section

Architecture

architecture design

Web Intelligence Architecture

Component Overview

1. Spider Engine (spider_rs)

  • Rust-based web crawler with Python bindings
  • Capabilities: crawl, scrape, smart crawl, cron jobs
  • Non-blocking async operation
  • Headless browser support

2. Content Pipeline

URL → Spider → Raw HTML → Markdown Converter → Staging → Links/Storage

3. Integration Points

  • Links MCP (6635/6636): Store discovered URLs with metadata
  • Temp MCP (6680/6681): Stage content during processing
  • Contact Enrichment: Research capabilities for contact intelligence
  • KB: Store research summaries

Redis Storage (6670/6671)

  • web:crawl:{crawl_id} - Crawl job metadata
  • web:page:{url_hash} - Cached page content
  • web:queue:{job_id} - Pending URLs to process
  • web:history:{domain} - Crawl history per domain

Tools (Planned)

  • web.crawl - Crawl a URL or domain
  • web.scrape - Get page content as markdown
  • web.research - Deep research on a topic
  • web.extract - Extract structured data
  • web.links - Discover and categorize links

Contents

ID: da22d1ba
Path: Web Intelligence > Architecture
Updated: 2026-01-08T12:20:00