Web Intelligence Architecture
Component Overview
1. Spider Engine (spider_rs)
- Rust-based web crawler with Python bindings
- Capabilities: crawl, scrape, smart crawl, cron jobs
- Non-blocking async operation
- Headless browser support
2. Content Pipeline
URL → Spider → Raw HTML → Markdown Converter → Staging → Links/Storage
3. Integration Points
- Links MCP (6635/6636): Store discovered URLs with metadata
- Temp MCP (6680/6681): Stage content during processing
- Contact Enrichment: Research capabilities for contact intelligence
- KB: Store research summaries
Redis Storage (6670/6671)
web:crawl:{crawl_id} - Crawl job metadata
web:page:{url_hash} - Cached page content
web:queue:{job_id} - Pending URLs to process
web:history:{domain} - Crawl history per domain
web.crawl - Crawl a URL or domain
web.scrape - Get page content as markdown
web.research - Deep research on a topic
web.extract - Extract structured data
web.links - Discover and categorize links