section

Architecture

architecture design

Web Intelligence Architecture

Component Overview

1. Spider Engine (spider_rs)

Rust-based web crawler with Python bindings
Capabilities: crawl, scrape, smart crawl, cron jobs
Non-blocking async operation
Headless browser support

2. Content Pipeline

URL → Spider → Raw HTML → Markdown Converter → Staging → Links/Storage

3. Integration Points

Links MCP (6635/6636): Store discovered URLs with metadata
Temp MCP (6680/6681): Stage content during processing
Contact Enrichment: Research capabilities for contact intelligence
KB: Store research summaries

Redis Storage (6670/6671)

web:crawl:{crawl_id} - Crawl job metadata
web:page:{url_hash} - Cached page content
web:queue:{job_id} - Pending URLs to process
web:history:{domain} - Crawl history per domain

Tools (Planned)

web.crawl - Crawl a URL or domain
web.scrape - Get page content as markdown
web.research - Deep research on a topic
web.extract - Extract structured data
web.links - Discover and categorize links

Contents

🌳 View Tree