Web Intelligence Architecture v2.0
Operation Webweaver Documentation - Server v2.0.0
1. Overview
Multi-Level Web Intelligence
Web Intelligence provides a layered approach to web data gathering, from simple page fetches to comprehensive site ingestion. The system adapts to the complexity of the task.
Tool Users
| User | Capabilities | Primary Use Cases |
|---|---|---|
| Claude | Full access | Research, enrichment, analysis |
| LARS | Delegated access | Automated collection, training data |
| Delegates | Scoped access | Specific tasks (contact lookup, doc fetch) |
Environment Integration
Web Intelligence connects to multiple Nexus environments: - Web (6670/6671) - Permanent web content storage - Staging (6680/6681) - Temporary processing workspace - KB (6625/6626) - Knowledge base documentation - Docs/CDN - File storage and delivery - Corpus - Extracted text for training - Links (6635/6636) - URL bookmarking
2. Tool Levels
Level 1: Simple Fetch (httpx - lightweight)
Purpose: Single page retrieval with minimal overhead Library: httpx (lightweight async HTTP)
| Tool | Purpose | Example |
|---|---|---|
web.fetch |
Single page β markdown | Like Claude's WebFetch |
web.batch_fetch |
Multiple pages in parallel | Batch processing |
web.reader |
Article extraction | Readability algorithm |
# Single page to markdown
web.fetch(url) β markdown content
# Parallel batch fetch
web.batch_fetch([url1, url2, url3]) β [content1, content2, content3]
# Clean article extraction
web.reader(url) β extracted article (title, content, metadata)
Level 2: Search
Purpose: Discovery via DuckDuckGo (search only, no crawling) Library: DuckDuckGo API
| Tool | Purpose | Example |
|---|---|---|
web.search |
DDG search results | Returns URLs + snippets |
web.search(query, limit=10) β [{url, title, snippet}, ...]
Level 3: Research (Full Pipeline)
Purpose: Comprehensive web intelligence with crawling Library: spider_rs (heavy-duty crawler)
| Tool | Purpose | Example |
|---|---|---|
web.research |
Search + crawl + synthesize | Full 5-phase pipeline |
web.crawl |
Full site crawl | spider_rs powered |
web.discover |
Site link discovery | Find all links on site |
# Full research pipeline
web.research(topic, depth='standard') β structured intelligence
# Site-wide crawl
web.crawl(base_url, depth=2, limit=100) β all pages
# Link discovery
web.discover(url) β [all_links_found]
See Web Intelligence Pipeline for full research flow
Utility Tools
Purpose: Supporting operations for content processing
| Tool | Purpose |
|---|---|
web.scrape |
Raw HTML extraction |
web.to_markdown |
HTML β Markdown conversion |
web.extract |
CSS selector targeting |
web.save_link |
Save single URL to Links |
web.bulk_save_links |
Batch save URLs to Links |
3. Staging Architecture
Key Format
stg:{env}:{user_id}:{channel}:{row_id}
Components:
- stg: - Staging prefix
- {env} - Target environment (web, kb, docs)
- {user_id} - User ID format u_XXXX (NOT names)
- {channel} - Processing stage
- {row_id} - Unique row identifier
Example:
stg:web:u_z1p5:search:row_8f3a
stg:web:u_z1p5:metadata:row_8f3a
stg:web:u_z1p5:crawled:row_8f3a
stg:web:u_z1p5:processed:row_8f3a
Channels (Columns)
Data flows through channels like columns in a pipeline:
| Channel | Purpose | Contains |
|---|---|---|
search |
Initial discovery | URLs + snippets |
metadata |
Site structure | Sitemaps, nav, about pages |
crawled |
Raw content | HTML pages |
processed |
Extracted data | Structured JSON |
TTL (Time-To-Live)
- Default: 24 hours
- Extended: 72 hours (for complex pipelines)
- Manual clear: After successful routing to destination
User ID Pattern
Always use u_XXXX format, never names:
- β
stg:web:u_z1p5:search:row_123
- β stg:web:chris:search:row_123
4. Environment Integration
Web Environment (6670/6671)
Purpose: Permanent web content storage Contents: - Crawled page snapshots - Timestamped versions - Searchable content index
Key Pattern: web:{user_id}:{date}:{domain}:{hash}
KB Environment (6625/6626)
Purpose: Knowledge base documentation Contents: - Structured documentation - Hierarchical nodes - Cross-references
Integration: web.ingest_site() β KB nodes
Docs/CDN
Purpose: File storage and delivery Contents: - Downloaded documents (PDFs, images) - Generated reports - Static assets
Integration: web.download() β Docs storage
Corpus Environment
Purpose: Extracted text for AI training Contents: - Clean text from web pages - Document extractions - Training datasets
Integration: Docs β Corpus (extraction pipeline)
Staging Environment (6680/6681)
Purpose: Temporary processing workspace Contents: - In-flight web data - Pipeline intermediates - Pre-routing buffers
Lifetime: 24-72 hours, then cleared
5. Key Libraries
spider_rs
Purpose: Heavy-duty site-wide crawling
Characteristics:
- Rust-based with Python bindings
- Async-native (await crawl(url))
- High performance parallel crawling
- Built-in rate limiting
Use For: Level 3 (Research) - multi-page crawls, site mapping, link discovery
httpx
Purpose: Lightweight single-page fetch Characteristics: - Pure Python async HTTP - Low overhead, fast - Good for simple operations
Use For: Level 1 (Simple Fetch) - single pages, batch fetches
readability-lxml
Purpose: Article content extraction Characteristics: - Readability algorithm implementation - Extracts main content from cluttered pages - Returns clean article text
Use For: Level 1 - web.reader tool
html-to-markdown (Rust)
Purpose: HTML β Markdown conversion Characteristics: - Fast Rust implementation - Preserves semantic structure - Handles complex HTML
Use For: Content transformation for KB/Corpus
DuckDuckGo (DDG)
Purpose: Web search Characteristics: - No API key required - Privacy-focused - Returns structured results
Use For: Level 2 (Search) - discovery phase
6. Data Flows
Web β Staging β Web (Page Storage)
[Internet] β web.crawl() β Staging:crawled β process β Web Environment
Use Case: Permanent snapshot of web pages
Web β Staging β KB (Knowledge Capture)
[Internet] β web.ingest_site() β Staging:processed β kb.create() β KB Environment
Use Case: Import documentation into knowledge base
Web β Staging β Docs (File Storage)
[Internet] β web.download() β Staging:files β docs.save() β Docs/CDN
Use Case: Save PDFs, images, data files
Docs β Corpus (Text Extraction)
[Docs/CDN] β extract_text() β corpus.add() β Corpus Environment
Use Case: Extract text from documents for training
Staging β PDF (Report Generation)
[Staging:processed] β synthesize() β pdf.generate() β Docs/CDN
Use Case: Generate reports from research data
Architecture Diagram
βββββββββββββββββββ
β INTERNET β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββ
β Level 1: Fetchβ β Level 2: Search β βLevel 3: Researchβ
β (httpx) β β (DDG) β β (spider_rs) β
β β β β β β
β β’ fetch β β β’ search β β β’ research β
β β’ batch_fetchβ β β β β’ crawl β
β β’ reader β β β β β’ discover β
βββββββββ¬ββββββββ ββββββββββ¬βββββββββ βββββββββ¬ββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββ
β STAGING ENVIRONMENT β
β (6680/6681) β
β β
β stg:{env}:{user}:... β
β TTL: 24-72 hours β
ββββββββββββββ¬ββββββββββββββ
β
βββββββββββββββββ¬ββββββββββββββββΌββββββββββββββββ¬ββββββββββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ
β Web β β KB β β Docs β β Corpus β β Links β
β 6670/6671 β β 6625/6626 β β /CDN β β β β 6635/6636 β
β β β β β β β β β β
β Permanent β β Knowledge β β Files β β Training β β URLs β
β Snapshotsβ β Base β β β β Data β β β
βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ βββββββββββββ
Documentation created during Operation Webweaver (g_4lji) - January 2026 Updated for Web Intelligence Server v2.0.0