Web Intelligence Architecture v2.0

Operation Webweaver Documentation - Server v2.0.0

1. Overview

Multi-Level Web Intelligence

Web Intelligence provides a layered approach to web data gathering, from simple page fetches to comprehensive site ingestion. The system adapts to the complexity of the task.

Tool Users

User	Capabilities	Primary Use Cases
Claude	Full access	Research, enrichment, analysis
LARS	Delegated access	Automated collection, training data
Delegates	Scoped access	Specific tasks (contact lookup, doc fetch)

Environment Integration

Web Intelligence connects to multiple Nexus environments: - Web (6670/6671) - Permanent web content storage - Staging (6680/6681) - Temporary processing workspace - KB (6625/6626) - Knowledge base documentation - Docs/CDN - File storage and delivery - Corpus - Extracted text for training - Links (6635/6636) - URL bookmarking

2. Tool Levels

Level 1: Simple Fetch (httpx - lightweight)

Purpose: Single page retrieval with minimal overhead Library: httpx (lightweight async HTTP)

Tool	Purpose	Example
`web.fetch`	Single page → markdown	Like Claude's WebFetch
`web.batch_fetch`	Multiple pages in parallel	Batch processing
`web.reader`	Article extraction	Readability algorithm

# Single page to markdown
web.fetch(url) → markdown content

# Parallel batch fetch
web.batch_fetch([url1, url2, url3]) → [content1, content2, content3]

# Clean article extraction
web.reader(url) → extracted article (title, content, metadata)

Level 2: Search

Purpose: Discovery via DuckDuckGo (search only, no crawling) Library: DuckDuckGo API

Tool	Purpose	Example
`web.search`	DDG search results	Returns URLs + snippets

web.search(query, limit=10) → [{url, title, snippet}, ...]

Level 3: Research (Full Pipeline)

Purpose: Comprehensive web intelligence with crawling Library: spider_rs (heavy-duty crawler)

Tool	Purpose	Example
`web.research`	Search + crawl + synthesize	Full 5-phase pipeline
`web.crawl`	Full site crawl	spider_rs powered
`web.discover`	Site link discovery	Find all links on site

# Full research pipeline
web.research(topic, depth='standard') → structured intelligence

# Site-wide crawl
web.crawl(base_url, depth=2, limit=100) → all pages

# Link discovery
web.discover(url) → [all_links_found]

See Web Intelligence Pipeline for full research flow

Utility Tools

Purpose: Supporting operations for content processing

Tool	Purpose
`web.scrape`	Raw HTML extraction
`web.to_markdown`	HTML → Markdown conversion
`web.extract`	CSS selector targeting
`web.save_link`	Save single URL to Links
`web.bulk_save_links`	Batch save URLs to Links

3. Staging Architecture

Key Format

stg:{env}:{user_id}:{channel}:{row_id}

Components: - stg: - Staging prefix - {env} - Target environment (web, kb, docs) - {user_id} - User ID format u_XXXX (NOT names) - {channel} - Processing stage - {row_id} - Unique row identifier

Example:

stg:web:u_z1p5:search:row_8f3a
stg:web:u_z1p5:metadata:row_8f3a
stg:web:u_z1p5:crawled:row_8f3a
stg:web:u_z1p5:processed:row_8f3a

Channels (Columns)

Data flows through channels like columns in a pipeline:

Channel	Purpose	Contains
`search`	Initial discovery	URLs + snippets
`metadata`	Site structure	Sitemaps, nav, about pages
`crawled`	Raw content	HTML pages
`processed`	Extracted data	Structured JSON

TTL (Time-To-Live)

Default: 24 hours
Extended: 72 hours (for complex pipelines)
Manual clear: After successful routing to destination

User ID Pattern

Always use u_XXXX format, never names: - ✅ stg:web:u_z1p5:search:row_123 - ❌ stg:web:chris:search:row_123

4. Environment Integration

Web Environment (6670/6671)

Purpose: Permanent web content storage Contents: - Crawled page snapshots - Timestamped versions - Searchable content index

Key Pattern: web:{user_id}:{date}:{domain}:{hash}

KB Environment (6625/6626)

Purpose: Knowledge base documentation Contents: - Structured documentation - Hierarchical nodes - Cross-references

Integration: web.ingest_site() → KB nodes

Docs/CDN

Purpose: File storage and delivery Contents: - Downloaded documents (PDFs, images) - Generated reports - Static assets

Integration: web.download() → Docs storage

Corpus Environment

Purpose: Extracted text for AI training Contents: - Clean text from web pages - Document extractions - Training datasets

Integration: Docs → Corpus (extraction pipeline)

Staging Environment (6680/6681)

Purpose: Temporary processing workspace Contents: - In-flight web data - Pipeline intermediates - Pre-routing buffers

Lifetime: 24-72 hours, then cleared

5. Key Libraries

spider_rs

Purpose: Heavy-duty site-wide crawling Characteristics: - Rust-based with Python bindings - Async-native (await crawl(url)) - High performance parallel crawling - Built-in rate limiting

Use For: Level 3 (Research) - multi-page crawls, site mapping, link discovery

httpx

Purpose: Lightweight single-page fetch Characteristics: - Pure Python async HTTP - Low overhead, fast - Good for simple operations

Use For: Level 1 (Simple Fetch) - single pages, batch fetches

readability-lxml

Purpose: Article content extraction Characteristics: - Readability algorithm implementation - Extracts main content from cluttered pages - Returns clean article text

Use For: Level 1 - web.reader tool

html-to-markdown (Rust)

Purpose: HTML → Markdown conversion Characteristics: - Fast Rust implementation - Preserves semantic structure - Handles complex HTML

Use For: Content transformation for KB/Corpus

DuckDuckGo (DDG)

Purpose: Web search Characteristics: - No API key required - Privacy-focused - Returns structured results

Use For: Level 2 (Search) - discovery phase

6. Data Flows

Web → Staging → Web (Page Storage)

[Internet] → web.crawl() → Staging:crawled → process → Web Environment

Use Case: Permanent snapshot of web pages

Web → Staging → KB (Knowledge Capture)

[Internet] → web.ingest_site() → Staging:processed → kb.create() → KB Environment

Use Case: Import documentation into knowledge base

Web → Staging → Docs (File Storage)

[Internet] → web.download() → Staging:files → docs.save() → Docs/CDN

Use Case: Save PDFs, images, data files

Docs → Corpus (Text Extraction)

[Docs/CDN] → extract_text() → corpus.add() → Corpus Environment

Use Case: Extract text from documents for training

Staging → PDF (Report Generation)

[Staging:processed] → synthesize() → pdf.generate() → Docs/CDN

Use Case: Generate reports from research data

Architecture Diagram

                                    ┌─────────────────┐
                                    │    INTERNET     │
                                    └────────┬────────┘
                                             │
                    ┌────────────────────────┼────────────────────────┐
                    │                        │                        │
                    ▼                        ▼                        ▼
            ┌───────────────┐      ┌─────────────────┐      ┌───────────────┐
            │ Level 1: Fetch│      │  Level 2: Search │      │Level 3: Research│
            │     (httpx)   │      │      (DDG)       │      │  (spider_rs)    │
            │               │      │                  │      │                 │
            │  • fetch      │      │  • search        │      │  • research     │
            │  • batch_fetch│      │                  │      │  • crawl        │
            │  • reader     │      │                  │      │  • discover     │
            └───────┬───────┘      └────────┬────────┘      └───────┬───────┘
                    │                       │                       │
                    └───────────────────────┼───────────────────────┘
                                            │
                                            ▼
                              ┌──────────────────────────┐
                              │    STAGING ENVIRONMENT   │
                              │      (6680/6681)         │
                              │                          │
                              │  stg:{env}:{user}:...    │
                              │  TTL: 24-72 hours        │
                              └────────────┬─────────────┘
                                           │
           ┌───────────────┬───────────────┼───────────────┬───────────────┐
           │               │               │               │               │
           ▼               ▼               ▼               ▼               ▼
     ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐
     │    Web    │  │    KB     │  │   Docs    │  │  Corpus   │  │   Links   │
     │ 6670/6671 │  │ 6625/6626 │  │   /CDN    │  │           │  │ 6635/6636 │
     │           │  │           │  │           │  │           │  │           │
     │ Permanent │  │ Knowledge │  │   Files   │  │  Training │  │   URLs    │
     │  Snapshots│  │   Base    │  │           │  │   Data    │  │           │
     └───────────┘  └───────────┘  └───────────┘  └───────────┘  └───────────┘

Documentation created during Operation Webweaver (g_4lji) - January 2026 Updated for Web Intelligence Server v2.0.0