section

Web Intelligence Architecture v2.0

Web Intelligence Architecture v2.0

Operation Webweaver Documentation - Server v2.0.0

1. Overview

Multi-Level Web Intelligence

Web Intelligence provides a layered approach to web data gathering, from simple page fetches to comprehensive site ingestion. The system adapts to the complexity of the task.

Tool Users

User Capabilities Primary Use Cases
Claude Full access Research, enrichment, analysis
LARS Delegated access Automated collection, training data
Delegates Scoped access Specific tasks (contact lookup, doc fetch)

Environment Integration

Web Intelligence connects to multiple Nexus environments: - Web (6670/6671) - Permanent web content storage - Staging (6680/6681) - Temporary processing workspace - KB (6625/6626) - Knowledge base documentation - Docs/CDN - File storage and delivery - Corpus - Extracted text for training - Links (6635/6636) - URL bookmarking


2. Tool Levels

Level 1: Simple Fetch (httpx - lightweight)

Purpose: Single page retrieval with minimal overhead Library: httpx (lightweight async HTTP)

Tool Purpose Example
web.fetch Single page β†’ markdown Like Claude's WebFetch
web.batch_fetch Multiple pages in parallel Batch processing
web.reader Article extraction Readability algorithm
# Single page to markdown
web.fetch(url) β†’ markdown content

# Parallel batch fetch
web.batch_fetch([url1, url2, url3]) β†’ [content1, content2, content3]

# Clean article extraction
web.reader(url) β†’ extracted article (title, content, metadata)

Purpose: Discovery via DuckDuckGo (search only, no crawling) Library: DuckDuckGo API

Tool Purpose Example
web.search DDG search results Returns URLs + snippets
web.search(query, limit=10) β†’ [{url, title, snippet}, ...]

Level 3: Research (Full Pipeline)

Purpose: Comprehensive web intelligence with crawling Library: spider_rs (heavy-duty crawler)

Tool Purpose Example
web.research Search + crawl + synthesize Full 5-phase pipeline
web.crawl Full site crawl spider_rs powered
web.discover Site link discovery Find all links on site
# Full research pipeline
web.research(topic, depth='standard') β†’ structured intelligence

# Site-wide crawl
web.crawl(base_url, depth=2, limit=100) β†’ all pages

# Link discovery
web.discover(url) β†’ [all_links_found]

See Web Intelligence Pipeline for full research flow

Utility Tools

Purpose: Supporting operations for content processing

Tool Purpose
web.scrape Raw HTML extraction
web.to_markdown HTML β†’ Markdown conversion
web.extract CSS selector targeting
web.save_link Save single URL to Links
web.bulk_save_links Batch save URLs to Links

3. Staging Architecture

Key Format

stg:{env}:{user_id}:{channel}:{row_id}

Components: - stg: - Staging prefix - {env} - Target environment (web, kb, docs) - {user_id} - User ID format u_XXXX (NOT names) - {channel} - Processing stage - {row_id} - Unique row identifier

Example:

stg:web:u_z1p5:search:row_8f3a
stg:web:u_z1p5:metadata:row_8f3a
stg:web:u_z1p5:crawled:row_8f3a
stg:web:u_z1p5:processed:row_8f3a

Channels (Columns)

Data flows through channels like columns in a pipeline:

Channel Purpose Contains
search Initial discovery URLs + snippets
metadata Site structure Sitemaps, nav, about pages
crawled Raw content HTML pages
processed Extracted data Structured JSON

TTL (Time-To-Live)

  • Default: 24 hours
  • Extended: 72 hours (for complex pipelines)
  • Manual clear: After successful routing to destination

User ID Pattern

Always use u_XXXX format, never names: - βœ… stg:web:u_z1p5:search:row_123 - ❌ stg:web:chris:search:row_123


4. Environment Integration

Web Environment (6670/6671)

Purpose: Permanent web content storage Contents: - Crawled page snapshots - Timestamped versions - Searchable content index

Key Pattern: web:{user_id}:{date}:{domain}:{hash}

KB Environment (6625/6626)

Purpose: Knowledge base documentation Contents: - Structured documentation - Hierarchical nodes - Cross-references

Integration: web.ingest_site() β†’ KB nodes

Docs/CDN

Purpose: File storage and delivery Contents: - Downloaded documents (PDFs, images) - Generated reports - Static assets

Integration: web.download() β†’ Docs storage

Corpus Environment

Purpose: Extracted text for AI training Contents: - Clean text from web pages - Document extractions - Training datasets

Integration: Docs β†’ Corpus (extraction pipeline)

Staging Environment (6680/6681)

Purpose: Temporary processing workspace Contents: - In-flight web data - Pipeline intermediates - Pre-routing buffers

Lifetime: 24-72 hours, then cleared


5. Key Libraries

spider_rs

Purpose: Heavy-duty site-wide crawling Characteristics: - Rust-based with Python bindings - Async-native (await crawl(url)) - High performance parallel crawling - Built-in rate limiting

Use For: Level 3 (Research) - multi-page crawls, site mapping, link discovery

httpx

Purpose: Lightweight single-page fetch Characteristics: - Pure Python async HTTP - Low overhead, fast - Good for simple operations

Use For: Level 1 (Simple Fetch) - single pages, batch fetches

readability-lxml

Purpose: Article content extraction Characteristics: - Readability algorithm implementation - Extracts main content from cluttered pages - Returns clean article text

Use For: Level 1 - web.reader tool

html-to-markdown (Rust)

Purpose: HTML β†’ Markdown conversion Characteristics: - Fast Rust implementation - Preserves semantic structure - Handles complex HTML

Use For: Content transformation for KB/Corpus

DuckDuckGo (DDG)

Purpose: Web search Characteristics: - No API key required - Privacy-focused - Returns structured results

Use For: Level 2 (Search) - discovery phase


6. Data Flows

Web β†’ Staging β†’ Web (Page Storage)

[Internet] β†’ web.crawl() β†’ Staging:crawled β†’ process β†’ Web Environment

Use Case: Permanent snapshot of web pages

Web β†’ Staging β†’ KB (Knowledge Capture)

[Internet] β†’ web.ingest_site() β†’ Staging:processed β†’ kb.create() β†’ KB Environment

Use Case: Import documentation into knowledge base

Web β†’ Staging β†’ Docs (File Storage)

[Internet] β†’ web.download() β†’ Staging:files β†’ docs.save() β†’ Docs/CDN

Use Case: Save PDFs, images, data files

Docs β†’ Corpus (Text Extraction)

[Docs/CDN] β†’ extract_text() β†’ corpus.add() β†’ Corpus Environment

Use Case: Extract text from documents for training

Staging β†’ PDF (Report Generation)

[Staging:processed] β†’ synthesize() β†’ pdf.generate() β†’ Docs/CDN

Use Case: Generate reports from research data


Architecture Diagram

                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    β”‚    INTERNET     β”‚
                                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚                        β”‚                        β”‚
                    β–Ό                        β–Ό                        β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ Level 1: Fetchβ”‚      β”‚  Level 2: Search β”‚      β”‚Level 3: Researchβ”‚
            β”‚     (httpx)   β”‚      β”‚      (DDG)       β”‚      β”‚  (spider_rs)    β”‚
            β”‚               β”‚      β”‚                  β”‚      β”‚                 β”‚
            β”‚  β€’ fetch      β”‚      β”‚  β€’ search        β”‚      β”‚  β€’ research     β”‚
            β”‚  β€’ batch_fetchβ”‚      β”‚                  β”‚      β”‚  β€’ crawl        β”‚
            β”‚  β€’ reader     β”‚      β”‚                  β”‚      β”‚  β€’ discover     β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                       β”‚                       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                            β”‚
                                            β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚    STAGING ENVIRONMENT   β”‚
                              β”‚      (6680/6681)         β”‚
                              β”‚                          β”‚
                              β”‚  stg:{env}:{user}:...    β”‚
                              β”‚  TTL: 24-72 hours        β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚               β”‚               β”‚               β”‚               β”‚
           β–Ό               β–Ό               β–Ό               β–Ό               β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚    Web    β”‚  β”‚    KB     β”‚  β”‚   Docs    β”‚  β”‚  Corpus   β”‚  β”‚   Links   β”‚
     β”‚ 6670/6671 β”‚  β”‚ 6625/6626 β”‚  β”‚   /CDN    β”‚  β”‚           β”‚  β”‚ 6635/6636 β”‚
     β”‚           β”‚  β”‚           β”‚  β”‚           β”‚  β”‚           β”‚  β”‚           β”‚
     β”‚ Permanent β”‚  β”‚ Knowledge β”‚  β”‚   Files   β”‚  β”‚  Training β”‚  β”‚   URLs    β”‚
     β”‚  Snapshotsβ”‚  β”‚   Base    β”‚  β”‚           β”‚  β”‚   Data    β”‚  β”‚           β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Documentation created during Operation Webweaver (g_4lji) - January 2026 Updated for Web Intelligence Server v2.0.0

ID: 954fef44
Path: Web Intelligence > Web Intelligence Architecture v2.0
Updated: 2026-01-08T16:09:07