Web Intelligence Pipeline

Overview

The Web Intelligence pipeline is a five-phase process that transforms search queries into structured, AI-consumable datasets. Each phase involves parallel processing and AI decision-making to ensure precise, relevant data collection.

Pipeline Diagram

┌────────────────────────────────────────────────────────────────┐
│                     PHASE 1: SEARCH + FILTER                     │
│                                                                  │
│  DDG Search (50 results) → AI Filter → Top 10 Relevant           │
│                              ↓                                   │
│                    Staging:web:links                             │
└────────────────────────────────────────────────────────────────┘
                              ↓
┌────────────────────────────────────────────────────────────────┐
│                  PHASE 2: PARALLEL METADATA CRAWL                │
│                                                                  │
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐      10 Parallel        │
│  │Site1│ │Site2│ │Site3│ │Site4│ │ ... │      Channels          │
│  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘                       │
│     │       │       │       │       │                          │
│     └───────┴───────┴───────┴───────┘                          │
│                       ↓                                         │
│  Grab: sitemap, /about, /contact, site structure                 │
│                       ↓                                         │
│              Staging:web:metadata                                │
└────────────────────────────────────────────────────────────────┘
                              ↓
┌────────────────────────────────────────────────────────────────┐
│                   PHASE 3: PARALLEL DEEP CRAWL                   │
│                                                                  │
│  AI Selects Specific Pages from Metadata                        │
│                       ↓                                         │
│  Precise crawl (NOT \"crawl the world\")                          │
│  - Product pages                                                 │
│  - Service descriptions                                         │
│  - Team/bio pages                                                │
│  - Pricing info                                                  │
│                       ↓                                         │
│              Staging:web:content                                 │
└────────────────────────────────────────────────────────────────┘
                              ↓
┌────────────────────────────────────────────────────────────────┐
│               PHASE 4: PROCESS + PERMANENT STORAGE               │
│                                                                  │
│  AI extracts relevant data from crawled content                  │
│  - Entity extraction (names, addresses, phones)                  │
│  - Content summarization                                         │
│  - Structure detection                                           │
│                       ↓                                         │
│  ┌────────────────────────────────────────────────────────┐    │
│  │           WEB ENVIRONMENT (PERMANENT)                │    │
│  │       Timestamped, versioned, searchable             │    │
│  │          Ports: 6670 (vault) / 6671 (read)           │    │
│  └────────────────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────────────────┘
                              ↓
┌────────────────────────────────────────────────────────────────┐
│                   PHASE 5: SYNTH + DATASET                       │
│                                                                  │
│  DeepSeek creates:                                               │
│  - Summary documents                                             │
│  - Structured dataset format                                     │
│  - Training-ready data                                           │
│                       ↓                                         │
│  ┌────────────────────────────────────────────────────────┐    │
│  │              DATASET ENVIRONMENT                     │    │
│  │          Structured for LARS training                │    │
│  └────────────────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────────────────┘

Phase Details

Input: Search query (e.g., "retirement planning Centennial CO") Process: 1. DuckDuckGo search returns ~50 results 2. AI evaluates relevance to query 3. Filters to top 10 most relevant

Output: Staging:web:links - URLs ready for crawling

AI Decision Point: Which URLs are worth crawling?

Phase 2: Parallel Metadata Crawl

Input: 10 filtered URLs Process: 1. 10 parallel channels (one per site) 2. Each channel grabs: - Sitemap (if available) - /about page - /contact page - Site navigation structure

Output: Staging:web:metadata - Site structures and navigation

AI Decision Point: What pages exist on each site?

Phase 3: Parallel Deep Crawl

Input: Site metadata from Phase 2 Process: 1. AI analyzes metadata to select specific pages 2. Precise, targeted crawl (NOT recursive) 3. Grabs only what's needed: - Product/service pages - Team bios - Pricing information - Contact details

Output: Staging:web:content - Raw page content

AI Decision Point: Which specific pages contain the data we need?

Phase 4: Process + Permanent Storage

Input: Raw content from Phase 3 Process: 1. AI extracts entities: - Names, titles, companies - Addresses, phones, emails - Services, pricing, dates 2. Structures data for storage 3. Timestamps and versions

Output: Web Environment (6670/6671) - Permanent, searchable storage

AI Decision Point: What data is worth keeping permanently?

Phase 5: Synth + Dataset

Input: Processed data from Web Environment Process: 1. DeepSeek synthesizes: - Summary documents - Structured datasets - Training data format 2. Validates data quality 3. Formats for LARS consumption

Output: Dataset Environment - LARS-ready training data

AI Decision Point: How should data be structured for optimal training?

Key Principles

Staging is TEMPORARY

Staging:web:links - Cleared after Phase 2 starts
Staging:web:metadata - Cleared after Phase 3 starts
Staging:web:content - Cleared after Phase 4 completes

Purpose: Organize, process, route - then clear

Web Environment is PERMANENT

Like Documents, Corpus, KB
Timestamped for history
Versioned for updates
Searchable for retrieval

Three-Phase Parallelization

Phase 2: 10 parallel metadata crawls
Phase 3: Parallel deep crawls (count varies by AI selection)
Phase 5: Can run synthesis while Phase 4 continues

AI Decides at Each Phase

Phase 1: Relevance filtering
Phase 2: (Automated - gather metadata)
Phase 3: Page selection based on metadata
Phase 4: Entity extraction and structuring
Phase 5: Dataset synthesis strategy

Staging Environment Keys

Staging:web:links     - Filtered URLs from search
Staging:web:metadata  - Site structures, navigation
Staging:web:content   - Raw crawled content

All staging keys are prefixed with staging: and are automatically cleared after processing.