section

Web Intelligence Pipeline

Web Intelligence Pipeline

Overview

The Web Intelligence pipeline is a five-phase process that transforms search queries into structured, AI-consumable datasets. Each phase involves parallel processing and AI decision-making to ensure precise, relevant data collection.

Pipeline Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     PHASE 1: SEARCH + FILTER                     β”‚
β”‚                                                                  β”‚
β”‚  DDG Search (50 results) β†’ AI Filter β†’ Top 10 Relevant           β”‚
β”‚                              ↓                                   β”‚
β”‚                    Staging:web:links                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  PHASE 2: PARALLEL METADATA CRAWL                β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”      10 Parallel        β”‚
β”‚  β”‚Site1β”‚ β”‚Site2β”‚ β”‚Site3β”‚ β”‚Site4β”‚ β”‚ ... β”‚      Channels          β”‚
β”‚  β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”˜                       β”‚
β”‚     β”‚       β”‚       β”‚       β”‚       β”‚                          β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                       ↓                                         β”‚
β”‚  Grab: sitemap, /about, /contact, site structure                 β”‚
β”‚                       ↓                                         β”‚
β”‚              Staging:web:metadata                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   PHASE 3: PARALLEL DEEP CRAWL                   β”‚
β”‚                                                                  β”‚
β”‚  AI Selects Specific Pages from Metadata                        β”‚
β”‚                       ↓                                         β”‚
β”‚  Precise crawl (NOT \"crawl the world\")                          β”‚
β”‚  - Product pages                                                 β”‚
β”‚  - Service descriptions                                         β”‚
β”‚  - Team/bio pages                                                β”‚
β”‚  - Pricing info                                                  β”‚
β”‚                       ↓                                         β”‚
β”‚              Staging:web:content                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               PHASE 4: PROCESS + PERMANENT STORAGE               β”‚
β”‚                                                                  β”‚
β”‚  AI extracts relevant data from crawled content                  β”‚
β”‚  - Entity extraction (names, addresses, phones)                  β”‚
β”‚  - Content summarization                                         β”‚
β”‚  - Structure detection                                           β”‚
β”‚                       ↓                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚           WEB ENVIRONMENT (PERMANENT)                β”‚    β”‚
β”‚  β”‚       Timestamped, versioned, searchable             β”‚    β”‚
β”‚  β”‚          Ports: 6670 (vault) / 6671 (read)           β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   PHASE 5: SYNTH + DATASET                       β”‚
β”‚                                                                  β”‚
β”‚  DeepSeek creates:                                               β”‚
β”‚  - Summary documents                                             β”‚
β”‚  - Structured dataset format                                     β”‚
β”‚  - Training-ready data                                           β”‚
β”‚                       ↓                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              DATASET ENVIRONMENT                     β”‚    β”‚
β”‚  β”‚          Structured for LARS training                β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase Details

Phase 1: Search + Filter

Input: Search query (e.g., "retirement planning Centennial CO") Process: 1. DuckDuckGo search returns ~50 results 2. AI evaluates relevance to query 3. Filters to top 10 most relevant

Output: Staging:web:links - URLs ready for crawling

AI Decision Point: Which URLs are worth crawling?

Phase 2: Parallel Metadata Crawl

Input: 10 filtered URLs Process: 1. 10 parallel channels (one per site) 2. Each channel grabs: - Sitemap (if available) - /about page - /contact page - Site navigation structure

Output: Staging:web:metadata - Site structures and navigation

AI Decision Point: What pages exist on each site?

Phase 3: Parallel Deep Crawl

Input: Site metadata from Phase 2 Process: 1. AI analyzes metadata to select specific pages 2. Precise, targeted crawl (NOT recursive) 3. Grabs only what's needed: - Product/service pages - Team bios - Pricing information - Contact details

Output: Staging:web:content - Raw page content

AI Decision Point: Which specific pages contain the data we need?

Phase 4: Process + Permanent Storage

Input: Raw content from Phase 3 Process: 1. AI extracts entities: - Names, titles, companies - Addresses, phones, emails - Services, pricing, dates 2. Structures data for storage 3. Timestamps and versions

Output: Web Environment (6670/6671) - Permanent, searchable storage

AI Decision Point: What data is worth keeping permanently?

Phase 5: Synth + Dataset

Input: Processed data from Web Environment Process: 1. DeepSeek synthesizes: - Summary documents - Structured datasets - Training data format 2. Validates data quality 3. Formats for LARS consumption

Output: Dataset Environment - LARS-ready training data

AI Decision Point: How should data be structured for optimal training?

Key Principles

Staging is TEMPORARY

  • Staging:web:links - Cleared after Phase 2 starts
  • Staging:web:metadata - Cleared after Phase 3 starts
  • Staging:web:content - Cleared after Phase 4 completes

Purpose: Organize, process, route - then clear

Web Environment is PERMANENT

  • Like Documents, Corpus, KB
  • Timestamped for history
  • Versioned for updates
  • Searchable for retrieval

Three-Phase Parallelization

  1. Phase 2: 10 parallel metadata crawls
  2. Phase 3: Parallel deep crawls (count varies by AI selection)
  3. Phase 5: Can run synthesis while Phase 4 continues

AI Decides at Each Phase

  • Phase 1: Relevance filtering
  • Phase 2: (Automated - gather metadata)
  • Phase 3: Page selection based on metadata
  • Phase 4: Entity extraction and structuring
  • Phase 5: Dataset synthesis strategy

Staging Environment Keys

Staging:web:links     - Filtered URLs from search
Staging:web:metadata  - Site structures, navigation
Staging:web:content   - Raw crawled content

All staging keys are prefixed with staging: and are automatically cleared after processing.

ID: 6b7dc3d3
Path: Web Intelligence > Web Intelligence Pipeline
Updated: 2026-01-08T13:07:12