Web Intelligence Pipeline
Overview
The Web Intelligence pipeline is a five-phase process that transforms search queries into structured, AI-consumable datasets. Each phase involves parallel processing and AI decision-making to ensure precise, relevant data collection.
Pipeline Diagram
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 1: SEARCH + FILTER β
β β
β DDG Search (50 results) β AI Filter β Top 10 Relevant β
β β β
β Staging:web:links β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 2: PARALLEL METADATA CRAWL β
β β
β βββββββ βββββββ βββββββ βββββββ βββββββ 10 Parallel β
β βSite1β βSite2β βSite3β βSite4β β ... β Channels β
β ββββ¬βββ ββββ¬βββ ββββ¬βββ ββββ¬βββ ββββ¬βββ β
β β β β β β β
β βββββββββ΄ββββββββ΄ββββββββ΄ββββββββ β
β β β
β Grab: sitemap, /about, /contact, site structure β
β β β
β Staging:web:metadata β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 3: PARALLEL DEEP CRAWL β
β β
β AI Selects Specific Pages from Metadata β
β β β
β Precise crawl (NOT \"crawl the world\") β
β - Product pages β
β - Service descriptions β
β - Team/bio pages β
β - Pricing info β
β β β
β Staging:web:content β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 4: PROCESS + PERMANENT STORAGE β
β β
β AI extracts relevant data from crawled content β
β - Entity extraction (names, addresses, phones) β
β - Content summarization β
β - Structure detection β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WEB ENVIRONMENT (PERMANENT) β β
β β Timestamped, versioned, searchable β β
β β Ports: 6670 (vault) / 6671 (read) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PHASE 5: SYNTH + DATASET β
β β
β DeepSeek creates: β
β - Summary documents β
β - Structured dataset format β
β - Training-ready data β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DATASET ENVIRONMENT β β
β β Structured for LARS training β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase Details
Phase 1: Search + Filter
Input: Search query (e.g., "retirement planning Centennial CO") Process: 1. DuckDuckGo search returns ~50 results 2. AI evaluates relevance to query 3. Filters to top 10 most relevant
Output: Staging:web:links - URLs ready for crawling
AI Decision Point: Which URLs are worth crawling?
Phase 2: Parallel Metadata Crawl
Input: 10 filtered URLs Process: 1. 10 parallel channels (one per site) 2. Each channel grabs: - Sitemap (if available) - /about page - /contact page - Site navigation structure
Output: Staging:web:metadata - Site structures and navigation
AI Decision Point: What pages exist on each site?
Phase 3: Parallel Deep Crawl
Input: Site metadata from Phase 2 Process: 1. AI analyzes metadata to select specific pages 2. Precise, targeted crawl (NOT recursive) 3. Grabs only what's needed: - Product/service pages - Team bios - Pricing information - Contact details
Output: Staging:web:content - Raw page content
AI Decision Point: Which specific pages contain the data we need?
Phase 4: Process + Permanent Storage
Input: Raw content from Phase 3 Process: 1. AI extracts entities: - Names, titles, companies - Addresses, phones, emails - Services, pricing, dates 2. Structures data for storage 3. Timestamps and versions
Output: Web Environment (6670/6671) - Permanent, searchable storage
AI Decision Point: What data is worth keeping permanently?
Phase 5: Synth + Dataset
Input: Processed data from Web Environment Process: 1. DeepSeek synthesizes: - Summary documents - Structured datasets - Training data format 2. Validates data quality 3. Formats for LARS consumption
Output: Dataset Environment - LARS-ready training data
AI Decision Point: How should data be structured for optimal training?
Key Principles
Staging is TEMPORARY
Staging:web:links- Cleared after Phase 2 startsStaging:web:metadata- Cleared after Phase 3 startsStaging:web:content- Cleared after Phase 4 completes
Purpose: Organize, process, route - then clear
Web Environment is PERMANENT
- Like Documents, Corpus, KB
- Timestamped for history
- Versioned for updates
- Searchable for retrieval
Three-Phase Parallelization
- Phase 2: 10 parallel metadata crawls
- Phase 3: Parallel deep crawls (count varies by AI selection)
- Phase 5: Can run synthesis while Phase 4 continues
AI Decides at Each Phase
- Phase 1: Relevance filtering
- Phase 2: (Automated - gather metadata)
- Phase 3: Page selection based on metadata
- Phase 4: Entity extraction and structuring
- Phase 5: Dataset synthesis strategy
Staging Environment Keys
Staging:web:links - Filtered URLs from search
Staging:web:metadata - Site structures, navigation
Staging:web:content - Raw crawled content
All staging keys are prefixed with staging: and are automatically cleared after processing.