page

Source-Specific Staging Rules

Staging Rules by Source

Staging behavior changes based on data source. Not one-size-fits-all.

Web Content Rules

Characteristics

  • High volume (250K+ characters common)
  • Already pre-processed by markdown converter (not AI)
  • Lossy-okay (AI extracts what it needs)
  • Fast cleanup acceptable

Stage 1 Behavior

  • Contains markdown output from web scraper
  • AI decides what to keep
  • Wipes on timer (30 seconds after Stage 2 created)
  • No verification required before wipe

Pipeline

Web Scrape → Markdown Converter → Stage 1 → AI extracts → Stage 2
                                    ↓
                            [timer wipe]

Document Content Rules

Characteristics

  • Every character matters (contracts, legal)
  • Lossless required
  • Slow cleanup acceptable
  • Must verify complete ingestion

Stage 1 Behavior

  • Contains full extracted content
  • Does NOT wipe until Corpus confirms 100% ingestion
  • Every ampersand, comma, colon preserved
  • Verification step required before wipe

Pipeline

PDF/Doc → NLM Ingestor → Stage 1 → Corpus Ingestion → Verify Complete
                                                            ↓
                                                    [verified wipe]

Storage Destinations

Source Permanent Store Cleanup Rule
Documents Corpus Verified wipe
Web Content Web Storage (new) Timer wipe
API Data Context/KB Timer wipe
User Uploads Corpus Verified wipe

Web Storage Environment (Proposed)

Separate from Corpus because: - URLs as primary identifiers (not file paths) - Content changes over time (re-scrape needed) - Different retention policies - May need freshness tracking

Structure

web_content = {
    'id': 'web_abc123',
    'url': 'https://corlera.com/about',
    'scraped_at': timestamp,
    'content': '...',
    'freshness': 'stale|fresh',
    'last_verified': timestamp,
    'trained_on': True/False
}

Training from Both Sources

Document Training

  1. PDF ingested to Corpus (lossless)
  2. Generate Q&A from Corpus content
  3. Train via Nexus Loop
  4. Claude evaluates
  5. LARS learns document content

Web Training

  1. Site scraped to Web Storage
  2. LARS generates own dataset from content
  3. Train via Nexus Loop
  4. Claude evaluates via stable ID
  5. Loop until 98%+ accuracy
  6. LARS learns web content autonomously

The Power Move

LARS can be told: 'Learn this website.'

LARS: 1. Scrapes the site 2. Saves to Web Storage 3. Creates training dataset 4. Trains itself 5. Gets evaluated by Claude 6. Loops until verified 7. Reports: 'I've learned corlera.com'

Autonomous web learning. Not just retrieval. Actual internalization.

ID: 3c1ea9fa
Path: Nexus AI Engine > Components > Staging Environment > Source-Specific Staging Rules
Updated: 2026-01-01T20:16:44