Source-Specific Staging Rules - Nexus Knowledge Base

Staging Rules by Source

Staging behavior changes based on data source. Not one-size-fits-all.

Web Content Rules

Characteristics

High volume (250K+ characters common)
Already pre-processed by markdown converter (not AI)
Lossy-okay (AI extracts what it needs)
Fast cleanup acceptable

Stage 1 Behavior

Contains markdown output from web scraper
AI decides what to keep
Wipes on timer (30 seconds after Stage 2 created)
No verification required before wipe

Pipeline

Web Scrape → Markdown Converter → Stage 1 → AI extracts → Stage 2
                                    ↓
                            [timer wipe]

Document Content Rules

Characteristics

Every character matters (contracts, legal)
Lossless required
Slow cleanup acceptable
Must verify complete ingestion

Stage 1 Behavior

Contains full extracted content
Does NOT wipe until Corpus confirms 100% ingestion
Every ampersand, comma, colon preserved
Verification step required before wipe

Pipeline

PDF/Doc → NLM Ingestor → Stage 1 → Corpus Ingestion → Verify Complete
                                                            ↓
                                                    [verified wipe]

Storage Destinations

Source	Permanent Store	Cleanup Rule
Documents	Corpus	Verified wipe
Web Content	Web Storage (new)	Timer wipe
API Data	Context/KB	Timer wipe
User Uploads	Corpus	Verified wipe

Web Storage Environment (Proposed)

Separate from Corpus because: - URLs as primary identifiers (not file paths) - Content changes over time (re-scrape needed) - Different retention policies - May need freshness tracking

Structure

web_content = {
    'id': 'web_abc123',
    'url': 'https://corlera.com/about',
    'scraped_at': timestamp,
    'content': '...',
    'freshness': 'stale|fresh',
    'last_verified': timestamp,
    'trained_on': True/False
}

Training from Both Sources

Document Training

PDF ingested to Corpus (lossless)
Generate Q&A from Corpus content
Train via Nexus Loop
Claude evaluates
LARS learns document content

Web Training

Site scraped to Web Storage
LARS generates own dataset from content
Train via Nexus Loop
Claude evaluates via stable ID
Loop until 98%+ accuracy
LARS learns web content autonomously

The Power Move

LARS can be told: 'Learn this website.'

LARS: 1. Scrapes the site 2. Saves to Web Storage 3. Creates training dataset 4. Trains itself 5. Gets evaluated by Claude 6. Loops until verified 7. Reports: 'I've learned corlera.com'

Autonomous web learning. Not just retrieval. Actual internalization.