Staging Rules by Source
Staging behavior changes based on data source. Not one-size-fits-all.
Web Content Rules
Characteristics
- High volume (250K+ characters common)
- Already pre-processed by markdown converter (not AI)
- Lossy-okay (AI extracts what it needs)
- Fast cleanup acceptable
Stage 1 Behavior
- Contains markdown output from web scraper
- AI decides what to keep
- Wipes on timer (30 seconds after Stage 2 created)
- No verification required before wipe
Pipeline
Web Scrape → Markdown Converter → Stage 1 → AI extracts → Stage 2
↓
[timer wipe]
Document Content Rules
Characteristics
- Every character matters (contracts, legal)
- Lossless required
- Slow cleanup acceptable
- Must verify complete ingestion
Stage 1 Behavior
- Contains full extracted content
- Does NOT wipe until Corpus confirms 100% ingestion
- Every ampersand, comma, colon preserved
- Verification step required before wipe
Pipeline
PDF/Doc → NLM Ingestor → Stage 1 → Corpus Ingestion → Verify Complete
↓
[verified wipe]
Storage Destinations
| Source | Permanent Store | Cleanup Rule |
|---|---|---|
| Documents | Corpus | Verified wipe |
| Web Content | Web Storage (new) | Timer wipe |
| API Data | Context/KB | Timer wipe |
| User Uploads | Corpus | Verified wipe |
Web Storage Environment (Proposed)
Separate from Corpus because: - URLs as primary identifiers (not file paths) - Content changes over time (re-scrape needed) - Different retention policies - May need freshness tracking
Structure
web_content = {
'id': 'web_abc123',
'url': 'https://corlera.com/about',
'scraped_at': timestamp,
'content': '...',
'freshness': 'stale|fresh',
'last_verified': timestamp,
'trained_on': True/False
}
Training from Both Sources
Document Training
- PDF ingested to Corpus (lossless)
- Generate Q&A from Corpus content
- Train via Nexus Loop
- Claude evaluates
- LARS learns document content
Web Training
- Site scraped to Web Storage
- LARS generates own dataset from content
- Train via Nexus Loop
- Claude evaluates via stable ID
- Loop until 98%+ accuracy
- LARS learns web content autonomously
The Power Move
LARS can be told: 'Learn this website.'
LARS: 1. Scrapes the site 2. Saves to Web Storage 3. Creates training dataset 4. Trains itself 5. Gets evaluated by Claude 6. Loops until verified 7. Reports: 'I've learned corlera.com'
Autonomous web learning. Not just retrieval. Actual internalization.