Synthetic Data Generator Pipeline - Nexus Knowledge Base

Overview

A Ralph-inspired loop that generates training data variations using Claude, then feeds the expanded dataset to LARS training.

Pipeline Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Seed Dataset   │────▶│  Claude Loop     │────▶│  Filtered Data  │
│  (3D examples)  │     │  (variations)    │     │  (quality pass) │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                                                          │
                                                          ▼
                        ┌──────────────────┐     ┌─────────────────┐
                        │  LARS Training   │◀────│  Curriculum     │
                        │  (AI Server)     │     │  Ordering       │
                        └──────────────────┘     └─────────────────┘

Stage 1: Seed Dataset

Start with existing 3D training examples
lars_identity.json, lars_3d_combined.json, lars_3d_tasks.json
These are our 'ground truth' examples

Stage 2: Claude Variation Loop

Run on Nexus (no GPU needed - just API calls)

for example in seed_dataset:
    variations = []

    # Generate paraphrases
    variations.extend(claude.paraphrase(example, count=5))

    # Generate complexity variants
    variations.extend(claude.simplify(example))
    variations.extend(claude.elaborate(example))

    # Generate edge cases
    variations.extend(claude.edge_cases(example, count=3))

    # Add to output
    expanded_dataset.extend(variations)

Stage 3: Quality Filtering

Semantic similarity check (stay close to original meaning)
Factual consistency (no hallucinations like MBA claims)
Grammar/coherence scoring
Deduplication
Remove examples that contradict seed data

Stage 4: Curriculum Ordering

Score examples by complexity (token count, concept density)
Sort simple → complex
Group by topic for staged training

Stage 5: LARS Training

Transfer expanded dataset to AI server (100.89.34.86)
Run training with curriculum order
Monitor for overfitting with validation set

Resource Requirements

Nexus server: Claude API calls, Python script
AI server: GPU training (dual 3090s → triple 3090s)
Estimated time:
Data generation: 2-4 hours (API rate limited)
Training: Depends on dataset size

Expected Outcome

10-50x dataset size increase
Better generalization from diverse examples
Reduced overfitting to specific phrasings
Curriculum ordering may speed convergence