Overview
A Ralph-inspired loop that generates training data variations using Claude, then feeds the expanded dataset to LARS training.
Pipeline Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Seed Dataset ββββββΆβ Claude Loop ββββββΆβ Filtered Data β
β (3D examples) β β (variations) β β (quality pass) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β LARS Training βββββββ Curriculum β
β (AI Server) β β Ordering β
ββββββββββββββββββββ βββββββββββββββββββ
Stage 1: Seed Dataset
- Start with existing 3D training examples
- lars_identity.json, lars_3d_combined.json, lars_3d_tasks.json
- These are our 'ground truth' examples
Stage 2: Claude Variation Loop
Run on Nexus (no GPU needed - just API calls)
for example in seed_dataset:
variations = []
# Generate paraphrases
variations.extend(claude.paraphrase(example, count=5))
# Generate complexity variants
variations.extend(claude.simplify(example))
variations.extend(claude.elaborate(example))
# Generate edge cases
variations.extend(claude.edge_cases(example, count=3))
# Add to output
expanded_dataset.extend(variations)
Stage 3: Quality Filtering
- Semantic similarity check (stay close to original meaning)
- Factual consistency (no hallucinations like MBA claims)
- Grammar/coherence scoring
- Deduplication
- Remove examples that contradict seed data
Stage 4: Curriculum Ordering
- Score examples by complexity (token count, concept density)
- Sort simple β complex
- Group by topic for staged training
Stage 5: LARS Training
- Transfer expanded dataset to AI server (100.89.34.86)
- Run training with curriculum order
- Monitor for overfitting with validation set
Resource Requirements
- Nexus server: Claude API calls, Python script
- AI server: GPU training (dual 3090s β triple 3090s)
- Estimated time:
- Data generation: 2-4 hours (API rate limited)
- Training: Depends on dataset size
Expected Outcome
- 10-50x dataset size increase
- Better generalization from diverse examples
- Reduced overfitting to specific phrasings
- Curriculum ordering may speed convergence