page

Synthetic Data Generator Pipeline

Overview

A Ralph-inspired loop that generates training data variations using Claude, then feeds the expanded dataset to LARS training.

Pipeline Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Seed Dataset   │────▢│  Claude Loop     │────▢│  Filtered Data  β”‚
β”‚  (3D examples)  β”‚     β”‚  (variations)    β”‚     β”‚  (quality pass) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                          β”‚
                                                          β–Ό
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚  LARS Training   │◀────│  Curriculum     β”‚
                        β”‚  (AI Server)     β”‚     β”‚  Ordering       β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 1: Seed Dataset

  • Start with existing 3D training examples
  • lars_identity.json, lars_3d_combined.json, lars_3d_tasks.json
  • These are our 'ground truth' examples

Stage 2: Claude Variation Loop

Run on Nexus (no GPU needed - just API calls)

for example in seed_dataset:
    variations = []

    # Generate paraphrases
    variations.extend(claude.paraphrase(example, count=5))

    # Generate complexity variants
    variations.extend(claude.simplify(example))
    variations.extend(claude.elaborate(example))

    # Generate edge cases
    variations.extend(claude.edge_cases(example, count=3))

    # Add to output
    expanded_dataset.extend(variations)

Stage 3: Quality Filtering

  • Semantic similarity check (stay close to original meaning)
  • Factual consistency (no hallucinations like MBA claims)
  • Grammar/coherence scoring
  • Deduplication
  • Remove examples that contradict seed data

Stage 4: Curriculum Ordering

  • Score examples by complexity (token count, concept density)
  • Sort simple β†’ complex
  • Group by topic for staged training

Stage 5: LARS Training

  • Transfer expanded dataset to AI server (100.89.34.86)
  • Run training with curriculum order
  • Monitor for overfitting with validation set

Resource Requirements

  • Nexus server: Claude API calls, Python script
  • AI server: GPU training (dual 3090s β†’ triple 3090s)
  • Estimated time:
  • Data generation: 2-4 hours (API rate limited)
  • Training: Depends on dataset size

Expected Outcome

  • 10-50x dataset size increase
  • Better generalization from diverse examples
  • Reduced overfitting to specific phrasings
  • Curriculum ordering may speed convergence
ID: 7e13a189
Path: Accelerated AI Training > Proposed Architecture > Synthetic Data Generator Pipeline
Updated: 2026-01-01T19:28:08