page

Claude as Evaluator Pattern

Core Concept

Claude serves as the evaluation layer for both coding loops and training loops. Via API, Claude gets Nexus context (workflow.init), evaluates LARS output, scores it, and decides: pass or loop again.

Two Parallel Systems, Same Pattern

System 1: LARS Coding Loop (Ralph-Style)

Task → LARS Codes → Claude Evaluates → Pass? → Done
                                    → Fail? → LARS learns + fixes → Loop
  • Model: Qwen3-Coder (fits in 72GB VRAM)
  • Task: Code generation, bug fixes, refactoring
  • Evaluator: Claude via API
  • Learning: LARS remembers WHY it was wrong (key differentiator)

System 2: LARS Training Loop

Dataset → Train → Test → Claude Scores → 98%+? → Done
                                       → <98%? → Generate corrections → Retrain → Loop
  • Model: Base LARS (Qwen 2.5 abliterated + LoRA)
  • Task: Knowledge acquisition, identity, reasoning
  • Evaluator: Claude via API
  • Learning: Corrective examples fix gaps

Claude API Evaluation Flow

# Pseudocode for evaluation step
def evaluate_lars_output(output, task_type, context):
    # 1. Get Nexus context
    nexus_context = workflow.init()

    # 2. Build evaluation prompt
    prompt = f"""
    Context: {nexus_context}
    Task Type: {task_type}
    LARS Output: {output}

    Evaluate this output:
    - Is it correct? (yes/no)
    - If no, what's wrong?
    - Score: 0-100
    - Specific corrections needed
    """

    # 3. Call Claude API
    evaluation = claude_api.complete(prompt)

    # 4. Parse and return
    return {
        'passed': evaluation.score >= 98,
        'score': evaluation.score,
        'corrections': evaluation.corrections,
        'loop_again': evaluation.score < 98
    }

The Learning Differentiator

Key insight from Chris: LARS shouldn't just fix mistakes, it should LEARN from them.

For Coding Loop

  • When code fails, capture:
  • Original attempt
  • Why it was wrong
  • Correct solution
  • Store as training example for future fine-tuning
  • LARS gets better over time, not just per-task

For Training Loop

  • Failures become explicit correction examples
  • Added to training set for next loop
  • Model internalizes corrections

Implementation Requirements

  1. Claude API Integration: Direct API calls or MCP tool
  2. Nexus Context Injection: workflow.init data available to evaluator
  3. Scoring Rubric: Define what 98%+ means for each task type
  4. Correction Generator: Turn failures into training data
  5. Loop Orchestrator: Manage the cycle, track iterations

Next Steps

  1. Small-scale test: Train on tiny dataset, evaluate, measure loop benefit
  2. Build Claude evaluator endpoint
  3. Create test suite for both coding and knowledge tasks
  4. Prototype loop orchestrator
ID: b9f55471
Path: Accelerated AI Training > Proposed Architecture > Claude as Evaluator Pattern
Updated: 2026-01-01T19:41:37