Claude as Evaluator Pattern - Nexus Knowledge Base

Core Concept

Claude serves as the evaluation layer for both coding loops and training loops. Via API, Claude gets Nexus context (workflow.init), evaluates LARS output, scores it, and decides: pass or loop again.

Two Parallel Systems, Same Pattern

System 1: LARS Coding Loop (Ralph-Style)

Task → LARS Codes → Claude Evaluates → Pass? → Done
                                    → Fail? → LARS learns + fixes → Loop

Model: Qwen3-Coder (fits in 72GB VRAM)
Task: Code generation, bug fixes, refactoring
Evaluator: Claude via API
Learning: LARS remembers WHY it was wrong (key differentiator)

System 2: LARS Training Loop

Dataset → Train → Test → Claude Scores → 98%+? → Done
                                       → <98%? → Generate corrections → Retrain → Loop

Model: Base LARS (Qwen 2.5 abliterated + LoRA)
Task: Knowledge acquisition, identity, reasoning
Evaluator: Claude via API
Learning: Corrective examples fix gaps

Claude API Evaluation Flow

# Pseudocode for evaluation step
def evaluate_lars_output(output, task_type, context):
    # 1. Get Nexus context
    nexus_context = workflow.init()

    # 2. Build evaluation prompt
    prompt = f"""
    Context: {nexus_context}
    Task Type: {task_type}
    LARS Output: {output}

    Evaluate this output:
    - Is it correct? (yes/no)
    - If no, what's wrong?
    - Score: 0-100
    - Specific corrections needed
    """

    # 3. Call Claude API
    evaluation = claude_api.complete(prompt)

    # 4. Parse and return
    return {
        'passed': evaluation.score >= 98,
        'score': evaluation.score,
        'corrections': evaluation.corrections,
        'loop_again': evaluation.score < 98
    }

The Learning Differentiator

Key insight from Chris: LARS shouldn't just fix mistakes, it should LEARN from them.

For Coding Loop

When code fails, capture:
Original attempt
Why it was wrong
Correct solution
Store as training example for future fine-tuning
LARS gets better over time, not just per-task

For Training Loop

Failures become explicit correction examples
Added to training set for next loop
Model internalizes corrections

Implementation Requirements

Claude API Integration: Direct API calls or MCP tool
Nexus Context Injection: workflow.init data available to evaluator
Scoring Rubric: Define what 98%+ means for each task type
Correction Generator: Turn failures into training data
Loop Orchestrator: Manage the cycle, track iterations

Next Steps

Small-scale test: Train on tiny dataset, evaluate, measure loop benefit
Build Claude evaluator endpoint
Create test suite for both coding and knowledge tasks
Prototype loop orchestrator