Core Concept
Claude serves as the evaluation layer for both coding loops and training loops. Via API, Claude gets Nexus context (workflow.init), evaluates LARS output, scores it, and decides: pass or loop again.
Two Parallel Systems, Same Pattern
System 1: LARS Coding Loop (Ralph-Style)
Task → LARS Codes → Claude Evaluates → Pass? → Done
→ Fail? → LARS learns + fixes → Loop
- Model: Qwen3-Coder (fits in 72GB VRAM)
- Task: Code generation, bug fixes, refactoring
- Evaluator: Claude via API
- Learning: LARS remembers WHY it was wrong (key differentiator)
System 2: LARS Training Loop
Dataset → Train → Test → Claude Scores → 98%+? → Done
→ <98%? → Generate corrections → Retrain → Loop
- Model: Base LARS (Qwen 2.5 abliterated + LoRA)
- Task: Knowledge acquisition, identity, reasoning
- Evaluator: Claude via API
- Learning: Corrective examples fix gaps
Claude API Evaluation Flow
# Pseudocode for evaluation step
def evaluate_lars_output(output, task_type, context):
# 1. Get Nexus context
nexus_context = workflow.init()
# 2. Build evaluation prompt
prompt = f"""
Context: {nexus_context}
Task Type: {task_type}
LARS Output: {output}
Evaluate this output:
- Is it correct? (yes/no)
- If no, what's wrong?
- Score: 0-100
- Specific corrections needed
"""
# 3. Call Claude API
evaluation = claude_api.complete(prompt)
# 4. Parse and return
return {
'passed': evaluation.score >= 98,
'score': evaluation.score,
'corrections': evaluation.corrections,
'loop_again': evaluation.score < 98
}
The Learning Differentiator
Key insight from Chris: LARS shouldn't just fix mistakes, it should LEARN from them.
For Coding Loop
- When code fails, capture:
- Original attempt
- Why it was wrong
- Correct solution
- Store as training example for future fine-tuning
- LARS gets better over time, not just per-task
For Training Loop
- Failures become explicit correction examples
- Added to training set for next loop
- Model internalizes corrections
Implementation Requirements
- Claude API Integration: Direct API calls or MCP tool
- Nexus Context Injection: workflow.init data available to evaluator
- Scoring Rubric: Define what 98%+ means for each task type
- Correction Generator: Turn failures into training data
- Loop Orchestrator: Manage the cycle, track iterations
Next Steps
- Small-scale test: Train on tiny dataset, evaluate, measure loop benefit
- Build Claude evaluator endpoint
- Create test suite for both coding and knowledge tasks
- Prototype loop orchestrator