section

AM-DeepSeek-R1-Distilled Dataset Analysis

AM-DeepSeek-R1-Distilled-1.4M Analysis

Date: 2025-12-29

Dataset Structure

{
  "messages": [
    {
      "role": "user",
      "content": "<prompt>",
      "info": {"source": "...", "reference_answer": "..."}
    },
    {
      "role": "assistant",
      "content": "<think>reasoning</think><answer>solution</answer>",
      "info": {"think_content": "...", "answer_content": "..."}
    }
  ]
}

Thinking Patterns Observed

  • Natural language: 'Okay, let me see', 'Hmm', 'I remember that...'
  • Self-questioning: 'So maybe I can...', 'Then I can...'
  • Step enumeration in answers

Metrics

  • Average thinking length: 3000-3500 chars
  • Average answer length: 300-1300 chars
  • Ratio: ~3:1 thinking to answer

Applied To

  • Created DS-002 (lars_3d_identity.json) based on this format
  • Successfully trained LARS with 3D reasoning (EXP-003)
ID: e192aeee
Path: Corlera AI Training Lab > Research > AM-DeepSeek-R1-Distilled Dataset Analysis
Updated: 2025-12-29T15:07:31