section

3D Reasoning Training Research

3D Reasoning Training Research

Date: 2025-12-29 Track Project: 53134a0f

Key Discovery: DeepSeek R1 Approach

DeepSeek proved you can train reasoning into a model using pure RL. The model learns: - Thinking tokens (spend more on hard problems) - Self-reflection ('Wait, let me reconsider...') - Backtracking when errors detected

Format

<think>
[reasoning steps, self-questioning, backtracking]
</think>
<answer>
[final response]
</answer>

Open Datasets Found

Dataset Size Content
AM-DeepSeek-R1-Distilled-1.4M 1.4M Full reasoning chains
Mixture-of-Thoughts 350k Verified traces
OpenR1-Math-220k 220k Math with CoT

Training Pipeline (Open-R1)

  1. SFT - Teach FORMAT of reasoning
  2. GRPO - RL that rewards final answer, indirectly teaches thinking

Key Insight

You don't reward thinking directly - you reward correct answers. Model learns better thinking = better answers = more reward.

Sources

  • github.com/huggingface/open-r1
  • huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M
  • github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
  • github.com/rxlqn/awesome-llm-self-reflection
ID: ab8d0615
Path: Corlera AI Training Lab > Research > 3D Reasoning Training Research
Updated: 2025-12-29T15:07:31