3D Reasoning Training Research
Date: 2025-12-29 Track Project: 53134a0f
Key Discovery: DeepSeek R1 Approach
DeepSeek proved you can train reasoning into a model using pure RL. The model learns: - Thinking tokens (spend more on hard problems) - Self-reflection ('Wait, let me reconsider...') - Backtracking when errors detected
Format
<think>
[reasoning steps, self-questioning, backtracking]
</think>
<answer>
[final response]
</answer>
Open Datasets Found
| Dataset | Size | Content |
|---|---|---|
| AM-DeepSeek-R1-Distilled-1.4M | 1.4M | Full reasoning chains |
| Mixture-of-Thoughts | 350k | Verified traces |
| OpenR1-Math-220k | 220k | Math with CoT |
Training Pipeline (Open-R1)
- SFT - Teach FORMAT of reasoning
- GRPO - RL that rewards final answer, indirectly teaches thinking
Key Insight
You don't reward thinking directly - you reward correct answers. Model learns better thinking = better answers = more reward.
Sources
- github.com/huggingface/open-r1
- huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M
- github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
- github.com/rxlqn/awesome-llm-self-reflection