DeepSeek-R1: Teaching a Model to Reason With Almost No Human Labels
Reinforcement learning alone, with no supervised reasoning traces, can make a base language model develop strong step-by-step reasoning, rivaling top closed models.
Reinforcement learning alone, with no supervised reasoning traces, can make a base language model develop strong step-by-step reasoning, rivaling top closed models.
What problem it solves
Strong reasoning in language models was assumed to require expensive human-written chains of thought. DeepSeek-R1 tests a cheaper idea: reward correct answers and let the model discover how to reason on its own.
The core method
A base model is trained with large-scale reinforcement learning, optimizing for verifiable outcomes such as correct math and code. An earlier variant, R1-Zero, learns reasoning from RL with no supervised fine-tuning at all. R1 then adds a small amount of cold-start data and multiple RL stages to make the output readable and well-behaved.
Key results
The model spontaneously grows longer chains of reasoning, self-checks, and reflection as training proceeds, and reaches accuracy on math and coding benchmarks competitive with leading closed models. DeepSeek released the weights and several distilled smaller models.
Why it matters
It showed that frontier-level reasoning can emerge from RL on outcomes, not just from imitating human reasoning, and that a capable reasoning model could be opened to everyone. The release pushed the whole field toward open reasoning models within weeks.
Limits and open questions
RL on verifiable rewards works best where answers are checkable; open-ended reasoning is harder to score. Readability, language mixing, and safety needed extra stages, and the compute to reproduce the full pipeline is still substantial.
One line: reward the answer, not the steps, and a model can learn to think on its own.