Alignment · LLM Reasoning · Reinforcement Learning
PPO Explained: The Clipped Objective Behind RLHF
PPO keeps policy-gradient RL stable with a clipped surrogate objective — almost as well-behaved as TRPO but far simpler — which made it the default RL engine behind RLHF for ChatGPT and InstructGPT.
Quick answer
Proximal Policy Optimization (PPO) is a policy-gradient reinforcement-learning algorithm that takes the biggest safe step it can per update by clipping the policy ratio to a small band — the paper uses [1-0.2, 1+0.2] — so a single batch of data can be reused for several epochs without the policy diverging. It delivers most of the stability of Trust Region Policy Optimization (TRPO) using only first-order gradients and roughly a dozen lines of code, and it later became the optimizer inside RLHF for InstructGPT and ChatGPT.
What problem it solves
Vanilla policy gradients are wasteful and brittle: each sampled trajectory is used for exactly one gradient step, and if the step is too large the policy collapses and never recovers. TRPO fixed the collapse by enforcing a hard constraint on how far the new policy can move from the old one in KL-divergence, but it pays for that with a second-order optimization (a conjugate-gradient solve plus the Fisher matrix) that is awkward to implement and incompatible with architectures that share parameters or use dropout. PPO’s goal is narrow and practical: keep TRPO’s “don’t move too far” guarantee, but get it with plain stochastic gradient ascent that any deep-learning framework can run.
The clipped surrogate objective
The core idea is one term. PPO writes the per-sample policy ratio r = π_new(a|s) / π_old(a|s) and multiplies it by the advantage estimate, the same as a normal surrogate objective. The trick is that it also computes a clipped version where r is forced into [1-ε, 1+ε], and then takes the minimum of the clipped and unclipped terms. The minimum is what makes it work: when an action looks good (positive advantage), the objective stops rewarding the update once the ratio exceeds 1+ε, so there is no incentive to keep pushing the probability up; when an action looks bad, the clip stops the policy from over-shrinking it. The result is a pessimistic lower bound that removes the gradient signal exactly when the update would move too far — no KL constraint, no second-order math, just a min and a clamp. The paper also describes an alternative that adds an adaptive KL penalty to the loss, but reports the clipped variant works better, and that is the one the world adopted.
In practice PPO runs an actor-critic loop: collect a fixed number of timesteps from parallel actors, estimate advantages (the paper uses generalized advantage estimation), then optimize the clipped objective plus a value-function loss and an entropy bonus for several epochs of minibatch SGD before throwing the data away and sampling again.
Key results
- On MuJoCo continuous-control tasks (HalfCheetah, Hopper, Walker2d, and others), PPO with clipping scored highest in the authors’ hyperparameter sweep, beating the adaptive-KL variant, vanilla policy gradient, A2C, and TRPO on the aggregate.
- The clip threshold
ε = 0.2was the best single setting across the continuous-control suite — a number that survived almost unchanged into countless later codebases. - On the Atari benchmark (49 games), PPO matched or beat A2C on most games and was competitive with ACER while being dramatically simpler, and it won clearly on the “speed of learning” metric measured over training.
- The headline trade-off the paper claims: PPO “strikes a favorable balance between sample complexity, simplicity, and wall-time” — and unlike most such claims, the simplicity part held up under a decade of reuse.
Why PPO became the RLHF workhorse
PPO’s lasting importance is not in robotics or Atari — it is in language models. Reinforcement learning from human feedback needs to nudge a large pretrained model toward human-preferred outputs without destroying the fluency it learned during pretraining, and that is exactly the “take a useful step but don’t move too far” problem PPO was built for. When OpenAI built InstructGPT and then ChatGPT, the alignment stage trained a reward model on human preference comparisons and then used PPO to optimize the language-model policy against that reward, with a KL penalty back to the original model holding the policy near its starting point. So a 2017 paper aimed at simulated locomotion turned out to be the optimizer that made instruction-following assistants work. My read: the reason PPO won this role is not that it is the most sample-efficient or the most theoretically clean algorithm — it is that it is robust and forgiving enough that a team can get it running on a 175B-parameter policy without a dedicated RL research effort, and in alignment that practicality mattered more than peak performance.
Limits and open questions
PPO is finicky in ways the clean equation hides. Its real-world performance leans heavily on implementation details the paper underplays — advantage normalization, reward scaling, value-loss clipping, learning-rate schedules, orthogonal initialization — and later reproducibility studies showed these “code-level optimizations” account for a large share of its reported gains, which is uncomfortable for a method sold as simple. It is still on-policy, so it discards every batch after a few epochs and is far less sample-efficient than off-policy methods on tasks where samples are expensive. In the RLHF setting specifically, PPO drags along a separate value network and a reward model, making the training loop heavy and unstable to tune — which is precisely the pain that later methods like Direct Preference Optimization and GRPO set out to remove by dropping the critic or the separate reward model. PPO is the incumbent, not the final word.
FAQ
What is Proximal Policy Optimization in simple terms?
PPO is a reinforcement-learning algorithm that improves a policy by taking the largest update it can while clipping the change in action probabilities to a small band (the paper uses ±20%), so the policy improves steadily without making a destabilizing jump.
Why is PPO used in RLHF and ChatGPT?
RLHF needs to optimize a language model against a learned reward model without letting it drift far from the pretrained model. PPO’s clipped objective plus a KL penalty does exactly that, which is why InstructGPT and ChatGPT used PPO for their alignment stage.
How is PPO different from TRPO?
TRPO enforces a hard KL constraint using second-order optimization, which is complex to implement. PPO approximates the same “stay close to the old policy” effect with a first-order clipped objective, so it runs with ordinary SGD and far less code while reaching comparable or better results.
What does the clip parameter epsilon do in PPO?
Epsilon sets how far the new-to-old policy probability ratio may move before the objective stops rewarding the change; the paper found ε = 0.2 best, meaning updates are effectively bounded to within ±20% of the previous action probabilities.
One line: clip the policy update so it can never step too far, and you get TRPO’s stability with SGD’s simplicity — the trade that made RLHF practical. Read the original paper on arXiv.