2025 was a breakout year for RL-based methods in LLM training. For a while, 'RL for LLMs' basically meant RLHF, largely following the ChatGPT-replication journey and its alignment-focused pipeline. Then OpenAI's o1 and DeepSeek-R1 helped prove a different point: on tasks with verifiable answers, RL can seriously boost an LLM's reasoning capability, not just polish its vibes. Around the same time, RL also became a tool for shaping agentic behavior, especially the ability to reliably call tools and execute multi-turn workflows.
With those trends converging, the community spent a lot of 2025 with discussions and experiments on RL algorithms tailored to token-level LLM training, especially with GRPO being the main attraction. This two-part post is my 'dont forget this' summary of what came out in this year. Part 1 lays out the basics of off-policy policy gradients, how GRPO emerged, and why it started to dominate PPO in practice. Part 2 surveys the many training tweaks and variants people proposed and debated throughout the year.
We’ll start with the token-level MDP framing and off-policy policy gradients, since that’s the common spine behind PPO, GRPO, and most of their descendants.
Token-Level MDPs
Training LLMs with RL has stemmed from the success of RLHF 1 2, which optimizes LLMs using human feedback. These methods typically model the LLM as a policy in a token-level Markov decision process (MDP). Let the LLM be a stochastic policy , parameterized by , that maps token histories to distributions over the vocabulary. The state space consists of all possible token sequences, while the action space is the vocabulary itself.
Given a prompt dataset , a prompt initializes the episode. The model then generates a response autoregressively,
where generation terminates when is the end-of-sequence (EOS) token 1.
At time step , the state is defined as the prompt concatenated with the partial response, and the action corresponds to selecting the next token. Upon taking action , the agent transitions deterministically to and receives a reward , where is a heuristic reward function or a learned reward model.
In practice, the reward structure is typically sparse, with for and a single terminal reward at the end of the episode. To facilitate learning, we may also design intermediate rewards (also called process rewards) for , yet this is an active area of research.
The overall learning objective is to maximize the expected discounted sum of rewards under the policy:
Policy Gradient Methods
A popular approach to optimize the policy is to use policy gradient methods. The key idea is to compute an estimate of the gradient of the objective with respect to the policy parameters , and then perform gradient ascent. As an LLM itself is a parameterized stochastic policy, it is natural to apply policy gradient methods to LLM training. Using the policy gradient theorem, we can derive the gradient estimator as follows. (For more details, see Section 13.1 of Sutton & Barto's book or this nice blog post by Lilian Weng!)
This is the basic idea of REINFORCE 3. Here, is an advantage function, which may be any function in the following form:
As stated, must be an estimator of the state action-value function . More specifically, it must satify:
is a baseline, which may be any function that does not depend on the action . The purpose of the baseline is to help reduce the variance of the gradient estimator without introducing bias. A near-optimal minimizer of variance is the state-value function , defined as:
At this point, one may imagine directly using and to define the advantage function. However, both functions are unknown and must be estimated from samples. To this end, the actor-critic method trains alongside a value network , parameterized by , to approximate . As can be expressed in terms of , we may then use to construct various estimators for the advantage function. Some common choices include:
- Baselined MC return: low bias, high variance
- TD residual: high bias, low variance
- -step TD residual: discrete control between bias and variance using a hyperparameter ; note that and
- Generalized Advantage Estimation (GAE) 4: continuous control between bias and variance using a hyperparameter ; note that and
(For more details and insights behind advantage estimation choices and GAE, see this blog post by Daniel Seita.)
The policy network and the value network are typically trained alternately. The value network is trained to minimize the mean squared error between its predictions and the bootstrapped returns (which is equivalent to the sum of the advantage estimate and predicted value), while the policy network is updated using the policy gradient estimator with the chosen advantage function.
Off-Policy Policy Gradient Methods
For LLMs, or even in other RL environments, generating on-policy trajectories can be computationally expensive and time-consuming. This is especially true for LLMs as it requires running the LLM to generate responses for each prompt in the dataset. To increase efficiency, we may consider generating multiple responses in parallel for multiple prompts, more than the amount we would use for a single model update. However, this introduces a distribution mismatch between the current policy and the behavior policy used to generate the responses. Thus, we need to adjust the policy gradient estimator by incorporating importance sampling ratios. This is where off-policy policy gradient methods come into play. Using importance sampling, we can rewrite the policy gradient estimator as follows:
The importance sampling ratios are inserted in a somewhat unnatural way: we maintain on-policy samples generated from the current policy , but each per-token expectation is taken over an off-policy sample sampled from the behavior policy , conditioned on the on-policy history .
After this adjustment, we may apply an approximation that replaces the on-policy history with off-policy history entirely. This is valid only when , so that the distributions are close enough for the difference in histories to become negligible. This leads to a fully off-policy estimator, where all tokens are sampled from the behavior policy . Also, in general, importance sampling estimators become inaccurate when the sampling distribution is too far from the nominal distribution. This is another reason why we need for the approximation to hold well in practice.
Therefore, assuming is kept throughout training, we can express the overall objective function as:
Practical Implementation
The overall training procedure for off-policy policy gradient methods can be summarized as follows:
Initialize: Policy Network , Value Network , Corpus of Prompts
for global epoch do
- for global batch do
- Set behavior policy network current
- Generate responses for
- Compute off-policy logprobs for ,
- Compute rewards for ,
- Compute values for ,
- Compute advantages for ,
- for mini epoch do
- for minibatch do
- Compute on-policy logprobs for ,
- Update the policy network with loss:
- end for
- end for
- Update the value network with loss (may also use multi-epoch, minibatch training):
- end for
(Here, denotes the stop-gradient operator, which is equivalent to torch.Tensor.detach().)
The value network is a model that takes in an arbitrary token sequence and outputs a scalar value. In practice, it is often implemented as an LLM with a scalar head on top. If we are using a reward model for reward computation (as in RLHF), this is very similar to the functionality of the reward model itself. For this reason, it is common to initialize the value network with the reward model weights in this case. 1 5 If we do not have a reward model and instead use a heuristic reward function, we usually initialize the value network as same as the policy network.
Many modern RL-LLM frameworks (e.g., TRL, verl, slime) implement their pipelines based on the above structure. Often, they also incorporate various techniques to speed up training and improve memory efficiency. For instance, rollouts may be accelerated using vLLM or SGLang. Training may be optimized using parallelism techniques such as FSDP or Megatron-LM.
Meanwhile, as modern LLMs generate long reasoning traces and this behavior especially gets amplified under RL, rollouts may remain heavy and time-consuming even with aggressive optimizations. As a result, the community has been actively working on approaches to increase asynchrony, for example by performing asynchronous rollouts and asynchronous reward computation if necessary. The Prime Intellect team has been making notable progress in this direction (PRIME-RL, Verifiers).
Proximal Policy Optimization (PPO)
Now, to ensure that the assumption holds during training, we may need to take special care. One popular approach is to use proximal policy optimization (PPO) 6, which is widely adopted in RL for LLMs due to its simplicity and effectiveness. The core idea of PPO is to prevent large policy updates by constraining the change in policy at each update step. To this end, we may add trust-region constraints. The original idea comes from trust region policy optimization (TRPO) 7, and PPO further simplifies the implementation as follows:
-
Use a clipped objective as a proxy for trust-region constraints. Formally, we define the clipping function
places an upper bound on when is positive, and a lower bound on when is negative. Intuitively, if we consider as the importance sampling ratio, this prevents excessively large updates that may arise from distribution mismatch. Using this, we replace the per-token contribution to the policy objective by
-
Maintain the initial policy as a reference policy to prevent policy collapse. Then, add a KL-penalty to the per-token reward by:
Here, is a hyperparameter that controls the strength of the penalty. After this adjustment, update the advantages accordingly, using the chosen advantage function (GAE is the most common choice with PPO). Let denote the updated advantage function, including the KL-penalty.
Finally, the resulting PPO objective becomes:
In the early days of RL for LLMs, PPO was the go-to algorithm. This was especially true during the RLHF era, where RL was primarily used for human alignment, as exemplified by ChatGPT and its many replications. During this period, relatively little attention was paid to advancing RL algorithms themselves. Most of the effort went into building better reward models, mitigating overoptimization, figuring out how to collect preference data at scale, etc. (which eventually led us to the RL vs DPO debate and beyond — here are some great articles that summarize the progress back then).
Group Relative Policy Optimization (GRPO)
However, things started to change after the success of o1 and R1. RL for LLMs began to matter again, not just for alignment, but as a way to incentivize general reasoning capabilites. A big reason this worked was that these models were trained on tasks with clear, verifiable answers. When rewards are strictly verifiable, many of the classic RLHF problems that come from reward modeling (reward hacking, weird incentives, misalignment) largely disappear. Now that these are less of a bottleneck, attention has shifted back to the RL algorithms, with the goal of maximizing scalability and stability. DeepSeek's solution to this problem was group relative policy optimization (GRPO) 8, which builds upon PPO with several key modifications to take better advantage of verifiable rewards in reasoning tasks.
GRPO assumes completely sequence-level rewards 2, which was always the case in RL for LLMs anyway. The key difference is that instead of treating the reward as given at the last token only, we treat it as given for the entire generated sequence. Accordingly, we calculate advantages at the sequence level, and broadcast the same advantage to every token in the sequence. Furthermore, GRPO assumes a grouped sampling scheme, meaning that for each prompt , we generate a group of responses using the behavior policy . The key idea is to use relative rewards within the group to define a more informative advantage function. Specifically, GRPO introduces the following components:
-
Define the advantage function using normalized rewards within the group:
and broadcast it to all tokens in . The standard advantage function is a measure of better-than-average performance. Similarly, measures how much better a response is compared to other responses in the same group. Here, is a small constant added for numerical stability.
-
As we do not have per-token rewards, we cannot directly apply the KL-penalty as in PPO. Instead, add KL-penalty to the per-token objective using the k3 approximator 9:
This is meant to approximate the KL-divergence . In GRPO, we use with and . You may notice some mismatch in the sampling distribution here, as is sampled from the behavior policy instead of the current policy . I will discuss this subtlety in the next post.
-
Apply sequence-level length normalization by dividing the per-token contributions by the sequence length . This is slightly unintuitive, as the idea of average reward is typically discussed in infinite-horizon MDPs. There has been much debate on this length normalization term in the sense of loss aggregation, which I will also discuss in the next post. For now, just note that this is part of the original GRPO formulation.
Applying all these components, the overall GRPO objective becomes:
Why GRPO?
Following DeepSeek’s breakout success, GRPO rapidly gained attention and, in many discussions, overtook PPO as the method of choice. Yet PPO isn’t dead. In fact, it still works surprisingly well. However, it’s hard to ignore that much of the recent progress in RL for LLMs has been built on top of GRPO-style methods rather than on PPO itself. This shift isn’t just a hype; GRPO appears to offer a some practical advantages that make it an appealing choice for large-scale LLM training.
The most obvious advantage of GRPO is its computational simplicity. GRPO eliminates the need for a learned value function in the advantage estimate, which means there is no separate value network to train or maintain. This reduces both compute cost and system complexity, making it easier to implement and scale.
But more interestingly, recent empirical results suggest that removing or weakening value learning may actually improve performance, not just efficiency. A notable trend is that even when PPO is used, practitioners increasingly push the GAE parameter toward 1.0, effectively reducing reliance on the value function. This behavior is reflected in results from the recent DeepSeek-R1 report published in Nature 10, which shows a clear performance ordering:

Performance of PPO and GRPO on the MATH task using DeepSeek-Coder-V2-Lite. We can observe a clear trend of GRPO > PPO ( = 1.0) > PPO ( = 0.95). As the influence of value learning decreases, performance improves, and removing it entirely works best.
This matches a growing consensus that value learning is hard in RL for LLMs. Researchers like John Schulman or Ross Taylor have both commented on this publicly. While the exact cause is still unclear, a common hypothesis is that bias in the learned value function harms optimization more than variance reduction helps. In other words, the errors introduced by imperfect value estimates may outweigh the benefits they provide in stabilizing policy updates.
Work such as VC-PPO 11 offers more concrete analysis of this issue. Two key takeaways stand out. First, value network initialization matters enormously, and explicit value-pretraining under a fixed initial policy may help. Second, variance reduction may not be important enough in value learning as it is in policy learning. This has led to proposals like decoupled GAE, where different values are used for policy updates and value updates. Even with these refinements, however, value learning remains brittle and sensitive to design choices.
GRPO avoids these issues altogether. With no value network, there’s no value bias, no delicate initialization, and no need to tune bias-variance tradeoffs. That simplicity, combined with strong empirical results, explains why GRPO has become the dominant foundation for recent RL-LLM work.
References
Footnotes
-
This formulation causes the episode length to be stochastic, being a random variable dependent on the policy. As a result, one cannot directly apply the common interchange , which is helpful when it is easier to process a per-sample expectation than an expectation over a sum. To make the interchange well-defined, we instead consider an equivalent non-terminating MDP by by making the
EOStoken a self-looping absorbing state with zero reward. Assuming convergence, we may write , and if the inner terms are zero after , the infinite sums reduce to finite sums up to . Then, if our processing is zero-preserving, we can safely swap the expectation and the summation, process the per-sample expectations, and swap them back into an expectation over a finite sum. ↩ -
The original DeepSeekMath paper does introduce a process reward version of GRPO, but it is very unintuitive and not widely used in practice. PRIME also proposed a process reward variant of GRPO, but the intuition is still lacking. The main difficulty is that it is hard to define a meaningful notion of a group at intermediate-level, as multiple responses each take different intermediate steps. One possible solution is to sample a tree of responses, where each node corresponds to a partial response shared by multiple complete responses. However, this quickly becomes complicated and computationally expensive. For now, it seems that sequence-level rewards are the most natural fit for GRPO. ↩