RL for LLMs I: The Token-Level MDP and Off-Policy Policy Gradients

December 21, 2025

2025 was a breakout year for RL-based methods in LLM training. For a while, 'RL for LLMs' basically meant RLHF, largely following the ChatGPT-replication journey and its alignment-focused pipeline. Then OpenAI's o1 and DeepSeek-R1 helped prove a different point: on tasks with verifiable answers, RL can seriously boost an LLM's reasoning capability, not just polish its vibes. Around the same time, RL also became a tool for shaping agentic behavior, especially the ability to reliably call tools and execute multi-turn workflows.

With those trends converging, the community spent a lot of 2025 with discussions and experiments on RL algorithms tailored to token-level LLM training, especially with GRPO being the main attraction. This two-part post is my 'dont forget this' summary of what came out in this year. Part 1 lays out the basics of off-policy policy gradients, how GRPO emerged, and why it started to dominate PPO in practice. Part 2 surveys the many training tweaks and variants people proposed and debated throughout the year.

We’ll start with the token-level MDP framing and off-policy policy gradients, since that’s the common spine behind PPO, GRPO, and most of their descendants.

Token-Level MDPs

Training LLMs with RL has stemmed from the success of RLHF 1 2, which optimizes LLMs using human feedback. These methods typically model the LLM as a policy in a token-level Markov decision process (MDP). Let the LLM be a stochastic policy πθ\pi_\theta, parameterized by θ\theta, that maps token histories to distributions over the vocabulary. The state space S\mathcal{S} consists of all possible token sequences, while the action space A\mathcal{A} is the vocabulary itself.

Given a prompt dataset D\mathcal{D}, a prompt xDx \in \mathcal{D} initializes the episode. The model then generates a response autoregressively,

yπθ(x)    ytπθ(x,y<t),  t=1,2,,T,y \sim \pi_\theta(\cdot \mid x) \iff y_t \sim \pi_\theta(\cdot \mid x, y_{<t}), \; t = 1, 2, \ldots, T,

where generation terminates when yTy_T is the end-of-sequence (EOS) token 1.

At time step tt, the state st=(x,yt)Ss_t = (x, y_{\leq t}) \in \mathcal{S} is defined as the prompt concatenated with the partial response, and the action at=yt+1Aa_t = y_{t+1} \in \mathcal{A} corresponds to selecting the next token. Upon taking action ata_t, the agent transitions deterministically to st+1=(x,yt+1)s_{t+1} = (x, y_{\leq t+1}) and receives a reward r(st,at)=r(x,yt+1)r(s_t, a_t) = r(x, y_{\leq t+1}), where r:S×ARr : \mathcal{S} \times \mathcal{A} \to \mathbb{R} is a heuristic reward function or a learned reward model.

In practice, the reward structure is typically sparse, with r(x,yt)=0r(x, y_{\leq t}) = 0 for t<Tt < T and a single terminal reward r(x,yT)r(x, y_{\leq T}) at the end of the episode. To facilitate learning, we may also design intermediate rewards (also called process rewards) for t<Tt < T, yet this is an active area of research.

The overall learning objective is to maximize the expected discounted sum of rewards under the policy:

maximizeθ  ExDyπθ(x)[t=1yγt1r(x,yt)].\operatorname*{maximize}_{\theta} \; \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_\theta(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\gamma^{t-1}r(x, y_{\leq t})\right].

Policy Gradient Methods

A popular approach to optimize the policy is to use policy gradient methods. The key idea is to compute an estimate of the gradient of the objective with respect to the policy parameters θ\theta, and then perform gradient ascent. As an LLM itself is a parameterized stochastic policy, it is natural to apply policy gradient methods to LLM training. Using the policy gradient theorem, we can derive the gradient estimator as follows. (For more details, see Section 13.1 of Sutton & Barto's book or this nice blog post by Lilian Weng!)

θJθ=ExDyπθ(x)[t=1yθlogπθ(ytx,y<t)A(x,yt)]\nabla_\theta J_\theta = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_\theta(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\nabla_\theta\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right]

This is the basic idea of REINFORCE 3. Here, AA is an advantage function, which may be any function in the following form:

A(x,yt)=Q^(x,yt)b(x,y<t)s.t.   E[Q^]=Qπθ.A(x, y_{\leq t}) = \hat{Q}(x, y_{\leq t}) - b(x, y_{< t}) \quad \textsf{s.t. } \; \mathbb{E}\Bigl[\hat{Q}\Bigr] = Q^{\pi_\theta}.

As stated, Q^\hat{Q} must be an estimator of the state action-value function Qπθ:S×ARQ^{\pi_\theta} : \mathcal{S} \times \mathcal{A} \to \mathbb{R}. More specifically, it must satify:

E[Q^(x,yt)x,yt]=Qπθ(x,yt)Ey>tπθ(x,yt)[t=tyγttr(x,yt)x,yt],(x,y<t)S,ytA  (x,yt)S.\mathbb{E}\Bigl[\hat{Q}(x, y_{\leq t}) \Bigm\vert x, y_{\leq t}\Bigr] = Q^{\pi_\theta}(x, y_{\leq t}) \coloneqq \mathbb{E}_{y_{> t} \sim \pi_\theta(\cdot \mid x, y_{\leq t})}\left[\sum_{t'=t}^{\lvert y \rvert}\gamma^{t'-t}r(x, y_{\leq t'}) \Biggm\vert x, y_{\leq t}\right], \quad \forall \, \underbrace{(x, y_{< t}) \in \mathcal{S}, y_t \in \mathcal{A}}_{\Leftrightarrow \; (x, y_{\leq t}) \in \mathcal{S}}.

bb is a baseline, which may be any function that does not depend on the action yty_t. The purpose of the baseline is to help reduce the variance of the gradient estimator without introducing bias. A near-optimal minimizer of variance is the state-value function Vπθ:SRV^{\pi_\theta} : \mathcal{S} \to \mathbb{R}, defined as:

Vπθ(x,y<t)Eytπθ(x,y<t)[t=tyγttr(x,yt)x,y<t].V^{\pi_\theta}(x, y_{< t}) \coloneqq \mathbb{E}_{y_{\geq t} \sim \pi_\theta(\cdot \mid x, y_{< t})}\left[\sum_{t'=t}^{\lvert y \rvert}\gamma^{t'-t}r(x, y_{\leq t'}) \Biggm\vert x, y_{< t}\right].

At this point, one may imagine directly using QπθQ^{\pi_\theta} and VπθV^{\pi_\theta} to define the advantage function. However, both functions are unknown and must be estimated from samples. To this end, the actor-critic method trains alongside a value network VϕV_\phi, parameterized by ϕ\phi, to approximate VπθV^{\pi_\theta}. As QπθQ^{\pi_\theta} can be expressed in terms of VπθV^{\pi_\theta}, we may then use VϕV_\phi to construct various estimators for the advantage function. Some common choices include:

  • Baselined MC return: low bias, high variance AMC(x,yt)=t=tyγttr(x,yt)Vϕ(x,y<t)A^{\textsf{MC}}(x, y_{\leq t}) = \sum_{t'=t}^{\lvert y \rvert}\gamma^{t' - t}r(x, y_{\leq t'}) - V_\phi(x, y_{<t})
  • TD residual: high bias, low variance ATD(x,yt)=r(x,yt)+γVϕ(x,y<t+1)Vϕ(x,y<t)A^{\textsf{TD}}(x, y_{\leq t}) = r(x, y_{\leq t}) + \gamma V_\phi(x, y_{< t + 1}) - V_\phi(x, y_{<t})
  • n\boldsymbol{n}-step TD residual: discrete control between bias and variance using a hyperparameter n{1,2,,yt+1}n \in \{1, 2, \ldots, \lvert y \rvert - t + 1\}; note that ATD=ATD(1)A^{\textsf{TD}} = A^{\textsf{TD}(1)} and AMC=ATD(yt+1)A^{\textsf{MC}} = A^{\textsf{TD}(\lvert y \rvert - t + 1)} ATD(n)(x,yt)=t=tt+n1γttr(x,yt)+γnVϕ(x,y<t+n)Vϕ(x,y<t)A^{\textsf{TD}(n)}(x, y_{\leq t}) = \sum_{t'=t}^{t+n-1}\gamma^{t' - t}r(x, y_{\leq t'}) + \gamma^{n} V_\phi(x, y_{< t+n}) - V_\phi(x, y_{<t})
  • Generalized Advantage Estimation (GAE) 4: continuous control between bias and variance using a hyperparameter λ[0,1]\lambda \in [0, 1]; note that ATD=AGAE(0)A^{\textsf{TD}} = A^{\textsf{GAE}(0)} and AMC=AGAE(1)A^{\textsf{MC}} = A^{\textsf{GAE}(1)} AGAE(λ)(x,yt)=(1λ)n=1ytλn1ATD(n)(x,yt)+λytATD(yt+1)(x,yt)=t=ty1(γλ)ttATD(x,yt)\begin{aligned} A^{\textsf{GAE}(\lambda)}(x, y_{\leq t}) &= (1 - \lambda)\sum_{n=1}^{\lvert y \rvert - t}\lambda^{n-1}A^{\textsf{TD}(n)}(x, y_{\leq t}) + \lambda^{\lvert y \rvert - t}A^{\textsf{TD}(\lvert y \rvert - t + 1)}(x, y_{\leq t}) \\ &= \sum_{t'=t}^{\lvert y \rvert - 1}(\gamma\lambda)^{t' - t}A^{\textsf{TD}}(x, y_{\leq t'}) \end{aligned}

(For more details and insights behind advantage estimation choices and GAE, see this blog post by Daniel Seita.)

The policy network πθ\pi_\theta and the value network VϕV_\phi are typically trained alternately. The value network is trained to minimize the mean squared error between its predictions and the bootstrapped returns (which is equivalent to the sum of the advantage estimate and predicted value), while the policy network is updated using the policy gradient estimator with the chosen advantage function.

Off-Policy Policy Gradient Methods

For LLMs, or even in other RL environments, generating on-policy trajectories can be computationally expensive and time-consuming. This is especially true for LLMs as it requires running the LLM to generate responses for each prompt in the dataset. To increase efficiency, we may consider generating multiple responses in parallel for multiple prompts, more than the amount we would use for a single model update. However, this introduces a distribution mismatch between the current policy πθ\pi_\theta and the behavior policy π0\pi_0 used to generate the responses. Thus, we need to adjust the policy gradient estimator by incorporating importance sampling ratios. This is where off-policy policy gradient methods come into play. Using importance sampling, we can rewrite the policy gradient estimator as follows:

θJθ=ExDy1πθ(x)y2πθ(x,y1)[t=1yθlogπθ(ytx,y<t)A(x,yt)]=ExDy1πθ(x)y1π0(x)y2πθ(x,y1)y2π0(x,y1)[t=1yπθ(ytx,y<t)π0(ytx,y<t)θlogπθ(ytx,y<t)A(x,y<t,yt)]()ExDy1π0(x)y2π0(x,y1)[t=1yπθ(ytx,y<t)π0(ytx,y<t)θlogπθ(ytx,y<t)A(x,yt)]=θExDyπ0(x)[t=1yπθ(ytx,y<t)π0(ytx,y<t)A(x,yt)]\begin{aligned} \nabla_\theta J_\theta &= \mathbb{E}_{\begin{subarray}{l} x \sim \mathcal{D} \\ y_1 \sim \pi_\theta(\cdot \mid x) \\ y_2 \sim \pi_\theta(\cdot \mid x, y_1) \\ \ldots \end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\nabla_\theta\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right] \\ &= \mathbb{E}_{\begin{subarray}{l} x \sim \mathcal{D} \\ y_1 \sim \pi_\theta(\cdot \mid x) \\ \textcolor{red}{y_1' \sim \pi_0(\cdot \mid x)} \\ y_2 \sim \pi_\theta(\cdot \mid x, y_1) \\ \textcolor{red}{y_2' \sim \pi_0(\cdot \mid x, y_1)} \\ \textcolor{red}{\ldots} \end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\textcolor{red}{\frac{\pi_\theta(y_t' \mid x, y_{<t})}{\pi_0(y_t' \mid x, y_{<t})}}\nabla_\theta\log\pi_\theta(\textcolor{red}{y_t'} \mid x, y_{<t})A(x, y_{< t}, \textcolor{red}{y_t'})\right] \\ &\underset{(*)}{\approx} \mathbb{E}_{\begin{subarray}{l} x \sim \mathcal{D} \\ y_1 \sim \pi_0(\cdot \mid x) \\ y_2 \sim \pi_0(\cdot \mid x, y_1) \\ \ldots \end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}\nabla_\theta\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right] \\ &= \nabla_\theta\mathbb{E}_{\begin{subarray}{l} x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x) \end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y_{\leq t})\right] \end{aligned}

The importance sampling ratios are inserted in a somewhat unnatural way: we maintain on-policy samples (y1,y2,,yT)(y_1, y_2, \ldots, y_T) generated from the current policy πθ\pi_\theta, but each per-token expectation is taken over an off-policy sample yty_t' sampled from the behavior policy π0\pi_0, conditioned on the on-policy history (x,y<t)(x, y_{<t}).

After this adjustment, we may apply an approximation ()(*) that replaces the on-policy history with off-policy history entirely. This is valid only when πθπ0\pi_\theta \approx \pi_0, so that the distributions are close enough for the difference in histories to become negligible. This leads to a fully off-policy estimator, where all tokens are sampled from the behavior policy π0\pi_0. Also, in general, importance sampling estimators become inaccurate when the sampling distribution is too far from the nominal distribution. This is another reason why we need πθπ0\pi_\theta \approx \pi_0 for the approximation to hold well in practice.

Therefore, assuming πθπ0\pi_\theta \approx \pi_0 is kept throughout training, we can express the overall objective function as:

Jθ=ExDyπ0(x)[t=1yπθ(ytx,y<t)π0(ytx,y<t)A(x,yt)].J_\theta = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y_{\leq t})\right].

Practical Implementation

The overall training procedure for off-policy policy gradient methods can be summarized as follows:

Initialize: Policy Network πθ\pi_\theta, Value Network VϕV_\phi, Corpus of Prompts D\mathcal{D}

for global epoch e=1,2,,Ee = 1, 2, \ldots, E do

  • for global batch B={x1,x2,xB}D\mathcal{B} = \{x_1, x_2, \ldots x_B\} \subseteq \mathcal{D} do
    • Set behavior policy network π0\pi_0 \gets current πθ\pi_\theta
    • Generate responses yiπ0(xi)y_i \sim \pi_0(\cdot \mid x_i) for i=1,2,,Bi = 1, 2, \ldots, B
    • Compute off-policy logprobs logπ0(yi,txi,yi,<t)\log\pi_0(y_{i,t} \mid x_i, y_{i, \lt t}) for i=1,2,,Bi = 1, 2, \ldots, B, t=1,2,,yit = 1, 2, \ldots, \lvert y_i \rvert
    • Compute rewards r(xi,yi,t)r(x_i, y_{i, \leq t}) for i=1,2,,Bi = 1, 2, \ldots, B, t=1,2,,yit = 1, 2, \ldots, \lvert y_i \rvert
    • Compute values Vϕ(xi,yi,<t)V_\phi(x_i, y_{i, \lt t}) for i=1,2,,Bi = 1, 2, \ldots, B, t=1,2,,yit = 1, 2, \ldots, \lvert y_i \rvert
    • Compute advantages A(xi,yi,t)A(x_i, y_{i, \leq t}) for i=1,2,,Bi = 1, 2, \ldots, B, t=1,2,,yit = 1, 2, \ldots, \lvert y_i \rvert
    • for mini epoch e=1,2,,Ee' = 1, 2, \ldots, E' do
      • for minibatch B={x1,x2,xB}B\mathcal{B}' = \{x_1', x_2', \ldots x_{B'}'\} \subseteq \mathcal{B} do
        • Compute on-policy logprobs logπθ(yi,txi,yi,<t)\log\pi_\theta(y_{i,t}' \mid x_i', y_{i, \lt t}') for i=1,2,,Bi = 1, 2, \ldots, B', t=1,2,,yit = 1, 2, \ldots, \lvert y_i' \rvert
        • Update the policy network with loss:
        • Lθ=1Bi=1Bt=1yiπθ(yi,txi,yi,<t)π0(yi,txi,yi,<t)A(xi,yi,t)\displaystyle\mathcal{L}_\theta = -\frac{1}{B'}\sum_{i=1}^{B'}\sum_{t=1}^{\lvert y_i' \rvert}\frac{\pi_\theta(y_{i,t}' \mid x_i', y_{i, \lt t}')}{\pi_0(y_{i,t}' \mid x_i', y_{i, \lt t}')}A(x_i', y_{i, \leq t}')
      • end for
    • end for
    • Update the value network with loss (may also use multi-epoch, minibatch training):
    • Lϕ=1Bi=1B1yit=1yi12([ ⁣ ⁣[A(xi,yi,t)+Vϕ(xi,yi,<t)] ⁣ ⁣]Vϕ(xi,yi,<t))2\displaystyle\mathcal{L}_\phi = \frac{1}{B}\sum_{i=1}^{B}\frac{1}{\lvert y_i \rvert}\sum_{t=1}^{\lvert y_i \rvert}\frac{1}{2}\Bigl(\Big[\!\!\Big[A(x_i, y_{i, \leq t}) + V_\phi(x_i, y_{i, \lt t})\Big]\!\!\Big] - V_\phi(x_i, y_{i, \lt t})\Bigr)^2
  • end for
end for

(Here, [ ⁣[] ⁣][\![\cdot]\!] denotes the stop-gradient operator, which is equivalent to torch.Tensor.detach().)

The value network is a model that takes in an arbitrary token sequence and outputs a scalar value. In practice, it is often implemented as an LLM with a scalar head on top. If we are using a reward model for reward computation (as in RLHF), this is very similar to the functionality of the reward model itself. For this reason, it is common to initialize the value network with the reward model weights in this case. 1 5 If we do not have a reward model and instead use a heuristic reward function, we usually initialize the value network as same as the policy network.

Many modern RL-LLM frameworks (e.g., TRL, verl, slime) implement their pipelines based on the above structure. Often, they also incorporate various techniques to speed up training and improve memory efficiency. For instance, rollouts may be accelerated using vLLM or SGLang. Training may be optimized using parallelism techniques such as FSDP or Megatron-LM.

Meanwhile, as modern LLMs generate long reasoning traces and this behavior especially gets amplified under RL, rollouts may remain heavy and time-consuming even with aggressive optimizations. As a result, the community has been actively working on approaches to increase asynchrony, for example by performing asynchronous rollouts and asynchronous reward computation if necessary. The Prime Intellect team has been making notable progress in this direction (PRIME-RL, Verifiers).

Proximal Policy Optimization (PPO)

Now, to ensure that the assumption πθπ0\pi_\theta \approx \pi_0 holds during training, we may need to take special care. One popular approach is to use proximal policy optimization (PPO) 6, which is widely adopted in RL for LLMs due to its simplicity and effectiveness. The core idea of PPO is to prevent large policy updates by constraining the change in policy at each update step. To this end, we may add trust-region constraints. The original idea comes from trust region policy optimization (TRPO) 7, and PPO further simplifies the implementation as follows:

  1. Use a clipped objective as a proxy for trust-region constraints. Formally, we define the clipping function

    Cϵ(ρ,A)min(ρA,clip(ρ,1ϵ,1+ϵ)A).\mathcal{C}_\epsilon(\rho, A) \coloneqq \min\Bigl(\rho A, \mathrm{clip}\bigl(\rho, 1 - \epsilon, 1 + \epsilon\bigr)A\Bigr).

    Cϵ\mathcal{C}_\epsilon places an upper bound on ρ\rho when AA is positive, and a lower bound on ρ\rho when AA is negative. Intuitively, if we consider ρ\rho as the importance sampling ratio, this prevents excessively large updates that may arise from distribution mismatch. Using this, we replace the per-token contribution to the policy objective by

    πθ(ytx,y<t)π0(ytx,y<t)A(x,yt)Cϵ(πθ(ytx,y<t)π0(ytx,y<t),A(x,yt)).\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y_{\leq t}) \gets \mathcal{C}_\epsilon\left(\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}, A(x, y_{\leq t})\right).
  2. Maintain the initial policy πref\pi_{\textsf{ref}} as a reference policy to prevent policy collapse. Then, add a KL-penalty to the per-token reward by:

    r(x,yt)r(x,yt)βlogπθ(ytx,y<t)πref(ytx,y<t).r(x, y_{\leq t}) \gets r(x, y_{\leq t}) - \beta\log\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_{\textsf{ref}}(y_t \mid x, y_{<t})}.

    Here, β>0\beta > 0 is a hyperparameter that controls the strength of the penalty. After this adjustment, update the advantages accordingly, using the chosen advantage function (GAE is the most common choice with PPO). Let AKLA^{\textsf{KL}} denote the updated advantage function, including the KL-penalty.

Finally, the resulting PPO objective becomes:

Jθ=ExDyπ0(x)[t=1yCϵ(πθ(ytx,y<t)π0(ytx,y<t),AKL(x,yt))].J_\theta = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\mathcal{C}_\epsilon\left(\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}, A^{\textsf{KL}}(x, y_{\leq t})\right)\right].

In the early days of RL for LLMs, PPO was the go-to algorithm. This was especially true during the RLHF era, where RL was primarily used for human alignment, as exemplified by ChatGPT and its many replications. During this period, relatively little attention was paid to advancing RL algorithms themselves. Most of the effort went into building better reward models, mitigating overoptimization, figuring out how to collect preference data at scale, etc. (which eventually led us to the RL vs DPO debate and beyond — here are some great articles that summarize the progress back then).

Group Relative Policy Optimization (GRPO)

However, things started to change after the success of o1 and R1. RL for LLMs began to matter again, not just for alignment, but as a way to incentivize general reasoning capabilites. A big reason this worked was that these models were trained on tasks with clear, verifiable answers. When rewards are strictly verifiable, many of the classic RLHF problems that come from reward modeling (reward hacking, weird incentives, misalignment) largely disappear. Now that these are less of a bottleneck, attention has shifted back to the RL algorithms, with the goal of maximizing scalability and stability. DeepSeek's solution to this problem was group relative policy optimization (GRPO) 8, which builds upon PPO with several key modifications to take better advantage of verifiable rewards in reasoning tasks.

GRPO assumes completely sequence-level rewards 2, which was always the case in RL for LLMs anyway. The key difference is that instead of treating the reward as given at the last token only, we treat it as given for the entire generated sequence. Accordingly, we calculate advantages at the sequence level, and broadcast the same advantage to every token in the sequence. Furthermore, GRPO assumes a grouped sampling scheme, meaning that for each prompt xx, we generate a group of GG responses y=(y(1),y(2),,y(G))\mathbf{y} = \left(y^{(1)}, y^{(2)}, \ldots, y^{(G)}\right) using the behavior policy π0\pi_0. The key idea is to use relative rewards within the group to define a more informative advantage function. Specifically, GRPO introduces the following components:

  1. Define the advantage function AGRPOA^{\textsf{GRPO}} using normalized rewards within the group:

    AGRPO(x,y(j))=r(x,y(j))mean({r(x,y(1)),,r(x,y(G))})std({r(x,y(1)),,r(x,y(G))})+ϵ,A^{\textsf{GRPO}}\left(x, y^{(j)}\right) = \frac{r\left(x, y^{(j)}\right) - \mathrm{mean}\Bigl(\left\{r\left(x, y^{(1)}\right), \ldots, r\left(x, y^{(G)}\right)\right\}\Bigr)}{\mathrm{std}\Bigl(\left\{r\left(x, y^{(1)}\right), \ldots, r\left(x, y^{(G)}\right)\right\}\Bigr) + \epsilon},

    and broadcast it to all tokens in y(j)y^{(j)}. The standard advantage function QπθVπθQ^{\pi_\theta} - V^{\pi_\theta} is a measure of better-than-average performance. Similarly, AGRPOA^{\textsf{GRPO}} measures how much better a response is compared to other responses in the same group. Here, ϵ>0\epsilon > 0 is a small constant added for numerical stability.

  2. As we do not have per-token rewards, we cannot directly apply the KL-penalty as in PPO. Instead, add KL-penalty to the per-token objective using the k3 approximator 9:

    k3(ξ;p,q)q(ξ)p(ξ)logq(ξ)p(ξ)1,where    ξp.k_3\left(\xi; p, q\right) \coloneqq \frac{q(\xi)}{p(\xi)} - \log\frac{q(\xi)}{p(\xi)} - 1, \quad \textsf{where} \;\; \xi \sim p.

    This is meant to approximate the KL-divergence DKL(pq)D_{\mathrm{KL}}(p \,\|\, q). In GRPO, we use ξ=yt(j)\xi = y_t^{(j)} with p=πθ(x,y<t(j))p = \pi_\theta\left(\cdot \Bigm\vert x, y^{(j)}_{<t}\right) and q=πref(x,y<t(j))q = \pi_{\textsf{ref}}\left(\cdot \Bigm\vert x, y^{(j)}_{<t}\right). You may notice some mismatch in the sampling distribution here, as yt(j)y_t^{(j)} is sampled from the behavior policy π0\pi_0 instead of the current policy πθ\pi_\theta. I will discuss this subtlety in the next post.

  3. Apply sequence-level length normalization by dividing the per-token contributions by the sequence length y(j)\lvert y^{(j)} \rvert. This is slightly unintuitive, as the idea of average reward is typically discussed in infinite-horizon MDPs. There has been much debate on this length normalization term in the sense of loss aggregation, which I will also discuss in the next post. For now, just note that this is part of the original GRPO formulation.

Applying all these components, the overall GRPO objective becomes:

Jθ=ExDyπ0(x)[1Gj=1G1y(j)t=1y(j)(Cϵ(πθ(yt(j)x,y<t(j))π0(yt(j)x,y<t(j)),AGRPO(x,y(j)))βk3(yt(j)x,y<t(j);πθ,πref))].J_\theta = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ \mathbf{y} \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{G}\sum_{j=1}^{G}\frac{1}{\lvert y^{(j)} \rvert}\sum_{t=1}^{\lvert y^{(j)} \rvert}\left(\mathcal{C}_\epsilon\left(\frac{\pi_\theta\left(y_t^{(j)} \Bigm\vert x, y_{<t}^{(j)}\right)}{\pi_0\left(y_t^{(j)} \Bigm\vert x, y_{<t}^{(j)}\right)}, A^{\textsf{GRPO}}\left(x, y^{(j)}\right)\right) - \beta k_3\left(y_t^{(j)} \Bigm\vert x, y_{<t}^{(j)} ; \pi_\theta, \pi_{\textsf{ref}}\right)\right)\right].

Why GRPO?

Following DeepSeek’s breakout success, GRPO rapidly gained attention and, in many discussions, overtook PPO as the method of choice. Yet PPO isn’t dead. In fact, it still works surprisingly well. However, it’s hard to ignore that much of the recent progress in RL for LLMs has been built on top of GRPO-style methods rather than on PPO itself. This shift isn’t just a hype; GRPO appears to offer a some practical advantages that make it an appealing choice for large-scale LLM training.

The most obvious advantage of GRPO is its computational simplicity. GRPO eliminates the need for a learned value function in the advantage estimate, which means there is no separate value network VϕV_\phi to train or maintain. This reduces both compute cost and system complexity, making it easier to implement and scale.

But more interestingly, recent empirical results suggest that removing or weakening value learning may actually improve performance, not just efficiency. A notable trend is that even when PPO is used, practitioners increasingly push the GAE parameter λ\lambda toward 1.0, effectively reducing reliance on the value function. This behavior is reflected in results from the recent DeepSeek-R1 report published in Nature 10, which shows a clear performance ordering:

PPO vs GRPO (DeepSeek R1)

Performance of PPO and GRPO on the MATH task using DeepSeek-Coder-V2-Lite. We can observe a clear trend of GRPO > PPO (λ\boldsymbol{\lambda} = 1.0) > PPO (λ\boldsymbol{\lambda} = 0.95). As the influence of value learning decreases, performance improves, and removing it entirely works best.

This matches a growing consensus that value learning is hard in RL for LLMs. Researchers like John Schulman or Ross Taylor have both commented on this publicly. While the exact cause is still unclear, a common hypothesis is that bias in the learned value function harms optimization more than variance reduction helps. In other words, the errors introduced by imperfect value estimates may outweigh the benefits they provide in stabilizing policy updates.

Work such as VC-PPO 11 offers more concrete analysis of this issue. Two key takeaways stand out. First, value network initialization matters enormously, and explicit value-pretraining under a fixed initial policy may help. Second, variance reduction may not be important enough in value learning as it is in policy learning. This has led to proposals like decoupled GAE, where different λ\lambda values are used for policy updates and value updates. Even with these refinements, however, value learning remains brittle and sensitive to design choices.

GRPO avoids these issues altogether. With no value network, there’s no value bias, no delicate initialization, and no need to tune bias-variance tradeoffs. That simplicity, combined with strong empirical results, explains why GRPO has become the dominant foundation for recent RL-LLM work.

References

1
Learning to summarize from human feedback
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P.,NeurIPS, 2020.
2
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R.,NeurIPS, 2022.
3
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
Williams, R. J.,Machine Learning, 8(3–4), 229–256, 1992.
4
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P.,ICLR, 2016.
5
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
Huang, S., Noukhovitch, M., Hosseini, A., Rasul, K., Wang, W., & Tunstall, L.,COLM, 2024.
6
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O.,arXiv, 2017.
7
Trust Region Policy Optimization
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P.,ICML, 2015.
8
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D.,arXiv, 2024.
9
Approximating KL Divergence
Schulman, J.,Personal Blog, 2023.
10
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., … Zhang, Z.,Nature, 645(8081), 633–638, 2025.
11
What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the Secret
Yuan, Y., Yue, Y., Zhu, R., Fan, T., & Yan, L.,arXiv, 2025.

Footnotes

  1. This formulation causes the episode length T=yT = \lvert y \rvert to be stochastic, being a random variable dependent on the policy. As a result, one cannot directly apply the common interchange Ey[t=1T]=t=1TEy[]\mathbb{E}_{y}\Bigl[\sum_{t=1}^{T} \cdots \Bigr] = \sum_{t=1}^{T}\mathbb{E}_{y}\Bigl[ \cdots \Bigr], which is helpful when it is easier to process a per-sample expectation than an expectation over a sum. To make the interchange well-defined, we instead consider an equivalent non-terminating MDP by by making the EOS token a self-looping absorbing state with zero reward. Assuming convergence, we may write Ey[t=1]=t=1Ey[]\mathbb{E}_{y}\Bigl[\sum_{t=1}^{\infty} \cdots \Bigr] = \sum_{t=1}^{\infty}\mathbb{E}_{y}\Bigl[ \cdots \Bigr], and if the inner terms are zero after t>Tt > T, the infinite sums reduce to finite sums up to TT. Then, if our processing is zero-preserving, we can safely swap the expectation and the summation, process the per-sample expectations, and swap them back into an expectation over a finite sum.
  2. The original DeepSeekMath paper does introduce a process reward version of GRPO, but it is very unintuitive and not widely used in practice. PRIME also proposed a process reward variant of GRPO, but the intuition is still lacking. The main difficulty is that it is hard to define a meaningful notion of a group at intermediate-level, as multiple responses each take different intermediate steps. One possible solution is to sample a tree of responses, where each node corresponds to a partial response shared by multiple complete responses. However, this quickly becomes complicated and computationally expensive. For now, it seems that sequence-level rewards are the most natural fit for GRPO.