In the previous post, we introduced off-policy policy gradient methods that are widely used for LLM training, especially GRPO. In this post, we will discuss several tweaks and improvements to GRPO that have been proposed in recent literature.

Some fixes are specific to GRPO, while others are more general and can be applied to other off-policy policy gradient methods as well. Throughout this post, to make everything simple, we will build upon the following objective:

J_\theta^{\textsf{Tok}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y_{\leq t})\right].

This is the basic off-policy policy gradient objective in the token-level MDP setup. Adding appropriate clipping, KL penalty, group sampling and averaging, or length normalization to this objective will give us PPO or GRPO.

Also, recall that the clipping function $\mathcal{C}_\epsilon$ is defined as:

\mathcal{C}_\epsilon(\rho, A) \coloneqq \min\Bigl(\rho A, \mathrm{clip}\bigl(\rho, 1 - \epsilon, 1 + \epsilon\bigr)A\Bigr).

Mitigating Unintended Off-Policiness

The first group of tweaks are about mitigating unintended off-policiness in the training pipeline. Though our baseline formulation already includes the IS ratios to correct for the off-policiness, there still exist some sources of off-policiness that people haven't paid much attention to, but can actually cause significant issues in training. These sources of off-policiness can lead to high variance and instability in training, and therefore we need to be careful about them.

Training-Inference Mismatch

One major source of off-policiness is the mismatch between training and inference policies. Most existing LLM-RL frameworks use highly optimized inference engines (e.g., vLLM or SGLang) for collecting rollouts, while using different training policies (e.g., FSDP or Megatron-LM) for computing the losses and gradients. These inference engines employ various optimization techniques such as speculative decoding, low-precision computation, or batch-variant CUDA kernels, to achieve high throughput and low latency. In contrast, the training polices prioritize numerical stability and ease of gradient computation, and therefore may not have the same optimizations as the inference engines. This leads to a mismatch between the training and inference policies, which can cause significant off-policiness in the training process.

To be more specific, in the training pipeline, we are usually doing the following three steps:

training a policy $\pi_\theta^{\textsf{train}}$ with FSDP or Megatron-LM etc.,
collecting online rollouts with optimized an inference engine $\pi_0^{\textsf{rollout}}$ like vLLM or SGLang etc.,
but computing the off-policy logprobs from $\pi_0^{\textsf{train}}$ , the training policy at the beginning of each iteration.

As a result, what we are actually computing (LHS) and what we should be computing (RHS) become different:

\mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0^{\textsf{rollout}}(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta^{\textsf{train}}(y_t \mid x, y_{<t})}{\pi_0^{\textsf{train}}(y_t \mid x, y_{<t})}A(x, y_{\leq t})\right] \;\neq\; \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0^{\textsf{rollout}}(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta^{\textsf{train}}(y_t \mid x, y_{<t})}{\pi_0^{\textsf{rollout}}(y_t \mid x, y_{<t})}A(x, y_{\leq t})\right].

Therefore, we need to build our algorithms upon the RHS. However, naively applying clipping to the RHS may cause issues, especially regarding batch size. When we apply clipping to the RHS, the IS ratios become batch-variant, because the denominator $\pi_0^{\textsf{rollout}}$ is different for different batches. This can lead to significant batch size sensitivity in training, as the effective learning rate can vary significantly across batches due to the varying IS ratios. This was first observed by Hilton et al. (2022), and they propose a simple solution to this issue by decoupling the IS ratios into two parts: one for the rollout-train mismatch, and the other for the policy update. The first part is not clipped, while the second part is clipped:

J_\theta^{\textsf{Decoupled}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0^{\textsf{rollout}}(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_0^{\textsf{train}}(y_t \mid x, y_{<t})}{\pi_0^{\textsf{rollout}}(y_t \mid x, y_{<t})}\mathcal{C}_\epsilon\left(\frac{\pi_\theta^{\textsf{train}}(y_t \mid x, y_{<t})}{\pi_0^{\textsf{train}}(y_t \mid x, y_{<t})}, A(x, y_{\leq t})\right)\right].

This way, we can mitigate the batch size sensitivity issue caused by clipping, while still correcting for the off-policiness caused by the training-inference mismatch.

This fix was first thoroughly discussed by Yao et al. (2025) ¹. In their work, they also find that applying truncated importance sampling (TIS) to the first part works the best:

J_\theta^{\textsf{Decoupled-TIS}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0^{\textsf{rollout}}(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\min\left(\frac{\pi_0^{\textsf{train}}(y_t \mid x, y_{<t})}{\pi_0^{\textsf{rollout}}(y_t \mid x, y_{<t})}, C\right)\mathcal{C}_\epsilon\left(\frac{\pi_\theta^{\textsf{train}}(y_t \mid x, y_{<t})}{\pi_0^{\textsf{train}}(y_t \mid x, y_{<t})}, A(x, y_{\leq t})\right)\right].

Routing Replay for MoE Models

For MoE models, the policy is not determined only by the dense parameters, but also by the router decisions. It is useful to write the policy as depending explicitly on the routing masks:

\pi_\theta(y_t \mid x,y_{<t}) = \pi_\theta(y_t \mid x,y_{<t}; m_\theta).

The usual token-level IS ratio is then implicitly

\rho_t = \frac{\pi_\theta^{\textsf{train}}\left(y_t \bigm\vert x,y_{<t}; m_\theta^{\textsf{train}}\right)}{\pi_0^{\textsf{train}}\left(y_t \bigm\vert x,y_{<t}; m_0^{\textsf{train}}\right)}.

The problem is that $m_\theta^{\textsf{train}}$ and $m_0^{\textsf{train}}$ may activate different experts for the same token. Worse, the rollout was actually sampled using an inference engine, so the training ratio may be comparing probabilities computed under routing masks that differ from the masks used to generate the data. In other words, the ratio is pretending to compare policies, while the router quietly swaps the subnetworks underneath it. This is the main MoE-specific instability that motivates Routing Replay ^2,3.

Routing Replay fixes this by caching a routing mask $\bar m$ and using the same mask when computing the numerator and denominator of the IS ratio. Then the training objective (without fixing the training-inference mismatch for logprob computation) becomes:

J_\theta^{\textsf{RR}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0^{\textsf{rollout}}(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta^{\textsf{train}}(y_t \mid x, y_{<t}; \bar m)}{\pi_0^{\textsf{train}}(y_t \mid x, y_{<t}; \bar m)}A(x, y_{\leq t})\right].

There are two natural choices of $\bar m$ . Vanilla Routing Replay (R2) ² replays the old-policy routing mask computed in the training engine:

\bar m^{\textsf{R2}} = m_0^{\textsf{train}}.

This mainly reduces policy-staleness effects, because both the old and current policies are evaluated using the old training-engine routing pattern.

Rollout Routing Replay (R3) ³ instead replays the routing mask used by the inference engine during rollout:

\bar m^{\textsf{R3}} = m_0^{\textsf{rollout}}.

This directly aligns the training forward pass with the rollout computation. Implementation-wise, R3 does not freeze the router logits completely. It reuses only the binary routing mask from inference, but still computes the gate weights from the training router logits. Let $z_{\theta,\ell,t}(e)$ and $m_{\theta,\ell,t}(e)$ each denote the router logits and expert mask for token $t$ at layer $\ell$ and expert $e$ . Then the replayed gate weights are given by:

g_{\theta,\ell,t}^{\textsf{R3}}(e) = \frac{m_{0,\ell,t}^{\textsf{rollout}}(e)\exp z_{\theta,\ell,t}^{\textsf{train}}(e)}{\sum_{e'=1}^{E}m_{0,\ell,t}^{\textsf{rollout}}(e')\exp z_{\theta,\ell,t}^{\textsf{train}}(e')}.

Then the replayed MoE output is

h_{\ell+1,t}^{\textsf{R3}} = \sum_{e=1}^{E}g_{\theta,\ell,t}^{\textsf{R3}}(e)\mathcal{E}_{\theta,\ell,e}\left(h_{\ell,t}\right),

where $\mathcal{E}_{\theta,\ell,e}$ is the expert function for expert $e$ at layer $\ell$ and $h_{\ell,t}$ is the input to the router at layer $\ell$ and token $t$ . So the discrete expert choice is replayed, but gradients can still flow through the training router logits inside the selected expert set. This avoids changing the activated sparse network between rollout and training, without completely removing router learning.

Routing Replay creates bias in the IS ratios, since the target policy is no longer the naturally routed policy, but rather the replay-routed policy. However, this routing gives a much more stable and meaningful IS ratio. In practice, this tradeoff is often worthwhile for MoE RL, especially under off-policy mini-batch reuse, where routing changes can otherwise make token-level ratios extremely noisy ⁴.

Biased KL estimator in GRPO

Recall that the original GRPO paper uses the k3 approximator for the KL-penalty term, which is given as:

k_3\left(\xi; p, q\right) \coloneqq \frac{q(\xi)}{p(\xi)} - \log\frac{q(\xi)}{p(\xi)} - 1, \quad \textsf{where} \;\; \xi \sim p.

This is an unbiased estimator for an approximation of $D_{\mathrm{KL}}(p \,\|\, q)$ near $p = q$ . However, if we sample $\xi$ from a different distribution $b$ , then this estimator becomes biased. To correct for this bias, we can use importance sampling to get an unbiased estimator ⁵:

k_3^{\textsf{IS}}\left(\xi; p, q, b\right) \coloneqq \frac{p(\xi)}{b(\xi)}\left(\frac{q(\xi)}{p(\xi)} - \log\frac{q(\xi)}{p(\xi)} - 1\right), \quad \textsf{where} \;\; \xi \sim b.

In our case, we should use $\xi = y_t^{(j)}$ with $p = \pi_\theta\left(\cdot \Bigm\vert x, y^{(j)}_{<t}\right)$ , $q = \pi_{\textsf{ref}}\left(\cdot \Bigm\vert x, y^{(j)}_{<t}\right)$ , and $b = \pi_0\left(\cdot \Bigm\vert x, y^{(j)}_{<t}\right)$ . Notably, DeepSeek-V3.2 adopts this fix as well.

Moreover, this IS-corrected, off-policy k3 estimator is also the correct choice at gradient-level. That is, even if we were using the original k3 estimator where $b \approx p$ and the differences are negligible, the value estimates would be unbiased, but the gradients would not be. If we take the gradient of $D_{\mathrm{KL}}(p \,\|\, q)$ with respect to the parameters of $p$ , we should be careful since there are two contributions: $p$ appears both as the sampling distribution and inside the log-ratio. This leads to a common pitfall where if the estimates use on-policy samples, the gradients are not guaranteed to be unbiased, even if the values are. ⁶ For instance, the correct gradient of $D_{\mathrm{KL}}(p \,\|\, q)$ is given by:

\begin{aligned} \nabla D_{\mathrm{KL}}(p \,\|\, q) &= \nabla\sum_{\xi}p(\xi)\log\frac{p(\xi)}{q(\xi)} = \sum_{\xi}p(\xi)\log\frac{p(\xi)}{q(\xi)}\nabla\log p(\xi) + \nabla\sum_{\xi}p(\xi) \\ &= \mathbb{E}_{\xi \sim p}\left[\log\frac{p(\xi)}{q(\xi)}\nabla\log p(\xi)\right], \end{aligned}

but if we use the k3 estimator with on-policy samples, the gradient is given by:

\nabla k_3(\xi; p, q) = \left(1 - \frac{q(\xi)}{p(\xi)}\right)\nabla\log p(\xi),

which is a wrong estimator. So the k3 estimator is unbiased as a value, but differentiating it as a loss would give us incorrect objectives. In fact, this is actually an unbiased estimator for the gradient of the reverse KL $D_{\mathrm{KL}}(q \,\|\, p)$ :

\begin{aligned} \nabla D_{\mathrm{KL}}(q \,\|\, p) &= \nabla\sum_{\xi}q(\xi)\log\frac{q(\xi)}{p(\xi)} = -\sum_{\xi}p(\xi)\frac{q(\xi)}{p(\xi)}\nabla\log p(\xi) \\ &= \mathbb{E}_{\xi \sim p}\left[-\frac{q(\xi)}{p(\xi)}\nabla\log p(\xi)\right] = \mathbb{E}_{\xi \sim p}\left[\nabla k_3(\xi; p, q)\right]. \end{aligned}

In contrast, differentiating the IS-corrected k3 estimator gives us an IS-corrected estimator of the correct gradient.

\begin{aligned} \nabla k_3^{\textsf{IS}}(\xi; p, q, b) &= \frac{p(\xi)}{b(\xi)}\log\frac{p(\xi)}{q(\xi)}\nabla\log p(\xi), \\ \mathbb{E}_{\xi \sim b}\Bigl[\nabla k_3^{\textsf{IS}}(\xi; p, q, b)\Bigr] &= \mathbb{E}_{\xi \sim p}\left[\log\frac{p(\xi)}{q(\xi)}\nabla\log p(\xi)\right] = \nabla D_{\mathrm{KL}}(p \,\|\, q). \\ \end{aligned}

Hence, we can observe that using the IS-corrected k3 estimator is important for getting the correct estimates, both in terms of values and gradients.

Loss Aggregation

The second group of tweaks are about how to aggregate the token-level objectives into a sequence-level objective. GRPO's loss aggregation can introduce some unintended biases and issues in training, which can be mitigated by using different ways of aggregating the token-level objectives.

Token-Level Objectives

Recall that when GRPO aggregates the token-level objectives, it introduces an unintuitive length normalization $1/|y^{(j)}|$ :

J_\theta^{\textsf{GRPO}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ \mathbf{y} \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{G}\sum_{j=1}^{G}\frac{1}{\lvert y^{(j)} \rvert}\sum_{t=1}^{\lvert y^{(j)} \rvert}\Bigl(\textsf{per-token objective}\Bigr)\right]

However, to match the original off-policy policy gradient and PPO objectives, the inner expectation term should simply be an average of the PPO objectives over the group. Therefore, the correct way to aggregate the loss should exclude the length normalization, as proposed in Dr. GRPO ⁷:

J_\theta^{\textsf{Dr. GRPO}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ \mathbf{y} \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{G}\sum_{j=1}^{G}\sum_{t=1}^{\lvert y^{(j)} \rvert}\Bigl(\textsf{per-token objective}\Bigr)\right].

The authors of Dr. GRPO additionally argue that without this fix, GRPO is implicitly biased towards short, correct and long, incorrect responses. This is not desirable as the whole concept of test-time scaling is to generate long, correct responses, meanwhile without wasting tokens. This also aligns with some early observations of DeepSeek-R1, that the model tends to generate excessively long responses with low-quality, repetitive patterns.

A similar fix is also proposed by DAPO ⁸, which gives us a slightly different objective, where we normalize by the total number of tokens in the group, instead of normalizing each response by its own length. This way, we can still maintain the length normalization, while avoiding the bias on response lengths. The objective of DAPO is given by:

J_\theta^{\textsf{DAPO}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ \mathbf{y} \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{\sum_{j=1}^{G}\lvert y^{(j)} \rvert}\sum_{j=1}^{G}\sum_{t=1}^{\lvert y^{(j)} \rvert}\Bigl(\textsf{per-token objective}\Bigr)\right]

Though DAPO is still biased as it includes the length normalization, modern implementations of GRPO mostly use DAPO-style loss aggregation as default. This is probably because without length normalization, the magnitude of the loss can vary significantly across responses with different lengths, which can lead to instability in training. DAPO's way of normalizing by the total number of tokens in the group can mitigate this issue while still avoiding the bias on response lengths.

Sequence-Level Objectives

However, in GRPO-style advantage estimation, it is actually more natural to take a sequence-level view. Currently we are calculating advantages at sequence-level, but simply broadcasting those advantages to all tokens, equally. This becomes very weird, because the GRPO advantage is a better-than-average in the sequence-level, not token-level. In fact, even if it were not GRPO, any sequence-level advantage estimator would have the same issue. Therefore, when we are using sequence-level advantages, it is more natural to view the entire sequence as a single decision and compute the IS ratios at sequence-level as well. This is essentially treating LLM generation as a bandit problem, not a token-level MDP.

With this sequence-level view, the off-policy policy gradient objective should be:

J_\theta^{\textsf{Seq}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}A(x, y)\right] = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\prod_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y)\right].

However, naively applying clipping and computing gradients on this sequence-level objective is not a good idea, because the IS ratios have very high variance and large numerical range at sequence-level. Therefore, we need to be careful about how to apply clipping and how to compute gradients. ¹

GSPO ² proposes a practical way to apply clipping at sequence-level, which is to first add length normalization, then apply sequence-level clipping on the length-normalized IS ratios. The objective of GSPO is given by:

J_\theta^{\textsf{GSPO}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\mathcal{C}_\epsilon\left(\left(\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)}\right)^{\frac{1}{\lvert y \rvert}}, A\left(x, y\right)\right)\right] = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\mathcal{C}_\epsilon\left(\left(\prod_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}\right)^{\frac{1}{\lvert y \rvert}}, A\left(x, y\right)\right)\right].

Ignoring the clipping, this results in using the geometric mean of the token-level IS ratios as the sequence-level IS ratio. Note that the original GRPO objective without clipping is equivalent to using the arithmetic mean of the token-level IS ratios as the sequence-level IS ratio. Hence, GSPO can also be viewed as a variant of GRPO with a different way of aggregating the token-level IS ratios.

But in some scenarios, we may want to use a mixture of sequence-level and token-level advantages, such as certain multi-turn dialogue settings where we want to have turn-level advantages as well. The authors of GSPO introduce a token-level variant of GSPO as well, which is called GSPO-Token, where we can apply clipping at token-level and use token-level advantages.

We can derive it by first computing the gradient of GSPO without clipping:

\begin{aligned} \nabla_\theta J_\theta^{\textsf{GSPO-NoClip}} &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}\nabla_\theta\log\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}A(x, y)\right] \\ &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{\lvert y \rvert}\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}\nabla_\theta\sum_{t=1}^{\lvert y \rvert}\log\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y)\right] \\ &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{\lvert y \rvert}\sum_{t=1}^{\lvert y \rvert}\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}\frac{\nabla_\theta\pi_\theta(y_t \mid x, y_{<t})}{\pi_\theta(y_t \mid x, y_{<t})}A(x, y)\right] \\ &= \nabla_\theta\mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{\lvert y \rvert}\sum_{t=1}^{\lvert y \rvert}\left[\hspace{-0.4em}\left[\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}\right]\hspace{-0.4em}\right]\frac{\pi_\theta(y_t \mid x, y_{<t})}{\bigl[\hspace{-0.3em}\bigl[\pi_\theta(y_t \mid x, y_{<t})\bigr]\hspace{-0.3em}\bigr]}A(x, y)\right]. \end{aligned}

Then, applying clipping to the IS ratios gives us the objective of GSPO-Token:

\begin{aligned} J_\theta^{\textsf{GSPO-Token}} = \; &\mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{\lvert y \rvert}\sum_{t=1}^{\lvert y \rvert}\mathcal{C}_\epsilon\left(\left[\hspace{-0.4em}\left[\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}\right]\hspace{-0.4em}\right]\frac{\pi_\theta(y_t \mid x, y_{<t})}{\bigl[\hspace{-0.3em}\bigl[\pi_\theta(y_t \mid x, y_{<t})\bigr]\hspace{-0.3em}\bigr]}, A(x, y)\right)\right] \\ \rightarrow \; &\mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{1}{\lvert y \rvert}\sum_{t=1}^{\lvert y \rvert}\mathcal{C}_\epsilon\left(\left[\hspace{-0.4em}\left[\left(\frac{\pi_\theta(y \mid x)}{\pi_0(y \mid x)}\right)^{\frac{1}{\lvert y \rvert}}\right]\hspace{-0.4em}\right]\frac{\pi_\theta(y_t \mid x, y_{<t})}{\bigl[\hspace{-0.3em}\bigl[\pi_\theta(y_t \mid x, y_{<t})\bigr]\hspace{-0.3em}\bigr]}, A(x, y_{\leq t})\right)\right] \end{aligned}

Since we have more fine-grained control over the IS ratios at token-level, it is safer to add customized token-level advantages as well, instead of using the same sequence-level advantage for all tokens. ²

But notably, a subsequent work ⁴ from the same authors shows that the sequence-level objective is in fact already very close to the token-level objective (where we broadcast the sequence-level advantage to all tokens), so using the token-level objective with sequence-level advantages is not really a problem. Using the first-order approximation $\prod_{t=1}^{T}(1 + \delta_t) \approx 1 + \sum_{t=1}^{T}\delta_t$ when $\delta_t$ 's are small, we have:

\begin{aligned} J_\theta^{\textsf{Seq}} &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\prod_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y)\right] \\ &\approx \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\left(1 + \sum_{t=1}^{\lvert y \rvert}\left(\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})} - 1\right)\right)A(x, y)\right] \\ % \hspace{-5.75mm}\underbrace{\iff}_{\textsf{at gradient-level}} &\overset{\nabla_\theta}{\equiv} \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y)\right] = J_\theta^{\textsf{Tok}}. \end{aligned}

The symbol $\overset{\nabla_\theta}{\equiv}$ means that the two objectives have the same gradient with respect to $\theta$ , and therefore are equivalent for optimization. Hence, we can see that the original GRPO-style token-level objective is already a very good approximation of the correct sequence-level objective, at gradient-level. Therefore, we can just stick to the original GRPO-style token-level objective, without worrying too much about issues regarding interpretation of the advantages.

Clipping only the IS Ratios

The last group of tweaks are about how to apply clipping to the IS ratios. While PPO and GRPO originally apply clipping to the entire product of IS ratios and advantages, some recent algorithms consider applying clipping only to the IS ratios instead. This idea is motivated especially in the context of LLM training, and is currently considered as a promising direction for improving the stability and performance of LLM-RL algorithms.

CISPO

This idea was proposed in CISPO as part of the training recipe for MiniMax-M1 ⁹. In standard PPO or GRPO-style objectives, clipping is applied to the full update term. When the ratio goes outside the clipping range, the clipped branch can produce zero gradient with respect to the token log-probability. This is intentional from a trust-region perspective, but it can be problematic in LLM reasoning training.

The issue is especially visible for rare but important tokens. Reasoning models often improve by increasing the probability of tokens that trigger reflection or correction, such as Wait, Hmm, Actually, or Aha. These tokens may have very low probability under the old policy, so their IS ratios can become large as soon as the current policy starts assigning them more probability. Under ordinary PPO/GRPO clipping, these updates may be clipped away, which means exactly the tokens we want the model to learn from can stop receiving useful gradients. Humanity invented a training signal and then carefully deleted it. Very elegant.

To avoid this, CISPO starts from the off-policy token-level policy gradient:

\begin{aligned} \nabla_\theta J_\theta^{\textsf{Tok}} &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\nabla_\theta\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}A(x, y_{\leq t})\right] \\ &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}\nabla_\theta\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right] \\ &= \nabla_\theta\mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\left[\!\!\left[\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}\right]\!\!\right]\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right]. \end{aligned}

CISPO then clips the IS weight itself, while keeping the log-probability term differentiable:

J_\theta^{\textsf{CISPO}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\left[\!\!\left[\mathrm{clip}\left(\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}, 1-\epsilon, 1+\epsilon\right)\right]\!\!\right]\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right].

The important difference is that clipping no longer removes the token-level learning signal. The clipped ratio only controls the scale of the update, while the gradient still flows through $\log \pi_\theta(y_t \mid x,y_{<t})$ . Thus, CISPO behaves more like a variance-controlled policy gradient estimator than a pessimistic PPO-style lower-bound objective.

Truncated Importance Sampling (TIS)

A closely related variant is truncated importance sampling (TIS) ¹⁰. Instead of clipping both sides of the ratio, we can only truncate the upper tail:

J_\theta^{\textsf{Tok-TIS}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\left[\!\!\left[\min\left(\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}, C\right)\right]\!\!\right]\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right].

This keeps the same CISPO-style stop-gradient structure, but focuses only on preventing extremely large ratios from dominating the update. Intuitively, this is useful when the main danger is not that a token has become too unlikely under the current policy, but that a small number of tokens have huge ratios and therefore create high-variance gradients. This style of loss is found to be very effective and stable in scaling up LLM-RL training, according to recent studies.

So far, importance sampling has been applied at the token level. However, token-level correction does not fully correct the mismatch between the old rollout distribution and the current policy distribution. Therefore, it is also natural to consider applying importance sampling at sequence-level. This gives:

\begin{aligned} \nabla_\theta J_\theta^{\textsf{PG}} &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_\theta(\cdot \mid x)\end{subarray}}\left[\sum_{t=1}^{\lvert y \rvert}\nabla_\theta\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right] \\ &= \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)}\sum_{t=1}^{\lvert y \rvert}\nabla_\theta\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right] \\ &= \nabla_\theta\mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\left[\!\!\left[\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)}\right]\!\!\right]\sum_{t=1}^{\lvert y \rvert}\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right]. \end{aligned}

Then adding truncation to the sequence-level IS ratio gives the sequence-level TIS objective:

J_\theta^{\textsf{Seq-TIS}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\left[\!\!\left[\min\left(\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)}, C\right)\right]\!\!\right]\sum_{t=1}^{\lvert y \rvert}\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right].

Masking Sequences

Masked Importance Sampling (MIS)

TIS still gives some weight to sequences with very large ratios. If $\rho(y \mid x) \gg C$ , the sequence is still used with weight $C$ . This is reasonable if high-ratio samples are merely rare but valid. However, in LLM-RL, extremely large ratios may also indicate that the sample is outside the trustworthy overlap between the rollout policy and the training policy. This can happen due to stale rollouts, training-inference mismatch, or simply because long-horizon generation found a bizarre corner of the distribution.

This motivates masked importance sampling (MIS) ¹⁰. Instead of softly clipping high-ratio sequences, MIS rejects them entirely. The strict sequence-level IS version is

J_\theta^{\textsf{Seq-MIS}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\left[\!\!\left[\mathbb{I}\left[\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)} \leq C\right]\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)}\right]\!\!\right]\sum_{t=1}^{\lvert y \rvert}\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right]

This can be interpreted as a hard trust region. If the sequence ratio is too large, then the sample is treated as unreliable and removed from the update. Compared with Seq-TIS, this sacrifices sample efficiency, since rejected samples provide no gradient. The benefit is robustness, since if high-ratio samples are actually OOD or corrupted by training-inference mismatch, clipping them to $C$ still lets them affect the update, while masking removes them completely.

Geometric Sequence Masking

In practice, one may also use a mask-only version of Seq-MIS as a safety wrapper around another objective:

J_\theta^{\textsf{Seq-Mask}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\mathbb{I}\left[\frac{\pi_\theta\left(y \mid x\right)}{\pi_0\left(y \mid x\right)} \leq C\right]\sum_{t=1}^{\lvert y \rvert}\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right]

This is no longer a full IS estimator, but rather a practical filtering heuristic. We first check whether the sequence is too far off-policy, and only then apply the base loss. However, a problem with sequence-level masking is that the raw sequence ratio is length-dependent, as it is a product of token-level ratios. Even if each per-token ratio is only slightly larger than $1$ , the product can become very large for long responses. This means that a fixed threshold $C$ can systematically reject long reasoning chains, even when the average per-token mismatch is small. For reasoning models, this is a fairly terrible failure mode, as the model can be punished not because the reasoning trajectory is bad, but because it is long.

Geometric sequence masking ¹⁰ fixes this by normalizing the log-ratio by sequence length:

J_\theta^{\textsf{Geo-Mask}} = \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_0(\cdot \mid x)\end{subarray}}\left[\mathbb{I}\left(\left\vert\frac{1}{\lvert y \rvert}\sum_{t=1}^{\lvert y \rvert}\log\frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_0(y_t \mid x, y_{<t})}\right\vert \leq \epsilon\right)\sum_{t=1}^{\lvert y \rvert}\log\pi_\theta(y_t \mid x, y_{<t})A(x, y_{\leq t})\right].

Intuitively, Geo-Mask filters out sequences whose average per-token log IS ratio is too extreme. Unlike Seq-MIS, it does not penalize a sequence merely for being long. A 200-token answer and a 20,000-token answer are judged by the same average per-token divergence criterion, which is much more appropriate for long-CoT or agentic, reasoning RL.

Importantly, Geo-Mask by itself is only a filtering rule. To recover a valid IS estimator, the mask should be placed in front of an objective that already contains a proper IS correction. In that case, the mask defines which samples are trusted, while the underlying IS estimator still determines how accepted samples are weighted. In fact, DeepSeek-V3.2 applies a this mask in front of GRPO, though only using an upper constraint to remove sequences whose sampling policy and training policy have diverged too much. The original authors of the RL-collapse analysis similarly suggest combining geometric sequence masking with Tok-TIS.

Dynamic Sampling

Dynamic Sampling is another practical trick proposed by DAPO ⁸. In GRPO, if all responses in a group receive the same reward, then all group-relative advantages become zero. For binary correctness rewards, this happens when the group accuracy is either $0$ or $1$ . These prompts therefore contribute no useful policy-gradient signal, but still occupy batch space, because naturally we must pay compute to learn nothing.

DAPO handles this by over-sampling prompts and filtering out groups whose accuracy is exactly $0$ or $1$ . Equivalently, only groups with at least one correct and at least one incorrect response are kept:

\mathbb{I}\left[0 < \sum_{j=1}^{G} r\left(x, y^{(j)}\right) < G\right].

This keeps the number of prompts with nonzero relative advantages more consistent across batches. Although it requires sampling extra groups, DAPO reports that it can still improve wall-clock efficiency because training needs fewer effective update steps.

Misc

Clip-Higher

Clip-Higher is a simple asymmetric clipping strategy from DAPO ⁸. Instead of using the same clipping range on both sides, it uses a larger upper clipping threshold:

\mathcal{C}_{\epsilon_{\textsf{low}}}^{\epsilon_{\textsf{high}}}(\rho, A) \coloneqq \min\Bigl(\rho A, \mathrm{clip}\bigl(\rho, 1 - \epsilon_{\textsf{low}}, 1 + \epsilon_{\textsf{high}}\bigr)A\Bigr), \qquad \epsilon_{\textsf{high}} > \epsilon_{\textsf{low}}.

The motivation is that the upper clipping bound can make it difficult to increase the probability of low-probability but high-reward tokens. In long-CoT RL, these low-probability tokens may correspond to exploratory reasoning patterns, so clipping them too aggressively can reduce diversity and lead to entropy collapse. By increasing only the upper clipping range, Clip-Higher gives the policy more room to upweight useful exploratory tokens, while still keeping the lower clipping range relatively tight for stability.

Removing STD normalization

Another small but important change is to remove the standard-deviation normalization in GRPO advantages, as proposed by Dr. GRPO ⁷. Standard GRPO computes group-relative advantages by subtracting the group mean and dividing by the group standard deviation, but Dr. GRPO removes the standard-deviation term:

A^{\textsf{Dr. GRPO}}\left(x, y^{(j)}\right) = r\left(x, y^{(j)}\right) - \mathrm{mean}\Bigl(\left\{r\left(x, y^{(1)}\right), \ldots, r\left(x, y^{(G)}\right)\right\}\Bigr)

The intuition is that dividing by the group standard deviation changes the effective scale of the update depending on the reward distribution inside each sampled group. This can introduce another source of optimization bias, especially when rewards are sparse or nearly binary. Removing the STD normalization makes the advantage closer to a plain reward-minus-baseline estimator, which is simpler and more interpretable. DeepSeek-V3.2 also adopts this style of advantage normalization.

Remove the KL Penalty

Another common simplification in modern LLM RL is to remove the explicit KL penalty against a frozen reference policy. The motivation is that KL regularization plays a different role in RLHF and reasoning RL. In RLHF, which was the beginning of most LLM RL work, the reward model is learned from data near the reference policy, so staying close to the reference helps avoid reward-model extrapolation. However, as modern LLM RL has shifted towards verifier-based reasoning, the reward is often more robust to distribution shift, so the KL penalty may be less necessary. Moreover, long-CoT reasoning may require the policy to move far from the initial model, producing longer traces, new reflection patterns, and different exploration behavior. A frozen reference can therefore become overly restrictive.

Numerous recent works, including DAPO ⁸, Dr. GRPO ⁷, or Open-Reasoner-Zero ¹¹ all report that removing the KL penalty gives the best stability and final performance. This no-KL choice has also become common in frontier open models or training frameworks.

However, removing KL entirely is not the only reasonable choice. ProRL ¹² argues that for prolonged RL, especially when starting from a strong distilled checkpoint, some KL control can help maintain entropy and stability. Instead of using a permanently frozen reference, ProRL periodically resets the reference policy to a recent online-policy checkpoint:

\pi_{\mathrm{ref}} \leftarrow \pi_{\theta_k}.

This gives a moving trust region. The policy is still discouraged from making abrupt updates, but it is not permanently tied to an outdated reference model.

This can also be seen as a mirror-descent-style update ¹³:

\pi_{\theta_{k+1}} = \operatorname*{argmax}_\pi\left(\mathbb{E}_{y \sim \pi(\cdot \mid x)}\Bigl[r(x, y)\Bigr] - \frac{1}{\eta} D_{\mathrm{KL}}\Bigl(\pi(\cdot \mid x) \Bigm\Vert \pi_{\theta_k}(\cdot \mid x)\Bigr)\right),

where $\eta = 1/\beta$ in the original KL penalty formulation. This is a more principled way to derive the moving-reference KL penalty, which can be viewed as a trust-region-style update that encourages the policy to stay close to its recent self, rather than an outdated reference.

To conclude, removing KL gives the policy maximum freedom and is often effective for verifier-based reasoning RL, while moving/adaptive KL methods try to preserve stability without over-constraining exploration. The best choice may depend on the specific task, reward model, and training setup, so it is worth experimenting with different KL strategies.

References

Your Efficient RL Framework Secretly Brings You Off-Policy RL Training

Yao, F., Liu, L., Zhang, D., Dong, C., Shang, J., & Gao, J.,Personal Blog, 2025.

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J., & Lin, J.,arXiv, 2025.

Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

Ma, W., Zhang, H., Zhao, L., Song, Y., Wang, Y., Sui, Z., & Luo, F.,arXiv, 2025.

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Zheng, C., Dang, K., Yu, B., Li, M., Jiang, H., Lin, J., Liu, Y., Lin, H., Wu, C., Hu, F., Yang, A., Zhou, J., & Lin, J.,arXiv, 2025.

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Zhang, Y., Liu, Y., Yuan, H., Yuan, Y., Gu, Q., & Yao, A. C.-C.,ICLR, 2026.

On a few pitfalls in KL divergence gradient estimation for RL

Tang, Y., & Munos, R.,arXiv, 2025.

Understanding R1-Zero-Like Training: A Critical Perspective

Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., & Lin, M.,COLM, 2025.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., … Wang, M.,NeurIPS, 2025.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax,arXiv, 2025.

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

Liu, J., Li, Y., Fu, Y., Wang, J., Liu, Q., & Shen, Y.,Personal Blog, 2025.

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Hu, J., Zhang, Y., Han, Q., Jiang, D., Zhang, X., & Shum, H.-Y.,arXiv, 2025.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Liu, M., Diao, S., Lu, X., Hu, J., Dong, X., Choi, Y., Kautz, J., & Dong, Y.,NeurIPS, 2025.

Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss, Dynamics, and Success Amplification

Mroueh, Y.,arXiv, 2025.

Interestingly, DeepSeek-V2, DeepSeek-V3, and even DeepSeek-R1 (!) all used this naive sequence-level clipping, different from the original GRPO in DeepSeekMath. But DeepSeek-V3.2 reverted back to the original version (token-level loss aggregation with length normalization). They didn't explain why, and some guess that it was just a notation error in the report, but anyway, this naive sequence-level clipping is not a good idea. ↩
GSPO and GSPO-Token are both originally proposed to use GRPO-style group relative advantages, so the full version of the objectives should include group sampling and group averaging as well, but here we present the single-sample versions for simplicity. ↩

RL for LLMs II: Stabilization Tricks for Modern LLMs