The Stability Gap: Why Top-K Routing Breaks RL Optimization

How Discrete Expert Selection Creates Pathological Optimization Landscapes

Yingru LI

Dec 7, 2025 11 min read Research, Theory

Citation

@online{li-2025-topk-stability,
  title = {The Stability Gap: Why Top-K Routing Breaks {RL} Optimization},
  author = {Li, Yingru},
  year = {2025},
  month = dec,
  url = {https://richardli.xyz/post/topk-routing-stability-gap/}
}

The Problem

Training Mixture of Experts (MoE) language models with Reinforcement Learning can be unstable. While dense LLMs have continuous and differentiable policy mappings, MoE-based models like Mixtral, DeepSeek-MoE, and Qwen-MoE introduce the Top-K operator—a discrete switching mechanism that creates discontinuities in the optimization landscape.

This discreteness introduces two fundamental mathematical pathologies that break standard RL assumptions used in PPO, GRPO, and other LLM-RL algorithms.

TL;DR: The Two Pathologies

Challenge 1: Gradient Blackout. The gradient of the token distribution $\pi_\theta(y_t | x, y_{\lt t})$ with respect to unselected experts’ logits is exactly zero almost everywhere. Unlike non-smooth but continuous convex functions where subgradient methods can still converge to the optimum, the Top-K landscape is discontinuous—providing no gradient signal for how to switch to a better expert.

Challenge 2: First-Order Approximation Failure. Modern LLM-RL algorithms (PPO, GRPO) rely on a surrogate objective that approximates the true objective to first order. This approximation requires the policy mapping to be smooth. Top-K routing violates this—an infinitesimal parameter change can cause a discrete expert switch, making the surrogate jump discontinuously and invalidating the gradient-based optimization entirely.

Pathology	Dense LLMs	MoE LLMs with Top-K
Gradient flow	Smooth, non-zero almost everywhere	Zero almost everywhere for unselected experts’ logits
Token distribution mapping	Continuous and differentiable	Discontinuous at routing boundaries
First-order approximation	Valid: $\nabla L_\mu \approx \nabla J$	Invalid at routing boundaries

Part 1: The Gradient Blackout

Setup: Autoregressive LLM with MoE

Consider an autoregressive language model generating a response $y = (y_1, y_2, \ldots, y_T)$ given a prompt $x$. At each timestep $t$, the model predicts the next token $y_t$ given the context:

State: $s_t = (x, y_{\lt t})$ — the prompt concatenated with previously generated tokens
Action: $a_t = y_t$ — the next token to generate
Policy: $\pi_\theta(a_t | s_t) = \pi_\theta(y_t | x, y_{\lt t})$ — the token probability distribution

In an MoE transformer, each MoE layer has a router that computes logits $h \in \mathbb{R}^N$ for $N$ experts based on the hidden state. For a fixed $K \lt N$, the Top-K operator selects the indices of the $K$ largest logits:

$$\mathcal{K}(h) = \{j : h_j \text{ is among the } K \text{ largest elements of } h\}$$

The MoE layer output is:

$$\text{MoE}(z) = \sum_{j \in \mathcal{K}(h)} \frac{e^{h_j}}{\sum_{k \in \mathcal{K}(h)} e^{h_k}} E_j(z)$$

where $z$ is the hidden state, $E_j$ is expert $j$’s FFN, and $h = h(z; \theta_r)$ depends on router parameters $\theta_r$. The final token distribution $\pi_\theta(y_t | x, y_{\lt t})$ depends on outputs from all MoE layers.

The Zero Gradient Problem

When training with RL, we optimize the policy $\pi_\theta(y_t | x, y_{\lt t})$ to maximize reward. Consider the gradient with respect to an unselected expert’s logit $h_u$, where $u \notin \mathcal{K}(h)$.

Step 1: Locally Constant Set. Let $h_{(K)}$ denote the $K$-th largest element of $h$, and let $e_u$ be the $u$-th standard basis vector. Assuming no ties (which holds almost everywhere), since $u \notin \mathcal{K}(h)$, we have $h_u \lt h_{(K)}$. For any scalar perturbation $\epsilon$ with $h_u + \epsilon \lt h_{(K)}$:

$$\mathcal{K}(h + \epsilon \cdot e_u) = \mathcal{K}(h)$$

The set of selected experts remains unchanged as long as $h_u$ stays below the selection threshold.

Step 2: Zero Dependency. Since $u \notin \mathcal{K}(h)$:

Expert $E_u$’s output does not contribute to the hidden state
The logit $h_u$ does not appear in the softmax normalization

Result: The gradient of the token probability with respect to unselected expert logits is zero:

$$\frac{\partial \pi_\theta(y_t | x, y_{\lt t})}{\partial h_u} = 0 \quad \text{almost everywhere}$$

Why Subgradients Don’t Help

Normally, we handle non-smooth points (like ReLU at 0) using subgradients. However, there’s a crucial distinction:

Non-smooth but continuous (e.g., ReLU):

The function $f(x) = \max(0, x)$ is continuous everywhere
At $x=0$, the subdifferential $\partial f(0) = [0, 1]$ is well-defined and non-empty
Subgradient methods converge for convex functions, even though individual subgradients don’t guarantee descent at every step

Discontinuous (Top-K):

The selection function is discontinuous at decision boundaries
On the plateau: The gradient is exactly $\mathbf{0}$—no signal at all
At the cliff: Where $h_i = h_j$ for the $K$-th and $(K+1)$-th ranked experts, the output jumps discontinuously as they swap positions

At a discontinuity, the classical subgradient is not defined. The Clarke Generalized Gradient can be defined for locally Lipschitz functions, but the MoE layer output $\text{MoE}(z)$ is not locally Lipschitz at switching boundaries—it has jump discontinuities.

Key insight: The pathology is not that gradients are “undefined” at boundaries, but rather:

Away from boundaries: $\frac{\partial \pi_\theta(y_t | x, y_{\lt t})}{\partial h_u} = 0$ exactly (no signal)
At boundaries: the function jumps discontinuously, so no first-order approximation is valid

Bottom line: During LLM-RL training, the router receives no gradient signal about whether switching to a different expert would generate better responses. The model cannot learn to route tokens to more suitable experts based on reward feedback.

Part 2: The First-Order Approximation Failure

The Trust Region Principle and Its Practical Approximations

The theoretical foundation of modern LLM-RL comes from Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). However, practical algorithms like PPO and GRPO do not implement actual trust region optimization—they use clipping mechanisms to mimic the trust region principle. Understanding this distinction is crucial.

In the LLM setting, we use the autoregressive MDP formulation:

State: $s_t = (x, y_{\lt t})$ — prompt plus tokens generated so far
Action: $a_t = y_t$ — the next token
Policy: $\pi_\theta(y_t | x, y_{\lt t})$ — the LLM’s token distribution

The Surrogate Objective and Why It Works

The key insight—originating from Conservative Policy Iteration (CPI) (Kakade & Langford, 2002) and later refined by TRPO—is optimizing a surrogate objective $L_\mu(\pi)$ instead of the true objective $J(\pi)$ directly:

$$L_{\mu}(\pi) = J(\mu) + \mathbb{E}_{s \sim d_\mu} \mathbb{E}_{a \sim \pi(\cdot|s)} [A_\mu(s, a)]$$

where $d_\mu$ is the state visitation distribution under the sampling policy $\mu$, and $A_\mu$ is the advantage function.

This surrogate is useful because it satisfies two critical conditions at $\pi = \mu$:

Equal values: $L_\mu(\mu) = J(\mu)$
Equal gradients: $\nabla_\theta L_\mu|_{\pi_\theta=\mu} = \nabla_\theta J|_{\pi_\theta=\mu}$

The surrogate is a first-order Taylor approximation of the true objective—it matches both value and gradient at the point of tangency. Away from $\pi = \mu$, the approximation degrades.

The TRPO Lower Bound

CPI established the foundational identity $J(\pi) - J(\mu) = \frac{1}{1-\gamma}\mathbb{E}_{s \sim d_\pi, a \sim \pi}[A_\mu(s,a)]$, showing that policy improvement can be measured via advantages. TRPO refined this into a practical bound by quantifying exactly how much the surrogate approximation degrades. The original theorem (Schulman et al., 2015) gives:

$$J(\pi) \geq L_{\mu}(\pi) - \frac{4\epsilon\gamma}{(1-\gamma)^2} \cdot (D_{TV}^{\max})^2$$

where $\epsilon = \max_{s,a}|A(s,a)|$ and $D_{TV}^{\max} = \max_s D_{TV}(\pi(\cdot|s) \| \mu(\cdot|s))$. For finite-horizon undiscounted settings ( $\gamma = 1$, horizon $T$), the bound becomes:

$$J(\pi) \geq L_{\mu}(\pi) - C \cdot T^2 \cdot (D_{TV}^{\max})^2$$

The penalty scales quadratically with both horizon and TV distance because state distribution mismatch accumulates over time.

The Gap Between Theory and Practice

Here’s the critical point: PPO/GRPO do not implement this bound. They use a constant clipping factor (e.g., $\epsilon = 0.2$) regardless of sequence length, while the theory requires the trust region to shrink as $O(1/T^2)$.

In practice, PPO/GRPO are best understood as stochastic gradient ascent (SGA) methods that compute a clipped gradient estimator. Li et al., 2025 analyze how mismatch between sampling policy $\mu$ and target policy $\pi$ affects optimization. Crucially, the token-level importance sampling (IS) gradient used in PPO/GRPO is exactly the gradient of the surrogate objective $\nabla_\theta L_\mu$, not the true gradient $\nabla_\theta J$:

$$\underbrace{\mathbb{E}_{s_t \sim d_\mu} \mathbb{E}_{y_t \sim \mu} \left[ \frac{\pi_\theta(y_t|s_t)}{\mu(y_t|s_t)} A_\mu(s_t, y_t) \nabla_\theta \log \pi_\theta(y_t|s_t) \right]}_{\text{Token-level IS gradient (what PPO computes)}} = \nabla_\theta L_\mu$$ $$\nabla_\theta L_\mu \neq \nabla_\theta J = \underbrace{\mathbb{E}_{s_t \sim d_\pi} \mathbb{E}_{y_t \sim \pi} \left[ A_\pi(s_t, y_t) \nabla_\theta \log \pi_\theta(y_t|s_t) \right]}_{\text{True policy gradient}}$$

where $s_t = (x, y_{\lt t})$ is the state (prompt + generated prefix) and $y_t$ is the action (next token). The token-level IS ratio $\pi_\theta(y_t|s_t)/\mu(y_t|s_t)$ corrects for the token distribution mismatch, but the expectation over states is still taken under $d_\mu$, not $d_\pi$. This prefix distribution mismatch is the source of bias, which scales with both horizon and policy divergence.

This bias is tolerable when:

The off-policiness is solely induced by policy parameter updates
The mismatch $D_{TV}^{\max}$ is small and controlled (e.g., by the clipping mechanism)

This bias becomes intolerable when:

The mismatch has diverse, uncontrolled sources (e.g., expert shifts in MoE)
$D_{TV}^{\max}$ is large, amplifying the approximation error

Their success relies on:

The first-order approximation $\nabla_\theta L_\mu \approx \nabla_\theta J$ being valid
The policy remaining close enough to $\mu$ that the surrogate is meaningful
The mapping from parameters to policy being smooth

How Top-K Breaks the First-Order Approximation

Let $f: \Theta \to \Pi$ be the map from parameters $\theta \in \Theta$ to the token distribution $\pi_\theta(y_t | x, y_{\lt t}) \in \Pi$.

In dense LLMs (GPT, LLaMA, etc.): $f$ is smooth. The surrogate $L_\mu(\pi)$ is a valid first-order approximation of $J(\pi)$, and gradient-based optimization works as expected.

In MoE LLMs (Mixtral, DeepSeek-MoE, etc.): $f$ is piecewise smooth but globally discontinuous—smooth within each routing region, but with jump discontinuities at region boundaries.

At a switching point $\theta^*$ (where expert rankings swap for some token), consider a direction $v$ crossing the decision boundary:

$$\lim_{\delta \to 0^+} \pi_{\theta^* + \delta v}(y_t | x, y_{\lt t}) \neq \lim_{\delta \to 0^+} \pi_{\theta^* - \delta v}(y_t | x, y_{\lt t})$$

The first-order approximation completely fails at these boundaries:

At the discontinuity, the gradient $\nabla_\theta J$ does not exist in the classical sense
The surrogate $L_\mu(\pi)$ cannot provide a valid first-order approximation to a discontinuous $J(\pi)$
The clipping mechanism in PPO/GRPO cannot help—it assumes the underlying policy mapping is smooth

The Consequences for LLM-RL Training

When the router crosses a decision boundary during training:

1. The Surrogate Becomes Meaningless: PPO/GRPO optimize $L_\mu(\pi)$ as a proxy for $J(\pi)$. At a discontinuity, the surrogate jumps while the gradient estimator sees only the local (pre-jump) landscape. The optimizer is effectively blind to what happens after crossing.

2. Gradient Estimates Are Invalid: The clipped gradient estimator assumes $\nabla_\theta L_\mu \approx \nabla_\theta J$. At a discontinuity, neither gradient exists in the classical sense, and the computed “gradient” points in an arbitrary direction.

3. Large, Uncontrolled Approximation Error: When the router switches experts, the effective $D_{TV}^{\max}$ (per-token TV distance) can be large—the output distribution changes discretely, not continuously. The TRPO bound shows the surrogate-to-objective gap scales as $O(T^2 \cdot (D_{TV}^{\max})^2)$. When $D_{TV}^{\max}$ jumps due to expert switching, this creates a regime where the gradient estimator is systematically wrong, pushing optimization toward incorrect solutions. This may contribute to the training instability observed when training MoE LLMs with RL.

Part 3: Implications for MoE LLM-RL

Why LLM-RL with MoE is Hard

The combination of these two pathologies creates a perfect storm for RL training:

Exploration is blind: The router receives no gradient signal for unselected experts. When generating response $y$ to prompt $x$, the model cannot learn whether routing tokens to different experts would produce higher-reward responses.
Exploitation is unstable: When the optimizer does find a beneficial switch point, crossing it can cause instability due to the first-order approximation failure. This may manifest as reward spikes followed by degradation during RL training.
The optimization landscape is adversarial: Flat plateaus (zero gradient) punctuated by cliffs (discontinuities) with no smooth paths between expert configurations. The model gets stuck in suboptimal routing patterns.

Potential Solutions for MoE LLM-RL

Understanding these pathologies suggests directions for solutions:

For the gradient blackout:

Soft routing (e.g., softmax over all experts) restores gradient flow but sacrifices sparsity and inference speed
Auxiliary losses that provide signal to unselected experts (e.g., load balancing with gradient flow)
Exploration bonuses for trying different expert combinations during rollouts

For the first-order approximation failure:

Entropy regularization on the router to smooth the routing distribution
Annealing from soft to hard routing during RL training
Modified KL constraints that account for discrete expert switches (see Part 3: Trust Region Optimization via Sequence Masking for how Geometric Rejection Sampling (Geo-RS) enforces length-invariant trust regions using per-token KL divergence bounds—the same principle can be adapted for MoE routing by constraining the average per-token routing divergence rather than the sequence-level product)
Freezing the router during RL (sacrificing routing adaptation)

Summary

The instability of RL training for MoE LLMs is not a bug to be fixed with hyperparameter tuning—it’s a fundamental consequence of the Top-K operator’s mathematical properties:

Property	Effect on LLM-RL Optimization
Discrete expert selection	Zero gradient for unselected experts—no signal for improving routing
Jump discontinuities at boundaries	Large $D_{TV}^{\max}$ when experts switch, causing $O(T^2 \cdot (D_{TV}^{\max})^2)$ approximation error
First-order approximation failure	Surrogate $L_\mu$ invalid at discontinuities—gradient estimates systematically wrong
No gradient signal for switching	Cannot learn which expert would generate better tokens

Until routing mechanisms are developed that preserve gradient information while maintaining sparsity, training MoE LLMs with RL will remain fundamentally more challenging than training dense LLMs.

References

Mixture of Experts:

Shazeer, N., et al. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” ICLR.
Fedus, W., Zoph, B., & Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR.

Trust Region Methods:

Kakade, S. & Langford, J. (2002). “Approximately Optimal Approximate Reinforcement Learning.” ICML.
Schulman, J., et al. (2015). “Trust Region Policy Optimization.” ICML.
Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv.

LLM-RL Analysis:

Liu, J., Li, Y., Fu, Y., Wang, J., Liu, Q., & Shen, Y. (2025). “When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch.” Blog Series:

Non-smooth Optimization:

Clarke, F. H. (1990). Optimization and Nonsmooth Analysis. SIAM.

Last updated: December 7, 2025

Reinforcement Learning Mixture of Experts Large Language Models LLM-RL Optimization Training Dynamics