The Stability Gap: Why Top-K Routing Breaks RL Optimization
How Discrete Expert Selection Creates Pathological Optimization Landscapes
The Problem
Training Mixture of Experts (MoE) language models with Reinforcement Learning can be unstable. While dense LLMs have continuous and differentiable policy mappings, MoE-based models like Mixtral, DeepSeek-MoE, and Qwen-MoE introduce the Top-K operator—a discrete switching mechanism that creates discontinuities in the optimization landscape.
This discreteness introduces two fundamental mathematical pathologies that break standard RL assumptions used in PPO, GRPO, and other LLM-RL algorithms.
TL;DR: The Two Pathologies
Challenge 1: Gradient Blackout. The gradient of the token distribution $\pi_\theta(y_t | x, y_{\lt t})$ with respect to unselected experts’ logits is exactly zero almost everywhere. Unlike non-smooth convex functions where subgradients guide optimization, the Top-K landscape offers no directional information on how to switch to a better expert.
Challenge 2: First-Order Approximation Failure. Modern LLM-RL algorithms (PPO, GRPO) rely on a surrogate objective that approximates the true objective to first order. This approximation requires the policy mapping to be smooth. Top-K routing violates this—an infinitesimal parameter change can cause a discrete expert switch, making the surrogate jump discontinuously and invalidating the gradient-based optimization entirely.
| Pathology | Dense LLMs | MoE LLMs with Top-K |
|---|---|---|
| Gradient flow | Smooth, non-zero almost everywhere | Zero almost everywhere for unselected experts’ logits |
| Token distribution mapping | Continuous and differentiable | Discontinuous at routing boundaries |
| First-order approximation | Valid: $\nabla L_\mu \approx \nabla J$ | Invalid at routing boundaries |
Part 1: The Gradient Blackout
Setup: Autoregressive LLM with MoE
Consider an autoregressive language model generating a response $y = (y_1, y_2, \ldots, y_T)$ given a prompt $x$. At each timestep $t$, the model predicts the next token $y_t$ given the context:
- State: $s_t = (x, y_{\lt t})$ — the prompt concatenated with previously generated tokens
- Action: $a_t = y_t$ — the next token to generate
- Policy: $\pi_\theta(a_t | s_t) = \pi_\theta(y_t | x, y_{\lt t})$ — the token probability distribution
In an MoE transformer, each MoE layer has a router that computes logits $h \in \mathbb{R}^N$ for $N$ experts based on the hidden state. For a fixed $K \lt N$, the Top-K operator selects the indices of the $K$ largest logits:
$$\mathcal{K}(h) = \{j : h_j \text{ is among the } K \text{ largest elements of } h\}$$The MoE layer output is:
$$\text{MoE}(z) = \sum_{j \in \mathcal{K}(h)} \frac{e^{h_j}}{\sum_{k \in \mathcal{K}(h)} e^{h_k}} E_j(z)$$where $z$ is the hidden state, $E_j$ is expert $j$’s FFN, and $h = h(z; \theta_r)$ depends on router parameters $\theta_r$. The final token distribution $\pi_\theta(y_t | x, y_{\lt t})$ depends on outputs from all MoE layers.
The Zero Gradient Problem
When training with RL, we optimize the policy $\pi_\theta(y_t | x, y_{\lt t})$ to maximize reward. Consider the gradient with respect to an unselected expert’s logit $h_u$, where $u \notin \mathcal{K}(h)$.
Step 1: Locally Constant Set. Let $h_{(K)}$ denote the $K$-th largest element of $h$, and let $e_u$ be the $u$-th standard basis vector. Assuming no ties (which holds almost everywhere), since $u \notin \mathcal{K}(h)$, we have $h_u \lt h_{(K)}$. For any scalar perturbation $\epsilon$ with $h_u + \epsilon \lt h_{(K)}$:
$$\mathcal{K}(h + \epsilon \cdot e_u) = \mathcal{K}(h)$$The set of selected experts remains unchanged as long as $h_u$ stays below the selection threshold.
Step 2: Zero Dependency. Since $u \notin \mathcal{K}(h)$:
- Expert $E_u$’s output does not contribute to the hidden state
- The logit $h_u$ does not appear in the softmax normalization
Result: The gradient of the token probability with respect to unselected expert logits is zero:
$$\frac{\partial \pi_\theta(y_t | x, y_{\lt t})}{\partial h_u} = 0 \quad \text{almost everywhere}$$Why Subgradients Don’t Help
Normally, we handle non-smooth points (like ReLU at 0) using subgradients. However, there’s a crucial distinction:
Non-smooth but continuous (e.g., ReLU):
- The function $f(x) = \max(0, x)$ is continuous everywhere
- At $x=0$, the subgradient $\partial f(0) = [0, 1]$ provides valid descent directions
- Optimization can proceed by choosing any element of the subdifferential
Discontinuous (Top-K):
- The selection function is discontinuous at decision boundaries
- On the plateau: The gradient is exactly $\mathbf{0}$—no signal at all
- At the cliff: Where $h_i = h_j$ for the $K$-th and $(K+1)$-th ranked experts, the output jumps discontinuously as they swap positions
At a discontinuity, the classical subgradient is not defined. The Clarke Generalized Gradient can be defined for locally Lipschitz functions, but the MoE layer output $\text{MoE}(z)$ is not locally Lipschitz at switching boundaries—it has jump discontinuities.
Key insight: The pathology is not that gradients are “undefined” at boundaries, but rather:
- Away from boundaries: $\frac{\partial \pi_\theta(y_t | x, y_{\lt t})}{\partial h_u} = 0$ exactly (no signal)
- At boundaries: the function jumps discontinuously, so no first-order approximation is valid
Bottom line: During LLM-RL training, the router receives no gradient signal about whether switching to a different expert would generate better responses. The model cannot learn to route tokens to more suitable experts based on reward feedback.
Part 2: The First-Order Approximation Failure
The Trust Region Principle and Its Practical Approximations
The theoretical foundation of modern LLM-RL comes from Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). However, practical algorithms like PPO and GRPO do not implement actual trust region optimization—they use clipping mechanisms to mimic the trust region principle. Understanding this distinction is crucial.
In the LLM setting, we use the autoregressive MDP formulation:
- State: $s_t = (x, y_{\lt t})$ — prompt plus tokens generated so far
- Action: $a_t = y_t$ — the next token
- Policy: $\pi_\theta(y_t | x, y_{\lt t})$ — the LLM’s token distribution
The Surrogate Objective and Why It Works
The key insight of TRPO is optimizing a surrogate objective $L_\mu(\pi)$ instead of the true objective $J(\pi)$ directly:
$$L_{\mu}(\pi) = J(\mu) + \mathbb{E}_{s \sim d_\mu} \mathbb{E}_{a \sim \pi(\cdot|s)} [A_\mu(s, a)]$$where $d_\mu$ is the state visitation distribution under the sampling policy $\mu$, and $A_\mu$ is the advantage function.
This surrogate is useful because it satisfies two critical conditions at $\pi = \mu$:
- Equal values: $L_\mu(\mu) = J(\mu)$
- Equal gradients: $\nabla_\theta L_\mu|_{\pi_\theta=\mu} = \nabla_\theta J|_{\pi_\theta=\mu}$
The surrogate is a first-order Taylor approximation of the true objective—it matches both value and gradient at the point of tangency. Away from $\pi = \mu$, the approximation degrades.
The TRPO Lower Bound
TRPO quantifies exactly how much the approximation degrades. The original theorem (Schulman et al., 2015) gives:
$$J(\pi) \geq L_{\mu}(\pi) - \frac{4\epsilon\gamma}{(1-\gamma)^2} \cdot (D_{TV}^{\max})^2$$where $\epsilon = \max_{s,a}|A(s,a)|$ and $D_{TV}^{\max} = \max_s D_{TV}(\pi(\cdot|s) \| \mu(\cdot|s))$. For finite-horizon undiscounted settings ( $\gamma = 1$, horizon $T$), the bound becomes:
$$J(\pi) \geq L_{\mu}(\pi) - C \cdot T^2 \cdot (D_{TV}^{\max})^2$$The penalty scales quadratically with both horizon and TV distance because state distribution mismatch accumulates over time.
The Gap Between Theory and Practice
Here’s the critical point: PPO/GRPO do not implement this bound. They use a constant clipping factor (e.g., $\epsilon = 0.2$) regardless of sequence length, while the theory requires the trust region to shrink as $O(1/T^2)$.
In practice, PPO/GRPO are best understood as stochastic gradient ascent (SGA) methods that compute a clipped gradient estimator. Li et al., 2025 analyze how mismatch between sampling policy $\mu$ and target policy $\pi$ affects optimization. Crucially, the token-level importance sampling (IS) gradient used in PPO/GRPO is exactly the gradient of the surrogate objective $\nabla_\theta L_\mu$, not the true gradient $\nabla_\theta J$:
$$\underbrace{\mathbb{E}_{s_t \sim d_\mu} \mathbb{E}_{y_t \sim \mu} \left[ \frac{\pi_\theta(y_t|s_t)}{\mu(y_t|s_t)} A_\mu(s_t, y_t) \nabla_\theta \log \pi_\theta(y_t|s_t) \right]}_{\text{Token-level IS gradient (what PPO computes)}} = \nabla_\theta L_\mu$$ $$\nabla_\theta L_\mu \neq \nabla_\theta J = \underbrace{\mathbb{E}_{s_t \sim d_\pi} \mathbb{E}_{y_t \sim \pi} \left[ A_\pi(s_t, y_t) \nabla_\theta \log \pi_\theta(y_t|s_t) \right]}_{\text{True policy gradient}}$$where $s_t = (x, y_{\lt t})$ is the state (prompt + generated prefix) and $y_t$ is the action (next token). The token-level IS ratio $\pi_\theta(y_t|s_t)/\mu(y_t|s_t)$ corrects for the token distribution mismatch, but the expectation over states is still taken under $d_\mu$, not $d_\pi$. This prefix distribution mismatch is the source of bias, which scales with both horizon and policy divergence.
This bias is tolerable when:
- The off-policiness is solely induced by policy parameter updates
- The mismatch $D_{TV}^{\max}$ is small and controlled (e.g., by the clipping mechanism)
This bias becomes intolerable when:
- The mismatch has diverse, uncontrolled sources (e.g., expert shifts in MoE)
- $D_{TV}^{\max}$ is large, amplifying the approximation error
Their success relies on:
- The first-order approximation $\nabla_\theta L_\mu \approx \nabla_\theta J$ being valid
- The policy remaining close enough to $\mu$ that the surrogate is meaningful
- The mapping from parameters to policy being smooth
How Top-K Breaks the First-Order Approximation
Let $f: \Theta \to \Pi$ be the map from parameters $\theta \in \Theta$ to the token distribution $\pi_\theta(y_t | x, y_{\lt t}) \in \Pi$.
In dense LLMs (GPT, LLaMA, etc.): $f$ is smooth. The surrogate $L_\mu(\pi)$ is a valid first-order approximation of $J(\pi)$, and gradient-based optimization works as expected.
In MoE LLMs (Mixtral, DeepSeek-MoE, etc.): $f$ is piecewise smooth but globally discontinuous—smooth within each routing region, but with jump discontinuities at region boundaries.
At a switching point $\theta^*$ (where expert rankings swap for some token), consider a direction $v$ crossing the decision boundary:
$$\lim_{t \to 0^+} \pi_{\theta^* + tv}(y_t | x, y_{\lt t}) \neq \lim_{t \to 0^+} \pi_{\theta^* - tv}(y_t | x, y_{\lt t})$$The first-order approximation completely fails at these boundaries:
- At the discontinuity, the gradient $\nabla_\theta J$ does not exist in the classical sense
- The surrogate $L_\mu(\pi)$ cannot provide a valid first-order approximation to a discontinuous $J(\pi)$
- The clipping mechanism in PPO/GRPO cannot help—it assumes the underlying policy mapping is smooth
The Consequences for LLM-RL Training
When the router crosses a decision boundary during training:
1. The Surrogate Becomes Meaningless: PPO/GRPO optimize $L_\mu(\pi)$ as a proxy for $J(\pi)$. At a discontinuity, the surrogate jumps while the gradient estimator sees only the local (pre-jump) landscape. The optimizer is effectively blind to what happens after crossing.
2. Gradient Estimates Are Invalid: The clipped gradient estimator assumes $\nabla_\theta L_\mu \approx \nabla_\theta J$. At a discontinuity, neither gradient exists in the classical sense, and the computed “gradient” points in an arbitrary direction.
3. Large, Uncontrolled Approximation Error: When the router switches experts, the effective $D_{TV}^{\max}$ (per-token TV distance) can be large—the output distribution changes discretely, not continuously. The TRPO bound shows the surrogate-to-objective gap scales as $O(T^2 \cdot (D_{TV}^{\max})^2)$. When $D_{TV}^{\max}$ jumps due to expert switching, this creates a regime where the gradient estimator is systematically wrong, pushing optimization toward incorrect solutions. This may contribute to the training instability observed when training MoE LLMs with RL.
Part 3: Implications for MoE LLM-RL
Why LLM-RL with MoE is Hard
The combination of these two pathologies creates a perfect storm for RL training:
Exploration is blind: The router receives no gradient signal for unselected experts. When generating response $y$ to prompt $x$, the model cannot learn whether routing tokens to different experts would produce higher-reward responses.
Exploitation is unstable: When the optimizer does find a beneficial switch point, crossing it can cause instability due to the first-order approximation failure. This may manifest as reward spikes followed by degradation during RL training.
The optimization landscape is adversarial: Flat plateaus (zero gradient) punctuated by cliffs (discontinuities) with no smooth paths between expert configurations. The model gets stuck in suboptimal routing patterns.
Potential Solutions for MoE LLM-RL
Understanding these pathologies suggests directions for solutions:
For the gradient blackout:
- Soft routing (e.g., softmax over all experts) restores gradient flow but sacrifices sparsity and inference speed
- Auxiliary losses that provide signal to unselected experts (e.g., load balancing with gradient flow)
- Exploration bonuses for trying different expert combinations during rollouts
For the first-order approximation failure:
- Entropy regularization on the router to smooth the routing distribution
- Annealing from soft to hard routing during RL training
- Modified KL constraints that account for discrete expert switches
- Freezing the router during RL (sacrificing routing adaptation)
Summary
The instability of RL training for MoE LLMs is not a bug to be fixed with hyperparameter tuning—it’s a fundamental consequence of the Top-K operator’s mathematical properties:
| Property | Effect on LLM-RL Optimization |
|---|---|
| Discrete expert selection | Zero gradient for unselected experts—no signal for improving routing |
| Jump discontinuities at boundaries | Large $D_{TV}^{\max}$ when experts switch, causing $O(T^2 \cdot (D_{TV}^{\max})^2)$ approximation error |
| First-order approximation failure | Surrogate $L_\mu$ invalid at discontinuities—gradient estimates systematically wrong |
| No gradient signal for switching | Cannot learn which expert would generate better tokens |
Until routing mechanisms are developed that preserve gradient information while maintaining sparsity, training MoE LLMs with RL will remain fundamentally more challenging than training dense LLMs.
References
Mixture of Experts:
- Shazeer, N., et al. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” ICLR.
- Fedus, W., Zoph, B., & Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR.
Trust Region Methods:
- Schulman, J., et al. (2015). “Trust Region Policy Optimization.” ICML.
- Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv.
LLM-RL Analysis:
- Li, Y., Liu, J., et al. (2025). “Why Mismatch Breaks LLM-RL.” Blog.
Non-smooth Optimization:
- Clarke, F. H. (1990). Optimization and Nonsmooth Analysis. SIAM.
Last updated: December 7, 2025