The Stability Gap: Why Top-K Routing Breaks RL Optimization

How Discrete Expert Selection Creates Pathological Optimization Landscapes

The Problem

Training Mixture of Experts (MoE) language models with Reinforcement Learning can be unstable. While dense LLMs have continuous and differentiable policy mappings, MoE-based models like Mixtral, DeepSeek-MoE, and Qwen-MoE introduce the Top-K operator—a discrete switching mechanism that creates discontinuities in the optimization landscape.

This discreteness introduces two fundamental mathematical pathologies that break standard RL assumptions used in PPO, GRPO, and other LLM-RL algorithms.


TL;DR: The Two Pathologies

Challenge 1: Gradient Blackout. The gradient of the token distribution $\pi_\theta(y_t | x, y_{\lt t})$ with respect to unselected experts’ logits is exactly zero almost everywhere. Unlike non-smooth convex functions where subgradients guide optimization, the Top-K landscape offers no directional information on how to switch to a better expert.

Challenge 2: First-Order Approximation Failure. Modern LLM-RL algorithms (PPO, GRPO) rely on a surrogate objective that approximates the true objective to first order. This approximation requires the policy mapping to be smooth. Top-K routing violates this—an infinitesimal parameter change can cause a discrete expert switch, making the surrogate jump discontinuously and invalidating the gradient-based optimization entirely.

PathologyDense LLMsMoE LLMs with Top-K
Gradient flowSmooth, non-zero almost everywhereZero almost everywhere for unselected experts’ logits
Token distribution mappingContinuous and differentiableDiscontinuous at routing boundaries
First-order approximationValid: $\nabla L_\mu \approx \nabla J$Invalid at routing boundaries

Part 1: The Gradient Blackout

Setup: Autoregressive LLM with MoE

Consider an autoregressive language model generating a response $y = (y_1, y_2, \ldots, y_T)$ given a prompt $x$. At each timestep $t$, the model predicts the next token $y_t$ given the context:

  • State: $s_t = (x, y_{\lt t})$ — the prompt concatenated with previously generated tokens
  • Action: $a_t = y_t$ — the next token to generate
  • Policy: $\pi_\theta(a_t | s_t) = \pi_\theta(y_t | x, y_{\lt t})$ — the token probability distribution

In an MoE transformer, each MoE layer has a router that computes logits $h \in \mathbb{R}^N$ for $N$ experts based on the hidden state. For a fixed $K \lt N$, the Top-K operator selects the indices of the $K$ largest logits:

$$\mathcal{K}(h) = \{j : h_j \text{ is among the } K \text{ largest elements of } h\}$$

The MoE layer output is:

$$\text{MoE}(z) = \sum_{j \in \mathcal{K}(h)} \frac{e^{h_j}}{\sum_{k \in \mathcal{K}(h)} e^{h_k}} E_j(z)$$

where $z$ is the hidden state, $E_j$ is expert $j$’s FFN, and $h = h(z; \theta_r)$ depends on router parameters $\theta_r$. The final token distribution $\pi_\theta(y_t | x, y_{\lt t})$ depends on outputs from all MoE layers.

The Zero Gradient Problem

When training with RL, we optimize the policy $\pi_\theta(y_t | x, y_{\lt t})$ to maximize reward. Consider the gradient with respect to an unselected expert’s logit $h_u$, where $u \notin \mathcal{K}(h)$.

Step 1: Locally Constant Set. Let $h_{(K)}$ denote the $K$-th largest element of $h$, and let $e_u$ be the $u$-th standard basis vector. Assuming no ties (which holds almost everywhere), since $u \notin \mathcal{K}(h)$, we have $h_u \lt h_{(K)}$. For any scalar perturbation $\epsilon$ with $h_u + \epsilon \lt h_{(K)}$:

$$\mathcal{K}(h + \epsilon \cdot e_u) = \mathcal{K}(h)$$

The set of selected experts remains unchanged as long as $h_u$ stays below the selection threshold.

Step 2: Zero Dependency. Since $u \notin \mathcal{K}(h)$:

  • Expert $E_u$’s output does not contribute to the hidden state
  • The logit $h_u$ does not appear in the softmax normalization

Result: The gradient of the token probability with respect to unselected expert logits is zero:

$$\frac{\partial \pi_\theta(y_t | x, y_{\lt t})}{\partial h_u} = 0 \quad \text{almost everywhere}$$

Why Subgradients Don’t Help

Normally, we handle non-smooth points (like ReLU at 0) using subgradients. However, there’s a crucial distinction:

Non-smooth but continuous (e.g., ReLU):

  • The function $f(x) = \max(0, x)$ is continuous everywhere
  • At $x=0$, the subgradient $\partial f(0) = [0, 1]$ provides valid descent directions
  • Optimization can proceed by choosing any element of the subdifferential

Discontinuous (Top-K):

  • The selection function is discontinuous at decision boundaries
  • On the plateau: The gradient is exactly $\mathbf{0}$—no signal at all
  • At the cliff: Where $h_i = h_j$ for the $K$-th and $(K+1)$-th ranked experts, the output jumps discontinuously as they swap positions

At a discontinuity, the classical subgradient is not defined. The Clarke Generalized Gradient can be defined for locally Lipschitz functions, but the MoE layer output $\text{MoE}(z)$ is not locally Lipschitz at switching boundaries—it has jump discontinuities.

Key insight: The pathology is not that gradients are “undefined” at boundaries, but rather:

  1. Away from boundaries: $\frac{\partial \pi_\theta(y_t | x, y_{\lt t})}{\partial h_u} = 0$ exactly (no signal)
  2. At boundaries: the function jumps discontinuously, so no first-order approximation is valid

Bottom line: During LLM-RL training, the router receives no gradient signal about whether switching to a different expert would generate better responses. The model cannot learn to route tokens to more suitable experts based on reward feedback.


Part 2: The First-Order Approximation Failure

The Trust Region Principle and Its Practical Approximations

The theoretical foundation of modern LLM-RL comes from Trust Region Policy Optimization (TRPO) (Schulman et al., 2015). However, practical algorithms like PPO and GRPO do not implement actual trust region optimization—they use clipping mechanisms to mimic the trust region principle. Understanding this distinction is crucial.

In the LLM setting, we use the autoregressive MDP formulation:

  • State: $s_t = (x, y_{\lt t})$ — prompt plus tokens generated so far
  • Action: $a_t = y_t$ — the next token
  • Policy: $\pi_\theta(y_t | x, y_{\lt t})$ — the LLM’s token distribution

The Surrogate Objective and Why It Works

The key insight of TRPO is optimizing a surrogate objective $L_\mu(\pi)$ instead of the true objective $J(\pi)$ directly:

$$L_{\mu}(\pi) = J(\mu) + \mathbb{E}_{s \sim d_\mu} \mathbb{E}_{a \sim \pi(\cdot|s)} [A_\mu(s, a)]$$

where $d_\mu$ is the state visitation distribution under the sampling policy $\mu$, and $A_\mu$ is the advantage function.

This surrogate is useful because it satisfies two critical conditions at $\pi = \mu$:

  1. Equal values: $L_\mu(\mu) = J(\mu)$
  2. Equal gradients: $\nabla_\theta L_\mu|_{\pi_\theta=\mu} = \nabla_\theta J|_{\pi_\theta=\mu}$

The surrogate is a first-order Taylor approximation of the true objective—it matches both value and gradient at the point of tangency. Away from $\pi = \mu$, the approximation degrades.

The TRPO Lower Bound

TRPO quantifies exactly how much the approximation degrades. The original theorem (Schulman et al., 2015) gives:

$$J(\pi) \geq L_{\mu}(\pi) - \frac{4\epsilon\gamma}{(1-\gamma)^2} \cdot (D_{TV}^{\max})^2$$

where $\epsilon = \max_{s,a}|A(s,a)|$ and $D_{TV}^{\max} = \max_s D_{TV}(\pi(\cdot|s) \| \mu(\cdot|s))$. For finite-horizon undiscounted settings ( $\gamma = 1$, horizon $T$), the bound becomes:

$$J(\pi) \geq L_{\mu}(\pi) - C \cdot T^2 \cdot (D_{TV}^{\max})^2$$

The penalty scales quadratically with both horizon and TV distance because state distribution mismatch accumulates over time.

The Gap Between Theory and Practice

Here’s the critical point: PPO/GRPO do not implement this bound. They use a constant clipping factor (e.g., $\epsilon = 0.2$) regardless of sequence length, while the theory requires the trust region to shrink as $O(1/T^2)$.

In practice, PPO/GRPO are best understood as stochastic gradient ascent (SGA) methods that compute a clipped gradient estimator. Li et al., 2025 analyze how mismatch between sampling policy $\mu$ and target policy $\pi$ affects optimization. Crucially, the token-level importance sampling (IS) gradient used in PPO/GRPO is exactly the gradient of the surrogate objective $\nabla_\theta L_\mu$, not the true gradient $\nabla_\theta J$:

$$\underbrace{\mathbb{E}_{s_t \sim d_\mu} \mathbb{E}_{y_t \sim \mu} \left[ \frac{\pi_\theta(y_t|s_t)}{\mu(y_t|s_t)} A_\mu(s_t, y_t) \nabla_\theta \log \pi_\theta(y_t|s_t) \right]}_{\text{Token-level IS gradient (what PPO computes)}} = \nabla_\theta L_\mu$$ $$\nabla_\theta L_\mu \neq \nabla_\theta J = \underbrace{\mathbb{E}_{s_t \sim d_\pi} \mathbb{E}_{y_t \sim \pi} \left[ A_\pi(s_t, y_t) \nabla_\theta \log \pi_\theta(y_t|s_t) \right]}_{\text{True policy gradient}}$$

where $s_t = (x, y_{\lt t})$ is the state (prompt + generated prefix) and $y_t$ is the action (next token). The token-level IS ratio $\pi_\theta(y_t|s_t)/\mu(y_t|s_t)$ corrects for the token distribution mismatch, but the expectation over states is still taken under $d_\mu$, not $d_\pi$. This prefix distribution mismatch is the source of bias, which scales with both horizon and policy divergence.

This bias is tolerable when:

  • The off-policiness is solely induced by policy parameter updates
  • The mismatch $D_{TV}^{\max}$ is small and controlled (e.g., by the clipping mechanism)

This bias becomes intolerable when:

  • The mismatch has diverse, uncontrolled sources (e.g., expert shifts in MoE)
  • $D_{TV}^{\max}$ is large, amplifying the approximation error

Their success relies on:

  1. The first-order approximation $\nabla_\theta L_\mu \approx \nabla_\theta J$ being valid
  2. The policy remaining close enough to $\mu$ that the surrogate is meaningful
  3. The mapping from parameters to policy being smooth

How Top-K Breaks the First-Order Approximation

Let $f: \Theta \to \Pi$ be the map from parameters $\theta \in \Theta$ to the token distribution $\pi_\theta(y_t | x, y_{\lt t}) \in \Pi$.

In dense LLMs (GPT, LLaMA, etc.): $f$ is smooth. The surrogate $L_\mu(\pi)$ is a valid first-order approximation of $J(\pi)$, and gradient-based optimization works as expected.

In MoE LLMs (Mixtral, DeepSeek-MoE, etc.): $f$ is piecewise smooth but globally discontinuous—smooth within each routing region, but with jump discontinuities at region boundaries.

At a switching point $\theta^*$ (where expert rankings swap for some token), consider a direction $v$ crossing the decision boundary:

$$\lim_{t \to 0^+} \pi_{\theta^* + tv}(y_t | x, y_{\lt t}) \neq \lim_{t \to 0^+} \pi_{\theta^* - tv}(y_t | x, y_{\lt t})$$

The first-order approximation completely fails at these boundaries:

  • At the discontinuity, the gradient $\nabla_\theta J$ does not exist in the classical sense
  • The surrogate $L_\mu(\pi)$ cannot provide a valid first-order approximation to a discontinuous $J(\pi)$
  • The clipping mechanism in PPO/GRPO cannot help—it assumes the underlying policy mapping is smooth

The Consequences for LLM-RL Training

When the router crosses a decision boundary during training:

1. The Surrogate Becomes Meaningless: PPO/GRPO optimize $L_\mu(\pi)$ as a proxy for $J(\pi)$. At a discontinuity, the surrogate jumps while the gradient estimator sees only the local (pre-jump) landscape. The optimizer is effectively blind to what happens after crossing.

2. Gradient Estimates Are Invalid: The clipped gradient estimator assumes $\nabla_\theta L_\mu \approx \nabla_\theta J$. At a discontinuity, neither gradient exists in the classical sense, and the computed “gradient” points in an arbitrary direction.

3. Large, Uncontrolled Approximation Error: When the router switches experts, the effective $D_{TV}^{\max}$ (per-token TV distance) can be large—the output distribution changes discretely, not continuously. The TRPO bound shows the surrogate-to-objective gap scales as $O(T^2 \cdot (D_{TV}^{\max})^2)$. When $D_{TV}^{\max}$ jumps due to expert switching, this creates a regime where the gradient estimator is systematically wrong, pushing optimization toward incorrect solutions. This may contribute to the training instability observed when training MoE LLMs with RL.


Part 3: Implications for MoE LLM-RL

Why LLM-RL with MoE is Hard

The combination of these two pathologies creates a perfect storm for RL training:

  1. Exploration is blind: The router receives no gradient signal for unselected experts. When generating response $y$ to prompt $x$, the model cannot learn whether routing tokens to different experts would produce higher-reward responses.

  2. Exploitation is unstable: When the optimizer does find a beneficial switch point, crossing it can cause instability due to the first-order approximation failure. This may manifest as reward spikes followed by degradation during RL training.

  3. The optimization landscape is adversarial: Flat plateaus (zero gradient) punctuated by cliffs (discontinuities) with no smooth paths between expert configurations. The model gets stuck in suboptimal routing patterns.

Potential Solutions for MoE LLM-RL

Understanding these pathologies suggests directions for solutions:

For the gradient blackout:

  • Soft routing (e.g., softmax over all experts) restores gradient flow but sacrifices sparsity and inference speed
  • Auxiliary losses that provide signal to unselected experts (e.g., load balancing with gradient flow)
  • Exploration bonuses for trying different expert combinations during rollouts

For the first-order approximation failure:

  • Entropy regularization on the router to smooth the routing distribution
  • Annealing from soft to hard routing during RL training
  • Modified KL constraints that account for discrete expert switches
  • Freezing the router during RL (sacrificing routing adaptation)

Summary

The instability of RL training for MoE LLMs is not a bug to be fixed with hyperparameter tuning—it’s a fundamental consequence of the Top-K operator’s mathematical properties:

PropertyEffect on LLM-RL Optimization
Discrete expert selectionZero gradient for unselected experts—no signal for improving routing
Jump discontinuities at boundariesLarge $D_{TV}^{\max}$ when experts switch, causing $O(T^2 \cdot (D_{TV}^{\max})^2)$ approximation error
First-order approximation failureSurrogate $L_\mu$ invalid at discontinuities—gradient estimates systematically wrong
No gradient signal for switchingCannot learn which expert would generate better tokens

Until routing mechanisms are developed that preserve gradient information while maintaining sparsity, training MoE LLMs with RL will remain fundamentally more challenging than training dense LLMs.


References

Mixture of Experts:

  • Shazeer, N., et al. (2017). “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” ICLR.
  • Fedus, W., Zoph, B., & Shazeer, N. (2022). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.” JMLR.

Trust Region Methods:

  • Schulman, J., et al. (2015). “Trust Region Policy Optimization.” ICML.
  • Schulman, J., et al. (2017). “Proximal Policy Optimization Algorithms.” arXiv.

LLM-RL Analysis:

Non-smooth Optimization:

  • Clarke, F. H. (1990). Optimization and Nonsmooth Analysis. SIAM.

Last updated: December 7, 2025

Yingru LI
Yingru LI
Research Scientist

My research focuses on building intelligent agents by advancing reinforcement learning, large-scale optimization, and LLM reasoning.