When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch
How the inference-training gap causes catastrophic failures in LLM reinforcement learning
Co-First Authors: Jiacai Liu and Yingru Li
Corresponding Authors: Yingru Li and Yu Shen
TL;DR
The relentless push for faster inference has created a dangerous “training-inference mismatch” that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:
- OOD Contexts Drive Low-Probability Sampling: Agentic workflows expose models to external inputs and dynamic environments, forcing frequent generation of low-probability tokens that are essential for novel reasoning, tool calls, and adaptive responses.
- Low-Probability Tokens Amplify Training Collapse: These tokens become the weakest link—the training-inference mismatch is most severe for them, causing catastrophically large gradients that lead to silent degradation and sudden training failure.
- Hardware Variability Complicates the Problem: Different GPU architectures exacerbate the mismatch unpredictably, meaning the same agentic training setup can succeed on one machine and catastrophically fail on another.
- Sequence-Level IS is the Principled Solution: Sequence-level Importance Sampling emerges as the theoretically grounded fix, restoring training stability across different hardware and complex tasks.