When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

How the inference-training gap causes catastrophic failures in LLM reinforcement learning

Jiacai Liu, Yingru LI, Yuqian Fu, Jiawei Wang, Qian Liu, Yu Shen

Sep 17, 2025 1 min read Research, Theory

Co-First Authors: Jiacai Liu and Yingru Li

Corresponding Authors: Yingru Li and Yu Shen

TL;DR

The relentless push for faster inference has created a dangerous “training-inference mismatch” that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:

OOD Contexts Drive Low-Probability Sampling: Agentic workflows expose models to external inputs and dynamic environments, forcing frequent generation of low-probability tokens that are essential for novel reasoning, tool calls, and adaptive responses.
Low-Probability Tokens Amplify Training Collapse: These tokens become the weakest link—the training-inference mismatch is most severe for them, causing catastrophically large gradients that lead to silent degradation and sudden training failure.
Hardware Variability Complicates the Problem: Different GPU architectures exacerbate the mismatch unpredictably, meaning the same agentic training setup can succeed on one machine and catastrophically fail on another.
Sequence-Level IS is the Principled Solution: Sequence-level Importance Sampling emerges as the theoretically grounded fix, restoring training stability across different hardware and complex tasks.

Read the full article on Notion →

Reinforcement Learning Language Models Deep Learning Training Dynamics Importance Sampling

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

TL;DR

Yingru LI

Research Scientist