When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

How the inference-training gap causes catastrophic failures in LLM reinforcement learning

Co-First Authors: Jiacai Liu and Yingru Li

Corresponding Authors: Yingru Li and Yu Shen

TL;DR

The relentless push for faster inference has created a dangerous “training-inference mismatch” that can silently kill reinforcement learning with LLMs. Our investigation reveals a vicious cycle that is particularly acute in modern reasoning and agentic RL:

  • OOD Contexts Drive Low-Probability Sampling: Agentic workflows expose models to external inputs and dynamic environments, forcing frequent generation of low-probability tokens that are essential for novel reasoning, tool calls, and adaptive responses.
  • Low-Probability Tokens Amplify Training Collapse: These tokens become the weakest link—the training-inference mismatch is most severe for them, causing catastrophically large gradients that lead to silent degradation and sudden training failure.
  • Hardware Variability Complicates the Problem: Different GPU architectures exacerbate the mismatch unpredictably, meaning the same agentic training setup can succeed on one machine and catastrophically fail on another.
  • Sequence-Level IS is the Principled Solution: Sequence-level Importance Sampling emerges as the theoretically grounded fix, restoring training stability across different hardware and complex tasks.

Read the full article on Notion →

Yingru LI
Yingru LI
Research Scientist

My research focuses on building intelligent agents by advancing reinforcement learning, large-scale optimization, and LLM reasoning.