Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It

A learning rate scheduling approach to stabilize LLM-RL training

Yaxiang Zhang, Yingru LI, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu

Dec 20, 2025 1 min read Research, Theory

Corresponding Author: Yingru Li

Co-First Authors: Yaxiang Zhang and Yingru Li

TL;DR

The Problem: Reinforcement Learning (RL) training for LLMs is notoriously unstable. While recent studies attribute this to “training-inference mismatch” (caused by hybrid engines), standard fixes like Importance Sampling might fail during longer training runs.
The Insight: We analyze this instability through an optimization lens. We find that as training progresses, gradient noise and training-inference mismatch increases simultaneously. This suggests that the “mismatch” is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory.
The Solution: A specialized Learning Rate (LR) Scheduler.
- Mechanism: By decaying the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
- Heuristic: We propose a novel method to time this decay based on Response Length. The surge in response length serves as a reliable early indicator of impending instability, signaling exactly when to reduce the learning rate.