Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It
A learning rate scheduling approach to stabilize LLM-RL training
Corresponding Author: Yingru Li
Co-First Authors: Yaxiang Zhang and Yingru Li
TL;DR
- The Problem: Reinforcement Learning (RL) training for LLMs is notoriously unstable. While recent studies attribute this to “training-inference mismatch” (caused by hybrid engines), standard fixes like Importance Sampling might fail during longer training runs.
- The Insight: We analyze this instability through an optimization lens. We find that as training progresses, gradient noise and training-inference mismatch increases simultaneously. This suggests that the “mismatch” is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory.
- The Solution: A specialized Learning Rate (LR) Scheduler.
- Mechanism: By decaying the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
- Heuristic: We propose a novel method to time this decay based on Response Length. The surge in response length serves as a reliable early indicator of impending instability, signaling exactly when to reduce the learning rate.