Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It

A learning rate scheduling approach to stabilize LLM-RL training

Corresponding Author: Yingru Li

Co-First Authors: Yaxiang Zhang and Yingru Li

TL;DR

  • The Problem: Reinforcement Learning (RL) training for LLMs is notoriously unstable. While recent studies attribute this to “training-inference mismatch” (caused by hybrid engines), standard fixes like Importance Sampling might fail during longer training runs.
  • The Insight: We analyze this instability through an optimization lens. We find that as training progresses, gradient noise and training-inference mismatch increases simultaneously. This suggests that the “mismatch” is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory.
  • The Solution: A specialized Learning Rate (LR) Scheduler.
    • Mechanism: By decaying the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
    • Heuristic: We propose a novel method to time this decay based on Response Length. The surge in response length serves as a reliable early indicator of impending instability, signaling exactly when to reduce the learning rate.

Read the full article on Notion →

Yingru LI
Yingru LI
Member of Technical Staff

My research focuses on building intelligent agents by advancing reinforcement learning, large-scale optimization, and LLM reasoning.