Yingru Li
Yingru Li
Home
Posts
Research
Contact
RL-Seminar
Light
Dark
Automatic
Optimization
Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It
RL training for LLMs is notoriously unstable. While recent studies attribute this to training-inference mismatch from hybrid engines, we show this is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory. We propose a specialized Learning Rate Scheduler that decays LR as gradient noise rises, using response length surge as a reliable early indicator of impending instability.
Yaxiang Zhang
,
Yingru LI
,
Jiacai Liu
,
Ziniu Li
,
Jiawei Xu
,
Qian Liu
Dec 20, 2025
1 min read
Research
,
Theory
The Optimal Token Baseline
RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.
Yingru LI
,
Jiawei Xu
,
Ziniu Li
,
Jiacai Liu
,
Yuxuan Tong
,
Wei Liu
,
Longtao Zheng
,
Zhenghai Xue
,
Yaxiang Zhang
,
Tianle Cai
,
Ge Zhang
,
Qian Liu
,
Baoxiang Wang
Dec 20, 2025
1 min read
Research
,
Theory
The Stability Gap: Why Top-K Routing Breaks RL Optimization
A rigorous mathematical analysis showing that Top-K expert routing in Mixture of Experts creates two fundamental pathologies: gradient blackout (zero gradients almost everywhere) and first-order approximation failure (discontinuous policy mapping), explaining why MoE-RL training can be unstable.
Yingru LI
Dec 7, 2025
11 min read
Research
,
Theory