Optimization

Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It

RL training for LLMs is notoriously unstable. While recent studies attribute this to training-inference mismatch from hybrid engines, we show this is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory. We propose a specialized Learning Rate Scheduler that decays LR as gradient noise rises, using response length surge as a reliable early indicator of impending instability.

Yaxiang Zhang, Yingru LI, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu

Dec 20, 2025 1 min read Research, Theory

The Optimal Token Baseline

RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.

Yingru LI, Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Dec 20, 2025 1 min read Research, Theory

The Stability Gap: Why Top-K Routing Breaks RL Optimization

A rigorous mathematical analysis showing that Top-K expert routing in Mixture of Experts creates two fundamental pathologies: gradient blackout (zero gradients almost everywhere) and first-order approximation failure (discontinuous policy mapping), explaining why MoE-RL training can be unstable.

Yingru LI

Dec 7, 2025 11 min read Research, Theory