Yingru Li
Yingru Li
Home
Posts
Research
Contact
RL-Seminar
Light
Dark
Automatic
Variance Reduction
The Optimal Token Baseline
RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.
Yingru LI
,
Jiawei Xu
,
Ziniu Li
,
Jiacai Liu
,
Yuxuan Tong
,
Wei Liu
,
Longtao Zheng
,
Zhenghai Xue
,
Yaxiang Zhang
,
Tianle Cai
,
Ge Zhang
,
Qian Liu
,
Baoxiang Wang
Dec 20, 2025
1 min read
Research
,
Theory