The Optimal Token Baseline

Variance Reduction for Long-Horizon LLM-RL

Yingru LI, Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Dec 20, 2025 1 min read Research, Theory

Go to Project Site

Project Lead: Yingru Li

Co-First Authors: Yingru Li and Jiawei Xu

TL;DR

The Problem: RL training for LLMs frequently suffers from “training collapse” due to exploding gradient variance in long-horizon tasks. Standard baselines (like Group Mean) fail because they treat all tokens and sequences as equally “noisy.”
The Insight: Gradient noise is heterogeneous. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy).
The Solution: We introduce a computationally free Logit-Gradient Proxy. This allows us to approximate the true gradient norm using only forward-pass probabilities—requiring zero additional backward passes.
The Impact:
- Stability: Eliminates training collapse by stabilizing gradient norms.
- Efficiency: Matches the performance of group size N=32 with just N=4.
- Savings: Reduces token consumption by 62% on Single-turn Reasoning and 66% on Multi-turn Tool-Integrated Reasoning (TIR).

Read the full article on Notion →

[Code] [Dataset]

Reinforcement Learning Language Models Deep Learning Variance Reduction Optimization

The Optimal Token Baseline

TL;DR

Yingru LI

Member of Technical Staff