The Optimal Token Baseline

Variance Reduction for Long-Horizon LLM-RL

Project Lead: Yingru Li

Co-First Authors: Yingru Li and Jiawei Xu

TL;DR

  • The Problem: RL training for LLMs frequently suffers from “training collapse” due to exploding gradient variance in long-horizon tasks. Standard baselines (like Group Mean) fail because they treat all tokens and sequences as equally “noisy.”
  • The Insight: Gradient noise is heterogeneous. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy).
  • The Solution: We introduce a computationally free Logit-Gradient Proxy. This allows us to approximate the true gradient norm using only forward-pass probabilities—requiring zero additional backward passes.
  • The Impact:
    • Stability: Eliminates training collapse by stabilizing gradient norms.
    • Efficiency: Matches the performance of group size N=32 with just N=4.
    • Savings: Reduces token consumption by 62% on Single-turn Reasoning and 66% on Multi-turn Tool-Integrated Reasoning (TIR).

Read the full article on Notion →

[Code] [Dataset]

Yingru LI
Yingru LI
Member of Technical Staff

My research focuses on building intelligent agents by advancing reinforcement learning, large-scale optimization, and LLM reasoning.