Project Lead: Yingru Li
Co-First Authors: Yingru Li and Jiawei Xu
TL;DR
- The Problem: RL training for LLMs frequently suffers from “training collapse” due to exploding gradient variance in long-horizon tasks. Standard baselines (like Group Mean) fail because they treat all tokens and sequences as equally “noisy.”
- The Insight: Gradient noise is heterogeneous. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy).
- The Solution: We introduce a computationally free Logit-Gradient Proxy. This allows us to approximate the true gradient norm using only forward-pass probabilities—requiring zero additional backward passes.
- The Impact:
- Stability: Eliminates training collapse by stabilizing gradient norms.
- Efficiency: Matches the performance of group size N=32 with just N=4.
- Savings: Reduces token consumption by 62% on Single-turn Reasoning and 66% on Multi-turn Tool-Integrated Reasoning (TIR).