Language Models | Yingru Li

Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It

RL training for LLMs is notoriously unstable. While recent studies attribute this to training-inference mismatch from hybrid engines, we show this is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory. We propose a specialized Learning Rate Scheduler that decays LR as gradient noise rises, using response length surge as a reliable early indicator of impending instability.

Yaxiang Zhang, Yingru LI, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu

Dec 20, 2025 1 min read Research, Theory

The Optimal Token Baseline

RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.

Yingru LI, Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Dec 20, 2025 1 min read Research, Theory

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

We derive tighter off-policy bounds for LLM-RL: O(T^{3/2}) Pinsker-Marginal and O(T) Mixed bounds, compared to classical O(T²). We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.

Yingru LI

Dec 20, 2025 4 min read Research, Theory

Mathematical Formulations of Rollout Correction Methods

Definitive mathematical formulations for rollout correction methods in VeRL, progressing from REINFORCE to PPO to Decoupled PPO. Handles policy mismatch, temporal lag, replay buffers, and off-policy algorithms with importance sampling and rejection sampling techniques.

Yingru LI

Nov 4, 2025 1 min read Research, Theory, Documentation

Information Bandwidth in Reinforcement Learning

An information-theoretic analysis showing that scalar advantage formulations learn ≤ log₂(B) bits per episode, while per-timestep advantages preserve full reward entropy.

Yingru LI

Oct 1, 2025 16 min read Research, Theory

Information Bandwidth in Reinforcement Learning

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

The relentless push for faster inference creates a dangerous training-inference mismatch that silently kills RL with LLMs. We reveal the vicious cycle—particularly acute in reasoning and agentic RL—and show that sequence-level importance sampling is the principled solution.

Jiacai Liu, Yingru LI, Yuqian Fu, Jiawei Wang, Qian Liu, Yu Shen

Sep 17, 2025 1 min read Research, Theory