Yingru Li
Yingru Li
Home
Posts
Research
Contact
Resume
RL-Seminar
Light
Dark
Automatic
Research
The Optimal Token Baseline
RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.
Yingru LI
,
Jiawei Xu
,
Ziniu Li
,
Jiacai Liu
,
Yuxuan Tong
,
Wei Liu
,
Longtao Zheng
,
Zhenghai Xue
,
Yaxiang Zhang
,
Tianle Cai
,
Ge Zhang
,
Qian Liu
,
Baoxiang Wang
Dec 20, 2025
1 min read
Research
,
Theory
The Stability Gap: Why Top-K Routing Breaks RL Optimization
A rigorous mathematical analysis showing that Top-K expert routing in Mixture of Experts creates two fundamental pathologies: gradient blackout (zero gradients almost everywhere) and first-order approximation failure (discontinuous policy mapping), explaining why MoE-RL training can be unstable.
Yingru LI
Dec 7, 2025
11 min read
Research
,
Theory
Scalable Exploration via Ensemble++
Ensemble++ achieves Thompson Sampling-level exploration with only O(d log T) ensemble directions, enabling scalable uncertainty quantification for neural bandits and beyond.
Yingru LI
Nov 29, 2025
4 min read
Research
Language as a Universal Interface for Reinforcement Learning Agents
This post establishes a formal mathematical framework for language agents, deriving fundamental challenges from first principles and providing concrete design guidelines with real-world examples from SWE-Bench.
Yingru LI
Nov 7, 2025
22 min read
Research
,
Theory
,
Engineering
Mathematical Formulations of Rollout Correction Methods
Definitive mathematical formulations for rollout correction methods in VeRL, progressing from REINFORCE to PPO to Decoupled PPO. Handles policy mismatch, temporal lag, replay buffers, and off-policy algorithms with importance sampling and rejection sampling techniques.
Yingru LI
Nov 4, 2025
1 min read
Research
,
Theory
,
Documentation
Part 3: Trust Region Optimization via Sequence Masking
Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Series Context Part 1: We established the SGA (Stochastic Gradient Ascent) framework and identified two failure modes of off-policy mismatch: Bias (measured by $D_{TV}$) and Variance (measured by $\chi^2$-divergence).
Yingru LI
,
Jiacai Liu
Nov 4, 2025
18 min read
Research
Part 2: Applying the SGA Framework — Token v.s. Sequence-level Correction
Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Citation @online{liu-li-2025-rl-collapse, title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch}, author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Shen, Yu}, year = {2025}, month = sep, url = {https://richardli.
Yingru LI
,
Jiacai Liu
Oct 31, 2025
17 min read
Research
Part 1: Why Off-Policy Breaks RL — An SGA Analysis Framework
Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch The Problem In reinforcement learning, we often cannot sample directly from the policy $\pi_\theta$ we are optimizing.
Yingru LI
,
Jiacai Liu
Oct 30, 2025
11 min read
Research
Information Bandwidth in Reinforcement Learning
An information-theoretic analysis showing that scalar advantage formulations learn ≤ log₂(B) bits per episode, while per-timestep advantages preserve full reward entropy.
Yingru LI
Oct 1, 2025
16 min read
Research
,
Theory
When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch
The relentless push for faster inference creates a dangerous training-inference mismatch that silently kills RL with LLMs. We reveal the vicious cycle—particularly acute in reasoning and agentic RL—and show that sequence-level importance sampling is the principled solution.
Jiacai Liu
,
Yingru LI
,
Yuqian Fu
,
Jiawei Wang
,
Qian Liu
,
Yu Shen
Sep 17, 2025
1 min read
Research
,
Theory