Reinforcement Learning

Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It

RL training for LLMs is notoriously unstable. While recent studies attribute this to training-inference mismatch from hybrid engines, we show this is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory. We propose a specialized Learning Rate Scheduler that decays LR as gradient noise rises, using response length surge as a reliable early indicator of impending instability.

Yaxiang Zhang, Yingru LI, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu

Dec 20, 2025 1 min read Research, Theory

The Optimal Token Baseline

RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.

Yingru LI, Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Dec 20, 2025 1 min read Research, Theory

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

We derive tighter off-policy bounds for LLM-RL: O(T^{3/2}) Pinsker-Marginal and O(T) Mixed bounds, compared to classical O(T²). We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.

Yingru LI

Dec 20, 2025 4 min read Research, Theory

The Stability Gap: Why Top-K Routing Breaks RL Optimization

A rigorous mathematical analysis showing that Top-K expert routing in Mixture of Experts creates two fundamental pathologies: gradient blackout (zero gradients almost everywhere) and first-order approximation failure (discontinuous policy mapping), explaining why MoE-RL training can be unstable.

Yingru LI

Dec 7, 2025 11 min read Research, Theory

Scalable Exploration via Ensemble++

Ensemble++ achieves Thompson Sampling-level exploration with only O(d log T) ensemble directions, enabling scalable uncertainty quantification for neural bandits and beyond.

Yingru LI

Nov 29, 2025 4 min read Research

Language as a Universal Interface for Reinforcement Learning Agents

This post establishes a formal mathematical framework for language agents, deriving fundamental challenges from first principles and providing concrete design guidelines with real-world examples from SWE-Bench.

Yingru LI

Nov 7, 2025 22 min read Research, Theory, Engineering

Mathematical Formulations of Rollout Correction Methods

Definitive mathematical formulations for rollout correction methods in VeRL, progressing from REINFORCE to PPO to Decoupled PPO. Handles policy mismatch, temporal lag, replay buffers, and off-policy algorithms with importance sampling and rejection sampling techniques.

Yingru LI

Nov 4, 2025 1 min read Research, Theory, Documentation

Part 3: Trust Region Optimization via Sequence Masking

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Series Context Part 1: We established the SGA (Stochastic Gradient Ascent) framework and identified two failure modes of off-policy mismatch: Bias (measured by $D_{TV}$) and Variance (measured by $\chi^2$-divergence).

Yingru LI, Jiacai Liu

Nov 4, 2025 18 min read Research

Part 2: Applying the SGA Framework — Token v.s. Sequence-level Correction

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Citation @online{liu-li-2025-rl-collapse, title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch}, author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Shen, Yu}, year = {2025}, month = sep, url = {https://richardli.

Yingru LI, Jiacai Liu

Oct 31, 2025 17 min read Research

Part 1: Why Off-Policy Breaks RL — An SGA Analysis Framework

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch The Problem In reinforcement learning, we often cannot sample directly from the policy $\pi_\theta$ we are optimizing.

Yingru LI, Jiacai Liu

Oct 30, 2025 11 min read Research