Importance Sampling

Mathematical Formulations of Rollout Correction Methods

Definitive mathematical formulations for rollout correction methods in VeRL, progressing from REINFORCE to PPO to Decoupled PPO. Handles policy mismatch, temporal lag, replay buffers, and off-policy algorithms with importance sampling and rejection sampling techniques.

Yingru LI

Nov 4, 2025 1 min read Research, Theory, Documentation

Part 2: Applying the SGA Framework — Token v.s. Sequence-level Correction

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Citation @online{liu-li-2025-rl-collapse, title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch}, author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Shen, Yu}, year = {2025}, month = sep, url = {https://richardli.

Yingru LI, Jiacai Liu

Oct 31, 2025 17 min read Research

When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

The relentless push for faster inference creates a dangerous training-inference mismatch that silently kills RL with LLMs. We reveal the vicious cycle—particularly acute in reasoning and agentic RL—and show that sequence-level importance sampling is the principled solution.

Jiacai Liu, Yingru LI, Yuqian Fu, Jiawei Wang, Qian Liu, Yu Shen

Sep 17, 2025 1 min read Research, Theory