Off-Policy

Part 3: Trust Region Optimization via Sequence Masking

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Series Context Part 1: We established the SGA (Stochastic Gradient Ascent) framework and identified two failure modes of off-policy mismatch: Bias (measured by $D_{TV}$) and Variance (measured by $\chi^2$-divergence).

Yingru LI, Jiacai Liu

Nov 4, 2025 18 min read Research

Part 2: Applying the SGA Framework — Token v.s. Sequence-level Correction

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Citation @online{liu-li-2025-rl-collapse, title = {When Speed Kills Stability: Demystifying {RL} Collapse from the Training-Inference Mismatch}, author = {Liu, Jiacai and Li, Yingru and Fu, Yuqian and Wang, Jiawei and Liu, Qian and Shen, Yu}, year = {2025}, month = sep, url = {https://richardli.

Yingru LI, Jiacai Liu

Oct 31, 2025 17 min read Research

Part 1: Why Off-Policy Breaks RL — An SGA Analysis Framework

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch The Problem In reinforcement learning, we often cannot sample directly from the policy $\pi_\theta$ we are optimizing.

Yingru LI, Jiacai Liu

Oct 30, 2025 11 min read Research