Authors: Yingru Li, Jiacai Liu
Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Series Context
Part 1: We established the SGA (Stochastic Gradient Ascent) framework and identified two failure modes of off-policy mismatch: Bias (measured by $D_{TV}$) and Variance (measured by $\chi^2$-divergence).