Authors: Yingru Li, Jiacai Liu
Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch The Problem In reinforcement learning, we often cannot sample directly from the policy $\pi_\theta$ we are optimizing.