VeRL

Mathematical Formulations of Rollout Correction Methods

Definitive mathematical formulations for rollout correction methods in VeRL, progressing from REINFORCE to PPO to Decoupled PPO. Handles policy mismatch, temporal lag, replay buffers, and off-policy algorithms with importance sampling and rejection sampling techniques.

Yingru LI

Nov 4, 2025 1 min read Research, Theory, Documentation