Mathematical Formulations of Rollout Correction Methods
A unified framework for handling general off-policy problems in RL training
Author: Yingru Li
Abstract
This document provides the definitive mathematical formulations for rollout correction methods in verl, following the natural progression from REINFORCE to PPO to Decoupled PPO.
Rollout correction provides a unified framework to handle general off-policy problems in RL training - any scenario where the data collection distribution differs from the training distribution.
Applicable scenarios include:
- Policy mismatch: Different precision (FP8 vs FP16 vs BF16 vs FP32), different backends (vLLM vs SGLang vs FSDP vs Megatron)
- Temporal lag: Model staleness, asynchronous rollout workers
- Replay buffers: Training on historical trajectories from earlier policy versions
- Off-policy algorithms: Behavioral cloning, DAPO, expert demonstrations
- Data filtering: Reweighting, preference learning, curriculum learning
Key Topics
Theoretical Foundation: From REINFORCE to Decoupled PPO
- REINFORCE policy gradient baseline
- PPO with trust region control
- Decoupled PPO for batch size invariance
Implementation in VeRL: The Three-Policy Framework
- Policy roles: Rollout (behavior), Old (proximal), Current
- Operating modes: Decoupled vs Bypass
- Two distribution shifts and their corrections
Algorithmic Components and Combinations
- IS/RS aggregation levels (token, sequence, geometric)
- Loss functions (PPO vs policy gradient)
- Safety mechanisms (veto, batch normalization)
Off-Policy Diagnostic Metrics
- KL divergence, perplexity, chi-squared divergence
Summary and Decision Guide
- Method comparison table
- Scenario-based recommendations
Related Resources
- Rollout Correction Usage Guide - Practical implementation guide
- VeRL GitHub Repository