Mathematical Formulations of Rollout Correction Methods

A unified framework for handling general off-policy problems in RL training

Yingru LI

Nov 4, 2025 1 min read Research, Theory, Documentation

Abstract

This document provides the definitive mathematical formulations for rollout correction methods in verl, following the natural progression from REINFORCE to PPO to Decoupled PPO.

Rollout correction provides a unified framework to handle general off-policy problems in RL training - any scenario where the data collection distribution differs from the training distribution.

Applicable scenarios include:

Policy mismatch: Different precision (FP8 vs FP16 vs BF16 vs FP32), different backends (vLLM vs SGLang vs FSDP vs Megatron)
Temporal lag: Model staleness, asynchronous rollout workers
Replay buffers: Training on historical trajectories from earlier policy versions
Off-policy algorithms: Behavioral cloning, DAPO, expert demonstrations
Data filtering: Reweighting, preference learning, curriculum learning

Key Topics

Theoretical Foundation: From REINFORCE to Decoupled PPO
- REINFORCE policy gradient baseline
- PPO with trust region control
- Decoupled PPO for batch size invariance
Implementation in VeRL: The Three-Policy Framework
- Policy roles: Rollout (behavior), Old (proximal), Current
- Operating modes: Decoupled vs Bypass
- Two distribution shifts and their corrections
Algorithmic Components and Combinations
- IS/RS aggregation levels (token, sequence, geometric)
- Loss functions (PPO vs policy gradient)
- Safety mechanisms (veto, batch normalization)
Off-Policy Diagnostic Metrics
- KL divergence, perplexity, chi-squared divergence
Summary and Decision Guide
- Method comparison table
- Scenario-based recommendations