Mathematical Formulations of Rollout Correction Methods

A unified framework for handling general off-policy problems in RL training

Author: Yingru Li

Abstract

This document provides the definitive mathematical formulations for rollout correction methods in verl, following the natural progression from REINFORCE to PPO to Decoupled PPO.

Rollout correction provides a unified framework to handle general off-policy problems in RL training - any scenario where the data collection distribution differs from the training distribution.

Applicable scenarios include:

  • Policy mismatch: Different precision (FP8 vs FP16 vs BF16 vs FP32), different backends (vLLM vs SGLang vs FSDP vs Megatron)
  • Temporal lag: Model staleness, asynchronous rollout workers
  • Replay buffers: Training on historical trajectories from earlier policy versions
  • Off-policy algorithms: Behavioral cloning, DAPO, expert demonstrations
  • Data filtering: Reweighting, preference learning, curriculum learning

Key Topics

  1. Theoretical Foundation: From REINFORCE to Decoupled PPO

    • REINFORCE policy gradient baseline
    • PPO with trust region control
    • Decoupled PPO for batch size invariance
  2. Implementation in VeRL: The Three-Policy Framework

    • Policy roles: Rollout (behavior), Old (proximal), Current
    • Operating modes: Decoupled vs Bypass
    • Two distribution shifts and their corrections
  3. Algorithmic Components and Combinations

    • IS/RS aggregation levels (token, sequence, geometric)
    • Loss functions (PPO vs policy gradient)
    • Safety mechanisms (veto, batch normalization)
  4. Off-Policy Diagnostic Metrics

    • KL divergence, perplexity, chi-squared divergence
  5. Summary and Decision Guide

    • Method comparison table
    • Scenario-based recommendations

Read the full documentation →

Yingru LI
Yingru LI
Research Scientist

My research focuses on building intelligent agents by advancing reinforcement learning, large-scale optimization, and LLM reasoning.