Trust Region

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

We derive tighter off-policy bounds for LLM-RL: O(T^{3/2}) Pinsker-Marginal and O(T) Mixed bounds, compared to classical O(T²). We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.

Yingru LI

Dec 20, 2025 4 min read Research, Theory

Part 3: Trust Region Optimization via Sequence Masking

Authors: Yingru Li, Jiacai Liu Original Blog: When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch Series Context Part 1: We established the SGA (Stochastic Gradient Ascent) framework and identified two failure modes of off-policy mismatch: Bias (measured by $D_{TV}$) and Variance (measured by $\chi^2$-divergence).

Yingru LI, Jiacai Liu

Nov 4, 2025 18 min read Research