Policy Optimization

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

We derive tighter off-policy bounds for LLM-RL: O(T^{3/2}) Pinsker-Marginal and O(T) Mixed bounds, compared to classical O(T²). We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.

Yingru LI

Dec 20, 2025 4 min read Research, Theory