Yingru LI

Member of Technical Staff

xAI

About me

I am a Member of Technical Staff at xAI. I earned my Ph.D. in Computer Science in 2025 from The Chinese University of Hong Kong (CUHK), where I had the privilege of being advised by Prof. Zhi-Quan (Tom) Luo, with Prof. Benjamin Van Roy on my thesis committee.

Prior to xAI, I was a Research Scientist at ByteDance. I also had the valuable opportunity to collaborate with Prof. Tong Zhang and Prof. John Hopcroft.

Research Vision

My research aims to develop intelligent agents capable of reliably interacting with complex environments. By bridging foundational theory with scalable algorithms, I advance reinforcement learning, large scale optimization, and large language model (LLM) reasoning to create systems for trustworthy decision-making.

🐦 Follow me on X for updates.

Recent Posts

Beyond Precision: Why Training-Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It

RL training for LLMs is notoriously unstable. While recent studies attribute this to training-inference mismatch from hybrid engines, we show this is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory. We propose a specialized Learning Rate Scheduler that decays LR as gradient noise rises, using response length surge as a reliable early indicator of impending instability.

Yaxiang Zhang, Yingru LI, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu

Dec 20, 2025 1 min read Research, Theory

The Optimal Token Baseline

RL training for LLMs frequently suffers from training collapse due to exploding gradient variance in long-horizon tasks. We derive the Optimal Token Baseline (OTB) from first principles, proving that updates should be weighted inversely to their accumulated uncertainty (Realized Energy). Our computationally free Logit-Gradient Proxy eliminates training collapse, matches N=32 performance with just N=4, and reduces token consumption by 62-66%.

Yingru LI, Jiawei Xu, Ziniu Li, Jiacai Liu, Yuxuan Tong, Wei Liu, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Dec 20, 2025 1 min read Research, Theory

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

We derive tighter off-policy bounds for LLM-RL: O(T^{3/2}) Pinsker-Marginal and O(T) Mixed bounds, compared to classical O(T²). We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region.

Yingru LI

Dec 20, 2025 4 min read Research, Theory

The Stability Gap: Why Top-K Routing Breaks RL Optimization

A rigorous mathematical analysis showing that Top-K expert routing in Mixture of Experts creates two fundamental pathologies: gradient blackout (zero gradients almost everywhere) and first-order approximation failure (discontinuous policy mapping), explaining why MoE-RL training can be unstable.

Yingru LI

Dec 7, 2025 11 min read Research, Theory

Scalable Exploration via Ensemble++

Ensemble++ achieves Thompson Sampling-level exploration with only O(d log T) ensemble directions, enabling scalable uncertainty quantification for neural bandits and beyond.

Yingru LI