Scalable Exploration via Ensemble++
Efficient Thompson Sampling for Deep Reinforcement Learning
The Exploration Challenge in Deep RL
Thompson Sampling is one of the most elegant algorithms for balancing exploration and exploitation in sequential decision-making. The idea is simple: maintain a posterior distribution over reward functions, sample from it, and act optimally with respect to the sample. This approach naturally trades off exploring uncertain regions against exploiting known good actions.
But there’s a catch: exact Thompson Sampling is computationally intractable for modern deep learning systems. Neural networks don’t admit conjugate posteriors, and maintaining full uncertainty quantification over millions of parameters is simply infeasible.
The Ensemble Approximation Gap
Prior work attempted to approximate Thompson Sampling using ensembles—train multiple neural networks and use their disagreement as a proxy for uncertainty. The problem? The required ensemble size scales prohibitively:
- Theoretical guarantees require $\Omega(T \cdot |\mathcal{X}|)$ ensemble members
- For practical horizons and action spaces, this means thousands to millions of networks
- Training and inference costs become unmanageable
This creates a fundamental tension: we want the exploration benefits of Thompson Sampling but cannot afford its computational cost.
Ensemble++: The Key Innovation
Ensemble++ resolves this tension through a clever insight: instead of maintaining independent ensemble members, maintain a shared ensemble matrix factor that updates incrementally using random linear combinations.
Core Mechanism
The algorithm maintains matrices that incrementally approximate the posterior covariance. At each step:
- Sample: Draw a reference vector to form action-selection hypotheses via linear combination
- Observe: Collect reward feedback from the environment
- Update: Modify the ensemble factor using perturbation vectors
This enables approximate posterior sampling without full matrix decomposition or expensive per-step retraining.
The Symmetrized Ridge-Regression Perspective
A unified formulation expresses both linear and nonlinear variants through a two-sided perturbation loss:
$$\mathcal{L}(\theta) = \sum_{t=1}^{n} \left( y_t - f_\theta(x_t) \right)^2 + \lambda |\theta|^2 + \text{perturbation terms}$$
For linear features, this admits closed-form updates. For neural networks, we optimize via SGD while preserving the incremental update mechanics.
Theoretical Guarantees
The theoretical contribution is substantial. For linear bandits, Ensemble++ achieves:
| Setting | Regret Bound |
|---|---|
| Compact action sets | $\mathcal{O}(d^{3/2}\sqrt{T}(\log T)^{3/2})$ |
| Finite action sets | $\mathcal{O}(d\sqrt{T \log |
These bounds are comparable to exact Thompson Sampling while requiring only:
- Ensemble size: $\Theta(d \log T)$ directions (vs. $\Omega(T \cdot |\mathcal{X}|)$ for prior methods)
- Per-step complexity: $\mathcal{O}(d^2)$ for linear case
- Storage: $\Theta(d \log T)$ ensemble directions
The approach works uniformly across compact and finite action sets without algorithmic reconfiguration—a notable improvement over prior methods that required problem-specific tuning.
Extension to Neural Networks
The same principle extends to nonlinear settings by replacing fixed linear features with learnable neural representations:
- Extract features from the last layer of a neural network
- Apply the Ensemble++ update rule to these features
- Propagate gradients through the full network via SGD
This bridges the gap between principled Bayesian exploration and practical deep RL systems.
Empirical Results
Comprehensive experiments validate the theoretical claims across multiple settings:
Linear Bandits
Ensemble++ matches Thompson Sampling performance with a fraction of the ensemble size, while baselines like Ensemble+ and EpiNet require significantly larger ensembles for comparable performance.
Quadratic Bandits
Even in non-linear settings, Ensemble++ maintains strong performance, demonstrating the robustness of the incremental update mechanism.
Neural Bandits
With neural network function approximation, Ensemble++ achieves superior regret-versus-computation trade-offs compared to existing ensemble methods.
GPT-Based Bandits
Scaling to large language models, Ensemble++ enables efficient exploration in high-dimensional representation spaces where traditional methods become prohibitively expensive.
Why This Matters
The exploration-exploitation trade-off is fundamental to sequential decision-making. Ensemble++ makes principled exploration practical at scale by:
- Reducing computational burden: From exponential to logarithmic ensemble scaling
- Preserving theoretical guarantees: Near-optimal regret bounds
- Enabling modularity: Plug-and-play integration with existing neural architectures
- Unifying settings: Same algorithm works across action space types
For practitioners building agents that must learn efficiently in uncertain environments—from recommendation systems to robotic control to LLM-based agents—Ensemble++ offers a principled yet practical path forward.
Poster
[TBD]
Paper Details
Title: Scalable Exploration via Ensemble++
Authors: Yingru Li, Jiawei Xu, Baoxiang Wang, Zhi-Quan Luo
Venue: NeurIPS 2025
Links:
Citation
@inproceedings{li2025scalable,
title = {Scalable Exploration via Ensemble++},
author = {Li, Yingru and Xu, Jiawei and Wang, Baoxiang and Luo, Zhi-Quan},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025}
}
Last updated: November 30, 2025