Q-Star Meets Scalable Posterior Sampling: Bridging Theory and Practice via HyperAgent

Abstract

We propose HyperAgent, a reinforcement learning (RL) algorithm based on the hypermodel framework for exploration in RL. HyperAgent allows for the efficient incremental approximation of posteriors associated with an optimal action-value function ($Q^\star$) without the need for conjugacy and follows the greedy policies w.r.t. these approximate posterior samples. We demonstrate that HyperAgent offers robust performance in large-scale deep RL benchmarks. It can solve Deep Sea hard exploration problems with episodes that optimally scale with problem size and exhibits significant efficiency gains in the Atari suite. Implementing HyperAgent requires minimal code addition to well-established deep RL frameworks like DQN. We theoretically prove that, under tabular assumptions, HyperAgent achieves logarithmic per-step computational complexity while attaining sublinear regret, matching the best known randomized tabular RL algorithm.

Publication
International Conference on Machine Learning (ICML)

Summary of technical contributions of HyperAgent

  1. New probability tools in high-dimensional probability and statistics
  • The first probability tool for sequential random projection, a non-trivial martingale extension of Johnson-Lindenstrauss (JL) for adaptively sampled data due to the sequential nature of RL;
  • A unified and simple analysis for JL via high-dimension extension of Hanson-Wright.
  1. Methodology for sequential-decision making
  • Hypermodel: efficient incremental approximation of the posterior (uncertainty quantification) over complex models without leveraging conjugacy as encountering more data;
  • Index sampling: approximate posterior sampling for data-efficient sequential decision-making.
  1. Results for sequential-decision making
  • Practically, our developed HyperAgent demonstrates its robust performance in large-scale deep RL benchmarks with significant efficiency gain in terms of both data and computation;
  • Theoretically, the first method to achieve logarithmic per-step computation and sublinear under tabular episodic RL and linear contextual bandit setups among practically scalable algorithms. At the heart of the analysis is the sequential incremental posterior approximation argument, made possible by the our developed first probability tool for sequential random projection.