Optimistic Thompson Sampling for No-Regret Learning in Unknown Games

Yingru Li, Liangqi Liu, Wenqiang Pu, Hao Liang, Zhi-Quan Luo

February, 2024

Abstract

Many real-world problems involving multiple decision-makers can be modeled as an unknown game characterized by bandit feedback. Addressing the challenges posed by bandit feedback and the curse of multi-agency, we developed Thompson sampling-type algorithms, leveraging information about opponent’s action and reward structures. Our approach significantly reduces experimental budgets, achieving a more than tenfold reduction compared to baseline algorithms in practical applications like traffic routing and radar sensing. We demonstrate that, under certain assumptions about the reward structure, the regret bound exhibits merely a logarithmic dependence on the total action space size, effectively mitigating the curse of multi-agency. Additionally, this research introduces the Optimism-then-NoRegret framework, a novel contribution that integrates both our proposed methodologies and existing algorithms in the field.

Type

Manuscript

Publication

Preprint. Presentation at ICML 2023 Workshop “The Many Facets of Preference-Based Learning”