Doubly Optimal No-Regret Online Learning with Bandit Feedback

53 Pages Posted: 7 Dec 2021 Last revised: 1 Jul 2022

See all articles by Tianyi Lin

Tianyi Lin

University of California, Berkeley - Department of Electrical Engineering & Computer Sciences (EECS)

Zhengyuan Zhou

affiliation not provided to SSRN

Wenjia Ba

Stanford Graduate School of Business

Jiawei Zhang

New York University (NYU) - Department of Information, Operations, and Management Sciences

Date Written: December 6, 2021

Abstract

We consider online no-regret learning in unknown games with bandit feedback, where each player can only observe its reward at each time -- determined by all players' current joint action -- rather than its gradient. We focus on the class of smooth and strongly monotone games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct a new bandit learning algorithm and show that it achieves {the single-agent optimal regret of $\tilde{\Theta}(n\sqrt{T})$ under smooth and strongly concave reward functions ($n \geq 1$ is the problem dimension)}. We then show that if each player applies this no-regret learning algorithm in strongly monotone games, the joint action converges in the last iterate to the unique Nash equilibrium at a rate of $\tilde{\Theta}(\sqrt{\frac{n^2}{T}})$. Prior to our work, the best-known convergence rate in the same class of games is $\tilde{O}(\sqrt[3]{\frac{n^2}{T}})$ (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is $\Omega(\sqrt{\frac{n^2}{T}})$). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present results on several application studies -- Cournot competition, Kelly auctions, and distributed regularized logistic regression -- to demonstrate the efficacy of our algorithm.

Keywords: no-regret learning; bandit feedback model; strongly monotone games; optimal regret; optimal last-iterate convergence rate; mirror descent

Suggested Citation

Lin, Tianyi and Zhou, Zhengyuan and Ba, Wenjia and Zhang, Jiawei, Doubly Optimal No-Regret Online Learning with Bandit Feedback (December 6, 2021). Available at SSRN: https://ssrn.com/abstract=3978421 or http://dx.doi.org/10.2139/ssrn.3978421

Tianyi Lin

University of California, Berkeley - Department of Electrical Engineering & Computer Sciences (EECS) ( email )

Berkeley, CA 94720-1712
United States

Zhengyuan Zhou

affiliation not provided to SSRN

Wenjia Ba (Contact Author)

Stanford Graduate School of Business ( email )

Jiawei Zhang

New York University (NYU) - Department of Information, Operations, and Management Sciences ( email )

44 West Fourth Street
New York, NY 10012
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
90
Abstract Views
337
rank
386,858
PlumX Metrics