A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits

Simchi-Levi, David; Zheng, Zeyu; Zhu, Feng

doi:10.2139/ssrn.4122441

Download This Paper

Open PDF in Browser

Add Paper to My Library

A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits

68 Pages Posted: 8 Jun 2022 Last revised: 15 Nov 2022

See all articles by David Simchi-Levi

David Simchi-Levi

Massachusetts Institute of Technology (MIT) - School of Engineering

Feng Zhu

Massachusetts Institute of Technology (MIT) - Institute for Data, Systems, and Society (IDSS)

Date Written: May 15, 2024

Abstract

We study the stochastic multi-armed bandit problem and design new policies that enjoy both worst-case optimality for expected regret and light-tailed risk for regret distribution. Specifically, our policy design (i) enjoys the worst-case optimality for the expected regret at order O(√ KT ln T) and (ii) has the worst-case tail probability of incurring a regret larger than any x > 0 being upper bounded by exp(-Ω(x/ √ KT)), a rate that we prove to be best achievable with respect to T for all worst-case optimal policies. Our proposed policy achieves a delicate balance between doing more exploration at the beginning of the time horizon and doing more exploitation when approaching the end, compared to standard confidence-bound-based policies. We also enhance the policy design to accommodate the "any-time" setting where T is unknown a priori, and prove equivalently desired policy performances as compared to the "fixed-time" setting with known T. Numerical experiments are conducted to illustrate the theoretical findings. We find that from a managerial perspective, our new policy design yields better tail distributions and is preferable than celebrated policies especially when (i) there is a risk of underestimating the volatility profile, or (ii) there is a challenge of tuning policy hyper-parameters. We conclude by extending our proposed policy design to the stochastic linear bandit setting that leads to both worst-case optimality in terms of expected regret and light-tailed risk on the regret distribution.

Keywords: stochastic bandits, worst-case optimality, instance-dependent consistency, heavy-tailed risk

Suggested Citation: Suggested Citation

Simchi-Levi, David and Zheng, Zeyu and Zhu, Feng, A Simple and Optimal Policy Design with Safety against Heavy-Tailed Risk for Stochastic Bandits (May 15, 2024). Available at SSRN: https://ssrn.com/abstract=4122441 or http://dx.doi.org/10.2139/ssrn.4122441