Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism

56 Pages Posted: 17 Jun 2019 Last revised: 30 Aug 2021

See all articles by Wang Chi Cheung

Wang Chi Cheung

Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR)

David Simchi-Levi

Massachusetts Institute of Technology (MIT) - School of Engineering

Ruihao Zhu

Cornell University

Date Written: May 23, 2019

Abstract

Motivated by applications in inventory control and real-time bidding, we consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts. In this setting, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \emph{variation budgets}. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (\texttt{SWUCRL2-CW}) algorithm, and establish its \emph{dynamic regret} bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (\texttt{BORL}) algorithm to adaptively tune the \sw~to achieve the same dynamic regret bound, but in a \emph{parameter-free} manner, \ie, without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared with existing algorithms.

Notably, under non-stationarity, historical data samples may falsely indicate that state transition rarely happens. This thus presents a significant challenge when one tries to apply the conventional Optimism in the Face of Uncertainty (OFU) principle to achieve low dynamic regret bound our problem. We overcome this challenge by proposing a novel confidence widening technique that incorporates additional optimism into our learning algorithms. To extend our theoretical findings, we demonstrate, in the context of single item inventory control with lost-sales, fixed cost, and zero-lead time, how one can leverage special structures on the state transition distributions to bypass the difficulty of exploring time-varying demand environments.

Keywords: reinforcement learning, data-driven decision making, confidence widening

Suggested Citation

Cheung, Wang Chi and Simchi-Levi, David and Zhu, Ruihao, Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism (May 23, 2019). Available at SSRN: https://ssrn.com/abstract=3397818 or http://dx.doi.org/10.2139/ssrn.3397818

Wang Chi Cheung

Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR) ( email )

Singapore

David Simchi-Levi

Massachusetts Institute of Technology (MIT) - School of Engineering ( email )

MA
United States

Ruihao Zhu (Contact Author)

Cornell University ( email )

Ithaca, NY 14853
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
499
Abstract Views
3,008
Rank
91,638
PlumX Metrics