Non-Stationary Reinforcement Learning: The Blessing of (More) Optimism
56 Pages Posted: 17 Jun 2019 Last revised: 30 Aug 2021
Date Written: May 23, 2019
Abstract
Motivated by applications in inventory control and real-time bidding, we consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under temporal drifts. In this setting, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \emph{variation budgets}. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (\texttt{SWUCRL2-CW}) algorithm, and establish its \emph{dynamic regret} bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (\texttt{BORL}) algorithm to adaptively tune the \sw~to achieve the same dynamic regret bound, but in a \emph{parameter-free} manner, \ie, without knowing the variation budgets. Finally, we conduct numerical experiments to show that our proposed algorithms achieve superior empirical performance compared with existing algorithms.
Notably, under non-stationarity, historical data samples may falsely indicate that state transition rarely happens. This thus presents a significant challenge when one tries to apply the conventional Optimism in the Face of Uncertainty (OFU) principle to achieve low dynamic regret bound our problem. We overcome this challenge by proposing a novel confidence widening technique that incorporates additional optimism into our learning algorithms. To extend our theoretical findings, we demonstrate, in the context of single item inventory control with lost-sales, fixed cost, and zero-lead time, how one can leverage special structures on the state transition distributions to bypass the difficulty of exploring time-varying demand environments.
Keywords: reinforcement learning, data-driven decision making, confidence widening
Suggested Citation: Suggested Citation