Bandits atop Reinforcement Learning: Tackling Online Inventory Models With Cyclic Demands

55 Pages Posted: 21 Jul 2020 Last revised: 14 Jan 2022

See all articles by Xiao-Yue Gong

Xiao-Yue Gong

MIT Operations Research Center

David Simchi-Levi

Massachusetts Institute of Technology (MIT) - School of Engineering

Date Written: September 7, 2021

Abstract

Motivated by a long-standing gap between inventory theory and practice, we study online inventory models with unknown cyclic demand distributions. We design provably efficient reinforcement learning (RL) algorithms that leverage the structure of inventory problems. We apply the standard performance measure in online learning literature, regret, which is defined as the difference between the total expected cost of our policy and the total expected cost of the clairvoyant optimal policy that has full knowledge of the demand distributions a priori. This paper analyzes, in the presence of unknown cyclic demands, the lost-sales model with zero lead time, and the multi-product backlogging model with positive lead times, fixed joint-ordering costs and order limits. For both models, we first introduce episodic models where inventory is discarded at the end of every cycle, and then build upon these results to analyze non-discarding models. Our RL policies HQL and FQL achieve ~O(T^(1/2)) regret for the episodic lost-sales model and the episodic multi-product backlogging model, matching the regret lower bound that we prove in this paper. For the non-discarding models, we construct a bandit learning algorithm on top of the previous RL algorithms, named Meta-HQL. Meta-HQL achieves ~O(T^(1/2)) regret for the non-discarding lost-sales model with zero lead time, again matching the regret lower bound. For the non-discarding multi-product backlogging model, our policy Mimic-QL achieves ~O(T^(5/6)) regret bound. Our policies remove the regret dependence on the cardinality of the state-action space for inventory problems, which is an improvement over existing RL algorithms. We conducted experiments with a real sales dataset from Rossmann, one of the largest drugstore chains in Europe, and also with a synthetic dataset. For both sets of experiments, our policy converges rapidly to the optimal policy and dramatically outperforms the best policy that models demand as i.i.d. instead of cyclic.

Keywords: inventory, cyclic demands, censored demand, lost-sales, reinforcement learning, nonparametric, bandits, regret analysis, lead time, joint-ordering cost

Suggested Citation

Gong, Xiao-Yue and Simchi-Levi, David, Bandits atop Reinforcement Learning: Tackling Online Inventory Models With Cyclic Demands (September 7, 2021). Available at SSRN: https://ssrn.com/abstract=3637705 or http://dx.doi.org/10.2139/ssrn.3637705

Xiao-Yue Gong (Contact Author)

MIT Operations Research Center ( email )

77 Massachusetts Avenue
Cambridge, MA 02139-4307
United States

David Simchi-Levi

Massachusetts Institute of Technology (MIT) - School of Engineering ( email )

MA
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
283
Abstract Views
937
rank
140,905
PlumX Metrics