Don’t Follow RL Blindly: Lower Sample Complexity of Learning Optimal Inventory Control Policies with Fixed Ordering Costs
32 Pages Posted: 15 May 2024 Last revised: 19 May 2024
Date Written: May 14, 2024
Abstract
Situated in the voluminous literature on the sample complexity of reinforcement learning (RL) algorithms for general (unstructured) Markov Decision Processes (MDPs), we show in this work that a class of structured MDPs admits more efficient learning (i.e., lower sample complexity bounds) compared to the best possible/known algorithms in generic RL. We focus on the MDPs describing the inventory control system with fixed ordering costs, a fundamental problem in supply chains. Interestingly, we find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal or best-known bounds recently obtained for the general MDPs. We improve on those ``best-possible'' bounds by carefully leveraging the structural properties of the inventory dynamics in various settings. More specifically, in the infinite-horizon discounted cost setting, we obtain a O(\frac{|\mathcal{S}||\mathcal{A}|}{(1 - \gamma)^2\epsilon^{2}}) sample complexity bound, improving on the generic optimal bound \Theta(\frac{|\mathcal{S}||\mathcal{A}|}{(1 - \gamma)^3\epsilon^{2}}) by a factor of (1-\gamma)^{-1}. In the infinite-horizon average cost setting, we obtain a O(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}) bound, improving on the generic optimal bound \Theta(\frac{|\mathcal{S}||\mathcal{A}| t_{mix}}{\epsilon^{2}}) by a factor of t_{mix}, and hence removing the mixing time dependence.
Keywords: Offline Learning; Sample Complexity; Inventory Control; Fixed Ordering Cost
Suggested Citation: Suggested Citation