Don’t Follow RL Blindly: Lower Sample Complexity of Learning Optimal Inventory Control Policies with Fixed Ordering Costs

32 Pages Posted: 15 May 2024 Last revised: 19 May 2024

See all articles by Xiaoyu Fan

Xiaoyu Fan

New York University

Boxiao Chen

University of Illinois at Chicago - College of Business Administration

Tava Lennon Olsen

University of Melbourne - Melbourne Business School

Michael Pinedo

New York University (NYU) - Leonard N. Stern School of Business

Hanzhang Qin

National University of Singapore (NUS)

Zhengyuan Zhou

New York University (NYU)

Date Written: May 14, 2024

Abstract

Situated in the voluminous literature on the sample complexity of reinforcement learning (RL) algorithms for general (unstructured) Markov Decision Processes (MDPs), we show in this work that a class of structured MDPs admits more efficient learning (i.e., lower sample complexity bounds) compared to the best possible/known algorithms in generic RL. We focus on the MDPs describing the inventory control system with fixed ordering costs, a fundamental problem in supply chains. Interestingly, we find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal or best-known bounds recently obtained for the general MDPs. We improve on those ``best-possible'' bounds by carefully leveraging the structural properties of the inventory dynamics in various settings. More specifically, in the infinite-horizon discounted cost setting, we obtain a O(\frac{|\mathcal{S}||\mathcal{A}|}{(1 - \gamma)^2\epsilon^{2}}) sample complexity bound, improving on the generic optimal bound \Theta(\frac{|\mathcal{S}||\mathcal{A}|}{(1 - \gamma)^3\epsilon^{2}}) by a factor of (1-\gamma)^{-1}. In the infinite-horizon average cost setting, we obtain a O(\frac{|\mathcal{S}||\mathcal{A}|}{\epsilon^{2}}) bound, improving on the generic optimal bound \Theta(\frac{|\mathcal{S}||\mathcal{A}| t_{mix}}{\epsilon^{2}}) by a factor of t_{mix}, and hence removing the mixing time dependence.

Keywords: Offline Learning; Sample Complexity; Inventory Control; Fixed Ordering Cost

Suggested Citation

Fan, Xiaoyu and Chen, Boxiao and Lennon Olsen, Tava and Pinedo, Michael and Qin, Hanzhang and Zhou, Zhengyuan, Don’t Follow RL Blindly: Lower Sample Complexity of Learning Optimal Inventory Control Policies with Fixed Ordering Costs (May 14, 2024). Available at SSRN: https://ssrn.com/abstract=4828001

Xiaoyu Fan (Contact Author)

New York University ( email )

New York
United States

Boxiao Chen

University of Illinois at Chicago - College of Business Administration ( email )

601 S Morgan St
Chicago, IL 60607
United States

Tava Lennon Olsen

University of Melbourne - Melbourne Business School ( email )

200 Leicester Street
Carlton, Victoria 3053 3186
Australia

Michael Pinedo

New York University (NYU) - Leonard N. Stern School of Business ( email )

44 West 4th Street
Suite 9-160
New York, NY NY 10012
United States

Hanzhang Qin

National University of Singapore (NUS) ( email )

1E Kent Ridge Road
NUHS Tower Block Level 7
Singapore, 119228
Singapore

Zhengyuan Zhou

New York University (NYU)

Bobst Library, E-resource Acquisitions
20 Cooper Square 3rd Floor
New York, NY 10003-711
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
256
Abstract Views
618
Rank
243,230
PlumX Metrics