Sage Journals: Discover world-class research

Abstract

We study the sample complexity of offline learning for a class of structured Markov decision processes (MDPs) describing the inventory control system with fixed ordering cost/setup cost, a fundamental problem in supply chains. We find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal bounds recently obtained for the general MDPs. More specifically, in the infinite-horizon discounted cost setting, we obtain an $\tilde{O} (min {\frac{(\bar{S} - \underline{s})^{2}}{(1 - γ)^{2} ϵ^{2}}, \frac{1}{(1 - γ)^{4} ϵ^{2}}})$ sample complexity bound, where ${(\bar{S} - \underline{s})}^{2}$ corresponds to the number of state-action pairs in a generic MDP with state space $S$ and action space $A$ . As such, $\tilde{O} (\frac{(\bar{S} - \underline{s})^{2}}{(1 - γ)^{2} ϵ^{2}})$ improves on the optimal generic reinforcement learning (RL) bound $\tilde{Θ} (\frac{(\bar{S} - \underline{s})^{2}}{(1 - γ)^{3} ϵ^{2}})$ (when directly applying $\tilde{Θ} (\frac{| S | | A |}{(1 - γ)^{3} ϵ^{2}})$ here) by a factor of $(1 - γ)^{- 1}$ , and $\tilde{O} (\frac{1}{(1 - γ)^{4} ϵ^{2}})$ is able to completely remove the dependence on state and action cardinality. In the infinite-horizon average cost setting, we obtain an $\tilde{O} (\frac{(\bar{S} - \underline{s})^{2}}{ϵ^{2}})$ bound, improving on the generic optimal RL bound $\tilde{Θ} (\frac{(\bar{S} - \underline{s})^{2} t_{m i x}}{ϵ^{2}})$ (when directly applying $\tilde{Θ} (\frac{| S | | A | t_{m i x}}{ϵ^{2}})$ here) by a factor of $t_{m i x}$ , and hence removing the mixing time dependence. By carefully leveraging the structural properties of the inventory dynamics in various settings, we are able to improve on those “best-possible” bounds developed in the RL literature. Our results demonstrate the drawbacks one could face by blindly following RL algorithms and the necessity of designing sample efficient algorithms that properly incorporate the special structures of the inventory systems.

Keywords

Offline Learning Sample Complexity Inventory Control Fixed Ordering Cost

Get full access to this article

View all access options for this article.

References

Agarwal

Jiang

Kakade

, et al. (2019) Reinforcement learning: Theory and algorithms. Technical Report, CS Department, UW Seattle, Seattle, WA, USA, 10–4.

Agarwal

Kakade

Yang

(2020) Model-based reinforcement learning with a generative model is minimax optimal. In: Conference on learning theory, pp.67–83. PMLR.

Agrawal

Jia

(2022) Learning in structured mdps with convex cost functions: Improved regret bounds for inventory management. Operations Research 70(3): 1646–1664.

Amazon . (2021) The evolution of Amazon’s inventory planning system. Available at: https://www.amazon.science/latest-news/the-evolution-of-amazons-inventory-planning-system. Retrieved August 25, 2025.

Analytics8 . (n.d.) Anheuser-Busch InBev improves logistics operations with single view across breweries. Available at: https://www.analytics8.com/customer-stories/anheuser-busch-inbev/#:∼:text=The%20Tier%201%20logistics%20division,ran%20on%20different%20code%20bases. Retrieved August 25, 2025.

Bastani

Sinchaisri

(2021) Improving human sequential decision-making with reinforcement learning. arXiv preprint arXiv:2108.08454.

Chen

Simchi-Levi

Wang

, et al. (2022) Dynamic pricing and inventory control with fixed ordering cost and incomplete demand information. Management Science 68(8): 5684–5703.

Chen

Sim

Simchi-Levi

, et al. (2007) Risk aversion in inventory management. Operations Research 55(5): 828–842.

Cheung

Simchi-Levi

(2019) Sampling-based approximation schemes for capacitated stochastic inventory control models. Mathematics of Operations Research 44(2): 668–692.

10.

Chung

Rostami

Bastani

, et al. (2022) Decision-aware learning for optimizing health supply chains. arXiv preprint arXiv:2211.08507.

11.

Fan

Chen

Zhou

(2022) Sample complexity of policy learning for inventory control with censored demand. Available at SSRN 4178567.

12.

Gheshlaghi Azar

Munos

Kappen

(2013) Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning 91: 325–349.

13.

Gupta

Guha

Kumar

(2025) Advancements in inventory management: Insights from INFORMS Franz Edelman Award Finalists. Production and Operations Management, p.10591478241310225.

14.

Hoeffding

(1994) Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, pp.409–426.

15.

Huh

Rusmevichientong

(2009) A nonparametric asymptotic analysis of inventory planning with censored demand. Mathematics of Operations Research 34(1): 103–123.

16.

Iglehart

(1961) Dynamic programming and stationary analyses of inventory problems. Technical report, Stanford University CALIF Applied Mathematics and Statistics Labs.

17.

Iglehart

(1963) Optimality of

(s, S)

policies in the infinite horizon dynamic inventory problem. Management Science 9(2): 259–267.

18.

JD . (2020) In-depth report: When big data and iot meet JD’s logistics. Available at: https://jdcorporateblog.com/in-depth-report-when-big-data-and-iot-meet-jds-logistics/. Retrieved August 25, 2025.

19.

Jin

Sidford

(2021) Towards tight bounds on the sample complexity of average-reward mdps. In: International conference on machine learning, pp.5055–5064. PMLR.

20.

Kakade

(2003) On the Sample Complexity of Reinforcement Learning. United Kingdom: University of London, University College London.

21.

Levi

Roundy

Shmoys

(2007) Provably near-optimal sampling-based policies for stochastic inventory control models. Mathematics of Operations Research 32(4): 821–839.

22.

Wei

Chi

, et al. (2024) Breaking the sample size barrier in model-based reinforcement learning with a generative model. Operations Research 72(1): 203–221.

23.

Lyu

Zhang

Xin

(2024) UCB-type learning algorithms with Kaplan–Meier estimator for lost-sales inventory models with lead times. Operations Research 72(4): 1317–1332.

24.

Qin

Simchi-Levi

Wang

(2022) Data-driven approximation schemes for joint pricing and inventory control models. Management Science 68(9): 6591–6609.

25.

Rossmann . (2025) Rossmann store statistics. Available at: https://rossmann.az/en/about. Retrieved August 25, 2025.

26.

Scarf

(1960) The optimality of

(s, S)

policies in the dynamic inventory problem.

27.

Scarf

Arrow

Karlin

(1957) A Min-Max Solution of An Inventory Problem. Rand Corporation Santa Monica.

28.

See

Sim

(2010) Robust approximation to multiperiod inventory management. Operations Research 58(3): 583–594.

29.

Sidford

Wang

, et al. (2018) Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Advances in Neural Information Processing Systems 2018: 5186–5196.

30.

Sun

Vanajakumari

Sriskandarajah

, et al. (2024) Supply chain planning with free trade zone and uncertain demand. Transportation Research Part E: Logistics and Transportation Review 192: 103771.

31.

Veinott

AF Jr

Wagner

(1965) Computing optimal

(s, S)

inventory policies. Management Science 11(5): 525–552.

32.

Wang

Yang

(2022) Near sample-optimal reduction-based policy learning for average reward MDP. arXiv preprint arXiv:2212.00603.

33.

Wang

Blanchet

Glynn

(2023) Optimal sample complexity for average reward Markov decision processes. arXiv preprint arXiv:2310.08833.

34.

Xie

Xin

(2024) Vc theory for inventory policies. arXiv preprint arXiv:2404.11509.

35.

Yuan

Luo

Shi

(2021) Marrying stochastic gradient descent with bandits: Learning algorithms for inventory systems with fixed costs. Management Science 67(10): 6089–6115.

36.

Zhang

Chao

Shi

(2020) Closing the gap: A learning algorithm for lost-sales inventory systems with lead times. Management Science 66(5): 1962–1980.

37.

Zheng

(1991) A simple proof for optimality of

(s, S)

policies in infinite-horizon inventory systems. Journal of Applied Probability 28(4): 802–810.

38.

Zheng

Federgruen

(1991) Finding optimal

(s, S)

policies is about as simple as evaluating a single policy. Operations research 39(4): 654–665.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.41 MB

Don’t Follow Reinforcement Learning Blindly: Lower Sample Complexity of Learning Optimal Inventory Control Policies with Fixed Ordering Cost

Abstract

Keywords

Get full access to this article

References

Supplementary Material