We study the sample complexity of offline learning for a class of structured Markov decision processes (MDPs) describing the inventory control system with fixed ordering cost/setup cost, a fundamental problem in supply chains. We find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal bounds recently obtained for the general MDPs. More specifically, in the infinite-horizon discounted cost setting, we obtain an
sample complexity bound, where
corresponds to the number of state-action pairs in a generic MDP with state space
and action space
. As such,
improves on the optimal generic reinforcement learning (RL) bound
(when directly applying
here) by a factor of
, and
is able to completely remove the dependence on state and action cardinality. In the infinite-horizon average cost setting, we obtain an
bound, improving on the generic optimal RL bound
(when directly applying
here) by a factor of
, and hence removing the mixing time dependence. By carefully leveraging the structural properties of the inventory dynamics in various settings, we are able to improve on those “best-possible” bounds developed in the RL literature. Our results demonstrate the drawbacks one could face by blindly following RL algorithms and the necessity of designing sample efficient algorithms that properly incorporate the special structures of the inventory systems.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
0.00 MB
0.41 MB