Sage Journals: Discover world-class research

Abstract

We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of $O (T^{2 / 3} \sqrt{\log T})$ for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

Keywords

exploration–exploitation online learning partially observable MDP spectral estimator

Get full access to this article

View all access options for this article.

References

Agrawal

Jia

(2017). Optimistic posterior sampling for reinforcement learning: worst‐case regret bounds. Advances in Neural Information Processing Systems, 30, 1184–1194.

Anandkumar

Hsu

Kakade

S. M.

Telgarsky

(2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15, 2773–2832.

Anandkumar

Hsu

Kakade

S. M.

(2012). A method of moments for mixture models and hidden Markov models. In Mannor, Shie, Srebro, Nathan & Williamson, Robert C. (Eds.), Conference on learning theory (Vol. 23, pp. 33.1–33.34), PMLR.

Auer

Cesa‐Bianchi

Fischer

(2002a). Finite‐time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256.

Auer

Cesa‐Bianchi

Freund

Schapire

R. E.

(2002b). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1), 48–77.

Auer

Gajane

Ortner

(2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Beygelzimer, Alina & Hsu, Daniel (Eds.), Conference on learning theory (Vol. 99, pp. 138–158), PMLR.

Auer

Ortner

(2006). Logarithmic online regret bounds for undiscounted reinforcement learning. In Schölkopf

Platt

Hoffman

(Eds.), Advances in Neural Information Processing Systems , (Vol. 19, pp. 49–56), MIT Press.

Azar

M. G.

Osband

Munos

(2017). Minimax regret bounds for reinforcement learning. In Precup, Doina & Teh, Yee Whye (Eds.), International conference on machine learning (Vol. 70, pp. 263–272), PMLR.

Azizzadenesheli

Lazaric

Anandkumar

(2016). Reinforcement learning of POMDPs using spectral methods. In Feldman, Vitaly, Rakhlin, Alexander & Shamir, Ohad (Eds.), Conference on learning theory (Vol. 49, pp. 193–256), PMLR.

10.

Balakrishnan

Wainwright

(2017). Statistical guarantees for the EM algorithm: From population to sample‐based analysis. The Annals of Statistics, 45(1), 77–120.

11.

Besbes

Gur

Zeevi

(2014). Stochastic multi‐armed‐bandit problem with non‐stationary rewards. Advances in Neural Information Processing Systems , 27, 199–207.

12.

Bubeck

Cesa‐Bianchi

(2012). Regret analysis of stochastic and nonstochastic multi‐armed bandit problems. Machine Learning, 5(1), 1–122.

13.

Cao

Guo

(2007). Partially observable Markov decision processes with reward information: Basic ideas and models. IEEE Transactions on Automatic Control, 52(4), 677–681.

14.

Chen

Chao

Ahn

(2019). Coordinating pricing and inventory replenishment with nonparametric demand learning. Operations Research, 67(4), 1035–1052.

15.

Chen

Wang

(2021). Learning and optimization with seasonal patterns . ArXiv preprint arXiv:2005.08088.

16.

Chen

Shi

Duenyas

(2020). Optimal learning algorithms for stochastic inventory systems with random capacities. Production and Operations Management, 29(7), 1624–1649.

17.

Cheung

Simchi‐Levi

Zhu

(2019). Non‐stationary reinforcement learning: The blessing of (more) optimism . Available at SSRN 3397818.

18.

Cheung

Simchi‐Levi

Zhu

(2022). Hedging the drift: Learning to optimize under nonstationarity. Management Science, 68(3), 1696–1713.

19.

De Castro

Gassiat

Le Corff

(2017). Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden Markov models. IEEE Transactions on Information Theory, 63(8), 4758–4777.

20.

Fiez

Sekar

Ratliff

(2018). Multi‐armed bandits for correlated Markovian environments with smoothed reward feedback . ArXiv preprint arXiv:1803.04008.

21.

Garivier

Moulines

(2011). On upper‐confidence bound policies for switching bandit problems. In Kivinen, Jyrki, Szepesvári, Csaba, Ukkonen, Esko & Zeugmann, Thomas (Eds.), International conference on algorithmic learning theory (pp. 174–188), Springer Berlin Heidelberg.

22.

Guha

Munagala

Shi

(2010). Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1), 1–50.

23.

Guo

Doroudi

Brunskill

(2016). A PAC RL algorithm for episodic POMDPs. In Gretton, Arthur & Robert, Christian C. (Eds.), International conference on artificial intelligence and statistics (pp. 510–518), PMLR.

24.

Hausknecht

Stone

(2015). Deep recurrent q‐learning for partially observable MDPS. In Association for the advancement of artificial intelligence fall symposium series (pp. 29–37).

25.

Hinderer

(2005). Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research, 62(1), 3–22.

26.

Hsu

Chuang

Arapostathis

(2006). On the existence of stationary optimal policies for partially observed MDPS under the long‐run average cost criterion. Systems & Control Letters, 55(2), 165–173.

27.

Igl

Zintgraf

Wood

Whiteson

(2018). Deep variational reinforcement learning for POMDPs. In Dy, Jennifer & Krause, Andreas (Eds.), International conference on machine learning (pp. 2117–2126), PMLR.

28.

Jaksch

Ortner

Auer

(2010). Near‐optimal regret bounds for reinforcement learning. The Journal of Machine Learning Research , 11, 1563–1600.

29.

Jin

Allen‐Zhu

Bubeck

Jordan

(2018). Is Q‐learning provably efficient? Advances in Neural Information Processing Systems , 31, 4868–4878.

30.

Jin

Kakade

Krishnamurthy

, & Liu (2020). Sample‐efficient reinforcement learning of undercomplete POMDPs. Advances in Neural Information Processing Systems , 33, 18530–18539.

31.

Keskin

Zeevi

(2017). Chasing demand: Learning and earning in a changing environment. Mathematics of Operations Research, 42(2), 277–307.

32.

Krishnamurthy

(2016). Partially observed Markov decision processes. Cambridge University Press.

33.

Kwon

Efroni

Caramanis

Mannor

(2021). Rl for latent MDPS: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems , 34, 24523–24534.

34.

Lakshmanan

Ortner

Ryabko

(2015). Improved regret bounds for undiscounted continuous reinforcement learning. In Bach, Francis & Blei, David (Eds.), International conference on machine learning (pp. 524–532), PMLR.

35.

Lattimore

Szepesvári

(2020). Bandit algorithms. Cambridge University Press.

36.

Lehéricy

(2019). Consistent order estimation for nonparametric hidden Markov models. Bernoulli, 25(1), 464–498.

37.

Nambiar

Simchi‐Levi

Wang

(2021). Dynamic inventory allocation with demand learning for seasonal goods. Production and Operations Management, 30(3), 750–765.

38.

Ormoneit

Glynn

(2002). Kernel‐based reinforcement learning in average‐cost problems. IEEE Transactions on Automatic Control, 47(10), 1624–1636.

39.

Ortner

Ryabko

(2012). Online regret bounds for undiscounted continuous reinforcement learning. In Pereira

Burges

C.J.

Bottou

Weinberger

K.Q.

(Eds.), Advances in Neural Information Processing Systems , (Vol. 25, 1772–1780), Curran Associates, Inc.

40.

Ortner

Ryabko

Auer

Munos

(2014). Regret bounds for restless Markov bandits. Theoretical Computer Science , 558, 62–76.

41.

Puterman

(2014). Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons.

42.

Ross

(1968). Arbitrary state Markovian decision processes. The Annals of Mathematical Statistics, 39(6), 2118–2122.

43.

Ross

Pineau

Chaib‐draa

Kreitmann

(2011). A Bayesian approach for learning and planning in partially observable Markov decision processes. The Journal of Machine Learning Research , 12, 1729–1770.

44.

Rusmevichientong

Tsitsiklis

(2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2), 395–411.

45.

Saldi

Yüksel

Linder

(2017). On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Mathematics of Operations Research, 42(4), 945–978.

46.

Sharma

Jafarnia‐Jahromi

Jain

(2020). Approximate relative value learning for average‐reward continuous state MDPS. In Adams, Ryan P. & Gogate, Vibhav (Eds.), Conference on uncertainty in artificial intelligence (pp. 956–964), PMLR.

47.

Slivkins

Upfal

(2008). Adapting to a changing environment: The Brownian restless bandits. In Rocco, Servedio & Tong, Zhang (Eds.), Conference on learning theory (pp. 343–354), PMLR.

48.

Spaan

(2012). Partially observable Markov decision processes. Reinforcement Learning , 387–414.

49.

Stephens

(2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4), 795–809.

50.

Sutton

Barto

(2018). Reinforcement learning: An introduction. MIT Press.

51.

Yang

Balakrishnan

Wainwright

(2017). Statistical and computational guarantees for the Baum‐Welch algorithm. The Journal of Machine Learning Research, 18(1), 4528–4580.

52.

Bertsekas

(2004). Discretized approximations for POMDP with average cost. In Christopher Meek (Ed.), Conference on uncertainty in artificial intelligence (pp. 619–627), AUAI Press.

53.

Bertsekas

(2008). On near optimality of the set of finite‐state controllers for average cost POMDP. Mathematics of Operations Research, 33(1), 1–11.

54.

Zhang

Chao

Shi

(2018). Perishable inventory systems: Convexity results for base‐stock policies and learning algorithms under censored demand. Operations Research, 66(5), 1276–1286.

55.

Zhang

Chao

Shi

(2020). Closing the gap: A learning algorithm for lost‐sales inventory systems with lead times. Management Science, 66(5), 1962–1980.

56.

Zhang

(2019). Regret minimization for reinforcement learning by evaluating the optimal bias function. Advances in Neural Information Processing Systems , 32, 2827–2836.

57.

Zhao

Xie

Sun

(2017). Multi‐view learning overview: Recent progress and new challenges. Information Fusion, 38, 43–54.

58.

Zhou

Xiong

Chen

Gao

(2021). Regime switching bandits. In Ranzato

Beygelzimer

Dauphin

Liang

P.S.

Wortman Vaughan

(Eds.), Advances in Neural Information Processing Systems , 34, (pp. 4542–4554).

59.

Zhu

Zheng

(2020). When demands evolve larger and noisier: Learning and earning in a growing environment. In III, Hal Daumé & Singh, Aarti (Eds.), International conference on machine learning (Vol. 119, pp. 11629–11638), PMLR.

Sublinear regret for learning POMDPs

Abstract

Keywords

Get full access to this article

References