HQ-learning is a hierarchical extension of Q(λ)-learning designed to solve certain types of partially observable Markov decision problems (POMDPs). HQ automatically decomposes POMDPs into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive subagents. HQ can solve partially observable mazes with more states than those used in most previous POMDP work.
Boutilier, C., & Poole, D. (1996). Computing optimal policies for partially observable decision processes using compact representations. In AAAI-1996: Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR.
2.
Caironi, P. V C., & Dorigo, M. (1994). Training Q-agents (Tech. Rep. No. IRIDIA-94-14) . Brussels: Université Libre de Bruxelles.
3.
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence. San Jose, CA: AAAI Press.
4.
Cliff, D., & Ross, S. (1994). Adding temporary memory to ZCS. Adaptive Behavior, 3, 101-150.
5.
Cohn, D.A. (1994). Neural network exploration using optimal experiment design. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems (Vol. 6). San Mateo, CA: Morgan Kaufmann.
6.
Dayan, P., & Hinton, G. (1993). Feudal reinforcement learning. In D. S. Lippman, J. E. Moody , & D. S. Touretzky (Eds.), Advances in neural information processing systems (Vol. 5). San Mateo, CA: Morgan Kaufmann .
7.
Digney, B. (1996). Emergent hierarchical control structures: Learning reactive/ hierarchical relationships in reinforcement environments. In P. Maes, M. Mataric, J.-A. Meyer, J. Pollack, & S. W Wilson (Eds.), From animals to animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior . Cambridge, MA: MIT Press/ Bradford Books.
8.
Fedorov, V.V. (1972). Theory of optimal experiments. New York: Academic.
9.
Hihi, S.E., & Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term dependencies. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8). Cambridge, MA: MIT Press.
10.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory Neural Computation, 9, 1681-1726.
11.
Humphrys, M. (1996). Action selection methods using reinforcement learning . In P. Maes, M. Matarić, J.-A. Meyer, J. Pollack, & S. W Wilson (Eds.), From animals to animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior . Cambridge, MA: MIT Press/ Bradford Books.
12.
Jaakkola, T., Singh, S.P., & Jordan, M.I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (Vol. 7). Cambridge, MA: MIT Press.
13.
Jordan, M.I., & Rumelhart, D.E. (1990). Supervised learning with a distal teacher (Tech. Rep. Occ. Paper No. 40). Cambridge, MA: Center for Cognitive Sciences, MIT.
14.
Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains (Unpublished Tech. rep.)Providence, RI: Brown University
15.
Koenig, S., & Simmons, R.G. (1996). The effect of representation and knowedge on goal-directed exploration with reinforcement learning algorithm. Machine Learning, 22, 228-250.
16.
Levin, L.A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3), 265-266.
17.
Lin, L. ( 1993). Reinforcement learning for robots using neural networks . Unpublished doctoral thesis, Carnegie Mellon University , Pittsburgh.
18.
Littman, M. (1994). Memoryless policies: Theoretical limitations and practical results. In D. Cliff, P. Husbands, J.-A. Meyer , & S. W Wilson (Eds.), From animals to animats 3: Proceedings of the International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press/Bradford Books.
19.
Littman, M. (1996). Algorithms for sequential decision making. Unpublished doctoral thesis, Brown University, Providence, RI.
20.
Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis & S. Russell (Eds.), Machine learning: Proceedings of the Twelfth International Conference. San Francisco, CA: Morgan Kaufmann.
21.
McCallum, R.A. (1993). Overcoming incomplete perception with utile distinction memory. In Machine learning: Proceedings of the Tenth International Conference. Amherst, MA: Morgan Kaufmann.
22.
McCallum, R.A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In P. Maes, M. Matarić, J.-A. Meyer, J. Pollack, & S. W Wilson (Eds.), From animals to animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press/Bradford Books.
23.
Moore, A., & Atkeson, C.G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103-130.
24.
Nguyen, D., & Widrow, B. (1989). The truck backer-upper: An example of self learning in neural networks. In Proceedings of the International Joint Conference on Neural Networks (Vol. 1). Piscataway, NJ: IEEE Press.
25.
Parr, R., & Russell, S. (1995). Approximating optimal policies for partially observable stochastic domains. In Proceedings of the International Joint Conference on Artificial IntelligenceSan Francisco: Morgan Kaufmann.
Ring, M.B. (1994). Continual learning in reinforcement environments . Unpublished doctoral thesis, University of Texas at Austin.
28.
Ron, D., Singer, Y., & Tishby, N. (1994). Learning probabilistic automata with variable memory length. In Aleksandr, I., & Taylor, J. (Eds.), Proceedings of the Seventh Annual Conference on Computational Learning Theory . New York: ACM Press.
29.
Salustowicz, R.P., & Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2), 123-141. (See ftp://ftp.idsia.ch/pub/rafal/PIPE.ps.gz.)
30.
Schmidhuber, J. (1991a). Curious model-building control systems. In Proceedings of the International Joint Conference on Neural Networks (Vol. 2). Piscataway, NJ: IEEE Press.
31.
Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences . In T. Kohonen, K. Mdkisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks . Amsterdam: Elsevier-North Holland .
32.
Schmidhuber, J. (1991 c). Reinforcement learning in Markovian and non-Markovian environments. In D.S. Lippman, J.E. Moody, & D.S. Touretzky (Eds.), Advances in neural information processing systems 3. San Mateo, CA: Morgan Kaufmann.
33.
Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2), 234-242.
Schmidhuber, J., Zhao, J., & Schraudolph, N. (in press). Reinforcement learning with self-modifying policies. In S. Thrun & L. Pratt (Eds.), Learning to learn. Norwell, MA: Kluwer.
36.
Schmidhuber, J., Zhao, J., & Wiering, M. (1997). Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 26, 105-130.
37.
Singh, S. (1992). The efficient learning of multiple task sequences . In J. Moody, S. Hanson, & R. Lippman (Eds.), Advances in neural information processing systems 4. San Mateo, CA: Morgan Kaufmann.
38.
Sondik, E.J. (1971). The optimal control of partially observable Markov decision processes. Unpublished doctoral thesis, Stanford University, Stanford, California.
39.
Storck, J., Hochreiter, S., & Schmidhuber, J. (1995). Reinforcement driven information acquisition in nondeterministic environments. In Proceedings of the International Conference on Artificial Neural Networks (Vol. 2). Paris: EC2 & Cie.
40.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
41.
Sutton, R.S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis & S. Russell (Eds.), Machine learning: Proceedings of the Twelfth International Conference. San Francisco: Morgan Kaufmann.
42.
Teller, A. (1994). The evolution of mental models. In E. Kenneth & J. Kinnear (Eds.), Advances in genetic programming. Cambridge, MA: MIT Press.
43.
Tham, C. (1995). Reinforcement learning of multiple tasks using a hierarchical CMAC architecture. Robotics and Autonomous Systems, 15(4), 247-274.
44.
Thrun, S. (1992). Efficient exploration in reinforcement learning (Tech. Rep. No. CMU-CS-92-102)Pittsburgh: Carnegie-Mellon University
Whitehead, S. (1992). Reinforcement learning for the adaptive control of perception and action. Unpublished doctoral thesis, University of Rochester, Rochester, NY.
48.
Wiering, M., & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In L. Saitta (Ed.), Machine learning: Proceedings of the Thirteenth International Conference. San Francisco: Morgan Kaufinann.
49.
Wilson, S. (1994). ZCS: A zeroth level classifier system. Evolutionary Computation, 2,1-18.
50.
Wilson, S. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2), 149-175.
51.
Wilson, S. (1996). Explore/exploit strategies in autonomy. In J.-A. Meyer & S. W Wilson (Eds.), From animals to animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press /Bradford Books.
52.
Zhao, J., & Schmidhuber, J. (1996). Incremental self-improvement for life-time multi-agent reinforcement learning. In P. Maes, M. Matarić, J.-A. Meyer, J. Pollack, & S. W Wilson (Eds.), From animals to animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press/Bradford Books.
53.
Zhang, N.L., & Liu, W. (1996). Planning in stochastic domains: problem characteristics and approximations (Tech. Rep. No. HKUST-CS96-31). Hong Kong: Hong Kong University of Science and Technology .