Learning CPG-based Biped Locomotion with a Policy Gradient Method: Application to a Humanoid Robot

Abstract

In this paper we describe a learning framework for a central pattern generator (CPG)-based biped locomotion controller using a policy gradient method. Our goals in this study are to achieve CPG-based biped walking with a 3D hardware humanoid and to develop an efficient learning algorithm with CPG by reducing the dimensionality of the state space used for learning. We demonstrate that an appropriate feedback controller can be acquired within a few thousand trials by numerical simulations and the controller obtained in numerical simulation achieves stable walking with a physical robot in the real world. Numerical simulations and hardware experiments evaluate the walking velocity and stability. The results suggest that the learning algorithm is capable of adapting to environmental changes. Furthermore, we present an online learning scheme with an initial policy for a hardware robot to improve the controller within 200 iterations.

Keywords

humanoid robots reinforcement learning bipedal locomotion central pattern generator

Get full access to this article

View all access options for this article.

References

Aoi, S. and Tsuchiya, K. (2005). Locomotion control of a biped robot using nonlinear oscillators. Autonomous Robots, 19(3): 219—232.

Baxter, J. and Bartlett, P.L. (2001). Infinite-horizon policy-gradient estimation Journal of Artificial Intelligence Research, 15: 319—350.

Benbrahim, H. and Franklin, J.A. (1997). Biped dynamic walking using reinforcement learning . Robotics and Autonomous Systems, 22: 283—302.

Cohen, A.H. (2003). Control principle for locomotion— looking toward biology. Proceedings of the 2nd International Symposium on Adaptive Motion of Animals and Machines (AMAM'03), Kyoto, Japan (CD-ROM, TuP-K-1).

Cohen, A.H. and Boothe, D.L. (1999). Sensorimotor interactions during locomotion: principles derived from biological systems Autonomous Robotics, 7(3): 239—245.

Doya, K. (2000). Reinforcement learning in continuous time and space Neural Computation, 12: 219—245.

Endo, G. et al. (2004). An empirical exploration of a neural oscillator for biped locomotion control. Proceedings of the 2004 IEEE International Conference on Robotics and Automation (ICRA'04), New Orleans, LA, pp. 3036— 3042.

Endo, G. et al. (2005a). Learning CPG sensory feedback with policy gradient for biped locomotion for a full-body humanoid. Proceedings of the 20th National Conference on Artificial Intelligence (AAAI-05) , Pittsburgh, PA, pp. 1267— 1273.

Endo, G. et al. (2005b). Experimental studies of a neural oscillator for biped locomotion with QRIO. Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA'05) , Barcelona, Spain, pp. 598—604.

10.

Grillner, S. et al. (1995). Neural networks that co-ordinate locomotion and body orientation in lamprey. Trends in NeuroSciences, 18(6): 270—279.

11.

Hase, K. and Yamazaki, N. (1997). A self-organizing model to imitate human development for autonomous bipedal walking. Proceedings of the 6th International Symposium on Computer Simulation in Biomechanics, Tokyo, Japan, pp. 9—12.

12.

Hase, K. and Yamazaki, N. (1998). Computer simulation of the ontogeny of biped walking. Anthropological Science, 106(4): 327—347.

13.

Hirai, K. et al. (1998). The development of Honda Humanoid Robot . Proceedings of the 1998 IEEE International Conference on Robotics and Automation, Leuven, Belgium, pp. 1321—1326.

14.

Hirukawa, H. et al. (2004). Humanoid robotics platforms developed in HRP. Robotics and Autonomous Systems, 48(4): 165—175.

15.

Ishiguro, A. , Fujii, A. and Hotz, P.E. (2003). Neuromodulated control of bipedal locomotion using a polymorphic CPG circuit. Adaptive Behavior, 11(1): 7—17.

16.

Kimura, H. and Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function. Proceedings of the 15th International Conference on Machine Learning (ICML-98), Madison, WI, pp. 278—286.

17.

Kimura, H. , Fukuoka, Y. and Cohen, A.H. (2007). Biologically inspired adaptive dynamic walking of a quadruped robot Philosophical Transactions of The Royal Society A365(1850): 153—170.

18.

Konda, V.R. and Tsitsiklis, J.N. (2003). On actor—critic algorithms SIAM Journal on Control and Optimization, 42(4): 1143—1166.

19.

Kuroki, Y. et al. (2001). A small biped entertainment robot. Proceedings of the 2001 IEEE-RAS International Conference on Humanoid Robots (Humanoids2001), Tokyo, Japan, pp. 181—186.

20.

Matsubara, T. et al. (2006). Learning CPG-based biped locomotion with a policy gradient method. Robotics and Autonomous Systems , 54: 911—920.

21.

Matsuoka, K. (1985). Sustained oscillations generated by mutually inhibiting neurons with adaptation. Biological Cybernetics, 52: 345—353.

22.

McGeer, T. (1990). Passive dynamic walking. International Journal of Robotics Research, 9(2): 62—82.

23.

McMahon, T.A. (1984). Muscles, Reflexes, and Locomotion. Princeton, NJ, Princeton University Press.

24.

Miyakoshi, S. et al. (1998). Three dimensional bipedal stepping motion using neural oscillators—towards humanoid motion in the real world. Proceedings of the 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'98), Victoria, BC, pp. 84—89.

25.

Mori, T. et al. (2004). Reinforcement learning for a CPGdriven biped robot. Proceedings of the 19th National Conference on Artificial Intelligence (AAAI'04), San Jose, CA, pp. 623—630.

26.

Morimoto, J. et al. (2005). Poincare-map-based reinforcement learning for biped walking. Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA'05), Barcelona, Spain, pp. 2381—2386.

27.

Nishiwaki, K. et al. (2000). Design and development of research platform for perception—action integration in humanoid robot: H6. Proceedings of the 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'00), Takamatsu, Japan, pp. 1559—1564.

28.

Orlovsky, G.N. , Deliagina, T.G. and Grillner, S. (1999). Neuronal Control of Locomotion: From Mollusc to Man. Oxford, Oxford University Press .

29.

Park, I. et al. (2005). Mechanical design of humanoid robot platform KHR-3 (KAIST Humanoid Robot—3: HUBO). Proceedings of the 2005 IEEE-RAS International Conference on Humanoid Robots (Humanoids2005) , Tsukuba, Japan, pp. 321—326.

30.

Peters, J. , Vijayakumar, S. and Schaal, S. (2003). Reinforcement learning for humanoid robots—policy gradients and beyond. Proceedings of the 3rd IEEE International Conference on Humanoid Robots (Humanoids2003), Karlsruhe and Munich, Germany (CD-ROM).

31.

Sutton, R.S. et al. (2000). Policy gradient methods for reinforcement learning with imperfect value function. Advances in Neural Information Processing Systems, 12: 1057—1063.

32.

Taga, G. (1995). A model of the neuro-musculo-skeletal system for human locomotion I. Emergence of basic gait. Biological Cybernetics , 73: 97—111.

33.

Tedrake, R. , Zhang, T.W. and Seung, H.S. (2004). Stochastic policy gradient reinforcement learning on a simple 3D biped. Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'04), Sendai, Japan, pp. 2849—2854.

34.

Williamson, M. (1998). Neural control of rhythmic arm movements. Neural Networks, 11(7—8): 1379—1394.