Sage Journals: Discover world-class research

Abstract

On-policy imitation learning algorithms such as DAgger evolve a robot control policy by executing it, measuring performance (loss), obtaining corrective feedback from a supervisor, and generating the next policy. As the loss between iterations can vary unpredictably, a fundamental question is under what conditions this process will eventually achieve a converged policy. If one assumes the underlying trajectory distribution is static (stationary), it is possible to prove convergence for DAgger. However, in more realistic models for robotics, the underlying trajectory distribution is dynamic because it is a function of the policy. Recent results show it is possible to prove convergence of DAgger when a regularity condition on the rate of change of the trajectory distributions is satisfied. In this article, we reframe this result using dynamic regret theory from the field of online optimization and show that dynamic regret can be applied to any on-policy algorithm to analyze its convergence and optimality. These results inspire a new algorithm, Adaptive On-Policy Regularization (Aor), that ensures the conditions for convergence. We present simulation results with cart–pole balancing and locomotion benchmarks that suggest Aor can significantly decrease dynamic regret and chattering as the robot learns. To the best of the authors’ knowledge, this is the first application of dynamic regret theory to imitation learning.

Keywords

Imitation learning online optimization online learning dynamic regret

Get full access to this article

View all access options for this article.

References

Adamskiy

Koolen

Chernov

Vovk

(2012) A closer look at adaptive regret. In: International Conference on Algorithmic Learning Theory. Berlin: Springer, pp. 290–304.

Asadi

Misra

Littman

(2018) Lipschitz continuity in model-based reinforcement learning. In: International Conference on Machine Learning, pp. 264–273.

Bagnell

(2015) An invitation to imitation. Technical Report, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.

Banach

(1922) Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae 3(1): 133–181.

Bertsekas

(1975) Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control 20(3): 415–419.

Brockman

Cheung

Pettersson

, et al. (2016) OpenAI gym. arXiv preprint arXiv:1606.01540.

Bubeck

(2011) Introduction to online optimization. Lecture Notes, Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ.

Bubeck

(2017) Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, Vol. 8. Now Publishers Inc.

Cesa-Bianchi

Lugosi

(2006) Prediction, Learning, and Games. Cambridge: Cambridge University Press.

10.

Cheng

Boots

(2018) Convergence of value aggregation for imitation learning. In: International Conference on Artificial Intelligence and Statistics.

11.

Cheng

Lee

Goldberg

Boots

(2019a) Online learning with continuous variations: Dynamic regret and reductions. arXiv preprint arXiv:1902.07286.

12.

Cheng

Yan

Theodorou

Boots

(2019b) Accelerating imitation learning with predictive models. In: International Conference on Artificial Intelligence and Statistics (AISTATS).

13.

Cheng

Yan

Wagener

Boots

(2018) Fast policy learning through imitation and reinforcement. In: Conference on Uncertainty in Artificial Intelligence.

14.

Duan

Andrychowicz

Stadie

, et al. (2017) One-shot imitation learning. In: Advances in Neural Information Processing Systems, pp. 1087–1098.

15.

Duchi

Hazan

Singer

(2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12: 2121–2159.

16.

Duvallet

Kollar

Stentz

(2013) Imitation learning for natural language direction following through unknown environments. In: Proceedings of ICRA. IEEE, pp. 1047–1053.

17.

Facchinei

Pang

(2007) Finite-Dimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media.

18.

Fukushima

(1996) Merit functions for variational inequality and complementarity problems. In: Nonlinear Optimization and Applications. New York: Springer, pp. 155–170.

19.

Hall

Willett

(2015) Online convex optimization in dynamic environments. IEEE Journal of Selected Topics in Signal Processing 9(4): 647–662.

20.

Hazan

(2016) Introduction to online convex optimization. Foundations and Trends in Optimization 2(3–4): 157–325.

21.

Hazan

Agarwal

Kale

(2007) Logarithmic regret algorithms for online convex optimization. Machine Learning 69(2–3): 169–192.

22.

Hazan

Kale

(2014) Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research 15(1): 2489–2512.

23.

Hazan

Seshadhri

(2007) Adaptive algorithms for online decision problems. In: Electronic Colloquium on Computational Complexity (ECCC).

24.

Hinderer

(2005) Lipschitz continuity of value functions in Markovian decision processes. Mathematical Methods of Operations Research 62(1): 3–22.

25.

Hussein

Elyan

Gaber

Jayne

(2018) Deep imitation learning for 3D navigation tasks. Neural Computing and Applications 29(7): 389–404.

26.

Jadbabaie

Rakhlin

Shahrampour

Sridharan

(2015) Online optimization: Competing with dynamic comparators. In: Artificial Intelligence and Statistics. MLResearchPress, pp. 398–406.

27.

Barnes

Sun

Lee

Choudhury

Srinivasa

(2019) Imitation learning as f-divergence minimization. arXiv preprint arXiv:1905.12888.

28.

Khalil

(2002) Nonlinear systems, Vol. 3. Englewood Cliffs, NJ: Prentice Hall.

29.

Laskey

Chuck

Lee

, et al. (2017) Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations. In: 2017 IEEE International Conference on Robotics and Automation (ICRA).

30.

Laskey

(2018) On and Off-Policy Deep Imitation Learning for Robotics. PhD Thesis, UC Berkeley.

31.

Kang

Yue

Carr

(2016) Smooth imitation learning for online sequence prediction. In: International Conference on Machine Learning, pp. 680–688.

32.

Lee

Laskey

Tanwani

Aswani

Goldberg

(2018a) A dynamic regret analysis and adaptive regularization algorithm for on-policy robot imitation learning. Workshop on Algorithmic Foundations of Robotics (WAFR).

33.

Lee

Laskey

Tanwani

Goldberg

(2018b) Stability analysis of on-policy imitation learning algorithms using dynamic regret. In: RSS Workshop on Imitation and Causality.

34.

Mokhtari

Shahrampour

Jadbabaie

Ribeiro

(2016) Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In: 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, pp. 7195–7201.

35.

Osa

Pajarinen

Neumann

, et al. (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7(1–2): 1–179.

36.

Pan

Cheng

Saigol

, et al. (2018) Agile off-road autonomous driving using end-to-end deep imitation learning. In: Robotics: Science and Systems.

37.

Pirotta

Restelli

Bascetta

(2015) Policy gradient in Lipschitz Markov decision processes. Machine Learning 100(2–3): 255–283.

38.

Pomerleau

(1989) Alvinn: An autonomous land vehicle in a neural network. Technical Report, Carnegie-Mellon University.

39.

Rakhlin

Sridharan

(2014) Statistical learning and sequential prediction. Book Draft, MIT, Cambridge, MA.

40.

Rakhlin

Sridharan

(2013) Optimization, learning, and games with predictable sequences. In: Advances in Neural Information Processing Systems, pp. 3066–3074.

41.

Ross

Gordon

Bagnell

(2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: International Conference on Artificial Intelligence and Statistics.

42.

Ross

Melik-Barkhudarov

Shankar

, et al. (2013) Learning monocular reactive uav control in cluttered natural environments. In: 2013 IEEE international conference on robotics and automation. IEEE, pp. 1765–1772.

43.

Sastry

(1999) Nonlinear Systems: Analysis, Stability, and Control, Vol. 10. New York: Springer.

44.

Schulman

Levine

Abbeel

Jordan

Moritz

(2015) Trust region policy optimization. In: International Conference on Machine Learning (ICML).

45.

Shalev-Shwartz

Kakade

(2009) Mind the duality gap: Logarithmic regret algorithms for online optimization. In: Advances in Neural Information Processing Systems, pp. 1457–1464.

46.

Sun

Venkatraman

Gordon

Boots

Bagnell

(2017) Deeply aggrevated: Differentiable imitation learning for sequential prediction. In: International Conference on Machine Learning.

47.

Yang

Zhang

Jin

(2016) Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In: International Conference on Machine Learning.

48.

Zhang

Cho

(2017) Query-efficient imitation learning for end-to-end simulated driving. In: Thirty-First AAAI Conference on Artificial Intelligence.

49.

Zhang

Yang

Rong

Zhou

(2017) Improved dynamic regret for non-degenerate functions. In: Advances in Neural Information Processing Systems.

50.

Zhang

McCarthy

Jowl

, et al. (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 1–8.

51.

Zinkevich

(2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936.

Dynamic regret convergence analysis and an adaptive regularization algorithm for on-policy robot imitation learning

Abstract

Keywords

Get full access to this article

References