Combining imitation and deep reinforcement learning to human-level performance on a virtual foraging task

Abstract

We develop a framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pre-trained with IL. We show that the combination of IL and RL match human performance and that the artificial agents trained with our approach can quickly adapt to reward distribution shift. We finally show that good performance and robustness to reward distribution shift strongly depend on combining allocentric information with an egocentric representation of the environment.

Keywords

Decision-making foraging reinforcement learning imitation learning autonomous navigation deep learning bio-inspired control

Get full access to this article

View all access options for this article.

References

Abbeel

Coates

A. Y.

(2010). Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 29(13), 1608–1639. https://doi.org/10.1177/0278364910371999

Abbeel

A. Y.

(2004). Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the twenty-first International Conference on Machine learning. Banff, Alberta, Canada, 2004, https://doi.org/10.1145/1015330.1015430

Alexander

A. S.

Carstensen

L. C.

Hinman

J. R.

Raudies

Chapman

G. W.

Hasselmo

M. E.

(2020). Egocentric boundary vector tuning of the retrosplenial cortex. Science Advances, 6(8), eaaz2322. https://doi.org/10.1126/sciadv.aaz2322

Andrychowicz

Raichuk

Stańczyk

Orsini

Girgin

Marinier

Hussenot

Geist

Pietquin

Michalski

, (2020). What matters in on-policy reinforcement learning? A large-scale empirical study. arXiv preprint arXiv:2006.05990 .

Botvinick

Ritter

Wang

J. X.

Kurth-Nelson

Blundell

Hassabis

(2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5), 408–422. https://doi.org/10.1016/j.tics.2019.02.006

Cheng

Yan

Wagener

Boots

(2019). Fast policy learning through imitation and reinforcement. In: Uncertainty in artificial intelligence. arXiv preprint arXiv:2006.05990 .

Dulac-Arnold

Mankowitz

Hester

(2019). Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901 .

Engstrom

Ilyas

Santurkar

Tsipras

Janoos

Rudolph

Madry

(2020). Implementation matters in deep policy gradients: A case study on ppo and trpo. In: International Conference ON Learning Representations. Virtual Conference, 2020.

Feigenbaum

J. D.

Morris

R. G.

(2004). Allocentric versus egocentric spatial memory after unilateral temporal lobectomy in humans. Neuropsychology, 18(3), 462–472. https://doi.org/10.1037/0894-4105.18.3.462

10.

Finn

Levine

Abbeel

(2016). Guided cost learning: Deep inverse optimal control via policy optimization. In: International Conference on Machine Learning. New York City, NY, USA, 2016, (pp. 49–58).

11.

Fujimoto

Hoof

Meger

(2018). Addressing function approximation error in actor-critic methods. In: International Conference on Machine Learning. Stockholmsmässan, Stockholm, Sweden, 2018, (pp. 1587–1596).

12.

Ghasemipour

S. K. S.

Zemel

(2020). A divergence minimization perspective on imitation learning methods. In: Conference on Robot Learning, Virtual Event / Cambridge, MA, USA, 2020, (pp. 1259–1277).

13.

Goddu

M. K.

Lombrozo

Gopnik

(2020). Transformations and transfer: Preschool children understand abstract relations and reason analogically in a causal task. Child Development, 91(6), 1898–1915. https://doi.org/10.1111/cdev.13412

14.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

(2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144. https://doi.org/10.1145/3422622

15.

Gopnik

Griffiths

T. L.

Lucas

C. G.

(2015). When younger learners can be better (or at least more open-minded) than older ones. Current Directions in Psychological Science, 24(2), 87–92. https://doi.org/10.1177/0963721414556653

16.

Haarnoja

Zhou

Abbeel

Levine

(2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: International Conference on Machine Learning. Stockholmsmässan, Stockholm, Sweden, 2018, (pp. 1861–1870).

17.

Hester

Vecerik

Pietquin

Lanctot

Schaul

Piot

Horgan

Quan

Sendonaris

Osband

(2018). Deep q-learning from demonstrations. In: Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, Louisiana, USA, 2018, https://doi.org/10.1609/aaai.v32i1.11757

18.

Ermon

(2016). Generative adversarial imitation learning. Advances in Neural Information Processing Systems, Advanced Online Publication.

19.

Jones

S. S.

(2009). The development of imitation in infancy. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1528), 2325–2335. https://doi.org/10.1098/rstb.2009.0045

20.

Kang

Jie

Feng

(2018). Policy optimization with demonstrations. In: International Conference on Machine Learning, Stockholmsmässan, Stockholm, Sweden, 2018, (pp. 2469–2478).

21.

Kober

Mohler

Peters

(2010). Imitation and reinforcement learning for motor primitives with perceptual coupling. In: From motor learning to interaction learning in robots (pp. 209–225). https://doi.org/10.1007/978-3-642-05181-4_10

22.

Levine

Kumar

Tucker

(2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643 .

23.

Libardi

De Fabritiis

Dittert

(2021). Guided exploration with proximal policy optimization using a single demonstration. In: International Conference on Machine Learning. Virtual Event, 2021, (pp. 6611–6620).

24.

Mnih

Kavukcuoglu

Silver

Graves

Antonoglou

Wierstra

Riedmiller

(2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .

25.

Moore

Dunne

Stern

McGuire

(2021). Virtual human foraging behavior follows predictions for heavy-tailed search. Society for Neuroscience Online.

26.

Nair

McGrew

Andrychowicz

Zaremba

Abbeel

(2018). Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE International Conference on robotics and automation (ICRA). Brisbane, QLD, 21-25 May 2018, pp. 6292–6299.

27.

Offerman

Sonnemans

(1998). Learning by experience and learning by imitating successful others. Journal of Economic Behavior & Organization, 34(4), 559–575. https://doi.org/10.1016/s0167-2681(97)00109-1

28.

Otte

Correll

Frazzoli

(2013). Navigation with foraging. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. Tokyo, 03-07 November 2013, (pp. 3150–3157). https://doi.org/10.1109/iros.2013.6696804

29.

Pomerleau

D. A.

(1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1), 88–97. https://doi.org/10.1162/neco.1991.3.1.88

30.

Queeney

Paschalidis

I. C.

Cassandras

C. G.

(2021). Uncertainty-aware policy optimization: A robust, adaptive trust region approach. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11), 9377–9385. https://doi.org/10.1609/aaai.v35i11.17130

31.

Rajeswaran

Kumar

Gupta

Vezzani

Schulman

Todorov

Levine

(2017). Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 .

32.

Ratliff

N. D.

Bagnell

J. A.

Zinkevich

M. A.

(2006). Maximum margin planning. In: Proceedings of the twenty-third International Conference on Machine Learning. Pittsburgh, Pennsylvania, 2006, (pp. 729–736). https://doi.org/10.1145/1143844.1143936

33.

Ross

Bagnell

(2010). Efficient reductions for imitation learning. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy, 2010, (pp. 661–668).

34.

Ross

Bagnell

J. A.

(2014). Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979 .

35.

Ross

Gordon

Bagnell

(2011). A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 627–635).

36.

Ruggeri

Pelz

Gopnik

Schulz

(2021). Toddlers search longer when there is more information to be gained. PsyArXiv preprint.

37.

Schaal

(1999). Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6), 233–242. https://doi.org/10.1016/s1364-6613(99)01327-3

38.

Schulman

Levine

Abbeel

Jordan

Moritz

(2015). Trust region policy optimization. In: International Conference on Machine learning. Lille, France, 2015, (pp. 1889–1897).

39.

Schulman

Wolski

Dhariwal

Radford

Klimov

(2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .

40.

Scone

Phillips

(2010). Trade-off between exploration and reporting victim locations in usar. In: 2010 IEEE International Symposium on” A world of Wireless, Mobile and Multimedia Networks”(WoWMoM). Montreal, QC, 14-17 June 2010, pp. 1–6.

41.

Serrano-Cuevas

Morales

E. F.

Hernández-Leal

(2020). Safe reinforcement learning using risk mapping by similarity. Adaptive Behavior, 28(4), 213–224. https://doi.org/10.1177/1059712319859650

42.

Silver

Schrittwieser

Simonyan

Antonoglou

Huang

Guez

Hubert

Baker

Lai

Bolton

Chen

Lillicrap

Hui

Sifre

van den Driessche

Graepel

Hassabis

(2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354–359. https://doi.org/10.1038/nature24270

43.

Subramanian

Isbell

C. L.

Jr Thomaz

A. L.

(2016). Exploration from demonstration for interactive reinforcement learning. In: Proceedings of the 2016 International Conference on autonomous agents & Multiagent systems. Singapore, 2016, (pp. 447–456).

44.

Sun

Bagnell

J. A.

Boots

(2018). Truncated horizon policy search: Combining reinforcement learning and imitation learning. In: International Conference on learning representations. Vancouver Convention Center, Vancouver, Canada, 2018.

45.

Sutton

R. S.

Barto

A. G.

(2018). Reinforcement learning: An introduction. MIT press.

46.

Syed

Schapire

R. E.

(2010). A reduction from apprenticeship learning to classification. Advances in Neural Information Processing Systems, Advanced Online Publication.

47.

Uchendu

Xiao

Yan

Simón

J. L. P.

Bennice

Hausman

(2021). Demonstration-guided q-learning. NIPS Workshop on Robot Learning: Self-Supervised and Lifelong Learning. Virtual-only Conference, 2021.

48.

Uchendu

Xiao

Zhu

Yan

Simon

Bennice

Jiao

(2022). Jump-start reinforcement learning. arXiv preprint arXiv:2204.02372 .

49.

Vecerik

Hester

Scholz

Wang

Pietquin

Piot

Heess

Rothörl

Lampe

Riedmiller

(2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 .

50.

Walker

C. M.

Williams

J. J.

Lombrozo

Gopnik

(2012). Explaining influences children’s reliance on evidence and prior knowledge in causal induction. In: Proceedings of the Annual Meeting of the Cognitive Science Society. Sapporo, Japan, 2012.

51.

Ziebart

B. D.

Maas

A. L.

Bagnell

J. A.

Dey

A. K.

(2008). Maximum entropy inverse reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Chicago, IL, 2008.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

5.41 MB