Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Abstract

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

Keywords

Reward learning active learning inverse reinforcement learning learning from demonstrations preference-based learning human–robot interaction

Get full access to this article

View all access options for this article.

References

Abbeel

(2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning. New York: ACM Press.

Abbeel

(2005) Exploration and apprenticeship learning in reinforcement learning. In: Proceedings of the 22nd International Conference on Machine Learning. New York: ACM Press, pp. 1–8.

Ailon

(2012) An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research 13: 137–164.

Akgun

Cakmak

Jiang

Thomaz

(2012) Keyframe-based learning from demonstration. International Journal of Social Robotics 4(4): 343–355.

Akrour

Schoenauer

Sebag

(2012) April: Active preference learning-based reinforcement learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, pp. 116–131.

Bajcsy

Losey

O’Malley

Dragan

(2018) Learning from physical human corrections, one feature at a time. In: Proceedings of the 2018 ACM/IEEE International Conference on Human–Robot Interaction. New York: ACM Press, pp. 141–149.

Bajcsy

Losey

O’Malley

Dragan

(2017) Learning robot objectives from physical human interaction. Proceedings of Machine Learning Research 78: 217–226.

Basu

Byk

Singhal

Sadigh

(2019) Active learning of reward dynamics from hierarchical queries. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 120–127.

Basu

Yang

Hungerman

Sinahal

Draqan

(2017) Do you want your autonomous car to drive like you. In: 2017 12th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, pp. 417–425.

10.

Ben-Akiva

Lerman

(1985) Discrete Choice Analysis: Theory and Application to Travel Demand ( Transportation Studies Series, Vol. 9). Cambridge, MA: MIT Press.

11.

Biyik

Huynh

Kochenderfer

Sadigh

(2020) Active preference-based gaussian process regression for reward learning. In: Proceedings of Robotics: Science and Systems (RSS).

12.

Biyik

Lazar

Sadigh

Pedarsani

(2019a) The green choice: Learning and influencing human decisions on shared roads. In: Proceedings of the 58th IEEE Conference on Decision and Control (CDC). IEEE, pp. 347–354.

13.

Biyik

Palan

Landolfi

Losey

Sadigh

(2019b) Asking easy questions: A user-friendly approach to active reward learning. In: Proceedings of the 3rd Conference on Robot Learning (CoRL).

14.

Biyik

Sadigh

(2018) Batch active preference-based learning of reward functions. In: Conference on Robot Learning (CoRL).

15.

Byk

Wang

Anari

Sadigh

(2019) Batch active learning using determinantal point processes. arXiv preprint arXiv:1906.07975.

16.

Bobu

Bajcsy

Fisac

Dragan

(2018) Learning under misspecified objective spaces. In: Conference on Robot Learning, pp. 796–805.

17.

Brockman

Cheung

Pettersson

, et al. (2016) OpenAI GYM. arXiv preprint arXiv:1606.01540.

18.

Brown

Goo

Nagarajan

Niekum

(2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: International Conference on Machine Learning, pp. 783–792.

19.

Brown

Goo

Niekum

(2020) Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In: Conference on Robot Learning, pp. 330–359.

20.

Brown

Niekum

(2019) Deep Bayesian reward learning from preferences. In: Workshop on Safety and Robustness in Decision Making at the 33rd Conference on Neural Information Processing Systems (NeurIPS) 2019.

21.

Cakmak

Srinivasa

Lee

Forlizzi

Kiesler

(2011) Human preferences for robot-human hand-over configurations. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, pp. 1986–1993.

22.

Chen

Paleja

Gombolay

(2020) Learning from suboptimal demonstration via self-supervised reward regression. In: Conference on Robot Learning.

23.

Choudhury

Swamy

Hadfield-Menell

Dragan

(2019) On the utility of model learning in hri. In: 2019 14th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, pp. 317–325.

24.

Christiano

Leike

Brown

Martic

Legg

Amodei

(2017) Deep reinforcement learning from human preferences. In: Advances in Neural Information Processing Systems, pp. 4299–4307.

25.

Chu

Ghahramani

(2005) Gaussian processes for ordinal regression. Journal of Machine Learning Research 6: 1019–1041.

26.

Cover

Thomas

(2012) Elements of Information Theory. New York: John Wiley & Sons, Inc.

27.

Daw

O’Doherty

Dayan

Seymour

Dolan

(2006) Cortical substrates for exploratory decisions in humans. Nature 441(7095): 876.

28.

Dragan

Srinivasa

(2012) Formalizing Assistive Teleoperation. Cambridge, MA: MIT Press.

29.

Guo

Sanner

(2010) Real-time multiattribute Bayesian preference elicitation with pairwise comparison queries. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 289–296.

30.

Habibian

Jonnavittula

Losey

(2021) Here’s what I’ve learned: Asking questions that reveal reward learning. arXiv preprint arXiv:2107.01995.

31.

Holladay

Javdani

Dragan

Srinivasa

(2016) Active comparison based learning incorporating user uncertainty and noise. In: RSS Workshop on Model Learning for Human–Robot Communication.

32.

Ibarz

Leike

Pohlen

Irving

Legg

Amodei

(2018) Reward learning from human preferences and demonstrations in Atari. In: Advances in Neural Information Processing Systems, pp. 8011–8023.

33.

Javdani

Srinivasa

Bagnell

(2015) Shared autonomy via hindsight optimization. Robotics Science and Systems: Online Proceedings, 2015.

34.

Katz

Maleki

Biyik

Kochenderfer

(2021) Preference-based learning of reward function features. arXiv preprint arXiv:2103.02727.

35.

Katz

Bihan

ACL

Kochenderfer

(2019) Learning an urban air mobility encounter model from expert preferences. In: 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC). IEEE, pp. 1–8.

36.

Khurshid

Kuchenbecker

(2015) Data-driven motion mappings improve transparency in teleoperation. Presence: Teleoperators and Virtual Environments 24(2): 132–154.

37.

Krishnan

(1977) Incorporating thresholds of indifference in probabilistic choice models. Management Science 23(11): 1224–1233.

38.

Kulesza

Amershi

Caruana

Fisher

Charles

(2014) Structured labeling for facilitating concept evolution in machine learning. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York: ACM Press, pp. 3075–3084.

39.

Kwon

Biyik

Talati

Bhasin

Losey

Sadigh

(2020) When humans aren’t optimal: Robots that collaborate with risk-aware humans. In: Proceedings of the 2020 ACM/IEEE International Conference on Human–Robot Interaction, pp. 43–52.

40.

Lepird

Owen

Kochenderfer

(2015) Bayesian preference elicitation for multiobjective engineering design optimization. Journal of Aerospace Information Systems 12(10): 634–645.

41.

Tucker

Biyik

, et al. (2021a) ROIAL: Region of interest active learning for characterizing exoskeleton gait preference landscapes. In: International Conference on Robotics and Automation (ICRA).

42.

Canberk

Losey

Sadigh

(2021b) Learning human objectives from sequences of physical corrections. In: International Conference on Robotics and Automation (ICRA).

43.

Lucas

Griffiths

Fawcett

(2009) A rational model of preference learning and choice prediction by children. In: Advances in Neural Information Processing Systems, pp. 985–992.

44.

Luce

(2012) Individual Choice Behavior: A Theoretical Analysis. Courier Corporation.

45.

Michini

How

(2012) Bayesian nonparametric inverse reinforcement learning. In: Joint European conference on machine learning and knowledge discovery in databases. New York: Springer, pp. 148–163.

46.

Russell

, et al. (2000) Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning, Vol. 1, p. 2.

47.

Nikolaidis

Ramakrishnan

Shah

(2015) Efficient model learning from joint-action demonstrations for human-robot collaborative tasks. In: 2015 10th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, pp. 189–196.

48.

Palan

Landolfi

Shevchuk

Sadigh

(2019) Learning reward functions by integrating human demonstrations and preferences. In: Proceedings of Robotics: Science and Systems (RSS).

49.

Park

Noseworthy

Paul

Roy

(2020) Inferring task goals and constraints using Bayesian nonparametric inverse reinforcement learning. In: Proceedings of the Conference on Robot Learning (Proceedings of Machine Learning Research, Vol. 100), pp. 1005–1014.

50.

Ramachandran

Amir

(2007) Bayesian inverse reinforcement learning. In: International Joint Conference on Artificial Intelligence, Vol. 7, pp. 2586–2591.

51.

Sadigh

Dragan

Sastry

Seshia

(2017) Active preference-based learning of reward functions. In: Proceedings of Robotics: Science and Systems (RSS).

52.

Sadigh

Sastry

Seshia

Dragan

(2016) Planning for autonomous cars that leverage effects on human actions. In: Proceedings of Robotics: Science and Systems (RSS).

53.

Schulman

Wolski

Dhariwal

Radford

Klimov

(2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

54.

Shah

(2020) Interactive robot training for non-Markov tasks. arXiv preprint arXiv:2003.02232.

55.

Todorov

Erez

Tassa

(2012) Mujoco: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, pp. 5026–5033.

56.

Tucker

Novoseller

Kann

, et al. (2020) Preference-based learning for exoskeleton gait optimization. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 2351–2357.

57.

Viappiani

Boutilier

(2010) Optimal bayesian recommendation sets and myopically optimal choice query sets. In: Advances in Neural Information Processing Systems, pp. 2352–2360.

58.

Wilde

Kulić

Smith

(2019) Bayesian active learning for collaborative task specification using equivalence regions. IEEE Robotics and Automation Letters 4(2): 1691–1698.

59.

Wise

Ferguson

King

Diehr

Dymesich

(2016) Fetch and Freight: Standard platforms for service robot applications. In: Workshop on Autonomous Mobile Service Robots.

60.

Ziebart

Maas

Bagnell

Dey

(2008) Maximum entropy inverse reinforcement learning. In: Proceedings of the AAAI, Chicago, IL, Vol. 8, pp. 1433–1438.