Active reward learning and iterative trajectory improvement from comparative language feedback

Abstract

Human-in-the-loop learning has gained traction in fields like robotics and natural language processing in recent years. While prior work mostly relies on human feedback in the form of preference comparisons, this feedback type has multiple limitations. It does not let users explain the reasons for their preferences and provides only a binary signal for learning, resulting in huge data inefficiency. Consequently, training robots require a substantial amount of human feedback, occupying significant time and burdening the user. To overcome these challenges, we take the insight that language is a preferable medium compared to comparisons, providing more information regarding user preferences. Thus, in this work, we aim to incorporate comparative language feedback to iteratively improve robot trajectories and learn reward functions that encode human preferences. We learn a shared latent space that integrates trajectory data and language feedback, and subsequently leverage the learned latent space to improve trajectories and learn human preferences. We finally introduce an active learning method that integrates comparative language feedback to further boost data-efficiency. Our results in simulation experiments and user studies demonstrate the effectiveness of the learned latent space and the success of our learning algorithms. Our reward learning algorithm exhibits a 23.9% improvement in subjective score on average and 11.3% higher time-efficiency compared to the preference comparison method in the user studies. Our active querying method further improves user experience featuring an 8.31% average improvement in subjective scores compared to random querying. Our code is publicly available at https://liralab.usc.edu/comparative-language-feedback/.

Keywords

reward learning active learning inverse reinforcement learning preference-based learning human–robot interaction human-in-the-loop learning

Get full access to this article

View all access options for this article.

References

Abbeel

(2004) Apprenticeship learning via inverse reinforcement learning. In: Proceedings, twenty-first international conference on machine learning, Banff, AB, Canada, 04 July 2004.

Akgün

Cakmak

Jiang

, et al. (2012) Keyframe-based learning from demonstration. International Journal of Social Robotics 4: 343–355. https://api.semanticscholar.org/CorpusID:10004846

Argall

Chernova

Veloso

, et al. (2009) A survey of robot learning from demonstration. Robotics and Autonomous Systems 57(5): 469–483.

Bajcsy

Losey

O’malley

, et al. (2017) Learning robot objectives from physical human interaction. In: Conference on robot learning. PMLR, pp. 217–226.

Banayeeanzade

Bahrani

Zhou

, et al. (2025) Gabril: Gaze-based regularization for mitigating causal confusion in imitation learning. In: International Conference on Intelligent Robots and Systems (IROS).

Basu

Singhal

Dragan

(2018) Learning from richer human guidance: augmenting comparison-based learning with feature queries. In: Proceedings of the 2018 ACM/IEEE international conference on human-robot interaction, Chicago, IL, USA, 26 February 2018, pp. 132–140.

Bernardo

Bayarri

Berger

, et al. (2003) The Variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian statistics 7(453-464): 210.

Bıyık

(2025) Training Robots With Natural and Lightweight Human Feedback. AI Magazine.

Biyik

Palan

Landolfi

, et al. (2019) Asking easy questions: a user-friendly approach to active reward learning. Proceedings of the 3rd Conference on Robot Learning (CoRL). PMLR, 100, 1177–1190.

10.

Bıyık

Lazar

Pedarsani

, et al. (2021) Incentivizing efficient equilibria in traffic networks with mixed autonomy. IEEE Transactions on Control of Network Systems 8(4): 1717–1729.

11.

Bıyık

Losey

Palan

, et al. (2022) Learning reward functions from diverse sources of human feedback: optimally integrating demonstrations and preferences. The International Journal of Robotics Research 41(1): 45–67.

12.

Biyik

Yao

Chow

, et al. (2023) Preference elicitation with soft attributes in interactive recommendation. ArXiv Preprint arXiv:2311.02085.

13.

Bıyık

Anari

Sadigh

(2024a) Batch active learning of reward functions from human preferences. ACM Transactions on Human-Robot Interaction 13(2): 1–27.

14.

Bıyık

Huynh

Kochenderfer

, et al. (2024b) Active preference-based Gaussian process regression for reward learning and optimization. The International Journal of Robotics Research 43(5): 665–684.

15.

Bradley

Terry

(1952) Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4): 324–345.

16.

Brown

Goo

Nagarajan

, et al. (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In: International conference on machine learning. PMLR, pp. 783–792.

17.

Brown

Goo

Niekum

(2020) Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In: Conference on robot learning. PMLR, pp. 330–359.

18.

Bucker

Figueredo

Haddadinl

, et al. (2022) Reshaping robot trajectories using natural language commands: a study of multi-modal data alignment using transformers. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022, pp. 978–984.

19.

Bucker

Figueredo

Haddadin

, et al. (2023) Latte: language trajectory transformer. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May 2023–02 June 2023, pp. 7287–7294.

20.

Burbidge

Rowland

King

(2007) Active learning for regression based on query by committee. In: Intelligent Data Engineering and Automated Learning-IDEAL 2007: 8th international conference, Birmingham, UK, 16–19 December 2007, pp. 209–218.

21.

Campos

Shern

(2022) Training language models with language feedback. In: ACL workshop on learning with natural language supervision, Vol. 2022.

22.

Casper

Davies

Shi

, et al. (2024) Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research (TMLR).

23.

Castro

Kalish

Nowak

, et al. (2008) Human active learning. Advances in Neural Information Processing Systems 21: 241–248.

24.

Christiano

Leike

Brown

, et al. (2017) Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30: 4299–4307.

25.

Cohn

Ghahramani

Jordan

(1996) Active learning with statistical models. Journal of Artificial Intelligence Research 4: 129–145.

26.

Cover

Thomas

(2012) Elements of Information Theory. John Wiley & Sons.

27.

Cui

Karamcheti

Palleti

, et al. (2023) No, to the right: online language corrections for robotic manipulation via shared autonomy. In: Proceedings of the 2023 ACM/IEEE international conference on human-robot interaction, Stockholm, Sweden, 13 March 2023, pp. 93–101.

28.

Culver

Kun

Scott

(2006) Active learning to maximize area under the ROC curve. In: Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China, 18–22 December 2006, pp. 149–158.

29.

Dennler

Delgado

Zeng

, et al. (2023) The Rosid tool: empowering users to design multimodal signals for human-robot collaboration. In: 18th International Symposium on Experimental Robotics (ISER).

30.

Ebert

Yang

Schmeckpeper

, et al(2022) Bridge data: boosting generalization of robotic skills with cross-domain datasets. In: Proceedings of Robotics: Science and Systems (RSS). DOI:10.15607/RSS.2022.XVIII.063

31.

Ellis

Ghosal

Russell

, et al. (2024) A generalized acquisition function for preference-based reward learning. In: International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024.

32.

Fuchsgruber

Wollschläger

Charpentier

, et al(2024) Uncertainty for active learning on graphs. In: Proceedings of the 41st International Conference on Machine Learning. pp. 14275--14307.

33.

Gal

Islam

Ghahramani

(2017) Deep Bayesian active learning with image data. In: International conference on machine learning, Sydney, NSW, Australia, 06 August 2017, pp. 1183–1192.

34.

Goyal

Niekum

Mooney

(2019) Using natural language for reward shaping in reinforcement learning. IJCAI 2385–2391.

35.

Goyal

Niekum

Mooney

(2021) PixL2R: guiding reinforcement learning using natural language by mapping pixels to rewards. In: Conference on robot learning. PMLR, pp. 485–497.

36.

Han

Zhu

, et al(2024) Interpret: interactive predicate learning from language feedback for generalizable task planning. In: Proceedings of Robotics: Science and Systems (RSS). DOI:10.15607/RSS.2024.XX.034

37.

Holk

Marta

Leite

(2024a) POLITE: preferences combined with highlights in reinforcement learning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024, pp. 2288–2295.

38.

Holk

Marta

Leite

(2024b) PREDILECT: preferences delineated with zero-shot language-based reasoning in reinforcement learning. In: Proceedings of the 2024 ACM/IEEE international conference on human-robot interaction, Boulder, CO, USA, 11–14 March 2024, pp. 259–268.

39.

Hong

Liang

Kim

, et al. (2025) Hand me the data: fast robot adaptation via hand path retrieval. ArXiv Preprint arXiv:2505.20455.

40.

Hoque

Balakrishna

Novoseller

, et al(2021) ThriftyDAgger: budget-aware novelty and risk gating for interactive imitation learning. In: Proceedings of the 5th Conference on Robot Learning (CoRL).

41.

Houlsby

Huszár

Ghahramani

, et al. (2011) Bayesian active learning for classification and preference learning. ArXiv Preprint arXiv:1112.5745.

42.

Katz

Maleki

Bıyık

, et al. (2021) Preference-based learning of reward function features. ArXiv Preprint arXiv:2103.02727.

43.

Kelly

Sidrane

Driggs-Campbell

, et al. (2019) HG-DAgger: interactive imitation learning with human experts. In: 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019, pp. 8077–8083.

44.

Kochenderfer

Wheeler

(2019) Algorithms for Optimization. MIT Press.

45.

Korkmaz

Bıyık

(2025) Mile: model-based intervention learning. In: 2025 International Conference on Robotics and Automation (ICRA). IEEE, pp. 15673-15679.

46.

Krogh

Vedelsby

(1994) Neural network ensembles, cross validation, and active learning. Advances in Neural Information Processing Systems 7: 231–238.

47.

Lee

Smith

Abbeel

(2021) Pebble: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In: Proceedings of the 38th International Conference on Machine Learning.

PMLR

139:6152-6163.

48.

Liang

Thomason

Bıyık

(2024) ViSaRL: visual reinforcement learning guided by human saliency. In: International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024.

49.

Lynch

Wahid

Tompson

, et al. (2023) Interactive language: talking to robots in real time. IEEE Robotics and Automation Letters. doi: 10.1109/LRA.2023.3295255

50.

MacKay

(1995) Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6(3): 469–505.

51.

Myers

Biyik

Anari

, et al(2021) Learning multimodal rewards from rankings. In: Proceedings of the 5th Conference on Robot Learning (CoRL). PMLR, pp. 342-352.

52.

Nickisch

Rasmussen

(2008) Approximations for binary Gaussian process classification. Journal of Machine Learning Research 9(10): 2035–2078.

53.

OpenAI (2023) Gpt-3.5. https://platform.openai.com/docs/models/gpt-3-5

54.

Ouyang

Jiang

, et al. (2022) Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35: 27730–27744.

55.

Raffel

Shazeer

Roberts

, et al. (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21(140): 1–67.

56.

Ross

Gordon

Bagnell

(2011) A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp. 627–635.

57.

Sadigh

Dragan

Sastry

, et al. (2017) Active preference-based learning of reward functions. In: Proceedings of Robotics: Science and Systems (RSS). DOI: 10.15607/RSS.2017.XIII.053.

58.

Sadigh

Sastry

Seshia

, et al. (2016) Planning for autonomous cars that leverage effects on human actions. In: Proceedings of Robotics: Science and Systems (RSS). DOI: 10.15607/RSS.2016.XII.029.

59.

Settles

(2009) Active learning literature survey.

60.

Seung

Opper

Sompolinsky

(1992) Query by committee. Proceedings of the fifth annual workshop on computational learning theory. Association for Computing Machinery, 287–294.

61.

Sharma

Sundaralingam

Blukis

, et al. (2022) Correcting robot plans with natural language feedback. In: Proceedings of Robotics: Science and Systems (RSS).

62.

Shi

Zhao

, et al. (2024) Yell at your robot: improving on-the-fly from language corrections. arXiv preprint arXiv:2403.12910.

63.

Sikchi

Saran

Goo

, et al. (2023) A ranking game for imitation learning. Transactions on Machine Learning Research.

64.

Spencer

Choudhury

Barnes

, et al. (2022) Expert intervention learning: an online framework for robot learning from explicit and implicit human feedback. Autonomous Robots 46: 1–15.

65.

Stiennon

Ouyang

, et al. (2020) Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33: 3008–3021.

66.

Wang

Sun

Zhang

, et al(2024) RL-VLM-F: reinforcement learning from vision language foundation model feedback. In: International Conference on Machine Learning (ICML), Vienna, Austria, 21 July 2024.

67.

Wilde

Biyik

Sadigh

, et al(2021) Learning reward functions from scale feedback. In: Proceedings of the 5th Conference on Robot Learning (CoRL). PMLR, pp. 353-362.

68.

Wirth

Akrour

Neumann

, et al. (2017) A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research 18(136): 1–46.

69.

Yang

Jun

Tien

, et al. (2024) Trajectory improvement and reward learning from comparative language feedback. In: 8th annual conference on robot learning. https://openreview.net/forum?id=1tCteNSbFH

70.

Yow

Garg

Ramanathan

, et al. (2024) Extract–explainable trajectory corrections from language inputs using textual description of features. Front. Robot. AI 11: 1345693. doi: 10.3389/frobt.2024.1345693

71.

Quillen

, et al. (2020) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning. PMLR, pp. 1094–1100.

72.

Zhang

McCarthy

Jow

, et al. (2018) Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018, pp. 5628–5635.

73.

Zhang

Luo

Anwar

, et al. (2025) ReWiND: language-guided rewards teach robot policies without new demonstrations. Proceedings of The 9th Conference on Robot Learning (CoRL). PMLR, 460–488.

74.

Zhu

Wong

Mandlekar

, et al. (2020) Robosuite: a modular simulation framework and benchmark for robot learning. ArXiv Preprint arXiv:2009.12293.