Domains as objectives: Domain-uncertainty-aware policy optimization through explicit multi-domain convex coverage set learning

Abstract

Uncertainty is inherent in real-world robotics problems, and any control framework must address it to succeed in practical applications. Reinforcement Learning is no different, and epistemic uncertainty arising from model uncertainty or misspecification is a challenge well captured by the sim-to-real gap. A simple solution to this issue is domain randomization (DR), which unfortunately can result in conservative agents. As a remedy to this conservativeness, the use of universal policies that take additional information about the randomized domain has risen as an alternative solution, along with recurrent neural network-based controllers. Uncertainty-aware universal policies present a particularly compelling solution able to account for system identification uncertainties during deployment. In this paper, we reveal that the challenge of efficiently optimizing uncertainty-aware policies can be fundamentally reframed as solving the convex coverage set (CCS) problem within a multi-objective reinforcement learning (MORL) context. By introducing a novel Markov decision process (MDP) framework where each domain’s performance is treated as an independent objective, we unify the training of uncertainty-aware policies with MORL approaches. This connection enables the application of MORL algorithms for domain randomization (DR), allowing for more efficient policy optimization. To illustrate this, we focus on the linear utility function, which aligns with the expectation in DR formulations, and propose a series of algorithms adapted from the MORL literature to solve the CCS, demonstrating their ability to enhance the performance of uncertainty-aware policies.

Keywords

Domain randomization uncertainty-aware policy convex coverage set multi-domain reinforcement learning Markov decision process framework

Get full access to this article

View all access options for this article.

References

Abels

Roijers

Lenaerts

, et al. (2019) Dynamic weights in multi-objective deep reinforcement learning. In: International Conference on Machine Learning. PMLR, 11–20.

Agarwal

Schwarzer

Castro

, et al. (2021) Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems 34: 29304–29320.

Ahn

Zhu

Hartikainen

, et al. (2020) Robel: robotics benchmarks for learning with low-cost robots. In: Conference on Robot Learning. PMLR, 1300–1313.

Andrychowicz

Baker

Chociej

, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39(1): 3–20.

Barcelos

Oliveira

Possas

, et al. (2020) Disco: double likelihood-free inference stochastic control. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10969–10975.

Chebotar

Handa

Makoviychuk

, et al. (2019) Closing the sim-to-real loop: adapting simulation randomization with real world experience. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE, 8973–8979.

Chen

Jin

, et al. (2021) Understanding domain randomization for sim-to-real transfer. arXiv preprint arXiv:2110.03239.

Chen

Murali

Gupta

(2018) Hardware conditioned policies for multi-robot transfer learning. Advances in Neural Information Processing Systems 31: 9355–9366.

Dehban

Borrego

Figueiredo

, et al. (2019) The impact of domain randomization on object detection: a case study on parametric shapes and synthetic textures. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2593–2600.

10.

Derman

Mankowitz

Mann

, et al. (2019) A bayesian approach to robust reinforcement learning. arXiv preprint arXiv:1905.08188.

11.

Ding

(2019) Popular-rl-algorithms. https://github.com/quantumiracle/Popular-RL-Algorithms

12.

Ding

(2021) Not only domain randomization: universal policy with embedding system identification. arXiv preprint arXiv:2109.13438.

13.

Haarnoja

Zhou

Hartikainen

, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.

14.

Heess

Hunt

Lillicrap

, et al. (2015) Memory-based control with recurrent neural networks. arXiv preprint arXiv:1512.04455.

15.

Ilboudo

WEL

Kobayashi

Sugimoto

(2020) Robust stochastic gradient descent with student-t distribution based first-order momentum. IEEE Transactions on Neural Networks and Learning Systems 33(3): 1324–1337.

16.

Ilboudo

WEL

Kobayashi

Matsubara

(2023) Domains as objectives: multi-domain reinforcement learning with convex-coverage set learning for domain uncertainty awareness. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5622–5629.

17.

Lin

Thomas

Yang

, et al. (2020) Model-based adversarial meta-reinforcement learning. arXiv preprint arXiv:2006.08875.

18.

Mankowitz

Levine

Jeong

, et al. (2019) Robust reinforcement learning for continuous control with model misspecification. arXiv preprint arXiv:1906.07516.

19.

Matl

Narang

Bajcsy

, et al. (2020) Inferring the material properties of granular media for robotic tasks. In: 2020 Ieee International Conference on Robotics and Automation (Icra). IEEE, 2770–2777.

20.

Medeiros

JEG

(2018) Unscented transform framework for quantization modeling in data conversion systems. In: IEEE International Symposium on Circuits and Systems, ISCAS 2017. IEEE.

21.

Mehta

Diaz

Golemo

, et al. (2020) Active domain randomization. In: Conference on Robot Learning. PMLR, 1162–1176.

22.

Morimoto

Doya

(2005) Robust reinforcement learning. Neural Computation 17(2): 335–359.

23.

Mossalam

Assael

Roijers

, et al. (2016) Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707.

24.

Mozian

Higuera

JCG

Meger

, et al. (2020) Learning domain randomization distributions for training robust locomotion policies. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 6112–6117.

25.

Muratore

Eilers

Gienger

, et al. (2021) Data-efficient domain randomization with bayesian optimization. IEEE Robotics and Automation Letters 6(2): 911–918.

26.

Muratore

Gruner

Wiese

, et al. (2022a) Neural posterior domain randomization. In: Conference on Robot Learning. PMLR, 1532–1542.

27.

Muratore

Gruner

Wiese

, et al. (2022b) Neural posterior domain randomization. In: Conference on Robot Learning. PMLR, 1532–1542.

28.

Natarajan

Tadepalli

(2005) Dynamic preferences in multi-criteria reinforcement learning. In: Proceedings of the 22nd International Conference on Machine Learning. Association for Computing Machinery. 601–608.

29.

Oord

Vinyals

(2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

30.

Peng

Andrychowicz

Zaremba

, et al. (2018) Sim-to-real transfer of robotic control with dynamics randomization. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3803–3810.

31.

Petrik

Russel

(2019) Beyond confidence regions: tight bayesian ambiguity sets for robust mdps. arXiv preprint arXiv:1902.07605. https://arxiv.org/abs/1902.07605

32.

Pinto

Davidson

Sukthankar

, et al. (2017) Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702.

33.

Rajeswaran

Ghotra

Ravindran

, et al. (2016) Epopt: learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283.

34.

Rakelly

Zhou

Finn

, et al. (2019) Efficient off-policy meta-reinforcement learning via probabilistic context variables. In: International Conference on Machine Learning. PMLR, 5331–5340.

35.

Ramos

Possas

Fox

(2019) Bayessim: adaptive domain randomization via probabilistic inference for robotics simulators. arXiv preprint arXiv:1906.01728.

36.

Ruiz

Schulter

Chandraker

(2018) Learning to simulate.

37.

Semage

Karimpanal

Rana

, et al. (2022) Uncertainty aware system identification with universal policies. In: 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2321–2327.

38.

Tesauro

Das

Chan

, et al. (2007) Managing power consumption and performance of computing systems using reinforcement learning. Advances in Neural Information Processing Systems 20: 1497–1504.

39.

Tessler

Efroni

Mannor

(2019) Action robust reinforcement learning and applications in continuous control. arXiv preprint arXiv:1901.09184.

40.

Tobin

Fong

Ray

, et al. (2017) Domain randomization for transferring deep neural networks from simulation to the real world. arXiv preprint arXiv:1703.06907.

41.

Todorov

Erez

Tassa

(2012) Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 5026–5033. DOI: 10.1109/IROS.2012.6386109.

42.

Uhlmann

(1995) Dynamic Map Building and Localization: New Theoretical Foundations. PhD Thesis. University of Oxford.

43.

Van Moffaert

Nowé

(2014) Multi-objective reinforcement learning using sets of pareto dominating policies. Journal of Machine Learning Research 15(1): 3483–3512.

44.

Van Moffaert

Drugan

Nowé

(2013) Scalarized multi-objective reinforcement learning: novel design techniques. In: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 191–199.

45.

Volpi

Larlus

Rogez

(2021) Continual adaptation of visual representations via domain randomization and meta-learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 4443–4453.

46.

Weng

(2019) Domain randomization for sim2real transfer. lilianweng.github.io. https://lilianweng.github.io/posts/2019-05-05-domain-randomization/

47.

Xie

Sodhani

Finn

, et al. (2022) Robust policy learning over multiple uncertainty sets. In: International Conference on Machine Learning. PMLR, 24414–24429.

48.

Yang

Nguyen

(2021) Recurrent off-policy baselines for memory-based continuous control. arXiv preprint arXiv:2110.12628.

49.

Yang

Sun

Narasimhan

(2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. arXiv preprint arXiv:1908.08342.

50.

Tan

Liu

, et al. (2017) Preparing for the unknown: learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453.

51.

Liu

Turk

(2019) Policy transfer with strategy optimization. In: International Conference on Learning Representations. https://openreview.net/forum?id=H1g6osRcFQ