Learning to soar: Resource-constrained exploration in reinforcement learning

Abstract

This paper examines temporal difference reinforcement learning with adaptive and directed exploration for resource-limited missions. The scenario considered is that of an unpowered aerial glider learning to perform energy-gaining flight trajectories in a thermal updraft. The presented algorithm, eGP-SARSA(λ), uses a Gaussian process regression model to estimate the value function in a reinforcement learning framework. The Gaussian process also provides a variance on these estimates that is used to measure the contribution of future observations to the Gaussian process value function model in terms of information gain. To avoid myopic exploration we developed a resource-weighted objective function that combines an estimate of the future information gain using an action rollout with the estimated value function to generate directed explorative action sequences. A number of modifications and computational speed-ups to the algorithm are presented along with a standard GP-SARSA(λ) implementation with $ε$ -greedy exploration to compare the respective learning performances. The results show that under this objective function, the learning agent is able to continue exploring for better state-action trajectories when platform energy is high and follow conservative energy-gaining trajectories when platform energy is low.

Keywords

Reinforcement learning exploration informative planning aerial robotics autonomous soaring

Get full access to this article

View all access options for this article.

References

Amigoni

Caglioti

(2010) An information-based exploration strategy for environment mapping with mobile robots. Robotics and Autonomous Systems 58(5): 684–699.

Bencatel

Tasso de Sousa

Girard

(2013) Atmospheric flow field models applicable for aircraft endurance extension. Progress in Aerospace Sciences 61: 1–25.

Binney

Krause

Sukhatme

(2013) Optimizing waypoints for monitoring spatiotemporal phenomena. The International Journal of Robotics Research 32(8): 873–888.

Brochu

Cora

de Freitas

(2010) A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical report, University of British Columbia, Canada.

Chung

Lawrance

NRJ

Sukkarieh

(2013) Gaussian processes for informative exploration in reinforcement learning. In: 2013 IEEE international conference on robotics and automation, pp. 2633–2639.

Csató

(2002) Gaussian processes: Iterative sparse approximations. PhD Thesis, Aston University, UK.

Csató

Opper

(2002) Sparse on-line Gaussian processes. Neural Computation 14(3): 641–668.

Dearden

Friedman

Russell

(1998) Bayesian Q-learning. In: Proceedings of the national conference on artificial intelligence, pp. 761–768.

Deisenroth

Rasmussen

Peters

(2009) Gaussian process dynamic programming. Neurocomputing 72(7–9): 1508–1524.

10.

Engel

Mannor

Meir

(2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In: Proceedings of the 20th international conference on machine learning, pp. 154–161.

11.

Engel

Mannor

Meir

(2005) Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd international conference on machine learning, pp. 201–208.

12.

Hollinger

Englot

Hover

. (2013) Active planning for underwater inspection and the benefit of adaptivity. The International Journal of Robotics Research 32(1): 3–18.

13.

Kim

Poupart

(2012) Cost-sensitive exploration in Bayesian reinforcement learning. In: Advances in neural information processing systems 25, pp. 3077–3085.

14.

Kollar

Roy

(2008) Trajectory optimization using reinforcement learning for map exploration. The International Journal of Robotics Research 27(2): 175–196.

15.

Lawrance

NRJ

(2011) Autonomous soaring flight for unmanned aerial vehicles. PhD Thesis, The University of Sydney, Australia.

16.

Lawrance

NRJ

Sukkarieh

(2011) Autonomous exploration of a wind field with a gliding aircraft. Journal of Guidance, Control, and Dynamics 34(3): 719–733.

17.

Levine

Luders

How

(2010) Information-rich path planning with general constraints using rapidly-exploring random trees. In: AIAA Infotech@Aerospace conference, Atlanta, GA.

18.

Peng

Williams

(1996) Incremental multi-step Q-learning. Machine Learning 22(1–3): 283–290.

19.

Rasmussen

Williams

CKI

(2005) Gaussian Processes for Machine Learning. Cambridge, MA: The MIT Press.

20.

Rummery

Niranjan

(1994) On-line Q-learning using connectionist systems. Technical report, Department of Engineering, University of Cambridge, UK.

21.

Singh

Krause

Guestrin

. (2009) Efficient informative sensing using multiple robots. Journal of Artificial Intelligence Research 34(2): 707–755.

22.

Stachniss

Grisetti

Burgard

(2005) Information gain-based exploration using Rao-Blackwellized particle filters. In: Proceedings of robotics: Science and systems, Cambridge, MA.

23.

Still

Precup

(2012) An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences 131(3): 139–148.

24.

Sutton

Barto

(1998) Reinforcement Learning: An Introduction. Cambridge, MA: The MIT Press.

25.

Sutton

Singh

(1994) On step-size and bias in temporal-difference learning. In: The proceedings of the eighth Yale workshop on adaptive and learning systems, pp. 91–96.

26.

Watkins

CJCH

(1989) Learning from delayed rewards. PhD Thesis, King’s College London, UK.