Minmax fuzzy deterministic policy gradient for zero-sum differential game: Take pursuit-evasion problem as example

Abstract

A novel actor-critic algorithm is introduced and applied to zero-sum differential game. The proposed novel structure consists of two actors and a critic. Different actors represent the control policies of different players, and the critic is used to approximate the state-action utility function. Instead of neural network, the fuzzy inference system is applied as approximators for the actors and critic so that the specific practical meaning can be represented by the linguistic fuzzy rules. Since the goals of the players in the game are completely opposite, the actors for different players are simultaneously updated in opposite directions during the training. One actor is updated updated toward the direction that can minimize the Q value while the other updated toward the direction that can maximize the Q value. A pursuit-evasion problem with two pursuers and one evader is taken as an example to illustrate the validity of our method. In this problem, the two pursuers the same actor and the symmetry in the problem is used to improve the replay buffer. At the end of this paper, some confrontations between the policies with different training episodes are conducted.

Keywords

Fuzzy inference system differential game reinforcement learning pursuit-evasion problem deterministic policy gradient

Get full access to this article

View all access options for this article.

References

Isaacs

, Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization, 01 1965.

, Bryson

and Baron

, Differential games and optimal pursuit-evasion strategies, IEEE Transactions on Automatic Control 10(4) (1965), 385–389.

Liubarshchuk

and Althoefer

, The problem of approach in differential–difference games, International Journal of Game Theory 45, 02 2015.

Makkapati

V.R.

, Sun

and Tsiotras

, Optimal evading strategies for two-pursuer/one-evader problems, Journal of Guidance, Control, and Dynamics 41(4) (2018), 851–862.

Lim

S.H.

, Furukawa

, Dissanayake

and Durrant-Whyte

H.F.

, A time-optimal control strategy for pursuit-evasion games problems, In, IEEE International Conference on Robotics and Automation, 2004. Proceedings, ICRA ’04. 2004 4 (2004), 3962–3967.

Sun

, Tsiotras

, Lolla

, Subramani

D.N.

and Lermusiaux

P.F.J.

, Pursuit-evasion games in dynamic flow fields via reachability set analysis, In 2017 American Control Conference (ACC), pages 4595–4600, 2017.

Wang

, Dong

and Sun

, Cooperative control for multi-player pursuit-evasion games with reinforcement learning, Neurocomputing 412 (2020), 101–114.

Liang

, Wang

, Liu

and Liu

, Guidance strategies for interceptor against active defense spacecraft in two-on-two engagement, Aerospace Science and Technology 96 (2020), 105529.

Staddon

J.E.R.

, The dynamics of behavior: Review of sutton and barto: Reinforcement learning: An introduction (2nd ed.), Journal of the Experimental Analysis of Behavior 113(2) (2020), 485–491.

10.

Liu

, Liu

, Wu

and Zhang

, A pursuit-evasion algorithm based on hierarchical reinforcement learning, Measuring Technology and Mechatronics Automation, International Conference on 2 (2009), 482–486.

11.

Kiumarsi

, Vamvoudakis

K.G.

, Modares

and Lewis

, Optimal and autonomous control using reinforcement learning: A survey, IEEE Transactions on Neural Networks and Learning Systems PP (2017), 1–21.

12.

Jia

, Wang

and Shen

, A continuous-time markov decision process-based method with application in a pursuit-evasion example, IEEE Transactions on Systems, Man, and Cybernetics: Systems 46 (2015), 1–11.

13.

Wang

H.-N.

, Liu

, Zhang

Y.-Y.

, Feng

D.-W.

, Huang

, Li

D.-S.

and Zhang

, Deep reinforcement learning: a survey, Frontiers of Information Technology & Electronic Engineering, 10 2020.

14.

Mnih

, Kavukcuoglu

, Silver

, Graves

, Antonoglou

, Wierstra

and Riedmiller

, Playing atari with deep reinforcement learning, 12 2013.

15.

Lillicrap

, Hunt

, Pritzel

, Heess

, Erez

, Tassa

, Silver

and Wierstra

, Continuous control with deep reinforcement learning, CoRR, 09 2015.

16.

Wang

, Wang

and Yue

, A fuzzy deterministic policy gradient algorithm for pursuit-evasion differential games, Neurocomputing 362(Oct.14) (2019), 106–117.

17.

Desouky

S.F.

and Schwartz

H.M.

, Self-learning fuzzy logic controllers for pursuit–evasion differential games, Robotics and Autonomous Systems 59(1) (2011), 22–33.

18.

Awheda

and Schwartz

, A residual gradient fuzzy reinforcement learning algorithm for differential games, International Journal of Fuzzy Systems 19 (2017), 1058–1076.

19.

Zhou

, van Kampen

E.-J.

and Chu

, Hybrid hierarchical reinforcement learning for online guidance and navigation with partial observability, Neurocomputing 331 (2019), 443–457.

20.

Grondman

, Busoniu

, Lopes

and Babuska

, A survey of actor-critic reinforcement learning: Standard and natural policy gradients, IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics 42 (2012), 1291–1307.

21.

Vamvoudakis

K.G.

and Lewis

, Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem, Automatica 46 (2010), 878–888.

22.

Silver

, Lever

, Heess

, Degris

, Wierstra

and Riedmiller

, Deterministic policy gradient algorithms, In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, page I–387–I–395. JMLR.org, 2014.

23.

Song

, Wei

and Song

, Neural-network-based synchronous iteration learning method for multi-player zero-sum games, Neurocomputing 242 (2017), 73–82.

24.

Jouffe

, Fuzzy inference system learning by reinforcement methods, Trans Sys Man Cyber Part C 28(3) (1998), 338–355.

25.

Mnih

, Kavukcuoglu

, Silver

, Rusu

, Veness

, Bellemare

, Graves

, Riedmiller

, Fidjeland

, Ostrovski

, Petersen

, Beattie

, Sadik

, Antonoglou

, King

, Kumaran

, Wierstra

, Legg

and Hassabis

, Human-level control through deep reinforcement learning, Nature 518 (2015), 529–533.

26.

Riedmiller

, Neural fitted q iteration - first experiences with a data efficient neural reinforcement learning method, Mach. Learn.: ECML, page 317–328, 2005.

27.

Watkins

J.C.H.

and Dayan

, Q-learning, Mach Learn 8 (1992), 279–292.

28.

, Lillicrap

, Sutskever

and Levine

, Continuous deep q-learning with model-based acceleration, 03 2016.

29.

Wang

, Schaul

, Hessel

, Hasselt

Hado V.

, Lanctot

and Freitas

N.D.

, Dueling network architectures for deep reinforcement learning, In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, page 1995–2003. JMLR.org, 2016.

30.

Sutton

, Mcallester

, Singh

and Mansour

, Policy gradient methods for reinforcement learning with function approximation, Adv Neural Inf Process Syst 12 (2000), 1057–1063.

31.

Prokhorov

D.V.

and Wunsch

, Adaptive critic design, IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council 8 (1997), 997–1007.

32.

TAKAGI

TOMOHIRO

and SUGENO

MICHIO

, Fuzzy identification of systems and its applications to modeling and control, IEEE Trans. Systems, Man, and Cybernet 15 (1985), 116–132.

33.

Bravo

, Ruiz

and Murrieta-Cid

Rafael

, A pursuit–evasion game between two identical differential drive robots, Journal of the Franklin Institute 357(10) (2020), 5773–5808.

34.

Akametalu

, Ghosh

, Fisac

and Tomlin

, A minimum discounted reward hamilton-jacobi formulation for computing reachable sets, 09 2018.