Sage Journals: Discover world-class research

Abstract

For multi-unmanned surface vessel (Multi-USV) capturing target, the coordination control method of path planning and tracking is proposed to integrate multi-agent reinforcement learning (MARL) and active disturbance rejection control (ADRC) dynamically and effectively. The bounded water environment model with various obstacles is constructed. To generate the optimal capturing path online and accelerate its convergence, the real-time multi-agent deep deterministic policy gradient (MADDPG) is enhanced by combining prioritized experience replay (PER). In order to achieve interaction between agent and environment, the real positions of unmanned surface vessels (USVs) are input to the MADDPG network as state variables. The action space consists of yaw angle and surge speed of USV and is reckoned as the reference path of tracking control. In case of target escaping, it is required that USVs are evenly distributed in the target-centered capture loop but not located within the detection range of target. For fast and safe capture, the composite reward function is proposed by designing capture reward, obstacle avoidance and collision avoidance reward, boundary collision restriction reward, capture inner boundary constraint reward, angle constraint reward and motion constraint reward. In addition, to integrate the actual tracking performance, the errors between the references and real states of USVs are also formulated into the reward function. In order to follow the reference commands from the enhanced MADDPG in presence of disturbances of wind and wave, the angle and speed tracking controllers are developed using linear ADRC (LADRC). Finally, the effectiveness of the proposed method is verified by capture simulations of static target and dynamic target with various types of obstacles.

Keywords

unmanned surface vessel target capture MADDPG LADRC path planning

Introduction

In recent years, unmanned surface vessel (USV) has been attempted to perform various ocean tasks due to its advantages of agility, reliability and intelligence,¹ for example, preventing or hunting invasive ships.² Particularly, target capture becomes one of attractive topics for USVs. That is, USVs need to surround the target in a fixed encirclement formation within a given area.³

Currently, many issues on target capture are addressed, such as single-target or multi-target capture,⁴ and static target or dynamic target capture.⁵ Various optimization and control algorithms are widely developed. Yu et al.⁶ propose the frontal interception guidance law and formation tracking control law with distributed encirclement for the leading and following aircrafts to achieve the coordinated target capture. Dong et al.⁷ propose a fuzzy double-capture control based on auction algorithm to reduce the possibility of target escaping. For the case of uncertain target information, Fedele et al.⁸ propose a control law to simulate motion of agents for surrounding the target. Aiming at encircling unknown target in three-dimensional space, Li et al.⁹ establish an estimator to locate target position, and then design the control law to drive the agents to complete circumnavigation of target.

Although the traditional control methods can be used for target capture problem, their adaptabilities to the complicated environment are not adequate. Either the accurate mathematical model for capture is generally necessary, or various constraints limit the development or application of these methods. As a new intelligent learning method, reinforcement learning (RL) does not need to model environment, but just focuses on the interaction between agent and environment.¹⁰ RL can be divided into strategy-based RL algorithm¹¹ and value-based RL algorithm.¹² At present, the collaborative capture research based on the multi-agent RL (MARL)¹³ has been addressed extensively, where multiple agents are used to conduct autonomous trial-and-error and collaborative learning.^14,15 And the scholars have tried to enhance RL algorithm, involving the improvement of reward function, experience playback mechanism and training method.¹⁶ The target capture research mainly involves how to determine the relative formation positions of hunters and how to form capture formation. For determination of the relative positions of hunters, the task allocation method is often used. Besides, the optimal target assignment for hunters can also be achieved in multi-target capture. Many task allocation approaches are proposed to ensure the reliability and effectiveness of resource allocation,^17,18 and rapidity of large-scale data processing.¹⁹ Regarding how to form capture formation, pure capture, capture with escape behavior and capture with game confrontation behavior are studied respectively. For pure target capture, many methods are attempted. For instance, Zhang et al.²⁰ introduce a continuous MARL framework to achieve online adaptation of multiple optimization functions, so as to guide mobile robots to move in a dynamic environment. Fan et al.²¹ propose a multi-robot RL algorithm guided by the potential energy model for single-target capture. On this basis, the target escape behavior is concerned. For random escape of target, Wu et al.²² improve the multi-agent deep deterministic policy gradient (MADDPG). Li et al.²³ study the hunting and escaping of multiple USVs by using the proximal policy optimization method and enhancing the structure and learning method of the training network. Sun et al.²⁴ study the cooperative pursuit strategy for the escape strategy of the besieged in different encirclement states. Gan et al.²⁵ propose a multi-USV collaborative pursuit strategy based on obstacle assistance and deep reinforcement learning, which can effectively achieve capture of intelligent evaders through autonomous environment perception, and deal with irregular obstacles and ocean current disturbances. Recently, the game confrontation between the hunters and the escapees is further studied. The adversarial game strategy is proposed by Qu et al.,²⁶ where the flexible reward function is designed and the escape strategy for the intruder is trained in complex environments. For the environment with dynamic obstacles, Sun et al.²⁷ propose a self-organizing cooperative hunting strategy to realize capture of intruders, where effective obstacle avoidance is achieved by dynamic collision avoidance approach.

Although the above studies apply MARL to target capture and escape confrontation behavior²⁸ is addressed as well, environment is generally assumed to be ideal and standard such that these algorithms are unable to conform to the changing environment with disturbances of wind and wave. Moreover, the agents in capture problem are mostly regarded as ideal particles. These algorithms are developed based on the kinematic model, but most of dynamic constraints are not considered. In addition, decision or planning and real motion control of MARL are separated in the network training stage of traditional MARL. Since the actual control results do not have impact on MARL computation,²⁹ a large gap between the actual effect of MARL and the expected result exists. So how to apply MARL in reality becomes a challenging topic.

Motivated by the above, the capture framework integrating MARL navigation and actual tracking control is addressed for capturing target by USVs in this paper. Most current studies assume the capture environment being infinite or introduce virtual capture boundary constraints.³⁰ Different from static target capture, dynamic target capture is more challenging and regarded as successful when the target is chased to the environmental boundary in the latter case. Therefore, the bounded water environment with various obstacles is modeled in this paper at first. It is required that the USV does not collide with the environmental boundary and the hunting cannot be achieved at the boundary. In order to imitate the real capture environment with dynamic obstacles,²⁷ both dynamic and static obstacles are considered, and safe obstacle avoidance distance is addressed to ensure the safety of USVs. Secondly, MADDPG via kinematic model is improved by combining prioritized experience replay (PER) to accelerate convergence. The proposed PER-MADDPG is implemented online to generate the optimal capturing path. The actual positions of USVs are taken as input of PER-MADDPG. The output of PER-MADDPG is taken as reference command of tracking controllers. Besides, considering the possibility of dynamic target escaping in complex environment, capture strategy is combined with escape and the mechanism of capture loop is proposed. In case of dynamic target escaping, it is required that USVs are evenly distributed in the target-centered capture loop and cannot move into the detection range of target. In the PER-MADDPG method, the composite reward function is formulated by developing capture reward, obstacle avoidance and collision avoidance reward, boundary collision restriction reward, capture inner boundary constraint reward, angle constraint reward, motion constraint reward, and the tracking error reward between the reference and real states of USVs. Due to the strong wind and wave disturbances, active disturbance rejection control (ADRC)³¹ is introduced to design the tracking controller. Based on the pseudo linear dynamic model, linear ADRC (LADRC)³² is used to develop the angle and speed tracking controllers. Eventually, simulations of capturing static target and dynamic target under different kinds of obstacles are implemented to demonstrate power of the proposed method.

The main contributions of this paper are listed as follows:

(i) The traditional control method is not suitable for strong uncertainty or drastic change of environment, while the existing reinforcement learning method is designed mostly regardless of the control errors. Therefore, the coordination control framework PER-MADDPG-LADRC is proposed for capturing static or dynamic target by USVs in this paper. The real-time path planning based on PER-MADDPG and path tracking control based on LADRC are developed and connected to construct the cascade closed-loop coordination control system. Wherein, path planning is slow time scale, and path tracking control is fast scale.

(ii) Considering the complexity of capture scenario, target escape possibly occurs and dynamic obstacles generally exist in marine environment. Therefore, the mechanism of target-centered capture loop is proposed, and the capture inner boundary constraint reward is designed to prevent USVs from crossing through the encirclement. Moreover, the static and dynamic obstacle avoidance and the collision avoidance between USVs are both concerned and incorporated into the composite reward function.

(iii) Since two-time scale sub-systems are proposed in the coordination control system and they have different running periods, there must exist delay of accurate path tracking control. Hence, the actual tracking error reward of USVs is designed as a part of the composite reward function in PER-MADDPG to reduce the impact of control error on path planning and tracking.

The rest of this paper is introduced as follows. Section ’Problem Statement’ introduces USV model and capture model. The methodology of PER-MADDPG-LADRC is described in Section ’Methodology’. In Section ’Simulation and Results’, two cases about static target and dynamic target capture are presented. Conclusions are drawn finally.

Problem statement

Unmanned surface vessel model description

The mathematical model of USV is multi-variable and strong coupling. Since heave, pitch and roll have little effect on USV’s motion, the three-degree-of-freedom model just involving surge, sway and yaw is addressed, as shown in Figure 1. Its kinematics and dynamics are presented as follows:

{\begin{matrix} \overset{\cdot}{x} = ucos ψ - vsin ψ \\ \overset{\cdot}{y} = usin ψ + vcos ψ \\ \overset{\cdot}{ψ} = r \end{matrix}

(1)

{\begin{matrix} \overset{\cdot}{u} = - \frac{h_{11}}{m_{11}} u + \frac{τ_{1}}{m_{11}} + \frac{m_{22}}{m_{11}} vr + \frac{h_{u}}{m_{11}} \\ \overset{\cdot}{v} = - \frac{h_{22}}{m_{22}} v - \frac{m_{11}}{m_{22}} ur + \frac{h_{v}}{m_{22}} \\ \overset{\cdot}{r} = - \frac{h_{33}}{m_{33}} r + \frac{τ_{2}}{m_{33}} + \frac{h_{r}}{m_{33}} \end{matrix}

(2)

where, $(x, y, ψ)$ represent the position and attitude vector of the USV in the inertial coordinate system. $(u, v, r)$ represent the velocity and angular velocity vector of the USV in the hull coordinate system. ${m_{11}, m_{22}, m_{33}}$ are inertial mass parameters. ${h_{11}, h_{22}, h_{33}}$ are hydrodynamic damping parameters. ${h_{u}, h_{v}, h_{r}}$ are environmental disturbances. And ${τ_{1}, τ_{2}}$ represent forward thrust and steering torque.

Figure 1.

USV schematic.

Target capture model

Generally, the marine environment includes unbounded scenario (e.g. sea) and bounded scenario (e.g. lake or river). When the environment is unbounded, the allowable motion ranges of USV and dynamic target will be larger. Moreover, it is not necessary to consider boundary collision risk. When the environment is bounded, the motion ranges of USV and dynamic target are limited, and it is required that the USV does not collide with the environmental boundary during capturing and the hunting cannot be achieved at the boundary. Therefore, target capture in the bounded environment is more challenging and complicated. Accordingly, the bounded marine environment is taken in account in this paper. The target capture is verified based on the bounded environment. So, the target capture environment is a bounded area of size $L_{x} \times L_{y}$ . It is supposed that there are N_q obstacles, N_p USVs and one target in the environment. Obstacles include static and dynamic categories. USVs are all isomorphic and can communicate normally. The target keeps either stationary or moving. It is defined that the capture coverage radius of USV is r_p, the detection coverage radius of target is r_q, and r_q < r_p. Accordingly, a target-centered capture loop is formed. The purpose is the USVs can hunt and enclose the target but the target cannot escape.³³ As shown in Figure 2, when the USVs can be evenly distributed in the loop, the USVs capture the target successfully. At the same time, the target does not detect them and not escape. The blue area is the entire bounded water environment. The gray denotes the obstacle area. The red point represents the target. The safe distance r_s is defined around the obstacles to ensure the safety of USVs. The figure demonstrates the ideal capture of three USVs versus one target.

Figure 2.

Target capture model.

Based on the above rules, the condition for successful capture is presented as follows:

{\begin{cases} r_{q}^{2} \leq {(x_{n} - x_{t})}^{2} + {(y_{n} - y_{t})}^{2} \leq r_{p}^{2} \\ {(x_{n} - x_{n - 1})}^{2} + {(y_{n} - y_{n - 1})}^{2} \\ = {(x_{n} - x_{n + 1})}^{2} + {(y_{n} - y_{n + 1})}^{2} \\ = {(2 s i n \frac{π}{N_{p}} \times \sqrt{{(x_{n} - x_{t})}^{2} + {(y_{n} - y_{t})}^{2}})}^{2} \end{cases}

(3)

where, $(x_{n}, y_{n})$ represents the position of the hunting USV n. $(x_{n - 1}, y_{n - 1})$ and $(x_{n + 1}, y_{n + 1})$ represent the positions of two adjacent neighbors of USV n. $(x_{t}, y_{t})$ means the target position.

Different from the search and tracking problem with the completely unknown target information, the capture problem requires the hunters can grasp the position and action information of the evaders in advance, so as to make fast maneuvering decisions. The research of capture problem focuses on the effective and rapid formation of capture.³⁴ For successful capturing and safe running of USVs, the following assumptions are proposed:

Assumption 1. It is assumed that the speed of hunting USV is variable, and the target’s maximum speed is lower than that of USV.

Assumption 2. It is assumed that the position and motion of target are known for USV.

Assumption 3. It is supposed that both static and dynamic obstacles exist in the environment, and the motion of dynamic obstacle is known.

Methodology

RL is a Markov Decision Process (MDP).³⁵ It consists of five tuples: state S, action A, discount factor γ, reward function R and state transition function P. In this paper, the MARL strategy is followed, and the PER-MADDPG-LADRC algorithm is designed to solve the capture problem.

Coordination control structure design

MADDPG³⁶ is a centralized training and distributed execution algorithm. It combines the advantages of Deep Q-Network (DQN) and Actor-Critic, and is suitable for multi-agent competitive and cooperative learning in complex environment.

The MADDPG algorithm is off-line strategy algorithm and designed for the continuous action space problem. By constructing a deterministic strategy, the gradient rising method is used to maximize the Q value, and then the ideal training model can be obtained. Since the deterministic strategy has limited exploration of environment, random noise is added to the action resulted from behavioral strategy to expand the scope of exploration. In MADDPG, each agent is trained by Actor-Critic, where the Actor only obtains its own information, but the Critic can get global information. In order to avoid overestimation of the action value, both of the Actor network and the Critic network are designed as two sets of training network and target network. And the target network uses soft update to slowly approach the training network. For N_p agents, the policy parameter set of the training Actor networks is defined as $θ = {θ_{1}, \dots, θ_{N_{p}}}$ and the policy parameter set of the target Actor networks is $θ^{'} = {θ_{1}^{'}, \dots, θ_{N_{p}}^{'}}$ . The set of training Actor policies is $π = {π_{1}, \dots, π_{N_{p}}}$ , and the set of target Actor policies is $π^{'} = {π_{1}^{'}, \dots, π_{N_{p}}^{'}}$ . In the ith policy update, the goal of the jth Critic network is to minimize the loss function $L (θ_{j})$ ³⁷:

\begin{matrix} L (θ_{j}) = E_{s, a, s^{'}, r} [{(y - Q_{j}^{π} (s, a_{1}, \dots, a_{N_{p}}))}^{2}] \\ y = r_{i, j} + Q_{j}^{π^{'}} (s^{'}, a_{1}^{'}, \dots, a_{N_{p}}^{'}) ∣_{a_{k (k = 1, \dots, N_{p})}^{'} = π_{k}^{'} (o_{k})} \end{matrix}

(4)

where, $(s, a, s^{'}, r)$ is the data obtained from the experience playback pool during the ith policy update. $Q_{j}^{π^{'}}$ is calculated by the target network, and $Q_{j}^{π}$ is calculated by the training network. The jth Actor network is updated based on the following policy gradient function:

\begin{matrix} \nabla_{θ_{j}} J = E_{s, a ~ D} [\nabla_{θ_{j}} π_{j} (o_{j}) \nabla_{a_{j}} Q_{j}^{π} (x, a_{1}, \dots, a_{N_{p}}) ∣_{a_{j} = π_{j} (o_{j})}] \end{matrix}

(5)

The traditional MADDPG algorithm selects data from the experience playback pool by random sampling, however, this leads to the low utilization of favorable data in the early stage of training. It is not beneficial for network training. Therefore, referring to the previous method,³⁸ the prioritized experience replay (PER) is combined to improve the algorithm. The data in experience playback pool are sorted in descending order based on the absolute value of the temporal difference error. The sampling probability is allocated for all data based on this order, described as follows:

\begin{matrix} m_{i} = \frac{1}{o_{i}} \\ P (i) = \frac{m_{i}}{\sum_{k = 1}^{K} m_{k}} \end{matrix}

(6)

where o_i is the data order in the experience playback pool, K is the total amount of data in the experience playback pool, and P(i) is the probability of the sampled data. And the importance sampling method is used by the proposed PER-MADDPG to ensure that the sampled data has the same effect on the change of gradient. The weight of importance sampling is set as follows:

\begin{matrix} w_{i} = {(KP (i))}^{- β} \end{matrix}

(7)

where β is a hyperparameter. Using PER, the probability of important data being selected for network training is increased, and the convergence speed of reward function in the early training stage is accelerated.

The general MARL algorithm is just related to the kinematics model. The interaction between agent and environment is achieved by the ideal position. Moreover, wind and wave disturbances, and actual control deviation are not involved. In order to apply MARL algorithm to the practical environment, the real dynamic response of the closed-loop control system is concerned in the MARL design in this paper.

Combining the PER-MADDPG algorithm with LADRC method, the actual USV motion controller is introduced in the interaction between agent and environment. A coordination control system consisting of path planning and tracking is designed. The output of the MADDPG action space is taken as the input of LADRC. And the actual position coming from the closed-loop control is used as a part of input of MADDPG state space. Correspondingly, the online path planning and tracking control for capture can be realized. The possible wind and wave disturbances are included in the control-oriented model. That is, the action information of the network output is ${u_{d}, ψ_{d}}$ , being input to the LADRC controllers as the desired angle and velocity. The network obtains the following state information $(x_{r}, y_{r})$ for the next iteration:

{\begin{matrix} x_{r} = x_{0} + f_{1 - LADRC} (u_{d}, u_{0}) \times Δ t \times \cos (f_{2 - LADRC} (ψ_{d}, ψ_{0})) \\ y_{r} = y_{0} + f_{1 - LADRC} (u_{d}, u_{0}) \times Δ t \times \sin (f_{2 - LADRC} (ψ_{d}, ψ_{0})) \end{matrix}

(8)

where, $(x_{0}, y_{0})$ represents the actual position of USV, ${u_{0}, ψ_{0}}$ represent the actual surge speed and yaw angle, $f_{2 - LADRC}$ means the angle LADRC law, and $f_{1 - LADRC}$ is the speed LADRC law. The system structure diagram is shown in Figure 3.

Figure 3.

Overall coordination control structure diagram.

In Figure 3, environment state including obstacles and target is fed to all networks as public information. ${a_{1}, a_{2}, \dots, a_{N_{p}}}$ denote the action space generated by each agent Actor network, including yaw angle and surge speed. They are converted into the real action space as ${ψ_{d 1}, u_{d 1}, ψ_{d 2}, u_{d 2}, \dots, ψ_{d N_{p}}, u_{d N_{p}}}$ by scale transformation and then input to the controllers as reference. ${ψ_{r 1}, u_{r 1}, ψ_{r 2}, u_{r 2}, \dots, ψ_{r N_{p}}, u_{r N_{p}}}$ are the actual yaw angle and surge speed feedback. ${τ_{11}, τ_{12}, \dots, τ_{1 N_{p}}}$ mean the thrusts applied to the USVs, ${τ_{21}, τ_{22}, \dots, τ_{2 N_{p}}}$ are the torque vectors. After LADRC controllers run some sampling periods, the actual position ${(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{N_{p}}, y_{N_{p}})}$ are sent back to PER-MADDPG optimizer to update the USV positions in the state space of PER-MADDPG. In the overall coordination control, the same network structure is adopted for all USVs. Since the state space is one-dimensional and only contains environmental state and USV state information, the fully connected neural network is chosen to build the whole network architecture. Considering the amount of data, the number of hidden layers of the actor network is set as 3, and the number of hidden layers of the critic network is set as 4 to guarantee the rapidity and effectiveness of training. The Rectified Linear Unit (ReLu) is selected as the activation function to construct a complete network model.

State space and action space design

In order to ensure the USVs capture the target rapidly and stably, the training process is accelerated. Before each round training, the neighbor relationship of USV within the encirclement loop is determined according to the relative distance between USVs. The neighbor with smaller relative distance is selected as the adjacent USV, and it is kept unchanged in the current training. That is, for USV n, its neighbors are determined by means of two steps:

(i) $N_{p} - 1$ USVs are screened. The USVs being not identified as neighbors by any USV are retained, and their number is recorded as N_re.

(ii) The nearest USVs p_n−1, p_n+1 are selected as the neighbors from the N_re USVs by the following equation.

{\begin{matrix} p_{n - 1} = \min (d_{1}, \dots, d_{n - 1}, d_{n + 1}, \dots, d_{N_{re}}) \\ p_{n + 1} = \min (d_{1}, \dots, d_{n - 2}, d_{n + 1}, \dots, d_{N_{re}}) \end{matrix}

(9)

where d represents the Euclidean distance. Even if a great deal of USVs capture target, the above rules are still effective to determine the neighbors.

For the nth hunting USV, its state space is written as $X_{n} = {P_{n}, P_{t}, P_{o}, u_{x - t}, u_{y - t}, S_{n - 1}, S_{n + 1}, S_{t}}$ . $P_{n} = {x_{n}, y_{n}}$ is its position. $P_{t} = {x_{t}, y_{t}}$ is the target position. $P_{o} = {x_{o}, y_{o}}$ is the obstacle position. ${u_{x - t}, u_{y - t}}$ are the lateral velocity and longitudinal velocity of target. ${S_{n - 1}, S_{n + 1}}$ are the Euclidean distance between the USV n and its two adjacent hunters. S_t is the Euclidean distance between the USV n and the target.

For the nth hunting USV, its action space is $A_{n} = {u_{n}, ψ_{n}}$ . u_n represents the surge speed, and ψ_n is the yaw angle. The ideal angle range is $0^{°} ~ 360^{°}$ . In order to ensure the feasibility of LADRC and avoid sharp angle change, the yaw angle of network output is calculated as follows:

\begin{matrix} ψ_{n} = \min {∥ ψ_{n} - ψ_{n}^{'} ∥, ∥ ψ_{n} + 2 π - ψ_{n}^{'} ∥, ∥ ψ_{n} - 2 π - ψ_{n}^{'} ∥} \end{matrix}

(10)

where $ψ_{n}^{'}$ represents the yaw angle of USV n at the previous moment.

Remark 1. As the number of agents increases, the length of one-dimensional data in the state space will increase. More computing resources are needed to handle the decision-making and learning processes of agents, which will lead to the computational complexity increasing. Although one-dimensional data increases, the state space dimension does not change, and the design of the fully connected network structure can deal with computational complexity. When the number of agents increases greatly, various approaches can be used to reduce computational complexity. Either the learning rate is adjusted, or the number of fully connected layers and neuron nodes are increased to deal with large amounts of data.

Reward function design

In PER-MADDPG, the reward function is developed to drive the capture performance to converge to the global optimum. The goal of this paper is to achieve stable and effective capture of static and dynamic target in bounded water environment, and simultaneously to avoid collisions safely. Thereafter, a composite reward function is formulated. Wherein, the capture reward function is designed to generate the criterion of successful capture. In order to meet the security and motion constraints, the collision avoidance and obstacle avoidance reward function, environmental boundary collision restriction reward function, and USV motion constraint reward function are proposed. To ensure that the capture process is safe and the hunters cannot be detected by the target, the capture inner boundary constraint reward function is constructed. Accordingly, it is guaranteed that the hunting USVs approach and encircle the target from outside the inner boundary of the ideal encirclement rather than crossing through the encirclement. The angle constraint reward is designed to ensure the smoothness of yaw angle change. Furthermore, the tracking error reward function is designed to reduce the actual path tracking error.

(1) Capture reward function f₁: This reward is developed to optimize the capture performance. It consists of relative distance reward f_d and relative angle reward $f_{ψ}$ . To ensure the consistency of the reward magnitude, the relative angle is transformed into the relative distance r_cap between the hunting USV and its neighbors:

\begin{array}{l} f_{d} = ‖ \sqrt{{(x_{n} - x_{t})}^{2} + {(y_{n} - y_{t})}^{2}} - r_{q} ‖ r_{c a p} = 2 s i n \frac{π}{N_{p}} \times \sqrt{{(x_{n} - x_{t})}^{2} + {(y_{n} - y_{t})}^{2}} \\ f_{ψ} = ‖ \sqrt{{(x_{n} - x_{n - 1})}^{2} + {(y_{n} - y_{n - 1})}^{2}} - r_{c a p} ‖ + ‖ \sqrt{{(x_{n} - x_{n + 1})}^{2} + {(y_{n} - y_{n + 1})}^{2}} - r_{c a p} ‖ f_{1} = f_{d} + f_{ψ} \end{array}

(11)

(2) Collision avoidance and obstacle avoidance reward function f₂: To guarantee safety of USVs, inter-USVs collision avoidance and obstacle avoidance between USVs and obstacles in the water environment should be concerned. In addition, in order to enhance the safety of obstacle avoidance further, the safe distance of obstacle avoidance is defined around the obstacles. The reward consists of environmental obstacle avoidance reward ${f'}_{2}$ and collision avoidance reward ${f ″}_{2}$ :

\begin{matrix} {f'}_{2 i} = {\begin{matrix} ‖ \sqrt{{(x_{n} - x_{o_{i}})}^{2} + {(y_{n} - y_{o_{i}})}^{2}} - r_{o} - r_{usv} - r_{s} ‖ if collide with obstacle \\ 0 otherwise \end{matrix} \\ {f ″}_{2 j} = {\begin{matrix} ‖ \sqrt{{(x_{n} - x_{j})}^{2} + {(y_{n} - y_{j})}^{2}} - 2 r_{usv} ‖ if collide with USV \\ 0 otherwise \end{matrix} \\ f_{2} = \frac{1}{N_{o}} \sum_{i = 1}^{N_{o}} {f'}_{2 i} + \frac{1}{N_{p} - 1} \sum_{j = 1}^{N_{p} - 1} {f ″}_{2 j} \end{matrix}

(12)

where r_o is the obstacle size, r_usv is the size of hunting USV, and r_s is the safe distance of obstacle avoidance.

(3) Environmental boundary collision restriction reward function f₃: In consideration of capturing the target being located in bounded waters, the boundary collision penalty is essential to make USV just move in the given water area. When the USV travels beyond the boundary, the penalty value is given as:

f_{3} = {\begin{matrix} C_{1} if collide with boundary \\ 0 otherwise \end{matrix}

(13)

where C₁ is a positive parameter denoting collision penalty. If the unbounded environment is considered, the environmental boundary collision restriction reward function will be removed from the composite reward function of USV.

(4) Capture inner boundary constraint reward function f₄: To ensure the hunting USVs approach and encircle the target from outside the inner boundary of the ideal encirclement rather than crossing through the encirclement, when the distance between USV and target is less than the radius r_q of the inner loop, the penalty is defined as:

f_{4} = {\begin{matrix} C_{2} ‖ \sqrt{{(x_{n} - x_{t})}^{2} + {(y_{n} - y_{t})}^{2}} ‖ < r_{q} \\ 0 otherwise \end{matrix}

(14)

where C₂ is a positive parameter denoting the penalty of crossing the inner boundary.

(5) USV angle constraint reward function f₅: In order to ensure the slow change of the yaw angle and the smoothness of trajectory, the angle variation is used as the angle constraint reward function:

f_{5} = ‖ ψ_{n} - ψ_{n}^{'} ‖

(15)

where $ψ_{n}^{'}$ represents the yaw angle of USV n at the previous moment, and ψ_n is the current yaw angle.

(6) USV motion constraint reward function f₆: For the static target, the USVs’ speed should be small as much as possible after hunting formation is realized. That is the USVs should hover near the target and keep encirclement. The motion constraint reward is proposed as:

f_{6} = {\begin{matrix} u d_{n - t} < d_{0}, d_{n - (n - 1)} < d_{1}, d_{n - (n + 1)} < d_{1} \\ u_{\max} - u otherwise \end{matrix}

(16)

where u is the current surge speed of USV, and u_max is the maximum surge speed. $d_{n - t}$ is the distance between USV n and the target. $d_{n - (n - 1)}$ and $d_{n - (n + 1)}$ are the distances between the USV and its two neighbors. d₀ and d₁ are hunting thresholds.

{\begin{matrix} d_{0} = r_{p} + T u_{\max} \\ d_{1} = 2 \sin \frac{π}{N_{p}} \times d_{0} \end{matrix}

(17)

where T is the environment period.

For the dynamic target, in order to make the USVs achieve stable and effective tracking of target, the following motion constraint reward is proposed as:

\begin{matrix} f_{6} = \\ {\begin{matrix} u - \sqrt{u_{x - t}^{2} + u_{y - t}^{2}} d_{n - t} < d_{0}, d_{n - (n - 1)} < d_{1}, d_{n - (n + 1)} < d_{1} \\ u_{\max} - (u - \sqrt{u_{x - t}^{2} + u_{y - t}^{2}}) otherwise \end{matrix} \end{matrix}

(18)

where ${u_{x - t}, u_{y - t}}$ are the lateral and longitudinal velocities of the dynamic target.

(7) The tracking error reward function f₇: In order to reduce path tracking control error, in the current control period, the yaw angle and surge speed generated by the proposed PER-MADDPG method are used as the expected input of the tracking control and remain unchanged. In path planning period, the LADRC is step response control. The following reward function is designed to show the difference between the actual position obtained by the controller and the expected position:

\begin{matrix} f_{7} = \sqrt{{(x_{r} - x_{d})}^{2} + {(y_{r} - y_{d})}^{2}} \end{matrix}

(19)

where $(x_{r}, y_{r})$ is the actual position of USV, and $(x_{d}, y_{d})$ is the expected position.

By weighting all the above functions, the composite reward function f_n for USV n is derived. This function is maximized to achieve the best capture performance.

f_{n} = (- (ω_{1} f_{n, 1} + ω_{2} f_{n, 2} + ω_{3} f_{n, 3} + ω_{4} f_{n, 4} + ω_{5} f_{n, 5} + ω_{6} f_{n, 6} + ω_{7} f_{n, 7}))

(20)

$ω_{1}, ω_{2}, ω_{3}, ω_{4}, ω_{5}, ω_{6}, ω_{7}$ are the weight parameters of reward functions. Since the successful capture of target is the primary goal of this paper, compared with other reward functions, the capture reward function is the most important such that the weight parameter of capture reward function needs to be set significantly greater than other weight parameters. So, the weight parameter of capture reward function $ω_{1} = 5$ . For other reward functions, the same impact on the total reward should be ensured. However, these functions are not normalized and have different formulations. The environmental boundary collision restriction reward function and capture inner boundary constraint reward function have the fixed values, but the other reward functions are calculated based on the normalization of position, angle and speed. Consequently, the weight parameters of environmental boundary collision restriction reward function and capture inner boundary constraint reward function are set as $ω_{3} = 1, ω_{4} = 1$ . And the weight parameters of collision avoidance and obstacle avoidance reward function, angle constraint reward function, motion constraint reward function and the tracking error reward function are supposed as $ω_{2} = 2, ω_{5} = 2, ω_{6} = 2, ω_{7} = 2$ .

LADRC tracking controller design

Based on the output of action space and the dynamic model of USV, a first-order surge speed controller and a second-order yaw angle controller are designed using LADRC.

According to the dynamic model (2), the pseudo linear surge speed model is formulated. The surge speed controller is developed into the first-order LADRC formulation.³⁹ The linear ESO (LESO) and the surge speed LADRC law are designed as:

{\begin{matrix} \overset{\cdot}{\hat{u}} = \hat{g_{u}} + b_{u} τ_{1} + 2 ω_{o_{u}} (u - \hat{u}) \\ {\overset{\cdot}{\hat{g}}}_{u} = ω_{o_{u}}^{2} (u - \hat{u}) \\ τ_{1} = \frac{ω_{c_{u}} (u_{d} - \hat{u}) - \hat{g_{u}}}{b_{u}} \end{matrix}

(21)

where $b_{u} = \frac{1}{m_{11}}$ , $g_{u} = - \frac{h_{11}}{m_{11}} u + \frac{m_{22}}{m_{11}} vr + \frac{h_{u}}{m_{11}}$ , $ω_{o_{u}}$ is the observer parameter, $ω_{c_{u}}$ is the controller parameter, $û$ is the observed surge speed, and $\hat{g_{u}}$ is the observation of g_u.

Similarly, the pseudo linear yaw angle model is formulated. Correspondingly, the second-order LADRC yaw angle controller is designed. The LESO and the yaw angle LADRC law are developed as:

{\begin{matrix} \overset{\cdot}{\hat{ψ}} = \hat{r} + 3 ω_{o_{ψ}} (ψ - \hat{ψ}) \\ \overset{\cdot}{\hat{r}} = \hat{g_{ψ}} + b_{ψ} τ_{2} + 3 ω_{o_{ψ}}^{2} (ψ - \hat{ψ}) \\ {\overset{\cdot}{\hat{g}}}_{ψ} = ω_{o_{ψ}}^{3} (ψ - \hat{ψ}) \\ τ_{2} = \frac{ω_{c_{ψ}}^{2} (ψ_{d} - \hat{ψ}) - 2 ω_{c_{ψ}} \hat{r} - \hat{g_{ψ}}}{b_{ψ}} \end{matrix}

(22)

where $b_{ψ} = \frac{1}{m_{33}}$ , $g_{ψ} = - \frac{h_{33}}{m_{33}} r + \frac{h_{r}}{m_{33}}$ , $ω_{o_{ψ}}$ is the observer parameter, $ω_{c_{ψ}}$ is the controller parameter, $\hat{ψ}$ is the observed yaw angle, $\hat{r}$ is the observed yaw angular velocity, and $\hat{g_{ψ}}$ is the observation of $g_{ψ}$ .

In summary, the PER-MADDPG-LADRC algorithm flow is shown in Algorithm 1.

Algorithm 1. PER-MADDPG-LADRC
Input: Episode number E, Episode length M, USV Number N_p, Weight parameters of reward function $ω_{1}, ω_{2}, ω_{3}, ω_{4}, ω_{5}, ω_{6}, ω_{7}$ , Environment period T, Controller parameters, Water area size, Obstacle size and position, PER parameter β. Output: Positions of USVs 1: for $e \in [1, E]$ do 2: Initialize the positions and speed of USVs and target. 3: Use equation (9) to judge the USVs’ neighbor relationship. 4: for $i \in [1, M]$ do 5: Select the action a of each USV. 6: for $t \in [1, T]$ do 7: Use equation (8) to update the states of USVs. 8: end for 9: Use equation (20) to get the reward f. 10: Store sample ${s, a, s^{'}, f}$ into the experience playback pool. 11: According to the priority sampling data, update the network. 12: end for 13: end for

Algorithm 1. PER-MADDPG-LADRC

Input: Episode number E, Episode length M, USV Number N_p, Weight parameters
of reward function

ω_{1}, ω_{2}, ω_{3}, ω_{4}, ω_{5}, ω_{6}, ω_{7}

, Environment period T, Controller
parameters, Water area size, Obstacle size and position, PER parameter β.
Output: Positions of USVs
1: for

e \in [1, E]

do
2: Initialize the positions and speed of USVs and target.
3: Use equation (9) to judge the USVs’ neighbor relationship.
4: for

i \in [1, M]

do
5: Select the action a of each USV.
6: for

t \in [1, T]

do
7: Use equation (8) to update the states of USVs.
8: end for
9: Use equation (20) to get the reward f.
10: Store sample

{s, a, s^{'}, f}

into the experience playback pool.
11: According to the priority sampling data, update the network.
12: end for
13: end for

Remark 2. In this paper, the real-time path planning and tracking control have different time scales. When LADRC runs for fixed steps (i.e. sampling period), the whole PER-MADDPG algorithm is implemented once (called environment period).

Computational complexity

When the capture problem is larger-scale, the computational complexity is concerned. The computational complexity of MARL is affected by the number of multi-agents, the training process of neural network, environmental interaction complexity, experience collection complexity, episode number, episode length and batch size. Ignoring the constant factor complexity, the computational complexity of the proposed PER-MADDPG-LADRC method can be expressed as $O (E \times M \times N_{p} \times (T / t + B \times D))$ , in this paper, where T is the environment period, t is sampling period, B is batch size and D is the number of parameters in the neural network.

Simulation and results

In this paper, simulation is carried out to validate the effectiveness of the propose method. The static target and the dynamic target are addressed respectively in capture test. In order to increase the complexity of the search environment, static and dynamic obstacles are included in the environment.

Simulation environment and parameter design

The simulation platform is i7-12700H CPU, RTX3070Ti GPU and 16RAM. The MADDPG framework is Pytorch. It is assumed that there are one target, three static obstacles, two dynamic obstacles and three hunting USVs in the marine environment. The static obstacles are large, but the dynamic obstacles are small. The initial positions and motion states of the obstacles are known. The initial positions of hunting USVs and target are randomly distributed. Table 1 is the collection of environment, obstacles and target parameters.

Table 1.

Environment, obstacles and target parameters.

Parameter	Value
Water area size	$80 m \times 80 m$
Static obstacle size	4.8m
Dynamic obstacle size	1.6m
Speed of dynamic obstacle	0.1m/s
Safe distance of obstacle avoidance	2m
USV size	2m
Surge speed range of USV	$0 m / s ~ 0.6 m / s$
Dynamic target speed $(u_{x - t}, u_{y - t})$	$(0.2 m / s, 0.2 m / s)$
Moving angle of dynamic target	$45^{°}$
Detection coverage radius of the target	8m
Capture coverage radius of the USV	13m
Static obstacle positions	$(28 m, 20 m)$ , $(50 m, 36 m)$ , $(68 m, 48 m)$
Dynamic obstacle initial positions	$(3.2 m, 2 m)$ , $(64 m, 72 m)$

At the beginning, the USVs are stationary. There exist the wind and wave disturbances in the marine environment, described as follows⁴⁰:

{\begin{matrix} h_{u} = s_{e} \cos (δ - ψ) \\ h_{v} = s_{e} \sin (δ - ψ) \\ h_{r} = l_{x} h_{v} - l_{y} h_{u} \\ s_{e} = 7 + 0.1 \sin (0.2 t) + 5 \cos (0.02 t) + 2 \cos (0.01 t) \\ δ = 0.05 \sin (0.07 t) + 0.3 \cos (0.01 t) + 0.07 \sin (0.145 t) \end{matrix}

(23)

The motion control of USV is mainly disturbed by wind and wave. These disturbances have similar distribution when acting on USV such that they can be uniformly expressed by a resultant disturbance force. The magnitude and direction of the resultant disturbance force are expressed as s_e and δ respectively. $(l_{x}, l_{y})$ is the arm of disturbance force. Based on USV’s mathematical model, the disturbance force is decomposed as h_u, h_v and h_r along three degrees of freedom of USV. The relevant control parameters and the parameters of the proposed PER-MADDPG algorithm are shown in Table 2.

Table 2.

Control and PER-MADDPG parameters.

Control parameter	Value
$(l_{x}, l_{y})$	$l_{x} = 0.21, l_{y} = 0$
Speed controller parameters	$ω_{o_{u}} = 25, ω_{c_{u}} = 10, b_{u} = 76.15$
Angle controller parameters	$ω_{o_{ψ}} = 25, ω_{c_{ψ}} = \sqrt{10}, b_{ψ} = 20$
Environment period	2s
Sampling period	0.02s
PER-MADDPG parameter	Value
Episode number E	E= 1000
Episode length M	M= 500
Batch size B	B= 100
Weight parameters	$ω_{1} = 5, ω_{2} = 2, ω_{3} = 1, ω_{4} = 1, ω_{5} = 2, ω_{6} = 2, ω_{7} = 2$
Collision penalty C₁	C ₁= 5
Penalty of crossing the inner boundary C₂	C ₂= 5
Evaluation episode number E₀	E ₀= 100
Buffer size S	$S = 2^{18}$
Learning rate l_r	$l_{r} = 0.0001$
Optimizer	Adam
PER parameter β	β= 0.4

Case 1: Static target capture

Different numbers of USVs capture the target

In order to demonstrate the power of different numbers of USVs capturing target, the simulations with three and five USVs hunting target are implemented respectively. The test step number is 80. Figure 4(a) to (d) show the capture performance of 3 USVs at interval of 20 steps. The black circle areas are obstacles. The yellow curves are the trajectory of dynamic obstacles. The red virtual circles represent the safe obstacle avoidance range of the obstacles. The red curve, blue curve and green curve are the capture trajectories of three USVs. ‘*’ represents the initial point of the USV, ‘△’ represents the end point, and the purple circle is the target. Two dotted circles represent the inner and outer boundaries of the ideal encirclement loop respectively. It can be seen that the USVs achieve an effective pursuit for the target within a given period. In Figure 4, it seems that the hunting trajectories of USVs pass through the dynamic obstacles. Actually, they do not collide each other since the dynamic obstacles and USVs move through the same position at different time. Simultaneously, effective obstacle avoidance is achieved. In addition, the USVs do not enter the inner boundary of the ideal encirclement loop during capturing, and the target escape behavior does not occur.

Figure 4.

Capture performance of three USVs in Case 1.

Figure 5 shows the control result of three USVs in Case 1. (a) presents the desired surge speed of three USVs and their real speed controlled by LADRC. (b) gives the desired yaw angle of three USVs and the real angle steered by LADRC. It can be seen that LADRC achieves accurate path tracking control. When the USVs approach the expected capture position, the overall real-time path planning and tracking coordination control is effective though the yaw angle tracking control has a slight deviation. And the speed tracking demonstrates when the USVs reach the desired positions, the surge speed of most USVs gradually approaches zero, and there is just a small amplitude fluctuation. In the whole capture process, the yaw angle change is relatively smooth.

Figure 5.

Control result of three USVs in Case 1.

Figure 6 presents the capture performance of five USVs at interval of 20 steps. The description of obstacles in the figures is consistent with target capture of three USVs. The red curve, blue curve, green curve, cyan curve and magenta curve are the capture trajectories of five USVs. It can be seen that the USVs achieve an effective pursuit for the target, and effective obstacle avoidance is realized. In addition, in order to avoid entering the inner boundary of the ideal encirclement loop, USV1 circumnavigates in the encirclement loop to achieve capture, and the encirclement path is reasonable.

Figure 6.

Capture performance of five USVs in Case 1.

Figure 7 shows the control result of five USVs in Case 1. (a) shows the desired surge speed of three USVs and their real controlled speed. (b) gives the desired yaw angle of three USVs and the real controlled angle. It can be seen that LADRC achieves accurate path tracking control and the yaw angle change is relatively smooth.

Figure 7.

Control result of five USVs in Case 1.

Target capture under strong wind and wave disturbances

In order to demonstrate the impact of strong wind and wave disturbances on the motion control of USV, the time-varying nonlinear interference force function is increased as follows:

{\begin{matrix} s_{e} = 10 + 1 \sin (0.2 t) + 5 \cos (0.02 t) + 2 \cos (0.01 t) \\ δ = 0.05 \sin (0.07 t) + 0.3 \cos (0.01 t) + 0.07 \sin (0.145 t) \end{matrix}

(24)

Based on the above situation, Figure 8 shows the capturing and tracking trajectories of three USVs capturing static target, and Figure 9 displays the control result. The simulation result shows that with the increase of disturbances, although the tracking error of yaw angle increases, the USV still achieves effective target capture under strong wind and wave disturbances.

Figure 8.

Capture performance of three USVs under strong disturbances in Case 1.

Figure 9.

Control result of three USVs under strong disturbances in Case 1.

Comparison simulation

For capturing the static target, the comparison simulations with original MADDPG,⁴¹ IPPO⁴² and REINFORCE⁴³ are implemented. All of them adopt the same state space and action space as the proposed method. But these comparison algorithms do not address tracking control performance in the training process, and the tracking error is not considered in the reward function. The original MADDPG structure is similar to the proposed algorithm, but lacks the PER link and LADRC. The IPPO is a fully decentralized multi-agent algorithm framework, so it is used to compare with the centralized training and distributed execution framework used in this paper. The REINFORCE is a single-agent RL algorithm. It is used as a centralized structure to construct the network model, which synthesizes the state space related to multiple agents into one-dimensional variable input.

Figure 10(a) is the reward function comparison diagram of four algorithms. The red curve denotes the proposed algorithm, the green curve is the original MADDPG, the blue curve is the IPPO, and the yellow curve is the REINFORCE. Figure 10(b) is the success rate diagram of four algorithms to carry out the capture test after training every 100 epochs. The training result shows that the multi-agent algorithm has better effect than the single-agent algorithm. The convergence value of the reward function of the single-agent algorithm has a large gap from other algorithms, and the reward value fluctuates greatly after stabilized. The proposed PER-MADDPG-LADRC method has more stable convergence and smaller convergence value than MADDPG and IPPO. Because the algorithm proposed in this paper considers tracking control performance in presence of disturbances of wind and wave in the training process. So, when LADRC control is involved in the training model, there is a higher success rate of capture. Nevertheless, the comparison algorithms do not address the above factors during training, so the success rate of capture is relatively low during testing. Besides, due to the limitation of single-agent algorithm, the test of REINFORCE can only have a certain probability to reach the capture trend, but cannot achieve capture really. Therefore, the success rate of capture of REINFORCE in Figure 10(b) just represents the probability of capture, which is not a standard capture.

Figure 10.

Reward function and success rate of capture in Case 1.

Figure 11 shows the capture performance of the original MADDPG in Case 1. Figure 12 gives its control result. Figure 13 presents the capture performance of IPPO in Case 1, and Figure 14 is its control result. In the comparison algorithms, although USVs have the tendency to perform the capture task, the distribution of USVs on the encirclement loop is not uniform, and the surge speed of USVs is far from zero. Although there is no collision with the obstacles, the USVs enter the safe obstacle avoidance range, and the risk of collision exists. In addition, since IPPO adopts a completely decentralized framework, the cooperation effect is slightly worse than MADDPG. For REINFORCE, capture cannot be realized ideally and the result has big error. Thereafter, its simulation result is not presented.

Figure 11.

Capture performance of MADDPG in Case 1.

Figure 12.

Control result of MADDPG in Case 1.

Figure 13.

Capture performance of IPPO in Case 1.

Figure 14.

Control result of IPPO in Case 1.

Case 2: Dynamic target capture

Different numbers of USVs capture the target

The case of different numbers of USVs capturing target is also considered in the dynamic target capture, three and five USVs hunting target are simulated respectively. The test step number is 100. Figure 15 shows the simulation result of three USVs. (a) to (e) present the capture results of USVs at different times. The description of obstacles and three hunting USVs in the figures is consistent with static target capture. The black curve is the target trajectory. It can be seen that the capture of USVs begins to form at step= 80. From step 80 to 100, the USVs keep tracking the target and achieve uniform distribution in the loop. And the USVs avoid the obstacles effectively.

Figure 15.

Capture performance of three USVs in Case 2.

Figure 16 gives the control result of three USVs in Case 2. (a) shows the desired surge speed of three USVs and their actual speed controlled by LADRC. (b) gives the desired yaw angle and accurate tracking result. It can be seen that LADRC also achieves stable tracking control. Angle control and speed control are effective in the whole procedure.

Figure 16.

Control result of three USVs in Case 2.

Figure 17 is the simulation result of five USVs. (a) to (e) display the capture results of USVs at different times. The description of obstacles and five hunting USVs in the figures is same as static target capture. It can be seen that the capture of USVs is achieved successfully and the USVs avoid the obstacles effectively as well.

Figure 17.

Capture performance of five USVs in Case 2.

Figure 18 displays the control result of five USVs in Case 2. (a) is the desired surge speed of three USVs and their actual controlled speed. (b) gives the desired yaw angle and accurate tracking result. It can be seen that LADRC also achieves stable tracking control.

Figure 18.

Control result of five USVs in Case 2.

In order to prove that the proposed method can be applied to target capture in unbounded environment, the test step number is selected to be 200, and Figure 19 shows the capture trajectories at interval of 40 steps. It can be seen that the desired capture result can still be achieved.

Figure 19.

Capture performance of three USVs in unbounded environment in Case 2.

Target at another speed

Based on the proposed capture assumptions and the motion constraint reward sub-function of USV, when the speed of the dynamic target is less than the surge speed of USV, the capturing and tracking can be realized stably. In order to illustrate capture feasibility when target moving at a higher speed, a new simulation verification is carried out for target speed being $(0.3 m / s, 0.3 m / s)$ . Simultaneously, the moving angle of target is changed into $135^{°}$ to verify capture feasibility further. The corresponding simulation figures are acquired. Among them, Figure 20 shows the capturing and tracking trajectories, and Figure 21 displays the control result. The simulation result shows that the proposed method can adapt to capture of target with higher speed and different angle.

Figure 20.

Capture performance of three USVs for another speed target in Case 2.

Figure 21.

Control result of three USVs for another speed target in Case 2.

Target with varying motion state

In order to verify that the proposed method can adapt to capturing target with varying motion state. In this scenario, the target runs at time-varying speed. The initial speed is set as $(0.1 m / s, 0.1 m / s)$ and the maximum speed is $(0.25 m / s, 0.25 m / s)$ . The speed function is presented as follows:

{\begin{matrix} u_{x - t} = u_{x - t} + 0.005 t_{s} \\ u_{y - t} = u_{y - t} + 0.005 t_{s} \end{matrix}

(25)

where t_s is the current step. When the step number is greater than 40, the motion angle of target is adjusted from $45^{°}$ to $90^{°}$ . Further, in order to realize environment change, the velocities of dynamic obstacles are also time-varying in this scenario. Their initial velocities are set as $(0.05 m / s, 0.05 m / s)$ , and their maximum velocities are $(0.1 m / s, 0.1 m / s)$ . And the speed function of dynamic obstacle is presented as follows:

{\begin{matrix} u_{x - o} = u_{x - o} + 0.005 t_{s} \\ u_{y - o} = u_{y - o} + 0.005 t_{s} \end{matrix}

(26)

where ${u_{x - o}, u_{y - o}}$ are the lateral and longitudinal velocities of dynamic obstacle.

Figure 22 shows the capturing and tracking trajectories and Figure 23 displays the control result. The simulation result shows that the target with varying motion state can still be captured effectively under dynamic obstacles moving at varying speed.

Figure 22.

Capture performance of target with varying motion state in Case 2.

Figure 23.

Control result of capturing target with varying motion state in Case 2.

Target capture under strong wind and wave disturbances

The feasibility of capturing dynamic target under strong wind and wave disturbances is also verified. Figure 24 shows the capturing and tracking trajectories of three USVs capturing dynamic target, and Figure 25 displays the control result. Compared with the static target, due to the increase of disturbances, the hunting trajectories for dynamic target are more tortuous, and the control error of yaw angle increases. However, the capture task is still be realized.

Figure 24.

Capture performance of three USVs under strong disturbances in Case 2.

Figure 25.

Control result of three USVs under strong disturbances in Case 2.

Comparison simulation

For capturing the dynamic target, the similar comparison simulations with MADDPG, IPPO and REINFORCE are implemented. Figure 26(a) is the reward function comparison diagram of four algorithms. The red curve is the proposed algorithm, the green curve for the original MADDPG, the blue curve for IPPO, and the yellow curve for REINFORCE. Figure 26(b) presents the curves of the success rate. The result shows that the proposed PER-MADDPG-LADRC method has better convergence performance and stronger practicability. Its capture performance on dynamic target is more obvious than static target. The proposed algorithm can achieve stable convergence of the reward function after the 300th training. However, the original MADDPG can achieve stable convergence just after the 400th training. Although IPPO has a faster convergence speed, it is easy to fall into local optimum and cannot reach the maximum reward value, which also leads to its poor capture effect. The change of success rate also shows that the proposed algorithm has better capture effect.

Figure 26.

Reward function and success rate of capture in Case 2.

Figure 27 shows the capture performance of the original MADDPG in Case 2, and Figure 28 is its control result. Figures 29 and 30 display the capture performance and the control result of IPPO in Case 2. In the comparison algorithms, although the USVs can achieve the stable encirclement of the target, they cannot be distributed on the encirclement loop, and the escape behavior of the target may occur. For REINFORCE, target capture is hardly completed, so tracking curve is not given.

Figure 27.

Capture performance of MADDPG in Case 2.

Figure 28.

Control result of MADDPG in Case 2.

Figure 29.

Capture performance of IPPO in Case 2.

Figure 30.

Control result of IPPO in Case 2.

Performance analysis

In order to verify the feasibility of the proposed algorithm, different numbers of USVs are used for static target and dynamic target capture. To meet the collision avoidance constraint, the following condition is essential:

\begin{matrix} 2 r_{usv} < r_{cap} \end{matrix}

(27)

Therefore, at most 7 USVs can be applied to capture one target simultaneously when the conditions and parameters of environment, USVs and target are same as Case 1 and Case 2. So, the number of USVs is set as 3, 5, 7 respectively, and training is performed for each situation. Figure 31(a) shows the reward function training diagram of different numbers of USVs for static target capture. The red curve is 3 USVs capturing target, the green curve is 5 USVs, and the blue curve is 7 USVs. Figure 31(b) shows the reward function training diagram for dynamic target capture. It can be seen that different numbers of USVs can all achieve stable convergence of the reward function. It is apparent that the time to achieve convergence stability will increase while the USV number increases. However, the computational complexity is still tolerant in this case, and the real-time feasibility of the algorithm can be guaranteed. Even if the USV number is larger, the computational complexity and capture performance can be balanced by regulating network structure and parameters. In addition, for dynamic target capture, the motion states of USVs are more complex, and the whole capture process can be divided into capturing and tracking target. Therefore, when the number of USVs increases, the training difficulty of the reward function is more obvious than that of static target. Nevertheless, it can still achieve stable convergence eventually.

Figure 31.

Reward function of different numbers of USVs.

In addition, for capturing target by different numbers of USVs, the simulation relative running time under the same training episode number and episode length is counted as shown in Table 3. It can be seen that although the training time increases with the number of agents, its computational complexity is still tolerant.

Table 3.

The simulation relative running time of different numbers of USVs.

Case	Number of agents	Running time	Case	Number of agents	Running time
Case 1	3 USVs	12657	Case 2	3 USVs	14300
Case 1	5 USVs	22530	Case 2	5 USVs	24530
Case 1	7 USVs	33692	Case 2	7 USVs	36594

In summary, the proposed algorithm can be applied not only the above assumed background, but also to the more complex water environment, such as the case with randomly moving target. It can be achieved by modifying the corresponding network input information due to the real-time motion state of target being as part of the network state space. Additionally, it is assumed that the position and motion of target is known for USV in this paper. Although this assumption is relatively ideal, this prior condition can be realized in realistic conditions for reasons of development of sensors and improvement of the related environment detection capabilities. In addition, this paper studies the capture problem in bounded waters. Compared with unbounded marine environment, the constraints of this study are more complex. Finally, this study considers the disturbances of wind and wave and control tracking error in network training. The preliminary combination of methodology and practice is feasible. In order to extend the method to deal with more practical marine conditions, true disturbances of wind and wave can be used to replace ideal data to train the network.

Conclusions

In the existing research, the MARL algorithm is mostly aimed at the kinematic model. It does not have strong practical potential for real environment and true objects. In this paper, the PER-MADDPG-LADRC method is proposed. The algorithm introduces the experience playback mechanism into MADDPG, by which the training convergence speed and the training results are improved. The proposed algorithm integrates LADRC for path tracking in the interaction between agent and environment. And the wind and wave disturbances are considered in the actual control. In successful capture, it is required that the USVs are located within the encirclement loop and not detected by target such that the target cannot escape. In reward function design, this research not only takes into account the capture effect reward, collision avoidance and obstacle avoidance reward, boundary collision restriction reward, capture inner boundary constraint reward, angle constraint reward and motion constraint reward, but also adds the LADRC tracking error reward in training the network for approaching reality.

The research results show that the algorithm can achieve capture of static target and dynamic target by multiple USVs in presence of various types of obstacles. The combination of MADDPG and LADRC realize real-time coordination control of path planning and tracking of USVs. In the future research, it is expected that the algorithm framework can be applied to target capture in more complex environments, such as in presence of irregular or in convex obstacles, in unknown and real-time changing environment. In addition, the algorithm can also be combined with the task allocation to solve multi-target capture by USVs.

Footnotes

ORCID iDs

Qi Zhu

Chaofang Hu

Chunbo Lu

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Natural Science Foundation of Tianjin (grant number 23JCZDJC01140).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.*

References

Cao

A novel collision avoidance strategy for multi-USV formation using stream functions and consensus algorithms. Int J Robot Autom 2021; 36(6): 402–415.

Zhao

Xiang

Liu

Design and implementation of marine information security early-warning system oriented to security elements association analysis. J Coast Res 2020; 108: 266–269.

Güler

Fidan

Target capture and station keeping of fixed speed vehicles without self-location information. Eur J Control 2018; 43: 1–11.

Nantogma

Zhang

, et al. Multi-USV dynamic navigation and target capture: a guided multi-agent reinforcement learning approach. Electronics 2023; 12(7): 1523.

Sun

Zou

, et al. Impedance control of multi-arm space robot for the capture of non-cooperative targets. J Syst Eng Electron 2020; 31(5): 1051–1061.

Dong

, et al. Distributed cooperative encirclement hunting guidance for multiple flight vehicles system. Aerosp Sci Technol 2019; 95: 105475.

Dong

Min

, et al. Fuzzy dual-hunting control based on auction algorithm. Int J Fuzzy Syst 2023; 25(7): 2816–2827.

Fedele

D’Alfonso

Bono

, et al. Swarm trajectories generation for target capturing with uncertain information. IEEE T Control Netw 2023; 10(4): 1986–1996.

Shi

Song

Localization and circumnavigation of multiple agents along an unknown target based on bearing-only measurement: a three dimensional solution. Automatica 2018; 94: 18–25.

10.

Wirth

Akrour

Neumann

, et al. A survey of preference-based reinforcement learning methods. J Mach Learn Res 2017; 18: 1–46.

11.

Zhang

Ong

Wang

, et al. A collaborative multiagent reinforcement learning method based on policy gradient potential. IEEE T Cybernetics 2021; 51(2): 1015–1027.

12.

Sui

Dong

Conflict resolution strategy based on deep reinforcement learning for air traffic management. Aviat 2023; 27(3): 177–186.

13.

Yang

Harish

, et al. Deep reinforcement learning for multi-phase microstructure design. Comput Mater Contin 2021; 68(1): 1285–1302.

14.

Xia

Luo

Liu

, et al. Cooperative multi-target hunting by unmanned surface vehicles based on multi-agent reinforcement learning. Def Technol 2023; 29: 80–94.

15.

Zhou

Liu

, et al. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin J Aeronaut 2022; 35(7): 100–112.

16.

Wang

, et al. Deep reinforcement learning-based air combat maneuver decision-making: literature review, implementation tutorial and future direction. Artif Intell Rev 2024; 57(1): 1.

17.

Wang

Gao

Wang

, et al. Resilient multi-objective mission planning for UAV formation: a unified framework integrating task pre- and re-assignment. Def Technol 2025; 45: 203–226.

18.

Deng

Liu

Dou

, et al. Autonomous sortie scheduling for carrier aircraft fleet under towing mode. Def Technol 2025; 43: 1–12.

19.

Jiao

Yao

Zhang

JY.

MAV/UAV task coalition phased-formation method. J Syst Eng Electronics 2019; 30(2): 402–414.

20.

Zhang

McLeod

Lee

, et al. Continuous reinforcement learning to adapt multi-objective optimization online for robot motion. Int J Adv Robot Syst 2020; 17(2): 172988142091149.

21.

Fan

Yang

Liu

, et al. Reinforcement learning method for target hunting control of multi-robot systems with obstacles. Int J Intell Syst 2022; 37(12): 11275–11298.

22.

Sofge

Lofaro

DM.

Crafting a robotic swarm pursuit–evasion capture strategy using deep reinforcement learning. Artif Life Robot 2022; 27(2): 355–364.

23.

Yin

Wang

, et al. Distributed pursuit-evasion game of limited perception USV swarm based on multiagent proximal policy optimization. IEEE T Syst Man Cy A 2024; 54(10): 6435–6446.

24.

Sun

, et al. Cooperative strategy for pursuit-evasion problem with collision avoidance. Ocean Eng 2022; 266(2): 112742.

25.

Gan

Song

, et al. Multi-USV cooperative chasing strategy based on obstacles assistance and deep reinforcement learning. IEEE Trans Autom Sci Eng 2024; 21(4): 5895–5910.

26.

Gan

Song

, et al. Pursuit-evasion game strategy of USV based on deep reinforcement learning in complex multi-obstacle environment. Ocean Eng 2023; 273(1): 114016.

27.

Sun

, et al. Self-organizing cooperative pursuit strategy for multi-USV with dynamic obstacle ships. J Mar Sci Eng 2022; 10(5): 562.

28.

Xiao

, et al. Multi-robot target encirclement control with collision avoidance via deep reinforcement learning. J Intell Robot Syst 2020; 99: 371–386.

29.

Tallamraju

Saini

Bonetto

, et al. Aircaprl: autonomous aerial human motion capture using deep reinforcement learning. IEEE Robot Autom Lett 2020; 5(4): 6678–6685.

30.

Wan

Zhai

, et al. An improved approach towards multi-agent pursuit-evasion game decision-making using deep reinforcement learning. Entropy 2021; 23(11): 1433.

31.

Han

From pid to active disturbance rejection control. IEEE Trans Ind Electron 2009; 56(3): 900–906.

32.

Sun

Gao

A DSP-based active disturbance rejection control design for a 1-kw h-bridge DC–DC power converter. IEEE Trans Ind Electron 2005; 52(5): 1271–1277.

33.

Fan

Yang

Han

YL.

Target round-up control for multi-agent systems based on reinforcement learning. Acta Aeronaut Astronaut Sin 2023; 44(S1): 727487.

34.

Zhang

Zeng

Lin

, et al. Multi-USV cooperative target encirclement through learning-based distributed transferable policy and experimental validation. Ocean Eng 2025; 318: 120124.

35.

Alagoz

Hsu

Schaefer

, et al. Markov decision processes: a tool for sequential decision making under uncertainty. Med Decis Making 2010; 30(4)(4): 474–483.

36.

Wei

Huang

Yang

, et al. Hierarchical RNNs-based transformers MADDPG for mixed cooperative-competitive environments. J Intell Fuzzy Syst 2022; 43(1): 1011–1022.

37.

Lillicrap

Hunt

Pritzel

, et al. Continuous control with deep reinforcement learning. In: 4th International Conference on Learning Representations (ICLR), 2016, pp. 1–14.

38.

Saglam

Mutlu

Cicek

, et al. Actor prioritized experience replay. J Artif Intell Res 2023; 78: 639–672.

39.

Cai

Wang

Zhao

, et al. Equivalence of LADRC and INDI controllers for improvement of LADRC in practical applications. ISA Trans 2022; 126: 562–573.

40.

Cai

Nonlinear positioning control for underactuated unmanned surface vehicles in the presence of environmental disturbances. IEEE/ASME Trans Mechatron 2022; 27(6): 5381–5391.

41.

Lowe

Tamar

, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. In: 31st annual conference on neural information processing systems (NIPS), 2017, p. 30.

42.

Zhang

Liu

, et al. High-speed ramp merging behavior decision for autonomous vehicles based on multiagent reinforcement learning. IEEE Internet Things 2023; 10(24): 22664–22672.

43.

Zhang

Chien

WC.

Overview of deep reinforcement learning improvements and applications. J Internet Technol 2021; 22(2): 239–255.

Multi-USV coordination control method of path planning and tracking based on MADDPG for target capture

Abstract

Keywords

Introduction

Problem statement

Unmanned surface vessel model description

Target capture model

Methodology

Coordination control structure design

State space and action space design

Reward function design

LADRC tracking controller design

Computational complexity

Simulation and results

Simulation environment and parameter design

Case 1: Static target capture

Different numbers of USVs capture the target

Target capture under strong wind and wave disturbances

Comparison simulation

Case 2: Dynamic target capture

Different numbers of USVs capture the target

Target at another speed

Target with varying motion state

Target capture under strong wind and wave disturbances

Comparison simulation

Performance analysis

Conclusions

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

Data availability statement

References