Obstacle avoidance USV in multi-static obstacle environments based on a deep reinforcement learning approach

Abstract

Unmanned surface vehicles (USVs) are intelligent platforms for unmanned surface navigation based on artificial intelligence, motion control, environmental awareness, and other professional technologies. Obstacle avoidance is an important part of its autonomous navigation. Although the USV works in the water environment (e.g. monitoring and tracking, search and rescue scenarios), the dynamic and complex operating environment makes the traditional methods not suitable for solving the obstacle avoidance problem of the USV. In this paper, to address the issue of poor convergence of the Twin Delayed Deep Deterministic policy gradient (TD3) algorithm of Deep Reinforcement Learning (DRL) in an unstructured environment and wave current interference, random walk policy is proposed to deposit the pre-exploration policy of the algorithm into the experience pool to accelerate the convergence of the algorithm and thus achieve USV obstacle avoidance, which can achieve collision-free navigation from any start point to a given end point in a dynamic and complex environment without offline trajectory and track point generation. We design a pre-exploration policy for the environment and a virtual simulation environment for training and testing the algorithm and give the reward function and training method. The simulation results show that our proposed algorithm is more manageable to converge than the original algorithm and can perform better in complex environments in terms of obstacle avoidance behavior, reflecting the algorithm’s feasibility and effectiveness.

Keywords

USV TD3 obstacle avoidance random walk policy

Introduction

In recent years, USVs have been widely used in marine scientific research, marine search and rescue, marine energy exploration, and other fields. Because the tasks environment of USV is very complex, not only contains static obstacles but also is affected by sea currents and other dynamic obstacles, obstacle avoidance becomes a key factor affecting the autonomous navigation of USV. They have become one of the leading research hotspots in the industry¹ and are the goal of continuous exploration and optimization by global scholars.

USV obstacle avoidance is a real-time collision-free path that satisfies the USV dynamic constraints based on environmental awareness and its state information. During USV navigation, the real-time information and communication system collected by the sensor informs the obstacles, ships, and unexpected information near the hull in real-time, which makes the USV drive away from the original route to avoid reasonably while guaranteeing to complete the original task.² Common obstacle avoidance algorithms include artificial potential field,³ particle swarm optimization algorithm,⁴ and bacterial foraging optimization algorithm.⁵ Xie et al.³ proposed an improved USV Artificial Potential Field (APF) algorithm to solve the problems of local optimization and unreachable destination of traditional artificial potential field algorithm. Xia et al.⁴ combines the Velocity Obstacle (VO) method with the Modified Quantum Particle Swarm Optimization (MQPSO) and proposes a USV local obstacle avoidance algorithm, which can effectively plan the USV obstacle avoidance path. To address the issue of the Bacterial Foraging Optimization (BFO) algorithm being prone to getting trapped in local optima during USV path planning, Yang et al.⁵ proposed an optimization algorithm that combines Simulated Annealing (SA) with BFO. It can not only successfully avoid static obstacles, but also complete dynamic path planning efficiently. In practical tasks, environmental space is often dynamic and uncertain, such as incomplete perception of environmental information (which is often the most common problem in practical applications), interference of wind and waves, the noise of sensor data, and control errors, making the above methods not suitable for solving obstacle avoidance problems of USV in complex environments.

In order to solve the problem of USV obstacle avoidance in a complex environment, the learning-based obstacle avoidance method has been paid attention to in recent years. As an important area of machine learning, DRL has made considerable progress in recent years, which provides strong support for obstacle avoidance of USV, enabling USV to handle high-dimensional state space and continuous motion space problems. DRL has a more vital perceptual decision-making ability to solve tasks, such as USV navigation,⁶ control,⁷ and obstacle avoidance.⁸ Wang et al.⁹ proposed a self-adaptive mechanism which was introduced into the Extreme Learning Machine (ELM) to make the neural network have faster learning ability and generalization. Wang et al.¹⁰ propose an automatic architecture design method based on Monarch Butterfly Optimization (MBO) for Convolution Neural Network (CNN) can significantly reduce the network time and performance overhead. Cui et al.¹¹ converts the malicious code into grayscale images and uses CNN to identify the transformed image, which can quickly and effectively detect malicious code. However, many factors currently restrict DRL in obstacle avoidance of USV: (1) Because the USV training environment is subject to many interference factors by waves, the algorithm is usually challenging to converge, (2) The portability of DRL is poor and needs to be retrained when the sensor or task changes, (3) There is a massive gap between the simulation and practical application environments, and the training results are usually good, but the applicability is poor, and (4) When an agent interacts with the natural environment, some erroneous behaviors will damage the agent, increasing training and time costs.

The main contributions of this article are summarized as follows:

(1) A new heuristic exploration policy is proposed to solve the slow convergence problem of TD3 algorithm training USV for obstacle avoidance. This approach enables an agent to explore the environment independently according to specific probability actions and store the information in the experience pool so that the algorithm can get relatively positive samples at the beginning of training. This approach significantly avoids the timid behavior of the agent due to insufficient positive samples in the previous period. This will enable the agent to adapt to the environment more quickly and accelerate the algorithm’s convergence to reduce the training time.

(2) For the poor portability of the algorithm, the state space, reward function, and action space are designed. Use generic distance sensor data as input to avoid the problem of poor robustness due to changes in different tasks or environments.

The rest of this article is organized as follows. Section 2 introduces some of the DRL work and the background knowledge of our algorithms. Section 3 describes the proposed algorithm and the implementation details of the training and elaborate on the state space, action space, and reward function in detail. Section 4 demonstrates the test environment, simulation system, and test results after training and analyses the simulation results. Section 5 is the conclusion and future work.

Related work

Deep reinforcement learning

Reinforcement Learning (RL) is an algorithm in machine learning. Unlike supervised learning and unsupervised learning, which have a lot of data or experience input, the agent guides the agent to achieve expected behavior by obtaining a set reward value through environment interaction and evaluates the agent’s behavior by overall return size. The learning environment for RL is the Markov Decision Process (MDP), as shown in Figure 1

Figure 1.

MDP.

The MDP is a sequence $[S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, A_{2} \dots]$ . The agent selects the action $A_{t}$ according to the policy to execute a time step to explore the environment. The environment can feed back the reward value $r_{t + 1}$ obtained by the agent under the current action, enter the next state $s_{t + 1}$ according to the selected action, and repeat the process until the stop signal is received. The ultimate goal is to find out the expectation $U_{t}$ of the policy with the maximum cumulative reward:

U_{t} = \max \sum_{t = 0}^{T} γ^{t} R_{t} .

(1)

In equation (1), $γ \in [0, 1]$ is the reward attenuation factor used to balance the current and future rewards. The larger the value, what will pay the more attention to the future reward value. Optimization aims to adopt appropriate policies to maximize $U_{t}$ in different states.

The policy is a mapping from the state to the probability of selection for each action. If the agent selects the policy $π$ at time $t$ , then $π (a | s)$ is the conditional probability of $A_{t} = a$ at state $S_{t} = s$ . $π (a | s)$ indicates that the output action a obeys a probability distribution from a given state $s$ . The state $s$ under the policy $π$ is denoted as $V_{π} (s)$ . Bellman equation can solve the optimal decision sequence of MDP. Bellman equation is as follows:

\begin{matrix} \begin{matrix} V_{π} (s) = E (U_{t} ∣ S_{t} = s_{t}) \end{matrix} \\ = E_{π} [R_{t + 1} + γ [R_{t + 2} + γ [R_{t + 3} \dots]] ∣ S_{t} = s_{t}] \\ = E_{π} [R_{t + 1} + γ V (s_{t + 1}) ∣ S_{t} = s_{t}] \end{matrix}

(2)

Equation (2) expresses the relationship between a state value and a subsequent state value. Value function $V_{π} (s)$ Similarly, we record the value of action $a$ in state $s$ under policy $π$ as $q_{π} (s, a)$ . $V_{π} (s)$ is the reward expected return of all possible decision sequences after the $π$ is adopted, starting from state $s$ and executing action $a$

\begin{matrix} Q_{π} (s, a) = E_{π} [U_{t} | S_{t} = s_{t}, A_{t} = a_{t}] \\ = E_{π} [R_{t + 1} + γ [R_{t + 2} + γ [R_{t + 3} \dots]] | S_{t} = s_{t}, A_{t} = a_{t}] \\ = E_{π} [R_{t + 1} + γ Q_{π} (s_{t + 1}, a_{t + 1}) | S_{t} = s_{t}, A_{t} = a_{t}] \end{matrix}

(3)

$Q_{π} (s, a)$ is called $π$ the action-value function of the policy.

Solving the RL problem means finding a policy that can obtain many rewards in the long-term process. Therefore, we can define an optimal policy $π^{*}$ , so there is an optimal state value function

V^{*} (s) = \max_{π} v_{π} (s) .

(4)

The optimal state value function is also shared by the optimal policy, denoted as

Q^{*} (s, a) = \max_{π} Q_{π} (s, a) .

(5)

The classical algorithm of DRL, Deep Q Network (DQN)¹² algorithm, uses a neural network to approximate $Q^{*}$ . $ω$ is a parameter of the neural network, denoted as

Q^{*} (s, a) \approx Q^{*} (s, a; ω) .

(6)

We can use $V^{*}$ to represent $Q^{*}$

Q_{*} (s, a) = E [R_{t + 1} + γ V_{*} (S_{t + 1}) ∣ S_{t} = s_{t}, A_{t} = a_{t}]

(7)

A neural network can also be used to approximate the policy $π$ , which $θ$ is the neural network parameter, denoted as

π (a | s) \approx π (a | s; θ) .

(8)

DQN estimates the optimal function $Q$ directly, but it can only deal with discrete low-dimensional action space because it chooses one action with the largest $Q$ value to execute at a time. Using the DQN algorithm to discretize the high-dimensional action space will lead to difficulties in training and non-convergence. Q-learning¹³ is the most original form of intensive learning and is used as the basis for more complex methods. Cao et al.¹⁴ uses the Q-learning algorithm to input 8-dimensional discrete state space and output three-dimensional discrete action space to achieve real-time navigation to a fixed target. Although the Q-learning algorithm can achieve USV navigation, it is only in low-dimension and discrete state space. In the face of high-dimension data, dimension disasters will lead to non-convergence problems. Fujita and Selamat¹⁵ uses DQN to input images for training and outputs decisions based on COLREG made by USV in the face of imminent collisions. Gao et al.¹⁶ proposed a method based on Dueling deep Q networks prioritized replay (Dueling-DQNPR) for Ship autonomous navigation constitutes improves the network depth and the ability to process continuous data. Xiaofei et al.¹⁷ proposed double deep Q networks (DDQN) to generate reasonable global paths for different tasks.

The goal of the DQN algorithm is to learn a $Q$ function that evaluates the value of each state-action pair. The agent can then make decisions by selecting the action with the highest $Q$ value. However, in continuous action spaces, the number of actions is infinite, and using continuous actions would lead to the “curse of dimensionality,” making it difficult to select the optimal action. Therefore, DQN cannot be directly applied to continuous action spaces.^18–20

To apply RL to continuous state space and continuous action space, Lillicrap et al.¹⁹ proposed the Deep Deterministic Policy Gradient (DDPG) algorithm. Zhou et al.²¹ proposed the DDPG algorithm based on key learning of failure regions to improve the obstacle avoidance rate of ships and reduce the error of simulated routes. Xu et al.²² proposed a DDPG-based route planning algorithm to generate navigation paths under unknown interference. As a classical algorithm for continuous motion control, the DDPG algorithm is widely used in obstacle avoidance, path planning, and other issues. However, it has an uneven overestimation of $Q$ -values, resulting in updates of suboptimal policies and non-convergence.²³

The advantages and disadvantages of the USV collision avoidance algorithm are shown in Table 1.

Table 1.

Advantages and disadvantages collision avoidance USV algorithm.

Algorithm name	Advantage	Disadvantage
APF	React quickly and compute rapidly.	Easily trapped in local optima in complex obstacle environments.
MQPSO	Achieve convergence zone simply and effectively.	Low precision and easily trapped in local optima.
BFO	Simple and easy to implement, and not sensitive to the initial values	Prone to getting trapped in local optima in complex environments, and slow convergence speed
DQN	Eliminates the need for manually coding complex rules and processing intricate sensor data.	Unable to handle continuous action problems.
DDPG	Capable of handling continuous action space problems, with good adaptability and generalization performance.	Sensitivity to hyperparameters, and the existence of uneven $Q$ -value overestimation.

In general, RL algorithms have the following advantages over traditional algorithms in the application of USV obstacle avoidance:

(1) Stronger adaptability: Traditional USV obstacle avoidance algorithms often require manual parameter settings, such as obstacle avoidance distance and speed, while RL algorithms can learn the optimal policy through interaction with the environment.

(2) Ability to handle nonlinear and high-dimensional data: USVs need to deal with complex nonlinear and high-dimensional data such as waves and wind direction during obstacle avoidance, while RL algorithms can handle these data and learn the optimal policy directly from raw data.

(3) Ability to handle partial observability: In some cases, USVs may not be able to obtain complete environmental information, such as incomplete information about waves and wind direction. RL algorithms can handle partial observability and estimate unobserved state information through state estimation to learn the optimal policy.

(4) Learning capability: RL algorithms have the ability to learn the optimal policy through interaction with the environment and can continuously improve the policy through continuous training.

Twin delayed deep deterministic policy gradient algorithm

Due to the problem of uneven overestimating $Q$ values, Double DQN is introduced in the DDPG, and two Critic networks are used to output $Q_{π} (s, a; ω)$ . We select $\min (Q (s, a; ω_{1}), Q (s, a; ω_{2}))$ as a valuation. It can avoid uneven overestimation. With the idea of delayed learning, the update frequency of the policy network is less than that of the value network, and the policy network is updated after a certain number of updates. By using delayed learning, the parameter update frequency of the Actor network is less than that of the Critic network, resulting in the TD3.²⁴ The core of the TD3 algorithm is its use of the Actor-Critic framework, empirical playback of the DDQN algorithm, double Critic structure, and Deterministic Policy Gradient (DPG).²⁵ As shown in Figure 2, the Actor-Critic framework establishes an Actor network and a Critic network, which are used to generate the current policy and evaluate its effectiveness.

Figure 2.

Actor-Critic algorithm framework.

Because the USV used in this paper is a two-degree-of-freedom catamaran model, the behavior of the USV is controlled by controlling the left and right thruster thrust, so it will not converge if the two-degree-of-freedom thruster thrust is discretized. Therefore, we describe the USV obstacle avoidance problem in a continuous state-action space. The network structure is shown in Figure 3.

Figure 3.

The network structure of TD3 algorithm.

First, the data are sampled from the environment and normalized to the empirical buffer. Sample a batch of data $(s, s^{'}, a, r)$ from the experience buffer, enter $s^{'}$ into the target Actor network to get the next action $a^{'}$ , and enter the status-action pair $(s^{'}, a^{'})$ into the target Critic network. Choose a smaller one to calculate the value function target $y (r, s^{'})$ after getting two targets $Q$ values ( $Q_{ω 1}, Q_{ω 2}$ ). On the other hand, input $(s, a)$ into the Critic network to get two $Q$ values ( $Q 1$ , $Q 2$ ). Then, they are used to calculate and reverse-propagate the sum of MSEs to update the parameters of the two Critic networks. Next, the $Q$ values obtained from the first Critic network are input into the Actor network. The Actor network parameters are updated as the $Q$ values increase (every two iterations). Finally, all target networks are updated using a soft update method.

For the Critic network, the loss function is the mean square error, that is

J (ω) = \frac{1}{m} \sum_{j = 1}^{m} {(y_{j} - Q (ϕ (S_{j}), A_{j}; ω))}^{2}

(9)

For the Actor network, since it adopts a deterministic policy, its loss gradient is

\nabla_{J} (θ) = \frac{1}{m} \sum_{j = 1}^{m} {\nabla_{a} Q (s_{j}, a_{j}; ω) |}_{s = s_{j}, a = π_{θ} (s)} {\nabla_{θ} π_{θ (s)} |}_{s = s_{j}} .

(10)

Random walk twin delayed deep deterministic policy gradient algorithm

USV state space in random walk TD3

In DRL, USV gets information from the environment and takes appropriate action. State space provides information about essential objects in the environment, such as obstacles and targets, and accurately represents the current state of the USV itself.²⁶ In our approach, the state space consists of the USV’s state and part of the environmental information detected by the sensor. Some researchers use vision as a navigation obstacle avoidance method to map current and target observations to a state space, which may work well in an ideal state, but this algorithm is not robust. Obstacles encountered by USV when deployed in other environments may be quite different, and training may be more difficult if too many types of obstacles are considered. Therefore, the state space of USV is designed to have shared data. There are 24-dimensional continuous data in the USV state space, including 19-dimensional USV ranging data, 2-dimensional attitude data, 1-dimensional heading angle data, 1-dimensional velocity data, and 1-dimensional USV distance end point data. The 19-D ranging data are 19 laser beams emitted from the USV bow and deflected one beam at a 10° interval from left to right, as shown in Figure 4, and the $d_{se}$ is the safe distance for USV navigation. The USV state is shown in Figure 5 and can be measured in real time using GPS and a gyroscope.

s = {[Pt, Rl, β, vc, tl, d_{i}]}^{T}

(11)

Where $Pt$ and $Rl$ represent pitch and roll, as shown in Figure 6. The $β$ represents the angle between the USV bow and the end point. $vc$ represents the current movement rate of the USV. $tl$ represents the distance between the current position of the USV and the end point. $d_{i}$ represents the 19-dimensional distance information used by the USV to detect obstacles using a laser distance sensor. If the sensor does not detect any object within a limited distance, the length of the ray is the maximum distance it can detect. Otherwise, the length is the distance from the USV to the obstacle detected by the sensor.

\begin{matrix} d_{i} = \\ {\begin{matrix} Max laser sensors length, if nothing was hit \\ distance from USV to object, else \end{matrix}, i = 1 . . . 19 . \end{matrix}

(12)

Figure 4.

The obstacle detection of USV.

Figure 5.

The diagram of USV’s state.

Figure 6.

The posture of USV.

Because different states have different units and scales, we must preprocess them before input into the network. In this paper, the state values collected by USV during navigation are processed by the normalization method to accelerate the convergence of network training.

tl = \frac{tl}{MaxCheckSize}

(13)

Rl = \frac{Rl + 180}{360}

(14)

Pt = \frac{Pt + 180}{360}

(15)

vc = \frac{vc}{MaxVelocity}

(16)

$MaxCheckSize$ is the maximum detection distance of the laser ranging sensor. $MaxVelocity$ is the maximum speed the USV can achieve.

USV reward function in random walk TD3

The reward function is the most crucial attribute in DRL. It can evaluate the USV action according to the environment. A positive value is a reward, and a negative value is a punishment. Reasonable reward function design is a prerequisite for DRL to solve complex problems. When solving the obstacle avoidance problem of USV, it is necessary to take full account of the situation encountered by USV when navigating or performing tasks, which is universal. The reward function designed in this paper is designed to guide the USV to the target area without collision.

The reward function consists of eight parts. $r_{blo}$ denotes that when the distance sensor of the USV is less than a threshold, the USV will emit a block signal to indicate a potential collision. Our objective is to penalize the algorithm when such a scenario occurs to avoid collisions between the algorithm and obstacles. $r_{ove_t}$ denotes that when the pose sensor of USV detects that the angle of USV exceeds a certain threshold, USV will emit an overturn signal to indicate a potential capsizing. $r_{rea}$ denotes that when the USV navigates to the designated area, it will return a reach signal to indicate that the USV has successfully reached the destination. $r_{war}$ denotes that when the distance sensor of the USV detects that the distance $d_{i}$ between the USV and an obstacle is less than the predetermined safe distance $d_{se}$ , the USV will emit a warning signal to imply that it has entered a hazardous area. $r_{vel}$ denotes that when the speed of the USV is less than a certain threshold, $r_{vel}$ is assigned a value of −1, whereas it is assigned a value of 1 otherwise. The purpose of this is to prevent the USV from exhibiting timid behavior. $r_{dis_t}$ denotes the distance between the current position of the USV and the target point. A larger penalty value is assigned when the USV is far from the end point, while a smaller value is assigned otherwise. The purpose of this is to guide the USV to navigate to the vicinity of the target point. $r_{des_a}$ denotes the heading angle between the USV and the target point. Our objective is to steer the USV toward the target point with the shortest possible distance. $r_{ddt}$ denotes the difference between the distance from the USV to the target point at the current time and the distance from the USV to the target point at the previous time. We want the USV to reach the target point as quickly as possible. At the same time, a penalty of −0.05 is applied to the USV at each step. The selection of the above reward function values is based on experience and multiple experiments. The reward function settings are listed in Table 2.

Table 2.

The settings of the reward function.

Function description	Reward function
Block reward	$r_{blo =} {\begin{matrix} - 2, if block \\ 0, else \end{matrix}$
Overturn reward	$r_{ove_t} = {\begin{matrix} - 1, if overturn \\ 0, else \end{matrix}$
Reach reward	$r_{rea} = {\begin{matrix} 20, if reach \\ 0, else \end{matrix}$
Warning reward	$r_{war} = {\begin{matrix} - 0.5, if d_{i} < d_{se} \\ 0, else \end{matrix}$
Velocity reward	$r_{vel} = {\begin{matrix} - 1, if v < 0.5 \\ 1, else \end{matrix}$
Distance terminal reward	$r_{dis_t} = (10 * DistanceTerminal)^{- 1}$
Difference distance terminal	$r_{ddt} = {\begin{matrix} 0.9, if r_{ddt} > 0.9 \\ DifDisTer * 200, else \end{matrix}$
Destination angle reward	$r_{des_a} = - 10 * abs (DestinationAngle)$
Survive reward	−0.05

Therefore, the reward received by USV at the current moment can be expressed as,

\begin{matrix} Reward = r_{blo} + r_{ov e_{t}} + r_{rea} + r_{war} + r_{vel} + r_{di s_{t}} \\ + r_{de s_{a}} + r_{ddt} + (- 0.05) . \end{matrix}

(17)

Random walk policy

USV training in complex obstacle environments often encounters cowardly behavior due to many obstacles and is afraid to move forward to the end point. Our goal is to explore the environment purposefully when the agent is not yet trained, to obtain relatively high-quality samples, and thus speed up the algorithm’s convergence when training starts earlier. This paper uses the random walk policy as an environmental exploration before training begins. It adds a heuristic probability: the probability of USV moving forward is greater than that of left, right, and back.

If the current experience pool size is less than the set value (the set value is less than or equal to the maximum capacity of the experience pool), perform action exploration $a_{forw}$ , $a_{rl}$ and $a_{backw}$ , action exploration is a two-dimensional vector. The policy is to execute the forward action of $a_{forw}$ with a probability of 0.6, the left and right actions of $a_{rl}$ with a probability of 0.3, and the backward action of $a_{backw}$ with a probability of 0.1. The thrust values of each action are relatively random and the left and right thrust values decay with the number of episodes.

a_{f o r w} = m a x (E x p l o r e M i n - i * A t t e n * (\frac{E x p l o r e M i n}{M a x A c t i o n}), M a x A c t i o n - i * A t t e n)

(18)

a_{rl} = N (0, max (1000, σ - i * Atten))

(19)

a_{backw} = - a_{forw}

(20)

where $ε ~ N (0, σ)$ indicates that the expected value is 0 and the standard deviation is σ Normal distribution. $Atten$ represents the value of each episode thrust decay. $MaxAction = Maxthrust ° ExploreMin$ represents the minimum number of explorations to perform the exploratory policy, $i$ is the current number of episodes. We use $P (a_{e})$ to represent the probability of action selection when USV adopts random walk policy. $(a_{e} = a_{forw}, a_{rl}, a_{backw})$ . The probability distribution is shown in Table 3.

Table 3.

The probability distribution.

X	$a_{forw}$	$a_{rl}$	$a_{backw}$
P_i	0.6	0.3	0.1

USV action space in random walk TD3

The dual thruster USV can control its attitude and behavior by adjusting the rotation speed of the left and right motors behind the hull. Therefore, this paper takes the left and right motor thrust of USV as executable action, $a_{left}$ , $a_{right}$ represents the force of the left and right motors, enabling the USV to move forward, backward, left, and right.

a_{left}, a_{right} \in [- Maxthrust, Maxthrust] .

(21)

Therefore, the resulting action value is

a = {[a_{left} + N (0, 2000), a_{right} + N (0, 2000)]}^{T} + a_{e} .

(22)

Design of algorithm

The algorithm flow is shown in Algorithm 1. First, the parameters of all neural networks are initialized, while the experience pool of the algorithm is initialized. When the training starts, the algorithm first executes the random walk policy, executes different actions according to the given probability and adds them to the actions output by the policy network. These action values decay with the number of episodes, which means that the behavior of exploring the surrounding environment is more aggressive in the early stage of the algorithm, and the acquired experience is stored in the experience pool. When the number of the current experience pool reaches the maximum capacity of M/20, the random walk policy is stopped and the algorithm is updated. In this way, the algorithm can obtain higher quality sample data by sampling the experience pool at the beginning of training, so as to accelerate the convergence of the algorithm. Then Based on the current state $s_{t}$ , choose actions $π_{θ} (s_{t})$ and add noise and explore policy values. Finally, the TD-error (Temporal difference error) algorithm is used to update the Critic network parameters and the Actor network parameters are updated by deterministic policy gradient every two steps.

Algorithm 1.

random walk policy TD3

1: Initialize Critic networks

Q_{ω 1}

and

Q_{ω 2}

and Actor network

π_{θ}

with random parameters

ω_{1}

ω_{2}

θ

2: Initialize the target networks

ω_{1}^{'}

←

ω_{1}

ω_{2}^{'}

←

ω_{2}

θ^{'}

←

θ

3: Initialize experience pool B with a capacity of M.

4: For episode = 0– 5000 do

5: Get the initial state

s_{0}

6: for step = 0 to Max Step

7: if reset ==0

8: if m < M/20

9: Perform a random walk policy exploration, choosing an action in

a_{e}

10: Else

11:

a_{e} = 0

12: Based on the current state

s_{t}

, choose actions

π_{θ} (s_{t})

and add noise and explore policy values:

a_{t} ~ π_{θ} (s_{t}) + ε + a_{e}

13: Perform actions

a_{t}

, get rewards

r_{t}

, and new states

s_{t + 1}

14: if reset == 1

15: Break

16: Save

(s_{t}, a_{t}, r_{t}, s_{t + 1})

deposited into experience pool B.

17: Randomly sample m samples

(s_{j}, a_{j}, r_{j}, s_{j + 1})

from the experience pool, j = 1, 2, …, m.

18: Calculate the expected return of the action through the target Critic network:

\begin{matrix} a_{t} ~ π_{θ} (s_{t}) + ε + select, a_{t} ~ clip (- 1000, 1000) \\ y_{j} ~ r_{j} + γ \min Q_{ω^{'}} (s_{j + 1}, a_{t}) \end{matrix}

19: Update Critic network parameters:

J (ω) = \frac{1}{m} \sum_{j = 1}^{m} {(y_{j} - Q (S_{j}, A_{j}; ω))}^{2}

20: Every 2 steps, the Actor network parameters θ are updated via the deterministic policy gradient:

\nabla_{J} (θ) = \frac{1}{m} \sum_{j = 1}^{m} {\nabla_{a} Q (s_{i}, a_{i}; ω) |}_{s = s_{i}, a = π_{θ} (s)} {\nabla_{θ} π_{θ (s)} |}_{s = s_{i}}

21: Update target network parameters:

\begin{matrix} θ_{i}^{'} = τ θ_{i} + (1 - τ) θ_{i}^{'} \\ ω^{'} = τ ω + (1 - τ) ω^{'} \end{matrix}

22: End the episode loop.

Experiments and results

In this section, we give the simulation environment and training results. Firstly, we constructed a virtual simulation environment and introduced wave and water flow disturbances to the USV in the simulation environment. Secondly, we conducted feasibility experiments of the algorithm with USV in a simple environment with fewer and regular obstacles. Subsequently, we trained USV in an environment with more obstacles and irregular shapes and conducted a comparative experiment using the TD3 algorithm. Finally, we randomly initialized the starting point of the USV to verify the generalization of the algorithm. The experimental environment is Windows 10.1 + Pytorch1.7.1 + CUDA10.1. Hardware is an INTEL i9-11900 processor and NVIDIA RTX A4000 graphics card.

Experiment platform and settings

UE4.26 is adopted by us to construct the virtual simulation environment. This version adds a water system, which allows us to easily define the water environment, such as oceans, rivers, and lakes. It can adjust the wavelength, amplitude, and other parameters of waves to realize the physical interaction between USV and water body, and finally simulate the interference of waves and currents so that USV can be trained closer to the natural environment and can be better deployed in the equipment in the future. The visualization of the simulation system based on UE4 is shown in Figure 7.

Figure 7.

The virtual simulation system.

The simulation system includes an environment construction module and an environment perception module:

The Environmental Construction Module is used to model the water body and terrain of the virtual navigation environment and obstacles encountered during the voyage. Our virtual USV thrusters are positioned in a static mesh to simulate the differential model of the dual thruster USV, as shown in Figure 8 . To simulate the interference of waves on the USV, two thrusters are set at the center of gravity of the USV to simulate the combined forces of roll and pitch of the USV, respectively. The attitude inference algorithm calculates the interference intensity of USV in waves.

Disturbance = α \times Attitude deflection angle

(23)

Figure 8.

USV rear view.

among $α$ represents the thrust coefficient, usually taken as [0, 2000]. The USV has five buoyancy blocks installed on its hull to simulate the buoyancy to which it is subjected. The buoyancy configuration is shown Figure 9.

Figure 9.

Buoyancy block configuration.

The Environment Perception Module is used to sense the scene data of a virtual USV model when navigating in a virtual navigation environment. The virtual scene data includes USV ranging data, attitude data, heading angle data, speed data, and USV distance terminal data. At the same time, we use TCP communication to transfer the above data from the simulation system and analyze the data to form the data type our algorithm can recognize.

Training and result

The algorithm’s feasibility was verified by training it in a relatively simple environment. The training environment was a rectangular space measuring 1750 × 1500 units and contained three equally sized cubic obstacles. The starting and ending points of the USV were located at the lower left corner and upper right corner of the environment, respectively. The training was conducted for 2500 iterations, and the hyperparameters used are shown in Table 4. At the beginning of each episode, the Actor networks and Critic networks parameters are initialized, and copied into their respective target networks. The initial point is generated at a random position. The termination conditions for each episode are: (1) the USV reaches the target area, (2) the USV collides with an obstacle, (3) the USV capsizes, or (4) the maximum number of training steps is reached. The USV’s decision execution cycle is 0.5 s, which means that the network parameters are updated and rewarded every 0.5 s after each action is executed.

Table 4.

The hyperparameters.

	Parameter name	Parameter values
0	Gamma	0.99
1	Capacity	200,000
2	Sample frequency	256
3	Policy noise	0.1
4	Noise clip	0.1
5	Policy delay	2
7	Actor	0.0001
8	Critic_1	0.001
9	Critic_2	0.001
10	Episode	5000
11	Steps	500
12	Time interval	0.5

After the training, we carried out several fixed start point and fixed end point navigation tests to obtain the distance between the USV and the end point and the thrust values of the USV in this environment, as shown in Figures 10 and 11, respectively.

Figure 10.

Obstacle avoidance trajectory in a simple environment.

Figure 11.

Thrust values under a simple environment.

From the results, it can be observed that the algorithm can make the USV produce obstacle avoidance actions, but there are backward movements in the trajectory. Such behavior can be dangerous in practical use, as it indicates that the USV did not fully consider the future navigation environment, leading to untimely avoidance. From the graph of the thrust values of the thrusters, it can be seen that both the left and right thrusters experienced a sudden change in negative thrust values. In practical use, the violent switching of the thruster’s rotation direction can lead to a reduction in its service life. We believe that this is due to insufficient training of the algorithm, resulting in the USV selecting backward strategies for obstacle avoidance.

After verifying the feasibility of the algorithm, we constructed a more complex training environment and carried out 5000 rounds of training. The training environment was set to a square environment with a size of 2500 × 2500 units, containing 10 randomly sized and irregular static obstacles. The dimensions of the cuboid obstacles were randomly stretched, and the irregular obstacles included reefs and shrubs, which exhibited non-uniform shapes relative to the cuboid obstacles.

We train USV using TD3 and the proposed algorithm. During the testing phase, the USV fully uses our trained strategy to sail steadily from the starting point to the end point and avoid obstacles. During the training of 5000 episodes, the average reward curves of the two algorithms are shown in Figure 12.

Figure 12.

The average reward curve.

It is evident that our algorithm achieves convergence earlier than the original TD3 algorithm, and during the subsequent training process, it enables the USV to explore the optimal path. The relatively dense placement of obstacles compared to the warning distance we set results in the continuous appearance of new obstacles in the warning area during the navigation of the USV, and previous obstacles also move away from the warning area due to the progress of navigation, leading to fluctuations in the average reward value.

After the training, we carried out many navigation tests with fixed starting point and fixed end point, and recorded the obstacle avoidance trajectory of the USV and the thrust value of the thruster under this environment, as shown in Figures 13 and 14 respectively.

Figure 13.

The distance to end point.

Figure 14.

Thrust values under a complex environment.

The curve smoothness of Figure 14 indicates that the USV can avoid obstacles smoothly and reach the target point without any backward avoidance when encountering obstacles. The USV considers the future navigation environment and thus avoids obstacles in advance. Left and right thrusters do not produce sudden changes of negative thrust value. However, the thrust of the thruster is not smooth enough, which may cause some thruster problems in actual deployment. How to make the change of thrust value generated by the algorithm smoother will be the direction of the future optimization algorithm.

The Generalization of the algorithm are tested under the start point of the random generation. In the test environment, water flow disturbance is different for each episode, as shown in Figure 15.

Figure 15.

The obstacle avoidance at random start point.

It can be seen that our algorithm can reach the end point without collision from different start points, and the trajectory also conforms to the dynamics of USV. The result shows that the modified algorithm has strong adaptability to obstacle avoidance in complex static environment.

Conclusions

This paper presents a method of obstacle avoidance based on DRL to enable USV to perform obstacle avoidance tasks in a complex multi-static obstacle environment. A new heuristic exploration policy is proposed to improve the DRL TD3 algorithm, which enables the agent to explore the environment independently in the early stage and get a large number of positive samples, and store the information in the experience pool so that the agent can adapt to the environment faster and reduce the training time. The method is then tested in a UE 4.26-based simulation environment. The results show that the algorithm can train the USV to reach the target area safely and quickly in a multi-obstacle environment.

For further research, the generated thrust values by the current algorithm are not smooth enough, which will be a major obstacle to the algorithm’s physical deployment. How to make the generated thrust values smoother will be a future direction for algorithm optimization. While our algorithm has not considered the avoidance of dynamic obstacles, we plan to construct a more complex and realistic simulation environment, add some dynamic obstacles, and test the algorithm in a real environment after completing the algorithm training to demonstrate the engineering application value of the algorithm.

Research Data

sj-rar-1-mac-10.1177_00202940231195937 – Supplemental material for Obstacle avoidance USV in multi-static obstacle environments based on a deep reinforcement learning approach

Supplemental material, sj-rar-1-mac-10.1177_00202940231195937 for Obstacle avoidance USV in multi-static obstacle environments based on a deep reinforcement learning approach by Dengyao Jiang, Mingzhe Yuan, Junfeng Xiong, Jinchao Xiao and Yong Duan in Measurement and Control

This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).

Footnotes

Authors’ Note

Mingzhe Yuan, Junfeng Xiong, and Jinchao Xiao is now affiliated to Guangzhou Institute of Industrial Intelligence, Guangzhou, China and Shenyang Institute of Automation Chinese Academy of Sciences, Shenyang, China. Yong Duan is now affiliated to School of Information Science and Engineering, Shenyang University of Technology, Shenyang, China.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Supported by the Key-Area Research and Development Program of Guangdong Province (2020B1111010002). Supported by the Guangdong Basic and Applied Basic Research Foundation (2020A1515010584).

Patents

The present invention discloses a method and system for obstacle avoidance of USV. After obtaining position, motion, attitude information, and obstacle information of USV, the collected information is normalized by using the normalization formula and stored in an experienced pool in the form of a quaternion. The information stored in the experience pool is calculated by using the neural network algorithm, and the resulting decision-making action is input into the USV motor to execute the decision-making action so that the USV can execute the obstacle avoidance action. Compared with other algorithms, the method designs the state space, reward function, and action space, increases the exploration activities of the USV in the early stage, and makes the neural network algorithm converge faster during training. It has a wide application environment after training and is suitable for scenarios with wind-wave interference, multiple obstacles, and dynamic obstacles with high robustness.

ORCID iD

Dengyao Jiang

Supplemental material

Supplemental material for this article is available online.

References

Zhu

, et al. An improved dueling deep Double-Q network based on prioritized experience replay for path planning of unmanned surface vehicles. J Mar Sci Eng 2021; 9: 1267.

Yuanqiao

Bei

, et al. Review and expectation on collision avoidance method of unmanned surface vessel. J Wuhan Univ Technol 2016; 40: 456–461.

Xie

Peng

, et al. The obstacle avoidance planning of USV based on improved artificial potential field. In: 2014 IEEE international conference on information and automation (ICIA), 2014, pp.746–751. New York, NY: IEEE.

Xia

Han

Zhao

, et al. Local path planning for unmanned surface vehicle collision avoidance based on modified quantum particle swarm optimization. Complexity 2020; 2020: 1–15.

Yang

Yixin

Cheng

, et al. Hybrid bacterial foraging algorithm for unmanned surface vehicle path planning. J Huazhong Univ Sci Technol 2022; 3: 68–73.

Zeng

Qin

, et al. Navigation in unknown dynamic environments based on deep reinforcement learning. Sensors 2019; 19: 3837.

Zhao

, et al. Path following optimization for an underactuated USV using smoothly-convergent deep reinforcement learning. IEEE Trans Intell Transp Syst 2020; 22: 6208–6220.

Wang

Liu

Shen

, et al. A greedy navigation and subtle obstacle avoidance algorithm for USV using reinforcement learning. In: 2019 Chinese automation congress (CAC), Hangzhou, China, 2019, pp.770–775. IEEE.

Wang

G-G

Dong

Y-Q

, et al. Self-adaptive extreme learning machine. Neural Comput Appl 2016; 27: 291–303.

10.

Wang

Qiao

Wang

G-G

. Architecture evolution of convolutional neural network using monarch butterfly optimization. J Ambient Intell Humaniz Comput 2023; 14: 12257–12271.

11.

Cui

Xue

Cai

, et al. Detection of malicious code variants based on deep learning. IEEE Trans Ind Inform 2018; 14: 3187–3196.

12.

Mnih

Kavukcuoglu

Silver

, et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

13.

Watkins

CJCH

Dayan

. Q-learning. Mach Learn 1992; 8: 279–292.

14.

Cao

Chen

, et al. A realtime Q-Learning method for unmanned surface vehicle target tracking. In: 2018 IEEE CSAA guidance, navigation and control conference (CGNCC) 2018, pp.1–5. New York, NY: IEEE.

15.

Fujita

Selamat

. The collaborative strategy of multiple USVs with deep reinforcement learning method,. In: Advancing technology industrialization through intelligent software methodologies, tools and techniques: proceedings of the 18th international conference on new trends in intelligent software methodologies, tools and techniques (SoMeT_19), 2019, vol. 318, pp.159. IOS Press.

16.

Gao

Kang

Zhang

, et al. MASS autonomous navigation system based on AIS big data with dueling deep Q networks prioritized replay reinforcement learning. Ocean Eng 2022; 249: 110834.

17.

Xiaofei

Yilun

Wei

, et al. Global path planning algorithm based on double DQN for multi-tasks amphibious unmanned surface vehicle. Ocean Eng 2022; 266: 112809.

18.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518: 529–533.

19.

Lillicrap

Hunt

Pritzel

, et al. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

20.

Silver

Huang

Maddison

, et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016; 529: 484–489.

21.

Zhou

YCP

Xie

Y HC

, et al. Collision avoidance path planning of tourist ship based on DDPG algorithm. Chin J Sh Res 2021; 16: 19–26.

22.

Wang

Zhao

, et al. Deep reinforcement learning-based path planning of underactuated surface vessels. Cyber Phys Syst 2019; 5: 1–17.

23.

Fujimoto

Hoof

Meger

. Addressing function approximation error in actor-critic methods. In: International conference on machine learning, Stockholm, Sweden, 2018, pp.1587–1596. Proceedings of Machine Learning Research.

24.

Dankwa

Zheng

. Twin-delayed DDPG: a deep reinforcement learning technique to model a continuous movement of an intelligent robot agent. In: Proceedings of the 3rd international conference on vision, image and signal processing (ICVISP 2019), 2020, pp.1–5. New York, NY: Association for Computing Machinery.

25.

Dong

Zou

. Mobile robot path planning based on improved DDPG reinforcement learning algorithm. In: 2020 IEEE 11th international conference on software engineering and service science (ICSESS), 2020, pp.52–56. New York, NY: IEEE.

26.

Zhang

Dong

. Autonomous navigation of UAV in multi-obstacle environments based on a deep reinforcement learning approach. Appl Soft Comput 2022; 115: 108194.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.41 MB