Sage Journals: Discover world-class research

Abstract

In order to solve the problem that the existing reinforcement learning algorithm is difficult to converge due to the excessive state space of the three-dimensional path planning of the unmanned aerial vehicle, this article proposes a reinforcement learning algorithm based on the heuristic function and the maximum average reward value of the experience replay mechanism. The knowledge of track performance is introduced to construct heuristic function to guide the unmanned aerial vehicles’ action selection and reduce the useless exploration. Experience replay mechanism based on maximum average reward increases the utilization rate of excellent samples and the convergence speed of the algorithm. The simulation results show that the proposed three-dimensional path planning algorithm has good learning efficiency, and the convergence speed and training performance are significantly improved.

Keywords

Path planning unmanned aerial vehicle Q-learning heuristic information experience replay

Introduction

Unmanned aerial vehicle (UAV) has attracted wide attention from scholars all over the world in past decade. UAV has the advantages of small size, low cost, convenient use, low requirements for the operational environment, flexible, no casualty risk, and so on. It is widely used in aerial photography, plant protection, express transportation, disaster rescue, surveying and mapping, power inspection, reconnaissance, and other fields. Path planning is one of the key technologies for UAVs to accomplish the above tasks. Path planning refers to finding an optimal or feasible trajectory from the starting point to the target point under given space and constraints. UAV path planning is an NP complex problem with multiple constraints, which include fuel consumption and maneuvering constraints, terrain obstacles, threat information, and so on. Many scholars have done a lot of work in path planning. $A^{*}$ algorithms¹ are widely used in path planning due to its ease of implementation and high efficiency, but it is difficult to ensure that an optimal path is found. Artificial potential field method² realizes path planning by establishing the gravitational field and repulsion field function, but it is easy to fall into the local optimal solution. The Voronoi diagram³ has a good performance in obstacle avoidance, but it needs to describe the obstacle information with geometric structure, which is difficult to adapt to large dynamic environment. Swarm intelligence algorithm⁴ (such as ant colony algorithm, particle swarm algorithm) has the advantages of simple structure, high precision, fast convergence, and so on. However, there are still complex calculation problems in high-dimensional and complex planning space.

Reinforcement learning is an important method in the field of machine learning.⁵ Unlike supervised learning and unsupervised learning, reinforcement learning relies on the agent to continuously interact with the environment, and each interaction receives an evaluative feedback signal in return for learning the optimal behavior by maximizing its cumulative reward. Theoretically, reinforcement learning does not depend on the exact model. Whatever exploration and utilization strategy is adopted, it will converge to the optimal value after a long enough time. But for a practical problem, when the state space is large and each state can perform more actions, there will be a very large state-action space. In order to accelerate the convergence speed of reinforcement learning algorithm, researchers have proposed several effective methods. Hengst⁶ uses hierarchical reinforcement learning to decompose large-scale reinforcement learning problems into several sub-problems, which reduces the state space of the problem. However, reasonable hierarchical processing is a challenging task and is difficult to achieve. Andrew et al.⁷ introduced shaping function into reinforcement learning and added heuristic value to the returns of agents, which effectively improved the convergence speed. Asmuth et al.⁸ used potential field function as a priori knowledge to enlighten the reinforcement learning process and proved the effectiveness of the algorithm.

Based on the existing reinforcement learning algorithm for path planning, this article proposes a three-dimensional path planning method for UAV based on heuristic return function and maximum average reward experience replay mechanism. First, the state space of UAV is discretized to reduce the scale of path planning problem. Then, a heuristic reward function is constructed based on the maneuverability of UAV, fuel consumption, terrain obstacles, flight altitude, and other factors to improve the convergence speed of the algorithm. The experience replay mechanism is improved, and the importance of samples is evaluated by the maximum average return value at a small computational cost. Finally, the validity of the method is verified by the three-dimensional path planning simulation experiment of UAV, which achieves the effective approximation of the value function, and has good learning efficiency and generalization performance. The convergence speed and training performance are obviously improved.

Problem formulation

The state space of UAV in real environment is continuous, which greatly increases the difficulty of problem-solving. Therefore, three-dimensional discretization of planning space should be carried out first, so the search space of path planning problem is reduced to a discrete set of spatial nodes; each node $v_{k}$ represents the three-dimensional coordinates $(x_{k}, y_{k}, z_{k})$ of UAV. The UAV path planning problem can be described as finding several nodes in the space node set, which minimizes the total cost function of UAV flying along the path composed of these nodes. Let the set of all discrete state space nodes set V, as follows

V = {v_{1}, v_{2}, v_{3}, \dots, v_{n} | v_{k} = (x_{k}, y_{k}, z_{k})}

The set of all flight paths, including the starting point and the target point, is represented by L

L = {L_{1}, L_{2}, L_{3}, \dots, L_{m}}

The cost from state space node $v_{i}$ to state space node $v_{j}$ can be expressed in $c (v_{i}, v_{j})$ , and the UAV path planning problem can be described in mathematical language as follows

\begin{matrix} C (L_{k}) = \sum_{(v_{i}, v_{j} \in L_{k})} c (v_{i}, v_{j}) \\ s . t . L_{k} \in L, v_{i}, v_{j} \in V \end{matrix}

(1)

where $C (L_{k})$ represents the total cost of the available path $L_{k}$ .

Q-learning

Algorithm principle

The research of reinforcement learning is based on the theoretical framework of Markov decision process (MDP). The state and reward of the next moment only depend on the current state. The MDP can be represented by a quaternion $(S, A, p_{sa}, R)$ :

S is the set of states, and any state $s \in S$ , $s_{t}$ represents the state of step t of UAV, the state $s_{t}$ only includes three-dimensional position coordinates $(x_{t}, y_{t}, z_{t})$ .

A is the set of actions, and any action $a \in A$ , and $a_{t}$ represents the actions that UAV executes in step t. The action of UAV can be composed of horizontal and vertical actions. The horizontal actions include left turning 45°, direct flying, and right turning 45°. The vertical actions include climbing 45°, horizontal flying, and descending 45°. Therefore, through different combinations of horizontal and vertical actions, UAV’s action space includes nine kinds of actions.

$p_{sa}$ represents the probability distribution of state transition of UAV after executing an action $a_{t}$ in state $s_{t}$ , and the probability of transition to state $s_{t + 1}$ is $p_{sa} (s_{t + 1} | s_{t}, a_{t})$ . For UAV three-dimensional path planning, the probability of UAV moving to other states after executing its action $a_{t}$ in the state $s_{t}$ is determined $p_{sa} (s_{t + 1} | s_{t}, a_{t}) = 1$ .

R is the reward function, which represents the immediate reward that UAV receives when it executes an action $a_{t}$ in a state $s_{t}$ and transfers to a state $s_{t + 1}$ . It can also be expressed as $R (s_{t}, a_{t}, s_{t + 1}) = r_{t}$ .

The model of reinforcement learning can be described as UAV–environment interaction in Figure 1. The UAV chooses an action $a_{t}$ in the state $s_{t}$ . It then transfers to the next state $s_{t + 1}$ according to the probability $p_{sa}$ of state transition and gets a reward value. Then, the UAV executes another action $a_{t + 1}$ according to state $s_{t + 1}$ and repeats the above process.

Figure 1.

Reinforcement learning model.

The goal of reinforcement learning is to learn a good strategy to maximize future cumulative rewards in sequential decision-making. The sum of discount cumulative rewards is called expected reward $G (t)$

G (t) = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + γ^{3} r_{t + 4} + \dots = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}

(2)

where $γ$ represents the discount factor, the range of values is between 0 and 1. The closer to 0, the more important the present reward is, and the closer to 1, the more important the future reward is.

The state-action value function $Q^{π} (s, a)$ represents the mathematic expectation of reward that UAV can obtain after choosing the action a in the state s and executing the strategy $π$ all the time. It can be written as follows

Q^{π} (s, a) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} | s_{t} = s, a_{t} = a]

(3)

Q-learning algorithm is a model-free offline strategy reinforcement learning control algorithm,⁹ which has good convergence in discrete MDP. The state-action value function $Q (s_{t}, a_{t})$ is updated according to Bellman equation¹⁰

Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + α (r_{t} + max_{a' \in A} Q (s_{t + 1}, a') - Q_{t} (s_{t}, a_{t}))

(4)

where $α$ is the learning rate, $0 < α < 1$ .

Training strategy

Because of the huge state space of UAV’s three-dimensional path planning, it is difficult to record all state-action value functions in tabular form. At present, the common method is to approximate the optimal action-value function with neural network and adjust the weight parameters of neural network by reducing the mean square error of Bellman equation. However, the convergence of this method is poor. Experiential playback and target separation can accelerate the convergence speed of the algorithm.¹¹

Experience replay mechanism is proposed by Lin.¹² A fixed-length cached sample data set $D_{t} = {d_{1}, d_{2}, d_{3}, \dots, d_{t}}, d_{t} = {s_{t}, a_{t}, r_{t}, s_{t + 1}}$ is set up. With the increase in data samples, the old data in the data set are updated with new data. Episode $ep s_{k}$ refers to the process of UAV from the starting point to the collision or arrival at the target point. Every n episode, m samples are randomly selected from the cached sample data set and put into Q-learning algorithm. Experience replay mechanism has two main advantages:

Each sample $d_{t}$ may be used many times, which improves the efficiency of data utilization.

It breaks the correlation of continuous samples in traditional Q-learning algorithm and helps to prevent the algorithm from falling into local optimum.

Minh et al.¹³ proposed to use two neural networks for learning, one is target neural network predicting $Q (s', a; θ_{j}^{target})$ , and the other is estimation neural network predicting $Q (s, a; θ_{i}^{est})$ . Bellman equation can be expressed as follows

la b_{i} = Q (s, a; θ_{i}^{est}) + α (r_{t} + max_{a' \in A} Q (s', a; θ_{j}^{target}) - Q (s, a; θ_{i}^{est}))

(5)

When the estimated neural network is trained, the loss function of the neural network can be written as follows

J (θ_{i}^{eval}) = \frac{1}{2} {(la b_{i} - Q (s, a; θ_{i}^{est}))}^{2}

(6)

The estimated neural network is continuously updated. When it is updated to a certain number of times o, the parameters of the estimated neural network are copied to the target neural network. Target neural network delays the influence of parameter updating, eliminates the correlation between the estimated value and the target value, and reduces the possibility of non-convergence of the algorithm.

Action selection strategy

In order to ensure good convergence, action selection strategies need to consider both exploration and application. The commonly used action selection strategy is $ε - greedy$ , which uses the probability of $1 - ε$ to select the maximum action in the state and the probability of $ε$ to randomly select any action in the available action space

π (s) = {\begin{matrix} random choice a & ε \\ \underset{a \in A}{\arg max} (Q (s, a)) & 1 - ε \end{matrix}

(7)

Algorithm 1: Q-learning
Input: discount factor $γ$ , learning rate $α$ , exploring factor $ε$
Neural network training parameters n, o, m, u, number of training times N, starting point $s_{0}$ , target point $s_{target}$
Initialize: $Q (s, a; θ_{0}^{est}), Q (s, a; θ_{0}^{t \arg et}), (\forall s \in S, a \in A)$
For k = 1:N do
Initial state $s_{0}$
For t = 1, 2, 3, … do
Choose $a_{t}$ from A using equation (7)
Take action $a_{t}$ that leads to new state $s_{t + 1}$
Calculate return value $r_{t}$ by
$r_{t} = {\begin{matrix} 1 & beyond or collision \\ - 1 & arrive the goal \\ 0 & otherwise \end{matrix}$
$ep s_{k} \leftarrow [s_{t}, a_{t}, r_{t}, s_{t + 1}]$
$s_{t} \leftarrow s_{t + 1}$
Until $r_{t} \neq 0$
If rem(k, n) = 0
Choose m samples to update $Q (s, a; θ_{i}^{est})$ using equation (6)
If rem(k, o) = 0
$θ_{i}^{target} \leftarrow θ_{i}^{est}$

Heuristic Q-learning

In order to improve the convergence speed of Q-learning, this article proposes a heuristic Q-learning method by improving the reward function and action selection strategy.

Reward function

Traditional Q-learning mostly uses sparse reward function. For path planning problem, if UAV reaches the target point, it will give a positive reward value, and if it collides with obstacles or exceeds the state space range, it will generate a negative penalty value. In other cases, the return value is generally 0. Sparse reward function has the advantages of simple calculation and easy design. But for UAV three-dimensional path planning problem, the state space is very large, the reward value of most actions is 0, and the probability of finding meaningful reward value is relatively small which leads to the slow convergence speed of learning algorithm. In order to overcome the shortcomings of sparse return function, a heuristic reward function $R (s_{t}, a_{t}, s_{t + 1})$ is constructed by introducing more factors affecting path performance, such as UAV maneuverability, fuel consumption, terrain threat, distance to target, and flight altitude

r_{t} = R (s_{t}, a_{t}, s_{t + 1}) = ω_{1} p_{1} + ω_{2} p_{2} + ω_{3} p_{3} + ω_{4} p_{4} + ω_{5} p_{5}

(8)

where $ω_{1} ~ ω_{5}$ is the influence weight of each factor, and $p_{1} ~ p_{5}$ is the evaluation factor of path performance. The concrete connotation and calculation method are as follows:

p ₁ represents the fuel consumption cost of the UAV from state $s_{t}$ to state $s_{t + 1}$ . The fuel cost is to ensure that the UAV reaches its destination in as short a time as possible

p_{1} = - \sqrt{{(x_{t} - x_{t + 1})}^{2} + {(y_{t} - y_{t + 1})}^{2} + {(z_{t} - z_{t + 1})}^{2}}

(9)

In the artificial potential field method, the target point will attract the UAV. According to this idea, this article constructs $p_{2}$ . The closer to the target point, the more returns will be obtained, which will provide direction guidance for the choice of UAV operation

em = \sqrt{x_{target}^{2} + y_{target}^{2} + z_{target}^{2}} - \sqrt{{(x_{t} - x_{target})}^{2} + {(y_{t} - y_{target})}^{2} + {(z_{t} - z_{target})}^{2}}

(10)

p 2 = \log_{10} (em + δ)

(11)

where $(x_{target}, y_{target}, z_{target})$ represents the three-dimensional coordinates of the target point, $δ$ is a constant greater than 0, which prevents calculation errors when em is equal to 0.

p ₃ is the cost of UAV deviating from cruise altitude $z_{0}$ , which is designed to ensure UAV flying at cruise altitude as far as possible

p_{3} = - | z_{t} - z_{0} |

(12)

p ₄ is the cost of collision between UAV and obstacle or beyond the scope of state space

p_{4} = {\begin{matrix} R_{0} & beyond or collision \\ 0 & otherwise \end{matrix}

(13)

where $R_{0}$ is a large negative constant.

p ₅ is the reward value of the UAV when it reaches the target point. A larger reward value can make the algorithm converge quickly

p_{5} = {\begin{matrix} R_{1} & arrive the goal \\ 0 & otherwise \end{matrix}

(14)

where $R_{1}$ is a large positive constant.

Action selection strategy

The key point of Q-learning algorithm convergence is to explore and utilize the balance problem and choose the appropriate action selection strategy which can not only make UAV explore the environment fully to avoid falling into local optimal solution but also make the learning algorithm converge faster. The optimal strategy of Q-learning is usually based on the estimation of state-action value function $Q (s, a)$ . However, in the initial stage of algorithm learning, UAV cannot get enough information about the environment, and the optimal estimation of state-action value function is not accurate, which often makes UAV collide with obstacles, seriously affecting the learning efficiency of the algorithm.

Therefore, this article designs an evaluation function $E (s, a)$ to assist the state-action value function $Q (s, a)$ to select the action. The meaning of $E (s, a)$ is that the UAV performs nine actions in the action space separately in the state s. When the UAV collides with the obstacle after the action, it returns to −100, and if there is no collision, it returns to 0

\underset{a \in A}{E (s, a)} = {\begin{matrix} - 100 & collision \\ 0 & otherwise \end{matrix}

(15)

If $E (s, a_{i})$ is −100, then $a_{i}$ is removed from the action space in the state s. The evaluation function $E (s, a)$ can reduce the available action space of UAV in the state s and avoid a large number of collisions in the initial stage. The action selection strategy of heuristic Q-learning is $ε - greedy$ , which chooses the action a when $(Q (s, a) - E (s, a))$ takes the maximum in the state s with probability $1 - ε$ and randomly chooses any action in the available action space with probability $ε$

π (s) = {\begin{matrix} random choice a & ε \\ \underset{a \in A}{\arg max} (Q (s, a) - E (s, a)) & 1 - ε \end{matrix}

(16)

In the early stage of learning, UAV is ignorant of environmental information, so action selection strategies should pay more attention to exploration. With the deepening of learning, the reliability of state-action value function is getting higher, and action selection strategies should be paid more attention to and utilized. So $ε$ is not a fixed value, but a relatively big value at the beginning, and then gradually reduces to nearly 0 with the increase of learning times

ε = ε - ε_{increment} (ε_{min} < ε < ε_{0})

(17)

where $ε_{0}$ is the initial value of exploratory factor. $ε_{min}$ is the minimum value of exploratory factor. $ε_{increment}$ is the increment of exploratory factor.

Algorithm 2: Heuristic Q-learning
Input: discount factor $γ$ , learning rate $α$ , exploring factor $ε_{0}$ , $ε_{min}$ , $ε_{increment}$
Neural network training parameters m, n, o, u, number of training times N, reward parameters $ω_{1} ~ ω_{5}$ , $R_{0}$ , $R_{1}$ , $δ$ , $z_{0}$ , starting point $s_{0}$ , target point $s_{target}$
Initialize: $Q (s, a; θ_{0}^{est}), Q (s, a; θ_{0}^{target}), (\forall s \in S, a \in A)$
For k = 1:N do
Initial state $s_{0}$
For t = 1, 2, 3, … do
Choose $a_{t}$ using equation (16)
Take action $a_{t}$ that leads to new state $s_{t + 1}$
Calculate return value $r_{t}$ using equation (8)
$ep s_{k} \leftarrow [s_{t}, a_{t}, r_{t}, s_{t + 1}]$
$s_{t} \leftarrow s_{t + 1}$
until $p_{4} + p_{5} \neq 0$
If rem(k, n) = 0
Choose m samples to update $Q_{k} (s, a; θ_{0}^{est})$ using equation (6)
If rem(k, o) = 0
$θ_{i}^{target} \leftarrow θ_{i}^{est}$

MARER Q-learning

The experience replay mechanism has the drawback that the sampling is randomly performed in the buffered sample data set, and the quality of the sample is neglected. The sample that the UAV has reached the target point after thousands of explorations may not have the Q-learning algorithm selected. In order to improve the learning efficiency, Schaul et al.¹⁴ proposed the sampling strategy of prioritized replay and sorted the samples according to their importance. However, the algorithm needs to continuously sort the samples in the data set, which greatly increases the computational complexity.

In order to further improve the efficiency of the algorithm and reduce the complexity of the algorithm, this article proposes a preferred cached experience replay mechanism using the maximum average reward value (MARER Q-learning) based on heuristic Q-learning:

Calculate the average of reward values $r_{k}^{ave}$ for the episode $ep s_{k}$ , as follows

r_{k}^{ave} = E (\sum_{r_{i} \in ep s_{k}} r_{i})

(18)

Then update the maximum average reward value $r_{max}^{ave}$

r_{max}^{ave} = {\begin{matrix} r_{k}^{ave} & r_{k}^{ave} > r_{max}^{ave} \\ r_{max}^{ave} & r_{k}^{ave} \leq r_{max}^{ave} \end{matrix}

(19)

If the maximum average reward value is updated, indicating that there is a good sample in this episode, all the samples included in this episode are directly put into the algorithm for learning, and the samples in this episode are copied three times and then put into the cached sample data set, which improves the proportion of excellent samples. Then, algorithm randomly selects m samples from the cached sample data set every n episode into the Q-learning algorithm for learning.

Algorithm 3: MARER Q-learning
Input: discount factor $γ$ , learning rate $α$ , exploring factor $ε_{0}$ , $ε_{min}$ , $ε_{increment}$
Neural network training parameters m, n, o, u, number of training times N, reward parameters $ω_{1} ~ ω_{5}$ , $R_{0}$ , $R_{1}$ , $δ$ , $z_{0}$ , starting point $s_{0}$ , target point $s_{target}$
Initialize: $Q (s, a; θ_{0}^{est}), Q (s, a; θ_{0}^{target}), (\forall s \in S, a \in A)$ , $r_{max}^{ave}$
For k = 1:N do
Initial state $s_{0}$ , $ep s_{k} = []$
For t = 1, 2, 3, … do
Choose $a_{t}$ from A using equation (16)
Take action $a_{t}$ that leads to new state $s_{t + 1}$
Calculate return value $r_{t}$ using equation (8)
$ep s_{k} \leftarrow [s_{t}, a_{t}, r_{t}, s_{t + 1}]$
$s_{t} \leftarrow s_{t + 1}$
Until $p_{4} + p_{5} \neq 0$
Calculate the average reward $r_{k}^{ave}$ using equation (18)
If $r_{k}^{ave} > r_{max}^{ave}$
Update $Q (s, a; θ_{0}^{est})$ using equation (19) with sample set $ep s_{k}$
Copy $ep s_{k}$ three times and put it in $D_{t} : D_{t} \leftarrow ep s_{k} \times 3$
Update maximum average reward $r_{max}^{ave}$ using equation (19)
If rem(k, n) = 0
Choose m samples to update $Q_{k} (s, a; θ_{0}^{est})$ using equation (6)

Simulation results

Assume that the mission area of the drone is 200 km × 100 km × 250 m, and a representative topographic map is generated according to typical terrain features. The discrete step in the horizontal direction is 5 km, and the discrete step in the height direction is 5 m. The discrete space of the three-dimensional path planning of the machine is 40 × 20 × 50, and there are 40 × 20 × 50 × 9 state-action values.

The expression for the terrain model used for the simulation is as follows

Z (X, Y) = \sum_{i = 1}^{8} h_{i} e^{- {(\frac{X - a_{i}}{c_{i}})}^{2} - {(\frac{Y - b_{i}}{c_{i}})}^{2}}

(20)

$(X, Y)$ is the coordinate of the horizontal direction, $Z (X, Y)$ is the height of the mountain in this horizontal coordinate. $h_{i}$ determines the maximum height of the mountain, $c_{i}$ determines the horizontal range of the mountain, and $(a_{i}, b_{i})$ determines the horizontal coordinates of the mountain (Figure 2)

\begin{matrix} h = [125, 150, 250, 200, 150, 150, 150, 150] \\ c = [200, 300, 200, 100, 300, 300, 200, 100] \\ a_{i} = [25, 40, 50, 80, 75, 140, 140, 175] \\ b_{i} = [25, 80, 45, 70, 11, 60, 15, 30] \end{matrix}

Figure 2.

Three-dimensional topographic map.

Simulation experiment parameters are as follows:

Reward parameters $ω_{1} = 0.05, ω_{2} = 0.2, ω_{3} = 0.003, ω_{4} = 0.01, ω_{5} = 0.01$ , $R_{0} = - 100$ , $R_{1} = 100$ , Cruising altitude, $z_{0} = 30$ , bias $δ = 0.01$ .

Discount factor $γ = 0.9$ , learning rate $α = 0.01$ , exploring factor $ε_{0} = 0.7$ , $ε_{max} = 0.95$ , $b_{i} = [25, 80, 45, 70, 11, 60, 15, 30]$ , $ε_{increment} = 0.00001$ .

Neural network training interval $n = 5$ , neural network replacement interval $o = 100$ , number of training samples per batch $m = 200$ , size of sample data set $u = 1000$ , number of training times N = 30,000.

Position coordinates of starting point $s_{0} = (10 km, 1 km, 30 m)$ , position coordinates of target point $s_{target} = (200 km, 30 km, 30 m)$ .

The maximum number of steps per learning episode is 200.

By statistically reaching the episode of the target point successfully, as shown in Figures 3 and 4, we can clearly see the difference in convergence speed of the three reinforcement learning methods. Heuristic Q-learning begins to converge after about 1700 episodes. MARER Q-learning converges after about 1300 episodes. However, Q-learning has not reached the target point after 2000 episodes. The fastest convergence among the three algorithms is the MARER Q-learning algorithm, while the convergence of Q-learning is the slowest. Heuristic Q-learning constructs a reward function by introducing path information such as height and distance, so that the UAV can have a deeper understanding of the environment and actively eliminates adverse actions to reduce the motion search space, thereby accelerating the convergence speed of the algorithm. Based on the heuristic Q-learning, MARER Q-learning evaluates the pros and cons of the sample by the maximum average return value and makes the algorithm select the excellent samples in a targeted manner, thus achieving more excellent results.

Figure 3.

The number of episode of successful reach to the target point.

Figure 4.

Average reward per episode: (a) heuristic Q-learning and (b) MARER Q-learning.

The optimal three-dimensional path given by heuristic Q-learning is shown in Figure 5, and the projection of the optimal trajectory (heuristic Q-learning) on the height profile and the horizontal profile is shown in Figure 6. The optimal path given by MARER Q-learning is shown in Figure 7, and the projection of the optimal path (MARER Q-learning) on the height profile and the horizontal profile is shown in Figure 8. By comparing the vertical plane of Figures 6 and 8, it can be found that the flight height of the latter is lower, and the latter has better obstacle avoidance ability; by comparing the horizontal plane of the two, it can be found that the track of the latter is more straight and has better economic performance. In summary, the MARER Q-learning algorithm is better than the other algorithms in terms of convergence speed and planning results.

Figure 5.

Three-dimensional path map of UAV (heuristic Q-learning).

Figure 6.

Three-dimensional path profile of UAV (heuristic Q-learning): (a) vertical plane and (b) horizontal plane.

Figure 7.

Three-dimensional path map of UAV (MARER Q-learning).

Figure 8.

Three-dimensional path profile of UAV (MARER Q-learning): (a) vertical plane and (b) horizontal plane.

It can be seen from the figure that the two reinforcement learning methods can successfully find the lower position of the two mountain joints and pass through them, climb up after encountering obstacles, and the altitude will drop to cruise altitude after overturning obstacles. The path planning of MARER Q-learning is better than that of heuristic Q-learning. The three-dimensional path planning of MARER Q-learning is mostly flat, basically the best path from the starting point to the target point.

Conclusion

Traditional Q-learning methods are less efficient. When the scale of state space increases linearly, the complexity of the problem increases exponentially, so the traditional reinforcement learning method is difficult to solve the problem of UAV three-dimensional path planning. This article proposes a Q-learning algorithm based on the heuristic function and the maximum average reward value of the experience replay mechanism. By synthetically considering the constraints of UAV path planning, a heuristic function is constructed to guide the learning behavior of UAV effectively, which can break away from blind exploration to a certain extent and improve the learning efficiency. The improved empirical playback mechanism greatly improves the convergence speed of the algorithm with a small computational cost. The simulation results show that the three-dimensional trajectory obtained by the proposed method of UAV’s three-dimensional path planning achieves the expectation.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work received support from National Natural Science Foundation of China (No 61702023, 61976014) and Fundamental Research Funds for the Central Universities.

ORCID iD

Ronglei Xie

Author biographies

Ronglei Xie received the BEng in 2015 from Northwestern Polytechnical University. Currently, he is working toward the PhD degree at the School of Aeronautic Science and Engineering, Beihang University of China. His current research interests include UAV intelligent mission planning and machine learning.

Zhijun Meng received the BEng and the PhD degree at the School of Aeronautic Science and Engineering, Beihang University of China. He is presently Associate Professor in the department. His current research interests include artificial intelligence, UAV design, and control.

Yaoming Zhou received the BEng and the PhD degree, both from the School of Aeronautic Science and Engineering at Beihang University, China. He is currently an Assistant Professor in the School of Aeronautic Science and Engineering at Beihang University, where his research interests include artificial intelligence, applications of unmanned aerial vehicles (UAVs), aircraft design, intelligent control of UAVs, and unmanned rotorcraft.

Yunpeng Ma is currently a Lecturer in the School of Aeronautic Science and Engineering at Beihang University. His current research interest is design of aircraft.

Zhe Wu received the PhD degree in 1988 at Harbin Institute of Technology, China. He is currently Professor in the School of Aeronautic Science and Engineering at Beihang University. His current research interests include aircraft design and aircraft stealth design.

References

Meng

B-B

Gao

. UAV path planning based on bidirectional sparse A* search algorithm. In: Proceedings of 2010 international conference on intelligent computation technology and automation, Changsha, China, 11–12 May 2010, pp. 1106–1109. New York: IEEE.

Jaradat

MAK

Garibeh

Feilat

. Autonomous mobile robot dynamic motion planning using hybrid fuzzy potential field. Soft Comput 2011; 16(1): 153–164.

Pehlivanoglu

. A new vibrational genetic algorithm enhanced with a voronoi diagram for path planning of autonomous UAV. Aerosp Sci Technol 2012; 16(1): 47–55.

Vincent

Mohammed

Gilles

. Comparison of parallel genetic algorithm and particle swarm optimization for real-time UAV path planning. IEEE T Ind Inform 2013; 9(1): 132–141.

Sutton

Barto

. Introduction to reinforcement learning. Cambridge, MA: MIT Press, 1998.

Hengst

. Discovering hierarchical reinforcement learning. Sydney, NSW, Australia: University of New South Wales, 2003.

Andrew

Harada

Russell

. Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of the sixteenth international conference on machine learning, Bled, 27–30 June 1999, pp. 278–287. San Francisco, CA: Morgan Kaufmann Publishers.

Asmuth

Littman

Zinkov

. Potential-based shaping in-model-based reinforcement learning. In: Proceedings of AAAI conference on artificial intelligence, Chicago, IL, 13–17 July 2008.

Watkins

. Q-learning. Mach Learn 1992; 8(3): 279–292.

10.

Sutton

Barto

. Reinforcement learning: an introduction, vol. 1. Cambridge, MA: MIT press, 1998.

11.

Mnih

Kavukcuoglu

Silver

, et al. Playing atari with deep reinforcement learning [EB/OL], https://arxiv.org/abs/1312.5602 (accessed 16 March 2017).

12.

Lin

. Self-improvement based on reinforcement learning, planning and teaching. In: Proceedings of the eight international workshop in machine learning, Evanston, IL, 1 June 1991, pp. 323–327. San Francisco, CA: Morgan Kaufmann.

13.

Mnih

Kavukcuoglu

Silver

, et al. Human-level control through deep reinforcement learning. Nature 2015; 518(7540): 529–533.

14.

Schaul

Quan

Antonoglou

, et al. Prioritized experience replay [EB/OL], https://arxiv.org/abs/1511.05952 (3 March 2017).

Heuristic Q-learning based on experience replay for three-dimensional path planning of the unmanned aerial vehicle

Abstract

Keywords

Introduction

Problem formulation

Q-learning

Algorithm principle

Training strategy

Action selection strategy

Heuristic Q-learning

Reward function

Action selection strategy

MARER Q-learning

Simulation results

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Author biographies

References