Sage Journals: Discover world-class research

Abstract

A flexible reinforcement learning (RL) optimal collision-avoidance control formulation for unmanned aerial vehicles (UAVs) with discrete-time frameworks is revealed in this work. By utilizing the neural network (NN) estimating capacity and the actor-critic control scheme of the RL technique, an adaptive RL optimal collision-free controller with a minimal learning parameter (MLP) is formulated, which is based on a novel strategic utility function. The optimal collision-avoidance control issue, which couldn’t be addressed in the prior literature, can be resolved by the suggested approaches. Furthermore, the proposed MPL adaptive optimal control formulation allows for a reduction in the quantity of adaptive laws, leading to reduced computational complexity. Additionally, a rigorous stability analysis is provided, demonstrating that the uniform ultimate boundedness (UUB) of all signals in the closed-loop system is ensured by the proposed adaptive RL. Finally, the simulation outcomes illustrate the effectiveness of the proposed optimal RL control approaches.

Keywords

Quadrotor UAV reinforcement learning collision-free RBF neural network (RBFNN)backstepping control

Introduction

The unmanned aerial vehicles (UAVs) compact and lightweight design, coupled with their capacity to perform tasks inconvenient or challenging for humans, has garnered widespread acclaim. They have demonstrated their prowess in diverse domains, including industrial inspections, emergency response and disaster relief, and daily assistance. Nonetheless, throughout this developmental trajectory, there has been an incremental rise in incidents where quadcopters have inflicted harm upon individuals and property, thereby jeopardizing airspace safety.^1,2 Thus, ensuring the possession of autonomous obstacle avoidance capability in quadcopters is regarded as the paramount and indispensable functional prerequisite for the execution of intricate operational tasks.

Regarding the obstacle avoidance issue in UAVs, a substantial body of literature has already presented various solutions from different perspectives. The obstacle avoidance problem in UAVs primarily refers to the design of a method that enables the UAV to safely reach the target point from the starting point within a specified flight environment. This entails not only considering how to safely navigate the UAVs around obstacles but also satisfying the requirements of the corresponding flight trajectory and the physical constraints of the UAVs itself.³ The obstacle avoidance method based on the route planning algorithm is also known as the global planning obstacle avoidance algorithm. Its basic idea is to use the route planning algorithm to find a flight path that allows the unmanned aerial vehicle to depart from the starting point, avoid all obstacles, and reach the target point. Then, the route tracking and guidance method is applied to control the unmanned aerial vehicle to fly along the generated route, achieving the purpose of obstacle avoidance.^4–6 The obstacle avoidance method based on the local collision prevention algorithm is also known as local planning obstacle avoidance. It refers to using the local collision prevention controller of UAVs to perform real-time evasion of detected obstacles. These methods do not rely on global information and do not require knowledge of the initial and target points. They only rely on real-time obstacle information detected by UAVs sensors and are commonly used for obstacles with insufficient prior information or sudden obstacles.⁷ The obstacle avoidance scheme deriving from the artificial potential function method utilizes virtual potential fields to produce attractive and repulsive forces on the UAVs, and forms the resultant force through attraction and repulsion, establishing the low-level regulation controller, thereby obtaining an effective local obstacle avoidance flight path.⁸ However, traditional UAVs obstacle avoidance algorithms typically require the construction of offline three-dimensional maps. Based on the global map, these algorithms use obstacle points as constraints and employ path planning algorithms to compute the optimal path. While some obstacle avoidance algorithms avoid the complex map construction process, they often require manual adjustment of numerous parameters, and the robots cannot utilize obstacle avoidance experience for self-iteration during the obstacle avoidance process. Based on the aforementioned analysis, there is an urgent demand for enabling intelligent obstacle avoidance in quadcopter unmanned aerial vehicles through real-time feedback and autonomous decision-making in complex environments.

With the advancement of machine learning, in the face of challenges related to integrating the human “trial-and-error-improvement” autonomous learning mechanism into obstacle avoidance control of quadcopter drones, several a priori research propositions are expounded as follows. Researchers have incorporated supervised learning into UAVs obstacle avoidance, treating obstacle avoidance as a classification problem based on supervised learning.⁹ Reinforcement learning primarily optimizes its own behavior through interaction with the external environment. Its advantage lies in its independence from the offline maps required by traditional non-machine learning methods and the annotated datasets needed for supervised learning. By learning the mapping relationship between input data and output actions through deep learning models, reinforcement learning enables intelligent agents to handle decision-making problems in high-dimensional continuous spaces, avoiding the complex offline map construction work.¹⁰ A deep reinforcement learning approach, based on uncertain perception, is proposed in which enables quadcopter drones to maintain “vigilance” in unfamiliar unknown environments by estimating collision probabilities. This approach reduces the speed of operation and minimizes the possibility of collisions.¹¹ The DDPG algorithm is applied to plan the desired path for quadcopter drones and combined it with a PID controller to achieve collision-free target tracking tasks using a hierarchical structure.¹² The DDPG algorithm, as a classical algorithm for continuous action control, has been widely used in obstacle avoidance, path planning, and other problems. Ding et al.¹³ divides the path planning task of UAVs into a Path Travel Policy Module and an Information Exploration Policy Module using the DDPG algorithm. To assist in guiding the generation of UAV flight path trajectories and enhancing the model’s learning capabilities, an improved Artificial Potential Field (APF) force-guiding mechanism is introduced in the Path Travel Policy Module. The Information Exploration Policy Module provides the UAV with a series of temporary target points, enabling the UAV to exhibit better obstacle avoidance performance on complex maps. Li et al.,¹⁴ building upon the Proximal Policy Optimization (PPO) algorithm for proximal policy optimization, improves the reward function design by introducing density rewards, distance rewards, and step-length penalties. This reduces congestion occurrences and enhances the task efficiency of the agent. Han et al.¹⁵ proposes an obstacle avoidance algorithm that combines Artificial Potential Fields with deep reinforcement learning. By modifying the Artificial Potential Field (APF), obstacles directly affect intermediate target positions rather than control commands, making it useful for guiding previously trained one-dimensional deep reinforcement learning controllers. Yan et al.¹⁶ presents a distributed formation and obstacle avoidance approach based on Multi-Agent Reinforcement Learning (MARL). Agents in the system make decisions and distributed controls using only local and relevant information. In the event of any disconnection, they rapidly reconfigure themselves into a new topology. This method shows improved performance in terms of formation error, formation convergence rate, and obstacle avoidance success rate. Considering the heterogeneity of agents cannot be overlooked in crowded scenarios, Zhu et al.¹⁷ models agents using Oriented Bounding Capsules (OBC) and transforms the interaction state space of robot-obstacle agent pairs. To address speed heterogeneity, a collision risk function related to speed is designed to shape robot behavior. This method enhances the success rate of collision avoidance for agents in congested scenes. However, it suffers from the problem of overestimation bias in Q-values. When this cumulative error reaches a certain level, it can lead to suboptimal policy updates and divergence behavior.

An aspect of practical significance that warrants attention is the presence of unknown information. In the work by Doukhi and Lee,¹⁸ a robust adaptive NN control strategy is developed for a quadrotor UAVs. This controller takes into account uncertainties arising from disturbances, inertia, mass, and aerodynamics. By employing adaptive neural networks and certainty equivalent control, the controller approximates the unknown dynamics, obviating the need for precise models or disturbance information. Wang et al.¹⁹ propose an innovative adaptive control scheme catering to uncertain discrete-time nonlinear systems that presents a rigorous strict-feedback form. This algorithm utilizes a singular neural network approximation to convert the initial system into a predictor form, effectively addressing the noncausal issue. The designed controller is consisted of just one pair of single actor controller and one adaptive compensator, thus streamlining the implementation process and alleviating computational load. To enhance control performance by leveraging estimated unknown information, the evaluation of control performance often involves the utilization of a long-term performance index, which has garnered considerable attention in the literature. In addressing this user-specified long-term cost, the control community frequently employs the reinforcement learning (RL) technique, as highlighted in the works of Refs.^20–22 In Chen et al.,²³ a dynamic surface control scheme using RBFNN and disturbance observer is proposed to handle uncertainty and saturation, avoiding complexity explosion and ensuring convergence of closed-loop signals. In addition, Moreover, in the context of time-varying-constrained nonlinear systems, previous studies^24,27 have utilized barrier Lyapunov functions to guarantee constraint adherence, utilizing NNs for approximating the unknown information in control strategy.

Taking into consideration the aforementioned content, addressing the issue of autonomous intelligent obstacle avoidance for quadcopter drones based on real-time environmental feedback, this paper summarizes the innovations of the “trial-and-error correction” evaluation-action intelligent control framework using reinforcement learning as follows:

The article presents an innovative approach that combines neural networks and actor-critic control mechanism of RL to develop an optimal collision-free RL strategy for UAVs with discrete-time systems.

The article introduces the idea of utilizing an MLP architectural strategy to reduce the number of adaptive adjustments within the adaptive control framework of RL. This leads to decreased computational complexity and improved operational efficiency.

The established RL framework and the MLP-based RL control system guarantee the stability of the closed-loop systems. This is supported by the analysis of uniformly ultimately bounded (UUB) behavior. Furthermore, it leverages the potential of deep learning models to adeptly tackle decision-making obstacles across vast continuous domains.

As for the structure of this work, it is stated as follows: Section “Problem formulation” introduces a nonlinear model of an UAVs and transforms the model into a discrete form. In Section “Design of reinforcement learning control strategy,” a pioneering online reinforcement learning control scheme for UAVs, considering collision risk, is presented. Section “Simulation” showcases simulations that validate the proficiency of the devised controllers. Lastly, Section “Conclusion” offers a conclusion, summarizing the key findings of the paper.

Problem formulation

Six DOF system model

The schematic diagram of the quadrotor UAVs under consideration is taken into consideration. To facilitate the description of motion, two reference frames are introduced: the inertial coordinate system denoted as $O - xyz$ and the body frame represented as $O_{b} - x_{b} y_{b} z_{b}$ . Therefore, the 3-DOF rotational dynamics of the quadrotor UAVs is derived based on the Newton-Euler laws in the body-fixed frame as²⁶:

\overset{\cdot}{Ω} = R \cdot ω

(1)

\overset{\cdot}{ω} = I^{- 1} \cdot (- ω^{\times} \times I ω + M) + d

(2)

where the Euler angle vector, denoted as $Ω = [ϕ, θ, ψ]^{T}$ , represents the roll angle, pitch angle, and yaw angle. The angular rate vector is $ω = [p, q, r]^{T}$ , while $I = diag (I_{x}, I_{y}, I_{z})$ represents the symmetric positive-definite constant matrix representing the UAVs inertia. The control torque generated by the four rotors is denoted as $M = {[M_{x}, M_{y}, M_{z}]}^{T}$ , and the lumped disturbance acting on the quadrotor UAVs, including model uncertainty and external bounded disturbances, is represented by $d = {[d_{1}, d_{2}, d_{3}]}^{T}$ .

The rotational matrix $R$ is:

R = [\begin{matrix} 1 & \sin ϕ \tan θ & \cos ϕ \tan θ \\ 0 & \cos ϕ & - \sin ϕ \\ 0 & \sin ϕ / \cos ϕ & \cos ϕ / \cos θ \end{matrix}]

(3)

Furthermore, $ω^{\times}$ represents the skew-symmetric matrix of $ω$ :

ω^{\times} = [\begin{matrix} 0 & - r & q \\ r & 0 & - p \\ - q & p & 0 \end{matrix}]

(4)

Consequently, the explicit form of quadrotor UAVs attitude dynamics is written as:

\begin{matrix} {\begin{matrix} \overset{\cdot}{Ω} = [\begin{matrix} 1 & \sin ϕ \tan θ & \cos ϕ \tan θ \\ 0 & \cos ϕ & - \sin ϕ \\ 0 & \sin ϕ / \cos ϕ & \cos ϕ / \cos θ \end{matrix}] ω \\ \overset{\cdot}{ω} = [\begin{matrix} qr (I_{y} - I_{z}) / I_{x} \\ pr (I_{z} - I_{x}) / I_{y} \\ pq (I_{x} - I_{y}) / I_{z} \end{matrix}] + [\begin{matrix} M_{x} / I_{x} \\ M_{y} / I_{y} \\ M_{z} / I_{z} \end{matrix}] + [\begin{matrix} d_{1} \\ d_{2} \\ d_{3} \end{matrix}] \end{matrix} \end{matrix}

(5)

The primary aim of this article is to formulate and establish two adaptive neural network reinforcement learning (NN RL) controllers for UAVs. These controllers ensure the attainment of uniformly ultimately bounded signals throughout the closed-loop system, while also enabling the convergence of tracking errors to the proximity of zero without any potential collision hazards.

Radial basis function neural network

In the context of this article, the definition of the radial basis function neural network (RBFNN) function $F (ς)$ can be elucidated as follows:

F (ς) = Θ^{T} Φ (ς)

(6)

where $ς$ is the system unknown of the estimated UAVs. $Θ = [Θ_{1}, \dots, Θ_{l}]^{T}$ represents the desired weight vector, and $Φ (ς) = [Φ_{1} (ς), \dots, Φ_{l} (ς)]^{T}$ denotes the basis function vector. The exact values of the weight vector remain unknown; however, they are bounded and adhere to the condition $‖ Θ ‖ \leq \bar{Θ}$ , where $\bar{Θ}$ represents an unknown constant greater than zero. Here, $l$ represents the number of nodes in the hidden layer. The Gaussian basis function is formulated as follows

Φ (Z) = \exp (- \frac{{(ς - a)}^{T} (ς - a)}{2 b^{2}})

(7)

where $a$ is the coordinate vector representing the center points of the Gaussian basis functions, while $b$ represents the width of the Gaussian basis functions.

Numerous relevant publications have provided evidence of the capability of RBFNN to approximate diverse nonlinear functions across a confined domain $Ω_{ς}$ :

F (ς) = Θ^{T} Φ (ς) + ε (ς)

(8)

where the ideal weight vector $Θ$ and the approximation error $ε (ς)$ are the variables of interest. It is assumed that the magnitude of the approximation error, denoted as $‖ ε (ς) ‖$ , is bounded by an unknown constant $\bar{ε} > 0$ .

Design of reinforcement learning control strategy

Within this section, an adaptive RL control scheme for discrete nonlinear systems of UAVs (5) is constructed as follows²⁷:

\begin{matrix} x_{1} (ℏ + 2 + 1 - 1) = f_{1} ({\bar{x}}_{2} (ℏ + 2 - 1)) + g_{1} ({\bar{x}}_{2} (ℏ + 2 - 1)) \\ \times x_{2} (ℏ + 2 - 1) \\ x_{2} (ℏ + 1) = f_{2} ({\bar{x}}_{2} (ℏ)) + g_{2} ({\bar{x}}_{2} (ℏ)) u (ℏ) + d (ℏ) \\ y (ℏ) = x_{1} (ℏ) \end{matrix}

(9)

where $x_{1} = Ω$ represents the system attitude angle of UAVs, and $x_{2} = ω$ represents the system attitude angular rate of UAVs. In addition, $ℏ$ represents the discretized variables of the UAVs system using the forward differencing method. $u = M$ stands for the control moment vector of UAVs. Furthermore, $f_{1} = 0, f_{2} = I^{- 1} \cdot (- ω^{\times} \times I ω), g_{1} = R, g_{2} = I^{- 1}$ . It is noted that the substitution of variables is intended to ensure the subsequent theoretical derivations are concise and clear.

Designs of critic NNs for avoiding obstacles

The purpose of the utility function denotes to a phenomenna that avoiding collision among UAVs, and keeping farther away from a obstacle, as shown in Figure 1. According to Refs,^22,25,27 by making inspection of the idea of few-shot learning, the utility cost function is proposed as follows:

M (ℏ) = {\begin{matrix} 0 \\ 1 \end{matrix} \begin{matrix} if \\ if \end{matrix} \begin{matrix} | ρ_{d} | \geq ε \\ | ρ_{d} | < ε \end{matrix}

(10)

where $ρ_{d}$ stands for the distance between UAVs ang obstacles. $ε$ is the designed threshold of tracking effect, and $M (ℏ)$ is the index that reflects the level of effectiveness in tracking performance of the system.

Figure 1.

Reinforcement learning-based UAVs collision avoidance control framework.

Therefore, in order to encapsulate the overall performance of the policy over an extended period, the utility function for long-term policy evaluation is formulated in the following manner:

\begin{matrix} L_{i} (ℏ) = π^{N} M_{i} (ℏ + 1) + π^{N - 1} π_{i} M_{i} (ℏ + 2) + \dots \\ + π_{i}^{ℏ + 1} M_{i} (N) \end{matrix}

(11)

where the symbol $π$ represents a positive design weighting parameter that fulfills the condition $0 < π < 1$ , while $N$ denotes the time horizon, indicating the final step in the simulation. Additionally, $L (ℏ)$ denotes the cumulative sum of all future performance indicators of the system. Equation (11) can also be equivalently expressed as $L (ℏ) = min_{u (k)} {π L (ℏ - 1) - π^{N + 1} L (ℏ)}$ , resembling the standard Bellman equation as described by Guo et al.²²

Given the challenging nature of obtaining an exact value for $L (ℏ)$ , an alternative approach to obtain a suboptimal solution is by utilizing RBFNN to approximate it. Let us assume that:

L (ℏ) = Θ_{c}^{T} Φ_{c} (ℏ) + ε_{c} (ℏ)

(12)

where $Θ_{c} = [Θ_{c, 1}, \dots, Θ_{c, l}] T$ stands for the ideal weight vector, and the estimation error is remarked as $ε_{c} (ℏ)$ , and suppose $‖ Θ_{c} (ℏ) ‖ \leq {\bar{Θ}}_{c}$ and $| ε_{c} (ℏ) | \leq {\bar{ε}}_{c}$ , where $Θ_{c} > 0$ and ${\bar{ε}}_{c} > 0$ are unknown constants. In addition, $Φ_{c} (ℏ)$ denotes the Gauss function.

Let us define the following Bellman error, which quantifies the discrepancy between the estimated value and the expected value in the Bellman equation, serving as a measure of the quality of the approximate solution:

E_{c} (ℏ) = π \hat{L} (ℏ) - [\hat{L} (ℏ - 1) - M (ℏ)]

(13)

where $\hat{L} (ℏ)$ is the estimation of $L (ℏ)$ , and $J_{c} (ℏ) = (1 / 2) E_{c}^{2} (ℏ)$ is defined as a quadratic cost function to judge the control performacne of designed RL control algorithm. By means of the gradient descent method, the update law for ${\hat{Θ}}_{c} (ℏ)$ is designed as follows:

{\hat{Θ}}_{c} (ℏ + 1) = {\hat{Θ}}_{c} (ℏ) - μ_{c} Φ_{c} (ℏ) [β \hat{L} (ℏ) - \hat{L} (ℏ - 1) + M (ℏ)]

(14)

In the upcoming content, a comprehensive exploration is undertaken to delve into the intricacies involved in formulating the backstepping RL mechanism for the purpose of designing advanced controllers:

Step 1: Define $e_{1} (ℏ + 2) = x_{1} (ℏ + 2) - y_{d} (ℏ + 2)$ and $e_{2} (ℏ + 1) = x_{2} (ℏ + 1) - α_{1} (ℏ)$ from the UAVs system (9), and it yields

\begin{array}{l} e_{1} (ℏ + 2) = g_{1} ({\bar{x}}_{2} (ℏ + 1)) [(\frac{f_{1} ({\bar{x}}_{2} (ℏ + 1)}{g_{1} ({\bar{x}}_{2} (ℏ + 1)} - \frac{y_{d} (ℏ + 2)}{g_{1} ({\bar{x}}_{2} (ℏ + 1)} \\ + y_{d} (ℏ + 2)) + α_{1} (ℏ) + e_{2} (ℏ + 1) - y_{d} (ℏ + 2)] \end{array}

(15)

where $α_{1} (ℏ)$ is the virtual controller. ${\bar{x}}_{2} (ℏ) = {[x_{1} (ℏ), x_{2} (ℏ)]}^{T}$ .

Define

Ξ_{1} (ℏ) = - \frac{f_{1} ({\bar{x}}_{2} (ℏ + 1)}{g_{1} ({\bar{x}}_{2} (ℏ + 1)} + \frac{y_{d} (ℏ + 2)}{g_{1} ({\bar{x}}_{2} (ℏ + 1)} - y_{d} (ℏ + 2)

(16)

The elusive function $Ξ_{1} (ℏ)$ is approximated through the utilization of a RBFNN, thereby enabling us to establish an estimation for this function:

Ξ_{1} (ℏ) = Θ_{1}^{T} (ℏ) Φ_{1} (ς_{1} (ℏ)) + ε_{1} (ℏ)

(17)

where $Θ_{1} (ℏ) = [Θ_{1, 1} (ℏ), Θ_{1, 2} (ℏ), \dots, Θ_{1, l} (ℏ)]^{T}$ is the ideal weight vector, and suppose $‖ Θ_{1} (ℏ) ‖ \leq {\bar{Θ}}_{1}$ and $| ε_{1} (ℏ) | \leq {\bar{ε}}_{1}$ with ${\bar{Θ}}_{1}$ and ${\bar{ε}}_{1}$ are unknown positive constants. $ς_{1} (ℏ) = {[x_{1} (ℏ + 1), x_{2} (ℏ + 1), y_{d} (ℏ + 2)]}^{T}$ . The utilization of conventional adaptive methods unavoidably triggers computationally intensive real-time calculations, given the necessity to estimate multiple unknown weight parameters within the neural network. To address this concern, the adaptive updating mechanism for these multiple neural network weight parameters is constructed using the norm estimation strategy (MLP technologies).

By substituting (17) into (15), it yields

\begin{array}{l} e_{1} (ℏ + 2) = g_{1} ({\bar{x}}_{2} (ℏ + 1)) [- Θ_{1} {(ℏ)}_{1}^{T} Φ_{1} . (ς_{1} (ℏ)) - ε_{1} (ℏ) \\ + α_{1} (ℏ) + e_{2} (ℏ + 1) - y_{d} (ℏ + 2)] \end{array}

(18)

Let

ϒ_{1} (ℏ) = Θ_{1} (ℏ)^{T} Φ_{1} (ℵ_{1} (ℏ))

(19)

where $ℵ_{1} (ℏ) = {[x_{1} (ℏ + 1), y_{d} (ℏ + 2)]}^{T}$ . Through the incorporation of both the addition and subtraction of equation (19) to the right-hand side of equation (18), it becomes straightforward to derive the following expression:

\begin{array}{l} e_{1} (ℏ + 2) = g_{1} ({\bar{x}}_{2} (ℏ + 1)) [- Θ_{1}^{T} (ℏ) Φ_{1} . (ς_{1} (ℏ)) - ε_{1} (ℏ) \\ + Θ_{1}^{T} (ℏ) Φ_{1} (ℵ_{1} (ℏ)) + α_{1} (ℏ) - Θ_{1}^{T} (ℏ) Φ_{1} (ℵ_{1} (ℏ)) \\ + e_{2} (ℏ + 1) - y_{d} (ℏ + 2)] \end{array}

(20)

The design of the virtual controller takes place in the following manner:

α_{1} (ℏ) = {\hat{Θ}}_{1}^{T} (ℏ) Φ_{1} (ℵ_{1} (ℏ)) + y_{d} (ℏ + 2)

(21)

By substituting (21) into (20), one has

\begin{array}{l} e_{1} (ℏ + 2) = g_{1} ({\bar{x}}_{2} (ℏ + 1)) [- Θ_{1}^{T} (ℏ) Φ_{1} . (ς_{1} (ℏ)) - ε_{1} (ℏ) \\ + Θ_{1}^{T} (ℏ) Φ_{1} (ℵ_{1} (ℏ)) - {\tilde{Θ}}_{1}^{T} (ℏ) Φ_{1} (ℵ_{1} (ℏ)) \\ + e_{2} (ℏ + 1)] \end{array}

(22)

where ${\tilde{Θ}}_{1} (ℏ) = {\hat{Θ}}_{1} (ℏ) - Θ_{1} (ℏ)$ .

At the time instance corresponding to $(ℏ + 1)$ , the error variable (22) can be mathematically represented in the following manner:

\begin{array}{l} e_{1} (ℏ + 1) = g_{1} ({\bar{x}}_{2} (ℏ)) [- Θ_{1}^{T} (ℏ_{1}) Φ_{1} . (ς_{1} (ℏ_{1})) - ε_{1} (ℏ_{1}) \\ + Θ_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ)) \\ - {\tilde{Θ}}_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ_{1})) + e_{2} (ℏ)] \end{array}

(23)

where $ℏ_{1} = ℏ - 1$ .

The definition of the strategic utility function encompasses a comprehensive evaluation framework that takes into account the strategic effectiveness and overall value derived from a particular course of collision-avoiding action, and we have

E_{1} (ℏ) = {\hat{Θ}}_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ_{1})) + \hat{L} (ℏ)

(24)

Select the expense metric as $J_{1} (ℏ) = \frac{1}{2} E_{1}^{2} (ℏ)$ . Through the utilization of the gradient descent technique, the governing equation for updating ${\hat{W}}_{1} (ℏ)$ is deduced as follows:

\begin{matrix} {\hat{Θ}}_{1} (ℏ + 1) \\ = {\hat{Θ}}_{1} (ℏ) - μ_{1} Φ_{1} (ℵ_{1} (ℏ_{1})) [{\hat{W}}_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ_{1})) + \hat{L} (ℏ)] \end{matrix}

(25)

Step 2: Establish the error parameter $e_{2} (ℏ + 1) = x_{2} (ℏ + 1) - α_{1} (ℏ)$ . The optimal regulator will be attained during the $2 th$ iteration. The expression for the error variable $e_{2} (ℏ + 1)$ can be formulated as follows:

\begin{matrix} e_{2} (ℏ + 1) = g_{2} ({\bar{x}}_{2} (k)) \\ [\frac{f_{2} ({\bar{x}}_{2} (ℏ))}{g_{2} ({\bar{x}}_{2} (ℏ))} - \frac{α_{1} (ℏ)}{g_{2} ({\bar{x}}_{2} (ℏ)) + α_{1} (ℏ) - α_{1} (ℏ) + u (ℏ)}] + d (ℏ) \end{matrix}

(26)

Define

Ξ_{2} (ℏ) = - \frac{f_{2} ({\bar{x}}_{2} (ℏ))}{g_{2} ({\bar{x}}_{2} (ℏ))} + \frac{α_{1} (ℏ)}{g_{2} ({\bar{x}}_{2} (ℏ))} - α_{1} (ℏ)

(27)

Utilize RBFNN to approximate the unfamiliar function $Ξ_{2} (ℏ)$ , and consider the following:

Ξ_{2} (ℏ) = Θ_{2}^{T} (ℏ) Φ_{2} (Z_{2} (ℏ)) + ε_{2} (ℏ)

(28)

where the optimal weight vector is denoted as $Θ_{2} (k) = {[Θ_{2, 1} (ℏ), Θ_{2, 2} (ℏ), \dots, Θ_{2, l} (ℏ)]}^{T}$ . Let it be assumed that $‖ Θ_{2} (ℏ) ‖ \leq {\bar{Θ}}_{2}$ and $‖ ε_{2} (ℏ) ‖ \leq {\bar{ε}}_{2}$ , where ${\bar{Θ}}_{2}$ and ${\bar{ε}}_{2}$ represent unspecified positive constants. $ς_{2} (ℏ) = {[x_{1} (ℏ), x_{2} (ℏ), y_{d} (ℏ + 2)]}^{T}$ .

By replacing (28) in (26), it is possible to derive

\begin{matrix} e_{2} (ℏ + 1) = \\ g_{2} ({\bar{x}}_{2} (ℏ)) [- Θ_{2}^{T} (ℏ) Φ_{2} (ℵ_{2} (ℏ)) - ε_{2} (ℏ) - α_{1} (ℏ) + u (ℏ)] + d (ℏ) \end{matrix}

(29)

Design the controller as

u (ℏ) = - \sum_{j = 1}^{2} c_{j} e_{j} (ℏ) + {\hat{Θ}}_{2}^{T} (ℏ) Φ_{2} (ℵ_{2} (ℏ)) + α_{1} (ℏ)

(30)

Through the process of substituting the equation (30) into the expression (29), we are able to arrive at the following result:

\begin{matrix} e_{2} (ℏ + 1) = g_{2} ({\bar{x}}_{2} (ℏ)) \\ [{\tilde{Θ}}_{2}^{T} (ℏ) Φ_{2} (ℵ_{2} (ℏ)) - ε_{2} (ℏ) - \sum_{j = 1}^{2} c_{j} e_{j} (ℏ)] + d (ℏ) \end{matrix}

(31)

where ${\tilde{Θ}}_{2} (ℏ) = {\hat{Θ}}_{2} (ℏ) - Θ_{2} (ℏ)$ .

The strategic utility function is defined, providing a comprehensive framework for evaluating the overall effectiveness and value of strategic choices and decisions:

E_{2} (ℏ) = {\hat{Θ}}_{2}^{T} (ℏ) Φ_{2} (ℵ_{2} (ℏ)) + \hat{L} (ℏ)

(32)

Choose the cost function $J_{2} = (1 / 2) E_{2}^{2} (ℏ)$ . With the employment of the gradient descent method, a specific update law is formulated for ${\hat{Θ}}_{2} (k)$ , tailored to iteratively adjust and optimize its values:

\begin{matrix} {\hat{Θ}}_{2} (k + 1) = \\ {\hat{Θ}}_{2} (ℏ) - μ_{2} Φ_{2} (ℵ_{2} (ℏ)) [{\hat{Θ}}_{2}^{T} (ℏ) Φ_{2} (ℵ_{2} (ℏ)) + \hat{L} (ℏ)] \end{matrix}

(33)

The previously mentioned design and analysis are consolidated and presented concisely in the subsequent theorem.

Theorem 1: Under Assumption 1, the adaptive reinforcement learning (RL) control scheme is applied to the nonlinear UAVs system (9). This control scheme is consisted of control laws (21) and (30), updating laws of parameters (14), (25), and (33). Provided that the design parameters meet the conditions of $0 < β < 1$ , $0 < μ_{c} < \frac{1}{β^{2} l}$ , $0 < μ_{i} < \frac{1}{l}$ , and $c_{i} > 0$ , there exist positive constants $κ_{c}$ , $κ_{W, i}$ , and $κ_{e, i}$ , ensuring that the adaptive RL control scheme ensures a guarantee of UUB for all signals encompassing variables and errors within the closed-loop system.

Proof: The Lyapunov function is determined by carefully selecting a suitable mathematical representation, resulting in

V (ℏ) = V_{1} (ℏ) + V_{2} (ℏ) + V_{3} (ℏ) + V_{4} (ℏ) + V_{5} (ℏ)

(34)

where

\begin{matrix} V_{1} (ℏ) = \frac{κ_{e, 1}}{4} e_{i}^{2} (ℏ), V_{2} = \frac{κ_{e, 2}}{5} e_{2}^{2} (ℏ) \\ V_{3} (ℏ) = \sum_{i = 1}^{2} \frac{κ_{Θ, i}}{μ_{i}} \sum_{λ = 1}^{2} {\tilde{Θ}}_{i}^{T} (ℏ + ƛ) {\tilde{Θ}}_{i} (ℏ + ƛ) \\ V_{4} (ℏ) = \frac{κ_{L}}{μ_{L}} {\tilde{Θ}}_{L}^{T} (ℏ) {\tilde{Θ}}_{L} (ℏ), V_{5} (ℏ) = 3 κ_{c} [{\tilde{Θ}}_{L}^{T} (ℏ - 1) Φ (ℏ - 1)] \end{matrix}

where ${\tilde{Θ}}_{L} (ℏ) = {\hat{Θ}}_{L} (ℏ) - Θ_{L}$ .

To proceed further, the derivative of $V_{1} (ℏ)$ with respect to the independent variable can be computed by employing the forward difference method, resulting in the following expression:

\begin{matrix} Δ V_{1} (ℏ) = \frac{κ_{e, 1}}{4} e_{1}^{2} (ℏ + 1) - \frac{κ_{e, 1}}{4} e_{1}^{2} (ℏ) \\ = \frac{κ_{e, 1}}{4} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) [- Θ_{1}^{T} (ℏ_{1}) Φ_{1} (Z_{1} (ℏ_{1})) - ε_{1} (ℏ_{1}) \\ + Θ_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ)) {- {\tilde{Θ}}_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ_{1})) + e_{2} (ℏ)]}^{2} \\ - \frac{κ_{e, 1}}{4} e_{1}^{2} (ℏ) \end{matrix}

(35)

Invoke the Cauchychwarz inequality as

{(x_{1} + x_{2} + \dots + x_{n})}^{2} \leq n (x_{1}^{2} + \dots + x_{n}^{2})

(36)

Furthermore, by means of the Young’s inequality and $0 < Φ_{i}^{T} (\cdot) Φ_{i} (\cdot) < l$ of RBFNN combined with (23), it is obtained as

\begin{matrix} - Θ_{1}^{T} (ℏ_{1}) Φ_{1} (ς_{1} (ℏ_{1})) + Θ_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ)) \\ \leq | Θ_{1}^{T} (ℏ_{1}) Φ_{1} (ς_{1} (ℏ_{1})) | + | Θ_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ)) | \\ \leq \frac{1}{2} Θ_{1}^{T} (ℏ_{1}) Θ_{1} (ℏ_{1}) + \frac{1}{2} Θ_{1}^{T} (ℏ_{1}) Θ_{1} (ℏ_{1}) \\ + \frac{1}{2} Φ_{1}^{T} (ς_{1} (ℏ_{1})) Φ_{1} (ς_{1} (ℏ_{1})) + \frac{1}{2} Φ_{1}^{T} (ℵ_{1} (ℏ_{1})) Φ_{1} (ℵ_{1} (ℏ_{1})) \\ \leq {\bar{Θ}}_{1}^{2} + l \end{matrix}

(37)

To continue, $Δ V_{1}$ is further calculated as

\begin{matrix} Δ V_{1} \leq κ_{e, 1} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) {({\tilde{Θ}}_{1}^{T} (ℏ_{1}) Φ_{1} (ℵ_{1} (ℏ_{1})))}^{2} \\ + κ_{e, 1} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) e_{2}^{2} (ℏ) + κ_{e, 1} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) ε_{1}^{2} (ℏ_{1}) \\ + κ_{e, 1} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) ({\bar{Θ}}_{1}^{2} + l)^{2} - \frac{κ_{e, 1}}{4} e_{1}^{2} (ℏ) \end{matrix}

(38)

By taking (31) and in mind, combined with (36), the $Δ V_{2}$ can be obtained as follows

\begin{matrix} Δ V_{2} \leq κ_{e, 2} g_{2}^{2} ({\bar{x}}_{2} (ℏ)) {({\tilde{Θ}}_{2}^{T} (ℏ) Φ_{2} (ℵ_{2} (ℏ)))}^{2} \\ + κ_{e, 2} g_{2}^{2} ({\bar{x}}_{2} (ℏ)) \sum_{i = 1}^{2} c_{i}^{2} e_{i}^{2} (ℏ) + κ_{e, 2} g_{2}^{2} ({\bar{x}}_{2} (ℏ)) ε_{2}^{2} (ℏ) \\ - \frac{κ_{e, 2}}{5} e_{2}^{2} (ℏ) + κ_{e, 2} {\bar{d}}^{2} \end{matrix}

(39)

In what follows, from (25), (33), and (36), $Δ V_{3} (ℏ)$ can be further deduced as follows:

\begin{matrix} Δ V_{3} = V_{3} (ℏ + 1) - V_{3} (ℏ) \\ = \sum_{i = 1}^{2} \frac{κ_{Θ, i}}{μ_{i}} ({\tilde{Θ}}_{i}^{T} (ℏ + 1) {\tilde{Θ}}_{i} (ℏ + 1) - {\tilde{Θ}}_{i}^{T} (ℏ) {\tilde{Θ}}_{i} (ℏ)) \\ \leq \sum_{i = 1}^{2} [κ_{Θ, i} μ_{i} l {[{\hat{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)) + \hat{L} (ℏ)]}^{2} \\ - 2 κ_{Θ, i} {\tilde{Θ}}_{i} (ℏ) ({\hat{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)) + \hat{L} (ℏ))] \end{matrix}

(40)

By utilizing the Cauchychwarz inequality (36), it yields

\begin{matrix} Δ V_{3} \leq - \sum_{i = 1}^{2} κ_{Θ, i} (1 - μ_{i} l) [{\hat{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)) + \hat{L} (ℏ)]^{2} \\ - \sum_{i = 1}^{2} κ_{Θ, i} [{\tilde{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)))]^{2} + \sum_{i = 1}^{2} 2 κ_{Θ, i} l ({\bar{Θ}}_{i} + {\bar{Θ}}_{c})^{2} \\ + \sum_{i = 1}^{2} 2 κ_{Θ, i} ({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℵ_{i} (ℏ)))^{2} \end{matrix}

(41)

Similarly, by applying the forward difference technique, we can determine the derivative of $V_{4} (ℏ)$ with respect to the independent variable, leading to the subsequent equation:

\begin{matrix} Δ V_{4} = V_{4} (ℏ + 1) - V_{4} (ℏ) \\ \leq - κ_{c} (1 - μ_{c} π^{2} l) {[π \hat{L} (ℏ) - \hat{L} (ℏ - 1) + M (ℏ)]}^{2} + 3 κ_{c} \\ - κ_{c} π^{2} {({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℏ))}^{2} + 3 κ_{c} {({\tilde{Θ}}_{c}^{T} (ℏ - 1) Φ_{c} (ℏ - 1))}^{2} \\ + 3 κ_{c} l {({\bar{Θ}}_{c} (1 + π))}^{2} \end{matrix}

(42)

By considering the explicit definition of forward difference, we can deduce the following:

Δ V_{5} (ℏ) = 3 κ_{c} {({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℏ))}^{2} - 3 κ_{c} {({\tilde{Θ}}_{c}^{T} (ℏ - 1) Φ_{c} (ℏ - 1))}^{2}

(43)

Ultimately, the discrepancy of the comprehensive Lyapunov function $V (ℏ)$ can be expressed as follows:

\begin{array}{l} Δ V \leq - \sum_{i = 1}^{2} κ_{Θ, i} (1 - μ_{i} l) {[{\hat{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)) + \hat{L} (ℏ)]}^{2} \\ - κ_{c} (1 - μ_{c} π^{2} l) {[π \hat{L} (ℏ) - \hat{L} (ℏ - 1) + M (ℏ)]}^{2} \\ + κ_{e, 1} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) e_{2}^{2} (ℏ) - \frac{κ_{e, 2}}{5} e_{2}^{2} (ℏ) \\ + κ_{e, 2} g_{2}^{2} ({\bar{x}}_{2} (ℏ)) c_{2}^{2} e_{2}^{2} (ℏ) - κ_{c} π^{2} {({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℏ))}^{2} \\ + 2 \sum_{i = 1}^{2} κ_{Θ, i} {({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℵ (ℏ)))}^{2} + 3 κ_{c} {({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℏ))}^{2} \\ - \sum_{i = 1}^{2} κ_{Θ, i} {[{\tilde{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)))]}^{2} \\ + \sum_{i = 1}^{2} κ_{e, i} g_{i}^{2} ({\bar{x}}_{2} (ℏ)) {({\tilde{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ (ℏ)))}^{2} \\ - \frac{κ_{e, 1}}{4} e_{1}^{2} (ℏ) + κ_{e, 2} g_{2}^{2} ({\bar{x}}_{2} (ℏ)) c_{1}^{2} e_{1}^{2} (ℏ) \\ + \sum_{i = 1}^{2} κ_{e, i} g_{i}^{2} ({\bar{x}}_{2} (ℏ)) ε_{i}^{2} (ℏ) \\ + \sum_{i = 1}^{2} 2 κ_{Θ, i} l {({\bar{Θ}}_{i} + {\bar{Θ}}_{c})}^{2} + κ_{e, 1} g_{1}^{2} ({\bar{x}}_{2} (ℏ)) {({\bar{Θ}}_{1}^{2} + l)}^{2} \\ + κ_{e, 2} {\bar{d}}^{2} + 3 κ_{c} + 3 κ_{c} l {({\bar{Θ}}_{c} (1 + π))}^{2} \end{array}

(44)

Choose the parameters $0 < μ_{i} < 1 / l$ and $0 < μ_{c} < 1 / π^{2} l$ , it infers that

\begin{matrix} Δ V \leq - (\frac{κ_{e, 1}}{4} - κ_{e, 2} {\bar{g}}_{2}^{2} c_{1}^{2}) e_{1}^{2} (ℏ) \\ - (\frac{κ_{e, 2}}{5} - κ_{e, 1} {\bar{g}}_{1}^{2} - κ_{e, 2} {\bar{g}}_{2}^{2} c_{2}^{2}) e_{2}^{2} (ℏ) \\ - (κ_{c} π^{2} - 2 \sum_{i = 1}^{2} κ_{Θ, i} - 3 κ_{c}) {({\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℏ))}^{2} \\ - \sum_{i = 1}^{2} (κ_{Θ, i} - κ_{e, i} {\bar{g}}_{i}^{2}) {[{\tilde{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ)))]}^{2} + Π \end{matrix}

(45)

By selecting the following parameters:

\begin{matrix} \frac{κ_{e, 1}}{4} \geq κ_{e, 2} {\bar{g}}_{2}^{2} c_{1}^{2}, \frac{κ_{e, 2}}{5} \geq κ_{e, 1} {\bar{g}}_{1}^{2} + κ_{e, 2} {\bar{g}}_{2}^{2} c_{2}^{2} \\ κ_{c} π^{2} \geq 2 \sum_{i = 1}^{2} κ_{Θ, i} - 3 κ_{c}, κ_{Θ, i} \geq κ_{e, i} {\bar{g}}_{i}^{2}, i = 1, 2, \end{matrix}

(46)

through meticulous scrutiny and thorough assessment, it can be inferred that the assertion of $Δ V (ℏ) < 0$ is valid, indicating a reduction in the aggregate Lyapunov function. This deduction is contingent upon the fulfillment of the subsequent set of conditions^22,27:

\begin{matrix} | e_{1} (ℏ) | \geq \sqrt{Π} / \sqrt{\frac{κ_{e, 1}}{4} - κ_{e, 2} {\bar{g}}_{2}^{2} c_{1}^{2}}, \\ | e_{2} (ℏ) | \geq \sqrt{Π} / \sqrt{\frac{κ_{e, 2}}{5} - κ_{e, 1} {\bar{g}}_{1}^{2} - κ_{e, 2} {\bar{g}}_{2}^{2} c_{2}^{2}} \\ | {\tilde{Θ}}_{c}^{T} (ℏ) Φ_{c} (ℏ) | \geq \sqrt{Π} / \sqrt{κ_{c} π^{2} - 2 \sum_{i = 1}^{2} κ_{Θ, i} - 3 κ_{c}}, \\ | {\tilde{Θ}}_{i}^{T} (ℏ) Φ_{i} (ℵ_{i} (ℏ))) | \geq \sqrt{Π} / \sqrt{κ_{Θ, i} - κ_{e, i} {\bar{g}}_{i}^{2}}, i = 1, 2 \end{matrix}

(47)

Drawing upon the principles of the established Lyapunov extension theorem, it can be deduced that all signals within the closed-loop system are UUB, thereby substantiating the completion of the proof for Theorem 1.

Simulation

This section presents simulation and experimental results to illustrate the efficacy and resilience of the proposed approach. The simulation is conducted under the following conditions: The initial state vector of the quadrotor is denoted as $x_{0} = [0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0; 0]$ . The quadrotor is subjected to certain constraints, including an upper limit on the propeller thrust at 15 and a lower limit at 0. Moreover, the values $δ_{i} = 0.85, i = 1, 2, 3$ are employed. The sampling time is set to 0.2 s. In order to assess the effectiveness of the proposed control algorithm, simulations are conducted under the specified conditions. The performance evaluation of the proposed reinforcement learning (RL) control strategy primarily revolves around the attainment of a safe flight trajectory that enables the UAVs to navigate through moving static or dynamic obstacles while effectively mitigating the risk of collisions. Among them, in the static obstacle scenario, the starting point of UAVs is $[003]^{T}$ , and the endpoint is $[10, 11, 2]^{T}$ , as shown in Figure 2; while in the dynamic obstacle scenario, the starting point of UAVs is $[0, 2, 5]^{T}$ , and the endpoint is $[10, 10, 5.5]^{T}$ . This safe flight trajectory is visually represented in Figures 2 and 3.

Figure 2.

Position tracking trajectory of a quadrotor using the proposed RL algorithm for static obstacle.

Figure 3.

Position tracking trajectory of a quadrotor using the proposed RL algorithm for dynamic obstacle.

The simulation results provide compelling evidence of the effective collision-avoidance capabilities exhibited by the attainment of the target, thus serving the purpose of evaluating the health status of large-scale photovoltaic power generation while concurrently minimizing the potential risks of collisions (as depicted in Figures 2 and 3). Illustrated across Figures 2 and 3, the trajectory visualized by the red line demonstrates a remarkable ability to faithfully adhere to the intended path, and the blue line in Figure 3) represents the motion trajectory of dynamic obstacles, thereby ensuring the avoidance of collisions under both static and dynamic conditions. Furthermore, the simulation outcomes presented in Figure 3 substantiate the suitability of the proposed RL control methodology for a diverse array of tracking missions, while upholding compliance with stringent safety boundaries.

Conclusion

This study has made significant contributions to the control of UAVs. The investigation focused on transient performance, highlighting the importance of considering nonlinearities and ensuring stable maneuvering. By combining neural networks and reinforcement learning (RL), an innovative approach was developed for adaptive collision-free RL optimal control of UAVs with discrete-time systems. The introduction of the minimal learning parameter (MLP) reduced adaptive laws, improving computational efficiency without compromising performance. The proposed RL and MLP-based controllers ensured closed-loop system stability, demonstrated through UUB analysis. Overall, this research advances UAVs control strategies, emphasizing transient performance, neural networks, and RL techniques, with implications for safer and more efficient UAVs operations. In the future work, we will adopt the cooperative game indicator design concept to further enhance the control effect. In future work, building upon the autonomous obstacle avoidance control algorithm designed in this paper, further research will investigate autonomous cooperative obstacle avoidance control algorithms for UAVs under nonholonomic information constraints and scale variations.

Footnotes

Handling Editor: Chenhui Liang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Henan Province Science and Technology Research Project (232102240098).

ORCID iD

Xiaoyu Huo

References

Kewang

Tenghuan

. Research on obstacle avoidance control method of multi-UAV based on model predictive control. In: 2021 international conference on electronics, circuits and information engineering (ECIE), Zhengzhou, China, 22–24 January 2021, pp.357–362. New York: IEEE.

Shao

Chen

. Adaptive neural discrete-time fractional-order control for a UAV system with prescribed performance using disturbance observer. IEEE Trans Syst Man Cybern Syst 2021; 51: 742–754.

Goez

Velásquez

Botero

. UAV route planning optimization using PSO implemented on microcontrollers. IEEE Lat Am Trans 2016; 14: 1705–1710.

Liu

Qin

, et al. Online path planning method of fixed-wing unmanned aerial vehicle. J Comput Appl 2019; 39: 3522–3527.

Guo

Wang

. Design on route online planning for certain UAV based on RRT algorithm. J Ordnance Equip Eng 2016; 37: 18–21.

Yang

Fang

Gao

, et al. Obstacle avoidance path planning for UAV based on improved RRT algorithm. Discrete Dyn Nat Soc 2022; 3: 1–9.

. Distributed unmanned aerial vehicles formation control with nonlinear dynamics. J Chongqing Univ Technol 2020; 34: 170–175.

Dong

. Suboptimal artificial potential function sliding mode control for spacecraft rendezvous with obstacle avoidance. Acta Astronautica 2018; 143: 133–146.

Chen

Luo

Huang

, et al. Transfer learning-motivated intelligent fault diagnosis designs: a survey, insights, and perspectives. IEEE Trans Neural Netw Learn Syst. Epub ahead of print 19 July 2023. DOI: 10.36227/techrxiv.21301533.v1.

10.

Zhang

, et al. Mechanism analysis and real-time control of energy storage based grid power oscillation damping: a soft actor-critic approach. IEEE Trans Sustain Energy 2021; 12: 1915–1926.

11.

Kahn

Villaflor

Pong

, et al. Uncertainty-aware reinforcement learning for collision avoidance. arXiv Preprint arXiv:1702.01182, 2017.

12.

Liu

Zhang

, et al. Learning unmanned aerial vehicle control for autonomous target following. arXiv Preprint arXiv:1709.08233, 2017.

13.

Ding

Gui

. Path planning based on reinforcement learning with improved APF model for synergistic multi-UAVs. In: 2023 26th international conference on computer supported cooperative work in design (CSCWD), Rio de Janeiro, Brazil, 24–26 May 2023, pp.432–437. New York: IEEE.

14.

Zhang

Zheng

. Research on obstacle avoidance strategy of grid workspace based on deep reinforcement learning. In: 2022 2nd Asia-Pacific conference on communications technology and computer science (ACCTCS), Shenyang, China, 25–27 February 2022, pp.11–15. New York: IEEE.

15.

Han

Cheng

, et al. Obstacle avoidance based on deep reinforcement learning and artificial potential field. In: 2023 9th international conference on control, automation and robotics (ICCAR), Beijing, China, 21–23 April 2023, pp.215–220. New York: IEEE.

16.

Yan

Qiu

, et al. Relative distributed formation and obstacle avoidance with multi-agent reinforcement learning. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022, pp.1661–1667. New York: IEEE.

17.

Zhu

Zhe

, et al. Collision avoidance among dense heterogeneous agents using deep reinforcement learning. IEEE Robot Autom Lett 2023; 8: 57–64.

18.

Doukhi

Lee

. Neural network-based robust adaptive certainty equivalent controller for quadrotor UAV with unknown disturbances. Int J Control Autom Syst 2019; 17: 2365–2374.

19.

Wang

Philip Chen

, et al. Adaptive robust control based on single neural network approximation for a class of uncertain strict-feedback discrete-time nonlinear systems. Neurocomputing 2014; 138: 325–331.

20.

Liu

Tang

Tong

, et al. Reinforcement learning design-based adaptive tracking control with less learning parameters for nonlinear discrete-time MIMO systems. IEEE Trans Neural Netw Learn Syst 2015; 26: 165–176.

21.

Cui

Yang

, et al. Adaptive neural network control of AUVs with control input nonlinearities using reinforcement learning. IEEE Trans Syst Man Cybern Syst 2017; 47: 1019–1029.

22.

Guo

Yan

Cui

. Integral reinforcement learning-based adaptive NN control for continuous-time nonlinear MIMO systems with unknown control directions. IEEE Trans Syst Man Cybern Syst 2020; 50: 4068–4077.

23.

Chen

Tao

Jiang

. Dynamic surface control using neural networks for a class of uncertain nonlinear systems with input saturation. IEEE Trans Neural Netw Learn System 2015; 26: 2086–2097.

24.

Liang

Zhao

, et al. Neural-based decentralized adaptive finite-time control for nonlinear large-scale systems with time-varying output constraints. IEEE Trans Syst Man Cybern Syst 2021; 51: 3136–3147.

25.

Chen

Liu

Alippi

, et al. Explainable intelligent fault diagnosis for nonlinear dynamic systems: from unsupervised to supervised learning. IEEE Trans Neural Netw Learn Syst. Epub ahead of print 8 September 2022. DOI: 10.1109/TNNLS.2022.3201511.

26.

Wang

Zhao

Cai

, et al. Onboard actuator model-based incremental nonlinear dynamic inversion for quadrotor attitude control: method and application. Chin J Aeronaut 2021; 34: 216–227.

27.

Chen

. Adaptive fault-tolerant tracking control for discrete-time multiagent systems via reinforcement learning algorithm. IEEE Trans Cybern 2021; 51: 1163–1174.

Adaptive collision-free control for UAVs with discrete-time system based on reinforcement learning

Abstract

Keywords

Introduction

Problem formulation

Six DOF system model

Radial basis function neural network

Design of reinforcement learning control strategy

Designs of critic NNs for avoiding obstacles

Simulation

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References