Sage Journals: Discover world-class research

Abstract

Multi-UAVs play an important role in the battlefield. Although many methods are proposed to solve the Multi-UAV task allocation, there still existing the problems of complex time constraints and uncertain solution space. The reason is that multi-UAVs usually face changing environmental factors. Aiming at solving such problem, this paper proposes a multi-UAV task assignment method based on Deep Q-based evolutionary reinforcement learning algorithms (MPSO-SA-DQN). Specifically, this method builds a multi-agent training framework based on the deep evolutionary reinforcement learning mechanism and SA-DQN. Its aim is to improve the global exploration and optimization capabilities of multi-agents. At the same time, the multi-dimensional particle swarm optimization algorithm is introduced to optimize the state space. Based on task priority mapping, the MPSO-SA-DQN algorithm framework is proposed. As a result, multi-agents can optimize the execution state in real time in the environment interaction. Besides, it also has the ability to reach optimal state and maximum reward. According to the characteristics of multi-UAV global task assignment, this paper designs a priority state space autoencoder strategy and global task feature. A multi-UAVs tasks allocation and iterative optimization method based on MPSO-SA-DQN algorithm is proposed, so as to continuously optimize the task allocation scheme. The simulation results show that the multi-UAV task allocation method based on MPSO-SA-DQN can effectively solve the problem of uncertainty in the optimal solution space of task allocation. At the same time, the algorithm achieves faster convergence result, and a good prospect of promotion in the field of UAV swarm cooperative task planning.

Keywords

Deep reinforcement learning particle swarm optimization multi-agent multi-UAV task assignment simulated annealing algorithm

Introduction

The first-hand information^1–3 are extremely important in the war environment. Unmanned Aerial Vehicle (UAV) is good at information reconnaissance and transmission, as well as the battlefield monitoring. As a result, the UAV technology^4,5 plays an increasingly important role on the battlefield. However, the combat tasks that a single UAV can complete are limited due to various restrictions. Thus, to meet the needs of more complex military tasks, multi-UAV maneuver decision-making tasks^6,7 have become a research hotspot.

At present, many researches with promising results on multi-UAVs task allocation^8–12 have been proposed. Specifically, Huawei et al.¹³ conduct a cooperative task planning model for unmanned warship and UAV. The model utilizes the adaptive particle swarm algorithm to search the optimal result. It is able to solve the problem of cooperative task allocation for unmanned combat platforms. Qingchao and Qingkui¹⁴ realize the multi-UAV collaborative task allocation by discrete pigeon group algorithm. However, the above-mentioned multi-UAVs task allocation methods is unable to solve the future large-scale operations problem. The characteristic of such problem is that, when the task is quickly planned and the scene changes, the difficulty for solving it rises rapidly, and the solution accuracy gradually decreases. Therefore, some works introduce deep evolutionary reinforcement learning^15–17 to explore the problem of large-scale UAV task assignment. For example, Khadka et al.^18,19 improve the diversity of data by introducing the evolutionary algorithm population. They train the agent in reinforcement learning, and periodically reinsert the agent into the evolutionary algorithm group. At the same time, the author injects gradient information into the evolutionary algorithm. Thus, a new way for solving the above problems is provided.

Conventional algorithms^20,21 have fixed solution space dimensions and cannot jump out of local optimal solutions. At the same time, the conventional deep reinforcement learning algorithm cannot jump out of the loop when the training is optimal, making the training time too long. Furthermore, there are some disadvantage in the study of the above task allocation methods. In detail, during the UAV flight, the mission target pair may change over time, and the environment in which the UAV performs the mission is dynamically changing. Besides, the process of task assignment is difficult to be decided by multiple UAVs.^22,23 It is because the sudden and unpredictable things often occur. Furthermore, there are certain defects in the solution space judgment of multi-UAV mission planning.

Aiming at the above analyzed problems of existing methods, this paper proposes a multi-UAV task assignment method based on MPSO-SA-DQN. Firstly, for the multi-agent Markov game process, an intelligent agent training framework (SA-DQN) based on deep evolutionary reinforcement learning is constructed. Secondly, the design of MPSO-SA-DQN environment is proposed. Then, the combination of multi-dimensional particle swarm optimization and SA-DQN is used to continuously interact with the environment. The main aim is to continuously optimize the optimization process, and then to achieve the global optimal state. In the end, a real-time priority state-space autoencoder strategy is proposed. At the same time, a multi-UAV task assignment model based on MPSO-SA-DQN is built. It is obtained by combining with multi-dimensional particle swarm and simulated annealing algorithm. Experiment results suggest that the proposed method can realize multi-UAV task assignment.

Related work

Wang et al.²⁴ developed a lifelong learning architecture that can integrate artificial intelligence (AI) algorithms into heterogeneous IIoT networks. The framework implements the basis for efficient data transfer. At the same time, the authors added the attention mechanism based on reinforcement learning, which is similar to the embedding method of the evolutionary algorithm mechanism.

Zheng et al.²⁵ proposed a prior regularization method (DL-PR) in deep learning. The regularization factor designed by combining inter-class adversarial factors, global and dimensional dispersion helps to increase the inter-class distance and decrease the intra-class distance of the samples. At the same time, it also helps deep learning models to generalize well on signals with various signal-to-noise ratios (SNR). Lishen et al.^26,27 improves the transient multiscale partial differential equations by developing a generalized finite element method with global-local enrichment. The adaptive algorithm detects a subset of global nodes with a mundane expansion set at each time step, and then determines the local problem residuals, allowing for less redundancy at the time step. Qinghe et al.^28,29 optimized a specially designed self-encoder (AE) via entropy-stochastic gradient descent. A similarity estimator for stream forms across different dimensions was designed as a penalty term to ensure their invariance during gradient backpropagation. It is highlighted that expertise hidden in the normal signal can be extracted and emphasized rather than simply overfitting.

Furthermore, evolutionary reinforcement learning algorithm^30–32 has the attributes of evolutionary algorithm^33–35 and reinforcement learning algorithm at simultaneously. As a result, it can have a better effect on solving the problems of algorithm real-time and complex environment task execution. For example, Khadka and Tumer³⁶ train reinforcement learning agents by evolutionary algorithms to provide diverse data. The agent is rerun in an evolutionary reinforcement learning algorithm. At the same time, Khadka et al.¹⁹ proposed a co-evolutionary reinforcement learning method. Based on the cooperative co-evolution (CC) framework of NCS, Yang et al.³⁷ use reinforcement learning to expand the scale of NCS. This method preserves its parallel exploration search behavior. These methods all have similarities. This paper focuses on the application of evolutionary algorithms to the design of environments in reinforcement learning. More similar to the work in this paper is the research content of Colas et al.³⁸ The method uses the playback buffer of the pure exploration trajectory as a scratchpad for the target exploration process (gep) and uses ddpg to generate sample data. This search process is very similar to evolutionary algorithms, although it mainly studies diversity, not reinforcement learning strategies.

Khadka and Tumer³⁶ introduced an evolutionary reinforcement learning algorithm based on swarm intelligence mechanism. The method introduces a swarm intelligence evolutionary algorithm based on the deep reinforcement learning algorithm ddpg. The author mainly uses the form of continuous variation of multi-agents through evolutionary algorithms, and selects based on the state-space of multi-agents during training. At the same time, the evolutionary algorithm is used to guide a single agent in the process of interacting with the environment. This agent is then periodically inserted into the multi-agent. When ddpg’s gradient-based policy improvement mechanism is effective, those with excellent performance will be selected into the next generation, and thus the entire group will evolve better.

Modeling of multi-UAV task allocation

Multi-UAV task assignment problem

There are two purposes of multi-UAV task assignment. The one is to cause maximum damage to the enemy, while ensuring the minimum loss of one’s own side. Another is to achieve the completeness of the task execution sequence as high as possible, in the limited time. In this paper, multi-UAV task assignment is modeled as a mixed Markov game. In detail, each UAV is regarded as an agent. Meanwhile, the action space of each agent is represented by the executable task sequences. In order to facilitate the interaction between multiple UAVs and the environment, this paper proposes the following assumptions.

Assumption 1: In a multi-UAV formation, each UAV can communicate with each other without delay within a certain distance.

Assumption 2: Each UAV can detect state of the target within a certain distance.

Assumption 3: All UAVs are flying at the same altitude.

We regard the dynamic allocation of multi-UAV tasks as a sequential decision-making problem. Its framework is shown in Figure 1. Firstly, in different time-space sequence states, multiple tasks are assigned to a single UAV to execute. The transition from the initial state to the final state means the UAV receives tasks and then complete it. UAV interacts with the global environment when performing tasks, based on time-space sequence advancement. Meanwhile, the UAV completes the minimum risk, minimum execution path, maximum damage benefit and task volume and others characteristics generation. Then, it updates the task completion and task priority features. The features of all tasks are combined into global information of the environment for multi-UAV global sharing.

Figure 1.

The framework of multi-UAV task assignment.

After that, each UAV perceives the change of the environment, and makes a dynamic task allocation decision through SA-MPSO-DQN. The main target is to form its own new task state sequence. Finally, each UAV starts to perform tasks cooperatively according to the latest task assignment plan. In this process, each UAV interacts with others, updates task characteristics, and forms the latest global environment and assignment plan. The above steps will be repeated until the task is completed.

The multi-UAV task allocation framework shown in Figure 1 is mainly embodied in three components including environment, central algorithm execution, and training. The tasks to be executed by the UAVs are firstly initialized for allocation operations to form a task pool. Then the tasks are randomly assigned to the UAVs to generate a global information network. The UAV generates a sequence of intelligent tasks to be executed by the UAV by interacting with the environment. Then, filter matching between UAVs and tasks is performed based on MPSO-SA-DQN, and certain tuples are formed. At the same time, it interacts and train with the environment and update the target MPSO-SA-DQN network. Finally, the targeted multi-UAV task allocation scheme is generated.

Modeling of multi-UAV task assignment

Generation of task real-time priority: This paper adopts the real-time priority mapping when describing the characteristics of each task, which is expressed as $Feature = (T, P, φ)$ . Specifically, $T$ denotes the task, $P$ denotes as the priority of the task, $φ$ denotes as the mapping from $T$ to the priority $P$ . Furthermore, $φ = (L, t, e, m_{a}, η)$ , where, $L$ denotes the longitude and latitude of the task, $t$ denotes the time when the task is started, $e$ indicates the time of the task that has been executed, $m_{a}$ indicates the volume of the task, and $η$ indicates the completeness of the task execution.

When the UAV performs a certain task,we set $X_{ij}$ , $i \in {1, 2, . . ., N}$ , $j {1, 2, . . ., m}$ , as the decision variables, shown as equation (1):

X_{ij} = {\begin{matrix} 1 & UA V_{i} in charges of T_{j} \\ 0 & UA V_{i} not charges of T_{j} \end{matrix}

(1)

When the task set $m_{i}$ is executed by the UAV $u_{i}$ at time $t$ , it is expressed as $ϕ_{ij}$ , with the following constraints.

(ϕ_{hj} \times X_{ij}) \cap (ϕ_{hi} \times X_{ij}) = Ø

(2)

Equation (2) indicates that each UAV performs at least one subtask.

In this paper, the original task is assigned as several subtasks and evenly distributed in the environment. When the UAV interacts with the environment, it is assigned by one subtask based on the principle of the shortest distance, the shortest time, and the completeness of subtask sequence execution. Besides, when the original task is assigned, the tasks performed by the UAV are distributed in the same area. From the perspective of the whole, the task distance performed by each UAV is relatively short. Thus, the distance between each task can be ignored when performing single task and performing dynamic task allocation. Here, the latitude and longitude of the task is mainly to distinguish the task. Therefore, when the UAV interacts with the environment to perform tasks, a series of original task execution lists will be continuously optimized according to the task priority. Finally it will reach the optimum through the constraints of resources and conditions.

The speed of the UAV is represented by $v$ . Therefore, when the UAV starts to perform the task at time t, the task amount $ϕ_{i}^{z}$ assigned to the UAV $u_{i}$ is: $Ω_{ij}^{zt} = vt X_{ij}$ . Besides, the completion degree of the subtask set is: $Ψ_{ij}^{zt} = Ω_{ij}^{zt} / Ω_{i}$ , where represents the task volume of a subtask. In a period of time $Kt$ , the completion degree of the task z is $Ω^{z}$ :

Ω^{z} = \sum_{i = 1}^{T (z)} \sum_{k = 1}^{k} X_{ik} Ψ_{zkt}^{i} • \frac{Ω_{i}}{\sum_{j = 1}^{T (z)} Ω_{j}}

(3)

Because the time efficiency of task execution is important, it is necessary to consider the completion time constraints of tasks. Besides, it is reasonable that the priority of each task in each time period will be different. Here, we express them as dynamic priority $P^{'}$ . Then, $P_{i}^{zt}$ is expressed as the priority of task $T_{i}^{z}$ within the time period $t$ , shown as equation (4). The larger the value of $P_{i}^{zt}$ , the higher priority of task $T_{i}^{z}$ is.

P_{i}^{zt} = \frac{Ω_{i} - Ω_{i}^{t}}{v (e_{i} - t)}

(4)

The proposed MPSO-SA-DQN

Based on the mechanism of the DQN algorithm and the properties of the tasks, the UAV is mapped as an intelligent body in deep reinforcement learning. The sequence of executable tasks is mapped to the states of the intelligent body. Then, a multidimensional particle swarm algorithm is introduced to optimize the state space of the intelligent body, which aims to dynamically guide the intelligent body when it interacts with the environment. So that the intelligent body gets the optimal task execution scheme through continuous learning. Thus, its time credit allocation is further rationalized. Secondly, a real-time priority state space self-encoder is designed for the state space, so that the complexity of the state space can kept in a certain dimension. The purpose is to make the algorithm run more efficiently.

Then, for the training redundancy that the intelligent body cannot end the loop after learning, we introduce a simulated annealing algorithm. Referring to the cooling mechanism of the simulated annealing algorithm, it enables the intelligent body to find the goal or complete the task execution programed. For quickly out of the loop, into the next round of learning. the overall application framework of the MPSO-SA-DQN algorithm is shown in Figure 2: The overall framework of the algorithm contains a total of three parts: simulated annealing network training, Q-network policy function mechanism, and multi-drone task allocation environment design. The multi-drone tasking environment design is used to simulate the scenario of multi-drone and platform interaction. The simulated annealing network training part is based on the design of round reward, which constantly cools down the Q network training. This effectively makes the intelligent bodies reduce the number of ineffective searches. As a results, the UAV allocation environment design is the core part of the MPSO-SA-DQN algorithm. As can be seen from Figure 2, the multiple UAVs in this paper all use the same network for the acquisition and updating of the task execution strategy scheme, and the way adopted is centralized training and distributed execution, so our UAVs will share the parameters of the Q-value network after centralized training.

Figure 2.

Overall application framework of MPSO-SA-DQN.

Design of SA-DQN training framework

The first stage of common agent training process is to set the maximum number of steps. Then, the agent will continually perform trial and error operations within the maximum number of steps in the current round until the best execution route is found. However, if the maximum number of steps is too small, the agent will not be able to optimize within the total steps. If the maximum number of steps is too large, there will be too many redundant steps, which can also lengthen the entire training time. Therefore, the SA-DQN training framework proposed in this paper provides temperature monitoring during agent training. More precisely, the next round will be proceeded immediately when the agent reaches a balanced state. Such operation will neither make the training insufficient nor make the number of training steps too long. The training framework of SA-DQN is shown in Figure 2.

Therefore, the steps of SA-DQN training framework is shown as:

Step1: Parameter initialization.

Step2: Whether the current round $e p_{i}$ is less than the maximum number of rounds $e p_{n}$ , if “yes,” turn to Step3.

Step3: Initialize the simulation environment.

Step4: Initialize the temperature $T$ .

Step5: Determine whether it is greater than the stable temperature $t$ , if “yes,” turn to Step6.

Step6: The agent interacts with the environment.

Step7: Generate tuples $< s, a, R, s^{'} >$ .

Step8: Whether the current state function is better than the next state function, if “yes,” go to Step9, if “no,” go to Step10.

Step9: Accept the current state $s$ .

Step10: Accept the next state $s$ .

Step11: Whether the status is stable, if “yes,” go to step12, if “no,” go to Step5.

Step12: $e p_{i} = e p_{i} + 1$ , turn to the next round.

Framework of MPSO-SA-DQN algorithm

Based on the training framework of SA-DQN, MPSO-SA-DQN introduces a multi-dimensional particle swarm algorithm. To guide the actions of the agent, it introduces multiple objective function. Besides, MPSO-SA-DQN changes the state space of the agent dynamically to improve the ability of search in an uncertain space. Meanwhile, it can further improve the convergence of the DQN algorithm. The flowchart of state space of multi-agent in MPSO optimization is shown in Figure 2. The execution framework of MPSO-SA-DQN is shown in Figure 2.

Steps of MPSO-SA-DQN algorithm:

Input: $Q$ network $Q (S, a, θ)$ , network parameters $Q (S, a, θ^{'})$ of network $Q$ , learning rate $lr$ , experience pool $M_{RL}$ , exploration coefficient $σ$ , number of simulation rounds $epsiod e_{n}$ , sample training scale $N_{batch}$ , time interval of target network parameter update $C$ , number of multi-agents $N_{u}$ , state space $S_{state}$ , action space $A_{action}$ .

Step1: Initialize the environment and restart the initial state of each agent $S_{0}^{i}$ .

Step2: Determine whether the temperature has reached equilibrium, if “Yes” execute Step1, if “No” execute Step4.

Step3: In the current state $S_{T}^{i}$ , each agent interacts with the environment, and then uses the optimal strategy to select the execution action $a_{T}^{i}$ .

Step4: Combining each agent action $a_{T}^{i}$ into an action vector, and inputting the action vector into the task assignment environment to get the reward $R_{T}$ and the next state $S_{T + 1}$ .

Step5: Store the tuple $< S_{T}^{i}, a_{T}^{i}, R_{T}, S_{t + 1}^{i} >$ in the memory pool Step6: Whether the number of tuples exceeds 200, if “Yes,” execute Step7, if “No,” execute Step3.

Step6: Learning.

Step7: Randomly fetch the state tuple data of a group of agents from the memory pool.

Step8: According to the objective function, judge whether the current state $S_{T}$ is better than the next state $S^{'}$ ; if “yes,” keep $S_{T}$ ; otherwise keep $S^{'}$ , and perform cooling operation, the specific operation is shown in Figure 2.

Step9: Determine whether the current round is over. If “yes,” then $y = R_{t}$ , and start the next round. If “No,” execute the mutation strategy function.

Step10: Output multi-agent execution plan

Allocation method based on MPSO-SA-DQN

The design of environment for multi-UAV task assignment

State-space design

In this paper, the state space S is divided into three parts, including the current Agent state information $S_{1}$ . the combination of action, reward, and task priority sets $S_{2}$ , and the multi-Agent state information $S_{3}$ at the next moment. We employ centralized training and distributed execution of multiple agents, enabling multiple agents to share the policy network Q. To achieve this, it is necessary to ensure that each agent’s state vector has the same dimension. At the same time, this state vector needs to include not only the information situation of the target, but also the interruption situation of the whole round and the reward value of the whole round. Through the shared strategy network, multiple agents can learn collaborative strategy solutions. It is shown as equation (5).

S = S (t) = S_{1} \land S_{2} \land S_{3}

(5)

Where, $\land$ represents a vector connector. The current state information $S_{1}$ is represented as a set of execution goals of the agent, which can be expressed as $[T_{i}, T_{m}, . . ., T_{k}]$ . The next state information represents the migration of the previous state information. It can also be represented by the calculation of the task $T_{m}$ by the strategy network. Moreover, its optimization result is not as good as that of the task tasks $T_{y}$ in the pool. Then, $T_{m}$ will be swapped out. The sequence of state information will be regenerated.

Design of action space

The action space of a UAV is continuous when performing a task. However, when assigning tasks to an agent, the action space needs to be discretized in order to satisfy the state space designed in this paper. When initializing the action space, a discrete representation of the action space is generated based on the number of task goals and subtasks. Specifically, in this paper, the action space is defined as a collection of sequences of subtasks. For example, $a_{k}^{i} = 1, 2, 3, . . ., C_{j}$ represents the sequence of tasks performed by the $j_{th}$ Agent at the $t_{th}$ moment.

Reward function design design of reward function

The purpose of the algorithm can be described as. the total time to perform tasks to be short as possible, and the completeness of tasks to be high as possible. So the reward is set to.

R = {\begin{matrix} R_{P} \\ R_{eval} \\ R_{eit} \end{matrix}

(6)

Where, $R_{P}$ is the feature of the priority $P^{'}$ of UAV at time $t$ . $R_{eval}$ represents the returned evaluation after the multi-dimensional particle swarm optimization of the task subsequence. $R_{eit}$ represents the maximum completeness of the task. In order to improve the completeness of the overall task, it is necessary to maximize the cumulative reward value. Therefore, the reward value of this round at time $t$ is:

R (t) = σ R_{P} + R_{eval} + (1 - σ) R_{eit}

(7)

Where, $σ$ expresses the influence of the UAV’s priority and the task completeness on the next state under the interaction with the environment. Besides, $σ$ is a number from [0,1].

Update of the strategy function

Each state vector in this paper is described by a defined set of values. In order to quantify the cumulative reward value of a state vector at t in time, a discount factor needs to be considered. In this paper, the discount factor is denoted by a symbol $θ$ , and the reward value of a cumulative round can be calculated in the following way:

G_{t} = R_{t - 1} + θ R_{t} + θ^{2} R_{t + 1} + . . .

(8)

At the current moment, the optimal action is: $a = Q_{π} (S_{t}, R_{t})$ , where $Q_{π}$ denotes the policy network. In using the network approximation Q value, the largest Q value in the next state is selected to update the policy function, which is denoted as follows:

π_{t} (Q | S, A) \leftarrow Q (s, a) + R (t) + μ (γ ma x_{a} Q (s', a) - Q (s, a))

(9)

Q (s, a) = π_{t} (Q | S, A)

(10)

μ = 1 / (1 + \exp (- μ))

(11)

Where: $μ$ is the learning rate.

In this paper, the loss function is used to continuously optimize the hyperparameters of the target network so as to update the hyperparameters. The loss function is constructed as follows:

Loss (θ) = π [(R (t) + (γ ma x_{a} Q (s', a) - Q (s, a)))^{2}]

(12)

Environmental design

The step() function is one of the most important modules in the environment design. It acts as a physical engine for deep reinforcement learning in this paper. The function takes as inputs the action $a$ , the time $t$ of the current execution of the task, the set of task priorities, the set of drones that are executing the task, the degree of completion of all the tasks, and outputs the state of the next moment $s_{-}$ . The reward of the current action, and the flag done for whether to terminate the training or not. Step() means the process of the interaction between the intelligent body and the environment, and provides relevant information.

The state of the next moment

In this paper, the state of the intelligent body mainly consists of the sequence of tasks executed by the UAV, task completion and task priority. Based on the time of mission start and the characteristics of the UAV, the intelligent body dispatches the UAV to execute the mission. At the current time, the mission sequence executed by the UAV is optimally evaluated based on the threat assessment, range constraints, and damage gain to obtain a new mission sequence and the priority of the current mission.

Reward for the action

In the experiment, the rewards obtained when the intelligent body performs the task are set. When the intelligent body is performing the task, if the intelligent body performs the task for too long or the degree of task completion is too low, a punishment is given according to the design of the reward space.

Signal for the termination of training

During the training process, if the intelligent body is unable to complete the task for a long period of time, it can be determined that the intelligent body is unable to complete the set of task sequences, and the training will be terminated at this time. Otherwise, if the intelligent body continues to train and the termination condition is not reached, the intelligent body will continue to train until the termination condition is reached.

Strategy of real-time priority state space autoencoder

Because the dimension of state space is high, excessive calculation will make training the agent expensive and time-consuming. Therefore, a state space autoencoder based on real-time priority is designed. Specifically, a feature expression network based on artificial neural network is adopted, and a three-layer network is designed. The designed network includes two parts, which are the encoder and the decoder. In detail, the mapping from the input signal to the output representation is the encoder. The decoder is the reverse mapping of the representation to the input and reconstructs the input. According to the characteristics of the input data in this paper, the tanh is selected as the activation function. Autoencoder training is the process of minimizing the reconstruction error function, which is defined as:

J_{0} = \sum_{i = 1}^{P} P * (∥ y_{i} - x_{i} ∥^{2}) + γ \sum W

(13)

Among them: $P$ is the priority of all tasks that have been executed in the time period $t$ ; $γ$ is the regularization coefficient of the second layer network, which is used to reduce the size of the weight to prevent overfitting; $(∥ y_{i} - x_{i} ∥^{2}) + γ \sum W$ the first item in is the sum of the square and mean of the error value The product of priorities, the second term is the regularization term.

UAV task assignment method based on MPSO-SA-DQN

In order to solve the problems of complex time constraints and uncertain solution space, this section proposes a multi-UAV task assignment and iterative optimization method based on the MPSO-SA-DQN. This method uses the simulated annealing mechanism to shorten the time of invalid iterations of deep reinforcement learning. At the same time, it encodes the task goal through the real-time priority state-space autoencoder strategy. Then, the execution plan of the agent in the complex environment is optimized based on the multidimensional particle swarm optimization algorithm. Thus, it updates the state of the agent while continuously iteratively learning, and the optimal task planning solution if found. The steps of the task assignment method based on MPSO-SA-DQN UAV are as follows.

Steps of UAV task assignment method based on MPSO-SA-DQN

Input: $Q$ network $Q (S, a, θ)$ , network parameters $Q^{'} (S, a, θ^{'})$ of target network $Q$ . UAV learning rate $lr$ , task experience pool $M_{RL}$ , exploration coefficient $σ$ , number of multi-UAV simulation rounds $epsiod e_{n}$ , number of mission UAVs $N_{n}$ , state space of UAV $S_{state}$ , action space of UAV $A_{action}$ .

Step1: Initialize the task assignment environment and restart the initial state $S_{0}^{i}$ of each UAV.

Step2: Determine whether the temperature has reached equilibrium, if “Yes” execute Step1, if “No” execute Step4.

Step3: In the current state $S_{T}^{i}$ , interact with the environment for each UAV performing the assigned task, and then use the optimal strategy to select the execution action $a_{T}^{i}$ .

Step4: Combine the actions of each UAV into an action vector $a_{T}^{i}$ , and input the action vector into the task assignment environment to get the reward $R_{T}$ and the next state $S_{T + 1}$ .

Step5: Store the tuple $< S_{T}^{i}, a_{T}^{i}, R_{T}, S_{t + 1}^{i} >$ in the memory pool

Step6: Whether the number of tuples exceeds 200, if “Yes,” execute Step7, if “No,” execute Step3.

Step7: Learning.

Step8: Randomly fetch a set of UAV state tuple data from the memory pool.

Step9: According to the objective function, judge whether the current state $S_{T}$ is better than the next state $S^{'}$ ; if “yes,” keep $S_{T}$ ; otherwise keep $S^{'}$ , and perform cooling operation, the specific operation is shown in Section 3.1.

Step10: Determine whether the current round is over, if “Yes,” then $y = R_{t}$ , and start the next round. If not, execute the mutation strategy function.

Step11: Multi-UAV task allocation scheme.

Simulation experiment and analysis

Experiment setting

In order to intuitively demonstrate the results of the designed method, a combat simulation scenario of UAV’s ground attack is designed. The experiment setting is as follows: the ground battlefield environment is generated by python tk.^39,40 It can simulate the environmental changes on the map, and randomly generates 30 mission targets on the map. When the ground attack starts, the number of corresponding task targets is obtained by calculating the tasks completed by each UAV. In particular, when the UAV is attacking the ground, the metrics for evaluating the performance of each UAV is as follows: the number of tasks completed by a single UAV over time, the degree of damage to the target, and the risk benefit and the path cost function. The standard is based on highest task volume per unit time, the highest damage coefficient, the lowest path cost and the smallest risk coefficient.

Construction of simulation platform

In order to verify the effectiveness and feasibility of the method, a simulated map environment is built. There are 30 tasks and 5 UAVs generated on the map. Table 1 shows the setting of the tasks in the experiments.

Table 1.

Setting of subtasks.

Characteristics	Subtasks	Coordinate	Workload	Efficiency	Priority
value	$T_{i}$	(x,y)	C	Time	P

The experiment uses tensorflow and pytorch to realize the neural network of deep evolutionary reinforcement learning. The algorithm is trained for 350 steps. In which,one step is from the beginning to the end of the task. The reward value accumulated in each step of the algorithm is shown in Figure 3:

Figure 3.

The cumulative reward value (r) of each cycle of multi-agent.

Shown as Figure 3, the algorithm belongs to its exploratory stage in the early stage of training. The cumulative reward value increases rapidly once the training begin. Besides, in the middle period, the cumulative reward value becomes stabilizes gradually with a slight shock. It is because the algorithm will update the parameters regularly during the process. Furthermore, it becomes stabilized gradually and shows a slight upward trend, at the 90th training cycle. At this time, each agent has learned and applied the strategy of dynamic allocation of multi-UAV tasks. Shown as Figure 4, the SA-DQN strategy shows its advantages according to the process of training. Its convergence is gradually concentrated. As a contrast, the DQN single strategy is more scattered, and the convergence effect is worse.

Figure 4.

Comparative loss function curves for SA-DQN and DQN per training cycle for multi-intelligents.

As shown in Figure 5, from the loss curve graph can be seen, when the greedy value is set to 0.9, and the learning rate changes, the starting point of the loss curve is high but the convergence of the fluctuation of the situation is relatively smooth and its convergence effect is not good. At the same time, when the learning rate is different, the greedy value is 0.9, obviously the learning rate is 0.1 when the effect is better, and its eventual convergence, obviously the fluctuation of the smaller. Finally, the convergence of the effect of the fluctuation of the better, the fluctuation error value at around 0.1, while the error value of the fluctuations of the other curves is around 1.

Figure 5.

Comparison curves of loss with different hyperparameters.

The completion of all tasks performed by multiple UAVs is shown in Figure 6. With the same learning rate, the smaller the greedy coefficient, the more ambiguous the completion of the task by the UAVs. However, when the greedy coefficient is too large, it adds extra loss, so a moderate value of 0.9 for the greedy coefficient needs to be chosen.

Figure 6.

Comparative plot of overall task completion with different hyperparameters.

Comparative analysis of mission planning generation

Based on the optimal multi-UAV mission execution scheme, the multi-UAV mission planning results are finally demonstrated, assuming that the time needed to complete the task is known as process time. Besides, the real process time is the simulated mission execution time based on the time needed for the task, the mission planning method obtained in this paper (MPSO-SA-DQN). It is compared with the hybrid particle swarm based (MPSO) and greedy algorithm based task planning method (greedy) in terms of task volume completion is shown in Figure 7:

Figure 7.

Comparison chart of task completion effects (Note the unit of time (/h)).

In Figure 7, the vertical coordinate indicates the completion degree of the task (1 means it has been completed), and the horizontal coordinate indicates the time (/h), as can be seen in the figure, at 40 h, the completion effect of the task of the multi-UAV task planning method based on reinforcement learning with multi-dimensional particle swarm has begun to converge, which indicates that the overall task completion has reached the highest level, which means that the UAVs have cooperated to complete the task. However, the MPSO and greedy completion effects have not yet reached 50% (0.5), indicating that the multi-UAV mission planning method based on reinforcement learning with multidimensional particle swarm is better.

Figure 8 shows the desired time and the simulation time for the UAV to perform tasks, at the end of the first round. It shows that the time to complete all tasks is about 550 h, which is far from the desired time. Besides, Figure 9 suggests that the actual completion time of the drone’s initial task execution is too long, and the real-time performance is low. Figure 6 shows that after the training is over, the UAV has learned the incoming task execution strategy at the end of training. Furthermore, Figure 10 exhibits that the actual and planned time of each task are approaching. For the error shown in Figure 11, the actual completion time of the drone’s mission is relatively accurate, and the real-time performance has been improved. In sum, taking the 15th task as an example, the actual time be used of the task is larger than 100 h. Before starting the training, the UAV group only depends on the completion of the task. It can be seen from Figure 8 that its execution time is relatively concentrated. In detail, its execution time is relatively scattered and long-term tasks and short-time tasks are alternately performed, so after the training is completed, it will be reduced to less than 100 h. The results show that the multi-agent has adapted to the environment and applied the designed algorithm.

Figure 8.

Planned and actual time of the UAV mission at the time of the first round.

Figure 9.

Error plots of the actual execution time of the task at the first round.

Figure 10.

Planned and actual time of the drone mission plan after the completion of training.

Figure 11.

Error plot of the actual execution time of the task at the end of training.

Figure 12 suggests that, the error between the time for the UAV to performs the task and the desired time is large, before the training begins. Also, the real-time performance of the task cannot be guaranteed. After the training is completed, the time error of the UAV to perform the mission becomes smaller. The phenomenon exhibits that the proposed method in this paper has a significant effect on the execution time of UAV tasks. Furthermore, Figure 13 shows that the time for the UAV to complete the task is also largely different before and after training. Although the error of the time used of the 8th task is large, the errors of the completion time of other tasks are all between [0,1]. From Figure 14, we can find that the 8th task ended early due to the early start, so its error is relatively large.

Figure 12.

Error plots for UAV mission execution time.

Figure 13.

Time error plot for UAV mission completion.

Figure 14.

Planned and actual time of commencement of task execution.

The evolution curve of the objective function of multi-UAV task execution is shown in Figures 15 –17. Based on the multi-UAV task allocation method, it can quickly find the optimal allocation scheme under the complex battlefield environment. At the same time, it has a certain stability in real-time task allocation. Such feature can make the UAV achieves the maximum damage degree when the path cost is the shortest and the risk benefit is the lowest. Besides, the damage benefit in Figure 7 is a straight line, which shows that the UAV found the optimal state by interacting with the environment at the beginning. Meanwhile, it achieves the maximum damage benefit, so it remained unchanged at 11.0. Training by SA-MPSO-DQN, the results of the optimal value of each UAV, the optimal path cost and the optimal task allocation scheme are shown in Table 2.

Figure 15.

Variations in UAV risk benefits, path benefits, and damage of first and second UAV.

Figure 16.

Variations in UAV risk benefits, path benefits, and damage of third and fourth UAV. .

Figure 17.

Changes in risk benefit, path benefit, and damage of the fifth UAV.

Table 2.

The optimal task allocation scheme for each UAV.

S/N	Best value	Optimal path cost	Maximum damage	Attack sequence
1	1.7287	2.8478	10.556	19-0-27-1-4-12
2	1.7686	4.2609	10.62	17-26-5-25-3-17
3	1.687	3.9613	9.75	14-24-9-13-18-8
4	1.7487	4.7230	10.67	28-16-6-10-20-7
5	1.648	5.62	11.0	15-21-2-23-29-22

The multi-UAV mission planning method based on improved multidimensional particle swarm is carried out under the situation of fixed target attributes and relatively stable target threats. Whereas, the MPSO-SA-DQN based multi-UAV mission assignment method is carried out in the case of dynamic changes in target attributes, constant changes in target threat attributes, and changes in target value over time. In deep reinforcement learning algorithms, multidimensional particle swarm algorithms are used for state-action space optimization for environment design. The multi-UAV mission planning method based on MPSO-SA-DQN is able to solve more problems, which is more accurate. Meanwhile, it is more adaptable to dynamic changes in the battlefield environment. Table 3 lists the threat coefficients and destruction levels obtained from the simulation of the two methods.

Table 3.

Comparison of simulation results of two algorithms.

S/N	Improved multidimensional particle swarm	MPSO-SA-DQN
1	Threat factors: 1.800, Degree of destruction: 10.900	Threat factors: 1.400, Degree of destruction: (10.000, 10.500)
2	Threat factors: 1.800, Degree of destruction: 9.977	Threat factors: 1.300, Degree of destruction: (10.000, 10.500)
3	Threat factors: 2.000, Degree of destruction: 0.005	Threat factors: 1.400, Degree of destruction: 10.450
4	Threat factors: 1.800, Degree of destruction: 10.800	Threat factors: 1.300, Degree of destruction:(10.000, 10.500)
5	Threat factors: 1.800, Degree of destruction: (10.180,10.200)	Threat factors: 1.300, Degree of destruction: 10.400

Table 3 shows that it can be compared that the threat coefficient derived from the multi-UAV task allocation method based on MPSO-SA-DQN is smaller than that of the multi-UAV task allocation method based on the multi-dimensional particle algorithm. Besides, from the point of view of the degree of destruction the results derived from the multi-UAV task allocation method based on deep reinforcement learning are more average and stable, which indicates that the strategy of the method is more practical and more relocatable. At the same time, the scheme of task allocation is more reasonable. On the other hand, the multi-UAV task allocation method based on improved multi-dimensional particle swarm yields two extremes of destruction, bad stability, and poor ability to cope with real-time dynamic changes.

Therefore, based on reinforcement learning multi-drone task allocation algorithm in solving the problem of more complex situations, its problem solving ability is stronger than that based on multi-dimensional particle swarm algorithm. In the complex environment, and the target attributes change over time, the number of tasks is large in scale and there are more than 20 drones as shown in Figure 16 and Table 3. Besides, in the same time based on the improvement of the depth of the reinforcement learning algorithm compared to conventional algorithms, the amount of tasks the degree of completion of the task is higher, and the task completion efficiency is higher. Therefore, the multi-UAV task allocation method based on deep reinforcement learning is adopted. On the contrary, the multi-drone task allocation method based on multi-dimensional particle swarm is adopted.

To exhibit the advantages of the proposed method, we conduct comparison experiment. The compared algorithm includes traditional greedy algorithm (greedy), genetic algorithm (GA), simulated annealing algorithm (SA), ant colony algorithm (ACO), and hybrid particle swarm algorithm (MPSO). The comparison curve of task completion over time is shown in Figure 18.

Figure 18.

Plot of task completion versus time when the number of drones is 20 (Note the unit of time (/h)).

Figure 19 shows the completeness of each task with training steps in the case of 30 subtasks during task execution. The “90 SA-MPSO-DQN” indicates the change in task completion over time after ninety rounds of training, which shows that as the number of rounds of training increases, the task is completed faster.As the training steps increases, more tasks will be introduced. Meanwhile, when the new task is just issued, the task completeness of this new task is 0. Therefore, when the overall task completeness is at time step 40, Greedy algorithms, simulated annealing algorithms and ant colony algorithms how a slight downward trend due to the issuance of new tasks. However, as the task continues to be executed, the overall task completeness gradually resumes its upward trend. In the early stage of task execution, the algorithm in this paper considers the time constraints, and dynamically allocates tasks during the task execution process. Therefore, the task completeness of our method is significantly higher than other algorithms. It indicats that the algorithm in this paper can improve the overall task completeness of the system when tasks are issued at any time.

Figure 19.

The completeness of task over time (Note the unit of time (/h)).

Conclusions

In this paper, a multi-UAV task assignment method based on MPSO-SA-DQN is proposed for multi-UAVs facing variable environmental factors, complex time constraints and uncertain solution space, etc. The design of this algorithm can be divided into three aspects. Firstly, this paper constructs a multi-intelligence environment optimized by a swarm intelligence algorithm to dynamically constrain the state action space of multi-intelligences. Based on the complex environment of the battlefield, the multi-intelligent body is made to intelligently decide the best adaptation state under the current action. Secondly, based on the time-varying nature of UAV missions, a multi-agent dynamic training method based on the idea of simulated annealing is designed under the influence of multi-intelligent body and environment interactions. The method can be used in a state-oriented multi-agent global environment. Thirdly, this paper proposes a real-time prioritized state-space autoencoder strategy. Specifically, the strategy combines the characteristics of multiple UAVs to optimize the state space and make the algorithm more efficient.

In addition, a multi-UAV task allocation method is designed from two aspects. Firstly, agents trained by DQN and its related algorithms can be used to solve the UAV task allocation problem. Compared with traditional methods, the method not only avoids complex constraints and tedious solving processes, but can also continuously generate dynamic assignment schemes as the states are refreshed. Another is that due to the complexity and variability of the task environment, multiple agents are motivated to dynamically find the optimal dimension for task execution based on the changing target values of the environment. The designed method can solve the dynamic optimization problem of mission planning under multiple constraints and achieve multi-UAV mission planning.

Finally, according to the algorithm improvement and the design of the method for multi-UAV mission planning, simulation experiments are carried out. From the results of the simulation experiments, it is able to design a better multi-UAV mission execution scheme under variable environmental factors, complex time constraints and spatial uncertainties. However, there is still a problem of uneven task distribution in terms of multi-UAV collaboration, and future research can be conducted to improve the multi-intelligent body strategy design.

Compared with other algorithms, the method achieves a strong real-time capability which is able to continuously interact with the environment and learn iteratively. And gradually from knowing the environment to adapting to the environment and forming a planning programmer. However, as the intelligent body is learning, the database capacity is constantly expanding and when the amount of data increases. Its strategy network is not stable enough, and the running time of the algorithm is low. Therefore, applying the MPSO-SA-DQN method in military or civilian UAV operations may be the challenge of the timeliness of massive data training and selection, computation. Besides, the backup resource regulation, and reinforcement learning algorithms with more stable policy networks and multidimensional particle swarm algorithms should be used to combine with each other to improve the execution efficiency of the algorithm.

Although good results have been achieved in the current work, there are still some shortcomings. As the scale of the task increases and the number of UAVs increases, the speed of the algorithm’s solution will gradually decrease, and the interaction between multiple intelligences and the environment will be limited. The interaction rate of the environment built by multi-dimensional particle swarm will become lower. A more intelligent algorithm should be used to replace the multi-dimensional particle swarm algorithm to rebuild the multi-intelligent body operating environment. So that the algorithms are constantly learning the commonalities of the tasks performed by the UAVs and constantly generating solutions.

Footnotes

Acknowledgements

Throughout the writing of this dissertation I have received a great deal of support and assistance. First of all, I would like to express my sincere gratitude to my supervisor, Mr Peng Peng Fei, whose advice and encouragement have given me a deeper understanding of these AI studies. It is my great honor and joy to study under his guidance and supervision. In addition, his charisma and diligence have been a privilege that I will cherish for the rest of my life. There are no words to express my gratitude to him. I am also very grateful to my elder brother, Fan Linkun, and my elder sister, Zheng Yalian, who have given me a lot of help and companionship in the process of preparing this thesis. They helped me to revise this thesis meticulously.In addition, I would like to thank the fund support provided by the National Experimental Teaching and Teaching Laboratory Construction Project, the Development, Application and Research of Digital Resources of a Certain Equipment in Experimental Teaching, and the Equipment Development Research Project, which made it possible for my research to be sustained.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Gong Xue

Data Availability Statement

The use of this data is completely restricted for research and educational purposes. The use of this data is forbidden for commercial purposes.

References

WeiRen

DeYun

Yang

, et al. Deep reinforcement learning and self-learning based algorithm for generating multi-UAV close-range air combat maneuver strategies. IET Control Theory Appl 2022; 39(02): 352–362.

Wenhao

Mengqi

Fei

, et al. Review on UAV swarm task allocation technology. Syst Eng and Electron 2024; 46(3): 922–934.

Jingshu

Jiaqi

Xubo

. Multi-UAV collaborative reconnaissance mission planning based on clustering and reinforcement learning. J Chinese Acad Electron Sci. 2023; 18(01): 21–25+55.

Tingfei

Guangquan

Kuihua

, et al. DQN-based composite anti-drone mission assignment method for multiple types of interceptor equipment. Control and Decision . 2022; 37(01): 142–150.

Yan

Han

, et al. Event-triggered formation control for time-delayed discrete-Time multi-Agent system applied to multi-UAV formation flying. J Franklin Inst 2023; 360(5): 3677–3699.

Hematulin

Kamsing

Torteeka

, et al. Trajectory planning for multiple UAVs and hierarchical collision avoidance based on nonlinear Kalman filters. Drones 2023; 7(2): 142.

Liu

Chen

Feng

, et al. An improved rrt* UAV formation path planning algorithm based on goal bias and node rejection strategy. Unmanned Syst 2023; 11(04): 317–326.

Liu

, et al. Multi-conflict-based optimal algorithm for multi-UAV cooperative path planning. Drones 2023; 7(3): 217.

Qin

Shao

Wang

, et al. Review of autonomous path planning algorithms for mobile robots. Drones 2023; 7(3): 211.

10.

Wang

Zhang

Qiao

, et al. Adjustable fully adaptive cross-entropy algorithms for task assignment of multi-UAVs. Drones 2023; 7(3): 204.

11.

Yue

Yang

Zuo

, et al. Factored multi-agent soft actor-critic for cooperative multi-target tracking of UAV swarms. Drones 2023; 7(3): 150.

12.

Jiang

Wang

Liu

, et al. Hierarchical multi-UAVs task assignment based on dominance rough sets. Appl Soft Comput J. 2023; 143: 143.

13.

Huawei

Yimin

Xiaoxuan

. Particle swarm algorithm-based cooperative mission planning for UAV ships and aircraft. Syst Eng Electron Technol 2016; 38(07): 1583–1588.

14.

Qingchao

Qingkui

. UAV mission assignment based on discrete pigeon flock algorithm. J Beijing Univ Inf Sci Technol Nat Sci Ed 2020; 35(06): 37–42.

15.

Junyi

Zhi

Huaxiong

, et al. A maximum entropy evolutionary reinforcement learning method based on adaptive noise. J Automation. 2023; 49(01): 54–66.

16.

Shuai

Xiaoyu

Zhenghao

, et al. A review of deep reinforcement learning methods combined with evolutionary algorithms. J Comput Sci 2022; 45(07): 1478–1499.

17.

Souto

Alfaia

Cardoso

, et al. UAV path planning optimization strategy: considerations of urban morphology, microclimate, and energy efficiency using Q-learning algorithm. Drones 2023; 7(2): 123.

18.

Khadka

Majumdar

Tumer

. Evolutionary reinforcement learning for sample-efficient multiagent coordination. CoRR. 2019, abs/1906.07315.

19.

Khadka

Majumdar

Nassar

, et al. Collaborative evolutionary reinforcement learning. In: International conference on machine learning, 2019, pp. 3341–3350. PMLR.

20.

Song

Bai

, et al. Multi-UAV trajectory planning during cooperative tracking based on a fusion algorithm integrating MPC and standoff. Drones 2023; 7(3): 196.

21.

Deng

Huang

Liu

, et al. A distributed collaborative allocation method of reconnaissance and strike tasks for heterogeneous UAVs. Drones 2023; 7(2): 138.

22.

Gugan

Haque

. Path planning for autonomous drones: challenges and future directions. Drones 2023; 7(3): 169.

23.

Gaowei

JIA

Jianfeng

WANG

. Research review of UAV swarm mission planning method. Syst Eng and Electron 2021; 43(1): 99–111.

24.

Wang

Shang

Lei

. Multi-granularity fusion resource allocation algorithm based on dual-attention deep reinforcement learning and lifelong learning architecture in heterogeneous IIoT. Inf Fusion 2023; 99: 99.

25.

Zheng

Tian

, et al. DL-PR: generalized automatic modulation classification method based on deep learning with priori regularization. Eng Appl Artif Intell 2023; 122: 106082.

26.

Valocchi

Duarte

. An adaptive global–local generalized FEM for multiscale advection–diffusion problems. Comput Methods Appl Mech Eng 2024; 418: 116548.

27.

Valocchi

Duarte

. A transient global-local generalized FEM for parabolic and hyperbolic PDEs with multi-space/time scales. J Comput Phys 2023; 488: 112179.

28.

Zheng

Zhao

Zhang

, et al. MR-DCAE: manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int J Intell Syst 2021; 36: 7204–7238.

29.

Zheng

Zhao

Wang

, et al. Fine-grained modulation classification using multi-scale radio transformer with dual-channel representation. IEEE Commun Lett 2022; 26: 1298–1302.

30.

Beyer

Schwefel

. Evolution strategies - a comprehensive introduction. Nat Comput 2002; 1: 3–52.

31.

Sigaud

. Combining evolution and deep reinforcement learning for policy search: a survey. ACM Trans Evol Learn Optim 2023; 3: 1–20.

32.

Gupta

Savarese

Ganguli

, et al. Embodied intelligence via learning and evolution. Nat Commun 2021; 12(1): 5721.

33.

Yang

Wang

, et al. A dynamic multi-objective evolutionary algorithm based on gene sequencing and gene editing. Inf Sci 2023; 644: 644.

34.

Zhang

Peng Lim

Liu

. Enhanced bare-bones particle swarm optimization based evolving deep neural networks. Expert Syst Appl 2023; 230: 230.

35.

Anjana

Anish

Sridharan

. Modified multi-objective simulated annealing algorithm for scheduling a flow shop production system with setup times. Int J Process Manag Benchmarking 2023; 13(2): 177.

36.

Khadka

Tumer

. Evolution-guided policy gradient in reinforcement learning. 2019.

37.

Yang

Zhang

, et al. Evolutionary reinforcement learning via cooperative coevolutionary negatively correlated search. Swarm Evol Comput 2022; 68: 68.

38.

Colas

Sigaud

Oudeyer

. GEP-PG: decoupling exploration and exploitation in deep reinforcement learning algorithms. 2018.

39.

Yiran

Xiaochuan

Tao

, et al. Research on multi-agent path planning method based on reinforcement learning Computer. Appl Softw. 2019; 36(08): 165–171.

40.

Fengzhu

Xin

Chunhai

, et al. Dynamic assignment of multiple UAV tasks based on deep reinforcement learning. J Guangxi Norm Univ 2021; 39(06): 63–71.

Multi-UAVs task allocation method based on MPSO-SA-DQN

Abstract

Keywords

Introduction

Related work

Modeling of multi-UAV task allocation

Multi-UAV task assignment problem

Modeling of multi-UAV task assignment

The proposed MPSO-SA-DQN

Design of SA-DQN training framework

Framework of MPSO-SA-DQN algorithm

Allocation method based on MPSO-SA-DQN

The design of environment for multi-UAV task assignment

State-space design

Design of action space

Reward function design design of reward function

Update of the strategy function

Environmental design

The state of the next moment

Reward for the action

Signal for the termination of training

Strategy of real-time priority state space autoencoder

UAV task assignment method based on MPSO-SA-DQN

Simulation experiment and analysis

Experiment setting

Construction of simulation platform

Comparative analysis of mission planning generation

Conclusions

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

Data Availability Statement

References