Sage Journals: Discover world-class research

Abstract

This article deals with the recovery flight problem of flapping-wing micro-aerial vehicles under extreme attitude by using a reinforcement learning approach. First, the reinforcement learning-based control policy is proposed to enable the flapping-wing micro-aerial vehicles to be recovery flight rapidly and keep the angular acceleration as small as possible. Then, a hybrid control approach is designed to significantly improve the flight stability by combining the reinforcement learning-based control approach with the proportional-derivative control approach. Finally, simulation results show the effectiveness of the reinforcement learning-based method and the hybrid control method for the flapping-wing micro-aerial vehicles under extreme attitudes.

Keywords

Flapping-wing vehicles reinforcement learning extreme attitude hybrid scheme

Introduction

Recently, flapping-wing micro-aerial vehicles (FWMAVs) have attracted much attention from researchers and engineers due to their exceptional stealth, excellent aerodynamic efficiency, and maneuverability.^1–5 Compared with traditional fixed-wing or rotary-wing aerial vehicles, the FWMAVs can efficiently complete agile flight skills such that the FWMAVs are widely applied in some special scenarios.^6,7 For example, according to Han et al.,² an eagle-like flapping-wing robot with a vision system and a flight control system was designed for environmental monitoring. According to Wu et al.,⁷ a servo-driven bird-like flapping-wing robot was proposed to conduct an outdoor airdrop mission.

It is worth mentioning that the FWMAVs are studied from two aspects. One aspect is to study the structure design^8,9 while the other aspect focuses on the controller design.^10,11 The structure design involves wing motion mode, wing structure, torque generation structure, and so on.^12,13,8,9 For example, according to Wang et al.,⁸ a bat-inspired flapping-wing aircraft model that combines the flexible wing and tail design of a bat was presented. The advantages of this model over conventional rigid-wing aircraft were its higher maneuverability and flight efficiency, especially in dynamic environments. According to Hou et al.,⁹ a bio-inspired smart-winged microlight based on a wing mimicking a scarab beetle’s wing was designed, which integrates aerodynamics, sensory functions, and power generation for environmental monitoring. According to Ishiguro et al.,¹² an optimal wing size for soft wings in flapping microlights was investigated. It was found that the use of chord-directed wing veins produces more force than wingspan wing veins. However, it is hard to achieve stable flight only by designing structure. One possible solution is to design the attitude controller of the FWMAVs.^14–20 For example, according to Guo et al.,¹⁶ the proportional–integral–derivative (PID) algorithm was employed to stabilize the attitude within small range. According to Nian et al.,¹⁸ a cascaded proportional–integral (PI) controller was proposed based on wing-tail interaction and aerodynamic-dynamic coupling so as to further enhance the control performance in complex flight tasks. However, the PID algorithm is sensitive to parameter changes and can be significantly influenced by the disturbance variations. Some improved approaches have been proposed.²⁰ For example, according to Ferdaus et al.,²⁰ a neuro-fuzzy controller was introduced to deal with the problem of parameter changes.

It should be pointed out that, there are two main limitations in using traditional algorithms to realize complex flight skills of FWMAVs. One limitation is that traditional control algorithms rely on real-time and efficient control reference updates. Moreover, complex flight trajectories are often difficult to plan, and in some cases, the special flight state is prone to violate the control requirements of traditional control algorithm. In this case, the algorithm is liable to accumulate errors from the control reference, and finally, the situation is out of control. The second limitation is that, for the model-based traditional control algorithm, part of the dynamic parameters of the FWMAVs are generally obtained by fitting the data under specific conditions, such as the quasi-steady-state model. When the state of FWMAVs deviates from this specific situation in the complex flight process, the dynamic parameters will change. As a result, the obtained simplified model is difficult to predict and control the state of the FWMAVs.

To address the aforementioned limitations, the model-free reinforcement learning (RL) method can be introduced,²¹ which represents an end-to-end strategy optimization and does not rely on the model of FWMAVs. RL uses trial and error methods to obtain the action that can improve the future long-term return at a certain state, instead of selecting actions following to the control reference. In addition, the RL method with data-driven allows the policy to ignore the influence of the dynamic parameter changes. Hence, the RL-based method can be applied in the FWMAVs to deal with the two limitations.

In fact, there are some research efforts on the RL-based policy for the agile movement of unmanned aerial vehicles (UAVs).^22–26 For example, a control strategy based on proximal policy optimization (PPO) was utilized to map Kalman filter data to output control commandin, where the data from real world is used to train the PPO algorithm in order to mitigate discrepancies.²⁴ By combining a DDPG algorithm with a no-dominated sorting approach, a new approach was proposed to achieve planar quadcopters flight vertically for the first time in a simulation environment.²⁵ The PPO approach was extended to enable rapid escape maneuvers of FWMVAs within short time, which effectively avoid the challenges posed by acrobatic flight under the constraints of model errors.²⁶ Thus, the RL-based method for the UAVs can be referred for controller design of the FWMAVs.

Notice that, extreme attitude is a special case where the FMWAV has a large pitch angle and random velocities. This case often occurs in hand-held throwing, strong winds, or collision conditions. For the FWMAVs, the large attitude angles and angular velocities in extreme attitudes make it difficult to be effectively controlled by traditional controllers. Moreover, the aerodynamic characteristics under extreme attitudes change significantly, which increases the model uncertainty and the difficulty of control. In addition, there is a high demand for fast control, and the FWMAV needs to make attitude corrections in a very short time, otherwise it may lead to loss of control. Therefore, how to design the RL-based policy for recovery flight of the FWMAVs under extreme attitudes to deal with the issues mentioned above motivates the current study.

In this article, we design the RL-based control approach for the recovery flight of the FWMAVs under extreme attitudes, and develop the hybrid control strategy to improve the control stability after attitude recovery. The contributions are summarized as follows.

An RL-based control approach is designed to solve the recovery flight issue of the FWMAVs under extreme attitudes. The proposed control approach has the ability to adjust attitudes under 1.5 s.

A hybrid control approach combining reinforcement learning and PD control approach is proposed, which can keep the stable flight after attitude recovery.

The two approaches are validated by simulation results, where one can see that the proposed control approaches can control the FWMAV to finish the recovery flight under extreme attitudes and keep the sustained flight.

This article is organized in the following. The “Preliminaries” section describes the dynamics model of FWMAVs. The “RL based controller” section presents the RL-based control policy and proposes a hybrid control approach. In the “Simulation results and discussion” section, the control effectiveness of the RL-based control policy for recovery flight under the extreme attitudes is discussed. And the effectiveness of the hybrid control approach for the sustained flight is also validated. Finally, the last “Conclusion” section concludes this work.

Preliminaries

In this section, we briefly describe the dynamics of FWMAVs, which is used in the following simulations. The configuration scheme is similar to the “Nimble,”²⁷ shown in Figure 1, which can flap the left and right wings, separately. The “Nimble” shows the strong maneuverability due to the capability of achieving stability without tail-wing.

Figure 1.

Flapping-wing micro-aerial vehicle (FWMAV) platform “Nimble.”

Dynamics of wing-actuator

In the longitudinal dynamics plane, only thrust and pitching torque can produce control effects. The corresponding actuators are the wing flutter and the dihedral angle actuator. By considering the actual performance limitations of the actuators, the wing flutter is modeled as the first-order system as follows:

τ \dot{f} (t) = f_{c} (t) - f (t)

(1)

T = 2 (0.0114 f (t) - 0.0449)

(2)

where τ is the time constant;

f (t)

represents the real flapping frequency at time t;

f_{c} (t)

is the reference of flapping frequency; and T is the thrust force generated by flapping motion. Moreover, the constants in (2) are obtained by fitting a biological mesoscale law and the experimental data. The constant 0.0114 is a scaling factor related to the flapping frequency. The constant 0.0449 is a correction factor to adjust the baseline value. The constant 2 is identified and validated by experimental data.

The dynamics of the dihedral angle actuator is given by the following equation:

\ddot{γ} (t) = - 2 ζ ω_{n} \dot{γ} (t) - ω_{n}^{2} γ (t) + ω_{n}^{2} γ_{c} (t)

(3)

l_{d} (t) = - l_{y} \sin (γ + c_{u} \frac{π}{180} u)

(4)

where

γ

is the real dihedral offset angle and

γ_{c}

is the reference of dihedral offset angle;

ω_{n}

is the natural frequency;

ζ

represets the damping ratio; u is the lateral velocity of the body;

l_{y}

denotes the distant between thrust position of wings and the mass center of the FWMAV;

l_{d}

is the offset of the thrust vector from its natural position; and

c_{u}

is the dihedral angle velocity correction coefficient.

Force and torque modeling

The “Nimble” generates pitch torque by varying the position offset of the force produced by its two-sided wings. Moreover, the magnitude of the produced thrust can be altered by adjusting the flapping amplitude. The $f_{B} = [X, Z]^{T}$ can be formulated as follows:

X = - 2 b_{x} (u - l_{z} \dot{θ} + {\dot{l}}_{d})

(5)

Z = - 2 T - 2 b_{z} (w - l_{d} \dot{θ})

(6)

τ_{B} = - X l_{z} + Z l_{d}

(7)

where

b_{x}

and

b_{z}

are the force coefficients in the respective axes; X and Z denote the body damping forces, which are opposing to the velocity orientation of wings; w is the longitudinal velocity of the body;

θ

is the pitch angle about y axis;

l_{z}

is the fixed offset of the thrust vector from its natural position; and

τ_{B}

is a pitch control torque.

Longitudinal dynamics of the FWMAV

In this article, according to the control objective of recovering flight under extreme initial pitch attitude, a simplified longitudinal dynamics model is formulated by the following equation:

{\dot{P}}_{W} = V_{W} \dot{R} = S (R) ω_{B}

(8)

m {\ddot{P}}_{W} = R f_{B} + m g I {\dot{ω}}_{B} = τ_{B} - ω_{B} \times I ω_{B}

(9)

where

P_{W} = [x^{w}, z^{w}]^{T}

and

V_{W} = [{\dot{x}}^{w}, {\dot{z}}^{w}]^{T}

represent the position and velocity of the FWMAV in the world frame, respectively; R is a rotation matrix that transfers vectors from body frame to the world frame;

S (\cdot)

is a skew-symmetric matrix mapp;

f_{B}

can be obtained in terms of (5) and (5); I is an inertial matrix;

τ_{B}

can be calculated in (7), and

ω_{B}

represents the pitch angular velocity of the FWMAV.

RL-based controller

This section introduces the RL-based controller for arbitrarily initial attitudes based on the model-free PPO algorithm. Then, the hybrid control approach is proposed by combining the RL-based controller and the PD controller for subsequent stable flight, where both persistence and stability have been significantly enhanced.

Task statement

In this work, our aim is to keep recovery flight with extreme conditions within a restricted area for the FWMAVs. The FWMAV initially flies with a pitch angle in $[90^{\circ}, 270^{\circ}]$ , which is similar to the scenario when releasing the FWMAV by hands. Moreover, the initial linear velocity and angular velocity of the FWMAV are also set within a specific range $V_{W} \in [- 1, 1]$ m/s and $\dot{θ} \in [- 1, 1]$ rad/s, respectively. In addition, throughout the recovery flight process, the FWMAV is not allowed to enter the prohibited area. Notice that the quick recovery process can lead to crashing or exceeding the restricted area, while the slow recovery process can result in loss control due to the influence of the gravity in the z-axis direction. Therefore, the task is how to design the control approach such that the proposed control approach should not only adjust the pitch angle under disturbance, but also stabilize the FWMAV within a restricted area.

RL-based controller

(1)

Proximal policy optimization (PPO): In RL, the agent chooses the next action relying on the current state and policy. Then, the environment rewards the agent and updates the state with a transition model. The continuous interactions between the agent and environment can be formulated as a Markov decision process (MDP) $M = (S, A, T, r, s_{0}, γ)$ , where $S$ and $A$ represent the state space and action space, respectively; $T (s_{t + 1} ∣ s_{t}, a_{t})$ is the transition model; $s_{t} \in S$ is the current state; $a_{t} \in A$ is the action; $s_{t + 1}$ is the next state under the action $a_{t}$ ; $r (s_{t}, a_{t})$ denotes the reward function; $s_{0}$ is the initial state, and $γ \in (0, 1)$ is the discount factor. The goal of RL is to learn a policy $π$ that maximizes the long-term cumulative rewards.

max_{π} E_{π} [\sum_{k = 0}^{\infty} γ^{k} r (s_{k}, a_{k}) | s_{k} = s]

(10)

where

s \in S

denotes the all possible states; and

E_{π}

is the mathematical expectations.

PPO is a policy iteration algorithm (see Hoang et al.²³ and the reference therein). Traditional policy gradient optimization algorithms may suffer from aggressive policy updates, which leads to dramatic performance fluctuations. To mitigate this issue of policy update oscillations, PPO employs two separate policy networks to represent the current and previous policies. Note that importance sampling is introduced to calculate the ratio ${\hat{r}}_{t} (σ)$ between current policy $π_{σ}$ and the previous policy $π_{σ_{o l d}}$ .

{\hat{r}}_{t} (σ) = \frac{π_{σ} (a_{t} ∣ s_{t})}{π_{σ_{o l d}} (a_{t} ∣ s_{t})}

(11)

Based on the ratio

{\hat{r}}_{t} (σ)

and the advantage function

{\hat{A}}_{t}

under the current policy, an objective function is given by the following equation:

L^{C P I} (σ) = E [\frac{π_{σ} (a_{t} ∣ s_{t})}{π_{σ_{old}} (a_{t} ∣ s_{t})} {\hat{A}}_{t}] = E [{\hat{r}}_{t} (σ) {\hat{A}}_{t}]

(12)

where

L^{C P I} (σ)

is the objective function that is used to optimize the network weight

σ

in actor neural network (Actor NN).

To further limit the magnitude of policy updates, the clipped objective function is obtained by the following equation:

L^{C L I P} (σ) = E [min ({\hat{r}}_{t} (σ) {\hat{A}}_{t}, clip ({\hat{r}}_{t} (σ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(13)

where

ϵ

is a hyper-parameter and is used to limit the maximum range between policies;

L^{C L I P} (σ)

is the clipped objective function that is used to optimize the network weight

σ

in Actor NN.

PPO updates policy by maximizing the objective function (13) with multiple steps stochastic gradient descent (SGD). In the practical training, the PPO algorithm uses the actor-critic (AC) theory for iterative updates. The AC neural network is illustrated in Figure 2. The Actor NN selects and carries out actions $a_{t}$ based on the received state $s_{t}$ at the current moment from policy $π$ . The state $s_{t}$ and action $a_{t}$ are employed into the critic neural network (Critic NN) to derive the Q value, which is utilized to update the policy $π$ . The Critic NN accomplishes its self-updating through the difference between real reward and the estimated values.

(2)

State and action: Here, we assume that the state of the FWMAV $X = {P_{W}, V_{W}, P_{B}, V_{B}, θ, \dot{θ}, {\dot{v}}_{x}} \in R^{11}$ is fully observable, which consists of the position $P_{W} \in R^{2}$ and the velocity $V_{W} \in R^{2}$ in the world coordinate frame, the position $P_{B} \in R^{2}$ and the velocity $V_{B} \in R^{2}$ of the center in the body coordinate frame, the pitch angle $θ \in R$ , the pitch angular velocity $\dot{θ} \in R$ , the x-axis acceleration ${\dot{v}}_{x} \in R$ in the world coordinate frame.

It should be noted that the state space does not exactly correspond to the observable state of the FWMAV mentioned above. As a part of the inputs vector to the actor and Critic NN in the PPO algorithm, the state space has a significant influence on the understanding of complex rules for policy. The article focuses on stabilizing flight of the FWMAV within a confined area. Therefore, the state $s_{t} \in S^{13} = {S_{r e a l}, S_{t a r}}$ is divided into two parts: the real state $S_{r e a l} = {x_{t}^{w}, z_{t}^{w}, {\dot{x}}_{t}^{w}, {\dot{z}}_{t}^{w}, θ_{t}^{w}, {\dot{θ}}_{t}^{w}, μ} \in S^{7}$ and the reference state $S_{t a r} = {{\hat{x}}^{w}, {\hat{z}}^{w}, {\dot{\hat{x}}}^{w}, {\dot{\hat{z}}}^{w}, {\hat{θ}}^{w}, {\dot{\hat{θ}}}^{w}} \in S^{6}$ . $μ$ is used to afford the additional information to the policy.

The action is $a_{t} = {f_{c}, γ_{c}} \in A^{2}$ , where $f_{c}$ denotes the flutter frequency and $γ_{c}$ is the dihedral angle. These control signals directly affect the flight attitude and trajectory of the flapping wing vehicle according to the dynamics models (1) to (9). Through the closed-loop control, the stable flight can be realized in accordance with the flutter frequency $f_{c}$ and angle $γ_{c}$ .

(3)

Reward Function: In this article, the reward function uses a straightforward design principle that is to minimize ambiguity as much as possible. The reward function is given by the following equation:

r (s_{t}, a_{t}) = λ_{1} r_{θ_{t}} + r_{l i v e} + r_{s i g n a l}

(14)

r_{θ_{t}} = ‖ θ_{t} - θ_{r} ‖

(15)

where

r (s_{t}, a_{t})

is the reward that the FWMAV receives after performing an action

a_{t}

at the state

s_{t}

;

θ_{t}

is the real pitch angle at time t;

r_{θ_{t}}

penalizes the FWMAV for deviating from the reference pitch angle

θ_{r} = 0

rad with the negative coefficient

λ_{1} = - 0.2

, where the penalty item can be minimized by adjusting the z-axis direction of the FWMAV;

r_{l i v e} = 1

denotes the survival reward, which enables the FWMAV to maintain its state within a controllable range as long as possible;

r_{s i g n a l} = 0.5

is activated when the state of the FWMAV reaches the safe range. Notice that we stop the training episode per step if the FWMAV enters into the prohibited area. Meanwhile, the additional

- 0.5

penalty is imposed to the total reward

r (s_{t}, a_{t})

for warning the action

a_{t}

at the state

s_{t}

It is worth mentioning that the reward function needs to be able to accurately reflect the attitude deviation, energy consumption, and flight stability of the vehicle such that the FWMAV can quickly keep a stable flight state under the extreme attitude. For this reason, the reward function contains a penalty term

r_{θ_{t}}

for pitch angle deviation. By penalizing the difference between the pitch angle and the reference attitude angle, the algorithm is able to learn how to most effectively adjust the FWMAV attitude to reduce this difference and achieve attitude recovery. Moreover, in order to ensure that the vehicle does not deviate from the predetermined position during the recovery process, we add a penalty term

r_{l i v e}

for positional deviation in the reward function.

Figure 2.

The neural networks of the proximal policy optimization (PPO) algorithm.

Hybrid control approach

The FWMAV completes the attitude recovery flight within a restricted area based on the RL-based controller. However, the FWMAV is prone to states that exceed restricted areas during flight such that the RL-based controller may choose the bad actions at the state that outside of the training state distribution. Thus, undesirable control can be generated due to lacking the enough information in the training process. To enhance the performance of the FWMAV in some particular scenarios with limited training data, the hybrid control approach is designed by combining the RL-based controller with the PD controller, as shown in Figure 3.

Figure 3.

The hybrid control framework, where the PD controller is employed when the FWMAV finishes the recovery task. PD: proportional-derivative; FWMAV: flapping-wing micro-aerial vehicle.

In this figure, dynamics module estimates the state after RL strategy choosing the action. Then PD controller gets the error between the estimated state and the target state to calculate the control output $δ a c t i o n$ . It should be noted that the control switching is determined according to the condition, that is whether the flight of the FWMAV has recovered and entered a normal range. If the FWMAV is the recovery flight, then only RL-based controller works and PD controller does not work. Otherwise, both RL-based controller and PD-based controller work. This hybrid control approach aims to enhance the performance capabilities in different application scenarios.

Simulation results and discussion

In this section, we illustrate the test results of the RL-based control strategy for the recovery flight problem of the FWMAV under extreme attitudes. To highlight the performance, we compare the performance capabilities of the proposed control strategy with a traditional PID controller in two different scenarios. Finally, we tests the proposed hybrid control approach to show its effectiveness for subsequent flight tasks.

Simulation environment

In the following simulations, we employed the interface provided by OpenAI Gym to build the RL training environment, where the bullet is used to render and evaluate the policy.

In the training process, the minimal longitudinal dynamics model is used to update the state information of the FWMAV. The control policy updating frequency is set as 200 Hz, which ensures the consensus with real physical platform. The model parameters of the FWMAV are presented in Table 1. From this table, one can see that the mass of the FWMAV is set as $m = 0.021$ . The gravitational acceleration is given as $g = 9.8$ . The time constant is $τ = 0.0796$ . The damping ratio is $ζ = 0.634$ . The natural frequency is $ω_{n} = 23.54$ . The dihedral angle velocity correction coefficient is $c_{u} = 10$ . The distance between the thrust position of wings and the mass center of the FWMAV is $l_{y} = 0.0081$ . The fixed offset of the thrust vector from its natural position is $l_{z} = 0.00271$ . The corresponding unit can be found in Table 1.

Table 1.

Model parameters of the flapping-wing micro-aerial vehicle (FWMAV).

Symbol	Value	Unit
m	0.021	kg
g	9.8	m/s $^{2}$
τ	0.0796	seconds
$ζ$	0.634	–
$ω_{n}$	23.54	rad/s
$c_{u}$	10	–
$l_{y}$	0.0081	m
$l_{z}$	0.00271	m

In order to enhance the diversity of samples and prevent the policy from getting stuck in local optima, the initial state of the FWMAV is randomized $x_{0}^{w} \in [- 1, 1]$ m, $z_{0}^{w} \in [- 3, - 2]$ m, ${\dot{x}}_{0}^{w}, {\dot{z}}_{0}^{w} \in [- 1, 1]$ m/s, $θ_{0}^{w} \in [90 \circ, 270 \circ]$ , and ${\dot{θ}}_{0}^{w} \in [- 1, 1]$ rad/s. The maximum step length for each episode is set to 800, that is, the longest live time is 4 seconds. If the position of the FWMAV beyond the restricted area, current episode will terminated. The entire training involves a total of 3 million steps.

The parameters setting of the PPO algorithm are listed in Table 2. From this table, one can see that the learning rate is 0.0003, the batch size is 64, the total timestep is $3 \times 10^{6}$ , the rollouts length is 2048, the entropy coefficient is 0, and the discount factor is 0.99. The training lasts about 4–5 hours.

Table 2.

Hyper-parameters for reinforcement learning.

No.	Parameters	Value
1	Learning rate	0.0003
2	Batch size	64
3	Total timesteps	3 $\times$ $10^{6}$
4	Rollout length	2048
5	Entropy coefficient	0.0
6	Discount factor	0.99

It is pointed out that model nonlinearities, model uncertainties, external disturbances, and input constraints are simulated by adding random values to actions and states in the simulation environment. This allows us to simulate a range of possible real-world scenarios and test the robustness of our proposed method under varying conditions. PID control approach is used as comparison algorithm.

Recovery flight

In Figure 4, the comparison results of the RL-based controller and the PID controller are shown for the recovery flight of the FWMAV with under the same initial pitch angle of $180 \circ$ . From this figure, one can see that the FWMAV controlled by the RL-based controller (blue line) can achieve a posture recovery within 1 seconds by fine-tuning the frequency $f_{c}$ and dihedral angle $γ_{c}$ . And the FWMAV sustains stable flight for at least 5 seconds. However, the FWMAV controlled by the PID controller (red line) reaches the prohibited flight area marked by yellow line during the 0.5 second timestamp. Hence, using traditional PID control cannot well deal with this kind of extreme attitudes of the FWMAV. The RL-based controller provides a possible solution for this case.

Figure 4.

The results of the FWMAV by comparing the RL-based controller (blue line) with the PID controller (red line) under the same initial condition. (a) The movement trajectories of the FWMAV at the $X - Z$ plane. (b) The changes of the actions $f_{c}$ and $γ_{c}$ of the RL-based controller. FWMAV: flapping-wing micro-aerial vehicle; RL: reinforcement learning; PID: proportional-integral-derivative.

In order to understand the impact of different situations on extreme attitude recovery, Figures 5 to 7 show the success rates for FWMAV controlled by the RL-based controller for different the velocities at the x-axis, the velocities at the z-axis, and the angular velocities, respectively. It is worth mentioning that, if the FWMAV can still remain in the restricted flying area within 2 seconds, the task of recovery flight is considered to be successful. Note that, the range of initial velocity values is set as [ $-$ 1, 1] in the training process. Hence, in the real simulation, when the initial velocity values exceed this range, success rates decrease due to lacking the corresponding samples. On the other hand, from the dynamics, the FWMAV easily enters the restricted flight zones. Similarly, the initial angular speed of the FWMAV can indirectly affect the linear speed.

Figure 5.

Success rates for the different pitch angles and velocities at the x-axis.

Figure 6.

Success rates for the different pitch angles and velocities at the z-axis.

Figure 7.

Success rates for the different pitch angles and angular velocities.

Figure 8 shows the FWMAV’s attitude recovery under pitch angles of $180 \circ$ , $135 \circ$ , and $90 \circ$ . Although the FWMAV is able to recover its attitude and maintain stable flight within the restricted area, the chosen actions allow the FWMAV to fly as long as possible because of not setting the desired location in the training process. Therefore, there is still substantial deviation, which in turn impacts the FWMAV’s position.

Figure 8.

Recovery flight trajectories under different initial pitch angles $θ_{0} = 180^{\circ}, 135^{\circ}$ , and $90^{\circ}$ .

Sustained flight

As shown in Figure 9, the RL-based controller does not perform excellently in the sustained flight tasks because the training objective is only to stabilize the attitude to $θ_{r} = 0$ rad. However, in the hybrid control approach, the PD controller is introduced to compensate the RL-based control strategy. The whole sustained flight test lasts for 30 seconds. In order to evaluate the performance differences between the two strategies, we introduce three metrics depicted by

d_{p o s} = \frac{1}{L + 1} \sum_{t = 0}^{L} ‖ P_{W} (t) - {\hat{P}}_{W} ‖

(16)

d_{f_{c}} = \frac{1}{L - 1} \sum_{t = 1}^{L - 1} \frac{f_{c} (t + 1) - f_{c} (t)}{f_{c} (i)}

(17)

d_{γ_{c}} = \frac{1}{L - 1} \sum_{t = 1}^{L - 1} \frac{γ_{c} (t + 1) - γ_{c} (t)}{γ_{c} (t)}

(18)

where

d_{p o s}

is the average distance between the actual position

P_{W}

and the desired position

{\hat{P}}_{W} = {0, 2.5}^{T}

over a complete time series; L represents the length of the time series;

d_{f_{c}}

and

d_{γ_{c}}

denotes the average change rate of the flapping frequency and dihedral angle, respectively. Table 3 shows the results of the RL-based controller and hybrid control approach for three metrics. It is evident that both the position errors and the control cost of the dihedral angle are lesser for hybrid control approach than ones for the RL-based control approach.

Figure 9.

After the recovery flight, the sustained flight trajectories for the RL-based control approach and hybrid control approach.

Table 3.

Evaluation results for sustained flight.

Metrics	RL method	Hybrid method
$d_{p o s}$	4.726	3.6979
$d_{γ_{c}}$	0.2696	0.20002
$d_{f_{c}}$	$-$ 0.4538	$-$ 0.4573

RL: reinforcement learning.

Bold values indices that the corresponding method is the best for the current performance metric.

Discussion

It is worth mentioning that, if the proposed approach is applied to the real FWMAV, the flight effect in a short time can be consistent with the simulation results. However, due to the limitation of the hardware manufacturing process of the experimental prototype, when the motor of the FWMAV maintains a higher speed, it will bring about a violent vibration of the flap structure, which may cause the gear box to be offset. And the FWMAV rotates about the fuselage axis as time goes by, which can lead to the inconsistency between the actual flight results and the simulation results. In addition, when collecting the dataset, due to the limitation of the manufacturing process, it is impossible to ensure that the dimensional parameters of the experimental prototype are the same each time, which makes it difficult to obtain the dynamics model with high prediction accuracy. However, the simulation results can illustrate that the proposed RL-based method still has certain potential and advantages.

From the aforementioned results, the proposed RL-based approach can quickly finish recovery flight under extreme attitudes compared with traditional control methods such as PID control. This method does not rely on environmental models and can ignore the influence of dynamic parameter changes such that the proposed RL-based approach is effective in complex environments. Moreover, this method has strong robustness and is suitable for different flight mission requirements, with high flexibility and adaptability. However, the proposed RL-based approach requires a large amount of environmental interaction data. And long-term flight training consumes a significant amount of computing resources. Due to the limited payload capacity, the FWMAV relies on wireless transmission of control commands, which introduces the potential impact of signal interference. At the same time, a large amount of training data is required to ensure the effectiveness of the proposed RL-based approach. For specific scenarios of recovery flight under extreme attitudes, specialized data collection and processing are needed, which increases the complexity and cost of the experiments.

Conclusion

The problem of recovery flight under the extreme attitudes has been addressed. First, we have designed the RL-based controller to guide the FWMAV to implement the recovery flight. Second, in order to keep sustain flight after the recovery flight, we have developed the hybrid control approach by combining the RL-based control approach with the PD control approach. Finally, we have illustrated the effectiveness of the proposed RL-based control approach and the hybrid control approach. In future, we will use the real FWMAV to test the effectiveness of the proposed RL-based control approach and the hybrid control approach.

Footnotes

Data availability

The simulation data can be obtained by contacting the corresponding author.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the Zhejiang Provincial Natural Science Foundation under Grant LZ23F030004, the National Natural Science Foundation of China under Grant 62073108, and the Fundamental Research Funds for the Provincial Universities of Zhejiang under Grant GK229909299001-004.

ORCID iDs

Yang Yu

Botao Zhang

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Guo

Liu

Xiong

, et al. Dynamic deformation measurement of bionic flapping-wing robot based on fiber Bragg grating array. IEEE Trans Instrum Meas 2024; 73: 1–9.

Han

Dong

Zhang

. Bistable click mechanism for dipteran flight robot. Int J Mech Sci 2024; 281: 109555.

Wang

Song

Sun

, et al. Review on ultra-lightweight flapping-wing nano air vehicles: artificial muscles, flight control mechanism, and biomimetic wings. Chin J Aeronaut 2023; 36: 63–91.

Shen

, et al. An aerial–wall robotic insect that can land, climb, and take off from vertical surfaces. Research 2023; 6: 0144.

Qian

Chen

Shen

, et al. Trajectory generation and tracking control for flapping wing robot three-dimensional flight. IEEE ASME Trans Mechatron 2024; 1–13.

Huang

, et al. A bio-inspired flapping-wing robot with cambered wings and its application in autonomous airdrop. IEEE/CAA J Autom Sin 2022; 9: 2138–2150.

Wang

, et al. A long-endurance flapping-wing robot based on mass distribution and energy consumption method. IEEE Trans Ind Electron 2023; 70: 8215–8224.

Wang

Pei

, et al. Analysis and design of bat-like flapping-wing aircraft. Aerospace 2024; 11: 325/1–325/26.

Hou

Tan

Wang

, et al. Scarab beetle-inspired embodied-energy membranous-wing robot with flapping-collision piezo-mechanoreception and mobile environmental monitoring. Adv Funct Mater 2024; 34: 2303745.

10.

Xiao

Zhao

, et al. A multi-modal tailless flapping-wing robot capable of flying, crawling, self-righting and horizontal take-off. IEEE Robot Autom Lett 2024; 9: 4734–4741.

11.

Bena

Yang

Calderón

, et al. High-performance Six-DOF flight control of the Bee

^{+ +}

: an inclined-stroke-plane approach. IEEE Trans Robot 2023; 39: 1668–1684.

12.

Ishiguro

Kawasetsu

Hosoda

. Effect of incorporating wing veins on soft wings for flapping micro air vehicles. Front Robot AI 2023; 10: 1–10.

13.

Deng

. Influence of area distribution along the span direction on flapping wing aerodynamics in hover based on numerical modeling analysis. IEEE J Sel Top Appl Earth Obs Remote Sens 2024; 17: 6683–6692.

14.

Gayango

Salmoral

Romero

, et al. Benchmark evaluation of hybrid fixed-flapping wing aerial robot with autopilot architecture for autonomous outdoor flight operations. IEEE Robot Autom Lett 2023; 8: 4243–4250.

15.

Qian

Fang

. Neural network-based hybrid three-dimensional position control for a flapping wing aerial vehicle. IEEE Trans Cybern 2023; 53: 6095–6108.

16.

Guo

Deng

, et al. LPV modeling and robust sampled-data

H_{\infty}

control of a tailless flapping wing microaerial vehicle with parameter uncertainties. IEEE ASME Trans Mechatron 2024; 1–12.

17.

Addo-Akoto

Yang

H-H

Han

J-S

, et al. Wing flexibility effect on aerodynamic performance of different flapping wing planforms. J Fluids Struct 2023; 123: 104006.

18.

Nian

X-H

Zhou

W-X

S-L

, et al. 2-D path following for fixed wing UAV using global fast terminal sliding mode control. ISA Trans 2023; 136: 162–172.

19.

Wang

Jiang

Zhao

, et al. Modeling and hover flight control of a micromechanical flapping-wing aircraft inspired by wing–tail interaction. IEEE ASME Trans Mechatron 2023; 28: 3132–3142.

20.

Ferdaus

Pratama

Anavatti

, et al. PAC: a novel self-adaptive neuro-fuzzy controller for micro aerial vehicles. Inf Sci (Ny) 2020; 512: 481–505.

21.

Yang

Lin

Church

, et al. Sim-to-real model-based and model-free deep reinforcement learning for tactile pushing. IEEE Robot Autom Lett 2023; 8: 5480–5487.

22.

Nelson

Yeduri

Jha

, et al. RL-based energy-efficient data transmission over hybrid BLE/LTE/Wi-Fi/LoRa UAV-assisted wireless network. IEEE ACM Trans Netw 2024; 32: 1951–1966.

23.

Hoang

Nguyen

, et al. Multi-agent reinforcement learning for cooperative trajectory design of UAV-BS fleets in terrestrial/non-terrestrial integrated networks. IEICE Commun Express 2024; 13: 327–330.

24.

Alhadhrami

Seghrouchni

AEF

Barbaresco

, et al. Drones tracking adaptation using reinforcement learning: proximal policy optimization. In: 2023 24th international radar symposium (IRS), 2023, pp.1–10.

25.

Wang

Groß

Zhao

. Aerobatic Tic-Toc control of planar quadcopters via reinforcement learning. IEEE Robot Autom Lett 2022; 7: 2140–2147.

26.

Fei

Deng

. Bio-inspired rapid escape and tight body flip on an at-scale flapping wing hummingbird robot via reinforcement learning. IEEE Trans Robot 2021; 37: 1742–1751.

27.

Karásek

Muijres

De Wagter

, et al. A tailless aerial robotic flapper reveals that flies use torque coupling in rapid banked turns. Science 2018; 361: 1089–1094.

Reinforcement learning based recovery flight control for flapping-wing micro-aerial vehicles under extreme attitudes

Abstract

Keywords

Introduction

Preliminaries

Dynamics of wing-actuator

Force and torque modeling

Longitudinal dynamics of the FWMAV

RL-based controller

Task statement

RL-based controller

Hybrid control approach

Simulation results and discussion

Simulation environment

Recovery flight

Sustained flight

Discussion

Conclusion

Footnotes

Data availability

Declaration of conflicting interests

Funding

ORCID iDs

Data availability statement

References