Sage Journals: Discover world-class research

Abstract

The purpose of the article is to design data-driven attitude controllers for a 3-degree-of-freedom experimental helicopter under multiple constraints. Controllers were updated by utilizing the reinforcement learning technique. The 3-degree-of-freedom helicopter platform is an approximation to a practical helicopter attitude control system, which includes realistic features such as complicated dynamics, coupling and uncertainties. The method in this paper first describes the training environment, which consists of user-defined constraints and performance expectations by using a reward function module. Then, actor–critic-based controllers were designed for helicopter elevation and pitch axis. Next, the policy gradient method, which is an important branch of the reinforcement learning algorithms, is utilized to train the networks and optimize controllers. Finally, from experimental results acquired by the 3-degree-of-freedom helicopter platform, the advantages of the proposed method are illustrated by satisfying multiple control constraints.

Keywords

Helicopter reinforcement learning reward function module policy gradient actor–critic-based controllers

Introduction

In the unmanned aerial system field, the helicopter control problems have attracted much attention because of their wide applications and scientific significance. Difficulties in designing controllers for helicopters are due to their particular features, such as external disturbance forces, model uncertainties, nonlinear dynamic conditions and coupling axis problems. In the past few decades, abundant papers have been reported about the controller design for the helicopter, see for example, sliding mode controllers,^1,2 robust controller, $H_{\infty}$ controller^3,4 and $L_{2}$ controller.⁵ However, the experiments for the practical helicopter tests are time-consuming and high cost, and then few experiment results have been reported.

In order to avoid the difficulties of the practical helicopter experiments, we aim to find a simplified model to corroborate the theoretical application. The helicopter platform in the study by Zheng and Zhong⁶ provided a simplified model, which consisted of strong coupling characteristic and complicated, nonlinear dynamics. Also, this model utilized the Simulink for controller design, and then it compiled the codes for the practical experiments. Based on this model, a large amount of research papers have been published so far. Rios et al.⁷ utilized the sliding-mode observation method. The robust controller design methods were brought out by Liu et al.⁸ Also, Kutay et al.⁹ used the adaptive method to design the pitch axis output feedback controller within this platform. Rosales dealt in three axes the set-point regulation problem. Li et al.¹⁰ brought out robust nonlinear controller, which is based on the robust integral of the error, to solve the attitude track problems, both set-point tracking and sine-signal tracking. But, all these controllers are designed by manually adjusting parameters. Also, seldom controllers can deal with control problems under multiple constraints. In this paper, authors want to design controllers for the helicopter platform under multiple constraints only with its input data and output data, which also means model-free controllers.

Among the machine learning method, reinforcement learning (RL) performs well when dealing with serial decision-making problems and control strategies. Sutton and Barto¹¹ based on the past decade research introduced the RL in their study. One of the fundamental theories is the Bellman equation, which is utilized to optimize value function iterations or policy iterations. Inspired by this, the researchers in control theory field improved the adaptive dynamic programming (ADP) by RL. By using the reward in “Linear Quadratic Regulator (LQR)” form, they used the Hamilton–Jacobi–Bellman (HJB) equation to deal with controller design problems for linear systems,^12–16 discrete nonlinear system,^17–20 complex-valued nonlinear systems,²¹ nonlinear switched systems,²² multiagent systems^23–25 and so on. Also, according to ADP, Kiumarsi et al.,²⁶ Luo et al.,²⁷ and Modares et al.²⁸ brought out $H_{\infty}$ controllers and robust controllers. However, few papers provided practical results for the simulations. Based on the learning theory in the study by Sutton and Barto,¹¹ researchers applied the deep RL to train agents play simple computer games.²⁹ Also, they used the Monte Carlo Tree search method to train the intelligence play Go against human, which is famous in the world.³⁰ However, these methods are only useful for low-dimension discrete systems. Continuous systems bring difficulties for agents to choose actions because the number of the actions can seem to be infinity. Then, they utilized the policy gradient (PG) method and deep network to solve these problems.³¹

Methods such as ADP and Q-learning are the value-based learning methods. These algorithms alternate actions by estimating the value function or Q-function and then improve the policy. Another type of RL algorithm is the PG method. It improves the policy by an estimator of the gradient of the cost function, which is calculated through the data. Abundant research studies about PG methods have been reported so far, see for example the studies by Peters and Schaal³² and Luo et al.³³ But, few of them behaved well in practical plants,³⁴ because it is difficult for us to choose an appropriate step size. Too short means slow learning rates and too long results in divergence. Schulman et al.³⁵ solved this problem by giving a limitation for the choice of the step size.

The aim of the article is to design RL-based controllers for the helicopter under multiple constraints. In the practical systems, there exist many constraints, such as the limiting control inputs. Also, researchers want to add some constraints for the performance such as no overshoot. In this article, we use the reward functions to describe these constraints and construct the learning environment. An actor–critic-based controller is brought out for the helicopter experiment plant, and we utilize one type of the PG methods to optimize the controllers. Finally, the table-mount helicopter platform is used to prove the efficiency of the proposed algorithm under different constraint conditions.

Problem formulation

The helicopter model used in this article is the table-mount helicopter, which is shown in Figure 1. The helicopter consists of two parts: a rectangular body frame and two propeller assemblies. The arm in this model allows the elevation and the travel motions of the helicopter with the utilization of a 2-degree-of-freedom (DOF) instrumented joint. The helicopter will elevate when motors are both driven with two positive voltages. If the voltage on the front motor is greater than that on the back motor, the positive pitch movement is generated. The thrust vectors of the body pitches can result in the travel motion. Angle position information can be directly measured by encoders mounted through the middle part.

Figure 1.

Quanser table-mount helicopter platform.

Based on the abovementioned simplified structure of the helicopter model, the control method designed in the article aims to track the reference signals of elevation and pitch axes. Then, we can formulate elevation and pitch motions of the helicopter as follows.

Elevation motion

\begin{array}{l} J_{e} \overset{\cdot\cdot}{α} = K_{f} l_{α} V_{s} \cos (β) - m g l_{α} \cos (α) \\ V_{s} = V_{f} + V_{b} \end{array}

(1)

Pitch motion

\begin{matrix} J_{p} \overset{\cdot\cdot}{β} = K_{f} l_{β} V_{d}, V_{d} = V_{f} - V_{b} \end{matrix}

(2)

where parameter symbols are listed in Table 1.

Table 1.

Symbols for equations (1) and (2).

Symbol	Instruction
$α$	Elevation angle
$β$	Pitch angle
$J_{e}$	Inertia moment of elevation axis
$J_{p}$	Inertia moment of pitch axis
$l_{α}$	Distance between elevation axis and helicopter
$l_{β}$	Distance between pitch axis and helicopter
$m$	Mass of the helicopter
$g$	The gravitational acceleration constant
$K_{f}$	The force constant of the motor
$V_{f}, V_{b}$	Voltages for front motor and back motor
$V_{s}, V_{d}$	Control outputs

Define

e_{e} (t) = r_{e} (t) - α (t), e_{p} (t) = r_{p} (t) - β (t)

where $r_{e} (t)$ and $r_{p} (t)$ are the target signals of elevation and pitch axes. $e_{e} (t)$ and $e_{p} (t)$ are tracking errors for elevation and pitch motions.

The aim of this paper is to design a controller that can track the signal and satisfy multiple performance constraints. Then, we define these special constraints as $c_{1}, c_{2}, \dots$ .

Remark 1

Here, we present some examples for the special constraints $c_{1}, c_{2}, \dots$ .

$c_{1}$ : the track signal we expect can have small overshoot, even no overshoot.

$c_{2}$ : the controller we design can have $H_{\infty}$ performance, which means $\int_{0}^{\infty} e (t) dt < ϵ \int_{0}^{\infty} ω (t) dt$ . Here, $ϵ$ is a small positive scalar.

$c_{3}$ : some variables of the signals have strict limitation, and it is not convenient to directly use a saturation module.

$c_{4}$ : the controller can obtain better $H_{\infty}$ performance in a finite frequency range, otherwise, the controller can satisfy finite frequency $H_{\infty}$ performance.

In conclusion, our control objective is to design two model-free controllers for elevation axis and pitch axis such that $e_{e} (t) \to 0$ , $e_{p} (t) \to 0$ as $t \to \infty$ . Simultaneously, the closed-loop system can satisfy the special constraints $c_{1}, c_{2}, \dots$ . The controllers are designed in an actor–critic structure. The control outputs are $V_{s}$ and $V_{d}$ , and then we convert them into $V_{f}$ and $V_{b}$ , which are applied on the helicopter experiment.

Preliminary on RL

Policy iteration is an important part of the RL. It is widely known that the Q-function in RL is described as follows

\begin{matrix} Q^{π_{new}} (s, a) & = (1 - ϱ) Q^{π} (s, a) \\ + τ (r + ε \max_{a} Q^{π} (s', a)) \end{matrix}

(3)

where $r$ is the reward, $ε$ is the discounted factor and $ρ$ is the learning rate. The policy $π$ updates according to

π^{*} = \underset{π}{\arg \max} Q^{π} (s, a)

(4)

This type of Q-function brings many shortcomings when it is applied to solve the continuous system problem. The action bound and the action amount will be noted as infinity. Then, it is difficult to make a choice, which can bring maximal reward. However, another branch, PG method, can solve such problems.

PG method

In the PG method,³² we directly parameterize policy $π$ with parameters $θ$ and minimize the cost function as

J (θ) = E_{ρ_{θ}} [\begin{matrix} \sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) \end{matrix}]

where $ρ_{θ} = p (s_{0}) π (a_{0} | s_{0}) p (s_{1} | s_{0}, a_{0}) \dots$ . From the PG theorem, we can obtain its gradient as follows

\nabla_{θ} J = E [\begin{matrix} \sum_{t} \nabla_{θ} \log π_{θ} (a_{t} | s_{t}) (R_{t} - b_{t}) \end{matrix}]

(5)

where $R_{t} = \sum_{t' = t} γ^{t' - t} r (s_{t'}, a_{t'})$ and $b_{t}$ is a baseline. We often choose $b_{t} = V_{θ} (s_{t}) = E_{θ} [R_{t} | s_{t}]$ . Define the advantage function as $A_{θ} (s_{t}, a_{t}) = E_{θ} [R_{t} | s_{t}, a_{t}] - V_{θ} (s_{t})$ . Then, we have the commonly used gradient estimator

g = E_{t} [\nabla_{θ} \log π_{θ} (a_{t} | s_{t}) (A_{θ} (t))]

(6)

Then, we can optimize the policy by a stochastic gradient descent optimizer to obtain a policy, which can maximize the sum of the reward.

Trust region policy optimization

When we apply the PG method into the practical utilization, it is hard to choose the step size for the optimization. The policy will not converge when our step size is quite large. Also, the optimization process is exceedingly time-consuming when it has small step size.

Schulman et al.³⁵ brought out the TRPO method to solve this problem by defining a trust region for the step size. In each iteration, TRPO transfers the current parameters $θ$ to $θ_{old}$ . Then, it optimizes the policy according to

J_{TRPO} (θ_{new}) = E_{t} [\frac{π_{θ_{new}} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})} A_{θ_{old}} (t)]

(7)

TRPO sets a constraint for the step size of the policy update. By using Kullback–Leibler (KL) divergence, the constraint can be expressed as $KL [π_{θ_{old}} | π_{θ_{new}}] < δ$ .

Main result

In this section, we will introduce the structure of the table-mount helicopter control system and the controller design method with the consideration under multiple constraints.

The structure of the controllers and the physical plant is shown in Figure 2. From the figure we have two main modules, reward function module and RL controller. Reward function module provides reward for the controller according to multiple constraints. The algorithm updates controllers with the utilization of reward and state signals based on RL. After optimization, the controllers can efficiently track the reference signal and satisfy multiple constraints.

Figure 2.

Helicopter control system.

Reward function module

In the history of the RL, reward plays an important role as one of four basic elements. We set reward rules for the problem and with the algorithm, the agent obtains more and more rewards after each episode. In other words, we utilize reward functions to construct the performance environment, which helps us train the policy. In this paper, the reward we use is continuous with states varying.

From Figure 2, one can see that the total reward $r_{t}$ consists of automated reward $r_{a}$ and special reward $r_{s}$ . Define

r_{t} = r_{a} + r_{s}, r_{s} = \sum_{i = 1}^{n} w_{i} r_{ci}

(8)

where $n$ is the amount of the constraints, $r_{ci}$ is the reward for the $i_{th}$ constraint and $w_{i}$ is the weight value for $r_{ci}$ . Automated reward $r_{a}$ describes the main task for the controller optimization. It may be different in different problems and in this track problem, we set the value of $r_{a}$ at time $t$ as

r_{a} (t) = - | Qe (t) |

(9)

where $e (t)$ is the reference signal and $Q$ is parameter matrix we give before training. Then, we give definitions for special reward $r_{s}$

w_{i} = {\begin{matrix} 1, & r_{ci} < \underline{r_{ci}} \\ σ_{i} . & r_{ci} \geq \underline{r_{ci}} \end{matrix}

(10)

where $\underline{r_{ci}}$ denotes the lower robust reward bound for the constraint $c_{i}$ , $σ$ is a sufficient small positive rational number, which is decided by users and in this paper $σ_{i} = 0.1, i = 1, 2, \dots, n$ .

Remark 2

Compared with main target, constraint $c_{i}$ are described by $r_{ci}$ , and we suppose each constraint has a tolerable range. We should strictly satisfy the constraint out of range and relatively pay more attention to the main target in the range. Then, the bound for the range is the robust reward bound, which is $[\underline{r_{ci}}, 0]$ . Correspondingly, parameter $σ_{i}$ should be smaller when the main task of the control problem is more important.

The $i_{th}$ constraint reward $r_{ci}$ is different for different types of constraints. Here, we provide several types of $r_{ci}$ for reference, which will be used in our works. As mentioned in Remark 1, we design the following rewards for no overshoot condition $c_{1}$ , $H_{\infty}$ performance condition $c_{2}$ and strict limitation condition $c_{3}$ .

$t_{\exp}$ is the response time that user reasonably expect. $e_{s} (t)$ is the error signal and $e_{m}$ is the maximum tolerable value of $e_{s} (t)$ during the training. Let $φ = 0.5 | e_{m} |$ .

$c_{1}$ : suppose that we want to track signal $s_{c} (t)$ and the track response has small overshoot or no overshoot. Then

\begin{matrix} r_{ci} (t) = {\begin{matrix} - h {\overset{\cdot}{e}}_{s} (t), & if {\overset{\cdot}{e}}_{s} (t) > {\bar{c}}_{1} \\ 0 . & otherwise \end{matrix}, \underline{r_{ci}} = 0 \\ h = ξ_{1} \frac{| e_{m} |}{{\bar{c}}_{1}}, {\bar{c}}_{1} = 2 \frac{φ}{t_{\exp}} \end{matrix}

(11)

where $ξ_{1}$ is hyperparameter, and in this paper we take $ξ_{1} = 20$ .

$c_{2}$ : suppose the disturbance input during training is $ω (t)$ , the acceptable maximum value for $ε$ is $\bar{ε}$ and the error signal is $e_{s} (t)$ for track problem, then

\begin{matrix} r_{ci} (t) = - | \frac{{‖ e_{s} (t) ‖}_{\infty}}{{‖ ω (t) ‖}_{\infty}} |, \underline{r_{ci}} = \bar{ϵ} \end{matrix}

(12)

$c_{3}$ : suppose there is strict limitation for signal $s_{u} (t)$ , the limitation is $[\underline{s_{u}}, \bar{s_{u}}]$ and the error signal is $e_{s} (t)$ for track problem, then $\underline{r_{ci}} = 0$ and

\begin{matrix} r_{ci} (t) = (\begin{matrix} - | e_{m} | - \frac{ξ_{2} | e_{m} |}{| \bar{s_{u}} | + | \underline{s_{u}} |} (s_{u} (t) - \bar{s_{u}}), & if s_{u} (t) > \bar{s_{u}} \\ - | e_{m} | - \frac{ξ_{2} | e_{m} |}{| \bar{s_{u}} | + | \underline{s_{u}} |} (| \underline{s_{u}} | - s_{u} (t)), & if s_{u} (t) < \bar{s_{u}} \\ 0 & otherwise \end{matrix} \end{matrix}

(13)

where $ξ_{2}$ is hyperparameter, and in this paper we take $ξ_{2} = 10$ .

Based on the above introduction, we can assume that we utilize rewards $r_{ci}, i = 1, 2, \dots, n$ to describe multiple constraints and weights $w_{i}$ to reduce the influence of the constraints to main task. Then, the total reward $r_{t}$ can describe our control target.

Remark 3

Here, the reward we give for such three types of constraints is efficient in our simulation. We suggest that in similar constraints, the reward can follow similar structures and hyperparameters $ξ_{1}$ and $ξ_{2}$ should be reconsidered. Another point should be noted that ${\bar{c}}_{1}$ may not be in the same form as ours and $h$ will not always be $ξ_{1} t_{\exp}$ . Obviously, the different values of the parameters can influence the speed or the possibility of the convergence. Therefore, the choice of the parameters should be reasonable and cautious.

Remark 4

In previous studies,^{12–14,17–19,21–23} the authors utilized the ADP method to design controllers and the reward they use can be described as $r (k) = x_{k}^{T} P x_{k} + u_{k}^{T} R u_{k}$ . This reward is used to construct value function and the authors obtain an optimal controller with the HJB equation. However, the HJB equation cannot easily solve optimize problems with reward shown in this paper as $r_{t}$ . Thus, we propose a PG method to solve this type of optimize problems with multiple constraints.

Optimization method

From section “Preliminary on RL,” we denote the expected discounted cost under a stochastic policy $π$ as

J_{π} = E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})]

(14)

where $a_{t} = π (a_{t} | s_{t})$ and $s_{t + 1} = P (s_{t + 1} | s_{t}, a_{t})$ . $γ$ is the discounted constant and $r (s_{t})$ is the reward of $s_{t}$ . According to equation (8), we have

\begin{matrix} E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r_{t} (s_{t})] = E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r_{a} (s_{t})] \\ + \sum_{t = 0}^{n} E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} w_{i} r_{ci} (s_{t})] \end{matrix}

(15)

According to equation (10), in the tolerable range, $σ_{i}$ is such small that we have

J_{π} = E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})] \approx E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} r_{a} (s_{t})]

(16)

In our following method, we set $r (s_{t}) = r_{t} (s_{t})$ . The definitions of the value function $V_{π}$ and state-action value function $Q_{π}$ are given as follows

\begin{matrix} V_{π} (s_{t}) = E_{a_{t}, s_{t + 1}, \dots} [\sum_{m = 0}^{\infty} γ^{m} r (s_{t + m})] \\ Q_{π} (s_{t}, a_{t}) = E_{s_{t + 1}, a_{t + 1}, \dots} [\sum_{m = 0}^{\infty} γ^{m} r (s_{t + m})] \end{matrix}

(17)

Then, we define the advantage function $A_{π}$ as $A_{π} (s_{t}, a_{t}) = Q_{π} (s_{t}, a_{t}) - V_{π} (s_{t})$ . When we update policy from $π_{old}$ to $π_{new}$ , there is a useful equation shown as follows, whose proof can be found in the study by Kakade and Langford³⁶

J_{π_{new}} = J_{π_{old}} + E_{s_{0}, a_{0}, \dots} [\sum_{t = 0}^{\infty} γ^{t} A_{π_{old}} (s_{t}, a_{t})]

(18)

where $a_{t} = π_{new} (a_{t} | s_{t})$ and $s_{t + 1} = P (s_{t + 1} | s_{t}, a_{t})$ . We denote $ρ_{π}$ , the discounted visitation frequencies as

ρ_{π} (s) = P (s_{0} = s) + γ P (s_{1} = s) + γ^{2} P (s_{2} = s) + \dots

Then, equation (11) can be reformulated as

J_{π_{new}} = J_{π_{old}} + \sum_{s} ρ_{π_{new}} (s) \sum_{a} π_{new} (a | s) A_{π_{old}} (s, a)

(19)

One can see that from equation (19), we can obtain that if $\sum_{a} π_{new} (a | s) A_{π_{old}} (s, a) < 0$ , the policy update process from $π_{old}$ to $π_{new}$ is monotonic. The expected cost will be less after each step update. When the expected advantage function equals to 0, the cost function will reach the minimum. However, in some conditions, due to the estimation error of $V_{π_{old}} (s)$ or too long step size, the expected advantage of some states $s$ is positive, which means $\sum_{a} π_{new} (a | s) A_{π} (s, a) > 0$ . To solve this problem, a method is utilized, which is the conservative policy iteration, brought out by Kakade and Langford.³⁶ In Theorem 1 of the paper,³⁵ we can convert equation (19) into the following equations

J_{π_{new}} \leq L_{π_{old}} (π_{new}) + ε D_{KL}^{\max} (π_{old}, π_{new})

(20)

where

L_{π_{old}} (π_{new}) = J_{π_{old}} + \sum_{s} ρ_{π_{old}} (s) \sum_{a} π_{new} (a | s) A_{π_{old}} (s, a)

D_{KL}^{\max} (π_{old}, π_{new}) = \max_{s} D_{KL} (π_{old} (\cdot | s) ∥ π_{new} (\cdot | s))

ε = \frac{2 κ γ}{{(1 - γ)}^{2}}, κ = \max_{s} | E [A_{π_{old} (s, a)}] |

where $D_{KL}$ denotes KL divergence. We transform the abovementioned equation to a sample-based estimation form. By utilizing $q$ as sampling distribution, for one step with state $s_{k}$ , we have

\sum_{s} ρ_{π_{old}} (s) [\dots] = \frac{1}{1 - γ} E_{s ~ ρ_{π_{old}}} [\dots]

\sum_{a} π_{new} (a | s) A_{π_{old}} (s, a) = E_{a ~ q} [\frac{π_{new} (a | s_{k})}{q (a | s_{k})} A_{π_{old}} (s_{k}, a)]

In the estimation procedure, we simulate the plants with policy $π_{old}$ to generate series as $s_{0}, a_{0}, s_{1}, \dots$ . We can obtain that $q (a | s) = π_{old} (a | s)$ , which results in

\begin{matrix} L_{π_{old}} (π_{new}) = J_{π_{old}} + \frac{E (s_{k}, a_{k})}{1 - γ} \\ E (s_{k}, a_{k}) = E_{a ~ q, s ~ ρ_{π_{old}}} [\begin{matrix} \frac{π_{new} (a_{k} | s_{k})}{q (a_{k} | s_{k})} A_{π_{old}} (s_{k}, a_{k}) \end{matrix}] \end{matrix}

(21)

From equation (13), we can find that the penalty parameter $ε$ is so large that the KL divergence is short, which means the step size will be very small. Then, we optimize $L_{π_{old}} (π_{new})$ and give constraints for $ε D_{KL}^{\max} (π_{old}, π_{new})$ . Here, we do not care about the determined part in the optimization process, so minimizing $L_{π_{old}} (π_{new})$ is equal to minimizing $E (s_{k}, a_{k})$ .

In this study, we usually choose the reward $r (s_{k})$ not more than zero. In other words, the cost is not more than zero. At the same time, we parameterize policy $π_{θ} (a | s)$ with parameter vector $θ$ . Overload our equations and our objective can be described as

\begin{matrix} \underset{θ}{maximize} E [\begin{matrix} F (θ) A_{θ_{old}} (s_{k}, a_{k}) \end{matrix}] \\ subject to D_{KL}^{\max} (θ_{old}, θ) < δ \\ F (θ) = \frac{π_{θ_{new}} (a_{k} | s_{k})}{π_{θ_{old}} (a_{k} | s_{k})} \end{matrix}

(22)

From equation (22), we can find that in the optimization of each episode, the algorithm first obtain the second derivative of the constraint and then optimize through conjugate gradient. The process is time-consuming, and then we utilize a method, which only requires the first derivative during the optimization. Set

\begin{matrix} Tanh (x) = μ \times \frac{e^{ν (x - 1)} - e^{- ν (x - 1)}}{e^{ν (x - 1)} + e^{- ν (x - 1)}} + 1 \end{matrix}

(23)

where $μ$ and $ν$ are user-defined hyperparameters. Based on equation (22), in our algorithm, we aim to optimize policy according to such equation

\underset{θ}{maximize} E [\min (F (θ) A_{θ_{old}}, Tanh (F (θ)) A_{θ_{old}})]

(24)

Compared with equation (22), equation (24) only have upper bound with the variation of $F (θ)$ either when $A_{θ_{old}} > 0$ or $A_{θ_{old}} < 0$ . It means that both algorithms can efficiently improve policy when they find better advantages. However, equation (24) can also utilize the “bad” information to optimize policy. Simultaneously, equation (24) does not use the second derivative, which means its algorithm optimizes policy more quickly. In this paper, we will use equation (24) to train and optimize our controllers and we take hyperparameters as $μ = 0.2$ , $ν = 5$ .

Remark 5

In the TRPO algorithm, the policy is optimized through conjugate gradient algorithm and before that TRPO will obtain the second derivative of the constraint, which is time-consuming. However, algorithm in this paper is based on the TRPO abandon using the constraint, which means the algorithm will optimize policy only with first-order derivative. In other words, without constraints, maximization of the objective will lead to an excessively large policy update. When $A_{θ_{old}} > 0$ or $A_{θ_{old}} < 0$ , TRPO and our algorithm can both reach an upper bound or a lower bound, which are determined by the user-defined hyperparameters. Both algorithms utilize the change in probability ratio when it would make the objective improve, but the algorithm in this paper also includes it when it makes the objective worse. Based on these conditions, our algorithm can accomplish the same target as TRPO and optimize method more efficiently.

Controller implement

In this section, we introduce the implement of the critic–actor-based controller and conclude the algorithm.

From Figure 2, we have that the RL-based controller utilizes reward, state signals to update and output voltages. Reward obtained by reward function module guarantees the controller satisfy multiple constraints. Back propagation neural networks (NNs) are employed to implement the controller. For the critic part, we use a three-layer network with definition as follows

V_{ϕ} (s) = W_{1}^{T} φ (W_{2}^{T} s + b_{2}) + b_{1}

(25)

where $W_{1}$ and $W_{2}$ are weighting matrices. $b_{1}$ and $b_{2}$ are bias. We use the ReLU activation function $φ (\cdot)$ for NN. For elevation axis, we set the input state as $s_{e} (t) = [\overset{\cdot}{α} (t)^{T} e_{e} (t)^{T}]^{T}$ . The output layer is the value function $V_{ϕ_{e}} (s_{e} (t))$ .

For the critic part, our aim is to minimize the advantage function $A_{θ_{old}} (s_{t})$ . Here, we have

A_{θ_{eold}} (s_{e} (t)) = \sum_{t' > t} γ^{t' - t} r_{e} (t') - V_{ϕ_{e}} (s_{e} (t))

With minibatch $T$ , we define

J_{ecritic} (ϕ_{e}) = - \sum_{t = 1}^{T} {(\sum_{t' > t} γ^{t' - t} r_{e} (t') - V_{ϕ_{e}} (s_{t}))}^{2}

(26)

We can update the critic parameters $ϕ_{e}$ by the gradient method according to $J_{ecritic} (ϕ_{e})$ . Similarly, for the pitch axis, the inputs are $s_{p} (t) = [\overset{\cdot}{β} (t)^{T} e_{p} (t)^{T}]^{T}$ and value function is $V_{ϕ_{p}} (s_{p} (t))$ . We update parameters $ϕ_{p}$ by optimizing $J_{pcritic} (ϕ_{p})$ . In this part, the value function will describe the reward environment more and more accurate.

For the actor network, we optimize the actor network by using the advantage function value obtained from the critic part. In this paper, actor network structure is similar to critic network, but it has two outputs. The inputs for the network are $s_{e} (t)$ for elevation axis and $s_{p} (t)$ for pitch axis. The outputs for each axis consist of two parts: $u_{mean} (t)$ and $u_{d} (t)$ . The control input $u (t)$ is sampled under a normal distribution with the mean value $u_{mean} (t)$ and the standard deviation $u_{d} (t)$ . The control outputs for the elevation and pitch axis are voltage $V_{s}$ and $V_{d}$ . Thus, we have

\begin{matrix} u_{mean} = W_{1 m}^{T} φ_{2} (W_{2}^{T} s + b_{2}) + b_{1 m} \\ u_{d} = φ_{1} (W_{1 d}^{T} φ_{2} (W_{2}^{T} s + b_{2})) + b_{1 d} \end{matrix}

(27)

where $W_{1 m}$ , $W_{1 d}$ and $W_{2}$ are weighting matrices, and $b_{1 m}$ , $b_{1 d}$ and $b_{2}$ are bias. $φ_{1} (\cdot)$ and $φ_{2} (\cdot)$ are the softplus and Rectified Linear Unit (ReLU) activation functions.

In this part, our aim is to optimize the actor policy as fast as possible according to equation (24). For elevation axis, we define

J_{actor} (θ) = \sum_{t = 1}^{T} \min (F (θ) A_{θ_{old}} (t), Tanh (F (θ)) A_{θ_{old}} (t))

(28)

We can obtain $A_{θ_{old}} (t)$ from the critic part, and then update the actor network parameters $θ$ by the gradient method according to $J_{actor} (θ)$ . For elevation axis, the input is $s_{e} (t)$ and advantage function is $A_{θ_{eold}} (t)$ . Similarly, for the pitch axis, the input is $s_{p} (t)$ and advantage function is $A_{θ_{pold}} (t)$ obtained from the pitch critic network. We update parameters $θ_{e}$ and $θ_{p}$ by optimizing $J_{actor} (θ_{e})$ and $J_{actor} (θ_{p})$ .

The technique used in the article is a gradient descent method. It is well known that optimizer is important in dealing with the gradient descent problem. In this paper, we utilize a momentum optimizer, Adam optimizer, which was brought out by Kingma and Ba.³⁷ It has lower variance than the Stochastic gradient descent (SGD) optimizer when the optimization converges.

Based on the abovementioned statement, we have our controller design algorithm described in the following steps:

Algorithm: RL controller design method
1. Set the hyperparameters in reward function module and equation (24).
2. Initialize the parameters of two NNs, learning rates and minibatch size. Run the system and training.
3. In the minibatch period $T$ , run the system with the policy $π_{θ_{old}}$ . Collect state vectors, system inputs and reward.
4. Calculate the advantage function value $A_{θ_{old}} (s)$ and optimize the critic network with the objective function $J_{critic} (ϕ)$ in equation (26).
5. Optimize the actor network with the objective function $J_{actor} (θ)$ in equation (28) and get the new policy $π_{θ_{new}}$ .
6. Justify the sum reward of this episode is qualified or not. If it is qualified, stop training and obtain the parameters of the networks. If the sum reward is large, go back to the Step 3 with running the next $T$ with policy $π_{θ_{new}}$ .

Example

In this section, we will use the experiments on the table-mount helicopter platform to illustrate the applicability of the proposed algorithm.

The equipment is shown in Figure 1, and the control system model is shown in Figure 2. The sampling time for the experiments is 5 ms. The parameters of the helicopter are shown in Table 2. The hidden layers of both critic network and actor network have 20 neurons. The minibatch for the training technique is 64. The learning rates for critic net and actor net are 0.0001 and 0.0002, respectively. The discounted rate $γ$ is 0.95. In table-mount helicopter platform, we construct the controller in the Simulink and then the software compiles and downloads codes into the practical system. The first step here is to establish a numerical simulation environment by using the parameters in Table 2. Then, the proposed algorithm is utilized to train the controllers. After controllers satisfy the strict limitation constraints, we load the net parameters to the practical platform and then controllers will be trained in the practical environment.

Table 2.

Parameters of the Quanser helicopter.

Parameter	Value	Parameter	Value
$J_{e}$	1.034 kg m²	$J_{p}$	0.045 kg m²
$l_{α}$	0.66 m	$l_{β}$	0.178 m
$m$	0.094 kg	$g$	9.81 m/s²
$K_{f}$	0.1188 N/V

Experiment with constrained control input

In this experiment, we test the applicability of our controllers with strict limitation constraint. In the table-mount helicopter plant, the initial position of the helicopter is −27° elevation axis and initial pitch angle is 0°. The targets of elevation axis and pitch axis are both 0°. In the traditional method, we cannot control the maximum of the control input and we adjust the parameters of the controller in practice. In this experiment, we provide a LQR controller with little parameter adjustments for comparison. The controller parameters are shown in equation (29). Assume there are strong limitations for DC motor such that voltages $V_{f}$ and $V_{b}$ should be in the range $[1, 24]$

K_{e} = [\begin{matrix} - 37.67 & - 20.95 \end{matrix}], K_{p} = [\begin{matrix} 13.21 & - 4.769 \end{matrix}]

(29)

According to the algorithm, we first establish the reward function module with $e_{m} = 0.45$ for elevation axis and $e_{m} = 0.2$ for pitch axis

\begin{matrix} r_{ea} (k) & = - | e_{e} (k) |, r_{pa} (k) = - | e_{p} (k) | \\ w_{e 1} = & {\begin{matrix} 1 & if r_{ac 1} (k) < 0 \\ 0.1 & otherwise \end{matrix}, w_{p 1} = {\begin{matrix} 1 & if r_{pc 1} (k) < 0 \\ 0.1 & otherwise \end{matrix} \end{matrix}

\begin{matrix} r_{ec 1} (k) & = {\begin{matrix} - 0.4 - 0.093 \times (V_{s} (k) - 48) & if V_{s} (k) > 48 \\ - 0.3 - 0.093 \times (5 - V_{s} (k)) & if V_{s} (k) < 5 \\ 0 & otherwise \end{matrix} \\ r_{pc 1} (k) & = {\begin{matrix} - 0.2 - 0.2 \times (V_{d} (k) - 5) & if V_{d} (k) > 5 \\ - 0.2 - 0.2 \times (- 5 - V_{d} (k)) & if V_{d} (k) < - 5 \\ 0 & otherwise \end{matrix} \end{matrix}

(30)

For elevation axis, the reward at state $s_{e} (k)$ is $r_{e} (k) = r_{ea} (k) + w_{e 1} r_{ec 1} (k)$ and for pitch axis $r_{p} (k) = r_{pa} (k) + w_{p 1} r_{pc 1} (k)$ . Compared with the LQR controllers, we can obtain Figures 3 and 4.

Figure 3.

Comparison of elevation axis tracking response for (a) reinforcement learning-based controllers and (b) traditional LQR controller with small parameter adjustment.

Figure 4.

Comparison of control output $V_{s}$ for (a) reinforcement learning-based controllers and (b) traditional LQR controller with small parameter adjustment.

From Figure 3, compared with the LQR controller, the controller design in our RL-based algorithm has a shorter settling time, less overshoot and less steady-state error. At the same time, our controllers are model-free, which means they can learn to optimize themselves without any parameter adjustments. Also, from Figure 4(a), we can find that under this reward environment, controllers learn to adapt this penalty environment, which satisfies limitation.

Experiment with multiple constraints

During this experiment, we will illustrate that the proposed controllers can guarantee the special performance need. Similar to the abovementioned experiment, the initial positions for elevation axis and pitch axis are −27° and 0°. The targets of two axes are both 0°. In this experiment, we want the helicopter to have such a performance that there is no overshoot in the step response. We add the special reward as follows with $t_{\exp} = 2$ , $e_{m} = 0.45$ for elevation axis and $e_{m} = 0.2$ for pitch axis

\begin{matrix} w_{e 2} = & {\begin{matrix} 1 & if r_{ac 1} (k) < 0 \\ 0.1 & otherwise \end{matrix}, w_{p 2} = {\begin{matrix} 1 & if r_{pc 1} (k) < 0 \\ 0.1 & otherwise \end{matrix} \\ r_{ec 2} (k) & = {\begin{matrix} 100 \times \overset{\cdot}{α} (k) & (if \overset{\cdot}{α} (k) > 0.225) \\ 0 & otherwise \end{matrix} \\ r_{pc 2} (k) & = {\begin{matrix} 100 \times \overset{\cdot}{β} (k) & (if \overset{\cdot}{β} (k) > 0.1) \\ 0 & otherwise \end{matrix} \end{matrix}

(31)

According to equations (30) and (31), we can have the reward for elevation axis as $r_{e} (k) = r_{ea} (k) + w_{e 1} r_{ec 1} (k) + w_{e 2} r_{ec 2} (k)$ and for pitch axis $r_{p} (k) = r_{pa} (k) + w_{p 1} r_{pc 1} (k) + w_{p 2} r_{pc 2} (k)$ . The main aim of this part is to test whether proposed controllers can guarantee the performance which has no overshoot. We only observe the elevation axis and obtain Figures 5 and 6.

Figure 5.

Front motor and back motor voltages of the step response for elevation axis.

Figure 6.

Step response for the elevation axis: (a) before training, (b) during the training and (c) after training.

Figure 6 shows the tracking response for the elevation axis before, during and after training. The controller before training we used is the controller in the experiment 1, which is trained in the reward environment (30). The settling time is around 2 s, the overshot is 4.5% and the tracking error is about 0.34°. The tracking response (Figure 6(b)) we got utilizes the controller, which we trained in the reward environments (30) and (31) after 500 iterations. The settling time is 2.9 s, and it has no overshot but −0.8° tracking error. After the complete training, we obtain the tracking response in Figure 6(c), whose settling time is 2.5 s and tracking error is 0.16°. Surely, the response has no overshoot.

Figure 5 shows the front motor voltage and back motor voltage of the tracking response after training. Both voltages are in the interval (0, 24), which satisfies the reward constraints $r_{ec 1}$ and $r_{pc 1}$ .

Experiment with step signal tracking

In this part, we illustrate that our two-axis controllers can successfully track the step signals. The initial position of the helicopter is −27° for elevation axis and 0° for pitch axis. We will carry out four experiments, and the target pitch angle we set remains 20°. For elevation axis, the helicopter will carry out four experiments separately, and the four target elevation angles are −22°, −17°, −12° and −7°. The reward function module we utilized in this part is that $r_{e} (k) = r_{ea} (k) + w_{e 1} r_{ec 1} (k) + w_{e 2} r_{ec 2} (k)$ for elevation axis and $r_{p} (k) = r_{pa} (k) + w_{p 1} r_{pc 1} (k) + w_{p 2} r_{pc 2} (k)$ for pitch axis. Through training for the table-mount helicopter, results are shown in Figures 7 and 8.

Figure 7.

(a) Elevation axis tracking responses for four cases and (b) elevation axis control output $V_{s}$ voltage for four cases.

Figure 8.

(a) Tracking responses for pitch axis from 0° to 20° and (b) control output $V_{d}$ voltage.

As shown in Figure 7(a), the controllers we design can successfully help the helicopter track the signals in different step cases. Figure 7(b) shows the control output $V_{s}$ voltage, which determines the elevation axis motion. Similarly, Figure 8(a) shows the tracking response for the pitch axis and Figure 7(b) the control output $V_{d}$ voltage. One can see that Figures 7 and 8 illustrate that the proposed controllers can successfully track the set-point signals and satisfy our special constraints.

Conclusion

In this paper, we propose an RL-based controller to address the controller problem for a 3-DOF helicopter with multiple constraints. By choosing a special reward function module, the critic controller can successfully learn the reward environment and actor controller can update policy quickly. There are two advantages of this algorithm. One is that the controller we design is model-free controller, which means the controller is trained only according to the control inputs and system states. This point results in that the controllers will update by themselves by using real-time data. Another advantage is that we can design controllers according to our demand through reward function module. The platform experiment results show that the proposed controllers can successfully satisfy the constraints, special performance and step signal tracking performance.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

This research was supported by State Key Laboratory of Robotics and System (HIT) (grant no.: SKLRS-2018-KF-12), the National Natural Science Foundation of China (grant no: 61603113), China Postdoctoral Science Foundation (grant no.: 2018T110309) and the 111 Project (grant no.: B16014).

ORCID iD

Zhan Li

References

Ozguner

. Sliding mode control of a quadrotor helicopter. In: Proceedings of the 45th IEEE conference on decision and control, San Diego, CA, 13–15 December 2006, pp. 4957–4962. New York: IEEE.

Starkov

Aguilar

Orlov

. Sliding mode control synthesis of a 3-DOF helicopter prototype using position feedback. In: Proceedings of the international workshop on variable structure systems, Antalya, Turkey, 8–10 June 2008, pp. 233–237. New York: IEEE.

Marconi

Naldid

. Robust full degree-of-freedom tracking control of a helicopter. Automatica 2007; 43(11): 1909–1920.

Gadewadikar

Lewis

Subbarao

, et al. Structured

H_{\infty}

command and control-loop design for unmanned helicopters. J Guid Contr Dynam 2008; 31(4): 1093–1102.

Lopez-Martinez

Ortega

Vivas

, et al. Nonlinear

L_{2}

control of a laboratory helicopter with variable speed rotors. Automatica 2007; 43(4): 655–661.

Zheng

Zhong

. Robust attitude regulation of a 3-DOF helicopter benchmark: Theory and experiments. IEEE T Ind Electron 2011; 58(2): 660–670.

Rios

Rosales

de Loza

, et al. Robust regulation for a 3-DOF helicopter via sliding-modes control and observation techniques. In: Proceedings of the 2010 American control conference, Baltimore, MD, 30 June–2 July 2010, pp. 4427–4432. New York: IEEE.

Liu

Zhong

. Robust LQR attitude control of a 3-DOF laboratory helicopter for aggressive maneuvers. IEEE T Ind Electron 2013; 60(10): 4627–4636.

Kutay

Calise

Idan

, et al. Experimental results on adaptive output feedback control using a laboratory model helicopter. IEEE T Contr Syst Technol 2005; 13(2): 196–202.

10.

Liu

HHT

Zhu

, et al. Nonlinear robust attitude tracking control of a table-mount experimental helicopter using output feedback. IEEE T Ind Electron 2015; 62(9): 5665–5676.

11.

Sutton

Barto

. Reinforcement learning: an introduction. Cambridge, MA: MIT Press, 1998.

12.

Zhou

. Adaptive learning and control for MIMO system based on adaptive dynamic programming. IEEE T Neural Networ 2011; 22(7): 1133–1148.

13.

Gao

Jiang

. Adaptive dynamic programming and adaptive optimal output regulation of linear systems. IEEE T Automat Contr 2016; 61(12): 4164–4169.

14.

Modares

Lewis

Jiang

. Optimal output-feedback control of unknown continuous-time linear systems using off-policy reinforcement learning. IEEE T Cybernetics 2016; 46(11): 2401–2410.

15.

Yang

Shi

. Reinforcement learning output feedback NN control using deterministic learning technique. IEEE T Neur Net Lear 2014; 25(3): 635–641.

16.

Kamalapurkar

Andrews

Walters

, et al. Model-based reinforcement learning for infinite-horizon approximate optimal tracking. IEEE T Neur Net Lear 2017; 28(3): 753–758.

17.

Kiumarsi

Lewis

Modares

, et al. Reinforcement q -learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 2014; 50: 1167–1175.

18.

Wei

Liu

Lin

. Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems. IEEE T Cybernetics 2016; 46(3): 840–853.

19.

Bertsekas

. Value and policy iterations in optimal control and adaptive dynamic programming. IEEE T Neur Net Lear 2017; 28(3): 500–509.

20.

Song

Lewis

Wei

. Off-policy integral reinforcement learning method to solve nonlinear continuous-time nonzero-sum games. IEEE T Neur Net Lear 2017; 28(3): 704–713.

21.

Song

Xiao

Zhang

, et al. Adaptive dynamic programming for a class of complex-valued nonlinear systems. IEEE T Neur Net Lear 2014; 25(9): 1733–1739.

22.

Zhu

Ferrari

. A hybrid-adaptive dynamic programming approach for the model-free control of nonlinear switched systems. IEEE T Automat Contr 2016; 61(10): 3203–3208.

23.

Zhang

Jiang

Luo

, et al. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE T Indus Electron 2017; 64(5): 4091–4100.

24.

Shen

Shi

. Distributed command filtered backstepping consensus tracking control of nonlinear multiple-agent systems in strict-feedback form. Automatica 2015; 53: 120–124.

25.

Modares

Chai

, et al. Off-policy reinforcement learning for synchronization in multiagent graphical games. IEEE T Neur Net Lear 2017; 28(10): 2434–2445.

26.

Kiumarsi

Lewis

Jiang

H_{\infty}

control of linear discrete-time systems: off-policy reinforcement learning. Automatica 2017; 78: 144–152.

27.

Luo

Huang

. Off-policy reinforcement learning for

H_{\infty}

control design. IEEE T Cybernetics 2015; 45(1): 65–76.

28.

Modares

Lewis

Jiang

H_{\infty}

tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE T Neur Net Lear 2015; 26(10): 2550–2556.

29.

Guo

Singh

Lee

, et al. Deep learning for real-time Atari game play using offline Monte-Carlo tree search planning. In: Proceedings of the NIPS, Montreal, QC, Canada, 8–13 December 2014, pp. 3338–3346. Cambridge, MA: MIT Press.

30.

Silver

Huang

Maddison

, et al. Mastering the game of go with deep neural networks and tree search. Nature 2016; 529: 484–489.

31.

Lillicrap

Hunt

Pritzel

, et al. Continuous control with deep reinforcement learning. In: Proceedings of the ICIL, Pretoria, 22–23 February 2016.

32.

Peters

Schaal

. Reinforcement learning of motor skills with policy gradients. Neural Networks 2008; 21(4): 682–697.

33.

Luo

Liu

, et al. Policy gradient adaptive dynamic programming for data-based optimal control. IEEE T Cybernetics 2017; 47(10): 3341–3354.

34.

Mousavi

Schukat

Howley

. Traffic light control using deep policy-gradient and value function-based reinforcement learning. IET Intell Transp Sy 2017; 11(7): 417–423.

35.

Schulman

Abbeel

Levine

, et al. Trust region policy optimization. In: Proceedings of the ICML, Lille, 6–11 July 2015, pp. 1889–1897. New York: ACM.

36.

Kakade

Langford

. Approximately optimal approximate reinforcement learning. In: Proceedings of the ICML, Sydney, NSW, Australia, 8–12 July 2002, pp. 267–274. San Francisco, CA: Morgan Kaufmann Publishers Inc.

37.

Kingma

. Adam: a method for stochastic optimization. In: Proceedings of the ICLR, San Diego, CA, 7–9 May 2015.

Training a model-free reinforcement learning controller for a 3-degree-of-freedom helicopter under multiple constraints

Abstract

Keywords

Introduction

Problem formulation

Remark 1

Preliminary on RL

PG method

Trust region policy optimization

Main result

Reward function module

Remark 2

Remark 3

Remark 4

Optimization method

Remark 5

Controller implement

Example

Experiment with constrained control input

Experiment with multiple constraints

Experiment with step signal tracking

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References