Sage Journals: Discover world-class research

Abstract

In this article, a novel continuous-time optimal tracking controller is proposed for the single-input-single-output linear system with completely unknown dynamics. Unlike those existing solutions to the optimal tracking control problem, the proposed controller introduces an integral compensation to reduce the steady-state error and regulates the feedforward part simultaneously with the feedback part. An augmented system composed of the integral compensation, error dynamics, and desired trajectory is established to formulate the optimal tracking control problem. The input energy and tracking error of the optimal controller are minimized according to the objective function in the infinite horizon. With the application of reinforcement learning techniques, the proposed controller does not require any prior knowledge of the system drift or input dynamics. The integral reinforcement learning method is employed to approximate the Q-function and update the critic network on-line. And the actor network is updated with the deterministic learning method. The Lyapunov stability is proved under the persistence of excitation condition. A case study on a hydraulic loading system has shown the effectiveness of the proposed controller by simulation and experiment.

Keywords

Optimal tracking control reinforcement learning techniques integral compensation completely unknown dynamics

Introduction

Accurate tracking control has drawn great research interests in a number of application fields.^1–3 Optimal control deals with problems of minimizing the prescribed objective function in the infinite or finite horizon. Traditional optimal control for linear system solves the algebraic Riccati equation (ARE) off-line.^4,5 The optimal control policy is regulated as a state feedback according to the gradient of the value function.⁶ However, this kind of controller may suffer from steady-state error because of the disturbance in the system.⁷ And the design of the feedforward part is separate from the optimal regulation.

In this study, the integral compensation and feedforward part are introduced in the optimal controller. The integral term is necessary to maintain the system state around the equilibrium points and reduce the steady-state error. Inspired by the adaptive robust control, a discontinuous projection for the integral compensation is applied to ensure the robustness.^8,9 While the feedforward part can remarkably improve the responses of the control system,¹⁰ which is necessary for high-accuracy control problems. So, an augmented system with the integral compensation, error dynamics, and desired trajectory is established in this study. And the optimal tracking control problem (OTCP) of the augmented system is formulated for minimizing the performance function in the infinite horizon.

However, the introduction of the integral compensation and feedforward brings difficulties to the optimal controller design. The optimal control policy for the augmented system can hardly be obtained by solving the Hamilton–Jacobi–Bellman (HJB) equation directly. And the completely unknown system dynamics are also challenges for the controller design. In this study, a new optimal controller with reinforcement learning techniques is proposed to deal with the problems mentioned above.^11–14 Different from many state-of-the-art continuous-time optimal controllers, the proposed controller is built up with a Q-function-based actor–critic architecture.¹⁵ The critic is updated by the Q-function approximation, and the actor is optimized by the deterministic learning method.

The implementation of Q-function approximation in the continuous-time domain is inspired by the integral reinforcement learning (IRL) method.^16,17 An integral of the linear quadratic function is calculated to obtain the Bellman error and update the critic weights. The optimal control policy is obtained by the method of deterministic learning.^18,19 Different from those off-line deterministic policy gradient (DPG) methods, the deterministic learning method in this study enables an on-line policy update.^20,21

In this article, we developed an adaptive optimal controller based on the deterministic learning technique. The contribution of this article is as follows. First, the integral compensation and feedforward are added in the control input, so that the control performance can be improved. Second, the OTCP of the system can be solved on-line with completely unknown dynamics, by employing Q-function approximation and deterministic learning method. Third, the convergence and Lyapunov stability of the proposed controller are proved. And the effectiveness of the controller is validated by simulation and experiment.

The rest of this article is organized as follows. Section “Optimal tracking problem formulation” presents the OTCP for the augmented system. Section “Optimal controller design” presents the design of the optimal controller. Section “Stability analysis” presents the Lyapunov stability of the proposed controller. Section “Case study” presents a case study on a hydraulic loading system. Section “Conclusion” presents the conclusion.

Optimal tracking problem formulation

Linear control system with integral compensation

Consider the single-input-single-output (SISO) affine continuous-time linear system described as

\begin{array}{l} \dot{x} (t) = A x (t) + B u (t) \\ y (t) = C x (t) \end{array}

(1)

where $x \in R^{n}$ is the system state vector, $A \in R^{n \times n}$ and $B \in R^{n}$ are the unknown system dynamics, $u \in R$ is the control input, $y \in R$ is the system output, and $C \in R^{1 \times n}$ is the output matrix.

Assumption 1

The system state $x$ and the control input $u$ are contained in compact sets.

Assumption 2

The system with A and B is controllable. The vector $Ax (t)$ and $B$ are bounded.

Assumption 3

Assume that the desired trajectory of the system $x_{d} (t) \in R^{n}$ is bounded and there exists a Lipschitz continuous function $f_{d}$ such that

{\overset{\cdot}{x}}_{d} (t) = f_{d} (x_{d} (t))

(2)

The tracking error $e_{d} (t) \in R^{n}$ is defined as

e_{d} (t) = x (t) - x_{d} (t)

(3)

The control input $u (t)$ is described as

u (t) = u_{s} (t) + u_{f} (t)

(4)

where $u_{f} (t)$ is the optimal control input composed of the feedforward and feedback parts and $u_{s} (t)$ is an integral compensation

u_{s} (t) = k_{i} s_{e} (t)

(5)

where $k_{i}$ is the integral coefficient, which keeps unchanged in this study. And the integral term $s_{e} (t) \in R$ is described as a discontinuous projection of $e_{d}$ ^9,10

{\overset{\cdot}{s}}_{e} (t) = Proj (e_{d}) = {\begin{matrix} 0, if s_{e} \geq s_{e max} and C e_{d} > 0 \\ 0, if s_{e} \leq s_{e min} and C e_{d} < 0 \\ C e_{d}, else \end{matrix}

(6)

where $s_{e max}$ and $s_{e min}$ are the maximum and minimum of $s_{e} (t)$ , respectively. It can be seen that $u_{s}$ is a linear function of $e_{d}$ when $s_{e} (t)$ is in the range $(s_{e min}, s_{e max})$ .

Remark 1

The traditional optimal regulation method obtains a proportional–derivative (PD)-type controller with feedback only, which may cause steady-state error under uncertain dynamics. In this study, an integral compensation is introduced to eliminate the steady-state error.

Augmented system and performance function

Define the augmented system state $X \in R^{2 n}$ as

X = [e_{d}, x_{d}]

(7)

The augmented state vector is composed of the tracking error and desired trajectory.

Then, the dynamics of the augmented system can be written as

\overset{\cdot}{X} (t) = F (X) + G (X) u_{f} (t)

(8)

where the drift dynamics $F (X)$ and the input dynamics $G (X)$ can be written as

F (X) = [\begin{matrix} A (e_{d} + x_{d}) + B k_{i} s_{e} - f_{d} (x_{d}) \\ f_{d} (x_{d}) \end{matrix}], G (X) = [\begin{matrix} B \\ 0 \end{matrix}]

(9)

Remark 2

Because $A$ and $B$ are assumed to be unknown in this study, the dynamics of the augmented system $F (X)$ and $G (X)$ are also unknown.

Remark 3

The coefficient of the integral compensation is set as a constant in this study. Only the feedforward and feedback parts $u_{f} (t)$ need to be regulated. And it is not necessary for the integral term $s_{e} (t)$ to be a part of the state vector. The integral compensation is regarded as part of the system drift dynamics instead.

The objective of the OTCP is to minimize the performance function of the augmented system

V_{c} (t) = \int_{t}^{\infty} e^{- γ (τ - t)} [e_{d} {(τ)}^{T} Q_{T} e_{d} (τ) + R_{T} u_{f} {(τ)}^{2}] d τ

(10)

$V_{c} (t)$ is a discounted sum of costs in the infinite horizon. The diagonal matrix $Q_{T} \in R^{n \times n}$ and the real number $R_{T} \in R$ are coefficients of the quadratic function.

The optimal control policy for $u_{f}^{*}$ should be obtained on-line under the unknown dynamics

u_{f}^{*} = π_{f}^{*} (X)

(11)

According to Leibniz’s rule, the derivative of $V_{c} (t)$ can be obtained

{\overset{\cdot}{V}}_{c} (t) = - e_{d} (τ)^{T} Q_{T} e_{d} (τ) - u_{f} (τ)^{T} R_{T} u_{f} (τ) + γ V_{c} (t)

(12)

And the tracking HJB equation can be written as

\begin{matrix} H (X, u_{f}^{*}, \nabla V_{c}^{*}) = {e_{d}}^{T} Q_{T} e_{d} + R_{T} u_{f}^{* 2} - γ V_{c}^{*} (X) \\ + \nabla V_{c}^{* T} (X) [F (X) + G (X) u_{f}^{*} (X)] = 0 \end{matrix}

(13)

Remark 4

The system dynamics $F (X)$ and $G (X)$ are unknown in this study. So, the traditional linear quadratic regulator (LQR) method is limited in this problem. And the problem can neither be solved by dealing with the HJB equation directly.

Optimal controller design

For systems with completely unknown dynamics, the HJB function can hardly be solved. In this section, an optimal controller is proposed with the actor-critic architecture. The structure of the controller is shown in Figure 1. The Q-value approximation is employed to evaluate the performance function. And the optimal control policy is updated on-line by the deterministic learning technique. The feedback and feedforward parts of the control input can be obtained simultaneously.

Figure 1.

Structure of the proposed optimal controller.

Critic network and Q-function approximation

The state vector is pre-processed by a normalization while considering the difference of scale between the desired trajectory $x_{d}$ and the tracking error $e_{d}$

{\bar{x}}_{d} = \frac{1}{k_{yd}} x_{d}

(14)

The state vector is transformed as $\bar{X}$

\bar{X} = [e_{d}, {\bar{x}}_{d}]

(15)

The value function $V_{c} (t)$ is the expectation of the performance function, which can be estimated on-line by the Q-value

V_{c} (t) = Q_{c} (\bar{X}, u_{f}) + ε_{v}

(16)

where $ε_{v}$ is the approximation error and the Q-function $Q_{c} (\bar{X}, u_{f})$ is evaluated based on the actual control input $u_{f}$ .

Remark 5

During the on-line learning process, a probing noise is added on $u_{f}$ (see section “Experiment results” for more detail). So, the difference between the value function and Q-function $ε_{v}$ is mainly caused by the probing noise.

Because the amplitude of probing noise is relatively low, the approximation error $ε_{v}$ can be kept bounded in a compact set according to Assumption 2

‖ ε_{v} ‖ \leq {ε_{v}}_{max}

(17)

The Q-function can be obtained by a linear approximation

Q_{c} = W_{c}^{T} ϕ_{c} (\bar{X}, u_{f})

(18)

where $W_{c} (t) \in R^{l}$ are the weights for ideal approximation and $ϕ_{c}$ is the basis function.

The Bellman equation can be obtained from equation (10), which is written as

\int_{t - T}^{t} e^{- γ (τ - t + T)} [e_{d} {(τ)}^{T} Q_{T} e_{d} (τ) + R_{T} u_{f} {(τ)}^{2}] d τ + e^{- γ T} V_{c} (t) - V_{c} (t - T) = 0

(19)

According to equations (16) and (19), the tracking Bellman error can be written as

\begin{matrix} ε_{B} (t) = \\ \int_{t - T}^{t} e^{- γ (τ - t + T)} [e_{d} {(τ)}^{T} Q_{T} e_{d} (τ) + R_{T} u_{f} {(τ)}^{2}] d τ + W_{c}^{T} Δ ϕ_{c} (t) \end{matrix}

(20)

where

Δ ϕ_{c} (t) = e^{- γ T} ϕ_{c} (\bar{X} (t), u_{f} (t)) - ϕ_{c} (\bar{X} (t - T), u_{f} (t - T))

(21)

According to equations (16) and (18), the tracking Bellman equation error $ε_{B} (t)$ can be written as

ε_{B} (t) = ε_{v} (t - T) - e^{- γ T} ε_{v} (t)

(22)

So, $ε_{B} (t)$ is bounded according to equation (17).

The Q-value is estimated by the critic network

{\hat{Q}}_{c} = {\hat{W}}_{c}^{T} ϕ_{c} (\bar{X}, u_{f})

(23)

According to equation (20), the Bellman error with respect to the weights of the critic network ${\hat{W}}_{c}$ can be written as

\begin{matrix} e_{B} (t) = \\ \int_{t - T}^{t} e^{- γ (τ - t + T)} [e_{d} {(τ)}^{T} Q_{T} e_{d} (τ) + R_{T} u_{f} {(τ)}^{2}] d τ + {\hat{W}}_{c}^{T} Δ ϕ_{c} (t) \end{matrix}

(24)

In this study, the policy iteration method is employed to minimize the Bellman error. The objective function of the critic network can be written as

E_{B} = \frac{1}{2} e_{B}^{2}

(25)

The update rate of ${\hat{W}}_{c}$ can be written as

\begin{matrix} {\overset{\cdot}{\hat{W}}}_{c} = - \frac{α_{c}}{k_{nc}^{2}} \frac{\partial E_{B}}{\partial e_{B}} \frac{\partial e_{B}}{\partial {\hat{W}}_{c}} \\ = - α_{c} (\frac{1}{{(1 + Δ ϕ_{c}^{T} Δ ϕ_{c})}^{2}} e_{B} Δ ϕ_{c} + θ_{c} {\hat{W}}_{c}) \end{matrix}

(26)

where $θ_{c}$ is the regularization coefficient and $k_{nc}$ is the normalization term

k_{nc} = 1 + Δ ϕ_{c}^{T} Δ ϕ_{c}

(27)

The normalization term is applied to limit the update rate of the network weights.

Define the estimation error of the critic network as

{\tilde{W}}_{c} = W_{c} - {\hat{W}}_{c}

(28)

According to equations (20) and (28), the Bellman error $e_{B} (t)$ can be written as

e_{B} (t) = - {\tilde{W}}_{c}^{T} (t) Δ ϕ_{c} (t) + ε_{B} (t)

(29)

So, the critic neural network (NN) estimation error dynamics becomes

\overset{\cdot}{\tilde{W_{c}}} = - α_{c} (Δ {\bar{ϕ}}_{c} Δ {\bar{ϕ}}_{c}^{T} {\tilde{W}}_{c} (t) - \frac{1}{k_{nc}} ε_{B} (t) Δ {\bar{ϕ}}_{c} - θ_{c} {\hat{W}}_{c})

(30)

where $Δ {\bar{ϕ}}_{c}$ is defined as

Δ {\bar{ϕ}}_{c} = \frac{Δ ϕ_{c}}{1 + Δ ϕ_{c}^{T} Δ ϕ_{c}}

(31)

Actor network and deterministic learning technique

The control policy is improved by the actor network. The deterministic learning technique is applied to update the actor network weights on-line.

The optimal control policy $π_{f}^{*}$ satisfies

π_{f}^{*} (\bar{X}) = \arg min Q_{c} (\bar{X}, u_{f})

(32)

The optimal control input $u_{f}^{*}$ can be expressed by a linear approximation

u_{f}^{*} = π_{f}^{*} (\bar{X}) = W_{a}^{T} ϕ_{a} (\bar{X}) + ε_{u}

(33)

where $ε_{u}$ is the approximation error and satisfies

‖ ε_{u} ‖ \leq {ε_{u}}_{max}

(34)

$u_{f}$ is the estimation of $u_{f}^{*}$ according to the actor network

u_{f} = π_{f} (\bar{X}) = {\hat{W}}_{a}^{T} ϕ_{a} (\bar{X})

(35)

The deterministic learning method is employed to update the weights ${\hat{W}}_{a}$ with the Q-value in the critic²²

\overset{\cdot}{\hat{W_{a}}} = - α_{a} (\frac{\partial {\hat{Q}}_{c} (\bar{X}, u_{f})}{\partial u_{f}} \frac{\partial π_{f} (\bar{X})}{\partial {\hat{W}}_{a}} + θ_{a} {\hat{W}}_{a})

(36)

where $θ_{a}$ is a regularization coefficient added to ensure convergence.

The term $\partial {\hat{Q}}_{c} (\bar{X}, u_{f}) / \partial u_{f}$ can be written as

\frac{\partial {\hat{Q}}_{c} (\bar{X}, u_{f})}{\partial u_{f}} = \nabla_{u_{f}} {ϕ_{c}}^{T} {\hat{W}}_{c}

(37)

So, the update rate $\overset{\cdot}{\hat{W_{a}}}$ can be obtained as

\overset{\cdot}{\hat{W_{a}}} = - α_{a} (\frac{1}{k_{na}} ϕ_{a} \nabla_{u_{f}} {ϕ_{c}}^{T} {\hat{W}}_{c} + θ_{a} {\hat{W}}_{a})

(38)

where $k_{na}$ is the normalization term

k_{na} = (1 + {ϕ_{a}}^{T} ϕ_{a}) (1 + \nabla_{u_{f}} {\hat{V}}_{c}^{T} \nabla_{u_{f}} {\hat{V}}_{c})

(39)

Remark 6

The initial weights of the actor network ${\hat{W}}_{a} (0)$ should be an admissible control policy which is able to stabilize the system.

The estimation error ${\tilde{W}}_{a}$ is defined as

{\tilde{W}}_{a} = W_{a} - {\hat{W}}_{a}

(40)

Then, the estimation error dynamics of weights $W_{a}$ can be written as

\overset{\cdot}{\tilde{W_{a}}} = α_{a} (\frac{1}{k_{na}} ϕ_{a} \nabla_{u_{f}} {ϕ_{c}}^{T} {\hat{W}}_{c} + θ_{a} {\hat{W}}_{a})

(41)

Persistently exciting condition

According to equation (26), the convergence of the weights ${\hat{W}}_{c}$ requires the persistent excitation (PE) condition of $Δ ϕ_{c}$ . For all $t \geq 0$ , there exists $μ_{1} > 0$ and $μ_{2} > 0$ such that

μ_{1} I \leq \int_{t}^{t + T_{1}} Δ ϕ_{c} (τ) Δ ϕ_{c}^{T} (τ) d τ \leq μ_{2} I

(42)

According to equation (38), the term $\nabla_{u_{f}} {ϕ_{c}}^{T} {\hat{W}}_{c}$ denotes the gradient of the estimated Q-function with respect to the control input. And the vector $ϕ_{a}$ should satisfy the PE condition to ensure the convergence of the weights ${\hat{W}}_{a}$ . For all $t \geq 0$ , there exist $μ_{3} > 0$ and $μ_{4} > 0$ such that

μ_{3} I \leq \int_{t}^{t + T_{1}} ϕ_{a} (τ) {ϕ_{a}}^{T} (τ) d τ \leq μ_{4} I

(43)

However, the PE condition can hardly be verified on-line.^23,24 So, in this study, a probing noise is added on the control input $u_{f}$ .

Stability analysis

In this section, the stability of the proposed method is proved in the Lyapunov sense.

The Lyapunov function is defined as

J (t) = V_{c} (t) + \frac{1}{2} {\tilde{W}}_{c} (t)^{T} α_{c}^{- 1} {\tilde{W}}_{c} (t) + \frac{1}{2} {\tilde{W}}_{a} (t)^{T} α_{a}^{- 1} {\tilde{W}}_{a} (t)

(44)

The derivative of the Lyapunov function is given by

\overset{\cdot}{J} (t) = {\overset{\cdot}{V}}_{c} (t) + {\tilde{W}}_{c} (t)^{T} α_{c}^{- 1} \overset{\cdot}{\tilde{W_{c}}} (t) + {\tilde{W}}_{a} (t)^{T} α_{a}^{- 1} \overset{\cdot}{\tilde{W_{a}}} (t)

(45)

According to equations (12) and (18), ${\overset{\cdot}{V}}_{c} (t)$ can be written as

{\overset{\cdot}{V}}_{c} (t) = - e_{d} (t)^{T} Q_{T} e_{d} (t) - R_{T} u_{f} (t)^{2} + γ (W_{c}^{T} ϕ_{c} + ε_{v})

(46)

Note that $u_{f} \in R$ , the term $u_{f} (t)^{2}$ in equation (46) can be written as

\begin{matrix} u_{f} (t)^{2} = ({\hat{W}}_{a}^{T} ϕ_{a})^{2} \\ = W_{a}^{T} ϕ_{a} ϕ_{a}^{T} W_{a} - 2 {\tilde{W}}_{a}^{T} ϕ_{a} ϕ_{a}^{T} W_{a} + {\tilde{W}}_{a}^{T} ϕ_{a} ϕ_{a}^{T} {\tilde{W}}_{a} \end{matrix}

(47)

The first term in equation (46) can be written as

e_{d}^{T} Q_{T} e_{d} \geq λ_{min} (Q_{T}) ‖ e_{d} ‖^{2}

(48)

So, the derivative ${\overset{\cdot}{V}}_{c} (t)$ can be written as

\begin{matrix} {\overset{\cdot}{V}}_{c} (t) \leq - λ_{min} (Q_{T}) ‖ e_{d} ‖^{2} - R_{T} (W_{a}^{T} ϕ_{a} ϕ_{a}^{T} W_{a} \\ - 2 {\tilde{W}}_{a}^{T} ϕ_{a} ϕ_{a}^{T} W_{a} \\ + {\tilde{W}}_{a}^{T} ϕ_{a} ϕ_{a}^{T} {\tilde{W}}_{a}) + γ (W_{c}^{T} ϕ_{c} + ε_{v}) \end{matrix}

(49)

According to equation (30), the second term in equation (45) can be written as

\begin{matrix} {\overset{\cdot}{J}}_{1} (t) = {\tilde{W}}_{c} (t)^{T} α_{c}^{- 1} \overset{\cdot}{\tilde{W_{c}}} (t) \\ = - {\tilde{W}}_{c} (t)^{T} Δ \bar{ϕ} Δ {\bar{ϕ}}^{T} {\tilde{W}}_{c} (t) + \frac{1}{k_{nc}} {\tilde{W}}_{c} (t)^{T} ε_{B} (t) Δ \bar{ϕ} \\ + {\tilde{W}}_{c} (t)^{T} θ_{c} W_{c} - {\tilde{W}}_{c} (t)^{T} θ_{c} {\tilde{W}}_{c} \end{matrix}

(50)

Using equation (41), the third term in equation (45) can be written as

\begin{matrix} {\overset{\cdot}{J}}_{2} (t) = {\tilde{W}}_{a} (t)^{T} α_{a}^{- 1} \overset{\cdot}{\tilde{W_{a}}} (t) \\ = \frac{1}{k_{na}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} \nabla_{u_{f}} {ϕ_{c}}^{T} {\hat{W}}_{c} + {\tilde{W}}_{a} (t)^{T} θ_{a} {\hat{W}}_{a} \end{matrix}

(51)

According to equations (23) and (38), ${\overset{\cdot}{J}}_{2} (t)$ can be written as

\begin{matrix} {\overset{\cdot}{J}}_{2} (t) = \frac{1}{k_{na}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} \nabla_{u_{f}} {ϕ_{c}}^{T} W_{c} \\ - \frac{1}{k_{na}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} \nabla_{u_{f}} {ϕ_{c}}^{T} {\tilde{W}}_{c} \\ + {\tilde{W}}_{a} (t)^{T} θ_{a} W_{a} - {\tilde{W}}_{a} (t)^{T} θ_{a} {\tilde{W}}_{a} \end{matrix}

(52)

According to the basic inequality, the second term in equation (52) can be written as

\begin{matrix} - \frac{1}{k_{na}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} \nabla_{u_{f}} ϕ_{c} {\tilde{W}}_{c}^{T} \leq \frac{1}{4 k_{na}^{2}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} ϕ_{a}^{T} {\tilde{W}}_{a} (t) \\ + {\tilde{W}}_{c}^{T} \nabla_{u_{f}} ϕ_{c} \nabla_{\bar{u}} {ϕ_{c}}^{T} {\tilde{W}}_{c} \end{matrix}

(53)

Therefore

\begin{matrix} {\overset{\cdot}{J}}_{2} (t) \leq \frac{1}{k_{na}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} \nabla_{u_{f}} {ϕ_{c}}^{T} W_{c} \\ + \frac{1}{4 k_{na}^{2}} {\tilde{W}}_{a} (t)^{T} ϕ_{a} ϕ_{a}^{T} {\tilde{W}}_{a} (t) \\ + {\tilde{W}}_{c}^{T} \nabla_{u_{f}} ϕ_{c} \nabla_{u_{f}} {ϕ_{c}}^{T} {\tilde{W}}_{c} + {\tilde{W}}_{a} (t)^{T} θ_{a} W_{a} \\ - {\tilde{W}}_{a} (t)^{T} θ_{a} {\tilde{W}}_{a} \end{matrix}

(54)

Using equations (49), (50), and (54), $\overset{\cdot}{J} (t)$ becomes

\begin{matrix} \overset{\cdot}{J} (t) \leq - λ_{min} (Q_{T}) ‖ e_{d} ‖^{2} - R_{T} ({W_{a}}^{T} ϕ_{a})^{2} + γ ε_{v} \\ - {\tilde{W}}_{c} (t)^{T} N_{1} {\tilde{W}}_{c} (t) + {\tilde{W}}_{c} (t)^{T} k_{1} - {\tilde{W}}_{a} (t)^{T} N_{2} {\tilde{W}}_{a} \\ + {\tilde{W}}_{a}^{T} k_{2} + γ W_{c}^{T} ϕ_{c} \end{matrix}

(55)

where $k_{1}$ is written as

k_{1} = \frac{1}{k_{nc}} ε_{B} (t) Δ \bar{ϕ} + θ_{c} W_{c}

(56)

and $k_{2}$ is written as

k_{2} = 2 R_{T} ϕ_{a} ϕ_{a}^{T} W_{a} + θ_{a} W_{a} + \frac{1}{k_{na}} ϕ_{a} \nabla_{u_{f}} ϕ_{c}^{T} W_{c}

(57)

According to the range of $ε_{B}$ , and the assumption of $ϕ_{a}$ and $ϕ_{c}$ , it can be concluded that $k_{1}$ and $k_{2}$ are bounded.

And $N_{1}$ and $N_{2}$ are written as

N_{1} = θ_{c} + Δ \bar{ϕ} Δ {\bar{ϕ}}^{T} - \nabla_{u_{f}} ϕ_{c} \nabla_{u_{f}} {ϕ_{c}}^{T}

(58)

N_{2} = θ_{a} + R_{T} ϕ_{a} ϕ_{a}^{T} - \frac{1}{4 k_{na}^{2}} ϕ_{a} ϕ_{a}^{T}

(59)

The regularization coefficients $θ_{a}$ and $θ_{c}$ should make sure that $N_{1}$ and $N_{2}$ are larger than zero. Then, $\overset{\cdot}{J} (t)$ becomes negative provided that

‖ e_{d} ‖ \geq \sqrt{\frac{| γ {ε_{v}}_{max} |}{λ_{min} (Q_{T})}}

(60)

‖ {\tilde{W}}_{c} ‖ > \frac{k_{1}}{λ_{min} (N_{1})}

(61)

‖ {\tilde{W}}_{a} ‖ > \frac{‖ k_{2} ‖ + \sqrt{k_{2}^{2} + 4 γ N_{2} W_{c}^{T} ϕ_{c}}}{2 λ_{min} (N_{2})}

(62)

Case study

In this section, the tracking control of a hydraulic loading system for hydraulic motors is taken as a case study.²⁵ The hydraulic loading system utilizes energy regeneration technique to improve efficiency.^26–28 The photograph of the experimental setup is shown in Figure 2. Simulation and experiment results are given to verify the effectiveness of the proposed controller. The objective is to achieve high-accuracy pressure control, which can be defined as an OTCP.

Figure 2.

Experimental setup of the hydraulic loading system with energy regeneration.

OTCP of the hydraulic loading system

The simplified schematic of the hydraulic loading system with energy regeneration is shown in Figure 3. The hydraulic loading system is used to test the hydraulic motor (Rexroth A2FM63) mounted on the transmission shaft. The system is driven by a variable frequency induction motor (ABB QABP 355L2A). The variable displacement loading pump (Rexroth A6V2F63) regenerates the mechanical energy and adjusts the system pressure. Two flow meters (KRACHT VC12) and pressure sensors (KELLER PA-33X/600BAR) are mounted at two outlets of the tested motor. A personal computer (PC) receives all the sensor signals and sends the control signal by an I/O card (ADVANTECH USB4716). The objective of the OTCP is to obtain the optimal displacement input of the loading pump so that the performance function (10) can be minimized.

Figure 3

Simplified schematic of the hydraulic loading system.

The dynamics of the loading pump can be simplified as a first-order system

T_{m} {\overset{\cdot}{q}}_{m} + q_{m} = u_{m}

(63)

where $T_{m}$ is the time constant of the loading pump, $q_{m}$ is the output displacement of the loading pump neglecting the leakage flow, and $u_{m}$ (cm³/rev) is the displacement input of the loading pump.

The system pressure dynamics can be described as

\frac{V}{β} {\overset{\cdot}{p}}_{c} = q_{m} n - q_{c} n - Q_{l}

(64)

where $p_{c}$ (MPa) is the test pressure, $V$ is the volume of the high-pressure chamber, $β$ is the bulk modulus, $n$ is the rotational speed, $q_{c}$ is the rated displacement of the tested motor, and $Q_{l}$ (L/min) is the leakage flow rate of the system. $Q_{l}$ can be linearized as

Q_{l} = k_{pl} p_{c}

(65)

Define the system state and output as

x = {[p_{c}, {\overset{\cdot}{p}}_{c}]}^{T}

(66)

y = p_{c}

(67)

According to equations (63) and (64), the hydraulic loading system can be described as a second-order linear system

\overset{\cdot}{x} (t) = Ax (t) + Bu (t)

(68)

A = [\begin{matrix} 0 & 1 \\ - \frac{β k_{pl}}{V T_{m}} & - \frac{1}{T_{m}} - \frac{β k_{pl}}{V} \end{matrix}]

(69)

B = [\begin{matrix} 0 \\ \frac{β n}{V T_{m}} \end{matrix}]

(70)

The control input $u (t)$ is written as

u (t) = u_{m} (t) - q_{c}

(71)

And the range of $u (t)$ is limited as

- 30 c m^{3} / rev \leq u (t) \leq 30 c m^{3} / rev

(72)

The state vector for the optimal controller is written as

X = {[e_{d 1}, e_{d 2}, x_{d 1}, x_{d 2}]}^{T}, e_{d} = x - x_{d}

(73)

where $x_{d} (t)$ is the desired trajectory of the system pressure.

The basis function $ϕ_{c} (\bar{X}, u_{f})$ is chosen as a second-order polynomial

\begin{matrix} ϕ_{c} (\bar{X}, u_{f}) = [{\bar{X}}_{1}^{2}, {\bar{X}}_{1} {\bar{X}}_{2}, {\bar{X}}_{1} {\bar{X}}_{3}, {\bar{X}}_{1} {\bar{X}}_{4}, {\bar{X}}_{1} u_{f}, {\bar{X}}_{2}^{2}, {\bar{X}}_{2} {\bar{X}}_{3}, \\ {\bar{X}}_{2} {\bar{X}}_{4}, {\bar{X}}_{2} u_{f}, {\bar{X}}_{3}^{2}, {\bar{X}}_{3} {\bar{X}}_{4}, {\bar{X}}_{3} u_{f}, {\bar{X}}_{4}^{2}, {\bar{X}}_{4} u_{f}, u_{f}^{2}]^{T} \end{matrix}

(74)

And $ϕ_{a} (\bar{X})$ is an identity mapping on $\bar{X}$

ϕ_{a} (\bar{X}) = [{\bar{X}}_{1}, {\bar{X}}_{2}, {\bar{X}}_{3}, {\bar{X}}_{4}]^{T}

(75)

The controller is designed with $k_{i} = 1$ , $k_{yd} = 10$ , $Q_{T} = 10 I_{2}$ ( $I_{2}$ is the $2 \times 2$ identity matrix), $R = 1$ , $γ = 0.1$ , $θ_{c} = 0.01$ , and $θ_{a} = 0.02$ . The sample time $T_{s} = 0.01 s$ . The reinforcement interval $T = 0.1 s$ . The learning rates are chosen as

α_{c} = 0.2 e^{- 0.05 t}

(76)

α_{a} = 0.1 e^{- 0.05 t}

(77)

Simulation results

The hydraulic loading system is modeled in Matlab/Simulink with $T_{m} = 0.1 s$ , $V = 0.008 m^{3}$ , $β = 1400 MPa$ , $n = 1000 rev / \min$ , $q_{c} = 63 c m^{3} / rev$ , and $k_{pl} = 0.2 L / (\min MPa)$ . And the matrices A and B can be written as

A = [\begin{matrix} 01 \\ - 5.83 - 10.58 \end{matrix}]

(78)

B = [\begin{matrix} 0 \\ 29.17 \end{matrix}]

(79)

Notice that the system dynamics $A$ and $B$ are unknown while designing the optimal controller with our proposed method.

Figure 4 shows the control performance of the proposed controller while tracking a non-periodic signal compared with a proportional–integral–derivative (PID) controller.^29,30

Figure 4.

Simulation results of the proposed controller compared with the PID method: (a) system pressure while tracking a non-periodic signal and (b) comparison of the tracking errors.

The feedback gain of the PID controller $K_{PD}$ is regulated by the LQR method, and the integral coefficient $K_{I}$ is the same as that of the proposed method

K_{PD} = {[3.3, 3.4]}^{T}, K_{I} = 1

(80)

It can be seen that the tracking error of the proposed controller can be remarkably reduced after the learning process. With the feedforward term in the output, the proposed controller can outperform the PID controller, which is shown in Figure 4(b).

Figure 5 shows the system state $\bar{X}$ . It can be seen that the amplitudes of the four elements are similar after normalization.

Figure 5.

System state with normalization in simulation.

The gradient $\nabla_{u_{f}} {\hat{Q}}_{c} (\bar{X}, u_{f})$ is shown in Figure 6. When pressure rises with overshoot, the gradient is positive, and the control input is expected to decrease for improving the control performance.

Figure 6.

Gradient of the Q-function $\nabla_{u_{f}} {\hat{Q}}_{c} (\bar{X}, u_{f})$ in simulation.

The convergence during the learning process is shown in Figure 7. The Bellman error keeps bounded and converges to zero gradually. In Figure 7(b), the critic weights vector finally converge to

\begin{matrix} {\hat{W}}_{c} = [0.010, - 0.135, - 0.009, - 0.0367, - 0.185, 0.264, \\ - 0.032, - 0.098, 0.150, 0.070, - 0.016, - 0.093, \\ {- 0.081, - 0.282, 0.008]}^{T} \end{matrix}

(81)

The initial weights of actor network are set to be

{\hat{W}}_{a} (0) = {[2, 0, 0, 0]}^{T}

(82)

In Figure 7(c), it can be seen that the weights ${\hat{W}}_{a}$ finally converge to

{\hat{W}}_{a} = {[1.905, 2.316, 0.081, 0.834]}^{T}

(83)

The feedback and feedforward parts are learned simultaneously.

Figure 7.

Convergence of the proposed controller in simulation: (a) Bellman error, (b) critic network weights, and (c) actor network weights.

Experiment results

Figure 8 shows the experiment results of the proposed optimal controller while tracking a non-periodic signal. The performance of the proposed controller is also compared with a PID controller. It can be seen that the difference of tracking errors between the two controllers is relatively small at the beginning. After several seconds for learning, the tracking error of the proposed controller is remarkably reduced.

Figure 8.

Experiment results of the proposed controller on the hydraulic loading system: (a) system pressure while tracking a non-periodic signal and (b) tracking error compared with the PID method.

Figure 9 shows the convergence of the proposed controller. It can be seen that the Bellman error keeps bounded under the experimental circumstances. The weights of the critic ${\hat{W}}_{c}$ finally converge to

\begin{matrix} {\hat{W}}_{c} = [0.255, - 0.217, 0.007, - 0.030, - 0.094, 0.163, \\ - 0.020, - 0.003, 0.084, 0.031, - 0.001, - 0.015, \\ {- 0.025, - 0.327, 0.035]}^{T} \end{matrix}

(84)

And ${\hat{W}}_{a}$ finally converges to

{\hat{W}}_{a} = {[1.825, 1.154, 0.244, 0.691]}^{T}

(85)

So, the convergences of ${\hat{W}}_{a}$ and ${\hat{W}}_{c}$ are also validated in the experiment.

Figure 9.

The learning process of the proposed controller in experiment: (a) Bellman error, (b) critic network weights, and (c) actor network weights.

Conclusion

In this article, an SISO continuous-time optimal tracking controller is proposed for linear systems with completely unknown dynamics. The proposed controller is different from those conventional proportional-deviation-type optimal controllers in two aspects. First, the integral compensation and the feedforward part are introduced into the controller, so the control performance can be improved. Second, the reinforcement learning techniques are applied in the controller design, and the optimal control policy can be obtained on-line without the prior knowledge of system dynamics. The Lyapunov stability and the convergence of the system have been proved. A case study on a hydraulic loading system with energy regeneration is given to validate the control performance. The simulation and experiment results have shown the effectiveness of the proposed controller.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by the National Natural Science Foundation of China (51475414 and 51875504), as well as the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (51821093).

ORCID iD

Tao Wang

References

Liu

HHT

Zhu

, et al. Nonlinear robust attitude tracking control of a table-mount experimental helicopter using output feedback. IEEE T Ind Electron 2015; 62(9): 5665–5676.

Wang

Zhang

Chen

, et al. Parameter identification and model-based nonlinear robust control of fluidic soft bending actuators. IEEE/ASME T Mech 2019; 24(3): 1346–1355.

Yuan

Chen

Yao

, et al. Fast and accurate motion tracking of a linear motor system under kinematic and dynamic constraints: an integrated planning and control approach. IEEE T Contr Syst T. Epub ahead of print 16 December 2019. DOI: 10.1109/TCST.2019.2955658.

Athans

Falb

. Optimal control: an introduction to the theory and its applications. IEEE T Automat Contr 2003; 12: 345–347.

Weiss

. Robust and optimal control. Automatica 1997; 33: 2095–2095.

Molinari

. The time-invariant linear-quadratic optimal control problem. Automatica 1977; 13: 347–357.

Yao

Reedy

, et al. Adaptive robust motion control of single-rod hydraulic actuators: theory and experiments. IEEE/ASME T Mech 2000; 5: 79–91.

Yao

Tomizuka

. Adaptive robust control of SISO nonlinear systems in a semi-strict feedback form. Automatica 1997; 33: 893–900.

Yao

Tomizuka

. Smooth robust adaptive sliding mode control of robot manipulators with guaranteed transient performance. J Dyn Syst: T ASME 1996; 118: 764–775.

10.

Takeshi

. Introduction of feedforward control for an improved optimal regulator system and its application. Int J Control 1983; 37: 191–204.

11.

Lewis

Vrabie

. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circ Syst Mag 2009; 9: 32–50.

12.

Xue

Lin

, et al. Training a robust reinforcement learning controller for the uncertain system based on policy gradient method. Neurocomputing 2018; 316: 313–321.

13.

Xue

Yang

. Training a model-free reinforcement learning controller for a 3-degree-of-freedom helicopter under multiple constraints. Meas Control 2019; 52(7–8): 844–854.

14.

Yang

Liu

Wang

. Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints. Int J Control 2014; 87: 553–566.

15.

Vamvoudakis

Lewis

. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010; 46: 878–888.

16.

Modares

Lewis

. Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 2014; 50: 1780–1792.

17.

Vamvoudakis

Vrabie

Lewis

. Online adaptive algorithm for optimal control with integral reinforcement learning. Int J Robust Nonlin 2014; 24: 2686–2710.

18.

Yang

Shi

. Reinforcement learning output feedback NN control using deterministic learning technique. IEEE T Neur Net Lear 2014; 25: 635–641.

19.

Liu

Yang

. Model-free adaptive control design for nonlinear discrete-time processes with reinforcement learning techniques. Int J Syst Sci 2018; 49: 2298–2308.

20.

Schulman

Levine

Moritz

, et al. Trust region policy optimization. Int Conf Mach Learn 2015; 37: 1889–1897.

21.

Silver

Lever

Heess

, et al. Deterministic policy gradient algorithms. Int Conf Mach Learn 2014; 32: 387–395.

22.

Prokhorov

Santiago

Donald

. Adaptive critic designs: a case study for neurocontrol. Neural Networks 1995; 8: 1367–1372.

23.

Lewis

Vamvoudakis

. Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data. IEEE T Syst Man Cy B 2010; 41: 14–25.

24.

Chowdhary

Yucelen

Maximillian

, et al. Concurrent learning adaptive control of linear systems with exponentially convergent bounds. Int J Adapt Control 2013; 27: 280–301.

25.

Zhang

Wang

. Design, modeling and pressure control of a 300-kW-class hydraulic pump/motor loading system with energy regeneration. Proc IMechE, Part I: J Systems and Control Engineering. Epub ahead of print 4 March 2020. DOI: 10.1177/0959651820903549.

26.

Wang

. An energy-saving pressure-compensated hydraulic system with electrical approach. IEEE/ASME T Mech 2014; 19: 570–578.

27.

Wang

Zhou

. A compact hydrostatic-driven electric generator: design, prototype, and experiment. IEEE/ASME T Mech 2016; 21: 1612–1619.

28.

Ahn

. Design and control of a closed-loop hydraulic energy-regenerative system. Autom Constr 2012; 22: 444–458.

29.

Ang

Chong

. PID control system analysis, design, and technology. IEEE T Contr Syst T 2005; 13: 559–576.

30.

Lee

Park

Lee

. PID controller tuning for desired closed-loop responses for SI/SO systems. AIChE J 2010; 44: 106–115.

A learning-based optimal tracking controller for continuous linear systems with unknown dynamics: Theory and case study

Abstract

Keywords

Introduction

Optimal tracking problem formulation

Linear control system with integral compensation

Assumption 1

Assumption 2

Assumption 3

Remark 1

Augmented system and performance function

Remark 2

Remark 3

Remark 4

Optimal controller design

Critic network and Q-function approximation

Remark 5

Actor network and deterministic learning technique

Remark 6

Persistently exciting condition

Stability analysis

Case study

OTCP of the hydraulic loading system

Simulation results

Experiment results

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References