Near-Optimal Tracking Control of a Nonholonomic Mobile Robot with Uncertainties

Abstract

A combined kinematic/torque control law is developed by using a backstepping design approach for a nonholonomic mobile robot with two driving wheels mounted on the same axis to track a reference trajectory. The auxiliary velocity control inputs are designed for the kinematic steering system to make the posture error asymptotically stable. Next, a computed-torque controller is designed such that the mobile robot's velocities converge on the given velocity inputs in an optimal manner by converting the tracking control problem into the regulation problem whereby the uncertainties in the dynamics of mobile robots are considered. The proposed online and forward-in-time policy iteration (PI) algorithm based on approximate dynamic programming (ADP) is used to solve the optimal control problem with unknown internal dynamics by using single neural networks (NNs) to approximate the cost function. Afterwards, the near-optimal control policy can be computed directly according to the cost function, which removes the action network appearing in the ordinary ADP method. The stability of the dynamical extension system is demonstrated using Lyapunov methods. The simulation results are provided to demonstrate the effectiveness of the proposed approach.

Keywords

Backstepping Nonholonomic Approximate Dynamic Programming Policy Iteration Lyapunov Methods

1. Introduction

A differentially driven wheeled mobile robot (WMR) is a typical nonholonomic system where the wheels are assumed to roll without slipping [1, 2]. In the mean time, it is also an intrinsically nonlinear system with uncertainties in the dynamic model. The tracking control of such system turns out to be a nontrivial problem due to both its challenging theoretic nature and its practical importance.

Originally, many works [3 –6] consider only the kinematic model of the mobile robot, with the assumption that the control signals instantaneously generate the actual velocity control inputs. However, the perfect velocity tracking [7] assumption does not hold in practice.

Controllers based on a full dynamic model [1, 2, 8 –10] capture better behaviour because they account for dynamic effects such as mass, friction and inertia, which are neglected by kinematic controllers. Optimization algorithms, such as genetic algorithms (GAs), Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO), have been used to find the optimal intelligent controller for WMR [11 –15]. However, the proposed control scheme only ensures the stability of the closed-loop system and the satisfactory tracking of the output to the given reference signal. There were no optimality criteria which were considered in the control objective. In many cases, it is desirable that the tracking control law not only stabilizes the system but also that it renders optimality based on a pre-defined cost function [16 –19].

From a mathematical point of view, the sufficient condition for solving this optimal control problem is the solution to the Hamilton-Jacobi-Bellman (HJB) equation [18, 19]. However, for nonlinear systems, finding a cost function that satisfies the HJB equation is challenging because it requires the solution of a partial differential equation that cannot be solved explicitly. For this reason, considerable efforts have been devoted to developing ADP algorithms [20, 21], including attempts to use, analyse or develop general-purpose methods to find good approximate answers to this optimization problem, using learning or approximation methods to cope with complexity. Actor-critic (AC) architectures [16, 22] have been proposed as models of ADP algorithms since AC methods are amenable to online implementation. Typically, the AC architectures consists of two NNs -an actor NNs and a critic NNs. The actor NNs approximates the optimal control law and generates the control signals while the critic NNs rates the quality of the control signals by the approximation of the cost function.

As part of the optimal control and one of several important new tools for intelligent control, the ADP algorithm presented in this paper does not require preliminary learning. It works online and consists of only one NNs to approximately solve the HJB equation, while the internal dynamics in terms of the velocity tracking errors are considered as unknown difference with AC architectures mentioned above.

The paper is organized as follows. Section 2 provides the kinematic and dynamic model of WMR. The formulation of the adaptive optimal tracking control problem is shown in Section 3 and a unifying design framework is proposed based on a backstepping control approach and nonlinear-optimal control theory. The Lyapunov theory guarantees the stability of the dynamical extension system while considering the error between the cost function and its approximation by using NNs' approximation structures. The convergence proof of the combined control law is presented in Section 4. Section 5 evaluates the control performance of the near-optimal controller by comparing with the initial stabilizing control. Finally, Section 6 gives some concluding remarks.

2. Kinematic and Dynamic Model of the WMR

The WMR as shown in Fig.1 consists of a vehicle with two driving wheels mounted on the same axis and a passive self-adjusted supporting wheel, which carries the mechanical structure. The two driving wheels are independently driven by two actuators (e.g., DC motors). It is assumed that the mobile robot under study is made up of a rigid frame equipped with no deformable wheels and that they are moving on a horizontal plane.

Both wheels have the same radius, denoted by r. The two driving wheels are separated by 2R. The centre of mass of the mobile robot is located at point C. The pose of the robot in the global coordinate frame OXY can be completely specified by three generalized coordinates $q = {[\begin{matrix} x & y & θ \end{matrix}]}^{T}$ , where $x, y$ are the coordinates of the point C in the global coordinate frame and θ is the orientation of the local frame CXcYc attached to the robot platform measured from the X-axis.

Figure 1.

System configuration of the WMR

2.1. Kinematic Model

For the WMR system considered here, the pure rolling and non-slipping, nonholonomic condition (1) states that the robot can only move in the direction that is perpendicular to the axis of the driving wheels:

\dot{y} c o s θ - \dot{x} s i n θ = 0

(1)

The kinematic constraint (1) can be written as:

\begin{array}{c} A (q) \dot{q} = 0 \\ A (q) = [\begin{matrix} - s i n θ & c o s θ & 0 \end{matrix}] . \end{array}

(2)

The null space of $A (q)$ is given by the matrix $S (q)$ . As such:

A (q) S (q) = 0

(3)

The vector q̇ has to lie in this null space, therefore:

\begin{array}{l} \dot{q} = S (q) υ \\ S (q) = [\begin{matrix} c o s θ & 0 \\ s i n θ & 0 \\ 0 & 1 \end{matrix}], υ = [\begin{matrix} v \\ w \end{matrix}] \end{array}

(4)

where $S (q)$ is a Jacobian matrix that transforms the velocities υ in the WMR base coordinates into velocities in the Cartesian coordinates q̇, and v and w are the linear velocity of the point C along the robot axis and the angular velocity respectively. System (4) is called the kinematic model of the robot.

2.2. Dynamic Model

Using the Euler-Lagrange equations, the dynamical equations of motion can then be derived as:

\begin{array}{l} M (q) \ddot{q} + F (\dot{q}) + τ_{d} = B (q) τ + A^{T} (q) λ \\ \begin{matrix} M (q) = [\begin{matrix} m & 0 & 0 \\ 0 & m & 0 \\ 0 & 0 & I \end{matrix}] & B (q) = \frac{1}{r} [\begin{matrix} c o s θ & c o s θ \\ s i n θ & s i n θ \\ R & - R \end{matrix}] \end{matrix} \\ τ = {[\begin{matrix} τ_{r} & τ_{l} \end{matrix}]}^{T} \end{array}

(5)

Here, m is the mass, I is the moment of inertia of the robot around its centre of mass, $F (\dot{q})$ denotes the surface friction, $τ_{d}$ denotes bounded unknown disturbances - including the unstructured unmolded dynamics - $B (q)$ is the input transformation matrix, τ is the input vector – which includes the right and left wheel torques - and λ is the vector of the constraint forces.

Next, we differentiate (4) with respect to time, substituting the expression for $\ddot{q}$ in (5), multiplying the resulting expression by $S^{T}$ for the elimination of the vector of constraint forces λ; the dynamic system (5) is now transformed into a more appropriate representation (6) for control purposes:

\begin{array}{l} \bar{M} (q) \dot{υ} + \bar{V} (q) υ + \bar{F} (υ) + {\bar{τ}}_{d} = \bar{τ} \\ \begin{matrix} \bar{M} = S^{T} M S & \bar{V} = S^{T} M \dot{S} \end{matrix} \\ \begin{matrix} \bar{F} = S^{T} F & {\bar{τ}}_{d} = S^{T} τ_{d} & \bar{τ} = S^{T} B τ \end{matrix} \end{array}

(6)

3. Control Design

From the perspective of backstepping control, the control design problem of the WMR can be described thus: it generates the desired velocity profiles for the mobile robot to follow a reference trajectory (called motion control) and then the control inputs to the robot (mostly the driving torques/voltages of the motors) are determined to achieve the required velocities that take into account the mass, friction, etc., parameters of the actual cart (called speed control).

3.1. Tracking Control Problem Formulation

In the trajectory tracking task, the mobile robot is required to follow a trajectory generated by a reference robot, prescribed as (7), where it moves at the desired linear and angular velocities, $v_{r}$ and $w_{r}$ respectively where $v_{r} \neq 0$ for all t.

\begin{array}{l} \begin{matrix} {\dot{x}}_{r} = v_{r} c o s θ_{r} & {\dot{y}}_{r} = v_{r} s i n θ_{r} & {\dot{θ}}_{r} = w_{r} \end{matrix} \\ \begin{matrix} q_{r} = {[\begin{matrix} x_{r} & y_{r} & θ_{r} \end{matrix}]}^{T} & υ_{r} = {[v_{r} w_{r}]}^{T} \end{matrix} \end{array}

(7)

To track a reference trajectory is to find a control law, which makes the real robot follow a given reference moving posture $q_{r}$ with stability. As time approaches infinity, $\lim_{t \to \infty} q (t) = q_{r} (t)$ .

3.2. Motion Control

The tracking error is expressed relative to the local coordinate frame fixed on the mobile robot as:

\begin{array}{l} e_{p} = {[\begin{matrix} e_{1} & e_{2} & e_{3} \end{matrix}]}^{T} = T_{e} (q_{r} - q) \\ T_{e} = [\begin{matrix} c o s θ & s i n θ & 0 \\ - s i n θ & c o s θ & 0 \\ 0 & 0 & 1 \end{matrix}] \end{array}

(8)

and the derivative of the error (8) is:

{\dot{e}}_{p} = [\begin{matrix} w e_{2} - v + v_{r} c o s e_{3} \\ - w e_{1} + v_{r} s i n e_{3} \\ w_{r} - w \end{matrix}]

(9)

An auxiliary velocity control input [1] that achieves tracking for (4) is given by:

\begin{array}{c} υ_{c} = f_{c} (e_{p}, υ_{r}, K) = [\begin{matrix} k_{1} e_{1} + v_{r} c o s e_{3} \\ k_{2} v_{r} e_{2} + k_{3} v_{r} s i n e_{3} + w_{r} \end{matrix}] \\ K = {[\begin{matrix} k_{1} & k_{2} & k_{3} \end{matrix}]}^{T} \end{array}

(10)

where $k_{1}, k_{2}, k_{3} > 0$ are the design parameters and the derivative of $υ_{c}$ becomes:

\begin{array}{l} {\dot{υ}}_{c} = [\begin{matrix} {\dot{v}}_{r} c o s e_{3} \\ k_{2} {\dot{v}}_{r} e_{2} + k_{3} {\dot{v}}_{r} s i n e_{3} + {\dot{w}}_{r} \end{matrix}] \\ + [\begin{matrix} k_{1} & 0 & - v_{r} s i n e_{3} \\ 0 & k_{2} v_{r} & k_{2} v_{r} c o s e_{3} \end{matrix}] {\dot{e}}_{p} \end{array}

(11)

since the perfect velocity tracking assumption is unrealistic. Therefore, the actual control inputs to the robot must be considered in order to achieve the required speeds.

3.3. Near-optimal Speed Control

Define the auxiliary velocity tracking error as:

e_{c} = υ_{c} - υ = {[\begin{matrix} e_{c 1} & e_{c 2} \end{matrix}]}^{T}

(12)

Differentiating (12) and using (6), the mobile robot dynamics may be written in terms of the velocity tracking error as:

\begin{array}{c} \bar{M} (q) {\dot{e}}_{c} = h (z) - \bar{V} (q) e_{c} - \bar{τ} \\ h (z) = \bar{M} (q) {\dot{υ}}_{c} + \bar{V} (q) υ_{c} + \bar{F} (υ) + {\bar{τ}}_{d} \\ z = {[\begin{matrix} υ^{T} & υ_{c}^{T} & {\dot{υ}}_{c}^{T} \end{matrix}]}^{T} \end{array}

(13)

where the function $h (z)$ contains all of the mobile robot parameters, such as masses, moments of inertia, friction coefficients, and so on. These quantities are often imperfectly known and difficult to determine. In applications, the nonlinear function $h (z)$ is at least partially unknown.

Therefore, a suitable control input for following velocity is given by the computed-torque like control:

\bar{τ} = \hat{h} (z) + K_{4} e_{c} - u

(14)

where $K_{4} = k_{4} I$ is a diagonal, positive definite gain matrix and $\hat{h} (z)$ is the nominal part of $h (z)$ . The reinforcement control input u is designed to make sure of the speed control in an optimal manner.

Using this control input (14) in (13), the closed-loop system becomes:

\bar{M} (q) {\dot{e}}_{c} = - (\bar{V} (q) + K_{4}) e_{c} + \tilde{h} (z) + u

(15)

Eq. (15) can be rewritten as:

\begin{array}{c} {\dot{e}}_{c} = f (e_{c}) + g u \\ f (e_{c}) = {\bar{M}}^{- 1} (- (\bar{V} (q) + K_{4}) e_{c} + \tilde{h} (z)) \\ g = {\bar{M}}^{- 1} \tilde{h} (z) = h (z) - \hat{h} (z) \end{array}

(16)

Define the infinite horizon integral cost function as:

V^{u} (e_{c} (t_{0})) = \int_{t_{0}}^{\infty} r (e_{c} (t), u (t)) d t

(17)

where $r (e_{c}, u) = Q (e_{c}) + u^{T} R u, Q (e_{c}) = e_{c}^{T} Q e_{c}$ and the weighting matrix Q is symmetric and positive semi-definite and R is a symmetric positive definite matrix.

Definition 1 Admissible Control

A control $u (e_{c})$ is defined as admissible with respect to (17), denoted by $u \in U (Ω)$ , if:

(1) $u (e_{c})$ is continuous on a set $Ω \in ℝ^{2}$ containing the origin that asymptotically stabilizes the system (16), and ${u (e_{c}) |}_{e_{c} = 0} = 0$ .

(2) $\forall e_{c} (0) \in Ω, V (e_{c} (0)) < \infty$

Then, an infinitesimal version of (17) is the so-called nonlinear Lyapunov equation:

\begin{array}{c} r (e_{c}, u) + {(\nabla V^{u})}^{T} (f (e_{c}) + g u) = 0 \\ V^{u} (0) = 0 \end{array}

(18)

where $\nabla V^{u}$ denotes the partial derivative of the cost function $V^{u}$ with respect to $e_{c}$ .

The optimal control problem can now be formulated: given the continuous-time system (16), the admissible control set $u \in U (Ω)$ and the infinite horizon cost function (17), find an admissible control policy such that the cost index (17) associated with the system (16) is minimized.

Defining the Hamiltonian of the problem:

H (e_{c}, u, \nabla V^{u}) = r (e_{c}, u) + {(\nabla V^{u})}^{T} (f (e_{c}) + g u)

(19)

The optimal cost function defined by $V^{*} (e_{c})$ satisfies the HJB equation:

\min_{u \in U (Ω)} [H (e_{c}, u, \nabla V^{*})] = 0

(20)

Assuming that the minimum on the right hand side of (20) exists and is unique, then the optimal control function for the given problem is:

u^{*} (e_{c}) = - \frac{1}{2} R^{- 1} g^{T} \nabla V^{*}

(21)

Inserting this optimal control policy (21) in the Hamiltonian (19), we obtain the formulation of the HJB equation in terms of $\nabla V^{*}$ :

\begin{array}{c} Q (e_{c}) + {(\nabla V^{*})}^{T} f (e_{c}) - 0.25 {(\nabla V^{*})}^{T} g R^{- 1} g^{T} \nabla V^{*} = 0 \\ V^{*} (0) = 0 \end{array}

(22)

In order to find the optimal control solution $u^{*}$ for the problem, one only needs to solve the HJB Equation (22) for the cost function $V^{*}$ and then substitute the solution in (21) to obtain the optimal control. However, the HJB equation is a partial differential equation, due to its nonlinear nature, and finding its solution is generally difficult or even impossible. Moreover, it requires complete knowledge of the system dynamics, which is hard to obtain. Considering practical applications of the WMR, the difficulties involved in modelling practical systems exactly, and unavoidable disturbances, the effective tracking control design of an uncertain WMR needs to be studied.

In the following, we discuss a new online algorithm based on PI which will adapt to solve the continuous-time (CT) optimal control problem without using any knowledge regarding the system's internal dynamics (i.e., the system function $f (e_{c})$ ) when nonlinear systems with an infinite horizon quadratic cost are considered.

Proof of the convergence of the algorithm to the optimal control function is provided in next section.

3.3.1. Modified PI Algorithm for Solving the HJB Equation

Instead of trying to solve the HJB equation directly, the PI method starts by evaluating the cost of a given initial policy and then tries to use this information to obtain a new improved control policy.

The modified PI algorithms [19, 23] proposed by the online adaptive critic [16, 21, 24, 25] techniques are built as a two-step iteration, where i denotes the number of iterative steps:

(1) Policy evaluation: solve for the cost function:

\begin{array}{c} V^{u^{(i)}} (e_{c} (t)) = \int_{t}^{t + T} r (e_{c} (s), u^{(i)} (e_{c} (s))) d s + V^{u^{(i)}} (e_{c} (t + T)) \\ V^{u^{(i)}} (0) = 0 \end{array}

(23)

(2) Policy improvement: update the control policy using:

\begin{array}{l} u^{(i + 1)} = \underset{u \in U (Ω)}{\arg \min} [H (x, u, \nabla V^{u^{(i)}})] \\ = - \frac{1}{2} R^{- 1} g^{T} \nabla V^{u^{(i)}} \end{array}

(24)

Given an initial stabilizing control policy $u^{(0)} (e_{c}) \in U (Ω)$ , $e_{c} (t) \in Ω$ , taking $T > 0$ , then $e_{c} (t + T) \in Ω$ .

We solve this iteratively between Eqs. (23) and (24) without making use of any knowledge of the internal dynamics of the system, $f (e_{c})$ . The information regarding the system matrix $f (e_{c})$ is identified with the states $e_{c} (t)$ and $e_{c} (t + T)$ , which are sampled online. In the spirit of the reinforcement learning algorithms, the integral term in (23) can be addressed as the reinforcement over the time interval $[\begin{matrix} t & t + T \end{matrix}]$ .

3.3.2. Single NNs-based ADP Algorithm for the HJB's Approximate Solution

In order to solve Eq. (22) using the modified PI algorithm proposed previously, we will use a NN to obtain an approximation of the unknown cost function $V^{u^{(i)}}$ due to its universal approximation property [2].

The cost function $V^{u^{(i)}}$ can be approximated by two-layer feedforward NNs:

V_{N}^{u^{(i)}} (e_{c}) = {(W_{N}^{u^{(i)}})}^{T} σ_{N} (e_{c})

(25)

where $σ_{N} (e_{c}) = {[\begin{matrix} σ_{1} & σ_{2} & \dots & σ_{N} \end{matrix}]}^{T}$ is the vector of linearly independent activation functions of the hidden layer with N neurons. The weights of the hidden layer are all equal to one and will not be changed during the training procedure. $W_{N}^{u^{(i)}} = {[\begin{matrix} w_{I 1}^{(i)} & w_{I 2}^{(i)} & \dots & w_{I N}^{(i)} \end{matrix}]}^{T}$ denotes the weight vector of the output layer which has linear activation functions.

Using the NN's approximate description for the cost function $V_{N}^{u^{(i)}}$ in Eq. (23), will have the temporal difference residual error:

\begin{array}{l} δ^{u^{(i)}} (e_{c} (t), T) = \int_{t}^{t + T} r (e_{c} (s), u^{(i)} (e_{c} (s))) d s \\ + {(W_{N}^{u^{(i)}})}^{T} [σ_{N} (t + T) - σ_{N} (t)] \end{array}

(26)

where:

\begin{array}{l} σ_{N} (t + T) = σ_{N} (e_{c} (t + T)) \\ σ_{N} (t) = σ_{N} (e_{c} (t)) \end{array}

we tune the weights $W_{N}^{u^{(i)}}$ to minimize the objective:

E = \int_{Ω} {[δ^{u^{(i)}} (e_{c}, T)]}^{2} d e_{c}

(27)

This amounts to:

\int_{Ω} \frac{d δ^{u^{(i)}} (e_{c}, T)}{d W_{N}^{u^{(i)}}} δ^{u^{(i)}} (e_{c}, T) d e_{c} = 0

(28)

Using the inner product notation for the Lévesque integral, (28) can be rewritten as:

{〈 \frac{d δ^{u^{(i)}} (e_{c}, T)}{d W_{N}^{u^{(i)}}}, δ^{u^{(i)}} (e_{c}, T) 〉}_{Ω} = 0

(29)

According with the properties of the inner product, (29) becomes:

\begin{array}{l} 〈 [σ (t + T) - σ (t)], \int_{t}^{t + T} r {(e_{c} (s), u^{(i)} (e_{c} (s))) d s 〉}_{Ω} \\ + Φ W_{N}^{u^{(i)}} = 0 \end{array}

(30)

If there exist values of T such that Φ is invertible [17], then we obtain:

\begin{array}{l} W_{N}^{u^{(i)}} = - Φ^{- 1} 〈 [σ (t + T) - σ (t)], \\ \int_{t}^{t + T} r {(e_{c} (s), u^{(i)} (e_{c} (s))) d s 〉}_{Ω} \end{array}

(31)

From (24), we get the new control policy:

u^{(i + 1)} = - \frac{1}{2} R^{- 1} g^{T} {(\nabla σ_{N} (e_{c}))}^{T} W_{N}^{u^{(i)}}

(32)

In order to make a difference with (24), we let $u_{N}^{(i + 1)} = u^{(i + 1)}$ .

Eqs. (31) and (32) are successively solved at each iteration i until convergence.

A general structure for the tracking control system is presented in Figure 2.

4. Convergence Proof

Theorem 1. The policy iterations (23) and (24) converge uniformly to the optimal control problem (17), without using any knowledge of the internal dynamics of the controlled system (16):

\begin{array}{c} \forall ε > 0, \exists i_{0} : \forall i \geq i_{0} \\ \sup | V^{u^{(i)}} - V^{*} | < ε, \sup | u^{​^{(i)}} - u^{*} | < ε \end{array}

Figure 2.

Combined kinematic/torque near the optimal tracking control structure

Proof:

Theorem 1 in [19] shows that $u^{(i)} \in Ψ (Ω), \forall i \geq 0$ .

From Eq. (17), we get:

V^{u^{(i)}} (e_{c} (t)) = \int_{t}^{\infty} r (e_{c} (s), u^{(i)} (e_{c} (s))) d s

(33)

The infinitesimal version of (33) is:

r (e_{c}, u^{(i)}) + {(\nabla V^{u^{(i)}})}^{T} (f (e_{c}) + g u^{(i)}) = 0

(34)

Integrating (34) over the time interval $[t, t + T]$ , we obtain (23). This means that the unique solution of (34) also satisfies (23). We have to show that Eq. (23) has a unique solution to complete the proof by contradiction.

Thus, assume that the solution of Eq. (18) also satisfies Eq. (23).

Subtracting (18) from (34), we obtain:

\begin{array}{l} \nabla {(V^{u} - V^{u^{(i)}})}^{T} (f (e_{c}) + g u^{(i)}) = 0 \\ \Rightarrow V^{u} = V^{u^{(i)}} + C \\ \Rightarrow V^{u} (0) = V^{u^{(i)}} (0) + C \Rightarrow C = 0 \end{array}

Thus Eq. (23) has a unique solution which is equal to the unique solution of (34).

The algorithm between (34) and (24) is equivalent to the iteration between (23) and (24), which has been proven converge on the solution of the HJB equation [27].

Theorem 2. Given an initial admissible control $u^{(0)}$ where the number of neurons in the hidden layer is sufficiently large, then the control policy gets from the PI algorithms (31) and (32) a cost function approximation which will converge uniformly with the solution of the HJB Equation (22):

\begin{array}{l} \exists N_{0} : \forall N \geq N_{0}, \forall i \geq 0 \\ u_{N}^{(i)} \in U (Ω) \\ \sup | V_{N}^{u^{(i)}} - V^{u^{(i)}} | \to 0 \\ \sup | u_{N}^{(i + 1)} - u^{(i + 1)} | \to 0 \end{array}

Proof:

Due to the Weierstrass Approximation Theorem [2], $V^{u^{(i)}}$ can be uniformly approximated by the infinite series built and based on $σ_{\infty} (e_{c})$ , which was chosen as a complete set.

V^{u^{(i)}} = {(C_{\infty}^{u^{(i)}})}^{T} σ_{\infty} (e_{c})

(35)

where $C_{\infty}^{u^{(i)}}$ is the vector constituted by the corresponding polynomial coefficients $c_{j}^{^{u^{(i)}}}$ . Substituting the expression (35) for $V^{u^{(i)}}$ in (23), then:

\begin{array}{l} 0 = \int_{t}^{t + T} r (e_{c} (s), u^{(i)} (e_{c} (s))) d s \\ + {(C_{\infty}^{u^{(i)}})}^{T} [σ_{\infty} (t + T) - σ_{\infty} (t)] \end{array}

(36)

Let ${\bar{σ}}_{N} (t, T) = σ_{N} (t + T) - σ_{N} (t)$ . Subtracting (26) from (36), we obtain:

\begin{array}{l} {(W_{N}^{u^{(i)}})}^{T} {\bar{σ}}_{N} (t, T) - {(C_{\infty}^{u^{(i)}})}^{T} {\bar{σ}}_{\infty} (t, T) = ε_{N} (t, T) \\ {(W_{N}^{u^{(i)}} - C_{N}^{u^{(i)}})}^{T} {\bar{σ}}_{N} (t, T) \\ = ε_{N} (t, T) + \sum_{j = N + 1}^{\infty} c_{j}^{u^{(i)}} σ_{j} (t, T) \\ = ε_{N} (t, T) + e_{N} (t, T) \end{array}

(37)

Due to the fact that the completeness of ${\bar{σ}}_{\infty} (t, T)$ relies on the high-order Weierstrass approximation theorem, we get:

\begin{array}{l} {‖ {(W_{N}^{u^{(i)}} - C_{N}^{u^{(i)}})}^{T} {\bar{σ}}_{N} (t, T) ‖}_{L_{2}}^{2} = {‖ ε_{N} (t, T) + e_{N} (t, T) ‖}_{L_{2}}^{2} \\ \leq 2 {‖ ε_{N} (t, T) ‖}_{L_{2}}^{2} + 2 {‖ e_{N} (t, T) ‖}_{L_{2}}^{2} \to 0 \end{array}

(38)

Since ${\bar{σ}}_{N} (t, T)$ is linearly independent, then:

\begin{array}{l} {‖ {(W_{N}^{u^{(i)}} - C_{N}^{u^{(i)}})}^{T} {\bar{σ}}_{N} (t, T) ‖}_{L_{2}}^{2} \to 0 \\ \Leftrightarrow {‖ W_{N}^{u^{(i)}} - C_{N}^{u^{(i)}} ‖}_{L_{2}}^{2} \to 0 \\ \Leftrightarrow {‖ {(W_{N}^{u^{(i)}} - C_{N}^{u^{(i)}})}^{T} σ_{N} (t) ‖}_{L_{2}}^{2} \to 0 \\ \Leftrightarrow {‖ V_{N}^{u^{(i)}} - V^{u^{(i)}} ‖}_{L_{2}}^{2} \to 0 \end{array}

(39)

Similarly, since $\nabla σ_{N}$ is also linearly independent, following the above result, we get:

\begin{array}{l} {‖ {(W_{N}^{u^{(i)}} - C_{N}^{u^{(i)}})}^{T} \nabla σ_{N} (t) ‖}_{L_{2}}^{2} \to 0 \\ \Leftrightarrow {‖ \nabla V_{N}^{u^{(i)}} - \nabla V^{u^{(i)}} ‖}_{L_{2}}^{2} \to 0 \end{array}

(40)

\begin{array}{l} {‖ u_{N}^{​^{(i + 1)}} - u^{(i + 1)} ‖}_{L_{2}}^{2} = {‖ R^{- 1} g^{T} (\nabla V_{N}^{u^{(i)}} - \nabla V^{u^{(i)}}) ‖}_{L_{2}}^{2} \\ \leq {‖ R^{- 1} g^{T} ‖}_{L_{2}}^{2} {‖ \nabla V_{N}^{u^{(i)}} - \nabla V^{u^{(i)}} ‖}_{L_{2}}^{2} \to 0 \end{array}

(41)

By combining (39)–(41), based on the universal approximation property of multilayer feedforward NNs, when the number of neurons in the hidden layer is sufficiently large, then:

\begin{array}{l} \sup | V_{N}^{u^{(i)}} - V^{u^{(i)}} | \to 0 \\ \sup | u_{N}^{​^{(i + 1)}} - u^{(i + 1)} | \to 0 \end{array}

Taking the derivative of $V^{u^{(i)}}$ along the trajectories generated by the controller $u_{N}^{(i + 1)}$ , we obtain:

{\dot{V}}^{u^{(i)}} = {(\nabla V^{u^{(i)}})}^{T} (f (e_{c}) + g u_{N}^{(i + 1)})

(42)

Using (18), we obtain:

\begin{array}{c} {\dot{V}}^{u^{(i)}} = {(\nabla V^{u^{(i)}})}^{T} (f (e_{c}) + g u^{(i)}) = - Q (e_{c}) - {(u^{(i)})}^{T} R u^{(i)} \\ {(\nabla V^{u^{(i)}})}^{T} f (e_{c}) = - {(\nabla V^{u^{(i)}})}^{T} g u^{(i)} \\ - Q (e_{c}) - {(u^{(i)})}^{T} R u^{(i)} \end{array}

(43)

Substituting (43) in (42):

\begin{array}{l} {\dot{V}}^{u^{(i)}} = - Q (e_{c}) - {(u^{(i)})}^{T} R u^{(i)} \\ - {(\nabla V^{u^{(i)}})}^{T} g (u^{(i)} - u_{N}^{​^{(i + 1)}}) \end{array}

(44)

Using (24), we obtain ${(\nabla V^{u^{(i)}})}^{T} g = - 2 R u^{(i + 1)}$ , and substituting it in (44), we obtain:

\begin{array}{l} {\dot{V}}^{u^{(i)}} = 2 u^{(i + 1)} R (u^{(i)} - u_{N}^{​^{(i + 1)}}) \\ - Q (e_{c}) - {(u^{(i)})}^{T} R u^{(i)} \\ = {(u^{(i)} - u_{N}^{(i + 1)})}^{T} R (u^{(i)} - u_{N}^{(i + 1)}) \\ - {(u^{(i)} - u^{(i + 1)})}^{T} R (u^{(i)} - u^{(i + 1)}) \\ - {(u_{N}^{(i + 1)})}^{T} R u_{N}^{(i + 1)} - Q (e_{c}) \end{array}

(45)

From the result we proved above, $\sup | u_{N}^{^{(i)}} - u^{(i)} | \to 0$ , then $\exists N_{0} : \forall N > N_{0}, {\dot{V}}^{u^{(i)}} < 0$ , which means that $V^{u^{(i)}}$ is a Lyapunov function for the system with a control policy $u_{N}^{^{(i + 1)}}$ .

Theorem 3. Given an initial admissible control $u^{(0)}$ , if the number of neurons in the hidden layer and the iterative steps are large enough, the control policy (32), with a cost function approximation (31), will converge on the optimal control (21) without using any knowledge of the internal dynamics of the controlled system (16). The auxiliary velocity tracking error $e_{c}$ and the posture tracking error $e_{p}$ are exponentially stable when a persistence of excitation condition is satisfied, as:

Proof:

The results 1, 2 in Theorem 3 follow directly from Theorems 1 and 2.

Establishing a Lyapunov function:

\begin{array}{c} V_{L} = V_{1} + V^{u^{(i)}} \\ V_{1} = \frac{k_{1}}{2} (e_{1}^{2} + e_{2}^{2}) + \frac{k_{1}}{k_{2}} (1 - \cos e_{3}) \end{array}

Using (9), we have the expression of ${\dot{V}}_{L}$ as follows:

{\dot{V}}_{1} = k_{1} (e_{1} {\dot{e}}_{1} + e_{2} {\dot{e}}_{2}) = - k_{1}^{2} e_{1}^{2} - \frac{k_{1} k_{3}}{k_{2}} v_{r} \sin^{2} e_{3} \leq 0

This implies that ${[e_{1}, e_{3}]}^{T} \to 0$ as $t \to \infty$ , and from (9) we have:

w_{r} - w = 0

From (45) we know that $e_{c 2} \to 0$ , which yields:

\begin{array}{c} k_{2} v_{r} e_{2} + k_{3} v_{r} s i n e_{3} + w_{r} - w = 0 \\ k_{2} v_{r} e_{2} = 0 \end{array}

By assumption $v_{r} \neq 0$ , then $e_{2} \to 0$ as $t \to \infty$ .

{\dot{V}}_{L} = {\dot{V}}_{1} + {\dot{V}}^{u^{(i)}} < 0

From the above, the exponential stability of the auxiliary velocity tracking error $e_{c}$ and the posture tracking error $e_{p}$ are obvious.

5. Simulation Results

We would like to implement the near-optimal control scheme presented in Fig. 2 and compare its performance with the initial one to show the effectiveness of the control law developed in this work.

We took the WMR parameters as follows:

m = 10 k g, I = 5 k g m^{2}, R = 0.5 m, r = 0.05 m

The control gains were selected as:

k_{1} = 10, k_{2} = 5, k_{3} = 4, k_{4} = 25

The reference trajectory is generated using the reference model in (7), with $v_{r} = 0.5 m / s, w_{r} = 0$ . It is a straight line with an initial posture:

[\begin{matrix} x_{r} ​_{0} & y_{r} ​_{0} & θ_{r 0} \end{matrix}] = [\begin{matrix} 1 & 0 & 0.5 π \end{matrix}]

The initial posture for the actual WMR is:

[\begin{matrix} x_{0} & y_{0} & θ_{0} \end{matrix}] = [\begin{matrix} 2 & 2 & 0 \end{matrix}]

Additionally, we assumed the uncertainties in (16) as:

\tilde{h} (z) = {[\begin{matrix} 20 \sin e_{c 1} & 25 \sin e_{c 2} \end{matrix}]}^{T}

The weighting matrix Q should be large enough to weigh heavily in the cost function (17). Accordingly, the accumulated states' error could be small and the weighting matrix R should be selected as large if we want to be able to keep the energy consumption as small as possible. For a convenient simulation, we selected $Q = I_{3 \times 3}$ , $R = I_{2 \times 2}$ .

For the NNs, we selected the activation functions with $N = 8$ hidden-layer neurons as:

\begin{array}{l} σ_{8} (e_{c}) = [\begin{matrix} e_{c 1}^{2} & e_{c 1} e_{c 2} & e_{c 2}^{2} & e_{c 1}^{4} \end{matrix} \\ {\begin{matrix} e_{c 1}^{3} e_{c 2} & e_{c 1}^{2} e_{c 2}^{2} & e_{c 1} e_{c 2}^{3} & e_{c 2}^{4} \end{matrix}]}^{T} \end{array}

The initial stabilizing controller was taken as:

u^{(0)} = - 0.02 {[\begin{matrix} e_{c 1} & e_{c 2} \end{matrix}]}^{T}

(46)

Figure 3.

Convergence of the NNs' weights

The simulation was conducted using data obtained from the system at every $T = 0.1 s$ . At each iteration step, we solved for the NNs' weights $W_{8}^{u}$ using 1000 data points associated with a given control policy over 30 time intervals. In this way, after every 3 s, the cost function was solved for a policy update. The weights of the NNs converged on the coefficients of the optimal cost function, as one can see from Fig.3

\begin{array}{l} W_{8}^{u^{*}} = [\begin{matrix} 0. 9571 & 0 & 1.1539 & - 0. 1399 \end{matrix} \\ {\begin{matrix} 0 & - 0.0 149 & 0 & - 0. 2482 \end{matrix}]}^{T} \end{array}

Using (33), we have the expression of the near-optimal controller:

u^{*} = - \frac{1}{2} R^{- 1} g^{T} {(\nabla σ_{8} (e_{c}))}^{T} W_{8}^{u^{*}}

(47)

Figure 4.

Initial control VS near-optimal control

Figure 5.

Velocity tracking error with near-optimal control

Figure 6.

Trajectories with initial/ near-optimal control

Fig. 4 shows the initial reinforcement control inputs (46) and the near-optimal reinforcement control inputs (47) for controller (14). We can see the change of the reinforcement control input in controller (14). The velocity tracking error with a near-optimal controller is shown in Fig. 5. The responses with different controllers are shown in Fig. 6 - the tracking error $e_{p}$ convergence was quicker than before. It is clear that the performance of the WMR with near-optimal control has been improved with respect to the initial WMR. As such, we can use this policy iteration method to improve the pre-existing controllers' effectiveness.

6. Conclusion

In this paper, we proposed an online PI algorithm based on ADP to solve the optimal control problem with unknown internal dynamics. It uses a single NN to approximate the cost function and then computes the near-optimal control law directly according to the approximation of the cost function. The action networks [22, 25, 28] are no longer needed as an important additional advantage, and the associated iterative training loops are also eliminated. This leads to a notable simplification of the architecture and results in substantial computational savings. Besides this, it also eliminates the NN's approximation error due to the eliminated action networks.

This paper also presents proofs of convergence for the online NNs-based version of the algorithm while taking into account the NN's approximation error. The simulation results support the effectiveness of the online near-optimal controller.

References

Oriolo

De Luca

Vendittelli

, “WMR control via dynamic feedback linearization: Design, implementation, and experimental validation,” Control Systems Technology, IEEE Transactions on, vol. 10, pp. 835–852, 2002.

Dongkyoung

, “Tracking Control of Differential-Drive Wheeled Mobile Robots Using a Backstepping-Like Feedback Linearization,” Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, vol. 40, pp. 1285–1295, 2010.

Huang

, “Speed control of differentially driven wheeled mobile robots—model-based adaptive approach,” Journal of Robotic Systems, vol. 22, pp. 323–332, 2005.

Dixon

W. E.

Dawson

D. M.

Zhang

Zergeroglu

, “Global exponential tracking control of a mobile robot system via a PE condition,” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 30, pp. 129–142, 2000.

Ti-Chung

Kai-Tai

Ching-Hung

Ching-Cheng

, “Tracking control of unicycle-modeled mobile robots using a saturation feedback controller,” Control Systems Technology, IEEE Transactions on, vol. 9, pp. 305–318, 2001.

Corradini

M. L.

Ippoliti

Longhi

, “Neural Networks Based Control of Mobile Robots: Development and Experimental Validation,” Journal of Robotic Systems, vol. 20, pp. 587–600, 2003.

Fierro

Lewis

F. L.

, “Control of a nonholonomic mobile robot: Backstepping kinematics into dynamics,” in Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, pp. 3805–3810, 1995.

Shojaei

Shahri

A. Mohammad

Tarakameh

, “Adaptive feedback linearizing control of non-holonomic wheeled mobile robots in presence of para-metric and nonparametric uncertainties,” Robotics and Computer-Integrated Manufacturing, vol. 27, pp. 194–204, 2011.

Fukao

Nakagawa

Adachi

, “Adaptive tracking control of a nonholonomic mobile robot,” Robotics and Automation, IEEE Transactions on, vol. 16, pp. 609–615, 2000.

10.

Martins

F. N.

Celeste

W. C.

Carelli

Sar-Cinelli-Filho

Bastos-Filho

T. F.

, “An adaptive dy-namic controller for autonomous mobile robot trajecto ry tracking,” Control Engineering Practice, vol. 16, pp. 1354–1363, 2008.

11.

Castillo

Martínez-Marroquín

Melin

Valdez

Soria

, “Comparative study of bioinspired algorithms applied to the optimization of type-1 and type-2 fuzzy controllers for an autonomous mobile robot,” Information Sciences, vol. 192, pp. 19–38, 2012.

12.

Cazarez-Castro

N. R.

Aguilar

L. T.

Castillo

, “Fuzzy logic control with genetic membership function parameters optimization for the output regulation of a servomechanism with nonlinear backlash,” Expert Systems with Applications, vol. 37, pp. 4368–4378, 2010.

13.

Martínez

Castillo

Aguilar

L. T.

, “Optimization of interval type-2 fuzzy logic controllers for a perturbed autonomous wheeled mobile robot using genetic algorithms,” Information Sciences, vol. 179, pp. 2158–2174, 2009.

14.

Astudillo

Castillo

Aguilar

Mar-Tínez

, “Hybrid Control for an Autonomous Wheeled Mobile Robot Under Perturbed Torques Foundations of Fuzzy Logic and Soft Computing,” vol. 4529, pp. 594–603, 2007.

15.

Astudillo

Castillo

Melin

Garza

A. A.

Soria

Aguilar

L. T.

, “Intelligent Control of an Autonomous Mobile Robot using Type-2 Fuzzy Logic,”. vol. 13, pp. 93–97, 2006.

16.

Vamvoudakis

K. G.

Lewis

F. L.

, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, vol. 46, pp. 878–888, 2010.

17.

Vrabie

Lewis

, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, vol. 22, pp. 237–246, 2009.

18.

Dupree

Patre

P. M.

Wilcox

Z. D.

Dixon

W. E.

, “Optimal control of uncertain nonlinear systems using a neural network and RISE feedback,” in American Control Conference, 2009. ACC '09., pp. 361–366, 2009.

19.

Abu-Khalaf

Lewis

F. L.

, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, pp. 779–791, 2005.

20.

Dierks

Jagannathan

, “Online optimal control of nonlinear discrete-time systems using approximate dynamic programming,” Journal of Control Theory and Applications, pp. 361–369, 2011.

21.

Lin

Yang

, “Adaptive critic motion control design of autonomous wheeled mobile robot by dual heuristic programming,” Automatica, vol. 44, pp. 2716–2723, 2008.

22.

Syam

Watanabe

Izumi

, “Adaptive actor-critic learning for the control of mobile robots by applying predictive models,” Soft Computing, vol. 9, pp. 835–845, 2005.

23.

Beard

R. W.

Saridis

G. N.

Wen

J. T.

, “Galerkin approximations of the generalized Hamilton-Jacobi-Bellman equation,” Automatica, vol. 33, pp. 2159–2177, 1997.

24.

Vrabie

Pastravanu

Abu-Khalaf

Lewis

F. L.

, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477–484, 2009.

25.

Prokhorov

D. V.

Wunsch

D. C. I.

, “Adaptive critic designs,” Neural Networks, IEEE Transactions on, vol. 8, pp. 997–1007, 1997.

26.

Hornik

Stinchcombe

White

, “Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks,” Neural Networks, vol. 3, pp. 551–560, 1990.

27.

Vrabie

Lewis

, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural Networks, vol. 22, pp. 237–246, 2009.

28.

Syam

Watanabe

Izumi

, “An Adaptive Actor-critic Algorithm with Multi-step Simulated Experiences for Controlling Nonholonomic Mobile Robots,” Soft Computing - A Fusion of Foundations, Methodologies and Applications, vol. 11, p. 81, 2007.