Q-learning trajectory planning based on Takagi–Sugeno fuzzy parallel distributed compensation structure of humanoid manipulator

Abstract

NAO is the first robot created by SoftBank Robotics. Famous around the world, NAO is a tremendous programming tool and he has especially become a standard in education and research. Aiming at the large error and poor stability of the humanoid robot NAO manipulator during trajectory tracking, a novel framework based on fuzzy controller reinforcement learning trajectory planning strategy is proposed. Firstly, the Takagi–Sugeno fuzzy model based on the dynamic equation of the NAO right arm is established. Secondly, the design and the gain solution of the state feedback controller based on the parallel feedback compensation strategy are studied. Finally, the ideal trajectory of the motion is planned by reinforcement learning algorithm so that the end of the manipulator can track the desired trajectory and realize the valid obstacle avoidance. Simulation and experiment shows that the end of the manipulator based on this scheme has good controllability and stability and can meet the accuracy requirements of trajectory tracking accuracy, which verifies the effectiveness of the proposed framework.

Keywords

Manipulator T-S fuzzy control Q-learning trajectory planning trajectory tracking

Introduction

In recent years, humanoid robots have attracted the attention of many researchers. Humanoid robot can imitate certain physiological perception, behavioral characteristics, social communication ability, and even some of thinking ability of human beings and has a higher intelligence.¹ Humanoid manipulator as the most common and the most important actuator in the humanoid robot can not only cooperate with other actuators to complete a series of activities such as crawling, handling, and throwing, it can also be used to perform specific tasks in a complex environment.^2
–4 It is one of the basic tasks of robot control that the end effector of the manipulator moves according to a desired trajectory. However, with the development of the production demand and the change of the service object, the requirements for the motion control of the manipulator are also increasing. How to control the end of the manipulator to achieve good motion planning and tracking has become a research hotspot in the field of humanoid robot control in recent years.⁵

The purpose of the trajectory tracking of the manipulator is to allow the position and speed of the robot track a desired trajectory by giving the driving torque of each joint.^6,7

When the accuracy requirements of trajectory tracking are not high, the influence of dynamics on the movement may not be considered. So, the manipulator usually uses the model-free proportion–integral–derivative (PID) control. However, with the increasing of automation, the tasks completed by the manipulator become more and more complicated, and the performance and the control precision of the manipulator have changed. Especially in the case of high precision and real time, the traditional control effects can’t meet performance requirements.⁸ For the problem of motion control of manipulators, scholars have done a lot of research work. Ranjan et al.⁹ studied hand–eye coordination system of NAO robot. This article studied the identification of the kinematics model of the manipulator and the calculation using the monocular visual Bouquets calibration method. The kinematic model of the left manipulator is established using Denavit–Hartenberg (D-H), and then the adaptive neural network fuzzy system is applied to the kinematic inverse solution calculation. Finally, the experimental analysis verified the error of the hand–eye coordination system was effectively reduced. Vázquez et al.¹⁰ proposed a continuous-time neural network control scheme, which was successfully applied to the trajectory tracking control of a 2-degrees-of-freedom (DOF) direct-drive manipulator. The article used the recurrent high-order neural network (RHONN) structure to dynamically identify the controlled object online and then use the backstepping method to construct the local neural network controller of the RHONN subsystem. Experiments showed that the system can effectively counteract the effects caused by the coupling of friction and gravity between the subsystems and improve the tracking effect of the end effector. Nugroho et al.¹¹ discussed the problem of motion planning using two arms of NAO for games. This article established a positive kinematics model based on the improved D-H movement analysis method and obtain kinematic inverse solution using an inverse transformation method. Based on this kinematics model, nine pose interpolation points were generated using the trajectory planning method of 3-D Cartesian space. The deviation compensation experience value manually settled overcome the positional deviation of the end effector by many experiments, and finally, smooth movement of the manipulator was realized. Yuan and coworkers¹² developed a D-H kinematics model of the manipulator and visual servo control module by combining monocular vision of NAO with Kinect stereo vision. Position-based visual servo (PBVS) control method was designed, and the iterative learning algorithm was used to improve the PBVS control law and complete the task of delivering the household goods. The proposed algorithm is mainly aimed at the situation that the objects are simple in the experimental environment and the features of the target objects are extracted easily. Ren¹³ used the particle swarm optimization (PSO) algorithm to optimize the redundant parameters in the trajectory and built a “knowledge” database in the work environment, which enabled the manipulator to achieve rapid and continuous response to the high-speed target under the condition of satisfying global constraints. Because the reactive optimization of fast action is not analyzed using dynamics, it may cause the certain imbalance of the robot. However, it still has important guiding significance for improving the reaction speed of the manipulator.

From the view of work task, the control problem of the manipulator is a process from simple location operation to trajectory tracking and intelligent learning. From the view of control technology, it is a process from simple logic control to high-precision PID adjustment, then to dynamic intelligent control. Now, the research on the control algorithm of the manipulator has made good progress, but some algorithms have limitations, such as ignoring external disturbances or adding a high-performance computer. Therefore, the accuracy of the motion control of the manipulator still needs further study.

The manipulator of this article is the arm of humanoid robot NAO for the object of motion control. This manipulator has the advantages of compact structure and flexible movement.¹⁴ From the view of the control, the manipulator is a complex nonlinear system with multiple inputs and multiple outputs. The study of the manipulator based on dynamics is a necessary basis for achieving high-precision operations.¹⁵ And the Takagi–Sugeno (T-S) fuzzy control technology has a high degree of nonlinear approximation mapping capability and robustness and is suitable for solving the analyzed and control problems of complex nonlinear uncertain systems than other models.^16
–18 By building a fuzzy model and a controller, high-precision tracking of the robot arm can be achieved.

The rest of this article is organized as follows. It describes the design of Q-learning method based on the T-S fuzzy parallel distributed compensation (PDC) control structure in the second section. In the third section, the trajectory tracking of a desired curve for a fuzzy model-based (FMB) manipulator closed-loop system is described. The obstacle avoidance and grasping experiment are designed using Q-learning algorithm in the fourth section. Finally, conclusion and future work are discussed in the fifth section.

Algorithm design

This article includes trajectory tracking and trajectory planning. PDC control based on T-S fuzzy is used to control the position of the trajectory tracking, and the RL algorithm is used for trajectory planning of obstacle avoidance.

Dynamic model

The humanoid robot NAO has five DOFs for each arm.¹⁹ For example, there are two DOFs with the RShoulderRoll and the RShoulderPitch on the shoulder, two DOFs with the RElbowRoll and the deflection RElbowYaw on the elbow, and a DOF with the deflection RwrisYaw for the right arm. Figure 1 shows the right arm offset angle.

Figure 1.

Right arm offset angle diagram.

Because that the NAO robot’s hand has only one DOF and the load is limited, it cannot grasp in any position. Therefore, this article does not consider the attitude control of the end of the arm, and only the position control of the NAO robot’s right arm is studied. The two DOFs RElbowYaw and RWristYaw are rotary joints, which only affect the attitude control and have no effect on the end position control. Therefore, the model can be simplified to the three DOFs manipulator. They are RShoulderRoll, RShoulderPitch, and RElbowRoll.

α, a, θ, and d are respectively defined as the twist angle, rod length, offset angle, and offset distance of each joint. The kinematic parameters of the right arm can be obtained by D-H method shown in Table 1.

Table 1.

Kinematic parameters of the right arm.

Joint	α (rad)	a (m)	θ (°)
RShoulderPitch	π/2	0	(−119.5, 119.5)
RShoulderRoll	0	0.09	(−76, 18)
RElbowRoll	0	0.10855	(2, 88.5)

Without considering the effects of friction and external disturbances, the dynamic equation of the serial manipulator established by the Lagrange functional balance method²⁰ can be expressed as follows

D \ddot{q} + C (q, \dot{q}) \dot{q} + ϕ = τ

where q, $\dot{q}$ , and $\ddot{q}$ are all n × 1 order vectors, which represent the position, velocity, and acceleration of each joint, respectively. D is an n × n order positive definite inertial matrix. C is the n × n order centrifugal and Coriolis force terms. ϕ is an n × 1 order vector representing a heavy moment vector. τ is the external input control torque.

The nonlinear motion equation (1) is linearized. The position coordinate q and velocity coordinate $\dot{q}$ of each joint of the robot are selected as the state variable of the system. Define

x = [\begin{matrix} q \\ \dot{q} \end{matrix}]

Then equation (1) can be rewritten as follows

\bar{D} \dot{x} + \bar{C} x + \bar{ϕ} = \bar{τ}

where

\bar{D} = [\begin{matrix} I & 0 \\ 0 & D \end{matrix}], \bar{C} = [\begin{matrix} 0 & - I \\ 0 & C \end{matrix}], \bar{ϕ} = [\begin{matrix} 0 \\ ϕ \end{matrix}], \bar{τ} = [\begin{matrix} 0 \\ τ \end{matrix}]

Since D is positive, $\bar{D}$ is reversible. The left and right sides of the equation (2) are multiplied ${\bar{D}}^{- 1}$ at the same time, which can be expressed as

\dot{x} = - {\bar{D}}^{- 1} \bar{C} x + {\bar{D}}^{- 1} (\bar{τ} - \bar{G})

Define ${\begin{cases} A = - {\bar{D}}^{- 1} \bar{C} \\ B = {\bar{D}}^{- 1} \\ u = \bar{τ} - \bar{G} \end{cases}$ , the equation of movement of the robot can be described as a form of state space

\dot{x} = A x + B u

where $x \in R^{n}$ is the state variable and $u \in R^{m}$ is the input variable. The robot state space equation has the form of a state equation that describes the dynamic characteristics of the system in modern control theory, but it is still a complex time-varying nonlinear equation.

T-S fuzzy model

The T-S fuzzy model is a nonlinear system described by a set of “IF-THEN” fuzzy rules. Each rule represents a subsystem. A nonlinear system can usually be expressed as a weighted sum of some local linear systems.²¹ T-S fuzzy model can approximate the actual controlled model within the form of fuzzy rules. The rules of the fuzzy controller output shown in Table 2.

Table 2.

$(K_{P}, K_{D})$ fuzzy rule.

dee	NB	NM	NS	Z	PS	PM	PB
NB	PB, PS	PS, PS	NB, PM	NM, PM	NB, PS	NM, Z	PS, Z
NM	PB, PS	PB, PB	NB, PM	NM, PS	NM, PS	NS, Z	Z, NS
NS	Z, PM	NS, PM	NM, PM	NM, PS	NS, Z	NS, NS	Z, NS
Z	Z, PM	NS, PM	NS, PS	NS, Z	NS, NS	NS, NM	Z, NM
PS	Z, PS	Z, PS	Z, Z	Z, NS	Z, NS	Z, NM	Z, NM
PM	PB, PS	NS, Z	PS, NS	PS, NM	PS, NM	PS, NM	PB, NB
PB	PB, Z	PM, Z	PM, NM	PM, NM	PS, NM	PS, NB	PB, NB

In Table 2, $(K_{P}, K_{D})$ is the output of fuzzy controller and used in the PD controller, e represents the error signal, [PB, PM, PS, Z, NS, NM, NB] means [positive big, positive middle, positive small, zero, negative small, negative middle, negative big].

The robot dynamics equation can be expressed by the following T-S fuzzy model, which is described by the IF-THEN rules as:

The rule i:

\begin{matrix} if x_{1} (t) is M_{1}^{i} and x_{2} (t) is M_{2}^{i} and \dots and x_{n} (t) is M_{n}^{i} i = 1, 2, \dots, r \\ then \dot{x} (t) = A_{i} x (t) + B_{i} u (t) \end{matrix}

where $x (t)$ represents the system state vector in the ith rule. $M_{j}^{i}$ is the fuzzy subset corresponding to each state variable in the ith rule. $u (t)$ represents the system input vector. r is the number of fuzzy rules. $A_{i}$ and $B_{i}$ , respectively, represent the system matrix and input matrix, $A_{i} \in R^{n \times n}$ , $B_{i} \in R^{n \times m}$ .

T-S model is a weighted sum of multiple linear models, which can approximate a nonlinear system. In equation (5), each T-S fuzzy rule describes a linear subsystem and the consequent output can be expressed by a linear function. According to the anti-fuzzification definition of the fuzzy system, the total output of the fuzzy model constructed by the fuzzy rules in equation (5) is

\begin{matrix} \dot{x} (t) = \frac{\sum_{i = 1}^{r} w_{i} [A_{i} x (t) + B_{i} u (t)]}{\sum_{i = 1}^{r} w_{i}} \\ = \sum_{i = 1}^{r} w_{i} [A_{i} x (t) + B_{i} u (t)] \end{matrix}

where w_i denotes the membership function of the ith rule, which represents the weight of the rule i in the rule base. And it satisfies $w_{i} \geq 0 \forall i$ and $\sum_{i = 1}^{r} w_{i} = 1$ .

A T-S fuzzy model is built by substituting data, which can use the following three rules to describe the dynamic behavior of the system

{\begin{cases} if x is about {[\begin{matrix} 0.15 & 0 \end{matrix}]}^{T}, then \dot{x} (t) = A_{1} x + B_{1} u . \\ if x is about {[\begin{matrix} 0.18 & 0 \end{matrix} .2]}^{T}, then \dot{x} (t) = A_{2} x + B_{2} u . \\ if x is about {[\begin{matrix} 0.2 & 0 \end{matrix}]}^{T}, then \dot{x} (t) = A_{3} x + B_{3} u . \end{cases}

where

\begin{array}{l} A_{1} = [\begin{matrix} 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & - 0.0297 & - 0.0009 & 0 \\ 0 & 0 & 0 & - 0.0974 & - 0.1065 & 0 \\ 0 & 0 & 0 & 0.0952 & 0.1778 & 0 \end{matrix}] A_{2} = [\begin{matrix} 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0.2451 & 0.5332 & - 0.1853 \\ 0 & 0 & 0 & - 0.0159 & - 0.1946 & 0.4624 \\ 0 & 0 & 0 & - 0.1677 & 0.8281 & - 0.2538 \end{matrix}] \\ A_{3} = [\begin{matrix} 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & - 0.1791 & - 0.5301 & 0 \\ 0 & 0 & 0 & 0.2124 & - 0.2730 & 0 \\ 0 & 0 & 0 & - 0.6052 & 0.0857 & 0 \end{matrix}] B_{1} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 27.0827 & 1.2538 & 18.0910 \\ 0 & 0 & 0 & 1.2538 & 46.3021 & - 46.2125 \\ 0 & 0 & 0 & 18.0910 & - 46.2125 & 76.1791 \end{matrix}] \\ B_{2} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 15.1052 & 32.2481 & 7.0632 \\ 0 & 0 & 0 & 32.2481 & 5.0640 & - 29.3071 \\ 0 & 0 & 0 & 7.0632 & - 29.3071 & 17.5615 \end{matrix}] B_{3} = [\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 6.0250 & 52.2067 & 2.7913 \\ 0 & 0 & 0 & 52.2067 & 18.1731 & - 34.4208 \\ 0 & 0 & 0 & 2.7913 & - 34.4208 & 67.3205 \end{matrix}] \end{array}

PDC controller design

PDC is a fuzzy controller design method based on T-S fuzzy model proposed by Tanaka and coworkers.²² It is suitable for solving the nonlinear system control problem based on the T-S fuzzy model. The basic principle of this method is to design an independent linear state feedback controller for each local linear subsystem. The control rules of each linear subsystem and the corresponding fuzzy system have the same number of rules and the antecedent description. Finally, the global system controller is obtained by fuzzy weighting of the membership function. The total controller output of the whole system is the weighted set of each subsystem controller. Figure 2 describes the PDC design principle.

Figure 2.

PDC design principle. PDC: parallel distributed compensation.

For each T-S fuzzy rule, through the state feedback method, if-then fuzzy rules of the controller can be expressed as follows:

The rule j:

if $x_{1} (t)$ is $M_{1}^{j}$ and…and $x_{n} (t)$ is $M_{n}^{j}$ , then

u (t) = K_{j} x (t) j = 1, 2, \dots, r

where $K_{j}$ denotes the feedback gain of the controller corresponding to the ith linear subsystem.

According to the description of the PDC method, by weighted average sum of the linear controllers of all subsystems, the T-S fuzzy controller of the system equation (6) can be obtained as

u (t) = \frac{\sum_{j = 1}^{r} w_{j} K_{j} x (t)}{\sum_{j = 1}^{r} w_{j}} = \sum_{j = 1}^{r} w_{j} K_{j} x (t)

The FMB control system is a closed-loop system composed of a T-S fuzzy model and a fuzzy controller. The T-S fuzzy model is used to represent a non-linear controlled object.²³ The fuzzy controller can process the state vector $x (t)$ and the input vector $r (t)$ and then output the generated control signal $u (t)$ is output to the controlled object for control purposes. FMB control system block diagram is shown in Figure 3.

Figure 3.

FMB control system block diagram. FMB: fuzzy model-based.

Considering that the membership function has the properties shown in the equation (9)

\sum_{i = 1}^{r} w_{i} = \sum_{j = 1}^{r} w_{j} = \sum_{i = 1}^{r} \sum_{j = 1}^{r} w_{i} w_{j} = 1

Therefore, according to the equations (6), (8), and (9), the total output of the FMB closed-loop control system can be expressed as follows

\begin{matrix} \dot{x} (t) = \sum_{i = 1}^{r} w_{i} [A_{i} x (t) + B_{i} (\sum_{j = 1}^{r} w_{j} K_{j} x (t))] \\ = \sum_{i = 1}^{r} \sum_{j = 1}^{r} w_{i} w_{j} [A_{i} + B_{i} K_{j}] x (t) \end{matrix}

Stability is an important attribute of the control system, and it is also a necessary condition for the normal operation of the system. Next, the stability conditions of T-S fuzzy system based on the Lyapunov function are inferred and analyzed using linear matrix inequality (LMI) tools.

Construct the Lyapunov function as

V (t) = \frac{1}{2} x^{T} P x

where $P$ is a positive symmetric matrix.

Derivation of the time t by the Lyapunov function yields the equation $\dot{V} (t)$

\dot{V} (t) = \frac{1}{2} {\dot{x}}^{T} P x + \frac{1}{2} x^{T} P \dot{x}

Substituting the closed-loop system equation (10) into equation (12)

\begin{matrix} \dot{V} (t) = \frac{1}{2} {[\sum_{i = 1}^{r} \sum_{j = 1}^{r} w_{i} w_{j} (A_{i} + B_{i} K_{j}) x]}^{T} P x + \frac{1}{2} x^{T} P [\sum_{i = 1}^{r} \sum_{j = 1}^{r} w_{i} w_{j} (A_{i} + B_{i} K_{j}) x] \\ = \frac{1}{2} x^{T} {\sum_{i = 1}^{r} \sum_{j = 1}^{r} w_{i} w_{j} [\begin{array}{l} {(A_{i} + B_{i} K_{j})}^{T} P + \\ P (A_{i} + B_{i} K_{j}) \end{array}]} x \end{matrix}

Simplifying the equation (13)

\begin{matrix} \dot{V} (t) = \frac{1}{2} x^{T} \sum_{i = 1}^{r} w_{i} w_{i} [{(A_{i} + B_{i} K_{i})}^{T} P + P (A_{i} + B_{i} K_{i})] x \\ + \frac{1}{2} x^{T} \sum_{i < j}^{r} w_{i} w_{j} {\begin{cases} {[(A_{i} + B_{i} K_{j}) + (A_{j} + B_{j} K_{i})]}^{T} P + \\ P [(A_{i} + B_{i} K_{j}) + (A_{j} + B_{j} K_{i})] \end{cases}} x \end{matrix}

Therefore, satisfy $\dot{V} (t) \leq 0$ to make the system stable, the following inequalities must be established

{\begin{matrix} {(A_{i} + B_{i} K_{i})}^{T} P + P (A_{i} + B_{i} K_{i}) < 0 & i = j = 1, 2, \dots, r \\ \begin{array}{l} {[(A_{i} + B_{i} K_{j}) + (A_{j} + B_{j} K_{i})]}^{T} P + \\ P [(A_{i} + B_{i} K_{j}) + (A_{j} + B_{j} K_{i})] < 0 \end{array} & i < j \leq r \end{matrix}

Define $Q = P^{- 1} > 0$ . It is known that $P$ is a positive symmetric matrix. According to the nature of the inverse matrix, $Q$ is also known as a positive symmetric matrix. Define $V_{i} = K_{i} Q$ , $V_{j} = K_{j} Q$ again. Each pair of the inequalities in equation (15) is multiplied by $Q$ on the left and right sides, and then substituting $Q$ , $V_{i}$ , and $V_{j}$ into the above equation and simplifying it, which can obtain the system stability conditions

{\begin{matrix} Q A_{i}^{T} + A_{i} Q + V_{i}^{T} B_{i}^{T} + B_{i} V_{i} < 0 & i = 1, 2, \dots, r \\ Q A_{i}^{T} + A_{i} Q + Q A_{j}^{T} + A_{j} Q + V_{j}^{T} B_{i}^{T} + B_{i} V_{j} + V_{i}^{T} B_{j}^{T} + B_{j} V_{i} < 0 & i < j \leq r \end{matrix}

If define $Q = P^{- 1}$ , then $K_{1} = V_{1} P$ , $K_{2} = V_{2} P$ , $K_{3} = V_{3} P$ . According to the equation (16), considering the control system constructed in this article, the T-S fuzzy system is asymptotically stable when there is a positive definite matrix $Q$ and it satisfies the constraints described by equation (17).

{\begin{cases} Q A_{1}^{T} + A_{1} Q + V_{1}^{T} B_{1}^{T} + B_{1} V_{1} < 0 \\ Q A_{2}^{T} + A_{2} Q + V_{2}^{T} B_{2}^{T} + B_{2} V_{2} < 0 \\ Q A_{3}^{T} + A_{3} Q + V_{3}^{T} B_{3}^{T} + B_{3} V_{3} < 0 \\ Q A_{1}^{T} + A_{1} Q + Q A_{2}^{T} + A_{2} Q + V_{2}^{T} B_{1}^{T} + B_{1} V_{2} + V_{1}^{T} B_{2}^{T} + B_{2} V_{1} < 0 \\ Q A_{1}^{T} + A_{1} Q + Q A_{3}^{T} + A_{3} Q + V_{3}^{T} B_{1}^{T} + B_{1} V_{3} + V_{1}^{T} B_{3}^{T} + B_{3} V_{1} < 0 \\ Q A_{2}^{T} + A_{2} Q + Q A_{3}^{T} + A_{3} Q + V_{3}^{T} B_{2}^{T} + B_{2} V_{3} + V_{2}^{T} B_{3}^{T} + B_{3} V_{2} < 0 \end{cases}

The solution of the inequality equation (17) is a convex optimization accessibility problem of LMI. By using MATLAB’s LMI toolbox, the positive definite matrix $Q$ is found that the system can satisfy the constraints of equation (15), which can obtain the status feedback gain matrix results for the controller of each fuzzy subsystem as follows

\begin{array}{l} K_{1} = [\begin{matrix} - 0.5 & 0 & 0 & - 0.0610 & 0.0375 & 0.0368 \\ 0 & - 0.5 & 0 & 0.0375 & - 0.0740 & - 0.0531 \\ 0 & 0 & - 0.5 & 0.0368 & - 0.0531 & - 0.0534 \\ - 0.0610 & 0.0375 & 0.0368 & - 0.0332 & 0.0248 & 0.0220 \\ 0.0375 & - 0.0740 & - 0.0531 & 0.0248 & - 0.0427 & - 0.0336 \\ 0.0368 & - 0.0531 & - 0.0534 & 0.0220 & - 0.0336 & - 0.0322 \end{matrix}] \\ K_{2} = [\begin{matrix} - 0.5 & 0 & 0 & - 0.0166 & - 0.0179 & - 0.0220 \\ 0 & - 0.5 & 0 & - 0.0179 & 0.0055 & 0.0156 \\ 0 & 0 & - 0.5 & - 0.0220 & 0.0156 & - 0.0210 \\ - 0.0166 & - 0.0179 & - 0.0220 & - 0.0637 & 0.0066 & 0.0007 \\ - 0.0179 & 0.0055 & 0.0156 & 0.0066 & 0.0389 & 0.0244 \\ - 0.0220 & 0.0156 & - 0.0210 & 0.0007 & 0.0244 & 0.0264 \end{matrix}] \\ K_{3} = [\begin{matrix} - 0.5 & 0 & 0 & 0.0006 & - 0.0187 & - 0.0095 \\ 0 & - 0.5 & 0 & - 0.0187 & 0.0024 & 0.0020 \\ 0 & 0 & - 0.5 & - 0.0095 & 0.0020 & - 0.0133 \\ 0.0006 & - 0.0187 & - 0.0095 & - 0.1598 & 0.0148 & - 0.0472 \\ - 0.0187 & 0.0024 & 0.0020 & 0.0148 & 0.1229 & 0.0939 \\ - 0.0095 & 0.0020 & - 0.0133 & - 0.0472 & 0.0939 & 0.0426 \end{matrix}] \end{array}

Q-learning algorithm

Q-learning algorithm is an iterative incremental online learning method. It can make the agent have the ability to select the optimal action sequence of the Markov decision process through interaction with the external environment.^24,25 Figure 4 shows the principle of the Q-learning algorithm.

Figure 4.

Q-learning algorithm principle.

The agent receives the input s_t in the environment and outputs the corresponding action a_t through an internal reasoning mechanism. The environment becomes a new state $s_{t + 1}$ under the action a_t , and at the same time, it generates an immediate reward signal (reward or penalty) $r_{t + 1}$ for the agent. $r_{t + 1}$ is the evaluation of the agent action a_t from the environmental state s_t . If a behavioral strategy gains a positive return (reward) from the environment, then the tendency of the agent selecting the action will increase, whereas the tendency will decrease.

The purpose of this algorithm is to learn the return value $Q (s, a)$ of each state-action $(S_{t}, a_{t})$ . $Q (s, a)$ is updated according to the equation (18)

Q (S_{t}, a_{t}) = r_{t} + γ max Q (S_{t + 1}, a)

where $r_{t}$ means the immediate reward of performing the action $a_{t}$ from environmental feedback under the state $s_{t}$ . $γ \in (0, 1)$ represents a discount factor that reflects the influence extent that the future actions affect the current action.

This article selects Q-learning method for trajectory planning, obtains the ideal trajectory of the end of the manipulator and achieves the optimal path of the workspace. The external environment is discretized into a 6 × 6 × 2 grid map, and the vertices of each cube are a corresponding state. The end of the manipulator has six kinds of actions, including top, bottom, left, right, front, and back in each state. The control of the corresponding movement are respectively denoted by $u_{up} = [0, 0, 1]$ , $u_{down} = [0, 0, - 1]$ , $u_{forward} = [0, 1, 0]$ , $u_{backward} = [0, - 1, 0]$ , $u_{left} = [- 1, 0, 0]$ , and $u_{right} = [1, 0, 0]$ .

The obstacle area is S_d , the desired position is S_t , the safe area is S_s , then the reward function is expressed as follows

r = {\begin{matrix} 100 & s = s_{t} \\ 0 & s \in s_{s} and s \neq s_{t} \\ - 100 & s \in s_{d} \end{matrix}

Trajectory planning flow chart is shown in Figure 5. The pseudocode of the Q-learning algorithm is shown in Table 3. The end condition are as follows: 1. Reach the max steps; 2. Reach the obstacle; 3. Reach the goal.

Figure 5.

Trajectory planning flow chart.

Table 3.

Q-learning algorithm.

Input: environment E; action space A; initial state s_i ; target state s_t ; discount factor γ
Process: $Q (s, a) = 0$ ; $s = s_{i}$ ; for $N \neq 0$ , do; Select action a_i based on $P (a_{i} \| s_{i}) = \frac{Q (s_{i}, a_{i})}{\sum Q (s_{i}, a)}$ $r_{i}, s_{i + 1} =$ the state of reward and transfer generated when performing a_i in E; Update $Q (s_{i}, a_{i}) = r_{t} + γ max Q (s_{i + 1}, a)$ ; if $s_{i + 1} = s_{t}$ or $s_{i + 1} \in s_{d}$ , break; $N - 1$ ; end for
Output: action sequence $π : S \to A$

Simulation research

This section studies the trajectory tracking of right arm of the robot NAO. The simulation environment is MATLAB 2014a and the simulation time is set to 10 s. The selected end desired trajectory is shown in equation (20)

{\begin{cases} x (t) = 0.03 sin (t) - 0.11 \\ y (t) = 0.03 cos (t) + 0.06 \\ z (t) = 0.01 cos (2 t) + 0.0509 \end{cases}

Without-interference simulation

The tracking curve of the fuzzy PDC controller control effect based on the manipulator T-S fuzzy model is shown in Figure 6. As a comparison, the tracking curve of the fuzzy PD controller is shown in Figure 7.

Figure 6.

The end tracking curve without interference using T-S fuzzy control. T-S: Takagi–Sugeno.

Figure 7.

The end tracking curve without interference using fuzzy PD control. PD: proportional–derivative.

In Figures 6 and 7, both control schemes can make the end of the manipulator track the curve of the equation (20) well under the condition of without interference. Especially in the T-S fuzzy control scheme, the tracking curve basically fits the desired curve. The change of the joint driving torque using T-S fuzzy control is shown in Figure 8.

Figure 8.

The change of driving torque of the joint using T-S fuzzy control without interference. T-S: Takagi–Sugeno.

To demonstrate the improved effect more intuitively, the position error of the end effector in each axis direction of the 3-D coordinate system is given. Figures 9, 10, and 11 are the error of the actuator, respectively, relative to the desired trajectory on the x-axis, y-axis, and z-axis.

Figure 9.

The end effector error comparison in x-axis without interference.

Figure 10.

The end effector error comparison in y-axis without interference.

Figure 11.

The end effector error comparison in z-axis without interference.

In Figures 9 to 11, comparing with the fuzzy PD algorithm, T-S fuzzy control algorithm can make the position error converge near 0 in a short time, which significantly reduces the tracking error of the end effector and effectively improves the control performance of the manipulator.

Interference simulation

The above control scheme is performed without interference, and it is an ideal model. However, in practice, it is inevitably subject to disturbance caused by the robot’s own frictional resistance and external noise. Therefore, Gaussian white noise is added to the control system, which can simulate the actual work environment and study the control effect of the proposed algorithm under the interference environment.

The desired trajectory in equation (20) is still used as the Gaussian noise signal with a mean value 0 and variance 1 is added, and the other parameters consisting of the noise-free environment is kept. The tracking curve based on T-S fuzzy model using fuzzy PDC control is shown in Figure 12, and the tracking curve of the fuzzy PD controller is shown in Figure 13.

Figure 12.

The end tracking curve with interference using T-S fuzzy control. T-S: Takagi–Sugeno.

Figure 13.

The end tracking curve with interference using fuzzy PD control. PD: proportional–derivative.

In Figures 12 and 13, control schemes are influenced by noise. For the T-S fuzzy control scheme, the end effector can also basically track the desired trajectory curve. There’s a certain trajectory deviation comparing with the Figure 6, but it cannot affect the system to complete the task in the allowable range of error. At this time, the fuzzy PD controller is seriously disturbed by noise. It shows poor stability, and there are more offset expectations of tracking trajectory. The system cannot complete the trajectory tracking task with high precision.

Fuzzy control change of driving torque of each joint is shown in Figure 14. Comparing with Figure 8(a) to (c), respectively, it can be seen that the control torque of each joint in Figure 14 (a)to (c) is affected by noise and output jitter occurs near the stable value. Among them, the influence of two roll joints rotating around the z-axis is more obvious. Under interference situation, the desired trajectory errors of the end effector of the manipulator on the x-axis, y-axis, and z-axis are shown, respectively, in Figures 15, 16, and 17.

Figure 14.

The change of driving torque of each joint without interference using T-S fuzzy control. T-S: Takagi–Sugeno.

Figure 15.

The error of the end effector on x-axis with interference.

Figure 16.

The error of the end effector on y-axis with interference.

Figure 17.

The error of the end effector on z-axis with interference.

Quantitative analysis

To quantitatively analyze the tracking effect of the end effector using T-S fuzzy controller and the fuzzy PD controller respectively, the formula for calculating the average tracking error of the algorithm is defined as in the equation (21)

E = \frac{\sum_{i = 1}^{i_{max}} | α - β |}{i_{max}}

where E represents the average tracking error, and E ₁ and E ₂, respectively, represent the average tracking error of the T-S fuzzy control algorithm and the fuzzy PD control algorithm. α represents the expected value of each coordinate axis. β represents the actual tracking value of each coordinate axis, and $β_{1}$ and $β_{2}$ , respectively, represent the tracking values of the T-S fuzzy control algorithm and the fuzzy PD control algorithm. i represents the number of samples. In this section the sample time is 0.01 s, so $i_{max} = 1001$ .

In addition, I is the improvement precision of T-S fuzzy control compared with the fuzzy PD control. The calculation formula is shown in equation (22)

I = | \frac{E_{1} - E_{2}}{E_{2}} | \times 100 %

The table of control accuracy can be calculated as shown in Table 4.

Table 4.

The control accuracy table.

Axis	Without interference			With interference
Axis	E ₁ (mm)	E ₂ (mm)	I (%)	E ₁ (mm)	E ₂ (mm)	I (%)
x-axis	1.1034	9.6974	88.62	1.6360	4.6864	65.09
y-axis	2.1102	6.9719	69.73	3.2459	9.1623	64.57
z-axis	1.6921	1.8183	6.94	4.6921	8.6348	45.66

Experiment study

NAO-H25 robot is used to perform the grabbing experiments. The experimental task is to control the robot right arm to move along the non-collision optimal path and grab the target object, which can verify the feasibility of the trajectory planning Q-learning algorithm. The experiment scene is shown in Figure 18. A cylindrical rod is used catching target, and the brush and the milk carton, respectively, are used as the obstacle 1 and the obstacle 2. Considering the length of the manipulator, the three objects are placed on a diagonal line, and the robot begins to plan path when facing the obstacle 1.

Figure 18.

The grasping experiment scene.

The rod is placed on the experiment platform, so that the robot’s camera can see the whole target. Then the contour learning in the visual development tool Choregraphe [version:1.14] is used to scan the target object for contour recognition. The object data identified is named object and sent to the robot. Figure 19 shows the contour recognition of the target object.

Figure 19.

The contour recognition of the target object

The laser on the head of the NAO robot can measure the position of the objects and the obstacles in the environment.²⁶ Firstly, a laser positioning is performed on the target object, and the Laser-Monitor monitoring window is opened. The obtained result of target scanning is shown in Figure 20. Then, the obstacles 1 and 2 are added to the scene, and the robot performs laser positioning of the obstacle. The scanning result is shown in Figure 21.

Figure 20.

Target’s laser scan results.

Figure 21.

Obstacle laser scan results.

The center of the robot is used as the original coordinate, and the 3-D coordinate system shown in Figure 22 is established, which can calculate the position of the radar scanning result.

Figure 22.

NAO robot coordinate system.

The Q-learning obstacle avoidance is implemented using Python 2.7, and the T-S fuzzy controller designed in the third section is used for trajectory tracking control. The grasping process of 3-D virtual robot in Choregraphe is shown in Figure 23.

Figure 23.

The virtual robot grasping process in Choregraphe.

The practical grasping process is shown in Figure 24. Figures 23 and 24 show the manipulator needs to avoid the obstacles 1 and 2 to grasp the target. According to the planned path using Q-learning algorithm, the end effector can change the direction of the movement when it collides the obstacle. By using T-S fuzzy PDC controller, it can accurately reach the location of the target along the planned path and successfully grasp the target.

Figure 24.

Physical robot grasping process.

The whole trajectory of the end effector of the manipulator in this experiment is shown in Figure 25. The trajectory of the planning path using the Q-learning algorithm is shown in Figure 26. In Figures 25 and 26, the red prism represents the obstacle and the green column represents the target, and the change of the trajectory of the end effector can be clearly shown.

Figure 25.

The whole grasping trajectory of end effector.

Figure 26.

The trajectory of the planning path.

Conclusion

In this article, aiming at the obstacle avoidance problem of NAO right arm, a method of enhanced Q-learning collision-less trajectory planning based on fuzzy PDC structure is proposed, which achieves the autonomous grasping to avoid obstacles. Firstly, the T-S fuzzy model of the manipulator is built, then a closed-loop fuzzy controller is designed based on the T-S fuzzy model to reduce the tracking error and analyze the stability of the closed-loop system. Finally, based on the fuzzy PDC control structure, the Q-learning algorithm is proposed to make the action strategy, and the obstacle avoidance trajectory is planned. The simulation and experiment results show that the proposed algorithm can effectively achieve the manipulator movement control and make the manipulator complete the grasping task successfully, which verifies the feasibility of the algorithm.

Future work will consider increasing the pose control of joint and designing motion grasping algorithms for dynamic environments, further extending the application scenario of humanoid manipulator.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China projects (No. 61773333, No. 61473248, No. 61573305) and the Key Projects for Scientific and Technological Research in Colleges and Universities in Hebei Province (ZD2016150).

ORCID iD

Shuhuan Wen

References

Singh

Baranwal

Nandi

. Development of a self reliant humanoid robot for sketch drawing. Multimed Tools Appl 2017; 76(18): 18847–18870.

Masayuki

. Analytical inverse kinematics for 5-DOF humanoid manipulator under arbitrarily specified unconstrained orientation of end-effector. Robotica 2015; 33(4): 747–767.

Charalambous

Fletcher

Webb

. The development of a scale to evaluate trust in industrial human-robot collaboration. Int J Soc Robot 2016; 8(2): 193–209.

Goher

Mansouri

Fadlallah

. Assessment of personal care and medical robots from older adults’ perspective. Robot Biomim 2017; 4(1): 1–7.

Lee

Jung

Park

. A study on intelligent control algorithm development for cooperation working of human and robot. J Korean Soc Ind Converg 2017; 20(4): 285–297.

Wang

Sheng

. A robot robust trajectory tracking control system. Comput Syst Appl 2012; 2(1): 59–64.

Lynch

Shiroma

Arai

. Collision-free trajectory planning for a 3-DOF robot with a passive joint. Int J Robot Res 2016; 19(12): 1171–1184.

Zhang

Wan

. Study on the accuracy of a 6-DOF flexure hinge-based robot. Adv Sci Lett 2011; 4(4): 1485–1490.

Ranjan

Kumar

Laxmi

. Identification and control of NAO humanoid robot to grasp an object using monocular vision. In. The 2nd International conference on electrical, computer and communication technologies, Coimbatore, India, 22-24 February 2017, pp. 1–5. Coimbatore: IEEE.

10.

Vázquez

Jurado

Castañeda

. Real-time decentralized neural control via backstepping for a robotic arm powered by industrial servomotors. IEEE Trans Neur Net Lear 2018; 29(2): 419–426.

11.

Nugroho

Prihatmanto

Rohman

. Design and implementation of kinematics model and trajectory planning for NAO humanoid robot in a tic-tac-toe board game. In: IEEE 4th international conference on system engineering and technology, Bandung, Indonesia, 24-25 November 2014, pp. 1–7. Bandung: IEEE.

12.

Tian

Yuan

. Visual servoing method combining iterative learning control for household object handling. J Huazhong U Sci Med(Natural Science Edition) 2015; 43(S1): 536–540.

13.

Ren

Zhu

Xiong

. Trajectory planning of 7-DOF humanoid manipulator under rapid and continuous reaction and obstacle avoidance environment. Acta Autom Sinica 2015; 41(6): 1131–1144.

14.

Mao

Wen

. Eliminating drift of the head gesture reference to enhance Google Glass-based control of an NAO humanoid robot. Int J Adv Robot Syst 2017; 14(2): 1–10.

15.

Six

Briot

Chriette

. The kinematics, dynamics and control of a flying parallel robot with three quadrotors. IEEE Robot Autom Lett 2018; 3(1): 559–566.

16.

Helbing

Farkas

Vicsek

. Simulating dynamical features of escape panic. Nature 2000; 407(Sep): 487–490.

17.

Parisi

Gilman

Moldovan

. A modification of the social force model can reproduce experimental data of pedestrian flows in normal conditions. Phys A Stat Mech Appl 2009; 388(17): 3600–3608.

18.

Zhang

. Physics inspired methods for crowd video surveillance and analysis: a survey. IEEE Access 2018; pp(99): 1–1.

19.

Singh

Nandi

. NAO humanoid robot: analysis of calibration techniques for robot sketch drawing. Robot Auton Syst 2016; 79: 108–121.

20.

Wang

Hirai

. Soft gripper dynamics using a line-segment model with an optimization-based parameter identification method. IEEE Robot Autom Lett 2017; 2(2): 624–631.

21.

Takagi

Sugeno

. Fuzzy identification of systems and its applications to modeling and control. IEEE Trans Syst Man Cyb 1985; 15(1): 116–132.

22.

Wang

Tanaka

Griffin

. An approach to fuzzy control of nonlinear systems: stability and design issues. IEEE Trans Fuzzy Syst 1996; 4(1): 14–23.

23.

Lam

. LMI-based stability analysis for fuzzy-model-based control systems using artificial T–S fuzzy model. IEEE Trans Fuzzy Syst 2011; 19(3): 505–513.

24.

Watkins

CJCH

Dayan

. Q-learning. Mach Learn 1992; 8(3): 279–292.

25.

Zhijun

Zhao

Chen

. Reinforcement learning of manipulation and grasping using dynamical movement primitives for a humanoid-like mobile manipulator. IEEE/ASME Trans Mech 2018; 23(1): 121–131.

26.

Bolotnikova

Demirel

Anbarjafari

. Real-time ensemble based face recognition system for NAO humanoids using local binary pattern. Analog Integr Circuits Signal Process 2017; 92(3): 467–475.