Sage Journals: Discover world-class research

Abstract

To improve the robustness of biped walking, a model parameters optimization method based on policy gradient decent learning is presented. For the linear inverted pendulum mode-based model parameters optimization, firstly, select the input parameters of the inverted pendulum model and the torso attitude parameters of the robot as the correction variables and establish the correction equation. Then, using the tracking errors of center of mass (CoM) of the robot and the errors of the robot posture relative to the upright state of the body to establish the fitness function. According to the fitness function, the gain coefficients in the model parameters correction equation are optimized by using the strategy gradient learning method, and the modified gain parameters are substituted into the model parameters correction equation to obtain the correction amount. By applying the model parameters optimization strategy, the robot can quickly and in real time adjust the body posture and walking patterns under unknown disturbances, hence, the walking robustness can be enhanced. Simulation and experiments on a full-body humanoid robot NAO validate the effectiveness of the proposed method. The experiments show that the optimized model yields a more controlled, robust walk on NAO robot and on various surfaces without additional manual parameters tuning.

Keywords

Humanoid robot robust walking active balance model parameters correction policy gradient decent learning

Introduction

For humanoid robots, stable and robust biped walking gait generation is very important. Many studies for dynamic biped walking are based on a simple humanoid robot as simplified dynamics models,^{1

–6} like three-dimensional linear inverted pendulum mode (3D-LIPM), cart-table model, and so on. Using such an abstract model, there are various ways to generate appropriate gait patterns. Humanoid walking is commonly realized by planning the center of mass (CoM) trajectories, so that the resultant zero moment point (ZMP) trajectory follows a desired ZMP trajectory, which is normally determined by predefined foot positioning. These models based on control methods can realize biped walking under ideal environment. However, in complex condition, the balanced walking of a humanoid robot requires compensation in real time to guarantee the dynamic stability. To realize robust and dynamic walking, many strategies have been proposed to compensate for the nonzero variations in momentum^7,8 or regulate CoM through sensory feedback.^{9
–11}

Biped dynamic walking under disturbances is still an open question. To achieve external environment disturbance rebalancing, some well-known methods to stabilize walking using sensor feedback are proposed,^{7
–9,12,13} including regulation of the states of CoM or by modification of the predefined foot positions online through sensory feedback.^{10,11,14

–17} Considering humanoid momentum compensation, Englsberger and Ott⁷ proposed a walking method based on a capture point, which integrates CoM vertical motion and angular momentum change. Ugurlu et al.¹² created a strategy to rotate a humanoid’s upper body to compensate for undesired yaw movement when the robot walks. Chang et al.⁸ estimated the robot state and regulated the angular momentum of CoM and steps to recover from a push. Castano et al.¹⁰ proposed an online walking control that replanned the gait pattern based on foot placement control using the actual CoM state feedback. Wieber^11,15 proposed an online walking gait generation approach, with both adjustable foot positioning and step duration based on the linear model predictive control scheme. Based on the preview control scheme, Nishiwaki and Kagami¹⁶ proposed strategies to adjust future foot positioning and ZMP trajectory. Stephens¹⁷ utilized an integral control method to maintain standing balance and viewed the disturbance as impulsive. Liu and Atkeson¹⁸ studied the balance of a robot standing based on a trajectory library, which is time-consuming.

Open loop walk can easily be disturbed by external influences like hitting obstacles or more irregular terrains. Closed loop online modulation of the controller is the important strategy to solve the disturbance affect question during biped walking. Graf and Röfer¹⁹ presented a closed loop 3D-LIPM-based gait generation method for the RoboCup standard platform. By adding the sensor feedback, robot can reach faster walking speeds and be robust to disturbances. On one hand, this allows reacting to unexpected disruptions such as unevenness of the ground or the forces exerted on the robot. On the other hand, it compensates the forces and hardware characteristics that were not considered within the model used. Using sensor feedback to observe the state of the robot and to adjust the inverted pendulum model to the observed state in order to improve the stability of the walk. Their approach not only allow compensating the inaccuracies in the model but also allows for reacting to external perturbations effectively, like with robustness against perturbations such as forces exerted on the robot. Urbann and Hofmann⁵ presented a reactive stepping algorithm based on the generation of walking motions with sensor feedback. The closed form calculation of foot placement modifications, such as disturbances of the CoM position, could be balanced with negligible ZMP deviations.

In order to improve the robustness of the previous work on the biped dynamic walking based on the LIPM model, the main contribution of this article is presenting an improved closed loop dynamic gait generation method based on the previous work of Graf et al. for humanoid robots combining LIPM with parameter optimization. For the presented LIPM-based model parameters optimization, firstly, select the input parameters of the inverted pendulum model and the torso attitude parameters of the robot as the correction variables and establish the correction equation. Then, using the tracking errors of CoM of the robot and the errors of the robot posture relative to the upright state of the body as the fitness index of the robot to the current environment, and establish the fitness function. According to the fitness function, the gain coefficients in the model parameters correction equation are optimized by using the strategy gradient learning method, and the modified gain parameters are substituted into the model parameters correction equation to obtain the correction amount. By applying the model parameters optimization strategy, the robot can quickly and in real time adjust the body posture and walking pattern under unknown disturbance, hence, the walking robustness can be enhanced. Simulation and experiments on a full-body humanoid robot NAO validate the effectiveness of the proposed method. The experiments show that the optimized model yields a more controlled, robust walk from NAO robots and on various surfaces without additional manual parameters tuning.

The rest of this article is organized as follows. Second section presents the LIPM-based biped gait generation, including the LIPM model, the CoM trajectory and foot trajectory planning, and observation of the CoM by Kalman filter. Third section introduces the design method of model parameters corrector based on policy gradient decent learning method. Fourth section verifies the real-time performance and validity of the presented compensation algorithm by experiments. Fifth section provides the conclusions and discussion.

LIPM-based gait generation

The dynamics of a humanoid robot can be approximated as a linear inverted pendulum with its mass concentrated at CoM and supported at ZMP (Figure 1). In this model, the robot’s leg is assumed as a weightless scalable limb and its kinematics constraints are not considered. When not considering the constraint of the limited area of supporting polygon and neglecting the vertical moment of the CoM, the dynamics of LIPM in the x–z plane can be formulated as a strictly proper dynamic system as follows

{\dot{x}}_{c} = A x_{c} + B u_{x}

x_{z} = C x_{c}

Figure 1.

Linear inverted pendulum model.

where $x_{c} = {[x_{c}, {\dot{x}}_{c}, {\ddot{x}}_{c}]}^{T}$ contains the position, velocity, acceleration of CoM along the x-axis in ∑_w, u_x is the jerk of x_c, $p_{c} = {[x_{c}, y_{c}, z_{c}]}^{T}$ is the position of CoM in ∑_w and $p_{z} = {[x_{z}, y_{z}, z_{z}]}^{T}$ is the position of ZMP in ∑_w, let g be the gravitational acceleration, and $A = [\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{matrix}], B = [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}], C = [\begin{matrix} 1 & 0 & - \frac{z_{c} - z_{z}}{g} \end{matrix}]$ .

Given a set of future foot positioning, a reference ZMP trajectory can be generated on fulfillment of the dynamic stability requirement, and the desired CoM trajectory can be generated online using preview control (or the build in PID control) based on the LIPM. Using the inverse kinematics to obtain joint commands from the CoM trajectory, an open-loop humanoid walking control method can be realized. However, this just can realize biped walking under ideal environment, like flat terrain walking without external disturbance.

The LIPM-based trajectory planning

The position and velocity of the CoM relative to the origin of the inverted pendulum are given by Graf and Röfer¹⁹ as follows

x (t) = x_{0} \cdot cosh (k \cdot t) + {\dot{x}}_{0} \cdot 1 / k \cdot sinh (k \cdot t)

\dot{x} (t) = x_{0} \cdot k \cdot sinh (k \cdot t) + {\dot{x}}_{0} \cdot cosh (k \cdot t)

where $k = \sqrt{g / h}$ , g is the gravitational acceleration, h is the height of the CoM above the ground, x₀ and ${\dot{x}}_{0}$ are the position and the velocity of the CoM relative to the origin of the inverted pendulum at t = 0, respectively.

The support leg is constantly switching during walking, so the origin of the pendulum need to be constantly switching because the origin of the pendulum is on the support foot. As shown in Figures 1(a) and 2, Q is the coordinate system of the robot, and its origin is located between the feet. To eliminate the double support phase, we need to calculate the switching time point of the support leg. t = 0 is defined as a starting time point for altering the support leg, when t = 0, the y-component of the velocity is 0 ( ${({\dot{x}}_{0})}_{y} = 0$ ). The position of the CoM at this point ${(x_{0})}_{y}$ is an arbitrary parameter and has a value of greater or lower than 0 depending on the active support leg. Hence, the function of CoM’s position in y-direction during the range $t_{b} \leq t \leq t_{e}$ (which means a single support phase starts at t = t_b and ends at t = t_e) is as follows

x_{y} (t) = {(x_{0})}_{y} \cdot cosh (k \cdot t)

Figure 2.

xy-cross section showing the step size $\bar{s}$ and the inverted pendulum origins r and $\bar{r}$ .

The inverted pendulum motion diagram in the y–z plane is shown in Figure 3. The swing foot should have crossed a distance of ${\bar{r}}_{y} + {\bar{s}}_{y} - r_{y}$ with respect to the support leg (Figure 2) at the end of the single support phase. The switching time point of the support leg can be determined by the ending time of the single support phase t_e as follows

{(x (t_{e}))}_{y} - {(\bar{x} ({\bar{t}}_{b}))}_{y} = {\bar{r}}_{y} + {\bar{s}}_{y} - r_{y}

{(\dot{x} (t_{e}))}_{y} = {(\bar{\dot{x}} ({\bar{t}}_{b}))}_{y}

Figure 3.

The inverted pendulum motion in y–z plane.

where $\bar{x} (\bar{t}) and \bar{\dot{x}} (\bar{t})$ are position and velocity of the CoM relative to the next pendulum origin. t_e and ${\bar{t}}_{b}$ can be calculated using an iterative method.²⁰

The support leg switching in x-direction have the same regularity with the y-direction as follows

{(x (t_{e}))}_{x} - {(\bar{x} ({\bar{t}}_{b}))}_{x} = {\bar{r}}_{x} + {\bar{s}}_{x} - r_{x}

{(\dot{x} (t_{e}))}_{x} = {(\bar{\dot{x}} ({\bar{t}}_{b}))}_{x}

To ensure the optimum original of the inverted pendulum in the next single support phase, and the CoM is located just above the origin, we design the conditions of ${\bar{r}}_{x} = 0 and {(\bar{x} (0))}_{x} = 0$ . When the CoM has the position $x_{t_{b}}$ and velocity ${\dot{x}}_{t_{b}}$ relative to the origin of Q at the beginning of a single support phase, r_x, ${(x_{0})}_{x}, and {({\dot{x}}_{0})}_{x}$ can be computed by the following equations, so that all the pendulum parameters can be determined.

\begin{array}{l} r_{x} + {(x_{0})}_{x} \cdot C (k \cdot t_{b}) + {({\dot{x}}_{0})}_{x} \cdot \frac{S (k \cdot t_{b})}{k} = {(x_{t_{b}})}_{x} \\ {(x_{0})}_{x} k \cdot S (k \cdot t_{b}) + {({\dot{x}}_{0})}_{x} \cdot C (k \cdot t_{b}) = {({\dot{x}}_{t_{b}})}_{x} \\ {(x_{0})}_{x} k \cdot S (k \cdot t_{e}) + {({\dot{x}}_{0})}_{x} \cdot C (k \cdot t_{e}) - {({\bar{x}}_{0})}_{x} \cdot k \cdot S (\bar{k} \cdot {\bar{t}}_{b}) \\ - {({\bar{\dot{x}}}_{0})}_{x} \cdot C (\bar{k} \cdot {\bar{t}}_{b}) = 0 \\ r_{x} + {(x_{0})}_{x} \cdot C (k \cdot t_{e}) + {({\dot{x}}_{0})}_{x} \cdot \frac{S (k \cdot t_{e})}{k} - {({\bar{x}}_{0})}_{x} \cdot C (\bar{k} \cdot {\bar{t}}_{b}) \\ - {({\bar{\dot{x}}}_{0})}_{x} \cdot \frac{S (\bar{k} \cdot {\bar{t}}_{b})}{k} = {\bar{r}}_{x} + {\bar{s}}_{x} \end{array}

where C( ) and S( ) are the abbreviation of cosh( ) and sinh( ).

The $x_{z} (t)$ in z-direction can be defined as follows

x_{z} (t) = h + l_{z} \cdot l (t)

φ (t) = \frac{t - t_{b}}{t_{e} - t_{b}}

l (t) = {\begin{matrix} \frac{1}{2} \cdot (1 - cos (\frac{2 π \cdot (φ (t) - φ_{s})}{φ_{D}})) \\ 0 \end{matrix}, \begin{matrix} φ_{s} \leq φ (t) \leq φ_{s} + φ_{D} \\ else \end{matrix}

where l_z is the lift height in z-direction of CoM, φ_s and φ_D are the proportional constant coefficients, which describe the beginning time and the duration of movement with the relationship of $2 φ_{s} + φ_{D} = 1, 0 \leq φ_{s}, φ_{D} \leq 1$ .

When the movement of the CoM trajectory relative to the origin of the pendulum motion is determined, the movement of the support leg with respect to the CoM will also be determined, which can be expressed as

p_{sup_foot} = {[\begin{matrix} - x_{x} (t) & - x_{y} (t) & - x_{z} (t) \end{matrix}]}^{T}

The movement of the swing foot with respect to the robot coordinate system Q can be divided into the xy-direction movement and the z-direction movement as follows

\begin{array}{l} p_{sw_foo t_{x y}} = (s_{x y} + {\bar{s}}_{x y}) \cdot m (t) - s_{x y} + f_{y} \\ m (t) = {\begin{matrix} 0 \\ \frac{1}{2} \cdot (1 - cos (\frac{π \cdot (φ (t) - φ_{s})}{φ_{D}})) \\ 1 \end{matrix}, \begin{matrix} φ (t) > φ_{s} \\ φ_{s} \leq φ (t) \leq φ_{s} + φ_{D} \\ φ (t) < φ_{s} + φ_{D} \end{matrix} \end{array}

p_{{sw_foot}_{z}} = s_{z} \cdot l (t)

where s_xy and ${\bar{s}}_{x y}$ are the current and next step’s size in xy-direction, s_z is the lift height in z-direction, and f_y is the foot position which only contains y-component (positive or negative is determined by the left or right foot).

Meanwhile, we need to set the angular velocity of the support foot and the swing foot as follows

\begin{array}{l} ω_{sup_foot} = {[\begin{matrix} 0 & 0 & 0 \end{matrix}]}^{T} \\ ω_{sw_foot} = {[\begin{matrix} ω_{x} \cdot l (t) & ω y \cdot l (t) & 0 \end{matrix}]}^{T} \end{array}

Finally, combining target foot positions relative to the CoM and CoM position relative to torso by computing current robot’s stance, the joint angles can be calculated by inverse kinematics.

Observing the CoM by Kalman filter

Based on Craf’s previous work,¹⁹ for determining an observed position of the CoM, two Kalman filters are used for the x- and y-components of the position and the velocity of the CoM. The expected CoM position x_ei with respect to Q is computed using x(t) and the error with the measured position x_mi is calculated to get update the Kalman as follows

x_{e i} = r + x (t_{i})

Δ x_{i} = x_{m i} - x_{e i}

The state quantities and covariance matrices of the Kalman prediction phase are as follows

u_{i}' = {[\begin{matrix} {(x_{e i})}_{x} & {(x_{e i})}_{y} & {(\dot{x} (t_{i}))}_{x} & {(\dot{x} (t_{i}))}_{y} \end{matrix}]}^{T}

Σ_{i}' = A \cdot Σ_{i - 1} A^{Τ} + Σ_{ε i}

where $A = [\begin{matrix} 1 & 0 & Δ t_{i} & 0 \\ 0 & 1 & 0 & Δ t_{i} \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}], Δ t_{i} = t_{i} - t_{i - 1}$ , and Σ_εi is a process noise covariance matrix.

The state quantity and covariance matrix of the Kalman update phase are as follows

K_{i} = Σ_{i}' \cdot C^{T} \cdot {(C \cdot Σ_{i}' \cdot C^{T} + Σ_{m i})}^{- 1}

u_{i} = {[\begin{matrix} {(x_{f i})}_{x} & {(x_{f i})}_{y} & {({\dot{x}}_{f i})}_{x} & {({\dot{x}}_{f i})}_{y} \end{matrix}]}^{T} = u_{i}' + K_{i} \cdot Δ x {}_{i}

Σ_{i} = Σ_{i}' - K i \cdot C \cdot Σ_{i}'

where K_i is the Kalman gain, $C = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}]$ , and Σ_mi is the measured noise covariance matrix. Smaller Σ_εi and larger Σ_mi with less correction effectiveness for the inverted pendulum model parameters.

The filtered position x_fi and the filtered velocity ${\dot{x}}_{f i}$ of the Kalman filter that estimates the true position and velocity of the CoM are used to redetermine the parameters of the inverted pendulum of the current single support phase. Since the prediction of the natural motion of the CoM is based on the pendulum parameters, the main purpose of the correction is to improve the prediction in the next iteration. So the pendulum parameters r, x₀, ${\dot{x}}_{0}$ , and t_i are redetermined to find a pendulum function that fits to the estimated position and the estimated velocity of the CoM

r' + x' (t_{i}') = x_{f i}

\dot{x'} (t_{i}') = {\dot{x}}_{f i}

The motion equation in the y-direction is corrected as follows

{(x_{f})}_{y} - r_{y}' = {(x_{0})}_{y}' \cdot cos h (k' \cdot t_{i}')

{({\dot{x}}_{f})}_{y} = {({\dot{x}}_{0})}_{y}' \cdot k \cdot sin h (k' \cdot t_{i}')

The values of ${(x_{0})}_{y}'$ and t_i′ can be obtained by solving equations as follows, then the parameters of the inverted pendulum model are obtained by the previous inverted pendulum parameter calculation step

{(x_{0})}_{y}' = \sqrt{{({(x_{f i})}_{y} - r_{y}')}^{2} - \frac{{(x_{f}_{i})}_{y}^{2}}{{k'}^{2}}}

t_{i}' = \frac{1}{k'} \cdot arcsin h (\frac{{({\dot{x}}_{f})}_{y}}{k' \cdot {(x_{0})}_{y}})

Learning-based model parameters correction

In this work, a model parameters optimization method based on policy gradient decent learning is proposed. Usually, there are two methods of using policy gradient learning in the inverted pendulum model. One approach is to update the input parameters of the inverted pendulum every time a policy set is generated during the process of parameter optimization. When a robot via the virtual linkage model takes one step using this policy set, the corresponding value of the objective function and the parameter set of the next iteration are obtained. Once the iteration number reaches the preset iteration number N_iter, the parameter set is taken as the input parameters of the inverted pendulum, and the output parameters of the inverted pendulum are passed on to the robot. This method can directly adjust the gait of a robot. However, the computations of the robot model and the inverted pendulum model are involved in each iteration, which means N_iter previews for each real step of the robot. The previews will take a long time, preventing the robot’s pace from being smooth, and therefore the approach is not suitable for real-time robot walking.

The other approach is that after a robot takes a step, the objective function is used to evaluate the state of this step, and then the input parameters for the inverted pendulum are compensated. At this time, the method of policy gradient learning does not directly optimize the input parameters of the inverted pendulum but optimizes the gain coefficient of the compensation amount, thereby indirectly adjusting the gait of the robot. The second approach separates the gradient descent method for policy search from the virtual linkage model for the robot, and the computation speed is greatly improved. It is thus suitable for the real-time gait adjustment of the robot. However, the relational function for the parameter compensation of the inverted pendulum and the gain coefficient need to be established to find the internal relationship between the compensation amount and the virtual linkage model for robots, with a certain degree of increase in difficulty.

The overall procedure of humanoid walking with learning-based parameters correction can be summarized by the diagram of Figure 4. For determining an observed position of the CoM, a Kalman filter will be used for the x- and y-components of the position and the velocity of the CoM in the inner feedback loop. The filtered position x_f and the filtered velocity ${\dot{x}}_{f}$ of the Kalman filter that estimates the true position and velocity of the CoM are used to redetermine the parameters of the inverted pendulum of the current single support phase. For the parameters correction outside loop of policy gradient learning (PGL) compensator, select the input parameters of the inverted pendulum model and the torso attitude parameters of the robot as the correction variables and establish the compensator.

Figure 4.

The diagram of our humanoid walking control with model parameters correction.

Model parameters correction equation

There have been many parameters in the LIPM model which influence the biped walking, such as, the step stride s_x and s_y, the step height s_z, and the height of the inverted pendulum h. In addition, the robot is a multidimensional model with high degrees of freedom (DOFs), it cannot be completely described by the inverted pendulum, such as the trunk posture θ_B, and so on. Select the variables s_x, s_y, $θ_{B}_{x}, and θ_{B}_{y}$ as the key parameters to effect the walking pattern and construct the parameters correction equation as follows

Δ s_{x} = K_{1} \cdot \frac{1}{N} \sum_{i = 1}^{N} (x_{f, x, i} - x_{e, x, i}) + K_{3} \cdot \frac{1}{N} \sum_{i = 1}^{N} (θ_{B, y, i} - θ_{B, y, i}^{ref})

Δ s_{y} = K_{2} \cdot \frac{1}{N} \sum_{i = 1}^{N} (x_{f, y, i} - x_{e, y, i}) + K_{4} \cdot \frac{1}{N} \sum_{i = 1}^{N} (θ_{B, x, i} - θ_{B, x, i}^{ref})

Δ θ_{B, x} = K_{5} \cdot \frac{1}{N} \sum_{i = 1}^{N} (p_{LHip, z, i} - p_{RHip, z, i})

Δ θ_{B, y} = K_{6} \cdot \frac{1}{N} \sum_{i = 1}^{N} (p_{SuppFoot, x, i} - p_{Head, x, i})

where $N = \frac{t_{e} - t_{b}}{Δ T}$ is the interpolation number of steps in the single support phase, x_fi is the estimate value of CoM with Kalman filter, x_ei is the ideal value of the CoM, s is the step stride, and Δs is the correction of step size. θ_B is the actual torso angle inclination and Δθ_B is the correction of torso angle. $θ_{B x}^{ref}$ is the torso’s tilt angle when upright. ${(p_{RHip})}_{z} and {(p_{LHip})}_{z}$ is the z-direction’s displacement of the right leg and left leg of the robot, respectively. ${(p_{Head})}_{x} and {(p_{SuppFoot})}_{x}$ is the x-direction’s displacement of the head and support leg of the robot, respectively. $\underline{K} = {K_{1}, .. K_{6}}$ are the gain coefficients.

This parameter modification function is designed based on the correction of the foothold of the robot in the x- and y-directions and the correction of the straightness of the torso. The correction of the step length along the x-direction, $Δ s_{x}$ , is related to the error in the x-axis centroid and the error in the y-axis torso angle because the y-axis torso angle represents the forward and backward inclination of the body. The correction of the step length along the y-direction, $Δ s_{y}$ , is related to the error in the y-axis centroid and the error in the x-axis torso angle because the x-axis torso angle represents the left and right inclination of the body. The correction of the torso angle on the x-direction, $Δ θ_{B_{x}}$ , is related to the height difference between the left and right hip joints, which is because, when a robot needs to keep both feet on the ground or perform the walking task, the left and right tilts of the body will cause a certain degree of height difference between the left and right hip joints, and for a robot without a driving waist joint, the inclination of the body is mainly caused by the height difference of the two hip joints. The correction of the torso angle in the y-direction, $Δ θ_{B_{y}}$ , is related to the difference in distance between the head joint and the support foot in the x-direction, which is because, for a robot without a driving waist joint, the forward and backward inclination of the body of a robot is approximately equivalent to the linkage rotation of the robot with the support foot as the fulcrum, with the support foot as the origin of the linkage, and with the head as the end of the linkage. Therefore, the distance between the support foot and the head in the x-direction represents the degree of inclination of the body.

The output of the compensator is used to correct the inverted pendulum model and the robot model as follows

{\bar{s}}_{x y} = s_{x y} + Δ s

θ_{B,}_{i} = θ_{B, i - 1} + Δ θ_{B}

where ${\bar{s}}_{x y}$ is the input parameter of the inverted pendulum model which represents the step size of the next single foot support phase and every single foot support phase corrects only once. $θ_{B,}_{i}$ is the torso angle for each frame during the single-leg support phase and each single foot support stage will be compensated times.

Policy gradient reinforcement learning

Policy gradient learning method is a powerful machine learning algorithm to find the local optimal policy when there is no analytic mapping from the performance index to the pending parameters or when the gradient of the cost function cannot be directly calculated. This method has already been adopted in many robotics applications, especially for learning the walking of legged robot.^{21
–23} In this work, the parameters of inverted pendulum is optimized by using the method of gradient learning. As we know, when the robot is disturbed by unknown disturbance, if after adjusting, it can maintain an high accuracy of CoM tracking and good body upright posture, then the robot’s with high adaptability to the disturbance. In the correction equation, $K_{1} \dots K_{6}$ are the gains to control the output of the compensator. The fitness function $F (\underline{K})$ is set as follows, the selection of their value is to minimize the following performance

F (\underline{K}) = α_{x} (| Δ s_{x} | + | Δ {\bar{x}}_{x} |) + α_{y} (| Δ s_{y} | + | Δ {\bar{x}}_{y} |) + β_{x} (| Δ θ_{B, x} | + | Δ {\bar{θ}}_{B, x} |) + β_{y} (| Δ θ_{B, y} | + | Δ {\bar{θ}}_{B, y} |)

Δ {\bar{x}}_{x} = \frac{1}{N} \sum_{i = 1}^{N} (x_{f, x, i} - x_{e, x, i})

Δ {\bar{x}}_{y} = \frac{1}{N} \sum_{i = 1}^{N} (x_{f, y, i} - x_{e, y, i})

Δ {\bar{θ}}_{B, x} = \frac{1}{N} \sum_{i = 1}^{N} (θ_{B, x, i} - θ_{B, x, i}^{ref})

Δ {\bar{θ}}_{B, y} = \frac{1}{N} \sum_{i = 1}^{N} (θ_{B, y, i} - θ_{B, y, i}^{ref})

where $\underline{K} = {K_{1}, \dots, K_{6}}$ represents the set of gain parameters, α_x, α_y, β_x, and β_y, are the weight factors which satisfy $α_{x} + α_{y} = 1 and β_{x} + β_{y} = 1$ . The smaller the value of the fitness evaluation function, the higher the fitness of the robot in the gain parameter set.

Each term of the fitness function contains an absolute value of the compensation amount with the mean error, which is obtained by the linear addition of the absolute value of the output of the compensator, the absolute value of the centroid error, and the absolute value of the body inclination angle error, with different weights. The first two terms represent the following effect of the centroid, and the latter two terms represent the erecting effect of the body. In the process of gradient learning, if the value of the fitness function decreases gradually, the errors in the centroid following and the body inclination are gradually reduced, and the adaptability of the robot under the current parameter set is gradually increased. The corrections of the input parameters and the body inclination of the inverted pendulum cannot be too large, to ensure a reasonable range for allocation of the robot joint space. At the same time, it should be noted that the objective function is not directly associated with the robot walking time but a better erecting effect of the body indicates a longer implementation of centroid following in the robot.

Optimization learning of the parameters in the model

Policy gradient learning is used to optimize the gain parameter set of $K = {K_{1}, \dots, K_{6}}$ , which refers to the assigned value of each parameter in the parameter set of the fitness function, and the value of the fitness function is calculated based on the gain parameter set and the current state of the robot. Assuming that the objective function $F (\underline{K})$ is derivable for each parameter in $\underline{K}$ , the local optimal solution ${\underline{K}}^{*}$ can be obtained by calculating the gradient of $F (\underline{K})$ .

The specific procedures of the method of policy gradient learning are as follows.

Step 1

In the k th iteration, for the gain parameter set ${\underline{K}}^{k - 1}$ obtained in the previous iteration, the partial derivatives of $F (\underline{K})$ at each parameter value within ${\underline{K}}^{k - 1}$ are calculated, and n policies are randomly generated around ${\underline{K}}^{k - 1}$ . The obtained policy set is represented as ${}_{m}K^{k - 1}$ (m = 1,…, n), with the number of policies n proportional to the searching space, and the equation for the generation of the policy set is

{}_{m}{\underline{K}}^{k - 1} = {\underline{K}}^{k - 1} + {}_{m}{\underline{ρ}}

where ${}_{m}{\underline{ρ}}$ (m=1,…, n) refers to the disturbance set, each disturbance ρ_m in the disturbance set is randomly selected from the set of ${- e_{m}, 0, + e_{m}}$ , and e_m represents the gain parameter of disturbance of the corresponding ρ_m.

Step 2

According to the values of −e_m, 0, +e_m for the disturbance ρ_m, the ${}_{m}{\underline{K}}^{k - 1}$ are divided into three corresponding groups: $G_{- ε_{m}}$ , G₀, and $G_{+ ε_{m}}, and {}_{m}{\underline{K}}^{k - 1}$ is substituted into the fitness evaluation function to obtain the corresponding average of each group: ${\bar{F}}_{- ε_{m}}, {\bar{F}}_{0}, and {\bar{F}}_{+ ε_{m}}$ .

Step 3

To calculate the approximate gradient value $\nabla {\underline{K}}^{k - 1}, if {\bar{F}}_{- ε_{m}} > {\bar{F}}_{0} and {\bar{F}}_{+ ε_{m}} > {\bar{F}}_{0}, \nabla {\underline{K}}^{k - 1} = 0$ ; otherwise, $\nabla {\underline{K}}^{k - 1} = {\bar{F}}_{+ ε_{m}} - {\bar{F}}_{- ε_{m}}$ .

Step 4

With the orthogonal processing for $\nabla {\underline{K}}^{k - 1}$ , the gradient value $\nabla {\overset{⌢}{\underline{K}}}^{k - 1}$ is obtained by multiplying a fixed step factor η. The gradient value $\nabla {\underline{\overset{⌢}{K}}}^{k - 1}$ is subtracted from the policy set ${\underline{K}}^{k - 1}$ to yield the policy set K^k of this iteration, and then K^k is used for the next iteration.

Step 5

When the number of iterations reaches the preset value N_iter, the iteration ends.

Experiments and results

Humanoid robot

To validate the proposed method, NAO robot produced by Aldebaran robotics is used for the experiments (Figure 5(a)). The robot has 25 DOFs, with 11 DOFs for the two legs. The height of this robot is about 54 cm during walking and its weight is 4.35 kg. The sensors data and the walking control module update at a rate of 50 frames per second. Besides, the robot is equipped with an ×86 AMD GEODE 500 MHz CPU. The difficulty to develop walking approaches on this robot is that each leg weighs almost the same with its body and the links of the robot are made of plastic, which makes the dynamics of the robot differs a lot from the simplified dynamics model, and the disturbances brought with the deformation of plastic links are significant.

Figure 5.

Humanoid robot platform. (a) NAO robot. (b) The ODE-based simulation environment. (c) Coordinate systems of the humanoid robot.

Two coordinate systems are used (Figure 5(c), the first one is the world coordinate system ∑_w, with its origin on the ground and its x-, y-axis formed a plane parallel to the ground; the second one is called the supporting coordinate system ∑_sup, which is attached on the supporting foot. Its origin is the origin point of the supporting foot with its x-axis pointing in the forward direction of the supporting foot, and y-axis to the outside. In this work, experiments on both the simulated and real full-body humanoid robot are developed to show the performance of proposed method. Due to the potential damages and time involved in PGL process on a real humanoid robot, the learning process is currently carried out in Webots simulator.

Simulated experiments

Two experimental setups are developed to test the robot performance and in both the experimental setups, the robot has no prior knowledge of the terrain. The first experimental setup is to let the robot disturbed by uncertain external force during walking. The impose point of the external force on the robot and the force moment, the size of the force and the duration of the force are all uncertain, which can verify the robustness of the robot to unknown disturbances during walking. The second experimental setup is to let the robot walk on different walking terrains to verify the walking robustness when the walking condition changes.

Experiements with only Kalman filter

Omnidirectional walking with only Kalman filter

Figure 6 shows an experiment that the robot turning counterclockwise during forward walk at a speed of 0.05 m/s and with only Kalman filter but no LIPM model parameters online correction. The robot can follow the specified pace size, and with good CoM trajectory tracking result, the ZMP always remain in the feet supporting polygons.

Figure 6.

Plot recorded when the robot walks both forward and turning counterclockwise. The plot on the top is the x–y plane data and the plot on the bottom is the CoM and ZMP trajectories. CoM: Centre of mass; ZMP: Zero moment point.

Flat terrain walking with unknown disturbance

The external impulse about 6.5 N happens at about 9 s, the direction is mainly along the y-axis positive direction. Figure 7 shows an experiment that the robot walk straight forward and with only Graf’s method of correcting the inverted pendulum parameters but no model parameters optimization method based policy learning. The robot swings and falls down after being hit by the disturbance. Since there is no online compensation to modify the state of the robot, the robot still follows the predefined footsteps and thus results in the same reference ZMP trajectory. Although with the Kalman filter, the robot can generate the appropriate CoM trajectory to track the reference ZMP, the CoM tracking requirement is however unable to be fulfilled with large external disturbance. The robot manages to swing for several steps and finally falls down.

Figure 7.

Plot recorded when the robot walks forward with only Graf’s method but no learning-based model parameters correction.

Experiments with parameters online correction

Flat terrain walking with unknown disturbance

The parameters correction is added outside loop of PGL compensator as shown in Figure 4 in the same experiment. By using the input parameters of the inverted pendulum model and the torso attitude parameters of the robot as the correction variables and establish the compensator, and optimizing the parameters of inverted pendulum by using the method of gradient learning. Figure 8 shows the snapshots of the motions performed in disturbance resistance. The external impulse about 6.5 N happens at about 5 s in the y-direction, the duration of force of about 0.5 s. The initial values and the weight factors are set as shown in Table 1. The robot begins to walk along the x-direction, after disturbance by the external force in the left direction, and with the online compensation, the robot immediately adjust the next few steps to adjust the pace, steps to the left, and through the trunk angle compensation, the robot gradually restores upright state.

Figure 8.

Snapshots of the motions performed in disturbance resistance.

Table 1.

FPC parameters.

Gain	Value
α_x	0.3
α_y	0.7
β_x	0.6
β_y	0.4
K^O	$[0 0 0 0 1 1]$
η	0.001
ε_j	0.001
N_iter	480

The curve of the torso of the robot is shown in Figure 9. It can be seen from the plot that the robot is tilted to the right after being subjected to external force. After about 7s′ adjustment, the x-axis torso curve returns to the periodic fluctuation during normal walking. Since the robot inertial unit sensor cannot measure the z-axis torso angle, the z-axis torso angle compensation is not added to the parametric compensator design, so the z-axis torso angle is always zero. The expected and measured CoM trajectories and the measured ZMP trajectory are shown in Figure 10. When the robot is subjected to the right external force, the ZMP is shifted to the right and has a large fluctuation but always remains in the support area. The centroid has a good tracking effect, and the external disturbance has little effect on the centroid tracking. The recorded footstep of the robot in the world coordinate system is shown in Figure 11. After using the compensator, the pace of the robot changes due to external disturbances, and when the robot is subjected to right external forces, it will move to the right and the travel distance in the x-direction is slightly reduced. The value of the objective function of the strategy gradient learning method is shown in Figure 12 when the parameter compensator is called in the process of robot walking. The objective function curve shows a decreasing trend as a whole, indicating that the error of the centroid tracking and the error of the torso angle are gradually reduced, which verify the effectiveness of the presented model compensation.

Figure 9.

Body attitude of robot under external disturbance.

Figure 10.

The ideal CoM, measured CoM and measured ZMP trajectory under external disturbance. CoM: Centre of mass; ZMP: Zero moment point.

Figure 11.

The footstep of the robot in x–y plane under external disturbance.

Figure 12.

Fitness function curve of the policy gradient learning.

Irregular terrain adaptive walking

The second experiment setup is to test the ability of the presented method’s walking on irregular terrain. In the simulated experiment, a flat plate with size 0.5 × 0.5 × 0.016 m³ is laid in walking ground. During walking on the plate, the robot will be affected by the reaction of the plate. The parameter set initial value of the strategy gradient learning method is set as: ${\underline{K}}^{0} = {0, 0, 0, 0, 1, 2}$ , η = 0.001, $ε_{j} = 0.001, and N_{iter} = 720$ . The weighting factor of the objective equation is set as: $α_{x} = 0.6, α_{y} = 0.4, β_{x} = 0.6, and β_{y} = 0.6$ . Figure 13 shows the experiment snapshots of the robot walking up and down the plate. The robot can adjust the walking step size and trunk posture through the compensator in real time to successfully walking through the plate.

Figure 13.

Snapshots of the walking up and down the plate. (a) Without online parameters correction and (b) with online parameters correction.

Figure 14 shows the body attitude in pitch plane and roll plane during walking. The robot slides backward with shaking when walking up the plate. When walking on the plate, the robot’s x-axis torso curve returns to normal waveform, while the y-axis torso curve remains negative and fluctuates slightly, with fluctuations and slides in the forward direction when the robot walking down the plate. The ideal CoM of the robot, the measured CoM and the measured ZMP trajectory are shown in Figure 15. When the robot walks up and down the plate, the x-direction centroid tracking appears as a more obvious error, the actual CoM position lags behind the ideal CoM position. The ZMP trajectory has a large fluctuation while walking up and down the plate but always remain in the supporting polygon. In the world coordinate system, the foothold diagram of the robot with the compensator is shown in Figure 16. The yellow area is the walking on plate. When the robot moves up and down the plate, the pace of the robot changes with the change in the ground conditions.

Figure 14.

Body attitude angles.

Figure 15.

The ideal CoM, measured CoM, and measured ZMP trajectory during adaptive walking. CoM: Centre of mass; ZMP: Zero moment point.

Figure 16.

Footsteps in x–y plane during walking through the plate.

Real robot experiments

Two experimental setups are developed to test the real robot performance and in both the experimental steups the robot has no prior knowledge of the disturbance and the walking terrain. The video clips of the experiments can be found in the video attachment. The first experiment setup is to let the robot disturbed by the unknown external force as on Figure 17. The result of the experiment is shown in Figure 18. The robot is disturbed by an external force from the back with about 5 N during the robot walking at a speed of 0.12 m/s. After the disturbance, by applying the model parameter compensator, the robot can adjust the pace and body posture according to the target equation in real time, thus reducing the error of the CoM tracking and the inclination of the body. Several experiments show that the robot can withstand the external force of about 10 N in the y-direction in the simulation environment, and can withstand the force of about 6 N in the x-direction. The real experiment can resist about 5 N force in the x-direction.

Figure 17.

Real experiment setup of external disturbance during biped walking.

Figure 18.

Real experiment setup of external disturbance during biped walking

Conclusions and discussion

For the control problem of humanoid walking, many recent researchers focus on generating reference CoM trajectories according to the constraints, predefined ZMP or other dynamics. To resist with small disturbances, online adaptation of those trajectories are also added in some recent approaches, with heuristics that the robot can exactly track these trajectories. However, due to the nature of the robot’s contact with the environment, the ability of the robot to track these trajectories is unavoidably limited, which consequently leads to the out of control and correspondingly the fall of the robot. Although this problem has been noticed in some recent propositions and some approaches are also presented, an applicable and efficient method is yet to be explored. In this article, a novel solution is proposed to the problem raised above. By the online model parameters optimization strategy, the robot can quickly and in real time adjust the body posture and walking patterns under unknown disturbances, hence, the walking robustness can be enhanced. The robot can successfully retrieve its balance even after being pushed by external forces significantly. The gains of the model parameters corrector are learnt through the policy gradient reinforcement learning method, and it shows satisfactory results when applied on both the physical simulated and real full-body humanoid robots.

Footnotes

Acknowledgements

The authors would like to thank the technicians in the Laboratory of Robotics and Intelligent Systems of Tongji University for their assistance during experiments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China (grant nos. U1713211, 61673300), and the Fundamental Research Funds for the Central Universities, Basic Research Project of Shanghai Science and Technology Commission (grant nos. 16JC1401200, 16DZ1200903).

Supplemental material

Supplementary material for this article is available online.

References

Kajita

Kanehiro

Kaneko

. The 3D linear inverted pendulum mode: a simple modeling for a biped walking pattern generation. In: IEEE/RSJ international conference on intelligent robots and systems, Maui, USA, 29 October–3 November 2001, pp. 239–246.

Graf

Röfer

. A center of mass observing 3D-LIPM gait for the RoboCup standard platform league humanoid. In: Röfer

Mayer

Savage

Saranli

(eds) RoboCup 2011: Robot Soccer World Cup XV. RoboCup 2011. Lecture notes in computer science, vol. 7416, Berlin, Heidelberg: Springer, 2011, pp. 102–113.

Graf

Röfer

A closed-loop 3D-LIPM gait for the RoboCup standard platform league humanoid: Zhou

Pagello

Behnke

Menegatti

R¨ofer

Stone

(eds) Proceedings of the fourth workshop on humanoid soccer robots in conjunction with the 2010 IEEE-RAS international conference on humanoid robots, Taipei, Taiwan, 18–22 October 2010.

Liu

Urbann

. Bipedal walking with dynamic balance that involves three-dimensional upper body motion. Robot Auton Syst 2016; 77: 39–54.

Urbann

Hofmann

. A reactive stepping algorithm based on preview controller with observer for biped robots. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Daejeon Convention Center, Daejeon, Korea, 9–14 October 2016, pp. 5324–5331.

Kajita

Hirukawa

Harada

. Introduction to humanoid robotics. Springer Publishing Company, Inc, 2014.

Englsberger

Ott

. Integration of vertical com motion and angular momentum in an extended capture point tracking controller for bipedal walking. In: IEEE-RAS international conference on humanoid robots, Osaka, Japan, 29 November–1 December 2012, pp. 183–189.

Chang

Huang

Hsu

. Humanoid robot push-recovery strategy based on CMP criterion and angular momentum regulation. In: IEEE international conference on advanced intelligent mechatronics (AIM), Busan, Korea, 7–11 July 2015, pp. 761–766.

Liu

Wang

. Active balance of humanoids with foot positioning compensation and non-parametric adaptation. Robot Auton Syst 2016; 75: 297–309.

10.

Castano

Zhou

. Dynamic and reactive walking for humanoid robots based on foot placement control. Int J Hum Robot 2016; 13(02): 1550041.

11.

Wieber

. Trajectory free linear model predictive control for stable walking in the presence of strong perturbations. In: IEEE-RAS international conference on humanoid robots, Genova, Italy, 4–6 December 2006, pp. 137–142.

12.

Ugurlu

Saglia

Tsagarakis

. Yaw moment compensation for bipedal robots via intrinsic angular momentum constraint. Int J Hum Robot 2012; 9(04): 1250033.

13.

Chen

Zhang

. Bio-inspired control of walking with toe-off, heel-strike and disturbance rejection for a biped robot. IEEE Trans Indust Elect 2017; 64(10): 7962–7971.

14.

Chen

Cai

. Rebalance strategies for humanoids walking by foot positioning compensator based on adaptive heteroscedastic SpGPs. In: IEEE international conference on robotics and automation (ICRA), Shanghai, China, 9–13 May 2011, pp. 563–568.

15.

Diedam

Dimitrov

Wieber

. Online walking gait generation with adaptive foot positioning through linear model predictive control. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Nice, France, 22–26 September 2008, pp. 1121–1126.

16.

Nishiwaki

Kagami

. Strategies for adjusting the ZMP reference trajectory for maintaining balance in humanoid walking. In: IEEE international conference on robotics and automation (ICRA), Anchorage, Alaska, USA, 3–7 May 2010, pp. 4230–4246.

17.

Stephens

. Integral control of humanoid balance. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), San Diego, USA, 29 October–2 November 2007, pp. 4020–4027.

18.

Liu

Atkeson

. Standing balance control using a trajectory library. In: IEEE/RSJ international conference on intelligent robots and systems, St. Louis, MO, USA, 11–15 October 2009, pp. 3031–3036.

19.

Graf

Röfer

. A center of mass observing 3D-LIPM gait for the RoboCup standard platform league humanoid. Robot Soccer World Cup XV, 2012, pp. 102–114.

20.

Czarnetzki

Kerner

Urbann

. Observer-based dynamic walking control for biped robots. Robot Auton Syst 2009; 57(8): 839–845.

21.

Missura

Behnke

. Gradient-driven online learning of bipedal push recovery. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September–2 October 2015, pp. 387–392. IEEE.

22.

Peng

Berseth

van de Panne

. Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Trans Graphics 2016; 35(4): 81.

23.

Missura

Behnke

. Online learning of bipedal walking stabilization. KI-Künstliche Intelligenz 2015; 29(4): 401–405.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

61.91 MB

5.39 MB

19.93 MB

13.58 MB

Dynamic walking control of humanoid robots combining linear inverted pendulum mode with parameter optimization

Abstract

Keywords

Introduction

LIPM-based gait generation

The LIPM-based trajectory planning

Observing the CoM by Kalman filter

Learning-based model parameters correction

Model parameters correction equation

Policy gradient reinforcement learning

Optimization learning of the parameters in the model

Step 1

Step 2

Step 3

Step 4

Step 5

Experiments and results

Humanoid robot

Simulated experiments

Experiements with only Kalman filter

Omnidirectional walking with only Kalman filter

Flat terrain walking with unknown disturbance

Experiments with parameters online correction

Flat terrain walking with unknown disturbance

Irregular terrain adaptive walking

Real robot experiments

Conclusions and discussion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

Supplemental material

References

Supplementary Material