Sage Journals: Discover world-class research

Abstract

Dynamic motion primitive has been the most prevalent model-based imitation learning method in the last few decades. Gaussian mixed regression dynamic motion primitive, which draws upon the strengths of both the motion model and the probability model to cope with multiple demonstrations, is a very practical and conspicuous branch in the dynamic motion primitive family. As Gaussian mixed regression dynamic motion primitive only learns from expert demonstrations and requires full environmental information, it is incapable of handling tasks with unmodeled obstacles. Aiming at this problem, we proposed the positive and negative demonstrations-based dynamic motion primitive, for which the introduction of negative demonstrations can bring additional flexibility. Positive and negative demonstrations-based dynamic motion primitive extends Gaussian mixed regression dynamic motion primitive in three aspects. The first aspect is a new maximum log-likelihood function that balances the probabilities on positive and negative demonstrations. The second one is the positive and negative demonstrations-based expectation–maximum, which involves iteratively calculating the lower bound of a new Q-function. And the last is the application framework of data set aggregation for positive and negative demonstrations-based dynamic motion primitive to handle unmodeled obstacles. Experiments on several typical robot manipulating tasks, which include letter writing, obstacle avoidance, and grasping in a grid box, are conducted to validate the performance of positive and negative demonstrations-based dynamic motion primitive.

Keywords

Expectation–maximization algorithm Gaussian mixed regression negative demonstrations skill transfer learning data set aggregation

Introduction

Traditional robots have aided humans to accomplish laborious work in factories for several decades. They need merely repeat some preplanned motions in limited or structured environments. In the last few years, the development of deep learning has facilitated substantial progress in the perception abilities of robots. Some challenges once thought unsolvable, such as 3D object recognition and localization, 6D pose estimation, and grasp position prediction, have been solved partly with deep convolutional neural networks.¹ With a stronger perception, the robots can observe and learn skills from human actions actively and directly. That is the skill transfer learning (STL), or imitation learning, of robots.

STL models can be split into three categories^2,3: motion, policy, and procedural models. Motion models, such as the dynamic movement primitives (DMPs)⁴ or hidden Markov model (HMM),⁵ treat STL tasks as trajectory encoding and recovering problems. Once the analytical and structured task configuration is available, either predefined manually or fed via additional independent perception modules, motion models are able to learn skills from a few or even only one demonstration. By contrast, policy models,^6,7 such as behavior clone or generative adversarial imitation learning, handle STL tasks as sequential decision problems. Policy models output commands for the actuators (joint angles, Cartesian position, or servo torques) at each control step. Policy models have a superior capability in feature expression and generalization and can produce end-to-end policies. Nonetheless, they rely on numerous demonstrations and a long learning time. Procedural models represent² the task structure and configuration at a higher level of semantics, making them more suitable for logical inference in complex tasks. A perfect STL model should possess all abilities and advantages of all three models. Motion models, on the other hand, are most likely to be widely used shortly soon.

DMP^8,9 is the most prevalent motion model nowadays. With a second-order damp-string system that is driven by a virtual force term, DMP encodes the demonstrations first and transfers the shape information to new tasks. DMP and its improved variants, such as Gaussian mixed regression (GMR) DMP (GMRDMP)⁴ and compliant parametric DMP (CPDMP),¹⁰ have been thoroughly demonstrated to be capable of extracting and transferring shape information with a few expert demonstrations. When coping with unexpected obstacles, a compensatory force associated with the obstacle can be introduced into the damp-string system to adjust the recovered trajectory. However, the structured representation for unexpected obstacles is not always available in practice, which the DMP methods cannot handle. Could we address this issue through human intervention?

GMRDMP is a very typical and practical member of the DMP family. It draws upon the strengths of both the motion model and the probability model to deal with multiple demonstrations. Thus, we have chosen GMRDMP as the basic method to be improved. Because the optimization problem behind GMRDMP is to recover the demonstration as much as possible, there is no space for the user to modify the recovered trajectory by adjusting hyperparameters. It means that GMRDMP lacks a flexible interaction channel for the user to adjust the recovered trajectory.

Aiming at this problem, we extend GMRDMP with negative demonstrations to improve its flexibility, that is, the positive and negative demonstration-based DMP (PNDMP). Positive demonstrations are the original expert demonstrations, whereas negative demonstrations are failed executions by either the experts or the robot itself. Positive ones highlight what the student robot should perform, while negative ones indicate what it cannot. In supervised learning and reinforcement learning, negative samples are necessary for providing specific annotation information or reward value. Particularly, some researchers have utilized them to filter the nonexpert demonstrations as a preprocessing step for motion models.^11,12 However, to the best of the authors’ knowledge, few studies have sought to directly improve DMP with negative demonstrations.

PNDMP consists of three essential extensions to GMRDMP:

A new log-likelihood function is defined for PNDMP to balance the probabilities on positive and negative demonstrations. With it, we can establish an adjusting mechanism for the recovered trajectories.

A positive and negative demonstrations-based expectation–maximum (PNEM) algorithm is proposed to solve the optimization problem with respect to the new log-likelihood function. A new Q-function and its lower bound are deduced to obtain an iterative solution.

PNDMP is applied in the data set aggregation (DAgger)¹³ framework to tackle unmodeled obstacles. In the PNDMP DAgger framework, negative demonstrations implying unmodeled obstacles are gathered and annotated during the training process, and then used to retrain the Gaussian mixed model (GMM). DAgger is a fundamental paradigm designed primarily for policy models, however, it has not been applied to motion models yet.

The rest of the article is organized as follows. The second section reviews recent developments of DMP. The third section introduces some prerequisite knowledge and arise the optimization problem of PNDMP. The fourth section presents the main algorithms. Experimental results are analyzed in the fifth section. At last, some conclusions are drawn in the sixth section.

Related work

DMP uses a spring-damping dynamic system to fit the evolution of the observed states.^8,14 The dynamic system is driven by a nonlinear force term that is encoded in the form of a weight vector and a group of Gaussian basis functions. Typically, the weights are optimized via some regression methods such as linear weighted regression,¹⁵ linear weighted projection regression,¹⁶ and Gaussian process regression (GPR).¹⁷ DMP can ensure that the system converges to a new target state theoretically. To consider the random characteristics of multiple demonstrations, probabilistic models have been introduced into the DMP framework. With an appropriate probabilistic model, such as GMM,¹⁸ HMM,¹⁰ hierarchical Bayesian model,^19,20 and kernel model,²¹ the spatial and temporal correlations among demonstrations can be represented by a few parameters. A typical probabilistic DMP framework can be summarized as three steps. First, the virtual force terms of all demonstrations are extracted via the dynamic equation of DMP. Second, force trajectories are encoded with a probabilistic model. Finally, the virtual force is retrieved via the probabilistic model for a new task, based on which the motion is then evolved via the dynamic equation of DMP. A recent comprehensive survey about the DMP family can be referred to in the study of Saveriano et al.²²

Many trajectory encoding methods have been proposed to improve the adaptability of probabilistic DMP. GMRDMP,⁴ which adopts the expectation–maximization (EM) algorithm to obtain a GMM of the virtual force and the GMR algorithm to recover the force trajectories, is a prevalent probabilistic DMP method. Gribovskaya et al.¹⁸ used a one-order nonlinear multivariate system to replace the spring-damping system, and this system can ensure robustness to external spatial-temporal perturbations through online adaptation of a motion. Colome and Torras²³ proposed a probabilistic dimensionality reduction method to decrease the computation burden of probabilistic movement primitives and extended it to learn the joint couplings. Tanwani and Calinon²⁴ designed a semi-tied GMM to associate or tie the covariance matrices of the mixture model with a common latent space to cope with perturbations. Lioutikov et al.²⁵ segmented the demonstrations and established a primitive library, which can be reused in different tasks. Li et al.²⁶ introduced a dynamic time warp into the GMRDMP⁴ to ensure that the sampling time is identical.

Meanwhile, some researchers are devoted to enhancing the robustness when the new task is beyond the demonstration regions. Considering that some shape information would be lost when the target state is not covered by the demonstrations, Calinon²⁷ proposed the multi-coordinate GMRDMP. Huang et al.²⁸ introduced confidence weights for each frame of multi-frame GMRDMP and designed an iterative algorithm to choose the frames highly relevant to a task. Pervez et al.²⁹ encoded the force term of visual demonstrations directly with a convolutional neural network and changed the traditional discriminative model to a generative model³⁰ to solve the extrapolation problem. The model was trained with synthetic demonstrations to improve its generalization capability. Mei et al.³¹ developed a continuous DMP to track dynamic targets, which was referred to as an online navigation method. Karlsson et al.³² proved the exponential convergence of temporally coupled DMP which is used to solve perturbations. Yang et al.³³ proposed a framework that considers the trajectory tracking module and DMP simultaneously.

Some methods for establishing the relationship between task variables and encoding parameters have also been proposed. Akgun and Thomaz³⁴ designed a framework to estimate the actions and goals from demonstrations, which can serve for acquiring the relationship. Alizadeh et al.³⁵ proposed a partial GPR algorithm for the situation that task parameters are partially observed. Liu et al.³⁶ designed an approximation regression approach to learn the relationship and proposed a post-correction algorithm to improve accuracy.

Safe learning control has attracted considerable attention for a long time, especially in those fields which concern safety extremely. For different tasks, the safety criteria are different. There are four widely used criteria which include the stability of dynamic systems,^37,38 obstacle avoidance,³⁹ robustness to noise,³⁹ and the gap between the training set and real tasks.⁴⁰ The safety of learning control can be improved in three manners. The first is to utilize an external safety module³⁹ or manual intervention⁴¹ to abort or modify dangerous motions. The external module requires full obstacle information to evaluate the risk probability and generate a control command on the learning process with an independent algorithm, while the manual intervention requires the framework owns a channel of interaction. The second is to modify the learning process.⁴² Guidance or regulation about the safe region is introduced into the learning strategy based on a specific safety model. Although this manner has been used in policy models, it has not yet been used in any motion models directly. The third is to modify the learning goal according to the safety criterion.^43,44

For DMP, the safety criteria should be contained in the spring-damping system in the form of another additional force term. Pervez and Lee³⁰ combined a generative model with DMP to learn the relation between external configuration and the model parameters, while Liu et al.³⁶ used GPR to learn the relation. Uger and Girgin¹⁰ extended the nonlinear shaping function to a parametric form using a parametric hidden Markov instead of a weighted sum of radial basis functions. Park et al.⁴⁵ added the gradient of a potential field centered on the obstacle, that is, a repellent force, to the dynamic equation. Ginesi et al.⁴⁶ proposed some super-quadric potential functions that include the velocity in the form of the potential of obstacles. As all these methods require full information about obstacles, they are unsuitable for tasks in which the structured representation of obstacles is not available.

Preliminaries and problem description

DMP can extract the shape information from one single demonstration, and GMRDMP is developed to learn from multiple demonstrations to improve the generality. In this section, discrete DMP and GMR are introduced first. Then, the reversibility and continuity of DMP and GMRDMP are discussed. At last, a formal representation of the model of PNDMP is presented.

In all figures of this article, “circle,” “X,” and “square” markers represent the starting, end, and target points of a trajectory, respectively.

Dynamic movement primitives

DMP models the kinematics of a demonstration as a spring-damping system and enforces a nonlinear virtual force to it to make the state converge to a target. Up to now, a variety of DMPs have been investigated.²² In this article, we have considered a basic but relatively well-developed one.⁴⁷ The dynamic system can be represented as

τ \ddot{x} = α (β (x_{t_{f}} - x) - \dot{x}) + f (s)

where x and $\dot{x}$ are position and velocity, respectively; x ₀ and $x_{t_{f}}$ are the starting point and endpoint, respectively. $τ$ is the scale coefficient of time, α is the spring coefficient, $β$ is the damping coefficient. Usually, we let $β = α / 4$ to guarantee the system is critically damped. $f (x)$ is the virtual force to control the system converge to $x_{t_{f}}$ . s is the phase variable that describes the normalized time and is subjected to a one-order canonical system

τ \dot{s} = - α_{s} s

s will decrease from 1 to 0 as the state moves from the starting point to the endpoint, and the decreasing speed is determined by the coefficient $α_{s}$ . $f (s)$ is encoded in form of function about s as below

\begin{array}{l} f (s) = \frac{\sum_{i = 1}^{N} w_{i} ϕ_{i} (s)}{\sum_{i = 1}^{N} ϕ_{i} (s)} \cdot s \cdot (x_{t_{f}} - x_{0}), \\ ϕ_{i} (s) = e^{- h_{i} {(s - c_{i})}^{2}} \end{array}

where $ϕ_{i} (s)$ is the $i th$ basic Gaussian function with center c_i and bandwidth h_i , w_i is the weight of $ϕ_{i} (s)$ , and N is the number of basic functions. In general, larger N will generate a smoother trajectory.

For a multi-degree of freedom (DOF) system, each DOF corresponds to an independent spring-damping model, and all DOF share one canonical system. DMP can transfer the shape feature of trajectory from the source task to the target task based on its time and special invariance.

Gaussian mixture regression

GMR provides a probabilistic retrieval of movements or policies with a GMM based on the theory of marginal and conditional distribution. Recovering a trajectory can be formalized as a regression problem. GMR computes the next action on the fly with a computation time independent of the number of data points that are used to train the model.

A GMM with K components is written as $G M M^{K}$ with parameters $θ = {π_{k}, μ_{k}, Σ_{k}}_{k = 1}^{K}$ and written as

P (S) \sim \sum_{k = 1}^{K} π_{k} N (μ_{k}, Σ_{k})

where S is the multidimensional variable, $π_{k}$ , $μ_{k}$ , and $Σ_{k}$ are the mixing coefficients (i.e. weights), expectation vector, and covariance matrix of the ith Gaussian component, respectively.

For a regression task, S, $μ_{k}$ , and $Σ_{k}$ can be written in the block form as

S = [\begin{matrix} S^{I} \\ S^{O} \end{matrix}], μ_{k} = [\begin{matrix} μ_{k}^{I} \\ μ_{k}^{O} \end{matrix}], Σ_{k} = [\begin{matrix} Σ_{k}^{I} & Σ_{k}^{I O} \\ Σ_{k}^{O I} & Σ_{k}^{O} \end{matrix}]

where the superscripts I and O represent the input and output blocks, respectively.

When given a GMM model $G M M^{K}$ and the input S_I , we can compute the output according to the conditional distribution

P (S^{O} | S^{I}) = N (S^{O} | {\hat{μ}}^{O}, {\hat{Σ}}^{O})

where

\begin{array}{l} {\hat{μ}}_{k}^{O} = μ_{k}^{O} + Σ_{k}^{O I} {(Σ_{k}^{I})}^{- 1} (S^{I} - μ_{k}^{I}), \\ {\hat{Σ}}_{k}^{O} = Σ_{k}^{O} - Σ_{k}^{O I} {(Σ_{k}^{I})}^{- 1} Σ_{k}^{I O}, \\ h_{k} = \frac{π_{k} N (S^{I} | μ_{i}^{I}, Σ_{i}^{I})}{\sum_{k = 1}^{K} π_{k} N (S^{I} | μ_{k}^{I}, Σ_{k}^{I})}, \\ {\hat{μ}}^{O} = \sum_{k = 1}^{K} h_{k} {\hat{μ}}_{i}^{O}, \\ {\hat{Σ}}^{O} = \sum_{k = 1}^{K} h_{i} [{\hat{Σ}}_{k}^{O} + {\hat{μ}}_{k}^{O} {({\hat{μ}}_{k}^{O})}^{T}] - {\hat{μ}}^{O} {({\hat{μ}}^{O})}^{T} \end{array}

In contrast to other regression methods such as locally weighted regression, locally weighted projection regression, or GPR, GMR models the regression function indirectly. It models the joint probability density function of data points and derives the regression function from the joint density model. The estimation of the model parameters is thus achieved in an offline phase that depends linearly on the number of data points. One advantage of GMR is that its regression is independent of this number and can be computed very rapidly. The other advantage is that any variables can be chosen as the input freely without modification of the model. For example, once a GMM about the joint probability $[s, x, y, z]$ is trained, the user can treat s as the input variable and $x, y, z$ as the output, or the user can treat x as the input variable and z as the output. GMR is very suitable for those applications which specify input and output dimensions at run time (e.g. to handle missing sensory inputs or to react swiftly by retrieving partial outputs).

In GMRDMP, $G M M^{K}$ is obtained from the demonstrations and used to recover the trajectory for a new task according to equation (6). The input S^I is the phase variable s, and the output S^O is the virtual forces.

Reversibility and continuity of DMP and GMRDMP

Denote the set of recovered trajectories as S ₂ which is obtained from a given set of demonstrations S ₁, that is, $S_{2} = S T L (S_{1})$ . The reversibility criterion means that S ₁ can be recovered from $S_{1} = S T L (S T L (S_{2}))$ . Continuity means that the recovered trajectory should change continuously when the starting or end point of the new task changes continuously. Failure in reversibility or continuity would lead to counterintuitive results and confuse the user.

Reversibility of DMP

In this test, a DMP model with parameters $N = 500$ , $α = 10$ , $β = α / 4$ , and integration step $d_{t} = 0.01$ is used for STL. As shown in Figure 1, the demonstrations are encoded and transferred to new tasks with different starting or end points. The recovered trajectory is encoded and transferred to a recycled trajectory in the same way. When the start or the end point changes, recovered2 always nearly coincides with demo1, which means that the transfer process is reversible.

Figure 1.

DAgger based on PNDMP. (a) Whole trajectories, (b) first half trajectory, and (c) second half trajectory.

Reversibility of GMRDMP

In this test, two demonstrations are given. Their starting and end points are generated randomly from two regions. The starting region is $ℛ_{1} = {x \in [0.75, 0.85], y \in [0.15, 0.25]}$ , and the end region is $ℛ_{2} = {x \in [0.35, 0.45], y \in [- 0.25, - 0.15]}$ . We obtained a GMM model from them to recover a trajectory with starting region $ℛ_{2}$ and end region $ℛ_{1}$ . Then we obtained the recycled trajectory with starting region $ℛ_{1}$ and end region $ℛ_{2}$ based on another model learning from the recovered trajectory. Both models have four components and the hyperparameters of GMRDMP are $α_{s} = 1.0$ , $α = 10$ , $β = α / 4$ , and $d_{t} = 0.01$ . As shown in Figure 2, although the recycled trajectory is in accord with the demonstrations, the recovered trajectory fails to preserve the shape. The reason is that the trajectory is encoded in the world coordinate.

Figure 2.

Reversibility of GMRDMP. GMRDMP: Gaussian mixed regression dynamic motion primitive.

Continuity of DMP

In this test, the starting and end points of the demonstration are at $(0.8, 0.2, 0.0)$ and $(0.4, - 0.2, 0.01)$ , respectively. In the new task, the starting point is at $(0.2, 0.2, 0.0)$ and the target varies from $(- 0.2, - 0.2, - 0.03)$ to $(- 0.2, - 0.2, 0.02)$ . The results of DMP are shown in Figure 3. The recovered trajectories are very sensitive to the slight change of z. This is because the virtual force is proportional to $x_{t_{f}} - x_{0}$ . This problem has been discussed in some literature.⁴⁷

Figure 3.

Continuity of DMP. DMP: dynamic motion primitive.

Continuity of GRMDMP

In this test, only one demonstration is given. The targets of the new tasks vary from $(0.4, 0.2, - 0.3)$ to $(0.4, 0.2, - 0.5)$ . The GMM model has four components and the hyperparameters of GMRDMP are $α_{s} = 1.0$ , $α = 10$ , $β = α / 4$ , and $d_{t} = 0.01$ . Given that the encoded force only has a relation with the world coordinate, its continuity is guaranteed. The results of GMRDMP are shown in Figure 4.

Figure 4.

Continuity of GRMDMP. GMRDMP: Gaussian mixed regression dynamic motion primitive.

The above tests illustrate that DMP has good reversibility but poor continuity while GMRDMP is just the opposite. The discontinuity of DMP and the reversibility of GMRDMP can be solved by coordinate transformation or normalization.²⁷ However, different transformation or normalization methods may produce very different results. Many other members of the DMP family also suffer from this problem. This kind of uncertainty is not allowed in industrial applications. Thus, some manual intervention, such as DAgger, is necessary for the application of GMRDMP. As GMRDMP lacks flexibility for interactive adjustment, we intended to utilize the negative demonstrations to improve flexibility.

Problem description

PNDMP is an improved version of GMRDMP with negative demonstrations. Without loss of generality, this study considers the typical motion planning task of the end effector of a robot manipulator. The methodology can be extended to other tasks easily.

Denote the set of demonstrations as

D = {d_{i, j} | i = 1, 2, \dots, N_{p} + N_{q}, j = 1, \dots, M}

where N_p and N_q are the number of demonstrations in each set, M is the length of demonstrations. The sampling point is

d_{i, j} = [s_{i, j}, x_{i, j}, y_{i, j}, z_{i, j}, {\dot{x}}_{i, j}, {\dot{y}}_{i, j}, {\dot{z}}_{i, j}, {\ddot{x}}_{i, j}, {\ddot{y}}_{i, j}, {\ddot{z}}_{i, j}]

${d_{i, j} | i = 1, 2, \dots, N_{p}}$ and ${d_{i, j} | i = N_{p} + 1, N_{p} + 2, \dots, N_{p} + N_{q}}$ are the positive and negative subsets, respectively.

With equation (1), the virtual force for the ith demonstration in the x direction can be calculated as

f_{i, j}^{(x)} = τ {\ddot{x}}_{i, j} - α (β (x_{i, M} - x_{i,0}) - {\dot{x}}_{i, j})

$f_{i, j}^{(y)}$ and $f_{i, j}^{(z)}$ can be calculated in the same manner.

Then, we can obtain the new positive and negative sets of virtual force as below

\begin{array}{l} U = {(s_{i, j}, f_{i, j}^{(x)}, f_{i, j}^{(y)}, f_{i, j}^{(z)}) | i = 1, 2, \dots, N_{p}}, \\ V = {(s_{i, j}, f_{i, j}^{(x)}, f_{i, j}^{(y)}, f_{i, j}^{(z)}) | i = N_{p} + 1, N_{p} + 2, \dots, N_{p} + N_{q}} \end{array}

For the convenience of the following expression, we can rewrite them as $U = {u_{i}, i = 1, 2, \dots, m_{u}}$ and $V = {v_{j}, j = 1, 2, \dots, m_{v}}$ , where $m_{u} = N_{p} M$ and $m_{v} = N_{q} M$ .

The pipeline of GMRDMP can be summarized as follows:

Obtain $U$ and $V$ from $D$ ;

Optimize $G M M^{K} = {π_{i}, μ_{i}, Σ_{i}}_{k = 1}^{K}$ with EM algorithm based on $U$ and $V$ ;

Retrieve the force trajectory $(\hat{s}, f^{(x)} (\hat{s}), f^{(y)} (\hat{s}), f^{(z)} (\hat{s}))$ via GMR with the model $G M M^{K}$ ;

For a new task, subject the recovered force trajectory and the new starting and end points into equation (1) to recover a new motion.

In PNDMP, the optimization of $G M M^{K}$ is replaced by maximizing a new log-likelihood function for adjustment of the recovered trajectory, in which $U$ and $V$ are annotated with different labels. The log-likelihood function is defined as below

L (θ) = (1 - λ) \sum_{i = 1}^{m_{u}} log P (u_{i} | θ) + λ \sum_{j = 1}^{m_{v}} log P (v_{j} | θ)

where $λ \in [0, 1]$ is a scalar weight, and U and V are the corresponding variables of $U$ and $V$ . The optimization objective changes to maximize the probability on U and minimize the probability on V simultaneously.

Because the difference between discrete and rhythmic GMRDMP lies in the calculation of $U$ from $D$ , PNDMP can be easily extended to rhythmic tasks by replacing the motion equation. Furthermore, we can also plan one single cycle of the rhythmic motion with the discrete GMRDMP first and repeat that cycle infinitely to accomplish rhythmic tasks. Thus, we only focus on the discrete GMRDMP in this paper.

Main algorithms of PNDMP

In this section, we developed a new PNEM algorithm to maximize the log-likelihood function (12). In addition, we devised a new positive and negative demonstration-based Kmeans (PNKmeans) algorithm for determining the initial value for PNEM. The DAgger framework of PNDMP is then put forward at last.

PNEM

In GMRDMP, EM algorithm^48,49 provides an iterative solution for $arg
max \sum log P (u_{i} | θ)$ . It introduces a latent variable to establish the Q-function and uses Kullback–Leibler divergence to calculate the lower bound of the Q-function, in which the latent variables represent the stage of the trajectory. On this basis, the E-step and M-step can be executed iteratively. The derivation process of PNEM refers to that of the standard EM and the standard EM is a special case of PNEM.

In this subsection, we mainly focused on the procedure from the optimization problem $\hat{θ} = arg max_{θ} L (θ)$ to the Q-function. The rest derivation can be found in the Appendix 1. When the positive and negative latent variables $ζ$ and $η$ are introduced, $L (θ)$ can be rewritten as

\begin{array}{l} L (θ) = (1 - λ) \sum_{i = 1}^{m_{u}} q (ζ_{i}) \sum_{z_{i}} log \frac{P (u_{i}, ζ_{i} | θ)}{P (ζ_{i} | u_{i}, θ)} + λ \sum_{j = 1}^{m_{v}} q (η_{j}) \sum_{z_{j}} log \frac{P (v_{j}, η_{j} | θ)}{P (η_{j} | v_{j}, θ)} \\ = (1 - λ) \sum_{i = 1}^{m_{u}} q (ζ_{i}) \sum_{z_{i}} log \frac{P (u_{i}, ζ_{i} | θ) / q (ζ_{i})}{P (ζ_{i} | u_{i}, θ) / q (ζ_{i})} + λ \sum_{j = 1}^{m_{v}} q (η_{j}) \sum_{z_{j}} log \frac{P (v_{j}, η_{j} | θ) / q (η_{j})}{P (η_{j} | v_{j}, θ) / q (η_{j})} \\ = (1 - λ) \sum_{i = 1}^{m_{u}} q (ζ_{i}) \sum_{z_{i}} log \frac{P (u_{i}, ζ_{i} | θ)}{q (ζ_{i})} + λ \sum_{j = 1}^{m_{v}} q (η_{j}) \sum_{z_{j}} log \frac{P (v_{j}, η_{j} | θ)}{q (η_{j})} + (1 - λ) \sum_{i = 1}^{m_{u}} q (ζ_{i}) \sum_{z_{i}} log \frac{q (ζ_{i})}{P (ζ_{i} | u_{i}, θ)} \\ + λ \sum_{j = 1}^{m_{v}} q (η_{j}) \sum_{z_{j}} log \frac{q (η_{j})}{P (η_{j} | v_{j}, θ)} \\ = (1 - λ) \sum_{i = 1}^{m_{u}} q (ζ_{i}) \sum_{z_{i}} log \frac{P (u_{i}, ζ_{i} | θ)}{q (ζ_{i})} + λ \sum_{j = 1}^{m_{v}} q (η_{j}) \sum_{z_{j}} log \frac{P (v_{j}, η_{j} | θ)}{q (η_{j})} + (1 - λ) K L (q (ζ) | | P (ζ | U, θ)) \\ + λ \cdot K L (q (η) | | P (η | V, θ)) \end{array}

where $K L (a ∥ b) \geq 0$ is the Kullback–Leibler divergence that measures the distance between distributions a and b. Only when $a = b$ , there exists $K L (a ∥ b) = 0$ . As conditional probabilities $P (ζ | U, θ)$ and $P (η | V, θ)$ are statistical variables, they can be obtained when the parameter $θ$ is given. Then, we have

\begin{array}{r} L (θ) \geq (1 - λ) \sum_{i = 1}^{m_{u}} q (ζ_{i}) \sum_{ζ_{i}} log \frac{P (u_{i}, ζ_{i} | θ)}{q (ζ_{i})} \\ + λ \sum_{j = 1}^{m_{v}} q (η_{j}) \sum_{η_{j}} log \frac{P (v_{j}, η_{j} | θ)}{q (η_{j})} \end{array}

When $q (ζ) = P (ζ | U, θ)$ and $q (η) = P (η | V, θ)$ , the equivalence of (14) is true. Thus, we can construct a constrained Q-function for $L (θ)$ , and calculate z and $θ$ alternatively in the same manner as the standard EM algorithm. Denoting $θ^{(t)}$ as the estimated result of iteration $t - 1$ , then the positive and negative demonstrations-based Q-function (PNQ-function) is defined as

\begin{array}{r} Q (θ, θ^{(t)}) = (1 - λ) \sum_{i = 1}^{m_{u}} P (ζ_{i} | u_{i}, θ^{(t)}) \sum_{ζ_{i}} log P (u_{i}, ζ_{i} | θ) \\ + λ \sum_{j = 1}^{m_{v}} P (η_{j} | v_{j}, θ^{(t)}) \sum_{η_{j}} log P (v_{j}, η_{j} | θ) \end{array}

With the new PNQ-function, we can obtain the posterior probabilities of the latent variables in the E-step

γ_{u, i k}^{(t + 1)} = \frac{π_{k}^{(t)} ϕ (u_{i} | μ_{k}^{(t)}, Σ_{k}^{(t)})}{\sum_{l = 1}^{K} π_{l}^{(t)} ϕ (u_{i} | μ_{l}^{(t)}, Σ_{l}^{(t)})}

γ_{v, j k}^{(t + 1)} = \frac{π_{k}^{(t)} ϕ (v_{j} | μ_{k}^{(t)}, Σ_{k}^{(t)})}{\sum_{l = 1}^{K} π_{l}^{(t)} ϕ (v_{j} | μ_{l}^{(t)}, Σ_{l}^{(t)})}

And in the M-step, we can obtain

π_{k}^{(t + 1)} = \frac{(1 - λ) \sum_{i = 1}^{m_{u}} γ_{u, i k}^{(t + 1)} + λ \sum_{j = 1}^{m_{v}} γ_{v, j k}^{(t + 1)}}{(1 - λ) m_{u} + λ m_{v}}

μ_{k}^{(t + 1)} = \frac{(1 - λ) \sum_{i = 1}^{m_{u}} u_{i} γ_{u, i k}^{(t + 1)} + λ \sum_{j = 1}^{m_{u}} v_{i} γ_{v, j k}^{(t + 1)}}{(1 - λ) \sum_{i = 1}^{m_{u}} γ_{u, i k}^{(t + 1)} + λ \sum_{j = 1}^{m_{u}} γ_{v, j k}^{(t + 1)}}

Σ_{k}^{(t + 1)} = \frac{(1 - λ) Σ_{a} + λ Σ_{b}}{(1 - λ) \sum_{i = 1}^{m_{u}} γ_{u, i k}^{(t + 1)} + λ \sum_{j = 1}^{m_{v}} γ_{v, j k}^{(t + 1)}}

where

\begin{array}{l} Σ_{a} = \sum_{i = 1}^{m_{u}} (u_{i} - μ_{k}) {(u_{i} - μ_{k})}^{T} γ_{u, i k}, \\ Σ_{b} = \sum_{j = 1}^{m_{v}} (v_{j} - μ_{k}) {(v_{j} - μ_{k})}^{T} γ_{v, j k} \end{array}

In the E-step, $γ_{u, i k}$ and $γ_{v, j k}$ are calculated with $π_{k}$ , $μ_{k}$ , and $Σ_{k}$ being frozen. In the M-step, it is the opposite. E-step and M-step are repeated till some stopping conditions (threshold of iterations or tolerance) are triggered.

PNKmeans

EM is sensitive to the initial value. Thus, Kmeans is often adopted to generate a preferable initial value $μ_{k}^{(0)}$ for EM. Kmeans is also an iterative clustering algorithm that only considers the distance between data points. In this section, we propose the PNKmeans method to introduce negative demonstrations into Kmeans.

Given the point data sets U and V defined as before and K to-be-calculated center points as $o_{k}^{(t)}$ , in the iteration $t + 1$ we can obtain

\begin{array}{l} d_{i k}^{(t)} = {‖ u_{i} - o_{k}^{(t)} ‖}^{2}, \\ {\bar{d}}_{j k}^{(t)} = {‖ v_{j} - o_{k}^{(t)} ‖}^{2} \end{array}

The sample space can be divided into K subspace $S_{k}^{(t)}$ by $o_{k}^{(t)}$ . A point x belongs to $S_{k}^{(t)}$ if ${‖ x - o_{k}^{(t)} ‖}^{2} \leq {‖ x - o_{j \neq k}^{(t)} ‖}^{2}$ , which is denoted as $x \in S_{k}^{(t)}$ . The points on the bound of two different subspaces belong to both subspaces.

The optimization objective is to minimize J which is defined as

J = (1 - λ) \underset{︸ J_{1}}{\sum_{i = 1}^{m_{u}} \sum_{k = 1}^{K} r_{i j} d_{i k}} + λ \underset{︸ J_{2}}{\sum_{j = 1}^{m_{v}} \sum_{k = 1}^{K} r_{i j} {\bar{d}}_{j k}}

where $r_{i j} = {\begin{matrix} \begin{matrix} 1, & i f u_{i} \in o_{k}^{(t)} \\ 0, & e l s e . \end{matrix}; \end{matrix}$ J ₁ is the optimization object of Kmeans, and J ₂ is the new term for negative demonstrations. As J ₁ and J ₂ are linearly weighted, the new center points can be updated as follows

o_{k}^{(t + 1)} = (1 - λ) \underset{u_{i} \in O_{k}^{(t)}}{m e a n} {u_{i}} + λ \underset{u_{j} \in O_{k}^{(t)}}{m e a n} {v_{j}}

Extrapolation strategy for PNDMP

From equations (18), (19), and (20), we can find that the results of PNEM are functions of the weight $λ$ , that is

t h e t a (λ) = {π_{k} (λ), μ_{k} (λ), Σ_{k} (λ {)}}_{k = 1}^{K}

where $λ \in [0, 1]$ , $θ (0)$ represents the model extracted from positive demonstrations, and $θ (1)$ from negative demonstrations. When we adjust $λ$ from 0 to 1, the shape of the recovered trajectory changes from positive demonstrations to negative ones gradually. However, to adjust the trajectory furthermore, that is $λ < 0$ or $λ > 1$ , calculating $θ (λ)$ directly is unsuitable. This is because of the exponential decay characteristic of the Gaussian distribution. When a Gaussian component is close to positive demonstrations and far from negative demonstrations, the effect of negative demonstrations on this component would decline considerably. Thus, we designed an extrapolation strategy to improve the flexibility of PNDMP.

Considering that $π_{k}$ represents the stage division of the trajectory and $Σ_{k}$ represents the degree of dispersion, the extrapolation strategy to calculate $\hat{θ} (w_{p}, w_{n})$ from the trained GMM model is as below

\begin{array}{l} {\hat{π}}_{k} (w) = π_{k} (0.5), \\ {\hat{μ}}_{k} (w) = w_{p} \cdot μ_{k} (0) + w_{n} \cdot μ_{k} (1), \\ {\hat{Σ}}_{k} (w) = Σ_{k} (0.5) \end{array}

This strategy can make the demonstrations act on the adjustment continuously and smoothly, as shown in Figure 5.

Figure 5.

Adjustment of the recovered trajectory based on extrapolation. (a) Recovering the trajectory. (b) Adjusting the trajectory via extrapolation with different w_n and $w_{p} = 1 - w_{n}$ . (c) Adjusting the trajectory via $θ (λ)$ with different $λ$ .

DAgger framework of PNDMP

Unexpected situations are very common in the industrial field. For instance, some new sensors added into the manipulator’s workspace render preprogrammed trajectories unsafe, yet the operator cannot model the obstacle. In that case, DAgger is an effective solution. DAgger is one of the fundamental incremental learning methods for sequential decision tasks, as well as an efficient method to cope with unexpected situations. In DAgger, new samples are collected and annotated by experts and then added to the training data set during the iterative training process. In the DAgger framework of PNDMP, the user can treat the recovered trajectory as a negative demonstration and re-train the GMM model.

Experiments

To validate the advantage of PNDMP, we have conducted experiments on three typical robot manipulating tasks. The first task is 2D letter writing,²⁴ with which we analyzed the properties of PNEM and PNKmeans. The second one is obstacle avoidance.¹⁰ The last one is the grasping task in a grid box, which is conducted on a Franka Panda robot. Experiments on the last two tasks illustrate the usage and performance of PNDMP.

Letter writing tasks

In the letter writing task, we have selected the letters B, E, Q, T, and X as the samples to learn. Ten positive and two negative demonstrations of each letter are gathered, with each demonstration including 200 sampling points.

The hyperparameters of PNDMP are $K = 5$ , $α = 10$ , $α_{s} = 1.0$ , $d_{t} = 0.01$ , $w_{p} = 1$ and $w_{n} = - 1$ . Kmeans and EM are both limited to 100 iterations. Denote the clustering results of standard Kmeans and PNKmeans as $μ_{k}^{(0)}$ and ${\tilde{μ}}_{k}^{(0)}$ , $k = 1, \dots, K$ , the results of standard EM based on $μ_{k}^{(0)}$ and ${\tilde{μ}}_{k}^{(0)}$ as $(π_{k, E M}^{(t_{f})}, μ_{k, E M}^{(t_{f})}, Σ_{k, E M}^{(t_{f})})$ and $({\tilde{π}}_{k, E M}^{(t_{f})}, {\tilde{μ}}_{k, E M}^{(t_{f})}, {\tilde{Σ}}_{k, E M}^{(t_{f})})$ , the results of PNEM as $(π_{k, P N E M}^{(t_{f})}, μ_{k, P N E M}^{(t_{f})}, Σ_{k, P N E M}^{(t_{f})})$ and $({\tilde{π}}_{k, P N E M}^{(t_{f})}, {\tilde{μ}}_{k, P N E M}^{(t_{f})}, {\tilde{Σ}}_{k, P N E M}^{(t_{f})})$ , and $T_{E M}$ , ${\tilde{T}}_{E M}$ , $T_{P N E M}$ , and the corresponding recovered letters as ${\tilde{T}}_{P N E M}$ , $T = ℬ, ℰ, Q, T, X$ , respectively. The superscript t_f represents the iterations of each execution.

It is difficult to evaluate the recovered letters quantitatively. Thus, we have repeated the experiment 100 times and analyzed the statistical difference between ${\tilde{T}}_{E M}$ and ${\tilde{T}}_{P N E M}$ . As EM can only obtain a local optimal solution, the result of GMRDMP is random. PNDMP inherits this property from DMP. In the task which owns an obvious gap between the positive and negative demonstrations, PNDMP shows flexibility to adjust the recovered trajectory. As shown in Figure 6, the recovered trajectories can be divided into different patterns. The statistical results are presented in Table 1. The margin distributions of $(s, f_{x} (s))$ are presented in Figure 7, in which each ellipse represents a component of GMM.

Figure 6.

Patterns of recovered motion of letters. (a) Pattern 1 of B, (b) pattern 2 of B, (c) pattern 3 of B, (d) pattern 4 of B, (e) pattern 1 of E, (f) pattern 2 of E, (g) pattern 3 of E, (h) pattern 4 of E, (i) pattern 1 of Q, (j) pattern 1 of T, (k) pattern 1 of X, and (l) pattern 2 of X.

Figure 7.

Marginal distributions of the GMM. GMM: Gaussian mixed model.

Table 1.

Statistical results of different patterns of the letter writing task.

	Pattern 1	Pattern 2	Pattern 3	Pattern 4
B	21	14	60	5
E	32	18	42	8
Q	100	—	—	—
T	100	—	—	—
S	62	38	—	—

It is worth noting that PNDMP fails to adjust the trajectory in some tasks. The scope of PNDMP is important for workers with little knowledge about the learning control of robots, but it is still challenging to estimate whether PNDMP is appropriate for a task.

Although calculating $θ (λ)$ directly for $λ \in (- \infty,0) \cap (0, + \infty)$ cannot generate good trajectories, we found a special phenomenon in the process of the experiment. That is, the introduction of negative demonstrations can improve the converging speed of PNDMP. In this experiment, we analyzed the converging speed of PNDMP and PNKmeans on the letter writing task. As shown in Tables 2 and Figure 8, there are 60 configurations in all, which are the combination of five letters, three $G M M$ , two Kmeans algorithms, and two EM algorithms. The hyperparameters of algorithms are the same as before except that $θ (λ = - 0.25)$ is used in PNDMP. PNEM can converge with fewer iterations than EM on 80% configurations. PNKmeans can converge with fewer iterations than Kmeans on 60% configurations. It should be noted that the number of components of GMM has no obvious relation to the number of iterations.

Table 2.

Average iterations of 100 executions.

		B	E	Q	T	X
$G M M^{5}$	Kmeans + PNEM	15.29	8.04	43.49	19.52	53.63
	PNKmeans + PNEM	15.81	9.79	31.09	24.08	47.13
	Kmeans + EM	52.49	32.33	43.43	34.49	65.27
	PNKmeans + EM	54.26	31.52	33.35	36.85	56.30
$G M M^{6}$	Kmeans + PNEM	65.64	30.05	68.79	28.09	31.49
	PNKmeans + PNEM	52.20	29.15	53.98	27.19	34.29
	Kmeans + EM	73.60	27.17	74.76	33.39	33.51
	PNKmeans + EM	58.06	26.01	56.72	35.90	36.15
$G M M^{7}$	Kmeans + PNEM	46.22	36.70	19.96	32.39	51.89
	PNKmeans + PNEM	45.44	38.15	24.23	31.25	46.75
	Kmeans + EM	49.04	33.90	34.57	39.09	49.42
	PNKmeans + EM	48.77	33.21	35.95	51.12	48.26

PNEM: positive and negative demonstrations-based expectation–maximum; EM: expectation–maximization.

Figure 8.

Comparison of iterations of four methods with GMM⁵. GMM: Gaussian mixed model.

The searching process of one component of $G M M^{5}$ for letters B and Q are depicted in Figures 9 and 10, respectively. The converging step decreases gradually in both executions. In Figure 9, PNEM converges faster and achieves the stop condition much earlier than EM. In Figure 10, PNEM converges faster with a shorter searching path. Thus, negative demonstrations can impose a boundary constraint on the searching space and then make PNEM apt to reach a local optimum, as shown in the sketch diagram in Figure 11.

Figure 9.

Iteration process of one component for letter B.

Figure 10.

Iteration process of one component for letter Q.

Figure 11.

Sketch diagram of the search path.

To analyze the performance of PNKmeans, another clustering task of letters is conducted. In this task, the demonstrations are clustered with a $G M M^{5}$ without being encoded to the virtual force. The execution is repeated 100 times for each letter, and 2500 components of GMM are recorded. We let ${‖ {\tilde{μ}}_{k, P N E M}^{(t_{f})} - μ_{k, P N E M}^{(t_{f})} ‖}_{2} = e_{k}^{(t_{f})}$ , and ${‖ {\tilde{μ}}_{k}^{(0)} - μ_{k}^{(0)} ‖}_{2} = e_{k}^{(0)}$ , the joint distribution of $e_{k}^{(t_{f})}$ and $e_{k}^{(0)}$ are depicted in Figure 12. The numbers of points lying in regions 1–4 are 2317, 16, 12, and 155, respectively. Regions 1 and 4 indicate that close or different initial values lead to very close results, as B, E, and Q in Figure 13. Region 3 indicates that close initial values lead to very different results, such as the X letter in Figure 6. Region 2 indicates that different initial values lead to very different results, such as the letter T. Thus, PNKmeans can improve the converging speed slightly without affecting the cluster results in most cases.

Figure 12.

Distribution of the difference between Kmeans and PNKmeans.

Figure 13.

Effect of PNKmeans on PNDMP. PNDMP: positive and negative demonstrations-based dynamic motion primitive

Obstacle avoidance task

In this task, the recovered trajectories of $G M M^{5}$ are required to avoid two obstacles. The position and volume of the obstacles are unknown and the DAgger framework based on PNDMP is used to finish this task. Trajectories are recovered by GRMDMP first. If they do not satisfy the safety criterion, then they are treated as negative demonstrations and mixed with original positive demonstrations to retrain $G M M^{5}$ . Ten positive demonstrations are collected and each demonstration includes 200 points. GMRDMP and PNDMP have the same hyperparameters $α = 10$ and $α_{s} = 1.0$ , and Kmeans and EM are both limited to 100 iterations. Those hyperparameters make the recovered trajectories of GMRDMP collide with the spherical obstacle, as those negative demos in Figure 14. By adding them to the data set and re-training $G M M^{5}$ , we can obtain better trajectories with $w_{n} = - 1, - 2$ and $w_{p} = 1 - w_{n}$ .

Figure 14.

Recovered trajectories for obstacle avoidance task.

Grasping task

In the grasping task, we need to control the Franka Panda robot to grasp a bottle from one square to another of the grid box, as shown in Figure 15(a). To focus on the PNDMP algorithm, we have chosen a bottle with a red cap and sticker as the object and designed the simple pipeline in Figure 16 to detect it. As depicted in Figure 15(b), the end effector of Panda is large, and it often collided with the box when we collected the demonstrations. Because the geometrical collision volume of the grid box is very hard to be described with a few parameters, neither task-parameterized DMP (TPDMP)²⁷ nor CPDMP is suitable for this task.

Figure 15.

Grasping task by Franka Panda. (a) The Franka Panda robot. (b) Camera above the grid box.

Figure 16.

Detection of the bottle. (a) Original RGB image. (b) Red channel. (c) Contours of the cap and sticker. (d) Position and attitude of the bottle.

In the world coordinate, the bottom centers of the squares are $(0.3, - 0.15, 0)$ , $(0.7, - 0.15, 0)$ , $(0.4, 0.15, 0)$ , and $(0.8, 0.15, 0)$ . The lengths and heights of the grid edges are 0.3 and 0.25, respectively. The starting and end points of demonstrations lie in $p_{0} = ({\bar{X}}^{4} \pm 0.1, {\bar{Y}}^{4} \pm 0.1, 0)$ and $p_{5} = ({\bar{X}}^{1} \pm 0.1, {\bar{Y}}^{1} \pm 0.1, 0)$ , where $({\bar{X}}^{i}, {\bar{Y}}^{i},0)$ is the bottom centers of grid i. Positive demonstrations are collected by nonlinear interpolation with three specified waypoints $p_{1} = p_{0} + (0, 0, 0.2)$ , $p_{2} = (p_{0} + p_{5}) / 2 + (0, 0, 0.35)$ , and $p_{3} = p_{5} + (0, 0, 0.2)$ . Negative demonstrations are collected with $p_{1} = p_{0} + (0.1, 0.1, 0.2)$ , $p_{2} = (p_{0} + p_{5}) / 2 + (0, 0, 0.25)$ , and $p_{3} = p_{5} + (- 0.1, - 0.1, 0.2)$ . The negative demonstrations are much closer to the grid edges than the positive demonstrations. Ten positive and two negative demonstrations are collected and each demonstration includes 200 points. The maximum iterations of Kmeans and EM are both set to 100. The hyperparameters are $K = 5$ , $α = \sqrt{20}$ , and $α_{s} = 1.0$ , $w_{n} = - 1$ , and $w_{p} = 1 - w_{n}$ . The results are shown in Figure 17.

Figure 17.

Recovered trajectories for grasping task in a gird box.

Conclusion

By introducing negative demonstrations, we developed PNDMP from GMRDMP with the new log-likelihood function, the PNEM algorithm, and the DAgger framework. PNDMP provides interactive flexibility for the user to adjust the recovered trajectories. Experiments on three typical robot manipulating tasks have been conducted to validate the performance of PNDMP from different aspects. Besides that, another experiment shows that PNEM and PNKmeans converge faster than standard EM and Kmeans when $λ < 0$ .

Although PNDMP can handle unexpected obstacles in many tasks, there are still several works to be done in the future. The first issue is to clarify the scope of PNDMP, which is of great significance for industrial practice. The second one is to identify the task parameters based on positive and negative demonstrations. Similar to PNEM, the negative demonstrations would provide a definite boundary to improve the accuracy and speed of parameter identification. The last one is to extend the DAgger framework to combine the motion model and policy model task, which means that we can cope with the path planning task and sequential decision task with a uniform model.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received financial support for the research, authorship, and/or publication of this article: This work is supported by National Natural Science Foundation of China (62002053), Natural Science Foundation of Guangdong Province (2021A1515011866), Guangdong Basic and Applied Basic Research Projects (2019A1515111082, 2020A1515110504), Social Welfare Major Project of Zhongshan (2019B2010, 2019B2011, 420S36), Achievement Cultivation Project of Zhongshan Industrial Technology Research Institute (419N26), Science and Technology Foundation of Guangdong Province (2021A0101180005), and Young Innovative Talents Project of Education Department of Guangdong Province (2018KQNCX337, 2019KQNCX186).

ORCID iD

Shuai Dong

Appendix 1

References

Wang

Lian

, et al. Vision-based robotic grasp detection from object localization, object pose estimation to grasp estimation: a review. arXiv: 190506658v1 2019: 1–24.

Liu

Cai

, et al. A review of robot manipulation skills learning methods. Acta Autom Sin 2019; 45(3): 458–470.

Liu

, et al. Skill transfer learning for autonomous robots and human-robot cooperation: a survey. Robot Autonom Syst 2020; 128: 103515.

Calinon

Guenter

Billard

. On learning, representing and generalizing a task in a humanoid robot. IEEE Trans Syst Man Cyber—PT B: Cyber 2007; 37(2): 286–298.

Kroemer

Daniel

Neumann

, et al. Towards learning hierarchical skills for multi-phase manipulation tasks. In: IEEE international conference on robotics and automation, Seattle, WA, USA, 26–30 May 2015, pp. 1503–1510.

Kuefler

Kochenderfer

. Burn-in demonstrations for multi-modal imitation learning. arXiv:171005090v1 2017: 1–8.

Ermon

. Generative adversarial imitation learning. arXiv:160603476 2016.

Ijspeert

Nakanishi

Schaal

. Learning attractor landscapes for learning motor primitives. In: 15th international conference on neural information processing systems. Cambridge, MA: MIT Press, pp. 1547–1554.

Ijspeert

Nakanishi

Schaal

Movement imitation with nonlinear dynamical systems in humanoid robots. In: IEEE international conference on robotics & automation. Washington, DC, USA, 29 May–2 June 2023, pp. 1398–1403.

10.

Ugur

Girgin

. Compliant parametric dynamic movement primitives. Robotica 2020; 38(Preprint): 457–474.

11.

Hutchison

. Modelling and simulation for autonomous systems. Berlin: Springer, 2014.

12.

Kim

Seo

Choi

, et al. Incorporating safety into parametric dynamic movement primitives. IEEE Robot Autom Lett 2019; 4(3): 2260–2267.

13.

Menda

Driggs-Campbell

Kochenderfer

. EnsembleDAgger: a Bayesian approach to safe imitation learning. In: IEEE international conference on intelligent robots and systems, Macau, China, 3–8 November 2019, pp. 5041–5048. DOI: 10.1109/IROS40897.2019.8968287.

14.

Schaal

Peters

Nakanishi

, et al. Control, planning, learning, and imitation with dynamic movement primitives. In: Workshop on bilateral paradigms on humans and humanoids: IEEE international conference on intelligent robots and systems, Naples, Italy, 29 August–2 September 2022, pp. 1–21.

15.

Cohn

Ghahramani

Jordan

. Active learning with statistical models. J Artif Intell Res 1996; 4(1522): 129–145.

16.

Wang

Merel

Reed

. Robust imitation of diverse behaviors. In: Neural information processing systems. Long Beach, USA: Neural Information Processing Systems, 2017, pp. 5320–5329.

17.

Ude

Gams

Asfour

, et al. Task-specific generalization of discrete and periodic dynamic movement primitives. IEEE Trans Robot 2010; 26(5): 800–815.

18.

Gribovskaya

Zadeh

Billard

. Learning nonlinear multivariate dynamics of motion in robotic manipulators. Int J Robot Res 2010; 30: 80–117.

19.

Colome

Neumann

Peters

, et al. Dimensionality reduction for probabilistic movement primitives. In: IEEE-RAS international conference on humanoid robots, Madrid Spain, 18–20 November 2014, pp. 794–800.

20.

Paraschos

Daniel

Peters

, et al. Using probabilistic movement primitives in robotics. Auto Robot 2018; 42(3): 529–551.

21.

Huang

Rozo

Silvério

, et al. Kernelized movement primitives. Int J Robot Res 2019; 38(7): 833–852.

22.

Saveriano

Abu-Dakka

Kramberger

, et al. Dynamic movement primitives in robotics: a tutorial survey. arXiv:210203861v1 2021: 1–43.

23.

Colome

Torras

. Dimensionality reduction for dynamic movement primitives and application to bimanual manipulation of clothes. IEEE Trans Robot 2018; 34(3): 602–615.

24.

Tanwani

Calinon

. Learning robot manipulation tasks with task-parameterized semi-tied hidden semi-Markov model. IEEE Robot Autom Lett 2016; 1(1): 235–242.

25.

Lioutikov

Neumann

Maeda

, et al. Learning movement primitive libraries through probabilistic segmentation. Int J Robot Res 2017; 36(8): 879–894.

26.

Yang

, et al. An enhanced teaching interface for a robot using DMP and GMR. Int J Intell Robot Appl 2018; 2(1): 110–121.

27.

Calinon

. A tutorial on task-parameterized movement learning and retrieval. Intell Serv Robot 2016; 9(1): 1–29.

28.

Huang

Silverio

Rozo

, et al. Generalized task-parameterized skill learning. In: IEEE international conference on robotics and automation, Brisbane, Australia, 21–25 May 2018, pp. 5667–5674.

29.

Pervez

Mao

Lee

Learning deep movement primitives using convolutional neural networks. In: IEEE-RAS international conference on humanoid robots, Birmingham, UK, 15–17 November 2017, pp. 191–197.

30.

Pervez

Lee

. Learning task-parameterized dynamic movement primitives using mixture of GMMs. Intell Serv Robot 2 2018; 11(1): 61–78.

31.

Mei

Chen

Zhang

, et al. Path planning for mobile robots based on dynamic movement primitives. Inform Contr 2019; 48(4): 392–400.

32.

Karlsson

Robertsson

Johansson

Convergence of dynamical movement primitives with temporal coupling. In: 2018 17th European control conference (ECC), Limassol, Cyprus, 12–15 June 2018.

33.

Yang

Member

Chen

, et al. Robot learning system based on adaptive neural control and dynamic movement primitives. IEEE Trans Neural Net Learn Syst 2019; 25(5): 581–603.

34.

Akgun

Thomaz

. Simultaneously learning actions and goals from demonstration. Auto Robot 2016; 40(2): 211–227.

35.

Alizadeh

Malekzadeh

Barzegari

. Learning from demonstration with partially observable task parameters using dynamic movement primitives and Gaussian process regression. In: IEEE/ASME international conference on advanced intelligent mechatronics, AIM, vol. 2016-September, Banff, Alberta, Canada, 12–15 July 2016, pp. 889–894. IEEE.

36.

Liu

Qian

Gui

, et al. Task generalization of robots based on parameterized learning of multi-demonstration action primitives. Robot 2019; 41(5): 574–582.

37.

Zhou

Oguz

Leibold

, et al. A general framework to increase safety of learning algorithms for dynamical systems based on region of attraction estimation. IEEE Trans Robot 2020; 36(5): 1472–1490.

38.

Berkenkamp

Turchetta

Schoellig

, et al. Safe model-based reinforcement learning with stability guarantees. In: Conference on neural information processing systems, Long Beach, CA, USA, 13 November 2017, pp. 1–11.

39.

Kahn

Villaflor

Pong

, et al. Uncertainty-aware reinforcement learning for collision avoidance. arXiv: 170201182v1 2017: 1–12.

40.

Lee

Saigol

Theodorou

. Safe end-to-end imitation learning for model predictive control. arXiv:180310231v3 2018: 1–11.

41.

Karlsson

Robertsson

Johansson

Autonomous interpretation of demonstrations for modification of dynamical movement primitives. In: 2017 IEEE international conference on robotics and automation (ICRA), Marina Bay Sands Singapore, 29 May 2017–3 June 2017.

42.

Tessler

Mankowitz

Mannor

Reward constrained policy optimization. In: International conference on machine learning, Sydney, Australia, 26 December 2018, pp. 1–10.

43.

Wen

Constrained cross-entropy method for safe reinforcement learning. In: Conference on neural information processing systems, Montreal, Canada, 12 August 2020, pp. 1–11.

44.

Tessler

Mankowitz

Mannor

. Reward constrained policy optimization. arXiv:180511074 2018.

45.

Park

Hoffmann

Pastor

, et al. Movement reproduction and obstacle avoidance with dynamic movement primitives and potential fields. In: 2008 8th IEEE-RAS international conference on humanoid robots, humanoids 2008, Daejeon, Korea (South), 1–3 December 2008, pp. 91–98.

46.

Ginesi

Meli

Roberti

, et al. Dynamic movement primitives: volumetric obstacle avoidance using dynamic potential functions. arXiv:200700518 2020.

47.

Ijspeert

Nakanishi

Hoffmann

, et al. Dynamical movement primitives: learning attractor models for motor behaviors. Neural Comput 2013; 25(2): 328–373.

48.

. Statistical learning methods. Beijing: Tsinghua University press, 2019.

49.

McLachlan

Krishnan

. The EM algorithm and extensions. New York: John Wiley & Sons, 1996.

Dynamic movement primitives based on positive and negative demonstrations

Abstract

Keywords

Introduction

Related work

Preliminaries and problem description

Dynamic movement primitives

Gaussian mixture regression

Reversibility and continuity of DMP and GMRDMP

Reversibility of DMP

Reversibility of GMRDMP

Continuity of DMP

Continuity of GRMDMP

Problem description

Main algorithms of PNDMP

PNEM

PNKmeans

Extrapolation strategy for PNDMP

DAgger framework of PNDMP

Experiments

Letter writing tasks

Obstacle avoidance task

Grasping task

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Appendix 1

References