Abstract
Robot introspection aids robots to understand what they do and how they do it. Previous robot introspection techniques have often used parametric hidden Markov models or supervised learning techniques, implying that the number of hidden states or classes is defined a priori and fixed through the entire modeling process. Fixed parameterizations limit the modeling power of a process to properly encode the data. Furthermore, first-order Markov models are limited in their ability to model complex data sequences that represent highly dynamic behaviors as they assume observations are conditionally independent given the state. In this work, we contribute a Bayesian nonparametric autoregressive Hidden Markov model for the monitoring of robot contact tasks, which are characterized by complex dynamical data that are hard to model. We used a nonparametric prior that endows our hidden Markov models with an unbounded number of hidden states for a given robot skill (or subtask). We use a hierarchical Dirichlet stochastic process prior to learn an hidden Markov model with a switching vector autoregressive observation model of wrench signatures and end-effector pose for the manipulation contact tasks. The proposed scheme monitors both nominal skill execution and anomalous behaviors. Two contact tasks are used to measure the effectiveness of our approach: (i) a traditional pick-and-place task composed of four skills and (ii) a cantilever snap assembly task (also composed of four skills). The modeling performance or our approach was compared with other methods, and classification accuracy measures were computed for skill and anomaly identification. The hierarchical Dirichlet stochastic process prior to learn an hidden Markov model with a switching vector autoregressive observation model was shown to have excellent process monitoring performance with higher identification rates and monitoring ability.
Introduction
For complex robot contact tasks, unmodeled disturbances can lead to failure. Recently, there has been more interest in automating snap assembly tasks for their varied applications. 1 Snap assemblies consist of strong elastic elements that create significant forces at the time of insertion and can be challenging to control when there are modeling errors. To this end, robot (physical) introspection is a valuable tool to help gain insight into the execution of robot skills. Robot introspection uses the robot’s sensory–motor signatures to discern the “what” and the “how” and empower better online decision-making. The “what” is concerned not only in understanding a robot’s nominal skills, but also in identifying anomalies and specific failure modes. Robot introspection, for example, could be used to identify external perturbations introduced into a system; internal modeling errors that lead to anomalies. Without introspection, when robots fail, they do so irreparably. Thus, introspection endows robots with the ability to recover and adjust to the unexpected. Particularly as robots continue their quest of gaining autonomy in unstructured environments, it is imperative that they understand both their internal dynamics and those of the environment. In doing so, robots can enhance their online decision-making ability and better negotiate the unexpected.
Despite increasingly robust trajectory planning 2,3 and control algorithms, 4 failure modes continue to exist whenever models do not capture the dynamics of the unstructured environment. The vast majority of systems have worked in process monitoring, state estimation, or robot introspection using parametric models, implying that the number of classes or states (also referred to as “model complexity”) are fixed. 5 The quality of a model is affected by the assumptions it presumes. When researchers make narrow assumptions, then the quality of the model is compromised. For cases like manipulation contact tasks where contact dynamics, friction, and actuator noise all introduce disturbances to the system, it is imperative to have good approximation of the true complexity of hidden states to better model the signals and produce more reliable classifications. Figure 1 shows a snap assembly’s force signature during alignment and how this task might be segmented and modeled as a sequence of skills by the introspection framework.

A representation of signal evolution for a contact task in one force axis. On the left, a robot is performing an assembly task. On the right, one would expect to identify different behaviors through the signal; for example, approaching, rotating, contacting, and son on. Key problems in the power of building statistical models were used to characterize the signal as well as distinguish the data pattern between per skill/subtask.
In the field of robot introspection for process monitoring of contact tasks, there are two relevant areas in machine learning: supervised parametric approaches and unsupervised nonparametric approaches (detailed description in the section “Related works”). This work furthers the state-of-the-art methods in unsupervised nonparametric process monitoring. Particularly, we investigate the modeling ability of Bayesian nonparametric techniques on Markov switching process to learn complex dynamics typical in robot contact tasks. A stochastic prior—the sticky hierarchical Dirichlet process (sHDP) is used to derive the model complexity (which represents primitive manipulation actions), while the Markov switching process known as the switching vector autoregressive (VAR-HMM) is used to model complex dynamical phenomena encoded in the wrench signature of a robot performing contact tasks. We hypothesize that the approach will lead to good generalizations that discover and model the underlying contact states in a computationally efficient manner. While others have used Bayesian nonparametric HMMs to model contact tasks, the more expressive sHDP-VAR-HMM has yet to be studied. We also contribute an extensive analysis of the performance of the model across different Markov models (sHDP-HMM with Gaussian observation, first-order autoregressive (AR) observation, or second-order AR observation) and different sensor modalities (wrench and end-effector’s pose) in two practical experiments: the Baxter robot do pick-and-place task and the HIRO robot do snap assembly task, respectively.
Our proposed introspective system runs in parallel with a manipulation behavior that is encoded through a finite state machine (FSM). The FSM also receives feedback from the process monitor. A Bayesian nonparametric prior, conjugate to the categorical distribution, is used to learn the mode complexity and transition distribution of the latent state portion of the model for a given robot skill. With respect to the VAR model, the Bayesian approach uses a conjugate prior to the AR likelihood to learn the posterior distribution of a set of dynamical parameters. For each state in the FSM, an inference is used to acquire models of the various subtasks in the task. From the set of trained models, an expected log-likelihood is derived. A classification threshold for anomalous (or unexpected events) detection is also derived from the trained model expected log-likelihood. Testing looks for both correct introspection (of the subtask at hand) and possible anomalous behavior. To do so, log-likelihoods of cumulative observations of the testing trial given separate sets of the trained models are produced. Correct introspection occurs when the log-likelihood of the skill at hand (as indexed by an FSM) attains the largest likelihood; otherwise, the nominal skill identification is faulty. Simultaneously, we monitor whether the log-likelihood of testing observations belonging to the current skill crosses below an anomalous threshold or not. If it does, an anomaly is occurred.
A typical pick-and-place task by the Baxter humanoid robot and a cantilever snap assembly task by the HIRO robot were demonstrated (see Figure 2). (Both tasks were driven by FSMs in their nominal behavior. Anomalous cases in the snap assembly case occurred from time-to-time throughout experimentation.) For both experiments, the sHDP-VAR-HMM results were compared with other introspective methods. This article shows that the nonparametric approach outperformed the baseline comparisons. The modeling technique was also used to learn a library of robot skills that can be used for future introspection and integrated with low-level control for online decision-making.

On the left, the Baxter robot performs the traditional pick-and-place task. On the right, the HIRO robot performs a snap assembly task. Both tasks are driven by a finite state machine (http://www.rps.org/SMACH).
The remainder of the article is structured as follows: We first review the related works on robotic process monitoring in recent decade in the second section. The methodology of Bayesian nonparametric Markov switching process is described in the third section. After introducing the formulation of sHDP-VAR-HMM, we consequently present our introspective implementation in process monitoring in the fourth section. The experimental description and results of two real-robot contact tasks are presented in the fifth section. In the “Discussion” section, the strengths and weaknesses of nonparametric method for robot process monitoring are discussed. Finally, we will have the conclusion and future work. ( Supplemental material, code, and data can be found at website: http://www.juanrojas.net/shdp-var-hmm/.)
Related works
This section describes unsupervised parametric approaches, and unsupervised nonparametric approaches have been applied in robot introspection for process monitoring of contact tasks, respectively. Supervised parametric methods have a prescribed number of classes that are fixed over time. This limits the expressive power of the algorithm, as the system may actually consist of a different complexity than that assumed. Knowing the correct class number a priori for real tasks and how to segment them is challenging. Tests like the maximum likelihood estimation (MLE) are used to compute which model best fits the data, but such methods tend to overfit, and undergoing such testing can be arduous. 6 –8 As a common approach in process monitoring, supervised parametric learning maps input features to the fixed and prescribed number of manually annotated classes. For example, in contact tasks state estimation, Rojas et al. extract relative-change patterns classified through a small set of categories and aided by contextual information and where increasingly abstract layers were used to estimate task behaviors. 1 The framework used Bayesian models to provide a belief about its classification 9 and support vector machines (SVMs) to identify specific failure modes. 10 Rodriguez et al. 6 classified wrench data using SVMs to learn a decision rule between successful and failed assemblies offline. Later in Rodriguez et al., 7 a probabilistic classification for outputs was produced through relevance vector machines in combination with a Markov chain model. Golz et al. 11 used an SVM approach to distinguish regular human–robot interactions versus collisions, thus inferring the intention of the robot. Artificial neural networks (ANNs) have also been used to perform state estimation. 12,13 Althoefer et al. 12 used a radial basis ANN to monitor and estimate failure modes for threaded fastener assemblies. Most recently, Kappler et al. used a naive Bayes approach with a Gaussian model to incrementally model the current state of execution through kinesthetic demonstrations. 14 The system tightly integrates a low-level control loop with a high-level state estimation loop to perform tractable online decision-making. A supervised discriminative model is used to learn general failure but has to be taught, no global failure threshold exists at system deployment.
Others have opted to use a stochastic process based on a Markov chain, namely the hidden Markov model (HMM) or variants thereof. Hovland and McCarragher 8 pioneered the use of HMMs to model contact events by observing wrench signatures. The contact state was identified among a set of discrete edge-surface configurations and provided a probability over a sequence of contacts. In this work, the model complexity remained fixed. Kroemer et al. learned to model the transitions among robot subtasks in manipulation problems. 15,16 The work used a variant of the standard AR HMM (AR-HMM) where actions, states, and robot skills were modeled. The key difference to the AR-HMM is the additional edge from the current state to the current skill. As a result of this edge, the transition between skills depends on the observed state. However, the results were still susceptible to the number of skills used and require validation tests at the end to compute the correct number. Park et al. used an HMM with a multimodal feature vector for execution monitoring and anomaly detection. 17 They innovated an anomaly detection threshold whose value varied according to the task’s progress.
As previously mentioned, most of the work related to high-level process monitoring has used parametric models. Parametric models assume a fixed and prescribed model complexity (i.e. the number of latent states in the model). Few works in process monitoring have studied nonparametric approaches that allow a prior with an unbounded number of potential model parameters to learn the model complexity based on the data. In doing so, one can expect better modeling and thus behavior identification. Unsupervised nonparametric learning methods adapt their complexity based on available data allowing more realistic classification. A key point is whether expert domain information, in the form of a prior, can be used to learn the model more efficiently. Very recently the stochastic processes such as the Dirichlet process (DP) and the hierarchical equivalent (hierarchical Dirichlet process [HDP]) 18,19 have been used in robotics to integrate domain-specific information when available, which use the data to drive the complexity of the model and assign beliefs to the states. 20,21 In recent years, many researchers have used nonparametric Bayesian approaches to learn HMMs in contact tasks for normal or abnormal detection and recognition. DiLello et al. 20,21 used the sHDP prior to learn the model parameters of an HMM based on wrench signatures for an industrial robot alignment task. The work was used to learn specific failure modes and do online classification. The approach is robust as it is able to model not just individual contact events but rather the evolution of signals; it does so while incorporating domain knowledge and adjusting its model complexity according to the data patterns presented. Nonetheless, the approach is limited by the assumptions of the HMM in which observations are conditionally independent given the state. This is often an insufficient condition to capture the temporal dependencies of complex dynamical behaviors. Niekum et al. 22 used a beta-process prior on the VAR (BP-VAR-HMM) to extract the mode complexity from the data. The BP-VAR-HMM has the ability to discover and model primitives from multiple related time-series. Robot skill discrimination was reliable and consistent though this work only modeled robot joint angles in a primarily non-contact-oriented task.
sHDP-VAR-HMM
Contact task modeling involves the interpretation of noisy wrench signals. Wrench noise is not well approximated by Gaussian noise and may contain latent patterns that stem from the knowledge of an expert task programmer or human demonstrator. Such patterns vary when the same task is executed by different agents. Our goal is, then, given roughly similar signals, to identify fundamental temporal patterns and model signal evolution to provide the robot with temporal introspection about its evolving high-level state. If successful, a robot can use this information to reason about its next move: whether selecting the next skill or recovering from abnormal behaviors (internal or external). Few works explore the capability of characterizing both robot skills and failure simultaneously in the same architecture. Both are critical to help the robot has a more complete understanding of its actions. We are thus in need of powerful modeling techniques to characterize complex dynamical phenomena, capturing both spatial and temporal correlations for the data, while integrating beliefs about the system. The sHDP-VAR-HMM leverages the attributes of combinatorial stochastic processes (sHDP) with state-space descriptions of dynamical systems (AR-HMM) 19 to achieve this goal. In this section, we provide a brief introduction to HMMs, Bayesian nonparametrics, and their integration with state-space models.
Hidden Markov model
HMM has been a workhorse in pattern recognition able to encode probabilistic state-space model. The HMM is a stochastic process where a finite number of latent states have Markovian state transitions. Conditioned on the mode sequence, the model assumes (discrete or continuous) conditionally independent observations given the latent state. Let zt denote the latent state of the Markov chain at time
where the state at the first time step is distributed according to an initial transition distribution
Then, to calculate the log-likelihood value of an observation vector
As illustrated in equation (2), if the state path
where
The denominator represents the probability of a sequence of observations that can be calculated by equation (3).
Bayesian nonparametric priors
HMMs restrictively assume a fixed model complexity. In process monitoring, latent states can represent robot primitives. It is clear that not all robot skills have the same number of primitives and that even the same skill might have variation when conducted under different conditions. To introduce flexibility into the number of computed hidden states, we leverage priors as probability measures.
HMMs can also be represented though a set of transition probability measures Gj. Probability measures yield strictly positive probabilities and sum to 1. Consider if instead of using a transition distribution on latent states, we use it across emission parameters
where
Therefore, given θj, different probability weights are assigned to possible successor candidates θk. We can also assign a prior to the categorical probability measure Gj. The Dirichlet distribution is a natural selection due to conjugacy. Thus, the transition probabilities
Additionally, emission parameters are drawn from a base measure H such that
We can further summarize the probability measure
where γ is the concentration parameter, and H is the base measure over parameter space Θ. The weights βk are sampled via a stick-breaking construction
where
This HDP is used as a prior on the HMM. The implications are a mode complexity learned from the data and a sparse state representation. The
The sticky HDP increases the expected probability of self-transitions by an amount proportional to κ and leads to posterior distributions with smoothly varying dynamics. Finally, priors are placed on the concentration parameters α and γ and the sticky parameter κ. Latent state creation is influenced by α and γ, while self-transition probabilities are biased by κ. These priors allow to integrate expert knowledge for better modeling than the expectation-maximization (EM) algorithm traditionally used in HMMs.
sHDP-VAR-HMM
Many complex dynamical phenomena are not adequately described as conditionally independent of the observations given the state zi; that is, the observations are a noisy linear combination of some finite set of past observations plus additive white noise. In such cases, the dynamical processes can be modeled by an AR process. Particularly, an order r vector AR process, denoted by VAR(r), with observations

On the left, graphical representation of a hierarchical Dirichlet process hidden Markov model (HDP-HMM) over T time steps. The latent, discrete-valued Markov process zt captures the temporal dependencies in the observations yt. On the right, graphical model for the hierarchical Dirichlet process switching vector autoregressive process (HDP-VAR-HMM) over T time steps. The latent, discrete-valued Markov process zt dictates the linear dynamical model at time t. For an order r switching VAR process (shown here for
The VAR has simplifying assumptions that make it a practical choice in applications.
19
The switching regime can be combined with the sHDP-HMM from the section “Bayesian nonparametric priors” to leverage the expressiveness of the VAR system with the ability of nonparametric priors to learn the mode complexity of the model. The VAR(r) process, with AR order r, consists of a latent state
We see a generative model for a time-series
and an inverse-Wishart prior on Σ
See appendix F.1 in Fox 25 for a detailed derivation of resulting posterior and definitions for variables M and K, which are parameters that define the matrix-normal portion of the MNIW prior.
Robot introspection
In our robot introspection system, we simultaneously detect nominal robot skills and anomalous events. Traditional robot behavior design uses FSMs to execute nominal robot tasks. FSMs partition the task into a sequence of skills (also known as subtasks or phases) and events that trigger a transition across skills. The skill sequence is learned from kinesthetic teaching or through an expert programmer and executed through a low-level controller (see our control basis controller 1 ). Transition thresholds are empirically devised under controlled conditions during the design phase and usually focus on robot data such as lapsed time, joint angle configurations, or wrench change-point events. This setup is fragile, particularly in unstructured scenarios like human–robot collaboration, where many unexpected events may occur. Thus, there is a need for the robot both to learn new skills along with their identification and to detect anomalous events.
Model training
We train our model on individual subtasks and capture their dynamics through an observation vector Y composed of end-effector wrench values and robot joint angles. The AR model in the sHDP-VAR-HMM captures spatiotemporal correlations in the observations. Blocked Gibbs sampling
25
learns the posterior distribution of the sHDP-VAR-HMM. We also learn the mean values for the model parameters Π of a given skill s, where
Nominal skill identification
Given M trained models for S skills in the robot manipulation task (M is equal to S when using k-fold cross-validation for optimal model selection), the optimal model of a skill is used along with the standard forward–backward algorithm to compute the expected cumulative likelihood of a sequence of observations. The expected cumulative likelihood is
If so, the skill identification is deemed correct, and we record the time at which the correct classification was flagged. We compute both a classification accuracy matrix and the mean time threshold value by a cross-validation period. In Figure 4, one can observe the log-likelihood curves for a snap assembly task consisting of four skills (see the section “Experimental verification” for more details): (i) a guarded approach, (ii) an alignment, (iii) an insertion, and (iv) mating. Note how for each of the subtasks, the log-likelihood corresponding to the current FSM index has a greater value than the rest.

Cumulative log-likelihoods using test trial observations conditioned on all available trained skill model parameters
Anomaly monitoring
General anomaly monitoring versus specific failure mode detection is more robust. Specific failure mode detection has many disadvantages: an infinite number of failure modes (even a small representative subset requires much work) and intentional failure, which is difficult to execute and puts the robot at risk. Learning failures as you go is viable, 14 but we deem that general anomaly detection complemented with recovery strategies based on additional sensory information is easier to generalize and more robust. Some have even began leveraging human sense of failure to guide robots. 26,27 In our work, anomaly detection is continuously monitored based on the given lower bound of threshold. It is based on the notion that, in nominal tasks, the cumulative likelihood has similar patterns across trials of the same robot skill. Thus, the expected cumulative log-likelihood L derived in training can be used to implement an anomaly threshold F. At first, we define the threshold as mentioned in DiLello et al. 20,21 In Figure 4, the dashed cyan curve represents L generated from the training data for each skill. For each time step in an indexed skill sc, the anomaly threshold is set as follows
where k is a real-valued constant that varies from the standard deviation of the threshold. We are interested only in the lower bound, thus the negative sign. Then, process monitoring works according to the following
where Πn represents the parameters for the nominal trained model. Equation (19) denotes that if the cumulative likelihood crosses the threshold, an anomaly is identified, else nominal.
Figure 4 shows the snap assembly task composed of four skills. As such, four probabilistic models are derived. Given an indexed position in a graph, one can draw the expected cumulative log-likelihood for that skill and the accompanying threshold. One can also see how at the termination of a skill, all data are reset and restarted. Moreover, in our experiments, we noticed that upon resetting our cumulative log-likelihood observations, anomalies were often triggered at the beginning of the task. We observed that the standard deviation of the cumulative log-likelihood training graphs began with very small standard deviations and then grew significantly over time. Figure 5 shows how the standard deviation of the cumulative log-likelihoods grows over time. The standard deviation grows as observations show greater variance due to the accumulation of error. To address this limitation, a new threshold definition was designed. Given that the derivative between the cumulative log-likelihoods between time steps t and

This plot shows the standard deviation of the cumulative log-likelihoods computed during training. The standard deviation grows over time as observations show greater variance due to the accumulation of error.
Additionally, the maximum and minimum log-likelihood of test trials can be computed as follows:
where Ti is the time length of each skill in trial Xi, and
The test aims to detect whether the derivative is an outlier compared with the derivatives of successful skill executions.
Figure 6 shows an example of how the derivative signal crosses the empirical threshold multiple times as anomalies are triggered by external perturbations during the execution of a skill. Finally, given this threshold definition, we select the threshold value that maximizes the true-positive rate (TPR) and minimizes the false-positive rate (FPR).

The derivative signal crosses the empirical threshold multiple times as anomalies are triggered by external perturbations during the execution of a skill. The red dash line represents the low-bound threshold for each skill.
Experimental verification
Results for nominal and anomalous skill identification are reported independently for clarity. In the case of anomaly detection, the trials within the test folds for skill and anomaly identification were randomly intermixed. A 12-dimensional observation vector consisting of 6-D wrench measurements and 6-D pose values at each time step t:
An AR order
In order to assess the robot introspection ability of the sHDP-VAR(r)-HMM, two test beds were tested: (i) a Baxter robot for a pick-and-place task and (ii) a HIRO robot for the snap assembly task. Both robots used F/T sensors (Robotiq FT300 and a JR3 respectively) that were placed on the robot wrists to measure forces and moments at
Experiment 1: Pick-and-place task
The dual-arm Baxter robot is used for the pick-and-place task. The latter is bootstrapped via a FSM composed of four skills (see Figure 7), and the goal is to use the sensed wrench and pose signals in building a statistical model for each skill through the proposed sHDP-VAR(r)-HMM method.

Experiment 1: the skills of a traditional pick-and-place task by FSM: (i) prepick, (ii) pick, (iii) preplace, and (iv) place. FSM: finite state machine.
Figure 8 shows an example of the wrench signals for each skill (shown in different colors) of this task. The signals were segmented into four skills according to the states shown in the FSM in Figure 7. Notice that as the task is being executed, significant patterns emerge within and in between the skills. For instance, if we examine the signals for Go-to-Pick-Hover and Go-to-Pick shown in Figure 7, we can identify patterns that can be encoded with statistical significance using our model. To evaluate the performance of the sHDP-VAR(r)-HMM models, we tested 24 samples for each skill through leave-one-out cross-validation. 28 The optimal model for each skill was selected via MLE.

Experiment 1: recorded wrench signals in the pick-and-place task. Different colors encode different skills.
For nominal skill identification, we test normally executing skills. Figure 9 shows corresponding signals generated on the basis that for ith skill model output is 1 if its cumulative likelihood is greater than the rest of the models at each time step, else 0. If a model has a maximum output cumulative likelihood value during the last time step in a sample, then the value is set to 1, implying that this observation belongs to that model class (the correct skill identification); otherwise, the firing would be 0. From the skill intervals shown in Figure 8 and the model outputs of in Figure 9, the monitoring accuracy for the sHDP-VAR(2)-HMM-based method can be readily proposed. The same scheme can be inferred for the other skills. We assessed the performance of the model by comparing with other similar techniques, namely the sticky HDP-HMM with Gaussian observation model and the sHDP-VAR(1)-HMM with a first-order AR model. We also show the performance by using different multimodal signal combinations. A confusion matrix is used for the accuracy evaluation of skill identification for the different methods and shown in Table 1. Notice from Figure 9 that, at the beginning of each skill, the identification suffers from low accuracies. This is due to the way in which we cumulatively compute the log-likelihood (not unlike humans that often need more information before confidently classifying). We compute the average test time to perform correct classification and report it as a percentage of total duration for a given skill (see Table 2). Low percentages imply quick identification, while large ones imply slow decisions.

Experiment 1: sHDP-VAR(2)-HMM model outputs for the pick-and-place task. sHDP: sticky hierarchical Dirichlet process; VAR: vector autoregressive; HMM: hidden Markov model.
Identification accuracy of different methods with different multimodal signals in a pick-and-place experiment.
sHDP: sticky hierarchical Dirichlet process; VAR: vector autoregressive; HMM: hidden Markov model; piH: skill of Go-to-Pick-Hover; Pik: skill of Go-to-Pick; plH: skill of Go-to-Place-Hover; Pla: skill of Go-to-Place.
Average time for correct anomaly identification in a pick-and-place task.
sHDP: sticky hierarchical Dirichlet process; VAR: vector autoregressive; HMM: hidden Markov model; piH: skill of Go-to-Pick-Hover; Pik: skill of Go-to-Pick; plH: skill of Go-to-Place-Hover; Pla: skill of Go-to-Place.
For anomaly monitoring, we manually induced three types of external perturbations to collect anomaly data. First, we introduce human collisions. Human collisions cause a pose displacement of a held object and lead to failures in the placement/assembly of that object. Five failure trials were induced for each perturbation, for a total of 15 anomalous trials (and 24 nominal trials). Receiver operating characteristic (ROC) curves were used to measure the discriminative ability of the system across wrench and pose sensor modalities (i.e. wrench and wrench–pose modalities). Constant k in equation (18) determines our classification threshold and is optimized to obtain the best monitoring performance. Comparisons between unimodal wrench signals and multimodal pose–wrench signals are conducted. Figure 10 shows ROC curves evaluate the relationship between the FPR and TPR. For any given TPR, multimodal tests resulted in lower FPRs when compared with unimodal versions. Our method also enjoyed high TPR when compared with other introspective methods.

Experiment 1: ROC curves for the pick-and-place task. The figure shows ROC curves that compare the performance of multimodal and unimodal sensing for anomaly detection and the performance of our anomaly detection method versus other baseline methods. ROC: receiver operating characteristic.
Experiment 2: Snap assembly task
In this experiment, so as to show the applicability and superiority of the suggested sHDP-VAR-HMM monitoring scheme in a real-world assembly task of industrial relevance, a 6-DoF dual-arms HIRO robot with electric actuators and a Robotiq 6-DoF force–torque sensor attached on the wrist is used to perform a snap assembly task of camera parts, as shown in Figure 11. A custom end-effector holds a male part, while a female part is fixed to a table in front of the robot. The tool center point (TCP) is placed at the location where the male and female parts make contact. The world reference frame was located at the manipulator’s base. The TCP position and orientation were determined with reference to the world coordinate frame To. The force and moment reference frames were determined with respect to the wrist’s frame (see Figure 2(b)). OpenHRP 29 executes the FSM, and modular hybrid pose–force–torque controllers 1 execute the skills. Four nominal skills are connected by the FSM: (i) a guarded approach, (ii) an alignment procedure, (iii) a snap insertion with high elastic forces, and (iv) a mating procedure. Unexpected events occur during initial parts’ contact (e.g. the wrong parts localization) or during the insertion stage where wedging is possible. An example of the recorded wrench signals is shown in Figure 12.

Experiment 2: the skills in a snap assembly task FSM: (i) approach, (ii) rotation, (iii) insertion, and (iv) mating. FSM: finite state machine.

Experiment 2: recorded wrench signals in the snap assembly task. Different colors encode different skill.
As in experiment 1, 44 real-robot nominal trials and 16 anomalous trials were conducted. We measured the skill identification and anomaly detection for the modeling schemes considered in experiment 2. The output of the proposed HDP-VAR(2)-HMM scheme is shown in Figure 13. The identification accuracy per robot skill is presented in Table 3, and the average time for computing correct classifications is reported as a percentage of total skill duration (see Table 4). In addition, as in experiment 1, we still compare performance with the same baseline methods. Meanwhile, the ROC curves confirm the anomaly detection performance that is illustrated in Figure 14. It is thus seen that the proposed process monitoring scheme has both accurate and efficient performance in contact tasks.

Experiment 2: sHDP-VAR(2)-HMM model outputs in the snap assembly task. sHDP: sticky hierarchical Dirichlet process; VAR: vector autoregressive; HMM: hidden Markov model.
Identification accuracy of different methods with variable sensor signals in snap assembly experiment.
sHDP: sticky hierarchical Dirichlet process; VAR: vector autoregressive; HMM: hidden Markov model.
Average time of computing correct identification in snap assembly task.
sHDP: sticky hierarchical Dirichlet process; VAR: vector autoregressive; HMM: hidden Markov model.

Experiment 2: Receiver operating characteristic curves for the snap assembly task across multimodal scenarios and baseline methods. Unimodal wrench signals are compared with multimodal pose and wrench signals.
Discussion
The results show that the sHDP-VAR-HMM had better identification accuracy than baseline methods, specially for the go-to-place skill of the first experiment and the more complex insertion skill of the snap assembly. Both skills entail complex dynamical phenomena that is hard to model. For both experiments, the sHDP-VAR-HMM significantly outperforms baselines in classifying nominal skills while requiring less time to do so. The ROC curves show that our model has better discrimination with less false positives than the rest of the models. The sHDP-VAR-HMM thus makes stronger generalizations and is a more powerful modeling system than the sHDP-HMM.
Compared with our previous SVM classifier, 10 the SVM tends to do better. On the other hand, the sHDP-VAR-HMM system allows for the simultaneous checking of both nominal and anomalous skills, but not so in Rojas et al., 10 where a two stage classifier is necessary. Having said that, the SVM probabilistic method is able to provide a confidence parameter beyond accuracy classification, something that is not readily available in our HMM (though condition numbers are possible but not yet attempted in our work). Compared with the parametric method, EM-based Gaussian mixtures models, 30 our method without given the number of the mixtures of the Gaussian components for each signal, has better performance.
As part of our future work, we expect to integrate a real-time open-source framework that will exploit Markov jump linear system and significantly increase the robustness, accuracy, and computational efficiency of the state-of-the art approaches in robot introspection. A wide array of robotic domains will benefit from the ability to confirm executed skills including collaborative manufacturing robots, service robots, as well as areas such as high-level planning and programming by demonstration. Better robot introspection will also result in better decision-making, failure prevention and recovery, and increased efficiency across all robot domains. What’s more, despite the robust recognition ability of the sHDP-VAR-HMM, the model still requires some time for its parameters to converge to correct classifications as a result of the use of Gibbs sampling. Future efforts will focus on improving the computational efficiency of the approach through variational Bayesian inference methodologies.
Conclusion and future work
In this work, a second-order nonparametric Markov switching process, the sHDP-VAR(2)-HMM, was used to perform robot introspection. The introspection system robustly and efficiently identified nominal and anomalous executions across robots and robot contact tasks. The combination of the VAR-HMM Markov-switching process with the nonparametric Bayesian prior used to learn posterior models resulted in better models that were more capable of encoding complex switching dynamical phenomena with better and faster classification rates.
The developed models were used to estimate the log-likelihood of observation vector during task execution. Two autonomous manipulation experiments were considered. For both experiments, comparisons were made with the results of the available introspective approaches like the sticky HDP-HMM with Gaussian observation model and the sHDP-VAR(1)-HMM with AR order 1. Additionally, multimodal sensor observations were shown to result in better introspection capabilities by comparing the accuracy with different multimodal inputs. Development of robot introspection is expected to have a direct impact on a large variety of practical applications, such as that can be used to estimate the log-likelihood of failure and prevent the failure during robot manipulation task, improves the safety in human–robot collaborative environment by assessing the quality of learned internal model for each skill, and can speed up the anomaly recovery and/or repair process by providing the detailed skill identification and anomaly monitoring.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by Major Project of the Guangdong Province Department for Science and Technology (Grant Numbers: 2014B090919002 and 2016B0911006), the NSFC-Guangdong Joint Fund (Grant Number: U1401240), the National Science Foundation of China (Grant Number: 61750110521), and the Frontier and Key Technology Innovation Special Funds of Guangdong (Grant Numbers: 2014B090919002, 2016B090911002, and 2017B050506008).
