Gauss process state-space model optimization algorithm with expectation maximization

Abstract

A Gauss process state-space model trained in a laboratory cannot accurately simulate a nonlinear system in a non-laboratory environment. To solve this problem, a novel Gauss process state-space model optimization algorithm is proposed by combining the expectation–maximization algorithm with the Gauss process Rauch–Tung–Striebel smoother algorithm, that is, the EM-GP-RTSS algorithm. First, a theoretical formulation of the Gauss process state-space model is proposed, which is not found in previous references. Second, a Gauss process state-space model optimization framework with the expectation–maximization algorithm is proposed. In the expectation–maximization algorithm, the unknown system state is considered as the lost data, and the maximization of measurement likelihood function is transformed into that of a conditional expectation function. Then, the Gauss process–assumed density filter algorithm and the Gauss process Rauch–Tung–Striebel smoother algorithm are proposed with the Gauss process state-space model defined in this article, in order to calculate the smoothed distribution in the conditional expectation function. Finally, the Monte Carlo numerical integral method is used to obtain the approximate expression of the conditional expectation function. The simulation results demonstrate that the Gauss process state-space model optimized by the EM-GP-RTSS can simulate the system in the non-laboratory environment better than the Gauss process state-space model trained in the laboratory, and can reach or exceed the estimation accuracy of the traditional state-space model.

Keywords

Gauss process state-space model nonlinear dynamic system expectation maximization algorithm Gauss process Rauch–Tung–Striebel smoother

Introduction

In many engineering problems, there are a large number of complex nonlinear dynamic systems needed to estimate state with a time series of the sensor measurement. A nonlinear dynamic model should be established according to a nonlinear system, such as a nonlinear state-space model (SSM), consisting of a parametric state transition equation and measurement equation. The Gauss process (GP) model can be used to simulate the nonlinear state transition equation and the measurement equation separately. A non-parametric Gauss process state-space model (GPSSM) can be constructed, if we lack the knowledge of a nonlinear system to build a principled parametric model. For example, GPSSMs are used for target monitoring and location problems in wireless sensor networks.^1,2

In order to train the GPSSM of a nonlinear system, its true system states are required. However, paradoxically, its states are usually unknown and needed to be estimated. At present, some methods and models are proposed to try to overcome training and estimating problems of the GPSSM.

Gauss process latent variable models (GPLVMs)³ are first proposed to learn a system model with low-dimensional state-space and high-dimensional observation space. Later, GPLVMs are extended to the dynamic system to obtain GPBF-LEARN, which is a framework for learning GP-BayesFilters from weakly labeled training data only.⁴ With specially tailored Particle Markov Chain Monte Carlo (MC) samplers, a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) for GPSSMs is proposed by marginalizing state transition functions.⁵ Although this method preserves a non-parametric representation of a system model, it requires that its measurement model should be known. Variational GPSSM is proposed by combining variational Bayes with sequential MC.⁶ Through imposing a structured Gaussian variational posterior distribution over latent states, an inference mechanism for the GPSSM is proposed in order to address the nonlinear system identification.⁷ Although the above works have been devoted to the GPSSM training based on the unknown true system states, unfortunately the problem has not been solved theoretically.

Thus, the best way to obtain an accurate GPSSM is still to have to acquire its true system states to train a state transition equation a measurement equation. In order to obtain true system states, it is necessary to build a state detection system with multiple types of sensors, which is expensive and not possible in some cases. When a homogeneous nonlinear system can be placed in a laboratory, such a state detection system can be constructed in the laboratory. After true states are obtained in the laboratory, a GPSSM of the homogeneous nonlinear system can be established.

However, our purpose is to simulate a nonlinear dynamic system in a practical application with GPSSMs, that is, the nonlinear dynamic system in a non-laboratory environment. Because noise and parameters in a non-laboratory environment differ from those of the homogeneous nonlinear system in a laboratory, simulating a homogeneous nonlinear system in a non-laboratory environment with a GPSSM trained in a laboratory could introduce a model bias, which will lead to an overfitting problem of state estimation.

In order to improve the accuracy of a GPSSM in a non-laboratory environment, a reasonable method is to eliminate its model bias by adaptively correcting the hyperparameters of the GPSSM that is trained in a laboratory.

Recently, parameter identification methods, combining the expectation–maximization (EM) algorithm⁸ with smoothing algorithms,⁹ are gradually recognized for parametric nonlinear SSMs. Given the input measurements and the output measurements, the EM estimates the parameters in a multivariable bilinear SSM as the robust and gradient-free computation of the maximum likelihood (ML) solution.¹⁰ A parameter estimation method combining particle method with the EM is proposed to estimate the general SSM of nonlinear systems.¹¹ Recognizing the unknown noise covariance of a nonlinear system is studied through the EM and the moving horizon estimation.¹² The EM using the classical extended and ensemble versions of the Kalman smoother is proposed to estimate the model error covariance.¹³ For nonlinear SSMs with multiplicative unknown parameters, an adaptive dispersion filter algorithm based on the EM is proposed.¹⁴

Given training data, a GPSSM can be regarded as a special parametric SSM. Based on the maximum likelihood estimator/estimate (MLE) criterion, this article proposes a novel GPSSM optimization algorithm called as the EM-GP-RTSS, combining the EM algorithm with the Gauss process Rauch–Tung–Striebel smoother (GP-RTSS) algorithm, in order to improve the accuracy of GPSSMs simulating the nonlinear dynamic system in a non-laboratory environment.

First, a theoretical formulation of GPSSMs is proposed by expanding one-dimensional output prediction to a multidimensional output prediction in this article. This is an important work, but not seen in previous references. Second, the MLE criterion is used to describe the GPSSM optimization problem as the maximization of the measurement likelihood function. However, the analytical expression of the measurement likelihood function is difficult, and thus the EM is used to replace the difficult-to-maximize measurement likelihood function with the conditional expectation function (CEF). Third, a GPSSM optimization framework based on the EM is proposed in this article. The GP-RTSS algorithm is given to obtain the smoothed estimation distribution of the system state, which is required by the calculation of the CEF. Fourth, in order to calculate the integral of the CEF in the closed form, the MC numerical integral method is used to express an approximate CEF. Fifth, the variable metric method¹⁴ is adopted for the optimization calculation of the approximate CEF. In the simulation experiment, a noise pendulum model is used as a representative of the nonlinear dynamic system to verify the effectiveness of the EM-GP-RTSS algorithm proposed in this article.

This article is organized as follows. In section “Background on GP,” we briefly review the background on GPs. In section “Optimization algorithm for GPSSM,” we describe the development of the GPSSM optimization algorithm in detail. In section “Simulation experiment,” the experimental results based on the noise pendulum model are presented to demonstrate the performance of the EM-GP-RTSS. Finally, the conclusion is drawn in section “Conclusion and future works.”

Background on GP

In this section, a standard GP model and a noise input prediction model are given. This section is the mathematical foundation for the development of the GPSSM optimization algorithm.

Standard GP model

A GP is a non-parametric tool for learning regression functions from the sample data.¹⁵ Specifically, a GP represents a random process, whose outputs of sampling points obey to a joint Gaussian distribution. A GP can be fully described by its mean function and covariance function. Suppose there is a training data set from the following noise process

β_{i} = ψ (α_{i}) + ε

(1)

where $α_{i}$ is an n-dimensional input column vector, $β_{i}$ is a scalar output, $ε$ is a random variable that obeys a Gaussian white noise $N (ε | 0, σ^{2})$ , where $σ$ is its standard deviation. Through denoting the matrix $A = [α_{1}, α_{2}, \dots, α_{N}]$ and the column vector $β = (β_{1}, β_{2}, \dots, β_{N})^{T}$ , the data set D can be restated as $D = 〈 A, β 〉$ . The joint distribution of the noise output $β$ is a function of $A$ . This distribution is assumed as a zero-mean multidimensional Gaussian distribution

p (β) = N (0, K (A, A) + σ^{2} I)

(2)

where $K (A, A)$ is the kernel matrix, and its element of the $i th$ row and the $j th$ column is denoted as $K_{ij} = k (α_{i}, α_{j})$ . The kernel function of GPSSMs usually selects the squared exponential (SE) function

k (α_{i}, α_{j}) = σ_{ψ}^{2} \exp (- \frac{1}{2} {(α_{i} - α_{j})}^{T} W_{ψ}^{- 1} (α_{i} - α_{j}))

(3)

where $σ_{ψ}^{2}$ is a signal variance. The diagonal matrix $W_{ψ}$ consists of the length scales for each dimension output. $σ_{ψ}$ , $σ$ , and $W_{ψ}$ are the hyperparameters of the GP. Given a training data set $〈 A, β 〉$ and a noiseless test input $α_{*}$ , the Gaussian distribution of the output $β_{*}$ can be obtained by the standard GP prediction model, whose mean is

G P_{ρ} (α_{*}, 〈 A, β 〉) = k_{*}^{T} {[K (A, A) + σ^{2} I]}^{- 1} β

(4)

and whose variance is

G P_{Σ} (α_{*}, 〈 A, β 〉) = k (α_{*}, α_{*}) - k_{*}^{T} {[K (A, A) + σ^{2} I]}^{- 1} k_{*}

(5)

where $k_{*}$ is the kernel vector between $α_{*}$ and $A$ , denoted as $(k (α_{*}, α_{1}), k (α_{*}, α_{2}), \dots, k (α_{*}, α_{N}))^{T}$ .

Noise input prediction model

If the test input $α_{*}$ is a random Gaussian noise, the prediction distribution of corresponding output $β_{*}$ cannot be directly given by formulas (4) and (5). Assume that $α_{*}$ follows the Gaussian distribution $N (α_{*} | {\hat{α}}_{*}, P_{*})$ , where ${\hat{α}}_{*}$ is a mean vector and $P_{*}$ is a covariance matrix. The prediction problem corresponds to calculating the probability distribution

p (β_{*}) = \int \int p (β_{*} | α_{*}, ε) p (α_{*}) p (ε) d α_{*} d ε

(6)

The mean and variance of $p (β_{*} | α_{*}, ε)$ are calculated by formulas (4) and (5), respectively. Fortunately, when the SE kernel function is used, the mean ${\hat{β}}_{*}$ and the variance $σ_{β}^{2}$ of $p (β_{*})$ can be calculated analytically

\begin{matrix} {\hat{β}}_{*} = E_{ε} [E_{α_{*}} [E_{ψ} [ψ (α_{*}) + ε | α_{*}, ε]]] \\ = \int \int G P_{μ} (α_{*}, 〈 A, β 〉) N (α_{*} | {\hat{α}}_{*}, P_{*}) N (ε | 0, σ^{2}) d α_{*} d ε \\ = λ^{T} γ \end{matrix}

(7)

where $λ = {[K (A, A) + σ^{2} I]}^{- 1} β$ and $γ = (γ_{1}, γ_{2}, \dots, γ_{N})$ . $E [\cdot]$ and $E [\cdot | \cdot]$ represent the expectation operation and the conditional expectation operation, respectively.

Let $i = 1, 2, \dots, N$ , and $γ_{i}$ is

\begin{matrix} γ_{i} = \int k (α_{i}, α_{*}) N (α_{*} | {\hat{α}}_{*}, P_{*}) d α_{*} \\ = σ_{ψ}^{2} {| P_{*} W_{ψ}^{- 1} + I |}^{- \frac{1}{2}} \\ \exp (- \frac{1}{2} {(α_{i} - {\hat{α}}_{*})}^{T} {(W_{ψ} + P_{*})}^{- 1} (α_{i} - {\hat{α}}_{*})) \end{matrix}

(8)

The variable $γ_{i}$ is the expectation of $k (α_{i}, α_{*})$ on $α_{*}$ . $σ_{β}^{2}$ is

\begin{matrix} σ_{β}^{2} = \int {\int \int (β_{*} - {\hat{β}}_{*})}^{2} p (β_{*} | α_{*}, ε) p (α_{*}) p (ε) d β_{*} d α_{*} d ε \\ = \int [\int {(β_{*} - {\hat{β}}_{*})}^{2} p (β_{*} | α_{*}, ε) d β_{*}] p (α_{*}) p (ε) d α_{*} d ε \\ = \int [G P_{μ} {(α_{*}, 〈 A, β 〉)}^{2} + G P_{Σ} (α_{*}, 〈 A, β 〉) + σ^{2} - {\hat{β}}_{*}^{2}] \\ p (α_{*}) d α_{*} \\ = λ^{T} Φ λ + σ_{ψ}^{2} - tr ({(K (A, A) + σ^{2} I)}^{- 1} Φ) - {\hat{β}}_{*}^{2} + σ^{2} \end{matrix}

(9)

where $tr (\cdot)$ is the trace operation. The element of the $i th$ row and the $j th$ column in $Φ$ is

\begin{matrix} Φ_{ij} = \frac{k (α_{i}, {\hat{α}}_{*}) k (α_{j}, {\hat{α}}_{*})}{{| 2 P_{*} W_{ψ}^{- 1} + I |}^{- \frac{1}{2}}} \\ \times \exp ({({\tilde{α}}_{ij} - {\hat{α}}_{*})}^{T} {(\frac{1}{2} W_{ψ} + P_{*})}^{- 1} P_{*} W_{ψ}^{- 1} ({\tilde{α}}_{ij} - {\hat{α}}_{*})) \end{matrix}

(10)

where ${\tilde{α}}_{ij} = 1 / 2 (α_{i} + α_{j})$ . Thus, $p (β_{*})$ can be approximated by $N (β_{*} | {\hat{β}}_{*}, σ_{β}^{2})$ in the closed form.

Optimization algorithm for GPSSM

In this section, the GPSSM optimization algorithm is developed with the EM and the GP-RTSS. The whole framework flowchart of the GPSSM optimization algorithm is shown in Figure 1. First, the training data are obtained to train an original GPSSM to simulate a nonlinear system in a laboratory. Second, the measurements of nonlinear dynamic systems are obtained in a non-laboratory environment. The EM is used to optimize the GPSSM of homogeneous system in the laboratory. Since calculating the CEF in the Expectation step (E-step) requires the smoothed distribution of system states, the GP-RTSS algorithm is introduced into the EM optimization framework. In the Maximization step (M-step), the CEF is maximized to obtain the iterative hyperparameters, which means to adjust the GPSSM at the same time. The E-step and the M-step are repeated alternately until convergence occurs, and the optimized GPSSM and the estimated states will be acquired at the end.

Figure 1.

The flowchart of the GPSSM optimization.

GPSSM

This section gives a theoretical formulation of GPSSMs. The concept of GPSSMs is already proposed in Ko and Fox,^4,16 and Deisenroth et al.,¹⁷ but a theoretical formulation of GPSSMs has not been found. Consider the traditional SSM of a discrete-time nonlinear system

\begin{matrix} x_{k + 1} = f (x_{k}) + w_{k} \\ y_{k + 1} = h (x_{k + 1}) + v_{k + 1} \end{matrix}

(11)

where $k \in {0, 1, 2, \dots}$ is an index of the time step, $x_{k} \in R^{n}$ is a state vector, $y_{k} \in R^{m}$ is a measurement vector, $w_{k}$ and $v_{k}$ are a zero-mean Gaussian white noise, $n, m \in N$ . The functions $f$ and $h$ are a state transition function and a measurement function, respectively. If the dynamic system is well understood, a traditional modeling method is to construct the parametric expressions of $f$ and $h$ . If not, it is very difficult to establish an appropriate parametric model. However, if the GP prior distributions of $f$ and $h$ are set separately, the GPSSM can be used to simulate the nonlinear SSM described in equation (11).

The operating mechanism of the dynamic system is assumed to be constant, that is the corresponding SSM of the system is unchanged in a time window $[t_{0}, t_{0} + Δ t \times L]$ , where $t_{0}$ is the left edge of the time window, $t_{0} + Δ t \times L$ is the right edge of the time window, L is a sample number, and $Δ t$ is a sampling period. For ease of writing, set $t_{0} = 0 s$ and $Δ t = 1 s$ in this article.

The matrix $X_{0 : L} = [x_{0}, x_{1}, \dots, x_{L}]$ is a sequence of states in the state-space, and the matrix $Y_{1 : L} = [y_{1}, y_{2}, \dots, y_{L}]$ is a sequence of measurements for $X_{1 : L}$ . Because the output of GP is a scalar, $f$ and $h$ need to be decomposed into several one-dimensional functions, namely, $f = (f_{1}, f_{2}, \dots, f_{n})^{T}$ and $h = (h_{1}, h_{2}, \dots, h_{m})^{T}$ . The SSM (equation (11)) can be expressed as follows

{\begin{matrix} x_{k + 1}^{(1)} = f_{1} (x_{k}) + w_{k}^{(1)} \\ x_{k + 1}^{(2)} = f_{2} (x_{k}) + w_{k}^{(2)} \\ ⋮ \\ x_{k + 1}^{(n)} = f_{n} (x_{k}) + w_{k}^{(n)} \end{matrix}; {\begin{matrix} y_{k + 1}^{(1)} = h_{1} (x_{k + 1}) + v_{k + 1}^{(1)} \\ y_{k + 1}^{(2)} = h_{2} (x_{k + 1}) + v_{k + 1}^{(2)} \\ ⋮ \\ y_{k + 1}^{(m)} = h_{m} (x_{k + 1}) + v_{k + 1}^{(n)} \end{matrix}

(12)

The set $D_{f_{i}} = 〈 X_{0 : L - 1}, x_{1 : L}^{(i)} 〉$ is denoted as the training data set of $f_{i}$ , where $X_{0 : L - 1}$ is the training input data, $x_{1 : L}^{(i)} = (x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{L}^{(i)})^{T}$ is the training output data, $x^{(i)}$ is the $i th$ dimension element of $x$ . Similarly, $D_{h_{i}} = 〈 X_{1 : L}, y_{1 : L}^{(i)} 〉$ is denoted as the training data set of $h_{i}$ , where $y_{1 : L}^{(i)} = (y_{1}^{(i)}, y_{2}^{(i)}, \dots, y_{L}^{(i)})^{T}$ . The prior distribution of the output training data $x_{1 : L}^{(i)}$ and $y_{1 : L}^{(i)}$ can be denoted as two zero-mean multidimensional Gaussian distributions, respectively

p (x_{1 : L}^{(i)}) = N (0, K_{f_{i}} (X_{0 : L - 1}, X_{0 : L - 1}) + σ_{xi}^{2} I)

(13)

p (y_{1 : L}^{(i)}) = N (0, K_{h_{i}} (X_{1 : L}, X_{1 : L}) + σ_{yi}^{2} I)

(14)

where $K_{f_{i}} (X_{0 : L - 1}, X_{0 : L - 1})$ and $K_{h_{i}} (X_{1 : L}, X_{1 : L})$ are the kernel matrixes abbreviated as $K_{f_{i}}$ and $K_{h_{i}}$ . The SE kernel functions of $K_{f_{i}}$ and $K_{h_{i}}$ are denoted as, respectively

k_{f_{i}} (x, x^{'}) = σ_{f_{i}}^{2} \exp (- \frac{1}{2} {(x - x^{'})}^{T} W_{f_{i}}^{- 1} (x - x^{'}))

(15)

k_{h_{i}} (x, x^{'}) = σ_{h_{i}}^{2} \exp (- \frac{1}{2} {(x - x^{'})}^{T} W_{h_{i}}^{- 1} (x - x^{'}))

(16)

The above formulas (12)–(16) define a formulated GPSSM, which can simulate the traditional nonlinear SSM described in formula (11).

Because the system state is unknown, the state $x_{k}$ is assumed to obey a prior Gaussian distribution $N (x_{k} | {\hat{x}}_{k}, P_{k})$ , where ${\hat{x}}_{k}$ is the mean vector and $P_{k}$ is the covariance matrix. The state $x_{k}$ is the noise test input of the GPSSM, and then $x_{k + 1}$ and $y_{k}$ are denoted as the state transition output and the measurement output of the GPSSM, respectively. The prediction distribution of $x_{k + 1}$ and $y_{k}$ are assumed as Gaussian distributions $N (x_{k + 1} | {\bar{x}}_{k + 1}, Q_{k + 1})$ and $N (y_{k} | {\bar{y}}_{k}, R_{k})$ , respectively, where ${\bar{x}}_{k + 1}$ and ${\bar{y}}_{k}$ are mean vectors, and $Q_{k + 1}$ and $R_{k}$ are covariance matrixes. Each elements in ${\bar{x}}_{k + 1}$ and ${\bar{y}}_{k}$ can be calculated by formulas (6)–(8) in the closed form, and $Q_{k + 1}^{(i, j)} (i, j = 1, 2, \dots, n)$ , which is the element of $i th$ row and the jth column in $Q_{k + 1}$ , can be obtained by

\begin{matrix} Q_{k + 1}^{(i, j)} = \int (x_{k + 1}^{(i)} - {\bar{x}}_{k + 1}^{(i)}) (x_{k + 1}^{(j)} - {\bar{x}}_{k + 1}^{(j)}) \\ p (x_{k + 1}^{(i)} | x_{k}, w_{k}^{(i)}) \\ \cdot p (x_{k + 1}^{(j)} | x_{k}, w_{k}^{(j)}) p (x_{k}) p (w_{k}^{(i)}) p (w_{k}^{(j)}) \\ d x_{k + 1}^{(i)} d x_{k + 1}^{(j)} d x_{k} d w_{k}^{(i)} d w_{k}^{(j)} \end{matrix}

(17)

If $i = j$ , then $Q_{k + 1}^{(i, j)}$ is the diagonal element of $Q_{k + 1}$ , namely, the variance of $x_{k + 1}^{(i)}$ calculated by formula (9). If $i \neq j$ , it is calculated by

\begin{matrix} Q_{k + 1}^{(i, j)} = \int (f_{i} (x_{k}) + w_{k}^{(i)} - {\bar{x}}_{k + 1}^{(i)}) (f_{j} (x_{k}) + w_{k}^{(j)} - {\bar{x}}_{k + 1}^{(j)}) \\ p (f_{i} (x_{k}) + w_{k}^{(i)} | x_{k}, w_{k}^{(i)}) \\ \cdot p (f_{j} (x_{k}) + w_{k}^{(j)} | x_{k}, w_{k}^{(j)}) p (x_{k}) p (w_{k}^{(i)}) p (w_{k}^{(j)}) d x_{k + 1}^{(i)} d x_{k + 1}^{(j)} \\ d x_{k} d w_{k}^{(i)} d w_{k}^{(j)} \\ = \int (f_{i} (x_{k}) + w_{k}^{(i)} - {\bar{x}}_{k + 1}^{(i)}) (f_{j} (x_{k}) + w_{k}^{(j)} - {\bar{x}}_{k + 1}^{(j)}) \\ p (f_{i} (x_{k}) + w_{k}^{(i)} | x_{k}, w_{k}^{(i)}) \\ \cdot p (f_{j} (x_{k}) + w_{k}^{(j)} | x_{k}, w_{k}^{(j)}) p (x_{k}) p (w_{k}^{(i)}) p (w_{k}^{(j)}) d x_{k + 1}^{(i)} d x_{k + 1}^{(j)} \\ d x_{k} d w_{k}^{(i)} d w_{k}^{(j)} \\ = \int G P_{ρ} (x_{k}, D_{f_{i}}) G P_{ρ} (x_{k}, D_{f_{j}}) p (x_{k}) d x_{k} - {\bar{x}}_{k + 1}^{(i)} {\bar{x}}_{k + 1}^{(j)} \\ = \int k_{f_{i}} (x_{k}, X_{0 : L - 1}) λ_{f_{i}} k_{f_{j}} (x_{k}, X_{0 : L - 1}) λ_{f_{j}} p (x_{k}) d x_{k} \\ = λ_{f_{i}}^{T} \int k_{f_{i}} (x_{k}, X_{0 : L - 1}) k_{f_{j}} (x_{k}, X_{0 : L - 1}) p (x_{k}) d x_{k} λ_{f_{j}} \\ = λ_{f_{i}}^{T} {\tilde{Φ} λ}_{f_{j}} \end{matrix}

(18)

\begin{matrix} {\tilde{Φ}}^{(p, q)} = σ_{f_{i}}^{2} σ_{f_{j}}^{2} {| (W_{f_{i}}^{- 1} + W_{f_{j}}^{- 1}) P_{k} + I |}^{- \frac{1}{2}} \\ \cdot \exp (- \frac{1}{2} {(x_{p} - x_{q})}^{T} {(W_{f_{i}} + W_{f_{j}})}^{- 1} (x_{p} - x_{q})) \\ \exp (- \frac{1}{2} {(z_{pq} - {\hat{x}}_{k})}^{T} Λ^{- 1} (z_{pq} - {\hat{x}}_{k})) \end{matrix}

(19)

where $λ_{f_{i}} = {(K_{f_{i}} + σ_{x i}^{2} I)}^{- 1} x_{1 : L}^{(i)}$ , $Λ = {(W_{f_{i}}^{- 1} + W_{f_{j}}^{- 1})}^{- 1} + P_{k}$ , $z_{pq} = W_{f_{j}} (W_{f_{i}} + W_{f_{j}})^{- 1} x_{p} + W_{f_{i}} (W_{f_{i}} + W_{f_{j}})^{- 1} x_{q}$ , and $p, q = 0, 1, \dots, L - 1$ .Thus, the mean vector and covariance matrix of the predicted distribution of $x_{k + 1}$ and $y_{k}$ can be calculated accurately. In addition, the condition prediction distributions of $x_{k + 1}$ and $y_{k}$ given a noise-free test input $x_{k}$ are calculated, respectively, by

\begin{matrix} p (x_{k + 1} | x_{k}) = Π_{i = 1}^{n} \\ N (x_{k + 1}^{(i)} | G P_{ρ} (x_{k}, D_{f_{i}}), G P_{Σ} (x_{k}, D_{f_{i}}) + σ_{xi}^{2}) \end{matrix}

(20)

p (y_{k} | x_{k}) = Π_{i = 1}^{m} N (y_{k}^{(i)} | G P_{ρ} (x_{k}, D_{h_{i}}), G P_{Σ} (x_{k}, D_{h_{i}}) + σ_{yi}^{2})

(21)

Optimization algorithm framework

A GPSSM is theoretically expressed in the previous sections. In order to establish an instantiated GPSSM for a specific nonlinear system, the training data set of the state transition equation and the measurement equation should be given, that is, true states $X_{0 : L}$ and corresponding measurements $Y_{1 : L}$ need to be known. In a non-laboratory environment, however, $X_{0 : L}$ is often unknown and needs to be estimated.

An accepted method is to build a state detection system equipped with multiple types of sensors in a laboratory, by which true states can be measured to train a GPSSM. The GPSSM obtained in the laboratory is recorded as the GPSSM* in this article. The GPSSM* can obtain a satisfactory state estimation accuracy in a laboratory, but there is an overfitting problem of state estimation with the GPSSM* in a non-laboratory environment. This is because the environment of non-laboratory and laboratory are different, and then, the homogeneous nonlinear system is affected by several factors. Main differences are system noise, sensor noise, and system parameters. Then, the use of the GPSSM* in a non-laboratory environment would introduce a model bias. The lack of robustness is a drawback of the GPSSM*, but the GPSSM* simulates the basic mechanism of the homogeneous nonlinear system in the non-laboratory environment.

In order to solve this problem, the core idea of this article is to optimize the hyperparameters of the GPSSM* to obtain an approximate optimal model for a nonlinear system in a non-laboratory environment. Its state and measurement sequences obtained in a laboratory are denoted as $X_{0 : L}^{*}$ and $Y_{1 : L}^{*}$ , respectively, and the corresponding training data set is denoted as $D^{*}$ . ${σ_{f_{i}}, W_{f_{j}}, σ_{xi}}_{i = 1}^{n}$ and ${σ_{h_{i}}, W_{h_{j}}, σ_{yi}}_{i = 1}^{m}$ of the GPSSM are collectively referred to $θ$ . Given $D^{*}$ , the MLE of $θ$ denoted as $θ^{*}$ is acquired by maximizing the measurement likelihood function.¹⁸ If $D^{*}$ and $θ^{*}$ are given, the GPSSM* is determined and denoted as $G (θ^{*}, D^{*})$ . A general GPSSM is denoted as $G (θ, D)$ . A measurement sequence of a nonlinear system $Y_{1 : L}$ is obtained in a non-laboratory environment. Based on the MLE criterion, there is an optimal GPSSM $G (\overset{⌢}{θ}, \overset{⌢}{D})$ that makes the log-likelihood function reach the maximum

G (\overset{⌢}{θ}, \overset{⌢}{D}) = \underset{G (D, θ)}{\arg \max} \log p (Y_{1 : L} | G (θ, D))

(22)

Because of the unknown $\hat{D}$ , $G (\overset{⌢}{θ}, \overset{⌢}{D})$ cannot be obtained. However, an approximate optimal GPSSM $G (\hat{θ}, D^{*})$ can be obtained by correcting $θ^{*}$ of $G (θ^{*}, D^{*})$

\hat{θ} = \underset{θ}{\arg max} \log p (Y_{1 : L} | G (θ, D^{*}))

(23)

Given $D^{*}$ , $G (θ, D^{*})$ can be regarded as a special parametric SSM. Formula (23) expresses a hyperparameter optimization problem by the MLE criterion. As the measurement log-likelihood function in formula (23) is difficult to express analytically, the numerical search algorithm cannot be adopted to maximize the function directly.

The unknown system state sequence is denoted as $X_{0 : L}$ . By the Bayes theorem, the log-likelihood function $\log p (Y_{1 : L} | G (θ, D^{*}))$ can be decomposed into

\begin{matrix} \log p (Y_{1 : L} | G (θ, D^{*})) = \log p (X_{0 : L}, Y_{1 : L} | G (θ, D^{*})) \\ - \log p (X_{0 : L} | Y_{1 : L}, G (θ, D^{*})) \end{matrix}

(24)

The hyperparameter $θ$ at the $μ th$ EM iteration is denoted as $θ_{μ}$ and then perform the expectation operation to formula (24) on $p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*}))$

\begin{matrix} \log p (Y_{1 : L} | G (θ, D^{*})) = \int p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*})) \\ \log p (X_{0 : L}, Y_{1 : L} | G (θ, D^{*})) d X_{0 : L} \\ - \int p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*})) \log p (X_{0 : L} | Y_{1 : L}, G (θ, D^{*})) \\ d X_{0 : L} \end{matrix}

(25)

Let

\begin{matrix} l (θ, θ_{μ}) = \int p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*})) \\ \log p (X_{0 : L}, Y_{1 : L} | G (θ, D^{*})) d X_{0 : L} \end{matrix}

(26)

Then

\begin{matrix} \log p (Y_{1 : L} | G (θ, D^{*})) - \log p (Y_{1 : L} | G (θ_{μ}, D^{*})) \\ = l (θ, θ_{μ}) - l (θ_{μ}, θ_{μ}) + \int p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*})) \\ \log \frac{p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*}))}{p (X_{0 : L} | Y_{1 : L}, G (θ, D^{*}))} d X_{0 : L} \end{matrix}

(27)

where $l (θ, θ_{μ})$ is a CEF. The last item at the right end of equation (27) is a Kullback–Leibler information distance from $p (X_{0 : L} | Y_{1 : L}, G (θ, D^{*}))$ to $p (X_{0 : L} | Y_{1 : L}, G (θ_{μ}, D^{*}))$ , which proves to be non-negative.

For $\forall θ_{μ + 1}$ , if $l (θ_{μ + 1}, θ_{μ}) > l (θ_{μ}, θ_{μ})$ , then $p (Y_{1 : L} | G (θ_{μ + 1}, D^{*})) > p (Y_{1 : L} | G (θ_{μ}, D^{*}))$ . Given $G (θ, D^{*})$ and $Y_{1 : L}$ , an EM-based GPSSM optimization framework is proposed as Framework 1.

Framework 1 (EM-based GPSSM optimization framework)

Step 1. Initialize $θ_{μ} = θ^{*}$ and $μ = 0$ ;

Step 2. Calculate $l (θ, θ_{μ})$ in the E-step;

Step 3. Maximize $θ_{μ + 1} = \arg max_{θ} l (θ, θ_{μ})$ in the M-step;

Step 4. Judge the convergence. If not, $μ = μ + 1$ and return to step 2; otherwise, end.

In the next section, the GPSSM optimization algorithm is developed based on Framework 1.

Expectation step

The first problem in Framework 1 is the specific expression of $l (θ, θ_{μ})$ . For ease of writing, $p (\cdot | G (θ, D^{*}))$ is recorded as $p_{θ} (\cdot)$ . According to the Bayesian conditional probability theorem and the Markov property, $l (θ, θ_{μ})$ can be decomposed as follows

l (θ, θ_{μ}) = l_{0} + l_{1} + l_{2}

(28)

l_{0} = \int p_{θ_{μ}} (x_{0} | Y_{1 : L}) \log p_{θ} (x_{0}) d x_{0}

(29)

l_{1} = \sum_{i = 0}^{L - 1} \int \int p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L}) \log p_{θ} (x_{i + 1} | x_{i}) d x_{i} d x_{i + 1}

(30)

l_{2} = \sum_{i = 1}^{L} \int \int p_{θ_{μ}} (x_{i} | Y_{1 : L}) \log p_{θ} (y_{i} | x_{i}) d x_{i}

(31)

where $p_{θ} (x_{0})$ is a probability distribution of initial state $x_{0}$ , which is assumed to obey the Gaussian distribution with an unknown mean vector ${\hat{x}}_{0}$ and covariance matrix $P_{0}$ , which are also unknown parameters that are included in $θ$ .

From formulas (29)–(31), $l (θ, θ_{μ})$ needs to calculate the smoothed distribution $p_{θ_{μ}} (x_{i} | Y_{1 : L})$ and the joint smoothed distribution $p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L})$ . When the EM iteratively optimizes $θ$ , it also estimates the system states. A smoothing algorithm is needed to obtain the smoothed distributions of $X_{0 : L}$ .

Filtering algorithm

For a nonlinear system state estimation problem, combining GPSSMs and Bayesian filtering algorithms, a variety of state estimation algorithms are proposed. Ko and Fox¹⁶ present a unified framework including the extended Kalman filter (EKF), the unscented Kalman filter (UKF), and the particle filter (PF). And, Deisenroth et al.¹⁷ propose the Gauss process–assumed density filter (GP-ADF) to acquire the state estimation mean and covariance matrix accurately.

Combining GP-BayesFilters with the Rauch–Tung–Striebel (RTS) smoother, the corresponding RTS smoothing algorithms can be obtained directly. These algorithms use the GP based on the noise-free test input, and the predicted covariance matrix is a diagonal matrix only, which cannot describe the correlation among the output data. Deisenroth et al.¹⁹ propose an analytical RTS smoother called as GP-RTSS algorithm, which is able to predict based on the noise test input and is proved to be more robust. The GP-RTSS algorithm is obtained by combining the RTSS algorithm with the GP-ADF algorithm.

The filtered distribution of the system state $x_{k}$ obeys the Gaussian distribution approximately, that is

p (x_{k} | Y_{1 : k}) \approx N (x_{k} | {\hat{x}}_{k}, P_{k})

(32)

{\hat{x}}_{k} = {\bar{x}}_{k} + P_{k}^{xy} R_{k}^{- 1} (y_{k} - {\bar{y}}_{k})

(33)

P_{k} = Q_{k} - P_{k}^{xy} R_{k}^{- 1} {(P_{k}^{xy})}^{T}

(34)

where ${\hat{x}}_{k}$ is a filtered estimation mean vector, $P_{k}$ is a filtered estimation covariance matrix, ${\bar{x}}_{k}$ is a state prediction mean vector, $Q_{k}$ is a prediction state covariance matrix, ${\bar{y}}_{k}$ is a prediction measurement mean vector, $R_{k}$ is a prediction measurement covariance matrix, and $P_{k}^{xy}$ is a cross-covariance matrix for the prediction state and the prediction measurement.

The filtered distribution of $x_{k - 1}$ at the $(k - 1) th$ time is assumed as a priori Gaussian distribution, $N (x_{k - 1} | {\hat{x}}_{k - 1}, P_{k - 1})$ . $x_{k - 1}$ as the noise is input into the state transition equation of GPSSM and then the prediction distribution of $x_{k}$ can be obtained as $N (x_{k} | {\bar{x}}_{k}, Q_{k})$ . If $N (x_{k} | {\bar{x}}_{k}, Q_{k})$ is a priori Gaussian distribution of $x_{k}$ , the prediction distribution of $y_{k}$ denoted as $N (y_{k} | {\bar{y}}_{k}, R_{k})$ can be obtained by inputting $x_{k}$ into the measurement equation of GPSSM. $P_{k}^{xy}$ in formulas (33) and (34) is derived as follows

\begin{matrix} P_{k}^{xy} = \int \int (x_{k} - {\bar{x}}_{k}) {(y_{k} - {\bar{y}}_{k})}^{T} p (x_{k}, y_{k} | Y_{1 : k - 1}) d x_{k} d y_{k} \\ = \int \int x_{k} y_{k}^{T} p (x_{k}, y_{k} | Y_{1 : k - 1}) d x_{k} d y_{k} - {\bar{x}}_{k} {\bar{y}}_{k}^{T} \\ = \int \int x_{k} y_{k}^{T} p (y_{k} | x_{k}) p (x_{k} | Y_{1 : k - 1}) d x_{k} d y_{k} - {\bar{x}}_{k} {\bar{y}}_{k}^{T} \\ =^{(4)} \int x_{k} [\int y_{k}^{T} Π_{i = 1}^{n} N (y_{k}^{(i)} | G P_{ρ} (x_{k}, D_{h_{i}}^{*}), G P_{Σ} (x_{k}, D_{h_{i}}^{*}) + σ_{yi}^{2}) d y_{k}] \\ N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} - {\bar{x}}_{k} {\bar{y}}_{k}^{T} \\ = \int x_{k} (G P_{ρ} (x_{k}, D_{h_{1}}^{*}), G P_{ρ} (x_{k}, D_{h_{2}}^{*}), \dots, G P_{ρ} (x_{k}, D_{h_{m}}^{*})) \\ N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} - {\bar{x}}_{k} {\bar{y}}_{k}^{T} \\ = [\int x_{k} {GP}_{ρ} (x_{k}, D_{h_{1}}^{*}) N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k}, \dots, \int x_{k} {GP}_{ρ} (x_{k}, D_{h_{m}}^{*}) \\ N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k}] - {\bar{x}}_{k} {\bar{y}}_{k}^{T} \end{matrix}

(35)

In the step (4) of the above formula (35), $p (y_{k} | x_{k})$ is calculated by formula (21) and $p (x_{k} | Y_{1 : k - 1})$ is the one-step prediction state distribution, that is $N (x_{k} | {\bar{x}}_{k}, Q_{k})$ . The key to calculate $P_{k}^{xy}$ is the integral as follows, $i = 1, 2, \dots, m$

\begin{matrix} \int x_{k} G P_{ρ} (x_{k}, D_{h_{i}}^{*}) N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} \\ =^{(1)} \int x_{k} k_{h_{i}}^{T} {[K_{h_{i}} (X_{1 : L}^{*}, X_{1 : L}^{*}) + σ_{yi}^{2} I]}^{- 1} \\ y_{1 : L}^{(i)} N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} \\ =^{(2)} \int x_{k} k_{h_{i}}^{T} λ (D_{h_{i}}^{*}) N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} \\ =^{(3)} \sum_{j = 1}^{L} λ^{(j)} (D_{h_{i}}^{*}) \int x_{k} k_{h_{i}} (x_{k}, x_{j}^{*}) N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} \\ =^{(4)} \sum_{j = 1}^{L} c_{ij} λ^{(j)} (D_{h_{i}}^{*}) {(W_{h_{i}} + Q_{k})}^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*}) \end{matrix}

(36)

In step (1) of formula (36), $G P_{ρ} (x_{k}, D_{h_{i}}^{*})$ can be obtained by replacing $α_{*}$ and $〈 A, β 〉$ with $x_{k}$ and $D_{h_{i}}^{*} = 〈 X_{1 : L}^{*}, y_{1 : L}^{* (i)} 〉$ in formula (4), respectively. In step (2) of formula (36), let $λ (D_{h_{i}}^{*}) = {[K_{h_{i}} (X_{1 : L}^{*}, X_{1 : L}^{*}) + σ_{y i}^{2} I]}^{- 1} y_{1 : L}^{* (i)}$ . In the step (3), let $k_{h_{i}}^{T} = (k_{h_{i}} (x_{k}, x_{1}^{*}), k_{h_{i}} (x_{k}, x_{2}^{*}), \dots, k_{h_{i}} (x_{k}, x_{L}^{*}))$ , where $k_{h_{i}} (x_{k}, x_{j}^{*})$ is described by formula (16). Step (3) to step (4) in formula (36) is deducted as follows

\begin{matrix} \int x_{k} k_{h_{i}} (x_{k}, x_{j}^{*}) N (x_{k} | {\bar{x}}_{k}, Q_{k}) d x_{k} \\ = \int x_{k} σ_{h_{i}}^{2} \exp (- \frac{1}{2} {(x_{k} - x_{j}^{*})}^{T} W_{h_{i}}^{- 1} (x_{k} - x_{j}^{*})) \frac{1}{{(2 π)}^{\frac{n}{2}} {| Q_{k} |}^{\frac{1}{2}}} \exp (- \frac{1}{2} {(x_{k} - {\bar{x}}_{k})}^{T} Q_{k}^{- 1} (x_{k} - {\bar{x}}_{k})) d x_{k} \\ = \frac{σ_{h_{i}}^{2}}{{(2 π)}^{\frac{n}{2}} {| Q_{k} |}^{\frac{1}{2}}} \int x_{k} \exp (- \frac{1}{2} {(x_{k} - x_{j}^{*})}^{T} W_{h_{i}}^{- 1} (x_{k} - x_{j}^{*}) - \frac{1}{2} {(x_{k} - {\bar{x}}_{k})}^{T} Q_{k}^{- 1} (x_{k} - {\bar{x}}_{k})) d x_{k} \\ = \frac{σ_{h_{i}}^{2}}{{(2 π)}^{\frac{n}{2}} {| Q_{k} |}^{\frac{1}{2}}} \exp (- \frac{1}{2} {({\bar{x}}_{k} - x_{j}^{*})}^{T} {(W_{h_{i}} + Q_{k})}^{- 1} ({\bar{x}}_{k} - x_{j}^{*})) \\ \cdot \int x_{k} \exp (\begin{matrix} - \frac{1}{2} {(x_{k} - {(W_{h_{i}} + Q_{k})}^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*}))}^{T} \cdot {(W_{h_{i}} {(W_{h_{i}} + Q_{k})}^{- 1} Q_{k})}^{- 1} \\ \cdot (x_{k} - {(W_{h_{i}} + Q_{k})}^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*})) \end{matrix}) d x_{k} \\ =^{(4)} c_{ij} \int x_{k} \frac{1}{{(2 π)}^{\frac{n}{2}} {| W_{h_{i}} {(W_{h_{i}} + Q_{k})}^{- 1} Q_{k} |}^{\frac{1}{2}}} \\ \cdot \exp (\begin{matrix} - \frac{1}{2} {(x_{k} - {(W_{h_{i}} + Q_{k})}^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*}))}^{T} \cdot {(W_{h_{i}} {(W_{h_{i}} + Q_{k})}^{- 1} Q_{k})}^{- 1} \\ \cdot (x_{k} - {(W_{h_{i}} + Q_{k})}^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*})) \end{matrix}) d x_{k} \\ = c_{ij} {(W_{h_{i}} + Q_{k})}^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*}) \end{matrix}

(37)

\begin{matrix} c_{ij} = \frac{σ_{h_{i}}^{2} {| W_{h_{i}} {(W_{h_{i}} + Q_{k})}^{- 1} Q_{k} |}^{1 / 2}}{{| Q_{k} |}^{1 / 2}} \exp \\ (- \frac{1}{2} {({\bar{x}}_{k} - x_{j}^{*})}^{T} {(W_{h_{i}} + Q_{k})}^{- 1} ({\bar{x}}_{k} - x_{j}^{*})) \end{matrix}

(38)

In step (3) of formula (37), the integral is the expectation operation on $x_{k}$ essentially, and then, the result is the mean vector $(W_{h_{i}} + Q_{k})^{- 1} (W_{h_{i}} {\bar{x}}_{k} + Q_{k} x_{j}^{*})$ . $c_{ij}$ is a normalization constant. Thus, the GP-ADF can be obtained as Algorithm 1.

Algorithm 1. GP-ADF
Step 1. Initialize $G (θ_{μ}, D^{})$ and $N (x_{0} \| {\hat{x}}_{0}, P_{0})$ ; Step 2. Take $N (x_{k - 1} \| {\hat{x}}_{k - 1}, P_{k - 1})$ as noise that is input to $G (θ_{μ}, D^{})$ ; Step 3. Calculate $N (x_{k} \| {\bar{x}}_{k}, Q_{k})$ ; Step 4. Take $N (x_{k} \| {\bar{x}}_{k}, Q_{k})$ as noise that is input to $G (θ_{μ}, D^{*})$ ; Step 5. Calculate $N (y_{k} \| {\bar{y}}_{k}, R_{k})$ ; Step 6. Calculate $P_{k}^{xy}$ ; Step 7. Get $y_{k}$ ; Step 8. Calculate ${\hat{x}}_{k} = {\bar{x}}_{k} + P_{k}^{xy} R_{k}^{- 1} (y_{k} - {\bar{y}}_{k})$ ; Step 9. Calculate $P_{k} = Q_{k} - P_{k}^{xy} R_{k}^{- 1} {(P_{k}^{xy})}^{T}$ ; Step 10. let $k = k + 1$ and return to (2).

Algorithm 1. GP-ADF

Step 1. Initialize

G (θ_{μ}, D^{*})

and

N (x_{0} | {\hat{x}}_{0}, P_{0})

;
Step 2. Take

N (x_{k - 1} | {\hat{x}}_{k - 1}, P_{k - 1})

as noise that is input to

G (θ_{μ}, D^{*})

;
Step 3. Calculate

N (x_{k} | {\bar{x}}_{k}, Q_{k})

;
Step 4. Take

N (x_{k} | {\bar{x}}_{k}, Q_{k})

as noise that is input to

G (θ_{μ}, D^{*})

;
Step 5. Calculate

N (y_{k} | {\bar{y}}_{k}, R_{k})

;
Step 6. Calculate

P_{k}^{xy}

;
Step 7. Get

y_{k}

;
Step 8. Calculate

{\hat{x}}_{k} = {\bar{x}}_{k} + P_{k}^{xy} R_{k}^{- 1} (y_{k} - {\bar{y}}_{k})

;
Step 9. Calculate

P_{k} = Q_{k} - P_{k}^{xy} R_{k}^{- 1} {(P_{k}^{xy})}^{T}

;
Step 10. let

k = k + 1

and return to (2).

Smoothing algorithm

Algorithm 1 accurately calculates the mean vector and the covariance matrix of the filtered estimation. Combining the GP-ADF with the RTS smoother, the distribution of the smoothed estimation can be calculated as follows

p (x_{k - 1} | Y_{1 : L}) \approx N (x_{k - 1} | {\tilde{x}}_{k - 1}, {\tilde{P}}_{k - 1})

(39)

{\tilde{x}}_{k - 1} = {\hat{x}}_{k - 1} + P_{k - 1, k}^{xx} Q_{k - 1}^{- 1} ({\tilde{x}}_{k} - {\bar{x}}_{k})

(40)

{\tilde{P}}_{k - 1} = P_{k} + P_{k - 1, k}^{xx} Q_{k - 1}^{- 1} {({\tilde{P}}_{k} - Q_{k})}^{T} {(P_{k - 1, k}^{xx} Q_{k - 1}^{- 1})}^{T}

(41)

After executing Algorithm 1, the only unknown quantity is the matrix $P_{k - 1, k}^{xx}$ in the above formulas (39)–(41), which is calculated as follows

\begin{matrix} P_{k - 1, k}^{xx} = \int \int (x_{k - 1} - {\hat{x}}_{k - 1}) {(x_{k} - {\bar{x}}_{k})}^{T} p (x_{k - 1}, x_{k} | Y_{1 : k - 1}) d x_{k - 1} d x_{k} \\ = \int \int x_{k - 1} x_{k}^{T} p (x_{k - 1}, x_{k} | Y_{1 : k - 1}) d x_{k - 1} d x_{k} - {\hat{x}}_{k - 1} {\bar{x}}_{k}^{T} \\ = \int x_{k - 1} [\int x_{k}^{T} Π_{i = 1}^{n} N (x_{k}^{(i)} | G P_{ρ} (x_{k - 1}, D_{f_{i}}^{*}), G P_{Σ} (x_{k - 1}, D_{f_{i}}^{*}) + σ_{xi}^{2}) d x_{k}] \\ \cdot N (x_{k - 1} | {\hat{x}}_{k - 1}, P_{k - 1}) d x_{k - 1} - {\hat{x}}_{k - 1} {\bar{x}}_{k}^{T} \\ = \int x_{k - 1} (G P_{ρ} (x_{k - 1}, D_{f_{1}}^{*}), G P_{ρ} (x_{k - 1}, D_{f_{2}}^{*}), \dots, G P_{ρ} (x_{k - 1}, D_{f_{n}}^{*})) \\ \cdot N (x_{k - 1} | {\hat{x}}_{k - 1}, P_{k - 1}) d x_{k - 1} - {\hat{x}}_{k - 1} {\bar{x}}_{k}^{T} \\ =^{(5)} [\dots, \int x_{k - 1} G P_{ρ} (x_{k - 1}, D_{f_{i}}^{*}) N (x_{k - 1} | {\hat{x}}_{k - 1}, P_{k - 1}) d x_{k - 1}, \dots] - {\hat{x}}_{k - 1} {\bar{x}}_{k}^{T} \end{matrix}

(42)

In step (5) of formula (42), the calculation of the integral term is similar to formula (36)

\begin{matrix} \int x_{k - 1} G P_{ρ} (x_{k - 1}, D_{f_{i}}^{*}) N (x_{k - 1} | {\hat{x}}_{k - 1}, P_{k - 1}) d x_{k - 1} \\ = \sum_{j = 0}^{L - 1} d_{ij} χ^{(j)} (D_{f_{i}}^{*}) {(W_{f_{i}} + P_{k})}^{- 1} (W_{f_{i}} {\hat{x}}_{k} + P_{k} x_{j}^{*}) \end{matrix}

(43)

χ (D_{f_{i}}^{*}) = {[K_{f_{i}} (X_{0 : L - 1}^{*}, X_{0 : L - 1}^{*}) + σ_{xi}^{2} I]}^{- 1} x_{1 : L}^{(i)}

(44)

\begin{matrix} d_{ij} = \frac{σ_{f_{i}}^{2} {| W_{f_{i}} {(W_{f_{i}} + P_{k})}^{- 1} P_{k} |}^{1 / 2}}{{| P_{k} |}^{1 / 2}} \exp \\ (- \frac{1}{2} {({\hat{x}}_{k} - x_{j}^{*})}^{T} {(W_{f_{i}} + P_{k})}^{- 1} ({\hat{x}}_{k} - x_{j}^{*})) \end{matrix}

(45)

Thus, the GP-RTSS can be described as Algorithm 2.

Algorithm 2. GP-RTSS
Step 1. Initialize $G (θ_{μ}, D^{*})$ and $N (x_{0} \| {\hat{x}}_{0}, P_{0})$ ; Step 2. Perform the GP-ADF from $k = 1$ to $k = L$ , and save the filtered distribution and the prediction distribution; Step 3. When $k = L$ , let ${\tilde{x}}_{k} = {\hat{x}}_{k}$ and ${\tilde{P}}_{k} = P_{k}$ ; Step 4. Calculate $P_{k - 1, k}^{xx}$ ; Step 5. Input the saved filtered distribution, prediction distribution and $P_{k - 1, k}^{xx}$ into (6) and (7); Step 6. Calculate ${\tilde{x}}_{k - 1} = {\hat{x}}_{k - 1} + P_{k - 1, k}^{xx} Q_{k - 1}^{- 1} ({\tilde{x}}_{k} - {\bar{x}}_{k})$ ; Step 7. Calculate ${\tilde{P}}_{k - 1} = P_{k} + P_{k - 1, k}^{xx} Q_{k - 1}^{- 1} {({\tilde{P}}_{k} - Q_{k})}^{T} {(P_{k - 1, k}^{xx} Q_{k - 1}^{- 1})}^{T}$ ; Step 8. Let $k = k - 1$ and return to Step 5.

Algorithm 2. GP-RTSS

Step 1. Initialize

G (θ_{μ}, D^{*})

and

N (x_{0} | {\hat{x}}_{0}, P_{0})

;
Step 2. Perform the GP-ADF from

k = 1

k = L

, and save the filtered distribution and the prediction distribution;
Step 3. When

k = L

, let

{\tilde{x}}_{k} = {\hat{x}}_{k}

and

{\tilde{P}}_{k} = P_{k}

;
Step 4. Calculate

P_{k - 1, k}^{xx}

;
Step 5. Input the saved filtered distribution, prediction distribution and

P_{k - 1, k}^{xx}

into (6) and (7);
Step 6. Calculate

{\tilde{x}}_{k - 1} = {\hat{x}}_{k - 1} + P_{k - 1, k}^{xx} Q_{k - 1}^{- 1} ({\tilde{x}}_{k} - {\bar{x}}_{k})

;
Step 7. Calculate

{\tilde{P}}_{k - 1} = P_{k} + P_{k - 1, k}^{xx} Q_{k - 1}^{- 1} {({\tilde{P}}_{k} - Q_{k})}^{T} {(P_{k - 1, k}^{xx} Q_{k - 1}^{- 1})}^{T}

;
Step 8. Let

k = k - 1

and return to Step 5.

Maximization step

Algorithm 2 can obtain the smoothed distribution of the GPSSM. Next, the joint smoothed distribution $p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L})$ is required to calculate the CEF

\begin{matrix} p (x_{k + 1}, x_{k} | Y_{1 : L}) = p (x_{k} | x_{k + 1}, Y_{1 : L}) p (x_{k + 1} | Y_{1 : L}) \\ = p (x_{k} | x_{k + 1}, Y_{1 : k}) p (x_{k + 1} | Y_{1 : L}) \end{matrix}

(46)

where $p (x_{k + 1} | Y_{1 : L})$ is the smoothed distribution of $x_{k + 1}$ . The distribution $p (x_{k} | x_{k + 1}, Y_{1 : k})$ is calculated by $p (x_{k}, x_{k + 1} | Y_{1 : k})$

\begin{matrix} p (x_{k}, x_{k + 1} | Y_{1 : k}) = N \\ (\begin{matrix} x_{k} \\ x_{k + 1} \end{matrix} | (\begin{matrix} {\hat{x}}_{k} \\ {\bar{x}}_{k + 1} \end{matrix}), (\begin{matrix} P_{k} & P_{k, k + 1}^{xx} \\ {(P_{k, k + 1}^{xx})}^{T} & P_{k + 1} \end{matrix})) \end{matrix}

(47)

Then

\begin{matrix} p (x_{k} | x_{k + 1}, Y_{1 : k}) = N \\ (x_{k} | {\hat{x}}_{k} + P_{k, k + 1}^{xx} Q_{k + 1}^{- 1} (x_{k + 1} - {\bar{x}}_{k + 1}), P_{k} - P_{k, k + 1}^{xx} Q_{k + 1}^{- 1} {(P_{k, k + 1}^{xx})}^{T}) \\ = N (x_{k} | (P_{k, k + 1}^{xx} Q_{k + 1}^{- 1}) x_{k + 1} + ({\hat{x}}_{k} - P_{k, k + 1}^{xx} Q_{k + 1}^{- 1} {\bar{x}}_{k + 1}), \\ P_{k} - P_{k, k + 1}^{xx} Q_{k + 1}^{- 1} (P_{k, k + 1}^{xx})^{T}) \\ =^{(3)} N (x_{k} | H_{k} x_{k + 1} + U_{k}, Γ_{k}) \end{matrix}

(48)

In step (3) of formula (48), $H_{k} = P_{k, k + 1}^{xx} Q_{k + 1}^{- 1}$ , $U_{k} = {\hat{x}}_{k} - P_{k, k + 1}^{xx} Q_{k + 1}^{- 1} {\bar{x}}_{k + 1}$ , and $Γ_{k} = P_{k} - P_{k, k + 1}^{xx} Q_{k + 1}^{- 1} (P_{k, k + 1}^{xx})^{T}$ . By the property of Gauss distribution

\begin{matrix} p (x_{k + 1}, x_{k} | Y_{1 : L}) = N \\ (\begin{matrix} x_{k + 1} \\ x_{k} \end{matrix} | (\begin{matrix} {\tilde{x}}_{k + 1} \\ H_{k} {\tilde{x}}_{k + 1} + U_{k} \end{matrix}), (\begin{matrix} {\tilde{P}}_{k} & {\tilde{P}}_{k} H_{k}^{T} \\ H_{k} {\tilde{P}}_{k} & H_{k} {\tilde{P}}_{k} H_{k}^{T} + Γ_{k} \end{matrix})) \\ = N (\begin{matrix} x_{k + 1} \\ x_{k} \end{matrix} | (\begin{matrix} {\tilde{x}}_{k + 1} \\ {\tilde{x}}_{k} \end{matrix}), (\begin{matrix} {\tilde{P}}_{k} & {\tilde{P}}_{k} H_{k}^{T} \\ H_{k} {\tilde{P}}_{k} & {\tilde{P}}_{k + 1} \end{matrix})) \end{matrix}

(49)

Thus, $l_{0}$ , $l_{1}$ , and $l_{2}$ are expressed as

l_{0} = \int p_{θ_{μ}} (x_{0} | Y_{1 : L}) \log N (x_{0} | {\hat{x}}_{0}, P_{0}) d x_{0}

(50)

\begin{matrix} l_{1} = \sum_{i = 0}^{L - 1} \int \int p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L}) \log Π_{j = 1}^{n} N (x_{i + 1}^{(j)} | G P_{ρ} (x_{i}, D_{f_{j}}), G P_{Σ} (x_{i}, D_{f_{j}}) + σ_{xj}^{2}) d x_{i} d x_{i + 1} \\ = - \frac{1}{2} \sum_{i = 0}^{L - 1} \sum_{j = 1}^{n} \int \int p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L}) (\frac{{(x_{i + 1}^{(j)} - G P_{ρ} (x_{i}, D_{f_{j}}))}^{2}}{G P_{Σ} (x_{i}, D_{f_{j}}) + σ_{xj}^{2}} - \frac{1}{2} \log (G P_{Σ} (x_{i}, D_{f_{j}}) + σ_{xj}^{2})) d x_{i} d x_{i + 1} + const \end{matrix}

(51)

\begin{matrix} l_{2} = \sum_{i = 1}^{L} \int p_{θ_{μ}} (x_{i} | Y_{1 : L}) \log Π_{j = 1}^{m} N (y_{i}^{(j)} | G P_{ρ} (x_{i}, D_{h_{j}}), G P_{Σ} (x_{i}, D_{h_{j}}) + σ_{yj}^{2}) d x_{i} \\ = \sum_{i = 1}^{L} \sum_{j = 1}^{m} \int p_{θ_{μ}} (x_{i} | Y_{1 : L}) \log N (y_{i}^{(j)} | G P_{ρ} (x_{i}, D_{h_{j}}), G P_{Σ} (x_{i}, D_{h_{j}}) + σ_{yj}^{2}) d x_{i} \\ = - \frac{1}{2} \sum_{i = 1}^{L} \sum_{j = 1}^{m} \int p_{θ_{μ}} (x_{i} | Y_{1 : L}) (\frac{{(y_{i}^{(j)} - G P_{ρ} (x_{i}, D_{h_{j}}))}^{2}}{G P_{Σ} (x_{i}, D_{h_{j}}) + σ_{yj}^{2}} - \frac{1}{2} \log (G P_{Σ} (x_{i}, D_{h_{j}}) + σ_{yj}^{2})) d x_{i} + const \end{matrix}

(52)

The variables $l_{1}$ and $l_{2}$ cannot be calculated analytically, because they contain the unprocessed integral in the closed form. The MC numerical integral method can provide any approximation accuracy for the integral operation. For $l_{1}$ , the MC method is used to sample N particles ${x_{i + 1}^{g}, x_{i}^{g}}_{g = 1}^{N}$ from $p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L})$ randomly. Then, $l_{1}$ is approximated by ${\hat{l}}_{1}$

{\hat{l}}_{1} = - \frac{1}{2 N} \sum_{g = 1}^{N} \sum_{i = 0}^{L - 1} \sum_{j = 1}^{n} (\frac{{(x_{i + 1}^{g (j)} - G P_{ρ} (x_{i}^{g}, D_{f_{j}}))}^{2}}{G P_{Σ} (x_{i}^{g}, D_{f_{j}}) + σ_{xj}^{2}} + \log (G P_{Σ} (x_{i}^{g}, D_{f_{j}}) + σ_{xj}^{2})) + const

(53)

Similarly, for $l_{2}$ , ${x_{i}^{g}}_{g = 1}^{N}$ are sampled from $p_{θ_{μ}} (x_{i} | Y_{1 : L})$ . $l_{2}$ is approximated by ${\hat{l}}_{2}$

{\hat{l}}_{2} = - \frac{1}{2 N} \sum_{g}^{N} \sum_{i = 1}^{L} \sum_{j = 1}^{m} (\frac{{(y_{i}^{(j)} - G P_{ρ} (x_{i}^{g}, D_{h_{j}}))}^{2}}{G P_{Σ} (x_{i}^{g}, D_{h_{j}}) + σ_{yj}^{2}} + \log (G P_{Σ} (x_{i}^{g}, D_{h_{j}}) + σ_{yj}^{2})) + const

(54)

Based on the above content, $l (θ, θ_{μ})$ can be approximately expressed as $\hat{l} (θ, θ_{μ}) = l_{0} + {\hat{l}}_{1} + {\hat{l}}_{2}$ .

After $\hat{l} (θ, θ_{μ})$ is obtained, the work shifts to maximize it about $θ$ in order to obtain a new iteration $θ_{μ + 1}$ . $θ$ consists of three parts, namely, $({\hat{x}}_{0}, P_{0})$ , ${σ_{f_{i}}, W_{f_{j}}, σ_{xi}}_{i = 1}^{n}$ and ${σ_{h_{i}}, W_{h_{j}}, σ_{yi}}_{i = 1}^{m}$ . $({\hat{x}}_{0}, P_{0})$ only appear in $l_{0}$ , ${σ_{f_{i}}, W_{f_{j}}, σ_{xi}}_{i = 1}^{n}$ only appear in ${\hat{l}}_{1}$ , and ${σ_{h_{i}}, W_{h_{j}}, σ_{yi}}_{i = 1}^{m}$ only appear in ${\hat{l}}_{2}$ . ${σ_{f_{i}}, W_{f_{j}}, σ_{xi}}_{i = 1}^{n}$ and ${σ_{h_{i}}, W_{h_{j}}, σ_{yi}}_{i = 1}^{m}$ are recorded as $θ_{f}$ and $θ_{h}$ , respectively. Then, the maximization of $\hat{l} (θ, θ_{μ})$ can be decomposed into maximizing $l_{0}$ , ${\hat{l}}_{1}$ , and ${\hat{l}}_{2}$ on ${({\hat{x}}_{0}, P_{0}), θ_{f}, θ_{h}}$ , respectively.

From the lemma of Deisenroth et al.,¹⁹ when ${\hat{x}}_{0} = {\tilde{x}}_{0}$ and $P_{0} = {\tilde{P}}_{0}$ , $l_{0}$ takes the maximum value, where ${\tilde{x}}_{0}$ and ${\tilde{P}}_{0}$ are the mean and covariance of the initial state smoothed estimation distribution, respectively. The partial derivatives of ${\hat{l}}_{1}$ and ${\hat{l}}_{2}$ with respect to $θ_{f}$ and $θ_{h}$ , respectively can be computed analytically, which allows for gradient-based parameter optimization. In this article, the variable metric method¹⁴ is used to search the maximum. The correction matrices constructed by the variable metric method are all symmetric positive definite matrices, which guarantee that the search direction are all descending direction and each iteration makes the function value decrease. In fact, $θ_{μ + 1}$ in the next iteration is not necessarily globally optimal, which only needs to meet the inequality $\hat{l} (θ_{μ + 1}, θ_{μ}) > \hat{l} (θ_{μ}, θ_{μ})$ that ensures $p (Y_{1 : L} | G (θ_{μ + 1}, D^{*}))$ $> p (Y_{1 : L} | G (θ_{μ}, D^{*}))$ .

Final optimization algorithm

The final optimization algorithm is described formally in this section. It is called as the EM-GP-RTSS algorithm, because it mainly consists of the EM algorithm and the GP-RTSS algorithm.

The key of Algorithm 3 is to replace the difficult-to-maximize measurement likelihood function with the CEF, as the MLE of hyperparameters for the expected GPSSM can be acquired asymptotically through iterative optimizing the CEF. Meanwhile, Algorithm 3 outputs the smoothed estimation at each EM iteration. Then, Algorithm 3 can also be called as a joint state estimation and model optimization algorithm for the GPSSM.

Algorithm 3. EM-GP-RTSS
Step 1. Initialize $θ_{μ} = θ^{}$ , $μ = 0$ ; Step 2. E-step: (a) Run Algorithm 1 and Algorithm 2 to obtain $p_{θ_{μ}} (x_{i} \| Y_{1 : L})$ , for $i = 0, 2, \dots, L - 1$ ; (b) Calculate t $p_{θ_{μ}} (x_{i + 1}, x_{i} \| Y_{1 : L})$ for $i = 0, 2, \dots, L - 1$ by formula (46); Step 3. M-step: (a) Sample ${x_{i + 1}^{g}, x_{i}^{g}}_{g = 1}^{N}$ randomly from $p_{θ_{μ}} (x_{i + 1}, x_{i} \| Y_{1 : L})$ to calculate ${\hat{l}}_{1}$ by formula (53); (b) Sample ${x_{i}^{g}}_{g = 1}^{N}$ randomly from $p_{θ_{μ}} (x_{i} \| Y_{1 : L})$ to calculate ${\hat{l}}_{2}$ by formula (54); (c) Calculate the gradients of ${\hat{l}}_{1}$ and ${\hat{l}}_{2}$ on $θ_{f}$ and $θ_{h}$ , respectively; (d) Apply the variable metric method to search $θ_{f}^{}$ and $θ_{h}^{}$ ; (e) Let $θ_{μ + 1} = {({\hat{x}}_{0}, P_{0}), θ_{f}^{}, θ_{h}^{*}}$ ; Step 4. Judge the convergence. If $\hat{l} (θ_{μ + 1}, θ_{μ}) - \hat{l} (θ_{μ}, θ_{μ}) > ε (ε > 0)$ , $μ = μ + 1$ and return to step 2; otherwise, end.

Algorithm 3. EM-GP-RTSS

Step 1. Initialize

θ_{μ} = θ^{*}

μ = 0

;
Step 2. E-step:
(a) Run Algorithm 1 and Algorithm 2 to obtain

p_{θ_{μ}} (x_{i} | Y_{1 : L})

, for

i = 0, 2, \dots, L - 1

;
(b) Calculate t

p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L})

for

i = 0, 2, \dots, L - 1

by formula (46);
Step 3. M-step:
(a) Sample

{x_{i + 1}^{g}, x_{i}^{g}}_{g = 1}^{N}

randomly from

p_{θ_{μ}} (x_{i + 1}, x_{i} | Y_{1 : L})

to calculate

{\hat{l}}_{1}

by formula (53);
(b) Sample

{x_{i}^{g}}_{g = 1}^{N}

randomly from

p_{θ_{μ}} (x_{i} | Y_{1 : L})

to calculate

{\hat{l}}_{2}

by formula (54);
(c) Calculate the gradients of

{\hat{l}}_{1}

and

{\hat{l}}_{2}

θ_{f}

and

θ_{h}

, respectively;
(d) Apply the variable metric method to search

θ_{f}^{*}

and

θ_{h}^{*}

;
(e) Let

θ_{μ + 1} = {({\hat{x}}_{0}, P_{0}), θ_{f}^{*}, θ_{h}^{*}}

;
Step 4. Judge the convergence.
If

\hat{l} (θ_{μ + 1}, θ_{μ}) - \hat{l} (θ_{μ}, θ_{μ}) > ε (ε > 0)

μ = μ + 1

and return to step 2; otherwise, end.

The computational complexity of Algorithm 3 depends on the two parts of the suboperation: the first is to compute the smoothed distribution; the second is to compute the approximate CEF. For the GPSSM described in this article, the computational complexity of GP-RTSS algorithm is $O (L n^{3} + 2 L m^{3})$ , where L is the sampling number. The computational complexity scales linearly with the sampling number L. The approximate CEF is calculated by formulas (53) and (54), and the corresponding computation complexity is $O (NL (n + m))$ , where N is the number of particles in the MC numerical integral. The threshold value $ε$ is not easy to choose in practice, and the alternative method must set the maximum number of EM iterations. Such a method can often be used to balance the amount of calculation and estimation accuracy.

Simulation experiment

In this section, the computer simulation experiments are carried out to verify the performance of the EM-GP-RTSS algorithm. The noise pendulum system is selected as the simulated object by the GPSSM in these experiments. Because both its state transition equation and measurement equation are typical nonlinear forms, it is often used to verify the performance of the nonlinear filtering and smoothing algorithms.^17,19

In this study, all experiments are performed using MATLAB R2013a on an Intel i5 quad-core 2.90-GHz 64-bit machine with 4 GB RAM.

Experimental settings

For a noise pendulum system with the unit length and the unit mass, its standard form of the discrete-time nonlinear SSM can be established as follows

\begin{matrix} \underset{x_{k}}{\underset{︸}{(\begin{matrix} x_{1, k} \\ x_{2, k} \end{matrix})}} = \underset{f (x_{k})}{\underset{︸}{(\begin{matrix} x_{1, k - 1} + x_{2, k - 1} Δ t \\ x_{2, k - 1} - g \sin (x_{1, k - 1}) Δ t \end{matrix})}} + q_{k - 1} \\ y_{k} = \underset{h (x_{k})}{\underset{︸}{\sin (x_{1, k})}} + r_{k} \end{matrix}

(55)

where $x_{1, k}$ is the angle at the $k th$ time, $x_{2, k}$ is the angular velocity at the $k th$ time, g is the gravitational acceleration, $y_{k}$ is the measurement at the $k - th$ time, $Δ t$ is the sampling period of the sensor, $q_{k - 1} ~ N (q_{k - 1} | 0, Q)$ , $r_{k} ~ N (r_{k} | 0, R^{2})$ and

Q = [\begin{matrix} \frac{q Δ t^{3}}{3} & \frac{q Δ t^{2}}{2} \\ \frac{q Δ t^{2}}{2} & q Δ t \end{matrix}]

(56)

where q is the spectral density.

On one hand, the nonlinear SSM (equation (55)) is served as a benchmark model to test the EM-GP-RTSS; on the other hand, it provides true states to train the GPSSM* that is taken as the GPSSM in the laboratory. We get different instantiated pendulum models (equation (55)) by setting different values for parameters $(q, R, g)$ . Five parameter combinations $(q, R, g)$ are set in the simulation experiment, as shown in Table 1.

Table 1.

Parameter settings of five pendulum models.

Parameters	q	R	g
Pendulum model⁰	0.01	0.01	9.831
Pendulum model¹	0.05	0.05	9.831
Pendulum model²	0.1	0.1	9.831
Pendulum model³	0.5	0.5	9.831
Pendulum model⁴	0.01	0.01	9.719

Five pendulum models in Table 1 represent five homogeneous nonlinear dynamic systems. The pendulum model⁰ is used as the model in the laboratory that provides true states. The remaining four pendulum models describe noise pendulum systems in the different non-laboratory environments. For the pendulum models^1–3, we set the different values of $(q, R)$ from that of the pendulum model⁰, indicating the different noise from the laboratory. For the pendulum model⁴, a different gravitational acceleration g is set to represent a different geographical location from the laboratory, because g is a function of the latitude and the elevation. For all of the above pendulum models, set the sensor sampling period $Δ t = 0.1 s$ , the initial time $t_{0} = 0 s$ , and the sample number $L = 50$ . The initial state obeys $x_{0} ~ N (x_{0} | {\hat{x}}_{0}, P_{0})$ , where ${\hat{x}}_{0} ~ N ({\hat{x}}_{0} | 0, E_{2 \times 2})$ , $P_{0} = 0.1 E_{2 \times 2}$ , $E_{2 \times 2} = [1, 0; 0, 1]$ .

By running the pendulum model⁰, $(X_{0 : 50}^{*}, Y_{1 : 50}^{*})$ are obtained, which are true system states and measurements. Then, the GPSSM* is trained with $(X_{0 : 50}^{*}, Y_{1 : 50}^{*})$ to simulate the pendulum model⁰. Trained hyperparameters of the GPSSM* are ${σ_{f_{i}}, W_{f_{i}}, σ_{xi}}_{i = 1}^{2}$ and ${σ_{h}, W_{h}, σ_{y}}$ where $W_{f_{i}} = [s_{f_{i, 1}}^{2}, 0; 0, s_{f_{i, 2}}^{2}]$ and $W_{h} = [s_{h_{1}}^{2}, 0; 0, s_{h_{2}}^{2}]$ . ${X_{0 : 50}^{i}, Y_{1 : 50}^{i}}_{i = 1}^{4}$ are obtained by running the pendulum models^1–4 separately, where $X_{0 : 50}^{i}$ is unknown in the non-laboratory environment. If the GPSSM* is used to estimate $X_{0 : 50}^{i}$ given $Y_{1 : 50}^{i}$ , there is an overfitting issue. If a new GPSSMⁱ, which is obtained by optimizing the GPSSM* with the EM-GP-RTSS algorithm for the pendulum modelⁱ, can improve the estimation accuracy of $X_{0 : 50}^{i}$ , then it demonstrates that the GPSSMⁱ can simulate the pendulum modelⁱ well. For the pendulum models^1–4, this article uses the GPSSM*, the GPSSMⁱ, and the parametric model (equation (55)) to estimate their states. For the GPSSMs and the traditional SSMs, the Unscented Rauch–Tung–Striebel smoother (URTSS)²⁰ that is a classical nonlinear smoothing algorithm is used as a benchmark for comparison with the GP-RTSS proposed in this article. The root mean square error (RMSE) is used to measure the accuracy of state estimation. The RMSE of the state $x_{1, k}$ is calculated as follows

RMSE (x_{1, k}) = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{x}}_{1, k}^{i} - x_{1, k})}^{2}}

(57)

where ${\hat{x}}_{1, k}^{i}$ is the smoothed estimation mean of $x_{1, k}$ at the $i th$ MC simulation experiment. The RMSE of the state $x_{2, k}$ is similar to formula (57). In the EM-GP-RTSS, the convergence condition is set to 20 maximum iterations, and the number of sampling particles in the MC numerical integral is set as $N = 200$ . The parameter setting of the URTSS algorithm is the same as Särkkä.²⁰

Experimental results

True states $X_{0 : 50}^{*}$ of the pendulum model⁰ are shown in Figure 2. Measurements ${Y_{1 : 50}^{i}}_{i = 0}^{5}$ of the pendulum models^0–4 are shown in Figure 3. $(X_{0 : 50}^{*}, Y_{1 : 50}^{*})$ are the training data of the GPSSM*. In simulation experiments, the BFGS algorithm²¹ is used to train hyperparameters of the GPSSM*, which are recorded in Table 2. $(X_{0 : 50}^{*}, Y_{1 : 50}^{*})$ are the input data and the output data to construct the kernel functions of the GPSSMs^1–4. ${Y_{1 : 50}^{i}}_{i = 1}^{4}$ are taken as measurements in these non-laboratory environments. The EM-GP-RTSS is adopted to optimize the GPSSM* in order to obtain the GPSSMs^1–4 whose hyperparameters are recorded in Table 3. The kernel functions $k_{f_{1}}$ , $k_{f_{2}}$ , and $k_{h}$ in Table 3 are defined in formulas (15) and (16).

Figure 2.

The true states obtained by the pendulum model⁰.

Figure 3.

The measurements obtained by the pendulum models^0–4.

Table 2.

Hyperparameters of GPSSM*.

Function	k_f ₁	k_f ₂	k_h
$σ_{f (h)}$	10.435	3.5634	1.7309
$s_{f {(h)}_{1}}$	52.625	14.844	95.455
$s_{f {(h)}_{2}}$	3.4048	7.5330	0.6710
$σ_{x (y)}$	0.0035	0.0327	0.0086

GPSSM: Gauss process state-space model.

Table 3.

Hyperparameters of GPSSM^1–4.

Model	GPSSM¹			GPSSM²			GPSSM³			GPSSM⁴
Function	k_f ₁	k_f ₂	k_h	k_f ₁	k_f ₂	k_h	k_f ₁	k_f ₂	k_h	k_f ₁	k_f ₂	k_h
$σ_{f (h)}$	18.99	4.099	2.163	24.05	4.801	1.640	24.69	3.151	1.444	7.655	4.718	1.838
$s_{f {(h)}_{1}}$	104.7	14.42	66.80	96.64	34.39	64.68	106.7	5.535	39.65	71.29	40.76	100.2
$s_{f {(h)}_{2}}$	8.003	6.933	1.013	11.10	17.13	0.764	11.61	3.769	0.814	3.680	19.85	0.763
$σ_{x (y)}$	0.010	0.074	0.052	0.015	0.106	0.098	0.025	0.243	0.466	0.004	0.037	0.009

GPSSM: Gauss process state-space model.

Figures 4 –7 show the smoothed estimation mean of pendulum models^1–4 obtained after running the EM-GP-RTSS to optimize the GPSSM*, respectively. And, the estimation of the GP-RTSS with the GPSSM* and the estimation of the URTSS with the true SSM (53) are also given. From Figures 4 –7, it can be seen that the smoothed estimation mean of the EM-GP-RTSS can more accurately follow true states for the pendulum models^1–4 than that of the GP-RTSS with the GPSSM*, which can also be seen from the corresponding standard deviation in Table 4. Therefore, it demonstrates that directly applying the GPSSM* to simulate these non-laboratory pendulum models^1–4 introduces the model bias. It can be seen from Table 4 that the GP-RTSS with GPSSMs^1–4 can reach or exceed the accuracy level of the URTSS with the SSMs^1–4.

Figure 4.

The smoothed estimation results for the pendulum model¹: (a) angle smoothing estimation mean and (b) angular velocity smoothing estimate mean.

Figure 5.

The smoothed estimation results for the pendulum model²: (a) angle smoothing estimation mean and (b) angular velocity smoothing estimate mean.

Figure 6.

The smoothed estimation results for the pendulum model³: (a) angle smoothing estimation mean and (b) angular velocity smoothing estimate mean.

Figure 7.

The smoothed estimation results for the pendulum model⁴: (a) angle smoothing estimation mean and (b) angular velocity smoothing estimate mean.

Table 4.

Standard deviation of smoothed estimation mean.

Algorithm	EM-GP-RTSS		GP-RTSS		URTSS
Standard deviation	Angle	Angular velocity	Angle	Angular velocity	Angle	Angular velocity
Pendulum model 1	0.0666	0.1340	0.0890	0.1381	0.1101	0.2352
Pendulum model 2	0.3547	0.3450	0.3996	0.4196	0.1434	0.2879
Pendulum model 3	0.4197	0.4299	0.6401	1.0124	0.8329	0.8476
Pendulum model 4	0.0239	0.0581	0.0447	0.1385	0.0430	0.0433

EM: expectation–maximization; GP: Gauss process; RTSS: Rauch–Tung–Striebel smoother; URTSS: unscented Rauch–Tung–Striebel smoother.

The one-step prediction performances of GPSSMs^1–4 and the GPSSM* are tested in order to further discuss the advantage of GPSSMs^1–4. Run pendulum models^1–4 again in the time period $[0, 1 s]$ and generate four sets of new true states and measurements denoted as ${{X^{'}}_{0 : 10}^{i}, {Y^{'}}_{1 : 10}^{i}}_{i = 1}^{4}$ . ${{X^{'}}_{0 : 9}^{i}}_{i = 1}^{4}$ are input to the state transition equations of GPSSMs^1–4 and then the prediction distributions of ${{X^{'}}_{1 : 10}^{i}}_{i = 1}^{4}$ are obtained from formula (20). Similarly, ${{X^{'}}_{1 : 10}^{i}}_{i = 1}^{4}$ are input to the measurement equations of GPSSMs^1–4 and then the prediction distribution of ${{Y^{'}}_{1 : 10}^{i}}_{i = 0}^{4}$ are obtained from formula (21).

The prediction distributions from the GPSSMs^1–4 are compared with the prediction distributions from the GPSSM*, as shown in Figures 8 –11. The error bars show the twice standard deviations of the state prediction distributions or the measurement prediction distributions. The red circle indicates the corresponding true value of the prediction. The distributions obtained by the GPSSMs^1–4 can contain the true values within the twice standard deviations. The true value appears in the range of the twice standard deviations of the distributions from the GPSSMs^1–4 with the probability 95.5%. Except the pendulum model⁴, the prediction distributions of the GPSSM* cannot contain the true value in the twice standard deviations. When the noise of a nonlinear dynamic system increases in these non-laboratory environments, the confidence of the prediction distribution with the GPSSM* is lower than that with the GPSSMs^1–4 optimized by the EM-GP-RTSS. And, the mean values of the predicted distributions with the GPSSMs^1–4 are closer to the true value.

Figure 8.

The one-step prediction distributions with GPSSM* and GPSSM¹ for the pendulum model¹. (a) The prediction distribution of the state transition equation GP model. (b) The prediction distribution of the measurement equation GP model.

Figure 9.

The one-step prediction distributions with GPSSM* and GPSSM² for the pendulum model². (a) The prediction distribution of the state transition equation GP model. (b) The prediction distribution of the measurement equation GP model.

Figure 10.

The one-step prediction distributions with GPSSM* and GPSSM³ for the pendulum model³. (a) The prediction distribution of the state transition equation GP model. (b) The prediction distribution of the measurement equation GP model.

Figure 11.

The one-step prediction distribution with GPSSM* and GPSSM⁴ for the pendulum model⁴. (a) The prediction distribution of the state transition equation GP model. (b) The prediction distribution of the measurement equation GP model.

The GPSSMs^1–4 is reused for the state estimation problem of the corresponding nonlinear dynamic system to evaluate the improvement of simulation performance. Execute 100 MC simulations for the each GPSSM of GPSSMs^1–4. The true states and the measurements are generated by the pendulum models^1–4, respectively. The RMSE of angle and angular velocity are calculated. The URTSS with the traditional SSM provides a benchmark for comparison. For the GPSSMs, the GP-RTSS is used to calculate the smoothed estimation.

Figures 12 –15 show the RMSE for the state estimation of the pendulum models^1–4 using the GPSSM*, the GPSSMs^1–4 and the traditional SSMs^1–4, respectively. It can be seen that the optimized GPSSMs^1–4 have the lower RMSEs than the GPSSM* in both state dimensions. It shows that the optimized GPSSMs^1–4 can better simulate the corresponding pendulum models^1–4 and reduce the model bias introduced by the GPSSM*. Meanwhile, the GP-RTSS with the GPSSMs can achieve or exceed the state estimation accuracy of the URTSS with the traditional SSMs.

Figure 12.

Comparison of state estimation accuracy with GPSSM*, GPSSM¹, and traditional SSM¹ for pendulum model¹ in 100 MC simulations: (a) RMSE of angle and (b) RMSE of angular velocity.

Figure 13.

Comparison of state estimation accuracy with GPSSM*, GPSSM², and traditional SSM² for pendulum model² in 100 MC simulations: (a) RMSE of angle and (b) RMSE of angular velocity.

Figure 14.

Comparison of state estimation accuracy with GPSSM*, GPSSM³, and traditional SSM³ for pendulum model³ in 100 MC simulations: (a) RMSE of angle and (b) RMSE of angular velocity.

Figure 15.

Comparison of state estimation accuracy with GPSSM*, GPSSM⁴, and traditional SSM⁴ for pendulum model⁴ in 100 MC simulations: (a) RMSE of angle and (b) RMSE of angular velocity.

Figure 16 shows the standard deviation of the smoothed estimation at each iteration of optimizing the GPSSM* by the EM-GP-RTSS to simulate the pendulum model³. When optimizing the GPSSM³ by the EM-GP-RTSS iteratively, the smoothed estimation can also be obtained, whose error constantly decreases and eventually converges.

Figure 16.

Standard deviation of smoothed estimation by the EM-GP-RTSS at each iteration: (a) standard deviation of angle and (b) standard deviation of angular velocity.

In the simulation experiment, the convergence condition of the EM-GP-RTSS is set to judge whether the maximum number of iterations is reached. Setting a loose maximum number of iterations can avoid the EM falling into a local optimum and directly control the amount of computation. Because of the operability, it is widely used to solve the MLE problem by the EM.^11,13

Conclusion and future works

In this article, for the nonlinear dynamic systems, a novel optimization algorithm for the GPSSM is proposed by combining the EM algorithm with the GP-RTSS algorithm, that is, the EM-GP-RTSS algorithm.

The main works are as follows: first, according to the concept of the GP, a theoretical formulation of the GPSSM is proposed, which is not found in previous references. The description provides a rigorous mathematical basis for the development of the EM-GP-RTSS algorithm. Second, according to the maximum likelihood estimation (MLE) criterion, a GPSSM optimization framework with the EM algorithm is proposed. In the EM algorithm, the unknown system state is considered as the lost data, and the maximization of measurement likelihood function is transformed into that of the CEF. Then, the GP-ADF algorithm and the GP-RTSS algorithm are proposed with the GPSSM defined in this article, in order to calculate the smoothed distribution in the CEF. Finally, the MC numerical integral method is used to obtain the approximate expression of the CEF.

In the simulation experiment, a noise pendulum model is used as a representative of the nonlinear dynamic system to verify the effectiveness of the EM-GP-RTSS algorithm proposed in this article. The noise pendulum model⁰ is selected as the nonlinear system in the laboratory, whose true states are obtained, and then a GPSSM* is trained to simulate it. The EM-GP-RTSS algorithm is used to optimize the GPSSM* for simulating four pendulum models^1–4 to get the GPSSMs^1–4. The experimental results demonstrate that the GPSSMs^1–4 can estimate the system state of the pendulum models^1–4 more accurately than the GPSSM* and can reach the estimation accuracy of the traditional SSM. The one-step prediction distributions from the GPSSMs^1–4 have the higher confidence than those of the GPSSM*. The EM-GP-RTSS algorithm is helpful to expand the application scope of the GPSSM from a laboratory environment to a practical non-laboratory environment.

In future research, first, a linear kernel function could be designed to develop a corresponding smoothing algorithm for the GPSSMs. A smoothing algorithm based on importance sampling methods can be used to obtain the nonlinear non-Gaussian smoothed estimation. Second, it is necessary to balance the computational complexity by selecting an appropriate number of particles in the MC numerical integral method. However, the relationship between the number of particles and the accuracy is difficult to determine. In addition to the MC numerical integration method, the fixed-point sampling numerical integration methods, such as based on the unscented transition or the spherical-radial cubature rule, may be adopted. The advantage of fixed-point sampling numerical integration methods is to require the fewer sampling points compared with the MC numerical integration method. Then, we will study how to use the high-degree fixed-point sampling numerical integration method to obtain the high-accuracy approximations for formulas (51) and (52).

Footnotes

Handling Editor: James Brusey

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China (No. 61472441) and the Equipment Prophecy Foundation of China (No. 61403110304).

ORCID iD

Haiyan Yang

References

Kim

Park

Heo

et al . Characterizing dynamic walking patterns and detecting falls with wearable sensors using Gaussian process methods. Sensors 2017; 17(5): 1172.

Liu

Meng

Own

C-M

. Gaussian process regression plus method for localization reliability improvement. Sensors 2016; 16(8): 1193.

Lawrence

. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. J Mach Learn Res 2005; 6: 1783–1816.

Fox

. Learning GP-BayesFilters via Gaussian process latent variable models. Autonom Rob 2011; 30(1): 3–23.

Frigola

Lindsten

Schön

. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In: Proceedings of the advances in neural information processing systems, Lake Tahoe, NV, 5–10 December 2013, pp.3156–3164. New York: ACM.

Frigola

Chen

Rasmussen

. Variational Gaussian process state-space models. In: Proceedings of the advances in neural information processing systems, Montreal, QC, Canada, 8–13 December 2014, pp.3680–3688. New York: ACM.

Eleftheriadis

Nicholson

TFW

Deisenroth

et al . Identification of Gaussian process state space models. In: Proceedings of the advances in neural information processing systems, Long Beach, CA, 4–9 December 2017, pp.5309–5319. New York: ACM.

Moon

. The expectation-maximization algorithm. IEEE Sig Proc Mag 1996; 13: 47–60.

Sarkka

. Bayesian filtering and smoothing. Cambridge: Cambridge University Press, 2013.

10.

Gibson

Wills

Ninness

. Maximum-likelihood parameter estimation of bilinear systems. IEEE Trans Auto Control 2005; 50(10): 1581–1596.

11.

Schön

Wills

Ninness

. System identification of nonlinear state-space models. Automatica 2011; 47(1): 39–49.

12.

Kerrigan

. Noise covariance identification for nonlinear systems using expectation maximization and moving horizon estimation. Automatica 2017; 77: 336–343.

13.

Dreano

Tandeo

Pulido

et al . Estimating model-error covariances in nonlinear state-space models using Kalman smoothing and the expectation-maximization algorithm. Quart J Roy Meteorol Soc 2017; 143(705): 1877–1885.

14.

Wang

Song

Liang

. EM-based adaptive divided difference filter for nonlinear system with multiplicative parameter. Int J Robust Nonlin 2017; 27: 2167–2197.

15.

Seeger

. Gaussian processes for machine learning. Int J Neu Syst 2004; 14(2): 69–106.

16.

Fox

. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonom Rob 2009; 27(1): 75–90.

17.

Deisenroth

Huber

Hanebeck

. Analytic moment-based Gaussian process filtering. In: Proceedings of the 26th annual international conference on machine learning, Montreal, QC, Canada, 14–18 June 2009, pp.225–232. New York: ACM.

18.

Kleint

Fox

et al . GP-UKF: unscented Kalman filters with Gaussian process prediction and observation models. In: Proceedings of the 2007 IEEE/RSJ international conference on intelligent robots and systems, San Diego, CA, 29 October–2 November 2007, pp.1901–1907. New York: IEEE.

19.

Deisenroth

Turner

Huber

et al . Robust filtering and smoothing with Gaussian processes. IEEE Trans Automat Control 2012; 57(7): 1865–1871.

20.

Särkkä

. Unscented Rauch–Tung–Striebel smoother. IEEE Trans Automat Control 2008; 53(3): 845–849.

21.

Yuan

Y-X

. A modified BFGS algorithm for unconstrained optimization. IMA J Num Anal 1991; 11(3): 325–332.