Robust adaptive online sequential extreme learning machine for predicting nonstationary data streams with outliers

Abstract

Data streams online modeling and prediction is an important research direction in the field of data mining. In practical applications, data streams are often of nonstationary nature and containing outliers, hence an online learning algorithm with dynamic tracking capability as well as anti-outlier capability is urgently needed. With this in mind, this paper proposes a novel robust adaptive online sequential extreme learning machine (RA-OSELM) algorithm for the online modeling and prediction of nonstationary data streams with outliers. The RA-OSELM is developed from the famous online sequential extreme learning machine algorithm, but it uses a more robust M-estimation loss function to replace the conventional least square loss function so as to suppress the incorrect online update of the learning algorithm with respect to outliers, and hence enhances its robustness in the presence of outliers. Moreover, the RA-OSELM adopts a variable forgetting factor method to automatically track the dynamic changes of the nonstationary data streams and timely eliminate the negative impacts of the outdated data, so it tends to produce satisfying tracking results in nonstationary environments. The performances of RA-OSELM are evaluated and compared with other representative algorithms with synthetic and real data sets, and the experimental results indicate that the proposed algorithm has better adaptive tracking capability with stronger robustness than its counterparts for predicting nonstationary data streams with outliers.

Keywords

Online sequential extreme learning machine nonstationary data streams outliers M-estimation variable forgetting factor

Introduction

Data streams analysis and prediction is an important issue in the data mining community, and it has been widely applied in various fields, such as business data analysis, traffic flow prediction, industrial process monitoring, network intrusion detection and so on.^1,2 In these situations, the data are usually generated successively in the form of a data stream, and an online learning algorithm is more desired than a batch learning algorithm to deal with the streaming data, since the online learner can update in an incremental fashion and without the need for retraining whenever a new sample arrives.^1,3

Among the existing online learning algorithms, the online sequential extreme learning machine (OSELM)⁴ is an emerging and practical one. OSELM is developed on the basis of the interpolation theory and approximation theorem of the extreme learning machine (ELM),⁵ and its core idea is to transform the training process of single hidden layer feedforward neural networks (SLFNs) into an equivalent problem of solving linear equations, and use a recursive least squares (RLS) method to calculate the output weights recursively so as to implement the online update of the learning model. By virtue of the powerful approximation ability of ELM and the concise recursive calculation manner of RLS, OSELM can realize an online incremental learning of sequential samples accurately and efficiently. Compared with other popular online learning algorithms, OSELM has many advantages such as fast learning speed, strong generalization ability and simple implementation, which make it very suitable for engineering applications requiring online learning. In view of the good theoretical foundation and superior online learning ability, OSELM has aroused great interests since it was proposed, and it has also achieved successful applications in the field of data streams modeling and prediction, such as rainfall forecasting,⁶ wind speed forecasting,⁷ streamflow prediction,⁸ time series prediction,^9,10 time-varying system prediction,^11,12 etc.

In practical data streams applications, data samples are usually acquired by sensor devices or manual methods. Due to instrument degradation, mechanical fault, human error and other reasons, there will inevitably be a small number of outliers in the collected data streams. Outliers in data streams may seriously affect the accuracy of the modeling process and result in poor prediction results. In order to improve the robustness of OSELM in the presence of outliers, several works have been done. On the basis of bi-objective optimization theory, Huynh et al.¹³ proposed a regularized OSELM algorithm (R-OSELM) to deal with the noisy data set. By using Tikhonov regularization method, the R-OSELM tends to generate a learning model with good generalization performance, and thus enhances the immunity of the model to noise and outliers. However, this algorithm is not a targeted solution for outliers problem, and its outliers’ resistance ability is not so satisfying. In Sun et al.,¹⁴ an empirical survival error potential based sequential extreme learning algorithm (ESEP-ELM) incorporating an integrated noise compensation mechanism was developed and used for predicting noisy chaotic time series. In comparison with the original OSELM, the ESEP-ELM has higher computational efficiency and stability with much better immunity to outliers, but its learning accuracy is not as good as OSELM. In recent years, researchers have put forward multiple robust ELM algorithms against outliers by combining different optimization learning methods,^15–18 but these algorithms are exclusively of batch learning type and they are not applicable in the online environment.

Moreover, data streams in real world are often nonstationary and exhibit some kind of dynamic behaviors, that is to say, the internal variation trends of the streaming data often evolve with time,^19,20 and accordingly, the new data are more helpful to depict the real-time status of the corresponding target system and the out-of-date data may be useless or even become noises to harm the learning model.²¹ Therefore, in the online modeling process of nonstationary data streams, we should attach more importance to the recent data than to the remote ones and timely eliminate the negative impacts of the outdated data, so as to better track the dynamic changes of the nonstationary data streams. In order to enhance the tracking capability of OSELM in a nonstationary environment, the concept of forgetting factor was introduced into the OSELM and an improved OSELM algorithm named F-OSELM was proposed.^22,23 Different from the original OSELM which treats all the samples coequally, the F-OSELM assigns a forgetting factor less than one as weight coefficient to the old samples to weaken the roles of them, and indirectly enhance the contributions of the new samples, so as to better track the real-time status of the nonstationary system. On this basis, the generalized regularization technique was further incorporated into the F-OSELM to solve the potential ill-posed problem of it, and then a new GRF-OSELM algorithm was developed.²⁴ Compared with F-OSELM, the GRF-OSELM has a better stability and reliability by avoiding the potential ill-conditioned matrix inverse, and which is more practicable in real applications. The above researches^22–24 have manifested that forgetting factor is indeed an effective approach to track the intrinsic changes of the nonstationary systems, while in some complex nonstationary environments, the variable forgetting factor (VFF) strategy is usually a more attractive choice than a fixed forgetting factor for global adaptivity. Therefore, a variety of VFF-based OSELM algorithms^25–30 have been put forward in recent years. Online prediction error is the most direct and effective indicator to reflect the dynamic changes of nonstationary system, so many scholars directly take online prediction error as the core observation variable and design some kind of correction function about the prediction error, so as to dynamically adjust the size of forgetting factor.^25–29 These VFF methods are very concise and efficient, with almost no additional computational overhead, but the parameter settings of these methods are meticulous for different problems, and it makes these methods difficult to apply in practical applications. In Soares and Araújo,³⁰ the authors proposed a new directional forgetting factor based OSELM algorithm (DFF-OSELM), in which the old data samples were forgotten according to a certain direction. It has been demonstrated by experiments that the DFF-OSELM is effective for dynamic system modeling and prediction, but there are also many control parameters that need to be set manually and cautiously in this algorithm.

For the above stated outliers problem and dynamic tracking problem in data streams modeling, the individual solutions have been presented in our previous works^10,12 to cope with them separately. However, in actual applications, these two problems often occur simultaneously, and a comprehensive online learning algorithm with both outliers’ resistance ability and dynamic tracking ability are much needed. With this in mind, this paper proposes a novel robust adaptive OSELM algorithm (RA-OSELM) by integrating the merits of our previous works^10,12 under a unified online learning framework, so as to address the above two problems simultaneously. Different from the originate OSELM which adopts the conventional least square as learning criterion, the RA-OSELM employs a more robust M-estimation loss function to train the learning model, so as to reduce the sensitivity of the learning model to outliers. Meanwhile, a new variable forgetting factor method based on error gradient descent is derived and incorporated in the RA-OSELM to further enhance the dynamic tracking ability and adaptivity of the algorithm in nonstationary environments. The performances of the proposed RA-OSELM algorithm are illustrated with both synthetic and real data streams, and the experimental results show that the RA-OSELM has better adaptive tracking capability with stronger robustness compared with other representative algorithms for predicting nonstationary data streams with outliers.

The remainder of this paper is outlined as follows. The Preliminaries section reviews the preliminary knowledge of this paper. The Proposed RA-OSELM section explains the details of our proposed RA-OSELM. The Experiments section presents the comparison experiments for validating the performances of the RA-OSELM. Finally, the Conclusions section 5 concludes this paper.

Preliminaries

OSELM algorithm

OSELM⁴ is an online implementation of the batch ELM algorithm.⁵ For a given data set with N different samples $(x_{i}, y_{i}) \in R^{d} \times R^{m}$ , an ELM with n hidden nodes and activation function $g (x)$ can be mathematically represented as

f_{n} (x) = \sum_{j = 1}^{n} β_{j} g_{j} (x_{i}) = \sum_{j = 1}^{n} β_{j} G (a_{j}, b_{j}, x_{i}), i = 1, 2, \dots, N

(1)

where a _j is the input weight vector connecting the input nodes and the j-th hidden node, b_j is the threshold of the j-th hidden node, β _j is the output weight vector connecting the j-th hidden node and the output nodes. If an ELM can approximate these N samples with zero error, then equation (1) can be written as

\sum_{j = 1}^{n} β_{j} G (a_{j}, b_{j}, x_{i}) = y_{i}, i = 1, 2, \dots, N

(2)

The above N equations in equation (2) can also be compactly written as

H β = Y

(3)

H = [\begin{matrix} h_{1} \\ ⋮ \\ h_{N} \end{matrix}] = {[\begin{matrix} G (a_{1}, b_{1}, x_{1}) & \dots & G (a_{n}, b_{n}, x_{1}) \\ ⋮ & \dots & ⋮ \\ G (a_{1}, b_{1}, x_{N}) & \dots & G (a_{n}, b_{n}, x_{N}) \end{matrix}]}_{N \times n}

(4)

β = {[\begin{matrix} β_{1}^{T} \\ ⋮ \\ β_{n}^{T} \end{matrix}]}_{n \times m} and Y = {[\begin{matrix} y_{1}^{T} \\ ⋮ \\ y_{N}^{T} \end{matrix}]}_{N \times m}

(5)

where H is called hidden layer output matrix, the i-th row of H represents the output vector of the hidden layer with respect to x _i, and the j-th column of H represents the output vector of the j-th hidden node with respect to all the inputs x₁, x₂, …, x _N .

Huang et al.⁵ have mathematically proved that ELM with random hidden nodes has universal approximation capability, and the parameters of hidden nodes of ELM can be randomly assigned and remain unchanged, then learning an ELM is equivalently transformed into finding a least square solution $β$ of the linear system $H β = Y$ . In most cases, the number of training samples N is much larger than the number of hidden nodes n, then H is a nonsquare matrix and there may not exist a $β$ such that $H β = Y$ . In ELM, the smallest norm least square solution is adopted as $β$ , that is

β = H^{†} Y

(6)

where

H^{†}

is the Moore-Penrose generalized inverse of matrix H. If the inverse of

H^{T} H

exists, then equation (6) can be further expressed as

β = H^{†} Y = {(H^{T} H)}^{- 1} H^{T} Y

(7)

On the basis of ELM, Liang et al.⁴ developed an online version of ELM to meet the needs of online learning. The learning process of OSELM includes two phases: an initialization phase and an online learning phase.

In the initialization phase, the standard ELM algorithm is used to initialize the OSELM. Given an initial training data set $Ω_{0} = {(x_{i}, y_{i}) | i = 1, \dots, N_{0}}$ , the initial output weights $β_{0}$ are determined as

β_{0} = P_{0} H_{0}^{T} Y_{0}

(8)

where

P_{0} = {(H_{0}^{T} H_{0})}^{- 1}

H_{0} = {[h_{1}^{T} h_{2}^{T} \dots h_{N_{0}}^{T}]}^{T}

and

Y_{0} = {[y_{1} y_{2} \dots y_{N_{0}}]}^{T}

In the online learning phase, the RLS algorithm is employed to update the output weights constantly. Upon the arrival of another chunk of data $Ω_{k} = {(x_{i}, y_{i}) | i = (\sum_{j = 0}^{k - 1} N_{j}) + 1, \dots, \sum_{j = 0}^{k} N_{j}}$ , k = 1, 2, … , then the new output weights are calculated recursively as follows

\begin{array}{l} P_{k} = P_{k - 1} - P_{k - 1} H_{k}^{T} {(I + H_{k} P_{k - 1} H_{k}^{T})}^{- 1} H_{k} P_{k - 1}, \\ β_{k} = β_{k - 1} + P_{k} H_{k}^{T} (Y_{k} - H_{k} β_{k - 1}) \end{array}

(9)

where

H_{k} = {[h_{(\sum_{j = 0}^{k - 1} N_{j}) + 1}^{T} h_{(\sum_{j = 0}^{k - 1} N_{j}) + 2}^{T} \dots h_{\sum_{j = 0}^{k} N_{j}}^{T}]}^{T},

Y_{k} = {[y_{(\sum_{j = 0}^{k - 1} N_{j}) + 1} y_{(\sum_{j = 0}^{k - 1} N_{j}) + 2} \dots y_{\sum_{j = 0}^{k} N_{j}}]}^{T}

Especially, when the k-th data chunk $Ω_{k}$ always contains just one sample, equation (9) can be simply written as

\begin{array}{l} P_{k} = P_{k - 1} - \frac{P_{k - 1} h_{k}^{T} h_{k} P_{k - 1}}{1 + h_{k} P_{k - 1} h_{k}^{T}}, \\ β_{k} = β_{k - 1} + P_{k} h_{k}^{T} (y_{k} - h_{k} β_{k - 1}) \end{array}

(10)

where

h_{k} = [G (a_{1}, b_{1}, x_{k}) \dots G (a_{n}, b_{n}, x_{k})]

is the corresponding hidden layer output vector.

M-OSELM algorithm

As described above, the OSELM algorithm adopts least square as learning criterion. Nevertheless, the least square method is easily affected by outliers and prone to generating overfitting models, thus making the performances of OSELM degenerate seriously in the presence of outliers. M-estimation is one of the most popular methods to solve the outliers problems. In order to deal with sequential data with outliers, an M-estimation based OSELM algorithm (M-OSELM) has been proposed in our previous work.¹⁰ The objective function of M-OSELM is written as

J_{M} (β_{k}) = \sum_{i = 1}^{k} ρ (y_{i} - h_{i} β_{k}) + \frac{1}{2} δ {‖ β_{k} ‖}^{2}

(11)

where

e_{i} = y_{i} - h_{i} β_{k}

represents the error between the network output and the real value, ρ(·) is a M-estimation function, and

δ

is the regularization parameter.

Compared with the original OSELM, M-OSELM uses an M-estimation loss function to replace the traditional least square learning criterion, so as to enhance its robustness to outliers. It is easy to know that outliers in streams may produce large perturbation errors and affect the accuracy of the obtained learning model. Different from the least square method which minimizes the quadratic sum of all errors uniformly, the M-estimation technique uses some moderate loss functions to mitigate the large perturbation when the errors exceed a reasonable range, and hence reduces the influence of outliers on the learning model.

GRAF-OSELM algorithm

For nonstationary systems, the variation trends of the systems often evolve over time. Accordingly, the new samples are more helpful to depict the real-time status of the nonstationary systems and the out-of-date samples may be useless or even become noises to harm the learning model. However, during OSELM's online learning process, it just constantly adds new learning samples into the learner and learns them sequentially, but does nothing to the old samples, which greatly limits its tracking performance in nonstationary environments.

To better express the timeliness of the dynamic data and further improve the prediction accuracy of OSELM for nonstationary systems, an improved OSELM algorithm with generalized regularization and adaptive forgetting factor (GRAF-OSELM) has been proposed in our previous work.¹² The objective function of GRAF-OSELM is written as

J_{GRAF} (β_{k}) = \sum_{i = 1}^{k} λ^{k - i} {| y_{i} - h_{i} β_{k} |}^{2} + δ {‖ β_{k} ‖}^{2}

(12)

where

λ

is the forgetting factor and

δ

is the generalized regularization parameter.

Different from the OSELM which treats all the training samples equally, the GRAF-OSELM assigns a different forgetting factor as weight to each sample according to the changes of the residual error in the online learning process, so as to reflect the different contributions of the new samples and the old samples to the learning model, thus improving its dynamic tracking ability to nonstationary systems. Moreover, different from other forgetting factor based OSELM algorithms^22,23 which commonly use a traditional exponential forgetting regularization, the GRAF-OSELM adopts a new generalized regularization approach to make the algorithm have a constant regularization effect and a persistent stability in all the online learning stages.

Proposed RA-OSELM

The learning model of RA-OSELM

As described in the Preliminaries section, in our previous works we have presented an M-OSELM algorithm¹⁰ and a GRAF-OSELM algorithm¹² to handle the outliers problem and the nonstationary system modeling problem, respectively. However, in practical data streams applications, the streaming data are often of nonstationary nature and contain outliers, and a comprehensive online learning algorithm with both outliers resistance ability and dynamic tracking ability is urgently needed but still vacant. With this in mind, this paper proposes a novel robust adaptive OSELM algorithm (RA-OSELM) by integrating the merits of our previous works,^10,12 so as to address the above two problems simultaneously. The objective function of RA-OSELM is written as

J_{R A} (β_{k}) = \sum_{i = 1}^{k} λ^{k - i} ρ (y_{i} - h_{i} β_{k}) + \frac{1}{2} δ {‖ β_{k} ‖}^{2}

(13)

where ρ(·) is an M-estimation function to enhance the robustness of the learning model to outliers,

λ

is the forgetting factor to distinguish different samples for better tracking performances,

δ

is the generalized regularization parameter to stabilize the model.

It can be seen that when the forgetting factor $λ = 1$ , equation (13) is equivalent to equation (11), then the RA-OSELM degenerates to the M-OSELM; when the M-estimation function ρ(·) is the ordinary least squares function, equation (13) is equivalent to equation (12), then the RA-OSELM degenerates to the GRAF-OSELM. That is to say, the proposed RA-OSELM model provides a unified online learning framework for our previous M-OSELM and GRAF-OSELM.

Recursive solution of the RA-OSELM model

Next, we will solve the above RA-OSELM model and try to derive a recursive formula of the output weights for online learning. Differentiating equation (13) with respect to $β_{k}$ and setting the derivative to zero, we get

R_{k} β_{k} = Φ_{k}

(14)

where

R_{k} = \sum_{i = 1}^{k} λ^{k - i} φ (e_{i}) h_{i}^{T} h_{i} + δ I = λ R_{k - 1} + δ (1 - λ) I + φ (e_{k}) h_{k}^{T} h_{k}

(15)

Φ_{k} = \sum_{i = 1}^{k} λ^{k - i} φ (e_{i}) h_{i}^{T} y_{i} = λ Φ_{k - 1} + φ (e_{k}) h_{k}^{T} y_{k}

(16)

in which

φ (e) = ξ (e) / e

and

ξ (e) = \partial ρ (e) / \partial e

. Define

R_{k}^{*} = λ R_{k - 1} + δ (1 - λ) I

(17)

then equation (15) becomes

R_{k} = R_{k}^{*} + φ (e_{k}) h_{k}^{T} h_{k}

(18)

Using the Sherman-Morrison-Woodbury formula,³¹ the inverse matrices of $R_{k}^{*}$ and $R_{k}$ can be, respectively, written as

\begin{array}{l} {(R_{k}^{*})}^{- 1} = λ^{- 1} R_{k - 1}^{- 1} - δ (1 - λ) / λ^{2} R_{k - 1}^{- 1} \\ \times {(I + δ (1 - λ) / λ R_{k - 1}^{- 1})}^{- 1} R_{k - 1}^{- 1} \end{array}

(19)

R_{k}^{- 1} = {(R_{k}^{*})}^{- 1} - \frac{φ (e_{k}) {(R_{k}^{*})}^{- 1} h_{k}^{T} h_{k} {(R_{k}^{*})}^{- 1}}{1 + φ (e_{k}) h_{k} {(R_{k}^{*})}^{- 1} h_{k}^{T}}

(20)

Let $P_{k} = R_{k}^{- 1}$ , $P_{k}^{*} = {(R_{k}^{*})}^{- 1}$ , and equations (19) and (20) can be rewritten, respectively, as

P_{k}^{*} = λ^{- 1} P_{k - 1} - δ (1 - λ) / λ^{2} P_{k - 1} {(I + δ (1 - λ) / λ P_{k - 1})}^{- 1} P_{k - 1}

(21)

P_{k} = P_{k}^{*} - \frac{φ (e_{k}) P_{k}^{*} h_{k}^{T} h_{k} P_{k}^{*}}{1 + φ (e_{k}) h_{k} P_{k}^{*} h_{k}^{T}}

(22)

Define $g_{k}^{*}$ , $g_{k}$ as follows

g_{k}^{*} = δ (1 - λ) / λ P_{k - 1} {(I + δ (1 - λ) / λ P_{k - 1})}^{- 1}

(23)

g_{k} = \frac{φ (e_{k}) P_{k}^{*} h_{k}^{T}}{1 + φ (e_{k}) h_{k} P_{k}^{*} h_{k}^{T}}

(24)

then we can rewrite equations (21) and (22) as

P_{k}^{*} = λ^{- 1} P_{k - 1} - λ^{- 1} g_{k}^{*} P_{k - 1} = λ^{- 1} (I - g_{k}^{*}) P_{k - 1}

(25)

P_{k} = P_{k}^{*} - g_{k} h_{k} P_{k}^{*} = (I - g_{k} h_{k}) P_{k}^{*}

(26)

Also, equation (24) can be transformed and rewritten as

\begin{array}{l} g_{k} = φ (e_{k}) P_{k}^{*} h_{k}^{T} - φ (e_{k}) g_{k} h_{k} P_{k}^{*} h_{k}^{T} = φ (e_{k}) (I - g_{k} h_{k}) P_{k}^{*} h_{k}^{T} \\ = φ (e_{k}) P_{k} h_{k}^{T} \end{array}

(27)

Combine equations (14) to (16), we can finally obtain the recursive formula for updating $β_{k}$ as follows

\begin{array}{l} β_{k} = R_{k}^{- 1} Φ_{k} = P_{k} (λ Φ_{k - 1} + φ (e_{k}) h_{k}^{T} y_{k}) \\ = P_{k} (λ R_{k - 1} β_{k - 1} + φ (e_{k}) h_{k}^{T} y_{k}) \\ = P_{k} ((R_{k} - δ (1 - λ) I - φ (e_{k}) h_{k}^{T} h_{k}) β_{k - 1} + φ (e_{k}) h_{k}^{T} y_{k}) \\ = β_{k - 1} - δ (1 - λ) P_{k} β_{k - 1} - φ (e_{k}) P_{k} h_{k}^{T} h_{k} β_{k - 1} \\ + φ (e_{k}) P_{k} h_{k}^{T} y_{k} \\ = β_{k - 1} - δ (1 - λ) P_{k} β_{k - 1} + φ (e_{k}) P_{k} h_{k}^{T} (y_{k} - h_{k} β_{k - 1}) \\ = β_{k - 1} - δ (1 - λ) P_{k} β_{k - 1} + g_{k} (y_{k} - h_{k} β_{k - 1}) \end{array}

(28)

where

ε_{k} = y_{k} - h_{k} β_{k - 1}

is the priori error,

e_{k} = y_{k} - h_{k} β_{k}

is the posteriori error.

Note that in equation (28), the calculation of $β_{k}$ depends on $e_{k}$ , but the $e_{k}$ is unacquainted when the $β_{k}$ is unknown. That is, there is an interdependence between $β_{k}$ and $e_{k}$ . In fact, the $ε_{k}$ can be regarded as an approximation of the $e_{k}$ before the updating of $β_{k}$ , so in our algorithm the $ε_{k}$ is adopted to replace the $e_{k}$ for calculation (similarly hereinafter). Moreover, using the similar approximate calculation method as,²⁴ the calculation of $g_{k}^{*}$ in equation (23) can be approximately expressed as $g_{k}^{*} = δ (1 - λ) / λ P_{k - 1} (I - δ (1 - λ) / λ P_{k - 1})$ for a lower computational complexity.

Online outlier detection for RA-OSELM

In this study, we adopt a modified Huber function as the M-estimation function of RA-OSELM. This Huber function is a piecewise function described below

ρ (e) = {\begin{matrix} \frac{e^{2}}{2}, & | e | \leq α \\ \frac{α^{2}}{2}, & | e | > α \end{matrix}

(29)

where e is the error, α is the threshold parameter. It can be seen from equation (29) that when

| e | \leq α

, the function ρ(·) is quadratic and it is equivalent to the ordinary least square function. On the other side, for the case

| e | > α

, the current data is likely to be corrupted by outliers, and then ρ(·) is equal to a constant, so the disturbance caused by outliers can be well suppressed. That is to say, this modified Huber function tries to distinguish the potential outliers from the normal samples by comparing the absolute magnitude of the corresponding error and the threshold parameter

α

, and cope with them differently. Therefore, an appropriate determination for the

α

is the key issue for the correct outlier detection. In this study, we present an online parameter estimation method to estimate the

α

continuously.

According to the statistical analysis and central limit theorem, the estimation error of the learning model without outliers can be simply assumed to be normally distributed with mean zero and variance $σ_{k}^{2}$ . Based on this hypothesis, we have $ε_{k}^{'} = ε_{k} / σ_{k}$ obeys the standardized normal distribution. Meanwhile, the probability of $| ε_{k} |$ greater than $α$ can be denoted as $θ_{α} = P_{r} {| ε_{k} | > α} = P_{r} {| ε_{k}^{'} | > α / σ_{k}}$ . According to the nature of the standardized normal distribution, we can further obtain $P_{r} {ε_{k}^{'} < α / σ_{k}} = \int_{- \infty}^{α / σ_{k}} e^{- x^{2} / 2} d x / \sqrt{2 π} = 1 - θ_{α} / 2$ , and it follows that $α / σ_{k} = Θ^{- 1} (1 - θ_{α} / 2)$ , where $Θ^{- 1} (\cdot)$ denotes the fractile of the standardized normal distribution with respect to a certain probability. If $θ_{α}$ is set to 0.01, we are able to detect the potential outliers with 99% confidence, then the corresponding value of $α$ can be determined by

α = Θ^{- 1} (0.995) σ_{k} \approx 2.576 σ_{k}

(30)

The standard deviation $σ_{k}$ in equation (30) can be estimated in terms of the median absolute deviation (MAD): $σ_{k} = 1.483 M A D (ε_{k})$ .³² This estimator is robust and suitable for the outlier environment, but it requires very complex computations. Based on robust regression theory³² and sliding window strategy, another robust but much more efficient estimator is adopted to estimate $σ_{k}$ recursively

σ_{k} = λ σ_{k - 1} + (1 - λ) 1.483 (1 + 5 / (L - 1)) \sqrt{med (W_{k} (ε))}

(31)

where

λ

is the forgetting factor, med(·) is the median of the errors,

W_{k} (ε) = {ε_{k}^{2}, ε_{k - 1}^{2}, …, ε_{k - L + 1}^{2}}

represents an error sliding window with length

L

1.483 (1 + 5 / (L - 1))

is a correction factor for finite samples. From equation (31) we can see that the

σ_{k}

is updated in an iterative manner, and there is only one parameter

L

need to be set manually. As a matter of fact, we have validated by experiments that the value of

L

has little influence on the performances of the learning model, and the values in the interval between 10 and 50 are always satisfying according to our experimental experience.

Adaptive forgetting scheme for RA-OSELM

Previous studies have proven that the forgetting factor method is effective for nonstationary system modeling and prediction.^22–24 However, in practical applications the variation rates of the nonstationary systems are often irregular. In this situation, the preset fixed forgetting factor may not guarantee the global adaptability to the dynamic changes of nonstationary systems, and a variable forgetting factor (VFF) method is generally a more attractive choice. Up to now, several VFF-based OSELM algorithms^25–30 have been developed and successfully applied to practical applications, but these methods still have some shortcomings such as complex parameter setting and poor practicability. In this section, a new variable forgetting factor method based on error gradient descent is derived to accommodate the proposed RA-OSELM, by which the forgetting factor can be automatically tuned without need for meticulous parameter setting.

With the strategy of adaptive forgetting, the forgetting factor in RA-OSELM is updated adaptively according to the changes of the online prediction errors so as to adapt itself to nonstationary environment. With this in mind, the goal is to compute the forgetting factor $λ$ that minimizes the mean squares of the priori prediction error.^33,34 The optimized objective function is determined by

J' (k) = \frac{1}{2} E [{| ε_{k} |}^{2}]

(32)

where

E [\cdot]

is the expectation operator,

ε_{k} = y_{k} - h_{k} β_{k - 1}

is the priori prediction error. For optimization, the optimal forgetting factor for minimizing equation (32) can be calculated by differentiating

J' (k)

with respect to

λ

\nabla_{λ} (k) = \frac{\partial J' (k)}{\partial λ} = - E [ψ_{k - 1}^{T} h_{k}^{T} ε_{k}]

(33)

where

ψ_{k} = \frac{\partial β_{k}}{\partial λ}

. Combined with equations (27) and (28),

ψ_{k}

can be derived as follows

\begin{array}{l} ψ_{k} = \frac{\partial (β_{k - 1} + φ (ε_{k}) P_{k} h_{k}^{T} (y_{k} - h_{k} β_{k - 1}) - δ (1 - λ) P_{k} β_{k - 1})}{\partial λ} \\ = (I - g_{k} h_{k} - δ (1 - λ) P_{k}) ψ_{k - 1} \\ + \frac{\partial P_{k}}{\partial λ} (φ (ε_{k}) h_{k}^{T} ε_{k} - δ (1 - λ) β_{k - 1}) + δ P_{k} β_{k - 1} \end{array}

(34)

Let $S_{k} = \frac{\partial P_{k}}{\partial λ}$ and substitute equation (27) into equation (26), we get

\begin{array}{l} S_{k} = \frac{\partial ((I - φ (ε_{k}) P_{k} h_{k}^{T} h_{k}) P_{k}^{*})}{\partial λ} \\ = (- φ (ε_{k}) S_{k} h_{k}^{T} h_{k}) P_{k}^{*} + (I - φ (ε_{k}) P_{k} h_{k}^{T} h_{k}) \frac{\partial P_{k}^{*}}{\partial λ} \end{array}

(35)

By moving and merging similar items, the $S_{k}$ in equation (35) can be expressed as follows

S_{k} = (I - g_{k} h_{k}) \frac{\partial P_{k}^{*}}{\partial λ} {(I + φ (ε_{k}) h_{k}^{T} h_{k} P_{k}^{*})}^{- 1}

(36)

Perform matrix inverse calculation to the last term of equation (36) with the Sherman-Morrison formula,³¹ and we further get

S_{k} = (I - g_{k} h_{k}) \frac{\partial P_{k}^{*}}{\partial λ} (I - h_{k}^{T} g_{k}^{T})

(37)

Combining with equation (25), we have

\frac{\partial P_{k}^{*}}{\partial λ} = - λ^{- 1} P_{k}^{*} + λ^{- 1} (- \frac{\partial g_{k}^{*}}{\partial λ} P_{k - 1} + (I - g_{k}^{*}) S_{k - 1})

(38)

The $\frac{\partial g_{k}^{*}}{\partial λ}$ in equation (38) can be calculated as follows by differentiating equation (23) with respect to $λ$

\begin{array}{l} \frac{\partial g_{k}^{*}}{\partial λ} = - {(λ (1 - λ))}^{- 1} g_{k}^{*} + S_{k - 1} \\ \times (δ (1 - λ) / λ - δ^{2} {(1 - λ)}^{2} / λ^{2} P_{k - 1}) \\ + δ^{2} (1 - λ) / λ^{3} P_{k - 1} (P_{k - 1} - λ (1 - λ) S_{k - 1}) \end{array}

(39)

Substituting equation (39) into equation (38), we obtain

\begin{array}{l} \frac{\partial P_{k}^{*}}{\partial λ} = - λ^{- 1} (P_{k}^{*} - {(λ (1 - λ))}^{- 1} g_{k}^{*} P_{k - 1} + S_{k - 1} g_{k}^{*} \\ \begin{array}{l} + δ^{2} (1 - λ) / λ^{3} P_{k - 1} (P_{k - 1} - λ (1 - λ) S_{k - 1}) P_{k - 1} \\ - (I - g_{k}^{*}) S_{k - 1}) \end{array} \end{array}

(40)

and then substituting equation (40) into equation (37) we obtain

S_{k}

; next substituting

S_{k}

into equation (34) we finally obtain

ψ_{k}

, and then the

\nabla_{λ} (k)

in equation (33) is available.

Provided that we obtain the gradient $\nabla_{λ} (k)$ , in a similar way to Simon,³³ the forgetting factor $λ$ may be updated adaptively using the recursion

λ_{k} = λ_{k - 1} - μ \nabla_{λ} (k)

(41)

Adopting the instantaneous estimation $- ψ_{k - 1}^{T} h_{k}^{T} ε_{k}$ for $\nabla_{λ} (k)$ , then the complete recursive formula for updating the forgetting factor is determined as

λ_{k} = {[λ_{k - 1} + μ ψ_{k - 1}^{T} h_{k}^{T} ε_{k}]}_{λ_{-}}^{λ_{+}}

(42)

where

μ

is a learning rate parameter and it is usually set a small value such as 0.1, the bracket represents a truncation operation with upper limit

λ_{+}

and lower limit

λ_{-}

. In practical applications, the

λ_{+}

is usually set slightly smaller than one to ensure that certain degree of forgetting is always there, while the setting of

λ_{-}

may depend on the problems, and it is suggested not to set

λ_{-}

a value too small, since it may cause numerical instability.^33,35

Differences among RA-OSELM, M-OSELM and GRAF-OSELM

In this study, we consult our previous works^10,12 and absorb some good ideas from them. But actually, compared with these two previous works, many improvements and innovations have been made in this paper to accommodate the new and more complex scenarios of this study. The main differences among the proposed RA-OSELM algorithm and our previous M-OSELM algorithm,¹⁰ GRAF-OSELM algorithm¹² can be summarized as follows.

The research purposes of each paper are different. Guo et al.¹⁰ aim to provide a robust online learning algorithm for predicting chaotic time series with outliers, and Guo et al.¹² aim to provide a stable and adaptive online learning algorithm for time-varying system prediction, while this paper aims to provide a comprehensive online learning algorithm with both robustness and adaptive tracking capability for predicting nonstationary data streams with outliers. In brief, in this paper we try to provide a complete solution by integrating the merits of our previous works,^10,12 so as to address the outliers problem and the dynamic tracking problem simultaneously.

The objective functions of each algorithm are different. Compared with M-OSELM (equation (11)), an extra forgetting factor is added in RA-OSELM (equation (13)) so as to better depict the dynamic behaviors of the nonstationary data streams; compared with GRAF-OSELM (equation (12)), an M-estimation function instead of the traditional least square function is used in RA-OSELM (equation (13)) so as to enhance its robustness to outliers.

The obtained recursive updating formulas of each algorithm are different. For details see equations (23) to (26) and (28) in this paper, equation (26) in Guo et al.,¹⁰ and equation (60) in Guo et al.¹² Comparing these equations we can deduce that if the forgetting factor in RA-OSELM is equal to one, its corresponding recursive updating formula will degenerate to equation (26), and if the adopted M-estimation function in RA-OSELM is the ordinary least squares function, its corresponding recursive updating formula will degenerate to equation (60). This further certifies that the proposed RA-OSELM algorithm is a unified online learning framework of our previous two algorithms.

The outlier detection methods of RA-OSELM and M-OSELM are similar but not the same. For example, in M-OSELM a common sliding window strategy is used to estimate the error standard deviation (see equation (29) in Guo et al.¹⁰), while in RA-OSELM we adopt forgetting factor as a smoothing coefficient and estimate the error standard deviation in a recursive fashion (see equation (31) in this paper), so as to suit the change characteristics of the nonstationary environment.

The forgetting factor adjustment schemes of RA-OSELM and GRAF-OSELM are similar but not the same. In section 3.4 of this paper, the item $φ (ε_{k})$ originated from the M-estimation function is attached in the derivation formulas, for example, in equations (34) to (36) and so on; while in section 3.3 of Guo et al.,¹² the item $φ (ε_{k})$ does not need to be considered.

In conclusion, though this paper absorbs some good ideas from Guo et al.,^10,12 it is not just a simple combination of our previous works, and many new issues have been reconsidered and many improvements have been made. As a matter of fact, the main thought of this paper is try to provide a complete solution under a unified learning framework, so as to simultaneously address the above-mentioned outliers problem and dynamic tracking problem, which have only been considered separately in our previous works.^10,12

Experiments

In this section, a simulative nonstationary data set, a parameter varying chaotic time series data set and a real stock price data set with dynamic behaviors are used to evaluate the performances of the proposed RA-OSELM algorithm. For comparison, the R-OSELM,¹³ M-OSELM,¹⁰ GRF-OSELM,²⁴ GRAF-OSELM¹² and DFF-OSELM³⁰ are employed as contrast algorithms and their experimental results are also given.

Data set description

The first data set is an artificial data set containing 2200 samples. For each sample, the input is composed of four random variables that are normally distributed with mean zero and standard deviation 0.1, and the output is the inner product of the input vector and the corresponding coefficient vector. To simulate the dynamic feature of the nonstationary data streams, the coefficient vector for each sample evolves continuously as time goes on.³⁶ To be specific, the coefficient vector is firstly initiated as $a (0) = [0.5 0.2 0.7 0.8]$ , and subsequently they evolve with time based on the following equation: $a_{j} (i) = a_{j} (i - 1) + 2 (i + 1000) / 10^{8} l ogsig (a_{j} (i - 1)), i = 1, 2, \dots, 2200,$ where i represents the index of the sample, j represents the component of the coefficient vector, j = 1,2,3,4. Besides, some normal white noises are added to the output to make the data set more real.

The second data set is a Logistic time series with time-varying characteristics. The Logistic time series has been widely used in dynamic system modeling due to its chaotic and dynamic nature, and here a parameter varying Logistic time series described in equation (43) is taken as an example to validate the tracking performance of the proposed RA-OSELM for dynamic system³⁷

\begin{array}{l} x (t + 1) = ω (t) x (t) (1 - x (t)), \\ ω (t + 1) = ω (t) + 10^{- 4} (1 - 0.2 \sin (t)), t = 1, 2, \dots \end{array}

(43)

As seen from equation (43), the system parameter $ω (t)$ changes gradually with time, and accordingly the corresponding chaotic time series contains more complex dynamic features. In this experiment, the initial values for equation (43) are set as x(1) = 0.512, $ω (1)$ = 3.6, and a time-varying time series with a total of 2200 data is generated. Similarly, some normal white noises are added to make the task real. The time delay and embedding dimension for reconstructing this time series are selected as 1 and 4, respectively.

The third data set is a real stock price data streams from the financial field. Because the factors that affect stock price often change over time, the stock price usually exhibits some kind of nonstationary behaviors, and which is very appropriate to evaluate the tracking performance of the proposed method. The used stock price data set is downloaded from the https://www.pmel.noaa.gov/tao/drupal/disdel/, and the first 700 data are chosen for experiment. For this data set, the time delay and embedding dimension for phase space reconstruction are selected as 1 and 15, respectively.²¹

Experimental setup

In our experiments, the R-OSELM, M-OSELM, GRF-OSELM, GRAF-OSELM, DFF-OSELM and RA-OSELM are conducted to model and predict the above three nonstationary data streams in an online mode. To be more specific, the data are scheduled to enter the learners one-by-one in the form of a stream after the initialization phase, and then the learners make online predictions for each input continuously and execute an online update of themselves at the same time. For comparison, the given experimental results of each case are the mean of 30 independent experiments and the root mean squared error (RMSE) is adopted as the evaluation criteria. For all the six OSELM-based algorithms, the same sigmoidal additive activation function is used; the hidden nodes number n is set as 50 and 100, respectively, for experiments, and the corresponding initial training sample number is taken as 200 uniformly.

In addition, other parameter settings for each algorithm in each case are as follows. In the first experiment, for R-OSELM, $δ$ = 10⁻⁸; for GRF-OSELM, $δ$ = 10⁻⁸, $λ$ = 0.98; for M-OSELM, GRAF-OSELM and DFF-OSELM, the parameters are set the same as in literature,^10,12,30 respectively; for RA-OSELM, $δ$ = 10⁻⁸, $μ$ = 0.1, $λ_{+}$ = 0.999, $λ_{-}$ = 0.8, the initial value of $λ$ is set as 0.995, L = 10. In the second experiment, the experimental design and the parameter settings for each algorithm are the same as those in the first experiment. In the third experiment, the experimental design and the parameter settings for each algorithm are the same as those in the first experiment except that the regularization parameter is set as 10⁻² in all the algorithms for better stability.

Experimental results and analysis

Comparison of tracking capability

We first verify the tracking performances of the six online algorithms for tracking the dynamic changes of the nonstationary data streams. Tables 1 to 3 give the mean and SD (standard deviations) of the online prediction RMSE over 30 independent experiments of the six algorithms for conducting different prediction steps on the above three nonstationary data sets. From Table 1, it can be seen that the prediction errors of R-OSELM and M-OSELM increase gradually with prediction steps, and they are much larger than those of the other algorithms in the same situations. In contrast, the other four forgetting factor-based algorithms, GRF-OSELM, GRAF-OSELM, DFF-OSELM, RA-OSELM, can timely eliminate the influences of the outdated samples and better track the real-time status of the nonstationary data streams, so they obtain much smaller prediction RMSE than R-OSELM and M-OSELM. Furthermore, we can see that the proposed RA-OSELM equipped with variable forgetting scheme can always achieve best or nearly best results among the contrast algorithms and exhibit good tracking performances. The experimental results in Table 2 are similar to those in Table 1, and in this simulation, the proposed RA-OSELM algorithm and our previous GRAF-OSELM algorithm achieve the best prediction results with equal accuracy level. It is worth noting that the DFF-OSELM does not behave well in this experiment, and this may be because the used parameters of this algorithm recommended by Soares and Araújo³⁰ are not appropriate for this data set. In the last experiment, the corresponding experimental results in Table 3 also show that the RA-OSELM and GRAF-OSELM achieve better prediction performances than the counterparts and demonstrate their superiority in tracking in real application. In short, the proposed RA-OSELM algorithm has the same good tracking capability as our previous GRAF-OSELM algorithm and they can generally provide more accurate prediction results than other representative algorithms for predicting nonstationary data streams.

Table 1.

Mean and SD of the online prediction RMSE of the six algorithms on simulative nonstationary data streams.

Hidden node (n)	Algorithm	Prediction steps
Hidden node (n)	Algorithm	400	800	1200	1600	2000
50	R-OSELM	1.52E-03 (5.39E-06)	1.81E-03 (3.09E-06)	2.30E-03 (3.96E-06)	2.84E-03 (3.41E-06)	3.80E-03 (6.61E-06)
	M-OSELM	1.57E-03 (2.53E-05)	1.84E-03 (1.44E-05)	2.33E-03 (1.05E-05)	2.87E-03 (7.61E-06)	3.83E-03 (1.23E-05)
	GRF-OSELM	1.58E-03 (2.33E-05)	1.53E-03 (1.62E-05)	1.52E-03 (1.15E-05)	1.51E-03 (1.07E-05)	1.51E-03 (1.14E-05)
	GRAF-OSELM	1.49E-03 (5.69E-06)	1.53E-03 (4.46E-06)	1.54E-03 (4.94E-06)	1.54E-03 (5.22E-06)	1.54E-03 (6.46E-06)
	DFF-OSELM	1.59E-03 (9.10E-06)	1.68E-03 (1.39E-05)	1.81E-03 (1.84E-05)	1.92E-03 (1.80E-05)	2.29E-03 (4.05E-05)
	RA-OSELM	1.49E-03 (4.98E-06)	1.54E-03 (5.79E-06)	1.54E-03 (4.34E-06)	1.54E-03 (6.84E-06)	1.55E-03 (1.23E-05)
100	R-OSELM	1.54E-03 (6.56E-06)	1.82E-03 (4.42E-06)	2.30E-03 (5.52E-06)	2.84E-03 (6.29E-06)	3.81E-03 (1.04E-05)
	M-OSELM	1.61E-03 (1.98E-05)	1.86E-03 (7.78E-06)	2.33E-03 (9.64E-06)	2.87E-03 (1.24E-05)	3.84E-03 (1.88E-05)
	GRF-OSELM	1.63E-03 (1.18E-05)	1.58E-03 (5.86E-06)	1.57E-03 (5.37E-06)	1.57E-03 (4.78E-06)	1.57E-03 (5.19E-06)
	GRAF-OSELM	1.51E-03 (5.17E-06)	1.55E-03 (2.86E-06)	1.55E-03 (3.26E-06)	1.55E-03 (2.29E-06)	1.56E-03 (2.43E-06)
	DFF-OSELM	1.62E-03 (7.11E-06)	1.73E-03 (1.83E-05)	1.86E-03 (2.55E-05)	1.97E-03 (2.63E-05)	2.44E-03 (6.76E-05)
	RA-OSELM	1.51E-03 (8.11E-06)	1.55E-03 (5.20E-06)	1.56E-03 (3.85E-06)	1.56E-03 (2.87E-06)	1.57E-03 (1.07E-05)

The boldface values display the best results in each case.

Table 2.

Mean and SD of the online prediction RMSE of the six algorithms on time-varying time series.

Hiddennode (n)	Algorithm	Prediction steps
Hiddennode (n)	Algorithm	400	800	1200	1600	2000
50	R-OSELM	5.91E-03 (7.24E-05)	6.36E-03 (8.60E-05)	6.78E-03 (8.86E-05)	7.18E-03 (5.82E-05)	7.82E-03 (1.19E-04)
	M-OSELM	5.94E-03 (6.99E-05)	6.37E-03 (5.52E-05)	6.82E-03 (9.42E-05)	7.25E-03 (9.92E-05)	8.04E-03 (9.26E-04)
	GRF-OSELM	6.01E-03 (1.06E-04)	6.19E-03 (8.85E-05)	6.50E-03 (1.15E-04)	6.89E-03 (1.54E-04)	7.19E-03 (1.60E-04)
	GRAF-OSELM	5.87E-03 (8.48E-05)	6.03E-03 (8.50E-05)	6.30E-03 (1.00E-04)	6.46E-03 (1.08E-04)	6.73E-03 (8.17E-05)
	DFF-OSELM	6.10E-03 (9.22E-05)	6.68E-03 (1.11E-04)	6.99E-03 (9.19E-05)	7.17E-03 (1.21E-04)	7.62E-03 (1.40E-04)
	RA-OSELM	5.86E-03 (8.51E-05)	6.04E-03 (8.66E-05)	6.28E-03 (1.11E-04)	6.48E-03 (8.84E-05)	6.75E-03 (6.30E-05)
100	R-OSELM	5.82E-03 (2.93E-05)	6.24E-03 (2.47E-05)	6.59E-03 (3.23E-05)	6.93E-03 (3.37E-05)	7.36E-03 (7.01E-05)
	M-OSELM	5.82E-03 (3.63E-05)	6.26E-03 (3.27E-05)	6.62E-03 (4.82E-05)	6.98E-03 (4.46E-05)	7.64E-03 (1.03E-03)
	GRF-OSELM	5.86E-03 (4.85E-05)	6.04E-03 (4.79E-05)	6.30E-03 (7.20E-05)	6.63E-03 (9.57E-05)	6.88E-03 (7.10E-05)
	GRAF-OSELM	5.72E-03 (2.66E-05)	5.90E-03 (3.24E-05)	6.11E-03 (4.31E-05)	6.28E-03 (3.81E-05)	6.56E-03 (4.02E-05)
	DFF-OSELM	6.00E-03 (4.50E-05)	6.55E-03 (3.57E-05)	6.85E-03 (4.93E-05)	6.96E-03 (4.75E-05)	7.39E-03 (4.89E-05)
	RA-OSELM	5.72E-03 (3.66E-05)	5.88E-03 (2.94E-05)	6.11E-03 (4.40E-05)	6.30E-03 (3.54E-05)	6.58E-03 (4.82E-05)

The boldface values display the best results in each case.

Table 3.

Mean and SD of the online prediction RMSE of the six algorithms on stock price.

Hidden node (n)	Algorithm	Prediction steps
Hidden node (n)	Algorithm	100	200	300	400	500
50	R-OSELM	2.08 (0.19)	4.03 (0.47)	3.87 (0.30)	3.78 (0.43)	3.79 (0.37)
	M-OSELM	2.12 (0.22)	4.95 (1.43)	4.36 (0.93)	4.45 (0.95)	4.53 (0.94)
	GRF-OSELM	1.96 (0.18)	2.85 (0.17)	2.56 (0.15)	2.61 (0.11)	2.52 (0.09)
	GRAF-OSELM	1.76 (0.28)	1.67 (0.16)	1.52 (0.15)	1.61 (0.13)	1.60 (0.13)
	DFF-OSELM	2.09 (0.20)	3.29 (0.61)	2.79 (0.36)	2.71 (0.41)	2.81 (0.48)
	RA-OSELM	1.78 (0.31)	1.69 (0.22)	1.53 (0.16)	1.65 (0.19)	1.63 (0.12)
100	R-OSELM	1.76 (0.26)	3.80 (0.66)	3.51 (0.34)	3.93 (0.44)	3.67 (0.37)
	M-OSELM	1.76 (0.32)	4.27 (1.08)	3.93 (0.93)	4.70 (1.62)	4.63 (1.07)
	GRF-OSELM	1.73 (0.17)	2.74 (0.23)	2.55 (0.15)	2.53 (0.13)	2.53 (0.13)
	GRAF-OSELM	1.46 (0.22)	1.53 (0.15)	1.38 (0.09)	1.54 (0.09)	1.56 (0.10)
	DFF-OSELM	1.72 (0.26)	3.26 (0.53)	3.10 (0.42)	3.16 (0.50)	2.93 (0.41)
	RA-OSELM	1.48 (0.23)	1.49 (0.12)	1.34 (0.09)	1.57 (0.13)	1.59 (0.15)

The boldface values display the best results in each case.

Comparison of robustness

To further verify and compare the robustness of the six algorithms in the presence of outliers, some outliers are added into the above three data sets, and another group of contrast tests are performed again on each contaminated data set, respectively. For the first and the second data sets, four outliers are added to replace the original samples at point 300, 800, 1300 and 1800; for the third data set, the outliers are added in the same way at point 100, 200, 300 and 400. Figures 1 to 3 present the typical graphs of the online prediction error between the real value and the predicted value of the six contrast algorithms on the three nonstationary data sets with outliers. As shown in the figures, the R-OSELM, GRF-OSELM, GRAF-OSELM and DFF-OSELM are very sensitive to outliers, and once the outliers occur in the streaming data, these algorithms will conduct an incorrect update calculation using the abnormal outliers and consequently produce biased online prediction models, so we can see the prediction errors on the subsequent prediction steps increase significantly or fluctuate dramatically. In contrast, by taking the M-estimation technique, the proposed RA-OSELM algorithm can effectively get away from the negative effects of the outliers in the online update process and generate a more accurate fitting model, so we can see the prediction errors of the RA-OSELM are very stable (Figure 1(f) and Figure 2(f)) or fluctuate in a normal range (Figure 3(f)) during the whole online prediction process except for the occurrence points of the outliers. From the figures we can also see that although the M-OSELM algorithm demonstrates the similar robustness to outliers as the proposed RA-OSELM algorithm, its online prediction errors are much larger than those of the RA-OSELM due to lack of tracking capability.

Figure 1.

The typical graphs of the online prediction error on simulative nonstationary data streams with outliers. (a) R-OSELM; (b) GRF-OSELM; (c) GRAF-OSELM; (d) DFF-OSELM; (e) M-OSELM; (f) RA–OSELM.

Figure 2.

The typical graphs of the online prediction error on time-varying time series with outliers. (a) R-OSELM; (b) GRF-OSELM; (c) GRAF-OSELM; (d) DFF-OSELM; (e) M-OSELM; (f) RA–OSELM.

Figure 3.

The typical graphs of the online prediction error on stock price with outliers. (a) R-OSELM; (b) GRF-OSELM; (c) GRAF-OSELM; (d) DFF-OSELM; (e) M-OSELM; (f) RA–OSELM.

Combining the above two aspects together, we can conclude that the proposed RA-OSELM has good tracking capability with strong robustness and it can achieve more accurate and stable prediction results than its counterparts for predicting nonstationary data streams with outliers.

Comparison of learning efficiency

To compare the learning efficiency of the six OSELM-based algorithms, the online modeling and prediction time of these algorithms on each data sets are recorded, and the results are shown in Tables 4 to 6, respectively. The experimental results reported here are obtained in the case that the hidden nodes number is set as 100 for all the six algorithms. From the results, we can see first that for each single algorithm, the prediction time increases linearly with the prediction steps. In addition, by comparison of all the algorithms under the same condition, the prediction time of R-OSELM, M-OSELM and DFF-OSELM are basically the same and the shortest, followed by GRF-OSELM, followed by GRAF-OSELM, and the prediction time of RA-OSELM is the longest. Though the proposed RA-OSELM is the least efficient among the six algorithms for the reason that additional calculations are required for adjusting the forgetting factor recursively and detecting the potential outliers continually in the online prediction process, its adaptive tracking ability and anti-outlier ability have been greatly improved, so it is worth it.

Table 4.

Online prediction time (seconds) of the six algorithms on simulative nonstationary data streams.

Algorithm	Prediction steps
Algorithm	400	800	1200	1600	2000
R-OSELM	0.13	0.25	0.36	0.48	0.60
M-OSELM	0.13	0.26	0.39	0.53	0.65
GRF-OSELM	0.30	0.62	0.91	1.26	1.69
GRAF-OSELM	0.89	1.78	2.92	3.90	4.89
DFF-OSELM	0.13	0.25	0.37	0.50	0.64
RA-OSELM	0.93	1.86	3.05	4.17	5.11

Table 5.

Online prediction time (seconds) of the six algorithms on time-varying time series.

Algorithm	Prediction steps
Algorithm	400	800	1200	1600	2000
R-OSELM	0.12	0.24	0.36	0.49	0.61
M-OSELM	0.13	0.26	0.39	0.52	0.66
GRF-OSELM	0.31	0.61	0.91	1.25	1.70
GRAF-OSELM	0.92	1.80	2.94	3.97	4.96
DFF-OSELM	0.13	0.26	0.38	0.50	0.61
RA-OSELM	0.93	1.88	3.05	4.08	5.09

Table 6.

Online prediction time (seconds) of the six algorithms on stock price.

Algorithm	Prediction steps
Algorithm	100	200	300	400	500
R-OSELM	0.03	0.06	0.08	0.12	0.16
M-OSELM	0.04	0.07	0.09	0.13	0.16
GRF-OSELM	0.07	0.16	0.23	0.31	0.38
GRAF-OSELM	0.24	0.45	0.69	0.93	1.21
DFF-OSELM	0.04	0.06	0.10	0.13	0.16
RA-OSELM	0.24	0.46	0.70	0.97	1.25

Conclusions

In this paper, a novel robust adaptive OSELM algorithm (RA-OSELM) based on M-estimation and variable forgetting factor is presented for predicting nonstationary data streams with outliers. By employing a more robust M-estimation loss function instead of the traditional least square learning criterion, the RA-OSELM’s online updating schemes are less affected by the large disturbances stemming from the outliers, thus improving the robustness of the algorithm in the presence of outliers. Besides, with the help of the variable forgetting factor strategy, the RA-OSELM is capable of tracking the dynamic changes of the nonstationary data streams automatically and reducing the adverse impacts of the outdated data timely, so it tends to produce satisfying tracking results in nonstationary environments. The performances of the proposed RA-OSELM algorithm have been evaluated and compared with other five representative algorithms by synthetic and real data sets. The experimental results indicate that the RA-OSELM has better adaptive tracking capability with stronger robustness than its counterparts, and which is applicable for the online modeling and prediction of nonstationary data streams with outliers.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China (grant numbers 61603326, 61379064, 61273106); the research fund of Jiangsu Provincial Key Constructive Laboratory for Big Data of Psychology and Cognitive Science (grant number 72591962004 G).

ORCID iD

Wei Guo

References

Olorunnimbe

Viktor

Paquet

Dynamic adaptation of online ensembles for drifting data streams. J Intell Inf Syst 2018; 50: 291–313.

Mejri

Limam

Weihs

A new dynamic weighted majority control chart for data streams. Soft Comput 2018; 22: 511–522.

Wei

Dong

et al. Dynamic structure embedded online multiple-output regression for streaming data. IEEE Trans Pattern Anal Mach Intell 2019; 41: 323–336.

Liang

Huang

Saratchandran

et al. A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Trans Neural Netw 2006; 17: 1411–1423.

Huang

Zhu

Siew

CK.

Extreme learning machine: theory and applications. Neurocomputing 2006; 70: 489–501.

Ali

Deo

Downs

et al. Multi-stage hybridized online sequential extreme learning machine integrated with Markov Chain Monte Carlo copula-Bat algorithm for rainfall forecasting. Atmos Res 2018; 213: 450–464.

Zhang

Peng

Pan

et al. A novel wind speed forecasting based on hybrid decomposition and online sequential outlier robust extreme learning machine. Energy Convers Manage 2019; 180: 338–357.

Lima

Hsieh

Cannon

AJ.

Variable complexity online sequential extreme learning machine, with applications to streamflow prediction. J Hydrol 2017; 555: 983–994.

Nóbrega

Oliveira

AL.

A sequential learning method with kalman filter and extreme learning machine for regression and time series forecasting. Neurocomputing 2019; 337: 235–250.

10.

Guo

Tang

M-estimator-based online sequential extreme learning machine for predicting chaotic time series with outliers. Neural Comput & Applic 2017; 28: 4093–4110.

11.

Yin

J-C.

A variable-structure online sequential extreme learning machine for time-varying system prediction. Neurocomputing 2017; 261: 115–125.

12.

Guo

Tang

et al. Online sequential extreme learning machine with generalized regularization and adaptive forgetting factor for time-varying system prediction. Math Probl Eng 2018; 2018: 1–22.

13.

Huynh

Won

Regularized online sequential learning algorithm for single-hidden layer feedforward neural networks. Pattern Recogn Lett 2011; 32: 1930–1935.

14.

Sun

Chen

Toh

et al. Sequential extreme learning machine incorporating survival error potential. Neurocomputing 2015; 155: 194–204.

15.

Zhang

Luo

Outlier-robust extreme learning machine for regression problems. Neurocomputing 2015; 151: 1519–1527.

16.

Barreto

Barros

ALBP.

A robust extreme learning machine for pattern classification with outliers. Neurocomputing 2016; 176: 3–13.

17.

Frenay

Verleysen

Reinforced extreme learning machines for fast robust regression in the presence of outliers. IEEE Trans Cybern 2016; 46: 3351–3363.

18.

Chen

et al. Robust regularized extreme learning machine for regression using iteratively reweighted least squares. Neurocomputing 2017; 230: 345–358.

19.

Mencagli

Torquati

Danelutto

Elastic-PPQ: a two-level autonomic system for spatial preference query processing over dynamic data streams. Future Gener Comp Syst 2018; 79: 862–877.

20.

Mohamad

Bouchachia

Sayedmouchaweh

A bi-criteria active learning algorithm for dynamic data streams. IEEE Trans Neural Netw Learn Syst 2018; 29: 74–86.

21.

Tan

Recentness biased learning for time series forecasting. Inf Sci 2013; 237: 29–38.

22.

Huang

Sensor fault diagnosis for aero engine based on online sequential extreme learning machine with memory principle. Energies 2017; 10: 1–15.

23.

Tian

Wang

Ren

et al. An adaptive online sequential extreme learning machine for short-term wind speed prediction based on improved artificial bee colony algorithm. NNW 2018; 28: 191–212.

24.

Guo

Tang

et al. Online sequential extreme learning machine with generalized regularization and forgetting mechanism. Control Decis 2017; 32: 247–254.

25.

Zhang

Yin

Online sequential ELM algorithm with forgetting factor for real applications. Neurocomputing 2017; 261: 144–152.

26.

Liu

Jiang

FP-ELM: an online sequential learning algorithm for dealing with concept drift. Neurocomputing 2016; 207: 322–334.

27.

Zhang

Yin

et al. A novel online sequential extreme learning machine for gas utilization ratio prediction in blast furnaces. Sensors 2017; 17: 1847

28.

Huang

Time series prediction based on adaptive weight online sequential extreme learning machine. Appl Sci 2017; 7: 217.

29.

Guo

et al. Online sequential extreme learning machine based on M-estimator and variable forgetting factor. J Electron Inf Techno 2018; 40: 1360–1367.

30.

Soares

Araújo

An adaptive ensemble of on-line Extreme Learning Machines with variable forgetting factor for dynamic system prediction. Neurocomputing 2016; 171: 693–707.

31.

Golub

Van Loan

CF.

Matrix computations. Baltimore: JHU Press, 2013, pp.56.

32.

Rousseeuw

Leroy

AM.

Robust regression and outlier detection. New York, NY: John Wiley & Sons, 2005, pp.43–44.

33.

Simon

Adaptive filter theory. Prentice Hall, NJ: Upper Saddle River, 2002, pp.442–444.

34.

Lim

Lee

Pang

Low complexity adaptive forgetting factor for online sequential extreme learning machine (OS-ELM) for application to nonstationary system estimations. Neural Comput Appl 2013; 22: 569–576.

35.

Lughofer

Angelov

Handling drifts and shifts in on-line data streams with evolving fuzzy systems. Appl Soft Comput 2011; 11: 2057–2068.

36.

Pérez-Sánchez

Fontenla-Romero

Guijarro-Berdiñas

et al. An online learning algorithm for adaptable topologies of neural networks. Expert Syst Appl 2013; 40: 7294–7304.

37.

Wang

Sun

Wang

et al. Prediction of the chaotic time series from parameter-varying systems using artificial neural networks. Acta Phys Sin 2008; 57: 6120–6131.