Rolling bearing fault diagnosis based on probabilistic mixture model and semi-supervised ladder network

Abstract

Fault diagnosis of rolling bearings is of great significance to ensure the production efficiency of rotating machinery as well as personal safety. In recent years, machine learning has shown great potential in signal feature extraction and pattern recognition, and it is superior to traditional fault diagnosis methods in dealing with big data. However, most of the current intelligent diagnostic methods are based on the ideal conditions that bearing data set and label information are sufficient, which are often not always available in engineering practice. In response to this problem, this paper proposes to use probabilistic mixture model (PMM) to approximate the data distribution of the bearing signal, and then use Markov Chain Monte Carlo (MCMC) algorithm to sample the probabilistic model to expand the fault data set. In addition, Semi-supervised Ladder Network (SSLN) can achieve the effect of supervised learning classifier with only a few labeled samples. Based on Case Western Reserve University (CWRU) Bearing Database, the recognition accuracy of the proposed PMM-SSLN model can reach 99.5%, and the experimental results show that this model is applicable to the case where both bearing data set and label information are insufficient.

Keywords

Rolling bearing fault diagnosis probabilistic mixture model MCMC semi-supervised ladder network

Introduction

Rolling bearings are the most important and most vulnerable component of the rotating machinery and widely used in machinery industry.¹ Thus, fault diagnosis of rolling bearings not only improves the production efficiency and the service life of the equipment, but also can timely detect and troubleshoot faults to prevent serious accidents. With the rapid development of the industrial Internet, the scale of the mechanical equipment group is expanding, and the monitoring points of the monitoring system are increasing, which makes the monitoring data in the whole field of mechanical fault diagnosis increase explosively. Efficiently mining of the rich information contained in big data has become the focus and challenge of the research.²

The traditional signal processing methods of rolling bearing include time domain analysis, frequency domain analysis and time-frequency domain analysis. Doguer and Strackeljan³ investigated an approach that suitable time domain features are selected by the higher derivatives of the time acceleration signal and some parameters characterizing the randomness of the peak positions. Frequency domain analysis converts the signal from the time domain to the frequency domain, which gives more accurate frequency components of the signal components. Zhang et al.⁴ presents an improved local cepstrum method to analyze vibration signals of rolling element bearing, and the method has strong ability to resist noise and detect the early fault. The frequency domain analysis shows great limitations in the face of non-stationary signals. In contrast, time-frequency analysis can describe the time-varying frequency characteristics of the signal. Wavelet trans-form (WT) can automatically adapt to the requirements of time-frequency signal analysis, and the time-frequency analysis method based on WT has proved its effectiveness in the face of non-stationary signals.⁵

Traditional time-frequency analysis techniques rely on complex signal processing techniques and fault diagnosis experience. In addition, the increasing amount of data brings the difficulties and challenges of fault diagnosis. In recent years, machine learning has become a powerful tool for big data analysis and is widely used in machinery fault detection and diagnosis.⁶ Jian et al.⁷ proposed a one-dimensional fusion neural network (OFNN), realizing the identification of bearing health under different load conditions. Liu et al.⁸ presented a novel bearing fault diagnosis with RNN in the form of an autoencoder, and this method has high robustness and classification accuracy. Haidong S et al.⁹ combined deep wavelet auto-encoder (DWAE) with extreme learning machine (ELM), diagnostic accuracy can reach 95% through experiments on CWRU motor bearing database.

However, these studies are based on the assumption that there is sufficient valid data and label information, which is inconsistent with the characteristics of monitoring data in engineering practice. The following two weaknesses are as follows:

In engineering practice, rolling bearings are in the process of normal operation for a long time, so the scarce fault data are inherent features in the field of rolling bearing fault diagnosis. However, it is necessary for machine learning to have sufficient samples for each label.¹⁰

The equipment accumulates massive data during long-term operation, but only a few data are known for their health status. But supervised learning requires a large amount of labeled data, which is likely to waste useful information of the unlabeled ones. In the face of such massive unlabeled data, manual labeling is time-consuming.

To overcome the above weaknesses, this paper pro-poses PMM-SSLN framework based on Probability Mixture Model and Semi-supervised Ladder Network. This paper takes into account the data characteristics of rolling bearings in engineering practice, the dataset is expanded through sampling the probability model established by the bearing data, and then uses SSLN for training.

PMM is one of the most commonly used tools for clustering and density estimation,^11–13 which can approximate any unknown distribution with multiple components. No matter how complicated the data distribution is, the local characteristics of the data can be described by adding components, the component here is an independent probability distribution. Compared with a single model, it is more flexible and adaptable. SSLN is a pure semi-supervised learning method proposed by Ramus et al.^14,15 Based on the traditional unsupervised learning network structure, horizontal lateral connections are added between the encoder and decoder of each layer to achieve compatibility with supervised learning.¹⁶ It can greatly reduce the cost of labeling bearing samples, and use unlabeled data to improve the performance of the classifier.

The contributions of the paper lies in the following two aspects:

PMM is applied to approximate the data distribution of complex bearing signal, and then MCMC algorithm is adopted to sample the model. In this way, large amounts of fake data with the same distribution can be generated to assist in training the neural network. This method can provide a new idea for handling small sample size problem in the intelligent diagnosis of rolling bearings.

This paper uses SSLN that is pure semi-supervised learning to train the expanded data set. This method can fully exploit the information hidden by the unlabeled data and can achieve the effect of a supervised learning classifier with only a small amount of labeled data. This solution improves the utilization of bearing signals, and thereby improve the accuracy of diagnosis.

The reminder of this paper is organized as follows: the mathematical model of monitored data is introduced in section “Mathematical model of monitored data,” estimation and data reproduction from the probabilistic model is introduced in section “Estimation and data reproduction from PMM”; section “Fault diagnosis using semi-supervised ladder network” introduces fault diagnosis using SSLN; section “Simulations and experiments” introduces simulations and experiments; section “PMM-SSLN model experiment setup and results” and “Conclusion” gives the conclusions and proposes open questions.

Mathematical model of monitored data

Probabilistic model of observed data

The collected data from rolling bearings including noise can be regarded as stochastic signal, which cannot be described by a fixed mathematical relationship. Each observation only represents one of its possible results in the variation range, and the variation is subject to statistical law. As a statistic point of view, the data can be considered to be a random variable which obeys a certain probability distribution.

Figure 1 shows the scatter diagram of $C$ data collected from a rolling bearing in $C$ sampling periods. Due to various external interference, complex vibrations will occur during the bearing operation. As a result, the monitored data usually do not follow a single probability distribution. PMM is more reasonable than a single model because it can use multiple components to describe a complex distribution, this feature makes PMM suitable for building probabilistic models of rolling bearing compound failures. Regardless of the complexity of the structure of the data distribution, the mixture model can add components to describe the local features of the target distribution. The components here refer to the probability density function. Thus, PMM can be employed to approximate the distribution of bearing signals to the greatest extent.

Figure 1.

The scatter diagram of the bearing data in the time domain.

PMM is a linear weighted combination of a finite single density function. It can be generalized to the following form:

Suppose the bearing data set $X = {X_{1}, X_{2}, \dots, X_{C}}$ obeys a certain distribution, $X$ is a $d$ -dimensional random variable, and $d$ depends on the number of physical quantities monitored in the experiment, such as vibration or temperature data, then the probability density function is:

P (X) = \sum_{k = 1}^{K} w_{k} f_{k} (X; β_{k})

(1)

Where $P (X)$ represents a mixture model with $K$ components, $f_{k} (X; β_{k})$ represents the probability density function of the $k - th$ component and $β_{k}$ is its parameters. $ω_{k}$ is its weight, which is the prior probability of the probability model $f_{k} (X)$ , and $\sum_{k = 1}^{K} ω_{k} = 1$ . Figure 2 is a schematic diagram of a one-dimensional PMM of fault data, which consists of two components. In practical applications, $d$ , $K$ and the types of probability distribution can be adjusted to suit different occasions, which determines the universality of PMM. The parameters to be estimated in the model are written as $θ = {ω_{k}, β_{k}}$ .

Figure 2.

A schematic diagram of PMM based on the bearing data.

Research routine

The rolling bearing fault diagnosis method based on PMM-SSLN Model consists of three steps:

Data acquisition and augmentation of the bearing signal model:

First collect bearing signals from the experimental device, and use PMM to fit the distribution of bearing data, in order to establish the probability models for bearing data under different working conditions. Then sample the probabilistic models by MCMC algorithm to ensure that each category of bearing data is sufficient.

Model training with Semi-supervised Ladder Network:

After obtaining sufficient training data, SSLN is trained by using small amounts of labeled data and massive unlabeled data. Complete the training after reaching the set number of iterations.

Fault identification:

Input the data set to be tested and finally output the diagnosis result.

The specific process is shown in Figure 3.

Figure 3.

The flow chart of the proposed PMM-SSLN Model.

Estimation and data reproduction from PMM

Information criterions for determining the component number

The first parameter that needs to be estimated is the number of components $K$ . When $K$ is closer to the number of samples $C$ , the model can more accurately describe the empirical distribution of bearing data, but the complexity of the model will become higher. This will lead to over-fitting problem, which does not guarantee accurate fitting to the true distribution. Therefore, we need to find a criterion that can weigh the complexity of the model and the goodness of the model fitting data. Currently the most commonly used are Akaike Information Criterion (AIC)¹⁷ and Bayesian Information Criterion (BIC),¹⁸ which is defined as

AIC = - 2 \ln L + 2 V

(2)

BIC = - 2 \ln L + V \ln C

(3)

where $\ln L$ is the maximum likelihood of the model, $V$ is the number of independently adjusted parameters within the model, and $C$ is the number of samples.

It can be seen from equations (2) and (3) that the preferred model should be the one with the smallest values of AIC and BIC. The distinction between the two kinds of information criteria lies in the different penalty factors. The penalty factor of BIC takes into account the number of samples to prevent the model from over-fitting due to the large amount of samples. With the large sample size ( $C > 300$ ), BIC performs better than AIC.¹⁹

Expectation-maximization algorithm for parameter estimation

After determining the number of components $K$ , the parameter estimation is the process of solving the parameter set $θ$ . Maximum Likelihood Estimate (MLE) is the most commonly used means of parameter estimation in probability statistics. For a single distribution, it can be directly deduced by deriving the likelihood function. But for PMM, there is no way to find the maximum likelihood directly because it is unknown which component each sample comes from.

Expectation-Maximization Algorithm²⁰ is a method to find the maximum likelihood function of the model with parameters. For the bearing data set $X = {X_{1}, X_{2}, \dots, X_{C}}$ , the hidden variable $Z$ is introduced to indicate which component each bearing data comes from and $Z = {Z_{1}, Z_{2}, \dots, Z_{K}}$ . The bearing data here is a discrete variable, and the log-likelihood function of the model can be written as:

\log P (X | θ) = \sum_{i = 1}^{C} \log [\sum_{k = 1}^{K} P (X_{i}, Z_{k} | θ)]

(4)

The purpose of EM algorithm is to find the maximum likelihood solution of equation (4). Introduce the hidden distribution $Q (Z)$ that can be considered to be the posteriori of the hidden variable $Z$ . E-step constructs the lower bound $L_{t}$ of the log-likelihood function with $θ$ fixed. According to Jensen’s inequality, the lower bound is reached as follows when $Q (Z) = P (Z | X, θ)$ :

L (θ, θ_{t - 1}) = \sum_{i = 1}^{C} \sum_{k = 1}^{K} Q (Z_{k}) \log P (X_{i}, Z_{k} | θ_{t - 1})

(5)

= E_{Q} [\log P (X, Z {| θ}_{t - 1})]

(6)

Where $t$ represents the number of iterations, and $E_{Q}$ represents the mathematical expectation of the joint likelihood $\log P (X, Z | θ)$ for the implicit distribution $Q (Z)$ . M-step solves $θ$ with $Q$ fixed to maximize the lower-bound, and $θ_{t}$ is computed by equation (7) to maximize $L_{t}$ .

θ_{t} = \arg max_{θ} E_{Q} [\log P (X, Z | θ_{t - 1})]

(7)

It can be proved that the likelihood function is monotonically increasing, that is, EM algorithm will finally converges to its maximum value. This maximum is the parameter estimation of PMM. In this way, the probability density function of the bearing data is obtained, and more data can be generated for data augmentation by means of the corresponding sampling method. The pseudo code of EM algorithm is as follows:

Algorithm 1 Expectation-Maximization algorithm
Input: $X$ , $P (X, Z \| θ)$ , $P (Z \| X, θ)$ , $T$ .
Output: $θ$
1: Initialize $θ_{0}$
/* $T$ is the maximum number of iterations*/
2: for $t = 1$ to $T$ do
/E-step: calculate the expectation/
3: $Q (Z) \leftarrow P (Z \| X, θ_{t - 1})$
4: $L (θ, θ_{t - 1}) = E_{Q} [\log P (X, Z {\| θ}_{t - 1})]$ /M-step: maximize the expectation/
5: $θ_{t} = \arg max_{θ} L (θ, θ_{t - 1})$
6: end for

MCMC algorithm for reproduction of bearing data

The bearing data generation process is a mechanical discrete event system, and it can be considered as the process of sampling the established distribution model. Expanding the training data set by sampling has the advantages of easy implementation and low cost. This paper attempts to use the method of sampling PMM to improve the learning performance of deep learning networks. In order to sample on the specified probability distribution $P (x)$ , the difficulty of sampling is that the probability density function of PMM is not necessarily integrable and the cumulative distribution function does not necessarily have an inverse function. In view of this problem, a Markov process can be constructed. After multiple transitions, it will eventually converge to a stationary distribution. The nature of the Markov chain determines that $P (x)$ is directly related to the choice of the transfer distribution $κ$ . The goal is to find a transfer distribution $κ$ so that the final convergence is the distribution $P (x)$ . The M-H algorithm is a very important MCMC sampling algorithm, which avoids a large number of samples being rejected by amplifying the acceptance rate and can quickly converge to a stationary distribution ²¹.

The M-H algorithm introduces a transfer distribution

$κ$ $(x \to x *)$ , and its calculation formula is as follows:

κ (x \to x *) = q (x * | x) α (x \to x *)

(8)

α (x \to x *) = min (1, \frac{P (x *) q (x | x *)}{P (x) q (x * | x)})

(9)

Where $q (x * | x)$ is the probability of transition proposal, namely the probability of transition from state $x$ to state $x *$ ; $α (x \to x *)$ is the probability of accepting the new state $x *$ .

It can be proved that the M-H algorithm satisfies the fine balance condition:

P (x) κ (x \to x *) = P (x *) κ (x * \to x)

(10)

When the detailed balance condition is satisfied, it can be guaranteed that the stationary distribution that eventually converges is the target distribution $P (x)$ after multiple transfers. As a result, the transition sequence from the beginning of convergence can be used as the simulated bearing samples. In this paper, MCMC algorithm is used to expand the training set of the neural network instead of a large number of experiments. This method of data augmentation can improve the learning effect of deep learning networks when the historical data set is limited. The specific algorithm and pseudo code of M-H algorithm are as follows:

Algorithm 2 Metropolis-Hastings algorithm
Input: objective distribution $P (x)$
Output: $x^{(0)}, x^{(1)}, x^{(2)} \dots$
1: Initialize $x^{(0)} ~ q (x)$
2: for iteration $i = 1, 2, \dots$ do
3: $x^{*} ~ q (x^{(i)} \| x^{(i - 1)})$
/calculate the acceptance probability/
4: $α (x \to x ) = min (1, \frac{P (x ) q (x^{(i - 1)} \| x )}{P (x^{(i - 1)}) q (x \| x^{(i - 1)})})$
5: $u ~ Uniform (u; 0, 1)$
6: if $u < α$ then
7: $x^{(i)} \leftarrow x^{*}$
8: else
9: $x^{(i)} \leftarrow x^{(i - 1)}$
10: end if
11: end for

Fault diagnosis using semi-supervised ladder network

The proposed data augmentation method solves the problem of insufficient data sets, so that SSLN has sufficient training data. In addition, the health information of bearing data in engineering practice is also insufficient. In the face of these huge amounts of data, labeling the health information of the bearings takes time and effort. SSLN is a derivative algorithm of the autoencoder. It adopts the structure of ladder network to realize the compatibility of unsupervised learning and supervised learning. This semi-supervised learning method can make full use of a large amount of unlabeled data, which is consistent with the high labeling cost in the field of fault diagnosis. Therefore, this paper uses SSLN as a fault diagnosis model of bearing signals.

Structure of ladder network

Semi-supervised learning is essentially a combination of unsupervised learning and supervised learning. Unsuper-vised learning requires that all input information be reconstructed as much as possible. As shown in Figure 4, autoencoder network is a typical unsupervised learning method.²² The compressed representation of input signal $x$ is obtained through multiple layers of encoding, and then decoded to reconstruct the original signal. This requires the highest layer to retain all the detailed features as much as possible. But the goal of supervised learning is to extract information related to classification tasks and filter out irrelevant information. This creates an inevitable conflict between unsupervised learning and supervised learning. Ladder network adds lateral connections bet-ween the encoding and decoding channels.¹⁵ This enables the high-level to extract the features most relevant to the classification and restore other missing features through lateral connections during the decoding process. This shows that this structure is conducive to semi-supervised learning.

Figure 4.

Structural comparison of autoencoder and ladder network. Where “^” indicates the variable of the corrupted decoding process, and $h^{(l)}$ is the output of layer $l$ after the activation function of the layer and the input of the next layer.

Structure of SSLN

SSLN is a semi-supervised network that combines ladder network and supervised learning. Now suppose there are $N$ labeled samples ${x (n), t (n) | (1 \leq n \leq N)}$ and $M$ un-labeled samples ${x (m) | (N + 1 \leq m \leq N + M)}$ , where $M >> N$ and the output $t (n)$ is a class label. Semi-supervised learning studies how these unlabeled data can aid in training a classifier. Figure 5 shows the structure of SSLN with $L$ layers. SSLN consists of a corrupted encoding path, a decoding path and a clean encoding path.

Figure 5.

A structural illustration of the L-layer ladder network. The corrupted encoding path shares the same mappings $f^{(l)}$ with the clean encoding path. The decoding path contains the denoising function $g^{(l)}$ . The costs functions $C_{d}^{(l)}$ on each layer minimize the difference between ${\hat{z}}^{(l)}$ and $z^{(l)}$ .

In the corrupted encoding path, Gaussian random noise is added on each layer to enhance the robustness of the network. The output $\tilde{y}$ represents the prediction classi-fication result after the encoding process with noise. The expression for each layer of the process is as follows:

\tilde{x} = z^{(0)} = {\tilde{h}}^{(0)} = x + n^{(0)}

(11)

{\tilde{z}}_{pre}^{(l)} = W^{(l)} {\tilde{h}}^{(l - 1)}

(12)

{\tilde{z}}^{(l)} = N_{B} ({\tilde{z}}_{pre}^{(l)}) + n^{(l)}

(13)

{\tilde{h}}^{(l)} = ϕ (γ^{(l)} ({\tilde{z}}^{(l)} + β^{(l)}))

(14)

\tilde{y} = {\tilde{h}}^{(L)}

(15)

Where $W^{(l)}$ is the weight matrix from layer $l - 1$ to layer $l$ , $h^{(l - 1)}$ is the output through the activation function of layer $l - 1$ , $n^{(l)}$ is Gaussian random noise, $N_{B}$ is a batch normalization,²⁴ $z^{(l)}$ is the input after batch normalization, $γ^{(l)}$ and $β^{(l)}$ are trainable parameters that are correction factors of batch normalization. $ϕ (•)$ is the activation function, and ReLU function is applied from layer $1$ to layer $L - 1$ . For layer $L$ that is the predictive output layer, the softmax classification function is selected as the activation function. The supervised cost $C_{c}$ for the corrupted forward pass is as follow:

C_{c} = - \frac{1}{N} \sum_{n = 1}^{N} \log P (\tilde{y} = t (n) | x (n))

(16)

The decoding path uses the result of the encoding process to decode the unlabeled data. Due to the addition of the lateral connection, each hidden layer of the decoding path not only obtains the information of the previous layer, but also obtains the information of the corrupted encoding path in the same layer and restores the lost information. The specific expression is as follows:

{\hat{u}}^{(L)} = N_{B} (\tilde{y}), l = L

(17)

{\hat{u}}^{(l)} = N_{B} (V^{(l + 1)} {\hat{z}}^{(l + 1)}), 0 \leq l \leq L

(18)

{\hat{z}}_{i}^{(l)} = g_{i}^{(l)} ({\tilde{z}}_{i}^{(l)}, u_{i}^{(l)}) = a_{i}^{(l)} ξ_{i}^{(l)} + b_{i}^{(l)} sigmoid (c_{i}^{(l)} ξ_{i}^{(l)})

(19)

Where $V^{(l + 1)}$ is the weight matrix from layer $l + 1$ to layer $l$ , ${\tilde{z}}_{i}^{(l)}$ represents the value of the i-th neuron of ${\tilde{z}}^{(l)}$ , $u_{i}^{(l)}$ indicates the value of the i-th neuron of ${\hat{u}}^{(l)}$ , ${\hat{z}}_{i}^{(l)}$ represents the value of the i-th neuron after the activation function, $ξ_{i}^{(l)} = [1, {\tilde{z}}_{i}^{(l)}, u_{i}^{(l)}, {\tilde{z}}_{i}^{(l)} u_{i}^{(l)}]^{T}$ , $a_{i}^{(l)}$ and $c_{i}^{(l)}$ are train-able $1 \times 4$ weight vectors, and $b_{i}^{(l)}$ is a trainable $1 \times 1$ weight vector.

Except that no noise is added, the clean encoding path has the similar structure to the corrupted encoding path. The specific expression is thus:

h^{(0)} = z^{(0)} = x

(20)

z_{pre}^{(l)} = W^{(l)} h^{(l - 1)}

(21)

z^{(l)} = N_{B} (z_{pre}^{(l)})

(22)

h^{(l)} = ϕ (γ^{(l)} (z^{(l)} + β^{(l)}))

(23)

y = h^{(L)}

(24)

The unsupervised denoising cost function $C_{d}$ and the total cost $C_{t}$ are as follows:

\begin{matrix} C_{d} = \sum_{l = 1}^{L} λ_{l} C_{d}^{(l)} = \sum_{l = 0}^{L} \frac{λ_{l}}{(N + M) m_{l}} \\ \sum_{n = 1}^{N + M} {‖ z^{(l)} (n) - {\hat{z}}_{BN}^{(l)} (n) ‖}^{2} \end{matrix}

(25)

{\hat{z}}_{BN}^{(l)} (n) = \frac{\hat{z} - mean (z_{pre}^{l})}{\sqrt{var (z_{pre}^{l})}}

(26)

C_{t} = C_{c} + C_{d}

(27)

Where $m_{l}$ is the layer’s width, $N + M$ is the total number of labeled and unlabeled samples in the training samples, and $λ_{l}$ is the weight of the $l - layer$ cost function. The purpose of calculating ${\hat{z}}_{BN}^{(l)}$ is to reduce the noise caused by batch normalization. The model parameters $W^{(l)}$ , $γ^{(l)}$ , $β^{(l)}$ , $V^{(l)}$ , $a_{i}^{(l)}$ , $b_{i}^{(l)}$ , $c_{i}^{(l)}$ can be trained to minimize the total cost $C_{t}$ . After the training is completed, enter the test set in the clean encoding path to obtain the classification result $y$ .

On the basis of autoencoder, SSLN retains and improves the unsupervised cost. Thus the structural information can be extracted that comes from a large amount of unlabeled bearing data, which makes the decoding process more reconstructed. In addition, the addition of the supervised cost makes the network become a semi-supervised learning network and extract the features of the labeled samples. What’s more, the introduction of lateral connections improves the classification effect of the upper layer without reducing the reconstruction performance of the decoder. These advantages ensure that SSLN has outstanding characteristics in the application of bearing fault diagnosis.

Simulations and experiments

Rolling bearing experimental data description

The experiment in this section uses the most commonly used vibration signal analysis method. The data comes from Case Western Reserve University Bearing Data Center. The experimental setup is shown in Figure 6. The test stand consists of a 2 hp motor, a torque transducer and a dynamometer. The bearing to be tested is a deep groove ball bearing, and electro-discharge machining (EDM) is used to set the faults on the inner ring, ball and outer ring of the bearing to be tested.

Figure 6.

Case Western Reserve University bearing experimental device.

Single point faults are introduced into different components of the test bearing with fault diameters of 0.007, 0.014, 0.021, and 0.028 inches. The accelerometer is used to collect the rolling bearing at the driving end, and the sampling frequency is 12 kHz. Each bearing is tested under different loads (0, 1, 2, and 3 hp), and the motor speed changes between 1730 and 1797 rpm. In order to highlight the identification ability of the fault identification method, this section selects minor faults, that is, bearing data with a fault diameter of 0.007 inch. The experimental data set is shown in Table 1. The data set is divided into 4 categories and 16 sets of data. The data is processed so that the length of a single sample is 1024, and finally each group has 108 pieces of data. Each group of data is randomly divided into 81 training data and 27 testing data.

Table 1.

The CWRU rolling bearing dataset information.

Class label	Fault location	Load (hp)	No. of training samples	No. of testing samples
0	Normal	0, 1, 2, 3	4 × 81	4 × 27
1	Inner ring	0, 1, 2, 3	4 × 81	4 × 27
2	Ball	0, 1, 2, 3	4 × 81	4 × 27
3	Outer ring	0, 1, 2, 3	4 × 81	4 × 27

SSLN experiment setup and results

The experiment in this section builds SSLN based on fully connected layer network. The dimension of each layer of the encoding path is set to [1024, 1000, 500, 250, 250, 100,4], and the weight of the unsupervised loss item of each layer is set to [1000, 10, 0.1, 0.1, 0.1, 0.1, 0.1]. The learning rate is set to 0.02, the batch size is set to 32, and the maximum number of iterations is set to 8000.

In the training set, 32 pieces of data are randomly selected and labeled, and the remaining 1264 pieces of data are regarded as unlabeled data. Each experiment has 32 pieces of labeled data and the same test set, and 0, 316, 632, 948, 1264 pieces of unlabeled data are added in order to train the model. The training results are shown in Figure 7. When there is no unlabeled data, the encoding path of SSLN is used for supervised learning with only 32 pieces of labeled data. The accuracy rate is only 25.9%, which is equivalent to the random probability. It can be seen that the lack of label samples has a great impact on the classification effect of supervised learning. This is because a large amount of unlabeled data cannot be used, resulting in a lack of data in the training set and over-fitting in neural network training.

Figure 7.

Influence of the amount of unlabeled data on accuracy.

The accuracy rate of semi-supervised learning when adding unlabeled data has been significantly improved. In addition, the accuracy rate is gradually increasing after gradually increasing the number of unlabeled samples. It can be inferred that the increase in the number of unlabeled data helps to improve the accuracy of SSLN fault recognition. After adding all the unlabeled data, the accuracy rate finally reached 71.8%, but still could not achieve a high recognition rate. Limited by the size of the data set, the next section continue to expand the training set through data enhancement methods.

PMM-SSLN model experiment setup and results

Based on the previous section, this section designs PMM-SSLN model to improve the fault recognition accuracy of SSLN in the case of small data sets. This section takes the inner ring fault under no external load as an example. As shown in Figure 8, the frequency distribution histogram of the fault data can be obtained. It can be roughly seen from the figure that the data distribution is dense in the middle and sparse on both sides that is similar to the Gaussian distribution. However, it does not necessarily follow a single Gaussian distribution. This experiment selects one-dimensional Gaussian Mixture Model (GMM) for probabilistic modeling of bearing data. This means that each component in PMM represents the probability density function of the Gaussian distribution. Because of the large sample size, BIC criterion is used to determine the number of components. This not only ensures the accuracy of the model, but also makes the model have a certain generalization ability, which can predict the data that is not observed.

Figure 8.

The frequency distribution histogram of the inner ring fault data.

It can be seen from Figure 9 that the value of BIC has a minimum value. After that, as the number of components increases, the BIC value slowly rises and eventually tends to be stable. It can be concluded that the fitting effect to the model reaches the best when $K = 2$ . After determining the number of components, the parameters of the GMM are estimated through EM algorithm. As shown in Figure 10, the probability density function of the model is drawn on the basis of the frequency distribution histogram. It can be found that the GMM can fit the distribution of bearing data well compared to the single Gaussian model (GSM). As we continue to increase the number of components, the weights of new components are close to zero, which is consistent with BIC.

Figure 9.

BIC for determining the number of GMM.

Figure 10.

Comparison of the fitting effect between GSM and GMM.

After establishing GMM for the observed data, much pseudo data can be generated through M-H sampling. The acceptance rate of M-H sampling is relatively high, therefore it can converge to a stable distribution more quickly. These pseudo data can be considered to have a certain similarity and homology with the original data in terms of probability. The threshold value of the number of state transitions needs to be set to ensure that the distribution of output data has converged to a stable distribution, which is set to 100,000 here. The generated data is set to three times the original samples, and its frequency distribution histogram is shown in Figure 11. It can be seen that the generated data set can fit the distribution of the established GMM, which means the sampling data and the original data are very close in probability distribution.

Figure 11.

GMM of the original data and histogram of the generating data.

Then the generated data is divided into three groups, and each group is corresponded to the original data one by one according to the size order. Moreover, the generated data has the same time label as the original data. This solution allows the generated data to retain the timing information of the original signal as much as possible. The comparison between the original signal and the generated signal in the time domain is shown in Figure 12.

Figure 12.

Time domain diagram of the inner ring fault’s original signal (a) and generated signal (b).

Use the proposed data augmentation method to expand the bearing data in each state in Table 1 to 324 pieces, and eventually 16 × 324 pieces of 1024-dimensional bearing signals can be obtained. The labeled samples are the same as the previous experiment, and the rest are unlabeled data. SSLN is trained with the expanded training set, and the parameter settings of which are consistent with the previous section. In order to further prove the effectiveness of the PMM-SSLN model, Support Vector Machines (SVM) and Semi-supervised Support Vector Machines (S3VM) are introduced as comparison models. Meanwhile, the same expanded data set was used in experiment 3. Combining the experiments in the previous section, six comparative experiments can be listed as shown in Table 2.

Table 2.

Classification results of three methods in the CWRU bearing dataset.

Serial number	Approaches	No. of training samples		No. of testing samples	Accuracy (%)
Serial number	Approaches	Labeled	Unlabeled	No. of testing samples	Accuracy (%)
1	SVM	32	0	432	39.8
2	S3VM	32	1264	432	45.4
3	PMM-S3VM	32	5152	432	53.7
4	Encoder of SSLN	32	0	432	25.9
5	SSLN	32	1264	432	71.8
6	PMM-SSLN	32	5152	432	99.5

It can be seen that the recognition accuracy of PMM-SSLN model can eventually reach 99.5%, which is significantly improved compared to the other five methods. The SVM model which belongs to shallow learning models has a higher accuracy than the encoder of SSLN when using a small amount of label data. But when the size of the training data set increases, the diagnostic accuracy of the SSLN model is significantly improved compared to the SVM model. This proves that SSLN has advantages in extracting information from a large number of training samples, and the method of data enhancement can significantly improve the recognition accuracy of SSLN. Moreover, it is concluded that SSLN only requires few label data, but requires large amounts of unlabeled data for training. This is consistent with the fact that the health status of most bearing signals is unknown in practical engineering applications. Moreover, the proposed data enhancement method can ensure the sufficient of unlabeled data, and further improve the diagnostic ability of SSLN. It can be seen that PMM-SSLN model proposed in this paper can solve the problem of small data volume and scarcely labels in the fault diagnosis of rolling bearings.

Conclusion

This paper proposes an intelligent fault diagnosis method based on probabilistic mixture model and Semi-supervised Ladder Network. In the case of limited valid data, the probabilistic model of the corresponding bearing data can be established through PMM, and then the data augmentation can be performed through MCMC sampling. Due to the strong expressive ability of PMM and the fast training speed of EM algorithm, this method can quickly capture the empirical distribution of rolling bearing data under different complex working conditions. Furthermore, PMM has a good generalization ability, and will not harm the prediction of unobserved data caused by over-fitting. On the other hand, there is a problem that the labeling cost of rolling bearings is high in industrial applications. SSLN that has the feature extraction capability of unlabeled data requires small amounts of labeled data to train the model. Experimental results show that the PMM-SSLN model proposed in this paper can solve the problem of lack of valid data and label information in industrial applications, and the final accuracy rate reaches 99.5% in the CWRU bearing dataset. Our future work includes further improvement of the model to better solve the problem of data imbalance, and to realize the identification of compound faults of rolling bearings.

Footnotes

Handling Editor: James Baldwin

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Anhui Province Key Project (No. 18030901039, No. 201903c08020008, and No. 201903a05020064); Fundamental Research Funds for the Central Universities (Grant no. JZ2019YYPY0017, JZ2020YYPY0247)

ORCID iD

Xuesong Lu

References

Wei

Wang

, et al. A novel intelligent method for bearing fault diagnosis based on affinity propagation clustering and adaptive feature selection. Knowl Based Syst 2017; 116: 1–12.

Yaguo

Feng

Detong

, et al. Opportunities and challenges of machinery intelligent fault diagnosis in big data era. J Mech Eng 2018; 54: 94–104.

Doguer

Strackeljan

. Vibration analysis using time domain methods for the detection of small roller bearing defects. In: SIRM 2009 - 8th International Conference on Vibrations in Rotating Machines. Vienna, Austria, 2009.

Zhang

Zhou

Zhang

Improved local cepstrum and its applications for gearbox and rolling bearing fault detection. Meas Sci Technol 2019; 30: 075007.

Yan

Gao

Chen

Wavelets for fault diagnosis of rotary machines: a review with applications. Signal Processing 2014; 96: 1–15.

Saufi

Ahmad

ZAB

Leong

, et al. Challenges and opportunities of deep learning models for machinery fault detection and diagnosis: a review. IEEE Access 2019; 7: 122644–122662.

Jian

Guo

, et al. Fault diagnosis of motor bearings based on a one-dimensional fusion neural network. Sensors 2019; 19: 1–122.

Liu

Zhou

Zheng

, et al. Fault diagnosis of rolling bearings with recurrent neural network-based autoencoders. Isa Trans 2018; 77: 167–178.

Haidong

Hongkai

Xingqiu

, et al. Intelligent fault diagnosis of rolling bearing using deep wavelet auto-encoder with extreme learning machine. Knowl Based Syst 2018; 140: 1–14.

10.

Zhang

Ding

, et al. Intelligent rotating machinery fault diagnosis based on deep learning using data augmentation. J Intell Manuf 2020; 31: 433–452.

11.

Gensler

Finite mixture models. Cham: Springer International Publishing, 2016.

12.

Manouchehri

Bouguila

A probabilistic approach based on a finite mixture model of multivariate beta distributions. In: 21st International conference on enterprise information systems. Heraklion, Crete, 2019, pp.373–380.

13.

Reynolds

Gaussian mixture models. In: Li

Jain

(eds) Encyclopedia of Biometrics. Springer, 2009, pp.659–663.

14.

Rasmus

Valpola

Honkala

, et al. Semi-supervised learning with ladder networks. Comp Sci 2015; 9: 1–9.

15.

Valpola

From neural PCA to deep unsupervised learning. Cambridge, MA: Academic Press, 2014.

16.

Rasmus

Valpola

Raiko

Lateral connections in denoising autoencoders support supervised learning. Comp Sci 2015; 31: 555–563.

17.

Akaike

A new look at the statistical model identification. IEEE Trans Automat Contr 1974; 19: 716–723.

18.

Ferguson

TS.

A Bayesian analysis of some nonparametric problems. Ann Stat 1973; 1: 209–230.

19.

Rick

Andrews

ISC

. A comparison of segment retention criteria for finite mixture logit models. J Mark Res 2003; 40: 235–243.

20.

Krishnan

Mclachlan

GJ.

The EM Algorithm. In: Gentle

Härdle

Mori

(eds) Handbook of computational statistics. Berlin, Heidelberg: Springer, 2012.

21.

Spall

JC.

Estimation via Markov chain Monte Carlo. IEEE Control Syst Mag N Y 2003; 23: 34–45.

22.

Hinton

GE.

Reducing the dimensionality of data with neural networks. Science 2006; 313: 504–507.

23.

Ioffe

Szegedy

. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: JMLR .org, 2015. 448–456.