A roundtrip probability estimation method for mechanical equipment fault detection under imbalanced samples

Abstract

Aiming at high misdetection of mechanical faults under imbalanced samples, a roundtrip probability-based method is proposed. By roundtrip mapping between latent variables and real fault data, biased estimation of the probability distribution of real fault data is obtained. Further, virtual fault data are sampled according to such distribution to increase sample amount. For recognition of real and virtual data, loss function based on binary cross-entropy is designed. For reconstruction between fault data and its roundtrip mapped results, objective function based on mean square error is designed. Thus, it preserves boundary data and avoids too many virtual data in central area. Meanwhile, a strategy for eliminating abnormal samples is designed to reduce boundary deviation. For supporting the advantage of roundtrip, in-depth reasons for misdetection are analyzed from empirical risk and structural risk. Experiments on 30 benchmark imbalanced test sets show that fault detection rate increases after amount enhancement. Additionally, it is verified on blade cracking and bearing fault detection. Results show that F1 score increases from 0.485 to 0.51 and 0.725 to 0.775 for such two cases.

Keywords

Mechanical fault detection roundtrip imbalanced samples probability estimation

Introduction

Fault detection is crucial for ensuring the safe operation of mechanical equipment. In recent years, data-driven detection methods have received widespread attention,^1,2 mainly because they do not establish kinematic and dynamic models, avoiding in-depth analysis about mechanical equipment’s physical system.^3,4 However, in industrial applications, normal data is far more abundant than fault data. Classification decision boundary may be shifted, resulting in data-driven methods being prone to misidentifying fault state as normal state. Especially, in early failure stages, it is extremely difficult to detect equipment faults. Currently, solutions to fault detection under imbalanced samples mainly include data-level methods and algorithmic-level methods.⁵

Data-level methods achieve balance by increasing minority class samples (i.e. fault samples) or decreasing majority class samples (i.e. normal samples). Typical methods include SMOTE (Synthetic Minority Over-sampling Technique), ADASYN (Adaptive Synthetic Sampling), ENN (Edit Nearest Neighbor), and GAN (Generative Adversarial Network). For instance, Chawla et al.⁶ used SMOTE to perform nearest-neighbor sampling to solve class imbalance problems. Bunkhumpornpat et al.⁷ combined density estimation algorithm with SMOTE to balance the dataset while retaining features of minority class samples. Huang et al.⁸ combined ADASYN with extreme learning machine to diagnose sliding electrical contact fault under imbalanced samples. He et al.⁹ generated minority class samples using ADASYN and then modified the decision boundary of classifier to improve detection rate. Batista et al.¹⁰ studied the behavior of SMOTE-Tomek and SMOTE-ENN for generating virtual minority class samples that conform to the overall dataset distribution. Wei et al.¹¹ designed a sample-characteristic-based SCOTE (Sample-characteristic Oversampling Technique) resampling method and found that fault detection rate significantly increases after balance on bearing fault diagnosis. Aiming at out-of-distribution detection, Han and Li¹² designed a deep ensemble method for these unseen data. Liu et al.¹³ proposed a multi-scale GAN method for generating wavelet spectrograms to expand the amount of fault samples and experimental results showed that GAN could effectively generate two-dimensional spectral fault data. Fan et al.¹⁴ embedded one-dimensional convolutional neural network in WGAN(Wasserstein GAN) to generate vibration data and experimental results on two bearing cases showed that this method could improve fault diagnosis accuracy. Han et al.¹⁵ proposed a semi-supervised adversarial discriminative learning approach for the scarcity of annotated samples.

Algorithmic-level methods reduce false negative rate by raising the cost of fault samples, including cost-sensitive learning and multi-kernel learning. Domingos¹⁶ combined cost-sensitive learning with pattern recognition to improve recognition rate by assigning different classification errors with unequal costs. Wang et al.¹⁷ proposed a cost-sensitive learning approach for multi-class data to improve fault detection in pumping units, which are characterized by a typical long-tailed distribution and severe class imbalance. Ng et al.¹⁸ proposed an intelligent electricity pricing classification method based on cost-sensitive learning and streaming imbalance-reversed bagging. Zhu et al.¹⁹ introduced a weight matrix and regularization term into multi-kernel learning to solve pattern recognition problems with imbalanced samples. Ma et al.²⁰ combined multi-kernel learning with cost-sensitive learning to ensure model fitting and enhance the support of decision boundaries for minority class samples.

Despite significant improvement on detection of minority class samples by the two types of methods, there are still obvious limitations (see Section 1.1 for details). (1) Data-level methods synthesize or remove samples within certain neighborhood, which is not effective for handling multi-cluster or non-convex cluster data; (2) Algorithmic-level methods require in-depth analysis of costs and the designed cost function needs to match specific problems, which lacks generality and needs extremely professional mathematical basis.

To overcome such drawbacks, this work proposes a mechanical equipment fault detection method based on roundtrip probability estimation. By adversarial roundtrip, we achieve bidirectional mapping between latent variable space and sample space, such that it fits the probability distribution of fault data. Further, we sample some virtual data to achieve balance. Verifications on 30 UCI benchmark data show that detection rate of minority class samples significantly increases after balance. Additionally, the proposed method is validated on real mechanical equipment fault detection and results show a significant reduction in false negatives. The main contributions are summarized as follows:

1) To improve fault detection accuracy, roundtrip method is modified to adapt to fault detection under imbalanced data.

2) To generate reliable fault data, novel loss function and adversarial training is designed to ensure convergence.

3) An outlier identification method is designed to preserve the sample boundary.

Imbalanced samples problem analysis and research motivation

Imbalanced samples problem analysis

Due to prohibition of operating with faulty equipment, collected data consist mostly of normal state samples and very few fault samples, resulting in a significant imbalance in quantity.²¹ Additionally, in industrial applications, equipment immediately shut down for inspection once detected fault. Most of collected fault samples are in early failure stage, embodying weak fault features. If adopting data-driven methods to analyze such imbalanced samples, there may be a situation where the overall accuracy is high but fault detection rate is low.

Taking binary classification as examples, we analyze the in-depth reasons from the perspective of empirical risk minimization and structural risk minimization. For empirical risk minimization model²² (such as logistic regression), precise prediction is achieved by minimizing mean squared error (MSE) cost function.

M S E = \frac{1}{N} \sum_{i = 1}^{N} {‖ y_{i} - {\hat{y}}_{i} ‖}_{2}^{2} = \frac{1}{N} \sum_{i = 1}^{N} {‖ y_{i} - f (x_{i}) ‖}_{2}^{2}

(1)

Where N denotes the sample amount; $f (•)$ is a prediction function; $x_{i}$ represents the i-th sample; $y_{i}$ and ${\hat{y}}_{i}$ are labels of the i-th sample and its predicted value. $y_{i}$ is a one-hot vector. ${\hat{y}}_{i}$ is a vector containing values in the interval [0, 1] and the sum of all components equals to 1. If data are extremely imbalanced, we can further expand the cost function as follows:

\begin{matrix} MSE = \frac{1}{N} \sum_{i = 1}^{N_{1}} ‖ y_{i} - {\hat{y}}_{i} ‖_{2}^{2} + \frac{1}{N} \sum_{j = 1}^{N_{2}} ‖ y_{j} - {\hat{y}}_{j} ‖_{2}^{2} \\ \leq \frac{1}{N} \sum_{i = 1}^{N_{1}} ‖ y_{i} - {\hat{y}}_{i} ‖_{2}^{2} + \frac{2 N_{2}}{N} \end{matrix}

(2)

Where N₁ denotes normal sample amount and N₂ denotes fault sample amount. Since N>>N₂, $\frac{2 N_{2}}{N}$ is near zero. Further, it can be inferred that

\frac{1}{N} \sum_{i = 1}^{N_{1}} ‖ y_{i} - {\hat{y}}_{i} ‖_{2}^{2} + \frac{2 N_{2}}{N} \approx \frac{1}{N} \sum_{i = 1}^{N_{1}} ‖ y_{i} - {\hat{y}}_{i} ‖_{2}^{2}

(3)

It is concluded that MSE only focuses on normal samples and ignore fault samples. The consistency between predicted values and target values is ignored, leading to misclassifying fault samples. Especially, in early failure stages, detection methods may fail due to the overlap between normal state and fault state in feature space.

For structural risk minimization in case of linear separability (such as support vector machine, SVM),²³ decision surface is obtained by minimizing the following objective function.

\begin{matrix} min Loss = ‖ W ‖ + C \sum_{i = 1}^{N} ξ_{i} \\ s . t . y_{i} (W x_{i} + b) \geq 1 - ξ_{i} \\ ξ_{i} \geq 0 \end{matrix}

(4)

Where W is a coefficient vector to be solved; b is a bias value; $C$ is a pre-defined penalty coefficient; $ξ_{i}$ is a relaxation variable; $y_{i}$ is the label; $x_{i}$ is the i-th sample. To simplify analysis, ideal decision surface is $\hat{W} x + \hat{b} = 0$ . For imbalanced case, supposing that decision surface is shifted by a distance of d away from normal samples along x-axis, variation of the objective function can be computed.

\begin{matrix} Δ Loss = C \sum_{i = 1}^{N_{1}} (ξ_{i} - d \sin θ) + C \sum_{j = 1}^{N_{2}} (ξ_{j} + d \sin θ) \\ - (C \sum_{i = 1}^{N_{1}} ξ_{i} + C \sum_{j = 1}^{N_{2}} ξ_{j}) \\ = Cd \sin θ (N_{2} - N_{1}) < 0 \end{matrix}

(5)

Where N₁ denotes normal samples and N₂ denotes fault samples; θ is the angle between decision surface and x-axis.

Through the above equation, we found $Δ Loss$ is less than 0, implying that objective function decreases. Therefore, decision surface moves to fault sample (i.e. supporting normal sample), leading to that normal samples are identified while fault samples are misidentified.

Research motivation

Based on the analysis in section 1.1, increasing the cost of fault samples can alleviate the above problem, but it may lead to overfitting due to insufficient sample size and the cost function needs redesigned for various problems. Therefore, synthesizing virtual samples can achieve generalizability and balance the cost function in an indirect way. SMOTE²⁴ is a representative sample synthesis methods and many methods are its improvement, such as WMOTE(Weighted Majority Over-sampling Technique), SCOTE, etc. Their core idea is to synthesize samples using the following equation:

X = λ X_{1} + (1 - λ) X_{2}

(6)

Where $X$ is the synthesized virtual sample; $X_{1}$ and $X_{2}$ are randomly selected minority-class samples; $λ$ is a random number between 0 and 1. However, when dealing with non-convex, multi-cluster data shown in Figure 1, such methods may select $X_{1}$ and $X_{2}$ from different clusters, resulting in synthesizing erroneous samples.

Figure 1.

Non-convex, multi-cluster sample distribution.

Different from SMOTE-based methods, GAN generates synthetic samples by fitting the probability distribution,^25–27 which can better handle multi-cluster and non-convex problems. GAN is implemented by maximizing and minimizing the following equation:

\begin{matrix} min_{G} max_{D} L (D, G) = E_{X ~ P (X)} [\log D (X)] \\ + E_{Z ~ P (Z)} [\log (1 - D (G (Z)))] \end{matrix}

(7)

In GAN framework, G denotes generator, D denotes discriminator, X denotes real fault samples, and Z denotes random variable. After D reaching optimal state, virtual samples generated by G will fall into high-density central region. This leads to a high confidence value for $G (Z)$ when judged as real samples ( $D (G (Z))$ will be close to 1) and the objective function decreases. Therefore, most samples generated by GAN locate in central region and there are few boundary samples (those are more valuable for decision boundary).

Roundtrip²⁸ uses a bidirectional GAN to achieve mutual mapping between sample space and latent variable space. It introduces two discriminators to ensure consistency between the distributions of virtual data and real data. Due to reconstruction error term in objective function, generated samples may not tend to high-density regions (otherwise, reconstruction error increases). Therefore, roundtrip can generate many boundary samples while covering real sample space, which is beneficial for high detection accuracy.

Roundtrip fault detection method

Basic principle of roundtrip probability estimation

Roundtrip is a novel approach published in PNAS in 2021. The basic principle can be found in referenced.²⁸ The overall framework of roundtrip method is shown in Figure 2.

Figure 2.

Overall framework of roundtrip.

It mainly includes four parts: generator networks G and H, discriminators D_X and D_Z. Z is a latent variable, which can be sampled from a standard normal distribution; X represents fault sample; $\tilde{Z}$ and $\tilde{X}$ denote virtual latent variables and virtual samples obtained by unidirectional mapping, which are calculated by the following equation:

\tilde{Z} = H (X)

(8)

\tilde{X} = G (Z)

(9)

H network implements mapping from latent variables to fault samples, while G network implements mapping from fault samples to latent variables. For discriminator D_X, both real and virtual fault samples ${X, \tilde{X}}$ are its input. For discriminator D_Z, both real and virtual latent variables ${Z, \tilde{Z}}$ are the input. D_X and D_Z determine the authenticity (real 0, fake 1). By minimizing the following six objective functions shown in equation (10), G, H, D_X, and D_Z reach the optimal configuration. The aim of L₁ and L₂ is to achieve well-performed discriminators. L₃ and L₄ supervise G and H to generate high-quality virtual fault samples and virtual latent variables. Meanwhile, L₅ and L₆ ensures the cycle consistency to improve the search for G and H. From another aspect, L₁ decreases while L₃ determinedly increases. Besides, L₂ decreases while L₄ increases, leading to adversarial training. When L₃ and L₄ decrease, virtual latent variables and virtual samples will move away from the decision boundary and tend toward high-density areas with high confidence. In this case, L₅ and L₆ increase, resulting in adversarial training.

{\begin{matrix} min_{D_{X}} L_{1} = E_{X ~ P (X)} [D_{X}^{2} (X)] + E_{Z ~ P (Z)} [{(1 - D_{X} (\tilde{X}))}^{2}] \\ min_{D_{Z}} L_{2} = E_{X ~ P (X)} [{(1 - D_{Z} (\tilde{Z}))}^{2}] + E_{Z ~ P (Z)} [D_{Z}^{2} (Z)] \\ min_{G} L_{3} = E_{Z ~ P (Z)} [D_{X}^{2} (\tilde{X})] \\ \begin{matrix} min_{H} L_{4} = E_{X ~ P (X)} [D_{Z}^{2} (\tilde{Z})] \\ min_{G, H} L_{5} = E_{Z ~ P (Z)} [‖ Z - H (G (Z)) ‖_{2}^{2}] \\ min_{G, H} L_{6} = E_{X ~ P (X)} [‖ X - G (H (X)) ‖_{2}^{2}] \end{matrix} \end{matrix}

(10)

After each objective function stabilizes, probability distribution of fault samples can be obtained using Laplace approximation method, as shown in equation (11).

P^{LP} (X) \approx {(\frac{1}{\sqrt{2 π}})}^{n} σ^{- n} \sqrt{det (Σ)} e^{- \frac{c (X)}{2}}

(11)

where

c (X) = ‖ \tilde{Z} ‖_{2}^{2} + σ^{- 2} ‖ X - G (\tilde{Z}) ‖_{2}^{2} - μ^{T} \sum^{- 1} μ

(12)

Σ = I + σ^{- 2} J_{z}^{T} J_{z}

(13)

In equation (13), I represents identity matrix; $μ$ is the mean value of X; $\sum$ is the covariance of X; $‖ \cdot ‖_{2}$ represents two-norm, that is Euclidean distance; $J_{z}$ is Jacobian matrix of $\tilde{X}$ with respect to Z ; $σ$ is standard deviation between real values and virtual values generated by G.

Improvements of roundtrip method

Detailed implementation of roundtrip network model

Considering that fault samples used in the experiment are one-dimensional time series data, G, H, D_X, and D_Z all adopt fully connected neural networks and specific details of each network are listed in Table 1. Each network has five layers and the number of hidden unit is 20. In Table 1, dim_X represents dimension of real fault sample and dim_Z represents dimension of latent variable. Considering that the scales of latent variable space and fault sample space are not fixed, G and H use LeakyReLu activation functions. Since D_X and D_Z output the confidence of sample (real or fake), Tanh is set as activation function for hidden layer and Sigmoid is set as activation function for output. Batch normalization is added to accelerate convergence of discriminant network.²⁹ Note that, the model structure parameter is designed by our many trials.

Table 1.

Network model in roundtrip (FCN represents Fully Connected Network).

	Network type	Number of layers	Neuron number	Hidden layer activation function	Output layer activation function	Batch normalization
G	FCN	5	[dim_Z,20,20,20,dim_X]	LeakyReLu	No	No
H	FCN	5	[dim_X,20,20,20,dim_Z]	LeakyReLu	No	No
D _X	FCN	5	[dim_X,20,20,20,2]	Tanh	Sigmoid	Yes
D _Z	FCN	5	[dim_Z,20,20,20,2]	Tanh	Sigmoid	Yes

Detailed implementation of loss function

Mean squared error (MSE) is effective for measuring consistency between two random variables. Therefore, it is fit for L₅ and L₆ in equation (10), which correspond to reconstruction of both latent variable and fault samples. However, gradient of MSE is very small around global minimum, which brings slow convergence when used for D_X and D_Z. For L₁–L₄, we use binary cross-entropy loss function instead, which is suitable for true/false discrimination. Considering that L₃–L₆ are optimized simultaneously, loss function is modified as follows:

{\begin{matrix} min_{D_{X}} L_{1} = - \sum_{x \in {X, \tilde{X}}, y \in {Y, \tilde{Y}}} y \log (D_{X} (x)) - \sum_{x \in {X, \tilde{X}}, y \in {Y, \tilde{Y}}} (1 - y) \log (1 - D_{X} (x)) \\ min_{D_{Z}} L_{2} = - \sum_{z \in {Z, \tilde{Z}}, p \in {P, \tilde{P}}} p \log (D_{Z} (z)) - \sum_{z \in {Z, \tilde{Z}}, p \in {P, \tilde{P}}} (1 - p) \log (1 - D_{Z} (z)) \\ min_{G, H} L_{DG} = - \sum_{x \in {\tilde{X}}, y \in {\tilde{Y}}} y \log (1 - D_{X} (x)) - \sum_{z \in {\tilde{Z}}, p \in {\tilde{P}}} p \log (1 - D_{Z} (z)) + α L_{5} + β L_{6} \end{matrix}

(14)

With

\begin{matrix} L_{5} = \frac{1}{N} \sum_{z \in {Z}} ‖ z - H (G (z)) ‖_{2}^{2} \\ L_{6} = \frac{1}{N} \sum_{x \in {X}} ‖ x - G (H (x)) ‖_{2}^{2} \end{matrix}

(15)

In equation (14), $Y$ and $\tilde{Y}$ represent the labels of real and virtual samples, respectively; $P$ and $\tilde{P}$ represent the labels of real and virtual latent variables, respectively; N represents the number of samples.

Adversarial training

In adversarial training process, Adam algorithm is used to minimize the three objective functions in equation (14). During each iteration, parameters of discriminator D_X in L₁ objective function are updated as shown in equation (16).

\begin{matrix} θ_{D_{X}}^{t} = θ_{D_{X}}^{t - 1} - γ (β_{1} m_{D_{X}}^{t - 1} + (1 - β_{1}) \nabla_{θ} L_{1} (θ_{D_{X}})) / \\ \sqrt{(β_{2} v_{D_{X}}^{t - 1} + (1 - β_{2}) \nabla_{θ}^{2} L_{1} (θ_{D_{X}})) + ε} \end{matrix}

(16)

In the above equation, $θ_{D_{X}}$ represents variable in discriminator D_X, t is iteration number, γ is learning rate, β₁ and β₂ are the first and second moment coefficients respectively. $\nabla_{θ} L_{1} (θ_{D_{X}})$ represents the first-order derivative of L₁ with respect to $θ_{D_{X}}$ , $\nabla_{θ}^{2} L_{1} (θ_{D_{X}})$ represents the second moment of the gradients, and $ε$ is used to avoid division by zero, set to 1e-5 in this paper. Secondly, parameters in discriminator D_Z are updated for L₂ objective function.

\begin{matrix} θ_{D_{Z}}^{t} = θ_{D_{Z}}^{t - 1} - γ (β_{1} m^{t - 1} + (1 - β_{1}) \nabla_{θ} L_{2} (θ_{D_{Z}})) / \\ \sqrt{(β_{2} v_{D_{Z}}^{t - 1} + (1 - β_{2}) \nabla_{θ}^{2} L_{2} (θ_{D_{Z}})) + ε} \end{matrix}

(17)

where $θ_{D_{Z}}$ represents variables in discriminator D_Z; $\nabla_{θ} L_{2} (θ_{D_{Z}})$ is the first-order derivative of L₂ with respect to $θ_{D_{Z}}$ ; $\nabla_{θ}^{2} L_{2} (θ_{D_{Z}})$ is the second-order moment of gradients. Finally, parameters in the objective function L_DG are updated.

\begin{matrix} θ^{t} = θ^{t - 1} - γ \frac{β_{1} m^{t - 1} + (1 - β_{1}) \nabla_{θ} L_{DG} (θ)}{\sqrt{(β_{2} v^{t - 1} + (1 - β_{2}) \nabla_{θ}^{2} L_{DG} (θ)) + ε}}, \\ θ \in {θ_{D_{X}}, θ_{D_{Z}}, θ_{G}, θ_{H}} \end{matrix}

(18)

Where $\nabla_{θ} L_{DG} (θ)$ is the first-order derivative of L_DG with respect to $θ$ , and $\nabla_{θ}^{2} L_{DG} (θ)$ is the second-order moment of gradients.

Identification of outlier samples

As roundtrip implements a mapping from latent variable space to sample space, outlier samples may be sampled from low probability density regions. To identify and remove outlier samples, the following equation is used:

\tilde{X} = {\tilde{x} | threshold > {‖ \tilde{x}, x ‖}_{2}}, \forall x \in X

(19)

Where ${X, \tilde{X}}$ represents real and virtual sample set; threshold is a pre-set value. If no real samples within the Euclidean space neighborhood of a virtual sample, it will be identified as outlier sample and removed. For non-convex and multi-cluster data shown in Figure 1, virtual samples are obtained using Roundtrip and outlier samples are removed with threshold values of 0.7, 0.15, and 0.4 respectively (got by trial and error). The results are shown in Figure 3. Obviously, virtual samples can effectively fill the multi-cluster and non-convex regions. Before removed, outlier samples exist in the blank regions. After removing them, the remaining virtual samples well distribute among these non-convex and multi-clusters.

Figure 3.

Real Fault samples (blue points) and virtual fault samples (red points, generated by roundtrip): (a) before removing outlier samples and (b) after removing outlier samples.

Note that for the cases in Section 3 and Section 4, data are in high dimension and cannot be visualized. Therefore, it is hard to find the optimal threshold. For benchmark in section 3, threshold is set as 0.4. For the two cases in section 4, threshold is set as 10, which are also achieved by trial and error.

Overall framework

The overall framework of roundtrip for mechanical equipment fault detection is shown in Figure 4. The details are as follows.

(1) Collect normal state samples and fault samples of mechanical equipment.

(2) Input fault samples into Roundtrip model to train G and H and obtain probability density function of fault samples according to equations (11)–(13).

(3) Sample virtual fault data to enhance fault sample amount.

(4) Merge normal state samples, real fault samples and virtual fault samples to obtain a balanced dataset, input it into classifier to perform fault detection.

Figure 4.

Overall framework of fault detection under imbalanced samples.

Benchmark experiment

Dataset description

We verified the proposed method on UCI (University of California Irvine) benchmark imbalanced dataset, which is obtained from http://archive.ics.uci.edu/ml/index.php. As shown in Table 2, there are 30 datasets, all of which are binary classification problems with the number of features ranging from 7 to 34. The minimum and maximum imbalance ratios are 1.3:1 and 175.5:1 respectively. UCI dataset has been widely used to test various learning methods as a standard dataset, making it suitable for evaluating the performance of the proposed method. Considering multiscale characteristics of different features, we use equation (20) to normalize each feature and rescale them to the range of [0,1].

F = \frac{F - F_{min}}{F_{max} - F_{min}}

(20)

Table 2.

UCI benchmark imbalanced dataset.

NO.	Dataset	# Majority class	# Minority class	Dimension	Imbalanced proportion (majority: minority)
1	Ecoli3	301	35	7	8.6:1
2	Ecoli4	316	20	7	15.8:1
3	Ecoli678	329	9	7	36.5:1
4	Abalone5v19	115	32	8	3.6:1
5	Abalone7v17	391	58	8	6.7:1
6	Abalone9v18	689	42	8	16.4:1
7	Abalone19	4145	32	8	129.5:1
8	Yeast1v3	429	163	8	14.3:1
9	Yeast0v4	463	51	8	9.1:1
10	Yeast1v7	429	30	8	14.3:1
11	Yeast6	1449	35	8	41.4:1
12	Glass2	138	76	9	1.8:1
13	Glass1	144	70	9	2.1:1
14	Glass7	185	29	9	6.4:1
15	Glass5	201	13	9	15.5:1
16	Glass6	205	9	9	22.8:1
17	Shuttle5v3	809	39	9	20.7:1
18	Shuttle4v2	2155	13	9	165.8:1
19	Pageblocks2v4	329	88	10	3.7:1
20	Pageblocks45v3	203	28	10	7.3:1
21	Pageblocks25v3	444	28	10	15.9:1
22	Pageblocks1v5	4913	115	10	42.7:1
23	Pageblocks1v3	4913	28	10	175.5:1
24	Wine3	130	48	13	2.7:1
25	Velicle4	647	199	17	3.3:1
26	Segment123	120	90	19	1.3:1
27	Segment1	180	30	19	6.0:1
28	Thyroid-ann2	3581	191	21	18.8:1
29	Thyroid-ann1	3679	93	21	39.6:1
30	lonosphere	225	126	34	1.8:1

Where F represents feature to be processed; F_min and F_max are the minimum and maximum feature values in historical data.

Evaluation metrics

F1 score is selected as evaluation metric to assess performance, since it considers both sensitivity (recall) to fault samples and reliability (precision) of detection method. A detailed analysis of this metric can be found in reference.³⁰F1 score is calculated as follows.

F 1 = 2 * \frac{P * R}{P + R}

(21)

P = \frac{# Correctly Classified Fault Samples}{# Correctly Classified Fault Samples + # Misclassified Normal Samples}

(22)

R = \frac{# Correctly Classified Fault Samples}{# Correctly Classified Fault Samples + # Misclassified Fault Samples}

(23)

Where P denotes precision and R denotes recall. Since the amount of minority class samples is small, a three-fold cross-validation is adopted and F1 score is calculated by averaging the results of three-fold cross-validation.

Results and analysis

Experimental parameters are set as follows: α and β in equation (14) are set to 10; learning rate γ in equations (16)–(18) is set to 0.01; the first and second momentums β₁ and β₂ are set to 0.9 and 0.999 respectively; threshold in equation (19) is set to 0.4; number of iterations is set to 10,000. It should be noted that these parameters are obtained through multiple trials. In addition, considering the cost balance between real samples and virtual samples in equation (14), the size and dimension of virtual latent variables are the same as those of real samples.

Four datasets (Ecoli3, Yeast0V4, Pageblocks1v3 and Wine3) are selected for analysis. Variation of objective function during iterations is shown in Figure 5. L₁ and L₂ do not conflict and their sum is shown for the ease of iteration. As seen, two curves show a decreasing trend, indicating that the iteration process is reasonable. In addition, during the descent process of L_DG, L₁+L₂ may fluctuate. This is because the generated samples are close to real samples, making it difficult for discriminator to distinguish between real and virtual samples, resulting in an adversarial effect.

Figure 5.

Iterative process curve: (a) Ecoli3, (b) Yeast0V4, (c) Pageblocks1v3, and (d) Wine3.

According to equation (11), probability distribution of minority class samples is obtained and virtual samples (number=# majority class - # minority class) are sampled. Logistic regression and SVM are used to recognize the datasets. For SVM, RBF kernel is adopted with a regularization factor of 1.0. F1 values are shown in Table 3. It is seen that with logistic regression classifier, there are 14 F1 values greater than that before enhancement, 9 equal value and 7 smaller value. With SVM classifier, there are 13 F1 values greater than that before enhancement, 10 equal value and 7 smaller value. It indicates that roundtrip method can generate effective virtual samples, thereby reducing misdetections.

Table 3.

F1 values before and after enhancement with two different classifiers.

No.	Dataset	Logistic regression		SVM
No.	Dataset	Before	After	Before	After
1	Ecoli3	0	0.12	0.65	0
2	Ecoli4	0	0.10	0.89	0.83
3	Ecoli678	0	0	0.62	0.63
4	Abalone5v19	0.95	1	0.98	0.98
5	Abalone7v17	0.65	0.86	0.68	0.77
6	Abalone9v18	0.04	0.33	0	0.05
7	Abalone19	0	0	0	0
8	Yeast1v3	0.64	0.82	0.85	0.84
9	Yeast0v4	0.22	0.06	0.68	0.37
10	Yeast1v7	0	0	0	0.11
11	Yeast6	0	0	0	0
12	Glass2	0.15	0.28	0	0.52
13	Glass1	0.51	0.65	0	0.14
14	Glass7	0.81	0.84	0	0
15	Glass5	0.35	0.35	0	0
16	Glass6	0.17	0	0	0
17	Shuttle5v3	0.99	0.98	0.31	0.26
18	Shuttle4v2	1	0.69	0	0.07
19	Pageblocks2v4	0.95	0.97	0	0
20	Pageblocks45v3	0.94	0.55	0.37	0.33
21	Pageblocks25v3	0.61	0.53	0.47	0.49
22	Pageblocks1v5	0.58	0.30	0.27	0.26
23	Pageblocks1v3	0.65	0.66	0.13	0.48
24	Wine3	0.95	0.97	0	0
25	Velicle4	0.90	0.90	0.18	0.60
26	Segment123	0.78	0.79	0.78	0.80
27	Segment1	0.95	0.95	0	0
28	Thyroid-ann2	0	0	0	0
29	Thyroid-ann1	0	0.10	0.15	0.19
30	Lonosphere	0.81	0.81	0.90	0.91

Case studies on mechanical fault detection

Wind turbine blade crack detection

Dataset description

Wind turbine, whose structure is shown in Figure 6, is widely used for wind power generation. Wind turbine blade is the main energy conversion component of wind turbine and operates under heavy loads. According to statistics, accidents caused by blade cracks account for 30% of total accidents and they occur frequently during high-wind power generation periods. In this case study, data provided by State Power Investment Corporation Limited is used for validation, which can be referred to https://www.datafountain.cn/competitions/302. There are 2000 normal samples and undetermined amount of blade crack samples, forming 10 imbalanced datasets, as listed in Table 4. The minimum imbalance ratio is 2:1 and the maximum is 20:1. Each sample contains 75 features, such as pitch motor current, x and y direction vibration, and inverter grid-side current.

Figure 6.

Structure diagram of wind turbine. (1) wheel hub; (2) blades; (3) rotor; (4) gear box; (5) anemometer and wind vane; (6) generator; (7) control box; (8) tower; (9) yaw system; (10) main bearing.

Table 4.

Dataset details.

No.	Dataset	# Majority class	# Minority class	Imbalanced ratio
1	Wind_5	2000	100	20:1
2	Wind_10	2000	200	10:1
3	Wind_15	2000	300	6.67:1
4	Wind_20	2000	400	5:1
5	Wind_25	2000	500	4:1
6	Wind_30	2000	600	3.33:1
7	Wind_35	2000	700	2.86:1
8	Wind_40	2000	800	2.5:1
9	Wind_45	2000	900	2.22:1
10	Wind_50	2000	1000	2:1

Detection results with different classifiers

For equation (14), α and β are both set to 15 (obtained through repeated experiments) and other parameters are configured as described in 3.3. Normalization is performed using equation (20). The first four data are selected and their objective function change curves during iteration process are shown in Figure 7. It is seen that the two curves show downward trend versus iteration progresses, indicating that the iteration is reasonable. In the early stage of iteration, when L_DG decreases, L₁+L₂ will produce violent fluctuations because the generated virtual samples are close to real samples, making it difficult for discriminator to distinguish them. Besides, when L₁+L₂ decreases, L_DG will fluctuate because the cost that virtual samples identified as real in L_DG will increase as discriminator improves. Overall, at the end of iteration, the two objective functions are stable.

Figure 7.

Iterative process curve: (a) Wind_5, (b) Wind_10, (c) Wind_15, and (d) Wind_20.

As shown in Table 5, with logistic regression classifier, 5 F1 scores improve after enhancement, 2 F1 scores get worse, and 3 F1 scores are the same as that before enhancement. With SVM classifier, 7 F1 scores get larger after enhancement, but 3 F1 scores get worse. The average F1 score of the 10 groups has increased from 0.54 to 0.56 and from 0.43 to 0.46. Moreover, as the number of minority class samples increases, F1 score improves. But, when it reaches 35% (ratio of minority class to majority class), F1 score tends to stabilize or even decrease. This is because improving the recognition of minority class samples will increase misidentification of majority class samples. However, F1 score measures the overall recognition rate, so a decrease appears. Combining the two classifiers, average F1 score before enhancement is 0.485 (i.e. (0.54 + 0.43)/2) and average F1 score is 0.51 (i.e. (0.56 + 0.46)/2) after enhancement. It indicates that the designed method improves wind turbine fault detection rate and reduce the misdetection while ensuring the overall recognition rate.

Table 5.

F1 values before and after balance under two different classifiers.

No.	Dataset	Logistic regression		SVM
No.	Dataset	Before	After	Before	After
1	Wind_5	0.24	0.30	0.24	0.26
2	Wind_10	0.40	0.52	0.30	0.39
3	Wind_15	0.56	0.56	0.24	0.49
4	Wind_20	0.58	0.61	0.52	0.57
5	Wind_25	0.62	0.63	0.47	0.60
6	Wind_30	0.67	0.67	0.64	0.65
7	Wind_35	0.63	0.61	0.57	0.54
8	Wind_40	0.58	0.59	0.54	0.28
9	Wind_45	0.58	0.58	0.43	0.47
10	Wind_50	0.59	0.58	0.44	0.37
Average	-	0.54	0.56	0.43	0.46

Comparison with other methods

Comparison method

To further verify advancement of the proposed method, it is compared with three other methods, including SMOTE,⁵ GAN,³¹ Roundtrip1 (Roundtrip without removing outlier samples). For SMOTE, the number of nearest neighbor is 5, which is a common used value. For GAN, model structure is listed in Table 6, which is the same as that in reference.³¹ Roundtrip1 have the same structure as Roundtrip.

Table 6.

Model structure of GAN.³¹

	Generator	Discriminator
Input layer	10	Sample dimension
Activation function	Leaky ReLU	Leaky ReLU
Batch normalization	Yes	–
Hidden layer1	15	10
Activation function	Leaky ReLU	–
Batch normalization	Yes	–
Hidden layer2	20	–
Activation function	Sigmoid	Sigmoid
Output layer	Sample dimension	1

Comparison results

Comparison results are listed in Table 7. Mean F1 score for the three comparison methods are respectively 0.55, 0.52, and 0.52. Overall, the proposed method is just slightly better than SMOTE, but obviously better than GAN and Roundtrip1. Specially, there are two F1 scores lower than SMOTE and one F1 scores lower than GAN. Likewise, F1 scores increase as imbalance ratio alleviates, but when reaching 35%, F1 score stabilize or even decrease (see section 4.1.2). In summary, the proposed method can improve wind turbine fault detection rate while ensuring the overall recognition rate.

Table 7.

Comparisons with other methods.

No.	Dataset	Logistic regression
No.	Dataset	Proposed	SMOTE	GAN	Roundtrip1
1	Wind_5	0.30	0.31	0.32	0.25
2	Wind_10	0.52	0.45	0.44	0.45
3	Wind_15	0.56	0.55	0.53	0.52
4	Wind_20	0.61	0.58	0.56	0.56
5	Wind_25	0.63	0.60	0.56	0.58
6	Wind_30	0.67	0.68	0.63	0.60
7	Wind_35	0.61	0.65	0.60	0.60
8	Wind_40	0.59	0.58	0.54	0.55
9	Wind_45	0.58	0.57	0.54	0.55
10	Wind_50	0.58	0.57	0.54	0.54
Average	-	0.56	0.55	0.52	0.52

IMS bearing fault detection

Dataset description

Bearing vibration data provide by NASA is used for validation. Figure 8(a) shows the experimental setup, which has four Rexnord ZA-2115 double-row bearings installed on the shaft, each with 16 rolling elements. A radial load of 6000 pounds is applied on the shaft and the sampling frequency is 20 kHz. More details can be found in reference.³² Figure 8(b) to (d) show the degradation signals of three bearings, where green line represents normal state (N), blue line represents degraded state (D) and red line represents faulty state (F). It can be clearly seen that the sample size is imbalanced and the specific sample information is shown in Table 8. Since rotating machinery has periodicity, each sample is transformed into frequency domain. Due to the symmetry of the spectrum, the length of each transformed sample is 1000.

Figure 8.

Bearing test device: (a) test rig, (b) acceleration signal of bearing 3, (c) acceleration signal of bearing 4, and (d) acceleration signal of bearing 1.

Table 8.

Data details.

No.	Bearing	Sample length	Majority class	Minority class	Number of samples (Maj: Min)	Imbalanced ratio
1	Bearing 3	1000	N	D	1800:300	6.00
2	Bearing 3	1000	D	F	300:56	5.36
3	Bearing 4	1000	N	D	1400:450	3.11
4	Bearing 4	1000	D	F	450:306	1.47
5	Bearing 1	1000	N	D	700:250	2.80
6	Bearing 1	1000	D	F	250:34	7.35
7	Bearing 1&3&4	1000	Rest	F	5240:56	93.57
8	Bearing 1&3&4	1000	Rest	F	4990:306	16.31
9	Bearing 1&3&4	1000	Rest	F	5262:34	154.76

Detection results with different classifiers

Experimental parameters are set as the values in Section 3.3 and normalization is performed using equation (20). The second and the sixth groups of data are selected. Objective function change curve during iteration process is visualized in Figure 9. It can be seen that after 10,000 iterations, both curves tend to be stable. The F1 values with two different classifiers are shown in Table 9. It is observed that with logistic regression classifier, there are six larger F1 scores after enhancement, two equal F1 scores and one smaller F1 scores. With SVM classifier, there are six larger F1 scores after enhancement and three equal results. The average F1 values increase from 0.69 to 0.70 and from 0.76 to 0.85 respectively. With logistic regression classifier, the average value does not increase significantly, because the first F1 score is much worse than that before enhancement. But the following four F1 scores show a significant improvement. In summary, F1 average value before enhancement is 0.725 (i.e. (0.69 + 0.76)/2) and F1 average is 0.775 (i.e. (0.70 + 0.85)/2) after enhancement. It indicates that Roundtrip method can improve detection rate of bearing faults and reduce possibility of misdetection while ensuring the overall recognition rate.

Figure 9.

Iterative process curve: (a) Data 2 and (b) Data 6.

Table 9.

F1 values before and after balance with two different classifiers.

No.	Logistic regression		SVM
No.	Before	After	Before	After
1	0.62	0.48	0.78	0.81
2	0.66	0.66	0.75	0.75
3	0.75	0.76	0.87	0.88
4	0.93	0.94	0.96	0.97
5	0.96	0.96	0.99	0.99
6	0.79	0.87	0.51	0.84
7	0.60	0.63	0.62	0.64
8	0.25	0.27	0.93	0.93
9	0.70	0.77	0.43	0.87
Average	0.69	0.70	0.76	0.85

Comparison with other methods

The proposed method is compared with three other methods, including SMOTE, GAN, Roundtrip1. Their parameters and structure are listed in Section 4.1.3. The results are listed in Table 10. The mean F1 score for the three comparison methods are respectively 0.36, 0.68, and 0.65. Overall, the proposed method is just slightly better than GAN, but obviously better than SMOTE and Roundtrip1. Obviously, SMOTE nearly fails because of its limitations. Although GAN has two higher F1 scores, but other 7 F1 scores are far worse than the proposed method. For Roundtrip1, the results clearly show the effectiveness of removing outlier samples. Overall, the proposed method is effective for improving fault detection rate while ensuring the overall recognition rate.

Table 10.

Comparisons with other methods.

No.	Logistic regression
No.	Proposed	SMOTE	GAN	Roundtrip1
1	0.48	0	0.45	0.45
2	0.66	0.13	0.58	0.59
3	0.76	0.70	0.72	0.72
4	0.94	0.58	0.92	0.88
5	0.96	0.88	0.93	0.90
6	0.87	0.45	0.83	0.82
7	0.63	0.09	0.58	0.61
8	0.27	0.07	0.34	0.24
9	0.77	0.34	0.79	0.71
Average	0.70	0.36	0.68	0.65

Conclusion

Data-driven method is a hot topic in the field of mechanical fault detection. However, imbalanced samples bring a shifted decision boundary, resulting in misdetection. In this paper, we propose a fault detection method based on roundtrip probability estimation. By bidirectional mapping between latent variable space and sample space, we increase the amount of fault samples and reduce misdetections. The method is applied to 30 benchmark test sets and results show that classifier’s F1 value improves after roundtrip enhancement. The method is also applied to wind turbine blade cracking and NASA bearing fault detection. Results show that the roundtrip method can significantly improve recognition rate of fault states and reduce misdetections while ensuring the overall recognition rate.

In future work, we will focus on the generation of multi-class data for handling coupling imbalance. Meanwhile, we will focus on more different industrial applications.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by National Natural Science Foundation of China [Grant Numbers 52105536], Key Scientific and Technological Research Projects in Henan Province [Grant Number 212102210072], Guangdong Basic and Applied Basic Research Foundation [Grant Number 2022A1515140066], Key Research and Development Projects of Henan Province [Grant Number 221111240200], Open Project of Henan Key Laboratory of Intelligent Manufacturing of Mechanical Equipment, Zhengzhou University of Light Industry [Grant Number IM202309], Key Scientific Research Projects of Institutions of Higher Learning in Henan Province [Grant Number 22A460033].

ORCID iDs

Zhang Yuyan

Wen Xiaoyu

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Zhang

Gao

Wen

, et al. Intelligent fault diagnosis of machine under noisy environment using ensemble orthogonal contractive auto-encoder. Expert Syst Appl 2022; 203: 117408.

Chen

Rao

Feng

, et al. Modified varying index coefficient autoregression model for representation of the nonstationary vibration from a planetary gearbox. IEEE Trans Instrum Meas 2023; 72: 1–12.

Tidriri

Chatti

Verron

, et al. Bridging data-driven and model-based approaches for process fault diagnosis and health monitoring: a review of researches and future challenges. Annu Rev Control 2016; 42: 63–81.

Chen

Rao

Feng

, et al. Physics-informed LSTM hyperparameters selection for gearbox fault detection. Mech Syst Signal Process 2022; 171: 108907.

Zhang

Gao

, et al. Imbalanced data fault diagnosis of rotating machinery using synthetic oversampling and feature learning. J Manuf Syst 2018; 48: 34–50.

Chawla

Bowyer

Hall

, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–357.

Bunkhumpornpat

Sinapiromsaran

Lursinsap

. DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 2012; 36: 664–684.

Huang

G-B

Zhou

Ding

, et al. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern B 2012; 42(2): 513–529.

Bai

Garcia

, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), Hong Kong, 1–8 June 2008. New York: IEEE.

10.

Batista

Prati

Monard

. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 2004; 6(1): 20–29.

11.

Wei

Huang

Yao

, et al. New imbalanced bearing fault diagnosis method based on sample-characteristic oversampling TechniquE (SCOTE) and multi-class LS-SVM. Appl Soft Comput 2021; 101: 107043.

12.

Han

Y-F

. Out-of-distribution detection-assisted trustworthy machinery fault diagnosis approach with uncertainty-aware deep ensembles. Reliab Eng Syst Saf 2022; 226; 108648.

13.

Liu

Zhang

Jiang

. Imbalanced fault diagnosis of rolling bearing using improved MsR-GAN and feature enhancement-driven CapsNet. Mech Syst Signal Process 2022; 168: 108664.

14.

Fan

Yuan

Miao

, et al. Full attention Wasserstein GAN with gradient normalization for fault diagnosis under Imbalanced Data. IEEE Trans Instrum Meas 2022; 71: 1–16.

15.

Han

Xie

Pei

. Semi-supervised adversarial discriminative learning approach for intelligent fault diagnosis of wind turbine. Inf Sci 2023; 648: 119496.

16.

Domingos

. Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, 1999. New York, NY: Association for Computing Machinery.

17.

Wang

Zhou

, et al. Open world long-tailed data classification through active distribution optimization. Expert Syst Appl 2023; 213: 119054.

18.

WWY

Zhang

Lai

, et al. Cost-sensitive weighting and imbalance-reversed bagging for streaming imbalanced and concept drifting in electricity pricing classification. IEEE Trans Ind Inform 2019; 15(3): 1588–1597.

19.

Zhu

Wang

, et al. Multiple empirical kernel learning with majority projection for imbalanced problems. Appl Soft Comput 2019; 76: 221–236.

20.

Wang

. Fault diagnosis method of check valve based on multikernel cost-sensitive extreme learning machine. Complexity 2017; 2017: 1–19.

21.

Hai Xiang

Jing

Shang

Y, J

, Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 2017; 73: 220–239.

22.

Huang

Zhang

. Self-adaptive training: beyond empirical risk minimization. Adv Neural Inf Process Syst 2020; 33: 19365–19376.

23.

Shawe-Taylor

Bartlett

Williamson

, et al. Structural risk minimization over data-dependent hierarchies. IEEE Trans Inf Theory 1998; 44(5): 1926–1940.

24.

Fernandez

Garcia

Herrera

, et al. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 2018; 61: 863–905.

25.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. Commun ACM 2020; 63(11): 139–144.

26.

Shao

Cai

, et al. Dual-threshold attention-guided gan and limited infrared thermal images for rotating machinery fault diagnosis under speed fluctuation. IEEE Trans Ind Inform 2023; 19: 9933–9942.

27.

Chen

Shao

Dou

, et al. Data augmentation and intelligent fault diagnosis of planetary gearbox using ILoFGAN under extremely limited samples. IEEE Trans Reliab 2023; 72: 1029–1037.

28.

Liu

Jiang

, et al. Density estimation using deep generative neural networks. Proc Natl Acad Sci 2021; 118(15): e2101344118.

29.

Ioffe

Szegedy

. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML’15: Proceedings of the 32nd International conference on machine learning, 2015, vol. 37, pp. 448–456. JMLR.org.

30.

Mitchell

. Machine learning. Vol. 1. New York, NY: McGraw-Hill, 2007.

31.

Yuyan

Yongqi

Chunya

, et al. Identification method of cracking state of wind turbine blade based on GAN under imbalanced samples. Comput Integr Manuf Syst 2023; 29(2): 532–543.

32.

Qiu

Lee

Lin

, et al. Wavelet filter-based weak signature detection method and its application on rolling element bearing prognostics. J Sound Vib 2006; 289(4–5): 1066–1090.