Sage Journals: Discover world-class research

Abstract

Long-term operation of electric motor under complex and extreme conditions can lead to unpredictable failures, therefore, accurate diagnosis of electric motor failures has been valued by scholars and engineers. However, the noise in the vibration signals of the motor’s rolling bearings has a profound impact on the diagnostic performance of a model in the process of feature extraction and fault classification. Aiming at vibration signal denoising and accurate fault classification, in this study, a novel method based on WOA-SVMD and multi-scale CNN-Transformer is proposed. Firstly, the Whale Optimization Algorithm (WOA) is employed to obtain the optimal parameters of Successive Variational Mode Decomposition (SVMD), which is then used to decompose the signal into Intrinsic Mode Functions (IMFs). Secondly, uncorrelated components are removed based on the correlation coefficient method, the left IMFs are reconstructed into new signals. Thirdly, local and global features of the signal are adequately extracted using multi-scale Convolutional Neural Network (CNN) and transformer. Finally, fault type is classified using the softmax function. The experimental results show that the proposed method can effectively reduce the noise interference, and the accuracy of fault diagnosis reaches 99.24% on the CWRU dataset and 99.68% on the PU dataset.

Keywords

Fault diagnosis whale optimization algorithm successive variational mode decomposition convolutional neural network transformer

Introduction

With the rapid development of science and technology, electric motors have found extensive applications in modern industry and daily life.^1–3 Of all the failures, the majority of rotating machinery failures are caused by bearing failures.⁴ Bearings are vital to the normal operation of motors, bearing loads, reducing friction, protecting critical components, and ensuring smooth operation and efficient performance.⁵ Rolling bearings use rolling elements to reduce friction and have a lower friction factor, which not only reduces energy loss, but also reduces the heat generated due to friction, contributing to the efficient operation and longer life of the drive motor. Moreover, rolling bearings can absorb and reduce the loads and impacts caused by road shocks and vibrations, play a cushioning role, and protect the motor and other key components from excessive stress and damage.^6–8 Therefore, accurately diagnosing motor bearing faults is very significant. Through the efforts of researchers and scholars at home and abroad, rolling bearing fault diagnosis has been widely studied from the traditional manual extraction of data and analysis of the cause of the fault to the current autonomous diagnosis relying on deep learning in the field of artificial intelligence.

Common signal decomposition methods include Fourier transform, Wavelet transform, and Empirical Mode Decomposition (EMD), among which EMD has been widely used. Qi et al.⁹ developed a bearing fault diagnosis technique that integrates empirical mode decomposition, deep learning, and knowledge graphs to enhance multi-dimensional feature extraction and provide comprehensive fault information, achieving high accuracy even under varying loads and noisy conditions. However, challenges such as, mode overlap and pseudo-modes at signal boundaries persist with EMD. In 2014, Dragomiretskiy and Zosso¹⁰ proposed Variational Mode Decomposition (VMD), which improves upon EMD by using a non-recursive decomposition model and ensuring stable results through the construction of a variational problem. Zhou et al.¹¹ advanced rolling bearing fault diagnosis by incorporating the Whale Gray Wolf optimization algorithm, variational mode decomposition, and support vector machine (SVM) to effectively diagnose faults in nonlinear and nonstationary vibration signals. Zhen et al.¹² developed a fault diagnosis method that combined VMD with cyclostationarity demodulation to extract fault characteristics from noisy bearing signals. VMD also has some limitations, such as reliance on the number of mode components $k$ and the penalty factor $α$ .¹³ Inspired from VMD, Nazari and Sakhaei¹⁴ proposed a novel Variational Mode Extraction (VME) method. VME can directly extract target IMF components from vibration signals with faster convergence and higher computational efficiency. Zhong et al.¹⁵ presented a bearing fault diagnosis scheme that integrated VME with particle swarm optimization to improve the accuracy of the fault identification. However, practical applications face challenges with the difficulty of determining the initial central frequency of the required IMF and selecting the balance parameters. Hence, Nazari and Sakhaei¹⁶ introduced a new Successive Variational Mode Decomposition (SVMD), which continuously applied VME to decompose the signal by adaptively breaking it down into a series of IMF components. Compared with VMD, SVMD does not require prior knowledge of the number of IMFs, that has lower computational complexity, and avoids difficulties in computing initial central frequencies in VME. Jiang et al.¹⁷ developed a SVMD method combined with a mode regrouping strategy based on sparsity index to more effectively extract fault features from rolling bearing signals. Guo et al.¹⁸ introduced a rolling bearing fault diagnosis method utilizing SVMD with an energy concentration and position accuracy index to optimize parameter value and target mode identification. However, the performance of SVMD will be influenced by the parameter $α$ , which requires manual tuning and lacks adaptability.

Conventional diagnostic methods are usually based on time domain, frequency domain and time-frequency domain.^19–21 Song et al.²² introduced a visual diagnosis method leveraging incrementally accumulated holographic symmetrical dot pattern characteristic fusion to enhance signal differentiation. Samal et al.²³ developed a bearing fault diagnosis approach using artificial neural networks combined with vibration analysis for higher classification accuracy. However, traditional fault diagnosis techniques depend on the experience and expertise of technicians to identify faults, which have significantly limited the development of rolling bearing fault diagnosis. (1) Since the feature extraction relies heavily on the knowledge and judgment of engineers, this may cause errors where minor fault indicators will be mistakenly discarded or obscured by noise.^24,25 (2) The extracted features are mainly used to solve specific fault problems, making them less generalizable. (3) Under actual working conditions, bearings often operate under variable loads and speeds, which leads to the fluctuating pulse intervals and causes challenges in feature extraction, and the signal noise contamination in the vibration signals collected by the system.^26,27 Thus, the traditional manual fault diagnosis cannot solve the problem of fault diagnosis of the bearings under these conditions.

Deep learning has powerful nonlinear modeling and automatic feature extraction capabilities, such as AutoEncoders, Deep Belief Networks, Convolutional Neural Networks, and Recurrent Neural Networks. Rajabioun et al.²⁸ developed a deep learning framework that combined multi-sensory input from vibration and magnetic flux signals to effectively diagnose distributed bearing faults. Ding et al.²⁹ introduced a channel attention siamese network with metric learning for intelligent bearing fault diagnosis, achieving high accuracy even with very small sample sizes and maintaining the reliability under noise and signal distortion. Niu et al.³⁰ presented a deep residual CNN that enhanced discriminate feature learning and integrated domain knowledge, improving diagnostic accuracy and training efficiency for multitask bearing fault diagnosis. Jiaocheng et al.³¹ employed a Bayesian optimization-deep convolution gate recurring unit method to automatically adjust hyperparameters, enabling precise bearing fault identification and overcoming traditional experience-based adjustment limitations. However, the aforementioned studies overlook the spatiotemporal correlations in vibration signals. Although these studies achieve relatively good fault diagnosis performance, there is still room for improvement.

Considering the shortcomings of the aforementioned methods, several recently emerging approaches have effectively addressed some of these issues. Liu et al.³² proposed a Siamese CNN-BiLSTM model designed to fully extract the multidimensional and temporal features of rolling bearing vibration signals to better learn the advanced features of different fault types of signals. Wang et al.³³ introduced a WOA-VMD and the GAT to overcome the effect of noise on the signal. WOA-VMD is used to decompose the original signal, then the KNN method is used to construct the graph structure data. The attention mechanism is used to construct the fault diagnosis model of the GAT rolling bearing in order to classify the signal. Feng et al.³⁴ developed a new MHA mechanism that integrates positional information into the weight matrix, thereby enabling it to extract more effective data features compared to the traditional MHA method. Additionally, the proposed method is developed for fault diagnosis scenarios that involve missing information.

In summary, accurate bearing fault diagnosis is fraught with many challenges, such as the integrity of the raw input signals and the feature extraction capabilities of the model, both of which can directly impact the fault classification results. Although numerous studies have been conducted to address these challenges, there are still some limitations that prevent sufficient noise reduction in the signals and hinder the model’s ability to fully extract features. Inspired by the limitations of the aforementioned existing methods, a WOA-SVMD and multi-scale CNN-Transformer-based fault diagnosis method is proposed for signal denoising and fault classification to address the issues mentioned above. The innovation of our proposed method lies in the use of SVMD for signal decomposition instead of traditional VMD. The advantage of the SVMD method is that it requires the adjustment of only one parameter, the penalty factor $α$ , while traditional VMD necessitates the tuning of two parameters: the penalty factor $α$ and the number of modes $K$ . Building on this, we employed the WOA to optimize the penalty factor, thereby eliminating the cumbersome manual tuning steps associated with SVMD. Additionally, we utilized a hybrid approach that combines multi-scale CNNs and Transformers for feature extraction. The size of the CNN convolutional kernels depends on the frequency range where the signal’s energy is concentrated, and the multi-scale CNNs effectively capture the local features of the vibration signals. Subsequently, the Transformer is used to extract global features from the multi-scale convolutional maps, enabling the model to achieve optimal fault diagnosis performance. The complexity of the proposed WOA-SVMD and multi-scale CNN-Transformer method can be analyzed in terms of time and space complexity. For WOA-SVMD, the time complexity is determined by the number of iterations and the computational cost of evaluating the quality of solutions in each iteration. Its space complexity is proportional to the population size, as it requires storing every solution in the population. For the multi-scale CNN-Transformer, the time complexity primarily depends on the network’s depth and width, as well as the scale of the input data. Specifically, the time complexity of the CNN component is proportional to the number of convolutional layers and the computational cost of each layer. The Transformer component’s time complexity is influenced by the number of attention heads and the sequence length in the self-attention mechanism. The space complexity of the multi-scale CNN-Transformer is associated with the number of model parameters, including the weights and biases in the convolutional layers, as well as the self-attention and feedforward network parameters in the Transformer. In practical applications, optimization measures such as GPU acceleration have been employed to improve computational efficiency, achieving favorable results.

Taking the above into account, a WOA-SVMD and multi-scale CNN-Transformer method is proposed for denoising and fault classification of bearing vibration signals. Firstly, the vibration signals obtained from the sensors are decomposed by SVMD, the parameters of which are optimized by the WOA algorithm with minimal envelope entropy. The generated IMFs are then filtered by the Pearson correlation coefficient method and reconstructed into new signals. Secondly, multi-scale CNN is used to extract local features from the processed data, and the Transformer is applied to further explore the global features. Finally, the output is fed into the softmax function for fault classification.

The contributions of our work can be summarized as follows:

(1) A combined WOA-SVMD and multi-scale CNN-Transformer fault diagnosis model is proposed to address the problem of noise interference in the real operating environments and elevate the fault diagnosis accuracy.

(2) To remove the noise variables in vibration signals, a combined WOA-SVMD and correlation coefficient method is proposed. This method optimizes the penalty factor parameter of SVMD to achieve the best decomposition results, while the correlation coefficient method is employed to filter out noise components with low correlation.

(3) To achieve adequate feature extraction, a combined multi-scale CNN-Transformer model is proposed. Multi-scale CNN extracts multiple localized features at various high-energy specific frequencies by using different sizes of convolution kernels, constructing rich feature representations, while transformer capture global features through its self-attention mechanism.

The remainder is arranged as follows: Section 2 proposes the architecture of the WOA- SVMD and multi-scale CNN-Transformer, covering WOA optimization, signal decomposition, filtering, multi-scale convolution, multi-head attention, and fault classification. Section 3 shows the experimental results. The conclusion is provided in Section 4.

The proposed fault diagnosis method

The overall framework of the proposed method is illustrated in Figure 1. The model is composed of data processing, multi-scale CNN and Transformer Encoder for feature extraction and fault type classification. In this study, vibration signal data was collected by placing an accelerometer on the drive end of the motor housing. The fault diagnosis process involves three main steps.

Figure 1.

Framework of proposed WOA-SVMD and multi-scale CNN-Transformer model.

Firstly, a data preprocessing model is established. The vibration signal collected by the accelerometer is susceptible to ambient noise. Common signal decomposition methods require preset parameter values. For instance, the VMD method necessitates a predetermined mode components $K$ and a penalty factor $α$ , while the SVMD method requires a preset penalty factor $α$ . These parameters must be manually adjusted to achieve the desired outcomes. To optimize signal decomposition, this study employs WOA to fine-tune the parameter $α$ of SVMD. After decomposing the signal into a series of Intrinsic Mode Functions (IMFs) using SVMD, the Pearson correlation coefficient of each component is calculated with the original signal. Components with low correlation coefficients are filtered out, and the remaining components are reconstructed to achieve denoising.

Secondly, a combined multi-scale CNN and Transformer feature extraction model is developed to leverage the strengths of both approaches and complement each other. CNN excels at extracting local features by utilizing convolutional kernels that slide across different regions of the input, efficiently capturing localized patterns in the data. On the other hand, the Transformer is known for its ability to extract spatial features through its unique self-attention mechanism, which enables it to dynamically adjust its focus on various parts of the input. This allows the Transformer to flexibly and effectively capture global spatial features, avoiding the limitations of relying solely on local information. By integrating these two methods, the model is able to fully capture both local and spatial features, leading to improved fault diagnosis accuracy.

Finally, the extracted features are flattened, and the fault categories are predicted using a fully connected layer followed by a softmax function. The predicted categories are then compared to the actual fault categories to calculate the loss value, which is used to continuously optimize the model’s classification performance.

WOA-SVMD model establishment

Signal denoising is an critical part of data preprocessing. A lot of useless noise in the input data will not only lead to poor qualities of learned features, but also lower the convergence speed of model training, or even fail to converge.³⁵

SVMD optimized by WOA

One of the main challenges with VMD is the need to accurately preset the number of modes $K$ before running the algorithm. Setting $K$ too high may result in duplicate modes, generating redundant information and increasing computational cost, while setting it too low may lead to incomplete signal decomposition and mode aliasing. SVMD sequentially extracts all the IMFs, eliminating the dependence on prior knowledge for determining $K$ to enhance the algorithm’s robustness.

Assuming the input signal $x (t)$ is decomposed into two signals:

\begin{matrix} x (t) = u_{L} (t) + x_{r} (t) \end{matrix}

(1)

where $u_{L} (t)$ is the $L$ -th mode, and $x_{r} (t)$ is the residual signal, which includes two parts: the sum of the previously obtained modes and the unprocessed part of signal $x_{u} (t)$ .

\begin{matrix} x_{r} (t) = \sum_{i = 1}^{K - 1} u_{i} (t) + x_{u} (t) \end{matrix}

(2)

The SVMD method for the $L$ -th mode $u_{L} (t)$ extraction is established based on the following four criteria: (1) Each mode should be compacted around its center frequency; (2) the spectral overlap between $u_{L} (t)$ and $x_{r} (t)$ should be minimized; (3) the energy of $u_{L} (t)$ at frequencies around the center frequencies of the previously obtained modes should also be minimized; (4) the original signal $x (t)$ should be completely reconstructed from the $L$ modes and $x_{u} (t)$ .¹⁶ Hence, as the ( $L - 1$ ) modes are recognized, the extraction of the $L$ -th mode can be formulated as a constrained minimization problem, as follows:

\begin{matrix} \begin{matrix} min_{u_{L}, ω_{L}, x_{r}} {α ‖ \partial_{t} [(δ (t) + \frac{j}{π t}) * u_{L} (t)] e^{- j ω_{L} t} ‖_{2}^{2} + ‖ β_{L} (t) * x_{r} (t) ‖_{2}^{2} + \sum_{i = 1}^{L - 1} ‖ β_{i} (t) * u_{L} (t) ‖ 2^{2}} \\ s . t . u_{L} (t) + x_{r} (t) = x (t) \end{matrix} \end{matrix}

(3)

where $\partial_{t}$ denotes the partial derivative with time $t$ , $δ (t)$ is the Dirac function, $ω_{L}$ is the center frequency of the $L$ -th mode, $α$ is a parameter for balancing, and * denotes the convolution operation. $β_{L} (t)$ represents the impulse response of the filter ${\hat{β}}_{L} (ω)$ in equation (4), which is used to filter out the frequencies in $x_{r} (t)$ that overlap with $u_{L} (t)$ to meet the second criteria. $β_{i} (t)$ is the impulse response of the filter ${\hat{β}}_{i} (ω)$ in equation (5), which is used to filter the frequencies in $u_{i} (t)$ overlapping with $u_{L} (t)$ to meet third criteria. The filters ${\hat{β}}_{L} (ω)$ and ${\hat{β}}_{i} (ω)$ can be expressed as:

{\hat{β}}_{L} (ω) = \frac{1}{α {(ω - ω_{L})}^{2}}

(4)

{\hat{β}}_{i} (ω) = \frac{1}{α {(ω - ω_{i})}^{2}}, i = 1, 2, \dots, L - 1

(5)

To transform the constrained minimization problem described in equation (3) into an unconstrained optimization problem, the quadratic penalty term and Lagrangian multiplier $λ$ are jointly introduced to establish the augmented Lagrangian function:

\begin{matrix} L (u_{L}, ω_{L}, λ) = ‖ α \partial_{t} [(δ (t) + \frac{j}{π t}) * u_{L} (t)] e^{- j ω_{L} t} ‖ 2^{2} + ‖ β_{L} (t) * x_{r} (t) ‖_{2}^{2} + \sum_{i = 1}^{L - 1} ‖ β_{i} (t) * u_{L} (t) ‖ 2^{2} \\ + ‖ x (t) - (u_{L} (t) + x_{u} (t) + \sum_{i = 1}^{L - 1} u_{i} (t)) ‖_{2}^{2} + 〈 λ (t), x (t) - (u_{L} (t) + x_{u} (t) + \sum_{i = 1}^{L - 1} u_{i} (t)) 〉 \end{matrix}

(6)

According to Parseval’s theorem, equation (6) can be transformed into its frequency domain form and rewritten as:

\begin{matrix} L (u_{L}, ω_{L}, λ) = α ‖ j (ω - ω_{L}) [(1 + sgn (ω)) \cdot {\hat{u}}_{L} (ω)] ‖_{2}^{2} + ‖ {\hat{β}}_{L} (ω) \cdot ({\hat{x}}_{u} (ω) + \sum_{i = 1}^{L - 1} {\hat{u}}_{i} (ω)) ‖_{2}^{2} \\ + \sum_{i = 1}^{L - 1} ‖ {\hat{β}}_{i} (ω) \cdot {\hat{u}}_{L} (ω) ‖_{2}^{2} + ‖ \hat{x} (ω) - ({\hat{u}}_{L} (ω) + {\hat{x}}_{u} (ω) + \sum_{i = 1}^{L - 1} {\hat{u}}_{i} (ω)) ‖_{2}^{2} \\ + 〈 \hat{λ} (ω), \hat{x} (ω) - ({\hat{u}}_{L} (ω) + {\hat{x}}_{u} (ω) + \sum_{i = 1}^{L - 1} {\hat{u}}_{i} (ω)) 〉 \end{matrix}

(7)

As in the VMD and VME methods, the alternate direction method of multipliers(ADMM) algorithm is also employed to iteratively solve the above minimization problem. The detailed solution process can be found in Nazari and Sakhaei.¹⁶ The final iteratively updating equations of ${\hat{u}}_{L} (ω)$ , $ω_{L}$ , and $\hat{λ} (ω)$ are provided as follows:

\begin{matrix} {\hat{u}}_{L}^{n + 1} (ω) = \frac{\hat{x} (ω) + α^{2} {(ω - ω_{L}^{n})}^{4} {\hat{u}}_{L}^{n} (ω) + \frac{\hat{λ} (ω)}{2}}{[1 + α^{2} {(ω - ω_{L}^{n})}^{4}] [1 + 2 α {(ω - ω_{L}^{n})}^{2} + \sum_{i = 1}^{L - 1} \frac{1}{α^{2} {(ω - ω_{i}^{n})}^{4}}]} \end{matrix}

(8)

\begin{matrix} ω_{L}^{n + 1} = \frac{\int_{0}^{\infty} ω {| {\hat{u}}_{L}^{n + 1} (ω) |}^{2} d ω}{\int_{0}^{\infty} {| {\hat{u}}_{L}^{n + 1} (ω) |}^{2} d ω} \end{matrix}

(9)

\begin{matrix} {\hat{λ}}^{n + 1} = {\hat{λ}}^{n} + τ \\ [\begin{matrix} \hat{x} (ω) - ({\hat{u}}_{L}^{n + 1} (ω) + \sum_{i = 1}^{L - 1} u_{i}^{n + 1} (ω) + \\ \frac{α^{2} {(ω - ω_{L}^{n + 1})}^{4} (\hat{x} (ω) - {\hat{u}}_{L}^{n + 1} (ω) - \sum_{i = 1}^{L - 1} {\hat{u}}_{i} (ω)) - \sum_{i = 1}^{L - 1} {\hat{u}}_{i} (ω)}{1 + α^{2} {(ω - ω_{L}^{n})}^{4}} \end{matrix}] \end{matrix}

(10)

where $\hat{x} (ω)$ represents the Fourier transform of the original signal $x (t)$ , ${\hat{u}}_{L}^{n + 1} (ω)$ denotes the Fourier transform of the $L$ -th mode $u_{L}^{n} (t)$ in the $n$ -th iteration with the center frequency $ω_{L}^{n}$ , $n$ refers to the number of iterations, and $τ$ is the update coefficient.

Among the above iterative updates, it’s clear that an appropriate value for the penalty parameter $α$ needs to be set. Therefore, WOA is utilized to optimize the parameter. The hunting behavior of whales includes three main steps: encircling prey, bubble net attacking, and searching for food. Based on this hunting mechanism, Zhang and Deng³⁶ proposed the WOA. When encircling their target prey, whales may not initially know its exact location and must continuously communicate and adjust within the group to move toward the prey and update their position. The position update formula is as follows:

\begin{matrix} X (t + 1) = X_{p} (t) - A \cdot | C \cdot X_{p} (t) - X (t) | \end{matrix}

(11)

where $X (t)$ denotes the current individual position, $X_{p} (t)$ represents the prey position, $t$ denotes the iteration number, and $A \cdot | C \cdot X_{p} (t) - X (t) |$ is the distance between current position and the target position. $A$ and $C$ are defined as follows:

\begin{matrix} A = 2 β \cdot ran d_{1} - β \end{matrix}

(12)

\begin{matrix} C = 2 \cdot ran d_{2} \end{matrix}

(13)

where $ran d_{1}$ and $ran d_{2}$ are random numbers in the interval $[0, 1]$ , and $β$ is the convergence factor as follows:

β = 2 - \frac{2 t}{t_{\max_iter}}

(14)

where $t_{\max_iter}$ denotes the maximum number of iterations.

The simulation of the humpback whale’s unique hunting technique involves a spiral motion to update its position and a narrowing ring mechanism, commonly referred to as the bubble net strategy. The probability of successfully capturing prey with these two methods is assumed to be 50%. When the probability $p$ is less than 0.5, the whale’s position is updated using equation (11). Conversely, when $p$ is 0.5 or greater, the position is updated as follows:

X (t + 1) = D' \cdot e^{bl} \cdot \cos (2 π l) + X_{p} (t)

(15)

where $D' = | X_{p} (t) - X (t) |$ , $b$ is a constant, and $l$ represents a random number.

In addition to the bubble net strategy mentioned above, the whale also hunts based on its position, again by adjusting vector $A$ . When $| A | < 1$ , it moves closer to the prey; conversely, it moves away from the prey. This approach effectively enhances the global optimization capabilities. The corresponding mathematical model is as follows:

X (t + 1) = X_{rand} (t) - A \cdot | C \cdot X_{rand} (t) - X (t) |

(16)

where $X_{rand}$ represents a randomly selected whale position vector.

Subsequently, select the minimum value of envelope entropy as the fitness function. The envelope entropy reflects the sparsity of the original signal. A higher envelope entropy value indicates more noise and less meaningful information in the IMF, while a lower value suggests less noise and more effective information. The envelope spectrum of the signal $X (i)$ , where $i = 1, 2, \dots, N$ , is calculated using the following equation:

\begin{matrix} {\begin{matrix} p_{n} = \frac{a (n)}{\sum_{n = 1}^{N} a (n)} \\ E_{p} = - \sum_{n = 1}^{N} p_{n} \lg p_{n} \end{matrix} \end{matrix}

(17)

where $a (n)$ represents the envelope signal of the $k$ mode components decomposed by SVMD after Hilbert demodulation, $p_{n}$ is the probability distribution sequence obtained by normalizing $a (n)$ . $N$ denotes the number of sampling points, and $E_{p}$ is the envelope entropy.

After obtaining the optimal whale position using WOA, it is used in updating the parameters for $ω_{L}$ , $u_{L}$ and $λ$ in SVMD, serving as the maximum number of loop iterations. Meanwhile, the penalty factor $α$ is continuously updated to achieve optimization.

The stop criteria for updating parameters ${\hat{u}}_{L}^{n + 1} (ω)$ is defined as:

\frac{‖ {\hat{u}}_{L}^{n + 1} - {\hat{u}}_{L}^{n} ‖_{2}^{2}}{‖ {\hat{u}}_{L}^{n} ‖_{2}^{2}} < ε_{1}

(18)

The termination criteria of SVMD is expressed as:

| σ^{2} - \frac{1}{T} ‖ (x (t) - \sum_{i = 1}^{K} u_{i} (t)) ‖_{2}^{2} | < ε_{2}

(19)

where $σ^{2}$ denotes the power of the additive white noise. $ε_{1}$ and $ε_{2}$ are assigned small positive values 1e−6.

Process of WOA optimized SVMD

The flowchart of optimizing the SVMD parameters using WOA is shown in Figure 2. The specific steps are as follows:

Step 1: The ranges of the SVMD parameter to be optimized are set and the WOA model is initialized, including the population size, the maximum iteration number. Here, penalty factor $α$ is assigned among [500, 3000]. The maximum number of iterations is set to 10, and the population size is set to 20;

Step 2: Decompose the input signal by using SVMD and obtain each IMF function. Calculate the envelope entropy of each IMF by using equation (17). The envelope entropy was used as a fitness function to find the optimal whale location and retain it;

Step 3: Start the iteration. A random number $p$ in the interval [0,1] is generated. If $p < 0.5$ , it is transferred to step 4; otherwise, position is updated by using equation (15).

Step 4: Determine the value of $A$ . If $| A | < 1$ , position is updated by using equation (11); otherwise, position is upated by using equation (16).

Step 5: Calculate each whale’s fitness and compare it to the previously stored optimal position. If the new solution is better, replace the previous one with the new optimal solution.

Step 6: Determine whether the iteration is terminated. If $t \leq t_{\max_iter}$ , then $t = t + 1$ and return to Step 3. Otherwise, end the iteration and save the optimal parameter $α$ .

Figure 2.

The flowchart of WOA optimizing SVMD.

Signal denoising process

After SVMD decomposition, the signal is broken down into a series of IMFs, each corresponding to a distinct frequency component of the original signal. These IMFs are well-localized in the time domain and exhibit concentrated energy distribution in the frequency domain. Next, Pearson correlation coefficients are calculated between each IMF and the original signal. IMFs with correlation coefficients higher than 0.2 are selected because they closely resemble the original signal and retain more meaningful information. In contrast, IMFs with low correlation coefficients are likely to contain more noise or irrelevant components.

The correlation coefficient is a statistic used to describe the strength of the linear relationship between variables. In this paper, the Pearson correlation coefficient is used. The coefficient ranges from $[- 1, 1]$ , where a larger absolute value indicates a stronger linear correlation between the variables, and a smaller absolute value indicates a weaker correlation. Its mathematical expression is as follows:

r (X, Y) = \frac{Cov (X, Y)}{\sqrt{D (X)} \sqrt{D (Y)}}

(20)

where $X$ and $Y$ represent two sample variables, $r (X, Y)$ denotes the correlation coefficient between the two sample variables, $Cov (X, Y)$ refers to the covariance of $X$ and $Y$ , $\sqrt{D (X)}$ stand for the variance of variable $X$ , and $\sqrt{D (Y)}$ indicates the variance of $Y$ .

Then, the retained IMFs are reconstructed to obtain the denoised signal. The denoising effect is evaluated by Mean Squared Error (MSE) and Signal-to-Noise ratio (SNR). The expressions for these metrics are as follows:

MSE = \frac{1}{L} \sum_{i = 1}^{L} | {\hat{x} (n) - x_{r} (n) |}^{2}

(21)

SNR = 10 \lg (\frac{\sum_{n = 1}^{L} {\hat{x}}^{2} (n)}{\sum_{n = 1}^{L} | {\hat{x} (n) - x_{r} (n) |}^{2}})

(22)

where $\hat{x} (n)$ is the noisy signal, $x_{r} (n)$ is the denoised signal, and $L$ is the signal length.

Multi-scale CNN-Transformer model for fault classification

Multi-scale CNN-Transformer method

For time-series vibration signals, 1D-CNN performs local feature extraction via convolution kernels, which has the advantages of relatively simple structure, lower computational complexity, and reduced risk of overfitting. Small kernel sizes can capture the high-frequency features of the vibration signal, while large kernel sizes can capture the low-frequency features of the vibration signal. For example, when the original signal has a sampling rate of 48 kHz, if a 1D-CNN with a convolutional kernel size of 120 is employed, the kernel covers a time window of roughly 2.5 ms, which primarily captures frequency features at or below 400 Hz. On the other hand, if the convolutional kernel size is reduced to 6, the kernel covers a much shorter time window of about 0.125 ms, enabling the extraction of high-frequency features up to 8 kHz.

CNNs are primarily composed of convolutional layers, batch normalization layers, pooling layers, activation layers, and so on. They learn and extract features from input data through local connections and weight sharing, using classifiers for supervised classification predictions.³⁷

The one-dimensional standard convolution operation is given below:

Conv 1 D (x_{n}, K) = \sum_{i = 1}^{c} \sum_{j = 0}^{K - 1} ω_{j, i} x_{n - \frac{K - 1}{2} + j, i} + b

(23)

where $Conv 1 D (\cdot)$ represents the one-dimensional standard convolution operation, $K$ and $ω$ represent the width and weight of the convolution kernel, respectively, $b$ indicates the bias, $c$ represents the number of signal channels, and $x_{n}$ is the input signal.

Multi-scale Convolution (MSC) makes use of convolution kernels of different sizes to extract features as shown below:

\begin{matrix} MSC (x_{n}, K_{1}, \dots, K_{m}) = \\ Concat [Conv 1 D (x_{n}, K_{1}), \dots, Conv 1 D (x_{n}, K_{m})] \end{matrix}

(24)

where $MSC (\cdot)$ stands for multi-scale convolution operation and $Concat [\cdot]$ stands for splicing operation in the channel dimension.

The weights $ω^{l}$ and bias $b^{l}$ can be expressed as:

\frac{\partial L}{\partial ω^{l}} = σ (z^{l - 1}) * δ^{l}

(25)

\frac{\partial L}{\partial b^{l}} = \sum δ^{l}

(26)

where $L$ represents the loss function, $σ (z^{l - 1})$ denotes the ( $l - 1$ )-th layer activation output, $δ^{l}$ denotes the error in the output feature map of the convolution kernel in the $l$ -th layer, and denotes the convolution operation.

As can be seen from Figure 3, the frequency of the bearing vibration signal is mainly concentrated in 0–5000 Hz and 8400 Hz, so we used 1D-CNN with different sizes of convolutional kernels to accurately capture different frequencies. Eventually, convolutional kernels of eight different sizes were chosen: 6, 15, 18, 36, 48, 72, 96, 144.

Figure 3.

The frequency domain vibration signal of 0.18 mm inner-race fault.

After the convolution operation comes pooling, and the pooling process is represented as follows:

\begin{matrix} p_{i, j}^{l} = max [x_{i, j}^{l - 1} (t)] t \in S \end{matrix}

(27)

where $x_{i, j}^{l - 1} (t)$ is the $t$ -th value of the $j$ -th local region of channel $i$ in layer ( $l - 1$ ). $S$ denotes the size of the pooling region, and $P_{i, j}^{l}$ is the corresponding output.

Subsequently, the activation function is applied to introduce nonlinearity, thereby enhancing the model’s representation capability. ReLU is one of the most commonly used activation functions in CNNs. It effectively mitigates the vanishing gradient problem and speeds up the convergence process. The ReLU function is mathematically defined as follows:

\begin{matrix} a_{i, j}^{l} = f (x_{i, j}^{l}) = max (0, x_{i, j}^{l}) \end{matrix}

(28)

where $x_{i, j}^{l}$ is the $j$ -th local region of channel $i$ in layer $l$ , and $a_{i, j}^{l}$ is the corresponding output.

Finally, the feature maps generated by each convolution kernel are concatenated to create a multi-scale feature map. These feature maps will serve as the input to the Transformer for further global feature extraction.

In multi-scale CNN, each CNN consists of three convolutional layers, two max pooling layers, and a batch normalization layer. Additionally, ReLU activation function is used to preserve the non-linearity of MSCNN. Each pooling layer is applied to downsample the input. The specific hyperparameter settings are presented in Figure 4.

Figure 4.

The specific architecture of multi-scale CNN.

The Transformer network is a type of deep neural network based on the self-attention mechanism and is capable of extracting features in parallel and incorporating global feature modeling. A full Transformer network includes an encoder-decoder structure. However, the encoder for classification alone suffices as the core feature extractor, making the decoder superfluous. The architecture of transformer encoder is shown in Figure 5.

Figure 5.

The architecture of transformer encoder.

First of all, positional embedding is used to infuse positional information into these feature maps, enabling the model to recognize the relative or absolute position of each feature. This is especially important for time-series signals, where the order of data points is crucial. With Positional Embedding, the model can better capture the temporal relationships between features, thereby enhancing the understanding and representation of global features. The positional encoding is expressed as:

\begin{matrix} {\begin{matrix} p_{k, 2 i} = \sin (\frac{k}{10000^{\frac{2 i}{d_{model}}}}) \\ p_{k, 2 i + 1} = \cos (\frac{k}{10000^{\frac{2 i}{d_{model}}}}) \end{matrix} \end{matrix}

(29)

where $p_{k}$ represents the faulty data at the $k$ -th position in the sequence, with 2 $i$ and 2 $i$ + 1 being its two components. This formula allows $p$ to adapt across all long sequences in the training set, making it easier for the model to compute relative positions.

Then, two key modules are employed: the multi-head attention module and the feed-forward connection layer module. Multi-head attention is composed of $h$ scaled dot-product attention modules. Initially, the input is processed through a nonlinear encoding layer to produce queries, keys, and values. These components are then passed through the $h$ scaled dot-product attention modules.

Finally, the individual attention heads are concatenated and linearly projected to generate the final attention output. The process can be summarized as follows:

Z_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(30)

\begin{matrix} Y_{o} = Concat (Z_{1}, Z_{2}, \dots, Z_{h}) W^{y} \end{matrix}

(31)

where $Z_{i}$ is the $i$ -th attention head. $W^{y}$ are the updatable weight matrices. $Y_{o}$ is the output of the module. $Q_{i}$ , $K_{i}$ , $V_{i}$ are listed as follows:

\begin{matrix} Q_{i} = {XW}_{i}^{Q}, K_{i} = {XW}_{i}^{K}, V_{i} = {XW}_{i}^{V}, 1 \leq i \leq h \end{matrix}

(32)

where $X$ is the input with length $l$ and dimension $d$ . $W_{i}^{Q}$ , $W_{i}^{K}$ , $W_{i}^{V}$ are the updatable weight matrics.

The feed-forward connection layer module consists of two fully connected layers, with a ReLU activation function applied between them. Additionally, the Transformer model includes two normalization layers. Layer normalization plays a critical role in ensuring fast convergence and stable training of the Transformer. In summary, the architecture can be described as follows:

\begin{matrix} LN = LayerNorm (X + Y_{o}) \end{matrix}

(33)

\begin{matrix} FFN = Relu ((LN) W_{1} + b_{1}) W_{2} + b_{2} \end{matrix}

(34)

\begin{matrix} OUT = LayerNorm (LN + FFN) \end{matrix}

(35)

where $W_{1}$ , $W_{2}$ , and $b_{1}$ , $b_{2}$ are the weight and bias in the fully connected layer. $FFN$ is the output of the feed-forward connection layer module. $LayerNorm (\cdot)$ is the layer normalization. $OUT$ represents the output of the feed-forward layer after applying residual connections and layer normalization.

Classification

Classification is the final step in fault diagnosis, where the classifier outputs the predicted category. This predicted result is then compared with the actual category to assess its accuracy. Then, the model parameters are adjusted accordingly. Through continuous training and fine-tuning, the model is gradually optimized. Ultimately, we obtain the desired model, which can be used for the final testing phase.

Firstly, the final output of the Transformer is flattened using the Flatten function. Once flattened, the data is passed through the fully connected layer for further processing, ultimately generating the final classification results.

Then, the cross-entropy loss function is used to calculate the loss between the predicted category and the true category. Subsequently, backpropagation and optimization are performed using Adam optimizer. After 100 rounds of training, the optimal model for fault diagnosis is obtained.

Finally, the test set data is fed into the model for fault classification, thus, the fault diagnosis process is completed.

Overall flowchart of the proposed method

The flowchart of the WOA-SVMD and multi-scale CNN-Transformer method is shown in Figure 6, which mainly includes the following steps:

Step 1: Collect vibration signals under different fault types.

Step 2: Use WOA to obtain optimal parameter $α$ of SVMD.

Step 3: Use the optimal $α$ parameter to decompose the signal using the SVMD algorithm to obtain a series of IMFs.

Step 4: Calculate the Pearson correlation coefficients between these IMFs and the original signal. Set the retention threshold at 0.2, filtering out IMFs with correlation coefficients below this value. Reconstruct the signal to remove noise.

Step 5: Input the denoised signal into the model to extract feature vectors.

Step 6: The fault diagnosis results are then output through the fully connected layer and Softmax classifier.

Figure 6.

Overall flowchart of WOA-SVMD and multi-scale CNN-Transformer method.

Experimental verification

Validation based on CWRU bearing dataset

The SKF-6205 drive-end bearing fault data, sourced from the open-access bearing fault dataset by Case Western Reserve University (CWRU), was utilized as the experimental data in this study, as shown in Figure 7.³⁸ The raw data was collected from accelerometers mounted on the drive end, fan end, and base of the motor housing. For this study, the vibration data from the drive end was selected as the primary vibration signal. The dataset includes four health conditions: normal, inner race fault, outer race fault, and rolling element fault. Each fault type is available in three defect sizes: 0.18, 0.36, and 0.54 mm, resulting in a total of 10 categories of fault data. The vibration signals were recorded under engine loads and motor speeds of 0 hp/1797 rpm, 1 hp/1772 rpm, 2 hp/1750 rpm, and 3 hp/1730 rpm, with a sampling frequency of 48 kHz. In this study, the data corresponding to 3 hp and 1730 rpm were used. Further details are provided in Table 1.

Figure 7.

Testbed for the CWRU dataset.

Table 1.

CWRU dataset.

Fault category	Fault diameter (mm)	Training set	Test set	Sample length	label
Normal	0	750	250	2048	0
Rolling elements	0.18	750	250	2048	1
Rolling elements	0.36	750	250	2048	2
Rolling elements	0.54	750	250	2048	3
Inner race	0.18	750	250	2048	4
Inner race	0.36	750	250	2048	5
Inner race	0.54	750	250	2048	6
Outer race	0.18	750	250	2048	7
Outer race	0.36	750	250	2048	8
Outer race	0.54	750	250	2048	9

To avoid overfitting due to insufficient data, we employed a sliding window overlap sampling method for data augmentation, as illustrated in equation (36). In this equation, $M$ represents the total length of the data, $N$ denotes the length of each sample, $γ$ is the offset, and $A$ is the number of samples that can be obtained from the current signal, with [·] indicating the floor function. In this study, $N$ is set to 2048 and $γ$ is set to 480.

A = [\frac{M - N}{γ} - 1]

(36)

The dataset is divided into training and test sets in a 75% to 25% ratio, respectively. Each state includes 750 samples in the training set and 250 samples in the test set, resulting in a total of 10,000 samples. The cross-entropy loss function is employed, with the Adam algorithm as the optimization method. All programs run on a computer with the following configuration: Intel i3 12100, NVIDIA RTX 3080.

WOA-SVMD denoising results

This experiment used a dataset with rolling element faults by selecting the bearing early fault signal of 1024 points starting from time t = 0.

Firstly, 5 dB noise is added to the original signal. A higher dB value indicates less noise, while a lower dB value indicates more noise. As shown in Figure 8, the signal change is quite noticeable after adding the noise.

Figure 8.

Comparison between the original signal and the noise-added 5 dB signal.

Then, the optimization iteration curve of WOA is illustrated in Figure 9, which shows that the algorithm converges after 5 iterations, resulting in a minimum envelope entropy of 6.76735 and an optimal penalty factor value of 2445.7462.

Figure 9.

Optimization iterative curve of WOA.

Next, the optimal penalty factor parameter α is applied in the SVMD algorithm for signal decomposition, resulting in a series of IMFs, as shown in Figure 10. Typically, low-frequency IMFs represent the trends or slow variations in the signal, while high-frequency IMFs contain detailed information about noise or rapid oscillations. To better visualize the IMFs, they are displayed in a 3-dimensional space, as illustrated in Figure 11, allowing a clearer view of each IMF component. Subsequently, the amplitude spectrum, power spectrum, and frequency spectrum of each IMF are calculated to further analyze their frequency characteristics, as shown in Figures 12 –14, respectively. The amplitude spectrum shows the intensity of each frequency component within the IMF, aiding in the identification of dominant frequencies. The power spectrum reveals how the IMF’s power is distributed across frequency components, indicating the frequency range where energy is concentrated. The frequency spectrum displays the distribution of various frequency components within the IMF, providing insight into the signal’s frequency structure. A Hilbert transform is also performed on each IMF to compute its instantaneous amplitude and instantaneous phase, and subsequently, the instantaneous frequency is calculated, which reflects the variation in the IMF’s frequency over time. The instantaneous amplitudes and instantaneous frequencies of all IMFs are then synthesized along the time axis to generate a three-dimensional Hilbert spectrum. This spectrum illustrates the energy distribution of the signal in the time-frequency plane, revealing the strength of each frequency component at every moment in time. The results are presented in Figure 15.

Figure 10.

IMF components decomposed by SVMD.

Figure 11.

3D-IMF components decomposed by SVMD.

Figure 12.

Amplitude spectrum of each IMF.

Figure 13.

Power spectrum of each IMF.

Figure 14.

Frequency spectrum of each IMF.

Figure 15.

Hilbert spectrum of each IMF: (a) 2D-Hilbert spectrum and (b) 3D-Hilbert spectrum.

Finally, the Pearson correlation coefficients between each IMF and the original signal are calculated, with the results presented in Table 2. The IMFs with correlation coefficients greater than 0.2 are retained and used to reconstruct the signals for denoising. To verify the effectiveness of the proposed method, the denoised signals are compared with the original signals, and the comparison is shown in Figure 16. Two other signal decomposition methods, WOA-VMD and GWO-CEEMDAN, are also included for comparison. The denoising effect with different dB levels of noise added to the original signal is shown in Figure 17. It is important to note that the parameters optimized for VMD were the number of modes $K$ and the penalty factor $α$ , with $K$ ranging from 3 to 10. For CEEMDAN, the optimized parameters were the noise amplitude weight $Nstd$ and the number of noise realizations $NR$ , with $Nstd$ set between 0.2 and 0.5, and $NR$ ranging from 50 to 100. The final optimization results were: optimal number of modes $K$ = 10, penalty factor $α$ = 2512.69, noise amplitude weight $Nstd$ = 0.22748, and noise realization count $NR$ = 89.

Table 2.

Pearson correlation coefficient of each IMF.

IMF component	Correlation coefficient	IMF component	Correlation coefficient
IMF1	0.2763	IMF5	0.0151
IMF2	0.2378	IMF6	0.0006
IMF3	0.9324	IMF7	0.0002
IMF4	0.0396	IMF8	0.0001

Figure 16.

Comparison of denoised signal with original signal.

Figure 17.

Comparison of denoising results of different methods.

Multi-scale CNN-Transformer classification result

In this experiment, the model parameters were configured as follows: embedding dimension, hidden dimension, number of attention heads, number of encoder layers, and dropout rate. When the embedding and hidden dimensions are too small, the network becomes under-parameterized, resulting in degraded performance. Additionally, a lower number of attention heads can cause the model to overly focus on its own position, leading to overfitting. The final model parameters are detailed in Table 3. The parameter $K$ represents the different size of convolutional kernel mentioned in Section 2.2.1.

Table 3.

Network parameters setting.

No.	Network layer	Parameter
1	Input size	Vibration signal (2048 × 1)
2	Output	Fault category (0–9)
3	Multi-scale CNN	4@K × 1 & max pool 4
4	No. of encoder layers	4
5	Hidden dimension	128
6	No. of attention heads	6
7	Embedding dimension	32
8	Optimizer	Adam
9	Learning rate	0.0005
10	Dropout rate	0.1
11	Epochs	100
12	Batch size	32

The loss values and accuracy curves during the training of the multi-scale CNN-Transformer model are displayed in Figure 18. Figure 19 presents the confusion matrix on the CWRU dataset. The confusion matrix illustrates the comparison between the models’ predictions and the actual labels across different categories. In this matrix, the horizontal axis represents the true categories, while the vertical axis represents the predicted categories. The diagonal line indicates the number of correct predictions, while the off-diagonal elements represent the number of incorrect predictions. To visualize the features extracted by the models, T-SNE visualization in Figure 20 displays the high-dimensional features from the final hidden layer reduced to a two-dimensional vector distribution. The effectiveness of the models’ feature extraction capabilities is demonstrated through the similarity and clustering of the data points in these plots. To further assess the effectiveness of the multi-scale CNN-Transformer model, five additional models were selected for comparative analysis.^39–41

Figure 18.

Loss values and accuracy curves of multi-scale CNN-Transformer model of the CWRU dataset: (a) loss curve and (b) accuracy curve.

Figure 19.

Confusion matrix of multi-scale CNN-Transformer of the CWRU dataset.

Figure 20.

T-SNE visualization of the CWRU dataset.

In the test dataset, the multi-scale CNN-Transformer network has a higher fault diagnosis accuracy of 99.24% than the other three networks. This proves the superiority of combining Transformer network and multi-scale CNN for fault diagnosis. Additional details on the model’s performance differences can be seen in the confusion matrix in Figure 21. To further evaluate the performance of the proposed method, a detailed comparison was conducted using four metrics: accuracy, precision, recall, and F1 score. True Positive (TP) refers to instances where the classifier correctly identifies the positive class. True Negative (TN) occurs when the classifier correctly identifies the negative class. False Positive (FP) represents cases where the classifier incorrectly labels a negative instance as positive, while False Negative (FN) occurs when a positive instance is incorrectly classified as negative. Accuracy measures the proportion of correct predictions out of the total predictions, reflecting the model’s overall correctness. The formula is provided in equation (37). Precision assesses how many of the predicted positive instances were actually correct, focusing on the accuracy of positive predictions, with the formula shown in equation (38). Recall evaluates how well the model identifies actual positive instances, emphasizing its ability to detect all positive cases, as outlined in equation (39). The F1 score, the harmonic mean of precision and recall, combines these metrics into a single measure. It is particularly valuable when class distribution is imbalanced or when balancing precision and recall is crucial. A high F1 score indicates minimized false positives and false negatives, as detailed in equation (40). A detailed comparison of the model’s performance using these four metrics is shown in Figure 22.

Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}

(37)

Precision = \frac{TP}{(TP + FP)}

(38)

Recall = \frac{TP}{(TP + FN)}

(39)

F 1 - Score = 2 * \frac{Precision * Recall}{Precision + Recall}

(40)

Figure 21.

Confusion matrix of different model of the CWRU dataset: (a) CNN, (b) WDCNN, (c) CNN-Transformer, (d) multi-scale CNN-Transformer, (e) SNDCNN, and (f) MSCNN-LSTM.

Figure 22.

Comparison of different methods using four performance evaluation metrics of the CWRU dataset.

Validation based on PU bearing dataset

The PU dataset is provided by the Paderborn University Bearing Data Center and is widely used in bearing fault diagnosis research.⁴² The test rig consists of several modules: an electric motor, a torque-measurement shaft, a rolling bearing test module, a flywheel, and a load motor, as shown in Figure 23. Ball bearings with various types of damage are installed in the bearing test module to generate experimental data. For this study, the diagnostic analysis focuses on faulty data from the upper region of the rolling bearing module, recorded at a sampling frequency of 64 kHz. In addition to normal operating conditions, the bearing faults include single-point damage induced by electrical discharge machining.

Figure 23.

The PU bearing fault test rig.

Four specific operating conditions—N15_M07_F10, N09_M07_F10, N15_M01_F10, and N15_M07_F04—were deliberately selected for the experiments. In this nomenclature, “N” represents the rotational speed, with N15 corresponding to 1500 rpm and N09 to 900 rpm. “M” denotes the torque magnitude, where M07 indicates 0.7 Nm and M01 represents 0.1 Nm. Additionally, “F” stands for the radial force, with F10 equating to 1000 N and F04 to 400 N. The rolling bearing data collected under these distinct operating conditions were systematically classified into four states: normal, inner race (IR) defects, outer race (OR) defects, and multiple defects involving both IR and OR defects, as outlined in Table 4.

Table 4.

PU dataset.

Fault category	Bearing code	Training set	Test set	Sample length	label
Normal	K001	375	125	2048	0
OR defect	KA01	375	125	2048	1
OR defect	KA04	375	125	2048	2
OR defect	KA16	375	125	2048	3
OR & IR defect	KB23	375	125	2048	4
OR & IR defect	KB24	375	125	2048	5
OR & IR defect	KB27	375	125	2048	6
IR defect	KI01	375	125	2048	7
IR defect	KI16	375	125	2048	8
IR defect	KI17	375	125	2048	9

In this experiment, we also applied a sliding window overlap sampling method, as illustrated in equation (36), for data augmentation to prevent overfitting. The parameter $N$ is set to 2048 and $γ$ is set to 500. The dataset is split into training and test sets in a 75% to 25% ratio, respectively. Each state contains 375 samples in the training set and 125 samples in the test set, resulting in a total of 5000 samples.

The loss values and accuracy curves during the training of the multi-scale CNN-Transformer model are shown in Figure 24. The confusion matrix on the PU dataset is presented in Figure 25. Figure 26 provides a T-SNE visualization of the features extracted by the multi-scale CNN-Transformer model. Figure 27 demonstrates the performance of different models using confusion matrices, highlighting the effectiveness of combining the Transformer network with the multi-scale CNN for fault diagnosis. Finally, Figure 28 offers a detailed comparison of the model’s performance using the four previously mentioned metrics.

Figure 24.

Loss values and accuracy curves of multi-scale CNN-Transformer model of the PU dataset: (a) loss curve and (b) accuracy curve.

Figure 25.

Confusion matrix of multi-scale CNN-Transformer of the PU dataset.

Figure 26.

T-SNE visualization of the PU dataset.

Figure 27.

Confusion matrix of different model of the PU dataset: (a) CNN, (b) WDCNN, (c) CNN-Transformer, (d) multi-scale CNN-Transformer, (e) SNDCNN, and (f) MSCNN-LSTM.

Figure 28.

Comparison of different methods using four performance evaluation metrics of the PU dataset.

Conclusions

This paper proposes an intelligent fault diagnosis method based on WOA-SVMD and multi-scale CNN-Transformer, which addresses the challenges of motor bearing vibration signals being easily interfered with by environmental noise and insufficient extraction of fault features. The WOA-SVMD method is used in the signal denoising process. WOA is employed to optimize the penalty factor parameter of SVMD, enabling the best signal decomposition. Then, the Pearson correlation coefficient method is applied to calculate the correlation between the IMFs obtained from the decomposition and the original signal. IMFs with low correlation are filtered out, while those with high correlation are retained to reconstruct the signal, achieving effective denoising. The feature extraction process employs a multi-scale CNN-Transformer model. The multi-scale CNN uses convolutional kernels of various sizes to extract local features from the input signal, constructing rich feature maps, and incorporates the ReLU activation function to enhance the network’s feature recognition capability. Subsequently, the Transformer is applied to capture global features through its self-attention mechanism. Finally, fault type is classified using the Softmax function. The proposed method has been validated on the CWRU and PU public datasets. Comparative experiments demonstrate that the method achieves superior signal denoising performance and higher fault diagnosis accuracy than other approaches.

However, fault diagnosis is only one aspect of machinery and equipment health management. In the future, we plan to test and optimize our models in the field of remaining useful life prediction of bearing.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: National Natural Science Foundation of China (62373321), Open Project of Zhejiang Key Laboratory of Automotive Electronics Intelligence (JY20240708) and National Key Laboratory of Industrial Control Technology (ICT2024B55).

ORCID iD

Jili Tao

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Liu

Ping

, et al. TRA-ACGAN: a motor bearing fault diagnosis model based on an auxiliary classifier generative adversarial network and transformer network. ISA Trans 2024; 149: 381–393.

Luo

Zhang

Yang

, et al. Imbalanced data fault diagnosis of rolling bearings using enhanced relative generative adversarial network. J Mech Sci Technol 2024; 38(2): 541–555.

Podmiljšak

Saje

Jenuš

, et al. The future of permanent-magnet-based electric motors: how will rare earths affect electrification? Materials 2024; 17(4): 848–890.

Lei

Wang

Zhang

, et al. Rolling bearing fault diagnosis method based on MTF and PC-MDCNN. J Mech Sci Technol 2024; 38(7): 3315–3325.

Gonda

Paulus

Graf

, et al. Basic experimental and numerical investigations to improve the modeling of the electrical capacitance of rolling bearings. Tribol Int 2024; 193: 109354–109365.

Guo

Wang

, et al. A CNN-BiLSTM-bootstrap integrated method for remaining useful life prediction of rolling bearings. Qual Reliab Eng Int 2023; 39(5): 1796–1813.

Zhao

Wang

Han

, et al. Fault diagnosis for abnormal wear of rolling element bearing fusing oil debris monitoring. Sensors 2023; 23(7): 3402–3420.

Liu

Hao

Liu

, et al. Prediction of remaining useful life of rolling element bearings based on LSTM and exponential model. Int J Mach Learn Cybern 2023; 14(4): 1567–1578.

Yao

, et al. Application of EMD combined with deep learning and knowledge graph in bearing fault. J Signal Process Syst 2023; 95(8): 935–954.

10.

Dragomiretskiy

Zosso

Variational mode decomposition. IEEE Trans Signal Process 2014; 62(3): 531–544.

11.

Zhou

Xiao

Niu

, et al. Rolling bearing fault diagnosis based on WGWOA-VMD-SVM. Sensors 2022; 22(16): 6281–6309.

12.

Zhen

Feng

, et al. Rolling bearing fault diagnosis based on VMD reconstruction and DCS demodulation. Int J Hydromechatron 2022; 5(3): 205–225.

13.

Liu

, et al. An optimized VMD method and its applications in bearing fault diagnosis. Measurement 2020; 166: 108185.

14.

Nazari

Sakhaei

SM.

Variational mode extraction: a new efficient method to derive respiratory signals from ECG. IEEE J Biomed Health Inform 2018; 22(4): 1059–1067.

15.

Zhong

Xia

Mei

A parameter-adaptive VME method based on particle swarm optimization for bearing fault diagnosis. Exp Tech 2023; 47(2): 435–448.

16.

Nazari

Sakhaei

SM.

Successive variational mode decomposition. Signal Process 2020; 174: 107610–167020.

17.

Jiang

Wang

Rolling bearing weak fault diagnosis utilizing successive variational mode decomposition with sparsity index reconstructing strategy. J Vibroengineering 2023; 25(1): 26–41.

18.

Guo

Yang

Jiang

, et al. Rolling bearing fault diagnosis based on successive variational mode decomposition and the EP Index. Sensors 2022; 22(10): 3889–3911.

19.

Zhou

Yang

Liu

, et al. Fuzzy broad learning system combined with feature-engineering-based fault diagnosis for bearings. Machines 2022; 10(12): 1229.

20.

Yang

Zheng

XX.

A novel bearing fault diagnosis method with feature selection and manifold embedded domain adaptation. Proc IMechE C: Journal of Mechanical Engineering Science 2022; 236(14): 8185–8197.

21.

Jiang

Multi-fault diagnosis of rolling bearing using two-dimensional feature vector of WP-VMD and PSO-KELM algorithm. Soft Comput 2023; 27(12): 8175–8187.

22.

Song

Liao

Wang

, et al. Incrementally accumulated holographic SDP characteristic fusion method in ship propulsion shaft bearing fault diagnosis. Meas Sci Technol 2022; 33(4): 045011.

23.

Samal

Sunil

Jamadar

, et al. AI-enhanced fault diagnosis in rolling element bearings: a comprehensive vibration analysis approach. FME Trans 2024; 52(3): 450–460.

24.

Hou

A multi-scale feature fusion network-based fault diagnosis method for wind turbine bearings. Wind Eng 2023; 47(1): 3–15.

25.

Choudakkanavar

Mangai

JA.

A hybrid 1D-CNN-Bi-LSTM based model with spatial dropout for multiple fault diagnosis of roller bearing. Int J Adv Comput Sci Appl 2022; 13(8): 637–644.

26.

Cao

Gong

, et al. Weak fault feature extraction of rolling bearing under strong poisson noise and variable speed conditions. J Mech Sci Technol 2022; 36(11): 5341–5351.

27.

Wang

Cao

A multiscale convolution neural network for bearing fault diagnosis based on frequency division denoising under complex noise conditions. Complex Intell Syst 2023; 9(4): 4263–4285.

28.

Rajabioun

Afshar

Atan , et al. Classification of distributed bearing faults using a novel sensory board and Deep Learning Networks with hybrid inputs. IEEE Trans Energy Convers 2024; 39(2): 963–973.

29.

Ding

Qin

, et al. A novel deep learning approach for intelligent bearing fault diagnosis under extremely small samples. Appl Intell 2024; 54(7): 5306–5316.

30.

Niu

Liu

Wang

, et al. Enhanced discriminate feature learning deep residual CNN for multitask bearing fault diagnosis with information fusion. IEEE Trans Ind Inform 2023; 19(1): 762–770.

31.

Jiaocheng

Jinan

Xin

, et al. Bayes-DCGRU with bayesian optimization for rolling bearing fault diagnosis. Appl Intell 2022; 52(10): 11172–11183.

32.

Liu

Chen

Wang

, et al. A Siamese CNN-bilstm-based method for unbalance few-shot fault diagnosis of rolling bearings. Meas Control 2024; 57(5): 551–565.

33.

Wang

Zhang

Cao

, et al. A rolling bearing fault diagnosis method based on the WOA-VMD and the GAT. Entropy 2023; 25(6): 889–919.

34.

Feng

Zhao

Wang

, et al. Fault diagnosis method based on the multi-head attention focusing on data positional information. Meas Control 2023; 56(3-4): 583–595.

35.

Mirjalili

Lewis

The whale optimization algorithm. Adv Eng Softw 2016; 95: 51–67.

36.

Zhang

Deng

An intelligent fault diagnosis method of rolling bearings based on short-time Fourier transform and convolutional neural network. J Fail Anal Prev 2023; 23(2): 795–811.

37.

Zhang

, et al. Intelligent fault diagnosis of rolling bearings under varying operating conditions based on domain-adversarial neural network and attention mechanism. ISA Trans 2022; 130: 477–489.

38.

Wang

Zhang

, et al. A new fault diagnosis of rolling bearing based on Markov transition field and CNN. Entropy 2022; 24(6): 751–764.

39.

Zhang

Peng

, et al. A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors 2017; 17(2): 425–446.

40.

Han

Cao

Luan

, et al. A rolling bearing fault diagnosis method based on switchable normalization and a deep convolutional neural network. Machines 2023; 11(2): 185–206.

41.

Zheng

Fault diagnosis method of rolling bearing based on MSCNN-LSTM. Comput Mater Contin 2024; 79(3): 4395–4411.

42.

Lessmeier

Kimotho

Zimmer

, et al. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: a benchmark data set for data-driven classification. In: Third European conference of the prognostics and health management society, 2016, pp.152–156.

A WOA-SVMD and multi-scale CNN-transformer method for fault diagnosis of motor bearing

Abstract

Keywords

Introduction

The proposed fault diagnosis method

WOA-SVMD model establishment

SVMD optimized by WOA

Process of WOA optimized SVMD

Signal denoising process

Multi-scale CNN-Transformer model for fault classification

Multi-scale CNN-Transformer method

Classification

Overall flowchart of the proposed method

Experimental verification

Validation based on CWRU bearing dataset

WOA-SVMD denoising results

Multi-scale CNN-Transformer classification result

Validation based on PU bearing dataset

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Data availability statement

References