Abstract
To address the challenge of incipient fault diagnosis in rotating machinery under complex operating conditions, this paper proposes a novel methodology integrating Multi-scale Energy Transform (MSET) with a dual-stream neural network architecture. The proposed framework consists of three key components: First, a new MSET technique is introduced for preprocessing weak fault signals. By applying Energy Transform (ET) iteratively with optimized parameters, where ET is defined as the reconstruction of a signal multiplied by its filtered Teager–Kaiser Energy Operator (TKEO), the proposed MSET effectively enhances fault impulse components while preserving their structural characteristics. This process yields a high signal-to-noise ratio (SNR) output suitable for subsequent sparse decomposition. Second, a sparse decomposition approach combining MSET with time–frequency spectrum optimization (TFSO) is developed to extract sparse atoms representing fault impulses. After MSET preprocessing, the maximum energy trajectory of the time–frequency representation across scales exhibits convex properties. This enables the reformulation of optimal atom matching pursuit into a convex optimization problem, which is efficiently solved using the Golden Jackal Optimization (GJO) algorithm to rapidly and accurately identify the optimal time–frequency spectrum. Third, a two-stream diagnostic network incorporating a time-series Transformer and an Inception-ResNet-v2 architecture is designed to comprehensively exploit multi-dimensional features in fault signals. This model effectively captures both global temporal dependencies and local pattern characteristics, enabling accurate fault mode identification. Simulation and experimental case studies confirm that the proposed method achieves high diagnostic accuracy and computational efficiency, particularly under strong noise conditions.
Keywords
Introduction
Large low-speed and heavy-duty machinery has the characteristics of frequent starting and braking, load fluctuation, serious coupling and strong interference, which makes the fault signal perform the characteristics of non-periodic, strong noise, and sparse.1–4 The amplitude of the impulses component in the incipient fault signal is often less than 10% of the amplitude of noise. This weak feature will lead to insufficient feature learning and low diagnosis accuracy of machine learning methods. 5
In the field of weak fault diagnosis, the main solutions are: (1) Methods of digital signal processing, which can separate the fault feature and noise by analyzing the characteristics of fault feature. (2) Methods of machine learning, which can self-study the feature from large fault samples to classify faults.
In the methods of digital signal processing for weak fault diagnosis, the basis function representing the fault signal is usually constructed according to the fault feature. Combining the time-frequency characteristics of basis function, the fault component and noise component can be separated by decomposition. These methods are widely used in fault diagnosis, such as Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN), Local Mean Decomposition (LMD), Ensemble Local Mean Decomposition (ELMD), Variational Mode Decomposition (VMD), Maximum Correlated Kurtosis Deconvolution (MCKD), Feature Mode Decomposition (FMD), Sparse Decomposition (SD), and so on. CEEMDAN is proposed to alleviate the mode mixing phenomenon and can obtain fault features with less noise and more physical meaning. 6 An improve CEEMDAN is proposed to reconstruct signal using Principal Component Analysis (PCA) and fractal dimension, which can overcome limitations of redundancy and mode confusion. 7 A hybrid fault diagnosis method based on the second-generation wavelet de-noising and the local mean decomposition is proposed to extract fault features from strong background and have faster convergence. 8 A method based on ELMD and fast kurtogram is proposed to extract fault characteristic information from strong noise-embedded signals. 9 VMD has excellent performance in fault diagnosis because of its clear center frequency and bandwidth. Variable bandwidth self-convergent VMD is proposed to select optimal parameter based on strategies of center frequency adaptive convergence and variable-bandwidth control parameter. 10 SD is proved to be a useful method to extract impulses from noisy signals. Mallat first proposed a sparse representation method to extract impulses from constructed redundant dictionaries, which is a milestone of digital signal processing and can represent impulses with higher precision. 11 The decomposition speed is the obvious shortage, to overcome this shortage, a whale optimization algorithm-optimized Orthogonal Matching Pursuit (OMP) is proposed with a combined time-frequency atom dictionary, which can optimize the atom parameters for best approximating the original signal. 12 A fast Sparse Decomposition method based on Time-Frequency Spectrum Segmentation (SD-TFSS) is proposed to takes the complex matching pursuit to spectrum processing. 13 A bearing fault diagnosis method via generalized logarithm sparse regularization is introduced to enhance the sparsity and reduce noise disturbance. 14 A Sparse Representation method based on the Maximum Correlated Kurtosis Deconvolution and Periodic Dictionary (MCKD-PDSR) is introduced to improve the accuracy of the dictionary, which is useful in a complex environment. 15 Feature Mode Decomposition (FMD) uses the adaptive finite impulse response (FIR) filter bank, period estimation and update process to lock the fault information, and eliminate the redundant mode and mixed mode in the mode selection process to obtain more pure and accurate fault features. An adaptive FMD based on health indicator is introduced to recognize different defects. 16
Methods of machine learning can learn and understand the internal laws and patterns of a large number of equipment healthy data through model training, and can identify the features that may be ignored by human beings, which have great significance to improve the accuracy of equipment fault diagnosis. These methods are widely used in fault diagnosis, such as Convolutional Neural Network (CNN), Residual Network (ResNet), Deep Residual Shrinking Network (DRSN), Vision Transformer (ViT), and so on. A method of CNN parameter design based on fault signal analysis is introduced to elaborate the physical characteristics of bearing acceleration signals to guide the CNN design, which has good performance in accuracy and uncertainty. 17 An adaptive symmetric loss in dynamic wide-kernel ResNet is proposed to extract potential fault impulse feature, which can improve the diagnostic performance of the network in the background of noise. 18 An improved DRSN-GRU dual-channel model is introduced to decouple and recognize compound faults. 19 A frequency channel-attention based ViT is proposed to enhance the sensitivity to signal frequency characteristics and increase the interpretability of the model. 20
In practical applications, strong noise interference is the key factor affecting the accuracy of fault diagnosis. Traditional digital signal processing and machine learning methods face pronounced constraints in the presence of strong noise and sample imbalance. (1) Digital signal processing methods are often difficult to achieve ideal results in weak fault diagnosis due to poor self-adaptive of strong noise, single dimension of feature extraction, high complexity of algorithm, difficult parameter selection and other factors. (2) The performance of machine learning methods is highly dependent on data quality. An inherent weakness lies in their oversensitivity to noisy data, which inclines them to learn noise patterns during training while overlooking discriminative subtle features. This ultimately leads to overfitting on datasets with high noise and class imbalance, resulting in significant degradation of generalization capability.
To enhance the accuracy and stability of weak fault diagnosis, this paper integrates the strengths of digital signal processing and machine learning to propose an incipient fault diagnosis method based on MSET and a two-stream network.
The proposed methodology comprises three main parts: (1) an innovative weak impulses enhancement approach of MSET is designed. MSET is defined as the reconstruction of signal multiplied by its filtered TKEO signal with multiple times, which can enhance the weak impulses to computable impulses. And wavelet de-noising with base of “rbio3.9” is adopted to filter the high frequency noise of TKEO signal, the index of relative change rate of correlation coefficient about signals before and after MSET is introduced as the iteration termination index. (2) A fault signal representation approach of sparse decomposition based on MSET and TFSO is proposed. As the structure of impulses would be change by MSET, sparse decomposition is used to express the fault impulse exactly. After MSET, the problem of optimal atoms matching pursuit can convert to the problem of maximum energy optimizing in time-frequency spectrums. GJO is adopted to search the optimal spectrum rapidly and exactly. When all the optimal atoms matching the impulses of MSET are selected, the corresponding atoms matching the original fault impulses can be reconstructed, and the sparse representation of original fault signal can be obtain. (3) A two-stream fault diagnosis network of time-series Transformer and Inception-ResNet-v2 is constructed. Time-series Transformer network is used to fully explore the feature information of fault signal in time domain and Inception-ResNet-v2 is used to fully explore the feature information of fault signal in time-frequency domain. The simulation and application can prove that the proposed method has good performance in accuracy and efficiency, especially under the strong noise condition.
The main innovations and contributions of this paper are as follows: (1) MSET method is designed to enhance weak fault features while preserving the integrity of their time-frequency characteristics, thereby providing rich time-frequency features for model training. (2) A sparse decomposition method based on MSET and TFSO is developed, which effectively extracts fault features from strong noise and demonstrates excellent performance in both diagnostic accuracy and computational efficiency. (3) An incipient fault diagnosis framework integrating MSET with a two-stream network is established, achieving comprehensive fault diagnosis under both variable and constant rotational speed conditions.
This paper is structured as follows. A fundamental introduction to the sparse decomposition, TKEO, Transformer network and Inception-ResNet-v2 network is presented in “Related work”. The proposed incipient fault diagnosis method based on MSET and two-stream network is presented in “Fault diagnosis method design”. The performance of the proposed method with simulated signals and engineering application is shown in “Simulation” and “Engineering application”, respectively. Finally, conclusions are presented in “Conclusions”.
Related work
Sparse decomposition
To get a succinct representation of signals with local features, methods such as traditional FFTs and WTs are not suitable, but the methods of sparse decomposition present good performance. Different from FFT and WT, the basic functions of sparse decomposition are nonorthogonal and redundant. The sparse decomposition is defined as follows
The over-complete dictionary consists of abundant atoms and the number of the atoms is much larger than the length of the signal. Getting the least atoms to approximate the analyzed signal is equivalent to solving the 0-norm problem
Methods such as MP, BP, OMP, and StOMP can solve this NP-hard problem and have good performance in precision and sparsity.21,22 But the decomposition efficiency is an urgent problem. Considering the performance of precision, sparsity, and efficiency, a fast sparse decomposition based on time-frequency spectrum optimization is proposed.
Teager-Kaiser energy operator (TKEO)
The TKEO is defined by
Impulse in mechanical vibration can be expressed as
From this equation, TKEO can demodulate the impulse, and it can enhance the amplitude of impulse, which is proportional to the square of frequency.
Transformer network
Transformer network is a deep learning model based on attention mechanism, and it has achieved tremendous success in natural language processing, image recognition, and other artificial intelligence fields. 26
The structure of Transformer network is shown in Figure 1. Structure of transformer network.
Encoder and decoder are two important components of Transformer network. The encoder is composed of multiple identical layers, and each layer is composed of two sub layers, which are self-attention layer and the feed-forward layer. The decoder is composed of several identical layers, and each layer is composed of three sub layers, which are self-attention layer, the encoder–decoder attention layer and the feed-forward layer. Self-attention mechanism is a key component, which can model the relationship between any two positions in the input sequence, and automatically learn the interdependence between different positions according to the content of the sequence.
Transformer network also introduces position coding to distinguish elements in different positions of the sequence, and multi head attention mechanism to better capture long-range dependencies in sequences.
Transformer network has many advantages, including parallel computing, global information capturing, flexibility, and easy training. However, it also has some disadvantages, such as sequence length limitation, which is limited by computing resources and memory, high computational overhead of attention weight, and so on.
Inception-ResNet-v2 network
Inception is proposed as a new deep learning framework in 2014, which can obtain abundant and diversified image features by convolution operations and pooling operations at different scales. In the inception network, 1×1 convolution kernel is used to reduce and upgrade the dimension of the feature map, the nonlinear incentive mechanism is added to improve the expression ability of the network, the appropriate conditional features is added to greatly increase the linear dependence of the activation function without transformation. In addition, increasing the depth and width information of the network and using the maximum pool for down sampling can reduce the dimension of the depth feature, and the regularization of the intermediate auxiliary layer can improve the recognition accuracy. 27
But these optimization methods will lead to the continuous increase of parameters and calculation, which may lead to the gradient dispersion problem. Therefore, residual network is considered to be added to the Inception module. The gradient information of residual network is easier to transfer in the process of back propagation, so as to accelerate the convergence speed. By further improving and evolving the Inception network, the Inception-ResNet-v2 network is constructed with high performance and low error rate.
The structure of Inception-ResNet-v2 is shown in Figure 2. Structure of inception-ResNet-v2.
Fault diagnosis method design
To address the challenge of incipient fault diagnosis in low-speed heavy-duty machinery, this paper proposes a novel method based on MSET and a dual-stream network architecture. The framework consists of three key steps: (1) MSET is applied to suppress strong background noise and enhance weak impulse components in raw vibration signals; (2) A fast sparse decomposition technique combining MSET and TFSO is employed to obtain sparse representations of both the enhanced MSET signal and the original fault signal by leveraging their correlated time-frequency characteristics; (3) A dual-stream diagnostic network integrating a time-series Transformer and Inception-ResNet-v2 is designed, where the Transformer branch comprehensively extracts temporal features from 1D signals while the Inception-ResNet-v2 branch analyzes time-frequency patterns from 2D GST spectrograms. The flowchart of incipient fault diagnosis method based on MSET and two-stream network is shown in Figure 3. Flowchart of incipient fault diagnosis method based on MSET and two-stream network.
Multi-scale energy transform
Although machine learning methods can autonomously learn features from various signal representations, such as raw vibration data, frequency spectra, and time-frequency spectra, for fault diagnosis, their accuracy is often limited when dealing with low-quality signals. Consequently, enhancing the discriminative features in these signals can significantly improve the diagnostic performance of the models.
Sparse decomposition offers a significant advantage in extracting weak impact signals from strong background noise by utilizing a redundant fault dictionary. Nevertheless, noise interference makes many optimization algorithms difficult to use in atom searching, resulting in poor performance in computational efficiency and accuracy.
How to filter out the strong noise while preserving or modifying the structural characteristics of the impulse signals is the key issue to be addressed by MSET.
Building on the TKEO’s ability to track high-frequency energy, we proposes a MSET method for impulse enhancement and noise suppression. It includes four steps: (1) calculate the TKEO energy signal of the signal, (2) filter the high frequency noise of the TKEO energy signal and retain the low frequency envelope, (3) execute one ET by multiplying the filtered TKEO energy signal and corresponding original signal, and (4) repeat steps (1)–(3) until the optimal impulses enhancement is achieved.
After MSET processing, the oscillation frequency of the impulses is preserved, while their decay rate exhibits a regular variation. This predictable behavior enables the reconstruction of the original signal’s time-frequency information through sparse decomposition.
The flowchart of MSET is shown in Figure 4. (1) Basic principles Flowchart of MSET.

According to the impulse signal shown in equation (4), the final expression form after MSET is analyzed in detail.
The ET is defined as a method of signal reconstruction, which is processed by multiplying of original signal and its TKEO energy signal.
From the TKEO transformation, it can be observed that the amplitude is proportional to the product of the square of the amplitude and the square of the oscillation frequency, which significantly increases the overall signal magnitude to a substantial scale. Since sparse decomposition fundamentally focuses more on the morphological features of the signal rather than its absolute magnitude, normalization is applied to eliminate dimensional influences.
So, the ET of impulse signal
ET has good performance in improving impulses and suppressing noise. But, the effect is poor in the condition of low SNR. To achieve better performance, MSET is proposed to reconstruct the impulses. Thus, the signal would be enhanced by multiple transform.
The MSET of impulse signal
The structure of
Based on the mathematical model of the original vibration impulse signal in equation (4), sparse representation requires the extraction of four key parameters: scale factor, shift factor, frequency factor, and phase factor. After the TKEO transformation, the envelope component of the signal is obtained, whose structure often exhibits higher similarity to high-frequency noise and possesses weaker anti-interference capability. In contrast, after the MSET, the signal is represented as a vibration impulse component with stronger anti-interference capability. By performing sparse representation on the time-frequency parameters obtained from the MSET, the time-frequency parameters of the original signal can be calculated, thereby achieving a more accurate sparse representation of the original signal.
The enhancement quality of MSET depends critically on the purity of the TKEO envelope. Consequently, significant noise interference in the envelope leads to substantial signal distortion. The overall performance of MSET is therefore determined by the degree of distortion in the TKEO energy signal caused by noise: stronger noise results in more severe TKEO distortion and correspondingly poorer MSET performance.
Figures 5–7 show the results of MSET for signals with different SNRs, Table 1 shows the correlation coefficient of impulses before and after MSET without filtering. The results of MSET without filtering for noise-free signal S1. (a) Noise-free signal. (b) Zoom-in view of 3rd TKEO. (c) Signal by 3 times ET. The results of MSET without filtering for signal with noise S2 (SNR = −1.83 dB). (a) Signal with noise. (b) Zoom-in view of 3rd TKEO. (c) Signal by 3 times ET. The results of MSET without filtering for signal with noise S3 (SNR = −13.20 dB). (a) Signal with noise. (b) Zoom-in view of 3rd TKEO. (c) Signal by 3 times ET. The correlation coefficient (CC) of impulses before and after MSET without filtering.


Figure 5 shows that in each ET, the real and ideal TKEO signals generally coincide, indicating that the impulse characteristics are well preserved. However, the attenuation speed of impulses increases rapidly with successive ET cycles.
Figure 6 shows that, the real TKEO signals are interfered by high-frequency noise. Table 1 shows that the result of 1st ET for S2 is the best, and the impulses will be seriously distorted if the ET is continued.
Figure 7 shows that the TKEO signals are seriously disturbed by high-frequency noise. Table 1 further shows that the result of 2nd ET for S3 is the best, and the impulses will be seriously distorted if the ET is continued.
Therefore, to maximize the performance of MSET while preserving the key impulse features, two critical issues must be addressed: (1) effective filtering of high-frequency noise in TKEO signal, (2) appropriate selection of the MSET scale. (2) Filtering strategy for TKEO energy signal
The definition of TKEO shows that TKEO energy signal is proportional to the square of the frequency, rendering it inherently sensitive to high-frequency noise. Such noise negatively impacts the ET process and leads to cumulative errors. Therefore, how to remove high-frequency noise and retain low-frequency envelope is particularly important in MSET. The filtering strategy for TKEO energy signal is shown in Figure 8. Filtering strategy for TKEO energy signal.
The similarity of reconstructed signals under different wavelet basis functions.
The results indicate that “rbio3.9” outperforms the other wavelet bases, primarily because its scaling function resembles the Gaussian function, and its waveform more closely matches the envelope morphology of the TKEO energy signal.
Wavelet de-noising with base of “rbio3.9” is adopted in proposed MSET. The curves of wavelet base of “rbio3.9” is shown in Figure 9. The curves of wavelet base of “rbio 3.9.”
In this TKEO energy signal filtering, firstly, decompose this signal by wavelet transform with “rbio3.9,” then, retain the approximation coefficient in last level and set all the detail coefficients to zeros, finally, obtain the filtered TKEO energy signal by wavelet reconstruction with the processed coefficients. From Figure 8(c), it shows that the reconstruction scaling function exists slight oscillation, which would lead to signal distortion in the wavelet reconstruction when the decomposition level is too large. But, too small decomposition level would lead to unsatisfied de-noising effect. So, a suitable decomposition level is very important.
Correlation coefficient, which can measure the correlation of signals before and after ET, is used to evaluate the effect of wavelet de-noising with different decomposition levels. But as the decay rate of impulse before and after ET is inconsistent, signals before and after ET cannot be directly involved in the calculation. So the correlation coefficient (3) Termination condition of MSET
As the amplitude of TKEO is proportional to the square of frequency, so the amplitude of noise would be enhanced while amplitude of impulse is enhanced. Even after filtering the TKEO energy signal, residual noise interference often persists. Although multiple ET can effectively enhance impulses and suppress noise, each ET iteration increases the decay rate of impulses, thereby shortening their duration. This phenomenon may lead to increased computational cost and potential blurring of time-frequency features.
Therefore, it is very important to select the appropriate scale of MSET. The relative change rate of correlation coefficient is introduced as the iteration criterion. When it reaches a certain threshold, the MSET will be stopped
To evaluate the performance of MSET for signals with strong noise in the case of original TKEO and filtering TKEO, a set of signal comprising three impulses and noise is constructed, with an SNR of −14.55 dB.
Figure 10 shows the results of MSET without TKEO filtering, Figure 11 shows the results of MSET with TKEO filtering. The results of MSET without filtering for signal with noise (SNR = −14.55 dB). (a) Impulse components. (b) Signal with noise. (c) Signal by 1 times ET. (d) Signal by 2 times ET. (e) Signal by 3 times ET. The results of MSET for signal with noise (SNR = −14.55 dB). (a) 1st TKEO and filtered TKEO. (b) Signal by 1 times ET. (c) 2nd TKEO and filtered TKEO. (d) Signal by 2 times ET. (e) 3rd TKEO and filtered TKEO. (f) Signal by 3 times ET. (g) 4th TKEO and filtered TKEO. (h) Signal by 4 times ET.

As shown in Figure 10(b), the fault impulses are completely masked by strong noise, making them particularly difficult to identify. When the TKEO is left unfiltered, the result of the 1st ET remains severely contaminated by noise, as seen in Figure 10(c). This issue persists in the subsequent 2nd and 3rd ET cycles, shown in Figures 10(d) and (e), demonstrating that the enhancement process remains ineffective without proper TKEO pre-filtering.
As shown in Figure 11, the iterative ET process effectively enhances impulse components while suppressing noise. Figure 11(a) demonstrates that the filtered TKEO signal is significantly smoother, with most high-frequency noise removed. Correspondingly, Figure 11(b) reveals emerging impulses after the 1st ET, despite residual background noise. The effect of the 2nd ET is shown in Figures 11(c) and (d): the TKEO signal becomes cleaner, and the impulses are more clearly visible. After the 3rd ET in Figures 11(e) and (f), the signal is nearly noise-free; however, noise-induced energy inconsistency leads to amplitude fluctuations in the extracted impulses. Finally, Figures 11(g) and (h) indicate that excessive decomposition occurs at the 4th ET, manifesting as edge oscillations in the TKEO and severe distortion in the impulse structure.
The similarity between the reconstructed signals and the ideal signal.
From Table 3, it can be calculated that the inflection points of k are 0.51, 0.56, 0.55, and 0.58, respectively. Therefore, selecting 0.5 as the representative value for the inflection point of k is reasonably justified.
Comprehensive analysis of Figures 9 and 10 shows the following conclusions: (1) MEST is effective in extracting of weak fault impulses, (2) The performance of MSET partially depends on the filtering quality of TKEO, (3) The choice of MSET scale is critical, too small a scale leads to insufficient noise suppression, while too large a scale causes severe impulse distortion, (4) As the MSET scale increases, noise interference can lead to inconsistent and widely varying amplitudes in the extracted impulses.
Therefore, it is very important to select both an appropriate wavelet decomposition level for TKEO filtering and a suitable MSET scale to achieve optimal performance.
Sparse decomposition method based on MSET and TFSO
In MSET, the energy of each impulse is amplified to varying degrees due to noise interference. With the increasing of the scale, the accumulation of this phenomenon would result in the loss of certain impulses. In order to solve this problem, sparse decomposition is employed to reconstruct all the impulses, which strategy is extracting the most outstanding impulse in each iteration. Meanwhile, in order to overcome the efficiency of sparse decomposition, the time-frequency spectrum optimization method can be used for quick searching of atoms with benefits of MSET signal.
The flow chart of sparse decomposition method based on MSET and TFSO is shown in Figure 12. (1) TFSO based on GJO algorithm The flow chart of sparse decomposition method based on MSET and TFSO.

It has been analyzed that the frequency factor, shift factor and phase factor of the impulses do not change after MSET, only the scale factor changes. Therefore, if the MSET signal can be expressed by sparse composition, the sparse representation of the original fault signal can be calculated.
Under the interference of noise, it is impossible to optimize the time-frequency spectrum of the original fault signal. The main reason is that the maximum energy curve of the time-frequency spectrum in different scales is not a convex function, this leads the optimization algorithm invalid in optimization. But after MSET, the noise has been greatly suppressed, and the maximum energy curve of the time-frequency spectrum in different scales shows the characteristics of convex function.
The specific proof is as follows:
When the window function is adopted to process signals, it is expected to have a higher time resolution in the high-frequency band, but a higher frequency resolution in the low-frequency band. According to this property, GST uses four parameters to control the location and shape of the window, especially, it uses frequency to control the shape of the window, which makes GST better characteristics in the time-frequency domain. 21
The GST is defined as follows
According to the GST, a large number of time-frequency spectrographs would be generated with different λ and p.
The time-frequency spectrum obtained by GST would appear energy concentrated area near the modulated frequency and occurrence time of impulse. And this energy distribution is closely related to the adjustment factor.
If p = 1, λ is a variable, the GST of signal S(t) can be written as follows
The real component is
The imaginary component is
Set
Thus, the inner product of the signal and the atoms can be written by GST as follows
Thus, in the time-frequency spectrums, the amplitude is corresponding to the inner product, the time axis, frequency axis, and adjustment factor are corresponding to the shift factor, frequency factor, and scale factor of the atoms, respectively, the phases of the real and imaginary component are corresponding to the phase factor of the atoms. Finding out all the maximum energy points in all the time-frequency spectrums with different adjustment factors can get a set of optimum atoms to express impulses.
The atom matching accuracy is determined by the resolution of the adjustment factor. One adjustment factor is corresponding to one time-frequency spectrum and one operation of GST, so, the higher the matching accuracy is required, the higher the resolution of the adjustment factor is needed, and the more time-frequency spectrums are generated. How to quickly find the optimal adjustment factor is essential.
According to the local feature of the impulse, it can be proved that
Thus, in each time-frequency spectrum, the local maximum energy is dependent on the adjustment factor
The objective function can be simplified as
Although traditional methods such as Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) are widely used, they possess certain inherent limitations. PSO is prone to becoming trapped in local optima, while GA requires complex parameter tuning and exhibits limited local search capability. The GJO algorithm, a novel meta-heuristic proposed in 2022, and inspired by the cooperative hunting behavior of golden jackals, demonstrates an excellent balance between global exploration and local exploitation. Its leader–follower model effectively maintains population diversity, thereby mitigating premature convergence. Furthermore, GJO involves fewer control parameters, making it relatively straightforward to implement and apply. Therefore, this proposed TFSO method adopts the GJO algorithm as the core optimizer, aiming to validate its effectiveness in the problem domain under investigation and to establish a reliable performance benchmark for future, more advanced versions.
The maximum value of the time-frequency spectrum energy is used as the fitness function, and the scale factor is used as the variable. According to the optimal time-frequency spectrum, the corresponding time-frequency factors can be calculated.
GJO mainly includes exploration phase and exploitation phase.
28
(1) Exploration phase
The mathematical models for the search behavior of male and female jackals are as follows (2) Exploitation phase
The mathematical models for the search behavior of male and female jackals are as follows
The mathematical model for the position of prey is as follows
The flow chart of GJO is shown in Figure 13. (2) Sparse representation for incipient fault signal The flow chart of GJO.

Suppose the time-frequency factors of the extracted atoms from the preprocessed signal are given by
The preprocessed signal can be represented sparsely in the library built by the extracted atoms as
The incipient fault signal can be represented by
Two-stream fault diagnosis network based on time-series Transformer and Inception-ResNet-v2
When equipment malfunctions occur, the vibration signals exhibit characteristic information related to faults. One-dimensional time-domain vibration signals contain rich information such as amplitude, phase, root mean square, variance, impulse, kurtosis, and margin, as well as crucial global temporal dependencies. For instance, the sparsity characteristic of early faults in low-speed heavy-duty equipment can only be manifested over long time sequences. However, frequency-domain information performs inadequately. Two-dimensional time-frequency spectrums possess local features that characterize fault information, including rich characteristics such as fault oscillation frequency, impulse occurrence time, instantaneous frequency and energy distribution, and lines or spots near the fault characteristic frequency. Additionally, they exhibit multi-scale characteristics, where different fault types and severity levels manifest at different scales in the time-frequency representations.
To fully exploit the multi-dimensional feature information in fault signals and improve fault diagnosis accuracy, a dual-stream fault diagnosis network based on Time-series Transformer and Inception-ResNet-v2 is proposed in this paper. The network employs the Time-series Transformer to process one-dimensional vibration data, using sparse representation signals constructed based on MSET and TFSO as input to this channel to mitigate the impact of strong noise on model accuracy. The Inception-ResNet-v2 processes two-dimensional time-frequency spectrums, with optimized time-frequency representations as input to this channel to better reveal subtle time-frequency features of weak fault signals.
The multi-scale architecture of Inception-ResNet-v2 enables the perception of fault information in time-frequency spectrums at different scales: the 1×1 convolutional modules extract point-like features of transient impulses to determine the existence of faults; the 3×3 convolutional modules extract line features of periodic faults to identify fault types; and the 5×5 convolutional modules extract broadband variation features to assess fault severity and different fault sources. This parallel, multi-scale feature extraction capability allows the network to adaptively capture comprehensive fault-related information from noisy time-frequency representations without manual design and selection of feature scales. The residual connections in Inception-ResNet-v2 ensure direct gradient backpropagation, enabling the learning of highly complex nonlinear fault mapping relationships.
The Transformer exhibits powerful global contextual dependency modeling capabilities, capturing long-range and complex temporal dependencies in fault signals. Through the global attention mechanism of the Transformer, a shock signal at one time point can establish direct connections with modulation signals at distant time points without passing through multiple convolutional layers. This is particularly crucial for detecting weak and sparse early fault signals.
By leveraging the complementary advantages of Inception-ResNet-v2’s local perception and Transformer’s global correlation, more comprehensive fault features are extracted from complementary perspectives.
Feature attention fusion is implemented as follows
Simulation
Simulation is designed mainly to verify the advantages of anti-interference and accuracy. Firstly, simulation is used to prove the availability of the proposed sparse decomposition method based on MSET and TFSO with other method such as improved VMD, 29 FMD, 30 and SD-TFSS. 31 Secondly, simulation is used to prove the fault diagnosis accuracy of the proposed two-stream fault diagnosis network based on time-series Transformer and Inception-ResNet-v2 with other machine learning methods.32,33
Comparison of proposed sparse decomposition method based on MSET and TFSO with other methods
A simulated noisy signal with low SNR was constructed, which is given by
Time-frequency factors of S p .
S1 is a noise-free signal with a = 0; S2 is a noisy signal with a = 0.1 and SNR = −3.54 dB; S3 is a noisy signal with a = 0.5 and SNR = −15.08 dB.
Methods of improved VMD, FMD, SD-TFSS, and proposed sparse decomposition method based on MSET and TFSO are used to analyze the simulated signals.
The simulated signals are shown in Figure 14. The simulated signals. (a) S1: Noise-free signal. (b) S2: Noisy signal, SNR = −3.54 dB. (c) S3: Noisy signal, SNR = −15.08 dB.
The results of improved VMD method are shown in Figure 15. The results of improved VMD method. (a) S2: modes number K = 4, balancing parameter a = 121. (b) S3: modes number K = 5, balancing parameter a = 1480.
The results of FMD method are shown in Figure 16. The results of FMD method. (a) S2: modes number K = 7, cut number of frequency band C = 3. (b) S2: modes number K = 7, cut number of frequency band C = 5.
The results of S3 by SD-TFSS method are shown in Figure 17. The results of S3 by SD-TFSS. (a) Optimal spectrum in 1st iteration. (b) Extracted impulse in 1st iteration. (c) Optimal spectrum in 2nd iteration. (d) Extracted impulses in 2nd iteration. (e) Optimal spectrum in 3rd iteration. (f) Extracted impulses in 3rd iteration.
The results of S3 by proposed sparse decomposition method based on MSET and TFSO are shown in Figure 18. The results of S3 by proposed sparse decomposition method based on MSET and TFSO. (a) Zoom-in view of optimal spectrum in 1st iteration. (b) Extracted impulse in 1st iteration. (c) Zoom-in view of optimal spectrum in 2nd iteration. (d) Extracted impulses in 2nd iteration. (e) Zoom-in view of optimal spectrum in 3rd iteration. (f) Residual signal in 3rd iteration.
Figure 14(a) shows the shapes of three impulses are different due to different scale factors and frequency factors. Figure 14(b) shows the signal S2 contains weak noise, and the impulses are still obvious. Figure 14(c) shows the signal S3 contains strong noise, and the impulses are completely submerged by noise.
Figure 15(a) shows in the decomposition results of S2 with improved VMD, almost all the three impulses are extracted with a little noise in IMF2. Figure 15(b) shows in the decomposition results of S3 with improved VMD, the impulses are mainly distributed in IMF2 and IMF3, the first two impulses can be extracted, the third impulse is relatively poor, and they are seriously disturbed by noise.
Figure 16(a) shows in the decomposition results of S2 with FMD, the impulses are mainly distributed in IMF1 and IMF2. Due to the different frequency factors of impulses, the third impulse is separated from the first two. But there are some shifts errors in time axis of the three impulses. Figure 16(b) shows in the decomposition results of S3 with FMD, the impulses are mainly distributed in IMF2, they are due to seriously disturbed by noise and the third impulse even cannot be extracted. There are also some shifts errors in time axis of these two extracted impulses.
Figure 17 shows that through decomposition with SD-TFSS for the signal S3, all the impulses can be extracted from strong noise. Due to the strong noise interference, the impulses amplitudes have a certain degree of error. Figures 17(a), (c) and (e) show that in each decomposition, the time-frequency spectrum is seriously interfered by the noise, which lead the time-frequency position of the maximum energy is prone to deviation. Over all, this method gives full play to the advantages of sparse decomposition for weak fault feature extraction.
Figure 18 shows that through sparse decomposition method based on MSET and TFSO for the signal S3, the time-frequency spectrums in Figures 18(a), (c) and (e) are almost free from noise interference due to the preprocessing with MSET. The effect of impulses extraction is very good.
Comparison of selected time-frequency factors for S1.
Comparison of selected time-frequency factors for S2.
Comparison of selected time-frequency factors for S3.
Comparison of minimum operations of GST.
Table 5–7 display that with the increase of noise, the error will increase in a small range, and the factors are closer to that in the simulated signals by the proposed method, which means the proposed method has better performance in accuracy. Additionally, Table 8 displays proposed method needs less operation of GST, which would cost the most computing time in searching the optimal time-frequency spectrum, so, the searching strategy of the proposed method can improve search efficiency greatly.
Above all, this simulation proves the better performance in accuracy and efficiency compared to the methods of improved VMD, FMD, and SD-TFSS.
Comparison of proposed method and other machine learning networks
Case Western Reserve University (CWRU) data set is adopted for simulation, which contains data of normal, outer ring fault, inner ring fault, and rolling element fault. The data is chose as samples with the fault diameters of 0.021 inches, motor speed of 1772 r/min, sampling frequency of 12 kHz. There are 300 samples for each type of fault, a total of 1200 samples, in which 960 samples are randomly selected as the training set and 240 samples as the test set. Meanwhile, Gaussian white noise is added to increase the fault diagnosis difficulty, and strength of this noise is –2db, –4dB, –6 dB, and −10db, respectively. The comparison methods are CNN, Vision-Transformer, 23 Inception-ResNet-v2, 24 TCN + CNN, 28 Liconvformer, 32 ClassBD, 33 and proposed method.
Signals of outer ring fault in the data set with different noise are shown in Figure 19. Corresponding processed signals by proposed sparse decomposition method based on MSET and TFSO are shown in Figure 20. The diagnosis accuracy on the CWRU dataset in ablation study is shown in Table 9. The diagnosis accuracy of different methods for CWRU data set is shown in Table 10. Signals of outer ring fault in the data set with different noise. Processed signals by proposed sparse decomposition method based on MSET and TFSO. The diagnosis accuracy on the CWRU dataset in ablation study. The diagnosis accuracy of different methods for CWRU data set.

Figure 19 reflects the fault impulses are obvious in the first original signal, with the increasing of noise, the impulses are gradually submerged and they are completely submerged when the intensity is −10 dB. Figure 20 reflects that used proposed method, the noise is basically filtered out, but due to the interference of strong noise, parts of the impulses disappear. Especially, when the intensity is −10 dB, only one impulse can be extracted.
Table 9 indicates that directly using SD can effectively extract fault impulse signals, thereby achieving a filtering effect with satisfactory diagnostic accuracy. After incorporating the MSET module proposed in this paper, the performance is further improved. In contrast, when the non-filtered MSET module is added, the diagnostic results deteriorate sharply with increasing noise due to the cumulative effect of signal distortion. Overall, the method proposed in this paper achieves the best diagnostic performance across different SNRs.
Table 10 shows that the proposed method, take the advantage of proposed feature extraction strategy and two-stream network, sufficient valid features can be used to train the network. The diagnosis accuracy can still reach 86.31% when the noise intensity is −10 dB. Table 10 shows the proposed approach has excellent anti-noise ability and is suitable for fault diagnosis with weak feature.
Engineering application
In order to verify the valid of proposed approach, fault diagnosis is carried out from two working conditions: variable speed and constant speed. In variable speed fault diagnosis, firstly, extract the weak fault features by proposed sparse decomposition method based on MSET and TFSO. Secondly, fault diagnosis is performed using the occurrence time of impulses. In constant speed fault diagnosis, our own fault data set are used for fault diagnosis by proposed two-stream network.
The fault simulation platform for rotating machinery is shown in Figure 21, and the bearings with outer ring fault and inner ring fault are shown in Figure 22. The bearing parameters are shown in Table 11. The fault simulation platform. Two kinds of fault bearings. (a) Outer ring fault, width = 0.2 mm, depth = 0.5 mm. (b) Inner ring fault, width = 0.2 mm, depth = 0.5 mm. Bearing parameters.

Fault diagnosis under variable speed
There are 200 samples for each fault type, including outer race faults and inner race faults. The rotation speed linear accelerates from 300 r/min to 450 r/min in 0.25 s, the sampling frequency is 20 kHz, and the sampling points are 5000.
The acquired outer ring fault signal and extracted signal by the proposed method are shown in Figure 23. The corresponding actual and theoretical occurrence time of impulses are shown in Table 12. Acquired signal and extracted signal by proposed method from outer ring fault bearing. (a) Acquired signal. (b) Extracted impulses. (c) Residual. The actual and theoretical occurrence time of impulses.
The acquired inner ring fault signal and extracted signal by the proposed method are shown in Figure 24. The corresponding actual and theoretical occurrence time of impulses are shown in Table 13. Acquired signal and extracted signal by proposed method from inner ring fault bearing. (a) Acquired signal. (b) Extracted impulses. (c) Residual. The actual and theoretical occurrence time of impulses.
Figure 23 displays all the impulses that can be extracted from the fault signal, the amplitude is increasing with the rotating speed, and the interval is decreasing with the rotating speed. As the impulses are nonperiodic, the traditional frequency analysis method is invalid.
Table 12 displays all the occurrence time of extracted impulses and the theoretical occurrence time according to the first impulse under different fault types. The distance of different fault types is computed as
Figure 24 displays the impulses that can be extracted from the fault signal, the amplitude is increasing with the rotating speed, and the interval is decreasing with the rotating speed.
Table 13 displays all the occurrence time of extracted impulses and the theoretical occurrence time according to the first impulse under different fault types. The distance of different fault types is computed as
The diagnosis accuracy under variable speed conditions.
The engineering application proves the proposed method can extract the non-periodic impulses exactly, and can identify the fault type in the condition of variable rotation speed.
Fault diagnosis under uniform speed
Our own data set is adopted for diagnosis, which is acquired from fault simulation platform shown in Figure 20. The sampling frequency is 20 kHz, the sampling points are 4096, the rotation speed is 900 r/min, and the sensor collected the vibration data is acceleration sensor. The data set contains data of normal, outer ring fault, inner ring fault and rolling element fault. There are 200 samples for each type of fault, a total of 800 samples, in which 640 samples are randomly selected as the training set and 160 samples as the test set. The diagnosis accuracy of different methods for our own data set is shown in Table. 11.
The diagnosis accuracy of different methods for our own data set.
Conclusions
This paper proposes an incipient fault diagnosis method based on MSET and a two-stream network for rotating machinery under complicated conditions. Results from both simulation and engineering applications confirm the effectiveness of the proposed approach. The main conclusions are summarized as follows: (1) The proposed MSET method effectively enhances impulse components buried in strong noise while preserving their structural integrity, which is a crucial feature for subsequent sparse decomposition. (2) The sparse decomposition method based on MSET and TFSO has good performance on accuracy and efficiency in extracting impulses from strong noise and is more applicable in complex and harsh environments. (3) The proposed incipient fault diagnosis method based on MSET and two-stream network not only improves fault diagnosis accuracy but also reliably identifies fault types under variable rotational speeds.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article. This work is supported by the Project Fund of Sichuan Province All-electric Navigation Aircraft Key Technology Engineering Research Center No. CAFUC2025KF02, by the Project Fund of the National Natural Science Foundation of China No. 52475525 and No. 52375520, by the Project Fund of Central-Guided Local Science and Technology Development Special Program of Hubei Province No. 2023EGA005.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
