Abstract
To address weak fault signatures in rolling bearing vibration signals under strong noise conditions, coupled fault mode interference, and limitations of existing methods in modal representation and feature fusion, a deep learning-based time-, frequency-, and time–frequency representation (TFT) multi-modal fusion framework for fault diagnosis with an attention mechanism is proposed. Specifically, three complementary modalities are constructed, including the time-domain, frequency-domain via fast Fourier transform (FFT), and time–frequency-domain via Multi-scale wavelet convolution (MWC), to comprehensively characterize fault-related information. A parallel convolutional neural network–long short-term memory (CNN–LSTM) architecture is employed to extract spatial and temporal features from each modality. Subsequently, a multi-head attention mechanism with learnable modal weights is introduced to adaptively fuse multi-modal features while enhancing discriminative information and suppressing redundant representations. Experimental results on the HUST bearing dataset demonstrate that the proposed framework achieves 100% classification accuracy for both single and compound faults, outperforming all baseline methods. Under varying signal-to-noise ratio (SNR) conditions, the model maintains consistently stable performance, exhibiting strong noise robustness and excellent generalization capability. Furthermore, t-distributed stochastic neighbor embedding (T-SNE) visualization and cross-dataset validation further confirm the effectiveness and separability of the learned feature representations.
Get full access to this article
View all access options for this article.
