Abstract
Bearings are critical components in rotating machinery, and vibration-based fault diagnosis plays an important role in monitoring their operational conditions. To improve the accuracy and stability of bearing fault diagnosis at the signal representation level, this research work proposes a multimodal diagnostic method that fuses visual representations with complementary temporal features extracted from bearing vibration signals. Specifically, a Mamba-based state space model is employed to capture long-term temporal dependencies in vibration sequences, enabling improved modeling of slow-varying and long-range dynamic patterns. Meanwhile, an improved Gramian Angular Field (GAF) is introduced to map one-dimensional time-series signals into two-dimensional images, and EfficientNet is adopted as the visual feature extraction backbone. In addition, an LSTM module combined with a self-attention mechanism is integrated to model short-term temporal dynamics and facilitate effective interaction between temporal and visual representations. Furthermore, the IVY optimization algorithm is utilized to automatically tune key hyperparameters and enhance training stability. By jointly modeling long-term temporal features, short-term temporal dynamics, and image-based representations, the proposed method forms a collaborative and complementary feature representation for bearing vibration signals. Experimental results indicate that the proposed approach provides consistent performance improvements and favorable generalization behavior: ablation studies show that model accuracy increased from 93.31% to 99.62% as key modules were progressively incorporated, while comparative experiments on two public bearing datasets achieved F1 scores of 98.75% and 98.27%, demonstrating competitive performance relative to existing image-only and time-series-only baseline models.
Keywords
Get full access to this article
View all access options for this article.
