Residual useful life prediction of rolling bearings based on improved Informer modeling

Abstract

Accurate prediction of rolling bearings’ Remaining Useful Life (RUL) is critical for ensuring machinery reliability and safety. While deep learning offers considerable potential, prevailing prognostics models face significant challenges: they often overlook critical inter-sensor correlations, exhibit instability in long-term predictions, and demand extensive training data. These limitations severely hinder their efficacy in data-scarce or informationally redundant scenarios. To overcome these issues, this paper introduces a novel hybrid architecture that synergistically integrates Convolutional Neural Networks (CNNs) with the Informer model. The proposed framework is engineered to autonomously extract and fuse salient nonlinear spatiotemporal features from multi-sensor data streams. Raw sensor signals are first segmented via a sliding window approach to preserve degradation characteristics. Subsequently, stacked convolutional layers hierarchically learn high-level representations, effectively capturing both intra- and inter-sensor dependencies. These enriched features are then processed by the Informer module for efficient time-series encoding and long-term dependency modeling, ultimately yielding a precise RUL estimate through a fully-connected layer. Extensive experimental results on rolling-element bearing datasets demonstrate the superiority of our method. It achieves state-of-the-art prediction accuracy and markedly superior stability over time, even when trained with significantly reduced dataset sizes, confirming its robustness and practical utility.

Keywords

rolling bearings remaining useful life non-linear feature fusion Informer modeling

Introduction

Rolling bearings are called the “joints of industry,”¹ and the researcher is similar to a “doctor” who analyzes the state of the bearing at the moment through sensor data. According to statistics, about 40%–45% of the annual mechanical failure are due to bearing damage,² so it is important to accurately predict the point in time when the failure occurs to avoid safety accidents and improve equipment safety.

In recent years, scholars at home and abroad have focused on rolling bearing life prediction research and proposed different methods based on modeling, based on data, and based on digital twin.^3,4 Data models can abstract complex problems and concisely express data relationships to improve data comprehensibility; at the same time, data model ensures that data is consistent across scenarios and can also reduce redundancy and conflict. Therefore, this paper explores data-driven approaches that enable more efficient and accurate data processing and analysis. In addition, recent studies have extended machine learning and neural network–based approaches to fatigue life and reliability prediction problems in various engineering domains, including gearbox reliability assessment, defect-driven fatigue life modeling, and physics-informed neural network–based life prediction frameworks. These works further demonstrate the effectiveness and flexibility of data-driven and hybrid learning approaches for complex life prediction tasks.^5–7 Addressing the problem that multi-layer perception (MLP) cannot automatically learn salient features, Peng et al.⁸ considered how to address the persistent challenges of vibration signals being inevitably contaminated by noise interference and extracted features containing redundant or irrelevant information, proposing a novel hybrid feature extraction method based on Adaptive Sparse Narrowband Decomposition (ASND) and Locality Preserving Projection (LPP). This approach was then integrated with a Least Squares Support Vector Machine (LS-SVM) for Remaining Useful Life (RUL) prediction, significantly improving prognostic accuracy in experimental validation. Liu et al.⁹ considered to address the challenge of remaining useful life (RUL) prediction for aero-engine rolling bearings, proposing a novel data-driven prognostic method combining deep learning with particle filtering. This approach demonstrates superior prediction accuracy and enhanced stability, while being less susceptible to variations in particle numbers or resampling methods compared to conventional techniques. More importantly, it better captures the evolutionary trends of rolling bearing degradation. Wang et al.¹⁰ considered to overcome interference from noise and other disruptive signals, proposing a novel remaining useful life (RUL) prediction method for rolling bearings based on improved empirical wavelet transform (IEWT) and one-dimensional convolutional neural networks (1D-CNN). The proposed model demonstrates superior prediction accuracy compared to existing approaches, as evidenced by reduced mean absolute error (MAE) and root mean square error (RMSE). Wang et al.¹¹ considered the problems of gradient dissipation and gradient explosion in long-term time series prediction, and established a Long Short-Term Memory (LSTM) network model to realize the time series prediction of rolling bearing signals. Wang et al.¹² addressed the problem that the model could not effectively recognize different sensor data, and proposed a multiscale learning strategy to automatically learn the representations of different time scales to realize regression analysis and RUL estimation. Although deep learning has achieved relatively excellent results in RUL prediction of bearings, the existing prediction methods have the following limitations. (1) In representation learning, the correlation between sensors are not adequately considered. The data from sensors in a certain range at the current moment is inevitably influenced by the data from other sensors at previous moments, but if the representation learning model does not take into account this spatio-temporal correlation, it can lead to omission. (2) During the training process, a large amount of data is required to train the model. The amount of data in model training will directly affect the prediction accuracy, yet acquiring a substantial amount of data can be highly challenging under certain specialized operating scenarios. Therefore, how to train models with limited training data becomes a pressing issue. (3) The current prediction of bearing life cannot achieve long-term stable prediction. The main reasons include complex and variable influencing factors. Environmental and usage conditions are difficult to measure accurately. There are also issues with data acquisition and processing. Consequently, to enhance the accuracy and stability of bearing life prediction, it is necessary to optimize data processing and analysis methods in order to develop more advanced and stable prediction models.

To address these limitations, this paper improves the Informer network for RUL prediction of bearings. The feature fusion extraction module (abbreviated as FFE for ease of representation) enables monitoring data from different sensors to be used directly as input to the prediction network. By adopting the method of equal-volume data segmentation, it can capture the subtle changes in bearing degradation more meticulously, thus preserving more nonlinear degradation information. Additionally, the feature fusion extraction module allows for the fusion of multi-sensor signals to enhance degradation information. Meanwhile, based on the convolution operation, time encoding is injected, and multiple high-level features are combined in a parallel way to generate the final interpretation results. This process comprehensively utilizes various feature information, improves the model’s ability to identify bearing degradation states and prediction accuracy, and achieves RUL prediction for rolling bearings. Additionally, it can also achieve high-precision RUL prediction with a limited amount of data.

Organization of the model

The model consists of two parts, the Feature Fusion Extraction (FFE) module and the prediction network Informer (Abbreviated as FFE-Informer), as shown in Figure 1. Deep learning faces various problems such as under-utilization of degradation information between different sensors, high demand of training data and inability to achieve long time prediction in the remaining useful life prediction. To address these issues, we employ a series of measures. First, we utilize a convolutional neural network to preprocess the degradation information from different sensors in order to preserve the rich nonlinear features. Second, we perform equal segmentation of the data to enrich the labeled data and maintain high prediction accuracy even with small samples. This model successfully solves the fusion of degraded signals from different sensors and maintains high accuracy with reduced training data samples. In the model, the original input signal is cropped, the cropped signal is used to extract the time domain features, and then the cropped signal is subjected to Fast Fourier Transform to extract the frequency domain features. Finally, these features are passed to the Informer network for training and used for remaining useful life prediction.

Figure 1.

Structure of the model.

Feature fusion extraction module (FFE module)

The Feature Fusion Extraction (FFE) module mainly consists of Convolutional Neural Network¹³ (CNN) as shown in Figure 2. Convolutional neural network (CNN) is a kind of network for efficiently processing multidimensional data. It is a feed-forward neural network, which mainly consists of convolutional layer, pooling layer, and fully connected layer.¹⁴ In bearing life research, Wang et al.¹⁰ proposed a novel method for predicting the remaining useful life (RUL) of rolling bearings based on an Improved Empirical Wavelet Transform (IEWT) and a 1D Convolutional Neural Network (1D-CNN), which was validated by promising experimental results. However, there is still insufficient fusion of degraded information from different sensors. Therefore, this paper proposes a feature fusion approach to achieve the fusion of degradation information between different sensors.

Figure 2.

Feature fusion extraction module (FFE).

Specifically in the text, the cycle time series data $X = [x_{1}, x_{2}, \dots, x_{T}]$ , T is the maximum life cycle, $x^{t}$ is the given data at moment t. Since the length of the data measured by the sensor each time cannot be efficiently calculated. In order to facilitate the calculation the original data is partitioned into N vectors with the same labels, as shown in the following equation:

N = [l / s]

(1)

Where $l$ is the length of the data recorded by the sensor each time, $s$ is the step size. The data is fed into the CNN, and let the input $X_{input}^{i - 1} \in R^{1 \times d}$ in i-th layer, and its corresponding output $X_{out}^{i}$ is

X_{out}^{i} = F (K * X_{input}^{i - 1} + b)

(2)

Where $F (\cdot)$ is the ReLU activation function,¹⁵ $K$ is the convolution kernel, $b$ is the bias matrix, and * denotes the convolution operation.

The ReLU activation function is added after the convolution in order to increase its nonlinear fitting ability during the training process. The formula is as follows:

x^{ij} = f_{ReLU} (y^{ij}) = max {0, y^{ij}}

(3)

To improve computational efficiency, the data is optimized using maximum pooling, and the feature mapping state of the j-th step in the i-th pooling layer is:

y_{t}^{ij} = pool (X_{t}^{ij}, p, s)

(4)

Where $pool (\cdot)$ is the downsampling function, maximum pooling is used in this paper, p is the pooling size and s is the step size.

During the normal operation of rolling bearings, the bearings undergo damage due to rolling contact fatigue. When the damaged parts come into contact with other parts, impacts are formed. Consequently, the vibration signals of rolling bearings contain a wealth of information.¹⁶ Compared to time domain information, frequency domain information can effectively remove white noise. Therefore, Fourier transform is applied to the data processed by CNN, and the following is the transformation formula:

X (K) = \sum_{n = 0}^{N - 1} x (n) e^{- \frac{2 π jnK}{N}}

(5)

Where n = 1, 2, …, N−1, $x_{n}$ is the discrete signal. Then the time domain and frequency domain acceleration signals are extracted, and the horizontal signal eigenvalues and vertical signal eigenvalues are fused using equation (12).

The training samples are trained by CNN to get degenerate features. During the training process, the ReLU activation function is added after convolution to increase the nonlinear fit. The following is the transformation formula:

x^{l (i, j)} = f_{Re LU} (y^{l (i, j)}) = max {0, y^{l (i, j)}}

(6)

Where $y^{l (i, j)}$ is the $l$ th layer convolution output and $x^{l (i, j)}$ is the $l$ th layer activation function output value. In order to make better use of the time information, the time information of each label is coded in the data at intervals of minutes. In this paper, we will introduce the time coding to fuse the time information into the acceleration signal with the following equation:

\begin{matrix} T_{min} = min / 59.0 - 0.5 \\ T_{hour} = hour / 23.0 - 0.5 \end{matrix}

(7)

Where $T_{min}, T_{hour}$ is the minute and hourly encoded information, so the value of $T$ is between [−0.5, 0.5], and the flow is shown in Figure 2.

Informer module

The Informer model is a lightweight model improved from the Transformer.¹⁷ In data-driven rolling bearing RUL prediction, Long Short-Term Memory (LSTM),¹¹ the Transformer model, and others are the mainstream models for prediction. However, the aforementioned models are limited to predicting shorter time series, which can significantly constrain the computational efficiency and prediction accuracy when dealing with long series prediction.

The Informer model¹⁸ consists mainly of the Encoder and the Decoder. The Encoder receives a large number of long sequence inputs and replaces canonical self-attention with ProbSparse self-attention. The computation is optimized for computational efficiency by finding important queries. In canonical self-notation by query, key, and value are denoted by Q, K, and V for these three matrices respectively. For more explicit representation, this paper takes the i-th row in Q, K, and V. Then the i-th query attention can be defined as:

A (q_{i}, k, v) = \sum_{j} \frac{k (q_{i}, k)}{\sum_{l} k (q_{i}, k)} v_{j} = E_{p (k_{j} | q_{i})} [v_{j}]

(8)

Where $p (k_{j} | q_{i}) = k (q_{i}, k_{j}) / \sum_{l} k (q_{i}, k_{j})$ and $k (q_{i}, k_{j})$ denote the choice of the asymmetric exponential kernel $\exp (q_{i} k_{j}^{T} / \sqrt{d})$ .

Next, we define the query coefficient criterion: $p (k_{j} | q_{i})$ and the KL dispersion of $q$ .

KL (q | | p) = \ln \sum_{l = 1}^{L_{k}} e^{\frac{q_{i} k_{j}^{T}}{\sqrt{d}}} - \frac{1}{L_{K}} \sum_{j = 1}^{L_{k}} \frac{q_{i} k_{j}^{T}}{\sqrt{d}}

(9)

Remove the constant and define the i-th query sparsity measure as:

M (q_{i}, k) = max_{j} {\frac{q_{i} k_{j}^{T}}{\sqrt{d}}} - \frac{1}{L_{k}} \sum_{j = 1}^{L_{k}} \frac{q_{i} k_{j}^{T}}{\sqrt{d}}

(10)

Where the first term is the logarithmic sum of all keys and the second term is the arithmetic mean of all keys.

Importance of sparse self-attention mechanism: sparse multi-head self-attention mechanism is mainly used in image recognition aims to improve foreground-background discrimination and reduce edge region ambiguity.¹⁹

The encoder aims to extract the robustness and long-term dependence of long sequence inputs. Let the t-th sequence input be $X_{t} \in R^{l_{x} \times d}$ , its specific structure is shown in Figure 3. The extraction process extends forward from the j-th layer to (j + 1th layer) with the following expression:

X_{j + 1}^{t} = MaxPool (Re lu (Conv 1 d ({[X_{j}^{t}]}_{AB})))

(11)

Where AB denotes the attention block. It contains the multi-head probabilistic sparse self-attention and the basic operation.

Figure 3.

Individual encoder operation.

The remaining useful life prediction framework of the total model is shown in Figure 1. The Informer network has the ability to process the state of the monitored data for a longer period of time and perform long time series prediction. Its main functions include focusing on processing feature information, mining temporal correlations in hidden data, and updating parameters automatically to improve prediction accuracy.

It should be clarified that this study adopts the original ProbSparse attention mechanism and its threshold setting from the Informer model without modification. The improvement proposed in this work lies in the introduction of the FFE feature fusion extraction module and the task-oriented RUL prediction framework, rather than changes to the internal attention threshold function.

Analysis of feature extraction and fusion

Typically, statistical features such as mean, variance, standard deviation, etc. are often used as inputs in time series data processing by extracting the mean, variance, standard deviation, etc. from the time and frequency domains. To clarify the motivation for introducing statistical features, it should be noted that bearing vibration signals are typically non-stationary and contain significant noise and operating-condition interference. In raw waveform form, degradation-related changes are not always sufficiently stable or prominent, especially in early degradation stages. Statistical features computed over signal segments can provide more stable and compact representations of signal variation trends.

In this work, statistical features are not used as a replacement for raw vibration signals, but as complementary degradation descriptors. Together with FFT-based frequency-domain features and multi-direction vibration inputs, they are fused in the FFE module to form a more informative feature representation. This fused representation helps the prediction network focus on degradation-related characteristics and improves the stability and effectiveness of RUL prediction.

However, this method still suffers from the problem of insufficient information mining. To solve this problem, we simultaneously utilize vibration information recorded by multiple sensors. Since bearing degradation is inevitable, this paper takes the XJTU-SY dataset²⁰ as an example. Therefore, maximizing the degradation characteristics will help to accurately predict the remaining service life of the bearings.

To this end, a FFE module is introduced to enrich the data and make it more representative. The FFE module performed CNN convolutional processing and statistical feature extraction on the raw vibration signals, and added moment information at each minute interval. Specifically, CNN employs a smoothing window to smooth the data at a specific stride, subsequently enhancing the representativeness of a moment’s features through upscaling and downscaling.

The processing of the FFE module is performed as follows: a single sampling of the vibration signal X is performed with the smoothing window operation on the time domain signal and the frequency domain signal, and then the data are transformed by convolution, pooling and nonlinear transformation. At the same time, some statistical features commonly used in rolling bearing RUL prediction are selected as degradation features, and the specific information is shown in Table 1. Where: $x_{i}$ is the i-th sampling value in the single sampling vibration sigma X.

Table 1.

Characteristics and their formulas.

Feature parameters	Time domain characteristics	Frequency domain characteristic
Standard deviation	$X_{sd} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - μ)}^{2}}$	$X_{sd} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - μ)}^{2}}$
Peak value	$X_{sk} = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - μ)}^{3}}{X_{sd}^{3}}$	$X_{sk} = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - μ)}^{3}}{X_{sd}^{3}}$
Skewness	$X_{ku} = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{4}}{{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}}^{2}}$	$X_{ku} = \frac{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{4}}{{\frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}}^{2}}$

The XJTU-SY dataset²⁰ collected the acceleration signals in the horizontal and vertical directions during the operation of rolling bearings. In the study of remaining bearing life, the horizontal acceleration signal is mainly usually the main focus. However, the vertical and horizontal acceleration signals also contain a large amount of bearing life information. Therefore, in this paper, the horizontal and vertical acceleration signals are combined with each other. The horizontal signal serves as the main signal, while the vertical signal functions as the auxiliary. It can be expressed as:

F = α H + (1 - α) V

(12)

Where F is the fusion feature, $H$ is the horizontal acceleration signal after FFE processing, $V$ is the vertical acceleration signal after FFE processing, and the value range is [0.5, 1]. Experimentally, it is found that adjusting the value of A can not only reduce the value of the loss function, but also enable more accurate predictions of the RUL for different working conditions. The details are shown in Figure 4(c).

Figure 4.

The value of α: (a) Bearing1_1, (b) Bearing2_3, and (c) Bearing3_2.

It is clearly observed through the experimental results that adjusting the fusion ratio of data from different sensors significantly improves the accuracy of the prediction results, thus enhancing the robustness of the model and proving its effectiveness in the overall network.

Before model training, the raw vibration signals are first segmented into fixed-length samples for feature construction and network input. For each signal segment, time-domain statistical features are computed, and FFT is applied to obtain frequency-domain features. These features, together with multi-direction vibration signal inputs, are used to form the fused feature representation in the FFE module.

To improve training efficiency and increase the number of labeled samples, the original long signal sequences are divided into multiple equal-length segments. Segments originating from the same time interval are assigned the same RUL label, so that each constructed sample corresponds to a supervised RUL target value.

The proposed framework performs RUL regression rather than multi-step time-series forecasting. After feature fusion and sequence modeling by the Informer network, the output is mapped through a fully connected layer to produce a single RUL prediction value for each input sample.

It should be noted that assigning a static RUL label to segmented samples is an approximation, whose validity depends on the assumption that the segment window is sufficiently short such that the RUL variation within a segment is limited. When degradation evolves rapidly or the segment window is relatively long, a potential label drift effect may occur. In this work, multi-source feature fusion in the FFE module enhances degradation-sensitive representations and can mitigate the impact of this approximation to some extent; however, more refined label modeling to explicitly address label drift is beyond the scope of the present study and will be explored in future work.

Experimental validation

We will conduct extensive experiments on datasets with the XJTU-SY dataset²⁰ and the PHM2012 dataset.²¹ The above datasets have three different working conditions. The dataset mentioned above comprises three distinct working conditions. The subsequent comparison is made among four models—Informer, Informer Stack, FFE-Informer, and FFE-Informer Stack in terms of their predicted outcomes for the full lifespan data across these three working conditions.

Introduction to the bearing dataset

XJTU-SY bearing dataset as the basis for analysis

The XJTU-SY bearing dataset is openly released for scholars around the world. It has a sampling frequency of 25.6 kHz, a sampling interval of 1 min, and a sampling duration of 1.28 s. It records the complete life cycle data and can be used for model-based and data-driven residual useful life prediction. The experimental platform is shown in Figure 5.

Figure 5.

Testbed of rolling element bearings.

It provides real experimental data to characterize the degradation over the entire life cycle. Taking Bearing1-1 and Bearing2-2 as an example, their horizontal and vertical raw acceleration signals are depicted in Figure 6. The 123-sampled raw data of Bearing1-1 are subjected to CNN processing, and then their feature data are extracted. The feature map is plotted into a 3D map as shown in Figure 8.

Figure 6.

Horizontal and vertical signal vibration diagram: (a) Bearing 1_1horizontal signal, (b) Bearing1_1vertical signal, (c) Bearing 2_2horizontal signal, and (d) Bearing2_2vertical signal.

PHM2012 bearing data set

The PRONOSTIA platform can address issues that include testing, validation, bearing health assessment, diagnosis, and prognosis. The primary aim of the experiment is to validate real-world data pertaining to accelerated bearing degradation. The platform is shown in Figure 7. A radial force equal to the maximum dynamic load of the bearing of 4 kN is applied to the bearing under test, and the bearing is subjected to accelerated degradation tests over a period of several hours. The bearing speed was maintained at approximately 1800 revolutions per minute. The sampling frequency was 25.6 kHz. Each sample contained 2560 data points and was repeated every 10 s. When the amplitude of the vibration signal surpasses 20g, it indicates that the bearing has reached the end of its lifespan (Figure 8).²¹

Figure 7.

The PRONOSTIA platform.

Figure 8.

Three-dimensional diagram of features.

It should be noted that for the PHM2012 dataset, the failure time and RUL labels are generated following the commonly used test-bench criterion (vibration amplitude exceeding 20g), in order to maintain a consistent label definition and ensure comparability with prior studies. Using a different threshold would shift the failure time earlier or later, thereby changing the endpoint definition and distribution scale of the RUL labels, which may further affect the training target and generalization behavior of the model. In this work, we follow the standard setting for PHM2012; a systematic sensitivity analysis with respect to different thresholds will be considered in future work.

Prediction metrics

In order to evaluate the prediction performance of FFE-Informer, two prediction metrics are used in this paper, Mean Square Error (MSE) and Mean Absolute Error (MAE).²² MSE can be used to evaluate the accuracy of a prediction algorithm under instances given RUL prediction results. The specific expression is given below: therefore does not reflect the distribution of the errors. Its formula is:

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(13)

Where $n$ indicates the number of samples, $y_{i}$ indicates the true value, ${\hat{y}}_{i}$ table model predicted value. The closer the value of MSE is to 0, the closer the prediction of its RUL is. The MSE is sensitive to outliers (This is because when the difference between the abnormal value and the normal value is significant, the error tends to exceed 1, and squaring the error further magnifies its impact.), but they are able to reflect the distribution of prediction error.

MAE (Mean Absolute Error) is the average of the absolute values of the errors, MAE is not sensitive to outliers and

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(14)

Where $n$ indicates the number of samples, $y_{i}$ indicates the true value, ${\hat{y}}_{i}$ table model predicted value. MAE is used to evaluate the degree of deviation between the true value and the fitted value. The closer the MAE value is to 0 indicates the better the model fit and the higher the model prediction accuracy.

Network parameter configuration for FFE

The hyperparameters that need to be set for the proposed FFE model include the cut length $l$ of the original data, which is then partitioned into n sequential inputs. These inputs are further divided into 70% for the training set, 10% for the validation set, and 20% for the test set. Additionally, the size of the convolution kernel $K$ , the size of the pooling layer $p$ , and the step size $s$ must also be determined. The specific parameters are shown in Table 2.

Table 2.

FFE parameter configuration.

Hyperparameter	Size	Hyperparameter	Size
Kernel size	$8 \times 1$	Number of kernels	4
Pooling size	4	Cut windows size	256
Number of conv	2	Number of pool	2

The time window contains a large amount of degradation information. In particular, this performance improvement is related to the time window size. Bearing1-1, Bearing2-2, and Bearing3-5 in the XJTU-SY dataset are used as examples for discussion. In this paper, the impact of different cutting window sizes on the prediction performance is first analyzed. Their sizes are set to 64, 128, 256, and 512. Accordingly, the box plots of MSE and MAE for the three tested bearings are shown in Figure 9(a)–(c), where (a), (b), and (c) correspond to Bearing1-1, Bearing2-2, and Bearing3-5, respectively. It can be seen in these box plots that the predictive ability of the model can be improved by appropriately increasing the window. Therefore, the time information of bearing degradation can be aggregated by appropriately enlarging the window. The prediction results of RUL under four different window sizes are shown in the figure. From these box plots, it can be seen that the MSE and MAE values are smoother when the time window is increased in a certain range, which means that the prediction performance of FFE-Informer can be effectively improved by increasing the time window size. However, it should be noted that as the time window size increases, the accuracy will go into a larger range of fluctuations up and down, but since larger time window sizes produce high-dimensional input vectors, more memory storage and computation time is required. Based on the above analysis, a time window size of 256 is suitable for the RUL prediction of bearings.

Figure 9.

Box plots of MSE and MAE for the three tested bearings under different time window sizes: (a) Bearing1-1, (b) Bearing2-2, and (c) Bearing3-5.

Experimental results

In this section, long time series prediction of FFE-Informer is studied and discussed by performing bearing RUL estimation. Firstly, the effect of cutting window size on prediction accuracy and reduction of training datasets is analyzed. Then, the benefits of Informer are discussed and a comparison between FFE-Informer and other advanced forecasting methods in long time forecasting is presented to illustrate its superiority. In particular, the MSE and MAE for each bearing in the test datasets are calculated from halfway through the life cycle to the end, as the predictions for these check time instances are more reliable and meaningful than earlier ones.²³

Impact of training set share on prediction accuracy

The size of the training data will have a direct impact on the knowledge and quality that the algorithm can learn, as follows: (1) Typically, improving the training set will improve the accuracy of the algorithm. (2) When the training set is too small, the algorithm is prone to underfitting resulting in poor prediction on new data. And when the training set is too large, the algorithm is prone to overfitting, meaning it learns the specific details of the training data too well and may not generalize well to new data. (3) As the size of the training set increases, the training time of the algorithm increases accordingly. Experiments conducted with various datasets have demonstrated that the FFE module maintains its effectiveness in terms of accuracy even when the amount of data is reduced. The MAE representation will be used to illustrate this, employing a test set of 20% and a training set of 10% respectively, with a prediction length of 128. The resulting data is presented in Table 3. Based on the experimental results from the XJTU-SY dataset, it can be observed that the model, after incorporating the FFE module, can achieve an accuracy that is comparable to or even lower than the Informer model’s direct prediction accuracy, despite reducing the training data by 10%. Therefore, the FFE module is effective in training with reduced training set. To make the model more convincing, we will demonstrate the applicability of the model by performing further validation on the PHM2012 dataset. Validating a model using another datasets are an important practice to be able to assess the generalization ability and applicability of the model. By testing the model on different datasets, we can verify its performance in various situations and confirm its validity in the real world. By testing the model on different datasets, we can validate performance in various situations and confirm effectiveness in the real world. The training data is shown in Table 4, and it can be seen from the experimental data that it is still valid. Therefore, the model proposed in this paper according to the existence of general applicability. In order to reduce the time, the prediction length in Table 4 is set to 24. Based on the data presented in the table, it can be observed that the prediction accuracy of the FFE network remains steady and maintains a high level even when the amount of data in the training set is reduced.

Table 3.

Effect of XJTU-SY training set ratio on results.

Methods	Informer	FFE-Informer
Metric	Train (70%)	Train (70%)	Train (68%)	Train (66%)	Train (64%)	Train (62%)	Train (60%)
Bearing1-1	1.008	0.882	0.837	0.783	0.916	0.849	0.910
Bearing2-2	0.961	0.651	0.678	0.728	0.734	0.757	0.862
Bearing3-5	1.001	0.665	0.747	1.032	1.141	0.923	0.988

Table 4.

Effect of PHM2012 training set ratio on the results.

Methods	Informer	FFE-Informer
Metric	Train (70%)	Train (70%)	Train (68%)	Train (66%)	Train (64%)	Train (62%)	Train (60%)
Bearing1-6	1.072	0.921	1.017	0.822	0.829	0.832	0.852
Bearing2-3	0.578	0.332	0.315	0.301	0.318	0.307	0.327
Bearing2-7	2.223	0.978	0.949	0.966	1.000	0.968	1.093

It should be noted that the statement regarding “stable performance” under reduced training data is based on the overall trend observed in the reported results, and does not imply statistical non-significance. Since repeated runs and formal significance tests (e.g. paired t-test or Friedman test) were not conducted in this study, we have revised the wording accordingly to avoid over-interpretation.

It should be noted that the “small-sample” advantage discussed in this study is defined in a relative sense, based on the experimental setting where the training data ratio is reduced to 60%. The results indicate that the proposed FFE-Informer framework maintains stable prediction performance under comparatively limited training data conditions. This robustness is mainly attributed to the input-side feature fusion strategy, which integrates multi-direction vibration signals with statistical and frequency-domain features to provide more degradation-sensitive representations. More extreme data-scarce scenarios and comparisons with transfer learning or meta-learning strategies are beyond the scope of the present work and will be considered in future studies.

FFE analysis of long time series prediction results

This section examines and analyses the rolling bearing life prediction results from the XJZU-SY and PHM2012 dataset. In the XJZU-SY dataset, the model performs more stably in long series prediction after adding the FFE module, which improves the accuracy of prediction. In the PHM2012 dataset, the prediction accuracy is also improved after adding the FFE module. This indicates that the FFE module plays a key role in long series prediction and has a good effect on the progress of model prediction.

Comparative analysis in the XJZU-SY dataset and PHM2012 dataset

In order to prove the superiority of the proposed FFE network in long time series prediction, we adopted three methods applicable to such predictions, such as Informer et al., to predict the RUL of the tested bearings under three working conditions. As can be clearly seen from the table, with the increase of the prediction length, the loss error also rises. Evidently, the model performs much more stably in long time series prediction with the addition of the FFE network. It can be concluded that the FFE network plays a key role in the stability of model prediction in long series prediction, which also confirms superiority in long time prediction. Bearing1-1, Bearing2-2, and Bearing 3-5 are selected for analysis, and their results are shown in Table 5. From Tables 5 and 6, it can be clearly seen that the FFE module plays a more obvious role in the prediction. Taking a prediction length of 168 as an example, the MSE reflects that the accuracy rate of the bearings has improved by 42%, 36%, and 50% respectively, while the MAE suggests that the accuracy has been enhanced by 19%, 34%, and 33% respectively. Taking Bearing1-1 as an example, it can be observed from the results that the predicted value fluctuates slightly above and below the true value. As illustrated in Figure 10, there is a significant deviation between the estimated RUL and the actual RUL in the initial prediction stage, but this deviation gradually diminishes over time. This is because, in the initial phase, the bearings are in the break-in period with minimal wear. Consequently, capturing degradation characteristics and establishing an accurate correlation between monitoring signals and RUL is challenging at this stage. And as bearing wear increases, more degradation information can be captured by monitoring signals. Therefore, the proposed FFE-Informer can obtain highly accurate RUL prediction results.

Table 5.

Multivariate long series time series prediction results (XJZU-SY).

Methods		Informer		Informer stack		FFE-Informer stack		FFE-Informer
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
Bearing1-1	24	1.445	0.899	1.165	0.804	0.918	0.694	1.050	0.768
	48	1.647	0.949	1.395	0.999	1.061	0.730	1.136	0.819
	72	1.690	0.998	1.401	0.855	1.185	0.775	1.105	0.768
	168	1.938	1.088	1.688	0.968	1.120	0.767	1.108	0.882
	240	2.199	1.189	1.865	1.044	1.692	0.953	1.563	0.914
Bearing2-2	24	1.058	0.695	0.977	0.659	0.876	0.615	0.829	0.599
	48	1.468	0.902	1.060	0.691	0.938	0.664	0.864	0.609
	72	1.240	0.812	1.044	0.701	0.902	0.641	0.927	0.626
	168	1.509	0.961	1.329	0.854	1.386	0.710	0.956	0.651
	240	1.458	0.923	1.388	0.889	1.083	0.761	1.069	0.745
Bearing3-5	24	0.638	0.618	0.630	0.624	0.576	0.547	0.520	0.456
	48	0.733	0.682	0.694	0.633	0.623	0.594	0.641	0.597
	72	0.925	0.774	0.911	0.758	0.667	0.614	0.730	0.664
	168	1.585	1.001	1.038	0.830	0.837	0.713	0.786	0.665
	240	1.331	0.961	1.138	0.874	0.835	0.728	0.878	0.729

Table 6.

Multivariate long series time series prediction results (PHM2012).

Methods	Informer	Informer stack	FFE-Informer stack	FFE-Informer
Error	MAE	MAE	MAE	MAE
Bearing1-6	0.805	0.891	0.693	0.789
Bearing2-3	0.333	0.337	0.188	0.189
Bearing2-7	5.782	5.685	4.265	4.506

Figure 10.

Bearing1-1 prediction results: (a) RUL prediction and (b) feature prediction.

It is observed that the prediction error increases as the prediction horizon becomes longer, which is common in long-range sequence prediction tasks and is mainly caused by uncertainty accumulation and error propagation. With extended forecast ranges, indirect degradation cues and operating variability introduce additional uncertainty, leading to gradual error growth. This trend is qualitatively consistent with cumulative error effects similar to those discussed in random-walk–type processes, although bearing degradation is not strictly a random walk. In our framework, FFE-based feature fusion improves degradation representation and helps reduce, but not fully remove, long-horizon error accumulation.

Comparison with representative deep learning models

To further validate the comparative performance of the proposed model on public benchmark datasets, several representative deep learning models for time-series forecasting were selected as baseline methods for comparison, including the Long Short-Term Memory network (LSTM), the Gated Recurrent Unit network (GRU), and the Temporal Convolutional Network (TCN). These models have been widely adopted in remaining useful life (RUL) prediction studies and are considered representative and comparable baselines.

The comparison experiments were conducted on two publicly available bearing degradation datasets, namely the XJTU-SY dataset (Bearing1-1, Bearing2-2, Bearing3-5) and the PHM2012 dataset (Bearing1-6, Bearing2-3, Bearing2-7). To ensure fairness, all models were evaluated under a unified experimental configuration. The training ratio was set to 70%, the input window length was 256, and the prediction horizon was 168, following a multi-step RUL sequence prediction setting. The data preprocessing procedures and sample construction strategies were kept consistent across all models. The evaluation metrics were Mean Absolute Error (MAE) and Mean Squared Error (MSE).

The comparison results are presented in Tables 7 and 8. On the XJTU-SY dataset, the proposed FFE-Informer model demonstrates overall competitive performance compared with mainstream deep sequence models and achieves the best results under several bearing conditions. For example, it obtains the lowest MAE and MSE values on Bearing2-2 and reaches prediction accuracy comparable to the best-performing baseline on Bearing3-5, indicating good stability and adaptability across different degradation patterns.

Table 7.

RUL prediction performance comparison on XJTU-SY datasets.

Method	Bearing1-1		Bearing2-2		Bearing3-5
Method	MAE	MSE	MAE	MSE	MAE	MSE
FFE-Informer	4.041	16.717	2.738	10.091	2.479	6.663
LSTM	4.151	18.205	3.962	16.245	2.479	6.663
GRU	3.172	11.878	3.741	14.745	3.244	11.481
TCN	3.652	14.648	2.992	10.769	2.901	9.001

Table 8.

RUL prediction performance comparison on PHM2012 bearing datasets.

Method	Bearing1-6		Bearing2-3		Bearing2-7
Method	MAE	MSE	MAE	MSE	MAE	MSE
FFE-Informer	1.203	1.688	1.069	1.211	0.905	0.898
LSTM	1.245	1.865	2.035	4.231	2.968	8.885
GRU	1.105	1.531	1.971	4.479	2.034	4.404
TCN	1.531	3.552	1.744	3.189	3.091	9.627

On the PHM2012 dataset, the advantage of the proposed model is more pronounced. Under the Bearing2-3 and Bearing2-7 conditions, FFE-Informer achieves the lowest MAE and MSE among all compared models, showing significantly smaller prediction errors than LSTM, GRU, and TCN. Under the Bearing1-6 condition, it also maintains leading or near-best performance. Overall, the results indicate that the proposed model exhibits strong generalization ability and competitive performance across different datasets and degradation trajectories.

Depth ablation study of the front-end feature extractor

To further investigate the impact of the front-end feature extraction module depth on remaining useful life (RUL) prediction performance and model generalization ability, this study conducts a structural depth ablation analysis based on the proposed Feature Fusion Extractor (FFE) architecture. Several front-end variants with different convolutional depths are constructed to enable a systematic evaluation. The objective is to examine whether increasing the number of convolutional layers consistently improves prediction accuracy under an unchanged overall forecasting framework, and to analyze the stability of deeper structures under different training sample ratios.

In terms of architectural configuration, three front-end settings are designed for comparison: an Informer baseline model without a front-end convolutional extraction module, an FFE-2 configuration consisting of two convolutional layers and two pooling layers (used as the default setting in this work), and an FFE-4 configuration, which increases the convolutional depth to four layers while keeping the number of pooling layers unchanged. The FFE-4 variant is introduced as an extended structure with increased feature extraction capacity, aiming to isolate and evaluate the effect of convolutional depth on prediction performance.

The ablation experiments were conducted on two publicly available benchmark bearing degradation datasets, namely the XJTU-SY bearing dataset (Bearing1-1, Bearing2-2, Bearing3-5) and the PHM2012 dataset (Bearing1-6, Bearing2-3, Bearing2-7). All compared models were trained and evaluated under a unified experimental configuration to ensure fairness. The input window length was set to 256, and the prediction horizon was set to 168, following a multi-step RUL sequence forecasting setting. Two training ratios, 70% and 60%, were adopted to cover both regular-sample and limited-sample scenarios. The evaluation metrics were MAE and MSE computed in the normalized label space. Performance was consistently evaluated on the latter half of the degradation stage to better reflect realistic RUL prediction conditions. The corresponding results are reported in Tables 9 and 10.

Table 9.

Ablation results of different FFE depths on the XJTU-SY dataset.

Variant	Train ratio (%)	Bearing1-1		Bearing2-2		Bearing3-5
Variant	Train ratio (%)	MAE	MSE	MAE	MSE	MAE	MSE
Informer	70	1.058	1.158	1.622	1.484	0.881	0.813
FFE-2		0.986	1.010	1.402	1.019	0.863	0.782
FFE-4		1.077	1.198	1.870	1.031	0.938	0.918
Informer	60	0.578	0.420	1.010	1.107	0.967	0.923
FFE-2		0.428	0.262	0.950	0.989	0.826	0.827
FFE-4		0.464	0.298	0.997	1.080	0.792	0.924

Table 10.

Ablation results of different FFE depths on the PHM2012 dataset.

Variant	Train ratio (%)	Bearing1-6		Bearing2-3		Bearing2-7
Variant	Train ratio (%)	MAE	MSE	MAE	MSE	MAE	MSE
Informer	70	1.109	1.281	0.733	0.587	0.860	1.043
FFE-2		1.016	1.188	0.702	0.529	0.522	0.573
FFE-4		1.278	1.748	0.750	0.599	0.801	0.691
Informer	60	1.168	1.462	0.777	0.691	0.968	1.034
FFE-2		1.108	1.327	0.696	0.571	0.769	0.690
FFE-4		1.123	1.361	0.611	0.463	0.847	0.815

The experimental results show that introducing a front-end feature extraction module consistently improves prediction performance overall. The FFE-2 configuration achieves lower MAE and MSE values than the Informer baseline in most bearing cases across both datasets, indicating that lightweight convolutional feature extraction is beneficial for modeling vibration-based degradation sequences. However, when the convolutional depth is further increased to FFE-4, the prediction errors do not exhibit a consistent decreasing trend. Instead, performance degradation is observed on multiple bearing sequences, accompanied by reduced stability.

A further comparison under different training ratios reveals that when the training proportion decreases to 60%, the performance fluctuation of the FFE-4 configuration becomes more pronounced, whereas FFE-2 maintains relatively stable accuracy. This suggests that deeper front-end convolutional structures are more prone to overfitting under limited data conditions. This trend is consistently observed across both the XJTU-SY and PHM2012 datasets. Overall, the results indicate that for vibration time-series based RUL prediction tasks, a deeper front-end feature extractor does not necessarily lead to better performance. Instead, a moderately deep and lightweight convolutional structure achieves a more robust balance between predictive accuracy and generalization capability. Therefore, the FFE-2 configuration is adopted as the default front-end setting in this work.

Conclusion

In this paper, we improve a bearing RUL prediction framework using a model combination of CNN and Informer, referred to as FFE-Informer. The framework primarily utilizes degradation information gathered from various sensors and employs the FFE module to perform dimensionality reduction and expansion operations on the collected data. Immediately after that, we introduce temporal coding techniques to enhance the accuracy of long time series prediction by subtly incorporating temporal information into high-level features. Meanwhile, the sparse attention mechanism is employed to enhance attention to local and edge information, thereby further improving the prediction effect. Ultimately, these learned feature representations were input to the fully connected layer and we were able to accurately estimate the RUL. In order to validate the excellent performance of the FFE-Informer model, we conducted exhaustive experimental validation using the XJTU-SY dataset and the PHM2012 dataset, and made an in-depth comparison with the original model. The experimental results show that the FFE-Informer model exhibits high accuracy in predicting future long sequences. What is even more exciting is that it enables highly accurate RUL predictions using only a relatively small number of datasets. This is a great contribution to maintenance decision making.

Footnotes

Handling Editor: Divyam Semwal

ORCID iDs

Xu Bai

Xiaotong Li

Xiaochen Zhang

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was mainly supported by The National Natural Science Foundation of China (No. U23A20631) and also supported by Scientific Research Innovation Capability Support Project for Young Faculty (No. SRICSPYF-BS2025023), The LiaoNing Revitalization Talents Program (No. XLYC2403117), The Joint Fund Project of Liaoning Science and Technology Department (No. 2024-MSLH-393).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The figures and tables data used to support the findings of this study are included within the article, and the article permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

References

Xia

Wang

, et al. Intelligent fault diagnosis for bearings of industrial robot joints under varying working conditions based on deep adversarial domain adaptation. IEEE Trans Instrum Meas 2022; 71: 1–13.

Lei

Han

Wang

, et al. Interpretation of XJTU-SY rolling bearing accelerated life test dataset. J Mech Eng 2019; 55(16): 1–6.

Nikolakis

Alexopoulos

Xanthakis

, et al. The digital twin implementation for linking the virtual representation of human-based production tasks to their physical counterpart in the factory-floor. Int J Comput Integr Manuf 2019; 32(1): 1–12.

Wang

Zhang

, et al. Creep-fatigue reliability assessment for high-temperature components fusing on-line monitoring data and physics-of-failure by engineering damage mechanics approach. Int J Fatigue 2023; 169: 107481.

Meng

Nie

Yang

, et al. Reliability analysis of wind turbine gearboxes: past, progress and future prospects. Int J Struct Integr 2025; 16(1): 4–38.

Wang

Zhang

, et al. Machine learning-based fatigue life prediction of laser powder bed fusion additively manufactured Hastelloy X via nondestructively detected defects. Int J Struct Integr 2025; 16(1): 104–126.

Dang

Tang

, et al. A fatigue life prediction framework of laser-directed energy deposition Ti-6Al-4V based on physics-informed neural network. Int J Struct Integr 2025; 16(2): 327–354.

Peng

Liu

Cheng

, et al. Remaining useful life prediction of rolling bearing using adaptive sparsest narrow-band decomposition and locality preserving projections. Adv Mech Eng 2019; 11(12): 168781401988977.

Liu

Chen

Cheng

, et al. Convolution neural network based particle filtering for remaining useful life prediction of rolling bearing. Adv Mech Eng 2022; 14(6): 16878132221100631.

10.

Wang

Zhao

Ding

. RUL prediction of rolling bearings based on improved empirical wavelet transform and convolutional neural network. Adv Mech Eng 2022; 14(6): 16878132221106609.

11.

Wang

Liu

Deng

, et al. Remaining life prediction method for rolling bearing based on the long short-term memory network. Neural Process Lett 2019; 50: 2437–2454.

12.

Wang

Lei

, et al. Multiscale convolutional attention network for predicting remaining useful life of machinery. IEEE Trans Ind Electron 2020; 68(8): 7496–7504.

13.

Ding

Sun

. Remaining useful life estimation in prognostics using deep convolution neural networks. Reliab Eng Syst Saf 2018; 172: 1–11.

14.

Shang

Tang

Zhao

, et al. A remaining life prediction of rolling element bearings based on a bidirectional gate recurrent unit and convolution neural network. Measurement 2022; 202: 111893.

15.

Daubechies

DeVore

Foucart

, et al. Nonlinear approximation and (deep) ReLU networks. Constr Approx 2022; 55(1): 127–172.

16.

Ren

Sun

Wang

, et al. Prediction of bearing remaining useful life with deep convolution neural network. IEEE Access 2018; 6: 13041–13049.

17.

Giuliari

Hasan

Cristani

, et al. Transformer networks for trajectory forecasting. In: 2020 25th international conference on pattern recognition (ICPR), Milan, Italy, 10–15 January 2021, pp.10335–10342. IEEE.

18.

Zhou

Zhang

Peng

, et al. Informer: beyond efficient transformer for long sequence time-series forecasting. Proc AAAI Conf Artif Intell 2021; 35(12): 11106–11115.

19.

Wen

Zhou

Zhang

, et al. Transformers in time series: a survey. arXiv preprint arXiv:2202.07125, 2022.

20.

Lei

Guo

, et al. Machinery health prognostics: a systematic review from data acquisition to RUL prediction. Mech Syst Signal Process 2018; 104: 799–834.

21.

Nectoux

Gouriveau

Medjaher

, et al. PRONOSTIA: an experimental platform for bearings accelerated degradation tests. In: IEEE international conference on prognostics and health management, Denver, CO, USA, June 2012, pp.1–8. IEEE.

22.

Chicco

Warrens

Jurman

. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput Sci 2021; 7: e623.

23.

Zhu

Chen

Shen

. A new data-driven transferable remaining useful life prediction approach for bearing under different working conditions. Mech Syst Signal Process 2020; 139: 106602.