Abstract
Reciprocating piston pump is an important power equipment in coal mine production, so the research on condition monitoring and fault diagnosis of reciprocating piston pump is of great significance. It is challenging to extract fault information from monitoring data due to the complex underground environment and serious noise. The existing methods have the problems of insensitive feature extraction and low diagnostic accuracy. Based on this, a new fault diagnosis method for reciprocating piston pumps based on feature fusion of convolutional neural network (CNN) and transformer encoder is proposed. In this method, a multi-scale CNN encoder and transformer encoder are used to extract local and global features of signals in parallel, and a multi-scale convolution module is used to improve the diversity of local features. At the same time, before using the transformer encoder to extract global features, patch segmentation of monitoring signals is carried out in combination with the phase of the reciprocating piston pump crankshaft to reduce the influence of data randomness on global features and improve the interpretability of global features. In addition, a feature fusion module is constructed to realize the interaction and fusion of local and global features and improve the comprehensive characterization ability of the device state. The proposed method is applied to the fault diagnosis task of reciprocating piston pump. The experimental results show that the proposed method achieves a diagnostic accuracy of 99.145% ± 0.1576%, demonstrating its excellent performance. This accuracy rate is significantly higher than that of other existing methods, indicating that the proposed method can more accurately diagnose the faults of reciprocating piston pumps.
Keywords
Introduction
Reciprocating piston pump can provide hydraulic power for a series of actions of coal mine hydraulic support, and is an important power equipment in the integrated liquid supply system. 1 Once the reciprocating piston pump is malfunctioning, it will not only affect the operation of the hydraulic support but also interrupt the power supply and affect coal mining, resulting in serious economic losses. 2 Therefore, real-time condition monitoring and fault diagnosis of reciprocating piston pump are of great research significance. 3
At present, the deep learning model is widely used in various application scenarios, and its end-to-end working mode is simpler and more effective than the fault diagnosis method based on expert experience, and it has been successfully applied to the fault intelligent diagnosis of mechanical equipment.4–6 However, the coal mining environment is complicated, and the diversified hydraulic support fluid demand affects the operation condition of the reciprocating piston pump, which makes the monitoring information of the reciprocating piston pump contain serious noise. When the hydraulic support performs different actions, the required pressure and flow rates vary. Therefore, the reciprocating piston pump must frequently pressurize and unload with the help of the unloading valve, resulting in significant pressure and flow pulsations. These pulsations affect the vibration status signal of the pump valve. Therefore, how to extract fault information from the monitoring signal containing serious noise to achieve accurate fault diagnosis has important research value and practical significance.
In order to enhance the feature extraction capability of neural networks and improve the robustness of feature combinations, convolutional neural networks (CNN) are improved to solve fault diagnosis problems.7–9 Huang et al. 10 proposed a multi-scale cascade CNN to increase the diversity of features and enhance the characterization ability of feature combinations for bearing fault diagnosis in both normal and noisy environments. Chao et al. 11 used a multi-scale cascade structure to extract image features and combined with residual blocks proposed a multi-scale cascade midpoint residual CNN for bearing fault diagnosis. Wang et al. 12 converted the original vibration signal into a two-dimensional representation through time-frequency symmetric point graph transformation technology and built a rolling bearing fault diagnosis model by using a series model of CNN and transformer. Huang et al. 13 used parallel one-dimensional convolution to capture multi-scale information of bearing signals and combined it with an attention mechanism to propose a multi-scale channel attention CNN, which achieved good diagnostic accuracy under a noisy environment. Xu et al. 14 combined multi-scale CNN and bidirectional long short-term memory network (LSTM) to propose a bearing fault diagnosis method for fault diagnosis of wind turbines in complex working and testing environments. Jiang et al. 15 combined continuous wavelet transform to construct a convolution layer for feature extraction, and designed a kernel weight recalibration module to dynamically assign different weights to different wavelet cores. The multi-wavelet core CNN method is proposed for the fault diagnosis of the gearbox, and the accuracy is good. Shao et al. 16 proposed an adaptive multi-scale attentional CNN method for bearing fault diagnosis based on multi-scale CNN extraction of signal features and transfer learning by correlation alignment distance. Fu et al. 17 used generative adversarial networks (GAN) to supplement fault data samples. Then based on transformer network and auxiliary classifier generative adversarial network, a model for bearing fault diagnosis is proposed. Dong et al. 18 extract global and local features of fault signals based on transformer network, and then build a model for bearing fault diagnosis. Li et al. 19 use cross-attention to pay attention to the information correlation between samples and propose a self-attention-based transformer model for bearing fault diagnosis. Lei et al. 20 aggregate the local information of the sample using a message-passing mechanism and extract the global information from the sample using an improved transformer. A fault diagnosis method based on hypergraph embedded coding transformer and adaptive information fusion is proposed to improve the fault diagnosis accuracy under strong noise interference conditions. Zhang et al. 21 take the transformer as the basic block of the conditional generative adversarial network (CGAN), and use CGAN to generate fault samples to supplement the problem of insufficient fault samples and establish a fault diagnosis model. By verifying the feasibility of the method model, the method has a higher fault detection rate and better fault diagnosis performance. Xiao et al. 22 improved the transformer and proposed a Bayesian variational transformer for rotating machinery fault diagnosis.
The above methods have achieved certain results, but when applied to the fault diagnosis of coal mine reciprocating piston pump, the generalization ability of the fault intelligent diagnosis method model needs to be further improved. (1) The coal mine environment is relatively complex, and the monitoring signal of the reciprocating piston pump contains serious background noise, so it is necessary to enhance the characterization ability of the feature combination. CNN can learn multi-level features of monitoring signals through network stacking, and extract local features of monitoring signals well. The convolution kernel has a fixed receptive field range, so the extracted local features mainly focus on the local changes of signals. These local changes are easily disturbed by noise, making it difficult for local features to accurately distinguish whether the changes are caused by faults or background noise. In contrast, global features capture major trends and overall behavior in a signal with relatively low sensitivity to noise. When local features and global features are combined, they can use their respective advantages to eliminate the effect of noise, thereby improving the accuracy and generalization of the fault diagnosis model. Therefore, the research focus of this paper is to extract local and global features of input signals and integrate them to comprehensively improve the representation ability of model feature combinations. (2) Vibration signal is a type of random non-stationary signal with data randomness at each time point. Meanwhile, the vibration signal data has strong local characteristics, that is, the correlation between adjacent time points is strong. The vibration signal is divided into patches, each processed as a whole, enabling the transformer model to efficiently extract global features by capturing long-distance dependencies across patches using self-attention mechanisms while maintaining locality. Additionally, this segmentation can reduce the number of tokens and thus reduce the computational complexity of the model. Therefore, how to segment the signal into multiple patches to better learn the global features of the signal is an urgent problem. (3) The complementary advantages of different features can enhance the robustness of the model and reduce the possibility of misjudgment and missed judgment. However, not all features extracted by the CNN encoder and transformer encoder are valid, and different features have different contributions to fault information. Therefore, the fusion of features extracted from the CNN encoder and the transformer encoder is crucial to enhance the robustness of the fault diagnosis model. In this paper, a new fault diagnosis method based on feature fusion of CNN and transformer encoder is proposed to fill the above shortcomings. The proposed method is applied to the fault diagnosis task of reciprocating piston pump. The experimental results show that the proposed method has better performance than the current popular fault diagnosis models.
The main contributions of this article are as follows:
The parallel CNN encoder and transformer encoder are used to extract the local and global features of the signal respectively to make up for the defect that the local features cannot fully counteract the noise. Patch processing of time-domain vibration signals is carried out in combination with phase information of the reciprocating piston pump to reduce the influence of data randomness on the learning of global features of the transformer encoder. The fusion module of local feature and global feature is constructed, and the fusion and optimal selection of these features are realized through multi-level parallel fusion, so as to improve the representation ability of feature combination and the robustness of the model.
The rest of the paper is organized as follows: Section 2 introduces the experimental test system of the reciprocating piston pump. Section 3 describes the fault diagnosis method based on feature fusion of CNN and transformer encoder. The examples of application and comparative analysis are described in Section 4. Finally, conclusions are given in Section 5.
Reciprocating piston pump test rig and analysis of vibration data characteristics
Reciprocating piston pump test rig
The experimental test system of the reciprocating piston pump is built for the reciprocating piston pump research test and factory test. As shown in Figure 1, the experimental test system includes a self-developed BRW630/40 reciprocating piston pump, a drive motor, a data acquisition system, and so on. The data acquisition system is used to collect the operation and status data of the reciprocating piston pump in real time, including flow rate, pressure, speed, vibration, phase, noise, etc. Specially, the top of the valve of the reciprocating piston pump BRW630/40 is installed with five one-way acceleration sensors of the model M603M170, which are used to collect vertical vibration signals of the 5 groups of pump valves. At the same time, the GSH2000 Hall sensor is installed on the crankcase end cover to assist in identifying the phase information of the crankshaft. The BH7000 multi-channel synchronous acquisition system is used to collect the vibration signal of the pump valve, and the whole period signal is extracted by combining the phase signal.

Experimental test system diagram of reciprocating piston pump.
Taking the self-developed BRW630/40 reciprocating piston pump as the research object, the parameters of the BRW630/40 reciprocating piston pump are shown in Table 1 3 In addition, by replacing the healthy parts of the reciprocating piston pump to the faulty parts, the simulation of various fault types of the reciprocating piston pump can be realized.
Parameters of BRW630/40 reciprocating piston pump.
Analysis of vibration data characteristics
Pump valve, including suction valve and discharge valve, plays an important role in liquid compression and is the key component of the reciprocating piston pump. Pump valve failure is one of the typical failures of reciprocating piston pump, the failure rate is relatively high, and the deterioration of failure may lead to the shutdown or scrap of the reciprocating piston pump. There are three failure modes simulated in this experiment: suction valve failure, discharge valve failure and suction, and discharge valve failure. Figure 2(a) and (b) show the faulty suction spool and drainage seat in the actual application scenario, respectively.

Faulty suction valve spool and drain valve seat.
The failure simulation experiment is carried out on the fifth group of pump valves, and the vibration data of pump valves in different states are obtained during the experiment. The sampling frequency of data acquisition is set to 10.24 kHz, and the vibration signal of the fifth pump valve under normal and fault conditions is shown in Figure 3.

Vibration signals of pump valve in different states.
As can be seen from Figure 3, the vibration waveform has nonstationarity and randomness, which affects fault feature extraction. Because the vibration signal has periodic changes in the angular domain, extracting the global features of the signal can reduce the non-stationarity of the signal. At the same time, there are detailed excitation shock signals at specific phases, which are caused by the opening or closing of the valve. Therefore, when the valve failure occurs, the extraction of the local features of the signal can play an important role in fault identification. In conclusion, the fusion of local and global features extracted from vibration signals can effectively improve the representation ability of model feature combination and enhance the generalization ability of fault diagnosis model.
Fault diagnosis method based on feature fusion of CNN and transformer encoder
Model architecture
In order to improve the recognition accuracy of the reciprocating piston pump fault diagnosis model under severe noise monitoring data, a fault diagnosis method based on feature fusion of CNN and transformer encoder is proposed. The model uses a parallel CNN encoder and transformer encoder to extract the local and global features of the signal respectively, and uses the fused local and global features to identify the health state of the reciprocating piston pump.
The crankshaft rotates 360 degrees, indicating that the reciprocating piston pump completes an energy conversion, and the corresponding vibration signal is called the whole period signal. At the same time, the vibration sensors are single-axis sensors, which are installed in different locations to monitor the status information of different components of the reciprocating piston pump. Therefore, the input data of the fault diagnosis model is the whole cycle signal data of a single sensor, and the input dimension of the model is [batch_size, 1, L]. The length of the whole period signal L is related to the crankshaft speed and the data sampling frequency. The data acquisition time of a single period is equal to about 0.13245 s, and L is equal to 1356.
As shown in Figure 4, the model consists of three main modules: (1) the CNN encoder, which extracts local features of different levels through stacked network layers; (2) the transformer encoder, which extracts global features at different levels through multi-level coding layers; (3) feature fusion module, which is used to achieve the interaction and fusion of local and global features. Finally, the output of the two corresponding layers is tiled and combined to create a feature map containing both local and global information to realize device state recognition.

Fault diagnosis model architecture based on feature fusion of CNN and transformer encoder.
Local feature extraction based on CNN encoder
Classical convolutional neural networks use convolution kernel with fixed kernel size to extract features, and cannot extract signal features of different scales. Since convolution kernels of different sizes have different time-frequency domain resolutions in extracting signal features, multi-scale features can be extracted by convolution kernels of different sizes, and then key features characterizing device health can be found in different frequency domain resolution ranges.
Multi-scale convolutional module (MSCM) uses multiple one-dimensional dilated convolutional layers to extract different scale features of monitoring signals, then aggregates the extracted features of different scales. 10 The one-dimensional dilated convolution layer can perceive the input signal in a larger range and obtain richer context information. At the same time, one-dimensional dilated convolution layers with different convolution kernel sizes have different time-frequency domain feature extraction effects, and the extracted features are more different. MSCM can capture features at multiple scales at the same time, which can more fully describe the information in the status signal.
In addition, the input signal features are extracted in parallel by the one-dimensional dilated convolution layers with different scales, and the resulting multi-scale features are combined to obtain the final comprehensive feature representation. Finally, the final output dimension of MSCM is the same as the input dimension, which makes MSCM easy to integrate with other neural networks.
Assuming that the dimension of input feature X of MSCM is Cin × Lin, it can be split into S components [

Schematic diagram of multi-scale convolutional feature extraction.
The convolution kernel size k is taken as,3,5,7,9 and the stride and dilation parameters are both 1. The input of the CNN encoder is consistent with the input of the fault diagnosis model, both of which have dimensions of [batch_size, 1, L]. Especially, the number of output channels of the first CNN encoder layer, denoted as C_L, and
Global feature extraction based on transformer encoder
The transmission shaft of the reciprocating piston pump is generally driven by a three-phase explosion-proof motor. The input shaft is reduced by the first stage gear to drive the crankshaft to rotate, and then the connecting rod and the slider drive the plunger to do the reciprocating movement. When the plunger is far away from the pump head, the pressure in the plunger cavity becomes smaller, and the suction valve opens to suck the emulsion. When the plunger is near the pump head, the pressure in the plunger cavity increases and the drain valve opens to discharge the emulsion. Therefore, for non-stationary and random vibration signals, the working stage of the reciprocating piston pump can be understood through the time point or the crankshaft phase. At the same time, the time point of the monitoring signal corresponds to the periodic change of the crankshaft phase.
The vibration amplitude corresponding to the specific time point is random, and taking the vibration signal data point as the minimum analysis unit is not conducive to accurately capturing the periodic and trend features of the signal. In combination with the phase change of the reciprocating piston pump, the minimum unit of signal analysis is determined by the phase angle, which is more beneficial to obtain the global features of the signal trend. By analyzing the signal fragment corresponding to the phase angle, the anomaly of the signal in the specific phase can be captured, and the interpretability of fault diagnosis can be improved.
The sample of the signal is a whole-period sample. As shown in Figure 6, patch cutting is performed on the vibration signal with a phase angle at a fixed interval. In other words, there is a clear relation between the number of signal patches PN and the phase angle of patch α: PN = 360/α. For example, when the phase angle of the interval is equal to 1 degree, the signal sample is divided into 360 patches; When the interval phase angle is equal to 2 degrees, the signal sample will be divided into 180 patches. And so on.

Patch cutting of vibration signal based on fixed phase angle.
Assume that the monitoring signal
The convolution operation with a kernel size of 1 is used to increase the number of channels in the input sample, and as a result, the dimension of the input sample changed from [batch_size, 1, L] to [batch_size, C_G, L]. When patch cutting is performed on the result, the dimension of the sample changes from [batch_size, C_G, L] to [batch_size, C_G, PN, PL]. When PN is equal to 360, PL is equal to 4. To prevent the filtered signal components from confusing each other in subsequent processing, it is crucial to maintain channel independence. Therefore, the batch size and channel dimensions are merged, the input dimensions of the model will also change from [batch_size, C_G, PN, PL] to [batch_size*C_G, PN, PL].
Instead of treating the data corresponding to each point in time as a token, the transformer encoder treats each signal patch as a token, which significantly reduces the number of tokens. Before the vibration signal is fed to the first transformer encoder layer, the signal block of the vibration signal must be projected and positional embedded. For projection, a linear layer can be used for feature mapping. The dimension of the signal block will also change from PL to dmodel. For position embedding, Sinusoidal positional encoding can be used. Each transformer encoder layer contains two sub-layers: the self-attention layer and the position-wise feed-forward network (FFN). Each sub-layer has an Add&Norm operation, i.e. residual connection and layer normalization. The transformer encoder enables the model to learn the global features of the monitored vibration signals to better detect reciprocating piston pump fault information from vibration signals containing noise.
Feature fusion of local and global features
Considering the lack of information interaction between the local features extracted by the CNN encoder and the global features extracted by the transformer encoder, a fusion module of local features and global features is constructed inspired by the attention mechanism and selective kernel network.23–25 Considering that CNN and transformer encoder extract signal features in parallel, the fusion of local features and global features in this study is mainly embodied in multi-level parallel fusion. As shown in Figure 7, the global average pooling of features from the CNN encoder and transformer encoder is carried out to realize the splicing of local features and global features of different dimensions. Two fully connected layers are used to learn the fusion weights of local and global features and assign weights uniformly to features of different types and channels. Based on this, the fusion and optimal selection of local features and global features are realized, which on the one hand realizes the complementary advantages of local features and global features, and on the other hand improves the representation ability of feature combination and the robustness of the model. Finally, the output of the two corresponding layers is tiled and combined to create a feature map containing both local and global information to realize device state recognition.

Feature fuse module.
Suppose that in the local feature map
For the transformer encoder layer, the input and output dimensions remain unchanged, with both being represented as [batch_size*C_G, PN, dmodel]. Therefore, the output feature map of the transformer encoder layer must be reshaped to [batch_size, C_G, PN, dmodel] before being fed to the feature fuse module. For global feature map
The global average pooling
The local features and global features are concatenated to obtain the combined feature
Two fully connected layers are used to learn the fusion weights of local features and global features, and weights are uniformly assigned to features of different types and channels, shown as follows:
Finally, the channel weight vector
When adaptively assigning lower and higher weights to invalid features and sensitive features, it can greatly improve the representation ability of local and global feature combinations and the robustness of the model.
Assigning low weights to invalid features and high weights to sensitive features can greatly enhance the representation capability of both local and global feature combinations, as well as improve the robustness of the model.
Experiment and result analysis
Case study of fault diagnosis
In order to verify the applicability and effectiveness of the proposed method, the valve state data set of the reciprocating piston pump is used for validation. According to the working characteristics of the reciprocating piston pump, vibration data of the pump valve were collected under various stable pressure conditions, including no-load, 10, 20, 30, and 40 MPa, as well as under variable working conditions such as boosting and unloading. These different operating conditions represent different pressure and flow pulsations, which in turn correspond to different noise levels. The robustness of the model was verified using the cross-validation method based on the data collected under these different working conditions. The experimental results show that the proposed method can accurately identify the fault of the reciprocating piston pump, and the identification accuracy is better than other methods.
For the MSCM, five different sizes of convolution kernels are set, specifically.3,5,7,9, 11 Additionally, to ensure that the feature lengths after convolution remain the same, the corresponding dilation rates are set to [1, 1, 1, 1, 1]. The input signal is processed using stacked MSCM to extract signal features at different levels. The number of convolution kernel corresponding to each scale increases layer by layer and is set to 16-32-32. And the window size and stride of the pooling layer are both set to 2. Correspondingly, one-dimensional convolution is used to change the number of channels in the input tensor of the transformer encoder so that it remains the same as the number of channels in the output tensor of the CNN encoder. The phase angle of the crankshaft corresponding to the signal patch is set 3.
For the linear layer used for the feature mapping of patch signals, set the number of neurons dmodel to 128. The dimension of FFN in the transformer encoder is set to 200. The number of the transformer encoder layer is set to 1. The number of head in multi-head attention is set to 4. For the final fully connected layer classifier, set the number of neurons in the fully connected layer to 128-128-4. After model parameters were initialized, Adam optimizer was selected to train the model. The learning rate was set to 5e-4, and the number of samples used for model training was set to 64. The fault diagnosis model is trained and validated using the measured pump valve dataset. For each pump and valve state, 10,000 data samples were selected to build a data set and divided into training sets, verification sets, and test sets according to a ratio of 7:2:1. In order to demonstrate the feature extraction capability of the model, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm. 26 was used to reduce the dimension of the last fully connected layer features and carry out three-dimensional visualization of the features, as shown in Figure 8.

3D feature visualization based on t-SNE.
The generalization ability of the fault diagnosis model is quantitatively measured by using the test data set of the pump valve. For the identification of the test set, the results are presented as a confusion matrix, as shown in Figure 9. It can be seen from the results in the figure that the fault diagnosis model has a recognition accuracy of more than 99% for different states of the reciprocating piston pump. The average accuracy of the fault diagnosis model is 99.75%, which shows that the method can accurately diagnose the fault of the reciprocating piston pump. The accuracy of the fault diagnosis model is 99.145% ± 0.1576%, as determined by using the method of 5-fold cross-validation.

Confusion matrix of recognition results of the test set.
Visualization of multi-scale convolution features
In order to better understand the differences in convolution features at different scales, the features extracted by extended convolution at different scales are visualized based on the trained model. Taking the first layer feature of the CNN encoder as an example, the difference between the original signal and the feature maps are compared. At the same time, considering the randomness of the time domain vibration signal, it is difficult to distinguish the difference, and the fast Fourier transform is performed on the original signal and the feature maps. The corresponding spectrum diagram is shown in Figure 10.

Spectrum diagram of original signal and the feature maps.
It can be seen from the figure that the spectrum corresponding to the feature maps extracted by convolution of different kernel sizes is significantly different, which is reflected in that the center of gravity of different spectra is focused on different frequency bands. At the same time, based on the same kernel size, the spectra of features extracted by different convolution kernels are similar. The feature spectrogram verifies that the signal features extracted by convolution kernels of different sizes have different resolutions in the frequency domain, which makes it convenient to find the key features of device health in different frequency domain resolution ranges. Therefore, multi-scale convolutional feature extraction can significantly enhance the diversity of features and improve the ability of feature combination representation and robustness.
Analysis of the influence of patch length on global feature characterization
In order to further analyze the influence of patch length on the global features extracted by the transformer encoder, the model performance under different patch lengths is compared and analyzed. With the fixed CNN encoder network structure unchanged, patch length is determined at different phase angles and the transformer encoder is constructed to obtain fault diagnosis models corresponding to different patch lengths. With the interval of 1, the phase angle of the crankshaft increases from 1 to 4, and the patch length of the corresponding signal is.4,8,12,15 Based on the same model training parameters, model training loss and recognition accuracy curves corresponding to different patch lengths are shown in Figure 11.

Model training loss and recognition accuracy curves corresponding to different patch lengths.
As can be seen from the figure, when the phase angle corresponding to the patch signal is equal to 1, the model not only converges slowly but also has unsatisfactory recognition accuracy. When the phase angle corresponding to the patch signal increases to 2 and 3, the convergence speed and recognition accuracy of the model are improved, and the convergence speed and recognition accuracy of the model are continuously improved with the increase of the phase angle corresponding to the patch signal. When the phase angle corresponding to the patch signal is equal to 3 and 4, the performance of the model is close. Although global features are beneficial to the performance improvement of the model, there are inherent shortcomings, which can not completely rely on global features to improve the performance of the model to the extreme. Based on the above analysis, in order to take into account local features when extracting global features, the phase angle corresponding to patch signals is selected to be equal to 3 to build the final fault diagnosis model.
Comparative analysis
In order to verify the superiority of the proposed method, it is compared with other fault diagnosis models. The comparison ideas are as follows:
To demonstrate the advantages of the proposed method (denoted as MSC = PT), it was compared with some popular fault diagnosis methods, including 1DCNN, transformer, and CNN-LSTM. For the 1DCNN model, the convolution kernel size is set to 3, and the number of layers and channels of the convolution layer is consistent with the MSC = PT model. For the transformer model, the same structure is used as the transformer encoder for the MSC = PT model. For CNN-LSTM, using the same parameters as the 1DCNN model, the number of layers of LSTM is set to 2 layers, and the hidden size is set to 512. The settings of all connection layers are consistent with the MSC = PT model. If other parameters are not specified, keep the default values. Ablation studies. To verify the effect of the transformer network in the model, the transformer encoder was removed from the MSC = PT model while retaining the feature fusion blocks, resulting in the model MSCFF. To assess the impact of MSCM within the model, the MSCM blocks in the MSC = PT model were replaced with simple convolutional blocks (maintaining the same total number of feature maps) to obtain the model 1DCNN = PT. In order to reflect the effectiveness of the feature fusion module, the feature fusion module of the MSC = PT model was removed, and only the last layer features of the CNN encoder and transformer encoder were spliced to obtain the model MSC + PT.
Taking the recognition accuracy of the test set samples as the index, the random experiment was carried out by using the method of 5-fold cross-validation. The results of the different methods are shown in a box diagram, as shown in Figure 12.

Box plots of results identified by different methods.
As can be seen from the figure, compared with other models, the model proposed in this paper has the best performance. The model based on the transformer primarily extracts global features, which enables it to accurately identify pump and valve faults. In most cases, its performance is superior to that of 1DCNN. However, due to its lack of focus on local fault characteristics, it may misdiagnose faults when the global characteristics of the samples are not significantly different, as indicated by small differences in RMS values. On the one hand, the results verify the rationality of the model architecture, and on the other hand, it also shows that the fusion of local features and global features can improve the performance of the model. The superiority of the model structure is verified by the ablation studies (comparison among MSCFF, 1DCNN = PT, MSC + PT, and MSC = PT). The effectiveness of the feature fusion module in the model architecture is verified by comparing MSC = PT and MSC + PT models. In addition, the standard deviations of the recognition accuracy for the above methods are 0.849%, 1.11%, 0.893%, 0.527%, 0.518%, 0.778%, and 0.1576%, respectively. The minimum standard deviation demonstrates the robustness of the proposed method.
Conclusions
Due to the complex underground environment of a coal mine, the condition monitoring data of the reciprocating piston pump has serious noise, which affects the accurate fault identification of the reciprocating piston pump. In order to further improve the accuracy of fault diagnosis of underground reciprocating piston pump in coal mine, a fault diagnosis method of reciprocating piston pump based on feature fusion of CNN and transformer encoder. Firstly, a multi-scale convolution module can improve the feature combination characterization ability and robustness. Secondly, the fusion of local and global features can make up for the defect that local features cannot completely resist noise. In addition, patch segmentation of time-domain vibration signals can reduce the influence of data randomness on the learning of global features in the transformer encoder and improve the interpretability of global features. Finally, compared with other popular fault diagnosis methods, the proposed method has better performance. The experimental data of the reciprocating piston pump verify the effectiveness and superiority of the proposed method. Future research will further explore equipment fault diagnosis and health management methods based on lightweight models. This will help reduce the performance requirements of the models in equipment hardware. At the same time, considering the application of the model on devices with similar structures, the transfer application of the model will be further studied. In addition, considering the health management of the entire equipment, future research can expand the fusion study of multiple sensors or multiple model results to form a comprehensive maintenance decision.
Footnotes
Author contributions
All authors contributed to the study’s conception and design. Material preparation, data collection, and analysis were performed by Li Ran, Ye Zhuang, and He Yonghua. The first draft of the manuscript was written by Lai Yuehua and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by China Coal Technology Engineering Group (2023-TD-QN006 and 2023-TD-ZD003-007).
Conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.
