Abstract
To accurately diagnose the quayside container crane (QCC) gearbox faults, this article proposes a method that combines the frequency-domain Markov transformation field (FDMTF) and multi-branch residual convolutional neural network (MBRCNN). Firstly, the gearbox vibration signal is converted into the frequency domain to reveal the components and amplitude of signals stably and concisely. Then, the one-dimensional frequency signal is encoded into the two-dimensional image by the Markov transformation field to capture the dynamic characteristics of signals. Thirdly, the MBRCNN network is constructed, which can extract multi-scale features and alleviate the problems caused by the deep network structure. Finally, the FDMTF image is fed into the constructed MBRCNN model for pattern recognition. The effectiveness of the proposed FDMTF–MBRCNN method is verified by two case studies. In Case 1, the diagnosis results of a benchmark dataset achieve 100% accuracy, better than seven state-of-the-art methods published in recent 3 years. In Case 2, the diagnosis results of the dataset collected from a 1:4 scaled test rig achieve 98.85% accuracy, better than eleven encoding methods and four convolutional neural network methods. It also can obtain a recognition accuracy of more than 94% under the conditions of small sample, different network hyper-parameters, or variable loads, which verifies its robustness. These case studies show that the FDMTF–MBRCNN method is expected to be applied to the actual fault diagnosis of QCC gearboxes.
Keywords
Introduction
Due to the unique advantages of stable transmission performance and high transmission efficiency, gearboxes are widely used in key transmission systems of various machinery, such as wind turbines automobile sectors, machining machines, robots, quayside container cranes (QCCs), etc.1–3 As the key mechanical equipment of cargo transportation ports, the QCC is specially used for loading and unloading. 2 The QCC is located in a harsh working environment, which increases the failure rate of gearboxes that bears a huge dynamic load for a long time. Typical common gear faults include gear wear, tooth root crack, gear fatigue and tooth breaking. 1 These gear faults as well as bearing faults will cause damage to the transmission system and easily induce other faults.1,3 In addition to the increase in operation and maintenance costs, gearbox faults also lead to a catastrophe for the machine, resulting in huge economic losses. 4 Fault diagnosis can improve the reliability and safety of machinery and reduce operation and maintenance costs.5,6
Due to rapid response and no need for too much professional knowledge, the intelligent fault diagnosis method has received extensive attention. 7 According to the depth of the model, intelligent fault diagnosis methods can be roughly classified as machine learning model-based and deep learning model-based. The machine learning-based fault diagnosis methods roughly include three steps: manual feature extraction, dimension reduction and classification model construction. Commonly used features include time-domain features, frequency-domain (FD) features and time–frequency features. Feature selection is a feature dimensionality reduction method, which removes redundant information by selecting feature subsets, to accurately distinguish multiple categories of health and fault states. 8 There are many shallow machine learning models for classification, such as support vector machine, 9 extreme learning machine, 10 K-nearest neighbor, 11 etc. On the one hand, manual feature extraction depends on prior knowledge, and its universality is poor. 12 On the other hand, it is difficult for machine learning models to express complex mapping. 13
Compared with machine learning, deep learning models integrate adaptive feature extraction as well as classification and can represent/learn complex nonlinear mapping. 14 Deep learning models, such as deep belief network, 15 stacked autoencoder (SAE), 16 recurrent neural network 17 and convolutional neural network (CNN),14,18–20 have been widely used in mechanical fault diagnosis. Among them, CNN can significantly decrease the number of training parameters, because of its characteristics of local connection, weight sharing and pooling operation.18,21 Considering that the original intention of CNN is to classify images, CNN is inclined to classify two-dimensional (2D) images.14,19 Therefore, how to transform the one-dimensional (1D) vibration signals into 2D images is a meaningful topic in fault diagnosis.19,21 Many signal processing methods have been used to encode 1D vibration signals into 2D images for mechanical fault diagnosis, such as FD transformation, that is, the spectrum, 22 short-time Fourier transform (STFT) 23 and continuous wavelet transform (CWT). 24 Other methods such as recursive plots (RP), 25 Gramian angle field (GAF) that is divided into Gramian angular summation field (GASF) and Gramian angular difference field (GADF), 26 and Markov transformation field (MTF)26,27that can encode time series into 2D images can also be considered for fault diagnosis based on CNN. Hsueh et al. 28 converted the 3-phase current signals into images through RP as the input of CNN for rotating machinery fault diagnosis. Tang et al. 29 encoded the vibration signal into a GASF image and achieved an accuracy 99% of mechanical fault identification. He et al. 30 used MTF to map the 1D monitoring signal into the 2D image, which improves the fault diagnosis accuracy of the nuclear power system. Han et al. 31 compared the characterization ability of GAF and MTF in bearing fault diagnosis cases and pointed out that MTF retains more dynamic information than GAF. It should be noted that the above methods directly encode 1D vibration signals into 2D images, which may be vulnerable to being disturbed by arbitrary shocks and background noise.
Bai et al. 32 pointed out that in addition to characterizing 1D vibration signals, building a good network structure is also important for fault diagnosis. There are many main stream structures of CNN, such as AlexNet, 33 GoogLeNet, 34 ResNet, 35 etc. used in image recognition. These classical models provide many references for the construction of CNN. For example, the multi-branch (MB) module of GoogLeNet can learn different features. The residual module can alleviate the problems of network degradation and gradient disappearance/explosion. Inspired by GoogLeNet, Wu et al. 36 proposed an enhanced multi-scale CNN for high-speed train rolling bearings. Zhang et al. 37 constructed an improved ResNet through the hybrid attention mechanism for wind turbine gearbox fault diagnosis. In addition to the network structure, the hyper-parameters in CNN training, such as learning rate, epoch and mini-batch size, also affect the identification accuracy and generalization of the diagnosis model. To obtain a CNN model that is insensitive to hyper-parameters, Zhang et al. 19 proposed an MB module, which can adaptively adjust the branches to adapt to the preset hyper-parameters. The study aims to design multiple networks, which is complex to choose the most appropriate network by using the minimum loss function. In a few words, a CNN model that is insensitive to hyper-parameters is worthy of further exploration.
Two characteristics of the QCC gearbox that need attention. One is that the number size of fault samples is small, and the other is that QCC often works under different loads. These should be considered in verifying the diagnosis methods. To address the above issues, a QCC gearbox fault diagnosis method combing FDMTF and multi-branch residual CNN (MBRCNN) is proposed. Compared with the time-domain signals, the FD signals can more stably and concisely reveal the components and amplitude of signals. The phase of the signal and a random instantaneous impact will change the shape of the time-domain waveform. In the FD, the phase change of the signal will not affect the shape of the spectrum. It is difficult for short shocks in the time domain to show huge amplitude in the frequency. Because the FD reflects all components of the signal, and the frequency amplitude corresponding to an instantaneous impact may be low. The MBRCNN combines MB block and residual block to extract multi-scale features and eliminate a series of problems caused by the deep structure. Our specific contributions and insights are as follows:
A new gearbox fault diagnosis method is proposed for the QCC gearbox, which benefits the FDMTF and MBRCNN. Its performance is explored based on two case studies, including a benchmark gearbox dataset and a dataset collected from a QCC scaled-down test rig. Its robustness is also explored, which lies in handling different proportions of training samples, different hyper-parameters as well as variable loads.
The representation ability of FDMTF for 1D vibration signal is comprehensively studied. Firstly, the representation effect of FDMTF for time series is compared with that of MTF. Secondly, the feasibility of FDMTF for the representation of gearbox vibration signals is explored. Finally, FDMTF is compared with other 2D representation methods, such as GASF, GADF, FDGASF, FDGADF, RP and FDRP, etc. in the case of QCC gearbox fault diagnosis.
A CNN network structure, that is, MBRCNN integrating the MB block and the residual block is proposed, to facilitate extracting the multi-scale features and alleviate the problems caused by deep network layers.
This article is arranged as follows. Section ‘The frequency-domain MTF’ proposes the FD MTF and compares the performance of FDMTF and time-domain MTF. Section ‘The proposed fault diagnosis method’ proposes the gearbox fault diagnosis method for QCC based on FDMDF–MBRCNN. Section ‘Case study’ is the case study based on two datasets. Section ‘Conclusions’ provides some conclusions.
The frequency-domain MTF
Markov transformation field
MTF26,27 can capture the dynamic characteristics in time and frequency. Given a time series
where ωmn is the probability that an element in qm is followed by an element in qn, that is, ωmn = P(xt∈qm|xt − 1∈qn); m, n∈[1, Q].
Finally, the MTF is extended by each probability of the points of the original time series
where Mij(i, j∈{1, 2, …, N}) is the probability that the interval qm corresponding to time series xi is transferred to the interval qn corresponding to time series xj.
The proposed FDMTF
For the above time series
where
To obtain the actual FDMTF, it is generally necessary to normalize the FD signal to the interval [−1, 1], and the normalized series is given by:
The FD series of Equation (4) is treated as a sequence to be processed by Equations (1) and (2) to obtain the FDMTF. To prove the advantages of FDMTF over time-domain MTF, we intercept different time intervals of the same signal to compare the encoding ability.
This signal is used to compare the GAF and FDGAF in Bai et al., 32 which can be expressed as follows:
The sampling time is 0.5 s with the sampling frequency = 512 Hz. Figure 1 shows this time signal. Divide the signal into two parts for the characterization of MTF and FDMTF. The black line is the original signal s, the blue line is the first half signal s1 of s and the red line is the second half signal s2 of s.

The comparison between MTF and FDMTF.
As shown in Figure 1, signals s1 and s2 differ in phase, resulting in differences in their time-domain MTF diagrams. Not only their spectrums but also their FDMTF images are identical. Figure 1 shows that FDMTF has more advantages in encoding signals to images than time-domain MTF. The obvious reason is that FDMTF can resist the influence of the phase change of the signal, while MTF cannot.
The proposed fault diagnosis method
A new QCC gearbox fault diagnosis method is proposed in this section. The process of the proposed method is drawn in Figure 2.

The process of FDMTF–MBRCNN.
Figure 2 covers the main establishment steps of the proposed FDMTF–MBRCNN method, including the acquisition of FDMTF and the construction of the CNN network structure, that is, MBRCNN. The proposed FDMTF–MBRCNN method is used to classify online vibration signals into a certain category to complete fault diagnosis by training historical vibration signals. This method first converts the vibration signal into the FD and then obtains the FDMTF based on the FD signal, which is used as the input of MBRCNN. MBRCNN consists of nine layers, which integrates the MB block of GoogLeNet and the residual block of ResNet. There are three branches designed, and the length of each branch is different. The first branch has eight network layers, the second branch has three network layers and the third branch has only one network layer. The size of the initial convolution kernel is related to different levels of original features. To obtain different levels of original features, the size of the first convolution kernel of the three branches is inconsistent. The extracted features from the third branch converge to the second branch. Similarly, the extracted features from the second branch converge in the first branch. In this way, the MBRCNN model can extract rich features at different levels. The residual module uses two network layers.
There are no residual modules in the third branch, one residual module in the second branch and three residual modules in the first branch. The basic MBRCNN module involving two branches and a residual module is shown in Figure 3.

The basic module of MBRCNN.
It can be seen from Figure 3 that the output of layer l,
where
How to obtain FDMTF and the advantages of FDMTF over MTF have been introduced in section ‘The frequency-domain MTF’. To facilitate a clear understanding, this section focuses on the MBRCNN construction, and the specific steps of the proposed fault diagnosis method for the QCC gearbox.
The constructed CNN model
According to Figure 2, we introduce the MBRCNN model in detail. The specific details of the network, such as the size of the convolution kernel and the output channel, can be seen in Table 1.
Network structure parameters of MBRCNN.
MBRCNN: multi-branch residual convolutional neural network.
where k′, s′, p′ and c′ represent the kernel size, step size, padding value and the number of channels, respectively; ‘Type’ is the kind of pooling, including max pooling and average pooling.
As can be seen from Table 1, MBRCNN designs multiple branches in the first three layers. The convolution kernels with different sizes are expected to extract the multi-scale features of the same input. For the general convolution layer, the convolution operation is expressed by the following formula.
where
where ReLu(·) is the activation; BN represents the batch normalization operation.
The convolution operation of the basic MBRCNN module can be expressed by the general formula.
where
The pooling layer generally follows the convolution layer to reduce the dimension of the feature map. Because it does not involve parameter updating, it is not necessary to regard it as an independent network layer. The MBRCNN constructed in this article adopts two pooling operations, that is, max pooling and average pooling. The pooling process is briefly described as follows:
where
The last layer of the MBRCNN is the full connection layer, which uses Softmax (·) as the activation function.
where ui is the output value of the i-th node of the front unit of the classifier, o is the index of the number of nodes and O is the total number of categories, o∈[1, O].
The cross-entropy is selected as the loss function as the training goal.
where
The steps of the FDMTF–MBRCNN method
The flowchart of the FDMTF–MBRCNN method for QCC gearbox fault diagnosis is shown in Figure 4. The step of data preparation in the diagnosis process using online data is the same as that in the training step.

The flowchart of the FDMTF–MBRCNN for gearbox diagnosis.
According to Figure 4, the steps of the proposed FDMTF–MBRCNN method for the QCC gearbox fault diagnosis are as follows:
Step 1: Offline signal collection. Collect the vibration signals of the QCC gearbox with different statuses.
Step 2: Image representation. The collected vibration signals are first subjected to fast Fourier transform, then the FD signal is encoded by MTF to obtain FDMTF according to section ‘The frequency-domain MTF’.
Step 3: MBRCNN model construction. Integrate muti-branch blocks and residual blocks to construct MBRCNN according to section ‘The constructed CNN model’.
Step 4: The network loss function calculation. Divide the training set into two parts, one of which is used to calculate the loss function according to Equation (12).
Step 5: Iteration judgement. If the number of iterations is reached, go to Step 7; If it is less than the number of iterations, proceed to Step 6.
Step 6: Parameters updating. Use the Adam method 38 to optimize network parameters. After updating all network parameters, get the network model belonging to this iteration and return to Step 4.
Step 7: CNN model selection. Input the remaining part of the training set, that is, the validation set into the network model of each iteration to identify the iterative model with the highest accuracy as the final trained model.
Step 8: Online diagnosis. obtain the FDMTF of the online vibration signal, and then input it to the trained model obtained in Step 7 to get the diagnosis results.
Case study
There are two case studies used to prove the effectiveness of the FDMTF–MBRCNN method. The gearbox dataset of Case 1 is provided by the University of Connecticut (UoC),39,40 which has been extensively used to explore the performance of fault diagnosis methods.41–48 In addition, Zhao et al. 41 tested seven publicly available datasets through four benchmark types of deep learning models and showed that the UoC gearbox dataset is the most difficult to diagnose among the seven datasets. The gearbox data of Case 1 is used to study the universality and progressiveness of the proposed method. The gearbox data of Case 2 is collected from a scaled-down test rig of QCC, which is the main case study of this article because it is closer to the real QCC gearbox dataset.
Case 1: UoC gearbox dataset
The UoC gearbox dataset is collected from a benchmark two-stage parallel shaft gearbox test rig. The faulty gear is located on the input shaft of the first stage. The vibration signals of the gearbox in nine different statuses are collected by the accelerometer sensor. In addition, the dataset used in Case 1 is provided by Cao et al.39,40 which does not describe its operating conditions in detail, so Case 1 is not considered proof of diagnosis under variable loads. The length of the acquired signal is 3600, and the corresponding sampling frequency of 20 kHz. The vibration signals of nine statuses are all used for analysis, including normal, tooth missing, tooth root crack, tooth spalling and tooth chipping (five different fault levels). The sample label of each gear status and the corresponding sample quantity is shown in Table 2.
The labels of different gear statuses.
In Table 2, chipping with level 1 represents the most serious chipping fault, and on the contrary, chipping with level 5 represents the least serious fault. About 104 samples were collected in each gear status, so there are 936 samples in total. Herein, according to the ratio of 7:3, the training set and testing set are randomly divided. Among the training set, 0.55 × 936 is used as the training part, and 0.15 × 936 is used as the validation part. The image size of the input MBRCNN model is 224 × 224. The training hyper-parameters are set as follows: the mini-batch size, the learning rate and the number of epochs are set to 64, 0.001 and 20, respectively. To illustrate the learning ability of MBRCNN, t-distributed stochastic neighbour embedding (t-SNE) 49 is used to visualize the feature of the last layer before the full connection layer. The recognition accuracy of the UoC gearbox dataset is 100%, which fully demonstrates the availability of the FDMTF–MBRCNN method. The specific diagnosis results are drawn in Figure 5.

The diagnosis results: (a) the recognition accuracy and (b) dimensionality reduction representation of features by t-SNE.
Figure 5(a) is the confusion matrix of the diagnosis results, indicating that the correct recognition rate of each category is 100%. Figure 5(b) shows that different kinds of samples are well-distinguished without overlapping, which proves the feature extraction ability of MBRCNN. To avoid accidental interference, the dataset was trained and tested five times. The recognition accuracy is still 100% each time, which proves the stability of the proposed method.
Comparison analysis of Case 1
Based on the UoC gearbox dataset, the proposed FDMTF–MBRCNN is compared with other state-of-the-art methods published in recent 3 years. Many pieces of literature use this dataset for fault diagnosis under limited training samples. To facilitate comparison, the diagnosis results of the proposed method are also carried out under different training sample quantities. It should be noted that some literatures have expanded the total amount of data to meet the requirements for training. The detailed comparison results are shown in Table 3.
The details of the comparison results.
FDMTF: frequency-domain Markov transformation field; MBRCNN: multi-branch residual convolutional neural network. CGAN: conditional generative adversarial networks; MLP: multi-layer perceptron; mRMR: minimum redundancy maximum relevance.
In Table 3, the proportion of the training set includes the samples of all known labels, including training samples and validation samples. Jin et al. 42 proposed two methods, that is, CNN-RL and gated recurrent unit neural network -reinforcement learning, which obtained recognition accuracy of 0.928 and 0.989, respectively. On the one hand, Table 3 shows the recognition accuracy of the FDMTF–MBRCNN under different proportions of training datasets. The recognition accuracy of the proposed method is 100% under different training proportions of 10%, 20%, 30%, 40%, 50% and 70%, which proves its robustness. In addition, it shows that the FDMTF–MBRCNN method can achieve good results with the small sample problem. According to Zhang et al. 50 if the proportion of the training set is less than 50%, it can be regarded as a small sample problem. On the other hand, these diagnosis results fluctuate in the range of 0.929–0.996 obtained by the methods proposed in different literatures based on UoC gearbox dataset, without exception, less than the recognition accuracy of the proposed FDMTF–MBRCNN method. The comparison results in Table 3 prove the superiority of the FDMTF–MBRCNN method.
Case 2: QCC gearbox dataset
To collect the QCC gearbox vibration signal, we built a scaled-down test rig of an actual type of QCC with a ratio of 1:4, as shown in Figure 6.

The scaled-down test rig of a QCC: (a) the location of faulty parts, (b) the trolley movement gearbox and (c) the appearance of the test rig.
The movement of the load box controlled by the test rig is divided into vertical and horizontal movement to simulate the operation of loading and unloading container ships. Therefore, the power output of the whole test rig is divided into two parts, and the two motors are connected to the corresponding gearbox, respectively. The trolley gearbox controls the horizontal movement of the load box. The hoisting gearbox controls the vertical movement. In the experiment, the running statuses of the trolley gearbox are used to research. Figure 6(a) shows the location of the faulty parts of the trolley gearbox. The red position is where the parts are located, and the green line reflects the power transmission inside the gearbox. Figure 6(b) shows the appearance of the horizontal trolley gearbox, which is the object of this experiment. It can be seen that the motor provides power, and the power is transmitted to the horizontal movement of the trolley driven by the steel wire rope through the drum and gearbox. Figure 6(c) shows the appearance of the test rig. The load box is added with iron blocks to simulate the actual containers. During the experiment, the input speed of the trolley gearbox is 1750 rpm, and the load is 1000 kg. The number of sampling points is 2048 with the sampling frequency = 10,000 Hz.
The vibration signals of the gearbox under nine different health statuses are collected. The specific category label of each health status, and the corresponding number of samples are shown in Table 4. A total of 579 samples were obtained in the experiment.
The labels of different health statuses.
The faulty parts used in Case 2 are shown in Figure 7. For the convenience of observation, we mark the damage location with a green circle. In addition, the broken tooth and worn tooth shall be coated with pink paint.

The faulty parts of Case 2: (a)–(f): Label 1–6 and (g), (h): label 8, 9.
Figure 7 shows the eight fault types analysed in Case 2. We randomly extract six signals under the normal status and show their time-domain waveform, spectrum, time-domain MTF and FDMTF in Figure 8.

Four different 2D expressions of the 1D vibration signal under normal status: (a) time-domain image, (b) frequency-domain image, (c) time-domain MTF and (d) FDMTF.
The images shown in Figure 8(a) as well as Figure 8(c) which are obtained by the time-domain signal perform poorly because there are large differences between images even for the same health status. It may reveal that the vibration signal of the scaled-down test rig is subject to unknown interferences. Figure 8(b) and (d) based on the FD signal shows that the six images are almost consistent in the same state, respectively. Figure 8 shows that the FDMTF that can resist the impact of unknown interferences is more robust than the time-domain MTF.
The images of FDMTF in nine different health statuses are shown in Figure 9. The FDMTF presents different expressions under different health statuses of the gearbox, which is conducive to the training of MBRCNN. Combining Figures 8 and 9, one can find that the FDMTF is suitable for encoding the gearbox vibration signal into the 2D image. The reason lies in that FDMTF can make the images of the same health state similar, and the images of different health state differential.

The FDMTF of the vibration signals under nine statuses: (a)–(i) corresponding to the label 1–9.
Carry out training and testing according to the steps of the FDMTF–MBRCNN method in section ‘The steps of the FDMTF–MBRCNN method’. The training sample proportion, network hyper-parameters and input image size are the same as those of Case 1. According to the previous deployment, a total of 174 samples participated in the test. The overall recognition accuracy is 98.85%, and the specific diagnosis results are shown in Figure 10. Figure 10(a) shows that only two samples of gear tooth broken are mistaken for gear root crack. The identification accuracy of the remaining categories is 100%. Figure 10(b) shows that the features extracted by MBRCNN are very effective and can separate different kinds of samples. To eliminate the influence of accidental factors, we have trained and tested five times. The recognition accuracies of all five tests are also 98.85%, which proves the stability of the proposed method.

The diagnosis results: (a) the recognition accuracy and (b) dimensionality reduction representation of features by t-SNE.
Comparison analysis of Case 2
The comparison with FDMTF
To prove the merits of the proposed method in 2D characterization, other methods that can characterize 1D vibration signals into 2D images are used to compare with FDMTF. These methods include time-domain MTF, spectrum, time-domain waveform, FD GADF, time-domain GADF, FDGASF, time-domain GASF, FDRP, time-domain RP, STFT and CWT. As mentioned in the introduction part, most of these 2D encoding methods have been used in fault diagnosis. Input these different 2D characterization images into MBRCNN, and the diagnosis results are shown in Table 5. For convenience, the standard deviation is placed behind the average recognition accuracy to describe the dispersion of the average value.
The diagnosis results of different inputs based on MBRCNN.
FDMTF: frequency-domain Markov transformation field; MTF: Markov transformation field; GASF: Gramian angular summation field; GADF: Gramian angular difference field; STFT: short-time Fourier transform; CWT: continuous wavelet transform; RP: recursive plots; MBRCNN: multi-branch residual convolutional neural network.The bold entries represent the highest diagnosis accuracy.
Table 5 shows that the diagnosis accuracy of 98.85% of the proposed FDMTF is higher and more stable than the other 11 methods, which proves the superiority of using FDMTF as the input of MBRCNN. The recognition accuracy of the spectrum, FDGADF, FDGASF, FDRP, STFT and CWT can reach more than 90%. However, the recognition accuracy of the methods that directly encode time-domain information to the 2D image including time-domain waveform, time-domain MTF, time-domain GADF, time-domain GASF and time-domain RP, which obtain poor diagnosis results, is less than 35%. The reason may lie in that the time-domain vibration signal is prone to be disturbed by the phase, background noise, etc., resulting in large differences in the signals even under the same state. To further show the difference between the 2D representation methods based on FD signals and time-domain signals, we draw the average recognition accuracy in Table 5 into Figure 11.

The comparison of recognition accuracy based on time domain and frequency domain.
In Figure 11, the signal means the raw vibration signal. Figure 11 intuitively shows the difference in the recognition results obtained by encoding 2D images based on FD signals and time-domain signals. The five different encoding methods tend to convert the FD signal rather than the time-domain signal into the 2D image as the input of MBRCNN to obtain higher recognition accuracy.
The comparison with MBRCNN
To prove the advantages of MBRCNN in feature extraction and classification, the famous networks AlexNet and ResNet-18 are used as diagnosis models for comparison. Using FDMTF as input, the recognition results obtained by AlexNet and ResNet-18 are drawn in Figures 12 and 13, respectively.

The fault diagnosis results of AlexNet: (a) the recognition accuracy and (b) dimensionality reduction representation of features by t-SNE.

The fault diagnosis results of ResNet-18: (a) the recognition accuracy and (b) dimensionality reduction representation of features by t-SNE.
The recognition accuracy of AlexNet is 90.23%, and that of ResNet-18 is 97.7%. The recognition accuracy of the above two network models is lower than that of the proposed method, which proves the advantages of MBRCNN. At the same time, AlexNet as well as ResNet-18 can obtain recognition accuracy of more than 90%, which shows that FDMTF is suitable for the characterization of the 1D vibration signal. Figures 12(b) and 13(b) show that the feature extraction ability of AlexNet and ResNet-18 is also weaker than that of MBRCNN in this case analysis.
To further prove the advantages of the proposed MBRCNN model, ImageNet 51 which is widely used to assist in CNN model training is used to pre-training as well as fine-tuning ResNet-18. The strategy of pre-training refers to using the model parameters trained by other datasets as the initial initialization parameters to replace the random parameters. The method of fine-tuning refers to the use of network parameters trained by other datasets as the parameters of the feature extraction layer. It is no longer necessary to update and optimize the feature extraction layer, while only adjusting the parameters of the full connection layer slightly for the strategy of fine-tuning.
Therefore, ResNet-18 has three training strategies, including training directly whose results are shown in Figure 13, pre-training and fine-tuning. To eliminate accidental factors, each method was trained and tested five times. The average recognition accuracy obtained by the abovementioned methods is shown in Table 6.
The average recognition accuracy of different network models.
The bold entries represent the highest diagnosis accuracy.
Table 6 shows that the recognition accuracy of the proposed method is the highest at 98.85% among the five methods. The recognition accuracy of ResNet-18 training assisted by ImageNet is not higher than that of training directly, but lower. The recognition accuracy of fine-tuning is only 0.6379, which shows that the training strategy of fine-tuning is not suitable for fault diagnosis based on FDMTF.
Robustness analysis based on Case 2
Robustness against proportions of training sample
In subsection ‘Comparison analysis of Case 1’, the proposed method can achieve 100% recognition accuracy of the UoC dataset under the condition of a small training sample. The dataset of the QCC test rig may be more sensitive to the number of samples because the collected data contains noises and impact interferences. 52 This subsection mainly proves the generalization performance of the FDMTF–MBRCNN method under different proportions of training samples. The corresponding diagnosis results are shown in Table 7.
The diagnosis results under different proportions of training samples.
Due to the small amount of dataset obtained from the QCC test rig, when the proportion of training set is 30% and 25%, there is no verification set. When the proportion of training set is 25%–70%, the diagnosis accuracy is higher than 95%. The standard deviation of the proposed method can maintain a very small value in proportions of training set, which proves the stability of the proposed method. It proves the generalization and robustness against the proportion of training samples of the FDMTF–MNRCNN method. When the proportion of training samples is 25%, only 16 samples of each category participate in training on average, and a total of 435 samples are used for testing. In this case, the recognition accuracy of 0.9572 is still obtained, which again shows that the proposed method is expected to solve the small sample problem.
Robustness against hyper-parameters
The proposed method integrates the MB block and residual block. In addition to strong feature extraction ability, it can improve the insensitivity to hyper-parameters. MBRCNN has multiple branches with different depths, which may adapt to the different hyper-parameter settings. To prove the above surmise, we discuss the influence of three hyper-parameters, namely, the mini-batch size, the learning rate and the number of epochs, on the recognition accuracy. Herein, we adopt the control variable method, that is, when changing the preset value of one parameter, the values of the other two parameters remain unchanged. Similarly, to avoid the influence of accidental factors, each group of hyper-parameters is trained and tested five times. The average diagnosis recognition accuracy of five times when the three hyper-parameters are changed in turn is shown in Figure 14.

The recognition accuracies of changing hyper-parameters: (a) mini-batch size, (b) the number of epochs and (c) the learning rate.
The recognition accuracies under 15 different groups of hyper-parameter combinations are more than 95%, which proves that the proposed method is not sensitive to the setting of hyper-parameters. This shows that the MBRCNN model can adapt to different parameter combinations and reduce the complex action of parameter adjustment deep learning. It also can be seen that among the three parameters, the learning rate has a greater impact, and it is appropriate to choose a value near 0.001.
Robustness against variable loads
QCCs need to deliver containers of variable quality. The online vibration signal and training vibration signal are likely collected under different load conditions. To explore the robustness of the FDMTF–MBRCNN method against variable loads, the vibration signal of the test rig under 500 kg load conditions and 1750 rpm is collected for a new time. A total of 653 samples are collected. The detailed number of samples of each gearbox status is shown in Table 8.
The number of samples of each status.
In order to explain the influence of different operating conditions, taking normal vibration signal as an example, the time-domain waveforms and frequency spectrum under different load conditions are displayed in Figure 15.

Time-domain waveform and spectrum under variable loads: (a) and (b) 1000 kg; (c) and (d) 500 kg.
Figure 15 shows that the amplitude under a load of 1000 kg is larger than that under a load of 500 kg, both in the time domain and the FD. It shows that the fluctuating operating load conditions will change the features of the vibration signal. All samples collected under 1000 kg load conditions are used to train the diagnosis models to diagnose the data collected at 500 kg load. During training, 70% of training set are used for training and the remaining 30% for validation to select the best model.
Based on the proposed FDMTF–MBRCNN method, the recognition accuracy of samples collected under a 500 kg load is 94.55%. On the one hand, this diagnosis result shows that the variable load does influence the recognition accuracy of the proposed method. On the other hand, it shows that the proposed method is robust against variable loads because the method still achieves 94.55% accuracy. Without the help of domain adaption technology, such a high recognition accuracy may be attributed to the good characterization of FDMTF. To highlight the advantages of FDMTF, other 11 characterization methods are also used to explore the possibility of fault diagnosis under variable loads. The specific diagnosis results are shown in Table 9.
The diagnosis results under variable load conditions.
FDMTF: frequency-domain Markov transformation field; MTF: Markov transformation field; GASF: Gramian angular summation field; GADF: Gramian angular difference field; STFT: short-time Fourier transform; CWT: continuous wavelet transform; RP: recursive plots.The bold entries represent the highest diagnosis accuracy.
Table 9 shows the recognition accuracy under different loads when 12 different characterization methods are used as the input of MBRCNN. The recognition accuracy obtained by the FDMTF is the highest, which proves once again the superiority of FDMTF. Seven of the 12 characterization methods achieve recognition accuracy of more than 80% with a small standard deviation. The order of recognition accuracy of different inputs is FD based > time–frequency domain based > time-domain based.
Discussions
The robustness of the proposed method is reflected by its ability to resist changes in the following three aspects: the number of training samples, the network hyper-parameters and the variable load conditions. Tables 3 and 7 show that the recognition accuracy exceeds 95% in two cases under the training sets with different proportions, which shows the robustness of the proposed method against the number of training samples. Figure 14 shows that the average diagnosis accuracy exceeds 95% when the three hyper-parameters are changed in turn, which shows the robustness of the proposed method against the network hyper-parameters. The recognition accuracy of fault diagnosis under variable loads in Case 2 is 94.55%, which shows the robustness of the proposed method against variable loads.
Based on the UoC gearbox dataset, the recognition accuracy of the proposed method exceeds that of the seven state-of-the-art methods in Refs. 42–48, which shows its progressive significance. The diagnosis accuracy of Case 2 is lower than that of Case 1, which may be because the vibration signal of Case 2 is more complex and contains more noise interference.
In combination with Figures 8 and 9, it can be found that FDMTF can make the characterization of vibration signals of the same running status similar and the vibration signals of different running statuses different, which proves that it can be used as the input of CNN for identifying gearbox faults. In the two case studies, the samples participating in each health status training are some dozens, not thousands, and the recognition accuracy of more than 98.8% is still obtained. With only 10% training samples in Case 1, 100% recognition accuracy can still be achieved. In Case 2, when there are only 25% training samples, the recognition accuracy can reach 95.72%. It shows that the good representation of the 1D vibration signal provides a potential possibility for solving the small sample problem.
The original intention of the MBRCNN network structure is to extract the multiple scale features and avoid a series of problems caused by the deep network structure. Figure 14 shows that the proposed model is insensitive to hyper-parameters with the help of MB blocks and residual blocks. However, it should still be noted that to obtain high recognition accuracy, the learning rate may be worthy of attention. According to the analysis in Figure 14(c), a learning rate of about 0.001 is recommended.
The results in Table 5, 9 and Figure 11 show that the methods based on the time-domain signal are not suitable to characterize the vibration signal of the QCC gearbox. There are noise interferences in the QCC gearbox vibration signal,52,53 which may bring great differences to the vibration signals even under the same health status. In addition, the signal characterization methods, such as GAF and MTF, are affected by the phase of the time series itself, which is proved by Bai et al. 32 as well as this article. Therefore, the 2D image encoding method based on the FD signal is recommended to replace the time-domain signal.
Table 5 and Figure 11 show that the recognition accuracy obtained by time-domain MTF is the lowest among the 12 methods, lower than time-domain GAF and even lower than time-domain waveform. However, the recognition accuracy obtained by FDMTF is higher than that obtained by FDGAF. The mechanism is not clear, which is worthy of further research.
The recognition accuracy of FDGAF in Table 5 is more than 98%, which shows that it is also a good 2D image encoding method. In addition, it can be found that GADF is better than GASF in both encoding FD signals and time-domain signals. This result is consistent with the research results in Han et al. 31 , Zhang et al. 53 Although GASF has achieved satisfactory results in Tang et al. 29 , Bai et al. 32 for mechanical fault diagnosis, it is suggested that when using GAF for encoding the vibration signals, GADF instead of GASF can be considered.
Table 6 shows that using the network parameters trained by the ImageNet dataset to fine-tune the network parameters based on FDMTF data is not necessarily suitable. The reason is that there may be great differences between the two kinds of image data. The method of pre-training parameters shown in Table 6 also failed to achieve the expected results. If the network model parameters trained by FDMTF of vibration signals collected from other gearboxes can be used, pre-training as well as fine-tuning may be helpful to reduce training costs of the QCC gearbox fault diagnosis model.
Zhang et al. 11 used domain adaption technology to solve the problem that training data and testing data come from different load conditions. The seven methods of 2D characterization of vibration signals in Table 9 can achieve a recognition accuracy of more than 80%, including 94.55% obtained by the proposed method, which does not use domain adaptation or other transfer learning techniques. Similar to the research results obtained in Zhang et al. 19 2D image representation of vibration signals combined with CNN can provide a potential idea for fault diagnosis under variable load conditions.
Conclusion
To accurately identify QCC gearbox faults, this article proposes a method, namely, FDMTF–MBRCNN, which benefits FDMTF and MBRCNN. Compared with time-domain MTF, the FDMTF can ignore the difference caused by the signal phase and is more robust than MTF to background noise. MBRCNN combines the advantages of MB and residual blocks. The experimental data collected from the UoC gearbox and the QCC scaled-down test rig achieve diagnosis accuracy of 100% and 98.85%, respectively, which proves the availability of the proposed method. The recognition accuracy of more than 94% can be obtained under different proportions of training samples, hyper-parameter combinations or variable loads, which proves the robustness of the FDMTF–MBRCNN method. The specific conclusions are as follows:
A QCC gearbox fault diagnosis method integrating FDMTF and MBRCNN is proposed. Its effectiveness and superiority are proved by the two case analyses and the comparison with the other methods.
The FDMTF is used to encode 1D vibration signals into 2D images, whose recognition accuracy is higher than that of the other 11 methods. FDMTF is recommended for encoding the vibration signal rather than time-domain MTF. In addition, for fault diagnosis based on CNN, the 2D encoding method is recommended for FD signals instead of time-domain signals.
When using FDMTF as input, the MBRCNN can effectively extract the features of different signals, whose extraction ability and recognition accuracy are better than four methods, that is, ResNet-18 adopts three different training strategies as well as AlexNet.
The FDMTF–MBRCNN method has the potential capabilities to solve the small sample problem and the variable load problem, which have been the hot issues in the current fault diagnosis field.
Our future work will further focus on the research of 2D representation of vibration signals to realize QCC gearbox fault diagnosis under variable speed conditions.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
