A domain adaptation model based on multiscale residual networks for aeroengine bearing cross-domain fault diagnosis

Abstract

As the core component of rotating machinery, the fault diagnosis of rolling bearing has important engineering practical significance. Most of the current intelligent fault diagnosis methods are based on the premise that the training data and test data have similar probability distributions. However, in practical scenarios, there will inevitably be discrepancies in the distribution of vibration signals due to internal and external factors such as changes in working conditions, which will significantly affect the diagnostic performance of the intelligent diagnostic model. Aiming at problems that the vibration signal characteristic distribution of rolling bearings is inconsistent under different working conditions and the labels of the samples to be diagnosed are difficult to obtain, a new domain-adaptive fault diagnosis method is proposed in this paper. Firstly, the multi-scale feature extraction module is used to extract the features of the input signals, and the residual network structure is used to avoid the degradation of the model performance. Then, the APReLU activation function is used to make the vibration signals perform different nonlinear transformations according to their own characteristics through adaptive learning. Finally, the Joint Maximum Mean Discrepancy (JMMD) is used to reduce the displacement of both conditional and edge distributions between different domains. Therefore, this method can extract domain-invariant feature information and align the source and target domains, which can be used for cross-domain intelligent fault diagnosis. Six transfer fault diagnosis tasks based on the rolling bearing experimental platform are designed to evaluate the performance and effectiveness of the proposed method. At the same time, four popular methods are selected for comprehensive analysis and comparison. The results show that the method has good robustness and superiority under various diagnostic tasks.

Keywords

Fault diagnosis rolling bearing domain adaptation residual network

Introduction

Rolling bearing is the most frequently used key component in rotating machinery, which is widely used in modern industrial production. The health status of rolling bearings has an important impact on whether the entire mechanical equipment can run smoothly. Therefore, fault diagnosis can effectively ensure the production safety and life safety in industrial production, which has great practical significance in engineering.

Aeroengines often work in high speed, high temperature, overload, and other extreme conditions, which make them more vulnerable to failure. If the aeroengine fails, the flight performance of the aircraft will be degraded, and in severe cases, it may even lead to a flight accident. In addition, the cost of aeroengines is high, and the cost of maintenance and replacement is huge. Therefore, it is very important to adopt relevant methods for fault diagnosis of aeroengines. Quickly and accurately detecting and locating faults can greatly save maintenance costs of aircraft, avoid flight accidents, and ensure flight safety. In summary, in order to effectively reduce the maintenance cost of aeroengines and improve the safety performance of aeroengines, more and more attention has been paid to the fault diagnosis research of aeroengines.

Bearing vibration signal is a mixed signal composed of vibration source signal and noise signal, which contains a large amount of bearing operating status information. Therefore, by processing and analyzing the vibration signal of the bearing, the running status of the bearing can be diagnosed. In traditional fault diagnosis research, relevant scholars process vibration signals and extract useful fault feature information from them. At present, many classical signal processing methods have emerged, such as fast Fourier transform (FFT), short-time Fourier transform (STFT), wavelet transform (WAT), etc. On this basis, many optimization methods were developed. For example, Zhang et al.¹ proposed a parameter adaptive variational mode decomposition (VMD) method to analyze vibration signals of rotating machinery, and used GOA algorithm to optimize VMD parameters, and finally applied it to bearing fault feature extraction. Li et al.² proposed a bandwidth empirical mode decomposition and adaptive multiscale morphological analysis (BEMD-AMMA) method, which weakens the mode aliasing phenomenon of traditional EMD algorithm, and realized early fault diagnosis of rolling bearings by constructing the principal component of the original signal and demodulating it.

However, all the above fault diagnosis methods require considerable professional knowledge of rotating machinery signal processing and a solid mathematical foundation as support. Moreover, in the actual operation process, mechanical equipment usually operates under more complex working conditions, and vibration signals may change with the change of rotating speed, load, and other working conditions, so the effect of fault diagnosis method based on signal processing will become worse. Therefore, some related scholars used machine learning-based fault diagnosis methods to make up for these deficiencies. Shao et al.³ proposed an intelligent mechanical fault diagnosis method by combining VMD and support vector machine (SVM), and optimized SVM by VHHO algorithm, which can effectively identify different bearing faults. Udmale et al.⁴ used spectral kurtosis (SK) to construct the fault feature set of vibration signals, identified the fault feature information by extreme learning machine (ELM), and introduced an improved two-way search method based on local search to determine ELM parameters, which improved the performance of bearing fault diagnosis model.

With the rapid development of artificial intelligence, deep learning has been widely used in the field of fault diagnosis due to its strong learning ability and feature self-extraction ability, and has become a research hotspot today. Deep learning can automatically extract deep fault information, effectively avoiding the limitation of selecting fault features manually. Cao et al.⁵ applied the particle swarm optimization (PSO) algorithm to the convolutional neural network (CNN) and proposed an adaptive deep convolutional neural network fault diagnosis model, which improves the diagnostic accuracy and robustness of the traditional deep learning fault diagnosis algorithm. Wen et al. proposed a rolling bearing fault classification method based on CNN with an automatic learning rate scheduler, and improved the stability of diagnosis model by using dual CNN structure. Experiments proved that AutoLR-CNN has advanced fault diagnosis performance.⁶ Yan et al.⁷ proposed a stack denoising autoencoder (SDA) fault diagnosis algorithm to automatically identify the rolling bearing status, and determined the parameters of the autoencoder model through grasshopper optimization algorithm (GOA), which improves the fault classification accuracy of the autoencoder. Fault diagnosis algorithms based on deep learning such as autoencoders (AE) and CNN do not require traditional artificial signal processing, and can achieve end-to-end fault diagnosis and obtain satisfactory results. In addition, when using deep learning technology, if the network model is deep, the selection and optimization of parameters are very important. For parameter optimization problems, related scholars generally use correlation optimization algorithms to optimize parameters, such as multi-objective optimization technique,⁸ Grasshopper optimization Algorithm,⁹ Whale Optimization Algorithm,¹⁰ etc.

It is worth mentioning that deep learning is not only widely used in the field of fault diagnosis and classification based on data, but also has important significance in the field of prediction of time series data. Such as Chen et al.¹¹ proposed a nonlinear combined short-term wind speed prediction method, which combined extreme learning machine (ELM), Elman neural network (ENN), and long short term memory neural network (LSTM), and obtained a better prediction effect; and achieved good prediction results. Zhao et al.¹² proposed a novel short-term traffic flow forecasting model EnLSTM-WPEO based on LSTM neural network, which is of great significance in the field of intelligent transportation. Karasu and Altan¹³ used the CHGSO algorithm to select the effective features related to prediction, and used the LSTM neural network model to predict the crude oil time series, which solved the chaotic and nonlinear characteristics of the original data.

However, the above deep learning-based fault diagnosis algorithm needs to be based on the independent and identically distributed (IID) of the training set and the test set. In actual industry, due to the change of temperature, speed, load, and other factors, the distribution of training set and test set usually has a drift phenomenon, which cannot meet the assumption of the same distribution. As a result, the fault diagnosis algorithm trained in the original working condition is difficult to apply to a new working condition, and the generalization performance of the fault diagnosis algorithm is greatly reduced. Therefore, it is of great practical significance to solve the problem that the model applicability is reduced due to the change of data distribution caused by different working conditions, that is, to realize cross-domain fault diagnosis.

For the above cross-domain fault diagnosis problems, the most direct way is to re-collect and label the training data under new working conditions, and re-train the model, but this method requires more manpower, material resources, and time costs. Therefore, relevant scholars apply transfer learning theory in the field of fault diagnosis, and transfer learning can solve related target domain problems through learned source domain knowledge: Learning in the source domain (the training set) and transferring the knowledge learned from the source domain to different but related target domain (the test set), so as to solve new related tasks in the target domain and improve the generalization ability of classification model.^14–16

Among them, domain adaptation is a special case of transfer learning. Its core idea is to map the source domain and target domain to a common feature space to eliminate the discrepancies between domains and re-form feature sets with the same distribution. Domain adaptation has strong cross-domain and cross-task learning ability, which is suitable for solving the problem of cross-condition diagnosis of rolling bearings, so that the fault diagnosis model trained under one working condition can adapt to fault diagnosis tasks under different working conditions. The adaptive process is a process of continuously approaching the target. Adaptive is a process that automatically adjusts processing methods, parameters, boundary conditions, or constraint conditions according to data characteristics in the process of processing and analysis, so as to achieve the best processing effect.^17,18

CNN has a strong ability to extract features and process complex high-dimensional data due to its characteristics of local linking, weight sharing, and subspace sampling. Therefore, the research on domain adaptation fault diagnosis based on CNN has a great development space. Liu et al. proposed a domain adaptation fault diagnosis algorithm based on deep learning, which consists of stack autoencoder (SAE) and weighted domain discriminator based on Softmax, and conducted confrontational training on the discriminator to minimize the approximate domain discrepancy distance. This significant algorithm improves the accuracy of fault diagnosis in partial domain adaptation (PDA) scenarios.¹⁹ Li et al. integrated the domain adaptation idea into CNN and proposed a deep convolutional transfer learning fault diagnosis algorithm, which uses CNN as a feature extractor to extract fault information and solves the domain shift problem by minimizing the multi-kernel maximum mean discrepancy (MK-MMD) between the two domains. Experiments showed that this method can improve the cross-domain performance of the fault diagnosis model.²⁰

However, the above papers ignored the influence of vibration coupling on rolling bearings during operation, which brings time multi-scale characteristics to vibration signal in the time domain. Therefore, when bearing damage occurs, the fault features may also show multi-scale characteristics due to vibration coupling. Moreover, the Maximum Mean Discrepancy (MMD) used in the above paper ignores the conditional distribution discrepancy between the source domain and the target domain, so it cannot accurately measure the discrepancy between different domains. In a word, the above papers still have some room for improvement. To solve the above problems, on the basis of the original algorithm, this paper proposes a new regional adaptive fault diagnosis model.

Aiming at the inconsistency in the feature distribution of the fault status signal data collected under different working conditions of the rolling bearing and the unlabeled samples in the target domain, this paper proposes a domain adaptation transfer diagnosis method based on a multi-scale feature fusion residual network. The method is mainly composed of fault feature extraction and domain adaptation. The feature extraction layer is used to extract the common features of the source domain and target domain, this part uses the pre-trained multi-scale feature fusion residual network to extract the general features. The discrepancy between the source domain and the target domain is mainly reflected in the fully connected layer. In the domain adaptation part, the fully connected layer is used to learn the transfer knowledge, and the joint maximum mean discrepancy (JMMD) metric is used to calculate and minimize the joint distribution discrepancy, so as to realize the transfer learning of rolling bearings under different working conditions. The main insights and contributions of this paper are summarized as follows:

In this paper, a domain-adaptive fault diagnosis algorithm for intelligent diagnosis the Residual Joint Adaptive Network based on multi-scale feature fusion is proposed. This algorithm can classify and identify the unknown target domain datasets by using the source domain data with similar fault categories, so as to realize the fault diagnosis of the target domain. And the algorithm does not require any signal processing, which reduces the algorithm’s dependence on signal processing and target domain label data. Experiments show that the algorithm has good adaptability and portability, and has a good application prospect.

In the fault feature extraction part, considering that the vibration signal has multi-scale characteristics, the algorithm uses multi-scale feature extraction and fusion layer in the first layer, which not only accelerates the training speed of the model, but also can extract more comprehensive multi-scale fault features. The residual module is used to deepen the depth of the model and solve the degradation problem of the deep neural network. APReLU activation function is introduced into the residual network to enhance the feature recognition ability of the network. Therefore, the algorithm can obtain more effective fault feature data in the feature extraction part.

In the domain adaptation part, the algorithm introduces JMMD to calculate the distribution discrepancy and minimizes it to achieve domain adaptation. The traditional measure of the MMD metric only considers the marginal distribution discrepancy between source and target domains, whereas JMMD sums the marginal distribution discrepancy and the conditional distribution discrepancy to measure the discrepancy between domains. Therefore, this algorithm can measure the difference between different domains more accurately when performing domain adaptation, and obtain higher accuracy in the transfer learning problem of rolling bearings.

Finally, the bearing vibration signal data sets under different faults and working conditions are obtained on the bearing fault test platform, and six groups of transfer learning bearing fault diagnosis tasks under different working conditions is carried out to explore the effectiveness of the proposed method. At the same time, four common fault diagnosis methods based on deep network and deep transfer learning are selected for comprehensive analysis, and the diagnosis results are explained reasonably. The results further demonstrate the effectiveness and superiority of this method.

The rest of this paper is organized as : the second part introduces the theoretical background of domain adaptation, feature extraction methods, and domain adaptation strategies; The third part explains the construction process of cross-domain fault diagnosis algorithms; The fourth part demonstrates the evaluation and discussion of model performance through related experiments; Finally, the fifth part draws the conclusions and future work directions.

Theoretical background

Problem formulation-domain adaptation

This paper focuses on the problem of domain adaptation in fault diagnosis, which is a branch of transfer learning. The working conditions of aeroengine will change with the external environment and human factors. Even if the fault type of aeroengine rolling bearings is the same, the distribution of vibration signals will change with the change of working conditions. Therefore, the distribution of fault features in the source domain and target domain is usually different under different working conditions. Domain adaptation can map samples of different working conditions (source domain and target domain) to a common space to find their common features, so as to narrow the domain discrepancies and eventually form feature sets with the same distribution.

The data under one condition is considered a domain $D = X, P (X)$ , where $X = {x_{1}, x_{2}, \dots x_{n}}$ is the sample space, $P (X)$ is the marginal probability of $X$ . The task of transfer learning is $T = {Y, f (X)}$ , where $Y$ is the label space, $f (x)$ is the target classification model, which is essentially a conditional probability distribution $f (X) = Q (Y_{t} | X_{t})$ .

$D_{s} = {(x_{i}^{s}, y_{i}^{s})}_{i = 1}^{n_{s}}$ is the source domain, and the data in the source domain is $X_{s} = [x_{1}^{s}, \dots, x_{n_{s}}^{s}]$ , which is composed of $n_{s}$ labeled samples, and the corresponding learning task is $T_{s}$ , $D_{t} = {(x_{j}^{t})}_{j = 1}^{n_{t}}$ is the target domain, and the data in the target domain is $X_{t} = [x_{1}^{t}, \dots, x_{n_{t}}^{t}]$ , which is composed of $n_{t}$ unlabeled samples, and the corresponding learning task is $T_{t}$ , where $D_{s} \neq D_{t}$ , and $T_{s} \neq T_{t}$ .

In the actual environment, due to the interference of the external environment and the change of working conditions, the problem of domain transfer will occur. Therefore, the data distribution of source domain $D_{s}$ and target domain $D_{t}$ is different, that is, $P (X_{s}) \neq P (X_{t})$ and $Q (Y_{s} | X_{s}) \neq Q (Y_{t} | X_{t})$ , which leads to a serious drop in the diagnostic accuracy of the fault diagnosis classifier in the target domain, so the classification model trained by source domain cannot be directly applied to the target domain. Domain adaptation can map samples of source domain and target domain to the same common subspace. By selecting the most appropriate common subspace, the sample feature distributions of mapped source domain and target domain can be as close as possible. The goal of domain adaptation is to fully learn the knowledge from $D_{s}$ and $D_{t}$ during the training process, extract transferable features, and build a cross-domain target classification model $f (x)$ , which can achieve good results on $D_{t}$ and complete the task $T_{t}$ .

By using domain adaptation technology, the cross-condition diagnosis problem is transformed into a general classification problem that satisfies the independent and identical distribution. This technology can improve the generalization ability and robustness of fault diagnosis algorithm, and improve the adaptability of the model to other application scenarios.

Residual network

CNN can fully extract the main features of one-dimensional signals, such as vibration signals, audio signals, etc. It mainly contains convolution layer, pooling layer and fully connection layer.

The function of the convolution layer is to extract local features. Each convolution kernel in the convolution layer convolves local the samples with fixed kernel size and stride, and extracts the corresponding features.

x_{n}^{l} = f (\sum_{m} x_{m}^{l - 1} * w_{n}^{l} + b_{n}^{l})

(1)

Where $x_{n}^{l}$ is the $n$ th feature map of the $l$ th layer, $x_{m}^{l - 1}$ is the $m$ th feature map of the $l$ th layer, $w_{n}^{l}$ is the $n$ th convolution kernel of the $l$ th layer, $b_{n}^{l}$ is the bias, * is the convolution operation, and $f$ is the non-linear activation function.

The function of the pooling layer is to reduce the dimension of the feature space, simplify the computational complexity of the network, and reduce model parameters while retaining the main information.

x_{n}^{l} = ς (α_{n}^{l} \cdot ϑ (x_{m}^{l - 1}) + b_{n}^{l})

(2)

Where $ς$ is the activation function, $α_{n}^{l}$ is the pooling layer weight, $ϑ$ is the pooling operation, and $b_{n}^{l}$ is the bias.

The function of the fully connected layer is to connect all features. It is a transition structure between the convolutional layer and the classifier, which converts the feature map into a one-dimensional array and sends it to the classifier.

x_{n}^{l} = γ (β_{n}^{l} \cdot x_{m}^{l - 1} + b_{n}^{l})

(3)

Where $γ$ is the activation function, $β_{n}^{l}$ is the fully connected layer weights, and $b_{n}^{l}$ is the bias.

The parameters of the CNN model are obtained through chain derivation, which may cause gradient disappearance or gradient explosion, which increases the difficulty of training. In response to the above problems, He et al.²¹ proposed the concept of residual network. Residual network directly inputs the output of the neural network of the previous layer into the network of the latter layer through cross-layer links, which effectively solves the problem of gradient dispersion. It consists of a series of residual blocks, each residual block includes convolutional layer, non-linear activation function layer, and cross-layer link. The residual block is shown in Figure 1.

Figure 1.

Structure diagram of residual network.

In Figure 1, $a^{l}$ is the input of the $l$ th residual module, $F (\cdot)$ is the residual function, and the residual module can be expressed as:

a^{l + 1} = a^{l} + F (a^{l})

(4)

For the $L$ th residual module, there are:

a^{L} = a^{l} + \sum_{i = l}^{L - 1} F (a^{i})

(5)

It can be seen that the residual module directly adds the input data of the module to the output data by cross-layer link. Because of the existence of the addition term, when calculating the gradient, the gradient component can be directly transferred to the next layer, which improves the training speed of the model and avoids the degradation of model performance caused by gradient disappearance or gradient explosion.

Adaptively parametric rectifier linear units

In a multi-layer neural network, the output of the upper neuron node and the input of the lower neuron node are connected together by activation function. The activation function can introduce nonlinear characteristics into the neural network, so that the neural network model can learn complex nonlinear functions better. The ReLU function is a commonly used activation function. Compared with other activation functions such as Sigmoid and tanh, it is closer to the biological neural activation mechanism, and has the advantages of simple calculation and fast convergence. ReLU function can be expressed as $ReLU (x) = max (0, x)$ , where $x$ is the input of function. It can be seen that the ReLU function has the characteristics of unilateral inhibition, which brings sparse activation to neurons, so that the sparse neural network can more fully mine relevant features and better fit the training data.

For the vibration signals of rolling bearings, the vibration signals under the same fault status may have different characteristics due to changes in temperature, load, speed, and other operating conditions, and the vibration signals under different fault statuses may also have similar characteristics at a certain moment due to changes in working conditions. In traditional deep learning, the activation function adopts the same nonlinear mapping transformation for all vibration signals, so the traditional ReLU activation unit may not be able to map all the input features into the correct regions.

In response to the above problems, in order to enhance feature recognition ability, Zhao et al.²² combined the attention mechanism with the PReLU activation function and proposed an adaptively parametric rectifier linear unit (APReLU), which can perform different nonlinear transformations on vibration signals through adaptive learning. Zhao et al. applied it in ResNet and proved through experiments that the ResNet-APReLU structure has strong advantages in the field of rolling bearing fault diagnosis. This structure can effectively optimize the parameters in the network and improve the accuracy of fault diagnosis. Therefore, in order to obtain higher diagnostic accuracy under non-stationary operating conditions, this paper introduces the above structure into the transfer learning method.

PReLU activation function is a ReLU activation function with parameters proposed by He et al.,²³ which has stronger fitting ability than traditional ReLU, and its formula is $PReLU (x) = max (0, x) + α \cdot min (0, x)$ , where $α$ is the learnable multiplicative weight parameter. On the basis of PReLU, APReLU can adjust the multiplicative weight parameter of adaptive change in nonlinear transformation by adding attention mechanism. The attention mechanism can make the neural network focus on the key features of the data, and let the model select the most helpful features for the current task, thereby improving the information extraction ability of the model.

The structure of APReLU is shown in Figure 2. In APReLU, firstly, the input feature map will be input into two channels respectively, the first channel is used to calculate the global information of positive features, in this channel feature map is subjected to the global average pooling (GAP) operation after $max (0, x)$ function. The second channel is used to calculate the global information of negative features, in this channel feature map is subjected to the GAP operation after $min (0, x)$ function; Then, connect the two sets of one-dimensional vectors output by the two channels in series, and pass through the FC layer, BN layer, ReLU function, FC layer, BN layer, and Sigmoid function in turn to obtain a set of one-dimensional vectors, namely multiplication weight parameter. The number of neurons in each FC layer is the same as the number of channels in the input feature map; finally, the multiplication weight parameter is multiplied with the input feature map to get the final output feature map.

Figure 2.

Structure diagram of APReLU.

In the above process, the function retains the information of negative features, the GAP reduces the influence of the vibration signal displacement change, the BN layer improves the model training speed, and finally, in order to prevent the multiplication weight parameter $α$ from being too large, the Sigmoid function maps $α$ to (0, 1).

In this paper, all ReLU activation functions in the residual module are replaced by APReLU activation functions, and the final residual module is shown in Figure 3.

Figure 3.

Structure diagram of ResNet-APReLU.

Multi-scale feature extraction and fusion module

The rolling bearing will be affected by vibration coupling during operation, which will bring the time multi-scale characteristics to the vibration time domain signal. When the bearing is damaged, the fault characteristics will also show multi-scale characteristics. In view of the above problems, in order to effectively mine fault information in data and extract multi-scale fault features, in this section, a multi-scale feature extraction and fusion module is added to the residual network, and a residual network model based on multi-scale convolution feature fusion is proposed. Compared with the traditional convolution layer, the multi-scale feature extraction module has stronger feature extraction ability and can obtain more effective fault information. The whole multi-scale convolution feature extraction and fusion module is equivalent to a convolution block.

This section builds a multi-scale feature extraction and fusion module through parallel learning, adopts a parallel three-channel structure for multi-scale feature extraction, and uses convolution kernels of different sizes to extract multi-scale fault features from the signal. The convolution kernel sizes of each branch structure are 64 × 1, 128 × 1, and 256 × 1, respectively. By using convolution of different sizes to extract features of different scales, the model not only retains the details of the shallow layer but also integrates the deep information. Meanwhile, in order to further optimize the model and improve the network diagnosis effect, batch normalization and activation functions are added after each convolutional layer. Finally, feature information of different scales is stacked and spliced together by Concat layer to generate new output features. The residual network model framework based on multi-scale convolution feature fusion is shown in the figure 4.

Figure 4.

Structure diagram of residual network model based on multi-scale convolution feature fusion.

Table 1 shows the details of the parameters of the convolutional neural network used in this paper. In this network model, the input is vibration signal sample with a length of 2048 data points, and then multiple hidden layers are used to extract more abstract features layer by layer. In order to enhance the fault diagnosis ability of the model, obtain a deeper network and larger receptive field, after the second layer, the convolution kernel dimension of all convolution layers is 3 × 1, and the number of convolution kernels in each layer is twice that of the previous layer.

Table 1.

Parameters of the neural network model.

Layer type	Windows size/stride	Kernel number
Input	–	–
Multi-scale module
Conv1_1x	[64 × 1]/[6 × 1]	16
Conv1_2x	[128 × 1]/[8 × 1]	16
Conv1_3x	[256 × 1]/[12 × 1]	16
Max-pool	[2 × 1]/[2 × 1]	–
Residual module_1
Conv2_1x	[3 × 1]/[1 × 1]	32
Conv2_2x	[3 × 1]/[1 × 1]	32
Residual module_2
Conv3_1x	[3 × 1]/[1 × 1]	64
Conv3_2x	[3 × 1]/[1 × 1]	64
Residual module_3
Conv4_1x	[3 × 1]/[1 × 1]	128
Conv4_2x	[3 × 1]/[1 × 1]	128
AdaptiveAvg-pool	1	–
Fully connected	–	–
Softmax	7	–

Remark 1. For the convolution kernel size of the multi-scale fusion layer and the number of residual modules, this paper obtained an optimal choice by considering the training speed and model accuracy through experimental comparison.

Joint maximum mean discrepancy

The source domain data is defined as $X_{s} = {x_{i}^{s}, y_{i}^{s}}_{i = 1}^{n_{s}}$ , and the target domain data is defined as $X_{t} = {x_{j}^{t}}_{j = 1}^{n_{t}}$ , where $n_{t}$ is the sample number of the source domain, and $n_{t}$ is the sample number of the target domain. In order to find an appropriate target classification model, it is necessary to reduce the distribution discrepancy $d (X_{s}, X_{t})$ between the source domain $X_{s}$ and the target domain $X_{t}$ . Maximum mean discrepancy (MMD) is often used to measure the marginal distribution discrepancy between two distributions, and is a loss function commonly used in many transfer learning methods to reduce data distribution discrepancies. The MMD between $X_{s}$ and $X_{t}$ is:

\begin{matrix} MM D_{H_{k}} (X_{s}, X_{t}) \\ = ‖ \frac{1}{n_{s}} \sum_{i = 1}^{n_{s}} φ (x_{i}^{s}) - \frac{1}{n_{t}} \sum_{j = 1}^{n_{t}} φ (x_{j}^{t}) ‖_{H_{k}}^{2} \end{matrix}

(6)

Where $H_{k}$ is the Reproducing Kernel Hilbert Space (RKHS), $k$ is the kernel, and $φ$ is a nonlinear mapping function in reproducing kernel Hilbert space. In this paper, $k$ adopts the Gaussian kernel function $K (X^{s}, X^{t}) = \exp (- \frac{‖ X^{s} - X^{t} ‖}{2 σ^{2}})$ . Where, $σ$ is the core width.

The traditional MMD distribution only considers the marginal distribution, but in practical industrial scenarios, the marginal distribution and conditional distributions of the target domain and the source domain are often different, and they all have different effects on domain adaptation. Therefore this section introduces the joint maximum mean discrepancy (JMMD).²⁴ JMMD fuses marginal distribution discrepancy and conditional distribution discrepancy. Experiments prove that the model constructed by JMMD can more convincingly represent the actual distribution discrepancies between the source domain and target domain, and has superior robustness.²⁵

Calculation of the marginal distribution MMD: It can be represented by formula (6):

\begin{matrix} MM D_{H} (P_{s}, P_{t}) \\ = MM D_{H_{k}} (X_{s}, X_{t}) \\ = ‖ \frac{1}{n_{s}} \sum_{i = 1}^{n_{s}} φ (x_{i}^{s}) - \frac{1}{n_{t}} \sum_{j = 1}^{n_{t}} φ (x_{j}^{t}) ‖_{H_{k}}^{2} \end{matrix}

(7)

Calculation of conditional MMD: Assuming that there are $L$ kinds of category labels, then any category has $l \in {1, . . ., L}$ , the conditional distribution discrepancy between $Q_{s} (x_{s} | y_{s} = l)$ and $Q_{t} (x_{t} | y_{t} = l)$ in label $l$ is:

\begin{matrix} MM D_{H} (Q_{s}^{(l)}, Q_{t}^{(l)}) \\ = ‖ \frac{1}{n_{s}^{l}} \sum_{i = 1}^{n_{s}^{l}} φ (x_{i}^{s, l}) - \frac{1}{n_{t}^{l}} \sum_{j = 1}^{n_{t}^{l}} φ (x_{j}^{t, l}) ‖_{H_{k}}^{2} \end{matrix}

(8)

Where $x_{i}^{s, l}$ and $x_{j}^{t, l}$ are the source domain data and target domain data labeled $l$ respectively, $n_{s}^{l}$ and $n_{t}^{l}$ are the quantity of source domain data and target domain data labeled $l$ respectively.

In summary, the formula for calculating the joint maximum mean discrepancy is

\begin{matrix} D_{H} (J_{s}, J_{t}) \\ = MM D_{H} (P_{s}, P_{t}) + \sum_{l = 1}^{L} MM D_{H} (Q_{s}^{(l)}, Q_{t}^{(l)}) \\ = ‖ \frac{1}{n_{s}} \sum_{i = 1}^{n_{s}} φ (x_{i}^{s}) - \frac{1}{n_{t}} \sum_{j = 1}^{n_{t}} φ (x_{j}^{t}) ‖_{H_{k}}^{2} + \\ \sum_{l = 1}^{L} ‖ \frac{1}{n_{s}^{l}} \sum_{i = 1}^{n_{s}^{l}} φ (x_{i}^{s, l}) - \frac{1}{n_{t}^{l}} \sum_{j = 1}^{n_{t}^{l}} φ (x_{j}^{t, l}) ‖_{H_{k}}^{2} \end{matrix}

(9)

Where $J_{s}$ and $J_{t}$ are the joint probability distributions of the source domain data and the target domain data respectively.

Proposed deep learning method

Build model

Aiming at the problems of inconsistent distribution of data characteristics collected under different working conditions of rolling bearings and no labels of target domain samples, a domain adaptation transfer diagnosis method based on a multi-scale feature fusion residual network is proposed in this paper. The method consists of two parts: fault feature extraction and domain adaptation. The feature extraction part uses the pre-trained multi-scale feature fusion residual network to extract the common features of the source domain and the target domain. The difference between the source domain and the target domain is mainly reflected in the FC layer. The domain adaptation part uses the FC layer to learn the transfer knowledge, and uses the JMMD metric to calculate and minimize the distribution difference, so as to realize the transfer learning of rolling bearings under different working conditions. The method consists of two parts: fault feature extraction and domain adaptation. The feature extraction part uses the pre-trained multi-scale feature fusion residual network to extract the common features of the source domain and the target domain; The domain adaptation part uses the fully connected layer to learn the transfer knowledge, and uses the JMMD metric to calculate and minimize the distribution discrepancy, so as to realize the transfer learning of rolling bearings under different working conditions.

Based on this idea, the model diagram of the method proposed in this paper is shown in Figure 5. First, the bearing vibration signals under different working conditions and different health statuses are preprocessed and divided into the training set and test set; secondly, in order to reduce the training time and accelerate the model convergence, the pre-trained multi-scale feature fusion residual network model is used for general feature extraction, and the JMMD metric is used for domain adaptation; Finally, a Softmax classifier is used for bearing fault diagnosis on the unlabeled target domain.

Figure 5.

Structure diagram of domain adaptation transfer diagnosis method based on multi-scale feature fusion residual network.

Training method

In this paper, the objective optimization function is minimized to train the domain adaptation model. For the rolling bearing domain adaptation fault diagnosis method proposed in this paper, the objective optimization function includes the following two items: The multi-class loss function $L_{C}$ between the source-domain sample predicted labels and the true labels, and the domain distribution discrepancy $L_{J}$ learned from cross-domain datasets. The form of the objective optimization function is

L = L_{C} + γ L_{J}

(10)

Where $γ$ is a trade-off parameter greater than 0, which is used to control how much $L_{J}$ participates in the training of the network.

The multi-class loss $L_{C}$ of the model are calculated using the cross-entropy loss function.

L_{C} = - \frac{1}{N} \sum_{n = 1}^{N} [\hat{y} \ln y + (1 - \hat{y}) \ln (1 - y)]

(11)

Where $N$ is the number of samples, $x$ is the predicted sample, $y$ is the prediction label output by the model classifier, and $\hat{y}$ is the actual label in the source domain.

After determining the optimization objective of the model, let $θ_{f}$ and $θ_{c}$ be the parameters of the feature extractor and the health status classifier respectively. The objective optimization function (10) can be rewritten as

L (θ_{f}, θ_{c}) = L_{C} (θ_{f}, θ_{c}) + γ L_{J} (θ_{f}, θ_{c})

(12)

In order to reduce the number of iterations required for convergence, improve the training speed, and reduce the resource utilization, this paper selects the Mini-Batch Gradient Descent (MBGD) optimization algorithm to optimize the mode. The algorithm introduces a batch size, assuming that the size of a single batch is $N$ , and randomly selects samples from the source domain and target domain to fill the batch, that is, only a part of the total training set is used for training each time until all samples are selected. In addition, the size of $N$ should be appropriate, otherwise the model accuracy may be affected.

Based on the above formula and the MSGD algorithm, the parameters $θ_{f}$ and $θ_{c}$ are updated as follows:

\begin{matrix} θ_{f} \leftarrow θ_{f} - ε (\frac{\partial L_{C}}{\partial θ_{f}} + γ \frac{\partial L_{J}}{\partial θ_{f}}) \\ θ_{c} \leftarrow θ_{c} - ε (\frac{\partial L_{C}}{\partial θ_{f}} + γ \frac{\partial L_{J}}{\partial θ_{f}}) \end{matrix}

(13)

Where $ε$ is the learning rate.

Experimental verification

Datasets preparation

In order to further verify the scientificity and effectiveness of the proposed method, this paper adopts the dynamic balance bearing fault experiment platform of the research team to obtain the required real bearing fault data set. The platform is mainly composed of a variable speed drive motor, JP-680T electrical measurement system, coupling, healthy bearing, test bearing, acceleration sensor, displacement sensor, rotational speed sensor, and OROS vibration signal acquisition and analysis system. In addition, the bearings used are SKF bearings, and the sampling frequency is 25.6 kHz.

Four bearing health statuses are considered in the experiment: normal (NO), inner race fault (IF), outer race fault (OF), and rolling ball fault (BF). The size of bearing fault includes 0.5 and 0.8 mm. In addition, fault simulation experiments with rotating speeds of ( $A$ ) 3000 rpm, ( $B$ ) 4000 rpm, and ( $C$ ) 5000 rpm were designed to simulate the operating conditions under different operating conditions. Finally, the bearing vibration signal data under different working conditions, different fault types and fault sizes are collected. Each working condition contains signals of one health status and six fault statuses, a total of seven categories, each 2048 data points constitute a sample, and each category contains 700 sets of samples, so 700 × 7 samples are taken for each working condition. The details of the dataset are shown in Table 2.

Table 2.

Bearing fault parameter design.

fault label	Fault type	Fault size/mm	Rotating speed/rpm	Number of samples
0	NO	–	3000/4000/5000	700 × 7
1	IF	0.5	3000/4000/5000	700 × 7
2	IF	0.8	3000/4000/5000	700 × 7
3	OF	0.5	3000/4000/5000	700 × 7
4	OF	0.8	3000/4000/5000	700 × 7
5	BF	0.5	3000/4000/5000	700 × 7
6	BF	0.8	3000/4000/5000	700 × 7

When dividing the sample data, in order to make full use of the data and extract more effective features, the data enhancement method with equally spaced sliding windows is used to overlap the original data. The calculation formula for realizing overlapping sampling is $N = \frac{L_{1} - L_{2}}{S} + 1$ , where $N$ is the number of samples after overlapping sampling, $L_{1}$ is the length of the original data, $L_{2}$ is the length of a single sample, that is, the window width, and $S$ is the moving step of the sliding window, that is, the sampling interval, In this paper, $L_{2} = 2048, S = 30$ .

In order to verify the effectiveness of the proposed fault diagnosis method, a total of 6( $C_{3}^{2}$ ) groups of cross-working condition fault diagnosis tasks are set up for comprehensive analysis according to four different speeds, $A \to B$ , $A \to C$ , $B \to A$ , $B \to C$ , $C \to A$ , $C \to B$ . Each group of tasks takes a fixed working condition as the source domain, and others as the target domain. The target tasks cover not only different operating conditions, but also different levels of failure severity. During the transfer stage, the source domain data has labels, and the target domain data has no labels. Labeled data in the source domain and part of the data in the target domain are used for training, and the rest of the data in the target domain is used for testing.

Experimental results and analysis

Diagnosis result analysis

In order to comprehensively evaluate the algorithm and verify the superiority and portability of the proposed method, this section compares this method with other commonly used fault diagnosis methods, which are described in detail below.

In the comparison experiment of fault diagnosis models, this paper uses the same dataset and data division method for all methods, and then uses the optimal parameters of the respective models for comparison. Besides, their feature extraction section is similar to the proposed method in this paper. All our experiments were performed on the PyTorch platform, and in order to reduce the randomness of the experiments, each category of experiments was repeated 10 times and averaged. The main hyperparameters of the model and their setting values include: batch size is set to 128, Epoch is set to 100 times, the optimization algorithm is set to MSGD algorithm, and the initial value of the learning rate is set to 1e−4.

CNN: Use a CNN without domain adaptation. The CNN is trained on the source domain dataset and tested on the target domain data, which is used to demonstrate the inherent generalization ability of the model.

Deep domain confusion (DDC): DDC uses adaptation layer to connect the two CNNs together and calculates the MMD distance in the last layer of the model to reduce the discrepancy between the source and target domains.²⁶

Deep adaptation network (DAN): DAN combines CNN with multi-kernel MMD, and calculates the MK-MMD distance at the first, second, and third layers from the bottom, which is a classic method of deep transfer learning.²⁷

Joint adaptation networks (JAN): JAN uses the JMMD distance, combining joint distribution adaptation and adversarial learning.²⁸ The model does not take into account the multi-scale fusion module and residual structure, which is used to verify the impact of these techniques on the overall performance of the model.

The experimental results of six groups of transfer tasks for five fault diagnosis methods are shown in Table 3 and Figure 6. The histogram uses the height of the column to reflect the difference in the data, which can get a good identification effect. In this experiment, the histogram is used to analyze the effect of different fault diagnosis methods. As can be seen in Figure 6(a), compared with other fault diagnosis methods, the proposed algorithm in this paper has higher accuracy in six transfer learning task scenarios. The radar chart can judge the strength of multiple indicators of the same object or show the comparison of the same indicators of different objects. In this experiment, the radar chart is used to analyze the effect of the same fault diagnosis method on different transfer learning tasks. As can be seen in Figure 6(b), compared with other transfer learning tasks, various fault diagnosis methods have lower accuracy for tasks A–C and C–A. This phenomenon should be due to the large difference between working condition A and working condition C, but the improved algorithm proposed in this paper can still maintain about 96% accuracy in tasks $A \to C$ and $C \to A$ . From the above experimental results, it can be found that the transfer learning fault diagnosis method based on domain adaptation is superior to the common deep learning fault diagnosis method (CNN), so domain adaptation is of great significance for the actual diagnostic demand.

Table 3.

Classification accuracies of the different methods.

	CNN	DDC	DAN	JAN	Proposed
A–B	83.12 ± 2.16	94.51 ± 0.53	95.21 ± 0.72	97.62 ± 0.92	98.67 ± 0.39
A–C	75.83 ± 1.81	87.74 ± 1.23	91.16 ± 0.59	95.46 ± 1.24	97.13 ± 0.62
B–A	82.58 ± 3.29	93.37 ± 0.85	95.36 ± 0.51	97.83 ± 0.53	98.26 ± 0.41
B–C	80.15 ± 4.02	91.62 ± 2.86	93.82 ± 1.35	96.16 ± 0.76	97.37 ± 0.68
C–A	76.71 ± 2.98	88.28 ± 1.17	90.06 ± 2.36	93.65 ± 0.20	96.82 ± 0.75
C–B	79.52 ± 3.73	90.61 ± 0.93	93.76 ± 1.31	96.71 ± 0.58	97.15 ± 0.53
Average	79.65	91.02	93.39	96.24	97.57

Figure 6.

Accuracies of different tasks under different fault diagnosis models: (a) histogram and (b) radar chart.

Although other domain adaptation methods achieve an average accuracy of more than 90% in multiple tasks, these methods have certain discrepancies and fluctuations in different tasks. Among them, the accuracy of $A \to C$ and $C \to A$ tasks is obviously lower than other tasks, and there is still a certain gap compared with the method in this paper. In contrast, the average test accuracy of the proposed method reaches 97%, which is higher than other fault diagnosis methods, and can obtain smaller standard deviation and better robustness in various transfer tasks. Compared with other methods, this method achieves better classification performance in various transfer task classifications, and these results demonstrate the effectiveness and superiority of this method.

In addition, the confusion matrix can analyze the classification accuracy of each fault diagnosis method in more detail. Randomly select tasks $B \to C$ , and calculate the confusion matrix of the diagnostic results of the above methods. The result is shown in Figure 7. It can be seen that the accuracy of DAN, JAN, and the method proposed in this paper has reached more than 90% for each condition, and the discrepancies in recognition of the seven categories are small. However, compared with the proposed method, DAN and JAN still suffer from more misclassifications under different conditions, which further illustrates the superiority of the proposed method.

Figure 7.

Confusion matrix of different fault diagnosis models: (a) CNN, (b) DDC, (c) DAN, (d) JAN, and (e) proposed.

Feature visualization analysis

Due to the poor interpretability of deep learning, in order to more intuitively analyze the feature extraction ability of the proposed domain adaptation model, this section uses t-SNE (t-distributed stochastic neighbor embedding)²⁹ to visualize the learned features in a two-dimensional space. The transfer learning task $C \to A$ is randomly selected, and the visual analysis results of the testing process of the proposed method and the other four comparison methods are shown in Figure 8.

Figure 8.

Feature visualization by t-SNE of different fault diagnosis models: (a) CNN, (b) DDC, (c) DAN, (d) JAN, and (e) proposed.

As can be seen in Figure 8(a), the feature visualization results of the CNN method are very confusing, overlapping, and with poor distinguishability, indicating that the knowledge learned from the source domain is not well generalized to the target domain.

In contrast, Figure 8(b) and (c) shows that the feature clustering results of these two methods have been improved, but there are still some discrepancies between the source and target domains, and there are still some misclassifications. In Figure 8(d) and (e), samples of the same fault category from the source and target domains are clustered more closely together, but the method proposed in this paper has better feature separation for different conditions. To sum up, the proposed method has better classification performance and domain adaptation ability.

Therefore, the proposed method in this paper has better clustering ability and distinguishing ability, and the shared features can be extracted more effectively. This method is not only able to learn fault discriminative features for accurate condition recognition, but also has strong transferability to reduce domain differences.

Conclusion

In this paper, an intelligent fault diagnosis algorithm based on domain adaptation is proposed for rotating machinery systems. Considering that the characteristic distribution of rolling bearing vibration signals collected under different working conditions is inconsistent and the samples to be diagnosed have no labels, the algorithm consists of a feature extraction part and a domain adaptation part, which improves the generalization performance and classification accuracy of the model in the target domain. In the feature extraction part, the multi-scale feature extraction module is used to extract features from the input signal to maximize the effective features of the fault data, and the APReLU activation function is introduced into the residual network to further enhance the feature recognition ability of the network. In domain adaptation, a more comprehensive domain adaptation framework is used, which introduces Joint Maximum Mean Difference to optimize the domain adaptation model. Finally, by analyzing different experimental results and comparing with other fault diagnosis methods, it is found that the proposed model has excellent fault identification ability and domain adaptive performance, which can obtain more accurate classification results, and has good results and excellent stability in various migration scenarios.

Although the proposed method in this paper has achieved good experimental results, there are still some limitations: First of all, the distribution of the current dataset among different fault statuses is relatively balanced, but in practical applications, it is usually easier to obtain bearing data for the health state, while data for different fault states are lacking. Therefore, we should further investigate whether the model can effectively extract domain invariant features between the source domain and target domain under unbalanced training datasets. In addition, through time-consuming analysis, it can be found that the time-consuming performance of the proposed model is not so good, which is not suitable for online scenarios with high real-time requirements. The above limitations will be further investigated in future work.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by Key Laboratories for National Defense Science and Technology (6142605200402), National Key Laboratory of Science and Technology on Helicopter Transmission (Grant No. HTL-O-21G11), the Aeronautical Science Foundation of China (20200007018001), the Aero Engine Corporation of China Industry-University-Research Cooperation Project (HFZL2020CXY011), and the Research Fund of State Key Laboratory of Mechanics and Control of Mechanical Structures (Nanjing University of Aeronautics and Astronautics, MCMS-I-0121G03).

ORCID iDs

Pu Yang

Huilin Geng

Peng Liu

References

Zhang

Miao

Zhang

, et al. A parameter-adaptive VMD method based on grasshopper optimization algorithm to analyze vibration signals from rotating machinery. Mech Syst Signal Process 2018; 108: 58–72.

Liang

, et al. Application of bandwidth EMD and adaptive multiscale morphology analysis for incipient fault diagnosis of rolling bearings. IEEE Trans Ind Electron 2017; 64: 6506–6517.

Shao

Tan

, et al. Coordinated approach fusing time-shift multiscale dispersion entropy and vibrational Harris hawks optimization-based SVM for fault diagnosis of rolling bearing. Measurement 2021; 173, 108580.

Udmale

Singh

SK.

Application of spectral kurtosis and improved extreme learning machine for bearing fault classification. IEEE Trans Instrum Meas 2019; 68: 4222–4233.

Cao

Wang

Yue

, et al. Rolling bearing fault diagnosis of launch vehicle based on adaptive deep CNN. J Vibr Shock 2020; 39: 97–104, 149.

Wen

Gao

, et al. Convolutional neural network with automatic learning rate scheduler for fault classification. IEEE Trans Instrum Meas 2021; 70: 1–12.

Yan

Liu

Jia

Health condition identification for rolling bearing using a multi-domain indicator-based optimized stacked denoising autoencoder. Struct Health Monit 2020; 19: 1602–1626.

Karasu

Altan

Bekiros

, et al. A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series. Energy 2020; 212: 118750.

Okur

Altan

. Grasshopper optimization algorithm-based adaptive control of extruder pendulum system in 3D printer. In: 2021 innovations in intelligent systems and applications conference (ASYU), Istanbul, Turkey. 2021, pp. 1–6.

10.

Altan

Parlak

. Adaptive control of a 3D printer using whale optimization algorithm for bio-printing of artificial tissues and organs. In: 2020 innovations in intelligent systems and applications conference (ASYU), Istanbul, Turkey 2020, pp. 1–5.

11.

Chen

Zeng

, et al. A two-layer nonlinear combination method for short-term wind speed prediction based on ELM, ENN, and LSTM. IEEE Internet Things J 2019; 6(4): 6997–7010.

12.

Zhao

Zeng

KD.

EnLSTM-WPEO: short-term traffic flow prediction by ensemble LSTM, NNCT weight integration, and population extremal optimization. IEEE Trans Veh Technol 2020; 69(1): 101–113.

13.

Karasu

Altan

Crude oil time series prediction model based on LSTM network with chaotic Henry gas solubility optimization. Energy 2022; 242: 122964.

14.

Huang

, et al. A perspective survey on deep transfer learning for fault diagnosis in industrial scenarios: theories, applications and challenges. Mech Syst Signal Process 2022; 167, 108487.

15.

Tan

Sun

Kong

, et al. A survey on deep transfer learning. In: Artificial neural networks and machine learning – ICANN 2018, vol. 11141. 2018, pp. 270–279. Springer, Cham.

16.

Ben-David

Blitzer

Crammer

, et al. A theory of learning from different domains. Mach Learn 2010; 79: 151–175.

17.

Wang

Parameter estimation and adaptive control for servo mechanisms with friction compensation. IEEE Trans Ind Inform 2020; 16(11): 6816–6825.

18.

Wang

Approximation-free control for nonlinear helicopters with unknown dynamics. IEEE Trans Circuits Syst II Express Briefs 2022; 69: 3254–3258.

19.

Liu

Wei

, et al. A stacked auto-encoder based partial adversarial domain adaptation model for intelligent fault diagnosis of rotating machines. IEEE Trans Ind Inform 2021; 17: 6798–6809.

20.

Zhang

Ding

, et al. Multi-layer domain adaptation method for rolling bearing fault diagnosis. Signal Process 2019; 157: 180–197.

21.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA. 2016, pp. 770–778.

22.

Zhao

Zhong

, et al. Deep residual networks with adaptively parametric rectifier linear units for fault diagnosis. IEEE Trans Ind Electron 2021; 68: 2587–2597.

23.

Zhang

Ren

, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile. 2015, pp. 1026–1034.

24.

Han

Liu

Yang

, et al. Deep transfer network with joint distribution adaptation: A new intelligent fault diagnosis framework for industry application. ISA Trans 2020; 97: 269–281.

25.

Long

Wang

Ding

, et al. Transfer feature learning with joint distribution adaptation. In: 2013 IEEE international conference on computer vision (ICCV), Sydney, NSW, Australia. 2013, pp. 2200–2207.

26.

Tzeng

Hoffman

Zhang

, et al. Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. 2014.

27.

Long

Cao

Wang

, et al. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on machine learning Lille, France. 2015, pp. 97–105. New York, NY: PMLR.

28.

Long

Zhu

Wang

, et al. Deep transfer learning with joint adaptation networks. In: International conference on machine learning, 2016, pp. 2208–2217. New York, NY: PMLR.

29.

Van der Maaten

Hinton

. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(11).