Sage Journals: Discover world-class research

Abstract

As the digitalization of industrial assets advances, data-driven fault diagnosis has increasingly garnered attention. However, models often underperform due to the lack of sufficient training data and the complexity of operational environments. In scenarios where a similar task with abundant data exists in the source domain, leveraging the knowledge embedded in this source data could be key to constructing an effective diagnostic model for the target domain. Following this idea, this study introduces a novel cross-domain decision method, weighted structure expansion and reduction (WSER), for fault diagnosis. This method initially extracts features from the time, frequency, and time-frequency domains. It then estimates data weights following the idea of instance transfer to mitigate the dissimilarity between the source and target data distributions. Based on these estimated weights, feature selection is further performed. The extracted source knowledge is subsequently transferred to the target domain using the proposed WSER method. The proposed method is applied on two public engineering fault datasets, and the results demonstrate the effectiveness of the proposed method in increasing the accuracy of fault diagnosis.

Keywords

Fault diagnosis transfer learning feature extraction feature selection decision tree

Introduction

In recent years, the advent of increasingly complex systems in fields such as manufacturing, aviation, and power generation has given rise to unprecedented challenges in fault diagnosis. These systems, characterized by their intricate network of interconnected components and subsystems, require sophisticated diagnostic techniques to ensure their reliable and safe operation. Furthermore, faults within these systems could have far-reaching implications, which may cause performance degradation, system failure, and even catastrophic accidents.¹ Therefore, developing effective and robust methods of fault diagnosis can be of paramount importance.

The existing fault diagnosis methods can be classified into model-based, signal-based, knowledge-based methods.² The knowledge-based methods, also known as data-driven methods, can help extract the underlying knowledge about the systems without previously known models or signal patterns based on the collected historical data. Such methods can be effective for complex systems where the explicit system models or signal symptoms are hard to establish. Currently, machine learning methods have been widely applied in the data-driven fault diagnosis,³ such as support vector machine (SVM), k-nearest neighbor (KNN), and neural network (NN).^1,4,5 These machine learning methods can help establish well-performing models with a large amount of data.

However, in some cases, there may not be enough data for model training, such as the situations where the complex systems are newly applied or used infrequently, and the data collection may cost too much time. In such cases, the model trained with insufficient data may not perform well on the target task of fault diagnosis. In addition, machine learning methods work well under a general assumption: the training data and the testing data should be drawn from the same distribution.⁶ Even if there exists a large amount of data collected from a similar system, the obtained model trained using the data can still perform poorly on the target task with a different data distribution.

To address the problems mentioned above, more and more attention has been paid to transfer learning. Transfer learning methods are proposed to learn and transfer shared knowledge from a similar domain (source domain) to the current domain (target domain).^7,8 Transfer learning methods have been widely applied for fault diagnosis. For example, Zhao et al.⁹ proposed a transfer learning method based on bidirectional gated recurrent unit and manifold embedded distribution alignment, to tackle the fault diagnosis problem with limited labeled data; Wu et al.¹⁰ developed an adaptive deep transfer learning method for bearing fault diagnosis, which was constructed based on instance transfer and feature transfer; Liu et al.¹¹ proposed a transfer learning method based on model transfer for fault diagnosis in building chillers; and Yang et al.¹² proposed a feature-based transfer learning neural network to identify the healthy conditions of real machines with the diagnosis knowledge obtained from experimental machines. The above-mentioned studies demonstrate the effectiveness of transfer learning in tackling fault diagnosis problems with few or even without labeled data, and transfer learning is thus studied in this paper.

In addition, many machine learning methods are considered black-box, where the process of generating decisions could be complicated for decision makers to understand. The generated model may not be suitable for high-stacks decision making.¹³ In high-stakes decisions there are often considerations outside the collected data that need to be combined with a risk calculation. It may be hard to manually calibrate how much the additional information adjusts the estimated risk with a black-box model. For example, in the fault diagnosis, there could be some conditions that are not easy collected as data, but very useful for the diagnosis of specific system or components. Besides, it can be unclear what factors are considered in the construction of the model, which could lead to risky or unreliable results. For example, the chatbot Tay in 2016, designed to continuously learn and improve through interactions, became a “troubled girl” embodying gender discrimination, and racial prejudice within less than 24 h of engaging with humans. The black box nature of machine learning algorithms, in which the decision-making process becomes opaque and difficult to trace, exacerbates the potential for unintended consequences.

Building the diagnosis model with an explainable method, such as decision tree, can thus be essential for complex system with high reliability. In existing machine learning methods, decision tree, logistic regression, and linear regression can be more explainable than others from the perspective of model. Compared with the linear methods, decision tree can capture the nonlinear relationship, and extract more complicated patterns. Currently, some decision transfer learning methods based on decision tree have been proposed, such as Structure Expansion Reduction (SER), Structure Transfer (STRUT), and Transfer in Decision Trees (TDT) methods.^14–16 In these methods, STRUT method keeps the structure of the source decision tree but adjust its threshold values, and SER and TDT methods use the labeled target data to adjust the structure of the decision tree trained using the source data, which could be more flexible on knowledge transfer. Compared with STRUT method, SER and TDT methods could be more flexible considering the domain dissimilarity. Different to TDT method, in SER method, after expanding the source decision tree, a reduction operation is then conducted to further improve the tree structure. The SER method is thus focused in this study.

Existing transfer learning methods can be classified into four categories according to the transferred objects, including instance, feature, model and relationship.¹⁷ SER is a model-based transfer learning method, but can be different from the existing transfer learning methods. In instance-based transfer learning, the shared knowledge is assumed to be contained in the source data, and the data weights are estimated or the data are selected to help adapt the marginal distributions. For examples, Huang et al.¹⁸ proposed Kernel Mean Matching (KMM) to match the means between the source and the target data in a Reproducing Kernel Hilbert Space, and Sugiyama et al.¹⁹ proposed Kullback-Leibler Importance Estimation Procedure (KLIEP) to minimize the Kullback-Leibler (KL) divergence of the source and the target data. Feature-based methods focus on transforming one feature representation to align with those of the other one, or transforming both feature representations to align them to each other. For examples, Daumé²⁰ proposed the Feature Augmentation Method to transform the original features by feature replication, Pan et al.²¹ proposed the Transfer Component Analysis (TCA) to adapt marginal distribution by minimizing the distribution difference using Maximum Mean Discrepancy, and Fernando et al.²² proposed Subspace Alignment (SA) to transform source subspace obtained with Principal Component Analysis into the target one. Model-based methods assume that the knowledge can be shared with the model or its parameters. For examples, Duan et al.²³ proposed a framework, Domain Adaptation Machine, to construct a robust classifier with some base classifiers preobtained on multiple source domains, Zhuang et al.²⁴ proposed the Matrix Tri-Factorization Based Classification Framework to characterize the connections among the document classes and the concepts conveyed by the word clusters using parameters, and Gao et al.²⁵ proposed an ensemble-based framework, Locally Weighted Ensemble, to combine various learners generated with different source domains or learning algorithms. Relational-based transfer learning approaches focus on transfer the learned source’s logical relationships or rules to the target domain. For examples, Wang et al.²⁶ proposed a relational knowledge transfer to extract the relational knowledge from data manifold structure and transfer it backwards to help generate virtual data for unseen categories, and Qin et al.²⁷ proposed a relational-based transductive transfer learning method, where the time series are clustered using the similarity measured with the relational knowledge.

Compared with the instance-based methods, SER can better dig the deep knowledge with the decision tree model, which can avoid the problem of the high dissimilarity between the marginal distributions or the high inconsistency between the label spaces. In addition, most feature, model and relational-based transfer learning methods could be a black box for certain tasks, while the SER constructed based on decision tree can be of better interpretability, where the results can be more reliable for diagnostic problems of complex system. However, the original SER only focus on transferability between the tree structures at the source and the target domain, where the marginal distribution is not considered when applicable, which can further facilitate the knowledge transfer.

In this work, a cross-domain decision method is proposed based on the improved SER method. In the proposed method, features are first extracted from the time domain, the frequency domain and the time-frequency domain. The data weights are then estimated following the idea of instance transfer. The extracted features are selected based on the estimated data weights. The knowledge contained in the source decision tree model is further transferred using the proposed weighted SER (WSER) method by considering the estimated data weights.

The main contributions of this paper include: (1) A cross-domain decision method WSER is introduced based on decision tree with instance transfer and model transfer. (2) The weights of the labeled source and the target data are calculated following the idea of instance transfer. (3) A new feature selection algorithm is developed to prioritize and select features with the estimated data weights. (4) The effectiveness of the proposed method is demonstrated through its application on two engineering fault datasets, showcasing its practical utility.

The remainder of this paper is organized as follows. Section 2 briefly reviews the preliminaries of the related algorithms. Section 3 elaborates the details of the proposed method. The proposed method is further verified using two public engineering fault datasets in Section 4. Finally, this paper is concluded in Section 5.

Theoretical backgrounds

Feature extraction

Fast Fourier transformation

Fast Fourier Transformation (FFT) is an algorithm used to efficiently compute the discrete Fourier transform (DFT) of a sequence or time-domain signal.²⁸ The FFT algorithm is widely used in various fields, including signal processing, image processing, audio analysis, and many other applications that involve analyzing the frequency content of signals.^29–31

The main advantage of the FFT algorithm is its computational efficiency, making it possible to perform high-speed spectral analysis on large sets of data in real-time or near real-time applications. The algorithm exploits the symmetry and periodicity properties of the DFT to reduce the number of computations required. It divides the DFT calculation into smaller subproblems and recursively combines the results, resulting in a significant reduction in computational complexity.

Based on the FFT algorithm,³⁰ the spectrum $s (k)$ of a given signal $x_{n}$ is defined by

s (k) = \int_{- \infty}^{+ \infty} x_{n} e^{- i 2 π k n} d n, k = 1, \dots, K, n = 1, \dots, N,

(1)

where $x_{n}$ is the time signal, $i = \sqrt{- 1}$ , $K$ denotes the number of spectrum lines, $N$ is the number of time signals, and $K$ equals to $N$ in FFT algorithm. The frequency value of the $k - th$ spectrum line can be calculated as

f_{k} = k * (F_{S} / N),

(2)

where $F_{S}$ denotes the sampling frequency. As the $N / 2$ of the frequency points can be derived from the remaining parts, the $N / 2$ redundant points can be discarded to improve the computing efficacy.

By evaluating vibration signals of fault condition with those of the healthy condition through FFT algorithm, the fault diagnosis can be better conducted with specific frequency components.

Wavelet packet transform

Wavelet Packet Transform (WPT) algorithm is a signal processing technique that extends the capabilities of wavelet analysis by providing a more detailed and flexible decomposition of signals into subbands.^32,33 It is a multi-resolution analysis tool that allows for a more comprehensive exploration of signal features in both time and frequency domains.

Unlike traditional wavelet analysis, which decomposes signals into a binary tree structure of low-pass and high-pass subbands, the WPT algorithm decomposes signals into multiple subbands at each level, allowing for a richer representation of signal components. This decomposition can be performed recursively to achieve greater granularity and capture fine-scale details in the signal.³⁴ The WPT algorithm provides a flexible framework for signal analysis, offering the ability to select and analyze specific subbands of interest. The WPT algorithm is thus used in this paper to extract the time-frequency domain features.

Given a wavelet packet function $Ψ$ , and three integer indices $j$ , $n$ and $g = 0, 1, \dots, 2^{j} - 1$ , which are the scale (frequency localization) parameter, the translation (time localization) parameter³⁵ and the modulation or oscillation parameter, respectively, $Ψ$ can be further obtained as

Ψ = Ψ_{j, n}^{g} (t) = 2^{j / 2} Ψ^{g} (2^{j} t - n) .

(3)

The computation of the wavelet packet coefficients $c_{j, k}^{n}$ for a signal $x$ can be accomplished based on the inner product operation between the signal itself and the corresponding wavelet packet function, which is

c_{j, n}^{g} = 〈 x, Ψ_{j, n}^{g} (t) 〉 = \int_{- \infty}^{+ \infty} x (t) \cdot Ψ_{j, n}^{g} (t) dt .

(4)

The wavelet packet node energy $E_{j} (g)$ is defined as

E_{j} (g) = \sum_{n} {(c_{j, n}^{g})}^{2} .

(5)

The obtained $E_{j} (g)$ can represent the characteristics of vibration signals in both time domain and frequency domain.

Feature selection

Feature selection is a process of selecting a subset of relevant features or variables from a larger set of available features, which can be an important step in the preprocessing stage to improve model performance, reduce overfitting, and enhance interpretability.

The objective of feature selection lies in identifying and retaining those features that hold the highest value of information and discriminatory power for the target task, while discarding irrelevant or redundant features that may introduce noise or add unnecessary complexity to the model.

Feature selection algorithms can be broadly categorized into three types, including filter, wrapper, and embedded algorithms.³⁶ Filter algorithms rank features based on the statistical properties or the relevance to the target variable, wrapper algorithms evaluate feature subsets by using a specific learning algorithm, and embedded algorithms incorporate feature selection as part of the learning algorithm itself.

In this paper, the recursive feature elimination (RFE) algorithm is mainly considered. As a wrapper algorithm, the RFE algorithm can consider the interaction and combination effects of features, which can lead to more accurate feature selection.^37,38 The RFE algorithm incorporates the model performance during the feature selection process, which ensures that the selected features are directly related to the model performance. In addition, the RFE algorithm is a flexible algorithm that can be used with various methods for model construction. The application of other feature selection algorithms in different scenarios is not extensively discussed in this paper.

Methods

In this section, a cross-domain decision method aimed at fault diagnosis is proposed by considering instance and model transfer. Initially, features are extracted from time series data, and subsequently, the data weights are heuristically estimated. Feature selection proceeds based on these estimated weights. The source knowledge related to fault diagnosis is then acquired from the source domain via decision tree. This knowledge is subsequently transferred to the target domain employing the WSER method.

Framework

The operational physical parameters of machinery serve as references that aid in abnormality detection and diagnosis., and high-sensitivity accelerometer-generated vibration signals are primarily utilized for this purpose.^39,40 The vibration signals collected as time series data are thus mainly focused on in this paper.

Given two fault diagnosis problems from the source domain $D_{S}$ and the target domain $D_{T}$ , the vibration signals $x$ of the machine in $D_{S}$ and $D_{T}$ would be recorded in the fixed time step, and the operation states $y$ of the machine would be also monitored with $x$ . The task is to construct a fault diagnosis model to help determine the operation state of the machine based on the vibration signals $x$ in the target domain $D_{T}$ . While abundant historical signal data $x^{S}$ are collected with operation states $y^{S}$ in $D_{S}$ , only few data $x^{L}$ are recorded with $y^{L}$ in $D_{T}$ . The constructed model may not perform well on target testing data $D^{U}$ with the few labeled signal data $D^{L} = {x^{L}, y^{L}}$ in $D_{T}$ . In such cases, the sufficient signal data $D^{S} = {x^{S}, y^{S}}$ in $D_{S}$ can help learn the patterns of fault diagnoses, which may facilitate the model construction in the target domain $D_{T}$ and improve the model performance on target data. Following this idea, the process of the proposed method is depicted in Figure 1.

Figure 1.

The process of the proposed method.

As stated in Figure 1, to help extract the fault patterns from the signal data in time series, the features are first extracted with the data $x$ and operation states $y$ , including time-domain features, frequency-domain features and time-frequency-domain features. Then, to improve the distribution similarity between the source and target data, the data weights are first estimated following the idea of instance transfer. The extracted features are further selected using improved decision tree based RFE $(D T^{+} - RFE)$ algorithm based on the data weights. Then the fault patterns learned from $D^{S}$ using decision tree are transferred to $D_{T}$ and further optimized using the proposed WSER method with $D^{L}$ .

Feature extraction

The features derived from vibration signal data encapsulate the health status information of machine components, holding crucial importance for fault diagnosis and prognosis.⁴¹ Signal processing techniques across a multitude of domains − time, frequency, and time-frequency, have been leveraged on the collected vibration data to glean a variety of original features.^34,42

In this section, various features are extracted from time, frequency, and time –frequency domains, which could be further used to help construct the diagnosis model.

Time-domain features

Time-domain analysis, a straightforward technique typically used in the initial stages of mechanical fault diagnosis, provides amplitude information of the signal in relation to time.⁴¹ Statistical attributes are often involved in time-domain features, which are particularly sensitive to impulse faults.³³ The 16 dimensional features are calculated in this paper, such as mean, absolute mean, variance, and so on, which are defined as in Table 1.

Table 1.

Time-domain features.

Features	Formulation	Features	Formulation
Mean	$x_{mean}^{I} = \frac{1}{N} \sum_{n = 1}^{N} x_{n}$	Skewness value	$x_{sv}^{I} = \frac{1}{N} \underset{n = 1}{\sum^{N}} {(\frac{x_{n} - x_{mean}^{I}}{σ})}^{3}$
Absolute mean	$x_{abm}^{I} = \frac{1}{N} \sum_{n = 1}^{N} \| x_{n} \|$	Kurtosis value	$x_{kv}^{I} = \frac{1}{N} \underset{n = 1}{\sum^{N}} {(\frac{x_{n} - x_{mean}^{I}}{σ})}^{4}$
Variance	$x_{var}^{I} = \frac{1}{N - 1} \underset{n = 1}{\sum^{N}} {(x_{n} - x_{mean}^{I})}^{2}$	Crest factor	$x_{cf}^{I} = max (\| x_{n} \|) / {(\frac{1}{N} \underset{n = 1}{\sum^{N}} x_{n}^{2})}^{1 / 2}$
Root mean variance	$x_{rms}^{I} = {(\frac{1}{N} \underset{n = 1}{\sum^{N}} x_{n}^{2})}^{1 / 2}$	Impulse factor	$x_{if}^{I} = max (\| x_{n} \|) / \frac{1}{N} \underset{n = 1}{\sum^{N}} \| x_{n} \|$
Maximum	$x_{max}^{I} = max (x_{n})$	Margin factor	$x_{mf}^{I} = max (\| x_{n} \|) / {(\frac{1}{N} \underset{n = 1}{\sum^{N}} \sqrt{\| x_{n} \|})}^{2}$
Minimum	$x_{min}^{I} = min (x_{n})$	Skewness factor	$x_{sf 1}^{I} = \frac{1}{N} \sum_{n = 1}^{N} {(\frac{x_{n} - x_{mean}^{I}}{σ})}^{4} / σ^{3 / 2}$
Peak-peak value	$x_{ppv}^{I} = max (x_{n}) - min (x_{n})$	Kurtosis factor
Square root of the amplitude	$x_{sra}^{I} = {(\frac{1}{N} \underset{n = 1}{\sum^{N}} \sqrt{\| {\bar{x}}_{n} \|})}^{2}$	Shape factor	$x_{sf 2}^{I} = {(\frac{1}{N} \sum_{n = 1}^{N} {\bar{x}}_{n}^{2})}^{1 / 2} / \frac{1}{N} \sum_{n = 1}^{N} \| {\bar{x}}_{n} \|$

$σ$ denotes the variance.

Frequency-domain features

Frequency-domain approaches typically entail an analysis of vibration signals to identify characteristic frequencies associated with the rotation of bearings.³⁰

The FFT is adopted on the time-domain vibration signals to help extract the frequency-domain features, which can provide information on defect frequencies of the components.⁴³ The 12 features are calculated considering the statistical results of frequency, such as mean, variance, maximum, and so on,³⁹ which are defined as in Table 2.

Table 2.

Frequency-domain features.

Features	Formulation	Features	Formulation
Mean	$x_{mean}^{F} = \frac{1}{K} \sum_{k = 1}^{K} s (k)$	Frequency center	$x_{fc}^{F} = \frac{\sum_{k = 1}^{K} f_{k} s (k)}{\sum_{k = 1}^{K} s (k)}$
Variance	$x_{var}^{F} = \frac{\sum_{k = 1}^{K} (s (k) - X_{mean}^{F})}{K - 1}$	Root mean square	$x_{rms}^{F} = \sqrt{\frac{\sum_{k = 1}^{K} f_{k}^{2} s (k)}{\sum_{k = 1}^{K} s (k)}}$
Maximum	$x_{\max}^{F} = max (s (k))$	Standard deviation	$x_{std}^{F} = \sqrt{\frac{\sum_{k = 1}^{K} {(f_{k} - FC)}^{2} s (k)}{K}}$
Minimum	$x_{\min}^{F} = min (s (k))$	CP1	$x_{cp 1}^{F} = \frac{1}{K} \frac{\sum_{k = 1}^{K} {(f_{k} - x_{fc}^{F})}^{3} s (k)}{{(x_{std}^{F})}^{3}}$
Skewness power spectrum	$x_{sp}^{F} = \frac{1}{K} \sum_{k = 1}^{K} {(\frac{s (k) - x_{mean}^{F}}{\sqrt{x_{var}^{F}}})}^{3}$	CP2	$x_{cp 2}^{F} = \frac{x_{std}^{F}}{x_{fc}^{F}}$
Kurtosis power spectrum	$x_{kp}^{F} = \frac{1}{K} \sum_{k = 1}^{K} {(\frac{s (k) - x_{mean}^{F}}{\sqrt{x_{var}^{F}}})}^{4}$	CP3	$x_{cp 3}^{F} = \frac{\sum_{k = 1}^{K} {(f_{k} - x_{fc}^{F})}^{4} s (k)}{K {(x_{std}^{F})}^{4}}$

$s (k)$ is the spectrum for $k = 1, \dots, K$ , $K$ is the number of the spectrum lines, and $f_{k}$ is the frequency value of the $k - th$ spectrum line.

Time-frequency-domain features

As stated above, time-domain features and frequency-domain features are easily extracted and commonly used in fault diagnosis. However, time and frequency information cannot be simultaneously considered in the extracted features above. Time-frequency domain analysis is thus further utilized to help extract comprehensive features, which may be more effective in fault diagnosis.⁴⁴ Many time-frequency domain analysis technologies have been developed, including short-time Fourier transform (STFT), wavelet packet transform (WPT), Hilbert-Huang transform (HHT) algorithms, etc.^33,45,46 In this paper, the WPT algorithm is adopted to extract the time-frequency-domain features with accelerometer sensor signals due to its flexible decomposition, excellent time-frequency localization, computational efficiency, and wide applicability.^34,40,47

The vibration signals are first decomposed into four scales using WPT algorithm, and the procedure can be referred in Rauber et al.³⁴ The energy values of wavelet packet nodes are further calculated at the 4th level, deriving 16 time-frequency-domain features,³³ with decomposition refined down to the fourth level. and refining is done down to the fourth decomposition level. This analysis considers a 1-D time-domain vibration signal comprising $N$ samples.

In WPT algorithm, with a tree depth of $j$ , $2^{j}$ final leaves $W_{j, 0}, \dots, W_{j {, 2}^{j} - 1}$ are generated. Each has approximately $N / 2^{j}$ wavelet coefficients. The features derived from the final $2^{j}$ leaf nodes of the decomposition tree represent the respective proportions of the energy contained in each leaf. Let $c_{j, n}^{g}$ , $n = 0, \dots, N / 2^{j} - 1$ be the $N / 2^{j}$ wavelet coefficients of leaf node $g$ at tree depth of $j$ , where $g = 0, \dots, 2^{j} - 1$ . The energy of the $g - th$ node³⁴ is calculated as

E_{j} (g) = \sum_{n = 0}^{N {/ 2}^{j} - 1} {[c_{j, n}^{g}]}^{2} .

(6)

Then, the $g - th$ wavelet packet feature is

x_{g}^{IF} = \frac{E_{j} (g)}{\sum_{m = 0}^{2^{j} - 1} E_{j} (m)} .

(7)

With tree depth $j = 4$ , 16 final leaves, including $W_{4, 0}, \dots, W_{4, 15}$ , are generated, and 16 features are obtained as ${x_{g}^{IF} | g = 0, \dots, 15}$ with $\sum_{g = 0}^{15} x_{g}^{IF} = 1$ .

Weight estimation

After the data ${\bar{D}}^{S} = {{\bar{x}}^{S}, y^{S}}$ and ${\bar{D}}^{L} = {{\bar{x}}^{L}, y^{L}}$ from extracted features are obtained with $D^{S}$ and $D^{L}$ , the weights of data are further estimated to help increase the distribution similarity between the source and the target data. In this section, the weight estimation is conducted in the way of instance transfer following the idea of Multiclass TrAdaBoost (MC-TAB) method.⁴⁸ Compared with other weight estimating methods, the developed weight estimation method can have an effective use of the labeled target data. In addition, besides the source data weights, the weights of labeled target data can also be estimated in this process, which can help further extract the data or information that is representative for the target domain.

Given the source data ${\bar{D}}^{S}$ and the target data ${\bar{D}}^{L}$ with labels, the weights of ${\bar{x}}^{S}$ and ${\bar{x}}^{L}$ are first initialized. In the absence of supplementary information to help obtain the initial data weights, they can be designated as equal, such as 1.

The weights are then normalized as

p^{r} = w^{r} / (\sum_{n}^{N^{S} + N^{L}} {\bar{x}}_{n}^{S + L}) .

(8)

A model is then trained using decision tree with the labeled data ${\bar{D}}^{S}$ and ${\bar{D}}^{L}$ with the normalized weights $p^{r}$ , where the decision tree is used here to keep consistency with the transfer model in the following steps of the proposed method.

Then the error of the derived model on ${\bar{D}}^{L}$ can be calculated as

ε_{r} = \sum_{n = 1}^{N^{L}} \frac{w_{n}^{L, r} \cdot I (h_{r} ({\bar{x}}_{n}^{L}) \neq y_{n}^{L})}{\sum_{n = 1}^{N^{L}} w_{n}^{L, r}} .

(9)

where I(·) denotes the signal function. The weight updating parameters can be further calculated as

α_{r} = \log (1 - ε_{r}) / ε_{r} + \log (K - 1),

(10)

α = \log (1 / (1 + \sqrt{2 \ln N^{S} / R})),

(11)

where $K$ is the number of classes, and $R$ is the max iteration number. In equation (10), the first part can be correlated to the error rate, which can help better adjust the weights by reflecting the importance. The second part can help the algorithm fitting in multi-class cases. The $α$ in equation (11) can help adjust the weights in a fixed rate. Further details can be referred in Hatie’s work.⁴⁹

And the weights can be updated as

w_{n}^{r + 1} = {\begin{matrix} p_{n}^{r} \cdot K (1 - ε_{r}) \cdot e^{α \cdot I (h_{r} ({\bar{x}}_{n}) \neq y_{n})}, 1 \leq n \leq N^{S} \\ p_{n}^{r} \cdot e^{α_{r} \cdot I (h_{r} ({\bar{x}}_{n}) \neq y_{n})}, N^{S} + 1 \leq n \leq N^{L} + N^{S} \end{matrix},

(12)

where $w_{n}^{r}$ denotes the weights at the $r$ -iteration, and $e^{(\cdot)}$ denotes the exponential function with base $e$ , which can help update the weights in a smooth way.

To avoid the overfitting of the target labeled data, the max iteration number $R$ is set at 20 in this paper.

Feature selection

As stated above, 44 features are extracted from the data. However, not all features are very relevant to model construction, and the irrelevant or redundant features may lead to model overfitting or high complexity.⁵⁰ The feature selection is thus conducted to help find the most effective features, which can also assist in reducing the data dimensionality and complexity.³⁶

To obtain the relevant feature subset from the 44 features, the RFE algorithm is applied in this paper, where decision trees serve as the base classifier, which is wrapped by the RFE algorithm. Compared to other classification methods, decision tree can have better interpretability, which is also used for model construction and knowledge transfer in the source and target domains in the subsequent sections. Decision tree is thus chosen here as the base classifier for feature selection to help keep consistency. The $D T^{+} - RFE$ algorithm is further developed in this paper based on DT-RFE, where the data weights are considered in the process of feature selection.

The $D T^{+} - RFE$ algorithm initially considers all features, progressively eliminating those deemed irrelevant until only pertinent features remain, as determined by assigned scores. The algorithm yields an array output based on data weights, representing the positive integer values that signify the ranking of each feature. A lower score denotes a higher feature ranking, and conversely, a higher score indicates a lower ranking. The $D T^{+} - RFE$ algorithm eliminates low-ranked or irrelevant features, selecting only those with high rankings.

As stated above, the source data with high weights can be more similar to the target data, and the $D T^{+} - RFE$ algorithm constructed using the source data weights can thus better find the features that are more relevant to the tasks in the target domain.

Model construction based on instance and model transfer

When the data ${\tilde{D}}^{S} = {{\tilde{x}}^{S}, y^{S}}$ and ${\tilde{D}}^{L} = {{\tilde{x}}^{L}, y^{L}}$ are obtained from ${\bar{D}}^{S}$ and ${\bar{D}}^{L}$ after the features are extracted and selected, the knowledge can be learned from the data with specific methods. Decision tree method is selected in this paper to construct the model, which can keep the high interpretability. In addition, decision tree method can also have better fitting power in non-linear problems compared with linear models.

As stated above, when labeled target data ${\tilde{D}}^{L}$ are insufficient, the decision model constructed using decision tree method may perform poor in the target domain $D_{T}$ . In such cases, the abundant source data ${\tilde{D}}^{S}$ can help extract the knowledge which can be applicable on the target data.

The source model is first trained using the source data ${\tilde{D}}^{S}$ and the data weights $w^{S}$ . In this paper, the decision tree is constructed using CART algorithm, where the Gini Index is used to measure the reduction in class impurity from partitioning the feature space, as shown in equations (13) and (14).⁵¹

impurity = 1 - \sum_{j} ‖ p (j) N_{j} (t) / N_{j} ‖^{2},

(13)

Gini = impurity (Parent) - \sum_{k} p_{k} \cdot impurity (Chil d_{k}),

(14)

where $p_{j}$ denotes the relative frequency for each class $j$ , that is, the number of samples of class $j$ divided by the total sample number.

After the source model $M^{S}$ is obtained, the knowledge in the source domain $D_{S}$ can be contained in $M^{S}$ . To transfer the knowledge from $D_{S}$ into $D_{T}$ , one important problem is that how the knowledge could be transferred.

Similar to SER method, WSER applies two transformations using the limited labeled target data ${\tilde{D}}^{L}$ , for example, expansion and reduction. Then, the weights of ${\tilde{D}}^{L}$ that generated using the estimation method are further considered in WSER, which facilitates the effectiveness of the expansion and reduction.

Given a leaf node $v$ of the source model $M^{S}$ , WSER will computes ${\tilde{D}}_{v}^{L}$ , the subset of the target data ${\tilde{D}}^{L}$ that reaches the node $v$ . Subsequently, each leaf $v$ is expanded to a full tree with ${\tilde{D}}_{v}^{L}$ . This expansion is achieved by developing a full decision tree using CART algorithm with data ${\tilde{D}}_{v}^{L}$ .⁵²

The reduction is then conducted based on the leaf error and subtree error. These are defined as the empirical error on $v$ respected to ${\tilde{D}}_{v}^{L}$ if $v$ was to be pruned into a leaf, and the empirical error of the subtree whose root is $v$ .¹⁶ Leaf error can be calculated as

LE (v, {\tilde{D}}_{v}^{L}) = \frac{1}{| N_{v}^{L} |} \sum_{n = 1}^{| N_{v}^{L} |} w_{n, v}^{L} \cdot 1_{{y_{n, v}^{L} \neq y_{v}}},

(15)

where $w_{n, v}^{L}$ and $y_{n, v}^{L}$ denote the weight and the label of the $n - th$ element in ${\tilde{D}}_{v}^{L}$ , and $y_{v}$ denotes the majority class of the leaf $v$ . The subtree error can be obtained by aggregating the errors of all leaves, each weighted by the fraction of ${\tilde{D}}_{v_{j}}^{L}$ attributed to each leaf $v_{j}$ ,¹⁵ which is calculated as

SE (v, D_{v}^{L}) = \frac{1}{| N_{v}^{L} |} \sum_{j}^{J} \sum_{n = 1}^{| N_{v_{j}}^{L} |} w_{n, v_{j}}^{L} \cdot 1_{{y_{n, v_{j}}^{L} \neq y_{v_{j}}}},

(16)

where $y_{n, v_{j}}^{L}$ denotes the label of the $n - th$ element in ${\tilde{D}}_{v_{j}}^{L}$ , and $y_{v_{j}}^{L}$ denotes the majority class of the leaf $v_{j}$ . If the leaf error on the node $v$ is smaller than the subtree error, then the subtree of $v$ would be cut.

The WSER algorithm is summarized as follows.

Algorithm 1 Weighted Structure Expansion Reduction (WSER)
Input: Source model $M^{S}$ , and labeled target data ${\tilde{D}}^{L}$ .Output: Target decision model $M^{T}$ .1: For a leaf node in the source tree model $M^{S}$ , expand the leaves of node $v$ with ${\tilde{D}}_{v}^{L}$ and data weights, where ${\tilde{D}}_{v}^{L}$ denotes the data that reach to node $v$ ;2: Recurse over all child nodes of node $v$ , and repeat step 1;3: Calculate the leaf error LE and the subtree error SE using equations (15) and (16);4: If $LE (v, {\tilde{D}}_{v}^{L}) < SE (v, {\tilde{D}}_{v}^{L})$ , delete all child nodes of node $v$ , and set $d (v) = 0$ and $y_{v} = \underset{y}{\arg max} \| {(\cdot, y) \in {\tilde{D}}_{v}^{L}} \|$ ;5: Repeat steps 1–4 until all leaf nodes of $M^{S}$ are recursed, and generate the target decision model $M^{T}$

Algorithm 1 Weighted Structure Expansion Reduction (WSER)

Input: Source model

M^{S}

, and labeled target data

{\tilde{D}}^{L}

.Output: Target decision model

M^{T}

.1: For a leaf node in the source tree model

M^{S}

, expand the leaves of node

v

with

{\tilde{D}}_{v}^{L}

and data weights, where

{\tilde{D}}_{v}^{L}

denotes the data that reach to node

v

;2: Recurse over all child nodes of node

v

, and repeat step 1;3: Calculate the leaf error LE and the subtree error SE using equations (15) and (16);4: If

LE (v, {\tilde{D}}_{v}^{L}) < SE (v, {\tilde{D}}_{v}^{L})

, delete all child nodes of node

v

, and set

d (v) = 0

and

y_{v} = \underset{y}{\arg max} | {(\cdot, y) \in {\tilde{D}}_{v}^{L}} |

;5: Repeat steps 1–4 until all leaf nodes of

M^{S}

are recursed, and generate the target decision model

M^{T}

Experiments

To validate the effectiveness of the proposed fault diagnosis method, the method is adopted on two public engineering fault datasets, including the bearing data provided by the Bearing Data Center of the Case Western Reserve University (CWRU) and the gearbox dataset from the Southeast University (SEU). The comparative experiments of the proposed method against machine learning and transfer learning-based methods are performed to underscoring its effectiveness.

Dataset

CWRU dataset

The CWRU dataset, widely recognized as a standard in rolling bearing fault diagnosis datasets, encompasses a driving motor, a torque transducer, and a load motor. Test bearings 6205-2RS JEM SKF and 6203-2RS JEM SKF are mounted at the drive end and the fan end of the driving motor, respectively, to uphold the motor shaft.⁵³ Bearing vibration data are collected by the acceleration sensors mounted at the ends of driving motor under various operational loads and bearing conditions.³³ The CWRU bearing data have been used extensively in various researches, which can provide an effective validation for bearing fault diagnosis.^1,53,54

The vibration signals collected at the sampling frequency of 12 kHz are adopted in this paper. Four kinds of bearing health conditions are identified in the data, such as normal (N), inner race fault (IR), outer race fault (OR), and roller fault (RF). Different fault diameters, 0.007, 0.014, and 0.021 in, are contained in the three types of faults. All bearings are re-fitted onto the testing rig under four distinct operational conditions, that is, the constant speeds for motor loads of 0, 1, 2, and 3 horsepower (HP). These loads correspond to the motor’s four types of speeds, which are 1797, 1772, 1750, and 1730 rpm, respectively.

To extract the samples from the signal data, the sample length is set as 1024, which means each sample contains 1024 signal points. 9000 samples are randomly extracted from the signal data under different operating conditions. The details of the preprocessed data samples are given in Table 3.

Table 3.

The details of extracted CWRU data samples.

Datasets	Health conditions	Operating conditions	Sample numbers
H0	N/IR/OR/RF	0 HP (1797 rpm)	9000
H1	N/IR/OR/RF	1 HP (1772 rpm)	9000
H2	N/IR/OR/RF	2 HP (1750 rpm)	9000
H3	N/IR/OR/RF	3 HP (1730 rpm)	9000

As shown in Table 3, four datasets are obtained after data preprocessing, where data have the same label spaces of health conditions, but are collected under different operating conditions. To validate the effectiveness of the proposed method, 12 transfer tasks $Z_{k}$ $(k = 1, \dots 12)$ of fault diagnosis are conducted in this paper, including $H_{0} \to H_{1}$ , $H_{0} \to H_{2}$ , $H_{0} \to H_{3}$ , $H_{1} \to H_{0}$ , $H_{1} \to H_{2}$ , $H_{1} \to H_{3}$ , $H_{2} \to H_{0}$ , $H_{2} \to H_{1}$ , $H_{2} \to H_{3}$ , $H_{3} \to H_{0}$ , $H_{3} \to H_{1}$ and $H_{3} \to H_{2}$ . To simulate the situation where only few labeled data are available in the target domain, only 100 data samples are randomly selected from the datasets when they are chosen as the target data, and the rest of the data are used for testing.

SEU dataset

The SEU dataset is a gearbox dataset collected from the Drivetrain Dynamics Simulator by Shao et al.⁵⁵ The details of SEU dataset is given in Table 4. This dataset consists of two sub-datasets, including the bearing and gear datasets, where eight channels were collected, and the data of channel 2 are mainly used following the setting of the work in Zhao et al.⁵⁶

Table 4.

The details of extracted SEU data samples.

Type	Status	Description
Bearing	Health	\
	Ball	Crack in the ball
	Outer	Crack in the outer ring
	Inner	Crack in the inner ring
	Combination	Crack in the both sides
Gear	Health	\
	Chipped	Crack in the gear feet
	Miss	Missing feet in the gear
	Surface	Wear in the surface of the gear
	Root	Crack in the root of the gear feet

As shown in Tables 4 and 5 different health statuses can be found in two sub-datasets, including one health and four fault statuses, while the fault statuses can differ between bearing and gear. The transfer tasks are established between two different working conditions with rotating speed system load set to be 20 Hz – 0 V or 30 Hz – 2 V for each sub-datasets, which are separately denoted as tasks 0 and 1. In total, there are four transfer learning settings, including $B_{0} \to B 1$ , $B_{1} \to B_{0}$ , $G_{0} \to G_{1}$ , and $G_{1} \to G_{0}$ .

Table 5.

Performances of the $D T^{T}$ , $D T^{ST}$ , and WSER models on tasks $Z_{k}$ $(k = 1, \dots, 12)$ .

Tasks	Details	$D T^{T}$	$D T^{ST}$	WSER
$Z_{1}$	$H_{0} \to H_{1}$	$0.7941 \pm 0.0062$	$0.6908 \pm 0.0355$	$0.9206 \pm 0.0220$
$Z_{2}$	$H_{0} \to H_{2}$	$0.8987 \pm 0.0239$	$0.6841 \pm 0.0400$	$0.9324 \pm 0.0170$
$Z_{3}$	$H_{0} \to H_{3}$	$0.8881 \pm 0.0255$	$0.7093 \pm 0.0454$	$0.9432 \pm 0.0168$
$Z_{4}$	$H_{1} \to H_{0}$	$0.8805 \pm 0.0523$	$0.9455 \pm 0.0226$	$0.9559 \pm 0.0076$
$Z_{5}$	$H_{1} \to H_{2}$	$0.8980 \pm 0.0329$	$0.9804 \pm 0.0087$	$0.9651 \pm 0.0088$
$Z_{6}$	$H_{0} \to H_{3}$	$0.8697 \pm 0.0414$	$0.9046 \pm 0.0808$	$0.9553 \pm 0.0061$
$Z_{7}$	$H_{2} \to H_{0}$	$0.8707 \pm 0.0431$	$0.9423 \pm 0.0329$	$0.9536 \pm 0.0071$
$Z_{8}$	$H_{2} \to H_{1}$	$0.8161 \pm 0.0348$	$0.9743 \pm 0.0036$	$0.9486 \pm 0.0220$
$Z_{9}$	$H_{2} \to H_{3}$	$0.8403 \pm 0.0465$	$0.9426 \pm 0.0278$	$0.9531 \pm 0.0136$
$Z_{10}$	$H_{3} \to H_{0}$	$0.8294 \pm 0.0654$	$0.8969 \pm 0.0487$	$0.9400 \pm 0.0214$
$Z_{11}$	$H_{3} \to H_{1}$	$0.8345 \pm 0.0604$	$0.8998 \pm 0.0444$	$0.9598 \pm 0.0133$
$Z_{12}$	$H_{3} \to H_{2}$	$0.8748 \pm 0.0435$	0.9455 ± 0.0335	$0.9718 \pm 0.0214$

Note. Bolded results indicate the best model performance under the same conditions.

Results

Results of CWRU dataset

Performance of the proposed method

Following data preparation, the proposed method is employed to verify its efficacy. As delineated above, 12 transfer tasks are performed. Each task consists of a source domain $D_{S}$ and a target domain $D_{T}$ , with 9000 pieces of training data in $D^{S}$ , and 100 pieces of training data and 8900 pieces of testing data in $D^{T}$ . With the collected data, 44 features are first extracted from time, frequency and time-frequency domains.

The data weights are further estimated, and the weighted data are used for feature selection using $D T^{+} - RFE$ method, as stated in Methods, where half of the features are selected by default.

The proposed method WSER is then used to generate the target diagnosis models based on the obtained data ${\tilde{D}}_{k}^{S}$ and ${\tilde{D}}_{k}^{L}$ , and the data weights, which are examined by ${\tilde{D}}_{k}^{U}$ to obtain the performance of the WSER models on tasks $Z_{k}$ $(k = 1, \dots, 12)$ . In addition, the $D T^{T}$ models trained using decision tree method with only labeled data ${\tilde{D}}_{k}^{L}$ in $D_{T}$ , and the $D T^{ST}$ models trained using decision tree method with weighted data ${\tilde{D}}_{k}^{S}$ and ${\tilde{D}}_{k}^{L}$ are also examined on ${\tilde{D}}_{k}^{U}$ , which can help further highlight the effectiveness of the proposed WSER method. The performance of the above models on different tasks are given in Table 5. All the performance in this study is measured by the accuracy rate, that is, the rate of correct predictions in all the testing data.

As shown in Table 5, the $D T^{ST}$ models perform better compared with the $D T^{T}$ method on tasks $Z_{4} - Z_{12}$ , which means the signal data collected under varying operational conditions can be similar, and leveraging source data can enhance the target model performance. In addition, the WSER models perform better than other models on most tasks, and the $D T^{ST}$ models perform better than those of WSER models only on tasks $Z_{5}$ and $Z_{8}$ . The results indicate that the proposed WSER method can effectively leverage the source knowledge for the target domain.

Effect of different categories of features on model performance

To understand how time, frequency, and time-frequency features affect diagnostic model performance, models are constructed using each of these feature types separately. They are examined by ${\tilde{D}}_{k}^{U}$ $(k = 1, \dots, 12)$ to obtain the performance. The results are given as follows.

As shown in Table 6, the models constructed using only time features comprehensively perform worse than those based on frequency features and time-frequency features. The frequency feature based models comprehensively perform better than time-frequency feature based models comprehensively. In addition, the models constructed using all the features perform better at the most cases. The results show that among three categories of features, the frequency features can be more important than others, which means the fault status tends to be reflected by the frequency information of the CWRU dataset.

Table 6.

The performance on CWRU dataset constructed with different categories of features.

Tasks	Time	Frequency	Time-frequency	All
$Z_{1}$	$0.8067 \pm 0.0203$	$0.9347 \pm 0.0159$	$0.8273 \pm 0.0203$	$0.9206 \pm 0.0220$
$Z_{2}$	$0.8353 \pm 0.0212$	$0.9416 \pm 0.0218$	$0.8765 \pm 0.0154$	$0.9324 \pm 0.0170$
$Z_{3}$	$0.8877 \pm 0.0289$	$0.9209 \pm 0.0268$	$0.9126 \pm 0.0255$	$0.9432 \pm 0.0168$
$Z_{4}$	$0.8663 \pm 0.0240$	$0.9147 \pm 0.0506$	$0.8943 \pm 0.0149$	$0.9559 \pm 0.0076$
$Z_{5}$	$0.8862 \pm 0.0332$	$0.9508 \pm 0.0212$	$0.8899 \pm 0.0352$	$0.9651 \pm 0.0088$
$Z_{6}$	$0.9221 \pm 0.0275$	$0.9160 \pm 0.0444$	$0.9259 \pm 0.0292$	$0.9553 \pm 0.0061$
$Z_{7}$	$0.8509 \pm 0.0317$	$0.9047 \pm 0.0423$	$0.8510 \pm 0.0148$	$0.9536 \pm 0.0071$
$Z_{8}$	$0.8901 \pm 0.0321$	$0.9334 \pm 0.0131$	$0.8697 \pm 0.0239$	$0.9486 \pm 0.0220$
$Z_{9}$	$0.9240 \pm 0.0192$	$0.9157 \pm 0.0351$	$0.9422 \pm 0.0105$	$0.9531 \pm 0.0136$
$Z_{10}$	$0.8573 \pm 0.0253$	$0.9011 \pm 0.0282$	$0.8404 \pm 0.0273$	$0.9400 \pm 0.0214$
$Z_{11}$	$0.8761 \pm 0.0253$	$0.9343 \pm 0.0183$	$0.8281 \pm 0.0328$	$0.9598 \pm 0.0133$
$Z_{12}$	$0.9021 \pm 0.0538$	$0.9348 \pm 0.0272$	$0.8447 \pm 0.0234$	$0.9718 \pm 0.0214$

Note. Bolded results indicate the best model performance under the same conditions.

Feature selection in transfer tasks on CWRU dataset

As stated in Section 3.4, the features are ranked using RFE algorithm, and the ranking results are presented in Figure 2 to illustrate which features are more important for model construction on the specific tasks.

Figure 2.

The feature selection at 10 iterations on tasks $Z_{1} - Z_{12}$ .

As shown in Figure 2, on the tasks $Z_{1} - Z_{12}$ , the time features, including 1 and 6, the frequency features, including 16, 17, 21, 23, and 24, and the time-frequency features, including 30, 34, 39, and 42 show higher importance compared with other features. Comprehensively, the frequency features can be more important than others on tasks $Z_{1} - Z_{12}$ of CWRU dataset.

Results of SEU dataset

Results of the proposed method

For SEU dataset, four transfer tasks are performed. Each task consists of a source domain and a target domain, with 4500 pieces of training data in $D^{S}$ and 100 pieces of training data and 4400 pieces pf testing data in $D^{T}$ . The same 44 features are also extracted to help construct the models. After the weight estimation and feature selection, the results of the proposed method on the SEU dataset are given in Table 7.

Table 7.

Performances of the $D T^{T}$ , $D T^{ST}$ , and WSER models on tasks $Z_{k}$ $(k = 13, \dots, 16)$ .

Tasks	Details	$D T^{T}$	$D T^{ST}$	WSER
$Z_{13}$	$B_{0} \to B_{1}$	$0.8536 \pm 0.0581$	$0.7867 \pm 0.0723$	$0.9125 \pm 0.0526$
$Z_{14}$	$B_{1} \to B_{0}$	$0.7009 \pm 0.1714$	$0.7035 \pm 0.0970$	$0.8335 \pm 0.0517$
$Z_{15}$	$G_{0} \to G_{1}$	$0.5094 \pm 0.0807$	$0.5548 \pm 0.0557$	$0.6201 \pm 0.0550$
$Z_{16}$	$G_{1} \to G_{0}$	$0.5473 \pm 0.0595$	$0.5324 \pm 0.0562$	$0.6054 \pm 0.0382$

Note. Bolded results indicate the best model performance under the same conditions.

As shown in Table 7, the $D T^{ST}$ models perform better compared with the $D T^{T}$ models on $Z_{14}$ and $Z_{15}$ , which means the signal data collected under varying operational conditions may be different for SEU dataset, and directly sample weighting may not help improve the model performance in the target domain. However, the WSER models perform better than other models on all the tasks. The results also indicate the effectiveness of the proposed WSER method. In addition, note that all models perform poorly on datasets $G_{0}$ and $G_{1}$ , possibly because the hand-craft features does not apply to such datasets.

Effect of different categories of features on model performance

The performance of models constructed with time, frequency, and time-frequency features are given as follows to help learn the effect of different categories of features on the transfer tasks for SEU dataset.

As shown in Table 8, the models constructed using all the features perform better at the most cases. Differently, the time feature based models perform poorly compared with other models, and the time-frequency feature based models show better performance compared with frequency feature based models on tasks $Z_{13}$ , $Z_{14}$ , and $Z_{16}$ . The results indicate that time-frequency features may show higher importance on transfer tasks for SEU dataset.

Table 8.

The performance on SEU dataset constructed with different categories of features.

Tasks	Time	Frequency	Time-Frequency	All
$Z_{13}$	$0.4596 \pm 0.0451$	$0.8974 \pm 0.0381$	$0.9080 \pm 0.0375$	$0.9125 \pm 0.0526$
$Z_{14}$	$0.4553 \pm 0.0661$	$0.7014 \pm 0.0610$	$0.8472 \pm 0.0559$	$0.8335 \pm 0.0517$
$Z_{15}$	$0.4246 \pm 0.0345$	$0.6009 \pm 0.0306$	$0.5722 \pm 0.0326$	$0.6201 \pm 0.0550$
$Z_{16}$	$0.5076 \pm 0.0900$	$0.5178 \pm 0.0965$	$0.5718 \pm 0.0232$	$0.6054 \pm 0.0382$

Note. Bolded results indicate the best model performance under the same conditions.

Feature selection in transfer tasks on SEU dataset

The ranking results of SEU dataset are further presented in Figure 3 to show feature importance on the specific tasks.

Figure 3.

The feature selection at 10 iterations on tasks $Z_{13} - Z_{16}$ .

As shown in Figure 3, on the tasks $Z_{13} - Z_{16}$ , the time features, 1, the frequency features, including 20, and 22, and the time-frequency features, including 28, 32, 33, 34, 37, and 39 show higher importance compared with other features. Comprehensively, the time-frequency features can be more important on the transfer tasks $Z_{13} - Z_{16}$ of SEU dataset, which can also be consistent with the results presented above.

Comparative analysis

Comparison with machine learning methods

Results of CWRU dataset

To further highlight the effectiveness of the proposed method, the performance of the proposed method is compared with those of some typical machine learning methods, including Support Vector Machine (SVM), Logistic Regression (LR), K-Nearest Neighbor (KNN), AdaBoost (ADB), Fully Connected Neural Network (FNN), and Gaussian Naive Bayes (GNB). The six methods are used to train models with the labeled data ${\tilde{D}}_{k}^{S}$ and ${\tilde{D}}_{k}^{L}$ $(k = 1, \dots, 12)$ . The performance of the obtained models is examined using ${\tilde{D}}_{k}^{U}$ . The model results on 12 tasks $Z_{k}$ $(k = 1, \dots, 12)$ are given in Table 9.

Table 9.

Comparison of the proposed method with machine learning methods for CWRU dataset.

Tasks	SVM	LR	KNN	ADB	FNN	GNB	WSER
$Z_{1}$	$0.6346 \pm 0.0361$	$0.6459 \pm 0.0608$	$0.7604 \pm 0.0572$	$0.8392 \pm 0.0626$	$0.7709 \pm 0.1236$	$0.8445 \pm 0.0273$	$0.9206 \pm 0.0220$
$Z_{2}$	$0.7503 \pm 0.0503$	$0.6325 \pm 0.0618$	$0.8491 \pm 0.0674$	$0.8246 \pm 0.1064$	$0.8581 \pm 0.0778$	$0.8605 \pm 0.0117$	$0.9324 \pm 0.0170$
$Z_{3}$	$0.8000 \pm 0.0263$	$0.6326 \pm 0.0737$	$0.8331 \pm 0.0723$	$0.8788 \pm 0.0442$	$0.8721 \pm 0.0376$	$0.8666 \pm 0.0028$	$0.9432 \pm 0.0168$
$Z_{4}$	$0.6624 \pm 0.0182$	$0.6310 \pm 0.0451$	$0.8679 \pm 0.0479$	$0.8508 \pm 0.0879$	$0.8633 \pm 0.0496$	$0.7704 \pm 0.0632$	$0.9559 \pm 0.0076$
$Z_{5}$	$0.7839 \pm 0.0570$	$0.6070 \pm 0.0483$	$0.9537 \pm 0.0446$	$0.8350 \pm 0.0958$	$0.9466 \pm 0.0265$	$0.7762 \pm 0.0753$	$0.9651 \pm 0.0088$
$Z_{6}$	$0.6469 \pm 0.0339$	$0.6067 \pm 0.0438$	$0.8445 \pm 0.0681$	$0.7825 \pm 0.1034$	$0.8645 \pm 0.0789$	$0.7389 \pm 0.0641$	$0.9553 \pm 0.0061$
$Z_{7}$	$0.7193 \pm 0.0161$	$0.6087 \pm 0.0372$	$0.9129 \pm 0.0175$	$0.7653 \pm 0.1083$	$0.9076 \pm 0.0229$	$0.8761 \pm 0.0388$	$0.9536 \pm 0.0071$
$Z_{8}$	$0.7348 \pm 0.0699$	$0.5966 \pm 0.0359$	$0.9285 \pm 0.046$	$0.7881 \pm 0.1394$	$0.9153 \pm 0.0291$	$0.8988 \pm 0.0465$	$0.9486 \pm 0.0220$
$Z_{9}$	$0.7415 \pm 0.0456$	$0.6404 \pm 0.0352$	$0.8932 \pm 0.0702$	$0.7477 \pm 0.1368$	$0.9562 \pm 0.0297$	$0.8771 \pm 0.0179$	$0.9531 \pm 0.0136$
$Z_{10}$	$0.7354 \pm 0.0200$	$0.5679 \pm 0.0477$	$0.8947 \pm 0.0421$	$0.7642 \pm 0.1448$	$0.8993 \pm 0.0367$	$0.9075 \pm 0.0386$	$0.9400 \pm 0.0214$
$Z_{11}$	$0.7109 \pm 0.0216$	$0.5406 \pm 0.0256$	$0.8552 \pm 0.0553$	$0.6380 \pm 0.1278$	$0.8825 \pm 0.0555$	$0.9457 \pm 0.0109$	$0.9598 \pm 0.0133$
$Z_{12}$	$0.7822 \pm 0.0309$	$0.5876 \pm 0.0298$	$0.9403 \pm 0.0585$	$0.8292 \pm 0.0859$	$0.9339 \pm 0.0542$	$0.9634 \pm 0.0230$	$0.9718 \pm 0.0214$

Note. Bolded results indicate the best model performance under the same conditions.

As shown in Table 9, the WSER models perform better than SVM, LR, KNN, ADB, FNN and GNB models on most of the tasks, and only the FNN model performs better than WSER model on task $Z_{9}$ . Comprehensively, the performance of WSER method is the best in most cases. The Wilcoxon signed rank test is conducted to show the differences between the model performance based on the model results.⁵⁷ The Wilcoxon signed-rank test is performed to illustrate the performance discrepancies among the models based on their respective results.⁵⁷ The results indicate that the performance of the WSER method significantly outperforms that of the KMM and KLIEP methods $(T = 0, p = 0.0005 < 0.05)$ and outperforms that of FNN method $(T = 1, p = 0.0010 < 0.05)$ , which underscoring the effectiveness of the WSER method.

The models trained using the six machine learning methods with only labeled target data of datasets $H_{0} - H_{3}$ are also obtained in this paper, and the results are given in Table 10.

Table 10.

Performances of machine learning models in single target domain for CWRU dataset.

Datasets	SVM	LR	KNN	ADB	FNN	GNB
$H_{0}$	$0.6720 \pm 0.0317$	$0.6399 \pm 0.0308$	$0.7891 \pm 0.0769$	$0.7376 \pm 0.1590$	$0.7471 \pm 0.0259$	$0.7669 \pm 0.0547$
$H_{1}$	$0.6182 \pm 0.0113$	$0.5533 \pm 0.0470$	$0.8348 \pm 0.0121$	$0.7687 \pm 0.1286$	$0.7198 \pm 0.0310$	$0.6533 \pm 0.0219$
$H_{2}$	$0.7345 \pm 0.0081$	$0.7418 \pm 0.0091$	$0.7922 \pm 0.0235$	$0.6842 \pm 0.0615$	$0.7294 \pm 0.0268$	$0.8068 \pm 0.0289$
$H_{3}$	$0.7790 \pm 0.1303$	$0.6543 \pm 0.0805$	$0.8629 \pm 0.0643$	$0.6849 \pm 0.1070$	$0.8863 \pm 0.0724$	$0.8853 \pm 0.0611$

As shown in Tables 9 and 10, most machine learning methods perform better with the assistance of the source data. This indicates that the model performance in the target domain can be effectively enhanced by the knowledge contained within the source data.

Results of SEU dataset

The above six machine learning methods are also used to train models with the labeled data ${\tilde{D}}_{k}^{S}$ and ${\tilde{D}}_{k}^{L}$ $(k = 13, \dots, 16)$ , and with only ${\tilde{D}}_{k}^{L}$ . The performance examined using ${\tilde{D}}_{k}^{U}$ on tasks $Z_{k}$ are given in Tables 11 and 12 separately.

Table 11.

Comparison of the proposed method with machine learning methods for SEU dataset.

Tasks	SVM	LR	KNN	ADB	FNN	GNB	WSER
$Z_{13}$	$0.3770 \pm 0.0571$	$0.4385 \pm 0.1639$	$0.5668 \pm 0.1939$	$0.5514 \pm 0.1231$	$0.5911 \pm 0.1605$	$0.3820 \pm 0.0742$	$0.9125 \pm 0.0526$
$Z_{14}$	$0.4038 \pm 0.0798$	$0.2915 \pm 0.0809$	$0.4121 \pm 0.0829$	$0.6462 \pm 0.0586$	$0.5384 \pm 0.1464$	$0.3545 \pm 0.0783$	$0.8335 \pm 0.0517$
$Z_{15}$	$0.3493 \pm 0.1010$	$0.3229 \pm 0.0819$	$0.3402 \pm 0.0997$	$0.4036 \pm 0.0517$	$0.6198 \pm 0.0371$	$0.5351 \pm 0.0998$	$0.6201 \pm 0.0550$
$Z_{16}$	$0.4480 \pm 0.1354$	$0.3939 \pm 0.1055$	$0.4455 \pm 0.1381$	$0.5740 \pm 0.0855$	$0.7156 \pm 0.0695$	$0.5856 \pm 0.0919$	$0.6054 \pm 0.0382$

Note. Bolded results indicate the best model performance under the same conditions.

Table 12.

Performances of machine learning models in single target domain for SEU dataset.

Datasets	SVM	LR	KNN	ADB	FNN	GNB
$B_{0}$	$0.3863 \pm 0.0916$	$0.4240 \pm 0.1641$	$0.4726 \pm 0.2125$	$0.5857 \pm 0.1855$	$0.5568 \pm 0.1843$	$0.5103 \pm 0.2194$
$B_{1}$	$0.3988 \pm 0.0786$	$0.2875 \pm 0.0781$	$0.4155 \pm 0.0933$	$0.6076 \pm 0.0918$	$0.5692 \pm 0.1653$	$0.3979 \pm 0.0479$
$G_{0}$	$0.3200 \pm 0.0787$	$0.3299 \pm 0.0882$	$0.3386 \pm 0.0954$	$0.4008 \pm 0.0519$	$0.6171 \pm 0.0413$	$0.5215 \pm 0.1098$
$G_{1}$	$0.4540 \pm 0.1088$	$0.3598 \pm 0.0816$	$0.4002 \pm 0.0976$	$0.5632 \pm 0.0524$	$0.6266 \pm 0.0986$	$0.5624 \pm 0.0555$

As shown in Table 11, the WSER models perform better than SVM, LR, KNN, ADB, FNN, and GNB models on most of the tasks, and only the FNN model performs better than WSER model on task $Z_{16}$ . Comprehensively, the performance of WSER method is the best in most cases. The Wilcoxon signed-rank test is not performed here due to its limitation for at least six sets of results.⁵⁷

As shown in Tables 11 and 12, the models trained using labeled source and target data perform slightly better than those trained using only labeled target data, which indicates the feasibility of making use of the knowledge contained in the source data. In addition, the performance of the models trained using the proposed methods perform better than others at the most cases, which also indicates the effectiveness of the proposed method.

Comparison with transfer learning based methods

Results of CWRU dataset

The WSER method is developed based on the integration of the decision tree method and transfer learning. Its efficacy can also be underscored when compared with the combination of the decision tree method and other transfer learning methods.

Comparative experiments are performed by employing seven methods, derived via following three different ways.

The methods are given by combining the decision tree method with two instance-based transfer learning methods, including Nearest Neighbors Weighting (NNW)⁵⁸ and Kullback-Leibler Importance Estimation Procedure (KLIEP).¹⁹

The methods are given by combining the decision tree method with three typical feature-based transfer learning methods, including correlation alignment (CORAL),⁵⁹ transfer component analysis (TCA),²¹ and subspace alignment (SA).²²

The methods are given by combining the decision tree method with two model-based transfer learning methods designed for decision tree method, including SER, and STRUT.¹⁵

The performance of the models trained using the methods derived above is examined using ${\tilde{D}}_{k}^{U}$ . The relevant results on Tasks $Z_{k}$ $(k = 1, \dots, 12)$ are given in Table 13.

Table 13.

Comparison of the proposed method with transfer learning based methods for CWRU dataset.

Tasks	NNW	KLIEP	CORAL	TCA	SA	SER	STRUT	WSER
$Z_{1}$	$0.8567 \pm 0.0623$	$0.8963 \pm 0.0505$	$0.8663 \pm 0.0837$	$0.9189 \pm 0.0223$	$0.8285 \pm 0.0706$	$0.8838 \pm 0.0567$	$0.9261 \pm 0.0212$	$0.9206 \pm 0.0220$
$Z_{2}$	$0.9090 \pm 0.0460$	$0.9097 \pm 0.0415$	$0.9192 \pm 0.0286$	$0.8833 \pm 0.0722$	$0.8801 \pm 0.0429$	$0.9151 \pm 0.0398$	$0.9292 \pm 0.0262$	$0.9324 \pm 0.0170$
$Z_{3}$	$0.8948 \pm 0.0428$	$0.9039 \pm 0.0360$	$0.8779 \pm 0.0525$	$0.8663 \pm 0.0658$	$0.8819 \pm 0.0456$	$0.9170 \pm 0.0086$	$0.9371 \pm 0.0114$	$0.9432 \pm 0.0168$
$Z_{4}$	$0.9128 \pm 0.0343$	$0.8618 \pm 0.0712$	$0.8515 \pm 0.0448$	$0.9108 \pm 0.0223$	$0.8144 \pm 0.0574$	$0.9591 \pm 0.0092$	$0.9210 \pm 0.0257$	$0.9559 \pm 0.0076$
$Z_{5}$	$0.9489 \pm 0.0240$	$0.9157 \pm 0.0423$	$0.9529 \pm 0.0198$	$0.9734 \pm 0.0134$	$0.8110 \pm 0.0818$	$0.9591 \pm 0.0087$	$0.9418 \pm 0.0116$	$0.9651 \pm 0.0088$
$Z_{6}$	$0.9323 \pm 0.0156$	$0.8879 \pm 0.0563$	$0.9068 \pm 0.0402$	$0.9045 \pm 0.0204$	$0.8249 \pm 0.0696$	$0.9294 \pm 0.0225$	$0.9073 \pm 0.0343$	$0.9553 \pm 0.0061$
$Z_{7}$	$0.9132 \pm 0.0301$	$0.8679 \pm 0.0398$	$0.8730 \pm 0.056$	$0.9085 \pm 0.0304$	$0.8324 \pm 0.0408$	$0.9614 \pm 0.0050$	$0.9509 \pm 0.0058$	$0.9536 \pm 0.0071$
$Z_{8}$	$0.8968 \pm 0.0500$	$0.8621 \pm 0.0499$	$0.9264 \pm 0.0243$	$0.9646 \pm 0.0113$	$0.7387 \pm 0.0817$	$0.9491 \pm 0.0230$	$0.9444 \pm 0.0238$	$0.9486 \pm 0.0220$
$Z_{9}$	$0.9315 \pm 0.0198$	$0.9002 \pm 0.0309$	$0.8946 \pm 0.0152$	$0.9528 \pm 0.0151$	$0.8295 \pm 0.0510$	$0.9567 \pm 0.0112$	$0.9437 \pm 0.0161$	$0.9531 \pm 0.0136$
$Z_{10}$	$0.9020 \pm 0.0495$	$0.8441 \pm 0.0389$	$0.8887 \pm 0.0388$	$0.9005 \pm 0.0669$	$0.8064 \pm 0.0335$	$0.9166 \pm 0.0185$	$0.9397 \pm 0.0187$	$0.9400 \pm 0.0214$
$Z_{11}$	$0.8775 \pm 0.0644$	$0.8719 \pm 0.0583$	$0.9081 \pm 0.0424$	$0.8646 \pm 0.0529$	$0.8216 \pm 0.0733$	$0.9565 \pm 0.0231$	$0.9477 \pm 0.0295$	$0.9598 \pm 0.0133$
$Z_{12}$	$0.9250 \pm 0.0313$	$0.9204 \pm 0.0255$	$0.9139 \pm 0.0545$	$0.9232 \pm 0.0189$	$0.8268 \pm 0.0723$	$0.9525 \pm 0.0336$	$0.9269 \pm 0.0183$	$0.9718 \pm 0.0214$

Note. Bolded results indicate the best model performance under the same conditions.

As shown in Table 13, compared with the WSER models, TCA model performs better on task $Z_{8}$ , SER models perform better on tasks $Z_{4}$ , $Z_{7}$ , and $Z_{9}$ , and STRUT model performs better on task $Z_{9}$ . In the rest of the cases, WSER models perform better than those of other compared methods. Comprehensively, the performance of the WSER method is the best in most cases. The Wilcoxon signed-rank test is performed to illustrate the performance discrepancies among the models based on the results. The results indicate that the performance of the WSER method significantly outperforms that of the NNW, KLIEP, CORAL and SA methods $(T = 0, p = 0.0005 < 0.05)$ , significantly outperforms that of the STRUT method $(T = 5, p = 0.0049 < 0.05)$ , significantly outperforms that of the TCA method $(T = 5, p = 0.0093 < 0.05)$ , and significantly outperforms that of the SER method $(T = 13, p = 0.0425 < 0.05)$ , which further highlights the effectiveness of the WSER method.

Results of SEU dataset

The models are also trained using the methods derived above on SEU dataset, and examined using ${\tilde{D}}_{k}^{U}$ $(k = 13, \dots, 16)$ . The relevant results on Tasks $Z_{k}$ are given in Table 14.

Table 14.

Comparison of the proposed method with transfer learning based methods for SEU dataset.

Tasks	NNW	KLIEP	CORAL	TCA	SA	SER	STRUT	WSER
$Z_{9}$	$0.7832 \pm 0.1149$	$0.7575 \pm 0.1618$	$0.7537 \pm 0.1768$	$0.7073 \pm 0.2306$	$0.8763 \pm 0.0673$	$0.8199 \pm 0.0904$	$0.7941 \pm 0.1124$	$0.9125 \pm 0.0526$
$Z_{10}$	$0.7373 \pm 0.1147$	$0.7632 \pm 0.1106$	$0.7163 \pm 0.1001$	$0.5021 \pm 0.1180$	$0.7369 \pm 0.1021$	$0.7453 \pm 0.1061$	$0.6861 \pm 0.1620$	$0.8335 \pm 0.0517$
$Z_{11}$	$0.4274 \pm 0.1548$	$0.4987 \pm 0.0932$	$0.4826 \pm 0.0857$	$0.4311 \pm 0.1317$	$0.4795 \pm 0.0978$	$0.5049 \pm 0.0684$	$0.5081 \pm 0.0872$	$0.6201 \pm 0.0550$
$Z_{11}$	$0.5679 \pm 0.0566$	$0.5409 \pm 0.0893$	$0.5575 \pm 0.1429$	$0.4108 \pm 0.1354$	$0.5815 \pm 0.1226$	$0.5560 \pm 0.1240$	$0.5791 \pm 0.1216$	$0.6054 \pm 0.0382$

Note. Bolded results indicate the best model performance under the same conditions.

As shown in Table 14, WSER models performs better than those of other compared methods in all the rest cases, which further highlights the effectiveness of the proposed method.

Discussion

The results presented in Tables 5 and 7 offer insightful observations regarding model performance across different domains. When the models are constructed directly based on the labeled data in the target domain, the model performance can be limited. After reweighting the source data, notable improvements can be observed on some tasks for models built with the weighted data. The enhancement of model performance on specific tasks suggests that data weighting can help reduce the difference in feature distribution between the source and target domains on these tasks. However, there are still some tasks where the model performance gets worse with the weighted data. This may indicate a large difference in feature distribution between the source and target domains on these tasks, making it difficult to bridge this gap through data weighting alone. In contrast, the proposed method, which employs labeled data from both the source and target domains, demonstrates a clear advantage. Its performance surpasses that of models constructed either solely with the labeled data from the target domain or with the weighted labeled data from both domains. This indicates that the validity of the proposed method in not only extracting shared knowledge from the source domain but also in facilitating a more effective transfer of this knowledge between the two domains. Consequently, the proposed method enhances model performance in the target domain, even in situations where the feature distributions between the source and target domains are markedly distinct, which underscores the robustness of the proposed method in adapting to and overcoming challenges posed by significant differences in feature distribution.

In addition, the proposed WSER method outperforms the compared methods without transfer learning, indicating that the knowledge from the source domain can be utilized to construct an effective model for the target task. Moreover, according to Tables 13 and 14, the WSER method achieves better performance compared to other transfer-learning-based methods. These results demonstrate the effectiveness of the WSER method in extracting and transferring knowledge for fault diagnosis.

To sum up, when dealing with limited data in fault diagnosis problems, it can be challenging to construct an effective model due to cost or other limitations. The proposed method addresses this issue by extracting knowledge from a source domain and transferring it to the target domain. The fault diagnosis model built with transferred knowledge can provide better predictive power for the target task. Additionally, the proposed method based on decision tree offers better interpretability compared to other black-box machine learning methods. This transparency allows engineers to understand how the recommended decisions are made, enhancing the reliability of system operation and maintenance.

Conclusion

Data-driven methods can be effective for fault diagnosis of complex systems. However, the application of data-driven fault diagnosis methods can be limited due to the lack of data. To tackle this challenge, this study develops a cross-domain decision method for fault diagnosis. This method can facilitate the knowledge transfer from the source domain to the target domain. Firstly, the features are extracted from the time, frequency, and time-frequency domains. The data weights are determined following the idea of instance transfer, which can reduce the distribution dissimilarity between the source and target data. The extracted features are then selected using the estimated data weights. Finally, the knowledge contained in the source model is transferred to the target domain using the proposed method. The efficacy of the method is thoroughly validated on the CWRU and SEU engineering fault datasets. This validation is further accentuated through a comparative analysis of the proposed method against machine learning methods and other transfer learning-based methods, underlining its superior performance.

The principal limitations of this study are as follows: (1) the proposed method constructs the model with features extracted using specific methods, which may need adjustment in different decision scenarios; and (2) the feature spaces of the source and the target domains are assumed to be the same, which may not be applicable in some problems.

In the next step, the proposed method would be extended to situations where the source and the target domains share heterogeneous feature spaces. In addition, the transfer task with no labeled data available in the target domain will be further investigated.

Footnotes

Handling Editor: Chenhui Liang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Science and Technology Project for State Grid Anhui Electric Power Co., Ltd (No. 52120522000M).

ORCID iD

Zijian Wu

References

Wen

Gao

, et al. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Trans Ind Electron 2018; 65: 5990–5998.

Dai

Gao

. From model, signal to knowledge: a data-driven perspective of fault detection and diagnosis. IEEE Trans Ind Inform 2013; 9: 2226–2238.

Zhu

Chen

Shen

. A new deep transfer learning method for bearing fault diagnosis under different working conditions. IEEE Sens J 2020; 20: 8394–8402.

Pandya

Upadhyay

Harsha

. Fault diagnosis of rolling element bearing with intrinsic mode function of acoustic emission data using APF-KNN. Expert Syst Appl 2013; 40: 4137–4145.

Wang

Zhao

, et al. Data-driven fault diagnosis method based on the conversion of erosion operation signals into images and convolutional neural network. Process Saf Environ Prot 2021; 149: 591–601.

Wen

Gao

. A new deep transfer learning based on sparse auto-encoder for fault diagnosis. IEEE Trans Syst Man Cybern Syst 2019; 49: 136–144.

Zhan

Zeng

. To transfer or not transfer: unified transferability metric and analysis, 2023. arXiv preprint arXiv:2305.07741. DOI:10.48550/arXiv. 2305.07741.

Zhuang

Duan

, et al. A comprehensive survey on transfer learning. Proc IEEE 2021; 109: 43–76.

Zhao

Jiang

, et al. A novel transfer learning fault diagnosis method based on manifold embedded distribution alignment with a little labeled data. J Intell Manuf 2022; 33: 151–165.

10.

Jiang

Zhao

, et al. An adaptive deep transfer learning method for bearing fault diagnosis. Measurement 2020; 151: 107227.

11.

Liu

Zhang

, et al. Transfer learning-based strategies for fault diagnosis in building energy systems. Energy Build 2021; 250: 111256.

12.

Yang

Lei

Jia

, et al. An intelligent fault diagnosis approach based on transfer learning from laboratory bearings to locomotive bearings. Mech Syst Signal Process 2019; 122: 692–706.

13.

Rudin

. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 2019; 1: 206–215.

14.

won Lee

Giraud-Carrier

. Transfer learning in decision trees. In: 2007 international joint conference on neural networks, pp.726–731. New York: IEEE. DOI:10.1109/IJCNN.2007.4371047.

15.

Minvielle

Atiq

Peignier

, et al. Transfer learning on decision tree with class imbalance. In: 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI), 2019, pp.1003–1010. New York: IEEE. DOI:10.1109/ICTAI.2019.00141.

16.

Segev

Harel

Mannor

, et al. Learn on source, refine on target: a model transfer learning framework with random forests. IEEE Trans Pattern Anal Mach Intell 2017; 39: 1811–1824.

17.

Pan

Yang

. A survey on transfer learning. IEEE Trans Knowl Data Eng 2010; 22: 1345–1359.

18.

Huang

Gretton

Borgwardt

, et al. Correcting sample selection bias by unlabeled data. In: Schölkopf

Platt

Hoffman

(eds) Advances in neural information processing systems 19. Cambridge, MA: MIT Press, 2007, 601–608.

19.

Sugiyama

Nakajima

Kashima

, et al. Direct importance estimation with model selection and its application to covariate shift adaptation. In: Platt

Koller

Singer

(eds) Advances in neural information processing systems 20. Cambridge, MA: Curran Associates, Inc, 2008, 1433–1440.

20.

Daumé

III . Frustratingly easy domain adaptation, 2009. arXiv preprint arXiv:0907.1815. DOI:10.48550/arXiv.0907.1815.

21.

Pan

Tsang

Kwok

, et al. Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 2011; 22: 199–210.

22.

Fernando

Habrard

Sebban

, et al. Unsupervised visual domain adaptation using subspace alignment. In: Proceedings of the IEEE international conference on computer vision, pp.2960–2967. New York: IEEE.

23.

Duan

Tsang

. Domain adaptation from multiple sources: a domain-dependent regularization approach. IEEE Trans Neural Netw Learn Syst 2012; 23: 504–518.

24.

Zhuang

Luo

Xiong

, et al. Exploiting associations between word clusters and document classes for cross-domain text categorization†. Stat Anal Data Min ASA Data Sci J 2011; 4: 100–114.

25.

Gao

Fan

Jiang

, et al. Knowledge transfer via multiple model local structure mapping. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, New York, NY, USA, pp. 283–291. Association for Computing Machinery. DOI:10.1145/1401890.1401928.

26.

Wang

Lin

, et al. Relational knowledge transfer for zero-shot learning. In: Proceedings of the AAAI conference on artificial intelligence, 2016, vol. 30. AAAI. DOI: 10.1609/aaai.v30i1.10195

27.

Qin

Yang

, et al. A novel relational-based transductive transfer learning method for PolSAR images via time-series clustering. Remote Sens 2019; 11: 1358.

28.

Bergland

. A guided tour of the fast Fourier transform. IEEE Spectr 1969; 6: 41–52.

29.

Ghani

Abdul Malek

Kamarul Azmi

, et al. A review on sparse fast Fourier transform applications in image processing. Int J Electr Comput Eng 2020; 10: 1346.

30.

Helmi

Forouzantabar

. Rolling bearing fault detection of electric motor using time domain and frequency domain features extraction and ANFIS. IET Electric Power Appl 2019; 13: 662–669.

31.

Púčik

Kubinec

Ondřáček

. FFT with modified frequency scale for audio signal analysis. In: 2014 24th international conference radioelektronika, pp.1–4. New York: IEEE. DOI: 10.1109/Radioelek.2014.6828482

32.

Cerrada

Zurita

Cabrera

, et al. Fault diagnosis in spur gears based on genetic algorithm and random forest. Mech Syst Signal Process 2016; 70-71: 87–103.

33.

Xia

Wan

, et al. Spectral regression based fault feature extraction for bearing accelerometer sensor signals. Sensors 2012; 12: 13719.

34.

Rauber

de Assis Boldt

Varejao

. Heterogeneous feature models and feature selection applied to bearing fault diagnosis. IEEE Trans Ind Electron 2015; 62: 637–646.

35.

Liu

. An expert system for fault diagnosis in internal combustion engines using wavelet packet transform and neural network. Expert Syst Appl 2009; 36: 4278–4286.

36.

Tadist

Najah

Nikolov

, et al. Feature selection methods and genomic big data: a systematic review. J Big Data 2019; 6: 79.

37.

Albashish

Hammouri

Braik

, et al. Binary biogeography-based optimization based SVM-RFE for feature selection. Appl Soft Comput 2021; 101: 107026.

38.

Tyagi

Singh

Gore

. Improved detection of coronary artery disease using DT-RFE based feature selection and ensemble learning. In: Woungang

Dhurandher

Pattanaik

, et al. (eds) Advanced network technologies and intelligent computing. Communications in computer and information science. Cham: Springer International Publishing, 2008, 425–440.

39.

Lei

. A new approach to intelligent fault diagnosis of rotating machinery. Expert Syst Appl 2008; 35: 1593–1600.

40.

Zhu

Wei

, et al. Bearing Fault feature extraction and fault diagnosis method based on feature fusion. Sensors 2021; 21: 2524.

41.

Buchaiah

Shakya

. Bearing fault diagnosis and prognosis using data fusion based feature extraction and feature selection. Measurement 2022; 188: 110506.

42.

Cui

Weng

Zhang

. A feature extraction and machine learning framework for bearing fault diagnosis. Renew Energy 2022; 191: 987–997.

43.

. Bearing performance degradation assessment using locality preserving projections. Expert Syst Appl 2011; 38: 7440–7450.

44.

Zhang

Liu

Wang

, et al. Time-frequency analysis for bearing fault diagnosis using multiple Q-factor gabor wavelets. ISA Trans 2019; 87: 225–234.

45.

van Wyk

. Difference histograms: a new tool for time series analysis applied to bearing fault diagnosis. Pattern Recognit Lett 2009; 30: 595–599.

46.

Yang

Cai

Gao

, et al. Adaptive redundant lifting wavelet transform based on fitting for fault feature extraction of roller bearings. Sensors 2012; 12: 4381–4398.

47.

Chebil

Noel

Mesbah

, et al. Wavelet decomposition for the detection and diagnosis of faults in rolling element bearings. Jordan J Mech Ind Eng 2009; 3: 260–267.

48.

Khoshelham

Fraser

. A multiclass TrAdaBoost transfer learning algorithm for the classification of mobile lidar data. ISPRS J Photogramm Remote Sens 2020; 166: 118–127.

49.

Hastie

Rosset

Zhu

, et al. Multi-class AdaBoost. Stat Interface 2009; 2: 349–360.

50.

Zhou

Zhang

Zhou

, et al. A feature selection algorithm of decision tree based on feature weight. Expert Syst Appl 2021; 164: 113842.

51.

Song

. Decision tree methods: applications for classification and prediction. Shanghai Arch Psychiatry 2015; 27: 130–135.

52.

Loh

. Classification and regression trees. Data Min Knowl Discov 2011; 1: 14–23.

53.

Yuan

, et al. An intelligent fault diagnosis method of rolling bearings via variational mode decomposition and common spatial pattern-based feature extraction. IEEE Sens J 2022; 22: 15169–15177.

54.

Wang

, et al. Entropy based fault classification using the case western reserve university data: a benchmark study. IEEE Trans Reliab 2020; 69: 754–767.

55.

Shao

McAleer

Yan

, et al. Highly accurate machine fault diagnosis using deep transfer learning. IEEE Trans Ind Inform 2019; 15: 2446–2455.

56.

Zhao

Zhang

, et al. Applications of unsupervised deep transfer learning to intelligent fault diagnosis: a survey and comparative study. IEEE Trans Instrum Meas 2021; 70: 1–28.

57.

Prajapati

Dunne

Armstrong

. Sample size estimation and statistical power analyses. Optom Today 2010; 16: 10–18.

58.

Loog

. Nearest neighbor-based importance weighting. In: 2012 IEEE international workshop on machine learning for signal processing, pp.1–6. New York: IEEE. DOI:10.1109/MLSP.2012. 6349714.

59.

Sun

Feng

Saenko

. Return of frustratingly easy domain adaptation. In: Proceedings of the AAAI conference on artificial intelligence, 2016, vol.30. AAAI.

Cross-domain decision method based on instance transfer and model transfer for fault diagnosis

Abstract

Keywords

Introduction

Theoretical backgrounds

Feature extraction

Fast Fourier transformation

Wavelet packet transform

Feature selection

Methods

Framework

Feature extraction

Time-domain features

Frequency-domain features

Time-frequency-domain features

Weight estimation

Feature selection

Model construction based on instance and model transfer

Experiments

Dataset

CWRU dataset

SEU dataset

Results

Results of CWRU dataset

Performance of the proposed method

Effect of different categories of features on model performance

Feature selection in transfer tasks on CWRU dataset

Results of SEU dataset

Results of the proposed method

Effect of different categories of features on model performance

Feature selection in transfer tasks on SEU dataset

Comparative analysis

Comparison with machine learning methods

Results of CWRU dataset

Results of SEU dataset

Comparison with transfer learning based methods

Results of CWRU dataset

Results of SEU dataset

Discussion

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References