Improved feature extraction using structured Fisher discrimination sparse coding scheme for machinery fault diagnosis

Abstract

Vibration signals reflecting different kinds of machinery conditions are very useful for fault diagnosis. However, vibration signal characteristics are not the same for different types of equipment and patterns of failure. This available information is often lost in structureless condition diagnosis models. We propose a structured Fisher discrimination sparse coding–based fault diagnosis scheme to improve the feature extraction procedure considering both efficiency and effectiveness. There are three major components: (1) a structured dictionary for synthesizing the vibration signals that is learned by structure Fisher discrimination dictionary learning, (2) a tree-structured sparse coding to extract sparse representation coefficients from vibration signals to represent fault features, and (3) a support vector machine’s classifier on the features to recognize different faults. The proposed algorithm is verified on a standard bearing fault data set and a worm gear fault experiment. Test results have proved that the proposed method can achieve better performance with considerable efficiency and generalization ability.

Keywords

Fault diagnosis feature extraction discrimination sparse coding dictionary learning worm gear

Introduction

Bearings and gears are widely used in automobiles, machines, turbines, and mining equipments. Pre-emptive detection of bearings and gears failure is critical to the reliable operation of mechanical systems.¹ This can be achieved by making use of the information contained in vibration signals. The traditional basis representation algorithms such as fast Fourier transform (FFT), wavelet,² and variants of wavelet³ are effective in dealing with vibration signals for fault diagnosis. However, the structural and discriminative information of vibration signals are not commonly used in the past. To overcome this limitation, new signal processing techniques have emerged to capture useful structural characteristics and discriminative information.⁴ Recently, considerable attention has been paid to sparse coding representation–based techniques for vibration signals processing.⁵ Sparse coding representation was first proposed for accurate signal reconstruction.⁶ Representing a signal in decomposed forms involves the choice of a dictionary, which is a collection of elementary signals or atoms.^7,8 In many tasks, the dictionary is fixed; for example, FFT and wavelets are special cases of signal representation based on fixed format dictionaries, but this is not sufficient for some complicated situations. A better approach is to learn a dictionary from the data itself, in which a sparse representation can be realized.⁹

There are two practical ways to deal with the dictionary learning (DL) issue: using greedy pursuit algorithms^10–12 and convex relaxation algorithms.^13–15 For the first category of techniques, matching pursuit (MP)^10,11 and orthogonal matching pursuit (OMP)¹² are the simplest approximate greedy algorithms. However, basis pursuit (BP)¹⁴ and least absolute shrinkage and selection operator (Lasso)¹⁵ are suitable to tackle the convex problem. Meanwhile, some other algorithms like the FOCal Underdetermined System Solver (FOCUSS)¹⁶ were also developed. More details could be found in the previous studies.^17–20 The shift invariant sparse coding (SISC)²¹ algorithm was proposed in our previous research²² that the training vibration signals are used to learn a dictionary, and the classification of new vibration signals was achieved by finding its sparse coefficients with respect to the activation distribution of the learned dictionaries.

However, these methods may not have taken advantage of the discriminative information within the data, especially when there are high correlations among the samples of different classes. Moreover, the efficiency and generalization ability of the algorithms are also important factors for fault diagnosis, while the computation of DL and sparse coding is complicated. Furthermore, how to resolve the signal based on the dictionary is another challenge for signal analysis. Hence, recently there are several attempts to include discrimination information in computing the dictionary and its coefficients.^23,24 A Fisher discrimination–based dictionary learning (FDDL) scheme was proposed for image recognition.²⁵ The Fisher discrimination feature is inherited by sparse representation coefficients which present small within-class scatter and large between-class scatter. The sub-dictionary is constrained in order to preserve good reconstruction property for the target training class but poor for other classes. However, the Fisher discrimination features are only used to reflect the relationship between classes under data representation dimension. Even if the relationship between these classes is known, the inherent correlation is not considered between the classes.

In this article, we present a structured sparse representation fault diagnosis framework, which extends our previous work on sparse coding.²² In the proposed framework, first, a structured dictionary is learned from training samples with labels by structured Fisher discrimination dictionary learning (SFDDL), where the dictionary atoms retain the corresponding class labels with a structured constraint. The atoms are arranged according to the correlation between classes—the more relevant the classes, the closer the atoms arranged in the dictionary. Then, the test samples are encoded by tree-structured sparse coding (TSSC) based on the learned dictionary. The tree structure enforces the faults with a hierarchical framework similar to a fault tree. The representation coefficients and its residual are both exploited in the final classification. Compared with our previous work in Liu et al.,²² which addressed the problem of how to efficiently generate features for vibration signals, we evaluate and validate our algorithm on a standard bearing faults database, and an experiment for worm gear faults is carried out.

The remainder of the article is organized as follows. Section “Structured Fisher discrimination sparse coding modeling” briefly describes the proposed fault diagnosis scheme, the SFDDL model, and the proposed TSSC model. In section “Case studies,” two experiments are described. The discussion of the results is presented in section “Discussion.” Finally, a conclusion is provided in section “Conclusion.”

Structured Fisher discrimination sparse coding modeling

To effectively reveal the structural information in vibration signals, the proposed method, structured Fisher discrimination sparse coding (SFDSC), adopts a machine learning algorithm which includes supervised DL and tree-structured coefficients solving, as shown in Figure 1. The vibration signals are acquired by the data acquisition system. Simultaneously, the fault information is recorded as the labels of the signals. Then, the vibration signals are processed, and details are given in sections “SFDDL” and “TSSC coefficients.”

Figure 1.

Overview of the proposed methodology.

SFDDL

The unsupervised DL algorithm has achieved state-of-the-art results in image classification²⁶ and image reconstruction.²⁷ With the supervised or semi-supervised DL algorithms, labels of the training samples contain the class discrimination information that could lead to better classification results.²⁸ The Fisher discrimination criterion used in the classification maximizes the distances between two classes, while minimizes the distances within elements in each class. Discriminative DL includes two categories: one is the shared dictionary (SD) and the other is the partition dictionary (PD). All classes are learned by the SD, and the representation coefficients are discriminative. Only the correspondences of the subject class are used to learn a sub-dictionary of PD. However, the representative residual associated with each class can be used in the classification. At the same time, the representation coefficients are not required to be discriminative and are not further considered in the classification.¹⁸ In addition, the correlation between classes is considered by a predefined relationship matrix.

Let $X = [X_{1}, X_{2}, \dots, X_{c}]$ denote the entire collection of training samples which contain n training samples from c different classes, where $X_{i} \in R^{d \times n_{i}}$ is the $n_{i}$ training samples from the ith class, d is the dimension of each sample vector. In order to learn a structured dictionary (or atoms) $D = [D_{1}, D_{2}, \dots, D_{c}]$ , where $D_{i} \in R^{d \times m_{i}}$ is the sub-dictionary for the ith class, d is the dimension of each dictionary atom which is equal to the training samples dimension, $m_{i}$ is the number of atoms for the ith sub-dictionary. Training samples $X$ could be represented using dictionary $D$ by the sparse coefficient matrix $S$ , that is, $X \approx DS$ , and $S$ can be written as $S = [S_{1}, S_{2}, \dots, S_{c}]$ , where $S_{i}$ is the coefficients sub-matrix representation of $X_{i}$ over $D_{i}$ . The FDDL model¹⁸ can be written as follows

J_{(D, S)} = \underset{(D, S)}{argmin} {R (D, S, X) + λ_{1} {‖ S ‖}_{1} + λ_{2} F (S)} s . t . ‖ d_{n} ‖_{2} = 1, \forall n

(1)

where $R (D, S, X)$ is the reconstruction error; ${‖ S ‖}_{1}$ is the $l_{1}$ sparsity constraint imposed on the sparse coding coefficients matrix $S$ ; $F (S)$ is a discriminatory constraining $S$ ; $λ_{1}, λ_{2}$ are scalar parameters used to trade-off the degree of constraint and generalization, which usually are small values selected by experience between 0 and 0.1 and is problem dependent. For all atoms $d_{n} \in D$ , there is a unit $l_{2}$ -norm to prevent $d_{n}$ from being very large (or very small), while $s_{n} \in S$ being very small (or very large). We will decompose the model into the following sub-elements.

The reconstruction error of any training sample class $X_{i}$ should be as smaller as possible. It means that $X_{i} - D S_{i} \approx 0$ or $X_{i} \approx D S_{i}$ . We can express $S_{i}$ as $S_{i} = [S_{i}^{1}, S_{i}^{2}, \dots, S_{i}^{c}]$ , where $S_{i}^{j} \in R^{m_{j} \times n_{i}}$ is the sparse coding coefficients matrix of $X_{i}$ over $D_{j}$ . Then, it gives $X_{i} \approx D_{1} S_{i}^{1} + \dots + D_{i} S_{i}^{i} + \dots + D_{c} S_{i}^{c}$ . In order to obtain a discriminative and sparse representation, let $S_{i}^{i}$ in the ith class contain non-zero coefficients, at the same time let $S_{i}^{j, j \neq i}$ contains zero coefficients. In other words, we have a small representation error of $‖ X_{i} - D_{i} S_{i}^{i} | |_{F}^{2}$ and a very small value of $‖ D_{j} X_{i}^{j} | |_{F}^{2}$ . Let $R (D, S, X)$ be defined as

R (D, S, X) = ‖ X_{i} - D S_{i} ‖_{F}^{2} + ‖ X_{i} - D_{i} S_{i}^{i} ‖_{F}^{2} + \sum_{j = 1, j \neq i}^{c} ‖ D_{j} S_{i}^{j} ‖_{F}^{2}

(2)

and we minimize this discriminative reconstruction error term.

Moreover, to increase the discrimination capability of dictionary $D$ , a Fisher discrimination criterion is used to enforce the sparse coding coefficients matrix $S$ . While the within-class scatter degree, denoted by $W (S)$ , is minimized, the between-class scatter degree, denoted by $B (S)$ , is maximized. We have

W (S) = \sum_{i = 1}^{c} \sum_{s_{k} \in S_{i}} (s_{k} - {\bar{s}}_{i}) (s_{k} - {\bar{s}}_{i})^{T} and B (S) = \sum_{i = 1}^{c} n_{i} ({\bar{s}}_{i} - \bar{s}) ({\bar{s}}_{i} - \bar{s})^{T}

(3)

where ${\bar{s}}_{i}$ and $\bar{s}$ are the mean vectors of $S_{i}$ and $S$ , respectively, and $n_{i}$ is the number of the sample in the ith class. Here, we introduce the predefined relationship matrix $M$ which constraints the between-class scatter

M = [\begin{matrix} 0 & {m_{1}}_{2} & \dots & m_{1 n} \\ {m_{2}}_{1} & 0 & \dots & m_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ m_{n 1} & m_{n 2} & \dots & 0 \end{matrix}]

(4)

where $m_{i, j} \in [0, 1]$ is the normalized distance between the ith and jth classes $(i, j = 1, 2, \dots, n)$ . The distance between classes is symmetrical, hence $m_{i, j} = m_{j, i} (i \neq j)$ , and the distance to themselves is zero, that is, $m_{i, i} = 0$ . The distance between the classes are arranged in ascending order, such that $m_{i, j} \leq m_{i, j + 1} (i \leq j)$ . The structured constraint of the relationship between classes can be defined as

Δ B (S) = \sum_{i = 1}^{c} | n_{i} ({\bar{s}}_{i} - \bar{s}) {({\bar{s}}_{i} - \bar{s})}^{T} - \sum_{j = i}^{c} m_{(i, j)} |

(5)

Following the study of Yang et al.,¹⁸ we can define $F (S)$ as

F (S) = tr (W (S)) - tr (B (S)) + Δ B (S) + η {‖ S ‖}_{F}^{2}

(6)

and we also want to minimize the Fisher discrimination term. The variable $η$ is a scalar parameter, and ${‖ S ‖}_{F}^{2}$ is an elastic regularization term to make the function convex and stable.

Considering the above discussion, we have the SFDDL model

\begin{matrix} J_{(D, S)} = \underset{(D, S)}{argmin} \sum_{i = 1}^{c} (‖ X_{i} - D S_{i} ‖_{F}^{2} + ‖ X_{i} - D_{i} S_{i}^{i} ‖_{F}^{2}) \\ + λ_{1} ‖ S ‖_{1} + λ_{2} (tr (W (S)) - tr (B (S)) \\ + Δ B (S) + η {‖ S ‖}_{F}^{2}) s . t . ‖ d_{n} ‖_{2} = 1, \forall n; \\ ‖ D_{j} S_{i}^{j} ‖_{F}^{2} = 0, \forall i \neq j \end{matrix}

(7)

Here, it is assumed that $S_{i}^{j} = 0$ , for all $j \neq i$ in simplifying the SFDDL model. In the iterative algorithm, it divides the problem into two sub-problems that optimizes $D$ and $S$ alternatively and updating one with the other fixed until the terminal condition is met, see Algorithm 1.

Algorithm 1. SFDDL.
Step 1. Input: input data $X \in R^{d \times \sum n_{i}}$ ; labels of data $Y \in R^{1 \times \sum n_{i}}$ ; the number of class c; number of samples in each class $n_{i} (i = 1, 2, \dots, c)$ ; number of atoms in per class m; the hyper-parameters $λ_{1}, λ_{2}$ , and $η$ ; the predefined relationship matrix $M$ ; iteration terminal condition: maximum number of iterations $n_{Itr}$ (or objective function threshold $λ_{th}$ ).
Step 2. Initialization: rearrange the input data according the predefined relationship matrix $M$ , we use the eigenvectors of $X_{i}$ to initialize the atoms of $D_{i}$ . $Itr = 1$ ;
Step 3. Update coefficients $S$ : solve each element in $X_{i} (i = 1, 2, \dots, c)$ with fix $D$ according to equation (7).
Step 4. Update dictionary $D$ : solve $D_{i} (i = 1, 2, \dots, c)$ one by one with fix $S$ by solving equation (7). A trick method for solving the equation referred to Yang et al.¹⁸
Step 5. Check: if $Itr < n_{Itr}$ (or $J_{(D, S)} > λ_{th}$ ), jump to step 3; else continues to next step.
Step 6. Return: discriminative dictionary and reconstruction coefficients $S$ .

The SFDDL has better discriminative capability; the learned dictionary based on the Fisher discriminative criterion essentially contains the distinguished information which has small within-class distance and large between-class distance. Furthermore, dictionary atoms are learned simultaneously in SFDDL, while methods based on BP^13,21,22,29 learned atoms sequentially from one sample class to another. The SFDDL does not need to do a time-consuming convolution computation,^21,22 and a batch processing method is employed in SFDDL instead of the tedious single cycle method. These considerations give SFDDL better generalization ability. The above-mentioned improvement will be confirmed in a practical application case presented in section “Case studies.”

TSSC coefficients

Recently, the tree structure sparse coding with group information about the features that yields a solution with grouped sparsity has received increasing attention in many areas including signal processing, machine learning, and statistical learning. For sparse coding, it is assumed that the groups of the inputs are available as prior knowledge, and it uses groups of inputs instead of individual input as a unit of variable selection.³⁰ Sparse coding achieved sparsity representation of the signal by applying the $l_{1}$ -norm of the lasso penalty over groups of inputs, while using the $l_{2}$ -norm for the input variables within each group. This $l_{1} / l_{2}$ -norm for sparse coding has been extended to a more general setting to encode prior knowledge on various sparsity patterns, where the key idea is to allow the groups to have overlaps.³¹

In TSSC, the structure over the features can be represented as a tree with leaf nodes and internal nodes as clusters of the features. Such a regularization can help to uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features.³² In many applications, certain tree structures can naturally be used to represent features. But the resulting optimization problem is a lot more difficult to solve than Lasso and group Lasso, due to the complex regularization of the tree structures. Hence, tree structures are well applied in the field of fault diagnosis, for example, the fundamental concept in fault tree analysis is the translation of a physical system into a structured logic diagram (fault tree), where one specified top event of interest attribute to certain specified causes.³³ For a diagnostic system that knows the relationship between the faults, we can define a tree structure similar to the fault tree to represent the logic of faults. Here, we define a structured regularization with a predefined tree structure based on a group-Lasso penalty, where one group is defined for each node in an index tree. Figure 2 shows an example of the index tree.

Figure 2.

A sample of the index tree.

For an index tree T of depth d, we use $T_{i} = {G_{1}^{i}, G_{2}^{i}, \dots, G_{n_{i}}^{i}}$ to denote all the nodes corresponding to depth i, where $n_{i}$ is the total number of nodes in depth $i, i = 0, 1, \dots, d$ and $n_{i} \geq 1$ . For example, the root node $G_{1}^{0}$ , where $n_{i} = 1$ , contains all the leaf nodes $G_{1}^{0} = {1, 2, \dots, p}$ , where p is the dimension of entries space (features space). We have two basic assumptions in the index tree: (1) there is no overlap between nodes in the same depth, $G_{j}^{i} \cap G_{k}^{i} = \emptyset, \forall i = 1, 2, \dots, d; j \neq k, 1 \leq j, k \leq n_{i}$ ; (2) the children nodes are contained in the parent node, that is, let $G_{k}^{i + 1}$ be the child node of $G_{j}^{i}$ , then $G_{k}^{i + 1} \subseteq G_{j}^{i}, where 1 \leq j \leq n_{i}, 1 \leq k \leq n_{i + 1}$ .

We define the tree-structured group regularization penalty term as

ϕ (x) = \sum_{i = 0}^{d} \sum_{j = 1}^{n_{i}} ω_{j}^{i} ‖ x_{G_{j}^{i}} ‖_{2}^{2}

(8)

where $x \in R^{p}$ , $ω_{j}^{i} \geq 0 (i = 0, 1, \dots, d; j = 1, 2, \dots, n_{i})$ is the predefined weight of the node; the weight value is related to the degree of membership, and the higher the membership degree, the larger the weight. Variable $x_{G_{j}^{i}}$ is the vector consisting of the entries of $x$ with indices $G_{j}^{i}$ .

The penalized optimization problem associated with the tree-structured group regularization for a given learned dictionary is

min_{s} (‖ x - Ds ‖_{2}^{2} + λ \sum_{i = 0}^{d} \sum_{j = 1}^{n_{i}} ω_{j}^{i} ‖ s_{G_{j}^{i}} ‖_{2}^{2})

(9)

where $x \in R^{m \times n}$ is the input data for m dimension of features with n samples, $D \in R^{m \times k}$ denotes the learned dictionary, variable $s \in R^{k \times n}$ is the sparse coefficients matrix, and $s_{G_{j}^{i}}$ is the index tree matrix which is repeated vector of index tree to n samples.

In the model defined by equation (9), the value selected for the regularization parameter $λ$ is a key issue. The most commonly used method selects an appropriate value of $λ$ from a set of candidates that are pre-specified in advance. Many algorithms³² have been developed to address this problem. However, there is no general solution for all cases. Hence, in this work, we use an empirical value for the regularization parameter, which ensures that the penalty term is in a reasonable range with a sparse solution of the problem. An analytical solution of equation (9) can be obtained by Algorithm 2.

Algorithm 2. TSSC with learned dictionary.
Step 1. Input:
Raw data $x \in R^{m \times n}$ , the learned dictionary $D \in R^{m \times k}$ , the index tree matrix $s_{G_{j}^{i}}$ consisted with the same tree structure in columns, the weights $ω_{j}^{i} \geq 0 (i = 0, 1, \dots, d, j = 1, 2, \dots, n_{i})$ , $λ > 0$ .
Step 2. Initialization: $r^{d + 1} = D^{T} x$
Step 3. Compute:
For $i = d$ to $0$ do
For $j = 1$ to $n_{i}$ do
$r_{G_{j}^{i}}^{i} = {\begin{matrix} 0 & ‖ r_{G_{j}^{i}}^{i + 1} ‖ \leq λ ω_{j}^{i} \\ \frac{‖ r_{G_{j}^{i}}^{i + 1} ‖ - λ ω_{j}^{i}}{‖ r_{G_{j}^{i}}^{i + 1} ‖} r_{G_{j}^{i}}^{i + 1} & ‖ r_{G_{j}^{i}}^{i + 1} ‖ > λ ω_{j}^{i} \end{matrix} (10)$
End For
End For
Step 4. Return: $s = r^{0}$

In the implementation of the TSSC algorithm, the first step is initiated by $‖ x - D r^{d + 1} | |_{2}^{2} = 0$ or $r^{d + 1} = D^{- 1} x$ . We use the transpose operator $(D^{T})$ which is easy to calculate, instead of solving complex inversion $(D^{- 1})$ . However, the prerequisite for the establishment of the method is $D^{T} D = I$ , where $I$ is an identity matrix. Recall the DL step, we have the constraint condition for the dictionary $D$ , where ${‖ d_{n} ‖}_{2} = 1, \forall n; ‖ D_{j} S_{i}^{j} | |_{F}^{2} = 0, \forall i \neq j$ . Let $\hat{I} = D^{T} D$ , the trace matrix of $\hat{I}$ will be an identity matrix exactly. In addition, the discriminative capability of the learned dictionary atoms makes the correlation between atoms as small as possible, which means that $| d_{i}^{T} d_{j} | \to 0$ , and the elements absent in the diagonal line of matrix $\hat{I}$ are closed to zero. Hence, $\hat{I} \approx I$ and we can use $D^{T}$ to initialize $r^{d + 1}$ , $r^{d + 1} = D^{T} x$ .

Case studies

Bearing fault diagnosis experiment

The standard bearing fault vibration data were obtained from the bearing center of Case Western Reserve University (CWRU).³⁴ This data set had been referred as a benchmark. The vibration data were collected from the drive end ball bearing in a three-phase induction motor (Reliance Electric 2HP IQ PreAlert). The SKF 6205-2RS JEM deep groove ball bearing is tested with single point faults, which were seeded by electro-discharge machining. Fault diameters were 0.007, 0.014, 0.021, and 0.028 in. The fault locations were located on the inner race, the outer race, and the ball. An acceleration transducer was mounted on the motor housing at the driver end to acquire vibration signals. All signals were recorded at a sampling frequency of 12 kHz, for motor loads of 0, 1, 2, and 3 horsepower (hp0∼3), respectively. Samples without faults were also tested for the four levels of motor loads.

In this study, data sets were selected from the tests datasheet set out in Table 1. All raw data were segmented into a 1024-point sample without overlap, and signal edges were smoothed. At the same time, all samples were normalized to [−1, 1] with the mean value and standard deviation saved as two time-domain features. In this way, we obtained 118 samples for each set of faulty data, 1180 samples were obtained for each motor load (hp0∼3). Totally, 4720 samples with 10 classes of bearing date were used. Without loss of generality, half of the samples with motor load 0 hp were randomly selected to learn atoms, which are denoted by “*” in Table 1, while the remaining samples were used only for testing. The discriminability of sparse coefficients coding by the learned atoms was invoked as features for the bearing fault diagnosis. Meanwhile, the adaptability and generalization ability of the learned atoms was tested with the bearing data under motor load 1, 2, and 3 hp.

Table 1.

Bearing data set list.

Fault diameter (in)	Motor load (hp)	Motor speed (r/min)	Inner race	Ball	Outer race	Fault type	Classification label
0.007	0	1797	IR007_0^a	B007_0^a	OR007@6_0^a	I07 B07 O07	C2 C5 C8
	1	1772	IR007_1	B007_1	OR007@6_1
	2	1750	IR007_2	B007_2	OR007@6_2
	3	1730	IR007_3	B007_3	OR007@6_3
0.014	0	1797	IR014_0^a	B014_0^a	OR014@6_0^a	I14 B14 O14	C3 C6 C9
	1	1772	IR014_1	B014_1	OR014@6_1
	2	1750	IR014_2	B014_2	OR014@6_2
	3	1730	IR014_3	B014_3	OR014@6_3
0.021	0	1797	IR021_0^a	B021_0^a	OR021@6_0^a	I21 B21 O21	C4 C7 C10
	1	1772	IR021_1	B021_1	OR021@6_1
	2	1750	IR021_2	B021_2	OR021@6_2
	3	1730	IR021_3	B021_3	OR021@6_3
0.000	0	1797	Normal_0^a			N	C1
	1	1772	Normal_1
	2	1750	Normal_2
	3	1730	Normal_3

These sets of samples are randomly split by half for learning atoms. The others are used for testing.

Atoms learning of bearing vibration signals by SFDDL

Faults on mechanical components can be indicated by vibration signals sensed from the machine. Vibration signals contain the fault information, the normal operational information, and the noise components. Therefore, effectively extracting the useful information of an abnormal condition is critical to machinery fault diagnosis. The proposal SFDDL method learns a discriminative dictionary from a small proportion of the vibration signals to capture the most information about the fault.

In our experiments, each atom with 1024 points has the same length of the samples from the original vibration signals. The number of atoms per class is 12, of which 10 classes of bearing fault data sets are learned. As a result, totally 10 sub-dictionary C_i (i =1, 2, …, 10) are generated, they make up an over-complete dictionary with 120 atoms, which are plotted in Figure 3. In this experiment, we choose the parameters as $λ_{1} = 0.005, λ_{2} = 0.05, η = 1$ , and $n_{Itr} = 30$ . The relationship matrix is predefined using 0.5 to the same kind of faults with different fault depths and 1.0 to different kinds of faults, which can be written as

M = [\begin{matrix} 0 & 1^{1 \times 3} & 1^{1 \times 3} & 1^{1 \times 3} \\ 1^{3 \times 1} & m & 1^{3 \times 3} & 1^{3 \times 3} \\ 1^{3 \times 1} & 1^{3 \times 3} & m & 1^{3 \times 3} \\ 1^{3 \times 1} & 1^{3 \times 3} & 1^{3 \times 3} & m \end{matrix}]

(11)

where

m = [\begin{matrix} 0 & 0.5 & 0.5 \\ 0.5 & 0 & 0.5 \\ 0.5 & 0.5 & 0 \end{matrix}], 1^{3 \times 3} = [\begin{matrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{matrix}], 1^{1 \times 3} = [\begin{matrix} 1 & 1 & 1 \end{matrix}], 1^{3 \times 1} = {[\begin{matrix} 1 & 1 & 1 \end{matrix}]}^{T} .

Figure 3.

The learned dictionary (atoms (N = 12) of each class are learned from the data set denoted in the subtitle).

Comparing each class of atoms in Figure 3 with others, there is an obvious difference between the normal condition (sub-dictionary C₁(N)) and the others. Almost no impulses appear in the atoms of the normal condition. On the contrary, different degrees of impulses are distributed in the atoms of the inner race fault, the outer race fault, and the ball fault corresponding to the seriousness of fault diameter. The location and morphology of impulse variations are due to different mechanisms in generating impulses, which are the same as our previous work.²²

The reason for choosing the number of 12-atoms per class is a trade-off between computational cost (sum of learning and coding) and classification performance of fault diagnosis. Figure 4 shows the time consumed for learning the dictionary from half of the samples under motor load 0 hp with different number of atoms per class. The coding time for the remaining half samples under motor load 0 hp responding to the learned dictionary is also plotted. The learning time is approximately in linear proportion to the number of atoms per class, while the coding time is close to an S-curve. It means that the SFDDL DL time is increasing uniformly. But the TSSC coefficients coding time have a rapidly rising interval when the number of atoms per class is larger than 12 $(N > 12)$ . We also found that the improved performance of classification is trivial when $N > 12$ . Hence, we used 12-atoms per class in our experiment. The details will be illustrated in the following sections.

Figure 4.

Time consumed on learning and coding with various number of atoms per class (half samples under motor load 0 hp for learning and the remaining half for coding).

Sparse coding of bearing vibration signals by TSSC

The proposal TSSC method is applied to solve sparse coding of vibration signals using the learned dictionary. In this experiment, the index of tree-structured group is shown in Figure 5. The root node is $G_{1}^{0} = {C 1, C 2, C 3, C 4, C 5, C 6, C 7, C 8, C 9, C 10}$ . Depth 1 nodes are $G_{1}^{1} = {C 2, C 3, C 4, C 5, C 6, C 7, C 8, C 9, C 10}$ , $G_{2}^{1} = {C 1}$ . Depth 2 nodes are $G_{1}^{2} = {C 2, C 3, C 4}$ , $G_{2}^{2} = {C 5, C 6, C 7}$ , $G_{3}^{2} = {C 8, C 9, C 10}$ . Depth 3 nodes are $G_{1}^{3} = {C 2}$ , $G_{2}^{3} = {C 3}$ , $G_{3}^{3} = {C 4}$ , $G_{4}^{3} = {C 5}$ , $G_{5}^{3} = {C 6}$ , $G_{6}^{3} = {C 7}$ , $G_{7}^{3} = {C 8}$ , $G_{8}^{3} = {C 9}$ , $G_{9}^{3} = {C 10}$ . The root node represents all the conditions of the bearing, the first internal nodes distinguish the abnormal class and normal class, while the second internal nodes classify the three faults categories: inner race fault, outer race fault, and ball fault. Simultaneously, the depth 3 nodes act as the leaf nodes of the tree that denote the particular fault classes for the fault diameter information. The weights of different nodes are attached to the arrows which connect two depths nodes: $ω_{1} = 0.08$ , $ω_{2} = 0.1$ , and $ω_{3} = 1$ to the nodes depth. These weights are selected by experience in an appropriate range. Moreover, we set equal value to the same depth; such values are not sensitive to the final result. However, different values can be applied to every node according to the proportion of the fault class or the significant degree to the system, and so on, which is treated as identity in our experiment. Here, we set the regularization parameter $λ = 0.005$ .

Figure 5.

Index of tree-structured group for bearing fault diagnosis.

For example, we randomly select one sample from one class. The result is shown in Figure 6. In the first row, the original vibration signals are plotted, where the trend has been removed and the amplitude is normalized to range [−1, 1]. It can be noticed that the normal conditions appear more stationary than the fault conditions which is composed of amplitude impulses in the signals. The sparse coding active coefficients are scattered in the second row. It shows that the TSSC produces is able to group sparse solutions. Furthermore, the group residuals and the group actives of the reconstructed are scattered in the third and fourth row. Similarly, they are normalized to range from −1 to 1. The result shows that, for the normal condition sample C₁(N), the activation (non-zero coefficients) is exactly centered in the first group. The minimum value of residue and maximum value of activation are both discriminatively locating in class 1. Hence, normal condition could be classified easily using these features. In the same way, most samples could be distinguished correctly, although they have various distributions among classes. However, the inner race fault with fault diameter 0.014 in denoted as C₃(I14) fails to be detected using this strategy. It could be explained that the signals are more complex and contain several different components of fault information which will be captured by different sub-dictionaries. In order to remove this defect, we can add some other features such as the time-domain features or frequency-domain features for the classification.

Figure 6.

An example to illustrate the tree-structured sparse coding of different classes of vibration signals (randomly selected from testing half samples of hp0) with learned dictionary: the first row is the original vibration signal, the second row is the corresponding sparse coding coefficients, the third and fourth row are the normalized group residual values and normalized group activation values of the reconstruction for the original signal using different sub-dictionaries, respectively: (a) example of fault class C1–C5 and (b) example of fault class C6–C10.

Bearing faults diagnosis

As mentioned above, quadratic SVMs were implemented to classify the fault classes. At first, 70% samples in the test data under 0 hp motor load were used for 10-fold cross validation. Then, the remaining 30% samples were tested with the best accuracy validated model which was trained in the cross-validation phase. The results are shown in Figure 7. When the number of atoms per class (N) is smaller, the accuracy is lower, because there are only a few number of atoms that contain limited information about fault patterns. An upper bound about 90% can be found in both the cross validation and the test phase while N is larger than 12. This means that increase in the number of atoms cannot improve the performance of classification after a particular level. It can be explained that the over-complete dictionary had already acquired enough features of the samples’ fault pattern which had been defined in the task. Meanwhile, extra atoms will contain information that are not defined or not interested such as noise.

Figure 7.

Classification accuracy rate of various numbers of atoms per class of bearing test data, under motor load 0 hp: 10-fold cross validation with 70% samples and test with the remaining 30% samples.

Worm gear fault experiment

Experimental setup

The experiment is conducted on a test bench as shown in Figure 8, and the type of the worm gearbox of the test rig is WPA (W-worm speed reduce, P-whole box structure, A-input shaft) 40 (ratio 1/10, number of threads 2, number of teeth 20, module 2.5 mm, lead angle 9°28′, pressure angle 20°, reference diameter of worm 30 mm) and driven by an alternating current (AC) servomotor. The loading is applied by another AC servomotor at 0 or 6 N m. Artificial faults (worm gear pitting, worm gear spalling, and worm gear broken as shown in Figure 9) are produced on the worm gear, and the vibration signal is sensed from the gearbox housing with sample frequency $f_{s} = 12.8 kHz$ and 58,000 points length for every sample. The rotational speed of the driving motor is 1000 r/min. The tri-axial vibration sensor could record three orthogonal directions signals at the same time. However, only the signal of the X-axis direction (along the worm axial) is used in this study because the vibration is intense in worm axial direction when the worm gear is in fault conditions.

Figure 8.

Worm gear fault test platform.

Figure 9.

Artificial worm gear fault specimens.

Worm gear faults diagnosis

The signals of different kinds of loadings and faults are acquired; as listed in Table 2, there are four classes of failures and two loading conditions for worm gear, and the raw vibration signals of worm gear under different conditions are shown in Figure 10. The signals have been divided into 1024 point segments as a sample. In total, 50 × 4 samples (50 samples per class) under free loading are used for DL, and another 50 × 4 samples under free loading and 50 × 4 samples under 6 N m loading are used for testing. The fault signal is processed according to the flowchart depicted in Figure 1. The parameters used for SFDDL is the same as the bearing case, except the number of atoms per class is 10, and the predefined relationship matrix is an identity matrix. Owing to every kind of failures has one degree of fault, the tree structure has two levels, one root node, and four children nodes in the second level.

Table 2.

Worm gear data set list.

Loading (N m)	Normal	Pitting	Spalling	Broken
0	N1000h0^a	P1000h0^a	S1000h0^a	B1000h0^a
6	N1000h6	P1000h6	S1000h6	B1000h6

These sets of samples are randomly split by half for learning atoms, the others are used for testing.

Figure 10.

A set of raw vibration signals with faulted worm gear.

The learned dictionary is shown in Figure 11. It is worth noting that the learned dictionaries under fault conditions have significant low-frequency components in the atoms, but none in the normal condition. There are serious impact components in the atoms (such as atoms 2 and 6) under broken condition. It can be considered that these learned atoms contain the corresponding fault features. Hence, these atoms are able to represent the raw signals in the sparse model.

Figure 11.

Learned dictionaries for worm gear faults by SFDDL.

The results for classification of the worm gear faults by SISC²² and the proposed method SFDSC are shown in Table 3. In SFDSC, the accuracy for free loading is more than 96%, whereas it is about 88% at 6 N m loading. It also shows that the proposed method SFDSC has outperformed the SISC method in both cases.

Table 3.

Fault diagnosis accuracy for worm gear with SISC²² and SFDSC.

Loading (N m)	Accuracy (%)
Loading (N m)	SISC	SFDSC
0	90.56	96.67
6	81.49	88.57

SISC: shift invariant sparse coding; SFDSC: structured Fisher discrimination sparse coding.

Discussion

Generalization and robustness

In order to validate the adaptability and the generalization ability of the proposal model SFDSC, the variations of working loads in bearing data were utilized in tests. Based on the over-complete dictionary (12 atoms per class) which learned from half of the samples under motor load 0 hp by SFDDL, 20 feature vectors (10 group residual features and 10 group activation features) were extracted for the test-bearing data samples (totally $118 \times 10 \times 3 = 3540 samples$ : 118 samples per class, 10 different fault types, and under motor load 1, 2, and 3 hp) by the TSSC representation. In addition to mean and standard deviation vectors, 22 feature vectors for 3540 samples were set to be classified. The same quadratic SVMs classifier was used.

The classification accuracies for three cases of motor loads and different features are given in Table 4. For load hp1, the accuracies are higher than 87% in time-domain features sets, both on cross validation and test scenarios. More than 93% samples could be exactly distinguished under load hp2, while the average for load hp3 is about 92%. If we only focus on the fault categories such as N, I, O, and B, but not the particular values of fault diameters, then we have elevated results as follows: 100%, 97%, 98%, and 98% for hp0, hp1, hp2, and hp3, respectively. Furthermore, the misclassifications among categories are less than 3% for all loads status. It indicates that the proposed SFDSC has good generalization ability for bearing fault diagnosis problem.

Table 4.

Classification accuracy rates of SFDSC (dictionary learning set: half of hp0, 12 atoms per class. Validate and test set: hp0, hp1, hp2, and hp3; 70% for cross validation (CV), 30% for the test).

Data set	hp0	hp1	hp2	hp3
CV	98.06	87.41	94.07	93.46
Test	99.15	87.85	93.5	91.24

SFDSC: structured Fisher discrimination sparse coding.

Comparison with other algorithms

Using the same CWRU bearing set, Table 5 summarizes the classification results of different methods in the literatures and ours. As an abnormality detector, all techniques have a good performance, where nearly 100% faults could be identified. In the four category (N, I, O, and B) classification, the proposal method has the best result (100%) equivalent to international workshop on parsing technologies-support vector machine (IWPT-SVM).³⁵ Meanwhile, we have a generalization result about 98% of the unlearned data sets hp1, hp2, and hp3. For all 10 classes of fault diagnosis problem, we also have the best result 99.15%, which is similar to our previous work.²² The generalization ability has been improved from 86.84% to 90.87%. This is because the SFDSC method takes full advantage of the hierarchical structure present in the bearing failure mode. The structured dictionary, which is learned by the proposed method, can better represent different types of bearing faults.

Table 5.

Comparisons with other methods based on CWRU bearing data set.

Methods	Fault classes	Accuracy (%)
IWPT-SVMs³⁵	N, I07, O07, B07 (hp0~3)	100^a
	N, I14, O14, B14 (hp0~3)	88^b
	N, I07, O07, B07, I14, O14, B14 (hp0~3)	99^a
SISC-LDA²²	N, I07, O07, B07, I14, O14, B14, I21,O21,B21 (hp0)	99.05^c
SISC-LDA²²	N, I07, O07, B07, I14, O14, B14, I21,O21,B21 (hp1~3)	86.84^b,c
Wavelet-SVMs³⁶	N, I, O, B (hp0)	96.1
Wavelet-SVMs³⁶	N, I07, O07, B07, I14, O14, B14, I21,O21,B21 (hp0)	88.9
HSSMC-SVM³⁷	N, I07, I14, I21, I28, O07, O14, O21 (hp0)	96.85
SFDSC-SVMs	Normal, Abnormal (hp0~3)	100
	N, I, O, B (hp0)	100
	N, I, O, B (hp1~3)	97.93 ^b,c
	N, I07, O07, B07, I14, O14, B14, I21,O21,B21 (hp0)	99.15
	N, I07, O07, B07, I14, O14, B14, I21,O21,B21 (hp1~3)	90.87 ^b,c

Bold values represent the results of the proposed algorithm in this paper.CWRU: Case Western Reserve University; IWPT-SVMs: international workshop on parsing technologies-support vector machines; SISC-LDA: shift invariant sparse coding-learning for dimensionality reduction and classification; HSSMC-SVM: hyper-sphere-structured multiclass-support vector machine; SFDSC-SVM: structured Fisher discrimination sparse coding-support vector machine.

The best results.

Using the first data set trained classifier in the article.

The average results.

The time cost for SISC²² and SFDSC is compared in Table 6. The time was calculated under the same personal computer (PC) platform: Pentium^® Dual-Core CPU E5800 @3.20 GHz, RAM 4.00 GB. It shows that SISC takes about 100 times more than SFDSC on the same scale of the dictionary. Even for the final used dictionary in SISC, the computation time is smaller than SFDSC. The time difference is greater than 100 multiples (10,455 s/98 s). It indicates that SFDSC has a better efficiency than SISC, which is very desirable for practice applications. There are two factors, batch processing without time-consuming convolution computation and the use of tree-structured index, that make SFDSC more efficient.

Table 6.

Computation time (seconds) for learning same size dictionary by SISC and SFDSC.

Data sets	Methods	Consumption time (seconds) with different number of atoms per class
Data sets	Methods	1	2	3	4	5	6	7	8	9	10^a	11	12^b
Bearing	SISC	−	2639	3780	4621	5558	6445	7504	7947	9441	10,455	10,757	10,864
Bearing	SFDSC	41	43	47	46	49	58	64	70	75	86	94	98
Gear	SISC	−	783	1071	1370	1606	2083	2575	2917	3164	2678	2995	3751
Gear	SFDSC	11	10	10	12	13	15	17	18	20	22	22	24

Bold values represent the results of the analysis cases in this paper.SISC: shift invariant sparse coding; SFDSC: structured Fisher discrimination sparse coding.

Number of atoms per class used in Liu et al.²²

Number of atoms per class used in this article.

Conclusion

In this article, we have presented a framework of sparse representation–based classification for machinery fault diagnosis. An SFDSC approach is applied to classify various types of faults in rolling element bearing. A structured dictionary with label is learned by structure Fisher discrimination DL, whose sub-dictionaries have discrimination ability: smaller within class dissimilar and larger distance between classes. The signal mean value and standard deviation are introduced to combine with the tree group structured sparse coding coefficients for fault diagnosis, and experiments have demonstrated that the classification of the SVMs is accurate. Furthermore, experimental results are also compared with previously published results using the same bearing data set. It shows that the performance of the proposed method is better than many state-of-the-art bearing fault diagnosis methods. Moreover, this approach can provide more generalization ability and higher efficiency. Some limitations are found in structure Fisher discrimination DL, such as the need for labeled samples for DL as well as the relationship of faults to predefine the tree structure. Notwithstanding the limitations, the effect of parameters based on empirical presumption on the final results is not obvious. Future works should seek a semi-supervised or unsupervised DL algorithm in which unlabeled samples could be used.

Footnotes

Academic Editor: Francesco Massi

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the National Natural Science Foundation of China (Grant no.51305258, 51275290, and 11202125) and National Key Technology R&D Program (2014BAD08B01, 2015BAF11B01).

References

Jardine

AKS

Lin

Banjevic

A review on machinery diagnostics and prognostics implementing condition-based maintenance. Mech Syst Signal Pr 2006; 20: 1483–1510.

Lim

Leong

MS.

Detection of early faults in rotating machinery based on wavelet analysis. Adv Mech Eng 2013; 5: 126–139.

Liu

Chen

Zhou

. Fault diagnosis of dolling bearing based on wavelet package transform and ensemble empirical mode decomposition. Adv Mech Eng 2013; 2013: 1310–1313.

Rubinstein

Bruckstein

Elad

Dictionaries for sparse representation modeling. Proc IEEE 2010; 98: 1045–1057.

Rakotomamonjy

Direct optimization of the dictionary learning problem. IEEE T Signal Process 2013; 61: 5495–5506.

Elad

Aharon

Image denoising via sparse and redundant representations over learned dictionaries. IEEE T Image Process 2006; 15: 3736–3745.

Zhang

. A bearing fault diagnosis method based on the low-dimensional compressed vibration signal. Adv Mech Eng. Epub ahead of print 6 July 2015. DOI: 10.1177/1687814015593442.

Yang

Gong

. Linear spatial pyramid matching using sparse coding for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2009 (CVPR 2009), Miami, FL, 20–26 June 2009, pp.1794–1801. New York: IEEE.

Chen

Sun

. Dictionary evaluation and optimization for sparse coding based speech processing. Inform Sci 2015; 310: 77–96.

10.

Mallat

Zhang

Matching pursuits with time-frequency dictionaries. IEEE T Signal Process 1993; 41: 3397–3415.

11.

Bai

. Image fusion via nonlocal sparse K-SVD dictionary learning. Appl Opt 2016; 55: 1814–1823.

12.

Tropp

Gilbert

AC.

Signal recovery from random measurements via orthogonal matching pursuit. IEEE T Inform Theory 2007; 53: 4655–4666.

13.

Huggins

Zucker

SW.

Greedy basis pursuit. IEEE T Signal Process 2007; 55: 3760–3772.

14.

Wang

Gan

Intelligent fault diagnosis of rolling element bearings using sparse wavelet energy based on overcomplete DWT and basis pursuit. J Intell Manuf. Epub ahead of print 17 February 2015. DOI: 10.1007/s10845-015-1056-2.

15.

Tibshirani

Regression shrinkage and selection via the lasso: a retrospective. J Roy Stat Soc B 2011; 73: 273–282.

16.

Gorodnitsky

Rao

BD.

Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm. IEEE T Signal Process 1997; 45: 600–616.

17.

Bruckstein

Donoho

Elad

From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 2009; 51: 34–81.

18.

Yang

Zhang

Feng

. Sparse representation based Fisher discrimination dictionary learning for image classification. Int J Comput Vision 2014; 109: 209–232.

19.

Simon

Friedman

Hastie

. A sparse-group lasso. J Comput Graph Stat 2013; 22: 231–245.

20.

Olshausen

. Highly overcomplete sparse coding (IS&T/SPIE electronic imaging). International Society for Optics and Photonics, 2013, p.86510S-S-9, http://redwood.berkeley.edu/bruno/papers/highly-overcomplete-SPIE.pdf

21.

Zhou

Chen

Dong

. Detection and diagnosis of bearing faults using shift-invariant dictionary learning and hidden Markov model. Mech Syst Signal Pr 2016; 72–73: 65–79.

22.

Liu

Huang

Adaptive feature extraction using sparse coding for machinery fault diagnosis. Mech Syst Signal Pr 2011; 25: 558–574.

23.

Zhang

Huang

TS.

Simultaneous discriminative projection and dictionary learning for sparse representation based classification. Pattern Recogn 2013; 46: 346–354.

24.

Gangeh

Ghodsi

Kamel

MS.

Kernelized supervised dictionary learning. IEEE T Signal Process 2013; 61: 4753–4767.

25.

Yang

Zhang

Feng

. Fisher discrimination dictionary learning for sparse representation. In: Proceedings of the 2011 IEEE international conference on computer vision (ICCV), Barcelona, 6–13 November 2011, pp.543–550. New York: IEEE.

26.

Mairal

Bach

Ponce

Sparse modeling for image and vision processing (arXiv preprint arXiv: 14113230), 2014, https://arxiv.org/abs/1411.3230

27.

Aharon

Elad

Bruckstein

. An algorithm for designing overcomplete dictionaries for sparse representation. IEEE T Signal Process 2006; 54: 4311–4322.

28.

Zhang

. Discriminative K-SVD for dictionary learning in face recognition. In: Proceedings of the 2010 IEEE conference on computer vision and pattern recognition (CVPR), San Francisco, CA, 13–18 June 2010, pp.2691–2698. New York: IEEE.

29.

Ding

Lin

HB.

Fault feature extraction of rolling element bearings using sparse representation. J Sound Vib 2016; 366: 514–527.

30.

Yuan

Lin

Model selection and estimation in regression with grouped variables. J Roy Stat Soc B 2006; 68: 49–67.

31.

Jacob

Obozinski

Vert

J-P

. Group lasso with overlap and graph lasso. In: Proceedings of the 26th annual international conference on machine learning, Montreal, QC, Canada, 14–18 June 2009, pp.433–440. New York: ACM.

32.

Liu

Moreau-Yosida regularization for grouped tree structure learning. Advances in Neural Information Processing Systems 2010; 23: 1459–1467.

33.

Volkanovski

Cepin

Mavko

Application of the fault tree analysis for assessment of power system reliability. Reliab Eng Syst Safe 2009; 94: 1116–1127.

34.

Lou

Loparo

KA.

Bearing fault diagnosis based on wavelet transform and fuzzy inference. Mech Syst Signal Pr 2004; 18: 1077–1095.

35.

Zhang

. Fault diagnosis of rotating machinery based on improved wavelet package transform and SVMs ensemble. Mech Syst Signal Pr 2007; 21: 688–705.

36.

Tao

. Wavelet leaders multifractal features based fault diagnosis of rotating mechanism. Mech Syst Signal Pr 2014; 43: 57–75.

37.

Wang

Kang

Jiang

. Classification of fault location and the degree of performance degradation of a rolling bearing based on an improved hyper-sphere-structured multi-class support vector machine. Mech Syst Signal Pr 2012; 29: 404–414.