Fault diagnosis method based on the multi-head attention focusing on data positional information

Abstract

In order to make full use of the absolute position information of fault signal, this paper designs a new multi-head attention (MHA) mechanism focusing on data positional information, proposes a novel MHA-based fault diagnosis method and extends it to the fault diagnosis scenario with missing information. Based on the absolute positional information and the trainable parameter matrix of the fault data, a novel attention weight matrix is generated, and the fault features are extracted by a fully connected network with the attention mechanism. By integrating the positional information into the weight matrix, the new MHA mechanism has the ability to extract more effective data features, compared with the traditional MHA method. Furthermore, the proposed method is also developed for the fault diagnosis scenarios with missing information. A special attention weight modified method is designed to reduce the impact of missing data on fault diagnosis results. In the experiment simulations, the data sampled from ZHS-2 multi-function motor flexible rotor test bed and the Tennessee- Eastman process data are utilized to test the performance of the algorithm. The results show that the proposed method can effectively extract fault features and reduce the impact of missing data.

Keywords

Absolute positional information fault diagnosis missing data multi-head attention mechanism

Introduction

In recent years, due to intelligent data acquisition systems and advanced computer control systems, the integration and automation of industrial production processes have been greatly improved. Meanwhile, highly interrelated subsystems and control units have increased the complexity of industrial processes. Most industrial processes have more urgent needs for reliability and safety. Once an accident occurs, it may cause serious property damage and personal injury. Therefore, the operation monitoring and fault diagnosis of modern industrial production systems are very necessary and important. As is well known, compared with model-based and knowledge-based fault diagnosis methods, data-driven methods do not need to establish accurate models or depend on expert systems. Because computer control systems collect a large amount of process data, data-driven fault diagnosis algorithms are paid increasingly attentions.¹

In recent decades, a variety of data-driven fault diagnosis methods have been extensively studied, including the K-nearest neighbor classifier,² support vector machine,³ fisher discriminant analysis,⁴ and random forest.⁵ Most of the methods mentioned above are applied to fault diagnosis scenarios with a small amount of data. However, in the scenarios with multiple working conditions and massive amounts of data, these methods will encounter the problems such as one-sided analysis results, poor accuracy or low efficiency. Last few years, deep learning is developing rapidly,^6,7 and presenting a breakthrough advantage over traditional fault diagnosis methods. Deep learning has replaced the feature extraction approach of traditional algorithms, and automatically mines the deep features of input data, which reduces the reliance on expert knowledge. Representative deep learning models include deep belief networks,⁸ sparse auto-encoders,⁹ convolutional neural networks (CNNs),¹⁰ and recurrent neural networks (RNNs).¹¹ At present, deep learning has been widely used in the field of fault diagnosis.^12–16

On the other hand, the attention mechanism method has been rapidly developed in the field of deep learning in recent years. The attention mechanism was first applied in the field of computer vision, where its function is to direct greater attention to the areas that need to be studied in image.¹⁷ Subsequently, the attention mechanism was widely applied in natural language processing tasks and considerably developed. Bahdanau et al.¹⁸ used the attention mechanism to connect an RNN-based encoder and decoder and applied an attention mechanism to machine translation tasks. Ashish Vaswani et al.¹⁹ proposed a fully connected architecture with multi-head attention. The attention mechanism succeeds in these tasks, because the pivotal features of data are emphasized, while the redundant features are weakened.^20,21 In this way, the feature extraction process of deep learning is expected to enhance the reliability and effectiveness of data-driven methods. The attention mechanism is an end-to-end recognition method. It eliminates feature extraction steps and has been employed in mechanical fault diagnosis. In the field of fault diagnosis, Huang et al.²² proposed a shallow multiscale convolutional neural network with attention to improve the accuracy of bearing fault diagnosis. Yang et al.²³ proposed a method using gated recurrent units with attention to improve the accuracy of bearing fault diagnosis. Canizo et al.²⁴ proposed a multi-head CNN–RNN, a supervised multi-time series anomaly detection method based on deep learning, which is used to deal with anomaly detection in multi-sensor systems. Yao et al.²⁵ proposed a deep convolutional neural network with attention that has good real-time and generalization performance for gear fault diagnosis. Currently, most attention mechanisms used in the fault domain are based on the improvement of CNNs and RNNs. A CNN is a parallel model, but its feature extraction ability is limited by the size of its convolution kernel. An RNN has a strong ability to extract long-range information, but it has to compute units one by one, which hinders the full exploitation of GPU parallelism.

It is worth noting that the multi-head attention mechanism has good performance in global data characteristic extraction and parallel computation. In this paper proposes a multi-head attention fault diagnosis method based on the data positional information. Different from the existing self-attention mechanism, which generates the attention weight matrix from the data values, The new multi-head attention fault diagnosis method utilizes the absolute positional information of data to generate the attention weight matrix. This method can extract the features of fault information more effectively, alleviate the low-rank bottleneck problem in multi-head attention and enhance the sensitivity of the multi-head attention model to the positional features of data. In addition, this paper also designs a special missing data weight modified method for the scenario with missing data, to reduce the influence of missing data on the fault diagnosis results.

The main contributions of this paper are summarized as follows. (1) In the multi-head attention mechanism focusing on data positional information designed in this paper, positional encoding is used to replace the input data to generate weight matrix. More positional information is integrated into the neural network. This extra information increases the sensitivity of the model to the data directionality. (2) The fault diagnosis methods are presented on the basis of the new multi-head attention mechanism, for the scenarios with and without missing data. For the fault diagnosis scenarios with missing data, by using the interpretability of attention weight matrix, a special attention weight modified method for missing data is designed to reduce the influence of missing data on fault diagnosis results.

Preliminaries

In this section, we will review the basic knowledge of multi-head attention fault diagnosis method.

Scale dot product attention

In fault diagnosis, the input data matrix is assumed to be $X_{0} \in R^{m \times D_{x}}$ , where m is the number of samples and $D_{x}$ is the number of sensors. In order to extract more data features, an empty dimension is extended to $X_{0}$ . It obtains $X_{0} \in R^{m \times D_{x} \times 1}$ . Then, $X_{0}$ is mapped to the new space by a fully connected layer with the number of neurons N, and $X \in R^{m \times D_{x} \times N}$ is obtained. Thirdly, X is projected into three different spaces by using the learnable parameter matrices $W_{q}$ , $W_{k}$ and $W_{v}$ :

Q, K, V = X W_{q}, X W_{k}, X W_{v},

(1)

where $W_{q}, W_{k}, W_{v} \in R^{N \times N}$ and $Q, K, V \in R^{m \times D_{x} \times N}$ . The scaled dot product attention can be written as

H = softmax (\frac{Q K^{T}}{\sqrt{D_{k}}}) V,

(2)

where $D_{k}$ is the value of the last dimension of matrix $K$ , and $\sqrt{D_{k}}$ plays a regulating role. H is the output of the scaled dot product attention.

Multi-head attention

To extract more interaction information, the multi-head attention mechanism was developed, which can be formulated as

MultiHead (H) = concat [H_{1}, \dots, H_{M}] W_{0},

(3)

The above formula indicates that multiple attention matrices are concatenated into one matrix and then multiplied by the parameter $W_{0}$ , which is the trainable parameter of the neural network and $M$ is the head number of multi-head attention. Multi-head attention uses several instances of scaled dot product attention to obtain richer data features. For a single sample, $Q, K, V$ are divided into M parts, that is, $q_{i, j}, k_{i, j}, v_{i, j}$ $\in R^{D_{x} \times (N / M)}$ (i, j = [1,2… M]), which results in an insufficient expression of the single scaled dot product attention. Generally, $Rank (q_{i, j} k_{i, j}^{T}) \leq min (Rank (q_{i, j}), Rank (k_{i, j}^{T}))$ and $N / M < D_{x}$ . It results in a low-rank bottleneck for modeling.²⁶ In the task of fault diagnosis, the attention can be designed more flexibly to alleviate the low-rank bottleneck of multi-head attention.

Positional encoding

In the multi-head attention network, the input data do not contain positional information. In other words, there is no difference between the input data from different positions in the multi-head attention model. Therefore, positional encoding is introduced to reflect the positional relationship between different positions of the input data. An existing positional encoding method is given as follows:

Define the positional encoding

PE = {\begin{matrix} \sin (pos / T^{l / N}), l = 2 r \\ \cos (pos / T^{l / N}), l = 2 r + 1 \end{matrix},

(4)

where $pos$ is in the range of $[0, D_{x})$ , r is in the range of $[0, N / / 2)$ , $″ / / ″$ represents the division of integers, and the result retains only integer digits. $D_{x}$ and N are defined in the same way as in Sub-section II-A, and T is a scaling factor, generally T = 10,000, which is used to adjust the position difference of adjacent elements. As T increases, the difference between the position codes of adjacent elements decreases. Although this positional encoding can reflect the absolute position of information, it is difficult for the neural network to distinguish the direction, from which the information comes, due to its lack of direction.²⁷

For a single sample $X_{p} \in R^{D_{x} \times N}$ ,

X_{p}^{*} = X_{p} + PE, p = 1, 2 . . ., m .

(5)

By introducing positional encoding into the input data, the neural network can extract data features efficiently.

Main work

The attention focusing on data positional information

In the multi-head attention neural network, the input data do not contain positional information. In other words, there is no difference between the input data from different positions in the multi-head attention model. Therefore, positional encoding is introduced to reflect the positional relationship between different positions of the input data. The positional encoding in equation (4) can reflect the absolute position of information. However, multi-head attention still has some shortcomings in extracting positional information. For instance, Bi-LSTM can discriminatively collect the information of a sample from its left and right sides. But it is not easy for the multi-head attention neural network to distinguish which side the data information comes from.²²

According to the analysis above, a positional encoding is introduced into the attention weight matrix to make it easier for the neural network to distinguish the directionality of information, in this sub-section.

It is defined as follows:

P E^{*} = {\begin{matrix} \sin (- pos / T^{l / D}), l = 2 r \\ \cos (- pos / T^{l / D}), l = 2 r + 1 \end{matrix},

(6)

Because $\cos (- x) = \cos (x)$ and $\sin (- x) = - \sin (x)$ , the following formula becomes obvious,

PE = {\begin{matrix} - P E^{*}, l = 2 r \\ P E^{*}, l = 2 r + 1 \end{matrix} .

(7)

For the input data to be processed by the neural network, the forward and backward positional encoding are the same for the cos terms but opposite for the sin terms. Therefore, the neural network will be more conducive to distinguish the different directions of the information by using equations (4) and (6) at the same time. The new positional encoding is integrated into the attention weight network in a certain form, which is helpful for the neural network to distinguish the directionality of data and improve the final classification accuracy.

Based on the above discussion, a new multi-head attention mechanisms is designed as shown in Figure 1.

Figure 1.

the attention Focusing on Data Positional Information.

In Figure 1, the attention output is expressed as:

h_{i, j} = softmax (weigh t_{i, j}) v_{i, j},

(8)

weigh t_{i, j} = \frac{q_{i, j} k_{i, j}^{T}}{\sqrt{D_{x}}},

(9)

where $weigh t_{i, j} \in R^{D_{x} \times D_{x}}$ . The elements in the matrix $weigh t_{i, j}$ reflect the relationships between the $D_{x}$ elements in a single sample.²⁸ Multi-head attention uses several instances of scaled dot product attention to obtain richer data features.

In order to improve the discrimination ability of neural network to data location information, V is calculated with X and K is generated by using the positional encoding. Further to alleviate the low-rank bottleneck in multi-head attention, a new attention mechanism is defined as follows:

Q, K = X W_{q}, P * W_{k},

(10)

PosWeight = \frac{QK}{\sqrt{D_{x}}},

(11)

where $P^{*} \in R^{B \times C \times D}$ consists of BPE^*, PE^* is defined by equation (6), and $W_{q}, W_{k} \in R^{N \times (M \cdot D_{x})}$ . The total number of parameters of q _{i, j} and k _{i, j} is 2(D_x× D_x), which exceeds the total number of parameters of posweight _{i, j}, thus alleviating the low-rank bottleneck of attention. The new positional encoding $P E^{*}$ is used to generate K . The integration of more positional information into the input data makes the neural network distinguishing the directionality of information more easily and improves the feature extraction ability of the multi-head attention deep network, especially for the scenario with part missing data in the input data.

Remark1: More generally, input data can be represented as a key-value pair, where k represents the position and v represents the value. Therefore, it is more reasonable to generating k with positional information. We have also tried to generate q with PE ^*, but the effect of this method is not good. If q and k are all generated by PE ^*, it will lead to insufficient training of neural network. The reason is that x will be continuously optimized according to the training of neural network, but the PE ^* is relatively fixed.

Attention weight modified method for missing data

As shown in Figure 2(a), $X_{0} \in R^{m \times D_{x}}$ is denoted as the historical measurement data. Each row of the matrix represents a sample and each column represents m measurements collected by the sensor. Then, $X_{0}$ is mapped into the new space by a fully connected layer with the number of neurons N, and $X_{2} \in R^{m \times D_{x} \times N}$ is obtained. Suppose m = 4, D_x = 3, N = 1, $V \in R^{D_{x} \times N}$ is the mapping of [x₃₁, x₃₂, x₃₃]. ATT is the mapping of V through the attention matrix Att-weight. In Figure 2(b), there are

h_{1}' = Att (h_{1}', v_{1}') \cdot v_{1}' + Att (h_{1}', v_{2}') \cdot v_{2}' + Att (h_{1}', v_{3}') \cdot v_{3}',

(12)

h_{2}' = Att (h_{2}', v_{1}') \cdot v_{1}' + Att (h_{2}', v_{2}') \cdot v_{2}' + Att (h_{2}', v_{3}') \cdot v_{3}',

(13)

h_{3}' = Att (h_{3}', v_{1}') \cdot v_{1}' + Att (h_{3}', v_{2}') \cdot v_{2}' + Att (h_{3}', v_{3}') \cdot v_{3}',

(14)

where Att( $h_{i}'$ , $v_{j}'$ ) is the $h_{i}'$ ’s attention to $v_{j}'$ , i, j = [1,2,3]. In other words, Att( $h_{i}'$ , $v_{j}'$ ) is the attention of x_3i to x_3j. The first row of Att-weight is the x₃₁’s attention to x₃₁, x₃₂, and x₃₃, the second row is the x₃₂’s attention to x₃₁, x₃₂, and x₃₃, and the third row is the x₃₃’s attention to x₃₁, x₃₂, and x₃₃. The important features of the input data are obtained by fusing the information and adjusting the weight through the learnable parameter matrix Att-weight.

Figure 2.

(a) The historical measurement data and (b) the calculation of attention.

Some signal may be missing in sample sequences, due to such sensor performance as sensor fault, inconsistent sensor measurement rate, etc. Compared with convolution neural network, multi-head attention neural network has better interpretability. For these unreliable or missing values, their corresponding weights in the attention weight matrix can be masked or modified, which is very useful for fault diagnosis.

Since the attention weight has the advantage of interpretability, we can further improve the accuracy of fault diagnosis under the scenarios with missing data, by modifying the weight value corresponding to missing data in the attention weight matrix. The position of the missing data is recorded by constructing a mark matrix $S' \in R^{m \times D_{x}}$ , where the elements corresponding to missing data are denoted as 1 and the ones corresponding to normal data are denoted as 0. Then, we fill the missing data in $X_{0}$ with the upper and lower average values. When the new weight matrix $ModifiedWeight$ is constructed, two empty dimensions are expanded from the mark matrix $S'$ to obtain $S ″ \in R^{m \times 1 \times 1 \times D_{x}}$ . The empty dimension of $S ″$ is constricted, according to the size of the $PosWeight$ matrix, to obtain $S \in R^{m \times M \times D_{x} \times D_{x}}$ . The following formula is used to modify the weight of the filled values in the weight matrix:

ModifiedWeight = PosWeight - B \cdot S

(15)

where $B \in R^{m \times M \times D_{x} \times D_{x}}$ is the trainable parameter matrix of the neural network, $PosWeight \in R^{m \times M \times D_{x} \times D_{x}}$ , and $B \cdot S$ is the inner product of $B$ and $S$ . Since the missing data position is 1 and the normal data position is 0, only the weight value related to the filled value position will be weakened, and the weight value of normal values will not be affected.

For a single sample:

h_{i, j}^{P E^{*}} = softmax (modifiedweigh t_{i, j}) v_{i, j},

(16)

where i = [1,2, …, m], j = [1,2, …, M], m is the number of samples, M is the number of heads of attention. $ModifiedWeight$ is made up of $m \times M$ $modifiedweigh t_{i, j}$ .

In Figure 2(a), Assume that $x_{32}$ is the missing data, filled in with the upper and lower average values. In Figure 3, since $x_{32}$ is not the projection of the real data, the weight positions related to $x_{32}$ are marked and set to 1 in s _i,j, and the weight positions of other normal data are set to 0. In this way, when the neural network modifies the attention weight matrix, it will only affect the weight related to the filled data and will not affect the weight of normal data. The complete algorithm is shown in the Table 1.

Figure 3.

Modified attention weight process.

Table 1.

Algorithm: modified-MHA.

Algorithm: modified-MHA
1: Input: X, P^*, S
2: Q = rearrange(XW_q, “ m D_x (M D_x) -> m M D_x D_x”)
3: K = rearrange(P^*W_k, “ m D_x (M D_x) -> m M D_x D_x”)
4: V = rearrange (XW_v, “ m D_xN -> m M D_x (N/M)”)
5: logits = einsum (Q, K, “m M D_x D_x, m M D_x D_x -> m M D_x D_x”) - B · S
6: attention = softmax(scale · logits, axis = −1)
7: H₀ = einsum(attention, V, “ m M D_x D_x, m M D_x (N/M) -> m M D_x (N/M)”)
8: Multihead(H) = rearrange(H₀, “ m M D_x (N/M) -> m D_x N ”)W₀
9: Output: Multihead(H)

Fault diagnosis model of multi-head attention based on positional information

In this section, the fault diagnosis method based on the above multi-head attention mechanism is proposed for the sample data sequence without missing data. The whole method is divided into two stages: offline modeling training and online diagnosis.

The process of offline model training:

(a) The historical measurement data $X_{0} \in R^{m \times D_{x}}$ are constituted by m independent samples obtained by various sensors. Each row of the matrix represents a sample and each column represents m measurements collected by a sensor. The input data $X_{0}$ should be standardized as follows.

\begin{matrix} X_{1 k} = (X 0 k - X_{0 k}_mean) / X 0 k_std \\ k = (1, 2, \dots n) . \end{matrix}

(17)

where, $X_{0 k}$ is the k-th column element of matrix $X_{0}$ , $X_{0 k}_mean$ is the mean value of the k-th column element of matrix $X_{0}$ , $X_{0 k}_std$ is the standard deviation of the k-th column element of matrix $X_{0}$ , and $X_{1 k}$ is the k-th column element of matrix $X_{1}$ . To extract more information, a new empty dimension is added to the last dimension of matrix $X_{1}$ , and $X_{1} \in R^{m \times D_{x} \times 1}$ is obtained.

(b) $X_{1}$ is mapped to $X_{2} \in R^{m \times D_{x} \times N}$ by a fully connected layer. By adding the positional encoding in equation (4) to $X_{2}$ , the new matrix $X \in R^{m \times D_{x} \times N}$ is obtained.

(c) The new data $X$ are sent to the multi-head attention layer, which is composed of num identical layers. Each layer includes two sublayers. The first sublayer is a multi-head attention mechanism layer, and the second sublayer is a fully connected feedforward neural network. We use a residual network²⁹ for each sublayer and then standardize the layer³⁰ to suppress network degradation and gradient diffusion. In the residual connection of the first sublayer, a trainable parameter $β$ is used in the residual network of sublayer 1 so that the residual structure is more flexible and the training is more efficient. The two sublayers are described below:

Sublayer 1: Using equation (16), $h_{i, j}^{P E^{*}} \in R^{D_{x} \times (N / M)}$ is obtained. Then, multiple results $h_{i, j}^{P E^{*}}$ are combined to obtain the multi-head attention:

Multihead (H) = concat [h_{1, 1}^{P E^{*}}, . . ., h_{i, j}^{P E^{*}}] W_{0},

(18)

The above formula describes multiple $h_{i, j}^{P E^{*}}$ are concatenated into a matrix and multiplied by $W_{0} \in R^{N \times N}$ , where $W_{0}$ is the trainable parameter in the neural network. M is the attention head number, i = [1,2, …, m], and $MultiHead (H) \in R^{m \times D_{x} \times N}$ . The output of sublayer 1 can be expressed as

X_{sublayer 1} = LayerNorm (MultiHead (H)) + X,

(19)

where $X_{sublayer 1}$ is the output of sublayer 1.

Sublayer 2: The second sublayer is a fully connected neural network. This sublayer includes two layers of a feedforward neural network. The number of neurons in these two layers is a superparameter. Here, we take the first layer with 2N neurons and the second layer with N neurons. The second layer uses the ReLU activation function.³¹ The sublayer residual connectors with the layer norm are expressed as

\begin{matrix} X_{sublayer 2} = LayerNorm (FFN (X_{sublayer 1})) \\ + X_{sublayer 1}, \end{matrix}

(20)

where $X_{sublayer 2}$ is the output of sublayer 2. The FFN is the two-layer feedforward neural network.

(d) The result of (c) is sent to a fully connected neural network, activated by the ReLU function, and then flattened. Finally, we utilize the softmax classifier to recognize fault conditions. The following cross-entropy loss function is used in this work,

Loss = \frac{1}{m} \sum_{k = 1}^{m} [y_{k} \ln t_{k} + (1 - y_{k}) \ln (1 - t_{k})],

(21)

where m indicates the sample size, y_k and t_k denote the actual tag value and the predicted value of the kth sample, respectively. Moreover, this work adopts the Adam optimizer³² to minimize the loss function and update the network parameters.

(e) Steps (b) to (d) are repeated until the accuracy of fault classification is met or the number of set iterations is reached. The complete algorithm flow chart is shown in Figure 4.

Figure 4.

Structure of the multi-head attention fault diagnosis model.

The online diagnosis process:

(f) Several samples to be diagnosed are collected from the actual production process, and the data are standardized and used as the input data.

(g) The data obtained by (f) are sent to the multi-head attention fault diagnosis model to obtain the fault classification results and complete the online fault diagnosis.

Fault diagnosis model of multi-head attention under the scenarios with partial missing data

In sample data sequences, some signal may be missing, due to such sensor performance as sensor fault, inconsistent sensor measurement rate, etc. It is common to fill the gaps with the upper and lower averages of the data column. However, these filled values are often inaccurate. Furthermore, the inaccurate filled values will deteriorate the performance of the multi-head attention fault diagnosis proposed in the previous section. How to reduce the impact of these inaccurate values on the overall data is a key issue of fault diagnosis under the scenarios with partial missing data.

In this section, a multi-head attention fault diagnosis approach is presented for the sample data sequence with missing data, by using the attention weight modified method. The whole method is divided into two stages: online diagnosis and offline model training.

(a) The historical measurement data $X_{0} \in R^{m \times D_{x}}$ constituted by m independent samples sampled by various sensors, where each row of the matrix represents a sample and each column represents m measurement data collected by a sensor. The missing data in $X_{0}$ is checked by constructing a mark matrix $S' \in R^{m \times D_{x}}$ , where the missing data position is denoted as 1 and the normal data position is denoted as 0. The missing data in $X_{0}$ are filled with the adjacent averages from the same column to obtain $X_{0}'$ . The input data $X_{0}'$ should be standardized before fault diagnosis:

\begin{matrix} X_{1 k} = (X_{0 k}' - X_{0 k}'_mean) / X_{0 k}'_std \\ k = (1, 2, \dots n) \end{matrix}

(22)

The formula indicates that each column in $X_{0}'$ is subtracted from its mean and divided by its standard deviation, and $X_{1}$ is obtained.

(b) The matrices $X_{1}$ and the mark matrix $S'$ are concatenated to obtain

X_{1}' = [X_{1}, S'],

(23)

where $X_{1}' \in R^{m \times 2 D_{x}}$ is sent to the neural network as the input data and changed into a tensor. To extract more information, a new empty dimension is added to the last dimension of matrix $X_{1}$ , and $X_{1} \in R^{m \times D_{x} \times 1}$ is obtained.

(c) $X_{1}$ is mapped to $X_{2} \in R^{m \times D_{x} \times N}$ by a fully connected layer. By adding the positional encoding in equation (4) to $X_{2}$ , the new matrix $X \in R^{m \times D_{x} \times N}$ is obtained.

(d) The new data $X$ are sent to the multi-head attention layer. When the new weight matrix $ModifiedWeight$ is constructed, two empty dimensions are expanded from the mark matrix $S'$ to obtain $S ″ \in R^{m \times 1 \times 1 \times D_{x}}$ , and then the empty dimension of $S ″$ is completed according to the size of the $PosWeight$ matrix to obtain $S \in R^{m \times M \times D_{x} \times D_{x}}$ . $ModifiedWeight$ is obtained by using equation (15). In this way, when the neural network modifies the attention weight matrix, it will only affect the weight related to the filled data and will not affect the weight of normal data.

(e) For the output $h_{i, j}^{P E^{*}}$ , $Multihead (H)$ are obtained by using equation (18). Then, the fault diagnosis is completed according to steps (d) - (g) in Chapter IV.

Experimental simulation

Experiment with ZHS-2 multi-function motor flexible rotor test bed

The experimental data are collected from ZHS-2 multi-functional motor flexible rotor test-bed. Eight vibration acceleration sensors are installed in the horizontal direction of the rotor support. The sensors are adsorbed on the base of the test bed and arranged evenly. The collected signal is the vibration signal of the rotating machine rotor, which is transmitted to the upper computer through HG8902 data collection box. The rotor speed of the motor is 1500 r/m. The test-bed can simulate a variety of operation modes of rotating machinery, including rotor unbalance fault mode, fan blade broken fault mode, pedestal looseness fault mode, etc. The structure of the test bench is shown in the following Figure 5.

Figure 5.

Test bench structure.

Seven operation modes are used to test the performance of the proposed fault diagnosis in this experimental simulation: rotor imbalance with one screw (bph1), rotor imbalance with three screws (bph3), rotor imbalance with five screws (bph5), rotor imbalance with seven screw(bph7), pedestal looseness (jzsd), fan blade breakage (fjdy), and normal mode (zc). In each mode, each sensor continuously collects 3,072,000 data in 240 s. As can be seen in Figure 6, the signal waveforms of various fault states are very similar, so it is difficult to accurately diagnose faults. In order to improve the accuracy of fault diagnosis and training efficiency, we need to expand a single sample. We take 8 times of the original sample for each line of data, that is D_x = 64, N = 64, M = 8, num = 2. The training set size is $R^{179200 \times 64}$ , the test set size is $R^{88600 \times 64}$ , and the verification set size is $R^{44800 \times 64}$ . The vibration signals shown in Figure 6. are collected by an acceleration sensor in 10,000 ms under seven operation modes.

Remark 2: Because the motor of this test-bed rotates one cycle, the eight sensors will receive 4096 signals. Theoretically, it is better to use the sensor signal with a complete rotation cycle, but this will cause a greater burden on the GPU. Due to the large amount of data, in this simulation, 64 signals constitute an expanded sample.

Figure 6.

Vibration signals of 7 modes.

The F₁ score is used as the comprehensive evaluation index:

F_{1} = 2 \frac{Precision \cdot Recall}{Precision + Recall} .

(24)

Several deep learning models are selected as the control group, which are the residual structure of the one-dimensional convolutional neural network (1D-CNN), the normal multi-head attention (MHA) neural network, and the Pos-MHA proposed in this paper. The structure and parameters of these comparison models are introduced as follows.

One-dimensional convolutional neural network with residual structure (1D-CNN). The network structure is shown in Table 2. The structures of block1 and block2 are shown in Figure 7.

The normal multi-head attention (MHA) neural network uses the attention mechanism given in Vaswani et al.¹⁹

Pos-MHA, its attention mechanism is shown in equation (12).

Table 2.

1D-CNN structure.

Layer name	Kernel size/stride	Number of kernels	Padding	Activation
Conv1d	16 × 1/2	16	True	ReLU
Pooling	2 × 1/2	-	False	-
Block1	-	16, 16, 64	True	-
Block2	-	16, 16, 64	True	-
Flatten	-	-	-	-
Dense	-	7	-	Softmax

Figure 7.

Residual structure Block1 and Block2.

Figure 8 shows the learning curves of the three depth models in the test set without missing fault values. Because the correlation information extracted by 1D-CNN is limited by the size of the convolutional kernel, the classification accuracy fluctuates greatly. Pos-MHA integrates more positional information into the weight matrix, uses equations (8) and (9) to alleviate the MHA low-rank bottleneck problem, and integrates more positional information to achieve higher classification accuracy than MHA.

Figure 8.

Learning curves of the three models in the test set with normal data.

We also compare Pos-MHA with several traditional data-driven fault diagnosis methods on the verification set in Table 3. Among them, RF is a random forest with 50 decision trees classifier, GNB is a Gauss Naive Bayes classifier. It can be seen from Table 3 that the Pos-MHA network proposed in this paper alleviates the low-rank bottleneck of multi-head attention, integrates more location information, and improves the classification accuracy significantly. The average F1 score reaches 98.9%, which is nearly 10% higher than that of other data-driven fault diagnosis algorithms. Especially in the fault diagnosis of bph1, bph5 and jzsd, compared with other algorithms, the F1 score of Pos-MHA is improved by more than 10%−200%. Pos-MHA has the highest F1 score for all 7 types of fault diagnosis and has obvious advantages compared with the other fault diagnosis methods.

Table 3.

F1 scores of the 5 fault diagnosis methods on the validation set when there are no missing data.

Fault Type	Pos-MHA	MHA	1D-CNN	RF	GNB
bph1	0.978	0.885	0.834	0.570	0.4033
bph3	0.998	0.990	0.962	0.512	0.2259
bph5	0.976	0.883	0.927	0.474	0.2418
bph7	0.999	0.997	0.968	0.766	0.5741
jzsd	0.994	0.958	0.968	0.918	0.6523
fjdy	0.983	0.919	0.940	0.814	0.3560
zc	0.991	0.944	0.913	0.669	0.3983
AVG	0.989	0.939	0.930	0.675	0.4074

In order to further explore the performance of the data-driven fault diagnosis algorithm under complex conditions, we also compare the performance of the Modified-Pos-MHA network proposed in this paper with other algorithms in the case with missing data. Modified-Pos-MHA is a kind of fault diagnosis algorithm for missing data. It is different from Pos-MHA that it uses equation (15) to modify the attention weight matrix of the missing data (Figures 9 and 10).

Figure 9.

Learning curves of the three models of the test set with missing data.

Figure 10.

The confusion matrix of Pos-MHA of the validation set without missing data.

The figure above shows the learning curves of the four depth models with 1/8 of the whole sample is randomly set as missing data and filled with the average of adjacent values in the test set. Among them, the Modified-Pos-MHA neural network uses equation (15) to modify the attention weight matrix. MHA is a common multi-head attention fault diagnosis method, and 1D-CNN is a one-dimensional convolutional neural network with the residual structure described in Table 2. It can be seen that the Modified-Pos-MHA modifies the weight of the filled values and achieves higher accuracy.

In Table 4, it can be seen from the above table that when the average value of adjacent values is used to fill in the null values, the fault classification accuracy of each model decreases to a certain extent. Modified-Pos-MHA modifies the attention weight corresponding to the filled values and further improves the accuracy of Pos-MHA. The average F1 score of Modified-Pos-MHA reaches 98.4%, which is the highest among all methods. Among the seven kinds of fault diagnosis, Modified-Pos-MHA has the highest F1 score in all kinds of fault diagnosis, which is better than that of the other algorithms.

Table 4.

The F1 scores of the 6 fault diagnosis methods on the validation set when the fault data includes missing data.

Fault type	Modified-Pos-MHA	Pos-MHA	MHA	1D-CNN	RF	GNB
bph1	0.971	0.962	0.877	0.801	0.543	0.348
bph3	0.999	0.988	0.977	0.941	0.470	0.206
bph5	0.966	0.960	0.869	0.883	0.429	0.228
bph7	1.000	0.991	0.988	0.953	0.741	0.569
jzsd	0.995	0.992	0.935	0.982	0.899	0.601
fjdy	0.975	0.962	0.879	0.770	0.790	0.348
zc	0.983	0.968	0.929	0.812	0.648	0.381
AVG	0.984	0.975	0.922	0.877	0.646	0.383

The above figure shows the confusion matrix of Pos-MHA for the validation set when using normal data. There are 7 fault states in the validation set and 6400 samples for each fault type. When there are no missing data, Pos-MHA can predict faults very well, with an average F1 score of 99.8% and 99.9% in the classification tasks of bph3 and bph7.

Figures 11 and 12 are the confusion matrices of Pos-MHA and Modified-Pos-MHA. When the fault data have missing data and the adjacent mean values of the missing data are used to fill in. The results show that the performance of Pos-MHA decreases when the data have missing data. Compared with Pos-MHA, Modified-Pos-MHA with the modified attention weight achieves a higher average fault diagnosis accuracy.

Figure 11.

The confusion matrix of Pos-MHA of the validation set with missing data.

Figure 12.

The confusion matrix of Modified-Pos-MHA of the validation set with missing data.

Simulation with Tennessee-Eastman process data

In order to verify the effectiveness of the proposed method in actual industrial production, this method is used in Tennessee-Eastman Process (TEP) data sets for experiments.³³ The TE process is widely used to test the monitoring performance of a process monitoring scheme. The process consists of five typical units: reactor, condenser, tripper, separator, and compressor. In this study, the benchmark data available at http://web.mit.edu/braatzgroup/links.html are employed. The TEP dataset contains 22 process variables, 19 component variables, and 12 operating variables. In TEP, 21 types of faults are presented. In this experiment, we use all 21 kinds of fault states for the experiment. In order to verify the performance of the model, 800 samples are collected for each fault, so a total of 16,800 samples are collected for model training and method testing. First, we divide 10% of the data into validation sets, which do not participate in model training, and we only use the trained model for prediction. For the remaining data, the whole data set is randomly scrambled and then divided into a training set and a test set according to the ratio of 4:1.

SVC is a support vector machine classifier using the RBF kernel function, and LR is logistic regression. As shown in Table 5, the Pos-MHA network proposed in this paper integrates more location information, and improves the classification accuracy significantly. The average F1 score reaches 88%, which is more than 10% higher than that of other data-driven fault diagnosis algorithms. Especially in the fault diagnosis of fault 10, fault 16 and fault 19, compared with other algorithms, the F1 score of Pos-MHA is improved by more than 40%−200%. Pos-MHA has the highest F1 score for all 21 types of fault diagnosis and has obvious advantages over other fault diagnosis methods.

Table 5.

The F1 scores of the 5 fault diagnosis methods on TEP validation set when there are no missing data.

Fault type	Pos-MHA	MHA	1D-CNN	SVC	LR
Fault 1	1.00	0.99	0.99	1.00	0.98
Fault 2	0.99	0.99	0.97	0.99	0.95
Fault 3	0.51	0.29	0.19	0.24	0.17
Fault 4	0.99	0.97	0.69	0.8	0.86
Fault 5	0.99	0.97	0.24	0.51	0.96
Fault 6	1.00	0.99	0.99	0.99	0.99
Fault 7	1.00	1.00	0.97	1.00	1.00
Fault 8	0.98	0.94	0.82	0.93	0.21
Fault 9	0.54	0.29	0.16	0.26	0.19
Fault 10	0.86	0.58	0.29	0.23	0.22
Fault 11	0.86	0.8	0.44	0.31	0.13
Fault 12	0.96	0.89	0.74	0.84	0.28
Fault 13	0.95	0.88	0.84	0.94	0.52
Fault 14	0.97	0.97	0.9	0.77	0.1
Fault 15	0.49	0.38	0.15	0.21	0.09
Fault 16	0.89	0.53	0.21	0.36	0.35
Fault 17	0.96	0.94	0.75	0.86	0.72
Fault 18	0.93	0.9	0.86	0.91	0.85
Fault 19	0.82	0.57	0.46	0.52	0.16
Fault 20	0.78	0.74	0.36	0.58	0.65
Fault 21	0.91	0.87	0.55	0.68	0.51
AVG	0.88	0.79	0.6	0.66	0.52

Figure 13 shows the learning curves of the three depth models in the test set without missing fault values.

Figure 13.

Learning curves of the three models in TEP test set when using normal data.

In Table 6, Randomly change 1/8 of the original data to a null value and fill it with the upper and lower average values. It can be seen from the above table that when the average value of adjacent values is used to fill in the null values, the fault classification accuracy of each model decreases to a certain extent. Modified-Pos-MHA further improves the accuracy of Pos-MHA. The average F1 score reaches 74%, which is the highest among all methods. Among the 21 kinds of fault diagnosis, Modified-Pos-MHA has the highest F1 score in 13 kinds of fault diagnosis, which is better than that of the other algorithms.

Table 6.

The F1 score of the 6 fault diagnosis methods on the validation set when the fault data have missing data.

Fault type	Modified-Pos-MHA	Pos-MHA	MHA	1D-CNN	SVC	LR
Fault 1	0.91	0.92	0.88	0.91	0.92	0.85
Fault 2	0.88	0.87	0.90	0.90	0.90	0.87
Fault 3	0.33	0.25	0.23	0.15	0.22	0.12
Fault 4	0.88	0.90	0.87	0.25	0.62	0.72
Fault 5	0.89	0.86	0.83	0.05	0.42	0.74
Fault 6	0.89	0.90	0.9	0.86	0.91	0.85
Fault 7	0.91	0.89	0.89	0.75	0.87	0.85
Fault 8	0.77	0.74	0.78	0.47	0.77	0.15
Fault 9	0.42	0.33	0.26	0.05	0.16	0.11
Fault 10	0.71	0.69	0.46	0.19	0.18	0.05
Fault 11	0.70	0.67	0.61	0.32	0.26	0.1
Fault 12	0.79	0.78	0.73	0.51	0.73	0.18
Fault 13	0.83	0.77	0.73	0.68	0.79	0.36
Fault 14	0.86	0.87	0.89	0.81	0.75	0.15
Fault 15	0.36	0.39	0.35	0.14	0.17	0.04
Fault 16	0.70	0.66	0.49	0.23	0.23	0.32
Fault 17	0.80	0.79	0.78	0.60	0.74	0.62
Fault 18	0.72	0.73	0.73	0.66	0.72	0.69
Fault 19	0.70	0.67	0.45	0.41	0.46	0.11
Fault 20	0.69	0.67	0.62	0.22	0.56	0.62
Fault 21	0.77	0.77	0.72	0.46	0.59	0.48
AVG	0.74	0.72	0.67	0.46	0.57	0.43

Figure 14 shows the learning curves of the four depth models with 1/8 of the whole sample is randomly set as missing values and filled with the average of adjacent values in the test set.

Figure 14.

Learning curves of the three models in TEP test set when data are missing.

In TEP experiments, the effectiveness of the proposed method in real industrial data is also verified.

Conclusion

This paper studies a kind of multi-head attention fault diagnosis method, in which, the weight matrix is generated by using positional encoding instead of its own data. The new method alleviates the low-rank bottleneck of multi-head attention, and increases the sensitivity of the model to positional data. This paper also designs a special attention method for missing data. Using the interpretability of attention weight matrix, additional learnable parameters are given to the generated data, so that they have different markers from the real data in the neural network. This method can effectively reduce the influence of missing data on the fault diagnosis results. The experimental results show that the proposed Pos-MHA method improves the classification accuracy, compared with the traditional MHA-based and other data-driven fault diagnosis methods. For the sample data with missing data, the Modified-Pos-MHA network proposed in this paper reduces the influence of the missing data on the neural network and further improves the fault classification accuracy, compared with Pos-MHA.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China under Grant U1804163, Grant 620732136 and Grant 61973209, by Hubei Superior and Distinctive Discipline Group of “New Energy Vehicle and Smart Transportation”, and by the central government guides local science and technology development projects of HuBei province (2018ZYYD049).

ORCID iDs

Xiaoliang Feng

Guang Zhao

References

Lei

Yang

Jiang

, et al. Applications of machine learning to machine fault diagnosis: a review and roadmap. Mech Syst Signal Process 2020; 138: 106587.

Alos

Dahrouj

. Detecting contextual faults in unmanned aerial vehicles using dynamic linear regression and K-nearest neighbour classifier. Gyroscopy Navig 2020; 11(1): 94–104.

Huang

Yang

. Support vector machine with genetic algorithm for machinery fault diagnosis of high voltage circuit breaker. Measurement 2011; 44(6): 1018–1027.

Shi

Liu

, et al. Fault diagnosis of nonlinear and large-scale processes using novel modified kernel Fisher discriminant analysis approach. Int J Syst Sci 2016; 47(5): 1095–1109.

Lee

Park

Jung

. Fault detection of aircraft system with random forest algorithm and similarity measure. Sci World J 2014; 2014: 727359.

Hinton

Salakhutdinov

. Reducing the dimensionality of data with neural networks. Science 2006; 313(5786): 504–507.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521(7553): 436–444.

Mohamed

Dahl

Hinton

. Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 2012; 20(1): 14–22.

Zeng

Zhang

Song

, et al. Facial expression recognition via learning deep sparse autoencoders. Neurocomputing 2018; 273: 643–649.

10.

Lecun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86(11): 2278–2324.

11.

Elman

. Finding structure in time. Cogn Sci 1990; 14(2): 179–211.

12.

Wen

. Review on deep learning based fault diagnosis. J Electron Inform Technol 2020; 42(1): 234–248.

13.

Chen

. Multisensor feature fusion for bearing fault diagnosis using sparse autoencoder and deep belief network. IEEE Trans Instrum Meas 2017; 66(7): 1693–1702.

14.

Zhou

Yang

, et al. A multimodal feature fusion-based deep learning method for online fault diagnosis of rotating machinery. Sensors 2018; 18(10): E3521.

15.

Zhang

Zhou

Chen

. Application of improved parallel LSTM in bearing fault diagnosis. In: The IEEE 2019 Chinese automation congress (CAC 2019), Hangzhou, China, 2019, pp. 5755–5760. New York, NY: IEEE.

16.

Sun

Shao

Zhao

, et al. A sparse auto-encoder-based deep neural network approach for induction motor faults classification. Measurement 2016; 89: 171–178.

17.

Mnih

Heess

Graves

, et al. Recurrent models of visual attention. In: The 28th neural information processing system (NIPS 2014), Montreal, Canada, 2014, pp.2204–2212. New York, NY: IEEE.

18.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate. In: International conference on learning representations (ICLR), Banff, Canada, 2014.

19.

Vaswani

Shazeer

Parma

, et al. Attention is all you need. In: The 31st Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017, pp.5998–6008. Red Hook, NY: Curran Associates Inc.

20.

Wang

, et al. On the diversity of multi-head attention. Neurocomputing 2021; 454: 14–24.

21.

Zhou

Liu

. Filter gate network based on multi-head attention for aspect-level sentiment classification. Neurocomputing 2021; 441: 214–225.

22.

Huang

Feng

, et al. Bearing Fault diagnosis based on shallow multi-scale convolutional neural network with attention. Energies 2019; 12(20): 3937.

23.

Yang

Zhang

Zhao

, et al. Interpreting network knowledge with attention mechanism for bearing fault diagnosis. Appl Soft Comput 2020; 97: 106829.

24.

Canizo

Triguero

Conde

, et al. Multi-head CNN–RNN for multi-time series anomaly detection: an industrial case study. Neurocomputing 2019; 363: 246–260.

25.

Yao

Zhang

Yang

, et al. Learning attention representation with a multi-scale CNN for gear fault diagnosis under different working conditions. Sensors 2020; 20: 4.

26.

Bhojanapalli

Yun

Rawat

, et al. Low-rank bottleneck in multi-head attention models. In: The 37th international conference on machine learning (ICML2020), 2020, pp.864–873 JMLR.org.

27.

Yan

Deng

, et al. TENER: adapting transformer encoder for named entity recognition. arxiv.org/abs/1911.04474, 2019.

28.

Qiu

. Neural networks and deep learning. Beijing: China Machine Press, 2020.

29.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: The 29th IEEE conference on computer vision and pattern recognition (CVPR 2016), Las Vegas, NV, 2016, pp.770–778. New York, NY: IEEE.

30.

Kiros

Hinton

. Layer normalization. arxiv.org/abs/1607.06450.

31.

Glorot

Bordes

Bengio

. Deep sparse rectifier neural networks. In: The 14th International conference on artificial intelligence and statistics (AISTATS 2011), Fort Lauderdale, FL, 2011, pp.315–323.

32.

Kingma

. Adam: a method for stochastic optimization. In: The 3rd international conference on learning representations (ICLR2015), San Diego, CA, 2015.

33.

Luo

, et al. Fault diagnosis of TE process based on incremental learning. In: The 39th Chinese control conference (CCC 2020), Shenyang, China, 2020, pp.42274232. New York, NY: IEEE.