An Anomaly Detection Based on Data Fusion Algorithm in Wireless Sensor Networks

Abstract

In recent years, with the development of wireless sensor networks (WSN), it has been applied in more and more areas. However, energy consumption and outlier detection have been always the hot topics in WSN. In order to solve the above problems, this paper proposes a timely anomaly detection algorithm which is based on the data fusion algorithm. This algorithm firstly employs the piecewise aggregate approximation (PAA) to compress the original data so that the energy consumption can be reduced. It then combines an improved unsupervised detection algorithm of K-Means and artificial immune system (AIS) to classify the compressed data to normal and abnormal data. Finally, relevant experiments on virtual and actual sensor databases show that our algorithm can achieve a high outlier detection rate while the false alarm rate is low. In addition, our detection algorithm can effectively prolong the life because it is based on data fusion algorithm.

1. Introduction

With the development of digital electronics and wireless communications, a new breed of tiny embedded systems known as wireless sensor nodes has emerged in the past decade. A wireless sensor network (WSN) which consists of a large number of cheap, tiny, battery powered sensor nodes equipped with limited on-board processing, storage and radio capabilities has provided a variety of applications, such as health monitoring, scientific data collection, environmental monitoring, and military operations [1].

Event-driven WSN applications [2] require timely data analysis and assessment in order to facilitate (near) real-time, efficient, and accurate critical decision making and situation awareness. However, the stringent resource constraints such as battery life, computational capacity, and communication overload may lead to unreliable and inaccurate sensor data, especially when battery power is exhausted. What is worse, these sensor nodes are usually deployed in harsh and unattended environments where the data may be unreliable because of noise, missing, redundant data, and so on. Therefore, how to ensure the accuracy of sensor data and find these unreliable data timely has become an urgent task in the applications of WSN [3].

To identify the unreliable data or malicious invasion, outlier detection is used in WSN, and some algorithms of outlier detection have been proposed [4–8], but almost none of them consider the problem of energy consumption during the process of detection. In this paper, we propose an anomaly detection algorithm which is based on the data fusion algorithm. To reduce the communication overload and prolong the life of battery, we firstly use the algorithm of PAA which is lightweight to compress the time series data of each node. Then based on the result of PAA, we combine an improved unsupervised detection algorithm of K-Means and AIS to effectively classify the normal and abnormal data. The improved K-Means algorithm is used to complete the classification of compressed data, and the AIS offsets the drawbacks of K-Means so that the detection result is global optimal. Simulations with two synthetic datasets and several real environmental datasets adopted in the paper [9] show that our adaptive outlier detection technique achieves high detection accuracy and low false alarm, and it saves the energy of network through data compression at the same time.

The rest of the paper is organized as follows. In Section 2, we review the characteristics of outlier detection and some existing outlier detection algorithms in WSN. In Section 3, we illustrate our improved K-Means and AIS algorithm which is based on compressed data in detail. In Section 4, we present our experimental results to validate the capabilities of our algorithm with a synthetic datasets and several real environmental datasets. Finally, we conclude the paper and discuss directions for future work in Section 5.

2. Related Works

In this section, we firstly analyze the challenges of outlier detection in WSN when compared with the traditional outlier detection techniques. In Section 2.2, we analyze the advantages and disadvantages of some recent outlier detection algorithms in WSN.

2.1. Challenges of Outlier Detection in WSN

Outlier in WSN, also known as anomaly, can be defined as, “those measurements that significantly deviate from the normal pattern of sensed data [13].” This definition indicates that an effective way for outlier detection in WSN is to define a normal behavior of sensor data and consider those sensor observations that deviate from the defined normal behavior of sensor data as outliers. However, conventional outlier detection techniques might not be suitable for sensor data in WSNs where there are more challenges for outlier detection [3]. (i)

Resource constraints: as said before, WSN has a stringent resource constraints such as battery life, computational capacity, and communication overload. Thus, compared to the most of traditional techniques which are computationally expensive and require much memory for data analysis and storage, how to minimize the communication load of WSN needs more attention.

(ii)

Distributed streaming data: during the lifetime of data collection, the underlying phenomenon which is being measured may alter, so the sensor data will be a nonstationary streaming distribution. It suggests that most of traditional outlier techniques with analyzing stationary offline data are not suitable for WSN and these algorithms which need a priori knowledge of the data distribution are also infeasible. Thus, a key of outlier detection in WSN is how to detect the distributed streaming data online.

(iii)

Large-scale deployment: massive sensor nodes may be deployed in the WSN. This requires the construction of an accurate normal profile that represents the normal behavior of sensor data so that the outlier detection techniques can still maintain a high detection rate while keeping the false alarm rate low.

According to the above analysis, an effective and efficient outlier detection technique for WSN should be able to identify outliers in a distributed and online manner with high detection accuracy and low false alarm, at the same time, satisfying WSN constraints in terms of communication, computational, and memory complexity [4].

2.2. Analysis of Several Previous Algorithm

Recently, many detection techniques have been proposed. Sun et al. [6] proposed a technique based on extended Kalman filter based mechanism to detect false injected data. The paper further applies an algorithm of combining cumulative summation and generalized likelihood ratio to increase detection sensitivity. This algorithm is practical on resource stringent hardware, but too much calculation fails to ensure the real-time detection. Salem et al. [5] presented an algorithm which combined the Mahalanobis distance (MD) and the kernel density estimator. The former is used for spatial analysis and the latter completes the identification of abnormal patterns. This technique does not require a priori known data distribution and can achieve good detection accuracy with low false alarm rate. A remaining problem in this technique is its high dependency on the predefined threshold of MD, while an appropriate threshold is quite difficult to figure out and a single threshold may also not be suitable for outlier detection in multidimensional data. Zhang et al. [4] took into account the correlation among sensor data attributes and proposed two distributed and online outlier detection techniques based on a hyperellipsoidal one-class support vector machine (SVM). The algorithm takes advantage of the theory of spatiotemporal correlation to identify outliers and updates the ellipsoidal SVM-based model representing the changed normal behavior of sensor data for further outlier identification. However, this technique ignores the communication capability of each node, where the spatial correlation is calculated by exchanging some parameters between neighbors. Bhargava and Raghuvanshi [7] proposed an anomaly detection algorithm based on S-transform and one-SVM (support vector machine). To reduce the data size, S-transform is applied to extract the significant components of the time series data, and then one-SVM is used to classify data into original and anomalous. This algorithm adopts compression transformation before outlier detection, but it needs to classify every extract components for final decision which generates heavy calculation. Some other outlier detection techniques in WSN also have been proposed [14–16]. Above outlier detection algorithms can detect outliers effectively, but they have not fully considered the characteristics of outlier detection in WSN, such as the constrained resource and the dynamic data flow. To solve these problems while also ensuring the detection accuracy high, we propose the outlier detection algorithm based on the data fusion. Firstly, we consider using improved PAA to save sensor energy and reduce the calculated amount in the following process of outlier detection. Then use an improved K-Means algorithm which does not need a priori knowledge of the data distribution to distinguish normal and abnormal data. Finally, the AIS algorithm is used to make the detection result by K-Means global optimal and the detection accuracy high. The advantages and reasons of selecting PAA, K-Means, and AIS for our outlier algorithm are shown in next chapter in detail.

3. Our Proposed Algorithm

In this section, we will present our outlier detection approach in detail. Figure 1 shows the diagram of our proposed algorithm. Firstly, we utilize the PAA to compress the original sensor data. Then the improved K-Means algorithm is used to complete the classification of compressed data, and the AIS offsets drawbacks of K-Means so that the detection result is global optimal. The two phases of compression and outlier detection will be discussed below in detail.

Figure 1

Flow diagram of our proposed algorithm.

3.1. Data Compression with PAA

To save energy of sensors, techniques of data compression have been used in WSN because most energy is consumed in data transmission [17–19]. However, due to resource constraints, a suitable compression algorithm for WSN should be efficient and simple. As a dimensionality reduction algorithm, PAA is more intuitive and simple when compared with these techniques such as Fourier transforms, wavelets, and so on [20]. What is more, the degree of compression can be changed by adjusting the compression ratio parameter. Thus, the PAA technique is adopted to compress the data in this paper.

A time series C of length n can be represented in a w-dimensional space by vector:

\begin{matrix} \bar{C} = \bar{c_{1}}, \dots, \bar{c_{w}} . \end{matrix}

(1)

And the ith element of $\bar{C}$ is calculated as follows:

\begin{matrix} \bar{c_{i}} = \frac{1}{k} \times \sum_{j = k (i - 1) + 1}^{k \cdot i} c_{j}, \\ k = \frac{n}{w}, \end{matrix}

(2)

where the K is defined as compression ratio and

C_{j}

is the jth element in the original data. More information about PAA is present in this paper [20].

Obviously, the compression process of PAA is simple and fast, but this way of compression also introduces a problem that this technique compresses the data directly without considering the characteristics of data and the correlation between data, which may lead to some mistakes in the subsequent handling of compressed data. This problem also has an influence on the outlier detection behind. For example, the situation in which the value of data is greater than 10 is considered abnormal, and $C_{1} = {2, 4, 9}$ and $C_{2} = {1, 1, 13}$ are two original data series. After PAA with $k = 3$ , $\bar{c_{1}} = \bar{c_{2}} = 3$ , the difference between two series is hidden and the outlier data cannot be detected. In order to further present the differences between sequences, the variance of sequence is added as the second output of PAA. The variance associated with the ith element of $\bar{C}$ is defined as

\begin{matrix} {Var}_{i} = \frac{1}{k} \times \sum_{j = k (i - 1) + 1}^{k \cdot i} {(c_{j} - \bar{c_{i}})}^{2} . \end{matrix}

(3)

Due to the energy consumption on acquiring the neighbor data, it is not encouraged to compress the data with PAA on the space. On the contrary, every sensor has a cache to storage previous data, so it is convenient to compress time series data with PAA in each node. A sliding window is used to implement the selection of data with k length, where k represents the compression ratio of PAA. According to (3) and (4), a binary array $X_{j} = (\bar{c_{i j}}, {Var}_{i j})$ is defined to represent the ith compression data of the jth sensor node.

3.2. Outlier Detection on Compressed Data

Based on compression result of PAA in the preceding subsection, we combine an improved unsupervised detection algorithm of K-Means and AIS to effectively classify the normal and abnormal compressed sensor data. Because reconstruction of the initial data is not needed, it will have good real-time performance. After all sensors sent their data to the Sink node, the improved K-Means algorithm which needs no prior knowledge of data distribution is used to complete the classification of compressed data, and the AIS offsets the drawbacks of K-Means so that the detection result is global optimal. We will introduce the process of outlier detection in detail in the following.

3.2.1. Classify by the Algorithm of K-Means

According to the definition of outlier in WSN, the key of the outlier detection is how to effectively separate the outliers from normal data. Classification is a process to classify sampled data into different classes or clusters. Some classification techniques, such as SVM and artificial neural network (ANN), have been utilized in the outlier detection in WSN [21–23]. As one of the simplest classification algorithm, the K-Means technique also has been proved to be effective in the outlier detection in other areas [24–26], which groups data with length N points into $k^{'}$ clusters, where $k^{'}$ represents the number of clusters and is used to distinguish the compression ratio K. In addition, compared to other classification techniques, the K-Means algorithm based on outlier detection in WSN has the following advantages. (i)

The K-Means do not need knowledge of the data distribution, and it is suitable for dynamic data.

(ii)

The calculation is simpler than other classification techniques, such as SVM and ANN.

However, there are two elements included in every compressed data, so we have an improvement on original K-Means algorithm to ensure that the K-Means algorithm can be combined with PAA algorithm. Suppose that the number of sensors in a WSN is N, and the improved steps of the K-Means in our algorithm are as bellow. (1)

Select $k^{'}$ random instances $M_{1} : (M_{11}, M_{12})$ , $M_{2} : (M_{21}, M_{22}) \dots M_{k^{'}} : (M_{K^{'} 1}, M_{k^{'} 2})$ from the all compressed sensor data as the initial centroids of the clusters $C_{1}$ , $C_{2} \dots C_{K^{'}}$ .

(2)

For every training data $X_{j} = (\bar{c_{i j}}, V {ar}_{i j})$ .

(a)

Calculate the Euclidean distance:

\begin{matrix} Y_{n j 1} = {‖\bar{C_{i j}} - M_{n 1}‖}^{2}, \\ Y_{n j 2} = {‖Va r_{i j} - M_{n 2}‖}^{2}, \\ n = 1,2 \dots k^{'}, \end{matrix}

(4)

where

Y_{n j 1}

is the distance between the first element of

X_{j}

and the first element of the nth cluster and

Y_{n j 2}

is the distance of the second element between

X_{j}

and the nth cluster. To reduce the influence of different order of magnitude,

Y_{n j 1}

and

Y_{n j 2}

are normalized to

Y_{n j 1}^{'}

and

Y_{n j 2}^{'}

, and the distance between

X_{j}

and

M_{q}

is defined as

\begin{matrix} D (X_{j n}) = m * Y_{n j 1}^{'} + (1 - m) * Y_{n j 2}^{'} n = 1,2 \dots k^{'}, \end{matrix}

(5)

where m is the weighted parameter to adjust the proportion of two factors. The distance represents the correlation between sensor data and cluster center, and it also determines whether the data belong to the outlier class. Finally, find cluster

C q

that is closest to

X j

(b)

Assign $X_{j}$ to $C_{q}$ and update the centroid ( $M_{q 1}, M_{q 2}$ ) of $C_{q}$ (the centroid of a cluster is the arithmetic mean of the instances in the cluster.).

(3)

Repeat steps $(2)$ until the centers no longer change. Finally, the algorithm aims at minimizing the squared error function J:

\begin{matrix} J = \sum_{j = 1}^{N} {‖D ({X_{j}}^{(i)})‖}^{2} j = 1,2 \dots k^{'}, \end{matrix}

(6)

where

D ({X_{j}}^{(i)})

represents the distance between

X_{j}

and the center value of cluster

C_{j}^{(i)}

in which

X_{i}

is located.

After the process of classification, according to the definition of outlier, the ideal result is that the abnormal data will be assigned to the same cluster while the normal data will be assigned to the same cluster, because these outliers are deviated from the normal data Therefore, the less the result J is, the more precise the classification is. In addition, compared to the normal data, the number of abnormal data is relatively less, so these data in the cluster where the number of data is the least is the identified outlier.

3.2.2. Classification Improvement with AIS

As it is known that the K-Means algorithm depends on the initial centroids of the clusters and it is easy to fall into local optimum, so we still need to solve these problems to make the classification more precise. The AIS algorithm, which is also known as clonal selection algorithm (CSA) and a global optimal searching algorithm, is considered appropriate to offset the drawbacks of K-Means algorithm in the paper.

Clonal selection algorithm is an emerging intelligent algorithm which is inspired by the immune system. It uses the diversity of immune system to maintain population diversity so that it can avoid the “premature problem” in general optimization and get the global optimization [27–29]. The detail description of this algorithm is presented in this paper [30].

According to the defects of the K-Means algorithm, the purpose of CSA in this paper is to find the best initial centroids of the clusters, which ensure that the classification result is the global optimum and our outliers' detection rate is high. The application of CSA applied in our paper can be described as follows. (1)

In our K-Means algorithm, the squared error function J in (6) is the judgment standard of classification result, so we choose J as the objective function and the affinity.

(2)

Because our purpose is to find the best initial centroids of clusters, we define centroids of the clusters as antibody and randomly initialize multiple array of centroids as initial antibody group T:

\begin{matrix} T = [\begin{bmatrix} M_{1}^{1} & M_{2}^{1} & \dots & \dots & M_{K^{'}}^{1} \\ M_{1}^{2} & M_{2}^{2} & \dots & \dots & M_{K^{'}}^{2} \\ ⋮ \\ M_{1}^{Q} & M_{2}^{Q} & \dots & \dots & M_{K^{'}}^{Q} \\ ⋮ \end{bmatrix}], \end{matrix}

(7)

where

(M_{1}^{Q}, M_{2}^{Q} \dots M_{K^{'}}^{Q})

is the Qth initial centroids of clusters.

(3)

For every antibody in T, we classify the compressed data with the improved K-Means and record the affinity J. Then we sort the affinity sequence by size.

(4)

According to the reorder affinity sequence, we select some antibody whose affinity is at the top of the sequence as parent antibody group which is as the initial antibody group in the next round, because the good genes are more likely to be propagated to the next generation according to the genetics. Then clone these antibodies based on the size of affinities.

(5)

Determine whether or not the classification result calculated with the antibody corresponding to the minimum J meets the end condition that the J is enough small. If meets, this antibody is the best initial centroids of clusters, which can ensure the classification result with K-Means is best. Otherwise, it will continue the following steps.

(6)

Process these antibodies selected in step (4) with the operation of clone, crossover, and mutation to format a new diversity generation of antibodies group.

(7)

If the number of iteration has been arrived, then the process is also finished; otherwise turn to step (3).

After the above process, we can find the more ideal initial cluster heads and get the more accurate classification than initial K-Means algorithm. As a result, the normal and abnormal sensor data will be selected to different cluster more effectively, so our algorithm can have higher detection accuracy while the false alarm is lower. However, an effective outlier detection algorithm in WSN not only need to have high detection rate but also need to satisfy the characteristic of sensor data and constrained resource. Compared to other algorithms, the advantages of our algorithm are shown in Table 1.

Table 1

The comparison of characteristics.

	Online	Based on compressed data	No a priori knowledge of data
Our algorithm	Yes	Yes	Yes
Salem et al. [5]	Yes	No	Yes
Bhargava and Raghuvanshi [7]	Yes	Yes	No
Zhang et al. [4]	Yes	No	No

As shown in Table 1, compared to other algorithms, our algorithm can satisfy more requirements of outlier detection in WSN.

3.3. Pseudocode of Our Algorithm

Based on the previous introduction of every part in our algorithm, Pseudocode 1 is used to describe the whole process of our algorithm.

Pseudocode 1: Pseudocode of our algorithm.

Input: Original data from every sensor

Output: abnormal sensor and outlier data

For original data from every sensor

(1) Compress data with improved PAA algorithm in each node and get compressed data ( $\bar{c_{i j}}$ , ${Var}_{i j}$ ).

(2) All nodes send the compressed data of the same time to the base station.

(3) Initialize the antibody group T and set the initial number of iteration as 0.

(4) While (The number of iteration has not been reached the specified value)

(5) For each antibody in T:

(6) Classify all the compressed data with improved K-Means algorithm and calculate the affinity J

(7) End the classification.

(8) Sort the all affinity by size and choose these antibody at the top of the reorder sequence

as patient antibody group. Then clone these antibodies based on the size of affinities.

(9) Determine whether or not the minimum J meets the end condition

(10) If (it does)

(11) Jump out of the loop of optimization and end the classification

(12) Else

(13) Process the patient antibody group with the operation of clone, crossover and

mutation to format a new diversity generation of antibodies group.

(14) The number of iteration plus one.

(15) End the processing of optimization and select the antibody in first line as the best initial centroids of clusters.

(16) Classify the compressed data with the selected antibody and pick out the abnormal compression data.

(17) Judge the abnormal type based on the location of the abnormal data.

4. Experimental Evaluation

In order to evaluate the performance of our proposed algorithm, experiments were carried out based on two synthetic anomaly datasets and some real medicine datasets commonly used in anomaly detection. The name and detail source of these datasets are shown in Table 2. For comparison, the K-Means algorithm without AIS is used as baseline.

Table 2

The experimental datasets.

Name of dataset	Length of dataset	Source of dataset
Ma_Data	1200	Synthetic dataset in this paper [10]
Keogh_Data	1200	Synthetic dataset in this paper [10]
chfdb_chf01	3600	ECG [11]
chfdb_chf13	3600	ECG [11]
stdb_308	3600	ECG [11]
Synthetic_control	600	UCI [12]

4.1. Experimental Setup and Evaluation Metrics

Our simulation is conducted in MATLAB. We assume that knowledge of sensor node locations is available at the base station. We do not assume any specific routing or medium access protocol in this network or make any assumptions on the node density of the network because our algorithm is not for a particular application scenario.

Detection rate (DR) and false alarm rate (FAR) are used to evaluate the performance of our outlier detection algorithm. Detection rate is the ratio of correctly detected outlier data to the total number of outlier data. The false alarm rate is the ratio of the number of normal data which is incorrectly detected as outlier data to the total number of normal data.

The number of abnormal data is very few in general, so we will select the cluster where the number of data is minimum as outlier cluster and calculate the DR and FAR depending on these data in the outlier cluster.

4.2. Experimental Results

4.2.1. Decision of Related Parameters

As shown in the process of our algorithm in the chapter 3, the compression ratio K and the number $k^{'}$ of clusters are two key parameters which have a great impact on the classification result. Therefore, relevant experiments have been completed to determine the appropriate scope of K and $k^{'}$ which means that the following detection results are robust and accurate. In order to show the results clearly and concisely, we will only present the results on database of stdb_308 because the results on other databases are similar. The database of stdb_308 is shown in Figure 2, and the data whose absolute value is more than 0.5 is abnormal. The results of DR and FAR under different K and $k^{'}$ are separately shown in Tables 3 and 4. In order to reduce the error, we take the average value of 100 times experiments under the same parameters as our finial result.

Table 3

DR (%) under different K and $k^{'}$ .

$k^{'}$	K
$k^{'}$	6	5	4	3	2
2	96.12	96.45	97.05	97.12	97.15
3	96.77	97.12	97.23	97.30	97.28
4	97.65	97.15	97.20	97.32	97.35
5	97.50	97.89	97.91	97.86	97.85
6	97.62	97.50	98.03	98.11	98.10
7	97.59	97.85	98.00	98.08	98.12
8	97.65	97.79	98.05	98.12	98.14

Table 4

FAR (%) under different K and $k^{'}$ .

$k^{'}$	K
$k^{'}$	6	5	4	3	2
2	5.12	4.45	3.87	3.12	2.98
3	5.11	4.32	3.58	3.09	2.78
4	4.85	4.21	3.56	2.98	2.77
5	4.90	3.87	3.34	3.01	2.65
6	4.85	3.75	2.98	2.65	2.56
7	4.83	3.68	2.55	2.34	2.45
8	4.75	3.57	2.14	1.98	1.53

Figure 2

The database of stdb_308.

According to the described algorithm in last chapter, the smaller K is, the less the data information after compressing loses, and the more accurate the detection is. On the other hand, the larger K is, the better the effect of compression is and the more the energy is saved. For the parameter $k^{'}$ , the less it is, the less the calculation amount is, but the worse the effect of classification is, and the performance is opposite when $K^{'}$ increases.

As shown in Tables 3 and 4, the results are in accordance with theoretical analysis. However, in order to balance the energy saving and effective outlier detection while considering the calculation amount, we determine that K is 4 and $k^{'}$ is 3.

4.2.2. Results of Outlier Detection

With K and $k^{'}$ selected in the previous subsection, relevant experiments were conducted to compare the performance of our algorithm (OA) and the algorithm of K-Means without AIS (KWA), and the results are shown in Table 5.

Table 5

The experimental results.

Name of dataset	DR with OA	DR with KWA	FAR with OA	FAR with KWA
Ma_Data	97.50%	81.12%	3.34%	8.23%
Keogh_Data	98.10%	80.45%	3.15%	7.78%
Chfdb_chf01	98.86%	81.34%	3.22%	7.12%
Chfdb_chf13	97.13%	78.65%	1.97%	8.02%
Stdb_308	97.23%	80.67%	3.58%	7.84%
Synthetic_control	98.34%	78.12%	2.78%	8.45%

As shown in the Table 5, the performance of our algorithm with every database is obviously better: DR is higher and FAR is lower than the algorithm without AIS. The reason is that our algorithm adopts the AIS algorithm to ensure that the result is global optimization; on the contrary, the other algorithm is depended on the initial cluster centers so that the result of every experiment is local optimum and unstable. In addition, although the original data is compressed before outlier detection, the DR of our algorithm in every line is higher than 95% while the FAR is lower than 5%, so our algorithm is an ideal and effective outlier detection system in WSN, which can ensure high detection rate while the false alarm rate is low.

However, there is more or less environmental noise in the actual application situation, which makes the original change, so we artificially add Gaussian white noise in the original signals and conduct the experiments again to test the robustness of our algorithm. We take the Ma_Data (Ma), Stdb_308 (Std), and Synthetic_control (Syn) databases as examples to carry out the experiments, and results are shown in Figures 3 and 4. Figure 3 shows the DA of these databases with different SNR and Figure 4 shows the FRA with different SNR.

Figure 3

DA with different SNR.

Figure 4

FAR with different SNR.

As shown in Figures 3 and 4, we test the robustness of our algorithm under several Gaussian white noise with different signal to noise ratio (SNR). In these two pictures DR has a slight downward trend and FAR is on the slight rise with the reducing of the SNR, because smaller SNR will have a greater influence on original data. However, the DR is still bigger than 85% and the FAR smaller than 15% under different noise, so it is obvious that our algorithm has a good robustness on resisting noise.

4.2.3. Analysis of Energy Saving and Real-Time Performance

Compared to other outlier algorithms, another advantage is our algorithm can prolong the network life because the PAA algorithm reduces the amount of sent data. For example, for the database of stdb_308, the number of sent data will reduce to 1800 from 3600 when K is 4, so the network can save half of the energy, and the network can save more energy when K is more.

Some outlier detection methods which combine with data fusion algorithm reconstruct the original data before detecting. However, the reconstruct process of a data fusion algorithm usually need long time, so it will lead to poor real-time performance of algorithm. Our detection algorithm directly deals with the compressed data, so it can have a better real-time performance.

4.2.4. The Comparison of Our Algorithm with Other Outlier Detection Algorithms in WSN

In order to show the effectiveness of our algorithm more clearly, our algorithm will be compared with these algorithms listed in Table 1. The databases stdb_308 is selected to do the relevant experiments. For every algorithm, we will choose the optimal parameters settings according to every initial paper. In addition, we will take the average value of 100 times experiments as the final experiments results to ensure the reliability of the experiments. The comparison result is shown in Table 6.

Table 6

The comparison of several algorithms.

	DR	FAR	Time consumption	Saved energy consumption
Our algorithm	97.23%	3.58%	11.5 s	50%
Salem et al. [5]	93.45%	6.15%	11.7 s	0%
Bhargava and Raghuvanshi [7]	92.17%	7.28%	12.8 s	35%
Zhang et al. [4]	95.19%	5.62%	10.6	0%

As shown in Table 6, the DR and FAR of our algorithm are both the best among above algorithms. Because we combine the AIS with K-Means to optimize the classification results, it can detect outliers effectively. Also, our algorithm can save energy consumption best compared with other algorithms, because the function of PAA ensures that the lifetime of network can be prolonged effectively. Because there is no the process of data compression in the algorithm proposed by Zhang et al. [4], in which the time consumption is less than in our algorithm. However, because the PAA, K-Means, and AIS are lightweight methods in our paper, the time consumption of our algorithm is less than the time consumption of the other two algorithms.

5. Conclusion and Future Work

In this paper, we propose an outlier detection algorithm for detecting abnormal compressed data in WSN. In the first phase, we utilize the PAA algorithm to compress the time series data in each node so that the communication overload can be reduced and the life of battery is prolonged. Based on the result of PAA, we then combine the improved unsupervised detection algorithm of K-Means and the AIS to effectively classify the normal and abnormal. The major advantages of this algorithm are that it is based on the compressed data so that the energy consumption is reduced, and our algorithm can achieve a high detection rate while the false alarm rate is low. Relevant experiments on virtual and real data demonstrate the effectiveness of our algorithm in detecting the outlier and resisting noise.

A potential limitation of this approach is that the time cost increased obviously when the data volume is huge, because every data need be reclassified with the new centroids of the clusters until the classification ends and the search time of optimal initial centroids of the clusters is long. Another limitation is that, in this paper, our detection algorithm is based on univariate data while the data is more complex in some application of WSN. Therefore, in the future, our work is to improve our algorithm so that it has a better real-time performance and can be more suitable for detection of multivariate data.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61271274), key project of Natural Science Foundation of Hubei Province of China (2011CDA069), general project of Natural Science Foundation of Hubei Province of China (2010CDB04203, 2011CDB339), and Key Science and Technology of Hubei Province of China (2012BAA02003, 2011BAB042). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

References

Bri

Garcia

Lloret

Dini

Real deployments of wireless sensor networks

Proceedings of the 3rd International Conference on Sensor Technologies and Applications (SENSORCOMM '09)

June 2009

415 423

10.1109/sensorcomm.2009.69

2-s2.0-70449509159

Garcia-Hernandez

C. F.

Ibarguengoytia

P. H.

Garcia-Hernandez

Perez-Diaz

J. A.

Wireless sensor networks and applications: a survey

International Journal of Computer Science and Network Security 2007 7 264 273

Zhang

Meratnia

Havinga

Outlier detection techniques for wireless sensor networks: a survey

IEEE Communications Surveys & Tutorials 2010 12 2 159 170

10.1109/surv.2010.021510.00088

2-s2.0-77955082590

Zhang

Meratnia

Havinga

P. J. M.

Distributed online outlier detection in wireless sensor networks using ellipsoidal support vector machine

Ad Hoc Networks 2013 11 3 1062 1074

10.1016/j.adhoc.2012.11.001

2-s2.0-84875691648

Salem

Liu

Mehaoua

Anomaly detection in medical wireless sensor networks

Journal of Computing Science and Engineering 2013 7 4 272 284

10.5626/jcse.2013.7.4.272

Sun

Shan

Xiao

Anomaly detection based secure in-network aggregation for wireless sensor networks

IEEE Systems Journal 2013 7 1 13 25

10.1109/JSYST.2012.2223531

2-s2.0-84874644743

Bhargava

Raghuvanshi

A. S.

Anomaly detection in wireless sensor networks using S-transform in combination with SVM

Proceedings of the 5th International Conference on Computational Intelligence and Communication Networks (CICN '13)

September 2013

Mathura, India

IEEE

111 116

10.1109/cicn.2013.34

2-s2.0-84892654967

Moshtaghi

Leckie

Karunasekera

Rajasegarar

An adaptive elliptical anomaly detection model for wireless sensor networks

Computer Networks 2014 64 195 207

10.1016/j.comnet.2014.02.004

2-s2.0-84896743358

Zhong

Q.-L.

Cai

Z.-X.

Symbolic algorithm for time series data based on statistic feature

Chinese Journal of Computers 2008 31 10 1857 1864

2-s2.0-55649083060

10.

Breunig

M. M.

Kriegel

H. P.

R. T.

Sander

LOF: identifying density-based local outliers

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '00)

June 2000

93 104

10.1145/342009.335388

11.

http://physionet.org/physiobank/database

12.

http://archive.ics.uci.edu/ml/datasets.html

13.

Chandola

Banerjee

Kumar

Anomaly detection: a survey

ACM Computing Surveys 2009 41 3, article 15

10.1145/1541880.1541882

2-s2.0-68049121093

14.

Sisodia

M. S.

Raghuwanshi

Anomaly base network intrusion detection by using random decision tree and random projection: a fast network intrusion detection technique

Network Protocols and Algorithms 2011 3 4 93 107

10.5296/npa.v4i4.1342

15.

Samparthi

V. S.

Verma

H. K.

Outlier detection of data in wireless sensor networks using kernel density estimation

International Journal of Computer Applications 2010 5 6 28 32

10.5120/924-1302

16.

Sahoo

P. K.

Efficient security mechanisms for mHealth applications using wireless body sensor networks

Sensors 2012 12 9 12606 12633

10.3390/s120912606

2-s2.0-84867006572

17.

MacIi

Colombo

Pivato

Fontanelli

A data fusion technique for wireless ranging performance improvement

IEEE Transactions on Instrumentation and Measurement 2013 62 1 27 37

10.1109/tim.2012.2209918

2-s2.0-84871022679

18.

Tan

Xing

Liu

Wang

Jia

Exploiting data fusion to improve the coverage of wireless sensor networks

IEEE/ACM Transactions on Networking 2012 20 2 450 462

10.1109/TNET.2011.2164620

2-s2.0-84859964364

19.

Cheng

C.-T.

Leung

Maupin

A delay-aware network structure for wireless sensor networks with in-network data fusion

IEEE Sensors Journal 2013 13 5 1622 1631

10.1109/JSEN.2013.2240617

2-s2.0-84875752519

20.

Lin

Keogh

Lonardi

Chiu

A symbolic representation of time series, with implications for streaming algorithms

Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD '03)

June 2003

2 11

10.1145/882082.882086

21.

Cheng

Pei

Liu

Hierarchical distributed data classification in wireless sensor networks

Computer Communications 2010 33 12 1404 1413

10.1016/j.comcom.2010.01.027

2-s2.0-77955303204

22.

Wang

Clustering-based distributed support vector machine in wireless sensor networks

Journal of Information & Computational Science 2012 9 4 1083 1096

2-s2.0-84859779374

23.

Siripanadorn

Hattagam

Teaumroog

Anomaly detection in wireless sensor networks using self-organizing map and wavelets

International Journal of Communication 2010 4 74 83

24.

Yasser

Siavash

Arash

An unsupervised network anomaly detection approach by K-means

Proceedings of the IEEE Symposium on Computers and Communications (ISCC '08)

2008

398 403

25.

Research of K-MEANS algorithm based on information entropy in anomaly detection

Proceedings of the 4th International Conference on Multimedia and Security (MINES '12)

November 2012

71 74

10.1109/mines.2012.169

2-s2.0-84873120893

26.

Gaddam

S. R.

Phoha

V. V.

Balagani

K. S.

K-Means+ID3: a novel method for supervised anomaly detection by cascading k-Means clustering and ID3 decision tree learning methods

IEEE Transactions on Knowledge and Data Engineering 2007 19 3 345 354

10.1109/tkde.2007.44

2-s2.0-33847704184

27.

El-Sharkh

M. Y.

Clonal selection algorithm for power generators maintenance scheduling

International Journal of Electrical Power and Energy Systems 2014 57 73 78

10.1016/j.ijepes.2013.11.051

2-s2.0-84890458835

28.

Sun

Z. X.

Generative tracking of 3D human motion in latent space by sequential clonal selection algorithm

Multimedia Tools and Applications 2014 69 1 79 109

10.1007/s11042-012-1251-5

2-s2.0-84894607914

29.

Feng

Jiao

L. C.

Zhang

Sun

Hyperspectral band selection based on trivariate mutual information and clonal selection

IEEE Transactions on Geoscience and Remote Sensing 2014 52 7 4092 4115

10.1109/tgrs.2013.2279591

2-s2.0-84896394026

30.

de Castro

L. N.

Von Zuben

F. J.

Learning and optimization using the clonal selection principle

IEEE Transactions on Evolutionary Computation 2002 6 3 239 251

10.1109/tevc.2002.1011539

2-s2.0-0036613006