Abstract
Sensor nodes in wireless sensor networks are prone to malfunction because they are exposed to the nearby environment directly. Consequently, wrong sensor readings occurred from sensor nodes and these readings are called an outlier. Commonly, since an outlier deviates from normal sensor readings and it can bring about some problems, various techniques to detect the outliers have been proposed. In this paper, we propose an efficient outlier detection technique based on data clustering. In order to decide the width of the cluster that consists of the sensor readings, we applied the Pigeonhole Principle and then detected the outliers based on clusters. In experiments, we demonstrate the efficiency of our proposed technique compared to other outlier detection techniques.
1. Introduction
Recently, since development of the integrated circuit, the size of the sensor is gradually reduced and various sensors are built in the sensor node in a wireless sensor network (WSN). Sensor nodes detect the various and huge information (e.g., temperature, humidity, and light) around their environments and they communicate to the base station and others using radio transmission. Accordingly, WSNs are used in various applications such as environment and habitat monitoring [1, 2], combat field surveillance [3], security [4], or health care [5] applications.
Commonly, sensor nodes are severely constrained in terms of the computation power, communication bandwidth, and battery power. Among these limitations, the power is of utmost importance, since replacing the battery of sensor nodes is too either expensive or impossible [6, 7]. Thus, the energy preservation is a major research issue since it directly impacts the life time of the network. Recently, much research has shown that the radio communication is more expensive than the computation or the sensing. Thus, many techniques [8–13] have been proposed in order to reduce the communication overhead.
Particularly, since sensor nodes are placed outdoor for the applications such as the disaster monitoring [14] and habitat monitoring, the sensor nodes can be malfunctioned or the sensor readings may be incorrect due to external impact such as the severe external environments [15]. In addition, due to sudden changes in external environments, some sensor readings may deviate significantly from the normal sensor readings. These abnormal sensor readings are called the outliers. For example, assume that several sensor nodes are deployed in a mountain to monitor forest fires. When an outlier value is detected and sent to a forest guard, the forest guard can identify the actual forest fires or he/she can initialize the sensor node if the outlier is generated by the malfunctioned sensor. Thus, the outlier detection is a quite important task to detect an event or to maintain sensor networks harmoniously.
In this paper, we focus on an energy-efficient outlier detection technique in WSNs based on data clustering. To construct data clusters, we use the Pigeonhole Principle. By applying cluster width, called the permission range, obtained by the Pigeonhole Principle, we partition the data space into several clusters. However, if we partition the data space evenly, some sensor readings are identified as outliers although similar sensor readings of them are detected by sensor nodes. Thus, we partition the domain of data unevenly based on the location of sensor readings. Then, we identify the outliers according to the user-defined threshold θ which is related to the number of sensor readings in a cluster.
The remainder of the paper is organized as follows. Section 2 discusses related work. In Section 3, we present the background of our work. Section 4 introduces the proposed outlier detection technique in WSNs. Section 5 presents an empirical evaluation. Section 6 summarizes the paper.
2. Related Work
In WSNs, a lot of outlier detection techniques have been proposed. In [16], an outlier detection technique was proposed to collect the outliers with respect to the neighbor sensor nodes. Each sensor node calculates the median of the sensor readings received from the neighbors and its readings. Each sensor node computes the mean μ and the standard deviation σ of the differences between its sensor readings and the calculated median. If the standardized value
Palpanas et al. proposed an outlier detection technique based on the the Epanechnikov kernel function [17]. Given a value v, each sensor node estimates the number of values around v using the kernel function. If the number of values around value v is less than a user-defined threshold p, a value v is regarded as an outlier. However, to make the kernel function, each sensor node transmits the required information to the base station along the routing path. Thus, it wastes a lot of energy.
In [18], an outlier detection technique based on data clustering, called DC, was proposed. Given a user-defined threshold ε, if the distance between any two sensor readings is less than ε, they become a cluster. When the distance between a pair of cluster centers is less than ε, they are merged. The intercluster distance (ICD) of a cluster to k-nearest clusters is computed to detect the outlier clusters. When ICD of a cluster is quite different from the mean of ICDs, it is regarded as an outlier cluster. Then, the information of outlier clusters is broadcasted into WSNs. Thus, it consumes much energy.
In [19], an outlier detection technique based on an ellipsoid was proposed. Each sensor node constructs an ellipsoid boundary of sensor readings using the mean and the covariance of sensor readings and transmits its ellipsoid boundary to the base station. The base station merges all received ellipsoids, computes the global ellipsoid boundary, and broadcasts the global ellipsoid to all sensor nodes. Then, with respect to the global ellipsoid boundary, each sensor node identifies the outliers among its sensor readings.
Recently, an outlier detection technique based on the distance between sensor readings and the estimation deviation was proposed [20]. Each sensor node computes the expected deviation and the average distance between pairs of sensor readings detected within recent time interval. If the average distance is greater than the expected deviation, the sensor reading at the current time becomes an outlier. However, when the expectation model is frequently updated, the communication overhead increases.
3. Preliminary
In this section, we present the basic model of sensor networks briefly.
3.1. Sensor Networks
We consider a sensor network consisting of n stationary sensor nodes

A simple sensor network.
Each sensor node generates its readings periodically. A sampling period is known as an epoch [21]. To agree on a global time base that allows sensor nodes to start and finish each epoch simultaneously, each sensor node executes the SMACS protocol [22] or a global time synchronization protocol [23]. Based on global time synchronized, nodes sleep for a certain period of time in each epoch to minimize energy consumption and each sensor node awakes to sample and receive results when its neighbors try to propagate a message.
4. Outlier Detection Based on Clustering
In this section, we present our proposed outlier detection technique which is based on data clustering in WSNs. In addition, we introduce an efficient data transmission scheme for our outlier detection technique.
4.1. Clustering Technique
Generally, in the outlier detection techniques based on data clustering, the width of clusters is the most important, since the number of sensor readings in each cluster is affected by the width of clusters. If the width of clusters is too large, all sensor readings may belong to a single cluster. Otherwise, each cluster may have only a single sensor reading. Thus, the outliers cannot be identified in these cases.
To solve the above problem, in this paper, we applied the Pigeonhole Principle to determine the width of clusters. We regard that pigeonholes and pigeons are the domain of sensor readings and the sensor readings, respectively. Thus, when we partition this domain into x number of subdomains, where x is less than the number of sensor readings, at least one subdomain contains more than two sensor readings.
Given a set of sensor readings
We use PR as the width of clusters and identify the outliers. For instance, given a set of sensor readings R shown in Figure 2, PR of R is obtained as

Equipartitioning based on permission range.
In Figure 2, the clusters represented by dotted lines are presented when we partition the domain of R evenly. Assume that the user-defined threshold θ is
To solve this drawback, we propose the nonequipartitioning based on the permission range (PR). In the nonequipartitioning, a set of clusters is constructed with respect to Definition 1.
Definition 1.
If the difference between a pair of sensor readings
In our proposed technique, if

Nonequipartitioning based on permission range.
4.2. Clustering Scheme for WSNs
If each sensor node transmits its readings to the base station blindly at each epoch and the base station computes the outliers, each sensor node consumes much energy. In this section, we present an efficient data transmission scheme for our outlier detection algorithm. We assume that all sensor nodes take sensor readings periodically and keep these readings into their local storage for a time window w. Note that when w is 1, each sensor node transmits data to the base station at each epoch.
At first, according to the Definition 1, each sensor node constructs clusters using its sensor readings detected within w. If the number of sensor readings in a cluster is greater than a user-defined threshold θ, all sensor readings in this cluster cannot be the outliers (i.e., nonoutlier cluster (NOC)). Otherwise, all sensor readings in a cluster may be the outliers, and then we call such clusters the outlier candidate clusters (OCCs).
Along the routing path to the base station, each sensor node transmits NOCs and OCCs. For NOCs, cluster ranges (CRs) are transmitted only where CR consists of minimum and maximum values of sensor readings in a NOC. In contrast, for OCCs, the sensor readings in OCC are transmitted to the parent node.
When a parent node p received CRs of NOCs and OCCs from its child nodes, p attempts to merge them with its clusters. To merge the clusters, we use the following definition.
Definition 2.
Given two cluster ranges
If a pair of cluster ranges overlap within PR, there are at least two sensor readings which are close and contained in different clusters. Thus, if two cluster ranges
Note that, when two clusters are merged into a new cluster where at least one of them is a NOC, the merged cluster cannot be an OCC, since the number of sensor readings in a NOC is already greater than θ. Thus, when a sensor node transmits a NOC, we do not need to transmit all sensor readings in the NOC and it only needs the cluster range CR of the NOC. Thus, each sensor node reduces the energy consumption when it transmits the NOCs since the volume of the transmitted data from each sensor decreases.
In contrast, when two OCCs are merged into a new cluster, we check whether the number of sensor reading in the new cluster is greater than θ or not. Since each sensor node sends all sensor readings in each OCC, we can easily count the number of sensor readings in the new cluster.
Along the routing path from each sensor node to the base station, CRs of NOCs and OCCs are merged and transmitted gradually. Finally, the base station can determine the outliers among the received OCCs. Recall that the cluster ranges (CRs) rather than all senor readings in NOCs are transmitted along the routing paths. Thus, we can reduce the energy consumption of each sensor node.
For example, given a WSN with 11 sensor nodes in Figure 4 where the user-defined threshold θ is 1 and the domain of a set of sensor readings is

Acquisition sensor readings.

Clustering in
In Figure 6,

Clustering in
A sensor node

Clustering in
4.3. Multidimensional Clustering Scheme
Generally, since the sensor nodes in WSNs are equipped with several sensor devices, sensor readings are multidimensional. For instance, when a sensor node obtains the weather information, it detects temperature as well as other parameters such as humidity and light. Thus, we propose the clustering scheme for a set of multidimensional data.
Let a set of d-dimension sensor readings
Definition 3.
Let a d-dimensional cluster range of a cluster
Based on Definition 3, For every dimension k with
For example, given three cluster ranges
Example of CRs on 2 dimensions.
5. Experiments
5.1. Experimental Environments
To evaluate the performance of our proposed algorithm compared with the state-of-the-art algorithm, we used a set of real-life data which is provided by Intel Berkeley Research Lab [24]. A sensor network consists of 54 sensor nodes, and each sensor node is deployed in

Placement of sensor nodes [24].
As competitors, we implemented the Brute-Force (
In
To compute the energy consumption, we use the free space channel model [25]. Under this model, to transmit an l-bits message and a distance c, a sensor expends
And, to receive this message, a sensor expends
In this experiment, we set 50 nJ/bit to the electronic circuit constant (
Parameters.
5.2. Experimental Result
To evaluate the energy consumption of each outlier detection algorithm, we run our own simulator for 1000 epoches and plot the total energy consumption.
Figure 9 shows the energy consumption varying the window size w. When the size of a window is small (i.e.,

Varying the window size (w).
However, as w increases, the performance of
Figure 10 shows the energy consumption varying θ. As shown in Figure 10, the performances of all techniques are stable in spite of varying θ. The energy consumption of

Varying the threshold size (θ).
Figure 11 shows the energy consumption varying the dimension d. The energy consumption of all techniques is increased according to varying d, because the packet of the sensor nodes on multidimension contains more information as compared with that on 1 dimension. The energy consumption of

Varying the dimension (d).
6. Conclusion
In this paper, we present an efficient outlier detection technique in WSNs. To obtain the appropriate width of clusters, we applied the Pigeonhole Principle. In our proposed technique, each sensor node in WSNs constructs and merges the clusters based on the permission range PR. Then, our proposed technique uses two kinds of clusters (NOC and OCC) in order to detect the outliers and reduce the energy consumption of each sensor. In our experiments with a set of real-life data, we show that our proposed technique outperforms existing techniques significantly.
Footnotes
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2012R1A1B3003060).
