Sage Journals: Discover world-class research

Abstract

Adopting filtering mechanism of dynamic filtering windows installed on sensor nodes to process top-k queries is an important research direction in wireless sensor networks. The mechanism can reduce transmissions of redundant data by utilizing filters. However, existing algorithms based on filters consume a vast amount of energy due to filter updating. In this paper, an energy-efficient top-k query technique based on adaptive filters is proposed. Due to updating filters consuming a large amount of energy, an algorithm named FUGPR based on Gaussian process regression to process top-k queries is provided for saving energy. When the filters change, the sensor readings are predicted to calculate the updating costs of filters; then FUGPR decides whether the filters need to be updated or not. Thus, the energy consumption for updating filters is decreased. Experimental results show that our approach can reduce energy consumption efficiently for updating filters on two distinct real datasets.

1. Introduction

With the improved recognition to the physical world and the rapid development on technologies such as electron and wireless communication, wireless sensor networks have been applied to many areas such as military, medical treatment, and environment surveillance. The bright future of wireless sensor networks has attracted the attention of so many scholars. In wireless sensor networks, sensor nodes are energy-limited which are powered by batteries. In addition, various kinds of sensors generate a large amount of data. If all the data is sent to the base station, much energy will be consumed and sensors will run out of energy soon.

For large quantity of data generated by wireless sensors, users are always interested in max or min k objects among them. People are often interested in the k maximum or minimum values in a wireless sensor network. A top-k query which returns k sensor nodes with the highest readings can meet the requirements above. Top-k queries have been widely used in wireless sensor networks [1]; for instance, it can be used to effectively monitor environmental and ecological changes. Assume that there are many bird feeders placed in a forest, each of which includes a sensor node. These sensor nodes can detect weight changes to count the number of birds landing at the feeder; thus ornithologists can estimate the number of birds in different areas and choose k areas to observe living habits of birds. Yet, some sensor nodes frequently become the top-k queries candidate sets. After several runs of top-k queries, some failed nodes will emerge, because these nodes frequently perceive and transmit data. According to the top-k query semantic of the example above, the k best sensor nodes often become failed nodes. After that, the top-k query results can not reflect the true distribution of the birds in the forest due to the failed nodes whose energy is exhausted.

To answer top-k queries in wireless sensor networks, a typical solution is by making the use of filters. Filters are broadcasted into the network and used by individual sensor node to decide whether its value is relevant to compute the query result. FILA (filter based monitoring approach) [2, 3] is an energy-efficient approach among these algorithms. The basic idea of FILA is to install a filter at each child sensor node, which avoids redundant data transmitting to parent nodes or base station. However, the approach consumes much energy on updating filters. Eager filter update policy in FILA updates windows parameters immediately once sensor readings violate the filters while lazy update policy cannot guarantee the updated filters need not be reupdated to the old ones in next epochs. By predicting future sensor readings, we can decide whether it is necessary to update current filters for saving energy in a long period. Mai et al. [4, 5] proposed DAFM algorithm according to FILA. The algorithm predicts the next sensor readings of sensor nodes by linear regression model. Then, the base station decides whether to update filtering windows based on predicted benefits. When sensor data satisfies complicated distributions and varies widely over time, the predicted performance of linear regression is unsatisfactory and the performance of DAFM algorithm gets worse. These issues are essential to top-k queries in wireless sensor networks and are investigated in this paper. The contributions of this paper are summarized as follows: (i)

We adopt a powerful Gaussian process regression (GPR) model to predict next s steps’ sensor readings based on historical sensor data. The GPR model is suitable for sensor data due to the temporal correlations. Our proposed filter updating based on Gaussian process regression (FUGPR) approach focuses on reevaluation when filters need updating. FUGPR introduces prediction mechanism to achieve less energy consumption compared to eager filter update and lazy update policies in FILA approach. Also, comparing to DAFM approach, our FUGPR approach guarantees smaller mean squared error thus can perform effectively.

(ii)

Instead of one-step-forward prediction of DAFM approach, we adopt adaptive s step prediction based on prediction errors. To select suitable s values based on specified sensor data, FUGPR outperforms existing approaches.

(iii)

Extensive experiments are conducted to evaluate proposed FUGPR approach by using real data traces. The results show that our FUGPR approach outperforms DAFM, FILE-e, and FILA-l in terms of both energy consumption and network lifetime under single hop and multihop network configurations.

The rest of this paper is organized as follows. In Section 2, we discuss related work. Typical filtering methods for top-k query processing are introduced in Section 3 and our proposed FUGPR algorithm is provided in Section 4. In Section 5, extensive simulations are conducted to show the efficiency and accuracy of the proposed method. Finally, we conclude this paper in Section 6.

2. Related Work

A naive implementation for monitoring top-k query is to use a centralized approach in which all sensor readings are periodically collected by the base station, which then computes the top-k result set directly. The most predominant algorithm for top-k processing is the Threshold Algorithm (TA) [6]. Along with TA algorithm, Cao and Wang [7] developed a three-phase protocol (TPUT) which decreases the number of remote accesses in large distributed networks. Theobald et al. [8] introduced a family of approximate top-k algorithms with probabilistic guarantees for multidimensional data. By extending TPUT algorithm, Michel et al. [9] proposed the KLEE algorithm to provide approximate answers which provides an adaptive framework to allow trade-off efficiency against result quality and network bandwidth. Zeinalipour-Yazti et al. [10] proposed UB-K and UBLB-K algorithms that return upper/lower bounds instead of exact answers. Also, approximate evaluation of top-k queries has also been proposed by Silberstein et al. [11] based on sampling approaches.

In the area of wireless sensor networks, energy-efficient query processing has always been an important issue. A wireless sensor network can be viewed as a distributed network which consists of lots of energy-limited sensor nodes, and communication cost is the main energy consumption. Therefore, there is no doubt that the centralized approach will consume extra energy because of the transmission of massive data. Filtering and aggregation are two main strategies at present that cope with each other to process top-k queries in wireless sensor networks. In order to reduce the communication cost in data collection, Madden et al. [12] proposed an in-network aggregation technique, known as TAG, which keeps unavailable data from transmission compared with centralized algorithms. Similar to TAG, Cougar [13] and Kspot [14] also employ a centralized optimizer to coordinate sensor nodes in an energy-efficient manner. Chen et al. propose QF (Quantile Filter) [15] approach which treats sensing values and its sensor as a point. The goal of algorithm is to decrease energy consumption and prolong the lifetime of network which is not only minimizing the total energy consumption, but also consuming less energy on each node. Recently, they develop online algorithms for answering time-dependent top-k queries with different values of k through the dynamic maintenance of a materialized view that consists of historical top-k results [16]. Liu et al. [17] propose a new cross pruning (XP) aggregation framework for top-k query in wireless sensor networks. There is a cluster-tree routing structure to aggregate more objects locally and a broadcast-then-filter approach in the framework. In addition, it provides an in-network aggregation technique to filter out redundant values which enhance in-network filtering effectiveness. Abbasi et al. [18] proposed MOTE (model-based optimization technique) approach based on assigning filters on nodes by model-based optimization. Nevertheless, it is an NP-hard problem on how to get optimal filter settings for top-k set.

However, these approaches incur unnecessary updates in the network and are not really energy-efficient. Olston et al. [19] propose the range caching mechanism for approximate query processing. In range caching mechanism, the base station caches a value range for each sensor node and computes a tentative result based on cashed value ranges. The cashed values are refreshed only when the new values on the sensor nodes violate the ranges. Also, Olston et al. [20] use filters to bound error in query result over distributed streams. An adaptive scheme is proposed to reduce the communication cost by precision adjustment at each individual source. Inspired by the filter based ideas [19, 20], Wu et al. [2, 3] propose FILA approach for top-k monitoring, which is to install a filter on each sensor node to filter out unnecessary data not contributed to final results. Reevaluation and filter setting are two critical aspects that ensure the correctness and effectiveness. When the new filters are different from the old ones maintained by sensor nodes themselves, base station needs to update them. But when sensing values on nodes vary widely, base station needs updating filters to related nodes frequently which leads to large scales of updating cost and makes the performance of algorithm worse. Mai et al. [4, 5] propose DAFM approach which aims to reduce the communication cost of sending probe messages in reevaluation process as well as the transmission cost in filter updating in FILA. DAFM approach predicts the next readings of sensor nodes to determine these values are in or out of filtering windows by linear regression models. To some extent, this approach decreases the cost of filter updating. However, the sensed values are affected by various factors; the performance is worse when adopting linear regression models to predicate future sensor values. In our previous work [21], time series models are adopted for predicting next sensor values for unnecessary filter updates. However, this approach is more suitable for specific sensor data, that is, time series data distribution and not energy-efficient for non time series sensor data.

3. Filter Based Top-k Query Methods

We first give a formal definition in Section 3.1. Then, Section 3.2 provides filter based top-k query methods and discusses the limitations of existing methods.

3.1. Problem Definition

We consider there are N sensor nodes in a wireless sensor network labeled $1,2, \dots, N$ which compose a set I, $I = {1,2, \dots, N}$ . Each sensor node $n_{i}$ ( $n_{i} \in I$ ) measures the local physical phenomenon $v_{i}$ (e.g.,temperature,voltage,light, orhumidity) at a fixed sampling rate. We consider a top-k query that requests the list of sensor nodes R with the highest readings, $R = 〈n_{1}, n_{2}, \dots, n_{k}〉$ for $\forall i < j$ , $v_{i} \geq v_{j}$ , and $\forall l \neq i$ $(i = 1,2, \dots, k)$ , $v_{l} \leq v_{k}$ . The results are maintained by base station and returned to users finally. The goal in this paper is to prolong the lifetime of wireless sensor networks by minimizing the overall energy consumption.

3.2. Filter Based Top-k Query Processing

3.2.1. An Overview of Filter Based Methods

To reduce network traffic in data collection, TAG [12] as an in-network aggregation technique has been proposed. The topology of the sensor network can be assumed as a spanning tree rooted at the base station. In TAG solution for monitoring top-k queries, every intermediate node of the spanning tree collects all values from its children and forwards the k largest of these values to the base station. Although TAG achieves a reduction of the number of transmitted values by aggregation, the number of messages remains high. Consider the situation that the final k largest values fall in one branch of the spanning tree at some level; the readings of other branches do not need to be forwarded. Therefore, TAG incurs unnecessary transmissions in the network and is not energy-efficient.

Range caching method was first provided by Olston et al. [19, 20] and later utilized by Wu et al. [2, 3] to process top-k queries in wireless sensor networks. The base station cashes a value range α for the value at each sensor node. If sensed value is $v_{i}$ for sensor node $n_{i}$ , we take $u_{i} = v_{i} + α / 2$ , $l_{i} = v_{i} - α / 2$ as the upper bound and lower bound of the range. A sensor node is updated with the base station only when the new value is beyond the range of the previously reported value. However, it is obvious that range caching method is simple to process top-k queries. However, the ranges (we can take a range as a filter) of distinct sensor nodes are overlapped and top-k queries could only be executed after the base station collects all the values of the sensor nodes with the overlapped ranges. The procedure will consume much energy so range caching method is not energy-efficient. Beyond this, the range value α is difficult to be decided.

To overcome the defect of range caching method, Wu et al. [2, 3] proposed FILA to process top-k queries in sensor networks. At the beginning of FILA, the base station collects the readings from all sensor nodes and sorts the sensor readings to obtain the initial top-k result set. Then the base station calculates a filter $[l_{i}, u_{i}]$ for each sensor node $N_{i}$ and sends it the corresponding sensor node for installation. At the next top-k query processing period, if the new reading of sensor node $N_{i}$ is within the filter $[l_{i}, u_{i}]$ , then no update is sent to the base station. Otherwise, if the new reading violates the filter, then an update is sent to the base station. The base station should reevaluate the top-k result and adjust the filters of relevant sensor nodes.

(1) Filter Setting. At the beginning of FILA method, the base station has obtained the sorted readings of all the sensor nodes ${v_{1}, v_{2}, \dots, v_{N}}$ and $v_{1} \geq v_{2} \geq \dots \geq v_{N}$ . In FILA, each node in the top-k results has a separate filter while all the remaining non-top-k nodes share a common one. At any time, we only need $k + 1$ filters. In addition, to ensure the correctness of FILA method, each filter $[l_{i}, u_{i}]$ should cover their current readings but does not overlap with any other filter. A feasible filter setting $[l_{1}, u_{1}], [l_{2}, u_{2}], \dots, [l_{k}, u_{k}]$ , $[l_{k + 1}, u_{k + 1}]$ must satisfy $v_{1} \leq u_{1}$ , $v_{i + 1} \leq u_{i + 1} \leq l_{i} \leq v_{i}$ , $(1 \leq i \leq k)$ and $l_{k + 1} \leq v_{N}$ .

To maximize the filtering capability, the upper bound of the top-1 node's filter $u_{1}$ is set to $+ \infty$ while the lower bound of the non-top-k's node filter $l_{k + 1}$ is set to $- \infty$ and $u_{i + 1}$ is set equal to $l_{i}$ for all the nodes except them. In FILA, two filter setting strategies, uniform and skewed settings, are provided, respectively. In this paper, we adopt uniform filter setting for simplicity. That is, $u_{i} + 1$ and $l_{i}$ are set at the midpoint of two sensor readings which is shown as below and Figure 1 illustrates the filter setting of the sensor nodes for a top-k query. Consider

\begin{matrix} u_{1} = + \infty \\ v_{i + 1} = l_{i} = \frac{(v_{i} + v_{i + 1})}{2}, (1 \leq i \leq k) \\ l_{k + 1} = - \infty . \end{matrix}

(1)

Figure 1

Filter setting.

(2) Query Reevaluation. In FILA, if there is at least one sensor reading sent to the base station for filter violation, the top-k results become undecided at the base station. Then the base station should probe some related sensor node(s) to reevaluate the top-k results.

When the updated reading overlaps with the filter of any other sensor node, there are three situations for this update.

Internal Update. An update originated from a top-k node jumps into the filter of another top-k node. For example, as shown in Figure 2(I), $v_{1}$ jumps into $[l_{2}, u_{2}]$ .

Figure 2

Three situations of windows update.

Join Update. An update from a non-top-k node jumps over the critical bound and falls into the filter of a top-k node. For example, as shown in Figure 2(II), $v_{k + 1}$ jumps into $[l_{k - 1}, u_{k - 1}]$ .

Leave Update. An update from a top-k node jumps over the critical bound and falls into the filter of non-top-k nodes. For example, as shown in Figure 2(III), $v_{3}$ jumps into $[l_{k + 1}, u_{k + 1}]$ .

It is obvious that if an update is an internal or a join one, then only the relevant top-k node whose filter covers the updated sensor reading needs to be probed to reevaluate the top-k result. Otherwise, a leave update from a node $N_{i}$ may have to probe all non-top-k nodes to look for the new top-kth node. This introduces high energy consumption.

(3) Filter Updating. When sensor readings pass their filters, the filter settings of corresponding sensor nodes need to be recomputed and the base station will send the new filtering windows to the nodes for installation. There are two approaches for updating the filter of each node $n_{i}$ in FILA.

Eager Filter Update. If a new filtering window is different from the old one, then the new filter computed by the base station is immediately sent to the node $n_{i}$ to replace the old one.

Lazy Filter Update. If a new filtering window $[l_{i}^{'}, u_{i}^{'}]$ fully contains the old one $[l_{i}, u_{i}]$ , that is, $[l_{i}, u_{i}] \subset [l_{i}^{'}, u_{i}^{'}]$ , then the base station delays the filter update until node $n_{i}$ 's reading violates the old one.

When top-k result set changes, the filters of the sensor nodes will also change. We demonstrate above two updating policies using the example shown in Figure 3. In Figure 3, we only consider four sensor nodes $n_{i}$ , $n_{i + 1}$ , $n_{i + 2}$ , and $n_{i + 3}$ . When the readings $v_{i + 1}$ , $v_{i + 2}$ change, the four filters will also change. According to the eager filter update policy, the base station needs to send all the four filters to the sensor nodes. In contrast, according to the lazy filter update policy, due to the fact that new filter windows of nodes $n_{i}$ , $n_{i + 3}$ contain the old ones, that is, $[l_{i}, u_{i}] \subset [l_{i}^{'}, u_{i}^{'}]$ , $[l_{i + 3}, u_{i + 3}] \subset [l_{i + 3}^{'}, u_{i + 3}^{'}]$ , the base station only needs to send the filters of nodes $n_{i + 1}$ , $n_{i + 2}$ while delaying the filter update of nodes $n_{i}$ and $n_{i + 3}$ . We can see that eager filter update consumes much energy for updating the filters of all sensor nodes even when only one sensor reading violates the filter. The lazy filter update consumes less energy because the sensor nodes with the new filter windows containing the old ones remain unchanged.

Figure 3

Changes of filter window.

We can see that lazy filter update policy is naive and not energy-efficient. Due to delay of the filter updating to some sensor nodes, the gap between adjacent filters will occur as illustrated in Figure 4 which decreases the filtering capability. In addition, lazy filter update is a suboptimal method because the total number of filtering updates remains in a high level and consumes much energy though the number of updates is reduced for one time filter update.

Figure 4

Existence of the gap between windows.

3.2.2. Defects of FILA

To some extent, FILA avoids redundant data from transmission and saves energy. However, there is massive unnecessary energy consumption. Assume that after sampling in epoch t, the filtering windows are shown in Figure 5 [5]. In Figure 5(a), at epoch $t_{i}$ , before sampling top-3 has the value of ${v_{1}, v_{4}, v_{3}}$ , and the result set is ${v_{1}, v_{3}, v_{4}}$ after sampling as shown in Figure 5(b); that is, the value $v_{3}$ jumps out its filter and falls into the filtering window installed on $v_{4}$ . The base station needs to adjust the filtering windows for relevant nodes which are shown as Figure 5(c). However, after sampling in epoch $t_{i + 1}$ as shown in Figure 5(d), the value $v_{3}$ changes again and jumps into the frontal window. In FILA, fluctuation of data will cause frequent updates of the filter windows which will incur large amount of communication cost. Mai et al. [5] proposed an updating algorithm based on prediction which determines whether to update the filters according to the possible cost on updating. The algorithm decreases the cost on updating the filters to some extent. However, this approach is limited by the prediction performance of linear regression model.

Figure 5

Frequent updating of the filters.

In addition, from Figure 5, the values on the non-top-k vary in a small range; their filtering windows are useful to filter out irrelevant sensor nodes. However, the upper bound of non-top-k nodes is determined by the values of ith and $(i + 1)$ th sensor nodes; when the value which is ranked at kth varies, the filters on non-top-k will be affected directly. For a wireless sensor network generally consists of large number of sensor nodes, k is usually small in a top-k query while a large number of sensor nodes fall into non-top-k node set. If the algorithm updates the filters once reevaluating the top-k results, it will consume more energy on updating the filters of non-top-k nodes.

In this paper, we adopt Gaussian process regression to alleviate the burden of filter update in FILA. When the filters change, the sensor readings are predicted by Gaussian process regression to calculate the updating costs of filters. An algorithm for top-k processing under FILA framework named FUGPR is provided. FUGPR reduces the cost of filter updating when fluctuation of sensor readings occurs and prolongs the lifetime of the wireless sensor network.

4. Filter Updating Algorithms

FUGPR is efficient top-k processing method based on predicted benefits of Gaussian process regression [22, 23]. Gaussian processes let the data “speak” more clearly for themselves. Gaussian processes extend multivariate Gaussian distributions to infinite dimensionality. Formally, a Gaussian process generates data located throughout some domain such that any finite subset of the range follows a multivariate Gaussian distribution. Under Bayesian linear regression framework, a Gaussian process predictor calculates posteriors from priors over functions rather than from priors over parameters [24]. Gaussian process regression has been widely exploited in many research topics, such as human motion estimation and time series prediction [25]. Recently, Gaussian process approach has been adopted for predicting of monitoring models over distributed data streams [26] and real-time sensor data processing in wireless sensor networks [27].

In FUGPR, each node in the top-k list has a separate filter while all the remaining non-top-k nodes share a common one. This setting is along with the framework of FILA. The $k + 1$ filters must not overlap with one another. If the reading of a sensor node at next epoch falls in the filter, the reading will not be sent to the base station. Otherwise the violated reading needs to be sent to the base station. Due to limited storage of sensor nodes (e.g., MICAz and telosB only have 4 KB and 10 KB RAM, resp. [28]), each sensor node only preserves current reading and violated readings from its child nodes. The maximum number of readings stored for the sensor nodes are the ones connecting directly with the base station which is decided by the number of layers of the spanning tree (which is a value not less than 100). For typical sensor nodes such as MICAz and telosB, the storage is enough to preserve these readings. The base station will reevaluate the filter setting of all the nodes based on the violated readings. According to the analysis in Section 3.2, the reevaluation process consumes much energy for filter updating. Instead, we utilize the historical sensor readings of each node to construct Gaussian process models to predict further values of each sensor reading. Then, the costs by utilizing the old filters as well as the new filters are evaluated which decides whether or not to update filters. The predict benefits of FUGPR can save much energy for avoiding updating filters frequently.

4.1. Gaussian Process Regression for Prediction

We predict next epochs’ readings of a sensor node on their temporal correlations. The very simple example is to predict next epoch's sensor value based on current reading. For l step prediction, we assume $x_{i}^{(l)}$ is current reading and $y_{i}^{(l)}$ the predicting value of next epoch. We can get a dataset $H_{D}$ , $H_{D} = {(x_{i}^{(l)}, y_{i}^{(l)}), l = 1,2, \dots, m, y_{i}^{(l)} = x_{i}^{(l + 1)}}$ , which can be seen as the input of the Gaussian process regression. However, it is obvious that this method will lead to big error margin for predicting next epoch's sensor readings based on current reading is far from enough.

To facilitate the calculation, we only focus on energy consumption resulting from communication and omit other aspects of energy consumption. The assumption is reasonable for the communication of wireless sensor networks is main aspect of energy consumption. The total amount of energy spent in sending a message with m bytes of content is given by $α_{s} + β_{s} m$ , where $α_{s}$ and $β_{s}$ are the per-message and per-byte sending costs, respectively. And the total amount of energy consumption in receiving a message with m bytes of content is given by $α_{r} + β_{r} m$ , where $α_{r}$ and $β_{r}$ are the per-message and per-byte receiving costs, respectively. As an example, typical values for MICA2 motes, receiving cost is defined analogously, with typical values of $α_{r}$ and $β_{r}$ roughly 60% less than their sending counterparts. To calculate quantitatively for our FUGPR, we set $α_{r} = 0.4 α_{s}$ , $β_{r} = 0.4 β_{s}$ and the values of $α_{s}$ , $β_{s}$ , $α_{r}$ , and $β_{r}$ are displayed in Table 1, respectively [29].

Table 1

Typical values for energy calculation.

Symbol	Value
$α_{s}$	0.645 mJ
$β_{s}$	0.0144 mJ/byte
$α_{r}$	0.258 mJ
$β_{r}$	0.00576 mJ/byte

From Table 1, we can learn that the energy consumption of communication is higher than data transmission. In order to reduce the number of communications, each transmission the sensor node sends more bytes for saving energy. In FUGPR, readings fallen in their corresponding filter need not be sent to the base station while the violated sensor readings along with their past p readings are sent to the base station together. For each sensor node, the p value is learned from the historical data at the base station. The p value can not be a large number because transmitting too much data will also consume much energy. In this paper, the p value varies from 1 to 5. We choose the p value with the minimum squared error (MSE) for the Gaussian process regression in FUGPR. At the base station, for each sensor node we use Gaussian process regression model on historical dataset $H_{D}$ . The input is the p sensor readings $X (t)$ and the output is the predicting value $y (t)$ . Let $X^{t} = [x^{t} x^{t - 1} \cdot \cdot \cdot x^{t - p + 1}]$ and $y^{t} = x^{t + 1}$ . The historical dataset $H_{D}$ satisfies $H_{D} = {(X^{(l)}, y^{(l)})$ , $l = p, p + 1, \dots, p + m - 1}$ . As illustrated in Figure 6, the $p + m$ data values compose m regression points as shown in Figure 6.

Figure 6

Construction of the training dataset.

To each sensor node, we use several past reading as the input and the reading at next epoch as the output; we can obtain the following equation through a Gaussian noise model:

\begin{matrix} y = f (X) + N (0, σ^{2}) . \end{matrix}

(2)

Users often expect the underlying function $f (X)$ to be linear and a least-squares method can be exploited to fit the straight line. In DAFM, Mai et al. [5] adopted linear regression method to predict next epochs’ sensor values. Typical linear regression models for prediction have the following formula:

\begin{matrix} y = b + θ X + ε, \end{matrix}

(3)

where ε is random error satisfying Gaussian distribution

N (0, σ^{2})

and the parameter b, θ can be estimated by the least-squared method.

Due to various sensor readings from real applications in wireless sensor networks, the prerequisites of linear regression method are not satisfied. Besides linear regression models, other analogical models may use the principles of model selection to choose among the various possibilities for we suspect $f (X)$ may also be quadratic, cubic, or even nonpolynomial. Instead of claiming $f (X)$ relates to some specific model, a Gaussian process extends multivariate Gaussian distributions to infinite dimensionality and lets the data stand for themselves. In geostatistics, Gaussian processes are known as “Kriging,” but the input space in the literature only concentrates two or three dimensions while Gaussian processes consider more general input spaces.

We illustrate Gaussian process regression as shown in Figure 7. For simplicity, we consider one-dimensional input space. The solid points are real sensor readings, the solid line indicates an estimation of prediction values for the real sensor readings, and the shaded region denotes 95% confidence intervals along with the regression line. If current sensor reading of a sensor node is $x_{*}$ , the sensor value $y_{*}$ of the next epoch along with the predicted value $y_{*}^{'}$ predicted by Gaussian process regression satisfies a Gaussian distribution. Consider

\begin{matrix} y_{*} ~ N (y_{*}^{'}, σ^{2}) . \end{matrix}

(4)

Figure 7

An example for Gaussian process regression.

4.1.1. Gaussian Process Regression

A Gaussian process is a collection of random variables, any finite number of which has a joint Gaussian distribution. In general, a Gaussian process is completely specified by its mean function $m (x)$ and covariance function $k (x, x^{'})$ . Usually, for notational simplicity we will take the mean function to be zero. The covariance function $k (x, x^{'})$ represents the influence of different sensor readings.

If there are m items in the training dataset $H_{D}$ , a $m \times m$ positive definite covariance matrix can be constructed by $k (x, x^{'})$ as follows:

\begin{matrix} K = [\begin{bmatrix} k (x^{1}, x^{1}) & k (x^{1}, x^{2}) & \dots & k (x^{1}, x^{m}) \\ k (x^{2}, x^{1}) & k (x^{2}, x^{2}) & \dots & k (x^{2}, x^{m}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ k (x^{m}, x^{1}) & k (x^{m}, x^{2}) & \dots & k (x^{m}, x^{m}) \end{bmatrix}] . \end{matrix}

(5)

When a new sensor reading $x_{*}$ arrives, the covariance matrix $K_{*}$ of previous m data items and $K_{* *}$ of itself are shown as follows:

\begin{matrix} K_{*} = [\begin{bmatrix} k (x_{*}, x_{1}) & k (x_{*}, x_{2}) & \dots & k (x_{*}, x_{m}) \end{bmatrix}], \end{matrix}

(6)

where

K_{* *} = k (x_{*}, x_{*})

. The estimation of sensor readings x,

x = [x^{(1)} x^{(2)} \dots x^{(m)}]

is y,

y = [y^{(1)} y^{(2)} \dots y^{(m)}]

. The joint function of y and

y_{*}

is also a multivariate Gaussian function. Consider

\begin{matrix} [\begin{bmatrix} y \\ y_{*} \end{bmatrix}] ~ N (0, [\begin{bmatrix} K & K_{*}^{T} \\ K_{*} & K_{* *} \end{bmatrix}]) . \end{matrix}

(7)

According to Bayesian rules, we can obtain

\begin{matrix} y_{*} | y ~ N (K_{*} K^{- 1} y, K_{* *} - K_{*} K^{- 1} K_{*}^{T}) . \end{matrix}

(8)

The mean value and variance of $y_{*}$ are $K_{*} K^{- 1} y$ and $K_{* *} - K_{*} K^{- 1} K_{*}^{T}$ , respectively. That is,

\begin{matrix} y_{*} ~ N (y_{*}^{'} = K_{*} K^{- 1} y, K_{* *} - K_{*} K^{- 1} K_{*}^{T}) . \end{matrix}

(9)

When given a new sensor reading $x_{*}$ and a confidence level $1 - δ$ , the confidence interval is $y_{*}^{'} \pm z δ / 2 σ$ .

4.1.2. Training Gaussian Process Regression Models

How well does the selection of the covariance function decide the reliability of our regression process? The procedure mainly lies on the characteristics of the training sensor readings. In the Gaussian process literature, a popular choice is the squared exponential (SE) covariance function which is also adopted in our paper. Consider

\begin{matrix} k (x_{p}, x_{q}) = σ_{f}^{2} exp (- \frac{1}{2} {(x_{p} - x_{q})}^{T} P^{- 1} (x_{p} - x_{q})) . \end{matrix}

(10)

When folding the noise $c o v N o i s e = σ_{n}^{2} δ (x_{p}, x_{q})$ into $k (x_{p}, x_{q})$ , the SE covariance function will be

\begin{array}{l} k (x_{p}, x_{q}) = σ_{f}^{2} exp (- \frac{1}{2} {(x_{p} - x_{q})}^{T} P^{- 1} (x_{p} - x_{q})) \\ + σ_{n}^{2} δ (x_{p}, x_{q}) . \end{array}

(11)

In the above equation, if $P = \begin{smallmatrix} (\begin{smallmatrix} λ^{- 2} \\ ⋱ \\ λ^{- 2} \end{smallmatrix}) \end{smallmatrix}$ , the covariance function is named $c o v S E i s o$ , that is, squared exponential covariance function with isotropic distance measure. If $x_{p}$ , $x_{q}$ are d-dimensional column vectors, $x_{p} - x_{q} = [r_{1}, r_{2}, \dots, r_{d}]^{T}$ , and $P = \begin{smallmatrix} (\begin{smallmatrix} l_{1}^{2} \\ ⋱ \\ l_{d}^{2} \end{smallmatrix}) \end{smallmatrix}$ , the covariance function is named $c o v S E a r d$ , that is, squared exponential covariance function with automatic relevance determination (ARD) distance measure.

After covariance function is chosen, we call its parameters hyperparameters. Take the covariance function $c o v S E a r d + c o v N o i s e$ , for example, hyperparameters set $θ = {l, σ_{f}, σ_{n}}$ , $l = {l_{1}, l_{2}, \dots, l_{d}}$ . Our maximum a posteriori estimate of θ occurs when $p (θ | X, y)$ is at its greatest value. Consider

\begin{matrix} p (θ | y, X) = \frac{p (y | X, θ) p (θ)}{p (y | X)} . \end{matrix}

(12)

Assuming we have little prior knowledge about what θ should be, Bayes's theorem tells us this corresponds to maximizing $p (y | X, θ)$ given by

\begin{matrix} arg max_{θ} p (θ | X, y) = arg max_{θ} p (y | X, θ) . \end{matrix}

(13)

For $y ~ N (0, K)$ , we obtain

\begin{matrix} p (y | X, θ) = \frac{1}{{(2 π)}^{n / 2} {|K|}^{1 / 2}} exp (- \frac{1}{2} y^{T} K^{- 1} y) . \end{matrix}

(14)

The log-style $log p (y | X, θ) = L (θ)$ of the equation is

\begin{matrix} L (θ) = - \frac{1}{2} y^{T} K^{- 1} y - \frac{1}{2} log |K| - \frac{n}{2} log 2 π . \end{matrix}

(15)

Simply run the favorite multivariate optimization algorithm (e.g., conjugate gradients, Nelder-Mead simplex) on this equation and a pretty good choice for θ will be found.

4.2. Optimization Strategies of Filter Updating

If the new reading of one sensor node is without its filters, then an update is sent to the base station. Now, the update is jump into other nodes’ filter; the top-k results become undecided at the base station. Therefore, the base station should probe some related sensor node(s) to reevaluate the top-k results. These nodes which should send their new readings to the base station need to wrap up the p readings of current and past epochs and then utilize Gaussian process regression for prediction to ascertain the means and variances of future readings. For these sensor nodes whose readings are not updated, the means and variances of their future readings still maintain the recent ones.

Assume the new reading of a node $n_{i}$ at epoch $t_{n + 1}$ is $v_{i}$ , the predicted new reading is $v_{i}^{pre}$ , and its variance is $σ_{i}^{2}$ ; then $v_{i} ~ N (v_{i}^{pre}, σ_{i}^{2})$ . Assume the probability function of the approximate distribution of the new reading is $f_{i} (x)$ . Let $F_{i}^{old} = [l_{i}^{old}, u_{i}^{old}]$ and $F_{i}^{new} = [l_{i}^{new}, u_{i}^{new}]$ denote the node $n_{i}$ 's old filter and newly computed filter, respectively. Then the probability of the real reading of $n_{i}$ falling in $F_{i}^{old}$ is

\begin{matrix} P_{i}^{old} = \int_{l_{i}^{old}}^{u_{i}^{old}} f_{i} (x) d x . \end{matrix}

(16)

And the probability of the real reading of $n_{i}$ falling in $F_{i}^{new}$ is

\begin{matrix} P_{i}^{new} = \int_{l_{i}^{new}}^{u_{i}^{new}} f_{i} (x) d x . \end{matrix}

(17)

Let $r_{i}$ be the average number of recent epochs that $n_{i}$ 's readings do not violate $n_{i}$ 's filter. Assume that we have run for 100 epochs. For the convenience of description, we only consider the sensor node $n_{i}$ from epochs 80th to 100th. Assume that $n_{i}$ 's reading violates its filters at epochs 84th, 91th, 94th, and 100th. Then, the numbers of epochs that $n_{i}$ 's reading does not violate its filters are $91 - 84 - 1 = 6$ epochs, $94 - 91 - 1 = 2$ epochs, and $100 - 94 - 1 = 5$ epochs. So, the average number of epochs of nonviolation is $r_{i} = (6 + 2 + 5) / 3$ .

If the base station needs updating the filters, the cost of using the new filters at the future s epochs is referred to as cost_new. Otherwise, the cost of using the old filters is denoted as cost_old. If ${cost}_{new} < {cost}_{old}$ , then the base station sends updated filters; otherwise, all the sensor nodes maintain the old filters. cost_new and cost_old are calculated as follows:

\begin{matrix} {cost}_{new} = |N_{u}| + |N| \times s - \sum_{n_{i} \in I} P_{i}^{new} \times r_{i} \\ {cost}_{old} = |N| \times s - \sum_{n_{i} \in I} P_{i}^{old} \times r_{i} . \end{matrix}

(18)

In formula (18), $|N_{u}|$ denotes the number of sensor nodes which need to update their filters; $|N|$ denotes the number of nodes in the wireless sensor networks. The result of cost_new and cost_old for comparison has nothing to do with $|N| \times s$ . Hence, we do not need to calculate $|N| \times s$ and consider the specific value of s.

5. Experiments

5.1. Experiments Settings

We use two real datasets: Intel Lab data and LEM data to conduct our experiments.

Intel Lab Data. The set contains information about data collected from 54 sensors deployed in the Intel Berkeley Research Lab [30]. We partition the sensor network into some nonoverlapping regions, making the sensors with positions close to each other in the same area. In this paper, we do not research in depth in how to partition regions more reasonably and only partition the sensor network based on location information by using the k-means algorithm. Taking an example of the Intel Berkeley Research Lab wireless sensor network, we partition it into 10 regions. Figure 8 shows the result of region partitioning, where the dark nodes denote the initial center points. We adopttemperature,humidity, andvoltage data on March 1, 2004, as experimental data and the data are collected every 31 seconds. We evaluate the performance of the proposed algorithm FUGPR by MATLAB simulations.

Figure 8

Sensor node topology of Intel Berkeley Research Lab.

LEM Data. The dataset is collected from the Live from Earth and Mars project at the University of Washington [31]. We adopttemperature,dew point, andsea level pressure collected from December 1, 2012, to December 1, 2013, as experimental data. There are 509,174 records in total. The dataset has missing values in some epochs and we have filled the missing values with the average of the readings at the prior and subsequent epochs. Each attribute of the dataset has 509,174 readings. We extract many subsets, and each subset contained 3,000 readings. The subsets are used to simulate the physical phenomena in the immediate surrounding of different sensor nodes.

We simulated a single hop network and two multihop networks. The single hop network has 12 sensor nodes as shown in Figure 9(a). The two multihop networks contain $8 \times 8 = 64$ and $12 \times 12 = 144$ sensor nodes, respectively. The network consisting of 84 sensor nodes is shown as Figure 9(b), and we number the nodes. The network layout of $12 \times 12$ is similar to $8 \times 8$ , is omitted for simplicity. To simulate the spatial correlation of sensor readings, the subset starting at successive time is assigned to neighboring nodes in the simulated networks.

Figure 9

Network layout.

5.2. Experimental Results

We compare our FUGPR with several typical methods, TAG, range caching, and FILA. TAG involves several basic aggregation methods which can be seen as the standards of energy consumption. In range caching method, if two filters are overlapped, the base station may request sensor values to decide final top-k results. In our experiments, the filter window value α for range caching method varies from 0.3 to 3.2 and energy consumptions under different α values are calculated. The α value with the minimum energy consumption is chosen as the final filter window value in our experiments. For FILA method, we adopt lazy filter update policy for our experiments. For our FUGPR method, we use the GPML MATLAB toolbox (http://www.gaussianprocess.org/gpml/) for prediction by our proposed Gaussian process regression models.

Figure 10 shows the comparison of energy consumption for TAG, range caching, FILA, and FUGPR after 1000 runs of top-k queries under Intel Berkeley Lab data:temperature,humidity, andpressure. We can see that FILA and FUGPR adopt filter mechanism which avoids redundant data transmitted to the base station, thus saving much energy. FUGPR optimizes the filter updating procedure which reduces the energy consumption for frequent filter updating. In Figure 10, with the increase of the k values, FUGPR is much better than FILA for energy consumption. The reason is that there will be many violated sensor readings needed to be sent to base station which requires filter reevaluation in FILA method. In contrast, our FUGPR does not need to do much filter updating operations under prediction benefits by adopting Gaussian process regression models. FUGPR utilizes the predicted sensor values to evaluate the costs of using updated filters and old ones in next s epochs, namely, cost_new and cost_old. The small one will be adopted and no necessary filter updates can be avoided in FUGPR.

Figure 10

Comparison of energy consumption of Intel Berkeley Research Lab sensor network.

Figure 11 shows the comparison of energy consumption for TAG, range caching, FILA, and FUGPR after 1000 runs of top-k queries for single hop sensor network. FILA and FUGPR can filter many sensor readings, especially when k is small. When k increases (>7), we can see that FILA consumes much energy. For thedew point data, when k is 9, the energy consumption of FILA is even larger than TAG method. The reason is that the fluctuation of data causes the energy consumption of requesting sensor node readings by the base station and updating the filters exceeds the energy consumption saved by filter benefits. In contrast, our FUGPR method is prior for all the circumstances.

Figure 11

Comparison of energy consumption of a single hop network.

Figures 12 and 13 show the comparison of energy consumption for TAG, range caching, FILA, and FUGPR after 1000 runs of top-k queries for two multihop sensor networks, $8 \times 8$ and $12 \times 12$ , respectively. The same situation occurs for FILA method and our proposed FUGPR method is prior to the other three methods.

Figure 12

Comparison of energy consumption of $8 \times 8$ multihop network.

Figure 13

Comparison of energy consumption of $12 \times 12$ multihop network.

6. Conclusion

This paper combines Gaussian process regression models into FILA method and proposes FUGPR method for top-k query processing in wireless sensor networks. The prediction benefits of the Gaussian process regression models reduce the cost for filter updating caused by fluctuation of sensor values. Experimental results show that under two distinct real datasets, for example, Intel Berkeley Research Lab data, LEM data, and simulated single hop and multihop sensor network structures, our proposed FUGPR is prior to TAG, range caching, and FILA methods and thus reduces the energy consumption and prolongs the lifetime of the sensor networks.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is partially supported by the Fundamental Research Funds for the Central Universities of China under Grant no. NS2015095.

References

Dylla

Miliaraki

Theobald

Top-k query processing in probabilistic databases with non-materialized views

Proceedings of the 29th International Conference on Data Engineering (ICDE '13)

April 2013

Washington, DC, USA

IEEE Computer Society

122 133

10.1109/icde.2013.6544819

2-s2.0-84881334099

Tang

Lee

W.-C.

Monitoring top-k query in wireless sensor networks

Proceedings of the 22nd International Conference on Data Engineering (ICDE '06)

April 2006

143

10.1109/icde.2006.103

2-s2.0-33749638737

Tang

Lee

W.-C.

Top-k monitoring in wireless sensor networks

IEEE Transactions on Knowledge and Data Engineering 2007 19 7 962 976

10.1109/TKDE.2007.1038

2-s2.0-34249805447

Thanh

M. H.

Lee

K. Y.

Lee

Y. W.

Kim

M. H.

Processing top-k monitoring queries in wireless sensor networks

Proceedings of the 3rd International Conference on Sensor Technologies and Applications (SENSORCOMM '09)

June 2009

545 552

10.1109/sensorcomm.2009.91

2-s2.0-70449500750

Mai

H. T.

Lee

Y. W.

Lee

K. Y.

Kim

M. H.

Distributed adaptive top-k monitoring in wireless sensor networks

Journal of Systems and Software 2011 84 2 314 327

10.1016/j.jss.2010.10.018

2-s2.0-78650603218

Fagin

Lotem

Naor

Optimal aggregation algorithms for middleware

Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS ’01)

May 2001

New York, NY, USA

ACM

102 113

2-s2.0-0034819889

Cao

Wang

Efficient Top-K query calculation in distributed networks

Proceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing (PODC '04)

July 2004

New York, NY, USA

ACM

206 215

2-s2.0-10444257928

Theobald

Weikum

Schenkel

Top-k query evaluation with probabilistic guarantees

Proceedings of the 30th International Conference on Very Large Data Bases (VLDB '04)

2004

648 659 VLDB Endowment

Michel

Triantafillou

Weikum

KLEE: a framework for distributed top-k query algorithms

Proceedings of the 31st International Conference on Very Large Data Bases (VLDB '05)

September 2005

637 648

2-s2.0-33744898753

10.

Zeinalipour-Yazti

Lin

Gunopulos

Distributed spatio-temporal similarity search

Proceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM '06)

November 2006

ACM

14 23

10.1145/1183614.1183621

2-s2.0-34547620671

11.

Silberstein

Braynard

Ellis

Munagala

Yang

A sampling-based approach to optimizing top-k queries in sensor networks

Proceedings of the 22nd International Conference on Data Engineering (ICDE '06)

April 2006

10.1109/icde.2006.10

2-s2.0-33749605102

12.

Madden

Franklin

M. J.

Hellerstein

J. M.

Hong

Tag: a tiny aggregation service for ad-hoc sensor networks

Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI ’02)

December 2002

Boston, Mass, USA

131 146

10.1145/1060289.1060303

13.

Yao

Gehrke

The cougar approach to in-network query processing in sensor networks

SIGMOD Record 2002 31 3 9 18

10.1145/601858.601861

2-s2.0-1442296009

14.

Andreou

Zeinalipour-Yazti

Chrysanthis

P. K.

Samaras

Power efficiency through tuple ranking in wireless sensor network monitoring

Distributed and Parallel Databases 2011 29 1-2 113 150

10.1007/s10619-010-7072-5

2-s2.0-78650889013

15.

Chen

Liang

Zhou

J. X.

Energy-efficient top-k query processing in wireless sensor networks

Proceedings of the 19th International Conference on Information and Knowledge Management and Co-located Workshops (CIKM '10)

October 2010

New York, NY, USA

ACM

329 338

10.1145/1871437.1871482

2-s2.0-78651301821

16.

Chen

Liang

J. X.

Energy-efficient top-k query evaluation and maintenance in wireless sensor networks

Wireless Networks 2014 20 4 591 610

10.1007/s11276-013-0625-6

2-s2.0-84899625448

17.

Liu

Lee

W.-C.

A cross pruning framework for top-k data collection in wireless sensor networks

Proceedings of the 11th IEEE International Conference on Mobile Data Management (MDM '10)

May 2010

157 166

10.1109/mdm.2010.41

2-s2.0-77955219522

18.

Abbasi

Khonsari

Farri

MOTE: efficient monitoring of top-k set in sensor networks

Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC '08)

July 2008

Marrakech, Morocco

IEEE

957 962

10.1109/iscc.2008.4625647

2-s2.0-55849141952

19.

Olston

Loo

B. T.

Widom

Adaptive precision setting for cached approximate values

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '01)

May 2001

New York, NY, USA

ACM

355 366

2-s2.0-0034823616

20.

Olston

Jiang

Widom

Adaptive filters for continuous queries over distributed data streams

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '03)

June 2003

New York, NY, USA

563 574

2-s2.0-1142291592

21.

Zheng

Zhang

Song

Wang

Prediction-based filter updating policies for top-k monitoring queries in wireless sensor networks

International Journal of Distributed Sensor Networks 2014 2014 11

696978

10.1155/2014/696978

2-s2.0-84899895776

22.

Rasmussen

C. E.

Williams

C. K.

Gaussian Processes for Machine Learning 2006

The MIT Press

MR2514435

23.

Williams

C. K. I.

Rasmussen

C. E.

Gaussian processes for regression

Advances in Neural Information Processing Systems 1996 8

MIT Press

514 520

24.

Williams

Jordan

Prediction with gaussian processes: from linear regression to linear prediction and beyond

Learning in Graphical Models 1998 89

Dordrecht, The Netherlands

Springer

599 621 NATO ASI Series

10.1007/978-94-011-5014-9_23

25.

Brahim-Belhouari

Bermak

Gaussian process for nonstationary time series prediction

Computational Statistics & Data Analysis 2004 47 4 705 712

10.1016/j.csda.2004.02.006

2-s2.0-8344270238

26.

Giatrakos

Deligiannakis

Garofalakis

Sharfman

Schuster

Prediction-based geometric monitoring over distributed data streams

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '12)

May 2012

New York, NY, USA

ACM

265 276

10.1145/2213836.2213867

2-s2.0-84862665086

27.

Osborne

M. A.

Roberts

S. J.

Rogers

Jennings

N. R.

Real-time information processing of environmental sensor network data using Bayesian Gaussian processes

ACM Transactions on Sensor Networks 2012 9 1, article 1

10.1145/2379799.2379800

2-s2.0-84870667283

28.

Cody-Kenny

Guerin

Ennis

Simon Carbajo

Huggard

Mc Goldrick

Performance evaluation of the 6LoWPAN protocol on MICAz and TelosB motes

Proceedings of the 4th ACM International Workshop on Performance Monitoring, Measurement, and Evaluation of Heterogeneous Wireless and Wired Networks (PM2HW2N '09)

October 2009

ACM

25 30

10.1145/1641913.1641917

2-s2.0-74049102989

29.

Silberstein

Braynard

Yang

Constraint chaining: on energy-efficient continuous monitoring in sensor networks

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '06)

June 2006

New York, NY, USA

ACM

157 168

10.1145/1142473.1142492

2-s2.0-34250676986

30.

Intel berkeley research lab, http://www.select.cs.cmu.edu/data/labapp3/index.html

31.

Live from earth and mars data from the university of Washington, http://www-k12.atmos.washington.edu/k12/

Adaptive Filter Updating for Energy-Efficient Top- k Queries in Wireless Sensor Networks Using Gaussian Process Regression

Abstract

1. Introduction

2. Related Work

3. Filter Based Top-k Query Methods

3.1. Problem Definition

3.2. Filter Based Top-k Query Processing

3.2.1. An Overview of Filter Based Methods

3.2.2. Defects of FILA

4. Filter Updating Algorithms

4.1. Gaussian Process Regression for Prediction

4.1.1. Gaussian Process Regression

4.1.2. Training Gaussian Process Regression Models

4.2. Optimization Strategies of Filter Updating

5. Experiments

5.1. Experiments Settings

5.2. Experimental Results

6. Conclusion

Footnotes

Conflict of Interests

Acknowledgment

References