Distributed high-dimensional similarity search approach for large-scale wireless sensor networks

Abstract

Similarity search in high-dimensional space has become increasingly important in many wireless sensor network applications. However, existing approaches to similarity search is based on the premise that sensed data are centralized to deal with, or sensed data are simple enough to be stored in a relational database. Different from the previous work, we propose a distributed approximate similarity search algorithm to retrieve similar high-dimensional sensed data for query in wireless sensor networks. First, the sensors are divided into several clusters using the distributed clustering method. Furthermore, the sink transmits the compressed hash code set to the cluster heads. Finally, the estimated similarity score is compared with a specified threshold to filter out irrelevant sensed data. Therefore, the higher search precision and energy efficiency can be achieved. Extensive simulation results show that the proposed algorithms provide significant performance gains in terms of precision and energy efficiency compared with the existing algorithms.

Keywords

Wireless sensor networks distributed approximate similarity search locality sensitive hashing

Introduction

With the rapid development of the Internet, wireless networks, and sensor technologies, there is an emerging attention in leveraging massive amounts of data available in distributed data source such as wireless sensor networks (WSNs). Processing of large volumes of data is a key challenge for the further WSN applications.¹ Furthermore, the growth of smart devices and WSN applications make sensed data diversified, and a large number of features extracted to represent the sensed data are also very high.² Hence, similarity search in high-dimensional space has become the hot research topic in WSN. For example, sensors equipped with visual information collection modules are capable of sensing and storing images. Then, users query the WSN to see which sensors store the images that are similar to “target images,” so the location of the event can be detected.

Over the past decade, many techniques have been proposed for similarity search (also known as nearest neighbor search). For instance, tree-based methods^3,4 have been developed to address the fast approximate nearest neighbor search problem. However, almost all tree-based methods suffer from the dimensionality issue with their performance typically degrading to exhaustive linear scan, that is, when the dimensionality exceeds about 10, existing indexing data structures based on space partitioning are slower than the linear-scan approach.

Current studies on high-dimensional similarity search focus on the locality sensitive hashing (LSH) method.^5,6 The principle of LSH is to hash similar data items into the same hash code with a high probability by random projections, and multiple hash tables are constructed independently to enlarge the probability. Due to the randomized hashing, the LSH methods suffer from long hash codes and a large number of hash tables. Some improved methods are proposed to overcome the drawback of basic LSH methods, and these methods can be divided into two categories, the data-dependent hashing method⁷ and the data-independent LSH method.⁸ Although these hashing methods have shown success in high-dimensional similarity search, these approaches are based on centralized architecture. How to optimize the similarity search in a distributed environment is rarely taken into account. In Haghani et al.,⁹ the authors tried to distribute the LSH-based methods in peer-to-peer (P2P) overlay networks.

However, the similarity search in WSNs has some unique characteristics: (1) the sensed data of the nodes are prone to value fluctuations due to the hardware-cost constraint and outer interference; (2) the centralized hash tables must be avoided to reduce the communication overhead; and (3) each node is aware of its location and sensed data, but has no global knowledge of the WSN. Considering the distributed nature of WSNs, how to acquire feasible solutions to the high-dimensional similarity search in large-scale WSNs is still an open problem.

To this end, we propose a distributed similarity search approach for retrieving the similar high-dimensional sensed data in WSNs. The contributions of this work are summarized as follows:

We propose a distributed LSH-based model via computing the similarity score in the cluster head of WSN, instead of probing hash codes in multiple centralized hash tables. So, energy efficiency and high search accuracy can be achieved.

We propose a distributed approximate similarity search (DASS) algorithm to retrieve the similar high-dimensional sensed data for query in WSNs. More explicitly, the sensors are divided into several clusters using the distributed clustering method, the sink transmits the compressed hash code set to the cluster head, and the similarity score is estimated in a distributed manner.

We propose an effective method to calculate the similarity score to reflect the degree of similarity between the sensed data with the query. Furthermore, the similarity score is compared with the threshold to filter out irrelevant sensed data, resulting in higher similarity search precision and lower energy consumption.

Note that the important aspect of our study design also includes how to use the large image datasets to simulate the large-scale WSN scenario. The idea of our study design is described in detail in the “Simulation” section.

The rest of the article is structured as follows. Section “Related work” reviews the related work. Section “Introduction of basic LSH method and multi-probe LSH method” gives a brief introduction of LSH-based similarity search methods. The details of our algorithms are presented in section “DASS approach.” Performance evaluation results are given in section “Simulation.” Section “Conclusion” concludes the article.

Related work

Similarity search in WSN

Early works of similarity search in WSN focus on database-oriented approach,¹⁰ and a WSN database should provide functions for querying sensed data. Furthermore, in Cheng et al.,¹¹ a distributed database management on WSNs is proposed in an energy-efficient manner. Two Hilbert curve–based approaches^12,13 are proposed to tackle the problem of similarity search due to the locality-preserving property of the Hilbert curve. Recently, some studies have turned to the distributed top-k query and range query processing.^14,15 In Ye et al.,¹⁶ the notion of sufficient set and necessary set is introduced for processing probabilistic top-k in a sensor network with tree topology. At the same time, considering the characteristics of WSN, a distributed spatial–temporal similarity data storage scheme¹⁷ is provided, and a framework¹⁸ is proposed to accelerate query evaluation in content-caching networks using XML metadata.

The methods mentioned above always need to utilize the information of network topologies. Furthermore, the sensed data are always organized as the traditional database. Hence, with the growth of smart devices and applications, these methods are not suitable for high-dimensional similarity search in WSN.

Hashing-based methods

Recently, some methods have been proposed to improve LSH in many aspects. These methods can be divided into two categories, the data-dependent hashing methods and the data-independent hashing methods.

A representative method in the data-dependent hashing methods is spectral hashing (SH),⁷ which can produce very compact hash codes by nonlinear functions along the principal directions of the data. However, the assumption of uniform data distribution cannot be applied into practical situation. Furthermore, a variety of data-dependent methods are proposed, including principal component analysis (PCA)-based hashing,^19,20 graph-based hashing,²¹ semi-supervised hashing,¹⁹ and locally linear hashing.²² Among these methods, the data distribution and the underlying manifold structure of data are captured to improve the similarity search. Although the above supervised or semi-supervised methods can project high-dimensional data items into more compact codes, they still need to cost much time and energy on training data process, which is difficult to implement in a large-scale WSN.

Among the data-independent hashing methods, random projection has been used widely for designing data-independent hashing techniques. Although these methods have strict performance guarantee, they are less efficient since the hashing functions are not specifically designed for a certain data set. Based on randomized projection, there still have been several efforts to improve the performance of LSH. Multi-probe LSH methods⁸ are proposed to extend the candidate set using similar hash codes to reduce the number of hash tables. Furthermore, the query-adaptive method²³ is proposed to generate optimal probe sequence of hash codes, and posteriori multi-probe LSH²⁴ puts forward a more reliable posteriori model based on some prior knowledge, which helps to accurately select the hash codes to be probed. In Gu et al.,²⁵ a novel probability model and a query-adaptive algorithm are proposed to generate the optimal multi-probe sequence for range queries. At the same time, Bayes’ LSH²⁶ is presented to perform candidate pruning and similarity estimation using the principle Bayesian approach.

The above approaches to the similarity search problem focus on centralized settings. Considering the distributed nature of WSNs, it is necessary to find an efficient solution to perform high-dimensional similarity search in a distributed manner.

Introduction of basic LSH method and multi-probe LSH method

LSH is an effective approximate similarity search method in centralized system. The main idea of basic LSH is to map similar objects into same hash codes with high probability using LSH functions. A family of LSH functions $H = {h : S \to U}$ is called $(r, cr, p_{1}, p_{2})$ -sensitive for distance function $D$ , which can satisfy the following properties: for any $p, q \in S$

If D (p, q) \leq r then P r_{H} [h (p) = h (q)] \geq p_{1}

(1)

If D (p, q) > cr then P r_{H} [h (p) = h (q)] \leq p_{2}

(2)

where $S$ specifies the domain of data items, $c > 1$ , and $p_{1} > p_{2}$ . Intuitively, the LSH functions have the property that nearby data items have a higher probability of colliding than ones that are far away. To meet the above requirement, we consider the family of LSH functions based on p-stable distributions when the distance measure is $l_{p}$ norm, and for any data item $v \in R^{d}$ , the p-stable LSH⁵ functions are defined as

h_{a, b}^{l} (v) = ⌊ \frac{a^{l} \cdot v + b^{l}}{W} ⌋

(3)

where $v \in R^{d}$ is projected to a integer-valued hash value $h_{a, b}^{l} (v)$ ; $a^{l} \in R^{d}$ is a random vector whose entries are chosen independently from a p-stable distribution, where $p \in (0, 2]$ ; and $b^{l}$ is a real number chosen uniformly from the range $[0, W]$ . Then, LSH constructs $L$ hash tables using hashing projections ${g_{l} (.)}_{l = 1}^{L}$ which can be defined as

g_{l} (v) = (h_{a_{1}, b_{1}}^{l} (v), \dots, h_{a_{k}, b_{k}}^{l} (v))

(4)

where $v \in R^{d}$ is projected to the k-dimensional hash code set ${g_{l} (v)}_{l = 1}^{L}$ of size $L$ . To perform search process, a given query item $q \in R^{d}$ is projected to k-dimensional hash code set ${g_{l} (q)}_{l = 1}^{L}$ , and the data items having the same k-dimensional hash code are retrieved from each hash table to form the candidate set. Finally, the candidate set is filtered by computing their distance to the query item (Figure 1).

Figure 1.

Illustrative examples on the relationship of a hash projection and hash functions.

In practice, the basic LSH method suffers from long hash codes, which requires large $L$ to maintain a reasonable probability of colliding. Apparently, the big storage burden of holding $L$ hash tables impairs system performance.

A multi-probe LSH method⁸ is proposed to extend candidate set whose data items have similar hash codes of the query item $q$ in each hash table. The idea is to increase the probability of obtaining relevant data items in each hash table and consequently reduce the hash table number $L$ .

Given a query item $q \in R^{d}$ , the multi-probe LSH methods retrieve the data items having similar hash codes $g_{l} (q) + Δ_{l}$ in each hash table, where the hash perturbation vector $Δ_{l} = (δ_{1}^{l}, \dots, δ_{k}^{l})$ can be defined as $δ_{i}^{l} \in {- 1, 0, + 1}$ (Figure 2).

Figure 2.

Illustrative examples on the ideal of the multi-probe LSH methods.

Moreover, two methods of the multi-probe LSH were proposed, which can be summarized as follows:

Step-wise probing (SWP).⁸ This method is based on the fact that the data items whose hash codes have fewer hash values different from $g_{l} (q)$ are more likely to be close to the query item $q$ . Hence, the SWP method focuses on the hash perturbation vector $Δ_{l} = (δ_{1}^{l}, \dots, δ_{k}^{l})$ whose non-zero elements’ number is smaller than $k$ .

Query-directed probing (QDP).²⁶ In the SWP method, the effect of the hash value position of $g_{l} (q)$ on the similarity search has not been considered. The QDP method forms the candidate set according to their posterior probabilities of all possible similar hash codes

Although the multi-probe LSH methods can obtain relatively higher recall rate with fewer hash tables, these approaches are still based on centralized architecture and cannot be used directly in WSN.

DASS approach

As described above, although the LSH-based methods can perform high-dimensional similarity search effectively, they suffer from maintaining multiple centralized hash tables, which may incur high communication overhead. At the same time, the data-dependent hash methods (such as graph-based hashing²¹ and semi-supervised hashing¹⁹) need to cost much time and energy on training data process. Therefore, we propose a DASS approach for retrieving similar high-dimensional sensed data in WSNs.

Similar to the definition of the p-stable LSH function in equation (3), we define the unquantified hash function as

f_{a, b}^{l} (v) = a^{l} \cdot v + b^{l}

(5)

where data item $v \in R^{d}$ is hashed to a real-valued hash value $f_{a, b}^{l} (v)$ . Note that $h_{a, b}^{l} (v)$ given by equation (3) is the discretized version of $f_{a, b}^{l} (v)$ , and the unquantified hash projection can be defined as

F_{l} (v) = (f_{a_{1}, b_{1}}^{l} (v), \dots, f_{a_{k}, b_{k}}^{l} (v))

(6)

$v \in R^{d}$ is projected to the k-dimensional unquantified hash code set ${F_{l} (v)}_{l = 1}^{L}$ of size $L$ .

Approximate similarity computation method

Degree of similarity

For a given query item $q \in R^{d}$ , the relationship of the real-valued hash value $f_{a, b}^{l} (q)$ and the integer-valued hash value $h_{a, b}^{l} (q)$ is illustrated in Figure 3. The horizontal axis is divided into intervals of length $W$ as defined in equation (3), and $f_{a, b}^{l} (q)$ falls into the middle interval corresponding to $h_{a, b}^{l} (q)$ . Similarly, the left interval and the right interval correspond to $h_{a, b}^{l} (q) - 1$ and $h_{a, b}^{l} (q) + 1$ , respectively. The distance of $f_{a, b}^{l} (q)$ from the boundary of the interval can be defined as

τ_{a, b}^{l} (δ) = {\begin{matrix} f_{a, b}^{l} (q) - h_{a, b}^{l} (q) \times W, & when δ = - 1 \\ 0, & when δ = 0 \\ W - τ_{a, b}^{l} (- 1), & when δ = 1 \end{matrix}

(7)

Figure 3.

Probability of similar data item falling into the neighbor intervals.

Note that the interval corresponding to $h_{a, b}^{l} (q) + δ (δ \in {- 1, 0, + 1})$ can be viewed as the neighbor intervals of $h_{a, b}^{l} (q)$ . If $f_{a, b}^{l} (v_{i})$ falls into the neighbor intervals of $h_{a, b}^{l} (q)$ , $v_{i}$ has a high probability of being close to $q$ .

Hence, $\Pr [h_{a, b}^{l} (v_{i}) = h_{a, b}^{l} (q) + δ] (δ \in {- 1, 0, + 1})$ can be used to evaluate the degree of similarity between $v_{i}$ and $q$ . As described in multi-probe LSH,⁸ for any data item $v_{i}$ , $f_{a, b}^{l} (v_{i}) - f_{a, b}^{l} (q)$ is a Gaussian random variable with zero mean and variance $σ^{2} = c ‖ v_{i} - q ‖_{2}^{2}$ , whose probability density function is shown by the curve in Figure 3. Assuming $W$ is large enough, the degree of similarity between $v_{i}$ and $q$ (i.e. the probability of $f_{a, b}^{l} (v_{i})$ falling into the neighbor intervals of $h_{a, b}^{l} (q)$ ) can be estimated by

\Pr [h_{a, b}^{l} (v_{i}) = h_{a, b}^{l} (q) + δ] \approx \exp (- λ (τ_{a, b}^{l} (δ))^{2})

(8)

where $δ \in {- 1, 0, + 1}$ , $λ$ is a constant depending on $‖ v_{i} - q ‖_{2}$ , and $τ_{a, b}^{l} (δ)$ is used to estimate $f_{a, b}^{l} (v_{i}) - f_{a, b}^{l} (q)$ .

As described above, the degree of similarity measured by a pair of hash values is not accurate. In order to overcome the problem, we introduce the concept of the similarity score.

Similarity score

As described above, given a k-dimensional hash code $g_{l} (q) = (h_{a_{1}, b_{1}}^{l} (q), \dots, h_{a_{k}, b_{k}}^{l} (q))$ , the probability of projecting the sensed data $v_{i}$ to the hash code $g_{l} (q) + Δ_{l}$ can be obtained as

\Pr [g_{l} (v_{i}) = g_{l} (q) + Δ_{l}] = Π_{i = 1}^{k} \Pr [h_{a_{i}, b_{i}}^{l} (v_{i}) = h_{a_{i}, b_{i}}^{l} (q) + δ_{i}^{l}]

where $Δ_{l} = (δ_{1}^{l}, \dots, δ_{k}^{l})$ and $δ_{i}^{l} \in {- 1, 0, + 1}$ . According to equation (8)

\begin{matrix} \Pr [g_{l} (v_{i}) = g_{l} (q) + Δ_{l}] \approx Π_{i = 1}^{k} \exp (- λ {(τ_{a_{i}, b_{i}}^{l} (δ_{i}^{l}))}^{2}) \\ = \exp (- λ \sum_{i = 1}^{k} {(τ_{a_{i}, b_{i}}^{l} (δ_{i}^{l}))}^{2}) \end{matrix}

(9)

From equations (8) and (9), if the sensed data $v_{i}$ has the similar hash code $g_{l} (q) + Δ_{l}$ , $v_{i}$ has a high probability of being close to $q$ . Furthermore, the probability is related to $\sum_{i = 1}^{k} {(τ_{a_{i}, b_{i}}^{l} (δ_{i}^{l}))}^{2}$ . Hence, the similarity score can be expressed by

ϕ_{l} (v_{i}, q) = {\begin{matrix} \exp (- λ \sum_{i = 1}^{k} {(τ_{a_{i}, b_{i}}^{l} (δ_{i}^{l}))}^{2}), if g_{l} (v_{i}) = g_{l} (q) + Δ_{l} \\ 0, otherwise \end{matrix}

(10)

Clearly, larger similarity score $ϕ_{l} (v_{i}, q)$ can lead to the higher probability of $v_{i}$ being close to $q$ .

Form the definition in equation (10), we need to generate the hash code set ${g_{l} (v_{i})}_{l = 1}^{L}$ and ${g_{l} (q)}_{l = 1}^{L}$ in order to improve the similarity search performance. The total similarity score can be written as

Φ (v_{i}, q) = \sum_{l = 1}^{L} ϕ_{l} (v_{i}, q)

(11)

Therefore, when the cluster head receives the unquantified hash code set ${F_{l} (q)}_{l = 1}^{L}$ , the similarity score $Φ ((v_{i}, q))$ can be computed to evaluate the similarity between $v_{i}$ and $q$ .

Implementation of the DASS algorithm

Based on above analysis, the process of the DASS approach can be described as follows. First, the distributed clustering algorithm²⁷ is implemented to divide the sensors into several clusters, and cluster head is selected according to the residual energy of each node. Second, the sink node is responsible for projecting a given query item $q \in R^{d}$ to the hash code set whose size is less than $d$ . Then, the compressed hash code set is transmitted between the cluster heads over the WSN. On receiving the hash code set, each cluster head can compute the similarity scores reflecting the degrees of similarity between $q$ and the sensed data of the nodes. Furthermore, the cluster head sends back all the sensed data whose similarity scores are larger than the specified threshold. Finally, the candidate set is constructed with all the sensed data from the cluster heads, and the sink node can validate the candidate set to obtain the query result. Therefore, the location of the event that we focus on can be detected.

Now, we present the DASS algorithm for retrieving high-dimensional similar sensed data for query in a distributed manner.

Algorithm: DASS

Input: Given a WSN with sensor nodes $n_{i} (i \leq M \cap i \in N)$ and a sink node $s$ , a query item $q \in R^{d}$ , and the sensed data $v_{i} \in R^{d}$ of the node $n_{i}$

Step 1. The WSN is divided into several clusters $c_{j} (j \leq C \cap j \in R)$ , where $C$ is the number of clusters in the WSN, and $n_{c h_{j}} \in c_{j}$ is selected as the cluster head of the cluster $c_{j}$ using the distributed clustering algorithm.²⁷

Step 2. The sink node $s$ receives a query item $q \in R^{d}$ and obtains k-dimensional unquantified hash code set ${F_{l} (q)}_{l = 1}^{L}$ given by equation (6).

Step 3. ${F_{l} (q)}_{l = 1}^{L}$ is transmitted to the cluster heads $n_{c h_{j}}$ of the cluster $c_{j}$ in the WSN.

Step 4. For each $c_{j} (j \leq C \cap j \in N)$ do.

Step 5. $n_{c h_{j}}$ collects all the sensed data $v_{i}$ of $n_{i} \in c_{j}$ .

Step 6. $n_{c h_{j}}$ computes the k-dimensional hash code set ${g_{l} (v_{i})}_{l = 1}^{L}$ and ${g_{l} (q)}_{l = 1}^{L}$ given by equations (3) and (4).

Step 7. Set $Φ (v_{i}, q) = 0$ .

Step 8. For l = 1 to L do.

Step 9. Compute the similarity score $ϕ_{l} (v_{i}, q)$ according to equations (7)–(10).

Step 10. $Φ (v_{i}, q) \leftarrow Φ (v_{i}, q) + ϕ_{l} (v_{i}, q)$

Step 11. End for.

Step 12. If $Φ (v_{i}, q) \geq η$ , $n_{c h_{j}}$ sends back $v_{i}$ as response to the sink node $s$ ; $else if Φ (v_{i}, q) < η$ , $n_{c h_{j}}$ ignores $v_{i}$ , where $η$ is the specified similarity score threshold.

Step 13. End for.

Step 14. The sink node $s$ validates all the response sensed data set ${v_{i}}$ and returns the query result.

In the DASS algorithm, compressed ${F_{l} (q)}_{l = 1}^{L}$ is transferred to the cluster head where the comparison of the similarity score with the score threshold $η$ is used to improve the precision of similarity search. More importantly, the number of the irrelevant response sensed data can be reduced significantly to reduce the power consumption in WSN. Note that the parameters of the hash function and hash projection in the algorithms have been pre-stored in both the sink node and cluster heads. The computational complexity in each cluster head is $O (kL)$ . Since $L$ can be very small, and there is no need of holding hash tables, the overhead of our proposed algorithm is relatively low.

The DASS approach has a higher probability of selecting proper nodes whose sensed data meet the similarity search requirement by filtering out irrelevant sensed data. Furthermore, the calculation of the similarity score has a low computational complexity, and there is no need of maintaining multiple hash tables. Therefore, energy efficiency and high search accuracy can be achieved.

Simulation

In this section, we compare the performance of DASS with existing hashing-based similarity search methods, and we implement various algorithms using MATLAB on a DELL server with Intel 8 core CPU 3.5 GHz and 64 GB RAM. First, the data-dependent hashing methods (such as SH,⁷ graph-based hashing,²¹ and semi-supervised hashing¹⁹) need to cost much time and energy on training data procedure. As we know, it is difficult to train the sensed data over the large-scale WSN to get compact hash codes due to the high communication overhead. Hence, the data-dependent hashing methods are not appropriate for comparison in the simulation. However, the multi-probe LSH methods can obtain relatively higher recall rate with fewer hash tables. We can extend multi-probe LSH to a distributed setting. More specifically, the cluster head evaluates the similarity between the sensed data with the query by comparing the difference between their hash codes. Finally, we repeat each experiment 10 times and report the results based on the average over the 10 runs.

On this basis, we mainly use recall, precision, query ratio, and response ratio to evaluate the high-dimensional similarity search performance. In this article, the following methods are compared:

Multi-probe LSH with SWP⁸

Multi-probe LSH with QDP²⁶

Data source in WSN

The simulation scenario based on wireless multimedia sensor networks is designed to testify the performance gain. More specifically, the simulation was conducted in a WSN with 269,648 sensor nodes, where a sensed data of each node is represented by a 500-dimensional feature vector of each image from the NUS-WIDE dataset.²⁸ In the simulation, we query the WSN to see which sensors store the images that are similar to “target images.” Note that the routing techniques, radio frequency power, and topology information of WSN are not the topic of this article.

Performance metrics

We measure the performance of the similarity search algorithm in two aspects: search quality and energy efficiency. For each query item $q$ , search quality is measured by recall and precision, which can be defined as

recall = \frac{| S (q) \cap I (q) |}{| I (q) |} and precision = \frac{| S (q) \cap I (q) |}{| S (q) |}

(12)

where $I (q)$ is the set of ideal answers, $S (q)$ is the set of actual answers that the algorithm obtains, and |.| denotes the cardinality of a set.

Energy efficiency is measured by query ratio and response ratio. According to equations (3) and (4), we define query ratio and response ratio as follows:

query ratio = \frac{L \times k}{d} and response ratio = \frac{| S (q) |}{M}

(13)

where L is the number of hash code for each query item q; k and d are the dimensionality of the hash code $F_{l} (q)$ and query item q, respectively; and M is the number of sensor nodes. From the definition, smaller query ratio can lead to fewer data transmitted at query time. Similarly, smaller response ratio can lead to a smaller percentage of responding nodes, which always brings about better precision.

Impact of parameters η, k and L

We carry out experiments to test the influence of the threshold η by varying from 0.2 to 0.6. We set k as 12. The results are shown in Figure 4.

Figure 4.

Impact of η on (a) recall and (b) precision for different L.

As discussed above, when the similarity score $Φ (v_{i}, q)$ is less than the threshold η, the sensed data $v_{i}$ of the node $n_{i}$ will be considered to be irrelevant to the query item q. Conversely, if $Φ (v_{i}, q)$ is larger than η, the sensed data $v_{i}$ will be sent back to the sink node as the actual answer. Hence, larger η leads to a lower recall and a higher precision. In the practical application, the threshold η needs to be adjusted to optimize the tradeoff between recall and precision. In Figure 4(a), the recall value corresponding to L from 7 to 11 are in ascending order, because the more hash code leads to the higher similarity score. The relationship between precision and L can be analyzed in a similar way in Figure 4(b).

In Figure 5(a), we set L as 12, from the analysis of equations (9) and (10), the similarity score can be obtained using the negative exponential function, and larger k leads to a lower similarity score. Hence, the recall value corresponding to k from 11 to 14 is in descending order, because the higher hash code dimensions leads to the higher probability of identifying sensed data as irrelevant data. The relationship between precision and k can be analyzed in a similar way in Figure 5(b).

Figure 5.

Impact of η on (a) recall and (b) precision for different k.

Performance of DASS in terms of similarity search

In Figure 6(a) and (b), the precision decreases as the recall increases from 0.79 to 0.98, because higher recall leads to more irrelevant sensed data identified as relevant data, which causes the lower precision. Furthermore, the DASS algorithm is non-sensitive to the values of L and k, and can be used in a wide variety of applications.

Figure 6.

Precision as a function of recall for different (a) L and (b) k.

Comparison with multi-probe LSH

Finally, we compare the performance of the DASS algorithm with two multi-probe LSH methods, that is, SWP⁸ and QDP.²⁶ Our simulations aim to comprehensively evaluate the performance in terms of precision, query ratio, and response ratio.

Furthermore, in SWP, the non-zero elements’ number of $Δ_{l} = (δ_{1}^{l}, \dots, δ_{k}^{l})$ is less than 4, and the thresholds in both DASS and QDP can be adjusted to optimize the search performance.

As described before, higher recall often results in lower precision, which makes energy waste in irrelevant data transmission. Therefore, we set recall as 0.92, 0.96, and 0.98 to evaluate the performance of the algorithms.

Table 1 shows when the value of recall remains fixed, DASS achieves higher precision, smaller response ratio, and query ratio. As defined in equation (13), if query ratio is less than 1, the size of the hash code set is less than the size of the query, that is, smaller query ratio guarantees fewer data transmission. Hence, the proposed algorithms can save the energy consumption compared with the traditional cluster architecture of WSN. Similarly, smaller response ratio guarantees the transmission of fewer query result data. Hence, DASS can meet the energy efficiency requirement of WSN while maintaining high similarity search performance compared to SWP, QDP, and the cluster methods of WSN.

Table 1.

Performance comparisons in terms of precision, response ratio, and query ratio.

Recall	Algorithm	L↓	Precision↑	Response ratio (%) ↓	Query ratio↓
0.92	Multi-probe LSH with SWP⁸	20	0.19	0.08	0.48
	Multi-probe LSH with QDP²⁶	7	0.06	0.3	0.168
	DASS	7	0.34	0.05	0.168
0.96	Multi-probe LSH with SWP	28	0.16	0.1	0.672
	Multi-probe LSH with QDP	11	0.04	0.44	0.264
	DASS	9	0.28	0.06	0.216
0.98	Multi-probe LSH with SWP	36	0.14	0.12	0.864
	Multi-probe LSH with QDP	16	0.03	0.6	0.384
	DASS	11	0.25	0.07	0.264

LSH: locality sensitive hashing; SWP: step-wise probing; QDP: query-directed probing; DASS: distributed approximate similarity search.

(↑) and (↓) indicate that the larger (smaller) the value, the better the performance; the best results in each evaluation are boldfaced.

As seen in Figure 7, the value of precision decreases with the increase in recall, and the DASS algorithm achieves relatively higher precision than the other two algorithms when recall remains fixed. It can be explained as follows.

Figure 7.

Impact of recall on precision for different algorithms.

When we consider the nature of randomized projection in the existing hashing-based algorithms, some irrelevant data items can be selected as relevant data with a certain probability, which often leads to relatively low precision and high communication overhead. While our proposed algorithms take into account multiple hash codes of each data to compute the similarity score, by comparing with the specified threshold, high recall and precision can be obtained in a distributed manner.

Conclusion

Considering the characteristics of WSNs such as high-dimensional sensed data, energy constraint, and distributed architecture, we propose a DASS algorithm to retrieve similar high-dimensional sensed data for query in WSNs. More specifically, each cluster head is responsible for computing the similarity score of each sensed data in its own cluster and comparing the similarity score with a specified threshold to filter out the irrelevant sensed data. Simulation results have demonstrated the effectiveness and superiority of the proposed algorithms in terms of search performance and energy efficiency.

Footnotes

Academic Editor: Lyudmila Mihaylova

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Science Foundation of China (61571233, 61203289, and 61572262), the National Science Foundation of Jiangsu (BK20141427), the key University Science Research Project of Jiangsu Province (14KJA510003), and the National Basic Research Program of China (2011CB302903).

References

Diallo

Rodrigues

JJPC

Sene

. Distributed database management techniques for wireless sensor networks. IEEE T Parall Distr 2015; 26(2): 604–620.

Jardak

Mähönen

Riihijärvi

. Spatial big data and wireless networks: experiences, applications, and research challenges. IEEE Network 2014; 28(4): 26–31.

Bentley

. K-d Trees for semidynamic point sets. In: Proceedings of the sixth annual symposium on computational geometry, Berkley, CA, 7–9 June 1990, pp.187–197. New York: ACM.

Beygelzimer

Kakade

Langford

. Cover trees for nearest neighbor. In: Proceedings of the ICML, Pittsburgh, PA, 25–29 June 2006, pp.97–104. New York: ACM.

Datar

Immorlica

Indyk

. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry, Brooklyn, NY, 8–11 June 2004, pp.253–262. New York: ACM.

Kulis

Jain

Grauman

. Fast similarity search for learned metrics. IEEE T Pattern Anal 2009; 31(12): 2143–2157.

Weiss

Torralba

Fergus

. Spectral hashing. In: Proceedings of conference on Advances in Neural Information Processing Systems (NIPS 2008), British Columbia, Canada, 8–11 December 2008, pp.1753–1760.

Josephson

Wang

. Multi-Probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd international conference on very large data bases (VLDB’07), Vienna, 23–27 September 2007, pp.950–961. New York: ACM.

Haghani

Michel

Aberer

. Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th international conference on extending database technology, Saint Petersburg, 24–26 March 2009, pp.744–755. New York: ACM.

10.

Madden

Franklin

Hellerstein

. TinyDB: an aquisitional query processing system for sensor networks. ACM Trans Database Syst 2005; 30(1): 122–173.

11.

Cheng

Chen

. Efficient query-based data collection for mobile wireless monitoring applications. Comput J 2010; 53(10): 1643–1657.

12.

Yang

Mareboyana

. Similarity search in sensor networks using semantic-based caching. J Netw Comput Appl 2012; 35(2): 577–583.

13.

Chung

Lee

. Finding similar answers in data-centric sensor networks. In: Proceedings of the IEEE international conference on sensor networks, ubiquitous, and trustworthy computing, Taichung, Taiwan, 11–13 June 2008, pp.217–224. New York: IEEE.

14.

Zeinalipour-Yazti

Vagena

. Distributed top-k query processing in wireless sensor networks. In: Proceedings of the 9th international conference on mobile data management, Beijing, China, 27–30 April 2008, p.227. New York: IEEE.

15.

Ahmed

Gregory

. Distributed efficient similarity search mechanism in wireless sensor networks. Sensors 2015; 15(3): 5474–5503.

16.

Lee

. Distributed processing of probabilistic top-k queries in wireless sensor networks. IEEE T Knowl Data En 2013; 25(1): 76–91.

17.

Shen

Zhao

. A distributed spatial-temporal similarity data storage scheme in wireless sensors networks. IEEE T Mobile Comput 2011; 10(7): 982–996.

18.

Liu

Zheng

Liu

. Metadata-guided evaluation of resource-constrained queries in content caching based wireless networks. Wireless Netw 2011; 17(8): 1833–1850.

19.

Wang

Kumar

Chang

. Semi-supervised hashing for large-scale search. IEEE T Pattern Anal 2012; 34(12): 2393–2406.

20.

Gong

Lazebnik

Gordo

. Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE T Pattern Anal 2013; 35(12): 2916–2929.

21.

Liu

Wang

Kumar

. Hashing with graphs. In: Proceedings of conference on International Conference on Machine Learning (ICML 2011), Bellevue, WA, 28 June–2 July 2011, pp.1–8.

22.

Irie

. Locally linear hashing for extracting non-linear manifolds. In: Proceedings of the IEEE computer vision and pattern recognition (CVPR), Columbus, OH, 23–28 June 2014, pp.2123–2130. New York: IEEE.

23.

Jegou

Amsaleg

Schmid

. Query adaptative locality sensitive hashing. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, Las Vegas, NV, 31 March–4 April 2008, pp.825–828. New York: IEEE.

24.

Joly

Buisson

. A posteriori multi-probe locality sensitive hashing. In: Proceedings of the 16th ACM international conference on multimedia, Vancouver, BC, Canada, 26–31 October 2008, pp.209–218. New York: ACM.

25.

Zhang

. Query range sensitive probability guided multi-probe locality sensitive hashing. In: Proceedings of the 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel & distributed computing (SNPD), Kyoto, Japan, 8–10 August 2012, pp.3–9. New York: IEEE.

26.

Satuluri

Parthasarathy

. Bayesian locality sensitive hashing for fast similarity search. Proc VLDB Endow 2012; 5(5): 430–441.

27.

Younis

Fahmy

. Distributed clustering in ad-hoc sensor networks: a hybrid, energy-efficient approach. In: Proceedings of the IEEE INFOCOM, Hong Kong, 7–11 March 2004. New York: IEEE.

28.

Chua

Tang

Hong

. NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference image and video retrieval, Santorini Island, Greece, 8–10 July 2009, p.48. New York: ACM.