Sage Journals: Discover world-class research

Abstract

With the development of Internet of Things, many applications need to use people’s location information, resulting in a large amount of data need to be processed, called big data. In recent years, people propose many methods to protect privacy in the location-based service aspect. However, existing technologies have poor performance in big data area. For instance, sensor equipments such as smart phones with location record function may submit location information anytime and anywhere which may lead to privacy disclosure. Attackers can leverage huge data to achieve useful information. In this article, we propose noise-added selection algorithm, a location privacy protection method that satisfies differential privacy to prevent the data from privacy disclosure by attacker with arbitrary background knowledge. In view of Internet of Things, we maximize the availability of data and algorithm when protecting the information. In detail, we filter real-time location distribution information, use our selection mechanism for comparison and analysis to determine privacy-protected regions, and then perform differential privacy on them. As shown in the theoretical analysis and the experimental results, the proposed method can achieve significant improvements in security, privacy, and complete a perfect balance between privacy protection level and data availability.

Keywords

Location-based service differential privacy protect mechanism Internet of Things

Introduction

In recent years, Internet of Things (IoT) has gradually become a popular guide in people’s lives, thus forming a large number of applications related to its background scene.¹ In the current era of the Internet, various communication and interconnection devices are developing rapidly. With the development of the IoT, information communication between devices becomes more frequent. In order to provide better services to users, these devices often collect some private information, which causes the user’s privacy disclosure.² As a result, the open interconnected environment brings many threats to the society. Nowadays, people are paying attention to the privacy problems in a serious attitude.³

Location-based service (LBS)⁴ is a hot topic in current mobile terminal services which is widely used in the current mobile application market. In the shopping application, using the LBS to obtain the user’s location not only omits the cumbersome process of manually inputting the location information but also provides the basis for the location information of the distribution warehouse. When considering the navigation background, the LBS obtains the location information of the user in real time and returns information to the user, making the acquisition and query of road condition information more intuitive and simple; in addition, in various social applications such as social, weather, taxi, group purchase, and travel, LBS plays an important role.^5,6 The geographic location information provided by LBS can enrich the functionality of the application and greatly enhance the user experience.

However, while improving the user experience, the LBS application needs to collect the user’s geographic location information. Body sensor network (BSN)⁷ is a perfect way to make use of LBS. It forms a huge database by collecting real-time uploaded information data from various portable devices. For example, people use their smart phones to submit their location anytime and anywhere. However, these data brought by people can be easily stared by various attacker.⁸ Therefore, it is essential to propose a strategy for BSN.

We found that location information can outline the general characteristics of a person. If we get some important information, we can analyze the address, work location, and so on of the target person. At the same time, the study found that people’s activities have a strong regularity. Thus, obtaining people’s location information not only infringes on the privacy of people at the current moment but also helps predict the future location. This is also the purpose of the attackers. Therefore, for geographical location information obtained from BSN, more and more people think that it is part of personal privacy and should not be obtained and analyzed by others. Based on the above issues, we should provide the external statistical data for analysis in Industrial Internet of Things (IIoT),⁹ while ensuring that the privacy of any individual will not be analyzed.

We can simply divide these sensitive information into two parts:¹⁰ disclosure based on location and disclosure based on data. The first threat is generated when the portable device collect our location information. The other is non-location information submitted by us. This article mainly focuses on the first threat. Figure 1 shows the location distribution of users.

Figure 1.

Location distribution in IoT reality.

We usually use differential privacy (DP)¹¹ to achieve the protection of location information, which is a proper method of location-based privacy in the scenery of IIOT as we described. We can make the statistical data queried and analyzed under the privacy environment. It can ensure that for a single individual included in a dataset, the result of statistical query will not change regardless of this existing individual. In short, this means the attacker cannot judge any individual’s data by the statistical result. DP focuses on shortcomings compared to the traditional privacy model (such as encryption algorithm¹² and k-anonymity¹³). First, DP proposes the definition based on strict mathematics, which means we can provide privacy for the datasets in different layers. Second, DP is assumed that attacker knows the maximum background knowledge including all other records expect the target record. As a result, the DP can neglect the attackers’ background knowledge. Above all, we argue the DP model can provide the most suitable privacy-preserving method for location-based information.

In the description of this article, we can easily observe that DP model can protect any individual. However, DP also has its drawbacks.¹⁴ In case of considering every single individual, the protect mechanism must be very strict, which means that the usability of query data may be affected. We need to not only avoid the disclosure of data privacy but also improve the credibility of data analysts for querying data. Thus, the balance between data availability and privacy protection degree is becoming a serious issue in the DP model.

Before the appearing of big data in IIoT, for common location data privacy, researchers have proposed a variety of algorithms to satisfy privacy guarantees. But after the emergence of big data, the original method has more or less defects in efficiency or usability. As a result, experts face a huge issue of privacy protection. So in this article, we propose our privacy strategy based on the LBS with BSN.

The specific contributions are as follows:

We introduce a planar structure to represent the location data in sensor networks according to the distribution and real-time position of location data;

We present a selection algorithm based on density threshold, which is with lower time complexity in partition algorithm;

We apply an advanced exponential mechanism into selection algorithm in this article in detail;

We conduct extensive experiments in both public databases, demonstrating that the presented algorithm always performs better than the state-of-the-art approaches in IIOT.

The rest of this article is described as follows. Section “Related work” presents the essential previous related work. Section “Preliminaries” introduces the preliminary knowledge. Section “The proposed noise-added selection method” details the proposed location privacy protection method. The real-world IoT datasets and experimental results are presented in section “Evaluation,” and conclusions are given in section “Conclusion.”

Related work

Evolution of the privacy model

With the growing popularity of location-based applications, lots of researchers proposed location models¹⁵ to protect privacy. We simply divide it into two categories.¹⁶ One is traditional anonymization, which encrypts sensitive private location information and data. And the other is the protection model of DP, which neglects the background knowledge of the attacker.¹⁶

Anonymization

Anonymization has always been a traditional and important method in location privacy protection.^13,17–19 In the static anonymous policy, the data issuer needs to process the quasi-identification code in the data, so that the multiple records with the same code can be combined. The record sets having the same quasi-identification code combination are called equivalent groups. Wong et al.¹⁷ proposed the k-anonymity method and also a large number of methods based on k-anonymity. The k-anonymity technology means that the number of records in each equivalence group is k. When attacker targeting big data performs a link attack, the attack on any record is also associated with other $k - 1$ in the equivalence group. This feature protects the user’s privacy by making it impossible for an attacker to determine the records associated with a particular user.

The l-diversity anonymity strategy¹⁸ ensures that the sensitive attributes of each equivalence class have at least one different value. The l-diversity allows the attacker to confirm the sensitive information of an individual by a maximum probability of $1 / l$ .

The t-closeness anonymous strategy¹³ measures the distance between sensitive attribute values by expected maximum distance (EMD) and requires that the difference between the distribution characteristics of sensitive attribute values in the equivalent group and the distribution characteristics of sensitive attribute values in the entire dataset be as large as possible. On the basis of l-diversity, the distribution problem of sensitive attributes is considered, and the distribution of sensitive attribute values in all equivalence classes is required to be as close as possible to the global distribution of the attributes.

Based on the static anonymous policy, a dynamic m-invariance anonymous strategy¹⁹ is proposed: while supporting new operations, it supports data redistribution to delete historical datasets.

However, these strategies may cause a large loss of information, which may cause the user of the data to make a false judgment.

DP

DP is a powerful privacy model that can be implemented by adding randomized noise to aggregated query results to protect individual entries without significantly changing the results of the query. DP algorithm guarantees that the personal data that an attacker can obtain is almost the same as what they can get from a dataset that no one has recorded.

Partitioning strategy

A more representative example of data-independent tree partitioning for two-dimensional data is quadtree.²⁰ We can also extend it to high-dimensional, that is, use octree and other structures to represent data. Their distinguishing feature is that the partitioning method is set in advance and only depends on the attribute domain. Compared to the error of the average distribution of privacy budgets, the query error of this method grows slowly²¹ and partially adds noise to the statistical data in the time period by filtering.

Data-dependent tree partitioning relies mainly on input, and the more representative structures are kd-tree²² and Hilbert R-trees. The main construction process of kd-tree is divided into two steps: one is to select the dimension k with the largest variance in the k-dimensional dataset, and then select the median m as the pivot to divide the dataset in the dimension. Then two sub-collections are obtained. At the same time, create a tree node to store the total count value. The second is to repeat step 1 for the two sub-collections until all sub-collections can no longer be divided. If a sub-collection cannot be subdivided, the data in the sub-collection are saved to the leaf node. We focus on the construction of kd-tree and the process of combining with DP. The algorithm that uses DP based on kd-tree is called kd-standard. The privacy budget of kd-standard is divided into two parts: one is to determine the median, because if the differential process is not used to protect the segmentation process, the segmentation line may leak the true value of the median. Second, privacy budget is used to add Laplace noise to each level of the kd-tree. Kd-standard may also have two aspects of the above quadtree: inconsistent query results and nonuniform distribution of privacy budgets. The solution is the same as quadtree.

R-tree²³ is a spatial partitioning structure consisting of nested rectangles that may overlap. The hybrid tree combines data independence and data-dependent tree-partitioning methods. This method sets the partitioning process in advance. The algorithm complements the advantages and disadvantages of the two tree-partitioning methods, so the query results are more accurate.

UG partitioning²⁴ is a relatively simple way to divide the space. This method divides the data domain into equal-sized grids, and then adds noise to the count value of each grid, where the size of the grid minimizes noise. The sum of the error and the uniform hypothesis error are obtained. The disadvantage of UG partitioning is that all areas of the dataset are treated equally. Whether the area is dense or sparse, it is divided in the same way. If there is less data point in a region, this method will result in many fine divisions of the region, which increases the noise error and hardly reduces the uniform hypothesis error.

The AG partition²⁴ first performs uniform grid division. Because there is a second layer partition, the granularity of the first partition is smaller than the UG partition. The AG partitioning then further adaptively selects the partition granularity of the second layer of mesh for the noisy count value of the first layer of mesh.

Preliminaries

We will introduce the definition of DP and partitioning in detail.

DP protection model aims to perturb the data by random noise before it is published. Thus, even if an attacker knows all the other records except the target one, the individual’s private information cannot be inferred through data mining and analysis. Hence, the DP protection model has been studied and perfected by scholars since its being proposed. Now that DP is defined on neighbor datasets, and it is necessary for us to show the definition of neighboring datasets.²⁵

Definition 1: Neighboring dataset

For two datasets $D_{1}$ and $D_{2}$ with the same attribute structure, $D_{1}$ and $D_{2}$ are called neighboring dataset, if and only if there is only one different data record in $D_{1}$ and $D_{2}$ .

For neighbor datasets mentioned in the above definition, we add that a data difference in two datasets not only means data have been changed but also data are deleted or data increased.

On this basis, we list the definition of DP.¹¹

Definition 2: ε-DP

A randomized algorithm B gives ε-DP, for any pair of neighboring datasets $D_{1}$ and $D_{2}$ , and for every set of outcomes O (O ∈ Range(B)), B satisfies

\Pr [B (D_{1}) = O] \leq e^{ε} \cdot \Pr [B (D_{2}) = O]

(1)

We call $ε$ in formula (1) as privacy budget which represents the level of the privacy protection level. The smaller the $ε$ , the higher the protection level. The size of $ε$ is inversely proportional to the degree of privacy protection. Nowadays, there is no good standard for the value of $ε$ . Generally, the optimal value is constantly adjusted according to the level of privacy protection. It is always between ln2 and ln5 from experience.

Achieving DP generally adopts two mechanisms: the Laplace mechanism and the exponential mechanism. These mechanisms contain the definition of sensitivity. To make these definition understand easily, we give an introduction of global sensitivity.²⁶

Definition 3: Global sensitivity

For a function $F : D \to R^{d}$ , for any neighboring datasets, the global sensitivity $Δ F$ of function F is defined as

Δ F = max_{D_{1} D_{2}} | | F (D_{1}) - F (D_{2}) | |_{1}

(2)

where D refers to the dataset, $R_{d}$ refers to the d-dimensional vector, d is a positive integer, $| | F (D_{1}) - F (D_{2}) | |_{1}$ represents the first-order distance between $F (D_{1})$ and $F (D_{2})$ .

Global sensitivity represents the change in the output of the algorithm when changing any record in the dataset.

There are two common mechanisms to complete the DP model, namely, the Laplace mechanism²⁷ and the exponential mechanism.²⁸ They mainly add noise to statistical data. There is also a significant difference between them. The Laplace mechanism mainly adds noise to the numeric query result to complete. The exponential mechanism has its own scoring function and finally publish the data according to the level of the score. We give their definitions below respectively.

Definition 4: Laplace mechanism

Given a function $f : D \to R^{d}$ over a dataset D, a privacy budget $ε$ , and the global sensitivity $Δ F$ , F satisfies Laplace mechanism when

A (D) = F (D) + Lap (\frac{Δ F}{ε})

(3)

where $Lap (λ)$ means the position parameter of Laplace distribution is 0, and the scale parameter is $λ$ , same as $Δ F / ε$ .

Figure 2 shows the distribution of Laplace noise function.

Figure 2.

Laplace distribution function with different value b.

In some cases, the result of the dataset query is an entity object, so the exponential mechanism is proposed to complete it. It is mainly used in the case where the output of the algorithm is a non-numeric data. In some application scenarios, we need to choose a better result output among multiple results, and the exponential mechanism determines the pros and cons of each result of the algorithm output by defining an availability function.

Definition 5: Exponential mechanism

Given a random algorithm B whose input is dataset D and output is an entity object $r \in ξ$ , $ξ$ is the set of output result. $u (D, r)$ is the usability function whose sensitivity is $Δ u$ . f satisfies the exponential mechanism if algorithm B chooses and outputs r from $ξ$ with a probability proportional to $\exp (\frac{ε \cdot u (D, r)}{2 Δ u})$ .

Where the global sensitivity of usability function $u (D, r) \to R$ is $Δ u = \max_{D, D^{'} = n b d s (D)} ∥ u (D, r) - u (D^{'}, r) ∥$ .

User location information is preserved by hiding it and the sanitized data are released by adding Laplace noise to the statistical results of each cell. Definitions 6 and 7²⁹ show that the proposed data release method that satisfies various DP.

Definition 6: Sequential composition

Suppose we have a set of privacy mechanisms $M = M_{1}, M_{2}, \dots, M_{m}$ and each $M_{i}$ provides $ε_{i}$ privacy guarantee. Then after executing these algorithms in sequence, M provides (ε)-DP, $ε = ε_{1} + ε_{2} + \dots + ε_{m}$

Definition 7: Parallel composition

Suppose we have a set of privacy mechanisms $M = M_{1}, M_{2}, \dots, M_{m}$ , and each $M_{i}$ provides $ε_{i}$ privacy guarantee on a disjoint subset of the entire dataset, M provides $max (ε_{i}) - DP$ .

When we understand the privacy of the underlying operations, composition theorem of DP can help us understand and calculate the privacy of a complex algorithm.

Through the above theorems and definitions, it is not difficult to come to a conclusion that the lower the degree of protection when $ε$ is smaller, the better the availability of data. Conversely, the higher the $ε$ , the higher the degree of privacy protection, and the lower the availability of the data.

In addition, according to parallel composition, if we make a separate privacy distribution for each region’s query value, it is undeniable that this will definitely achieve a perfect balance between query availability and privacy protection, but in this separate. However, the time spent in the privacy allocation process will be several times the previous average allocation method, which is not acceptable to us.

The proposed noise-added selection method

In this section, we propose our main method in location privacy protection. For a real-time dataset containing the user’s geographic location information in IIOT, we first convert it into a two-dimensional dataset, perform a slider screening on the dataset, and then we perform an exponential mechanism on the filtered collection by giving it a scoring function and selecting k data according to probability. Finally, a Laplace noise addition mechanism is implied on these data and updates it to the original dataset.

We transform Figures 1 –3, which means that we convert the geographic map into a mathematical lattice map. Think of the map as a huge white background area where each user’s location information is transformed into a point in the area based on latitude and longitude. In Figure 3, blue points and the square replace the actual distribution of users in Figure 1.

Figure 3.

Flattened dataset.

Problem definition and assumption

Notations

Let D denote the domain of all user’s location. $D'_{s}$ represents the result set of region selection algorithm. $D'$ is the final dataset generated by $D'_{s}$ using DP mechanism. s represents the area of the square filter block, and q represents the density threshold. The notations are given in Table 1.

Table 1.

Notations of involved symbols.

Symbols	Notions
D	Original dataset
$D'$	Laplace noise-added dataset
$D_{s}$	Region selection dataset
$D'_{s}$	Exponential selection dataset
q	Parameter which represents density of this area
s	Parameter which represents area of detectingregion
$ε$	Total privacy budget
$ε_{1}$	Privacy budget implied into selection algorithm
$ε_{2}$	Privacy budget implied into Laplace noise-addedalgorithm
N	The number of the points in whole region
p	Number of data selected by the exponentialmechanism

Problem description

For a BSN service application, at a certain moment, a geographically implemented location distribution is needed to provide better service to the customer. At this point, the app is too unreliable and dangerous for the user to get the user’s real location. However, it must also complete the service to the user. In this case, the trusted third party needs to protect the location information submitted by the user and process these information. Then, it can release sanitized statistical data to application while ensuring the availability of the sent location to avoid deviations in the BSN service.

First, we need a trusted third party to perform the process of adding noise and selecting process. The framework is as follows: First, users of this application are persuaded to submit their real-time position to the third party. Second, the third party processes the location data they collected and leverages a selection mechanism to select their target regions. We treat each area as a block of data so that these areas constitute a dataset. Next, we filter the data in this dataset and find the appropriate ones to execute noise addition. In the end, we will restore the changed data to the original location map and update it, so we have reached the goal of achieving privacy protection.

We believe that users are willing to do this for two reasons. One is that, the third party is absolutely safe so that nobody will use the data of individual without user’s permission. The other is that, the method we execute is a DP model whose concept is very strong. That means whether the individual give his or her permission to the third party or not, attackers may still have chances to discover his or her private information. But this risk will be nonexistent if people submit their location to the third party. So we propose our algorithm. We should do some extreme hypothesis to make the result clear. First, we assume that the third party is completely safe. Second, we argue that the attackers are smart and vicious, which means they understand how to use differential attacks and have the most background knowledge.

Construction of screening algorithm

For a dataset constructed by a map that implements monitoring, the content has many kinds including the user’s number and position. If it is placed on the coordinate system, the x-axis represents the latitude and the y-axis represents the longtitude, then each user’s location will correspond to a coordinate. For easy understanding, we propose our process in Figure 4.

Figure 4.

Total algorithm flow chart.

The process is detailed succinctly as follows. Databases are transformed into coordinate system to make region selection whose final result is a plurality of coordinate system square of the dataset and each square contains many data points. Then, these squares are selected by exponential mechanism. The carefully selected squares are the added noise by Laplace mechanism. Finally, we reload the noise-added small blocks into the original dataset, enabling DP protection. The noise-added selection (NAS) algorithm is shown in algorithm 1.

Algorithm 1: NAS
Input: Original dataset D; threshold q, s; privacy budget $ε$ ; Onput: Noise-added dataset $D'$ ; 1. We consider dataset D as a coordinate system in the first quadrant, and each user’s position information as a point distributed in the coordinate system; 2. A square with an area of s is moved in such a manner that the abscissa gradually increases and the ordinate gradually increases, and the region satisfying the judgment condition is formed in the process to form a dataset. We detail this in algorithm 2; 3. Calculate the output probability using the exponential mechanism. The formula is detailed in algorithm 3; 4. Select the region that needs to be protected whose information is more important; 5. Add Laplace noise to the selected dataset. Update the original dataset D by altering the values of regions, called $D'$ ; 6. return $D'$ ;

As presented in step 2 of algorithm 1, the square enables the function of detecting useful data which satisfies the condition. Let the small square traverse the horizontal axis of the coordinate axis first, and then traverse the vertical axis. The detailed process is shown in algorithm 2 and Figure 5.

Algorithm 2: Region selection
Input: Original dataset D; Output: Updated set $D_{s}$ ; 1. Initialize a square whose lower left corner coordinate value is $(i, j)$ ; 2. Initialize a dataset $D_{s} = \emptyset$ ; 3. for $j = 0$ ; $j < length - \sqrt{s}$ ; $j + = \sqrt{s}$ do 4. for $i = 0$ ; $i < length - \sqrt{s}$ ; $i + = \sqrt{s}$ do 5. Count the number of points in this square, denoted as N; 6. if $\frac{N}{s} > q$ , then 7. $D_{s} = D_{s} \cup C$ 8. elseBreak 9. end if 10. end for 11. end for 12. return $D_{s}$ ;

Algorithm 2: Region selection

Input: Original dataset D;
Output: Updated set

D_{s}

;
1. Initialize a square whose lower left corner coordinate value is

(i, j)

;
2. Initialize a dataset

D_{s} = \emptyset

;
3. for

j = 0

;

j < length - \sqrt{s}

;

j + = \sqrt{s}

do
4. for

i = 0

;

i < length - \sqrt{s}

;

i + = \sqrt{s}

do
5. Count the number of points in this square, denoted as N;
6. if

\frac{N}{s} > q

, then
7.

D_{s} = D_{s} \cup C

8. elseBreak
9. end if
10. end for
11. end for
12. return

D_{s}

;

Figure 5.

Process of region selection (the horizontal axis represents the longitude and the vertical axis represents the latitude).

As shown in step 1 of algorithm 2, we set the point in the lower right corner of the square to the traversal point. Dataset $D_{s}$ is established to record the satisfying data in step 2. We execute a for loop to compare the density values of each data. If the result is larger than q, then we will select it and store in $D_{s}$ for a while, as shown in steps 3–11. Finally, we update the dataset $D_{s}$ .

Selection method using exponential mechanism

We propose the selection method based on the exponential mechanism mainly because of the strong privacy it offers. First, the data we have filtered is a huge number that makes plenty of privacy budget. The appearance of exponential mechanism reduces the risk of privacy disclosure. Second, it selects the location data that include more sensitive information, so it can monitor the disclosure situation by adjusting different privacy budget. Third, it can control the number of data selected each time according to the algorithm. Combining these advantages, we propose algorithm 3.

Algorithm 3: Exponential selection with $DP - p$
Input: Dataset $D_{s}$ ; privacy budget $ε_{1}$ ; Output: Updated set $D'_{s}$ ; 1. Initialize records in the set $D_{s}$ ; 2. Count the number of records in $D_{s}$ , denoted as n; 3. Score each data record $C_{i}$ by the formula as follows, denoted as $Score (D_{s}, C_{i})$ ; 4. Calculate the value of DP-added score $S_{c}^{DP}$ 5. Calculate the probability $\Pr (C_{i}) = \frac{S_{c_{i}}^{DP}}{Σ_{j = 1}^{n} S_{c_{j}}^{DP}}$ ; 6. The probability of each data obtained is ranked in descending order. We choose the first p data to construct the set $D'_{s}$ ; 7. return $D'_{s}$ ;

Algorithm 3: Exponential selection with

DP - p

Input: Dataset

D_{s}

; privacy budget

ε_{1}

;
Output: Updated set

D'_{s}

;
1. Initialize records in the set

D_{s}

;
2. Count the number of records in

D_{s}

, denoted as n;
3. Score each data record

C_{i}

by the formula as follows, denoted as

Score (D_{s}, C_{i})

;
4. Calculate the value of DP-added score

S_{c}^{DP}

5. Calculate the probability

\Pr (C_{i}) = \frac{S_{c_{i}}^{DP}}{Σ_{j = 1}^{n} S_{c_{j}}^{DP}}

;
6. The probability of each data obtained is ranked in descending order. We choose the first p data to construct the set

D'_{s}

;
7. return

D'_{s}

;

First, step 2 of algorithm 3 calculates the number of records in dataset. Then, we use the score function to calculate the points of each record in step 3. The formula is

Score (D_{s}, C_{i}) = E (C_{i})

(4)

where $E (C_{i})$ refers to the number of location points existing in each record. Then, we obtain the $S_{c}^{DP}$ in step 4

S_{c_{i}}^{DP} = \exp [\frac{ε_{1} * Score (D_{s}, C_{i})}{2 Δ Score}]

(5)

Δ Score = max_{i j \in n} | | E (C_{i}) - E (C_{j}) | |_{1}

(6)

where $Δ Score$ represents the maximum value of the difference between data points in $D_{s}$ .

The probability is obtained in step 5 and sorted in descending order in step 6. Finally, we finish our selection with p data and construct the $D'_{s}$ .

Added Laplace noise

As previously mentioned, the noise count is added to disturb the statistics of the query. In this part, we will add Laplace noise to the query result. Algorithm 4 shows the details of this release.

Algorithm 4: Laplace noise release
Input: $D'_{s}$ ; Output: Updated set $D'$ ; 1. Initialize $D_{i} \in D'_{s}$ ; 2. Initialize $n_{i}$ : number of points in $D_{i}$ ; 3. for $D_{i} \in D'_{s}$ do 4. $n'_{i} = n_{i} + Laplace (\frac{Δ F}{ε_{2}})$ ; 5. end for 6. Update $D'_{s}$ ; 7. $D' = D'_{s}$ ; 8. return $D'$ ;

Here, $Laplace (\frac{Δ F}{ε_{2}})$ is the noise that satisfies Laplace distribution. Note that the privacy budget $ε$ is used in two parts: exponential mechanism with $ε_{1}$ and Laplace mechanism with $ε_{2}$ . $ε = ε_{1} + ε_{2}$ . According to the composition theorem, the proposed method satisfies DP.

Choice of parameter

In this article, we denote threshold q and s whose functions are detailed before. Now, we will discuss the process of obtaining these parameters and reasons to set.

As detailed in algorithm 2, to detect the distributed spots, NAS needs two parameter values, q and s. These values specify the minimum accessed density and the area of the square that used to detect, respectively. Therefore, we need to determine these values privately due to the dangerous reason that these values can leak the information about the underlying dataset.

To solve the above issue, we set these parameters based on the average density of the whole region. We suppose the number of position points are L and the area of the region is S. So the average density can be obtained by the calculation formula $L / S$ . Then, each spot we selected has a density of at least $L / S$ , which is denoted as q.

After that, we now choose the value of s. Intuitively, two parameters q and s together determine the selection of data point spots. Accordingly, if the selection square is relatively small, the time complexity may be larger. However, if we give a bigger square, we may inadvertently lose a lot of accuracy and miss some satisfying spots. Therefore, we propose the limit condition as follows: first, $s \geq 1$ ; second, $s \leq \frac{S}{4}$ . These conditions avoid two situations. The area is too small to tolerate the data points and too large for partition. If the conditions are satisfied, then we use $s = \frac{S}{512}$ in our implementation of NAS and will show in the experiments that achieves reasonably good utility in this setting.

Evaluation

Experimental settings

Dataset

We leverage the check-in dataset from the Gowalla (Gowalla total-check-in dataset), which has a huge amount of data and records about the information of the users. The experimental dataset is presented in Table 2.

Table 2.

Information on datasets.

User ID	Time of check-in	Latitude	Longitude
228470	2010-10-19T23:55:27Z	30.2359091167	−97.7951395833
4203150	2010-10-18T22:17:43Z	30.2691029532	−97.7493953705
3166370	2010-10-17T23:42:03Z	30.2557309927	−97.7633857727
165160	2010-10-17T19:26:05Z	30.2634181234	−97.7575966669

Environment and parameter

The configuration information and parameter settings are shown in Tables 3 and 4, respectively.

Table 3.

Configuration information.

Name	Configuration details
CPU	i7-6560U CPU at 2.20 and 2.21 GHz
Memory	DDR4 8GB
Disk	256GB SSD
System image	Windows 10, 64-bit operating system

Table 4.

Parameter settings.

Parameter	Value
$ε$	0.005, 0.01, 0.015,a 0.02, 0.025, 0.03, 0.035, 0.04,0.045, , 0.05, 0.1, 0.15, 0.2, 0.25, 0.5, 0.75, 1, 1.25,1.5. 2, 5. 10, 15
p	20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220,240, 260, 280, 300, 320, 340, 360, 380, 400, 420,440, 460, 480, 500

Time analysis

We mainly analyze and calculate timeliness in two aspects:

Timeliness of constructing region selection area;

Timeliness of exponential selection method and updating original dataset after the addition of Laplace noise.

Table 5 shows the results of testing. We assume the parameters p, L, and $ε$ are based on the following analysis. With the increasing data points, the time spent becomes larger.

Table 5.

Time analysis.

L	Time(region selection), s	L	Time (originaldatabaseupdated), s
100	0.00000012	100	0.0000037
200	0.00000034	200	0.0000058
500	0.00000066	500	0.0000154
1000	0.00000106	1000	0.0000369
1500	0.00000155	1500	0.0000516
2000	0.00000199	2000	0.0000741
2500	0.00000311	2500	0.0000773
5000	0.00000584	5000	0.0000985
7500	0.00000963	7500	0.0002114
$10, 000$	0.000001182	$10, 000$	0.0003132

Experiment and availability analysis

This article chooses 500, 1000, and 2000 data points from Gowalla. The black dots represent the insensitive locations and the red ones indicate the sensitive locations. The experimental results are shown in Figures 5 –8.

Figure 6.

500 data points before NAS.

Figure 7.

500 data points after NAS.

Figure 8.

1000 data points before NAS.

In the condition of 500 data points, there are 151 red sensitive points and 349 black insensitive points in Figure 5. There are 195 red sensitive positions and 305 black insensitive locations in Figure 6 after protection. According to the 1000-point situation, there are 219 red sensitive points and 781 black insensitive points in Figure 7. After NAS protection, there are 244 red sensitive positions and 756 black insensitive locations in Figure 8. The same situation appears in 2000 data points. There are 299 red sensitive points and 1701 black insensitive points in Figure 9, while there are 322 red sensitive positions and 1678 black insensitive locations in Figure 10 after protection, which shows that the sensitive locations have increased. Table 6 presents the statistical results of the data utility.

Figure 9.

1000 data points after NAS.

Figure 10.

2000 data points before NAS.

Table 6.

Test results.

Data points	Sensitive points (before)	Insensitive points (before)	Sensitive points (after)	Sensitive points (after)
500	151	349	195	305
1000	219	781	244	756
2000	299	1301	322	1678

We compare the proposed method with three methods UG, AG, and KDT. We construct a table of these main methods in Table 7.

Table 7.

Information of method.

Method	General process
UG	Grid size, $m = \sqrt{\frac{N ε}{c}}$
AG	First gird size, $m_{1} = max {10, \frac{1}{4} \sqrt{\frac{N α ε}{c}}}$ ; Second grid size, $m_{2} = \sqrt{\frac{N' (1 - α) ε'}{c}}$
KDT	Total height: 6. Switching height: 3

The formula of the evaluated standard RER (relative error rate) is given as

RER = \frac{| Query (D') - Query [(D)] |}{Query (D')}

(7)

where $Query (D')$ represents the query of the modified database including methods UG, AG, and KDT. The experimental results are shown in Figure 11. We choose the situation of 1000 data points and p with 220 value. From Figure 11, we can make the following observations:

With the increase in privacy budget $ε$ , the RER decreases gradually, which means the data availability increases but more noise is needed to meet the condition of DP. Thus, we achieve the best experimental effect according to the obtained values p and L. Thus, we assume that p is 100 and L is 5000.

The results of UG, AG, and KDT methods show that when privacy budget becomes larger, they will make a better effectiveness. However, if $ε$ is small, the error is relatively larger. What’s more, our proposed method NAS is still better whatever the situation is.

Parameters we need to consider are, respectively n, $ε$ , and L. We next attempt to discuss the effectiveness of a single variable on the experimental results.

Figure 11.

2000 data points after NAS.

We present the variable control method to explore the relationship between a variable and the resulting value while determining the value of other variables. Next, we will do the experiment according to the method described above and organize the experimental data into several tables to summarize the rules:

As represented in Table 8 and Figure 12, we can observe that with the increasing privacy budget parameter $ε$ , the RER remains to be decreasing due to the privacy level. The larger the $ε$ , the less the noise needed and the lower the privacy protection level.

As shown in Tables 9 and 10, we argue that the number of data points and selection region affect RER significantly. When we increase the data points, the Laplace part $ε_{2}$ privacy budget must be separated more times to satisfy DP. What’s more, the data we altered stay the same; however, the data points keep increasing. Thus, RER will decrease with the increasing L. If we control other sectors while making p larger, the regions needed to be noise-added becomes more and more, the RER will decrease.

Table 8.

Relation between $ε$ and RER.

$ε$	RER
0.005	0.05
0.02	0.035
0.04	0.03
0.1	0.025
0.2	0.02
0.5	0.02
1	0.02
2	0.015
5	0.015
10	0.01

Figure 12.

Relative error rate comparison of previous and proposed methods.

Table 9.

Relation between p and RER.

p	RER
40	0.1
100	0.08
160	0.065
200	0.055
240	0.05
300	0.045
360	0.04
400	0.03
440	0.03
500	0.025

Table 10.

Relation between L and RER.

L	RER
100	0.185
200	0.18
500	0.16
1000	0.13
1500	0.11
2000	0.08
2500	0.065
5000	0.04
7500	0.035
$10, 000$	0.03

Overall, the proposed method NAS has obvious advantages both in the level of location privacy protection and the effectiveness and availabily of the algorithm.

Conclusion

In this article, we proposed a location data releasing method based on DP, called NAS. This method enables the LBS application to continuously provide convenience for users while protecting their privacy data by strong privacy guarantee. Besides, the proposed method increases the availability of the released data. Specifically, the NAS first partitions the location dataset into pieces of location data using a region selection algorithm. Then, it uses the score function to evaluate the selection result with exponential mechanism and obtains a new selection set. Next, we use Laplace noise releasing method to change the selected statistical data count. Finally, we update the original dataset with changed data to supply the query. We evaluated the proposed method through extensive experiments, and the results prove that our method achieves a better availability with the same privacy guarantee.

Footnotes

Handling Editor: Fei Yu

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Nature Science Foundation of China (nos 61370198, 61370199, 61672379, and 61300187).

ORCID iD

Zhaobin Liu

References

Antoni

Marjanovi

A mobile crowd sensing ecosystem enabled by CUPUS: cloud-based publish/subscribe middleware for the internet of things. Future Gener Comp Sy 2016; 56: 607–622.

Xiong

Zhu

Meng

XF.

A survey on differential privacy and applications. Chinese J Comp 2014; 37: 101–120.

Kifer

. Attacks on privacy and definettis theorem. In: ACM international conference on management of data (SIGMOD), Providence, RI, 29 June–2 July 2009, pp.127–138. New York: ACM.

Daniel

Kifer

Johannes

Gehrke

. Injecting utility into anonymized datasets In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, 27-29 June 2006, pp. 217–228. DOI: 10.1145/1142473.1142499

Mohan

Padmanabhan

Ramjee

. Nericell: rich monitoring of road and traffic conditions using mobile smartphones. In: SenSys: proceedings of the 6th ACM conference on Embedded network sensor systems, Raleigh, NC, 5–7 November 2008, pp.323–336. New York: ACM.

Chon

Lane

et al . Automatically characterizing places with opportunistic crowdsensing using smartphones. In: UbiComp’12 proceedings of the 2012 ACM conference on ubiquitous computing, Pittsburgh, PA, 5–8 September 2012, pp.481–490. New York: ACM.

Akyildiz

Sankarasubramaniam

et al . Wireless sensor networks: a survey. Comput Netw 2010; 38: 393–422.

Xing

et al . Mutual privacy preserving k-means clustering in social participatory sensing. IEEE Trans Ind Informat 2010; 13: 2066–2076.

Xiao

Sha

Yuan

et al . Vulhunter: a discovery for unknown bugs based on analysis for known patches in industry internet of things. IEEE Trans Emerg Topics Comput 2017; PP(99): 1–1.

10.

Cormode

Procopiuc

Srivastava

et al . Differentially private spatial decompositions. In: IEEE international conference on data engineering (ICDE), Arlington, VA, 1–5 April 2012, pp.20–31. New York: IEEE.

11.

Dwork

Differential privacy. Lect Notes Comput Sc 2006; 4052: 1–12.

12.

Shen

Huang

et al . Towards preserving worker location privacy in spatial crowdsourcing. In: GLOBECOM, San Diego, CA, 6–10 December 2015, pp.1–6. New York: IEEE.

13.

Venkatasubramanian

T-closeness: privacy beyond k-anonymity and l-diversity. In: Proceeding of the IEEE 23rd international conference on data engineering (ICDE), Istanbul, 15–20 April 2007, pp.106–115. New York: IEEE.

14.

Nissim

Raskhodnikova

Smith

Smooth sensitivity and sampling in private data analysis. In: 39th annual ACM symposium on theory computing, San Diego, CA, 11–13 June 2007, pp.75–84. New York: IEEE.

15.

Alemdar

Ersoy

Wireless sensor networks for healthcare: a survey. Comput Netw 2010; 54: 2688–2710.

16.

Dwork

. Differential privacy: a survey of results. In: International conference on theory and applications of MODELS of computation, Xi’an, China, 25–29 April 2008, pp.1–19. Berlin: Springer.

17.

Wong

RCW

AWC

et al . (k)-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing. In: 12th ACM SIGKDD international conference knowledge discovery data mining, Philadelphia, PA, USA, 20–23 August 2006, pp.754–759. New York, NY: ACM. DOI: 10.1145/1150402.1150499

18.

Machanavajjhala

Kifer

Gehrke

et al . L-diversity: privacy beyond k-anonymity. ACM Trans Knowl Discovery Data 2007; 1: 24.

19.

Liu

From data privacy to location privacy: models and algorithms. In: 33rd international conference very large data bases, Vienna, 23–27 September 2007, pp.1429–1430. New York: ACM.

20.

Xiao

Xiong

Yuan

Differentially Private Data Release through Multidimensional Partitioning. In: Vldb conference on secure data management (SDM), 2010, pp.150–168. Berlin, Heidelberg: Springer. DOI: 10.1007/978-3-642-15546-8_11

21.

Yin

Sun

et al . Location privacy protection based on differential privacy strategy for big data in industrial internet of things. IEEE Trans Indusl Inform 2018; 14: 3628–3636.

22.

Cormode

Procopiuc

Shen

et al . Differentially private spatial decompositions. In: Proceedings of IEEE 28th International Conference on Data Engineering (ICDE), Washington, DC, 1–5 April 2011, vol. 41, pp.21–31. New York: IEEE.

23.

Kamel

Faloutsos

Hilbert r-tree: an improved r-tree using fractals. In: International conference on very large data bases, Santiago de Chile, 12–15 September 1994, pp.500–509. San Francisco, CA: Morgan Kaufmann Publishers Inc.

24.

Qardaji

Yang

. Differentially private grids for geospatial data. In: Proceedings of IEEE 29th international conference on data engineering (ICDE), Brisbane, QLD, Australia, 8–12 April 2013, pp.757–768. New York: IEEE.

25.

Ebadi

Sands

Schneider

Differential privacy: now it’s getting personal. In: Proceedings of the 42nd annual symposium on principles of programming languages, Mumbai, India, 15–17 January 2015, pp.69–81. New York: ACM.

26.

Soria-Comas

Domingo-Ferrer

Optimal data-independent noise for differential privacy. Inf Sci 2013; 250: 200–214.

27.

Dwork

McSherry

Nissim

et al . Calibrating noise to sensitivity in private data analysis. In: Conference on theory of cryptography conference (TCC), New York, 4–7 March 2006, pp.265–284. Berlin: Springer.

28.

McSherry

Talwar

Mechanism design via differential privacy. In: IEEE symposium on foundations of computer science (FOCS), Providence, RI, 21–23 October 2007, pp.94–1032. New York: IEEE.

29.

McSherry

. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: ACM international conference on management of data (SIGMOD), Providence, RI, 29 June–2 July 2009, pp.265–284. New York: ACM.

Noise-added selection method for location-based service using differential privacy in Internet of Things

Abstract

Keywords

Introduction

Related work

Evolution of the privacy model

Anonymization

DP

Partitioning strategy

Preliminaries

Definition 1: Neighboring dataset

Definition 2: ε-DP

Definition 3: Global sensitivity

Definition 4: Laplace mechanism

Definition 5: Exponential mechanism

Definition 6: Sequential composition

Definition 7: Parallel composition

The proposed noise-added selection method

Problem definition and assumption

Notations

Problem description

Construction of screening algorithm

Selection method using exponential mechanism

Added Laplace noise

Choice of parameter

Evaluation

Experimental settings

Dataset

Environment and parameter

Time analysis

Experiment and availability analysis

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References