The rapid development of mobile technology has improved users' quality of treatment, and tremendous amounts of medical information are readily available and widely used in data analysis and application, which bring on serious threats to users' privacy. Classical methods based on cryptography and anonymous-series models fail due to their high complexity, poor controllability, and dependence on the background knowledge of adversaries when it comes to current mobile healthcare applications. Differential privacy is a relatively new notion of privacy and has become the de facto standard for a security-controlled privacy guarantee. In this paper, the key aspects of basic concepts and implementation mechanisms related to differential privacy are explained, and the existing research results are concluded. The research results presented include methods based on histograms, tree structures, time series, graphs, and frequent pattern mining data release methods. Finally, shortcomings of existing methods and suggested directions for future research are presented.
1. Introduction
With the further development of mobile technology, it is easy to attain, store, and analyze mobile healthcare information, such as physiological data, social interactions, consumer record, and track data, which all inevitably involve user's privacy. MHR (mobile healthcare record) has been questioned because of user's privacy disclosure since its inception due to the high sensitivity of health data [1, 2]. In 2015, MIT researchers (de Montjoye et al. [3]) reported that the so-called unimportant dates and locations of only four pieces of consumption records are enough to identify 90 percent of the people in a dataset recording three months of credit-card transactions by 1.1 million users. Based on concerns about user's privacy in mobile healthcare, researchers propose a series of privacy protection schemes mainly including data distortion, data encryption, and restrictive release [4].
The anonymity models based on restrictive release were proposed greatly to guarantee user's privacy in healthcare application [5], such as k-anonymity, l-diversity, and t-closeness. k-anonymity was first introduced by Sweeney and Samarati [6, 7] and it has the property that each record is indistinguishable from at least records; however, it may disclose privacy information [8, 9], so l-diversity and t-closeness were proposed to enhance privacy security. But all these methods fail to offer quantifiable guarantee for user's privacy and are short of demarcation clearly for adversary's ability.
Avancha et al. [10] presented the privacy problems in mHealth systems and gave a conceptual privacy framework without specific technology; Francis et al. [11] proposed a privacy preserving model for the e-health system which uses the trusted authority to maintain the privacy of user's health records; Sadki and El Bakkali [12] proposed an integrated privacy module to protect user's privacy in mobile healthcare. However, existing privacy preserving models mainly depend on the background knowledge of adversaries who may harm the data by reidentifying it [1]. By reason of the dependence, the research of privacy protection falls into the strange circle; that is, a method is proposed to protect the privacy and is broken subsequently. Then, another method is proposed to solve this problem and is broken again, but differential privacy [13] can protect the presence of an individual regardless of the adversary's background knowledge and give the demarcation of adversary's ability.
In 2006, differential privacy based on probability model was first proposed by Dwork. It could be classified as data distortion and uses random noise to ensure that the adversary fails to guess whether the tuple is present in the dataset with high probability even if he knows all the tuples except the current tuple, which is the theoretical upper limit of attacking capability of the adversary. Differential privacy is a strictly provable and security-controlled method and is introduced quickly and widely to other data application scenes since it was proposed in database domain [14–16]. The popularity of network applications [17, 18] will certainly promote the development of differential privacy.
Leoni [19] reviewed differential privacy, which contains noninteractive processing and is mainly classified under technical level, such as dimensionality reduction and filtering; Xiong et al. [20] gave a survey on differential privacy and applications and neglected time series and graph data release methods; Yang et al. [21] described the classification of differential privacy; however, the detailed statement is insufficient. As the classification of many existing surveys is incomplete and its account is not quite clear, this paper is going to summarize the state of the art in differential privacy research from data level, mainly including a variety of categories, histogram, tree structure, time series, graph, and frequent pattern mining data release methods, and then to discuss open problems by analysis and comparison of existing methods.
The rest of this paper is organized as follows. Section 2 provides the preliminaries on differential privacy and implementation mechanism. Section 3 introduces data processing models and releasing strategies of differential privacy. Section 4 presents data release methods of differential privacy. Finally, it comes to conclude the paper and presents research directions for future work in Section 5.
2. Preliminaries
Differential privacy is a relatively new notion of privacy and is one of the most popular privacy concepts as well. Starting from the basic concept of privacy, this section introduces the notion, implementation mechanism, and two important properties of differential privacy.
2.1. Base Definitions
Privacy is most often defined as the right to be left alone, free from intrusion or interruption. It is an umbrella term, encompassing elements such as physical privacy, communications privacy, and information privacy [22].
Intuitionally, differential privacy ensures that the removal or addition of a single database tuple does not affect the outcome of any analysis [23]. The definition of differential privacy is as follows.
Let D and be two neighboring datasets which differ on at most one tuple, denoted by ; given a randomized function , is the set of all possible outputs of A in D and , and given an arbitrary subset , A is said to be ε-differential privacy if
The parameter ε is called privacy budget, which is used to control the ratio of output of function A in neighboring datasets D and . The smaller ε is, the more close ratio is to 1; that is to say, the output probability distributions of function A in neighboring datasets D and are approximately the same; meanwhile, the higher the security becomes, the lower the utility gets. The value of ε is small usually, such as 0.01, 0.1, 0.5, and 0.8.
The strong privacy guarantee of differential privacy comes at the price of much noise added to the results of queries and analyses [21]. Dwork et al. further proposed enhancing the utility of data by reducing the requirements of data security; then -differential privacy was proposed to this issue.
Let D and be two neighboring datasets which differ on at most one tuple, denoted by ; given a randomized function , is the set of all possible outputs of A in D and , and given an arbitrary subset , A is said to be -differential privacy if
No hard rule dictates what value of privacy budget ε and δ should be and they can be set according to the actual application requirements. The value of δ is small usually () and ()-differential privacy equals ε-differential privacy when .
The output probabilities of randomized function A in neighboring datasets D and for ε-differential privacy are shown in Figure 1 [26, 27], where ; the larger c is, the wider the distribution will be and the higher the probability of adding more noise becomes.
Output probabilities of randomized function A in neighboring datasets D and .
Function A provides privacy guarantee by adding random noise to the output; that is, , where f is such query operations as average, count, and sum and is the random noise following Laplace distribution which will be introduced in Section 2.2. The smaller the difference between two curves is, the stronger the security received is.
A simple attack example is as follows: Table 1 is a health dataset and provides count service for querying the number of people suffering from HIV. Without privacy protection strategy, the adversary just needs to query the number of four people containing Alice with HIV which is Yes and the number of three people excluding Alice also with HIV is Yes; then the adversary obtains that Alice is suffering from HIV only by . Obviously, practical application for health data release comes to failure if it fails to resist these attacks.
Health records.
Name
HIV (0 is No; 1 is Yes)
Address
Tim
0
New York
Bob
1
Washington
Tony
0
New York
Alice
1
Hawaii
⋮
⋮
⋮
Now, we add a random noise to the output of count function f and let two count results fall into the same range with higher probability; for example, and ; by this time, the adversary fails to get Alice's health information by the above method as both and may output 1.5 and . Differential privacy gives the maximum capacity of the adversary in this instance; that is to say, the adversary knows statistics whether the Alice is present in the dataset or not.
2.2. Noise Mechanism
The two main noise mechanisms are Laplace mechanism and exponential mechanism, and the noise magnitude refers to privacy budget and global sensitivity.
Given a randomized function , the global sensitivity of function f is
where D and are two neighboring datasets, R is the real space of mapping, d is the dimension, and is the first-order norm distance between and and is denoted by .
The global sensitivity for count function ; however, is larger for other operations. For instance, given dataset , f is sum operation; then the global sensitivity is larger than the sensitivity of count function. So Nissim et al. first proposed the notion of local sensitivity to reduce the sensitivity magnitude as the noise will become large with the increasing of global sensitivity.
Given a randomized function , the local sensitivity of function f is
For example, given the subset of D, the local sensitivity is smaller than the global sensitivity 21 under adding operation. Local sensitivity may reduce the noise magnitude, but noise magnitude might reveal information about the dataset; for details see [41, Example 1]. So Nissim et al. further proposed the notion of β-smooth sensitivity to avoid disclosing privacy information; now the added noise is proportional to a smooth upper bound on the local sensitivity.
For , given a randomized function , the β-smooth sensitivity of function f is
2.2.1. Laplace Mechanism
The literature [42] proposed Laplace mechanism to achieve differential privacy by adding random noise drawn from the Laplace distribution to the output. Its definition is as follows.
Definition 6 (Laplace mechanism).
Given a dataset D and the function , global sensitivity is ; random algorithm satisfies ε-differential privacy if the noise obeys the Laplace distribution; that is, ; thereinto, location parameter is 0 and scale parameter is .
Let denote the Laplace distribution when location parameter is 0 and scale parameter is b, and its probability density function is . From Figure 2, the larger the noise added to the output is, the larger b is and, meanwhile, the smaller ε becomes.
Let denote standard deviation; denotes variance, and , , , and ; then we obtain the results , .
Probability density function.
2.2.2. Exponential Mechanism
Laplace mechanism is used when the output is numerical, and exponential mechanism is another security-controlled scheme to satisfy differential privacy when the outputs are nonnumerical.
Intuitionally, exponential mechanism still ensures that the change of a single database tuple does not affect the outcome of the score function, and the definition is as follows.
Let D denote the input dataset; denotes a potential answer, given a score function ; if a random algorithm A selects an answer based on the following probability, then algorithm A is said to satisfy ε-differential privacy:
where denotes the sensitivity of score function u and is defined as
The exponential mechanism can output nonnumerical results according to their values of score function. The output probability refers to privacy budget from the definition above, and the highest scored result is outputted with higher probability when ε is larger; meanwhile, when the difference between the output probabilities grows, the security becomes less; vice versa, the smaller ε is, the higher the security will be.
2.3. Combination Properties
Differential privacy has two important properties [44], as follows.
Property 1 (sequential composition).
Set satisfying differential privacy for ; multiple mechanism satisfies -differential privacy for the same dataset D.
Sequential composition shows that privacy budget and error are cumulative linearly when we use multiple differential privacy to release data for the same dataset.
Property 2 (parallel composition).
Set satisfying differential privacy for ; multiple mechanism satisfies -differential privacy for the different datasets .
Parallel composition shows that the degree of privacy guarantee depends on the value of ; when the value of grows, the security decreases gradually.
These two properties play a major role in proving whether an algorithm satisfies differential privacy and judging whether the budgeting strategy is reasonable [4]. For instance, given an algorithm A with n-step, each step has budget ; then the algorithm A satisfies -differential privacy.
Differential privacy has great advantages compared to traditional privacy protection model; however, there will be some sacrifice for utility of the data to satisfy differential privacy. From Definition 2, the level of similarity in query result of neighboring datasets depends on the value of ε: the smaller ε is, the higher the level of similarity of query results becomes and the stronger privacy guarantee for raw data will become; at the same time, smaller ε leads to larger noise and poor data utility, and vice versa. In short, utility and security of data are a pair of contradictions, so the aim of privacy guarantee is to ensure very high level of privacy while simultaneously providing extremely accurate information about the database [13].
In order to balance the security and the utility of the data, we need to consider how to meet the maximum number of queries using the smallest budget before exhausting ε or how to enhance the accuracy of query under giving ε. Existing methods usually use relative error [31], absolute error [31], variance [30, 38], standard deviation [31], and false negative [40] to evaluate the rationality of algorithm.
3. Data Process Models and Release Strategies
There are two data process models of differential privacy, namely, noninteractive model and interactive model; the latter is also called online query model as data requester is only allowed to access the information through an interface provided by data owner. Similarly, noninteractive model is called offline query model as data requester can directly access the sanitized dataset released by data owner [20, 42].
Interactive model is shown in Figure 3; data owner provides a data query algorithm based on differential privacy, and data requester sends a query request; when the query algorithm receives the request, it gets the raw data from the original database and does sanitizing process of raw data; then the sanitized data is returned to the requester. In this model, query number is restricted by privacy budget ε; the more query number leads to smaller budget for each query if total budget ε is a constant and a larger noise is added to the query result; in the end, the query result becomes inutile. So how to design the query algorithm to provide the maximum number of queries under limited budget ε is the key to this model.
Interactive model.
Noninteractive model is shown in Figure 4; data owner is a trusted curator and releases a sanitized dataset, and data requester sends a query request; when the sanitized dataset receives the request, the noisy result is returned to the requester. In this model, query number is unrestricted by privacy budget ε, so how to design the release algorithm with high efficiency to enhance the accuracy of query is the key to this model.
Noninteractive model.
Difference privacy research is based on the two above models and there are roughly three data release strategies as follows.
Noise is added to raw data and gets noisy data , which can be further optimized by postprocessing, for example, least square method, to enhance the query accuracy by the optimized data ; then is released. In this strategy, ε is usually uniform budgeting; its process is shown in Algorithm 1.
First, the conversion and the compression, which usually can reduce the sensitivity of the function f, are applied to obtain a synthetic data ; for example, the graph is converted into tree structure; then noise is added to and noisy data is gotten. In this strategy, ε is uniform budgeting; finally, is released; the process of Strategy 2 is shown in Algorithm 2.
Algorithm 2.
Input. Raw data , privacy budget ε.
Output. Released data
.
Strategy 3.
Noise is added to raw data and gets noisy data ; note that ε is nonuniform budgeting; that is, ; it uses the reasonable budgeting strategy to reduce the error of data query; the process of Strategy 3 is shown in Algorithm 3.
Algorithm 3.
Input. Raw data , privacy budget , .
Output. Released data
.
These three data release strategies can be used alone or in combination. For instance, given a data release method, Strategy 3 is first used to allocate the budget reasonably; then Strategy 1 is used to further enhance the query accuracy by postprocessing.
4. Data Release Categories
How to ensure the level of privacy guarantee while simultaneously providing highly utility of the data is a real concern; therefore, many methods were proposed in this section, and we will give our categories of differential privacy.
4.1. Histogram Release
Histogram is constructed by splitting the input dataset into mutually disjoint subsets called buckets or bins depends on a set of attributes. Any access to the raw data is conducted through the differential privacy interface when users send data query, and the differential privacy histogram is used to answer the user's queries; the process is shown in Figure 5 [46].
Different private histogram release.
The most straightforward method is to add Laplace noise to each histogram bucket using privacy parameter ε, and the noise is as count function sensitivity ; this method is fit for unit length queries and short-range queries, and it will lead to larger query error under long-range (i.e., the number of bucket n is larger) queries according to noise variance formula . In order to reduce the query error, multiple buckets are merged into one partition; now the number of tuples which fall into each partition is the average value of the number of tuples in these buckets, as shown in Figure 6.
Example of buckets merger.
The noise noise is added to each bucket before the merger, and the total noise is 7noise; after merge operation, the total noise becomes 3noise, which is smaller than the value before the merger. However, merge operation introduces an approximation error as the number of tuples in the partition is the average value of the number of tuples in multiple buckets.
We need to make the least possible number of partitions to reduce the noise error and make the number of tuples of buckets in a partition the same as much as possible to reduce the approximation error. In general, a finer-grained partitioning will introduce smaller approximation error but larger noise error, so finding the right balance between approximation error and noise error is a key question [46].
Xiao et al. [29, 46] proposed a partitioning strategy based on the -tree. It seeks to merge buckets which are close to uniform into a partition using a variance-like metric , which is defined as , where is the current partition, is the each bucket in , is the number of tuples in , and is the average value of buckets in . If , where T is a giving threshold, then current partition is split into two partitions according to the split median of all the numbers of buckets in . The process repeats until the whole domain is split into several partitions and each partition is close to uniform.
The process of -tree based histogram release is as follows. First, the original histogram is constructed according to the kth different attributes of raw data and the noise is added to original histogram using privacy budget to generate a noisy k-dimensional dataset. Next, the noisy k-dimensional dataset is partitioned by -tree algorithm according to , getting final partitions. Then, the noise is added to each partition using the remaining privacy budget . Finally, the noisy histogram is released. The property of the algorithm is decided by the parameter and the tree depth when partitioning or the number of tuples in each partition. -tree based histogram release algorithm is further used in health information DE-identification in Xiao et al.'s subsequent research and is named DPCube, which is based on Strategy 1 and -tree is used for postprocessing after adding noise. DPCube supports multidimensional data, long-range query and query error may be lower if the value of parameters selected properly.
In view of the right balance between the appropriation error and the noise error, Xu et al. [15] proposed the notion of Sum of Squared Error (SSE) in order to reduce the appropriation error and enhance the data query accuracy. SSE is defined as follows: , is a series of counts of original histogram and is the count of each bucket, represents the partitioned histogram which merges neighboring counts into k partitions and is the partition of the histogram, which contains an interval , and is the optimal value for and is generally computed by the mean value of the counts in ; that is, .
Two algorithms NoiseFirst and StructureFirst are proposed dependent on SSE to measure the error between the histogram H and the original count data D. Strategy 1 based NoiseFirst constructs the optimal histogram adopting by dynamic programming method after adding noise and is fit for short-range query; Strategy 2 based StructureFirst generates the optimal histogram based on the original count sequence, which has the risk of privacy disclosure. StructureFirst first uses partly privacy budget to guarantee the sensitive information in the histogram; then, the noise is added to the partitions using remaining privacy budget (ε-), and StructureFirst is fit for long-range query. StructureFirst and DPCube proposed by Xiao et al. are very similar; the maximal difference is that the two steps in DPCube separately allocated privacy budget , that is, uniform budgeting, while StructureFirst is a near-optimal budgeting strategy. The two above methods need to reconstruct the histogram; that is, the original histogram is reconstructed, and the number of partitions k is the key factor that affects the algorithm's performance.
In order to reduce the error in long-range query, Hay et al. [47] proposed optimizing histogram by hierarchical tree structure. Histogram is transformed into hierarchical tree structure; then one can arrange all queried intervals into a tree, where the unit-length intervals are the leaves [48], and the query accuracy is enhanced by the consistency check between each layer node of the tree.
In view of the hierarchical method for optimizing histogram, Qardaji et al. [48] researched the factors which affect the mean squared error (MSE) of range query under the consistency check and different budgeting strategies. The hierarchical method in one dimension can reduce MSE when branching factor is more than 4; however, the hierarchical method fails to adapt multidimensional histogram when the dimensionality is greater than or equal to 3. The noise is directly added to each unit bucket in the histogram, which is called the Flat method by Qardaji et al., and Flat outweighs the hierarchical method. Detailed analysis of histogram release methods is shown in Table 2.
In a word, methods based on histogram release usually need to reconstruct the original histogram; when the number of partitions k grows, the approximation error gets less, but the noise error becomes larger; meanwhile, it will enlarge the time cost and make its application restricted. For instance, the time complexity of StructureFirst is , ; the value of k affects the time cost, so how to improve the efficiency of the algorithm and reduce the approximation error especially when data dimensionality is larger is one research direction of histogram. Moreover, many algorithms adopt uniform budgeting, and how to allocate the privacy parameter ε based on the requirements of application to enhance the query accuracy is another research direction. Note that Li et al. [49] first proposed DPSynthesizer, an open source visual toolkit for differentially private histogram data release, which can make the user see the one-dimensional or multidimensional differential privacy histogram data directly in terms of graph.
4.2. Tree Structure Release
In order to seek differential privacy budgeting strategy and reduce the query error, a series of methods based on tree-structure data split were proposed, and they are known as private spatial decompositions which divide the geospatial data into smaller regions and obtain statistics on the points within each region [30]. If the partition discloses the sensitive information during spatial decomposition, it is called data-dependent decomposition; otherwise, it is called data-independent decomposition. For instance, -tree is recursively split through the median data value, which discloses the median value itself, so it is called data-dependent decomposition.
(1) Data-Dependent Decomposition. As shown in Figure 7, node discloses itself during splitting; in order to mask the real information, the noise is added to the node. Let , ; the data in D is sorted in ascending order and is the median value; then the noise noise is added; that is, , ; however, the noise median may go beyond the range of the data D; that is, ; as for this issue, Inan et al. [50, 51] proposed to adopt the mean instead of the median for numerical data, and the mean can be approximated as the ratio of the sum of noisy counts to the number of the counts, but this method fails to ensure the degree of approximation between the mean and the median.
Example of private -tree.
As for indexing structure based on -tree, Xiao et al. [29, 46] first use privacy budget for original data and then construct -tree and compute the median by noisy data, finally guaranteeing the privacy of the median.
Cormode et al. [30] proposed EM method to judge the median according to exponential mechanism among the -tree; EM returns x with , and represents the rank of x in data D. Since the value x is approximately equal to the median , the median is guaranteed. After the selection, using privacy budget for , using privacy budget for count computation, and , a consideration is that larger value of leads to more accurate median, but larger count error, while smaller value of generates more accurate count computation, but larger median error, so how to allocate the budget ε for the count computation and the median computation and balance the errors on the count error and the median error is a concern. The above algorithm based on -tree and EM medians is called -standard.
(2) Data-Independent Decomposition. Cormode et al. [30] proposed the algorithm Quad-opt based on quadtree, which recursively divides the data space into equal quadrants, does not disclosure data information, and belongs to data-independent decomposition. As shown in Figure 8, the data a is divided into four quadrants , , and would continue to divide until the predefined tree depth h is reached; h begins with 0. Compared to classical uniform budgeting, that is , Cormode proposed a novel geometrical budgeting strategy (by a factor of ), which allocates for each level of the tree, thereto, , , ; then the least upper bound of error in query Q satisfies ; the query error is the sum of the nodes contained in Q variances. Quad-opt can handle the noisy data through the least square method during postprocessing and further enhance the query accuracy. For example, if the node a's budget is and its children's budget is , in order to optimize the query error when querying the noisy count in node a, we should let , where is the noisy count in node a itself and is the noisy count of four children; then .
Example of private quadtree: the numbers are the actual counts.
If allocating all privacy budget to the leaf nodes, the algorithm is equivalent to directly dividing the data space into cells. When the data is very sparse, adding noise to each cell will lead to poor information remaining; inspired by it, Fan et al. [32] proposed Spatial Estimation Algorithm (SEA), which merges similar cells into one group or partition to rise above the sparsity of the data. SEA first models different types of cells dependent on the domain knowledge before split; that is, each cell is sparse or dense, and the partition is called homogeneous if all the cells contained in this partition belong to the same category and need not to split; on the contrary, it is necessary to split until the partition is homogeneous or reaches the predefined tree depth. Then the noise noise is added to each partition after the data splitting has completed, and the cell noise is , where is the number of cells in the partition . Finally, SEA enhances the query accuracy while simultaneously providing very high level privacy guarantee by cells merge operation, essentially by reducing the added noise for cells.
The partition based on quadtree or -tree is fit for one-dimensional or two-dimensional data; as for multiple-dimensional data, Qardaji et al. [31] proposed method UG based on uniform grid which can be seen as an n-tree especially. There are two sources of error in the grid method: one is noise error introduced by adding random noise followed Laplace distribution; another is nonuniformity error introduced by assuming that the data points are distributed uniformly and nonuniformity error is proportional to the number of data points in the cells that fall on the border of the query rectangle [31]; then the query error refers to the two errors above.
UG splits the space data into cells; given a query Q, the query error , where N is the number of data points, r is the ratio of the area of the query Q to the area of the whole domain, and is a constant. Note that UG gives the granularity w of the partition, which is the key factor effecting the accuracy of query results, and w is about , .
UG treats all cells equally whether sparse cells or density cells. When the cell is very sparse, the noise error is too large; on the other hand, when the cell is very dense, the nonuniformity error is too large. Qardaji et al. further proposed the adaptive grids method AG to balance the two errors: when a region is sparse, a finer granularity is needed to reduce the noise error; when a region is dense, a coarse granularity is used to reduce the nonuniformity error.
AG first splits the space data into coarse cells. If a cell is too dense, can be further split into finer cells, and then issuing a count query for each coarse cell and finer cell using privacy budget and is carried out. In order to optimize the data query error, let , , where is the noisy count in cell . However, UG and AG also lack the adaptive judging criterion whether the cell is sparse or dense. At the same time, it is assumed that the nonuniformity error is proportional to the number of data points in the cells. In this model, the query rectangle is just square, which is not so consistent with the fact because the shape of query may be rectangular, and the query results are influenced severely by the selection of . Detailed analysis of tree structure release methods is shown in Table 3.
Balance the noise error and the nonuniformity error
Lack of the adaptive judging criterion
Uniform budgeting
In conclusion, there is no privacy risk for tree structure in the data-independent decomposition, but how to reduce the noise error is the key for our special research. In Fan et al.'s data category, there are only sparse data and dense data, and the partition of data category is too coarse, so how to partition the data category based on the variance or information entropy to enhance the query accuracy is one of the research directions in the future. In data-dependent decomposition, some privacy budget is needed to guarantee the privacy for -tree itself disclosing the median information; thus, how to balance the allocation of the privacy budget to raise the query accuracy is one of the research directions in the future.
4.3. Time Series Release
There are many differential privacy real time data release, such as MHR, GPS, and track. The real time data with higher correlation between time stamps has a timescale, if the length of the time series is T and the noise is added to the data at time k, , when T is large, the added noise gets large and leads to poor utility of the data.
As for time series data, in order to reduce the error, Rastogi and Nath [52] proposed algorithm based on discrete Fourier transform. For time series D, first executes the discrete Fourier transform, that is, , and retains only the first k DEF coefficients; then the Laplace noise is added to those coefficients and the inverse Fourier transform is executed on the noisy coefficients , that is . Finally, the perturbed data is released. However, the algorithm needs to execute the and operations with the full series, and it is not compatible to real time applications [32].
As for real time traffic monitoring data, in order to raise the execution efficiency and query accuracy of the algorithm in the real time environment, Fan et al. [32] proposed Temporal Estimation Algorithm (TEA) based on the time series model. Given the space G, TEA first splits the space into cells and the time span T is discretized into time index, where . The frequency series of cell c is , where is the frequency series of c at time stamp k and all frequency series in G are . Then, the model is constructed dependent on domain knowledge, such as locations, population, and road networks; for each cell c, the process model is , , where represents the degree of variation between neighbor time stamps, and the noise is added to the frequency series of c at time stamp k; that is, , , where and ε is uniform budgeting for the time span T. For estimation purpose, noise is replaced by white Gaussian noise; that is, . Finally, the noisy data is released. TEA utilizes the time series model and the educated guess to enhance the query accuracy; meanwhile, the computing complexity is low and time complexity is , where .
For the time span T, if a real time system has higher sampling rate, larger noise will be introduced to satisfy the differential privacy; on the contrary, the approximate error is larger if the system is with lower sampling rate, so how to select the value of sampling rate is the key. Fan et al. proposed an adaptive sampling algorithm FAST [33, 34] to solve the problem in the subsequent research. FAST can adjust the sampling rate F automatically according to data dynamic and ; the sampling rate is increased when the data changed rapidly and vice versa. The dynamic of current data can be described by the feedback error , which is defined as , where and are the posterior estimate and the prior state estimate for nth () sampling point at time step () and δ is a parameter assigned by user and generally . Because of the data, the sampling frequency is not fixed; in order to achieve the optimal sampling rate in minimal time, FAST uses the PID controller to adjust the rate. Firstly, PID algorithm takes as the input and gets the output Δ, that is, the PID error, which is defined as follows: , where , , and are the proportional, integral, and derivative components of PID; then the new sampling rate is , where θ and ξ are parameters specified by user. The adaptive sampling algorithm FAST reduces the noise error from to .
As for web browsing behavior data, utilizing the rich correlation of the time series, Fan et al. proposed session-level differentially privacy algorithm U-KF [35] based on the user-level framework FAST. U-KF releases the differential privacy aggregates using state-space model in real time and adopts Kalman filter [53] during postprocessing. Given original data , where represents the number of sessions browsing at time k. Firstly, in the prediction step, the prior estimate is derived from the previous posterior estimate , where “” represents state estimates and the superscript “−” represents variables related to the prior estimate [35]. Then, computing the perturbed value , that is , , where is the query sensitivity. Next, in the correction step, the posterior estimate is computed according to the perturbed value and the prior estimate , that is, , where is the Kalman gain [53]. Finally, is released. U-KF is a univariate time series algorithm with higher query accuracy and its computing time is proportional to the number of web pages, that is, ; however, U-KF only supports the browsing request in the single server, failing to support the request in different servers. Detailed analysis of time series release methods is shown in Table 4.
Computing complexity is low; query accuracy is high
Only supports requests in signal server
Uniform budgeting
As for record data, in order to enhance the query accuracy, Xiao et al. [54, 55] proposed Privelet method based on wavelet transform. Privelet does not directly perturb the frequent matrix M of input data, but first transforms the M into the wavelet coefficient matrix C, which is perturbed by Laplace noise; then the noisy coefficient matrix is mapped into noisy frequent matrix ; finally, is released. It is known then that the differential privacy based on wavelet transform is feasible. As for real time data, such as smart meter, Ferrández-Pastor et al. [56] indicated that using wavelet transform to analyze the behavior of user is feasible, so the time series release combined with wavelet transform is one of the research directions in the future. At the same time, how to split the time span T is the key of time series release methods; the partition granularity of T should be approximate, combined with the features of data, such as the stepped data; how to adaptively spit the data is also one of the research directions in the future for time series.
4.4. Graph Release
Nowadays, digital health social network often contains user's sensitive information, and directly releasing these network data may harm individual privacy, so the network data should be sanitized before the release. Network data is different from general data, and it will be highly sensitive to small changes as the clustering coefficient is larger according to graph theory. For instance, adding random noise to a subgraph counting query for purpose of masking the presence or not of an edge, and larger noise implies poorer utility of the data. In order to enhance the utility of the sanitized data, a series of methods were proposed.
Hay et al. [57] first proposed edge differential privacy for networks data or differential privacy for short. Given graphs and , and are the set of vertices of graphs and ; and are the set of edges of graph and ; if , , and , then and are neighbors which are denoted by .
Let and be two neighboring datasets which differ on at most one edge, denoted by ; given a randomized function , is the set of all possible outputs of A in and , and given an arbitrary subset , A is said to be ε-edge differential privacy or ε-differential privacy if
Similarly, edge differential privacy can be expanded to multiple edges; if neighboring datasets differ on k edges, that is, , then function A is said to be ε-k edge differential privacy. Furthermore, if neighboring datasets which differ up to all edges connected to one single node, that is, , , then A is said to be ε-node differential privacy, which is the ultimate aim desirable to achieve; however, it will introduce excessive noise and cause the worst utility of the data, in subsequent sections, A satisfies ε-edge differential privacy as the privacy standard.
In order to raise the utility of graph under the requested level of privacy, the graph's structure is needed to be captured first; then it would be converted back into a synthetic graph desired to be the same with original graph after adding noise. Sala et al. [58] proposed Pygmalion method, which captures the graph's structure according to -series graph model [59].
can capture the graph's structure into statistical number called -series, which is the number of nodes at different degree distribution for d-node subgraphs. As shown in Figure 9, when , the distribution is the node degree distribution; when , the distribution is the joint degree distribution for 2-node subgraphs; then using a matching generator, a synthetic graph is generated from the -series.
Example of -series.
Pygmalion first captures the -series from the original graph and then sorts the -series and clusters the -series to get a set of subseries. Next, each subseries is perturbed by random noise according to local sensitivity. Finally, a synthetic graph is generated from the perturbed -series. However, local sensitivity might reveal sensitive information although it reduces the noise magnitude.
Wang and Wu [36] also used -series graph model to capture the graph's structure, in order to strictly achieve differential privacy and avoid revealing the privacy, and adopted the smooth sensitivity. For 2K-graph, Wang and Wu proposed a method , which satisfies ()-differential privacy. first uses the parameter () to compute the value of (), where , , and then obtains the β-smooth sensitivity according to β and the sensitivity ; next, it uses to get the random noise which is added to the -graph series; finally, a new graph is generated by perturbed series.
Wang et al. [37] proposed another method, that is, LNPP, which uses spectral graph to guarantee the graph data. Spectral graph analysis mainly applies the graph's adjacent matrix to construct the graph's topological structure. For example, the adjacent matrix represents the graph G, where if there is an edge between nodes i and j; if not. LNPP first decomposes matrix A and gets eigenvalues and eigenvectors; then, these values are perturbed based on Laplace mechanism, and the output eigenvectors are postprocessed by the vector orthogonalization technique [60] as the perturbed eigenvectors are no longer orthogonormal [37].
Adopting the spectral graph and the -graph model to guarantee privacy, the noise introduced by the former is proportional to , and the latter's sensitivity is , where n is the number of vertices in the input graph. In order to raise the utility of the data, Xiao et al. [38] proposed a novel method hrg--e-, which is based on hierarchical random graph (HRG) model to reduce the perturbed noise magnitude.
Classical graph model is construed by considering the observed edges, and HRG uses the connection probabilities between vertices. In the HRG model, graph G can be represented by a hierarchical structure and the connection probabilities. The hierarchical structure of G is a rooted binary tree which has n leaf nodes corresponding to the n vertices of G [38]. Each internal node r has a probability , and any two vertices i and j of G, their connection probability , , where r is two vertices' lowest common ancestor in T, and are the right and left subtrees of internal node r, and are the numbers of leaf nodes in left and right subtrees, and is the number of edges in G whose endpoints are leaves of the two subtrees of r in T [38]. So, the hierarchical random graph can be defined as (). As shown in Figure 10, original graph G can be represented by dendrogram and dendrogram .
Example of the HRG model.
For graph G, in order to select a good dendrogram from numerous candidates, the notion likelihood is proposed to estimate how to represent G exactly by the selected dendrogram; that is, . In the above instance, firstly, the set of is calculated for each dendrogram; then the likelihoods of two dendrograms are computed; that is, and ; finally, is compared to ; one can see that is more exact to represent the original graph as is larger.
In hrg--e-, privacy budget ε is divided into two parts; firstly, exponential mechanism is used; a Markov chain Monte Carlo procedure is designed to select a good dendrogram , in which only one probability would be affected with a single edge change. This step consumes privacy budget, that is, and score function is , where ; then, Laplace mechanism is adopted; the set of of is perturbed using remaining budget ; finally, the sanitized graph is reconstructed by HRG model and is released. The global sensitivity of hrg--e- is and reduces the added noise magnitude. Detailed analysis of graph release methods is shown in Table 5.
Sampling and perturbation use privacy budget together
To sum up, directly adding noise to graph will lead to poor utility of the data as high sensitivity of graph, and it requires to project the data to other data types, such as spectral domain, -series, and connection probabilities; therefore, how to select a good data type with a lower sensitivity is one of the research directions in the future.
4.5. Pattern Mining Release
Frequent pattern is a subset of items and is frequently present in dataset, such as item-set, subsequence, and substructure, and it is the basis of mining tasks, that is, classification, clustering, and so forth. However, it may reveal individual's privacy during frequent pattern mining, so how to identify the useful patterns in the data while simultaneously guarantee user's privacy is a new research direction of data mining.
Frequent pattern mining algorithm is usually measured by support degree [28, 39, 40], occurrence [61], and stay point [62, 63], given data , , where T is the transaction set constituted by all transactions, and each transaction contains some items; that is, , , where I is the item space constituted by all items and itemset is the subset of item space I. if transaction contains itemset , we say that supports , and the ratio of the number of transactions supporting to the total number of transactions is the support degree for . Then, it could be said that is a frequent itemset if the support degree is more than the giving threshold.
In terms of personal location information, in order to mine the useful pattern, for example, people's interesting geographic location, Ho and Ruan [62] proposed BuildDPQuadTree method based on quadtree and density-based clustering algorithm (DBSCAN). How to define the interesting region when we query the location history database is the focus, and the notion of stay point is proposed. Given the trajectory data , ), where is a timestamp, is the latitude, and is the longitude. If a trajectory stays in a circle region with a ρ radius for at least a time period , the center () of the circle is a stay point; given a set of trajectories , the region is called interesting location if it has more than r stay points in it.
BuildDPQuadTree first recursively divides the data space G into small regions, given stay point set S, threshold , and privacy budget ; if , stop partition and go to partition if not. After getting the set of spatial partitions, each cluster , which is the set of stay points of the form (), is obtained by density-based spatial clustering of application with noise clustering method [64]. Given the threshold r, privacy budget , and , if , where and D is the set of all the individuals who have records in the cluster [62], then is called interesting region, and its center location is defined as , where , , and , where h is the quadtree depth. BuildDPQuadTree provides a novel method identifying the interesting region by stay point; however, its noise magnitude is easily affected by threshold . In the Ho and Ruan's subsequent research, the algorithm DP-ILD [63] adopting β-smooth sensitivity was proposed and it satisfies ()-differential privacy.
The sensitivity of itemset query depends on the dimensionality of itemset during Top-k frequent pattern mining. In order to reduce the sensitivity, the data usually is projected into a lower dimensional space. Li et al. [39] proposed PrivBasis method to search Top-k frequent patterns. In the initialization step, PrivBasis tries to reduce the search space by finding the minimal sets of items , which contains the top k frequent itemsets [39]; then the supports of all subsets of are derived by applying binary itemset support counting [65]; however, the final frequent patterns may be imprecise.
In order to raise the utility of the data, Lee and Clifton [40] proposed the NoiseCut method based on the sparse vector technique and FP-tree. NoiseCut first identifies all frequent itemsets L; in this step, sparse vector technique is adopted to reduce consumption of privacy budget, that is, given a noisy threshold and a count query q, which only pays privacy budget if , where , is the support of the kth most frequent itemset. It is noted that NoiseCut regards the itemset as frequent if its noisy support () is bigger than . Then the noisy FP-tree is built according to L, and the supports of itemsets and Top-k frequent itemsets can be derived from the FP-tree. In NoiseCut, given an itemset whose support is , its false negative is ; the smaller the false negative is, the higher obtaining correct frequent patterns becomes.
As for graph data, Shen and Yu [28] proposed frequent graph pattern mining method Diff-FPM based on Markov chain Monte Carlo random walk technology. Diff-FPM obtains the Top-k frequent subgraph mainly according to random walk; then, the real counts of subgraphs are perturbed by Laplace random noise. The key to Diff-FPM is whether random walk can reach steady state or not. Diff-FPM satisfies ε-differential privacy if steady state is reached; if not, Diff-FPM only satisfies ()-differential privacy, in which the security of the data maybe is affected.
As for sequential data, such as DNA sequences and trajectories, Luca and Li [61] indicated that there are some disadvantages using support to mine frequent sequential patterns; for example, one pattern may fail to be regarded as frequent pattern if it repetitively occurs in the same transaction, which may issue in a loss of information patterns. In addition, patterns are represented as a subset of an universe of items and may not well describe the sequentiality of important events in reality [61]. In order to successfully mine the frequent sequential patterns from the sequential data, Bonomi et al. proposed the notion of occurrence which can identify repetitive patterns as frequent. Given a pattern , , and a string , if there is an integer making , , it says that the pattern p presents in position i of the string x, and represents all number of occurrences of p in x. Given the dataset , denotes the number of occurrences of the p in D, and p is called frequent sequential pattern if is larger than the giving threshold. Note that adding or removing one string x in the dataset D may affect the count of a pattern up to and lead to higher sensitivity; in order to reduce the sensitivity, it usually uses a prefix tree structure [66, 67] to split the input data and uses the frequency of the prefixes instead of the frequency of the patterns [61]. Detailed analysis of pattern mining release methods is shown in Table 6.
The drop in utility with the increase of the number of outputs [28]
Sampling and perturbation use the privacy budget together
Luca and Li pointed out the existing differently privacy frequent pattern mining methods mainly based on the exact data; however, the real data usually are noisy, and some patterns may be lost [61]. There are less research about how to mine the frequent patterns in the noisy data, which is a future research direction. At the same time, how to raise the utility of the data while simultaneously providing privacy guarantee is an issue for further research.
For different data types, related researchers proposed a series of differential privacy data release methods, and the methods summarized in this paper only are a part of the whole; there are many methods which the paper fails to describe, such as batch query methods based on matrix [68–70] and data release methods based on machine learning [71].
4.6. Discussion
Data release methods under different strategies are the main research contents for differential privacy and are the theoretical basis for practical applications. Classification and comparison of the differential privacy data release methods are shown in Table 7, from which we can see that there are few research achievements in this new research area, and there still exists many urgent issues to be addressed and some new study directions to be explored.
Classification and comparison of differential privacy data release methods.
User behavior, DNA sequences, trajectory and disease trend analysis, recommended system
(1) Highly Sensitive Problem. As for multiple-dimensional or graph data, whose sensitivity is so high that it may lead to poor utility of the data, how to reduce the sensitivity of function f and enhance the utility require deep-going research.
(2) Time Series Problem. As far as time series data is concerned, how to balance the utility and the security when the time T is increasing is a question deserving research, and privacy guarantee for online data is another question worth studying as well.
(3) Efficiency Problem. Computational complexity of many differential privacy algorithms is relatively high, so how to ensure privacy while simultaneously improving algorithm efficiency is a question deserving research.
(4) New Research Direction for Differential Privacy. Among existing differential privacy data release methods, the data usually are stored in one party; if the data are stored in multiple parties, therefore how to share the data and guarantee user's privacy between multiple parties is a question deserving further research for distributed differential privacy [72, 73], which is also called security multiparty computation problem. It mainly faces the following difficulties: (1) excessive communication overhead makes the algorithm infeasible if the data and participants are large; (2) how to design an algorithm and make it satisfy differential privacy is another obstacle. So an efficient and safe algorithm with lower communication overhead is a question deserving research, for example, how to design a differential privacy algorithm for such distributed real time streaming data as financial data.
(5) The Definition Expansion of Differential Privacy. Kearns et al. [74] introduced the differential privacy into game theory to guarantee player's privacy. The paper points out that the equilibria of complete information games can be implemented in setting of incomplete information under certain conditions (the number of players is large). Any algorithm that estimates a correlated equilibrium of a complete information game while satisfying a variant of differential privacy is called joint differential privacy. By this time, no group of the players can derive “much” about the type of any player outside the group, even if these players collude in an arbitrary way [74].
Given a randomized function , let D and be i-neighboring datasets which differ only in their ith coordinate, ; for every event , A is said to be ()-joint differential privacy if
where , , , and ()-joint differential privacy equals ε-joint differential privacy when .
In joint differential privacy, the input date has n data elements and the output also is a vector of n messages; that is, the input and the output is . Each output element is corresponding to a input element of one player, and it can guarantee ith player's privacy even if other players collude; that is, they know all information except input and output of ith player ( and ) but fail to learn much about ith player's input .
Joint differential privacy is an appropriate solution concept for problems whose solution can be partitioned amongst the n players who provide the privacy data to the algorithm, such as equilibrium computation [74, 75] in game theory and agent's privacy guarantee in maximum matching problem [76].
5. Conclusion and Future Work
Differential privacy has become the de facto standard to guarantee privacy and is one of the hot research topics in the field of healthcare applications. In this paper, the methods of differentially private data releasing in recent years' literatures are classified, and a series of data release methods are summarized for privacy problem, which provide certain references for differential privacy in further study. In Section 4, as for different categories, different research directions are proposed; for example, it is pointed out that time series release combined with wavelet transform is one of the research directions to solve the privacy of real time streaming data. However, there are still some other questions which are worth studying. They are summarized as follows.
(1) The Privacy Budgeting for Tree Structure. Privacy budgeting strategy is usually uniform budgeting in tree structure release, and Cormode et al. [30] presented a novel nonuniform budgeting method based on complete quadtree to improve query accuracy. The key of the method is geometric budgeting strategy, and its theoretical analysis is conducted by the total variance of the all node variances with a great theoretical significance, but the effect is not optimal as fewer nodes may be touched in actual queries. Inspired by it, the error of the data query is small if privacy budget properly tilts toward to the below level nodes of the tree, so how to design an optimal privacy budgeting strategy is a great challenge.
(2) The Balance of the Noise Error and Nonuniformity Error. As for geospatial data, Qardaji et al. [31] constructed a differentially private synopsis by choosing the right partition granularity over the data domain to balance the noise error and nonuniformity error, which is proportional to the number of data points in the cells that intersect with the border of the query rectangle. However, nonuniformity error is not only connected with the point number, but also refers to the area of the cells that intersect with the border of the query rectangle, as the less area of intersected portion is, the less nonuniformity error will be. Nonuniformity error especially is 0 when the area of intersected portion is 0, so how to choose the optimal partition granularity for geospatial data when nonuniformity error is connected with both point number and area is a great challenge as well.
Footnotes
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant no. 41371402); the National Basic Research Program of China (973 Program) (Grant no. 2011CB302306); the Fundamental Research Funds for the Central Universities under Grant no. 2015211020201.
References
1.
DankarF. K.El EmamK.The application of differential privacy to health dataProceedings of the Joint EDBT/ICDT WorkshopsMarch 2012Berlin, GermanyACM15816610.1145/2320765.23208162-s2.0-84864133800
2.
FrancisT.MadiajaganM.KumarV.Privacy issues and techniques in E-health systemsProceedings of the ACM SIGMIS Conference on Computers and People Research (SIGMIS-CPR ′15)June 2015Newport Beach, Calif, USAACM11311510.1145/2751957.2751981
3.
de MontjoyeY.-A.RadaelliL.SinghV. K.PentlandA. S.Unique in the shopping mall: on the reidentifiability of credit card metadataScience2015347622153653910.1126/science.12562972-s2.0-84921917628
4.
LiY.WenW.XieG.-Q.Survey of research on differential privacyApplication Research of Computers201229932013211
5.
YeH.ChenE. S.Attribute utility motivated k-anonymization of datasets to support the heterogeneous needs of biomedical researchersAMIA Annual Symposium Proceedings2011201115731582
6.
SweeneyL.Computational Disclosure Control: A Primer on Data Privacy Protection2001Cambridge, Mass, USAMassachusetts Institute of Technology
7.
SamaratiP.Protecting respondents' identities in microdata releaseIEEE Transactions on Knowledge and Data Engineering20011361010102710.1109/69.9711932-s2.0-0035517699
8.
MachanavajjhalaA.KiferD.GehrkeJ.L-diversity: privacy beyond k-anonymityACM Transactions on Knowledge Discovery from Data200711, article 310.1145/1217299.1217302
9.
NinghuiL.TianchengL.VenkatasubramanianS.t-Closeness: privacy beyond k-anonymity and I-diversityProceedings of the 23rd International Conference on Data Engineering (ICDE ′07)April 2007Istanbul, TurkeyIEEE10611510.1109/icde.2007.3678562-s2.0-34548805858
10.
AvanchaS.BaxiA.KotzD.Privacy in mobile technology for personal healthcareACM Computing Surveys2012451, article 310.1145/2379776.23797792-s2.0-84860308263
11.
FrancisT.MadiajaganM.KumarV.Privacy issues and techniques in e-health systemsProceedings of the ACM SIGMIS Conference on Computers and People ResearchJune 2015Newport Beach, Calif, USAACM11311510.1145/2751957.2751981
12.
SadkiS.El BakkaliH.Enhancing privacy on mobile health: an integrated privacy moduleProceedings of the 5th International Conference on Next Generation Networks and Services (NGNS ′14)May 2014Casablanca, MoroccoIEEE24525010.1109/NGNS.2014.6990259
13.
DworkC.Differential privacyProceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP ′06)July 2006Venice, Italy112
14.
YuanG.ZhangZ.WinslettM.XiaoX.YangY.HaoZ.Low-rank mechanism: optimizing batch queries under differential privacyProceedings of the VLDB Endowment201251113521363
RastogiV.NathS.Differentially private aggregation of distributed time-series with transformation and encryptionProceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ′10)June 2010Indianapolis, Ind, USAACM73574610.1145/1807167.18072472-s2.0-77954711910
17.
ZhuR.ShuW.MaoT.DengT.Enhanced MAC protocol to support multimedia traffic in cognitive wireless mesh networksMultimedia Tools and Applications201367126928810.1007/s11042-011-0942-72-s2.0-84881174187
18.
ZhuK.ZhuR.NiiH.SamaniH.JalaeianB.PaperIO: a 3D interface towards the internet of embedded paper-craftIEICE Transactions on Information and Systems201497102597260510.1587/transinf.2013thp00012-s2.0-84907546259
19.
LeoniD.Non-interactive differential privacy: a surveyProceedings of the 1st International Workshop on Open Data (WOD ′12)May 2012Nantes, FranceACM405210.1145/2422604.2422611
20.
XiongP.ZhuT.-Q.WangX.-F.A survey on differential privacy and applicationsChinese Journal of Computers201437110112210.3724/sp.j.1016.2014.001012-s2.0-84893560895
21.
YangY.ZhangZ.MiklauG.Differential privacy in data publication and analysisProceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ′12)May 2012Scottsdale, Ariz, USA601606
22.
International Organization for StandardizationInformation Technology—Business Operational View—Part 1: Operational Aspects of Open-Edi for Implementation20112ndLondon, UKInternational Organization for StandardizationISO/IEC 15944-1
23.
DworkC.Differential privacy: a survey of resultsTheory and Applications of Models of Computation: 5th International Conference, TAMC 2008, Xi'an, China, April 25–29, 2008. Proceedings20084978Berlin, GermanySpringer119Lecture Notes in Computer Science10.1007/978-3-540-79228-4_1
24.
DworkC.Differential privacyEncyclopedia of Cryptography and Security20112nd338340
25.
DworkC.KenthapadiK.McSherryF.MironovI.NaorM.Our data, ourselves: privacy via distributed noise generationAdvances in Cryptology—EUROCRYPT: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28–June 1, 2006. Proceedings20064004Berlin, GermanySpringer486503Lecture Notes in Computer Science10.1007/11761679_29
26.
DworkC.A firm foundation for private data analysisCommunications of the ACM2011541869510.1145/1866739.18667582-s2.0-78650804208
27.
HsuJ.GaboardiM.HaeberlenA.KhannaS.NarayanA.PierceB. C.RothA.Differential privacy: an economic method for choosing epsilonProceedings of the IEEE 27th Computer Security Foundations Symposium (CSF ′14)July 2014Vienna, AustriaIEEE39841010.1109/csf.2014.35
28.
ShenE.YuT.Mining frequent graph patterns with differential privacyProceedings of the 19th ACM SIGKDD International ConferenceAugust 2013Chicago, Ill, USA54555310.1145/2487575.2487601
29.
XiaoY.GardnerJ.XiongL.DPCube: releasing differentially private data cubes for health informationProceedings of the IEEE 28th International Conference on Data Engineering (ICDE ′12)April 2012Washington, DC, USAIEEE1305130810.1109/icde.2012.1352-s2.0-84864199556
30.
CormodeG.ProcopiucC.SrivastavaD.ShenE.YuT.Differentially private spatial decompositionsProceedings of the 28th IEEE International Conference on Data Engineering (ICDE ′12)April 2012203110.1109/icde.2012.162-s2.0-84864197091
31.
QardajiW.YangW.LiN.Differentially private grids for geospatial dataProceedings of the 29th International Conference on Data Engineering (ICDE ′13)April 2013Brisbane, Australia75776810.1109/icde.2013.65448722-s2.0-84881363049
32.
FanL.XiongL.SunderamV.Differentially private multi-dimensional time series release for traffic monitoringData and Applications Security and Privacy XXVII20137964Berlin, GermanySpringer3348Lecture Notes in Computer Science10.1007/978-3-642-39256-6_3
33.
FanL.XiongL.SunderamV.FAST: differentially private real-time aggregate monitor with filtering and adaptive samplingProceedings of the ACM SIGMOD Conference on Management of DataJune 2013New York, NY, USA1065106810.1145/2463676.24652532-s2.0-84880544858
34.
FanL.XiongL.An adaptive approach to real-time aggregate monitoring with differential privacyIEEE Transactions on Knowledge and Data Engineering20142692094210610.1109/tkde.2013.96
35.
FanL.BonomiL.XiongL.SunderamV.Monitoring web browsing behavior with differential privacyProceedings of the 23rd International Conference on World Wide Web (WWW ′14)April 2014Seoul, Republic of Korea17718710.1145/2566486.25680382-s2.0-84909620009
36.
WangY.WuX.Preserving differential privacy in degree-correlation based graph generationTransactions on Data Privacy201362127145MR3091427
37.
WangY.WuX.WuL.Differential privacy preserving spectral graph analysisAdvances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, April 14–17, 2013, Proceedings, Part II20137819Berlin, GermanySpringer329340Lecture Notes in Computer Science10.1007/978-3-642-37456-2_28
38.
XiaoQ.ChenR.TanK.-L.Differentially private network data release via structural inferenceProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′14)August 2014New York, NY, USA91192010.1145/2623330.26236422-s2.0-84907033666
39.
LiN.QardajiW.SuD.CaoJ.Privbasis: frequent itemset mining with differential privacyProceedings of the VLDB Endowment20125111340135110.14778/2350229.2350251
40.
LeeJ.CliftonC. W.Top-k frequent itemsets via differentially private FP-treeProceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′14)August 2014New York, NY, USA93194010.1145/2623330.26237232-s2.0-84907022968
41.
NissimK.RaskhodnikovaS.SmithA.Smooth sensitivity and sampling in private data analysisProceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC ′07)2007ACM58410.1145/1250790.1250803MR24024302-s2.0-35448955271
42.
DworkC.McSherryF.NissimK.SmithA.Calibrating noise to sensitivity in private data analysisTheory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4–7, 2006. Proceedings20063876Berlin, GermanySpringer265284Lecture Notes in Computer Science10.1007/11681878_14
43.
McSherryF.TalwarK.Mechanism design via differential privacyProceedings of the 48th Annual Symposium on Foundations of Computer Science (FOCS ′07)October 2007Providence, RI, USA9410310.1109/focs.2007.43894832-s2.0-46749128577
44.
McSherryF.Privacy integrated queries: an extensible platform for privacy-preserving data analysisCommunications of the ACM2010539899710.1145/1810891.18109162-s2.0-77956293777
45.
ZhangX.-J.MengX.-F.Differential privacy in data publication and analysisChinese Journal of Computers201437492794910.3724/sp.j.1016.2014.009272-s2.0-84899070948
46.
XiaoY.XiongL.YuanC.Differentially private data release through multidimensional partitioningSecure Data Management: 7th VLDB Workshop, SDM 2010, Singapore, September 17, 2010. proceedings20106358Berlin, GermanySpringer150168Lecture Notes in Computer Science10.1007/978-3-642-15546-8_11
47.
HayM.RastogiV.MiklauG.SuciuD.Boosting the accuracy of differentially private histograms through consistency3, no. 1-2Proceedings of the VLDB Endowment1021103210.14778/1920841.19209702-s2.0-78650518102
48.
QardajiW.YangW.LiN.Understanding hierarchical methods for differentially private histogramsProceedings of the VLDB Endowment20136141954196510.14778/2556549.2556576
49.
LiH.XiongL.ZhangL.JiangX.DPSynthesizer: differentially private data synthesizer for privacy preserving data sharingProceedings of the VLDB Endowment20147131677168010.14778/2733004.2733059
50.
InanA.KantarciogluM.GhinitaG.BertinoE.Private record matching using differential privacyProceedings of the 13th International Conference on Extending Database Technology (EDBT ′10)March 2010Lausanne, SwitzerlandACM12313410.1145/1739041.1739059
51.
InanA.KantarciogluM.GhinitaG.BertinoE.A hybrid approach to private record matchingIEEE Transactions on Dependable and Secure Computing20129568469810.1109/tdsc.2012.462-s2.0-84864774329
52.
RastogiV.NathS.Differentially private aggregation of distributed time-series with transformation and encryptionProceedings of the ACM SIGMOD International Conference on Management of DataJune 2010Indianapolis, Ind, USAACM73574610.1145/1807167.18072472-s2.0-77954711910
53.
KalmanR. E.A new approach to linear filtering and prediction problemsJournal of Basic Engineering1960821354510.1115/1.3662552
54.
XiaoK.WangG.GehrkeJ.Differential privacy via wavelet transformsProceedings of the 26th IEEE International Conference on Data Engineering (ICDE ′10)March 2010Long Beach, Calif, USAIEEE22523610.1109/icde.2010.54478312-s2.0-77952787160
55.
XiaoX.WangG.GehrkeJ.Differential privacy via wavelet transformsIEEE Transactions on Knowledge and Data Engineering20112381200121410.1109/tkde.2010.2472-s2.0-79959540412
56.
Ferrández-PastorF.-J.García-ChamizoJ.-M.Nieto-HidalgoM.Romacho-AgudV.Flórez-RevueltaF.Using wavelet transform to disaggregate electrical power consumption into the major end-usesUbiquitous Computing and Ambient Intelligence. Personalisation and User Adapted Services20148867Springer272279Lecture Notes in Computer Science10.1007/978-3-319-13102-3_45
57.
HayM.LiC.MiklauG.JensenD.Accurate estimation of the degree distribution of private networksProceedings of the 9th IEEE International Conference on Data Mining (ICDM ′09)December 2009Miami, Fla, USAIEEE16917810.1109/icdm.2009.112-s2.0-77951192010
58.
SalaA.ZhaoX.WilsonC.ZhengH.ZhaoB. Y.Sharing graphs using differentially private graph modelsProceedings of the ACM SIGCOMM Internet Measurement Conference (IMC ′11)November 2011Berlin, Germany819710.1145/2068816.20688252-s2.0-82955196835
59.
MahadevanP.KrioukovD.FallK.VahdatA.Systematic topology analysis and generation using degree correlationsProceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM ′06)September 2006Pisa, ItalyACM13514610.1145/1151659.11599302-s2.0-33750290186
60.
GarthwaiteP. H.CritchleyF.Anaya-IzquierdoK.MubwandarikwaE.Orthogonalization of vectors with minimal adjustmentBiometrika201299478779810.1093/biomet/ass041MR29991612-s2.0-84869449368
61.
LucaB.LiX.Mining frequent patterns with differential privacyProceedings of the VLDB Endowment20136121422142710.14778/2536274.2536329
62.
HoS.-S.RuanS.Differential privacy for location pattern miningProceedings of the 4th ACM SIGSPATIAL International Workshop on Security and Privacy in GIS and LBS (SPRINGL ′11)November 2011Chicago, Ill, USAACM172410.1145/2071880.2071884
63.
HoS.-S.RuanS.Preserving privacy for interesting location pattern mining from trajectory dataTransactions on Data Privacy20136187106MR3067644
64.
EsterM.KriegelH.-P.SanderJ.XuX.A density-based algorithm for discovering clusters in large spatial databases with noiseProceedings of the International Conference on Knowledge Discovery & Data Mining (KDD ′96)1996226231
65.
ChenJ.XiaoK.BISC: a bitmap itemset support counting approach for efficient frequent itemset miningACM Transactions on Knowledge Discovery from Data201043, article no. 1210.1145/1839490.18394932-s2.0-78049324078
66.
BonomiL.XiongL.A two-phase algorithm for mining sequential patterns with differential privacyProceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM ′13)November 2013Burlingame, Calif, USA26927810.1145/2505515.25055532-s2.0-84889607657
67.
BonomiL.XiongL.ChenR.FungB. C. M.Frequent grams based embedding for privacy preserving record linkageProceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM ′12)November 20121597160110.1145/2396761.23984802-s2.0-84871100918
68.
LiC.HayM.RastogiV.MiklauG.McGregorA.Optimizing linear counting queries under differential privacyProceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS ′10)June 2010Indianapolis, Ind, USAACM12313410.1145/1807085.18071042-s2.0-77954715960
69.
LiC.MiklauG.An adaptive mechanism for accurate query answering under differential privacyProceedings of the VLDB Endowment20125651452510.14778/2168651.2168653
70.
YuanG.ZhangZ.WinslettM.XiaoX.YangY.HaoZ.Low-rank mechanism: optimizing batch queries under differential privacyProceedings of the VLDB Endowment20125111352136310.14778/2350229.2350252
71.
JiZ.JiangX.WangS.XiongL.Ohno-MachadoL.Differentially private distributed logistic regression using private and public dataBMC Medical Genomics20147supplement 1, article S1410.1186/1755-8794-7-s1-s142-s2.0-84900454588
72.
LiuJ.HuangJ. Z.LuoJ.XiongL.Privacy preserving distributed DBSCAN clusteringTransactions on Data Privacy2013616985
73.
FriedmanA.SharfmanI.KerenD.SchusterA.Privacy-preserving distributed stream monitoringProceedings of the Network and Distributed System Security SymposiumFebruary 2014San Diego, Calif, USA11210.14722/ndss.2014.23128
74.
KearnsM.PaiM. M.RothA.UllmanJ.Mechanism design in large games: incentives and privacyAmerican Economic Review2014104543143510.1257/aer.104.5.4312-s2.0-84901439879
75.
RothA.Differential privacy, equilibrium, and efficient allocation of resourcesProceedings of the 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton ′13)October 2013Monticello, Ill, USAInternational Society for Optics and Photonics1593159710.1109/Allerton.2013.6736719
76.
HsuJ.HuangZ.RothA.RoughgardenT.WuZ. S.Private matchings and allocationsProceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC ′14)May-June 2014New York, NY, USAACM213010.1145/2591796.2591826