A novel destination prediction attack and corresponding location privacy protection method in geo-social networks

Abstract

Location publication in check-in services of geo-social networks raises serious privacy concerns due to rich sources of background information. This article proposes a novel destination prediction approach Destination Prediction specially for the check-in service of geo-social networks, which not only addresses the “data sparsity problem” faced by common destination prediction approaches, but also takes advantages of the commonly available background information from geo-social networks and other public resources, such as social structure, road network, and speed limits. Further considering the Destination Prediction–based attack model, we present a location privacy protection method Check-in Deletion and framework Destination Prediction + Check-in Deletion to help check-in users detect potential location privacy leakage and retain confidential locational information against destination inference attacks without sacrificing the real-time check-in precision and user experience. A new data preprocessing method is designed to construct a reasonable complete check-in subset from the worldwide check-in data set of a real-world geo-social network without loss of generality and validity of the evaluation. Experimental results show the great prediction ability of Destination Prediction approach, the effective protection capability of Check-in Deletion method against destination inference attacks, and high running efficiency of the Destination Prediction + Check-in Deletion framework.

Keywords

Geo-social networks location privacy destination prediction data sparsity problem data mining

Introduction

Driven by the explosive increase of online social media and modern mobile devices with integrated position sensors, such as smart phones and tablet PC, geo-social networks (GeoSNs) have attracted millions of users in recent years. GeoSNs are online social networks (OSNs) that combine location-reporting capabilities with traditional social network functionality,¹ such as Facebook Places, Foursquare, Google Latitude, Gowalla, and Flickr. One of the most popular services in GeoSNs is check-in service. It allows users to publicly report their arrival at a location (i.e. Point Of Interest, POI) and share their experiences, which not only provides a platform for social interactions and self-expression, but also material benefits to users, such as special offers or discounts from the business providers. Thus, from this aspect, GeoSNs users tend to check in as frequently as possible to benefit from GeoSNs’ check-in services.

However, along with the aforementioned pleasure and benefits check-in services bring, publication of users’ locations from the services do raise serious privacy threats.^2–4 On one hand, revealing a user’s exact location would lead to privacy leakage when the place itself is sensitive, such as home or work place, or it may allow adversary to infer sensitive information if the user visits, such as a hospital and a night club. One may argue that the user should notice the potential privacy violations and avoid checking in such sensitive positions. Indeed, this is inevitable according to findings of several sociological and psychological studies,^5,6 which prove that the vast majority of people underestimate the risks. On the other hand, continuously shared locations constitute trajectories, which enable intrusive inferences of users’ movement pattern and locations that may be misused for stalking, mugging, or determining empty houses for burglaries.³ Consequently, practical location privacy protection mechanisms are badly needed in GeoSNs.

Currently, there are some existing location privacy protection approaches based on different attack models.^7–9 Unfortunately, the protection of location privacy in GeoSNs still remains a big challenge.³ (1) Many approaches only consider limited attack models, ignoring the commonly available background information that may be used by advanced adversaries to violate location privacy, such as social information and geographical information. (2) Because of the inevitable “data sparsity problem,”¹⁰ the prediction ability of the existing Bayesian-based location inference algorithms is considerably restricted when the available historical trajectories are too sparse to cover all the possible trajectories. (3) Existing location privacy protection methods mainly focus on processing user’s real-time check-in by location obfuscation or position dummies,¹¹ which badly affect user experience. Furthermore, the efficiency of existing location privacy protection frameworks is too low to meet real-time check-in demand.¹²

Our work was motivated by the above limitations. In this article, we focus on the destination prediction attack and propose a new attack model named DesPre (Destination Prediction) and corresponding protection method named CkiDel (Check-in Deletion) and framework named DPCD (Destination Prediction + Check-in Deletion) specially for check-in services. We first construct the personalized historical trajectories data set for each target user based on social relationships and user similarities. Then, to conquer the “data sparsity problem,” we decompose the historical trajectories in the user’s personalized trajectories data set to construct a Markov model to calculate transition probabilities offline. Afterwards, by directly fetching those transition probabilities, we compute the destination probability and then the top- $κ$ potential destinations based on the Bayesian inference framework. The proposed privacy protection method CkiDel utilizes DesPre’s prediction results for constructing a removing list of check-ins from historical trajectory so that the real destination would be not available. Based on DesPre and CkiDel, privacy protection framework DPCD is proposed to help check-in users detect potential location privacy leakage and retain confidential locational information. In summary, contributions of this article are as follows:

A novel approach to personalized destination prediction, named DesPre, is proposed specially for check-in services in GeoSNs, which not only addresses the “data sparsity problem” faced by other destination prediction approaches, but also takes advantage of the commonly available background information from the GeoSNs so that the most probable locations to be visited by the target user can be inferred accurately even when the available historical trajectories are too sparse to cover all the possible trajectories.

Making use of the DesPre approach as the privacy attack model, we present a new privacy protection method CkiDel, which prevents adversaries from obtaining the correct patterns of users’ movements by deleting the smallest number of users’ historical check-ins so that the real sensitive destinations of target users would not be available without sacrificing the real-time check-in precision and user experience.

Based on the DesPre inference approach and CkiDel protection method, the location privacy protection framework DPCD is proposed to guard against destination inference attack. It achieves both effective protection ability and high running efficiency since the inference approach DesPre transfers large quantities of time-consuming calculations to the offline training phase and utilizes the geographical and time constraints to filter the unreachable destinations to avoid further complex calculations, which is proved by extensive experiments.

Without loss of generality and validity of the experimental evaluation, a new data preprocessing method is designed to construct a reasonable complete check-in subset from the worldwide check-in data set of Gowalla, a real-world GeoSN.

The remainder of this article is structured as follows. We discuss related work in section “Related work.” The problem this article concerned is formalized in section “Problem formalization.” In section “DesPre-based attack model,” DesPre approach is presented in detail along with DesPre-based attack model. Section “Privacy protection method CkiDel and framework DPCD” describes our privacy protection method CkiDel and framework DPCD. Experimental evaluations are discussed in section “Experimental evaluation.” Section “Conclusion and future work” concludes the article and presents our future work.

Related work

Past work on privacy in GeoSNs includes identifying privacy threats and proposing privacy protection approaches.¹³ Freni et al.¹⁴ first formalized the notions of location and absence privacy in GeoSNs and proposed a cloaking algorithm to enforce them. CR Vicente et al.¹ complemented the privacy preservation concepts and identified four specific classes of privacy threats systematically in GeoSNs. Mascetti et al.¹⁵ designed two protocols for preserving privacy by encrypting location information to prevent collection of a user’s geographical information. L Siksnys et al.¹⁶ proposed a client–server solution to achieve private and flexible proximity detection in GeoSNs. G Zhong et al.¹⁷ introduced three distributed protocols to guarantee location privacy in location-based services (LBSs). However, some of the above solutions^1,14 could not meet the demands of exact location and real-time publication in check-in services, while some others^15–17 were specially designed for the proximity services of GeoSNs, which did not apply to location publication of check-in services in GeoSNs.

Check-in services in GeoSNs could be considered as a combination of general LBSs and OSNs. So the most intuitive idea is to apply the existing privacy-related findings in LBSs and OSNs to GeoSNs setting. Unfortunately, direct application is unfeasible. On one hand, research works on privacy in OSNs mainly focus on profile information, while the biggest challenge in check-in services is posed by the dynamic geo-spatial information which bridge the online world and offline world. On the other hand, techniques in general LBSs are mostly based on the assumption of user anonymity,¹⁸ which does not correspond with the reality in most existing GeoSNs.¹⁴ Furthermore, there is no much non-positional information like social relationships in general LBSs, while these information plays a key role in GeoSNs’ location privacy leakage.

Although the achievements in LBSs cannot be applied to GeoSNs setting directly, we could still draw lessons from them. Some scholars who work on destination prediction approaches in general LBSs¹¹ are motivated mainly by the application in location-based targeted advertising or POI recommendation. Proper utilization of these location inference approaches could benefit both users and business providers, nevertheless, abuse of them by adversaries could pose great threat to location privacy. The attack model proposed in this article is inspired by these ideas. Since continuously shared locations in check-in services constitute trajectories, those historical trajectory-based destination prediction approaches could be transplanted into check-in services of GeoSNs if only taking the GeoSNs’ features into consideration. Furthermore, the literature¹⁹ showed that the best predictors of human behavior are based on his friends, relatives, and other related people. Cho et al.²⁰ found that, by jointly analyzing GeoSNs social relationships and user mobility, more than half of the users’ movements are affected by their friends. These findings further support the idea of this article since the commonly available social structures in GeoSNs could be used to improve the accuracy of the inference model as well as the attacking ability of the attack model.

The most related work to ours is presented by Xue et al.¹⁰ Nevertheless, our destination inference model is different from theirs in the following four aspects:

Our model is specially designed for check-in services of GeoSNs, whose unique features pose us a new problem, while the prediction algorithm by Xue et al.¹⁰ mainly focuses on general LBSs.

When formalizing the destination prediction problem, we consider the time factor and decompose the prediction into two main issues, which is more in line with reality. However, they only concern the time-independent destination probabilities, which is just one of the two main issues we concern.

Our destination inference approach is individual-oriented, which means each target user would have his or her own personalized model. However, the inference model for different users is exactly the same in the work by Xue et al.¹⁰ which may introduce inaccuracy since human moving patterns vary from person to person.²⁰

Our model makes full use of commonly available information by utilizing historical and social information to calculate posterior probabilities and considering geographical information to filter unreachable locations, which contributes to the accuracy and efficiency of the model, respectively. But Xue’s model only took the historical trajectories into account.

Besides, considering user’s need of self-expression and social interaction, our privacy protection method CkiDel would keep the real-time check-in from being removed to the greatest extent, while the End-Points Generation Method in the work by Xue et al.¹⁰ may easily cancel users’ current check-in, which dampens user experience and discounts the functionalities of GeoSNs.

Problem formalization

Terms and notions

Check in is the process whereby a user $u_{k}$ claims his visit to POI l at a certain time t through GeoSNs, that is, $c^{k} = (l^{k}, t^{k})$ . Continuously published check-ins of $u_{k}$ constitute a trajectory, which is represented as a sequence of time-ordered locations during a specific time interval, that is, $tr a_{k} = 〈 l_{1}, l_{2}, \dots, l_{i}, \dots, l_{n} 〉$ . Query trajectory is the user-specified ongoing trajectory whose destination (i.e. the user’s next location) is about to be predicted. Consisting of a user’s latest check-ins, a query trajectory of $u_{k}$ is defined as ${tra}_{k}^{q} = 〈 l_{s}, l_{2}, l_{3}, \dots, l_{c} 〉$ , where $l_{s}$ and $l_{c}$ are, respectively, the starting location and current location of $u_{k}$ .

In previous studies on destination prediction,^10,21–24 a uniform grid is commonly used to help represent the data set. Following this paradigm, we divide the entire interested area into square cells with sides of $λ m$ , so the whole space is represented as a two-dimensional grid graph. Each grid cell may contain several POIs, and all the POIs within the same cell would be considered as the same object. Consequently, the representation of check-in would be replaced by $c^{k} = (g^{k}, t^{k})$ , and trajectory would be transformed into a sequence of time-ordered grid cell coordinates, that is, $tr a_{k} = 〈 g_{1}, g_{2}, \dots, g_{i}, \dots, g_{n} 〉$ , where $g_{i} = (x_{i}, y_{i})$ . For example, the trajectory $tr a_{1}$ , as shown in Figure 1, which consists of $〈 l_{1}, l_{5}, l_{6}, l_{9} 〉$ would be represented as $〈 g_{1}, g_{4}, g_{5}, g_{8} 〉$ .

Figure 1.

Sample grid graph and trajectory.

Problem and goal

In this article, we focus on the basic location privacy problem in GeoSNs and limit our attention to destination prediction attack, a sort of location privacy attack from which the adversary could infer the most likely place to be visited at certain time in the near future by the target user. Given the historical trajectories data set, the user-specified query trajectory ${tra}_{k}^{q}$ and the maximum travel time (denoted by $Δ t_{m}$ ) between the user’s current check-in and the destination, the prediction problem could be described as calculating the maximum value of the destination probability $p (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q})$ , where $l_{d} \in g_{i}$ indicates the destination $l_{d}$ of target user $u_{k}$ is located in the grid cell $g_{i}$ . Following the most popular Bayesian inference framework, the destination probability is equivalent to

p (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q}) = \frac{p (l_{d} \in g_{i}, Δ t_{m} | {tra}_{k}^{q})}{p (Δ t_{m})}

(1)

Note that, given a definite $Δ t_{m}$ for a certain query, $p (Δ t_{m})$ in equation (1) is a constant, that is

p (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q}) \propto p (l_{d} \in g_{i}, Δ t_{m} | {tra}_{k}^{q}) = p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q}) \cdot p (l_{d} \in g_{i} | {tra}_{k}^{q})

(2)

In equation (2), the former probability $p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q})$ measures the reachability of grid cell g_i within the time interval $Δ t_{m}$ when starting from the user’s current position g_c, so we call it reachability probability. The latter probability $p (l_{d} \in g_{i} | {tra}_{k}^{q})$ in equation (2) measures the possibility of the target user’s destination in the grid cell g_i without considering the time factor, so we call it time-independent destination probability.

According to equation (2), the destination prediction problem could be decomposed into two main issues. One is to calculate the reachability probability of each grid cell in the training data set, and the other is to compute the time-independent destination probability. The goal of our proposed inference approach DesPre is to list the top- $κ$ potential destinations of a target user at given scenario by calculating the destination probabilities of the candidate gird cells. The protection goal of our protection method CkiDel is to prevent adversaries from inferring the top- $κ$ potential destinations among which there would be the correct destination based on the DesPre approach.

DesPre-based attack model

Background information of attackers

It is widely accepted that the GeoSN service providers are un-trusted.¹² They may either analyze users’ data or share them with un-trusted third parties. We assume that adversaries have access to all the resources published on all the users’ GeoSN profile, including users’ basic attributes, check-in records, and articulated lists of friends. This assumption may be conservative, but for the sake of security, we should never underestimate the enemies. Furthermore, since geographical information is generally available, adversaries could easily obtain their interested road networks and related geographical resources, such as the maximum speed of road sections. Taking these geographical constraints into consideration, the attacking ability of the adversaries would be enhanced since the field of potential destinations can be narrowed and the accuracy and efficiency of destination inference can be improved. In sum, the background information the adversary possesses is as follows (n is the total number of users in GeoSN): (1) historical information: the set $C_{i}$ $(i \in [1, n])$ of user $u_{i}$ ’s published check-ins, and $C = ⋃_{i = 1}^{n} C_{i}$ ; (2) social information: friend list of user $u_{i}$ $(i \in [1, n])$ , denoted by $F_{i}$ ; (3) geographical information: road network $G (V, E, S)$ of the interested area: G is an undirected graph, E is the edge set of all the road sections, V is the vertex set of all the intersections, and S is the weight set of all the edges which indicates the speed limits of the road sections.

Destination prediction approach: DesPre

Based on the abovementioned background information the adversaries may have, the destination prediction approach DesPre is proposed, which mainly includes two phases: the offline training phase and the online prediction phase. Detailed descriptions are as follows.

DesPre—offline training

The offline training phase of DesPre aims to prepare the probabilities for efficient online computing of destination probabilities, and it includes three steps.

Personalized historical trajectories filtering

Human movement exhibits structural patterns, which may vary from person to person due to geographical and social constraints. Thus, the practice of utilizing the whole historical trajectory data set to construct one inference model for all users by Xue et al.¹⁰ would introduce inaccuracy when predicting destinations. Given this, our DesPre approach first filters the historical trajectories to construct personalized historical data set for individual-oriented destination prediction. The filter rules we employ are as follows:

1. Filtering by friend closeness. Previous studies indicate that more than half of the users’ movements are affected by their friends.²⁰ The closer the two persons are, the more likely they hang out together, resulting in more similar visiting behaviors. From this aspect, the historical check-ins of the target user’s friends count for much in the user’s movement prediction. Thus, a variable friend closeness, denoted by clos, is used to identify the target user $u_{t}$ ’s close friend so as to filter historical trajectories. Given $u_{t}$ and his friend $u_{f}$

clos (u_{t}, u_{f}) = \frac{| F_{t} \cap F_{f} |}{min (| F_{t} |, | F_{f} |)}

(3)

Then, the first filter rule we employed could be formalized as

\forall u_{f} \in F_{t}, t \in [1, n], f \in [1, n], if clos (u_{t}, u_{f}) \geq φ, then T_{t} = T_{t} \cup C_{f}

It means that only when clos meets the given friend closeness filter threshold $φ$ could $u_{f}$ ’s historical trajectories be incorporated into the target user’s personalized training set.

2. Filtering by user similarity. Similar users who have common interests and tastes tend to visit similar places.¹² Thus, similar users’ historical trajectories may contribute a lot to the target user’s destination prediction. In order to estimate user similarity, check-in vector²⁵ is introduced to measure a user’s visiting possibility to a set of grid cells.

Definition 1. (check-in vector)

Given a grid cell sequence $S = 〈 g_{1}, g_{2}, \dots, g_{i}, \dots, g_{m} 〉$ , $u_{t}$ ’s check-in vector of S is $P_{S}^{t} = 〈 p_{1}^{t}, p_{2}^{t}, \dots, p_{i}^{t}, \dots, p_{m}^{t} 〉$ , where $p_{i}^{t} \in [0, 1]$ represents $u_{t}$ ’s check-in probability to grid cell $g_{i}$ , specifically

p_{i}^{t} = \frac{| {c^{t} | g^{t} = g_{i}} |}{| C_{t} |}

(4)

Thus, given two users $u_{t}$ and $u_{k}$ , the elements set of sequence S would be the union of $u_{t}$ ’s and $u_{k}$ ’s check-in sets, that is, ${g | (g, t) \in C_{t}} \cup {g | (g, t) \in C_{k}}$ . The user similarity, denoted by sim, could be computed by the cosine similarity of the two users’ check-in vectors, that is

sim (u_{t}, u_{k}) = \frac{(P_{S}^{t} \cdot P_{S}^{k})}{({| P_{S}^{t} |}_{2} \times {| P_{S}^{k} |}_{2})}

(5)

where $| P_{S}^{t} |_{2}$ represents the two-norm of vector $P_{S}^{t}$ . Thus, the second filter rule is formalized as

\forall u_{k} \in F_{t}, t \in [1, n], f \in [1, n], if sim (u_{t}, u_{k}) \geq θ, then T_{t} = T_{t} \cup C_{k}

It means that only when sim meets the given user similarity threshold $θ$ could the friend’s historical trajectories be incorporated into $u_{t}$ ’s personalized training set $T_{t}$ .

By applying the two filter rules to the whole data set, we can obtain an individual-oriented historical trajectories training set $T_{t}$ for each target user $u_{t}$ . Correspondingly, the sample grid cells’ set containing all checked-in grid cells involved in $T_{t}$ would also be certain, signed as $G_{t}$ .

Markov model construction

In order to make full use of the selected trajectories and conquer the data sparsity problem, we decompose each trajectory in $T_{t}$ into sub-trajectories with length 2 and use them to construct a first-order Markov model following the practice of previous studies.^10,26,27

Specifically, each state in Markov model corresponds to a grid cell in $G_{t}$ , and the transition between two states corresponds to the movement between two adjacent grid cells. Thus, the one-step transition probability matrix $M^{1}$ in Markov model would be a two-dimensional matrix where one dimension refers to the grid cell of current state and the other refers to the next state. The entries of $M^{1}$ are the probabilities of traveling directly from one grid cell $g_{i}$ to its adjacent cell $g_{j}$ , which is denoted by $p_{ij}^{1}$ and is calculated as the number of trajectories containing sequence $〈 g_{i}, g_{j} 〉$ divided by the number of trajectories containing $g_{i}$

p_{ij}^{1} = p (g_{j} | g_{i}) = \frac{| {tra | 〈 g_{i}, g_{j} 〉 \subset tra, tra \in T_{t}} |}{| {tra | 〈 g_{i} 〉 \subset tra, tra \in T_{t}} |}

(6)

Using equation (6), we can get the one-step transition probabilities of each pair of adjacent grid cells in $G_{t}$ and fill matrix $M^{1}$ with them.

Take the scenario in Figure 1 as an example; suppose that the three trajectories shown in the grid graph are all the trajectories in $T_{t}$ , the corresponding grid cells set would contain all the cells in the grid graph, that is, $G_{t} = {g_{1}, g_{2}, \dots, g_{9}}$ . Then, the trajectory $tr a_{1} = 〈 g_{1}, g_{4}, g_{5}, g_{8} 〉$ would be decomposed into $〈 g_{1}, g_{4} 〉$ , $〈 g_{4}, g_{5} 〉$ , and $〈 g_{5}, g_{8} 〉$ , which in turn contributes to one-step transition probabilities $p_{14}^{1}$ , $p_{45}^{1}$ , and $p_{58}^{1}$ , respectively, and so do the other trajectories in $T_{t}$ . Eventually, the Markov model is constructed and the transition matrix $M^{1}$ is obtained.

Total transition probability matrix formation

The probabilities stored in $M^{1}$ are the one-step transition probabilities, corresponding to the traveling from one cell to its adjacent cell in exactly one-step. If two cells are not adjacent to each other in space, which means the distance between them is longer than one-step, then the two corresponding entries in $M^{1}$ would be zero. Suppose that the distance between two grid cells $g_{i}$ and $g_{j}$ is r, then traveling from $g_{i}$ to $g_{j}$ takes at least r steps, which corresponds to the r-step transition probability. To calculate r-step transition probability, we let $M^{1}$ multiply itself r times to form $M^{r}$ , then the entries of $M^{r}$ are the transition probabilities of traveling from one cell to another in exactly r steps. The probability of moving from $g_{i}$ to $g_{j}$ via all the shortest paths (e.g. in r steps) will be equal to the value of $M_{ij}^{r}$ , the corresponding entry of $M^{r}$ . However, in practice, when traveling from one place to another, people do not always choose the shortest path; a small detour might be taken occasionally due to various reasons.¹⁰ Therefore, the possibility of traveling from $g_{i}$ to $g_{j}$ should be the sum of the probabilities of all possible paths between the two cells, which is calculated as the total transition probability

p_{ij}^{T} = \sum_{r = ds}^{dl} M_{ij}^{r} = M_{ij}^{ds} \sum_{r = 0}^{ds - dl} M_{ij}^{r} = M_{ij}^{ds} (M_{ij}^{0} + M_{ij}^{1} + \dots + M_{ij}^{ds - dl})

(7)

where ds is the length of the shortest path between $g_{i}$ and $g_{j}$ , dl is the length of the possible longest path, and $M_{ij}^{0}$ is the identity matrix I. Using equation (7), the total transition probability of each pair of grid cells in $G_{t}$ and the total transition probability matrix $M^{T}$ can be obtained.

DesPre—online prediction

The online Prediction phase of DesPre aims to list the top- $κ$ potential destinations of the target users by calculating the destination probabilities of the candidate grid cells. According to the analysis in section “Problem and goal,” the destination prediction problem lies in computation of both the reachability probability $p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q})$ and the time-independent destination probability $p (l_{d} \in g_{i} | {tra}_{k}^{q})$ . In this subsection, we first present the definition of the reachability probability, then analyze the computing method of the time-independent destination probability based on the matrices acquired during the offline training phase, and introduce the online prediction algorithm of DesPre finally.

Reachability probability

Human’s movement may be circumscribed by geographical factors, such as road connectivity, maximum moving speed of road sections, and available time. Taking these geographical constraints into consideration, the attacking power of the adversaries would be enhanced, because the field of potential destinations can be narrowed and the accuracy and efficiency of destination inference can be improved.

Reachability probability is a variable indicating whether a certain grid cell is reachable or not from the target use’s current location within the time interval $Δ t_{m}$ under the limitation of road network. Specifically, it can be defined as the following piecewise constant function

p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q}) = {\begin{matrix} 1 the grid cell g_{i} is reachable within Δ t_{m} \\ 0 else \end{matrix}

(8)

It is mainly used for filtering the unreachable grid cells to avoid further complex calculations of time-independent destination probabilities. If the reachability probability equals 1, the DesPre model will continue to calculate the time-independent destination probability. Otherwise, the current candidate grid cell will be eliminated.

Time-independent destination probability

Time-independent destination probability $p (l_{d} \in g_{i} | {tra}_{k}^{q})$ indicates the possibility of the target user’s destination locating in the grid cell g_i without considering the time factor. As for the time-independent destination probability, we continue to apply Bayes’ law for the derivation of it; then we have

p (l_{d} \in g_{i} | {tra}_{k}^{q}) = \frac{p ({tra}_{k}^{q} | l_{d} \in g_{i}) p (l_{d} \in g_{i})}{\sum_{j} p ({tra}_{k}^{q} | l_{d} \in g_{j}) p (l_{d} \in g_{j})}

(9)

where $p (l_{d} \in g_{i})$ is the prior probability that can be easily computed as

p (l_{d} \in g_{i}) = \frac{| {tra | l_{d} \in g_{i}} |}{| {tra} |}

(10)

where $| {tra | l_{d} \in g_{i}} |$ is the number of trajectories terminating at a location in $g_{i}$ and $| {tra} |$ is the total number of trajectories in the training data set.

In equation (9), $p ({tra}_{k}^{q} | l_{d} \in g_{i})$ is the posterior probability. Given a query trajectory ${tra}_{k}^{q}$ of user $u_{k}$ , whose starting position is located in grid cell $g_{s}$ and current position is in $g_{c}$ , the posterior probability measures the possibility that $u_{k}$ travels from $g_{s}$ to $g_{c}$ via ${tra}_{k}^{q}$ conditioned on his or her destination locating in $g_{i}$ . It can be calculated as

p ({tra}_{k}^{q} | l_{d} \in g_{i}) = \frac{[p ({tra}_{k}^{q}) \cdot p_{ci}^{t}]}{p_{si}^{t}}

(11)

where $p_{ci}^{t}$ and $p_{si}^{t}$ are total transition probabilities of moving from $g_{c}$ and $g_{s}$ , respectively, to the predicted destination $g_{i}$ , and both of them can be directly retrieved from the total transition probability matrix $M^{T}$ obtained via the offline training of DesPre. $p ({tra}_{k}^{q})$ , named as trajectory probability, is the probability that a user moving specifically along the trajectory ${tra}_{k}^{q}$ , and it can be calculated as the product of all the total transition probabilities of each pair of adjacent check-in cells in ${tra}_{k}^{q}$ , formally

p ({tra}_{k}^{q}) = p (〈 g_{s}, g_{2}, \dots, g_{n - 1,} g_{c} 〉) = p_{s 2}^{t} \cdot (Π_{i = 2}^{n - 2} p_{i (i + 1)}^{t}) \cdot p_{(n - 1) c}^{t}

(12)

where every $p_{i (i + 1)}^{t}$ can be retrieved from $M^{T}$ directly. By incorporating equation (12) into (11), we could get the posterior probability needed for destination prediction.

Online prediction algorithm

Given the sample grid cells set $G_{t}$ , a user-specified query trajectory ${tra}_{k}^{q}$ , the maximum travel time $Δ t_{m}$ between the user’s current check-in and the destination, the total transition probability matrix $M^{T}$ generated in the offline training phase, and road network $G (V, E, S)$ , the pseudo-code of the online destination prediction algorithm is shown in Algorithm 1.

Algorithm 1. DesPre_Prediction.
Input: $G_{t}$ , $M^{T}$ , ${tra}_{k}^{q}$ , $Δ t_{m}$ , $G (V, E, S)$ .
Output: a sorted list of the top- $κ$ potential destinations.
1. $des_list \leftarrow \emptyset$ ; /* a list to store the predicting results. */
2. $p (tr {a_{k}}^{q}) \leftarrow M^{T}$ ; /* calculate the trajectory probability $p ({tra}_{k}^{q})$ from $M^{T}$ . */
3. for each $g_{i}$ in $G_{t}$ do
4. /* calculate the reachability of $g_{i}$ according to the road network and traffic condition. */
5. $p (Δ t_{m} \| λ_{d} \in g_{i}, tr {a_{k}}^{q}) \leftarrow G (V, E, S)$ /* calculate the reachability probability. */
6. if $p (Δ t_{m} \| l_{d} \in g_{i}, {tra}_{k}^{q}) = = 0$ /* $g_{i}$ is not reachable within $Δ t_{m}$ . */
7. break;
8. else /* $p (Δ t_{m} \| l_{d} \in g_{i}, {tra}_{k}^{q}) = = 1$ , $g_{i}$ is reachable within $Δ t_{m}$ .*/
9. retrieve $p_{ci}^{t}$ and $p_{si}^{t}$ from $M^{T}$ ;
10. $p (tr {a^{q}}_{k} \| l_{d} \in g_{i}) \leftarrow p (tr {a^{q}}_{k}), {p_{ci}}^{t}, {p^{t}}_{si}$ ; /* calculate the posterior probability. */
11. $p (l_{d} \in g_{i} \| Δ t_{m}, tr {a^{q}}_{k}) = 1 \cdot p (l_{d} \in g_{i} \| tr {a_{k}}^{q} \leftarrow p (tr {a_{k}}^{q} \| l_{d} \in g_{i}), p (l_{d} \in g_{i})$ ;
12. save $p (l_{d} \in g_{i} \| Δ t_{m}, {tra}_{k}^{q})$ into $des_list$ ;
13. end if
14. sort $des_list$ ;
15. return the top- $κ$ elements of $des_list$ ;

As shown in Algorithm 1, we first calculate the trajectory probability $p ({tra}_{k}^{q})$ of the given query trajectory using equation (12) (line 2). Then, for each $g_{i}$ in the sample grid cells set $G_{t}$ , we calculate the reachability probability of $g_{i}$ under the constraint of road network as described in subsection “Reachability probability” (line 5) and process differently according to the value of $g_{i}$ ’s reachability probability. If $p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q}) = = 0$ (line 6), it means $g_{i}$ is not reachable in this case, so we eliminate $g_{i}$ directly from the candidate list. The program would just break the loop (line 7). Otherwise (line 8), $p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q}) = = 1$ , which means $g_{i}$ is reachable, so we continue to calculate the time-independent destination probability: we first retrieve $p_{ci}^{t}$ and $p_{si}^{t}$ from $M^{T}$ (line 9) and calculate the posterior probability $p ({tra}_{k}^{q} | l_{d} \in g_{i})$ using equation (11) (line 10) and $p (l_{d} \in g_{i} | {tra}_{k}^{q})$ using equations (9) and (10) (line 11). Meanwhile, the destination probability is also obtained since $p (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q})$ equals $p (l_{d} \in g_{i} | {tra}_{k}^{q})$ in this case (line 11). Finally, we sort the grid cells by the obtained destination probabilities and return the top- $κ$ potential destinations.

Privacy protection method CkiDel and framework DPCD

Privacy protection method: CkiDel

As indicated above, people’s movement patterns are hidden in their historical check-in records, and destination prediction approaches exactly utilizes users’ historical check-ins to mine the behavior patterns and thus to infer destinations. Given these, to guard against destination inference attack, the most intuitive idea is to prevent adversaries from obtaining the correct patterns of users’ movements so that the most likely destinations would not be available. This is exactly the basic idea of our proposed location privacy protection method CkiDel. In practice, users with high security awareness tend to manually remove some of the published historical contents on their OSN profiles at regular intervals, which exactly proves the feasibility of our idea. In CkiDel, we act appropriately to break the users’ original behavior patterns by deleting some records of their historical check-ins so that the sensitive destination of the query trajectory would no longer be predicted as the top- $κ$ potential destinations. Furthermore, it is necessary for a protection method to preserve the maximum functionality of GeoSNs’ check-in service and guarantee good user experience. However, whether it is convenient to remove certain check-in nodes is quite subjective. It is hard to set uniform standards. Our belief is that the fewer nodes to be removed, the less inconvenience it may cause. All in all, the principal we follow is that CkiDel would minimize the number of the removed check-ins under the premise of achieving privacy request. The detailed method is as follows.

CkiDel—check-in nodes removing strategy

According to the analysis in sections “Problem and goal” and “Destination prediction approach: DesPre,” given a specific query trajectory ${tra}_{k}^{q}$ of user $u_{k}$ and the maximum travel time $Δ t_{m}$ , the ultimately needed destination probability of $g_{i}$ , could be calculated as equation (13) by incorporating equations (2), (9), and (11)

p (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q}) = p (Δ t_{m} | l_{d} \in g_{i}, {tra}_{k}^{q}) \cdot \frac{\frac{p_{ci}^{t}}{p_{si}^{t}} \cdot p (l_{d} \in g_{i})}{\sum_{j} (\frac{p_{cj}^{t}}{p_{sj}^{t}} \cdot p (l_{d} \in g_{j}))}

(13)

which shows that the value of the destination probability only lies on the starting grid cell $g_{s}$ and the current grid cell $g_{c}$ . Specifically, in equation (13), the former multiplier, that is, reachability probability, is a piecewise constant function defined as equation (8), which is mainly used for filtering the unreachable grid cells before complex calculations since its value equals either 1 or 0. As for the latter multiplier in equation (13), besides the candidate grid cell $g_{i}$ , the values of $p_{ci}^{t}$ and $p_{si}^{t}$ depend on the current grid cell $g_{c}$ and the starting cell $g_{s}$ , respectively. In addition, once the training set is fixed, the value of the prior probability $p (l_{d} \in g_{j})$ would not change.

All in all, given an individual-oriented DesPre inference model and a user-specified query trajectory, destination probabilities of potential positions will not change unless the endpoints of the trajectories are removed. So, the strategy of our proposed CkiDel method is to delete the endpoints of the query trajectory until the calculated destination probability of the sensitive actual destination meets the user-specified privacy threshold. Moreover, since the current grid cell $g_{c}$ is the one that the user is about to check in, for guaranteeing the functionality of GeoSN and good user experience, CkiDel will not delete the $g_{c}$ unless the users permit to do so.

CkiDel algorithm

Based on the above idea and strategy, the algorithm of CkiDel is shown in Algorithm 2.

Algorithm 2. CkiDel.
Input: ${tra}_{k}^{q} = 〈 g_{1}, g_{2}, \dots, g_{n} 〉$ , $g_{t}$ , $κ$ , $G_{t}$ , $M^{T}$ , $Δ t_{m}$ , $G (V, E, S)$ .
Output: a list of check-ins to be removed.
1. $des_list \leftarrow \emptyset$ ; /* a list to store the predicting results. */
2. $rvk_list \leftarrow \emptyset$ ; /* a list to store the historical check-ins to be removed. */
3. for $i = 1 : n - 1$ do
4. ${tra}_{k}^{q} = 〈 g_{i + 1}, \dots, g_{n} 〉$ /* revoke the first i check-ins in theoriginal query trajectory. */
5. /* calculate the top- $κ$ potential destinations for the new ${tra}_{k}^{q}$ , and update $des_list$ . */
6. $des_list \leftarrow$ DesPre_Prediction /* calculate the top- $κ$ potential destinations. */
7. if $g_{t} \notin des_list$ /* $g_{t}$ is beyond the predicted top- $κ$ potential destinations. */
8. $rvk_list \leftarrow 〈 g_{1}, \dots, g_{i} 〉$ ; /* store the first i check-ins needed to be removed. */
9. return $rvk_list$ ;
10. end if /* the rank of $g_{t}$ ’s destination probability is still too high, go on removing */
11. /*couldn’t meet the privacy request by deleting the historical check-ins, ask for the user’s
12. permission to give up checking in $g_{c}$ this time. */
13. return $g_{c}$ ;

As shown in Algorithm 2, given the query trajectory ${tra}_{k}^{q}$ , user-specified privacy threshold $κ$ , and sensitive destination $g_{t}$ , CkiDel first iteratively deletes one endpoint of ${tra}_{k}^{q}$ from the very first one (line 4). Then, for the newly obtained query trajectory, it utilizes the DesPre_Prediction algorithm to calculate the top- $κ$ potential destinations (line 6) and judge whether the sensitive destination $g_{t}$ is among the top- $κ$ potential destinations. Once the result meets the privacy threshold, that is, $g_{t}$ is no longer in the top- $κ$ list, the historical check-ins needed to be removed would be returned (lines 7–9). Otherwise, CkiDel continues to generate new query trajectory. Note that there may be a scenario that the privacy request could not be achieved unless all the check-ins in ${tra}_{k}^{q}$ are needed to be removed except the current check-in, that is, the one that the target user is going to share on GeoSNs. In this case, CkiDel would only return the current check-in and suggests the user to give up checking in this grid cell at the moment (line 13).

Privacy protection framework: DPCD

Based on the above inference model and protection method, we propose a location privacy protection framework DPCD to guard against destination inference attacks. DPCD consists of three main components: mobile user device, trusted proxy server, and GeoSN server. We assume the centralized trusted proxy server lies between GeoSN users and GeoSN server. The architecture of DPCD is shown in Figure 2.

Figure 2.

Architecture of the DPCD framework.

The trusted proxy server has three modules: DesPredictor, PrvProtector, and CkiProcessor. Based on the DesPre approach, DesPredictor is in charge of preparing all the intermediate parameters offline and calculating the most likely destinations online for target users. Given the user-specified sensitive locations and predicted destinations from DesPredictor, PrvProtector is responsible for designing the privacy preserving check-in proposals by applying the CkiDel method. According to the proposals made by PrvProtector, CkiProcessor acts on behalf of target users to publish the real-time check-in on GeoSN and deletes the chosen historical check-in records. To clearly articulate the working process of DPCD, Figure 3 shows a sequence diagram to present the flow of a check-in event under the DPCD framework.

Figure 3.

Sequence diagram of the check-in event under the DPCD framework.

The proposed protection method and framework actually tend to supply a privacy alert and suggestive solution rather than a compelling measure. This is the exact reason why the DPCD framework is designed to involve multiple interactions between GeoSN users and the trusted proxy server. As mentioned in section “Privacy protection method: CkiDel,” whether it is convenient or not to remove certain check-in nodes is quite subjective. Thus, the best thing we could do is to remind the users of the potential threats they may face if they insist checking in certain nodes.

Experimental evaluation

Data preprocessing and properties analysis

Data preprocessing method

The data set we utilized is a real-world check-in data set from a GeoSN named Gowalla, which was collected by Cho et al.²⁰ from February 2009 to October 2010. It is a worldwide data set containing both public check-in data and an explicit social network. For simplicity, only the resident users of California (i.e. the study area) and related check-ins are studied by this article, and a subset selection method is needed.

As far as we know, the commonly used subset selection methods in such studies tend to directly keep the check-in records located in the interest study area¹² and then the corresponding users would be selected as the experimental subjects in turn. Frankly, this method seems brutal, and quite a number of trajectories would be artificially broken, resulting in the breaking of users’ moving patterns. Destination predictions and performance evaluations based on such broken trajectories make no sense. To overcome the deficiency, we propose a data preprocessing method to construct a reasonable check-in subset from the worldwide check-in data set of Gowalla. Detailed preprocessing method is discussed as follows.

First, we select the check-in records whose POI is located in California, namely, California-related check-ins. Then, users involved in the California-related check-ins, namely, California-related users, would be certain. Note that, among the California-related users, there must be the ones whose main activity area was beyond our study area and only short visits were paid to California. That is to say, for such users, our check-in selection breaks their movement patterns, so the destination prediction based on the broken trajectory data would make no sense. Thus, these kinds of users should be abandoned. Specifically, for each California-related user, we calculate the ratio of the number of California-related check-ins to the total number of the user’s check-ins in the original data set. Then, taking the calculated check-in ratio as the filter, all the California-related users whose check-in ratio is less than 1 would be removed from our data set so that the resident users of California, namely, significant users, would be fixed.

Our new method could guarantee all the needed trajectories of the selected experimental subject (i.e. a target GeoSN user) are completely reserved and are never artificially broken. In other words, for a given target user, no matter the ultimately used data set is the worldwide one or the tailored one obtained through our preprocessing method, trajectories related to him or her (i.e. those needed to construct his or her prediction model) would always be the same, and the subsequent model construction processes as well as the model’s performance would not be different. This implies that the proposed data preprocessing method will not lead to the loss of the experimental evaluation’s validity or generality.

Properties analysis

The tailored data set obtained through the aforementioned preprocessing method includes 5397 users and 214,961 check-in records. We further calculate the time interval and distance between two consecutive check-ins, and the distributions of them are shown in Figure 4.

Figure 4.

The resulting data set’s attributes.

Besides the data set of Gowalla, we also obtain the road network data of California²⁸ for the calculation of reachability probabilities, which contains 21,693 edges (i.e. road sections).

Effectiveness evaluation

Evaluation metrics

Two performance metrics, Predictive Accuracy and Aggregated Distance Error, are used to evaluate the effectiveness of our proposed approaches. Predictive Accuracy measures how accurately the prediction models can predict the actual destination of the users. For instance, accuracy of 0.8 means that in 80% of the cases, the actual destination is among the top- $κ$ predicted destinations. The parameter $κ$ is the privacy threshold that determines the number of predicted destinations to be presented to users. The Distance Error for a single prediction of a query trajectory is defined as the weighted average of the distance deviations between the actual destination $g_{d}$ and each one of the predicted top- $κ$ destinations $g_{i}$ , formally

DE = \frac{\sum_{i = 1}^{k} (p_{g_{i}}^{des} \cdot ‖ g_{i} - g_{d} ‖)}{\sum_{i = 1}^{k} p_{g_{i}}^{des}}

(14)

where $p_{g_{i}}^{des}$ is the destination probability of $g_{i}$ . The Aggregated Distance Error is defined as the arithmetical mean of all the $DE s$ for predictions in experiments, measuring the spatial proximity of predicted destinations to the real destination. The higher the Predictive Accuracy, the better the prediction model performs, while Aggregated Distance Error just do the opposite.

Baseline models

We employ three non-trivial baseline models for effectiveness evaluation.

1. Most Frequent Visit (MFV) model

The MFV model²⁹ assigns the probability of a user $u_{k}$ ’s next check-in in grid cell $g_{i}$ as the probability of $u_{k}$ checking-in the grid cell $g_{i}$ in his visiting history, formally

\overset{MFV}{p} (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q}) = \frac{| {c^{k} | l^{k} \in g_{i}} |}{| {c^{k}} |}

(15)

2. ZMDB model

The ZMDB model, which is named after the authors’ name of literature,²⁴ is based on the Bayesian inference framework, specifically

\overset{ZMDB}{p} (l_{d} \in g_{i} | Δ t_{m}, {tra}_{k}^{q}) = \frac{p ({tra}_{k}^{q} | l_{d} \in g_{i}) p (l_{d} \in g_{i})}{\sum_{j} p ({tra}_{k}^{q} | l_{d} \in g_{j}) p (l_{d} \in g_{j})}

(16)

The posterior probability is defined as

p ({tra}_{k}^{q} | l_{d} \in g_{i}) = \frac{| {tr a_{l_{d} \in g_{i}} | {tra}_{k}^{q} \subset tr a_{l_{d} \in g_{i}}} |}{| {tr a_{l_{d} \in g_{i}}} |}

(17)

where ${tra}_{k}^{q} \subset tr a_{l_{d} \in g_{i}}$ means the query trajectory partially matches the trajectories whose termination $l_{d}$ is located in $g_{i}$ . The prior probability $p (l_{d} \in g_{i})$ can be easily calculated by equation (10).

3. Sub-Trajectory Synthesis (SubSyn) model

The SubSyn model¹⁰ mainly focuses on the solution of data sparsity problem faced by historical trajectories–based destination inference models. It also follows the Bayesian inference framework, where the destination probability and the prior probability $p (l_{d} \in g_{i})$ are also calculated by equations (16) and (10), respectively. The idea of SubSyn is similar to ours, but it only takes the historical trajectories into account and never considers any external information, such as time factor, social structure, or geographical information.

Prediction capability evaluation of DesPre

Evaluation with the varied time span of training set

Human’s movement pattern may change over time, so the options of historical trajectories we utilize influence the performance of inference models. To study the models’ effectiveness when the time span of training set varies, we separate all the check-in records over the time span of 20 months into 13 time buckets; thus each time bucket covers approximately 1.5 months, and the timestamp at the end of the $i th$ time bucket is denoted by $τ_{i}$ . Then, we totally carry out 13 rounds of experiments for the target users, respectively. For the $i th$ round of experiments, the raw training set used by the DesPre’s offline training phase includes not only check-ins recorded in the former $(i - 1)$ time buckets (e.g. the ones recorded before the time $τ_{i - 1}$ ), but also those recorded in the former 1 month of the $i th$ time bucket. The testing set used for the online prediction includes check-ins recorded in the later 0.5 month of the $i th$ time bucket. Detailed data set partition of the $i th$ round experiment is shown in Figure 5.

Figure 5.

Data set partition of the $i th$ round experiment.

In each round of experiments, we first utilize the corresponding raw training set to construct individual-oriented Markov model for each target user, where each user’s check-in sequence recorded in 1 day is regarded as a trajectory. Then, we randomly choose five check-ins for each target user from the testing set as the current check-in $g_{c}$ that should be published. Check-ins that immediately follow each $g_{c}$ are regarded as the real destinations that the inference models need to predict. The partial trajectory constituted by check-ins published in the same day with $g_{c}$ but before $g_{c}$ is regarded as the query trajectory. As for the maximum travel time $Δ t_{m}$ between the user’s current check-in and the destination, we treat it as a constant for brevity. Since we notice that the mode of consecutive check-ins’ time interval is within 3 h, $Δ t_{m}$ is set to 3 h. Figure 6 shows the performance differences in Predictive Accuracy and Aggregated Distance Error with varied time span of training set.

Figure 6.

Performance of inference models with varied time span of training set.

According to Figure 6, all the models roughly witness a decreasing trend in accuracy and an increasing trend in Aggregated Distance Error along with an increase in training set’s time span, which could be explained by the increasing number of appeared unique check-ins. DesPre performs better than the baselines for all the 13 rounds of experiments, whose best performance occurred in the second round, for example, the time span of the training set is approximately 2.5 months. In the first round of experiments, DesPre performs relatively worse for the reason that few patterns could be mined and utilized to predict destinations without sufficient data, so does SubSyn. SubSyn ranks second to DesPre, but it shows the highest decreasing rate in accuracy due to its processing method of taking all the users’ check-in records to construct one inference model, which would obfuscate the users’ personalized movement pattern in the long run. Without impressive accuracy and precision, the ZMDB model performs stable except in the first round where its effectiveness was seriously affected by the data sparsity problem since there are no enough historical trajectories in a training set of short time span. MFV simply predicts the destination as the most frequently checked-in location in history, ignoring the short-term effect so that it could not distinguish which one is more significant in the long history, which leads to continuous decrease in Predictive Accuracy and large margin of error. All in all, DesPre outperforms SubSyn, ZMDB, and MFV in both Predictive Accuracy and Aggregated Distance Error, and the optimal time span of the training set is approximately 2.5 months in the scenario of this article.

Evaluation with the varied size of grid cell

The size of grid cell may influence both precision and accuracy of prediction models. Specifically, a coarse grid, on one hand, may lead to low prediction precision because the area covered by each grid cell is large, and on the other hand, it may improve the prediction accuracy because more trajectories in the training set would fall into the identical grid cells, thus increasing the number of the matching query trajectories. While a fine grid may lead to high prediction precision but low accuracy relatively, furthermore, it may also introduce inefficiency into prediction models. Thus, we need to study the impact of varied grid size on the performance of inference models to find a balanced and compromised grid cell’s side length so that the best prediction performance could be achieved.

We carry out four rounds of experiments by changing the side length $λ$ of grid cell from 200 to 800 m with 200 m increment, whose time span of the training set is 2.5 months, and the other basic experimental setup is consistent with that in subsection “Evaluation with the varied time span of training set.”Figure 7 shows the performance differences of inference models with respect to grid cell’s side length.

Figure 7.

Performance of inference models with varied size of grid cell.

As shown in Figure 7, Predictive Accuracy of all the prediction models generally shows a rising trend with an increase in grid cell’s side length, which dovetail with the above analysis that sparse grid leads to high accuracy. Among all the models, DesPre achieves the highest accuracy with the lowest Aggregated Distance Error being 1.2 km, since it utilizes rich sources of background information to mine the movement patterns of users, while MFV performs the worst for it simply predicts the destination as the most frequently location in history and cannot distinguish which one is more important over a period of time. ZMDB approach suffers the most from the data sparsity problem caused by the fine grid, especially in the case that the side length of grid cell is very short, for example, 200 m, while SubSyn and DesPre were not much affected. As for the Aggregated Distance Errors, all the models show a general increase in trend with the increase in λ after the global minimum point where λ equals 400 m. In sum, DesPre has the overwhelming advantage in both Predictive Accuracy and Aggregated Distance Error, and 400 m should be the optimal value of grid cell’s side length for the data set utilized by this article.

Protection ability evaluation of CkiDel

Given that the protection ability of DPCD is consistent with that of the CkiDel method, so we only discuss the protective performance of CkiDel in the following. Our evaluation methodology is to take the incomplete trajectories processed by CkiDel as the input of several non-trivial destination inference models to see whether the inference models could still obtain the real destinations or not. If the answer is yes, it means the CkiDel is invalid, if no, it means the CkiDel achieves the protection goal of guarding against destination prediction attacks.

Our evaluation is based on the experiments conducted in subsection “Evaluation with the varied size of grid cell.” We take the resulting incomplete trajectories obtained by deleting a portion of check-ins of the original historical trajectories as the input of several destination inference attack models, including the DesPre-based, SubSyn-based, ZMDB-based, and MFV-based attack models. Then, we record the predictive results of them.

The aforementioned performance metrics, Predictive Accuracy and Aggregated Distance Error, are still used to evaluate the prediction performance of the aforementioned destination inference models. It is shown that the Predictive Accuracy of all the attack models is equal to 0, and the Aggregated Distance Error of them is pretty high (no less than 32.57 km, and nearly 37.41% of the check-in nodes in the query trajectories being removed on average). This exactly implies the invalidity of the attack models in front of CkiDel, and thus proves the protection capability of the CkiDel method and the DPCD framework. In fact, these results could be summarized from theoretical analysis. CkiDel is designed based on the DesPre model, and a removing list of historical check-ins could be accepted if and only if the inference model should be incapable of predicting the real destination using the resulting trajectory. Furthermore, according to the proofs in subsection “Prediction capability evaluation of DesPre,” DesPre, as the the most powerful inference model, has utilized the most sufficient and effective background information for destination prediction, but it still fails to obtain the correct results. Not to mention the other weaker ones whose input information is poorer than DesPre. From this point of view, CkiDel and DPCD’s achieving of the protection goal becomes easy to understand.

Efficiency evaluation

Besides the effectiveness of the proposed approaches, the operating efficiency of the DPCD framework is also significant for it needs to run online and answer real-time queries instantaneously so as to achieve GeoSNs’ check-in service and guarantee user experiences. Given the architecture of DPCD, we, respectively, study the running efficiencies of DesPre and CkiDel on an enterprise server with Intel Xeon CPU E5-2650 and 64 GB RAM.

Efficiency of DesPre

Running efficiency of DesPre offline training phase

We carry out the evaluation with respect to various grid sizes, that is, four rounds of experiments by changing the side length $λ$ of grid cell from 200 to 800 m with 200 m increment, while the time span of the utilized training set is 2.5 months. Experiments have been repeated five times and the average runtimes of DesPre-training algorithm are shown in Table 1.

Table 1.

Average runtime of DesPre-training phase.

Side length of grid cells (m)	200	400	600	800
Average runtime (hh:mm:ss)	13:49:21	2:09:13	00:28:58	00: 07:34

As we can see from Table 1, the efficiency of DesPre’s offline training phase is not high, because we choose to transfer large quantities of time-consuming calculations to the offline training phase in order to guarantee the efficiency of the online prediction. Fortunately, the DesPre-training algorithm will not be evoked for each online query, and the transition probabilities generated by the offline training phase need not be updated instantaneously. In other words, DesPre-training algorithm only needs to run occasionally, for example, once in a few months. Therefore, the relatively long runtimes are totally acceptable, especially the one corresponding to the optimal side length of 400 m, whose average runtime is around 2 h.

Running efficiency of DesPre online prediction phase

As for the online prediction phase of DesPre, we record the runtime of the algorithm and compare it with SubSyn and ZMDB, for they all follow the Bayesian inference framework and have a relatively independent and complete online prediction phase. Results show that the average runtime of DesPre’s online prediction algorithm is approximately 10⁻² ms for each query, while that of SubSyn is 10⁻¹ ms and ZMDB is more than 10² ms. Detailed average runtime of different online prediction algorithms is shown in Table 2. The worst performance of ZMDB could be explained by that in order to compute the posterior probability, the time-consuming trajectory matching process is obligatory, while DesPre and SubSyn could fetch most of the needed probabilities directly from the existing matrices generated by the offline training phase. The reason why SubSyn performs worse than DesPre is that SubSyn needs to compute all the POIs’ destination probabilities to find the top- $κ$ potential destinations, while DesPre only needs to calculate a relatively small portion of that, thanks to its constraint conditions which avoid large quantities of unnecessary computations.

Table 2.

Average runtime of online prediction algorithms.

Algorithm	DesPre	SubSyn	ZMDB
Average runtime (ms)	1.59 × 10⁻²	7.44 × 10⁻¹	1.81 × 10²

Efficiency of CkiDel

We conduct the evaluation with respect to number of check-ins in query trajectories, while the grid cell’s side length is 400 m and the time span of the training data set is 2.5 months. The average runtimes of CkiDel are listed in Table 3. Results show that the average runtimes do not change too much with an increase in check-ins in query trajectories.

Table 3.

Average running time of CkiDel.

Number of check-ins in query trajectories	4	6	8	10	12
Running time (10⁻² ms)	6.03	6.61	6.84	7.12	6.98

The average runtime of CkiDel is approximately 10⁻² ms, which is one magnitude faster than the privacy protection method by Xue et al.¹⁰ and far more acceptable than the existing privacy alert method by Huo et al.¹² whose average response time is in minute level. The efficiency advantage of CkiDel mainly benefits from the quick response of online destination inference, which is due to the following two strategies. The first one is to transfer large quantities of time-consuming calculations to the offline training phase, and the second one is to utilize the geographical and time constraints to filter the unreachable destinations to avoid further complex calculations of destination probabilities.

Conclusion and future work

Privacy risks in location publication are high due to rich sources of background information. In this article, we propose a novel destination prediction approach specially for the check-in service of GeoSNs and design a protection method CkiDel to guard against destination inference attack. Based on DesPre and CkiDel, a new privacy protection framework DPCD is proposed to help the GeoSN users to detect potential location privacy leakage and retain confidential location information without sacrificing real-time check-in precision and user experience. We also design a data preprocessing method to construct reasonable and complete data subset from the real-world GeoSN data set for performance evaluation. Experimental results prove the effective prediction ability of the DesPre approach, high protection capability of the CkiDel algorithm, and high running efficiency of the DPCD framework.

Currently, the DPCD framework mainly counters the threat of destination inference attack. In the future, we will consider more potential location privacy leakage threats to extend our location privacy protection framework.

Footnotes

Academic Editor: Geng Yang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Nature Science Foundation of Jiangsu China under Grant No. BK20131069.

References

Vicente

Freni

Bettini

. Location-related privacy in geo-social networks. IEEE Internet Comput 2011; 15(3): 20–27.

Sadilek

Kautz

Bigham

JP.

Finding your friends and following them to where you are. In: Proceedings of the 5th ACM international conference on web search and data mining, Seattle, WA, 8–12 February 2012, pp.723–732. New York: ACM.

Wernke

Skvortsov

Dürr

. A classification of location privacy attacks and approaches. Pers Ubiquit Comput 2014; 18(1): 163–175.

Mokbel

MF.

Privacy in location-based services: state-of-the-art and research directions. In: Proceedings of the 2007 IEEE international conference on mobile data management, Mannheim, 7–11 May 2007, p.228. New York: IEEE.

Danezis

Lewis

Anderson

RJ.

How much is location privacy worth? In:

Proceedings of the 4th workshop on the economics of information security (WEIS 2005), Cambridge, MA, 2–3 June 2005, vol. 5.

Ropeik

Gray

GM.

Risk: a practical guide for deciding what’s really safe and what’s dangerous in the world around you. New York: Houghton Mifflin Harcourt, 2002.

Chow

Mokbel

MF.

Trajectory privacy in location-based services and data publication. ACM SIGKDD Explor Newsl 2011; 13(1): 19–29.

Krumm

A survey of computational location privacy. Pers Ubiquit Comput 2009; 13(6): 391–399.

Solanas

Domingo-FerreJ

Martínez-Ballesté

Location privacy in location-based services: beyond TTP-based schemes. In: Proceedings of the 1st international workshop on privacy in location-based applications (PILBA), Malaga, 9 October 2008, pp.12–23. CEUR-WS.org

10.

Xue

Zhang

Zheng

. Destination prediction by sub-trajectory synthesis and privacy protection against such prediction. In: Proceedings of the 2013 IEEE 29th international conference on data engineering (ICDE), Brisbane, QLD, Australia, 8–11 April 2013, pp.254–265. New York: IEEE.

11.

Kido

Yanagisawa

Satoh

. An anonymous communication technique using dummies for location-based services. In: Proceedings of the 2005 IEEE international conference on pervasive services (ICPS’05), Santorini, 11–14 July 2005, pp.88–97. New York: IEEE.

12.

Huo

Meng

Zhang

Feel free to check-in: privacy alert against hidden location inference attacks in GeoSNs. Database Syst Adv Appl 2013; 7825: 377–391.

13.

Roick

Heuser

Location based social networks—definition, current state of the art and research agenda. Trans GIS 2013; 17(5): 763–784.

14.

Freni

Ruiz Vicente

Mascetti

. Preserving location and absence privacy in geo-social networks. In: Proceedings of the 19th ACM international conference on information and knowledge management (CIKM 2010), Toronto, ON, Canada, 26–30 October 2010, pp.309–318. New York: IEEE.

15.

Mascetti

Freni

Bettini

. Privacy in geo-social networks: proximity notification with untrusted service providers and curious buddies. Int J Very Larg Data Bases 2011; 20(4): 541–566.

16.

Siksnys

Thomsen

Saltenis

. Private and flexible proximity detection in mobile social networks. In: Proceedings of the 2010 IEEE eleventh international conference on mobile data management, Kansas City, MO, 23–26 May 2010, pp.75–84. New York: IEEE.

17.

Zhong

Goldberg

Hengartner

Louis, lester and pierre: three protocols for location privacy. In: Proceedings of the 7th international conference on privacy enhancing technologies (PET’07), Ottawa, Canada, 20–22 June 2007, pp.62–76. Berlin: Springer.

18.

Gruteser

Grunwald

. Anonymous usage of location-based services through spatial and temporal cloaking. In: Proceedings of the ACM 1st international conference on mobile systems, applications and services, San Francisco, CA, 5–8 May 2003, pp.31–42. New York: ACM.

19.

Bell

Koren

Volinsky

. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, 12–15 August 2007, pp.95–104. New York: ACM.

20.

Cho

Myers

Leskovec

Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, 21–24 August 2011, pp.1082–1090. New York: ACM.

21.

Krumm

Horvitz

Predestination: inferring destinations from partial trajectories. In: Proceedings of the international conference on ubiquitous computing, Innsbruck, 17–21 September 2006, pp.243–260. Berlin: Springer.

22.

Krumm

Horvitz

Predestination: where do you want to go today?

IEEE Comput 2007; 40(4): 105–107.

23.

Patterson

Liao

Fox

. Inferring high-level behavior from low-level sensors. In: Proceedings of the international conference on ubiquitous computing, Seattle, WA, 12–15 October 2003, pp.73–89. Berlin: Springer.

24.

Ziebart

Maas

Dey

. Navigate like a cabbie: probabilistic reasoning from observed context-aware behavior. In: Proceedings of the 10th ACM conference on Ubiquitous computing, Seoul, Korea, 21–24 September 2008, pp.322–331. New York: ACM.

25.

Gao

Tang

Liu

. Exploring social-historical ties on location-based social networks. In: Proceedings of the 6th international AAAI conference on weblogs and social media, Dublin, 5–7 June 2012. Palo Alto, CA: AAAI Press.

26.

Ashbrook

Starner

Using GPS to learn significant locations and predict movement across multiple users. Pers Ubiquit Comput 2003; 7(5): 275–286.

27.

Bhattacharya

Das

SK.

LeZi-update: an information-theoretic approach to track mobile users in PCS networks. In: Proceedings of the 5th annual ACM/IEEE international conference on mobile computing and networking, Seattle, WA, 15–19 August 1999, pp.1–12. New York: IEEE.

28.

Cheng

Hadjieleftheriou

. On trip planning queries in spatial databases. In: Proceedings of the 9th international symposium on spatial and temporal databases, Angra dos Reis, Brazil, 22–24 August 2005, vol. 3363, pp.273–290. Berlin: Springer.

29.

Gao

Tang

Liu

Mobile location prediction in spatio-temporal context. In: Proceedings of the Nokia mobile data challenge workshop, Newcastle, 18–19 June 2012.