Identifying hidden visits from sparse call detail record data

Abstract

Despite many studies on trip inference using call detail record (CDR) data, a fundamental understanding of their limitations is lacking. In particular, because of the sparse nature of CDR data, users may travel to a location without being revealed in the data, which we refer to as a hidden visit. The existence of hidden visits hinders our ability to extract reliable information about human mobility and travel behavior from CDR data. In this study, we propose a data fusion approach to obtain labeled data for statistical inference of hidden visits. In the absence of complementary data, this can be accomplished by extracting labeled observations from more granular cellular data access records, and extracting features from voice call and text messaging records. The proposed approach is demonstrated using a real-world CDR dataset of 3 million users from a large Chinese city. Different machine learning models are used to infer whether a hidden visit exists during an observed displacement. The test results show significant improvement over the naive no-hidden-visit rule, which is an implicit assumption adopted by most existing studies. Based on the proposed model, we estimate that over 10% of the displacements extracted from CDR data involve hidden visits.

Keywords

Call detail record individual mobility trip inference data fusion hidden visit

Introduction

Enabled by the increasing availability of large-scale datasets on human movements, human mobility has become an emerging field dedicated to extracting patterns that describe individual trajectories in time and space. In its essence, human movements are results of spatiotemporal choices (e.g., the decision to go somewhere at some time) made by individuals with diverse preferences and lifestyles. Trips reflect critical travel decisions, and thus are basic behavioral units of human mobility. A trip is defined as “the travel required from an origin location to access a destination for the purpose of performing some activity” (McNally, 2007). The ability to extract trips from large-scale spatiotemporal data sources is important for urban planning, transportation management, and location-based services.

One of the most commonly used data sources for human mobility studies is call detail record (CDR) data, which are collected by cellular service operators primarily for billing information collection and network management. CDR data are one type of event-driven mobile phone network data (Calabrese et al., 2014). The generative events typically include incoming and outgoing voice calls, text messages (or Short Message Service, SMS), and, in some cases, cellular data usage (e.g., 3G/4G). In this study, we treat cellular data usage records as referred to in Ranjan et al. (2012), as part of CDR data. Whenever a cellular transaction is made, the CDR database records its time and approximate location, in the form of the connected cell tower or antenna. Thus, CDR data offer the opportunity to capture spatiotemporal patterns of mobile phone users over time at a large scale.

In recent years, CDR data have been used extensively to extract useful human mobility patterns and urban transportation information. The related studies cover diverse topics ranging from origin-destination (OD) estimation (Caceres et al., 2007; Calabrese et al., 2011; Iqbal et al., 2014; Mellegard et al., 2011; Wang et al., 2013) and travel time estimation (Hasan and Ali, 2017) to meaningful place detection (Ahas et al., 2010; Isaacman et al., 2011) and human activity discovery (Csáji et al., 2013; González et al., 2008; Phithakkitnukoon et al., 2010; Schneider et al., 2013). The majority of these studies depend on the ability to accurately extract trips from CDR data. However, unlike Global Positioning System (GPS) data (Zhao et al., 2015), CDRs are recordings of people’s telecommunication activities, which are not perfectly aligned with their travel behavior (Xu et al., 2018). This raises the need to translate a series of telecommunication activities into a series of travel activities, which is not a straightforward task (Zhao et al., 2016a).

One critical limitation of CDR data for trip extraction is its sparsity. Phone usage tends to be sporadic in nature (Barabási, 2005). For most users, their mobile phone records are sparsely and irregularly distributed over time, resulting in periods when users may travel but have no phone records to reveal it in the CDR data. We call these time periods elapsed time intervals, or ETIs. An ETI is defined as the period between two consecutive mobile phone records that is long enough for a user to potentially make a trip unobservable from the CDR data. When ETIs occur, the observed spatiotemporal traces of the user are likely incomplete, and the trip estimations based on such incomplete observations are prone to errors. For example, because of the sparsity issue, the fact that two locations are sequentially observed in CDR data does not mean that they are connected by a direct trip. They may not be an OD pair if the user makes an unobserved trip to another location between them. In other words, there may be a hidden visit, which occurs when a user visits a place but has no CDR associated with it. By definition, hidden visits can only occur during ETIs. This issue has received limited attention in existing literature (Bayir et al., 2010; Chen et al., 2019). Without properly considering hidden visits, the extracted OD pairs may be incorrect, the trip generation rate may be underestimated, and the spatiotemporal distribution of trips is likely to be skewed based on individual preferences of mobile phone usage (Zhao et al., 2016b). This calls for methods that can infer the existence of hidden visits based on spatiotemporal context of the ETI as well as the individual characteristics of the user.

The objective of this study is to highlight the issue of hidden visits, and develop an approach to infer the existence of hidden visits during ETIs. Inferring something unobservable in the data, using the said data, is a challenging task. Typically, an unsupervised approach (Chen et al., 2019) is the only choice. However, the heterogeneity across different subsets of CDR data, e.g., voice call vs data access records, raises the opportunity to adopt a supervised approach based on data fusion for hidden visit inference. Specifically, for a subgroup of users with passively generated data activities, their data access records may be used to recover the portion of travel that is hidden from actively generated voice call records, which can then be used to train hidden visit inference models applied to general user population. In this paper, we focus on the problem of inferring whether hidden visits exist or not. The ability to identify hidden visits is important for ensuring the quality of the trip-level information extracted from CDR data. For example, if a hidden visit exists, the extracted OD pair should not be used for OD estimation. By identifying hidden visits during ETIs, we may distinguish OD pairs that are accurately inferred from those that are not. It is worth emphasizing that this paper focuses on identifying the existence of hidden visits, i.e., a binary problem, which is of great value for trip extraction by itself. It will provide a foundation for the inference of the exact time and location of the hidden visits, which is a relatively more challenging problem and should be further studied in future research.

The specific contributions of this study are summarized as follows. First, we define the problem of hidden visit inference as part of the trip detection process using sparse CDR data. We show that estimated trip characteristics, such as average trip distance, would be biased without hidden visit inference. Second, we propose a data fusion approach to obtain labeled training data from CDR data alone for supervised statistical inference of hidden visits. More specifically, labeled observations are extracted from more frequently sampled cellular data access records, and features from voice call and text messaging records. Third, we demonstrate the proposed data fusion approach for predicting whether observed displacements contain hidden visits, and identify a range of spatial, temporal, and personal features for the prediction task. Based on a large-scale real-world dataset, we estimate that over 10% of the displacements extracted from CDR data involve hidden visits.

Methodology

Trip extraction from CDR data

Despite its increasing popularity in human mobility and transportation studies, CDR data have several limitations that hinder the ability to accurately extract individual trips. First, CDR data typically provides spatial information at the cell tower level, while the precise location of the user is unknown. It is also well documented that positioning noise exists in CDR data, which stems from signal movements (Calabrese et al., 2011; Iqbal et al., 2014). Low spatial resolution and signal noise both contribute to localization error. Second, the status of travel is not provided in CDR data. A mobile phone record may be generated during a visit to a place or during a trip between two places. This poses a challenge for identifying trip origins and destinations. Third, the sparse nature of CDR data makes it impossible to obtain a complete profile of user mobility. Even when complete records of the mobile phone activities of an individual are available, not every trip of the user is observable. Only those trips that occur in tandem with mobile phone activities are recoverable.

All the aforementioned limitations are, to various degrees, recognized and discussed in the literature. While terminologies and methodologies vary across specific studies, this section synthesizes them into a unified framework shown in Figure 1. Generally, to extract trips from CDR data, three stages are needed—localization, movement state identification, and hidden visit inference. Each is intended to address one of the three limitations. The results obtained after each stage are closer to actual individual travel behavior.

Figure 1.

A general process for extracting trips from CDR data.

The first stage, localization, intends to mitigate localization errors and estimate user locations. A plethora of different methods have been used in the literature to reduce localization error (Csáji et al., 2013; Isaacman et al., 2011; Jiang et al., 2013; Wang et al., 2013). They typically include two steps—trajectory smoothing and spatial clustering. In trajectory smoothing, one takes a sequence of CDRs within a certain time threshold and applies smoothing or filtering algorithms to reduce “jumps” in the location sequence. These algorithms include speed-based filtering (Wang et al., 2013), time-weighted smoothing (Csáji et al., 2013), or assigning a single medoid location to every record in the sequence if they are close by (Jiang et al., 2013). All of them produce smoothed location sequences. In spatial clustering, one ignores the ordering or the temporal distribution of CDRs and clusters data points based on their spatial distribution only. In this way, we can consolidate points that may represent the same location but are visited on different days. Agglomerative clustering (Hariharan and Toyama, 2004; Jiang et al., 2013) and leader clustering (Csáji et al., 2013; Isaacman et al., 2011) are two common spatial clustering algorithms used in prior research. The former clusters a sequence of locations based on a distance matrix only, while the latter can prioritize some locations over others usually based on the visit frequency. Cluster diameters need to be specified in both algorithms. In some cases, the location of a mobile phone may be recorded as triangulated coordinates computed based on the locations of multiple cell towers that the device connects to. When triangulated coordinates are available, a model-based clustering method, proposed by Chen et al. (2014), is more flexible as it does not require predetermined threshold values and allows for probabilistic cluster assignments.

After localization, the location of a user at a certain time is represented by a clustered location, instead of a cell tower location. A time-stamped user location is called a presence. Each presence can either occur during a trip or during an activity at a meaningful place. Jiang et al. (2013) referred to the former category as “pass-by” points, and the latter “stay” points. The goal of the second stage, movement state identification, is to distinguish between the two categories and extract visits from presences. A visit is a series of “stay” points correspond to the same location. The most common way to do this in prior literature is simply to apply a dwell time threshold (Calabrese et al., 2011; Jiang et al., 2013; Mellegard et al., 2011; Wang et al., 2013). For example, a threshold of 10 min is used in Jiang et al. (2013). A sequence of presences that are associated with the same location and span over 10 min is classified as a visit. Otherwise, they are flagged as pass-bys. This method works well for high-frequency data, such as GPS data. However, for sparse CDR data, it may cause most presences to be labeled as pass-by points. One way to improve this is to further identify “potential stays”, the presences that are classified as pass-bys using the dwell time criterion but are associated with a previously visited location (Jiang et al., 2013).

Whereas most existing methods only cover the first two stages, we argue that a third stage, hidden visit inference, is needed to distinguish between trips and displacements. A displacement occurs between two consecutive visits observed in the data, while a trip occurs between two consecutive visits regardless they are observed or not. In other words, a displacement may correspond to one or more trips. Using displacements extracted from sparse CDR data to directly estimate mobility patterns may lead to biases (Zhao et al., 2016b). The discrepancy between displacements and trips is a non-trivial obstacle in applying CDR data for travel behavior analysis (Chen et al., 2016). Hidden visit inference is a problem that has been largely overlooked in the literature. Bayir et al. (2010) is the first study that explicitly defines the problem of hidden visits. They make the distinction between “observed end-locations” and “hidden end-locations”. Based on their definition, “a hidden location occurs when a significant amount of time is elapsed during cell transition.” In an attempt to address the issue, they propose the use of a transition time threshold to determine whether a hidden location exists during an ETI, which heavily relies on personal judgment and lacks statistical robustness. While several statistical methods have been developed to fill in the gaps in sparse CDR trajectories by estimating the length of stay at each observed location, they do not explicitly consider hidden visits to a different location (Chen et al., 2018; Hoteit et al., 2016). More recently, Chen et al. (2019) proposed a tensor decomposition method for complete CDR trajectory reconstruction. Specifically, a 3-dimensional tensor is constructed for each user and the missing locations are estimated based on the assumption of user behavior regularity. Unsupervised learning methods, such as tensor decomposition, are often necessary because the ground-truth data about hidden visits are typically not available. However, unsupervised learning methods are generally difficult to calibrate and do not perform as well as supervised learning methods for prediction tasks. In this study, we will present a novel data fusion approach that makes it possible to infer hidden visits using supervised learning methods. In addition, unlike Chen et al. (2019), our hidden visit inference method will combine both individual-specific features and other spatiotemporal features under a universal model to allow learning across users.

Hidden visit inference based on data fusion

To infer whether a hidden visit exists is essentially a classification problem. It involves building a statistical model for predicting a binary output based on one or more inputs (James et al., 2013). This requires a set of training examples, each being a pair consisting of a feature vector, $X,$ and a desired output value, $Y .$ However, in the case of CDR data, such training data is typically unavailable. This is arguably the most critical obstacle that limits our ability to transition from existing heuristic-based approaches to statistical approaches.

Specifically for hidden visit inference, the complete travel profile of a user is required, along with the sparse CDR data, in order to form labeled observations. One way to achieve this is to find another data source that complements the characteristics of CDR data. CDR data are one example of large-scale urban mobility data sources that cover large user population and long observation period, but the individual-level information that is captured in such data is relatively coarse. In contrast, another type of data may be collected from a smaller sample of individuals over a shorter observation period, but can provide richer and more detailed information at the individual level, e.g., the Reality Mining dataset (Eagle and Pentland, 2006). These two types of data are complementary to each other. In this study, we refer them as coarse big data and rich small data, respectively. Whenever both types of data are available, we can maximize their value by combining the two for statistical learning, which involves forming training examples with $X$ extracted from coarse big data, and $Y$ from the rich small data. The trained models can then be applied to coarse big data for larger population over longer period, so that some of the unobservable information in the data can be inferred. Potential ways to collect rich small data include travel surveys and smartphone GPS tracking. Both options require active recruitment of sample users, and thus are costly and not very scalable.

Given these practical challenges, this study proposes a new way to apply data fusion using only CDR data. This is possible because multiple types of mobile phone transactions are recorded in CDR data, including voice calls, SMS, and data activities. While many of the datasets analyzed in the literature consist of voice calls only (Barabási, 2005; Csáji et al., 2013; González et al., 2008; Iqbal et al., 2014), or voice calls in combination with SMS records (Isaacman et al., 2011), records of data activities are becoming more available (Calabrese et al., 2011, 2014; Ranjan et al., 2012). Unlike voice call and SMS activities, data activities do not always require user initiation or participation (Ranjan et al., 2012). On devices with enabled cellular data capability, a plethora of mobile applications, if allowed by users, make periodic or sporadic connections to the cellular network automatically. These data activities are recorded as data access records, and they tend to be less sparse than voice call or SMS records. Furthermore, voice call and SMS records are determined by mobile phone usage preferences. As a result, the mobility patterns observed from such data may be confounded with the user’s mobile phone usage behavior (Williams et al., 2015). On the other and, data access records can be generated passively. For example, a user may prefer not to make voice calls at certain locations or at a certain time of the day, and thus, the travel associated with these locations and periods may be hidden from the voice call records. However, these otherwise “hidden” visits can be captured by passively generated data access records. For these reasons, data access records can be used to capture complete travel profiles, at least for a small group of smartphone users with passively generated cellular data activities. Therefore, despite the lack of a complementary data source, it is still possible to obtain labeled data by extracting $X$ from voice calls and SMS records, and $Y$ from data access records.

Problem Formulation

After localization and movement state identification, we obtain a series of stay points for each user, $(s_{1}^{u}, t_{1}^{u}), (s_{2}^{u}, t_{2}^{u}), \dots, (s_{n}^{u}, t_{n}^{u}),$ where $s_{i}^{u}$ and $t_{i}^{u}$ are the location and timestamp of the i-th stay point for user u. A hidden visit occurs when the user makes a trip to a location other than $s_{i}^{u}$ and $s_{i + 1}^{u}$ during the time between $t_{i}^{u}$ and $t_{i + 1}^{u}$ . It is challenging to directly estimate the location and time of the hidden visit because of the large number of possible outcomes. Instead, we focus on a simpler question in this study—whether hidden visits exist. Let $h_{i}^{u}$ indicate whether a hidden visit exists between $t_{i}^{u}$ and $t_{i + 1}^{u}$ . It is the target variable to be inferred.

For a given user, the superscript u is omitted for clarity. Let $e_{i}$ indicate whether the time period between $t_{i}$ and $t_{i + 1}$ counts as an ETI, i.e.,

e_{i} = {\begin{matrix} 1, i f t_{i + 1} - t_{i} > τ \\ 0, o t h e r w i s e \end{matrix}

where $τ$ is the minimum time threshold of ETI, which essentially defines the temporal resolution of the analysis. The choice of $τ$ depends on the problem requirement and the data constraint. By definition, hidden visits can only exist during ETIs. Or in other words, $P (h_{i} = 1 | e_{i} = 0) = 0 .$ If the data is frequently sampled, $e_{i} = 0, \forall i,$ and no hidden visit inference is needed. Otherwise, it is necessary to estimate $P (h_{i} = 1 | e_{i} = 1) .$ One way to estimate this is to use true values of $h_{i},$ which may be obtained through the data fusion approach described in the previous section.

Assume that both coarse big data and rich small data are available for a group of users. A series of stay points can be obtained from the former. For each time interval $(t_{i}, t_{i + 1}),$ when $e_{i} = 1,$ a set of locations $S_{i}^{'}$ are obtained from the latter. Therefore, the true values of $h_{i}$ may be determined based on the following rule:

$h_{i} = 1,$ if $e_{i} = 1$ and there exists a $s^{'} \in S_{i}^{'}$ so that $s^{'} \notin {s_{i}, s_{i + 1}},$

$h_{i} = 0,$ otherwise.

To ensure that the location sequences in the two data sources are comparable, we may transform them into discrete time series, for example, by binning the timestamps into hours. Also, in reality, even in the frequently sampled data access records, user locations may be missing in certain periods. For example, if the mobile phone of a user is out of battery for a period, all the travel activities during the period would be missing. We call these periods unrecoverable. Only hidden visits within the recoverable ETIs can be identified. This process of obtaining labeled observations is illustrated using the example in Figure 2.

Figure 2.

Illustration of the process for identifying the existence of hidden visits.

With labeled observations, a model can be trained to estimate $P (h_{i} = 1 | e_{i} = 1) .$ For parametric methods, the training data are used to find the values of a set of model parameters, $θ,$ so that some loss function is minimized. Assuming there are M users in the training data, each with $N_{u}$ observations, the loss function can be generally expressed as

\hat{θ} = a r g m i n \sum_{u = 1}^{M} \sum_{i = 1}^{N_{u}} e_{i}^{u} L (h_{i}^{u}, f (X_{i}^{u}; θ))

where $X_{i}^{u}$ is the feature vector associated with of $i$ -th observation of individual $u,$ and $L (a, b)$ is the loss function specified by the true value $a,$ and the estimated value $b .$ The specific form of the loss function and $f (X_{i}^{u}; θ)$ depends on the choice of algorithms.

It is important to consider two different scenarios depending on whether $s_{i} = s_{i + 1} .$ In Scenario I, $s_{i} \neq s_{i + 1} .$ A displacement occurs from $s_{i} = A$ to $s_{i + 1} = B,$ and the goal of hidden visit inference is to determine whether the user visits another place $Z$ in between, e.g., $A \to Z \to B$ . In Scenario II, $s_{i} = s_{i + 1}$ . In this case the goal is to determine whether the user visits another place and returns to the same place, e.g., $A \to Z \to A .$ The two scenarios have different implications regarding user behavior and travel patterns, and thus require separate model specifications, even though the general methodology may be similar. This paper focuses on Scenario I and demonstrates how the data fusion approach may be implemented to infer $P (h_{i} = 1 | e_{i} = 1, s_{i} \neq s_{i + 1}) .$

Application

Data

The dataset used for this study is collected by one of the major cellular service operators from a Chinese city with a population of 6 million. The dataset contains over 2 billion mobile phone transaction records generated by 3 million users during November 2013. Only voice call and data access records are available; SMS records are not available, exacerbating the sparsity problem at the individual level. The key fields in the CDR data include:

ID—encrypted unique identifier for each phone number

Location Area Code (LAC)—location area code, used in combination with Cell ID to identify the cell tower used for the transaction

Cell ID—used in combination with LAC to identify the cell tower used for the transaction

Date Time—the timestamp of the mobile phone transaction

Event ID—the type of the event that triggers the transaction, which may be an outgoing call, incoming call, or data usage (2G/3G).

In addition to the CDR dataset, we have a cell tower database documenting the attributes (including geographic location) of the cell towers. This makes it possible to query the coordinates (in the form of longitude and latitude) of the tower associated with each mobile phone record using LAC and Cell ID. In total, we are able identify the locations of over 9,000 cell towers that appear in the CDR dataset. The distribution of the cell towers is shown together with the 607 traffic analysis zones (TAZs) of the city in Figure 3(a). It is clear that the distribution of the cell towers is highly uneven and concentrated in the city center, where most activities take places. In Figure 3(b), each TAZ is represented as a node, and the edge between two nodes represents the number of displacements between TAZs. The size of the node describes the total number of displacements from/to the TAZ.

Figure 3.

Spatial distribution of cell towers and displacements between TAZs.

Although the total amount of CDR data is large, the number of records per user is sparse. On average, a user generates 0.16 voice call records and 0.40 data access records every hour. Voice call and data access records exhibit different patterns. As shown in Figure 4, data access records are not only larger in number but also more evenly distributed throughout the day than voice call records. Similar to prior findings in Candia et al. (2008), the number of voice call records per user have two peaks, one in the morning and the other in the afternoon, which resembles the distribution of travel demand. It suggests that making phone calls is somewhat correlated with travel, potentially causing biases in travel estimation. For example, if some users only make phone calls before and after their commutes, we may overestimate the proportion of commuting trips and underestimate other trips. Without observing the actual travel behavior from another less biased data source, it is very difficult to correct the bias. Data access records suffer from a similar problem, but to a lesser degree. Therefore, data access records may be used to quantify, and potentially mitigate, the biases of voice call records.

Figure 4.

Mobile phone usage pattern over time of day.

The inter-event distribution of the voice call and data activities is also explored. The inter-event time is calculated as the time difference between two consecutive records for the same user. Again, voice call records and data access records are analyzed separately. Figure 5 shows that, whereas the inter-call time is characterized by a smooth distribution curve, the inter-event time for data activities exhibits a few peculiar spikes, the most significant of which being at the 1-hour mark. This is likely caused by the fact that some mobile applications are set up to automatically make hourly connections to the cellular network. This finding suggests that a reasonable choice for ETI threshold τ is 1 hour, at least for this dataset.

Figure 5.

Distribution of inter-event time by transaction type.

Note that not every user’s data access records have such regular hourly inter-event intervals. Some users may not have a smartphone, while others may disallow for passive data usage of some mobile applications. An examination of mobile phone usage patterns shows that 9% of the users have only data access records but no voice calls, possibly because they represent tablets or secondary devices. The rest of the users are considered regular mobile users who generated at least one voice call record in November. They can be further divided into call-only users who do not use cellular data (43%) and mixed users who have both voice call and data access records (57%).

By further decomposing mixed users, we find that a large proportion of data access records are generated by a small group of users. Figure 6 shows the distribution of the active hours of mixed users based on their voice call or data access frequency. An active hour denotes an hour when the user generates at least one record, and we break this down by the two transaction types. Note that there are 720 hours in total during our study period (i.e., 30 days). The two distributions shown in the figure are distinctly different, which is highlighted in the log-log plot (see the inset chart of Figure 6). There is virtually no user that has more than half of their hours with at least one voice call, but there are a small group of users that generate data access records in most of the hours, a strong indication that these users have passively generated data activities. In this study, we define frequent data users as the mixed users who have active data usage in at least half of the hours (in this case, 360 hours). The threshold represents a trade-off between certainty and volume—a lower threshold would place more users in this group, but we would be less certain that these users have passively generated data access records. For this group of users, their voice call records are still sparse, but their data access records are not. Therefore, we can to some extent observe their mobility between voice calls based on their data usage. Note that the frequent data users only account for 10% of the regular mobile user population. Nevertheless, hidden visit inference models may be trained based on CDRs of the frequent data users, before being deployed for the large majority of users with sparse CDR data. Note that even a frequent data user may still have ETIs in their trajectories, e.g., when the user is out of cellular coverage or is served by unknown Wi-Fi hotspots. We do not require them to have complete trajectories. Instead, we only extract the complete segments of trajectories for training data. After the model is trained, it can then be deployed for hidden visit inference for all ETIs.

Figure 6.

Distribution of mixed users by number of active hours.

Preprocessing

Before hidden visit inference, two previous stages need to be completed—localization and movement state inference. For localization, a method similar to Isaacman et al. (2011) and Csáji et al. (2013) is adopted to perform spatial clustering of cell towers and identify important user locations. The method has two steps. In the first step, the cell towers are ranked based on their importance for each user, where the importance is measured based on the number of “call-days” (i.e., days the cell tower was contacted). In the second step, the leader algorithm is used for clustering analysis. The algorithm starts with the most important cell towers and merges the surrounding towers into the first cluster. Then we move on to the most frequent tower of the remaining towers and repeat, until all towers are assigned a cluster. Unlike the K-means algorithm, the leader algorithm does not require a predefined number of clusters, which is advantageous because localization is performed at the individual level. Compared to the hierarchical clustering algorithm, the leader algorithm allows us to assign higher priority (or weight) to the frequently used cell towers. The weighted centroid of the cell towers with a cluster is used to represent the location of a user.

In the movement state identification stage, we distinguish between stay and pass-by points based on the dwell time and the frequency of appearances at the associated user location. To determine dwell time, presences are aggregated to segments based on location matching and temporal proximity. Two consecutive presences are combined if they are associated with the same location $(s_{i} = s_{i + 1})$ and the time difference is within the ETI threshold $(t_{i + 1} - t_{i} < τ) .$ Each presence segment is associated with a single location and covers a unique time span. If a segment’s time span is above the minimum dwell time threshold (set as 10 min based on Jiang et al., 2013) or its associated location appears at a frequency higher than the minimum frequency threshold (set as 4 days per month), all the presences in the segment are classified as stay points; otherwise, they are flagged as pass-by points and will not be used for further analysis. Here, we assume that a location is an important place for users if they appear at the location at least once a week. This is loosely based on Isaacman et al. (2011), which defines an important place as “a geographic location where a person spends a significant amount of time and/or which she visits frequently”.

For model implementation, 10,000 frequent data users are selected to form training samples. For each user in the dataset, we are able to obtain a series of stay points $(s_{1}, t_{1}), (s_{2}, t_{2}), \dots, (s_{n}, t_{n})$ extracted from voice call records, and corresponding hidden visit labels $h_{1}, h_{2}, \dots, h_{n - 1}$ extracted from the data access records based on the approach discussed in the Methodology section . For simplicity, we filter the data to only keep the observations where $e_{i} = 1$ and $s_{i} \neq s_{i + 1} .$ In other words, each observation is a displacement with ETI. A displacement with ETI is characterized by two consecutive stay points $(s_{i}, t_{i})$ and $(s_{i + 1}, t_{i + 1}),$ where $s_{i} \neq s_{i + 1}$ (thus a displacement) and $t_{i + 1} - t_{i} > τ$ (thus an ETI). $h_{i}$ indicates whether there exists a hidden visit during the ETI. For example, if two consecutive stay points are (A, 8 am) and (B, 11 am), it counts as a displacement with ETI, and the goal is to infer whether there exits a hidden visit between 8 am and 11 am. For model training, we only consider instances when $h_{i}$ can be recovered from the data access records. For each user, we randomly select one observation. This gives us a dataset of 9,761 observations, which is then used for model training. As this is a binary classification problem, the performance of classifiers is evaluated using standard metrics such as classification accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (ROC AUC). The precision and recall values are calculated based on the “positive” class, which is the class with $h_{i} = 1 .$ 10-fold cross-validation is used to obtain robust model performance metrics.

Model specification

Given a displacement with ETI, a set of attributes need to be defined in order to be used for hidden visit inference. Generally, its attributes can be categorized into three sets—spatial (of the displacement), temporal (of the ETI) and personal (of the user) attributes.

Spatial attributes refer to the characteristics of the displacement $(s_{i}, s_{i + 1}) .$ Distance is a measure of travel cost between $s_{i}$ and $s_{i + 1},$ and the proportion of observed trips (i.e., displacements with no ETIs) is a measure of the strength of connection. In an attempt to characterize locations, the home and workplace of an individual are inferred based on simple heuristics. For each user, we identify the top two most frequently visited locations (a minimum frequency threshold has to be met), and determine that the one with more presences during (1) all hours on weekends and (2) night hours (i.e., before 7 am or after 7 pm) on weekdays is home (Alexander et al., 2015) and the other the workplace. The term “workplace” is used to generally indicate the location that a user visits frequently during daytime on weekdays. It may be a school for some users. This categorization allows us to semantically characterize displacements in a few ways. Based on the function of only $s_{i + 1},$ we may categorize displacements by three travel purposes—home, work, or other. Alternatively, displacements may be grouped into four categories based on the functional combination of both $s_{i}$ and $s_{i + 1}$ —home-based work (HBW), home-based other (HBO), work-based other (WBO), and other-based other (OBO). HBW is travel between home and work, HBO between home and other, WBO between work and other, and OBO between two other locations. This categorization is commonly used in transportation planning. In addition, we also examine the distribution of the user’s displacements without ETIs and see how many of them are between $s_{i}$ and $s_{i + 1} .$ To further examine systematic spatial patterns of hidden visits, we further explore built environment features defined as the cell tower densities at the TAZs associated with $s_{i}$ and $s_{i + 1},$ though they do not show to be significant in preliminary analysis.

Temporal attributes refer to the characteristics of the ETI $(t_{i}, t_{i + 1}) .$ The duration of ETI, or $(t_{i + 1} - t_{i}),$ is an important factor. In general, the longer the ETI, the more likely there exists a hidden visit. To determine the time of day effect, we use four dummy variables: whether the ETI overlaps with (i) morning peak hours (from 7 am to 9 am), (ii) afternoon peak hours (from 4 pm to 7 pm), (iii) midday hours (from 10 am to 3 pm), and (iv) night hours (from 8 pm to 6 am the next day). These dummy variables are not mutually exclusive because ETIs often span across multiple hours. The underlying assumption is that people have different motility patterns during different periods in a day. In addition, we calculate how often a user appears at other locations (neither $s_{i}$ nor $s_{i + 1})$ during the same time of the day (TOD) as the ETI $(t_{i}, t_{i + 1}) .$ For example, if the ETI is between 8am and 11am, we will count the total number of hours where the user is observed elsewhere between 8am and 11am across all days in the observation period.

User attributes include both characteristics of mobile phone usage and those related to travel behavior. One specific measure of travel tendency is the number of displacements per active hour, which is a normalized measure of user displacement rate. Table 1 presents numerous attributes that are extracted and tested, and the italicized ones are those that are selected to be included in the final model based on model validation.

Table 1.

Possible features for hidden visit inference.

Category	Features
Spatial features	Distance between $s_{i}$ and $s_{i + 1}$
	Displacement type *
	Location ranking of $s_{i}$ and $s_{i + 1}$ *
	Visit frequency of $s_{i}$ and $s_{i + 1}$ *
	Location function of $s_{i}$ and $s_{i + 1}$ (home, work, or other) *
	% of displacements (without ETIs) between $s_{i}$ and $s_{i + 1}$ *
	Cell tower density of TAZs associated with $s_{i}$ and $s_{i + 1}$
Temporal features	Duration
	Time of the day (TOD)
	Day of the week
	Number of locations where user appears during same TOD*
	Frequency of user appearing elsewhere (neither $s_{i}$ nor $s_{i + 1}$ ) * during same TOD *
User features	Number of active call hours*
	Number of visited locations*
	Number of visited locations per active call hour*
	Number of displacements*
	Number of displacements per active call hour*

Note: features with * are derived using individual user data; features in italics are included in final models.

Because different assumptions for loss functions and model structures may yield different results, four commonly used classifiers are tested. They are logistic regression, support vector machine (SVM), random forest, and gradient boosting. The implementation details of these methods are described as follows:

The logistic regression model outputs the probability distribution across the two classes, and thus a cut-off value needs to be chosen to produce a point estimate (i.e., yes or no). Based on preliminary tests, the cut-off value is set at 0.5. To avoid over-fitting, the L2 regularization is used, and the parameter of inverse regularization strength, C, is chosen to be 1.0.

For SVM, the radial basis function kernel is used. It takes the following form: $K (x, x^{'}) = \exp (- γ {‖ x - x^{'} ‖}^{2})$ , where the parameter $γ$ needs to be specified based on the validation set. Based on preliminary tests, it is determined that $γ = 0.05 .$ Similar to logistic regression, the regularization parameter C = 1.0.

A random forest is an ensemble method that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The Gini index is used to measure the impurity of a node in a tree, and the number of trees in the forest is set to be 100.

Gradient boosting is an ensemble method that builds an additive model in a forward stage-wise fashion and allows for optimization of an arbitrary differentiable loss function (Friedman, 2001). In each stage, a weak model, typically a decision tree classifier, is fitted based on the negative gradient of the loss function.

All classifiers are implemented in Python through the machine learning package “scikit-learn” (Pedregosa et al., 2011). If not specified, the default model settings are used.

Feature importance analysis

Of the classifiers used, logistic regression has most interpretable model parameters. Thus, the detailed results of the logistic regression model are presented in Table 2 for a better understanding of the relationship between variables. Positive coefficients mean that an increase in attribute values will increase the probability that a hidden visit occurs, and vice versa.

Table 2.

Logistic regression model results.

Feature	Estimate	p-value
Intercept	−1.683	0.000
Spatial features
Distance between $s_{i}$ and $s_{i + 1}$ (in km)	0.087	0.000
Destination = Home	0.169	0.011
Displacement type = HBW	−0.980	0.000
Displacement type = HBO	−0.552	0.000
Displacement type = WBO	−0.415	0.000
If any trip from $s_{i}$ to $s_{i + 1}$ is observed	−0.270	0.000
% of displacements (without ETIs) between $s_{i}$ and $s_{i + 1}$	−0.627	0.037
Temporal features
Duration of the ETI (in hours)	0.031	0.000
ETI overlaps with morning peak hours	0.190	0.016
ETI overlaps with afternoon peak hours	0.136	0.027
ETI overlaps with midday hours	0.119	0.040
If user appears in other locations during same TOD	0.249	0.003
Frequency of user appearing elsewhere during same TOD	0.034	0.000
User features
Number of observed user locations	−0.009	0.015
Number of displacements per active hour	1.177	0.000
Model fit
chi-square = 2078.75, degrees of freedom = 16, McFadden’s $R^{2}$ =0.162

As expected, spatial attributes matter in the classification problem. Hidden visits are more likely to occur when the displacement distance is longer. Interestingly, they also occur more often when the displacement ends at home. One possible explanation is that people are more likely to make a short visit to another place (e.g., grocery store) on their way home. This may be because that people have fewer time constraints when they travel back home. If either $s_{i}$ or $s_{i + 1}$ correspond to the home or workplace, the probability of a hidden visit decreases. This may be caused by the fact the home and the workplace are the two most frequently visited locations for a user, and the probability of visiting any other location is relatively small. If the user is observed to travel from $s_{i}$ to $s_{i + 1}$ on other occasions (in cases of displacements with no ETIs), there is less likely to be a hidden visit. Human mobility has been found to show high regularity (Song et al., 2010). People tend to repeat the same trip over time.

In terms of temporal attributes, the ETI duration has a similar effect as the displacement distance; the longer the ETI, the more likely a hidden visit occurs. If the ETI overlaps with morning peak hours, it is more likely to involve hidden visits. Afternoon peak and midday hours have similar, but lesser, effects, while the coefficient for night hours is insignificant. This finding matches our expectation that people are generally more mobile in peak hours than at midday, and least active during night hours. If the user is observed to appear in other locations during the same period on other days, hidden visits are also more likely to occur.

The number of displacements per active hour approximates the travel rate of a user. As expected, a user with a higher travel rate is more likely to undertake a hidden visit. We find that, although the number of user locations per active hour is not a significant factor in the model, the total number of observed user locations is. This suggests that the more visited locations revealed in the data, the less likely the user has hidden visits. One may argue this is a result of call frequency; a frequent caller reveals more visited locations. However, the frequency of the call activities is also insignificant in the model. A more likely explanation is rooted in the user’s spatial preference regarding phone calls. Regardless of call frequency, some users distribute their phone calls across all locations, while others may prefer to make a phone call only at a few locations. The latter group of users is more likely to have hidden visits in their voice call records. In other words, the distribution of calls matters more rather than the frequency of calls.

Another way to assess feature importance is to see how much they contribute to the actual prediction of hidden visits. To do this, we evaluate the overall prediction performance, measured by classification accuracy and ROC AUC. Then, we remove spatial, temporal, and personal features from the model, and compare the resulting difference in prediction performance. The results are summarized in Table 3. Based on the results, it seems that spatial features are most important (because of the largest drop in prediction performance), followed by temporal features. User features are least important.

Table 3.

Comparison of prediction performance with different features.

Feature combination	Accuracy	ROC AUC
Spatial + Temporal + User	0.694	0.696
Spatial + Temporal	0.694	0.695
Spatial + User	0.668	0.675
Temporal + User	0.621	0.626

Comparison of model performance

Table 4 shows the performance of the four classifiers. They are compared against two baseline models. Baseline 1 is a deterministic model that assumes no hidden visit, which is the assumption that many prior studies have made when they extracted trips from CDR data. It reaches a classification accuracy of 62.8%, meaning that the naive rule will underestimate the number of trips by at least 37%. Note that the precision and recall are both 0 for Baseline 1, because there are no positive cases predicted. Baseline 2 is a probabilistic model that generates predictions through sampling the marginal distribution of the target variable observed in the training data. Its prediction accuracy is lower, but, unlike Baseline 1, it produces positive precision and recall. Compared to the baseline models, the fitted statistical models significantly improve the prediction performance on all metrics. Among the four classifiers, gradient boosting performs best in terms of overall accuracy and precision, while logistic regression does better in recall, F1 score, and ROC AUC. Because the positive class accounts for a smaller proportion than the negative class, it tends to be under-classified, resulting in a lower recall than precision. Common strategies to address the data imbalance include oversampling or overweighting the positive class, or, for probabilistic models such as logistic regression, adjust the cut-off probability threshold. However, these strategies may worsen the overall accuracy score. This is a trade-off to be assessed depending on specific applications.

Table 4.

Comparison of classification performance.

Methods	Accuracy	Precision	Recall	F1 Score	ROC AUC
Baseline 1 (deterministic)	0.628	0.0	0.0	N/A	N/A
Baseline 2 (probabilistic)	0.537	0.366	0.363	0.365	0.499
Logistic regression	0.694	0.566	0.702	0.627	0.696
SVM	0.712	0.652	0.459	0.538	0.659
Random forest	0.721	0.654	0.503	0.569	0.675
Gradient boosting	0.729	0.671	0.511	0.580	0.683

Many classifiers, such as logistic regression, can produce probabilistic predictions for a new observation. Standard SVM does not provide such probabilities, but it can with Platt scaling (Platt, 2000). Figure 7 shows how the proportions of positive (in red) and negative (in blue) classes vary based on $\hat{P} (h_{i} = 1)$ estimated by the logistic regression model. Generally, as $\hat{P} (h_{i} = 1)$ increases, the relative proportion of the positive class rises. However, when $P (h_{i} = 1)$ is between 0.5 and 0.7, the model cannot confidently distinguish between the two classes. Instead, we may directly user $P (h_{i} = 1)$ as an indicator of uncertainty. In many applications, especially at the aggregate level, probabilistic predictions are preferred over point predictions, because they directly account for the degree of uncertainty in the analysis. This is a distinct advantage of statistical models over heuristic-based approaches, which cannot provide reliable probabilistic estimations.

Figure 7.

Proportions of positive/negative classes varying by estimated probabilities of hidden visits.

Model deployment for trip extraction

To demonstrate the importance of the hidden visit inference model, we apply the trained logistic regression model to all displacements with ETIs (with unknown $h_{i})$ in the CDR dataset. To avoid the high uncertainty associated with very long periods without observations, we only focus on the displacements with durations shorter than 8 hours. 56.5 million such displacements are extracted from the dataset, of which 23.2 million (or 41%) are associated with ETIs. Based on the calculated probabilities $\hat{P} (h_{i} = 1)$ from the model, we estimate that the expected number of displacements with hidden visits is over 6.4 million—27.7% of the displacements with ETIs, or 11.4% of all displacements.

This has two implications. On the one hand, the results show that more than 10% of the displacements are not direct trips, and considering their end-locations $(s_{i}, s_{i + 1})$ as OD pairs would be inaccurate. Thus, in order to obtain an accurate estimation of OD distribution, we may discount those displacements with hidden visits. On the other hand, each displacement with a hidden visit corresponds to more than one trip, and as a result the actual number of trips is at least 11.4% more than the number of observed displacements. One way to correct this underestimation is to apply an upscaling factor based on the estimated proportion of hidden visits. Note that this proportion varies significantly over time, as shown in Figure 8 as the orange curve. Higher proportions occur during early mornings and middays. The variation is determined by two components—the variation in the proportion of displacements with ETIs over all displacements, and the estimated $P (h_{i} = 1 | e_{i} = 1, s_{i} \neq s_{i + 1}),$ both of which are also shown in Figure 8 as the blue and red curves, respectively. The former is a result of the uneven distribution of CDRs over time of day (see Figure 4), and the latter is estimated by the hidden visit inference model. Figure 8 suggests the peak in the early morning is primarily driven by the former component, while both components contribute to the higher percentage of hidden visits around middays. It is intuitive that during the periods with less mobile phone activity, there are greater chances of encountering ETIs, and thus hidden visits are more likely to occur. Note that this is not to say there are more hidden visits occurring in the early morning. Instead, the correct way to interpret this is that each displacement in the early morning is more likely to contain a hidden visit. These variations need to be taken into consideration when we upscale the number of displacements to number of trips.

Figure 8.

Percentage of inferred hidden visits over time of day.

As another demonstration of the value of hidden visit inference model, we compare the average trip distance for displacements with or without ETIs. Average trip distance is an important indicator of travel demand, useful for transportation planning. Hypothetically, the presence of ETIs should not significantly alter the average trip distance. However, because of potential hidden visits during ETIs, displacements with ETIs are more likely to contain more than one trips, resulting in longer average distance than direct trips. As shown in Table 5, the displacements with ETIs have much longer distance than displacements without ETIs, likely as a result of hidden visits. To address the inconsistency, we adopt the logistic regression model for hidden visit inference, and use the predicted probability of no hidden visit $\hat{P} (h_{i} = 1 | e_{i} = 1, s_{i} \neq s_{i + 1})$ as the weight for each displacement. As a result, the weighted average distance is more consistent with displacements without ETIs. Note that longer displacements with ETIs are more likely to involve hidden visits, and thus they are more likely to be down-weighted. Therefore, the average distance is lower after weighting.

Table 5.

Comparison of average distance across different types of displacements.

Type of displacements	Average distance (km)
Displacements without ETIs	4.489
Displacements with ETIs	7.551
Displacements with ETIs weighted by $\hat{P} (h_{i} = 1 \| e_{i} = 1, s_{i} \neq s_{i + 1})$	4.879

Discussion

In this study, we define the problem of hidden visits caused by data sparsity, and develop a data fusion approach to infer the existence of hidden visit in CDR data. The proposed method works by extracting labeled observation from more granular cellular data access records and features from voice call and/or SMS records. It is demonstrated using the CDR data of 3 million users from a large Chinese city over a one-month period. The records of a sample of 10,000 frequent data users are used to train hidden visit inference models. The test results show that the developed models offer superior performance compared to the implicit assumption of no hidden visit adopted by many prior studies. Furthermore, it allows us to explicitly account for uncertainty in hidden visit inferences via probabilistic estimates. In addition, the results reveal that longer displacements are more likely to involve hidden visits. By applying the trained model to general user population, we find 11.4% of the 56.5 million displacements extracted from CDR data involve hidden visits. This means, without considering hidden visits, the trip distance estimated from CDR data may be over-estimated and more than 10% of the observed OD pairs are potentially inaccurate. These findings provide a better understanding regarding the potential biases of sparse CDR data, especially voice call records, for travel estimation.

The proposed methodology presents a promising research direction, and opens up many opportunities for future advancement. First, the presence of signal noises, or localization errors in particular, in CDR data limits the performance of hidden visit inference models. Better localization methods can further reduce signal noises and potentially improve the performance of the models. Second, incorporating more features and sequential dependence can help improve the performance. Mining of individual-level longitudinal data may reveal more features regarding the user’s activity patterns and routines. Sequential dependence exists across a series of displacements of the same user, and it may be accounted for using methods like conditional random field. Third, as the models are developed based on training samples extracted from frequent data users’ CDRs, it is assumed that the model parameters are applicable to the general user population. The validity of the assumption depends on the problem and model specifications. Future research is needed to examine whether the frequent data users can be used as a reasonable training sample for model development, for which more ground-truth data should be collected. Fourth, in the current models, we assume all users share the same model parameters. However, different groups of people may have different mobility/telecommunication patterns and behavioral preferences. This may be accounted for in two steps—apply user clustering first and then develop models for each of the clusters. This requires a weaker assumption on the representativeness of training samples, as both the frequent data users and general population can be considered as different mixtures of the same underlying user clusters. See Appendix A for one way to cluster users based on their voice call patterns. More generally, a combination of unsupervised learning methods (e.g., Chen et al., 2019) and supervised learning methods (described in this study) can potentially further improve the model performance. Finally, in this case study, we only focus on inferring the existence of hidden visits. This is an important step of trip extraction from CDR data. An extension of this work is to infer, if a hidden visit exists, when and where it occurs. It is a more challenging problem. As each user visits a different set of locations at different time periods, the specific spatiotemporal patterns of hidden visits for one user may not be generalizable for other users. In addition, there can be more than one hidden visit during an ETI, further complicating the task. Future studies are needed to provide a better understanding on individual heterogeneity of spatiotemporal patterns and propose new methods to infer detailed hidden visit information.

Whenever large-scale data is used, user privacy is an important consideration. The target application of our study is trip detection and OD estimation, which are done at aggregate level, not individual level. The developed models can be directly deployed on the database servers of telecom carriers, without need for data transfer. Furthermore, compared to other forms of big data, such as social media or credit card transaction data, CDR data is relatively less intrusive in terms of personal privacy. In addition, its localization error helps to mask the exact user locations, providing another layer of privacy preservation.

With the rapid advance in cellular network technologies and growth in smartphone usage, it is reasonable to expect that the data access records will be increasingly rich and prevalent. Specifically, the increasing prevalence of mobile signaling data, which records all signal jumps between cell towers, should largely reduce the occurrence of ETIs, though data sparsity can still be an issue for parts of the population during certain time periods. The specific parameters may change depending on the configuration of the networks or mobile phone settings, but the proposed data fusion approach is general and should still hold in the foreseeable future. Although CDR data are the focus of this study, the proposed approach can be extended to other types of coarse big data, such as transit smart card data and geo-located social media data. The value of this type of data can be enhanced through data fusion and statistical inference. For example, if we have smart card records and travel survey data for a group of individuals, the same data fusion approach may be applied to infer whether a user makes a hidden visit between two observed transit trips, which makes it possible to estimate individual-level mode shares for public transit.

Footnotes

Appendix A: User cluster analysis

In this section, we show one possible way to cluster users based on their voice call patterns, because different voice call patterns likely indicate different user activity schedules, lifestyles, and to some extent mobility behaviors. Specifically, each user $u$ is characterized by a 720-dimensional vector $c^{u},$ where the $i$ -th element in the vector $c_{i}^{u}$ indicates whether the user makes a voice call during the $i$ -th hour in the month. Based on these user-specific vectors, we can run K-means cluster analysis to group users who share similar voice call patterns. We choose $K = 3$ based on within-cluster sum-of-squares distributions. Figure A1 shows the average voice call patterns of the 3 estimated user clusters, as well as their proportions in the overall user population. It is clear that the three user clusters exhibit three levels of voice call frequencies or intensities. In addition, Cluster 1 (i.e., the frequent callers) display bigger difference between weekdays and weekends, compared to Clusters 2 and 3. Based on the user clustering results, we can express the difference between frequent data users and general users as different mixtures among the three clusters. As shown in Figure A2, there are a higher proportion of Cluster 1 users and lower proportion of Cluster 3 users in frequent data users. Therefore, in future research, we may explore how to combine hidden visit inference (or other mobility models) with user clustering analysis, to directly account for user heterogeneity in different user groups.

Acknowledgements

The authors would like to thank the Energy Foundation China and China Sustainable Transportation Center for providing the data and funding support that made this research possible. We also thank the anonymous reviewers for their constructive comments, which helped us to improve the paper.

Authorship contribution

Zhan Zhao: conceptualization, methodology, formal analysis, writing—original draft.

Haris N Koutsopoulos: conceptualization, supervision, writing—review and editing.

Jinhua Zhao: conceptualization, supervision, writing—review and editing.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly supported by the Energy Foundation China and China Sustainable Transportation Center.

Zhan Zhao is currently an Assistant Professor at the University of Hong Kong, where he is based in Department of Urban Planning and Design, and also affiliated with Musketeers Foundation Institute of Data Science. His research interests include human mobility modeling, emerging transportation systems, and urban data science.

Haris N Koutsopoulos is currently a Professor with the Department of Civil and Environmental Engineering, Northeastern University, Boston, MA, USA. His current research focuses on the use of data from opportunistic and dedicated sensors to improve planning, operations, monitoring, and control of urban transportation systems.

Jinhua Zhao is currently the Edward H. and Joyce Linde Associate Professor of city and transportation planning at MIT. He brings behavioral science and transportation technology together to shape travel behavior, design mobility systems, and reform urban policies. He directs the MIT Urban Mobility Laboratory and Public Transit Laboratory.

References

Ahas

Silm

Järv

, et al. (2010) Using mobile positioning data to model locations meaningful to users of mobile phones. Journal of Urban Technology 17(1): 3–27.

Alexander

Jiang

Murga

, et al. (2015) Origin–destination trips by purpose and time of day inferred from mobile phone data. Transportation Research Part C: Emerging Technologies 58, Part B: 240–250.

Barabási

(2005) The origin of bursts and heavy tails in human dynamics. Nature 435(7039): 207–211.

Bayir

Demirbas

Eagle

(2010) Mobility profiler: A framework for discovering mobility profiles of cell phone users. Pervasive and Mobile Computing 6(4): 435–454.

Caceres

Wideberg

Benitez

(2007) Deriving origin destination data from a mobile phone network. IET Intelligent Transport Systems 1(1): 15–26.

Calabrese

Di Lorenzo

Liu

, et al. (2011) Estimating origin-destination flows using mobile phone location data. IEEE Pervasive Computing 10(4): 36–44.

Calabrese

Ferrari

Blondel

(2014) Urban sensing using mobile phone network data: A survey of research. ACM Computing Surveys. 47(2): 25:1–25:20.

Candia

González

Wang

, et al. (2008) Uncovering individual and collective human dynamics from mobile phone records. Journal of Physics A: Mathematical and Theoretical 41(22): 224015.

Chen

Bian

(2014) From traces to trajectories: How well can we guess activity locations from mobile phone traces? Transportation Research Part C: Emerging Technologies 46: 326–337.

10.

Chen

Susilo

, et al. (2016) The promises of big data and small data for travel behavior (aka human mobility) analysis. Transportation Research Part C: Emerging Technologies 68: 285–299.

11.

Chen

Hoteit

Viana

, et al. (2018) Enriching sparse mobility information in call detail records. Computer Communications 122: 44–58.

12.

Chen

Viana

Fiore

, et al. (2019) Complete trajectory reconstruction from sparse mobile phone data. EPJ Data Science 8(1): 1–24.

13.

Csáji

Browet

Traag

, et al. (2013) Exploring the mobility of mobile phone users. Physica A: Statistical Mechanics and its Applications 392(6): 1459–1473

14.

Eagle

Pentland

(2006) Reality mining: Sensing complex social systems. Personal and Ubiquitous Computing 10(4): 255–268.

15.

Friedman

(2001) Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5): 1189–1232.

16.

González

Hidalgo

Barabási

(2008) Understanding individual human mobility patterns. Nature 453(7196): 779–782.

17.

Hariharan

Toyama

(2004) Project Lachesis: Parsing and modeling location histories. In: Egenhofer

Freksa

Miller

(eds) Geographic Information Science, No. 3234 in Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 106–124.

18.

Hasan

Ali

(2017) Estimating travel time of Dhaka City from mobile phone call detail records. In: Proceedings of the Ninth International Conference on Information and Communication Technologies and Development, ICTD ’17. New York, NY, USA: Association for Computing Machinery, 1–11.

19.

Hoteit

Chen

Viana

, et al. (2016) Filling the gaps: On the completion of sparse call detail records for mobility analysis. In: Proceedings of the Eleventh ACM Workshop on Challenged Networks, CHANTS ’16. New York, NY, USA: Association for Computing Machinery, 45–50.

20.

Iqbal

Choudhury

Wang

, et al. (2014) Development of origin–destination matrices using mobile phone call data. Transportation Research Part C: Emerging Technologies 40: 63–74.

21.

Isaacman

Becker

Cáceres

, et al. (2011) Identifying important places in people’s lives from cellular network data. In: Lyons

Hightower

Huang

(eds) Pervasive Computing, No. 6696 in Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 133–151.

22.

James

Witten

Hastie

, et al. (2013) An Introduction to Statistical Learning, Springer Texts in Statistics, vol. 103. New York, NY: Springer New York.

23.

Jiang

Fiore

Yang

, et al. (2013) A review of urban computing for mobile phone traces: Current methods, challenges and opportunities. In: Proceedings of the 2nd ACM SIGKDD International Workshop on Urban Computing, UrbComp ’13. New York, NY, USA: ACM, 2:1–2:9.

24.

McNally

(2007) The four-step model. In: Handbook of Transport Modelling, Handbooks in Transport, volume 1. Emerald Group Publishing Limited, 35–53.

25.

Mellegard

Moritz

Zahoor

(2011) Origin/Destination-estimation using cellular network data. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), 891–896.

26.

Pedregosa

Varoquaux

Gramfort

, et al. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830.

27.

Phithakkitnukoon

Horanont

Lorenzo

, et al. (2010) Activity-aware map: Identifying human daily activity pattern using mobile phone data. In: Salah

Gevers

Sebe

, et al. (eds) Human Behavior Understanding, number 6219 in Lecture Notes in Computer Science. Berlin Heidelberg: Springer, 14–25.

28.

Platt

(2000) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola

Bartlett

Schölkopf

, et al. (eds) Advances in Large Margin Classifiers. Cambridge, MA: MIT Press, 61–74.

29.

Ranjan

Zang

Zhang

, et al. (2012) Are call detail records biased for sampling human mobility? ACM SIGMO-BILE Mobile Computing and Communications Review 16(3): 33.

30.

Schneider

Belik

Couronne

, et al. (2013) Unravelling daily human mobility motifs. Journal of The Royal Society Interface 10(84): 20130246–20130246.

31.

Song

Blumm

, et al. (2010) Limits of predictability in human mobility. Science 327(5968): 1018–1021.

32.

Wang

Schrock

Broek

, et al. (2013) Estimating dynamic origin-destination data and travel demand using cell phone network data. International Journal of Intelligent Transportation Systems Research 11(2): 76–86.

33.

Williams

Thomas

Dunbar

, et al. (2015) Measures of human mobility using mobile phone records enhanced with GIS data. PLOS ONE 10(7): e0133630.

34.

Shaw

, et al. (2018) Uncovering the relationships between phone communication activities and spatiotemporal distribution of mobile phone users. In: Shaw

Sui

(eds.) Human Dynamics Research in Smart and Connected Communities, Human Dynamics in Smart Cities. Cham: Springer International Publishing, 41–65.

35.

Zhao

Musolesi

Hui

, et al. (2015) Explaining the power-law distribution of human mobility through transportation modality decomposition. Scientific Reports 5: 9136.

36.

Zhao

Koutsopoulos

(2016a) Individual-level trip detection using sparse call detail record data based on supervised statistical learning. Paper number 16-4386. Available at: https://trid.trb.org/view/1393647 (accessed 6 September 2022).

37.

Zhao

Shaw

, et al. (2016b) Understanding the bias of call detail records in human mobility research. International Journal of Geographical Information Science 30(9): 1738–1762.