Abstract
This paper presents a data fusion methodology for inferring trip purposes from GPS trajectories of ride-hailing services in Toronto. The methodology has a discrete choice model at its core that predicts the most probable purpose distributions using only basic trip-related information such as approximate pick-up and drop-off locations, trip start times, and land use characteristics around the origins and destinations. The choice model is estimated using revealed trip purpose data from a small-sample travel survey augmented by land use information from an enhanced point of interest database and the census. The methodology is applied to the trajectories of commercial ride-hailing trips made in Toronto between September 2016 and September 2018. For the core choice model, multinomial, nested, and mixed multinomial logit models are compared. Validation of the inferred trip purposes using the trip purpose proportions from another independent survey (not used in choice model estimation) reveal that the multinomial logit model can infer ride-hailing trip purpose distribution with reasonable accuracy. The inferred purpose distribution explains the nature of ride-hailing trips and provides important context of travel demand generated by the services. The results indicate that although ride-hailing services are mostly used for discretionary activities, they also play important roles in daily commuter travel. A quarter of the total weekday ride-hailing trips were made for work- and school-related activities. With increasing ridership, these services may start influencing conventional travel modes and thereby adversely affect the level of traffic congestion and transit ridership in the city.
In recent years, ride-hailing services offered by private transportation companies (e.g., Uber, Lyft, DiDi) have experienced rapid growth owing to their flexible, reliable, and cost-effective mobility options. Between September 2016 and March 2019, ride-hailing trips have grown by more than 180% in the City of Toronto ( 1 ). This increasing market penetration is making it imperative to have clear understanding of the characteristics of these trips and how they are changing the travel behavior of people. Such holistic and in-depth analysis requires large amount of data that can be extracted from the GPS trajectories of the ride-hailing trips. The trajectory data contains a wealth of travel-related information, such as when and where ride-hailing passengers move around the city in a high resolution. However, being passively collected, it does not contain semantic information about why the trips are being made. Trip purpose information directly relates to the activities for which ride-hailing is used; thus, it provides important context of travel demand generated by the services.
Although questionnaire-based travel surveys can easily collect trip purposes, they suffer from limitations of high survey cost, heavy respondent burden, short time and space coverage, and trip underreporting ( 2 ). Also, these surveys usually underrepresent younger population who tend to use ride-hailing services more frequently ( 1 , 3 , 4 ). Thus there is a trade-off of information between ride-hailing trajectory and travel survey data, where the former provides rich spatial and temporal information but lacks trip purposes, and the latter provides detailed trip purposes but suffers from small sample size and inaccuracies. This paper proposes a data fusion methodology that leverages both of these information sources to infer ride-hailing trip purposes.
The study uses trajectories of ride-hailing trips made in the City of Toronto between September 2016 to September 2018 to investigate the patterns of ride-hailing trip purposes at the drop-off locations. In particular, a data fusion-based methodology is proposed that uses an econometric choice model to impute the most probable purposes of individual ride-hailing trips. Application of choice theory induces important behavioral features. It captures the reality that passengers arriving at the same location can be driven by different purposes, and the purposes may also differ based on trip start times and origins. The proposed fusion method relies only on basic information such as approximate pick-up and drop-off locations of trips, start times, land use information, and revealed trip purpose data from small-sample travel surveys. The inferred trip purpose pattern extends our understanding about the increasing ride-hailing ridership, which in turn will facilitate efficient urban policy planning.
The remainder of the paper is organized as follows. The second section presents a review of methodologies used in trip purpose imputation research. The third section presents the proposed data fusion method. The fourth section presents the application of the method for the City of Toronto and discusses the results obtained. The final section summarizes the findings and limitations of the study and provides guidelines for future research.
Previous Research on Trip Purpose Inference
Trip purpose inference has emerged as a popular research topic over the last two decades ever since the advent of passive data collection technologies, such as GPS devices, mobile phones, transit smart cards, and so forth. The developed algorithms vary widely in their complexity, input data requirements, and performance accuracy. This section reviews the previous works based on the methods and the input/context variables used. More detailed reviews (especially for inference of purpose from GPS-based travel survey data) can be found in Gong et al. ( 5 ), Xiao et al. ( 6 ), and Ermagun et al. ( 7 ).
Earlier studies mostly relied on deterministic rule-based methods that utilize predefined heuristics to infer the trip purposes (8–11). These algorithms are criticized for lack of generalization and poor transferability. More recent studies tend to favor probabilistic methods and machine learning. For predictions, the probabilistic algorithms calculate the occurrence probability of each trip purpose, whereas the machine learning algorithms report whether a specific purpose is selected or not based on classification, and pattern recognition methods. The majority of the studies that adopt the probabilistic method utilize Bayes’ theorem of conditional probability to estimate trip purposes based on different spatial and temporal constraints (e.g., 12–14). However, a few studies ( 7 , 10 , 15 , 16 ) also utilize the theory of choice modeling, particularly the multinomial and the nested logit models. In machine learning techniques, the most popular tools include clustering algorithms ( 17 ), decision trees ( 15 , 18 ), and random forest classifiers ( 7 , 19–21); although some studies use more complex techniques like artificial neural networks ( 6 , 22 ) and continuous Markov models ( 23 ).
Researchers have compared the performance accuracy of different categories of algorithms. Ermagun et al. ( 7 ), Zhu ( 16 ) and Oliveira et al. ( 15 ) found that machine learning algorithms (random forest and decision tree) have higher predictive accuracy than probabilistic methods (regularized multinomial logit and nested logit models) while using detailed contextual information, such as people’s socio-economic attributes, previous activity type, current activity duration, and characteristics of nearby places from Google Places API. However, some of the studies made context-specific simplifying assumptions. For example, Ermagun et al. ( 7 ) and Zhu ( 16 ) did not predict home, work, and education trips, since these regular activities can be easily labeled using user-provided home and work locations from smartphone-based trajectories ( 7 ) or multiple-day continuously observed travel patterns from transit smart card data ( 16 ). This exclusion improved the prediction accuracy of the other activity types with relatively small sample sizes in the training data. On the other hand, Wu et al. ( 19 ) did not find any striking differences in performance between rule-based and random forest models when contextual attributes like time and activity data are used. These contrasting results indicate that the application context and input variables play important roles in accurate inference of trip purpose.
The most promising input variables for trip purpose inference models are reported to be the land use characteristics, points of interest information, and trip timing (trip start and end time, day of the week, etc.). Activity duration, frequency of visit to a destination, and other tour-based information also play essential roles in inference of trip purpose from GPS-based travel surveys and transit smart card data ( 11 , 12 , 23 , 24 ). Socio-demographic characteristics of the respondents and key addresses collected in GPS-based surveys also improve the inference accuracy by providing additional contexts about activities performed ( 6 , 20 , 25 ). Some recent studies on inference of purpose of taxi trips have explored the potential of data repositories like Area of Interest ( 26 ) and social network check-in data ( 14 , 17 ).
Many of the contextual variables, including activity duration, frequency of visit, and tour-based information are not available for inference of ride-hailing trip purpose. Ride-hailing services record only a part of tour trajectory, and the socio-demographics of the users (if available) are not shared because of privacy concerns. Moreover, it is difficult to obtain reasonably large estimation/training samples for the inference model. All these issues pose unique challenges for ride-hailing trip purpose inference that have not been sufficiently tackled in the existing literature. Most of the studies that investigated ride-hailing travel behavior relied solely on primary or secondary data sets like household travel surveys. For example, Young and Farber ( 3 ) analyzed the purpose of ride-hailing trips using data from a large-scale household travel survey. However, trip purposes inferred from ride-hailing trajectories can extend our understanding beyond what is possible by summary of household travel surveys or small-sample ride-hailing specific surveys. They have the potential to provide important context of ride-hailing demand generation at spatial and temporal resolutions unavailable by survey results alone. Despite such immense potential, minimal research effort has been applied to inferring ride-hailing trip purposes from passively collected pick-up and drop-off locations and times. Recently, Dias et al. ( 4 ) attempted to infer purposes from publicly available anonymized ride-hailing data by fusing information from multiple sources. However, the purposes were based on locations visited and not the activities performed at those locations.
It is evident that there is a gap in the literature in relation to ride-hailing trip purpose inference. This study attempts to narrow the gap by proposing an econometric data fusion approach for inferring trip purpose from passively collected ride-hailing trajectory data. The fusion tool is based on a choice model that can infer detailed purposes of individual trips using limited context-specific variables and relatively small estimation data.
Data Fusion Methodology
Overview of the Fusion Methodology
Figure 1 presents the methodology used in this study for inferring unobserved destination purposes of ride-hailing trips. The core of the fusion tool is a discrete choice model that predicts the most likely trip purpose outcomes by fusing GPS attributes with travel survey and land use data. The model is first estimated using revealed trip purpose responses from the travel survey data. Land use information obtained from an enhanced point of interest data repository and census data are fused with the survey data to provide contexts about typical activities performed at the origins and the destinations. The estimated model is then applied to impute the missing purposes of the ride-hailing trips that are also augmented with land use information.

Overview of data fusion-based trip purpose inference method.
Discrete choice theory is used to develop the core model of trip purpose inference mainly to leverage its strong behavioral foundation—the random utility maximization (RUM) theory. Under the RUM assumption, the choice model calculates the probability of each trip purpose based on its systematic/observable utilities. It is anticipated that such a model will be able to accurately infer the most probable trip purposes using the (approximate) pick-up and drop-off locations, start times of the ride-hailing trips, and limited context-specific data.
Formulation of Discrete Choice Models
The study tests the performance of three discrete choice model formulations: the multinomial logit model (MNL), the nested logit model (NL), and the mixed multinomial logit model (MMNL). Among these structures, MNL is the most popular (
27
) because of its well-known advantage of tractability (closed-form probability function). The model assumes that the random utility components of the trip purpose alternatives are independently and identically distributed (i.i.d.) type I extreme values, which leads to the following probability of alternative
where
where
NL is still limited and cannot capture many forms of unobserved heterogeneity. Moreover, the IIA property holds for the alternatives within a nest. To test a more flexible error structure, an MMNL is estimated. Details of MMNL can be found in: Train (
29
), Walker (
33
), and Hensher and Greene (
34
). In this formulation, the total utility
where the random utility is made up of two components:
While many alternative error correlation patterns can be captured by an MMNL model, in this investigation, it was found that only a heteroskedastic MMNL can be valid (based on statistical significance tests of additive error variances, σ). The covariance matrix of the model has elements in its main diagonal only, that is, there is no covariance among the model’s error terms (in other words, the covariance matrix consists of only differently-scaled variances and no off-diagonal covariances). Here, the probability of trip purpose alternative
where
The MNL and NL models are estimated using the classical maximum likelihood estimation technique. The MMNL model is estimated by maximum simulated likelihood estimation technique and by using Halton draws for error simulation. All of these are done through programs written in GAUSS ( 35 ).
Empirical Analysis for the City of Toronto
Description of Data Sets Used
Ride-Hailing Trip Trajectory Records
The main data set used in this research was the trajectories of all ride-hailing trips made in Toronto between September 2016 and September 2018. The data set was obtained from the City of Toronto as part of a research collaboration. It contains trip-level information, including spatial coordinates and timestamps of pick-up and drop-off locations. However, to protect users’ privacy, the coordinates are mapped to the nearest intersections, and there are no user IDs. Thus, each record in the data set must be treated as an independent trip with no means of tracking multiple rides by the same user. A summary of the ride-hailing service usage pattern in the city based on the analysis of the data has been published by the City of Toronto Big Data Innovation Team ( 1 ).
This study uses the ride-hailing trips made between September and December 2016, which amounts to about 6.95 million trips after necessary data cleaning. Figure 2 shows the spatial distribution of these trips on weekdays and weekends. On both types of days, the destinations of the trips are highly concentrated around Downtown Toronto and other major activity hubs. However, almost 1.5 times more trips are generated on a weekend than on a weekday. To keep the computation burden associated with the inference process reasonable, a 20% random sample was generated that corresponds to a total of 1,390,527 ride-hailing records. As the primary objective of this research, the missing destination purposes of this sample of ride-hailing trips were inferred.

Spatial distribution of ride-hailing drop-off locations on an average weekday (left) and on an average weekend (right) at the dissemination area level.
Person Trip Survey Data
As part of an R&D project (TTS2.0) by the authors’ research group, a web-based travel survey was conducted in August and November 2017 among the residents of Greater Toronto and Hamilton Area. The survey collected travel diaries along with home and work locations, and household and individual level socio-economic information. The diaries contain trip start times, origin and destination locations, travel modes, and reported trip purposes, classified into 13 categories (shown in Table 1). A subset of this survey data comprising trips originating and terminating within the City of Toronto was taken. After cleaning for missing information, a final data set of 5,065 trip records covering both summer and fall seasons was obtained. This data set was used for the estimation/training of the trip purpose classification models in this study.
Descriptive Statistics of the Datasets
Note: DA = “dissemination area”, a geographic unit from the Canadian Census; EPOI = Enhanced Point of Interest; TTS = Transportation Tomorrow Survey; SD = standard deviation; NAICS = North America Standard Industry Classification System; na = not applicable.
Count of establishments per square kilometer in DA.
Count of private dwellings per square kilometer in DA.
It should noted that the estimation sample contains trips of all modes, and not exclusively of the ride-hailing mode. As such, this study assumes that the relationship between trip purpose, land use at trip ends, and trip start time is similar for the ride-hailing trip trajectory records and the trips of all modes in the estimation data. This is an unavoidable assumption for this study, given that the share of ride-hailing is still small compared with other modes. As such, obtaining a representative ride-hailing survey sample that is large enough to allow model estimation can be difficult. However, the performance of the fusion method is validated by matching the predicted market shares of the trajectory data with observed trip purpose market shares obtained exclusively for ride-hailing trips from an independent travel survey.
Enhanced Points of Interest Data
Enhanced Points of Interest (EPOI) is a national database of Canadian business and recreational points of interest maintained by DMTI Spatial Inc. ( 36 ). It contains the geocoded locations of the points of interest along with their North America Standard Industry Classification System (NAICS) codes. These codes place each EPOI in categories like healthcare facilities, shopping centers, educational services, accommodation and food services, and so forth. This open-source data was processed to generate the counts of establishments of different EPOI categories within each dissemination area. As shown in Figure 1, the EPOI data is fused with both the estimation and the inference data sets to provide detailed land use characteristics of trip origins and destinations. This, in turn, provides important context about the typical activities performed at these locations.
2016 Canadian Census Data
Data from the 2016 Canadian Census ( 37 ) is used to derive the number of private dwellings within each dissemination area. A dissemination area (DA) is the smallest geographic unit for which census information is publicly available in Canada. Accordingly, the location coordinates of the other data sets are associated with their corresponding DA, which indirectly takes care of different accuracies of the spatial coordinates in the ride-hailing trajectories and the survey data sets. Similar to the EPOI data, the number of private dwellings within each DA is fused with the estimation and the inference data sets as a contextual variable.
2016 Transportation Tomorrow Survey Data
The Transportation Tomorrow Survey (TTS) ( 38 ) is a large-scale household travel survey conducted in the Greater Toronto and Hamilton Area once every five years. It collects travel diaries (for weekdays) and individual and household level socio-economic attributes. Its latest iteration was from September 2016 to December 2016, in which ride-hailing was included as a travel mode. This generated a sample of 1,264 ride-hailing trips in the city, with seven categories of reported trip purposes: “home,”“work,”“school,”“daycare,”“facilitate passenger,”“shopping,” and “others.” The observed trip purpose market shares from this data were used to validate the performance of the inference model.
Table 1 presents the descriptive statistics of the datasets described above. It also mentions how each of these datasets was used in the overall fusion algorithm.
Estimation Results of Discrete Choice Models
The three candidate discrete choice models described above (in the “Formulation of Discrete Choice Models” sub-section) are estimated using two categories of explanatory variables: trip attributes and land use variables. The trip-related variables include trip start time, day, season, and trip length. Land use information includes the densities of EPOI and of private dwellings within a DA. Table 2 presents the estimation results. For the NL model, different types of nesting structures were tested by grouping “similar” trip purpose alternatives based on a priori expectations. Specifically, alternatives that are expected to share common components in their random error terms were placed under the same nest. However, based on behavioral plausibility and statistical significance of the scale parameters, only one nest of mandatory trips was found to be valid. The mandatory trip nest includes “home,”“work,”“education,” and “daycare” purposes. In the MMNL model, the Cholesky factors corresponding to the variance terms of the alternatives are estimated. (Note that Table 2 report the parameters corresponding to the standard deviations [or variances by extension] of the alternatives and not of the Cholesky decomposition.) The model was tested with 50, 100, 150, and 200 draws and seemed to reach convergence after 150 draws. The model results obtained for D = 200 are presented here. Variables in the final specifications are selected based on the expected sign, and statistical significance (95% confidence interval) of corresponding parameters. Some parameters with lower than 95% confidence are still retained in the model as those are perceived to be important variables. Moreover, the same explanatory variables are kept in all models for ease of performance comparison. Among the three specifications, the MMNL model shows the best goodness-of-fit in lower log-likelihood, Akaike information criterion (AIC), and Bayesian information criterion (BIC) values against the constant-only model.
Empirical Model of Trip Purposes
Note: MNL = multinomial logit model; NL = nested logit model; MMNL = mixed multinomial logit model; DA = “dissemination area”, a geographic unit from the Canadian Census; na = not applicable.
All the density measures of land use characteristics are log-transformed.
, **, *Significant at 99%, 95%, 90% levels of confidence.
As expected, the land use variables at destinations are important predictors for most trip purposes (except “facilitate passenger,”“worship,” and “other” which are not associated with a specific land use type). For example, the coefficient for retail EPOI density is positive and significant for shopping trips, which indicates that if a trip ends in a DA with many retail establishments, it has a high probability of being a shopping trip. Land use characteristics around the trip origin provide context about the previous activity, and the associated coefficients are statistically significant for some trip purposes. For example, a trip that starts from a DA with many manufacturing or educational establishments has a higher probability of being a return home trip. Similarly, trips to daycare facilities are more likely to start from a residential area.
For trip start times, separate coefficients are estimated for each time period of the day to capture their specific effects on trip purpose. Most of the alternatives have positive parameters for the morning period with respect to return home purpose. This indicates that most of the trips in the morning are destined for some out-of-home activity location. Moreover, it is found that if a trip starts later in the day, it has a lower probability of being a work trip, and a higher probability of being a discretionary trip for facilitating passenger, eating out, recreational or social visits, and so forth.
Day of the week is an important trip purpose determinant for work and religious activities. The dummy variable representing weekdays has a positive and significant coefficient for work trips, indicating that a trip has a higher probability of being a work trip if it is made on a weekday as opposed to the weekend. On the other hand, the same variable has a negative coefficient for religious activity related trips, which generally occur over the weekend. Season of the year is also an important attribute for some trip purposes. For example, the fall season dummy variable has a positive coefficient for educational trips, as schools remain closed for most of the summer and reopen in fall. In general, all the variables of the models provide satisfactory and intuitive behavioral explanations for the associated trip purposes.
Inferring Ride-Hailing Trip Purposes
First, the ride-hailing trajectory data set (as described above) is fused with land use information from the EPOI and the census databases to provide contexts about potential activities at trip origins and destinations. The choice models estimated in the previous section are then applied to each trip of the augmented data set. This generates the probabilities of occurrence of the 13 purposes for each ride-hailing trip. Finally, the probabilities of all the records for each trip purpose are averaged to obtain the percentage of travel for that purpose category. Table 3 shows the aggregated prediction results.
Inferred Purpose Distribution of Ride-Hailing Trips
Note: MNL = multinomial logit model; NL = nested logit model; MMNL = mixed multinomial logit model.
Validation of Inferred Trip Purposes
To evaluate the effectiveness of the data fusion-based inference algorithm, the reported ride-hailing trip purposes from the 2016 TTS are considered as the ground truth. This limits the validation process to the weekday ride-hailing records, since the TTS collects weekday travel diaries only. Moreover, as mentioned above, the TTS has only seven categories of reported trip purposes, as opposed to the 13 categories inferred by the fusion algorithm. As such, to make the trip purpose categories compatible for the validation exercise, the inferred categories of “shopping and errands,”“eat out,”“recreation, sports, leisure,”“arts, health and personal care,”“services,”“visiting friends, family,”“worship, religion,” and “other” are grouped into a more general class that corresponds to the combined category of “shopping” and “others” in the TTS. The inferred and observed trip purpose distributions are shown in Figure 3.

Trip purpose distributions for Transportation Tomorrow Survey (TTS) data and ride-hailing data. The graph represents “weekday” trips only.
Overall, the choice models capture the trend of trip purpose distribution well. From Figure 2 it is found that the inferred and the observed trip purpose percentages are reasonably close for “home,”“education,” and “shopping and others.” The prediction errors for these purpose categories are less than 10%, 2%, and 9% respectively. However, the models tend to overpredict the share of “work” trips made by ride-hailing services. The prediction error for these trips is roughly 28%. The inference accuracy is quite low for smaller share alternatives such as “daycare” and “facilitate passenger.” Nonetheless, the validation results are quite encouraging, especially given that the trips in the estimation data are not exclusively ride-hailing trips and as such have somewhat different spatial and temporal characteristics than the ride-hailing trip records of the inference data (see Figure 4). In spite of this, the predicted trip purpose shares match the observed shares quite well and it can be anticipated that with better estimation data, the predictions will be even more accurate. Moreover, the proposed method performed well despite the existence of mixed-use land parcels in the study area (especially the downtown core of Toronto where most of the ride-hailing trips are made) which has always been as a major challenge for trip purpose imputation ( 5 ).

Trip length (a) and start time (b) distributions of estimation and prediction data sets.
The inference accuracies of the three model specifications are comparable with each other. The more complex NL and MMNL do not provide consistently improved inference. This is not surprising given that the nesting and the variance structures of these models are superimposed on the estimation data, which might not hold for the ride-hailing trajectory data.
Characteristics of Inferred Ride-Hailing Trip Purposes
The inferred purposes help reveal the nature of ride-hailing trips in the City of Toronto. From Table 3, it is evident that ride-hailing is mostly used for discretionary activities like shopping and errands, eating out, recreation and social visits, and so forth, and for returning home. The increasing popularity of the services as a commuting mode is also noticeable. About a quarter of all weekday ride-hailing trips made within the city are for work- and education-related destinations. Moreover, a portion of the “return home” trips might originate from work or school locations, which means that this category also contributes to commuting travel on top of the “work” and “education” categories. This increasing popularity of ride-hailing as a commuting mode might have a long-term effect on the overall transportation system as the modal share of the services becomes more prominent in the future. More commuting ride-hailing trips would result in increased peak-period congestion on the city’s street networks.
The inferred trip purposes also extend our understanding of ride-hailing demand generation throughout the City of Toronto at a spatial resolution that is unavailable from survey results alone. Figure 5 shows the spatial distributions of some of the major trip purposes predicted by the MNL model at the DA level. Such disaggregate distributions reveal important patterns of ride-hailing use in the city. For example, it is found that the downtown core attracts a wide variety of ride-hailing trips, especially “shopping and errands,”“eat out,”“recreation, sports, leisure,”“services,” and some “work” trips. However, it attracts comparatively less “education” and “visiting friends, family” trips. A large share of “work” trips are also destined for the outer employment hubs of the city. The most evident pattern is observed for “eat out” trip destinations, which are mostly concentrated around the downtown area. The opposite trend is observed for “visit friends, family” trips, which are mostly destined for the mid- or up-town locations.

Spatial distribution of the inferred ride-hailing trip purposes at the dissemination area level: (a) “home” trips, (b) “work” trips, (c) “education” trips, (d) “shopping and errands” trips, (e) “eat out” trips, (f) “recreation, sports, leisure” trips, (g) “arts, health and personal care” trips, (h) “services” trips, (i) “visitng friends, family” trips.
A comparison between the inferred weekday and weekend trip purposes (Figure 6) reveals that more “return home” and “shopping and others” trips are made by ride-hailing over the weekends. This indicates that a higher proportion of people use these services over the weekends for trips to entertainment, bars, and other activities, which in turn may help in reducing drinking and driving.

Ride-hailing trip purpose distributions for weekdays and weekends.
Finally, a comparative analysis between the inferred weekday trip purpose distribution of ride-hailing obtained from the MNL model and the revealed trip purposes of other travel modes from the TTS is shown in Figure 7. As expected, the proportions of trip purposes are somewhat similar for ride-hailing and taxi, indicating the strong modal competition between these services. However, ride-hailing is used more for work-related trips than taxi. This might be attributed to the higher degree of reliability and flexibility associated with ride-hailing services compared with taxi services, making them a preferred mode for work trips. Interestingly, more educational trips are made via ride-hailing than taxis, which indicates that ride-hailing is more popular among the student population than taxi service, perhaps owing to its lower cost.

Proportions of trip purposes by different travel modes.
In general, the proportion of mandatory trips (home, work, education, daycare) among the total ride-hailing trips is found to be comparable to that of other travel modes, especially driving private car (auto driver), passenger in private car (auto passenger), and taxi. While this may be inconsequential as long as the modal share of ride-hailing is small, with its increasing popularity, ride-hailing may start to influence the ridership levels of the more substantial modes. In particular, the greater degree of convenience and reliability (at the expense of relatively cheaper cost compared with traditional taxi services) associated with ride-hailing might encourage people to switch from transit to ride-hailing for their mandatory trips. Thus, proper regulating policies need to be enforced by the municipal authorities that support the benefits of ride-hailing services by providing improved mobility options for people, but not at the expense of increased congestion and reduced transit ridership.
Conclusion and Future Work
The paper presents a data fusion methodology for inferring trip purpose from passively collected ride-hailing trajectory data. The inferred trip purposes extend our understanding about the use of this relatively new mobility option. The proposed algorithm is based on a choice model that makes use of only basic trip-related information (such as approximate pick-up and drop-off locations and trip start times), and land use characteristics around the origins and destinations to predict the probability of different trip purposes. The choice model is estimated using data from a small-sample travel survey, an EPOI database, and a census database. The estimated model is then applied to the ride-hailing trip records to obtain the most probable distribution of ride-hailing trip purposes. Nonetheless, this research has the potential to be applied for broader use beyond ride-hailing travel. The data fusion method presented here is generic enough to be applicable for trip purpose inference from other types of passive data sources, including transit smart card transactions, GPS-based travel surveys, mobile phone call detail records, taxi trajectories, and so forth. It thus demonstrates the benefit of fusing big data sources (such as trip trajectories) with small data sources (such as traditional travel surveys) to extend our understanding of urban travel demand and activity dynamics.
The proposed algorithm is empirically tested using ride-hailing trip data from the City of Toronto. Three different formulations of the choice model are estimated: an MNL structure, an NL structure with one nest of mandatory trips, and a heteroskedastic MMNL structure. The estimation results indicate that basic land use data and time of day provide good contexts about typical activities performed at trip origins and destinations. The use of a probabilistic choice model captures the reality that passengers arriving at the same location can be driven by different purposes, and the purposes may also differ based on trip start times and origins. As such, they are behaviorally more plausible and generic than deterministic rule-based approaches (e.g., 8–11). The prediction performances of the models are evaluated by comparing the inferred distribution with the revealed trip purpose proportions obtained from TTS. It is found that the models can predict trip purposes reasonably well when the training data is relatively small, and limited context-specific attributes are available. Given that the City of Toronto has mixed-use land parcels, and the estimation data has somewhat different spatial and temporal characteristics compared with the ride-hailing data, the performance of the choice models is considered satisfactory.
The inferred purpose distribution of ride-hailing trips indicate that the services are mostly used for discretionary activities (e.g., shopping and errands, eating out, recreation, social visits, etc.) and for returning home. The services also play an important role in daily commuter travel. About a quarter of all weekday ride-hailing trips are made to work- and education-related destinations. This, along with the portion of “return home” trips that originate from work or school locations, indicates increasing popularity of the services as commuting mode. During the weekends, these services are used more for going to and returning from entertainment- and recreation-related activities, which in turn may help in reducing drinking and driving. With regard to modal competition, it is evident that taxi and ride-hailing are strong competitors that cater to similar purposes. Moreover, it is anticipated that with growing ridership, ride-hailing will start to influence other travel modes, including transit. As such, efficient policies should be mandated to reduce the adverse impacts of ride-hailing on the level of traffic congestion, the environment, and the equity of mobility services.
The proposed data fusion method includes a basic assumption that the missing ride-hailing trip purposes have the same conditional probability as the trips of all modes in the estimation data. Although the validation proves that even with this hypothesis the inference algorithm performs quite satisfactorily, this is a very strong assumption and future works might be geared to relax it by using a representative and large enough sample exclusively for ride-hailing trips. An alternative approach to relaxing the assumption might be recalibrating the alternative specific constants of the estimated choice model to reflect data on observed ride-hailing trip purpose market share, given that the information is available to the analyst from an independent ride-hailing specific survey. Such adjustment of constants would help to minimize the systematic bias associated with estimating the model using a sample that is not exclusively of ride-hailing trips. Also, more advanced context data repositories such as social network check-in data, Google Places API, and hours of operation of points of interest can be used to improve the prediction accuracy. In future work, the authors plan to compare the trip purpose predictive performance of the choice models with that of appropriate machine learning tools. Specifically, they want to investigate the suitability of unsupervised learning techniques so that the scarcity of representative training data for ride-hailing trips can be overcome.
Footnotes
Acknowledgements
Data were made available for the research by the Big Data Innovation Team of the City of Toronto and the Data Management Group of the University of Toronto.
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: K. N. Habib, S. Hossain; data collection: S. Hossain; analysis and interpretation of results: S. Hossain, K. N. Habib; draft manuscript preparation: S. Hossain, K. N. Habib. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research was funded by a Trillium Scholarship and an NSERC Discovery Grant.
The opinions expressed in this paper are those of the authors and not those of NSERC. The authors claim the sole responsibility for all results, comments, and interpretations made in the paper.
