Route Choice Set Generation on High-Resolution Networks

Abstract

This study seeks to find a strategy to capture the most observed trajectories with a minimum number of algorithms. GPS information on 4,538 real trips from 131 travelers in 2008 was collected and analyzed in Minneapolis-St. Paul (the Twin Cities) as part of the I-35W Bridge Collapse study. The high-resolution road network of the Twin Cities includes 108,561 nodes and 277,747 links. Labeling and link penalty approaches are combined to generate alternatives based on either observed or free-flow speed. Overall, with the best 10 labels, on average, 40 unique routes are generated for each origin-destination pair, and around 80% of all observed trips could be captured with an 80% overlap threshold. About 88% of all observed trips have an average deviation within 50 m compared with the best matching result when combining all labels introduced in this study. Freeway-preferred routes cover more observed trips than freeway-avoided routes, and the peak coverage occurs when freeway travel is weighted between 0.8 to 1 of travel on non-freeway links. A random effects panel model is used for predicting the overlap between alternative route and observed trajectory. Multinomial and mixed logit models with a path-size term are applied to model the route selection. These models indicate that alternative routes which are shorter in distance, have faster average free-flow speed, contain a higher freeway percentage, and incur fewer traffic lights, are more likely to have higher overlap with observed trajectories and are more likely to be selected.

Keywords

planning and analysis behaviors decision analysis and processes

Generating the choice set for route choice is a complex problem. For a large urban network, potential routes, which are difficult to itemize, could number in the thousands ( 1 ), and thus a clear choice set of available routes for a trip should be explicit from the network before route choice modeling. Many studies pointed out that the size and composition of choice sets deeply influence the quality of route choice estimation and choice probability ( 2 – 6 ).

Previous studies show that for networks with thousands or tens of thousands of nodes and links, the common choice set generation methods provide a choice set that captures a high share of observed trips with fewer than 50 potential routes. However, generating the same number of alternative routes by using these methods gains less share of observed trips. Therefore, more potential routes are needed in the choice set to obtain the same share of observed trips on a network with hundreds of thousands of nodes and links. One option to overcome this difficulty is reducing the network size by removing low traffic local streets in neighborhoods. However, eliminating links in the network loses traffic on those links and worsens the accuracy of map matching for GPS points.

Therefore, this research aims to provide a method for combining two popular choice set generation procedures, namely link penalty and labeling approaches, to capture a significant proportion of observed trips on a graph that represents all streets within a real road network. The generated choice sets are then used for modeling travelers’ route selection. Previous literature suggests that attributes including trip length, travel time, travel speed, percentage of the freeway, number of left turns, and number of right turns might influence route choice ( 7 – 9 ). These attributes are included in this study.

We first review different methods for route choice set generation in the literature. We then describe our methodology for data preparation, choice set generation, and choice set evaluation. Next, we present our results. Finally, we suggest directions for future research.

Literature Review

Choice set generation approaches can be categorized based on the way they produce potential routes ( 1 ). The measures all build on variations of the idea of the best path through the network, and are summarized in Table 1:

K-shortest path: Calculate the $K$ best paths based on a generalized cost of links. Some algorithms allow cycles in the path ( 10 , 11 ), and some others focus on a cyclic paths ( 12 , 13 ).

Link elimination ( 14 ): The algorithm is based on K-shortest path algorithm. When searching the $K + 1$ shortest path (normally measured in distance or time), some links or all links in previous $K$ shortest paths will be eliminated from the network.

Link penalty ( 15 ): Similar to link elimination, but instead of removing links from the original network, this method adds a fixed penalty factor to links which are included in the previous shortest path before re-estimating the shortest path.

Labeling ( 16 ): Labeling differs from the previous two methods. It defines a target label such as “Minimize traffic lights” before calculating a path, and then searches for routes that optimize the target. For example, maximize “Minimize traffic lights” means finding the path with the fewest traffic lights.

Constrained enumeration: Unlike the methods listed above which assume travelers making decision based on minimum generalized cost, constrained enumeration approach assumes people selecting routes based on behavioral rules. A branch-and-bound algorithm which is introduced by Prato and Bekhor ( 17 ), uses the branching rule reflects behavioral assumptions through the definition of thresholds. A directional threshold and a loop threshold remove routes with either high overlap with existing alternative routes or very long travel time.

Probabilistic methods: Probabilistic methods calculate the probability of links based on the distance of them to the shortest path ( 18 ). For an origin-destination (OD) pair, at each way point, a repeated random walk process adds the probable next link based on similarity to the shortest path. The route probability then equals the product of the probability of each link comprising the route.

Doubly stochastic generation function ( 19 ): This assumes that travelers have a perceived cost with error for paths, and the generation function includes a stochastic terms for cost and to account for the heterogeneity of travelers. These random terms are assumed to follow a probability distribution.

Table 1.

Summary of the Choice Set Generation Techniques Observed in the Literature

Study	# of nodes	# of links	Travelers	Generated routes	Capture rate (overlap = 80%)
Link elimination
Bekhor et al. ( 20 )	13,000	34,000	188	30	71%
Frejinger and Bierlaire ( 18 )	3,077	7,459	2,978	15	80%
Prato and Bekhor ( 2 )	419	1,427	236	10	70%
Pillat et al. ( 21 )	7,703	22,620	1,089	1–13	60%
Rieser-Schüssler et al. ( 22 )	408,636	882,120	500	100	75%
Ding et al. ( 23 )	7,808	11,106	997	30	79%
Zhu and Levinson ( 6 )	8,618	22,477	143	29–58	25%
Yao and Bekhor ( 3 )	8,583	21,151	6,000	24	83%
Link penalty
Bekhor et al. ( 20 )	13,000	34,000	188	40	80%
Prato and Bekhor ( 2 )	419	1,427	236	15	62%
Zhu and Levinson ( 6 )	8,618	22,477	143	15	55%
Yao and Bekhor ( 3 )	8,583	21,151	6000	27	96%
Labeling
Bekhor et al. ( 20 )	13,000	34,000	188	1	46%
Prato and Bekhor ( 2 )	419	1,427	236	1	31%
Spissu et al. ( 24 )	No information	18,000	393	1	47%
Quattrone and Vitetta ( 25 )	4,480	16,029	332	5	75%
Zhu and Levinson ( 6 )	8,618	22,477	143	1	23%
Tang and Levinson ( 26 )	8,618	22,477	124	1	28%
Yao and Bekhor ( 3 )	8,583	21,151	6,000	1	59%

Directly using the K-shortest path approach for the route choice problem is uncommon in recent research. The high similarity (overlapping) of generated routes means the alternatives cannot be easily distinguished as a different routes by travelers. Instead, link elimination, link penalty, and labeling approach are commonly applied in the literature. As shown in Table 1, these methods generally provide acceptable results and are easy to implement.

Prato ( 1 ) noted that some links which are included in both actual trajectories and alternative routes might be removed before searching new shortest path in link elimination, and thus the real trajectories cannot be captured. Link penalty keeps those “used” links in the network, and the new shortest path still has the chance to use them. Therefore, link penalty approaches are more likely to generate real paths. For labeling, choosing good labels can make the generation process efficient, but the label selection process relies on the modelers’ experience, and not all real trips have clear labels. For constrained enumeration, the difficulty is setting the thresholds for behavioral constraints.

Bekhor and Prato ( 27 ) argued that the empirical results in a small network can be opposite to a large urban network. One shortcoming of probabilistic methods is the unrealistic loops created by repeating the random walk process. Prato ( 1 ) argued that doubly stochastic generation function method is computationally prohibitive for larger road networks.

All the methods above generate alternative routes based on predefined rules and are further categorized to be explicit methods by Yao and Bekhor ( 3 ). Compared with explicit methods, implicit methods do not need alternative routes to be defined before model estimation ( 28 ). However, implicit methods have high computational costs and are unsolvable when there are cycles in the paths.

Overall, generating alternative routes based on a K-shortest path algorithm is easy to implement but is unlikely to capture all observed trajectories. Similar shortcomings could be found in constrained enumeration methods. For stochastic-shortest path based methods, the high time cost and high reliance on the implementation of suitable probability distributions makes them unsuitable for all data sets. Based on previous studies, the majority of research focuses on improving link elimination, link penalty, and labeling approaches, and both the number of observed trips and the size of the network are small. In high-resolution networks, a previous study generated 100 alternative routes per OD pair, and covered approximately 75% observed trips with 80% overlap threshold. This should be improved.

Methodology

Data Preparation

GPS Data

GPS information was collected in the Minneapolis-St. Paul region (the Twin Cities) as part of the I-35W Bridge Collapse study in 2008 ( 6 , 29 ). Within a 13-week period, 43,117 trips were recorded from 153 participants using either a logging GPS device (QSTARZ BT-Q1000p GPS Travel Recorder powered by DC output from in-vehicle cigarette lighter) or a real-time communicating GPS device (adapted from the system deployed in the Commute Atlanta study [ 30 ]) installed in the travelers’ vehicles. In this study, an observed trip is defined as an observed journey from a single origin to a single destination for one traveler at a specific time. For example, for the same origin and destination, even if a traveler uses the same route on two different days, these two journeys are defined as two separate observed trips. In this study, only morning trips are considered. Trips finished within 1 min are assumed to be short journeys for parking and are removed. In a few cases where travelers detour 1.5 to three times the shortest distance to their workplace, these trips are assumed to be to pick someone up or have other purposes and are excluded. The filtering process and results are presented in Table 2.

Table 2.

Data Processing

Cleaning process	Travelers removed	Unique travelers	Unique trips
Raw data	0	153	43,117
Trips completed between 5 a.m. and 9 a.m.	21	132	5,990
Trip duration longer than 1 min	1	131	4,965
Trips forming a complete route	0	131	4,839
Trips shorter than 150% shortest distance path	0	131	4,538

As shown in Figure 1, the Lawrence Group (TLG) road network for the Twin Cities is used in this study, and it includes 108,561 nodes and 277,747 links. Cleaned trajectories are then matched to the TLG network by K-nearest neighbor (KNN) approach ( 31 ). As shown in Figure 2, an algorithm was developed and applied to ensure all matched links form a complete route.

Figure 1.

The Twin Cities road network from the Lawrence Group (TLG).

Figure 2.

Map matching algorithm.

In rare cases, multiple complete routes are formed by adding potential links connecting disjointed points upstream. In this case, the shortest one is selected, assuming people will avoid a local detour. For the privacy safety consideration, the records for the points at which people start and end are not their real origin and destination. In this case, a 100 m tolerance zone is set at their reported origin and destination. Finally, all 4,538 trips were successfully matched, and 1,940 OD pairs were included.

Link Attributes

Road attributes such as length, speed limit, road type, number of traffic lights, number of bus stops, and travel direction are gained from TLG network. Three sources of real-time speed data are applied in this study.

The first is TomTom speed data which was gained by aggregating millions of GPS logging and navigation devices from the Twin Cities metro region ( 32 – 34 ). The travelers in the I-35W study did not use TomTom as their guidance, and the TomTom speed data were collected in a different year (2011). Additionally, this study focuses on morning trips, so morning peak hour (7:00–9:00 a.m.) in TomTom speed data is combined with TLG network. For a few roads in the TLG network, TomTom speed network splits those roads into several links and records speeds for those links separately. In this case, speeds of the longest link are used to present the travel speeds for the road. To simulate real travel time on each route, the real travel time for a traveler on each link is needed, but the TomTom data only has the travel speed distribution on each link. For example, the fifth percentile travel speed represents the lowest 5% recorded speed on each link, and for every 5% from the fifth percentile to 95th percentile, TomTom data provide an aggregated travel speed. Therefore two extreme scenarios are used in this study:

Perfectly correlated scenario $(M 1)$ : which assumes the same percentage is on all links (e.g., 90th percentile travel speed is used for all links).

Perfectly independent scenario $(M 2)$ : which assumes travel speeds on all links are independent and randomly chooses the percentage level for each link.

Additionally, travel speeds from loop detectors and the GPS speeds from all 153 participants in the I-35W Bridge Collapse study are also included in this study. Both have smaller sample sizes compared with the TomTom speed data. Of course, we could have used the GPS data itself to generate a speed map and used the average speed of all links to represent the travel speed for non-used links ( 29 ), but as the 2011 year TomTom data has recorded travel speed on most of the links and is available to us, we used it to represent the link travel speed in this study.

For all three sources of real-time travel speed data, not all links in the TLG network have speed records. The 2008 to 2009 street images in Google Maps are used to check those links, and we find that most of the links are local streets. Based on that, we assume these are low traffic roads, and therefore, speeds are assumed to be 15 mph (25 km/h).

Choice Set Generation

Choice set generation is an important step before modeling individual route choices, and quality and size significantly affect the modeling results. The high-resolution TLG road network provides an advantage in accuracy of the paths, but it also results in higher computational complexity in the route search process. As TomTom speed data records 19 travel speed percentage bins (every 5% from 5% to 95%), a simulation method is used to find alternative routes.

Overall, 19 draws based on 19 travel speed percentage bins defined in TomTom speed data are performed for the perfectly correlated scenario $M 1$ , and thus 19 travel times are obtained for each route alternative for a given trip’s OD pair.

For the perfectly independent scenario $M 2$ , since travel speeds of all links are independent, for each simulation, travel speeds of the links are randomly drawn from 5% speed to 95% speed. We performed 20 draws for the M2 scenario because of computational cost. However, as computational costs drop, more draws would be preferred, noting that we observed additional draws are subject to diminishing returns as alternative routes tend to repeat, and speed variance on many links is fairly small. Finally, 20 simulated travel times are gained for each link for scenario $M 2$ . Several packages can generate a network and find the shortest path, like NetworkX, SNAP and LightGraphs, and the different package has advantages in different aspects. Since evaluating those packages is not the main aim of this study, the most familiar package, NetworkX, is used. In both cases, the A-star algorithm is applied in Python’s NetworkX package to find the shortest path.

The labeling and link penalty approaches are combined to generate alternative routes with TomTom and free-flow speed data. The general steps are:

First step: Define a label.

Second step: Determine a penalty or bonus factor and apply the factor to all related links in the road network.

Third step: Search the path which satisfies the predefined label in the first step for all OD pairs, such as “Minimize travel distance.”

Fourth step: For some labels, we want multiple paths rather than a single path. Before finding the K+1th path for these labels, we add a penalty factor to all links in the kth path and update those links in the road network.

Fifth step: Once all labels are completed, form a choice set with the generated alternative routes.

Labels used in this study are categorized into time-based paths and distance-based paths.

Time-Based Labels

Time-based paths find alternative routes with the least travel time under different conditions. The measurable link attributes for these kind of labels are travel time. Freeways, which normally have faster travel speed, good road surface condition, and no traffic lights, are attractive for many travelers. However, as they have fewer intersections, traveling on freeways often results in a somewhat longer travel distance. To investigate the trade-off for the benefits of freeway, a link penalty method is applied. A set of penalty weightings (0.33, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0, 1.05, 1.11, 1.25, 1.43, 1.67, 3.33) are multiplied by the link travel time on freeway links to identify the weighted fastest path. The same process is implemented for free-flow travel time, travel time under the perfectly correlated scenario $(M 1)$ , and travel time under the perfectly independent scenario $(M 2)$ . In addition, for every 5% from 5% to 95% travel speed in the perfectly correlated scenario, alternatives are generated based on all penalty factors in that list. Therefore, one route is found for each penalty factor (13 factors in total) for free-flow travel time, 19 routes are found for each penalty factor (13 factors in total) for $M 1$ , and 20 routes are found for each penalty factor (13 factors in total) for $M 2$ .

As the TLG network includes many more links and nodes, the computational costs for generating K-paths are high. Therefore, 20 shortest time paths are found for each three factors (0.3, 0.8, and 1.0) with free-flow travel time. For scenario $M 1$ , 20 shortest time paths are searched for penalty factor 1 with 50th, 70th, and 90th percentile travel time. Since computational time for scenario $M 2$ is high, and the travel time simulation process itself is like searching K-shortest paths, no additional K-shortest paths are generated for $M 2$ . The link penalty method is applied to searching K alternatives. For travel time based on detector speed and travel time based on GPS speed, only a factor equal to 1 (no preference for freeway) is applied.

Distance-Based Labels

Distance-based paths find alternative routes with the least travel distance under different conditions. The measurable link attributes for these kind of labels are link length, and based on the shortest distance path(s), some further searching processes are made to optimize target labels.

“Shortest distance”: One shortest distance path is found for each OD pair.

“Minimum left turns”: Since turns are actions of vehicles rather than properties of links, it is hard to find the minimum left-turn path for a given OD pair directly by using the NetworkX package in Python ( 35 ). Instead of finding the minimum left turns path in the network, this study identifies the minimum left turns path of the 20 shortest distance paths.

The link penalty approach is used to generate the 20 shortest paths in this study. For example, before finding the K+1th shortest path, a penalty factor is multiplied to each link in the kth shortest path, and the the K+1th is found based on the updated network with those penalty links. Penalty factors of 1.05, 1.2, and 99 are used in this study.The reason for using 99 is that we find that even if we use the 1.2 penalty factor to exclude the link in the last shortest path, we still get the same path or a path with just small differences to the last shortest path. However, if we eliminate the link in the last shortest path, we might lose the links in the observed trips. Therefore, we used a very high penalty factor for the links in the last shortest path to obtain a much different path. Overall, as shown in Table 3, 47 time-based labels and five distance-based labels are included.

Table 3.

Summary of Labels

Labels class	Penalty weights	Number of labels
Time-based labels
Free-flow travel time scenario
Freeway preferred	0.33, 0.6, 0.7, 0.8, 0.9, and 0.95	6
Freeway avoided	1.05, 1.11, 1.25, 1.43, 1.67, and 3.33	6
No preferences (Shortest time)	1.0	1
K-shortest path	0.3, 0.8, and 1.0	3
Perfectly correlated scenario (M1)
Freeway preferred	0.33, 0.6, 0.7, 0.8, 0.9, and 0.95	6
Freeway avoided	1.05, 1.11, 1.25, 1.43, 1.67, and 3.33	6
No preferences (Shortest time)	1.0	1
K-shortest path with 50% travel time	1.0	1
K-shortest path with 70% travel time	1.0	1
K-shortest path with 90% travel time	1.0	1
Perfectly independent scenario (M2)
Freeway preferred	0.33, 0.6, 0.7, 0.8, 0.9, and 0.95	6
Freeway avoided	1.05, 1.11, 1.25, 1.43, 1.67, and 3.33	6
No preferences (Shortest time)	1.0	1
Other time-based labels
Shortest time with loop detector speed data	1.0	1
Shortest time with GPS speed data	1.0	1
Distance-based labels
Shortest distance	1.0	1
Minimum left turns	1.0	1
K-shortest path	1.05, 1.2 and 99	3
Total number of labels	NA	52

Note: NA = not available.

Choice Set Evaluation

Overlap

The common choice set performance indicators are overlap rate and capture rate. Overlap rate measures the percentage by which generated routes overlap the observed trajectory, as shown in Equation 1.

\begin{matrix} O_{i, n} = L_{i, n}^{c} / L_{n} \cdot 100 % \end{matrix}

(1)

where

$O_{i}$ is the percentage of generated routes $i$ overlapping observed trip $n$

$L_{i}^{c}$ is the total length of all common links in generated route $i$ and observed trip $n$

$L_{n}$ is the total length of the observed trajectory

For an observed trip, the overlap rate between each generated alternative route and the actual route is calculated. The same process is repeated for all observed trips

Capture rate describes the percentage of the set of observed trajectories that are captured by the generated routes under a specific overlap threshold. For an observed trip, if any alternative route in the choice set has an overlap rate greater than the threshold, then this observed trip is defined as “captured.” For example, capture rate of 43% for 80% overlap threshold means that for 43% of all observed trajectories in the sample set, more than 80% of those trajectories spatially coincide with the selected algorithm. Both overlap rate and capture rate are used to show the performance of the choice set generation algorithms. Overlap thresholds of 50%, 60%, 70%, 80%, 90%, and 100% are used to measure capture rates.

Deviation

In a high-resolution road network, there exist many objective routes (alternative routes) which slightly deviate from target routes (observed trip) ( 22 ). Some generated routes which have no overlap with observed trips might be one block or several meters away from the used route. The cause of those small deviation varies with factors such as traffic light phase, influence of temporary road conditions, a gap in the traffic, and some potential personal preference which could not be identified based on collected information. Compared with routes which are far from observed routes, these “closer alternative routes” are also important ( 36 ), so only using overlap rate to evaluate choice sets might be insufficient. Therefore, a term called “average deviation” is calculated based on Equation 2, and applied in this study.

\begin{matrix} D_{i, n} = A_{i, n} / L_{n} \end{matrix}

(2)

where

$D_{i, n}$ is the average deviation between observed trip $n$ and the generated route $i$ . Unit is meter.

$A_{i}$ is the area of the region between the generated route $i$ and observed trip $n$ . Unit is square meters.

$L_{n}$ is the total length of the observed trip. Unit is meters.

A list of deviation thresholds: 5 m, 50 m, 100 m, 200 m, 300 m, 400 m, and 500 m, are tested.

Explanatory Variables

After obtaining a choice set with good performance, to understand and predict route choices, this study evaluates which variables explain route choice. Both linear models and logit models are used in this study. We model the overlap and route choices as being determined by a set of independent variables described as follows:

Length (m): Trip length

Traffic lights coverage: Length of link with traffic lights divided by trip length

Bus stops coverage: Length of link with bus stops divided by trip length

Free-flow travel Time: Travel time-based on speed limit

Free-flow speed: Trip length divide by free-flow time

Freeway percentage: Freeway length divided by trip length

Left turns: The number of left turns along the trip

Right turns: The number of right turns along the trip

Traffic lights: The number of traffic lights passed along the trip

Bus stops: The number of bus stops passed along the trip

Path size: Between 0 and 1, 1 for unique route

If the angle between the two connected road segments is between 30 and 150 degrees, the movement through these two segments is defined as a turn.

Linear Overlap and Deviation Models

In this study, a linear model which predicts the performance (overlap or deviation) of the alternative route is introduced. As GPS trajectories are collected from the same group of participants for a period of time, panel regression models are applied. All attributes shown in the “Explanatory Variables” section are taken as independent variables, and dependent variables including overlap rate $O_{i, n}$ and average deviation $D_{i, n}$ are modeled.

For both $O_{i, n}$ and $D_{i, n}$ , the Breusch-Pagan test is adopted to test the heteroskedasticity, and the Durbin-Watson test is used for checking auto-correlation. The Hausman test is used to test for endogeneity in the panel data. Hypotheses in this study include:

For overlap: alternative routes with shorter trip lengths, less travel time, fast travel speeds, higher percentage with freeway, and less turns are more likely to have higher overlap with the observed trip.

For deviation: alternative routes with long trip lengths, longer travel time, lower travel speeds, lower percentage with freeway, and more turns are more likely to have higher deviation with the observed trip.

Route Choice Models

To model travelers’ route choices, a mixed logit model is applied. The utility $U$ is the sum of a deterministic term $V$ and a error term $ε$ . Equation 4 shows the probability of alternative route $i$ being selected from the choice set $C$ .

U_{n, i} = V + ε = β X_{n, i} + ε_{n, i}

(3)

where

$U$ is the utility function

$V$ is the deterministic part in the utility function

$β$ is the coefficient vector, where $β ~ f (β | θ)$ for any distribution of $f$

$X$ is a vector which includes all variables in the “Explanatory Variables” section

$ε$ is the iid (independently and identically distributed) extreme value

P (i | C) = \int \frac{e^{β X_{n, i}}}{\sum_{j} e^{β X_{n, j}}} f (β | θ) d β

(4)

Generated alternative routes in the choice set might overlap partially or completely with each other. The path-size logit model adds a path-size $(Z)$ term, as shown in Ben-Akiva and Bierlaire ( 37 ), to correct the overlapping effect in deterministic part in utility function Equation 5, and formula for $Z$ is presented in Equation 6.

\begin{matrix} V = β_{k} X_{k} + β_{Z} \ln Z \end{matrix}

(5)

where

$V$ is the deterministic part in the utility function

$β_{k}$ is the coefficient vector

$X$ is a vector which includes all variables in the “Explanatory Variables” section

$Z$ is the path-size term

\begin{matrix} Z_{i, C} = \sum_{a \in i} \frac{l_{a}}{l_{i}} \frac{1}{\sum_{j} δ_{a, j}} \end{matrix}

(6)

where

$Z_{i, C}$ is the path-size term for alternative route $i$ in choice set C

$l_{a}$ is the length of link $a$ in alternative route $i$

$l_{i}$ is the total length of the alternative route $i$

$δ_{a, j}$ is 1 if alternative route $j$ includes link $a$ , and 0 otherwise

$\sum_{j} δ_{a, j}$ is the number of alternative routes contain link $a$

Since observed trips are recorded for multiple participants, heterogeneity might exist across people. However, multinomial logit models and path-size logit models do not consider variation among travelers. Therefore, mixed logit, which allows the coefficient $β$ to be random in utility function, is applied to take heterogeneity into account. Moreover, as shown in Equations 7 and 8, the path size introduced in Equation 6 is added as an attribute to the mixed logit model, and the results of the mixed logit model with and without the $Z$ term are compared. A Python package called “pylogit” is applied to perform mixed logit modeling ( 38 ).

\begin{matrix} U_{n, i} = β_{n} X_{n, i} + β_{n} \ln (Z) + ε_{n, i} \end{matrix}

(7)

P (i | C) = \int \frac{e^{β X_{n, i} + β_{n} \ln (Z)}}{\sum_{j} e^{β X_{n, j} + β_{n} \ln (Z)}} f (β | θ) d β

(8)

All the logit models presented above aim to predict the probability of selecting one route from the choice set $C$ .

Results

Overlap

For all route generation algorithms described in the “Methodology” section, the alternative routes are compared with the observed trip, and the overlap rate is measured. The results are presented in Table 4. If the two generated routes are not exactly the same, they will be identified as two unique routes. As the overlap rate threshold increases, more observed trips are shared in the routes generated from the algorithms. Thus, the capture rate decreases with the increase of the overlap rate threshold for all labels.

Table 4.

Capture Rate for Various Labels Under Different Overlap Thresholds

Labels	Unique routes	Capture rate (%) overlap threshold
Labels	Unique routes	70%	80%	90%	100%
Shortest distance	1	37	31	27	24
Free-flow travel time
Freeway preferred factor = 0.8	1	55	48	42	26
Shortest free-flow time	1	52	46	39	26
Freeway avoided factor = 1.05	1	50	42	34	25
K-shortest free-flow time	20	57	49	38	26
K-freeway preferred factor = 0.8	20	56	47	37	25
Perfectly correlated scenario (M1)
Freeway preferred factor = 0.8	avg. 3.5	58	50	41	22
Shortest time factor = 1	avg. 3.7	57	48	39	18
Freeway avoided factor = 1.05	avg. 4	56	47	37	18
Perfectly independent scenario (M2)
Freeway preferred factor = 0.8	avg. 7.0	61	52	45	25
Shortest time factor = 1	avg. 8.3	60	52	44	22
Freeway avoided factor = 1.05	avg. 8.7	60	52	43	22
Minimum left turns	1	45	35	30	25
Shortest distance and least free-flow time path with all freeway factors	avg. 4.3	71	63	51	33
Full choice set	avg. 107	90	81	71	44

Note: avg. = average.

Shortest distance path and shortest free-flow time are common labels used in previous studies. The capture rates under various overlap thresholds for these labels are presented in Table 5 and compared with past studies.

Table 5.

Percentage of Observed Trips Captured by Shortest Distance Path and Shortest Free-Flow Time Path Under Various Thresholds

Overlap threshold	Capture rate
Overlap threshold	Shortest distance	Shortest time
This study
50%	48%	64%
60%	40%	57%
70%	37%	52%
80%	31%	46%
90%	27%	39%
100%	24%	26%
Bekhor et al. ( 20 )
80%	28%	46%
90%	22%	37%
100%	20%	34%
Zhu and Levinson ( 6 )
80%	9%	23%
90%	5%	16%
100%	2%	6%

Comparing results with those from Bekhor et al. ( 20 ) and Zhu and Levinson ( 6 ), the shortest free-flow path captures more observed routes in all three studies.

As shown in Figure 3, the capture rate shows a rising trend at the beginning stage and reaches a peak value when the freeway factor ranges from 0.7 to 1.11; a continuous decline is found after achieving the peak value for all three scenarios. For a perfect match (overlap = 100%), even the number of unique routes found based on free-flow travel time is smaller than the other two scenarios, it captures most observed trips with a factor not less than 0.7. According to Figure 3, in general, the freeway-preferred path captured a greater percentage of observed trips than the freeway-avoided path in all three scenarios.

Figure 3.

Comparison of capture rate under different thresholds for each freeway factor in three scenarios: free-flow, perfectly independent (M2), and perfectly correlated (M1).

As described in the “Methodology” section, there are 52 labels applied, and on average 107 unique routes are defined for each observed trip. Considering the marginal effect of adding more alternatives to the choice set, for each iteration, only the label which makes the choice set have the highest capture rate is added to choice sets to measure the capture rate of the choice set. The best label here means, by using this label, the most uncovered trips are captured in the choice set. Figure 4 shows the cumulative capture rate of adding results of the best label for a total of 10 best labels under a threshold equal to 80%. The increase in rate of capture rate rise tends to be small (roughly 1%) after using six labels.

Figure 4.

Cumulative capture rate of adding 10 best labels in choice set.

Deviation

The results of capture rate for eight different deviation thresholds are presented in Table 6. Of the observed trips, 88% have at least one generated alternative route with an average deviation less than 50 m.

Table 6.

Percentage of Observed Trips Covered by Different Labels Under Different Deviation Thresholds

Threshold	Capture rate (%)
Threshold	Shortest distance	Shortest free-flow time	Shortest time (M1)	Shortest time (M2)	Combine all labels in “Methodology”
5 m	24	31	36	39	62
50 m	35	48	53	55	88
100 m	42	53	60	62	93
200 m	50	61	67	70	97
300 m	63	68	72	75	98
400 m	68	73	77	78	98
500 m	69	77	80	81	98
800 m	78	84	84	85	100

Overlap Versus Deviation

Both overlap and deviation can be used to assess the performance of choice sets. A comparison is made between predicting deviation and predicting overlap to determine the most suitable independent variable for the analysis. To simplify the process, a choice set which includes the “shortest distance path” and 13 “least free-flow time paths with different freeway factors” is used. With 70% overlap threshold, 70% of observed trips could be captured by this choice set.

According to the Breusch-Pagan test ( $p = 0.0$ for F-test) and Durbin-Watson test results $(DW = 0.43)$ , there is heteroskedasticity and auto-correlation in the data set, and the Hausman test $(p = 0.97)$ leads us to accept the null hypothesis that there is no correlation between the regressors and the errors. Therefore, a random effects model is implemented when estimating overlap and average deviation. The results are presented in Table 7. Overall, the $R^{2}$ of predicting overlap is higher than predicting average deviation with the same independent variables, but they provide a consistent conclusion:

For deviation: longer distances and less freeway percentage are associated with larger average deviation from the observed trip.

For overlap: higher freeway percentage and shorter distances, right turns, traffic lights, and bus stop coverage, are associated with high overlap with the observed trip.

Table 7.

Overlap Versus Deviation

$β$	Deviation	Overlap
Constant	51.84***	0.50***
Constant	(63.31)	(0.05)
Length	0.05***	$- 5.69 e^{-} 6$ ***
Length	(0.005)	$(2.14 e^{-} 6)$
Freeway percentage	−452.34**	0.178**
Freeway percentage	(123.77)	(0.178)
Traffic light	NS	−0.012**
Traffic light	NA	(0.005)
Right turns	NS	−0.017**
Right turns	NA	(−2.34)
Bus stop coverage	NS	−28.32**
Bus stop coverage	NA	(12.63)
$R^{2}$	0.195	0.366

Note: (standard error); NA = not available; NS = not significant.

significance at 5% level;

***

significance at 1% level.

For the remaining models, overlap is selected as the independent variable for the linear model, and the selected alternative is defined based on the maximum overlap. Future research should test deviation in more depth.

Linear Model

As presented in Figure 4, with 80% overlap threshold, a choice set which is generated based on the first 10 labels captures 80% of observed trips. This best-10-label choice set averages 40 unique routes for each OD pair and is applied for the route choice modeling process. Similar to above, the random effects model is suggested for modeling the overlap for the best-10-label choice set, and $\ln (Z)$ , introduced in Equation 6, is also added as an independent variable. According to the estimated results in Table 8, by adding the path-size term, $R^{2}$ increases by 0.11. Routes with a larger freeway percentage and higher average free-flow speed and less trip length and fewer traffic lights are more likely to have a higher overlap with observed trips. The number of left turns and right turns are not significant based on the statistical result, which contrasts with results in the literature. This might be caused by the composition of the choice set. As shown in Table 7, the number of right turns is significant when using the simple choice set which is formed by the shortest distance and 13 least free-flow time paths.

Table 8.

Random Effects Model With and Without Path-Size Term $(Z)$

$β$	With $Z$	Without $Z$
Constant	−0.31**	−0.37***
Constant	(0.129)	(0.136)
Length	−9.45e-6**	−1.51e-5***
Length	(3.03e-6)	(3.18e-6)
Free-flow speed	0.0054**	0.011***
Free-flow speed	(0.0022)	(0.0023)
Freeway percentage	0.23***	0.237***
Freeway percentage	(0.051)	(0.058)
Traffic light coverage	25.86*	68.68***
Traffic light coverage	(14.7)	(14.72)
Bus stop coverage	14.69	35.8**
Bus stop coverage	(13.013)	(14.58)
Traffic light	−0.004**	−0.0038***
Traffic light	(0.0013)	(0.0014)
Bus stop	−0.0005	−0.0022***
Bus stop	(0.0004)	(0.0005)
Left turns	−0.001	−0.0032***
Left turns	(0.001)	(0.0008)
Right turns	0.0009	−0.0006***
Right turns	(0.0011)	(0.001)
$\ln (Z)$	−0.127***	NA
$\ln (Z)$	(0.0087)	NA
$R^{2}$	0.450	0.336
$R^{2}$ (Overall)	0.439	0.309

Note: (standard error); NA = not available.

Significance at 10% level;

Significance at 5% level;

***

Significance at 1% level.

Logit Model

As described above, overlap is used to determine the selection of routes in the choice set. The coefficients $β$ in the utility function Equation 3 are assumed to follow the normal distribution, and simulations are replicated 20 times to gain draws from the distribution. Additionally, a path-size MNL model is applied to compare the performance with the mixed logit model with $Z$ term. The results are presented in Table 9. In Table 9, across the three models, only variables that were identified to be significant were used from the explanatory variables section. The variable “left turns” has the highest variance inflation factor (VIF) among all variables. Its value equals the standard threshold (10) of VIF. Since the effect of the number of left turns on the route choice is worth exploring, this variable is kept in the model. With $\ln (Z)$ term, log-likelihood, Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) are improved. For the population average, routes with higher freeway percentage and more bus stops, and shorter trip length and fewer traffic lights, left turns, and right turns are more likely to be chosen. It should be noted that, as $β$ is assumed to follow normal distribution, the coefficient of mixed logit model presented in Table 9 is the mean value. We recognize that there might be some share of the population who have an opposite sign for any given $β$ .

Table 9.

Mixed Logit Model With Versus Without $Z$ Term Versus $Z$ -MNL Model

$β$	Mixed logit		MNL with $Z$
$β$	with $Z$	without $Z$	MNL with $Z$
Length	−0.0013***	−0.0013***	−0.00036***
Length	(0.0007)	(0.0007)	NA
Free-flow speed	0.19***	0.18***	0.176***
Free-flow speed	(−0.149)	(−0.16)	NA
Freeway percentage	2.24***	2.67***	2.27***
Freeway percentage	(1.83)	(0.15)	NA
Traffic light	−0.025***	−0.05***	0.034***
Traffic light	(0.086)	(0.108)	NA
Bus stop	0.04***	0.049***	0.0463***
Bus stop	(−0.092)	(−0.08)	NA
Left turns	−0.048***	−0.03***	−0.049***
Left turns	(−0.048)	(−0.059)
Right turns	−0.045***	−0.06***	−0.119***
Right turns	(0.095)	(0.016)	NA
$\ln (Z)$	0.33***	NA	−0.078**
$\ln (Z)$	(−0.168)	NA	NA
Log-Likelihood	−10,609.644	−10,675.82	−12,522.55
AIC	21,259.289	21,387.64	25,273.1
BIC	21,387.869	21,503.363	26,006.01

Note: (sigma of $β$ ); NA = not available; MNL = Multinomial Logit Models; AIC = Akaike Information Criterion; BIC = Bayesian Information Criterion.

Significance at 5% level;

***

Significance at 1% level.

Discussion and Conclusion

When dealing with a high-resolution road network and GPS trajectories, it is difficult to generate a parsimonious choice set that can capture the most trajectories across an observed trips set. Relying on only one choice set generation method is not sufficient because, as more links and nodes are included in the network, the number of possible routes between an OD pair rises. Additionally, as the number of links around a point increases, map matching results on high-resolution graphs are more likely to be affected by inaccurate GPS points. However, the high similarity of the high-resolution graph with the real-life network suggests the analysis results are more useful.

This study combines labeling, link penalty, and k-shortest path algorithms and uses recorded travel speed and free-flow speed to generate choice sets for route choice modeling. The high-resolution road network of the Twin Cities, which includes 108,561 nodes and 277,747 links, is applied. Compared with the choice set in the other studies in Table 1, we achieve almost the same capture rate (80%) as Bekhor et al. ( 20 ) did, but our network is much larger than their network (108,561 nodes and 277,747 links versus 13,000 nodes and 34,000 links). For the same level of network size, compared with the study from Rieser-Schüssler et al. ( 22 ), we gain a higher capture rate (around 80% versus 75%) with fewer alternative routes for each OD pair in the choice set (40 versus 100). The choice set generation results are improved compared with previous studies and could be applied to other datasets to improve the choice modeling results.

A panel regression model is also applied to model the overlap between alternative route and observed route. With the same initial attributes, a random effects model is used for predicting the overlap between alternative route and observed route, and the $R^{2}$ is 0.45. Alternative routes with shorter distance, higher percentage of freeway, fast free-flow speed, and fewer traffic lights are more likely to overlap with observed trips and are more likely to be selected. Based on the $R^{2}$ of a random effects model, the overlap rate is more easily predicted as an independent variable than the average deviation between alternative routes.

According to the results presented in Table 5, alternatives based on the least travel time match more observed trips than those based on the shortest distance. For most drivers, the travel time might be more important than distance when they plan morning trips. Moreover, for both free-flow travel time and TomTom travel time, multiplying freeway links by a factor (factor = 0.8) when applying the shortest path algorithm improves the capture rate (more than 50% and 40% for 80% and 90% overlap threshold), as most drivers prefer freeways over surface streets. We also observe that assuming the link travel time is perfectly independent (M2) captures more observed trips than assuming travel time is perfectly correlated (M1) or free-flow travel time. This is because the alternative routes which are found based on the M2 scenario are more realistic.

A list of penalty factors (weighting freeway times versus non-freeway times) is added to freeway links on both free-flow and observed conditions, and factors within 0.8 to 1.0 normally capture more observed trips under almost all overlap thresholds. With the same number of iterations and the same penalty, compared with the result when assuming link travel time is perfectly correlated, more unique routes could be generated if we assume travel time for each link is perfectly independent, and more observed trips could be captured by the alternative routes. By using the best 10 labels to generate, on average, 40 unique routes for each OD pair, around 80% of observed trips could be captured under 80% overlap threshold. Combining all labels in this study, the generated choice set could capture 81%, 71%, 44% of observed trips with overlap threshold set to 80%, 90%, and 100% respectively. There is a significant drop from 71% to 44%, but it follows the expectation. Many factors influence route choices, and some of them are unknown to the modeler or even to the traveler themselves. Therefore, generating routes that are the same as the observed trips is challenging. It is very hard for the modeler to know those reasons based on the GPS trajectories. Besides that, 88% of observed trips have at least one generated alternative route with an average deviation within 50 m.

Based on statistical evaluation of the models, a mixed logit model with path-size term shows better log-likelihood, AIC, and BIC than path-size MNL model and mixed logit model without that term. But from an application perspective, it might not be a “better” one. As described in the “Methodology” section, the coefficients $β$ in the mixed logit model are assumed to follow the normal distribution, and thus the coefficients of the mixed logit model in Table 9 are the mean value. In the normal distribution, for example, for $β_{Trafficlight} = - 0.025$ and $σ_{Trafficlight} = 0.086$ , it means 39% $(P (x) > 0)$ of population $β$ has the opposite sign with mean value (−0.025), indicating that it is hard to form a general conclusion. For path-size MNL model, it does not consider the heteroskedasticity in the panel data. Therefore, in addition to using a logit model, we test panel regression models of the overlap between alternative route and observed route. Alternatively, testing whether the coefficient $β$ in mixed logit model follows some other distribution might be helpful. Since this paper focused more on the choice set generation, this problem might be developed in future work.

Footnotes

Acknowledgements

We are thankful to the development team of NetworkX, the development team of Biogeme, and the development team of “pylogit” for making their Python packages open source.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: H. Wang, E. Moylan, D. Levinson; analysis and interpretation of results: H. Wang, E. Moylan, D. Levinson; draft manuscript preparation: H. Wang. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Haotian Wang

Emily Moylan

David Levinson

References

Prato

C. G.

Route Choice Modeling: Past, Present and Future Research Directions. Journal of Choice Modelling, Vol. 2, No. 1, 2009, pp. 65–100.

Prato

C. G.

Bekhor

Modeling Route Choice Behavior: How Relevant Is the Composition of Choice Set?

Transportation Research Record: Journal of the Transportation Research Board, 2007. 2003: 64–73.

Yao

Bekhor

Data-Driven Choice Set Generation and Estimation of Route Choice Models. Transportation Research Part C: Emerging Technologies, Vol. 121, 2020, p. 102832.

Bekhor

Toledo

Prashker

J. N.

Effects of Choice Set Size and Route Choice Models on Path-Based Traffic Assignment. Transportmetrica, Vol. 4, No. 2, 2008, pp. 117–133.

Bovy

P. H.

On Modelling Route Choice Sets in Transportation Networks: A Synthesis. Transport Reviews, Vol. 29, No. 1, 2009, pp. 43–68.

Zhu

Levinson

Do People Use the Shortest Path? An Empirical Test of Wardrop’s First Principle. PLoS One, Vol. 10, No. 8, 2015, p. e0134322.

Lai

Modelling Stochastic Route Choice Behaviours with a Closed-Form Mixed Logit Model. Mathematical Problems in Engineering, Vol. 2015, 2015, p. 729089. https://doi.org/10.1155/2015/729089.

Lai

Sha

Understanding Drivers’ Route Choice Behaviours in the Urban Network with Machine Learning Models. IET Intelligent Transport Systems, Vol. 13, No. 3, 2019, pp. 427–434.

Jensen

A. F.

Rasmussen

T. K.

Prato

C. G.

A Route Choice Model for Capturing Driver Preferences When Driving Electric and Conventional Vehicles. Sustainability, Vol. 12, No. 3, 2020, p. 1149.

10.

Bellman

Kalaba

On kth Best Policies. Journal of the Society for Industrial and Applied Mathematics, Vol. 8, No. 4, 1960, pp. 582–588.

11.

Eppstein

Finding the k Shortest Paths. SIAM Journal on Computing, Vol. 28, No. 2, 1998, pp. 652–673.

12.

Yen

J. Y.

Finding the K Shortest Loopless Paths in a Network. Management Science, Vol. 17, No. 11, 1971, pp. 712–716.

13.

Hadjiconstantinou

Christofides

An Efficient Implementation of an Algorithm for Finding K Shortest Simple Paths. Networks: An International Journal, Vol. 34, No. 2, 1999, pp. 88–101.

14.

Azevedo

Costa

M. E. O. S.

Madeira

J. J. E. S.

Martins

E. Q. V.

An Algorithm for the Ranking of Shortest Paths. European Journal of Operational Research, Vol. 69, No. 1, 1993, pp. 97–106.

15.

de la Barra

Perez

Anez

Multidimensional Path Search and Assignment. Proc., 21st PTRC Summer Annual Meeting, University of Manchester, UK, 1993.

16.

Ben-Akiva

Bergman

Daly

A. J.

Ramaswamy

Modelling Inter Urban Route Choice Behaviour. Papers Presented at 9th International Symposium on Transportation and Traffic Theory, Delft, The Netherlands, July 11–13, 1984.

17.

Prato

C. G.

Bekhor

Applying Branch-and-Bound Technique to Route Choice Set Generation. Transportation Research Record: Journal of the Transportation Research Board, 2006. 1985: 19–28.

18.

Frejinger

Bierlaire

Capturing Correlation with Subnetworks in Route Choice Models. Transportation Research Part B: Methodological, Vol. 41, No. 3, 2007, pp. 363–378.

19.

Bovy

P. H.

Fiorenzo-Catalano

Stochastic Route Choice Set Generation: Behavioral and Probabilistic Foundations. Transportmetrica, Vol. 3, No. 3, 2007, pp. 173–189.

20.

Bekhor

Ben-Akiva

M. E.

Ramming

M. S.

Evaluation of Choice Set Generation Algorithms for Route Choice Models. Annals of Operations Research, Vol. 144, No. 1, 2006, pp. 235–247.

21.

Pillat

Mandir

Friedrich

Dynamic Choice Set Generation Based on Global Positioning System Trajectories and Stated Preference Data. Transportation Research Record: Journal of the Transportation Research Board, 2011. 2231: 18–26.

22.

Rieser-Schüssler

Balmer

Axhausen

K. W.

Route Choice Sets for Very High-Resolution Data. Transportmetrica A: Transport Science, Vol. 9, No. 9, 2013, pp. 825–845.

23.

Ding

Gao

Jenelius

Rahmani

Huang

Pereira

Ben-Akiva

Routing Policy Choice Set Generation in Stochastic Time-Dependent Networks: Case Studies for Stockholm, Sweden, and Singapore. Transportation Research Record: Journal of the Transportation Research Board, 2014. 2466: 76–86.

24.

Spissu

Meloni

Sanjust

Behavioral Analysis of Choice of Daily Route with Data from Global Positioning System. Transportation Research Record: Journal of the Transportation Research Board, 2011. 2230: 96–103.

25.

Quattrone

Vitetta

Random and Fuzzy Utility Models for Road Route Choice. Transportation Research Part E: Logistics and Transportation Review, Vol. 47, No. 6, 2011, pp. 1126–1139.

26.

Tang

Levinson

D. M.

Deviation Between Actual and Shortest Travel Time Paths for Commuters. Journal of Transportation Engineering, Part A: Systems, Vol. 144, No. 8, 2018, p. 04018042.

27.

Bekhor

Prato

C. G.

Methodological Transferability in Route Choice Modeling. Transportation Research Part B: Methodological, Vol. 43, No. 4, 2009, pp. 422–437.

28.

Fosgerau

Frejinger

Karlstrom

A Link Based Network Route Choice Model with Unrestricted Choice Set. Transportation Research Part B: Methodological, Vol. 56, 2013, pp. 70–80.

29.

Zhu

The Roads Taken: Theory and Evidence on Route Choice in the Wake of the I-35W Mississippi River Bridge Collapse and Reconstruction. University of Minnesota, Minneapolis, 2010.

30.

Rates

Atlanta Commute Vehicle Soak and Start Distributions and Engine Starts per Day Impact on Mobile Source. Atlanta, 2007.

31.

Fix

Hodges

J. L.

Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review/Revue Internationale de Statistique, Vol. 57, No. 3, 1989, pp. 238–247.

32.

Cui

Levinson

Accessibility and the Ring of Unreliability. Transportmetrica A: Transport Science, Vol. 14, No. 1–2, 2018, pp. 4–21.

33.

Tang

Levinson

An Empirical Study of the Deviation Between Actual and Shortest Travel Time Paths. Working Paper. University of Minnesota, Minneapolis, 2015.

34.

Cohn

Real-Time Traffic Information and Navigation: An Operational System. Transportation Research Record, 2009. 2129: 129–135.

35.

Hagberg

Swart

Schult

Exploring Network Structure, Dynamics, and Function Using NetworkX. Los Alamos National Lab (LANL), Los Alamos, NM, 2008.

36.

Wang

Moylan

Levinson

D. M.

Prediction of the Deviation Between Alternative Routes and Actual Trajectories for Bicyclists. Findings, June 10, 2022.https://doi.org/10.32866/001c.35701.

37.

Ben-Akiva

Bierlaire

Discrete Choice Methods and Their Applications to Short Term Travel Decisions. In Handbook of Transportation Science ( Hall

R. W.

, ed.), Springer, Boston, MA, 1999, pp. 5–33.

38.

Brathwaite

Walker

J. L.

Asymmetric, Closed-Form, Finite-Parameter Models of Multinomial Choice. Journal of Choice Modelling, Vol. 29, 2018, pp. 78–112.