Sage Journals: Discover world-class research

Abstract

This study presents an innovative multistage methodology for decomposing urban traffic flows into light vehicle (LV) and heavy vehicle (HV) categories, addressing a critical gap in transportation network analysis. Utilizing data from Greater Sydney’s road network, we develop a comprehensive approach comprising three main stages: origin–destination (OD) matrix estimation using RapidEx, quadratic programming optimization for HV/LV proportion estimation, and XGBoost regression for generalization. Our analysis examines associations between HV proportions and urban characteristics, including points of interest (POI), nightlight intensity, and zonal attributes. The XGBoost model achieves a test R² of 0.637, demonstrating strong predictive power for real-world applications. Through SHAP (SHapley Additive exPlanations) analysis, we uncover complex nonlinear relationships between nightlight intensity and HV proportions, with significant interaction effects between urban features. The model performs particularly well in predicting common urban HV proportion ranges (0.2–0.6), reflecting typical urban traffic compositions. These findings provide valuable insights for urban planning and policy development, especially in contexts where detailed vehicle classification data are limited.

Keywords

urban transportation traffic decomposition heavy vehicles light vehicles origin–destination flows machine learning

Introduction and Background

Urban transportation networks are complex systems that significantly influence economic efficiency, environmental sustainability, and quality of life in cities worldwide. These networks are shaped by diverse vehicle types and their movement patterns, with a crucial distinction between heavy vehicles (HV) and light vehicles (LV). HVs, encompassing trucks, buses, and coaches, constitute a significant portion of total traffic volume. Trucks usually account for up to 15% of traffic, while buses may represent approximately 3% ( 1 ). However, these figures exhibit substantial variability across different regions, influenced by factors such as local industrial activity, transportation infrastructure, and the extent of public transit systems.

The ability to distinguish between LV and HV movements is crucial for several reasons. It allows for a more detailed understanding of urban logistics and commercial activities, as HVs are often indicative of goods movement and industrial operations ( 2 , 3 ). It also provides insights into public transportation patterns as buses belong to the HV category. This decomposition by vehicle type enables more targeted approaches to traffic management, infrastructure planning, and policy-making.

HVs, while constituting a small proportion of total traffic, contribute disproportionately to various urban challenges. Commercial vehicles significantly affect traffic congestion and emissions ( 4 , 5 ), and are responsible for a disproportionate share of road maintenance burden ( 6 ). This underscores the need for accurate identification and disaggregation of HV movements to generate effective policies and manage their impact on urban areas.

However, existing approaches to distinguish between vehicle types have significant limitations. The most traditional approach relies on manual classification surveys, which while accurate, are inherently resource-intensive and limited in spatial coverage ( 7 ). Fixed sensor networks, particularly weigh-in-motion systems, provide highly accurate data but only at specific locations within the network ( 8 ). Vehicle classification algorithms utilizing inductive loop detectors have shown promise but require substantial infrastructure investment and may experience varying levels of accuracy, depending on vehicle type ( 9 , 10 ). Video analytics and computer vision systems ( 11 ) offer promising capabilities but grapples with the challenges of scalability, camera coverage requirements, processing limitations, and performance under varying weather conditions. Global Positioning System (GPS) tracking of commercial fleets captures valuable movement data but represents only a subset of overall HV movements ( 12 ).

Recent advancements in transportation modeling tools, such as the RapidEx ( 13 ), have created new opportunities to address these research gaps. By leveraging travel time data from sources such as Google or TomTom, in conjunction with road network information from OpenStreetMap, RapidEx can now estimate potential flow at every link in a network, achieving travel times that closely align with observed data. This capability, combined with the tool’s ability to generate origin–destination (OD) patterns consistent with these flows, provides an opportunity for fine-grained traffic flow disaggregation.

However, the application of such advanced tools often faces challenges owing to the lack of detailed, decomposed data needed to generate accurate OD patterns for all links in a network. Even in developed nations, the availability of such data frequently lacks the necessary granularity and network coverage ( 14 ). While sensor deployment for obtaining decomposed flow data is increasing in some areas, it is not yet commonplace worldwide. This data scarcity highlights the need for models that can be developed in data-rich areas and then applied to other contexts where specific, detailed transportation data is limited.

To address this challenge, understanding the relationship between transportation and economic activity becomes crucial. This relationship is fundamental for urban planning and policy-making ( 2 , 3 ), as it allows for the creation of transferable models that can estimate traffic patterns based on more commonly available economic indicators. While this applies to all types of vehicle movements, it is particularly relevant for HVs, which include both freight transport and public transit. The movement of HVs, especially in freight transport, plays a vital role in bridging the gap between production and consumption locations. This necessitates the development of models that integrate economic activity, logistics decision-making, and traffic flows ( 15 , 16 ). By focusing on these relationships, we can create more comprehensive models that account for both LV and HV movements, providing a better understanding of urban transportation dynamics.

In recent years, there has been increased interest in freight transport research, driven by the need to better understand freight flows and volumes within transportation networks ( 16 , 17 ). However, obtaining reliable freight trip generation data remains challenging, especially where large-scale freight transport surveys are infrequent ( 18 , 19 ). To address these limitations, researchers have explored various proxies for economic activity in trip generation models ( 20 – 23 ). Among these proxies, point of interest (POI) data and nighttime light intensity data have gained significant attention as indicators of economic activity ( 22 , 23 ). POI data, representing entities with geolocation information, has shown strong potential in transport research ( 24 , 25 ). These studies demonstrate the effectiveness of such proxies in estimating patterns of socio-economic change and transport demand at various geographical scales ( 26 ).

To fully capitalize on these advancements and explore relationships between vehicle movement patterns and urban socio-economic indicators, machine learning techniques have emerged as a promising approach. In particular, XGBoost has shown impressive results in traffic prediction tasks ( 27 – 29 ). Its ability to handle complex nonlinear relationships and robustness to overfitting makes it particularly suitable for transportation data analysis, enabling the development of models that link socio-demographic factors and other proxies to vehicle type splits.

Our research leverages these techniques to develop an innovative approach for decomposing urban traffic flows into LV and HV categories. Focusing on Greater Sydney’s road network, we aim to identify relationships between vehicle movement patterns and urban socio-economic indicators. This study is motivated by the need to provide urban planners, policymakers, and transportation engineers with a more comprehensive understanding of the factors influencing HV and LV movements in urban areas. The significance of this research extends beyond the immediate case study. By developing a machine learning-based regression model that links socio-demographic factors and other proxies to vehicle type splits, we create a framework that can be generalized to other regions or cities. Our approach addresses the scarcity of observed link-level decomposed flow data and offers a scalable solution for urban transportation analysis on a global scale, moving beyond traditional survey-based methods to provide a more accurate and transferable understanding of urban transportation patterns and their relationship to economic activities.

Major contributions of our study are as follows:

We develop a novel multistage methodology that integrates advanced network modeling, optimization techniques, and machine learning to disaggregate traffic flows.

We provide a comprehensive analysis of the factors influencing HV proportions in urban areas, offering a nuanced view of the complex dynamics governing urban freight movement.

We establish a generalizable framework for predicting vehicle type distributions across entire urban networks, enabling more accurate forecasting and scenario analysis for urban planning in diverse contexts.

By bridging the gap between aggregated traffic data and disaggregated vehicle flows, this research aims to enhance our understanding of urban transportation dynamics and provide practical tools for evidence-based decision making. The insights gained from this study have the potential to make a noticeable difference toward improving urban mobility efficiency, reducing congestion, and supporting more sustainable urban development practices across various urban environments.

The remainder of this paper is organized as follows: the second section describes different data sources used. The third section describes detailed methodology. The fourth section presents results and discusses their implications for urban planning and transportation policy. Finally, the fifth section concludes with a summary of our findings and suggestions for future research directions.

Data

The study leverages three comprehensive datasets that capture various aspects of the urban activity, and nightlight intensity in Sydney, Australia. These datasets provide a rich and diverse set of information, enabling a thorough analysis of the relationship between urban characteristics and the composition of traffic flows.

The first dataset, POI, is sourced from SafeGraph, a leading provider of geospatial data. SafeGraph’s POI data is widely used in academic research ( 30 – 32 ), government analyses, and business applications, owing to its comprehensiveness and accuracy. The dataset provides detailed spatial coordinates and categorical classifications for business and service locations throughout the Sydney region. For the year 2023, the POI data categorize urban amenities into seven distinct groups: education, employment, food, medical, retail, services, and transit (see Table 1). This granular classification enables detailed analysis of spatial patterns in urban activities and their potential influence on traffic composition.

Table 1.

Description of Key Variables in the Analysis

Variable	Measure	Description
Urban amenity indicators (POI counts)
POI_education	Count	Educational facilities including schools, universities, colleges, and training centers
POI_employment	Count	Business establishments including corporate offices, financial institutions, and technology firms
POI_food	Count	Food service and entertainment venues including restaurants, cafes, hotels, and recreational facilities
POI_medical	Count	Healthcare facilities including hospitals, clinics, and institutional care services
POI_retail	Count	Retail establishments including shopping centers, department stores, and specialty shops
POI_services	Count	Service businesses including professional services, personal care, and commercial enterprises
POI_transit	Count	Transportation hubs including transit stations and logistics centers
Economic and demographic indicators
Nightlight_intensity	nW.cm⁻².sr⁻¹	Average radiance of nighttime light (proxy for economic activity)
Population	Count	Total number of residents within the analysis zone
Area	km²	Geographic extent of the analysis zone

The second dataset utilizes the visible infrared imaging radiometer suite (VIIRS) day/night band (DNB) nighttime lights data, maintained by the Earth Observation Group ( 33 ). This data source has emerged as the gold standard for economic studies utilizing night-lights ( 34 – 37 ), offering superior resolution and accuracy compared with previous satellite-based light measurements. We employed the annual VIIRS Nighttime Lights (VNL) V2 ( 33 ) average radiance composite data for 2023, which provides cloud-free average radiance values. The VNL V2 product features enhanced filtering of aurora, stray light, and ephemeral lights, making it particularly suitable for urban analysis. The 2023 dataset was selected to maintain temporal consistency with other data sources and ensure the most current representation of nighttime light patterns in the study area.

The third dataset comprises traffic volume data from Transport for New South Wales (2023). The authority’s Traffic Volume Viewer provides disaggregated traffic flow data for specific road network links, distinguishing between HV and LV ( 38 ). This dataset’s reliability and relevance for traffic analysis have been demonstrated in previous research ( 39 ).

The summary statistics presented in Table 2 encompass analysis of all 158 zones within the study area. For each zone, we aggregated POI counts by category to capture the distribution of commercial and service activities. Population and area measurements provide demographic and spatial context, while nightlight intensity measurements serve as a proxy for economic activity levels. This combination of variables enables comprehensive analysis of the relationships between urban form, economic activity, and traffic patterns.

Table 2.

Summary Statistics of Key Variables

	POI_education	POI_employment	POI_food	POI_medical	POI_retail
M	21.47	36.07	198.20	184.22	296.83
SD	25.61	62.15	221.02	230.21	331.23
Minimum	0.00	0.00	2.00	1.00	4.00
25%	5.00	6.25	53.00	36.00	84.75
50%	11.00	17.50	119.50	101.00	210.00
75%	29.00	36.00	245.75	215.25	345.75
Maximum	161.00	621.00	1,455.00	1,157.00	2,139.00
	POI_services	POI_transit	Nightlight_intensity	Population	Area
M	709.50	49.60	25.51	27,659.27	20.07
SD	863.77	53.07	24.19	31,814.99	19.43
Minimum	10.00	1.00	0.42	14.72	0.89
25%	224.25	16.00	7.55	7,506.12	6.20
50%	403.50	31.00	20.68	16,118.90	6.07
75%	857.50	55.75	33.80	33,588.74	43.42
Maximum	6,868.00	322.00	127.02	157,582.34	43.46

Note: POI = point of interest; SD = standard deviation.

The majority of the data used in this study is available for most parts of the world, either in the public domain or through API access from the sources. This includes network data from OpenStreetMap (OSM), travel time data from Google, TomTom, or Mapbox, nightlight data from NASA’s Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB), and points of interest (POI) data from SafeGraph, OSM, and Google. The only specialized data is the decomposed vehicular split in the network, which is not readily available worldwide. This data scarcity is the primary motivation for our research—to build a machine learning model that can be applied in regions where such specialized data may not be accessible, by leveraging the wealth of publicly available data sources mentioned above.

Methodology

This study presents a novel multistage approach to estimate the proportions of HVs and LVs in OD matrices for transportation network analysis. Our methodology integrates advanced network modeling techniques, optimization algorithms, and machine learning to provide a comprehensive solution for disaggregating vehicle types in OD flows. The approach consists of three stages: OD matrix estimation, optimization-based proportion estimation, and machine learning-based generalization.

OD Matrix Estimation

The first stage of our methodology focuses on estimating the OD matrix for the study area using the RapidEx ( 13 ). RapidEx is a powerful network modeling software that leverages OpenStreetMap data to extract road network information that has been successfully applied to analyze similar road networks ( 40 ). To manage computational complexity while maintaining essential traffic corridors, we focused on motorways, trunk roads, and primary roads. This network simplification strategy allows for efficient processing without compromising the integrity of major HV and LV flow patterns. Travel time data for each link were collected using the Google Maps API for a representative weekday morning peak period (June 6, 2024, 7:00−9:00 a.m.). This time window was selected to capture peak traffic conditions. Additionally, we gathered population data from WorldPOP ( 41 ) and categorized POI information to inform the demand estimation process.

Figure 1 illustrates the road network of Greater Sydney, color-coded as motorways (dark blue), trunk roads (cyan), and primary roads (brown). (Color online only.) Secondary and tertiary roads are omitted, as the selected network captures the majority of HV and LV flows. The overlay shows zones (158) based on the Uber H3 hexagonal hierarchical spatial index. These zones start at resolution 6 (approximately 36 km² per hexagon) and are further subdivided up to resolution 8 (approximately 0.9 km² per hexagon) based on POI density.

Figure 1.

Road network and zones of greater Sydney. (Color online only.)

The RapidEx tool employs a bi-level approach for OD matrix estimation. In the upper level, OD demand is initially estimated using a line search for total demand ranging from 5,000 to 1,500,000 vehicles, utilizing population and POI data proportions to distribute demand among OD pairs. The lower level consists of a traffic assignment problem, determining link-level flows and travel times based on the current OD estimate. A genetic algorithm then operates at the upper level, adjusting the OD demand based on a fitness function that measures the match between estimated and observed traffic flows or travel times at the lower level. This bi-level problem is solved iteratively until a specified convergence criterion is met. For clarity, it’s important to note that the OD demand estimated by RapidEx is equivalent to passenger car units (PCU), a standardized measure of traffic flow accounting for various vehicle characteristics. For a more comprehensive understanding of the methodology and its implementation, readers are encouraged to refer to ( 13 ).

Optimization-Based Proportion Estimation

The second stage of our methodology focuses on estimating the proportions of HVs and LVs for each OD pair using an optimization-based approach. Since the OD demand estimated by RapidEx is in PCU and we have the OD contribution to each link (derived during the traffic assignment step), different proportions of HVs and LVs will not affect travel times as the PCU value remains constant. The number of HVs and LVs can vary while maintaining the same PCU demand, as multiple values satisfy the PCU for different HV and LV combinations.

The optimization model considers all OD pairs and observed flow ( 38 ) at a few links to identify the optimal and unique proportions. By leveraging the OD matrix from the previous stage and the contribution of each OD pair to link flows, we formulated a quadratic programming optimization model to determine HV and LV proportions at the OD level. Since demand is constant and the value of travel time will not be affected by these proportions, we eliminated the need to solve the traffic assignment problem at every iteration, thus making the optimization process extremely fast.

The optimization problem is defined as follows:

Objective Function:

\min \sum_{l \in o b s} {(F_{l, o b s}^{H} - \sum_{(i, j) \in O D} O D L_{i, j, l} \times p_{i, j}^{H})}^{2} + {(F_{l, o b s}^{L} - \sum_{(i, j) \in O D} O D L_{i, j, l} \times p_{i, j}^{L})}^{2}

(1)

Subject to:

p_{i, j}^{H} + p_{i, j}^{L} = 1 \forall (i, j) \in OD

(2)

p_{i, j}^{H} \in [0, 1] \forall (i, j) \in OD

(3)

p_{i, j}^{L} \in [0, 1] \forall (i, j) \in OD

(4)

where $F_{l, obs}^{H}$ and $F_{l, obs}^{L}$ are the observed flows for HVs and LVs on link $l$ , respectively, $OD L_{i, j, l}$ represents the demand on link $l$ contributed by OD pair $(i, j)$ , and $p_{ij}^{H}$ and $p_{ij}^{L}$ are the proportions of HVs and LVs for OD pair $(i, j)$ .

To solve this optimization problem, we employed Gurobi 11, a commercial optimization solver known for its efficiency in handling large-scale quadratic programming problems. To ensure a high-quality solution, a relative optimality gap of 1e-6 is set as the stopping criterion.

The output of this stage is a set of HV and LV proportions for each OD pair in the study area. These proportions provide a detailed understanding of the composition of traffic flows between zones, enabling more accurate modeling and analysis of HV and LV movements in the network.

Machine Learning-Based Generalization

The final stage of our methodology focuses on developing a machine learning model to generalize the relationship between various socio-economic, land-use, and demographic factors and the proportions of HVs in OD flows. While the previous stage provides exact proportions at the OD pair level for the study area, this stage aims to build a model that can be applied to other regions or time periods where detailed data may not be available. Among various machine learning approaches considered, XGBoost ( 42 ) emerged as the most suitable choice owing to its ability to handle complex nonlinear relationships and its robust performance with transportation data. The proportion of LVs can be easily derived as the complement of the HV proportion (i.e., 1 - HV proportion).

To train the XGBoost model, we engineered a comprehensive set of features that capture the factors influencing HV and LV proportions. These features included POI data and their sub-categories (employment centers, educational institutions, transit hubs, retail establishments, medical facilities, and service related), which serve as proxies for land-use patterns and economic activity; nightlight intensity data, an additional proxy for economic activity ( 43 ), capturing the intensity of human settlements and industrial areas; population data, representing the distribution of residential areas and potential trip generation/attraction zones; and zonal area, accounting for the size and spatial extent of each zone. These features were compiled for both origin and destination zones of each OD pair, providing a rich set of explanatory variables for the XGBoost model.

To improve model quality and reduce noise in the dataset, we applied several preprocessing steps. OD pairs with zero demand and those with an average annual daily traffic (AADT) for HVs less than 1 were excluded. All features underwent z-score normalization to prevent bias from high-magnitude variables. The dataset was partitioned into an 80% training set and a 20% testing set.

To optimize the performance of the XGBoost model, we conducted an extensive hyperparameter tuning process using grid search. The grid search explored a vast space of 468,750 parameter combinations, including key parameters such as maximum depth, minimum child weight, subsample, column sample by tree, learning rate, alpha (L1 regularization), lambda (L2 regularization), and the number of estimators. This exhaustive search allowed us to identify the best combination of hyperparameters that maximize the model’s predictive accuracy and generalization ability.

The trained XGBoost model was then evaluated on the testing set to assess its performance in predicting HV proportions for unseen OD pairs. To gain deeper insights into the model’s behavior and interpretability, we conducted comprehensive SHAP (SHapley Additive exPlanations) ( 44 ) analysis to understand feature interactions and their impacts on predictions. This included analyzing SHAP summary plots to identify the most influential factors, SHAP interaction values to quantify feature interdependencies, and SHAP dependence plots to visualize complex relationships between features. This combination of techniques provides a more nuanced understanding of how different urban characteristics work together to influence HV proportions.

The machine learning-based generalization stage offers several benefits for extending the HV proportion estimates to other regions or time periods. The XGBoost model captures the complex nonlinear relationships between the input features and the target variable, allowing for accurate predictions even in the absence of detailed flow data. Furthermore, the model’s ability to handle large datasets and its robustness to outliers and noise make it suitable for application in diverse urban contexts.

The output of this stage is a fully trained XGBoost model that can predict HV proportions for any OD pair, given the corresponding input features. This model serves as a powerful tool for transportation planners and policymakers, enabling them to assess the impact of different scenarios on the distribution of HVs in the network and make informed decisions concerning infrastructure investments, traffic management strategies, and environmental policies.

Results and Discussion

The efficacy of our multistage approach is demonstrated through rigorous validation and performance metrics. While the OD matrix estimation process using RapidEx involves a nonconvex optimization problem owing to its bi-level structure and genetic algorithm components, several robust mechanisms ensure consistent, high-quality solutions. The algorithm employs multiple random initializations to explore different regions of the solution space, maintains population diversity within the genetic algorithm to prevent premature convergence, and utilizes convergence criteria based on fitness. This approach helps mitigate concerns about local optima and solution uniqueness. While a detailed treatment of local optima and solution uniqueness is beyond the scope of this paper, interested readers are referred to ( 13 ) for a comprehensive discussion.

The method’s reliability is evidenced by its performance metrics, with link flows showing remarkable accuracy, within 4% error on links with available count data. Additionally, travel times align closely with observed values, falling within 10% for 85% of the links. The quadratic programming optimization for proportion estimation, in contrast to the OD estimation stage, is strictly convex owing to its mathematical structure. The objective function, formulated as a sum of squared residuals with strictly positive link-flow coefficients, yields a positive definite Hessian, while all constraints are linear equalities and inequalities within a bounded solution space. This structure guarantees the existence of a unique global optimum for the proportion estimation stage, providing consistent results under identical input conditions.

With these reliable outputs, we applied an XGBoost regression model to capture the complex relationships between these factors and the proportions of HVs. The XGBoost model was optimized through extensive hyperparameter tuning, with the final configuration including 1,200 estimators, a maximum depth of 5, minimum child weight of 4, subsample and column sample by tree both at 0.8, learning rate of 0.06, and regularization parameters alpha and lambda at 0.3 and 9, respectively. This configuration was determined using five-fold cross-validation, with mean squared error (MSE) as the primary selection metric, ensuring a balance between predictive accuracy and model generalizability.

The final model achieved a training R² of 0.931, test R² of 0.637, and cross-validation R² of 0.566 (standard deviation: 0.015). While these metrics suggest potential overfitting when evaluated on proportions alone, several factors support the model’s practical utility. The cross-validation R²’s low standard deviation (0.015) demonstrates consistent model behavior across different data subsets. Additionally, evaluating the same predictions on a volume basis reveals substantially stronger performance (Figure 2), indicating better accuracy in high-volume scenarios that matter most for practical applications. The model’s mean absolute error (MAE) of 0.137 indicates predictions typically deviate by only 13.7 percentage points, a practically acceptable range for strategic transportation planning applications.

Figure 2.

Model validation through actual versus predicted heavy vehicle (HV) proportions and volumes.

Examining the actual vs predicted plots (Figure 2) highlights the model’s contrasting behavior on proportional and absolute metrics. The proportion scatter (left panel) shows wider dispersion and a clear gap between training (R² = 0.931) and test (R² = 0.637) performance, with the test MAE at 0.140. This plot also exhibits a pronounced vertical band at an actual HV proportion of approximately 0.9. In contrast, the volume-based scatter (right panel) aligns tightly around the 1:1 line, yielding high training (R² = 0.987) and test (R² = 0.913) values for absolute HV AADT.

The aforementioned vertical band in the proportion plot (Figure 2, left panel) stems from approximately 12% of OD pairs that, owing to the upstream data extrapolation model, derive a high HV share (around 0.8–0.9). These cases span a wide total AADT range (from as low as 1 AADT up to 4756 AADT, with an average total AADT of ≈ 237 AADT). While distributed across various volume levels, the model’s performance on HV proportions for instances that fall into the two highest-volume deciles (mean AADT ≈ 323 AADT for Decile 8 and ≈ 846 AADT for Decile 9; see Table 3) particularly affects the global R². On those deciles the model is slightly conservative, with median proportional errors of –0.054 and –0.062, producing decile-level $R^{2}$ values of 0.36 and –0.44. These two deciles therefore pull the global test $R^{2}$ down, whereas the remaining eight deciles register $R^{2}$ between 0.55 and 0.72.

Table 3.

Proportion-Model Performance by Average Annual Daily Traffic (AADT) Decile (Only Test Set)

Decile	M AADT	$R^{2}$	MAE	Median error
0	2.2	0.55	0.18	0.018
1	4.7	0.55	0.13	0.052
2	10.3	0.72	0.14	0.007
3	21.5	0.66	0.12	0.023
4	35.3	0.61	0.14	0.026
5	61.5	0.59	0.15	0.008
6	109.7	0.55	0.15	0.017
7	212.0	0.64	0.09	−0.018
8	323.1	0.36	0.17	−0.054
9	845.6	−0.44	0.13	−0.062

Note: MAE = mean absolute error.

To evaluate the practical implications of these proportion predictions, HV counts were derived by multiplying the model’s predicted HV proportions by the total AADT for each OD pair. An R² analysis on these derived HV counts (Figure 2, right panel) yields a robust test R² of 0.913. This confirms that, despite the nuances in proportion prediction, the model’s underlying proportion estimates, when scaled by volume, lead to HV count estimations that correspond well with actual counts across the network, including on the high-volume links. The dual evaluation therefore isolates a narrowly confined regime, very high AADT corridors, particularly those with inherently high HV shares, that could benefit from a dedicated sub-model or refined approach for HV proportion prediction in future work. Simultaneously, this demonstrates that the present model already provides high practical value for deriving HV counts network-wide and for understanding HV proportions across the remaining 90% of the network segments.

SHAP analysis revealed critical insights into the determinants of HV proportions across the urban network. The beeswarm visualization (Figure 3) illustrates that nightlight_intensity_Destination exhibits the highest magnitude of influence, with SHAP values ranging from -0.2 to 0.12, indicating its substantial role in both increasing and decreasing HV proportions in traffic composition. This bidirectional effect suggests that urban nighttime activity serves as a key differentiator in the modal split of vehicular traffic. Food_Destination demonstrates a notably different pattern, with SHAP values concentrated between -0.1 and 0.06, and distinct clustering patterns that indicate threshold effects in how food-related destinations influence the presence of HVs in traffic flows.

Figure 3.

SHapley Additive exPlanations (SHAP) value distribution for individual features.

The interaction analysis (Figure 4) and hierarchical clustering (Figure 5) reveal complementary aspects of urban feature relationships. While features cluster into distinct groups—commercial/service destinations (food, medical, transit, services, retail, employment), urban form characteristics (area, education, population), and origin features—their interactions show complex interdependencies. Notably, nightlight_intensity_Destination stands independent in the clustering with highest influence, yet shows the strongest interaction with Area_Destination (interaction strength = 0.007686). This suggests that while nightlight intensity captures unique aspects of urban activity, its effect on HV patterns is strongly modulated by spatial scale. Origin features, forming a separate cluster with consistently lower importance, also show weaker interaction patterns across all features.

Figure 4.

SHapley Additive exPlanations (SHAP) interaction matrix.

Figure 5.

Feature importance with hierarchical clustering.

The dependence analysis (Figure 6) uncovers nuanced relationships between these key features. Nightlight_intensity_Destination shows a clear regime shift at intensity values around 1.0, where its impact transitions from predominantly negative to positive SHAP values, suggesting increased HV presence in well-lit (potentially higher economic activity) areas. For food destinations, the relationship exhibits a distinct threshold pattern, with SHAP values increasing sharply for food destination values between -1 and 2, before plateauing. The retail destination interaction (shown by color) reveals that areas with high retail presence (red points) systematically amplify the positive effect of food destinations on HV proportions, indicating that commercial clusters generate higher proportions of heavy vehicle traffic in the overall traffic mix.

Figure 6.

SHapley Additive exPlanations (SHAP) dependence plots showing (a) how nightlight_intensity_Destination’s impact varies with its feature value, colored by Population_Destination, and (b) food_Destination’s relationship, colored by retail_Destination values.

Conclusion

This study introduces an innovative approach to decomposing urban traffic flows into LV and HV categories, addressing a critical gap in transportation network analysis. Our three-stage methodology, combining OD matrix estimation, quadratic programming optimization, and XGBoost regression, provides a robust framework for estimating HV and LV proportions across Greater Sydney’s road network. By leveraging diverse data sources, we have implemented a comprehensive model that achieves a test R² of 0.637, demonstrating its effectiveness in capturing complex urban mobility patterns.

Our model demonstrates robust performance particularly in predicting HV proportions in the 0.2–0.6 range, where the majority of urban freight movements occur, as evidenced by the bimodal distribution with peaks at 0.3 and 0.9 HV proportions. This pattern reflects the natural distribution of HV traffic in urban environments, with higher proportions being characteristic of specialized zones. Our analysis, enhanced by comprehensive SHAP interaction analysis, reveals intricate relationships between urban features and HV proportions that go beyond simple feature importance rankings. While nightlight_intensity_Destination emerged as the most influential feature, the SHAP interaction analysis uncovered significant coupling effects between urban characteristics, with the strongest interaction observed between nightlight intensity and area at destinations (interaction strength = 0.007686). This interaction demonstrates how the impact of economic activity on HV traffic varies substantially with the spatial scale of development. These interaction effects provide crucial insights for urban planners and policymakers, demonstrating that the effectiveness of infrastructure improvements or policy interventions may depend significantly on the broader urban context in which they are implemented. For instance, the strong interaction between nightlight intensity and area suggests that similar economic activities might generate different HV patterns depending on the spatial configuration of the destination zone. Such insights are particularly valuable for developing targeted strategies for different urban contexts, from compact city centers to sprawling industrial areas.

The methodological framework presented in this study, leveraging increasingly accessible global data types (road network data, Points of Interest, satellite imagery), holds considerable promise for broader application. Expanding this adaptable approach to diverse global urban contexts, particularly rapidly developing cities with limited specialized HV data availability but growing freight demands, is a key future direction. However, while the methodology is designed for broad applicability, it is crucial to acknowledge that the model parameters and feature correlations learned from Greater Sydney may necessitate local recalibration or fine-tuning when applied to cities with markedly different geographical, economic, or social characteristics. Therefore, a primary area for future research is the rigorous investigation and testing of model transferability. This involves applying the framework to other urban centers, especially in developing countries; such application would require thorough validation and likely adaptation using local data to capture unique regional attributes and ensure optimal predictive performance. This process will be vital for establishing the framework’s broader utility and refining its adaptability across different urban contexts.

Further research will also address other identified limitations, such as improving predictions for high-HV proportion scenarios (>0.8) where the current model shows increased uncertainty. Developing specialized sub-models tailored to this specific regime, or incorporating more detailed data on factors influencing HV movements within such high-volume, high-share corridors (such as industrial activities, specific zoning regulations, or dedicated HV infrastructure), could enhance predictive performance. Additionally, extending the model to capture dynamic, time-dependent variations in HV proportions would enhance its practical value. By incorporating socio-economic factors through these advancements, and by developing associated policy recommendations, this continuously refined approach can become an even more powerful tool for comprehensive urban planning. Such a tool would help bridge the gap between transportation efficiency and social equity in our evolving smart cities, enabling Mobility as a Resource (MaaR) concepts and supporting decision-makers in optimizing urban mobility solutions that are efficient, socially inclusive, and environmentally sustainable.

Footnotes

Acknowledgements

The authors extend their gratitude to Anna Sotnikova for her assistance in matching observed flows to the exact links, which significantly contributed to the quality of this research.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: PF, SRK, STW; data collection: PF, SRK; analysis and interpretation of results: PF, SRK, DYL, STW; draft manuscript preparation: PF, SRK, DYL, STW. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Patrick Fernandez

Surendra Reddy Kancharla

Dung-Ying Lin

S. Travis Waller

References

Evgenikos

Yannis

Folla

Bauer

Machata

Brandstaetter

, Characteristics and Causes of Heavy Goods Vehicles and Buses Accidents in Europe. Transportation Research Procedia, Vol. 14, 2016, pp. 2158–2167.

Dablanc

City Distribution, a Key Element of the Urban Economy: Guidelines for Practitioners. In City Distribution and Urban Freight Transport (C. Macharis and S. Melo, eds.), Edward Elgar Publishing, 2011.

Taniguchi

Thompson

R. G.

Yamada

, Recent Trends and Innovations in Modelling City Logistics. Procedia-Social and Behavioral Sciences, Vol. 125, 2014, pp. 4–14.

Figliozzi

M. A.

The Impacts of Congestion on Commercial Vehicle Tour Characteristics and Costs. Transportation Research Part E: Logistics and Transportation Review, Vol. 46, No. 4, 2010, pp. 496–506.

Hartgen

D. T.

Fields

M. G.

Layzell

A. L.

Jose

E. S.

How Employers View Traffic Congestion: Results of National Survey. Transportation Research Record, Vol. 2319, No. 1, 2012, pp. 56–66.

Song

Wang

Wright

Thatcher

Felix

Traffic Volume Prediction with Segment-Based Regression Kriging and Its Implementation in Assessing the Impact of Heavy Vehicles. IEEE Transactions on Intelligent Transportation Systems, Vol. 20, No. 1, 2018, pp. 232–243.

Szczuraszek

Macioszek

Proportion of Vehicles Moving Freely Depending on Traffic Volume and Proportion of Trucks and Buses. The Baltic Journal of Road and Bridge Engineering, Vol. 8, No. 2, 2013, pp. 133–141.

Hernandez

S. V.

Tok

Ritchie

S. G.

Integration of Weigh-in-Motion (WIM) and Inductive Signature Data for Truck Body Classification. Transportation Research Part C: Emerging Technologies, Vol. 68, 2016, pp. 1–21.

Jeng

S.-T.

Chu

A High-Definition Traffic Performance Monitoring System with the Inductive Loop Detector Signature Technology. In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC). IEEE, October 8-11, 2014, Qingdao, China, pp. 1820–1825.

10.

Jeng

S.-T.

Ritchie

S. G.

Real-Time Vehicle Classification Using Inductive Loop Signature Data. Transportation Research Record, Vol. 2086, No. 1, 2008, pp. 8–22.

11.

Won

Intelligent Traffic Monitoring Systems for Vehicle Classification: A Survey. IEEE Access, Vol. 8, 2020, pp. 73340–73358.

12.

Basso

Pezoa

Tapia

Varas

Estimation of the Origin-Destination Matrix for Trucks That Use Highways: A Case Study in Chile. Sustainability, Vol. 14, No. 5, 2022, p. 2645.

13.

Waller

S. T.

Chand

Zlojutro

Nair

Niu

Wang

Zhang

Dixit

V. V.

, Rapidex: A Novel Tool to Estimate Origin–Destination Trips Using Pervasive Traffic Data. Sustainability, Vol. 13, No. 20, 2021, p. 11171.

14.

Stopher

P. R.

Greaves

S. P.

Household Travel Surveys: Where Are We Going?

Transportation Research Part A: Policy and Practice, Vol. 41, No. 5, 2007, pp. 367–381.

15.

Caspersen

An Explorative Approach to Freight Trip Attraction in an Industrial Urban Area. In City Logistics 3: Towards Sustainable and Liveable Cities (E. Taniguchi and E. Thompson, eds.), 2018.

16.

Sánchez-Díaz

Holguín-Veras

Wang

An Exploratory Analysis of Spatial Effects on Freight Trip Attraction. Transportation, Vol. 43, 2016, pp. 177–196.

17.

Holguín-Veras

Jaller

Destro

Ban

Lawson

Levinson

H. S.

Freight Generation, Freight Trip Generation, and Perils of Using Constant Trip Rates. Transportation Research Record, Vol. 2224, No. 1, 2011, pp. 68–81.

18.

Pani

Sahu

P. K.

Planning, Designing and Conducting Establishment-Based Freight Surveys: A Synthesis of the Literature, Case-Study Examples and Recommendations for Best Practices in Future Surveys. Transport Policy, Vol. 78, 2019, pp. 58–75.

19.

Middela

M. S.

Ramadurai

Spatial Seemingly Unrelated Regression Models for Freight Trip Generation by Vehicle Type: Application to the Chennai Metropolitan Area in India. Transportation Research Record, Vol. 2676, No. 4, 2022, pp. 380–392.

20.

Giuliano

Kang

Yuan

Hutson

The Freight Landscape: Using Secondary Data Sources to Describe Metropolitan Freight Flows, METRANS Transportation Center report, 2015.

21.

Alho

A. R.

e Silva

J. d. A.

Analyzing the Relation Between Land-Use/Urban Freight Operations and the Need for Dedicated Infrastructure/Enforcement—Application to the City of Lisbon. Research in Transportation Business & Management, Vol. 11, 2014, pp. 85–97.

22.

Mellander

Lobo

Stolarick

Matheson

Night-Time Light Data: A Good Proxy Measure for Economic Activity?

PLoS ONE, Vol. 10, No. 10, 2015, p. e0139779.

23.

Yan

Zou

Liu

The Spatial Pattern and Influencing Factors of China’s Nighttime Economy Utilizing POI and Remote Sensing Data. Applied Sciences, Vol. 14, No. 1, 2024, p. 400.

24.

Jiang

Alves

Rodrigues

Ferreira

J. Jr.

Pereira

F. C.

Mining Point-of-Interest Data from Social Networks for Urban Land Use Classification and Disaggregation. Computers, Environment and Urban Systems, Vol. 53, 2015, pp. 36–46.

25.

Yuan

Zheng

Xie

Discovering Regions of Different Functions in a City Using Human Mobility and POIs. In Proc. 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, Beijing, China, August 12–16, 2012, pp. 186–194.

26.

Gao

Janowicz

Couclelis

Extracting Urban Functional Regions from Points of Interest and Human Activities on Location-Based Social Networks. Transactions in GIS, Vol. 21, No. 3, 2017, pp. 446–467.

27.

Zhang

Haghani

A Gradient Boosting Method to Improve Travel Time Prediction. Transportation Research Part C: Emerging Technologies, Vol. 58, 2015, pp. 308–324.

28.

Saleh

Hatzopoulou

A Machine Learning Approach Capturing the Effects of Driving Behaviour and Driver Characteristics on Trip-Level Emissions. Atmospheric Environment, Vol. 224, 2020, p. 117311.

29.

Zafar

Ul Haq

Traffic Congestion Prediction Based on Estimated Time of Arrival. PLoS ONE, Vol. 15, No. 12, 2020, p. e0238200.

30.

Jiao

Bhat

Azimian

Measuring Travel Behavior in Houston, Texas with Mobility Data during the 2020 COVID-19 Outbreak. Transportation Letters, Vol. 13, No. 5–6, 2021, pp. 461–472.

31.

Ning

Jing

Lessani

M. N.

Understanding the Bias of Mobile Location Data Across Spatial Scales and Over Time: A Comprehensive Analysis of SafeGraph Data in the United States. PLoS ONE, Vol. 19, No. 1, 2024, p. e0294430.

32.

Prestby

App

Kang

Gao

Understanding Neighborhood Isolation Through Spatial Interaction Network Analysis Using Location Big Data. Environment and Planning A: Economy and Space, Vol. 52, No. 6, 2020, pp. 1027–1031.

33.

Elvidge

C. D.

Zhizhin

Ghosh

Hsu

F.-C.

Taneja

Annual Time Series of Global VIIRS Nighttime Lights Derived from Monthly Averages: 2012 to 2019. Remote Sensing, Vol. 13, No. 5, 2021, p. 922.

34.

Bennett

M. M.

Smith

L. C.

Advances in Using Multitemporal Night-Time Lights Satellite Imagery to Detect, Estimate, and Monitor Socioeconomic Dynamics. Remote Sensing of Environment, Vol. 192, 2017, pp. 176–197.

35.

Gibson

Olivia

Boe-Gibson

Which Night Lights Data Should We Use in Economics, and Where?

Journal of Development Economics, Vol. 149, 2021, p. 102602.

36.

Price

Atkinson

P. M.

Global GDP Prediction with Night-Lights and Transfer Learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, Vol. 15, 2022, pp. 7128–7138.

37.

Chen

Nordhaus

W. D.

Using Luminosity Data as a Proxy for Economic Statistics. Proceedings of the National Academy of Sciences, Vol. 108, No. 21, 2011, pp. 8589–8594.

38.

New South Wales Roads and Maritime Services. Traffic Volume Viewer, 2023. http://www.rms.nsw.gov.au/about/corporate-publications/statistics/traffic-volumes/aadt-map/index.html#?z=6. Accessed June 6, 2024.

39.

Infante

Clustering and Previous Visit Dependency Technique for Electric Vehicle Station Visits. In 2018 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), IEEE, Sarajevo, Bosnia and Herzegovina, October 21–25, 2018, pp. 1–5.

40.

Waller

S. T.

Qurashi

Sotnikova

Karva

Chand

Analyzing and Modeling Network Travel Patterns During the Ukraine Invasion Using Crowd-Sourced Pervasive Traffic Data. Transportation Research Record, Vol. 2677, No. 10, 2023, pp. 491–507.

41.

Bondarenko

Kerr

Sorichetta

Tatem

A. J.

Census/Projection-Disaggregated Gridded Population Datasets for 189 Countries in 2020 Using Built-Settlement Growth Model (BSGM) Outputs, 2020. https://www.worldpop.org/.

42.

Chen

Guestrin

Xgboost: A Scalable Tree Boosting System. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, San Francisco, CA, August 13-17, 2016, pp. 785–794.

43.

Elvidge

C. D.

Baugh

K. E.

Anderson

S. J.

Sutton

P. C.

Ghosh

The Night Light Development Index (NLDI): A Spatially Explicit Measure of Human Development from Satellite Data. Social Geography, Vol. 7, No. 1, 2012, pp. 23–35.

44.

Lundberg

S. M.

Lee

S.-I.

A Unified Approach to Interpreting Model Predictions. In Proc. 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, Long Beach, CA, December 4-9, 2017, NIPS’17, pp. 4768–4777.

Decomposing Urban Traffic Flows: A Multistage Approach to Model Heavy Vehicle Movements in Greater Sydney

Abstract

Keywords

Introduction and Background

Data

Methodology

OD Matrix Estimation

Optimization-Based Proportion Estimation

Machine Learning-Based Generalization

Results and Discussion

Conclusion

Footnotes

Acknowledgements

Author Contributions

Declaration of Conflicting Interests

Funding

ORCID iDs

References