Abstract
This study presents an innovative multistage methodology for decomposing urban traffic flows into light vehicle (LV) and heavy vehicle (HV) categories, addressing a critical gap in transportation network analysis. Utilizing data from Greater Sydney’s road network, we develop a comprehensive approach comprising three main stages: origin–destination (OD) matrix estimation using RapidEx, quadratic programming optimization for HV/LV proportion estimation, and XGBoost regression for generalization. Our analysis examines associations between HV proportions and urban characteristics, including points of interest (POI), nightlight intensity, and zonal attributes. The XGBoost model achieves a test R2 of 0.637, demonstrating strong predictive power for real-world applications. Through SHAP (SHapley Additive exPlanations) analysis, we uncover complex nonlinear relationships between nightlight intensity and HV proportions, with significant interaction effects between urban features. The model performs particularly well in predicting common urban HV proportion ranges (0.2–0.6), reflecting typical urban traffic compositions. These findings provide valuable insights for urban planning and policy development, especially in contexts where detailed vehicle classification data are limited.
Keywords
Introduction and Background
Urban transportation networks are complex systems that significantly influence economic efficiency, environmental sustainability, and quality of life in cities worldwide. These networks are shaped by diverse vehicle types and their movement patterns, with a crucial distinction between heavy vehicles (HV) and light vehicles (LV). HVs, encompassing trucks, buses, and coaches, constitute a significant portion of total traffic volume. Trucks usually account for up to 15% of traffic, while buses may represent approximately 3% ( 1 ). However, these figures exhibit substantial variability across different regions, influenced by factors such as local industrial activity, transportation infrastructure, and the extent of public transit systems.
The ability to distinguish between LV and HV movements is crucial for several reasons. It allows for a more detailed understanding of urban logistics and commercial activities, as HVs are often indicative of goods movement and industrial operations ( 2 , 3 ). It also provides insights into public transportation patterns as buses belong to the HV category. This decomposition by vehicle type enables more targeted approaches to traffic management, infrastructure planning, and policy-making.
HVs, while constituting a small proportion of total traffic, contribute disproportionately to various urban challenges. Commercial vehicles significantly affect traffic congestion and emissions ( 4 , 5 ), and are responsible for a disproportionate share of road maintenance burden ( 6 ). This underscores the need for accurate identification and disaggregation of HV movements to generate effective policies and manage their impact on urban areas.
However, existing approaches to distinguish between vehicle types have significant limitations. The most traditional approach relies on manual classification surveys, which while accurate, are inherently resource-intensive and limited in spatial coverage ( 7 ). Fixed sensor networks, particularly weigh-in-motion systems, provide highly accurate data but only at specific locations within the network ( 8 ). Vehicle classification algorithms utilizing inductive loop detectors have shown promise but require substantial infrastructure investment and may experience varying levels of accuracy, depending on vehicle type ( 9 , 10 ). Video analytics and computer vision systems ( 11 ) offer promising capabilities but grapples with the challenges of scalability, camera coverage requirements, processing limitations, and performance under varying weather conditions. Global Positioning System (GPS) tracking of commercial fleets captures valuable movement data but represents only a subset of overall HV movements ( 12 ).
Recent advancements in transportation modeling tools, such as the RapidEx ( 13 ), have created new opportunities to address these research gaps. By leveraging travel time data from sources such as Google or TomTom, in conjunction with road network information from OpenStreetMap, RapidEx can now estimate potential flow at every link in a network, achieving travel times that closely align with observed data. This capability, combined with the tool’s ability to generate origin–destination (OD) patterns consistent with these flows, provides an opportunity for fine-grained traffic flow disaggregation.
However, the application of such advanced tools often faces challenges owing to the lack of detailed, decomposed data needed to generate accurate OD patterns for all links in a network. Even in developed nations, the availability of such data frequently lacks the necessary granularity and network coverage ( 14 ). While sensor deployment for obtaining decomposed flow data is increasing in some areas, it is not yet commonplace worldwide. This data scarcity highlights the need for models that can be developed in data-rich areas and then applied to other contexts where specific, detailed transportation data is limited.
To address this challenge, understanding the relationship between transportation and economic activity becomes crucial. This relationship is fundamental for urban planning and policy-making ( 2 , 3 ), as it allows for the creation of transferable models that can estimate traffic patterns based on more commonly available economic indicators. While this applies to all types of vehicle movements, it is particularly relevant for HVs, which include both freight transport and public transit. The movement of HVs, especially in freight transport, plays a vital role in bridging the gap between production and consumption locations. This necessitates the development of models that integrate economic activity, logistics decision-making, and traffic flows ( 15 , 16 ). By focusing on these relationships, we can create more comprehensive models that account for both LV and HV movements, providing a better understanding of urban transportation dynamics.
In recent years, there has been increased interest in freight transport research, driven by the need to better understand freight flows and volumes within transportation networks ( 16 , 17 ). However, obtaining reliable freight trip generation data remains challenging, especially where large-scale freight transport surveys are infrequent ( 18 , 19 ). To address these limitations, researchers have explored various proxies for economic activity in trip generation models ( 20 – 23 ). Among these proxies, point of interest (POI) data and nighttime light intensity data have gained significant attention as indicators of economic activity ( 22 , 23 ). POI data, representing entities with geolocation information, has shown strong potential in transport research ( 24 , 25 ). These studies demonstrate the effectiveness of such proxies in estimating patterns of socio-economic change and transport demand at various geographical scales ( 26 ).
To fully capitalize on these advancements and explore relationships between vehicle movement patterns and urban socio-economic indicators, machine learning techniques have emerged as a promising approach. In particular, XGBoost has shown impressive results in traffic prediction tasks ( 27 – 29 ). Its ability to handle complex nonlinear relationships and robustness to overfitting makes it particularly suitable for transportation data analysis, enabling the development of models that link socio-demographic factors and other proxies to vehicle type splits.
Our research leverages these techniques to develop an innovative approach for decomposing urban traffic flows into LV and HV categories. Focusing on Greater Sydney’s road network, we aim to identify relationships between vehicle movement patterns and urban socio-economic indicators. This study is motivated by the need to provide urban planners, policymakers, and transportation engineers with a more comprehensive understanding of the factors influencing HV and LV movements in urban areas. The significance of this research extends beyond the immediate case study. By developing a machine learning-based regression model that links socio-demographic factors and other proxies to vehicle type splits, we create a framework that can be generalized to other regions or cities. Our approach addresses the scarcity of observed link-level decomposed flow data and offers a scalable solution for urban transportation analysis on a global scale, moving beyond traditional survey-based methods to provide a more accurate and transferable understanding of urban transportation patterns and their relationship to economic activities.
Major contributions of our study are as follows:
We develop a novel multistage methodology that integrates advanced network modeling, optimization techniques, and machine learning to disaggregate traffic flows.
We provide a comprehensive analysis of the factors influencing HV proportions in urban areas, offering a nuanced view of the complex dynamics governing urban freight movement.
We establish a generalizable framework for predicting vehicle type distributions across entire urban networks, enabling more accurate forecasting and scenario analysis for urban planning in diverse contexts.
By bridging the gap between aggregated traffic data and disaggregated vehicle flows, this research aims to enhance our understanding of urban transportation dynamics and provide practical tools for evidence-based decision making. The insights gained from this study have the potential to make a noticeable difference toward improving urban mobility efficiency, reducing congestion, and supporting more sustainable urban development practices across various urban environments.
The remainder of this paper is organized as follows: the second section describes different data sources used. The third section describes detailed methodology. The fourth section presents results and discusses their implications for urban planning and transportation policy. Finally, the fifth section concludes with a summary of our findings and suggestions for future research directions.
Data
The study leverages three comprehensive datasets that capture various aspects of the urban activity, and nightlight intensity in Sydney, Australia. These datasets provide a rich and diverse set of information, enabling a thorough analysis of the relationship between urban characteristics and the composition of traffic flows.
The first dataset, POI, is sourced from SafeGraph, a leading provider of geospatial data. SafeGraph’s POI data is widely used in academic research ( 30 – 32 ), government analyses, and business applications, owing to its comprehensiveness and accuracy. The dataset provides detailed spatial coordinates and categorical classifications for business and service locations throughout the Sydney region. For the year 2023, the POI data categorize urban amenities into seven distinct groups: education, employment, food, medical, retail, services, and transit (see Table 1). This granular classification enables detailed analysis of spatial patterns in urban activities and their potential influence on traffic composition.
Description of Key Variables in the Analysis
The second dataset utilizes the visible infrared imaging radiometer suite (VIIRS) day/night band (DNB) nighttime lights data, maintained by the Earth Observation Group ( 33 ). This data source has emerged as the gold standard for economic studies utilizing night-lights ( 34 – 37 ), offering superior resolution and accuracy compared with previous satellite-based light measurements. We employed the annual VIIRS Nighttime Lights (VNL) V2 ( 33 ) average radiance composite data for 2023, which provides cloud-free average radiance values. The VNL V2 product features enhanced filtering of aurora, stray light, and ephemeral lights, making it particularly suitable for urban analysis. The 2023 dataset was selected to maintain temporal consistency with other data sources and ensure the most current representation of nighttime light patterns in the study area.
The third dataset comprises traffic volume data from Transport for New South Wales (2023). The authority’s Traffic Volume Viewer provides disaggregated traffic flow data for specific road network links, distinguishing between HV and LV ( 38 ). This dataset’s reliability and relevance for traffic analysis have been demonstrated in previous research ( 39 ).
The summary statistics presented in Table 2 encompass analysis of all 158 zones within the study area. For each zone, we aggregated POI counts by category to capture the distribution of commercial and service activities. Population and area measurements provide demographic and spatial context, while nightlight intensity measurements serve as a proxy for economic activity levels. This combination of variables enables comprehensive analysis of the relationships between urban form, economic activity, and traffic patterns.
Summary Statistics of Key Variables
Note: POI = point of interest; SD = standard deviation.
The majority of the data used in this study is available for most parts of the world, either in the public domain or through API access from the sources. This includes network data from OpenStreetMap (OSM), travel time data from Google, TomTom, or Mapbox, nightlight data from NASA’s Visible Infrared Imaging Radiometer Suite (VIIRS) Day/Night Band (DNB), and points of interest (POI) data from SafeGraph, OSM, and Google. The only specialized data is the decomposed vehicular split in the network, which is not readily available worldwide. This data scarcity is the primary motivation for our research—to build a machine learning model that can be applied in regions where such specialized data may not be accessible, by leveraging the wealth of publicly available data sources mentioned above.
Methodology
This study presents a novel multistage approach to estimate the proportions of HVs and LVs in OD matrices for transportation network analysis. Our methodology integrates advanced network modeling techniques, optimization algorithms, and machine learning to provide a comprehensive solution for disaggregating vehicle types in OD flows. The approach consists of three stages: OD matrix estimation, optimization-based proportion estimation, and machine learning-based generalization.
OD Matrix Estimation
The first stage of our methodology focuses on estimating the OD matrix for the study area using the RapidEx ( 13 ). RapidEx is a powerful network modeling software that leverages OpenStreetMap data to extract road network information that has been successfully applied to analyze similar road networks ( 40 ). To manage computational complexity while maintaining essential traffic corridors, we focused on motorways, trunk roads, and primary roads. This network simplification strategy allows for efficient processing without compromising the integrity of major HV and LV flow patterns. Travel time data for each link were collected using the Google Maps API for a representative weekday morning peak period (June 6, 2024, 7:00−9:00 a.m.). This time window was selected to capture peak traffic conditions. Additionally, we gathered population data from WorldPOP ( 41 ) and categorized POI information to inform the demand estimation process.
Figure 1 illustrates the road network of Greater Sydney, color-coded as motorways (dark blue), trunk roads (cyan), and primary roads (brown). (Color online only.) Secondary and tertiary roads are omitted, as the selected network captures the majority of HV and LV flows. The overlay shows zones (158) based on the Uber H3 hexagonal hierarchical spatial index. These zones start at resolution 6 (approximately 36 km2 per hexagon) and are further subdivided up to resolution 8 (approximately 0.9 km2 per hexagon) based on POI density.

Road network and zones of greater Sydney. (Color online only.)
The RapidEx tool employs a bi-level approach for OD matrix estimation. In the upper level, OD demand is initially estimated using a line search for total demand ranging from 5,000 to 1,500,000 vehicles, utilizing population and POI data proportions to distribute demand among OD pairs. The lower level consists of a traffic assignment problem, determining link-level flows and travel times based on the current OD estimate. A genetic algorithm then operates at the upper level, adjusting the OD demand based on a fitness function that measures the match between estimated and observed traffic flows or travel times at the lower level. This bi-level problem is solved iteratively until a specified convergence criterion is met. For clarity, it’s important to note that the OD demand estimated by RapidEx is equivalent to passenger car units (PCU), a standardized measure of traffic flow accounting for various vehicle characteristics. For a more comprehensive understanding of the methodology and its implementation, readers are encouraged to refer to ( 13 ).
Optimization-Based Proportion Estimation
The second stage of our methodology focuses on estimating the proportions of HVs and LVs for each OD pair using an optimization-based approach. Since the OD demand estimated by RapidEx is in PCU and we have the OD contribution to each link (derived during the traffic assignment step), different proportions of HVs and LVs will not affect travel times as the PCU value remains constant. The number of HVs and LVs can vary while maintaining the same PCU demand, as multiple values satisfy the PCU for different HV and LV combinations.
The optimization model considers all OD pairs and observed flow ( 38 ) at a few links to identify the optimal and unique proportions. By leveraging the OD matrix from the previous stage and the contribution of each OD pair to link flows, we formulated a quadratic programming optimization model to determine HV and LV proportions at the OD level. Since demand is constant and the value of travel time will not be affected by these proportions, we eliminated the need to solve the traffic assignment problem at every iteration, thus making the optimization process extremely fast.
The optimization problem is defined as follows:
Objective Function:
Subject to:
where
To solve this optimization problem, we employed Gurobi 11, a commercial optimization solver known for its efficiency in handling large-scale quadratic programming problems. To ensure a high-quality solution, a relative optimality gap of 1e-6 is set as the stopping criterion.
The output of this stage is a set of HV and LV proportions for each OD pair in the study area. These proportions provide a detailed understanding of the composition of traffic flows between zones, enabling more accurate modeling and analysis of HV and LV movements in the network.
Machine Learning-Based Generalization
The final stage of our methodology focuses on developing a machine learning model to generalize the relationship between various socio-economic, land-use, and demographic factors and the proportions of HVs in OD flows. While the previous stage provides exact proportions at the OD pair level for the study area, this stage aims to build a model that can be applied to other regions or time periods where detailed data may not be available. Among various machine learning approaches considered, XGBoost ( 42 ) emerged as the most suitable choice owing to its ability to handle complex nonlinear relationships and its robust performance with transportation data. The proportion of LVs can be easily derived as the complement of the HV proportion (i.e., 1 - HV proportion).
To train the XGBoost model, we engineered a comprehensive set of features that capture the factors influencing HV and LV proportions. These features included POI data and their sub-categories (employment centers, educational institutions, transit hubs, retail establishments, medical facilities, and service related), which serve as proxies for land-use patterns and economic activity; nightlight intensity data, an additional proxy for economic activity ( 43 ), capturing the intensity of human settlements and industrial areas; population data, representing the distribution of residential areas and potential trip generation/attraction zones; and zonal area, accounting for the size and spatial extent of each zone. These features were compiled for both origin and destination zones of each OD pair, providing a rich set of explanatory variables for the XGBoost model.
To improve model quality and reduce noise in the dataset, we applied several preprocessing steps. OD pairs with zero demand and those with an average annual daily traffic (AADT) for HVs less than 1 were excluded. All features underwent z-score normalization to prevent bias from high-magnitude variables. The dataset was partitioned into an 80% training set and a 20% testing set.
To optimize the performance of the XGBoost model, we conducted an extensive hyperparameter tuning process using grid search. The grid search explored a vast space of 468,750 parameter combinations, including key parameters such as maximum depth, minimum child weight, subsample, column sample by tree, learning rate, alpha (L1 regularization), lambda (L2 regularization), and the number of estimators. This exhaustive search allowed us to identify the best combination of hyperparameters that maximize the model’s predictive accuracy and generalization ability.
The trained XGBoost model was then evaluated on the testing set to assess its performance in predicting HV proportions for unseen OD pairs. To gain deeper insights into the model’s behavior and interpretability, we conducted comprehensive SHAP (SHapley Additive exPlanations) ( 44 ) analysis to understand feature interactions and their impacts on predictions. This included analyzing SHAP summary plots to identify the most influential factors, SHAP interaction values to quantify feature interdependencies, and SHAP dependence plots to visualize complex relationships between features. This combination of techniques provides a more nuanced understanding of how different urban characteristics work together to influence HV proportions.
The machine learning-based generalization stage offers several benefits for extending the HV proportion estimates to other regions or time periods. The XGBoost model captures the complex nonlinear relationships between the input features and the target variable, allowing for accurate predictions even in the absence of detailed flow data. Furthermore, the model’s ability to handle large datasets and its robustness to outliers and noise make it suitable for application in diverse urban contexts.
The output of this stage is a fully trained XGBoost model that can predict HV proportions for any OD pair, given the corresponding input features. This model serves as a powerful tool for transportation planners and policymakers, enabling them to assess the impact of different scenarios on the distribution of HVs in the network and make informed decisions concerning infrastructure investments, traffic management strategies, and environmental policies.
Results and Discussion
The efficacy of our multistage approach is demonstrated through rigorous validation and performance metrics. While the OD matrix estimation process using RapidEx involves a nonconvex optimization problem owing to its bi-level structure and genetic algorithm components, several robust mechanisms ensure consistent, high-quality solutions. The algorithm employs multiple random initializations to explore different regions of the solution space, maintains population diversity within the genetic algorithm to prevent premature convergence, and utilizes convergence criteria based on fitness. This approach helps mitigate concerns about local optima and solution uniqueness. While a detailed treatment of local optima and solution uniqueness is beyond the scope of this paper, interested readers are referred to ( 13 ) for a comprehensive discussion.
The method’s reliability is evidenced by its performance metrics, with link flows showing remarkable accuracy, within 4% error on links with available count data. Additionally, travel times align closely with observed values, falling within 10% for 85% of the links. The quadratic programming optimization for proportion estimation, in contrast to the OD estimation stage, is strictly convex owing to its mathematical structure. The objective function, formulated as a sum of squared residuals with strictly positive link-flow coefficients, yields a positive definite Hessian, while all constraints are linear equalities and inequalities within a bounded solution space. This structure guarantees the existence of a unique global optimum for the proportion estimation stage, providing consistent results under identical input conditions.
With these reliable outputs, we applied an XGBoost regression model to capture the complex relationships between these factors and the proportions of HVs. The XGBoost model was optimized through extensive hyperparameter tuning, with the final configuration including 1,200 estimators, a maximum depth of 5, minimum child weight of 4, subsample and column sample by tree both at 0.8, learning rate of 0.06, and regularization parameters alpha and lambda at 0.3 and 9, respectively. This configuration was determined using five-fold cross-validation, with mean squared error (MSE) as the primary selection metric, ensuring a balance between predictive accuracy and model generalizability.
The final model achieved a training R2 of 0.931, test R2 of 0.637, and cross-validation R2 of 0.566 (standard deviation: 0.015). While these metrics suggest potential overfitting when evaluated on proportions alone, several factors support the model’s practical utility. The cross-validation R2’s low standard deviation (0.015) demonstrates consistent model behavior across different data subsets. Additionally, evaluating the same predictions on a volume basis reveals substantially stronger performance (Figure 2), indicating better accuracy in high-volume scenarios that matter most for practical applications. The model’s mean absolute error (MAE) of 0.137 indicates predictions typically deviate by only 13.7 percentage points, a practically acceptable range for strategic transportation planning applications.

Model validation through actual versus predicted heavy vehicle (HV) proportions and volumes.
Examining the actual vs predicted plots (Figure 2) highlights the model’s contrasting behavior on proportional and absolute metrics. The proportion scatter (left panel) shows wider dispersion and a clear gap between training (R2 = 0.931) and test (R2 = 0.637) performance, with the test MAE at 0.140. This plot also exhibits a pronounced vertical band at an actual HV proportion of approximately 0.9. In contrast, the volume-based scatter (right panel) aligns tightly around the 1:1 line, yielding high training (R2 = 0.987) and test (R2 = 0.913) values for absolute HV AADT.
The aforementioned vertical band in the proportion plot (Figure 2, left panel) stems from approximately 12% of OD pairs that, owing to the upstream data extrapolation model, derive a high HV share (around 0.8–0.9). These cases span a wide total AADT range (from as low as 1 AADT up to 4756 AADT, with an average total AADT of ≈ 237 AADT). While distributed across various volume levels, the model’s performance on HV proportions for instances that fall into the two highest-volume deciles (mean AADT ≈ 323 AADT for Decile 8 and ≈ 846 AADT for Decile 9; see Table 3) particularly affects the global R2. On those deciles the model is slightly conservative, with median proportional errors of –0.054 and –0.062, producing decile-level
Proportion-Model Performance by Average Annual Daily Traffic (AADT) Decile (Only Test Set)
Note: MAE = mean absolute error.
To evaluate the practical implications of these proportion predictions, HV counts were derived by multiplying the model’s predicted HV proportions by the total AADT for each OD pair. An R2 analysis on these derived HV counts (Figure 2, right panel) yields a robust test R2 of 0.913. This confirms that, despite the nuances in proportion prediction, the model’s underlying proportion estimates, when scaled by volume, lead to HV count estimations that correspond well with actual counts across the network, including on the high-volume links. The dual evaluation therefore isolates a narrowly confined regime, very high AADT corridors, particularly those with inherently high HV shares, that could benefit from a dedicated sub-model or refined approach for HV proportion prediction in future work. Simultaneously, this demonstrates that the present model already provides high practical value for deriving HV counts network-wide and for understanding HV proportions across the remaining 90% of the network segments.
SHAP analysis revealed critical insights into the determinants of HV proportions across the urban network. The beeswarm visualization (Figure 3) illustrates that nightlight_intensity_Destination exhibits the highest magnitude of influence, with SHAP values ranging from -0.2 to 0.12, indicating its substantial role in both increasing and decreasing HV proportions in traffic composition. This bidirectional effect suggests that urban nighttime activity serves as a key differentiator in the modal split of vehicular traffic. Food_Destination demonstrates a notably different pattern, with SHAP values concentrated between -0.1 and 0.06, and distinct clustering patterns that indicate threshold effects in how food-related destinations influence the presence of HVs in traffic flows.

SHapley Additive exPlanations (SHAP) value distribution for individual features.
The interaction analysis (Figure 4) and hierarchical clustering (Figure 5) reveal complementary aspects of urban feature relationships. While features cluster into distinct groups—commercial/service destinations (food, medical, transit, services, retail, employment), urban form characteristics (area, education, population), and origin features—their interactions show complex interdependencies. Notably, nightlight_intensity_Destination stands independent in the clustering with highest influence, yet shows the strongest interaction with Area_Destination (interaction strength = 0.007686). This suggests that while nightlight intensity captures unique aspects of urban activity, its effect on HV patterns is strongly modulated by spatial scale. Origin features, forming a separate cluster with consistently lower importance, also show weaker interaction patterns across all features.

SHapley Additive exPlanations (SHAP) interaction matrix.

Feature importance with hierarchical clustering.
The dependence analysis (Figure 6) uncovers nuanced relationships between these key features. Nightlight_intensity_Destination shows a clear regime shift at intensity values around 1.0, where its impact transitions from predominantly negative to positive SHAP values, suggesting increased HV presence in well-lit (potentially higher economic activity) areas. For food destinations, the relationship exhibits a distinct threshold pattern, with SHAP values increasing sharply for food destination values between -1 and 2, before plateauing. The retail destination interaction (shown by color) reveals that areas with high retail presence (red points) systematically amplify the positive effect of food destinations on HV proportions, indicating that commercial clusters generate higher proportions of heavy vehicle traffic in the overall traffic mix.

SHapley Additive exPlanations (SHAP) dependence plots showing (a) how nightlight_intensity_Destination’s impact varies with its feature value, colored by Population_Destination, and (b) food_Destination’s relationship, colored by retail_Destination values.
Conclusion
This study introduces an innovative approach to decomposing urban traffic flows into LV and HV categories, addressing a critical gap in transportation network analysis. Our three-stage methodology, combining OD matrix estimation, quadratic programming optimization, and XGBoost regression, provides a robust framework for estimating HV and LV proportions across Greater Sydney’s road network. By leveraging diverse data sources, we have implemented a comprehensive model that achieves a test R2 of 0.637, demonstrating its effectiveness in capturing complex urban mobility patterns.
Our model demonstrates robust performance particularly in predicting HV proportions in the 0.2–0.6 range, where the majority of urban freight movements occur, as evidenced by the bimodal distribution with peaks at 0.3 and 0.9 HV proportions. This pattern reflects the natural distribution of HV traffic in urban environments, with higher proportions being characteristic of specialized zones. Our analysis, enhanced by comprehensive SHAP interaction analysis, reveals intricate relationships between urban features and HV proportions that go beyond simple feature importance rankings. While nightlight_intensity_Destination emerged as the most influential feature, the SHAP interaction analysis uncovered significant coupling effects between urban characteristics, with the strongest interaction observed between nightlight intensity and area at destinations (interaction strength = 0.007686). This interaction demonstrates how the impact of economic activity on HV traffic varies substantially with the spatial scale of development. These interaction effects provide crucial insights for urban planners and policymakers, demonstrating that the effectiveness of infrastructure improvements or policy interventions may depend significantly on the broader urban context in which they are implemented. For instance, the strong interaction between nightlight intensity and area suggests that similar economic activities might generate different HV patterns depending on the spatial configuration of the destination zone. Such insights are particularly valuable for developing targeted strategies for different urban contexts, from compact city centers to sprawling industrial areas.
The methodological framework presented in this study, leveraging increasingly accessible global data types (road network data, Points of Interest, satellite imagery), holds considerable promise for broader application. Expanding this adaptable approach to diverse global urban contexts, particularly rapidly developing cities with limited specialized HV data availability but growing freight demands, is a key future direction. However, while the methodology is designed for broad applicability, it is crucial to acknowledge that the model parameters and feature correlations learned from Greater Sydney may necessitate local recalibration or fine-tuning when applied to cities with markedly different geographical, economic, or social characteristics. Therefore, a primary area for future research is the rigorous investigation and testing of model transferability. This involves applying the framework to other urban centers, especially in developing countries; such application would require thorough validation and likely adaptation using local data to capture unique regional attributes and ensure optimal predictive performance. This process will be vital for establishing the framework’s broader utility and refining its adaptability across different urban contexts.
Further research will also address other identified limitations, such as improving predictions for high-HV proportion scenarios (>0.8) where the current model shows increased uncertainty. Developing specialized sub-models tailored to this specific regime, or incorporating more detailed data on factors influencing HV movements within such high-volume, high-share corridors (such as industrial activities, specific zoning regulations, or dedicated HV infrastructure), could enhance predictive performance. Additionally, extending the model to capture dynamic, time-dependent variations in HV proportions would enhance its practical value. By incorporating socio-economic factors through these advancements, and by developing associated policy recommendations, this continuously refined approach can become an even more powerful tool for comprehensive urban planning. Such a tool would help bridge the gap between transportation efficiency and social equity in our evolving smart cities, enabling Mobility as a Resource (MaaR) concepts and supporting decision-makers in optimizing urban mobility solutions that are efficient, socially inclusive, and environmentally sustainable.
Footnotes
Acknowledgements
The authors extend their gratitude to Anna Sotnikova for her assistance in matching observed flows to the exact links, which significantly contributed to the quality of this research.
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: PF, SRK, STW; data collection: PF, SRK; analysis and interpretation of results: PF, SRK, DYL, STW; draft manuscript preparation: PF, SRK, DYL, STW. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
