Advancing the Methodology for Identifying Crash Contributing Factors in Implementing a Systemic Approach for Prioritizing Sites for Safety Improvements

Abstract

The traditional approach of network screening to identify roadway sites for safety improvements has issues, such as the tendency for safety countermeasures to be targeted only at locations that have experienced a high frequency of crashes. The systemic safety approach, on the other hand, proactively identifies risk factors for focus crash and site types, then reviews the road network and identifies locations for implementation of appropriate countermeasures based on the prevalence of these risk factors. Using crash data from Ohio collector road segments, the focus site type, as a case study, this research aimed to advance the systemic approach methodology by first developing a severity index (SI) that considers the frequency and cost of crashes in each severity level. An advanced method that integrates eXtreme Gradient Boosting (XGBoost) and SHapley Additive exPlanations (SHAP) algorithms was then proposed and applied to identify contributing factors to focus crash types by determining those affecting SI for a focus site type. The technique produced relative importance scores that were then used to rank and compare contributing factors for the focus crash types. Then, road segments were not only identified, but also prioritized for safety improvements targeting the focus crash type based on the prevalence of critical contributing factors and their importance scores. The example demonstrated that locations without crashes may still be categorized high priority for safety improvement with the systemic safety approach. The proposed methodology incorporates advanced concepts but is nevertheless implementable beyond the research community.

Keywords

systemic safety approach crash severity highway safety performance eXtreme Gradient Boosting (XGBoost)SHapley Additive exPlanations (SHAP)

The traditional approach to safety management involves a network screening process that prioritizes locations with potential for safety improvements based on historical crash concentration hotspots ( 1 ). This focus constitutes a reactive approach, whereby the possibility of a safety improvement for a specific site is tied to the previous occurrence of crashes at that location ( 1 ), whereas locations without crashes tend to be ignored. Corridor approaches are somewhat less reactive since they result in safety improvements along an entire corridor, including high-crash hotspots and locations with low- or no-crash concentrations. By contrast, new initiatives aimed at preventing the very possibility of severe crashes, such as “Vision Zero” and “Safe System,” anticipate the occurrence of crashes and target improvements at locations regardless of their historical crash profile ( 1 ).

The “systemic approach” that has been gaining traction is found between these two extremes, involving both reactive and proactive components ( 2 ). The approach was characterized most recently by FHWA as follows: “A systemic approach involves the installation of low- to moderate-cost countermeasures at locations with the highest risk of severe crashes” ( 3 ). As part of this process, historical crash data are used to identify the type of roadways that suffer from recurring safety concerns, making it a partly reactive approach ( 1 ). However, it goes beyond identifying clusters of crashes, as it does not consider specific high-crash locations but rather those with high-risk road features. In so doing, it mitigates the issues arising from natural random fluctuations in historical crash data ( 3 ). Ultimately, therefore, the approach would also consider countermeasures at low- or no-crash sites ( 1 )—this is in contrast to the traditional network screening approaches for identifying sites with potential for safety improvement. The systemic approach to safety does not, however, replace the traditional approaches, in that high-crash locations must still be addressed. Instead, both site analysis and systemic approaches are necessary to advance a comprehensive safety management program ( 4 ).

Three fundamental tasks make up the planning component of the systemic approach to road safety management: (1) identify focus crash and facility types (FCFTs) and contributing factors, (2) screen and prioritize candidate locations, and (3) select countermeasures. The following provides an overview of these tasks:

Task 1: Identify FCFTs and Contributing Factors

The first task of the systemic approach is to select FCFTs and contributing factors. This process starts by identifying the most critical crash types to focus on. Focus crash types can be identified with different approaches, such as the most frequent severe crashes or those contributing to the highest number of fatalities and serious injuries, crashes with the highest probability of being severe, and alignment with existing plans and programs ( 3 ). Once the focus crash types are identified, the next step is to select the specific facility types where these types of crashes are prevalent. A common approach to identifying focus facility types involves using a crash tree diagram ( 3 ), which can take various forms, including tabular formats and machine learning–based decision trees ( 5 ).

The last step of this task is the subject of this study. It involves analyzing the factors contributing to an increased risk of focus crash types on a focus facility type. These contributing risk factors may include infrastructure characteristics, operational data, asset management data, community and contextual data, socioeconomic data, and crash and safety data ( 3 ). Various methods have been used to identify these contributing factors, including traditional approaches such as established findings, local knowledge, and analysis of crash overrepresentation, as well as more advanced techniques like statistical modeling and machine learning algorithms ( 3 ). As a background for this study, these methods are reviewed in some detail in the next section.

Task 2: Screen and Prioritize Candidate Locations

The second task of the systemic approach is to develop a prioritized list of potential locations at which systemic improvements should be applied. The process involves three key steps. During the first step, agencies would typically use the risk-contributing factors identified in the previous step to pinpoint system elements within the focus facility type that are at higher risk of severe crashes. The second step requires calculating the “risk score” for each system element, which is a quantitative evaluation based on the presence and relative importance of identified risk-contributing factors. The last step involves prioritizing system elements by ranking them in order from the highest to the lowest risk score ( 3 ).

Task 3: Select Countermeasures

The third task of the systemic approach is to select countermeasures for high-priority locations identified in Task 2 by developing a list of potential countermeasures and selecting them for deployment ( 3 ). Various resources are available for this task, most notably the CMF Clearinghouse ( 6 ) and the Highway Safety Manual ( 7 ). To ensure consistency in countermeasure deployment, tools such as decision trees, worksheets, or matrices may be considered ( 3 ).

Literature Review

As noted, a foundational element of the systemic safety approach analysis, which is the focus of this study, is the identification of contributing factors associated with specific crash types and facility types (FCFTs). The literature reveals a range of methodological approaches for this purpose, spanning from basic descriptive statistics to state-of-the-art machine learning algorithms. This review synthesizes key studies that illustrate this spectrum of techniques and highlights the progression toward more interpretable and cost-informed approaches.

An example of the more simplistic methods is the study conducted by Preston et al., who provided the Systemic Tool as a comprehensive understanding of how systemic planning plays a pivotal role in the safety management process ( 4 ). To determine the contributing factors to crashes, the method involves utilizing the descriptive statistics of data to compare the proportion of locations with specific attributes to the percentage of severe crashes occurring at those locations. In another study, the New York State Department of Transportation (NYSDOT) utilized overrepresentation analysis to identify risk factors contributing to crashes as part of their Roadway Departure Safety Action Plan. NYSDOT focused on two specific crash types: head-on collisions and other roadway departure crashes. For each of these crash types, NYSDOT identified specific facility types to focus on. Using overrepresentation, they compared the proportion of severe (KA; i.e., injury scale in which K = fatal [killed], A = incapacitating injury, B = nonincapacitating injury, C = possible injury, and O = property damage only) focus crashes to the proportion of vehicle miles traveled for particular system elements to determine the risk factors ( 8 ).

An example of the application of more advanced techniques for the identification of the contributing factors for FCFTs is the study performed by Thomas et al., who developed a process and guide for conducting systemic safety analyses for pedestrians using analytical techniques to identify pedestrian activities, roadway features, and other contextual and behavioral risk factors, such as land use, that increase pedestrian crashes ( 9 ). The researchers estimated negative binomial (NB) regression methods to identify a set of variables that are predictive of crash risk. Two types of analyses were conducted: a regression tree analysis to determine the most important predictors, followed by the estimation of NB regression models. Several of the 51 potentially influential variables were potentially interrelated, so a data mining algorithm (conditional random forest) was applied to help identify the most important independently predictive variables among a correlated set of variables to select and rank a smaller set of variables to test in regression modeling.

Another example of more advanced techniques is the direct use of the random forest in a project for FHWA by Saleem et al., who investigated contributing factors for focus crash types using data from three sources ( 10 ): crash and roadway inventory from the Highway Safety Information System (HSIS) ( 11 ), climate data from the National Oceanic and Atmospheric Administration ( 12 ), and socioeconomic census data from the U.S. Census Bureau ( 13 ). (It should be noted here in passing that this database was also used for the investigation in the current study.) To investigate the contributing factors for the focus crash types, Saleem et al. used the random forest method. Using such advanced techniques allows for the identification of predictors that do not appear in a single regression or classification tree but which, nonetheless, are highly related to the target variable.

In another study that applied machine learning algorithms, Cho et al. used the chi-square automatic interaction detection (CHAID) decision tree to analyze roadway departure crashes on undivided two-lane rural roads in Virginia ( 5 ). (CHAID, which was developed by Kass ( 14 ), is a type of decision tree that is constructed by iteratively dividing the data into smaller subsets, starting with the entire dataset and then splitting each subset into two or more smaller groups.) The Cho et al. dataset contained information on crash characteristics, such as collision type and severity, as well as contributing factors like lighting conditions and speeding, and a range of attributes related to administrative details, facility characteristics, and traffic operations.

In addition to statistical and machine learning approaches applied to analyze crashes, different approaches have been used to quantify crash severity and prioritize segments accordingly. Traditional methods often rely on crash frequency or severity counts. However, there are alternative approaches that incorporate severity through weighting, such as the equivalent property damage only (EPDO) method, which is favored for its simplicity and ability to reflect the relative impact of crashes through weights. As such, it is still the method of choice for prioritizing sites for systemic treatment in state-level safety management programs such as the Nevada Highway Safety Improvement Program ( 15 ) and the Oregon All Roads Transportation Safety Program ( 16 ).

Building on both strands of the literature, this study utilizes a cost-based severity index (SI) grounded in FHWA-recommended economic values ( 17 ), offering an alternative to traditional fixed-weight methods. At the same time, this study responds to the growing methodological shift toward advanced analytics by leveraging machine learning techniques capable of handling high-dimensional, correlated datasets in offering a novel pathway to enhance both the predictive power and transparency of systemic safety analyses. This dual emphasis on cost-based severity weighting and model interpretability addresses key gaps in existing approaches and represents a critical step forward in the development of practical, data-driven tools for informed safety management.

Study Objectives

The primary objective of this study is to advance the methodology for identifying contributing factors for FCFTs by capitalizing on the strengths of the various approaches while seeking to address some of the weaknesses. One point of departure from prominent studies, such as that by Saleem et al., is based on the tendency for those studies to focus on crash frequency, such as the unweighted sum of fatal and incapacitating injury crashes, to capture severity ( 10 ). By utilizing a weighted SI through a weighted sum of the relative costs of PDO (i.e., property damage only), injury, and fatal crashes, this study aims to provide a more balanced and economically meaningful assessment of crashes.

A second point of departure, from an analytical perspective, is based on the reality that traditional regression-based techniques often suffer from certain drawbacks, such as sensitivity to multicollinearity and outliers, reliance on the assumption of linear relationships, and sensitivity to the underlying data distribution. Additionally, many commonly applied methods are overly simplistic, reducing the accuracy and depth of insights they can provide. In addressing these issues, this study seeks to offer a more robust framework for analyzing crash data by investigating the use of an advanced technique called eXtreme Gradient Boosting (XGBoost) ( 18 ) that is applied in conjunction with SHapley Additive eXplanations (SHAP) ( 19 ) to identify contributing factors to crashes by determining the factors affecting the SI. Applying XGBoost with SHAP provides high predictive accuracy, effectively handling nonlinear relationships and multicollinearity, which are common in crash data. XGBoost is also data-efficient and can handle large, complex datasets, whereas SHAP improves the interpretability of the model by clarifying the direction of the effects of contributing factors. This approach produces importance scores, allowing the factors to be ranked and compared based on their importance. Sites can then be prioritized for safety improvement based on the prevalence of critical factors contributing to focus crash types and their importance scores. A database of Ohio collector road segments (the focus facility type) is used to develop and illustrate the methodology for identifying contributing factors for run-off-road crashes (the focus crash type) and illustrating how sites can be subsequently prioritized for safety improvement based on importance scores for contributing factors.

Data

The dataset for this research pertains to 2,351 run-off-road traffic crashes on 10,234 Ohio two-lane rural collector road segments during a 6-year period. As noted, it is part of a larger dataset that has been used by Saleem et al. in the study referred to earlier ( 10 ) to identify FCFTs as well as contributing factors, including those that characterize adjacent neighborhoods, in applying a systemic safety approach. Run-off-road crashes were selected for this research because they were identified as constituting a focus crash type in that study ( 10 ). The selection, for this research effort, of collector road segments, which was identified by Saleem et al. as a focus facility type, is based on the likelihood of motorists from adjacent neighborhoods using these roads, thereby providing the opportunity to consider the characteristics of those neighborhoods as contributing factors, as Saleem et al. did.

The dataset combines data from three sources: crash and roadway inventories from the HSIS ( 11 ), climate data from the National Oceanic and Atmospheric Administration ( 12 ), and socioeconomic census data from the U.S. Census Bureau ( 13 ). FCFTs were selected by analyzing the Fatality Analysis Reporting System (FARS) ( 20 ) and the HSIS databases. The descriptive statistics of the data are presented in Table 1.

Table 1.

Descriptive Statistics of the Data

Variable name	Min.	Max.	Average	SD
Segment length (mi)	0.001	0.27	0.047	0.033
Annual average daily traffic	93.3	9,513.3	1,107	1,036
Curve radius (ft)	23.5	1,468.0	485.5	236.0
Percentage of grade	0	20	3.67	4.70
Surface width (ft)	16	44	19.15	1.82
Speed limit (mph)	25	55	52.11	6.07
Shoulder width (ft)	0	10	2.22	1.47
Average snowfall per year (in.)	6.8	109.9	21.12	10.21
Average rainfall per year (in.)	33.16	48.74	41.40	2.52
Average annual number of days with a minimum temperature max 32°F(F)	88	148	122.17	12.41
Average annual maximum temperature (F)	57	69	62.46	2.36
Average annual minimum temperature (F)	37	46	40.86	1.78
Average annual minimum winter temperature (F)	17	27	22.02	2.03
Proportion of population ages 16+ unemployed	0	0.35	0.09	0.06
Proportion of population ages 16–24 working full time	0	1	0.17	0.19
Proportion of population ages 16–24 working part time	0	1	0.44	0.23
Proportion of population ages 16–24 unemployed	0	1	0.38	0.23
Proportion of population ages 25+ without a high school diploma	0.11	0.86	0.45	0.09
Proportion of population ages 25+ with a high school diploma	0.05	0.76	0.43	0.09
Proportion of population ages 25+ with a university degree	0	0.65	0.13	0.08
Proportion of households with income less than $50,000	0.06	0.96	0.55	0.12
Proportion of households with income between $50,000 and $100,000	0	0.62	0.32	0.11
Proportion of households with income more than $100,000	0	0.64	0.13	0.08
Proportion of households with no vehicles	0	0.80	0.06	0.07
Proportion of households with one vehicle	0	0.70	0.25	0.10
Proportion of households with two or more vehicles	0.11	0.96	0.69	0.11
Proportion of population ages 15–19	0	0.32	0.07	0.03
Proportion of population ages 20–44	0.09	0.75	0.28	0.06
Proportion of population ages 45–64	0.03	0.56	0.32	0.09
Proportion of population ages 65–74	0	0.33	0.09	0.04
Proportion of population ages 75+	0	0.29	0.06	0.04

Note: Min. = minimum; Max. = maximum; SD = standard deviation.

Methods

As noted earlier, it was necessary to first define an SI for which the assessment of contributing factors would be based. For this research, similar to Roy et al., an SI was developed for each segment to consider weights for the frequency of crashes in different crash severity levels ( 21 ). The SI was calculated per Equation 1, with the severity level designations based on the KABCO scale,

\begin{matrix} SI = W_{1} \times (No . of O - level crashes) + W_{2} \\ \times (No . of C - level crashes) + W_{3} \\ \times (No . of B - level crashes) + W_{4} \\ \times (No . of KA - level crashes) \end{matrix}

(1)

The relative weights (W_i) of each crash, considering its severity, were calculated using the crash costs for Ohio published by FHWA ( 17 ). These crash costs and the derived weights are shown in Table 2.

Table 2.

Crash Costs and Weights

	Injury severity
Crash cost and weight parameter	KA (fatal and serious injury)	B (evident injury)	C (possible injury)	O (no injury)
Average cost per involved crash	$336,145	$56,146	$38,056	$8,576
Average weight (W_i) per involved crash	39.2	6.55	4.44	1

It should be noted in passing that, although the systemic and Safe System approaches emphasize reducing fatal and serious injury (KA) crashes, the use of an SI that also considers B-, C-, and O-level crashes provides a complementary and analytically rigorous method for prioritizing safety improvements. Even so, fatal and serious injury crashes inevitably account for a substantial portion of the SI value. As such, this index places implicit and substantial weight on the most severe outcomes, aligning closely with the objectives of the Safe System approach. All the same, it should be stressed that the proposed methodology would still be applicable for other severity indices, such as the sum of fatal and serious crashes, or, indeed, EPDO.

As noted, this study aimed to advance the methodology for identifying the most important variables affecting SI and the contributing factors corresponding to the identified FCFT (i.e., run-off-road crashes on two-lane rural collector roads for this study). Among the promising state-of-the-art methods in the field of nonparametric and artificial intelligence models, particularly machine learning, is XGBoost, which was proposed recently by Ester et al. ( 22 ). Some studies have shown that its performance and efficiency are greater than those of conventional classification techniques ( 23 – 25 ). As such, it was selected as the method for this study’s investigation.

In applying XGBoost, a learning sample of data with known class labels and predictor variable values is recursively partitioned when constructing the tree. Tree-based models split the data multiple times according to specific cut-off values in the features ( 26 ), known as split points. Based on Chen et al., the objective function of the XGBoost algorithm can be defined by Equations 2 and 3 ( 27 ).

Obj = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{n} Ω (f_{k})

(2)

Ω (f_{t}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(3)

where

Obj is objective function;

$l (y_{i}, {\hat{y}}_{i})$ indicates training loss function corresponding to each sample, x_i;

$Ω (f_{k})$ represents rule penalty coefficient for the complexity of a model;

y_i represents the actual value of sample x_i;

${\hat{y}}_{i}$ represents the predicted value of sample x_i;

f_k shows k^th tree;

$γ$ is parameter for complexity;

T denotes number of leaf nodes;

$λ$ is fixed value; and

w shows leaf weight.

Equation 4 shows how the expanded objective function can be obtained by approximating the original objective with a second-order Taylor expansion.

\begin{matrix} Ob j^{(t)} \approx \sum_{i = 1}^{n} [l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] \\ + Ω (f_{t}) + C \end{matrix}

(4)

where g_i denotes Taylor expansion’s first-order derivative, and h_i shows its second-order derivate.

Gradient boosting has the advantage that once the boosted trees are constructed, it is relatively straightforward to retrieve importance scores. As a general rule, these scores measure the effectiveness of each feature in constructing the boosted decision trees. It allows attributes to be ranked and compared in the dataset based on their importance on the SI. XGBoost has exceptional computational velocity in addition to impressive prediction accuracy. This results from implementing normalization within its objective function, mitigating overfitting, and minimizing training errors. Furthermore, XGBoost demonstrates resistance to multicollinearity issues and possesses the capability to address multinomial logistic regression (MNL) matters involving the principle of independent irrelevant alternatives (IIA) ( 28 ).

Model interpretability and explainability are essential for machine learning (ML) applications, particularly in traffic safety. Although earlier research focused on enhancing model performance and comparison, it neglected the importance of interpretability and the need to assess the impacts of various risk factors and their cumulative effects on traffic mitigation measures. Whereas XGBoost excels in predictive performance, it lacks interpretability, a gap that can be addressed using the SHAP, a relatively new method for ML model interpretation derived from game theory. SHAP, which has been used in more recent road safety research ( 29 – 31 ), provides a baseline for assessing the impact of each feature on the final output ( 31 ) and a clear, interpretable explanation of the model predictions by calculating the contribution of each feature to the model, that is, the degree to which it influences the prediction result ( 29 ).

This study utilized the XGBoost algorithm integrated with SHAP using Python. To evaluate predictive performance and avoid overfitting, the dataset was randomly divided into training and testing subsets using an 80/20 split. The XGBoost model was trained on the larger portion and validated on the held-out test set to assess generalizability. This internal validation approach, which is common in ML, ensures that model performance reflects true patterns rather than overfitting to noise in the training data. The generic Python codes were developed by Lundberg et al. ( 19 ), and are available open-source on GitHub ( 32 ). As such, these methods, though seemingly complex, can be relatively easily implementable beyond the research community.

To facilitate the application of the method investigated in this study, the framework of the systemic safety approach as proposed is presented in Figure 1, beginning with data collection of crash records, roadway features, and socioeconomic variables. Next, feature engineering and data cleaning are performed to prepare the dataset for modeling. This is followed by quantifying crash severity by calculating the SI using FHWA-recommended crash cost values. An XGBoost model is then trained to predict SI based on available features. SHAP values are applied to interpret the model, identifying the most influential risk factors. Finally, segments are prioritized for safety treatment by combining SHAP outputs with the SI, enabling a proactive, cost-informed ranking of high-risk locations. This framework is highly adaptable and can be implemented by agencies with access to basic crash, roadway inventory, and contextual datasets. The SI can be recalibrated using local crash cost estimates to reflect state-specific priorities or economic conditions (e.g., differing valuation of injuries or fatalities). Likewise, the list of predictive features can be updated based on locally available attributes, including climate exposure, roadway classifications, or enforcement patterns. Finally, it should be noted that the XGBoost Python package provides a framework that can be easily used to solve regression prediction problems and classification. In XGBoost, there is an inbuilt function that computes the importance of features, which can be used to rank the study variables. The XGBoost–SHAP framework supports such customization, offering both robustness and interpretability, even with varying data quality. This modularity makes the method suitable for implementation in departments of transportation, metropolitan planning organizations, or regional safety agencies aiming to evolve from frequency-based screening toward proactive, data-informed systemic safety management.

Figure 1.

The framework of the systemic safety approach in this study.

Case Study Results

This section documents the results of applying the systemic safety approach with the advanced methodology to the dataset used for the case study. The results are focused and presented separately for the first two tasks identified in the introduction: the identification of potential risk factors for the FCFT and the screening and prioritization of candidate locations. (Although the third task—countermeasure selection—is a critical one in the systemic safety process, in practice, it should be based on site-specific characteristics, benefit–cost analysis, and agency guidelines and priorities. As such, delving into this task is beyond the scope of this study.)

Results for Task 1: Identifying Potential Risk Factors for the FCFT

The overall influence of each feature on the SI is obtained as an importance score from the XGBoost model within the SHAP interpretation framework. Figure 2 shows the scores for the first 16 features. According to this figure, SI is primarily influenced by segment length, which is the feature ranked first in importance. The second most important variable is annual average daily traffic (AADT), followed by the curve radius, proportion of the population aged 16+ unemployed, shoulder width, and proportion of the population aged 45 to 64.

Figure 2.

Importance scores for each variable.

The values presented in Figure 2 are the absolute values of the importance scores without showing the direction of the effect of each feature. To understand how features affect the SI, Figure 3 was developed, which shows the direction of effects for the 16 features with the highest importance scores. The graph in this figure has two axes: a vertical axis representing the different features and a horizontal axis representing their SHAP values. The SHAP value of each feature indicates its contribution to the model, that is, how influential it is on the prediction results ( 29 ). Each color shade corresponds to a feature value strength and can be seen in the color bar on the right of the figure. A red dot indicates a higher crash risk factor, whereas a blue dot indicates a lower crash risk factor ( 31 ). In addition, to identify the most important contributing factors, SHAP values of the features can be ranked ( 33 ). For instance, for AADT, its SHAP value was high whenever the feature value was high, indicating that this feature variable positively affected the predictions of SI. The result of this entire process is that XGBoost integrated with SHAP allowed the direction as well as the strength of the effect of each variable to be determined.

Figure 3.

Distribution of SHAP values for each variable.

Results for Task 2: Screening and Prioritizing Candidate Locations

The result of using this task in the systemic approach is a list of prioritized locations based on the presence of the most important risk factors. The greater the prevalence of critical risk factors in the segment characteristics, the greater the potential for certain types of crashes to occur and the higher their priority for safety improvement investment. The result of this prioritization is a ranking of the elements of the focus facility type. For this step in the systemic safety approach, following Gooch et al., the values of features were normalized ( 3 ). Normalization was conducted such that if a feature was positively associated with the SI, a value of 1 was assigned to its highest value or presence. Conversely, if a feature was negatively related to the higher values of SI, the highest value was assigned a value of 0. Then, the importance scores from SHAP were used as prioritization weights. Finally, the weighted sum of importance scores for each characteristic was used to rank high-priority locations. Table 3 shows the 10 highest priority locations in the database used for this study; information is provided for the SI and importance score, as well as the six most important characteristics identified in Figure 2. Notably, 3 of the 10 high-priority locations did not have any crashes (SI = 0), emphasizing the key principle of the systemic safety approach. Also of note is that one location had a high SI (50.19). This SI, which can be calculated from Equation 1, is based on one fatal (K) crash, one B-level (evident injury) crash and one C-level (possible injury) crash.

Table 3.

High-Priority Locations Based on the Systemic Safety Approach

County route	Most important variables						SI	Importance score
County route	L	R	AADT	SW	Page>16	P45–64	SI	Importance score
BEL0331R	0.142	971	9513	1	0.135	0.229	6.55	1.051
COL0170R	0.035	803.5	7,849.5	2	0.103	0.298	0	0.985
COL0170R	0.093	638.5	7,849.5	2	0.103	0.298	1	0.96
COL0170R	0.118	685	8,750	2	0.103	0.298	6.55	0.96
SCI0073R	0.082	567	8,780	3	0.233	0.205	50.19	0.95
COL0170R	0.051	561	7,849.5	2	0.103	0.298	1	0.948
COL0170R	0.081	546	7,849.5	2	0.103	0.298	0	0.945
COL0170R	0.058	589	8,750.2	2	0.103	0.298	0	0.942
COL0170R	0.04	356	8,750.2	2	0.103	0.298	4.44	0.941
COL0170R	0.098	517.5	8,750.2	2	0.103	0.298	6.55	0.931

Note: L = segment length (mi); R = radius (ft); AADT = annual average daily traffic; SW = shoulder width (ft); Page>16 = proportion of population aged 16+ unemployed; P45–64 = proportion of population aged 45–64; SI = severity index.

Discussion and Summary

This paper aimed to advance the methodology for identifying contributing factors for focus crash types in implementing a systemic approach for prioritizing sites in a focus facility type for safety improvements. A dataset containing crash and roadway inventories, climate data, and socioeconomic census data pertaining to Ohio collector road segments was used as a case study to accomplish this aim. To consider different aspects of crash severity, an SI was developed based on the cost and frequency of crashes in each severity level on the KABCO scale. An advanced method that integrates XGBoost and SHAP algorithms was applied to identify contributing factors to focus crash types by determining those affecting the SI for the targeted site types. The technique produced relative importance scores that were then used to rank and compare contributing factors for the focus crash types. Then, road segments were not only identified but also prioritized for safety improvement, targeting the focus crash type based on the prevalence of critical contributing factors and their importance scores.

This study builds on and extends existing methods used in systemic safety analysis by integrating advanced ML and economic quantification techniques into the process of identifying contributing factors and prioritizing sites for improvement. Traditionally, safety studies have relied heavily on crash frequency or severity counts (e.g., fatal or serious injury crashes) as the basis for flagging candidate locations for countermeasure implementation ( 34 , 35 ). Although these approaches can be useful, they often lack the ability to distinguish the locations that share the same risk factors but have not yet experienced severe crashes.

The case study results demonstrated that several prioritized locations had zero observed crashes during the study period but shared high-risk characteristics, affirming the proactive power of the systemic approach. This finding supports prior research ( 1 , 36 ) and reinforces the argument that relying solely on historical crash frequency underrepresents true risk. In addition, the case study application, while confirming known contributing factors (e.g., AADT, horizontal curves), also identified contextual risk indicators such as socioeconomic factors that are generally not considered in traditional safety screening.

The use of XGBoost, enhanced by SHAP interpretability values, enabled a more robust and transparent identification of contributing features, even in the presence of complex nonlinearities or feature interactions. In so doing, it overcame the limitations of earlier studies such as those of Thomas et al. ( 9 ) and Cho et al. ( 5 ) that used statistical methods (e.g., logistic regression or decision trees) to identify contributing factors for specific crash types. Although conceptually appealing, the mentioned methods may suffer from limitations such as multicollinearity, loss of interaction effects, or reliance on subjective thresholds for significance.

Although this study provides valuable insights into the application of a systemic safety approach, it has limitations that should be acknowledged. First, the geographical scope was limited to Ohio, which may restrict the generalizability of findings to other regions. Second, whereas this study considered a range of variables, more detailed information about neighborhoods, drivers, and vehicles involved in crashes could provide further insights. For instance, data on neighborhood characteristics such as ethnicity and driver sociodemographic and behavioral factors could be valuable. Future research could also integrate more advanced ML algorithms into a hybrid framework, which could enhance the analysis of risk factors, potentially leading to more efficient prioritization of sites for safety treatments and more focused targeting of interventions. Additionally, transforming continuous variables into discrete bins based on SHAP-identified thresholds (e.g., for curve radius, shoulder width, or grade) could support the development of interpretable risk scoring tools aligned with practitioner needs. Finally, adopting longitudinal validation strategies by training models on early year crash data and validating predictions against subsequent years would enable real-world evaluation of model performance and improve comparisons across systemic and traditional approaches.

In sum, this study provides a replicable, interpretable, and economically grounded framework that complements and extends previous systemic safety methods. It offers practical tools for agencies seeking to evolve beyond crash-frequency-based screening and toward a more data-driven, risk-informed prioritization process. In this context, it should be noted that the XGBoost and SHAP methods, though seemingly complex, can be relatively easily implemented beyond the research community with the use of downloadable open-source Python codes.

Footnotes

Acknowledgements

The dataset was provided by the University of North Carolina Highway Safety Research Center. The authors gratefully acknowledged their assistance.

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: M. Jafari, B. Persaud; data collection: M. Jafari; analysis and interpretation of results: M. Jafari, B. Persaud, C. Mohammadi; draft manuscript preparation: M. Jafari, B. Persaud, C. Mohammadi. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. It should be noted, however, that one co-author, Bhagwant Persaud, is an Associate Editor of Transportation Research Record.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a discovery grant (Appl. ID RGPIN-2023-03787) from the Natural Sciences and Engineering Research Council of Canada and the Highway Infrastructure and Innovation Funding Program of the Ontario Ministry of Transportation.

ORCID iDs

Mahsa Jafari

Bhagwant Persaud

Cameron Mohammadi

Data Accessibility Statement

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

References

Grembek

Pasquet

Vanoli

An Enhanced Systemic Approach to Road Safety. No. CSCRS-R2. University of California, Berkeley. Safe Transportation Research and Education Center, 2019.

FHWA. A Systemic Approach to Safety Using Risk to Drive Action. 2015. Report No.: FHWA-SA-12-025.

Gooch

Gross

Dunn

Kersavage

Sanders

Schoner

, et al. Systemic Safety User Guide. FHWA; 2024. Report No.: FHWA-SA-23-008.

Preston

Storm

Dowds

J. B.

Wemple

Hill

Systematics

Developing Methodology for Identifying, Evaluating, and Prioritizing Systemic Improvements. United States. Federal Highway Administration. Office of Safety; 2013.

Cho

H. W.

Cottrell

B. H.

Jr. Lim

I. K.

Development of a Systemic Safety Improvement Plan for Two-Lane Rural Roads in Virginia. Virginia Transportation Research Council (VTRC), 2020.

FHWA. CMF Clearinghouse [Internet]. 2024. https://cmfclearinghouse.fhwa.dot.gov

AASHTO. Highway Safety Manual [Internet]. Washington, DC: American Association of State Highway and Transportation Officials, 2010. http://www.highwaysafetymanual.org/Pages/default.aspx

New York State Department of Transportation. New York State roadway departure safety action plan, 2024.

Thomas

Kumfer

Lang

Zegeer

Sandt

Lan

Nordback

Systemic Pedestrian Safety Analysis: Contractor’s Technical Report. NCHRP Report 893, TRB, Washington, D.C., 2018. https://onlinepubs.trb.org/onlinepubs/nchrp/nchrp_rpt_893_Contractor.pdf

10.

Saleem

Porter

R. J.

Srinivasan

Carter

Himes

Contributing Factors for Focus Crash and Facility Types. Washington, DC: Federal Highway Administration, U.S. Department of Transportation, 2020.

11.

FHWA. Highway Safety Information System (HSIS) [Internet]. 2018. https://www.hsisinfo.org

12.

National Oceanic and Atmospheric Administration. “National Oceanic and Atmospheric Administration” (website) [Internet]. 2018. https://www.noaa.gov. Accessed January 19, 2018.

13.

U.S. Census Bureau. “Socioeconomic Census Data” (website) [Internet]. 2018. Available from: https://www.census.gov/data.html. Accessed January 19, 2018.

14.

Kass

G. V.

An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society: Series C (Applied Statistics), Vol. 29, No. 2, 1980, pp. 119–127.

15.

Highway Safety Improvement Program. Nevada Highway Safety Improvement Program. Carson City: Nevada Department of Transportation, 2015.

16.

Siddique

Z. Q.

Bish

D. W.

Haas

K. J.

Oregon’s All Roads Transportation Safety Program: Data-Driven Program to Improve Safety on All Public Roads. Transportation Research Record. Vol. 2582, No.1, 2016, pp. 18–25.

17.

Harmon

Bahar

Gross

Crash Costs for Highway Safety Analysis. Report No.: FHWA-SA-17-071. U.S. Federal Highway Administration, Washington, D.C., January 2018. https://highways.dot.gov/sites/fhwa.dot.gov/files/2022-09/fhwasa17071.pdf

18.

Jamal

Zahid

Tauhidur Rahman

Al-Ahmadi

H. M.

Almoshaogeh

Farooq

, et al. Injury Severity Prediction of Traffic Crashes with Ensemble Machine Learning Techniques: A Comparative Study. International Journal of Injury Control and Safety Promotion, Vol. 28, No. 4, 2021, pp. 408–427.

19.

Lundberg

S. M

Lee

S. I.

A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems, 2017, p. 30.

20.

NHTSA. Fatality Analysis Reporting System (FARS) [Internet]. 2018. https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars. Accessed January 19, 2018.

21.

Roy

Farid

Ksaibati

Effects of Pavement Friction and Geometry on Traffic Crash Frequencies: A Case Study in Wyoming. International Journal of Pavement Research and Technology, Vol. 16, No. 6, 2023, pp. 1468–1481.

22.

Ester

Kriegel

XGBoost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (vol, pg 785, 2016). GEOGRAPHICAL ANALYSIS. 2022;

23.

Misra

Bao

Modeling Pedestrian Injury Severity: A Case Study of Using Extreme Gradient Boosting Vs Random Forest in Feature Selection. Transportation Research Record. Vol. 2678, No. 1, 2023, pp. 1–11.

24.

Zhang

Khattak

Matara

C. M.

Hussain

Farooq

Hybrid Feature Selection-Based Machine Learning Classification System for the Prediction of Injury Severity in Single and Multiple-Vehicle Accidents. PLoS One, Vol. 17. No. 2, 2022, p. e0262941.

25.

Zhang

Shi

Zhang

Abraham

A Xgboost-Based Lane Change Prediction on Time Series Data Using Feature Engineering for Autopilot Vehicles. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 10, 2022, pp. 19187–19200.

26.

Loh

W. Y.

Shih

Y. S.

Split Selection Methods for Classification Trees. Statistica Sinica. Vol. 7, No. 4, 1997, pp. 815–840.

27.

Chen

Guestrin

Xgboost: A scalable tree boosting system. In Proceedings of The 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785–794), 2016.

28.

Sun

Wang

, et al. Understanding Key Contributing Factors on The Severity of Traffic Violations by Elderly Drivers: A Hybrid Approach of Latent Class Analysis And Xgboost Based SHAP. International Journal of Injury Control and Safety Promotion. Vol. 31, No. 2, 2024, pp. 273–293.

29.

Zhang

Chen

Xing

Feng

Prediction and analysis of likelihood of freeway crash occurrence considering risky driving behavior. Accident Analysis & Prevention. Vol. 192, 2023, Article No. 107244.

30.

Wang

Chen

Zhang

Wong

Zhang

Zhou

Toward safer highway work zones: an empirical analysis of crash risks using improved safety potential field and machine learning techniques. Accident Analysis & Prevention. Vol. 194, 2024, Article No. 107361.

31.

Zahid

Habib

Ijaz

Ameer

Ullah

Ahmed

, et al. Factors affecting injury severity in motorcycle crashes: Different age groups analysis using Catboost and SHAP techniques. Traffic injury prevention. Vol. 25, No. 3, 2024, pp. 472–481.

32.

Slundberg. SHAP, 2017. https://github.com/slundberg/shapAccessed November 5, 2024

33.

Zhao

Jiang

Tighe

Exploring implicit relationships between pavement surface friction and vehicle crash severity using interpretable extreme gradient boosting method. Canadian Journal of Civil Engineering. 2022; 49(7):1206–1219.

34.

Shaon

MRR

Zhao

Wang

Jackson

. Developing a Data-Driven Network Screening Procedure for Systemic Safety Approach. Transportation Research Record, Vol. 2678, No. 3, 2023, pp. 348–364.

35.

Gooch

Mahmud

Gross

Polin

Identification of Risk Factors for Severe Younger and Older Driver Crashes in Massachusetts. Transportation Research Record, Vol. 2678, No. 12, 2024, pp. 1950–1963.

36.

FHWA. Kentucky Transportation Cabinet Applies Systemic Safety Project Selection Tool on Behalf of Local Agencies. Report No. FHWA-SA-13-023. Federal Highway Administration, U.S. Department of Transportation, Washington, D.C., 2013.