Abstract
This study aims to identify and analyze risk factors affecting serious multi-fatality crashes using Bayesian networks. First, a Bayesian network structure was constructed based on expert experience and the Dempster–Shafer evidence theory. Second, the structure was amended to satisfy the conditional-independence test. Finally, 484 serious multi-fatality crashes for the period 2000–2012 in China were inputted into the Bayesian network to calculate the posterior probability of each factor. Results showed that the most influential factor was driver behavior, followed by vehicle condition, road condition, and external environment. And compared to the other behaviors, speeding and mistaken adjustment had greater influence on serious crashes. The findings in this study provide useful and valuable information for engineers to take corrective and preventative measures to reduce the probability for serious multi-fatality crashes.
Introduction
A serious multi-fatality crash is defined as a motor-vehicle crash resulting in more than 10 deaths, 1 which causes catastrophic losses of human life and property and even threats social stability. However, to date, little information about crashes resulting in many deaths and injuries has been published. One important reason could be that such crashes do not occur frequently, especially in developed countries. It is difficult to accumulate enough data to support researches related to these crashes. Based on the Annual Reports of Road Accidents of the People’s Republic of China, there were in total 484 serious multi-fatalities crashes resulting in 7467 deaths and 7376 injuries from 2000 to 2012. Although the number of crashes was becoming less and less during this period, average deaths per crash kept stable and located between 14 and 17, meaning that the severity of crashes remains serious and worrying. Thus, it is necessary to analyze the factors affecting serious multi-fatality crashes and propose possible countermeasures to reduce the frequency and severity of such crashes.
Most previous studies have analyzed the factors responsible for road accidents, and these factors were often classified into three groups: road user, road environment, and vehicle,2,3 Moreover, Evans 4 found that the major factors responsible for traffic safety included driver behavior and performance, driver’s age and sex, vehicle, roadway, and environment factors. Similarly, in this study, the factors contributing to serious crashes are grouped into driver behavior, driver’s characteristics, vehicle, roadway, and environment factors.
Many researchers have tried to collect large crash data and to develop some mathematical models in order to interpret various risk factors. For example, Abdel-Aty 5 studied crash data (1996–1997) and police reports (1999–2000) for the Central Florida using ordered probit models and concluded that the major factors included driver behavior such as drink driving, vehicle factors such as vehicle type, road factors such as horizontal curve, environment factors such as area type and weather, and driver’s characteristics such as age and gender. Boufous et al. 6 studied police crash records and hospitalization data for the New South Wales during 2000 and 2001 using the multivariate linear regression and concluded that speeding, driver error, speed limit, and rural area had influence on the severity of crashes involving older people. Wang et al. 7 studied 10,946 crash records at freeway exit segments for the Florida State during 2003 and 2006 using the partial proportional odds model and found that the following major factors were responsible for these accidents: alcohol or drug, heavy vehicle, number of lanes, curve and grade, surface condition, land type, weather, and light conditions. Except for the abovementioned mathematical models, other models used for accident analysis include ordered logit models,8–10 multinomial logit, or probit models.11,12 However, these statistical methods consider all factors to be independent rather than related. In reality, multiple factors contribute to serious multi-fatality crashes and these factors are often interrelated.13,14
To avoid this problem, Bayesian network as a promising approach has been applied. For example, Zhao et al. 15 developed a Bayesian network to explore the interactions among the factors affecting hazardous material transportation accidents and demonstrated how such a model can be used for accident analysis. Simoncic 16 captured the interrelations between various factors contributing to two-car accidents using Bayesian networks. Maglogiannis et al. 17 built a model of risk analysis using a Bayesian network to present the interplay among undesirable events for health information systems and the Bayesian network identified and prioritized the most critical events. The structures of Bayesian networks stated above were constructed based on subjective judgments from experts. Thus, their structures may be different from actual situations. In addition, many researches did not take conditional independence among factors into considerations. They may lead to incorrect interpretations of relationships between factors in Bayesian networks.
Overcoming these limitations,15,18 this article builds a Bayesian network structure based on expert knowledge using the Dempster–Shafer (D-S) evidence theory and modifies the structure according to a test for conditional independence. Then, the posterior probabilities of various factors are computed to find their influence on serious multi-fatality crashes using the expectation–maximization (EM) algorithm. Finally, some interventions and countermeasures are proposed and suggested to reduce the probability for serious crashes.
Data collection
For modeling purpose, 484 serious multi-fatality crashes were collected from the Annual Reports of Road Accidents of the People’s Republic of China from 2000 to 2012. Within such accident reports, each crash was documented by text and picture in detail. Specific information includes crash description, driver behavior, driver’s characteristics, vehicle and road conditions, external environment, crash casualties, and potential countermeasures. Typically, the temporal and spatial distributions of these crashes were analyzed. As shown in Figure 1(a), serious crashes are more likely to occur in March, April, August, and October. Among all regions shown in Figure 1(b), the Southwest China has the highest percentage, which is mainly due to bad road environment and adverse weather conditions in the mountain area.

Temporal and spatial distributions of serious multi-fatality crashes from 2000 to 2012: (a) distribution of serious multi-fatality crashes each month and (b) spatial distribution of serious multi-fatality crashes.
A Bayesian network structure for serious multi-fatality crashes
Bayesian networks combine graph theory with probability theory, and they can represent the interplay among the variables. For specific information about Bayesian networks, refer to the work of Pearl.14,15 Here, only a basic mathematical description is presented.
Bayesian networks are acyclic directed graphs in which nodes represent random variables (namely, risk factors in this study) and arcs represent direct probabilistic dependences among them. For a Bayesian network, a joint probability distribution, P, can be expressed as formula (1), using the set of variables X = {X1, X2, …, Xn}
where π (Xi) denotes the set of parent variables of variable Xi.
In addition, a flowchart (see Figure 2) was made to illustrate how to apply the method of Bayesian network to identify risk factors affecting serious multi-fatality crashes in China.

A flowchart for identifying risk factors by the method of Bayesian network.
Define the factors
According to the theory of road safety, 4 this article defines driver behavior, vehicle condition, road condition, and external environment as direct factors responsible for serious multi-fatality crashes. And the indirect factors include driver’s gender, age, and driving experience, which may influence the probability of serious crashes through affecting driver behavior. The direct factors are also called parent factors, and each parent factor includes certain child factors. For example, in Table 1, driver behavior as a parent factor include six child factors, which are alcohol or drug, dangerous lane change, speeding, mistaken adjustment, driving fatigue, and wrong-way movement. Table 1 provides the descriptions and value sets for these factors.
Factors contributing to serious multi-fatality crashes.
For horizontal curve, RH and RL denote the turning radius of high-class road and low-class road, respectively. Similarly, for longitudinal curve, iH and iL denote the longitudinal gradient of high-class road and low-class road, respectively. According to the national guideline, 19 both horizontal curve and longitudinal gradient are classified into three groups listed in Table 1.
Based on the national specification JTG D20-2006, 20 if the sight distance for high-class road (SH) is greater than the stopping sight distance (S) or the sight distance for low-class road (SL) is greater than the double stopping sight distance (2S), then the sight distance is considered to be good, else poor.
Finally, within the group of driver’s characteristics, both age and driving experience are discretized based on the distributions of crashes in the annual reports of road accidents.
Develop the Bayesian network structure
Develop the Bayesian network structure based on expert knowledge
For a Bayesian network with many factors, to reduce the number of relationships that require confirmations by experts, based on the features of the risk factors, several simplifying assumptions should be made. (1) The four direct factors including driver behavior, vehicle condition, road condition, and external environment are independent of each other. Similarly, the three indirect factors including driver’s gender, age, and driving experience are also independent of each other. (2) The child factors pertaining to the same parent factor are independent. For example, driver behavior as a parent factor has six child factors. As a result, the six child factors are independent of each other. (3) Serious multi-fatality crashes are only affected by direct factors. In other words, indirect factors only influence driver behavior through affecting its child factors, which do not directly influence driver behavior.
Based on the three assumptions, a preliminary Bayesian network structure can be developed in Figure 3. But there are still many pairs of factors for which their relationships are not determinate. Thus, the next step is to confirm these relationships based on the knowledge of experts.

Bayesian network structure based on assumptions.
Generally, there are four kinds of relations between two factors (i.e. Fi and Fj), which are as follows: (1) Fi → Fj denotes that Fi directly leads to Fj; (2) Fi ← Fj denotes that Fj directly leads to Fi; (3) Fi ↔ Fj denotes that the relation between Fi and Fj is not determinate; and (4) Fi | Fj denotes that there is no direct relation between Fi and Fj.
Five members with experience and expertise in road safety were asked to capture the relations between the pairs of factors. Specifically, the five experts assign a value indicating the likelihood of a causal relationship for each pair of factors. To reduce the subjectivity of expert judgments, the D-S evidence theory 18 was used to integrate the experts’ opinions and determine the causal relationships for the pairs of factors, which is a powerful and flexible mathematical tool for handling uncertain, imprecise, and incomplete information.
Based on the D-S theory, the mass or basic probability assignment (BPA) should be provided so that the combined mass (M) can be obtained through the Dempster’s rule of combination. 21 In this study, the mass (m) is equal to the probability of a causal relationship assigned by experts and the Dempster’s rule of combination can be presented in formula (2)
where A denotes one of four kinds of relationships for a pair of factors and n denotes the number of experts (n = 5). As a result, the relationship with the maximum of combined mass is accepted to stand for the relationship between two factors. 15 The partial results of these processes are listed in Table 2.
Relationships between the factors contributing to serious multi-fatality crashes based on expert knowledge (partial table, showing only representative examples).
Using the D-S theory, the causal relationships for most pairs of factors can be confirmed. But there are still 10 pairs of factors like S ↔ F in Table 3 for which the relationships are indeterminate. To solve this problem, a concept called mutual information is applied to estimate whether two factors are dependent. 15 By definition, the mutual information I (Fi; Fj) between two factors Fi and Fj can be expressed as formula (3). And if I (Fi; Fj) is greater than a certain limitation, a causal relationship exists between Fi and Fj. Given serious multi-fatality crashes with low probability, the limitation value is equal to 0.05. Based on 484 serious crashes, the mutual information for 10 pairs of factors was calculated. For a pair of factors with a causal relationship (I (Fi; Fj) > 0.05), the direction of the relationship was determined by the experts. Thus, the relationships for 10 pairs of factors are shown in Table 3
where P(Fi) indicates the percentage of crashes with the factor Fi among all crashes, and P(Fi, Fj) indicates the percentage of crashes with the two factors (Fi, Fj) among all crashes.
Relationships of 10 pairs of factors determined by using mutual information.
Based on the D-S theory and the concept of mutual information, the causal relationships of all factors contributing to serious crashes can be found and the preliminary structure of Bayesian network should be amended. Figure 4 displays a new Bayesian network structure based on expert knowledge and red arcs represent added links between the pairs of factors.

Bayesian network structure based on expert knowledge.
Generally, a Bayesian network developed using causal relationships must satisfy the assumption of conditional independence. 22 However, the structure of the new network does not take into consideration the conditional independence of the various factors. Thus, the structure of the Bayesian network shown in Figure 4 should be modified again.
Modify the Bayesian network structure based on conditional independence
According to assumption (3) in section “Develop the Bayesian network structure based on expert knowledge,” indirect factors can only influence driver behavior through affecting its child factors, which indicates conditional-independence relationships. However, against the assumption, there is one path between indirect factors and direct factors excluding driver behavior in Figure 4. The sets of involved factors include (gender, driving fatigue, daylight) and (gender, wrong-way movement, median divider).
Geiger and Pearl 23 proved that all relationships of conditional independence can be discovered from the topology of a Bayesian network using a method called “directed separation” (D-separation). A typical algorithm proposed by Zhao et al. 15 was used to modify the structure of a Bayesian network. In this algorithm, the concept of conditionally mutual information was used to test conditional independence. The conditionally mutual information I (Fi; Fj | C) for a pair of factors (Fi and Fj) given condition C is defined as
where C is the cut-set that can D-separate factors Fi and Fj; when I (Fi; Fj | C) is smaller than a certain limitation, the factors Fi and Fj are considered to be conditional independence given C. Given serious multi-fatality crashes with low probability, the threshold value is equal to 0.05.
However, such an algorithm is not completely suitable for the network structure in Figure 4. There is only one path for the pairs of factors (daylight, driving fatigue), (gender, driving fatigue), (median divider, wrong-way movement), and (gender, wrong-way movement). Therefore, in order to use the typical algorithm, a dummy factor and two dummy arcs were added into the graph (in Figure 5).

Modification of the Bayesian network structure: (a) modified structure for factors daylight, driving fatigue, and gender and (b) modified structure for factors median divider, wrong-way movement, and gender.
According to Figure 5(a), there are two paths between factors “daylight” and “driving fatigue.” If the arc between the two factors was removed temporarily, the cut-set included “dummy factor” and “gender.” Using formula (4), the conditionally mutual information was equal to 0.102 (I > 0.05), meaning “daylight” and “driving fatigue” are not conditionally independent given conditions “dummy factor” and “gender.” Then, the arc was added back to the graph. In the same way, the arc between nodes “gender” and “driving fatigue” was removed temporarily. And the conditionally mutual information was equal to 0.004 (I < 0.05), meaning “gender” and “driving fatigue” are conditionally independent given conditions “dummy factor” and “daylight.” So, the arc was removed permanently. Similarly, in Figure 5(b), based on the conditionally mutual information, the arc between factors “median divider” and “wrong-way movement” was added back and the arc between factors “gender” and “wrong-way movement” was removed permanently. As a result, Figure 6 shows the Bayesian network structure for serious multi-fatality crashes after eliminating links according to the test of conditional independence.

Bayesian network structure based on conditional independence.
Learning the parameters of the Bayesian network
Based on expert knowledge and conditional independence, the Bayesian network structure for serious multi-fatality crashes has been established. Then, probabilistic inference can be conducted to predict the posterior probabilities of various factors. In this study, the probability of serious multi-fatality crash was set to 100% because all crashes have already happened. Pearl 24 proved that the posterior probability can capture the relationships and interplay among the variables that describe a situation. Thus, this research attempts to identify the factors affecting serious crashes and analyze the interplay among these factors.
Specifically, 484 serious multi-fatality crashes were used to estimate the posterior probabilities of the network with the EM algorithm. 22 The EM algorithm 15 is generally used to find maximum-likelihood estimates for a set of parameters, when researchers have an incomplete data set. To perform this operation, the software of GeNie 2.0 was adopted, which has been developed at the Decision Systems Laboratory, University of Pittsburgh. Figure 7 displays the posterior probabilities for factors that influence serious multi-fatality crashes estimated by the software of GeNie 2.0.

Posterior probabilities for factors contributing to serious multi-fatality crashes.
Results and discussions
Based on the results in Figure 7, the posterior probability for driver behavior is 0.64, which is significantly higher than that for vehicle condition (0.57), road condition (0.50), and external environment (0.39). This result shows that even when vehicle and road provide the necessary level of safety, driver error and illegal behavior are still the major factors causing serious multi-fatality crashes.
Driver behavior
Speeding has the highest posterior probability (0.67) within the driver behavior group. With the improvement of vehicle performance and roadway conditions, speeding violation in China is becoming more and more frequent. The higher the speed, the larger the possible damage. Thus, police officers should strengthen law enforcement for speeding violation.
Mistaken adjustment has the second highest posterior probability (0.45), which is realistic based on the current situation for drivers’ skills in China. Due to bad supervision for driving tests, many drivers with driving license have not qualified skills for driving. Especially facing the emergency, they often adopt mistaken countermeasures resulting in more casualties and property losses. Therefore, the government should implement train programs and driving tests to improve the skills of drivers and to raise their safety awareness.
Wrong-way movement has a relatively high posterior probability (0.22), followed by dangerous lane change (0.15), driving fatigue (0.12), and alcohol or drug (0.05). First, wrong-way movement is likely to cause serious head-on collisions. To avoid this behavior, median divider or separated barrier can be adopted on certain roadways. Second, dangerous lane change is a common problem in traffic safety. But this behavior is hard to change in a short time. Thus, lasting education and publicity should be required. Third, to reduce driving fatigue, an effective rule that is referred to as “84220” has been issued in China. This rule indicates that for a driver, his accumulated time for driving should not be more than 8 h within 24 h; his continuous driving time was not more than 4 h during the day and not more than 2 h during the night; and each rest period was not less than 20 min. Finally, few serious crashes involve alcohol or drug. The main reason could be that in China, driving under influence of alcohol or drugs has been included in the criminal law. Given high cost, a driver has a low probability to violate the law.
Vehicle condition
In the category of vehicle size, large vehicle has the highest posterior probability (0.59). Large vehicles with heavy weight or high passenger density are more likely to result in serious causalities. And overloading (0.40) can further increase the severity of crashes. In addition, vehicle defect (0.41) is also a significant factor contributing to serious crashes. Therefore, large vehicles should be traced and monitored in real time, especially for large buses with many passengers. Specifically, all large vehicles should be encouraged to install Global Positioning System (GPS) devices and the performance of large vehicles should be tested carefully before departure.
Road condition
Road condition including 10 contributing factors has complex influence on serious crashes. The 10 factors can be divided into two groups: factors with two states and factors with three states. Given factors with two states, median divider (0.72) and roadside safety facilities (0.50) have relatively higher posterior probabilities, which indicate that the two interventions can significantly reduce the potential for serious crashes. Median divider can separate traffic flow in opposite directions to avoid severe head-on collisions. And roadside safety facilities (i.e. guardrail, crash cushion cylinder, and wall) as passive protection measures can reduce crash severity. For the other factors with two states, their influence degree on serious crashes sorted in descending order is surface friction (0.40), road class (0.33), sight distance (0.31), and surface evenness (0.21).
For the horizontal curve, serious crashes are most likely to occur on straight roads (0.49), followed by sharp curves (0.30). Similarly, for the longitudinal gradient, small slope has the highest posterior probability (0.42), followed by large slope (0.37). These findings can be explained from two sides. When road alignment is good (i.e. straight and small slope), driving speeds are often high in most cases and even greater than speed limits. When road alignment is poor (i.e. sharp curve and large slope), some drivers may operate vehicles by the incorrect way and even lose control of their vehicles.
Roads with more lanes have higher posterior probability due to higher operation speeds. Moreover, the modeling results indicate that higher speed limits are associated with higher crash severities, suggesting that future speed limit changes need to be carefully assessed on a case-by-case basis.
In a word, in terms of road condition, there are two ways to reduce the potential for serious crashes. When road condition is good, police officers should limit driving speed within a reasonable range and strengthen law enforcement for speeding violation. When road condition is poor, engineers cannot only depend on speed control measures. Other engineering measures (i.e. separated guardrail, crash cushion cylinder and wall, and median divider) are also required. In addition, warning information about adverse weather, poor sight distance, sharp curve, and large slope should be monitored and disseminated to raise driver’s safety awareness.
External environment
Within the external environment group, mountain area has the highest posterior probability (0.50), followed by rain (0.34), daylight (0.29), and fog (0.17). In case of a collision in the mountain area, vehicles often lose control and roll down cliffs, resulting in catastrophic causalities. Thus, roadside safety facilities like crash cushion walls are required. Based on our results, rain and fog would affect driver’s visibility and pavement friction, leading to more deaths and injuries. In addition, bad light conditions and driving fatigue during the night are also significant factors affecting serious crashes.
Driver characteristics
In the category of gender, male drivers tend to increase the risk of serious crashes, for which the posterior probability is 0.95. According to our results, drivers aged between 23 and 45 years are more likely to be involved in crashes. And drivers with license between 6 and 15 years have a relatively high posterior probability (0.42). To sum up, the most likely reason could be that these drivers are often overconfident in their physical state and driving skills.
Conclusion
In this study, Bayesian networks were used to investigate the influence of risk factors on serious multi-fatality crashes. The Bayesian network was developed based on expert experience and the D-S theory, modified based on the test for conditional independence, and supplied with data from 484 serious crashes from 2000 to 2012 in China. And the posterior probability of each factor was calculated using the EM algorithm by the software of GeNie 2.0. As a result, the relative influence of risk factors was revealed and corresponding countermeasures were discussed and suggested.
According to our results, driver behavior was the most influential factor causing serious crashes. And speeding and mistaken adjustment had greater influence than the other behaviors. Vehicle condition had the second highest posterior probability. Within such group, large vehicle with heavy weights or many passengers are more likely to result in serious causalities. Road condition had the third highest posterior probability. For the factors with two states, their influence degree sorted in descending order was median divider, roadside safety facilities, surface friction, road class, sight distance, and surface evenness. And higher speed limits were associated with more serious crashes, suggesting that future speed limit changes need to be carefully assessed on a case-by-case basis. In addition, given poor road alignments, active speed controls are not enough to reduce crash rate. Necessary engineering interventions are also required. Finally, external environment has the fourth highest posterior probability. Of these factors, mountain area had the greatest influence on serious crashes. All results in this study are realistic, given the reality of serious multi-fatality crashes in China, and provide the possible measures to reduce the frequency of crashes.
However, there is a significant limitation about the method in this manuscript, which assumes some factors are independent of each other in order to reduce the number of relationships that require confirmations by experts. If possible, all relationships among risk factors should be confirmed and determined.
Footnotes
Academic Editor: Yongjun Shen
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project was supported by the Open Fund for the Key Laboratory for Traffic and Transportation Security of Jiangsu Province (TTS2015-09).
