Optimal road accident case retrieval algorithm based on k -nearest neighbor

Abstract

An optimal algorithm which can help traffic managers to make more accurate decisions from previous road accident case knowledge has been proposed. The algorithm based on k-nearest neighbor determines weight value of each accident case feature based on information entropy index, establishes a road accident case retrieval base using two-step cluster algorithm, and proposes a global similarity model of road accident cases. Then, a new comprehensive evaluation index called matching degree is presented. And then, a prototype system is developed to conduct case retrieval experiments to verify the performance of the proposed algorithm for road accident case retrieval. The result of the experiments clearly demonstrates the effectiveness of this case retrieval algorithm for road accident management in real time.

Keywords

Road accident case retrieval base similarity model case retrieval evaluation case retrieval algorithm

Introduction

Case-based reasoning (CBR) is a problem-solving paradigm reusing the previous corresponding case knowledge. The CBR cycle contains four main steps:¹ case retrieve, reuse, revise, and retain. CBR has extensive application in assessing intelligent transportation system (ITS) benefits, traffic pattern recognition, decision support, and traffic emergency management.^2–8 Road accident emergency management strategies, consisting of traffic control, road wrecker, and emergency rescue, are always made based on different accident features, such as road accident time, accident location, and accident type. It is feasible to use CBR approach for making road accident management measures from previous case which has similar accident features as the input one. Generally speaking, case retrieve is the most important step in the CBR process. A number of retrieving algorithms currently exist: (1) nearest neighbor (NN), (2) induction algorithms, and (3) template retrieval. NN is used more frequently in many CBR transportation applications currently. Sadek et al.⁹ used NN approach to retrieve traffic diversion strategies with the weight value of each case feature assumed to be equal to 1.0. Chowdhury et al.¹⁰ analyzed the diversion strategies under incident conditions based on NN algorithm in CBR model. The performance of case retrieval was evaluated with subjectively inputting eight different combinations of weights to case features (e.g. volume multiplier, incident duration, percentage of drivers diverting, and incident location) and the perfect combination of weights was selected.¹⁰ Hoogendoorn et al.¹¹ developed a CBR system to help make traffic management decisions, using mean fuzzy membership to determine the case weights value. Ji and Liu¹² also proposed NN algorithm to retrieve similar decision-making cases for traffic congestion management with analytic hierarchy process (AHP) approach to determine the weight of each case feature. And cluster analysis is usually used to improve similarity calculation.¹³ Rousseeuw and Kaufman applied the hierarchical agglomerative clustering method on similar occasions. Saharan and Baragona¹⁴ used cluster analysis to measure the similarity of the factors both on the whole data set and separately for severity levels to outline the association between accident type and factors involved.

Although NN case retrieval algorithm plays an important role in the research and application field of aforementioned traffic congestion management, traffic incident management, traffic management evaluation, and so on, two key problems remain to be existing:

If there is no full similarity case in a case base as the input one, users may need to make decisions according to some scattered information like features of the input case. However, NN case retrieval algorithm cannot help to retrieve these data (cases) effectively.

An effective case retrieval evaluation index to evaluate retrieved case set has not been considered. It is too one-sided to evaluate case retrieval accuracy only based on the value of maximum similarity index.

A road accident case retrieval algorithm based on k-NN is proposed in this study. The presented optimal k-NN case retrieval algorithm has not been captured in most previous studies. Along the line of previous studies and the above critical issues, this study determines weight value of each accident case feature based on information entropy index. Based on two-step cluster algorithm, a road accident case retrieval base is constructed to put a solid foundation for equalizing case retrieval. Similarity models, including two local similarity models and a global similarity model of road accident cases, are proposed. Then, a new evaluation index called integrated matching degree for evaluating retrieved road accident case set is proposed. And a road accident case retrieval prototype system is developed, using road accident data base of Shanghai–Hangzhou freeway in China as the original case base. Then, a series of experiments are carried out to verify the effectiveness of the presented optimal case retrieval algorithm by analyzing both the maximum similarity and the integrated matching degree of the retrieved case set.

This article is organized as follows. Section “Case retrieval algorithm” proposes an optimal k-NN case retrieval algorithm, which includes determination of each case feature weight, development of a case retrieval base, and road accident case similarity model. Section “Case retrieval evaluation” details a new evaluation index called matching degree, including local matching degree and integrated matching degree. A prototype system is developed, and the comparative analysis is presented after 40 road accident case retrieval experiments are conducted in section “Case retrieval experiments.” Concluding comments are reported in section “Conclusion.”

Case retrieval algorithm

k-NN algorithm is used to make weighted sum of each case feature similarity and form a retrieved case set which includes a certain number of cases. So, it is the first step to determine the weight and prepare a special case base for retrieval.

Weight determination

Setting each case feature weight accurately is critical to achieve the goal of case retrieval because a road accident case usually contains both numerical data (e.g. accident location) and enumeration data (e.g. accident type). The value of each case feature weight can be determined according to data dispersion analysis. Information entropy is a good index to evaluate the dispersion degree of either numerical data or enumeration data. The value of every feature weight is determined objectively based on data dispersion analysis of each road accident case feature.

Assume that a system may be in many different states and the probability of each system state is $p_{i} (i = 1, 2, \dots, m)$ , so the information entropy can be defined as¹⁵

H = - k \sum_{i = 1}^{m} p_{i} \ln p_{i}

(1)

Similarly, the information entropy of the jth road accident case feature can be defined as follows

H_{j} = - k \sum_{i = 1}^{m} p_{i} \ln p_{i} (j = 1, 2, \dots, n)

(2)

where m represents the number of the system states, $k = 1 / \ln m$ , and the range of information entropy is [0, 1] which can be applied to analyze data dispersion.

If the value of information entropy is more close to 0, the degree of dispersion is smaller and the weight value needed by case retrieving accurately is smaller, and vice versa.

The weight value of road accident feature j is modeled as follows

w_{j} = \frac{H_{j}}{\sum_{j = 1}^{n} H_{j}}

(3)

A original case base is developed based on a road accident data base which contains 3542 road accident cases happened in Shanghai–Hangzhou freeway in China. The information entropy of each road accident case feature, such as accident location, accident date, accident time, accident type, accident severity, and weather condition, is analyzed according to equation (2). And then, the weight value of each road accident feature is calculated based on equation (3), as shown in Table 1.

Table 1.

Weights of the road accident case features.

Case feature	Accident location	Accident date	Accident time	Accident type	Accident severity	Weather condition
Information entropy	0.809561	0.998445	0.826354	0.597972	0.179309	0.459111
Weight	0.209148	0.257946	0.213487	0.154485	0.046324	0.11861

Construction of case retrieval base

Due to smaller weight value of accident type, accident severity, and weather condition, the retrieval accuracy of these three features is lower than the other three (accident location, accident date, and accident time). If there is no same case absolutely in the case base as the input one, the users also want to summarize countermeasures based on some scattered information like features of the input case in the retrieved case set. So, it is important to ensure that every case feature can be retrieved exactly as far as possible. Therefore, the original case base divided into several sub-case bases according to case feature categorization based on two-step cluster algorithm is transferred to a road accident case retrieval base. First, some road accident data, including data of accident location, accident type, accident severity, and weather condition, should be classified before two-step clustering effectively.

Data of road accident location can be classified into two groups: accident prone location and occasional accident location based on accident cumulative frequency curve method.¹⁶ At the same time, data of accident type, accident severity, and weather condition are respectively classified, aiming to get a balance for data frequency distribution. The results of data classification are shown in Tables 2 –5.

Table 2.

Data classification of road accident location.

Accident feature	Data classified	Location label
Accident location	APL	K108, K120, K116, K110, K114, K135, K105, K117, K75, K76, K138, K109, K99
Accident location	OAL	Others in the case base

Table 3.

Data classification of road accident type.

Accident feature	Accident type
Accident type	V-V crash	V-R crash	Rollover	Vehicle scraped	Crushed	Fired	Drop from vehicle	Other types
Data classified	V-V crash	V-R crash	Others

Table 4.

Data classification of road accident severity.

Accident feature	Accident severity
Accident severity	LOP	Injury	Death
Data classified	LOP	Casualty

Table 5.

Data classification of weather condition.

Accident feature	Weather condition
Weather condition	Sunshine	Wind	Fog	Snow	Cloud	Rain
Data classified	Sunshine	AW

The acronyms of some road accident features are shown as follows:

OAL: occasional accident location;

APL: accident prone location;

PP: peak period;

LP: low period;

LOP: loss of property;

V-V crash: vehicle-to-vehicle crash;

V-R crash: vehicle-to-roadside facility crash;

AW: abnormal weather (e.g. wind, fog, snow, cloud, and rain).

Based on two-step cluster algorithm, the original case base is divided into nine sub-case bases. The structure of road accident case retrieval base is shown in Figure 1.

Figure 1.

Structure of road accident case retrieval base.

Similarity model of road accident cases

A road accident case always has categorical data as well as numerical one. First, methods to calculate the similarity of numerical and categorical data will be presented respectively. And then, a model of global similarity is established.

Local similarity

Local similarity of numerical data

Assume the mth case feature of cases $C_{i}$ and $C_{j}$ is numerical data and their value is $V_{im}$ and $V_{jm}$ , respectively. The features can be normalized to ensure the same range and impact: $V_{im}, V_{jm} \in [0, 1]$ . And then, the local similarity model of the mth case feature between $C_{i}$ and $C_{j}$ can be calculated as follows

sim (C_{im}, C_{jm}) = 1 - D (C_{im}, C_{jm}) = 1 - | V_{im} - V_{jm} |

(4)

where sim is the similarity function, D is the distance function, $C_{im}$ is the mth case feature of case $C_{i}$ , and $C_{jm}$ is the mth case feature of case $C_{j}$ .

Local similarity of categorical data

If two cases have the same categorical data, the local similarity of the two categorical data equals 1.0; otherwise, it is 0.

The model is listed below

sim (C_{im}, C_{jm}) = {\begin{matrix} 1 \begin{matrix} , \end{matrix} C_{im} = C_{jm} \\ 0 \begin{matrix} , \end{matrix} C_{im} \neq C_{jm} \end{matrix}

(5)

Global similarity

The model of global similarity is defined as follows

sim (C_{i}, C_{j}) = \sum_{m = 1}^{n} w_{m} sim (C_{im}, C_{jm})

(6)

where $w_{m}$ denotes the weight of the mth feature of a case, and all of the weights of road accident case features are shown in Table 1; n represents the number of features in a case and so $\sum_{m = 1}^{n} w_{m} = 1$ .

The global similarity of input case and each case in road accident case retrieval base can be calculated successively through equations (4)–(6). Then, a retrieved case set can be developed. The number of cases selected from every sub-case base according to global similarity is about $[\sqrt{k_{l}}]$ .¹⁷ And so, the total number of cases chosen in the retrieved case set is defined as

k = \sum_{l = 1}^{9} \sqrt{k_{l}}

(7)

where $k_{l}$ denotes the number of cases in the lth sub-case base, and k represents the total number of cases in the retrieved case set.

Case retrieval evaluation

Some traffic managers usually want to retrieve a completely same case as the road accident just happened in practical in order to assist them in decision-making. Also, in most of the previous studies, retrieving the maximum similarity case is often identified as the only aim of case retrieval or even case reasoning and the maximum similarity degree is also used to evaluate the effectiveness of case retrieval. However, if there is completely no same case in the retrieved case set as the users needed, then the maximum global similarity is not equal to 1.0 ( $si m_{max} (C_{i}, C_{j}) \neq 1$ ), and the users should get a comprehensive evaluation in the whole retrieved case set.

A new comprehensive evaluation index called matching degree is presented. Matching degree represents the degree of each case in the retrieved case set to match the input case. Assume that there are k cases in a retrieved case set. If the data of the mth feature in a case are numerical, the local matching degree of this feature is defined as follows

p_{m} = w'_{m} max (si m_{t} (C_{im}, C_{jm})), t = (1, 2, \dots, k)

(8)

Assume $w'_{m} = 1 / n (m = 1, 2, \dots, n)$ , where $w'_{m}$ represents the weight of the mth feature of a case in the calculation process of local matching degree; and $p_{m}$ denotes the local matching degree of the mth feature of cases in a retrieved case set.

If the data of the mth feature in a case is categorical, the local matching degree of this feature is defined as follows

p_{m} = {\begin{matrix} {w'}_{m} \begin{matrix} , \end{matrix} \sum_{t = 1}^{k} si m_{t} (C_{im}, C_{jm}) \geq 1 \\ 0 \begin{matrix} , \end{matrix} \sum_{t = 1}^{k} si m_{t} (C_{im}, C_{jm}) = 0 \end{matrix} t = (1, 2, \dots, k)

(9)

And for a certain input case, the integrated matching degree of retrieved case set can be finally defined as

p = \sum_{m = 1}^{n} p_{m}

(10)

where p denotes the integrated matching degree of a retrieved case set.

Case retrieval experiments

A prototype system used for verifying the performance of the proposed optimal algorithm for road accident case retrieval is developed. The operating environment of the system is configured to Genuine Intel (R) processor and 0.99G memory as the configuration of other common personal computers. Results of these experiments will be evaluated based on the integrated matching degree and maximum similarity. During experiments, the retrieval time consumed will be recorded to verify whether the case retrieval algorithm can work in real time. The interface of road accident case retrieval system is shown in Figure 2.

Figure 2.

Road accident case retrieval system.

The system randomly generates 40 experimental cases and each one is retrieved in the original accident case base and case retrieval base, respectively, based on the similarity models (equations (4)–(6)) of road accident cases. The parameters of case feature weights are set according to the values depicted in Table 1 and the matching degree of retrieved case set can be calculated in equations (8)–(10). The results are shown in Figures 3 and 4.

Figure 3.

Analysis of experiments on original case base.

Figure 4.

Analysis of experiments on case retrieval base.

Define the minimum similarity threshold of reference case as 0.8. From Figure 3, it can be seen that there are 37 experiments which can meet the minimum threshold. The retrieval accuracy is about 92.5%. So, the weight values of case features are assigned reasonably. At the same time, the average consumed retrieval time is about 0.39 s that can meet real-time decision-making for traffic management in practice. However, the maximum similarity of case E5, E24, and E26 is only 0.6788, 0.6526, and 0.7887 respectively. And the integrated matching degree of these three cases is smaller.

In next step, the same experiments will be conducted on the proposed case retrieval base to modify the disadvantages mentioned above.

After experiments on case retrieval base, in Figure 4, it can be seen that the integrated matching degree in most experiments has been improved clearly. Generally, for cases E5, E24, and E26, the value of integrated matching degree increased about 67.1%, 25.9%, and 53.6%, respectively. The optimization of integrated matching degree can make up the loss of small case similarity on decision-making. The users can make a more scientific and reasonable management measures through comprehensive assessment on the retrieved case set. The average consumed retrieval time is about 0.40 s which can also be acceptable.

Conclusion

This article has assessed the dispersion of road accident data based on information entropy method. On the basis of road accident data classified, a road accident case retrieval base is developed with two-step cluster algorithm to help traffic managers capture more accurate case set through k-NN retrieval algorithm. And a global similarity model of road accident cases has been proposed. Then, a new evaluation index called matching degree is presented and its function has been discussed in detail. Taking integrated matching degree and maximum similarity degree as the two main evaluation indexes, users can get a more comprehensive evaluation in the result of case retrieval.

The proposed system has conducted dozens of case retrieval experiments to verify the presented optimal algorithm for case retrieval in the field of maximum similarity, case retrieval time, and integrated matching degree. The result of the experiments clearly demonstrates the effectiveness and promises this optimal case retrieval algorithm for road accident management in real time.

Further research along this line will be focused on the critical issues such as case reuse, new case revise, and studying on case retain to enlarge the whole case base in the cycle of case reasoning.

Footnotes

Handling Editor: Jiangchen Li

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper has been supported by the Project of Safety Oriented Operation Management-Facility Risk Analysis and Control (Tongji University—Shenzhen Urban Transport Planning Center “City Traffic Joint Laboratory” funding).

ORCID iD

Xianyuan Dong

References

Sadek

Smith

Demetsky

MJ.

Artificial intelligence-based architecture for real-time traffic flow management. Transp Res Record: J Transp Res Board 1998; 1651: 55–58.

Sadek

Morse

Ivan

et al . Case-based reasoning for assessing intelligent transportation systems benefits. Comput Aided Civil Inf Eng 2003; 18: 173–183.

Dahlgren

Khattak

McDonough

et al . ITS decision enhancements: developing case-based reasoning and expert systems and incorporating new material. Berkeley, CA: University of California, 2004.

Chantaraskul

Cuthbert

. Using case-based reasoning in traffic pattern recognition for best resource management in 3G networks. In: Proceedings of the 7th ACM international symposium on modeling, analysis and simulation of wireless and mobile systems, Venice, 4–6 October 2004. New York: ACM.

Shumin

Zhaosheng

Maolei

. A decision support system of urban traffic emergency control based on expert system. In: Proceedings of the software engineering and service sciences (ICSESS), Beijing, China, 16–18 July 2010. New York: IEEE.

Lin

Liu

et al . Traffic control method of highway tunnel emergency based on CBR and RBR. J Traffic Transp Eng 2011; 11: 108–113, 122.

Zhang

Traffic dispersion aid decision method based on CBR. Comput Eng Design 2014; 35: 3621–3625.

Design and implementation of the road traffic emergency plan decision support system for Changzhou. Dhaka, Bangladesh: Southeast University, 2015.

Sadek

Smith

Demetsky

MJ.

A prototype case-based reasoning system for real-time freeway traffic routing. Transp Res Part C 2011; 9: 353–379.

10.

Chowdhury

Sadek

et al . Applications of artificial intelligence paradigms to decision support in real-time traffic management. Transp Res Record: J Transp Res Board 2006; 1968: 92–98.

11.

Hoogendoorn

De Schutter

Schuurman

Decision support in dynamic traffic management. Real-time scenario evaluation. Eur J Transp Inf Res 2003; 3: 21–38.

12.

Liu

Traffic congestion management method on case-based reasoning, Beijing, China. J Southwest Jiaotong Univ 2009; 44: 415–420.

13.

Rousseeuw

Kaufman

Finding groups in data: an introduction to cluster analysis. Hoboken, NJ: John Wiley & Sons, 1990.

14.

Saharan

Baragona

A new genetic algorithm for clustering binary data with application to traffic road accident in Christchurch. Far East J Theor Stat 2013; 45: 67–89.

15.

Zhu

Integrated assessment of highway operation system safety. Shanghai, China: Tongji University, 2010.

16.

Fang

Guo

Yang

A new identification method for accident prone location on highway, Beijing, China. J Traffic Transp Eng 2001; 1: 90–98.

17.

Zhang

Gong

Data mining principle and technology. Beijing, China: Electronic Industry Press, 2004.