Sage Journals: Discover world-class research

Abstract

Road safety analysis is typically performed by domain experts on the basis of the information contained in accident reports. The main challenges are the difficulty of considering a large number of reports in textual form and the subjectivity of the expert judgments contained in reports. This work develops a framework based on the combination of Natural Language Processing (NLP) and Machine Learning (ML) for the automatic classification of accidents with the final aim of assisting experts in performing road safety analyses. Two different models for the representation of the textual reports (Hierarchical Dirichlet Processes (HDPs) and Doc2vec) and three ML-based classifiers (Artificial Neural Networks (ANNs), Decision Trees (DTs) and Random Forests (RFs)) are compared. The framework is applied to a repository of road accident reports provided by the US National Highway Traffic Safety Administration. The best trade-off between accuracy of the classification and explainability of the obtained results is achieved by combining HDP topic modeling and RF classification.

Keywords

Road safety road accident reports Natural Language Processing Hierarchical Dirichlet Process Doc2Vec Artificial Neural Networks Decision Tree Random Forest

Get full access to this article

View all access options for this article.

References

World Health Organization. Global status report on road safety. Geneva: World Health Organization, 2018.

Imprialou

Quddus

Crash data quality for road safety research: Current state and future directions.. Accid Anal Prev 2019; 130: 84–90.

Rolison

JJ.

Identifying the causes of road traffic collisions: Using police officers’ expertise to improve the reporting of contributory factors data. Accid Anal Prev 2020; 135: 105390. DOI: 10.1016/j.aap.2019.105390

Krause

Busch

New insights into road accident analysis through the use of text mining methods. In: 6th international conference on models and technologies for intelligent transportation systems (MT-ITS), 2019. New York: IEEE.

Kwon

Rhee

Yoon

Application of classification algorithms for analysis of road safety risk factor dependencies. Accid Anal Prev 2015; 75: 1–15.

Teh

Jordan

Beal

, et al. Hierarchical Dirichlet processes. J Am Stat Assoc 2006; 101: 1566–1581.

Mikolov

Distributed Representations of Sentences and Documents. In: Proceedings of the 31 st international conference on machine learning, 2014.

Blei

Carin

Dunson

Probabilistic topic models. In: IEEE signal processing magazine, 2010, Vol. 27, pp.55–65. New York: IEEE.

Landauer

Foltz

Laham

An introduction to latent semantic analysis. Discourse Process 1998; 25: 259–284.

10.

Blei

Jordan

MT.

Latent dirichlet allocation. Adv Neural Inf Process Syst 2002; 3: 993–1022.

11.

Martinčić-Ipšić

Miličić

Todorovski

The influence of feature representation of text on the performance of Document Classification. Appl Sci 2019; 9: 743.

12.

Vaswani

Shazeer

Parmar

, et al. Attention Is All You Need. In: Proceeding of the 31st conference on neural information processing systems (NIPS 2017), 2017.

13.

Fang

Luo

, et al. Automated text classification of near-misses from safety reports: an improved deep learning approach. Adv Eng Inform 2020; 44: 44.

14.

Mauni

Hossain

Rab

Classification of underrepresented text data in an imbalanced dataset using deep neural network. In: IEEE region 10 symposium (TENSYMP), 2020, pp.997–1000. New York: IEEE.

15.

Mishu

Rafiuddin

Performance analysis of supervised machine learning algorithms for text classification. In: 19th international conference on computer and information technology (ICCIT), 2016, pp.409–413. New York: IEEE.

16.

Zaghloul

Lee

Trimi

Text classification: neural networks vs support vector machines. Ind Manag Data Syst 2009; 109: 708–717.

17.

Sun

Zeng

, et al. Application research of text classification based on random forest algorithm. In: 3rd International conference on advanced electronic materials, computers and software engineering (AEMCSE), 2020.

18.

Vries

Classification of aviation safety reports using machine learning. In: International conference on artificial intelligence and data analytics for air transportation (AIDA-AT), 2020, pp.1–6. New York: IEEE.

19.

Valcamonico

Baraldi

Amigoni

, et al. Text mining for the automatic classification of road accident reports. In: Proceedings of the 30th European safety and reliability conference and the 15th probabilistic safety assessment and management conference, 2020.

20.

NHTSA. Crash injury research (CIREN), https://www.nhtsa.gov/research-data/crash-injury-research. Accessed 20 October 2019.

21.

Gasparetto

Marcuzzo

Zangari

, et al. A survey on text classification algorithms: from text to predictions. Information 2022; 13: 83–39.

22.

Yang

Baraldi

Zio

A novel method for maintenance record clustering and its application to a case study of maintenance optimization. Reliab Eng Syst Saf 2020; 203: 107103.

23.

Guimarães

Gomes de Araújo

Lucas

, et al. An NLP and text mining – based approach to categorize occupational accidents. In: Proceedings of the 30th European safety and reliability conference and the 15th probabilistic safety assessment and management conference, Venice, Italy, 2020.

24.

Bezerra

de Santana

JMM

Moura

das

, et al. Automated classification of injury leave based on accident description and natural language processing. In: Proceedings of the 30th European safety and reliability conference and the 15th probabilistic safety assessment and management conference, Venice, Italy, 2020. DOI: 10.3850/981-973-0000-00-0.

25.

Zhang

Fleyeh

Wang

, et al. Construction site accident analysis using text mining and natural language processing techniques. Autom Constr 2019; 99: 238–248.

26.

Heidarysafa

Kowsari

Barnes

, et al. Analysis of railway accidents’ narratives using deep learning. In: 17th international conference on machine learning and applications, 2018. DOI: 10.1109/ICMLA.2018.00235.

27.

Zhang

A hybrid structured deep neural network with Word2Vec for construction accident causes classification. Int J Constr Manag 2022; 22: 1120–1140.

28.

Rane

Kumar

. Sentiment classification system of twitter data for US airline service analysis. In: Proceedings of the 42nd IEEE international conference computer software and applications, 2018, pp.769–773. New York: IEEE.

29.

Sarkar

Vinay

Maiti

. Text mining based safety risk assessment and prediction of occupational accidents in a steel plant. In: 2016 international conference on computational techniques in information and communication technologies (ICCTICT), 2016, pp.439–444. New York: IEEE.

30.

Williams

Betak

A comparison of LSA and LDA for the analysis of railroad accident text. Procedia Comput Sci 2018; 130: 98–102.

31.

Kwayu

Kwigizile

Lee

, et al. Discovering latent themes in traffic fatal crash narratives using text mining analytics and network topology. Accid Anal Prev 2021; 150: 105899.

32.

Limsettho

Hata

Matsumoto

. Comparing hierarchical dirichlet process with latent dirichlet allocation in bug report multiclass classification. In: 2014 IEEE/ACIS 15th international conference on software engineering, artificial intelligence, networking and parallel/distributed computing, SNPD 2014, 2014. New York: IEEE.

33.

Tahvili

Hatvani

Felderer

, et al. Automated functional dependency detection between test cases using Doc2Vec and clustering. In: IEEE international conference on artificial intelligence testing (AITest), 2019, pp.19–26. New York: IEEE.

34.

Bragatto

Ansaldi

Agnello

, et al. Ageing management and monitoring of critical equipment at Seveso sites: An ontological approach. J Loss Prev Process Ind 2020; 66: 104204

35.

Macêdo

das Chagas Moura

Aichele

, et al. Identification of risk features using text mining and BERT-based models_ application to an oil refinery. Process Saf Environ Prot 2022; 158: 382–399.

36.

Bin

Baigen

Wei

. Text mining in fault analysis for on-board equipment of high-speed train control system. In: Chinese automation congress (CAC), Jinan, China2017; pp.6907–6911. New York: IEEE.

37.

Weiss

Indurkhya

Zhang

, et al. Text mining. Predicitive methods for analyzing unstructured information. Springer, Sydney, Australia, 2005.

38.

Pereira

Saraiva

. A comparative analysis of unbalanced data handling techniques for machine learning algorithms to electricity theft detection. In: 2020 IEEE Congress on evolutionary computation (CEC), 2020, pp.1–8. New York: IEEE.

39.

Sojka

Řehůřek

Software framework for topic modelling with large corpora. In: Proceeding Lr 2010 Work new challenges NLP Fram, 2010.

40.

Mikolov

Sutskever

Chen

, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems, 2013, Vol. 2, pp.3111–3119. Red Hook, NY: Curran Associates Inc.

41.

Amati

Van Rijsbergen

CJ.

Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst 2002; 20: 357–389.

42.

Griffiths

Steyvers

Tenenbaum

JB.

Topics in semantic representation. Psychol Rev 2007; 114: 211–244.

43.

Alghamdi

Alfalqi

A survey of topic modeling in Text Mining. Int J Adv Comput Sci Appl 2015; 6: 147–153.

44.

Blei

Lafferty

JD.

Correlated topic models. In: Proceeding of Advances in Neural Information Processing Systems 18 conference (NIPS 2005), Vancouver, Canada, 2005, pp.147–154.

45.

Blei

Lafferty

JD.

Topic models. In: Srivastava

Sahami

(eds.), Text mining. Chapman and Hall/CRC, New York, 2009, pp.101–124.

46.

Wang

Paisley

Blei

DM.

Online variational inference for the hierarchical Dirichlet process. J Mach Learn Res 2011; 15: 752–760.

47.

Almeida

Xexéo

Word Embeddings: A Survey. Computing research repository (CoRR), 2019.

48.

Schakel

AMJ

Wilson

BJ.

Measuring word significance using distributed representations of words. Computing research repository (CoRR), 2015.

49.

Bengio

Ducharme

Vincent

, et al. A neural probabilistic language model. J Mach Learn Res 2003; 3: 1137–1155.

50.

Nwankpa

Ijomah

Gachagan

, et al. Activation Functions : Comparison of trends in practice and research for Deep Learning. arXiv:181103378v1 2018; 1–20.

51.

Zhang

Dou

, et al. Evaluation of the influences of hyper-parameters and L2 norm regularization on ANN model for MNIST recognition. In: International conference on intelligent computing, automation and systems (ICICAS), 2019, pp. 379–386. New York: IEEE.

52.

Conroy

Tominaga

Erwin

, et al. The influence of vehicle damage on injury severity of drivers in head-on motor vehicle crashes. Accid Anal Prev 2008; 40: 1589–1594.

53.

Augenstein

Perdeck

Stratton

, et al. Characteristics of the crashes that increase the risk of serious injuries. Assoc Advacement Autom Med 2003; 47: 561–576.

54.

Shannon

Murphy

Mullins

, et al. Applying crash data to injury claims – an investigation of determinant factors in severe motor vehicle accidents. Accid Anal Prev 2018; 113: 244–256.

55.

Chiu

Shang

, et al. Neural text segmentation and its application to sentiment analysis. IEEE Trans Knowl Data Eng 2022; 34: 828–842.

56.

Lee

Kim

Yun

, et al. Characteristics of patients injured in road traffic accidents according to the New Injury Severity Score. Ann Rehabil Med 2016; 40: 288–293.

57.

Chen

Shen

Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd international joint conference of artificial intelligence. AAAI Press, Barcelona, Spain, 2011, pp. 1776–1781.

58.

Wei

Zou

. EDA : Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Association for Computational Linguistics, Hong Kong, China, 2019, pp.6382–6388.

59.

Dogru

Tilki

Jamil

, et al. Deep learning-based classification of news texts using Doc2Vec model. In: 2021 1st international conference on artificial intelligence and data analytics, CAIDA 2021, 2021, pp.91–96. New York: IEEE.

60.

Liu

Zhou

. Exploratory undersampling for class-imbalance learning. In: IEEE transactions on systems, man, and cybernetics, 2009, Vol. 39, pp.539–550. New York: IEEE.

61.

Palczewska

Palczewski

Robinson

, et al. Interpreting random forest classification models using a feature contribution method. In: Integration of Reusable Systems. Springer, Cham, 2014, pp.193–218.

62.

Loecher

From unbiased MDI feature importance to explainable AI for Trees. arXiv Prepr (statistic, Machine Learning), 2021.

63.

Marjai

Lehotay-Kéry

Kiss

Document similarity for error prediction. J Inf Telecommun 2021; 5: 407–420.

64.

Baraldi

Compare

Zio

, et al. Identification of contradictory patterns in experimental datasets for the development of models for electrical cables diagnostics. Int J Performability Eng 2011; 7: 43–60.

65.

Vig

Belinkov

. Analyzing the structure of attention in a transformer language model. In: Proceedings of the 2019 ACL workshop blackboxNLP: Analyzing and interpreting neural networks for NLP, Florence, Italy, 2019, pp.63–76.

A framework based on Natural Language Processing and Machine Learning for the classification of the severity of road accidents from reports

Abstract

Keywords

Get full access to this article

References