Sage Journals: Discover world-class research

Abstract

Safety is a critical factor in evaluating autonomous vehicles, and real-world crash data provide valuable insights for assessing autonomous vehicle (AV) safety performance. While structured AV crash data have been widely used to analyze general crash patterns, unstructured crash narratives contain rich contextual information that remains underutilized. These narratives offer in-depth descriptions of crash circumstances, making them essential for understanding AV crash causes. However, extracting meaningful insights from these narratives presents challenges such as data scarcity and class imbalance in cause classification. Therefore, this study utilizes an improved bidirectional encoder representations from transformers (BERT) model to classify sentences related to crash causes and then perform fine-grained cause analysis using topic modeling method latent Dirichlet allocation. Then, text similarity between cause sentences and topic word is computed for topic assignment. To address the problem of data scarcity and class imbalance in cause classification, mixup data augmentation strategy and focal loss are respectively integrated to the BERT model. Experimental results on real California Department of Motor Vehicles crash reports show a significant improvement in cause sentence classification performance compared with baseline methods. Specifically, accuracy, precision, recall, F1-score, and area under curve increased by approximately 4.95%, 8.39%, 20.25%, 14.32%, and 10.16%, respectively. Topics of cause sentences are summarized into three groups, including operational scene, location, and driving status in AV crashes. The results indicate that crashes are most common in operational scenes such as “traffic yielding,”“waiting to turn,” and “pedestrian yielding”. For location-related factors, crashes frequently occur at “intersections” and “stop signs”. Notably, within the driving status category, “manual operation” is the most critical factor.

Keywords

cause classification cause analysis autonomous vehicle crash narratives BERT LDA

Get full access to this article

View all access options for this article.

References

Wang

Song

Bai

Pan

Incorporating Accident Liability into Crash Risk Analysis: A Multidimensional Risk Source Approach. Accident Analysis & Prevention, Vol. 153, 2021, p. 106035.

Kutela

Das

Dadashova

Mining Patterns of Autonomous Vehicle Crashes Involving Vulnerable Road Users to Understand the Associated Factors. Accident Analysis & Prevention, Vol. 165, 2022, p. 106473.

Ashraf

M. T.

Dey

Mishra

Rahman

M. T.

Extracting Rules from Autonomous-Vehicle-Involved Crashes by Applying Decision Tree and Association Rule Methods. Transportation Research Record, Vol. 2675, No. 11, 2021, pp. 522–533.

Boggs

A. M.

Wali

Khattak

A. J.

Exploratory Analysis of Automated Vehicle Crashes in California: A Text Analytics and Hierarchical Bayesian Heterogeneity-Based Approach. Accident Analysis & Prevention, Vol. 135, 2020, p. 105354.

Singh

Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey. https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812506, 2015.

Liu

Zhao

Liu

Hao

Can Autonomous Vehicle Reduce Greenhouse Gas Emissions? A Country-Level Evaluation. Energy Policy, Vol. 132, 2019, pp. 462–473.

Chen

Wang

Meng

Solving the First-Mile Ridesharing Problem Using Autonomous Vehicles. Computer-Aided Civil and Infrastructure Engineering, Vol. 35, No. 1, 2020, pp. 45–60.

Chen

Huang

Zhong

Jiao

Relational Graph Convolutional Network for Text-Mining-Based Accident Causal Classification. Applied Sciences, Vol. 12, No. 5, 2022, p. 2482.

Chen

Wang

Meng

Autonomous Truck Scheduling for Container Transshipment Between Two Seaport Terminals Considering Platooning and Speed Optimization. Transportation Research Part B: Methodological, Vol. 154, 2021, pp. 289–315.

10.

Zhu

Meng

What Can We Learn from Autonomous Vehicle Collision Data on Crash Severity? A Cost-Sensitive CART Approach. Accident Analysis & Prevention, Vol. 174, 2022, p. 106769.

11.

Kwayu

K. M.

Kwigizile

Lee

J.-S.

Discovering Latent Themes in Traffic Fatal Crash Narratives Using Text Mining Analytics and Network Topology. Accident Analysis & Prevention, Vol. 150, 2021, p. 105899.

12.

Zhang

Shao

Guan

Application of Traffic Environment Accident Information Text Processing Technology Based on LDA Topic Model. Ekoloji Dergisi, No. 107, 2019, p. 4843.

13.

Zhang

Cisse

Dauphin

Y. N.

Lopez-Paz

Mixup: Beyond Empirical Risk Minimization. arXiv preprint arXiv:1710.09412.

14.

Devlin

Chang

M.-W.

Lee

Toutanova

Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.

15.

Jaradat

Alhadidi

T. I.

Ashqar

H. I.

Hossain

Elhenawy

Investigating Patterns of Freeway Crashes in Jordan: Findings from a Text Mining Approach. Results in Engineering, Vol. 26, 2025, p. 104413.

16.

Chen

Yue

Noyce

D. A.

Analyzing Relationships Between Latent Topics in Autonomous Vehicle Crash Narratives and Crash Severity Using Natural Language Processing Techniques and Explainable XGBoost. Accident Analysis & Prevention, Vol. 203, 2024, p. 107605.

17.

Zhong

Pan

Love

P. E. D.

Sun

Tao

Hazard Analysis: A Deep Learning and Text Mining Framework for Accident Prevention. Advanced Engineering Informatics, Vol. 46, 2020, p. 101152.

18.

Song

Qin

Zhang

Railroad Accident Causal Analysis with Unstructured Narratives Using Bidirectional Encoder Representations for Transformers. Journal of Transportation Safety & Security, Vol. 15, No. 7, 2023, pp. 717–736.

19.

Qiao

Wang

Guan

Shuran

Construction-Accident Narrative Classification Using Shallow and Deep Learning. Journal of Construction Engineering and Management, Vol. 148, No. 9, 2022, p. 04022088.

20.

Zhong

Pan

Love

P. E.

Ding

Fang

Deep Learning and Network Analysis: Classifying and Visualizing Accident Narratives in Construction. Automation in Construction, Vol. 113, 2020, p. 103089.

21.

Ganganwar

An Overview of Classification Algorithms for Imbalanced Datasets. International Journal of Emerging Technology and Advanced Engineering, Vol. 2, No. 4, 2012, pp. 42–47.

22.

Galar

Fernandez

Barrenechea

Bustince

Herrera

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol. 42, No. 4, 2011, pp. 463–484.

23.

Liu

Chen

Fong

Y. H. C.

Zhang

Attention Based Spatio-Temporal Graph Convolutional Network with Focal Loss for Crash Risk Evaluation on Urban Road Traffic Network Based on Multi-Source Risks. Accident Analysis & Prevention, Vol. 192, 2023, p. 107262.

24.

Xie

Zhang

A Two-Dimensional Lateral Interaction Crash Risk Evaluation Model Considering Imbalanced Data. Journal of Transportation Safety & Security, Vol. 16, No. 3, 2024, pp. 250–270.

25.

Wang

Zou

Wang

Convolutional Neural Networks with Refined Loss Functions for the Real-Time Crash Risk Analysis. Transportation Research Part C: Emerging Technologies, Vol. 119, 2020, p. 102740.

26.

Huang

Peng

Liu

Investigation of Clusters and Injuries in Pedestrian Crashes Using GIS in Changsha, China. Safety Science, Vol. 127, 2020, p. 104710.

27.

Ahadh

Binish

G. V.

Srinivasan

Text Mining of Accident Reports Using Semi-Supervised Keyword Extraction and Topic Modeling. Process Safety and Environmental Protection, Vol. 155, 2021, pp. 455–465.

28.

Goh

Y. M.

Ubeynarayana

C. U.

Construction Accident Narrative Classification: An Evaluation of Text Mining Techniques. Accident Analysis & Prevention, Vol. 108, 2017, pp. 122–130.

29.

Gao

Verb-Based Text Mining of Road Crash Report. In TRB 92nd Annual Meeting, Washington, DC, USA, 2013.

30.

Yao

Mao

Luo

KG-BERT: BERT for Knowledge Graph Completion. arXiv preprint arXiv:1909.03193.

31.

Guo

Mao

Zhang

Augmenting Data with Mixup for Sentence Classification: An Empirical Study. arXiv preprint arXiv:1905.08941.

32.

Zhang

Sabuncu

Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. Advances in Neural Information Processing Systems, Vol. 31, 2018, pp. 8792–8802.

33.

Blei

Jordan

Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, 2003, pp. 993–1022.

34.

van der Maaten

Hinton

Visualizing Data Using t-SNE. Journal of Machine Learning Research, Vol. 9, No. 86, 2008, pp. 2579−2605.

35.

Shafer

A Mathematical Theory of Evidence, Vol. 42. Princeton University Press, Princeton, NJ, 1976.

36.

Lin

T.-Y.

Goyal

Girshick

Dollár

Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 2980–2988.

37.

Nilsson

Strand

Falcone

Vinter

Driver Performance in the Presence of Adaptive Cruise Control Related Failures: Implications for Safety Analysis and Fault Tolerance. In 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W). IEEE, Piscataway, NJ, 2013, pp. 1–10.

38.

Wan

Lucic

M. C.

Ghazzai

Massoud

Empowering Real-Time Traffic Reporting Systems with NLP-Processed Social Media Data. IEEE Open Journal of Intelligent Transportation Systems, Vol. 1, 2020, pp. 159–175.

39.

Pemila

Pongiannan

Narayanamoorthi

Sweelem

E. A.

Hendawi

El-Sebah

M. I. A.

Real Time Classification of Vehicles Using Machine Learning Algorithm on the Extensive Dataset. IEEE Access, Vol. 12, 2024, pp. 98338–98351.

40.

Hinton

Vinyals

Dean

Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503. 02531.

41.

Sun

Zhang

Distillation for Text Classification Task Based on BERT. In 2021 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE). IEEE, Piscataway, NJ, 2021, pp. 472–478.

42.

Shah

Manzoni

S. L.

Zaman

Es Sabery

Epifania

Zoppis

I. F.

Fine-Tuning of Distil-BERT for Continual Learning in Text Classification: An Experimental Analysis. IEEE Access, Vol. 12, 2024, pp. 104964–104982.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.11 MB

Cause Classifications and Analysis of Autonomous Vehicle Crash Narratives Using Improved Bidirectional Encoder Representations from Transformers and Latent Dirichlet Allocation Methods

Abstract

Keywords

Get full access to this article

References

Supplementary Material