Enhancing Data Accessibility through Automated Personally Identifiable Information De-Identification in Crash Narratives

Abstract

Unstructured crash narratives in police reports contain rich textual information that can uncover key insights into crash circumstances, such as contributing factors and driver behavior, that are often missing from the structured fields of crash data. However, the presence of personally identifiable information (PII) within these narratives, and the lack of scalable, domain-specific redaction tools, limit their broader use because of privacy concerns and legal restrictions. To address this challenge, a scalable, privacy-preserving pipeline for automated PII de-identification from crash narratives was developed and evaluated. The proposed method utilizes a generalist model for named entity recognition using bidirectional transformer (GLiNER), which is known for its strong zero-shot, few-shot, and fine-tuned performance across diverse entity types. The model was fine-tuned on a manually annotated training set to adapt it to the crash narrative domain. It was found that combining this fine-tuned named entity recognition model with a rule-based post-processing module improved PII detection performance by resolving span misalignments and recovering entities that were initially missed. Evaluation on a test set achieved an F1 score above 80%, particularly for frequent PII categories such as names and addresses. Post-processing further reduced false negatives by 32%. The pipeline was developed and tested on local machines to ensure data confidentiality. Additionally, the workflow supports accessibility and future use through GLiNER-Studio, a user-friendly tool that enables non-programmers to fine-tune models on new datasets. This study contributes a practical solution to the need for automated PII de-identification in transportation safety data, enabling secure data sharing and ethical analytics for research and policymaking.

Keywords

PII de-identification crash narratives named entity recognition GLiNER transportation safety privacy-preserving NLP

Get full access to this article

View all access options for this article.

References

Montella

Andreassen

Tarko

A. P.

Turner

Mauriello

Imbriani

L. L.

Romero

M. A.

Crash Databases in Australasia, the European Union, and the United States: Review and Prospects for Improvement. Transportation Research Record: Journal of the Transportation Research Board, 2013. 2386(1): 128–136. https://doi.org/10.3141/2386-15.

Zhen

Yang

J. J.

Tab-Text: Bridging Tabular Data and Natural Language for Enhanced Traffic Safety Analysis and Modeling. Expert Systems with Applications, Vol. 290, 2025, p. 128450. https://doi.org/10.1016/j.eswa.2025.128450.

Sharma

Exploratory Analysis of Automated Vehicle Crashes Using an NLP Pipeline. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 67, No. 1, 2023, pp. 1191–1196. https://doi.org/10.1177/21695067231194987.

Gambi

Nguyen

Ahmed

Fraser

Generating Critical Driving Scenarios from Accident Sketches. In 2022 IEEE International Conference on Artificial Intelligence Testing (AITest), IEEE, Newark, CA, USA, 2022, pp. 95-102. https://doi.org/10.1109/AITest55621.2022.00022.

Gambi

Huynh

Fraser

Generating Effective Test Cases for Self-Driving Cars from Police Reports. Tallinn, Estonia, 2019.

Gambrell

Crash Report Data & Personally Identifiable Information (“PII”) - GACP. Georgia Association of Chiefs of Police. https://gachiefs.com/ga-chief-magazine/crash-report-data-and-pii/. Accessed July 29, 2025.

Negash

Katz

Neilson

C. J.

Moni

Nesca

Singer

Enns

J. E.

De-Identification of Free Text Data Containing Personal Health Information: A Scoping Review of Reviews. International Journal of Population Data Science, Vol. 8, No. 1, 2023, p. 2153. https://doi.org/10.23889/ijpds.v8i1.2153.

Lison

Pilán

Sanchez

Batet

Øvrelid

Anonymisation Models for Text Data: State of the Art, Challenges and Future Directions. In Presented at the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021.

Sundheim

B. M.

Overview of Results of the MUC-6 Evaluation. 1995.

10.

Lee

H.-J.

Zhang

Roberts

A Hybrid Approach to Automatic De-Identification of Psychiatric Notes. Journal of Biomedical Informatics, Vol. 75, 2017, pp. S19–S27. https://doi.org/10.1016/j.jbi.2017.06.006.

11.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

Attention Is All You Need. Advances in Neural Information Processing Systems, Vol. 30, 2017.

12.

Devlin

Chang

M.-W.

Lee

Toutanova

Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Presented at the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), 2019.

13.

Lample

Ballesteros

Subramanian

Kawakami

Dyer

Neural Architectures for Named Entity Recognition. http://arxiv.org/abs/1603.01360. Accessed November 22, 2025.

14.

Named Entity Recognition of Automotive Parts Based on RoBERTa-CRF Model. In 2024 4th International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 2024, pp. 604–612. https://doi.org/10.1109/NNICE61279.2024.10499162.

15.

Zaratiana

Tomeh

Holat

Charnois

Gliner: Generalist Model for Named Entity Recognition Using Bidirectional Transformer. arXiv Preprint arXiv:2311.08526, 2023.

16.

Abdul

W. M.

Pimentel

M. A.

Salman

M. U.

Raha

Christophe

Kanithi

P. K.

Hayat

Rajan

Khan

Named Clinical Entity Recognition Benchmark. arXiv Preprint arXiv:2410.05046, 2024.

17.

Solove

D. J.

Schwartz

P. M.

The PII Problem: Privacy and a New Concept of Personally Identifiable Information. New York University Law Review, Vol. 86, 2011, p. 1814.

18.

Saglam

R. B.

Nurse

J. R. C.

Hodges

Personal Information: Perceptions, Types and Evolution. Journal of Information Security and Applications, Vol. 66, 2022, p. 103163. https://doi.org/10.1016/j.jisa.2022.103163.

19.

McCallister

Grance

Scarfone

Guide to Protecting the Confidentiality of Personally Identifiable Information. Publication NIST Special Publication 800-122. National Institute of Standards and Technology (NIST), Gaithersburg, MD, 2010.

20.

National Highway Traffic Safety Administration/Enforcement. Protecting Sensitive Information from Public Disclosure. Standard Operating Procedures. Prepared by BLF Technologies, Inc., 2014.

21.

Garfinkel

De-Identification of Personal Information. Publication NISTIR 8053. US Department of Commerce, National Institute of Standards and Technology, 2015.

22.

Norgeot

Muenzen

Peterson

T. A.

Fan

Glicksberg

B. S.

Schenk

Rutenberg

Oskotsky

Sirota

Yazdany

Schmajuk

Ludwig

Goldstein

Butte

A. J.

Protected Health Information Filter (Philter): Accurately and Securely de-Identifying Free-Text Clinical Notes. NPJ Digital Medicine, Vol. 3, No. 1, 2020, p. 57. https://doi.org/10.1038/s41746-020-0258-y.

23.

Microsoft/Presidio. Microsoft, Nov 23, 2025.

24.

Ferrández

South

B. R.

Shen

Friedlin

F. J.

Samore

M. H.

Meystre

S. M.

Evaluating Current Automatic De-Identification Methods with Veteran’s Health Administration Clinical Documents. BMC Medical Research Methodology, Vol. 12, No. 1, 2012, p. 109.

25.

DeYoung

Jain

Rajani

N. F.

Lehman

Xiong

Socher

Wallace

B. C.

ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 4443–4458. https://doi.org/10.18653/v1/2020.acl-main.408.

26.

Doccano: Text Annotation Tool for Human. doccano, 2018.

27.

Yazdani

Stepanov

Teodoro

GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition. http://arxiv.org/abs/2504.00676. Accessed November 23, 2025.

28.

Urchade/Gliner_multi-v2.1 Hugging Face. https://huggingface.co/urchade/gliner_multi-v2.1. Accessed July 30, 2025.

29.

Terven

Cordova-Esparza

D.-M.

Romero-González

J.-A.

Ramírez-Pedraza

Chávez-Urbiola

E. A.

A Comprehensive Survey of Loss Functions and Metrics in Deep Learning. Artificial Intelligence Review, Vol. 58, No. 7, 2025, p. 195. https://doi.org/10.1007/s10462-025-11198-7.

30.

Keraghel

Morbieu

Nadif

Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study (2024). arXiv>cs>arxXiv:2401.10825. https://doi.org/10.48550/arXiv.2401.1082.

31.

Varughese

Yazbeck

McInnes

B. T.

NLP@ VCU at BioASQ2025: Information Extraction on the GutBrainIE Dataset. CLEF 2025 Working Notes, Madrid, Spain, 2025.