Abstract
Unstructured crash narratives in police reports contain rich textual information that can uncover key insights into crash circumstances, such as contributing factors and driver behavior, that are often missing from the structured fields of crash data. However, the presence of personally identifiable information (PII) within these narratives, and the lack of scalable, domain-specific redaction tools, limit their broader use because of privacy concerns and legal restrictions. To address this challenge, a scalable, privacy-preserving pipeline for automated PII de-identification from crash narratives was developed and evaluated. The proposed method utilizes a generalist model for named entity recognition using bidirectional transformer (GLiNER), which is known for its strong zero-shot, few-shot, and fine-tuned performance across diverse entity types. The model was fine-tuned on a manually annotated training set to adapt it to the crash narrative domain. It was found that combining this fine-tuned named entity recognition model with a rule-based post-processing module improved PII detection performance by resolving span misalignments and recovering entities that were initially missed. Evaluation on a test set achieved an F1 score above 80%, particularly for frequent PII categories such as names and addresses. Post-processing further reduced false negatives by 32%. The pipeline was developed and tested on local machines to ensure data confidentiality. Additionally, the workflow supports accessibility and future use through GLiNER-Studio, a user-friendly tool that enables non-programmers to fine-tune models on new datasets. This study contributes a practical solution to the need for automated PII de-identification in transportation safety data, enabling secure data sharing and ethical analytics for research and policymaking.
Keywords
Get full access to this article
View all access options for this article.
