Abstract
Traffic crashes are a leading cause of death in low- and middle-income countries, where weak infrastructure and limited data hinder effective responses. Predicting crash injury severity is vital for emergency planning and policy; however, most machine learning models rely on English-language data, limiting their use in multilingual, low-resource settings. This is especially problematic for Khmer, Cambodia’s official language, which lacks word boundaries, has complex morphology, and suffers from scarce natural language processing resources. Standard models fail because of poor tokenization, semantic drift, and lack of script-specific representations. To address this, a Khmer-aware deep learning framework is proposed that integrates conditional random field-based tokenization, multigranular embeddings (character, subword, word), a dilated bidirectional long short-term memory with self-attention, and noise-robust classification to manage linguistic complexity and data variability. A labeled dataset of 1,074 Khmer-language traffic reports collected from eight Cambodian news outlets (2015–2024) is also introduced. The model achieves 95.2% accuracy, 0.952 precision, 0.952 recall, and 0.951 macro-F1, outperforming the best traditional model (eXtreme Gradient Boosting: 88.0% accuracy, 0.80 macro-F1) with nearly 60% lower error rate. Results confirm that language-specific design is essential for reliable severity prediction in low-resource languages. Exploratory analysis of media-reported crashes reveals that 40.7% were classified as fatal, 52.1% of fatalities occurred on national roads, and 73.6% involved motorcycle patterns reflective of reporting intensity rather than population-level risk. This work provides a reproducible pipeline to transform vernacular text into public health intelligence. By combining linguistic expertise with deep learning, it is demonstrated that inclusive, language-aware AI can turn local narratives into actionable, life-saving insights, setting a precedent for equitable road safety research in underserved regions.
Keywords
Get full access to this article
View all access options for this article.
