Abstract
Background
Construction sites generate large volumes of textual safety data, yet inconsistent terminology and mixed-language expressions (MLEs) reduce the reliability of analysis. Korean safety violation and warning reports (SVWRs), a localized form of safety observation reports, are often written with irregular spacing, abbreviations, and hybrid vocabulary, hindering systematic utilization for data-driven safety management.
Objective
This study aims to develop and validate a domain-specific text normalization framework to improve the linguistic consistency and analytical reliability of SVWRs.
Methods
A dataset of 64,999 SVWRs collected from 39 construction sites in South Korea was analyzed. A rule- and dictionary-based normalization pipeline was designed to unify fragmented terms and standardize MLEs. Topic modeling was conducted using topic modeling with symmetric priors and eight topics aligned with national safety categories.
Results
Normalization increased topic-model coherence from 0.412 to 0.497 (20.6% improvement), clarifying risk structures across categories such as falls, electrical hazards, and fire prevention. It revealed co-occurring risk patterns previously obscured by inconsistent language use, demonstrating that linguistic preprocessing is crucial for reliable text-based safety analytics.
Conclusions
The proposed framework enhances both methodological reliability and practical applicability by converting fragmented field reports into standardized, analyzable data. Its dictionary-based architecture can be extended to other agglutinative or multilingual languages, supporting scalable and data-driven safety management in the construction industry.
Keywords
Get full access to this article
View all access options for this article.
