Mixed-language expressions in construction safety violation and warning reports: A domain-specific framework for normalization and safety analytics

Abstract

Background

Construction sites generate large volumes of textual safety data, yet inconsistent terminology and mixed-language expressions (MLEs) reduce the reliability of analysis. Korean safety violation and warning reports (SVWRs), a localized form of safety observation reports, are often written with irregular spacing, abbreviations, and hybrid vocabulary, hindering systematic utilization for data-driven safety management.

Objective

This study aims to develop and validate a domain-specific text normalization framework to improve the linguistic consistency and analytical reliability of SVWRs.

Methods

A dataset of 64,999 SVWRs collected from 39 construction sites in South Korea was analyzed. A rule- and dictionary-based normalization pipeline was designed to unify fragmented terms and standardize MLEs. Topic modeling was conducted using topic modeling with symmetric priors and eight topics aligned with national safety categories.

Results

Normalization increased topic-model coherence from 0.412 to 0.497 (20.6% improvement), clarifying risk structures across categories such as falls, electrical hazards, and fire prevention. It revealed co-occurring risk patterns previously obscured by inconsistent language use, demonstrating that linguistic preprocessing is crucial for reliable text-based safety analytics.

Conclusions

The proposed framework enhances both methodological reliability and practical applicability by converting fragmented field reports into standardized, analyzable data. Its dictionary-based architecture can be extended to other agglutinative or multilingual languages, supporting scalable and data-driven safety management in the construction industry.

Keywords

construction industry accidents occupational risk management occupational health and safety data mining natural language processing text normalization topic modeling

Get full access to this article

View all access options for this article.

References

Cameron

Hare

Duff

. An analysis of safety advisor roles and site safety performance. Eng Constr Archit Manag 2013; 20: 505–521.

Adnan

Yussof

FNM

Jaafar

, et al. Safety manager competencies in managing construction projects in Malaysia. IOP Conf Ser Earth Environ Sci 2019; 385: 012057.

Occupational Safety and Health Research Institute. Statistical survey and analysis of industrial occupational accidents. Ulsan (KR): Korea Occupational Safety and Health Agency, 2023.

Kim

Jang

. Analyzing patterns of multi-cause accidents from KOSHA’s construction injury case reports utilizing text mining methodology. J Archit Inst Korea 2022; 38: 237–244.

Oswald

Sherratt

Smith

. Problems with safety observation reporting: a construction industry case study. Saf Sci 2018; 107: 35–45.

Gadekar

Bugalia

. Automatic classification of construction safety reports using semi-supervised YAKE-Guided LDA approach. Adv Eng Inf 2023; 56: 101929.

Tixier

Hallowell

Rajagopalan

, et al. Construction safety clash detection: identifying safety incompatibilities among fundamental attributes using data mining. Autom Constr 2017; 74: 39–54.

Hughes

Shipp

Figueres-Esteban

, et al. From free-text to structured safety management: introduction of a semi-automated classification method of railway hazard reports to elements on a bow-tie diagram. Saf Sci 2018; 110: 11–19.

Ahadh

Binish

Srinivasan

. Text mining of accident reports using semi-supervised keyword extraction and topic modeling. Process Saf Environ Prot 2021; 155: 455–465.

10.

Chen

. Mining construction accident reports via unsupervised NLP and Accimap for systemic risk analysis. Autom Constr 2024; 161: 105343.

11.

Goh

Ubeynarayana

. Construction accident narrative classification: an evaluation of text mining techniques. Accid Anal Prev 2017; 108: 122–130.

12.

Bengfort

Bilbro

Ojeda

. Applied text analysis with Python: Enabling language-aware data products with machine learning. 1st ed. Sebastopol, CA: O’Reilly Media, Inc., 2018, https://doi.org/10.5555/3285754 .

13.

Yoon

. Preliminary work and practical measures for the standardization of terminology in the architectural field. Nara Sarang 2020; 129: 134–154, https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10512826 .

14.

Park

. It’s time to turn foreign languages into our own at the construction site. KSCE’S Proposal 2021; 69: 22–23, https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE10611173 .

15.

Choi

. An analysis on foreign workers’ dangerous work behavior factors in the construction sites. J Stand Certif Saf 2022; 12: 15–28.

16.

Kwon

Kim

. Analysis of affective domain trends in mathematics education using dynamic topic modeling. School Maths 2024; 26: 445–468.

17.

Baker

Hallowell

Tixier

. Automatically learning construction injury precursors from text. Autom Constr 2020; 118: 103145.

18.

Zou

Sunindiio

. Strategic safety management in construction and engineering. Hoboken, NJ: John Wiley & Sons, 2015, https://doi.org/10.1002/9781118839362 .

19.

Verma

Maiti

. Text-document clustering-based cause and effect analysis methodology for steel plant incident data. Int J Inj Contr Saf Promot 2018; 25: 416–426.

20.

Jing

Liu

Gong

, et al. Correlation analysis and text classification of chemical accident cases based on word embedding. Process Saf Environ Prot 2020; 158: 698–710.

21.

Lin

, et al. Understanding on-site inspection of construction projects based on keyword extraction and topic modeling. IEEE Access 2020; 8: 198503–17.

22.

Cao

Guo

, et al. Construction health and safety: a topic landscape study. Organ Technol Manag Constr Int J 2021; 13: 2472–2483.

23.

Robinson

Irwin

Kelly

, et al. Application of machine learning to mapping primary causal factors in self reported safety narratives. Saf Sci 2015; 75: 118–129.

24.

Das

Alphonse

PJA

. A comparative study on tf-idf feature weighting method and its analysis using unstructured dataset. arXiv:2308.04037 [Preprint]. 2023 [cited 2025 Feb 10], https://arxiv.org/abs/2308.04037.

25.

Pennington

Socher

Manning

. GloVe: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp.1532–1543, https://doi.org/10.3115/v1/D14-1162.

26.

Mikolov

Chen

Corrado

, et al. Efficient estimation of word representations in vector space. arXiv:1301.3781 [Preprint]. 2013 [cited 2025 Sep 12], https://doi.org/10.48550/arXiv.1301.3781.

27.

Moon

Chung

Chi

. Bridge damage recognition from inspection reports using NER based on recurrent neural network with active learning. J Perform Constr Facil 2020; 34: 04020119.

28.

Zhang

. A hybrid structured deep neural network with Word2Vec for construction accident causes classification. Int J Constr Manag 2022; 22: 1120–1140.

29.

Choi

Kim

Seol

, et al. A syllable-based technique for word embeddings of Korean words. arXiv:1708.01766 [Preprint]. 2017 [cited 2025 Oct 20], https://arxiv.org/abs/1708.01766v1.

30.

Chan

. Constructional borrowing from English in Hong Kong Cantonese. Front Commun 2022; 7: 796372.

31.

Gong

Shim

Hyeon

, et al. Auto-correction with Levenshtein (edit) distance algorithm to spell error in legal terms. J Korea Contents Assoc 2024; 24: 82–93.

32.

Kim

Cho

Kang

. KR-WordRank: an unsupervised Korean word extraction method based on WordRank. J Korean Inst Indus Eng 2014; 40: 18–33.

33.

Kil

. The study of Korean stopwords list for text mining. URIMALGEUL: Korean Lang Literat 2018; 78: 1–25.

34.

Park

Choi

Lee

. Study on preprocessing method suitable for Korean aspect extraction based on unsupervised learning: for childcare products reviews. J Korean Inst Indus Eng 2021; 47: 56–67.

35.

Zeng

. Leveraging large language models for code-mixed data augmentation in sentiment analysis. In Proceedings of the second workshop on social influence in conversations (SICon 2024); 2024 Nov; Miami, Florida, USA. Association for Computational Linguistics, 2024, pp.85–101. https://doi.org/10.18653/v1/2024.sicon-1.6.

36.

Pan

. Patterns of code-switching in Mandarin Chinese and English and why does it happen. Commun Humanit Res 2025; 70: 110–116.

37.

Treffers-Daller

. The simple view of borrowing and code-switching. Int J Biling 2023; 29: 347–370.

38.

Park

Kim

. Robust multi-task learning-based Korean POS tagging to overcome word spacing errors. ACM Trans Asian Low-Resource Lang Inform Processing 2023; 22: 1–13.

39.

Zou

. Discovery of new safety knowledge from mining large injury dataset in construction. Saf Sci 2021; 144: 105481.

40.

Chang

Chi

. Understanding user experience and satisfaction with urban infrastructure through text mining of civil complaint data. J Constr Eng Manag 2022; 148: 04022061.

41.

Lim

Han

Kang

, et al. Affinity analysis between factors of fatal occupational accidents in construction using data mining techniques. Korean J Constr Eng Manag 2021; 22: 29–38.

42.

Wang

, et al. Extracting domain knowledge elements of construction safety management: rule-based approach using Chinese natural language processing. J Manag Eng 2021; 37: 04021001.

43.

. Uncovering critical causes of highway work zone accidents using unsupervised machine learning and social network analysis. J Constr Eng Manag 2023; 150: 04023168.

44.

Kang

Song

. Constructing sentiment lexicon for subject-specific sentiment analysis. Korean Linguist 2021; 93: 83–110.

45.

Construction Technology Promotion Act. Act No. 13671. Republic of Korea (Dec 29, 2015), https://elaw.klri.re.kr/eng_mobile/viewer.do?hseq=37267&key=35&type=part.

46.

Occupational Safety and Health Act. Act No. 8475. Republic of Korea (May 17, 2007), https://elaw.klri.re.kr/eng_service/lawView.do?hseq=8431&lang=ENG.

47.

Park

Mahamadu

Agyekum

, et al. An inquiry into the health and safety management practices of construction firms in South Korea. J Eng Design Technol 2025; 23: 345–367.

48.

Lim

Won

, et al. Improvement of inspection system for reduction of small-scale construction site accident in Korea. Ind Health 2018; 56: 466–474.

49.

Dale

Colvin

Barrera

, et al. The association between subcontractor safety management programs and worker perceived safety climate in commercial construction projects. J Saf Res 2020; 74: 279–288.

50.

Jeon

. KoSpacing: Automatic korean word spacing [Internet]. GitHub Repository; 2018a [cited 2025 May 23], https://github.com/haven-jeon/KoSpacing; GitHub.

51.

Jeon

. Automatic Korean word spacing with neural n-gram detector [Internet]. GitHub Repository; 2018b [cited 2025 Jun 19], https://github.com/haven-jeon/Train_KoSpacing.

52.

Blei

Jordan

. Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022. https://dl.acm.org/doi/10.5555/944919.944937 .

53.

Röder

Both

Hinneburg

. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on web search and data mining (WSDM ‘15). 2015 Feb, pp. 399–408. https://doi.org/10.1145/2684822.2685324.

54.

Maaten

Hinton

. Visualizing data using t-SNE. J Mach Learn Res 2008; 9: 2579–2605.

55.

Cheng

Doe

, et al. Identifying English language use and communication challenges facing “entry-level” workplace immigrants in Canada. J Int Migr Integr 2021; 22: 865–886.

56.

Akihiko

. Word embedding-based semantic analysis of English loanwords in Japanese and Korean. Dissertation, Seoul National University, Seoul (KR), 2021.

57.

De Jesus-Rivas

Conlon

Burns

. The impact of language and culture diversity in occupational safety. Workplace Health Saf 2016; 64: 24–27.