Semantic variation and translation issues in positively evaluative Korean homographs

Abstract

In order to develop one for a morphologically difficult language like Korean, the paper investigates the problem of homograph disambiguation in NMT. Homographs – words with a common form but several meanings and context differences – cause significant issues for NMT systems because they often assign meanings at random, leading to translation errors. This work introduces UTagger, a tool that uses the Standard Korean Language Dictionary (SKLD) to provide homographs distinct sense-codes in order to resolve homograph ambiguity. These sense-codes are then integrated into the parallel corpus, where each indexed homograph is treated as an individual word by the NMT system, minimizing ambiguity-related translation errors. The addition of UTagger improves the quality performances of the translation systems with BLEU, TER, and DLRATIO measures, according to experiments conducted on Korean-English and Korean-Vietnamese translation pairings. The outcome shows that for languages with complex morphology, UTagger enhances translation accuracy and dependability. The study concludes with a discussion of extending the application of UTagger to a wider range of homographs, particularly favourable evaluations, and outlines future work to further enhance UTagger, including enhancing its capacity to adapt to various languages and dialects.

Keywords

neural machine translation (NMT)homograph disambiguation morphologically rich languages Korean language processing UTagger sense-coding standard Korean language dictionary (SKLD)translation accuracy BLEU score TER (translation edit rate)DLRATIO Korean-English translation Korean-Vietnamese translation lexical ambiguity resolution contextual word disambiguation parallel corpus annotation multilingual NLP machine translation evaluation metrics

Get full access to this article

View all access options for this article.

References

Ahammad

Kalangi

Nagendram

, et al. (2024) Improved neural machine translation using natural language processing. Multimedia Tools and Applications 83(13): 39335–39348.

Apidianaki

(2022) From word types to tokens and back: a survey of approaches to word meaning representation and interpretation. Computational Linguistics 49(2): 465–523.

Chan

C-H

Zeng

Wessler

, et al. (2020) Reproducible extraction of cross-lingual topics (rectr). Communication Methods and Measures 14(4): 285–305.

Copot

Bonami

(2023) Behavioural evidence for implicative paradigmatic relations. The Mental Lexicon 18(2): 177–217.

Ganguli

Bhowmick

Sil

(2023) Deep insights of erroneous Bengali–English code-mixed bilingual language. IETE Journal of Research 69(6): 3334–3345.

Gemechu

Kanagachidambaresan

(2023) Text-text neural machine translation: a survey. Optical Memory & Neural Networks 32(2): 59–72.

Gupta

Dixit

Sethi

(2022) A comparative analysis of sentence embedding techniques for document ranking. Journal of Web Engineering 21(7): 2149–2185.

Haque

Hasanuzzaman

Way

(2020) Analysing terminology translation errors in statistical and neural machine translation. Machine Translation 34(2): 149–195.

Huang

Wang

Jiang

, et al. (2024) Flow2GNN: flexible two-way flow message passing for enhancing GNNs beyond homophily. IEEE Transactions on Cybernetics 54(11): 6607–6618.

10.

Humm

Archer

Bense

, et al. (2023) New directions for applied knowledge-based AI and machine learning. Informatik-Spektrum 46(2): 65–78.

11.

Hwang

(2020) A study on academic vocabulary education for content-based Korean language education. Journal of the Korea Society of Computer and Information 25(2): 67–74.

12.

Hwang

S-J

Lee

Kim

, et al. (2021) Topic modeling for analyzing topic manipulation skills. Information 12(9): 359.

13.

Jabbar

Iqbal

Tamimy

, et al. (2023) An analytical analysis of text stemming methodologies in information retrieval and natural language processing systems. IEEE Access 11: 133681–133702.

14.

Jiang

Iqbal

Tamimy

, et al. (2025) Fpa-GCN: enhancing aspect sentiment triplet extraction with graph convolutional networks. Applied Intelligence 55(9): 740.

15.

Kim

Cho

, et al. (2021) Trkic G00gle: why and how users game translation algorithms. Proceedings of the ACM on Human-Computer Interaction 5(CSCW2): 344:1–344:24.

16.

Ren

Sun

, et al. (2022) Exploiting Japanese–Chinese cognates with shared private representations for NMT. ACM Transactions on Asian and Low-Resource Language Information Processing 22(1): 28:1–28:12.

17.

Liu

(2020) Comparing and analyzing cohesive devices of SMT and NMT from Chinese to English: a diachronic approach. Open Journal of Modern Linguistics 10(6): 765–772.

18.

Liu

Iqbal

Tamimy

, et al. (2025) Aligning cyberspace with the physical world: a comprehensive survey on embodied AI. IEEE: 1–22.

19.

Meng

Shou

, et al. (2024) A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition. Neurocomputing 569: 127109.

20.

Moniri

Schlosser

Kowerko

(2024) Investigating challenges in Persian language information retrieval using standardized data collections and deep learning. Computers 13(8): 212.

21.

Pan

(2025) Human–machine plan conflict and conflict resolution in a visual search task. International Journal of Human-Computer Studies 193: 103377.

22.

Park

Kim

(2023) A role of functional morphemes in Korean categorial grammars. Korean Linguistics 19(1): 1–30.

23.

Park

Lee

Yang

, et al. (2020) Ancient Korean neural machine translation. IEEE Access 8: 116617–116625.

24.

Poort

Rodd

(eds) (2022) Cross-lingual priming of cognates and interlingual homographs from L2 to L1. Glossa. Psycholinguistics 1(1): 1–33.

25.

Roumeliotis

Tselikas

Nasiopoulos

(2024) LLMs in e-commerce: a comparative analysis of GPT and LLaMA models in product review evaluation. Natural Language Processing Journal 6: 100056.

26.

Sagnika

Pattanaik

Shankar Prasad Mishra

, et al. (2020) A review on multilingual sentiment analysis by machine learning methods. Journal of Engineering Science and Technology Review 13(2): 154–166.

27.

Seghier

Boudelaa

(2024) Constraining neuroanatomical models of reading: the view from Arabic. Brain Structure and Function 229(9): 2167–2185.

28.

Song

Sun

, et al. (2025) AttriDiffuser: adversarially enhanced diffusion model for text-to-facial attribute image synthesis. Pattern Recognition 163: 111447.

29.

Stahlberg

(2020) Neural machine translation: a review. Journal of Artificial Intelligence Research 69: 343–418.

30.

Tohidi

Dadkhah

Ganji

, et al. (2024) PAMR: persian abstract meaning representation corpus. ACM Transactions on Asian and Low-Resource Language Information Processing 23(3): 35:1–35:20.

31.

Wani

DAA

Iqbal

Makhdoomi

(2022) Modelling an intrusion detection system using ensemble approach based on voting to improve accuracy of base classifiers. Journal of Algebraic Statistics 13(2): 2.

32.

Woo

Park

Kim

(2022) Profane or not: improving Korean profane detection using deep learning. KSII Transactions on Internet and Information Systems 16(1): 305–318.

33.

Xiao

Codevilla

Gurram

, et al. (2022) Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems 23(1): 537–547.

34.

Yadav

Patel

Shah

(2021) A comprehensive review on resolving ambiguities in natural language processing. AI Open 2: 85–92.

35.

Yazar

Şahın

DÖ

Kiliç

(2023) Low-resource neural machine translation: a systematic literature review. IEEE Access 11: 131775–131813.

36.

史宗玲

2021. Re-looking into machine translation errors and post-editing strategies in a high-tech context. 編譯論叢 14(2): 0004.