Abstract
In order to develop one for a morphologically difficult language like Korean, the paper investigates the problem of homograph disambiguation in NMT. Homographs – words with a common form but several meanings and context differences – cause significant issues for NMT systems because they often assign meanings at random, leading to translation errors. This work introduces UTagger, a tool that uses the Standard Korean Language Dictionary (SKLD) to provide homographs distinct sense-codes in order to resolve homograph ambiguity. These sense-codes are then integrated into the parallel corpus, where each indexed homograph is treated as an individual word by the NMT system, minimizing ambiguity-related translation errors. The addition of UTagger improves the quality performances of the translation systems with BLEU, TER, and DLRATIO measures, according to experiments conducted on Korean-English and Korean-Vietnamese translation pairings. The outcome shows that for languages with complex morphology, UTagger enhances translation accuracy and dependability. The study concludes with a discussion of extending the application of UTagger to a wider range of homographs, particularly favourable evaluations, and outlines future work to further enhance UTagger, including enhancing its capacity to adapt to various languages and dialects.
Keywords
Get full access to this article
View all access options for this article.
