Sage Journals: Discover world-class research

Abstract

Machine translation (MT) for underrepresented languages, such as Xhosa, presents significant challenges due to limited linguistic resources and the complex nature of these languages. This research proposes a novel approach to improve MT accuracy for Xhosa-to-English translation using Adaptive Gradient Boosted Bidirectional Encoder Representations from Transformers (AdaGrad-BBERT). This method combines the powerful capabilities of BERT with adaptive gradient boosting, enhancing contextual understanding and overall translation accuracy. The system integrates a series of preprocessing steps to optimize model performance. Initially, the dataset undergoes text cleaning, including the removal of noise, normalization of punctuation, and correction of spelling inconsistencies. Tokenization uses BERT’s word-piece model, effectively handling rare and out-of-vocabulary words. Part-of-speech tagging and dependency parsing are applied to capture syntactic relationships specific to Xhosa, which has distinct grammatical structures compared to English. Pre-trained BERT embeddings are employed to generate rich, context-sensitive representations of Xhosa words, ensuring more accurate translations. The encoder-decoder architecture with an attention mechanism is fine-tuned using the AdaGrad-BBERT optimization technique. Translation quality is evaluated using the BLEU score, and model performance is assessed over multiple training epochs. Simulations are conducted using the TensorFlow framework to train and evaluate the model on the Xhosa-English dataset, with results demonstrating significant improvements in translation accuracy, the BLUE score increased 0.896. The proposed system highlights the potential of AdaGrad-BBERT in bridging the gap in MT for underrepresented languages, offering a scalable solution for enhancing MT. This solution holds promising applications in education, cross-cultural communication, and digital inclusion.

Keywords

Machine translation (MT)Xhosa-to-English contextual language modeling BLEU score adaptive gradient (AdaGrad)Bidirectional Encoder Representations from Transformers (BERT)

Get full access to this article

View all access options for this article.

References

Steigerwald

Ramírez-Castañeda

Brandt

, et al. Overcoming language barriers in academia: machine translation tools and a vision for a multilingual future. Bioscience 2022; 72(10): 988–998.

Bahji

Acion

Laslett

, et al. Exclusion of the non-English-speaking world from the scientific literature: recommendations for change for addiction journals and publishers. Nordisk Alkohol Nark 2023; 40(1): 6–13.

Amano

Ramírez-Castañeda

Berdejo-Espinola

, et al. The manifold costs of being a non-native English speaker in science. PLoS Biol 2023; 21(7): e3002184.

Bazaga

Gunwant

Micklem

. Translating synthetic natural language to database queries with a polyglot deep learning framework. Sci Rep 2021; 11(1): 18462.

Novis-Deutsch

Cohen

Alexander

, et al. Interdisciplinary learning in the humanities: knowledge building and identity work. J Learn Sci 2024; 33(2): 284–322.

Faheem

Wassif

Bayomi

, et al. Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation. Sci Rep 2024; 14(1): 2265.

Bergamaschi

De Nardis

Martoglia

, et al. Novel perspectives for the management of multilingual and multi-alphabetic heritages through automatic knowledge extraction: the digitalmaktaba approach. Sensors 2022; 22(11): 3995.

Ateeq

Tiun

Abdelhaq

, et al. Arabic narrative question answering (qa) using transformer models. IEEE Access 2023; 12: 2760–2777.

Klyuchnikov

Trofimov

Artemova

, et al. Nas-bench-nlp: neural architecture search benchmark for natural language processing. IEEE Access 2022; 10: 45736–45747.

10.

Molero

Pérez-Martín

Rodrigo

, et al. Offensive language detection in Spanish social media: testing from bag-of-words to transformers models. IEEE Access 2023; 11: 95639–95652.

11.

Scholes-Robertson

Gutman

Howell

, et al. Clinicians’ perspectives on equity of access to dialysis and kidney transplantation for rural people in Australia: a semistructured interview study. BMJ Open 2022; 12(2): e052315.

12.

Singh

Mahmood

. The NLP cookbook: modern recipes for transformer based deep learning architectures. IEEE Access 2021; 9: 68675–68702.

13.

Javed

Zan

Mamyrbayev

, et al. Transformer-based re-ranking model for enhancing contextual and syntactic translation in low-resource neural machine translation. Electronics 2025; 14(2): 243.

14.

Aliyu

Sarlan

Danyaro

, et al. Comparative analysis of transformer models for sentiment analysis in low-resource languages. Int J Adv Comput Sci Appl 2024; 15(4): 353–364.

15.

Kann

Ebrahimi

Mager

, et al. AmericasNLI: machine translation and natural language inference systems for Indigenous languages of the Americas. Front Artif Intell 2022; 5(5): 995667.

16.

Ragsdale

Boppana

. On designing low-risk honeypots using generative pre-trained transformer models with curated inputs. IEEE Access 2023; 11: 117528–117545.

17.

Wongso

Joyoadikusumo

Buana

, et al. Many-to-many multilingual translation model for languages of Indonesia. IEEE Access 2023; 11: 91385–91397.

18.

Zhang

, et al. Collaborative encoding method for scene text recognition in low linguistic resources: the Uyghur language case study. Appl Sci 2024; 14(5): 1707.

19.

Wang

Sung

. Deformer: denoising transformer for improved audio music genre classification. Appl Sci 2023; 13(23): 12673.

20.

Kim

Jeong

F-Albert . A distilled model from a two-time distillation system for reduced computational complexity in ALBERT model. Appl Sci 2023; 13(17): 9530.

21.

Xiong

Yang

Yan

, et al. Efficient reinforcement learning-based method for plagiarism detection boosted by a population-based algorithm for pretraining weights. Expert Syst Appl 2024; 238: 122088.

22.

Pereira

Zanchettin

, et al. PrAACT: predictive augmentative and alternative communication with transformers. Expert Syst Appl 2024; 240: 122417.

23.

Liu

Cao

, et al. Hidformer: hierarchical dual-tower transformer using multi-scale emergence for long-term time series forecasting. Expert Syst Appl 2024; 239: 122412.

24.

Liang

Wang

, et al. Transformer-BLS: an efficient learning algorithm based on multi-head attention mechanism and incremental learning algorithms. Expert Syst Appl 2024; 238: 121734.

25.

Liu

Yang

Cai

. SEASum: syntax-enriched abstractive summarization. Expert Syst Appl 2022; 199: 116819.

26.

. English-Chinese translation quality assessment based on phrase statistical machine translation decoding algorithm. International Journal of Maritime Engineering 2024; 1(1): 675–688.

Improving machine translation accuracy for underrepresented languages in linguistic research using transformer models

Abstract

Keywords

Get full access to this article

References