Abstract
Machine translation (MT) for underrepresented languages, such as Xhosa, presents significant challenges due to limited linguistic resources and the complex nature of these languages. This research proposes a novel approach to improve MT accuracy for Xhosa-to-English translation using Adaptive Gradient Boosted Bidirectional Encoder Representations from Transformers (AdaGrad-BBERT). This method combines the powerful capabilities of BERT with adaptive gradient boosting, enhancing contextual understanding and overall translation accuracy. The system integrates a series of preprocessing steps to optimize model performance. Initially, the dataset undergoes text cleaning, including the removal of noise, normalization of punctuation, and correction of spelling inconsistencies. Tokenization uses BERT’s word-piece model, effectively handling rare and out-of-vocabulary words. Part-of-speech tagging and dependency parsing are applied to capture syntactic relationships specific to Xhosa, which has distinct grammatical structures compared to English. Pre-trained BERT embeddings are employed to generate rich, context-sensitive representations of Xhosa words, ensuring more accurate translations. The encoder-decoder architecture with an attention mechanism is fine-tuned using the AdaGrad-BBERT optimization technique. Translation quality is evaluated using the BLEU score, and model performance is assessed over multiple training epochs. Simulations are conducted using the TensorFlow framework to train and evaluate the model on the Xhosa-English dataset, with results demonstrating significant improvements in translation accuracy, the BLUE score increased 0.896. The proposed system highlights the potential of AdaGrad-BBERT in bridging the gap in MT for underrepresented languages, offering a scalable solution for enhancing MT. This solution holds promising applications in education, cross-cultural communication, and digital inclusion.
Keywords
Get full access to this article
View all access options for this article.
