Abstract
The development of real-time machine translation has become crucial for supporting multilingual communication in domains such as tourism, commerce, and global collaboration. While traditional text-only models have achieved notable success, they struggle to interpret contextual cues in complex, multimodal environments. This study introduces a Transformer-based multimodal translation framework that integrates textual, visual, and audio modalities. By leveraging a hierarchical attention mechanism, late fusion strategy, and wait-k decoding for real-time performance, the model captures rich contextual information and generates more accurate translations. An auxiliary loss function further enhances training stability and convergence. Experimental results on publicly available multilingual datasets show strong improvements over unimodal baselines, demonstrating the system’s practical potential for real-time, context-aware applications in dynamic, real-world scenarios.
Keywords
Get full access to this article
View all access options for this article.
