Sage Journals: Discover world-class research

Abstract

The development of real-time machine translation has become crucial for supporting multilingual communication in domains such as tourism, commerce, and global collaboration. While traditional text-only models have achieved notable success, they struggle to interpret contextual cues in complex, multimodal environments. This study introduces a Transformer-based multimodal translation framework that integrates textual, visual, and audio modalities. By leveraging a hierarchical attention mechanism, late fusion strategy, and wait-k decoding for real-time performance, the model captures rich contextual information and generates more accurate translations. An auxiliary loss function further enhances training stability and convergence. Experimental results on publicly available multilingual datasets show strong improvements over unimodal baselines, demonstrating the system’s practical potential for real-time, context-aware applications in dynamic, real-world scenarios.

Keywords

multimodal translation transformer attention mechanism sequence-to-sequence natural language processing (NLP)

Get full access to this article

View all access options for this article.

References

Yousuf

Lahzi

Salloum

, et al. A systematic review on sequence-to-sequence learning with neural network and its models. Int J Electr Comput Eng 2021; 11: 3.

Lillicrap

Adam

. Backpropagation through time and the brain. Curr Opin Neurobiol 2019; 55: 82–89.

Gruslys

Munos

Danihelka

, et al. Memory-efficient backpropagation through time. Adv Neural Inf Process Syst 2016; 29: 4132–4140.

Guo

. Backpropagation through time. Unpubl. ms 2013; 40: 1–6.

Neubig

. Neural machine translation and sequence-to-sequence models: a tutorial. arXiv preprint arXiv:1703.01619 2017.

Cho

Kim

Jung

, et al. Water level prediction model applying a long short-term memory (lstm)–gated recurrent unit (gru) method for flood prediction. Water 2022; 14.14: 2221.

Soydaner

. Attention mechanism in neural networks: where it comes and where it goes. Neural Comput Appl 2022; 34(16): 13371–13385.

Chowdhary

KR1442

Chowdhary

. Natural language processing. In: Fundamentals of artificial intelligence. Springer Nature, 2020, pp. 603–649.

Rodriguez

Velazquez

Cucurull

, et al. Pay attention to the activations: a modular attention mechanism for fine-grained image recognition. IEEE Trans Multimed 2019; 22(2): 502–514.

10.

Chorowski

Bahdanau

Serdyuk

, et al. Attention-based models for speech recognition. Adv Neural Inf Process Syst 2015; 28: 577–585.

11.

Bahdanau

. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 2014.

12.

Guo

Qiu

Liu

, et al. Star-transformer. arXiv preprint arXiv:1902.09113 2019.

13.

Guo

Gan

, et al. Bp-transformer: modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070 2019.

14.

Dai

Yang

, et al. Transformer-xl: Language modeling with longer-term dependency[J]. 2018.

15.

Murugan

. Feed forward and backward run in deep convolution neural network. arXiv preprint arXiv:1711.03278 2017.

16.

Liu

Cao

Zhao

. Gumbel-attention for multi-modal machine translation. arXiv preprint arXiv:2103.08862 2021.

17.

Arık

SÖ

Chrzanowski

Coates

, et al. Deep voice: real-Time neural text-to-speech. In: International conference on machine learning. PMLR, 2017.

18.

Hamid

. Frame blocking and windowing speech signal. Journal of Information, Communication, and Intelligence Systems (JICIS) 2018; 4(5): 87–94.

19.

Zhu

Xiao

, et al. Modulated pulses based distributed vibration sensing with high frequency response and spatial resolution. Opt Express 2013; 21(3): 2953–2963.

20.

Han

Xiao

, et al. Transformer in transformer. Adv Neural Inf Process Syst 2021; 34: 15908–15919.

21.

Xiong

Yang

, et al. On layer normalization in the transformer architecture. In: International conference on machine learning. PMLR, 2020.

Multimodal real-time English translation based on the transformer architecture

Abstract

Keywords

Get full access to this article

References