An Edge-Enabled Low-Latency Cross-Lingual Speech-to-Text Framework for Efficient Human

Abstract

Humanoid robots are being introduced in places where people do not speak the same language and people expect quick, natural responses. In such situations, speech interaction cannot afford noticeable delays. Most of the present speech-to-text systems are mainly maintained with cloud servers, leading to latency problems, reliance on dependable connectivity, and failures on the fly when used in real time. These shortcomings become especially evident when robots are expected to autonomously and continuously interact with human users. To address these limitations, this project proposes a new edge-centric speech-to-text framework tailored specifically for the multilingual humanoid robot domain. Instead of sending audio data to the cloud, this method performs speech processing directly within the robot. This technology includes lightweight neural models for real-time streaming, an onboard mechanism that allows for real-time identification of the target language, and local caching methods for quicker retrieval of repeated or known speech patterns. Combine these and you can get quicker, more trustworthy transcription without burning a hole in network resources. The system reduces communication delays to a great extent, while providing transcribed data in multiple languages due to local handling of speech from the wireless edge network. The time taken in overall response is more than 60% lower than the response time used in cloud-based systems; it has been found in experiments. More critically, the framework does a good job with fluctuating network bandwidth, loss of packets, and background noise. It is concluded that edge-based and multilingual speech-to-text systems will be important for humanoid robots to enhance responsivity and contextuality. Understanding faster results in faster reactions, smoother conversations, and moments of interaction that feel more natural is a major step toward pragmatic and reliable communication between humans and robots in the working world.

Keywords

cross-lingual speech recognition humanoid robots low-latency STT real-time interaction wireless edge networks

Get full access to this article

View all access options for this article.

References

1. Chen

, Tian

, Peng

, et al. OWLS: Scaling laws for multilingual speech recognition and translation models. In: Proceedings of International Conference on Machine Learning 2025, vol. 267. PMLR, 2025, pp. 9121–9145. Available from: https://proceedings.mlr.press/v267/chen25bj.html

2. Feng

, Zhang

, Liu

, et al. Edge-ASR: Towards low-bit quantization of state-of-the-art ASR models. arXiv Preprint 2025.

3. Song

, Zhuo

, Yang

, et al. LoRA-whisper: Parameter-efficient and extensible multilingual ASR. arXiv Preprint 2024.

4. Xue

, Wang

, Li

, et al. and “A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability. In: Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE; 2023.

5. Bapna

, Cherry

, Zhang

, et al. mSLAM: Massively multilingual joint pre-training for speech and text. arXiv Preprint 2022.

6. Gow-Smith

, Berard

, Boito

, et al. Multilingual speech translation systems for the IWSLT 2023 low-resource track. arXiv Preprint 2023.

7. Sethiya

, Nair

, Walia

, et al. Indic-ST: A large-scale multilingual corpus for low-resource Speech-to-Text translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process 2025;24(6):Article 60–A25; doi: 10.1145/3736720

8. Liu

, Niehues

. Recent highlights in multilingual and multimodal speech translation. In: Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT), Bangkok, Thailand. IWSLT; 2024; pp. 235–253.

9. Iranzo-Sánchez

, Santamaría-Jordà

, Mas-Mollà

, et al. Speech translation for multilingual medical education leveraged by large language models. Artif Intell Med 2025;166:103147.

10.

10. Kim

, Pinyoanuntapong

, Kim

, et al. Edge vs Cloud: How Do We Balance Cost, Latency, and Quality for Large Language Models Over 5G Networks? In: 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy. IEEE; 2025, pp. 1–6.

11.

11. Radford

, Kim

, Xu

, et al. and Robust speech recognition via large-scale weak supervision. In: Proceedings of the 40th International Conference on Machine Learning, vol. 202. PMLR; 2023, pp. 28492–28518. Available from: https://proceedings.mlr.press/v202/radford23a.html

12.

12. Shangguan

, et al. Optimizing speech recognition for the edge (for background edge optimization). ArXiv 2019; doi: 10.48550/arXiv.1909.12408

13.

13. Cui

, Yu

, Jiao

, et al. Recent advances in speech language models: A survey. arXiv Preprint 2024; doi: 10.48550/arXiv.2410.03751

14.

14. Jani

, Panchal

, Patel

, et al. Multilingual Speech Recognition: An In-Depth Review of Applications, Challenges, and Future Directions. In: Communication and Intelligent Systems. ICCIS 2023. Lecture Notes in Networks and Systems, vol 968. ( Sharma

, Shrivastava

, Tripathi

, Wang

, eds) Springer: Singapore: 2024. 10.1007/978-981-97-2079-8_1

15.

15. Shim

, Lee

, Chang

, et al. and Knowledge distillation from non-streaming to streaming ASR encoder using auxiliary non-streaming layer In: Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH; 2023, pp. 1663–1667.

16.

16. Beňo

, Pribiš

, Drahoš

. Edge Container for Speech Recognition. Electronics (Basel) 2021;10(19):2420; doi: 10.3390/electronics10192420

17.

17. He

, Peeters

, et al. Zero-resource language identification under adverse conditions. In: Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH; 2024.

18.

18. Jiang

, Zhang

, Li

, et al. and Accurate and structured pruning for efficient automatic speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH; 2023, pp. 4104–4108.

19.

19. BG

, Yadava G

. Automatic speech recognition: Challenges, enhancements, and evaluation metrics. Multimed Tools Appl 2025;84(38):46627–46645; doi: 10.1007/s11042-025-20998-6

20.

20. Bandela

, Sharma Sadhu

, Rathore

, et al. Development of Noise Robust Automatic Speech Recognition System. In: 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT). ICCCNT: Delhi, India: 2023, pp. 1–6.

21.

21. Zhang

, et al. Streaming end-to-end multilingual speech recognition with joint language identification. arXiv Preprint 2022.

22.

22. Palivela

, Narvekar

, Asirvatham

, et al. Code-switching ASR for low-resource Indic languages: A Hindi-Marathi case study. IEEE Access 2025;13:9171–9198.

23.

23. Froiz-Míguez

, Fraga-Lamas

, Fernández-CaraméS

. Design, implementation, and practical evaluation of a voice recognition based iot home automation system for low-resource languages and resource-constrained edge iot devices: A system for Galician and mobile opportunistic scenarios. IEEE Access 2023;11:63623–63649.

24.

24. Torkamani

, and, Zarin

. Adaptive edge-cloud inference for speech-to-action systems using ASR and large language models. arXiv Preprint 2025.

25.

25. Wang

, et al. Adaptive neural network quantization for lightweight speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH; 2023.

26.

26. Veena

, et al. Multi-Modal Signal Fusion: Enhancing Speech Recognition in Noisy Environments. In: 2023 10th IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Vol. 10. IEEE, 2023.

27.

27. Bataev

. RNN-Transducer-based Losses for Speech Recognition on Noisy Targets. arXiv Preprint 2025.

28.

28. Miao

, Cheng

, Zhang

. Low‐latency transformer model for streaming automatic speech recognition. Electron Lett 2022;58(1):44–46.

29.

29. Mohiuddin

, Fatima

, Khan

, et al. Mobile learning evolution and emerging computing paradigms: An edge-based cloud architecture for reduced latencies and quick response time. Array 2022;16:100259.

30.

30. Prabhavalkar

, Hori

, Sainath

, et al. End-to-end speech recognition: A survey. IEEE/ACM Trans Audio Speech Lang Process 2024;32:325–351.

31.

31. Shao

, Zhou

, Wang

, et al. CleanMel: Mel-spectrogram enhancement for improving both speech quality and ASR. IEEE Trans Audio, Speech Lang Process 2025;33:3202–3214.

32.

32. Alexandridis

, et al. Caching networks: Capitalizing on common speech for ASR. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2022.

33.

33. Kang

, Kim

., and GenPTQ: Green Post-Training Quantization for Large-Scale ASR Models with Mixed-Precision Bit Allocation. In: Findings of the Association for Computational Linguistics: EMNLP 2025. EMNLP; 2025.

34.

34. Yau

, Awiphan

, Bootkrajang

, et al. A Robust Throughput Estimation in Edge-Assisted Adaptive Bitrate Streaming Networks. IEEE Access 2025;13:152598–152607.

35.

35. Kum

, Lee

. Can gestural filler reduce user-perceived latency in conversation with digital humans? Applied Sciences 2022;12(21):10972.

An Edge-Enabled Low-Latency Cross-Lingual Speech-to-Text Framework for Efficient Human–Robot Interaction

Abstract

Keywords

Get full access to this article

References