Sage Journals: Discover world-class research

Abstract

The accelerated progress in natural language processing has elevated prosodic rhythm analysis as a critical research frontier. Current methodologies remain constrained by their reliance on surface-level acoustic attributes and predefined linguistic frameworks, limiting their capacity to capture evolving semantic-rhythm correlations. To bridge this methodological gap, this investigation develops SAM-Fuse – a Context-Augmented Multimodal Fusion framework that establishes dynamic cross-modal integration between textual semantics, paralinguistic features, and visual prosodic cues. Our architecture incorporates three innovation pillars: (1) A hierarchical semantic encoder with contextual augmentation, (2) An attention-based modality fusion gate with adaptive weighting, and (3) Cross-modal rhythm pattern distillation. Rigorous evaluations across multimodal speech corpora demonstrate statistically significant improvements (p < 0.01) in rhythm prediction accuracy and cross-domain generalisation compared to state-of-the-art baselines. The proposed paradigm advances fundamental understanding of semantic-prosodic interactions while providing practical solutions for voice synthesis and affective computing applications.

Keywords

semantic enhancement adaptive learning multimodal fusion textual speech rhythm analysis

Get full access to this article

View all access options for this article.

References

Ding

Zhang

. Speech prosody in mental disorders. Annu Rev Linguist 2023; 9: 335–355.

. Characterizing first and second language rhythm in English using spectral coherence between temporal envelope and mouth opening-closing movements. J Acoust Soc Am 2022; 152: 567–579.

Pérez-Parra

Suaza-Restrepo

Restrepo-de-Mejía

. Analysis of verbal language from the theory of human movement as a complex system. Int J Hum Mov Sports Sci 2022; 10: 384–395.

Cheng

Z-Q

J-Y

, et al. Emotion-llama: multimodal emotion recognition and reasoning with instruction tuning. Adv Neural Inf Process Syst 2024; 37: 110805–110853.

Alberts

. Meeting them halfway: altering language conventions to facilitate human-robot interaction. Stellenbosch Papers in Linguistics Plus 2019; 56: 97–122.

Pham

Wårdell

Eklund

, et al. Classification of short time series in early Parkinsonʼ s disease with deep learning of fuzzy recurrence plots. IEEE/CAA J Autom Sinica 2019; 6: 1306–1317.

Shimono

. The effects of repeated oral reading and timed reading on l2 oral reading fluency. Read Matrix: An International Online Journal 2019; 19: 139–145.

Abdullakutty

Elyan

Johnston

. A review of state-of-the-art in Face Presentation Attack Detection: from early development to advanced deep learning and multi-modal fusion methods. Inf Fusion 2021; 75: 55–69.

Schlenker

. What is super semantics? Philos Perspect 2018; 32: 365–453.

10.

Chandra

Kulkarni

. Semantic and sentiment analysis of selected Bhagavad Gita translations using BERT-based language framework. IEEE Access 2022; 10: 21291–21315.

11.

Wang

, et al. Pre-trained language models and their applications. Engineering 2023; 25: 51–65.

12.

Rakshit

Sarkar

. A supervised deep learning-based sentiment analysis by the implementation of Word2Vec and GloVe Embedding techniques. Multimed Tools Appl 2025; 84: 979–1012.

13.

Triantafyllopoulos

Schuller

İymen

, et al. An overview of affective speech synthesis and conversion in the deep learning era. Proc IEEE 2023; 111: 1355–1381.

14.

Zhao

Zhang

Geng

. Deep multimodal data fusion. ACM Comput Surv 2024; 56: 1–36.

15.

Sheng

Wen

Feng

, et al. A survey on data-driven runoff forecasting models based on neural networks. IEEE Trans Emerg Top Comput Intell 2023; 7: 1083–1097.

16.

Holsanova

. Uncovering scientific and multimodal literacy through audio description. J Vis Literacy 2020; 39: 132–148.

17.

Tong

Liang

. Calibrating the adaptive learning rate to improve convergence of ADAM. Neurocomputing 2022; 481: 333–356.

18.

Peng

Shu

Pan

, et al. DSCSSA: a classification framework for spatiotemporal features extraction of arrhythmia based on the Seq2Seq model with attention mechanism. IEEE Trans Instrum Meas 2022; 71: 1–12.

19.

Abdul

Al-Talabani

. Mel frequency cepstral coefficient and its applications: a review. IEEE Access 2022; 10: 122136–122158.

20.

Belabbas

Addou

Selouani

. Pathological voice classification system based on CNN-BiLSTM network using speech enhancement and multi-stream approach. Int J Speech Technol 2024; 27: 483–502.

21.

Alex

Mary

Babu

. Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features. Circuits Syst Signal Process 2020; 39: 5681–5709.

22.

Wang

Cui

Zhang

. Improving skip-gram embeddings using BERT. IEEE/ACM Trans Audio Speech Lang Process 2021; 29: 1318–1328.

23.

Jacobs

Tschötschel

. Topic models meet discourse analysis: a quantitative tool for a qualitative approach. Int J Soc Res Methodol 2019; 22: 469–485.

24.

Amiri

Heidari

Navimipour

, et al. Adventures in data analysis: a systematic review of Deep Learning techniques for pattern recognition in cyber-physical-social systems. Multimed Tools Appl 2024; 83: 22909–22973.

25.

Van Erven

Koolen

Van der Hoeven

. Metagrad: adaptation using multiple learning rates in online learning. J Mach Learn Res 2021; 22: 1–61.

26.

Pigou

Van Den Oord

Dieleman

, et al. Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in video. Int J Comput Vis 2018; 126: 430–439.

27.

Zou

Wang

, et al. Integration of residual network and convolutional neural network along with various activation functions and global pooling for time series classification. Neurocomputing 2019; 367: 39–45.

28.

Gibbon

. The rhythms of rhythm. J Int Phonetic Assoc 2023; 53: 233–265.

29.

Pandeya

Lee

. Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed Tools Appl 2021; 80: 2887–2905.

30.

Hussain

Al-Masni

Aslam

, et al. Revolutionizing tumor detection and classification in multimodality imaging based on deep learning approaches: methods, applications and limitations. J X Ray Sci Technol 2024; 32: 857–911.

31.

Chu

Pei

, et al. Model complexity of deep learning: a survey. Knowl Inf Syst 2021; 63: 2585–2619.

Semantic-driven textual speech rhythm assessment with contextual augmentation and multimodal synchronisation

Abstract

Keywords

Get full access to this article

References