Abstract
The accelerated progress in natural language processing has elevated prosodic rhythm analysis as a critical research frontier. Current methodologies remain constrained by their reliance on surface-level acoustic attributes and predefined linguistic frameworks, limiting their capacity to capture evolving semantic-rhythm correlations. To bridge this methodological gap, this investigation develops SAM-Fuse – a Context-Augmented Multimodal Fusion framework that establishes dynamic cross-modal integration between textual semantics, paralinguistic features, and visual prosodic cues. Our architecture incorporates three innovation pillars: (1) A hierarchical semantic encoder with contextual augmentation, (2) An attention-based modality fusion gate with adaptive weighting, and (3) Cross-modal rhythm pattern distillation. Rigorous evaluations across multimodal speech corpora demonstrate statistically significant improvements (p < 0.01) in rhythm prediction accuracy and cross-domain generalisation compared to state-of-the-art baselines. The proposed paradigm advances fundamental understanding of semantic-prosodic interactions while providing practical solutions for voice synthesis and affective computing applications.
Get full access to this article
View all access options for this article.
