Abstract
The rapid expansion of online education platforms has highlighted the critical need for accurate learner emotion recognition to improve interactivity and learning outcomes. However, conventional unimodal approaches face challenges in capturing the complex emotional cues embedded in learners’ language and speech. To address this gap, this study proposes a novel text-audio multimodal weighted network (TA-MWN) that integrates Bidirectional Encoder Representations from Transformers (BERT) for textual feature extraction and Mel Frequency Cepstral Coefficients (MFCCs) for speech feature representation. These features are further processed through a multi-layer Long Short-Term Memory (LSTM) network to model temporal dependencies. The proposed multimodal weighted network (MWN) dynamically fuses the prediction probabilities from both modalities, enhancing sentiment recognition performance. Experimental validation on the public CMU-MOSI dataset demonstrates that TA-MWN outperforms traditional unimodal and state-of-the-art multimodal methods in terms of precision, recall, and F1-score. Further evaluation on a self-constructed online learning dataset confirms the model’s superior adaptability and recognition accuracy, achieving over 90% accuracy in negative sentiment detection. The proposed approach not only improves emotion recognition in online education but also provides a scalable framework for adaptive learning, personalized feedback, and privacy-friendly interaction without relying on visual data. These findings offer practical implications for enhancing learner engagement and emotional support in intelligent online learning environments.
Get full access to this article
View all access options for this article.
