Sage Journals: Discover world-class research

Abstract

The rapid expansion of online education platforms has highlighted the critical need for accurate learner emotion recognition to improve interactivity and learning outcomes. However, conventional unimodal approaches face challenges in capturing the complex emotional cues embedded in learners’ language and speech. To address this gap, this study proposes a novel text-audio multimodal weighted network (TA-MWN) that integrates Bidirectional Encoder Representations from Transformers (BERT) for textual feature extraction and Mel Frequency Cepstral Coefficients (MFCCs) for speech feature representation. These features are further processed through a multi-layer Long Short-Term Memory (LSTM) network to model temporal dependencies. The proposed multimodal weighted network (MWN) dynamically fuses the prediction probabilities from both modalities, enhancing sentiment recognition performance. Experimental validation on the public CMU-MOSI dataset demonstrates that TA-MWN outperforms traditional unimodal and state-of-the-art multimodal methods in terms of precision, recall, and F1-score. Further evaluation on a self-constructed online learning dataset confirms the model’s superior adaptability and recognition accuracy, achieving over 90% accuracy in negative sentiment detection. The proposed approach not only improves emotion recognition in online education but also provides a scalable framework for adaptive learning, personalized feedback, and privacy-friendly interaction without relying on visual data. These findings offer practical implications for enhancing learner engagement and emotional support in intelligent online learning environments.

Keywords

multimodal network affective computing LSTM NLP

Get full access to this article

View all access options for this article.

References

Martin

Sun

Westine

. A systematic review of research on online teaching and learning from 2009 to 2018. Comput Educ 2020; 159: 104009.

Kim

Merrill

, et al. My teacher is a machine: understanding students’ perceptions of AI teaching assistants in online education. Int J Hum Comput Interact 2020; 36(20): 1902–1911.

Zhang

Aslan

. AI technologies for education: recent research & future directions. Comput Educ Artif Intell 2021; 2: 100025.

Wang

Niu

, et al. Emotion recognition of students based on facial expressions in online education based on the perspective of computer simulation. Complexity 2020; 2020: 1–9.

Shen

Song

, et al. MEduKG: a deep-learning-based approach for multi-modal educational knowledge graph construction. Information 2022; 13(2): 91.

Neji

Ammar

. The integration of an emotional system in the intelligent system. In: 2005 ACS/IEEE International Conference, 2005.

Ray

Chakrabarti

. Design and implementation of technology enabled affective learning using fusion of bio -physical and facial expression. Educ Technol Soc 2016; 19(4): 112–125.

Bahreini

Nadolski

Westera

. Towards multimodal emotion recognition in e -learning environments. Interact Learn Environ 2016; 24(3): 590–605.

Jaques

Vicari

. A BDI approach to infer student’s emotions in an intelligent learning environment. Comput Educ 2007; 49(2): 360–384.

10.

Pérez-Rosas

Mihalcea

Morency

. Utterance-level multimodal sentiment analysis. Annual Meeting of the Association for Computational Linguistics, 2013, pp. 973–982.

11.

Ding

Zhou

Chellappa

. Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: IEEE International Conference on Automatic Face & Gesture Recognition, 2017, pp. 118–126.

12.

Zadeh

Liang

Mazumder

, et al. Memory fusion network for multi-view sequential learning. In: The AAAI Conference on Artificial Intelligence, 2018, Vol. 32.

13.

Tsai

Bai

Liang

, et al. Multimodal transformer for unaligned multimodal language sequences. In: Annual Meeting of the Association for Computational Linguistics, 2019, p. 6558.

14.

Wang

Guan

Venetsanopoulos

. Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 2012; 14(3): 597–607.

15.

Harish

Sadat

. Trimodal attention module for multimodal sentiment analysis. AAAI Conf Artifi Intel 2020; 34(10): 13803–13804.

16.

Bert

KMV

. A review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943 , 2021.

17.

Grail

Perez

and Gaussier

. Globalizing BERT-based transformer architectures for long document summarization. In: Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: Main volume, 2021, pp. 1792–1810.

18.

Deng

Meng

Cao

, et al. Heart sound classification based on improved MFCC features and convolutional recurrent neural networks. Neural Netw 2020; 130: 22–32.

19.

Shi

Wang

Zhao

, et al. Threshold-free phase segmentation and zero velocity detection for gait analysis using foot-mounted inertial sensors. IEEE Trans Hum Mach Syst 2022; 53(1): 176–186.

20.

Zhou

Chen

, et al. Hi-net: hybrid-fusion network for multi-modal MR image synthesis. IEEE Trans Med Imag 2020; 39(9): 2772–2781.

21.

Huang

Qin

, et al. Dominant SIngle-modal SUpplementary fusion (SIMSUF) for multimodal sentiment analysis. IEEE Trans Multimed 2023.

22.

Zadeh

Chen

Poria

, et al. Tensor fusion network for multimodal sentiment analysis. In: The Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.

23.

Tsai

Liang

Zadeh

, et al. Learning factorized multimodal representations. In: International Conference on Representation Learning, 2019.

24.

Sun

Sarma

Sethares

, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. AAAI Conf Artifi Intel 2020; 34(05): 8992–8999.

A text-audio multimodal weighted network for emotion recognition to enhance interactivity in online education

Abstract

Keywords

Get full access to this article

References