Sage Journals: Discover world-class research

Abstract

Multimodal sentiment analysis is a well-known field in AI that integrates text, video, and audio to gain a better understanding of people’s emotions. However, this field still faces two major challenges: first, it is difficult to effectively eliminate noise interference in multimodal data; Second, the existing research focuses on the fusion mechanism between modes, ignoring the similarity and heterogeneity between modes, which leads to the deviation of emotional analysis. Therefore, this paper proposes a deep fusion model based on Transformer. We define the polarity vector (PV) and the intensity vector (SV) based on the emotion analysis strategy of PS-Mixer to judge the polarity (positive, negative, or neutral) and intensity (0–3 range) of emotions, respectively. PV fuses text and video features, and SV fuses text and audio features and improves the ability of feature expression by introducing a cross-fusion strategy. Transformer architecture enhances the generalization performance and large-scale data processing ability of the model. The experimental results show that the performance of this model on MOSI and MOSEI datasets is better than the existing multi-modal sentiment analysis methods, which provides a new idea for research in this field.

Keywords

cross-fusion strategy deep fusion model multimodal sentiment analysis transformer architecture

Get full access to this article

View all access options for this article.

References

Rahman

Hasan

Lee

, et al. Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th annual meeting of the association for computational linguistics, online, 2020.

Zhu

Zhang

, et al. Multimodal sentiment analysis based on fusion methods: a survey. Inf Fusion 2023; 95: 306–325.

Mahendhiran

Kannimuthu

SJIJIT

. Deep learning techniques for polarity classification in multimodal sentiment analysis. Int J Info Tech Dec Mak 2018; 17(03): 883–910.

AL-Saqqa

Abdel-Nabi

Awajan

. A survey of textual emotion Detection. In: 2018 8th international conference on computer science and information technology (CSIT). Amman: IEEE, 2018. 136–142.

Liu

. Sentiment analysis and subjectivity, 2010.

Singh

Abhishek

Kumar

, et al. A survey of cutting-edge multimodal sentiment analysis.

Abbasi

Chen

Salem

. Sentiment analysis in multiple languages. ACM Trans Inf Syst 2008; 26: 1–34.

Bollen

Mao

Zeng

. Twitter mood predicts the stock market. Journal of Computational Science 2011; 2: 1–8.

Gamon

Aue

Corston-Oliver

, et al. Pulse: mining customer opinions from free Text. In: Lecture Notes in Computer Science. Advances in Intelligent Data Analysis VI, 2005: 121–132.

10.

Shou

Meng

, et al. Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis. Neurocomputing 2022; 501: 629–639.

11.

Okada

Wang

, et al. Context-and knowledge-aware graph convolutional network for multimodal emotion recognition. IEEE MultiMedia 2022; 29(3): 91–100.

12.

Sun

Liu

Chen

, et al. Modality-invariant temporal representation learning for multimodal sentiment classification.

13.

Quan

Sun

, et al. Multimodal sentiment analysis based on cross-modal attention and gated cyclic hierarchical fusion networks. Comput Intell Neurosci 2022; 2022: 4767437.

14.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north. Minneapolis, Minnesota: Association for Computational Linguistics, 2019.

15.

Morency

L-P

Mihalcea

Doshi

. Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces. Stroudsburg: Association for Computational Linguistics, 2011, pp. 169–176.

16.

Poria

Cambria

Gelbukh

. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2015, pp. 2539–2544.

17.

Yuan

, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell 2021; 35: 10790–10797.

18.

Zadeh

Liang

Poria

, et al. Multi-attention recurrent network for human communication comprehension. Proc AAAI Conf Artif Intell 2018; 32: 5642–5649.

19.

Poria

Cambria

Howard

, et al.

Fusing audio, visual and textual clues for sentiment analysis from multimodal content

Neurocomputing 2016; 174: 50–59.

20.

Shan

Gong

Mcowan

. Beyond facial expressions: learning human emotion from body gestures. In: Procedings of the British machine vision conference, 2007.

21.

Zhang

Shi

Liu

, et al. RETRACTED ARTICLE: ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis. Appl Intell. 2023; 53(12): 16332–16345.

22.

Disentanglement translation network for multimodal sentiment analysis.

23.

Mansoorizadeh

Moghaddam Charkari

. Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 2010; 49(2): 277–297.

24.

Nemati

Naghsh-Nilchi

. Exploiting evidential theory in the fusion of textual, audio, and visual modalities for affective music video retrieval. In: 2017 3rd international conference on pattern recognition and image analysis (IPRIA). Shahrekord, Iran: IEEE, 2017.

25.

Poria

Peng

Hussain

, et al. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 2017; 261: 217–230.

26.

Mansoorizadeh

Moghaddam Charkari

. Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl 2010; 49: 277–297.

27.

Ekman

WallaceV

. Facial action coding system: a technique for the measurement of facial movement. 1978.

28.

Zadeh

Liang

Mazumder

, et al. Memory fusion network for multi-view sequential learning, 2018. arXiv preprint arXiv:1802.00927.

29.

Zadeh

Liang

Poria

, et al. Multi-attention recurrent network for human communication comprehension. In: Proceedings of the. AAAI Conference on Artiicial Intelligence. AAAI Conference on Artiicial Intelligence. NIH Public Access, 2018, Vol. 2018, 5642–5649.

30.

Bagher Zadeh

Liang

Poria

, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings o f the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.

31.

Lin

Zhang

Ling

, et al. PS-Mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis.

32.

Wang

Ren

. Multimodal sentiment analysis based on multiple attention. Eng Appl Artif Intell 2025; 140: 109731.

33.

Huang

Zhou

Tang

, et al. TMBL: transformer-based multimodal binding learning model for multimodal sentiment analysis.

34.

Zhang

. Flat multi-modal interaction transformer for named entity recognition, 2022.

35.

Chen

Zhang

, et al. Hybrid transformer with multi-level fusion for multimodal knowledge graph Completion. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, 2022.

36.

Zhao

. MGICL: multi-grained interaction contrastive learning for multimodal named entity recognition.

37.

Cui

Cao

Cong

, et al. Enhancing multimodal entity and relation extraction with variational information bottleneck, 2023.

38.

Liu

Zhou

Chu

, et al. Modality translation-based multimodal sentiment analysis under uncertain missing modalities.

39.

Baltrusaitis

Robinson

Morency

. OpenFace: an open source facial behavior analysis toolkit. In: 2016 IEEE winter conference on applications of computer vision (WACV). Lake Placid, NY, USA: IEEE, 2016.

40.

https://COVAREP__A_collaborative_voice_analysis_repository_for_speech_technologies.pdf

41.

Greff

Srivastava

Koutnik

, et al. LSTM: a search space Odyssey. IEEE Trans Neural Netw Learn Syst 2017; 28: 2222–2232.

42.

Tolstikhin

Houlsby

Kolesnikov

, et al. MLP-Mixer: an all-MLP architecture for vision. In: Neural Information Processing Systems, Neural Information Processing Systems 2021.

43.

Touvron

Bojanowski

Caron

, et al. ResMLP: feedforward networks for image classification with data-efficient training. IEEE Trans Pattern Anal Mach Intell 2023; 45(4): 5314–5321.

44.

Zadeh

Zellers

Pincus

, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 2016; 31: 82–88.

45.

Bagher

Liang

Poria

, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). Melbourne, Australia: IEEE, 2018. DOI: 10.18653/v1/p18-1208.

46.

Zadeh

Chen

Poria

, et al. Tensor fusion network for multimodal sentiment analysis. arXiv: Computation and Language, arXiv: Computation and Language 2017.

47.

Liu

Shen

Lakshminarasimhan

, et al. Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: IEEE, 2018, Vol 1.

48.

Mai

Xing

. Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion. Proc AAAI Conf Artif Intell 2020; 34(01): 164–172.

49.

Huang

, et al. Speaker-invariant affective representation learning via adversarial training. In: Icassp 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). Barcelona, Spain: IEEE, 2020, pp. 7144–7148.

50.

Zadeh

Liang

Mazumder

, et al. Memory fusion network for multi-view sequential Learning. Proc AAAI Conf Artif Intell 2022; 32: 5634–5641.

51.

Wang

Shen

Liu

, et al. Words can shift: dynamically adjusting word representations using nonverbal Behaviors. Proc AAAI Conf Artif Intell 2019; 33: 7216–7223.

52.

Chauhan

Akhtar

Ekbal

, et al. Context-aware interactive attention for multi-modal sentiment and emotion Analysis. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language P rocessing. Hong Kong, China: EMNLP-IJCNLP, 2019.

53.

Tsai

YHH

Bai

Liang

, et al. Multimodal transformer for unaligned multimodal language Sequences. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Florence, Italy: IEEE, 2019.

54.

Lian

Chen

Sun

, et al. GCNet: graph completion network for incomplete multimodal learning in conversation.

55.

Rajagopalan

Morency

Baltrusaitis

, et al. Extending long short-term memory for multi-view structured Learning. In: Computer Vision – ECCV 2016, Lecture Notes in Computer Science, 2016, 338–353.

56.

Poria

Cambria

Hazarika

, et al. Context-dependent sentiment analysis in user-generated Videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: IEEE, 2017.

57.

Waduge

Kulasooriya

WKVJB

Ranasinghe

RSS

, et al. Navigating the ethical landscape of ChatGPT integration in scientific research: review of challenges and recommendations. J Comput Cogn Eng 2024; 3(4): 360–372.

The feature deep fusion model for multimodal emotion analysis based on transformer

Abstract

Keywords

Get full access to this article

References