Sage Journals: Discover world-class research

Abstract

With the rapid advancement of large language models (LLMs) and computer vision technologies, multimodal large language models (MLLMs) have demonstrated remarkable potential in sentiment analysis. Traditional sentiment analysis methods often rely on unimodal data (e.g., text or images), making it difficult to comprehensively capture complex emotional expressions. This paper proposes a multimodal sentiment analysis framework based on MLLMs, integrating both visual and textual information to enhance sentiment classification accuracy. Experiments on a high-profile social media event show that Qwen2-VL-Adpter model outperforms conventional methods in multiple evaluation metrics, validating the effectiveness of multimodal information fusion. This study provides a robust technical framework for sentiment analysis in public opinion monitoring and offers valuable data support for crisis management. However, the model’s performance is influenced by the specificity of the dataset and computational demands, which may limit its application in resource-constrained environments.

Keywords

internet public opinion multimodal biglanguage modeling sentiment analysis sentiment intensity

Get full access to this article

View all access options for this article.

References

Zhang

Wang

Liu

. Deep learning for sentiment analysis: a survey. WIREs Data Min & Knowl 2018; 8(4): e1253.

Yoon

Byun

Jung

. Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018, pp. 112–118.

Mai

Xing

. Divide, conquer and combine: hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In: Korhonen A, Traum D, Lluís M (Eds.). Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. Florence, Italy: Association for Computational Linguistics. 481–492. https://aclanthology.org/P19-1046/

Miao

Wang

Liu

, et al. A method for analyzing emotions in microblogs by integrating text and images. Comput Eng Des 2019; 40(4): 1099–1105.

Yang

Shao

, et al. Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing 2022; 467: 130–137.

Tan

Bansal

. Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. 2019. https://arxiv.org/abs/1908.07490

Yatskar

Yin

, et al. Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. 2019. https://arxiv.org/abs/1908.03557

Batra

Parikh

, et al. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inf Process Syst 2019; 32.

Siriwardhana

Reis

Weerasekera

, et al. Jointly fine-tuning “bert-like” self supervised models to improve multimodal speech emotion recognition. arXiv preprint arXiv:2008.06682. 2020. https://arxiv.org/abs/2008.06682

10.

Tang

Jin

, et al. CTFN: hierarchical learning for multimodal sentiment analysis using coupled-translation fusion network. In: Zong

Xia

Navigli

(Eds.). Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on Natural Language Processing (volume 1: long papers), 2021, Association for Computational Linguistics. pp. 5301–5311. https://aclanthology.org/2021.acl-long.412/

11.

Huang

Wei

Weng

, et al. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans Multimed Comput Commun Appl 2020; 16(3): 1–19.

12.

Chen

Cao

Hou

, et al. Multimodal emotion recognition with temporal and semantic consistency. IEEE/ACM Trans Audio Speech Lang Process 2021; 29: 3592–3603.

13.

Peng

Zhang

, et al. Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl Base Syst 2022; 235: 107676.

14.

Song

Nho

Seo

, et al. Decision-level fusion method for emotion recognition using multimodal emotion recognition information. In: 2018 15th international conference on Ubiquitous Robots (UR), Honolulu, HI, USA, 26–30 June 2018, pp. 472–476. DOI: 10.1109/URAI.2018.8441795.

15.

Gkoumas

Dehdashti

, et al. Quantum cognitively motivated decision fusion for video sentiment analysis. Proc AAAI Conf Artif Intell 2021; 35(1): 827–835.

16.

Wang

Dong

Shi

. Multimodal sentiment analysis with composite hierarchical fusion. J Frontiers Comput Sci Techno 2023; 17(1): 198–208.

17.

Kakuba

Poulose

Han

. Deep learning-based speech emotion recognition using multi-level fusion of concurrent features. IEEE Access 2022; 10: 125538–125551.

18.

Zhao

Zhou

, et al. A survey of large language models. arXiv, abs/2303.18223. 2023. https://arxiv.org/abs/2303.18223

19.

Touvron

Lavril

Izacard

, et al. LLaMA: open and efficient foundation language models. arXiv, abs/2302.13971. 2023. https://arxiv.org/abs/2302.13971

20.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Proceedings of neural information processing systems, 2017. Available at: https://export.arxiv.org/pdf/1706.03762v7.pdf

21.

Kirillov

Mintun

Ravi

, et al. Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, October 2–6, 2023, pp. 3992–4003.

22.

Shen

Chen

, et al. Aligning and prompting everything all at once for universal visual perception. In: 2024 IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024, pp. 13193–13203.

23.

Sevilla

Heim

, et al. Compute trends across three eras of machine learning. In: 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, July, 2022, pp. 1–8.

24.

Radford

Kim

Hallacy

, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning: ICML 2021, online, 18-24 July 2021, Part 11 of 16. Curran Associates, Inc., 2022, pp. 8738–8753. https://arxiv.org/abs/2103.00020

25.

Wang

Yang

Men

, et al. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv, abs/2202.03052. 2022. https://arxiv.org/abs/2202.03052

26.

Liu

, et al. Visual instruction tuning. arXiv, abs/2304.08485. 2023. Red Hook, NY, USA: Curran Associates Inc. DOI: 10.5555/3666122.3667638.

27.

Zhu

Chen

Shen

, et al. MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv, abs/2304.10592. 2023. https://arxiv.org/abs/2304.10592

28.

Yang

Wang

, et al. MM-REACT: prompting ChatGPT for multimodal reasoning and action. arXiv, abs/2303.11381. 2023. https://arxiv.org/abs/2303.11381

29.

Driess

Xia

Sajjadi

MSM

, et al. PaLM-E: an embodied multimodal language model. In: International Conference on Machine Learning: ICML 2023, Honolulu, HI, USA, 23–29 July 2023, Part 11 of 54, 2023, pp. 8469–8488.

30.

Zhang

Bing

. Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. In: Conference on Empirical Methods in Natural Language Processing: systems demonstrations (EMNLP 2023), Singapore, 6–10 December 2023, pp. 543–553.

31.

Deshmukh

Elizalde

Singh

, et al. Pengi: an audio language model for audio tasks. arXiv, abs/2305.11834. 2023. Red Hook, NY, USA: Curran Associates Inc. DOI: 10.5555/3666122.3666917.

32.

Chen

Zhang

Zeng

, et al. Shikra: unleashing multimodal LLM’s referential dialogue magic. arXiv, abs/2306.15195. 2023. https://arxiv.org/abs/2306.15195

33.

Yuan

Liu

, et al. Osprey: pixel understanding with visual instruction tuning. In: 2024 IEEE/CVF conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024, pp. 28202–28211.

34.

Han

Zhang

Shao

, et al. ImageBind-LLM: multi-modality instruction tuning. arXiv, abs/2309.03905. 2023. https://api.semanticscholar.org/

35.

Moon

Madotto

Lin

, et al. In: Dernoncourt F, Preoțiuc-Pietro D, Shimorina A (Eds.). Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. Miami, Florida, US: Association for Computational Linguistics. arXiv, abs/2309.16058. 2024, pp. 1314–1332. https://aclanthology.org/2024.emnlp-industry.98/

36.

Fei

, et al. NExT-GPT: any-to-any multimodal LLM. MLR.org. arXiv, abs/2309.05519. 2024.

37.

Yao

Wang

, et al. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv, abs/2308.12038. 2023. https://arxiv.org/abs/2308.12038

38.

Zhang

Zhao

, et al. PMC-VQA: visual instruction tuning for medical visual question answering. arXiv, abs/2305.10415. 2023. https://arxiv.org/abs/2305.10415

39.

Liu

Yang

Liu

, et al. TextMonkey: an OCR-free large multimodal model for understanding document. arXiv, abs/2403.04473. 2024. https://dblp.uni-trier.de/db/journals/corr/corr2403

40.

Huang

Yong

, et al.An embodied generalist agent in 3D world. arXiv, abs/2311.12871. 2023. JMLR.org.

41.

Zhang

Yang

Liu

, et al.AppAgent: multimodal agents as smartphone users. arXiv, abs/2312.13771. 2023. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3706598.3713600

42.

Yang

Wei

, et al. MDF: a dynamic fusion model for multi-modal fake news detection. arXiv, abs/2406.19776. 2024. https://arxiv.org/abs/2406.19776

43.

Hao

Huang

, et al. MFUIE: A Fake News Detection Model Based on Multimodal Features and User Information Enhancement[J]. EAI Endorsed Transactions on Scalable Information Systems, 2025, 12(1). DOI: 10.4108/eetsis.7517.

44.

Ghorbanpour

Ramezani

Fazli

, et al. FNR: a similarity and transformer-based approach to detect multi-modal fake news in social media. Soc Netw Anal Min 2021; 13: 1–15.

45.

Zhu

Yang

, et al. Multimodal sentiment analysis with image-text interaction network. IEEE Trans Multimed 2023; 25: 3375–3385.

46.

Liang

. Design and implementation of hotspot public opinion analysis system based on multi-source social media. Beijing: Beijing University of Posts and Telecommunications, 2021.

47.

Zhang

Guo

, et al. LLaVA-OneVision: easy visual task transfer. arXiv, abs/2408.03326. 2024, Available at: https://openreview.net/pdf?id=zKv8qULV6n

48.

Wang

Bai

Tan

, et al. Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv, abs/2409.12191. 2024, Available at: https://api.semanticscholar.org/CorpusID:272704132

Research on sentiment analysis of online public opinion based on multimodal big language modeling

Abstract

Keywords

Get full access to this article

References