Abstract
Multimodal sentiment analysis is a well-known field in AI that integrates text, video, and audio to gain a better understanding of people’s emotions. However, this field still faces two major challenges: first, it is difficult to effectively eliminate noise interference in multimodal data; Second, the existing research focuses on the fusion mechanism between modes, ignoring the similarity and heterogeneity between modes, which leads to the deviation of emotional analysis. Therefore, this paper proposes a deep fusion model based on Transformer. We define the polarity vector (PV) and the intensity vector (SV) based on the emotion analysis strategy of PS-Mixer to judge the polarity (positive, negative, or neutral) and intensity (0–3 range) of emotions, respectively. PV fuses text and video features, and SV fuses text and audio features and improves the ability of feature expression by introducing a cross-fusion strategy. Transformer architecture enhances the generalization performance and large-scale data processing ability of the model. The experimental results show that the performance of this model on MOSI and MOSEI datasets is better than the existing multi-modal sentiment analysis methods, which provides a new idea for research in this field.
Get full access to this article
View all access options for this article.
