Sage Journals: Discover world-class research

Abstract

Currently, the significance of multimodal sentiment analysis is progressively increasing. However, the heterogeneity of multimodal signals poses a challenge in learning modality representation and fusing information. To address it, the BiMSA (multimodal sentiment analysis based on BiGRU and bidirectional interactive attention) is proposed in this paper. In the modality representation learning layer, BiMSA incorporates a structure consisting of BiGRU to extract contextual information from video and audio inputs. Subsequently, modality features are projected into modality internal representations and interactive representations for extracting information. This practice allows the model to take into account more aspects of modality information. In terms of the modality fusion layer, a bidirectional interactive attention mechanism is used to focus on the key representations of key modalities, and integrate multimodal information flexibly and efficiently. Attention weights are concentrated on modality representations that synergistically contribute toward overall sentiment orientation. Additionally, the constraints of similarity loss and difference loss are introduced to align with representations while mitigating redundant information and achieving a better fusion effect. Experimental results on public datasets (CMU-MOSI and CMU-MOSEI) demonstrate the effectiveness of the BiMSA model.

Keywords

multimodal sentiment analysis representation learning BiGRU modality fusion bidirectional interactive attention

1. Introduction

Sentiment analysis refers to the automated process of interpreting and classifying sentiment based on data, with the aim of determining whether expressed opinions are positive, negative, or neutral. Traditional sentiment analysis primarily focuses on text-based discrimination of sentiment. However, due to the rapid growth of social networks, an increasing number of users now express their attitudes through multiple modalities instead of relying solely on textual information in the past.

Multimodal data serves as a transmission medium for encompassing textual, visual, and auditory elements. Incorporating facial expressions’ descriptive quality conveys precise and nuanced emotional information, as well as unveils latent textual cues. Consequently, analyzing sentiment in data rich with affective content has emerged as the primary approach to discern users’ sentiment tendencies. Multimodal sentiment analysis holds immense value in social media sentiment analysis, user experience research, sentiment perception in autonomous vehicles, and sentiment monitoring in healthcare.

In the early stages of multimodal sentiment analysis, features were primarily extracted independently from text and image then fused using methods (Ren et al., 2021), before connected network for classification. Experimental results demonstrate that utilizing multimodal data leads to higher accuracy in classification compared to single-modality data usage, approaches mainly focused on developing intricate fusion mechanisms ranging from attention-based models to tensor-based fusion (Poria et al., 2017). Despite these advancements, challenges persist regarding how best to apply multimodal information in sentiment analysis. Firstly, extracting features from multiple modalities often leads to information redundancy or loss due to the potential mixing of repeated information across modalities while missing out on unique modality-privately details. Secondly, after feature extraction, it is necessary to overcome the heterogeneity of different modalities and perfectly integrate the features of different modalities.

This paper proposes utilizing a neural network constructed by BiGRU (Liu et al., 2019) to extract video and audio features in the modality representation learning layer, while using bidirectional encoder representation with transformers (BERT) to extract text features. The BiGRU structure can process sequence data bidirectional, and extract contextual information from sequences. The model can fully capture the contextual information in video and audio data when processing sequence data, and deepen the understanding and learning of video and audio features. Previous models typically employ convolutional neural network (Lecun & Bottou, 1998) and recurrent neural network (RNN; Zaremba et al., 2014) for feature extraction, neglecting the fact that sequence data should not only consider subsequent information but also take into account preceding information. Therefore, the BiGRU units are utilized. Additionally, inspired by Hazarika et al.’s (2020) ideas, modality internal representations and modality interactive representations based on the extracted features are investigated. The modality internal representations refer to independent representations of the modalities themselves, while modality interactive representations refer to interaction between multiple modalities. The interactive representations can relevant information between modalities, such as emphasizing the sentiment of the speaker, which enables the model to consider more aspects of the modality characteristics. In the modality fusion layer, the paper incorporates a bidirectional interactive attention mechanism to ensure each unique modality representation is aware of other cross-modality representations. This enables each representation to generalize latent information from other representations, extract crucial information from a substantial volume of data, ignore most unimportant information, and focus on important information. Furthermore, during the model training process, similarity loss and difference loss are added to task loss, which helps to align with the modality interactive representations, reduce redundant information, and achieve better fusion results.

The main contributions of this paper can be summarized as follows:

In the modality representation learning layer of multimodal sentiment analysis, this paper proposes a neural network constructed by BiGRU to extract features from video and audio inputs, so that the model can effectively capture context information when processing sequence data.

In the modality fusion layer, a bidirectional interactive attention mechanism is used to fuse all representations. This allows the model to focus on representations that have a synergistic effect on overall sentiment orientation.

Additionally, similarity constraints and difference constraints are incorporated to align interactive representations between modalities and reduce redundant information.

In order to improve the problems encountered in modality representation learning and modality fusion, perform multimodal sentiment analysis more accurately, and exert the value of multimodal sentiment analysis in the fields of economy and scientific research, this paper proposes BiMSA, a hybrid model based on BiGRU, bidirectional interactive attention mechanism, and constraint function. Experiments on the publicly available datasets CMU-MOSI and CMU-MOSEI demonstrate that the BiMSA model outperforms a series of baseline models, and ablation experiments demonstrate the importance of each component of the model.

2. Related Work

Multimodal sentiment analysis has two major challenges, one is modality representation learning, and the other is modality fusion.

2.1. Modality Representation Learning

For the modality representation of text, which is mainly to convert the text into a language that can be recognized by the machine, the word vector model such as Word2vec is commonly used (Goldberg & Levy, 2014), which represents text as vectors. GLOVE word vectors used a co-occurrence matrix to consider global information (Pennington et al., 2014); ELMo word vector can capture the context-related meanings of words as the language environment changes (Peters et al., 2018). BERT made scholars use the large-scale corpus for pre-training and input word vectors for downstream tasks after learning semantic relations (Devlin et al., 2018). Stappen et al. (2021) transcribed video clips into textual format, employed one-hot vectors for text encoding, and subsequently utilized an support vector machine classifier to achieve excellent sentiment analysis results.

The feature extraction of video modality primarily relies on the analysis of geometric and texture features exhibited by faces. Python’s OpenCV and Dlib libraries are often used for face feature key recognition. For example, Liu et al. (2021) used OpenCV to detect faces. The OpenFace tool proposed by Baltrušaitis et al. (2016) can also extract facial features and obtain low-dimensional representations. Cambria et al. (2018) used deep convolutional neural networks to extract text and visual features, and Wang et al. (2020) used neural networks to extract facial features.

The Mel frequency cepstral coefficient and linear predictor cepstral coefficients are commonly used audio features in audio modality. In 2015, Schuller’s team developed OpenSmile tool, which can preprocess and extract features of speech (Eyben et al., 2010). Degottex et al. (2014) developed the COVAREP for feature extraction of speech.

Banerjee et al. (2021) aimed to quantitatively demonstrate the influence of non-verbal cues on the results of multimodal sentiment analysis, focusing on facial emotion. Taking the Spanish data set as an example, the results show that the feature analysis results of the Spanish text combined with visual features are superior to the English text feature analysis, which indicates that there is an intrinsic correlation between the Spanish visual cues and the Spanish text (Banerjee et al., 2021).

Additionally, Hazarika et al. (2020) proposed MISA, which projects each modality into two subspaces. One facilitates the learning of commonalities and the reduction of gaps across different modalities. The other represents the unique representation of each modality. These representations offer a comprehensive perspective on multimodal data fusion for enabling predictions.

The work presented in this paper differs from the above research. To capture contextual information of sequential data, BERT is used for extracting textual representations. For representation learning of video and audio modalities, a neural network constructed with BiGRU is utilized to extract representation information.

2.2. Modality Fusion

In multimodal sentiment analysis, the modality fusion method determines the effect of the model, so it is important to choose an appropriate fusion method. Currently, there exist three primary modality fusion methods: feature-level fusion, decision-level fusion, and hybrid fusion (Wang, 2021).

On the basis of feature-level fusion, after feature extraction from video, audio, and text, (Poria et al., 2018) concatenated features in the fusion layer to form a combination vector which is then inputted into the classifier for obtaining the result (Poria et al., 2018). Li et al. (2022) proposed the contrastive learning and multi-layer fusion (CLMLF). The multi-layer fusion (MLF) module proposed a label-based fusion of multimodal features, simultaneously, a contrastive learning-based task is designed to aid the model in acquiring sentiment-related features and enhancing its ability to extract and integrate features effectively. CLMLF exhibits strong competitiveness, particularly through visualization techniques, comparative learning tasks, and MLF modules. Li et al. (2023) proposed the Mutual Information Maximization and Feature Space Separation and Bi-Bimodal Modality Fusion model and designed the mutual information (MI) maximization module and feature space separation module. The MI module maximizes the MI between the two modalities to retain more correlated information, while the feature separation module separates the fusion features to prevent the loss of independent information during the fusion process.

Decision-level fusion enables each modality to learn features using their most suitable models, but a disadvantage is internal connections between features cannot be learned, and the process is time-consuming. Huang et al. (2019) proposed a sentiment analysis model, which used the connections and differences between images and texts to classify sentiment. The model fuses the results of three models to obtain the prediction results. Poria et al. (2015) proposed a short text feature extraction method based on deep convolutional neural networks and proposed a parallel decision-level data fusion method to improve the running speed.

Hybrid fusion mainly considers the fusion modality of the model. The proposed two-layer multimodal learning model by Ji et al. (2018) addresses the dependence between modality effectively. This model consists of two layers: the first layer focuses on capturing the correlation between tweets and tweet features to predict sentiment, while the second layer emphasizes on learning the relationship among multiple modalities. Experimental results demonstrate that the model shows excellent performance.

In addition, many scholars have made contributions to modality fusion. Han et al. (2021) proposed a hierarchical MI maximization framework for multimodal sentiment analysis, MultiModal InfoMax model, which maximizes MI at the input and fusion levels to reduce the loss of valuable task-related information . This is the first time that MI is included in multimodal sentiment analysis.

BiMSA’s approach in the modality fusion layer differs from previous methods by using a bidirectional interactive attention mechanism to construct the fusion network. It incorporates both internal representations and interaction of the modalities into the network, enabling the model to ignore redundant information and focus on relevant feature information for accurate sentiment polarity prediction. Additionally, the model assigns higher weights to features that contribute significantly toward predicting the model’s outcomes, thereby selecting focused features.

3. Method

In order to use multimodal data to judge sentiment in video and audio, fully extract modalities features, and fuse modalities features, a multimodal information sentiment analysis model is proposed. The overall structure of the model is illustrated in Figure 1. There are three components: modality representation learning, modality fusion, and sentiment prediction.

Figure 1.

The overall structure of the BiMSA model.

Each video in the dataset represents a sample sequence $X$ , comprising three sub-feature sequences: text (t), video (v), and audio (a). The model feeds sequence into the modality representation learning layer, yielding the output representations which then enter the modality fusion layer. The fused information is subsequently inputted into the prediction module to derive the ultimate sentiment score.

3.1. Modality Representation Learning

First, BiMSA extracts the raw features of the data.

For text data, the bidirectional encoding capability of the BERT pre-trained model is used to extract raw features. BERT pre-trained model is based on the Transformer architecture, which understands and encodes the semantic and syntactic structures of input text at different levels. This makes the model perform well in handling long-distance dependencies and contextual understanding. Thus, the context and relevant information in the text can be better understood, and the extracted features can be more comprehensive and accurate.

For video data, Facet tools can extract key features from video quickly and accurately. Moreover, various types of video features can be extracted, such as facial expressions, postural movements, voice emotions, etc., to help users quickly analyze and understand video content from multiple angles. Therefore, Facet tools are used in this paper to extract visual features.

For audio data, BiMSA uses COVAREP to analyze the sound characteristics of speech signals. COVAREP can extract a variety of sound features, and these multi-dimensional features can provide rich information about the speech signal and facilitate more comprehensive acoustic analysis. Moreover, the sound features extracted by COVAREP can be used to analyze the speaker’s sentiment state and intonation change, which is helpful for sentiment recognition and other applications.

With this approach, the sequence of raw video clips will be represented as a feature vector $X = [X_{t}, X_{v}, X_{a}]$ , containing three modalities.

Then, the representation learning is conducted based on acquired original features. The rationale behind not directly utilizing the raw feature lies in its relative simplicity, requiring a non-linear combination to fully exert its role. Moreover, there exists a considerable redundancy among features, rendering not all of them useful for prediction purposes. Additionally, many characteristics often exhibit variability and noise.

For each modality $m \in {t, v, a}$ , its sequence input $X_{m} \in R^{U_{m} \times d_{m}}$ is encoded as a fixed-length vector $x_{m} \in R^{d_{v}}$ . $U_{m}$ represents the sequence length while $d_{m}$ represents the feature dimension of each modality, and $d_{v}$ represents the dimension of the hidden vector. For video and audio features, considering the temporal nature of the data, a network structure composed of BiGRU is employed for encoding.

BiGRU is composed of two gated recurrent units in opposite directions, which can effectively capture the long dependency of the context in the sequence and solve the problem of gradient disappearance and gradient explosion during RNN training (Cho et al., 2014; Chung et al., 2014; Liu et al., 2019). In BiGRU, forward and reverse inputs obtain hidden layer representations at corresponding time steps. The concatenation operation then yields video and audio features with contextual information. The formulae can be expressed as follows:

\begin{aligned} x_{t} & = BERT (X_{t}, θ_{t}^{BERT}) \end{aligned}

(1)

\begin{aligned} x_{v} & = BiGRU (X_{v}, θ_{v}^{BiGRU}) \end{aligned}

(2)

\begin{aligned} x_{a} & = BiGRU (X_{a}, θ_{a}^{BiGRU}) \end{aligned}

(3)

Finally, referring to MISA, each encoding vector is extracted into two distinct representations. The first representation captures the modality’s internal representations, encompassing the inherent representations of each modality. The second representation focuses on modality interactive representations, aiming to learn common representations of modality interactions while considering distribution similarity constraints. This approach effectively reduces heterogeneity among modality signals and facilitates subsequent steps in modality fusion. BiMSA uses a fully connected layer to acquire hidden modality internal representations (

h_{m}^{i} \in R^{6 \times d_{v}}

) and modality interactive representations (

h_{m}^{s} \in R^{6 \times d_{v}}

) based on the given vector

x_{m}

for modality

m

\begin{aligned} h_{m}^{i} & = F_{i} (x_{m}, θ_{m}^{i}) \end{aligned}

(4)

\begin{aligned} h_{m}^{s} & = F_{s} (x_{m}, θ^{s}) \end{aligned}

(5)

3.2. Modality Fusion

After obtaining modality internal representations and modality interactive representations, a modality fusion module is performed. The approach employed here involves utilizing the bidirectional interactive attention mechanism, and subsequently concatenates the transformed two-layer vectors.

The bidirectional interactive attention mechanism can be used to optimize the model from three perspectives: multimodal information fusion, cross-modal association learning, and dynamic attention allocation.

In terms of multimodal information fusion, through the bidirectional interactive attention mechanism, the model can better integrate and understand the information of different modalities when dealing with multiple modal inputs, so as to improve the model’s understanding ability of complex scenes.

In cross-modal association learning, the bidirectional interactive attention mechanism can make each modal feature aware of other cross-modal features, taking into account not only the global relations between different modalities, but also the local relations between different representations of the same modality. On the basis of these relations, the correlation between different modalities can be identified and modeled. Allowing each representation to capture features from other representations, inducing underlying information (Vaswani et al., 2017), these representations have a synergistic effect on overall affective orientation.

As for dynamic attention allocation, the multimodal bidirectional interactive attention mechanism can dynamically adjust the attention allocation according to the modal input, so as to make better use of the information of different modalities and make the model more flexible and efficient in processing multimodal information.

First of all, modality internal representations and modality interactive representations of three modalities are stacked into matrixes $P = [h_{t}^{i}, h_{v}^{i}, h_{a}^{i}, h_{t}^{s}, h_{v}^{s}, h_{a}^{s}] \in R^{6 \times d_{v}}$ and $Q = [h_{t}^{s}, h_{v}^{s}, h_{a}^{s}, h_{t}^{i}, h_{v}^{i}, h_{a}^{i}] \in R^{6 \times d_{v}}$ .

Then, each input matrix containing representations is divided into multiple “heads,” and each “head” will independently perform self-attention calculation and get its own attention output. Each head performs the next steps independently. Compute the Query, Key, and Value vectors for each element based on the input sequence data. Suppose there are h “heads”, and the input sequence for each “head” is $X_{i}$ , and the weight matrix is $W_{Q}$ , $W_{K}$ , $W_{V}$ , there is:

\begin{aligned} Q_{i} & = W_{Q} X_{i} \end{aligned}

(6)

\begin{aligned} K_{i} & = W_{K} X_{i} \end{aligned}

(7)

\begin{aligned} V_{i} & = W_{V} X_{i} \end{aligned}

(8)

where the value of

i

ranges from 1 to

h

The attention score between different representations is computed to quantify the level of attention that each representation pays to other representations. This score is calculated using the following formula:

score (Q_{i}, K_{i}) = \frac{(Q_{i} K_{i})}{\sqrt{(d_{k})}}

(9)

where the value of

i

ranges from 1 to

h

and

d_{k}

is the dimension of

K_{i}

The model calculates the attention weight coefficient between different representations. Through the softmax function, the attention score is converted to the value before 0 and 1 and the sum is 1, and the attention weight is obtained:

W_{i} = softmax (score (Q_{i}, K_{i}))

(10)

where the value of

i

ranges from 1 to

h

The value is weighted and summed according to the weight coefficient. Multiply the value vector of each element with its corresponding attention weight coefficient, and then sum. This process aims to combine the input multimodal feature sequence to generate a new context-dependent sequence vector:

Z_{i} = W_{i} V_{i}

(11)

where the value of

i

ranges from 1 to

h

After that, the output $Z_{i}$ of each “head” is spliced together to form the final vector output of the $P$ and $Q$ matrices:

\begin{aligned} Z_{P} & = W_{P} \cdot concat (Z_{1 p}, Z_{2 p}, \dots, Z_{h p}) \end{aligned}

(12)

\begin{aligned} Z_{Q} & = W_{Q} \cdot concat (Z_{1 q}, Z_{2 q}, \dots, Z_{h q}) \end{aligned}

(13)

where

Z_{1 p}, Z_{2 p}, \dots, Z_{h p}

represent the output of each “head” from the

P

matrix, while

Z_{1 q}, Z_{2 q}, \dots, Z_{h q}

represents the output of each “head” from the

Q

matrix.

Finally, the output vectors of matrices $P$ and $Q$ are linearly concatenated to form the final vector $M$ . This final vector contains the attention output information for all heads, modalities, and representations. Since they focus on distinct segments of the input sequence, concatenation enables integration of these diverse pieces of information into a more comprehensive representation. Consequently, private information and interaction details can be fully considered, allowing the model to maximize attention towards key modality-related information.

3.3. Model Prediction

The task of multimodal sentiment analysis is to input a data series pair $X = [X_{t}, X_{v}, X_{a}]$ and predict the sentiment score of that series. After the representation learning module extracts modality features, learns the modality internal representation and the modality interactive representation, and uses the bidirectional interactive attention mechanism module to select important information, the final feature vector $M$ is obtained. A fully connected feedforward neural network is used to predict the sentiment score:

\hat{y} = ReLU (W_{M} M + b_{M})

(14)

3.4. Learning

The learning is achieved through the process of minimizing:

L = L_{task} + {α L}_{sim} + {β L}_{diff}

(15)

L_{task}

denotes the task loss, which is computed as the average absolute error between the predicted value

\hat{y}

and the true value

y

for

N

data series in each batch:

L_{task} = \frac{1}{N} \sum_{i = 0}^{N} {‖ y_{i} - {\hat{y}}_{i} ‖}_{2}^{2}

(16)

L_{sim}

refers to the similarity loss between modality interactive representations. Through minimizing similarity loss, the differences between representations of each modality can be reduced. This helps to enhance the semantics and align the features of intermodal interactions. Central moment difference (CMD) is used to calculate the similarity loss:

L_{sim} = \frac{1}{3} ({CMD}_{K} (h_{t}^{s}, h_{v}^{s}) + {CMD}_{K} (h_{t}^{s}, h_{a}^{s}) + {CMD}_{K} (h_{v}^{s}, h_{a}^{s}))

(17)

Here, CMD is a method of measuring the difference between two distributions based on the Gaussian mixture model (Zellinger et al., 2017). The more similar the two distributions, the smaller the CMD distance. The random samples

X

and

Y

are assumed to be bounded, with probability distributions

p

and

q

, respectively, defined on the interval

[a, b]

. The CMD measure is defined as follows:

\begin{aligned} CMD (X, Y) & = \frac{1}{| b - a |} {‖ E (X) - E (Y) ‖}_{2} + \sum_{k = 2}^{\infty} \frac{1}{{| b - a |}^{k}} {‖ c_{k} (X) - c_{k} (Y) ‖}_{2} \end{aligned}

(18)

\begin{aligned} c_{k} (X) & = E (\prod_{i = 1}^{N} {(X_{i} - E (X_{i}))}^{r_{i}}) \end{aligned}

(19)

where

r_{1} + r_{2} + \dots + r_{N} = k

and

r_{1}, r_{2}, \dots, r_{N} \geq 0

$L_{diff}$ denotes the loss of difference, which aims to ensure that modality internal representations and modality interactive representations capture different features and reduce redundancy. The difference loss is as follows:

L_{diff} = \sum_{m \in {t, v, a}} {‖ {h_{m}^{i}}^{T} h_{m}^{s} ‖}_{F}^{2} + {‖ {h_{t}^{i}}^{T} h_{v}^{s} ‖}_{F}^{2} + {‖ {h_{t}^{i}}^{T} h_{a}^{s} ‖}_{F}^{2} + {‖ {h_{v}^{i}}^{T} h_{a}^{s} ‖}_{F}^{2}

(20)

where

‖ \cdot ‖_{F}^{2}

is the square of the Frobenius norm.

The difference loss is calculated by imposing an orthogonality constraint between two types of representations (Bousmalis et al., 2016; Hazarika et al., 2020; Liu et al., 2017; Ruder & Plank, 2018). In addition to the constraints between modality internal representations and modality interactive representations, orthogonal constraints between modality internal representations are also added.

4. Experiments

4.1. Datasets

4.1.1. CMU-MOSI

The CMU-MOSI dataset consists of 2,198 discourse-level video clips, comprising independent reviews of the film by 89 speakers (Zadeh et al., 2016). It is worth noting that the dataset maintains a rough gender balance. Each statement within it is accompanied by a continuous variable denoting a sentiment score ranging from $- 3$ to 3, where a score of 3 indicates extremely positive sentiment and $- 3$ signifies extremely negative sentiment.

4.1.2. CMU-MOSEI

The CMU-MOSEI dataset surpasses its predecessor, the CMU-MOSI dataset (Zadeh et al., 2018). It comprises over 65 hours of monologue video from more than 1,000 speakers, encompassing 23,453 annotated clips across 250 topics.

4.2. Evaluation Criteria

The CMU-MOSI and CMU-MOSEI datasets involve regression tasks, evaluated using mean absolute error (MAE) and Pearson correlation (Corr). Additionally, the baseline includes classification metrics such as seven accuracy levels ranging from $- 3$ to 3 (Acc-7), binary accuracy (Acc-2), and F1-score. The binary precision score is calculated based on the negative/positive class formula, where sentiment scores <0 are considered negative sentiment, while those >0 are considered positive sentiment.

4.2.1. MAE

The MAE is the mean of the distance between predicted value $\hat{y}$ and true value $y$ of the sample. Assuming there are $n$ samples, the formula is as follows:

MAE = \frac{1}{N} \sum_{i = 0}^{N} | y_{i} - {\hat{y}}_{i} |

(21)

4.2.2. Pearson Correlation (Corr)

Corr is the correlation between the prediction matrix and the actual value matrix. The higher the correlation, the more effective the BiMSA model is. The formula for the correlation coefficient is as follows:

Corr = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \sqrt{\sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}}

(22)

4.2.3. Seven-classification Accuracy (Acc-7)

The Acc-7 is determined by applying the round function to map regression-based prediction results onto a discrete variable range of $[- 3, 3]$ , while ensuring that the true score is also mapped within this same discrete variable range. The evaluation of Acc-7 involves comparing the mapped predicted values with their corresponding real 7-classification values.

4.2.4. Binary Accuracy (Acc-2)

In the part of binary classification (Acc-2), the confusion matrix is considered for binary classification. Both the predicted score and true score are mapped to the binary label and compared to calculate the binary classification accuracy:

Acc-2 = \frac{TP + TN}{TP + FN + FP + TN}

(23)

4.2.5. F1-score

F1-score is the weighted average of precision and recall:

\begin{aligned} Precision & = \frac{TP}{TP + FP} \end{aligned}

(24)

\begin{aligned} Recall & = \frac{TP}{TP + FN} \end{aligned}

(25)

\begin{aligned} F1 & = \frac{2 \times Precision}{Precision + Recall} \end{aligned}

(26)

4.3. Baseline

The following multimodal sentiment analysis baseline models are selected to compare with BiMSA:

BC-LSTM: The proposed model presents a hierarchical concept, comprising of two components (Poria et al., 2017). The initial level involves the extraction of single-modality features, while the level entails feeding the extracted features from the first layer into long short-term memory. Finally, fusion is performed on the features extracted from three modalities to derive the final prediction outcome.

TFN: The fundamental concept of the model is to learn intra-modality and inter-modality dynamics end-to-end. The inter-modality dynamics are modeled by tensor fusion, and intra-modality dynamics are modeled by three modal embedding subnetworks (Zadeh et al., 2017).

LMF: The study proposes a model that decomposes the weights into low-rank factors, which effectively reduces the number of parameters in the model (Liu et al., 2018).

MulT: The core of the model is the cross-modality attention module, which focuses on cross-modality interactions at the utterance scale. This module potentially adjusts the data stream from one modality to another by repeatedly reinforcing the features of one modality with those of the others (Tsai et al., 2019).

MISA: MISA emphasizes the significance of multimodal representation learning as a prerequisite for fusion. MISA acquires modality-shared and modality-unique representations that facilitate the prediction of sentiment state fusion.

ConFEDE: ConFEDE believes that multimodal sentiment analysis depends to a large extent on learning a good representation of multimodal information, which should include modal invariant representations consistent across modes and modal specific representations (Yang et al., 2023). Thus, ConFEDE jointly performs contrast representation learning and contrast feature decomposition to enhance the representation of multimodal information.

5. Results and Analysis

5.1. Quantitative Results

Tables 1 and 2 show the results of experiments on the CMU-MOSI dataset and CMU-MOSEI dataset for the model in baseline and the BiMSA model. Regardless of the dataset used, the BiMSA model outperforms the baseline model in terms of regression evaluation indicators as well as classification.

Table 1.
Experimental Performance of BiMSA Model on CMU-MOSI Dataset.

CMU-MOSI

Models MAE Corr Acc-7 ( $%$ ) Acc-2 ( $%$ ) F1-score ( $%$ )

BC-LSTM 1.079 0.581 28.72 73.91 73.91

TFN 0.901 0.698 34.96 80.52 80.44

LMF 0.917 0.695 33.23 80.71 80.63

MulT 0.871 0.698 40.25 80.64 80.57

MISA 0.817 0.739 40.82 80.95 80.99

ConFEDE 0.801 0.745 41.08 81.29 81.21

BiMSA 0.787 0.763 43.29 83.54 83.47

	CMU-MOSI
BC-LSTM	1.079	0.581	28.72	73.91	73.91
TFN	0.901	0.698	34.96	80.52	80.44
LMF	0.917	0.695	33.23	80.71	80.63
MulT	0.871	0.698	40.25	80.64	80.57
MISA	0.817	0.739	40.82	80.95	80.99
ConFEDE	0.801	0.745	41.08	81.29	81.21
BiMSA	0.787	0.763	43.29	83.54	83.47

Note. MAE = mean absolute error; Corr = Pearson correlation; Acc-7 = 7-classification accuracy; Acc-2 = binary accuracy.

Table 2.

Experimental Performance of BiMSA Model on CMU-MOSEI Dataset.

	CMU-MOSEI
Models	MAE	Corr	Acc-7 ( $%$ )	Acc-2 ( $%$ )	F1-score ( $%$ )
BC-LSTM	0.712	0.542	45.01	76.92	77.03
TFN	0.593	0.704	50.2	82.54	82.17
LMF	0.623	0.677	48.02	82.35	82.14
MulT	0.58	0.703	51.87	82.52	82.31
MISA	0.578	0.747	50.39	83.69	82.91
ConFEDE	0.561	0.732	50.45	82.17	82.16
BiMSA	0.554	0.753	52.32	84.66	84.38

Note. MAE = mean absolute error; Corr = Pearson correlation; Acc-7 = 7-classification accuracy; Acc-2 = binary accuracy.

On the CMU-MOSI dataset, compared to the ConFEDE model, which has better performance at present, the MAE of the BiMSA model is reduced by 0.014. The correlation increased by 0.018. The accuracy of the 7-classification is higher than that of the ConFEDE model by 2.21 $%$ . The accuracy of binary classification is improved by 2.25 $%$ . F1-score increased by 2.26 $%$ .

The performance is the same on the MOSEI dataset, compared to the ConFEDE model, the MAE of BiMSA is also decreased by 0.007. The correlation increased by 0.021. BiMSA demonstrates an enhancement of 1.87 $%$ in Acc-7, 2.43 $%$ in binary classification accuracy, and 2.22 $%$ improvement in F1-score.

The above experiments illustrate that BiMSA combining the BiGRU structure and the bidirectional interactive attention mechanism module can better splice the modality representation learning layer and the modality fusion layer of the model, making the context information richer. BiGRU structure can consider the historical information and future information of the current moment and can make use of the context information to predict more comprehensively. The bidirectional interactive attention mechanism module can enhance the model’s attention to the information of different positions in the input sequence, so that the model can make full use of the context information and improve the model’s understanding of the sequence data. At the same time, adding the similarity constraint and the difference constraint to the loss function can make the interaction features closer in the feature space and reduce the redundancy. The difference constraint can make different features more dispersed in the space, thus improving the effect of feature representation learning, making features more discriminative, and improving the robustness and generalization ability of the model. This also validates the effectiveness and advancement offered by BiMSA within multimodal sentiment analysis tasks.

5.2. Ablation Study

On both datasets, ablation experiments were performed. The ablation experiment was divided into three parts: modality ablation, model component ablation, and loss function ablation. Tables 3 and 4 show results.

Table 3.
Ablation Experiment of BiMSA Model on CMU-MOSI Dataset.

CMU-MOSI

Models MAE Corr Acc-7 ( $%$ ) Acc-2 ( $%$ ) F1-score ( $%$ )

BiMSA 0.787 0.763 43.29 83.54 83.47

text 0.869 0.722 37.32 80.95 80.99

video 1.436 0.027 15.31 47.1 45.45

audio 1.464 0.061 15.45 42.68 26.32

w/o BiGRU 0.812 0.728 41.02 80.99 80.57

w/o Attention 0.805 0.744 42.73 82.03 82.01

w/o $L_{s i m}$ 0.815 0.748 39.79 81.25 81.26

w/o $L_{d i f f}$ 0.783 0.754 42.7 82.01 81.97

	CMU-MOSI
BiMSA	0.787	0.763	43.29	83.54	83.47
text	0.869	0.722	37.32	80.95	80.99
video	1.436	0.027	15.31	47.1	45.45
audio	1.464	0.061	15.45	42.68	26.32
w/o BiGRU	0.812	0.728	41.02	80.99	80.57
w/o Attention	0.805	0.744	42.73	82.03	82.01
w/o $L_{s i m}$	0.815	0.748	39.79	81.25	81.26
w/o $L_{d i f f}$	0.783	0.754	42.7	82.01	81.97

Note. MAE = mean absolute error; Corr = Pearson correlation; Acc-7 = 7-classification accuracy; Acc-2 = binary accuracy.

Table 4.

Ablation Experiment of BiMSA Model on CMU-MOSEI Dataset.

	CMU-MOSEI
Models	MAE	Corr	Acc-7 ( $%$ )	Acc-2 ( $%$ )	F1-score ( $%$ )
BiMSA	0.554	0.753	52.32	84.66	84.38
text	0.560	0.751	51.03	82.81	82.94
video	0.841	0.035	41.36	62.85	48.52
audio	0.842	0.005	38.57	58.02	43.18
w/o BiGRU	0.559	0.75	52.05	82.27	82.36
w/o Attention	0.571	0.752	52.11	84.08	84.17
w/o $L_{s i m}$	0.563	0.750	51.35	84.32	84.23
w/o $L_{d i f f}$	0.566	0.742	51.38	84.01	84.06

Note. MAE = mean absolute error; Corr = Pearson correlation; Acc-7 = 7-classification accuracy; Acc-2 = binary accuracy.

5.2.1. Modality Ablation

In the modality ablation section, this paper mainly observes the difference between the influence of a single modality on sentiment analysis and the influence of a multimodal on sentiment analysis. In other words, in addition to the BiMSA model, only text, video, and audio are used for sentiment analysis, respectively.

The results in Tables 3 and 4 demonstrate that BiMSA model has better performance than any single-modal sentiment analysis model in the table, and the multimodal sentiment analysis model has the lowest average absolute error, the highest correlation, the highest classification accuracy and the highest F1-score. This proves the advantages of multimodal sentiment analysis over single-modal sentiment analysis.

In addition, in single-modal sentiment analysis, the model using only text mode performs significantly better than the model using only video mode, and the model using only audio mode performs the worst. Among them, in the two datasets, the accuracy of the text model decreased by 2.59 $%$ and 1.85 $%$ compared with the BiMSA model, while the accuracy of the audio model decreased by 40.86 $%$ and 26.64 $%$ compared with BiMSA model, respectively. This may be due to the text containing more explicit information, and textual features are easier to extract due to the previous work on text sentiment analysis, while the audio modality has information interference such as noise and clutter, which is difficult to reveal effective information. Therefore, text mode is the dominant mode of multimodal sentiment analysis, which can play a decisive role in the model results.

5.2.2. Model Component Ablation

Firstly, the component of extracting modality representations from the network constructed by BiGRU is eliminated. On the CMU-MOSI dataset, the model without the BiGRU module is significantly weaker than the model with the BiGRU module. Compared with the latter, the MAE of the former is higher, and the binary classification accuracy is reduced by 2.55 $%$ . The F1-score is reduced by 2.9 $%$ . The performance on the CMU-MOSEI dataset also shows that deleting the BiGRU module will reduce the performance of the model, with binary classification accuracy reduced by 2.39 $%$ and F1-score decreased by 2.02 $%$ . This shows that using BiGRU is very suitable for processing time series data such as video and audio. It can capture long-term dependencies in time series, which is conducive to a more comprehensive understanding of video and audio content, which is very helpful for recognizing motion in video and acoustic features in audio.

Additionally, the bidirectional interactive attention mechanism component is also excluded in the modality fusion module. It can be observed that upon removing the bidirectional interactive attention mechanism component, there is a decline in the performance of regression and classification indicators of the model. Specifically, on the CMU-MOSI dataset, the binary classification accuracy decreased by 1.51 $%$ while it decreased by 0.58 $%$ on the CMU-MOSEI dataset. This suggests that the bidirectional interactive attention mechanism aids in selecting internal representations and interactive representations of modality with emphasis, which concentrates attention weight on modality features that synergistically contribute to sentiment orientation and uncovers additional potential information.

5.2.3. Loss Function Ablation

For the ablation of the loss function, the similarity loss function and the difference loss function are removed, respectively. In terms of binary classification accuracy index on the CMU-MOSI dataset, the performance of the model without similarity loss function decreased by 2.29 $%$ compared with the original model, and the performance of the model without difference loss function decreased by 1.53 $%$ compared with the original model. On the CMU-MOSEI dataset, the binary classification accuracy of the model with the removal of the similarity loss function decreased by 0.34 $%$ and the performance of the model with the removal of the difference loss function decreased by 0.65 $%$ compared with the original model. This shows that the BiMSA model can help align the interaction features between modes and achieve a better fusion effect by minimizing similarity loss. The use of minimization of variance loss ensures that modality internal representations and modality interactive representations capture different features and reduce redundant information.

5.3. Loss Curve

The error curves of the loss function are traced. In Figure 2, the left picture is the trace diagram of the loss function on the CMU-MOSI dataset, and the right picture is the trace diagram of the loss function on the CMU-MOSI dataset. It is found that no matter on which dataset, the loss function curves of the training dataset and the verification dataset gradually decline and converge with the increase of epoch, and the gap between them is very small, which indicates that the model does not have the problem of overfitting or underfitting, and the model possesses generalization capabilities.

Figure 2.

Loss function locus of BiMSA model.

6. Conclusion

In this paper, BiMSA, a multimodal sentiment analysis model, is proposed. The representation learning module utilizes BERT to extract features from text and utilizes a unit structure composed of BiGRU to extract deep features from video and audio modalities, so that the model can capture context information more effectively. This is important for understanding continuous motion in video and audio. Then, the features of the three modalities are transferred to the modality internal representations and the modality interactive representations through linear transformation, becoming six representations. In the modality fusion module, a bidirectional interactive attention mechanism is employed to enable each modality to be aware of potential information from other representations and fuse them synergistically for sentiment analysis purposes, which makes the model more flexible in processing multimodal information. Finally, during training, three loss functions including task function, similarity loss function, and difference loss function are combined to minimize differences between modality interactive representations while maximizing differences between the modality internal representations and the modality interactive representations, reducing redundancy and maximizing information extraction.

The experimental results demonstrate the high effectiveness of BiMSA, surpassing some previous models in performance. Ablation experiments and visualization analysis further validate the generalization capability of the representation learning module and the rationality of the component. In general, the idea proposed in this paper is to focus on the extraction and fusion of feature information been empirically proven to be effective. Moving forward, we plan to incorporate additional modalities for sentiment analysis, such as exploring the impact of physiological signals.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported in part by the National Natural Science Foundation of China (12361072) and Xinjiang Natural Science Foundation of China (2023D01A36).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Baltrušaitis

Robinson

Morency

L. P.

(2016). OpenFace: An open source facial behavior analysis toolkit. In IEEE winter conference on applications of computer vision (pp. 1–10). IEEE.

Banerjee

Yagnik

Hegde

(2021). Impact of cultural-shift on multimodal sentiment analysis. Journal of Intelligent & Fuzzy Systems, 41(5), 5487–5496. https://doi.org/10.3233/JIFS-189870

Bousmalis

Trigeorgis

Silberman

Krishnan

Erhan

(2016). Domain separation networks. In NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems (pp. 343–351). Curran Associates Inc. https://dl.acm.org/doi/10.5555/3157096.3157135

Cambria

Hazarika

Poria

Hussain

Subramanyam

R. B. V.

(2018). Benchmarking multimodal sentiment analysis. In Annual international conference on computational linguistics and intelligent text processing (pp. 166–179). Springer. https://doi.org/10.1007/978-3-319-77116-8_13

Cho

Van Merriënboer

Gulcehre

(2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1724–1734). ACL. https://doi.org/10.3115/v1/D14-1179

Chung

Gulcehre

Cho

K. H.

Bengio

(2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv. https://doi.org/10.48550/arXiv.1412.3555

Degottex

Kane

Drugman

Raitio

Scherer

(2014). COVAREP: A collaborative voice analysis repository for speech technologies. In 2014 IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 960–964). IEEE. https://doi.org/10.1109/ICASSP.2014.6853739

Devlin

Chang

M. W.

Toutanova

K. L.

(2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Eyben

Wllmer

Schuller

(2010). Opensmile: The Munich versatile and fast open source audio feature extractor. In ACM international conference on multimedia (pp.1459–1462). ACM. https://doi.org/10.1145/1873951.1874246

10.

Goldberg

Levy

(2014). word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv. https://doi.org/10.48550/arXiv.1402.3722

11.

Han

Chen

Poria

(2021). Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. https://doi.org/10.48550/arXiv.2109.00412

12.

Hazarika

Zimmermann

Poria

(2020). MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: MM’20: Proceedings of the 28th international conference on multimedia (pp. 1122–1131). ACM. https://doi.org/10.1145/3394171.3413678

13.

Huang

Zhang

Zhao

(2019). Image-text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems, 167, 26–37. https://doi.org/10.1016/j.knosys.2019.01.019

14.

Chen

Cao

Gao

(2018). Cross-modality microblog sentiment prediction via bi-layer multimodal hypergraph learning. IEEE Transactions on Multimedia, 21(99), 11. https://doi.org/10.1109/TMM.2018.2867718

15.

Lecun

Bottou

(1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791

16.

Tian

Zhou

Wang

(2023). Mutual information maximization and feature space separation and bi-bimodal modality fusion for multimodal sentiment analysis. Journal of Intelligent & Fuzzy Systems, 45(4), 5783–5793. https://doi.org/10.3233/JIFS-222189

17.

Zhu

Zhao

(2022). CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection. https://doi.org/10.48550/arXiv.2204.05515

18.

Liu

Chen

Wang

Liu

Guo

Dang

(2021). Multimodal emotion recognition with capsule graph convolutional based representation fusion. In: CASSP 2021 – 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), Toronto, ON, Canada, 2021 (pp. 6339–6343). IEEE. https://doi.org/10.1109/ICASSP39728.2021.9413608

19.

Liu

Qiu

Huang

(2017). Adversarial multi-task learning for text classification. https://doi.org/10.18653/v1/p17-1001

20.

Liu

Shen

Lakshminarasimhan

V. B.

Liang

P. P.

Zadeh

Morency

L. P.

(2018). Efficient low-rank multimodal fusion with modality-specific factors. https://doi.org/10.18653/v1/P18-1209

21.

Liu

Yang

Wang

Chen

(2019). Attention-based BiGRU-CNN for Chinese question classification. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-019-01344-9

22.

Pennington

Socher

Manning

(2014). GLOVE: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). ACL. https://doi.org/10.3115/v1/D14-1162

23.

Peters

Neumann

Iyyer

Gardner

Clark

Lee

Zettlemoyer

(2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237). ACL. https://doi.org/10.18653/v1/N18-1202

24.

Poria

Cambria

Bajpai

Hussain

(2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98–125. https://doi.org/10.1016/j.inffus.2017.02.003

25.

Poria

Cambria

Gelbukh

(2015). Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Conference on empirical methods in natural language processing (pp. 2539–2544). ACL. https://doi.org/10.18653/v1/D15-1303

26.

Poria

Cambria

Hazarika

Majumder

Zadeh

Morency

L. P.

(2017). Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics, Vancouver, July 30–August 4, 2017 (pp. 873–883). ACL.

27.

Poria

Majumder

Hazarika

Cambria

Gelbukh

Hussain

(2018). Multimodal sentiment analysis: Addressing key issues and setting up the baselines. IEEE Intelligent Systems, 33, 17–25. https://doi.org/10.1109/MIS.2018.2882362

28.

Ren

Z. Y.

Wang

Z. C.

Z. W.

Wushour

(2021). Survey of multimodal data fusion. Computer Engineering and Applications, 57(18), 49–64. https://doi.org/10.3778/j.issn.1002-8331.2104-0237

29.

Ruder

Plank

(2018). Strong baselines for neural semi-supervised learning under domain shift. arXiv preprint arXiv:1804.09530.

30.

Stappen

Baird

Cambria

Schuller

B. W.

(2021). Sentiment analysis and topic recognition in video transcriptions. IEEE Intelligent Systems, 36(2), 88–95. https://doi.org/10.1109/MIS.2021.3062200

31.

Tsai

Y. H. H.

Bai

Liang

P. P.

Kolter

J. Z.

Morency

L. P.

Salakhutdinov

(2019). Multimodal transformer for unaligned multimodal language sequences. https://doi.org/10.18653/v1/P19-1656

32.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. Advances in neural information processing systems. https://dl.acm.org/doi/pdf/10.5555/3295222.3295349

33.

Wang

Y. R.

(2021). A review of multimodal sentiment analysis algorithms. Computer Programming Skills Maintenance, 2021(12), 34-36. https://doi.org/10.16184/j.cnki.comprg.2021.12.013

34.

Wang

Peng

Yang

Qiao

(2020). Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 689–6906). CVPR. https://doi.org/10.1109/CVPR42600.2020.00693

35.

Yang

Niu

Guo

(2023). ConFEDE: Contrastive feature decomposition for multimodal sentiment analysis. In Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long papers) (pp. 7617–7630). ACL. https://doi.org/10.18653/v1/2023.acl-long.421

36.

Zadeh

Chen

Poria

Cambria

Morency

L. P.

(2017). Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 1103–1114). ACL.

37.

Zadeh

A. A. B.

Liang

P. P.

Poria

Cambria

Morency

L. P.

(2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th annual meeting of the association for computational linguistics, Melbourne, July 15–20, 2018 (pp. 2236–2246). ACL.

38.

Zadeh

Zellers

Pincus

Morency

L. P.

(2016). Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 31(6), 82–88. https://doi.org/10.1109/MIS.2016.94

39.

Zaremba

Sutskever

Vinyals

(2014). Recurrent neural network regularization. Eprint Arxiv, https://doi.org/10.48550/arXiv.1409.2329

40.

Zellinger

Grubinger

Lughofer

Natschläger

Saminger-Platz

(2017). Central moment discrepancy (CMD) for domain-invariant representation learning. https://doi.org/10.48550/arXiv.1702.08811

BiMSA: Multimodal Sentiment Analysis Based on BiGRU and Bidirectional Interactive Attention

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Modality Representation Learning

2.2. Modality Fusion

3. Method

4.1. Datasets

4.1.1. CMU-MOSI

4.1.2. CMU-MOSEI

4.2. Evaluation Criteria

4.2.1. MAE

4.2.4. Binary Accuracy (Acc-2)

5. Results and Analysis

5.1. Quantitative Results

5.2.2. Model Component Ablation

5.2.3. Loss Function Ablation

5.3. Loss Curve

Footnotes

Funding

Declaration of Conflicting Interests

References