Sage Journals: Discover world-class research

Abstract

Current fake news detection models do not adequately extract fine-grained image features and also ignore the important impact of shallow text features on the results. In addition, the fusion methods are too simple and do not take into account the different importance of various information sources on the final detection results. To address these limitations, we propose a named Dual-branch Hybrid Visual Networks and Hierarchical Adaptive Fusion Strategy Model (DVHAM). Specifically, we design a dual-branch hybrid network based on transformer and convolutional neural network architecture. This network not only considers the global features of the image but also fully incorporates the local details of the image. In addition, we incorporate the hierarchical information of text to construct a hierarchical adaptive dynamic fusion module. The module employs a paired multihead attention mechanism and an adaptive adjustment strategy based on a gating mechanism. This design enables the model to capture and utilize the complementarity and correlation between the semantic and visual information at different levels in the text model. Simultaneously, it adaptively fuses the modal interaction features containing different information for the final detection task. DVHAM achieves an accuracy of 91.5%, 92.4%, and 90.6% on the Weibo, TWITTER, and PHEME datasets, respectively. This proves the effectiveness of DVHAM in the field of fake news detection.

Keywords

multimodal fake news detection dual-branch hybrid network feature fusion adaptive strategy

1. Introduction

With the rapid development of the Internet and social media, online media has become increasingly prominent as a channel for news acquisition. More and more people have begun to use social media to get information about all aspects of social life, from politics and the economy to celebrity gossip, which has brought a far-reaching impact on society. According to the latest statistics from the International Telecommunication Union, nearly 5.4 billion people worldwide, or about 67% of the world’s total population, are using the Internet by 2023, a 45% increase from 2018. This figure reflects the unprecedented ease of access to news, but it also poses the challenge of news authenticity. Compared with traditional media, social media allows individuals and organizations to publish content freely, and information generation and sharing have become free and flexible. However, while enjoying convenience, it has also lowered the threshold for posting information, which has led to the proliferation of fake news on social media, bringing negative impacts on society. Moreover, algorithmic recommendations and personalized screenings on social media made it easier for users to access information that agrees with their views, further exacerbating the distortion of the truth, and the proliferation of fake news, which seriously impedes the healthy development of social media and the Internet as a whole.

Compared with traditional newspaper media, social media platforms have become one of the main channels for news dissemination of their real-time and interactive features. In social media, news forms also exhibited a trend toward multimodality. According to the research statistics (Yang, 2016), 35.1% of the fake news adopts the form of text-image fusion, and 24.3% of the fake news combines the three modes of text, image, and video at the same time. Especially in the fields of public health and social events, false information is usually packaged and disseminated with rich multimedia content. This suggestion indicates that relying only on single-modal text features could no longer meet the demand for effective detection of multimodal fake news in social media. Therefore, how to effectively detect multimodal fake news spread by social media platforms to ensure the health of the cyberspace environment has become an important challenge in today’s society.

Fake news detection in the early days mainly relied on experts in the field to make factual judgments and verifications, which not only failed to guarantee timeliness and scalability, but also was a great waste of human resources. To solve the limitations of the early methods, researchers classified fake news by combining manually labeled features with traditional machine learning,(Noble, 2006; Rish et al., 2001; Safavian & Landgrebe, 1991) but did not achieve satisfactory detection results. With the continuous development and application of deep learning, researchers have started to utilize deep neural network methods to detect multimodal news content (Farhangian et al., 2023), and the methods represented by convolutional neural networks (CNNs; Yoo, 2015) and recurrent neural networks (RNNs; Sherstinsky, 2020) have been widely proposed. This approach typically employs pretrained models to separately extract text and image features from the news. These features are then combined or enhanced using a simple concatenation or an attention mechanism, respectively, to obtain the final classification features. With the advent of the transformer (Vaswani et al., 2017), many methods have begun to use the BERT model (Kenton & Toutanova, 2019) for text feature extraction, due to its powerful text semantic extraction capabilities. These methods typically use the output of the last encoding layer of the BERT model as the text feature representation. In addition, many methods also combine transformer-based visual models, such as vision-transformer (ViT; Dosovitskiy et al., 2020) and Swin-transformer (SwinT; Liu et al., 2021), to extract language and visual features, respectively. Although these methods have improved performance compared to traditional methods (Wu et al., 2023), there are still issues with insufficient utilization of text and visual features and difficulties in modeling fine-grained interactions between images and text. Considering the above issues, this paper identifies several challenges in feature extraction and feature fusion for multimodal fake news detection methods:

(1)
The current methods of feature extraction are not comprehensive enough to be used effectively for fusion (Dosovitskiy et al., 2020; Kenton & Toutanova, 2019; Vaswani et al., 2017). Specifically, for text feature extraction, a common detection method is to use the last layer of encoding of the BERT model as the text representation. However, this approach overlooks the nuanced information captured in the intermediate layers of the BERT model, resulting in the loss of some important shallow and local semantic information. For image content extraction, the general approach is to use traditional CNNs alone for modeling to extract the corresponding features. While CNNs excel at local feature extraction, their ability to obtain global features is limited, resulting in overly localized image feature extraction. Additionally, some methods rely solely on transformer-based networks to extract features. Transformer models are proficient in their global sensory field, but they perform poorly in extracting local features. Therefore, most of the models are unable to extract the image features adequately, which affects the multimodal fusion.
(2)
Effective fusion of multimodal features is a major challenge in multimodal fake news detection tasks. Traditional fusion methods (Sherstinsky, 2020; Yoo, 2015), such as early fusion and late fusion, often fail to adequately capture the complex interactions across modalities. Some existing methods (Qian et al., 2021; Singhal et al., 2019) simply concatenate the extracted features, failing to deeply explore the intrinsic correlations and complementarities among multimodal features. Some methods employ fixed weights to fuse features from different modalities (Qian et al., 2021; Singhal et al., 2019; Zhou et al., 2020), potentially overlooking the fact that different information sources contribute differently to the final result. Consequently, the comprehensive extraction and effective utilization of multimodal features, as well as the effective modeling of multimodal fusion networks, have emerged as hot topics of research in this field.
To address the above problems, this study proposes a new detection model that combines a two-branch hybrid network and a hierarchical adaptive fusion strategy to improve the detection of multimodal fake news containing images and text. The main contributions of this paper are as follows:
We propose a dual-branch hybrid network based on CNN and transformer for fine-grained image modeling. This network design allows for deep interaction and fusion between the local features of convolution operations and the global representations of the self-attention mechanism, thereby enhancing the learning of image feature representations. This method not only extracts key information from images but also captures subtle differences within the images to optimize subsequent fusion effects.

We propose a hierarchical adaptive dynamic fusion module that uses a paired multi-head attention (MHA) mechanism and an adaptive adjustment strategy based on a gating mechanism. The hierarchical modal interaction approach enables the model to capture and exploit the complementarities and correlations between the different levels of semantic and visual information in the text model. The adaptive tuning strategy enables the model to dynamically adjust the importance of the multilevel fusion features for the final detection task based on the quality and reliability of each level of fusion.

We propose a Dual-branch Hybrid Visual Networks and Hierarchical Adaptive Fusion Strategy Model (DVHAM), for performing multimodal fake news detection. DVHAM is validated on three real publicly available datasets. The experimental results show an overall performance improvement over existing baseline methods.

2. Related Work

Methods for fake news detection are closely related to the evolution of news forms. Therefore, this section introduces the unimodal-based methods and multimodal-based detection methods based on the process of news form transformation from unimodal to multimodal, respectively.

2.1. Unimodal Fake News Detection

In terms of text, traditional unimodal fake news detection methods based on text features mainly rely on hand-designed statistical features (Castillo et al., 2011; Potthast et al., 2017; Volkova et al., 2017). These methods can capture the surface features of text but are limited to deep semantic information. In recent years, deep learning-based methods have gained much attention in the field of fake news. These methods utilize techniques such as pretrained word vectors, CNNs, and long short-term memory networks to extract richer semantic information (Hochreiter & Schmidhuber, 1997; Mikolov et al., 2013). For example, Ma et al. (2015) proposed a fake news learning model based on a recursive neural network (RNN) represented by text features in time series. Chen et al. (2018) proposed an RNN-based fake news detection model, which can selectively learn the time-hidden representations of consecutive news text sequences and capture the contextual changes of related news content over time.

With the diversification of news formats, images have taken on a key role in news content. Consequently, methods that utilize visual features to detect fake news have rapidly increased. Researchers have begun to utilize visual features to detect false information (Jin et al., 2016; Wu et al., 2015). To address this problem, existing methods mainly focus on analyzing the accompanying image information and the type of image features. For example, Qi et al. (2019) proposed an image pattern capture model based on CNN. They used CNNs to extract the frequency domain patterns of images used RNNs for semantic detection of photo authenticity, and finally used an attention mechanism to fuse the image patterns with the frequency domain patterns. Due to the development of diverse forms of data, the unimodal approach to fake news detection has some limitations in real-world enablement applications.

2.2. Multimodal Fake News Detection

Multimodal fake news detection aims to detect the authenticity of news using multiple modal information such as text and images simultaneously. Early work is mainly based on the attention mechanism to learn and fuse multimodal features. For example, Jin et al. (2017) proposed an RNN based on the attention mechanism to fuse image, text, and background features. Kumari and Ekbal (2021) designed a framework to improve multimodal representation by maximizing the correlation between text and image features. In recent years, the application of pretrained language models has greatly contributed to the development of this field. For example, Tuan and Minh (2021) introduced BERT to learn textual features of posts; Wang et al. (2018) proposed EANN model to learn feature representation of text and image by event discriminator. Zhuang and Zhang (2022) used GloVe (Pennington et al., 2014) and VGG16 (Simonyan & Zisserman, 2014) to extract text and image features, respectively. Khattar et al. (2019) reconstruction task to help feature fusion. In addition, scholars also work on modeling the semantic hierarchy of text. Qian et al. (2021) designed the HMCAN, which can model the semantic hierarchical relationship. Li et al. (2021) further modeled the correspondence between image and text entities based on this, and proposed the Entity-oriented Multi-modal Alignment and Fusion network.

In general, multimodal fake news detection fuses text and visual information and can achieve better results than a single modality, but there are still some deficiencies yet to be solved. For example, the modeling in feature extraction is not deep enough, resulting in less comprehensive feature extraction. The fusion strategy of the model is too simple to make full use of the correlation between modalities, and even if the interactions between modalities are taken into account, most of the works only consider the interactions between the image features and the deep textual semantic features, ignoring the influence of the shallow features on the judgment of the authenticity of the information.

3. Method

3.1. Model Overview

In this paper, we propose a multimodal fake news detection model based on the visual feature extraction method of dual-branch hybrid network and hierarchical adaptive fusion strategy, and the overall structural framework of the model is shown in Figure 1. The model is mainly composed of the following four parts:

Figure 1.

Overall Architecture of Dual-branch Hybrid Visual Networks and Hierarchical Adaptive Fusion Strategy Model (DVHAM). The Model is Mainly Composed of Four Modules: an Input Layer, a Feature Extraction Layer Based on the Dual-Branch Hybrid Visual Network and the BERT Model, a Hierarchical Adaptive Feature Fusion Layer, and an Output Layer.

Input layer: The main task is to receive the input data, including the text and image information of the news, and perform the corresponding preprocessing to meet the needs of the model.

Feature extraction layer: This layer is composed of a dual-branch hybrid visual network and a BERT model. Its primary role is to fully extract and understand the image and textual features in preparation for the integration stage of the model. The dual-branch hybrid visual network is responsible for extracting visual features from images, including global features and local details. The BERT model extracts hierarchical semantic features from the text.

Feature fusion layer: The main function of this layer is to effectively fuse visual features with text features. We adopt a hierarchical adaptive fusion strategy that enables the model to dynamically assign weights according to the importance of different features. This strategy not only improves the flexibility of the model but also enables the model to better capture and utilize the correlation between visual and textual information.

Output layer: Based on the fused features, this layer determines whether the input data is fake news. The following sections will introduce the DVHAM method in detail.

For ease of reading, we provide a list of symbol definitions used as shown in Table 1.

Table 1.

Relevant Symbols and Definitions Used in This Document.

Symbol	Definition
$T$	Sequence of cleaned news text
$t_{i}$	The $i$ th word in the input text
$b_{i}$	Embedding vector representation in the BERT model
$W_{T}$	Word embedding matrix with dimensions
$H^{l}$	Hidden state of the $l$ th layer BERT encoder
$B^{l}$	The feature representation is obtained by linear learning for each hidden state $H^{l}$
$I$	Input image with dimensions $256 \times 256 \times 3$
$I_{h}$	Image features processed by the head network of dual-branch hybrid visual networks
$I_{C_{1}}$	Output feature for the first stage of the CNN branch module
$I_{T_{1}}$	Output feature for the first stage of the transformer branch module
$I_{C_{i}}$	Feature map output of the CNN branch at stage $i$
$I_{T_{i}}$	Patch embedding output of the transformer branch at stage $i$
$g_{i} (I)$	Output of the $i$ th interaction aggregation stage
$F_{C T_{i}}$	Output of the interaction structure during the $i$ th stage from the CNN branch to the transformer branch
$F_{T C_{i}}$	Output of the interaction structure during the $i$ th stage from the transformer branch to the CNN branch
$B_{i} (I)$	Matrix obtained by bilinear fusion of two features at position L
$A_{i} (I)$	Matrix obtained after summation pooling operation on $B_{i} (I)$
$m_{i} (I)$	Vectorized form of the matrix $A_{i} (I)$
$M_{i} (I)$	Normalized matrix obtained from the vector $m_{i} (I)$
$G$	Final comprehensive visual features obtained after $N$ interaction aggregation stages
$N$	Number of stages in the two-branch interaction aggregation process
$X^{i}$	Text feature matrix at the $i$ th layer
$T_{I}^{i}$	Cross-modal attention output with text as query
$I_{T}^{i}$	Cross-modal attention output with image as query
$h_{i}$	Cross-modal attention computation result
$H_{T}$	Concatenated matrix of outputs from multiple cross-modal attention heads
$C_{T I}^{i}$	Interaction matrix based on text and visual information at the $i$ th layer
$α_{i}$	Learned weight parameter matrix
$C_{M}$	Final fusion feature
$C^{1}$ – $C^{4}$	Four different levels of global pooling features
$C_{M}^{'}$	Multimodal eigenmatrix after first linear transformation and ReLU activation function
$C_{M}^{″}$	Multimodal eigenmatrix after the second linear transformation and ReLU activation function
$p$	Predicted probability value after Softmax activation function
$y^{'}$	Predictive labeling, taking the value of ${0, 1}$ , where 0 means real news and 1 means fake news

Note. CNN = convolutional neural network; ReLU = rectified linear unit.

3.2. Input Layer

The multimodal fake news detection task studied in this paper is modeled as a binary classification problem of determining whether the input news data is real news or fake news. The input data for the model is multimodal content containing both text and image information sources. The role of the input layer is mainly to preprocess the text and image content of the news data for model training. In the process of processing news text content, the first step is data cleaning, that is, removing irrelevant characters in the text. Then, the cleaned news text can be represented as $T = {t_{1}, t_{2}, \dots, t_{n}}$ , where $t_{i}$ denotes the $i$ th word in the input text. We transform each word $t_{i}$ in the original text $T$ into an embedding vector $b_{i} \in R^{d}$ of the BERT model and obtain the corresponding word embedding matrix $W_{T} \in R^{n \times d}$ . The BERT model (Kenton & Toutanova, 2019) used in this paper is a pretrained model, and the model itself has a word table embedding layer, which can be represented as WordEmbed( $t_{i}$ ). The formula is shown below:

\begin{aligned} b_{i} & = WordEmbed (t_{i}) \end{aligned}

(1)

\begin{aligned} W_{T} & = {b_{1}, b_{2}, \dots, b_{n}} \end{aligned}

(2)

The final obtained word embedding matrix

W_{T}

can be input to the text feature extraction model to obtain the corresponding hierarchical semantic feature vectors.

In processing news image data, firstly, the input image $I \in R^{H \times W \times C}$ needs to be uniformly resized to the same size $256 \times 256$ . This is because most of the pretraining models in deep learning require that the input images must be the same size. Next, the resized images are normalized and converted into a tensor for input to a visual feature extraction module for a dual-branch hybrid network based on CNN and transformer.

3.3. Feature Extraction Layer

After a piece of news data has been preprocessed in the input layer, we feed the text and images into different feature extraction modules to realize the vectorized representation of the data. Therefore, the feature extraction layer of the model consists of two parts: hierarchical feature extraction of text and image feature extraction based on the dual-branch hybrid visual network.

3.3.1. Hierarchical Feature Extraction of Text

Text, as the main content of news, usually contains rich information, so text features become important clues for fake news detection. The quality of text feature extraction and whether it is adequately utilized have a significant impact on the final result of detection, so we use the BERT pretrained language model (Kenton & Toutanova, 2019) with excellent semantic extraction to extract text features. However, most of the methods use only the output of the last layer of the BERT model and ignore the hidden state of the middle layer, thus losing the complete hierarchical semantic information. To explore and capture complete text hierarchical semantic features to learn better multimodal news representations, we adopt a BERT pretrained model containing 12 layers of transformer encoders to extract text hierarchical semantic features. First, we input the text sequence $W_{T} = {b_{1}, b_{2}, \dots, b_{n}}$ into the BERT model, where $b_{i}$ represents the embedding vector of the $i$ th word. The BERT model then generates a hidden state for each encoder layer. The hidden state for the $l$ th encoder layer is denoted as $H^{l}$ , which can be represented as:

H^{l} = {TransformerLayer}_{l} (W_{T}) .

(3)

where

{TransformerLayer}_{l}

denotes the encoder function of the

l

th layer of the BERT model, and

W_{T}

is the input text sequence. The hidden state

H^{l}

can be expressed as:

H^{l} = {h_{1}^{l}, h_{2}^{l}, \dots, h_{n}^{l}},

(4)

where

h_{i}^{l}

represents the hidden state of the

i

th word at the

l

th layer, and

n

is the length of the text sequence. Next, we apply a linear transformation to each hidden state

H^{l}

to obtain the feature representation

B^{l}

B^{l} = Linear (H^{l}) = W^{l} H^{l} + b^{l} .

(5)

Here,

Linear

represents the linear layer function, and

W^{l}

and

b^{l}

are the learnable parameters of the linear layer. Finally, we take the feature representations from the third, sixth, ninth, and 12th encoder layers to form the multilevel feature representations:

X^{1} = B^{3}, X^{2} = B^{6}, X^{3} = B^{9}, X^{4} = B^{12} .

(6)

These multilevel feature representations, denoted as

X^{1}, X^{2}, X^{3}, X^{4}

, capture semantic information at different scales and abstraction levels, enhancing the model’s expressiveness and robustness to input changes.

3.3.2. Image Feature Extraction Based on the Dual-Branch Hybrid Visual Networks

As an important part of news, images contain feature information that is closely related to text information, thus playing an indispensable auxiliary role in the detection of fake news. For the current methods, most of them use pretrained visual models such as ResNet (He et al., 2016) and VGG (Simonyan & Zisserman, 2014) based on CNNs to extract features, but there is a lack of mining for the global features of the images. Some use Transformer-based visual models to acquire image features, but they are not as good as CNNs in focusing on the local salient features of the image. To better extract the rich and comprehensive features of an image, this paper designs a dual-branch hybrid network based on the architecture of Transformer and CNN to extract image features, and its detailed composition is shown in Figure 2.

Figure 2.

Dual-Branch Hybrid Visual Networks for Image Feature Extraction.

The dual-branch hybrid network is composed of a head module and $N$ interactive aggregation stages. First, the input image $I \in R^{256 \times 256 \times 3}$ passes through the header module containing the convolutional and maximum pooling layers to generate the initial feature map $I_{h}$ . The feature map $I_{h}$ is the input to the CNN branching module (CnnBlock) and the converter branching module (TranBlock) in the initial phase, and the generated features $I_{C_{1}}$ and $I_{T_{1}}$ are used as inputs to the $N$ interactive aggregation phase. The specific process is as follows:

\begin{aligned} I_{h} & = HeadBlock (I), \end{aligned}

(7)

\begin{aligned} I_{C_{1}} & = CnnBlock (I_{h}), \end{aligned}

(8)

\begin{aligned} I_{T_{1}} & = TranBlock (I_{h}) . \end{aligned}

(9)

where

CnnBlock (.)

and

TranBlock (.)

represent the computational processes of the CNN branch module and the transformer branch module, respectively. Next, using features

I_{C_{1}}

and

I_{T_{1}}

as inputs,

N

stages containing the dual-branch module, the interaction structure, and the aggregation module are sequentially executed to obtain the final image features. The output of the previous stage in this process is the input of the next stage. The output of the

i

th stage,

g_{i} (I)

, is defined as:

g_{i} (I) = {\begin{cases} {InterAgg}_{1} (I_{C_{1}}, I_{T_{1}}), & if i = 1, \\ {InterAgg}_{i} (g_{i - 1} (I), g_{i - 1} (I)), & if i = 2, 3, \dots, N, \end{cases}

(10)

where

{InterAgg}_{i}

denotes the process of interaction aggregation stage. Finally, the image feature matrix representation

G

for the fusion stage is generated by the fully connected layer, rectified linear unit (ReLU) activation and dropout layer:

G = DP (ReLU (Linear (g_{N} (I)))) .

(11)

Each interactive aggregation stage primarily consists of three components: the dual-branch module, the interaction structure, and the aggregation module. The dual-branch module is formed by a CNN branch and a transformer branch. The CNN branch is designed to provide local feature details to the transformer branch, while the transformer branch is intended to enhance the global perceptual ability of the CNN branch. The CNN branch (He et al., 2016) includes a

1 \times 1

down-projection convolution, a

3 \times 3

spatial convolution, a

1 \times 1

up-projection convolution, and a residual connection between the input and output. The transformer branch (Dosovitskiy et al., 2020) contains an MHA module and an Multilayer Perceptron (MLP) module. Layer normalization (LayerNorms) is performed before each layer, and residual connections are established in the attention layer and MLP module. Where the process of the

i

th stage is shown below:

\begin{aligned} I_{C_{i + 1}} & = {CnnBlock}_{i} (I_{C_{i}}), \end{aligned}

(12)

\begin{aligned} I_{T_{i + 1}} & = {TranBlock}_{i} (I_{T_{i}}) \end{aligned}

(13)

Here,

I_{C_{i}}

and

I_{T_{i}}

represent the feature map output of the CNN branch and the patch embedding output of the transformer branch in the previous stage, respectively. The

I_{C_{i + 1}}

and

I_{T_{i + 1}}

represent the outputs of that stage.

The interaction structure comprises a down-sampling module and an up-sampling module. This is designed to address the misalignment issue when interacting between the feature maps of the CNN branch and the patch embeddings of the transformer branch. To input the feature maps from the CNN branch into the transformer branch, a $1 \times 1$ convolution is initially applied to align the channel numbers with the patch embeddings, followed by a down-sampling module to align the spatial dimensions. Subsequently, the feature maps are added to the patch embeddings. When feedback is provided from the transformer branch to the CNN branch, the patch embeddings are up-sampled to align the spatial scale, and the channel dimension is aligned with the CNN feature maps through a $1 \times 1$ convolution before being added to the feature maps. Throughout this process, LayerNorm and BatchNorm modules are utilized to normalize the features. The mathematical representation of this process is as follows:

\begin{aligned} F_{C T_{i}} & = DownSample (I_{C_{i + 1}}) + I_{T_{i}}, \end{aligned}

(14)

\begin{aligned} F_{T C_{i}} & = UpSample (I_{T_{i + 1}}) + I_{C_{i + 1}}, \end{aligned}

(15)

where

F_{C T_{i}}

and

F_{T C_{i}}

, respectively, represent the outputs of the interaction structure during the

i

th stage when fed from the CNN branch to the transformer branch, and from the transformer branch back to the CNN branch. Down-sampling and up-sampling refer to the down-sampling module and the up-sampling module, respectively. The aggregation module primarily serves to effectively amalgamate feature information from the two branches, ensuring feature integrity while minimizing redundant information. This module ultimately aligns the patch embeddings obtained from the transformer branch with the CNN feature map via an up-sampling module. They are then combined at each position using matrix outer product and average pooling to obtain an aggregated feature representation. Firstly, for image

I

, consider the bilinear fusion of two features

F_{C T_{i}} (L, I) \in R^{M \times N}

and

F_{T C_{i}} (L, I) \in R^{M \times N}

at location

L

to obtain the matrix

B_{i} (I)

. Subsequently, a summation pooling operation is performed on

B_{i} (I)

at all locations, yielding the matrix

A_{i} (I)

. This is shown below:

\begin{aligned} B_{i} (I) & = Bilinear (L, I, F_{C T_{i}}, F_{T C_{i}}) = F_{C T_{i}} (L, I)^{T} F_{T C_{i}} (L, I) \in R^{M \times N}, \end{aligned}

(16)

\begin{aligned} A_{i} (I) & = \sum_{L}, B_{i} (I) = \sum_{L} F_{C T_{i}} (L, I)^{T} F_{T C_{i}} (L, I) . \end{aligned}

(17)

Subsequently, the obtained matrix

A_{i} (I)

is vectorized and normalized to yield the matrix

M_{i} (I)

. The process is as follows:

\begin{aligned} m_{i} (I) & = vec (A_{i} (I)) \in R^{M N \times 1}, \end{aligned}

(18)

\begin{aligned} M_{i} (I) & = sign (m_{i} (I)) \sqrt{| m_{i} (I) |} \in R^{M N \times 1}, \end{aligned}

(19)

\begin{aligned} g_{i} (I) & = \frac{M_{i} (I)}{| | M_{i} (I) | |^{2}} \in R^{M N \times 1} . \end{aligned}

(20)

The aggregated features

g_{i} (I)

are processed by the sampling module, respectively, and then sent to the next dual-branch interaction aggregation stage. In this way, a rich and comprehensive visual feature

G

is obtained after

N

interaction aggregation stages for use in the fusion stage.

3.4. Feature Fusion Layer

The primary objective of the feature fusion layer is to interact with the hierarchical text and image features outputted by the feature extraction layer, generating a fused feature vector for fake news detection. To address the issue in previous methods where multimodal features were simply concatenated, failing to capture the close connections between modalities at a fine-grained level, we designed a hierarchical adaptive dynamic fusion module. As shown in Figure 1, the module consists of four paired MHA mechanism (PAM) fusion modules and an adaptive dynamic fusion module. The detailed structure of the PAM fusion modules is shown in Figure 3, it can be observed that the proposed PAM module mainly consists of two cross-modal MHA layers (Vaswani et al., 2017) and feed-forward neural (FFN) networks. The inputs of the PAM module are feature vectors from two different modalities, and each of the outputs is an interaction feature after residual concatenation and normalization operations.

Figure 3.

Paired Multi-Head Attention Mechanism (PAM).

To model multimodal information more effectively, we input four different levels of text features and image features together into four PAM fusion blocks for interaction, which allows the model to capture and exploit the complementarity and correlation between semantic and visual information at different levels of the text model. This enables the model to better understand and utilize the complex relationships between textual and visual information. The specific process is as follows: firstly, we regard a text feature matrix $X^{i}$ as a query $Q$ of the cross-modal MHA layer after linear projection and regard image feature $G$ as keywords $K$ and values $V$ of the cross-modal attention layer; after they are computed by the MHA, we get the text feature matrix $T_{I}$ , which interacts with the image information. To be able to get the image feature that interacts with the text matrix $I_{T}$ , similarly, the image feature matrix $G$ is used as a query $Q$ for another cross-modal MHA mechanism (Vaswani et al., 2017), while the text feature matrix $X^{i}$ is accordingly used as a keyword $K$ and value $V$ input to another cross-modal attention mechanism layer. We then perform a summation operation on the inter-modal interaction features $T_{I}^{i}$ and $I_{T}^{i}$ . Then, the structural form concerning the transformer is sequentially input to the feed-forward neural network layer and the residual connectivity layer as well as the normalization layer. After the above process, we obtain an interaction matrix $C_{T I}^{i}$ based on textual information and visual information in the $i$ th layer.

For the PAM fusion block, the attention calculation process is shown in the following equation:

\begin{aligned} h_{i} = CrossAtt (Q_{A}, K_{B}, V_{B}) = SoftMax (\frac{Q_{A} K_{B}^{T}}{\sqrt{d_{h}}} V_{B}), \end{aligned}

(21)

\begin{aligned} H = Concat (h_{1}; h_{2}; \dots; h_{n}), \end{aligned}

(22)

\begin{aligned} {MHA}_{1} ((X^{i}, G, G)) = H W^{0}, \end{aligned}

(23)

\begin{aligned} {MHA}_{2} ((G, X^{i}, X^{i})) = H W^{0}, \end{aligned}

(24)

where

h_{i}

is the result of cross-modal attention computation, and the inputs are the text feature matrix

X^{i}

and the image feature matrix

G

of the

i

th stage for two different modalities, respectively. MHA represents the calculation process of the MHA mechanism. MHA is the result of concatenating the outputs of multiple cross-modal attention heads, followed by a linear transformation.

W^{0} \in R^{256 \times 1}

is the matrix for linear transformation, and

H

is the concatenation matrix of the outputs of

n

attention heads. Concat represents the concatenation operation. Then, the cross-modal fusion matrix

C_{T I}^{i}

of the

i

th layer can be formalized as follows:

\begin{aligned} T_{I}^{i} & = NM (X^{i} + {MHA}_{1} ((X^{i}, G, G))), \end{aligned}

(25)

\begin{aligned} I_{T}^{i} & = NM (G + {MHA}_{2} ((G, X^{i}, X^{i}))), \end{aligned}

(26)

\begin{aligned} C_{T I}^{i} & = NM {(T_{I}^{i} + I_{T}^{i}) + FFN (T_{I}^{i} + I_{T}^{i})}, \end{aligned}

(27)

where NM and FFN denote the normalization layer and feed-forward neural network function, respectively, and

X^{i}

is the text output matrix of the

i

th layer in the text feature extraction model,

i \in {1, 2, 3, 4}

. Finally, the four hierarchical features

X^{1}

X^{2}

X^{3}

X^{4}

of the text and the image feature

G

are inputted to the four PAM fusion modules to obtain four hierarchical fusion vectors

C_{T I}^{1}

C_{T I}^{2}

C_{T I}^{3}

, and

C_{T I}^{4}

. These four fusion vectors contain different levels of fusion information, respectively.

The effective integration of the four modal interaction vectors, each with varying degrees of importance, is a crucial aspect of the model detection task. Existing models often simply concatenate multimodal feature vectors or use average fixed weights to integrate features from different modalities. This approach overlooks the varying importance of different information sources to the final result, thereby introducing excessive noise. To address this issue, we propose an adaptive tuning structure that employs a gating mechanism, as illustrated in Figure 4. Specifically, our model assigns unique learnable parameters to four different fusion features and performs adaptive weighted concat operations to derive the final multimodal feature representation of a news article. The process unfolds as follows: initially, we apply global average pooling to the four output features, thereby compressing the global spatial information into a single-channel description. Subsequently, these four features are concatenated and processed through a gating mechanism, which comprises a fully connected layer, an ReLU layer, another fully connected layer, and a Sigmoid layer. This mechanism yields four weighting factors, denoted as $α_{1}$ , $α_{2}$ , $α_{3}$ , and $α_{4}$ . Each of these weighting factors is then multiplied with the corresponding feature element. The amalgamation of these weighted features results in the final fusion feature denoted as $C_{M}$ .The process is summarized as follows:

\begin{aligned} C^{i} & = GAP (C_{T I}^{i}), i \in {1, 2, 3, 4}, \end{aligned}

(28)

\begin{aligned} S & = C^{1} \oplus C^{2} \oplus C^{3} \oplus C^{4}, \end{aligned}

(29)

\begin{aligned} α_{i} & = σ (f_{1} (S, W)) = σ (W_{i + 1} δ (W_{i} S)), \end{aligned}

(30)

\begin{aligned} C_{M} & = (α_{1} ⊙ C_{T I}^{1}) \oplus (α_{2} ⊙ C_{T I}^{2}) \oplus (α_{3} ⊙ C_{T I}^{3}) \oplus (α_{4} ⊙ C_{T I}^{4}), \end{aligned}

(31)

where

GAP ()

denotes the global average pooling operation;

C^{i}

is the compressed feature of the

i

th modality.

α_{i}

is the learned weight parameter matrix,

σ

is the Sigmoid function,

δ

is the ReLU function, which denotes feature concat operation,

⊙

denotes the multiplication of the corresponding elements,

W_{i}

is the parameter of the full connectivity layer, and

C_{M}

is the final fusion feature.

Figure 4.

Adaptive Dynamic Fusion Module.

3.5. Output Layer

The output layer in the final part of DVHAM is the multimodal fake news detector. The role of this layer is to utilize the multimodal features obtained through the fusion layer to predict the authenticity of the news. This output layer is composed of two linear transformation layers, the ReLU activation function and a fully connected layer with the Softmax activation function. Its input is the multimodal feature $C_{M}$ and its output is the prediction label $y^{'} \in {0, 1}$ . The detailed process is shown in the following equations:

\begin{aligned} C_{M}^{'} & = DP (ReLU (Linear (C_{M}, (w_{1}, b_{1})))), \end{aligned}

(32)

\begin{aligned} C_{M}^{″} & = ReLU (Linear (C_{M}^{'}, (w_{2}, b_{2}))), \end{aligned}

(33)

\begin{aligned} p & = softmax (W C_{M}^{″} + b), \end{aligned}

(34)

where

p

is the predicted probability value output by

C_{M}^{″}

after Softmax activation function,

w_{1}

w_{2}

, and

W

are the parameters of the fully connected layer, and

b_{1}

b_{2}

b

are the bias. In a binary classification task, when the model outputs a probability

p \geq 0.5

, the news is categorized as fake news, otherwise, it is categorized as real news, which ensures that the prediction result is always in line with the category with higher probability, and ultimately results in our predicted label

y^{'}

y^{'} = {\begin{cases} 0, if p \geq 0.5, \\ 1, if p < 0.5, \end{cases}

(35)

where 0 represents real news and 1 represents fake news. Finally, we use binary cross-loss entropy function for training and the formula is shown below:

L (θ) = - \sum y \log p + (1 - y) \log (1 - p),

(36)

where

y

is the true label of the news sample,

p

denotes the probability predicted by the model for this sample, and

θ

is a learnable training parameter.

4. Experimentation and Analysis

4.1. Datasets

In the research field of fake news detection, there are many publicly available real-world datasets available for use. However, to comprehensively and accurately evaluate the performance of DVHAM, we selected three multimodal datasets containing both image and textual information for experimental validation based on differences in language and collection objects, respectively: the WEIBO dataset (Jin et al., 2017), the TWITTER dataset (Boididou et al., 2014), and the PHEME dataset (Zubiaga et al., 2017). These three datasets enable us to test the generalization ability and validity of our model in diverse environments.

The WEIBO dataset was collected by (Jin et al., 2017), and originates from Sina Weibo, a prominent social platform in China. Consequently, the dataset is in Chinese and has been officially authenticated by China’s authoritative Xinhua News Agency. As depicted in Table 2, the dataset comprises 9,528 pairs of text and image matches corresponding to news articles, with an approximately even distribution of real and fake news. For the partitioning of the WEIBO dataset, we randomly allocated it into a training set, and validation set in an $8 : 2$ ratio for model validation.

Table 2.
Statistical Data for the WEIBO Dataset, TWITTER Dataset, and PHEME Dataset.

Dataset Real news Fake news Image

WEIBO 4,779 4,749 9,528

TWITTER 6,026 7,098 514

PHEME 1,428 590 2,018

Dataset	Real news	Fake news	Image
WEIBO	4,779	4,749	9,528
TWITTER	6,026	7,098	514
PHEME	1,428	590	2,018

The TWITTER dataset (Boididou et al., 2014) was collected on the multimedia social platform Twitter in the English language. Each tweet on this dataset contains text, visual information, and user information. To comply with our research objectives, we eliminated the user information from the dataset. This dataset comprises a development set and a testing set. The development set includes events associated with 17 rumors, whereas the test set includes events linked to an additional 35 rumors. As illustrated in Table 2, the dataset consists of 6,026 real news, 7,098 fake news, and 514 images, indicating a relatively small quantity of images. For this study, the development set serves as the training set, and the test set serves as the test set.

The PHEME dataset (Zubiaga et al., 2017) was collected by the authors on the Twitter platform for five different events in the English language. The original dataset comprises news text, corresponding images, and annotations. Given our objective of detecting fake news that includes both text and images, we retained only those data instances in the dataset that contain both text and images. The specifics of the data post-cleaning are presented in Table 2. The dataset includes matching pairs of text and images for 2018 news items. Compared to the WEIBO dataset, the distribution of news categories in the PHEME dataset is uneven, and the volume of data is relatively small. This presents a challenge for the validation of our model. For the partitioning of the PHEME dataset, we divided it into training and test sets at a ratio of $8 : 2$ .

4.2. Experimental Setup

All the experiments in this paper were conducted under Ubuntu 20.04 and Python 3.8, utilizing the PyTorch deep learning framework for model construction and training. The GPU model used was the NVIDIA Tesla P100. The models used in our study are pretrained BERT models, bert-base-chinese and bert-base-uncased depending on the language of the datasets. The training process was configured with a batch size of 8 and was run for 200 epochs. To mitigate the risk of overfitting, a dropout rate of 0.6 was applied during training. The Adam optimizer was used with a learning rate of $1 \times 10^{- 5}$ and a weight decay of 0.01. An early stop strategy was implemented during the training process, where training was terminated early when the model performance did not improve significantly within 20 periods. Additionally, a linear learning rate decay strategy was used for adjustment.

The multimodal fake news detection studied in this paper is a binary classification problem. The experiments use four commonly used metrics of binary classification performance, namely accuracy, precision, recall, and F1 score, to evaluate the performance of DVHAM and baseline models. In this study, the time complexity can be expressed as $O (e n)$ , where $n$ represents the number of samples in the dataset and $e$ denotes the number of hours required for training. This complexity aligns with that of most standard training paradigms. As the primary objective of this study is to improve model accuracy, further comparison and analysis of time complexity are not conducted.

4.3. Baseline Model

To validate the performance of the proposed model in the multimodal fake news detection task, a series of comparison tests with the baseline models are conducted. These baseline models (Qian et al., 2021) are mainly categorized into two types: unimodal models and multimodal models.

4.3.1. Unimodal Baseline Models

Unimodal is a model that utilizes only a single information source such as text or image for fake news detection, and the unimodal model we selected is described in detail below:

SVM-TS (Ma et al., 2015): The model uses heuristic rules and SVM classifiers to detect and identify by analyzing time-series data from social media.

CNN (Yu et al., 2017): The model uses a CNN to flexibly extract key features dispersed in the input sequence and effectively identify fake news.

GRU (Ma et al., 2016): The model uses RNN to learn and capture hidden representations of contextual information about relevant posts over time for effective early rumor detection.

TextGCN (Yao et al., 2019): The method uses a graph convolutional neural network to construct a single textual graph convolutional network based on word co-occurrence and document word relations for effective text classification.

4.3.2. Multimodal Baseline Model

Multimodal is a model that uses multiple sources of information such as text and images at the same time for fake news, and the multimodal baseline model we have chosen is described in detail below:

EANN (Wang et al., 2018): The model is called an Event Adversarial Neural Network, which is an end-to-end framework for extracting event invariant features. It consists of three main components: a multimodal feature extractor, a fake news detector, and an event discriminator.

Att-RNN (Jin et al., 2017): The method uses RNN with an attention mechanism to fuse multimodal features for effective rumor detection. For a fair comparison with this paper, the use of social context features was removed from the experiments.

MVAE (Khattar et al., 2019): The model is an end-to-end network that uses a bimodal variational self-encoder in conjunction with a binary classifier for the fake news detection task.

SpotFake (Singhal et al., 2019): The model does not depend on any other subtasks but uses textual and visual features of the article to detect fake news. Specifically, the BERT model is used to learn the textual features while the image features are learned from VGG-19, and finally, the features are directly concatenation for detection.

SAFE (Zhou et al., 2020): The model uses a similarity-aware multimodal approach to extract textual and visual features of the news separately. Then, the relationships between the cross-modal extracted features are further investigated. These were jointly learned and used to predict fake news.

HMCAN (Qian et al., 2021): The model detects fake news by jointly modeling multimodal contextual information and the hierarchical semantics of the text in a unified deep model.

GFNN (Li et al., 2023):The core idea of GFNN is to explore consistency and inconsistency from highly and lowly correlated word-region pairs, respectively.

MRAN (Yang et al., 2024):The MRAN model fuses intramodal and extramodal semantic information through multilevel text encoding and relation-aware attention mechanisms to achieve efficient fake news detection.

The data in Table 3 shows the comparison of the results of DVHAM and the baseline model on the three datasets. It can be observed that our proposed DVHAM outperforms the baseline model on all three datasets in several metrics. Specifically, DVHAM achieves accuracies of 0.915, 0.924, and 0.906 on these datasets, which are 1.2%, 6.9%, and 3.6% better than the latest benchmark model, MRAN, respectively. Notably, DVHAM outperforms all other models in terms of F1 scores for both real news classification and fake news classification on both the WEIBO and TWITTER datasets. This outcome demonstrates that our model has enhanced its performance comprehensively on both English and Chinese datasets. However, the F1 score of fake news on the PHEME dataset is not the best, and we analyze the reason for its small amount of data. Figure 5 shows the accuracy of all models and the comparison of F1 scores on fake news, we next analyze the data in detail from the following aspects:

(1)
From the level of traditional and deep learning methods, the SVM-TS model performs the worst in all aspects of the three datasets, indicating that the traditional manual extraction of features is not effective and cannot recognize fake news.
(2)
Regarding unimodal and multimodal methods, the unimodal models (CNN, GRU, and TextGRU) in the baseline model significantly underperform in terms of accuracy and F1 scores. The DVHAM outperforms the best unimodal model, TextGRU, on the three datasets by 12.8%, 22.1%, and 7.8% in terms of accuracy, respectively. This suggests that employing multiple sources of information for fake news detection is superior to relying solely on text, and that visual information can play a significant supporting role.
(3)
In terms of whether or not the fusion mechanism is used in the fusion process, the SpotFake model uses the BERT pretraining model to extract text features, which can achieve superior performance over the earlier multimodal models EANN and MVAE, which illustrates the power of the BERT model, but the SpotFake model is just a simple connection between image and textual features, which is worse than HMCAN and DVHAM, which used the attention mechanism for the fusion in terms of accuracy, and F1 scores, which demonstrates that the attention mechanism has a good effect in the modal fusion.
(4)
DVHAM outperforms the individual baseline models in overall performance, illustrating the effectiveness of our designed pairwise MHA mechanism and adaptive fusion strategy.

Figure 5.
Comparison of Accuracy and F1 Scores for Fake News Detection Across Three Datasets: (a) WEIBO, (b) Twitter, and (c) PHEME.

Table 3.
The Comparison Results of Dual-Branch Hybrid Visual Networks and Hierarchical Adaptive Fusion Strategy Model (DVHAM) and Baseline Models.

Fake news Real news

Dataset Methods Accuracy Precision Recall F1 Precision Recall F1

WEIBO SVM-TS 0.640 0.741 0.573 0.646 0.651 0.798 0.711

CNN 0.740 0.736 0.756 0.744 0.747 0.723 0.735

GRU 0.702 0.671 0.794 0.727 0.747 0.609 0.671

TextGCN 0.787 0.975 0.573 0.727 0.712 0.985 0.827

EANN 0.782 0.827 0.697 0.756 0.752 0.863 0.804

Att-RNN 0.772 0.854 0.656 0.742 0.720 0.889 0.795

MVAE 0.824 0.854 0.769 0.809 0.802 0.875 0.837

SpotFake 0.869 0.877 0.859 0.868 0.861 0.879 0.870

SAFE 0.763 0.833 0.659 0.736 0.717 0.868 0.785

HMCAN 0.885 0.920 0.845 0.881 0.856 0.926 0.890

GFNN 0.901 0.913 0.889 0.889 0.888 0.913 0.900

MRAN 0.903 0.904 0.908 0.906 0.897 0.892 0.894

DVHAM 0.915 0.906 0.936 0.921 0.927 0.892 0.909

TWITTER SVM-TS 0.529 0.488 0.497 0.496 0.565 0.556 0.561

CNN 0.549 0.508 0.597 0.549 0.598 0.509 0.550

GRU 0.634 0.581 0.812 0.667 0.758 0.502 0.604

TextGCN 0.703 0.808 0.365 0.503 0.680 0.939 0.779

EANN 0.648 0.810 0.498 0.617 0.584 0.759 0.660

Att-RNN 0.664 0.749 0.615 0.676 0.589 0.728 0.651

MVAE 0.745 0.801 0.719 0.758 0.689 0.777 0.730

SpotFake 0.771 0.784 0.744 0.764 0.769 0.807 0.787

SAFE 0.766 0.777 0.795 0.786 0.752 0.731 0.742

HMCAN 0.897 0.971 0.801 0.878 0.853 0.979 0.912

GFNN 0.923 0.872 0.965 0.916 0.971 0.891 0.929

MRAN 0.855 0.861 0.857 0.859 0.847 0.816 0.831

DVHAM 0.924 0.926 0.837 0.879 0.923 0.967 0.944

PHEME SVM-TS 0.639 0.546 0.576 0.560 0.729 0.705 0.717

CNN 0.779 0.732 0.606 0.663 0.799 0.875 0.835

GRU 0.832 0.782 0.712 0.745 0.855 0.896 0.865

TextGCN 0.828 0.775 0.735 0.737 0.827 0.828 0.828

EANN 0.681 0.685 0.664 0.694 0.701 0.750 0.747

Att-RNN 0.850 0.791 0.749 0.770 0.876 0.899 0.888

MVAE 0.852 0.806 0.719 0.760 0.817 0.917 0.893

SpotFake 0.823 0.743 0.745 0.744 0.864 0.863 0.863

SAFE 0.811 0.827 0.559 0.667 0.806 0.940 0.866

HMCAN 0.881 0.830 0.838 0.834 0.910 0.905 0.907

MRAN 0.870 0.852 0.808 0.839 0.889 0.928 0.908

DVHAM 0.906 0.807 0.844 0.825 0.943 0.928 0.935

Note. Values in bold and underlined represent the highest scores per metric.

4.4. Ablation Experiments

			Fake news	Real news
WEIBO	SVM-TS	0.640	0.741	0.573	0.646	0.651	0.798	0.711
	CNN	0.740	0.736	0.756	0.744	0.747	0.723	0.735
	GRU	0.702	0.671	0.794	0.727	0.747	0.609	0.671
	TextGCN	0.787	0.975	0.573	0.727	0.712	0.985	0.827
	EANN	0.782	0.827	0.697	0.756	0.752	0.863	0.804
	Att-RNN	0.772	0.854	0.656	0.742	0.720	0.889	0.795
	MVAE	0.824	0.854	0.769	0.809	0.802	0.875	0.837
	SpotFake	0.869	0.877	0.859	0.868	0.861	0.879	0.870
	SAFE	0.763	0.833	0.659	0.736	0.717	0.868	0.785
	HMCAN	0.885	0.920	0.845	0.881	0.856	0.926	0.890
	GFNN	0.901	0.913	0.889	0.889	0.888	0.913	0.900
	MRAN	0.903	0.904	0.908	0.906	0.897	0.892	0.894
	DVHAM	0.915	0.906	0.936	0.921	0.927	0.892	0.909
TWITTER	SVM-TS	0.529	0.488	0.497	0.496	0.565	0.556	0.561
	CNN	0.549	0.508	0.597	0.549	0.598	0.509	0.550
	GRU	0.634	0.581	0.812	0.667	0.758	0.502	0.604
	TextGCN	0.703	0.808	0.365	0.503	0.680	0.939	0.779
	EANN	0.648	0.810	0.498	0.617	0.584	0.759	0.660
	Att-RNN	0.664	0.749	0.615	0.676	0.589	0.728	0.651
	MVAE	0.745	0.801	0.719	0.758	0.689	0.777	0.730
	SpotFake	0.771	0.784	0.744	0.764	0.769	0.807	0.787
	SAFE	0.766	0.777	0.795	0.786	0.752	0.731	0.742
	HMCAN	0.897	0.971	0.801	0.878	0.853	0.979	0.912
	GFNN	0.923	0.872	0.965	0.916	0.971	0.891	0.929
	MRAN	0.855	0.861	0.857	0.859	0.847	0.816	0.831
	DVHAM	0.924	0.926	0.837	0.879	0.923	0.967	0.944
PHEME	SVM-TS	0.639	0.546	0.576	0.560	0.729	0.705	0.717
	CNN	0.779	0.732	0.606	0.663	0.799	0.875	0.835
	GRU	0.832	0.782	0.712	0.745	0.855	0.896	0.865
	TextGCN	0.828	0.775	0.735	0.737	0.827	0.828	0.828
	EANN	0.681	0.685	0.664	0.694	0.701	0.750	0.747
	Att-RNN	0.850	0.791	0.749	0.770	0.876	0.899	0.888
	MVAE	0.852	0.806	0.719	0.760	0.817	0.917	0.893
	SpotFake	0.823	0.743	0.745	0.744	0.864	0.863	0.863
	SAFE	0.811	0.827	0.559	0.667	0.806	0.940	0.866
	HMCAN	0.881	0.830	0.838	0.834	0.910	0.905	0.907
	MRAN	0.870	0.852	0.808	0.839	0.889	0.928	0.908
	DVHAM	0.906	0.807	0.844	0.825	0.943	0.928	0.935

Given that DVHAM is constructed from multiple components, we aim to ascertain the impact of each component on the experimental outcomes. To this end, we have designed the following variants of DVHAM for analysis in ablation methods across each of the three datasets:

DVHAM-T: Removing the text component of the model and using only image features for detection is an unimodal approach.

DVHAM-I: Removing the visual component of the model and using only text for classification is an unimodal approach.

DVHAM-G: Instead of using an adaptive dynamic fusion mechanism, the final features are all connected according to the same weights.

DVHAM-F: Deletion of pairs of multiple attention mechanisms fused to simply connect textual, image features.

DVHAM-C: Delete the hierarchical text features and keep only the text features in the last layer of the BERT model.

DVHAM-R: Replace the dual-branch hybrid visual network with a ResNet network.

DVHAM-V: Replace the dual-branch hybrid visual network with a ViT network.

The experimental results are shown in Table 4. It can be seen that DVHAM has a better performance than any variant model, which indicates that our components play a non-negligible role in the final result. We analyze the ablation experiments in the following details:

(1)
We compare DVHAM-T and DVHAM-I, which use only image features and text features for detection, respectively, and their accuracy and F1 scores are less effective than the other variant models. This indicates that the unimodal methods perform significantly lower than DVHAM, and DVHAM’s use of a combination of text and picture features is critical for improving model performance.
(2)
For DVHAM-G and DVHAM-F, these two methods do not use adaptive dynamic fusion mechanisms and pairwise fusion with multiple attention mechanisms, respectively. The results are significantly lower than those of DVHAM, indicating that these two mechanisms play a significant role in DVHAM, especially the adaptive dynamic fusion mechanism, which dynamically adjusts the weights of the features according to different inputs to better capture the correlation between text and images.
(3)
DVHAM-C, which deletes the hierarchical text features, compares with the final model, DVHAM, and its accuracy decreases by 0.6%, 7.6%, and 3.4% on the three datasets, respectively, which indicates that the shallow features of the model’s text also play a non-negligible role.
(4)
Comparing DVHAM-R and DVHAM-V, which change the dual-branch hybrid visual network to a single-stream network, respectively, both the accuracy and F1 scores are drastically reduced compared to DVHAM, indicating that the dual-stream network we designed helps to improve the performance of the model because it is able to capture the features of the images from different perspectives, which provides richer information.

Table 4.
The Comparison Results of Dual-Branch Hybrid Visual Networks and Hierarchical Adaptive Fusion Strategy Model (DVHAM) and Baseline Models.

Fake news Real news

Dataset Methods Accuracy Precision Recall F1 Precision Recall F1

WEIBO DVHAM-T 0.673 0.685 0.700 0.692 0.660 0.644 0.652

DVHAM-I 0.717 0.742 0.706 0.724 0.692 0.728 0.709

DVHAM-G 0.908 0.915 0.910 0.913 0.901 0.907 0.904

DVHAM-F 0.902 0.918 0.893 0.905 0.885 0.911 0.898

DVHAM-C 0.909 0.911 0.917 0.914 0.907 0.901 0.904

DVHAM-R 0.889 0.911 0.874 0.892 0.867 0.905 0.886

DVHAM-V 0.902 0.889 0.930 0.909 0.918 0.872 0.894

DVHAM 0.915 0.906 0.936 0.921 0.927 0.892 0.909

TWITTER DVHAM-T 0.767 0.669 0.587 0.625 0.807 0.856 0.831

DVHAM-I 0.835 0.729 0.798 0.762 0.895 0.853 0.874

DVHAM-G 0.863 0.819 0.826 0.822 0.891 0.886 0.888

DVHAM-F 0.847 0.780 0.840 0.809 0.895 0.852 0.873

DVHAM-C 0.848 0.968 0.626 0.760 0.809 0.987 0.889

DVHAM-R 0.822 0.828 0.678 0.746 0.820 0.912 0.863

DVHAM-V 0.804 0.768 0.587 0.665 0.817 0.912 0.862

DVHAM 0.924 0.926 0.837 0.879 0.923 0.967 0.944

PHEME DVHAM-T 0.758 0.532 0.679 0.597 0.872 0.786 0.827

DVHAM-I 0.882 0.835 0.774 0.803 0.901 0.931 0.916

DVHAM-G 0.862 0.728 0.761 0.744 0.913 0.898 0.905

DVHAM-F 0.850 0.664 0.872 0.754 0.948 0.842 0.892

DVHAM-C 0.872 0.804 0.779 0.791 0.901 0.914 0.908

DVHAM-R 0.874 0.739 0.807 0.772 0.929 0.898 0.913

DVHAM-V 0.857 0.736 0.716 0.726 0.899 0.908 0.903

DVHAM 0.906 0.807 0.844 0.825 0.943 0.928 0.935

Note. Values in bold and underlined represent the highest scores per metric.

Figure 6.
Effect of Parameter G on experimental results: (a) Accuracy and (b) Fake News F1 Score.

Overall, each component of DVHAM has an important impact on the model performance, and the combination of these components enables DVHAM to efficiently process multimodal information in text and images, which has improved the performance of the fake news detection task.
4.5. The Influence of Hyperparameters on Model Outcomes

			Fake news	Real news
WEIBO	DVHAM-T	0.673	0.685	0.700	0.692	0.660	0.644	0.652
	DVHAM-I	0.717	0.742	0.706	0.724	0.692	0.728	0.709
	DVHAM-G	0.908	0.915	0.910	0.913	0.901	0.907	0.904
	DVHAM-F	0.902	0.918	0.893	0.905	0.885	0.911	0.898
	DVHAM-C	0.909	0.911	0.917	0.914	0.907	0.901	0.904
	DVHAM-R	0.889	0.911	0.874	0.892	0.867	0.905	0.886
	DVHAM-V	0.902	0.889	0.930	0.909	0.918	0.872	0.894
	DVHAM	0.915	0.906	0.936	0.921	0.927	0.892	0.909
TWITTER	DVHAM-T	0.767	0.669	0.587	0.625	0.807	0.856	0.831
	DVHAM-I	0.835	0.729	0.798	0.762	0.895	0.853	0.874
	DVHAM-G	0.863	0.819	0.826	0.822	0.891	0.886	0.888
	DVHAM-F	0.847	0.780	0.840	0.809	0.895	0.852	0.873
	DVHAM-C	0.848	0.968	0.626	0.760	0.809	0.987	0.889
	DVHAM-R	0.822	0.828	0.678	0.746	0.820	0.912	0.863
	DVHAM-V	0.804	0.768	0.587	0.665	0.817	0.912	0.862
	DVHAM	0.924	0.926	0.837	0.879	0.923	0.967	0.944
PHEME	DVHAM-T	0.758	0.532	0.679	0.597	0.872	0.786	0.827
	DVHAM-I	0.882	0.835	0.774	0.803	0.901	0.931	0.916
	DVHAM-G	0.862	0.728	0.761	0.744	0.913	0.898	0.905
	DVHAM-F	0.850	0.664	0.872	0.754	0.948	0.842	0.892
	DVHAM-C	0.872	0.804	0.779	0.791	0.901	0.914	0.908
	DVHAM-R	0.874	0.739	0.807	0.772	0.929	0.898	0.913
	DVHAM-V	0.857	0.736	0.716	0.726	0.899	0.908	0.903
	DVHAM	0.906	0.807	0.844	0.825	0.943	0.928	0.935

This section presents validation experiments conducted on three datasets, discussing the impact of key parameters within DVHAM on the experimental outcomes. Notably, the parameters exerting substantial influence on DVHAM are the number of text layers $G$ , and the number of interaction aggregation modules $N$ . We select the set of values for $G$ as [1,2,3,4,6], and the results of the experiments are shown in Figure 6, when $G$ equals 12, the computational expense becomes prohibitive, thus we exclude this scenario. The experimental results, as depicted in the figure, reveal that on the WEIBO and TWITTER datasets, a larger $G$ value corresponds to higher accuracy and a greater F1 score for fake news detection. However, on the PHEME dataset, the optimal performance is achieved when $G = 6$ . Yet, considering the computational resource consumption, $G = 4$ offers the best cost-effectiveness. It is further observed that the models exhibit comparable performance on the WEIBO and PHEME datasets in terms of accuracy metrics, as shown in Figure 6(a). However, a significant discrepancy in performance is observed in Figure 6(b) based on the Fake News F1 Score metrics. This divergence can be attributed to the fact that the PHEME dataset is substantially smaller in scale compared to the WEIBO dataset, coupled with an imbalanced distribution of true and false news instances within the PHEME dataset. These factors collectively contribute to the pronounced difference in the model’s performance on the Fake News F1 Score metric.

Another critical parameter is the quantity of interaction aggregation modules $N$ within the dual-branch hybrid network. The value of $N$ is too large, the computational complexity of the model becomes large, and the value of $N$ is 1, which does not make much sense. Therefore, we select the set of values for $N$ as $[2, 3, 4, 5, 6, 7]$ . The experimental results across the WEIBO, TWITTER, and PHEME datasets, as illustrated in Figure 7, demonstrate that the model performs best when $N = 5$ . Both the accuracy and fake news F1 scores reach their highest points at this value. Conversely, smaller values lead to the dual-branch hybrid network’s inability to learn a more comprehensive feature representation, resulting in suboptimal model performance. Additionally, larger values cause overfitting and increase model parameters.

Figure 7.

Effect of Parameter $N$ on Experimental Results: (a) Accuracy and (b) Fake News F1 Score.

5. Conclusion

In this paper, we address the social challenge posed by the spread of fake news, which significantly impacts cyberspace governance, by proposing a DVHAM. The DVHAM model introduces a dual-branch hybrid network architecture and a hierarchical adaptive dynamic fusion module. The dual-branch hybrid network, based on transformer and CNN architectures, not only considers the global features of the image, but also integrates the local details, addressing the limitations of existing models in image feature extraction. The model employs a paired MHA mechanism and an adaptive adjustment strategy based on a gating mechanism, which capture and utilize the complementarity and correlation between different levels of semantic and visual information within the text model, thus overcoming the limitations of current models in fusing multimodal information. Through comparative and ablation experiments on three datasets, DVHAM has demonstrated its effectiveness and feasibility in the domain of fake news detection.

Although DVHAM has demonstrated its effectiveness in extracting and fusing multimodal information for fake news detection, its robustness may be compromised in scenarios involving noisy data or adversarial attacks. For instance, in real-world settings where image data is significantly degraded, the dual-branch hybrid network’s ability to extract global and local features may encounter difficulties in distinguishing genuine features from noise. Similarly, text inputs embedded with adversarial perturbations can disrupt the paired MHA mechanism and the adaptive gating strategy, resulting in a suboptimal fusion of semantic and visual information. Future work on DVHAM should aim to improve its robustness by enhancing preprocessing with noise-reduction and augmentation techniques and adopting adversarial training methods to better handle corrupted data and resist attacks, ultimately evolving into a more resilient framework for diverse scenarios. However, DVHAM currently only considers the text and image information of news and has not yet covered information such as audio and the social context of news. Given the large amount of audio information, propagation paths, user characteristics, and data distribution in the current news, how to effectively utilize these multimodal information sources for fake news detection will be an important direction for our future research. We expect that through further research and exploration, we will be able to propose more comprehensive and effective fake news detection methods.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper was supported by the National Natural Science Foundation of China (No. 61976085).

Declaration of Conflicting Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data that support the findings of this study are openly available. The WEIBO, TWITTER, and PHEME datasets used in this study are publicly accessible. Detailed information about the datasets and how to access them can be found in the respective references. Our project’s GitHub repository: .

References

Boididou

Papadopoulos

Kompatsiaris

Schifferes

Newman

(2014). Challenges of computational verification in social multimedia. In Proceedings of the 23rd international conference on world wide web (pp. 743–748). ACM.

Castillo

Mendoza

Poblete

(2011). Information credibility on Twitter. In Proceedings of the 20th international conference on world wide web (pp. 675–684). ACM.

Chen

Yin

Zhang

(2018). Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. In Trends and applications in knowledge discovery and data mining: PAKDD 2018 workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Melbourne, VIC, Australia, June 3, 2018, revised selected papers 22 (pp. 40–52). Springer.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

, et al. (2020). An image is worth 16

\times

16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Farhangian

Cruz

R. M.

Cavalcanti

G. D.

(2023). Fake news detection: Taxonomy and comparative study. Information Fusion, 103, 102140.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

Hochreiter

Schmidhuber

(1997). Long short-term memory. Neural Computation, 9, 1735–1780.

Jin

Cao

Guo

Zhang

Luo

(2017). Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on multimedia (pp. 795–816).

Jin

Cao

Zhang

Zhou

Tian

(2016). Novel visual and statistical image features for microblogs news verification. IEEE Transactions on Multimedia, 19, 598–608.

10.

Kenton

J. D. M.-W. C.

Toutanova

L. K.

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (vol. 1, p. 2).

11.

Khattar

Goud

J. S.

Gupta

Varma

(2019). Mvae: Multimodal variational autoencoder for fake news detection. In The world wide web conference (pp. 2915–2921).

12.

Kumari

Ekbal

(2021). AMFB: Attention based multimodal factorized bilinear pooling for multimodal fake news detection. Expert Systems with Applications, 184, 115412.

13.

Bin

Zou

Wei

Wang

Yang

(2023). Cross-modal consistency learning with fine-grained fusion network for multimodal fake news detection. In Proceedings of the 5th ACM international conference on multimedia in Asia (pp. 1–7).

14.

Sun

Tian

Yao

(2021). Entity-oriented multi-modal alignment and fusion network for fake news detection. IEEE Transactions on Multimedia, 24, 3455–3468.

15.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

(2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

16.

Gao

Mitra

Kwon

Jansen

B. J.

Wong

K.-F.

Cha

(2016). Detecting rumors from microblogs with recurrent neural networks. In International joint conference on artificial intelligence (pp. 1751–1754).

17.

Gao

Wei

Wong

K.-F.

(2015). Detect rumors using time series of social context information on microblogging websites. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1751–1754).

18.

Mikolov

Chen

Corrado

Dean

(2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

19.

Noble

W. S.

(2006). What is a support vector machine? Nature Biotechnology, 24, 1565–1567.

20.

Pennington

Socher

Manning

C. D.

(2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

21.

Potthast

Kiesel

Reinartz

Bevendorff

Stein

(2017). A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638.

22.

Cao

Yang

Guo

(2019). Exploiting multi-domain visual information for fake news detection. In 2019 IEEE international conference on data mining (ICDM)(pp. 518–527). IEEE.

23.

Qian

Wang

Fang

(2021). Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval (pp. 153–162).

24.

Rish

, et al. (2001). An empirical study of the Naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (pp. 41–46).

25.

Safavian

S. R.

Landgrebe

(1991). A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics, 21, 660–674.

26.

Sherstinsky

(2020). Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306.

27.

Simonyan

Zisserman

(2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

28.

Singhal

Shah

R. R.

Chakraborty

Kumaraguru

Satoh

(2019). Spotfake: A multi-modal framework for fake news detection. In 2019 IEEE fifth international conference on multimedia big data (BigMM) (pp. 39–47). IEEE.

29.

Tuan

N. M. D.

Minh

P. Q. N.

(2021). Multimodal fusion with BERT and attention mechanism for fake news detection. In 2021 RIVF international conference on computing and communication technologies (RIVF) (pp. 1–6). IEEE.

30.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. In Advances in neural information processing systems 30. Curran Associates, Inc (pp. 6000–6010). ACM.

31.

Volkova

Shaffer

Jang

J. Y.

Hodas

(2017). Separating facts from fiction: Linguistic models to classify suspicious and trusted news posts on Twitter. In Proceedings of the 55th annual meeting of the association for computational linguistics (vol. 2: Short papers) (pp. 647–653). ACM.

32.

Wang

Jin

Yuan

Xun

Jha

Gao

(2018). EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 849–857). ACM.

33.

Long

Gao

Wang

Zhang

(2023). MFIR: Multimodal fusion and inconsistency reasoning for explainable fake news detection. Information Fusion, 100, 101944.

34.

Yang

Zhu

K. Q.

(2015). False rumors detection on Sina Weibo by propagation structures. In 2015 IEEE 31st international conference on data engineering (pp. 651–662). IEEE.

35.

Yang

(2016). Characteristics and governance of false information dissemination on the internet – based on big data analysis in the first half of 2016. Journalist, 6, 38–43.

36.

Yang

Zhang

Cheng

(2024). MRAN: Multimodal relationship-aware attention network for fake news detection. Computer Standards & Interfaces, 89, 103822.

37.

Yao

Mao

Luo

(2019). Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence (pp. 7370–7377). ACM.

38.

Yoo

H.-J.

(2015). Deep convolution neural networks in computer vision: A review. IEIE Transactions on Smart Processing Computing, 4, 35–43.

39.

Liu

Wang

Tan

, (2017). A convolutional approach for misinformation identification. In IJCAI (pp. 3901–3907). ACM.

40.

Zhou

Zafarani

(2020). SAFE: Similarity-aware multi-modal fake news detection. Preprint. arXiv:2003.04981 (vol. 200304981, p. 2).

41.

Zhuang

Zhang

(2022). Yet at Memotion 2.0 2022: Hate speech detection combining BiLSTM and fully connected layers. In Proceedings of de-factify: Workshop on multimodal fact checking and hate speech detection, CEUR (pp.3901–3907). ACM.

42.

Zubiaga

Liakata

Procter

(2017). Exploiting context for rumour detection in social media. In Social informatics: 9th international conference, SocInfo 2017, Oxford, UK, September 13–15, 2017, proceedings, PART I 9 (pp. 109–123). Springer.

			Fake news			Real news
Dataset	Methods	Accuracy	Precision	Recall	F1	Precision	Recall	F1
WEIBO	SVM-TS	0.640	0.741	0.573	0.646	0.651	0.798	0.711
	CNN	0.740	0.736	0.756	0.744	0.747	0.723	0.735
	GRU	0.702	0.671	0.794	0.727	0.747	0.609	0.671
	TextGCN	0.787	0.975	0.573	0.727	0.712	0.985	0.827
	EANN	0.782	0.827	0.697	0.756	0.752	0.863	0.804
	Att-RNN	0.772	0.854	0.656	0.742	0.720	0.889	0.795
	MVAE	0.824	0.854	0.769	0.809	0.802	0.875	0.837
	SpotFake	0.869	0.877	0.859	0.868	0.861	0.879	0.870
	SAFE	0.763	0.833	0.659	0.736	0.717	0.868	0.785
	HMCAN	0.885	0.920	0.845	0.881	0.856	0.926	0.890
	GFNN	0.901	0.913	0.889	0.889	0.888	0.913	0.900
	MRAN	0.903	0.904	0.908	0.906	0.897	0.892	0.894
	DVHAM	0.915	0.906	0.936	0.921	0.927	0.892	0.909
TWITTER	SVM-TS	0.529	0.488	0.497	0.496	0.565	0.556	0.561
	CNN	0.549	0.508	0.597	0.549	0.598	0.509	0.550
	GRU	0.634	0.581	0.812	0.667	0.758	0.502	0.604
	TextGCN	0.703	0.808	0.365	0.503	0.680	0.939	0.779
	EANN	0.648	0.810	0.498	0.617	0.584	0.759	0.660
	Att-RNN	0.664	0.749	0.615	0.676	0.589	0.728	0.651
	MVAE	0.745	0.801	0.719	0.758	0.689	0.777	0.730
	SpotFake	0.771	0.784	0.744	0.764	0.769	0.807	0.787
	SAFE	0.766	0.777	0.795	0.786	0.752	0.731	0.742
	HMCAN	0.897	0.971	0.801	0.878	0.853	0.979	0.912
	GFNN	0.923	0.872	0.965	0.916	0.971	0.891	0.929
	MRAN	0.855	0.861	0.857	0.859	0.847	0.816	0.831
	DVHAM	0.924	0.926	0.837	0.879	0.923	0.967	0.944
PHEME	SVM-TS	0.639	0.546	0.576	0.560	0.729	0.705	0.717
	CNN	0.779	0.732	0.606	0.663	0.799	0.875	0.835
	GRU	0.832	0.782	0.712	0.745	0.855	0.896	0.865
	TextGCN	0.828	0.775	0.735	0.737	0.827	0.828	0.828
	EANN	0.681	0.685	0.664	0.694	0.701	0.750	0.747
	Att-RNN	0.850	0.791	0.749	0.770	0.876	0.899	0.888
	MVAE	0.852	0.806	0.719	0.760	0.817	0.917	0.893
	SpotFake	0.823	0.743	0.745	0.744	0.864	0.863	0.863
	SAFE	0.811	0.827	0.559	0.667	0.806	0.940	0.866
	HMCAN	0.881	0.830	0.838	0.834	0.910	0.905	0.907
	MRAN	0.870	0.852	0.808	0.839	0.889	0.928	0.908
	DVHAM	0.906	0.807	0.844	0.825	0.943	0.928	0.935

Dual-Branch Hybrid Visual Networks and Hierarchical Adaptive Fusion Strategy: An Effective Multimodal Fake News Detection Model

Abstract

Keywords

1. Introduction

2.1. Unimodal Fake News Detection

2.2. Multimodal Fake News Detection

3. Method

3.1. Model Overview

3.3.1. Hierarchical Feature Extraction of Text

4.1. Datasets

Table 2. Statistical Data for the WEIBO Dataset, TWITTER Dataset, and PHEME Dataset. Dataset Real news Fake news Image WEIBO 4,779 4,749 9,528 TWITTER 6,026 7,098 514 PHEME 1,428 590 2,018

4.3. Baseline Model

4.3.1. Unimodal Baseline Models

4.3.2. Multimodal Baseline Model

Footnotes

Funding

Declaration of Conflicting Interest

Data Availability Statement

References

Table 2.
Statistical Data for the WEIBO Dataset, TWITTER Dataset, and PHEME Dataset.

Dataset Real news Fake news Image

WEIBO 4,779 4,749 9,528

TWITTER 6,026 7,098 514

PHEME 1,428 590 2,018