Sage Journals: Discover world-class research

Abstract

An increasing number of fake news combining text, images and other forms of multimedia are spreading rapidly across social platforms, leading to misinformation and negative impacts. Therefore, the automatic identification of multimodal fake news has become an important research hotspot in academia and industry. The key to multimedia fake news detection is to accurately extract features of both text and visual information, as well as to mine the correlation between them. However, most of the existing methods merely fuse the features of different modal information without fully extracting intra- and inter-modal connections and complementary information. In this work, we learn physical tampered cues for images in the frequency domain to supplement information in the image space domain, and propose a novel multimodal frequency-aware cross-attention network (MFCAN) that fuses the representations of text and image by jointly modelling intra- and inter-modal relationships between text and visual information whin a unified deep framework. In addition, we devise a new cross-modal fusion block based on the cross-attention mechanism that can leverage inter-modal relationships as well as intra-modal relationships to complement and enhance the features matching of text and image for fake news detection. We evaluated our approach on two publicly available datasets and the experimental results show that our proposed model outperforms existing baseline methods.

Keywords

Fake news detection multimoal cross attention frequency domain

1 Introduction

Social networks have become an important platform for people to express and exchange information due to their low cost, convenience, and rapid dissemination, but the lack of effective regulatory measures has also resulted in the proliferation of fake news. The massive amount of online fake news spreading uncontrollably in social media will bring negative impact on individuals and damage the stability and harmony of society [1]. For example, during the 2016 US presidential election, fake news was more popular and spread more widely on Facebook than real news [2], and during the COVID-19 pandemic, a large amount of inaccurate information also caused social panic [3, 4]. Detecting fake news is therefore essential to building a credible social news system.

In addition, along with the development of Internet technology, news has evolved from unimodal to multimodal, i.e., multimedia news with pictures or videos. Multimedia news brings a richer news reading experience, and news with visual content not only attracts more attention but also increases the credibility of the news [5]. However, fake news has also taken advantage of this, often using distorted or even faked images or videos to attract and mislead readers, thus spreading more quickly than real news [6]. Figure 1 depicts four examples of fake news stories. The text in Fig. 1(a) and 1(b) appear normal, but the images are obviously fabricated. In Fig. 1(c), the image appears authentic, but the accompanying text reveals that it is a forgery. In Fig. 1(d), the image and the text apparently convey the same meaning, but it is actually taken out of context. From the above examples it can be concluded that it is difficult to detect fake news from a single pattern of data. Generally, fake news creators produce fake news by expressing strong emotions through textual descriptions and mis-matching irrelevant images or fabricating images that are visually striking [7]. Therefore, each modality of data can provide different degrees of information about rumours, and it is important to consider the fusion of multimodal data to detect fake news.

Fig. 1

Some examples of fake news images. (a) and (b) are manipulated photographs in which (a) a picture of football star Henry is spliced into a hotel room to represent his participation in the Sochi Winter Olympics, and (b) depicts prismatic cloud formations on Japan’s Mount Fuji. (c) An image of the 2009 New York air crash was widely disseminated as Malaysia Airlines Flight 370, which crashed in 2014. (d) Trump’s feeding of the fish was misconstrued as dumping fish food, when in fact he was imitating Shinzo Abe by dumping the remaining fish food.

Recent studies have demonstrated the increasing importance of combining image data to detect fake news. According to previous studies, fake news images can be broadly classified as misleading images and manipulated images. Misleading images are those that are not related to the news text in terms of content or semantics, or those that are genuine and used without permission. Images that have been modified using image editing software are tampered. Copy-move and splicing are two common methods of image tampering operations. In Fig. 2, the first column is the original real image, the second column is a fake news image synthesized by copying and moving objects from the same photo or splicing them with objects from another photo, and the third column represents the manipulated area. Due to the quantized compression of the discrete cosine transform (DCT), misleading images, after downloaded and re-uploaded, typically leave more apparent compression artifacts than the real picture. The tampered images also leave manipulation traces. However, existing work gives little consideration to the detection of image tampering features and neglects to effectively model the spatial domain features and frequency domain features of images. For the exploitation of visual information, the majority of these approaches [5 , 9] use pre-trained models, such as VGG19 [10], to extract higher-order semantic information in the spatial domain, while ignoring the effective modelling of tampered physical features of fake images, such as compression artifacts after splice forgery and copy-shift forgery features reflected in the frequency domain [11]. Moreover, traces of physical induced by image editing that are perceptible in the frequency domain are no longer discernible in the spatial domain (RGB domain) [12 –14]. This necessitates a multimodal method that simultaneously models the RGB domain and the frequency domain to detect traces of delicate manipulation.

Fig. 2

Examples of image copy-move and splicing.

In addition, the interactions and relationships between text and visual features (spatial domain and frequency domain features) have not been adequately investigated in prior research. Some models [8 , 16] simply concatenate text features and visual features to obtain fusion features at a later point in the model, without mining the semantic complementarity of text and images, in particular, which can result in information loss. For example, [9] obtained a shared representation of text and visual information for classification by jointly training an encoder, decoder and classifier. [5] used an attention mechanism to merge the image fea-tures extracted by the pre-trained model into the joint features of text and social context fused by the LSTM for rumor detection. Consequently, the ability to effectively combine the textual and visual information of a post is the key to detecting false news. Typically, when evaluating the credibility of news, people first examine the images and then the text [17]. Images of news posts usually contain much visual content, without the guidance of textual information, it is difficult to grasp the key points of images at one time, and similarly, without the clues hidden in the images, news posts with incomplete text or ambiguous language are hard to follow. By repeatedly observing images and texts, people gradually deepen their understanding and eventually determine the truthfulness of the news. Intuitively, there is an interconnection between the semantic information of images and text, and the interaction and fusion between different modal data contributes to multimodal fake news detection.

To address these constraints, we introduce MFCAN, a framework for multi-modal fake news detection. MFCAN uses a dual stream architecture, with the textual stream extracting features of the text content and the visual stream divided into two branches, where the spatial domain branch captures representation of original images. The frequency-aware branch employs base filters and learnable filters to discover forged features in the frequency domain.

We also develop a cross-modal fusion module to efficiently fuse information from multiple modalities in an interactive manner. First, we fuse frequency-aware branch features with spatial domain branch features to produce visual stream features, and then we merge text features with visual stream features to generate combined features. The integrated features are finally fed into a fully connected classifier to get prediction results.

The following are brief summary of the key findings from this study:

We present a new end-to-end multimodal method for detecting fake news that explores the inherent features of fake news images at both semantic and physical levels from the spatial and frequency domains respectively.

The proposed MFCAN exploits a well-designed attention fusion network to gradually fuse textual and visual features, which can learn the saliency representations of fake news from intra-dependencies and relationship of intra- and inter-modality effectively.

Our proposed generic architecture for detecting fake news is feasible and flexible in engineer-ing practice. The MFCAN can be realized in engineering by the idea of modular design in software engineering. Each component of the MFCAN framework can be developed and tested as an independent module, and each module can be connected through interfaces to finally realize the overall function of MFCAN. In addtion, we can choose a flexible method for each module according to the scale of the project, the condition of the hardware and so on. The Resnet50 network for extracting visual semantic features in this paper can be replaced by other classical models (Inception, ResNeXt, etc.), or lightweight models (MobileNet, SqueezeNet, etc.), or by ViT and its various variants. Bert can also be substituted in the project by other word vector-based methods (word2vec, LSTM, etc.) or pre-training methods (XLNet, BART, etc.).

Experimental results on two public benchmark datasets show that our approach outperforms the best baseline. The visual representation of the frequency domain learned by the model provides additional complementary information for fake news detection, and the cross-modal fusion module helps to improve the overall detection performance.

The remaining sections of this article are organized as follows. In Section 2, we provide an overview of previous research findings regarding the detection of false news. Section 3 defines the issue. Section 4 provides a comprehensive description of the proposed model. Section 5 provides the dataset, benchmarks, and experimental results. The study is summarised in Section 6.

2 Related work

False news can refer to a variety of misinformation and disinformation [18], such as biassed news, imposter content, manipulated content, and satire. However, a general definition of fake news has yet to be established [19]. Similar to prior research [20, 21], we define fake news as “deliberately and verifiably false news articles”. On the basis of the differences between the models, we can generally classify existing fake news detection models as either unimodal or multimodal.

2.1 Single-modal fake news detection

Unimodal analysis concentrates on extracting textual features from the content (textual or visual information) or social context of a post. Various studies have extracted content features from different perspectives, including statistical features such as the percentage of negatives [22, 23], the number of paragraphs in the text [24], etc.; the number of verbs and nouns, the number of emotional and casual phrases [25], lexical and syntactic features [26 –30], and other writing styles. However, these features are manually designed and are not flexible or generalisable, so in order to automatically extract high-level features, deep learning models such as LSTM+CNN [31], GNN [32],GAN [33] have been proposed. Social context-based research investigates the social connections between users and tweets, such as user profiles and the characteristics of news diffusion. [34] utilises user profiles along the news dissemination path to determine the authenticity of news. Some other work [35 –37] build the heterogeneous credibility networks to model the propagation patterns.

2.2 Multimodal fake news detection

The rapid growth of fake news posts consisting of multimedia content has prompted a shift towards multimodality in the analysis of fake news. Some early investigations employed statistical image features as complementary features, such as the image type [6], number of images [38, 39] and the popularity of the image. However, these statistical features just learn simple patterns which are not sufficient to capture the nuanced differences between real and fake news images.

Recent work has attempted to introduce pre-trained models, such as BERT [40], VGG19, to extract high-level semantic features of text or vision, and to uncover connections between text and visual features through attentional mechanisms. Specifically, [8] learns distinguishable representations for identifying fake news, while event classification is introduced as a supplementary task for learning common features that are invariant to events. Variational autoencoders are employed in [9] to learn visual and textual correlations in order to obtain their shared representations and feed them into a classifier to detect fake news. Without relying on any subtasks, [41] utilised pre-trained BERT and VGG19 models to learn textual and visual features respectively and feed them in concatenation into a classifier for prediction. [15] applied a neural network to extract multimodal textual and visual features separately and used cross-modal correlations between them to identify fake news. [5] employed an attention mechanism to fuse visual features into the joint features of text and social context for fusion classification. [42] supplemented the semantic representation of text with conceptual knowledge retrieved from knowledge graphs and learned event invariant features to jointly improve rumour detection. However, these methods primarily consider the semantic features of images in the spatial domain and do not adequately investigate the correlation between multimodal features.

2.3 Frequency domain learning

Frequency analysis is a powerful tool in the field of image processing and digital signal processing and has been l widely used. In recent years, several studies of frequency analysis have been introduced in the field of deep learning, such as image classification [43], super-resolution [44] and detection of falsified regions [45]. Some studies have attempted to use frequency cues for forgery detection. [46] uses deep convolutional networks to learn frequency-domain image features, demonstrating the generalisability and effectiveness of frequency-domain features for classification, detection, and separation tasks. [47] uses CNNs to learn the difference between the histogram of discrete cosine transform coefficients in the tampered region (single compressed region) and the untampered region (double compressed region) for dual JPEG compression detection and forgery localisation. [48] investigates aligned and unaligned double JPEG detection in the pixel and frequency domains, which embeds the computation of the DCT coefficient histograms into the CNN structure and uses two-dimensional convolution in the CNN to capture possible correlations between the DCT coefficient histograms. [49] utilises histogram features as input to a CNN network, while merging quantization table vectors in the fully connected layer of the CNN network to improve the network’s ability to distinguish between single or double JPEG blocks.

3 Problem formulation

The detection of fake news can be classifiedinto post level (identifying individual posts as fake/true news) [5 , 9] and event level (identifying news posts contained in events as fake/true) [33, 42]. This study followed the former. The text and image of a news article P = (T, I) are denoted by T and I, respectively, and t ∈ R^d and v ∈ R^d represent the corresponding representations, where t = F_t (T, θ_t) and v = F_v (V, θ_v) , θ_* are the parameters to be learned. Our research aims to use the text and image features of a post and their relationship to infer the veracity of P, that is F_p : (t, v, θ_t, θ_i) → y ∈ (0, 1).

4 Methodology

4.1 Model overview

An overview of the MFCAN framework is presented in Fig. 3. It consists of five modules, which are (1) the text feature extractor (Sec. 4.2). (2) the visual semantic feature extractor (Sec. 4.3). (3) the frequency-aware tamper feature extractor (Sec. 4.4). (4) the cross-modality fusion module (Sec. 4.5). (5) the fake news detector (Sec. 4.6). The BERT model generates the text embedding vector, and the image is fed to two distinct branches for feature extraction. In the first branch, the image is directly fed into the pre-trained model ResNet50 to extract spatial region features, whereas in the second branch, the image is fed into the frequency-aware tamper feature extractor for transformation in order to extract frequency-aware features. Then a cross-modality attention fusion network consisting of two Modality Fusion Block is used to fuse features of different modalities hierarchically. The proposed Modality Fusion Block explores intra- and inter-modal relationships and generates fused representations across modalities.

Fig. 3

Illustration of the Multi-Modality Frequency-aware Cross-attention Network (MFCAN) architecture.

4.2 The textual feature extractor

Pre-trained on a large corpus, the BERT [40] model is capable of capturing the interrelationships between words in a sentence and the semantic importance of each word in order to generate feature representations with underlying semantic and contextual information. The BERT model has demonstrated powerful performance in a variety of natural language processing tasks, including text classification [50, 51].

For a news text W ={ w₁, w₂, …, w_n } consisting of a sequence of n words, which are first mapped to the corresponding index by a BERT predefined vocabulary [40], and then processed by BERT to generate a sequence representation T ={ t₁, t₂, …, t_m } of W. t_i corresponding to the transformed feature of w_i. The text lengths of all news are uniformly padded or truncated to the same length m. The above process is shown in Equation 1: $\begin{matrix} T = {t_{1}, t_{2}, \dots, t_{m}} = BERT (W; θ_{BERT}) \end{matrix}$ (1) where $T \in ℝ^{m \times d_{t}}$ represents the text features vector from BERT, the i th word feature is denoted as $t_{i} \in ℝ^{d_{t}}$ , and d_t is the dimension of the word embedding.

4.3 The visual semantic feature extractor

According to research [11], in the spatial domain (the RGB domain), fake news images exhibit some distinct characteristics compared to real news images, such as being visually impactful and emotionally provocative. These features have been associated with a variety of low-level to high-level visual factors [21]. Thus, we apply the ResNet50 [52] network, as depicted in Fig. 3, to fully capture the visual features of different semantic levels of the spatial domain. The last classification layer of original ResNet50 is replaced by a fully connected d-dimensional layer with RELU activation function which generates a u × d_s dimensional feature as the final visual representation of spatial domain for the input image, the process is shown in Equation 2: $\begin{matrix} S = {s_{1}, s_{2}, \dots, s_{u}} = ResNet 50 (I; θ_{ResNet 50}) \end{matrix}$ (2) where I represents the original image of the input, and S represents the spatial domain features extracted by ResNet50. The spatial region features obtained are denoted as $S \in ℝ^{u \times d_{s}}$ , where the i th region feature is denoted as $s_{i} \in ℝ^{d_{s}}$ and there are u regions feature in total.

4.4 The frequency-aware tamper feature extractor

Compared to real news images, fake news images are usually uploaded and downloaded multiple times on social media platforms, and have more severe recompression artifacts, such as the block effect. In addition, some fake news images inevitably have traces of manipulation, such as splicing or copy-move. [53 –55] demonstrate that artifacts of fake photos cannot be detected in the spatial domain due to perturbations such as JPEG compression, but they can be detected in the frequency domain. On this basis, we extract features from the frequency domain to provide additional clues for detecting fake news.

We present a novel frequency-aware tamper feature extractor (FATE) that uses a learnable frequency filter to perform adaptive segmentation of the input image in the frequency domain, which as seen in Fig. 3. Frequency-aware visual components are produced by inverting the decomposed frequency components to the spatial domain. After this, ResNet50 is employed to mine forgery patterns. D indicates the application of the Discrete Cosine Transform (DCT). D^-1 indicates the application of the Inversed Discrete Cosine Transform (IDCT).

DCT is widely applied in image processing, such as JPEG and H.264 compression algorithms [56]. The DCT transform of an image can be expressed as follows. $\begin{matrix} T = {AFA}^{'} \end{matrix}$ (3)

Where F is the d × d image patch. And A is determined by the DCT transformation matrix: $\begin{matrix} A_{i, j} = {\begin{matrix} \frac{1}{\sqrt{N}}, i = 0 \\ \sqrt{\frac{2}{N}} \cos \frac{(2 j + 1) i π}{2 N}, i > 0 \end{matrix} \end{matrix}$ (4)

Specifically, taking the image I as input, we start with some pre-processing, such as the standard transformation, resize, cropping as described in [57]. The DCT then transforms the spatial domain image to the frequency domain along the spatial dimension: $\begin{matrix} H = D (I) \end{matrix}$ (5) where $H \in ℝ^{H \times W \times 3}$ is the frequency domain representation and $D$ denotes DCT. Then, H is multiplied with a learnable filter to examine subtle traces of tampering in different frequency bands, and an inverse DCT is applied to transfer it to the spatial domain in order to obtain the frequency-aware characteristics of the forgery. This procedure is illustrated in Fig. 4.

Fig. 4

Process of extracting salient frequency features using the proposed frequency-aware tamper feature extractor (FATE).

\begin{matrix} G = D^{- 1} (H ⊙ F) \end{matrix}

(6)

F = f_{base}^{i} + σ (f_{lea}^{i}), i = {1, \dots, 3}

(7)

where $F$ denotes the filter, which consists of a binary base filter f_base and a learnable filter f_lea. The base filter explicitly divides the frequency domain into low, medium, and high frequency bands and we then add the learnable filter to these base filters. $σ (x) = \frac{1 - exp (- x)}{1 + exp (- x)}$ is intended to set x to a value between –1 and +1.⊙ denotes the hadamard product. After that, considering the ability of CNNs to capture the spatially structured features, G is fed into the Resnet50 network to autonomously capture frequency domain features of the tampered image. $\begin{matrix} F = {f_{1}, f_{2}, \dots, f_{v}} = ResNet 50 (G; θ_{ResNet 50}) \end{matrix}$ (8)

The frequency region features obtained are denoted as $F \in ℝ^{v \times d_{f}}$ , where the i th region feature is denoted as $f_{i} \in ℝ^{d_{f}}, d_{f}$ is the dimention of the frequency region features, and there are v regions in total.

4.5 The cross-modality fusion module

In this section, we proposed applying a cross-modality fusion network on visual features and textual features (S, F, T) to capture the relationship of inter-modality and intra-modality in a post. As shown in Fig. 3, the cross-modality fusion network consists of two blocks (Modality Fusion Block1 and Modality Fusion Block2), Modality Fusion Block1(MFB1) is utilized to fuse two visual modalities information (spatial and frequency), Modality Fusion Block2(MFB2) focuses on blending the representation of visual and textual modalities.

Inspired by the self-attention mechanism in transformer structures [58], we use the query-key-value mechanism to design the Modality Fusion Block, which consists of two types of attention unit: Self-Attention Unit and Cross-Attention Unit.

The Self-Attention Unit (SAU) and Cross-Attention Unit (CAU) are based on a multi-headed attention mechanism, which is depicted in detail in Fig. 5. We propose the Self-Attention Unit (SAU) to model such within-modality relationships which have been shown to be effective in object detection [59], image captioning and word embedding pre training of BERT. SAU take data from same modalities as input, which capture the importance between visual regions and between textual words.

Fig. 5

Illustration of the multi-headed attention mechanism.

The Cross-Attention Unit (CAU) first learns to capture the importance between two modality features. It then aggregates and updates one modality features through information flows passed from another modality according to the learned importance weights. Such a process of information flow is able to identify cross-modal relations.

Taking Modality Fusion Block1(MFB1) as an example, the inputs are spatial region features S and frequency tamper features F, respectively. First, Self-Attention Unit (SAU) is utilized to learn the representation of intra-modality. Specifically, we compute the query, key, and value by using different learnable linear projections of the input, where the transformed spatial visual features are denoted as $S_{Q}, S_{K}, S_{V} \in ℝ^{u \times \dim};$ Transformed frequency visual features are denoted as $F_{Q}, F_{K}, F_{V} \in ℝ^{v \times \dim}$ , $\begin{matrix} S_{Q} = {FC}_{s}^{Q} (S), F_{Q} = {FC}_{F}^{Q} (F) \end{matrix}$ (9) $\begin{matrix} S_{K} = {FC}_{s}^{K} (S), F_{K} = {FC}_{F}^{K} (F) \end{matrix}$ (10) $\begin{matrix} S_{V} = {FC}_{s}^{V} (S), F_{V} = {FC}_{F}^{V} (F) \end{matrix}$ (11) where “FC” denotes different fully-connected layer, and “dim” represents the common dimension of transformed features from both modalities. We can then acquire the affinity (or weight) matrix ${SAU}_{S \leftarrow S} \in ℝ^{u \times u}$ and ${SAU}_{F \leftarrow F} \in ℝ^{v \times v}$ by calculating the “Scaled Dot-Product Attention” to weight different within-modality relations. Scaled Dot-Product Attention (See Fig. 6) is employed for the fine-grained integration of intra modalities (Q = K = V) and inter modalities (K = V, Q not equal K) characteristics. Q, K, and V each represent one of the following: visual spatial feature, visual frequency feature, and textual feature. $\begin{matrix} {SAU}_{S \leftarrow S} = softmax (\frac{S_{Q} S_{K}^{T}}{\sqrt{\dim}}) \end{matrix}$ (12) $\begin{matrix} {SAU}_{F \leftarrow F} = softmax (\frac{F_{Q} F_{K}^{T}}{\sqrt{\dim}}) \end{matrix}$ (13)

Fig. 6

The scaled dot-product attention.

where “softmax” is a softmax operation by row. Furthermore, based on the intra-modal affinity matrix, the representation of the visual spatial domain features and frequency features can be updated using Equations 16-17. Finally, we obtain the final visual representation

\tilde{S}

and

\tilde{F}

through residuals, fully connected layers and layer normalisation operations. The details are as follows:

\begin{matrix} \tilde{S} = layer_n orm (FC (S + S_{update})) \end{matrix}

(14)

\begin{matrix} \tilde{F} = layer_n orm (FC (F + F_{update})) \end{matrix}

(15)

\begin{matrix} S_{update} = {SAU}_{S \leftarrow S} \times S_{V} \end{matrix}

(16)

\begin{matrix} F_{update} = {SAU}_{F \leftarrow F} \times F_{V} \end{matrix}

(17)

FC indicates a full-connected layer and layer_norm denotes the layer normalization. As shown in Fig. 3, $\tilde{S}$ and $\tilde{F}$ are the output of Self-Attention Unit (SAU).

The spatial and frequency visual representation obtained from SAU are learned independently without considering each other. Therefore, the output features by the SAU would then be fed into the following CAU to further update spatial and frequency visual features.

Similarly to the way of SAU, we first get the transformed spatial visual features ${\tilde{S}}_{Q}, {\tilde{S}}_{K}, {\tilde{S}}_{V} \in ℝ^{u \times \dim}$ and frequency visual features ${\tilde{F}}_{Q}, {\tilde{F}}_{K}, {\tilde{F}}_{V} \in ℝ^{v \times \dim}$ , which are computed by multiplying $\tilde{S}$ and $\tilde{F}$ through different learnable linear projections.Then we obtained two sets of inter-modality attention weights, ${CAU}_{\tilde{S} \leftarrow \tilde{F}} \in ℝ^{u \times v}$ and ${CAU}_{\tilde{S} \to \tilde{F}} \in ℝ^{v \times u}$ , the two directional CAU matrices capture the importance between every spatial region and frequency region of image, $\begin{matrix} {CAU}_{\tilde{S} \leftarrow \tilde{F}} = softmax (\frac{{\tilde{S}}_{Q} {\tilde{F}}_{K}^{T}}{\sqrt{\dim}}) \end{matrix}$ (18) $\begin{matrix} {CAU}_{\tilde{S} \to \tilde{F}} = softmax (\frac{{\tilde{F}}_{Q} {\tilde{S}}_{K}^{T}}{\sqrt{\dim}}) \end{matrix}$ (19)

Each row of ${CAU}_{\tilde{S} \leftarrow \tilde{F}}$ stands for the attention weights between one spatial visual region and all frequency visual region, and vice versa.

The process of updating spatial visual features and frequency visual features could be denoted as: $\begin{matrix} {\tilde{S}}_{update} = {CAU}_{\tilde{S} \leftarrow \tilde{F}} \times {\tilde{F}}_{V} \end{matrix}$ (20) $\begin{matrix} {\tilde{F}}_{update} = {CAU}_{\tilde{S} \to \tilde{F}} \times {\tilde{S}}_{V} \end{matrix}$ (21) where F_V and S_V are the value features to update spatial and frequency visual region features. After acquiring the updated spatial and frequency features, we add them with original spatial features $\tilde{S}$ and frequency features $\tilde{F}$ . A fully connected layer and layer normalization are used to get output features $\hat{S}$ and $\hat{F}$ , as shown in Fig. 3. $\begin{matrix} \hat{S} = layer_n orm (FC (\tilde{S} + {\tilde{S}}_{update})) \end{matrix}$ (22) $\begin{matrix} \hat{F} = layer_n orm (FC (\tilde{F} + {\tilde{F}}_{update})) \end{matrix}$ (23) $\begin{matrix} R_{SF} = FC ([\hat{S}, \hat{F}]) \end{matrix}$ (24)

Finally, $\hat{S}$ and $\hat{F}$ are concatenated into a feature vector R_SF as the multi-modal representation of the S and F.

Next, we utilize Modality Fusion Block2 (MFB2) to fuse the obtained frequency-aware visual features R_SF and textual features T. Note that MFB2 and MFB1 have the same structure but do not share weights.

Specifically, we first perform a linear transformation on R_SF and T, respectively. The transformed frequency-aware visual features and textual features are denoted as: $\begin{matrix} R_{Q} = {FC}_{R}^{Q} (R_{SF}), T_{Q} = {FC}_{T}^{Q} (T) \end{matrix}$ (25) $\begin{matrix} R_{K} = {FC}_{R}^{K} (R_{SF}), T_{K} = {FC}_{T}^{K} (T) \end{matrix}$ (26) $\begin{matrix} R_{V} = {FC}_{R}^{V} (R_{SF}), T_{V} = {FC}_{T}^{V} (T) \end{matrix}$ (27) where $R_{Q}, R_{K}, R_{V} \in ℝ^{v \times D}$ , $T_{Q}, T_{K}, T_{V} \in ℝ^{m \times D}$ , “FC” denotes different fully-connected layer, and “D” stand for the transformed common dimension.

Next, we can acquire the affinity (or weight) matrix ${SAU}_{R \leftarrow R} \in ℝ^{v \times v}$ and ${SAU}_{T \leftarrow T} \in ℝ^{m \times m}$ to weight different intra-modal relations. $\begin{matrix} {SAU}_{R \leftarrow R} = softmax (\frac{R_{Q} R_{K}^{T}}{\sqrt{D}}) \end{matrix}$ (28) $\begin{matrix} {SAU}_{T \leftarrow T} = softmax (\frac{T_{Q} T_{K}^{T}}{\sqrt{D}}) \end{matrix}$ (29)

Affinity (or weight) matrix ${SAU}_{R \leftarrow R} \in ℝ^{v \times v}$ and ${SAU}_{T \leftarrow T} \in ℝ^{m \times m}$ are used to update the features of R_V and T_V, $\begin{matrix} R_{update} = {SAU}_{R \leftarrow R} \times R_{V} \end{matrix}$ (30) $\begin{matrix} T_{update} = {SAU}_{T \leftarrow T} \times T_{V} \end{matrix}$ (31)

Furthermore, based on R_updateand T_update, the final representation of the frequency-aware visual features and textual features can be updated hrough residuals, fully connected layers and layer normalisation operations. $\begin{matrix} \tilde{R} = layer_n orm (FC (R + R_{update})) \end{matrix}$ (32) $\begin{matrix} \tilde{T} = layer_n orm (FC (T + T_{update})) \end{matrix}$ (33)

As shown in Fig. 3, $\tilde{R}$ and $\tilde{T}$ are the output of Self-Attention Unit (SAU) in MFB2.

$\tilde{R}$ and $\tilde{T}$ are learned independently without establishing mutual connections. For better fusion of textual features, they are fed into the CAU of MFB2 to obtain a cross-modal fused feature representation.

We first obtain the transformed features ${\tilde{R}}_{Q}, {\tilde{R}}_{K}, {\tilde{R}}_{V} \in ℝ^{v \times D}$ and ${\tilde{T}}_{Q}, {\tilde{T}}_{K}, {\tilde{T}}_{V} \in ℝ^{m \times D}$ , which are computed by multiplying $\tilde{R}$ and $\tilde{T}$ through different learnable linear projections. Then we obtained two sets of inter-modality attention weights, ${CAU}_{\tilde{R} \leftarrow \tilde{T}} \in ℝ^{v \times m}$ and ${CAU}_{\tilde{R} \to \tilde{T}} \in ℝ^{m \times v}$ , the two directional CAU matrices capture the importance between every word and visual region of image, $\begin{matrix} {CAU}_{\tilde{R} \leftarrow \tilde{T}} = softmax (\frac{{\tilde{R}}_{Q} {\tilde{T}}_{K}^{T}}{\sqrt{D}}) \end{matrix}$ (34) $\begin{matrix} {CAU}_{\tilde{R} \to \tilde{T}} = softmax (\frac{{\tilde{T}}_{Q} {\tilde{R}}_{K}^{T}}{\sqrt{D}}) \end{matrix}$ (35)

Each row of ${CAU}_{\tilde{R} \leftarrow \tilde{T}}$ stands for the attention weights between one frequency-aware visual region and all words of sentence, and vice versa.

Then, we obtain the final visual representation $\hat{R}$ and $\hat{T}$ through residuals, fully connected layers and layer normalisation operations. The are as follows: $\begin{matrix} \hat{R} = layer_n orm (FC (\tilde{R} + R_{update})) \end{matrix}$ (36) $\begin{matrix} \hat{T} = layer_n orm (FC (\tilde{T} + T_{update})) \end{matrix}$ (37) $\begin{matrix} R_{update} = {CAU}_{\tilde{R} \to \tilde{T}} \times {\tilde{R}}_{V} \end{matrix}$ (38) $\begin{matrix} T_{update} = {CAU}_{\tilde{R} \leftarrow \tilde{T}} \times {\tilde{T}}_{V} \end{matrix}$ (39)

Finally, we concatenate $\hat{R}$ and $\hat{T}$ to obtain the multimodal representation R_SFT of the S, F and T. $\begin{matrix} R_{SFT} = FC ([\hat{R}, \hat{T}]) \end{matrix}$ (40)

4.6 The fake news detector

With the previous four components, we obtained the final features R_SFT of multimodal news, which incorporates cross-modal textual, visual spatial domain and frequency domain features. Our aim is to map the textual and visual features of news to their labels to determine whether they are fake news. A fake news detector consisting of a fully connected layer with an activation function implements the correspondence between features and labels. $\begin{matrix} {\hat{y}}_{n} = σ (W_{f} R_{SFT}^{n} + b) \end{matrix}$ (41) where σ (.) is softmax activation function, $R_{SFT}^{n}$ is the feature representation of the n-th post, W_f is the parameters of the fully connected layer, b is a bias term, and $\underset{n}{y}$ represents the predicted probabilities of the n-th post. Cross-entropy is utilized to minimize the detection loss: $\begin{matrix} L (Θ) = \sum_{n = 1}^{N} - [y_{n} \log ({\hat{y}}_{n}) + (1 - y_{n}) \log (1 - {\hat{y}}_{n})] \end{matrix}$ (42) where y_n represent the ground-truth labels of n-th post, with Θ representing fake news and 1 representing real news and Θ denotes all learnable parameters in the proposed model.

5 Experiments and results

In this section, we conduct experiments on two widely used datasets to evaluate the effectiveness of the proposed MFCAN. We initially describe the information of two social media datasets and provide an introduction about model settings and some baseline approaches for detecting fake news. Then we make comparisons between the MFCAN and baseline methods on two datasets and bring a detailed analysis for the ablation study. Finally, we investigated some typical cases to demonstrate the importance of multi-modal fake news detection.

5.1 Datasets

We conduct experiments on both the English (Twitter) and Chinese (Weibo) datasets to fairly evaluate the performance of the proposed model.

D1[60]: Twitter dataset. As part of the MediaEval task, the Twitter dataset was used in a competition for the detection of fake content on Twitter. each post in the Twitter dataset contains textual content, relevant images/videos, and additional social contextual information. The original dataset was organised into two parts: a development set and a test set. We maintain the same data segmentation scheme as the benchmark, using the development set to train the model for feature learning and the test set for optimising parameters and testing. We only use the text and image information of the posts to match the research objectives of this paper.

D2 [5]: Weibo dataset. It consists of textual content, user profiles and attached images. The real news on the Weibo dataset is collected from authoritative news sources in China, such as Xinhua News Agency. Fake news was sourced from data from the official fake news uncovering system of Sina Weibo from May 2012 to January 2016. With reference to previous studies, we first pre-processed the dataset to remove duplicate images and low-quality images to ensure the quality of the whole dataset. The entire dataset was then divided into training and test sets as in [9, 61].

5.2 Experimental setup

In this section, we present the implementation details of MFCAN. For the textual feature extractor, we utilize the pre-trained multilingual cased BERT to extract textual features on Twitter datasets, for the Weibo dataset, we use Chinese BERT-based model. After padding and truncation, the maximum length of the news text is 30 on the Twitter dataset and 200 on the Weibo dataset, and the dimensionality of the text embedding is 768. For the visual semantic feature extractor, all images are resized to 224×224×3, we use the output of the penultimate layer of the ResNet50 [54] pre-trained on ImageNet [62], and the dimension of visual vector is 2048, which is then reduced to a size of 768 by means of two fully connected layers.

The frequency-aware tamper feature extractor uses the same ResNet50 network as the forged feature extraction module. The weights of BERT and ResNet50 are frozen during the training phase on Twitter dataset due to over-fitting, whereas not on Weibo dataset.

In the cross-modality fusion module, for Self-Attention-Unit and Cross-Attention-Unit, the number of multi-heads is set to 12 and 8 on the Twitter and Weibo datasets respectively. The dimension of out feature is the same as input, here is 768. The fake news classifier consists of two fully connected layers of sizes 1536 and 512 respectively.

The dataset is divided into training, validation and test sets according to the ratio of 7:1:2. In the ablation analysis, we retrain them after removing some of the branching networks. To obtain a reliable result, we run each experiment for 5 times and report the performance of highest accuracy.

The initial learning rate of the Adam [64] optimizer is adjusted from 1e-6 to 1e-2 by grid search. The number of heads in the cross-attention unit is adjusted in 4,8,12,16. To prevent overfitting, we use L2-regularizer on the weights of our model. We experiment with a weight penalty of 0, 0.02,0.05, 0.1, 0.2. For the batch size of the dataset, we try 16,32,64,128. We search through the grid, adjust the hyperparameters based on the accuracy on the validation set, and evaluate the results on the test set. Finally, we train the model 100 epochs with a patience of 5, and the batch size is set as 32. Early stopping is also used to avoid overfitting. In each fully connected layer of the model except the final output layer, we applied the ReLU activation function and dropout technique with a dropout probability of 0.2. The learning rate of Adam is set as 1e-4 and a momentum of 0.9.

Table 1
The statistics of the datasets

Dataset Real news Fake news Image

D1(Twitter) 6681 8199 512

D2(Weibo) 3594 4103 7697

Dataset	Real news	Fake news	Image
D1(Twitter)	6681	8199	512
D2(Weibo)	3594	4103	7697

Figure 7 shows how the classification accuracy of MFCAN changes with the number of iterations during the training process. It can be found that when the number of iterations is between 90 and 110, the training basically converges, and the performance of MFCAN reaches the best.

Fig. 7

Accuracy changes with different iterations.

The proposed model is implemented in the PyTorch architecture [63] and the experiments are performed based on PyCharm 2022.3.2, Python 3.8.0, PyTorch 1.10.0, CUDA 10.2, and cuDNN 7.6.5.

5.3 Baselines

The performance of the proposed MFCAN is contrasted with that of single-mode and multimode baseline models.

5.3.1 Single-modality models

We first contrast the proposed multi-modality approach with the following three single-modality models.

Text: the model only analyzes textual information to detect fake news. The post’s text is represented by BERT as a 768-dimensional sequence of word vectors, are is then fed into a Bi-LSTM, whose output is sent into a 768-dimensional fully-connected layer to derive the overall text features for final prediction.

Visual-S: this model is part of the proposed MFCAN. It uses only visual semantic features of images to recognize fake news. The pre-trained ResNet50 network is used to extract the visual features of the images, which are then fed into a 768-dimensional fully connected layer for prediction.

Visual-F: the model employs only the frequency-aware tamper feature extractor of MFCAN to detect tampering components, and a fully-connected layer with 768-dimensional is used for making the prediction finally.

5.3.2 Multi-modality models

Multi-modality approaches typically use information from multiple modalities to detect fake news. The following multi-modal approaches are compared with our method.

VQA [65]: Visual Question Answering is given an image to answer a question. The original VQA model was used for a multiclass classification task. We adapt the final multiclass layer of the original VQA model to our binary classification task by replacing it with a binary classification layer, using only one layer of the LSTM, which is set to 32 hidden units.

NeuralTalk [66]: is a model that generates captions based on images. It uses a convolutional neural network to encode images into compact representations and then generates the corresponding sentences through the LSTM network.

att-RNN [5]: att-RNN is a method that incorporates social contextual information into multimodal news detection. It first applies LSTM to obtain a fused representation of text and social information, and then explores the semantic association of text and visual information through an attention mechanism. In our experiments, we removed social contextual information for fair comparison.

EANNs [8]: An event discriminator was added to the composition of EANNs in addition to a multimodal feature extractor and a fake news detector. The multimodal feature extractor obtains distinguishable textual and visual features from posts, and the event discriminator improves multimodal features by learning event invariant features based on adversarial ideas. We used a variant of EANN with the event discriminator removed for our experiments.

MVAE [9]: obtains a shared representation of vision and text to detect fake news by jointly training an encoder, decoder and classifier. We use the official implementation of MVAE for our experiments.

SAFE [15]: uses a pre-trained model to generate a textual description of an image and computes its semantic similarity to the textual representation, which is then combined with multimodal fusion features to detect falsity of news.

5.4 Evaluation metrics

Following the evaluation metrics commonly used for classification tasks, in our experiment, we used Accuracy, Precision, F1 score and Recalll to assess the performance of the various methods. The formulae for the evaluation are shown below. $\begin{matrix} Accuraccy = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}$ (43) $\begin{matrix} Precision = \frac{TP}{TP + FP} \end{matrix}$ (44) $\begin{matrix} Recall = \frac{TP}{TP + FN} \end{matrix}$ (45) $\begin{matrix} F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \end{matrix}$ (46) where TP is the predicted number of positive samples in the positive category, TN is the predicted number of negative samples in the negative category, FP is the predicted number of negative samples in the positive category, and FN is the predicted number of positive samples in the negative category.

5.5 Performance comparison

Tables 2 and 3 show the results of performance comparison between the proposed MFCAN and the baselines on Twitter and Weibo datasets. Note that the experimental results for the baseline models are cited directly from previous papers. Based on the results of the tables, we have the following observations:

MFCAN-concat outperforms single-modal models (i.e., Visual-S), indicating that multimodal methods are generally superior to single-modality methods. Introducing multimodality into a model is feasible, but not always effective in improving model performance, for example, the unimodal method “Text” performs better than MFCAN-concat on the Weibo dataset. Our proposed MFCAN performs better than MFCAN-concat and other multimodal models, demonstrating that our proposed strategy for fusing cross-modal features superior to simple connectivity methods.

In comparison of multimodal models, att-RNN fuses textual and social contextual features through the attention mechanism and concatenates the fused textual features with image features, achieving better performance than VQA and NeuralTalk, demonstrating the effectiveness of the attention mechanism, but it does not directly use the attention mechanism to explore intra- and inter-modal associations of textual and visual information. The superior performance of the EVNN and MVAE models over the single-mode model suggests that visual information can provide complementary evidence for the analysis of fake news. SpotFake has better performance than all baselines on the two datasets, showing that the pre-trained BERT can learn more textual information to improve model performance. Spotfake, EANN, and MAVE, however, rely only on fused features obtained directly using concatenation or auxiliary tasks. These fused features do not provide enough discriminative power to classify fake news because the text and image features separately extracted are not in the same semantic space and the relevant information of the text and images is not well attended to during the fusion process. Therefore, the experimental results of these methods are unsatisfactory.

With the comparison of Tables 2 and 3, the performance of the single-modality models Text (BERT), Visual-S(ResNet-50), and Visual-F(ResNet-50) on the Weibo dataset is significantly better than that on the Twitter dataset. The reason is due to issues with the Twitter dataset, such as post length and imbalance issue. Over 70% of the tweets in the Twitter dataset are related to a particular event. Therefore, the training data of BERT and ResNet50 are excessively similar, resulting in poor generalization capability. The average length of a tweet on Weibo dataset is significantly longer than that of a tweet on Twitter dataset, and the image is versatile, which could be useful in extracting discriminable representation of text and image.

Table 2
Comparison results of different models on the Twitter dataset

Dataset Method Accuracy Fake News Real News

Precision Recall F_1 Precision Recall F_1

Twitter Text 0.576 0.592 0.852 0.699 0.497 0.199 0.284

Visual-S 0.645 0.636 0.609 0.622 0.654 0.68 0.667

Visual-F 0.625 0.791 0.67 0.726 0.628 0.758 0.687

VQA 0.631 0.765 0.509 0.611 0.55 0.794 0.65

NeuralTalk 0.61 0.728 0.504 0.595 0.534 0.752 0.625

att-RNN 0.664 0.749 0.615 0.676 0.589 0.728 0.651

EANN 0.648 0.81 0.498 0.617 0.584 0.759 0.66

MVAE 0.745 0.801 0.719 0.758 0.689 0.777 0.73

SpotFake 0.777 0.751 0.9 0.82 0.832 0.606 0.701

MFCAN-concat 0.764 0.745 0.898 0.814 0.807 0.580 0.675

MFCAN 0.794 0.907 0.684 0.780 0.719 0.920 0.807

Dataset	Method	Accuracy	Fake News	Real News
Twitter	Text	0.576	0.592	0.852	0.699	0.497	0.199	0.284
	Visual-S	0.645	0.636	0.609	0.622	0.654	0.68	0.667
	Visual-F	0.625	0.791	0.67	0.726	0.628	0.758	0.687
	VQA	0.631	0.765	0.509	0.611	0.55	0.794	0.65
	NeuralTalk	0.61	0.728	0.504	0.595	0.534	0.752	0.625
	att-RNN	0.664	0.749	0.615	0.676	0.589	0.728	0.651
	EANN	0.648	0.81	0.498	0.617	0.584	0.759	0.66
	MVAE	0.745	0.801	0.719	0.758	0.689	0.777	0.73
	SpotFake	0.777	0.751	0.9	0.82	0.832	0.606	0.701
	MFCAN-concat	0.764	0.745	0.898	0.814	0.807	0.580	0.675
	MFCAN	0.794	0.907	0.684	0.780	0.719	0.920	0.807

Table 3

Comparison results of different models on the Weibo dataset

Dataset	Method	Accuracy	Fake News			Real News
			Precision	Recall	F_1	Precision	Recall	F_1
Weibo	Text	0.817	0.934	0.707	0.805	0.738	0.943	0.828
	Visual-S	0.691	0.943	0.447	0.606	0.608	0.969	0.746
	Visual-F	0.679	0.819	0.562	0.667	0.589	0.835	0.691
	VQA	0.736	0.797	0.634	0.706	0.695	0.838	0.76
	NeuralTalk	0.726	0.794	0.713	0.692	0.684	0.84	0.754
	att-RNN	0.772	0.854	0.656	0.742	0.72	0.889	0.795
	EANN	0.782	0.827	0.697	0.756	0.752	0.863	0.804
	MVAE	0.824	0.854	0.769	0.809	0.802	0.875	0.837
	SpotFake	0.892	0.902	0.964	0.932	0.847	0.656	0.739
	MFCAN-concat	0.859	0.92	0.806	0.859	0.806	0.92	0.86
	MFCAN	0.898	0.902	0.896	0.899	0.893	0.9	0.896

On the Twitter dataset, our model outperforms the best model SpotFake by a margin of 1.1% in terms of accuracy. For the Weibo dataset, our model achieves performance improvement of 0.6%. Overall, these results demonstrate the advantages of the proposed MFCAN for multi-modal fake news detection. The performance improvement is attributed to the superiority of MFCAN. First, the physical tampering clues of fake images captured in the frequency domain could effectively complement the information in the spatial domain; second, the proposed cross-modal fusion network could fully utilize the intra- and inter-modal interactions to enhance the fused feature representation of text and images for more accurate fake news detection.

5.6 Ablation study

5.6.1 Effectiveness of MFCAN components

In this section, we devise ablation experiments to evaluate the efficacy of the various MFCAN components, beginning with the most fundamental configurations and progressively adding components until the entire model architecture is built up.

The results are exhibited in Tables 4-5. “T” denotes the text feature extractor, “S” stands for the visual semantic feature extractor, “F” refers to the frequency-aware tamper feature extractor, “Cross-att” stands for the cross-modality fusion module which fuse three modalities with attention mechanism, “Concat” denotes that the three modalities are merged with concatenate.

Table 4
Architecture ablation analysis of MFCAN on Twitter dataset

Dataset T S F Concat Cross-att Accuracy Fake News Real News

Precision Recall F_1 Precision Recall F_1

Twitter √ 0.576 0.592 0.852 0.699 0.497 0.199 0.284

√ √ 0.756 0.742 0.885 0.807 0.788 0.580 0.668

√ √ 0.742 0.750 0.821 0.784 0.728 0.636 0.679

√ √ √ √ 0.764 0.745 0.898 0.814 0.807 0.580 0.675

√ √ √ √ 0.794 0.907 0.684 0.780 0.719 0.920 0.807

Dataset	T	S	F	Concat	Cross-att	Accuracy	Fake News	Real News
Twitter	√					0.576	0.592	0.852	0.699	0.497	0.199	0.284
	√	√				0.756	0.742	0.885	0.807	0.788	0.580	0.668
	√		√			0.742	0.750	0.821	0.784	0.728	0.636	0.679
	√	√	√	√		0.764	0.745	0.898	0.814	0.807	0.580	0.675
	√	√	√		√	0.794	0.907	0.684	0.780	0.719	0.920	0.807

Table 5

Architecture ablation analysis of MFCAN on Weibo dataset

Dataset	T	S	F	Concat	Cross-att	Accuracy	Fake News			Real News
							Precision	Recall	F_1	Precision	Recall	F_1
Weibo	√					0.817	0.934	0.707	0.805	0.738	0.943	0.828
	√	√				0.837	0.916	0.763	0.833	0.773	0.920	0.84
	√		√			0.829	0.831	0.838	0.835	0.827	0.819	0.823
	√	√	√	√		0.859	0.920	0.806	0.859	0.806	0.920	0.860
	√	√	√		√	0.898	0.902	0.896	0.899	0.893	0.900	0.896

We begin with the text feature extraction module (T) and gradually add new sub-networks. As shown in Tables 4-5, according to the first line of two datasets, the average accuracy achieved is 57.6%, 81.7% depending on the text characteristics. Then, we added the visual semantic feature extractor (S) and combined the textual and visual semantic features. Compared to the single module, the results improved by 18%, 2%, respectively. In addition, combining the textual feature with the tamper feature of the frequency-aware tamper feature extractor (F) leads to an improvement of 13.2% and 1.2%, respectively, when compared to just one module.

As a next step, we selected concatenate operation (Concat) to combine features of T, S, and F modules, this boosts recognition accuracy on average to 76.4% and 85.9%. Compared to T + S or T + F, the effect on Twitter is improved by 0.8% and 5.6%, respectively, and on Weibo by 2.2% and 3%. This also demonstrates that the frequency-aware tamper feature is beneficial to performance.

Finally, instead of concatenation, we use the attention module (Cross-att) to conduct experiments that leverage attention mechanisms to effectively blend intra- and inter-modal representations in the linguistic and visual domains. The experimental findings achieved the best performance of 79.4% and 89.8%. In comparison to the concatenate, the outcomes improve by 3% and 3.9%, respectively.

We display the findings to better illustrate the ablation experiment. As illustrated in Fig. 8, each component plays an important role in boosting MFCAN’s performance. T + S+F+C outperform T + S, indicating that frequency domain information has the potential to detect bogus news. T + S+F+A outscored T + S+F+C, indicating that the cross-attention method we advocated is effective. Textual information contributes less to the overall model on the Twitter dataset than visual representations, but the opposite is true on the Weibo dataset. This is due to the short length of most posts in the Twitter dataset, which makes learning the entire semantic in formation challenging. Furthermore, due to the balanced data distribution, eliminating one or two components from the Weibo dataset does not significantly reduce MFCAN performance.

Fig. 8

MFCAN ablation analysis in Accuracy. (a)Twitter. (b)Weibo.

5.6.2 Effectiveness of FATE components

In order to further investigate the effectiveness of low, mid and high frequency as well as full frequency information on FATE (frequency-aware tamper feature extractor) in falsifying feature extraction, we evaluate the proposed FATE by different frequency components, namely, 1) FATE-Low, a variant that extracts only low frequency band features; 2) FATE-Mid, a variant that extracts only mid frequency band features; 3) FATE-High, a variant that extracts only high frequency band features; 4) FATE-All, which extracts all band features.

Tables 6-7 present the experimental results for the variant of FATE in both datasets, showing that the variant FATE-High achieved the best results compared to FATE-Low and FATE-Mid due to the higher frequency band component containing richer information about the changes in the picture. This suggests that high-frequency cues in images undoubtedly contribute to the detection of forgeries. This is because high-frequency cues correspond to the edges, textures, and detailed parts of the image, which are sensitive areas for forgery, and FATE-All achieves the highest results by stringing together the three frequency bands to capture both low-frequency global information and to learn a small range of mid- and high-frequency details, which facilitates a more com prehensive mining of forgery patterns and benefits to obtain richer frequency-aware clues.

Table 6
Ablation analysis of FATE on Twitter dataset

Variants Accuracy Fake News Real News

Precision Recall F_1 Precision Recall F_1

Text+FATE-Low 0.729 0.745 0.799 0.771 0.704 0.636 0.668

Text+FATE-Mid 0.732 0.789 0.725 0.756 0.670 0.743 0.679

Text+FATE-High 0.734 0.797 0.715 0.754 0.667 0.758 0.705

Text+FATE-All 0.742 0.750 0.821 0.784 0.728 0.636 0.709

Variants	Accuracy	Fake News	Real News
Text+FATE-Low	0.729	0.745	0.799	0.771	0.704	0.636	0.668
Text+FATE-Mid	0.732	0.789	0.725	0.756	0.670	0.743	0.679
Text+FATE-High	0.734	0.797	0.715	0.754	0.667	0.758	0.705
Text+FATE-All	0.742	0.750	0.821	0.784	0.728	0.636	0.709

Table 7

Ablation analysis of FATE on Weibo dataset

Variants	Accuracy	Fake News			Real News
		Precision	Recall	F_1	Precision	Recall	F_1
Text+FATE-Low	0.819	0.769	0.723	0.745	0.845	0.875	0.859
Text+FATE-Mid	0.823	0.779	0.720	0.748	0.846	0.882	0.863
Text+FATE-High	0.827	0.870	0.621	0.725	0.812	0.946	0.823
Text+FATE-All	0.829	0.831	0.838	0.835	0.827	0.819	0.874

We selected several fake news images from the Twitter and Weibo datasets to visually demonstrate the effectiveness of FATE-All in capturing forgery traces. Note that according to the above ablation experiments, we only choose FATE-All for demonstration because the difference in performance between FATE-Low and FATE-Mid is not significant. Since the original authentic images cannot be found, we employ the detection results of the Error Level Analysis (ELA) algorithm as a comparison to show the the possible tampered regions. As shown in Fig. 9, the possible forged and non-forged regions of the second column of ELA-processed images are distinctly different in character, and the manipulated regions are generally highlighted. Our proposed FATE-ALL also captures similar regions, which are presented as heat maps where potential forgery regions are shown in red, with darker colors indicating more attention obtained.

Fig. 9

The results of image manipulation traces detection and localization. The first column shows the forged images, the second column is the first column images processed by ELA, and the third column is the first column images after FATE-All processing.

5.7 Parameter sensitivity analysis

In this section, we investigate the influence of the number of cross-attention unit heads on the performance of the model. The range of the number of heads was set to [4 , 16] due to the fact that the number of cross-attentive heads must be divisible by the size of the word vector. Figure 10 shows how the accuracy of MFCAN varies with different head counts. From this we can observe that the number of heads has little impact on the results and that the performance of MFCAN was the best on the Twitter and Weibo datasets at 8 and 12 heads respectively.

Fig. 10

Experimental results of MFCAN at different numbers of cross-attention unit heads.

5.8 Complexity analysis

The time or computational complexity of deep neural networks (DNNs) is an issue. the computational complexity of DNNs is closely related to factors such as hardware execution, number of layers, and the number of operations required to produce results. So, we investigate the computational complexity of the model by tracking the training and prediction times of several MFCAN variants on GTX 3090 24G, which are presented in hour, minute, second, and millisecond formats (HH:MM: SS.ms). As seen in Table 8, the running time increases as the model components increase in size or complexity.

Table 8
Training and prediction time for several variants of MFCAN

T T+S T+F T+S+F+C T+S+F+A

Twitter Training Time (HH:MM: SS.ms) 0:06:11.636 0:14:05.684 0:20:44.048 0:25:06.809 0:28:09.236

Prediction Time (HH:MM: SS.ms) 0:00:01.636 0:00:02.588 0:00:02.749 0:00:03.294 0:00:03.588

Weibo Training Time (HH:MM: SS.ms) 0:05:35.523 0:08:41.905 0:15:27.188 0:14:12.424 0:15:51.481

Prediction Time (HH:MM: SS.ms) 0:00:03.066 0:00:04.495 0:00:05.383 0:00:05.530 0:00:05.835

		T	T+S	T+F	T+S+F+C	T+S+F+A
Twitter	Training Time (HH:MM: SS.ms)	0:06:11.636	0:14:05.684	0:20:44.048	0:25:06.809	0:28:09.236
	Prediction Time (HH:MM: SS.ms)	0:00:01.636	0:00:02.588	0:00:02.749	0:00:03.294	0:00:03.588
Weibo	Training Time (HH:MM: SS.ms)	0:05:35.523	0:08:41.905	0:15:27.188	0:14:12.424	0:15:51.481
	Prediction Time (HH:MM: SS.ms)	0:00:03.066	0:00:04.495	0:00:05.383	0:00:05.530	0:00:05.835

5.9 Latent space feature visualization

To visualize the features learned by our model on both datasets, Figs. 11-12 show visual depictions of t-SNE [67]. (a) and (b) in Figs. 11-12 are learned on the test set by fusing representations of the three modalities through the connection (MFCAN-Concat) and attention mechanisms (MFCAN-Cross_att), respectively.

Fig. 11

t-SNE feature visualization results on Weibo dataset. (a) MFCAN-Concat. (b) MFCAN-Cross_att.

Fig. 12

t-SNE feature visualization results on Twitter dataset. (a) MFCAN-Concat. (b) MFCAN-Cross_att.

As shown in Figs. 11-12, the separability of the feature representation of MFCAN-Cross_att is much superior to that of MFCAN-Concat. MFCAN-Concat can acquire distinguishing qualities, although many of them are easily misclassified, as shown in Figs. 11(a) and 12(a). The features learned by MFCAN-Cross_att are more discriminable with a wider zone of separation between two sample types in Figs. 11(b) and 12(b). This is due to the attention module of MFCAN-Cross_att, which gradually fuses intra- and inter-modal features of text and visuals and thoroughly combines the characteristics of three modalities.

Based on the phenomena, we can conclude that the clustering results obtained by integrating features from three modalities with an attention mechanism are more compact than those obtained by simple concatenation, further validating the efficacy of MFCAN in enhancing detection ability for fake news.

5.10 Case studies

To further demonstrate the significance of multimodal features in detecting fake news and to see the advantages given by the attention mechanism in multimodal fusion, we generate the class activation maps using the Grad-cam [68] and interpret the findings. Specifically, the idea is to observe which image regions are contributing to classification, and these should ideally be textual objects or forged regions in forged images.

Now we will demonstrate some of the visual outputs typically used to demonstrate the interpretability of MFCAN models. As we observe in Fig. 13 the Grad-cam results are satisfactory. The Grad-cam generates the heat-map in the object region (streets and metro stations) and most likely the forged region (shark) well from Fig. 13(a). The heat map in Fig. 13(b) focuses on the “mount” area and, meanwhile, emphasizes the typical characteristics of the “lenticular cloud” zone that appears in the sentence, which is very likely to be the fabrication region. As seen in Fig. 13(c), the high intensity values of the heat-map (red) correspond precisely to the “Hurricane” and “Statue of Liberty” (located in New York) regions, which match the words “Hurricane” and “New York” in the text. Figure 13(d) demonstrates that our approach relies on the trunk and ears to identify elephant and recognize the features of the stone.

Fig. 13

Some fake news on the Twitter dataset is detected by MFCAN but missed by the Text-only model.

According to Fig. 13(e), the heat map concentrates on the snake’s curl feature, followed by the snake’s head feature. These tweets above were recognized as fake by MFCAN but misclassified by the text-only model.

These visualizations highlight the complementary nature of textual and visual modalities and the ability of our model to learn the appropriate features from the input text and image and classify effectively using those features.

5.11 Discussion

From the experimental results, it can be seen that MFCAN clearly outperforms other baseline methods. The main advantage of MFCAN over the baseline methods is that it takes into account not only the visual semantic information but also the physical falsification information of the image. The results of the ablation experiments are as we expected, T + S+F+C outperforms T + S suggesting that the frequency domain information provides complementary information that helps to detect falsifications. To further investigate the importance of different frequency components in tampering feature extraction, we quantitatively evaluated the effectiveness of low, medium, and high frequency as well as full-frequency information for tampering feature extraction. The results are in line with the expectation that full-frequency information linking low, medium, and high frequencies together helps provide richer frequency-aware cues and enables more comprehensive mining of forgery patterns.

In addition, to investigate the role of attentional mechanisms in the fusion of textual and visual information, we explored potential intramodal and intermodal correlation information. t + s+f+a outperformed t + s+f+c, a result that suggests that attentional mechanisms can effectively facilitate the interactive fusion of visual and textual information to enhance the expression of multi-modal features. In addition, we further study the effect of the number of heads in the cross-attention unit on the model performance. The results show that the model performance is best when the number of heads is set to 8 and 12 on the Twitter and Weibo datasets, respectively.

5.12 Model limitations

The proposed MFCAN also has limitations, for example, it is sensitive to an unbalanced Twitter dataset in which more than 70% of posts are related to a specific event, resulting in poor performance of model generalization. Our method relies on hand-crafted labeled datasets for training, which makes it hard and expensive to train on a substantial amount of labeled data. For this reason, [69] investigated a semi-supervised framework based on co-training to handle limited labeled data and improve co-training robustness on imbalanced data. [70] developed an effective semi-supervised feature selection framework for video semantic recognition tasks that utilises optimal neighbour assignment and adaptive loss measures to improve the accuracy and robustness of the model. Inspired by [70], news from the same event may have potentially similar features, and feature selection with the aid of event labels may bring enhancements to our model.

6 Conclusion

In this paper, we develop a new multimodal frequency-aware cross-attentive network (MFCAN) which performs fake news detection by mining features of textual and visual modalities and jointly establishing inter- and intra-modal relationships. We started by extracting salient features from the text, spatial and frequency domains using three sub-networks. Then, utilizing the proposed multimodal attention fusion module, we identify complicated fine-grained relationships between cross-modal features. Finally, we map the text and visual features into a fully connected network to obtain classification results. Experimental results and comparisons on two publicly available benchmark datasets for fake news detection demonstrate the effectiveness of the proposed MFCAN.

Our work shows that information cues mined from forged images in the frequency domain are useful complement to visual information, and that the fusion of features in the frequency and spatial domains facilitates model performance. Carefully constructed attention mechanisms to model intra- and inter-modal dependencies and to fuse textual and visual information are fruitful.

In subsequent work, we will continue to investigate and optimize feature fusion approaches for multimodal data based on this model, and we will also evaluate and enhance the applicability and robustness of the model by using various multimodal fake news datasets so that it can be adapted to more complicated problem scenarios.

Footnotes

Acknowledgments

This study was supported by the Key Cooperation Project of Chongqing Municipal Education Commission (HZ2021017, HZ2021008), and the “Fertilizer Robot” project of Chongqing Committee on Agriculture and Rural Affairs.

References

Zhang

and Ghorbani

A.A.

, An overview of online fake news: Characterization, detection, and discussion, Information Processing & Management 57(2) (2020), 102025.

Bovet

and Makse

H.A.

, Influence of fake news in Twitter during the US presidential election, Nature Communications 10(1) (2019), 7.

Sharma

, Sharma

, Datta

Misleading the Covid-19 vaccination discourse on Twitter: An exploratory study of infodemic around the pandemic, (2021).

Melki

, Tamim

, Hadid

, Makki

, Amine

J.E.

and Hitti

, Mitigating infodemics: The relationship between news exposure and trust and belief in COVID-19 fake news and social media spreading, PLoS One 16(6) (2021), e0252830.

Jin

, Cao

, Guo

, Zhang

and Luo

, Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs, In Association for Computing Machinery, Proceedings of the 25th ACM international conference on Multimedia (2017), 795–816.

Jin

, Cao

, Zhang

, Zhou

and Tian

, Novel Visual and Statistical Image Features for Microblogs News Verification, IEEE Transactions on Multimedia 19(3) (2017), 598–608.

Verstraete

, Bambauer

D.E.

, Bambauer

J.R.

, Identifying and Countering Fake News, (2021).

Wang

, Ma

, Jin

, Yuan

, Xun

, Jha

, Su

, Gao

EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection, In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining, 2018 Association for Computing Machinery, 849–857.

Khattar

, Goud

J.S.

, Gupta

, Varma

MVAE: Multimodal Variational Autoencoder for Fake News Detection, In The World Wide Web Conference, 2019 Association for Computing Machinery, 2915–2921.

10.

Simonyan

, Zisserman

Very Deep Convolutional Networks for Large-Scale Image Recognition, (2015).

11.

, Cao

, Yang

, Guo

, Li

Exploiting Multi-domain Visual Information for Fake News Detection, In 2019 IEEE International Conference on Data Mining (ICDM), 2019, 518–527.

12.

Chen

, Yao

, Chen

, Ding

, Li

and Ji

, Local Relation Learning for Face Forgery Detection, Proceedings of the AAAI Conference on Artificial Intelligence 35(2) (2021), 1081–1088.

13.

Wang

, Wu

, Ouyang

, Han

, Chen

, Jiang

Y.-G.

, Li

S.-N.

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection, In Proceedings of the 2022 International Conference on Multimedia Retrieval, 2022 Association for Computing Machinery, 615–623.

14.

Qian

, Yin

, Sheng

, Chen

, Shao

Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues, In Computer Vision –ECCV 2020, 2020 Springer International Publishing, 86–103.

15.

Zhou

, Wu

, Zafarani

SAFE: Similarity-Aware Multi-modal Fake News Detection, In Advances in Knowledge Discovery and Data Mining, 2020 Springer International Publishing, 354–367.

16.

Yang

, Zheng

, Zhang

, Cui

, Li

, Yu

P.S.

TI-CNN: Convolutional Neural Networks for Fake News Detection, (2022).

17.

Wang

, Qian

, Hu

, Fang

, Xu

Fake News Detection via Knowledge-drivenMultimodal Graph Convolutional Networks, In Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020 Association for Computing Machinery, 540–547.

18.

Lazer

D.M.

, Baum

M.A.

, Benkler

, Berinsky

A.J.

, Greenhill

K.M.

, Menczer

, Metzger

M.J.

, Nyhan

, Pennycook

and Rothschild

, The science of fake news, Science 359(6380) (2018), 1094–1096.

19.

Zhou

, Zafarani

, Shu

, Liu

Fake news: Fundamental theories, detection strategies and challenges, In Proceedings of the twelfth ACM international conference on web search and data mining, (2019), 836–837.

20.

Shu

, Sliva

, Wang

, Tang

and Liu

, Fake News Detection on Social Media: A Data Mining Perspective, ACM SIGKDD Explorations Newsletter 19(1) (2017), 22–36.

21.

Allcott

and Gentzkow

, Social Media and Fake News in the Election, Journal of Economic Perspectives 31(2) (2017), 211–236.

22.

Bond

G.D.

, Holman

R.D.

, Eggert

J.-A.L.

, Speller

L.F.

, Garcia

O.N.

, Mejia

S.C.

, Mcinnes

K.W.

, Ceniceros

E.C.

and Rustige

, ‘Lyin’Ted’,‘Crooked Hillary’, and ‘Deceptive Donald’: Language of Lies in the US Presidential Debates, Applied Cognitive Psychology 31(6) (2017), 668–677.

23.

Potthast

, Kiesel

, Reinartz

, Bevendorff

, Stein

A Stylometric Inquiry into Hyperpartisan and Fake News, (2017).

24.

Volkova

, Shaffer

, Jang

J.Y.

, Hodas

Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter, In ACL 2017, 2017 Association for Computational Linguistics, 647–653.

25.

Horne

B.D.

, Adali

This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body,More Similar To Satire Than Real News, In Eleventh International AAAI Conference on Web and Social Media, 2017.

26.

Conroy

N.K.

, Rubin

V.L.

and Chen

, Automatic deception detection: Methods for finding fake news, Proceedings of the Association for Information Science and Technology 52(1) (2015), 1–4.

27.

Rubin

, Conroy

, Chen

, Cornwell

Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News, In Proceedings of the Second Workshop on Computational Approaches to Deception Detection, 2016 Association for Computational Linguistics, 7–17.

28.

Shu

, Mahudeswaran

, Wang

, Lee

and Liu

, FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media, Big Data 8(3) (2020), 171–188.

29.

Oshikawa

, Qian

, Wang

W.Y.

A Survey on Natural Language Processing for Fake News Detection, (2020).

30.

Rashkin

, Choi

, Jang

J.Y.

, Volkova

, Choi

Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking, In EMNLP 2017, 2017 Association for Computational Linguistics, 2931–2937.

31.

Kong

S.H.

, Tan

L.M.

, Gan

K.H.

, Samsudin

N.H.

Fake News Detection using Deep Learning, In 2020 IEEE 10th Symposium on Computer Applications & Industrial Electronics (ISCAIE), (2020), 102–107.

32.

Vaibhav

, Annasamy

R.M.

, Hovy

Do Sentence Interactions Matter? Leveraging Sentence Level Representations for Fake News Classification, (2019).

33.

, Gao

, Wong

K.-F.

Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learning, In The World Wide Web Conference, 2019 Association for Computing Machinery, 3049–3055.

34.

Liu

and Wu

Y.-F.

, Early Detection of Fake News on Social Media Through Propagation Path Classification with Recurrent and Convolutional Networks, Proceedings of the AAAI Conference on Artificial Intelligence 32(1) (2018).

35.

Shu

, Wang

, Liu

Beyond News Contents: The Role of Social Context for Fake News Detection, In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 2019 Association for Computing Machinery, 312–320.

36.

Gupta

, Zhao

, Han

Evaluating Event Credibility on Twitter, in: Proceedings of the 2012 SIAM International Conference on Data Mining (SDM), Society for Industrial and Applied Mathematics, (2012), pp. 153–164.

37.

Zhang

, Dong

, Yu

P.S.

FakeDetector: Effective Fake News Detection with Deep Diffusive Neural Network, In 2020 IEEE 36th International Conference on Data Engineering (ICDE), 2020, 1826–1829.

38.

Yang

, Liu

, Yu

, Yang

Automatic detection of rumor on Sina Weibo, In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, 2012 Association for Computing Machinery, 1–7.

39.

, Yang

, Zhu

K.Q.

False rumors detection on Sina Weibo by propagation structures, In 2015 IEEE 31st International Conference on Data Engineering, (2015), 651–662.

40.

Devlin

, Chang

M.-W.

, Lee

, Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, (2019)

41.

Singhal

, Shah

R.R.

, Chakraborty

, Kumaraguru

, Satoh

S.i.

SpotFake: A Multi-modal Framework for Fake News Detection, In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), 2019, 39–47.

42.

Zhang

, Fang

, Qian

, Xu

Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection, In, 2019 Association for Computing Machinery, 1942–1951.

43.

Franzen

Image classification in the frequency domain with neural networks and absolute value DCT, In Image and Signal Processing: 8th International Conference, ICISP 2018, Cherbourg, France, July 2-4, 2018, Proceedings 8, 2018 Springer, 301–309.

44.

, You

, Robles-Kelly

A frequency domain neural network for fast image super-resolution, In 2018 International Joint Conference on Neural Networks (IJCNN), 2018 IEEE, 1–8.

45.

Nam

S.-H.

, Ahn

, Yu

I.-J.

, Kwon

M.-J.

, Son

and Lee

H.-K.

, Deep convolutional neural network for identifying seam-carving forgery, IEEE Transactions on Circuits and Systems for Video Technology 31(8) (2020), 3308–3326.

46.

, Qin

, Sun

, Wang

, Chen

Y.-K.

, Ren

Learning in the frequency domain, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 1740–1749.

47.

Wang

and Zhang

, Double JPEG compression forensics based on a convolutional neural network, EURASIP Journal on Information Security 2016(1) (2016), 1–12.

48.

Barni

, Bondi

, Bonettini

, Bestagini

, Costanzo

, Maggini

, Tondi

and Tubaro

, Aligned and non-aligned double JPEG detection using convolutional neural networks, Journal of Visual Communication and Image Representation 49 (2017), 153–163.

49.

Park

, Cho

, Ahn

, Lee

H.-K.

Double JPEG detection in mixed JPEG quality factors using deep convolutional neural network, In Proceedings of the European conference on computer vision (ECCV), (2018), 636–652.

50.

Sanh

, Debut

, Chaumond

, Wolf

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, (2020).

51.

Sun

, Qiu

, Xu

, Huang

How to Fine-Tune BERT for Text Classification?, In Chinese Computational Linguistics, 2019 Springer International Publishing, 194–206.

52.

, Zhang

, Ren

, Sun

Deep Residual Learning for Image Recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770–778.

53.

Wang

S.-Y.

, Wang

, Zhang

, Owens

, Efros

A.A.

CNN-Generated Images Are Surprisingly Easy to Spot... for Now, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 8695–8704.

54.

Huang

, Zhang

, Wang

Deep Frequent Spatial Temporal Learning for Face Anti-Spoofing, (2020).

55.

Durall

, Keuper

, Pfreundt

F.-J.

, Keuper

Unmasking DeepFakes with simple Features, (2020).

56.

Ahmed

, Natarajan

and Rao

K.R.

, Discrete Cosine Transform, IEEE Transactions on Computers C-23(1) (1974), 90–93.

57.

Krizhevsky

, Sutskever

and Hinton

G.E.

, ImageNet classification with deep convolutional neural networks, Communications of the ACM 60(6) (2017), 84–90.

58.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

, Polosukhin

Attention is All you Need, In Advances in Neural Information Processing Systems, 2017 Curran Associates, Inc., 30.

59.

, Gu

, Zhang

, Dai

, Wei

Relation Networks for Object Detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 3588–3597.

60.

Boididou

, Middleton

S.E.

, Jin

, Papadopoulos

, Dang-Nguyen

D.-T.

, Boato

and Kompatsiaris

, Verifying information with multimedia content on twitter, Multimedia Tools and Applications 77(12) (2018), 15545–15571.

61.

Qian

, Wang

, Hu

, Fang

, Xu

Hierarchical multi-modal contextual attention network for fake news detection, In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, 2021, 153–162.

62.

Deng

, Dong

, Socher

, Li

L.-J.

, Li

, Fei-Fei

ImageNet: A large-scale hierarchical image database, In 2009 IEEE Conference on Computer Vision and Pattern Recognition, (2009), 248–255.

63.

Paszke

, Gross

, Chintala

, Chanan

, Yang

, DeVito

, Lin

, Desmaison

, Antiga

, Lerer

Automatic differentiation in PyTorch, (2017).

64.

Kingma

D.P.

, Ba

Adam: A Method for Stochastic Optimization, (2017).

65.

Antol

, Agrawal

, Lu

, Mitchell

, Batra

, Zitnick

C.L.

and Parikh

, Vqa: Visual question answering, In, Proceedings of the IEEE international conference on computer vision (2015), 2425–2433.

66.

Vinyals

, Toshev

, Bengio

, Erhan

Vqa: Visual question answering, In Proceedings of the IEEE international conference on computer vision, (2015), 2425–2433.

67.

Maaten

L.v.d.

and Hinton

, Visualizing Data using t-SNE, Journal of Machine Learning Research 9(86) (2008), 2579–2605.

68.

Selvaraju

R.R.

, Cogswell

, Das

, Vedantam

, Parikh

, Batra

Grad-CAM:Visual Explanations From Deep Networks via Gradient-Based Localization, In Proceedings of the IEEE International Conference on Computer Vision, (2017), 618–626.

69.

Chen

, Yao

, Zhang

, Wang

, Chang

and Nie

, A semisupervised recurrent convolutional attention model for human activity recognition, IEEE transactions on neural networks and learning systems 31(5) (2019), 1747–1756.

70.

Luo

, Chang

, Nie

, Yang

, Hauptmann

A.G.

and Zheng

, An adaptive semisupervised feature analysis for video semantic recognition, IEEE transactions on cybernetics 48(2) (2017), 648–660.

Multi-modality frequency-aware cross attention network for fake news detection

Abstract

Keywords

1 Introduction

2.1 Single-modal fake news detection

2.2 Multimodal fake news detection

2.3 Frequency domain learning

3 Problem formulation

4 Methodology

4.1 Model overview

5.1 Datasets

5.2 Experimental setup

Table 1 The statistics of the datasets Dataset Real news Fake news Image D1(Twitter) 6681 8199 512 D2(Weibo) 3594 4103 7697

5.3.1 Single-modality models

5.3.2 Multi-modality models

5.4 Evaluation metrics

5.6.1 Effectiveness of MFCAN components

5.12 Model limitations

6 Conclusion

Footnotes

Acknowledgments

References

Table 1
The statistics of the datasets

Dataset Real news Fake news Image

D1(Twitter) 6681 8199 512

D2(Weibo) 3594 4103 7697