Sage Journals: Discover world-class research

Abstract

With the rapid development of Deepfake technologies, the prevalence of forged facial images and videos on social media has surged. However, these technologies have also been exploited for malicious purposes, posing significant threats to social security. Although existing detection methods have shown effectiveness in identifying high-quality facial forgeries, their performance drops significantly for low-quality forgeries, as compression causes detail loss and feature blurring. To address this problem, a Frequency Assisted Multiscale Dual-Stream Network (FAMDnet) is proposed for low-quality Deepfake detection in this article. First, the multiscale spatial feature extraction module is designed to extract RGB features from image patches of different sizes, thereby capturing forgery traces within multiscale regions. Then, the dynamic global frequency feature extraction module is constructed by combining learnable dynamic filters with the Fast Fourier Transform to extract frequency domain features, supplementing the forgery artifacts that may vanish in the spatial domain. Finally, the multimodal attention fusion module is utilized to explore the correlation between RGB features and frequency domain features, effectively capturing local texture details and global context patterns in facial images, thereby achieving better fusion of RGB features and frequency domain features. Moreover, a parameter-free attention mechanism is introduced into the classifier to enhance the fused features, further improving the model's discriminative ability for facial forgery. Comparative experiments on the FaceForensics++, Celeb-DF, and WildDeepfake datasets demonstrate that FAMDnet has better performance in detecting low-quality fake images and videos. The code is available at https://github.com/daisy-12138/FAMDnet.

Keywords

multiscale feature dynamic frequency filter Deepfake detection

1 Introduction

In recent years, the rapid advancement of Deepfake technologies (Krichen, 2023) has enabled the generation of increasingly realistic images and videos, raising significant concerns about ethical abuses such as identity theft and online fraud, which pose tangible threats to social security. Consequently, developing effective methods for detecting Deepfake facial images has become a critical priority. However, due to bandwidth constraints, many forged facial images or videos experience substantial quality degradation after being uploaded to social media platforms. These low-quality Deepfakes, characterized by subtle forgery traces and smaller file sizes, are more likely to spread widely in real-world scenarios, presenting unique detection challenges (Zhou et al., 2024). This is primarily because low-quality facial images inherently suffer from reduced clarity, detail, and color accuracy, often caused by low resolution or excessive compression. Such degradation obscures the already subtle artifacts introduced by Deepfake algorithms, making it difficult for detection models to distinguish forged regions from authentic content. Consequently, the combination of compressed formats (for easier sharing) and obscured forgery traces creates a “double hurdle” for detection methods, enabling these low-quality forgeries to proliferate across platforms.

Most existing Deepfake detection methods (Afchar et al., 2018; Chen et al., 2021; Cozzolino et al., 2017; Li, Bao, et al., 2020; Nguyen, Fang, et al., 2019; Zhao et al., 2021) rely on Convolutional Neural Networks (CNNs), which are inherently constrained by their local receptive fields. This limitation restricts their capacity to model relationships between distant image regions, a critical shortfall for detecting low-quality forgeries. Subtle forgery artifacts in such images, including inconsistent lighting, texture mismatches, and color discrepancies, often span spatial scales larger than the networks’ receptive fields. As a result, CNNs fail to capture the global contextual inconsistencies that distinguish forged and authentic content, leading to fragmented feature analysis and diminished detection accuracy for degraded images. In contrast, Transformers (Vaswani et al., 2017) model long-range dependencies through their self-attention mechanism, which allows each image patch, or token, to attend to all others, enabling direct capture of global spatial relationships across the entire input. Wodajo et al. (Wodajo and Atnafu, 2021) pioneered the use of Vision Transformers (ViTs) (Dosovitskiy et al., 2020) for detecting forged video frames, achieving promising results. Varying patch sizes in Transformers prompt the model to analyze image features at multiple scales, where smaller patches focus on fine-grained local details like pixel-level artifacts and larger patches capture coarse-grained global structures like structural inconsistencies (Wang et al., 2021). This multiscale feature integration enhances the model's ability to detect subtle forgeries across both local and global levels, particularly in compressed low-quality content. Consequently, this article employs multiscale Transformers to extract forged facial features, leveraging their capacity to fuse diverse contextual information and overcome CNNs’ local receptive field limitations. However, compressed images complicate the task of distinguishing global feature discrepancies in the RGB color space, as compression artifacts such as blocky distortions from JPEG quantization often mask subtle spatial inconsistencies introduced by Deepfake algorithms. These artifacts disrupt the natural correlation between adjacent pixels in the spatial domain, making it difficult to isolate forged regions based on RGB intensity or texture alone (Durall et al., 1999; Huang et al., 2020; Qian et al., 2020; Wang, Wang, et al., 2019). In contrast, frequency domain analysis offers a critical advantage, as it decomposes images into spatial frequency components, including low-frequency structural trends and high-frequency texture details, to reveal intrinsic differences in the spectral distribution of forged and authentic content. Forged regions, often generated by neural networks with distinct synthesis patterns, exhibit anomalous frequency distributions including unnatural periodicity in high-frequency bands or inconsistent low-frequency gradient structures that persist even after compression. These frequency domain anomalies are less susceptible to degradation by compression algorithms, which typically alter spatial coherence rather than fundamentally changing the spectral signature of image content (Zhou et al., 2024). Consequently, recent studies (Zhou et al., 2024; Durall et al., 1999; Qian et al., 2020) have turned to frequency domain analysis to address these challenges because the significant differences in frequency distributions between forged and authentic regions provide a robust basis for detection, enabling models to bypass compression-induced noise while targeting the inherent spectral mismatches that characterize Deepfake manipulations.

In summary, inspired by references (Tatsunami and Taki, 2024; Wang et al., 2021; Yang et al., 2021; Zhou et al., 2023), a Frequency-Assisted Multiscale Dual-Stream Network (FAMDnet) for low-quality Deepfake detection is proposed in this article. First, the multiscale spatial feature extraction module is designed by employing a ViT to capture RGB features from image patches of varying sizes, thereby accurately pinpointing structural anomalies in forged regions (such as unnatural boundaries and texture inconsistencies) and addressing the limitation of traditional methods in capturing the association between local details and global structure. Then, to overcome the limitations of previous static global filters, the dynamic global frequency feature extraction module is constructed, combining learnable dynamic filters with Fast Fourier Transform (FFT) to extract frequency-domain features. This dynamic filter can adaptively adjust the feature extraction pattern according to input characteristics, effectively supplementing the forgery traces that tend to disappear in the spatial domain. Next, the multimodal attention fusion module is proposed to achieve efficient integration of local texture details and global contextual patterns by deeply mining the intrinsic correlation between RGB and frequency-domain features. This module contains three components: linear angular attention, which leverages multihead attention to compute query-key-value relationships and generate weighted features to enhance spatial inconsistencies such as boundary mismatches in spliced regions; intercovariance matrix attention, which generates attention weights through dot product operations to capture channel dependencies of forgery-related frequency components, thereby enhancing the discriminability of frequency features; and cross-attention fusion, which adopts parallel average and max pooling strategies to fuse the enhanced spatial-frequency features while preserving complementary statistical information. Finally, the parameter-free attention mechanism is introduced into the classifier, which utilizes mean statistics along the channel dimension to adaptively calibrate the features in the fused feature space, significantly boosting the model's ability to recognize subtle forgery clues without increasing the parameter count. Experimental results on the FaceForensics++, Celeb-DF, and WildDeepfake datasets demonstrate that FAMDnet achieves superior detection performance on highly compressed facial forgeries.

The main contributions of this article are as follows:

(1)
The multiscale spatial feature extraction module is designed to extract multiscale RGB features from image patches of varying sizes, thereby identifying structural mismatches in forged regions of different sizes.
(2)
The dynamic global frequency feature extraction module is constructed by combining learnable dynamic filters with the FFT to extract frequency domain features, supplementing the forgery artifacts that may vanish in the spatial domain.
(3)
The multimodal attention fusion module is utilized to integrate spatial and frequency features, strategically exploiting the correlation between RGB and frequency domains to effectively capture both local texture details and global contextual patterns in facial images.
(4)
A parameter-free attention mechanism is introduced into the classifier, through which channel-wise mean statistics are used to adaptively recalibrate features in the fused feature space without introducing extra parameters, thereby enhancing the model's ability to distinguish facial forgeries.

2 Related Work

Early approaches for facial forgery detection predominantly employ CNNs to extract subtle forgery traces from manipulated images in the spatial domain. MesoNet (Afchar et al., 2018) and FakeSpotter (Wang, Juefei-Xu, et al., 2019) leverage CNNs to automatically learn discriminative features of forged facial images by analyzing manipulated facial regions. MaDD (Zhao et al., 2021) frames Deepfake detection as a fine-grained classification task, utilizing multiple spatial attention heads to enhance texture feature representation and aggregating low-level texture cues with high-level semantic features. Miao et al. (Miao, Chu, et al., 2022) improve model generalization by introducing a forged region-guided self-attention module. Wang et al. (Wang and Deng, 2021) achieve significant detection performance by integrating spatial attention mechanisms to capture local forgery artifacts. Stehouwer et al. (2019) employ attention mechanisms to strengthen the classification of real/fake facial images and visualize manipulated regions. While CNNs effectively learn local features through restricted receptive fields, they face inherent limitations in modeling global contextual information. This constraint makes CNNs particularly challenged in detecting low-quality forgeries, as their local receptive fields often overlook subtle inconsistencies and artifacts that are more apparent in a global context, leading to failures in assessing overall image coherence. In contrast, the self-attention mechanism in Transformer architectures enables modeling of global relationships and long-range feature dependencies, thereby enhancing visual representation capabilities. Wodajo et al. (Wodajo and Atnafu, 2021) establish long-distance correlations between image patches using a hybrid CNN-ViT framework. Miao et al. (Miao et al., 2021) adopt a pure ViT backbone with feature bagging to encode interblock relationships, enabling forgery feature learning without explicit mask supervision. LiSiam (Wang et al., 2022) combines Transformer and CNNs by using original images and quality-degraded counterparts as paired inputs to generate segmentation maps, employing a local invariance loss to enforce consistency between outputs. Trans-FCA (Tan et al., 2022) introduces a local compensation module to fuse global Transformer features with local convolutional features, addressing Transformer's limitations in capturing fine-grained local details. Notably, most existing Transformer-based methods rely on fixed-size image patches for single-scale RGB feature extraction, with limited research on multiscale spatial feature learning using variable patch sizes. Wang et al. (2021) propose a multimodal multiscale Transformer that captures subtle manipulation artifacts through image patches of diverse sizes, enabling detection of local inconsistencies across multiple spatial levels. Zhou et al. (2024) develop a multiscale dual-branch network for facial forgery detection, leveraging Transformer to extract fine-grained multiscale texture features from raw RGB images.

A critical limitation of most detection methods is their exclusive focus on RGB features, which leads to performance degradation when handling low-quality Deepfake images and videos. To address this, recent studies have incorporated frequency domain information. Masi et al. (2020) demonstrate improved performance by amplifying artifacts and suppressing high-level content through Laplacian Gaussian multiband feature extraction. GFF (Luo et al., 2021) uses high-frequency noise signals to guide RGB feature extractors toward forgery traces. IAW (Jia et al., 2021) employs a dual-branch network to model inter- and intra-image inconsistencies, separately predicting image-level and pixel-level forgery labels. Dzanic et al. (2020) adopt spectral detection techniques, modifying networks to generate high-frequency attributes that better mimic real-image statistics. While most prior work remains single-domain, emerging studies explore dual-stream networks combining spatial and frequency domains for Deepfake detection (Nguyen, Yamagishi, et al., 2019). Liu et al. (2021) integrate spatial images with phase spectra to capture upsampling artifacts in facial forgeries, enhancing detection generalizability by shallowing network architectures to reduce receptive fields and focus on local regions. M2TR (Wang et al., 2021) fuses RGB and frequency domain information, using multiscale image patches to detect cross-level spatial inconsistencies. Liang et al. (2023) design a dual-stream framework integrating conventional spatial flow and frequency flow, effectively distinguishing high-quality and low-quality faces across diverse generation methods.

3 Methodology

This article proposes a FAMDnet designed to enhance the robustness of detecting low-quality Deepfake images and videos by uncovering subtle forgery traces within them. The overall structure of the method, illustrated in Figure 1, primarily consists of the multiscale spatial feature extraction module, the dynamic global frequency feature extraction module, the multimodal attention fusion module, and a classifier incorporating the simple parameter-free attention mechanism. First, the low-level texture features $T \in R^{(H / 4) \times (W / 4) \times C}$ of the input image $I \in R^{H \times W \times 3}$ are extracted using the stem and depthwise separable convolution blocks of EfficientNet-B4. Following this, the multiscale spatial feature extraction module divides the low-level texture features $T \in R^{(H / 4) \times (W / 4) \times C}$ into image patches of varying sizes and feeds them into the Transformer to extract multiscale spatial features $f_{r g b} \in R^{(H / 4) \times (W / 4) \times C}$ , effectively capturing artifacts in regions of different sizes. Next, the dynamic global frequency feature extraction module employs learnable dynamic global frequency filters, generated through the integrated dynamic weighting mechanism, to perform element-wise multiplication with features that have undergone the two-dimensional (2D) FFT. These features are then converted back to the spatial domain via inverse transformation, resulting in dynamic global frequency features $f_{f r e q} \in R^{(H / 4) \times (W / 4) \times C}$ that supplement artifacts which tend to disappear in the spatial domain. The multimodal attention fusion module then fuses the RGB features and frequency features, using linear angle attention and cross-covariance matrix attention to enhance the RGB features and frequency features, respectively, before feeding them into cross-attention fusion, fully exploiting the correlation between the RGB features and frequency features to obtain fused features $f_{f u s e} \in R^{(H / 4) \times (W / 4) \times C}$ . Finally, these fused features $f_{f u s e} \in R^{(H / 4) \times (W / 4) \times C}$ are input into the classifier based on the simple parameter-free attention mechanism for classification, thereby enabling effective facial forgery detection.

Figure 1.

The Overall Structure of the Frequency Assisted Multiscale Dual-Stream Network.

3.1 Multiscale Spatial Feature Extraction Module

The multiscale spatial feature extraction module constructs Transformer encodings on image patches of varying scales. By employing the multihead self-attention mechanism, it extracts local anomaly patterns under different receptive fields, thereby capturing multiscale RGB features and revealing artifacts of varying scales in low-quality forged facial images. Initially, low-level texture features $T \in R^{(H / 4) \times (W / 4) \times C}$ extracted from the backbone are partitioned into image patches of varying sizes, with each patch sized as $C_{h} = r_{h} \times r_{h} \times C$ . These patches are then reshaped into one-dimensional vectors to facilitate self-attention computation. Subsequently, a fully connected layer is utilized to project the flattened vectors into query embeddings $q_{i}^{h} \in R^{N \times C_{h}}$ , where $N = (H / 4 r_{h}) \times (W / 4 r_{h}), C_{h} = r_{h} \times r_{h} \times C$ . Analogous operations are conducted to derive key embeddings $k_{i}^{h}$ and value embeddings $v_{i}^{h}$ , respectively. The attention matrix for image patches of different sizes is then computed using equation (1).

\begin{aligned} a_{i}^{h} = softmax (\frac{q_{i}^{h} {(k_{i}^{h})}^{T}}{C_{h}}) v_{i}^{h} \end{aligned}

(1)

where

a_{i}^{h}

represents the attention matrix for the i-th image patch in the i-th head. The above process is iterated for image patches of diverse sizes to capture features across various scales. Subsequently, the output of each image patch is reshaped into a feature tensor, and the feature tensors of all image patches are concatenated via equation (2), transforming the concatenated result back to the original image shape. This is further processed through a 2D residual block, thereby generating multiscale spatial features

f_{rgb} \in R^{(H / 4) \times (W / 4) \times C}

\begin{aligned} A_{i}^{h} = \sum_{i = 1}^{N} a_{h}^{i} \end{aligned}

(2)

In this article, four different sizes of image patches (original size, $\frac{1}{2}$ , $\frac{1}{4}$ , $\frac{1}{8}$ ) are selected.

3.2 Dynamic Global Frequency Feature Extraction Module

The dynamic global frequency feature extraction module employs the FFT with the integration of dynamic filters to extract frequency features. Unlike static global filters used in previous work, these dynamic filters adaptively generate flexible feature extraction patterns that adjust to input characteristics, thereby capturing artifacts that are difficult to discern in the spatial domain. The structure is illustrated in Figure 2. Firstly, the FFT is applied to the multiscale spatial features $f_{rgb} \in R^{(H / 4) \times (W / 4) \times C}$ :

\begin{aligned} F = F [f_{rgb}] \in C^{(H / 4) \times (W / 4) \times C} \end{aligned}

(3)

where

F [\cdot]

represents the 2D FFT. While performing the 2D FFT, the dynamic filter is generated concurrently. The procedure for generating the dynamic filter is depicted on the left side of Figure 2, where a linear combination of N global filters is utilized for each channel. In this article, N is set to 4. The coefficients of the dynamic filter are normalized through MLP

M

. The global filter basis is denoted as

K, K = {K_{1}, \dots, K_{N}}

, where

K_{1}, \dots, K_{N} \in C^{(H / 4) \times ⌈ \frac{W / 4}{2} ⌉}

. The filter is related to the weighting process by means of equation (4).

\begin{aligned} K_{M} (f_{rgb})_{c, :, :} & := \sum_{y} {i = 1}^{N} (\frac{e^{s_{(c - 1) N + i}}}{\sum_{n = 1}^{N} e^{s_{(c - 1) N + n}}}) K_{i} \end{aligned}

(4)

\begin{aligned} (s_{1}, \dots, s_{N C})^{T} & = M (\frac{\sum_{h, w} f_{r g b_{:, h, w}}}{H W}) \end{aligned}

(5)

where the coefficient

\frac{e^{s_{(c - 1) N + i}}}{\sum_{n = 1}^{N} e^{s_{(c - 1) N + n}}}

represents the regularization parameter for the MLP

M

M

can be described in terms of its weights, and the calculation is as follows.

\begin{aligned} M (f_{r g b}) = W_{2} StarReLU (W_{1} LN (f_{rgb})) \end{aligned}

(6)

Figure 2.

The Structure of the Dynamic Global Frequency Feature Extraction Module.

Here, $W_{1} \in R^{C \times int (ρ C)}$ and $W_{2} \in R^{i n t (ρ C) \times N C}$ represent the weight matrices of $M$ , while $ρ$ denotes the ratio of the intermediate dimension to the input dimension of $M$ . $ρ$ is set to 0.25 in this article. After performing the operations described above, the dynamic filter $K_{M} \in C^{(H / 4) \times (W / 4) \times C}$ and the feature $F \in C^{(H / 4) \times (W / 4) \times C}$ after applying 2D FFT are obtained. Subsequently, an element-wise multiplication of $F \in C^{(H / 4) \times (W / 4) \times C}$ with the dynamic filter $K_{M} \in C^{(H / 4) \times (W / 4) \times C}$ is performed to carry out the dynamic global filtering operation.

\begin{aligned} \tilde{F} = K_{M} (f_{rgb}) ⊙ F \end{aligned}

(7)

where

⊙

denotes the Hadamard product. Finally, the inverse FFT is applied to

\tilde{F}

, and it is nonlinearly transformed back into the spatial domain, thus yielding the dynamic global frequency features

f_{freq} \in R^{(H / 4) \times (W / 4) \times C}

\begin{aligned} f_{freq} \leftarrow F^{- 1} [\tilde{F}] \end{aligned}

(8)

3.3 Multimodal Attention Fusion Module

The multimodal attention fusion module effectively captures the correlation between RGB and frequency features to integrate multiscale RGB features $f_{rgb}$ with dynamic global frequency features $f_{freq}$ via three attention mechanisms, as shown in Figure 3. This module comprises three core components: linear angular attention models pairwise relationships among queries, keys, and values to perform multihead attention operations on multiscale RGB features, generating a weighted feature representation that emphasizes spatial contextual information; cross-covariance matrix attention employs dot products to compute attention weights, thereby enhancing the discriminative power of dynamic global frequency features; cross-attention fusion integrates these two feature streams via the combination of average and max pooling, enabling comprehensive feature interaction and facilitating subsequent processing.

Figure 3.

The Structure of the Multimodal Attention Fusion Module.

3.3.1 Linear Angular Attention

Linear angular attention is introduced to enhance multiscale spatial features $f_{rgb} \in R^{(H / 4) \times (W / 4) \times C}$ , thereby enabling better capture and integration of critical information from the input. The structure is illustrated on the left side of Figure 4. First, the spatial features undergo dimensional reshaping. Next, through linear layer operations, the reshaped features are mapped to queries, keys, and values. These projections are then divided into multiple heads to obtain N distinct sets, with attention operations performed independently on each head. For each head, denoted as $Hea d_{i} (i = 1, \dots, N)$ , the attention computation is conducted as follows:

\begin{aligned} A_{i} = Norm (0.5 v_{i} + \frac{q_{i} ({(k_{i})}^{T} v_{i})}{π}) + Conv (v_{i}) \end{aligned}

(9)

Figure 4.

The Structure of Linear Angular Attention and Cross-Covariance Matrix Attention.

The weight of v is set to 0.5 to balance its contribution in the final output, preventing it from dominating the output and avoiding gradient explosion. $π$ serves as a scaling factor, assisting in controlling the numerical range of the output. $Conv (\cdot)$ enhances feature representation. In this article, N is set to 8. Subsequently, the outputs from each head are concatenated along the channel dimension, followed by a linear layer and Dropout to improve the generalization capability. Finally, the spatial features enhanced by linear angular attention are reshaped to obtain the desired output $f_{linear} \in R^{(H / 4) \times (W / 4) \times C}$ .

\begin{aligned} A & = concat (A_{1}, \dots, A_{N}) \end{aligned}

(10)

\begin{aligned} f_{linear} & = reshape (Dropout (Linear (A))) \end{aligned}

(11)

3.3.2 Cross-Covariance Matrix Attention

Dynamic global frequency features $f_{freq} \in R^{(H / 4) \times (W / 4) \times C}$ are enhanced via cross-covariance matrix attention, where channels are updated through weighted sum to facilitate information exchange across different channels, thereby improving the representational capacity of frequency features, as illustrated on the right side of Figure 4. Initially, the frequency features $f_{freq} \in R^{(H / 4) \times (W / 4) \times C}$ undergo dimensional reshaping. Subsequently, through linear layer computation, the reshaped features are mapped to queries, keys, and values. These projections are then partitioned into multiple heads to obtain N distinct sets, with attention operations performed independently on each head. For each head, denoted as $Hea d_{i} (i = 1, \dots, N)$ , the attention calculation is conducted as follows:

\begin{aligned} A_{i} = (Dropout (Softmax (q_{i} (k_{i})^{T}))) \cdot v_{i} \end{aligned}

(12)

In this article, N is set to 8. Then, the outputs of each head will be concatenated along the channel dimension. After passing through the linear layer and Dropout, the frequency features enhanced by the cross-covariance matrix attention are reshaped to obtain the output $f_{cross} \in R^{(H / 4) \times (W / 4) \times C}$ .

\begin{aligned} A & = concat (A_{1}, \dots, A_{N}) \end{aligned}

(13)

\begin{aligned} f_{cross} & = reshape (Dropout (Linear (A))) \end{aligned}

(14)

3.3.3 Cross-Attention Fusion

Cross-attention fusion is employed to fuse $f_{linear} \in R^{(H / 4) \times (W / 4) \times C}$ and $f_{cross} \in R^{(H / 4) \times (W / 4) \times C}$ , as depicted in Figure 3. This cross-attention comprises two components: channel attention and spatial attention. For the features $f_{i} \in R^{(H / 4) \times (W / 4) \times C} (i = linear, cross)$ derived from the two branches, these features are first fed into the channel attention mechanism. Within this mechanism, max pooling and average pooling operations are performed, yielding two channel descriptors with a shape of $1 \times 1 \times C$ . These descriptors are then input into MLP, which is a two-layer neural network. The first layer of the MLP consists of two convolutional layers with a kernel size of $1 \times 1$ , and the ReLU activation function is adopted. The second layer also includes two convolutional layers with a kernel size of $1 \times 1$ . Subsequently, the two resulting features are summed to obtain the channel weights $M_{j c} \in R^{1 \times 1 \times C} (j = l, c)$ .

\begin{aligned} M_{j c} = σ (MLP (AvgPool (f_{i})) + MLP (MaxPool (f_{i}))) \end{aligned}

(15)

where

AvgPool (\cdot)

and

MaxPool (\cdot)

denote global average pooling and global max pooling, respectively, and

σ (\cdot)

signifies Sigmoid activation function. Through the aforementioned operations, the channel-wise weights

M_{l c}

and

M_{c c}

for the two branches are derived, and their element-wise product generates the channel attention weight matrix

M \in R^{C \times C}

. The channel-attended features of the two branches, denoted as

f_{l c} \in R^{(H / 4) \times (W / 4) \times C}

and

f_{c c} \in R^{(H / 4) \times (W / 4) \times C}

, are formulated as follows:

\begin{aligned} M & = M_{l c} (M_{c c})^{T} \end{aligned}

(16)

\begin{aligned} f_{l c} & = Softmax (M) \cdot f_{linear} \end{aligned}

(17)

\begin{aligned} f_{c c} & = Softmax (M) \cdot f_{cross} \end{aligned}

(18)

Subsequently, spatial attention mechanism is performed on the two feature maps. For each feature map $f_{i c} (i = l, c)$ , max pooling and average pooling are respectively applied to derive two spatial descriptors with a shape of $H / 4 \times W / 4 \times 1$ . These two spatial descriptors are then concatenated along the channel dimension. Thereafter, spatial attention weights $M_{i s} \in R^{(H / 4) \times (W / 4) \times 1} (i = l, c)$ are generated through a sequential application of two convolutional layers with a kernel size of $3 \times 3$ , each followed by the ReLU activation function.

\begin{aligned} M_{i s} = Conv (AvgPool (f_{i c}) \cdot MaxPool (f_{i c})) \end{aligned}

(19)

where

Conv (\cdot)

denotes the two convolutional layers with a kernel size of

3 \times 3

, followed by the ReLU activation function. Through the aforementioned operations, the spatial weight matrices

M_{i s} \in R^{(H / 4) \times (W / 4) \times 1}

for the two branches are obtained. After multiplying

M_{l s}

with the feature map

f_{l i n e a r}

and adding the residual connection, and similarly multiplying

M_{c s}

with

f_{c r o s s}

and incorporating the residual connection, the features

f_{l s} \in R^{(H / 4) \times (W / 4) \times C}

and

f_{c s} \in R^{(H / 4) \times (W / 4) \times C}

of the two branches after the spatial attention mechanism are derived.

\begin{aligned} f_{l s} = Softmax (M_{l s}) f_{linear} + f_{linear} \end{aligned}

(20)

\begin{aligned} f_{c s} = Softmax (M_{c s}) f_{cross} + f_{cross} \end{aligned}

(21)

Finally, the features $f_{l s} \in R^{(H / 4) \times (W / 4) \times C}$ and $f_{c s} \in R^{(H / 4) \times (W / 4) \times C}$ are concatenated, and the residual is incorporated, yielding the fused features $F_{fuse i} \in R^{(H / 4) \times (W / 4) \times C} (i = 1, \dots, N)$ .

\begin{aligned} F_{fuse i} = f_{l s i} + f_{c s i} + f_{rgb} \end{aligned}

(22)

The variable i denotes the output feature of the i-th cross-attention fusion. Subsequently, the multiscale spatial feature extraction module, dynamic global frequency feature extraction module, and multimodal attention fusion module are iteratively repeated and stacked four times to generate the final fused features $F_{fuse} \in R^{(H / 4) \times (W / 4) \times C}$ . These features are then fed into the remaining depthwise separable convolution blocks of the backbone, thereby deriving the deep semantic features $S \in R^{(H / 4) \times (W / 4) \times C^{'}}$ .

3.4 Classifier Based on Parameter-Free Attention Mechanism

To effectively amplify the key information embedded in input features and thereby enhance the classifier's performance, a parameter-free attention mechanism is incorporated prior to fully connected layers after the entire feature extraction and fusion process is completed. Initially, the spatial-wise mean $μ \in R^{1 \times 1 \times C^{'}}$ of the high-level semantic features $S \in R^{(H / 4) \times (W / 4) \times C^{'}}$ is computed as:

\begin{aligned} μ = \frac{1}{(H / 4) \times (W / 4)} \sum_{i = 1}^{H / 4} \sum_{j = 1}^{W / 4} S_{i, j, C^{'}} \end{aligned}

(23)

Subsequently, the weighted coefficient y and the features $G \in R^{(H / 4) \times (W / 4) \times C^{'}}$ processed by the simple parameter-free attention mechanism are derived.

\begin{aligned} y & = \frac{{(S - μ)}^{2}}{4 z} + 0.5 \end{aligned}

(24)

\begin{aligned} z & = \frac{\sum_{i = 1}^{H / 4} \sum_{j = 1}^{W / 4} {(S - μ)}^{2}}{(H / 4) \times (W / 4) - 1} + λ \end{aligned}

(25)

\begin{aligned} G & = S \cdot Sigmoid (y) \end{aligned}

(26)

Here, y is conceptualized as the parameter-free attention mechanism that modulates the output by leveraging statistical properties (i.e., mean and variance) of the input, thereby accentuating discriminative features. By element-wise multiplication of S with y, the model dynamically adjusts the saliency of each pixel, enhancing the classification accuracy and robustness. Subsequently, the refined features $G \in R^{(H / 4) \times (W / 4) \times C^{'}}$ are processed through a sequence of operations including Dropout, fully connected layers, and ReLU activation functions, followed by a final Sigmoid activation to produce the binary classification result (real or fake). This framework facilitates the detection of low-quality forged facial images by emphasizing critical semantic cues and suppressing irrelevant noise.

4 Experiments

4.1 Experimental Settings

4.1.1 Datasets

We employ intra-dataset and cross-dataset evaluations to validate the effectiveness of the proposed method. For intra-dataset evaluation, the model is trained and tested on the FaceForensics++ (FF++) dataset (Rössler et al., 2019), which is composed of 1,000 authentic videos and 4,000 corresponding manipulated videos generated via four distinct methods: Deepfakes (DF), Face2Face (F2F), FaceSwap (FS), and NeuralTextures (NT). Following the partitioning strategy of FF++, 720 videos per manipulation category are selected for training, 140 for validation, and 140 for testing in the experiment, with 270 frames extracted per video. In this article, authentic images are replicated four times through resampling to mitigate class imbalance. To evaluate the stability of the model under different data splits on the same dataset, we design five independent experiments. In each experiment, a different random seed is used to control frame sampling and resampling processes, ensuring the independence of data partitioning. The final results are presented as the mean ± standard deviation across the five runs, which serve to assess the model's robustness to the internal data distribution of the FF++ dataset. To comprehensively assess generalization of FAMDnet across unseen distributions and manipulation techniques, the FAMDnet trained solely on FF++ is evaluated on two external benchmarks: Celeb-DF (Li, Yang, et al., 2020) and WildDeepfake (Zi et al., 2020) datasets. The Celeb-DF dataset is composed of 590 source videos with diverse demographics from public platforms, along with 5,639 synthesized Deepfake counterparts. The WildDeepfake dataset is aggregated from 3,509 forged sequences and 3,805 authentic sequences directly extracted from uncurated online environments, exhibiting substantial heterogeneity in scenarios, subjects, and activities. This uncontrolled acquisition method presents significant challenges due to the prevalence of low-resolution artifacts characteristic of real-world dissemination. To ensure standardized evaluation, 20,000 randomly selected authentic frames and 20,000 forged frames are selected by each external dataset for testing. We perform three independent random sampling operations from the Celeb-DF and WildDeepfake datasets. The model's adaptability to unseen data distributions and tampering techniques is quantified by computing the mean ± standard deviation of the results across multiple testing rounds. This design effectively mitigates the impact of single-sampling bias on the evaluation of generalization performance.

4.1.2 Implementation Details

We utilize RetinaFace (Deng et al., 2020) to detect facial regions in the input images, align the extracted facial images, and resize them to 320 × 320. For the backbone network, we select EfficientNet-B4 that has been pretrained on ImageNet (Deng et al., 2009). The learning rate is set to 0.0001 with a decay factor of 10 applied every 5 epochs, and the batch size is configured as 8. The model is trained over 30 epochs and optimized using binary cross-entropy loss. We evaluate the classification performance of FAMDnet with Accuracy (ACC) and the area under the receiver operating characteristic curve (AUC), and adopt an image-level evaluation approach applicable to both image forgery detection and video forgery detection. The model is implemented based on PyTorch and trained on the NVIDIA GeForce RTX 4090.

4.2 Intra-Dataset Evaluation

4.2.1 Evaluation on Four Deepfake Methods of FaceForensics++ Dataset

The comparative experimental results of the FAMDnet and other state-of-the-art detection methods in detecting low-quality facial images generated by four forgery algorithms are presented in Table 1. The terms DF, FS, F2F, and NT in Table 1 refer to Deepfakes, FS, F2F, and NT, respectively. The model is trained on the high-quality dataset of FF++ and tested on the low-quality dataset of FF++. As illustrated in Table 1, our FAMDnet exhibits better performance, particularly in detecting forged images generated by NT, which are capable of producing intricate details and textures that are challenging to distinguish from real faces. Compared to DDBD (Nirkin et al., 2022), our method achieves AUC improvements of 3.3%, 12.87%, 15.62%, and 14.55% respectively in detecting facial forgery images generated by four different forgery methods. DDBD (Nirkin et al., 2022) first detects faces and expands the bounding boxes to include contextual regions, then utilizes a segmentation network to divide the image into facial and background regions, which are separately fed into an Xception-based facial recognition network and a contextual recognition network. By calculating the identity probability difference vector between their outputs, it captures forgery cues. Additionally, it trains dedicated networks to detect face swapping and reenactment manipulations, concatenating the difference vector with the feature activation values from these two networks before inputting them into a classifier. However, the performance of this method falls short of our FAMDnet. DDBD (Nirkin et al., 2022) merely distinguishes facial and background regions via a segmentation network before feeding them into Xception networks, failing to account for the distortion of multiscale details in low-quality images and thus struggling to capture pixel-level long-range dependencies. In contrast, FAMDnet employs ViTs to perform multiscale feature extraction on image patches of varying sizes, enabling effective capture of detail distortions at different resolutions in low-quality images. Furthermore, DDBD (Nirkin et al., 2022) relies solely on spatial-domain identity probability difference vectors to capture forgery cues, making it incapable of identifying periodic artifacts introduced by forgery operations such as resampling and compression in the frequency domain. When detecting face images generated by four forgery methods, the average AUC of our method exceeds that of DCDD (Hu, Liao, Wang, et al., 2022) by 8.3%. DCDD (Hu, Liao, Wang, et al., 2022) proposes a dual-dimensional compression-based Deepfake video detection method. At the frame level, it targets compression-induced artifacts such as block boundary distortions and quantization noise, employing a CNN with a pruning module to extract features and validating compression noise and structural distortions using peak signal-to-noise ratio and structure similarity index measure. At the temporal level, leveraging the characteristic of Deepfake videos being encoded frame-by-frame, which results in a lack of interframe temporal consistency, it confirms the low inter-frame correlation of forged videos through Hamming distance. By combining the principles of encoding, it derives the correlation between residual features and temporal information to capture sequential anomalies. The frame-level stream extracts facial regions from I-frames using MesoNet combined with an iterative pruning network for feature extraction, while the temporal-level stream divides the video into three segments, extracts residual features, and inputs them into ResNet18, subsequently fusing the outputs of both streams. However, when DCDD (Hu, Liao, Wang, et al., 2022) relies on a CNN with a pruning module at the frame level to extract compression features, block boundary artifacts and quantization noise in low-quality forged images are easily blurred, making it difficult for the CNN to capture spatial-domain anomalies. At the temporal level, when analyzing inter-frame correlation through Hamming distance, it struggles to accurately capture the nonlinear inter-frame dependencies caused by dynamic lighting changes or multiframe synthesis resulting from frame-by-frame encoding of Deepfakes. These limitations may contribute to its performance being inferior to our method. The average AUC of our method exceeds that of PRRNet (Shang et al., 2021) by 4.73%. PRRNet (Shang et al., 2021), which is primarily designed for spliced facial image detection, locates tampered regions at the pixel level and extracts features from both tampered and original regions. By capturing inter-regional correlations and integrating three metric methods (cosine distance, Euclidean distance, and inner product), it calculates regional inconsistency scores. This approach not only leverages multilevel relationships to pinpoint tampered regions but also facilitates the classification of real and fake images. However, PRRNet (Shang et al., 2021) exhibits two limitations. First, it solely focuses on spatial-domain pixel and regional relationships. When confronted with low-quality forged facial images where artifacts vanish in the spatial domain, PRRNet (Shang et al., 2021) fails to capture the unique distortion traces left by forgery operations such as Deepfake in the frequency domain. Second, during global feature classification via fully connected layers, it lacks an adaptive weighting mechanism for noise in low-quality images, which may lead to misclassifying noise as tampered features or overlooking genuine tampering signals. These shortcomings may contribute to its inferior detection performance compared to the proposed method. When detecting face images generated by four forgery methods, the average AUC of our method outperforms SARB-DF (Prathibha et al., 2024) by 3.38%. SARB-DF combines self-attention mechanisms with continual learning, capturing long-range inter-frame dependencies through residual self-attention and integrating them into the XceptionNet backbone network to fuse local and global features. Additionally, it employs an elastic weight consolidation continual learning approach, leveraging the Fisher information matrix to constrain critical weights and prevent catastrophic forgetting. Furthermore, it introduces an uncertainty-based dynamic sampling strategy, selecting high-information samples near the decision boundary to optimize the model, thereby enhancing its generalization capability in detecting compressed videos and novel synthetic data. However, SARB-DF (Prathibha et al., 2024) solely relies on spatial-domain self-attention for feature extraction, failing to exploit frequency domain forgery traces. Under low-quality compression conditions, salient region features in the spatial domain are often obscured by noise. In comparison, our FAMDnet employs FFT and adaptive dynamic filters to capture frequency domain forgery artifacts, demonstrating heightened sensitivity to frequency anomalies induced by compression. FAMDnet effectively compensates for the lack of frequency domain features and improves the detection capability for low-quality Deepfake samples. The average AUC of our method exceeds that of Ensemble (Omar et al., 2024) by 3.8%. The Ensemble (Omar et al., 2024) combines the CoAtNet integrated model with CutMix augmentation, addressing the issue of ViT's lack of image inductive bias. CoAtNet integrates MBConv and self-attention with relative bias. The former employs a three-layer structure of expansion-depth convolution-projection along with residual connections to achieve lightweight and efficient feature extraction, while the latter leverages patch-based sequential modeling and 2D relative positional encoding to capture global dependencies. Additionally, it generates independent models through Bagging bootstrap sampling and fuses predictions using rules such as majority voting. By incorporating CutMix regional replacement to enhance data diversity, it synergistically optimizes convolutional local features and global modeling via self-attention, thereby improving detection robustness. However, the Ensemble (Omar et al., 2024) relies on the traditional majority voting strategy of conventional ensemble methods. This strategy fails to account for differences in feature importance and exhibits insufficient robustness in classifying low-quality samples sensitive to noise. In contrast, our method enhances robustness by introducing a parameter-free attention mechanism before the classifier. This mechanism calculates weighted coefficients based on feature mean and variance, dynamically amplifying the weights of critical forgery features in low-quality images to improve classification robustness. When detecting face images generated by four forgery methods, the average AUC of our method exceeds that of MLP-Mixer (Essa, 2024) by 3.61%. MLP-Mixer (Essa, 2024) integrates DaViT, iFormer, and GPViT. DaViT employs a dual spatial-channel self-attention mechanism to balance image details and global context. IFormer utilizes an Inception token mixer to decompose features into high-frequency and low-frequency paths, capturing different frequency information through pooling-convolution and self-attention operations, respectively. GPViT leverages group propagation blocks to efficiently propagate global information in high-resolution features. Additionally, MLP-Mixer (Essa, 2024) performs feature fusion via token mixing and channel mixing operations, and after multilayer processing, outputs classification results to enhance its detection capability for various Deepfake types. However, MLP-Mixer (Essa, 2024) only processes features in the spatial domain, making it difficult to capture forgery traces such as spectral anomalies in NT synthesis, which vanish in the spatial domain. In contrast, FAMDnet transforms spatial features into the frequency domain through two-dimensional Fourier transform, combines dynamic filters to select key frequency components, and then maps the frequency features back to the spatial domain via inverse transform. Our approach enhances the detection capability of subtle frequency domain forgery artifacts in low-quality samples. Overall, our FAMDnet achieves better performance in detecting low-quality forged images generated by four different forgery algorithms compared to other methods.

Table 1.
Comparative Experimental Results of Detecting Low-Quality Facial Images Generated by Four Forgery Methods.

DF FS F2F NT Average

Methods ACC ACC ACC ACC ACC

MADD (Zhao et al., 2021) 88.07 83.57 62.14 55.85 72.41

MesoNet (Afchar et al., 2018) 89.52 84.44 83.56 75.74 83.32

DSW-FWA (Li and Lyu, 2018) 93.60 91.77 90.73 83.15 89.81

Xception (Rössler et al., 2019) 94.28 91.56 93.70 82.11 90.41

DDBD (Nirkin et al., 2022) 94.50 84.50 80.30 74.00 83.33

DCDD (Hu, Liao, Wang, et al., 2022) 95.05 85.27 86.08 80.05 86.61

PRRNet (Shang et al., 2021) 95.63 94.93 90.15 80.01 90.18

F³-Net (Qian et al., 2020) 96.01 93.62 94.33 86.37 92.58

SARB-DF (Prathibha et al., 2024) 97.69 97 . 40 91.69 79.32 91.53

Ensemble (Omar et al., 2024) 97.70 91.69 95.74 79.32 91.11

MLP-Mixer (Essa, 2024) 97.72 96.20 91.21 80.05 91.30

Ours 97.80 ± 0.10 97.37 ± 0.16 95.92 ± 0.12 88.55 ± 0.22 94.91 ± 0.15

	DF	FS	F2F	NT	Average
MADD (Zhao et al., 2021)	88.07	83.57	62.14	55.85	72.41
MesoNet (Afchar et al., 2018)	89.52	84.44	83.56	75.74	83.32
DSW-FWA (Li and Lyu, 2018)	93.60	91.77	90.73	83.15	89.81
Xception (Rössler et al., 2019)	94.28	91.56	93.70	82.11	90.41
DDBD (Nirkin et al., 2022)	94.50	84.50	80.30	74.00	83.33
DCDD (Hu, Liao, Wang, et al., 2022)	95.05	85.27	86.08	80.05	86.61
PRRNet (Shang et al., 2021)	95.63	94.93	90.15	80.01	90.18
F³-Net (Qian et al., 2020)	96.01	93.62	94.33	86.37	92.58
SARB-DF (Prathibha et al., 2024)	97.69	97 . 40	91.69	79.32	91.53
Ensemble (Omar et al., 2024)	97.70	91.69	95.74	79.32	91.11
MLP-Mixer (Essa, 2024)	97.72	96.20	91.21	80.05	91.30
Ours	97.80 ± 0.10	97.37 ± 0.16	95.92 ± 0.12	88.55 ± 0.22	94.91 ± 0.15

The best results are marked in bold.

ACC: Accuracy; DF: Deepfakes; F2F: Face2Face; FS: FaceSwap; NT: NeuralTextures.

4.2.2 Evaluation on High-Quality and Low-Quality Datasets of the FaceForensics++ Dataset

The comparative experimental results of FAMDnet and other state-of-the-art detection methods on the high-quality (HQ) and low-quality (LQ) datasets of FF++ are presented in Table 2. Our FAMDnet achieves better performance on both datasets, indicating its effectiveness in detecting Deepfakes across varying levels of compression, particularly demonstrating superior detection accuracy on the low-quality dataset. As shown in Table 2, most methods perform exceptionally well on the high-quality dataset. However, they perform significantly worse when applied to the low-quality dataset. Compared with MH-FFNet (Zhou et al., 2025), our method achieves an 8.76% higher AUC on low-quality datasets while only trailing by 0.04% on high-quality ones. MH-FFNet (Zhou et al., 2025) uses ConvNeXt as the backbone, enhancing local mid-high-frequency textures via DCT in shallow features and capturing global frequency domain semantics with self-attention in mid-level features. However, its low-quality performance declines due to the lack of noise alignment, making shallow features vulnerable and mid-level semantics insufficient in blurred regions. Our method leverages FFT and dynamic filters to mine imperceptible spatial forgery artifacts, enhancing frequency-spatial interaction through a multimodal attention fusion module. Its linear angular attention and cross-covariance matrix attentions suppress noise and boost discriminability, addressing low-quality issues like blurring and compression artifacts. Thus, it excels in low-quality scenarios, while MH-FFNet (Zhou et al., 2025) suits high-texture detection. Compared with FFD (Zhang et al., 2025), our method surpasses it by 6.45% and 0.7% in AUC on low-quality and high-quality datasets, respectively. FFD (Zhang et al., 2025) uses a dual-branch structure with EfficientNet-b4 as the backbone, inputting RGB images and extracting high-frequency noise features via trainable SRM convolutions. It employs multiscale convolutions for intermediate features, cross-stream attention matrices with learnable weighting, and multiscale global features via pooling and MLP channel weighting to filter noise. In contrast, our FAMDnet adopts FFT and learnable filters to adaptively extract frequency domain forgery traces, which are more flexible than FFD's fixed SRM convolutions. Additionally, it achieves bidirectional guidance and refined interaction between spatial and frequency features, which are more precise than FFD's single cross-stream attention, suppressing irrelevant noise and enhancing key cues. Compared with HFI-Net (Miao, Tan, et al., 2022), our method exceeds it by 5.45% and 0.74% in AUC on the low-quality and high-quality datasets, respectively. HFI-Net (Miao, Tan, et al., 2022) is a dual-branch hierarchical network composed of a Transformer branch for capturing global context and a CNN branch for extracting local details. It suppresses high-level semantic interference via mid-high-frequency forgery traces, purifies mid-high-frequency features via discrete cosine transform to generate attention weights, and calibrates feature responses to enhance forgery cues. However, its performance on both low-quality and high-quality datasets is inferior to ours. Our FAMDnet captures frequency domain artifacts in low-quality images via FFT and dynamic filters, while HFI-Net's frequency domain processing has weaker dynamic adaptability and global filtering capabilities. Our multiscale spatial feature extraction module leverages ViT to process image patches of different sizes, capturing subtle cross-scale pixel correlations in high-quality images and making up for HFI-Net's insufficient fine-grained feature modeling of high-quality images by its Transformer branch. Our parameter-free attention classifier dynamically enhances key information to adapt to different quality data, while HFI-Net's dual-classifier average output lacks flexibility. Compared with DFDT (Khormali and Yuan, 2022), our method achieves 1.72% and 0.14% higher AUC on low-quality and high-quality datasets, respectively. DFDT (Khormali and Yuan, 2022) uses overlapping patch extraction to process images. It forms input sequences with latent embeddings and positional embeddings, fuses Transformer attention weights to extract key tokens, and employs multistream Transformers. The low-level branches of these Transformers learn local features via small-sized. The high-level branches extract global features via large-sized patches. DFDT (Khormali and Yuan, 2022) makes initial predictions via residual blocks in low-level and high-level Transformers, then averages multiscale prediction results for final detection. However, it relies on spatial-domain multiscale patch extraction and attention averaging without frequency domain support. Our FAMDnet leverages FFT and dynamic filters to mine imperceptible artifacts in the frequency domain. This enhances its robustness to blurred features in low-quality datasets, thus enabling it to outperform DFDT (Khormali and Yuan, 2022). Compared with SGF (Khormali and Yuan, 2024), our method achieves 1.04% and 0.06% higher AUC on low-quality and high-quality datasets, respectively. SGF (Khormali and Yuan, 2024) first partitions images into patches and constructs graphs, using feature embeddings as graph nodes and spatial proximity as edges. Then it employs ViT combined with self-supervised contrastive learning. Through masked modeling and student–teacher networks, it generates cross-view features, while improving feature generalization via contrastive loss and self-distillation loss. Subsequently, it extracts node local features through graph convolutional layers and fuses global dependencies via Transformer attention mechanisms. Finally, it performs classification after reducing the number of nodes via min-cut pooling. However, its performance on the low-quality dataset is inferior. This may stem from SGF's insufficient global semantic modeling in spatial proximity graphs while our FAMDnet uses ViT to compute cross-scale self-attention for multisize patches, capturing global–local dependencies and adapting to low-quality's pixel relationship changes. Additionally, SGF (Khormali and Yuan, 2024) lacks frequency domain artifact mining, whereas low-quality forged traces often reside in the frequency domain, leading to SGF's incomplete feature representation and poor low-quality performance. Overall, our FAMDnet demonstrates better performance in detecting forgeries across various compression levels.

Table 2.
Comparative Experimental Results of Detecting High-Quality and Low-Quality Deepfake Facial Images.

Methods Low-quality High-quality

AUC ACC AUC ACC

Step.Features (Fridrich and Kodovsky, 2012) - 70.97 - 55.98

LD-CNN (Cozzolino et al., 2017) - 78.45 - 70.47

MesoNet (Afchar et al., 2018) - 83.10 - 70.47

Face X-ray (Li, Bao, et al., 2020) 87.40 - 61.60 -

MH-FFNet (Zhou et al., 2025) 87.44 85.90 99.44 97.37

FFD (Zhang et al., 2025) 89.75 75.95 98.70 94.46

HFI-Net (Miao, Tan, et al., 2022) 90.75 86.90 98.66 95.12

DFDT (Khormali and Yuan, 2022) 94.48 93.67 99.26 98.18

Xception (Rössler et al., 2019) 94.86 92.39 81.76 80.32

SGF (Khormali and Yuan, 2024) 95.16 94.59 99.34 98.41

F³-Net (Qian et al., 2020) 97.80 93.12 87.22 84.53

Ours 96.20 ± 0.02 92.93 ± 1.5 99.40 ± 0.01 97.76 ± 1.2

Methods	Low-quality	High-quality
Step.Features (Fridrich and Kodovsky, 2012)	-	70.97	-	55.98
LD-CNN (Cozzolino et al., 2017)	-	78.45	-	70.47
MesoNet (Afchar et al., 2018)	-	83.10	-	70.47
Face X-ray (Li, Bao, et al., 2020)	87.40	-	61.60	-
MH-FFNet (Zhou et al., 2025)	87.44	85.90	99.44	97.37
FFD (Zhang et al., 2025)	89.75	75.95	98.70	94.46
HFI-Net (Miao, Tan, et al., 2022)	90.75	86.90	98.66	95.12
DFDT (Khormali and Yuan, 2022)	94.48	93.67	99.26	98.18
Xception (Rössler et al., 2019)	94.86	92.39	81.76	80.32
SGF (Khormali and Yuan, 2024)	95.16	94.59	99.34	98.41
F³-Net (Qian et al., 2020)	97.80	93.12	87.22	84.53
Ours	96.20 ± 0.02	92.93 ± 1.5	99.40 ± 0.01	97.76 ± 1.2

The best results are marked in bold.

ACC: Accuracy.

4.3 Cross-Dataset Evaluation

The AUC of FAMDnet, compared with that of other state-of-the-art methods on the Celeb-DF and WildDeepfake datasets, is presented in Table 3. FAMDnet is trained on the high-quality dataset of FF++ and tested on Celeb-DF and WildDeepfake to evaluate its effectiveness and robustness when confronting low-quality forged images and videos in real-world scenarios. As shown in Table 3, our FAMDnet demonstrates superior performance on both datasets. Our method achieves AUC improvements of 12.2% and 10.3% over DCViT (Wodajo and Atnafu, 2021) on two datasets, respectively. DCViT (Wodajo and Atnafu, 2021) adopts a hybrid architecture combining VGG and ViT, where its feature learning module first extracts spatial texture features from images using VGG, and then utilizes ViT to serialize these features into classification inputs. The model exhibits weak generalization capability, primarily because it relies solely on the VGG convolutional layers to extract spatial features, making it difficult to adapt to distributional differences across various datasets. Additionally, its Softmax classifier performs discrimination directly based on global features, which is easily susceptible to noise interference from low-quality images. Our method achieves AUC improvements of 7.79% and 7.61% over RFM (Wang and Deng, 2021) on two datasets, respectively. RFM (Wang and Deng, 2021) identifies detector-sensitive regions by computing the gradient differences between authentic and forged image outputs. Unlike class activation mapping, it generates sensitive regions at the image level and focuses on critical areas. By occluding the top N sensitive regions, this method can preserve more facial information while mitigating detector overfitting caused by information leakage. However, RFM (Wang and Deng, 2021) has two limitations. First, it relies on detector-sensitive region localization specific to a particular dataset, and cross-dataset application may lead to shifts in sensitive regions due to variations in forgery patterns. Second, the occlusion operation (e.g., random rectangular patches) may disrupt facial key information, and cross-dataset application may disrupt the balance between information retention and forgery region erasure due to differences in data distributions. In contrast, our FADMnet does not depend on gradient information from specific detectors but instead directly mines forgery cues from the spatial and frequency domain correlations within the image itself. Our method achieves an AUC score on the Celeb-DF dataset that surpasses LDFnet (Guo et al., 2024) by 7.3%. LDFnet (Guo et al., 2024) is designed to capture local salient artifacts and global subtle texture variations through two complementary approaches. The local artifact representation is obtained via five depthwise separable convolutional layers, which focus on local operational cues, while the global texture representation is extracted using depthwise separable convolutions to capture local features and multilayer perceptrons combined with max-pooling to capture global features, thereby extracting both low-level and high-level texture features. The local and global features are then concatenated and processed through depthwise separable convolutions to generate hybrid features. Subsequently, an attention matrix is generated from the global features to refine the features and ultimately produce the fused features. However, its dynamic fusion mechanism relies on fixed concatenation and depthwise separable convolutions for feature fusion, making it difficult to adapt the fusion weights to different data distributions across datasets. In contrast, our FAMDnet incorporates a multimodal attention fusion module, which enhances long-range dependencies in spatial features through linear angular attention and captures channel correlations in frequency-domain features via cross-covariance matrix attention. Furthermore, the dual channel-spatial attention mechanism within the cross-attention fusion module adaptively learns the fusion weights for spatial and frequency-domain features, thereby improving the flexibility of feature fusion across different datasets. Our method achieves AUC improvements of 7.3% and 4.3% over M2TR (Wang et al., 2021) on two datasets, respectively. M2TR (Wang et al., 2021) first partitions features into patches of varying sizes to compute patch-level self-attention. Subsequently, it transforms the features into the frequency domain via 2D FFT and performs Hadamard product operations with learnable filters to obtain frequency-aware features. Finally, the RGB spatial features and frequency features are embedded as query-key-value pairs for fusion. The loss function framework of M2TR (Wang et al., 2021) includes cross-entropy loss for classification tasks, segmentation loss for forgery mask prediction, and contrastive loss to drive real sample features toward a feature center. However, the static global frequency filters employed by M2TR (Wang et al., 2021) struggle to adapt to frequency feature variations caused by differing compression levels and forgery algorithms across datasets. To address this, our FAMDnet dynamically generates adaptive filters via MLP, which can adjust frequency-domain weights in real time based on input features. This effectively captures the concealed frequency-domain traces of forgery obscured by compression in different datasets, significantly enhancing cross-dataset robustness. At the cross-modal fusion level, M2TR (Wang et al., 2021) merely implements feature fusion through a simple query-key-value mechanism, insufficiently exploring the interactions between spatial and frequency-domain features. In contrast, FAMDnet's multimodal attention fusion module introduces linear angle attention to strengthen long-range dependencies in spatial features. It employs cross-covariance matrix attention to enhance interchannel correlations in frequency features and further reinforces the complementarity of cross-modal features through a dual channel-space attention mechanism in cross-attention fusion, achieving deeper feature interactions. Our method achieves AUC improvements of 3.57% and 8.04% over FFD (Zhang et al., 2025) on two datasets, respectively. FFD (Zhang et al., 2025) employs trainable SRM convolutions to extract high-frequency noise, but its convolutional kernel parameters are fixed, making it difficult to adapt to variations in noise characteristics caused by differences in generation algorithms, compression quality, and other factors across different forgery datasets. This limitation hinders its effectiveness in cross-domain detection, as it cannot efficiently extract noise features under such conditions. In contrast, our FAMDnet can adaptively adjust filter weights based on input features through MLPs, while also leveraging FFT to extract frequency-domain features. Our approach enables dynamic optimization of filtering strategies according to the frequency distributions of different datasets, significantly enhancing the ability to capture cross-domain noise features. Consequently, our method outperforms FFD in performance. Our method achieves AUC improvements of 3.16% and 7.56% over FFDM (Zhang et al., 2024) on two datasets, respectively. FFDM (Zhang et al., 2024) employs EfficientNet as the backbone network to extract spatial features while utilizing learnable SRM filters to capture noise features. By generating noise attention maps and fusing them with spatial features, the model further processes these features through a channel attention mechanism to enhance feature representation. Additionally, it integrates Bi-Level Routing Attention mechanisms with 4 × 4 and 7 × 7 scales, leveraging self-attention to capture long-range dependencies, thereby improving the expressiveness of local features. The enhanced features are ultimately fed back into the backbone network to complete the classification task. However, FFDM (Zhang et al., 2024) suffers from insufficient cross-dataset generalization. It directly utilizes the backbone network for classification without optimizing for feature distribution discrepancies across datasets. Our FAMDnet introduces a parameter-free attention mechanism before the classifier. By computing the mean and variance of features to generate weighting coefficients, it dynamically enhances critical forgery features while suppressing noise interference, effectively improving the model's classification robustness across different datasets. On the Celeb-DF dataset, our method achieved a 2.95% improvement in AUC compared to MLP-Mixer (Essa, 2024). Although MLP-Mixer (Essa, 2024) relies on the integration of different Transformer modules, it lacks flexibility in capturing cross-scale pixel relationships in low-quality images, particularly in modeling small-scale forgery traces such as compression artifacts and edge halos. In contrast, our FAMDnet leverages ViT to perform self-attention computation on image patches of varying sizes, and combines residual blocks to achieve multiscale feature fusion, effectively capturing forgery cues at different scales in low-quality images. Our method achieves an AUC that surpasses FInfer's (Hu, Liao, Liang, et al., 2022) AUC by 2.4% on the Celeb-DF dataset. FInfer (Hu, Liao, Liang, et al., 2022) first extracts frames from videos and detects faces, then employs Laplacian of Gaussian pyramid transformations on facial data to reveal tampering boundaries. Subsequently, it encodes the source and target faces into a 128-dimensional low-dimensional space using an encoder to mitigate the curse of dimensionality and extract effective features. A Gated Recurrent Unit is then adopted to construct a regression model, leveraging gating mechanisms to handle temporal dependencies and predict target face representations based on source face representations. Finally, correlation learning is applied to compute the correlation between predicted and actual representations, with the model optimized end-to-end through cross-entropy loss. This approach is particularly effective for detecting high-visual-quality deepfakes. However, FInfer (Hu, Liao, Liang, et al., 2022) relies on temporal predictions between the current frame and future frames, and cross-dataset variations in temporal dynamics due to different forgery techniques (such as frame rates and expression patterns) lead to reduced generalization capability. In contrast, our FAMDnet weakens the dependency on temporal information through multiscale spatial feature extraction and dynamic global frequency feature extraction, directly capturing spatial and frequency-domain traces of forgery to enhance cross-dataset robustness. Overall, our method demonstrates commendable effectiveness when addressing low-quality forged images and videos in real-world scenarios.

Table 3.
Evaluation on Celeb-DF and WildDeepfake Datasets.

Celeb-DF WildDeepfake

Methods AUC AUC

Xception (Rössler et al., 2019) 48.20 64.50

Multitask (Nguyen, Fang, et al., 2019) 54.30 57.60

Capsule (Nguyen, Yamagishi, et al., 2019) 57.50 68.50

DCVit (Wodajo and Atnafu, 2021) 60.80 70.30

RFM (Wang and Deng, 2021) 65.21 72.99

LDFnet (Guo et al., 2024) 65.70 -

M2TR (Wang et al., 2021) 65.70 76.30

FFD (Zhang et al., 2025) 69.43 72.56

FFDM (Zhang et al., 2024) 69.84 73.04

MLP-Mixer (Essa, 2024) 70.05 -

FInfer (Hu, Liao, Liang, et al., 2022) 70.60 -

Ours 73.00 ± 0.02 80.60 ± 0.03

	Celeb-DF	WildDeepfake
Xception (Rössler et al., 2019)	48.20	64.50
Multitask (Nguyen, Fang, et al., 2019)	54.30	57.60
Capsule (Nguyen, Yamagishi, et al., 2019)	57.50	68.50
DCVit (Wodajo and Atnafu, 2021)	60.80	70.30
RFM (Wang and Deng, 2021)	65.21	72.99
LDFnet (Guo et al., 2024)	65.70	-
M2TR (Wang et al., 2021)	65.70	76.30
FFD (Zhang et al., 2025)	69.43	72.56
FFDM (Zhang et al., 2024)	69.84	73.04
MLP-Mixer (Essa, 2024)	70.05	-
FInfer (Hu, Liao, Liang, et al., 2022)	70.60	-
Ours	73.00 ± 0.02	80.60 ± 0.03

The best results are marked in bold.

4.4 Ablation Experiments

4.4.1 Effectiveness of Different Modules

Our FAMDnet is primarily composed of the multiscale spatial feature extraction module, the dynamic global frequency feature extraction module, the multimodal attention fusion module, and the classifier based on the parameter-free attention mechanism. To verify the effectiveness of each module of FAMDnet, experiments are conducted on the FF++ dataset, comparing the changes in ACC and AUC of detection performance for various module combinations in two datasets of different qualities. The experimental results are shown in Table 4. In Table 4, “RGB” represents the multiscale spatial feature extraction module, “Freq.” denotes the dynamic global frequency feature extraction module, “Fusion” indicates the use of the multimodal attention fusion module to integrate spatial and frequency features. “SimAM” refers to the classifier based on parameter-free attention mechanism. As shown in Table 4, when all modules are combined, the model exhibits optimal performance on both low-quality and high-quality datasets. In the low-quality dataset, using only the multiscale spatial feature module causes a significant decrease of 12.90% in ACC and 10.20% in AUC, which is particularly obvious because image and video compression can lead to the disappearance of certain artifacts in the spatial domain. By contrast, the dynamic global frequency feature extraction module outperforms the multiscale spatial feature extraction module when used alone, as artifacts that vanish in the spatial domain can still be detected in the frequency domain. However, compared with FAMDnet, using only frequency features leads to a decrease of 7.08% in ACC and 4.80% in AUC. In the high-quality dataset, using only the multiscale spatial feature module causes a decrease of 0.55% in ACC and 0.60% in AUC, while using only the dynamic global frequency feature extraction module results in a decrease of 2.74% in ACC and 1.30% in AUC, indicating that RGB features play a more critical role in identifying forgery traces in high-quality datasets. For the multimodal attention fusion module, using simple feature concatenation instead of the proposed fusion mechanism causes a 7.14% decrease in ACC and 4.40% decrease in AUC on the low-quality dataset. On the high-quality dataset, this simple concatenation leads to a 7.24% drop in ACC and 2.20% drop in AUC. These results suggest that the multimodal attention fusion module effectively integrates spatial and frequency features to enhance FAMDnet's performance. For the classifier with the simple parameter-free attention mechanism, removing this mechanism in the low-quality dataset leads to a decrease of 3.33% in ACC and 2.30% in AUC. To more clearly understand the roles of different modules in facial forgery detection, the features learned by different module combinations are visualized. Figure 5 shows a comparative analysis of feature maps generated by these combinations, including original images and fake images generated by four different forgery methods of the FF++ dataset. It can be seen from Figure 5 that multiscale spatial features mainly focus on global facial images, while frequency features emphasize the acquisition of detailed information, with the frequency domain particularly focusing on the high-frequency components of images that usually contain complex details. When comparing the use and nonuse of the multimodal attention fusion module, it is found that using the fusion module allows simultaneous attention to both the overall facial images and their detailed aspects. Overall, the modules proposed in this article demonstrate significant effectiveness.

Figure 5.

Comparison of Feature Maps of Different Module Combinations.

Table 4.

Experimental Ablation Results of Combination of Each Module.

Module				High-quality		Low-quality
RGB	Freq.	Fusion	SimAM	ACC	AUC	ACC	AUC
√			√	97.21	98.80	80.03	86.00
	√		√	95.02	98.10	85.85	91.40
√	√		√	90.52	97.20	85.79	91.80
√	√	√		97.34	99.20	89.60	93.90
√	√	√	√	97.76	99.40	92.93	96.20

ACC: Accuracy.

4.4.2 Effectiveness of the Number of Dynamic Filters

Our FAMDnet extracts frequency features via FFT with dynamic filters that adapt based on input contextual information to apply distinct filters at different spatial locations and feature channels, unlike traditional global filters that use uniform transformations, thereby enabling the model to more effectively capture and model local features and complex patterns while enhancing its capability to discern fine-grained details in diverse contexts. To assess the impact of the number of dynamic filters on the performance of FAMDnet, the model is trained on the high-quality dataset of FF++ while tested on the low-quality dataset, with experiments utilizing 2, 4, 6, and 8 dynamic filters, as presented in Table 5. As is evident from Table 5, when the number of dynamic filters is set to 2, FAMDnet has fewer parameters and lower complexity, yet exhibits low ACC and AUC, likely due to its limited expressive capacity hindering the capture of sufficient feature information and the effective discrimination of forged facial images or videos. Increasing the number of dynamic filters to 8 results in the highest complexity and parameter count, yet test ACC and AUC remain low, potentially attributed to the model's over-optimization of training data leading to overfitting. When comparing models with 4 and 6 dynamic filters, similar ACC and AUC values are observed. Though the FAMDnet with 6 dynamic filters has more parameters than the model with 4 filters, both perform comparably in performance metrics, while the model with 6 dynamic filters may incur higher computational costs and longer training times. Considering the balance between performance and efficiency, the number of dynamic filters is set to 4 in this article.

Table 5.
Experimental Results of the Number of the Dynamic Filters.

Number ACC AUC Parameter

2 87.03 90.30 17879312

4 92.93 96.20 17935888

6 93.01 96.22 17992464

8 85.99 90.20 18049040

Number	ACC	AUC	Parameter
2	87.03	90.30	17879312
4	92.93	96.20	17935888
6	93.01	96.22	17992464
8	85.99	90.20	18049040

ACC: Accuracy.

Figure 6.

Visualization of the Distribution of Features Extracted by FAMDnet in the Feature Space. FAMDnet: Frequency Assisted Multi-Scale Dual-Stream Network.

4.5 Visualization of Feature Distributions

To intuitively demonstrate the distribution of the features extracted by the proposed method in the feature space, a visual analysis is conducted using t-SNE on 1,000 images of each type (real and fake) selected from the FF++ dataset. As illustrated in Figure 6, yellow represents forged facial images, while purple denotes real facial images. From Figure 6, it can be observed that a small overlap exists between the real and fake categories in each subfigure, which may be attributed to misclassifications caused by our FAMDnet. Although this overlap exists, clear classification boundaries are evident in both categories of each subfigure. Specifically, the visual results from DF, F2F, FS, and NT show that each category forms a relatively concentrated cluster, highlighting the effectiveness of FAMDnet in the feature space. The distinct separation between these clusters further demonstrates the substantial robustness of FAMDnet in discriminating between real and forged facial images, including low-quality facial images.

5 Conclusion and Future Work

The existing Deepfake detection methods demonstrate high accuracy on high-quality facial datasets but perform poorly on low-quality datasets. To address this issue, a FAMDnet is proposed for low-quality Deepfake detection in this article. The FAMDnet employs a dual-stream network that leverages multiscale spatial features extracted by the multiscale spatial feature extraction module and dynamic global frequency features extracted by the dynamic global frequency feature extraction module to reveal traces of forgery. Additionally, the multimodal attention fusion module of the FAMDnet is utilized to integrate spatial features with frequency features, and the fused features are further enhanced by the simple parameter-free attention mechanism. This enhancement significantly improves detection performance in scenarios involving low-quality forged images and videos. Comparative experiments conducted on publicly available datasets indicate that the proposed FAMDnet exhibits superior detection performance, particularly on low-quality datasets, surpassing many existing detection methods. In future work, we will continue to explore various interaction methods between spatial and frequency features to enhance the performance of the model, as well as investigate fusion methods between Transformer and CNN to fully leverage the advantages of both structures, thereby adapting more effectively to diverse data distributions.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (62076246).

Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Afchar

Nozick

Yamagishi

Echizen

(2018). MesoNet: A compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS) (pp. 1–7). https://doi.org/10.1109/WIFS.2018.8630761

Chen

Yao

Chen

Ding

(2021). Local relation learning for face forgery detection. In AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v35i2.16193

Cozzolino

Poggi

Verdoliva

(2017). Recasting residual-based local descriptors as convolutional neural networks: An application to image forgery detection. In Proceedings of the 5th ACM workshop on information hiding and multimedia security. https://doi.org/10.1145/3082031.3083247

Deng

Dong

Socher

Li-Jia

Fei-Fei

(2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). https://doi.org/10.1109/CVPR.2009.5206848

Deng

Guo

Ververas

Kotsia

Zafeiriou

(2020). RetinaFace: Single-Shot multi-level face localisation in the wild. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 5202–5211). https://doi.org/10.1109/CVPR42600.2020.00525

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

(2020). An image is worth 16(16 Words: Transformers for image recognition at scale. In ArXiv, vol. abs/2010.11929. https://doi.org/10.48550/arXiv.2010.11929

Durall

Keuper

Pfreundt

F.-J.

Keuper

(1999). Unmasking DeepFakes with simple features. In ArXiv, vol. abs/1911.00686. https://doi.org/10.48550/arXiv.1911.00686

Dzanic

Shah

Witherden

F. D.

(2020). Fourier spectrum discrepancies in deep network generated images. In Proceedings of the 34th international conference on neural information processing systems (NIPS ‘20), Curran Associates Inc., Red Hook, NY, USA (pp. 3022–3032).

Essa

(2024). Feature fusion vision transformers using MLP-Mixer for enhanced Deepfake detection. Neurocomputing, 598, https://doi.org/10.1016/j.neucom.2024.128128

10.

Fridrich

Kodovsky

(2012). Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3), 868–882. https://doi.org/10.1109/TIFS.2012.2190402

11.

Guo

Wang

Yang

(2024). LDFnet: Lightweight dynamic fusion network for face forgery detection by integrating local artifacts and global texture information. IEEE Transactions on Circuits and Systems for Video Technology, 34(2), 1255–1265. https://doi.org/10.1109/TCSVT.2023.3289147

12.

Liao

Liang

Zhou

Qin

(2022). FInfer: Frame inference-based Deepfake detection for high-visual-quality videos. Proceedings of the AAAI Conference on Artificial Intelligence, 36(1), 951–959. https://doi.org/10.1609/aaai.v36i1.19978

13.

Liao

Wang

Qin

(2022). Detecting compressed Deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3), 1089–1102. https://doi.org/10.1109/TCSVT.2021.3074259

14.

Huang

Zhang

Wang

(2020). Deep frequent spatial temporal learning for face anti-spoofing. In ArXiv, vol. abs/2002.03723. https://doi.org/10.48550/arXiv.2002.03723

15.

Jia

Zheng

Liu

Deng

(2021). Inconsistency-aware wavelet dual-branch network for face forgery detection. IEEE Transactions on Biometrics, Behavior, and Identity Science, 3, 308–319. https://doi.org/10.1109/TBIOM.2021.3086109

16.

Khormali

Yuan

J.-S.

(2022). DFDT: An end-to-end Deepfake detection framework using vision transformer. Applied Sciences, 12(6), 2953. https://doi.org/10.3390/app12062953

17.

Khormali

Yuan

J.-S.

(2024). Self-supervised graph transformer for Deepfake detection. IEEE Access, 12, 58114–58127. https://doi.org/10.1109/ACCESS.2024.3392512

18.

Krichen

(2023). Generative adversarial networks. In 2023 14th International conference on computing communication and networking technologies (ICCCNT) (pp. 1–7). https://doi.org/10.1109/ICCCNT56998.2023.10306417

19.

Bao

Zhang

Yang

Chen

Wen

Guo

(2020). Face X-ray for more general face forgery detection. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 5000–5009). https://doi.org/10.1109/CVPR42600.2020.00505

20.

Lyu

(2018). Exposing DeepFake videos by detecting face warping artifacts. arXiv:1811.00656. https://doi.org/10.48550/arXiv.1811.00656

21.

Yang

Sun

Lyu

(2020). Celeb-DF: A large-scale challenging dataset for DeepFake forensics. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 3204–3213). https://doi.org/10.1109/CVPR42600.2020.00327

22.

Liang

Wang

Jin

Pan

Liu

(2023). Hierarchical supervisions with two-stream network for Deepfake detection. Pattern Recognition Letters, 172, 121–127. https://doi.org/10.1016/j.patrec.2023.05.029

23.

Liu

Zhou

Chen

Xue

Zhang

(2021). Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. In 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 772–781). https://doi.org/10.1109/CVPR46437.2021.00083

24.

Luo

Zhang

Yan

Liu

(2021). Generalizing face forgery detection with high-frequency features. In 2021 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 16312–16321). https://doi.org/10.1109/CVPR46437.2021.01605

25.

Masi

Killekar

Mascarenhas

R. M.

Gurudatt

S. P.

AbdAlmageed

(2020). Two-branch recurrent network for isolating Deepfakes in videos. Computer Vision-ECCV, 2020, 667–684. https://doi.org/10.1007/978-3-030-58571-6_39

26.

Miao

Chu

Gong

Zhuang

(2021). Towards generalizable and robust face manipulation detection via bag-of-feature. In 2021 International conference on visual communications and image processing (VCIP) (pp. 1–5). https://doi.org/10.1109/VCIP53242.2021.9675331

27.

Miao

Chu

Tan

Zhuang

(2022). Learning forgery region-aware and ID-independent features for face manipulation detection. IEEE Transactions on Biometrics, Behavior, and Identity Science, 4, 71–84. https://doi.org/10.1109/TBIOM.2021.3119403

28.

Miao

Tan

Chu

Guo

(2022). Hierarchical frequency-assisted interactive networks for face manipulation detection. IEEE Transactions on Information Forensics and Security, 17, 3008–3021. https://doi.org/10.1109/TIFS.2022.3198275

29.

Nguyen

H. H.

Fang

Yamagishi

Echizen

(2019). Multi-task learning for detecting and segmenting manipulated facial images and videos. In 2019 IEEE 10th International conference on biometrics theory, applications and systems (BTAS) (pp. 1–8). https://doi.org/10.1109/BTAS46853.2019.9185974

30.

Nguyen

H. H.

Yamagishi

Echizen

(2019). Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019—2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK (pp. 2307–2311). https://doi.org/10.1109/ICASSP.2019.8682602

31.

Nirkin

Wolf

Keller

Hassner

(2022). Deepfake detection based on discrepancies between faces and their context. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6111–6121. https://doi.org/10.1109/TPAMI.2021.3093446

32.

Omar

Sakr

R. H.

Alrahmawy

M. F.

(2024). An ensemble of CNNs with self-attention mechanism for DeepFake video detection. Neural Computing & Applications, 36, 2749–2765. https://doi.org/10.1007/s00521-023-09196-3

33.

Prathibha

P. G.

Tamizharasan

P. S.

Panthakkan

Mansoor

Al Ahmad

(2024). SARB-DF: A continual learning aided framework for Deepfake video detection using self-attention residual block. IEEE Access, 12, 189088–189101. https://doi.org/10.1109/ACCESS.2024.3517170

34.

Qian

Yin

Sheng

Chen

Shao

(2020). Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Computer Vision-ECCV 2020 (pp. 86–103). https://doi.org/10.1007/978-3-030-58610-2_6

35.

Rössler

Cozzolino

Verdoliva

Riess

Thies

Niessner

(2019). FaceForensics++: Learning to detect manipulated facial images. In 2019 IEEE/CVF International conference on computer vision (ICCV) (pp. 1–11). https://doi.org/10.1109/ICCV.2019.00009

36.

Shang

Xie

Zha

Zhang

(2021). PRRNet: Pixel-region relation network for face forgery detection. Pattern Recognition, 116. https://doi.org/10.1016/j.patcog.2021.107950

37.

Stehouwer

Dang

Liu

Jain

A. K.

(2019). On the detection of digital face manipulation. In 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) (pp. 5780–5789). https://doi.org/10.1109/CVPR42600.2020.00582

38.

Tan

Yang

Miao

Guo

(2022). Transformer-based feature compensation and aggregation for DeepFake detection. IEEE Signal Processing Letters, 29, 2183–2187. https://doi.org/10.1109/LSP.2022.3214768

39.

Tatsunami

Taki

(2024). FFT-based dynamic token mixer for vision. Proceedings of the AAAI Conference on Artificial Intelligence, 38(14), 15328–15336. https://doi.org/10.1609/aaai.v38i14.29457

40.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems (NIPS’17) (pp. 6000–6010). https://doi.org/10.5555/3295222.3295349

41.

Wang

Deng

(2021). Representative forgery mining for fake face detection. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 14918–14927). https://doi.org/10.1109/CVPR46437.2021.01468

42.

Wang

Sun

Tang

(2022). Lisiam: Localization invariance Siamese network for Deepfake detection. IEEE Transactions on Information Forensics and Security, 17, 2425–2436. https://doi.org/10.1109/TIFS.2022.3186803

43.

Wang

Chen

Jiang

Y.-G.

(2021). M2TR: Multi-modal multi-scale transformers for Deepfake detection. In Proceedings of the 2022 international conference on multimedia retrieval. https://doi.org/10.1145/3512527.3531415

44.

Wang

Juefei-Xu

Xie

Huang

Wang

Liu

(2019). FakeSpotter: A simple yet robust baseline for spotting AI-Synthesized fake faces. In International joint conference on artificial intelligence. https://doi.org/10.48550/arXiv.1909.06122

45.

Wang

S.-Y.

Wang

Zhang

Owens

Efros

A. A.

(2019). CNN-generated images are surprisingly easy to spot… for now. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8692–8701). https://doi.org/10.1109/CVPR42600.2020.00872

46.

Wodajo

Atnafu

(2021). Deepfake video detection using convolutional vision transformer. In ArXiv, vol. abs/2102.11126. https://doi.org/10.48550/arXiv.2102.11126

47.

Yang

Zhang

Xie

(2021). SimAM: A simple, parameter-free attention module for convolutional neural networks. Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, 139, 11863–11874.

48.

Zhang

Chen

Liao

Chen

Yang

(2024). Face forgery detection via multi-feature fusion and local enhancement. IEEE Transactions on Circuits and Systems for Video Technology, 34(9), 8972–8977. https://doi.org/10.1109/TCSVT.2024.3390945

49.

Zhang

D. Y.

F. F.

Chen

J. H.

Gong

Tian

Zhang

(2025). Fake face detection based on fusion of spatial texture and high-frequency noise. Chinese Journal of Electronics, 34(1), 212–221. https://doi.org/10.23919/cje.2023.00.342

50.

Zhao

Wei

Zhou

Zhang

Chen

(2021). Multi-attentional Deepfake detection. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2185–2194). https://doi.org/10.1109/CVPR46437.2021.00222

51.

Zhou

Luo

Zhuang

Weng

Gong

Lin

(2023). Attention multihop graph and multiscale convolutional fusion network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–14. https://doi.org/10.1109/TGRS.2023.3265879

52.

Zhou

Zhao

Zhang

Zhou

(2024). MDCF-Net: Multi-scale dual-branch network for compressed face forgery detection. IEEE Access, 12, 58740–58749. https://doi.org/10.1109/ACCESS.2024.3390217

53.

Zhou

Sun

Wang

(2025). MH-FFNet: Leveraging mid-high frequency information for robust fine-grained face forgery detection. Expert Systems with Applications, 276. https://doi.org/10.1016/j.eswa.2025.127108

54.

Chang

Chen

Jiang

Y.-G.

(2020). WildDeepfake: A challenging real-world dataset for Deepfake detection. In Proceedings of the 28th ACM international conference on multimedia. https://doi.org/10.1145/3394171.3413769

Frequency Assisted Multiscale Dual-Stream Network for Low-Quality Deepfake Detection

Abstract

Keywords

1 Introduction

3 Methodology

4.1 Experimental Settings

4.1.1 Datasets

4.1.2 Implementation Details

4.2 Intra-Dataset Evaluation

4.2.1 Evaluation on Four Deepfake Methods of FaceForensics++ Dataset

4.4.1 Effectiveness of Different Modules

Table 5. Experimental Results of the Number of the Dynamic Filters. Number ACC AUC Parameter 2 87.03 90.30 17879312 4 92.93 96.20 17935888 6 93.01 96.22 17992464 8 85.99 90.20 18049040

5 Conclusion and Future Work

Footnotes

Funding

Conflicting Interests

References

Table 5.
Experimental Results of the Number of the Dynamic Filters.

Number ACC AUC Parameter

2 87.03 90.30 17879312

4 92.93 96.20 17935888

6 93.01 96.22 17992464

8 85.99 90.20 18049040