Sage Journals: Discover world-class research

Abstract

The image manipulation detection localization task differs from traditional computer vision tasks in that we focus more on capturing subtle and generic manipulation detection features in images. In this paper, we propose a novel method called irrelevant visual information suppression, which aims to alleviate the interference of irrelevant visual information in images on manipulation detection feature extraction, thereby obtaining generic manipulation traces that are more subtle and unrelated to semantic visual information. In general, most manipulation operations leave traces at manipulation edges. Therefore, we introduce a specially designed manipulated edge information enhancement branch aimed at identifying these edge artifacts more accurately. We construct a dual-branch network, where each branch uses ResNet-50 as the backbone to capture as many multi-scale manipulation features as possible. Finally, we adopt a multi-view feature learning method that combines the manipulated edge information enhancement branch with the irrelevant visual information suppression branch and is trained with multi-scale (pixel/edge/image/irrelevant visual information suppression) supervision. To validate the effectiveness of the proposed method, we conducted extensive experiments using five image manipulation localization datasets, including CASIAv1, CASIAv2, COVER, Columbia, and NIST16. The experimental results demonstrate that our proposed method can outperform state-of-the-art methods by a significant margin in terms of F1 score. Taking CASIAv1, COVER, and Columbia datasets as examples, compared with MVSS-Net published in ICCV 2021, our method has improved F1 scores by 7.1%, 6.3%, and 12.5%, respectively. The code used in this paper can be found at the following URL: https://github.com/ginwins/ISIE-Net.

Keywords

image manipulation detection irrelevant visual information suppression manipulated edge information enhancement multi-scale supervision

1. Introduction

The editability of digital images has reached astonishing levels, enabling individuals to make virtually imperceptible modifications visually. The rapid development of this technology poses a challenge to the authenticity of digital photos, as individuals can easily manipulate various details of an image, altering colors, contrast, brightness, and even adding or removing objects from the image. In terms of manipulation methods, copy-move, splicing, and removal are three of the most common and relatively easy to achieve, which can be easily achieved through software such as Photoshop, Meitu Xiu Xiu, and so on, as shown in Figure 1. However, the misuse of these technologies, such as generating deepfakes (Song et al., 2019), forged signatures (Liu et al., 2021), and rumors (Luo et al., 2021), has severely challenged the normal functioning of society, prompting widespread public concern (Zampoglou et al., 2017). Therefore, there is an urgent need to detect and localize these manipulated images in order to maintain the authenticity and credibility of the information.

Figure 1.

Example of manipulated images with different manipulation techniques. Copy-move is to copy some areas of an image and move them to other locations. Splicing is utilized to cut some objects from another image and mask the target image with these objects. Removal is to remove some unwanted elements.

Image manipulation detection is fundamentally different from general computer vision tasks such as object detection and semantic segmentation (Yang et al., 2022; Yi et al., 2022). While usual computer vision tasks focus on learning the visual semantic content of an image, image manipulation detection focuses more on capturing the minute traces and details left by manipulation operations. Traditional image manipulation detection methods are categorized based on different manipulation techniques, including those based on overlapping blocks (Luo et al., 2006), feature points (Amerini et al., 2013), image properties (Dong et al., 2009), and compression properties (Luo et al., 2007). However, these methods suffer from poor feature applicability and low efficiency. Currently, there is no single method in traditional approaches that can universally apply to all types of image manipulation techniques, and they fail to provide accurate pixel-level detection results.

In recent years, some researchers have attempted to perform image manipulation detection by deep learning methods and have achieved significant results in the recognition of multiple manipulation types. Existing deep learning image manipulation detection methods can usually be categorized into two types: noise-aware methods and edge-aware methods. The core idea of noise-aware methods lies in the fact that tampered regions and normal image regions are significantly different in terms of noise. Therefore, these methods employ a predefined noise filter (Fridrich Kodovsky, 2012) to generate another noisy image, which is then fused with the original red–green–blue (RGB) features for manipulation detection. However, for those image manipulation operations that are performed only on the target image, such as the copy-move type (Wu et al., 2018; Zhong et al., 2022), the noise introduced is almost negligible for the original image since the copy-move does not introduce new elements. Therefore, in this type of image manipulation situation, noise-aware methods are relatively ineffective and are considered suboptimal choices. In contrast, edge-aware methods try to find boundary visual artifacts around the tampered region and use the visual artifacts as cues to locate the edges of the manipulation. This method is able to be independent of the type of manipulation, as visual artifacts often show inconsistencies in edge regions. Therefore, a common strategy is to introduce another branch for detecting edge artifacts (Zhou et al., 2020). However, the simple feature concatenation method used in previous methods is not ideal because the manipulated features exhibit obvious differences in feature maps at different scales. Linear aggregation of feature maps ignores that deep features may make the detected manipulation features semantically relevant, while it also ignores the importance of shallow features, resulting in suboptimal performance. Additionally, previous edge detection methods fail to establish an effective link between RGB features and edge features, resulting in underutilized spatial contextual information and ineffective mining of unique and effective key information at each scale. Furthermore, distinguishing between edge artifacts of natural objects and genuine edges becomes even more challenging when edge artifacts are concealed by carefully designed post-processing methods such as local smoothing, image compression, and filtering.

To overcome the two limitations mentioned above, we propose a dual-supervised network ISIE-Net, based on irrelevant information suppression and critical information enhancement. This is a novel end-to-end multi-scale supervised framework designed for image manipulation detection and localization. ISIE-Net can be divided into three parts: the manipulated edge information enhancement branch (MEIEB), irrelevant visual information suppression branch (IVISB), and dual attention (DA) feature fusion. It aims to capture edge artifacts, subtle and generic signs of manipulation, and information of high-level semantic objects. The MEIEB and theIVISB are jointly optimized through a DA feature fusion module, and the information at each scale is effectively utilized. In the MEIEB, we are inspired by Chen et al. (2021) and Ma et al. (2021) to gradually aggregate features from shallow to deep to predict the target edge. In order to jointly use shallow and deep feature information at multiple scales, this branch extracts edge features based on the features output from each layer of the network. Firstly, the features are fed into the Sobel layer to extract the edge-related information, which is then processed by the edge extraction module, and then these edge features are concatenated and fed into the edge attention module as the edge prediction features. The edge attention module achieves mutual optimization between the main task of manipulation segmentation and the task of the manipulated edge information enhancement, effectively improving the performance of the entire network. In the IVISB, we employ multiple resolution down ( $R_{-down}$ ) blocks to reduce the number of pixels in the image to suppress excessive irrelevant visual information. Then, the resolution of the image is restored through multiple resolution up blocks ( $R_{-up}$ ) to reconstruct feature maps of the same size as the original down-sampling feature maps. Then, a subtraction operation is performed on the down-sampling feature map and the up-sampling reconstructed feature map to generate a unique irrelevant visual information suppression view. By using a supervised learning strategy to reduce the difference between the feature map obtained from the down-sampling operation and the feature map reconstructed from the up-sampling operation, this generated irrelevant visual information suppression view not only helps us to remove the interference of the visual semantic information in the image, but also effectively highlights the subtle and generic manipulation trace signals in the image. Subsequently, the irrelevant visual information suppression view is fed into the backbone network structure, which progressively combines features from different ResNet blocks and is used to extract multi-scale features from coarse to fine in order to dig deeper into the key information in the irrelevant visual information suppression view, which improves the detection sensitivity of the tampered locations. Extensive experimental results on five public datasets show that our proposed ISIE-Net is very effective and robust. In summary, our main contributions are in three aspects:

We develop a new end-to-end multi-scale supervised framework for image manipulation detection localization tasks, called ISIE-Net. As shown in Figure 2, the MEIEB and the IVISB are jointly optimized by a DA feature fusion module, and the information at each scale is used efficiently.

We propose a new image manipulation detection and localization method based on the idea of irrelevant visual information suppression, which utilizes a subtraction operation to alleviate the impact of irrelevant visual information on manipulation feature extraction to obtain generic manipulation traces that are more subtle and unrelated to semantic visual information. In addition, we introduce an MEIEB, which establishes an effective link between the backbone segmentation task and the manipulated edge information enhancement task, and accurately locates tampered regions based on multi-scale edge artifact features, which makes full use of the inconsistency of edge artifact features.

Extensive experiments on five publicly available datasets show that our proposed method can outperform state-of-the-art (SOTA) methods in terms of pixel-level F1 score and area under the curve (AUC). The rest of our work is organized as follows: Section 2 introduces related work, and Section 3 describes the proposed ISIE-Net model approach. Section 4 describes the experiments and discussions. Section 5 provides concluding remarks.

Figure 2.

Network architecture of the ISIE-Net method. ISIE-Net has two branches, both using ResNet-50 as the backbone. The top MEIEB is specifically designed to enhance subtle boundary artifacts around the tampered region, while the bottom IVISB is used to learn subtle and generic manipulation cues in the image. Finally, the MEIEB and the IVISB are feature fused by a DA fusion module. Note. MEIEB = manipulated edge information enhancement branch; IVISB = irrelevant visual suppression branch; DA = dual attention.

2. Related Work

This section reviews the most relevant research on deep learning methods for manipulation detection and localization. Then, we briefly introduce the attention mechanism, one of the core components of this network.

2.1. Image Manipulation Detection and Localization

The deep learning-based methods for image manipulation detection and localization have brought a new perspective to the field of image tampering detection. In recent years, many research works have made significant advancements in this area.

Bappy et al. (2017) introduced a hybrid convolutional neural network (CNN)-long short-term memory (LSTM) model called J-LSTM. After segmenting the input image into multiple blocks, this model utilizes the LSTM network to extract discriminative features regarding the correlation between blocks, aiming to discern manipulated and unmanipulated regions. Similarly, a novel high-confidence manipulation localization structure, H-LSTM, was proposed in Bappy et al. (2019), based on resampling features. Comprising LSTM and CNN, this structure is designed to locate manipulation regions, albeit constrained by the size of the partitioned blocks. Zhou et al. (2018) introduced a dual-stream localization architecture called RGB-N. In this architecture, the dual streams consist of an RGB stream, which extracts features from the RGB image, and a noise stream, which utilizes a steganalysis-rich model filter to extract noise features to discover the consistency of noise between the authentic and tampered regions. However, the method mainly focuses on the semantic information of the image, and there are some limitations in using it to detect tampered images, the untampered region features can affect the judgmental features of the tampered region, leading to an increase in the false detection rate. In addition, the method uses rectangular boxes to outline the falsified regions and does not achieve pixel-level localization. Yang et al. (2020) used BayarConv as the initial convolutional layer of their CR-CNN for more accurate prediction. Zhou et al. (2020) proposed a generative adversarial network structure-based stitching detection model GSR-Net, which constructs an edge detection task by selecting features from the middle three blocks of DeepLab (Chen et al., 2017) and uses the predicted manipulated edges to optimize the overall manipulation region prediction. Additionally, Zhuang et al. (2021) designed a fully convolutional encoder–decoder architecture, DenseFCN, that contains dense connectivity and expansive convolution to improve the operation localization performance. Kwon et al. (2021) proposed CAT-Net, an end-to-end fully convolutional neural network containing RGB and dual clutch transmission (DCT) streams to jointly learn forensic features for compression artifacts in both the RGB and DCT domains. Wu et al. (2019) proposed an end-to-end ManTra-Net network, which consists of two sub-networks: a manipulation trace feature extractor and a local anomaly detection network. It treats the manipulation detection task as a local anomaly detection task, capturing local anomalies using $Z$ -score features, and then evaluating them using LSTM, but the lack of an edge optimization module may lead to blurring of edges in the detected tampered regions. In addition, in ManTra-Net, the model only models features of different sizes but does not model the spatial relationships between image blocks. Hu et al. (2020)improved ManTra-Net and proposed spatial pyramid attention network (SPAN). SPAN consists of a feature extractor, spatial pyramid attention module, and prediction module, aimed at establishing relationships between image blocks at multiple scales and using a convolutional network to determine whether pixels have been tampered with. However, SPAN has not fully utilized spatial correlations, considering only local region correlations, resulting in an inability to further refine predictions for tampered regions. Liu et al. (2022) proposed a progressive prediction model, PSCC-Net, which constructs a dual-path process, one for extracting local and global features from top to bottom, and the other for detecting image tampering from bottom to top. Additionally, the steering column control module aims to capture spatial and channel correlations for better tampering localization. Shi et al. (2022) constructed a progressively refined network called PR-Net, which consists of a mask generation module with three sub-modules. The rotated residual structures are employed in these three sub-modules to suppress image content and extract features from coarse to fine. Lin et al. (2023) proposed a novel network, EMT-Net, for learning and enhancing multiple tampering traces, including noise distribution and visual artifacts. This network extracts global and local noise features from noise maps using transformers and captures local visual artifacts from original RGB images using convolutional neural networks. Additionally, it employs an edge artifact enhancement module and edge supervision strategy to enhance the boundary artifacts of fused multiple features. Representatively, Chen et al. (2021) proposed MVSS-Net, which adopts a method of learning multi-perspective features and multi-scale supervision. The network consists of noise and boundary branches, where the noise branch aims to learn semantic-independent features, providing the network with broader applicability, while the boundary branch focuses on learning the discrepancies between tampered and authentic regions at their boundaries. Additionally, MVSS-Net integrates the traditional Sobel edge detection operator into the progressive edge extraction structure (Yu et al., 2018). Currently, MVSS-Net stands as one of the leading models, particularly excelling in handling cross-dataset scenarios, making it one of the main comparison subjects in this paper.

2.2. Attention Mechanisms

When a CNN is used to process images, we prefer to focus on important content rather than considering all information comprehensively. However, manually specifying what needs attention is impractical (Dai et al., 2023). Therefore, attention mechanisms in deep learning have become a crucial technology, mimicking the attentional mechanisms and information-processing methods of the human visual system. This allows models to selectively concentrate on specific parts of input data, thereby enhancing the recognition and utilization of critical information. In summary, attention mechanisms serve two purposes: determining which parts of the input should be attended to and how to effectively allocate limited information-processing resources to these important parts. For image manipulation detection tasks, the global correlation between each pixel and other pixels in the image is crucial. Ma et al. (2021) proposed BCANet, which introduced a boundary-guided context aggregation module based on an attention mechanism, aiming to capture distant dependencies between pixels in boundary regions and those inside the target. Fu et al. (2019) designed a positional attention module to encode broader contextual information into local features and devised a channel attention mechanism to model the interdependencies among channels.

3. The Proposed Model

Given an input RGB image $W \times H \times 3$ , our goal is to learn a model that can detect subtle inconsistency artifacts and critical edge artifacts hidden in the image content and use these subtle artifacts and edge artifacts for manipulation detection and localization. To this end, we believe that multi-scale image content still requires special attention to better obtain subtle and generic manipulation cues as well as enhanced edge manipulation features. As shown in Figure 2, we proposed a model called ISIE-Net, which is a dual-branch network with the well-known basic network architecture ResNet50 (He et al., 2016) as the backbone. The MEIEB at the top is specifically designed to enhance subtle edge artifacts surrounding the tampered regions. It extracts the respective manipulated edge information from the features output by each layer of the model backbone and combines all this edge information for input into the edge attention module. Through the use of the edge attention module, precise localization of the tampered regions is achieved. This module not only contributes to enhancing manipulation edge information but also improves performance in the segmentation task of the backbone network. The IVISB at the bottom utilizes subtraction operations to suppress the influence of irrelevant visual information on manipulation feature extraction. It aims to identify even more subtle and generic manipulation traces in the images. Finally, the MEIEB and the IVISB are used for feature fusion through the DA fusion module, and the fused features are up-sampling and convolved to obtain the final mask of the manipulated area. Therefore, below we introduce the MEIEB, IVISB, and DA fusion module, respectively, and finally give the loss function of ISIE-Net.

3.1. Manipulated Edge Information Enhancement Branch

In the image manipulation detection task, edge information is crucial for identifying manipulation traces. Therefore, edge information is used as a supervisory signal to guide the image manipulation detection task. However, the linear aggregation of feature maps adopted in previous research ignores that deep features may make the detected manipulation features semantically relevant, while also not fully considering the importance of shallow features. Therefore, in order to better utilize the feature contents at different scales and to establish an effective link between the main segmentation task and the manipulated edge information enhancement task, we propose a multi-scale MEIEB, as shown in Figure 2, ResNet50 is used as the backbone network, and R1, R2, R3, and R4 are, respectively, the layers of conv1-x, conv2x, conv3-x, and conv4-x in ResNet50, and we extract manipulation edge information by utilizing the features output from each layer of the ResNet50 backbone network. In order to obtain stronger edge information, we introduce the Sobel layer, which helps to enhance the edge-related information in the image to locate the tampered regions more precisely. Immediately after that, we employ the edge extraction block (EEB) to process the edge features extracted by the Sobel layer and join all this edge information as the edge feature. Furthermore, we believe that the manipulated edge information enhancement task and the backbone segmentation task are interrelated and mutually optimized. The focus of the manipulated edge information enhancement task is to capture the boundaries and transitions between the tampered and untampered regions of an image, and this boundary information is crucial for structural and semantic analysis of the image. At the same time, the backbone segmentation task can also provide enhanced contextual information for the edge detection task. Therefore, we introduced the edge attention module to mine manipulation position information more accurately. The introduction of this module not only helps to enhance the performance of edge detection, but also improves the performance of the backbone network in the manipulation segmentation task.

Specifically, assume there is a natural RGB image $I \in R^{W \times H \times 3}$ , where $W$ and $H$ represent the width and height, respectively. Subsequently, the image $I$ is fed into the backbone network, and we can obtain multi-scale mapping features such as $r_{1} \in R^{(W / 4) \times (H / 4) \times C_{1}}$ , $r_{2} \in R^{(W / 8) \times (H / 8) \times C_{2}}$ , $r_{3} \in R^{(W / 16) \times (H / 16) \times C_{3}}$ , $r_{4} \in R^{(W / 16) \times (H / 16) \times C_{4}}$ , where $C_{1}$ , $C_{2}$ , $C_{3}$ , and $C_{4}$ are 256, 512, 1024 and 2048, respectively. Furthermore, all these mapped features are input into the corresponding Sobel layer, and the basic idea behind the Sobel layer is to distinguish edge-related pixels from other pixels in a given feature map by using edge-related weights. To obtain such attention maps, we first pass the feature maps through the classical Sobel filter, followed by the batch normalization layer and norm layer, and finally the Sigmoid layer. This approach effectively mines both low-level details and high-level semantics while preserving their consistency. The structure of the Sobel layer is shown in Figure 3(a). In order to make full use of edge information at different scales and accurately locate edge artifact information, the output of the Sobel layer is further input into the edge extraction module (EEB). The structure of the EEB layer is shown in Figure 3(b). Multiple convolution operations unify each channel carrying edge information features, and connect all this edge information as edge features E.

Figure 3.

(a) Sobel layer and (b) EEB for manipulating edge detection in the manipulated edge information enhancement branch. Note. EEB = edge extraction block.

In order to establish an effective link between the RGB features and edge features of the backbone network, we introduce the edge attention module, which aims to mine the manipulation location information more precisely. The structure is shown in Figure 2. Specifically, given the semantic feature map $r_{4} \in R^{(W / 16) \times (H / 16) \times C_{4}}$ generated from the backbone network and the boundary feature map $E \in R^{(W / 4) \times (H / 4) \times C}$ from MEIEB, applying two distinct $1 \times 1$ convolutions to $r_{4}$ yields RGB features $A_{1} \in R^{(W / 16) \times (H / 16) \times 256}$ and $A_{2} \in R^{(W / 16) \times (H / 16) \times 256}$ , and E uses $1 \times 1$ convolution to obtain the edge features $B \in R^{(W / 16) \times (H / 16) \times 256}$ . Then the reshaping operation is performed on $A_{1}$ , $A_{2}$ , and $B$ to obtain $A_{1}^{^{'}} \in R^{256 \times N}$ , $A_{2}^{^{'}} \in R^{256 \times N}$ , and $B^{^{'}} \in R^{256 \times N}$ , where $N = (W / 16) \times (H / 16)$ . We perform matrix multiplication on the transpose of the reshaped $A_{1}$ and on $B$ , which allows us to generate the attention matrix by means of the softmax function, and the whole operation can be described as follows:

F (i, j) = \frac{\exp (B_{i} \cdot A_{1 j}^{T})}{\sum_{i = 1}^{N} \exp (B_{i} \cdot A_{1 j}^{T})}

(1)

where

F \in R^{N \times N}

is the edge attention map, and each position

j

on the

A_{1}^{^{'}}

feature is affected by the edge feature position

i

, thereby more accurately determining the location of the tampered area. Then, the edge attention map is multiplied with RGB features

A_{2}^{^{'}}

to guide manipulated edge information enhancement. Finally, through the up-sampling operation and Sigmoid layer, the output result is converted to

(W / 4) \times (H / 4) \times 1

to obtain the final manipulated edge map prediction. In summary, the output of the MEIEB has two parts: the feature map output from the last block

R_{4}

of the backbone network is used for the main task, denoted as

{f_{{meie}_{1}}, \dots, f_{{meie}_{k}}}

, and the predicted manipulated edge map, denoted as

S_{edge (x)}

, is obtained by transforming the output of the edge attention module using a sigmoid layer.

3.2. Irrelevant Visual Information Suppression Branch

In the image manipulation detection task, the traces of image manipulation are very subtle, and in order to capture the subtle and semantically irrelevant generic manipulation features of an image, we constructed an IVISB in parallel with MEIEB. Unlike AMTEN proposed by Guo et al. (2021), which subtracts image content from low-level features, the proposed IVISB obtains an irrelevant visual information suppression view by performing a subtraction operation on a set of feature maps obtained from a set of resolution down blocks and a set of feature maps reconstructed from a set of resolution up blocks and then taking the absolute value. The structures of the resolution down block ( $R_{-down}$ ) and resolution up block ( $R_{-up}$ ) layers are shown in Figure 4(a) and (b).

Figure 4.

(a) Resolution down block ( $R_{-down}$ ) and (b) resolution up block ( $R_{-up}$ ) layers.

Specifically, as shown in Figure 2, given an input RGB image $W \times H \times 3$ , the image is then input into a network consisting of a plurality of resolution down blocks $(R_{i -down}, i ϵ {1, 2, 3, 4})$ and a plurality of resolution up blocks $(R_{j -up}, j ϵ {1, 2})$ , and multi-scale mapping features can be obtained, such as $D_{i} \in R^{(W / 2^{i}) \times (H / 2^{i}) \times C_{i}}, i ϵ {1, 2, 3, 4}$ , $U_{1} \in R^{(W / 8) \times (H / 8) \times C_{5}}$ , $U_{2} \in R^{(W / 4) \times (H / 4) \times C_{6}}$ , where $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ , $C_{5}$ , and $C_{6}$ are 32, 64, 128, 256, 128, and 64. IVISB attempts to remove irrelevant visual semantic features learned by the convolutional block from the manipulated features. Formulaically, $R_{0} = | D_{2} - U_{2} |$ , where $D_{2}$ and $U_{2}$ denote the feature maps output from the $R_{2-down}$ block and the $R_{2-up}$ block, respectively, and is the obtained irrelevant visual information suppression view. We reduce the difference between these two feature maps $D_{2}$ and $U_{2}$ through the supervised learning strategy, so that the irrelevant visual information suppression view not only helps us to remove the interference of irrelevant visual information from the image, but also effectively highlights the subtle and generic manipulation trace signal of the image. Subsequently, we input the irrelevant visual information suppression view $R_{0}$ into the backbone network structure. This structure progressively combines features from different ResNet blocks and is utilized to extract multi-scale features from coarse to fine, digging deeper into the crucial manipulation information in the irrelevant visual information suppression view. This process enhances the detection sensitivity of manipulation locations. The output of this branch is the feature map from the output of the last block $R_{4}$ of its backbone, denoted as ${f_{{ivis}_{1}}, \dots, f_{{ivis}_{k}}}$ .

3.3. Branch Fusion via DA

We concatenate two feature map arrays, ${f_{{meie}_{1}}, \dots, f_{{meie}_{k}}}$ and ${f_{{ivis}_{1}}, \dots, f_{{ivis}_{k}}}$ , from the MEIEB and the IVISB to form a feature map $A \in R^{(W / 16) \times (H / 16) \times 4096}$ , and then a trainable DA feature fusion module is used to fuse them (Fu et al., 2019). The DA module has two attention mechanisms working in parallel: positional attention and channel attention, see Figure 5. The positional attention module aims to utilize the correlation between any two features to mutually enhance the respective feature representations. Specifically, the feature map $A \in R^{(W / 16) \times (H / 16) \times 4, 096}$ is first obtained by passing three $1 \times 1$ convolutional layers to get three feature maps B, C, and D respectively. Then B and C are reshaped into $R^{(W H / 256) \times 512}$ . After that,the transpose $R^{512 \times (W H / 256)}$ of the reshaped B is multiplied with the reshaped $C \in R^{(W H / 256) \times 512}$ are multiplied, and then the positional attention map P( $(W H / 256) \times (W H / 256)$ ) is obtained by softmax. Then matrix multiplication is performed between the reshaped $D \in R^{(W H / 256) \times 4096}$ and the transpose of ( $(W H / 256) \times (W H / 256)$ ), which is then multiplied by the scale factor $α$ , and reshaped to the original shape, and finally summed with A to obtain the final output. Where $α$ is initialized to 0 and is gradually learned to get larger weights. The channel attention module aims to enhance the specific semantic response ability under the channel by modeling the association between channels. The specific process is similar to the positional attention module, the difference is that in obtaining the feature attention map C is the dimension transformation and matrix product of any two-channel features to obtain the association strength of any two channels, and then the channel attention map obtained by the same softmax operation, and the weighted sum of all the channel attention maps is used to update each channel attention map. Finally, the outputs of these two attention modules are fused to further enhance the feature representation and converted to a feature map of size $(W / 16) \times (H / 16)$ by $1 \times 1$ convolution, and then the feature map is up-sampling and convolved to obtain the final mask $S_{seg} (x)$ of the manipulated region.

Figure 5.

DA Module With Position Attention Module on Top and Channel Attention Module on Bottom. Note. DA = dual attention.

3.4. Loss Function

The MEIEB provides powerful supervised signals through multi-scale edge information, while the IVISB enhances the detection performance by learning subtle and generic manipulation features in the image. Since the image context information and edge information can enhance each other, we jointly optimize the parameters of the MEIEB, the IVISB, and the DA fusion module in order to fully exploit the potential complementary relationship between them. We consider four different scales of losses, each with its own specific objective: a pixel-scale loss is employed to enhance the sensitivity of the model to pixel-level manipulation detection, an edge information loss is utilized to strengthen crucial information at the manipulation edges, an irrelevant visual information suppression loss is designed to learn subtle and generic manipulation features in the image and an image-scale loss is used to improve the specificity of the model for image-level manipulation detection. The optimized loss function can be defined as follows:

Loss = α \cdot {loss}_{seg} + β \cdot {loss}_{clf} + (1 - α - β) \cdot {loss}_{edge} + γ \cdot {loss}_{ivis}

(2)

where

α, β \in [0, 1]

are trade-off parameters to balance the contributions of pixel-scale loss, edge information loss, and image-scale loss. In addition, we separately give a weighting factor

γ

to the irrelevant visual information suppression loss for learning weak generic manipulation features of the image.

α

β

, and

γ

are empirically set to 0.16, 0.04, and 10 in our experiments, respectively. Loss denotes the total loss function of the ISIE-Net network, and

{loss}_{seg}

{loss}_{edge}

{loss}_{ivis}

{loss}_{clf}

denote the pixel-level loss of tampered region prediction, the loss of edge information prediction, the loss of irrelevant visual information suppression, and the image-scale loss, respectively.

3.4.1. Pixel Level Loss

Due to the imbalance in the ratio of tampered edge pixels to other pixels, pixel level loss employs dice loss as a loss function, as it is very effective for learning from extremely imbalanced data (Wei et al., 2021). Its definition can be given by the following equation:

{loss}_{seg} (x) = 1 - \frac{2 \sum_{i, j} S_{seg} (x_{i, j}) \cdot y_{i, j}}{\sum_{i, j} {S_{seg}}^{2} (x_{i, j}) + \sum_{i, j} y_{i, j}^{2}}

(3)

where

y_{i, j} \in {0, 1}

is a binary label indicating whether the pixel

(i, j)

is manipulated or not.

3.4.2. Edge Loss

Since edge pixels are overwhelmed by non-edge pixels, we again use the Dice loss for manipulation edge detection, denoted as ${loss}_{edge}$ . According to a previous study (Chen et al., 2021), since manipulating edge detection is an auxiliary task, we do not compute ${loss}_{edge}$ with the full size of $W \times H$ . Instead, the loss is computed in the smaller $(W / 4) \times (H / 4)$ size, see Figure 2. This strategy reduces the computational cost during training while slightly improving the performance.

3.4.3. Image-Scale Loss

In order to reduce false alarms, real images must be considered during the training phase. As shown in Figure 2, the $G_{clf}$ module (Dong et al., 2022) is responsible for converting the pixel-level segmentation map $S_{seg} (x)$ into the image-level prediction $C (x)$ . We use the image-scale binary cross-entropy loss, which is computed as follows:

{loss}_{clf} (x) = - (y \cdot \log C (x) + (1 - y) \cdot \log (1 - C (x)))

(4)

where

y = max ({y_{i}})

3.4.4. Irrelevant Visual Information Suppression Loss

As shown in Figure 2, the irrelevant visual information suppression view is obtained by subtracting a set of feature maps generated by a resolution down block from those reconstructed by a resolution up block. Our goal is to reduce the difference between these two feature maps, $D_{2}$ and $U_{2}$ , through a supervised learning strategy. To achieve this objective, we use the mean squared error loss function to calculate the loss between the obtained irrelevant visual information suppression view and a zero feature map with the same shape. The formula for the calculation is as follows:

{loss}_{ivis} (y, y^{^{'}}) = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - y_{i}^{^{'}})^{2}

(5)

where

y

denotes the true value,

y^{^{'}}

denotes the predicted value of the model,

N

is the number of samples,

y_{i}

denotes the true target value of the

i

th sample, and

y_{i}^{^{'}}

denotes the predicted result of the model for the

i

th sample.

4. Experiments

4.1. Datasets

To evaluate the performance of ISIE-Net, we conducted experiments on public image processing localization datasets, namely CASIA (Dong et al., 2013), COVER (Wen et al., 2016), Columbia (Hsu Chang, 2006), and NIST16 (Guan et al., 2019). The composition of the operational image types of the dataset is shown in Table 1. CASIA consists of two components (CASIAv1 and CASIAv2), both providing authentic masks. CASIA provides images with splicing and copy-move manipulations, as well as images affected by post-processing operations such as rotation, blur, distortion, so on. COVER is a small-scale manipulation dataset based on copy-move, comprising 100 manipulated samples along with their corresponding authentic masks. Columbia is a small dataset generated by stitching arbitrary regions into an image. NIST16 is a challenging dataset containing three manipulation types: copy-move, splicing, and removal. Manipulation traces are obscured by some post-processing operations in NIST16.

Table 1.
Details of Training and Test Images for the Five Datasets Used in Our Experiments.

Split Dataset Au Tampered Copy-move Splicing Removal

Train CASIAv2 7,491 5,063 3,195 1,828 0

Test CASIAv1 800 920 209 461 0

Columbia 183 180 0 180 0

COVER 100 100 100 0 0

NIST16 0 563 68 288 208

Split	Dataset	Au	Tampered	Copy-move	Splicing	Removal
Train	CASIAv2	7,491	5,063	3,195	1,828	0
Test	CASIAv1	800	920	209	461	0
	Columbia	183	180	0	180	0
	COVER	100	100	100	0	0
	NIST16	0	563	68	288	208

4.2. Experimental Setup

4.2.1. Implementation Details

The proposed ISIE-Net is built based on PyTorch and uses the Adam optimizer, where the learning rate is periodically decayed from 10–4 to 10–6, with a batch size of 12. All the training processes are run on an NVIDIA GeForce RTX 3090 with 24 GB of RAM. We chose ResNet-50 to be used as the backbone of ISIE-Net and pre-trained by the ImageNet dataset. The input images were uniformly set to $512 \times 512$ and then enforced by regular data enhancement including flipping, cropping, scaling, Gaussian blurring, and JPEG compression. In many studies, it is common to train the model on other large datasets and then test it on the aforementioned datasets, or to use one part of each dataset for training and the other part for testing for performance comparison. However, we believe that this comparison method is not very reasonable as it fails to adequately demonstrate the generalizability of the model in the face of unknown data. Therefore, we adopt the same evaluation method as in Chen et al. (2021), we train the model only on the CASIAv2 dataset and then test it directly on the rest of the dataset. This approach more directly reflects whether the model has learned how to detect image manipulation rather than just overfitting a particular dataset.

4.2.2. Evaluation Metrics

To evaluate the performance of our model, we follow previous work (Chen et al., 2021; Zhou et al., 2018, 2020) and use pixel-level F1 scores and AUC as our evaluation metrics. F1 scores and AUC are two widely used metrics for measuring per-pixel binary classification performance, and their scoring values are in the range of $[0, 1]$ . The higher the score, the better the performance. It is important to note that previous studies typically report performance using a decision threshold chosen for each test set, which allows model comparisons to be made under optimal conditions. However, this setup may lead to overly optimistic performance estimates because, in practice, the decision thresholds of the models have to be specified and fixed in advance. Since it is not possible to predict the most appropriate thresholds in the absence of GT data, we used the median value of 0.5 as a threshold to determine the positive and negative classes.

F1 score combines precision and recall. Precision is the ratio of correctly predicted positive data to the total data predicted to be positive, while recall is the ratio of correctly predicted positive data to all data marked as positive. The F1 score is calculated according to

\begin{aligned} Precision & = \frac{TP}{TP + FP} \end{aligned}

(6)

\begin{aligned} Recall & = \frac{TP}{TP + FN} \end{aligned}

(7)

\begin{aligned} {F1}_{score} & = \frac{2 \times Precision \times Recall}{Precision + Recall} \end{aligned}

(8)

where TP, FP, and FN are the true positive, false positive, and true negative, respectively. The AUC score is the area covered by the receiver operating characteristic curve, which has the vertical coordinate of the true positive rate (TPR) and the horizontal coordinate of false positive rate (FPR). The equations for calculating TPR, FPR, and AUC are shown below:

\begin{aligned} FPR & = \frac{FP}{FP + TN} \end{aligned}

(9)

\begin{aligned} TPR & = \frac{TP}{TP + FN} \end{aligned}

(10)

\begin{aligned} AUC & = \int_{0}^{1} TPR (FPR) dFPR \end{aligned}

(11)

where TN is the true negative.

4.3. Baseline Models

For a fair comparison, we chose results from the baseline dataset that follows a public evaluation protocol and fulfill one of the following two criteria: (a) results published by SOTA or representative methods, and (b) results obtained by retraining using the publicly available code of the method in question. We evaluated and compared ISIE-Net with six currently published leading methods, namely ManTra-Net¹ (Wu et al., 2019), GSR-Net² (Zhou et al., 2020), Constrained R-CNN³ (Yang et al., 2020), DenseFCN⁴ (Zhuang et al., 2021), CAT-Net⁵ (Kwon et al., 2021), and MVSS-Net⁶ (Chen et al., 2021). All models and methods either follow the same evaluation protocol or are retrained on the CASIAv2 dataset.

4.4. Comparison With SOTA Methods

In this section, the proposed ISIE-Net is compared quantitatively (pixel level and image level) and qualitatively with SOTA methods. The pixel-level and image-level quantitative comparison (corresponding to Section 4.4.1) and the qualitative evaluation (corresponding to Section 4.4.2) give a comparison of the visualization results corresponding to the different methods.

4.4.1. Quantitative Comparison

We first evaluate the performance of ISIE-Net on five public image processing localization datasets, and then compare the ISIE-Net results with the F1 and AUC values of the previous SOTA methods. Our experiments train the model only on the CASIAv2 dataset and then test it directly on the remaining datasets. If the code was not available, the results reported in the corresponding references were used. The comparison results are given in Tables 2 and 3, respectively (some experimental results are from Chen et al., 2021).

Table 2.
AUC Comparison Results Obtained on Three Standard Datasets and NIST16 is Excluded as it has no Authentic Image.

Methods CASIAv1 COVER Columbia

ManTra-Net 0.141 0.491 0.701

CR-CNN 0.783 0.566 0.783

GSR-Net 0.502 0.515 0.502

CAT-Net 0.604 0.553 0.697

DenseFCN 0.635 0.528 0.637

MVSS-Net 0.751 0.549 0.851

ISIE-Net(ours) 0.822 0.598 0.890

Methods	CASIAv1	COVER	Columbia
ManTra-Net	0.141	0.491	0.701
CR-CNN	0.783	0.566	0.783
GSR-Net	0.502	0.515	0.502
CAT-Net	0.604	0.553	0.697
DenseFCN	0.635	0.528	0.637
MVSS-Net	0.751	0.549	0.851
ISIE-Net(ours)	0.822	0.598	0.890

Note. AUC = area under the curve.

Table 3.

F1 Score Comparison Results Obtained on Three Standard Datasets.

Methods	CASIAv1	COVER	Columbia	NIST16
ManTra-Net	0.155	0.286	0.364	0.000
CR-CNN	0.405	0.291	0.436	0.198
GSR-Net	0.387	0.285	0.613	0.283
CAT-Net	0.205	0.129	0.298	0.173
DenseFCN	0.219	0.193	0.335	0.137
MVSS-Net	0.461	0.249	0.588	0.285
ISIE-Net(ours)	0.532	0.312	0.713	0.292

The AUC comparison results are shown in Table 2, while NIST16 is excluded due to the lack of real images. It is clear from Table 2 that our proposed model obtains the highest AUC scores of 82.2%, 59.8%, and 89.0% on the CASIAv1, COVER, and Columbia datasets, respectively, and these results clearly outperform ManTra-Net, GSR-Net, Constrained R-CNN, DenseFCN, CAT-Net, and MVSS-Net.

The results of the F1 score comparison are shown in Table 3, from which it can be seen that our proposed model achieves the highest F1 scores of 53.2%, 31.2%, 71.3%, and 29.2% on the CASIAv1, COVER, Columbia, and NIST16 datasets, respectively. ISIE-Net shows the best performance in both AUC and F1 scores, especially on the CASIAv1 and Columbia datasets, where the F1 scores are significantly improved compared to the SOTA algorithms. For example, the F1 score of ISIE-Net on the Columbia dataset is as high as 0.713, while that of MVSS-Net is only 0.588, and its improvement can be as high as 12.5%. There are two key reasons why ISIE-Net outperforms previous methods. First, our proposed IVISB method successfully suppresses the interference of irrelevant visual information in the image, while enhancing the learning ability of subtle and generic tampering features and tampering edge critical features. Specifically, we pay special attention to those subtle and generic manipulation features that have been neglected in previous methods, thus improving the generalization performance of ISIE-Net. In addition, we introduce am MEIEB that establishes an effective link between edge features and RGB features. This branch enhances the manipulated edge information enhancement performance through the constraint of edge artifacts, while ensuring that the backbone network maintains high accuracy in predicting segmentation results. In contrast, other methods fail to fully utilize the details of the edge information around the tampered region.

4.4.2. Qualitative Comparison

To further demonstrate the effectiveness of ISIE-Net, we performed result visualization. The visualization results of manipulation detection and localization are shown in Figure 6, which clearly demonstrate the significant advantages of our ISIE-Net over the baseline approach. Specifically, regardless of which manipulation operation is used, ISIE-Net generates segmentation results that are very close to the actual situation, and mis-segmentation is very much the case. In contrast, ManTra-Net, GSR-Net, CAT-Net, and DenseFCN produce unsatisfactory results with a large number of missegmented regions. In addition, ManTra-Net, CAT-Net, DenseFCN, and CR-CNN perform poorly in boundary detection and are far less accurate than ISIE-Net, as they ignore the importance of edge artifact information. In contrast, MVSS-Net and ISIE-Net employ edge supervision branches to enhance detection and localization performance by learning edge features. However, in the third line (copy-move), it is known from the basic facts that both the Arctic fox and the shadow in the water are tampered regions. When ISIE-Net is used, it can detect these regions well and the boundaries are very clear, and when MVSS-Net is used, although the Arctic Fox and the shadow in the water can also be found, their boundaries are relatively fuzzy. Again, for the example in the first row (splicing operation), ISIE-Net gets more accurate prediction results. Therefore, ISIE-Net can locate the tampered region more accurately and divide the edges more finely, and its results are closer to the real labels.

Figure 6.

Qualitative results of different manipulation localization algorithms. The first column shows the manipulated images on the CAISA dataset. The second column shows the basic facts. From the third to the eighth columns, the final manipulation segmentation predictions are represented for ManTra-Net, GSR-Net, CR-CNN, Dense-FCN, CAT-Net, and MVSS-Net, respectively. The last column shows the results of our proposed ISIE-Net. Note that the first two rows of images are operated by splicing, the third and fourth rows are operated by copy-move, while the other rows are operated by removal.

4.5. Ablation Studies

In the ablation study, all modules were trained by the CASIAv2 dataset. The results of the ablation experiments are shown in Table 4. As mentioned earlier, ISIE-Net consists of three modules: the MEIEB, the IVISB, and the DA fusion module. To better illustrate the effect of each module in the model, we gradually added each component to ResNet-50 for training and compared the results. The model for the ablation experiment is as follows:

SB: Use ResNet-50 as the base model and add the DA module for feature fusion.

DB: Use ResNet-50 to build a dual-branch network and add the DA module for feature fusion.

DB+IVISB: Add our proposed IVISB to the dual-branch network and add the DA module for feature fusion.

DB+MVSS-Net(NSB): By replacing our IVISB with the noise branch of MVSS-Net (Chen et al., 2021) and adding DA module module for feature fusion.

DB+MEIEB: Add our proposed MEIEB to the dual-branch network and add the DA module for feature fusion.

DB+MVSS-Net(ESB): By replacing our MEIEB with the edge branch of MVSS-Net (Chen et al., 2021) and adding the DA module for feature fusion.

DB+IVISB+MEIEB: Add our proposed IVISB and MEIEB to the dual-branch network and add the DA module for feature fusion, which constitutes our proposed final model.

Table 4.
Ablation Experiment Results.

CASIAv1 Columbia COVER

Models F1 AUC F1 AUC F1 AUC

SB 0.363 0.714 0.387 0.711 0.149 0.525

DB 0.387 0.719 0.446 0.760 0.218 0.550

DB+MVSS-Net(NSB) 0.430 0.769 0.521 0.791 0.247 0.547

DB+IVISB 0.465 0.797 0.597 0.861 0.255 0.578

DB+MVSS-Net(ESB) 0.430 0.765 0.571 0.707 0.249 0.521

DB+MEIEB 0.505 0.752 0.666 0.767 0.290 0.560

DB+IVISB+MEIEB 0.532 0.822 0.713 0.890 0.312 0.598

	CASIAv1	Columbia	COVER
SB	0.363	0.714	0.387	0.711	0.149	0.525
DB	0.387	0.719	0.446	0.760	0.218	0.550
DB+MVSS-Net(NSB)	0.430	0.769	0.521	0.791	0.247	0.547
DB+IVISB	0.465	0.797	0.597	0.861	0.255	0.578
DB+MVSS-Net(ESB)	0.430	0.765	0.571	0.707	0.249	0.521
DB+MEIEB	0.505	0.752	0.666	0.767	0.290	0.560
DB+IVISB+MEIEB	0.532	0.822	0.713	0.890	0.312	0.598

Note. AUC = area under the curve.

Comparing SB and DB shows that the dual-branch network captures more detailed information and facilitates manipulation detection. The comparison results of DB, DB+IVISB, and DB+MVSS-Net (NSB) validate the low validity of the noisy view that we mentioned in the introductory section, and also highlight the validity of the IVISB branch that we proposed. Since DB+IVISB is obtained by adding IVISB to DB, its better performance validates the effectiveness of IVISB in improving the detection of pixel-level and image-level manipulations. The overall performance of DB+MVSS-Net (NSB) is lower than that of DB+IVISB, and the results clearly demonstrate the superiority of the proposed IVISB with respect to the existing techniques.

The comparison between DB and DB+MEIEB reveals that the MEIEB significantly contributes to the overall detection performance. The results of DB+MEIEB compared to DB+MVSS-Net(ESB) further confirm the importance of the edge attention module in the MEIEB. The introduction of this module establishes a close connection between edge features and RGB features, more accurately mining manipulation location information. The incorporation of the edge attention module not only significantly improves edge detection performance but also enhances the performance of the backbone network in segmentation tasks.

5. Conclusions

In this paper, we introduce a novel approach based on enhancing critical information and suppressing irrelevant visual information to address the problem of image manipulation detection. Specifically, we propose a new supervised deep learning model called ISIE-Net, designed to detect tampered regions in digital images and predict manipulated mask mappings. We discuss in detail the process of designing and implementing the network architecture, along with the convolutional neural network architecture layers used therein, to learn a universal representation for image manipulation detection and forgery localization. By designing a branch based on edge information enhancement to capture enhanced edge artifacts and a branch based on suppressing irrelevant visual information to capture more subtle and generic manipulation cues in images, and by utilizing a DA module to fuse the MEIEB with the IVISB, the network is able to fully exploit the clues difference between tampered and untampered regions.

In comparison to previous methods, our model captures manipulation cues from both enhancement and suppression aspects, as the loss of critical edge artifact information can disrupt the results of manipulation detection, which can be supplemented by edge enhancement. Therefore, this method is more versatile and effective for complex image forgery. Additionally, an abundance of semantic information can interfere with tampering detection results, which can be effectively addressed by the suppression branch of irrelevant visual information. Furthermore, the introduced edge attention module optimizes the enhancement task of manipulated edge information and the main segmentation task, effectively improving the overall network performance. Experimental results demonstrate that our method achieves the best quantitative and qualitative results on all five standard datasets, and extensive ablation experiments also confirm the significant advantage of obtaining universal manipulation features by strengthening key information enhancement and suppressing irrelevant visual information. Moreover, our method performs well on images without reference evaluation and on images subjected to post-processing, which is one of the most important challenges in detecting small manipulated areas and unclear images. Additionally, another advantage of this method is its ability to simultaneously detect multiple tampered regions.

Based on the methods and results of this study, it is evident that our proposed approach of enhancing key information and suppressing irrelevant visual information is effective in tampering detection and localization, further advancing the practical application of image forensics. Since our proposed model has relatively few parameters, it can be trained end-to-end using large-scale datasets in a lightweight manner. Therefore, an important direction for future research is to establish large-scale, high-quality dataset benchmarks containing known instances of image forgeries and continuously incorporate additional forged data for model training. This will enable the model to maintain high discriminative power when faced with various novel manipulation operations in internet scenarios. Additionally, the accuracy and precision of image manipulation detection network models will also be enhanced. Furthermore, exploring various other network architectures can further accelerate the proposed methods. Future work involves exploring concepts such as feature fusion techniques and attention modules to further enhance the performance of the proposed approach. Additionally, investigating the vulnerability of the proposed network to adversarial attacks is another important task for future consideration.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China under grant no. 62376017, and Fundamental Research Funds for the Central Universities (grant no. BUCTRC202221).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Amerini

Ballan

Caldelli

Del Bimbo

Del Tongo

Serra

(2013). Copy-move forgery detection and localization by means of robust clustering with J-linkage. Signal Processing: Image Communication, 28(6), 659–669. https://doi.org/10.1016/j.image.2013.03.006

Bappy

J. H.

Roy-Chowdhury

A. K.

Bunk

Nataraj

Manjunath

(2017). Exploiting spatial structure for localizing manipulated image regions. In 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy (pp. 4980–4989). https://doi.org/10.1109/ICCV.2017.532

Bappy

J. H.

Simons

Nataraj

Manjunath

Roy-Chowdhury

A. K.

(2019). Hybrid LSTM and encoder–decoder architecture for detection of image forgeries. IEEE Transactions on Image Processing, 28(7), 3286–3300.

Chen

Dong

Cao

(2021). Image manipulation detection by multi-view multi-scale supervision. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada (pp. 14165–14173). https://doi.org/10.1109/ICCV48922.2021.01392

Chen

L.-C.

Papandreou

Kokkinos

Murphy

Yuille

A. L.

(2017). DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848. https://doi.org/10.1109/TPAMI.2017.2699184

Dai

Chen

(2023). DS-Net: Dual supervision neural network for image manipulation localization. IET Image Processing, 17(12), 3551–3563.

Dong

Chen

Cao

(2022). MVSS-Net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3539–3553.

Dong

Wang

Tan

(2013). CASIA image tampering detection evaluation database. In 2013 IEEE China summit and international conference on signal and information processing (pp. 422–426). IEEE.

Dong

Wang

Tan

Shi

Y. Q.

(2009). Run-length and edge statistics based approach for image splicing detection. In Digital watermarking: 7th international workshop, IWDW 2008, Busan, Korea, 10–12 November 2008. Selected papers 7 (pp. 76–87). Springer.

10.

Fridrich

Kodovsky

(2012). Rich models for steganalysis of digital images. IEEE Transactions on information Forensics and Security, 7(3), 868–882. https://doi.org/10.1109/tifs.2012.2190402

11.

Liu

Tian

Bao

Fang

(2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3146–3154).

12.

Guan

Kozak

Robertson

Lee

Yates

A. N.

Delgado

Zhou

Kheyrkhah

Smith

Fiscus

(2019). MFC datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In 2019 IEEE winter applications of computer vision workshops (WACVW) (pp. 63–72). IEEE.

13.

Guo

Yang

Chen

Sun

(2021). Fake face detection via adaptive manipulation traces extraction network. Computer Vision and Image Understanding, 204: 103170. https://doi.org/10.1016/j.cviu.2021.103170

14.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90

15.

Hsu

Chang

(2006). Detecting image splicing using geometry invariants and camera characteristics consistency. In International Conference on Multimedia and Expo (ICME), Toronto, Canada, July 2006.

16.

Zhang

Jiang

Chaudhuri

Yang

Nevatia

(2020). SPAN: Spatial pyramid attention network for image manipulation localization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, proceedings, Part XXI 16 (pp. 312–328). Springer.

17.

Kwon

M.-J.

I.-J.

Nam

S.-H.

Lee

H.-K.

(2021). CAT-Net: Compression artifact tracing network for detection and localization of image splicing. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA (pp. 375–384) https://doi.org/10.1109/WACV48630.2021.00042

18.

Lin

Wang

Deng

Bai

Chen

Tang

(2023). Image manipulation detection by multiple tampering traces and edge artifact enhancement. Pattern Recognition, 133, 109026. https://doi.org/10.1016/j.patcog.2022.109026

19.

Liu

Huang

Yin

Chen

(2021). Offline signature verification using a region based deep metric learning network. Pattern Recognition, 118, 108009. https://doi.org/10.1016/j.patcog.2021.108009

20.

Liu

Chen

Liu

(2022). PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(11), 7505–7517. https://doi.org/10.1109/TCSVT.2022.3189545

21.

Luo

Huang

Qiu

(2006). Robust detection of region-duplication forgery in digital image. In: 18th international conference on pattern recognition (ICPR’06) (Vol. 4, pp. 746–749). IEEE.

22.

Luo

Yeo

C. K.

(2021). BCMM: A novel post-based augmentation representation for early rumour detection on social media. Pattern Recognition, 113, 107818. https://doi.org/10.1016/j.patcog.2021.107818

23.

Luo

Huang

Qiu

(2007). A novel method for detecting cropped and recompressed image block. In 2007 IEEE international conference on acoustics, speech and signal processing—ICASSP’07 (Vol. 2, pp. II–217). IEEE.

24.

Yang

Huang

(2021). Boundary guided context aggregation for semantic segmentation. arXiv preprint. https://doi.org/10.48550/arXiv.2110.14587

25.

Shi

Chang

Chen

Zhang

(2022). PR-NET: Progressively-refined neural network for image manipulation localization. International Journal of Intelligent Systems, 37(5), 3166–3188. https://doi.org/10.1002/int.22822

26.

Song

Zhao

Fang

Lin

(2019). Discriminative representation combinations for accurate face spoofing detection. Pattern Recognition, 85, 220–231. https://doi.org/10.1016/j.patcog.2018.08.019

27.

Wei

Zhang

Gong

Chen

Ding

et al. (2021). Learn to segment retinal lesions and beyond. In 2020 25th international conference on pattern recognition (ICPR) (pp. 7403–7410). IEEE.

28.

Wen

Zhu

Subramanian

T.-T.

Shen

Winkler

(2016). COVERAGE: A novel database for copy-move forgery detection. In 2016 IEEE international conference on image processing (ICIP) (pp. 161–165). IEEE.

29.

Abd-Almageed

Natarajan

(2018). Busternet: Detecting copy-move image forgery with source/target localization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science (vol. 11210). Springer, Cham. https://doi.org/10.1007/978-3-030-01231-1_11

30.

Abd-Almageed

Natarajan

(2019). Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9543–9552).

31.

Yang

Lin

Jiang

Zhao

(2020). Constrained R-CNN: A general image manipulation detection model. 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK (pp. 1–6). https://doi.org/10.1109/ICME46284.2020.9102825

32.

Yang

Zhou

Zhang

Sun

Wang

(2022). Multi-view correlation distillation for incremental object detection. Pattern Recognition, 131, 108863. https://doi.org/10.1016/j.patcog.2022.108863

33.

Wang

(2022). Weakly-supervised semantic segmentation with superpixel guided local and global consistency. Pattern Recognition, 124, 108504. https://doi.org/10.1016/j.patcog.2021.108504

34.

Wang

Peng

Gao

Sang

(2018). Learning a discriminative feature network for semantic segmentation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA (pp. 1857–1866). https://doi.org/10.1109/CVPR.2018.00199

35.

Zampoglou

Papadopoulos

Kompatsiaris

(2017). Large-scale evaluation of splicing localization algorithms for web images. Multimedia Tools and Applications, 76(4), 4801–4834. https://doi.org/10.1007/s11042-016-3795-2

36.

Zhong

J.-L.

Gan

Y.-F.

Vong

C.-M.

Yang

J.-X.

Zhao

J.-H.

Luo

J.-H.

(2022). Effective and efficient pixel-level detection for diverse video copy-move forgery types. Pattern Recognition, 122, 108286. https://doi.org/10.1016/j.patcog.2021.108286

37.

Zhou

Chen

B.-C.

Han

Najibi

Shrivastava

Lim

S.-N.

Davis

(2020). Generate, segment, and refine: Towards generic manipulation segmentation. In 34th AAAI Conference on Artificial Intelligence: AAAI-20, New York, USA, 7–12 February 2020 (Vol. 34, pp. 13058–13065).

38.

Zhou

Han

Morariu

V. I.

Davis

L. S.

(2018). Learning rich features for image manipulation detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA (pp. 1053–1061) https://doi.org/10.1109/CVPR.2018.00116

39.

Zhuang

Tan

Huang

(2021). Image tampering localization using a dense fully convolutional network. IEEE Transactions on Information Forensics and Security, 16, 2986–2999. https://doi.org/10.1109/TIFS.2021.3070444

	CASIAv1		Columbia		COVER
Models	F1	AUC	F1	AUC	F1	AUC
SB	0.363	0.714	0.387	0.711	0.149	0.525
DB	0.387	0.719	0.446	0.760	0.218	0.550
DB+MVSS-Net(NSB)	0.430	0.769	0.521	0.791	0.247	0.547
DB+IVISB	0.465	0.797	0.597	0.861	0.255	0.578
DB+MVSS-Net(ESB)	0.430	0.765	0.571	0.707	0.249	0.521
DB+MEIEB	0.505	0.752	0.666	0.767	0.290	0.560
DB+IVISB+MEIEB	0.532	0.822	0.713	0.890	0.312	0.598

Image Manipulation Detection Based on Irrelevant Information Suppression and Critical Information Enhancement

Abstract

Keywords

1. Introduction

2.1. Image Manipulation Detection and Localization

2.2. Attention Mechanisms

3. The Proposed Model

3.1. Manipulated Edge Information Enhancement Branch

3.4.3. Image-Scale Loss

4.1. Datasets

Table 1. Details of Training and Test Images for the Five Datasets Used in Our Experiments. Split Dataset Au Tampered Copy-move Splicing Removal Train CASIAv2 7,491 5,063 3,195 1,828 0 Test CASIAv1 800 920 209 461 0 Columbia 183 180 0 180 0 COVER 100 100 100 0 0 NIST16 0 563 68 288 208

4.2.1. Implementation Details

4.2.2. Evaluation Metrics

4.4. Comparison With SOTA Methods

4.4.1. Quantitative Comparison

Footnotes

Funding

Declaration of Conflicting Interests

Notes

References

Table 1.
Details of Training and Test Images for the Five Datasets Used in Our Experiments.

Split Dataset Au Tampered Copy-move Splicing Removal

Train CASIAv2 7,491 5,063 3,195 1,828 0

Test CASIAv1 800 920 209 461 0

Columbia 183 180 0 180 0

COVER 100 100 100 0 0

NIST16 0 563 68 288 208