Sage Journals: Discover world-class research

Abstract

Camouflaged object detection (COD) is an emerging research direction in computer vision in recent years, aiming to segment objects that are visually integrated with the background, which is a valuable task and has attracted increasing interest from researchers. Since camouflaged objects are integrated with their surroundings, their boundaries are also very blurred, and it becomes an important issue in COD to segment the edges of the objects accurately and completely. To address the above issues, in this article, we propose a novel multi-level edge-enhanced fusion for camouflaged object detection network (ME $^{2}$ FNet). Specifically, we design a residual texture enhanced module to obtain more refined features from the noise-filled backbone features. Then, we design an edge extraction module (EEM), which aims to extract effective edge semantic information from low-level features and high-level features by a simple local channel attention mechanism. Finally, we design a boundary-guided fusion module, which aims to fuse the previously obtained prior information. It can fuse the edge information extracted by EEM with the features at different levels of the backbone network, and guide the learning under the supervision of ground truth. At the same time, it fuses the high-level global information with the features at different levels, so that the final predicted edge is clearer and the overall structure is more complete. Extensive experiments on three challenging benchmark datasets have shown that ME $^{2}$ FNet outperforms multiple leading-edge models in recent years and achieves advanced results under four widely used evaluation metrics.

Keywords

camouflaged object detection image segmentation edge semantic feature fusion

1. Introduction

There are many camouflaged objects in nature, which refer to objects that are highly similar to the background or visually blend in with the environment. It is an important hiding mechanism to protect themselves from predator detection. Camouflaged object detection (COD), which aims to detect camouflaged objects in the visual scene and segment them from the background, has attracted the interest of many researchers in recent years. It promotes many valuable applications in different fields with broad prospects, such as applications in computer vision (search and rescue efforts or rare species discovery), medicine (polyp segmentation Fan et al., 2020a, lung infection segmentation Fan et al., 2020b, retinal image segmentation), agriculture (disaster detection Amit et al., 2016, locust detection Chudzik et al., 2020; Yi et al., 2019), content creation (recreational art Chu et al., 2010) and so on. However, COD remains a very challenging task due to the high similarity between the camouflaged objects and the environment.

In order to address these challenges, deep learning has shown strong potential in this area in recent years. Fan et al. (2020) collected the first large-scale dataset COD10K for COD and proposed SINet to simulate the hunting process of hunters, using search and recognition modules to locate and recognize camouflaged objects, which makes it possible to use deep learning algorithms to learn segmentation of camouflaged objects from data. Mei et al. (2021) proposed a distraction mining strategy and used it to construct a PFNet (positioning and focus metwork), which simulates predatory processes in nature. Lv et al. (2021) proposed a model for simultaneously locating, segmenting, and sorting camouflaged objects, in which the sorting module can sort the ability of COD.

Although many of the approaches now proposed have made significant progress in this task, there are still many questions that need to be explored and addressed. Due to the missing edge or the high integration of the edge of the camouflaged objects with the environment, it is often difficult for the existing methods to segment the camouflaged object completely and cannot overcome the edge blurring problem well.

To this end, we propose a novel multi-level edge-enhanced fusion for camouflaged object detection network (ME $^{2}$ FNet), which optimizes edge segmentation by directly using the edge information of the camouflaged object, thereby enhancing the performance of COD. It mainly utilizes three modules to improve COD results. At first, we improve the texture enhanced module (TEM) (Fan et al., 2021) and design a residual TEM (RTEM), which can increase the receptive field to capture richer feature in the backbone network and provide effective prior information for subsequent fusion. Secondly, we design an edge extraction module (EEM) that integrates features from different levels. EEM explores and captures effective edge information through local channel attention mechanism under the supervision of edge truth map, and then transmits them to each level of the network for enhancement. Finally, we design a boundary-guided fusion module (BGFM), which can combine edge features with features at different levels to guide the learning in order to enhance the amount of edge information contained and fuse features at different scales from higher to lower levels. In summary, the main contributions of this article are as follows:

For COD, we propose a multi-level edge-enhanced fusion network (ME $^{2}$ FNet) for COD, which is able to extract and enhance the edge semantic of camouflaged object to improve the capability of COD.

We design a RTEM, which uses multi-scale receptive field expansion and residual feature fusion to obtain better feature representation.

We design an EEM and a BGFM, EEM can capture the edge semantics through the local channel attention mechanism, BGFM can promote the fusion of different levels of features with edge features to optimize the prediction.

Extensive experiments on three challenging benchmark datasets have shown that ME $^{2}$ FNet outperforms multiple cutting-edge models in recent years and achieves advanced results under four widely used evaluation metrics.

2. Related Work

Early work (Bhajantri & Nagabhushan, 2006; Ge et al., 2018; Pan et al., 2011) mainly relied on handmade texture, convexity, color boundaries, and other underlying features to distinguish foreground and background, which was subject to many limitations.

In recent years, with the booming development of deep learning, researchers have proposed many effective COD methods and achieved good results. Fan et al. (2020) proposed SINet (search identification network), which mimics the hunter’s hunting process by using a search module with an identification module to locate and identify camouflaged targets, and improved SINet in subsequent work to propose SINetV2 (Fan et al., 2021), which became an important baseline model in COD. Fan et al. (2020a) proposed PraNet (parallel reverse attention network), which first predicts rough regions and then refines boundaries. Yan et al. (2021) observed that flipped images can help detect disguised targets, and proposed MirrorNet, which takes the original image and flipped image as input data. Li et al. (2021) proposed an adversarial network combining SOD (salient object detection) and COD to enhance SOD and COD using contradictory information. Yang et al. (2021) proposed UGTR (uncertainty-guided transformer reasoning), which first learns the conditional distribution of backbone outputs to obtain initial estimates and associated uncertainties, and then reason about these uncertainty regions through an attention mechanism to produce final predictions. Sun et al. (2021) proposed C $^{2}$ FNet (context aware cross-level fusion network) and designed an attention induced cross-level fusion module and a dual-branch global context module to enhance feature representation. Zhang et al. (2022) proposed PreyNet to model the predation process and divided the detection process of camouflaged objects into initial detection and predator learning, and designed a bi-directional bridging interaction module (BBIM) for initial detection, which selectively interacts features between two adjacent layers and aggregates cues of information camouflage in a focused manner. He et al. (2023) proposed a degradation-resistant unfolding network (DeRUN) for the HIF task to generate high-quality fused images even in degradation scenarios. Zhang et al. (2022) proposed TPRNet to improve COD performance by designing a transformer-induced progressive refinement module (TPRM) and a semantic-spatial interaction enhancement module (SIEM). He et al. (2023a) focused on reducing annotation dependency by leveraging SAM-generated pseudo labels and multi-scale feature grouping. He et al. (2023b) proposed a novel UMIE approach that avoids the above limitation of existing methods by directly encoding HQ cues into the LQ enhancement process in a variational fashion and thus model the UMIE task under the joint distribution between the LQ and HQ domains.

2.1. COD Methods Based on Edge-Focused

In the COD task, the segmentation of the boundary is critical, because the camouflaged objects are visually embedded in the background, which lead to the unclear boundary between the camouflaged objects and the background, and the edge semantic provides useful constraints to guide the feature extraction during detection. Therefore, Zhai et al. (2021) proposed the mutual graph learning (MGL) model , it decouples an image into two task-specific feature maps—one for roughly locating the target and the other for accurately capturing its boundary details—and fully exploits the mutual benefits by recurrently reasoning their high-order relations through graphs. Ji et al. (2022) proposed the edge-based reversible re-calibration network (ERRNet) and designed two modules, selective edge aggregation (SEA) and reversible recalibration unit (RRU), to simulate visual perception behavior, achieve effective edge priors and cross comparison between potential camouflage areas and backdrops, and achieve advanced results in detecting frame rates. Zhu et al. (2022) proposed BSANet (boundary-guided separated attention network) utilizes two stream-separated attention modules to highlight the separation between the background and foreground of an image, this separation is also known as the boundary, and the two streams are followed by a boundary-guided module (BGM) which combines them to enhance the understanding of the boundary and further improve the COD accuracy. Zhou et al. (2022) proposed FAPNet (feature aggregation and propagation network) for COD tasks and designed a BGM to explicitly model boundary features. This module can provide boundary features to improve COD performance. In order to capture the scale changes of disguised objects, they also designed a multi-scale feature aggregation module (MFAM) to characterize the multi-scale information of each layer and obtain aggregated feature representations. He et al. (2023a) designed the ordinary differential equation (ODE)-inspired edge reconstruction (OER) module to reconstruct accurate and complete edge prediction maps using a highorder ODE solver, specifically, the second-order RungeKutta. Incorporating this auxiliary task with the COD task can facilitate the generation of precise segmentation results with accurate object boundaries. He et al. (2023b) proposed an edge-guided separated calibration (ESC) module. ESC leverages edge features to adaptively guide segmentation and reinforce the feature-level edge information to achieve the sharp edge for segmentation results.

In contrast to the above work, we propose a new scheme to achieve more accurate camouflage object location and edge segmentation. The obvious difference and advantage of this scheme is that the features we introduce at the decoder stage are finely extracted and rich prior features, and we experimentally show that this leads to better results.

3. Method

3.1. Overall Architecture

The overall architecture of our proposed ME $^{2}$ FNet is shown in Figure 1. We adopt Res2Net-50 (Gao et al., 2019) as our backbone network to extract multi-level features $F_{i} \in R^{H \times W \times C} (i = 1, 2, 3, 4, 5)$ from the input image $F \in R^{H \times W \times 3}$ . The number of channels in each level is $(64, 256, 512, 1024, 2048)$ . Then, we input features $F_{2} - F_{5}$ to the EEM. Under the supervision of the edge truth map, EEM extracts the edge semantic information related to the camouflage objects from the low-level features containing edge semantic information and the high-level features containing global structure information. At the same time, features $F_{2} - F_{5}$ are fed into the RTEM, which further refine the features from the backbone network by expanding the receptive field as well as fusing the features by means of residual to remove excess noise from the features and reduce the channels to 64. As described in the literature (Zhao et al., 2017), the pyramid pooling module (PPM) is able to expand the network receptive field and excavate global context information through global adaptive pooling of multiple branches. Therefore, we place the PPM in the highest layer of the network to receive the feature $F_{5}$ from the backbone network in order to facilitate the acquisition of valid global prior information and reduce the channels to 64. Finally, the BGFM fuses edge information from EEM, refined features from RTEM, and global information from PPM at each level from top to bottom. The output of each BGFM is supervised by the truth map, and the output $P_{4}$ of the last BGFM is used as the final network prediction.

Figure 1.

Overall network structure of the proposed model.

3.2. Residual Texture Enhanced Module

Inspired by the human visual system, Liu et al. (2018) designed the receptive field block (RFB) to enhance feature discriminability and robustness. Fan et al. (2021) added a branch with larger expansion rate to expand the receptive field in RFB and further designed a TEM with two asymmetric convolution layers instead of the standard convolution, achieving better results. As the backbone network uses a large number of convolutional operations, it is unable to extract features containing rich context information, which is not conducive to the segmentation of camouflaged objects. Inspired by Res2Net module (Gao et al., 2019), we design a RTEM based on TEM to refine the features of the backbone network, so as to provide more effective prior information in the subsequent fusion. Specifically, as shown in Figure 2, the RTEM contain four sub-branches $B_{i} (i = 1, 2, 3, 4)$ with different expansion rates $d (d = 1, 3, 5, 7)$ and one main branch $B_{o}$ . The channels are first reduced using $1 \times 1$ convolution in each sub-branch, followed by a $(2 i - 1) \times (2 i - 1)$ convolution and a $3 \times 3$ convolution with $(2 i - 1)$ dilation in the $B_{2} - B_{4}$ branches. And in sub-branches $B_{2} - B_{4}$ , the input of the sub-branch is equal to the output of the previous sub-branch added to $F_{i}$ :

F_{B K}^{'} = U P (F_{B (k - 1)}^{i}) \oplus F_{i}

(1)

where

U P

is upsample,

F_{B K}^{'} (k = 2, 3, 4)

is the output of the sub-branches,

\oplus

is element-wise addition. Finally, concatenating the outputs of all sub-branches and perform a

3 \times 3

convolution and adding main branch feature, then feeding the entire module into a ReLU function to obtain the output feature

F_{i}^{^{'}}

. In general, compared with the TEM, the RTEM adds three residual branch structures that enable better fusion of information at different scales for enhanced feature representation.

Figure 2.

The detailed architecture of the proposed residual texture enhanced module (RTEM).

3.3. Edge Extraction Module

It has been demonstrated in several previous works (Ding et al., 2019; Ji et al., 2022; Zhai et al., 2021; Zhou et al., 2022) that edge semantic contributes to COD and provides useful constraints in the process of detection. In COD missions, as the edges of camouflaged objects are embedded in the surrounding background, locating their boundaries becomes a critical issue. Usually, information such as edges and contours of objects are preserved in low-level features, but there is also a lot of extraneous noise in low-level features, while in high-level features, after operations such as deep convolution of the backbone network, the location information of objects is more prominent and the boundaries become blurred. Therefore, in order to better explore the semantic information of edges, we combine low-level features with high-level features and design the EEM Figures 3 to 5. Specifically, we reduce the features $F_{2} - F_{5}$ to 64 channels by the $1 \times 1$ convolution and upsampling $F_{3} - F_{5}$ to the same size as $F_{2}$ for concatenation, then pass a $3 \times 3$ convolution to obtain the integrated feature $F_{o} \in R^{(W \times H \times C)}$ , which has 256 channels. The whole process can be denoted as follows:

{\begin{matrix} F_{2}^{'} = C o n v_{1 \times 1} (F_{2}) \\ F_{3}^{'} = U P (C o n v_{1 \times 1} (F_{3})) \\ F_{4}^{'} = U P ({Conv}_{1 \times 1} (F_{4})) \\ F_{5}^{'} = U P ({Conv}_{1 \times 1} (F_{5})) \\ F_{o} = C o n v_{3 \times 3} (C A T (F_{2}^{'}, F_{3}^{'}, F_{4}^{'}, F_{5}^{'})) \end{matrix}

(2)

where

C A T

is the concatenation operation, and

U P

is the upsample.

However, different feature channels often contain different semantic information. We are inspired by Wang et al. (2020) to introduce a simple local channel attention mechanism to explore critical feature channels. So, we aggregate the features by a global average pooling (GAP), then we reshape the features and use a 1D convolution to obtain the effective local channels. Finally, the channel attention is obtained by the Sigmoid function. We multiply the attention weights with the integrated features and output the edge features $F_{e} \in R^{H \times W \times 1}$ by a $1 \times 1$ convolution. The whole process can be denoted as follows:

F_{e} = C o n v_{1 \times 1} (F_{o} \otimes (C o n v 1 d (G A P (F_{o}))))

(3)

where

C o n v_{(1 \times 1)}

1 \times 1

convolution,

\oplus

is element-wise multiplication,

σ

denotes the sigmoid function,

G A P

is the global average pooling,

C o n v 1 d

is the 1D convolution with kernel size k, k is the adaptive convolution kernel size,

k = | \frac{(1 + \log_{2} C)}{2} |_{a d d}

, where

| * |_{a d d}

denotes the nearest odd number and C is the channels of

F_{o}

. Here in this article the number of

F_{o}

channels is 256, so the convolution kernel k = 5. Obviously, this strategy of considering local channels, which can highlight key channels and thus remove redundant channels with noise, has less complexity and plays a key role in exploring effective edge semantics in noise-filled features.

Figure 3.

The detailed architecture of the proposed edge extraction module (EEM), which is designed to extract edge features.

Figure 4.

Visualization of edge feature maps extracted by edge extraction module (EEM).

Figure 5.

The detailed architecture of the proposed boundary-guided fusion module (BGFM).

3.4. Boundary-Guided Fusion Module

In order to incorporate the extracted target edge semantic information into the network to guide learning as well as to fuse features from different levels, we design a BGFM to fuse the edge semantic information extracted by EEM and the global context information extracted by the PPM, which can also aggregate features across levels. Unlike attention-induced cross-level fusion module (Sun et al., 2021), which only the fusion of cross-level features is considered, BGFM considers not only the fusion of cross-level features, but also the fusion of edge features and global features. It injects edge feature $F_{e}$ into the features of each level in a element-wise multiplication manner to obtain the enhanced edge feature $F_{e}^{'}$ :

F_{e}^{'} = F_{e} \otimes (F_{l} \oplus U p (F_{h})) \oplus (F_{l} \oplus U P (F_{l}))

(4)

where

U P

is the upsample,

F_{l}

is the low-resolution feature, and

F_{h}

is the high-resolution feature.

Similarly, we use the multi-scale channel attention (MSCA) (Dai et al., 2021)module to aggregate the fused features and explore context information to enhance detection. MSCA is based on a double-branch structure, in which one branch uses global average pooling to obtain global contexts to emphasize globally distributed large objects, and the other branch maintains the original feature size to obtain local contexts to avoid small objects being ignored. Finally, aggregating multi-scale context information. The attention weights obtained by MSCA are multiplied with the low-resolution features and the attention background is multiplied with the high-resolution features, they are added and passed through a $3 \times 3$ convolution, then we add the global context information $F_{p}$ and output the obtained features $F_{b}$ through another $3 \times 3$ convolution:

{\begin{matrix} F_{l h} = F_{l} \otimes M (F_{e}^{'}) \oplus (F_{h} \otimes (1 ⊖ M (F_{e}^{'}))) \\ F_{b} = C o n v_{3 \times 3} (C o n v_{3 \times 3} (F_{l h}) \oplus U P (F_{p})) \end{matrix}

(5)

where

M

is the MSCA. In general, BGFM is able to fuse features at different levels and utilize multi-scale information to alleviate the scale variation brought about by the fusion of different prior information, which makes the aggregated features smoother and obtains a most comprehensive feature representation, thus enhancing the detection of camouflaged object.

3.5. Loss Function

The binary cross-entropy loss function is widely used in most COD methods. In the approach of this article, there are two types of supervisions, one for the camouflaged object mask ( $G$ ) and another for the camouflaged object edge ( $G_{e}$ ). For mask supervisions, we employ the weighted binary cross-entropy loss( $L_{B C E}^{W}$ ) and weighted IOU loss ( $L_{I O U}^{W}$ ) (Wei et al., 2020), which calculate the difference of the central pixels and assign different weights to each pixel, so that the effective pixels get higher attention. For edge supervision, we adopt the dice loss ( $L_{d i c e}$ ) (Xie et al., 2020) to deal with the strong imbalance between positive and negative samples. In this article, there are four output predictions in the network, three camouflaged object predictions $P_{i} (i = 2, 3, 4)$ from BGFM and one edge prediction $P_{e}$ from EEM. Thus, the total loss is defined as follows:

L_{t o t a l} = \sum_{i = 2}^{4} (L_{B C E}^{W} (P_{i}, G) + L_{I O U}^{W} (P_{i}, G)) + φ L_{d i c e} (P_{e}, G_{e})

(6)

where

φ

is a trade-off parameter and set

φ

= 3 in our experiments.

4. Experiments

4.1. Implementation Details

The operating system of the experimental platform is Ubuntu 16.04.7 LTS, configured with Python 3.8. Based on PyTorch to implement our network model, the GPU is NVIDIA Tesla V100 32GB. We adopt the pre-trained model of Res2Net-50 on ImageNet as the backbone network and resize all the input images to 416 $\times$ 416 and augment them by randomly horizontal flipping. During the training, the batchsize is set to 16, the initial learning rate is set to 1 $\times 10^{- 4}$ , and the Adam optimizer (Kingma & Ba, 2014) is adopted, the epoch is set to 40, and the entire training duration is about 1–2 h.

4.2. Datasets

We evaluate our method on three public benchmark datasets, which are also the most widely used three datasets in COD: (1) CAMO (Le et al., 2019), which contains 1250 images (1000 for training set and 250 for test set), (2) COD10K (Fan et al., 2020), which contains 5066 camouflaged images (3040 for training set and 2026 for test set); (3) NC4K (Lv et al., 2021), which contains 4121 images and is the largest test set. We are consistent with most COD model setups, using the training set of CAMO with COD10K as the training set of our model, and using the test set of CAMO, the test set of COD10K, and NC4K as the test set.

4.3. Evaluation Metrics

E-measure (Fan et al., 2021), which is an enhanced alignment measure that combines local pixels with image-level averages, thus taking into account both local and global information:

Q_{s} = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} φ s (i, j)

(7)

S-measure (Fan et al., 2017), which is a structure-based measure that considers the similarity of object-aware ( $S_{o}$ ) and region-aware ( $S_{r}$ ) structures:

S = α \times S_{o} + (1 - α) \times S_{r}

(8)

where

α

is set empirically to 0.5.

Weighted F-measure (Margolin et al., 2014), which is an overall performance measure that comprehensively considers weighted precision and weighted recall:

F_{β}^{W} = \frac{(1 + β^{2}) P r e c i s i o n^{W} \times R e c a l l^{W}}{β^{2} P r e c i s i o n^{W} + R e c a l l^{W}}

(9)

where

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l = \frac{T P}{T P + F N}

,TP, TN, FP, FN represent true positive, true positive, false positive, and false negative.

β^{2}

is a trade-off parameter and set

β^{2}

=0.3.

MAE (Perazzi et al., 2012), mean absolute error, evaluates the average pixel-level relative error between normalized prediction and true label values:

M A E = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} | G (i, j) - S (i, j) |

(10)

where

G

is the ground truth,

S

is the prediction,

W

is width, and

H

is the height.

4.4. Comparison of Results

To demonstrate the effectiveness of our method, we compare it against 15 leading-edge methods, including PoolNet (Liu et al., 2019), EGNet (Zhao et al., 2019), UCNet (Zhang et al., 2020), PraNet (Fan et al., 2020a), SINet (Fan et al., 2020), C2FNet (Sun et al., 2021), PFNet (Mei et al., 2021), UGTR (Yang et al., 2021), LSR (Li et al., 2021), ERRNet (Ji et al., 2022), PreyNet (Zhang et al., 2022), BSANet (Zhu et al., 2022), TPRNet (Zhang et al., 2022), FAPNet (Zhou et al., 2022), and SINetV2 (Fan et al., 2021). For fair comparison, all evaluation data comes from various papers or retraining using open source code.

4.4.1. Quantitative Evaluation

As shown in Table 1, we quantitatively compare our method with 15 advanced models on three benchmark datasets, obviously, our method outperforms the other 15 models overall on three datasets under four evaluation metrics. Furthermore, Our method also has significant advantages over methods that also use edge semantics to improve COD capability (ERRNet, BSANet, and FAPNet), we significantly increase wFm by 9 $%$ compared with ERRNet on COD10K test. On CAMO-test, our method significantly increase F by 3.6 $%$ compared with BSANet, increase wFm by 2.6 $%$ compared with FAPNet, this proves that our method is more effective in the extraction and utilization of edge semantics. Compared with the second best SINetV2, overall better, although there are two indicators lower than SINetV2 but also very close, and visual results are also more refined than SINetV2.

Table 1.
Quantitative Comparison With Advanced Methods for COD on Three Benchmarks Using Four Widely Used Evaluation Metrics.

CAMO-test COD10K-test NC4K

Method Year $S_{a} ↑$ $F_{β}^{w} ↑$ $E_{ϕ} ↑$ $M ↓$ $S_{a} ↑$ $F_{β}^{w} ↑$ $E_{ϕ} ↑$ $M ↓$ $S_{a} ↑$ $F_{β}^{w} ↑$ $E_{ϕ} ↑$ $M ↓$

PoolNet CVPR19 0.730 0.575 0.746 0.105 0.740 0.506 0.776 0.056 0.785 0.635 0.814 0.073

EGNet ICCV19 0.732 0.604 0.800 0.109 0.736 0.517 0.810 0.061 0.777 0.639 0.841 0.075

UCNet CVPR20 0.739 0.640 0.787 0.094 0.776 0.633 0.857 0.042 0.811 0.729 0.871 0.055

PraNet MICCAI20 0.769 0.663 0.825 0.094 0.789 0.629 0.861 0.045 0.822 0.724 0.876 0.059

SINet CVPR20 0.745 0.644 0.804 0.092 0.776 0.631 0.864 0.043 0.808 0.723 0.871 0.058

PFNet CVPR21 0.782 0.695 0.841 0.085 0.800 0.660 0.877 0.040 0.829 0.745 0.887 0.053

LSR CVPR21 0.787 0.696 0.838 0.080 0.804 0.673 0.880 0.037 0.840 0.766 0.895 0.048

UGTR ICCV21 0.784 0.683 0.821 0.086 0.817 0.665 0.852 0.036 0.839 0.746 0.874 0.052

C $^{2}$ FNet IJCAI21 0.796 0.719 0.854 0.080 0.813 0.686 0.890 0.036 0.838 0.762 0.897 0.049

ERRNet PR22 0.779 0.679 0.842 0.085 0.786 0.630 0.867 0.043 0.827 0.737 0.887 0.054

PreyNet MM22 0.790 0.708 0.842 0.077 0.813 0.697 0.881 0.034 0.834 0.763 0.887 0.050

BSANet AAAI22 0.796 0.717 0.851 0.079 0.818 0.699 0.891 0.034 0.841 0.771 0.897 0.048

TPRNet TVCJ22 0.807 0.725 0.861 0.074 0.817 0.683 0.887 0.036 0.846 0.768 0.898 0.048

FAPNet TIP22 0.815 0.734 0.865 0.076 0.822 0.694 0.888 0.036 0.851 0.775 0.899 0.047

SINetV2 TPAMI22 0.820 0.743 0.875 0.070 0.815 0.680 0.887 0.037 0.847 0.770 0.903 0.048

Ours 0.817 0.753 0.879 0.071 0.829 0.720 0.897 0.032 0.849 0.785 0.903 0.046

		CAMO-test	COD10K-test	NC4K
PoolNet	CVPR19	0.730	0.575	0.746	0.105	0.740	0.506	0.776	0.056	0.785	0.635	0.814	0.073
EGNet	ICCV19	0.732	0.604	0.800	0.109	0.736	0.517	0.810	0.061	0.777	0.639	0.841	0.075
UCNet	CVPR20	0.739	0.640	0.787	0.094	0.776	0.633	0.857	0.042	0.811	0.729	0.871	0.055
PraNet	MICCAI20	0.769	0.663	0.825	0.094	0.789	0.629	0.861	0.045	0.822	0.724	0.876	0.059
SINet	CVPR20	0.745	0.644	0.804	0.092	0.776	0.631	0.864	0.043	0.808	0.723	0.871	0.058
PFNet	CVPR21	0.782	0.695	0.841	0.085	0.800	0.660	0.877	0.040	0.829	0.745	0.887	0.053
LSR	CVPR21	0.787	0.696	0.838	0.080	0.804	0.673	0.880	0.037	0.840	0.766	0.895	0.048
UGTR	ICCV21	0.784	0.683	0.821	0.086	0.817	0.665	0.852	0.036	0.839	0.746	0.874	0.052
C $^{2}$ FNet	IJCAI21	0.796	0.719	0.854	0.080	0.813	0.686	0.890	0.036	0.838	0.762	0.897	0.049
ERRNet	PR22	0.779	0.679	0.842	0.085	0.786	0.630	0.867	0.043	0.827	0.737	0.887	0.054
PreyNet	MM22	0.790	0.708	0.842	0.077	0.813	0.697	0.881	0.034	0.834	0.763	0.887	0.050
BSANet	AAAI22	0.796	0.717	0.851	0.079	0.818	0.699	0.891	0.034	0.841	0.771	0.897	0.048
TPRNet	TVCJ22	0.807	0.725	0.861	0.074	0.817	0.683	0.887	0.036	0.846	0.768	0.898	0.048
FAPNet	TIP22	0.815	0.734	0.865	0.076	0.822	0.694	0.888	0.036	0.851	0.775	0.899	0.047
SINetV2	TPAMI22	0.820	0.743	0.875	0.070	0.815	0.680	0.887	0.037	0.847	0.770	0.903	0.048
Ours		0.817	0.753	0.879	0.071	0.829	0.720	0.897	0.032	0.849	0.785	0.903	0.046

The best three scores are highlighted in Bold. SINet = search identification network; PraNet = parallel reverse attention network; UGTR = uncertainty-guided transformer reasoning; ERRNet = edge-based reversible re-calibration network; BSANet = boundary-guided separated attention network; TPRM = transformerinduced progressive refinement module.

4.4.2. Qualitative Evaluation

As shown in Figure 6, a qualitative comparison of the prediction results of six methods on several typical samples in the datasets. In the figure, third row and fourth row, the edge of the camouflaged object is particularly blurred, in this case, it can be seen that other models for segmentation are inaccurate, while our model can still accurately detect camouflaged objects with rich edge details. As shown in the last line of the image, it can be seen that the predicted results of our model are more complete, showing enhancement of the edge information for improving the detection capability of camouflaged objects.

Figure 6.

Qualitative visual comparison of our model with six advanced methods.

4.5. Ablation Study

To verify the effectiveness of the designed modules, as shown in Table 2, we design several ablation experiments. The No.1 is the baseline model, we remove all modules that we designed and only retain TEM and PPM. And we use simple addition to fuse features. In experiment No.2, we add EEM on top of baseline model, and fuse it with each layer feature by multiplication. It can be seen that after adding EEM, all metrics have improved, which shows EEM effectiveness; In experiment No.3, we added a boundary guided fusion module to the baseline model. Compared with experiment No.1, we can see that all metrics are steadily improving, which demonstrates the effectiveness of the BGFM. In the fourth experiment No.4, we add EEM with BGFM to the baseline model and could see that the overall performance is higher than when the two modules are used alone. In experiment No.5, we replace TEM from the fourth experiment with the RTEM designed in this article and remove the PPM. The final experiment is our complete model. Compared with our model in experiment No.4, it can be seen that after replacing the TEM with the RTEM we designed, all metrics improved, which proves the effectiveness of RTEM. Comparison between experiment No.5 and our model, it is clear that the model with the PPM added has higher metrics and better performance, which demonstrates the effectiveness of the global context information captured by the PPM in improving the performance of the detection of camouflaged object. Overall, every component in the model is crucial.

Table 2.
Ablation Studies for Each Component on Three Test Datasets.

CAMO-test COD10K-test NC4K

No. TEM RTEM EEM BGFM PPM $S_{a} ↑$ $F_{β}^{w} ↑$ $E_{ϕ} ↑$ $M ↓$ $S_{a} ↑$ $F_{β}^{w} ↑$ $E_{ϕ} ↑$ $M ↓$ $S_{a} ↑$ $F_{β}^{w} ↑$ $E_{ϕ} ↑$ $M ↓$

1 $\sqrt$ $\sqrt$ 0.793 0.715 0.848 0.081 0.823 0.702 0.886 0.034 0.840 0.765 0.888 0.050

2 $\sqrt$ $\sqrt$ $\sqrt$ 0.803 0.727 0.855 0.079 0.826 0.708 0.890 0.034 0.848 0.780 0.899 0.047

3 $\sqrt$ $\sqrt$ $\sqrt$ 0.799 0.729 0.859 0.076 0.827 0.715 0.896 0.032 0.846 0.780 0.901 0.046

4 $\sqrt$ $\sqrt$ $\sqrt$ $\sqrt$ 0.806 0.741 0.863 0.075 0.828 0.718 0.895 0.033 0.848 0.782 0.900 0.046

5 $\sqrt$ $\sqrt$ $\sqrt$ 0.796 0.720 0.854 0.079 0.825 0.708 0.895 0.033 0.847 0.780 0.901 0.046

Ours $\sqrt$ $\sqrt$ $\sqrt$ $\sqrt$ 0.817 0.753 0.879 0.071 0.829 0.720 0.897 0.032 0.849 0.785 0.903 0.046

						CAMO-test	COD10K-test	NC4K
1	$\sqrt$				$\sqrt$	0.793	0.715	0.848	0.081	0.823	0.702	0.886	0.034	0.840	0.765	0.888	0.050
2	$\sqrt$		$\sqrt$		$\sqrt$	0.803	0.727	0.855	0.079	0.826	0.708	0.890	0.034	0.848	0.780	0.899	0.047
3	$\sqrt$			$\sqrt$	$\sqrt$	0.799	0.729	0.859	0.076	0.827	0.715	0.896	0.032	0.846	0.780	0.901	0.046
4	$\sqrt$		$\sqrt$	$\sqrt$	$\sqrt$	0.806	0.741	0.863	0.075	0.828	0.718	0.895	0.033	0.848	0.782	0.900	0.046
5		$\sqrt$	$\sqrt$	$\sqrt$		0.796	0.720	0.854	0.079	0.825	0.708	0.895	0.033	0.847	0.780	0.901	0.046
Ours		$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	0.817	0.753	0.879	0.071	0.829	0.720	0.897	0.032	0.849	0.785	0.903	0.046

$\sqrt$ represents the inclusion of the component. TEM = texture enhanced module; RTEM = residual texture enhanced module; EEM = edge extraction module; BGFM = boundary-guided fusion module ; PPM = pyramid pooling module; CAMO = camouflage; COD = camouflaged object detection.

5. Conclusion

In this article, we propose a novel ME $^{2}$ FNet that utilizes edge semantic to guide learning and significantly improve the performance of COD. The proposed RTEM is used to refine the backbone features, the EEM is used to obtain the edge semantic information, and finally the BGFM is used to fuse the edge prior and the global prior information, resulting in a more refined and structurally complete final result edge. Extensive experiments on three challenging benchmark datasets have shown that ME $^{2}$ FNet outperforms multiple leading-edge models in recent years and achieves advanced results under four widely used evaluation metrics. Although ME²FNet achieves state-of-the-art performance on multiple benchmark datasets.

It still has the following limitations: (1) Adaptability to complex dynamic scenarios. Segmentation accuracy may degrade in scenarios with extreme dynamic backgrounds (e.g., rapidly moving camouflaged objects or drastic lighting changes). For example, when camouflaged objects dynamically synchronize with background textures (e.g., chameleons in changing environments), the EEM may struggle to distinguish motion artifacts from true boundaries. (2) Segmentation of small-scale and highly fragmented targets. The model may miss pixel-level small targets (e.g., insect antennae) or highly fragmented camouflaged regions (e.g., broken twigs) due to limited receptive fields. This stems from the RTEM’s insufficient sensitivity to micro-features during multi-scale feature fusion. In future research, we will optimize the above two aspects through multimodal fusion and small target attention.

Footnotes

Ethical Considerations

The study did not involve moral and ethical issues.

Author Contributions

All authors contributed to the study conception and design. The experiment was completed by Xuwei Tong under the guidance of Guangjian Zhang. Yuhao Yang was responsible for writing some papers and using experimental equipment. The first draft of the manuscript was written by Xuwei Tong and all authors commented on previous versions of the article. All authors read and approved the final article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

This study adopts the public datasets, which can be downloaded on the Internet.The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Code Availability

The code are available from the corresponding author on reasonable request.

References

Amit

S. N. K. B.

Shiraishi

Inoshita

, et al. (2016). Analysis of satellite images for disaster detection. In 2016 IEEE International geoscience and remote sensing symposium (IGARSS), IEEE (pp.5189–5192).

Bhajantri

N. U.

Nagabhushan

(2006). Camouflage defect identification: A novel approach. In 9th International conference on information technology (ICIT’06) (pp.145–148). IEEE.

Chu

H. K.

Hsu

W. H.

Mitra

N. J.

, et al. (2010). Camouflage images. ACM Transactions on Graphics, 29(4), 51–1.

Chudzik

Mitchell

Alkaseem

, et al. (2020). Mobile real-time grasshopper detection and data aggregation framework. Scientific Reports, 10(1), 1–10.

Dai

Gieseke

Oehmcke

, et al. (2021). Attentional feature fusion. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp.3560–3569).

Ding

Jiang

Liu

A. Q.

, et al. (2019). Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE/CVF International conference on computer vision (pp.6819–6829).

Fan

D. P.

Cheng

M. M.

Liu

, et al. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision (pp.4548–4557).

Fan

D. P.

G. P.

Cheng

M. M.

, et al. (2021). Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6024–6042.

Fan

D. P.

G. P.

Qin

, et al. (2021). Cognitive vision inspired object segmentation metric and loss function. Scientia Sinica Informationis, 6(6), 1475–1489.

10.

Fan

D. P.

G. P.

Sun

, et al. (2020). Camouflaged object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.2777–2787).

11.

Fan

D. P.

G. P.

Zhou

, et al. (2020a). Pranet: Parallel reverse attention network for polyp segmentation. In Medical image computing and computer assisted intervention–MICCAI 2020: 23rd international conference, Lima, Peru, October 4–8, 2020, Proceedings, Part VI 23 (pp.263–273). Springer.

12.

Fan

D. P.

Zhou

G. P.

, et al. (2020b). Inf-net: Automatic Covid-19 lung infection segmentation from CT images. IEEE Transactions on Medical Imaging, 39(8), 2626–2637.

13.

Gao

S. H.

Cheng

M. M.

Zhao

, et al. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.

14.

Jin

, et al. (2018). Image editing by object-aware optimal boundary searching and mixed-domain composition. Computational Visual Media, 4, 71–82.

15.

, et al. (2023). Degradation-resistant unfolding network for heterogeneous image fusion. In Proceedings of the IEEE/CVF international conference on computer vision (pp.12611–12621).

16.

, et al. (2023b). Hqg-net: Unpaired medical image enhancement with high-quality guidance. IEEE Transactions on Neural Networks and Learning Systems, 2023, 18404–18418.

17.

Zhang

, et al. (2023a). Weakly-supervised concealed object segmentation with sam-based pseudo labeling and multi-scale feature grouping. Advances in Neural Information Processing Systems, 36, 30726–30737.

18.

Zhang

, et al. (2023a). Camouflaged object detection with feature decomposition and edge reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.22046–22055).

19.

Zhang

, et al. (2023b). Strategic preys make acute predators: Enhancing camouflaged object detectors by generating camouflaged objects. arXiv preprint arXiv:230803166.

20.

G. P.

Zhu

Zhuge

, et al. (2022). Fast camouflaged object detection via edge-based reversible re-calibration network. Pattern Recognition, 123, 108414.

21.

Kingma

D. P.

(2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980.

22.

T. N.

Nguyen

T. V.

Nie

, et al. (2019). Anabranch network for camouflaged object segmentation. Computer Vision and Image Understanding, 184, 45–56.

23.

Zhang

, et al. (2021). Uncertainty-aware joint salient object and camouflaged object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.10071–10081).

24.

Liu

J. J.

Hou

Cheng

M. M.

, et al. (2019). A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.3917–3926).

25.

Liu

Huang

, et al. (2018). Receptive field block net for accurate and fast object detection. In Proceedings of the European conference on computer vision (ECCV) (pp.385–400).

26.

Zhang

Dai

, et al. (2021). Simultaneously localize, segment and rank the camouflaged objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.11591–11601).

27.

Margolin

Zelnik-Manor

Tal

(2014). How to evaluate foreground maps? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.248–255).

28.

Mei

G. P.

Wei

, et al. (2021). Camouflaged object segmentation with distraction mining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.8772–8781).

29.

Pan

Chen

, et al. (2011). Study on the camouflaged target detection method based on 3d convexity. Modern Applied Science, 5(4), 152.

30.

Perazzi

Krähenbühl

Pritch

, et al. (2012). Saliency filters: Contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition (pp.733–740). IEEE.

31.

Sun

Chen

Zhou

, et al. (2021). Context-aware cross-level fusion network for camouflaged object detection. arXiv preprint arXiv:210512555.

32.

Wang

Zhu

, et al. (2020). Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.11534–11542).

33.

Wei

Wang

Huang

(2020). F

^{3}

net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI conference on artificial intelligence (pp.12321–12328).

34.

Xie

Wang

, et al. (2020). Segmenting transparent objects in the wild. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16 (pp.696–711). Springer.

35.

Yan

T. N.

Nguyen

K. D.

, et al. (2021). Mirrornet: Bio-inspired camouflaged object segmentation. IEEE access, 9, 43290–43300.

36.

Yang

Zhai

, et al. (2021). Uncertainty-guided transformer reasoning for camouflaged object detection. In Proceedings of the IEEE/CVF International conference on computer vision (pp.4146–4155).

37.

Chen

W. H.

(2019). Locust recognition and detection via aggregate channel features. Poster Papers (p.112).

38.

Zhai

Yang

, et al. (2021). Mutual graph learning for camouflaged object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.12997–13007).

39.

Zhang

Fan

D. P.

Dai

, et al. (2020). Uc-net: Uncertainty inspired rgb-d saliency detection via conditional variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.8582–8591).

40.

Zhang

, et al. (2022). Tprnet: Camouflaged object detection via transformer-induced progressive refinement network. The Visual Computer (pp. 1–15).

41.

Zhang

Piao

, et al. (2022). Preynet: Preying on camouflaged objects. In Proceedings of the 30th ACM international conference on multimedia (pp.5323–5332).

42.

Zhao

J. X.

Liu

J. J.

Fan

D. P.

, et al. (2019). Egnet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp.8779–8788).

43.

Zhao

Shi

, et al. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.2881–2890).

44.

Zhou

Gong

, et al. 2022). Feature aggregation and propagation network for camouflaged object detection. IEEE Transactions on Image Processing, 31, 7036–7047.

45.

Zhu

Xie

, et al. (2022). I can find you! Boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI conference on artificial intelligence (pp.3608–3616).

		CAMO-test				COD10K-test				NC4K
Method	Year	$S_{a} ↑$	$F_{β}^{w} ↑$	$E_{ϕ} ↑$	$M ↓$	$S_{a} ↑$	$F_{β}^{w} ↑$	$E_{ϕ} ↑$	$M ↓$	$S_{a} ↑$	$F_{β}^{w} ↑$	$E_{ϕ} ↑$	$M ↓$
PoolNet	CVPR19	0.730	0.575	0.746	0.105	0.740	0.506	0.776	0.056	0.785	0.635	0.814	0.073
EGNet	ICCV19	0.732	0.604	0.800	0.109	0.736	0.517	0.810	0.061	0.777	0.639	0.841	0.075
UCNet	CVPR20	0.739	0.640	0.787	0.094	0.776	0.633	0.857	0.042	0.811	0.729	0.871	0.055
PraNet	MICCAI20	0.769	0.663	0.825	0.094	0.789	0.629	0.861	0.045	0.822	0.724	0.876	0.059
SINet	CVPR20	0.745	0.644	0.804	0.092	0.776	0.631	0.864	0.043	0.808	0.723	0.871	0.058
PFNet	CVPR21	0.782	0.695	0.841	0.085	0.800	0.660	0.877	0.040	0.829	0.745	0.887	0.053
LSR	CVPR21	0.787	0.696	0.838	0.080	0.804	0.673	0.880	0.037	0.840	0.766	0.895	0.048
UGTR	ICCV21	0.784	0.683	0.821	0.086	0.817	0.665	0.852	0.036	0.839	0.746	0.874	0.052
C $^{2}$ FNet	IJCAI21	0.796	0.719	0.854	0.080	0.813	0.686	0.890	0.036	0.838	0.762	0.897	0.049
ERRNet	PR22	0.779	0.679	0.842	0.085	0.786	0.630	0.867	0.043	0.827	0.737	0.887	0.054
PreyNet	MM22	0.790	0.708	0.842	0.077	0.813	0.697	0.881	0.034	0.834	0.763	0.887	0.050
BSANet	AAAI22	0.796	0.717	0.851	0.079	0.818	0.699	0.891	0.034	0.841	0.771	0.897	0.048
TPRNet	TVCJ22	0.807	0.725	0.861	0.074	0.817	0.683	0.887	0.036	0.846	0.768	0.898	0.048
FAPNet	TIP22	0.815	0.734	0.865	0.076	0.822	0.694	0.888	0.036	0.851	0.775	0.899	0.047
SINetV2	TPAMI22	0.820	0.743	0.875	0.070	0.815	0.680	0.887	0.037	0.847	0.770	0.903	0.048
Ours		0.817	0.753	0.879	0.071	0.829	0.720	0.897	0.032	0.849	0.785	0.903	0.046

						CAMO-test				COD10K-test				NC4K
No.	TEM	RTEM	EEM	BGFM	PPM	$S_{a} ↑$	$F_{β}^{w} ↑$	$E_{ϕ} ↑$	$M ↓$	$S_{a} ↑$	$F_{β}^{w} ↑$	$E_{ϕ} ↑$	$M ↓$	$S_{a} ↑$	$F_{β}^{w} ↑$	$E_{ϕ} ↑$	$M ↓$
1	$\sqrt$				$\sqrt$	0.793	0.715	0.848	0.081	0.823	0.702	0.886	0.034	0.840	0.765	0.888	0.050
2	$\sqrt$		$\sqrt$		$\sqrt$	0.803	0.727	0.855	0.079	0.826	0.708	0.890	0.034	0.848	0.780	0.899	0.047
3	$\sqrt$			$\sqrt$	$\sqrt$	0.799	0.729	0.859	0.076	0.827	0.715	0.896	0.032	0.846	0.780	0.901	0.046
4	$\sqrt$		$\sqrt$	$\sqrt$	$\sqrt$	0.806	0.741	0.863	0.075	0.828	0.718	0.895	0.033	0.848	0.782	0.900	0.046
5		$\sqrt$	$\sqrt$	$\sqrt$		0.796	0.720	0.854	0.079	0.825	0.708	0.895	0.033	0.847	0.780	0.901	0.046
Ours		$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	0.817	0.753	0.879	0.071	0.829	0.720	0.897	0.032	0.849	0.785	0.903	0.046

ME 2 FNet: Muti-level Edge-Enhanced Fusion Network for Camouflaged Object Detection

Abstract

Keywords

1. Introduction

2. Related Work

2.1. COD Methods Based on Edge-Focused

3. Method

3.1. Overall Architecture

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.4.1. Quantitative Evaluation

Footnotes

Ethical Considerations

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

Code Availability

References