Sage Journals: Discover world-class research

Abstract

Camouflaged object detection (COD) aims to identify objects seamlessly embedded in the surrounding environment. Due to the high inherent similarity between the texture of the camouflaged object and its complex background, making COD far more challenging than traditional target detection. To solve these problems, we propose a method that uses holistic boundary information to optimize COD through a two-stage strategy. Specifically, the feature enhancement module is initially implemented to refine features at different scales and emphasize boundary details of camouflaged entities. Then, our network employs a boundary localization module to guide low-level local edge features through high-level global semantic. Furthermore, the boundary-embedded feature aggregation module is introduced to achieve cross-level fusion of multi-scale features, by embedding and effectively activating boundary information, which reduces the interference from cluttered backgrounds. Extensive experiments on four benchmark datasets demonstrate that our proposed model outperforms the other 17 state-of-the-art COD methods. The source code and results of our method are available at https://github.com/WObaibai/BSNet.

Keywords

camouflaged object detection boundary location multi-scale features deep learning

1. Introduction

In nature, animals often blend perfectly with their surroundings to avoid predators by changing color, shape, and patterns (Price et al., 2019). Unlike generic object detection tasks, camouflaged objects are difficult to detect in their essence since hidden targets are often indistinguishable from the background. The camouflaged object detection (COD) task confronts two main challenges: inherent similarity and complex backgrounds. In the former, the camouflaged object shares similar color and texture with its background, making it difficult to distinguish the target accurately. In the latter, the background partially occludes the camouflaged object, often leading to inaccurate detection in challenging and complex scenes. Recently, researchers have proposed many methods based on deep learning and have made certain progress. Fan et al. (2020a) created COD10K, the largest camouflage dataset at present, and designed the SINet network, proposing a predatory bionic structure that uses two stages of search recognition to accurately distinguish camouflage from the background. The bio-inspired solutions also produce good results in Jia et al. (2022), Li et al. (2021), and Mei et al. (2021). Zhou et al. (2022) in their work on FAP-Net demonstrate the efficacy of using boundary auxiliary information to accurately locate targets in complex scenes, as shown in Figure 1.

Figure 1.

Existing COD methods such as BGNet, FERDE, and SINet-V2 often cannot fully detect camouflaged objects in scenes where the camouflaged objects are highly similar to the background or have complex boundaries. In contrast, our method can accurately detect camouflaged objects. Note. COD = camouflaged object detection.

However, in a wide field of view, we usually regard low-contrast camouflaged objects and the background as a whole, resulting in misleading vision (Stevens et al., 2006). When the field of view is reduced, we can only see a local area of the camouflage object or background, making it challenging to categorize pixels accurately and increasing vulnerability to interference from similar details. To address this, we can integrate various receptive fields in our networks, to match the boundary features of camouflaged objects of different sizes and shapes, which is beneficial to the localization of object structures. Secondly, camouflage objects are often in complex environments (Jung, 1973), such as background occlusion, which easily introduces background noise. Therefore, how to effectively suppress background interference and obtain accurate camouflage images become crucial.

Although previous researchers used boundary information to help networks find the approximate location of camouflage objects, they adopted a simple fusion method of boundary information and feature maps (He et al., 2023; Song et al., 2023; Sun et al., 2022; Zhu et al., 2022). However, their guidance mechanism relied too much on the global context and neglected the accurate modeling of local boundary details of objects, especially when dealing with camouflage objects with rich details but fuzzy local boundaries, which performed poorly as shown in the fourth column of Figure 1. Based on the above shortcomings and inspirations, we propose a new COD framework BSNet. Unlike general boundary guidance, we adopt a method of first locating the boundary and then integrating the complete target. This framework uses precise boundary semantic information to improve the performance of disguised object detection. Specifically, our method uses a two-stage strategy, which is different from the general stage of coarse positioning and then precise segmentation. The model focuses first on precise boundary location and then on feature fusion and refinement to minimize noisy background interference. BSNet consists of three key modules. We first design the feature enhancement module (FEM) to refine features at different scales. Then we employ the boundary localization module (BLM), which aims to use different receptive fields in the spatial dimension to locate diverse camouflage edge areas forming more accurate edges. In addition, we have constructed an architecture that gradually integrates multi-level features from top to bottom to integrate more complete and accurate boundaries. Finally, we propose the boundary-embedded feature aggregation module (BFM), featuring a simple multiple residual structure for cross-level fusion of multi-scale features. This module activates precise boundary information from the BLM to refine the object structure and reduce cluttered background noise. Experimental results show that compared with current advanced methods, the proposed model effectively improves detection accuracy. In summary, our contributions are as follows:

We propose BSNet, a new two-stage strategy network. It initially locates and integrates comprehensive boundary semantics, then directs the fusion of cross-layer, multi-scale features for precise segmentation.

We carefully designed the BLM to extract local edge details and global structure from the spatial dimension to enhance boundary semantics. In order to utilize contextual information to explore the integrity of the camouflage boundary, we propose a boundary refinement architecture to gradually fuse multi-scale edge features from bottom to top.

We introduce the BFM to enhance cross-level context interaction and feature discrimination. This module facilitates the correlations among multi-level features and diminishes background interference by embedding object boundary information, thus fostering more distinct feature generation.

2. Related Work

2.1. Camouflaged Object Detection (COD)

Since the pattern and color of the camouflaged object are similar to the background, it is difficult to separate it from the background, resulting in poor detection results. Earlier COD adopted manual detection methods, using texture, color, or brightness to distinguish the target and the background. However, since the camouflaged object is highly similar in color and texture to the background, it usually does not work well. Recent work utilizes deep neural network, convolutional neural network (CNN), to identify complex attributes of camouflage objects and achieves excellent performance (Le et al., 2019; Ren et al., 2021; Song et al., 2023; Yan et al., 2023).

Some studies were inspired by the hunting process. They first conducted target search and positioning and then identified target motivations one by one, and designed search modules and recognition modules (Fan et al., 2020a). Since then, a series of biologically inspired models have been designed (Jia et al., 2022; Mei et al., 2021; Wang et al., 2023). But this type of model only roughly locates the area containing the camouflaged object, and still cannot clarify the fuzzy boundary between the camouflaged object and its surrounding environment, resulting in fuzzy segmentation. Yan et al. (2020) used the original image and its flip as input and passed them to a two-stream mirror network to change the perspective to identify camouflaged objects. In addition, these models (Cong et al., 2023; Xie et al., 2023; Zhong et al., 2022) also propose using frequency representation to integrate networks to obtain high-frequency and low-frequency information, respectively, and design frequency reasoning modules to mine information clues. However, in areas with fuzzy boundaries, the background still causes interference, affecting COD accuracy.

Transformers provide richer background–foreground interaction information by modeling global context and complex pixel-to-pixel relationships. This allows them to effectively identify camouflaged objects when the background and target are similar in color or texture. Lyu et al. (2024) and Yang et al. (2021, 2023) take advantage of probabilistic models, uncertainty guidance, and transformer-based reasoning to learn deterministic and probabilistic information about camouflaged objects. However, these methods still suffer from high computational costs and may not be the best choice for real-time applications and large-scale dataset processing.

2.2. Multi-Task Learning

Compared to single-task learning paradigms (Chen et al., 2022; Fan et al., 2020a) that develop attention modules to identify target regions. Nowadays, some research is more focused on using multi-task learning frameworks, such as using texture information (Ren et al., 2021; Song et al., 2023) to help the network locate camouflaged objects more accurately. However, these algorithms may still mistake the camouflaged object as the background due to its similar texture, resulting in detection errors. At the same time, MGL (Zhai et al., 2021) is the first model to encode edge features together with object features into a graph convolutional network to improve COD performance. And UGTR (Yang et al., 2021) learn the certainty of disguised objects by taking advantage of confidence estimation models and transformer-based backbone. The multi-task framework can also achieve robust COD learning by locating, segmenting, and ranking camouflaged objects (Wang et al., 2021). And Zhou et al. (2022) proposed a boundary auxiliary module to enhance feature representation. However, this type of network only uses low-level features, and the edge cues contain redundant background information, which can easily cause interference to the network.

2.3. Multi-Scale Contextual Information Learning

It aims to effectively process feature information at different scales in images. For example, C2FNet (Sun et al., 2021) uses a design in which darker features gradually enhance shallower features to improve hidden features from coarse to fine levels. In addition, some networks also use contextual information to effectively enhance feature representation capabilities, which is crucial in target detection tasks, such as salient target detection and small sample detection tasks. In order to supplement global information, these models integrate low-level detailed features and advanced semantic features to alleviate the scale change problem and improve detection accuracy (Chen et al., 2020; Guo et al., 2023; Hu et al., 2021; Wang et al., 2021). By extracting features at different levels of the feature pyramid network (Lin et al., 2016) and fusing them, dense top-down and bottom-up propagation combines more comprehensive multi-context information to improve the model recognition and localization capability (Chen et al., 2022; Cheng et al., 2022; Chou et al., 2022; Huang et al., 2023; Wang et al., 2021; Zhang et al., 2022a, 2022c).

2.4. Boundary-Guided Learning

Since the boundaries of camouflaged targets are usually slightly different from the background, using this boundary information can effectively improve the accuracy of target detection (He et al., 2023; Song et al., 2023; Sun et al., 2022; Zhu et al., 2022). BGNet (Sun et al., 2022) adds a boundary guidance mechanism to its network, which helps the network focus on the edge area of the camouflaged target by generating a boundary guidance feature map. However, its boundary guidance relies on the direct extraction and fusion of boundary information. For camouflaged targets with blurred boundaries or highly integrated with the background, it is impossible to fully extract details, resulting in limited detection performance and target loss. BSANet (Zhu et al., 2022) introduces a separation attention mechanism and a boundary guidance module to enhance attention to boundary areas through attention guidance. FSNet (Song et al., 2023) dynamically focuses on the key areas of the camouflaged target boundary through a scanning mechanism. However, the size and shape of camouflaged targets may vary greatly. The above methods cannot effectively capture smaller or farther boundary targets, and are weak in multi-scale adaptation and complex background capabilities, resulting in performance limitations.

In contrast, our method pays more attention to highlighting the overall edge details of the camouflage and provides effective constraints in the fusion stage using multi-scale features at different levels to reduce the interference of redundant background information, thus improving the accuracy of the detection model.

3. Proposed Method

3.1. Overview

The overall architecture of the proposed BSNet is illustrated in Figure 2, which consists of two stages: the first stage aggregates different levels to generate the accurate target boundary map, and the second stage integrates boundary information with the camouflaged object features to produce the final prediction result. Our network comprises three key components: the FEM, BLM, and BFM.

Figure 2.

The overall architecture of BSNet for camouflage target detection. It mainly contains three key modules, namely, FEM, BLM, and BFM. The dashed red line represents supervision between the ground truth and predictions. The BSNet performs supervised training on the edge maps obtained by three BLMs and the prediction maps obtained by all BFMs at the same time. See Section 3 for details. Note. FEM = feature enhancement module; BLM = boundary localization module; BFM = boundary-embedded feature aggregation module.

Specifically, an image $I \in R^{C \times H \times W}$ is fed into the pretrained encoding network (e.g., Res2Net50) to extract five different levels of features denoted as $F_{i} (i \in {1, 2, 3, 4, 5}$ ). Due to messy details in feature $F_{1}$ , features $F_{i}, (i \in {2, 3, 4, 5})$ are utilized in our proposed network. Subsequently, four FEMs are employed to enhance the features of $F_{i}$ by enlarging the receptive field. Following this, the BLM is designed to utilize the edge features at different levels to obtain the boundary information of camouflaged objects. Finally, our module is designed to integrate the predicted boundary map with the object features to produce discriminative saliency features. The details of the key modules and loss function will be presented as follows.

3.2. Feature Enhancement Module (FEM)

Since the detailed information about the camouflage, such as edges, textures, etc., is helpful for network detection, it is particularly critical to effectively capture the difference between the camouflage and the background, especially in complex scenes. Compared to existing networks (Song et al., 2023; Xu et al., 2022) that directly feed the encoder features into the localization and recognition module, such methods tend to incorporate background information. In contrast, in order to improve the perception of the boundaries of camouflaged targets by integrating information from different perspectives, we adopt an asymmetric convolution strategy. By extracting features in the horizontal and vertical directions and capturing subtle feature differences in specific directions, the network pays more attention to the directional information and boundary details of the camouflaged target.

As shown in Figure 2, asymmetric convolution is used to add residual connections to better capture and emphasize features in specific directions, and thereby improve the perceptual information of structures such as object boundaries. Specifically, the process can be expressed as:

R_{i} = C_{1 \times 1} (Concat {C_{asy} (f_{i}), C_{3 \times 3} (f_{i})})

(1)

where

R_{i}, i \in {2, 3, 4, 5}

are the enhanced features and

Concat {\dots}

denotes the concatenation operation.

C_{asy} (\cdot)

and

C_{3 \times 3} (\cdot)

are the asymmetric convolution and the

3 \times 3

convolutional block, respectively.

Our asymmetric convolution consists of a standard $3 \times 3$ convolution kernel, a horizontal $1 \times 3$ convolution kernel, and a vertical $3 \times 1$ convolution kernel. They have the same sliding window and can be defined as

C_{asy} (f_{i}) = C_{3 \times 3} (f_{i}) + C_{1 \times 3} (f_{i}) + C_{3 \times 1} (f_{i})

(2)

It is then integrated with a

3 \times 3

convolution block to obtain local context, which can prevent small objects from being ignored. Finally, the output channels are changed to 64 through

1 \times 1

convolution.

3.3. Boundary Localization Module (BLM)

The previous studies (Zhen et al., 2020) have demonstrated that incorporating boundary information contributes to improving the performance of computer vision tasks. In COD tasks, there is no obvious dividing line between the boundary of the camouflage object and the background. The colors and textures are likely to be consistent with the background, and the shapes of the boundaries may be similar to the shapes of the surrounding scene. This integration of boundaries with the surroundings poses a challenge. Therefore, accurately locating the boundaries of camouflaged targets is crucial. And boundary information serves as an effective constraint for detecting features of camouflaged targets, reducing interference from redundant background information. The network (Sun et al., 2022; Zhou et al., 2022) of the boundary guidance mechanism only obtains the boundary map through global features as additional information input and is not suitable for the complex features of camouflage objects with different shapes and boundary pixels. The difference is that we first generate boundary positioning information, obtain the target position, and form a clear contour map. We adopt a multi-scale receptive field to adapt to camouflage targets of different sizes, and gradually refine the boundary information. The design of this module pays more attention to the adaptability of boundary details, especially for target boundaries of different sizes and shapes.

We use different receptive fields in the spatial dimension to match the edge regions of disguised objects of different sizes and shapes, forming more accurate edge information and obtaining preliminary boundary localization. This is the first step, as shown in Figure 3. Specifically, the process involves first upsampling the features of $R_{5}$ , performing a point-wise multiplication with the features of $R_{4}$ , and then merging the result through a $3 \times 3$ convolution layer to obtain ${F_{i}}^{'}$ . Subsequently, ${F_{i}}^{'}$ is channel-wise split to form four parallel branches $F_{m}^{i}$ , where $i \in {1, 2, 3, 4}$ . Then, ${F_{m}^{1}, F_{m}^{2}, F_{m}^{3}}$ employ dilated convolutions (with rates of 3, 5, and 7) to enhance the receptive field of low-level features, thereby obtaining detailed local boundary details. The features are then aggregated through point-wise addition and residual connections to yield enhanced boundary features $F_{e}$ . Simultaneously, to explore potential target regions, $F_{m}^{4}$ is fed into a $3 \times 3$ convolution and a $1 \times 1$ convolution to increase the channel dimension, and the result is added and fused with $F_{e}$ to obtain the final output $E_{i}$ . It can be formulated as:

\begin{aligned} F_{m}^{i} & = C_{d} (Spilt ({F_{i}}^{'})), i \in {1, 2, 3} \end{aligned}

(3)

\begin{aligned} F_{e} & = {F_{i}}^{'} + {Conv}_{3} (F_{m}^{3} \otimes (F_{m}^{1} \otimes F_{m}^{2})) \end{aligned}

(4)

\begin{aligned} E_{i} & = {Conv}_{3} (F_{e} + {Conv}_{1} ({Conv}_{3} (F_{m}^{4}))) \end{aligned}

(5)

where

C_{d}

is dilated convolution, Spilt is segmented according to the channels, And

{Conv}_{3}

and

{Conv}_{1}

are a

3 \times 3

and

1 \times 1

convolution kernels.

\otimes

indicates the multiplication operation.

Figure 3.

The architecture of our boundary localization module.

Compared with the simple way of obtaining edges in recent methods (Liu et al., 2023; Sun et al., 2022; Zhou et al., 2022), we adopt the strategy of gradually refining edges and focus on learning boundary-enhanced representations to preserve local features and boundary information, incorporating them into the decoder network. Notably, for $i = {2, 3}$ , the output of the previous BLM ( $E_{i}$ ) will be combined with $R_{i}$ as input for the next BLM to obtain $E_{i}$ . Another $1 \times 1$ convolution is applied to alter the channel dimension of the features $E_{i}$ , resulting in the boundary map $E^{'}$ for the camouflaged object.

Figure 4 shows the visual feature maps of the input and output of the three-layer BLM, as well as the final generated edges. By comparing Figure 4(c), (d), and (e), we can find that the strip structure of the edge of the camouflage target is highlighted, indicating that the network effectively learns the fine-grained detail information of the boundary and better highlights the integrity of the camouflage object boundaries.

Figure 4.

Visualization feature maps in different layers of BLMs and BFMs. Note. BLM = boundary localization module; BFM = boundary-embedded feature aggregation module.

3.4. Boundary-Embedded Feature Aggregation Module (BFM)

Different levels of features typically contain different information. Low-level features are closer to the original information and are important for detecting small targets, while high-level features contain richer semantic information and are more sensitive to large targets.

Due to obtaining accurate edge information in the first stage (as shown in the figure), in order to effectively utilize the edge information localization capability in BLM and the enhanced features obtained by FEM, we propose the second stage of the network: integrating and designing BFM. Unlike network models such as BGNet, boundary maps, and feature maps are fused using simple convolution operations. This module can interactively integrate contextual information while embedding the position and size information of the target edge. Better capture global contextual information through attention mechanisms and dynamically adjust the importance of features at different scales. Deep fusion of local and global features can better handle complex scenes of disguised objects, while ensuring details and fully considering global semantic information.

As shown in Figure 5, $F_{a}$ represents the low-level characteristic and $F_{b}$ represents the high-level characteristic/coarse prediction. $F_{b}$ is upsampled and multiplied with $F_{a}$ to obtain the fusion feature $F_{a b}$ . Then send it into MSCA (Dai et al., 2020) based on the dual-branch structure to obtain the aggregated feature $F_{s}$ :

F_{s} = MSCA ({Conv}_{3} (F_{a b})) \otimes F_{a b}

(6)

Figure 5.

The architecture of our boundary-embedded feature aggregation module.

Moreover, $F_{s}$ is multiplied by $F_{a b}$ and connected to $F_{a}$ and $F_{b}$ residuals, respectively, fed into a $3 \times 3$ convolution layer, followed by batch normalization and a rectified linear unit function, which can effectively mine multi-scale information and alleviate scale changes:

F_{s} = Concat ({Conv}_{3} (F_{a s}, F_{b s}))

(7)

Finally, the detailed boundary map $E_{3}$ obtained from the BLM is injected into the fused features to provide position and contour information. We use adaptive space normalization to embed the boundary information into the feature map effectively (Zhu et al., 2022). The modulation parameters $α$ and $β$ , derived from a set of $3 \times 3$ convolutions on the boundary map, serve as affine parameters (scale and shift) to modulate the features, aiming to embed position information into the feature space. The described process can be formulated as:

S_{i} = β (E_{3}) \oplus ((α (E_{3}) \otimes F_{s}) \oplus F_{s})

(8)

where MSCA means multi-scale channel attention. And

\otimes

and

\oplus

, respectively, denote element-wise multiplication and element-wise addition operations. During the inference phase, boundaries are generated from edge information predicted by the network, rather than ground truth, so

α

and

β

also calculate based on this predicted edge information.

3.5. Overall Loss Function

The loss function of the proposed BSNet comprises two types of supervision: the camouflaged object mask ( ${GT}_{s}$ ) and the camouflaged object boundary ( ${GT}_{e}$ ). Given that the model tends to focus more on background information when the number of background pixels exceeds that of foreground pixels, for mask segmentation ( ${GT}_{s}$ ), we employ weighted binary cross-entropy loss ( $L_{bce}$ ) and weighted intersection over union loss ( $L_{iou}$ ; Wei et al., 2019) to emphasize hard pixels. Differential weighting based on pixel dissimilarity is applied to assign different weights, enhancing the model’s generalization. For boundaries ( ${GT}_{e}$ ), dice loss ( $L_{dice}$ ; Milletarì et al., 2016) is used to address the imbalance between positive and negative samples. Additionally, both $E_{i}$ generated by the BLM and $S_{i}$ generated by the BFM are supervised. All outputs are upsampled to the original input image size to match their ground-truth values.

In summary, the loss function for our model is defined as follows:

L_{toall} = \sum_{k = 3}^{5} \frac{1}{2^{k - 3}} (L_{bce} (S_{i}, {GT}_{S}) + L_{iou} (S_{i}, {GT}_{S}) + L_{dice} (E_{i}, {GT}_{S}))

(9)

4. Experiments

4.1. Implementation Details

We implemented the network with PyTorch and used Res2Net50 pretrained on ImageNet as our backbone. All input images and ground truth are resized to $448 \times 448$ and enhanced by random horizontal inversion. During the training phase, the batch size is set to 20, and the Adam algorithm is adopted. An initial learning rate is $5 \times 10^{- 5}$ , and the network parameters are optimized through poly strategy adjustment with an adjustment intensity of 0.9. The entire training and testing process is performed on NVIDIA A4000 GPU. The source code and results of our method are available at https://github.com/WObaibai/BSNet.

4.2. Datasets

To verify the effectiveness of the proposed model, we perform performance evaluation on four COD benchmark datasets: CAMO (Le et al., 2019), CHAMELEON (Przemysław et al., 2018), COD10K (Fan et al., 2020a), and NC4K (Lyu et al., 2021). NC4K is the newly largest dataset, which covers a variety of different camouflaged object scenes and environments, with a total of 4,121 images. COD10K is currently the largest dataset with pixel-level annotations, containing 3,040 training images and 2,026 testing images. It consists of 10 super-classes and 78 subclasses gathered from several photography websites. CAMO contains eight categories and has 1,250 camouflage images. CHAMELEON is a small dataset consisting of 76 COD images. Following the common training settings with existing methods, this work uses 3,040 samples from COD10K and 1,000 images from CAMO for training. During the testing phase, we tested and compared the performance of our model and other competing models on the test sets of CAMO and COD10K as well as the entire CHAMELEON and NC4K datasets.

4.3. Evaluation Metrics

Quantitatively evaluate the effectiveness of the model based on four metrics commonly used in COD tasks: the structure-measure ( $S_{α}$ ; Fan et al., 2017) is to evaluate non-binary foreground maps. Average enhanced-measure ( $E_{Φ}$ ; Fan et al., 2018) is mainly used to measure image-level statistics and local pixel matching information. Weight-F-measure ( $F_{β}^{ω}$ ; Margolin et al., 2014) is the weighted harmonic average of precision and recall. Mean absolute error (MAE; Perazzi et al., 2012) is designed to directly measure the absolute difference between the ground-truth value and the predicted value. Among these four metrics, higher $S_{α}$ , $E_{Φ}$ , and $F_{β}^{ω}$ indicate better performance, while the situation of MAE is just on the contrary.

4.4. Comparison to State-of-the-Arts

To demonstrate the effectiveness of our method, it is compared with 17 state-of-the-art (SOTA) COD methods, including SINet (Fan et al., 2020a), JSCOD (Li et al., 2021), S-MGL (Zhai et al., 2021), R-MGL (Zhai et al., 2021), PFNet (Mei et al., 2021), LSR (Lyu et al., 2021), C2FNet (Sun et al., 2021), BSANet (Zhu et al., 2022), BGNet (Sun et al., 2022), SINet-V2 (Fan et al., 2021), UGTR (Yang et al., 2021), SegMaR (Jia et al., 2022), CubeNet (Zhuge et al., 2022), C2FNet-pre (Chen et al., 2022), FEDER-R2N (He et al., 2023), OAformer (Yang et al., 2023), UEDG (Lyu et al., 2024), FSPNet (Huang et al., 2023), NCHIT (Zhang et al., 2022a), OCEnet (Liu et al., 2022), and PreyNet (Zhang et al., 2022b). To be fair, all predictions from these methods are either provided by the authors or produced by models retrained with open-source code.

Quantitative Evaluation. Table 1 details the quantitative comparison of BSNet with 16 SOTA models on four benchmark datasets. For each dataset, it can be clearly seen that our method significantly outperforms other advanced models on most evaluation metrics across the four datasets. For example, on the NC4K dataset, compared with the second best method SINET-V2, our model slightly increases $E_{Φ}$ and $F_{β}^{ω}$ , respectively, reducing MAE by 3.8%. On the NC4K dataset, compared with the second best method FEDER, our model slightly increases $E_{Φ}$ and $F_{β}^{ω}$ , respectively, reducing MAE by 3.8%. On the COD10K dataset, compared with the next best SINet-V2, our model improves $S_{α}$ and $F_{β}^{ω}$ by 1.9% and 2.9%, respectively, and reduces MAE by 3.8%. Overall, our proposed method improves the performance of **SOTA.

Visual comparison: Figure 6 shows the visual comparison results of different COD methods on eight test samples in four datasets. These eight samples are designed for various challenging scenes, including occlusion (rows 3 and 4), high brightness (rows 1 and 6), rich edge details (rows 6 and 9), tiny targets (row 9), and multiple targets (rows 2 and 10). For the above scenario, we can see that the proposed method obtains significantly more accurate results than other methods.

Comparison of efficiency: As listed in Table 2, We also compared the proposed method with SOTA methods in terms of model parameters, floating-point operations (FLOPs), and inference speed. It can be clearly seen that our method is highly competitive in terms of FLOPs and inference speed, significantly reducing model parameters while greatly improving inference speed.

Figure 6.

Qualitative comparisons of our BSNet with state-of-the-art methods.

Table 1.

The Performance Comparison With Eight State-of-the-Art Models on Four Datasets. Maximum-F-Measure ( $F_{β}^{ω}$ , Higher is Better), Maximum-E-Measure ( $E_{Φ}$ , Higher is Better), S-Measure ( $S_{α}$ , Higher is Better), and MAE ( $M$ , Lower is Better) are Utilized to Evaluate the Performance of These Models. The Top Two Results are Highlighted Bold and Underline, Respectively.

		CAMO				CHAMELEON				COD10K				NC4K
Method	Year’Pub	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$
CNN-based methods
SINet	20’PAMI	0.100	0.771	0.606	0.752	0.044	0.891	0.740	0.869	0.051	0.807	0.551	0.771	0.058	0.871	0.723	0.808
JSCOD	20’CVPR	0.0730	0.872	0.727	0.800	0.030	0.942	0.832	0.891	0.035	0.882	0.683	0.808	0.046	0.909	0.776	0.846
S_MGL	21’CVPR	0.090	0.806	0.664	0.772	0.032	0.912	0.802	0.892	0.037	0.844	0.654	0.811	–	–	–	–
R_MGL	21’CVPR	0.088	0.812	0.673	0.776	0.031	0.917	0.812	0.893	0.035	0.851	0.666	0.814	–	–	–	–
PFNet	21’CVPR	0.085	0.841	0.695	0.782	0.033	0.931	0.810	0.882	0.040	0.877	0.660	0.800	0.053	0.887	0.745	0.829
SLSR	21’CVPR	0.080	0.838	0.696	0.787	0.031	0.935	0.822	0.890	0.037	0.880	0.673	0.804	0.048	0.895	0.766	0.840
C2FNet	21’TCSVT	0.080	0.854	0.719	0.796	0.032	0.935	0.828	0.888	0.036	0.890	0.686	0.813	0.049	0.897	0.762	0.838
ERRNet	22’PR	0.085	0.842	0.679	0.779	0.039	0.922	0.787	0.868	0.043	0.867	0.630	0.786	0.054	0.887	0.737	0.827
NCHIT	22’CVIU	0.079	0.836	0.738	0.806	0.032	0.937	0.854	0.899	0.034	0.891	0.687	0.866	0.058	0.872	0.710	0.830
BSANet	22’AAAI	0.079	0.851	0.717	0.794	0.027	0.946	0.841	0.895	0.034	0.891	0.699	0.818	0.048	0.897	0.771	0.842
OCEnet	22’WACV	0.079	0.836	0.738	0.806	0.032	0.937	0.854	0.899	0.034	0.891	0.687	0.866	0.058	0.872	0.710	0.830
BGNet	22’IJCAI	0.073	0.870	0.749	0.812	0.027	0.943	0.851	0.901	0.033	0.901	0.722	0.831	0.044	0.907	0.788	0.851
SINet-V2	22’TPAMI	0.071	0.882	0.743	0.820	0.030	0.942	0.816	0.888	0.037	0.887	0.680	0.815	0.048	0.903	0.770	0.847
PreyNet	22’MM	0.074	0.880	0.725	0.807	0.031	0.930	0.816	0.891	0.036	0.869	0.683	0.817	0.049	0.897	0.760	0.844
SegMaR	22’CVPR	0.072	0.861	0.728	0.808	0.028	0.950	0.823	0.892	0.035	0.890	0.682	0.814	0.046	0.896	0.781	0.841
CubeNet	22’PR	0.085	0.838	0.682	0.788	0.037	0.928	0.787	0.873	0.041	0.865	0.643	0.795	–	–	–	–
C2FNet-V2	22’TCSVT	0.077	0.859	0.730	0.799	0.028	0.946	0.845	0.893	0.036	0.887	0.691	0.811	0.048	0.896	0.770	0.840
FEDER	23’CVPR	0.071	0.867	0.738	0.802	0.030	0.946	0.834	0.887	0.032	0.900	0.716	0.822	0.044	0.907	0.789	0.847
OURS		0.069	0.881	0.761	0.820	0.024	0.952	0.857	0.903	0.030	0.904	0.736	0.838	0.043	0.908	0.796	0.855
Transformer-based methods
UGTR	21’ICCV	0.086	0.823	0.686	0.785	0.031	0.911	0.796	0.888	0.035	0.853	0.667	0.818	0.052	0.874	0.747	0.839
OAFormer	23’ICME	0.048	0.924	0.826	0.866	0.023	0.956	0.858	0.904	0.025	0.924	0.773	0.860	0.032	0.935	0.837	0.883
UEDG	23’TMM	0.048	0.922	0.817	0.863	0.023	0.952	0.866	0.911	0.025	0.920	0.766	0.858	0.035	0.930	0.830	0.879
FSPNet	23’CVPR	0.050	0.919	0.799	0.856	0.023	0.945	0.851	0.908	0.026	0.900	0.735	0.851	0.035	0.923	0.816	0.879
OURS		0.042	0.928	0.837	0.872	0.023	0.959	0.856	0.903	0.023	0.932	0.779	0.863	0.032	0.936	0.843	0.885

Note. $↑$ Means the higher the score the better. $↓$ Indicates the lower the score the better. Indicates the code or result is not available. The top two results are highlighted in bold and underline, respectively. CNN = convolutional neural network; CVPR = computer vision and pattern recognition; MAE = mean absolute error.

Table 2.

Comparison of the Number of Model Params, the Number of FLOPs, and the Inference Speed (Speed).

Methods	Speeds (FPS) $↑$	FLOPs $↓$	Params $↓$
CNN-based methods
SINet	26.75	38.8	48.95
S_MGL	16.14	263.6	67.64
R_MGL	17.40	249.89	63.6
PFNet	44.01	46.5	26.51
SLSR	38.26	127.12	57.9
C2FNet	25.41	26.2	38.41
ERRNet	31.46	20.05	69.76
BSANet	21.75	29.7	87.06
BGNet	23.27	58.45	58.38
SINet-V2	22.73	27.98	12.31
SegMaR	19.96	33.63	56.21
C2FNet-V2	16.05	44.94	18.19
FEDER	12.69	42.09	35.92
OURS	28.17	29.81	31.28
Transfomer-based methods
OAFormer	24.98	49.50	87.59
UEDG	25.73	46.50	71.95
FSPNet	40.26	285.37	278.79
OURS	25.92	52.89	82.73

Note. Params = parameters; FLOP = floating-point operation; FPS = frames per second.

4.5. Ablation Study

In this part, we conduct comprehensive ablation experiments to verify the effectiveness of various parts and configurations. For the baseline, we remove all extra modules (FEM, BLM, and BFM) and boundary refinement architecture, leaving only four $1 \times 1$ convolutions to reduce the backbone channels. Note that we keep the same parameters during retraining for each ablation variant. Specifically, the following six models are mainly involved in our ablation study:

Basic (A1): The basic model is equivalent to all FEMs, BLMs, and BFMs from our network.

Basic + FEMs (A2): Add the FEM to the basic model (A1).

Basic + ERF (A3): Add the BLM to the basic model (A1). The BLM used in this example only retains the $F_{4}$ and $F_{5}$ layers as input modules.

Basic + BFMs (A4): Add the BFMs to the basic model (A1).

Basic + FEMs + BLMs (B1): Add all BLMs in the Basic + FEMs model (A2).

Basic + FEMs + BFMs (B2): Equivalent to adding all BFMs in the Basic + FEMs model (A2).

According to the qualitative comparison results shown in Figure 7, we can see that our model performs best, which produces satisfactory maps. The structure of the target becomes increasingly complete, and the boundaries become clearer. Our final result image is closer to the ground truth and achieves the best visual effect.

Figure 7.

Visual comparisons of different variations. (a) RGB image, (b) ground truth, (c) A2, (d) B1, (e) w/o dilated conv, (f) w/o MSCA, (g) w/o res, (h) our. Note. RGB = red–green–blue; MSCA = multi-scale channel attention.

Contribution of Base Network: Table 3, we remove FEMs, BLMs, and BFMs from $O^{#}$ . In comparison, our network significantly improves performance. Taking the MAE as an example, it is $-$ 25.57% on COD10K, $-$ 23% on NC4K. $F_{β}^{ω}$ :+9.05% on CAMO, +14.32% on COD10K.

Contribution of FEM: Verify by comparing the A1 and A2 models. It can be seen that the addition of the FEM significantly improves the detection accuracy. In terms of MAE, chameleon and COD10K decreased by 6.53% and 4.88%, respectively. In addition, we can see from Figure 7 that the results of the FEM model contain less background noise, allowing the network to better focus on the target area.

Contribution of BLM: In order to verify the effectiveness of BLM, we can prove it by comparing two groups, including baseline and A1 and A3. As shown in Table 3, the metrics of A3 are better than A1, and the joint of boundary refinement strategies can significantly improve the performance of the network. These observations indicate that the addition of boundary information enables our network to better localize camouflaged objects.

Is it effective to use atrous convolution to obtain boundaries? We replace all dilated convolutions in BLM with $3 \times 3$ convolutions. As shown in Table 3 (B1 and Ours), on the four datasets, all metrics have not improved, and on the NC4K dataset, $-$ 1.26% on $F_{β}^{ω}$ , +5.05% on MAE. This fully demonstrates that using atrous convolution in BLM is very effective.

Is it necessary to refine the boundaries? We prove the necessity of refining the boundaries by comparing A1 and A4. As shown in Table 3, the performance has improved, especially in MAE, camo decreases 7.59 $%$ , COD10K decreases 11.65 $%$ ; $F_{β}^{ω}$ : +1.1 $%$ on NC4K, +3.19 $%$ on COD10K. In addition, adding the strategy of refining the boundary helps our network locate the disguised object more accurately and improves the recognition accuracy.

Contribution of BFM: By comparing A3 and A4, we can observe that A4 is better than A1 in all indicators. On the MAE, CAMO decreases by 11.60%, and COD10K decreases by 11.35%. In addition, the comparison between $M 3$ and $M 5$ shows that the joint utilization of BFM and BLM can significantly improve performance. All of the above shows that the guided multi-scale fusion part of the BFM using boundary information helps our network better locate camouflage objects.

The Impact of Input size: According to Table 4 Training with smaller inputs can result in the loss of spatial information, leading to a decrease in the performance of the trained model. Training with larger input scales often leads to a sharp increase in memory usage, resulting in performance degradation in practical applications due to hardware resource limitations. The input size of $448 \times 448$ was obtained through experiments, which enables the model to have both good performance and strong practical feasibility.

The Effectiveness of Each Component in BLM: Here we provide two variants of BLM to study the effectiveness of individual components: (1) Cancel channel segmentation (i.e., w/o split, Table 5). As shown in the second row in Table 5, the performance all decreases, especially in the MAE. (2) Use ordinary convolution instead of dilated convolution (i.e., w/o dilated conv, Table 5. We replace all dilated convolutions in BLM with $3 \times 3$ convolutions. As shown in the third row in Table 5, on the four datasets, all indicators have not improved, and on the NC4K dataset, $F_{β}^{ω} - 1.26 %$ , MAE +5.05%. Therefore, it shows that these two components are necessary for BLM. It is worth noting that the performance degradation of w/o split is greater than that of w/o dilated conv, which fully demonstrates that it is more important to use atrous convolution to enhance the receptive field in BLM.

The Effectiveness of Each Component in BFM: The effectiveness of each component is demonstrated through three variants of BFM: (1) no MSCA strategy (i.e., w/o MSCA, Table 5), (2) without multiple residual connections (i.e., w/o Res, Table 5), and (3) edge features simple superposition (i.e. w Cat, Table 5). w/o MSCA shows that the MSCA module to enhance the fusion of cross-layer features and consider the differences in multi-scale features of the context, and the performance is improved. As shown in the third row of Table 5, if the residual strategy is not adopted, the filtered features will be re-added into the network, resulting in inaccurate detection. And directly splicing edge maps with fused features instead of prominently providing positioning information in the form of parameters (position and displacement), which leads to background redundant information embedding, affecting performance. Therefore, our BFM can effectively complement the upper and lower multi-scale feature fusion and effectively obtain edge positioning information.

The Effectiveness of Supervision Strategies: The four sets of comparisons in Table 6 verify the effectiveness of the supervision strategy adopted in the proposed method. The following supervision strategies are involved: $S_{1}$ ( $R_{1}$ ): Only the final prediction map ( $R_{1}$ ) is used for supervision. $S_{2}$ ( $R_{3}$ + $E_{3}$ ): Supervises the final prediction map $R_{3}$ and the final edge map $E_{3}$ . $S_{3}$ ( $R_{1} + R_{2} + R_{3} + E_{3}$ ): A multi-level supervision method is used to supervise the prediction map $R_{i}$ and the final edge map $E_{3}$ of all BFMs. $S_{4}$ ( $R_{3} + E_{1} + E_{2} + E_{3}$ ): Adopt multi-level supervision on the edge map $E_{i}$ and supervise the final prediction map. $S_{5}$ ( $R_{1} + R_{2} + R_{3} + E_{1} + E_{2} + E_{3}$ ): The fully supervised method adopted by this network. It can be seen from Table 6 that the addition of multiple edge maps and multi-level supervision strategies contribute to detection accuracy. Therefore, jointly adopting a fully supervised approach, $S_{5}$ achieves the best performance in most cases.

Table 3.

Ablation Studies for Different Baseline Methods and Key Components of Our Model.

Datasets	Metrics	A1	A2	A3	A4	B1	B2	OURS
CAMO	$M$	0.088	0.085	0.077	0.071	0.071	0.073	0.071
	$E_{ϕ}^{m}$	0.830	0.838	0.849	0.873	0.867	0.872	0.881
	$F_{β}^{ω}$	0.684	0.696	0.724	0.751	0.747	0.748	0.755
	$S_{m}$	0.801	0.799	0.803	0.813	0.814	0.810	0.818
COD10K	$M$	0.042	0.040	0.035	0.031	0.033	0.032	0.032
	$E_{ϕ}^{m}$	0.869	0.879	0.886	0.899	0.892	0.899	0.904
	$F_{β}^{ω}$	0.644	0.664	0.694	0.730	0.717	0.728	0.728
	$S_{m}$	0.814	0.820	0.821	0.832	0.828	0.831	0.834
NC4K	$M$	0.058	0.056	0.049	0.045	0.047	0.046	0.045
	$E_{ϕ}^{m}$	0.875	0.882	0.890	0.899	0.896	0.899	0.908
	$F_{β}^{ω}$	0.720	0.737	0.765	0.785	0.777	0.783	0.786
	$S_{m}$	0.838	0.841	0.843	0.848	0.846	0.847	0.851

Table 4.

Ablation Study of the Input Size. Four Widely Used Evaluation Metrics (i.e., $E_{Φ}$ , $F_{β}^{ω}$ , $S_{α}$ , and $M$ ). “ $↑ / ↓$ ” Indicates that Larger/Smaller is Better.

	CAMO				COD10K				NC4K
Input size	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$
$384 \times 384$	0.071	0.870	0.751	0.814	0.031	0.903	0.730	0.837	0.045	0.904	0.789	0.851
$448 \times 448$	0.069	0.881	0.761	0.820	0.030	0.904	0.736	0.838	0.043	0.908	0.796	0.855
$704 \times 704$	0.070	0.874	0.755	0.813	0.031	0.905	0.735	0.836	0.044	0.906	0.790	0.851

Table 5.

Ablation Study of Our BLMs and BFMs. Four Widely Used Evaluation Metrics (i.e., $E_{Φ}$ , $F_{β}^{ω}$ , $S_{α}$ , and $M$ ). “ $↑ / ↓$ ” Indicates that Larger/Smaller is Better.

	CAMO				COD10K				NC4K
Methods	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$
w Refinement	0.073	0.868	0.739	0.814	0.032	0.891	0.711	0.827	0.048	0.892	0.772	0.845
w/o Dilated Conv	0.071	0.878	0.755	0.818	0.032	0.901	0.728	0.834	0.045	0.902	0.786	0.851
w/o Spilt	0.070	0.873	0.750	0.820	0.033	0.896	0.719	0.833	0.046	0.903	0.784	0.853
w/o MSCA	0.070	0.877	0.759	0.818	0.030	0.901	0.730	0.834	0.045	0.903	0.789	0.850
w/o Res	0.072	0.872	0.756	0.816	0.031	0.902	0.728	0.833	0.044	0.904	0.788	0.852
w Cat	0.070	0.876	0.759	0.813	0.031	0.903	0.736	0.833	0.046	0.901	0.786	0.847
OURS	0.069	0.881	0.761	0.820	0.030	0.904	0.736	0.838	0.043	0.908	0.796	0.855

Note. BLM = boundary localization module; BFM = boundary-embedded feature aggregation module.

Table 6.

Quantitative Results for Validating the Effectiveness of the Supervision Strategy Adopted in the Proposed Methods. Four Widely Used Evaluation Metrics (i.e., $E_{Φ}$ , $F_{β}^{ω}$ , $S_{α}$ , and $M$ ). “ $↑ / ↓$ ” Indicates that Larger/Smaller is Better.

		CAMO				CHAMELEON				COD10K				NC4K
$S$	Supervision strategy	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$
$S_{1}$	$R_{3}$	0.078	0.859	0.723	0.815	0.040	0.923	0.779	0.869	0.039	0.882	0.669	0.814	0.051	0.893	0.752	0.843
$S_{2}$	$R_{3} + E_{1}$	0.076	0.867	0.734	0.813	0.035	0.929	0.797	0.878	0.042	0.875	0.669	0.809	0.052	0.891	0.752	0.839
$S_{3}$	$R_{1} + R_{2} + R_{3} + E_{1}$	0.074	0.863	0.742	0.811	0.033	0.930	0.825	0.885	0.033	0.899	0.723	0.834	0.046	0.901	0.785	0.852
$S_{4}$	$R_{3} + E_{1} + E_{2} + E_{3}$	0.073	0.868	0.736	0.815	0.033	0.931	0.812	0.885	0.037	0.887	0.683	0.816	0.050	0.893	0.756	0.839
$S_{5}$	$R_{1} + R_{2} + R_{3} + E_{1} + E_{2} + E_{3}$	0.069	0.881	0.761	0.820	0.024	0.952	0.857	0.903	0.030	0.904	0.736	0.838	0.043	0.908	0.796	0.855

4.6. Limitation and Discussion

Although our COD network model performs well in accuracy and efficiency, it still lags behind some popular models in terms of parameter quantity. As shown in Table 7, the BFM has a significant impact on the speed, floating point operation, and parameters of the entire network. This indicates that the BFM plays a crucial role in determining overall computational cost and model complexity. Therefore, this is a sector that I need to pay attention to in the future. Specifically, efforts will be made to find a good balance between performance and the number of parameters, with the aim of optimizing the efficiency of the model while not compromising its ability to provide high-quality results. We plan to explore strategies such as implementing lightweight backbones, employing knowledge distillation, or developing more efficient network architectures.

\vskip1.8pc ?>

Table 7.
Comparison of Three Modules in BSNet: Number of Model Params, Number of FLOPs, and Reasoning Speed (Speed).

Methods Speeds(FPS) $↑$ FLOPs $↓$ Params $↓$

FEM 43.73 29.18 32.31

BLM 35.83 21.59 26.64

BFM 28.36 25.9 27.09

OURS 28.17 29.81 31.28

Methods	Speeds(FPS) $↑$	FLOPs $↓$	Params $↓$
FEM	43.73	29.18	32.31
BLM	35.83	21.59	26.64
BFM	28.36	25.9	27.09
OURS	28.17	29.81	31.28

Note. Params = parameters; FLOP = floating point operation; FPS = frames per second; FEM = feature enhancement module; BLM = boundary localization module; BFM = boundary-embedded feature aggregation module.

4.7. The Application for Polyp Segmentation

Polyps are tumor-like lesions that grow in the colon. Accurate segmentation of polyps is crucial for detecting them in colonoscopy images, enabling timely surgical intervention. To evaluate the effectiveness of our method in polyp segmentation, we followed the same benchmark protocol as Fan et al. (2020b). Our BSNet was retrained on the Kvasir-SEG (Jha et al., 2019) and CVC-ClinicDB (Bernal et al., 2015) datasets, and tested on five commonly used datasets for polyp segmentation. Table 8 lists the quantitative evaluation results of different polyp segmentation methods. It can be seen that our Net achieved better metrics on all five datasets, demonstrating significant advantages. The visual comparison results of different polyp segmentation methods are shown in Figure 8. For each dataset, one sample was selected for comparison. The proposed method achieves significantly better performance than other SOTA methods.

Figure 8.

The visual comparison of detection results obtained by different polyp segmentation methods.

\vskip1.8pc ?> Table 8.

Compare Performance With Five State-of-the-Art Models on Five Datasets.

	CVC-300				CVC-ClinicDB				CVC-ColonDB				Kvasir				ETIS-LaribPolypDB
Method	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$	$M ↓$	$E_{Φ} ↑$	$F_{β}^{ω} ↑$	$S_{m} ↑$
U-Net (Ronneberger et al., 2015)	0.022	0.867	0.684	0.842	0.019	0.917	0.810	0.889	0.061	0.763	0.497	0.711	0.054	0.900	0.794	0.857	0.036	0.645	0.365	0.683
UNet++ (Zhou et al., 2018)	0.018	0.883	0.687	0.838	0.022	0.897	0.785	0.872	0.064	0.761	0.467	0.691	0.047	0.906	0.808	0.862	0.034	0.704	0.390	0.683
ACSNet (Zhang et al., 2020)	0.010	0.948	0.867	0.937	0.011	0.954	0.890	0.939	0.036	0.846	0.720	0.839	0.034	0.929	0.869	0.914	0.062	0.748	0.580	0.781
SFA (Fang et al., 2019)	0.065	0.604	0.340	0.640	0.041	0.815	0.646	0.793	0.093	0.634	0.366	0.628	0.075	0.828	0.670	0.781	0.109	0.514	0.231	0.557
PraNet (Fan et al., 2020b)	0.009	0.938	0.843	0.925	0.009	0.957	0.896	0.936	0.043	0.847	0.699	0.820	0.030	0.943	0.885	0.915	0.031	0.791	0.600	0.793
OURS	0.009	0.951	0.859	0.943	0.010	0.958	0.899	0.938	0.035	0.863	0.712	0.826	0.027	0.936	0.885	0.917	0.020	0.826	0.623	0.797

5. Conclusion

In this paper, we propose a COD framework, namely BSNet, that addresses blurred boundaries and background occlusions, Focusing on employing clear and complete edge semantic information to finely disguise object detection through a two-stage strategy. We carefully design FEM to refine features at different scales. At the same time, BLM is introduced, which uses high-level global semantic information to guide low-level local edge information to achieve an orderly fusion of target edge-related semantic information. BSNet uses a unique top-down structure that gradually integrates multi-level features to more accurately integrate boundary information. Finally, the introduction of the BFM enables the cross-level fusion of multi-scale features, thereby greatly reducing the interference of cluttered backgrounds. Extensive experiments prove that BSNet significantly outperforms the current SOTA methods on four widely used COD datasets, providing a more effective solution in the field of camouflage object detection and promoting research in this field level. In addition, we also plan to continuously promote the development of the field of camouflage object detection by adding some super-resolution strategies to improve model performance.

Footnotes

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Yuhong Chen

Meng Dai

Jiayun Wu

Evaluation Metrics

S_{α} = α S_{0} + (1 - α) S_{r}

where

α

is set to 0.5 for balancing

S_{0}

and

S_{r}

Average enhanced-measure ( $E_{Φ}$ ; Fan et al., 2018) is mainly used to measure image-level statistics and local pixel matching information. The E-measure is defined as: (A.2)

E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} ϕ (x, y)

where

ϕ

is the enhanced alignment matrix.

Weight-F-measure ( $F_{β}^{ω}$ ; Margolin et al., 2014) is the weighted harmonic average of Precision and Recall. F-measure is calculated by the following formula: (A.3)

F_{β} = \frac{(β^{2} + 1) P R}{β^{2} P + R}

where

β^{2}

is generally set as 0.3 to emphasize more on precision.

MAE (Perazzi et al., 2012) is designed to directly measure the absolute difference between the ground-truth value and the predicted value, which is formulated as MAE is computed as: (A.4)

MAE = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} | S (x, y) - G (x, y) |

where

W

and

H

denote the width and height of the feature map, respectively,

S

refers to the feature map, and

G

denotes the ground truth.

References

Bernal

Sánchez

F. J.

Fernández-Esparrach

Gil

de Miguel

C. R.

Vilariño

(2015). WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics: The Official Journal of the Computerized Medical Imaging Society, 43, 99–111. https://api.semanticscholar.org/CorpusID:1961788

Chen

Liu

Sun

Y.-J.

G.-P.

Zhou

(2022). Camouflaged object detection via context-aware cross-level fusion. IEEE Transactions on Circuits and Systems for Video Technology, 32, 6981–6993. https://api.semanticscholar.org/CorpusID:249112891

Chen

Cong

Huang

(2020). Global context-aware progressive aggregation network for salient object detection. ArXiv abs/2003.00651. https://api.semanticscholar.org/CorpusID:211677643

Cheng

Hao

Liu

(2022). Attention-based neighbor selective aggregation network for camouflaged object detection. In 2022 international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE. https://api.semanticscholar.org/CorpusID:252626035

Chou

M.-C.

Chen

H.-J.

Shuai

H.-H.

(2022). Finding the Achilles heel: Progressive identification network for camouflaged object detection. In 2022 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE. https://api.semanticscholar.org/CorpusID:251846834

Cong

Sun

Zhang

Zhou

Zhang

Zhao

(2023). Frequency perception network for camouflaged object detection. In Proceedings of the 31st ACM international conference on multimedia (pp. 1179–1189). ACM (Association for Computing Machinery). https://api.semanticscholar.org/CorpusID:261030190

Dai

Gieseke

Oehmcke

Barnard

(2020). Attentional feature fusion. In 2021 IEEE winter conference on applications of computer vision (WACV) (pp. 3559–3568). IEEE. https://api.semanticscholar.org/CorpusID:221995547

Fan

D.-P.

Cheng

M.-M.

Liu

Borji

(2017). Structure-measure: A new way to evaluate foreground maps. International Journal of Computer Vision, 129, 2622–2638. https://api.semanticscholar.org/CorpusID:22726592

Fan

D.-P.

Gong

Cao

Ren

Cheng

M.-M.

Borji

(2018). Enhanced-alignment measure for binary foreground map evaluation. In International joint conference on artificial intelligence (pp. 698–704). International Joint Conferences on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:44072899

10.

Fan

D.-P.

G.-P.

Cheng

M.-M.

Shao

(2021). Concealed object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 6024–6042. https://api.semanticscholar.org/CorpusID:231985788

11.

Fan

D.-P.

G.-P.

Sun

Cheng

M.-M.

Shen

Shao

(2020a). Camouflaged object detection. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2774–2784). IEEE. https://api.semanticscholar.org/CorpusID:219963808

12.

Fan

D.-P.

G.-P.

Zhou

Chen

Shen

Shao

(2020b). PraNet: Parallel reverse attention network for polyp segmentation. ArXiv abs/2006.11392. https://api.semanticscholar.org/CorpusID:219966949

13.

Fang

Chen

Yuan

Tong

R. K.-Y.

(2019). Selective feature aggregation network with area-boundary constraints for polyp segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 302–310). Springer. https://api.semanticscholar.org/CorpusID:204027843

14.

Guo

Shi

Zhang

(2023). Progressive multi-scale mutual feedback network for salient object detection. In 2023 8th international conference on intelligent informatics and biomedical sciences (ICIIBMS) (vol. 8, pp. 332–338). https://api.semanticscholar.org/CorpusID:266235889

15.

Zhang

Tang

Zhang

Guo

Z. H.

(2023). Camouflaged object detection with feature decomposition and edge reconstruction. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 22046–22055). IEEE. https://api.semanticscholar.org/CorpusID:260099045

16.

Bai

Cui

Wang

(2021). Dense relation distillation with context-aware aggregation for few-shot object detection. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10180–10189). IEEE. https://api.semanticscholar.org/CorpusID:232428370

17.

Huang

Dai

Xiang

T.-Z.

Wang

Chen

Qin

Xiong

(2023). Feature shrinkage pyramid for camouflaged object detection with transformers. In 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5557–5566). IEEE. https://api.semanticscholar.org/CorpusID:257766661

18.

Jha

Smedsrud

P. H.

Riegler

Halvorsen

de Lange

Johansen

H. D.

(2019). Kvasir-SEG: A segmented polyp dataset. In Conference on multimedia modeling (pp. 451–462). Springer. https://api.semanticscholar.org/CorpusID:208138155

19.

Jia

Yao

Liu

Fan

Liu

Luo

(2022). Segment, magnify and reiterate: Detecting camouflaged objects the hard way. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4703–4712). IEEE. https://api.semanticscholar.org/CorpusID:250446781

20.

Jung

(1973). Visual perception and neurophysiology. https://api.semanticscholar.org/CorpusID:146437620

21.

T.-N.

Nguyen

T. V.

Nie

Tran

M.-T.

Sugimoto

(2019). Anabranch network for camouflaged object segmentation. Computer Vision and Image Understanding, 184, 45–56. https://api.semanticscholar.org/CorpusID:155722519

22.

Zhang

Y.-Q.

Liu

Zhang

Dai

(2021). Uncertainty-aware joint salient object and camouflaged object detection. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10066–10076). IEEE. https://api.semanticscholar.org/CorpusID:233033504

23.

Lin

T.-Y.

Dollár

Girshick

R. B.

Hariharan

Belongie

S. J.

(2016). Feature pyramid networks for object detection. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 936–944). IEEE. https://api.semanticscholar.org/CorpusID:10716717

24.

Liu

Qiu

(2023). Edge-guided camouflaged object detection via multi-level feature integration. Sensors (Basel, Switzerland), 23(13), 5789. https://api.semanticscholar.org/CorpusID:243120747

25.

Liu

Zhang

Barnes

(2022). Modeling aleatoric uncertainty for camouflaged object detection. In 2022 IEEE/CVF winter conference on applications of computer vision (WACV) (pp. 2613–2622). IEEE. https://api.semanticscholar.org/CorpusID:246868155

26.

Lyu

Zhang

Dai

Liu

Barnes

Fan

D.-P.

(2021). Simultaneously localize, segment and rank the camouflaged objects. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 11586–11596). IEEE. https://api.semanticscholar.org/CorpusID:232147188

27.

Lyu

Zhang

Liu

Yang

Yuan

(2024). UEDG: Uncertainty-edge dual guided camouflage object detection. IEEE Transactions on Multimedia, 26, 4050–4060. https://api.semanticscholar.org/CorpusID:259865719

28.

Margolin

Zelnik-Manor

Tal

(2014). How to evaluate foreground maps? In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

29.

Mei

G.-P.

Wei

Yang

Wei

Fan

D.-P.

(2021). Camouflaged object segmentation with distraction mining. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 8768–8777). IEEE. https://api.semanticscholar.org/CorpusID:233324118

30.

Milletarì

Navab

Ahmadi

S.-A.

(2016). V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV) (pp. 565–571). IEEE. https://api.semanticscholar.org/CorpusID:206429151

31.

Perazzi

Krähenbühl

Pritch

Sorkine-Hornung

Saliency filters: Contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition (pp. 733–740). IEEE. https://api.semanticscholar.org/CorpusID:9146763

32.

Price

Green

S. D.

Troscianko

Tregenza

Stevens

(2019). Background matching and disruptive coloration as habitat-specific strategies for camouflage. Scientific Reports, 9. https://api.semanticscholar.org/CorpusID:163164812

33.

Przemysław

Hassan

Jakub

Tomasz

Adam

Kozieł

(2018). Animal camouflage analysis: Chameleon database. Unpublished manuscript.

34.

Ren

Zhu

Wang

Deng

Heng

P.-A.

(2021). Deep texture-aware features for camouflaged object detection. IEEE Transactions on Circuits and Systems for Video Technology, 33, 1157–1167. https://api.semanticscholar.org/CorpusID:231839685

35.

Ronneberger

Fischer

Brox

(2015). U-Net: Convolutional networks for biomedical image segmentation. ArXiv abs/1505.04597. https://api.semanticscholar.org/CorpusID:3719281

36.

Song

Kang

Wei

Liu

Dian

(2023). FSNet: Focus scanning network for camouflaged object detection. IEEE Transactions on Image Processing, 32, 2267–2278. https://api.semanticscholar.org/CorpusID:258179803

37.

Stevens

Cuthill

I. C.

Párraga

C. A.

Troscianko

(2006). The effectiveness of disruptive coloration as a concealment strategy. Progress in Brain Research, 155, 49–64. https://api.semanticscholar.org/CorpusID:44274924

38.

Sun

Chen

Zhou

Zhang

Liu

(2021). Context-aware cross-level fusion network for camouflaged object detection. In International joint conference on artificial intelligence (pp. 1025–1031). International Joint Conferences on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:235195607

39.

Sun

Wang

Chen

Xiang

T.-Z.

(2022). Boundary-guided camouflaged object detection. In International joint conference on artificial intelligence (pp. 1335–1341). International Joint Conferences on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:250265077

40.

Wang

Zhang

Liu

Zheng

(2021). D²C-Net: A dual-branch, dual-guidance and cross-refine network for camouflaged object detection. IEEE Transactions on Industrial Electronics, 69, 5364–5374. https://api.semanticscholar.org/CorpusID:236554158

41.

Wang

(2023). Camouflaged object detection with a feature lateral connection network. Electronics, 12(12), 2570. https://api.semanticscholar.org/CorpusID:259770880

42.

Wang

Zhao

Yang

Chai

X.-l.

Zhang

(2021). Global contextual guided residual attention network for salient object detection. Applied Intelligence, 52, 6208–6226. https://api.semanticscholar.org/CorpusID:239706196

43.

Wei

Wang

Huang

(2019). F3Net: Fusion, feedback and focus for salient object detection. ArXiv abs/1911.11445. https://api.semanticscholar.org/CorpusID:208291093

44.

Xie

Xia

(2023). Frequency representation integration for camouflaged object detection. In Proceedings of the 31st ACM international conference on multimedia (pp. 1789–1797). Association for Computing Machinery. https://api.semanticscholar.org/CorpusID:264491985

45.

Chen

Wang

(2022). Guided multi-scale refinement network for camouflaged object detection. Multimedia Tools and Applications, 82, 5785–5801. https://api.semanticscholar.org/CorpusID:251206539

46.

Yan

T.-N.

Nguyen

K.-D.

Tran

M.-T.

T.-T.

Nguyen

T. V.

(2020). MirrorNet: Bio-inspired camouflaged object segmentation. IEEE Access, 9, 43290–43300. https://api.semanticscholar.org/CorpusID:226227738

47.

Yan

Sun

Han

Wang

(2023). Camouflaged object segmentation based on matching-recognition-Refinement network. IEEE Transactions on Neural Networks and Learning Systems, 35, 15993–16007. https://api.semanticscholar.org/CorpusID:259842938

48.

Yang

Zhai

Huang

Luo

Cheng

Fan

D.-P.

(2021). Uncertainty-guided transformer reasoning for camouflaged object detection. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 4126–4135). IEEE. https://api.semanticscholar.org/CorpusID:237952765

49.

Yang

Zhu

Mao

Xing

(2023). OAFormer: Occlusion aware transformer for camouflaged object detection. In 2023 IEEE international conference on multimedia and expo (ICME) (pp. 1421–1426). IEEE. https://api.semanticscholar.org/CorpusID:261126218

50.

Zhai

Yang

Chen

Cheng

Fan

D.-P.

(2021). Mutual graph learning for camouflaged object detection. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12992–13002). IEEE. https://api.semanticscholar.org/CorpusID:233033765

51.

Zhang

(2022c). TPRNet: Camouflaged object detection via transformer-induced progressive refinement network. The Visual Computer, 39, 4593–4607. https://api.semanticscholar.org/CorpusID:250660517

52.

Zhang

Cui

Qian

(2020). Adaptive context selection for polyp segmentation. ArXiv abs/2301.04799. https://api.semanticscholar.org/CorpusID:222136567

53.

Zhang

Wang

Liu

Yang

(2022a). Camouflaged object detection via neighbor connection and hierarchical information transfer. Computer Vision and Image Understanding, 221, 103450. https://api.semanticscholar.org/CorpusID:249035924

54.

Zhang

Piao

Shi

Lin

(2022b). PreyNet: Preying on camouflaged objects. In Proceedings of the 30th ACM international conference on multimedia (pp. 5323–5332). Association for Computing Machinery. https://api.semanticscholar.org/CorpusID:252782642

55.

Zhen

Wang

Zhou

Shen

Shang

Fang

Long

(2020). Joint semantic segmentation and boundary detection using iterative pyramid contexts. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 13663–13672). IEEE. https://api.semanticscholar.org/CorpusID:215785993

56.

Zhong

Tang

Kuang

Ding

(2022). Detecting camouflaged object in frequency domain. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 4494–4503). IEEE. https://api.semanticscholar.org/CorpusID:250551798

57.

Zhou

Siddiquee

M. M. R.

Tajbakhsh

Liang

(2018). UNet++: A nested U-net architecture for medical image segmentation. Deep learning in medical image analysis and multimodal learning for clinical decision support. In 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings (vol. 11045, pp. 3–11). Springer. https://api.semanticscholar.org/CorpusID:50786304

58.

Zhou

Gong

Yang

Zhang

(2022). Feature aggregation and propagation network for camouflaged object detection. IEEE Transactions on Image Processing, 31, 7036–7047. https://api.semanticscholar.org/CorpusID:253303505

59.

Zhu

Xie

Yan

Liang

Chen

Wei

Qin

(2022). I can find you! Boundary-guided separated attention network for camouflaged object detection. In AAAI conference on artificial intelligence (Vol. 36, No. 3: AAAI-22 Technical Tracks 3). AAAI Press. https://api.semanticscholar.org/CorpusID:250298616

60.

Zhuge

Guo

Cai

Chen

(2022). CubeNet: X-shape connection for camouflaged object detection. Pattern Recognition, 127, 108644. https://api.semanticscholar.org/CorpusID:247394500

BSNet: Boundary-Location Network Based on Deep Multi-Scale Modulation for Camouflaged Object Detection

Abstract

Keywords

1. Introduction

2.1. Camouflaged Object Detection (COD)

2.2. Multi-Task Learning

2.3. Multi-Scale Contextual Information Learning

2.4. Boundary-Guided Learning

3. Proposed Method

3.1. Overview

4.1. Implementation Details

4.2. Datasets

4.3. Evaluation Metrics

4.4. Comparison to State-of-the-Arts

Table 7. Comparison of Three Modules in BSNet: Number of Model Params, Number of FLOPs, and Reasoning Speed (Speed). Methods Speeds(FPS) ↑ FLOPs ↓ Params ↓ FEM 43.73 29.18 32.31 BLM 35.83 21.59 26.64 BFM 28.36 25.9 27.09 OURS 28.17 29.81 31.28

Footnotes

Funding

Conflicting Interests

ORCID iDs

Evaluation Metrics

References

Table 7.
Comparison of Three Modules in BSNet: Number of Model Params, Number of FLOPs, and Reasoning Speed (Speed).

Methods Speeds(FPS) $↑$ FLOPs $↓$ Params $↓$

FEM 43.73 29.18 32.31

BLM 35.83 21.59 26.64

BFM 28.36 25.9 27.09

OURS 28.17 29.81 31.28