Sage Journals: Discover world-class research

Abstract

The RGB-D salient object detection (SOD) task aims to pinpoint salient objects in RGB and depth images, with the key challenge being effective multimodal integration. This article presents MHINet, a network tailored for RGB-D SOD, encompassing a dual-stream swin transformer encoder, adaptive fusion enhancement module (AFEM), multi-level feature interaction module (MFIM), and decoder. The dual-stream swin transformers extract multi-level features (outperforming traditional CNNs in capturing long-range dependencies). AFEM dynamically adjusts RGB-depth fusion ratios via channel attention, enhancing feature expression. MFIM uses middle-layer features to enable stable cross-level feature interaction, improving fusion efficiency. The decoder restores edge details via residual convolution. Experimental results on DUT, LFSD, NJU2K, NLPR, and SIP show MHINet outperforms state-of-the-art methods, validating its cross-modal detection capabilities for RGB-D SOD.

Keywords

RGB-D salient object detection adaptive enhanced fusion multi-level feature interaction

1. Introduction

Salient object detection (SOD) aims to detect the most prominent objects from images or videos containing various complex scenes. SOD has proven valuable in a range of computer vision tasks, such as photo-cropping (Wang et al., 2018), video object segmentation (Wang et al., 2015), semantic segmentation (Wang et al., 2022), and medical image segmentation (Zhao et al., 2021). Most previous SOD tasks used only RGB images (Cong et al., 2022b; Zhang et al., 2022a; Zheng et al., 2021; Zhu et al., 2019). Traditional SOD methods (Jiang et al., 2013; Yan et al., 2013) generally utilize handcrafted features or heuristic priors (e.g., color contrast Cheng et al., 2014, center prior (Jiang & Davis, 2013), background prior (Han et al., 2014)) to detect salient objects. However, these methods fail to capture sufficient semantic information. Recently, convolutional neural networks (CNNs) have been widely adopted in SOD tasks(Chen et al., 2022; Li et al., 2023a; Tang et al., 2022), surpassing traditional methods(Klein & Frintrop, 2011). For instance, sWei et al. (2020) proposed $F^{3} N e t$ that can selectively aggregate multi-level features and a multi-level feedback mechanism, which can help the network adaptively select complementary components and eliminate differences between features. Zhuge et al. (2022) explored the integrity learning of SOD, designed an integrity cognition network based on the three important components of strong integrity features, and improved the FNR index by about 10%. Ma et al. (2023) proposed a framework BBRF to enhance a wider receptive field, in which the decoder extremely separates semantic and detail features, acquires the ability to perceive objects of extreme sizes, and adapts well to scale changes.

Although these methods have achieved certain results in detection effects. However, detection of salient objects using methods based only on RGB in complex scenes takes much work. Fortunately, the rise of depth cameras like Kinect and RealSense has made RGB-D SOD, which incorporates depth information, an increasingly appealing area of research. Therefore, researchers introduced depth images to provide supplementary information, such as spatial structure, three-dimensional layout, and object boundary information (Piao et al., 2019), which, to some extent, overcomes the limitations of RGB images. However, challenges such as accurately distinguishing salient objects persist in complex scenes, including low light conditions, image clutter, and depth blur (Chen & Li, 2019). RGB-D detection generally outperforms RGB in such scenarios, yet effectively fusing complementary information from RGB and depth modalities remains challenging. For instance, Peng et al. (2014) proposed a simple fusion strategy by combining RGB-based scene saliency model with depth-induced saliency generated using a multi-context contrastive method. However, this method ignores the differences between modalities, resulting in inaccurate feature fusion. Wang and Gong (2019) designed a two-stream CNN to predict saliency maps from RGB and depth maps separately, followed by adaptive fusion. Zhang et al. (2021) infers latent variables through a generator model, and the inference model gradually updates the latent variables to achieve probabilistic cross-modal fusion. DMRA (Ji et al., 2022) uses deep refinement blocks and residual connections to fuse the complementary cues of RGB and depth streams. However, this fusion approach hardly captures the complex interactions between the two modalities.

The fusion of multi-level features is also widely used in SOD, as effectively integrating high-level semantic features with low-level details is crucial for accurately distinguishing image foreground from background. The most commonly used fusion method is top-down fusion, such as concatenating and summing features of adjacent layers (Ji et al., 2021; Wang et al., 2021; Zhou et al., 2021b) or using integration modules (Liu et al., 2021; Zhou et al., 2021a). However, the number of adjacent layers is limited, and the diversity of feature fusion is limited, making it impossible to achieve a complete integration of high-level and low-level features. Therefore, some methods (Sun et al., 2021; Zhang et al., 2020) directly integrate multi-level features before fusion to learn feature representations of features at different layers. These methods ignore the characteristics of each layer of features and limit the study of the complementarity between different features. In addition, the direct integration may cause the original information in the features to be destroyed.

In response, this article designs an adaptive enhancement fusion and multi-level feature interaction network to enhance RGB-D detection performance. The overall network framework designed in this article adopts an encoder–decoder structure, including three key components: feature extraction, feature fusion interaction, and feature decoding, and finally achieves SOTA performance. MHINet first uses a Swin Transformer to extract RGB and depth image features. Among them, the Swin Transformer can efficiently model global features through its moving window self-attention mechanism. Then, the adaptive fusion enhancement module (AFEM) is introduced, which combines channel attention with adaptive fusion to enhance the network’s expressive ability, while utilizing the semantic information of high-level features to provide rich contextual guidance. Then, a multi-level feature interaction module (MFIM) is used to efficiently process the high-level semantic features and low-level detail features in the fused features. Finally, the decoder is used to progressively integrate low-level features, extract image details and remove noise for precise predictions. Key contributions are as follows:

We proposed an innovative RGB-D SOD framework, MHINet, which follows an encoder–decoder structure. Through comprehensive experiments on five challenging datasets, MHINet outperforms existing optimal algorithms in four evaluation indicators. This outstanding detection performance is achieved through the coordinated operation of multiple novel components within the framework.

We designed an adaptive fusion enhancement module (AFEM). By ingeniously combining the channel attention mechanism and an adaptive fusion strategy, AFEM enables effective multi-modal feature integration. It can dynamically adjust the information fusion coefficients of different modalities, realizing the information complementarity between RGB and depth images. This not only enriches the network’s expressiveness but also significantly improves the model’s ability to utilize cross-modal information.

We developed a multi-level feature interaction module (MFIM). MFIM takes into account the characteristics of each level of features and uses middle-layer features as intermediaries to achieve stable interaction among multi-level features. Considering that direct interaction between high-level and low-range features may hinder feature fusion, this innovative design effectively overcomes this problem. As a result, the integration of details and semantic information is improved, leading to enhanced detection accuracy.

In the decoder design, we introduced a unique approach of gradually incorporating interactive features generated by the MFIM in the step-by-step decoding process. Additionally, residual convolution blocks are used to reduce noise. This design effectively addresses the issue of detail loss and noise introduction during the encoding process, further improving the accuracy of salient object detection.

2. Method

2.1. Overview Framework

The MHINet proposed in this article includes two Swin Transformer encoders, an AFEM, a MFIM, and a decoder module. Its overall architecture is shown in Figure 1. First, the MHINet uses swin transformer as the backbone network to extract multi-level features for RGB images and depth images respectively. Then, the AFEM module is used to fuse the extracted RGB multi-level features and deep multi-level features. Then, the fused features $F_{i} (i = 1, 2, 3, 4)$ of each layer are input into the MFIM module for integration. Finally, the decoder gradually generates the predicted saliency map.

Figure 1.

The overall structure of MHINet.

2.2. Swin Transformer Encoder

In the RGB-D SOD task, feature extraction is crucial. Traditional models usually rely on CNN networks for feature extraction, but the size of the receptive field limits the extraction of global features. Swin Transformer implements global information modeling through shift window operations and reduces the computational complexity to linear, thereby significantly reducing the computational burden. In the research of this article, Swin Transformer was selected to extract features, and the Swin-B version was selected to balance complexity and efficiency.

2.3. Adaptive Fusion Enhancement Module

Research shows that RGB features contain rich color and texture information, while depth features show spatial layout, that is, specific spatial location information. One difficulty in the RGB-D SOD task is to effectively use the complementary information in RGB and deep features to fuse different features. Therefore, this article designs a cross-modal AFEM that combines channel attention with adaptive fusion. Among them, the channel attention mechanism can help the network automatically weight different feature channels, enabling the network to focus on important feature channels and suppress irrelevant or noisy features, thereby improving the network’s feature expression ability. The correlation of the two features is reflected through the attention-crossover strategy. The ratio between RGB and depth image features is then adjusted through adaptive fusion learning of the gating values. In this way, the AFEM module can dynamically adjust the fusion coefficient of two different information according to the input features, enhance the network’s expression ability, and thus improve the fusion effect. In addition, smaller objects are generally difficult to detect, and most SOD methods are prone to missed detections. This article believes this is because the information in a single-size feature is insufficient, making it difficult to detect objects of various sizes. Therefore, the AFEM fuses high-level features into adjacent low-level features to obtain richer contextual information guidance, thereby improving the detection effect of multi-scale objects.

As shown in Figure 2, the AFEM first scales the high-level feature $F_{i + 1}^{r / d}$ adjacent to the feature $F_{i}^{r / d}$ of the i-th layer (if it is the last layer, the current layer is convolved first and then used as a replacement for the high-level feature). $F_{i + 1}^{r / d}$ is convolved to match the channel count of $F_{i}^{r / d}$ , and then upsampled to adjust the spatial resolution to be consistent with $F_{i}^{r / d}$ . The two layers of features are then concatenated, and the number of channels of the merged features is aligned with $F_{i}^{r / d}$ again through the convolution layer to obtain multi-scale features $F_{i}^{r}$ and $F_{i}^{d}$ , respectively.

\begin{aligned} F_{i}^{r} & = Conv (Cat (UP (Conv (F_{i + 1}^{r}), F_{i}^{r}))) \end{aligned}

(1)

\begin{aligned} F_{i}^{d} & = Conv (Cat (UP (Conv (F_{i + 1}^{d}), F_{i}^{d}))) \end{aligned}

(2)

Figure 2.

The structure of the AFEM.

Among them, $UP (\cdot)$ represents a bilinear interpolation upsampling, $Cat (\cdot)$ represents a concatenation, and $Conv (\cdot)$ represents a $3 \times 3$ convolution.

After this, two-channel attention operations are applied to RGB and depth features, respectively, to obtain RGB attention weights and depth attention weights, and then cross-multiply the two attention weights with the corresponding features to enhance the complementary correlation between the two modalities, thereby generating RGB attention features and depth attention features. Then, the respective input features are directly added to them through jump connections to pass information to deeper layers, thereby alleviating the problem of gradient disappearance. Finally, adaptive fusion dynamically adjusts the fusion ratio of RGB and depth features, and obtain fused features $F_{i}$ after one convolution.

\begin{aligned} F_{i}^{r} w & = F_{i}^{r} \times CA (F_{i}^{d}) + F_{i}^{r} \end{aligned}

(3)

\begin{aligned} F_{i}^{d} w & = F_{i}^{d} \times CA (F_{i}^{r}) + F_{i}^{d} \end{aligned}

(4)

\begin{aligned} F_{i} & = Conv (A (F_{i}^{r} w, F_{i}^{d} w)) \end{aligned}

(5)

Here,

A (\cdot)

represents the adaptive fusion module, and

CA (\cdot)

represents the spatial attention module,

\times

represents element-wise multiplication,

+

represents element-wise addition.

In summary, the adaptive enhanced fusion module obtains richer contextual information by integrating adjacent high-level features and can better adapt to multi-scale objects. In addition, the module uses a weighted crossover strategy to fully utilize the complementary information of RGB and deep features, and it introduces adaptive fusion to dynamically adjust the fusion ratio of the two features, thereby enhancing the feature representation capability of the network.

2.4. Multi-Level Feature Interaction Module

The AFEM fuses multi-modal features to obtain multi-level fusion features. Some existing methods (Ji et al., 2021; Liu et al., 2021; Zhang et al., 2022a; Zhou et al., 2021a) usually adopt a top-down layer-by-layer integration method for multi-level feature fusion. However, this method cannot take advantage of the diversity of features and cannot fully integrate features from different layers. Although other methods (Sun et al., 2021; Zhang et al., 2020) directly integrate multi-level features to enhance the features of each layer, they still cannot fully utilize the unique information of each level feature. It is known that low-level detailed features, such as textures and borders, can make the foreground and background of an image clearer. High-level complex features contain rich semantic information, which can identify object categories and highlight salient areas from the background. By fusing multi-level features, more comprehensive features can be obtained in complex scenes, thereby improving detection performance.

To address these issues, this article introduces a MFIM. This module realizes the fusion of high-level and low-level features by introducing the intermediate features $F_{i}$ of multi-level fusion features, and retains the original information of the features of this layer as much as possible. Figure 3 shows the MFIM module, which mainly includes four submodules, namely two lower interaction modules (DIM) and two upper interaction modules (UIMs). The input of the lower interaction module is three continuous multimodal features ( $F_{i} - 2$ , $F_{i} - 1$ and $F_{i}$ , $i \in {3, 4}$ ), which include high-level features $F_{4}$ and low-level features $F_{2}$ , $F_{3}$ . First, $F_{2}$ and $F_{3}$ are downsampled and channel-scaled to make them consistent with the size of $F_{4}$ , and then $F_{2}$ and $F_{3}$ are concatenated to obtain features $F_{f}$ , and then a simple fusion module (SFM) is used to generate low-level guidance features $F_{\lg}$ , which are multiplied with high-level features $F_{4}$ by element-level multiplication to obtain fusion features. In this way, high-level features and low-level features are effectively combined to achieve multi-level feature interaction. Finally, jump connections are used to retain the original information of $F_{4}$ and finally generate interactive features $F_{4}^{I}$ . The process can be described as follows:

\begin{aligned} F_{f} & = Cat (Conv (Down (F_{2})), Conv (Down (F_{3}))) \end{aligned}

(6)

\begin{aligned} F_{\lg} & = SFM (F_{f}) \end{aligned}

(7)

\begin{aligned} F_{4}^{I} & = F_{g} \times F_{4} + F_{4} \end{aligned}

(8)

Figure 3.

The structure of MFIM.

Among them, $Down (\dot{)}$ represents the downsampling operation using bilinear interpolation, and $F_{\lg}$ is the low-level guidance feature. Here, $F_{2}$ and $F_{3}$ are generated through SFM to generate the low-level guidance feature weight $F_{\lg}$ , and then $F_{\lg}$ is element-wise multiplied with $F_{4}$ to form guidance. In order to retain the details of the high-level features, it is added to the original feature $F_{4}$ .

The UIM is opposite to the lower interaction module. It upsamples and scales $F_{3}$ and $F_{4}$ to make them consistent with the size of $F_{2}$ , and then generates high-level guidance features through feature fusion, which acts on $F_{2}$ to ultimately generate the interaction feature $F_{2}^{I}$ . The process can be described as follows:

\begin{aligned} F_{f} & = Cat (Conv (Down (F_{2})), Conv (Down (F_{3}))) \end{aligned}

(9)

\begin{aligned} F_{\lg} & = SFM (F_{f}) \end{aligned}

(10)

\begin{aligned} F_{4}^{I} & = F_{g} \times F_{4} + F_{4} \end{aligned}

(11)

Among them, $UP (\dot{)}$ represents the upsampling operation, and $F_{hg}$ is the high-level guidance feature. Here, $F_{3}$ and $F_{4}$ are generated through SFM to generate the high-level guidance feature $F_{hg}$ , and then $F_{hg}$ is element-wise multiplied with $F_{2}$ to form guidance. In order to retain the details of the low-level features, it is added to the original feature $F_{2}$ .

2.5. Decoder Block

Usually, when the encoder performs feature extraction, the edge details of the image may be lost, and noise will also be introduced during the upsampling process. To address this problem, this article designs a decoder that restores details by gradually introducing interactive features generated by the MFIM module, while using residual convolution blocks to reduce the impact of noise.

As shown in Figure 1, this article inputs the interaction features ( $F_{m}$ ) generated by MIFM into the decoder for gradual decoding. Each decoding block is followed by upsampling via a deconvolution operation. This decoder fuses features step by step and ultimately generates more accurate saliency prediction maps. Here, instead of using a single convolution layer, convolutions of three different dimensions are used to effectively reduce noise. The specific steps are as follows:

\begin{aligned} P_{1} & = DB (Cat (DB (Cat (Deconv (DB (Cat (Deconv (F_{4}^{I}), F_{3}^{I}))), F_{2}^{l})), F_{1}^{I})) \end{aligned}

(12)

\begin{aligned} DB (F_{i} n) & = Conv (Conv (DW (F_{in}))) + F_{in} \end{aligned}

(13)

\begin{aligned} P_{i} & = BI (Conv (DB (F_{in}))) \end{aligned}

(14)

Among them, $Deconv (\dot{)}$ represents deconvolution, $G (\dot{)}$ represents the GELU activation function, $BI (\dot{)}$ represents bilinear interpolation, and $DB (\dot{)}$ represents the decoder block. The structure is shown on the right side of Figure 1. First, the input features $F_{in}$ undergo depthwise convolution (DW). Next, the number of channels is adjusted through two $1 \times 1$ convolutions to generate output features. Then, the input and output features are added together as the output of the module. Finally, the output features of the decoder block are resized to obtain the saliency prediction map $P_{i} (i = 1, 2, 3, 4)$ of each level.

2.6. Loss Function

This article uses a deep supervision strategy (Lee et al., 2015) to generate a saliency prediction map by $F_{4}$ and $P_{i} (i = 1, 2, 3)$ , and uses the binary cross entropy (BCE) loss method (De Boer et al., 2005) and the intersection union (IoU) loss method (Máttyus et al., 2017) consists of hybrid losses to supervise it. This combined loss effectively supervises both pixel-level and foreground–background regions, enhancing network recognition ability and accelerating convergence.

The BCE loss, a classic in SOD tasks, is a pixel-level loss that calculates the independent loss of each pixel. It can be calculated as follows:

L_{BCE} = - \sum_{i, j} [G (i, j) \log (P (i, j)) + (1 - G (i, j)) \log (1 - P (i, j))]

(15)

Among them, $G (i, j)$ and $P (i, j)$ represent the value at position $(i, j)$ of the ground truth and predicted saliency maps, respectively. The IoU loss focuses on the global structure rather than individual pixels so that salient objects have clear boundaries. It can be calculated as follows:

L_{IOU} = 1 - \frac{\sum_{i = 1}^{H} \sum_{j = 1}^{W} P (i, j) G (i, j)}{\sum_{i = 1}^{H} \sum_{j = 1}^{W} [P (i, j) + G (i, j) - P (i, j) G (i, j)]}

(16)

The overall loss $L$ is defined as:

L = \sum_{i = 1}^{4} L_{BCE}^{i} (P^{i}, G) + L_{IoU}^{i} (P^{i}, G)

(17)

3. Experiment

3.1. Experimental Settings

3.1.1. Datasets

To evaluate MHINet’s performance, this article evaluate five public RGB-D SOD benchmarks: NJU2K (Ju et al., 2014), NLPR (Peng et al., 2014), DUT(Wang et al., 2017), LFSD (Li et al., 2014), and SIP (Fan et al., 2020).

The dataset used in this article is specifically set as follows: the training set contains 2185 samples, including 1485 samples from NJU2K and 700 samples from NLPR. The test set contains the remaining samples of NJU2K and NLPR, as well as complete data from other datasets.

3.1.2. Evaluation Metrics

To quantitatively evaluate the data, this article uses a series of methods, including F-measure (Cong et al., 2018), E-measure (Fan et al., 2018), S-measure (Fan et al., 2017), and mean absolute error (MAE) (Perazzi et al., 2012).

F-Measure is a comprehensive evaluation metric, defined as the weighted harmonic mean of precision and recall. It is calculated as follows:

F_{β} = \frac{(1 + β^{2}) \times Precision \times Recall}{β^{2} \times Precision \times Recall}

(18)

Among them, $β^{2}$ is set to 0.3.

E-Measure combines the local pixel values at the image level with the image level average to jointly capture image-level statistics and local pixel matching information. It is calculated as follows:

E_{γ} = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} Φ_{FW} (x, y)

(19)

Among them, W and H are the width and height of the saliency map. $Φ FM (*)$ denotes the enhanced alignment matrix

S-Measure evaluates the spatial structure similarity between the saliency map S and the ground truth Y. It is calculated as follows:

S_{λ} = α \times S_{o} + (1 - α) \times S_{γ}

(20)

Among them, $α \in [0, 1]$ is the balance parameter set to 0.5 by default.

MAE evaluates the average pixel-wise relative error between the ground truth and the normalized prediction by calculating the average of the absolute values of the differences.

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(21)

Among them, $n$ is the number of samples, $y_{i}$ is the actual value, and ${\hat{y}}_{i}$ is the predicted value.

3.1.3. Implementation Details

During the training and testing phases, the RGB and depth images are resized to $384 \times 384$ , with depth images duplicated to 3 channels for consistency with RGB images. To prevent overfitting, data augmentation strategies—random flipping, rotation, and boundary cropping—are applied. The Swin-B pre-trained model initializes the backbone parameters, while other parameters follow PyTorch’s default settings. The network is trained using the Adam optimizer [46] with a batch size of 8 and an initial learning rate of $5 \times 10^{- 5}$ , reduced by a factor of 10 every 100 epochs. Training on a single NVIDIA GTX 4090 GPU, as depicted in the training loss curve of Figure 4, the model reaches convergence within 150 epochs.

Figure 4.

Training loss curve.

3.2. Comparisons With the State-of-the-Art

This article compares the proposed model with 8 RGB-D saliency target methods, including HMHINet(Li et al., 2021), DSA2F(Sun et al., 2021), SPSN(Lee et al., 2022), C2DFNet(Zhang et al., 2022b), CIRNet(Cong et al., 2022a), DCBF(Li et al., 2023b), CAVER(Pang et al., 2023), CPNet(Hu et al., 2024). In order to make a fair comparison, the saliency map prediction results of the above methods are provided by the author or obtained by running their published code and reproduction with default parameters.

3.2.1. Quantitative Analysis

The quantitative evaluation results presented in Table 1 illustrate the performance of MHINet across five distinct evaluation metrics. Notably, MHINet outperforms all methods in all four metrics, achieving the lowest MAE of 0.021 on the DUT dataset. This exceptional performance is accompanied by a high F-measure score of 0.956, which underscores the model’s efficacy in accurately identifying salient objects.

Table 1.
Quantitative Comparison With Eight Different Methods on Five Testing Datasets.

Dataset Method $MAE ↓$ $F_{β} ↑$ $S_{γ} ↑$ $S_{λ} ↑$

DUT HMHINet2021 0.038 0.896 0.917 0.909

DSA2F2021 0.039 0.898 0.922 0.904

SPSN2022 0.065 0.832 0.896 0.844

C2DFNet2022 0.025 0.934 0.958 0.930

CIRNet2022 0.028 0.937 0.952 0.932

DCBF2023 0.067 0.812 0.892 0.775

CAVER2023 0.039 0.920 0.947 0.913

CPNet2024 0.031 0.939 0.947 0.928

Ours 0.021 0.956 0.967 0.953

LFSD HMHINet2021 0.079 0.855 0.877 0.854

DSA2F2021 0.066 0.867 0.902 0.864

SPSN2022 0.082 0.819 0.870 0.830

C2DFNet2022 0.065 0.863 0.883 0.863

CIRNet2022 0.067 0.882 0.902 0.875

DCBF2023 0.066 0.867 0.902 0.864

CAVER2023 0.074 0.865 0.907 0.863

CPNet2024 0.053 0.901 0.916 0.887

Ours 0.048 0.916 0.922 0.899

NJU2K HMHINet2021 0.038 0.896 0.917 0.909

DSA2F2021 0.024 0.892 0.950 0.918

SPSN2022 0.033 0.912 0.943 0.912

C2DFNet2022 0.038 0.899 0.919 0.908

CIRNet2022 0.035 0.907 0.927 0.925

DCBF2023 0.033 0.914 0.929 0.903

CAVER2023 0.028 0.932 0.956 0.927

CPNet2024 0.024 0.944 0.961 0.933

Ours 0.023 0.949 0.965 0.938

NLPR HMHINet2021 0.025 0.891 0.952 0.921

DSA2F2021 0.022 0.917 0.960 0.928

SPSN2022 0.023 0.910 0.958 0.923

C2DFNet2022 0.021 0.899 0.958 0.928

CIRNet2022 0.023 0.885 0.956 0.934

DCBF2023 0.020 0.903 0.962 0.901

CAVER2023 0.023 0.923 0.961 0.930

CPNet2024 0.016 0.942 0.971 0.939

Ours 0.015 0.943 0.973 0.942

SIP HMHINet2021 0.053 0.875 0.919 0.875

DSA2F2021 0.057 0.865 0.908 0.862

SPSN2022 0.042 0.899 0.934 0.892

C2DFNet2022 0.054 0.867 0.915 0.872

CIRNet2022 0.052 0.875 0.917 0.888

DCBF2023 0.050 0.879 0.920 0.855

CAVER2023 0.033 0.909 0.935 0.895

CPNet2024 0.038 0.932 0.939 0.901

Ours 0.035 0.933 0.942 0.905

Dataset	Method	$MAE ↓$	$F_{β} ↑$	$S_{γ} ↑$	$S_{λ} ↑$
DUT	HMHINet2021	0.038	0.896	0.917	0.909
	DSA2F2021	0.039	0.898	0.922	0.904
	SPSN2022	0.065	0.832	0.896	0.844
	C2DFNet2022	0.025	0.934	0.958	0.930
	CIRNet2022	0.028	0.937	0.952	0.932
	DCBF2023	0.067	0.812	0.892	0.775
	CAVER2023	0.039	0.920	0.947	0.913
	CPNet2024	0.031	0.939	0.947	0.928
	Ours	0.021	0.956	0.967	0.953
LFSD	HMHINet2021	0.079	0.855	0.877	0.854
	DSA2F2021	0.066	0.867	0.902	0.864
	SPSN2022	0.082	0.819	0.870	0.830
	C2DFNet2022	0.065	0.863	0.883	0.863
	CIRNet2022	0.067	0.882	0.902	0.875
	DCBF2023	0.066	0.867	0.902	0.864
	CAVER2023	0.074	0.865	0.907	0.863
	CPNet2024	0.053	0.901	0.916	0.887
	Ours	0.048	0.916	0.922	0.899
NJU2K	HMHINet2021	0.038	0.896	0.917	0.909
	DSA2F2021	0.024	0.892	0.950	0.918
	SPSN2022	0.033	0.912	0.943	0.912
	C2DFNet2022	0.038	0.899	0.919	0.908
	CIRNet2022	0.035	0.907	0.927	0.925
	DCBF2023	0.033	0.914	0.929	0.903
	CAVER2023	0.028	0.932	0.956	0.927
	CPNet2024	0.024	0.944	0.961	0.933
	Ours	0.023	0.949	0.965	0.938
NLPR	HMHINet2021	0.025	0.891	0.952	0.921
	DSA2F2021	0.022	0.917	0.960	0.928
	SPSN2022	0.023	0.910	0.958	0.923
	C2DFNet2022	0.021	0.899	0.958	0.928
	CIRNet2022	0.023	0.885	0.956	0.934
	DCBF2023	0.020	0.903	0.962	0.901
	CAVER2023	0.023	0.923	0.961	0.930
	CPNet2024	0.016	0.942	0.971	0.939
	Ours	0.015	0.943	0.973	0.942
SIP	HMHINet2021	0.053	0.875	0.919	0.875
	DSA2F2021	0.057	0.865	0.908	0.862
	SPSN2022	0.042	0.899	0.934	0.892
	C2DFNet2022	0.054	0.867	0.915	0.872
	CIRNet2022	0.052	0.875	0.917	0.888
	DCBF2023	0.050	0.879	0.920	0.855
	CAVER2023	0.033	0.909	0.935	0.895
	CPNet2024	0.038	0.932	0.939	0.901
	Ours	0.035	0.933	0.942	0.905

On the LFSD, NJU2K, and NLPR datasets, MHINet consistently achieves top results across multiple metrics. For instance, it outperforms other methods in the LFSD dataset, where the F-measure is improved by approximately 1.7% compared to the next best result. This dataset is particularly challenging due to its inclusion of small objects and complex backgrounds, indicating that MHINet exhibits a robust capability to handle difficult scenes effectively.

Additionally, on the NJU2K dataset, MHINet achieves a notable F-measure of 0.949 and a low MAE of 0.023, further validating its performance in diverse conditions. The results on the NLPR dataset also reflect strong metrics, with an MAE of 0.015 and an F-measure of 0.943, demonstrating the model’s versatility and adaptability.

Overall, the results substantiate the effectiveness and generalizability of MHINet, suggesting its potential as a leading solution for RGB-D SOD across various environments. The robust performance across datasets emphasizes the algorithm’s capability to perform reliably in complex and challenging scenarios.

3.2.2. Qualitative Analysis

To further illustrate the superior performance of MHINet, Figure 5 showcases a series of representative results across various challenging scenarios. Lines 1 to 2 highlight the model’s effectiveness in cluttered backgrounds, where it successfully isolates salient objects despite distractions. Lines 3 to 4 demonstrate MHINet’s capability to accurately detect large objects, maintaining detail and precision. In lines 5 to 6, the model handles complex edges adeptly, showcasing its ability to delineate intricate boundaries. Lastly, lines 7 to 8 illustrate the resilience of MHINet in the presence of low-quality depth images, where it effectively mitigates depth noise to produce more accurate and detailed saliency maps. These results affirm the model’s robustness and adaptability in diverse conditions, reinforcing its applicability in real-world scenarios.

Figure 5.

Qualitative visual comparisons of MHINet with other methods.

3.2.3. Ablation Experiment

To evaluate the effectiveness of the main components of MHINet, ablation experiments are used to analyze the importance of each part. First, the effectiveness of AFEM is verified. The following experiments are conducted: (a) baseline model: only use CA attention to fuse the RGB and depth image of the current layer, and do not use MIFM module; (b) the AFEM module uses CA attention to cross-fuse the RGB and depth image of the current layer; (c) the AFEM module uses CA attention to cross-fuse the RGB and depth image of the current layer, and uses an adaptive fusion strategy for RGB and depth attention features; (d) the AFEM module used in this article, that is, the current layer features and adjacent high-level features of the RGB and depth image are fused.

According to the results in Table 2, it is evident that the performance of model a is relatively low on each data set, especially the MAE is higher on the DUT and SIP data sets, indicating that only using CA attention to fuse RGB and depth features alone cannot make full use of both. Complementary information of modalities. Through cross-fusion, the performance of model b is improved compared to model a on all data sets, indicating that the complementary information of RGB and depth images is better utilized in cross-fusion. When the adaptive fusion strategy is added to the c model, the performance continues to improve on data sets such as DUT and LFSD, indicating that the adaptive weighting strategy can enhance the expressive ability of fusion features. However, the improvement on the LFSD and NJU2K data sets is relatively limited, possibly due to the simple features of these data sets and the limited gain of adaptive fusion. Finally, the model d proposed in this article combines the fusion strategy of current layer features and adjacent layer advanced features, achieving the best results on all data sets, especially the MAE dropped significantly on the DUT and NLPR data sets. The fusion of multi-level features helps the model capture richer contextual information and significantly enhances the model’s ability to handle complex scenes. In summary, with the gradual improvement of module configuration, the performance of the model gradually improves, verifying the effectiveness of the AFEM module proposed in this article.

Table 2.
Ablation Results of AFEM.

Dataset Method $MAE ↓$ $F_{β} ↑$ $S_{γ} ↑$ $S_{λ} ↑$

DUT a 0.034 0.936 0.944 0.920

b 0.029 0.942 0.955 0.932

c 0.028 0.944 0.957 0.935

d 0.025 0.947 0.958 0.940

LFSD a 0.054 0.901 0.915 0.880

b 0.053 0.906 0.921 0.891

c 0.051 0.907 0.921 0.891

d 0.051 0.908 0.923 0.889

NJU2K a 0.026 0.944 0.959 0.926

b 0.023 0.946 0.963 0.935

c 0.023 0.946 0.963 0.937

d 0.023 0.947 0.962 0.936

NLPR a 0.020 0.937 0.965 0.934

b 0.016 0.942 0.973 0.941

c 0.015 0.941 0.973 0.939

d 0.015 0.942 0.974 0.942

SIP a 0.040 0.928 0.926 0.893

b 0.036 0.934 0.939 0.905

c 0.036 0.934 0.939 0.903

d 0.035 0.937 0.943 0.906

To evaluate the effectiveness of MFIM, the following experiments were conducted: (a) baseline model; (b) the MFIM module only uses DIM; (c) the MFIM module only uses UIM; (d) the MFIM module DIM and UIM are used in the lowermost layer and the uppermost layer, respectively; (e) the MFIM module uses DIM and UIM in the middle two layers, respectively; (f) the MFIM module used in this article uses DIM or UIM for all layers.

According to the results in Table 3, the performance of model a is relatively low on all data sets, especially on the SIP data set, where the MAE reaches 0.040, indicating that the model has problems in capturing different levels of features without using the MFIM module. Clearly lacking, resulting in a weak overall performance. In contrast, model b using DIM and model c using UIM both improve the performance of the model. Especially for model b, the F-measure and E-measure indicators on the DUT and NJU2K datasets are significantly improved, indicating that DIM can better capture depth features in these scenes. However, the MAE of model c increases slightly on the LFSD and SIP datasets, possibly due to the fact that UIM does not perform as well as DIM in processing complex depth information. Model d uses DIM in the bottom layer and UIM in the top layer. The results show that compared with b and c, the performance is stable on the NJU2K and SIP data sets, but the performance on other data sets is not significantly improved, indicating that it is only at the extreme level. The combined strategy using two interacting modules has a limited impact on overall performance. Model e uses DIM and UIM in the middle layer, which further improves the model’s performance on DUT and NJU2K data sets, and both F-measure and E-measure indicators are improved. This shows that multiple feature interactions at the intermediate level can more effectively capture feature information at different scales, making the model perform better in these complex scenes. Finally, the model f used in this article uses DIM or UIM for all layers, which has the best performance. Especially on the NLPR and DUT data sets, the MAE dropped to 0.014 and 0.030, respectively, and all indicators reached the best. This shows that applying the combined strategy of DIM and UIM at each level can give full play to the complementary advantages of multi-level features, making the model perform well on various data sets, especially in complex scenarios with stronger robustness and generalization ability. In summary, the experimental results verify the effectiveness of the MFIM module proposed in this article.

Table 3.

Ablation Results of MFIM.

Dataset	Method	$MAE ↓$	$F_{β} ↑$	$S_{γ} ↑$	$S_{λ} ↑$
DUT	a	0.034	0.936	0.944	0.920
	b	0.032	0.938	0.950	0.929
	c	0.033	0.940	0.946	0.926
	d	0.033	0.938	0.942	0.924
	e	0.033	0.940	0.949	0.929
	f	0.030	0.942	0.955	0.933
LFSD	a	0.054	0.901	0.915	0.880
	b	0.052	0.905	0.921	0.889
	c	0.056	0.905	0.917	0.885
	d	0.052	0.912	0.922	0.891
	e	0.051	0.906	0.923	0.891
	f	0.045	0.918	0.931	0.909
NJU2K	a	0.026	0.944	0.959	0.926
	b	0.023	0.946	0.963	0.935
	c	0.023	0.947	0.963	0.936
	d	0.023	0.947	0.962	0.936
	e	0.024	0.946	0.961	0.934
	f	0.023	0.947	0.963	0.936
NLPR	a	0.020	0.937	0.965	0.934
	b	0.017	0.940	0.970	0.938
	c	0.015	0.943	0.974	0.942
	d	0.016	0.941	0.972	0.941
	e	0.015	0.942	0.973	0.941
	f	0.014	0.947	0.976	0.945
SIP	a	0.040	0.928	0.926	0.893
	b	0.037	0.934	0.938	0.902
	c	0.038	0.934	0.936	0.899
	d	0.039	0.931	0.935	0.898
	e	0.036	0.936	0.942	0.905
	f	0.034	0.937	0.945	0.907

Finally, the effectiveness of the decoder module design was also verified, and experiments were conducted: (a) baseline model using a single-layer residual block as the decoder module; (b) the decoder module used in this article.

According to the results in Table 4, the performance of the baseline model a on all data sets is relatively weak, especially on the LFSD data set, where the MAE reaches 0.063, indicating that it is difficult for a single-layer residual block to effectively restore the edge details of the image. In terms of F-measure and E-measure indicators, the baseline model also performs relatively poorly. The decoder module b used in this article shows significant performance improvement. On the DUT and NJU2K data sets, the MAE is reduced to 0.021 and 0.023, respectively, and both F-measure and E-measure are significantly improved. Moreover, on the SIP data set, MAE dropped from 0.043 to 0.035, and S-measure increased from 0.933 to 0.942. In short, through experimental comparison, it can be seen that the decoder module proposed in this article effectively reduces the noise caused by upsampling and significantly improves the model’s detail recovery capability and overall prediction accuracy.

Table 4.

Ablation Results of Decoder.

Dataset	Method	$MAE ↓$	$F_{β} ↑$	$S_{γ} ↑$	$S_{λ} ↑$
DUT	a	0.032	0.939	0.953	0.931
	b	0.021	0.956	0.967	0.953
LFSD	a	0.063	0.886	0.909	0.873
	b	0.048	0.916	0.922	0.899
NJU2K	a	0.029	0.936	0.954	0.923
	b	0.023	0.949	0.965	0.938
NLPR	a	0.020	0.930	0.961	0.931
	b	0.015	0.943	0.973	0.942
SIP	a	0.043	0.919	0.933	0.889
	b	0.035	0.933	0.942	0.905

4. Conclusion

This article proposes a straightforward yet effective framework to tackle the difficulties of cross-modal feature fusion and multi-scale object detection in RGB-D SOD. First, a cross-modal AFEM is designed, which dynamically adjusts the fusion weights of RGB and depth images through the channel attention mechanism and adaptive fusion strategy to improve feature expression capabilities. Secondly, a MFIM is proposed to effectively integrate high-level and low-level features to improve the ability of multi-scale object detection. In addition, the designed decoder module further improves the accuracy of detail recovery by introducing residual convolution blocks to reduce noise. The framework proposed in this article achieved excellent performance in five public data sets, significantly improving the accuracy of the prediction of saliency and proving the effectiveness of the proposed method.

However, the proposed method also has some limitations. In the RGB-D SOD detection model, although depth features provide complementary information to RGB features and improve the model performance, the quality of depth maps is limited by the resolution and accuracy of sensors. As a result, redundant features may occur when fusing with RGB features, which not only wastes computational resources but also interferes with the model’s ability to capture key information. Moreover, when the size of the salient object is too small, the model may have difficulty distinguishing these objects from the background, especially when the resolution is low or the target is far from the camera. Smaller target sizes may cause the model to fail to effectively extract salient features, leading to missed detections or false alarms.

For future work, more effective fusion strategies to avoid redundant features can be designed. Additionally, exploring higher-resolution enhancement techniques and focus strategies for specific targets could improve the model’s ability to detect small objects.

Footnotes

ORCID iD

Ruihong Wang

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Competing Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Chen

Shao

Chai

Chen

Jiang

Meng

Y. S.

(2022). Modality-induced transfer-fusion network for rgb-d and rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 33(4), 1787–1801.

Chen

(2019). Three-stream attention-aware network for rgb-d salient object detection. IEEE Transactions on Image Processing, 28(6), 2825–2835.

Cheng

M. M.

Mitra

N. J.

Huang

Torr

P. H.

S. M.

(2014). Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 569–582.

Cong

Lei

Cheng

M. M.

Lin

Huang

(2018). Review of visual saliency detection with comprehensive information. IEEE Transactions on Circuits and Systems for Video Technology, 29(10), 2941–2959.

Cong

Lin

Zhang

Cao

Huang

Zhao

(2022a). Cir-net: cross-modality interaction and refinement for rgb-d salient object detection. IEEE Transactions on Image Processing, 31, 6800–6815.

Cong

Qin

Zhang

Jiang

Wang

Zhao

Kwong

(2022b). A weakly supervised learning framework for salient object detection via hybrid labels. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 534–548.

De Boer

P. T.

Kroese

D. P.

Mannor

Rubinstein

R. Y.

(2005). A tutorial on the cross-entropy method. Annals of Operations Research, 134, 19–67.

Fan

D. P.

Cheng

M. M.

Liu

Borji

(2017). Structure-measure: a new way to evaluate foreground maps. In Proceedings of the IEEE international conference on computer vision (pp. 4548–4557).

Fan

D. P.

Gong

Cao

Ren

Cheng

M. M.

Borji

(2018). Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421.

10.

Fan

D. P.

Lin

Zhang

Zhu

Cheng

M. M.

(2020). Rethinking rgb-d salient object detection: models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networks and Learning Systems, 32(5), 2075–2089.

11.

Han

Zhang

Guo

Ren

(2014). Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology, 25(8), 1309–1321.

12.

Sun

Wang

(2024). Cross-modal fusion and progressive decoding network for rgb-d salient object detection. International Journal of Computer Vision, , –.

13.

Zhang

Piao

Yao

Zheng

Cheng

(2021). Calibrated rgb-d salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9471–9481).

14.

Yan

Piao

Yao

Zhang

Cheng

(2022). DMRA: depth-induced multi-scale recurrent attention network for rgb-d saliency detection. IEEE Transactions on Image Processing, 31, 2321–2336.

15.

Jiang

Wang

Yuan

Zheng

(2013). Salient object detection: a discriminative regional feature integration approach. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2083–2090).

16.

Jiang

Davis

L. S.

(2013). Submodular salient region detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2043–2050).

17.

Geng

Ren

(2014). Depth saliency based on anisotropic center-surround difference. In 2014 IEEE international conference on image processing (ICIP) (pp. 1115–1119). IEEE.

18.

Klein

D. A.

Frintrop

(2011). Center-surround divergence of feature statistics for salient object detection. In 2011 International conference on computer vision (pp. 2214–2219). IEEE.

19.

Lee

C. Y.

Xie

Gallagher

Zhang

(2015). Deeply-supervised nets. In Artificial intelligence and statistics (pp. 562–570). Pmlr.

20.

Lee

Park

Cho

Lee

(2022). SPSN: superpixel prototype sampling network for rgb-d salient object detection. In European conference on computer vision (pp. 630–647). Springer.

21.

Mao

Zhang

Dai

(2023a). Mutual information regularization for weakly-supervised rgb-d salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 34(1), 397–410.

22.

Liu

Chen

Bai

Lin

Ling

(2021). Hierarchical alternate interaction network for rgb-d salient object detection. IEEE Transactions on Image Processing, 30, 3528–3542.

23.

Zhang

Piao

Cheng

(2023b). Delving into calibrated depth for accurate rgb-d salient object detection. International Journal of Computer Vision, 131(4), 855–876.

24.

Ling

(2014). Saliency detection on light field. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2806–2813).

25.

Liu

Zhang

Wan

Shao

Han

(2021). Visual saliency transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4722–4732).

26.

Xia

Xie

Chen

(2023). Boosting broader receptive fields for salient object detection. IEEE Transactions on Image Processing, 32, 1026–1038.

27.

Máttyus

Luo

Urtasun

(2017). Deeproadmapper: extracting road topology from aerial images. In Proceedings of the IEEE international conference on computer vision (pp. 3438–3446).

28.

Pang

Zhao

Zhang

(2023). Caver: cross-modal view-mixed transformer for bi-modal salient object detection. IEEE Transactions on Image Processing, 32, 892–904.

29.

Peng

Xiong

(2014). Rgbd salient object detection: a benchmark and algorithms. In Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part III 13 (pp. 92–109). Springer.

30.

Perazzi

Krähenbühl

Pritch

Hornung

(2012). Saliency filters: contrast based filtering for salient region detection. In 2012 IEEE conference on computer vision and pattern recognition (pp. 733–740). IEEE.

31.

Piao

Zhang

(2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7254–7263).

32.

Sun

Zhang

Wang

(2021). Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1407–1417).

33.

Tang

Liu

Tan

(2022). Hrtransnet: Hrformer-driven two-modality salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 33(2), 728–742.

34.

Wang

Song

Bao

Huang

Yan

(2021). Cgfnet: cross-guided fusion network for rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(5), 2949–2961.

35.

Wang

Feng

Wang

Yin

Ruan

(2017). Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 136–145).

36.

Wang

Gong

(2019). Adaptive fusion for rgb-d salient object detection. IEEE access, 7, 55277–55284.

37.

Wang

Shen

Ling

(2018). A deep network solution for attention and aesthetics aware photo cropping. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1531–1544.

38.

Wang

Shen

Porikli

(2015). Saliency-aware geodesic video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3395–3402).

39.

Wang

Sun

Van Gool

(2022). Looking beyond single images for weakly supervised semantic segmentation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3), 1635–1649.

40.

Wei

Wang

Huang

(2020). F

^{3}

net: fusion, feedback and focus for salient object detection. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 12321–12328).

41.

Yan

Shi

Jia

(2013). Hierarchical saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1155–1162).

42.

Zhang

Fan

D. P.

Dai

Anwar

Saleh

Aliakbarian

Barnes

(2021). Uncertainty inspired rgb-d saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5761–5779.

43.

Zhang

Zhao

(2022a). Progressive dual-attention residual network for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 32(9), 5902–5915.

44.

Zhang

Yao

Piao

(2022b). C²dfnet: criss-cross dynamic filter network for rgb-d salient object detection. IEEE Transactions on Multimedia, 25, 5142–5154.

45.

Zhang

Piao

(2020). Feature reintegration over differential treatment: a top-down and adaptive fusion network for rgb-d salient object detection. In Proceedings of the 28th ACM international conference on multimedia (pp. 4107–4115).

46.

Zhao

Zhang

(2021). Automatic polyp segmentation via multi-scale subtraction network. In Medical image computing and computer assisted intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24 (pp. 120–130). Springer.

47.

Zheng

Tan

Zhou

Lau

R. W.

(2021). Weakly-supervised saliency detection via salient object subitizing. IEEE Transactions on Circuits and Systems for Video Technology, 31(11), 4370–4380.

48.

Zhou

Chen

Zhou

Fan

D. P.

Shao

(2021a). Specificity-preserving rgb-d saliency detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4681–4691).

49.

Zhou

Zhu

Lei

Wan

(2021b). Ccafnet: crossflow and cross-scale adaptive fusion network for detecting salient objects in rgb-d images. IEEE Transactions on Multimedia, 24, 2192–2204.

50.

Zhu

Chen

C. W.

Qin

Heng

P. A.

(2019). Aggregating attentional dilated features for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 30(10), 3358–3371.

51.

Zhuge

Fan

D. P.

Liu

Zhang

Shao

(2022). Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3738–3752.

Multimodal Feature Hybrid Encoding and Interactive Decoding Network for RGB-D Salient Object Detection

Abstract

Keywords

1. Introduction

2. Method

2.1. Overview Framework

2.3. Adaptive Fusion Enhancement Module

3.1. Experimental Settings

3.1.1. Datasets

3.1.2. Evaluation Metrics

3.2.1. Quantitative Analysis

Footnotes

ORCID iD

Funding

Declaration of Competing Interest

References