Abstract
The RGB-D salient object detection (SOD) task aims to pinpoint salient objects in RGB and depth images, with the key challenge being effective multimodal integration. This article presents MHINet, a network tailored for RGB-D SOD, encompassing a dual-stream swin transformer encoder, adaptive fusion enhancement module (AFEM), multi-level feature interaction module (MFIM), and decoder. The dual-stream swin transformers extract multi-level features (outperforming traditional CNNs in capturing long-range dependencies). AFEM dynamically adjusts RGB-depth fusion ratios via channel attention, enhancing feature expression. MFIM uses middle-layer features to enable stable cross-level feature interaction, improving fusion efficiency. The decoder restores edge details via residual convolution. Experimental results on DUT, LFSD, NJU2K, NLPR, and SIP show MHINet outperforms state-of-the-art methods, validating its cross-modal detection capabilities for RGB-D SOD.
Get full access to this article
View all access options for this article.
