Sage Journals: Discover world-class research

Abstract

Transparent objects are ubiquitous in everyday life, but how to detect them is full of challenges. Transparent objects hardly reflect light, and they usually transmit the appearance of their surroundings, making it difficult to distinguish them from their surroundings. Existing methods usually use only RGB (Red Green Blue) images as input, ignoring the role of depth maps in transparent object detection. In this article, we try to improve the detection performance of transparent objects by fusing RGB and depth information. Specifically, we propose a multimodal fusion network that fuses RGB and depth modalities in a complementary way. Moreover, extensive experiments and ablation studies on the RGB-D (RGB-Depth) transparent object dataset demonstrate the excellent performance of our method.

Keywords

Transparent object detection RGB image depth information multimodal fusion

Introduction

Transparent objects are widely found in daily life, such as mineral water bottles, cups, measuring cylinders, test tubes, etc. Detecting these objects is very important in many practical applications. For example, when intelligent robots are grasping transparent objects such as test tubes and mineral water bottles, they first need to sense these objects. In addition, when intelligent robots are performing navigation and tracking tasks, they need to avoid crashing into transparent objects such as glass doors, windows, and walls. However, the difficulty of distinguishing transparent objects from the background makes detecting and segmenting transparent objects challenging.

There are few studies dedicated to transparent object detection and segmentation, and the mainstream methods for transparent object detection and segmentation are based on RGB information,^1,2 and they detect and segment transparent objects by their boundary cues. However, inaccurate results may occur when the light is dim or the boundaries of transparent objects are not clear.

Due to the unique visual properties of transparent objects, the depth map acquired by the depth camera has a depth missing, as shown in Figure 1. However, some recent studies^4,5 have shown that the depth discontinuity in the depth map of transparent objects makes the transparent objects have different semantic features from the background. From this perspective, we address the limitations of RGB-based transparent object detection and segmentation methods by fusing RGB features and depth features.

Figure 1.

Original depth maps of transparent object, from the TODD³ dataset.

There have been a number of different proposals for the fusion of RGB and depth maps.^6–9 For example, fusing RGB and depth information via BBS-Net,⁶ calibrating depth maps with RGB maps and fusing them via a depth calibration module and a cross-reference module,⁷ and augmenting RGB features and fusing them with depth maps via the DSA²F framework.⁸ However, there are still some challenges in the existing RGB-D fusion methods. The first challenge is the complementary fusion of multimodal features. RGB images contain color and texture information, while depth maps contain the distance between objects and shape of objects. The second challenge is the effective fusion of multi-level features. Low-level features provide rich details that facilitate refinement of segmentation boundaries but contain inherent noise. High-level features contain global information, which is good for object localization, but lack detail information.

To solve the above problem, we propose a new multimodal feature fusion network (MFFNet) to obtain better performance for transparent object detection. As shown in Figure 2, our MFFNet uses an encoding–decoding network architecture¹⁰ with ResNet¹¹ as backbone in the encoder. And a new feature fusion enhancement module (FFEM) is developed to fuse and enhance RGB and depth features. Our main contributions are summarized as follows:

We designed a dual-stream encoding–decoding framework for MFFNet to handle the detection of common transparent objects.

We introduce FFEM to fuse RGB features and depth features in a complementary way.

Comparisons with other methods on publicly available datasets demonstrate the excellent performance of our method.

Figure 2.

The overall architecture of MFFNet.

The rest of this article is structured as follows. The second section reviews related work. The third section describes our network in detail. The fourth section gives the experimental results and discussion. The last section gives a summary of the work and the outlook.

Related work

Semantic segmentation

Since Long et al.¹² proposed FCN in 2014, most of the current state-of-the-art semantic segmentation methods are based on deep learning approaches. They changed the then classification networks AlexNet,¹³ VGG-Net,¹⁴ and GoogleNet¹⁵ to fully convolutional networks for segmentation tasks. Ronneberger et al.¹⁶ proposed to solve the semantic segmentation problem in biomedical field by using U-Net network with skip connection. Badrinarayanan et al.¹⁰ proposed SegNet to introduce the encoding–decoding architecture to semantic segmentation for the first time. The Google team released four versions of the DeepLab series^17–20 from 2015 to 2018. However, all the above algorithms are trained and tested based on RGB images.

With the increasing use of depth cameras and thermal imaging cameras, various semantic segmentation algorithms based on RGB-D and RGB-T (RGB-Thermal) have been proposed. Deng et al.²¹ proposed a two-stage FEANet with feature enhancement module to enhance features and fuse RGB and thermal information in a complementary form. Sun et al.²² proposed a RTFNet that fuses RGB and thermal information and designed a new decoder to restore the resolution. Zhang et al.²³ proposed a semantic segmentation method that first reduces modal differences and then fuses them. Hu et al.²⁴ proposed ACNet to complementarily fuse RGB and depth information by considering the difference of RGB and depth feature distributions. Jiang et al.²⁵ proposed an RGB-D residual encoding–decoding structure RedNet and solved the vanishing gradient problem.

Transparent object detection

Xie et al.¹ collected Trans10 K, the large-scale transparent object dataset, and proposed a network TransLab that uses boundary cues to improve the segmentation performance of transparent objects. Later, Trans10K-v2 dataset²⁶ was proposed by extending Trans10 K,¹ and Trans2Seg, a transparent object segmentation network based on Transformer was proposed. DeepLabV3+²⁰ with DRN-D-54 as backbone was used in ClearGrasp²⁷ for segmentation of transparent objects. Mei et al.²⁸ constructed a large-scale glass detection dataset GOD and proposed a glass detection network GDNet. He et al.² solved the segmentation problem of glass-like objects by modeling their boundaries with graph convolutional networks. Lin et al.⁵ observed that the depth image generated by the depth camera produces gaps on the surface of transparent objects and used it as a supplement to the RGB image for glass object surface detection. Sun et al.²⁹ collected TROSD, a large-scale dataset containing transparent and reflective objects, and proposed TROSNet, a transparent and reflective object segmentation network. Zhang et al.³⁰ proposed Trans4Trans, a lightweight transparent object segmentation network that can be easily deployed to wearable devices.

Proposed method

In this section, we introduce the overall architecture of MFFNet, which comprises two feature extraction networks, an FFEM and an output decoding network. To fully fuse the complementary features from RGB and depth features, we propose FFEM to fuse and enhance multimodal features to achieve good performance in transparent object detection. We describe in detail the FFEM that fuses RGB features and depth features.

Overall architecture

Figure 2 shows the overall framework of our proposed RGB-D transparent object detection, which consists of three parts: a dual-stream encoder, a FFEM and a decoder. The dual-stream encoder extracts different levels of features ${F_{i}^{R G B}}_{i = 0}^{4}$ and ${F_{i}^{D e p t h}}_{i = 0}^{4}$ from three-channel RGB images and single-channel depth maps. High-level features (e.g., $F_{3}^{R G B}$ and $F_{4}^{R G B}$ ) are rich in semantic information, which is beneficial for object localization, but there is a large amount of detail information loss. Low-level features (e.g., $F_{0}^{R G B}$ and $F_{1}^{R G B}$ ) contain detailed details that help refine the boundaries of objects, but contain more noise interference information. We fuse and enhance different levels of RGB features ${F_{i}^{R G B}}_{i = 1}^{4}$ and depth features ${F_{i}^{D e p t h}}_{i = 1}^{4}$ extracted by the dual-stream encoder by the proposed FFEM. The decoder restores the resolution to 352 × 352 by upsampling, and refines the boundaries of transparent object detection by fusing multi-level features.

Encoder

In our proposed MFFNet, multi-level RGB and depth map features are extracted in the dual-stream encoder, and the process is shown in Figure 3. The figure shows that the second column extracts low-level features that contain detailed edges, colors, textures, and so on. And the model begins to extract higher-level features that increasingly contain rich semantic information as the depth of the network increases. Specifically, we use ResNet-50¹¹ as the backbone for RGB and depth map feature extraction. Since the output of the first stage (Layer0) contains more raw noise, we only use the output of the last four stages (Layer1, Layer2, Layer3, Layer4). We add an FFEM that we designed after the output of each stage to fuse and enhance multimodal features. The enhanced RGB features and depth features are used as inputs for the next stage. ResNet-50¹¹ is designed for three-channel RGB image feature extraction, while the depth map is a single-channel map. Therefore, in the depth feature extraction network, we modify the number of input channels of the first convolution to one, so that it is applicable for a single-channel depth map. Finally, Layer4 is modified to atrous convolution²⁰ to maintain the same output dimension as Layer3, and the final pooling layer and full connection layer of ResNet-50¹¹ are deleted.

Figure 3.

RGB images and depth maps for different level features.

Feature fusion enhance module

How the complementary fusion of RGB features and depth features is our focus. RGB and depth map have color texture information and position information. Some differences that are difficult to find in RGB image can be easily found in depth map. The depth map is sent to the dual-stream encoder along with the RGB image to generate multi-level features ${F_{i}^{R G B}}_{i = 0}^{4}$ and ${F_{i}^{D e p t h}}_{i = 0}^{4}$ . We have dropped the output features $F_{0}^{R G B}$ and $F_{0}^{D e p t h}$ , which have the highest resolution in the dual-stream encoder, to reduce the impact of noise. In order to fuse the output features of the dual-stream encoder, we design the FFEM, as shown in Figure 4. The input of FFEM includes RGB feature ${F_{i}^{R G B}}_{i = 1}^{4}$ and depth feature ${F_{i}^{D e p t h}}_{i = 1}^{4}$ , and the output includes RGB enhanced feature ${{\hat{F}}_{i}^{R G B}}_{i = 1}^{3}$ , depth enhanced feature ${{\hat{F}}_{i}^{D e p t h}}_{i = 1}^{3}$ and fused feature ${F_{i}^{F u s e d}}_{i = 1}^{4}$ .

Figure 4.

The architecture of FFEM.

We designed the FFEM to extract more detailed and complementary features from the input RGB image and the depth map. More specifically, the output features $F_{i}^{R G B}$ and $F_{i}^{D e p t h}$ of the ith stage of the RGB feature extraction stream and the depth feature extraction stream are used as inputs. For low-level features ( $F_{1}^{R G B}$ , $F_{1}^{D e p t h}$ , $F_{2}^{R G B}$ and $F_{2}^{D e p t h}$ ), we use the spatial attention to fuse them because of their small semantic differences between different channels. First, max pooling (MP) is used to reduce the dimensionality of RGB features and depth features. These two features are then fed into a 7 × 7 convolutional layer (Conv) and activation function $δ (\cdot)$ , to obtain the spatial attention weights $S_A t t_{i}$ . The process can be defined as follows:

S_A t t_{i} = δ (C o n v (M a x p o o l i n g (F_{i}))), i = 1, 2

(1)where Maxpooling(·) denotes max pooling. Next, the result of multiplying the spatial attention weights with the input features is added to the input features to obtain the enhanced features

{\hat{F}}_{i}

. The process can be defined as:

{\hat{F}}_{i} = S_A t t_{i} \cdot F_{i} + F_{i}, i = 1, 2

(2)Because of the high semantic extraction of high-level features (

F_{3}^{R G B}

F_{4}^{R G B}

F_{3}^{D e p t h}

and

F_{4}^{D e p t h}

), we use the channel attention to fuse them. The global average pooling (AP) is first used to obtain the global receptive fields of the RGB features and the depth features. Then these two features are fed into the fully connected layer (FC) and the activation function

δ (\cdot)

, to obtain the channel attention weight

C_A t t_{i}

. The process can be defined as follows:

C_A t t_{i} = δ (M L P (A v g p o o l i n g (F_{i}))), i = 3, 4

(3)where Avgpooling(·) denotes global average pooling. The process of obtaining high-level enhanced features is the same as that of low-level features, which is:

{\hat{F}}_{i} = C_A t t_{i} \cdot F_{i} + F_{i}, i = 3

(4)Finally, the RGB features and depth features are multiplied with their attention weight, and the concatenated features are fed into 1 × 1 convolution for dimensionality reduction to obtain the fused features. The process is defined as:

F_{i}^{F u s e d} = {\begin{matrix} C o n v (c a t (S_A t t_{i}^{R G B} \cdot F_{i}^{R G B}, S_A t t_{i}^{D e p t h} \cdot F_{i}^{D e p t h})), i = 1, 2 \\ C o n v (c a t (C_A t t_{i}^{R G B} \cdot F_{i}^{R G B}, C_A t t_{i}^{D e p t h} \cdot F_{i}^{D e p t h})), i = 3, 4 \end{matrix}

(5)

Decoder

After fusing the multi-level RGB features and depth features from the dual-stream encoder, the multi-level fused features are fed to the decoder to restore the resolution to the input image. Since high-level features help to localize transparent objects and low-level features help to refine the boundaries of transparent objects, our decoder is designed to efficiently utilize multi-level features for mask refinement.

Specifically, we utilize all levels of fused features ${F_{i}^{F u s e d}}_{i = 1}^{4}$ of RGB features and depth features. First, to obtain more global information, we feed the high-level features $F_{3}^{F u s e d}$ and $F_{4}^{F u s e d}$ into the ASPP¹⁶ module. Then, to reduce the importance of the $F_{1}^{F u s e d}$ , $F_{2}^{F u s e d}$ and $F_{3}^{A S P P}$ features, we reduce their dimensions using 1 × 1 convolution. The process can be defined as:

{\hat{F}}_{i}^{F u s e d} = {\begin{matrix} C o n v (F_{i}^{F u s e d}), i = 1, 2 \\ C o n v (A S P P (F_{i}^{F u s e d})), i = 3 \end{matrix}

(6)Secondly, to concatenate features of different levels, we use bilinear interpolation algorithm for upsampling to make adjacent features have the same size. Finally, the concatenated features are fed into three 3 × 3 convolution for processing and upsampled to restore them to the resolution of the input image (352 × 352). The process can be defined as:

F_{i - 1}^{P r e d} = {\begin{matrix} c a t ({\hat{F}}_{i - 1}^{F u s e d}, U P_{2} ({\hat{F}}_{i}^{F u s e d})), i = 2, 3 \\ c a t ({\hat{F}}_{i - 1}^{F u s e d}, {\hat{F}}_{i}^{F u s e d}), i = 4 \end{matrix}

(7)

O u t p u t = U P_{4} (T (F_{1}^{P r e d}))

(8)where Output denotes the final output and T(·) denotes the final set of convolutional modules.

Loss functions

To obtain good performance of our network, we used the SoftCrossEntropy Loss²¹ and Dice Loss³¹ to supervise the training of transparent object detection and weighted them to obtain the final loss function. SoftCrossEntropy Loss in the case of binary classification can be written as:

L_{S} = \frac{1}{N} \sum_{i}^{N} - [{\hat{y}}_{i} \log (y_{i}) + (1 - {\hat{y}}_{i}) \log (1 - y_{i})]

(9)where

{\hat{y}}_{i}

denotes ground true, y_i denotes the predicted value, and N is the total number of pixel points.

Dice Loss is calculated as follows:

L_{D} = 1 - \frac{2 \sum_{i}^{N} y_{i} {\hat{y}}_{i}}{\sum_{i}^{N} y_{i} + \sum_{i}^{N} {\hat{y}}_{i}}

(10)where

{\hat{y}}_{i}

denotes ground true, y_i denotes the predicted value, and N is the total number of pixel points.

The total loss function is:

L = α L_{S} + (1 - α) L_{D}

(11)where

α \in [0, 1]

is the weight.

Experimental verification

Datasets

To evaluate the performance of our proposed MFFNet, we chose the RGB-D dataset of transparent objects in TODD³ for validation. The TODD dataset contains a total of 14,659 images, with 10,302 images in the training set and 4357 images in the validation and test sets according to the official classification.

Evaluation metrics

For a comprehensive evaluation, we used not only the Intersection-over-Union (IoU), a common metric for semantic segmentation, but also the F-measure, mean absolute error (MAE), and balance error rate (BER), common metrics for saliency target detection and shadow detection. In calculating the IoU, we only calculate the IoU values of transparent objects and ignore the background. The IoU is calculated as:

I o U = \frac{T P}{T P + F P + F N}

(12)where TP indicates that the predicted result is positive class and the true result is also positive class; FP indicates that the predicted result is positive class and the true result is negative class; FN indicates that the predicted result is negative class and the true result is positive class.

MAE is calculated as follows:

M A E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | P (x, y) - Y (x, y) |

(13)where P is the predicted value, Y is the true value, and W and H are the width and height of the image.

BER is calculated as follows:

B E R = (1 - \frac{1}{2} (\frac{T P}{T P + F N} + \frac{T N}{T N + F P})) \times 100

(14)where TN indicates that the predicted result is in the negative category and the true result is also in the negative category.

The F-measure is calculated as:

F_{β} = \frac{(1 + β^{2}) P r e c i s i o n \times R e c a l l}{β^{2} P r e c i s i o n + R e c a l l}

(15)where

P r e c i s i o n = T P / T P + F P

R e c a l l = T P / T P + F N

. Following the suggestion in [32], we set β = 0.3 to place more emphasis on precision than recall.

Implementation details

We implemented MFFNet with PyTorch³³ and trained it using an RTX 3060 GPU with 12GB memory starting from pre-trained weights on ImageNet. All images used for training and testing are set to 352 × 352. Throughout the entire training process, the initial learning rate is 1 × 10⁻⁴, the optimizer is chosen as the Adam optimizer, the learning rate decreases to 0.1 times the original after every 7 epochs, and the Batch Size is set to 8. Empirically the weight $α$ in the loss function is set to 0.2 for training.

Comparative experiments

Our proposed method is compared with other semantic segmentation, saliency target detection, and transparent object detection/segmentation methods, as shown in Table 1. We chose DeepLabv3+ (ResNet50),²⁰ DeepLabv3+ (Xception),²⁰ DeepLabv3+ (MobileNet),²⁰ U-Net,¹⁶ FCN,¹² ICNet,³⁴ CSNet,³⁵ PGSNet,³² GDNet²⁸ methods. We retrained each model on the training set of the TODD dataset and evaluated it on the test set. To be fair, we set the size of all images used for training and testing to 352 × 352.

Table 1.

Quantitative comparison of results on test datasets using IoU, MAE, BER, and F _β .

Methods	IoU↑	MAE↓	BER↓	F _β ↑
ICNet³³	64.89	0.03482	10.70	0.7714
CSNet³⁴	71.73	0.02708	7.79	0.8149
U-Net¹⁶	77.51	0.02035	6.67	0.8730
GDNet²⁸	78.81	0.01953	5.23	0.8598
DeepLabv3+ (Xception)²⁰	79.15	0.01889	5.48	0.8832
FCN¹²	79.48	0.01839	5.91	0.8799
DeepLabv3+ (MobileNet)²⁰	79.55	0.01848	5.43	0.8857
PGSNet³⁵	79.68	0.01775	5.17	0.8738
DeepLabv3+ (ResNet50)²⁰	81.68	0.01637	4.78	0.8987
MFFNet(Ours)	82.79	0.01520	4.62	0.9057

Table 1 shows quantitative comparisons with other methods. Compared with other methods, our method has obtained optimal results in IoU, MAE, BER, and F _β (see bold values). And some comparison results with other methods are visualized, as shown in Figure 5. It can be seen from the figure that our method obtains a clearer boundary and a more complete mask.

Figure 5.

Visual comparison of MFFNet to other methods.

Ablation study

First, to verify the role of depth information in transparent object detection, we conducted an ablation study of the input RGB and depth information. Table 2 compares the results for different modal inputs: (1) MFFNet-R means that we use the RGB image as the input to the RGB channel and the all-zero tensor as the input to the depth channel; (2) MFFNet-D means that we use the depth map as the input to the depth channel and the all-zero tensor as the input to the RGB channel.

Table 2.

Comparison of different input modes on the test datasets.

Methods	IoU↑	MAE↓	BER↓	F_β↑
MFFNet-R	82.64	0.01533	4.70	0.9047
MFFNet-D	52.91	0.04136	21.37	0.6898
Ours	82 . 79	0.01520	4.62	0.9057

From the results in Table 2, we can find that the depth information is helpful for the detection of transparent objects. Figure 6 visualizes some comparison results.

Figure 6.

A visual example of ablation study of the input RGB and depth information.

Finally, we performed an ablation study for each module to verify its effectiveness. Table 3 compares the results of different modules: (1) Baseline means that FFEM is deleted, and for RGB features and depth features, we directly fuse them by adding, and for encoder, we only use $F_{4}^{F u s e d}$ ; (2) Baseline-CD means that we use our designed encoder on Baseline; (3) Baseline-FM means that we have used FFEM on Baseline.

Table 3.

Comparison of different modules on the test datasets.

Methods	IoU↑	MAE↓	BER↓	F_β↑
Baseline	81.46	0.0166	4.79	0.8976
Baseline-CD	82.14	0.01586	4.81	0.9017
Baseline-FM	82.15	0.01589	4 . 61	0.9018
Ours	82.79	0.01520	4.62	0.9057

The results in Table 3 show that our final model exhibits optimal performance on all four metrics. Figure 7 visualizes some of the comparison results, and we can see that both FFEM and encoder we designed can help us improve detection performance.

Figure 7.

A visual example of ablation study of different modules.

Robot grasping

The performance of MFFNet in robot operation can be verified by combining MFFNet with a robot grasping task. After detecting a transparent object, an end-to-end network was used to directly regress the 6-degrees-of-freedom grasping pose. Finally, a panda robot arm was used to perform the transparent object-grasping task, as shown in Figure 8.

Figure 8.

Robot grasping process.

Conclusion

In this thesis, we propose a multimodal fusion framework for transparent object detection. The framework employs a dual-stream encoder–decoder-based framework, and the three components: the feature extraction backbone network, the FFEM, and the decoder are described step by step. Comparison with other deep learning-based methods is performed on the challenging RGB-D transparent object dataset. The IoU of our method is improved by 5% compared to the glass detection method GDNet and by 3.9% compared to PGSNet. Compared to the semantic segmentation method DeepLabv3+, the IoU value of our method is improved by 1.36%. However, the current RGB-D transparent object public dataset only includes some simple transparent objects and a single scene, and we will collect a new large-scale RGB-D transparent object dataset as future work.

Footnotes

Authors contributions

Li Zhu and Tuanjie Li proposed the main idea of this article, and Li Zhu also wrote the programming related to multimodal feature fusion network. Yuming Ning debugs the relevant programs under the PyTorch framework. Yan Zhang revised the whole manuscript and gave detailed suggestions related to the experiments. All persons who have made substantial contributions to the work reported in the manuscript, including those who provided editing and writing assistance.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China under Grant 51775403, the Proof-of-Concept Foundation of Xidian University Hangzhou Institute of Technology under Grant GNYZ2023QC0404, and the Fundamental Research Funds for the Central Universities under Grant YJSJ24001.

ORCID iD

Tuanjie Li

References

Xie

Wang

, et al. Segmenting transparent objects in the wild. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, pp. 696–711.

Cheng

, et al. Enhanced boundary learning for glass-like object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021, pp. 15859–15868.

Wang

Eppel

, et al. Seeing glass: joint point cloud and depth completion for transparent objects. In: Conference on Robot Learning, 2021, pp. 827–838

Seib

Barthen

Marohn

, et al. Friend or foe: exploiting sensor failures for transparent object localization and classification. In: 2016 SPIE International Conference on Robotics and Machine Vision, Moscow, Russia, 14–16 September 2017, 10253, pp. 94–98.

Lin

Yeung

Lau

RWH

. Depth-aware glass surface detection with cross-modal context mining. arXiv preprint arXiv:2206.11250, 2022.

Fan D

Zhai

Borji

, et al. BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, pp. 275–292.

, et al. Calibrated RGB-D salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021, pp. 9471–9481.

Sun

Zhang

Wang

, et al. Deep RGB-D saliency detection with depth-sensitive attention and automatic multi-modal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021, pp. 1407–1417.

Liu

Tan

, et al. SwinNet: swin transformer drives edge-aware RGB-D and RGB-T salient object detection. IEEE Trans Circuits Syst Video Technol. 2021; 32: 4486–4497.

10.

Badrinarayanan

Kendall

Cipolla

. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017; 39: 2481–2495.

11.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778.

12.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015, pp. 3431–3440.

13.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM 2017; 60: 84–90.

14.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

15.

Szegedy

Liu

Jia

, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015, pp. 1–9.

16.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. Med Image Comput Comput Assist Interv, Munich, Germany, 5–9 October 2015, pp. 234–241.

17.

Chen L

Papandreou

Kokkinos

, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:1412.7062, 2014.

18.

Chen L

Papandreou

Kokkinos

, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017; 40: 834–848.

19.

Chen

Papandreou

Schroff

, et al. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.

20.

Chen

Zhu

Papandreou

, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 801–818.

21.

Deng

Feng

Liang

, et al. FEANet: feature-enhanced attention network for RGB-thermal real-time semantic segmentation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021, pp. 4467–4473.

22.

Sun

Zuo

Liu

. RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot Autom Lett 2019; 4: 2576–2583.

23.

Zhang

Zhao

Luo

, et al. ABMDRNet: adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021, pp. 2633–2642.

24.

Yang

Fei

, et al. ACNet: attention based network to exploit complementary features for RGBD semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019, pp. 1440–1444.

25.

Jiang

Zheng

Luo

, et al. RedNet: Residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint arXiv:1806.01054, 2018.

26.

Xie

Wang

, et al. Segmenting transparent object in the wild with transformer. arXiv preprint arXiv:2101.08461, 2021.

27.

Sajjan

Moore

Pan

, et al. ClearGrasp: 3D shape estimation of transparent objects for manipulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020, pp. 3634–3642.

28.

Mei

Yang

Wang

, et al. Don't hit me! glass detection in real-world scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020, pp. 3687–3696.

29.

Sun

Zhang

Yang

, et al. TROSD: a new RGB-D dataset for transparent and reflective object segmentation in practice. IEEE Trans Circuits Syst Video Technol. 2023; 33: 5721–5733.

30.

Zhang

Yang

Constantinescu

, et al. Trans4Trans: efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021, pp. 1760–1770.

31.

Milletari

Navab

Ahmadi

. V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 IEEE Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–26 October 2016, pp. 565–571.

32.

Mei

Dong

, et al. Progressive glass segmentation. IEEE Trans Image Process. 2022; 31: 2920–2933.

33.

Paszke

Gross

Massa

, et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 2022; 721: 8026–8037.

34.

Zhao

Shen

, et al. ICNet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 405–420.

35.

Cheng M

Gao S

Borji

, et al. A highly efficient model to study the semantics of salient object detection. IEEE Trans Pattern Anal Mach Intell. 2021; 44: 8006–8021.

MFFNet: Multimodal feature fusion network for RGB-D transparent object detection

Abstract

Keywords

Introduction

Related work

Semantic segmentation

Transparent object detection

Proposed method

Overall architecture

Encoder

Feature fusion enhance module

Decoder

Loss functions

Experimental verification

Datasets

Evaluation metrics

Implementation details

Comparative experiments

Ablation study

Robot grasping

Conclusion

Footnotes

Authors contributions

Declaration of conflicting interests

Funding

ORCID iD

References