Sage Journals: Discover world-class research

Abstract

Although advancements in red–green–blue-depth (RGB-D)-based six degree-of-freedom (6D) pose estimation methods, severe occlusion remains challenging. Addressing this issue, we propose a novel feature fusion module that can efficiently leverage the color and geometry information in RGB-D images. Unlike prior fusion methods, our method employs a two-stage fusion process. Initially, we extract color features from RGB images and integrate them into a point cloud. Subsequently, an anisotropic separable set abstraction network-like network is utilized to process the fused point cloud, extracting both local and global features, which are then combined to generate the final fusion features. Furthermore, we introduce a lightweight color feature extraction network to reduce model complexity. Extensive experiments conducted on the LineMOD, Occlusion LineMOD, and YCB-Video datasets conclusively demonstrate that our method significantly enhances prediction accuracy, reduces training time, and exhibits robustness to occlusion. Further experiments show that our model is significantly smaller than the latest popular 6D pose estimation models, which indicates that our model is easier to deploy on mobile platforms.

Keywords

6D pose estimation feature fusion lightweight network

Introduction

Six degree-of-freedom (6D) object pose estimation aims to estimate the rigid transformation from the coordinate system of an object to the coordinate system of the employed camera. It is one of the core steps used in a wide range of robotic tasks, such as grasping objects¹ and operating tools.² A key challenge in 6D object pose estimation is occlusion, where objects of interest are partially or fully hidden, hindering accurate pose estimation due to missing or obscured features. Handling such common real-world scenarios is vital for robust pose estimation models. The occlusion problem has been historically approached in various ways. Early methods^3,4 relied on local feature matching, which could handle occlusion to some extent but was limited by viewpoint changes and performed poorly on texture-less objects. With the rapid development of deep learning in computer vision, several works use convolutional neural networks (CNNs) on red–green–blue (RGB) images to solve the occlusion problem. Two main categories of approaches have been developed: one-stage^5,6 and two-stage methods.^7–9 The former methods directly regress the 6D pose of an object from input images, and the latter methods first establish three-dimensional to two-dimensional (3D to 2D) correspondences and then use the Perspective-n-Point (PnP) algorithm to calculate the 6D pose of the object of interest. Two-stage methods generally have stronger robustness to occlusion, with a representative method being Pixel-wise Voting Network (PVNet).⁷ It first uses a CNN to predict the directional vector from each pixel to the key points. Then it acquires the positions of the key points through vector voting. This voting mechanism enables it to learn the relationship between different parts of an object, thereby allowing occluded key points to be reliably recovered by visible parts. Despite having higher accuracy than traditional methods, these RGB-based methods have limited performance in occlusion scenes due to their lack of depth information.

With the emergence of consumer-level RGB-depth (RGB-D) cameras, more recent approaches have turned to using RGB-D images to address this problem. Incorporating depth information enables these methods to outperform those that only use RGB images in occluded environments. Early works^10–12 use depth information to refine previous estimations with the iterative closest point (ICP) algorithm or directly use depth information as an additional channel in their network architectures for RGB images. However, ICP is time-consuming, and due to the difference between the depth feature space and the color feature space, the depth information is not fully utilized. More recently, DenseFusion¹³ uses different networks to extract color and geometry information and then densely fuses the extracted results at the pixel level to perform pixel-wise pose estimation, achieving great performance. However, this approach merely extracts features from each point individually, discarding the local information in the point cloud. Furthermore, Zhou et al.¹⁴ and He et al.¹⁵ used PointNet++¹⁶ to extract the local geometric features of the object, but all neighbor processing operations in PointNet++ are isotropic, which limits the geometric feature extraction performance of the network. Although the decent achievements of 6D pose estimation methods based on RGB-D images, a persisting challenge lies in effectively leveraging the color information in the RGB images and the geometric information in the depth images. Indeed, this represents a pivotal aspect in tackling the 6D object pose estimation under occlusion conditions.

Another challenge faced by deep learning-based 6D pose estimation algorithms is that they often have many model parameters and high computational complexity, which limits their deployment on mobile platforms with limited computing resources and small storage spaces. In addition, these methods often require hundreds of epochs to converge, which increases their time costs in practical applications.

In this paper, we introduce an end-to-end 6D object pose estimation method with a two-stage feature fusion module that can efficiently leverage the color and geometry information in RGB-D images. We find that when observing an object under heavy occlusion, human beings can simultaneously perceive the object’s appearance and geometric information and infer the whole pose from a local part. Therefore, we first extract color features from RGB images and integrate them into the target point cloud to obtain a point cloud with rich appearance information. Then, we process the point cloud with an anisotropic separable set abstraction network (ASSANet)-like network to obtain some local features and one global feature. The global feature is concatenated with each local feature to obtain the fusion features, and each fusion feature independently predicts the 6D pose with an associated confidence score. We choose the pose with the highest confidence as the final output. Finally, the pose is further improved via the iterative refinement module in Wang et al.¹³ To further reduce the model size, we design a lightweight network based on a changed ShuffleNet V2 for color feature extraction. The proposed method is evaluated on three benchmark datasets: LineMOD,¹⁷ Occlusion LineMODm¹⁸ and YCB-Video.¹¹ The experimental results show that our method exhibits better performance than competing approaches on all datasets. Moreover, our model is much smaller than the current popular 6D pose estimation methods, highlighting its suitability for deployment on mobile platforms.

The contributions of this work can be summarized as follows:

We design a two-stage fusion module that can fully leverage two complementary data sources from RGB-D images, which makes it robust for handling heavy occlusion in 6D object pose estimation tasks.

We propose a lightweight color feature extraction network that significantly reduces our model’s size and computational cost.

The experimental results obtained on the LineMOD, Occlusion LineMOD, and YCB-Video datasets show that our method achieves significantly boosted performance.

Related work

Pose estimation with RGB data

Since RGB images are easy to obtain, many studies use only RGB images for pose estimation. Traditional methods^1,19 generally perform pose estimation based on key point extraction and matching algorithms. Such methods are often sensitive to cluttered backgrounds and light changes. With the great success of deep learning technology in 2D vision, many tasks have used deep learning for estimating poses from RGB images. PoseNet²⁰ was the first method to use a CNN to solve the 6D pose estimation problem and can adapt to the environment well. SSD-6D⁶ discretizes the rotation space into a classifiable set of viewpoints and transforms the 6D pose estimation task into a classification problem. However, these two methods are highly dependent on time-consuming pose optimization operations to improve their performance. In contrast, EfficientPose²¹ introduces the EfficientDet²² structure to construct a pose estimation framework for directly predicting the 6D attitudes of objects. The utilization of a novel 6D enhancement strategy allows it to achieve better performance without optimization. Deep-6DPose⁵ adds a pose estimation branch to a mask region-based CNN²³ to directly estimate the 6D poses of objects without any postprocessing processes.

Compared with the direct prediction of 6D object poses, establishing a 2D to 3D correspondence and taking it as the indirect expression of a pose to calculate the corresponding 6D pose often yields more accurate results. BB8²⁴ predicts the 2D projections of the vertices of a 3D bounding box to construct the 2D to 3D correspondence. Similar to BB8, YOLO-6D²⁵ uses a single-shot neural network to directly detect the key points of an object. PVNet⁷ adopts a key point location strategy based on voting. It first trains a CNN to predict the direction vector from each pixel to the key point and then uses the vector of pixels belonging to the target object to vote on the key point position. Yu et al.²⁶ proposed an effective loss based on PVNet to achieve more accurate vector field estimation results by incorporating the distances between pixels and key points into the target. However, these methods cannot be trained in an end-to-end manner. Therefore, Single-stage²⁷ and GDR-Net²⁸ both established 2D to 3D correspondences but attempted to learn the PnP algorithm in an end-to-end fashion. However, due to the loss of geometric information caused by perspective projection, RGB-based methods are sensitive to illumination changes, serious occlusion situations, and cluttered backgrounds.

Pose estimation with depth/point clouds

The emergence of inexpensive depth sensors has led to some methods that use point clouds or depth data. These methods take point clouds or depth images as inputs and utilize 3D CNNs or point cloud networks for geometry feature extraction to estimate 6D poses. VoxelNet²⁹ and Frustrum PointNets³⁰ are both PointNet-like structures that have achieved great performance on the KITTI benchmark dataset. Zhang et al.³¹ first used PointNet to extract high-dimensional features from an input 3D point cloud, then performed feature dimensionality reduction and fusion, and finally regressed the corresponding 6D pose. However, the lack of object appearance information limits the performance of these methods in challenging scenes.

Pose estimation with RGB-D data

With the rapid development of hardware, the use of RGB-D information to estimate the 6D poses of objects has gradually become a hot research direction. PoseCNN¹¹ estimates 6D poses from RGB images and then uses depth images to refine the poses. However, this refinement process is time-consuming and cannot be trained together with the pose estimation network. Li et al.¹² used depth information as an additional channel of a network constructed for RGB images, ignoring the difference between the depth feature space and the color feature space and thus failing to fully utilize the RGB-D information. Recently, DenseFusion¹³ uses a CNN and PointNet³² to process the data acquired from RGB and depth images, respectively, and performs pixel-wise fusion. However, this approach discards the local information in the point cloud. A deep point-wise 3D keypoints voting network for 6 degrees of freedom pose estimation (PVN3D)¹⁵ predicts the depth information for key points to convert these key points from the 2D space to the 3D space and optimizes the 6D pose estimation process according to the geometric information of the object of interest itself. Furthermore, Zhou et al.¹⁴ and He et al.¹⁵ used PointNet++¹⁶ to extract local geometric information from a point cloud. However, PointNet++ treats all local points equally, which limits the extraction of geometric information. In contrast, we use an ASSANet-like³³ network to efficiently extract the geometric information from a point cloud and perform both color-depth and local–global fusion to obtain a better representation of the observed RGB-D information.

The proposed method

This paper aims to estimate the 6D pose of a known object from an RGB-D image. The 6D pose represents the camera coordinate system’s translation and rotation transformations relative to the original object’s world system. Generally, rotation and translation are represented by a rotation matrix $R \in S O (3)$ and a translation matrix $t \in R^{3}$ , respectively.

Overview of the network

Our overall network architecture, shown in Figure 1, is mainly divided into three stages. In the first stage, the input RGB-D image is semantically segmented to obtain the target region. During the second stage, we first use a lightweight network to extract color features from the RGB image and integrate the extracted color features into a point cloud (converted from depth maps) according to the corresponding relationships between the depth maps and color maps for color-depth fusion. Then, an ASSANet-like network is used to extract several local features and one global feature from the target point cloud, and the global features are copied and fused with each local feature for local–global fusion. The fusion features are sent to a pose predictor network to obtain the corresponding 6D poses with their confidence scores in the last stage. We choose the 6D pose with the highest score as the final output. In addition, we add a refinement network in Wang et al.¹³ to obtain a better estimation. Each part of Figure 1 is described in detail in the subsequent sections.

Figure 1.

The overall framework of our approach. The input RGB-D image is segmented to obtain the target region. The color features extracted by a lightweight network are fused with the point cloud, and several local features and one global feature are extracted from the point cloud. Then, we fuse the global feature with the local features and send the fused features to the pose predictor network to obtain the corresponding 6D poses. Finally, we add a refinement network to obtain a better estimation. RGB-D: red–green–blue-depth; 6D: six degree-of-freedom; CNN: convolutional neural network; PSP: pyramid scene parsing; ASSANet: anisotropic separable set abstraction network.

Preprocessing

Before performing feature extraction, we first segment the target region derived from the input image. According to the segmentation results, we crop the RGB image and convert the depth image to a point cloud using the camera’s internal parameters. Since this part is not our focus, we directly adopt the segmentation network provided by Xiang et al.¹¹

Color feature extraction and color-depth fusion

The cropped image is passed through a color feature extraction network to produce pixel-wise color embeddings. Unlike previous works,^13,14 we design a lightweight feature extraction network for color feature extraction, which reduces the number of parameters and the computational complexity of the model. This network comprises a variant of ShuffleNet V2,³⁴ a pyramid scene parsing (PSP) module³⁵ and four upsampling layers, as illustrated in Figure 2. The original ShuffleNet V2 unit for spatial downsampling, shown in Figure 3, uses DWConv and a channel random mixing strategy to reduce the model size. Among the calculations of ShuffleNet V2, the 1 × 1 convolution accounts for most of the complexity. The 1 × 1 convolutions before and after DWConv are mainly used to fuse the information between the channels or change the dimensions. Since we do not need to change the dimensionality here, we only need one 1 × 1 convolution to compensate for the lack of DWConv for conducting information fusion between channels. Therefore, we crop the 1 × 1 convolution after DWConv in branch 2 to further reduce the computational cost.

Figure 2.

Structure of our color feature extraction network. RGB: red-green-blue; PSP: pyramid scene parsing.

Figure 3.

The original ShuffleNet V2 unit for spatial downsampling. DWConv: depthwise convolution; BN: batch normalization; ReLU: rectified linear unit; Conv: convolution.

Specifically, the cropped RGB images of size $H \times W \times 3$ are extracted through the changed ShuffleNet V2 network to obtain a feature map of size $H_{1} \times W_{1} \times d_{1}$ , and then the PSP module is used to further fuse the information with different scales to obtain a feature map of size $H_{2} \times W_{2} \times d_{2}$ . Because we need to fuse color features and point clouds next, we use four upsampling layers to restore the feature map to the original input image size and obtain the final output feature map of size $H \times W \times d_{r g b}$ . Each pixel of the feature map is a $d_{rgb}$ -dimensional vector that represents the input image’s appearance information. Then, we perform the first fusion process. Specifically, we randomly select $N$ points from the target point cloud and then find N corresponding $d_{rgb}$ -dimensional feature vectors according to the corresponding relationships between the depth and color maps. Finally, we splice the vectors together to obtain a point cloud of size $N \times (3 + d_{rgb})$ with rich appearance information as the input for the next stage.

Geometric feature extraction and local–global fusion

Effectively extracting local geometric information from point clouds is a challenging task. Previous methods^14,15 typically employed PointNet++ to extract information from the point cloud, yet PointNet++ treats all local points in an isotropic manner. In contrast, we use a network based on the anisotropic separable set abstraction (ASSA) module to deal with the target point cloud. The ASSA module performs set abstraction operations in separate directions independently, facilitating the capture of anisotropic patterns and enhancing the performance of local information extraction. Furthermore, the ASSA module is more efficient and has a faster inference speed, which aligns with our intention to create a lightweight and efficient model. As shown in Figure 4, this module consists of five layers: a subsampling layer, a grouping layer, a geometry-aware anisotropic reduction layer, and two multilayer perceptron (MLP) layers. Specifically, taking the resulting point cloud of size $N * 3$ , we first execute the MLPs on all $N$ points directly. In the subsampling layer, we use the farthest point sampling algorithm to select $N^{'}$ points from the input points; these are the most distant points from each other, and they define the centers of the local areas. Then, the grouping layer uses the ball query method to find all points within a radius r of each center point and chooses the k nearest neighbor points from them. The point cloud is divided into n areas, and each area contains a center point and k neighbor points. Next, in the anisotropic reduction layer, we calculate the relative positions between the neighbor points and each center point in the x, y, and z dimensions. Then, the relative positions are used as scaling weights for aggregating the features across the neighborhood dimension, and they are normalized by the radius of the ball query r. The neighborhood features are individually scaled by the three corresponding relative positions. The three scaled neighborhood features are then combined and fed to another network for neighborhood information aggregation using a symmetric reduction function. Another MLP block then processes the output of the anisotropic reduction layer. Finally, we add a residual connection between the outputs of the two MLP parts. The design of this residual connection is inspired by deep separable convolution, which stabilizes the training process and provides better feature embeddings by fusing the aggregated local information with the point information.

Figure 4.

ASSA module. ASSA: anisotropic separable set abstraction; MLP: multilayer perceptron.

Our geometric feature extraction network consists of three ASSA modules. The second module outputs $N_{local}$ local features, and the last module further processes the local features to obtain one global feature. Then, we perform the local–global fusion. Specifically, we copy the $N_{local}$ global features and splice them with each local feature to obtain $N_{local}$ fused features. This fusion process combines the detailed local information with a broader contextual understanding of the global features, resulting in a more comprehensive representation. This representation is particularly effective in scenarios with severe occlusions, as it allows the model to infer the structure of partially visible objects by leveraging both local and global contextual cues. The effectiveness of this two-stage fusion in handling occlusions is further illustrated in our experimental results, demonstrating its superiority in capturing the complete essence of objects even when severely occluded.

Pose estimation and refinement

Pose estimation

We feed the obtained fusion features into a pose prediction network to obtain 6D poses. This network has three branches. Each branch consists of three 1 × 1 convolution layers and outputs the predicted rotation parameter R, translation parameter t, and confidence c. Each fusion feature independently predicts a 6D pose and a corresponding confidence score. We choose the pose with the highest confidence as the final output. This process enables the model to independently predict the pose from various local regions, each offering a distinct perspective. Thus, even when parts of the object are obscured, the model is capable of inferring the overall pose from regions that are less occluded. The final pose is determined based on the prediction with the highest confidence score, derived from these diverse local assessments. This method effectively leverages local information to navigate the challenges posed by severe occlusion, ensuring accurate pose estimation despite the object’s partial visibility.

Pose refinement

We further optimize the predicted pose to improve the resulting pose estimation accuracy. The standard ICP optimization algorithms are time-consuming and cannot be trained in an end-to-end manner. Therefore, we adopt the iterative refinement structure in Wang et al.,¹³ which is fast and can be jointly trained with our main network for end-to-end pose estimation. The refinement process is shown in Figure 5. The structure of the refinement network is similar to that of the main network except that our fusion module outputs only one global feature instead of fusion features. The global feature goes through two small regression networks composed of fully connected layers to return a single pose. Specifically, the input point cloud is first transformed according to the initial pose predicted by the pose prediction network. Then, the transformed input point is fed with the original color feature into a refinement network to obtain a residual pose. Finally, we transform the target point cloud again according to the residual pose and send it to the next iteration. After T iterations, we concatenate the results of each iteration to obtain the final pose estimate:

\tilde{p} = [R_{T} ∣ t_{T}] [R_{T - 1} ∣ t_{T - 1}] \dots [R_{0} ∣ t_{0}]

(1)

Figure 5.

The pose refinement procedure. ASSANet: anisotropic separable set abstraction network; 6D: six degree-of-freedom.

Loss functions

During the training phase, we define two different loss functions for symmetric and asymmetric objects. For asymmetric objects, to minimize the average offset between the point cloud randomly sampled on the object model for the real pose and the point cloud of the corresponding object in the predicted pose, the loss function is defined as:

L_{i}^{p} = \frac{1}{M} \sum_{j = 1}^{M} ‖ (R x_{j} + t) - (\hat{R_{i}} x_{j} + \hat{t_{i}}) ‖

(2)

where

x_{j}

denotes the jth point of the M randomly sampled 3D points from the object’s 3D computed-aided design model,

p = [R | t]

is the ground-truth pose, and

\hat{p} = [\hat{R} | \hat{t}]

is the pose predicted through the network.

Symmetric objects can have the same appearance under different poses in RGB-D images. Therefore, another loss function is introduced to minimize the average distance between each point in the predicted object model and the closest points in the object model for the real pose. The loss function is as follows:

L_{i}^{p} = \frac{1}{M} \sum_{j = 1}^{M} min_{0 < k < M} ‖ (R x_{j} + t) - (\hat{R_{i}} x_{j} + \hat{t_{i}}) ‖

(3)

We add a confidence term to the loss function to learn the confidence score of the prediction. The overall loss function is derived as:

L = \frac{1}{N_{r}} \sum_{i = 1}^{N_{r}} (L_{i}^{p} c_{i} - ω \log (c_{i}))

(4)

where

N_{r}

is the number of local areas, ω is a balancing hyperparameter, and

c_{i}

is the

i th

confidence score. We use the estimated pose with the highest confidence score as the final output.

Experiments

Datasets and metrics

The LineMOD dataset¹⁷ contains 15,783 images of 13 different objects, with 1200 images for each object and the corresponding 6D pose annotations. The dataset includes challenging cluttered scenes, texture-less objects, and lighting condition variations, making accurate object pose prediction difficult. The LineMOD dataset has become the most commonly cited pose estimation dataset, and newly proposed estimation methods typically use it as their test benchmark.

The Occlusion LineMOD dataset¹⁸ was generated from the LineMOD dataset to compensate for the lack of occlusion in the latter dataset. It consists of 1214 RGB images belonging to LineMOD’s bench vise sequence, and it additionally annotates the true 6D poses of another eight heavily occluded visible objects.

The YCB-Video dataset¹¹ consists of 92 RGB-D video sequences, with 130,000 frames of real scene image data and 80,000 frames of synthetic model-rendered image data, covering 21 irregular objects. Each object has varying external influences, such as occlusion or poor lighting. The videos are annotated with 6D poses and instance semantic masks.

We adopt the average distance ADD¹⁷ and ADD-S¹¹ metrics, which are commonly used in 6D pose estimation, to evaluate our method. The ADD metric is defined as the mean pairwise distance between the model points transformed by the predicted pose $[R, t]$ and the ground-truth pose $[\hat{R}, \hat{t}]$ .

ADD = \frac{1}{M} \sum_{x \in M} ‖ (R x + t) - (\hat{R} x + \hat{t}) ‖

(5)

For symmetric objects, the average distance is computed using the closest point distance:

ADD-S = \frac{1}{M} \sum_{x_{1} \in M} min_{x_{2} \in M} ‖ (R x_{1} + t) - (\hat{R} x_{2} + \hat{t}) ‖

(6)

For an evaluation conducted on the LineMOD and Occlusion LineMOD datasets, we report the accuracy in terms of ADD(-S) that are

< 10 %

of the corresponding object’s diameter. For the YCB-Video dataset, we report the area under the ADD-S curve (AUC) and the percentage of ADD-S values that are smaller than 2 cm (2 cm is used for most robot grippers).

Implementation details

The color feature extraction network takes an RGB image of size $H \times W \times 3$ as its input, the ShuffleNet V2 network outputs a feature map of size $H / 8 \times W / 8 \times 512$ , and the PSP module outputs a feature map of size $H / 8 \times W / 8 \times 1024$ . Then, four upsampling layers restore the feature map to the original size and output a feature map of size $H \times W \times 32$ , which means that each pixel is a 32-dimensional vector. We select 500 points from a point cloud of size $N \times 3$ ; then, we find the 32-dimensional color feature vector corresponding to each point, stitch the vectors together to obtain 500 35-dimensional vectors, and send them to the geometric feature extraction network. This network consists of three ASSA modules. The first module outputs 64 local features with 256 dimensions and the center coordinates of these 64 local areas, and then they are sent to the second module. The second module outputs 16 local features with 512 dimensions and the center coordinates of these 16 local areas, and then they are sent to the last module. The last module outputs one global feature with 1024 dimensions. Finally, we copy the 16 global features and splice them with each local feature to obtain 16 fusion features with 1536 dimensions. We use two iterations during the refinement procedure and choose $ω = 0.01$ for equation (4). The settings of the other hyperparameters are the same as those in Wang et al.¹³

All experiments are performed on an Intel Core i9-10900X CPU@3.70 GHz × 20 with a single NVIDIA GeForce RTX3090 GPU. The training and test sets division is the same as the previous work.¹⁴

Evaluation conducted on the LineMOD dataset

Table 1 lists the quantitative evaluation results obtained for all 13 objects in LineMOD dataset. As can be seen, the performance of RGB-D-based methods is generally better than that of the RGB-based methods. Our method achieves an accuracy of $97.77 %$ and outperforms the baseline approach: DenseFusion¹³ and the method of Zhou et al.¹⁴ by $3.47 %$ and $1.37 %$ , respectively, indicating that our fusion module can leverage color and depth information better than competing methods. Although compared with the state-of-the-art methods, our method is $1.93 %$ and $1.63 %$ lower than a full flow bidirectional fusion network for 6d pose estimation. (FFB6D)³⁷ and PVN3D,¹⁵ respectively, the parameters of our model are much smaller than them.

Table 1.

Quantitative 6D pose evaluation results obtained on the LineMOD dataset in terms of the ADD(-S) metric. Objects with bold names are symmetric.

	RGB				RGB-D
	PVNet	PoseCNN+DeepIM	CPDN³⁶	DPOD⁸	DenseFusion	Zhou et al.	PVN3D	FFB6D	Ours
ape	43.62	77.0	64.4	87.7	92.3	93.51	97.3	98.4	94.28
bench	99.90	97.5	97.8	98.5	93.2	95.15	99.7	100.0	97.28
cam	86.86	93.5	91.7	96.1	94.4	93.92	99.6	99.9	97.84
can	95.47	96.5	95.9	99.7	93.1	95.47	99.5	99.9	96.65
cat	79.34	82.1	83.8	94.7	96.5	98.6	99.8	99.9	98.30
driller	96.43	95.0	96.2	98.8	87.0	94.84	99.3	100.0	97.42
duck	52.58	77.7	86.3	89.58	92.3	95.96	98.2	98.4	95.96
eggbox	99.15	97.1	99.7	99.9	99.8	100.0	99.8	100.0	100.0
glue	95.66	99.4	99.6	99.3	100.0	99.71	100.0	100.0	100.0
hole	81.92	52.8	81.92	86.9	92.1	94.48	94.48	99.8	98.00
iron	98.88	98.3	97.9	100.0	97.0	97.54	99.7	99.9	98.47
lamp	99.33	97.5	93.38	96.8	95.3	98.94	99.8	99.9	98.66
phone	92.41	87.7	90.8	94.7	92.8	95.1	99.5	99.7	98.17
average	86.27	88.6	89.9	95.2	94.3	96.4	99.4	99.7	97.77

6D: six degree-of-freedom; RGB: red–green–blue; RGB-D: RGB-depth; PVNet: pixel-wise voting network; PoseCNN: pose convolutional neural network; DeepIM: deep iterative matching; CPDN: coordinates-based disentangled pose network; DPOD: 6D pose object detector and refiner; PVN3D: a deep point-wise 3D keypoints voting network for 6D pose estimation; FFB6D: a full flow bidirectional fusion network for 6D pose estimation.

Evaluation conducted on the Occlusion LineMOD dataset

The quantitative evaluation results obtained for all eight objects in the Occlusion LineMOD dataset are reported in Table 2. Hinterstoisser et al.³⁸ only use depth images, while the other methods use RGB-D images. As the table shows, our method outperforms DenseFusion¹³ and the approach of Zhou et al.¹⁴ by $5.34 %$ and $3.05 %$ , respectively, indicating that our method is more robust to occlusion. In particular, our method outperforms DenseFusion by significant margins of $9.75 %$ and $16.26 %$ on the duck and hole puncher objects, respectively.

Table 2.

Quantitative 6D pose evaluation results obtained on the Occlusion LineMOD dataset in terms of the ADD(-S) metric. The bold value in each row represents the best performance achieved for one object category.

	Hinterstoisser et al.³⁸	Michel et al.¹⁰	PoseCNN+ICP	DenseFusion	Zhou et al.	Ours
ape	81.4	80.7	76.2	73.20	68.43	79.08
can	94.7	88.5	87.4	88.64	92.65	93.93
cat	55.2	57.8	52.2	72.22	77.97	70.44
driller	86.0	94.7	90.3	92.50	95.13	97.37
duck	79.7	74.4	77.7	59.65	62.14	69.40
eggbox	65.5	47.6	72.2	94.24	96.06	95.96
glue	52.1	73.8	76.7	92.62	93.54	91.70
hole p.	95.5	96.3	91.4	78.77	83.64	95.03
average	76.3	76.7	78.0	81.27	83.56	86.61

6D: six degree-of-freedom; PoseCNN: pose convolutional neural network; ICP: iterative closest point.

Evaluation conducted on the YCB-Video dataset

Table 3 shows the results obtained for all 21 objects in the YCB-Video dataset. The percentages of ADD-S AUC values that are <0.1 m and ADD-S values that are <2 cm are used to measure the performance of these methods. All methods use the same segmentation masks as those in PoseCNN to ensure fairness. As seen from the table, our method outperforms PoseCNN + ICP and DenseFusion by $0.6 %$ and $0.5 %$ in terms of the AUC (<0.1 m) metric and by $4.2 %$ and $0.6 %$ in terms of the ADD-S<2 cm metric, respectively. Moreover, our method achieves the best results for most of the 21 objects. Some qualitative visualization results are illustrated in Figure 6.

Figure 6.

Visualization of the poses estimated by our method. The left panel shows the input images and the right panel shows the resulting 6D pose images.

Table 3.

Quantitative 6D pose evaluation results (ADD-S<2 cm and AUC) obtained on the YCB-Video dataset.

	PointFusion		PoseCNN + ICP		DenseFusion		Ours
	AUC	2 cm	AUC	2 cm	AUC	2 cm	AUC	2 cm
002 master chel can	90.9	99.8	95.8	100.0	96.4	100.0	96.2	100.0
003 ceacker box	80.5	62.6	92.7	91.6	95.5	99.5	96.3	100.0
004 sugar box	90.4	95.4	98.2	100.0	97.5	100.0	98.3	100.0
005 tomato soup can	91.9	96.9	94.5	96.9	94.6	96.9	94.7	96.9
006 mustard bottle	88.5	84.0	98.6	100.0	97.2	100.0	97.6	100.0
007 tuna fish can	93.8	99.8	97.1	100.0	96.6	100.0	97.2	100.0
008 pudding box	87.5	96.7	97.9	100.0	96.5	100.0	96.0	100.0
009 gelatin box	95.0	100.0	98.8	100.0	98.1	100.0	98.3	100.0
010 putted meat can	86.4	88.5	92.7	93.6	91.3	93.1	91.7	93.6
011 banana	84.7	70.5	97.1	99.7	96.6	100.0	96.9	100.0
019 pitcher base	85.5	79.8	97.8	100.0	97.1	100.0	97.6	100.0
021 bleach cleanser	81.0	65.0	96.9	99.4	95.8	100.0	96.8	100.0
024 bowl	75.7	24.1	81.0	54.9	88.2	98.8	88.3	97.8
025 mug	94.2	99.8	95.0	99.8	97.1	100.0	97.3	100.0
035 power drill	71.5	22.8	98.2	99.6	96.0	98.7	96.1	99.3
036 wood bolck	68.1	18.2	87.6	80.2	89.7	94.6	91.8	100.0
037 scissors	76.7	35.9	91.7	95.6	95.2	100.0	94.9	100.0
040 large marker	87.9	80.4	97.2	99.7	97.5	100.0	98.1	100.0
051 large clamp	65.9	50.0	75.2	74.9	72.9	79.2	72.3	78.5
052 extra large clamp	60.4	20.1	64.4	48.8	69.8	76.3	74.2	77.8
061 foam brick	91.8	100.0	97.2	100.0	92.5	92.5	93.4	100.0
Mean	83.9	74.1	93.0	93.2	93.1	96.8	93.6	97.4

6D: six degree-of-freedom; AUC: area under the ADD-S curve; PoseCNN: pose convolutional neural network; ICP: iterative closest point.

Time efficiency results

Training time

A 6D pose estimation algorithm based on deep learning often needs to train for hundreds of epochs to achieve good results. This excessively long training time limits the practical applications of these methods. Figure 7 shows the average accuracies and errors yielded by DenseFusion,¹³ the approach of Zhou et al.¹⁴ and our method on the LineMOD dataset during the first 10 training epochs. As we can see, our method can achieve a higher accuracy rate and a smaller error than those of the other two methods under the same number of training epochs. Most importantly, our approach can achieve 90+ $%$ accuracy after only seven iterations. On a GTX 3090 GPU, one epoch takes approximately 35 min, which means that our method can achieve acceptable accuracy in only a few hours of training. Therefore, our approach has a wide range of application scenarios, especially industrial applications where the accuracy requirements are not extremely strict but the time cost is essential.

Figure 7.

Training curves. (a) and (b) show the accuracy and average error on the LineMOD dataset over the first 10 training epochs, respectively.

Inference time

On the GTX 3090 GPU, our method takes 0.017 s for pose estimation and 0.007 s for refinement. With 0.03 s for prior instance segmentation, the overall runtime on the LineMOD dataset is approximately 0.054 s, which can meet the requirements of real-time applications.

Model size results

Table 4 shows the comparison between the size of our model and those of the latest popular 6D pose estimation models. As we can see, the parameters of our model are $66.6 %$ and $71.2 %$ less than those of FFB6D³⁷ and PVN3D,¹⁵ respectively. In particular, the storage space occupied by our model is only one-tenth of that occupied by PVN3D. This indicates that our method is easier to deploy on mobile platforms with limited computing resources and small storage spaces, such as mobile manipulators.

Table 4.

Comparison of our model with the latest popular models in terms of model size.

	PVN3D¹⁵	FFB6D³⁷	Ours
Space (MB)	449	129	45.4
Parameters	39.2M	33.8M	11.3M

PVN3D: a deep point-wise 3D keypoints voting network for 6 degrees of freedom pose estimation; FFB6D: a full flow bidirectional fusion network for 6D pose estimation; MB: megabyte.

Ablation experiments

We conduct a series of ablation studies on the LineMOD dataset to verify the effects of different parts of our model.

Table 5 compares our model’s parameters, memory space, floating point operations (FLOPs), and accuracy under different color feature extraction networks. Compared to ResNet34+PSPNet and the color feature extraction network in DenseFusion, our color feature extraction network reduces the number of model parameters by $65.2 %$ and $49.5 %$ , the number of GFLOPs (when the input size is 120 × 120) by $55.3 %$ and $39.5 %$ , and the occupied memory space by $65.0 %$ and $49.2 %$ , respectively. Moreover, only a slight accuracy drop is observed. This shows that our color feature extraction network can substantially reduce the model complexity with a minor accuracy loss, thus facilitating mobile device deployment. Our feature fusion module divides the obtained point cloud into $N_{local}$ local areas and extracts $N_{local}$ local features from them, with each feature independently predicting a pose. We aimed to investigate the effectiveness of this region-dividing way. Therefore, we conducted a comparative analysis of the results obtained by partitioning the point cloud into varying numbers of areas, as shown in Table 6. The accuracy of our method is only $95.2 %$ when only global features are used, which is much lower than the accuracy achieved with any number of local areas, which shows that the lack of local features significantly affects the resulting prediction accuracy. As the number of local regions increases from 8 to 16, the accuracy also increases from $97.3 %$ to $97.75 %$ , which shows that the increase in the number of local areas can help the model predict poses from more different angles to attain improved prediction accuracy. However, a further increase in the number of local areas does not yield improved accuracy. Our possible analysis is that these local areas already have too many overlapping parts. The increase in the number of local areas also increases the FLOPs of our model, so we choose $N_{local} = 16$ for the experiment.

Table 5.

Comparison among different versions of our model using different color feature extraction networks on the LineMOD dataset in terms of accuracy and model complexity.

	ResNet34+PSPNet	DenseFusion’s	Ours
GFLOPs	8.91	6.58	3.98
Space (MB)	129.9	89.4	45.4
Parameters	32,385,075	22,284,339	11,274,627
accurary	97.8	97.8	97.75

GFLOP: giga floating point operation.

Table 6.

The changes induced in the FLOPs and accuracy produced on the LineMOD dataset according to the number of local areas.

	0	8	16	24
Accuracy	95.2	97.3	97.75	97.73
GFLOPs	3.91	3.94	3.98	4.02

FLOP: floating point operation; GFLOP: giga FLOP.

Conclusions

In this paper, we introduce an end-to-end network for estimating an object’s 6D pose from RGB-D images. We develop a two-stage feature fusion module to better leverage the color and geometry information in RGB-D images, which is particularly advantageous in dealing with occlusion environments. This module first extracts color features from the given RGB images and combines them into a point cloud for color-depth fusion. Then, it uses an ASSANet-like network to extract several local features and one global feature from the point cloud for local–global fusion. Furthermore, to reduce the complexity of our model, we developed a lightweight network based on ShuffleNet V2 for color feature extraction. Experimental results obtained on LineMOD, Occlusion LineMOD, and YCB-Video datasets demonstrate that the proposed approach can increase the overall accuracy of estimated 6D poses. Furthermore, our model is significantly smaller than the latest popular models for 6D pose estimation, making it suited for applications on mobile platforms with limited computational resources. In the future, we will apply the end-to-end advantages of our method to the field of self-supervised learning.

Footnotes

Acknowledgement

We would like to thank the anonymous reviewers and the editor for their comments.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (62373016) and the Open Projects Program of the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS-2023-22).

ORCID iD

Guoyu Zuo

References

Collet

Martinez

Srinivasa

. The moped framework: object recognition and pose estimation for manipulation. Int J Rob Res 2011; 30: 1284–1306.

Tremblay

Sundaralingam

, et al. Deep object pose estimation for semantic robotic grasping of household objects. In: The 2nd Conference on Robot Learning (CoRL), Zurich, Switzerland, 29–31 October 2018, pp. 306–316. PMLR.

Bay

Tuytelaars

Van Gool

. Surf: Speeded up robust features. Lect Notes Comput Sci 2006; 3951: 404–417.

Rothganger

Lazebnik

Schmid

, et al. 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. Int J Comput Vis 2006; 66: 231–259.

Cai

Pham

, et al. Deep-6DPose: Recovering 6D object pose from a single RGB image. arXiv preprint arXiv:180210367 2018.

Kehl

Manhardt

Tombari

, et al. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October, 2017, pp. 1530–1538. New York, United States: IEEE.

Peng

Liu

Huang

, et al. PVNet: Pixel-wise voting network for 6DoF pose estimation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019, pp. 4556–4565. New York, United States: IEEE.

Zakharov

Shugurov

Ilic

. DPOD: 6D pose object detector and refiner. In: 2019 IEEE/CVF international conference on computer vision (ICCV), Seoul, South Korea, 27 October–2 November 2019, pp. 1941–1950. New York, United States: IEEE.

Jiang

Wang

Huang

, et al. Triangulate geometric constraint combined with visual-flow fusion network for accurate 6dof pose estimation. Image Vis Comput 2021; 108: 104127.

10.

Michel

Kirillov

Brachmann

, et al. Global hypothesis generation for 6D object pose estimation. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017, pp. 115–124. New York, United States: IEEE.

11.

Xiang

Schmidt

Narayanan

, et al. PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:171100199 2017.

12.

Bai

Hager

. A unified framework for multi-view multi-class object pose estimation. In: 2018 European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 254–269. Berlin, Germany: Springer.

13.

Wang

Zhu

, et al. DenseFusion: 6D object pose estimation by iterative dense fusion. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019, pp. 3343–3352. New York, United States: IEEE.

14.

Zhou

Yan

Wang

, et al. A novel depth and color feature fusion framework for 6D object pose estimation. IEEE Trans Multimedia 2020; 23: 1630–1639.

15.

Sun

Huang

, et al. PVN3D: A deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 13–19 June 2020, pp. 11632–11641. New York, United States: IEEE.

16.

, et al. Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst 2017; 30: 873–880.

17.

Hinterstoisser

Lepetit

Ilic

, et al. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: 2012 Asian conference on computer vision (ACCV), Daejeon, South Korea, 5–9 November 2012, pp. 548–562. Berlin, Germany: Springer.

18.

Wohlhart

Lepetit

. Learning descriptors for object recognition and 3D pose estimation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA, 7–12 June 2015, pp. 3109–3118. New York, United States: IEEE.

19.

Zhu

Derpanis

Yang

, et al. Single image 3D object detection and pose estimation for grasping. In: 2014 IEEE international conference on robotics and automation (ICRA), Hong Kong, China, 31 May–7 June 7 2014, pp. 3936–3943. New York, United States: IEEE.

20.

Kendall

Grimes

Cipolla

. PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile, 7–13 December 2015, pp. 2938–2946. New York, United States: IEEE.

21.

Bukschat

Vetter

. Efficientpose: an efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv preprint arXiv:201104307 2020.

22.

Tan

Pang

. Efficientdet: Scalable and efficient object detection. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, 13–19 June 2020, pp. 10781–10790. New York, United States: IEEE.

23.

Gkioxari

Dollár

, et al. Mask R-CNN. In: 2017 IEEE international conference on computer vision (ICCV), Venice, Italy, 22–29 October 2017, pp. 2961–2969. New York, United States: IEEE.

24.

Rad

Lepetit

. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: 2017 IEEE international conference on computer vision (ICCV), Venice, Italy, 22–29 October 2017, pp. 3828–3836. New York, United States: IEEE.

25.

Tekin

Sinha

Fua

. Real-time seamless single shot 6D object pose prediction. In: 2018 IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018, pp. 292–301. New York, United States: IEEE.

26.

Zhuang

Koniusz

, et al. 6DoF object pose estimation via differentiable proxy voting loss. arXiv preprint arXiv:200203923 2020.

27.

Fua

Wang

, et al. Single-stage 6D object pose estimation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 13–19 June 2020, pp. 2930–2939. IEEE.New York, United States: IEEE.

28.

Wang

Manhardt

Tombari

, et al. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021, pp. 16611–16621. New York, United States: IEEE.

29.

Zhou

Tuzel

. VoxelNet: End-to-end learning for point cloud based 3D object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition 2018, pp.4490–4499. IEEE.

30.

Liu

, et al. Frustum PointNets for 3D object detection from RGB-D data. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018, pp. 918–927. New York, United States: IEEE.

31.

Zhang

Chenkun

. Pose estimation by key points registration in point cloud. In: 2019 3rd International Symposium on Autonomous Systems (ISAS), Tokyo, Japan, 9–11 December 2019, pp. 65–68. New York, United States: IEEE.

32.

, et al. Pointnet: Deep learning on point sets for 3D classification and segmentation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017, pp. 652–660. IEEE.New York, United States: IEEE.

33.

Qian

Hammoud

, et al. ASSANet: an anisotropic separable set abstraction for efficient point cloud representation learning. Adv Neural Inf Process Syst 2021; 34: 165–171.

34.

Zhang

Zheng

, et al. Shufflenet V2: Practical guidelines for efficient CNN architecture design. In: 2018 European conference on computer vision (ECCV) , Munich, Germany, 8–14 September 2018, pp. 116–131. Berlin, Germany: Springer.

35.

Zhao

Shi

, et al. Pyramid scene parsing network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017, pp. 2881–2890. New York, United States: IEEE.

36.

Wang

. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019, pp. 7678–7687. IEEE.New York, United States: IEEE.

37.

Huang

Fan

, et al. FFB6D: A full flow bidirectional fusion network for 6D pose estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021, pp. 3003–3013. New York, United States: IEEE.

38.

Hinterstoisser

Lepetit

Rajkumar

, et al. Going further with point pair features. In: 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands 11–14 October 2016, pp. 834–848. Berlin, Germany: Springer.

A lightweight color and geometry feature extraction and fusion module for end-to-end 6D pose estimation

Abstract

Keywords

Introduction

Related work

Pose estimation with RGB data

Pose estimation with depth/point clouds

Pose estimation with RGB-D data

The proposed method

Overview of the network

Preprocessing

Color feature extraction and color-depth fusion

Geometric feature extraction and local–global fusion

Pose estimation and refinement

Pose estimation

Pose refinement

Loss functions

Experiments

Datasets and metrics

Implementation details

Evaluation conducted on the LineMOD dataset

Evaluation conducted on the Occlusion LineMOD dataset

Evaluation conducted on the YCB-Video dataset

Time efficiency results

Training time

Inference time

Model size results

Ablation experiments

Conclusions

Footnotes

Acknowledgement

Declaration of conflicting interests

Funding

ORCID iD

References