Sage Journals: Discover world-class research

Abstract

3D object instance segmentation plays a vital role in various applications such as autonomous driving, robotics and virtual reality. However, tabletop scenes exhibit diverse object complexities and size variations. The challenge is to enhance the accuracy of segmenting these scenes for multiple object instances. This limitation directly impacts robots’ capabilities to effectively grasp and manipulate objects. In this paper, we propose a multi-scale deep learning and clustering-based approach for object instance segmentation in tabletop scenes. Our approach incorporates a multi-scale neighborhood feature sampling (MNFS) module specifically designed to extract local features, and a clustering algorithm to eliminate noise and preserve instance integrity. Furthermore, we combine the strength of both methods through ScoreNet and non-maximal suppression. We conducted extensive experiments on TO-Scene, the first large-scale dataset of 3D tabletop scenes, and observed an average mIoU improvement of approximately 4.07% compared to existing methods. This highlights the superior performance of our proposed method. In addition, we tested our algorithm on a real-scene robotics platform and showed that it has good performance and generalization capabilities to support future applications such as robot grasping.

Keywords

3D point cloud instance segmentation deep learning clustering algorithm robot grasping

Introduction

3D object instance segmentation is a fundamental task in computer vision that aims at recognizing and outlining individual separate objects in a 3D scene. It plays a vital role in various applications such as autonomous driving, robotics, healthcare, augmented reality and virtual reality. As robots are widely used in various fields of human production and life, it has become an inevitable requirement for the development of robot-operated intelligent technology to allow robots to replace humans to autonomously complete a variety of humanoid operation tasks in complex and unknown tabletop scenes.

However, most of the current research focuses on large-size scenarios such as autopilot or remote sensing, and there is insufficient research on small-size scenarios such as tabletop objects. Tabletop scenes may contain a wide variety of objects with different categories and sizes (such as pens, cups, fruits, computers, desks, etc.). Even objects within the same category can exhibit diverse shapes (e.g., Cup, Mug, Travel Tumbler, Bowl Cup, Goblet, Champagne Flute, etc.), adding complexity to the problem due to the large variation in object types and sizes. Additionally, objects may suffer from boundary ambiguity, further complicating instance segmentation. As a result, improving the accuracy of tabletop and multi-object instance segmentation becomes a challenging problem that is a crucial component of intelligent perception and flexible grasping for robots. Instance segmentation for tabletop scenarios helps industrial and service robots to understand the applied environment more deeply and facilitates the subsequent high-level decision-making process.

Traditional instance segmentation methods have focused on 2D images with the task of segmenting objects in a single image plane.^1,2 With the rapid development of 3D sensing technologies such as depth cameras and LIDAR, the 3D data acquired by these sensors can provide richer geometric, shape, and scale information compared to 2D data. Whereas, there is an increasing demand for accurate and efficient realization of instance segmentation in 3D scenes. Point clouds, as a common representation of 3D data, have also received extensive research and attention for 3D object instance segmentation,^3–5 which faces new challenges due to the extra spatial dimensions and the complexity of dealing with occlusions, cluttered scenes and object shapes.

Instance segmentation of point clouds is mainly categorized into clustering and deep learning methods. Clustering methods⁶ can extend many already existing 2D instance segmentation methods directly to 3D, offering improved judgment of object contours. But accurately performing clustering is particularly challenging for several reasons. (1) The point cloud scene contains a lot of interference from background points. (2) The size densities of the instance points are all very different. (3) The semantic gap between point and instance identities is huge. Therefore, over-segmentation or under-segmentation are common problems and can easily exist. In recent years, advances in deep learning, the availability of large-scale labeled datasets, and improved 3D perception capabilities have driven significant progress in the field of 3D object instance segmentation.^7–9 However, there still exists some problems. (1) The processing of points by the deep network is susceptible to noise, which produces misclassification of some points. It will cause instances to be incomplete and affect accuracy. (2) The shape of objects in the dataset is limited and insufficient model generalization capability leads to poor application of models trained to real-life scenarios. So, we thought it would be an interesting study to combine the strengths of both while compensating for the weaknesses.

In this paper, we propose a tabletop-aware learning and clustering-based approach to address these challenges and improve the accuracy and generalization of 3D object instance segmentation for robots in realistic tabletop scenes. Our approach utilizes the currently popular Point Transformer deep learning network, and we add a multi-scale neighborhood feature sampling (MNFS) module, taking into account the tabletop scene dataset object sizes and the density contrast between objects and background. The 3D point cloud information is directly feature extracted through the network, and the extracted features are used to design a clustering segmentation method that meets the objects of the tabletop scene, and then the two are advantageously fused through the ScoreNet network and non-maximal suppression. We conducted extensive experiments on TO-Scene,¹⁰ the first large-scale dataset of 3D tabletop scenes for tabletop scenarios, to demonstrate the superior performance of our method compared to existing methods. It achieved test results on the TO-Vanilla, TO-Crowd, and TO-ScanNet datasets with mIoU scores of 82.90%, 81.16%, and 73.58%, respectively. In addition, we apply the instance segmentation test on a robot to a realistic scenario, provide an in-depth analysis of our method, and discuss potential future directions for advancing the field of instance segmentation of 3D objects.

In a nutshell, our contributions are as follows:

We propose a 3D instance segmentation framework based on the fusion of deep learning and clustering. Through our algorithm and ScoreNet, we can greatly leverage the advantages of both and effectively achieve object instance segmentation.

A local information extraction module, multi-scale neighborhood sampling (MNFS), was designed to effectively extract the features of small-scale objects on the tabletop.

The algorithm was tested on datasets and the robot platform in a realistic scenario. The results show that it has excellent performance and generalization capabilities, which support future applications such as robot grasping.

The rest of the paper is structured as follows: section “Related work” provides an overview of related works in the field. In section “Methods,” we delve into the comprehensive explanation of our methodology. The experimental results and analysis are thoroughly presented in section “Experiments.” Lastly, section “Conclusion” encompasses our summary.

Related work

Deep learning on 3D point clouds

Point clouds are widely used as a common format for 3D data with the development of 3D scene understanding. Processing methods for point cloud deep learning are mainly categorized into projection-based networks, voxel-based ones and point-based ones. Projection-based methods project 3D point clouds into various image planes and then utilize 2D CNN-based networks to extract feature representations,^11–13 but the choice of projection planes affects the performance to a great extent, and the occlusion of objects in the projection reduces the accuracy; voxel-based methods aim to turn irregular point clouds into regular representations by voxelization,^14,15 and the efficiency of this method has been improved by introducing sparse convolution,¹⁶ but the geometric information of the point cloud may still be lost due to quantizing the point cloud into a mesh according to different dimensions. Point-based methods extract features directly from unstructured point sets, such as PointNet and PointNet++.^3,4 With the recent popularity of Transformer^17,18 in the NLP field, some researchers have also introduced Transformer and self-attention mechanism module to 3D point clouds, Point Transformer⁵ is a classic example, which has achieved excellent results in several point cloud tasks such as recognition segmentation. In this paper, Point Transformer is chosen as the backbone of tabletop-aware learning.

Clustering-based instance segmentation

The basic principle of clustering-based 3D point cloud segmentation methods is to find out certain discriminative rules to map points to a discriminative representation space, where points belonging to the same instance exhibit similar features while points on different objects have different features, and then group different sets of points together in the space. Just like earlier pixel grouping in the 2D domain is similar, for example, Fathi et al.⁶ compute the likelihood of pixels and group similar pixels together in the embedding space. In the 3D domain, SGPN¹⁹ proposes a similarity matrix to represent the pairwise similarity between points, and generates instances by merging high similarity points through a grouping algorithm. OccuSeg²⁰ employs learned occupancy signals to guide the clustering. MTML²¹ learns the feature and directional embeddings, and then performs a mean-shift clustering of the feature embeddings to generate the target segments, which are scored based on the orientation feature consistency of the target segments for scoring. A fundamental problem arises from the wide variation in size and point density of object instances in 3D scans, which can result in over- and/or under-segmentation when fixed clustering parameters are used.²² The problem is made evident in tabletop scenarios where the contrast between the background and the object makes the problem obvious.

So there are some attempts to change the clustering steps, 3D-MPA²³ does this by predicting the center of the instances and then aggregating the points into candidate instances. PointGroup⁷ proposes clustering points based on dyadic coordinate sets and introduces ScoreNet to predict the scores of the instance objects. HAIS⁸ and SoftGroup⁹ obtain the point cloud features through 3D-UNet and then also follow the clustering paradigm by introducing set aggregation and within-instance prediction to improve instances at the object level. In this paper, we achieve the fusion of tabletop-aware learning and clustering-based algorithm by predicting the instance object scores using ScoreNet.

Instance segmentation for robot manipulation

The ability to perceive the geometric space of three-dimensional objects is crucial for robot manipulation. Instance segmentation is also widely used in the field of robot grasping. Researchers^24,25 used an instance segmentation network to segment and localize objects in a logistics sorting scene before grasping them. The core idea of both is to improve the efficiency of target pickup by designing a joint learning method of semantic and instance segmentation based on RGBD images. However, it is difficult to be applied to complex and chaotic industrial scenes because the objects in logistics scenes are more organized. Abbeloos et al.²⁶ use point pair features matching of model points and scene points to solve the instance segmentation problem in highly cluttered scenes, while introducing a heuristic to reduce the complexity. But the computation time will be long due to the large number of points. PPR-Net²⁷ and FPCC²⁸ infer instance centers in the feature space and then generate instance segmentation results quickly based on point cloud clustering. This method greatly improves the speed of calculation, but can only be applied to a single class or specific industrial objects. Currently, due to the variety and complexity of objects in tabletop scenarios, there is a shortage of research specifically focused on object instance segmentation in tabletop scenes. Moreover, tabletop scenes are frequently encountered in robotic manipulation applications, making them particularly relevant for robot grasping applications.

Methods

The general framework of this paper is shown in Figure 1. The learning network's inputs are divided into coordinate information coord and RGB information feat. Given the significant disparities between tabletop objects and the background in terms of density and size, a MNFS module is incorporated. The network produces feature F, which subsequently undergoes a two-branch structure, resulting in the extraction of semantic labels and predicted points. P_o is the tabletop objects of semantic segmentation from the input point cloud through the deep learning network with a two-branch structure. Then the point set P_o, representing the predicted tabletop objects, undergoes a clustering algorithm to yield the cluster set P_c. The semantically labeled pred points P_so and the clustering results P_c are jointly output to ScoreNet, whose output S_c gives the suggested scores for evaluating the two. Finally, non-maximum suppression (NMS) is applied to generate instance predictions.

Figure 1.

The diagram of the algorithmic framework. It is composed of four key components: network input, deep learning backbone, the two-branch structure, and clustering fusion output.

Points feature extraction network

The backbone of our points feature extraction network uses Point Transformer, and a MNFS module is designed for small-size objects such as tabletop scenes to fully extract global and local information. In the implementation, we first pre-voxelize the point cloud inputs so that we can obtain more regular structural and contextual information.^10,29,30 We set the voxel size to 4 mm³ to match the small size of tabletop objects. Random sampling of different segmented regions through the MNFS module conveys 60,000 points for training. All points are used for testing. The transformer block within the Point Transformer enables the exchange of information among these local feature vectors, resulting in the generation of new feature vectors as its output. Specifically, the transformer processes input feature vectors using a self-attentive mechanism, enabling each vector to consider connections with other vectors in the sequence. This information is then employed to update the feature vectors, capturing correlations and correspondences in the data. The newly generated, more discriminative feature vectors from the transformer contribute to extracting complex spatial correlations and semantic relationships among objects in the tabletop scene. Information aggregation adapts both to the content of the feature vectors and to the layout of the feature vectors in three dimensions. The network output feature vectors $F \in R^{N \times K}$ , which are further segregated into semantic labels $S \in R^{N \times 1}$ and prediction points $P \in R^{N \times 6}$ . Based on the segmentation semantic labels background extraction and desktop object extraction can be performed, which combined with the feature prediction points are used as inputs to the clustering algorithm.

Multi-scale neighborhood feature sampling

In order to finely segment tabletop backgrounds and objects, we need further aggregation of features. The MNFS module is designed to efficiently extract the local features of a point using the density information of its neighborhood space. Inspired by multi-scale local feature aggregating (LFA),^31,32 the coordinate and feature information of joint points are encoded and combined to provide efficient discriminative region feature extraction. However, our approach is different from the former in that we dynamically adjust the scale and sampling of the point set region by learning the density features from the neighborhood, taking into account both the intricacies of large-scale features and the nuances of smaller-scale features.

The large-scale features present in 3D point clouds highlight the positional and structural relationships among global objects throughout the entire scene. Differently, small-scale features specifically focus on local information, encompassing geometric normals, local point density distribution, subtle shape variations, and other attributes. These elements are crucial in discerning objects that are either in close proximity or bear similarity. The integration of multi-scale neighborhoods is an effective means of aggregating these basic local features, which plays a good role in segmenting small object and more detailed instances.

Given a set of points Points = {p_i}, a preliminary step involves employing Farthest Point Sampling (FPS) to acquire a subset of sampled points {p₁, p₂, ……, p_n}. We construct the multi-scale neighborhood $P_{1} = {x_{i}}, P_{2}, \dots, P_{n}$ by spherical grouping radius r. Each sampling point p_i corresponds to a set of neighborhood points {x_j}. The density of the neighborhood i, denoted as D_i, can be calculated as

\begin{matrix} D_{i} = \frac{N_{i}}{r^{3}} \sum_{x_{j} \in P_{i}}^{N_{i}} {‖ x_{j} - p_{i} ‖}_{2} \\ {r^{'}}_{i} \propto \frac{1}{D_{i}} \end{matrix}

(1)

Each neighborhood point x_j is characterized by features: coordinate feature P_j, which describes the geometric distribution in 3D space, and feature context F_j, which is used for analyzing semantic information in point cloud classification. The union of the two features results in:

U_{j} = concat (F_{j}, P_{j})

(2)

Where

U_{j}

is the union feature of x_j. So we define the relationship between x_i and x_j as:

Δ U_{i j} = concat (F_{i} - F_{j}, U_{j})

(3)

Neighborhood density feature aggregation for joint features in each neighborhood:

D_{i j} = conv (Δ U_{i j})

(4)

Then k-nearest neighbors (kNN) quadratic grouping

{P_{1}^{'}, P_{2}^{'}, \dots, P_{n}^{'}}

is performed based on the density feature

D_{i j}

of each neighborhood. Each input feature undergoes a linear transformation, followed by batch normalization, ReLU activation, and maximum pooling.

\begin{matrix} f (x) = \max (x, 0) = {\begin{matrix} x, & x > 0 \\ 0, & otherwise \end{matrix} \end{matrix}

(5)

Finally, the output point set P′ is obtained by kNN and random sampling, which ensures that the neighborhood features of the points are fully learned and aggregated to the corresponding sampling points. For a more intuitive understanding, we’ve included a diagram of the MNFS module in Figure 2. P represents the set of neighborhood points at the original scale, while P′ represents the set of multi-scale neighborhood points after changes. Various neighborhood set ranges are visually distinguished by different colors.

Figure 2.

The multi-scale neighborhood feature sampling (MNFS) module.

Semantic labels branch

A multi-layer perception (MLP) with softmax layer is applied for feature F to get the initial predicted semantic scores ${s_{1}, s_{2}, \dots, s_{N}}$ , N being the number of categories. The branch is trained using the cross-entropy loss $L_{sem}$ of the semantic scores, and the category with the highest score is the predicted semantic label of the point $S_{i}$ .

Predicted points branch

We apply a 2-layer MLP to generate gap-shifted prediction that separates different instances that are in contact or very close together. It contains point coordinate information $x_{i} \in R^{N \times 3}$ and a variation information $o_{i} \in R^{N \times 3}$ to characterize the proximity gaps between the predicted instances and generate the corresponding minor offsets.

L_{g} = \frac{1}{\sum_{i = 1}^{N} s_{i}^{n}} \sum_{i}^{N} \frac{‖ o_{i} - c_{i} ‖}{{‖ x_{i} - x_{i} * ‖}_{2}} \cdot s_{i}^{n}

(6)

\begin{matrix} c_{i} = \frac{1}{N_{s^{n} (i)}} \sum_{j \in s^{n} (i)} x_{j} \end{matrix}

(7)

Where

s_{i}^{n}

comes to indicate whether the nearest neighbor of the point is the same instance or not, if yes/no it is equal to 1/0.

x_{i} *

denotes the coordinate vector from

x_{i}

to the nearest neighbor point.

c_{i}

is the center of the instance to which the point belongs.

s^{n} (i)

denotes the point set index of the instance to which the point belongs, N denotes the number of point set indexes of this instance.

Clustering algorithm

Based on the Objects output from the semantic labeling branch and the point set result $P \in R^{N \times 6}$ output from the predicted points branch, we obtain the point coordinates of the gap-shifted prediction: ${x_{1}^{o}, x_{2}^{o}, \dots, x_{n}^{o}}$ . For a comprehensive understanding of the algorithm's process, refer to the detailed flow outlined in Algorithm 1.

Our clustering algorithm estimates the watershed of instance boundaries by calculating the distance between the nearest neighboring points, and then predicts the point set of instances by region growing. First, the input predicted point coordinates : ${x_{1}^{o}, x_{2}^{o}, \dots, x_{n}^{o}}$ are taken to construct the K-dimensional tree T of the point cloud. It computes the inter-point distances with $x_{i}^{o}$ and sets the algorithm's watershed distance parameter rad (rad ≈ 0.01 according to our multiple tests on the dataset). Then, an empty set of aggregated category points and an empty queue of growing seed points are created. The first unlabeled point at the beginning of a loop is selected as the new seed point for the growth of the new category region. The set of point trees T(seed) around the seed point is traversed, skipping the already labeled points. The 2-norm distance of the coordinate vectors between the unlabeled point and the seed point is calculated. If it is less than the delimitation parameter rad, the point is added to the category point set and seed queue. This operation is repeated until all points in the seed queue have been traversed. Finally, the number of points in this category is counted. If it exceeds the minimum number of category points Minpt, it is added as a new category to the output CLUS.

In the predicted points of the learning network, the small-size instances may be split into multiple instances, and the instances may produce the presence of many noises. Our clustering algorithm, on the other hand, maintains the integrity of the instances well, as analyzed in Section “Evaluation on TO-Scene.”

Scorenet

We use a ScoreNet inspired by Point Group²³ to evaluate the clustering results in order to better fuse the results of deep learning and clustering algorithms. In contrast to PointGroup, our tabletop-aware learning strategy excels in extracting semantic information from tabletop objects and completeness of the clustering algorithm to further enhance the accuracy of instance segmentation. The structure of ScoreNet is shown in Figure 3, where the inputs to the network are P_so, C^l, and L_so. Specifically, P_so represents the prediction points of objects with the semantic labels. C^l is the labels of clustering predicted instances, and L_so is the ground truth.

Figure 3.

The structure of ScoreNet. The input is voxelated and then encoded into the network to predict scores.

Since there is a correspondence between semantic labels and ground truth, it is necessary to establish a mapping relationship between clustered labels and ground truth.

\begin{matrix} C_{i}^{l} \overset{f}{⟶} {\hat{C}}_{i}^{l} \end{matrix}

(8)

f (*) : max (L_{so} (*))

(9)

Its mechanism is like majority voting. Define the mapping relation as the category whose clustered sequence points correspond to the highest number of times in the truth labels.

The loss function is defined using the cross-entropy loss.

\begin{matrix} L_{sc} = - \frac{1}{M} \sum_{i = 1}^{M} l_{i} \log ({\hat{c}}_{i}^{l}) + (1 - l_{i}) \log (1 - {\hat{c}}_{i}^{l}) \end{matrix}

(10)

where l_i is its ground truth and

{\hat{c}}_{i}^{l}

is the softmax output value after mapping. Finally, we set a threshold value and select the predicted instances that surpass this threshold using NMS.

Experiments

We conducted extensive experiments to evaluate the performance of our method on the first large-scale tabletop scene dataset, TO-Scene, as well as on a robotics platform for realistic scenes. Additionally, we tested our algorithm in ablation experiments and explored its applicability to grasping scenarios. The results demonstrate the effectiveness of our method.

Experimental settings

Dataset

The dataset for the experiments is TO-Scene, the first large-scale tabletop scene dataset.¹⁰ TO-Scene contains a total of 16,077 tabletop scenes and 52 common tabletop object classes, which are subdivided into three subdatasets. The vanilla dataset, which contains 12,078 tabletop scenes and 60,174 tabletop instances belonging to the 52 classes, is called TO-Vanilla. The dataset with a higher density of tabletop objects in a single scene, TO-Crowd, which contains 3999 tabletop scenes and 52,055 instances. The TO-ScanNet dataset, which is an interception of scans from the ScanNet dataset of an entire room, preserving the semantic labels on the original ScanNet room furniture, covers 4663 scans and holds approximately 137k tabletop instances. Each sub-dataset is divided into a training set, a validation set, and a test set. Since the test set is not yet complete (the authors have not yet published the annotated labels for the test set), we train on the training set and obtain results on the validation set.

Evaluation metrics

According to state-of-the-art methods,^5,33,34 we adopt the mean of classwise intersection over union (mIoU) as a main evaluation metric. The mIoU quantifies the extent of overlap between predicted and ground truth segmentation masks for each class. It provides a comprehensive measure of segmentation accuracy by considering both false positives and false negatives. The 52 classes of objects and background, totaling 53 classes, are evaluated on the validation sets of the TO-Vanilla, TO-Crowd, and TO-ScanNet datasets.

Implementation details

Our model is trained iteratively on a 4 × RTX4090 (24G) graphics card with a batch size of 32. We use the stochastic gradient descent (SGD) optimizer with momentum and weight decay set to 0.9 and 0.0001. To expedite the model's convergence, we employ a MultiStepLR learning rate adjustment strategy with milestones set every 10 epochs and a gamma value of 0.5. The initial learning rate is established at 0.1. Because of the small size of the objects in the tabletop scene, we set the hyperparameter radius of the spherical grouping to r = 0.08 in MNFS. Due to memory limitations, the maximum number of points per scene after random sampling is capped at 60k.

Evaluation on TO-Scene

Results

We compared our proposed method with several widely used techniques in the field of point cloud segmentation. Among these, PointNet++⁴ stands out as a classical network in the field of deep learning for 3D point clouds. PointTransFD¹⁰ builds upon the foundation of Point Transformer by incorporating feature vectors and dynamic sampling strategies, which currently holds the highest score on the TO-Scene dataset. As shown in Table 1, our method achieves a mIoU of 82.90% on the TO-Vanilla dataset, 81.16% on the TO-Crowd dataset, and 73.58% on the TO-ScanNet dataset. Notably, it is on average 4.07% higher than the current highest result on the TO-Scene dataset, which fully proves the effectiveness of our method. Furthermore, we excerpt the category mIoU values for a portion of the objects in the test results of the dataset, as detailed in Table 2.

Table 1.

Test results of different methods for object segmentation tested on the TO-Scene dataset.

Methods	TO-Vanilla	TO-Crowd	TO-ScanNet
PointNet³	49.31	44.89	36.74
PointNet++⁴	65.57	61.09	53.97
PAConv³⁴	75.68	71.28	65.15
Point Transformer⁵	77.08	72.95	67.17
PointTransFD¹⁰	79.91	75.93	69.59
Ours	82.90	81.16	73.58

Table 2.

Category-wise mIoU values for our method tested on the TO-Scene dataset.

Objects	TO-Vanilla	TO-Crowd	TO-ScanNet
bag	98.06	92.01	94.24
bottle	91.67	88.69	91.99
bowl	94.28	86.68	88.66
camera	90.97	82.11	93.09
can	97.34	95.22	96.02
keyboard	77.17	91.23	83.79
earphone	95.81	90.42	93.75
laptop	97.99	96.19	97.86
phone	84.18	75.55	76.67
alarm	73.81	68.68	62.46
book	85.78	87.98	86.68
cake	97.07	88.84	92.69
dishes	96.88	92.61	82.65
eye_glasses	93.75	89.26	96.22
box	90.94	90.56	89.16
fruit	95.37	91.70	90.44
notebook	77.69	73.63	56.25
pencil	76.64	74.25	78.80
plant	96.94	95.58	96.68
plate	86.98	92.67	89.37
tea_pot	97.87	92.36	95.20
vegetables	74.70	70.50	32.53

Analysis and discussion

We have selected several typical illustrations over the dataset to visualize the intermediate and final results of our algorithmic process, as shown in Figure 4. The output of the tabletop-aware learning network does a good job of distinguishing objects by category, but there may be a lot of noise or misclassification of some regions. Our clustering algorithm maintains the integrity of the instances better, but may split them into two or more parts on instances with poor connectivity due to partial occlusion, or stacked objects split them into one part. The final output of instance prediction can be well blended with the advantages of both by ScoreNet to score the clustering results, followed by the application of NMS. It is clear from the figure that the result of instance segmentation is more accurate.

Figure 4.

Visualization of test results on the TO-Scene dataset for our algorithmic process. Objects are the semantic output of the two-branch structure. Cluster is the visualization of the clustering result. Instance Pred represents the final instance segmentation outcome produced by the algorithm.

We observe the specific scores of various categories for our method as tested on the three datasets from Table 2. The differences in their scores are reflected in a variety of reasons, which we try to analyze and discuss. Firstly, variations in shape and appearance among object categories are crucial factors. The complexity of an object's shape, specifically the regularity of its geometric surface, directly influences the segmentation difficulty. For instance, a nearly cylindrical object like a can is likely to achieve a higher score than a complexly shaped item such as a camera. Flat objects, including notebooks and keyboards, pose additional challenges due to susceptibility to noise from surrounding desktops during segmentation. Moreover, while object size may be a contributing factor, the integration of MNFS has significantly improved results in this aspect. For example, categories with a high number of variations, such as vegetables, can increase segmentation difficulty, thereby impacting accuracy. Additionally, factors like data labeling quality and proximity to other objects can introduce variations in the results.

In Figure 5, we present a comparison of the output results obtained from different methods. By comparing the output results of PointNet++ and Point Transformer, we can clearly see that the latter has better scene understanding ability after adding a self-attention mechanism, and the correct rate of object classification will be greatly improved. Furthermore, the introduction of our MNFS module (without clustering) refines local details, leading to more precise segmentation of small-size objects such as erasers, pencil holders, and books. In brief, paired with the MNFS module deep learning network has a strong performance. Coupled with the fact that clustering methods can maintain object integrity well, so our method has a better advantage over other methods.

Figure 5.

Visual comparison of test results on the TO-Scene dataset for different methods. Different colors represent separate semantic categories of objects. Key regions are circled with red to indicate cases of misclassification and mis-segmentation.

Ablation study

As shown in Table 3, we performed ablation experiments on the MNFS module and the clustering part for object segmentation. The experimental results show that the MNFS improves by 6.5% on TO-Crowd dataset compared to the backbone model, and also demonstrates an improvement of approximately 2% on the TO-Vanilla and TO-ScanNet datasets, which indicates that it is better for dense scenes. The clustering module improves by 5.03% on the TO-Vanilla dataset and about 1.8% on the other two datasets, indicating its enhanced effectiveness in scenarios with sparse objects. Both the MNFS and clustering components contribute significantly to augmenting and refining the overall algorithm's performance.

Table 3.

Test results of the ablation study for the MNFS and clustering.

Backbone	MNFS	Clustering	Vanilla	Crowd	ScanNet
√			77.08	72.95	67.17
√	√		77.87	79.45	71.74
√	√	√	82.90	81.16	73.58

Experiments in real-world scenes

Experimental scenes setup

Our experiments are conducted on an autonomously designed robot platform equipped with a ROBOTIQ three-finger gripper on the robot end-effector. 3D reconstruction is performed using an Intel Realsense D435i camera to acquire RGBD images of the scene from multiple angles. The computational processor for the robot is the official Nvidia Jetson AGX Xavier Developer Kit. The robot in this study adopts a humanoid configuration with seven degrees of freedom in one arm, comprising three shoulder joints, three wrist joints, and one elbow joint. For the general shape of the working objects in a desktop scenario, the direct operating load at the end of the robot arm is set at 5 kg, and the arm's working space spans 914 mm. To address accuracy issues in end localization, hand-eye calibration is performed on the depth camera and robot coordinate system. To enhance stability during object grasping and transfer, we employed harmonic reducers as joint reducers, benefiting from their low backlash and high reduction ratio. Additionally, the connecting rods were designed with a large safety margin to ensure stiffness and improve positioning accuracy. Simultaneously, we implement a force control program for the robotic arm's rotating joints. Force sensors are mounted on the joint output axes for joint force perception, allowing the joint controller to complete torque closed-loop control. This approach enables impedance control at both the joint and end effector levels. The relative accuracy of the end force control is within 0.2 N force and 0.1 Nm torque, ensuring precise and stable gripping and operations. The experimental scene is illustrated in Figure 6. On the left is the structural diagram of our self-designed robotic arm. The robot platform with a depth camera applies the algorithms discussed in this paper to generate instance segmentation results for tabletop objects.

Figure 6.

Diagram of the experimental scene and the robot platform.

Semi-autonomous experiments

Semi-autonomous grasping experiments are conducted in a realistic setting using a robotic platform equipped with the algorithms outlined in this paper. Initially, the depth camera captures RGBD images of the scene to enable 3D reconstruction. Real-time 3D point cloud reconstruction is achieved through truncated signed distance function (TSDF) spatial fusion, global feature matching, and local optimization algorithms. Subsequently, upon completing instance segmentation using our algorithm, researchers select targets through mouse clicks. The point cloud corresponding to the grasping target is extracted from the instance segmentation results. The trajectory planning process employs the open motion planning library (OMPL) sampling-based planning method integrated into MoveIt!. Specifically, the URDF model of the physical robot is configured, and obstacle constraints are extracted from the point cloud of the current scene. Through OMPL's rapidly-expanding random trees (RRT) algorithm, smooth and collision-free motion trajectories are generated for the robotic arm. The geometric center point and rotation angle are employed to derive the desired end pose, facilitating grasping experiments conducted with a strategy based on shape primitives and pose estimation.^35–37 For arm control, we implement position control in the end Cartesian space, using the moveLToPose function in the robot development API. Figure 7 offers a clearer view of the sequence in some experimental processes，while the experimental flow is shown in Figure 8.

Figure 7.

Real-world scenarios of robot grasping experiments. The scene in Experiment 2 is denser compared to the Experiment 1.

Figure 8.

Flowchart of experimental semi-autonomous grasping for robots based on object instance segmentation algorithm in realistic scenes.

Realization results and analysis

We evaluate the algorithm using the robot's reconstructed results of a realistic tabletop scene. The algorithm is tested in both sparse and crowded tabletop scenarios. The test results demonstrate instance segmentation of tabletop objects in various situations, as shown in Figure 9. The point cloud of realistic scenes generated using depth camera 3D reconstruction on our robot platform is presented on the left, followed by the instance segmentation results for tabletop objects displayed in the middle, and the corresponding grasping strategy depicted on the right.

Figure 9.

(a) 3D reconstruction of realistic scenes. (b) Instance segmentation. (c) Grasping strategy.

The time to completion (TOC) is defined as the duration from issuing the algorithm execution instructions to completion, representing the interval between receiving the input of the point cloud scene and obtaining the output of the instance segmentation result. According to our experiments, the average testing speed for single-scene segmentation of the algorithm is approximately 317 ms. We convert the contour center of the object instances into the world coordinate system and compare it with the actual measured values. Repeated experiments have found an average error of about 1 cm. These results highlight the algorithm's effectiveness in achieving improved object instance segmentation in realistic tabletop scenes.

We encountered some challenges during the experiments, which will be the focal point of our future work. It is worth noting that the accuracy of our results is significantly influenced by the performance of the 3D reconstruction algorithm, as evidenced by the superior segmentation performance in sparse scenes compared to dense scenes. The missing or distorted reconstructions can have a direct impact on the segmentation accuracy. In particular, the portion of an object in contact with the tabletop may exhibit sparser representation than the top during the reconstruction process, which could impact the segmentation accuracy of small objects to a certain extent. Moreover, instance segmentation errors in object contours or incorrect grasping strategies have the potential to result in failures during the grasping operation.

These aspects require further research on our part, including testing on more robot platforms. Overall, our instance segmentation algorithm achieves a success rate of up to 91.7% in repeated trials of robot grasping application.

Conclusion

We design an accurate and effective instance segmentation algorithm for tabletop scenes. Our tabletop-aware learning approach incorporates a multi-scale neighborhood sampling module within the Point Transformer, enabling the extraction of features from 3D point cloud, particularly for small-sized objects. In addition, we design a clustering algorithm that ensures the preservation of instance integrity. Furthermore, we enhance the segmentation accuracy of the objects by combining the network and clustering advantages through ScoreNet.

We have conducted comprehensive experiments to evaluate our algorithm on the TO-Scene dataset. These experiments observe an average mIoU improvement of approximately 4.07% compared to existing methods. The ablation study demonstrates the effectiveness of our designed MNFS module and clustering algorithm. In the real-world scenes, we explore the application of our algorithm on robotic grasping. The results indicate a single-scene segmentation speed of approximately 317 ms and a grasping success rate of up to 91.7%. We are currently actively exploring the application of instance segmentation in robotics. Our future work would like to further improve algorithm accuracy through multimodal fusion, as well as investigate small-sample or unsupervised stance segmentation of objects in tabletop scenes.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China (Grant No. U22B2079, 62103054, 62273049 and U2013602), Beijing Natural Science Foundation (Grant No. 4232054 and 4242050), Foundation of National Key Laboratory of Human Factors Engineering (Grant No. HFNKL2023WW06), Beijing Institute of Technology Research Fund Program for Young Scholars (Grant No. XSQD-6120220298).

ORCID iDs

Yongrui Xue

Xiao Huang

Hui Li

References

Dai

, et al. Instance-sensitive fully convolutional networks. In: Computer vision–ECCV 2016: 14th European conference. Amsterdam, The Netherlands: Springer, 11–14 October 2016, pp.534–549.

Dai

, et al. Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu, HI: IEEE, 21–26 July 2017, pp.2359–2367.

, et al. Pointnet: deep learning on point sets for 3d classification and segmentation. In: IEEE conference on computer vision and pattern recognition. Honolulu, HI: IEEE, 21–26 July 2017, pp.652–660.

, et al. Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in neural information processing systems. Long Beach, CA: ACM, 3–9 December 2017, pp.5099–5108.

Zhao

Jiang

Jia

, et al. Point transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. Montreal, Canada: IEEE, 10–17 October 2021, pp.16259–16268.

Ezugwu

Ikotun

Oyelade

, et al. A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng Appl Artif Intell 2022; 110: 104743.

Jiang

Zhao

Shi

, et al. Pointgroup: dual-set point grouping for 3d instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA: IEEE, 13–19 June 2020, pp.4867–4876.

Chen

Fang

Zhang

, et al. Hierarchical aggregation for 3d instance segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. Montreal, Canada: IEEE, 10–17 October 2021, pp.15467–15476.

Kim

Luu

, et al. Softgroup for 3d instance segmentation on point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. New Orleans, LA: IEEE, 18–24 June 2022, pp.2708–2717.

10.

Chen

Liu

, et al. TO-scene: a large-scale dataset for understanding 3D tabletop scenes. In: European conference on computer vision. Tel Aviv, Israel: Springer, 23–27 October 2022, pp.340–356.

11.

Lang

Vora

Caesar

, et al. Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA: IEEE, 15–20 June 2019, pp.12697–12705.

12.

Kanezaki

Matsushita

Nishida

. Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: IEEE conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE, 18–22 June 2018, pp.5010–5019.

13.

Maji

Kalogerakis

, et al. Multi-view convolutional neural networks for 3d shape recognition. In: IEEE international conference on computer vision. Santiago, Chile: IEEE, 7–13 December 2015, pp.945–953.

14.

Graham

Engelcke

Van Der Maaten

. 3d Semantic segmentation with submanifold sparse convolutional networks. In: IEEE conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE, 18–22 June 2018, pp.9224–9232.

15.

Maturana

Scherer

. Voxnet: a 3d convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS). Hamburg, Germany: IEEE, 28 September–02 October, 2015, pp.922–928.

16.

Choy

Gwak

Savarese

. 4d spatio-temporal convnets: minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Long Beach, CA: IEEE, 15–20 June 2019, pp.3075–3084.

17.

Ren

, et al. Point attention network for point cloud semantic segmentation. Sci Chin Inform Sci 2022; 65: 192104.

18.

Yang

Jiang

, et al. Color guided convolutional network for point cloud semantic segmentation. Int J Adv Robot Syst 2022; 19: 17298806221098506.

19.

Wang

Huang

, et al. SGPN: similarity group proposal network for 3d point cloud instance segmentation. In: IEEE conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE, 18–22 June 2018, pp.2569–2578.

20.

Han

Zheng

, et al. Occuseg: occupancy-aware 3d instance segmentation. In: IEEE conference on computer vision and pattern recognition. Seattle, WA: IEEE, 13–19 June 2020, pp.2940–2949.

21.

Lahoud

Ghanem

Pollefeys

, et al. 3d Instance segmentation via multi-task metric learning. In: The IEEE/CVF international conference on computer vision. Seoul, Korea: IEEE, 27 October–2 November 2019, pp.9256–9266.

22.

Comaniciu

Meer

. Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 2002; 24: 603–619.

23.

Engelmann

Bokeloh

Fathi

, et al. 3d-mpa: multiproposal aggregation for 3d semantic instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Seattle, WA: IEEE, 13–19 June 2020, pp.9031–9040.

24.

Wada

Okada

Inaba

. Joint learning of instance and semantic segmentation for robotic pick-and-place with heavy occlusions in clutter. In: 2019 international conference on robotics and automation (ICRA). Montreal, Canada: IEEE, 20–24 May 2019, pp.9558–9564.

25.

Zhuang

Wang

Zhao

, et al. Semantic part segmentation method based 3D object pose estimation with RGB-D images for bin-picking. Robot Comput-Integr Manuf 2021; 68: 102086.

26.

Abbeloos

Goedemé

. Point pair feature based object detection for random bin picking. In: 2016 13th conference on computer and robot vision (CRV). Victoria, Canada: IEEE, 1-3 June 2016, pp.432–439.

27.

Dong

Liu

Zhou

, et al. PPR-Net: point-wise pose regression network for instance segmentation and 6d pose estimation in bin-picking scenarios. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS). Macao, China: IEEE, 4–8 November 2019, pp.1773–1780.

28.

Arai

Liu

, et al. FPCC: fast point cloud clustering-based instance segmentation for industrial bin-picking. Neurocomputing 2022; 494: 255–268.

29.

Graham

Engelcke

Van Der Maaten

. 3d Semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE, 18–22 June 2018, pp.9224–9232.

30.

Sun

. Feature extraction and matching combined with depth information in visual simultaneous localization and mapping. Int J Adv Robot Syst 2023; 20: 17298806231158298.

31.

Xie

Gao

, et al. 3DCTN: 3d convolution-transformer network for point cloud classification. IEEE Trans Intell Transport Syst 2022; 23: 24854–24865.

32.

Zhou

Wang

Bao

, et al. Robust monocular 3d object pose tracking for large visual range variation in robotic manipulation via scale-adaptive region-based method. Int J Adv Robot Syst 2022; 19: 17298806221076978.

33.

Zhao

Jiang

, et al. Bidirectional projection network for cross dimension scene understanding. In: IEEE conference on computer vision and pattern recognition. Nashville, TN: IEEE, 10–25 June 2021, pp.14373–14382.

34.

Ding

Zhao

, et al. PAConv: position adaptive convolution with dynamic kernel assembling on point clouds. In: IEEE conference on computer vision and pattern recognition. Nashville, TN: IEEE, 10–25 June 2021, pp.3173–3182.

35.

Miller

Allen

. Graspit!: a versatile simulator for robotic grasping. IEEE Robot Autom Mag 2005; 11: 110–122.

36.

Bohg

Morales

Asfour

, et al. Data-driven grasp synthesis—a survey. IEEE Trans Robot 2013; 30: 289–309.

37.

Chen

Luo

Han

, et al. Closed-form camera pose and plane parameters estimation for moments-based visual servoing of planar objects. Int J Adv Robot Syst 2022; 19: 17298806221099701.

Multi-scale deep learning and clustering-based tabletop object instance segmentation for robot manipulation

Abstract

Keywords

Introduction

Related work

Deep learning on 3D point clouds

Clustering-based instance segmentation

Instance segmentation for robot manipulation

Methods

Points feature extraction network

Multi-scale neighborhood feature sampling

Semantic labels branch

Predicted points branch

Clustering algorithm

Scorenet

Experiments

Experimental settings

Dataset

Evaluation metrics

Implementation details

Evaluation on TO-Scene

Results

Analysis and discussion

Ablation study

Experiments in real-world scenes

Experimental scenes setup

Semi-autonomous experiments

Realization results and analysis

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References