Sage Journals: Discover world-class research

Abstract

Shared perception between robotic systems significantly enhances their ability to understand and interact with their environment, leading to improved performance and efficiency in various applications. In this work, we present a novel full-fledged framework for robotic systems to interactively share their visuo-tactile perception for the robust pose estimation of novel objects in dense clutter. This is demonstrated with a two-robot team sharing their visuo-tactile scene representation which then declutters the scene using interactive perception and precisely estimates the 6 Degrees-of-Freedom (DoF) pose and 3 DoF scale of a target unknown object. This is achieved with the Stochastic Translation-Invariant Quaternion Filter (S-TIQF), a novel Bayesian filtering method with robust stochastic optimization for estimating the globally optimal pose of a target object. S-TIQF is also deployed to perform in situ visuo-tactile hand-eye calibration, since shared perception requires accurate extrinsic calibration between the two different sensing modalities, tactile and visual. Finally, we develop a novel active shared visuo-tactile representation and object reconstruction method employing a joint information gain criterion to improve the sample efficiency of the robot actions. To validate the effectiveness of our approach, we perform extensive experiments across standard datasets for pose estimation, as well as real-robot experiments with opaque, transparent and specular objects in randomised clutter settings and comprehensive comparison with other state-of-the-art approaches. Our experiments indicate that our approach outperforms state-of-the-art methods in terms of pose estimation accuracy for dense visual and sparse tactile point clouds.

Keywords

Interactive perception active object reconstruction category-level object pose estimation visuo-tactile sensing

1. Introduction

Humans are capable of seamlessly integrating perceptual information from vision and touch (haptic) to maintain a high-level of cognitive understanding of the environment (Ernst and Banks, 2002; Hatwell, 1987). Robots should also be able to achieve a similar level of scene understanding given that they are similarly equipped, for example, with visual and tactile sensing. The shared perception among complementary sensing modalities offers a comprehensive and accurate scene representation as well as addressing the weaknesses inherent in individual sensor systems. This should make perception robust against sensor failure, as the robot could then rely on the other modality to retain the same level of functionality (Murali et al., 2022d). As with humans, robots have also the option to enhance their perceptual information through purposeful manipulative actions, a technique known as interactive perception which forges a symbiotic relationship between action and perception (Bohg et al., 2017). Thus, by leveraging shared and interactive perception, robots can potentially increase their autonomy and efficacy in real-world scenarios.

However, sharing multi-modal visuo-tactile perceptual information is challenging due to the weakly paired and complementary nature of the sensing modalities. Visual perception provides dense and global information of the scene, whereas tactile perception provides sparse and local contact information. Temporal misalignment also affects shared perception as visual data can be captured in one shot while tactile data acquisition requires sequential contact interactions with objects (Li et al., 2020). Previous research in the realm of multi-robot and multi-sensor shared perception frequently relies on employing identical sensing modalities, typically using multiple cameras. This simplification streamlines the representation of the shared scene (Lauri et al., 2020). Active perception techniques, characterized by the proactive selection of sensor positions to enhance information gathering, are often utilized in the context of single sensor setups or setups with multiple sensors of the same modality (Connolly, 1985; Delmerico et al., 2018). Nevertheless, the extension of active perception methods to multi-sensor configurations comprising different modalities, such as visual and tactile sensing, poses a non-trivial challenge. Similarly, there are recent works tackling the problem of category-level object pose estimation wherein the exact CAD models of object of interest are unknown but prior knowledge of objects belonging to the same category are available. These works typically regress a shared canonical representation of all possible object instances within a category and use the measured depth information to lift from 2D to 3D space to perform object pose estimation (Deng et al., 2022; Lee et al., 2021; Wang et al., 2019).

In summary, the state-of-the-art methods have several limitations: (a) Category-level pose estimation techniques are predominantly tailored to visual sensing information (RGB and depth data), rendering them unsuitable for direct adaptation to other sensory modalities, such as tactile sensing; (b) these methods are also not evaluated for photometrically challenging objects such as transparent objects (Wang et al., 2022); (c) active perception methods that are designed for mono-modal settings cannot be directly extended to multi-modal settings; (d) misalignment between multi-modal visual and tactile data often arises due to calibration errors that affects the shared perceptual information. Addressing the misalignment requires specific calibration procedures, which are often laborious and time consuming.

In our previous work (Murali et al., 2021), we presented a recursive Bayesian filtering approach for object pose estimation through point cloud registration termed translation-invariant quaternion filter (TIQF). However, we assumed a priori knowledge of the CAD model of the target object and TIQF is prone to get stuck in local minima if incorrectly initialized. Furthermore, in our recent work (Murali et al., 2023) we demonstrated a data-driven approach to reconstructing novel transparent objects belonging to known categories through tactile sensing alone. In this work, we solve the limitations of our previous works (Murali et al., 2021, 2023) and present several new contributions as follows:

I. We propose a novel shared visuo-tactile perception method for scene representation and object reconstruction through a data-efficient joint information-theoretic approach for active perception (vision or tactile).

II. We present Stochastic Translation-Invariant Quaternion Filter (S-TIQF) which is a recursive Bayesian filtering method with robust stochastic optimization for globally optimal pose estimation. S-TIQF estimates the 6 DoF pose and 3 DoF scale of unknown instances of categorical objects and relaxes the need for prior known model of the object.

III. A necessary condition for shared perception is the accurate calibration between the sensing modalities. We present a novel approach for in situ visuo-tactile-based hand-eye calibration using arbitrary objects which removes the constraint of specific hand-eye calibration targets and time-consuming calibration procedures.

IV. We integrate our developed methods into a full-fledged framework that enables multi-robot teams to share their perceptual information with the objective to declutter a complex scene, reconstruct and robustly estimate the pose of objects.

We conducted extensive experiments to validate our framework against state-of-the-art approaches, using various benchmark datasets and real-world robotic experiments (Figure 1). To the best of our knowledge, this is the first work tackling the problem of shared visuo-tactile interactive perception for robust object pose estimation.

Figure 1.

Experimental setup: A universal robots UR5 sensorised with tactile sensor arrays on the Robotiq Gripper, a Franka Emika Panda robot equipped with an Azure Kinect RGB-D camera and clutter objects containing the novel target object. The objective is to collaboratively declutter the scene, share the visuo-tactile perceptual information and find the pose of the target object.

This paper is organized as follows: Section 2 summarizes the state-of-the-art in interactive perception, object reconstruction and pose estimation and highlights our contributions in the context of current and related research. Our framework and methodology are presented in Section 3. The experimental results are reported in Section 4 and finally concluded in Section 5.

2. Related work

We review the state-of-the-art methods for interactive perception, shared perception for object reconstruction and object pose estimation and their relation to our work in this section.

2.1. Interactive perception

Interactive perception or perceptive manipulation is any kind of purposeful manipulation actions performed to simplify or enhance the perception of the environment (Bohg et al., 2017). Interactive perception techniques rely upon effective scene understanding in order to plan and execute manipulative actions. The scene understanding iteratively improves upon performing manipulation actions. In unstructured cluttered scenarios, the target object may have multiple other objects overlapping on it in random configurations. A typical choice for scene understanding in computer vision is scene graph which is a data structure that describes objects in a scene and the relationships between these objects (Johnson et al., 2015). Support graphs, a type of scene graph, have been introduced to describe the support relations between objects in the scene through geometric reasoning (Kartmann et al., 2018; Mojtahedzadeh et al., 2015; Schwarz et al., 2018). Sui et al. (2017) presented an axiomatic scene estimation method to describe the relationship between objects and object poses as a scene graph for manipulation. Mitash et al. (2019) developed a Monte Carlo Tree Search-based technique for scene understanding leveraging physics-priors of objects in clutter for pose estimation. Zhang et al. (2021) tackled the issue of inferring object relationships through a neural network performing classification task on all possible pairwise permutations between the objects in the scene.

Scene understanding is followed by planning manipulation actions in clutter which is a challenging task and has received immense research interest. Typically, research works on grasping in clutter falls into model-based approaches or data-driven methods. Analytical or model-based approaches have been deeply studied in Bicchi and Kumar (2000). Berenson and Srinivasa (2008) devised an optimisation method for generating the gripper pose in clutter given the knowledge of the object shape to ensure force closure. Some works such as Moll et al. (2017) and Dogar et al. (2012) rely upon physics simulators to model the robot-object and object-object interactions while performing manipulation actions in clutter. More recent studies involving data-driven grasping have leveraged deep neural networks (DNNs) to achieve significant results as detailed in Bohg et al. (2013). Several works leveraging data-driven methods have focused towards grasping a priori known objects (Detry et al., 2011; Goldfeder et al., 2007; Miller et al., 2003; Przybylski et al., 2011) or familiar objects with known object class by matching with a known database of grasping information or ranking based on prior grasping experience (Detry et al., 2012; Mahler et al., 2016; Patten et al., 2020). For grasping of unknown objects, prior works rely upon global shape or features from sensory data and a set of heuristics (Bohg et al., 2013; Morrison et al., 2020; Schaub and Schöttl, 2020; Schmidt et al., 2018). For instance, Morrison et al. (2020) developed an object-independent grasp synthesis method from depth images using their generative grasping convolutional neural network (GG-CNN). Tactile sensors have also been used for manipulation of deformable objects (Kaboli et al., 2016). However, most of the prior works focused on singulated objects (single objects in the scene) or structured clutter wherein there are multiple objects in the scene but they are spread apart with minimal contact between them. However, in dense unstructured clutter there are multiple objects which are densely packed over one another in randomised configurations. In such scenarios, relying upon only grasping actions can severely handicap the robot as some objects may be very hard to grasp due to the surrounding clutter. Some recent works have leveraged both prehensile manipulation such as grasping and non-prehensile manipulation such as pushing actions (Danielczuk et al., 2019; Dutta et al., 2023; Murali et al., 2022b; Zeng et al., 2018). If the objective is to grasp and retrieve an object, then pushing is often used to singulate the object away from clutter aiding subsequent grasp action (Grimm et al., 2021). However in mechanical search the goal is to retrieve a target object, both grasp and pushing actions are used in conjunction in order to declutter the workspace around the target object (Danielczuk et al., 2019; Murali et al., 2022b). The work by Danielczuk et al. (2019) focused on retrieving a target object from clutter and they used a heuristic that removes the largest object first. Furthermore, the target object in their work is known a priori. Their framework is based on deep reinforcement learning to learn the synergy between grasping and pushing and is more data-hungry and compute-intensive.

In comparison with the prior work reviewed above, we focus on estimation of the pose of an unknown target object in dense unstructured clutter. We present a declutter scene graph-based approach that directly encodes the relationship between objects in the scene as well as the type of action to perform (grasp/push) to declutter the objects. Our formulation ensures that only necessary minimal actions are performed such that the target object is not occluded for pose estimation. The actions are also chosen automatically based on the grasp affordance of the object.

2.2. Object reconstruction

Object reconstruction follows the process of capturing the shape and appearance of a 3D object by moving a suitable sensor around the object. Typically, object reconstruction approaches can be classified into statistical model-based and deep learning-based generative models (Phang et al., 2021). Some statistical methods involve capturing point clouds from various viewpoints and aggregating into a common coordinate frame using point cloud registration (Delmerico et al., 2018). Xie et al. (2021) designed a generative model based on PointNet (Qi et al., 2017) capable of performing reconstruction and interpolation. We focus on active object reconstruction techniques wherein the sensing locations are chosen autonomously by robots to improve performance efficiency and avoid an exhaustive search. In this regard, Delmerico et al. (2018) compared various next-best-view (NBV) strategies for an object in an uncluttered scene in simulation demonstrating the usability of information-theoretic criteria for efficient reconstruction. Similarly, Bissmarck et al. (2015) devised an efficient volumetric NBV algorithm exploiting frontier voxels and spatial hierarchy. While these works are limited to one camera, more recently, multiple cameras have been used for reconstruction by exploiting the joint uncertainty in the modelling (Lauri et al., 2020). Along similar lines, Cui et al. (2019) devised a multi-sensor strategy for next best view calculation with a laser range sensor and RGB-D camera based on occupied voxel metrics.

Typical vision sensors, in particular depth sensors are sensitive to transparent and specular objects producing erroneous or missing regions in the measurements. To overcome this limitation, recent works have leveraged transfer learning techniques from the RGB modality which are comparatively less sensitive to transparency/specularity into the depth modality and utilized off-the-shelf depth-based grasping methods for manipulating such objects (Weng et al., 2020). Similarly, Sajjan et al. (2020) developed a technique to reconstruct transparent objects wherein convolutional neural networks are used to infer the normals, contours and semantic segmentation from RGB images which are used to refine the depth estimate in order to recover the shape of transparent objects. Along similar lines, Zhang et al. (2022) developed a transformer-based architecture for depth completion when provided with instance-based segmentation and RGB images for reconstruction of transparent objects. A detailed review on robotic perception for transparent objects is found in Jiang et al. (2023). As majority of the vision-based techniques for reconstruction of transparent objects depend on the availability of high fidelity RGB image priors, it cannot be always ensured in unstructured environments wherein the environmental lighting conditions can vary. Low and bright lighting conditions cause erroneous results with transparent objects (Sajjan et al., 2020). Moreover, such transparent and specular objects when present in cluttered environments are also challenging for vision-based sensing due to shadows and occlusions.

On the contrary, tactile sensors are more robust to the ambient lighting conditions, and are relatively more robust to the transparency or specularity of the objects compared to visual sensors. Thus, tactile sensors have been used independently or in conjunction with vision sensors for robotic object perception (Dutta et al., 2024; Kaboli et al., 2015, 2018; Murali et al., 2022c). Gaussian processes implicit surfaces (GPIS) have been used for shape reconstruction from both vision and tactile inputs (Rustler et al., 2022; Suresh et al., 2022). However, GPIS approaches are known to be computationally expensive (Schulz et al., 2018). Similarly, Wang et al. (2018) designed a framework for generating 3D shape of objects from a single visual image using learnt shape priors which is refined using tactile sensing. They also performed uncertainty based next-best-touch (NBT) computation while keeping the camera static.

Previous works used touch to refine the shape estimated prior by vision in a two-step process. In comparison, in this work we present a novel approach for shared active object reconstruction using a joint information gain metric for sensor selection criteria and next-best-view (NBV) or next-best-touch (NBT) execution. The robots equipped with vision and tactile sensors coordinate autonomously to reconstruct the object with minimal actions for the objective of object pose estimation and avoid overlapping regions of data collection.

2.3. Object pose estimation

Object pose estimation is a broad field of research in computer vision and robotics with approaches broadly categorised depending on using 2D data (RGB images) or 3D data (RGB-D images or point clouds) as input (He et al., 2020). Here, we review point cloud based approaches for object pose estimation as they are relevant to our work. Typically, with point cloud-based approaches, a point cloud that corresponds to the object CAD model is registered or matched with the sensor acquired point cloud from the scene and the output of the registration process yields the 6 DoF pose of the object. Iterative closest point (ICP) and its variants are popular point-cloud based approaches for pose estimation (Pomerleau et al., 2013). They fall into the category of simultaneous correspondence estimation and pose estimation methods wherein there is an iterative alternation between estimation of the closest point in the target point set and minimisation of the distance between the corresponding points (Besl and McKay, 1992). However, such methods are local approaches and require good initialisation as they tend to converge to local minima. In contrast, other approaches rely upon finding dense point-to-point corresponding using feature extraction and then optimise for 6D pose (Gentner et al., 2023; Huang et al., 2021b; Rusu et al., 2009; Yang et al., 2020). Choukroun et al. (2006) devised a Bayesian filtering approach for only the rotation estimation (so-called Wahba’s problem) between two coordinate systems. Whereas in this work, we tackle the problem of full SE (3) pose and scale estimation by converting the non-linear problem of pose estimation into decoupled rotation and translation estimation by exploiting the geometry of the measured point clouds. Recently, deep learning based approaches have been used to learn robust features for generating correspondences followed by an optimization such as RANSAC (Deng et al., 2018; Zeng et al., 2017). Deep learning approaches have also been used to regress the pose directly using an end-to-end approach. This is done by learning to regress the pose parameters directly from the features of the input point clouds (Huang et al., 2021a; Pais et al., 2020; Yang et al., 2019).

Contrary to instance-based methods, recent works have addressed the pose estimation of unknown objects from known object categories without any prior instance-specific CAD models available which is known as category-level object pose estimation. Wang et al. (2019) introduced the problem of category-level pose estimation and presented the Normalised Object Coordinate Space (NOCS) that produces a shared canonical representation for all object instances in each category. The predicted NOCS map is used to extract the pose and shape of objects with the observed depth map. Lee et al. (2021) extended the NOCS map with a CNN-based category level pose estimation with RGB images with little or no depth information. Similarly, other works have used variational auto-encoders (VAE) for generating the canonical 3D point clouds and the pose is regressed using another deep neural network (Chen et al., 2020). Some works explicitly model the intra-class shape variations using deformation from pre-learned shape priors (Tian et al., 2020). In addition to pose estimation, Deng et al. (2022) combined their category-level auto-encoder with a particle filter framework for tracking of unknown objects in an iterative manner. The method relies upon accurate depth estimation and semantic segmentation as input. Similarly, Wen and Bekris (2021) performed 6D pose tracking for unknown objects using learnt networks for segmentation and keypoint extraction and pose graph optimisation for pose tracking. As the accuracy of category-level estimation is far from satisfactory in comparison to instance-level methods, some works perform iterative point cloud pose refinement after finding the categorical shape prior (Liu et al., 2022b).

2.3.1. Visuo-tactile-based pose estimation

Prior works used accurate depth or RGB images for category-level pose estimation, however in case of photometrically challenging objects (transparent, shiny, reflective), input visual depth data are unreliable. However, such transparent and shiny objects like wine glasses, metallic cutlery are ubiquitous in unstructured environments wherein the robust operation of robots are necessary as evidenced by rise in datasets such as PhoCal Dataset (Wang et al., 2022). Prior works on object pose estimation for robots have leveraged high fidelity tactile sensing embedded on the end-effector or body surface to improve the visual pose estimate (Kaboli and Cheng, 2018; Kaboli et al., 2017; Murali et al., 2021, 2022a). Although tactile sensing can provide accurate grounded information regarding the objects, it has complementary characteristics to that of vision sensing. Visual sensing provides dense information of the global scene whereas current commercial tactile sensors generally provide sparse and local information about the object in contact. While vision sensors capture the entire scene information in one shot, tactile sensors acquire information sequentially and require memory of previous acquisitions to iteratively build the scene information (Dahiya et al., 2019; Li et al., 2020; Liu et al., 2022a; Liu and Sun, 2018). Contrasting to vision sensing, tactile data are action-conditioned, such that the kind of data acquired depends on the type of action performed (Kaboli et al., 2019). Point clouds are preferred for array-based tactile sensors expressing the visual and tactile data in the same domain for pose estimation. In case of known objects, point cloud registration provides the accurate 6 DoF pose of the target object in the scene (Pomerleau et al., 2015). However, state-of-the-art techniques as well as standard methods such as ICP and variants perform poorly on sparse tactile data (Pomerleau et al., 2013). Hence, due to the sparsity and sequential nature of tactile data, prior works have used sequential filter-based methods for pose estimation (Murali et al., 2021; Petrovskaya and Khatib, 2011; Vezzani et al., 2017). Some works have also developed novel tactile descriptors regardless of the type of tactile sensor or method of tactile data extraction (Kaboli and Cheng, 2018). Another method has been proposed to extract local geometric features using PCA and estimate the pose by matching the covariance between the extracted tactile data and object model (Bimbo et al. (2016)). On the other hand, methods based on vision-based tactile sensors express the tactile data as RGB images or pressure heatmaps and use feature extraction and pose estimation techniques that are typical in the computer vision literature (Bauza et al., 2019; Kuppuswamy et al., 2019; Li et al., 2014; Suresh et al., 2021). Recently, rising number of works have used visuo-tactile sensing for accurate object pose estimation. Vision has been used to provide an initial estimate of the object pose that is finely refined by tactile localisation using local or global optimization techniques (Hebert et al., 2011). Bhattacharjee et al. (2015) stated that visually similar surfaces need to have similar haptic properties as well. Based on this fact, they create a dense haptic map efficiently across visible surfaces with sparse haptic labels which allow a humanoid to perform a reaching task in cluttered foliage. De Gregorio et al. (2018) leveraged vision and tactile sensing to accurately estimate the pose of a deformable wire in an insertion task. While manipulating objects in-hand, the objects are typically occluded from the line-of-sight of the camera. Prior works have fused vision and tactile sensing data to accurately measure and track the pose of in-hand objects using Bayesian filtering techniques and deep learning methods (Álvarez et al., 2019; Dikhale et al., 2022; Pfanne et al., 2018). Recent works used deep learning based approaches along with pose-graph optimization to track and recover the shape of novel objects during in-hand manipulation by combining visual and tactile sensing (Qi et al., 2023; Suresh et al., 2023).

3. Methodology

3.1. Problem formulation and framework

The objective is to accurately identify the rotation R ∈ SO (3), position $t \in R^{3}$ and scale $S \in R^{3}$ of an unknown target object in dense clutter by sharing the scene perception between visual and tactile sensing. The target object belongs to one of the $N_{c_{O}}$ known categories of the objects and can be opaque or transparent. However, no knowledge of the object model is assumed a priori. Firstly, in order to identify the individual rotations and positions of objects, the robots autonomously and deterministically declutter the scene through interactive perception (Figure 2(a)). Secondly, vision and tactile sensing are used by the robots to extract a shared scene representation and explore the unknown object for reconstruction (Figure 2(b)). Finally, the reconstructed object model is used for pose estimation using our S-TIQF algorithm with sensor acquired point cloud (Figure 2(c)). Furthermore, if there is a discrepancy between the visual and tactile point cloud data, it is typically due to incorrect hand-eye calibration and a novel in-situ visuo-tactile hand-eye calibration solution employing again our S-TIQF algorithm is also presented (Figure 2bi). The notations used in this article have been detailed in Table 1.

Figure 2.

Our proposed framework for shared interactive visuo-tactile perception for active object reconstruction and robust pose estimation in dense clutter including a novel approach for in situ visuo-tactile based hand-eye calibration. (a) Visuo-tactile based interactive scene decluttering. (b) Shared visuo-tactile based active object reconstruction. (c) Visuo-tactile based robust pose estimation

Table 1.

List of Notations.

List of Notations
R	Unknown rotation matrix ∈ SO(3)
t	Unknown translation ∈ $R^{3}$
S	Unknown scale ∈ $R^{3}$
$G$	Scene graph with vertices $V$ & edges $E$
O _T	Target object
μ _q	Grasp threshold
μ _o	Overlap threshold
d _ij	Minimum distance between contours i, j
q _k	Grasp quality measure
^A H _B	Homogeneous transform transforming a point from frame B to frame A, ∈ $R^{4 \times 4}$
$A$	Set of possible actions
G _res	Occupancy grid with resolution {res}
z	Sensor measurement
D(a)	Energy cost of action a
$P$	Point Cloud
R = {r₁, r₂…r_n}	Set of rays for ray-casting
p(b)	Probability of input b
$\tilde{q}$	Quaternion
ρ	Uncertainty in point correspondences
K _t	Kalman gain at time t
∑	Covariance matrix
ζ	Simulated annealing cooling rate
$N_{C_{o}}$	Object Category

3.2. Visuo-tactile-based interactive scene decluttering

Before being able to estimate the pose of individual objects using our proposed novel S-TIQF algorithm, we may need to declutter a possibly cluttered scene. As objects may be present in random configurations in the scene, a method and formalism are necessary to encode the spatial and support relationships between the objects. We encode such relationships in the form of a scene graph termed declutter graph and has been presented in our previous work (Murali et al., 2022b). We briefly describe it in this section for the sake of completeness and readability. The decluttering process is shown in Figure 2(a).

The declutter graph is a directed graph $G = (V, E)$ where the vertices of the graph $V$ represent the objects in the scene and the edges $E$ from $v_{i} \in V$ to $v_{j} \in V$ encodes explicitly the actions needed for decluttering or singulating object v_j from object v_i. Implicitly, the edges represent the spatial and support relationships between the objects in $V$ . The root node of the graph represents the target unknown object O_T which we seek to identify the shape and pose. The declutter graph is constructed based on outputs from a semantic segmentation network and grasp affordance network as shown in Figure 2(a). An RGB image and depth image are taken as inputs for the semantic segmentation network and grasp affordance network, respectively. We use an off-the-shelf semantic segmentation network (Chen et al., 2017) and grasp affordance network (Morrison et al., 2020). We fine-tuned the pre-trained models with our real-world object datasets in clutter and their respective segmentation masks. The output of the semantic segmentation network $M_{s e g}$ provides the various objects present in the scene $O_{k} \in O$ . We assume that there is exactly one instance of the target object in the scene and the clutter objects belong to other categories dissimilar to the target object for the sake of decluttering. The edges of the graph are extracted through the overlap or proximity metric from $M_{s e g}$ defined as follows.

Definition 1

Overlap Metric: Two objects representing the vertices of the graph v_i, v_j constitute an edge $e_{i j} \in E$ if the overlap measure is greater than the threshold μ_o. The overlap measure is defined as the Intersection over Union (IoU) value, that is, $I o U_{i j} = (C_{i} \cap C_{j}) / (C_{i} \cup C_{j})$ where $C$ defines all points in the minimum area bounding box of the respective object masks.

Definition 2

Proximity Metric: Two objects representing vertices of the graph v_i, v_j constitute an edge $e_{i j} \in E$ if the proximity measure is less than the threshold μ_d. The proximity measure d_ij is defined as the shortest distance between the two contours of the object masks.

Thus, an edge $e_{i j} \in E$ is given by

e_{i j} = \{\begin{cases} I o U_{i j} & (I o U_{i j} > μ_{o}) \\ 1 / d_{i j} & (d_{i j} < μ_{d}) \land (I o U_{i j} \leq μ_{o}) \\ 0 & otherwise \end{cases}

(1)

Each edge $e \in E$ also has an action attribute attached with them. The grasp affordance network provides a grasp action a^grasp and a grasp quality measure q_k as output which defines the edge attribute. For an edge e_ij directed from v_i to v_j, the edge e_ij is attributed with a grasp or a push action a^push for object v_j based on a grasp threshold μ_q as:

a_{k} = \{\begin{cases} a_{k}^{g r a s p} & q_{k} \geq μ_{q} \\ a_{k}^{p u s h} & q_{k} < μ_{q} \end{cases}

(2)

Hence, our declutter scene graph encodes the next object to singulate and the action (prehensile/non-prehensile) to perform. It ensures a targeted and greedy approach to separate out all cluttering objects around the target object. The graph is traversed in a depth-first search manner. If there are no children nodes to the root node in the declutter graph

G

, then the decluttering procedure is complete. Utilising prehensile (grasping) and non-prehensile (push) actions allows the robotic system to choose the action which is more confident and increases the flexibility of the system. In our experiments, we set the threshold values as follows: μ_o = 0.05, μ_d = 0.5, μ_q = 0.1. The details of the push actions and grasp actions formulations are presented in the Appendix (Section 6). As shown in Figure 2(a), after each action is performed to singulate an object, the graph is updated and process is repeated until the decluttering is complete.

3.3. Shared visuo-tactile-based active object reconstruction

As the shape of the target object is unknown, reconstruction of the object is necessary for pose estimation and other possible downstream tasks such as precise manipulation. Our framework autonomously chooses (a) which sensor to use, (b) where to perform sensing and (c) how much of the object information is necessary for the chosen objective of pose estimation. In a single sensor scenario, the next best action selection problem seeks to find the optimal next sensory action to perform based on current knowledge of the environment in order to maximise the information gain that is calculated through an objective function. In a multi-agent and multi-sensor scenario, there is additionally the sensor selection problem which seeks to find the optimal sensor to employ given the current knowledge of the environment and incentivises the coordination between the agents as well as reducing the redundant data collection. In our case, the two robots equipped with a visual RGB-D sensor and tactile sensor array, respectively, as shown in Figure 1 are tasked to reconstruct the object in a coordinated and time-efficient manner.

3.3.1. Vision and tactile action sampling

For the Next-Best-View (NBV) and Next-Best-Touch (NBT) selection, we perform Monte-Carlo sampling of the visual and tactile actions, respectively, around the target object. The centroid o_centroid of the target object is extracted from the semantic segmentation mask.

For NBV sampling, N_nbv viewpoints are sampled on the hemisphere space centred on o_centroid of the target object. The radius of the hemisphere is determined empirically considering the maximum reach of the robot and any possible vertical offset between o_centroid and the robot base frame. The Panda robot has a maximum kinematic reach of 855 mm and we set the radius to a nominal value of 550 mm to avoid singularities at the kinematic extremity. A viewpoint $a^{v} \in A^{v}$ is defined by the position $p^{v} \in R^{3}$ and orientation R^v ∈ SO(3) of the camera frame expressed in the world coordinate frame ^WH_C. The constraint on a^v is that the camera orientation must be towards the object of interest that is, the Z-axis of the camera frame which points outward from the camera needs to pass through the centroid of the target object. The position p^v is randomly sampled as a point on the hemisphere (Marsaglia, 1972). The points which are sampled outside the kinematic limits of the robot are discarded. The rotation matrix R^v is calculated through the angle-axis formulation ${\hat{e}, θ}$ as follows:

\hat{h} = \frac{p^{v} - o_{c e n t r o i d}}{‖ p^{v} - o_{c e n t r o i d} ‖},

(3)

θ = \cos^{- 1} (\hat{h} \cdot \hat{Z}), \hat{e} = \frac{\hat{h} \times \hat{Z}}{‖ \hat{h} \times \hat{Z} ‖} .

(4)

where

\hat{Z} = {0,0,1}

the Z-axis of the world coordinate frame W. The rotation matrix R^v is calculated from the angle-axis formulation using the Rodrigues’ formula.

For the NBT sampling, we define the tactile action as $a^{t} \in A^{t}$ by the position $p^{t} \in R^{3}$ and direction $\hat{d}$ . The positions are randomly sampled as points on each face of the oriented bounding box of the object except the bottom face where the object rests on the table. The bounding box is determined from the 2D semantic segmentation mask of the target object and a predefined height. The direction $\hat{d}$ of each p^t is calculated as the normal perpendicular to the face of the bounding box. The NBV and NBT sampling are graphically shown in Figure 3.

Figure 3.

Next best view (NBV) and next best touch (NBT) action selection.

3.3.2. Active sensor selection and next best action selection

At each iteration, the next best action a* is selected from the set $A = A^{v} ⋃ A^{t}$ using the joint information gain approach. We discretise the space around the target object into a 3D voxelised probabilistic occupancy grid G with resolution G_res. Each grid cell g_i ∈ G is represented by a Bernoulli random variable X_g which represents the probability if the grid cell is occupied X_g = 1 or unoccupied X_g = 0. We assume these Bernoulli random variables are independent which allows the calculation of the probabilities of the occupancy grid p(g_i). The confidence of the reconstruction can be calculated as the uncertainty of the grid through the Shannon Entropy as:

H (G) = - \sum_{g_{i} \in G} p (g_{i}) l o g (p (g_{i})) + (1 - p (g_{i})) l o g (1 - p (g_{i}))

(5)

Given a point cloud captured by the camera or tactile sensor, the occupancy grid is updated with probabilities using the respective sensor models (Hornung et al., 2013). We define a virtual sensor measurement model for visual and tactile sensors respectively for the NBV and NBT calculations. The visual sensor generates the point cloud using a time-of-flight (ToF) sensor. The virtual vision sensor model is defined by a set of beam measurements $R_{v} = r_{1}, r_{2}, r_{3}, \dots r_{n_{v}}$ where $r_{n_{v}}$ refers to the maximum number of rays. These rays are rotated such that they span the field-of-view of the sensor. Similarly, we define a virtual tactile sensor which casts a set of rays R_t = r₁, r₂, r₃, …r_taxel where r_taxel refers to the number of taxels in the sensor array. Raytracing is used to update the grid cells, wherein the grid cells where the ray terminates is updated as hits and remaining grid cells are updated as misses. Given the observed grid cell g and the measurement from sensor observation z, the log-odds is updated as L(g|z) = L(g) + l (z) wherein L(g) = log p(g)/1 − p(g) and

l (z) = \{\begin{cases} l o g \frac{p_{h}}{1 - p_{h}} z \hat{=} hit \\ l o g \frac{p_{m}}{1 - p_{m}} z \hat{=} miss \end{cases}

(6)

where p_h and p_m are the probabilities of hit and miss which are user-defined values set to 0.7 and 0.4, respectively, as in Hornung et al. (2013). The posterior probability p(g|z) can be computed by inverting L(g|z).

For a single sensor case, the expected information gain by taking an action $a_{k} \in A$ and corresponding expected measurement ${\hat{z}}_{t}$ is given by the Kullback–Leibler (KL) divergence between the posterior entropy after integrating the expected measurements and the prior entropy:

E [I (p (g_{i} | a_{t}, {\hat{z}}_{t}))] = H (p (g_{i})) - H (p (g_{i} | a_{t}, {\hat{z}}_{t}))

(7)

Hence, the action a* that maximises the information gain can be selected as the next best action:

a^{*} = \underset{a_{k} \in A}{arg max} (E [I (p (g_{i} | a_{k}, {\hat{z}}_{k}))]) .

(8)

A näive way of extending to multiple sensors is to compute eq. (8) for each sensor. However, this would result in collection of redundant data due to sensor data overlap. Furthermore, there would be no incentive for coordination between the robots as well as leveraging the vision and tactile sensors with complementary properties. Hence, we present a joint sensor selection and action selection method for vision and tactile sensors. We can utilise the same occupancy grid formulation for integrating the sensor information from vision and tactile sensors. For each cell of the occupancy grid, the probabilistic evidence from each sensor needs to be updated. We can define an energy cost D (a_t) that encodes the time taken to perform the robot action. In general, performing visual actions is faster than performing tactile actions with the robot. Hence, we set $D (a_{t}^{v}) < D (a_{t}^{t})$ for all cases unless the target object is transparent. In case of objects that are transparent, the visual sensor produces erroneous data and we set $D (a_{t}^{t}) < D (a_{t}^{v})$ , thus preferring the tactile actions in such cases. Our method for detection of transparent objects is described in following subsection. Hence the optimization for the sensor selection and next best action is performed by:

a^{*} = \underset{a_{k} \in A}{arg max} (\frac{E [I (p (g_{i} | a_{k}, {\hat{z}}_{k}))]}{D (a_{k})}) .

(9)

3.3.3. Detection of transparent objects

Detecting transparent objects is a challenging task for off-the-shelf visual cameras with RGB and depth sensing. Many prior works are available for detection of transparent objects with the usage of specialized sensors or specific calibration setups, and with analytical or data-driven methods (Ihrke et al., 2010). We design a simple heuristic approach to detect object transparency in order to set the energy cost D(a_t) during object exploration. We extract the RGB image and point cloud of the target object from a perpendicular top-down view. We extract the bounding box $C_{r g b}$ of the object from the RGB image using contour segmentation techniques (Bradski and Kaehler, 2000). Assuming that the object lies on a plane, we employ plane segmentation techniques to remove the points belonging to the plane from the point cloud. The 2D bounding box of the remaining points (ignoring the height of the bounding box) is extracted as $C_{p c}$ . The overlap measure between $C_{r g b}$ and $C_{p c}$ which is measured by the Intersection-over-Union (IoU), as previously defined in Section 3.2, is used classify the object transparent if IoU_pc/rgb < ω or opaque, where ω is a user-defined threshold. The transparent object detection strategy is visualised in Figure 4. The detected transparent object is used to set the value for $D (a_{t}^{t})$ such that it is less than $D (a_{t}^{v})$ . This ensures that tactile actions are preferred for object exploration in case of transparent objects. The algorithm for shared visuo-tactile object exploration is summarized in Algorithm 1.

Figure 4.

Bounding box segmentation and IoU calculation using (a) RGB image and (b) point cloud for detecting transparent objects.

The sensor acquired point cloud is used to reconstruct the object model as described in Section 3.3.4.

3.3.4. Category-level object shape reconstruction

In order to recognize the shape of category-level objects, we present a self-supervised learning approach with an auto-encoder network that aims to reconstruct the original point cloud when provided a subsampled point cloud. The network is trained on only synthetic object models belonging to the same category but not identical as the real-world objects. We generate a dataset $D$ of synthetic point clouds from synthetic CAD models available in the ShapeNet repository (Chang et al., 2015). The trained network is directly used with visual and tactile point clouds from real-world objects. This avoids expensive real-world data collection and annotation process. The network consists of a feature-extraction encoder and upsampling decoder unit as shown in Figure 5. The input point clouds $P_{i n}$ are randomly sampled with different sampling factors to produce point clouds with point numbers between 60 and 1024. The output reconstructed point $P_{o u t}$ from the network is fixed to 2048 points. This is done to emulate tactile-only and visuo-tactile point clouds which are sparse and dense, respectively. The low density of the input point clouds provides a challenge for shape reconstruction as other simpler techniques such as interpolation cannot be used.

Figure 5.

Architecture for the reconstruction network.

3.3.4.1. Feature-extraction encoder architecture

The encoder creates a high dimensional feature vector from a possibly sub-sampled point cloud as its input. This feature vector encodes the input point cloud’s overall geometric shape data. For the encoder, we employ a modified PointNet architecture (Qi et al., 2017). The encoder network produces a feature vector of 1024 dimension by selecting the informative and distinctive parts of the point cloud. The encoder consists of [1 × 1] convolutions with output channels size (64, 64, 128, 1024) with the first convolutional layer with kernel size [1 × 3] to encode the input point cloud of N × 3 dimension. The convolution layers are aggregated by a max-pooling layer. Furthermore, we add a self-attention layer (Zhang et al., 2019) whose outputs are aggregated with the max-pooled features to provide the global feature vector. The self-attention mechanism allows the model to weigh the importance of each point with respect to the other points in the input point cloud.

3.3.4.2. Upsampling decoder architecture

The upsampling decoder enlarges the feature vector input to generate a more detailed output point cloud $P_{o u t}$ . The decoder is made up of a fully connected layer with an output dimension of 1024, as well as five deconvolutional or transpose convolutional layers. These layers have specific kernel sizes and output channels, as depicted in Figure 5. This decoder produces a point cloud output with 2048 points, which is dense enough to effectively reconstruct the original point cloud.

3.3.4.3. Loss function

We utilize the Chamfer distance (Borgefors, 1986) as the loss function which ensures the reconstructed point cloud follows the 3D shape of the ground truth point cloud. Given the input point cloud to our network prior to subsampling $P_{i n}$ and the reconstructed output point cloud $P_{o u t}$ , the loss is defined as:

\begin{align} L_{C D} (P_{i n}, P_{o u t}) & = \frac{1}{| P_{i n} |} \sum_{p_{1} \in P_{i n}} \min_{p_{2} \in P_{o u t}} ‖ p_{1} - p_{2} ‖_{2} + \\ \frac{1}{| P_{o u t} |} \sum_{p_{2} \in P_{o u t}} \min_{p_{1} \in P_{i n}} ‖ p_{2} - p_{1} ‖_{2}, \end{align}

(10)

where |•| refers to the cardinality (number of points) of the point cloud and ‖•‖₂ refers to the L2 norm. The loss

L_{C D}

represents the average distance between the pair-wise closest points in the two point clouds. Hence, the overall loss function is provided as

L_{r e c} = α L_{C D}

wherein α is the weight which is set empirically during training.

3.4. Visuo-tactile-based robust pose estimation

3.4.1. Stochastic translation-invariant quaternion filter (S-TIQF)

Upon reconstructing the object point cloud, we are able to perform pose estimation which is to compute the unknown scale $S \in R^{3}$ , rotation R ∈ SO(3), and translation $t \in R^{3}$ . Given the two point sets ( $S$ and $O$ ) with point-to-point correspondences, the point cloud registration problem is formulated as:

s_{i} = S . (R o_{i}) + t i = 1, \dots N,

(11)

where

s_{i} \in R^{3}

are points in the point cloud

S

and

o_{i} \in R^{3}

are the corresponding points belonging to the point cloud

O

and. represents the element-wise product. Typically, for object pose estimation, the point cloud

S

is derived from the scene with sensor measurements and the point cloud

O

is derived from the object model. In our case, we use the reconstructed object model from the Section 3.3.4 as

O

3.4.1.1. Scale estimation

The reconstructed object point cloud $O$ is uniformly scaled within a [0,1]³ cube. To find the absolute scale S, we compute the ratio of the axis aligned bounding box (AABB) of the scene $S$ and object $O$ point clouds, that is, if {(x_min, x_max), (y_min, y_max), (z_min, z_max)} represents the AABB for a point cloud, then:

S = \{\frac{| x_{m a x} - x_{m i n} |_{S}}{| x_{m a x} - x_{m i n} |_{O}}, \frac{| y_{m a x} - y_{m i n} |_{S}}{| y_{m a x} - y_{m i n} |_{O}}, \frac{| z_{m a x} - z_{m i n} |_{S}}{| z_{m a x} - z_{m i n} |_{O}}\}

(12)

Subsequently, we can scale the object point cloud $O$ by S and we are left with the estimation of R and t. Eq. (11) simplifies to $s_{i} = R o_{i}^{'} + t$ where $o_{i}^{'} = S o_{i}$ , for i = 1, 2, …N.

3.4.1.2. Stochastic initial alignment

We presented our Translation-Invariant Quaternion Filter (TIQF) in our prior work (Murali et al., 2021) which is a Bayesian filtering approach for point cloud registration applicable for dense visual and sparse tactile point clouds. However, TIQF is sensitive to initialisation conditions. Figure 6 shows an example error surface for point cloud registration with TIQF using the Stanford Bunny dataset (Levoy et al., 2005). It is obtained by varying the initial position about one axis in the range of [−5.0, 5.0] and initial orientation about one axis in the range of [−π, π]. The error is calculated as the root mean squared error of the distance metric between corresponding points. We note from Figure 6 that the error surface contains multiple local minima in which the optimization can be trapped depending upon the initial conditions. We solve this problem with a stochastic initialization method for TIQF that is robust against local optima termed StochasTIQF (S-TIQF).

Figure 6.

Error surface calculated as the distance between corresponding points of two clouds upon performing TIQF with initialisation parameters (translation and rotation) varied (best viewed on-screen and in colour).

The stochastic initial alignment is performed through Simulated Annealing (Bertsimas and Tsitsiklis, 1993). Simulated Annealing (SA) is a well-known stochastic probabilistic method for approximating the global optima for a given function f (b). In SA, a temperature variable is used to guide the exploration. An initial temperature t = t₀ is chosen and the temperature is reduced in each iteration according to the geometric cooling rate, that is, t′ = tζ where ζ is the cooling rate. This is termed as the annealing schedule. At t = t₀, an initial state b = b₀ is chosen at random and the cost is computed using the cost function c₀ ← f (b₀). At every iteration, a random state in the neighbourhood of the current state is chosen and the difference in cost Δc is calculated. The probability of accepting the new state is provided by the following condition:

p (b^{'}) = \{\begin{cases} 1 & Δ c \leq 0 \\ e^{\frac{- Δ c}{t}} & Δ c > 0 \end{cases}

(13)

The new state b′ is accepted if p(b′) > random (0, 1). The process is repeated until a pre-defined temperature threshold is reached t < t_min or for a fixed number of iterations. Random restarts are also used wherein t is set to t₀ when t ≤ t_min. In our experiments, random restarts are performed 10 times. In order to use simulated annealing with TIQF, a cost function for SA needs to be designed that upon finding the solution provides a good initialization for TIQF to extract the rotation and translation estimates. The cost is defined as the root mean squared error of the nearest neighbour point-to-point distances for each state b = {R, t}. The nearest neighbourhood correspondence assignment allows fast computation of the costs and thereby allowing larger iterations of SA. The state with minimal cost naturally minimizes the distance between the two point clouds. Hence for two point sets $S$ and $O$ , the cost is defined as follows:

f (b) = \frac{1}{| O |} (\sum \sqrt{‖ \min (S - (R_{S A} O + t_{S A})) ‖^{2}})

(14)

The temperature variable allows exploration in the initial phase thereby escaping the local minima and gradually converges to an optimal solution. The estimated rotation R_SA and translation t_SA are used for TIQF to find the accurate solution.

3.4.1.3. Correspondence estimation

A crucial factor in point cloud registration from eq. (11) is the knowledge of point correspondences in the two point sets. In realistic scenarios, the point correspondences are not known a priori. On the one hand, simultaneous pose and correspondence estimation methods such as ICP and its variants rely upon nearest neighbour search for extracting point correspondences while iteratively improving the pose in successive steps (Besl and McKay, 1992). On the other hand, correspondence based methods extract point correspondences through feature matching and may employ rejection techniques to remove outlier correspondences prior to performing registration. In the case of visual and tactile point clouds, there are further challenges: (a) the point density difference between visual and tactile point clouds and (b) visual point clouds can be captured in one shot whereas tactile point cloud is aggregated through sequential tactile actions. Due to the point sparsity, typical feature-based correspondence matching algorithms are not accurate as they depend on local surface information. Similarly, nearest neighbour search as used in ICP is not robust to outliers and can get stuck in local minima.

We use the mutual nearest neighbours or Best-Buddies Pairs (BBP) (Oron et al., 2017) to estimate the point correspondences. It has been shown in Oron et al. (2017) that the BBP measure is robust to outliers and differences in point density but in the context of template matching in the image domain. The point $p_{i} \in P$ and $q_{j} \in Q$ are Best Buddy Pairs (BBP) if p_i is the nearest neighbour of q_j in point cloud $Q$ and q_j is the nearest neighbour of p_i in point cloud $P$ . Mathematically, it can be written as:

b b p (p, q, P, Q) = \{\begin{cases} 1 & {N N (p_{i}, Q) = q_{j}} \land {N N (q_{i}, Q) = p_{j}} \\ 0 & otherwise \end{cases}

(15)

where

N N (p_{i}, Q) = {arg min}_{q \in Q} d (p_{i}, q)

and d(p_i, q) is a distance measure. Typically, the nearest neighbours can be calculated based on Euclidean distances as

d (p_{i}, q) = ‖ q - (\hat{R} p_{i} + \hat{t}) ‖

where

\hat{R}

and

\hat{t}

are the current rotation and translation estimates respectively. Furthermore, we also include the normals in case there are multiple candidates for nearest neighbours. The surface normal for each point in the dense visual point cloud is estimated as the normal of the plane tangent at the local neighbourhood of the point and for sparse tactile point cloud, the surface normals are directly measured from the tactile sensors as our sensors provide three-axis force measurements. The corresponding points p_i and q_j must also have their associated normals

n_{p_{i}}

and

n_{q_{j}}

oriented approximately in similar directions, that is, the NN (p_i, q_j) = 1 if

\arccos (n_{q_{j}} . \hat{R} n_{p_{j}})

is less than an user-defined threshold.

3.4.1.4. Rotation estimation

The estimation of rotation and translation is decoupled and performed in consecutive steps. The decoupling is done by computing the relative vectors between pairs of corresponding points as s_ji = s_j − s_i and $o_{j i}^{'} = o_{j}^{'} - o_{i}^{'}$ . Eq. (11) is simplified as:

\begin{align} s_{j} - s_{i} & = (R o_{j}^{'} + t) - (R o_{i}^{'} + t), \end{align}

(16)

\begin{align} s_{j i} & = R o_{j i}^{'} . \end{align}

(17)

Eq. (17) is independent of translation t, hence these measurements are termed as translation-invariant measurements. Next, we cast the rotation estimation problem into the Bayesian estimation framework. We denote the rotation estimate $\hat{R}$ in its quaternion form as the state x which needs to be estimated by measurements z obtained via actions a upto time t. As the pose of the target object remains unchanged, it is regarded as a static Bayesian network. Hence, the state estimate is provided by a recursive Bayes filter as:

p (x | z_{1 : t}, a_{1 : t}) = η p (z_{t} | x, a_{t}) p (x | z_{1 : t - 1}, a_{1 : t - 1}),

(18)

where η is a normalization constant.

We estimate the current belief p(x|z_1:t, a_1:t) through a Kalman filter. Herein, we demonstrate the formal derivation of the linear state and measurement model and the associated noise models. Eq. (17) can be reformulated using quaternions algebra as:

{\tilde{s}}_{j i} = x ⊙ {\tilde{o}}_{j i} ⊙ x^{*},

(19)

where ⊙ is the quaternion product, x* is the conjugate of x,

{\tilde{s}}_{j i} = {0, s_{j i}}

and

{\tilde{o}}_{j i} = {0, o_{j i}}

. As x is an unit quaternion, and using the fact that x* ⊙ x = x ⊙ x* = 1 to get:

{\tilde{s}}_{j i} ⊙ x - x ⊙ {\tilde{o}}_{j i} = 0 .

(20)

Definition 3

The quaternion multiplication can be reformulated in matrix form as:

\begin{align} {\tilde{q}}_{1} ⊙ {\tilde{q}}_{2} & = [\begin{matrix} q_{0_{1}} & - q_{1}^{T} \\ q_{1} & q_{1}^{\times} + q_{0_{1}} I_{3} \end{matrix}] q_{2}, \end{align}

(21)

\begin{align} = [\begin{matrix} q_{0_{2}} & - q_{2}^{T} \\ q_{2} & - q_{2}^{\times} + q_{0_{2}} I_{3} \end{matrix}] q_{1}, \end{align}

(22)

where [v]^× is the skew-symmetric matrix formed from the vector v.

Hence, we can reformulate eq. (20) as:

\begin{align} [\begin{matrix} 0 & - s_{j i}^{T} \\ s_{j i} & s_{j i}^{\times} \end{matrix}] x - [\begin{matrix} 0 & - o_{j i}^{T} \\ o_{j i} & - o_{j i}^{\times} \end{matrix}] x = 0 \end{align}

(23)

\begin{align} {[\begin{matrix} 0 & - {(s_{j i} - o_{j i})}^{T} \\ (s_{j i} - o_{j i}) & {(s_{j} + s_{i} + o_{j} + o_{i})}^{\times} \end{matrix}]}_{4 \times 4} x & = 0 . \end{align}

(24)

Eq. (24) is of the form Hx = 0 where H is the pseudo-measurement matrix such that

H = [\begin{matrix} 0 & - {(s_{j i} - o_{j i})}^{T} \\ (s_{j i} - o_{j i}) & {(s_{j} + s_{i} + o_{j} + o_{i})}^{\times} \end{matrix}] \in R^{4 \times 4}

(25)

The pseudo-measurement matrix H depends only on the measurements points s_ji and the model points o_ji. Furthermore, it can be inferred that, in the no noise case, the true state x must lie in the nullspace of H.

Eq. (24) can be reformulated as a pseudo-measurement model as:

H x = z^{h},

(26)

and enforcing the pseudo-measurements z^h = 0. For each time-step t, we obtain H_t based on newly obtained measurement points

{(s_{ji})}_{t}

and transformed model points

{(o_{ji})}_{t}

. If z_t represents the measurements vector, and v_t represents the measurement noise at time t, then:

z_{t} = z_{t}^{h} + v_{t} .

(27)

Using eq. (26), we can reformulate eq. (27) as follows:

H_{t} x_{t} = z^{h} + v_{t} .

(28)

0 = H_{t} x_{t} - v_{t} .

(29)

Eq. (29) represents a linear equation in the state x_t with the state-dependent noise term v_t. The measurement noise v_t is considered as zero mean noise with covariance $P_{t}^{v}$ . The exact expressions for the state-dependent measurement noise are defined as follows. Let $P_{t}^{x}$ represent the covariance matrix of state x_t and $P_{t}^{v}$ the covariance matrix of the measurement noise term v. We adopt the equation for the covariance of the measurement noise similar to Choukroun et al. (2006) which is motivated from a fundamental proposition from Stochastic Filtering Theory (Jazwinski, 1970) [Chap 3, Pg. 90] as:

P_{t}^{v} = \frac{1}{4} ρ [t r ({\hat{x}}_{t - 1} {\hat{x}}_{t - 1}^{T} + P_{t - 1}^{x}) I_{4} - ({\hat{x}}_{t - 1} {\hat{x}}_{t - 1}^{T} + P_{t - 1}^{x})],

(30)

where ρ is a constant which corresponds to the uncertainty of the correspondence measurements, tr refers to trace and

{\hat{x}}_{t}

refers to the mean of x_t at time t. The process model is given as x_t = x_t−1 as it represents the time-invariant rotation estimate of the object.

Hence, we can define the Kalman filter equations as follows:

\begin{align} x_{t} & = x_{t - 1} - K_{t} (H_{t} x_{t - 1}) \end{align}

(31)

\begin{align} P_{t}^{x} & = (I - K_{t} H_{t}) P_{t - 1}^{x} \end{align}

(32)

\begin{align} K_{t} & = P_{t - 1}^{x} H_{t}^{T} {(H_{t} P_{t - 1}^{x} H_{t}^{T} + P_{t}^{v})}^{- 1}, \end{align}

(33)

where x_t−1 is the state estimate at t − 1, K_t is the Kalman gain and

P_{t - 1}^{x}

is the covariance matrix of the state at t − 1.

We must note that the Kalman filter does not preserve the constraints on the state-variables such as the unit-norm property of the quaternion. Hence, a common technique is to normalise the state and the associated covariance matrix after each update:

x_{t} = \frac{x_{t}}{‖ x_{t} ‖_{2}}, P_{t}^{x} = \frac{P_{t}^{x}}{‖ x_{t} ‖_{2}^{2}} .

(34)

The rotation estimate x (quaternion) can be converted to the rotation matrix form R ∈ SO (3) and inserted into eq. (11).

3.4.1.5. Translation estimation

Once the rotation estimate R is found, the translation estimate t is computed in closed form:

t = \frac{1}{N} \sum_{i = 0}^{N} (s_{i} - R o_{i}^{'}) .

(35)

Thus, with each iteration of the S-TIQF we obtain a new rotation and translation estimate that is used to transform the model. The transformed model is used to recompute correspondences and repeat the TIQF update steps. We calculate the change in homogeneous transformation between iterations Δ_S−TIQF < ξ^conv, that is, if the difference in the output pose is less than a specified threshold which in our experiments is 0.1 mm and 0.1°, respectively, and/or maximum number of iterations in order to check for convergence (max_it_S−TIQF = 100). The pseudo-code of the S-TIQF algorithm is shown in Algorithm 2.

3.5. Visuo-tactile hand-eye calibration

As described in our framework in Figure 2bi, if there is a discrepancy between visual and tactile point clouds for a static object, it is typically due to incorrect hand-eye calibration (c.f., Figure 18(a)). Conventionally, the hand-eye calibration is performed using a specialized target such as a calibration grid as shown in Figure 7(a). However, this process is time-consuming as it adds additional overhead such as specialized targets and calibration procedures. Furthermore, the grid-based calibration technique may contain residual errors as it is dependent on the chosen robot end-effector poses, lighting conditions, and sensor noise. Recent works have introduced deep-learning based markerless hand-eye calibration methods using segmentation and differentiable rendering techniques to regress the camera-to-robot pose based on input images of the robot and associated joint kinematics (Labbé et al., 2021; Lu et al., 2023). While the disadvantages of solely relying upon visual images such as occlusions, challenging backgrounds for segmentation, and lighting conditions still apply, there is an additional overhead of training requirement for various types and kinematic configurations of the robots. These methods are typically used to compute the camera-to-robot (eye-to-hand) transform and they need to be extended for eye-in-hand cases wherein the camera is attached to the end-effector of the robot as in our scenario. In this work, we relax the need for a specific calibration artifact or target and demonstrate how to perform hand-eye calibration using any known object present in the workspace of the robot, hence termed as in situ calibration. Furthermore, by combining visual and tactile perception, we effectively provide grounding to the estimation and correct the visual estimate with sparse tactile data for improving hand-eye calibration.

Figure 7.

(a) Classical grid-based hand-eye calibration method and (b) our in situ visuo-tactile hand-eye calibration method.

Consider the two-manipulator system shown in Figure 7(b). We cast the hand-eye calibration problem of finding ${}^{E^{P}}H_{C}$ as a point cloud registration problem given any arbitrary object. The camera frame C can be expressed in the world coordinate frame W as follows:

{}^{W}H_{C} = {}^{W}H_{B^{P}} {}^{B^{P}}H_{E^{P}} {}^{E^{P}}H_{C},

(36)

where

{}^{W}H_{B^{P}}

is known a priori through user assignment of the world frame and

{}^{B^{P}}H_{E^{P}}

is extracted through the robot kinematic model and

{}^{E^{P}}H_{C}

is the so-called hand-eye calibration matrix. The coordinate frames B^P and E^P are the base frame and end-effector frame of the Panda robot respectively and B^U is the base frame of the UR5 robot. Therefore, the hand-eye calibration matrix can be obtained as follows:

{}^{E^{P}}H_{C} = {}^{W}H_{E^{P}}^{- 1} {}^{W}H_{C},

(37)

Let’s denote ${}^{W}{\hat{H}}_{C}$ as the estimated ^WH_C. We can estimate ^WH_C through point-set registration that minimises the following cost function:

f ({}^{W}{\hat{H}}_{C}, {}^{C}P_{i}^{v}, {}^{W}P_{i}^{t}) = \frac{1}{N} \sum_{i = 1}^{N} ‖ {}^{W}{\hat{H}}_{C} {}^{C}P_{i}^{v} - {}^{W}P_{i}^{t} ‖^{2},

(38)

where

{}^{C}P_{i}^{v}

is the point cloud of an arbitrary object captured by the camera in the camera frame and

{}^{W}P_{i}^{t}

is the point cloud extracted using tactile sensing of the same arbitrary object in the world coordinate frame and i = 1, 2, 3…N represent the corresponding points in the two aforementioned point clouds. The position of the arbitrary calibration object is rigidly fixed to the workspace and we exploit tactile feedback to limit the force exerted by the robot during probing actions, ensuring that the object does not move. The tactile point cloud is expressed in the world frame through the following transformations:

{}^{W}P_{i}^{t} = {}^{W}H_{B^{u}} {}^{B^{u}}H_{T} {}^{T}P^{t},

(39)

where the coordinate frame T is the frame of the tactile sensor. The transform

{}^{W}H_{B^{u}}

is defined a priori by the user assignment of the world frame and as the tactile sensors are rigidly attached to the gripper, the corresponding transform

{}^{B^{u}}H_{T}

is received from the robot kinematic model. We assume that the transformation

{}^{B^{u}}H_{T}

can be obtained from the kinematic model of the robot and any miscalibration of the tactile sensor frame due to possible incorrect mounting of the tactile sensors are absent. Other scenarios may involve the tactile sensors and the visual sensors attached to the same robot. Our formulation involving a multi-robot setup can be simplified trivially for a single robot setup as well.

As the tactile data is high fidelity, we aim to register the dense visual point cloud ^CP^v to the sparse tactile point cloud ^WP^t using our S-TIQF algorithm as detailed in Section 3.4.1. Note that any point cloud registration method can be used but as we demonstrate in Section 4, state-of-the-art point cloud registration methods perform poorly in dense-sparse registration whereas our S-TIQF approach shows high accuracy even with low number of points. The S-TIQF algorithm produces the homogeneous transform ${}^{W}{\hat{H}}_{C}$ as output. Plugging the ${}^{W}{\hat{H}}_{C}$ value into eq. (37), we can obtain the required hand-eye calibration matrix ${}^{E^{P}}H_{C}$ .

4. Experiments

4.1. Experimental setup

The experimental setup shown in Figure 1 consists of a Universal Robots UR5 robot with a tactile sensorised Robotiq 2F140 Gripper and Franka Emika Panda robot with the standard Panda Gripper. The tactile sensor array of the two-finger gripper is acquired from XELA robotics^© and Contactile^©. The Contactile sensors embedded on one finger on the outer and inner-side comprises of 3 × 3 tactile array. The XELA sensors embedded on the other finger comprises of 6 × 4 array on the outer side and 4 × 4 array on the inner side of the finger. The fingertip of the finger sensorised with the XELA sensors also has 6 × 1 array to touch objects from the top. Each taxel of both types of sensor arrays provides three-axis force measurements. This configuration allows the robot to acquire tactile data while touching with the outer side and from the fingertip. We intentionally used two different types of tactile sensors that are based on different operating principles in order to show that our framework is agnostic to tactile sensing technology. The normalised force values of the tactile sensors are measured and contact is established when the force exceeds the baseline threshold f_ts ≥ τ_f where τ_f = 1.1. The contact points $P_{o b s}^{t}$ expressed in the common world frame $W$ are added to the tactile point cloud P^t after every action. An Azure Kinect DK^©RGB-D camera is rigidly attached to the Panda Gripper with a custom-designed flange which provides the vision point cloud P^v. Hand-eye calibration is performed to find the transformation between the Panda Gripper and the camera frame and consequently transformed into the common world coordinate frame. All operations involving point clouds uses the Point Cloud Library (PCL) (Rusu and Cousins, 2011), occupancy grid computations uses OctoMap library (Hornung et al., 2013), and the overall setup uses a ROS-based framework. All robot experiments are run on a workstation using Ubuntu 18.04 with Intel^©Xeon Gold 5222 CPU. Our reconstruction network is implemented using the TensorFlow framework and training/inference is performed on Nvidia Quadro RTX 4000 GPU. The maximum allowed speeds for the UR5 and Panda were 75 mm/s and 100 mm/s, respectively, due to safety constraints. Network implementation details: we used the ADAM optimiser, learning rate set to 10⁻⁴, momentum 0.9 and batch size 8. All layers of the encoder-decoder uses batch normalisation and the decay rate initialized at 0.5 and gradually increased to 0.99 with decay step size 2 × 10⁵. During training with our synthetic dataset $D$ , random voxel-grid subsampling is done to have input point clouds with point size between 60 and 1024 in order to account for sparse tactile point clouds and dense visual point clouds. The hyperparameter α for the reconstruction loss is set to 100. The ground-truth point cloud and mesh of all objects were obtained using a scanning device. The output point cloud size for the network is set to 2048 points.

4.1.1. Object list

In order to be easily reproducible, widely available daily objects are used for experimentation from the following categories: (a) bottle, (b) cup, (c) mug, (d) spray, (e) detergent, and (f) wineglass. The objects in each category are shown in Figure 8(a). These objects are unknown and their models are reconstructed, and pose estimation is performed. Furthermore, a set of other objects shown in Figure 8(b) are used to clutter the workspace and the target object. Each scene is composed of one target object from any category and a subset of clutter objects placed around the target object in randomised dense clutter scenarios.

Figure 8.

(a) Target unknown objects. The properties evaluated by human experts: T: transparency/specularity, C: shape complexity, S: symmetry, +: medium, ++: high. (b) Objects used to clutter the workspace. (c) Visuo-tactile point cloud of an exemplary object demonstrating the need for tactile exploration in reflective regions where vision data is absent. (d) Visuo-tactile point cloud of a transparent object wherein visual data is completely missing and object is reconstructed and localised with tactile data.

4.2. Active visuo-tactile-based target object reconstruction

As the target objects list contains both transparent and opaque objects, our framework automatically prefers tactile exploration for transparent objects and visuo-tactile exploration for opaque objects using the joint criteria defined in eq. (9). An exemplary case that demonstrates the benefit of our method is shown in Figure 8(c). The ketchup bottle has parts of opaque and reflective regions. The vision point cloud shown in red captures the overall shape but contains some missing points in the reflective region (highlighted in the green box). Due to our information gain method for object exploration, tactile acquisitions are performed only in the regions where the visual points are missing or there is uncertainty due to noisy data (around the edges). Similarly for transparent object shown in Figure 8(d), the visual data is completely missing (shown in red points) and the tactile data captures the overall shape of the object (shown in blue points).

For reconstruction evaluation, we use the Chamfer distance (CD) metric defined in eq. (10) with the ground truth point clouds shown in Figure 10. The chamfer distance (CD) is defined as the sum of the average nearest-neighbour distance between one point cloud and the other and vice versa. Lower CD value denotes higher reconstruction precision. For the ideal case where the reconstructed point cloud exactly matches the ground truth, CD ≈ 0.0 m. Qualitatively, through empirical analysis, we found that CD < 0.01 m denotes accurate reconstruction, 0.01 m < CD < 0.1 m denotes good reconstruction while CD > 0.2 m implies poor reconstruction of the point cloud. The qualitative results for the reconstruction with vision and tactile data with our network are shown in Figure 9. Our method, with the help of the learned model over the category-level synthetic objects, is able to reconstruct the object even with sparse input point clouds. We performed five repeated experimental trials for each target object and with each exploration strategy: active, random and uniform, resulting in 210 total trails (14 objects, 3 strategies, 5 repetitions). We note that for opaque objects, the shared visual and tactile data result in a higher accuracy of reconstruction (CD < 2 cm) as seen in Figure 10(a). The visual point clouds for opaque objects capture all the sides of the objects due to the active visual exploration. On average, combining with tactile data improves visual reconstruction accuracy by 17%. For opaque objects, the tactile reconstruction accuracy is relatively worse due to the fact that incomplete tactile point clouds are collected as the robot only explores regions unseen by the camera.

Figure 9.

Visuo-tactile point clouds and the respective reconstructed point cloud using our reconstruction network.

Figure 10.

Quantitative reconstruction results showing the chamfer distance (CD) metric of the reconstructed point cloud compared with the ground-truth point cloud for (a) opaque objects and (b) transparent objects. The bar graph represents the average values and the error bars represent the standard deviation.

For transparent objects, even a sparse input point cloud provides acceptable reconstruction accuracies as seen in Figure 10(b). In this case, since no visual point cloud is available, the robot explores the object with only tactile sensing in an information-gain seeking strategy. All objects are accurately reconstructed (CD < 2 cm) by the network except the wineglass2. The reason for the lower accuracy for wineglass2 is due to its peculiar shape, and it is out of distribution to the training dataset.

Furthermore, we performed a comparison study of our active exploration strategy with baseline random and uniform strategies for both vision and tactile modality. The baseline strategies for tactile exploration are defined as follows: the bounding box on the target object is discretised into a 3D grid with each grid cell of size 3 cm × 3 cm which corresponds to the size of the sensor patch. The robot is moved to touch the grid cell closest to its base frame and sequentially touches each cell in a uniform manner. In contrast, the random strategy involves choosing the next possible grid cell in a randomised manner. In a similar manner, we define random and uniform exploration strategies for the vision modality: viewpoints on the hemisphere sphere are sampled uniformly in the same way as described in Section 3.3 and the robot starts from one extreme possible position and sequentially moves to the next viewpoint in a uniform manner. The random strategy chooses one among the possible sampled viewpoints at random. As we want to compare the sample efficiency of the actions and to have an unbiased comparison with each strategy, we limit to 20 actions for tactile-only actions for transparent objects and to 3 visual actions and 5 tactile actions for the remaining opaque objects. The results for reconstruction of opaque objects with visuo-tactile sensing is shown in Figure 11(a) and transparent objects with tactile sensing in Figure 11(b). As the exploration is performed for the objective of object reconstruction, CD is used as a metric for comparison. For both transparent and opaque objects, increasing the number of exploratory actions reduces the CD value. For transparent objects that rely upon only the sense of touch, active exploration converges to an accuracy of CD ≈ 2 cm within 20 touches. Random exploration converges to CD ≈ 5 cm in 20 touches, but with higher variance due to the stochasticity of exploratory actions. Uniform exploration has the least accuracy due to the fixed nature of exploration, which often collects redundant data, leading to long data collection times. In contrast for opaque objects, random and active strategies perform similarly on average (CD ≈ 1.5 cm) and subsequently active tactile strategy slightly improves the reconstruction accuracy (CD ≈ 1.1 cm). Negligible reduction in CD value is reported with random and uniform tactile actions following visual perception for opaque objects.

Figure 11.

(a) Active visuo-tactile reconstruction accuracy for opaque objects and (b) active tactile-only reconstruction accuracy for transparent objects compared with random and uniform strategies. The error bars in (a) and shaded regions in (b) represent the standard deviation.

4.3. Category-level visuo-tactile-based pose estimation

In order to benchmark our stochastic TIQF (S-TIQF) and previously proposed TIQF methods, firstly we perform instance-level pose estimation where the object model point cloud is obtained from the ground truth mesh on the Stanford Scanning Repository benchmark. The following state-of-the-art methods are used for comparison: Iterative Closest Point (ICP) (Besl and McKay, 1992), Sparse iterative closest point (S-ICP) (Bouaziz et al., 2013), Random sample consensus (RANSAC) (Fischler and Bolles, 1981), Truncated least squares Estimation And SEmidefinite Relaxation (TEASER++) (Yang et al., 2020) and PREDATOR (Huang et al., 2021a). We compare with both local registration methods such as ICP and S-ICP, global optimization methods such as RANSAC and TEASER++ and learning methods such as PREDATOR. We chose these popular baselines are they are often used in the literature for the point cloud registration task. Furthermore, some of these baselines such as ICP and RANSAC are also used to perform the final registration task with learning-based methods where the features are learnt using a neural network. We also compare with the learning-based method termed PREDATOR (Huang et al., 2021a) which learns to predict the registration of point clouds with low overlap between each other as is the case with visuo-tactile point clouds. We used the pretrained model of PREDATOR (available with their open-source implementation¹) and set the hyper-parameters as suggested in the paper with the exception of first_subsampling_dl = 0.01, dgcnn_k = 5 for sparse point clouds. Secondly, to demonstrate the flexibility of our method, we use the PhoCal dataset (Wang et al., 2022) to perform a feasibility study for category-level pose estimation using the NOCS-based (Wang et al., 2019) framework. For real robot experiments, we use the reconstructed point cloud of the objects from our reconstruction network as the object point cloud and the acquired vision and/or tactile point cloud as the scene point cloud for category-level pose estimation.

4.3.1. Benchmark experiments

4.3.1.1. Stanford scanning repository benchmark

In order to benchmark our methods against the state-of-the-art, we use the standard point cloud registration benchmark from the Stanford Scanning repository (Levoy et al., 2005). We used six CAD models from the dataset namely bunny, dragon, happy Buddha, Lucy, statue and armadillo (Levoy et al., 2005). In order to have an unbiased comparison of pose estimation, we used the model point cloud derived from the CAD mesh in the dataset. This is done because errors in shape reconstruction can propagate and influence pose estimation. Each model point cloud is sampled uniformly from the CAD mesh to have 1024 points. The scene point cloud is sampled randomly from the CAD mesh and the point numbers are set to 20, 40, 80 and 120 points. The varying degree of sparsity can test the robustness of our approach against state-of-the-art methods. The model and scene point clouds are normalized and scaled to lie within a [−1, 1]³ m cube. In order to evaluate the sensitivity of our method against local optima, the initial pose for the model point cloud is randomly chosen from a position range of [−5.0, 5.0] m and rotation from [−180°, 180°] for each experimental trial. The correspondence estimation for ICP and S-ICP is based on nearest neighbourhood search whereas RANSAC and TEASER++ are based of Fast Point Feature Histograms (FPFH) descriptors (Rusu et al., 2009). For each selected model from the Stanford scanning repository, the experiment is repeated five times with the initial pose randomly varied for each trial. The errors are measured as using the Average Distance of model points with Indistinguishable views metric (ADI) which is insensitive to object symmetries (Hinterstoisser et al., 2013). The ADI metric is measured as:

{e r r}_{a d i} = \frac{1}{| O |} \sum_{p_{1} \in O} \min_{p_{2} \in O} ‖ (R_{g t} p_{1} + t_{g t}) - (R_{e s t} p_{2} + t_{e s t}) ‖,

(40)

where (R_gt, t_gt) and (R_est, t_est) refers to ground-truth and estimated rotation and translation, respectively,

O

refers to the object model point cloud and the points

p_{1} \in O

and

p_{2} \in O

belong to the object point cloud and denote the closest corresponding points when

O

is transformed by {R_gt, t_gt} and {R_est, t_est}, respectively.

Qualitative results with the Stanford Bunny model are shown in Figure 12. Quantitative results evaluated with all the models selected from the Stanford scanning repository are provided in Figure 13. It can be seen that for all levels of point sparsity (20–120 points), our S-TIQF outperforms baselines (p < .001 for Welch’s t-test in all cases except for scene cloud with 40 points with TIQF where p < .01). Interestingly, S-TIQF also outperforms TIQF method and this is due to the stochastic initial alignment used in S-TIQF. For instance, in the case of the scene point cloud with 20 points, S-TIQF outperforms the closest baseline S-ICP by 45% on average and 38% for scene point cloud with 120 points. The results corroborate the known weaknesses of correspondence-based techniques such as RANSAC and TEASER++ as they rely upon features for estimation correspondences. These correspondences remain fixed throughout the pose estimation process. Due to point sparsity, feature extraction methods such as FPFH fail to generate valid correspondences. Similarly, our S-TIQF and TIQF outperforms the learning-based method PREDATOR by more than 50% on average for all levels of point sparsity (20–120). The point sparsity and absence of neighbourhood points is challenging for the graph neural network in PREDATOR to extract good features for the overlapping regions. Furthermore, simultaneous pose and correspondence methods such as ICP and TIQF perform relatively well on sparse data but rely on good initialization. Our S-TIQF approach removes the need for good initialization through the stochastic search for initial alignment. Furthermore, to demonstrate that the effectiveness of our approach, we performed an ablation study shown in the Appendix A.3 where we provide the output from the stochastic alignment (SA) module to the baseline methods ICP and S-ICP. We show that S-TIQF still outperforms the modified baseline algorithms SA + ICP and SA + S-ICP by at least 20% in terms of ADI error (cf., Table 3).

Figure 12.

Qualitative results on the Stanford bunny dataset: The grey mesh represents the model at ground truth for reference, the blue sparse point cloud represents the scene point cloud and the red dense point cloud represents the transformed model point cloud after performing point cloud registration.

Figure 13.

Pose error calculated as ADI error for models from the Stanford scanning repository. The object point cloud consisting of 1024 points is sampled from the models while the scene point cloud is randomly sampled from the model and consists of (a) 20, (b) 40, (c) 80 and (d) 120 points, respectively. p values calculated by Welch’s t-test shown as ∗. The bar plot represents the average and the error bars represents the standard deviation.

4.3.1.2. PhoCal dataset benchmark

We conducted a feasibility study with the PhoCal dataset (Wang et al., 2022) to demonstrate category-level pose estimation with our method. In contrast to our reconstruction network which is trained on point clouds of synthetic objects belonging to the same category as the real-world objects, the Normalized Object Coordinate Space (NOCS)-based framework (Wang et al., 2019) can also be used to generate the point cloud of the objects. Given RGB inputs, the NOCS network learns a NOCS map which is a shared canonical space of objects within a category. The NOCS map can be combined with the depth map to lift from 2D image to 3D point cloud space. This is used as the object point cloud and the point cloud from the depth map is considered as the scene point cloud for point cloud registration. Furthermore, point cloud registration methods such as Umeyama algorithm (Umeyama, 1991) are used to perform pose estimation with the NOCS-based framework (Wang et al., 2019). The PhoCal dataset (Wang et al., 2022) contains the RGB, depth and learnt NOCS maps of real world objects belonging to different categories particularly for photometrically challenging objects. In order to perform accurate 6D pose annotation, the authors of the PhoCal dataset (Wang et al., 2022) used a tool-tip on a robotic manipulator to manually touch the object at various locations that are sparsely distributed on the object. We use these touch points as the tactile point cloud. The learnt NOCS maps are used to generate the object model point cloud and the depth map provides the visual point cloud. We compare our approach with methodology introduced in Wang et al. (2019) for pose estimation. Figure 14 shows an example from the PhoCal dataset demonstrating the rendered NOCS map (Figure 14(b)) and the reconstructed models (Figure 14(d)). We note that the reconstructed point clouds are partial (see bottle and fork in Figure 14(d)) as only the visible portions of the scene are used to generate the NOCS map and provides further challenges for pose estimation. Figure 15 shows the comparison results of S-TIQF method against the Umeyama approach (Umeyama, 1991) used in (Wang et al., 2019). We demonstrate that our approach outperforms the baseline method for tactile point clouds by approximately 35% median ADI error and about 20% ADI error when applied to dense visual point clouds (p < .001). We also evaluated the scale estimation approach we use for visual and tactile sensing based point clouds. The scale error is calculated as:

{e r r}_{s c a l e} = \sqrt{{(s_{x}^{g t} - s_{x}^{e s t})}^{2} + {(s_{y}^{g t} - s_{y}^{e s t})}^{2} + {(s_{z}^{g t} - s_{z}^{e s t})}^{2}}

(41)

wherein,

S^{e s t} = {s_{x}^{e s t}, s_{y}^{e s t}, s_{z}^{e s t}}

represents the estimated scale and the

S^{g t} = {s_{x}^{g t}, s_{y}^{g t}, s_{z}^{g t}}

denotes the ground-truth value. Lower error denotes better estimation. The scale error for tactile based estimation is 0.017 ± 0.0035 and for vision based estimation is 0.03 ± 0.0049. On average, the scale error is 43% lower for tactile-based perception compared to the visual perception. As the dataset consists of transparent and specular objects, the visual sensing method provides incomplete point clouds with a large number of missing or erroneous points whereas the tactile sensing approach provides sparse but approximately uniformly distributed point clouds which can be seen from Figure 14(c). Furthermore, tactile sensing is insensitive to the transparency or specularity properties of the objects and provides high fidelity point measurement.

Figure 14.

Qualitative results using the PhoCal dataset: (a) RGB input, (b) rendered NOCS map, (c) visual and tactile point cloud, and (d) reconstructed model point clouds from NOCS maps in (b).

Figure 15.

Comparison of our method against the NOCS (Umeyama method) (Wang et al., 2019) performed as a feasibility study with the PhoCal dataset. p values calculated by Welch’s t-test shown as ∗.

4.3.2. Robotic experiments

In order to validate our method in real world settings, we carried out extensive experiments using the robotic setup shown in Figure 1 and everyday objects shown in Figure 8. Similar to the benchmark experiments, our S-TIQF and TIQF methods are compared against the same baseline methods. The model point cloud is derived from the reconstructed point cloud from the reconstruction network. The scene point cloud comprises of vision and/or tactile data. For each target object, the experiment is repeated five times by randomising the cluttered scene for each iteration. Similar to the previous experiments, the initial pose is sampled randomly from [−5.0, 5.0] m and [−180°, 180°] for each trial and the same initial pose is provided to all comparison methods. The quantitative results for transparent objects are shown in Figure 16(a) and opaque objects in Figure 16(b). The results with transparent objects are similar to the benchmark experiments due to the nature of sparse tactile point clouds and S-TIQF outperforms the baseline approaches. For instance, S-TIQF outperforms the next best baseline method S-ICP by nearly 40% on average for sparse tactile point clouds (p < .001). In comparison, it can be seen from Figure 16(b) that for dense visual point clouds, nearly all the methods perform equally well and our S-TIQF method compares favourably with the state-of-the-art (p < .001). Our method achieves an average ADI error of 2.1 cm whereas TEASER++ achieves an error of 2.5 cm for dense visuo-tactile point clouds (p = .04034). The learning-based approach PREDATOR also performs on-par with other baselines for the dense visuo-tactile point clouds with average ADI error of 3.1 cm. However, the performance of PREDATOR with sparse tactile point clouds for transparent objects is much worse, with the average accuracy of S-TIQF nearly 65% better than PREDATOR. The PREDATOR method assumes sufficiently dense local point features even if there is minimal overlap for the overlap attention module which is not the case for sparse tactile point clouds and results in lower performance. In fact, it can be seen that the combined visuo-tactile point clouds results in better accuracy than visual or tactile point clouds alone, demonstrating the importance of shared perception. For instance, the accuracy improves by $\sim 35 %$ for S-TIQF using the combined visuo-tactile point cloud instead of either vision or tactile point clouds alone. In this case, it should be noted that the higher levels of inaccuracies with tactile data are a result of the active object exploration strategy wherein tactile data are only collected in locations of higher uncertainty and inaccessible locations to visual data. Hence, it can be concluded that our S-TIQF method provides highly accurate pose estimation for both sparse tactile and dense visual data. The object-wise pose estimation results with S-TIQF is shown in Figure 17 and detailed in the Discussion section (Section 4.5).

Figure 16.

Average pose error for real-world objects for (a) opaque objects with visuo-tactile perception and (b) transparent objects with tactile perception. p values calculated by Welch’s t-test shown as ∗.

Figure 17.

Object-wise pose estimation results with S-TIQF. The figure shows a standard box-and-whisker plot with the box extended between the upper and lower quartile and line showing the median. The whiskers show the range of the data.

4.4. Visuo-tactile hand-eye calibration

This section provides comparative studies performed for hand-eye calibration of our in situ approach and standard methods using calibration grid with the algorithm originally presented by Tsai (Tsai and Lenz, 1989). For the calibration grid method, the grid was fixed at a suitable distance from the camera such that it is clearly within the field of view of the camera. Ten different viewpoints were chosen manually ensuring that the different end-effector rotations were incorporated. The experiment was repeated five times. Our in situ visuo-tactile calibration approach does not require a specialized grid. Any object in the workspace of the robot can be used as long as an accurate point cloud corresponding to the object is available. The object must be immobilised and multiple visual pointclouds are captured from different viewpoints. With an incorrect hand-eye calibration, the point clouds from different views would not overlap accurately and result in the scenario shown in Figure 18(a). Furthermore, when tactile data are collected from the same object, resulting in the tactile point cloud, an incorrect hand-eye calibration can be described in the scenario shown in Figure 18(b). The tactile sensors are rigidly attached to the end-effector and the robot kinematics are accurate enough to provide a grounding of the object pose. Using the calibration grid method, an acceptable accuracy can be achieved, but residual errors would still be present. The qualitative results are shown in Figure 18(c). Using our in situ approach with the S-TIQF method, a highly accurate solution can be obtained (Figure 18(d)). Quantitative results are shown in Figure 18(e). Our approach achieves $< 1 cm$ error in position and $<$ 5° error in rotation.

Figure 18.

Qualitative results of hand-eye calibration: Effects of incorrect calibration when point clouds are acquired from different viewpoints (a) and (b). The different colours for the point clouds in (a) highlight the effect of incorrect calibration when overlapped with each other. The accuracy of calibration using grid-based method (c) and our method (d). Quantitative analysis of the error in hand-eye calibration (position and rotation) (e). p values calculated by Welch’s t-test shown as ∗.

4.4.1. Robot kinematic accuracy benchmark for calibration

Our in situ visuo-tactile-based hand-eye calibration method depends on the accuracy of the kinematic calibration (especially for the robot with tactile sensing). Although this assumption is commonly used in the case of hand-eye calibration (Sun and Hollerbach, 2008), we benchmarked the kinematic accuracy of both robots using an external sensor system to evaluate the effect on hand-eye calibration. We used the OptiTrack motion capture system (NaturalPoint, Inc.) to track a specially designed coordinate marker frame with embedded markers that is attached to the gripper fingers, as shown in Figure 19(a). The motion capture system has an average accuracy of 0.1 mm. The UR5 robot, which is sensorized with tactile sensors, has a pose repeatability of 0.1 mm and is kinematically calibrated by the manufacturer.² The Franka Emika robot which is sensorised with the camera also has a pose repeatability of 0.1 mm obtained from the datasheet of the manufacturer. Both robot’s end effector were moved in arbitrary trajectories covering all 6 DoF by the human user with manual hand guiding and the pose of the end-effector was extracted from the kinematic model and using the motion capture system, respectively. The static offset which arises due to pose difference between the robot’s end-effector frame and the designed marker frame which is attached to the end-effector that needs to be compensated. The end-effector poses expressed in the world-coordinate frame are extracted from the robot kinematic model and the poses of the marker frame expressed in the world coordinate frame extracted from the Optitrack system are used while the robot is stationary and the static offset is measured as the averaged RMSE of the two poses. We compare the accuracy between these poses in Figure 19(b) and (c) where the plots of the end-effector trajectory are shown with the kinematic pose calculation in blue and the motion capture calculated pose in red. The numerical results are shown in Table 2. We recognize the intrinsic uncertainties inherent in our comparative analysis: sporadically, the human operator might occlude the markers from the field of view of certain OptiTrack cameras during manual guidance, despite the strategic deployment of six OptiTrack cameras surrounding the workspace. Hence, we measure the median error and median absolute deviation to disregard spurious outlier points. We note that the kinematic accuracy of the UR5 robot with our benchmarking 0.303 ± 1.82 mm, which is crucial for the calculation of the tactile point clouds. We note that this accuracy for the tactile measurements is within the tolerance bounds. The kinematic discrepancies observed in the Panda robot are more pronounced (±4 mm median absolute deviation); however, their impact on the hand-eye calibration process remains minimal, as the tactile point cloud serves as the reference for the registration of the corresponding visual point cloud. Our method works identically for the case where the camera is kept static and visual point cloud of the object is registered to the tactile point cloud.

Figure 19.

(a) Benchmarking the robot kinematics with high-precision marker-based motion capture system (OptiTrack) for (b) UR5 robot and (c) Franka Panda robot. The robot poses are shown in blue and the pose calculated by the motion capture system in red.

Table 2.

Numerical results from kinematic calibration benchmark showing the median error and median absolute deviation.

	UR5	Panda
Calibration error (mm)	0.303 ± 1.82	0.432 ± 3.342

4.5. Discussion

4.5.1. Individual sub-system evaluation

We presented a novel interactive shared visuo-tactile perception approach for unknown target object reconstruction and robust pose estimation. To retrieve the target object, the robots coordinate together to declutter the scene. We used target objects of varying shape complexity and transparency to extensively evaluate our reconstruction and pose estimation pipeline. Our approach is able to accurately reconstruct both transparent and opaque novel objects efficiently in an active information-gain seeking manner. We note from Figure 11(a) for transparent objects, the uniform strategy requires a large number of tactile actions for accurate reconstruction of the objects leading to increased data collection time. Random exploration strategy has high standard deviation (CD ≈ 3 cm after 20 actions) that stems from the stochastic nature of the exploration. Our active strategy has lower variance and higher accuracy (CD < 2 cm) within 20 actions, outperforming both random and uniform strategies. For vision based object reconstruction, due to the workspace limitations, wide field of view of the camera and limited size of the objects, on average three viewpoints are sufficient to completely explore the objects. However, as seen from Figure 11(b), uniform strategy is less accurate than random and active strategy for visual reconstruction. Furthermore, the subsequent tactile actions after visual perception improve the reconstruction accuracy by 17% with our active strategy whereas the improvement is marginal with random and uniform strategies. The acquired tactile data with random and uniform strategies are redundant with the visual point cloud data whereas with active strategy, the regions unexplored by the visual modality are explored with the tactile modality. Shared visuo-tactile perception proves more advantageous than sole reliance on mono-modal visual or tactile perception, and the proficient sharing of perceptual attributes between modalities demonstrates efficacy across various object types, including both transparent and opaque entities. Furthermore, active perception is required for effective shared perception to avoid redundant data collection and overlap between sensing data.

Similarly, for category-level pose estimation with the reconstructed point clouds of real-world objects, our S-TIQF method outperforms all the baseline strategies for tactile based pose estimation due to sparsity of tactile point clouds with an average ADI error around 2 cm as seen from Figure 16(a) (p < .001). Whereas for opaque objects, our method compares favourably to state-of-the-art methods for dense visual and visuo-tactile point clouds (p < .05). We have compared with geometry-based point cloud registration methods such as ICP (Besl and McKay, 1992), S-ICP (Bouaziz et al., 2013), RANSAC (Fischler and Bolles, 1981) and TEASER++ (Yang et al., 2020) which does not require any neural network learning component as well as with PREDATOR (Huang et al., 2021a) which is a learning-based registration method. Furthermore, some of these popular baselines such as ICP and RANSAC are also used as a “backend” to perform the final registration task with learning-based methods where the features are learnt using a neural network. In fact, even the PREDATOR method (Huang et al., 2021a) learns the feature points where there is maximal overlap between the point clouds and RANSAC is used to extract the final pose estimate using the correspondences found. Our S-TIQF approach allows for incorporating sparse as well as dense measurements for pose estimation. S-TIQF also outperforms TIQF by 35% for tactile and visuo-tactile-based pose estimation. Our stochastic initialization strategy proves effective for escaping local minima. We note from Figure 17 showing the object-wise pose estimation results, that increasing shape complexity results in marginal reduction in pose accuracy as to be expected. Transparent objects show higher average errors compared to opaque objects as they rely solely upon tactile perception resulting in sparse data. However, the worst-case error (for instance spray1) is within 3 cm demonstrating the robustness of the system. Since S-TIQF is a category-level pose estimation method, these objects are novel and unseen. Hence, the ADI error is also proportional to the reconstruction accuracy seen from Figure 10. For instance, the wineglass2 has higher reconstruction error (CD $\approx 3 cm$ ) resulting in lower pose estimation accuracy $({e r r}_{a d i} < 3 cm)$ . Furthermore, from Figure 15 we see that our method compares favourably with state-of-the-art methods for category-level pose estimation using NOCS based reconstruction improving the median ADI error by 35% and 20% for sparse tactile and dense visual point clouds respectively (p < .001). This shows that our method is not tuned to a particular shape-reconstruction technique but is adaptable to other category-level reconstruction techniques as well.

Furthermore, we evaluated our approach on various point cloud registration datasets and real world objects with different sensors. The S-TIQF method can be applied to various sensing modalities, including LIDAR or RADAR sensors, capable of supplying point cloud data. This versatility extends its utility to applications like mapping, localization, and sparse-to-dense registration when integrated with depth cameras. Given that S-TIQF relies directly on raw point clouds containing only positional information (i.e. x, y, z coordinates) and potentially normals data when available, with minimal parameter tuning, it stands as an effective choice for diverse applications requiring precise point cloud registration.

4.5.2. Overall system evaluation

To evaluate the performance of the entire pipeline as shown in Figure 1, we choose the criterion such that if ${e r r}_{a d i} < 3 cm$ for the target object then we term the overall experimental trial to be successful. This criterion is more competitive than other state-of-the-art approaches that allow a 5 cm error for category-level pose estimation even without overlapping clutter (Wang et al., 2019). We have a success rate of 88% for the overall framework. While our reconstruction and category-level pose estimation parts of the pipeline produce ${e r r}_{a d i} < 3 cm$ for all objects, the failure cases stem from the interactive decluttering part (Figure 2(a)). The reasons for the failure cases include: (i) selection and execution of incorrect grasp/push poses and (ii) extremely dense clutter wherein less than 10% of the target object area is visible to the camera. If the decluttering actions fail, then the human user intervenes and resets the scene. We intentionally use complex objects to clutter the target object including deformable (sponge) and transparent objects (triangle box) (see Figure 8(b)). Furthermore, as we use an off-the-shelf semantic segmentation network (Chen et al., 2017) for extracting the scene graph as well as for grasp/push pose prediction, it can sometimes produce erroneous segmentation outputs for extreme dense clutter scenarios. As this is not the main contribution of our framework, we can substitute it with state-of-the-art segmentation models such as segment anything network (Kirillov et al., 2023) to improve the performance and is considered as part of future work.

4.5.3. Future work

We aim to extend our pose estimation approach to non-rigid objects such as articulated and deformable objects as part of future work. Extending our shared perception formulation to other sensors apart from visual and tactile sensing is also worth investigating, due to the generalized formulation that depends only on 3D position data. As noted in Section 4.4.1, there is an uncertainty about tactile measurements that arises due to the inherent uncertainty of robot kinematics. Another avenue for future research involves the integration of uncertainty quantification within the registration process, which has the potential to further mitigate calibration errors. While the shared perception in this work focused on the shape information, further object properties may also be shared through multi-modal perception allowing for robust manipulation strategies.

5. Conclusions

In this work, we proposed a novel full-fledged framework for interactive shared visuo-tactile object reconstruction and pose estimation of unknown target objects in dense clutter. In our scenario, two robots equipped with vision and tactile sensors coordinate to declutter the workspace using our proposed declutter scene graph approach. Visual and tactile sensing are efficiently shared to explore the unknown target object using the joint information gain criteria. This ensures non-redundant actions performed in a greedy-information gain manner improving the sample efficiency of actions. Tactile perception is prioritised for transparent objects that are challenging for visual perception. The extracted point cloud data is used for inferring the reconstructed model of the object using our reconstruction network. Finally, our novel S-TIQF method is performed for robust pose estimation that is accurate for both sparse and dense point cloud data. It provides globally optimal pose that is robust against local minima. Our method has been extensively validated using benchmark datasets and with real-robot experiments and it outperforms state-of-the-art techniques. Furthermore, we demonstrate how our S-TIQF method can also be used for hand-eye calibration using any arbitrary objects through visual-tactile data which is critical for shared multi-modal perception.

Supplemental Material

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funded in part by the BMW Group and EU Horizon Project PHASTRAC under Grant ID 101092096.

ORCID iDs

Prajval Kumar Murali

Bernd Porr

Mohsen Kaboli

Supplemental Material

Supplemental material for this article is available online.

Notes

Appendix

References

Álvarez

Roa

Moreno

(2019) Visual and tactile fusion for estimating the pose of a grasped object. In: Iberian robotics conference, Porto, Portugal, 20–22 November 2019, 184–198. Springer.

Bauza

Canal

Rodriguez

(2019) Tactile mapping and localization from high-resolution tactile imprints. In: 2019 international conference on robotics and automation (ICRA), Montreal, QC, 20–24 May 2019, 3811–3817. IEEE.

Berenson

Srinivasa

(2008) Grasp synthesis in cluttered environments for dexterous hands. In: Humanoids 2008-8th IEEE-RAS international conference on humanoid robots, Daejeon, South Korea, 1–3 December 2008, 189–196. IEEE.

Bertsimas

Tsitsiklis

(1993) Simulated annealing. Statistical Science 8(1): 10–15.

Besl

McKay

(1992) Method for registration of 3-d shapes. In: Sensor Fusion IV: Control Paradigms and Data Structures. Bellingham, WA: International Society for Optics and Photonics, 586–606.

Bhattacharjee

Shenoi

Park

, et al. (2015) Combining tactile sensing and vision for rapid haptic mapping. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September 2015–2 October 2015, 1200–1207. IEEE.

Bicchi

Kumar

(2000) Robotic grasping and contact: a review. In: Proceedings 2000 ICRA. Millennium conference. IEEE international conference on robotics and automation. Symposia proceedings (Cat. No. 00CH37065), San Francisco, CA, 24–28 April 2000, 348–353. IEEE.

Bimbo

Luo

Althoefer

, et al. (2016) In-hand object pose estimation using covariance-based tactile to geometry matching. IEEE Robotics and Automation Letters1(1): 570–577.

Bissmarck

Svensson

Tolt

(2015) Efficient algorithms for next best view evaluation. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September 2015–2 October 2015, 5876–5883. IEEE.

10.

Bohg

Morales

Asfour

, et al. (2013) Data-driven grasp synthesis—a survey. IEEE Transactions on Robotics 30(2): 289–309.

11.

Bohg

Hausman

Sankaran

, et al. (2017) Interactive perception: leveraging action in perception and perception in action. IEEE Transactions on Robotics 33(6): 1273–1291.

12.

Borgefors

(1986) Distance transformations in digital images. Computer Vision, Graphics, and Image Processing 34(3): 344–371.

13.

Bouaziz

Tagliasacchi

Pauly

(2013) Sparse iterative closest point. In: Computer Graphics Forum. Hoboken, NJ: Wiley Online Library, 113–123.

14.

Bradski

Kaehler

(2000) Opencv. Dr. Dobb’s Journal of Software Tools 3(2): 1.

15.

Chang

Funkhouser

Guibas

et al. (2015) Shapenet: an information-rich 3d model repository. ArXiv preprint arXiv:1512.03012.

16.

Chen

Papandreou

Kokkinos

, et al. (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4): 834–848.

17.

Chen

Wang

, et al. (2020) Learning canonical shape space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 13–19 June 2020, pp. 11973–11982.

18.

Choukroun

Bar-Itzhack

Oshman

(2006) Novel quaternion kalman filter. IEEE Transactions on Aerospace and Electronic Systems 42(1): 174–190.

19.

Connolly

(1985) The determination of next best views. In: Proceedings 1985 IEEE international conference on robotics and automation, St Louis, MO, 25–28 March 1985, 432–435. IEEE.

20.

Cui

Wen

Trinkle

(2019) A multi-sensor next-best-view framework for geometric model-based robotics applications. In: 2019 international conference on robotics and automation (ICRA), Montreal, QC, 20–24 May 2019, 8769–8775. IEEE.

21.

Dahiya

Yogeswaran

Liu

, et al. (2019) Large-area soft e-skin: the challenges beyond sensor designs. Proceedings of the IEEE 107(10): 2016–2033.

22.

Danielczuk

Kurenkov

Balakrishna

, et al. (2019) Mechanical search: multi-step retrieval of a target object occluded by clutter. In: 2019 international conference on robotics and automation (ICRA), Montreal, QC, 20–24 May 2019, 1614–1621. IEEE.

23.

De Gregorio

Zanella

Palli

, et al. (2018) Integration of robotic vision and tactile sensing for wire-terminal insertion tasks. IEEE Transactions on Automation Science and Engineering 16(2): 585–598.

24.

Delmerico

Isler

Sabzevari

, et al. (2018) A comparison of volumetric information gain metrics for active 3d object reconstruction. Autonomous Robots 42(2): 197–208.

25.

Deng

Birdal

Ilic

(2018) Ppfnet: global context aware local features for robust 3d point matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18 June 2018, pp. 195–205.

26.

Deng

Geng

Bretl

, et al. (2022) icaps: iterative category-level object pose and shape estimation. IEEE Robotics and Automation Letters 7(2): 1784–1791.

27.

Detry

Kraft

Kroemer

, et al. (2011) Learning grasp affordance densities. Paladyn 2(1): 1–17.

28.

Detry

Madry

, et al. (2012) Generalizing grasps across partly similar objects. In: 2012 IEEE international conference on robotics and automation, St Paul, MN, 14–18 May 2012, 3791–3797. IEEE.

29.

Dikhale

Patel

Dhingra

, et al. (2022) Visuotactile 6d pose estimation of an in-hand object using vision and tactile sensor data. IEEE Robotics and Automation Letters 7(2): 2148–2155.

30.

Dogar

Hsiao

Ciocarlie

, et al. (2012) Physics-Based Grasp Planning Through Clutter. Cambridge, MA: MIT Press.

31.

Dutta

Burdet

Kaboli

(2023) Push to know!-visuo-tactile based active object parameter inference with dual differentiable filtering. In: 2023 IEEE/RSJ international conference on intelligent robots and systems (IROS), Detroit, MI, 1–5 October 2023, 3137–3144. IEEE.

32.

Dutta

Burdet

Kaboli

(2024) Vitract: robust object shape perception via active visuo-tactile interaction. IEEE Robotics and Automation Letters 9: 11250–11257.

33.

Ernst

Banks

(2002) Humans integrate visual and haptic information in a statistically optimal fashion. Nature 415(6870): 429–433.

34.

Fischler

Bolles

(1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6): 381–395.

35.

Gentner

Murali

Kaboli

(2023) Gmcr: graph-based maximum consensus estimation for point cloud registration. In: 2023 IEEE international conference on robotics and automation (ICRA), London, UK, 29 May 2023–2 June 2023, 4967–4974. IEEE.

36.

Goldfeder

Allen

Lackner

, et al. (2007) Grasp planning via decomposition trees. In: Proceedings 2007 IEEE international conference on robotics and automation, Roma, Italy, 10–14 April 2007, 4679–4684. IEEE.

37.

Grimm

Grotz

Ottenhaus

, et al. (2021) Vision-based robotic pushing and grasping for stone sample collection under computing resource constraints. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 30 May 2021–05 June 2021, 6498–6504. IEEE.

38.

Hatwell

(1987) Motor and cognitive functions of the hand in infancy and childhood. International Journal of Behavioral Development 10(4): 509–526.

39.

Feng

Zhao

, et al. (2020) 6d pose estimation of objects: recent technologies and challenges. Applied Sciences 11(1): 228.

40.

Hebert

Hudson

, et al. (2011) Fusion of stereo vision, force-torque, and joint sensors for estimation of in-hand object location. In: 2011 IEEE international conference on robotics and automation, Shanghai, China, 9–13 May 2011, 5935–5941. IEEE.

41.

Hinterstoisser

Lepetit

Ilic

, et al. (2013) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Computer vision–ACCV 2012: 11th Asian conference on computer vision, revised selected papers, part I, Daejeon, Korea, 5–9 November 2012, 548–562. Springer.

42.

Hornung

Wurm

Bennewitz

, et al. (2013) Octomap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots 34: 189–206.

43.

Huang

Gojcic

Usvyatsov

, et al. (2021a) Predator: registration of 3d point clouds with low overlap. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, 20–25 June 2021, pp. 4267–4276.

44.

Huang

Mei

Zhang

, et al. (2021b) A comprehensive survey on point cloud registration. ArXiv preprint arXiv:2103.02690.

45.

Ihrke

Kutulakos

Lensch

, et al. (2010) Transparent and specular object reconstruction. In: Computer Graphics Forum. Hoboken, NJ: Wiley Online Library, 2400–2426.

46.

Jazwinski

(1970) Stochastic Processes and Filtering Theory. North Chelmsford, MA: Courier Corporation.

47.

Jiang

Cao

Deng

, et al. (2023) Robotic perception of transparent objects: a review. IEEE Transactions on Artificial Intelligence 5(6): 2547–2567.

48.

Johnson

Krishna

Stark

, et al. (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, 7–12 June 2015, 3668–3678.

49.

Kaboli

Cheng

(2018) Robust tactile descriptors for discriminating objects from textural properties via artificial robotic skin. IEEE Transactions on Robotics 34(4): 985–1003.

50.

Kaboli

Long

Cheng

(2015) Humanoids learn touch modalities identification via multi-modal robotic skin and robust tactile descriptors. Advanced Robotics 29(21): 1411–1425.

51.

Kaboli

Yao

Cheng

(2016) Tactile-based manipulation of deformable objects with dynamic center of mass. In: 2016 IEEE-RAS 16th international conference on humanoid robots (humanoids), Cancun, Mexico, 15–17 November 2016, 752–757. IEEE.

52.

Kaboli

Feng

Yao

, et al. (2017) A tactile-based framework for active object learning and discrimination using multimodal robotic skin. IEEE Robotics and Automation Letters 2(4): 2143–2150.

53.

Kaboli

Feng

Cheng

(2018) Active tactile transfer learning for object discrimination in an unstructured environment using multimodal robotic skin. International Journal of Humanoid Robotics 15(01): 1850001.

54.

Kaboli

Yao

Feng

, et al. (2019) Tactile-based active object discrimination and target object search in an unknown workspace. Autonomous Robots 43: 123–152.

55.

Kartmann

Paus

Grotz

, et al. (2018) Extraction of physically plausible support relations to predict and validate manipulation action effects. IEEE Robotics and Automation Letters 3(4): 3991–3998.

56.

Kirillov

Mintun

Ravi

, et al. (2023) Segment anything. ArXiv Preprint arXiv:2304.02643.

57.

Kuppuswamy

Castro

Phillips-Grafflin

, et al. (2019) Fast model-based contact patch and pose estimation for highly deformable dense-geometry tactile sensors. IEEE Robotics and Automation Letters 5(2): 1811–1818.

58.

Labbé

Carpentier

Aubry

, et al. (2021) Single-view robot pose and joint angle estimation via render & compare. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, 20–25 June 2021, pp. 1654–1663.

59.

Lauri

Pajarinen

Peters

, et al. (2020) Multi-sensor next-best-view planning as matroid-constrained submodular maximization. IEEE Robotics and Automation Letters 5(4): 5323–5330.

60.

Lee

Kim

, et al. (2021) Category-level metric scale object shape and pose estimation. IEEE Robotics and Automation Letters 6(4): 8575–8582.

61.

Levoy

Gerth

Curless

, et al. (2005) The stanford 3d scanning repository. Stanford University Computer Graphics Laboratory. Available at: https://www-graphics.stanford.edu/data/3dscanrep.

62.

Platt

Yuan

, et al. (2014) Localization and manipulation of small parts using gelsight tactile sensing. In: 2014 IEEE/RSJ international conference on intelligent robots and systems, Chicago, IL, 14–18 September 2014, 3988–3993. IEEE.

63.

Kroemer

, et al. (2020) A review of tactile information: perception and action through touch. IEEE Transactions on Robotics 36(6): 1619–1634.

64.

Liu

Sun

(2018) Robotic Tactile Perception and Understanding. Berlin, Germany: Springer.

65.

Liu

Deswal

Christou

, et al. (2022a) Neuro-inspired electronic skin for robots. Science Robotics 7(67): eabl7344.

66.

Liu

Wang

, et al. (2022b) Catre: iterative point clouds alignment for category-level object pose refinement. In: European conference on computer vision (ECCV), Tel Aviv, Israel, 23–27 October 2022.

67.

Richter

Yip

(2023) Markerless camera-to-robot pose estimation via self-supervised sim-to-real transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, 17–24 June 2023, 21296–21306.

68.

Mahler

Pokorny

Hou

, et al. (2016) Dex-net 1.0: a cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In: 2016 IEEE international conference on robotics and automation (ICRA), Stockholm, Sweden, 16–21 May 2016, 1957–1964. IEEE.

69.

Marsaglia

(1972) Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics 43(2): 645–646.

70.

Mason

(1986) On the scope of quasi-static pushing. In: International symposium on robotics research, San Francisco, CA, 7–8 April 1986, pp. 229–233.

71.

Miller

Knoop

Christensen

, et al. (2003) Automatic grasp planning using shape primitives. In: 2003 IEEE international conference on robotics and automation (Cat. No. 03CH37422), Taipei, Taiwan, 14–19 September 2003, 1824–1829. IEEE.

72.

Mitash

Boularias

Bekris

(2019) Physics-based scene-level reasoning for object pose estimation in clutter. The International Journal of Robotics Research 41(6): 615–636.

73.

Mojtahedzadeh

Bouguerra

Schaffernicht

, et al. (2015) Support relation analysis and decision making for safe robotic manipulation tasks. Robotics and Autonomous Systems 71: 99–117.

74.

Moll

Kavraki

Rosell

(2017) Randomized physics-based motion planning for grasping in cluttered and uncertain environments. IEEE Robotics and Automation Letters 3(2): 712–719.

75.

Morrison

Corke

Leitner

(2020) Learning robust, real-time, reactive robotic grasping. The International Journal of Robotics Research 39(2-3): 183–201.

76.

Murali

Gentner

Kaboli

(2021) Active visuo-tactile point cloud registration for accurate pose estimation of objects in an unknown workspace. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS), Prague, Czech Republic, 27 September 2021–1 October 2021, 2838–2844. IEEE.

77.

Murali

Dahiya

Kaboli

(2022a) An empirical evaluation of various information gain criteria for active tactile action selection for pose estimation. In: 2022 IEEE international conference on flexible and printable sensors and systems (FLEPS), Vienna, Austria, 10–13 July 2022, 1–4. IEEE.

78.

Murali

Dutta

Gentner

, et al. (2022b) Active visuo-tactile interactive robotic perception for accurate object pose estimation in dense clutter. IEEE Robotics and Automation Letters 7(2): 4686–4693.

79.

Murali

Wang

Dahiya

, et al. (2022c) Towards robust 3d object recognition with dense-to-sparse deep domain adaptation. In: 2022 IEEE international conference on flexible and printable sensors and systems (FLEPS), Vienna, Austria, 10–13 July 2022, 1–4. IEEE.

80.

Murali

Wang

Lee

, et al. (2022d) Deep active cross-modal visuo-tactile transfer learning for robotic object recognition. IEEE Robotics and Automation Letters 7(4): 9557–9564.

81.

Murali

Porr

Kaboli

(2023) Touch if it’s transparent! actor: active tactile-based category-level transparent object reconstruction. In: 2023 IEEE/RSJ international conference on intelligent robots and systems (IROS), Detroit, MI, 1–5 October 2023.

82.

Oron

Dekel

Xue

, et al. (2017) Best-buddies similarity—robust template matching using mutual nearest neighbors. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(8): 1799–1813.

83.

Pais

Ramalingam

Govindu

, et al. (2020) 3dregnet: a deep neural network for 3d point registration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 13–19 June 2020, pp. 7193–7203.

84.

Patten

Park

Vincze

(2020) Dgcm-net: dense geometrical correspondence matching network for incremental experience-based robotic grasping. Frontiers in Robotics and AI 7: 120.

85.

Petrovskaya

Khatib

(2011) Global localization of objects via touch. IEEE Transactions on Robotics 27(3): 569–585.

86.

Pfanne

Chalon

Stulp

, et al. (2018) Fusing joint measurements and visual features for in-hand object pose estimation. IEEE Robotics and Automation Letters 3(4): 3497–3504.

87.

Phang

JTS

Lim

Chiong

RCW

(2021) A review of three dimensional reconstruction techniques. Multimedia Tools and Applications 80(12): 17879–17891.

88.

Pomerleau

Colas

Siegwart

, et al. (2013) Comparing icp variants on real-world data sets. Autonomous Robots 34(3): 133–148.

89.

Pomerleau

Colas

Siegwart

(2015) A review of point cloud registration algorithms for mobile robotics. Foundations and Trends® in Robotics 4(1): 1–104.

90.

Przybylski

Asfour

Dillmann

(2011) Planning grasps for robotic hands using a novel object representation based on the medial axis transform. In: 2011 IEEE/RSJ international conference on intelligent robots and systems, San Francisco, CA, 25–30 September 2011, 1781–1788. IEEE.

91.

, et al. (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, 652–660.

92.

Suresh

, et al. (2023) General in-hand object rotation with vision and touch. In: Conference on robot learning, Atlanta, GA, 6 November 2023, 2549–2564. PMLR.

93.

Rustler

Lundell

Behrens

, et al. (2022) Active visuo-haptic object shape completion. IEEE Robotics and Automation Letters 7(2): 5254–5261.

94.

Rusu

Cousins

(2011) 3d is here: point cloud library (pcl). In: 2011 IEEE international conference on robotics and automation, Shanghai, China, 9–13 May 2011, 1–4. IEEE.

95.

Rusu

Blodow

Beetz

(2009) Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE international conference on robotics and automation, Kobe, Japan, 12–17 May 2009, 3212–3217.

96.

Sajjan

Moore

Pan

, et al. (2020) Clear grasp: 3d shape estimation of transparent objects for manipulation. In: 2020 IEEE international conference on robotics and automation (ICRA), Paris, France, 31 May 2020–31 August 2020, 3634–3642. IEEE.

97.

Schaub

Schöttl

(2020) 6-dof grasp detection for unknown objects. In: 2020 10th international conference on advanced computer information technologies (ACIT), Deggendorf, Germany, 16–18 September 2020, 400–403. IEEE.

98.

Schmidt

Vahrenkamp

Wächter

, et al. (2018) Grasping of unknown objects using deep convolutional neural networks based on depth images. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, 21 May 2018, 6831–6838. IEEE.

99.

Schulz

Speekenbrink

Krause

(2018) A tutorial on Gaussian process regression: modelling, exploring, and exploiting functions. Journal of Mathematical Psychology 85: 1–16.

100.

Schwarz

Lenz

García

, et al. (2018) Fast object learning and dual-arm coordination for cluttered stowing, picking, and packing. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, 21 May 2018, 3347–3354. IEEE.

101.

Sui

Xiang

Jenkins

, et al. (2017) Goal-directed robot manipulation through axiomatic scene estimation. The International Journal of Robotics Research 36(1): 86–104.

102.

Sun

Hollerbach

(2008) Active robot calibration algorithm. In: 2008 IEEE international conference on robotics and automation, Pasadena, CA, 19–23 May 2008, 1276–1281. IEEE.

103.

Suresh

Bauza

, et al. (2021) Tactile slam: real-time inference of shape and pose from planar pushing. In: 2021 IEEE international conference on robotics and automation (ICRA), Xi’an, China, 30 May 2021–05 June 2021, 11322–11328. IEEE.

104.

Suresh

Mangelson

, et al. (2022) Shapemap 3-d: efficient shape mapping through dense touch and vision. In: 2022 international conference on robotics and automation (ICRA), Philadelphia, PA, 23–27 May 2022, 7073–7080. IEEE.

105.

Suresh

, et al. (2023) Neural feels with neural fields: visuo-tactile perception for in-hand manipulation. ArXiv preprint arXiv:2312.13469.

106.

Tian

Ang

Lee

(2020) Shape prior deformation for categorical 6d object pose and size estimation. In: European conference on computer vision, Glasgow, UK, 23–28 August 2020, 530–546. Springer.

107.

Tsai

Lenz

(1989) A new technique for fully autonomous and efficient 3d robotics hand/eye calibration. IEEE Transactions on Robotics and Automation 5(3): 345–358.

108.

Umeyama

(1991) Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4): 376–380.

109.

Vezzani

Pattacini

Battistelli

, et al. (2017) Memory unscented particle filter for 6-dof tactile localization. IEEE Transactions on Robotics 33(5): 1139–1155.

110.

Wang

Sun

, et al. (2018) 3d shape perception from monocular vision, touch, and shape priors. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain, 1–5 October 2018, 1606–1613. IEEE.

111.

Wang

Sridhar

Huang

, et al. (2019) Normalized object coordinate space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, 15 June 2019, 2642–2651.

112.

Wang

Jung

, et al. (2022) Phocal: a multi-modal dataset for category-level object pose estimation with photometrically challenging objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, 18–24 June 2022, pp. 21222–21231.

113.

Wen

Bekris

(2021) Bundletrack: 6d pose tracking for novel objects without instance or category-level 3d models. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS), Prague, Czech Republic, 27 September 2021–1 October 2021, 8067–8074. IEEE.

114.

Weng

Pallankize

Tang

, et al. (2020) Multi-modal transfer learning for grasping transparent and specular objects. IEEE Robotics and Automation Letters 5(3): 3791–3798.

115.

Xie

Zheng

, et al. (2021) Generative pointnet: deep energy-based learning on unordered point sets for 3d generation, reconstruction and classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, 20–25 June 2021, pp. 14976–14985.

116.

Yang

Pan

Luo

, et al. (2019) Extreme relative pose estimation for rgb-d scans via scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, 16–17 June 2019, pp. 4531–4540.

117.

Yang

Shi

Carlone

(2020) Teaser: fast and certifiable point cloud registration. IEEE Transactions on Robotics 37(2): 314–333.

118.

Zeng

Song

Nießner

, et al. (2017) 3dmatch: learning local geometric descriptors from rgb-d reconstructions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 1802–1811.

119.

Zeng

Song

Welker

, et al. (2018) Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), Madrid, Spain, 1–5 October 2018, 4238–4245. IEEE.

120.

Zhang

Goodfellow

Metaxas

, et al. (2019) Self-attention generative adversarial networks. In: International conference on machine learning, Long Beach, CA, 10–15 June 2019, 7354–7363. PMLR.

121.

Zhang

, et al. (2021) Invigorate: interactive visual grounding and grasping in clutter. In: 2021 robotics science and systems conference (RSS), Virtual Event, 12–16 July 2021.

122.

Zhang

Opipari

Chen

, et al. (2022) Transnet: category-level transparent object pose estimation. In: European Conference on Computer Vision. Berlin, Germany: Springer, 148–164.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Shared visuo-tactile interactive perception for robust object pose estimation

Abstract

Keywords

1. Introduction

2. Related work

2.1. Interactive perception

2.2. Object reconstruction

2.3. Object pose estimation

2.3.1. Visuo-tactile-based pose estimation

3. Methodology

3.1. Problem formulation and framework

3.2. Visuo-tactile-based interactive scene decluttering

3.3. Shared visuo-tactile-based active object reconstruction

3.3.1. Vision and tactile action sampling

3.3.2. Active sensor selection and next best action selection

3.3.3. Detection of transparent objects

3.3.4. Category-level object shape reconstruction

3.3.4.1. Feature-extraction encoder architecture

3.3.4.2. Upsampling decoder architecture

3.3.4.3. Loss function

3.4. Visuo-tactile-based robust pose estimation

3.4.1. Stochastic translation-invariant quaternion filter (S-TIQF)

3.4.1.1. Scale estimation

3.4.1.2. Stochastic initial alignment

3.4.1.3. Correspondence estimation

3.4.1.4. Rotation estimation

3.4.1.5. Translation estimation

3.5. Visuo-tactile hand-eye calibration

4. Experiments

4.1. Experimental setup

4.1.1. Object list

4.2. Active visuo-tactile-based target object reconstruction

4.3. Category-level visuo-tactile-based pose estimation

4.3.1. Benchmark experiments

4.3.1.1. Stanford scanning repository benchmark

4.3.1.2. PhoCal dataset benchmark

4.3.2. Robotic experiments

4.4. Visuo-tactile hand-eye calibration

4.4.1. Robot kinematic accuracy benchmark for calibration

4.5. Discussion

4.5.1. Individual sub-system evaluation

4.5.2. Overall system evaluation

4.5.3. Future work

5. Conclusions

Supplemental Material

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

Supplemental Material

Notes

Appendix

References

Supplementary Material