Sage Journals: Discover world-class research

Abstract

Robotic exploration of unknown soft objects presents significant challenges for autonomous systems due to unpredictable deformations and shape changes during manipulation. To address this, we propose a framework that integrates topology-aware 3D reconstruction with a topology-guided motion planner, enabling the discovery and reconstruction of previously hidden or concave regions. This topology-aware 3D reconstruction employs a novel representation of deformable objects by combining Cylinder Čech Complexes with point clouds, enabling rapid tracking of significant topology changes and detection of non-manifold boundaries. The topology analysis and canonical reconstruction guide motion planning by optimising grasp points and planning trajectories to reveal previously unseen surfaces through two actions: turning over and stretching. We validated our algorithm through simulations and experiments using the da Vinci Research Kit, demonstrating successful exploration with two or three manipulators. We showed it can fully explore surfaces of two everyday objects, a beanie and a rubber glove, and two cadaveric organs, a liver and a colon, within seven manipulations. Our method achieved a 45.6% improvement in 3D reconstruction accuracy compared to state-of-the-art point-cloud-based methods while also demonstrating the capability to detect and fix non-manifold geometry.

Keywords

robotic exploration robotic manipulation on deformable objects interactive sensing autonomous robotic surgery

Introduction

Exploring deformable objects to fully understand their shape and appearance is a fundamental task for humans in everyday life. This occurs in various activities, such as cloth processing (Sanchez et al., 2018) and garbage disposal (Kiyokawa et al., 2022), as well as in skilled professions like laparoscopic surgery (Sánchez et al., 2011). This exploration is always accompanied by manipulation, since only part of the surface is initially visible. Typically, humans form an initial perception of a soft object through sight and mentally estimate its rough shape. Based on these perceptions, they manipulate the object to reveal more of its surface, refining their understanding with each interaction. In practice, however, it can be difficult to remember all surface details clearly due to the drastic changes in morphology.

For robots, achieving autonomous exploration of unknown deformable objects is even more challenging. In robotic-assisted minimally invasive surgery (RAMIS), deformations arise not only from robot-induced movement (Figure 1) but also from respiratory- and circulatory-induced motion (Attanasio et al., 2021). Three major challenges arise at the perception, decision, and action stages. First, perceiving the shape of a soft object is difficult. As in this case with the robotic manipulation of soft objects, a representative state of the explored object surface, such as a 3D model, is necessary for the robot to refer to when making control decisions. While camera systems can perceive raw images or depth maps of the object, these alone are insufficient as the representative state. Second, determining if the soft object has been fully explored and identifying unexplored areas is challenging. During manipulations, inevitable deformations complicate the 3D reconstruction, leading to incorrect topology and making it difficult to analyse the object’s shape. Third, manipulating the soft object to expose its hidden surface is inherently complex. Grip points and manipulation paths can cause unpredictable deformations, potentially leading to failed plans or suboptimal results.

Figure 1.

System-level workflow illustrating robotic exploration with 3D canonical reconstruction, including (1) 3D point cloud collection from camera during manipulation, (2) frame-to-frame point cloud registration to align observations, and (3) 3D reconstruction via shape fusion to build the canonical model.

Recent advances in key technologies for robotic exploration, such as 3D reconstruction and robotic manipulation of soft objects, have made significant progress in addressing these challenges. 3D reconstruction (Figure 1) involves recovering the shape of a 3D object, using a series of discrete observations dependent on the input modality. Both registration-based and learning-based methods are used; however, they often fail when the object undergoes large deformations during the reconstruction process. Furthermore, robotic manipulation is essential in the shaping of soft objects, with shape representation being a critical factor in its solution. Especially in the context of robotic exploration, a proper representation is key to achieve real-time performance in online control.

While there has been extensive research on the exploration of unknown rigid objects (Browatzki et al., 2014; Okamura and Cutkosky, 2001), to the best of our knowledge, no published work specifically addresses the robotic exploration of soft objects. This paper presents a novel framework for a vision-guided, multi-arm robotic system designed to manipulate unknown soft objects with the goal of fully understanding their shape and texture. The framework integrates perception, analysis, grasp point selection, and trajectory planning. This framework offers novel solutions to these challenges, with its distinct advantages over existing methodologies highlighted in Table 1 and elaborated upon in subsequent sections. The key contributions of this work are

1. A topology-aware representation that extends grid-point-based weighted residual method (GP-WRM) by incorporating topological information is presented, enabling real-time reconstruction and analysis of objects undergoing drastic topological changes during manipulation.

2. Full exploration and homeomorphic reconstruction of an unknown soft object, including complex non-convex and self-occluding features, can be achieved through active robotic manipulation by detecting and revealing previously hidden areas with a small reach. It also mitigates the effects of low-quality point cloud observations and down-sampling.

3. A comprehensive framework for robotic exploration of unknown soft objects, including grasp point selection and trajectory planning, supporting two types of manipulation: turning over and stretching.

4. Validation of the proposed approach through simulations and experiments on various objects, including human cadaveric tissues.

Table 1.

Concise comparison of our framework with prior work in deformable object exploration.

Challenge	Representative prior work & key limitations	Our solution
Shape Representation	Physics-Based Models: High computational cost and requires known material properties. (Leizea et al., 2015; Wang et al., 2015) Sparse Feature Models: Limited for complex shapes and tracking robustness. (Navarro-Alarcon et al., 2016; Navarro-Alarcon and Liu, 2017) Learning-Based Methods: Heavily data-dependent, struggle with unknown objects and online comprehension. (Zhou et al., 2021; Li et al., 2024) Our Previous Work: Limited in tracking complex deformations and overall shape comprehension. (Hu et al., 2024)	Topology-aware 3D representation: Combines complexes with point clouds. Enables real-time, robust tracking of drastic topology changes and non-manifold boundaries for unknown soft objects.
Shape Reconstruction & Exploration Completeness	Image-Based Methods: Reliant on robust feature correspondences which are hard with large deformation. (Chen et al., 2024) Learning-Based Methods: Often offline or heavily pre-trained, limiting real-time live stream application. (Wang et al., 2022) Depth Sensing/Point Cloud Methods: Prone to robustness issues and failure with large deformations in dynamic scenes; lack explicit topology change detection/correction. (Lamarca et al., 2021; Newcombe et al., 2015) One common limitation is achieving and verifying complete shape reconstruction of unknown objects.	Homeomorphic 3D canonical model reconstruction: Continuously updated using topology analysis for completeness. Explicitly detects and corrects non-manifold geometries.
Action Planning for Exploration	Rigid Object Planning: Unsuitable for unpredictable soft object deformations. (Browatzki et al., 2014; Okamura and Cutkosky, 2001) Existing Deformable Object Planners: Lack clear objectives for unknown object exploration; grasp selection is not well studied. (Yang et al., 2023; Hu et al., 2024; Huang and Au, 2022; Navarro-Alarcon and Liu, 2017) Single/Random Manipulations: Inefficient or incomplete exploration. (Shinde et al., 2024)	Topology-guided motion planning: Optimises grasp points to maximise surface exposure using a specific criterion. Includes ‘turning over’ and ‘stretching’ actions for exploration.

Our previous work (Hu et al., 2024) focused on shape control by representing soft objects with the proposed GP-WRM, but faced limitations in completing complex tasks due to a reduced ability to track the deformation field and fully comprehend the overall shape of the soft object. In this work, we extend the deformation representation from the previous study, incorporating knowledge from past deformations to achieve a comprehensive understanding of the geometry, thereby enabling more complex tasks.

Related works

Robotic exploration of soft objects

Autonomous robotic exploration has traditionally focused on understanding environments, such as in the context of drones, service robots (Jiang et al., 2024), field robots (Wettergreen et al., 2005), and underwater vehicles (Mallios et al., 2016). It can also refer to exploring objects within the environment with active interactions, enabling the understanding of mechanical properties and scene graphs. Previous research has used active manipulations to determine boundary constraints in physical models (Boonvisut and Çavuşoğlu, 2014), and to identify parameters for reinforcement learning, allowing robots to manipulate unknown objects more effectively (Schneider et al., 2022). These applications generally rely on rigid objects or static environments, where the robot can infer properties through active interaction with minimal deformation. However, exploring and manipulating soft, deformable objects introduces a higher level of complexity.

In surgical robotics, exploration of soft tissues has been studied. For example, Goldman et al. (2013) proposed algorithms for surgical robots to autonomously explore the shape and stiffness of surgical fields. Shinde et al. (2024) studied active sensing of unknown boundaries, such as tissue attachments, in deformable surgical environments using stereo endoscopic observations. Additionally, haptic feedback systems have been integrated into robotic surgery to provide kinesthetic and cutaneous sensing, allowing surgeons, and theoretically future robots, to directly interpret the mechanical properties of soft tissues (Enayati et al., 2016). However, exploring the complete shape of unknown soft objects remains an open problem, especially in dynamic, deformable environments like laparoscopic surgeries.

Shape representation of soft object

Accurate shape representation is critical in robotic manipulation, allowing the robot to understand and interact with the environment. Early works in the field relied heavily on physics-based models to represent the state of soft objects, including mass-spring systems, position-based dynamics, and continuum mechanics, including finite element methods (FEM) (Yin et al., 2021). FEM provides a detailed representation of soft object deformations (Leizea et al., 2015), but its high computational overhead limits real-time application, prompting the use of linear FEM as a simplified, though less accurate, alternative (Wang et al., 2015). Position-based dynamics, a mesh-free method, has gained popularity for real-time applications, enabling the modelling of dynamic deformations, plasticity, fluids, and rigid bodies (Macklin et al., 2014; Tang and Tomizuka, 2022). When the shape of the soft object is known, estimating the stress or strain fields and deformation can represent its real-time state using non-rigid structure from motion in image modalities (Badias et al., 2021). These methods generally balance computational efficiency and flexibility, making them well-suited for certain manipulation tasks. However, in real-time robotic control, particularly with unknown soft objects, physics-based models often fall short due to the difficulty of establishing mechanical properties in advance.

To address these challenges, some approaches have shifted toward simplified, mechanics-free representations that focus on extracting key geometric features, as an elaborate representation is not necessary for certain tasks. Sparse explicit features provide economical shape representation, including landmarks or geometric features (Navarro-Alarcon et al., 2014, 2016), Fourier surfaces (Kelemen and Gerig, 1996), contours (Navarro-Alarcon and Liu, 2017), latent topology (Zhou et al., 2024), and latent manifolds (Koganti et al., 2017). While these methods are faster and less computationally intensive, they are limited in handling more complex shapes. Additionally, the robustness of tracking algorithms remains a significant challenge.

Integrating these features within point cloud representations of the surface is an alternative technique commonly used in simultaneous localization and mapping (SLAM) problems in deformable environments (Song et al., 2018). More recently, our work proposed an efficient, mechanics-free state, utilising a down-sampled grid of surface points to represent the object’s instantaneous shape (Hu et al., 2024). Modal graphs have been introduced to capture low-dimensional deformation features from raw point clouds while preserving the robotic system’s spatial structure (Yang et al., 2023). These methods represent a promising direction for reducing the computational burden. Still, they are limited in handling extended time periods and complex cases, such as drastic changes in topology due to interactions. Overall, this limits their usefulness to basic shape control operations.

High-dimensional features extracted from big data, such as images or point clouds, are another alternative. Methods like PointNet (Qi et al., 2017) provide semantic understanding of scene features, enabling the identification of 3D objects and their components. Others have relied solely on predetermined, end-to-end pipelines in deep learning, bypassing the need for intermediate representations. These methods often overlook the semantic significance of internal features, thereby failing to provide a comprehensive interpretation of the analytical process (Matas et al., 2018). A more efficient approach uses semantically meaningful lower-dimensional space, or latent shape. Such methods have shown promise in capturing key features of the soft object through data training (Zhou et al., 2021). Latent features can also be integrated into other volumetric rendering techniques, such as neural radiance fields (NeRF), allowing for deformability (Li et al., 2024). However, a limitation of these learning-based approaches is their high dependency on datasets to provide prior understanding and generalisation ability for unfamiliar objects, which presents challenges for online comprehension of unknown objects.

These representations are inadequate for the demands of autonomous robotic exploration. Transitions between visible and invisible surfaces frequently occur, with topology changes over time causing features to disappear and reappear. Consequently, the representation’s effectiveness depends on the observations.

3D non-rigid reconstruction

3D non-rigid reconstruction refers to capturing the shape and appearance of deformable objects and representing them in 3D space. Like rigid reconstruction, various methods can be employed depending on the modality of source data, which includes images, depth sensing, and point clouds.

For image-based methods, non-rigid structure from motion (NRSfM) (Parashar et al., 2019; Torresani et al., 2008) is commonly used to generate a sparse reconstruction and estimate camera motion. This process is followed by multi-view stereo (MVS) (Wen et al., 2019) for detailed reconstruction. Both approaches fundamentally depend on establishing correspondences across images, often achieved using feature detection techniques such as SURF (Bay et al., 2006) or motion tracking methods like scene flow (Chen et al., 2024). In recent years, learning-based methods have been introduced to assist in building correspondences for registration. NeRF can reconstruct 3D scenes from 2D images using volumetric representation (Wang et al., 2022) but is an offline model requiring extensive time to train the neural network, making it unsuited for the real-time reconstructions needed for robotic control. In contrast, our definition of real-time reconstruction focuses on the ability to continuously update the 3D canonical model during ongoing robot manipulation, providing immediate feedback for control actions without requiring the robot to stop and wait for a complete model to be built.

Advances in depth sensing have facilitated 3D reconstruction across various applications, primarily using point clouds often combined with RGB texture. This process typically involves non-rigid registration to align point clouds captured at different times. For dynamic reconstructions, methods aim to align sequential, temporally-spaced point clouds but face robustness issues in highly dynamic scenes (Newcombe et al., 2015). Feature-based representations, like curvature, can enhance performance in these scenarios (Sharp et al., 2002; Tajdari et al., 2022). To handle highly dynamic cases more effectively, topology-aware methods have been developed (Zampogiannis et al., 2019), and using shape templates has been shown to improve non-rigid registration (Lamarca et al., 2021). These methods often solve complex optimisation problems to fit deformation models to the observed data, with the goal of minimising the difference between the captured data and the reconstructed model. A globally optimal solution for deformable SLAM has also been proposed (Bai et al., 2024). However, while these approaches are promising for offline reconstruction tasks, real-time 3D reconstruction remains challenging.

In robotic exploration, where instantaneous decision-making is essential, current methods often fail to achieve the real-time performance for effective online robotic control. Furthermore, the capability to address topology changes is critical. However, existing 3D reconstruction approaches are insufficient in detecting and correcting such changes.

Motion planning of robotic exploration

The motion planning for active robotic exploration varies depending on the task, including the selection of manipulation points and motion planning for manipulation.

Identifying suitable grip points is crucial as they serve as the starting point for manipulating deformable objects, with subsequent motion depending on this choice. In robotic exploration tasks involving soft objects, deformation is influenced by the grip point location, which affects exploration efficiency. Some studies focus on finding optimal grip points to achieve a secure hold with minimal force (Nadon and Payeur, 2019). Task-oriented grip point selection has also been studied, with metrics developed to quantify grasp quality (Huang and Au, 2022). However, selecting grip points to maximise exploration time of hidden surfaces remains an open problem.

The actions in motion planning are often closely related to the representation of the environment and the object of interest. Various planners control the shape of deformable objects based on their representations, such as linearisation models (Navarro-Alarcon et al., 2016), physics-based models (Hu et al., 2024) or learning-based method (Thach et al., 2021). Physical interaction between the robot and the environment leads to dynamic planning. A big challenge in active exploration is the absence of a specific target shape or the difficulty in planning a sequence of target shapes, making the optimisation objective for existing methods unclear. As the robot gathers more information about the unpredictable environment, the motion planning often requires updates. Fully exploring a soft object with a single manipulation is difficult, indicating the need for more comprehensive motion planning based on multiple basic manipulations, aligned with the representation of the soft object.

Modelling and problem formulation

In this paper, we focus on achieving a comprehensive understanding of a single unknown deformable object through robotic exploration using position-controlled robotic end-effectors and a fixed camera. The definition of robotic exploration here involves utilising a camera to capture all surface information and concurrently generating a 3D canonical model of the soft object during active robotic manipulation. The canonical model represents the object’s 3D shape (as point cloud, mesh, or other formats) and is progressively updated using sensor data from depth or RGB-D cameras during exploration. This model, initialised from the object’s pre-manipulation state, serves as a reference frame for integrating incremental geometric information, despite the infinite deformation space of soft objects.

Modelling

As shown in Figure 2, the scenario includes a visual-integrated multi-arm robot system and a deformable object $O$ . In this model, the mechanical properties or shape of the soft object are unknown. This object can be freely manipulated and is placed on a table within the workspace of the robotic system. The camera is able to observe the depth of the environment, and the instant surface of the soft object can be computed. Only the point cloud and texture of the soft object serve as the feedback. To focus on the robotic exploration, we make the following assumptions:

Figure 2.

Illustration of robotic exploration of a soft object placed on a table, observed by a fixed RGB-D camera. The table plane is defined as $n_{0}^{⊺} (x - t_{0}) = 0$ .

Assumption 1

The kinematics of the manipulators are known, and the robot–camera and robot–robot extrinsic relations are calibrated before exploration. In practice, online correction can be achieved using vision-based tool tracking (Hu et al., 2023), which addresses calibration drift in multi-arm systems, especially when passive arms are moved.

Assumption 2

There is only one soft object placed on a planar surface, within the workspace of the multi-arm robotic system, and it is not affixed to this surface. The soft object cannot deform without external force.

Assumption 3

The camera is fixed in a top-down view relative to the table and the bases of the robots, and it remains stationary during exploration.

Assumption 4

Robots have at least 6 degree-of-freedoms (DoF) and can firmly grasp the soft object during the manipulation.

Assumption 5

The observed shape of the soft object can be segmented from the raw point cloud captured by the fixed camera.

Assumption 6

The soft object is homeomorphic to a 3-ball¹, indicating that it is solid-core.

Problem formulation

Given an unknown soft object $O$ , a multi-arm robotic system with an RGB-D camera is used for exploration. The RGB-D camera captures the immediate point cloud $S = {\{p_{i} \in R^{3}\}}_{i = 1}^{P}$ , representing the visible part of the soft object. The system consists of G manipulators, where G ≥2, each equipped with a gripper. These manipulators work together to handle the soft object, progressively reconstructing its 3D canonical model $C$ into a complete closed² surface.

Remark 1

For simplicity, all positions and orientations discussed in the following sections are referenced with respect to the camera coordinate system, as illustrated in Figure 2.

Overview of the methodology

The workflow of robotic exploration of unknown soft object is depicted in Figure 3. This canonical shape is represented by the 3D canonical model $C$ , which is presented in point cloud format in this paper. The exploration process, which includes grasping and manipulation procedures, continues until the point cloud of the canonical model forms a complete surface with no missing regions. Within each procedure, grip points are determined first, followed by trajectory planning for each arm. During the manipulations, the observed surfaces of the soft object are continuously integrated or fused with the canonical model to reconstruct an updated canonical shape. Detailed explanations are provided in the following sections.

Figure 3.

Schematic representation of the proposed controller. The configurations $O^{k}$ and $O^{k + 1}$ are the soft object states in the k-th and (k + 1)-th iterations, respectively. β₂ denotes the second Betti number of the topology $T$ , see Section 8 for details.

Preliminary

Weighted residual method based deformation field

We adopted the GP-WRM method in our previous work (Hu et al., 2024). The deformation field can be discretely described using a set of N deformation nodes $N = \{n \in R^{3}, v \in R^{3}, u \in R^{3}\}$ , where n and v are the position and normal of the node, respectively, and u is the displacement of the node.

At time t, the positions and displacements of the nodes can be written as $N_{t} = {[\begin{matrix} n_{1} & n_{2} & \dots & n_{N_{t}} \end{matrix}]}^{⊺} \in R^{N_{t} \times 3}$ and $U_{t} = {[\begin{matrix} u_{1} & u_{2} & \dots & u_{N_{t}} \end{matrix}]}^{⊺} \in R^{N_{t} \times 3}$ , respectively. The deformation field from the canonical configuration to the live state at time t for any given position $x \in R^{3}$ is denoted as: $U (x ∣ N_{t}, U_{t}) : R^{3} \to R^{3}$ . It is discretely expressed as:

U (x ∣ N_{t}, U_{t}) = ϕ_{t}^{⊺} (x) U_{t}

(1)

where

ϕ_{t} (x) = [\begin{matrix} 1 & x^{⊺} \end{matrix}] M^{- 1} (x) {[\begin{matrix} 1_{N_{t}} & N_{t} \end{matrix}]}^{⊺} W (x) \in R^{N_{t}}

is the shape vector with the weighted covariance matrix

M (x) = {[\begin{matrix} 1_{N_{t}} & N_{t} \end{matrix}]}^{⊺} W (x) [\begin{matrix} 1_{N_{t}} & N_{t} \end{matrix}] \in R^{4 \times 4}

. The weight matrix

W (x) = diag (ω (D (N_{t}, x), R)) \in R^{N_{t} \times N_{t}}

is computed base on Euclidean distance of x to the nodes

D (N_{t}, x) = {[\begin{matrix} {‖n_{i} - x‖}_{2} \end{matrix}]}_{i = 1}^{N_{t}} \in R^{N_{t}}

, like the most discrete models (Yang et al., 2023). The weight function

ω (Z, R) = {[\begin{matrix} \max (0, {(1 - z_{i}^{2} / R^{2})}^{3}) \end{matrix}]}_{i = 1}^{N_{t}} : R^{N_{t}} \times R_{+} \to R^{N_{t}}

, where

Z = {[\begin{matrix} z_{1} & z_{2} & \dots & z_{N_{t}} \end{matrix}]}^{⊺}

, computes the weighted element-wise values for the given vector based on a given radial factor R. If the deformation field is applied to any point cloud

X = {[\begin{matrix} x_{1} & x_{2} & \dots & x_{X} \end{matrix}]}^{⊺} \in R^{X \times 3}

, it is denoted as:

U (X ∣ N_{t}, U_{t}) = Φ_{t} (X) U_{t}

(2)

where

Φ_{t} (X) = [\begin{matrix} ϕ_{t}^{⊺} (x_{1}) & ϕ_{t}^{⊺} (x_{2}) & \dots & ϕ_{t}^{⊺} (x_{X}) \end{matrix}] \in R^{X \times N_{t}}

is the shape matrix.

The normal vector v at each node is not directly used to represent the deformation field. Instead, it is utilised in the non-rigid registration process to compute the displacement vector u, as detailed in Section Canonical Model Reconstruction.

2D manifold

An unknown soft object can be modelled as a manifold, $M = \{(x_{1}, x_{2}, x_{3}) \in R^{3} ∣ M (x_{1}, x_{2}, x_{3}) = 0\}$ . As the object undergoes deformation during exploration, its shape at time t is denoted as $M_{t}$ . Its canonical model is defined as $C \equiv M_{0}$ . For simplicity, we make the following assumption:

Assumption 7

The surface of an unknown 3D object is assumed to be a 2D $C^{1}$ -submanifold³ embedded in Euclidean space $R^{3}$ with a smooth map $f : R^{3} \to R$ .

A camera observing a closed manifold in 3D Euclidean space perceives a subset of $M$ with a boundary, as a result of the projection model.

Definition 1

(Camera-Observed Manifold). In camera coordinates, where the optical axis is +z, the portion of the object observed by the camera is defined as: $S = {(x_{1}, x_{2}, x_{3}) \in M ∣ \forall (x_{1}, x_{2}), x_{3} = \inf {x_{3}^{'} ∣ (x_{1}, x_{2}, x_{3}^{'}) \in M}}$ . The manifold $S$ is a non-empty, locally finite submanifold of $M$ .

In practice, the camera-observed manifold is represented by the point cloud $S = {[\begin{matrix} p_{1} & p_{2} & \dots p_{P} \end{matrix}]}^{⊺} \in R^{P \times 3}$ . However, due to limitations in the sample radius and observation precision, particularly when the reach $τ_{M}$ ⁴ of $M$ is too small, the canonical model reconstructed from S may not be homeomorphic to the real submanifold, as depicted in Figure 4. Therefore, due to the manifold’s deformation during exploration, relying solely on the Euclidean distance metric in the deformation field may not accurately capture the relationships between two points, particularly for deformations in the invisible regions. To improve deformation propagation, the inherent topologies of the manifolds are utilised. The weighted matrix W in equation (1) is modified by incorporating the estimated geodesic distance (EGD), as follows:

W = diag (ω ({[\begin{matrix} egd (n_{i}, x) \end{matrix}]}_{i = 1}^{N_{t}}, R)) \in R^{N_{t} \times N_{t}} .

(3)

Figure 4.

Illustration of the case when the reach of manifold is small. The red dashed curve is the observation of the object (black Riemannian manifold). The distance between points A and B (black dots) in observation (yellow arrow) is much shorter than it is in real manifold (green arrows).

Definition 2

(Estimated Geodesic Distance). The geodesic distance is defined as the length of the shortest path connecting two points on the manifold. The EGD is the shortest distance between two points (more precisely, 0-simplices⁵) passing only through 1-simplices σ₁ = {p₁, p₂}, denoted as egd (σ₁) or egd (p₁, p₂). Appendix A provides the details of the computation of the EGD.

The topology of a manifold with point cloud observations can be estimated by constructing simplices. Here, we define ‘Distorted Topology’ as the part of the topology that does not fully match the real manifold. Figure 4 illustrates the presence of distorted topology (between points A and B) on the observed manifold $S$ .

Definition 3

(Distorted Topology). For a 0-simplex in the topology, if the Euclidean distance between its endpoints is significantly shorter than the actual geodesic distance on the manifold, any d-simplices (d ≥ 1) containing these endpoints are considered part of the distorted topology.

Canonical model reconstruction

The reconstruction of the canonical model is illustrated in Figure 5. As manipulations progress, more surface information from the soft object is uncovered. Although the observed surface may differ from the canonical model due to deformation, non-rigid registration allows the observed surface to be fused, resulting in an expanded canonical model. The pipeline for canonical model reconstruction involves the following steps: (1) Establish point correspondences, (2) compute the deformation field, (3) merge point clouds, and (4) update the topology.

Figure 5.

Illustration of the canonical model reconstruction in sectional views: from t₀ to t₄, two manipulators ( $R_{1}$ and $R_{2}$ ) grasp and manipulate the soft object $O$ (black curves), whose shape is initially unknown to the system. During the manipulation, the grip points (g₁ and g₂) move, generating the deformation field $U$ . The red curves represent the observed surfaces $S$ captured by the RGB-D camera, while the red dashed curves represent the deformed observed surfaces registered to the canonical model $C$ . The green curve represents the canonical model $C$ , which expands as exploration progresses and eventually closes (at t₄).

Representation of canonical model and deformation field

The canonical model in this work is represented as a combination of the point cloud and its underlying topology. For each observation with point cloud S, the topology of its underlying shape can be recovered using a simplicial complex⁶. Leveraging the properties of the observation, particularly its injectivity, the Cylinder-Čech Complexes (CČC) is proposed to efficiently generate a simple and accurate topology from a sampled point cloud. The Čech complex ${\overset{ˇ}{C}}_{r} (X)$ (Dantchev and Ivrissimtzis, 2012) is constructed from a finite point cloud as the vertex set. For each $σ \in {\{x_{i}\}}_{i = 1}^{X}$ , a simplex is included if the set of balls with radius r centred at the points of σ has a nonempty intersection. The CČC uses cylinders perpendicular to the camera projection plane to filter simplices and can be regarded as the topological space for the surface, according to Lemma 1. The nodes N consists of a maximal ϵ-separate set⁷ sampled from S, are used as the points for constructing the CČC, as shown in Figure 6. This approach allows the point cloud to preserve detailed surface information, including shape, boundary, and texture.

Figure 6.

Illustration of the topology of the observed manifold $S$ (pink surfaces). The green dash does not belong to the topology. The red points are the example points at the centre of the responding cylinders. The yellow points are within the cylinder. All the points on the surfaces are the 0-simplices of the topology, which are the nodes from a maximal ϵ-separated set, and the black lines are the 1-simplices.

Definition 4

(Cylinder-Čech Complex). Let the (r, h)-cylinder in $R^{3}$ be centred at $y = (y_{1}, y_{2}, y_{3}) \in R^{3}$ , with height h and radius r (h, r > 0). It is given by

\begin{array}{l} C_{r, h} (y) : = \{(x_{1}, x_{2}, x_{3}) \in R^{3} ∣ - h / 2 \leq x_{3} - y_{3} \leq h / 2, \\ {(x_{1} - y_{1})}^{2} + {(x_{2} - y_{2})}^{2} \leq r^{2}\} . \end{array}

Given a finite point cloud X, the CČC, denoted as $C {\overset{ˇ}{C}}_{r, h} (X)$ , consists of all simplicies $σ \subseteq {\{x_{i}\}}_{i = 1}^{X}$ of the set of (r, h)-cylinder centred at points of σ has a nonempty intersection. That is, $⋂_{p_{i} \in σ} C_{r, h} (p_{i}) \neq \emptyset$ .

Proposition 1

There only exist d-dimensional (d ≤ 3) simplicies σ for N with a maximal ϵ-separated set when using the CČC, $C {\overset{ˇ}{C}}_{r, h} (N)$ , where $h \in [\sqrt{2} ϵ, \min (\sqrt{2 ϵ^{2} + r^{2}}, \sqrt{3} ϵ))$ and $r \in [\sqrt{2} ϵ, 2 ϵ)$ .

Proof. See Appendix B.

Lemma 1

(Topology Approximation from Depth Map). The topological space of the observed manifold $S$ , represented by the CČC $C {\overset{ˇ}{C}}_{r, h} (N)$ using an ϵ-separated set, is valid when $r \in [\sqrt{2} ϵ, 2 ϵ)$ . For any given point $(x_{1}, x_{2}, x_{3}) \in C {\overset{ˇ}{C}}_{r, h} (N)$ from the observation, there is at most one d-simplex (d ≤ 3) at (x₁, x₂, x₃).

Proof. According to Proposition 1, the topology of the observation can be represented using only points, edges, triangles, and tetrahedrons within the CČC. Therefore, for a point $(x_{1}, x_{2}, x_{3}) \in C {\overset{ˇ}{C}}_{r, h} (N)$ , there is only one d-simplex (d ≤ 3). For other points (x₁, x₂, x) for any x ≠ x₃, if a d-simplex (d ≤ 3) exists, it implies that it is also a 0-simplex. This violates the injectivity of the depth map g and the condition that the 0-simplices are on the observation as defined in Definition 1.

Lemma 1 demonstrates the selection of appropriate cylinder parameters to reconstruct a representative topology from the sampled point cloud.

For $C {\overset{ˇ}{C}}_{r, h} (N)$ , the observed diffeomorphism $U_{t} (X ∣ N, U) = U (X ∣ N, U) + X : C \to M_{t}$ between the canonical model (or the initial manifold) and the manifold at time t, can be represented by a GP-WRM-based deformation model (since $U_{t}$ is injective), using the 0-simplices in $C {\overset{ˇ}{C}}_{r, h} (N)$ as nodes and their displacements. While the diffeomorphism cannot be fully obtained during exploration because the representing nodes are not entirely visible, and their corresponding displacements remain unknown, we still use $U$ for simplicity to describe the partially observed diffeomorphism.

Since the manifolds are embedded in $R^{3}$ , the normal vector v′ (which is the orthogonal complement of the tangent space) at the point $U (n)$ on the manifold $U (M)$ is given by the pushforward of v through the differential of $U^{- 1}$ at $U (n)$ . The differential of $U^{- 1}$ equals the inverse of the Jacobian matrix $\nabla U \in R^{3 \times 3}$ . Therefore,

v^{'} (x) = {({(\nabla U (x))}^{- 1})}^{⊺} v (x) .

(4)

And equation (4) can be simplified with the following approximation:

v^{'} (x) = \nabla U (x) v (x) .

(5)

as long as the deformation field is locally rigid, according to Adams et al. (2010). This simplification is sufficiently accurate in practice for robotic exploration.

For each 0-simplex, its normal vector is estimated using principal component analysis over the surrounding points. Unlike the Delaunay triangulation, where the normal vectors at the vertices are fixed, the normals in CČC are updated according to equation (5).

The initial canonical model $C_{0}$ is directly from the first observation $S_{0}$ and its topological space is $C_{0}^{T} = C {\overset{ˇ}{C}}_{r, h} (S_{0})$ , based on N sampled from $S_{0}$ . During exploration, more observations $S_{t}$ at time t are integrated into the canonical model through frame-to-frame non-rigid registrations to reconstruct it. The deformation between two frames is computed as $δ U_{t} : = U_{t - 1}^{- 1} ° U_{t} : R^{3} \to R^{3}$ , where $U^{- 1}$ denotes the inverse of the diffeomorphism, and the operator ° represents function composition. Specifically, for functions $U_{t_{1}}$ and $U_{t_{2}}$ the composition is defined as:

U_{t_{1}} ° U_{t_{2}} : = (U_{t_{1}} ° U_{t_{2}}) (X) = U_{t_{2}} (U_{t_{1}} (X) + X) - X .

Details of these operations are in Appendix C. The accumulated deformation field from the initial state, where $U_{t_{0}} = 0$ (indicating no displacement of any nodes), up to time t is given by:

U_{T} (X) = (δ U_{t_{0}} ° δ U_{t_{1}} ° \dots ° δ U_{T}) (X) .

(6)

Throughout the exploration, the deformation

δ U_{t}

between consecutive frames is partially observed through non-rigid registration between the two observation

S_{t - 1}

and

S_{t}

Measure frame-to-frame deformation

Given the small gripper movement and the presence of specific landmarks that enhance non-rigid registration, the point clouds can initially be aligned using these landmarks. This alignment can then be further refined by establishing correspondences based on the closest points.

The soft object is firmly grasped by the grippers. Consequently, the grip points, whose movements are precisely recorded using forward kinematics, serve as landmarks. The posture of the grippers at time t is denoted as $G_{t} = {[\begin{matrix} g_{1} & g_{2} & \dots & g_{G} \end{matrix}]}^{⊺} \in R^{G \times 3}$ , where $g_{i} \in R^{3}$ represents the Cartesian position of the i-th gripper.

Utilising feature detection can also rapidly facilitate the generation of landmark correspondences between frames. We use F_t pairs of corresponding SURF features from sequential frames, denoted as ${\{(f_{t - 1, i} \in R^{3}, f_{t, i} \in R^{3})\}}_{i = 1}^{F_{t}}$ . However, feature detection may not always be robust, and poor feature matches can result in inaccuracies. To address this, we employ GP-WRM to compute the probability of features being correctly matched. Initially, the deformation field is computed using GP-WRM, with features serving as nodes in the GP-WRM framework. As the input points for this framework are also these landmarks, their deformation is mainly influenced by their own displacement. Therefore, based on equation (1), a special deformation field is defined that masks the node equalling the input position:

U^{⋆} (n_{i} ∣ N, U) : = U (n_{i} ∣ J_{i} N, J_{i} U) : C \to R^{3}

(7)

where

J_{i} = [\begin{matrix} I_{i - 1} & 0 \\ 0 & I_{N - i} \end{matrix}] \in R^{(N - 1) \times N}

. The position of the grippers introduced into the deformation field at time t is given by

U_{t}^{⋆} (n_{i}) : = U^{⋆} (n_{i} | [\begin{matrix} F_{t - 1} \\ G_{t - 1} \end{matrix}], [\begin{matrix} F_{t} - F_{t - 1} \\ G_{t} - G_{t - 1} \end{matrix}])

, where

F_{t - 1} = [\begin{matrix} f_{t - 1,1} & f_{t - 1,2} & \dots f_{t - 1, F_{t}} \end{matrix}] \in R^{F_{t} \times 3}

and

F_{t} = [\begin{matrix} f_{t, 1} & f_{t, 2} & \dots f_{t, F_{t}} \end{matrix}] \in R^{F_{t} \times 3}

The computed displacement of the features will be compared with the displacement estimated from the surrounding points. For any given pair of landmarks $(f_{t - 1}, f_{t})$ , with corresponding unit normal vectors v_t−1 and v_t, their confidence is given by:

γ (f_{t - 1}) = γ_{1} \frac{{‖U_{t}^{⋆} (f_{t}) - f_{t}‖}_{2}}{{‖f_{t - 1} - f_{t}‖}_{2}} + γ_{2} \frac{v_{t}^{⊺} \nabla U_{t}^{⋆} (f_{t}) v_{t - 1}}{{‖\nabla U_{t}^{⋆} (f_{t}) v_{t - 1}‖}_{2}},

(8)

where γ₁ > 0 and γ₂ > 0 are weighted parameters.

Features with confidence values smaller than threshold will be removed. The remaining features are represented by $({\hat{F}}_{t - 1} \in R^{{\hat{F}}_{t} \times 3}, {\hat{F}}_{t} \in R^{{\hat{F}}_{t} \times 3})$ . Then two surfaces ( $S_{t - 1}$ and $S_{t}$ ) will be initially aligned with the deformation:

δ {\hat{U}}_{t}^{(0)} (X) : = U (X | [\begin{matrix} {\hat{F}}_{t - 1} \\ G_{t - 1} \end{matrix}], [\begin{matrix} {\hat{F}}_{t} - {\hat{F}}_{t - 1} \\ G_{t} - G_{t - 1} \end{matrix}]) .

(9)

In the refinement stage, each point in S_t−1 needs to find its correspondence in

δ {\hat{U}}_{t}^{(0)} (S_{t})

, denoted as

S_{t}^{C}

. In the first iteration,

S_{t}^{C}

is determined based on the closest Euclidean distance from S_t. The refinement of the registration aims to minimise the difference between the two point clouds by finding the optimal deformation field

U_{t} (X ∣ N_{t}, U_{t})

. This involves using the constraints of all validated landmarks in the deformation field as regularisation. Therefore, the optimisation problem is:

\begin{aligned} \underset{U_{t}}{arg min} & {‖U (S_{t} ∣ N_{t}, U_{t}) - S_{t}^{C}‖}_{F}^{2} \\ subject to & Φ_{t}^{'} (N_{t}) {[\begin{matrix} U_{t}^{⊺}, {\hat{F}}_{t}^{⊺} - {\hat{F}}_{t - 1}^{⊺}, G_{t}^{⊺} - G_{t - 1}^{⊺} \end{matrix}]}^{⊺} = 0 \end{aligned}

(10)

where

Φ_{t}^{'} (N_{t}) \in R^{N_{t} \times (N_{t} + {\hat{F}}_{t} + G_{t})}

is a special matrix:

[\begin{matrix} - 1 & ϕ_{1,2} & ϕ_{1,3} & \dots & ϕ_{1, N_{t}} & \dots & ϕ_{1, N_{t} + {\hat{F}}_{t} + G_{t}} \\ ϕ_{2,1} & - 1 & ϕ_{2,3} & \dots & ϕ_{2, N_{t}} & \dots & ϕ_{2, N_{t} + {\hat{F}}_{t} + G_{t}} \\ ϕ_{3,1} & ϕ_{3,2} & - 1 & \dots & ϕ_{3, N_{t}} & \dots & ϕ_{3, N_{t} + {\hat{F}}_{t} + G_{t}} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ ϕ_{N_{t}, 1} & ϕ_{N_{t}, 2} & ϕ_{N_{t}, 3} & \dots & - 1 & \dots & ϕ_{N_{t}, N_{t} + {\hat{F}}_{t} + G_{t}} \end{matrix}]

with the element ϕ_i,j (i ≠ j) at i-th row and j-th column equalling the element at the same position in the shape matrix of

U^{⋆} (n_{i} | [\begin{matrix} N_{t} \\ {\hat{F}}_{t - 1} \\ G_{t - 1} \end{matrix}], [\begin{matrix} U_{t} \\ {\hat{F}}_{t} - {\hat{F}}_{t - 1} \\ G_{t} - G_{t - 1} \end{matrix}])

The optimal solution of the problem equation (10) is given by:

U_{t}^{⋆} = Φ_{t}^{⊺} (P_{t}) Φ_{t} (P_{t}) + λ {\hat{Φ}}_{t}^{⊺} {\hat{Φ}}_{t}^{- 1} (P_{t} - P_{t}^{C})

(11)

where λ > 0 is a scale parameter, Φ_t is the shape matrix of the deformation field

U (P_{t} ∣ N_{t}, U_{t})

, and

{\hat{Φ}}_{t} \in R^{N_{t} \times N_{t}}

is the submatrix of

Φ_{t}^{'} (N_{t})

. Specifically,

{\hat{Φ}}_{t}

is defined as:

\hat{Φ} = Φ_{t}^{'} (N_{t}) {[\begin{matrix} I_{N_{t}} & 0_{({\hat{F}}_{t} + G_{t}) \times N_{t}} \end{matrix}]}^{⊺} .

The frame-to-frame deformation will be updated as follows:

δ {\hat{U}}_{t}^{(1)} = δ {\hat{U}}_{t}^{(0)} ° U (X | [\begin{matrix} N_{t} \\ {\hat{F}}_{t - 1} \\ G_{t - 1} \end{matrix}], [\begin{matrix} U_{t}^{⋆} \\ {\hat{F}}_{t} - {\hat{F}}_{t - 1} \\ G_{t} - G_{t - 1} \end{matrix}]) .

(12)

Since the closest points are initially assumed to be the corresponding points, this optimisation needs to be iterated several times, as shown in Algorithm 1. The iteration process continues until the optimal deformation $U_{t}^{⋆}$ is sufficiently small, at which point the optimisation terminates. After k iteration, the partial measurement of the diffeomorphism from $M_{t}$ to $M_{t + 1}$ is denoted as $δ {\hat{U}}_{t} \equiv {δ {\hat{U}}_{t}^{(k)}|}_{S_{t - 1}} : S_{t - 1} \to M_{t}$ . Then the deformation is $U_{t} = U_{t - 1} ° δ {\hat{U}}_{t}$ .

Invisible topology prediction

The instantaneous diffeomorphism $δ \hat{U_{t}}$ cannot be fully observed, leading to an observed point cloud $C_{t}$ that is smaller than its corresponding region on $C_{t - 1}$ . Consequently, this results in an inability to determine the diffeomorphism $U_{t}$ for the remaining part of $C_{t - 1}$ , as illustrated in Figure 7.

Figure 7.

Illustration of invisible topology prediction in sectional views. The curves represent different manifold surfaces. The observation $S_{t + 1}$ is first aligned with its previous observation $S_{t}$ , and then deformed into the canonical model $C_{t}$ using the diffeomorphism $U_{t}$ . $S_{t}^{∁}$ and $S_{t + 1}^{∁}$ are the unobserved parts.

Lemma 2

(Singularity in the GP-WRM) (Hu et al., 2024). If the number of the surrounding nodes of a given points is less than 4, the matrix M in equation (1) is singular.

Lemma 2 demonstrates that not all points in the point cloud can be deformed, as insufficient displacement observations from their neighbouring points lead to a singularity in the matrix M. In practice, the number of nodes whose deformation can be computed using equations (11) and (12) is less than or equal to the number of nodes in the observed $S_{t}$ . Based on Definition 5, there is $\hat{Q} < N$ .

Definition 5

(Observability Matrix for Deformation on a 2-Manifold). After each measurement at time t, let the number of 0-simplices in the canonical model be denoted as Q_t, and the number of observable vertices as ${\tilde{Q}}_{t}$ . The observability matrix for the canonical model is defined as:

Γ_{t} = {[\begin{matrix} e_{j_{1}} & e_{j_{2}} & \dots & e_{j_{{\tilde{Q}}_{t}}} \end{matrix}]}^{⊺} \in R^{{\tilde{Q}}_{t} \times Q_{t}},

where $e_{k} \in R^{Q_{t}}, k = j_{1}, j_{2}, \dots, j_{{\tilde{Q}}_{t}}$ , are the standard basis vectors, with the k-th element equal to one, and $j_{1}, j_{2}, \dots, j_{{\tilde{Q}}_{t}} \in \{1,2, \dots, Q_{t}\}$ are the indices of the observable vertices.

The deformation of the canonical model using the observable part is given by $U (x ∣ Γ_{t} N_{t}^{C}, Γ_{t} U_{t}^{C}) : C_{t} \to R^{3}$ and $N_{t}^{C} \in R^{Q_{t} \times 3}$ and $U_{t}^{C} \in R^{Q_{t} \times 3}$ represent the nodes on the canonical model and their displacements, respectively.

Remark 2

When the rank of the observability matrix Γ is zero, indicating that no displacement has been detected, the method will fail.

To estimate the diffeomorphism ${U_{t}|}_{C_{t}} : C_{t} \to M_{t}$ over the entire reconstructed portion of the canonical model using the partial observation $δ \hat{U_{t}}$ , we employ an extended Kalman filter (EKF) that accounts for partial observation losses. The transition function is based on the GP-WRM-based deformation model. The time-invariant system is described by the following system equations:

\begin{aligned} α_{t + 1} & = χ (α_{t}, g_{t}) + w_{t} \\ β_{t + 1} & = h (α_{t}) + z_{t} \end{aligned}

(13)

where

α_{t} = {[\begin{matrix} α_{1, t}^{⊺} & α_{2, t}^{⊺} \end{matrix}]}^{⊺} \in R^{6 Q_{t}}

is the system state, with

α_{1, t} = {[\begin{matrix} n_{1}^{⊺} & n_{2}^{⊺} & \dots & n_{Q_{t}}^{⊺} \end{matrix}]}^{⊺} \in R^{3 Q_{t}}

and

α_{2, t} = {[\begin{matrix} u_{1}^{⊺} & u_{2}^{⊺} & \dots & u_{Q_{t}}^{⊺} \end{matrix}]}^{⊺} \in R^{3 Q_{t}}

representing the position of the vertices in wrapped canonical model

U_{t} (C_{t})

and their displacement to

U_{t + 1} (C_{t + 1})

, respectively. When t = 0,

α_{2,0} = 0_{3 Q_{0}}

g_{t} = {[\begin{matrix} g_{1, t}^{⊺} & g_{2, t}^{⊺} \end{matrix}]}^{⊺} \in R^{6 G_{t}}

with

g_{2, t} = vec (G_{t}^{⊺} - G_{t - 1}^{⊺}) \in R^{3 G_{t}}

and

g_{1, t} = vec (G_{t}^{⊺}) \in R^{3 G_{t}}

is the control vector, representing the position and movement of the grippers.

β_{t + 1} \in R^{6 {\hat{Q}}_{t}}

is the measurement output, including the position of the vertices in the observable part

S_{t + 1}

and their displacement

δ U_{t + 1}

. The w_t is the system disturbance, and z_t is the measurement noise. They are both assumed to be zero mean multivariate Gaussian noise with covariance

Q_{t} \in R^{6 Q_{t} \times 6 Q_{t}}

and

B_{t} \in R^{6 {\hat{Q}}_{t} \times 6 {\hat{Q}}_{t}}

, respectively. For the invisible part, we add a smaller covariance. The function h is used to compute the predicted measurement from the predicted state:

h (α_{t}) = (I_{2} \otimes Γ_{t} \otimes I_{3}) α_{t},

(14)

where the symbol ⊗ is the Kronecker product. The state transition function χ is determined by the diffeomorphism between

C_{t}

C_{t + 1}

χ (α_{t}, g_{t}) = [\begin{matrix} Δ + α_{1, t} \\ Δ \end{matrix}]

(15)

with

Δ = (Φ_{t} (vec^{- 1} (α_{1, t})) \otimes 1_{3}^{⊺}) ({\tilde{Γ}}_{t} \otimes I_{3}) [\begin{matrix} α_{2, t} \\ g_{2, t} \end{matrix}] \in R^{3 Q_{t}},

where

vec^{- 1} (•)

is inverse of the vectorisation

vec (•)

, shaping the vector into a 3-column matrix as

vec^{- 1} (x) = ((vec {(I_{3})}^{⊺} \otimes I_{m}) (I_{3} \otimes x)) : R^{3 m} \to R^{m \times 3}

. The augmented observability matrix

{\tilde{Γ}}_{t} = diag (Γ_{t}, I_{G}) \in R^{(3 {\tilde{Q}}_{t} + 3 G) \times (3 Q_{t} + 3 G)}

. The matrix function

Φ_{t} (•)

is the shape matrix for the canonical model in the deformation

U_{t} (• | {\tilde{Γ}}_{t} vec^{- 1} ([\begin{matrix} α_{1, t} \\ g_{1, t} \end{matrix}]), {\tilde{Γ}}_{t} vec^{- 1} ([\begin{matrix} α_{2, t} \\ g_{2, t} \end{matrix}]))

For the state of the system, there are some space constraints for the topology like the external table plane, self-interaction of the topology. We use an inequality constraint:

C α_{t} \leq d .

(16)

Example 1

(External Constraints from the Table). Given the table plane function $n_{0}^{⊺} (x - t_{0}) = 0$ , every point x above the surface satisfies $n_{0}^{⊺} (x - t_{0}) \geq 0$ . The parameters in the constraints are $C = [\begin{matrix} I_{Q_{t} \times 3 Q_{t}} \otimes n_{0} & 0_{Q_{t} \times 3 Q_{t}} \end{matrix}]$ , and $d = n_{0}^{⊺} t_{0} 1_{Q_{t}}$ .

Example 2

(Internal Constraints from Topology Self-Intersection). Let the set $E = \{(i, j) ∣ \exists i, j \in \{1,2, \dots, Q_{t}\}, {‖n_{i} - n_{j}‖}_{2} < ε_{1}$ and $v_{i}^{⊺} v_{j} < ε_{2}\}$ , where ɛ₁ > 0 and ɛ₂ < 0 are thresholds. The distance between the 0-simplices corresponding to these index pairs in E should be greater than ϵ/2. Therefore, the constraints are $((e_{j}^{⊺} - e_{i}^{⊺}) \otimes v_{i}^{⊺}) α_{t} \leq - ϵ / 2$ and $((e_{i}^{⊺} - e_{j}^{⊺}) \otimes v_{i}^{⊺}) α_{t} \leq - ϵ / 2$ , when (i, j) ∈ E. The parameters in the constraints are $C = [\begin{matrix} (e_{j_{1}}^{⊺} - e_{i_{1}}^{⊺}) \otimes v_{i_{1}}^{⊺} \\ (e_{j_{2}}^{⊺} - e_{i_{2}}^{⊺}) \otimes v_{i_{1}}^{⊺} \\ ⋮ \\ (e_{j_{k}}^{⊺} - e_{i_{k}}^{⊺}) \otimes v_{i_{k}}^{⊺} \end{matrix}] \otimes [\begin{matrix} 1 \\ - 1 \end{matrix}]$ , where (i₁, j₁), (i₂, j₂), …, (i_k, j_k) ∈ E, and $d = (- ϵ / 2) 1_{Q_{t}}$ .

At each time step, the predicted state estimation is ${\hat{α}}_{t ∣ t - 1} = χ ({\hat{α}}_{t - 1 ∣ t - 1}, g_{t - 1})$ and the predicted covariance estimation is $V_{t ∣ t - 1} = D_{t} V_{t - 1 ∣ t - 1} D_{t}^{⊺} + Q_{t - 1}$ , where $D_{t} = {\partial χ / \partial α|}_{{\hat{α}}_{t - 1 ∣ t - 1}, g_{t}}$ is the state transition matrix. The update rule of this system with these partial observation losses (Liu and Goldsmith, 2004) is:

\begin{aligned} {\hat{α}}_{t ∣ t} & = {\hat{α}}_{t ∣ t - 1} + K_{t} (β_{t} - h ({\hat{α}}_{t ∣ t - 1})) \\ V_{t ∣ t} & = V_{t ∣ t - 1} - K_{t} H_{t} V_{t ∣ t - 1} \end{aligned}

(17)

where

K_{t} = P_{t ∣ t - 1} H_{t}^{⊺} {(H_{t} P_{t ∣ t - 1} H_{t}^{⊺} + B_{t})}^{- 1}

is the Kalman gain and

H_{t} = {\partial h / \partial α|}_{{\hat{α}}_{t ∣ t - 1}} = I_{2} \otimes Γ_{t} \otimes I_{3}

is the observation matrix.

When the inequality constraints (equation (16)) are applied, the Kalman gain K_t in equation (17) is modified; see details in Appendix D.

Therefore, the diffeomorphism ${U_{t}|}_{C_{t}} : C_{t} \to M_{t}$ in the whole reconstructed part of the canonical model is

\begin{aligned} {U_{t}|}_{C_{t}} & = U (x ∣ vec^{- 1} ([\begin{matrix} I_{3 Q_{t}} 0_{3 Q_{t} \times 3 Q_{t}} \end{matrix}] α_{t}), \\ vec^{- 1} ([\begin{matrix} 0_{3 Q_{t} \times 3 Q_{t}} I_{3 Q_{t}} \end{matrix}] α_{t})) . \end{aligned}

(18)

Homeomorphic reconstruction

In Section 5.2, we discussed that point cloud observations might not accurately reconstruct the topology. Lemma 3 establishes the existence of non-manifold (NM) geometries (Definition 6) on the distorted topology using CČC, which can be identified as per Corollary 1. During exploration, however, observations with a small reach τ may undergo deformations, potentially uncovering the true topology, as demonstrated by Lemma 4. Thus, homeomorphic reconstruction of the manifold can be achieved after appropriate manipulations, accurately reflecting its true topology through the reconstructed canonical model.

Definition 6

(Non-manifold Geometry). A d-manifold $M$ is defined as any subset of Euclidean space for which the neighbourhood of every point in $M$ is locally equivalent to a d-dimensional open ball. Geometric objects that do not meet this criterion at all points are classified as NM.

Lemma 3

(Existence of Local Non-Manifold Geometries). Consider a manifold $M$ and its partial observation $S$ with a distorted topology. The topological space, combining the natural topology of $M$ and the distorted topology of $S$ , cannot be embedded into $R^{3}$ . This results in NM geometries within the merged topology, including simplices that share d-simplices (d ≤ 2) in the distorted topology.

Proof. Given a small neighbourhood around a point, consider a 0-simplex σ₀ in the distorted topology $D \neq \emptyset$ . We have $σ_{0} \in D$ and $σ_{0} \in M$ . Since the natural topology of $M$ is locally Euclidean, we have $M ≅ R^{2}$ .

The neighbourhood of σ₀ is denoted as B_r (σ₀), which is a ball of radius $r ≫ τ_{M}$ centred at σ₀. The star of σ₀ in $D$ , denoted as $S t (σ_{0}, D)$ , is the set containing every simplex in $D$ that includes σ₀. Since $r ≫ τ_{M}$ , we have $S t (σ_{0}, D) \subseteq B_{r} (σ_{0})$ . Because the distorted topology $D$ doesn’t represent the real topology, there exists a face $σ_{2} \in S t (σ_{0}, D) \ M$ , which means at least two surfaces intersect at σ₀ in 3D space. As a result, σ₀ does not have a neighbourhood homeomorphic to the half-space $R^{2} \times R^{+}$ (the upper half-space of $R^{3}$ ) and is therefore considered a NM point, as shown in Figure 4. □

Corollary 1

(Detection of Non-manifold Geometries). Given the topological space of $M$ represented by the CČC derived from a maximal ϵ-separated set of a point cloud, NM geometries may exist due to observations with distorted topology. These NM geometries can be detected when a homeomorphism is observed with the local reach $τ_{M}$ greater than 2ϵ.

Proof. We assume that the original manifold $M$ contains non-manifold geometries due to a small reach $τ_{M}$ . When the reach $τ_{M_{t}} > 2 ϵ$ , the length of any 1-simplex in the simplicial complex $C {\overset{ˇ}{C}}_{r, h} (M_{t})$ satisfies:

\sup \{{‖p_{i} - p_{j}‖}_{2} ∣ \forall p_{i}, p_{j} \in σ, \forall σ \in C {\overset{ˇ}{C}}_{r, h} (M_{t})\} < τ_{M_{t}} .

Since all the 0-simplices are on

M_{t}

, for any 1-simplices σ₁ = {p₁, p₂} satisfying

egd (σ_{1}) \approx {‖p_{1} - p_{2}‖}_{2}

, we have

D_{t} = \emptyset

meaning that the distorted topology does not exist.

According to Lemma 3, $C {\overset{ˇ}{C}}_{r, h} (M_{t})$ is locally Euclidean and topologically equivalent to a portion of the real manifold. However, when it is combined with the previous topology $C {\overset{ˇ}{C}}_{r, h} (M)$ , the resulting merged topology may exhibit local non-manifold geometries. Therefore, the detection of non-manifold geometries corresponds to the disappearance of the distorted topology $D$ . □

Lemma 4

(Full Observability of a Manifold). Given any 2-manifold $M$ that is topologically equivalent to a 3-ball, let $M_{t}$ denote its homeomorphism at time t, and $S_{t} \subset M_{t}$ be the corresponding observed subset. The manifold $M$ can be fully observed and embedded in $R^{3}$ using the CČC on maximal ϵ-separated set, provided that the reach $τ_{S_{t}}$ satisfies $τ_{S} > 2 ϵ$ during the deformations.

Proof. According to Corollary 1, NM geometries can be detected when the reach $τ_{M_{t}}$ of the homeomorphism $M_{t}$ of $M$ is greater than 2ϵ. Therefore, for any observation $S_{t}$ of $M_{t}$ , if $τ_{S_{t}} > 2 ϵ$ , then $D = \emptyset$ and $C {\overset{ˇ}{C}}_{r, h} (S_{t}) ≅ S_{t}$ . Conversely, if $τ_{S_{t}} < 2 ϵ$ , it is possible that $D \neq \emptyset$ and $C {\overset{ˇ}{C}}_{r, h} (S_{t}) ≇ S_{t}$ . During deformation, if there exists a time t, such that $τ_{S_{t}} > 2 ϵ$ , then $C {\overset{ˇ}{C}}_{r, h} (S_{t}) ≅ S_{t}$ and the NM geometries ( $D$ in this case) can be detected. Once $D$ is removed, $C {\overset{ˇ}{C}}_{r, h} (S_{t}) \ D ≅ S_{t}$ , allowing the real manifold to be fully observed. □

The diffeomorphism map, also known as the deformation field, is continuous due to the physical properties of the soft object. According to Lemma 1, NM geometries can be detected when $τ_{M} > 2 ϵ$ during exploration. Thus, detecting NM geometries in the discrete topological space after each iteration is proposed. These NM geometries consist of a set of 0-simplices estimated to lie on points of discontinuity in the deformation field. Discontinuity is determined using an angle criterion. The NM geometries of the topology $T := C {\overset{ˇ}{C}}_{r, h} (S)$ are denoted as:

\begin{aligned} \tilde{T} & = \{σ_{0} \in T ∣ gap (A_{1} (σ_{0}, T)) \geq a_{1}\} ⋃ \\ \{σ_{0} \in T ∣ gap (A_{2} (σ_{0}, T)) \geq a_{2}\} \end{aligned}

(19)

where a₁, a₂ > 0 are threshold values, and the function

gap (•) : R^{N} \to R

, used to find the largest gap between the consecutive deformation distance values in the sorted sets

A_{1} (x, T) = \{{‖U (x) - U (y)‖}_{2} | \forall y \in T, {‖x - y‖}_{2} < ϵ\}

and

A_{2} (x, T) = \{U (x) U (y) / {‖U (x)‖}_{2} {‖U (y)‖}_{2} | \forall y \in T, {‖x - y‖}_{2} < ϵ\}

, is defined as:

\begin{array}{l} gap (A) = \sup \{| x - y | ∣ \exists x, y \in A, x \neq y, \\ \forall z \in A \ {x}, | y - x | \leq | z - x |\} . \end{array}

Two metrics for measuring the difference between two displacements are the Euclidean Distance (A₁) and Cosine Similarity (A₂).

The simplices containing the 0-simplices in the NM topological space $\tilde{T}$ will then be removed.

Shape merging

After each observation $S_{t}$ , the CČC $C {\overset{ˇ}{C}}_{r, h} (S_{t})$ is constructed to represent the manifold. Given that the diffeomorphism $U_{t}$ between $C$ and $M_{t}$ is estimated, both $C {\overset{ˇ}{C}}_{r, h} (S_{t})$ and the corresponding point cloud are mapped to the canonical space. The canonical model at time t is

C_{t} = U_{t} (S_{t}) ⋃ C_{t - 1} .

(20)

Directly merging two surfaces may result in overlapping layers, and as additional wrapped point cloud surfaces are introduced, the point cloud of the canonical model may get unclear, hindering further non-rigid registration. To address this, moving least squares (MLS) method (Alexa et al., 2003) to smooth the surface, using 0-simplices as nodes, is applied. Based on equation (20), the new canonical model is $C_{t} = mls (U_{t} (S_{t}) ⋃ C_{t - 1})$ . Since the distance map has already been computed during non-rigid registration, the MLS method incurs no additional computational effort.

For topology merging, in the first frame, the canonical topology directly adopts the topology of $M_{0}$ . Subsequently, the existing topology on the canonical model remains unchanged unless distorted topology is detected. When new point clouds are introduced, their topologies are integrated, followed by a filtering process that merges points that are very close to each other, as shown in Figure 8, the topological space of the canonical model $C_{t}^{T}$ is updated as:

C_{t + 1}^{T} = C_{t}^{T} ⋃ U_{t} (C {\overset{ˇ}{C}}_{r, h} (S_{t} \ \ U^{- 1} (Z_{t}))) ⋃ C {\overset{ˇ}{C}}_{r, h} (Y_{t})

(21)

where

Z_{t} = \{x \in S_{t} ∣ \exists y \in C_{t}, {‖x - y‖}_{2} < ϵ\}

is the overlapped area on

U_{t} (S_{t})

, and

Y_{t} = {x \in S_{t} \ \ U^{- 1} (Z_{t}), y \in C_{t} ∣ {‖x - y‖}_{2} < ϵ}

Figure 8.

Topology Merge: The black, blue, and green meshes represent the topological spaces of the canonical model, newly wrapped observation, and new 1-simplices.

Planning of grip points

In this section, the selection of grasp points is introduced. Before the soft object is fully explored, the canonical model $M_{t}$ contains boundary. To realise fast exploration of the soft object, the candidate grip points are on the boundary of the $M_{t}$ . Since computing the boundary of a point cloud is time-consuming, utilising topological boundary edges can simplify the candidate group selection process.

Candidate grip points

As illustrated in Figure 9, the workspace $W$ is the union of the reachable workspace of each robot: $W = ⋃_{i = 1}^{G} W_{i}$ where $W_{i} \subset R^{3}$ is workspace of the i-th robot, which does not necessarily have to be the exact reachable workspace.

Figure 9.

Illustration of the workspace for a three-arm robotic system. The blue areas represent the workspace of each arm.

The candidate grip points ${\tilde{G}}_{i} \subset R^{3}$ of the i-th gripper is

{\tilde{G}}_{i} \subset W_{i} (\partial M_{t}) = \{x \in \partial M_{t} ∣ x \in W_{i}\}

(22)

where

\partial M_{t}

is the boundary of the topological space.

Optimal grip points based on null space analysis

Since the grip points are also on the surface, utilising the deformation field in equation (1), the relationship between the movement of the manipulation points and the displacement field is (Hu et al., 2024):

δ G_{t} = U (G_{t} ∣ N_{t}, U_{t}) = Φ_{t} (G_{t}) U_{t}

(23)

where

δ G_{t} \in R^{G_{t} \times 3}

is movements of grip points G_t. In equation (23), since G_t ≪ N_t, the mapping from the displacement field to the robotic movement is from high dimensional to low dimensional. This implies that there are some parts of the soft object that cannot be controlled by the manipulation points, potentially reducing the efficiency of manipulation. To improve the competency of the controller on the soft object, the null space analysis of Φ_t (G_t) in equation (23) is introduced. The null space of the matrix is defined as

\{y \in R^{n} ∣ Φ_{t} (G_{t}) y = 0\}

. The null space criterion (NSC) is computing the length of the projected matrix of the grid point displacements on the column space:

nsc (G_{t}) = \frac{{‖P (Φ_{t} (G_{t})) U_{t}‖}_{F}}{{‖U_{t}‖}_{F}}

(24)

where

P (•)

is a projection operator onto the row space of a matrix with full row rank, defined as:

P (A) = {A^{⊺} (A A^{⊺})}^{- 1} A

. A longer projected length implies that more grid points on the soft object will be affected during manipulation.

To find the optimal grip point G*, each point in $\tilde{G}$ is iterated to compute the null space criterion and the one with maximum projected length is regarded as G*:

G^{*} = \underset{\tilde{G} \in \tilde{G}}{arg max} nsc (\tilde{G}) .

(25)

The details are shown in Algorithm 2.

However, limitations arise because certain points on the boundary of the canonical model may be physically unreachable for the manipulators. For instance, the gripper might be obstructed by the soft object itself when attempting to approach a target located beneath it.

Orientation of grasping

The pitch of the gripper at the grasp point is determined by the normal vector v. The remaining two DoF, roll and yaw, are optimised by minimising the rotation in terms of the Riemannian distance. Therefore, the optimal orientation of the gripper is the one that aligns its pitch with the normal vector v and minimises the Riemannian distance for the roll and yaw rotations, as discussed by Hu et al. (2024), it is:

\hat{R} = \exp (| \arccos (v^{⊺} v_{p}) | v_{p} \times v) R

(26)

where

\exp (\cdot) : s o (3) \to S O (3)

is the exponential map, and v_p and R are the pitch axis and Cartesian orientation of the end-effector before grasping, respectively.

Motion planning

The goal of motion planning is to orient the invisible portion of the surface toward the camera, thereby enabling closure of the canonical model. Once the grasp points are determined, the corresponding trajectories and motions of these points must be precisely planned.

The procedure of robotic exploration is iterative. Before each iteration, the grippers are homed to minimise occlusion. Then the grippers move toward the designated grip points and securely grasp the object. Two manipulators serve as the primary actuators, while any additional arms, if available, provide support during manipulation. To avoid collisions between the manipulators and the soft object during manipulation, the orientation of the gripper is adjusted first, followed by the translation of the gripper. Based on the analysis of the topology of the wrapped canonical model, the actions are categorised into two types: turning over and stretching. The exploration process continues until the stop condition is met, as illustrated in Figure 10.

Figure 10.

Schematic representation the motion planning.

Topology analysis with Betti number

Motion planning is based on the shape of $M_{t}$ , with the topology of $M_{t}$ analysed before each manipulation. The topology analysis of the CČC includes identifying voids, detecting holes, and determining the number of connections. These metrics are crucial for operations such as stopping, stretching, and retracting.

For a given topological space $C {\overset{ˇ}{C}}_{r, h} (M)$ , homology associates vector spaces $H_{i} (C {\overset{ˇ}{C}}_{r, h} (M))$ for i = 0, 1, 2, …, used in topology analysis. The zeroth homology group, $H_{0} (C {\overset{ˇ}{C}}_{r, h} (M))$ , describes connected components, while the first and second homology groups, $H_{1} (C {\overset{ˇ}{C}}_{r, h} (M))$ and $H_{2} (C {\overset{ˇ}{C}}_{r, h} (M))$ , correspond to loops and voids, respectively. The dimensions of these homology groups are given by the Betti numbers. Specifically, the zeroth Betti number $β_{0} = \dim (H_{0} (C {\overset{ˇ}{C}}_{r, h} (M)))$ represents the number of connected components, the first Betti number $β_{1} = \dim (H_{1} (C {\overset{ˇ}{C}}_{r, h} (M)))$ indicates the number of 1D loops, and the second Betti number $β_{2} = \dim (H_{2} (C {\overset{ˇ}{C}}_{r, h} (M)))$ corresponds to the number of 2D voids.

The identification of Betti numbers is based on constructing the boundary matrix $\partial C {\overset{ˇ}{C}}_{r, h} (M)$ from the topological space. Cycles and holes are identified by interpreting the reduced matrix. Since $C_{t}$ is homologically equivalent to $M_{t}$ , their Betti numbers are identical.

Given that computing Betti numbers incurs higher computational costs as the topology becomes more complex, this computation is performed only before each manipulation.

Stop condition

According to Assumptions 2 and 6, when the surface of the canonical model is closed, for a volumetric object, its Betti number sequence is $β_{0} (M_{t}) = 1$ , β₁ = 0, and β₂ = 1, indicating the presence of one connected component with a single 2D void and no holes. This void can be considered homologically equivalent to $M$ , serving as an indication that the stop condition has been met. At this point, the robotic exploration is complete, and the robotic system halts.

Trajectory planning

The control of soft objects varies depending on the configuration of the robots and the properties of the objects. Unlike traditional manipulation planning for rigid objects with high stiffness, we propose a generalised trajectory planning approach for robotic exploration and manipulation of soft objects. This approach includes actions such as turning over and stretching.

Turning over

In most cases, the hidden portions of the soft object are typically oriented downward, making it essential to turn the object over to explore these invisible areas. This action involves using a pivot to reverse the orientation of the soft object.

As shown in Figure 11(a), since the explored soft object has infinite DoF and is predicted to deform, an agent box is used to efficiently describe the shape of the wrapped canonical model $M_{t}$ placed on the table. The two active grippers are used to manipulate the bounding box, with the optimal grasp points being g₁ and g₂. Given that the soft object is placed on a table, the table surface is defined by $n_{0}^{⊺} (x - t_{0}) = 0$ where $n_{0}^{⊺} n_{0} = 1$ . The agent box is also designed to be positioned on the table. The point cloud of $M_{t}$ is denoted as $P \in R^{P \times 3}$ . It is projected onto the table as $P^{'} = P - (P - t_{0}^{⊺} 1_{P}^{⊺}) n_{0}$ . The covariance matrix of the projected point cloud P′ is used to determine its orientation:

cov (P^{'}) = {\bar{P}}^{'} {\bar{P}}^{' ⊺}

(27)

where

{\bar{P}}^{'} = P^{'} - 1 / T P^{'} 1_{T} 1_{T}^{⊺}

is the zero-centred point cloud. The centroid of P′ is

P_{0}^{'} = 1 / T P^{'} 1_{T}

. The eigenvalues of

cov (P^{'})

are λ₁, λ₂, and λ₃, where λ₁ ≥ λ₂ ≥ λ₃ = 0. The corresponding eigenvectors are θ ₁, θ ₂, θ ₃. θ ₃ = n₀. The extents of the agent box are

2 \sqrt{λ_{1}}, 2 \sqrt{λ_{2}}

and max (Pn₀) − min (Pn₀).

Figure 11.

Illustration the trajectory planning during the manipulation. The green object represents the soft object, and the red arrows are the trajectory directions. (a) Turning over. When a third arm is not available, the friction between the soft object and the table serves as the pivot. (b) The middle image shows the camera view of the soft object in Phase 1. The yellow loop indicates the 1D holes in its topology (white mesh). The green dashed arrows represent the EGD from g₁ to g₂.

When G = 3, g₃ is the point on $M_{t}$ that is farthest in EGD from the axis passing through g₁ and g₂. The Euclidean distance d is measured as $d = ‖(g_{2} - g_{1}) \times (g_{3} - g_{1})) ‖_{2} / {‖(g_{3} - g_{1})‖}_{2}$ .

There are two stages for the turning operation: In the Lifting Stage (Phase 1 to Phase 2), the active grippers move to the positions:

g_{i}^{'} = g_{i} + d n_{0} + (1 - \frac{{‖q_{i} - P_{0}^{'}‖}_{2}}{\sqrt{λ_{1} + λ_{2}}}) (q_{i} - P_{0}^{'}), i = 1,2

(28)

where

q_{i} = g_{i} - {(g_{i} - t_{0})}^{⊺} n_{0} n_{0}

is the projection of g_i on the table plane. The third arm moves from g₃ to

g_{3}^{'} = \frac{g_{1}^{'} + g_{2}^{'}}{2} - {(\frac{g_{1}^{'} + g_{2}^{'}}{2} - t_{0})}^{⊺} n_{0} n_{0} .

(29)

In the Retracting Stage (Phase 2 to Phase 3), the grippers move from

g_{i}^{'}

to the positions:

g_{i}^{''} = q_{i} + 2 q \sqrt{λ_{1} + λ_{2}} (q_{1} - q_{2}) \times n_{0}, i = 1,2

(30)

where

q = sgn ({(p_{1} - p_{2})}^{⊺} (P_{0}^{'} - p_{2}))

. The third gripper remains in its position at

g_{3}^{'}

, that is,

g_{3}^{''} = g_{3}^{'}

, which serves as the pivot point during the turning operation.

When the number of manipulators is limited to two (G = 2), the edge or friction between the soft object and the table serves as the pivot, as shown in Figure 11(b).

Stretching

Folds on the soft object can create additional hidden areas, particularly in depressed regions, which necessitates the use of a stretching operation. The invisible depressed areas are represented as 1D holes in the topology of $C {\overset{ˇ}{C}}_{r, h} (S_{t})$ . If a 1D hole exists in the observed point cloud $S_{t}$ , indicated by a positive first Betti number β₁ > 0, the stretching operation is performed by using two manipulators that move in opposite directions, as illustrated in Figure 11(b). This is particularly crucial for exploring non-convex objects with internal folds.

The stretching operation is based on the EGD on $S_{t}$ between two grip points, as shown in Figure 11(b). When the ratio of the EGD to the Euclidean distance, denoted as $ϑ = edg (g_{1}, g_{2}) / {‖g_{1} - g_{2}‖}_{2} \geq 1$ , exceeds a certain threshold, two active grippers move in opposite directions:

g_{1}^{'} = g_{1} + \frac{ϑ - 1}{2} (g_{1} - g_{2}), g_{2}^{'} = g_{2} + \frac{ϑ - 1}{2} (g_{2} - g_{1}) .

(31)

When the conditions for both turning over and stretching are met simultaneously, the stretching can be incorporated directly into equations (28) and (30) directly.

Manipulation speed

According to equation (23), the displacement of the grid points after the movement of the grip points is expressed as:

U_{t} (δ G_{t}) = Φ_{t}^{†} (G_{t}) δ G_{t}

(32)

where

Φ_{t}^{†} (G_{t})

is the pseudo-inverse of

Φ_{t} (G_{t})

. It shows that high manipulation speed leads to significant deformation in a short period. Consequently, non-rigid frame-to-frame registration performance degrades with larger deformations because robust initial alignment cannot be ensured. The correspondence between

δ {\hat{U}}_{t}^{(0)} (S_{t})

and

S_{t - 1}

after initial alignment relies on their closest point pairs. Therefore, controlling the manipulation speed is crucial to maintain an acceptable deformation rate for robust non-rigid registration while efficiently completing the exploration.

Due to the initial lack of deformation information during manipulation, the manipulation speed begins slowly and then gradually increases. Subsequently, the deformation $U_{t} (δ G_{t})$ capped using the average deformation length, specifically $1 / \sqrt{N_{t}} {‖U_{t} (δ G_{t})‖}_{F} \leq ζ$ , where ζ is the threshold of deformation. Therefore, based on the sub-multiplicative property of the Frobenius norm, we have:

\begin{aligned} {‖δ G_{t}‖}_{F} & = {‖Φ_{t} (G_{t}) U_{t} (δ G_{t})‖}_{F} \\ \leq {‖Φ_{t} (G_{t})‖}_{F} {‖U_{t} (δ G_{t})‖}_{F} \\ \leq ζ \sqrt{N_{t}} {‖Φ_{t} (G_{t})‖}_{F} . \end{aligned}

(33)

Therefore, the maximal manipulation speed is limited at

v_{\max} = \frac{F ζ \sqrt{N_{t}} {‖Φ_{t} (G_{t})‖}_{F}}{\sqrt{G}}

(34)

where F is the registration frequency, which must be sufficient to provide updated shape information to the robot for continuous movement without pauses.

Simulated validation

The simulated validation was carried out using the SOFA framework (Faure et al., 2012). In the simulator, the geometry of initial 3D model is known and can be regarded as the ground truth (GT) of the reconstructed canonical model. During manipulation, the positions of each vertex and the movements of the robots can also be recorded. To simulate the point cloud from the camera perspective, the vertices on the mesh facing the virtual camera serve as the point cloud $S_{t}$ . To evaluate the performance of the canonical reconstruction, the directed Hausdorff distance between the ground truth $C$ and $C_{t}$ , defined as:

E_{t} = \max_{x \in C_{t}} \min_{y \in C} {‖x - y‖}_{2},

(35)

is used as the metric. The prediction error P_t is also measured as the directed Hausdorff distance between the invisible points and the ground truth.

In this simulation, to provide a more intuitive result, we presented the relative measurement of reconstruction error and prediction error relative to the size of the canonical model, which is approximately $2 \sqrt{λ_{1} + λ_{2} + λ_{3}}$ , where λ₁, λ₂, λ₃ are the eigenvalues of the covariance matrix $cov (\bar{C})$ , and $\bar{C}$ the zero-centred point cloud of $C$ . The relative reconstruction and prediction errors are $E_{t} / (2 \sqrt{λ_{1} + λ_{2} + λ_{3}})$ and $P_{t} / (2 \sqrt{λ_{1} + λ_{2} + λ_{3}})$ .

Simulations on liver model

We employed a 3D liver model as the soft object, specifying the Young’s modulus of 500 N m⁻² and Poisson ratio of 0.3. The liver model has 2194 vertices and 4385 triangles. It was placed on a table whose plane function is defined as $n_{0} = {[\begin{matrix} 0 & - 1 & 0 \end{matrix}]}^{⊺}$ and $t_{0} = {[\begin{matrix} 0 & 0 & 3 \end{matrix}]}^{⊺}$ . The simulation was conducted using both a two-arm and a three-arm robotic system, as shown in Figure 12 and Extension 1. Each case was repeated 5 times in which the initial liver poses are different. The down-sample rate is 0.06. As the average edge length on the liver mesh is 2.17 mm, based on the sampling theorem, the down-sampling rate has to be larger than this value. The simulated point cloud was sampled from the mesh with the spacing of 0.01 mm. As shown in Figure 12, there are three manipulations to complete the exploration. In each manipulation, there were only turning-over operations. The average explored areas after each exploration of all the experiments are illustrated in Figure 13 (Group G-2 and G-3). The mean exploration time of two-arm system is 124.2 ± 10.2 s, while three-arm system is 112.2 ± 8.7 s, as the parameters in each controller are the same.

Figure 12.

Simulation of robotic exploration of a 3D liver with three arms. Upper: The first row shows the SOFA simulator, with green points representing the observed surface. The second and third rows display the camera view and side view of the canonical model, respectively. The fourth and fifth rows show the topology from two different perspectives. Lower: The plots illustrate the canonical surface area, canonical shape difference, and prediction error throughout the robotic exploration.

Figure 13.

Explored surface area percentage before each manipulation across tasks. Groups G-2 and G-3 represent simulated tasks with grip point planning using two and three arms, respectively, while N-2 and N-3 represent tasks without planning. M1 to M8 indicate the increase in reconstructed area percentage after each manipulation, with M0 showing the initial percentage. The number atop each row denotes total manipulations performed.

To avoid the influence of collisions between the grippers and the liver during the homing and reaching procedures to the grip points, these steps are simplified. In the simulations, the virtual grippers are positioned directly at the optimal grip points once determined, and only the manipulations are simulated.

We compared our 3D reconstruction method with the state-of-the-art point cloud fusion method, dynamic fusion (DF) (Newcombe et al., 2015), as well as with the GT. As shown in Table 2, the mean difference and Hausdorff distance between the 3D reconstructed canonical model and the GT were measured. The reconstruction time includes the mean processing time of registration and fusion per frame. This reconstruction time directly impacts the maximum frequency at which new shapes can be provided to the controller for subsequent action planning, allowing for continuous robot operation rather than requiring pauses for offline processing. Additionally, the area of NM geometry compared to the GT area was measured. Results indicate that although our method introduces topology estimation into the reconstruction and achieves a smaller reconstruction error, the reconstruction time is not longer than that of the DF.

Table 2.

Comparison of the 3D reconstruction performance with DF (Newcombe et al., 2015).

Method	G	Time (ms)	Error (%)	NM area (%)
Our	2	385 ± 53	1.63 ± 0.21	0.10 ± 0.03
Our	3	356 ± 64	1.25 ± 0.49	0.15 ± 0.04
DF	2	378 ± 81	2.63 ± 0.58	2.57 ± 0.70
DF	3	367 ± 73	3.04 ± 0.74	2.15 ± 0.53

Comparative studies

Effect of the sample rate

Different sampling parameters were selected to validate the effect of the sample rate. In each simulation configuration, we used three levels of the sample rate, ϵ = 0.2 mm, ϵ = 0.6 mm, ϵ = 1.0 mm. All three groups were carried out five times with different initial poses. The shape difference between the reconstructed canonical model and the GT, the NM area, and the average registration time were measured.

Table 3 indicates that the computational cost has a positive correlation with the sample rate. With a smaller sample rate, the computational cost increases as the number of nodes N increases. It also leads to smaller reconstruction errors because more shape information is preserved with the high density of the point cloud. The NM area accordingly decreases due to the higher observation quality, as the sample rate is much lower than the reach of the liver model. When ϵ is small, reconstruction takes a longer time. During this process, the point cloud was saved continuously during the manipulation. Even after the manipulation stopped, the reconstruction continued until each frame was fully processed. Therefore, there is a gap between two manipulations to allow sufficient time for reconstruction.

Table 3.

Comparison of the performance at different sample rates ϵ.

ϵ(mm)	G	Time (ms)	Error (%)	NM area (%)
0.2	2	1274 ± 271	0.51 ± 0.17	0.07 ± 0.02
0.2	3	1149 ± 193	0.61 ± 0.13	0.06 ± 0.03
0.6	2	385 ± 53	1.63 ± 0.21	0.10 ± 0.03
0.6	3	356 ± 64	1.25 ± 0.49	0.15 ± 0.04
1.0	2	117 ± 29	2.47 ± 0.74	1.05 ± 0.14
1.0	3	109 ± 18	2.79 ± 0.37	0.97 ± 0.21

Effect of the grasp point planning

To demonstrate the effect of grasp point planning, we conducted experiments where the grip points were randomly selected on the boundary $\partial M_{t}$ without computing the NSC. The two-arm experiments with random grasp points were carried out five times, each with a different starting pose. For the three-arm experiments, also conducted five times, the grip points for two of the active arms were randomly selected, while the grip point for the third arm was chosen using our method to ensure successful manipulation.

Figure 13 compares the number of manipulations and the explored area of the canonical model between methods with and without grip point planning. The results show that random grip point selection requires more manipulations to complete the exploration. On average, 5.8 manipulations are needed with two arms and 6.0 with three arms. In contrast, with our method, the average number of manipulations needed is just 3.2 for both configurations. For each manipulation, the average area increase is 9.97% when grip points are randomly selected, compared to 18.38% when grip points are planned based on the NSC.

Effect of the manipulation speed

We also investigated the effect of the grippers’ maximum speeds v_max, in equation (34), during the manipulation in the three-arm configuration, using 10 different maximum speeds ranging from 0.1 mm s⁻¹ to 10 mm s⁻¹. In each manipulation, the grip points remained the same as those used for grip point planning at 0.1 mm s⁻¹. In these experiments, the frequency of image and point cloud recording was consistently set at F = 10 Hz. Reconstruction errors, prediction errors, and total exploration time were measured in these simulations, as shown in Figure 14.

Figure 14.

Effect of different maximal manipulation speeds v_max on reconstruction error and prediction error. The red cross mark indicates cases where the exploration was failed.

When the maximum manipulation speed increases, the total exploration time decreases, but the reconstruction error increases as the difference between two sequential frames becomes larger. The results show that excessive speed reduces the success rate of frame-to-frame registration, increasing the risk of exceeding the registration error threshold. In this simulation environment, reconstruction fails when the maximum speed surpasses 5 mm s⁻¹. To ensure successful exploration, the deformation threshold must satisfy $ζ \leq 6 % \sqrt{λ_{1} + λ_{2} + λ_{3}}$ from these simulations. Additionally, the prediction error is influenced by the manipulation speed, as the overlapped area $Z$ in equation (21) becomes too small, leading to a reduction in the dimensions of α _t and β _t in the EKF, as illustrated in equation (13).

Experimental validation

Experiment setup

As shown in Figure 15, we use the da Vinci Research Kit (dVRK) (Intuitive, USA) (Kazanzides et al., 2014) for validation, consisting of two patient-side manipulators (PSM). A fixed RGB-D camera (RealSense, Intel, USA) is utilised to capture the surface and texture information of the explored soft object.

Figure 15.

Experimental setup.

The soft objects (rubber glove, beanie, colon, and liver) were placed on a table at a known position relative to the camera and manipulated by two 7-DoF PSMs. The Point Cloud Library (PCL) (Rusu and Cousins, 2011) was used for filtering the point cloud and computing features. The point cloud corresponding to the table was removed using a plane function, while the point cloud associated with the PSMs was excluded using kinematic data.

Two soft objects made of different materials and shapes (a rubber glove and a beanie) were used as they are commonly found in daily life. To validate the procedure in the context of RAMIS, cadaveric colon and liver tissues were used to replicate common scenarios in laparoscopic exploration. These experiments, including those on cadaveric tissues, aim solely to validate our general robotic exploration techniques for soft objects. They are not intended to develop controllers for actual RAMIS procedures, as the real surgical environment, with its need for highly specialised and delicate manipulations (e.g., folding, peeling, localised lifting), presents additional complexities beyond the scope of this foundational exploration framework.

To evaluate the performance of canonical reconstruction on real-world data, we employed a deep learning approach. OmniMotion (Wang et al., 2023) was used offline to obtain dense pixel correspondences. Since analysing long videos with this method requires significant memory storage due to extracting the optical flow and cycle consistency masks between all possible pairs of frames in the videos, we segment the videos into sections by each manipulation (around 45 s) and sampled them at 10 frames per second. This allows us to obtain reliable pixel correspondences in time windows long enough for the exploration. Given the intrinsic and extrinsic camera parameters, along with the depth information from the camera, the corresponding tracked pixels across frames were utilised to reconstruct the 3D canonical model in a point cloud format.

Robotic exploration of rubber glove

In this experiment, a blue rubber glove filled with a substance is employed, with two PSMs manipulating the glove, as illustrated in Figure 16. The parameters for the CČC are set with a radius of 5 mm, and a height of 8 mm.

Figure 16.

Upper: Robotic exploration of a rubber glove. The first row shows camera images. The second and third rows show the point cloud $S_{t}$ from two views. The fourth and fifth rows depict the canonical model $C_{t}$ from two views, with NSC values of candidate grip points on the boundary visualised using a colorbar. The sixth and seventh rows illustrate the topological space of the canonical model $C_{t}^{T}$ with the Betti number sequence β of the respective manifold. Lower: The boundary and area of the canonical model during exploration. The colour blocks represent homing, grasping, and up and down manipulation stages, respectively.

In this case, after four manipulation cycles, the glove was fully explored. As shown in Figure 16, the initial point cloud, captured at t = 0 before exploration, constitutes approximately 14.5% of the canonical model, and the Betti number is $β_{1} (S_{t = 0}) = 0$ . As the NSC values of each candidate grip point are visualised in Figure 16, the two grip points are located around the base of the little finger in their respective workspaces. Therefore, during the initial manipulation phase (from t = 0 to t = 36.4 s), only turning-over operations were performed. At t = 22.9 s, the lifting stage in the turning operation was reached. Then, the retraction was completed at t = 33.8 s and the grippers were released, returning to the home position for the next manipulation. After completing the first manipulation cycle, the reconstructed surface area reached approximately 72.3%, while the Betti number $β_{2} (C_{t = 33.8}) = 0$ , indicating incomplete exploration. The second manipulation phase spans from t = 36.4 s to t = 69.4 s. This phase includes the stretch operation, indicated by two 1D holes on $S_{t}$ because $β_{1} (S_{t = 0}) = 2$ . One grip point is located at the tip of the index finger, while the other is at the base of the thumb. As the grip points are not located on any 1D holes, the ratio of EGD and Euclidean distance, ϑ, is not very big and the stretch motion is not significant. After this manipulation, the explored area is around 96.2% and $β_{2} (C_{t = 69.4}) = 0$ , therefore more manipulation is necessary to make the canonical closed. The third and fourth manipulations are also turning-over operations. Before the retracting stage in the fourth manipulation finishes, the canonical model closes at t = 112.0 s.

Within four manipulations, the maximum speed of both grippers is v_max = 2.5 cm s⁻¹, with a deformation cap of ξ = 5 mm and a registration frequency of 3 Hz. During manipulation, the average speed of both grippers is 1.84 mm s⁻¹. Approximately 83.2% of the deformation fields between two consecutive frames are smaller than ξ, with an average deformation of 4.54 mm. The average shape difference between the canonical model reconstructed through robotic exploration and the learning-based method is 2.72 mm.

A drastic topology change occurred during the second manipulation phase when the index and middle fingers began to open around t = 37.2 s. Before this, they were considered closed. The NM geometry was detected between the two fingers based on the difference of the displacement, consisting of 23 0-simplices, 18 1-simplices, and 16 2-simplices, as shown in Figure 17. This NM geometry was removed, allowing the correct topology to replace it with new observations. The accompanying video (Extension (2) shows the entire exploration of the rubber glove.

Figure 17.

Detection of the NM boundary on the topology of the canonical model (indicated by the red dashed line) in the robotic exploration of the rubber glove. The colormap shows the distribution of the Euclidean Distance metric and Cosine Similarity metric as described in equation (19).

Robotic exploration of beanie

In this experiment, a beanie was used as the soft object. Compared to the rubber glove, it is softer and contains an inner lining. Our method successfully enables robotic exploration of the beanie. The parameters for the CČC are r = 5 mm and h = 8 mm. As shown in Figure 18, the beanie is manipulated 5 times to reach the stop condition. The areas of canonical surfaces before each manipulation are 14.8%, 69.4%, 81.2%, 95.1%, 95.7% and the Betti number sequences are shown in the Table 4. The corresponding Betti number sequences are listed in Table 4. Stretching occurred only during the second and third manipulation phases. At around t = 19 s, the inner lining became visible to the camera, indicated by the emergence of a 1D hole on $S_{t}$ as β₁ > 0, this revealed an occluded region, prompting further exploration. The ratio r = 2.34 between the EGD and Euclidean distance exceeded the threshold, confirming the need for stretching; this manipulation primarily aimed to unfold and expose the internal surface.

Figure 18.

Upper: Robotic exploration of a beanie. The first row shows camera images. The second and third rows show the point cloud $S_{t}$ from two views. The fourth and fifth rows depict the canonical model $C_{t}$ from two views, with NSC values of candidate grip points on the boundary visualised using a colorbar. The sixth and seventh rows illustrate the topological space of the canonical model $C_{t}^{T}$ with the Betti number sequence β of the respective manifold. Lower: The boundary and area of the canonical model during exploration. The colour blocks represent homing, grasping, and up and down manipulation stages, respectively.

Table 4.

The Betti numbers of the canonical model $(C_{t})$ and its wrapped model $(M_{t})$ before each manipulation and at the end of the robotic exploration of beanie.

t(s)	$C_{t}$			$M_{t}$
t(s)	β ₀	β ₁	β ₂	β ₀	β ₁	β ₂
0	3	2	0	3	2	0
36.3	1	4	0	1	4	0
69.8	1	2	0	1	2	0
105.5	1	3	0	1	2	0
141.4	1	1	0	1	1	0
End	1	0	1	1	0	1

We validated the performance of the constraints in topology prediction. During the robotic exploration of the beanie, the constraints related to the table plane were not applied. Although the beanie was placed on the table, ideally requiring all points p in the canonical model to satisfy $n_{0}^{⊺} ({p - t}_{0}) > 0$ , only approximately 40.7% of the points met this condition. The volume of the portion below the table plane was approximately 44.5% of the entire canonical model volume, as illustrated in Figure 19. Subsequently, we reconstructed the canonical model using the table plane constraints (Example 1), 84.4% of the points were above the table plane, with the remaining points situated within a depth of 5 mm beneath the table. The accompanying video (Extension 3) shows the entire exploration of the beanie.

Figure 19.

The side view of the canonical model at the first and last frames in the robotic exploration of the beanie.

Robotic laparoscopic exploration of cadaveric liver

As shown in Figure 20, robotic exploration of the human cadaveric liver was conducted. Due to the weight of the cadaveric liver, an additional surgical tool was used as a supporting arm for the dVRK. A planar label measuring 15 mm × 80 mm was attached to the liver specimen, and its point cloud was removed to ensure that only the cadaveric liver was reconstructed.

Figure 20.

Upper: Robotic exploration of a cadaveric liver. The first row shows camera images. The second and third rows show the point cloud $S_{t}$ from two views. The fourth and fifth rows depict the canonical model $C_{t}$ from two views, with NSC values of candidate grip points on the boundary visualised using a colorbar. The sixth and seventh rows illustrate the topological space of the canonical model $C_{t}^{T}$ with the Betti number sequence β of the respective manifold. Lower: The boundary and area of the canonical model during exploration. The colour blocks represent homing, grasping, and up and down manipulation stages, respectively.

The liver was manipulated for 3 times to reach the whole canonical model, and all the manipulations were only turning over. In the first manipulation, grip points were selected on the gallbladder and round ligament, taking 27.4 s to turn the liver and expose more of its surface. In this experiment, the initial canonical model is partially occluded by the surgical instrument, with the initial surface area accounting for 25.1% of the final canonical model surface. In the second manipulation phase, the grip points were located on the gallbladder and right triangular ligament. This manipulation increased the explored area from 83.9% to 97.0%. In the third manipulation phase, the grip points were positioned at both ends of the falciform ligament. The maximal speed for two active manipulators in these three manipulations is 1.95 cm s⁻¹.

The average shape difference of the liver is 2.24 mm compared to its reconstruction using OmniMotion. The error is higher in the region around the gallbladder, as this area is softer and more prone to deformation. The accompanying video (Extension 4) shows the entire exploration of the cadaveric liver.

Robotic laparoscopic exploration of cadaveric colon

A human cadaveric colon, approximately 20 cm in length, was used for validation. As shown in Figure 21, a higher resolution of the CČC was set for the colon due to the presence of adhesive fats and folds, with a radius of 3.5 mm and a height of 6 mm.

Figure 21.

Upper: Robotic exploration of a cadaveric colon. The first row shows camera images. The second and third rows show the point cloud $S_{t}$ from two views. The fourth and fifth rows depict the canonical model $C_{t}$ from two views, with NSC values of candidate grip points on the boundary visualised using a colorbar. The sixth and seventh rows illustrate the topological space of the canonical model $C_{t}^{T}$ with the Betti number sequence β of the respective manifold. Lower: The boundary and area of the canonical model during exploration. The colour blocks represent homing, grasping, and up and down manipulation stages, respectively.

The exploration required seven manipulations and 244.6 s in total to complete, starting from 30.4% of the final canonical surface area before exploration. The shape difference between the canonical model and the reconstruction from OmniMotion is 5.54 mm. In each manipulation, the turning-over operations were accompanied by stretching. Throughout the exploration, the rough shape of the colon remained unchanged, resembling a long stick, which resulted in similar maximal manipulation speeds and, consequently, similar manipulation times at 23.72 ± 1.07 s.

In the second manipulation, almost no new surface was explored during the lifting and retracting stages. The optimal grip points were not easily reachable due to partial occlusions that would collapse under the grippers as they approached these points, leading to exploration of new surfaces from these interactions rather than from the turning-over operation. The accompanying video (Extension 5) shows the entire exploration of the cadaveric colon.

Table 5 provides a concise overview of the key performance metrics across all tested objects and scenarios, facilitating a comprehensive comparison of our framework’s performance in diverse experimental settings.

Table 5.

Summary of key experimental results.

Metric	Rubber glove	Beanie	Cadaveric liver	Cadaveric colon
Manipulators (G)	2	2	2 (with support)	2
Total Exploration Time (s)	112.0	141.4	97.0	244.6
Number of Manipulations	4	5	3	7
Initial Surface Area (%)	14.5%	14.8%	25.1%	30.4%
Avg. Shape Difference (mm)	2.72	4.18	2.24	5.54
Max Gripper Speed (cm/s)	2.5	2.5	1.95	2.25

Discussions and conclusions

Discussion on simulations

The simulations show that the topology-aware 3D reconstruction can not only align different shapes at different times but also reconstruct better topology compared to the point-cloud-based method (Newcombe et al., 2015). This improvement is due to the estimation of topology and the detection of NM geometry. Despite the additional topology estimation, the computational cost remains similar to that of the DF, as using the CČC to estimate topology is efficient. Compared to methods like Delaunay triangulation or the truncated signed distance field (Newcombe et al., 2015), which require mesh compliance checks, this approach is more straightforward. The additional computational cost mainly stems from predicting invisible deformations and merging topologies. The performance of the two-arm and three-arm systems is similar in both operation duration and number of manipulations, as their configurations are nearly identical. The key difference is that the two-arm system relies on friction, while the three-arm system uses the third arm as a pivot. In these simulations, friction is sufficient for turning-over manipulations, with the process primarily dominated by the two active arms.

Several factors affect performance according to the simulations. The sample rate ϵ is crucial, as there is a trade-off between computational cost and reconstruction performance. When ϵ is high, the nodes cannot represent the shape accurately, and the observation is more likely to be non-homeomorphic to the real manifold. Grip point planning can improve exploration efficiency compared to using random grip points. The exploration can be completed with fewer manipulations. This is because grip points with maximal NSC cause more area to be deformed and exposed during the movement of the grip points. When the grip points are randomly selected, the linked area is limited, and the invisible part has a higher possibility of not being indirectly manipulated by the grippers. The effect of the speed is related to the success rate. When the manipulation speed is too high, the non-overlapping area between two frames is small, and its deformation field cannot be directly observed. Therefore, the deformation prediction of the invisible part will be less robust. In extreme cases, the point clouds in two frames do not overlap at all due to fast manipulation, and there is no correspondence between them, leading to failed reconstruction. If the speed is low enough, there is sufficient time for real-time reconstruction, but the exploration time may be too long. The optimal speed in our method balances the trade-off between success rate and exploration time.

Discussion on experiments

The experiments show that autonomous robotic exploration of soft objects with varying materials, shapes, and functions can be completed using our framework for daily and medical applications.

In the glove and beanie experiments, parts of the canonical model appeared below the table, conflicting with the physical expectation that soft objects placed on a table should remain entirely above its surface. This discrepancy arises from relying on visible surfaces for 3D reconstruction, while the invisible portions must be predicted. When large deformations occur, often due to low stiffness or complex geometry, the prediction of invisible topology becomes less reliable, and the assumption of deformation continuity may break down. By contrast, the cadaveric liver, being stiffer and less deformable, produced a canonical model that remained above the table plane during reconstruction. This issue can be mitigated by adding constraints to the EKF when predicting the invisible topology, helping guide the canonical model toward a more realistic initial shape. Still, some regions may appear below the table, since these constraints have limited effect on visible regions where the deformation field is already known.

The robustness of the method relies on the performance of the canonical modelling. In addition to the sample ratio and control speed discussed in the simulation, there are other error sources that affect 3D reconstruction. These include the quality of the point cloud, particularly in the segmentation and reconstruction from the RGB-D image. In the context of robotic surgery, the quality of laparoscopic reconstruction is lower, and the point cloud is sparser in the dVRK system compared to that obtained with a structured-light camera (Chen et al., 2023). While many robotic open surgeries can directly use this technology, better 3D reconstruction methods for laparoscopic images are currently being investigated. Robotic position inaccuracies may disrupt smooth robotic movement, which is crucial for achieving relatively continuous deformation. Continuous deformation helps in predicting invisible parts and thereby improves reconstruction performance. Additionally, the thickness of the soft object influences reconstruction accuracy, as the raw point cloud from the RGB-D camera may not be adequately reconstructed in areas with extremely high curvature. This high curvature near thin regions can lead to discontinuous surface tracking with a stationary camera. In extreme cases, when the soft object is homeomorphic to a 2-disk, the approach may fail.

The planning of robotic trajectory also relies on understanding the underlying topology. The method demonstrates robustness to drastic topology changes, as shown in the experiments on the glove and colon. Misjudgements of topology, which occur when it is not homeomorphic to the actual manifold, often result from the low resolution of the RGB-D camera’s point cloud and the high sampling rate, particularly when the local reach of the manifold is small. As proven in Proposition 1, the true topology becomes apparent as the local reach increases during manipulation. The experiments on the glove and colon have shown that NM geometries can be detected during manipulation. It is important to note, however, that NM geometries may not always be observed unless the required conditions are met through deformation during manipulation according to Lemma 4; otherwise, the controller may never accurately perceive the true geometry. Using the Betti number is an efficient way to understand the overall geometry, but it may be influenced by the quality of the point cloud. The presence of holes or disconnected components in the raw point cloud, caused by environmental lighting or occlusions when using structured light for shape estimation, can lead to β₀ > 1 or β₁ > 1. However, as more frames are fused, a more complete and accurate canonical model is generated, allowing the Betti number to correctly represent the geometry and enabling the controller to plan trajectories effectively.

This method is also effective for long-term exploration of complex surfaces, as demonstrated in the experiment on the colon. In this case, the controller can operate with a relatively low-resolution point cloud. If the live point cloud and deformation field are stored, a high-fidelity canonical model can be constructed during the intervals between manipulations or afterwards.

Our framework further demonstrates strong robustness to occlusion, despite operating with a single top-down RGB-D sensor. This robustness stems from the shape-control strategy developed in our prior work (Hu et al., 2024), where occlusion-aware down-sampling was used to ensure reliable surface feedback. In the current system, the deformation field computed from this shape controller is further extended into a topology-aware representation. As shown in Figures 16, 18, and 20, the system successfully reconstructs and explores self-occluding objects such as the beanie and cadaveric organs. Occluded regions are gradually revealed and reconstructed over the course of manipulation cycles, enabling complete surface coverage. Furthermore, since our method relies on depth-only input, it is inherently more robust to variable lighting compared to photometric or deep learning-based approaches. This makes our framework particularly suited for surgical environments where visual conditions may be suboptimal.

A key advantage of our method is its model-free design. With non-rigid registration between surface observations, the controller adapts to any source of deformation including breathing, heartbeat, and patient motion without requiring a predefined physical model. This flexibility enhances robustness and applicability in realistic surgical scenarios.

Limitation

First, while our framework’s current ‘turning over’ and ‘stretching’ manipulations are highly effective for the autonomous robotic exploration of unknown soft objects in a general setting, challenges persist for more complex geometries. If the soft object is not homeomorphic to a 3-ball and has a very intricate shape, such as a t-shirt, exploring every part of its surface becomes challenging for the controller due to the lack of prior knowledge about the object’s topology. Specifically, operations like flipping it inside out pose a notable challenge for the robot with our current manipulation set. Additionally, thin soft objects, like planar structures, may not be accurately reconstructed in 3D due to their inherent thinness. Since the behaviour of the object cannot be precisely predicted, for some objects, like cloth, it may become entangled or form a mass after several explorations. Our framework currently does not incorporate more specialised or delicate manipulations, such as localised folding, unfolding, peeling, or precise lifting of small regions. Such operations, while crucial for specific sub-tasks in fields like minimally invasive surgery (e.g., tissue dissection or retraction), fall outside the primary goal of comprehensive surface exploration of an unknown object. Future work will explore integrating these finer manipulation primitives to enhance the system’s versatility for more complex and application-specific tasks, including direct applications to RAMIS.

The current system also assumes that only a single deformable object is present in the workspace, placed on a flat and known surface. This simplification avoids the need for tissue segmentation and recognition, which would be required in more complex settings with multiple or interacting objects, such as surgical scenes. While this limits direct applicability to real surgeries, it allows us to focus on the main contribution of this work, which is the autonomous exploration and 3D reconstruction of soft objects. In actual surgical environments, tissues are often partially attached to surrounding structures, making parts of them physically inaccessible. For example, in the liver experiment, the gallbladder was explored successfully only in its free regions, while the attached areas remained unreconstructed.

The trajectory planning may not be optimal because the optimal grasp points might not be physically reachable. Additionally, the optimal grip points may not suit the gripper due to the stiffness, thickness, and resilience of the soft object in that region. More comprehensive grip point planning, based on properties such as morphology and material, is necessary for more complex tasks and will improve the performance of autonomous exploration.

Another limitation is that the method may be influenced by environmental factors or the specifics of the robotic system. For instance, lighting conditions can affect the brightness of the texture, which impacts both the performance of non-rigid registration and the texture quality of the reconstructed canonical model. This effect is particularly pronounced when surfaces that are initially dark become brightly illuminated as they face the camera. Additionally, different systems have varying manipulator configurations, which can lead to issues such as self-collision.

In real laparoscopic surgery, organs are constrained by ligaments, nerve bundles, and blood vessels that must not be damaged. These connections add complexity to trajectory planning during exploration. Camera movement is also common in laparoscopic surgery, as moving the camera allows for faster exploration of target tissues. Additionally, some of the latest robotic systems, such as the da Vinci 5 from Intuitive Surgical, integrate advanced sensors to restore haptic feedback. This feedback can be analysed to help infer these constraints and improve surgical precision, which is not fully addressed in the current approach.

Conclusion

In this study, we proposed a novel framework for the autonomous robotic exploration of unknown soft objects with 3D reconstruction. The framework integrates topology-aware 3D reconstruction during manipulation with motion planning for robotic exploration. We introduced a novel representation of deformable objects by combining CČCes with point clouds, enabling fast tracking of drastic topology changes and detection of NM boundaries. The motion planning, guided by topology analysis, optimises grasp points and plans trajectories for two types of operations: turning over and stretching. We validated our algorithm through simulations and experiments on various soft objects using the dVRK. The results demonstrate that soft objects can be successfully explored using two or three robotic arms.

In the future, we plan to apply the proposed method to real-world scenarios, such as complete laparoscopic exploration of the small bowel and thin objects.

Footnotes

Acknowledgements

All the experiments involving human cadaveric tissues were performed under ethical approval from the University of Leeds. The authors would like to thank Intuitive Surgical, Inc., for the donation of the da Vinci system, the STORM Lab technician, Samwise Wilson, for hardware support, and the anatomy facilities technicians of the School of Medicine, Sarah Wilson and Charlotte Coleman, for their support in the cadaveric experiments.

ORCID iDs

Junlei Hu

Dominic Jones

Shoudong Huang

Pietro Valdastri

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the European Research Council (ERC) through the European Union’s Horizon 2020 Research and Innovation Programme under Grant 818045, in part by the Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/V047914/1, and in part by the National Institute for Health and Care Research (NIHR) Leeds Biomedical Research Centre (BRC) (NIHR203331). Any opinions, findings, and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of the ERC, the EPSRC or the NIHR.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

Appendix

References

Adams

Wicke

Ovsjanikov

, et al. (2010) Meshless shape and motion design for multiple deformable objects. Computer Graphics Forum 29: 43–59.

Alexa

Behr

Cohen-Or

, et al. (2003) Computing and rendering point set surfaces. IEEE Transactions on Visualization and Computer Graphics 9(1): 3–15.

Attanasio

Scaglioni

De Momi

, et al. (2021) Autonomy in surgical robotics. Annual Review of Control, Robotics, and Autonomous Systems 4: 651–679.

Badias

Alfaro

Gonzalez

, et al. (2021) Morph-dslam: model order reduction for physics-based deformable slam. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11): 7764–7777.

Bai

Bartoli

(2024) Kernel-gpa: a globally optimal solution to deformable slam in closed-form. The International Journal of Robotics Research 43(4): 456–484.

Bay

Tuytelaars

Van Gool

(2006) Surf: speeded up robust features. In: omputer Vision–ECCV 2006: 9Th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer, 404–417.

Boonvisut

Çavuşoğlu

(2014) Identification and active exploration of deformable object boundary constraints through robotic manipulation. The International Journal of Robotics Research 33(11): 1446–1461.

Browatzki

Tikhanoff

Metta

, et al. (2014) Active in-hand object recognition on a humanoid robot. IEEE Transactions on Robotics 30(5): 1260–1269.

Chen

Marzullo

Alberti

, et al. (2023) Frsr: framework for real-time scene reconstruction in robot-assisted minimally invasive surgery. Computers in Biology and Medicine 163: 107121.

10.

Chen

Kobayashi

Sakuma

, et al. (2024) Surgem: a vision-based surgery environment modeling framework for constructing a digital twin towards autonomous soft tissue manipulation. IEEE Robotics and Automation Letters 99: 1–8.

11.

Dantchev

Ivrissimtzis

(2012) Efficient construction of the čech complex. Computers & Graphics 36(6): 708–713.

12.

Enayati

De Momi

Ferrigno

(2016) Haptics in robot-assisted surgery: challenges and benefits. IEEE reviews in biomedical engineering 9: 49–65.

13.

Faure

Duriez

Delingette

, et al. (2012) SOFA: a multi-model framework for interactive physical simulation. In: Payan

(ed) Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, Volume 11 of Studies in Mechanobiology, Tissue Engineering and Biomaterials. Springer, 283–321.

14.

Federer

(1959) Curvature measures. Transactions of the American Mathematical Society 93(3): 418–491.

15.

Goldman

Bajo

Simaan

(2013) Algorithms for autonomous exploration and estimation in compliant environments. Robotica 31(1): 71–87.

16.

Gupta

Hauser

(2007) Kalman filtering with equality and inequality state constraints. arXiv preprint arXiv:0709.2791.

17.

Jones

Valdastri

(2023) Coordinate calibration of a dual-arm robot system by visual tool tracking. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 11468–11473.

18.

Jones

Dogar

, et al. (2024) Occlusion-robust autonomous robotic manipulation of human soft tissues with 3-d surface feedback. IEEE Transactions on Robotics 40: 624–638.

19.

Huang

(2022) Task-oriented grasping position selection in deformable object manipulation. IEEE Robotics and Automation Letters 8(2): 776–783.

20.

Jiang

Huang

, et al. (2024) Roboexp: action-conditioned scene graph via interactive exploration for robotic manipulation. arXiv preprint arXiv:2402.15487.

21.

Kazanzides

Chen

Deguet

, et al. (2014) An open-source research kit for the da vinci® surgical system. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 6434–6439.

22.

Kelemen

Gerig

(1996) Segmentation of 2-d and 3-d objects from mri volume data using constrained elastic deformations of flexible fourier contour and surface models. Medical Image Analysis 1(1): 19–34.

23.

Kiyokawa

Takamatsu

Koyanaka

(2022) Challenges for future robotic sorters of mixed industrial waste: a survey. IEEE Transactions on Automation Science and Engineering 21(1): 1023–1040.

24.

Koganti

Tamei

Ikeda

, et al. (2017) Bayesian nonparametric learning of cloth models for real-time state estimation. IEEE Transactions on Robotics 33(4): 916–931.

25.

Lamarca

Parashar

Bartoli

, et al. (2021) Defslam: tracking and mapping of deforming scenes from monocular sequences. IEEE Transactions on Robotics 37(1): 291–303.

26.

Leizea

Mendizabal

Alvarez

, et al. (2015) Real-time visual tracking of deformable objects in robot-assisted surgery. IEEE Computer Graphics and Applications 37(1): 56–68.

27.

, et al. (2024) Deformnet: Latent Space Modeling and Dynamics Prediction for Deformable Object Manipulation. 2024 IEEE International Conference on Robotics and Automation (ICRA), 14770–14776.

28.

Liu

Goldsmith

(2004) Kalman filtering with partial observation losses. In: 2004 43rd IEEE Conference on Decision and Control (CDC)(IEEE Cat. No. 04CH37601). IEEE, Vol. volume 4, 4180–4186.

29.

Macklin

Müller

Chentanez

, et al. (2014) Unified particle physics for real-time applications. ACM Transactions on Graphics 33(4): 1–12.

30.

Mallios

Ridao

Ribas

, et al. (2016) Toward autonomous exploration in confined underwater environments. Journal of Field Robotics 33(7): 994–1012.

31.

Matas

James

Davison

(2018) Sim-to-real reinforcement learning for deformable object manipulation. In: Conference on Robot Learning. PMLR, 734–743.

32.

Nadon

Payeur

(2019) Automatic selection of grasping points for shape control of non-rigid objects. In: 2019 IEEE International Symposium on Robotic and Sensors Environments (ROSE). IEEE, 1–7.

33.

Navarro-Alarcon

Liu

Y-H

(2017) Fourier-based shape servoing: a new feedback method to actively deform soft objects into desired 2-d image contours. IEEE Transactions on Robotics 34(1): 272–279.

34.

Navarro-Alarcon

Liu

Y-h.

Romero

, et al. (2014) On the visual deformation servoing of compliant objects: uncalibrated control methods and experiments. The International Journal of Robotics Research 33(11): 1462–1480.

35.

Navarro-Alarcon

Yip

Wang

, et al. (2016) Automatic 3-d manipulation of soft objects by robotic arms with an adaptive deformation model. IEEE Transactions on Robotics 32(2): 429–441.

36.

Newcombe

Fox

Seitz

(2015) Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 343–352.

37.

Okamura

Cutkosky

(2001) Feature detection for haptic exploration with robotic fingers. The International Journal of Robotics Research 20(12): 925–938.

38.

Parashar

Pizarro

Bartoli

(2019) Local deformable 3d reconstruction with cartan’s connections. IEEE Transactions on Pattern Analysis and Machine Intelligence 42(12): 3011–3026.

39.

, et al. (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 652–660.

40.

Rusu

Cousins

(2011) 3D is Here: Point Cloud Library (PCL): IEEE International Conference on Robotics and Automation (ICRA).

41.

Sánchez

Rodríguez

Davila

, et al. (2011) Robot-assisted laparoscopic common bile duct exploration: case report and proposed training model. Journal of Robotic Surgery 5: 145–148.

42.

Sanchez

Corrales

J-A

Bouzgarrou

B-C

, et al. (2018) Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey. The International Journal of Robotics Research 37(7): 688–716.

43.

Schneider

Belousov

Chalvatzaki

, et al. (2022) Active exploration for robotic manipulation. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 9355–9362.

44.

Sharp

Lee

Wehe

(2002) Icp registration using invariant features. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1): 90–102.

45.

Shinde

Liang

Liu

, et al. (2024) Jiggle: an active sensing framework for boundary parameters estimation in deformable surgical environments. arXiv preprint arXiv:2405.09743.

46.

Song

Wang

Zhao

, et al. (2018) Mis-slam: Real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing. IEEE Robotics and Automation Letters 3(4): 4068–4075.

47.

Tajdari

Huysmans

Yang

, et al. (2022) Feature preserving non-rigid iterative weighted closest point and semi-curvature registration. IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society 31: 1841–1856.

48.

Tang

Tomizuka

(2022) Track deformable objects from point clouds with structure preserved registration. The International Journal of Robotics Research 41(6): 599–614.

49.

Thach

Kuntz

Hermans

(2021) Deformernet: a deep learning approach to 3d deformable object manipulation. arXiv preprint arXiv:2107.08067.

50.

Torresani

Hertzmann

Bregler

(2008) Nonrigid structure-from-motion: estimating shape and motion with hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(5): 878–892.

51.

Wang

Yin

, et al. (2015) Deformation capture and modeling of soft objects. ACM Transactions on Graphics 34(4): 94–101.

52.

Wang

Long

Fan

, et al. (2022) Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 431–441.

53.

Wang

Chang

Y-Y

Cai

, et al. (2023) Tracking everything everywhere all at once. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 19795–19806.

54.

Wen

Zhang

, et al. (2019) Pixel2mesh++: Multi-view 3d mesh generation via deformation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1042–1051.

55.

Wettergreen

Tompkins

Urmson

, et al. (2005) Sun-synchronous robotic exploration: technical description and field experimentation. The International Journal of Robotics Research 24(1): 3–30.

56.

Yang

Sui

Zhong

, et al. (2023) Modal-graph 3d shape servoing of deformable objects with raw point clouds. The International Journal of Robotics Research 42(14): 1213–1244.

57.

Yin

Varava

Kragic

(2021) Modeling, learning, perception, and control methods for deformable object manipulation. Science Robotics 6(54): eabd8803.

58.

Zampogiannis

Fermüller

Aloimonos

(2019) Topology-aware non-rigid point cloud registration. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(3): 1056–1069.

59.

Zhou

Zhu

Huo

, et al. (2021) Lasesom: a latent and semantic representation framework for soft object manipulation. IEEE Robotics and Automation Letters 6(3): 5381–5388.

60.

Zhou

Zheng

, et al. (2024) Reactive human–robot collaborative manipulation of deformable linear objects using a new topological latent control model. Robotics and Computer-Integrated Manufacturing 88: 102727.

Autonomous robotic exploration of unknown soft objects

Abstract

Keywords

Introduction

Related works

Robotic exploration of soft objects

Shape representation of soft object

3D non-rigid reconstruction

Motion planning of robotic exploration

Modelling and problem formulation

Modelling

Problem formulation

Overview of the methodology

Preliminary

Weighted residual method based deformation field

2D manifold

Canonical model reconstruction

Representation of canonical model and deformation field

Measure frame-to-frame deformation

Invisible topology prediction

Homeomorphic reconstruction

Shape merging

Planning of grip points

Candidate grip points

Optimal grip points based on null space analysis

Orientation of grasping

Motion planning

Topology analysis with Betti number

Stop condition

Trajectory planning

Turning over

Stretching

Manipulation speed

Simulated validation

Simulations on liver model

Comparative studies

Effect of the sample rate

Effect of the grasp point planning

Effect of the manipulation speed

Experimental validation

Experiment setup

Robotic exploration of rubber glove

Robotic exploration of beanie

Robotic laparoscopic exploration of cadaveric liver

Robotic laparoscopic exploration of cadaveric colon

Discussions and conclusions

Discussion on simulations

Discussion on experiments

Limitation

Conclusion

Footnotes

Acknowledgements

ORCID iDs

Funding

Declaration of conflicting interests

Notes

Appendix

References