Sage Journals: Discover world-class research

Abstract

Robotic deformable object manipulation (DOM) faces critical challenges in industrial and medical applications due to under-actuation, unpredictable deformation, and partial observability. Model-free methods often suffer from unstable Jacobians arising from ill-conditioned observations, while physics-based models typically depend on precise parameters and volumetric meshing, limiting their real-time practicality. We propose a wavelet-boundary element method (BEM) framework that leverages multiscale wavelet descriptors to control 3D deformations directly from efficient feedback modalities, such as contours and curves. By coupling wavelets with BEM, we derive an analytical deformation Jacobian that functions independently of material stiffness (e.g., Young’s modulus), relying solely on an online-calibrated Poisson’s ratio. This mesh-free formulation significantly enhances real-time performance and robustness against sensor occlusion. Validated in simulation and on the da Vinci Research Kit (dVRK) with phantom and ex vivo animal tissue, our method achieves millimetre-level accuracy. Comparative studies against Fourier-based, model-free, and online finite element method (FEM) approaches demonstrate superior stability and computational efficiency. Notably, our framework achieves convergence speeds significantly faster than online FEM by avoiding volumetric computations, while resolving ill-conditioning through spatial–frequency localization. This work advances deformable object manipulation in unstructured environments, particularly in surgical robotics, where stability under partial observability is essential. Project page: https://junleihu.github.io/projects/dwtbem/.

Keywords

deformable object manipulation robotic laparoscopy wavelet transform boundary element method

1. Introduction

Robotic manipulation of deformable objects is essential in both daily life (Longhini et al., 2025) and specialized applications, such as autonomous robotic surgery (Hu et al., 2024). DOM has emerged as a critical research area, yet it remains challenging due to the infinite degrees of freedom (DoFs) and complex material dynamics of deformable objects. In the absence of prior models or constraints, robots must primarily rely on shape feedback to infer deformation mechanics. This task is further complicated by unpredictable nonlinearities, particularly in scenarios that require precise control of intricate configurations.

Recent advances in DOM use shape feedback to compute robotic motions δR that correct deformations δS, by approximating the mapping between shape changes and robot actions via an unknown nonlinear function H, as proposed by Hu et al. (2024); Berenson (2013); Navarro-Alarcon et al. (2013, 2016); Navarro-Alarcon and Liu (2017); Hu et al. (2018, 2019); Lagneau et al. (2020); Qi et al. (2021); Zhu et al. (2021); Shetab-Bushehri et al. (2022); Yang et al. (2023a); Saghour et al. (2025):

δ R = H (δ S),

(1)

or its velocity-based counterpart:

\dot{R} = H (\dot{S}) .

(2)

Here, H encapsulates deformation mechanics, sensorimotor relationships, and control dynamics. Due to the inherent complexity of this mapping, H (or its inverse H⁻¹) is often linearized locally as a deformation Jacobian matrix (DJM). Specifically, J_{δ
R} represents the DJM in motion space, while $J_{δ S} = J_{δ R}^{- 1}$ represents the DJM in shape space (Lagneau et al., 2020; Navarro-Alarcon et al., 2016). These DJMs are estimated from observed interactions between the robot and the deformable object.

However, linearizing H or H⁻¹ introduces stability challenges. For instance, when the robot follows a near-linear trajectory (e.g., small translations toward a target) or when the deformation is spatially uniform (e.g., quasi-static manipulation of high-stiffness objects), the resulting Jacobian matrix J_{δ
R} or J_{δ
S} becomes ill-conditioned. This ill-conditioning leads to unstable deformation predictions due to even very small shape feedback inaccuracy or kinematic errors. Similar issues arise in learning-based approaches (Hu et al., 2018, 2019): while iterative in-place deformations may succeed in controlled settings, unstable estimations of deformation dynamics can cause divergent behaviour or task failure. Figure 1 illustrates this challenge by visualizing the condition numbers of key matrices across various manipulation scenarios in simulation. Elevated condition numbers correlate with numerical instability, highlighting the limitations of current methods in real-world applications.

Figure 1.

Illustration of DOM and the unstable cases using DJM-based approach.

Physics-based methods derive analytical solutions by modelling material mechanics (e.g., strain energy minimization or stress equilibrium) (Ficuciello et al., 2018; Han et al., 2025), thereby avoiding the ill-conditioning inherent in DJM-based approaches. Other model-free approaches, such as the grid-point-based weight residual method (GP-WRM) (Hu et al., 2024), provide analytical displacement-motion relations to bypass these issues. Despite these advances, real-time deformation control remains a problem that involves effective shape representation and corresponding dynamics models.

1.1. Related works

1.1.1. Representation of soft object

Shape representation is pivotal in DOM, directly influencing the viability of model-based or data-driven approaches (Yin et al., 2021). Common shape feedback modalities include 2D contours (Zhu et al., 2021), sparse feature points, 3D point clouds (Hu et al., 2024; Yang et al., 2023a), and raw images. While feature points offer simplicity, their reliance on preprocessing and limited geometric fidelity restricts their utility in complex tasks (Hu et al., 2024). Similarly, raw images and point clouds necessitate non-trivial state abstractions or geometric-aware shape simplification for control, such as contour moments (Qi et al., 2021) for 2D contours, downsampled grid points (Hu et al., 2024), histogram feature (Hu et al., 2018, 2019), modal-graph (Yang et al., 2023b) for 3D surface, latent manifolds (Koganti et al., 2017), histogram-of-oriented-wrinkles (Jia et al., 2018, 2019), Variational Autoencoders (Lippi et al., 2023) for raw images, or 3D mesh or geometry directly for 3D object.

Signal-processing techniques, such as the fast Fourier transform (FFT), have been explored for shape representation in 2D (Navarro-Alarcon and Liu, 2017) and 3D (Makiyeh et al., 2023). However, its global frequency-domain encoding lacks spatial localization, making direct Fourier coefficient subtraction unsuitable for resolving local geometric discrepancies. Furthermore, its fixed spectral resolution and periodicity requirements, often managed through artificial signal completion like spherical parametrization (Makiyeh et al., 2023), limit its applicability to partial or occluded observations.

1.1.2. Physics-based models in DOM

Physics-based models, such as the FEM, discretize deformable objects into tetrahedral meshes (Ficuciello et al., 2018; Han et al., 2025; Saghour et al., 2025) to predict post-interaction deformations via physics simulations. These models enable analytic Jacobian computation for displacement-to-motion mappings (Yang et al., 2023a), as in equations (1) and (2), offering a principled foundation for control. However, the practical adoption of these methods faces critical challenges. Most require predefined 3D templates such as meshes (Saghour et al., 2025; Yang et al., 2023a), lattices (Shetab-Bushehri et al., 2023), or grids (Han et al., 2025), or have only been validated on simple parametric shapes such as circles (Das and Sarkar, 2011), ellipsoids (Makiyeh et al., 2023), and hemispheres (Dometios and Tzafestas, 2022), which limits generalization to arbitrary geometries. In addition, partial observability during manipulation, which is common in camera-based systems, restricts access to full geometric data and complicates real-time mesh updates. The high dimensionality of FEM state spaces, which far exceeds the robot’s control inputs, necessitates parsimonious representations for computational efficiency. Strategies include node reduction (Koessler et al., 2021), simplified mass-spring models (Makiyeh et al., 2023), or mesh-free approaches like position-based dynamics (Yin et al., 2021), which enforce physical constraints on particles at the cost of mechanical fidelity. A persistent challenge lies in estimating unknown physical properties online, as model accuracy hinges on precise parametrization of material behaviour.

1.1.3. Model-free approaches in DOM

Model-free strategies for deformable manipulation frequently depend on DJM predictions from interaction-deformation dynamics, yet suffer from numerical instability due to poor Jacobian conditioning under complex deformations, as evidenced in Figure 1. Approaches like the GP-WRM (Hu et al., 2024) avoid explicit Jacobian modelling by linking gripper motion to displacement fields through FEM-derived partial differential equation (PDE). While efficient, GP-WRM imposes restrictive assumptions: gripper contacts must align with observable point cloud regions to define PDE boundary condition (BC), a requirement often violated in practice when interacting with occluded or edge geometries, leading to singularities in the nonlinear mapping. Furthermore, voxel-based downsampling attenuates sensor noise inadequately and blurs fine deformation features vital for control precision. The method’s reliance on 3D point clouds also complicates integration with 2D visual feedback common in robotic-assisted minimally invasive surgery (RAMIS), as cross-modal planning remains unaddressed.

Learning-based techniques constitute a major category of model-free approaches, utilizing neural networks to approximate deformation dynamics in equation (1) or equation (2). These methods have been integrated into frameworks that combine model predictive control (Shin et al., 2019), Gaussian process regression (Hu et al., 2018), and deep neural networks (Hu et al., 2019). Other recent approaches have employed implicit neural representations to model action-conditional visual dynamics for volumetric deformable objects (Shen et al., 2024). In the surgical domain, advancements have further explored hierarchical and embodied intelligence frameworks. For instance, SRT-H (Kim et al., 2025) proposes a language-conditioned imitation learning approach to automate complex surgical tasks, while the surgical embodied intelligence framework (Long et al., 2025) leverages sim-to-real reinforcement learning to achieve generalized autonomy in tasks such as soft tissue retraction. While such methods demonstrate impressive capability in high-level task planning and execution from demonstrations, they fundamentally rely on extensive datasets encompassing diverse deformations and interactions. Consequently, their generalization remains limited to the scope of the training data, often struggling with precise, unseen tissue mechanics or novel geometries. This dependence on exhaustive, task-specific data constrains their practicality in unstructured environments where such datasets are unavailable. In contrast, our approach focuses on the low-level stability of deformation control without requiring pre-collected training data.

1.2. Our contribution

In this work, we focus on the manipulation of volumetric deformable objects governed by three-dimensional continuum mechanics, such as biological organs. We distinguish our scope from low-dimensional objects, including one-dimensional linear objects (e.g., cables) and two-dimensional thin shells (e.g., cloth). We propose a physics-based framework, named DWT-BEM which integrates the BEM with the discrete wavelet transform (DWT) to improve deformation control stability and mitigate the instability introduced by Jacobian linearization.

Unlike FEM which requires volumetric meshing, the BEM formulation is particularly advantageous for volumetric objects because it directly models boundary interactions using surface observations. This characteristic makes it ideal for vision-guided systems where only partial deformations are visible. Consequently, our approach reduces computational overhead while preserving physical consistency through stress-strain equilibrium equations.

The DWT complements BEM by decomposing deformations into multiscale wavelet coefficients, capturing both coarse global shapes and fine details. This hierarchical representation avoids the periodicity assumptions of FFT-based methods that can distort boundary conditions in open or partially observed surfaces, thereby enabling adaptive control with prioritized large-scale corrections followed by incremental refinements. Notably, DWT retains positional context even under occlusion, addressing challenges in partially observable environments like laparoscopic surgery.

The contributions of this work are:

• A wavelet-BEM framework that combines the DWT with the BEM to represent 3D deformable objects. Unlike mesh-dependent approaches, this descriptor leverages the DWT’s spatial localization to handle occlusions and open surfaces, while BEM bypasses volumetric meshing for direct physical interpretation of boundary deformations.

• An analytical physics-based deformation Jacobian is derived to mitigate the ill-conditioning issues inherent in model-free approaches. Unlike traditional physics-based models that require precise material characterization, the proposed formulation reduces the dependency to a single online-calibrated Poisson’s ratio, rendering the framework effectively free of manual physical parameter inputs.

• A systematic benchmarking against a comprehensive suite of state-of-the-art (SOTA) strategies, contrasting our wavelet-centric method with spectral, data-driven, geometric model-free, and physics-based (FEM) approaches to rigorously quantify trade-offs in numerical stability, accuracy, and computational efficiency.

• Robotic validation in laparoscopic scenarios, demonstrating applicability to surgical tasks requiring high dexterity and partial observability, where contours and curves prove to be convenient and effective feedback modalities for real-world surgery.

2. Modelling and problem formulation

Consider a deformable object $O$ manipulated by a multi-robot system with K arms. The gripper contact points are defined as $R = {(R_{i}^{m}, g_{i}^{m}) ∣ R_{i}^{m} \in S O (3), g_{i}^{m} \in R^{3}}_{i = 1}^{K}$ , where $R_{i}^{m}$ and $g_{i}^{m}$ represent the orientation and position of the i-th gripper in the camera frame. The system is vision-guided, utilizing both RGB images and depth maps as shape feedback (Figure 2). Depth observations are acquired with focal length f, enabling 3D reconstruction of the deforming surface.

Figure 2.

Model illustration of robotic DOM using contours, curves and surfaces shape feedbacks. The corresponding targets are shown as dashed curves.

In this study, we aim to design a physics-based controller for shaping a soft object, under the following assumptions:

Assumption 1

The object is modelled as a continuous, isotropic linear elastic material with quasi-static deformation.

Assumption 2

Camera-robot and inter-robot transformations are known with calibration (Hu et al., 2023).

Assumption 3

Hardware safety is maintained by joint-level kinematic limits, while tissue preservation relies on the assumption of quasi-static manipulation to minimize dynamic interaction forces.

Assumption 4

No slip occurs at the contact points, and the robotic manipulator is modelled as backlash-free, with deformation arising only from its controlled motion.

Remark 1

All poses and deformations are expressed in the camera coordinate system, as illustrated in Figure 2, unless stated otherwise.

Remark 2

In the proposed framework, the target shape $S_{d}$ and the gripper contact points serve as external inputs to the controller. We treat the proposed shape servoing as a fundamental primitive within a hierarchical autonomous manipulation system. In complex surgical scenarios, these inputs are typically generated by high-level planners. For example, in our previous work (Hu et al., 2026), we specifically addressed the planning of optimal grasp points and target configurations based on null space analysis for autonomous exploration. In this study, we purposefully decouple the high-level planning from the low-level control to rigorously evaluate the stability and convergence of the DWT-BEM deformation controller under determining inputs.

Problem Statement: Given a deformable object represented by a time-varying shape signal S (x, t), observed as either an RGB image $I (x, t)$ or a depth-derived 3D point cloud $P (x, t)$ , design a physics-based controller for the multi-arm system that drives the object to a desired configuration S_d(x) without prior knowledge of material parameters.

3. Shape representation using wavelet decomposition

3.1. DWT

Given a shape $S$ in D-dimensional space (D = 2, 3), it can be interpreted as one or several D₀-dimensional signal $f (x) : R^{D_{0}} \to R$ , D₀ < D. For example, a 2D contour is represented as two 1D signals parametrized by arc length while a 3D surface is represented as a 2D signal in spherical coordinates. The DWT decomposes f(x) into a hierarchical representation using scaling functions ϕ(x) and wavelet functions ψ(x), employing dyadic scaling and translation across J resolution levels. The decomposition is:

f (x) = \sum_{k \in K_{J}} c_{J, k} ϕ_{J, k} (x) + \sum_{j = 1}^{J} \sum_{a \in A} \sum_{k \in K_{j}} d_{j, a, k} ψ_{j, a, k} (x),

where c_J,k = ⟨f, ϕ_J,k⟩ = ∑_xf(x)ϕ_J,k(x) denotes the approximation coefficients capturing low-frequency features, and d_j,a,k = ⟨f, ψ_j,a,k⟩ = ∑_xf(x)ψ_j,a,k(x) represents the detail coefficients encoding high-frequency features at scale j. The translation indices

K_{j} = \prod_{i = 1}^{D_{0}} K_{i, j}

are defined per dimension as

K_{i, j} = \{0,1, \dots, ⌊ 2^{j - J} N_{i} ⌋ - 1\}

, where N_i is the signal length in dimension i. The orientation of the wavelet function

a \in A

adapts to the geometric structure of the shape. The scaling and wavelet functions are spatially localized through translation and dilation:

ϕ_{0, k} (x) = ϕ (x - k), ψ_{j, a, k} (x) = 2^{- j / 2} ψ_{a} (2^{j} x - k) .

A shape $S (x)$ can thus be reconstructed at resolution level Q (0 ≤ Q ≤ J) via the truncated inverse discrete wavelet transform (IDWT):

f_{Q} (x) = Φ^{⊺} (x) c + \sum_{j = Q}^{J} \sum_{a \in A} Ψ_{j, a}^{⊺} (x) d_{j, a},

(3)

where

c = {[\begin{matrix} c_{J, k} \end{matrix}]}_{k \in K} \in R^{P_{J}}

and

d_{j, a} = {[\begin{matrix} d_{j, b, k} \end{matrix}]}_{k \in K} \in R^{P_{j}}

are vectors of approximation and detail coefficients, respectively. Here,

P_{j} = \prod_{i = 1}^{D_{0}} ⌊ 2^{- j} N_{i} ⌋

defines the dimensionality of the coefficient vectors, and

Φ (x) = {[\begin{matrix} ϕ_{0, k} (x) \end{matrix}]}_{k \in K} \in R^{P_{0}}

Ψ_{j, a} (x) = {[\begin{matrix} ψ_{j, b, k} \end{matrix}]}_{k \in K} \in R^{P_{j}}

are basis vectors of scaling and wavelet functions.

The signal is represented by the coefficient vectors c and d_j,a, derived from the low-pass filtered approximation. These coefficients are spatially localized: each c_J,k and d_j,a,k corresponds to a specific region of the shape at scale j, with k encoding its spatial position. This ensures that geometric differences in the wavelet domain directly reflect physical discrepancies in Cartesian space. By capturing both spatial and spectral information, the DWT provides a multiscale framework for shape representation, where coefficient-level operations inherently preserve geometric consistency. This property underpins our deformation modelling framework, where shape discrepancies are computed as differences in wavelet coefficients, ensuring physically meaningful and computationally efficient deformations.

3.2. Wavelet-decomposed represented shapes

The 1D DWT provides an effective framework for representing shape feedback from 2D contours or curves in an image. The contour or curve is defined by discrete 2D pixel points, denoted as $S (x) = [\begin{matrix} g (x) & h (x) \end{matrix}]$ , where g(x) and h(x) represent the image coordinates. Here, x ∈ [0, L] parametrizes the contour or curve, with x representing the arc length from a reference point (e.g., a gripper point) and L being the total length. Each coordinate function, g(x) and h(x), is independently decomposed into a J-level DWT, yielding the wavelet representations $W_{J, g} = {c_{J, g}, d_{J, g}, d_{J - 1, g}, \dots, d_{Q, g}}$ and $W_{J, h} = {c_{J, h}, d_{J, h}, d_{J - 1, h}, \dots, d_{Q, h}}$ , where c_J,g and c_J,h are the approximation coefficients, and d_j,g, d_j,h are the detail coefficients at scale j. For 1D DWT, the directional index a is omitted, as it inherently lacks directional sensitivity.

For 3D curves, which are defined by discrete points $S (x) = [\begin{matrix} g (x) & h (x) & z (x) \end{matrix}]$ in Cartesian space, the decomposition involves applying three independent 1D DWTs along each of the coordinate axes. The resulting wavelet representations are $W_{J, g}$ , $W_{J, h}$ , and $W_{J, z} = {c_{J, z}, d_{J, z}, \dots, d_{Q, z}}$ , preserving the multiscale structure across all dimensions.

In robotic DOM applications, 3D surfaces are typically captured using RGB-D cameras and represented as point clouds, with an injective mapping that assigns a depth value to each image coordinate. The shape signal $S$ is parametrized as its depth f(x) over a spatial domain x = [0, L_g] × [0, L_h], where L_g and L_h define the dimensions of the image coordinates.

The 2D DWT is then employed to decompose f(x) into its directional components by using wavelet functions that align with horizontal, vertical, and diagonal orientations. In this framework, horizontal (a = (1, 0)) details are characterized by the function $ψ_{j, (1,0), k_{1}, k_{2}} (n_{1}, n_{2}) = ϕ_{j} (n_{1} - k_{1}) ψ_{j} (n_{2} - k_{2})$ , vertical (a = (0, 1)) details by $ψ_{j, (0,1), k_{1}, k_{2}} (n_{1}, n_{2}) = ψ_{j} (n_{1} - k_{1}) ϕ_{j} (n_{2} - k_{2})$ , and diagonal (a = (1, 1)) details by $ψ_{j, (1,1), k_{1}, k_{2}} (n_{1}, n_{2}) = ψ_{j} (n_{1} - k_{1}) ψ_{j} (n_{2} - k_{2})$ , where ϕ_j and ψ_j denote the scaling and wavelet functions at scale j.

For a J-level decomposition, the surface is represented as $W_{J} = {c_{J}, d_{J}, d_{J - 1}, \dots, d_{Q}}$ . Here, c_J contains the approximation coefficients, and the detail coefficients at each scale j are aggregated as d_j = (d_j,(1,0), d_j,(0,1), d_j,(1,1)).

3.3. Adaptive sampling

The parameter x or x is typically sampled uniformly along the contour, curve, or image coordinates. However, to accommodate non-uniform deformations such as stretching, adaptive sampling is employed when reliable feature points are available. This process utilizes two types of anchors: the robot’s grasping points, which are known from kinematics and serve as fixed boundary delimiters, and texture features near the contour, which are matched between consecutive frames using optical flow. These anchors divide the contour into segments, and the sampling density within each segment is adjusted to maintain a consistent node distribution relative to the previous time step. This ensures that the wavelet representation remains spatially aligned with the object’s geometric features throughout the manipulation, which is essential given that the DWT operates on fixed-length discrete signals.

3.4. Missing data on the shape signal

In practice, occlusion often occurs when observing the shape, and point cloud observations may include background noise that must be filtered out, resulting in insufficient signal on the domain [0, L] or [0, L_g] × [0, L_h]. The DWT is well-suited to handle such cases due to properties like local support, orthogonality, and perfect reconstruction, especially when using wavelet functions such as the Haar wavelet. In the Haar wavelet transform, the approximation coefficients c_J,k depend only on the domain [2^Jk, 2^J (k + 1) − 1], while the detail coefficients d_j,a,k are similarly determined by [2^jk, 2^j (k + 1) − 1]. As a result, coefficients corresponding to missing portions of a signal can be effectively replaced with artificial or interpolated values. Due to the localized nature of the DWT, such modifications remain confined to their specific regions, preserving the decomposition of the original valid data. This makes the method particularly robust for handling locally occluded curves in signal processing.

Given the missing part is defined as $M = {M_{i}}$ is a set that containing the different domain segments, for example, $M_{i} = (m_{i}, m_{i}^{'}) \subset [0, L]$ for curves or $M_{i} = (m_{i}, m_{i}^{'}) \times (m_{i}^{''}, m_{i}^{'''}) \subset [0, L_{g}] \times [0, L_{h}]$ for surface. To quantify the extent of signal retention under occlusion, we define the completeness ratio η relative to the domain of the target shape. The completeness ratio is defined as the proportion of the visible domain relative to the total domain size. Specifically, for 2D/3D curves with total length L:

η = 1 - \frac{1}{L} \sum_{M_{i} \in M} | m_{i}^{'} - m_{i} |,

where

| m_{i}^{'} - m_{i} |

represents the arc length of the i-th missing segment. Similarly, for 3D surfaces defined over a domain area L_g × L_h:

η = 1 - \frac{1}{L_{g} L_{h}} \sum_{M_{i} \in M} | m_{i}^{'} - m_{i} ‖ m_{i}^{'''} - m_{i}^{''} | .

It should be noted that because η is calculated relative to the target shape’s domain, significant material stretching between the initial and target configurations can distort the arc-length mapping, which may slightly skew this completeness estimation. When the completeness ratio η is small, indicating a large missing domain, the exact content of the signal in those regions becomes less relevant, as it mainly corresponds to high-frequency coefficients that are not utilized. To ensure computational efficiency, linear interpolation is applied for curves: for each coordinate in the missing domain M_i, the interpolated value is given by

f (x) = m_{i} + \frac{f (m_{i}^{'}) - f (m_{i})}{m_{i}^{'} - m_{i}} (x - m_{i}), f \in {g, h, z} .

For surfaces, bilinear interpolation is used to estimate the depth in M_i:

\begin{aligned} S (x) = & (1 - t_{1}) (1 - t_{2}) S ([\begin{matrix} m_{i} \\ m_{i}^{''} \end{matrix}]) + t_{1} t_{2} S ([\begin{matrix} m_{i}^{'} \\ m_{i}^{'''} \end{matrix}]) \\ + (1 - t_{1}) t_{2} S ([\begin{matrix} m_{i} \\ m_{i}^{'''} \end{matrix}]) + t_{1} (1 - t_{2}) S ([\begin{matrix} m_{i}^{'} \\ m_{i}^{''} \end{matrix}]), \end{aligned}

with

t_{1} = x_{1} - m_{i} / m_{i}^{''} - m_{i}

and

t_{2} = x_{2} - m_{i}^{'} / m_{i}^{'''} - m_{i}^{'}

The corresponding coefficients with indices

V_{j} = \{k ∣ \exists M_{i} \in M, [2^{j} k, 2^{j} (k + 1) - 1] \subseteq M_{i}\}

are excluded, as they lie entirely within the missing domain.

4. Desired displacement field

The desired shape $S_{d}$ (in 2D or 3D Cartesian space) is first decomposed into its wavelet representation, denoted as $W_{d} = {c_{d}, d_{1, d}, \dots, d_{J, d}}$ . To compute the displacement field required to deform the object into $S_{d}$ , we derive a deformation field in the DWT domain as the coefficient-wise difference. Leveraging the multiscale structure of the DWT, we propose a hierarchical alignment strategy that progressively resolves geometric discrepancies from coarse to fine scales.

4.1. Deformation field in wavelet representation

With the position-encoding nature of DWT coefficients, the deformation field is defined directly in the wavelet domain as the multiscale difference between two shape vectors:

U = W_{1} - W_{2} = \{c_{1} - c_{2}, d_{J, 1} - d_{J, 2}, \dots, d_{Q, 1} - d_{Q, 2}\}

where

W_{i} = {c_{i}, d_{J, i}, \dots, d_{Q, i}}

, i ∈ {1, 2}. Subtraction of coefficients is physically meaningful only when both shapes share the same decomposition levels J, resolution levels Q, and vector lengths P_J, and when the coefficients are spatially aligned at each scale. This prerequisite guarantees that differences in c_j and d_j,k correspond directly to geometric discrepancies within the same spatial regions. Deformation field in Cartesian space is reconstructed via the IDWT in equation (3).

Under the assumption of non-extensible material (i.e., the deformable object undergoes no stretching), the spatial domain remains invariant in both 2D and 3D. This permits direct computation of shape differences using wavelet coefficients without altering their positional indices, preserving the intrinsic geometric relationships across scales.

To quantify shape discrepancies, a metric space is defined over the wavelet-represented shapes. Let $(W, H)$ denote this space, where $W$ is the set of shapes encoded in the DWT domain, and $H : W \times W \to R^{+}$ is the metric defined as:

H (W_{1}, W_{2}) : = \frac{{‖c_{1} - c_{2}‖}_{2}}{2^{J / 2} P_{J}} + \sum_{j = Q}^{J} \frac{{‖d_{j, 1} - d_{j, 2}‖}_{2}}{2^{j / 2} P_{j}} .

This metric aggregates the multiscale geometric differences, combining the coarse approximation (c₁ − c₂) and fine-scale details (d_j,1 − d_j,2) across all decomposition levels j = Q, …, J.

4.2. Shape alignment in the wavelet space

The desired deformation field in the wavelet domain:

U_{d} = W_{d} - W,

where

W_{d}

and

W

are the DWT representations of the target and current shapes, respectively. Spatial alignment is a critical prerequisite: coefficients in

W_{d}

and

W

must correspond to the same spatial regions to ensure the subtraction yields a physically meaningful deformation field. This guarantees that each coefficient pair (c_d, c) or (d_j,d, d_j) reflects geometric discrepancies in their respective regions.

The proposed hierarchical alignment strategy iteratively resolves positional mismatches across scales, starting from coarse approximations and refining through finer details. This ensures wavelet-domain consistency before computing $U_{d}$ . During shape servoing, the target wavelet $W_{d}$ remains fixed, while the current shape’s parameters are optimized.

Similarity between shapes is quantified using the normalized cross-correlation (NCC) of their zero-mean approximation coefficients:

NCC (c_{1}, c_{2}) = \frac{{\bar{c}}_{1}^{⊺} {\bar{c}}_{2}}{‖ {\bar{c}}_{1} ‖_{2} ‖ {\bar{c}}_{2} ‖_{2}},

where

{\bar{c}}_{i} = c_{i} - 1 / P_{J} 1_{P_{J}}^{⊺} c_{i} 1_{P_{J}}

, i ∈ {1, 2}, centres the coefficients.

For 2D/3D curves, alignment accounts for rotational and parametric phase shifts. The approximation coefficients are transformed as $c = [\begin{matrix} c_{g} & c_{h} \end{matrix}] R_{g, h}$ and $c = [\begin{matrix} c_{g} & c_{h} & c_{z} \end{matrix}] R_{g, h, z}$ , where R_g,h ∈ SO(2) and R_g,h,z ∈ SO(3) are rotation matrices. Contours additionally require arc-length phase alignment via translation τ, modifying the coefficients as

c_{i} (τ) = \sum_{x} f (x + τ) ϕ_{J, k} (x + τ), i \in {g, h, z} .

The optimization problem becomes:

\underset{R, τ}{\arg \min} \sum_{i} {NCC}^{2} (c_{i} (R, τ), c_{d, i}) .

(4)

For 3D surfaces, rotational alignment modifies coefficients as:

c (R) = \sum_{x} f (R x) ϕ_{J, k} (R x),

yielding the objective:

\underset{R}{\arg \max} {NCC}^{2} (c (R), c_{d}) .

(5)

To solve these optimizations, differentiable scaling functions (e.g., Shannon wavelets) enable gradient-based methods. The differentiability of ϕ(x) ensures analytical Jacobians for efficient gradient descent to get the optimal rotation R^⋆. It is important to note that this optimization is strictly limited to rigid alignment (6 DoF) to establish a consistent error metric. Since the shape feedback is continuous in time, we use the optimal rotation and translation from the previous time step as the initial values for the current step. This ensures that the solver starts very close to the true solution, effectively avoiding local minima in the alignment phase.

5. Physical model

After representing the deformable object and computing the desired displacements, a physical model is introduced to determine the appropriate actions.

5.1. Boundary element method

The deformation of an elastic object is governed by Navier’s equation:

μ \nabla^{2} u + (λ + μ) \nabla (\nabla \cdot u) + f = 0,

(6)

where u is the displacement vector field, μ and λ are the Lamé parameters (shear modulus and first parameter, respectively), and f is the body force per unit volume. In our control framework, we solve for the incremental displacement u relative to the current equilibrium state, which already accounts for gravity-induced pre-stress. Since gravity is a constant field, the incremental change in body force is negligible during each control step. Therefore, we assume f = 0 in the differential formulation, allowing the problem to be solved purely via boundary integrals without volumetric discretization.

BEM solves partial differential equations by reformulating domain problems into boundary integral equations. This approach reduces the problem’s dimensionality by requiring discretization only along the boundary Γ, making it particularly efficient for problems involving infinite or semi-infinite domains. Using BEM, equation (6) can be reformulated into Somigliana’s Identity (Cruse and Suwito, 1993):

c_{i j} (x) u_{j} (x) + \int_{Γ} T_{i j} (x, y) u_{j} (y) d Γ = \int_{Γ} U_{i j} (x, y) t_{j} (y) d Γ,

(7)

where c_ij(x) is the geometric free term, determined by the local geometry at the field point x, u_j(x) denotes the displacement at x and t_j(x) is the traction at boundary point y ∈ Γ. Subscripts i, j correspond to Cartesian coordinate directions. The stress kernel T_ij (x, y) represents the stress at x induced by a unit displacement at y on Γ, while the displacement kernel U_ij (x, y) describes the displacement at x due to a unit force at y on Γ. For isotropic materials, the fundamental solutions U_ij (x, y) and T_ij (x, y) for D-dimensional (D = 2, 3) cases are given by

U_{i j} (x, y) = \frac{2^{- D}}{π μ (1 - ν)} [(3 - 4 ν) C_{D} δ_{i j} + \frac{r_{i} r_{j}}{r^{D}}],

\begin{array}{l} T_{i j} (x, y) & = \frac{- {(2 r)}^{- D} r}{π (1 - ν)} [\frac{\partial r}{\partial n} ((1 - 2 ν) δ_{i j} + D \frac{r_{i} r_{j}}{r^{2}}) \\ - (1 - 2 ν) (\frac{r_{i} n_{j}}{r^{2}} - \frac{r_{j} n_{i}}{r^{2}})], \end{array}

where r_i and r_j are the Cartesian components of r = x − y, with magnitude r = ‖r‖. ν = λ/2 (λ + μ) is Poisson’s ratio, δ_ij is the Kronecker delta, and n_j represents the unit outward normal vector at the boundary point y ∈ Γ. The coefficient C_D is dimension-dependent, given by

C_{D} = \{\begin{cases} - \ln (r), & if D = 2, \\ r^{- 1}, & if D = 3 . \end{cases}

5.2. Collocation points from DWT representation

Equation (7) is solved numerically by selecting collocation points on the boundary Γ, requiring discretization of the boundary topology. Traditional physics-based approaches rely on explicit meshing for deformation analysis. In contrast, our method exploits the inherent spatial localization of DWT coefficients, which encode both geometric and positional information, eliminating the need for manual meshing. Collocation points are adaptively placed using these coefficients across resolution levels, enabling direct computation of the integrals in equation (7) and matrix assembly in equation (8). The hierarchical ordering of wavelet basis functions further allows displacement field analysis via BEM, algorithmically constructing the virtual boundary topology from the DWT.

To balance precision and computational cost, collocation points are placed at the centroid of each boundary element. The multiscale nature of the DWT supports hierarchical control: coarse representations resolve large shape discrepancies (e.g., $H (W, W_{d}) ≫ 0$ ), while finer scales refine the solution as the discrepancy diminishes. The active decomposition level Q is determined by

Q : = ⌊ H (W, W_{d}) / k ⌋ \leq \log_{2} N

where k is a scaling parameter and N is the signal length.

Boundary nodes are reconstructed from wavelet coefficients. For Haar wavelets, the scaling function $ϕ (t) = \{\begin{cases} 1, & 0 \leq t < 1, \\ 0, & otherwise, \end{cases}$ and $ψ (t) = \{\begin{cases} 1, & 0 \leq t < 1 / 2 \\ - 1, & 1 / 2 \leq t < 1 \\ 0, & otherwise \end{cases}$ simplify node selection by collapsing duplicated values in Φ and Ψ_j,a.

For 2D curves, nodes at level Q = J are derived from approximation coefficients as $N_{J} = 2^{- J / 2} [\begin{matrix} c_{g} & c_{h} \end{matrix}] \in R^{P_{J} \times 2}$ . At finer levels (Q < J), nodes incorporate detail coefficients. When Q = J − 1, the nodes, denoted as $N_{J - 1} \in R^{2 P_{J} \times 2}$ , are

N_{J - 1} = \frac{1}{2^{J / 2}} [\begin{matrix} vec ([\begin{matrix} c_{g}^{⊺} - d_{0, g}^{⊺} \\ c_{g}^{⊺} + d_{0, g}^{⊺} \end{matrix}]) & vec ([\begin{matrix} c_{h}^{⊺} - d_{0, h}^{⊺} \\ c_{h}^{⊺} + d_{0, h}^{⊺} \end{matrix}]) \end{matrix}]

where vec (•) denotes the vectorization operation. More generally, the nodes

N_{Q} \in R^{2^{J - Q} P_{J} \times 2}

are

N_{Q} = {[\begin{matrix} {vec ([\begin{matrix} n_{Q + 1, g}^{⊺} - 2^{- (Q + 1) / 2} d_{Q + 1, g}^{⊺} \\ n_{Q + 1, g}^{⊺} + 2^{- (Q + 1) / 2} d_{Q + 1, g}^{⊺} \end{matrix}])}^{⊺} \\ {vec ([\begin{matrix} n_{Q + 1, h}^{⊺} - 2^{- (Q + 1) / 2} d_{Q + 1, h}^{⊺} \\ n_{Q + 1, h}^{⊺} + 2^{- (Q + 1) / 2} d_{Q + 1, h}^{⊺} \end{matrix}])}^{⊺} \end{matrix}]}^{⊺},

where n_Q+1,g and n_Q+1,h are two columns of

N_{Q + 1} = [\begin{matrix} n_{Q + 1, g} & n_{Q + 1, h} \end{matrix}]

for Q ≤ J − 1. The nodes in N_Q are in a sequence of the curves. Two consecutive nodes can form a segment. The collocation points are the midpoints of each segment, as shown in Figure 3.

Figure 3.

BEM element from DWT. The black segments are the mesh and the black dots are the nodes while the grey dots are the invisible nodes. The black crosses are gripper points. (a) Contour is in purple and the target is in dashed purple. (b) Curve is in purple and the target is in dashed purple. (c) The blue dots and mesh denote the wavelet representation in higher level.

For 3D curves, nodes are reconstructed similarly to the 2D case, with the wavelet representation extended to 3D as $N_{J} = 2^{- J / 2} [\begin{matrix} c_{g} & c_{h} & c_{z} \end{matrix}] \in R^{P_{J} \times 3}$ . Finer resolution levels (Q < J) follow the same vectorized update scheme as in 2D, incorporating detail coefficients d_j,g, d_j,h, and d_j,z.

For 3D surfaces, nodes are parametrized over the 2D spatial domain as $n_{g, h} = (n_{g}, n_{h}) \in \{k L_{g} / N_{g} ∣ k = 0,1, \dots, N_{g}\} \times \{l L_{h} / N_{h} ∣ l = 0,1, \dots, N_{h}\}$ , where N_g and N_h define the grid resolution. The 3D coordinates of each node are $n = {[\begin{matrix} n_{g} & n_{h} & f_{Q} (n_{g, h}) \end{matrix}]}^{⊺}$ , where f_Q (n_g,h) maps the 2D domain point to its depth value. At the coarsest level (Q = J), this mapping simplifies to f_J (n_g,h) = 2^−J/2c_i, where c_i is the i-th element of the approximation coefficient vector c, corresponding to the spatial region containing n_g,h.

The topology of Γ is defined by connecting nodes into segments or quadrilaterals (Figure 3), with collocation points at the element centroids. This wavelet-driven approach ensures adaptive resolution while preserving geometric fidelity. The corresponding coefficients are masked, specifically those with indices V_j, which ensures the removal of the collocation points with these indices. Consequently, the connections corresponding to the visible indices are also excluded, maintaining a consistent and accurate representation of the visible shape. There are E elements on the topology of Γ after meshing.

5.3. Matrix form

The shape function reconstructed using the IDWT in equation (3) is expressed as $u_{j} (y) = \sum_{k = 1}^{N} ϕ_{j}^{k} (y) u_{j}^{k}$ . Similarly, the traction field is represented as $t_{j} (y) = \sum_{k = 1}^{N} ϕ_{j}^{k} (y) t_{j}^{k}$ , using the same basis functions. Accordingly, the integral in equation (7) becomes:

\int_{Γ} U_{i j} (x, y) t_{j} (y) d Γ = \sum_{k = 1}^{N} t_{j}^{k} \int_{Γ} U_{i j} (x, y) ϕ_{j}^{k} (y) d Γ .

This integral is computed using Gaussian quadrature with Q quadrature points per element. Each quadrature point is given by $y_{q} = \sum_{n = 0}^{Q} ϕ_{k} (ξ_{q}) y_{k}$ , where q ∈ {1, 2, …, Q}, $ξ \in R^{D}$ is the local coordinate of the quadrature point, and y_k are the corresponding global coordinates of the nodes. Therefore, the discretized form of the integral becomes:

\begin{array}{l} \int_{Γ} U_{i j} (x, y) ϕ_{j}^{k} (y) d Γ & \approx \sum_{e = 1}^{E} \sum_{q = 1}^{Q} U_{i j}^{⊺} (x, y_{q}) ϕ_{j}^{k} (y_{q}) | J_{e} | ω_{q} \\ = \sum_{e = 1}^{E} | J_{e} | U_{i j} (x) W ϕ_{j}^{k} \end{array}

where |J_e| is the Jacobian determinant of the e-th element for length or area scaling,

U_{i j} = {[\begin{matrix} U_{i j} (x, y_{1}) & U_{i j} (x, y_{2}) & \dots & U_{i j} (x, y_{Q}) \end{matrix}]}^{⊺} \in R^{Q}

W = diag ([\begin{matrix} ω_{1} & ω_{2} & \dots & ω_{Q} \end{matrix}]) \in R^{Q \times Q}

is the weight matrix, and

ϕ_{j}^{k} = [\begin{matrix} ϕ_{j}^{k} (y_{1}) & ϕ_{j}^{k} (y_{2}) & \dots & ϕ_{j}^{k} (y_{Q}) \end{matrix}] \in R^{Q}

collects the evaluations of the shape function at the quadrature points y_q (q = 1, 2, ⋯Q). Three-point quadrature with

W = diag ([\begin{matrix} 1 / 6 & 1 / 6 & 1 / 6 \end{matrix}])

is used in 3D, while 2-point quadrature with

W = diag ([\begin{matrix} 1 / \sqrt{3} & 1 / \sqrt{3} \end{matrix}])

is used in 2D.

Similarly, the other integral in equation (7) is

\int_{Γ} T_{i j} (x, y) u_{j} (y) d Γ = \sum_{k = 1}^{N} t_{j}^{k} (\sum_{e = 1}^{E} | J_{e} | T_{i j} (x) W ϕ_{j}^{k})

where

T_{i j} = {[\begin{matrix} T_{i j} (x, y_{1}) & T_{i j} (x, y_{2}) & \dots & T_{i j} (x, y_{Q}) \end{matrix}]}^{⊺} \in R^{Q}

Therefore, for all nodes on the boundary, equation (7) can be expressed in matrix form as

H u = G t .

(8)

The matrix $G \in R^{D N \times D N}$ is composed of D² submatrices $G_{i j} \in R^{N \times N}$ , where the (l, k)-th entry of G_ij is given by

G_{i j}^{l k} = \sum_{e = 1}^{E} | J_{e} | U_{i j} (n_{l}) W ϕ_{k} .

Similarly, the matrix

H \in R^{D N \times D N}

consists of submatrices

H_{i j} \in R^{N \times N}

, where the (l, k)-th entry of H_ij is

H_{i j}^{l k} = \sum_{e = 1}^{E} | J_{e} | T_{i j} (n_{l}) W ϕ_{k} + c_{i j} (n_{l, j})

Both matrices depend solely on the geometry of the object and can be precomputed. The traction vector is assembled as $t = {[\begin{matrix} t_{1}^{⊺} & t_{2}^{⊺} & \dots & t_{N}^{⊺} \end{matrix}]}^{⊺} \in R^{D N}$ , and the displacement vector as $u = {[\begin{matrix} u_{1}^{⊺} & u_{2}^{⊺} & \dots & u_{N}^{⊺} \end{matrix}]}^{⊺} \in R^{D N}$ .

5.4. Solving interior points’ displacement with regularisation

Under Dirichlet BCs, the traction field solved from equation (8) is

t = G^{- 1} H u .

(9)

The G can be shown to be nonsingular, ensuring the solvability of the traction field.

Proposition 1

The matrix G in equation (8) is full-rank.

Proof

For the diagonal entries (i = j), the integral involves a neighbourhood of node i. The fundamental solution U (x_i, x_i) is singular at x_i = x_j, leading to a dominant contribution to the diagonal entry G_ii. This singularity ensures that G_ii ≫ 0 in magnitude.

For off-diagonal entries (i ≠ j), the kernel U (x_i, x_j) decays rapidly with increasing distance between nodes x_i and x_j, resulting in |G_ij|≪|G_ii|. Under sufficiently refined discretization, the diagonal dominance condition holds uniformly across all rows:

|G_{i i}| > \sum_{j \neq i} |G_{i j}| \forall i \in {1, 2, \dots, D N} .

By the Levy-Desplanques Theorem (strict diagonal dominance), G is nonsingular. Since G is square and nonsingular, it follows that G is full-rank.□

In practice, the condition of the BEM is not strictly satisfied. Thus, a regularization term is added to ensure the feasibility of this method in our cases by penalizing large gradients or discontinuities in the traction field, as

E_{p} = α \int_{Γ} {‖\nabla t‖}_{F}^{2} d Γ .

(10)

where the Frobenius norm of the traction gradient is given by

{‖\nabla t‖}_{F}^{2} = \sum_{i, j} {(\partial t_{i} / \partial x_{j})}^{2}

. Within each element, the traction gradient is approximated as

\nabla t = \sum_{j = 1}^{n} \nabla N_{j} t_{j},

where

\nabla N_{j} = [\begin{matrix} \partial N_{j} / \partial x & \partial N_{j} / \partial y & \partial N_{j} / \partial z \end{matrix}]

denotes the spatial gradient of the shape function N_j. The partial derivatives are computed via the Jacobian matrix of the mapping from local coordinates to global coordinates (x, y, z). Substituting into equation (10), the regularisation energy becomes

\begin{array}{l} E_{p} & \approx α \sum_{j = 1}^{n} \sum_{k = 1}^{n} t_{j}^{⊺} (\sum_{e = 1}^{E} \sum_{q = 1}^{Q} {(\nabla N_{j})}^{⊺} (\nabla N_{k}) d Γ) t_{k} \\ = α t^{⊺} D t \end{array}

where D is the regularisation matrix that encodes the action of the gradient operator. Therefore, the optimisation of the traction field with regularisation becomes

\underset{t}{\arg \min} {‖G t - H u‖}_{F}^{2} + α t^{⊺} D t,

and the closed-form solution is

t = {(G^{⊺} G + α D)}^{- 1} G^{⊺} H u .

When the boundary is a complete contour, regularisation under static equilibrium can be applied, as detailed in Appendix A.

For interior points not located on the boundary, the coefficient c_ij(x) ≡ 1. Accordingly, for such points, equation (7) written in matrix form similar to equation (8):

\tilde{u} = \tilde{G} t - \tilde{H} u,

(11)

where

\tilde{u} = {[\begin{matrix} {\tilde{u}}_{1}^{⊺} & {\tilde{u}}_{2}^{⊺} & \dots & {\tilde{u}}_{G}^{⊺} \end{matrix}]}^{⊺} \in R^{D G}

is the displacement vector at the interior points. The matrix

\tilde{H} \in R^{D G \times D N}

consists of D² submatrices

{\tilde{H}}_{i j} \in R^{G \times N}

, where the l-th row and k-th column element of

{\tilde{H}}_{i j}

is given by

{\tilde{H}}_{i j}^{l k} = \sum_{e = 1}^{E} | J_{e} | T_{i j} (g_{l}) W ϕ_{k} .

Similarly,

\tilde{G} \in R^{D G \times D N}

contains D² submatrices

{\tilde{G}}_{i j} \in R^{G \times N}

, with the (l, k)-th element of

{\tilde{G}}_{i j}

defined as

{\tilde{G}}_{i j}^{l k} = \sum_{e = 1}^{E} | J_{e} | U_{i j} (g_{l}) W ϕ_{k} .

By substituting the traction vector from equation (9) into equation (11), the displacement at interior points becomes

\begin{aligned} \tilde{u} & = \underset{K (x, B ∣ μ, ν)}{\underset{⏟}{(\tilde{H} - \tilde{G} {(G^{⊺} G + α D)}^{- 1} G^{⊺} H)}} u \\ = K (x, B ∣ μ, ν) u \end{aligned}

(12)

where

x \in R^{D}

denotes the positions of the interior points and

B \in R^{N \times D}

denotes the boundary nodes positions. Here,

K : R^{3} \to R^{D G \times D N}

is a mapping that encodes material properties and boundary geometry via μ and ν.

Equation (12) reveals the relationship between the observed boundary displacement and the displacement at interior locations. The positions of the gripper tips in the camera frame, either as 2D pixel coordinates or 3D world coordinates are denoted by vectors $p_{i} \in R^{D}$ . These positions, obtained via kinematic transformations and hand-eye calibration, are treated as interior points for the BEM computation. Accordingly, the desired movement of the gripper at a specific coordinate is given by

δ p_{i} = K (p_{i}, B ∣ μ, ν) δ B,

(13)

where δB ≔ u is the boundary displacement field.

5.5. Mechanical parameters estimation

The Poisson’s ratio ν and shear modulus μ in equation (12) are unknown a priori and must be inferred from observed boundary deformations during robotic manipulation. They are critical for predicting material response and optimizing subsequent control actions. The estimation is formulated as a least-squares minimization of the discrepancy between predicted and observed gripper displacements across multiple observations:

\underset{ν, μ}{\arg \min} \sum_{t = 0}^{T} \sum_{i}^{K} {‖δ g_{i, t} - K (g_{i, t} ∣ B, μ, ν) u_{t}‖}_{2}^{2}

(14)

where g_i,t and δg_i,t are the i-th gripper’s position and displacement at time t. The shear modulus μ scales the compliance matrix G (or

\tilde{G}

) and appears as a factor 1/μ in the governing equations. Under Dirichlet BC in the BEM framework, μ cancels out algebraically during displacement computation, making its explicit estimation unnecessary and isolating ν as the sole unknown. Even with regularization via equation (10), μ scales linearly with the parameter α, allowing arbitrary selection without loss of generality. This property is advantageous for image-based applications where pixel coordinates lack physical units. While ν can theoretically be estimated from a single observation, practical implementations leverage multiple observations over time to improve robustness against noise and model inaccuracies, although frequent updates are unnecessary due to the quasi-static nature of the deformations. The optimization is solved via gradient descent, initialized with ν = 0.25, a common assumption for near-incompressible materials such as rubber or soft tissues.

6. Shape Servo

6.1. Iterative control

The motion of interior points in a deformable object can be related to gripper kinematics through the displacement formulation in equation (12). Given that the boundary B and its deformation field δB in equation (13) are reconstructed from the wavelet shape $W$ , the velocity of the i-th gripper ${\dot{g}}_{i}$ is governed by:

{\dot{g}}_{i} = K (g_{i}, b (W)) b (\dot{W}),

(15)

where K maps the wavelet coefficient rates

b (\dot{W})

to gripper motion. b (•) is the conversion from the wavelet shape to a vector containing the coordinates of all the nodes.

Robotic systems, such as the dVRK’s 7–DoF surgical forceps, enable dexterous manipulation beyond the 2D workspace constraints of prior methods (Navarro-Alarcon and Liu, 2017). To fully exploit this capability, we extend control to 3D motion by augmenting positional velocity with rotational and scaling components.

For 2D curves or contours, rotational velocity $ω_{i} \in R^{3}$ is derived from the optimization in equation (4) as $ω_{i} = {[\begin{matrix} 0 & 0 & \dot{θ} \end{matrix}]}^{⊺},$ where $θ = \cos^{- 1} (tr (R^{⋆}) / 2)$ . Axial motion ${\dot{z}}_{i}$ is determined by the scaling factor s, defined as the arc-length ratio between corresponding segments on the current and target shapes. The gripper velocity becomes ${\dot{g}}_{i} = z_{i} [\begin{matrix} p_{i} / f \\ \dot{s} - 1 \end{matrix}],$ with z_i denoting depth in the camera frame. The resultant 6-DoF controller velocity is ${\dot{R}}_{i} = [\begin{matrix} {\dot{g}}_{i} \\ ω_{i} \end{matrix}] .$

For 3D features, positional velocity ${\dot{g}}_{i}$ is directly computed in 3D space, while rotational velocity is obtained via equation (5): $ω_{i} = {({\dot{R}}^{⋆} (W) {(R^{⋆} (W))}^{⊺})}^{\lor},$ where (•)^∨ extracts the vector from a skew-symmetric matrix. The controller velocity ${\dot{R}}_{i}$ retains the same 6-DoF structure, enabling seamless integration with high-dimensional robotic systems.

The control strategy relies on estimating interior displacements from observed boundary deformations. However, directly actuating interior points to match a target displacement does not reach the desired boundary changes due to the nonlinear and heterogeneous properties of deformable objects. To overcome this, a closed-loop iterative controller is employed, as illustrated in Figure 4. Each iteration begins with boundary observation and the computation of the desired shape displacement toward a target configuration. The corresponding gripper motion is then calculated via equation (13) and executed accordingly. The resulting deformation is measured to update the shape discrepancy, thereby forming a feedback loop that iteratively reduces the deviation between the current and desired states. The discrepancy is quantified using the metric $H (W, W_{d})$ , which represents the mean shape error (MSE). This closed-loop approach enables adaptive compensation for unmodelled dynamics and environmental disturbances, which is essential for robust manipulation in real-world settings.

Figure 4.

Schematic representation of the proposed closed-loop controller. The DWT blocks decompose the current shape $S$ and target shape $S_{d}$ into wavelet coefficients $W$ and $W_{d}$ . MSE is the shape discrepancy. If the target is not reached, the deformation field $U_{d}$ and the boundary topology (Nodes N_Q) are fed into the BEM solver. The BEM module, coupled with the current robot state $R_{i}$ , computes the required gripper velocities ${\dot{R}}_{i}$ to drive the object $O^{k}$ toward $O^{k + 1}$ .

6.2. Stability analysis

As established in our previous work (Hu et al., 2024), the mapping from the robot’s finite DoFs to the deformable object’s infinite DoFs creates a non-trivial null space. While the controller explicitly minimizes error in the controllable subspace, ensuring the stability of the null space dynamics is critical for surgical safety to prevent unmodelled tissue trauma. Here, we extend the stability analysis by incorporating the physical properties of the BEM formulation.

Consider the differential relationship in equation (15):

δ g = K (g, b) δ b .

We decompose the shape variation vector δb into a controllable component δb_c and an uncontrollable (null space) component δb_u:

δ b = δ b_{c} + δ b_{u},

where δb_c = K^†δg lies in the range space of K^†, and δb_u lies in the null space of K, defined using the orthogonal projection matrix:

δ b_{u} = (I - K^{†} K) δ b .

Assumption 5

Local Linear Dynamics. In the neighbourhood of the equilibrium point (g*, b*), the shape dynamics can be linearized as

δ \dot{b} = J_{1} δ b + J_{2} δ g

where

\dot{b} ≔ b (\dot{W})

, and

J_{1} \in R^{D N \times D N}

and

J_{2} \in R^{D N \times D G}

are the state and input Jacobian matrices, respectively.

Assumption 6

Controllable Subspace. The matrix $K (g, b)$ maintains constant rank DG in the operational workspace, ensuring the existence of a well-defined controllable subspace.

Property 1

Null Space Dissipation. The deformable object is modelled as a linear elastic material governed by the BEM formulation. Consequently, deformation dynamics in the null space (uncontrollable subspace δb_u) are driven by the minimization of the internal Strain Energy potential $E (δ b_{u})$ . In the quasi-static regime, this implies that the null space motion follows the negative gradient of the potential energy: $δ {\dot{b}}_{u} \propto - \nabla E$ .

Proposition 2

Asymptotic Stability of Controllable Error. Under Assumptions 5, 6 and Property 1, the control law

\dot{g} = - K^{⊺} Q δ b_{c}

where

Q \in R^{N G \times N G}

is a positive definite gain matrix and δb_c = K^†δg is the controllable shape error, guarantees asymptotic convergence of δb_c to zero.

Proof

Consider a composite Lyapunov function candidate V_total representing the total energy of the closed-loop system, comprising the task-space error energy and the physical strain energy:

V_{total} = V_{task} + V_{phys} = \frac{1}{2} δ b_{c}^{⊺} Q δ b_{c} + E (δ b_{u}) .

Taking the time derivative of V_total:

{\dot{V}}_{total} = δ b_{c}^{⊺} Q δ {\dot{b}}_{c} + \nabla E^{⊺} δ {\dot{b}}_{u} .

For the first term (controllable subspace), substituting the control law $\dot{g}$ and the kinematic relationship $δ {\dot{b}}_{c} = K^{†} \dot{g}$ :

δ b_{c}^{⊺} Q δ {\dot{b}}_{c} = - δ b_{c}^{⊺} (Q K^{†} K^{⊺} Q) δ b_{c} .

Since K^†K^⊺ is positive definite by construction and $Q ≻ 0$ , this term is strictly negative definite for δb_c ≠ 0.

For the second term (null space), according to Property 1, the quasi-static material relaxation ensures that the unactuated deformation evolves to minimize potential energy:

\nabla E^{⊺} δ {\dot{b}}_{u} = - ‖ \nabla E ‖^{2} \leq 0 .

Therefore, the total derivative ${\dot{V}}_{total} \leq 0$ . This implies that the total energy of the system is non-increasing, confirming that the interaction is passive. The controllable error δb_c converges asymptotically to zero due to active control, while the uncontrollable error δb_u remains bounded by the initial strain energy and dissipates to an equilibrium state.□

Remark 3

The control law effectively exploits the kinematic structure of the system, with the gain matrix Q providing flexibility to prioritize specific shape features.

Remark 4

Although the null space is not actively controlled, the passivity proof $({\dot{V}}_{total} \leq 0)$ ensures surgical safety. The strict dissipation of strain energy guarantees that unmodelled deformations do not diverge or exhibit high-frequency oscillations that could cause tissue trauma, provided the manipulation remains within the quasi-static regime.

7. Simulation validation

Simulation experiments were conducted using the SOFA framework (Faure et al., 2012), where deformable object dynamics were modelled with a FEM-based physics engine. A liver model was rendered as 2D images via a virtual camera at 1280 × 720 resolution. From the 2D mesh projections, contours were extracted and feature curves, defined as connected sequences of vertices and edges, were projected onto the image plane. In 3D cases, shape features were represented as point clouds sampled from visible surface or curve regions captured by the virtual camera. Occlusion was simulated by removing selected shape features within predefined regions. Three robotic grippers with known positions and full 6-DoF actuation were used to realistically manipulate the virtual mesh and reproduce interaction dynamics. Representative tasks under various configurations are illustrated in Figures 5 and 6. Additional results are available in Supplemental Video 1.

Figure 5.

Simulations using 3D surface as shape feedbacks.

Figure 6.

Simulations using contours and curves as shape feedbacks.

7.1. Evaluation of the physical parameters estimation

The estimation accuracy of Poisson’s ratio ν was evaluated across various deformation scenarios at a frequency of 1 Hz during manipulation. As shown in Figure 7, parameter estimation was performed under multiple configurations with different ν values. The ground-truth values, directly obtained from the simulator’s physical model, are indicated in the figure legends.

Figure 7.

Estimation of the Poisson’s ratio in the simulations.

It was observed that the Poisson’s ratio ν could be approximated reasonably well when 2D contours were used as shape feedback. This is attributed to the completeness of the boundary feedback, which closely satisfies the BCs assumed by the BEM. Toward the end of the manipulation, estimation accuracy declined slightly due to reduced gripper motion. This behaviour is expected, as the estimation process depends on observable displacement. Nonetheless, the resulting error is of minor significance, given the small magnitude of the displacement field during this phase. In contrast, when shape feedback is sparse or the boundary is only partially represented, such as when the 3D curve comprises less than 30 % of the total boundary, the assumptions underlying the BEM are less well satisfied, leading to decreased estimation accuracy.

7.2. Comparison with Fourier-based shape representation

To demonstrate the advantages of wavelet decomposition, the proposed DWT-based method was compared against FFT-based strategies. Specifically, 2D contour control was implemented following the method by Navarro-Alarcon and Liu (2017), while 3D surface manipulation was adapted from Makiyeh et al. (2023).

To isolate the influence of shape representation independently from the physical modelling (e.g., BEM), a DJM-based controller was implemented using DWT features (referred to as DWT-Jac; see Appendix B). This enabled direct comparison with FFT representation under identical task conditions.

All simulations were conducted under consistent environmental settings, control gains, and target deformation profiles. Performance was evaluated based on two key metrics: shape accuracy and manipulation time. A total of 10 task groups were tested, with five focusing on contour deformations and the remaining five on surface deformations. Within each group, both FFT- and DWT-based methods were executed under the same conditions to ensure fairness.

Results shown in Figure 8 indicate that when there is a large rotational discrepancy between the initial and target shapes, FFT-based methods require more time to converge or are more likely to fail. This limitation arises because discrepancies in the frequency domain do not directly reflect geometric differences. In contrast, the wavelet-based BEM method produced more accurate gripper displacement estimates, as the physics-based model guided motion using a more spatially localized and physically meaningful representation.

Figure 8.

Comparison of shapes using FFT and DWT. The FFT for 2D contour is from Navarro-Alarcon and Liu (2017) and in 3D Surface is from Makiyeh et al. (2023).

7.3. Comparison with GP-WRM

The primary advantage of the DWT-based method lies in its ability to filter noise effectively, allowing for accurate shape representation even when input data consist of partial or noisy point clouds. To illustrate this benefit, the DWT approach was integrated with the GP-WRM framework by reformulating the wavelet-approximated points as downsampled grid points, as described in Appendix C.

A known limitation of GP-WRM is its assumption that gripper contact points lie on the observable surface. To evaluate robustness under more realistic conditions, scenarios were designed in which grippers interacted with the object from positions outside the visible surface, mimicking occluded surgical instruments. Simulations were repeated ten times, with variations in the number of pre-grasped grippers.

To further assess performance under realistic conditions, varying levels of noise were artificially injected to surface observations. In theory, the DWT mitigates such degradation by suppressing high-frequency noise during inverse transformation via low-pass filtering. Simulations were categorized into four noise levels and tested across four spatial configurations (IB, OB-, OB+, and MX; see Table 1 for definitions).

Table 1.

Comparison of DWT-BEM with GP-WRM using 3D surface as feedback.

Case	Metric	Low noise			Middle noise			High noise
Case	Metric	GP-WRM	DWT-WRM	DWT-BEM	GP-WRM	DWT-WRM	DWT-BEM	GP-WRM	DWT-WRM	DWT-BEM
IB	MT (s)	7.5 ± 0.1	7.4 ± 0.2	7.4 ± 0.3	7.6 ± 0.4	7.5 ± 0.1	7.4 ± 0.2	7.7 ± 0.3	7.6 ± 0.4	7.6 ± 0.3
	SE (mm)	0.9 ± 0.1	0.9 ± 0.2	0.9 ± 0.2	1.5 ± 0.2	1.4 ± 0.2	1.3 ± 0.3	2.2 ± 0.4	1.9 ± 0.3	1.9 ± 0.2
	SR (%)	100	100	100	100	100	100	100	100	100
OB-	MT (s)	11.2 ± 1.1	11.1 ± 0.9	7.5 ± 0.3	11.5 ± 0.9	11.5 ± 1.3	7.5 ± 0.4	11.1 ± 0.8	7.6 ± 0.4	7.5 ± 0.3
	SE (mm)	1.3 ± 0.2	1.3 ± 0.3	0.9 ± 0.1	1.9 ± 0.4	1.8 ± 0.5	1.4 ± 0.4	2.6 ± 0.6	2.5 ± 0.7	1.8 ± 0.5
	SR (%)	80	80	100	80	80	100	80	80	100
OB+	MT (s)	×	×	8.1 ± 0.4	×	×	8.2 ± 0.4	×	×	8.4 ± 0.5
	SE (mm)	×	×	1.3 ± 0.4	×	×	1.4 ± 0.1	×	×	2.3 ± 0.5
	SR (%)	0	0	100	0	0	100	0	0	100
MX	MT (s)	9.8 ± 0.7	10.3 ± 1.1	7.6 ± 0.4	14.2	×	7.9 ± 0.4	×	×	8.3 ± 0.5
	SE (mm)	1.9 ± 0.4	1.8 ± 0.5	1.2 ± 0.3	2.1	×	1.4 ± 0.2	×	×	2.1 ± 0.3
	SR (%)	40	40	100	20	0	100	0	0	100

MT: manipulation time, SE: shape error, SR: success rate, IB: gripper points inside the boundary, OB-: gripper points outside the boundary at a small distance, OB+: gripper points outside the boundary at a large distance, MX: some gripper points are inside and some are outside the boundary. × denotes unavailable data.

Results summarized in Table 1 demonstrate that GP-WRM performance declines significantly as gripper positions move beyond the observed surface. When the gripper distance exceeds the defined neighbourhood radius r, the upper limit for motion correlation, the moment matrix becomes singular, resulting in task failure. While increasing r can alleviate singularities, this introduces over-smoothed deformation fields and leads to a reduction in computational efficiency by approximately 62 % to 78 %. Under low noise conditions, both methods perform similarly, as wavelet-based positions directly approximate grid points. However, as noise levels increase, the DWT method yields better surface reconstructions than GP-WRM by using low-pass IDWT, effectively filtering high-frequency artifacts.

7.4. Comparison with FEM-based method

To benchmark our proposed framework against physics-based control, we conducted a comparative study with the method recently proposed by Saghour et al. (2025). This baseline employs an online FEM to continuously predict object deformations and update the interaction matrix based on the difference between predicted and observed shapes. Both methods were evaluated in the SOFA simulation environment (v24.06) using the same liver manipulation task on a standard CPU workstation (Intel Core i7 @ 3.6 GHz). For the FEM-based approach, we utilized the standard linear elasticity model with a tetrahedral mesh (μ ≈ 1071 Pa, ν = 0.4), comprising 3286 elements. While this mesh is denser than the one used in the original study, which comprises approximately 1600 elements (Saghour et al., 2025), the setup follows the same modelling principles. While Saghour et al. (2025) originally integrate visual feedback via iterative closest point (ICP) registration to correct the simulation model, in this comparative experiment, we provided the controller with direct access to the simulation’s mesh vertices. This idealized feedback loop was chosen to decouple control performance from perception noise, ensuring a fair evaluation of the physics-based solving strategies. In contrast, our DWT-BEM method operated solely on the boundary surface point cloud (3682 triangular elements) without requiring internal volumetric data.

The results, depicted in Figure 9, illustrate the trade-off between model fidelity and computational efficiency. Quantitative analysis across 10 trials reveals that both methods achieve comparable final accuracy: the FEM-based baseline reached a mean shape error of 2.70 ± 102 mm, whereas our DWT-BEM method achieved 2.82 ± 79 mm. This similarity confirms that although BEM relies on a boundary-only formulation and does not include the full volumetric continuum mechanics modelled by the tetrahedral FEM, it provides a sufficiently accurate approximation to guide deformation control. However, the temporal profiles differ significantly. Our method reaches the target shape in approximately 8 s, exhibiting a nonlinear error descent. Conversely, the FEM-based controller shows a smooth, linear reduction in shape error, validating its stability, but at a substantial computational cost. The method by Saghour et al. (2025) required an average of approximately 3 s per iteration, with the FEM simulation alone accounting for nearly 2.3 s to reach quasi-equilibrium. This computation time aligns with the performance reported in the original study (≈1.9 s per step), confirming the standard computational burden of volumetric solvers. In contrast, our BEM computation takes only about 0.1 s. This efficiency enables a substantially higher control frequency, leading to significantly faster overall convergence to the target configuration in real-time scenarios.

Figure 9.

DWT-BEM converges significantly faster than the FEM baseline.

7.5. Effect of the level of DWT

The hierarchical decomposition level J and reconstruction level Q in DWT govern the granularity of shape representation, balancing computational efficiency against deformation precision. We evaluate this trade-off by analysing control accuracy and time costs across varying J and Q.

Higher J increases the resolution of wavelet coefficients, theoretically improving shape fidelity. However, this amplifies the computational load for solving the BEM problem, particularly inverting the Green’s matrix G, its dimension increases drastically due to finer spatial discretization.

Experiments across 2D contours, 3D curves, and 3D surfaces reveal diminishing returns: increasing J beyond Q = 2 yields marginal accuracy gains (Figure 10). This stems from two factors. First, robotic systems possess finite DoFs (n ≤ 12 for dual-arm setups), insufficient to exploit the infinite-dimensional deformation space of soft objects. Second, unpredictable nonlinear deformations during manipulation nullify benefits from higher-frequency coefficients, as cumulative errors dominate after initial adjustments, see equation (13). Thus, initializing control with approximation coefficients (lowest J) suffices for most tasks, while detailed coefficients (Q ≥1) provide negligible practical improvement at greater computational expense.

Figure 10.

Results of different levels of wavelet decomposition and reconstruction. Each block shows the MSE trend over time for one manipulation, with the horizontal axis representing time (0 s to 8 s) and the vertical axis representing MSE: 0 px to 120 px for the 2D contour, 0 mm to 50 mm for the 3D curve, and 0 mm to 66 mm for the 3D surface. The colour of each block indicates its final MSE.

The results show that even if a high-level DWT is used, the accuracy of the shape control will not be improved a lot, as the total DoFs of the robotic grippers is much lower than the deformable object, which can be regarded as infinite DoFs. In equation (13), null space. And due to the unpredictable deformation during the manipulation, even the higher detail coefficients are used, after initial manipulations, the shape difference can be reduced, the total manipulation times is not reduced. Therefore, it is economical to just use the approximated coefficients at beginning. Even using first level of the detailed coefficient is enough. Supplemental Video 1 showcases all results.

7.6. Comprehensive comparative analysis

To validate the proposed framework, we benchmarked it against a diverse set of SOTA methods: spectral-based approaches (FFT-based and DWT-Jac), model-free controllers including GP-WRM, FTSMC (Qi et al., 2021), Modal Analysis (Yang et al., 2023a), and Modal-Graph (Yang et al., 2023b), data-driven Online GPR (Hu et al., 2018), and physics-based Online FEM. Table 2 summarizes the quantitative performance across four key metrics.

Table 2.

Comparison of the proposed DWT-BEM framework against SOTA methods. Data are derived from controlled simulation benchmarks (Section 7).

Category	Method^a	Need Mesh	Shape Repr. / Feedback	Mean Shape Error (mm ) ↓	Comp. Cost / Iter. (s ) ↓	Stability $(\log κ (J_{δ S}))$ ↓	Occlusion Robustness
Spectral	FFT-based	✗	Fourier Series	3.01 ± 0.42^c	Low $(< 0.1)$	Unstable (3 ∼ 4)^d	Low
Spectral	DWT-Jac	✗	Wavelet Coeff.	1.98 ± 0.96	Low $(< 0.1)$	Unstable (2 ∼ 3)^d	High
Model-free	GP-WRM	✗	Grid Point	2.20 ± 0.74^e	Medium (0.3)	Stable^e	High
	Standard DJM	✗	Feature Points	3.41 ± 1.45	Low $(< 0.1)$	Unstable $(> 4)$ ^d	Low
	FTSMC	✗	Contour Moment	2.71 ± 1.34	Low $(< 0.1)$	Unstable (3 ∼ 4)^d	Low
	Modal Analysis	✓	Modal Coord.	2.15 ± 0.92	Low $(\approx 0.1)$	Stable	Low
	Modal-Graph	✗	Modal Graph	1.59 ± 0.49	Low $(\approx 0.15)$	Stable	High
Data-driven	Online GPR	✗	Feature Points	2.42 ± 1.01	Medium $(\approx 0.3)$ ^f	Unstable (3 ∼ 4)^g	Low
Physics-based	Online FEM	✓	Vol. Mesh	1.70 ± 1.02	High $(\approx 3.0)$	Stable	High
Physics-based	DWT-BEM	✗	Boundary Wavlet	1.36 ± 0.39	Low $(\approx 0.1)$	Stable	High

^a Methods: FFT-based (Navarro-Alarcon and Liu, 2017), GP-WRM (Hu et al., 2024), FTSMC (Qi et al., 2021), Modal Analysis (Yang et al., 2023a), Online GPR (Hu et al., 2018), Online FEM (Saghour et al., 2025).

^bShape Type: = Contour, = Curve, = 3D Surface.

^cFails under large rotational discrepancies (see Section 7.2).

^d Estimated from condition number analysis in Figure 1(b).

^eData from Table 1 (Case OB+ and MX); note regarding stability is relative to valid grasp regions.

^f Excludes mandatory online exploration phase for initialization.

^g Based on the condition number of the effective Jacobian from Equation (1).

In terms of accuracy and efficiency, the proposed DWT-BEM framework achieves the lowest mean shape error (1.36 ± 39 mm), outperforming both the high-fidelity Online FEM (1.70 ± 102 mm) and the recent Modal-Graph method (1.59 ± 49 mm). Crucially, our method maintains a low computational cost per iteration (≈0.1 s), enabling real-time feedback, whereas the Online FEM is significantly slower (≈3.0 s).

Specific comparisons with contour-based methods reveal the limitations of global descriptors. Both the FFT-based method (3.01 ± 42 mm) and FTSMC (2.71 ± 134 mm) exhibit higher errors and numerical instability (log κ ≈ 3 ∼ 4). This indicates that global features like Fourier series or contour moments are less capable of capturing local deformations compared to the localized boundary wavelets used in DWT-BEM, particularly when the Jacobian becomes ill-conditioned.

Data-driven approaches such as Online GPR also suffer from high condition numbers (log κ ≈ 3 ∼ 4) because of their sensitivity to local linearization errors. In contrast, DWT-BEM ensures stability through its analytical BEM formulation. Furthermore, unlike Modal Analysis and Online GPR, which degrade under partial observability, our approach maintains high robustness to occlusion. Its performance is comparable to Modal-Graph, while not requiring any offline training or online initialization phases.

8. Experiments

8.1. Experimental setup

The experiments were conducted on the dVRK (Kazanzides et al., 2014; Xu et al., 2025), comprising three 7-DoF patient-side manipulator (PSMs) and one 4-DoF endoscopic camera manipulator (ECM). The ECM was equipped with a stereo laparoscope (Intuitive Surgical, Inc., U.S.) with resolution of 1280 × 720, calibrated to sub-millimetre accuracy for 3D reconstruction (Figure 11). Notably, the dVRK platform features a hierarchical control architecture where the low-level servo controllers operate at high frequency (>1 kHz) (Xu et al., 2025), allowing for smooth execution of velocity commands derived from higher-level planners.

Figure 11.

Experimental setup for deformable object manipulation.

Soft tissue phantoms (EasySurg, HumanX Medical LLC, U.S.) mimicking liver, colon with appendix, and pancreas were used. These materials replicate nonlinear hyperelasticity and viscoelastic damping observed in human organs, enabling validation of surgical relevance in RAMIS scenarios. To showcase the ability of our controller in different materials, ex vivo animal experiments using porcine liver and colon were also conducted for the RAMIS.

For vision feedback, the laparoscope streamed RGB-D video at 30 fps. With 2D image feedback, only the left channel image was used. Interested organs were segmented in real time using Segment Anything Model 2 (SAM2) (Kirillov et al., 2023). Dense 3D point clouds were reconstructed via RAFT-Stereo disparity estimation (Lipson et al., 2021), with depth maps converted to Euclidean coordinates using intrinsic camera parameters. Specifically, to support the adaptive sampling in Section 3.3, we tracked approximately 3 to 8 texture feature pairs between consecutive frames alongside the gripper positions, ensuring robust node alignment even under large deformations. Point cloud updates occurred at 1 Hz to balance computational latency and deformation tracking accuracy.

8.2. Experiment on phantom tissues

When the shape feedback is 2D contour, the results are shown in Figure 12. In all cases, the contours were incomplete. For example, in the case of the colon, its elongated and folded structure prevents full reconstruction directly from the segmented image; only a partial region was selected, resulting in an open contour. This partial representation can be treated as an occlusion. Similarly, anatomical features such as the appendix or surrounding fat were partially occluded by surgical instruments or adjacent tissues. Supplemental Video 2 demonstrates all results.

Figure 12.

Manipulations on phantom tissues using contours as shape feedback.

Through systematic evaluation, we observed that successful manipulation was consistently achieved when the visible portion of the contour retained a completeness ratio η of approximately 60 %. This empirical threshold represents the limit below which the rigid registration between the partial view and the target shape becomes unstable. We further analysed the impact of the occlusion location by selectively masking different geometric features. The results indicated that the controller is robust to the loss of high-curvature features. This robustness arises because the wavelet-BEM framework prioritizes low-frequency approximation coefficients for global shape servoing, allowing the system to maintain convergence even when localized high-frequency details are occluded.

We also evaluated DJM-based methods on the pancreas using contour moment (Qi et al., 2021) and FFT representation (Navarro-Alarcon and Liu, 2017). Both methods struggled to compute shape discrepancies in their respective representation spaces when the contour was incomplete, leading to task failures.

As shown in Figure 13, both 2D and 3D curves were used as the shape feedback on the liver ridges for gallbladder exposure, and a 2D feature curve on the phantom appendix with occlusion of the surgical instrument. For control of the liver ridge, 3D-curve feedback yielded a higher success rate than 2D-curve feedback. In the 2D case, when grippers are positioned on opposite sides of a curve, the boundary element method requires the object to be divided into two subdomains, allowing each gripper to be treated as an interior point relative to its respective region. However, this division introduces ambiguity in interpreting shape feedback near the curve. In contrast, in 3D, the curve lies directly on the object’s surface, and local surface normals are naturally defined. As a result, all gripper contact points are inherently interior to the domain, and the BEM assumptions are more strictly satisfied, contributing to improved control robustness.

Figure 13.

Manipulations on the phantom tissues using curves as shape feedback.

When 3D surfaces were used as feedback (Figure 14), robotic manipulations were successfully performed on the pancreas, appendix, and large intestine. In the appendix case, the left gripper, and in the intestine case, both grippers, were positioned outside the observed surface point cloud. Despite this, the method remained effective in controlling the target shape. Although parts of the surface were occluded by surgical instruments, the desired deformations were still achieved, with average shape discrepancy, MSE less than 3 mm.

Figure 14.

Manipulations on the phantom tissues using curves as shape feedback.

Poisson’s ratio estimations based on different shape feedback types showed slightly greater variation than those observed in simulation. Nonetheless, the results consistently indicate that 2D contours provide more reliable feedback than 2D curves, and 3D surfaces provide more reliable feedback than 3D curves. While precise identification of physical parameters is not the primary goal of this approach, the results confirm that the shape controller remains effective, owing to its iterative nature.

8.3. Experiment on ex vivo animal tissue

To validate the performance of the DOM in realistic RAMIS scenarios, particularly under realistic material properties, we used ex vivo animal tissue, including the full gastrointestinal tract. Manipulations were performed on organs such as the liver, spleen, and colon using three types of shape feedback, as shown in Figures 15 –17. When using contour feedback, the colon with its mesentery was manipulated. For curve-based feedback, blood vessels on the colon and the superior border of the spleen were selected. For 3D surface feedback, the anterior end of the spleen and the liver lobe were manipulated. In all cases, the proposed controller successfully guided the organs to the target shapes within 10 s. Supplemental Video 3 demonstrates all results.

Figure 15.

Manipulations on phantom tissues using contours as shape feedback.

Figure 16.

Manipulations on phantom tissues using curves as shape feedback.

Figure 17.

Manipulations on phantom tissues using surfaces as shape feedback.

9. Discussion and conclusion

This work presents a wavelet-decomposed representation coupled with the BEM for deformable object manipulation, validated through robotic organ conformation tasks in the RAMIS platform. While traditional model-free approaches, such as those based on the DJM or nonlinear mappings, exhibit limitations in stability and observability, and physics-based methods require extensive prior knowledge of shape and mechanical properties with high computational costs, our method demonstrates superior efficiency and robustness. By leveraging BEM with only Poisson’s ratio estimation, our approach simplifies computation while maintaining accuracy, enabling real-time control through observed boundary displacements to infer interior deformations where grippers interact.

The wavelet decomposition eliminates traditional BEM meshing constraints by generating topology directly from multiscale coefficients. This integration preserves the interpretability of physics-based models while harnessing the spatial-frequency localization of wavelets, enabling targeted updates to the displacement field. Unlike Fourier-based methods, which require global recomputation and lack positional context, wavelet coefficient subtraction directly encodes shape differences at specific locations, bypassing inefficient spectral approximations.

A key feature of the proposed DWT-BEM framework, compared to purely geometric methods, is its adherence to physical laws. While geometric methods interpolate visual features and may produce physically implausible motions, our approach ensures the commanded velocity field satisfies the governing equations of static equilibrium. In this study, we focused on robustness to unknown materials. As shown in the derivation of equation (12), the traction field links boundary and interior displacements, and substituting it into the domain equations causes Young’s modulus to cancel out. This allows the controller to operate without knowledge of tissue stiffness, a major advantage in unstructured surgical environments. The parameter-free design comes with a trade-off for active safety: explicit traction values are not computed, so interaction forces cannot be directly monitored. Consequently, while the kinematic trajectories generated by the BEM provide a form of “passive” safety by strictly adhering to static equilibrium, this relies heavily on the assumption that the high-level target shape $S_{d}$ provided to the controller is anatomically safe and within the tissue’s physiological limits. However, this limitation is not fundamental. By reintroducing Young’s modulus, the framework can recover the traction field and enforce safety constraints, demonstrating its potential for future clinical integration.

A critical consideration in surgical robotics is the balance between control bandwidth and system stability. While our BEM-based deformation planning operates at approximately 10 Hz, the system maintains robust stability through a hierarchical velocity-control architecture. The solver computes a continuous velocity command from equation (15), which is executed by the dVRK’s low-level servo controllers operating at high frequency (>1 kHz) (Xu et al., 2025). This separation ensures smooth trajectory interpolation between visual updates and prevents the discretization artifacts typically associated with low-bandwidth position control. Furthermore, the risk of latency-induced instability, which is common in rigid haptic systems, is inherently mitigated in our setting. The quasi-static manipulation of viscoelastic and overdamped soft tissues naturally suppresses high-frequency phase lag effects and contributes to stable closed-loop behaviour.

Complementing this temporal stability, the framework also addresses potential spatial convergence issues, such as the risk of local minima in optimization-based servoing. It is essential to distinguish between the rigid alignment defined in equations (4) and (5) and the deformation control defined in equation (15). Even in scenarios where the rigid registration converges to a local optimum and results in an imperfect coordinate match, the shape discrepancy between the current and target objects typically remains non-zero. Consequently, the BEM-based controller continues to generate non-zero velocity commands. Since the Jacobian explicitly models the global physical connectivity of the object, any remaining shape difference is mapped to a physically admissible velocity field. This continuous motion drives the robot to deform the object, alters the geometric landscape, and prevents the system from becoming trapped in alignment-induced local minima.

In practical surgical scenarios, the planned target shape may occasionally be physically unreachable given the limited number of grippers and their specific grasp locations. Experimental results (see Supplemental Video 4) demonstrate that the proposed DWT-BEM controller exhibits robust behaviour in such cases. Rather than diverging or exhibiting unstable oscillations, the system converges to a bounded, non-zero steady-state error. This asymptotic convergence occurs because the control inputs eventually reach an equilibrium with the internal elastic restoring forces of the tissue. The inability to fully eliminate the error in these scenarios is attributed to Saint-Venant’s Principle (Timoshenko and Gere, 2012) and the viscoelastic nature of the material. The deformation energy applied by the grippers dissipates during propagation, which physically limits the control ability over boundary regions distant from the actuation points. While the controller ensures stability by maintaining bounded motion, consistent with the dissipation characteristics described in Property 1 and Remark 4, practical implementation requires higher-level supervisory logic, such as termination based on error gradients or time-out conditions, to manage these physical local minima.

The multiscale nature of wavelets further enhances computational efficiency. By hierarchically decomposing the boundary value problem, our framework adaptively discretizes the domain, reducing complexity without sacrificing accuracy. Spatial localization ensures that deformations are resolved at relevant scales, avoiding unnecessary computations in regions of minimal influence. Compared to registration-dependent approaches, which struggle with occlusion and noise, our method achieves robustness through direct wavelet-based correspondence, even under significant sensor limitations.

In addition to sensing robustness, the framework demonstrates resilience to modelling approximations. To enable real-time control, we adopt a linear elastic approximation (Assumption 1) to represent soft tissues that are inherently hyperelastic. This simplification is necessary to utilize the analytical solvability of the BEM (Cotin et al., 1999; Ficuciello et al., 2018; James and Pai, 1999), avoiding the high computational cost of iterative nonlinear solvers. Although this introduces constitutive discrepancies, our experimental results demonstrate that the closed-loop shape servoing framework effectively compensates for these modelling errors. By iteratively updating the deformation target based on visual feedback, the controller corrects residual errors caused by the linear approximation, ensuring convergence even when the constitutive model is imperfect.

Despite these advantages, our framework faces challenges under extreme operating conditions. A primary challenge arises when shape feedback is highly incomplete, for example, using sparse curves or occluded point clouds to represent complex or large deformable objects. This violates the BEM’s requirement for sufficient boundary data, leading to ill-posed boundary integral equations that amplify numerical instabilities. Another issue stems from unmodelled external forces (e.g., environmental interactions), which can introduce unanticipated dynamics that iterative control cannot fully compensate for without force feedback integration. Finally, while wavelet-based alignment avoids non-rigid registration, extreme deformations may decouple spatial correspondence between target and current shapes. This occurs when topological changes (e.g., folding or stretching) exceed the linear superposition assumptions of BEM, reducing robustness in large-displacement scenarios. Furthermore, while our feature tracking approach using optical flow (e.g., RAFT-Stereo) is effective for adaptive sampling in phantom and ex vivo setups, true in vivo environments introduce severe perceptual artifacts such as specular reflections from wet tissues, blood pooling, and surgical smoke. Addressing these visual disruptions is essential for practical clinical application.

This work advances deformable object manipulation in unstructured environments, particularly in surgical robotics where real-time responsiveness and partial observability are critical. Future directions include extending the framework to heterogeneous materials, dynamic occlusion handling, improving perception robustness in visually degraded in vivo environments, and integrating force feedback from sensorised grippers to compensate for unmodelled external forces.

Supplemental material

Supplemental Material - Multiscale deformable objects manipulation via wavelet-decomposed boundary element method

Supplemental Material for Multiscale deformable objects manipulation via wavelet-decomposed boundary element method by Junlei Hu, Dominic Jones, Majed Melibary, Jiannan Liu, and Pietro Valdastri in The International Journal of Robotics Research

Footnotes

Acknowledgements

All the experiments involving human cadaveric tissues were performed under ethical approval from the University of Leeds.

ORCID iDs

Junlei Hu

Majed Melibary

Pietro Valdastri

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the European Research Council (ERC) through the European Union’s Horizon 2020 Research and Innovation Programme under Grant 818045, in part by the Engineering and Physical Sciences Research Council (EPSRC) under Grant EP/V047914/1, and in part by the National Institute for Health and Care Research (NIHR) Leeds Biomedical Research Centre (BRC) (NIHR203331). Any opinions, findings, and conclusions or recommendations expressed in this article are those of the authors and do not necessarily reflect the views of the ERC, the EPSRC or the NIHR.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors would like to thank Intuitive Surgical, Inc., for the donation of the da Vinci system, the STORM Lab technician, Samwise Wilson, for hardware support, and Dr. Benjamin Calmé for his support in the animal experiments

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Berenson

(2013) Manipulation of deformable objects without modeling and simulating deformation. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, pp. 4525–4532.

Cotin

Delingette

Ayache

(1999) Real-time elastic deformations of soft tissues for surgery simulation. IEEE Transactions on Visualization and Computer Graphics 5(1): 62–73. https://doi.org/10.1109/2945.764872

Cruse

Suwito

(1993) On the somigliana stress identity in elasticity. Computational Mechanics 11(1): 1–10. https://doi.org/10.1007/bf00370069

Das

Sarkar

(2011) Autonomous shape control of a deformable object by multiple manipulators. Journal of Intelligent and Robotic Systems 62: 3–27. https://doi.org/10.1007/s10846-010-9436-5

Dometios

Tzafestas

(2022) Interaction control of a robotic manipulator with the surface of deformable object. IEEE Transactions on Robotics 39(2): 1321–1340. https://doi.org/10.1109/tro.2022.3226143

Faure

Duriez

Delingette

, et al. (2012) SOFA: a multi-model framework for interactive physical simulation. In: Payan

(ed) Soft Tissue Biomechanical Modeling for Computer Assisted Surgery, Volume 11 of Studies in Mechanobiology, Tissue Engineering and Biomaterials. Springer, pp. 283–321.

Ficuciello

Migliozzi

Coevoet

, et al. (2018) Fem-based deformation control for dexterous manipulation of 3d soft objects 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 4007–4013.

Han

Liu

Wang

(2025) Quasi-static modeling and controlling for planar pushing of deformable objects. IEEE Transactions on Robotics 41: 1296–1315. https://doi.org/10.1109/tro.2025.3532500

Sun

Pan

(2018) Three-dimensional deformable object manipulation using fast online gaussian process regression. IEEE Robotics and Automation Letters 3(2): 979–986. https://doi.org/10.1109/lra.2018.2793339

10.

Han

Sun

, et al. (2019) 3-d deformable object manipulation using deep neural networks. IEEE Robotics and Automation Letters 4(4): 4255–4261. https://doi.org/10.1109/lra.2019.2930476

11.

Jones

Huang

, et al. (2026) Autonomous robotic exploration of unknown soft object. The International Journal of Robotics Research. Available at: https://doi.org/10.1177/02783649251415415

12.

Jones

Valdastri

(2023) Coordinate calibration of a dual-arm robot system by visual tool tracking 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 11468–11473.

13.

Jones

Dogar

, et al. (2024) Occlusion-robust autonomous robotic manipulation of human soft tissues with 3d surface feedback. IEEE Transactions on Robotics 40: 624–638. https://doi.org/10.1109/tro.2023.3335693

14.

James

Pai

(1999) Artdefo: accurate real time deformable objects. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 65–72.

15.

Jia

Pan

, et al. (2018) Manipulating highly deformable materials using a visual feedback dictionary. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 239–246.

16.

Jia

Pan

, et al. (2019) Cloth manipulation using random-forest-based imitation learning. IEEE Robotics and Automation Letters 4(2): 2086–2093. https://doi.org/10.1109/lra.2019.2897370

17.

Kazanzides

Chen

Deguet

, et al. (2014) An open-source research kit for the da vinci® surgical system. In: 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 6434–6439.

18.

Kim

Chen

J-T

Hansen

, et al. (2025) Srt-h: a hierarchical framework for autonomous surgery via language-conditioned imitation learning. Science Robotics 10(104): eadt5254. https://doi.org/10.1126/scirobotics.adt5254

19.

Kirillov

Mintun

Ravi

, et al. (2023) Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026.

20.

Koessler

Filella

Bouzgarrou

B-C

, et al. (2021) An efficient approach to closed-loop shape control of deformable objects using finite element models. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 1637–1643.

21.

Koganti

Tamei

Ikeda

, et al. (2017) Bayesian nonparametric learning of cloth models for real-time state estimation. IEEE Transactions on Robotics 33(4): 916–931. https://doi.org/10.1109/tro.2017.2691721

22.

Lagneau

Krupa

Marchal

(2020) Active deformation through visual servoing of soft objects. In: 2020 Ieee International Conference on Robotics and Automation (ICRA). IEEE, pp. 8978–8984.

23.

Lippi

Poklukar

Welle

, et al. (2023) Enabling visual action planning for object manipulation through latent space roadmap. IEEE Transactions on Robotics 39(1): 57–75. https://doi.org/10.1109/tro.2022.3188163

24.

Lipson

Teed

Deng

(2021) Raft-stereo: multilevel recurrent field transforms for stereo matching. In: 2021 International Conference on 3D Vision (3DV). IEEE, pp. 218–227.

25.

Long

Lin

Kwok

DHC

, et al. (2025) Surgical embodied intelligence for generalized task autonomy in laparoscopic robot-assisted surgery. Science Robotics 10(104): eadt3093. https://doi.org/10.1126/scirobotics.adt3093

26.

Longhini

Wang

Garcia-Camacho

, et al. (2025) Unfolding the literature: a review of robotic cloth manipulation. Annual Review of Control, Robotics, and Autonomous Systems 8(1): 295–322. https://doi.org/10.1146/annurev-control-022723-033252

27.

Makiyeh

Chaumette

Marchal

, et al. (2023) Shape servoing of a soft object using fourier series and a physics-based model. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 6356–6363.

28.

Navarro-Alarcon

Liu

Y-H

(2017) Fourier-based shape servoing: a new feedback method to actively deform soft objects into desired 2-d image contours. IEEE Transactions on Robotics 34(1): 272–279.

29.

Navarro-Alarcon

Liu

Y-H

Romero

, et al. (2013) Model-free visually servoed deformation control of elastic objects by robot manipulators. IEEE Transactions on Robotics 29(6): 1457–1468. https://doi.org/10.1109/tro.2013.2275651

30.

Navarro-Alarcon

Yip

Wang

, et al. (2016) Automatic 3-d manipulation of soft objects by robotic arms with an adaptive deformation model. IEEE Transactions on Robotics 32(2): 429–441. https://doi.org/10.1109/tro.2016.2533639

31.

Zhu

, et al. (2021) Contour moments based manipulation of composite rigid-deformable objects with finite time model estimation and shape/position control. IEEE 27(5): 2985–2996. https://doi.org/10.1109/tmech.2021.3126383

32.

Saghour

Navarro-Alarcon

Fraisse

, et al. (2025) Dual-arm shaping of soft objects in 3d based on visual servoing and online fem simulations. The International Journal of Robotics Research 44(7): 1138–1155. https://doi.org/10.1177/02783649241301076

33.

Shen

Jiang

Choy

, et al. (2024) Action-conditional implicit visual dynamics for deformable object manipulation. The International Journal of Robotics Research 43(4): 437–455. https://doi.org/10.1177/02783649231191222

34.

Shetab-Bushehri

Aranda

Mezouar

, et al. (2022) As-rigid-as-possible shape servoing. IEEE Robotics and Automation Letters 7(2): 3898–3905. https://doi.org/10.1109/lra.2022.3145960

35.

Shetab-Bushehri

Aranda

Mezouar

, et al. (2023) Lattice-based shape tracking and servoing of elastic objects. IEEE Transactions on Robotics 40: 364–381. https://doi.org/10.1109/tro.2023.3331596

36.

Shin

Ferguson

Pedram

, et al. (2019) Autonomous tissue manipulation via surgical robot using learning based model predictive control. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE, pp. 3875–3881.

37.

Timoshenko

Gere

(2012) Theory of Elastic Stability. Courier Corporation.

38.

Deguet

, et al. (2025) dvrk-si: the next generation da vinci research kit. In: 2025 International Symposium on Medical Robotics (ISMR). IEEE, pp. 185–191.

39.

Yang

Chen

, et al. (2023a) Model-free 3-d shape control of deformable objects using novel features based on modal analysis. IEEE Transactions on Robotics 39(4): 3134–3153. https://doi.org/10.1109/tro.2023.3269347

40.

Yang

Sui

Zhong

, et al. (2023b) Modal-graph 3d shape servoing of deformable objects with raw point clouds. The International Journal of Robotics Research 42(14): 1213–1244. https://doi.org/10.1177/02783649231198900

41.

Yin

Varava

Kragic

(2021) Modeling, learning, perception, and control methods for deformable object manipulation. Science Robotics 6(54): eabd8803. https://doi.org/10.1126/scirobotics.abd8803

42.

Zhu

Navarro-Alarcon

Passama

, et al. (2021) Vision-based manipulation of deformable and rigid objects using subspace projections of 2d contours. Robotics and Autonomous Systems 142: 103798. https://doi.org/10.1016/j.robot.2021.103798

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

60.88 MB