Sage Journals: Discover world-class research

Abstract

Estimating the 6D pose of textureless objects from RGB images is an important problem in robotics. Due to appearance ambiguities, rotational symmetries, and severe occlusions, single-view based 6D pose estimators are still unable to handle a wide range of objects, motivating research towards multi-view pose estimation and next-best-view prediction that addresses these limitations. In this work, we propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images. Our approach is built upon a key idea: decoupling the 6D pose estimation into a two-step sequential process can greatly improve both accuracy and efficiency. First, we estimate the 3D translation of each object, resolving scale and depth ambiguities inherent to RGB images. These estimates are then used to simplify the subsequent task of determining the 3D orientation, which we achieve through canonical scale template matching. Building on this formulation, we then introduce an active perception strategy that predicts the next best camera viewpoint to capture an RGB image, effectively reducing object pose uncertainty and enhancing pose accuracy. We evaluate our method on the public ROBI and TOD datasets, as well as on our reconstructed transparent object dataset, T-ROBI. Under the same camera viewpoints, our multi-view pose estimation significantly outperforms state-of-the-art approaches. Furthermore, by leveraging our next-best-view strategy, our approach achieves high pose accuracy with fewer viewpoints than heuristic-based policies across all evaluated datasets. The accompanying video and T-ROBI dataset will be released on our project page: https://trailab.github.io/ActiveODPE.

Keywords

6D object pose multi-view optimization active vision deep learning

Introduction

Textureless rigid objects occur frequently in industrial environments and are of significant interest in many robotic applications. The task of 6D pose estimation aims to detect objects of known geometry and estimate their 6DoF (Degree of Freedom) poses, that is, 3D translations and 3D orientations, with respect to a global coordinate frame. In robotic manipulation tasks, accurate object poses are required for path planning and grasp execution (Deng et al., 2020; Song et al., 2017; Tremblay et al., 2018; Wang et al., 2019). For robotic navigation, 6D poses serve as valuable cues for localization and obstacle avoidance (Fu et al., 2021; Liao et al., 2024; Merrill et al., 2022; Salas-Moreno et al., 2013; Wang et al., 2021b).

Due to the absence of appearance features, 6D pose estimation for textureless objects has typically been addressed using depth data (Bui et al., 2018; Cai et al., 2022; Drost et al., 2010; Gao et al., 2020, 2021; Li and Stamos, 2023; Yang et al., 2021b) or RGB-D images (Doumanoglou et al., 2016; He et al., 2020; Li and Schoellig, 2023; Saadi et al., 2021; Tian et al., 2020; Wada et al., 2020; Wang et al., 2019; Wen et al., 2020, 2024bib_wen_et_al_2020bib_wen_et_al_2024). These methods demonstrate strong pose estimation performance when high-quality depth data is available. However, despite advances in depth sensing technology, commodity-grade depth cameras frequently produce inaccurate depth maps, with errors or missing data occurring on glossy or dark surfaces (Chai et al., 2020; Yang et al., 2024a; Yang and Waslander, 2022), as well as on translucent or transparent objects (Liu et al., 2020; Sajjan et al., 2020; Xu et al., 2021). These depth limitations can severely degrade object pose estimation performance. Therefore, RGB-based approaches have received a lot of attention over the past decade as a promising alternative (Brachmann et al., 2016; Hinterstoisser et al., 2011).

Due to the numerous advances in deep learning over the last decade, some learning-based approaches have recently been shown to significantly improve object pose estimation performance using only RGB images (He et al., 2023; Hodan et al., 2020; Kehl et al., 2017; Labbé et al., 2022; Li et al., 2018; Peng et al., 2019; Sun et al., 2025; Sundermeyer et al., 2018; Xiang et al., 2018; Xu et al., 2024). However, due to the inherent scale, depth, and perspective ambiguities from a single viewpoint, RGB-based solutions often suffer from low accuracy in the final 6D pose estimation. To this end, recent works leverage multiple RGB views to enhance their pose estimation results (Deng et al., 2021; Fu et al., 2021; Haugaard and Iversen, 2023; Labbé et al., 2020, 2022bib_labbé_et_al_2020bib_labbé_et_al_2022; Maninis et al., 2022; Merrill et al., 2022; Shugurov et al., 2021). Although fusing multi-view information can enhance overall performance, addressing challenges such as appearance ambiguities, rotational symmetries, and occlusions remains difficult. Additionally, even when multi-view fusion mitigates some of these issues, relying on a large number of viewpoints is often impractical for many real-world applications, such as robotic manipulation.

To address these challenges, we present a comprehensive framework for both object pose estimation and next-best-view prediction using multi-view RGB images. First, we introduce a multi-view object pose estimation method that decouples the 6D pose estimation into a two-step sequential process: we first estimate the 3D translation, followed by the 3D orientation of each object. This decoupled formulation first resolves scale and depth ambiguities from single RGB images, and then leverages the resulting translation estimates to simplify object orientation estimation in the second stage. To address the multimodal nature of orientation space, we develop an optimization scheme that accounts for object symmetries and counteracts measurement uncertainties. The second part of our framework focuses on next-best-view (NBV) prediction, which builds upon the proposed multi-view pose estimator. We introduce an information-theoretic approach to quantify object pose uncertainty. In each NBV iteration, we predict the expected object pose uncertainty for each potential viewpoint and select the next camera viewpoint that minimizes this uncertainty, ensuring more informative RGB measurements are collected. Figure 1 illustrates the effectiveness of our multi-view approach on transparent objects, demonstrating accurate 6D pose estimations even under challenging conditions.

Figure 1.

6D object pose estimation using multi-view acquired RGB images. (a) The input multi-view RGB images with known camera poses. (b) The pose estimation results using CosyPose and PVNet. (c) The pose estimation results using our approach. The green and red colors represent correct and incorrect pose estimations, respectively.

We conduct extensive experiments on the public ROBI (Yang et al., 2021a) and TOD (Liu et al., 2020) datasets, as well as on a challenging transparent object dataset, T-ROBI, that we present. To support network training, we also propose a large-scale synthetic dataset based on both ROBI and T-ROBI. Our approach significantly outperforms state-of-the-art RGB-based methods. Compared to depth-based methods, it achieves comparable performance on reflective objects and fully surpasses them on transparent objects, despite relying solely on RGB images. Furthermore, compared to baseline viewpoint selection strategies, our next-best-view strategy achieves high object pose accuracy while requiring fewer viewpoints.

Our work makes the following key contributions.

We propose a novel 6D object pose estimation framework that decouples the problem into a two-step sequential process. This process resolves the depth ambiguities from RGB frames and greatly improves the estimate of orientation parameters.

Building on our proposed pose estimator, we introduce an information-theoretic active vision strategy that optimizes object pose accuracy by selecting the next-best camera viewpoint.

We introduce a multi-view dataset of transparent objects, specifically designed to evaluate 6D pose estimation for transparent parts in cluttered and occluded bin scenarios.

To support network training, we create a large-scale synthetic dataset that includes all parts from both the public ROBI and our T-ROBI dataset.

It is important to note that this work substantially extends our previous conference paper (Yang et al., 2023b), as follows:

Improved Orientation Estimation. To enhance object orientation estimation, we introduce a new head into the neural network architecture that extracts per-frame object edge maps, serving as more accurate and consistent shape inputs for the object orientation estimator.

Active Vision. We extend our previous approach by integrating an active vision strategy that selects the next-best-view to improve the object pose accuracy.

Transparent Object Dataset. The dataset allows evaluation of our method under real-world, challenging scenarios, while also serving as a valuable benchmark for researchers working on transparent object pose estimation.

Synthetic Dataset. We generate a large-scale synthetic dataset to provide a comprehensive benchmark for training and fair comparison on ROBI and our transparent object dataset.

Expanded Real-World Results. We include detailed ablation analysis, three additional baselines, and more extensive real-world results on the ROBI, TOD and our transparent object datasets.

Related works

Object pose estimation from a single RGB image

Traditional methods

Due to the lack of appearance features, traditional methods usually tackle the problem via holistic template matching techniques (Hinterstoisser et al., 2011; Imperoli and Pretto, 2015), but are susceptible to failure due to scale change and occlusion. Later advances (Brachmann et al., 2016) improved the efficiency of template matching by jointly regressing object coordinates and labels via a learning-based framework, but its accuracy is far below modern deep learning methods.

End-to-end methods

With advances in deep learning, many works leverage convolution neural networks (CNNs) (Kehl et al., 2017; Labbé et al., 2022; Li et al., 2018; Wang et al., 2021a; Xiang et al., 2018) or vision transformers (ViTs) (Amini et al., 2021; Jantos et al., 2023) to estimate object pose end-to-end. Among these, SSD-6D (Kehl et al., 2017) was the first to regress 6D object pose from a single RGB image with CNNs. To avoid complex rotation parametrization, it discretizes rotation as a classification problem. PoseCNN (Xiang et al., 2018) improved this by decoupling translation and rotation with separate branches, leading to more accurate estimates. Building on these, DeepIM (Li et al., 2018) iteratively refines object poses by matching rendered object image to the input image. To better facilitate end-to-end methods, a continuous 6D rotation representation (Zhou et al., 2019) was introduced, offering advantages over other parametrizations for network training. This representation is subsequently adopted in end-to-end works (Amini et al., 2021; Jantos et al., 2023; Labbé et al., 2022; Wang et al., 2021a), further improving 6D pose estimation.

Learning-based indirect methods

While classical feature and geometric-fitting methods fail on textureless objects, deep learning overcomes this by learning discriminative features. Recent indirect methods use deep networks to predict 2D object keypoints (He et al., 2023; Pavlakos et al., 2017; Peng et al., 2019; Rad and Lepetit, 2017) or dense 2D–3D correspondences (Haugaard and Buch, 2022; Hodan et al., 2020; Park et al., 2019; Su et al., 2022; Zakharov et al., 2019), then compute poses via RANSAC/PnP (Lepetit et al., 2009). More recently, the representation power of diffusion (Ho et al., 2020) and foundation models (Cherti et al., 2023; Oquab et al., 2023) has further improved indirect methods (Xu et al., 2024), enabling zero-shot 6D pose estimation (Ausserlechner et al., 2024; Deng et al., 2025; Fan et al., 2024; Sun et al., 2025).

Although these methods perform well on 2D metrics, their 6D pose accuracy is limited by depth ambiguities and occlusions from a single viewpoint. Consequently, depth data is often needed to refine object poses (Deng et al., 2021; Yang et al., 2024a; Zhang and Cao, 2019).

Object pose estimation from multi-view RGB images

Multi-view approaches address the scale and depth ambiguities that commonly occur in single-viewpoint scenarios, improving the accuracy of estimated poses. Traditional methods rely on local features (Collet and Srinivasa, 2010; Eidenberger and Scharinger, 2010) but struggle to handle textureless objects. More recently, multi-view object pose estimation has been revisited with neural networks. These approaches employ an offline, batch-based optimization framework, where all frames are processed simultaneously to produce a consistent interpretation of the scene (Chen and Jiang, 2024; Haugaard and Iversen, 2023; Kundu et al., 2018; Labbé et al., 2020; Liu et al., 2020; Shugurov et al., 2021). The most notable work is CosyPose (Labbé et al., 2020), which integrates single-view pose estimates into a globally consistent scene and is agnostic to the choice of pose estimator. Using a similar multi-view setup, a pose refiner further improves accuracy via differentiable rendering (Shugurov et al., 2021).

Other approaches address multi-view pose estimation in an online manner. 6D object pose tracking (Deng et al., 2020, 2021; Labbé et al., 2022; Moon et al., 2024) focuses on estimating object poses relative to the camera, whereas object-level SLAM simultaneously estimates both camera and object poses within a shared world coordinate frame (Chen and Jiang, 2024; Fu et al., 2021; Merrill et al., 2022; Wu et al., 2020; Yang and Scherer, 2019). PoseRBPF (Deng et al., 2021) represents an early effort in online 6D object pose tracking, combining particle filtering with deep neural networks to achieve robust estimation under challenging conditions. In contrast, MegaPose (Labbé et al., 2022) and GenFlow (Moon et al., 2024) uses an end-to-end framework, enabling 6D tracking of novel, previously unseen objects. While these methods handle single objects well, they cannot track multiple objects simultaneously. Object-level SLAM approaches (Fu et al., 2021; Merrill et al., 2022; Wu et al., 2020; Yang and Scherer, 2019), on the other hand, can recover the poses of multiple objects at once, offering a more comprehensive understanding of the scene.

While the above methods improve performance using only RGB images, they still face challenges in handling object scales, rotational symmetries, and measurement uncertainties. Our approach follows the principles of online object-level SLAM, but with the known camera poses. Using per-frame neural network predictions as measurements, our approach resolves depth and scale ambiguities through a two-step sequential formulation. It also handles rotational symmetries and measurement uncertainties within an incremental online framework.

Pose uncertainty and visual ambiguity

In practice, the estimated object pose may carry state uncertainty or be subject to visual ambiguity. Representing these factors is crucial for many robotic applications, such as manipulation (Deng et al., 2020; Wang et al., 2019) or navigation (Fu et al., 2021; Salas-Moreno et al., 2013). Object pose uncertainty reflects variability in translation and orientation, potentially with different variances along each axis (e.g., depth uncertainty along the optical axis from a single viewpoint). A straightforward strategy is to assume a unimodal distribution and represent the pose with a single covariance matrix. To estimate this covariance, several works (Merrill et al., 2022; Peng et al., 2019; Richter-Klug, 2019; Yang and Pavone, 2023) adopt a structured strategy that first computes the uncertainty of 2D keypoint detections and then propagates it to the 6D pose. Extending this idea to depth information, studies such as (He et al., 2020; Liao et al., 2024; Salas-Moreno et al., 2013; Yang et al., 2024a) predict depth or 3D keypoint uncertainty and propagate it to the pose as well. Under the unimodal assumption, the resulting covariance can be further reduced by incorporating additional measurements.

Although the unimodal assumption is effective, it may fail to capture complex uncertainties, particularly when objects appear similar under different poses due to shape symmetries, occlusion, or repetitive textures, which is also known as visual ambiguities (Bui et al., 2020; Deng et al., 2022; Höfer et al., 2023; Hsiao et al., 2024; Manhardt et al., 2019; Okorn et al., 2020). It is important to model such visual ambiguities using more expressive, multimodal distributions. Furthermore, when distinctive object features are visible, the distribution should naturally converge to a more confident, unimodal estimate. For this purpose, sampling-based methods (Haugaard et al., 2023; Kendall and Cipolla, 2016; Shi et al., 2021) generate multiple pose hypotheses to estimate this uncertainty, but achieving high accuracy requires many samples, making these approaches computationally costly. Alternatively, Manhardt et al. (Manhardt et al., 2019) learn orientation distributions using a Winner-Takes-All (WTA) strategy (Rupprecht et al., 2017) over multiple hypotheses. To capture full orientation distributions, AAE (Sundermeyer et al., 2018) and PoseRBPF (Deng et al., 2021) adopt a discrete representation, whereas Okorn et al. (Okorn et al., 2020) use a histogram-based approach. While effective, these methods are inherently limited by discretization. To address this limitation, some approaches employ mixture of Gaussian (Fu et al., 2021) or Bingham mixture models (Deng et al., 2022; Gilitschenski et al., 2019) to represent multimodal orientation distributions.

In this work, we focus on modeling pose uncertainty rather than explicitly resolving visual ambiguities, enabling more efficient representations for robotic tasks.

Active vision

Active vision (Aloimonos et al., 1988; Bajcsy et al., 2018; Chen et al., 2011), or more specifically Next-Best-View (NBV) prediction (Connolly, 1985), refers to actively manipulating the camera viewpoint to obtain the maximum information in the next frame for the required task. Active vision has received a lot of attention from the robotics community and has been employed in many applications, such as robot manipulation (Breyer et al., 2022; Fu et al., 2024; Morrison et al., 2019), calibration (Choi et al., 2023; Rebello et al., 2017; Xu et al., 2023; Yang et al., 2023a), object pose estimation (Doumanoglou et al., 2016; Eidenberger and Scharinger, 2010; Sock et al., 2020; Wu et al., 2015; Yang et al., 2024a), 3D reconstruction (Forster et al., 2014; Isler et al., 2016; Yang and Waslander, 2022), and localization (Davison and Murray, 2002; Falanga et al., 2018; Hanlon et al., 2024; Zhang and Scaramuzza, 2018, 2019). The next-best-view selection is often achieved by finding the viewpoint that maximizes the information gain or minimizes the expected entropy (Choi et al., 2023; Doumanoglou et al., 2016; Kiciroglu et al., 2020; Rebello et al., 2017; Xu et al., 2023; Yang et al., 2023a, 2024abib_yang_et_al_2023abib_yang_et_al_2024a; Zhang and Scaramuzza, 2018, 2019). To estimate 6D object poses, Doumanoglou et al. first present a single-shot object pose estimation approach based on Hough Forests (Doumanoglou et al., 2016). The next-best-view is predicted by exploiting the capability of Hough Forests to compute the entropy. To eliminate reliance on the Hough Forests, recent studies show that the next-best-view can be achieved by maximizing the Fisher information of the robot state parameters (Forster et al., 2014; Rebello et al., 2017; Yang et al., 2023a, 2024a; Zhang and Scaramuzza, 2018, 2019). For example, in robot localization, the authors in Zhang and Scaramuzza (2018, 2019) use the Fisher information maximization to find highly informative trajectories and achieve high localization accuracy. Our approach lies in extending this principle to 6D object pose estimation. We actively move the camera to maximize the Fisher information, selecting viewpoints that most effectively reduce object pose uncertainty.

6D pose estimation using multi-view optimization

Problem formulation

Given a 3D object model and multi-view images, the goal of 6D object pose estimation is to estimate the rigid transformation T _wo that maps the object model frame O to a global (world) frame W, where

T_{w o} = [\begin{matrix} R_{w o} & t_{w o} \\ 0^{⊤} & 1 \end{matrix}] \in S E (3),

(1)

with

R_{w o} \in S O (3) and t_{w o} \in R^{3}

. For multi-view object pose estimation, we assume that the camera poses

T_{w c} \in S E (3)

relative to the world frame are known. These can be determined through robot forward kinematics and eye-in-hand calibration (Tsai and Lenz, 1989) when the camera is mounted on the end-effector of a robotic arm, or through off-the-shelf SLAM methods (Klein and Murray, 2007; Mur-Artal et al., 2015) for hand-held cameras.

Given measurements Z_1:k up to viewpoint k, we aim to estimate the posterior distribution of the 6D object pose $P (R_{w o}, t_{w o} | Z_{1 : k})$ . Direct computation of this distribution is typically infeasible because object translation t_wo and orientation R_wo follow distinct distributions. In particular, the translation distribution $P (t_{w o})$ is generally unimodal in practice. Although a multimodal formulation could be used to address more challenging situations, such as severe occlusions, we adopt a unimodal model to maintain computational efficiency. In contrast, the distribution for object orientation $P (R_{w o})$ is more complex, influenced by uncertainties related to object shape symmetries, appearance ambiguities, and potential occlusions. Inspired by Deng et al. (2021), we decouple the pose posterior $P (R_{w o}, t_{w o} | Z_{1 : k})$ into

P (R_{w o}, t_{w o} | Z_{1 : k}) = P (R_{w o} | Z_{1 : k}, t_{w o}) P (t_{w o} | Z_{1 : k}),

(2)

where

P (t_{w o} | Z_{1 : k})

can be formulated as a unimodal Gaussian distribution,

N (t_{w o} | μ, Σ)

and

P (R_{w o} | Z_{1 : k}, t_{w o})

is the orientation distribution conditioned on the input images Z_1:k and the 3D translation t_w,o. To represent the complex orientation uncertainties, we follow a similar approach to Eidenberger and Scharinger (2010) and model

P (R_{w o} | Z_{1 : k}, t_{w o})

as a mixture of Gaussian distributions:

P (R_{w o} | Z_{1 : k}, t_{w o}) = \sum_{i = 1}^{N} w_{i} N (R_{w o} | μ_{i}, Σ_{i}),

(3)

which consists of N Gaussian components, where w_i is the weight of the i^th component, and μ _i and Σ_i are its mean and covariance, respectively.

Our decoupling formulation implies a useful correlation between the object’s translation and orientation in the image domain. The 3D translation estimation t_wo is independent of the object’s orientation and encodes its center and scale information. By applying the camera pose T_wc,k at frame k, the estimated 3D translation t_co,k under the camera coordinate provides the scale and 2D center of the object in the image. Using these estimates, the per-frame object orientation measurement R_co,k can be estimated from its visual appearance in the image. With this sequential process, our multi-view framework consists of two main steps, as summarized in Figure 2.

Figure 2.

An overview of the proposed multi-view object pose estimation pipeline with a two-step optimization formulation. We decouple the 6D pose estimation into a sequential process: first estimating the 3D translation, then the 3D orientation of each object. At each viewpoint, SEC-Net predicts the object’s 2D center, mask, and edge map. Using observations from K frames, we estimate translation via multi-view optimization, which provides object scale and center for re-cropping the edge map. Per-frame orientations are then obtained and fused via a max-mixture optimization.

To implement this formulation, a key step in our framework is estimating per-frame measurements using a neural network, which are then integrated into our two-step sequential process. The network outputs the object’s 2D edge map, segmentation mask, and the 2D projection of its 3D center. We refer to our network as MEC-Net (Mask, Edge, Center Network) for the remainder of this paper. With these estimates, we proceed to the first step, where we estimate the 3D translation t_wo by minimizing the 2D re-projection error across camera viewpoints. Using the estimated 3D translation t_wo, the second step involves re-cropping an orientation-independent Region of Interest (RoI) from each object’s edge map, based on the estimated scale. This RoI is then fed into an orientation estimator (Hinterstoisser et al., 2011) to obtain the per-frame 3D orientation measurement R_co,k. The final object orientation R_wo is determined through an optimization approach that explicitly accounts for shape symmetries and incorporates a max-mixture formulation (Fu et al., 2021; Olson and Agarwal, 2013) to mitigate uncertainties arising from per-frame orientation estimates.

3D translation estimation

As illustrated in Figure 3, the 3D translation t_wo represents the coordinate of the object model origin in the world frame. Given that the camera pose T_wc is known, this is equivalent to solving for the translation from the object model origin to the camera optical center, $t_{c o} = {[t_{x}, t_{y}, t_{z}]}^{T}$ . Given an RGB image from an arbitrary camera viewpoint, the translation t_co can be recovered by the following back-projection, assuming a pinhole camera model,

t_{c o} = [\begin{matrix} t_{x} \\ t_{y} \\ t_{z} \end{matrix}] = [\begin{matrix} \frac{u_{x} - c_{x}}{f_{x}} t_{z} \\ \frac{u_{y} - c_{y}}{f_{y}} t_{z} \\ t_{z} \end{matrix}],

(4)

where f_x and f_y denote the camera focal lengths, and

{[c_{x}, c_{y}]}^{T}

is the principal point. We define

u = {[u_{x}, u_{y}]}^{T}

as the 2D projection of the object model origin O and refer to it as the 2D center of the object in the rest of the paper. If the object center u is localized in the image and the depth t_z to the object center is estimated, then t_co (or t_wo) can be recovered. In our framework, we use MEC-Net to predict the 2D object center u for each frame and estimate the depth t_z through a multi-view optimization formulation.

Figure 3.

Object, world, and camera coordinate frames. The 3D translation t_wo is the object origin in world coordinates, estimated by localizing per-frame 2D centers u_k and minimizing re-projection errors with known camera poses T_wc,k.

Our MEC-Net architecture is shown in the upper part of Figure 2 and is based on PoseCNN (Xiang et al., 2018) and PVNet (Peng et al., 2019). To handle multiple instances within the scene, we first employ YOLOv8 (Sohan et al., 2024) to detect 2D bounding boxes of the objects. These detections are then cropped and resized to 128 × 128 before being passed to the network. To estimate the object 2D center, the MEC-Net first predicts pixel-wise binary labels and a 2D vector field towards the object center. A RANSAC-based voting scheme is then applied to compute the mean u_k and covariance Σ_u,k of the object center at frame k. For more details on object center estimation, we refer the reader to Peng et al. (2019) and Xiang et al. (2018).

Given a sequence of measurements, we estimate the object’s 3D translation t_wo using a maximum likelihood estimation (MLE) formulation. Assuming a unimodal Gaussian error model, we solve the problem via nonlinear least squares (NLLS) optimization. The optimization is formulated by creating measurement residuals that constrain the object translation t_wo to the object center u_k, its covariance Σ_u,k, and known camera pose T_wc,k at viewpoint k,

r_{k} (t_{w o}) = π (T_{w c, k}^{- 1} t_{w o}) - u_{k},

(5)

where π denotes the perspective projection function. The full problem then becomes the minimization of the cost function L across all viewpoints:

L = \sum_{k} r_{k}^{T} Σ_{u, k}^{- 1} r_{k},

(6)

Where Σ_u,k is the covariance matrix for the object center u_k. We initialize each object’s translation t_wo from two camera viewpoints. For the k^th camera viewpoint, T_wc,k, we perform object association based on epipolar geometry constraints and the previously estimated translation t_wo,1:k−1 up to viewpoint k − 1. Detections that are not associated with any existing objects are initialized as new objects.

We solve the NLLS problem (Equation (5) and (6)) using an iterative Gauss-Newton procedure:

(J_{t_{w o}}^{T} Σ_{u}^{- 1} J_{t_{w o}}) δ t_{w o} = J_{t_{w o}}^{T} Σ_{u}^{- 1} r (t_{w o}),

(7)

Where the stacked Jacobian matrix,

J_{t_{w o}}

, and the stacked measurement covariance matrix, Σ_u, are represented by

J_{t_{w o}} = [\begin{matrix} J_{t_{w o}, u_{1}} \\ ⋮ \\ J_{t_{w o}, u_{K}} \end{matrix}], Σ_{u} = [\begin{matrix} Σ_{u_{1}} \\ ⋱ \\ Σ_{u_{K}} \end{matrix}]

(8)

The row-blocks,

J_{t_{w o}, u_{k}}

, and

Σ_{u_{k}}

correspond to the Jacobian matrix and measurement covariance matrix for the k^th viewpoint. The per-frame measurement uncertainty,

Σ_{u_{k}}

, is obtained from our MEC-Net (upper part of Figure 2). The Jacobian matrix,

J_{t_{w o}, u_{k}}

, is computed as

J_{t_{w o}, u_{k}} = \frac{\partial u_{k}}{\partial t_{w o}} = \frac{\partial u_{k}}{\partial t_{c o, k}} \frac{\partial t_{c o, k}}{\partial t_{w o}},

(9)

Where t_co,k is the 3D translation from the object model origin to the camera optical center at the k^th viewpoint.

3D orientation estimation

The procedure for estimating the object orientation R_wo is shown in the lower part of Figure 2. Given the per-frame edge map from MEC-Net, we first adopt a template matching (TM)-based approach, LINE-2D (Hinterstoisser et al., 2011), to obtain the per-frame orientation measurement R_co,k. Measurements from multiple viewpoints are then integrated into an optimization scheme. We handle the rotational symmetries explicitly using the object CAD model. To counteract the measurement uncertainties (e.g., from appearance ambiguities), a max-mixture formulation (Fu et al., 2021; Olson and Agarwal, 2013) is employed to recover a globally consistent set of object orientation estimates. Note that the acquisition of the orientation measurement R_co,k is not limited to the LINE-2D (Hinterstoisser et al., 2011) or TM-based approaches and can be superseded by other holistic methods (Imperoli and Pretto, 2015; Kehl et al., 2017; Liu et al., 2012; Sundermeyer et al., 2018).

Per-frame orientation measurement

The process of acquiring the per-frame object orientation measurement, R_co,k, is illustrated in Figure 4. This process is based on a template-matching (TM)-based approach, LINE-2D (Hinterstoisser et al., 2011). The original LINE-2D method estimates the object’s 3D orientation by matching templates derived from the object’s 3D model. In the offline training stage, LINE-2D renders object templates from a view sphere, with each template represented as a set of sampled edge points (shown in the upper part of Figure 4). At run-time, it first extracts the edge pixels (e.g., using a Sobel filter) from the input RGB image and utilizes the gradient response to find the best matched template, determining the object orientation. A confidence score is computed based on the quality of the match. However, the template matching-based approach suffers from scale change and occlusion problems. Additionally, specular reflections on shiny surfaces can introduce false edges, leading to incorrect matches. To address these issues, we propose two major modifications to the LINE-2D algorithm.

Figure 4.

The process of acquiring the per-frame object orientation measurement, R_co,k. This process is based on a template matching (TM)-based approach, LINE-2D. To reduce the gap between rendered templates and RoI images, we introduce a new head into our neural network, MEC-Net (shown in the upper part of Figure 2), and directly generate the object edge map, which serve as the input for template matching.

Predicting object edge map

To bridge the gap between rendered templates and RoI images, and to reduce the impact of spurious edges, we leverage our MEC-Net to directly generate the object’s edge map. As illustrated in Figure 2, we extend our previous approach (Yang et al., 2023b) by adding an extra network head specifically for estimating the object’s 2D edge map. To handle partial occlusion, which is common in real-world scene, we incorporate occlusion augmentation in our training data, similar to the approach used in AAE (Sundermeyer et al., 2018). During training, we treat the edge map as a binary classification task and minimize the cross-entropy loss. At inference, we apply the sigmoid function to map edge pixel values to the range of [0, 1], with higher values indicating greater confidence that a pixel belongs to the object edge.

Handling object scale change

To address the scale change issue, the original LINE-2D generates object templates at multiple distances and scales, which increases run-time complexity. In contrast, our approach fixes the 3D translation to a canonical centroid distance:

t_{r} = [0,0, z_{r}] .

(10)

At run-time, given the 3D translation estimate,

t_{c o} = [x_{s}, y_{s}, z_{s}]

, from object origin to camera center (obtained from t_wo and camera pose, T_wc), we re-crop the edge map RoI from the image. The RoI size l_s is determined by

l_{s} = \frac{z_{r}}{z_{s}} l_{r},

(11)

Where l_r and z_r represent the RoI size and canonical distance during training, respectively. This process is illustrated in Figure 5(a). When the translation estimate, t_co, is accurate, the resized edge map ROI will have the same size with the rendered object template at the canonical distance. Note that the RoI is square here and independent of the object’s orientation. Finally, the per-frame measurement of the object orientation, R_co,k, is obtained by feeding the resized RoI into the LINE-2D orientation estimator. As shown in Figure 5(b) and (c), compared to using the original edge map, this resizing step yields a correct orientation estimate.

Figure 5.

(a) The inference of object size l_s from its projective ratio. (b) Incorrect object orientation estimates on the original edge map due to scale changes. (c) Re-cropped object RoI using the object translation estimate, resulting in correct orientation estimation.

Optimization formulation

Given the multi-view orientation measurements, R_co,k, we aim to estimate the object’s 3D orientation in the global world frame. Generally, estimating the object’s 3D orientation from a sequence of such measurements can also be formulated as an MLE problem:

\hat{X} = \underset{X}{\arg \max} \prod_{k} p (z_{k} | X),

(12)

Where X denotes the object 3D orientation, R_wo, to be estimated. The measurement z_k refers to the object’s orientation with respect to the camera coordinate, R_co,k, obtained in the previous subsection. The measurement model is a function of the camera pose (orientation part), R_wc,k, and the object’s orientation, R_wo, in the world frame:

h (R_{w o}, R_{w c, k}) = R_{w c, k}^{- 1} R_{w o} .

(13)

We formulate the optimization problem by creating the residual between the object orientation, R_wo, and the per-frame measurement, R_co,k:

r_{k} (R_{w o}) = \log {(R_{c o, k} h {(R_{w o}, R_{w c, k})}^{- 1})}^{\lor},

(14)

Where

r_{k} (R_{w o})

is expressed in Lie algebra

s o (3)

. To account for rotational symmetries, we explicitly consider them alongside the measurement R_co,k in equation (14). Generally, when an object has symmetry, there exists a set of orientations that leave the object’s appearance unchanged:

S (R_{c o}) = \{R_{c o}^{'} \in S O (3) s . t \forall G (R_{c o}) = G (R_{c o}^{'})\},

(15)

Where

G (R_{c o})

is the rendered image of the object under orientation R_co (assuming the same object translation). We can update the measurement R_co,k in equation (14) to

{\bar{R}}_{c o, k}

{\bar{R}}_{c o, k} = \underset{R_{c o, k}^{'} \in S (R_{c o, k})}{\arg \min} ∥ \log {((R_{c o, k}^{'}) h {(R_{w o}, R_{w c, k})}^{- 1})}^{\lor} ∥,

(16)

Where ∥⋅∥ denotes the absolute angle of a 3D rotation vector ϕ , and

{\bar{R}}_{c o, k}

is the updated orientation measurement that has the minimal loss relative to R_wo.

Measurement ambiguities

Due to the complex uncertainties, such unimodal estimates are insufficient to fully capture the uncertainty associated with the object orientation. To this end, we now consider the sum-mixture of Gaussians as the likelihood function:

p ({\bar{z}}_{k} | X) = \sum_{i = 1}^{N} w_{i} N (μ_{i}, Σ_{i}),

(17)

Where

{\bar{z}}_{k}

is the updated measurement (using Equation (16)), and each

N (μ_{i}, Σ_{i})

represents a distinct Gaussian distribution, with w_i being the weight for component i. A challenge with the sum-mixture model is that the MLE solution becomes more complex and falls outside the support of common NLLS optimization approaches. To address this, we consider the max-marginal and solve the optimization problem using the following max-mixture formulation (Olson and Agarwal, 2013):

p ({\bar{z}}_{k} | X) = \max_{i = 1 : N} w_{i} N (μ_{i}, Σ_{i}) .

(18)

The max operator acts as a selector, reducing the problem to a common NLLS optimization. It’s important to note that the max-mixture does not make a permanent selection. During each iteration of the optimization, only one of the Gaussian components is selected and optimized. Specifically, given a new orientation measurement

{\bar{R}}_{c o, k}

at frame k, we actually evaluate each Gaussian component in equation (18) by computing the absolute orientation angle error, θ _k,i, between

{\bar{R}}_{c o, k}

and

h (R_{w o, i}, R_{w c, k})

θ_{k, i} = ∥ \log {({\bar{R}}_{c o, k} h {(R_{w o}, R_{w c, k})}^{- 1})}^{\lor} ∥,

(19)

And select the component with the minimal angle error. To reduce the impact of outliers, the selected Gaussian component will only accept an orientation measurement if the orientation angle error θ _k,i is below a predefined threshold (30° in our implementation). We initialize the Gaussian-mixture model with a single component using the first orientation measurement. As more measurements arrive, each new measurement

{\bar{R}}_{c o, k}

is evaluated against the existing components. If the measurement

{\bar{R}}_{c o, k}

is not accepted by any Gaussian component, it will be treated as a new component and added to the current Gaussian-mixture model.

For each Gaussian component, we optimize the object orientation R_wo by constructing the following residual $r (R_{w o})$ at frame k:

r_{k} (R_{w o}) = \log {({\bar{R}}_{c o, k}^{- 1} R_{w c, k}^{- 1} R_{w o})}^{\lor} .

(20)

We perform the optimization in the tangent space $s o (3)$ . Following the standard Lie algebra derivation method, we apply a left disturbance Δ ϕ to R_wo, and the residual error becomes:

\begin{align} {\hat{r}}_{k} & = \ln {({\bar{R}}_{c o, k}^{- 1} R_{w c, k}^{- 1} \exp {(Δ ϕ)}^{\land} R_{w o})}^{\lor} \end{align}

(21)

\begin{align} = \ln {({\bar{R}}_{c o, k}^{- 1} R_{w c, k}^{- 1} R_{w o} \exp {(R_{w o}^{- 1} Δ ϕ)}^{\land})}^{\lor} \end{align}

(22)

\begin{align} \approx \ln {({\bar{R}}_{c o, k}^{- 1} R_{w c, k}^{- 1} R_{w o} (1 + {(R_{w o}^{- 1} Δ ϕ)}^{\land}))}^{\lor} \end{align}

(23)

\begin{align} = \ln {(\exp {(r_{k})}^{\land} + {\bar{R}}_{c o, k}^{- 1} R_{w c, k}^{- 1} R_{w o} {(R_{w o}^{- 1} Δ ϕ)}^{\land})}^{\lor} \end{align}

(24)

\begin{align} = r_{k} + \ln (\exp {(r_{k})}^{\land} \exp (R_{w o}^{- 1} Δ ϕ)) \end{align}

(25)

\begin{align} = r_{k} + \frac{\partial r_{k}}{\partial Δ ϕ} Δ ϕ \end{align}

(26)

\begin{align} = r_{k} + J_{ϕ_{w o, k}} Δ ϕ, \end{align}

(27)

Where

\begin{align} J_{ϕ_{w o, k}} & = J_{r}^{- 1} (r_{k}) (R_{w o}^{- 1}) \end{align}

(28)

\begin{align} \approx I (r_{k}) (R_{w o}^{- 1}), \end{align}

(29)

Where

J_{r}

is the right Jacobian of

S O (3)

, which can be approximated as the identity matrix when the errors are small. The final Jacobian,

J_{ϕ_{w o, k}}

, for each orientation measurement is a 3 × 3 matrix.

Similar to the 3D translation approach (Equation (7)), we optimize the object orientation, R_wo, using the Gauss-Newton solver:

(J_{ϕ_{w o}}^{T} Λ_{ϕ} J_{ϕ_{w o}}) δ ϕ_{w o} = J_{ϕ_{w o}}^{T} Λ_{ϕ} r (R_{w o}),

(30)

Where

r (R_{w o})

is the stacked rotation residual vector across multiple viewpoints. The stacked Jacobian matrix,

J_{ϕ_{w o}}

, and the stacked measurement weight matrix, Λ_ϕ, are given by

J_{ϕ_{w o}} = [\begin{matrix} J_{ϕ_{w o, 1}} \\ ⋮ \\ J_{ϕ_{w o, K}} \end{matrix}], Λ_{ϕ} = [\begin{matrix} Λ_{ϕ, 1} \\ ⋱ \\ Λ_{ϕ, K} \end{matrix}]

(31)

The row-blocks,

J_{ϕ_{w o, k}}

, and Λ_{ϕ
,k} correspond to the Jacobian matrix and the inverse of the measurement covariance matrix for the k^th viewpoint. The per-frame Jacobian matrix is obtained from equation (28). For the weight matrix Λ_{ϕ
,k}, we approximate it by placing the LINE-2D confidence score on its diagonal elements.

To compute the weight, w_i, for each Gaussian component, we accumulate the LINE-2D confidence score, c_i, from the orientation measurements within each component across the viewpoints. The weight can be approximated as

w_{i} = \frac{c_{i}}{\sum_{i} c_{i}} .

(32)

This processing is illustrated in Figure 6. Given the measurements from two viewpoints, the object orientation distribution

P (R_{w o})

is represented with two Gaussian components (green and red) with similar weights. When additional viewpoints are incorporated, a third component (yellow) is added. As more orientation measurements are received (i.e., after eight viewpoints), the weight of the correct component (green) increases, surpassing the other hypotheses.

Figure 6.

Max-mixtures for processing the object orientation measurements. Note that we show the distribution only on one axis for demonstration purposes. (a) Acquired orientation measurements from different viewpoints. (b) Mixture distribution after two viewpoints. (c) Mixture distribution after five viewpoints. (d) Mixture distribution after eight viewpoints.

Active pose estimation using next-best-view

In the last section, we solve the multi-view object pose estimation problem using a two-step optimization formulation. However, the accuracy of the estimated object pose heavily depends on the collected RGB measurements from the selected camera viewpoints. Moreover, in many real-world applications, capturing a large number of viewpoints is impractical. To overcome this limitation, we introduce an active object pose estimation process. This approach not only estimates the uncertainty of the object pose but also predicts the next-best-view to minimize that uncertainty.

Initialization and uncertainty estimation

We initialize our active object pose estimation process with a collection of measurement sets, Z_1:K, from K camera viewpoints and perform iterative optimization to estimate the object’s 6D pose. To bootstrap the system, at least K = 2 viewpoints are required. In our multi-view pose formulation, we decompose the full 6D object pose into 3D translation, t_wo, and 3D orientation, R_wo. As a result, we compute their uncertainties independently.

3D translation

As discussed previously, we assume that the object’s translation, t_wo, follows a unimodal Gaussian distribution, $N (t_{w o} | μ, Σ)$ . The translation is estimated via equation (7), using the stacked Jacobian, $J_{t_{w o}, u_{1 : K}}$ , and the stacked measurement uncertainties, $Σ_{u_{1 : K}}$ :

J_{t_{w o}, u_{1 : K}} = [\begin{matrix} J_{t_{w o}, u_{1}} \\ ⋮ \\ J_{t_{w o}, u_{K}} \end{matrix}], Σ_{u_{1 : K}} = [\begin{matrix} Σ_{u_{1}} \\ ⋱ \\ Σ_{u_{K}} \end{matrix}],

(33)

Where u_1:K denotes the object 2D center measurements from K camera viewpoints. We compute the covariance of the object translation,

Σ_{t_{w o}, u_{1 : K}}

, through a first-order approximation of the Fisher information matrix (FIM):

Σ_{t_{w o}, u_{1 : K}} = {(J_{t_{w o}, u_{1 : K}}^{T} Σ_{u_{1 : K}}^{- 1} J_{t_{w o}, u_{1 : K}})}^{- 1} .

(34)

To obtain the entropy from the translation covariance matrix, we employ the differential entropy,

h_{e} (Σ_{t_{w o}, u_{1 : K}})

h_{t_{w o}} = h_{e} (Σ_{t_{w o}, u_{1 : K}}) = \frac{1}{2} \ln ({(2 π e)}^{3} |Σ_{t_{w o}, u_{1 : K}}|),

(35)

Where

h_{t_{w o}}

is expressed in nats. Note that the entropy computation is not restricted to differential entropy and can be replaced by alternative metrics, such as the trace (Costante et al., 2018) or the sum of the eigenvalues of the covariance matrix (Kiciroglu et al., 2020).

3D orientation

In contrast, the uncertainty calculation for the object orientation is more complex due to the Gaussian-mixture formulation, $\sum_{i = 1}^{N} w_{i} N (ϕ_{w o} | μ_{i}, Σ_{i})$ , as shown in equation (3). While many approaches estimate entropy using sampling methods (Shi et al., 2021), which can be computationally expensive, we instead follow the approach from (Eidenberger and Scharinger, 2010) and estimate the orientation entropy using an upper-bound approximation for the Gaussian-mixture distribution (Huber et al., 2008):

h_{ϕ_{w o}} \leq h_{ϕ_{w o}}^{u} = \sum_{i = 1}^{N} w_{i} [- \ln w_{i} + \frac{1}{2} \ln ({(2 π e)}^{3} |Σ_{i}|)],

(36)

Where h_{ϕ

wo} is the true entropy of the Gaussian mixture, and

h {ϕ_{w o}}^{u}

is the upper-bound approximation. To estimate the entropy for each individual Gaussian covariance, Σ_i, we re-project the 3D edge points (from model templates) into the image space and evaluate the alignment quality between the projected points and the 2D edge maps from different camera viewpoints. As shown in Figure 7, the orientation uncertainty is low when the re-projected edge points align well with the 2D edge map, and high when the alignment is poor.

Figure 7.

(a) Low orientation uncertainty when edge alignment is accurate ( $h_{ϕ_{w o}} = 3.21$ nats). (b) High orientation uncertainty when edge alignment is not accurate ( $h_{ϕ_{w o}} = 3.73$ nats).

We begin by deriving the Jacobian of the projected edge points and their associated measurement uncertainties, which will later be used to compute the orientation covariance. For a set of N 3D model edge points, we denote their stacked coordinates as $o_{k} \in R^{3 N}$ . After transforming these points into the k^th camera viewpoint, each point is re-projected onto the image plane, producing a corresponding 2D pixel location. By stacking these N projected points, we obtain the measurement vector $m_{k} \in R^{2 N}$ . The Jacobian, $J_{ϕ_{w o}, m_{k}}$ , is then expressed as

J_{ϕ_{w o}, m_{k}} = \frac{\partial m_{k}}{\partial ϕ_{w o}} = \frac{\partial m_{k}}{\partial p_{c, k}} \frac{\partial p_{c, k}}{\partial p_{w}} \frac{\partial p_{w}}{\partial ϕ_{w o}},

(37)

Where

p_{c, k} \in R^{3 N}

and

p_{w} \in R^{3 N}

denote the object’s 3D edge points in the k^th camera frame and the world frame, respectively. Since the measurement vector

m_{k} \in R^{2 N}

and the orientation vector

ϕ_{w o} \in s o (3)

, the Jacobian

J_{ϕ_{w o}, m_{k}}

has dimensions 2N × 3. Note that the Jacobian,

J_{ϕ_{w o}, m_{k}}

, is distinct from the Jacobian,

J_{ϕ_{w o}, k}

, in equation (28). For the associated measurement uncertainties,

Σ_{m_{k}}

, we approximate them with the inverse of the edge map intensity, placing the values along the diagonal elements and its dimension is 2N × 2N. The stacked Jacobian,

J_{ϕ_{w o}, m_{1 : K}}

, and stacked measurement uncertainties,

Σ_{m_{1 : K}}

, are given by

J_{ϕ_{w o}, m_{1 : K}} = [\begin{matrix} J_{ϕ_{w o}, m_{1}} \\ ⋮ \\ J_{ϕ_{w o}, m_{K}} \end{matrix}], Σ_{m_{1 : K}} = [\begin{matrix} Σ_{m_{1}} \\ ⋱ \\ Σ_{m_{K}} \end{matrix}] .

(38)

The covariance matrix, $Σ_{ϕ_{w o}, m_{1 : K}}$ is finally computed using the Fisher information approximation:

Σ_{ϕ_{w o}, m_{1 : K}} = {(J_{ϕ_{w o}, m_{1 : K}}^{T} Σ_{m_{1 : K}}^{- 1} J_{ϕ_{w o}, m_{1 : K}})}^{- 1} .

(39)

To compute the total entropy over the Gaussian mixture, we apply Equations (37)–(39) to each Gaussian component and substitute the results into equation (36):

\begin{align} h_{ϕ_{w o}}^{u} & = \sum_{i = 1}^{N} w_{i} [- \ln w_{i} + h_{e} (Σ_{ϕ_{w o}, m_{1 : K}, i})] \end{align}

(40)

\begin{align} = \sum_{i = 1}^{N} w_{i} [- \ln w_{i} + \frac{1}{2} \ln ({(2 π e)}^{3} |Σ_{ϕ_{w o}, m_{1 : K}, i}|)], \end{align}

(41)

Where

h_{ϕ_{w o}}^{u}

is expressed in nats. The final entropy of the 6D object pose is given by

h_{6 D} = g_{t} h_{t_{w o}} + g_{ϕ} h_{ϕ_{w o}}^{u},

(42)

Where g_t and g_ϕ are the weights assigned to the translation and orientation entropies, respectively.

Next-best-view prediction

In our next-best-view setup, we operate with a predefined set of camera viewpoints, $V$ . To improve object pose accuracy, we aim to select the next-best viewpoint $v^{*} \in V$ that minimizes the entropy of the object pose. To deploy this setup on a real robot platform, we could first define the set $V$ , and then map each camera viewpoint to a corresponding robot pose (joint configuration) using the robot’s kinematics.

Suppose we have already collected object center measurements, u_1:K, and edge measurements, m_1:K, from K different camera viewpoints. For a future camera viewpoint, $\hat{v}$ , the stacked Jacobian, $J_{t_{w o}, \bar{u}}$ , and the stacked measurement uncertainties, $Σ_{t_{w o}, \bar{u}}$ , of the object translation are expressed as follows:

\begin{align} J_{t_{w o}, \bar{u}} & = [\begin{matrix} J_{t_{w o}, u_{1 : K}} \\ J_{t_{w o}, \hat{u}} \end{matrix}], \end{align}

(43)

\begin{align} Σ_{t_{w o}, \bar{u}} & = [\begin{matrix} Σ_{t_{w o}, u_{1 : K}} & 0 \\ 0 & Σ_{t_{w o}, \hat{u}} \end{matrix}] . \end{align}

(44)

Where

\bar{u} = \{u_{1 : K}, \hat{u}\}

includes the set of acquired object center measurements u_1:K from viewpoints v_1:K and the predicted measurement

\hat{u}

for the future viewpoint,

\hat{v}

For the orientation component, similarly, for each Gaussian component, we define the stacked Jacobian, $J_{ϕ_{w o}, \bar{m}}$ , and the associated measurement uncertainties, $Σ_{ϕ_{w o}, \bar{m}}$ , for the future viewpoint as follows:

\begin{align} J_{ϕ_{w o}, \bar{m}} & = [\begin{matrix} J_{ϕ_{w o}, m_{1 : K}} \\ J_{ϕ_{w o}, \hat{m}} \end{matrix}], \end{align}

(45)

\begin{align} Σ_{ϕ_{w o}, \bar{m}} & = [\begin{matrix} Σ_{ϕ_{w o}, m_{1 : K}} & 0 \\ 0 & Σ_{ϕ_{w o}, \hat{m}} \end{matrix}] . \end{align}

(46)

Using the Fisher information, we can predict the covariance of the object translation and orientation as

Σ_{t_{w o}, \bar{u}} = {(J_{t_{w o}, \bar{u}}^{T} Σ_{t_{w o}, \bar{u}}^{- 1} J_{t_{w o}, \bar{u}})}^{- 1},

(47)

Σ_{ϕ_{w o}, \bar{m}} = {(J_{ϕ_{w o}, \bar{m}}^{T} Σ_{ϕ_{w o}, \bar{m}}^{- 1} J_{ϕ_{w o}, \bar{m}})}^{- 1} .

(48)

We illustrate this process for the translation component in Figure 8. Note that, in Equations (43) to (46), we compute the Jacobians,

J_{t_{w o}, \hat{u}}

{\overset{ˇ}{J}}_{ϕ_{w o}, \hat{m}}

, and measurement uncertainties,

Σ_{t_{w o}, \hat{u}}

Σ_{ϕ_{w o}, \hat{m}}

, prior to actually moving to the future camera viewpoint

\hat{v}

. These Jacobians are computed based on the object pose estimate derived from the measurements u_1:K and m_1:K. For measurement uncertainties, we assume that they remain constant across different future viewpoints.

Figure 8.

Visualization of the covariance matrix construction for object 3D translation. (a) The construction from collected measurements and (b) the construction when predicting NBV.

We determine our NBV from the candidate viewpoint set, $V$ , by minimizing the weighted sum of the translation and orientation entropy:

\begin{align} v^{*} = \underset{\hat{v}}{\arg \min} g_{t} h_{e} (Σ_{t_{w o}, \bar{u}}) \\ + g_{ϕ} \sum_{i = 1}^{N} w_{i} [- \ln w_{i} + h_{e} (Σ_{ϕ_{w o}, \bar{m}, i})], \end{align}

(49)

Where g_t and g_ϕ are the entropy weights for the translation and orientation components, respectively. For the orientation term, this formulation refines accuracy by considering all modes in the orientation distribution, each weighted by its corresponding Gaussian component weight w_i. Such a formulation aims to improve the final pose accuracy under the assumption that the initial object pose has no view ambiguity, reducing uncertainty only in such cases.

Once the next-best-view v* is determined, the camera is moved, and new measurements, u*, m*, are collected from the corresponding viewpoint. These new measurements are then appended as follows:

u_{1 : K} \cup u^{*} \to u_{1 : K + 1}, m_{1 : K} \cup m^{*} \to m_{1 : K + 1} .

(50)

The object translation and orientation are then recomputed, and the NBV selection process is repeated using Equations (43)–(49). This iterative process continues until the predicted entropy falls below a user-defined threshold or until a maximum number of viewpoints has been selected.

Experiments

Datasets

We evaluate our framework on three challenging real-world datasets: the public ROBI (Yang et al., 2021a) and TOD (Liu et al., 2020) datasets, and a new dataset of textureless transparent objects, T-ROBI, which we created for this work. The ROBI dataset contains seven textureless reflective industrial parts placed in complex bin scenarios, recorded from multiple viewpoints using two sensors: a high-end Ensenso camera and a commodity-level RealSense camera. The TOD dataset contains 15 transparent objects across six categories, each captured in isolated settings with diverse backgrounds and multiple RGB-D views per scene.

T-ROBI dataset

To further validate the effectiveness of our approach, we introduce the T-ROBI (Transparent Reflective Objects in BIns) dataset. This dataset includes two representative components: a “Bottle” and a “Pipe Fitting,” as illustrated in Figure 9. Unlike other publicly available transparent object datasets (Liu et al., 2020; Sajjan et al., 2020; Xu et al., 2021), which typically focus on isolated objects, our dataset presents a more challenging scenario. It consists of images containing multiple identical parts randomly stacked within a bin, thereby significantly increasing the difficulty of object pose estimation. For each object, we captured six distinct scenes from 55 camera viewpoints using the high-end Ensenso N35 camera (IDS, 2025). For each viewpoint, both monochrome images and depth maps are provided. However, as illustrated in Figure 9(b), the transparency of the objects results in significant depth inaccuracies or missing data, making it particularly challenging to label ground-truth 6D object poses. To address this, we adopted the ground-truth labeling method from the ROBI dataset (Yang et al., 2021a), utilizing a scanning spray (AESUB, 2025) to capture accurate ground-truth depth maps of all bins. The example ground-truth depth map, object CAD model, and annotated 6D object poses of the T-ROBI dataset are shown in Figure 9(c)–(e), respectively. Upon the publication of this work, we will release a public version of our T-ROBI dataset. This dataset is designed to support 6D pose estimation (Chen et al., 2023; Liu et al., 2020) as well as depth estimation tasks (Sajjan et al., 2020; Xu et al., 2021) for transparent objects in challenging cluttered and occluded bin-picking scenes.

Figure 9.

T-ROBI dataset: (upper) the object “Bottle” and (lower) the object “Pipe Fitting.” (a) Monochrome images, (b) raw depth maps, (c) ground-truth depth maps, (d) 3D CAD models of the objects, and (e) ground-truth 6D object poses.

Synthetic dataset

To facilitate network training, we introduce a large-scale synthetic dataset comprising objects from both the ROBI and T-ROBI datasets, as illustrated in Figure 10. For each scene, we provide the RGB images, depth maps, object masks, and 6D poses. Our simulation environment is built using the Bullet physics engine (Coumans and Bai, 2016) in conjunction with Blender software (Community, 2018). The process begins with importing each object’s CAD model into Blender, where we manually specify its color and material properties. After preparing the object, we load it into the simulation and drop it from various positions and orientations within the bin using the Bullet physics engine. This approach allows us to generate a wide variety of object poses, clutter levels, and occlusions. Next, we adjust both the light source and camera pose to different viewpoints above the bin and render the scene using Blender, resulting in high-quality visual representations for our dataset. Finally, we utilize the Ensenso SDK (IDS, 2025) to generate synthetic depth images, as shown in Figure 10(c). For each object, we produce approximately 6000 to 13,000 images. We will also release our synthetic dataset upon publication.

Figure 10.

Examples of our generated synthetic data using the Blender rendering software (Community, 2018) with the Bullet physics engine (Coumans and Bai, 2016). (a) The RGB images, (b) the object masks, (c) the depth maps, and (d) the ground-truth 6D object poses. From top to bottom: the object “D-Sub Connector,” “Zigzag” from ROBI dataset (Yang et al., 2021a) and “Bottle” from T-ROBI dataset.

YOLO detection performance

In our framework, we first employ YOLOv8 (Sohan et al., 2024) to detect bounding boxes of multiple identical objects. Since accurate detection forms the foundation for reliable pose estimation, we first evaluate the 2D detection performance of YOLOv8 before presenting the pose estimation evaluations. We train YOLOv8 exclusively on our synthetic dataset and evaluate it on the real-world ROBI and T-ROBI datasets, which contain multiple instances of identical objects. YOLO’s detection performance is assessed using the commonly adopted Intersection over Union (IoU) metric. Specifically, IoU thresholds of 0.7 and 0.8 are used for quantitative evaluation. Table 1 reports the detection rate and false detection rate under both thresholds. “Detection (%)” denotes the ratio of correctly detected objects to the total number of ground-truth instances, while “False (%)” represents the proportion of false detections relative to the total number of detected objects.

Table 1.

YOLOv8 object detection performance on the ROBI and T-ROBI datasets under different IoU thresholds. The table reports the detection rate and false detection rate for both thresholds.

Dataset	IoU $>$ 0.7		IoU $>$ 0.8
Dataset	Detection (%)	False (%)	Detection (%)	False (%)
ROBI (Ensenso)	91.4	13.9	85.4	20.4
ROBI (RealSense)	90.9	11.1	86.1	16.6
T-ROBI	98.8	14.0	98.3	14.5

As shown in Table 1, YOLOv8 achieves a high detection rate (above 90%) and a low false rate (below 15%) across all datasets when evaluated with the 0.7 IoU threshold. Even under the stricter 0.8 IoU criterion, the detection performance remains consistently strong. Representative results in Figure 11 further confirm that YOLOv8 provides sufficiently accurate detections for the subsequent object pose estimation stages.

Figure 11.

Qualitative results of YOLOv8 object detection on the ROBI and T-ROBI datasets. Detections with IoU >0.8 are considered correct (green), while others are false or inaccurate detections (red). From left to right: the object “Tube Fitting,” “Chrome Screw,” “Zigzag” from ROBI dataset and “Bottle,” “Pipe Fitting” from T-ROBI dataset.

Baselines and implementations for pose estimation

We quantitatively evaluate our approach against three prominent baselines: Multi-View 3D Keypoints (MV-3D-KP) (Li and Schoellig, 2023) and two variants of CosyPose (Labbé et al., 2020). To ensure a fair comparison, all methods are trained only on the synthetic dataset. During run-time, we utilize identical object bounding box detections and provide ground-truth multi-view camera poses.

MV-3D-KP: Multi-View 3D Keypoints (MV-3D-KP) (Li and Schoellig, 2023) builds upon the single-view approach of PVN3D (He et al., 2020) and specializes in estimating 6D object poses by leveraging both RGB and depth data. MV-3D-KP provides excellent scalability, allowing for the incorporation of additional views that enhance accuracy and reduce uncertainty in pose estimation. As shown in Li and Schoellig (2023), this method demonstrates exceptional performance on the ROBI dataset, setting a high standard in the field.

CosyPose+PVNet: CosyPose (Labbé et al., 2020) is a multi-view pose fusion solution which takes the 6D object pose estimates from individual viewpoints as the input and optimizes the overall scene consistency. Note that, CosyPose is an offline batch-based solution that is agnostic to any particular pose estimator. In our implementation, we utilize a learning-based approach, Pixel-Wise Voting Network (PVNet) (Peng et al., 2019), to acquire the single-view pose estimates. The PVNet approach first detects 2D keypoints and then solves a Perspective-n-Point (PnP) problem for pose estimation. This approach naturally deals with object occlusion and achieves remarkable performance.

CosyPose+LINE2D: To provide single-view pose estimates for CosyPose, we additionally utilize the LINE-2D pose estimator. In our implementation, we utilize the LINE-2D pose estimator with the same object center, object edge, and segmentation mask (from our MEC-Net). To feed the reliable single-view estimates to CosyPose, we use two strategies to obtain scale information. For the first strategy, we generate the templates at multiple distances during training (9 distances in our experiments) and perform standard template matching at inference time. This strategy can significantly improve the single-view pose estimation performance by sacrificing run-time speed and is treated as the RGB version. For the second strategy, we directly use the depth images at inference time to acquire the object scale and refer to it as the RGB-D version.

We implement our MEC-Net using the PyTorch library, employing ResNet-18 (He et al., 2016) as the backbone network. The MEC-Net is trained from scratch using the Adam optimizer (Kingma and Ba, 2015), with a batch size of 640 and a learning rate of 0.001 over 100 epochs on an RTX A6000 GPU. To ensure a fair comparison between MV-3D-KP and PVNet, we use the same ResNet-18 backbone and maintain consistent hyperparameters during training.

Evaluation metrics for pose estimation

In our evaluation, we consider a ground-truth pose only if its visibility score is larger than 75%. We adopt two metrics to evaluate pose estimation performance: the symmetry-aware average model distance (ADD*) and the 5-mm/10-degree (5 mm,10^◦) metric.

For objects with known geometric symmetries, the standard ADD metric (Hinterstoisser et al., 2012) may incorrectly penalize poses that are visually indistinguishable under the object’s symmetry group. A commonly used alternative, ADD-S, handles symmetry via nearest-neighbor matching but can tolerate large pose errors. To address this, we adopt a symmetry-aware variant of ADD, denoted as ADD*, that leverages the object’s known symmetry group. Let $S = {S^{1}, \dots, S^{K}}$ denote the set of symmetry transformations that leave the object appearance unchanged. For each symmetry operation $S^{k} \in S O (3)$ , we construct a symmetry-equivalent ground-truth pose $R_{gt}^{k}, t_{gt}^{k}$ as

R_{gt}^{k} = R_{gt} S^{(k)}, t_{gt}^{k} = t_{gt} .

(51)

Given the estimated orientation R and translation t, the ADD error with respect to symmetry operation S^k is

A D D (k) = \frac{1}{| M |} \sum_{x \in M} ‖(R x + t) - (R_{gt}^{k} x + t_{gt}^{k})‖,

(52)

Where

M

is the set of 3D object model points. The final symmetry-aware error is defined as the minimum over all symmetry-adjusted ground-truth poses:

{A D D}^{*} = \min_{i \in {1, \dots, K}} A D D (k) .

(53)

The ADD* metric enforces precise geometric correspondence while respecting object symmetry and reduces to standard ADD metric for non-symmetric objects. An object pose is considered correct if its ADD* is smaller than 10% of the object diameter.

To further evaluate pose accuracy, we also use the stricter 5-mm/10-degree (5 mm, 10°) metric. We reuse the symmetry-equivalent ground-truth poses defined for ADD*. The rotation and translation errors are evaluated against each symmetry-adjusted pose $R_{gt}^{k}, t_{gt}^{k}$ , and the estimate is considered correct if it satisfies both the 5-mm translation and 10-degree rotation thresholds for at least one pose.

Pose estimation results on ROBI

We conduct experiments on the ROBI dataset with a variable number of viewpoints (4 and 8), with the viewpoints carefully chosen to provide broad coverage of the scene. Figure 12(a) illustrates the qualitative superiority of our approach. Quantitative results on the Ensenso and RealSense test sets are presented in Tables 2 and 3, respectively. The results show our method outperforms the RGB baseline by a wide margin, and is competitive with the RGB-D approaches, without the need for depth measurements.

Figure 12.

Qualitative results of our approach for the ROBI, T-ROBI datasets. Pose estimation performance is depicted using color coding: green indicates detections that satisfy the ADD* metric, while red indicates those that do not. The results are generated using eight camera viewpoints. To enhance visualization, the estimated object poses are overlaid on the ground-truth depth map.

Table 2.

Detection rates of 6D object pose estimation on Ensenso test set from ROBI dataset, evaluated with the metrics of ADD* and (5 mm, 10°). There are a total of nine scenes for each object.

Objects		4 views					8 views
Objects		CosyPose+PVNet	CosyPose+LINE2D		MV-3D-KP	Ours	CosyPose+PVNet	CosyPose+LINE2D		MV-3D-KP	Ours
Input modality		RGB	RGB	RGBD	RGBD	RGB	RGB	RGB	RGBD	RGBD	RGB
Tube	ADD*	39.4	32.5	74.8	94.0	89.4	61.6	50.3	91.4	96.0	94.0
Fitting	(5,10)	47.7	45.7	76.2	95.4	88.1	64.9	71.5	94.7	96.0	92.0
Chrome	ADD*	17.4	55.7	73.0	90.8	86.7	24.4	70.1	88.5	91.9	93.7
Screw	(5,10)	18.6	63.2	78.2	88.5	85.1	30.8	78.7	90.2	90.8	90.2
Eye	ADD*	21.6	35.1	85.1	93.2	93.2	46.0	79.7	93.2	94.6	94.6
Bolt	(5,10)	12.2	27.0	78.4	87.8	67.6	31.1	64.9	83.8	85.1	75.8
Gear^a	ADD*	50.6	25.9	80.2	85.2	91.4	71.6	43.2	88.9	93.8	97.5
Gear^a	(5,10)	34.6	29.6	79.0	85.2	85.2	49.4	45.7	92.6	91.4	93.8
Zigzag	ADD*	89.7	65.5	87.9	96.6	94.8	89.7	77.6	96.6	96.6	98.3
Zigzag	(5,10)	82.8	37.9	75.9	93.1	89.7	86.2	63.8	93.1	96.6	93.1
DIN	ADD*	13.3	15.6	57.8	90.6	69.5	28.1	24.2	64.1	93.8	73.4
Connector	(5,10)	18.8	12.5	46.1	84.4	53.9	32.0	23.4	51.6	93.0	59.4
D-sub	ADD*	11.2	9.9	55.3	92.5	79.5	18.0	15.5	63.3	95.7	84.5
Connector^b	(5,10)	11.2	11.2	39.1	83.2	47.2	16.8	11.2	41.6	91.3	55.9
ALL	ADD*	34.7	34.3	73.4	91.8	86.4	48.5	51.5	83.7	94.6	90.9
ALL	(5,10)	32.3	32.4	67.6	88.2	73.8	44.5	51.3	78.2	92.0	80.0

^aIn our evaluation, we treat the object “Gear” as symmetric about the Z-axis with an order of 12.

^bIn our evaluation, we treat the object “D-Sub Connector” as symmetric about the Z-axis with an order of 2.

Table 3.

Detection rates of 6D object pose estimation on RealSense test set from ROBI dataset, evaluated with the metrics of ADD* and (5 mm, 10°). There are a total of four scenes for each object.

Objects		4 views					8 views
Objects		CosyPose+PVNet	CosyPose+LINE2D		MV-3D-KP	Ours	CosyPose+PVNet	CosyPose+LINE2D		MV-3D-KP	Ours
Input modality		RGB	RGB	RGBD	RGBD	RGB	RGB	RGB	RGBD	RGBD	RGB
Tube	ADD*	26.7	27.9	70.6	79.4	86.8	47.1	69.1	83.9	86.8	85.3
Fitting	(5,10)	36.8	48.5	72.1	76.5	79.4	44.2	82.3	85.3	83.8	91.2
Chrome	ADD*	10.0	58.6	68.5	87.1	92.9	30.0	77.1	80.0	92.9	92.9
Screw	(5,10)	10.0	64.3	82.9	84.3	77.1	42.9	85.7	94.3	91.4	87.1
Eye	ADD*	17.7	58.8	76.5	85.3	94.1	38.2	73.5	94.1	85.3	94.1
Bolt	(5,10)	17.7	41.2	67.6	82.3	55.9	29.4	61.8	91.2	85.3	76.5
Gear^*	ADD*	38.9	36.1	83.3	88.8	94.4	44.4	55.6	97.2	91.7	97.2
Gear^*	(5,10)	27.8	38.9	77.8	61.1	86.1	30.6	58.3	94.4	63.9	88.9
Zigzag	ADD*	60.7	42.9	78.6	96.4	89.3	85.7	71.4	92.9	92.9	96.4
Zigzag	(5,10)	53.6	21.4	71.4	96.4	85.7	82.1	64.3	92.9	92.9	92.9
DIN	ADD*	11.5	3.8	36.5	82.7	51.9	15.4	15.4	51.9	84.6	82.7
Connector	(5,10)	13.5	1.9	30.8	75.0	32.7	26.9	9.6	34.6	84.6	57.7
D-Sub	ADD*	8.3	6.9	40.3	88.9	70.8	20.8	9.7	45.8	90.3	81.9
Connector^†	(5,10)	9.7	6.9	18.1	44.4	31.9	18.1	8.3	33.3	55.6	43.1
ALL	ADD*	24.8	33.6	64.9	86.7	82.9	40.2	53.1	78.0	89.2	90.1
ALL	(5,10)	24.2	31.9	60.1	74.3	64.1	39.2	52.9	75.1	79.6	76.8

In the Ensenso test set, it is noteworthy that “MV-3D-KP” demonstrates exceptional performance, achieving state-of-the-art results on the ROBI dataset. This success is largely attributed to the high-quality depth maps produced by the Ensenso 3D camera. Specifically, when utilizing RGB-D data, the “MV-3D-KP” method achieves an overall detection rate of 91.8% using four views and 94.6% using eight views, as measured by the ADD* metric. Additionally, it achieves an overall detection rate of 88.2% with four views and 92.0% with eight views using the (5 mm, 10°) metric. In comparison, despite relying solely on RGB data, our approach demonstrates competitive performance, with detection rates only 5.4% and 3.7% lower than MV-3D-KP for four-view and eight-view data, respectively, as measured by the ADD* metric. When utilizing only RGB images, our approach significantly outperforms both “CosyPose+PVNet” and “CosyPose+LINE2D,” achieving margins of at least 51.7% and 39.4% for the 4-view and 8-view configurations, respectively, as measured by the ADD* metric. With the availability of depth data, the performance of “CosyPose+LINE2D” shows substantial improvement, representing its upper bound. In contrast, our method exceeds this upper bound by a clear margin, achieving detection rates that are 13.0% and 7.2% higher on the 4-view and 8-view test sets, respectively, with the ADD* metric. A similar margin is observed when using the (5 mm, 10°) metric.

In the RealSense test set, the degraded quality of depth data presents challenges for both the “MV-3D-KP” and “CosyPose+LINE2D” (RGB-D version) methods. In contrast and as expected, our approach maintains a comparable detection rate. Specifically, for the 4-view configuration, our approach exhibits only a slight decrease in performance compared to the “MV-3D-KP” by 3.8% and 10.2% using the ADD* and the (5 mm, 10°) metric. With the 8-view configuration, our approach achieves the best performance of 90.1% using the ADD* metric and is only 2.8% lower than “MV-3D-KP” under the (5 mm, 10°) metric.

Pose estimation results on T-ROBI

Table 4 presents the object pose estimation results on our T-ROBI dataset, where our approach demonstrates clear superiority. It significantly outperforms “CosyPose” (all variants) and “MV-3D-KP” by a substantial margin. Using the ADD* metric, our method demonstrates an impressive overall detection rate of 93.4% for the 4-view configuration and 95.2% for the 8-view configuration. When evaluated with the (5 mm, 10°) metric, it achieves detection rates of 79.4% with four views and 81.3% with eight views, highlighting its robustness in handling transparent objects. Figure 12(b) further illustrates the strong performance of our approach on the T-ROBI dataset, with accurate pose estimation.

Table 4.

Detection rates of 6D object pose estimation on T-ROBI dataset, evaluated with the metrics of ADD* and (5 mm, 10°). There are a total of six scenes for each object.

Objects		4 views					8 views
Objects		CosyPose+PVNet	CosyPose+LINE2D		MV-3D-KP	Ours	CosyPose+PVNet	CosyPose+LINE2D		MV-3D-KP	Ours
Input modality		RGB	RGB	RGBD	RGBD	RGB	RGB	RGB	RGBD	RGBD	RGB
Bottle	ADD*	32.7	38.5	1.9	3.8	90.4	48.1	53.8	3.8	1.9	90.4
Bottle	(5,10)	17.3	13.5	1.9	3.8	73.1	28.9	38.5	3.8	1.9	75.0
Pipe	ADD*	30.4	51.8	28.6	67.9	96.4	39.3	71.4	37.5	64.3	100.0
Fitting	(5,10)	14.3	35.7	12.5	60.7	85.7	28.6	41.7	17.9	62.5	87.5
ALL	ADD*	31.0	45.2	15.3	35.9	93.4	45.6	62.6	20.7	33.1	95.2
ALL	(5,10)	15.8	24.6	7.2	32.3	79.4	26.8	40.1	10.9	32.2	81.3

In contrast, the “MV-3D-KP” and “CosyPose+LINE2D” (RGB-D version) approaches show low detection rates, largely due to significant depth missing and inaccuracies. These results highlight the advantage of our RGB-only approach for transparent objects that typically challenge depth-based methods.

Pose estimation results on TOD

For evaluation on the TOD dataset (Liu et al., 2020), we compare against KeyPose (Liu et al., 2020), the state-of-the-art method for 6D pose estimation of transparent objects. KeyPose is also the leading approach reported on the TOD dataset. We follow the KeyPose experimental protocol, using the same training and testing data to ensure a fair comparison. In our experiments, we evaluate a total of six objects, selecting one representative object from each category. For each object, one texture is held out for testing, resulting in approximately 3000 training samples and 320 test samples per object.

For multi-view evaluation, we use four stereo pairs, corresponding to eight viewpoints. KeyPose estimates object poses from each stereo pair independently, and the resulting poses are fused using the CosyPose multi-view fusion procedure to produce globally consistent pose estimates. Table 5 reports the results on the TOD dataset. Our method outperforms “KeyPose+CosyPose” on nearly all objects. Although the improvement is modest under the ADD* metric (11.4%), the gain under the stricter (5 mm, 10°) metric is substantial (32.3%), highlighting the precision of our approach. Figure 13 provides qualitative examples on different TOD objects, demonstrating the consistency of our pose estimation across diverse scenes.

Table 5.

Comparison of our method with KeyPose+CosyPose on the TOD dataset (Liu et al., 2020).

Metric	Method	Objects
Metric	Method	Ball	Bottle₀	Cup₀	Mug₄	Heart	Tree	ALL
ADD*	KeyPose+CosyPose	90.6	100	96.9	96.9	53.1	81.3	86.5
ADD*	Ours	100	100	100	100	87.5	100	97.9
(5,10)	KeyPose+CosyPose	90.6	87.5	62.5	28.1	34.4	34.4	56.3
(5,10)	Ours	96.9	87.5	100	84.4	71.9	90.6	88.6

Evaluations use four stereo pairs (eight viewpoints per scene) with ADD* and (5 mm, 10°) metrics. Bold marks the best result in each column.

Figure 13.

Qualitative results of our approach on the testing and validation sets of the TOD. The results are generated using four stereo pairs (8 viewpoints). From left to right: Mug₄, Cup₀, Tree, and Heart.

Comparison with RGB-D baselines

To evaluate the practical advantage of our RGB-only approach, we compare it against the MV-3D-KP baseline on ROBI and T-ROBI dataset under different depth configurations:

Ground-Truth (GT) Depth: Captured with scanning spray and a high-end Ensenso sensor for optimal quality. This depth serves as an oracle reference and is not available in practice.

Raw Depth: Directly captured from either a high-end Ensenso sensor or a commodity-level RealSense camera.

DA+GT Scale: Depth is predicted from the RGB image using Depth-Anything V2 (Yang et al., 2024b). For each frame, the predicted relative depth is converted to metric depth by aligning it to the corresponding ground-truth depth using scale and shift (Ganj et al., 2025). This represents the upper-bound performance of Depth-Anything V2.

Results are summarized in Table 6. As expected, MV-3D-KP achieves near-perfect accuracy with GT depth, establishing an upper bound for RGB-D performance. With raw sensor depth on ROBI datasets (reflective objects), MV-3D-KP slightly outperforms our multi-view RGB-only method when using the high-end Ensenso sensor, while with a commodity RealSense sensor, our method achieves comparable performance. On T-ROBI datasets (transparent objects), where a large portion of depth data is missing, our RGB-only approach significantly outperforms MV-3D-KP by at least 49.1%.

Table 6.

Comparison of our RGB-only approach against MV-3D-KP with different depth sources on the ROBI and T-ROBI datasets.

Dataset	Metric	MV-3D-KP			Ours
Dataset	Metric	RGB+GT depth*	RGB+Raw depth	RGB+DA+GS	RGB only
ROBI (Ensenso)	ADD*	95.5	94.6	37.5	90.9
ROBI (Ensenso)	(5,10)	93.8	92.0	28.1	80.0
ROBI (RealSense)	ADD*	95.3	89.2	34.9	90.1
ROBI (RealSense)	(5,10)	93.2	79.6	31.6	76.8
T-ROBI	ADD*	100	33.1	60.1	95.2
T-ROBI	(5,10)	100	32.2	34.1	81.3

GT depth* is an oracle reference (not available in practice). Ensenso represents a high-end depth sensor, RealSense a commodity sensor, and DA+GS refers to Depth-Anything v2 with per-frame GT scale and shift alignment. All results are computed using eight viewpoints per scene. Bold indicates the best practical result, and underline indicates the second-best result.

When using Depth-Anything V2, MV-3D-KP performance improves slightly only for objects that are completely invisible to the sensor, such as the transparent objects in T-ROBI. Even in these cases, it still performs worse than our RGB-only method by at least 35.1%. In contrast, on the ROBI dataset, where objects are reflective but depth can be directly sensed, DA+GT provides minimal improvement and performs worse than on T-ROBI, mainly due to the high clutter in ROBI bins. For these ROBI objects, MV-3D-KP with DA+GT remains far below both its performance with real depth and our RGB-only approach. Overall, these results demonstrate that while predicted depth can help in extreme cases, it cannot replace real depth measurements or multi-view RGB cues for robust pose estimation.

Ablation studies on pose estimation

We conduct ablation studies to evaluate the effect of using the edge map and the decoupled formulation on the ROBI and T-ROBI dataset. Table 7 summarizes the results of our ablation studies.

Table 7.

Ablation studies on different configurations for 6D object pose estimation on the ROBI and T-ROBI datasets.

Object edge	Sequential process	ADD*		(5 mm, 10°)		Run-time (ms)
Object edge	Sequential process	4 views	8 views	4 views	8 views	Run-time (ms)
×	×	70.3	78.4	48.3	55.4	157.2
×	✓	73.3	79.7	48.5	55.8	34.1
✓	×	85.4	90.1	71.9	78.6	197.7
✓	✓	86.5	91.2	72.1	79.3	48.5

Results report detection rate based on ADD* and 5-mm/10-degree metrics. Object Edge refers to utilizing the object’s 2D edge map from MEC-Net to obtain per-frame object orientation measurements. Sequential Process shows our method when 6D pose estimation is decomposed into a two-step sequential process. We report run-time for per-frame orientation estimation using LINE-2D in milliseconds per object, tested on a laptop with an Intel 2.60 GHz CPU. Bold indicates the best practical result, underline the second-best.

Edge map

For optimizing the object 3D orientation, we use a template matching-based orientation estimator, LINE-2D, to obtain the per-frame object orientation measurement. However, LINE-2D is susceptible to issues related to occlusion and fake edges. Compared to our previous approach (Yang et al., 2023b), we address these problems by leveraging our MEC-Net to directly produce the object’s 2D edge map. To demonstrate the advantage of this approach, we conduct a comparison of the final results with and without using the edge map. In cases where edge maps are unavailable, we take the object mask from the MEC-Net and then feed the re-cropped object RoI into the LINE-2D estimator. Table 7 clearly shows a significant increase in the correct detection rate when utilizing the estimated edge map. This phenomenon is more obvious when using the metric, (5 mm, 10°), which imposes a stricter criterion for orientation error.

Sequential process

As discussed in the problem formulation, the core idea of our method is the decoupling of 6D pose estimation into a two-step sequential process. This process first resolves the scale and depth ambiguities in the RGB images and greatly improves the orientation estimation performance. To justify its effectiveness, we consider an alternative version of our approach, one which simultaneously estimates the 3D translation and orientation. This version uses the same strategy to estimate the object translation. However, instead of using the provided scale from the translation estimates, it uses the multi-scale trained templates (similar to the RGB version of CosyPose) to acquire orientation measurements. Table 7 shows that, due to the large number of templates, the run-time for orientation estimation is generally slow for the simultaneous process version. In comparison, our sequential process not only operates with a much faster run-time speed but also has slightly better overall performance.

Next-best-view evaluation

In our setup, we operate with a predefined set of camera viewpoints, and the evaluation consists of selecting the next-best viewpoint from this set. We compare our approach against two heuristic-based baselines, “Random” and “Max-Distance.” “Random” selects viewpoints randomly from the candidate set, while “Max-Distance” moves the camera to the viewpoint farthest from previous observations.

The evaluation is performed on the ROBI, T-ROBI, and TOD datasets. For all view selection strategies, we use our object pose estimation method to ensure a fair comparison. To obtain the results, we initialize the object pose with two viewpoints and progressively refine it by incorporating RGB measurements selected according to each view selection strategy.

Table 8 shows the NBV result when using two additional viewpoints (4 viewpoints in total). Our NBV approach outperforms the “Random” baseline by a significant margin on all three datasets, with the improvement being more substantial under the stricter (5, mm, 10°) metric, for example, 11.2% on TOD and 5.9% on ROBI. Compared to the “Max-Distance” baseline, NBV provides smaller yet meaningful improvements, demonstrating that it can select more informative viewpoints and further refine pose estimation beyond heuristic strategies. The benefit of NBV is most evident in highly cluttered scenes such as ROBI, where occlusions are frequent and careful view selection can further enhance pose estimation, improving ADD* by 1.3% and (5, mm, 10°) by 2% over the “Max-Distance” baseline. Although these improvements are smaller on T-ROBI and TOD, where objects are less cluttered or isolated, NBV still provides slight gains or maintains comparable performance, demonstrating its robustness across different environments.

Table 8.

Next-Best-View evaluation.

Dataset	ADD*			(5 mm, 10°)
Dataset	Random	Max	NBV	Random	Max	NBV
ROBI	85.5	88.3	89.6	70.2	74.1	76.1
T-ROBI	95.3	96.3	97.2	79.4	85.1	84.1
TOD	97.3	99.7	99.7	79.5	89.3	90.7

We show the object pose estimation results with different viewpoint selection strategies on ROBI, T-ROBI and TOD dataset. An object pose is considered correct if it lies within the ADD* or (5 mm, 10°) metric. We initialize the pose estimation with two viewpoints. The maximum number of additional viewpoints is set to two (a total of four viewpoints).

Figure 14 illustrates the trend more clearly when extending to four additional viewpoints (6 viewpoints in total). Compared to the “Random” (red curve) and “Max-Distance” baseline (green curve), our NBV policy (blue curve) consistently achieves higher or comparable ADD* performance on ROBI, T-ROBI, and TOD datasets, showing its reliability across different scene complexities.

Figure 14.

Evaluation of our next-best-view policy when comparing against heuristic-based baselines. We use our multi-view pose estimation approach for all the viewpoint selection strategies. The results are evaluated using the correct detection rate with the ADD* metric on the ROBI, T-ROBI, and TOD datasets. Our approach can achieve a high correct detection rate with fewer viewpoints.

Limitations and future work

Although we have demonstrated the effectiveness of our approach in real-world scenes, there are several limitations that future work can address. First, in our problem formulation, we model the object translation distribution as a unimodal Gaussian. While this assumption generally holds, it can fail in heavily occluded cases, such as a cylindrical object with both ends occluded, and adopting a multimodal distribution (Bui et al., 2020) could allow the model to capture multiple plausible translations.

Second, in our next-best-view prediction, although the object’s orientation is modeled as a multimodal distribution, the predicted viewpoints are intended to refine the final pose accuracy under the assumption that the initial pose is unambiguous. Consequently, if the object exhibits inherent visual ambiguity (e.g., from occlusion), this approach cannot resolve it, since it does not explicitly account for disambiguating between multiple plausible modes (Manhardt et al., 2019). Addressing these ambiguities is an important direction for future work.

Third, in our NBV setup, we assume a predefined set of camera viewpoints, which can be directly mapped to the robot’s poses for execution on a real robot platform but restricts the robot to discrete positions. Generating continuous motions via trajectory optimization (Falanga et al., 2018; Wang et al., 2020) could enable more informative observations, particularly on platforms equipped with an end-effector-mounted camera.

Finally, our current approach requires a 3D object CAD model and known camera poses, which limits its applicability. Future work will investigate joint estimation of object and camera poses and explore extending the active perception framework to CAD-less objects (Wang et al., 2021b); Liao et al., 2024).

Conclusion

In this work, we present a complete framework of multi-view pose estimation and next-best-view prediction for textureless objects. For our multi-view object pose estimation approach, the core idea of our method is to decouple the posterior distribution into a 3D translation and a 3D orientation of an object and integrate the per-frame measurements with a two-step sequential formulation. This process first resolves the scale and depth ambiguities in the RGB images and greatly simplifies the per-frame orientation estimation problem. Moreover, our orientation optimization module explicitly handles the object symmetries and counteracts the measurement uncertainties with a max-mixture-based formulation. To find the next-best-view, we predict the object pose entropy via the Fisher information approximation. The new RGB measurements are collected from the corresponding viewpoint to improve the object pose accuracy. Experiments on public datasets ROBI and TOD, along with our T-ROBI dataset, demonstrate the effectiveness and accuracy compared to the state-of-the-art baselines.

Footnotes

ORCID iDs

Jun Yang

Steven L. Waslander

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Epson Canada Ltd.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

AESUB (2025) Aesub Blue: Vanishing 3d Scanning Spray. https://aesub.com/.

Aloimonos

Weiss

Bandyopadhyay

(1988) Active vision. International Journal of Computer Vision 1(4): 333–356.

Amini

Periyasamy

Behnke

(2021) T6d-direct: transformers for multi-object 6d pose direct regression. DAGM German Conference on Pattern Recognition. Springer International Publishing.

Ausserlechner

Haberger

Thalhammer

, et al. (2024) Zs6d: Zero-Shot 6d Object Pose Estimation Using Vision Transformers. IEEE International Conference on Robotics and Automation (ICRA).

Bajcsy

Aloimonos

Tsotsos

(2018) Revisiting active perception. Autonomous Robots 42(2): 177–196.

Brachmann

Michel

Krull

, et al. (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single RGB image. In: Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Breyer

Ott

Siegwart

, et al. (2022) Closed-loop next-best-view planning for target-driven grasping. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

Bui

Birdal

Deng

, et al. (2020) 6d camera relocalization in ambiguous scenes via continuous multimodal inference. European Conference on Computer Vision (ECCV). Springer International Publishing.

Chen

James

Sui

, et al. (2023) Stereopose: category-level 6d transparent object pose estimation from stereo images via back-view nocs. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

10.

Chen

Jiang

(2024) Multi-view metal parts pose estimation based on a single camera. Sensors 24(11): 3408.

11.

Chen

Kwok

(2011) Active vision in robotic systems: a survey of recent developments. The International Journal of Robotics Research 30(11): 1343–1377.

12.

Bui

Zakharov

Albarqouni

, et al. (2018) When regression meets manifold learning for object recognition and pose estimation. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

13.

Cai

Heikkilä

Rahtu

(2022) Ove6d: object viewpoint encoding for depth-based 6d object pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

14.

Chai

Tsao

(2020) Deep depth fusion for Black, transparent, reflective and texture-less objects. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

15.

Cherti

Beaumont

Wightman

, et al. (2023) Reproducible scaling laws for contrastive language-image learning. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

16.

Choi

Leutenegger

(2023) Accurate and interactive visual-inertial sensor calibration with next-best-view and next-best-trajectory suggestion. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

17.

Collet

Srinivasa

(2010) Efficient multi-view object recognition and full pose estimation. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

18.

Community BO (2018) Blender-a 3D Modelling and Rendering Package. Blender Foundation, Stichting Blender Foundation. https://www.blender.org

19.

Connolly

(1985) The determination of next best views. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

20.

Costante

Delmerico

Werlberger

, et al. (2018) Exploiting photometric information for planning under uncertainty. In: Robotics Research. Springer, 107–124.

21.

Coumans

Bai

(2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. Available at: http://pybullet.org.

22.

Davison

Murray

(2002) Simultaneous localization and map-building using active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7): 865–880.

23.

Deng

Mousavian

Xiang

, et al. (2021) Poserbpf: a rao–blackwellized particle filter for 6-d object pose tracking. IEEE Transactions on Robotics 37(5): 1328–1342.

24.

Deng

Bui

Navab

, et al. (2022) Deep bingham networks: dealing with uncertainty and ambiguity in pose estimation. International Journal of Computer Vision 130(7): 1627–1654.

25.

Deng

Campbell

Sun

, et al. (2025) Pos3r: 6d pose estimation for unseen objects made easy. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

26.

Deng

Xiang

Mousavian

, et al. (2020) Self-supervised 6d object pose estimation for robot manipulation. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

27.

Doumanoglou

Kouskouridas

Malassiotis

, et al. (2016) Recovering 6d object pose and predicting next-best-view in the crowd. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

28.

Drost

Ulrich

Navab

, et al. (2010) Model globally, match locally: efficient and robust 3d object recognition. In: Proceeding of the IEEE Computer Vision and Pattern Recognition (CVPR).

29.

Eidenberger

Scharinger

(2010) Active perception and scene modeling by planning with probabilistic 6d object poses. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

30.

Falanga

Foehn

, et al. (2018) Pampc: perception-aware model predictive control for quadrotors. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

31.

Fan Z, Pan P, Wang P, et al (2024) Pope: 6-dof promptable pose estimation of any object in any scene with one reference. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

32.

Forster

Pizzoli

Scaramuzza

(2014) Appearance-based active, monocular, dense reconstruction for micro aerial vehicles. In: Proceeding of the Robotics: Science and Systems (RSS).

33.

Huang

Doherty

, et al. (2021) A multi-hypothesis approach to pose ambiguity in object-based slam. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

34.

Miao

Ohnishi

, et al. (2024) A low-cost, high-speed, and robust bin picking system for factory automation enabled by a non-stop, multi-view, and active vision scheme. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

35.

Ganj

Guo

(2025) Hybriddepth: robust metric depth fusion by leveraging depth from focus and single-image priors. In: Proceeding of the IEEE Winter Conference on Applications of Computer Vision (WACV).

36.

Gao

Lauri

, et al. (2021) Cloudaae: learning 6d object pose regression with on-line data synthesis on point clouds. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

37.

Gao

Lauri

Wang

, et al. (2020) 6d object pose regression via supervised learning on point clouds. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

38.

Gilitschenski

Sahoo

Schwarting

, et al. (2019) Deep orientation uncertainty learning based on a bingham loss. In: Proceeding of the International Conference on Learning Representations (ICLR).

39.

Hanlon

Sun

Pollefeys

, et al. (2024) Active visual localization for multi-agent collaboration: a data-driven approach. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

40.

Haugaard

Buch

(2022) Surfemb: dense and continuous correspondence distributions for object pose estimation with learnt surface embeddings. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

41.

Zhang

Ren

, et al. (2016) Deep residual learning for image recognition. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

42.

Sun

Huang

, et al. (2020) Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

43.

Haugaard

Hagelskjær

Iversen

(2023) Spyropose: Se (3) pyramids for object pose distribution estimation. In: International Conference on Computer Vision (ICCV). IEEE/CVF.

44.

Haugaard

Iversen

(2023) Multi-view object pose estimation from correspondence distributions and epipolar geometry. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

45.

Zhao

, et al. (2023) Contourpose: monocular 6-d pose estimation method for reflective textureless metal parts. IEEE Transactions on Robotics 39(5): 4037–4050.

46.

Hinterstoisser

Cagniart

Ilic

, et al. (2011) Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(5): 876–888.

47.

Hinterstoisser

Lepetit

Ilic

, et al. (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Proceeding of the Asian Conference on Computer Vision (ACCV).

48.

Jain

Abbeel

(2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33: 6840–6851.

49.

Hodan

Barath

Matas

(2020) Epos: estimating 6d pose of objects with symmetries. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

50.

Höfer

Kiefer

Messmer

, et al. (2023) Hyperposepdf-hypernetworks predicting the probability distribution on so (3). In: Proceeding of the IEEE Winter Conference on Applications of Computer Vision (WACV).

51.

Hsiao

Chen

Yang

, et al. (2024) Confronting ambiguity in 6d object pose estimation via score-based diffusion on se (3). In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

52.

Huber

Bailey

Durrant-Whyte

, et al. (2008) On entropy approximation for Gaussian mixture random vectors. In: Proceeding of the IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems.

53.

IDS (2025) Ensenso 3d Camera. https://www.ensenso.com/.

54.

Imperoli

Pretto

(2015) D2co: fast and robust registration of 3d textureless objects using the directional chamfer distance. In: Proceeding of the International Conference on Computer Vision Systems (ICVS).

55.

Isler

Sabzevari

Delmerico

, et al. (2016) An information gain formulation for active volumetric 3d reconstruction. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

56.

Jantos

Hamdad

Granig

, et al. (2023) Poet: pose estimation transformer for single-view, multi-object 6d pose estimation. In: Proceeding of the Conference on Robot Learning (CoRL).

57.

Kehl

Manhardt

Tombari

, et al. (2017) SSD-6d: making RGB-based 3d detection and 6d pose estimation great again. In: Proceeding of the IEEE/CVF International Conference on Computer Vision (ICCV).

58.

Kendall

Cipolla

(2016) Modelling uncertainty in deep learning for camera relocalization. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

59.

Kiciroglu

Rhodin

Sinha

, et al. (2020) Activemocap: optimized viewpoint selection for active human motion capture. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

60.

Kingma

(2015) Adam: a method for stochastic optimization. In: Proceeding of the International Conference on Learning Representations (ICLR).

61.

Klein

Murray

(2007) Parallel tracking and mapping for small ar workspaces. In: Proceeding of the IEEE/ACM International Symposium on Mixed and Augmented Reality.

62.

Kundu

Rahul

Ganeshan

, et al. (2018) Object pose estimation from monocular image using multi-view keypoint correspondence. European Conference on Computer Vision (ECCV). Springer International Publishing.

63.

Labbé

Carpentier

Aubry

, et al. (2020) Cosypose: consistent multi-view multi-object 6d pose estimation. European Conference on Computer Vision (ECCV). Springer International Publishing.

64.

Labbé

Manuelli

Mousavian

, et al. (2022) Megapose: 6d pose estimation of novel objects via render & compare. In: Proceeding of the Conference on Robot Learning (CoRL).

65.

Lepetit

Moreno-Noguer

Fua

(2009) Epnp: an accurate o(n) solution to the pnp problem. International Journal of Computer Vision 81(2): 155–166.

66.

Schoellig

(2023) Multi-view keypoints for reliable 6d object pose estimation. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

67.

Stamos

(2023) Depth-based 6d of object pose estimation using swin transformer. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

68.

Wang

, et al. (2018) Deepim: deep iterative matching for 6d pose estimation. European Conference on Computer Vision (ECCV). Springer International Publishing.

69.

Liao

Yang

Qian

, et al. (2024) Uncertainty-aware 3d object-level mapping with deep shape priors. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

70.

Liu

Tuzel

Veeraraghavan

, et al. (2012) Fast object localization and pose estimation in heavy clutter for robotic bin picking. The International Journal of Robotics Research 31(8): 951–973.

71.

Liu

Jonschkowski

Angelova

, et al. (2020) Keypose: multi-view 3d labeling and keypoint estimation for transparent objects. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

72.

Manhardt

Arroyo

Rupprecht

, et al. (2019) Explaining the ambiguity of object detection and 6d pose from visual data. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

73.

Maninis

Popov

Nießner

, et al. (2022) Vid2cad: cad model alignment using multi-view constraints from videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1): 1320–1327.

74.

Merrill

Guo

Zuo

, et al. (2022) Symmetry and uncertainty-aware object slam for 6d of object pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

75.

Moon

Son

Hur

, et al. (2024) Genflow: generalizable recurrent flow for 6d pose refinement of novel objects. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

76.

Morrison

Corke

Leitner

(2019) Multi-view picking: next-best-view reaching for improved grasping in clutter. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

77.

Mur-Artal

Montiel

JMM

Tardos

(2015) Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics 31(5): 1147–1163.

78.

Okorn

Hebert

, et al. (2020) Learning orientation distributions for object pose estimation. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

79.

Olson

Agarwal

(2013) Inference on networks of mixtures for robust robot mapping. The International Journal of Robotics Research 32(7): 826–840.

80.

Oquab

Darcet

Moutakanni

, et al. (2023) Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 .

81.

Park

Patten

Vincze

(2019) Pix2pose: pixel-wise coordinate regression of objects for 6d pose estimation. In: Proceeding of the IEEE/CVF International Conference on Computer Vision (ICCV).

82.

Pavlakos

Zhou

Chan

, et al. (2017) 6-dof object pose from semantic keypoints. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

83.

Peng

Liu

Huang

, et al. (2019) Pvnet: pixel-wise voting network for 6dof pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

84.

Rad

Lepetit

(2017) Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: Proceeding of the IEEE/CVF International Conference on Computer Vision (ICCV).

85.

Rebello

Das

Waslander

(2017) Autonomous active calibration of a dynamic camera cluster using next-best-view. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

86.

Richter-Klug

, Frese U (2019) Towards meaningful uncertainty information for cnn based 6d pose estimates. In: Proceeding of the International Conference on Computer Vision Systems (ICVS).

87.

Rupprecht

Laina

DiPietro

, et al. (2017) Learning in an uncertain world: representing ambiguity through multiple hypotheses. In: Proceeding of the IEEE/CVF International Conference on Computer Vision (ICCV).

88.

Saadi

Besbes

Kramm

, et al. (2021) Optimizing rgb-d fusion for accurate 6dof pose estimation. IEEE Robotics and Automation Letters 6(2): 2413–2420.

89.

Sajjan

Moore

Pan

, et al. (2020) Cleargrasp: 3d shape estimation of transparent objects for manipulation. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

90.

Salas-Moreno

Newcombe

Strasdat

, et al. (2013) Slam++: simultaneous localisation and mapping at the level of objects. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

91.

Shi

Zhu

Tremblay

, et al. (2021) Fast uncertainty quantification for deep object pose estimation. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

92.

Shugurov

Pavlov

Zakharov

, et al. (2021) Multi-view object pose refinement with differentiable renderer. IEEE Robotics and Automation Letters 6(2): 2579–2586.

93.

Sock

Garcia-Hernando

Kim

(2020) Active 6d multi-object pose estimation in cluttered scenarios with deep reinforcement learning. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

94.

Sohan

Sai Ram

Rami Reddy

(2024) A review on yolov8 and its advancements. In: Proceeding of the International Conference on Data Intelligence and Cognitive Informatics.

95.

Song

Jiang

(2017) Cad-based pose estimation design for random bin picking using a RGB-d camera. Journal of Intelligent and Robotic Systems 87: 455–470.

96.

Saleh

Fetzer

, et al. (2022) Zebrapose: coarse to fine surface encoding for 6d of object pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

97.

Sun

Wang

Zhou

, et al. (2025) Metal parts’ zero-shot 6d pose estimation via foundation model and template update for industrial scenario. IEEE Transactions on Instrumentation and Measurement 1: 1.

98.

Sundermeyer

Marton

Durner

, et al. (2018) Implicit 3d orientation learning for 6d object detection from rgb images. European Conference on Computer Vision (ECCV). Springer International Publishing.

99.

Tian

Pan

Ang

, et al. (2020) Robust 6d object pose estimation by learning RGB-d features. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

100.

Tremblay

Sundaralingam

, et al. (2018) Deep object pose estimation for semantic robotic grasping of household objects. In: Proceeding of the Conference on Robot Learning (CoRL).

101.

Tsai

Lenz

(1989) A new technique for fully autonomous and efficient 3d robotics hand/eye calibration. IEEE Transactions on Robotics and Automation 5(3): 345–358.

102.

Wada

Sucar

James

, et al. (2020) Morefusion: multi-object reasoning for 6d pose estimation from volumetric fusion. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

103.

Wang

Zhu

, et al. (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

104.

Wang

Xiang

Fox

(2020) Manipulation trajectory optimization with online grasp synthesis and selection. In: Proceeding of the Robotics: Science and Systems (RSS).

105.

Wang

Manhardt

Tombari

, et al. (2021a) Gdr-net: geometry-guided direct regression network for monocular 6d object pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

106.

Wang

Rünz

Agapito

(2021) Dsp-slam: object oriented slam with deep shape priors. In: Proceeding of the International Conference on 3D Vision (3DV).

107.

Wen

Mitash

Ren

, et al. (2020) se (3)-tracknet: data-driven 6d pose tracking by calibrating image residuals in synthetic domains. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

108.

Wen

Yang

Kautz

, et al. (2024) Foundationpose: unified 6d pose estimation and tracking of novel objects. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

109.

Ranasinghe

Dissanayake

(2015) Active recognition and pose estimation of household objects in clutter. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

110.

Zhang

Zhu

, et al. (2020) Eao-slam: monocular semi-dense object slam based on ensemble data association. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

111.

Xiang

Schmidt

Narayanan

, et al. (2018) Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. In: Proceeding of the Robotics: Science and Systems (RSS).

112.

Wang

Eppel

, et al. (2021) Seeing glass: joint point cloud and depth completion for transparent objects. In: Proceeding of the Conference on Robot Learning (CoRL).

113.

Willners

Hong

, et al. (2023) Observability-aware active extrinsic calibration of multiple sensors. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

114.

Cai

, et al. (2024) 6d-diff: a keypoint diffusion framework for 6d object pose estimation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

115.

Yang

Pavone

(2023) Object pose estimation with statistical guarantees: conformal keypoint detection and geometric uncertainty propagation. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

116.

Yang

Scherer

(2019) Cubeslam: monocular 3d object slam. IEEE Transactions on Robotics 35(4): 925–938.

117.

Yang

Waslander

(2022) Next-best-view prediction for active stereo cameras and highly reflective objects. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

118.

Yang

Waslander

(2021b) Probabilistic multi-view fusion of active stereo depth maps for robotic bin-picking. IEEE Robotics and Automation Letters 6(3): 4472–4479.

119.

Yang

Rebello

Waslander

. (2023) Next-best-view selection for robot eye-in-hand calibration. In: Proceeding of the Conference on Robots and Vision (CRV).

120.

Yang

Xue

Ghavidel

, et al. (2023b) 6d pose estimation for textureless objects on RGB frames using multi-view imization. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

121.

Yang

Yao

Waslander

(2024) Active pose refinement for textureless shiny objects using the structured light camera. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

122.

Yang

Gao

, et al. (2021a) Robi: a multi-view dataset for reflective objects in robotic bin-picking. In: Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

123.

Yang

Kang

Huang

, et al. (2024) Depth anything v2. Advances in Neural Information Processing Systems.

124.

Zakharov

Shugurov

Ilic

(2019) Dpod: 6d pose object detector and refiner. In: Proceeding of the IEEE/CVF International Conference on Computer Vision (ICCV).

125.

Zhang

Cao

(2019) Fast 6d object pose refinement in depth images. Applied Intelligence 49: 2287–2300.

126.

Zhang

Scaramuzza

(2018) Perception-aware receding horizon navigation for mavs. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

127.

Zhang

Scaramuzza

(2019) Beyond point clouds: fisher information field for active visual localization. In: Proceeding of the IEEE International Conference on Robotics and Automation (ICRA).

128.

Zhou

Barnes

, et al. (2019) On the continuity of rotation representations in neural networks. In: Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Active 6D pose estimation for textureless objects using multi-view RGB frames

Abstract

Keywords

Introduction

Related works

Object pose estimation from a single RGB image

Traditional methods

End-to-end methods

Learning-based indirect methods

Object pose estimation from multi-view RGB images

Pose uncertainty and visual ambiguity

Active vision

6D pose estimation using multi-view optimization

Problem formulation

3D translation estimation

3D orientation estimation

Per-frame orientation measurement

Predicting object edge map

Handling object scale change

Optimization formulation

Measurement ambiguities

Active pose estimation using next-best-view

Initialization and uncertainty estimation

3D translation

3D orientation

Next-best-view prediction

Experiments

Datasets

T-ROBI dataset

Synthetic dataset

YOLO detection performance

Baselines and implementations for pose estimation

Evaluation metrics for pose estimation

Pose estimation results on ROBI

Pose estimation results on T-ROBI

Pose estimation results on TOD

Comparison with RGB-D baselines

Ablation studies on pose estimation

Edge map

Sequential process

Next-best-view evaluation

Limitations and future work

Conclusion

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

References