Abstract
A fundamental vision technique for industrial robots involves the six degrees of freedom pose estimation of target objects from a single image. However, the direct estimation of the six degrees of freedom object pose solely from a single image is subject to limited accuracy. Various refinement approaches have been proposed to improve the accuracy by utilizing rendered images from a 3D model. Nevertheless, balancing speed and accuracy in an industrial setting remains a challenge for these methods. In this study, we propose a novel six degrees of freedom pose refinement approach centered around matching real image patches. Unlike previous approaches, our method does not rely on a 3D model, resulting in increased speed and accuracy. In the offline phase, we construct an offline database using image patches obtained from real images. During the inference phase, our method initially identifies the image patch within the offline database that is closest to the initial pose. Subsequently, we refine the six degrees of freedom pose by matching the corresponding image patches from the offline database. Experimental results indicate that our six degrees of freedom pose refinement method achieves real-time capability with a frame rate of 71 Frames Per Second (FPS), along with high precision. When the threshold is set to 0.5% of the object diameter, the average distance of dots score on the test data surpasses 70%. Moreover, experiments involving gripping and assembling tasks on an industrial robot demonstrate the ability of our method to autonomously select appropriate grasping angles and positions in real time. It further generates suitable motion paths, ultimately ensuring production efficiency.
Introduction
Industrial robots are widely used in electronic, manufacturing, logistics, chemical, and other industrial fields. They can complete the work of grasping, transporting, welding, spraying, assembling, and so on. With the support of Artificial Intelligence technology, robots can have higher efficiency, lower energy consumption, and better environmental adaptability. To achieve high-quality development, the requirements for robot’s accuracy are now greatly increased. For example, assembly accuracy has been improved from the previous millimeter level to the submillimeter level, such as smart part assembly. The grasping point of the industrial robot of the existing production line is fixed, and the workpiece can only be placed in a fixed position in advance in a fixed pose, which is difficult to meet the complex industrial production requirements and inefficient. Using cameras as the primary sensor device to perform high-precision assembly tasks in a semi-structured environment remains a serious challenge for robots.
High-precision pose estimation plays a crucial role in robot grasping and assembly, enabling precise manipulation of objects. Some six degrees of freedom (6 DOF) pose estimation method 1 employ the red, green, blue (RGB)2,3 image to predict key points and employ the perspective-n-point (PnP) algorithm 4 for object pose estimation. However, achieving high accuracy is challenging due to the PnP algorithm’s sensitivity to small errors in the 2D representation. To enhance pose estimation accuracy, researchers have developed methods that refine the pose by analyzing the feature difference between the input image and the rendered image from a 3D model, such as the Deep Iterative Matching for 2D Template Matching (DeepIM) 5 approach. These methods leverage high-quality 3D models to generate detailed rendered images. However, acquiring such models can be costly, and the process generally requires significant computational resources.
These challenges significantly impact the speed, accuracy, and usefulness of pose estimation methods in achieving accurate manipulation in robotics. To address these limitations, our method focuses on enhancing pose estimation accuracy by leveraging stereovision techniques. We draw inspiration from stereovision to identify pixel-level precise feature point pairs through feature matching. Instead of relying solely on corners or key points on the object’s surface, we employ scale-invariant feature transform (SIFT) 6 and fast library for Approximate Nearest Neighbors (FLANN) 7 algorithms for feature point extraction and matching to generate 2D key points. We then utilize these 2D key points and calculate their corresponding 3D points in multiview images for the PnP algorithm. This approach ensures high-precision 2D-3D correspondence and avoids introducing errors from the 3D model acquisition process. Our method adopts more robust patch matching than point matching to refine 2D key points used for the PnP algorithm. Inspired by Siamese neural network 8 and Siam-Fc, 9 we designed a matching refinement network (M_R_Net) to refine 2D points by using a template image patch. Rather than utilizing rendered images derived from a 3D model as demonstrated like DeepIM, our approach involves employing real images obtained from a database to generate template image patches. This technique allows us to circumvent the need for 3D model acquisition and directs our focus toward the incorporation of realistic textures present in the real images, thereby yielding improved matching accuracy. The database can be easily obtained by collecting multiview images. To improve the inference speed of our method, we designed a multiway trees-based search method to efficiently access the database so that generate template image patches.
In the initial stage, our approach involves constructing an offline database using real image patches. During the inference phase, our method begins by identifying image patches in the offline database that closely resemble the initial pose. Subsequently, we refine the 6 DOF pose by comparing and matching these image patches from the offline database with the sub-window extracted from the input image. The pose refinement process of our method is shown in Figure 1. Our approach operates in four stages: (a) according to the initial 6 DOF pose, the 3D points

The process of algorithm: (a) find template image block, (b) project 3D points, (c) image block matching, and (d) compute refined pose.
More specifically, our contributions are as follows: We propose a tiny fully convolutional Siamese neural network, which is fast, accurate, and robust enough to use for image patch matching. We propose a 6 DOF pose refinement method, which does not require 3D mesh model, and is fast, accurate, versatile, and robust enough to be easily integrated with robotic systems in industrial environment, so that the robot can automatically select the appropriate grasping angle and grasping position according to the incoming material position posture in real time and generate a suitable motion path to avoid collisions and complete high-precision grasping and assembly tasks.
The organization of this paper is as follows: The second section reviews the current related work. The “Proposed method” section describes our method in detail, with the section “Find template image patch by initial 6 DOF pose” explaining how to find the template image patch through the initial 6D attitude. “Image patch matching” section introduces the image patch matching phase, and section “6 DOF pose refinement” discusses the optimization process. In the fourth section, we conduct accuracy tests on a dedicated data set to verify the real-time performance, accuracy, and robustness of our proposed monocular 6 DOF pose estimation method on a robot platform. Lastly, the fifth section provides a comprehensive review and summary of the thesis, encompassing its contributions and future prospects.
Related work
It is an important task for many robot applications to locate the 6 DOF pose of objects only from RGB images. Recently, some studies have shown excellent performance in the field of RGB-based 6 DOF pose estimation by using deep learning technology. At present, most of the 6 DOF pose estimation method using deep learning based on RGB trains a deep learning model to establish the correspondence between 2D points on the image and 3D points on the object model and then uses PnP to restore the 6 DOF pose of the object, which is mainly divided into two categories: (1) establish 2D-3D correspondences for the pre-defined key points 10,11 and (2) Establish correspondences between all the pixels of the object on the image and the 3D points on the model. 12,13 BB8 14 estimate the corners of the 3D bounding box and run PNP algorithm. Inspired by 2D object detector You Only Look Once(YOLOV2), 15 YOLO6D (a variation of the You Only Look Once algorithm designed for 6D object pose estimation) 16 trained an end-to-end neural network model to directly predict the projected points of 3D corners. CullNet (a neural network-based algorithm used for video compression) 17 introduced a culling module to further improve performance based on YOLO6D. Instead of regarding the corners as key points, PVNet (Pose VoxelNet) 18 places the key points on the object surface via the farthest point sampling algorithm and then regresses pixel-wise unit vectors pointing to the key points. However, since the PnP algorithm is sensitive to small errors in the 2D representation, it is still challenging to obtain high accuracy.
6 DOF pose refinement using deep learning: To further improve the accuracy of 6 DOF pose estimation, some works have proposed to further optimize the pose by training a pose refinement network. DeepIM trains a deep iterative matching neural network model and drives the generation of rendered images through the pose difference output by the model. The deep point-based object detector 19 method directly extracts the difference of features from the input image and the image rendered by initial pose and then estimates the optimized pose. Stevšić and Hilliges 20 introduce the spatial attention mechanism into the pose refinement to recognize and utilize spatial detail information during pose refinement. Their methods require a high-quality rendered image of a 3D model to compare the images. However, it is expensive to obtain high-quality 3D model, and it will take more time to generate richly detailed rendered image.
Siamese neural network: The Siamese neural network measures the similarity between two inputs by sharing the weights of the two networks, which has great significance in the field of image matching. The traditional Siamese neural network accepts two inputs of the same size and then uses the distance between the two output vectors to represent the similarity of the two inputs. Siam-Fc combines the Siamese neural network and the fully convolutional neural network and improves the matching speed between image patches through parallel computing.
Proposed method
The overview of our method is depicted in Figure 1. In offline phase, we use real image to build an offline database. In the inference phase, our method first identifies the image patches within the offline database that closely resemble the initial pose, as outlined in Yang et al. 3 Subsequently, M_R_Net is utilized to refine the 6-degree-of-freedom (DOF) pose by aligning the image patches found in the offline database with the image patch extracted from the input image. The loss function used to train the network in this paper is calculated as follows
where the
Find template image patch by initial 6 DOF pose
At the initial step, in offline phase, we use the QR code [30] placed on certain location around the object to establish the object coordinate system, then use N real images collected under discrete viewpoints to build an offline database, and estimate the pose 3D points in object coordinate systems.
where i and j are the subscripts of the m 3D points and the N real images, respectively, and

Finding process: (a) compute Euler angles, (b) multiway tree-based finding, and (c) return image block and 3D points.
Image patch matching
At the second step, we will use image patch matching to re-fine the initial 2D coordinates
where
We designed a tiny fully convolutional image patch matching network named M_R_Net to find the refined position

M_R_Net architecture. M_R_net: matching refinement network.
Because each image of the offline database is collected under the certain pose, in the inference stage, the distance dx between camera and object may be greater or less than the distance dz between camera and object of offline stage, which is the process of offline database image collected. Therefore, before passing M_R_Net,
Similar to correlation filter networks,
22
we use a training loss to calculate the gap between the matching position of the template image patch
where v is a value on the output matching score map
The elements of the score map are considered to be positive if they are within radius R of the ground truth position c of the map.
6 DOF pose refinement
At the third step, we use the PnP algorithm to calculate the refined pose. The 3D points
The visualization of the pose refinement process is shown in Figure 5, where the orange, green, and blue 3D bounding box represent the initial, ground truth, and refined 6 DOF pose, respectively. When the refined box overlaps with the ground truth box, only the refined box is displayed. The orange, green, and blue image points represent

Pose refinement process.
Experiments
Implementation details
We captured images and built our data set using a color industrial camera (MER2-160-75GC-P) with a resolution of 1292 × 964 and a focal length of 6 mm. In our data set, we have six objects depicted in Figure 6. Around each object, we placed QR codes 23 to establish an object coordinate system. The camera was mounted on a robotic arm, and we collected 1700 images discretely and uniformly in the upper half of the camera’s viewpoint space by setting robot motion trajectories. We estimated the camera pose for each image based on the QR codes and then rotated the plane to expand the data set to 170,000 images. For each object, we marked an initial 3D bounding box in the object coordinate system to generate ground truth for all the data.

Objects in our data set.
Search module
We use all the 170,000 images to build the offline database for each object, and set r, s, and t to 60, 45, and 60, respectively, in the section “Find template image patch by initial 6 DOF pose.”
Matching module
We only train a patch matching model for all objects. There are 6000 image patch pairs in the training data set, and the image patch pairs consists of two resolutions (64 × 64, 32 × 32) image patches, and the ground truth matching position is the center of image patch. Image patch with resolution 64 × 64 and 32 × 32 will as
We train our model on a GeForce RTX Titan GPU and use the stochastic gradient descent optimizer. We set loss weight of positive and negative to 1 and 2, respectively. We train our model for 10 epochs with learning rate 1 × 10−2, and with momentum 9 × 10−1. We set batch size to 16 and set radius R of the training loss to 1. In the training phase, we randomly adjust the hue, saturation, and brightness for the z and x.
Evaluation metric
In our experiment, we use average distance of dots (ADD) 24 as evaluation metrics
ADD computes the average distances between the 3D points in the object coordinate system transformed using the estimated 6 DOF pose and the ground truth 6 DOF pose. Generally, if the average distance is less than a threshold, the estimated pose is considered correct. In our experiment, we set the thresholds to 0.1d, 0.01d, 0.005d, and 0.0025d, where d represents the diagonal length of the 3D bounding box. The symbol d serves as a scaling factor to compare the average distance with the threshold, thus determining the accuracy of the estimated pose. This symbol allows for adjusting the threshold based on specific application requirements. By setting the threshold size, we can evaluate the accuracy of pose estimation in different objects and scenes, and smaller distance values correspond to better estimation accuracy. For the Cuboid, Car, Cylinder, Beer, Crest, and Hitiger, the thresholds were set to 133.7 mm, 172.3 mm, 113.3 mm, 180.5 mm, 206.1 mm, and 210.3 mm, respectively.
Pose refinement results
For training of YOLO6D, CullNet, and PVNet, we randomly selected 10% from the data set in the section “Implementation details.” Our method adopts YOLO6D as the initial pose estimator. For evaluation, we select 500 images for each object as test data from the remaining 90% of the data set.
We calculate ADD on test data and shown in Table 1, and the values presented in the table indicate the percentage of images in which the estimated pose is deemed correct within a specific threshold. This method achieves high accuracy in pose estimation, as demonstrated by the obtained results. The values in the table represent the percentage of images in which the estimated pose is deemed correct, based on a specific threshold (smaller distance values correspond to better estimation accuracy). This accuracy can be attributed to the method’s capability to achieve high precision. In addition, increasing the image resolution directly enhances the accuracy of our method.
ADD scores for various thresholds on our test data.
ADD: average distance of dots.
We only change the resolution of image patch in our method and calculate the ADD on test data, and the results are shown in Table 2. The accuracy of our patch matching network mainly depends on image patch z, and the lower resolution of patch z means that fewer features are used for matching, resulting in lower accuracy.
ADD scores for various resolutions of image patch.
ADD: average distance of dots.
We change the resolution of the input image and calculate the ADD on test data, and the results are shown in Table 3. We reduce the image resolution from 1292 × 964 to 640 × 480 through cubic spline interpolation, while adjusting the camera’s intrinsic matrix. Since our patch matching model always outputs pixel-level results, reducing the resolution of the input image will reduce the accuracy of our method. However, the camera resolution of robotic platform is usually no lower than 640 × 480 in industrial environment.
ADD scores for various resolutions of input image.
ADD: average distance of dots.
We change the distance between camera and object, and experiment at three distances 0.6 m, 0.8 m, and 1.0 m, respectively. For each distance, we collect the same number of images for data set building and select the same number of images for test, and same number of images to train the YOLO6D model. We calculate the ADD scores on the test data for each distance, as shown in Table 4, where the results of 0.6 m are from Table 1. The visualization of results is shown in Figure 7. As the distance between the object and the camera increases, the pixel area of the object in the image will decrease, and one pixel corresponds to more object region. The effect of the matching error of our patch matching network on the pose error will be amplified, and the accuracy of our method will decrease. However, the distance between the object and the camera is usually not fixed in practical robotics applications, and we can move the object or the camera to obtain a suitable distance between camera and object, so that reduce the abovementioned error and obtain high pose estimation accuracy.
ADD scores for various distance on the test data.
ADD: average distance of dots.

Pose refinement visualization on test data.
Analysis of the matching module
The performance of our 6D pose refinement method depends on the robustness of our image patch matching model. As shown in Figure 8(a), we fix the input image patch

Robustness of image patch matching model.
As shown in Table 5, we compared our M_R_Net with the method implemented in OpenCV, witch sum of squared distance and normalized cross correlation. We use 1500 image patch pairs with (64 × 64, 32 × 32) resolution to calculate the matching accuracy of three methods, and for all image patch pairs, the patch center is ground truth matching position. We scale, change HSV, rotate, and perspective transform each image patch pair (
Matching refinement network.
HSV: Hue, Saturation, Value; SSD: sum of squared distance; NCC: normalized cross correlation.
The matching result is considered correct when the difference between the output matching position and ground truth within one pixel (include end), and the values in the table are the percentage of correct matching results. Experimental results show that our M_R_Net has higher robustness.
Running time
Our experiments are run on a desktop with an Intel i7-3759700KF 3.60 GHz CPU and a NVIDIA GeForce RTX 2070 SUPER GPU. The running time of our method depends on the search module and the matching module.
Search module
The time complexity of search is O(k), and the search operation’s time grows linearly with the number of elements k in the search set. This time complexity characterizes the computational resources required by the search algorithm in the worst-case scenario. k is the average size of all subsets of the offline database determined by r, s, and t (shown in Figure 3).
Matching module
The running time of the matching module depends on the batch-size and the resolution of the input image patch. As shown in Table 6, we change the batch-size (Bs) and the resolution (R-) of the input image patch to test our matching module. For each combination of batch-size and resolution, we calculate the average time of 1000 inferences.
Running time (ms) of matching module.
When considering both initial pose estimation and pose refinement, we set the resolution of input image to 640 × 480, set 8 batch-size and set the resolution of
Running time of pose estimation.
Robotic evaluation
To evaluate the practicability of the proposed method, we apply it on the robotic manipulation platform to evaluate accuracy. We set up the eye-to-hand platform comprising an AUBO I5 robot, a 1.0 mm diameter Phillips screwdriver, and an industrial RGB camera, as shown in Figure 9 left, and set up the eye-in-hand platform comprising an AUBO I5 robot, an industrial RGB camera and a calibration board shown in Figure 9 right.

Robot operating platform.
Eye-to-hand
The evaluation process is shown in Figure 10. First, we calibrate the camera-robot system. Second, we select a 3D point A in the object coordinate system. Third, we use the QR code to calculate the pose of the object, move the tip of the Phillips screwdriver of the robot to point A, and then record the current robot end pose T1, shown in Figure 10 left column. Fourth, we use the proposed method to estimate the pose of the object and also move the tip of the Phillips screwdriver of the robot to point A, and then record the current robot end pose T2, as shown in Figure 10 middle column and right column. We use T1 and T2 to calculate the poses TP1 and TP2 of the Phillips screwdriver tip in the robot coordinate system, respectively. Then we calculate the difference between the six components of TP1 and TP2, as shown in Table 8. Therefore, the accuracy of the pose estimation method in this paper is verified. The method in this paper can obtain high pose estimation accuracy only from a single RGB image, which has high practical value for low-cost and high-precision robot scenes.

The evaluation process visualization of eye-to-hand.
Evaluation results of the three objects.
Eye-in-hand
Similar to fast directional chamfer matching (FDCM), 25 we evaluate the consistency of pose estimate across multiple viewpoints of the camera. More specifically, we placed the object and calibration board in a complex environment and kept the object, calibration board, and base coordinate system of robot relatively fixed throughout the experiment. Then we command the robot arm to move to several rotations and translations, so that the image can be collected under various camera viewpoints. For each collected image, we use YOLO6D model estimate the initial pose and our method to refinement, and the visual result of the optimized pose is shown in Figure 11. Since the pose our method output represents the pose between the camera and the object, we transform all the poses output by our method to the base coordinate system of robot. Since the object is static, the estimated pose of the object in the base coordinate system of robot should be identical irrespective of the viewpoint of the camera. For each target in Figure 6, we collect 20 images, respectively, then transform the 20 poses to base coordinate system of robot, and for each pose, we calculate the deviations from the median pose of 20 poses, shown in Figure 12, where Δrx , Δry , Δrz , Δtx , Δty , and Δtz represent the six components of the pose, respectively. For each subfigure, the horizontal axis represents the deviation from the median pose, and the vertical axis represents the proportion of the number of poses whose deviations are in a certain interval, such as [0.0, 0.1), [0.1, 0.2).

The pose estimation results visualization of eye-in-hand.

Histograms of deviations from the median pose.
Nut and bolt assembly experiments
We use the proposed method to estimate the pose of the bolt and nut relative to the camera, as shown in the left column of Figure 13, and then use the end actuator of the robot with suction to lift the nut. To eliminate the pose disturbance caused by the nut suction process, we use the proposed method to estimate the pose of the nut on the robot relative to the camera, as shown in the second column of Figure 13. Then, according to the estimated pose between the nut and the camera and the robot’s hand eye calibration results, the relative pose between the nut and the robot’s end is obtained. At the same time, according to the estimated pose between the bolt and the camera and the robot’s hand eye calibration results, the pose of the nut required to complete the assembly relative to the robot’s base coordinate system is obtained, and finally the pose transformation of the robot’s end required to complete the assembly under the robot’s base coordinate system is obtained. Thus, the end of the robot is controlled to complete the assembly, as shown in the right column of Figure 13. To evaluate the assembly accuracy, we fix the relative position and posture between the bolt and the robot base coordinate system, and change the relative position and posture between the nut and bolt during each experiment. Each line in Figure 13 represents an assembly experiment. Because the bolt is fixed relative to the robot, in an ideal situation, the position and pose of the assembled nut in the robot base coordinate system should be the same. Due to the existence of the position and pose estimation error, the relative position and pose between the nut and the robot base frame are different each time the assembly is completed. This paper calculates the position and pose between the nut and the robot base frame when the assembly is completed for 20 times and draws the deviation histogram of each component with respect to the median value. Evaluate the assembly accuracy, as shown in Table 9, where the meanings of Δrx , Δry , Δrz , Δtx , Δty , and Δtz are the same as those in Figure 12.

The process of robot assembly.
Statistical histograms of 6 DOF pose estimation.
DOF: degree of freedom.
Experiments show that our method can obtain high pose estimation accuracy, even if in complex backgrounds, our method still performance well, which makes our method suitable for practical robotic applications.
Conclusion
In this paper, we propose a pose refinement method based on real image patch matching. Compared with some 3D model needed methods,26–28 our method does not need the 3D object mesh models and solves the difficulty of generating real-world 3D models with high-quality textures. Meanwhile, based on our efficient template search method and fully convolutional patch M_R_Net, our method achieves real-time and high-precision. According to the actual experiment, our 6 DOF pose refinement method achieves high speed (71FPS) and accuracy (ADD score higher than 70% on the test data when the threshold is 0.5% of object diameter). The experiments of grasping and assembling tasks on industrial robots show that this method can automatically select the appropriate grasping angle and position in real time and generate the appropriate motion path to avoid collision, thus ensuring production efficiency. However, in the process of image patch matching, there will be some false positive matching due to the existence of some very similar objects. In the future, we can further improve the system by adding a validation step that evaluates the quality and accuracy of the matched feature points, including assessing the geometric consistency of the matched features, and using outlier exclusion techniques to eliminate any mismatches.
Supplemental Material
Supplemental Material, sj-pdf-1-arx-10.1177_17298806241229270 - A patch-based real-time six degrees of freedom object pose refinement method for robotic manipulation
Supplemental Material, sj-pdf-1-arx-10.1177_17298806241229270 for A patch-based real-time six degrees of freedom object pose refinement method for robotic manipulation by Yang Yang, Qian Jiang, Quan Mu, Hanlin Huang, Tian Yang, Wenbo Li, Yilin Yuan, Jian Wen and Fei Liu in International Journal of Advanced Robotic Systems
Footnotes
Authors’ note
YY and QJ contributed equally to this work.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by National Natural Science Foundation of China [T2222018] and Innovation group science fund of Chongqing Natural Science Foundation [cstc2019jcyj-cxttX0003].
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
