Sage Journals: Discover world-class research

Abstract

Model-based stereo vision pose estimation depends on the establishment of the model. The photo-model-based method simplifies the model-building process with just one photo. Programming languages do not predefine the shapes, colors, and patterns of objects. In the past, however, it was necessary to calculate a pixel per metric ratio, that is, the number of pixels per millimeter of the object, based on the photo’s shooting distance to generate a photo-model with the same size (length and width) as the actual object. It restricts the real application. The proposed method extends the traditional photo-modeling algorithm and relaxes the photo prerequisite for target pose determination. Various pixel per metric ratios will be assumed to generate 3D photo-models of different sizes. These models will then be employed in stereo vision image matching techniques to detect the pose of the target object. Since it is not a data-driven method, it does not require many pictures and pretraining time. This article applies the algorithm to the cleaning of seaports and aquaculture, aiming to locate dead or diseased marine life on the water surface before collection. Pose estimation experiments have been conducted to detect an object’s pose and a prepared photo’s pixel per metric ratio in real application scenarios. The results show that the expanded photo-model stereo vision method can estimate the pose of a target with one pixel per metric ratio unknown photo.

Keywords

Photo-model pose estimation stereo vision pixel per metric ratio

Introduction

Robots play an important role in marine biological fishing, invasive species control, and surface collection. Using robots such as autonomous underwater vehicles (AUVs) for precise positioning and targeted marine creatures can reduce the capture of nontarget species and minimize the impact on the marine environment. They are used for water surface cleaning and retrieval in marine ports and aquaculture.

Estimating the six degrees of freedom (DOF) pose of a marine creature floating on the water surface is crucial for autonomous robots to track or grasp it effectively. Concerning automated robots, visual information is expected to allow robots to adapt to different environments.^1,2 For a robot with vision sensors, such as cameras, it has been difficult till now to accurately detect the 3D pose of the target object, especially if the target object cannot be predefined since the shape or size is arbitrary. On the other hand, some vision-based AUVs have been researched, for example, for eliminating invasive species with a monocular camera.³ In contrast to the current state of robot vision research for use on land, applying robot vision in water is at a lower stage.

Since monocular vision has a single and cheap hardware setup, it is widely utilized for visual pose detection.^4,5 A fish-catching robot has been developed to deal with problems of the target’s 3DOF recognition.⁶ However, it cannot avoid the disadvantage that its depth distance measurement is inaccurate. Many studies have used Red, Green, Blue, and Depth (RGB-D) cameras composed of one RGB camera and depth sensor with infrared light to improve the distance detection abilities of monocular vision.^7,8 However, sometimes, depth information is not readily available, including systems operating outdoors, where common depth sensors do not work well with noisy and sparse depth information.^9,10 Furthermore, optical laser sensors have been explored for vision tasks in previous studies.^11

–14 However, these sensors often come with a higher cost. With the development of deep-learning technology, monocular RGB images can also achieve pose recognition.^9,15 However, this requires many pictures and pretraining time.

Unlike monocular vision and RGB-D methods, stereo vision is another method for estimating 3D pose and perceives a greater variety of target material properties and light conditions.¹⁶ When it comes to pose detection, stereo vision methods can be roughly categorized into two types: stereo-matching and model-matching methods.

Stereo matching, also known as disparity estimation, utilizes epipolar geometry to compute the 3D coordinates of a physical point (2D–3D method). In terms of stereo-matching methods, feature-based approaches are commonly used for pose detection. These methods involve extracting feature points and matching them using techniques such as Fast,¹⁷ Sift,¹⁸ and Surf.¹⁹ However, mismatch is inevitable. Removing mismatched noise points is a complex problem.

Point-cloud-based methods use a scene point cloud generated by stereo matching,²⁰ which can be seen as a global extension of feature-based methods. However, it is generally necessary to organize and structure the 3D discrete points into a higher-level representation, such as voxels.²¹ One of the challenges in this process is removing noise points that do not correspond to the target objects.

Additionally, identifying and segmenting the desired objects within the point clouds is a complex task.²² However, mismatches are inevitable in stereo vision, and addressing them is crucial in achieving accurate results.

Model-matching methods, also known as model-based recognition methods, have the advantage of avoiding mismatches and are especially effective in handling occlusion scenarios. Although monocular model-based method can estimate 6D pose, the distance measurement accuracy is lower than that of stereo vision due to the inability to use parallax displacement.^23,24 These methods first acquire the object model, projecting all points of a solid 3D model onto stereo vision image planes. Subsequently, these projected points are matched with their corresponding counterparts on the actual target, leading to accurate pose estimation (3D–2D method).

Traditional model-based pose estimation methods mainly rely on handcrafted models. They are made according to the style and size of the original object, so are generally used when a target size is given. An application utilizing a fixed 3D marker has been developed for the purposes of AUV navigation and battery recharging.^25,26 However, for other situations, aquatic organisms are always on the move, making it difficult to measure and model them accurately. An innovative approach using deformable models and stereo vision was employed to accurately measure the size of tuna.²⁷ However, the complexity of model building limits the generality of this method in detection.

A photo-model-based pose estimation method has been proposed²⁸ in response to overcome the disadvantages encountered in constructing models. It simplified the model-making process since it does not need to predefine the object’s shape, color, pattern, and coding in the programming language.^28,29 This method belongs to the model-based recognition method. It involves creating 3D models from 2D photos and then projecting these models onto binocular images to match actual objects for pose estimation (2D–3D–2D).

The photos employed for photo-model generation are pre-prepared instead of being captured on-site to minimize the environmental constraints at the specific application site. In the pose estimation process, the generated models are matched with binocular images taken on-site to find the best match. Regarding deformable objects, such as clothes, and the issue of partial occlusion, previous studies have investigated various environmental factors that influence the handling of such objects.^29,30 These studies have conducted experiments and provided empirical evidence supporting the effectiveness of the photo-model approach. Additionally, a 3D target object’s pose could be estimated and tracked in real time by using stereo vision and its 2D photo.³¹ Moreover, a visual servoing system for catching marine creatures was developed using the photo-model approach.²⁴

However, previous algorithms based on photo-models rely on camera calibration to obtain the pixel per metric (PM) ratio and calculate the target size based on the shooting distance. Previous method uses a known PM ratio photo to generate a photo-model of the same size as the object. It cannot use photos with unknown PM ratios. The ratio measures the number of pixels per unit length of an object. It is an important parameter for object size detection in pictures.^32
–34 Existing studies usually rely on the camera calibration with reference objects of known size to ensure this ratio.^35,36

However, in practical scenarios, the size of targets on the ocean surface is often unknown, which poses a challenge for model-based approaches. To address this issue, a proposed expanded method overcomes the limitations of the previous size-fixed photo-model approach by assuming different PM ratios. This enables the generation of photo-models of different sizes from the same photo. By utilizing this approach, spatial model matching of the object with an unknown size can be facilitated, ultimately leading to accurate estimation of the target’s pose.

On the other hand, while the data-driven method with deep-learning techniques utilizes images for 3D pose detection, it necessitates a considerable amount of training data and pretraining time.^37,38 In contrast, the photo-model-based method, which belongs to the model-based approach, can accurately recognize the object’s pose with just a single photo.³¹ This approach eliminates the need for large amounts of training data and simplifies the model-building process.

The main purpose of this article is to verify whether it is possible to use photo-models of assumed dimensions, that is, generated from different PM ratios, to perform model-based stereo vision methods for estimating the pose of objects of uncertain dimensions. This article aims to enhance the current photo-model-based algorithm and propose a convenient pose detection method using stereo vision. The proposed method will serve as a foundation for future research on visual servoing in marine aquaculture, specifically for the collection of deceased or ill marine creatures on the surface of the water. This article does not discuss recognition target underwater.

In this study, we assume the PM ratio for model generation. To verify the generality of the algorithm, in addition to taking pictures of the target, we also downloaded a photo with unknown shooting distance from Bing image. Using two photos for separate pose estimation experiments, the results confirmed that using photos of the same species with unknown shooting distance and stereo vision, the target pose can be recognized.

More precisely, the contributions of this article are as follows:

This article proposes a method to estimate target pose by using stereo vision and a shooting distance unknown photo.

In the case where the model of the same size as the object cannot be generated, 3D planar models of different sizes are generated by assuming the PM ratio of the pixel length to the actual length of the object.

The target pose and PM ratio estimation problem is transformed into an optimization problem. The pose and ratio can be solved simultaneously.

The rest of this article is organized in the following sections: The second section presents expanded photo-model generation and the photo-model-based pose estimation method. In the third section, we discuss the adaptability of the proposed method for recognizing an object’s pose according to the pose–ratio fitness distribution and pose estimation experimental results. The conclusions and future work are described in the final section.

Expanded photo-model-based pose detection

This section introduces the methodology of the expanded photo-model-based recognition method with the variable PM ratio. The developed photo-model-based stereo vision system is shown in Figure 1. Each coordinate system is as follows:

$Σ_{H}$ : end-effector (hand) coordinate system,

$Σ_{M}$ : target object coordinate system,

$Σ_{C L}$ , $Σ_{C R}$ : left and right camera coordinate systems.

$Σ_{C L}$ and $Σ_{C R}$ are inclined at an angle of approximately 14° with respect to the vertical line. The baseline between the two cameras is 323.4 mm. $Σ_{H}$ is located above the center of the line connecting the two cameras. z_H is oriented downward along the vertical direction. The center of $Σ_{M}$ is the geometric center. z_M is perpendicular to the target back.

Figure 1.

Photo-model-based stereo vision system.

Figure 2 shows a perspective projection of the stereo vision system. Each coordinate system is as follows:

$Σ_{I L}$ , $Σ_{I R}$ : left and right image 2D coordinate systems,

$Σ_{M j}$ : jth model coordinate system,

$^{M} r_{i}^{j}$ : position of ith point on jth 3D model in $Σ_{M j}$ ,

$^{C L} r_{i}^{j}$ , $^{C R} r_{i}^{j}$ : position of ith point on jth 3D model based on $Σ_{C R}$ and $Σ_{C L}$ ,

$^{I L} r_{i}^{j}$ , $^{I R} r_{i}^{j}$ : projected 2D position on $Σ_{I L}$ and $Σ_{I R}$ of ith point on jth 3D model.

Figure 2.

Perspective projection of stereo vision system. In the 3D search space, the spatial plane model is projected onto the left and right images through perspective projection.

2D pixel photo-model

This subsection is a description of the 2D pixel photo-model generation before explaining the 3D photo-model generation. There are two central portions of the model generation. The first portion is 2D pixel model generation, and the 2D pixel model size is fixed with the unit pixel. The latter is the 3D plane model generation, its size (length and width) is variable, and its unit is millimeters. The estimation of the relative pose needs to use the generated 3D plane model.

The hue value in HSV color representation is used for the extraction of the target color. The advantage of HSV is that each of its attributes corresponds directly to the basic color concepts, which makes it conceptually simple. Therefore, it is easy to understand the program for the image matching process. In addition, the hue of the HSV color system shows good robustness against a change in the lighting intensity.

The model generation process is represented in Figure 3. Scan Figure 3(a) from outside to inside. The part of the image with target hue values is determined as the photo-model frame. As shown in Figure 4(a), the model is generated based on a photo. The coordinate system of the model $Σ_{P}$ is shown in Figure 4(b). The length of the model frame is $L_{P} (pixel)$ , the width of the model frame is $B_{P} (pixel)$ , that is, the model size. The outer portion is set for image matching, and its size is larger than the model size. Sampling points are taken in the model at regular pixel intervals and used for the model matching. The coordinate of ith sampling point into the 2D pixel coordinate system in $Σ_{P}$ is

{}^{P}r_{i} = {[{}^{P}x_{i}, {}^{P}y_{i}]}^{T}

Figure 3.

2D pixel photo-model generation processes are described as (a)–(d): (a) shows a photograph with a target object (the squid) in the background, (b) represents a model surface space $S_{in}$ constituted by inner points group, (c) represents a model outside space $S_{out}$ that envelops $S_{in}$ , and (d) generated model.

3D photo-model with specified PM ratio

To explore the object, the photo-model needs to be converted from a 2D pixel model to a 3D spatial plane model. For the jth 3D spatial plane model, its length and width are calculated as follows

{\begin{array}{l} L_{M j} (mm) = \frac{L_{P} (pixel)}{α_{j}} \\ B_{M j} (mm) = \frac{B_{P} (pixel)}{α_{j}} \end{array}

where $α_{j}$ is the PM ratio of the jth model.

Its unit is (pixel/mm). It is the ratio of the 2D pixel model to the 3D spatial plane model. $α_{M}$ is defined as the real ratio of the 2D pixel model to the target object.

The coordinate of the ith point of jth model $^{M} r_{i}^{j}$ in coordinate system $Σ_{M j}$ in 3D searching space is

^{M} r_{i}^{j} = {[{}^{M}{x_{i}^{j}}, {}^{M}{y_{i}^{j}}, {}^{M}{z_{i}^{j}}]}^{T} = {[{}^{M}{x_{i}^{j}}, {}^{M}{y_{i}^{j}},0]}^{T}

Because from 2D photo generate a 3D searching model, the thick of a target is unknown, therefore $^{M} z_{i} = 0$ , the generated 3D photo-model is a 3D spatial plane. Equation (4) indicates the conversion relationship of the ith sampling point between $Σ_{P}$ shown in Figure 4 and $Σ_{M j}$ in Figure 2

{\begin{array}{l} ^{M} x_{i}^{j} (α_{j}) (mm) = \frac{^{P} x_{i} (pixel)}{α_{j}} \\ ^{M} y_{i}^{j} (α_{j}) (mm) = \frac{^{P} y_{i} (pixel)}{α_{j}} \end{array}

Therefore, $^{M} r_{i}^{j}$ can be described as the function of $α_{j}$ , that is, $^{M} r_{i}^{j} (α_{j})$ .

Figure 4.

The prepared photo and the generated 2D pixel photo-model. (a) The photo size is 640 × 480 pixel. The photo-model is composed of the inner portion and outer portion with sampling points. (b) The 2D pixel photo-model is only a small part of the photo including the target, the whole photo is not a model. Sampling points are collected at a certain interval. Its coordinate system $Σ_{P}$ in pixels. The 2D model size in pixel is $L_{P} = 386 (pixel)$ and $B_{P} = 152 (pixel)$ in this situation.

Projective transformation of the photo-model

This subsection introduces the basic component of the projective transformation of the photo-model as follows. More details have been proposed in the literature.^28,30,39 It should be noted that in the past $^{M} r_{i}^{j}$ is a fixed value for a specific object. In this article, from step 1 to step 2 in Figure 5, the photo-model $^{M} r_{i}^{j} (α_{j})$ is not a fixed value, but a function (4) of the PM ratio $α_{j}$ .

As shown in Figure 1, the pose of $Σ_{M}$ based on $Σ_{H}$ , including three position variables and three orientation variables in quaternion, is

^{H} ϕ_{M} = {[{}^{H}x_{M}, {}^{H}y_{M}, {}^{H}z_{M}, {}^{H}ε {_{1}}_{M}, {}^{H}ε {_{2}}_{M}, {}^{H}ε {_{3}}_{M}]}^{T}

Based on $Σ_{H}$ , the pose of the jth 3D model $^{H} ϕ_{M}^{j}$ is defined as

^{H} ϕ_{M}^{j} = {[{}^{H}{x_{M}^{j}}, {}^{H}{y_{M}^{j}}, {}^{H}{z_{M}^{j}}, {}^{H}{ε_{1 M}^{j}}, {}^{H}{ε_{2 M}^{j}}, {}^{H}{ε_{3 M}^{j}}]}^{T}

For simplicity, $^{H} ϕ_{M}^{j}$ is written as $_{} ϕ_{M}^{j}$ . The homogenous transformation $^{H} T_{M j}$ based on the hand coordinate system $Σ_{H}$ can be calculated through the pose of jth model $_{} ϕ_{M}^{j}$ .⁴⁰

Figure 5.

The summary of the calculation process from 2D pixel photo-model generation to the 3D photo-model’s stereo vision perspective projection.

About stereo vision, position $^{C L} r_{i}^{j}$ of ith point on jth 3D model based on $Σ_{C L}$ can be calculated through equation (7)

^{C L} r_{i}^{j} =^{C L} T_{H}^{H} T_{M j} (ϕ_{M}^{j})^{M} r_{i}^{j} (α_{j})

The position vector $^{I L} r_{i}^{j}$ of the ith point in the left camera image coordinate can be calculated by using the projective transformation matrix $P_{C L}$ from 3D space $Σ_{C L}$ to 2D image space $Σ_{I L}$ as

^{I L} r_{i}^{j} = P_{C L}^{C L} r_{i}^{j} = P_{C L}^{C L} T_{H}^{H} T_{M j} {(ϕ_{M}^{j})}^{M} r_{i}^{j} (α_{j})

Then $^{I L} r_{i}^{j}$ can be described in short as

^{I L} r_{i}^{j} = f_{L} (Φ_{M}^{j})

where

Φ_{M}^{j} = [ϕ_{M}^{T j}, α_{j}]^{T}

The projective transformation process is summarized in Figure 5. $^{I R} r_{i}^{j}$ can also be described as the same manner like $^{I L} r_{i}^{j}$ .

Photo-model-based matching in 3D space

As shown in Figure 6, the 3D toys of marine creatures are prepared. The squid is used for explanation the photo-model-based matching.

Figure 6.

(a) Marine biological models. The three labels correspond to model number, English name, and size (unit: cm). (b) Photos of marine biological models. It should be noted that the model is only a part of photos, including the target, that is, inside of the black frame in the photo.

Figure 7 illustrates the experimental setup for the fitness distribution experiment, which will be elaborated upon in the subsequent subsection. The target object and manipulator remain stationary while the 3D photo-models vary in distance and size.

In Figure 7, two example models of equal size, generated from the same photo, are displayed. Therefore, they have the same PM ratio. The first model’s projection transformation result is depicted in Figure 8(a), while the second model’s stereo projection result is shown in Figure 8(b). Additional example results for 3D model projection onto stereo vision image planes are illustrated in Figure 8.

In Figure 8(c), through forward projection equation (9), the 3D model in 3D searching space is projected onto the left and right camera images. $S_{L} (Φ_{M}^{j})$ consists of $S_{L, in} (Φ_{M}^{j})$ and $S_{L, out} (Φ_{M}^{j})$ . And $S_{R} (Φ_{M}^{j})$ consists of $S_{R, in} (Φ_{M}^{j})$ and $S_{R, out} (Φ_{M}^{j})$ . The evaluation of the matching degree between the projected model and images of the real target object captured by the dual-eye cameras is defined as a fitness function $F (Φ_{M}^{j})$ .

Figure 7.

Experimental environment for 3D model projection onto stereo vision image planes. The pose of the object is fixed as $^{H} ϕ_{M} {= [0,0,500,0,0,0]}^{T}$ . Two 3D photo-models are of the same size and are positioned at distance 300 mm and 500 mm, respectively.

Figure 8.

A series of 2D projection transformation results of 3D photo-models at various positions in space with different PM ratios based on the perspective transformation. When $α_{j} = 2, 1.2,$ and 1, the model size is 193 × 76 × 0 mm³, 321.7 × 126.7 × 0 mm³, and 386 × 152 × 0 mm³, respectively. (a) $ϕ_{M}^{j} {= [0,0,300,0,0,0]}^{T}$ . $α_{j} = 2$ , (b) $ϕ_{M}^{j} {= [0,0,500,0,0,0]}^{T}$ . $α_{j} = 2$ , (c) $ϕ_{M}^{j} {= [0,0,500,0,0,0]}^{T}$ . $α_{j} = 1.2$ and (d) $ϕ_{M}^{j} = [0, 0, 1, {000, 0, 0, 0]}^{T}$ . $α_{j} = 1$ .

As shown in Figure 4, when the prepared photo is 640 × 480 pixel, the divided squid model is 386 × 152 pixel. According to equation (4), in Figure 8(a), the photo-model spatial size is 193 × 76 × 0 mm³ with $α = 2$ . In Figure 8(c), the model spatial size is 321.7 × 126.7 × 0 mm with $α = 1.2$ and bigger than that in (a). Comparing these two subfigures (c) and (a), since the position of the model in (a) is more near to the cameras than that in Figure 8(c), through the perspective projection, in right and left images, the photo-model projection results have the similar sizes. However, because of the stereo vision, the projection results in Figure 8(c) are closer to respective centers of the left and right images than that in (a). Comparing (b) and (d) can also draw the same conclusion. As shown in (b), when the models are completely overlapped with the target object, it is considered that the α is the correct ratio of the prepared photo. The pose of the model is the same as the target object.

Compared to monocular vision, which only observes the projection results of the left camera, binocular vision tends to exhibit a greater positional difference when projecting the model onto the image. Therefore, stereo vision is more helpful in accurately identifying pose and size. As shown in Figure 8(b), when the distance between the model and the object is close and their sizes are similar, the coincidence is higher. This characteristic has inspired us to create a fitness function, which utilizes coincidence to accurately describe the resemblance in pose and size between a photo-model and the target object.

Definition of the fitness function

Figure 9(a) shows the left image projection example of the jth model. The evaluation points of hue value, $\dots, {}^{I L}{r_{i - 1}^{j}}, {}^{I L}{r_{i}^{j}}, {}^{I L}{r_{i + 1}^{j}}, \dots$ , are indicated by white dots in inside area $S_{L, in}$ , and those in outside strip $S_{L, out}$ .

The 2D model is composed of dots whose relative positions are predefined and fixed. Figure 9(b) shows another situation where the overlap area between the real target and the model is increased compared to the area depicted in (a).

The correlation between the projected model and captured images on the left and right 2D images is calculated by equations (11)–(13)

F (Φ_{M}^{j}) = \frac{\sum_{\begin{matrix} ^{I R} r_{i}^{j} \in \\ S_{R, in} (Φ_{M}^{j}) \end{matrix}} p_{R, in} (^{I R} r_{i}^{j}) + \sum_{\begin{matrix} ^{I R} r_{i}^{j} \in \\ S_{R, out} (Φ_{M}^{j}) \end{matrix}} p_{R, out} (^{I R} r_{i}^{j}) + \sum_{\begin{matrix} ^{I L} r_{i}^{j} \in \\ S_{L, in} (Φ_{M}^{j}) \end{matrix}} p_{L, in} (^{I L} r_{i}^{j}) + \sum_{\begin{matrix} ^{I L} r_{i}^{j} \in \\ S_{L, out} (Φ_{M}^{j}) \end{matrix}} p_{L, out} (^{I L} r_{i}^{j})}{2 (e_{1} N_{in} + e_{3} N_{out})}

$N_{in}$ and $N_{out}$ are the total number of the inner portion and outer portion sampling points, respectively. The evaluation of every point in the input image that lies inside the model inner portion $^{I L} r_{i}^{j} \in S_{L, in} (Φ_{M}^{j})$ and outside area $^{I L} r_{i}^{j} \in S_{L, out} (Φ_{M}^{j})$ is represented as $p_{L, in} (^{I L} r_{i}^{j})$ and $p_{L, out} (^{I L} r_{i}^{j})$ respectively. Equations (12) and (13) are used for their calculation

p_{L, in} (^{I L} r_{i}^{j}) = {\begin{array}{l} e_{1}, & if (| H_{I L} (^{I L} r_{i}^{j}) - H_{M L} (^{I L} r_{i}^{j}) | \leq 20) \\ e_{2}, & otherwise \end{array}

p_{L, out} (^{I L} r_{i}^{j}) = {\begin{array}{l} e_{3}, & if (| H_{I L} (^{I L} r_{i}^{j}) - H_{M L} (^{I L} r_{i}^{j}) | \leq 20) \\ e_{4}, & otherwise \end{array}

where

$H_{I L} (^{I L} r_{i}^{j})$ : the hue value of point $^{I L} r_{i}^{j}$ on the left camera image,

$H_{M L} (^{I L} r_{i}^{j})$ : the hue value of the ith point on jth model on the left camera image,

$e_{1}, e_{2}$ : evaluation value of sampling point in inner portion. $e_{1} = 2$ , $e_{2} = - 0.5$ ,

$e_{3}, e_{4}$ : evaluation value of sampling point in outer portion. $e_{3} = 0.1$ , $e_{4} = - 0.5$ .

Figure 9.

Calculation of the matched degree of each point in model space ( $S_{L, in}$ and $S_{L, out}$ ). (a) Evaluation position $^{I L} r_{i}^{j}$ . That is the ith point of jth model which is projected on the left image. (b) Classification of evaluation points (A)–(D): (A) represents points that satisfy the first case of equation (12), $| H_{I L} (^{I L} r_{i}^{j}) - H_{M L} (^{I L} r_{i}^{j}) | \leq 20$ , representing that inner model $S_{L, in}$ overlaps with the real target, (B) representing that inner model $S_{L, in}$ overlaps with background, (C) means that the outer model $S_{L, out}$ overlaps with background, and (D) shows outer portion $S_{L, out}$ overlaps with the real target.

The evaluation values in equations (12) and (13) are tuned experimentally.

Calculating p of each sampling point (equations (12) and (13)) based on color similarity is considered to be constant, with a time complexity of O(1). Furthermore, for each photo-model (the jth photo-model), the fitness calculation complexity given in equation (11) is $O (N_{in} + N_{out})$ . $N_{in} + N_{out}$ is the total number of sampling points for each photo-model.

In equation (12), if the hue value of each point of captured images, which lies inside the surface model frame $S_{L, in}$ , is similar to the hue value of each point in a model, the fitness value will increase with the voting value of e ₁. These sampling points are represented by dots designated by (A) in Figure 9(b). The fitness value will decrease with the value of e ₂ for every model inner portion point when hue values of $S_{L, in}$ are different from the hue value of pixel point in the left camera image. This represents that the model does not overlap precisely the target in the input image, which are represented by (B) in Figure 9(b).

Similarly, in equation (13), if the hue value of each point in the left camera image, which are in $S_{L, out}$ , is near to the hue value of the background, with the tolerance of 20, the fitness value will increase with the value of e ₃. This means $S_{L, out}$ strip area surrounding $S_{L, in}$ overlaps the background, expressing the model and the target overlap rather correctly as (C) in Figure 9(b). Otherwise, the fitness value will be decreased with the value of e ₄. This represents points on $S_{L, out}$ overlap with the real target as (D) in Figure 9(b).

Likewise, a function $p_{R, in} (^{I R} r_{i}^{j})$ and $p_{R, out} (^{I R} r_{i}^{j})$ are represented for the right camera image.

Feasibility of pose recognition of expanded photo-model with PM ratio search

Stereo image acquisition

The photo-model-based stereo vision system is shown in Figures 1 and 7. The utilized manipulator in the system is a PA-10 robot arm manufactured by Mitsubishi Heavy Industries, Tokyo, Japan. And two CCD cameras mounted on the end effector. The resolution of stereo images is 640 × 480 pixel. The PC is Yoga Pro 13s (CPU: Core(TM) i5-1135G7, 2.42 GHz; RAM: 16 GB).

Fitness distribution experiment

Using still pictures of the target captured by the left and right cameras, the fitness value $F (Φ_{M}^{j})$ is calculated with the assumed photo’s ratio and model’s pose varied as parameters. We call it “pose–ratio fitness distribution.”

Figure 10.

Fitness distribution of C02 squid listed in Figure 6. (a) Left and right camera images. (b) Prepared photo for photo-model generation. (c)–(e) fitness distribution with position-ratio scan, that is, $x - α$ , $y - α$ , and $z - α$ , respectively, (f)–(h) fitness distribution with orientation-ratio scan, that is, $ε_{1} - α$ , $ε_{2} - α$ , and $ε_{3} - α$ , respectively. The rotations $ε_{1}, ε_{2}, ε_{3}$ are represented by quaternion around axes, corresponding to around x_H , y_H , z_H of $Σ_{H}$ depicted in Figure 7.

Figure 11.

Fitness distribution of C02 squid. The size of the prepared photo (b) is different from that in Figure 10. (c)–(e) fitness distribution with position-ratio scan. (f)–(h) fitness distribution with orientation-ratio scan.

Figure 12.

Fitness distribution of C01 crab. The size of the prepared photo (b) is the same as that in Figure 11. (c)–(e) fitness distribution with position-ratio scan. (f)–(h) fitness distribution with orientation-ratio scan. In each subfigure of (c)–(h), the maximum fitness value and corresponding coordinate to give the maximum value are shown in text boxes.

Figures 10 to 12 illustrate the distribution of fitness results for the pose–ratio of C01 crab and C02, which are depicted in Figure 6.

The true pose of an object is set as

^{H} ϕ_{M} = [0, 0, 500 (mm) {,0,0,0]}^{T}

In this experiment, the target objects’ sizes and poses do not change. To imitate photos taken at different heights, photos of different sizes are prepared. Therefore, corresponding to different size photos, the true photo ratios are as follows

α_{M} = {\begin{matrix} 1, & if prepared photo size is 320 \times 240 pixel \\ 2, & if prepared photo size is 640 \times 480 pixel \end{matrix}

The relationship between $Σ_{H}$ and $Σ_{M}$ based on equation (14) is depicted in Figure 7. In the experiment, the target $Σ_{M}$ and the end-effector $Σ_{H}$ do not move. Even though the fitness distribution is made by an exhaustive search method, it is impossible to calculate all possibilities. In this experiment, the position incremental distance of fitness value is set at 2.0 mm, the orientation increment is 0.02 (quaternion does not have the unit), and the PM ratio increment is 0.05. Search ranges of fitness distribution are set as position: $^{H} x_{M}$ and $^{H} y_{M} \in [- 100, 100]$ mm, $^{H} z_{M} \in [400, 600]$ mm; orientation: $^{H} ε_{1 M}$ , $^{H} ε_{2 M}$ , and $^{H} ε_{3 M} \in [- 0.3, 0.3]$ ; model size ratio: $α \in [0.5, 3]$ in Figure 10 and $α \in [1, 4]$ in Figures 11 and 12.

Figure 10(a) shows the left and right camera images of the C02 squid. And the size of the prepared photo (b) is smaller than that of Figure 11(b). Figure 10(c) to (e) shows the fitness distribution results with position and photo ratio scan, (f) to (h) show the fitness distribution results with orientation and photo ratio scan. All the fitness distributions (c) to (h) have peaks whose poses and ratios are near the actual values given by equations (14) and (15).

When the size of the prepared photo changes in Figure 11, the same conclusion can be drawn. As shown in Figure 11(e), when α changes from 1 to 4, that means the size of the squid model changes from 386 × 152 × 0 mm to 96.5 × 38 × 0 mm, although z is still 480 mm, the fitness has changed dramatically due to the change of α. Only when α is close to the actual scale, that is, when the model size is close to the object’s actual size, the fitness has a high value.

Concerning C01 crab, same as the squid, Figure 12(c) to (e) shows the position-ratio fitness distribution, and (f) to (h) show the orientation-ratio fitness distribution. All the pose–ratio fitness distributions (c) to (h) also have peaks near the true value.

In this section, by the fitness distribution experiment, it is verified that the fitness function equation (11) can transform the PM ratio estimation and target’s pose detection problems into optimization problems. It is also confirmed that the proposed method can estimate 3D target pose by using stereo vision and one PM ratio unknown photo.

Pose estimation experiment with genetic algorithm and different photos

To verify the detection ability of the proposed expanded photo-model-based algorithm, pose and ratio detection experiments were conducted with different photos in real application scenarios. As shown in Figure 1, in this experiment, the squid object floats on the water in the pool without pose constraints. The distance between $Σ_{H}$ and $Σ_{M}$ at the vertical direction is $^{H} z_{M} = {}^{W}z_{H} - {}^{W}z_{M} = 680 mm$ . In other directions, $^{H} x_{M}$ and $^{H} y_{M}$ are unknown.

While the fitness function transforms the main problem of recognizing the pose of an object and the ratio of a prepared photo into an optimization problem, the pose and ratio fitness distributions involve much computation. We choose the genetic algorithm (GA) as an optimization method to find the maximum fitness value because of its simplicity and effectiveness.^31,41 Because of the limited space, GA will not be introduced in detail here.

Figure 13.

The 3D pose estimation results with GA and two different prepared photos. Figure 1 shows the experimental environment. (a) shows the original stereo images at a moment. The distance between $Σ_{H}$ and $Σ_{M}$ at the vertical direction is $^{H} z_{M} = {}^{W}z_{H} - {}^{W}z_{M} = 680 mm$ . At (a) moment, with photo (b.2), the estimation result of GA is shown in (b.1). Use a photo (c.2) of the same category as the target downloaded from Bing images for pose estimation, and the result is shown in (c.1).

In previous studies,^31,41,42 the chromosome in the GA consisted of six variables, representing possible pose solutions. However, for PM ratio detection, as shown in equation (16), each chromosome is elongated and now comprises seven variables. The 30 individuals of GA are used in this experiment, where the chromosome of an individual consists of 68 bit. The first three variables (1–30 bit) of an individual in 3D space are the jth model’s position $(^{H} x_{M}^{j},^{H} y_{M}^{j},^{H} z_{M}^{j})$ and the middle three variables (31–60 bit) are the orientation $(^{H} ε_{1}^{j}_{M},^{H} ε_{2}^{j}_{M}, {}^{H}{ε_{3 M}^{j}})$ based on $Σ_{H}$ . The last variable (61–68 bit) is the PM ratio $α_{j}$ of jth model. The specification of GA is that the number of genes is 30, selection rate 20%, mutation rate 50%, crossover is two-point, evolutionary strategy is elitism preservation. These parameters are adjusted through experimental tuning

\underset{10 bits}{\underset{︸}{\overset{{}^{H}{x_{M}^{j}}}{\overset{︷}{01 \dots 1}}}} \underset{10 bits}{\underset{︸}{\overset{{}^{H}{y_{M}^{j}}}{\overset{︷}{10 \dots 0}}}} \underset{10 bits}{\underset{︸}{\overset{{}^{H}{z_{M}^{j}}}{\overset{︷}{11 \dots 0}}}} \underset{10 bits}{\underset{︸}{\overset{{}^{H}{ε_{1 M}^{j}}}{\overset{︷}{01 \dots 1}}}} \underset{10 bits}{\underset{︸}{\overset{{}^{H}{ε_{2 M}^{j}}}{\overset{︷}{01 \dots 1}}}} \underset{10 bits}{\underset{︸}{\overset{{}^{H}{ε_{3 M}^{j}}}{\overset{︷}{01 \dots 0}}}} \underset{8 bits}{\underset{︸}{\overset{α_{j}}{\overset{︷}{00 \dots 1}}}}

As shown in Figure 13(a), the cameras capture the left and right images at an arbitrary moment. Each experiment was performed using a prepared photo. Figure 13(c.2) is downloaded from Bing images (http://cn.bing.com/images). Its PM ratio is unknown. According to GA, the 3D models with random poses and ratios generated from the prepared photos (b.2) and (c.2) converge to target objects in 3D space. GA stops evolving after the 500th generation. (b.1) shows the estimation results using the model generated from the photo (b.2). (c.1) is the estimation result of the model generated from the photo (c.2).

The average one-generation evolution time of the model generated from Figure 13(b.2) is 0.213 s with $N_{in} + N_{out} = 448$ . Similarly, the average one-generation evolution time of the model generated from Figure 13(c.2) is 0.208 s with a total of 420 sampling points.

Table 1 summarizes the GA estimation results of two experiments. The distance between the end effector and the target floating on the pool’s water surface is 680 mm. The length and width of the target are measured with a manual tape measure. The experimental results of two different photos are close to the actual value. Since they are two different size photos corresponding to the same target object, the PM ratios are different. Photo 1 is the target’s photo, and the estimation result is closer to the true value than that with photo 2. Although photo 2 is not the target photo and the shooting distance is unknown, the detection result using it is close to the true value. The detected distance, object length, and width are close to the true value. We can see that the expanded photo-model-based algorithm can detect the pose of objects using PM ratio unknown photos in the practical application scenario.

Table 2 presents the relative error of GA in estimating the distance and size of objects on the water surface. Based on Table 1 detection results, when the photo 1 was used, the distance detection absolute error is

Δ z = 676.6 - 680 = - 3.4 mm

and the relative error is

δ_{z} = - 3.4 / 680 = - 0.5 %

The length detection absolute error is

Δ L = 196.8 - 210 = - 13.2 mm

and the relative error is

δ_{L} = - 13.2 / 210 = - 6.3 %

The calculation for the relative error in the last row, when using photo 2, is performed in the same manner as explained earlier. However, it should be noted that photo 2 in Table 2 (Figure 13(c.2)) and the target object C02 in Figure 6(a) have some shape differences, although they belong to the same species. While using the same species’ photo can help explore the pose and size, there might be a slight increase in the error value $Δ B$ .

For comparison, the performance of the expanded photo-model method is evaluated against other existing methods. The most common research is the size detection of marine creatures. Tuna research²⁷ makes use of stereo vision technology, employing finely constructed models. On the other hand, fish research^14,34 utilizes laser sensors for dimensional measurement, offering increased precision at the expense of higher costs. In Billfish and Tuna research,⁴³ caught fish size on board is detected using a size known reference object.

In the research, data on the size and positioning detection of relevant binocular products in the air have been included for comparison.^14,34 Similar studies on agricultural products in the air have also been incorporated for comparison.^44,45 The results show that our research achieves high positioning accuracy but only average size detection accuracy. Overall, as can be seen from the table, our method is a low-cost and practical approach in terms of effectiveness in distance and size measurements.

Table 1.

The target detection results of GA.^a

		x (mm)	y (mm)	z (mm)	$ε_{1}$	$ε_{2}$	$ε_{3}$	α	Length (mm)	Width (mm)
GA	Photo 1	−132.0	−37.5	676.6	0.093	−0.005	0.044	1.96	196.5	77.5
GA	Photo 2	−117.6	−42.6	679.7	−0.092	−0.007	0.078	2.32	197.4	56.9
Measure	Target			680					210	82.5

^a Through perspective transformation, the projection results of two models on the left and right images corresponding to the pose and ratio are shown in Figure 13(b.1) and (b.2), respectively. The last row shows the measurement of the target under the tape measure. Even though photo 2 is not the target photo, the detection result is near the actual value.

Table 2.

Relative error in distance (mm) and size (mm) for different methods.^a

	$Δ z$ (mm)	$δ_{z}$ (%)	$Δ L$ (mm)	$δ_{L}$ (%)	$Δ B$ (mm)	$δ_{B}$ (%)
Tuna²⁷				<7
Billfish⁴³				5.01
Tuna⁴³				4.24
Fish^14,34	10			1
Eggplant⁴⁴				2.15		5.97
Pineapple⁴⁵		1.17
Photo 1 (ours)	−3.4	−0.5	−13.2	−6.3	−5	−6.1
Photo 2 (ours)	−0.3	−0.04	−12.6	−6.0	−25.6	−31.0

^a The most common research is the size detection of marine creatures. Since this research is specifically aimed at detecting objects on the water surface, we have also included data on binocular product size and positioning detection for comparison purposes. The results clearly demonstrate that our research achieves a high level of accuracy in terms of positioning. However, in terms of size detection, our results fall within the average range.

In the previous subsection, the fitness distribution experiments were conducted to verify the feasibility of the proposed expanded photo-model-based recognition method. Those pose–ratio fitness distributions of the fitness function in Figures 10 to 12 have maximum peaks at the true pose of targets and true ratios of the prepared photos. These results prove that the problem of detecting the pose of a marine creature from a picture of unknown shooting status is transformed into an optimization problem. And through the pose estimation experimental results in this subsection, it is confirmed that

the proposed expanded photo-model-based method can estimate a target object’s pose by using stereo vision and only one PM ratio unknown photo;

the fitness function equation (11) can transform the target pose and PM ratio estimation problem into an optimization problem that GA can solve.

We conducted the experiments with one target and two different photos to clarify that

this proposed method can detect both the pose and size of an object in the actual application using just a single pre-prepared photo where the shooting distance is unknown.

The above three points are the contributions of this article and are verified by the pose estimation experiments.

Conclusion and future work

This study presents the expanded photo-model-based pose estimation method that overcomes the limitations of the previous fixed PM ratio approach. By utilizing photos with unknown PM ratios taken at unknown distances, the proposed method allows for the generation of photo-models of different sizes from the same photo. Experimental results have demonstrated the effectiveness of this approach in detecting the pose and size of objects. Moving forward, further research and optimization efforts are necessary to improve the performance and efficiency of this method.

Indeed, the proposed expanded photo-model-based method is still in its early stages. The model shape can be further refined. The addition of a new PM ratio parameter has increased the computational complexity compared to the previous method. As a result, the amount of calculation required for pose estimation also increases. Therefore, it is important to explore ways to optimize the algorithm and minimize the computational burden while maintaining accurate pose and size detection. Reducing the number of sampling points can improve speed, but accuracy may be affected. The real-time tracking performance of GA needs to be further investigated in the future. It is recommended that a wider variety of experimental objects both on the water surface and underwater be included to enhance the generalizability of the findings. In addition, in this study marine organisms are considered as rigid bodies. Although death weakens their deformation, they still undergo deformation on the water surface. Therefore, conducting further experiments is crucial for determining the reliability of the proposed method.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Hongzhi Tian

References

Chaumette

Hutchinson

Corke

. Visual servoing. In: Siciliano

Khatib

(eds) Springer handbook of robotics. Cham: Springer International Publishing, 2016, pp. 841–866. DOI: 10.1007/978-3-319-32552-1_34.

Palla

Meoni

Fanucci

, et al. Position based visual servoing control of a wheelchair mounter robotic arm using parallel tracking and mapping of task objects. EAI Endorsed Trans Ambient Syst 2017; 4(13): 152545.

Dayoub

Dunbabin

Corke

. Robotic detection and tracking of crown-of-thorns starfish. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), Hamburg, Germany, 28 September 2015–02 October 2015, pp. 1921–1928. Washington, DC, USA: IEEE Computer Society.

Park

Kim

. Model-referenced pose estimation using monocular vision for autonomous intervention tasks. Auton Robot 2020; 44(2): 205–216.

Kudryavtsev

Laurent

Clévy

, et al. Analysis of CAD model-based visual tracking for microassembly using a new block set for MATLAB/Simulink. Int J Optomechatronics 2015; 9(4): 295–309.

Suzuki

Minami

. Visual servoing to catch fish using global/local GA search. IEEE ASME Trans Mechatron 2005; 10(3): 352–357.

Lee

Park

, et al. Nontarget-based measurement of 6-DOF structural displacement using combined RGB color and depth information. IEEE ASME Trans Mechatron 2021; 26(3): 1358–1368.

Morrison

Tow

McTaggart

, et al. Cartman: the low-cost cartesian manipulator that won the Amazon Robotics Challenge. In: 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018, pp. 7757–7764. Washington, DC, USA: IEEE Computer Society.

Billings

Johnson-Roberson

. SilhoNet: an RGB method for 6D object pose estimation. IEEE Robot Autom Lett 2019; 4(4): 3727–3734.

10.

Kadambi

Bhandari

Raskar

. 3D depth cameras in vision: benefits and limitations of the hardware. In: Shao

Han

Kohli

, et al. (eds) Computer vision and machine learning with RGB-D sensors. Cham: Springer, 2014, pp. 3–26.

11.

Sergiyenko

Tyrsa

. 3D optical machine vision sensors with intelligent data management for robotic swarm navigation improvement. IEEE Sens J 2020; 21(10): 11262–11274.

12.

Sergiyenko

. Optoelectronic devices in robotic systems. Berlin, Germany: Springer Nature, 2022. DOI: 10.1007/978-3-031-09791-1.

13.

Sørensen

Mai

Olsen

, et al. Commercial optical and acoustic sensor performances under varying turbidity, illumination, and target distances. Sensors 2023; 23(14): 6575.

14.

Risholm

Thorstensen

Thielemann

, et al. Real-time super-resolved 3D in turbid water using a fast range-gated CMOS camera. Appl Opt 2018; 57(14): 3927–3937.

15.

Larsson

Monocular depth estimation using deep convolutional neural networks. Master’s Thesis, Linköping University, 2019.

16.

Leeper

Hsiao

Chu

, et al. Using near-field stereo vision for robotic grasping in cluttered environments. In: Khatib

Kumar

Sukhatme

(eds) Experimental robotics. Berlin, Germany: Springer, pp. 253–267.

17.

Dollár

Appel

Belongie

, et al. Fast feature pyramids for object detection. IEEE Trans Pattern Anal Mach Intell 2014; 36(8): 1532–1545.

18.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004; 60(2): 91–110.

19.

Bay

Tuytelaars

Van Gool

. Surf: Speeded up robust features. In: Leonardis

Bischof

Pinz

(eds) Ninth European conference on computer vision. Berlin, Germany: Springer, 2006, pp. 404–417.

20.

Yang

Feng

Zhao

, et al. Edge supervision and multi-scale cost volume for stereo matching. Image Vis Comput 2022; 117: 104336.

21.

Tong

Stilla

. Voxel-based representation of 3D point clouds: methods, applications, and its potential use in the construction industry. Autom Constr 2021; 126: 103675.

22.

Shen

Wang

, et al. Mismatching removal for feature-point matching based on triangular topology probability sampling consensus. Remote Sens 2022; 14(3): 706.

23.

Lwin

Mukada

Myint

, et al. Visual docking against bubble noise with 3-D perception using dual-eye cameras. IEEE J Ocean Eng 2018; 45(1): 247–270.

24.

Tian

Kou

Kawakami

, et al. Photo-model-based stereo-vision 3D perception for marine creatures catching by ROV. In: OCEANS 2019 MTS/IEEE SEATTLE, Seattle, WA, USA, 27–31 October 2019, pp. 1–6. Washington, DC, USA: IEEE Computer Society.

25.

Lwin

Myint

Yonemori

, et al. Dual-eye vision-based docking experiment in the sea for battery recharging application. SICE J Control Meas Syst Integr 2019; 12(2): 47–55.

26.

Hsu

Toda

Yamashita

, et al. Stereo-vision-based AUV navigation system for resetting the inertial navigation system error. Artif Life Robot 2022; 27(1): 165–178.

27.

Muñoz-Benavent

Andreu-Garca

Valiente-González

, et al. Enhanced fish bending model for automatic tuna sizing using computer vision. Comput Electron Agric 2018; 150: 52–61.

28.

Phyu

Funakubo

Fumiya

, et al. Verification of recognition performance of cloth handling robot with photo-model-based matching. In: 2017 IEEE international conference on mechatronics and automation (ICMA), Takamatsu, Japan, 06–09 August 2017, pp. 1750–1756. DOI: 10.1109/ICMA.2017.8016082.

29.

Phyu

Funakubo

Ikegawa

, et al. Verification of unique cloth handling performance based on 3D recognition accuracy of cloth by dual-eyes cameras with photo-model-based matching. Int J Mechatron Autom 2018; 6(2–3): 55–62.

30.

Phyu

Funakubo

Hagiwara

, et al. Verification of photo-model-based pose estimation and handling of unique clothes under illumination varieties. J Adv Mech Des Syst Manuf 2018; 12(2): JAMDSM0047. DOI: 10.1299/jamdsm.2018jamdsm0047.

31.

Tian

Kou

, et al. Real-time pose tracking of 3D targets by photo-model-based stereo-vision. J Adv Mech Des Syst Manuf 2020; 14(4): JAMDSM0057.

32.

Sharaff

Ukey

Choure

, et al. Automatic dimension detection of fish images. In: Singh

Tomar

Choudhury

, et al. (eds) Data driven approach towards disruptive technologies: Proceedings of MIDAS 2020. Singapore: Springer, pp. 49–59.

33.

Mustaffa

Khairul

SFBM

. Identification of fruit size and maturity through fruit images using OpenCV-python and Rasberry Pi. In: 2017 International conference on robotics, automation and sciences (ICORAS), Melaka, Malaysia, 27–29 November 2017, pp. 1–3. Washington, DC, USA: IEEE Computer Society.

34.

Risholm

Mohammed

Kirkhus

, et al. Automatic length estimation of free-swimming fish using an underwater 3D range-gated camera. Aquac Eng 2022; 97: 102227. DOI: 10.1016/j.aquaeng.2022.102227.

35.

Deplomo

BNI

Balbin

. Categorizing of allium Sativum based on the Philippines national standard and Asian standard using pixel per metric ratio and blob detection methods. PalArch’s J Archaeol 2020; 17(9): 3927–3941.

36.

Lee

Nazki

Baek

, et al. Artificial intelligence approach for tomato detection and mass estimation in precision agriculture. Sustainability 2020; 12(21): 9138.

37.

Tsai

Huang

Chou

. Data-driven visual picking control of a 6-dof manipulator using end-to-end imitation learning. In: 2018 International automatic control conference (CACS), Taoyuan, Taiwan, 04–07 November 2018, pp. 1–6. Washington, DC, USA: IEEE Computer Society.

38.

Zeng

Song

, et al. Multi-view self-supervised deep learning for 6d pose estimation in the Amazon Picking Challenge. In: 2017 IEEE international conference on robotics and automation (ICRA), Singapore, 29 May 2017–3 July 2017, pp. 1386–1383. Washington, DC, USA: IEEE Computer Society.

39.

Funakubo

Phyu

Tian

, et al. Recognition and handling of clothes with different pattern by dual hand-eyes robotic system. In: 2016 IEEE/SICE international symposium on system integration, SII 2016, Sapporo, Japan, 13–15 December 2016, pp. 742–747. DOI: 10.1109/SII.2016.7844088.

40.

Diebel

. Representing attitude: Euler angles, unit quaternions, and rotation vectors. Matrix 2006; 58(15–16): 1–35.

41.

Myint

Yonemori

Yanou

, et al. Visual servoing for underwater vehicle using dual-eyes evolutionary real-time pose tracking. J Robot Mechatron 2016; 28(4): 543–558.

42.

Lwin

Myint

Mukada

, et al. Sea docking by dual-eye pose estimation with optimized genetic algorithm parameters. J Intell Robot Syst 2019; 96: 245–266.

43.

Tseng

Hsieh

Kuo

. Automatic measurement of the body length of harvested fish using convolutional neural networks. Biosyst Eng 2020; 189: 36–47.

44.

Zheng

Sun

Meng

, et al. Vegetable size measurement based on stereo camera and keypoints detection. Sensors 2022; 22(4): 1617. DOI: 10.3390/s22041617.

45.

Liu

Nie

, et al. Pineapple (Ananas comosus) fruit detection and localization in natural environment based on binocular stereo vision and improved YOLOV3 model. Precis Agric 2023; 24(1): 139–160.

Expanded photo-model-based stereo vision pose estimation using a shooting distance unknown photo

Abstract

Keywords

Introduction

Expanded photo-model-based pose detection

2D pixel photo-model

3D photo-model with specified PM ratio

Projective transformation of the photo-model

Photo-model-based matching in 3D space

Definition of the fitness function

Feasibility of pose recognition of expanded photo-model with PM ratio search

Stereo image acquisition

Fitness distribution experiment

Pose estimation experiment with genetic algorithm and different photos

Conclusion and future work

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References