Simultaneous Pose and Correspondence Estimation Based on Genetic Algorithm

Abstract

Although several algorithms have been presented to solve the simultaneous pose and correspondence estimation problem, the correct solution may not be reached to with the traditional random-start initialization method. In this paper, we derive a novel method which estimates the initial value based on genetic algorithm, considering the influences of different initial guesses comprehensively. First, a set of random initial guesses is generated as candidate solutions. Second, the assignment matrix and the perspective projection error are computed for each candidate solution. And then each individual is modified (selection, crossover, and mutation) in current iterative process. Finally, the fittest individual is stochastically selected from the final population. With the presented initialization method, the proper initial guess could be first calculated and then the simultaneous pose and correspondence estimation problem could be solved easily. Simulation results with synthetic data and experiments on real images prove the effectiveness and robustness of our proposed method.

1. Introduction

Estimating the position and orientation using 2D images and known 3D targets, which is usually called pose estimation, is a basic problem in computer vision, including hand-eye coordination system, object tracking, augmented reality, and autonomous navigation [1–8]. However, when the correspondences between object features and image features are unknown, for example, because the scene contains repetitive patterns or because the 3D points are simply salient features on a geometric model without associated texture or because occlusion and clutter should be taken into account, it becomes difficult because no additional information is available.

In this case, pose and correspondence should be determined simultaneously, which is simultaneous pose and correspondence estimation problem. So far, designing a proper method that can be generally applied to estimate position and orientation when the correspondences are unknown is still a challenging problem. Particularly, this problem also takes occlusion, clutter, image noise, and model deformation into account. The problem can be iteratively solved by optimizing a global cost function. One limitation of such algorithm is that the global minimum cannot be guaranteed. This is alleviated by randomly initializing it at different initial guesses. Meanwhile, the algorithm is quite slow because hundreds of different initial poses are needed to try randomly and it may succeed only when the initial pose is close to the real pose. Obviously, the consideration of occlusion and clutter makes this problem complicated. Yet, certain configurations of the data or situations with large amounts of occlusion and clutter still cause the algorithm to fail.

In this paper, we formalize the simultaneous pose and correspondence problem comprehensively and present a novel initialization method based on genetic algorithm. We note that the relationships between each randomly initial guess are neglected by the traditional random-start initialization method. In this case, an initial guess may still be used to minimize the global objective function, even when a very similar guess has also been attempt. What is worse, an appropriate initial guess may not be selected in a long time if there are many local optima. Actually, each initial guess has different influence to the global optimum and a valid method should consider the different results of previous attempts during searching for the global optimum. Our initialization method is derived based on the analysis. Simulation results and experiments on real images suggest that our method has faster convergence speed and higher convergence rate than the traditional random-start method, which can be used to solve the simultaneous pose and correspondence problem effectively and robustly.

The rest of this paper is organized as follows: Section 2 carries out a study of simultaneous pose and correspondence method and some relevant proposals using the genetic algorithm. Section 3 provides a thorough analysis of the simultaneous pose and correspondence estimation problem and genetic algorithm. We describe our initialization method based on genetic algorithm in Section 4. Section 5 presents experimental results using both synthetic data and real images and compares our algorithm with the state-of-the-art pose estimation algorithms. Conclusions are drawn in Section 6.

2. Related Works

In literature, many algorithms have been developed to solve the simultaneous pose and correspondence problem. Most of them determine correspondences between 3D targets and 2D image features explicitly or implicitly before computing pose [9–11]. The hypothesize-and-test algorithm, for example, RANSAC, is frequently used. In this approach, a small set of 2D image features and 3D object features are selected first and the correspondences between them are then hypothesized. Based on the hypothesized correspondences, the pose can be easily computed. All the 3D object features are reprojected into the image plane with the estimated pose. If the original and reprojected images features are sufficiently similar, the pose is accepted; otherwise, a new hypothesis is formed and the process is repeated.

In the above algorithms, the correspondence problem and the pose estimation problem are solved separately, but the relationships between pose and correspondence are neglected. Time complexity may become great, especially when the number of 3D object points or 2D image points is high.

Skrypnyk and Lowe [12] presented another similar algorithm which uses view-variant 2D image features to index 3D models. In his approach, a process named off-line training is executed to learn 2D feature groupings associated with large number of different views of the 3D models. Then, the on-line recognition stage is performed to index 3D object models in a database of learned object-to-image correspondence hypotheses. Correspondences could be determined based on the object recognition results, which are used for pose estimation and final verification.

In [13], Wunsch and Hirzinger formalized the problem as the optimization of an objective function combining all the correspondence and pose constraints. Combined with a hybrid pose estimation algorithm, a random-start local search method is performed in the object-to-image space. However, in their algorithm, the correspondence constrains are not represented analytically. Instead, each object feature is explicitly matched to the closest sight line of the image features.

David et al. derived a very similar algorithm named SoftPOSIT in [14] by defining a global objective function, which combined an iterative correspondence assignment technique called Softassign [11] and an iterative pose estimation technique called POSIT [15] into a single iteration loop. SoftPOSIT stands out in the simultaneously estimating pose and correspondence approaches because of its accuracy and speed. The drawback is that it tries different initial poses and succeeds when the initial pose is close to the real pose.

SoftSI algorithm [16] proposed by Zhou et al. is also based on the global objective function, which can simultaneously obtain pose and correspondences. The SoftSI algorithm combines the SI algorithm and two singular value decomposition- (SVD-) based shape description theorems. By analyzing the calculation process of SI algorithm, the method can avoid pose ambiguity and quickly eliminate bad initial values.

This objective function can be also solved by different optimization algorithms [17–21] and can contains line features [22], cornerless images [23], and nonrigid shape [24]. Unfortunately, there is no guarantee of finding the global optimum given a single random-start initial guess in the algorithm, especially when the number of points is not equal because of occlusion, clutter, and image noise.

Genetic algorithm (GA), which simulated the natural evolution process, is a useful tool for many optimization problem. Recently, GA and its variations have been used in many fields, such as computer vision, artificial intelligence, pattern recognition, telecommunication, smart sensing, and mobile computing. In telecommunications, the quality of service (QoS) parameters may conflict with each other, so this problem is actually a multiobjective optimization [25]. A genetic algorithm can be used to solve complicated global optimization problems. In quantum computing, genetic algorithm can be used in stimulated annealing for combinatorial optimization so as to avoid premature convergence [26].

3. Problem Formulation

In this section, we first give a description of camera model and the formulation of pose estimation algorithm with known correspondences, using the closed-form global optimal function. Then the function is modified to characterize the global pose-correspondence problem without known correspondences. The genetic algorithm is introduced at last.

3.1. Camera Model

Generally, a camera can be seen as a pinhole model, which describes the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane, where the camera aperture is described as a point. Some of the effects that the pinhole camera model does not take into account can be compensated, for example, by applying suitable coordinate transformations on the image coordinates. This means that the pinhole camera model often can be used as a reasonable description of how a camera depicts a 3D scene, for example, in computer vision and computer graphics.

The geometry related to the mapping of a pinhole camera is illustrated in Figure 1.

Figure 1

The pinhole model of a camera.

In the model, a 3D orthogonal coordinate system is abstracted with its origin at $O_{c}$ , which is also where the camera aperture is located. The three axes of the coordinate system are referred to as $X_{c}$ , $Y_{c}$ , and $Z_{c}$ . Axis $Z_{c}$ is pointing in the viewing direction of the camera and is referred to as the optical axis, principal axis, or principal ray. The image plane is parallel to axes $X_{c}$ and $Y_{c}$ and is located at a distance f from the origin $O_{c}$ in the negative direction of the $Z_{c}$ axis. A practical implementation of a pinhole camera implies that the image plane is located such that it intersects $Z_{c}$ axis at coordinate $- f$ where $f > 0$ . f is also referred to as the focal length of the pinhole camera.

A point $C_{p}$ at the intersection of the optical axis and the image plane is referred as the principal point or image center.

A point P somewhere in the world has the coordinates $(X, Y, Z)$ relative to the axes $X_{c}$ , $Y_{c}$ , and $Z_{c}$ . The projection line of point P into the camera (the green line in Figure 1) passes through point P and the point $O_{c}$ . The projection of point P onto the image plane is denoted as p. This point is given by the intersection of the projection line (green line) and the image plane. In any practical situation we can assume that $Z > 0$ , which means that the intersection point is well defined.

There is also a 2D coordinate system in the image plane, with origin at $C_{p}$ and with axes X and Y which are parallel to $X_{c}$ and $Y_{c}$ , respectively. The coordinates of point p relative to this coordinate system is $(x, y)$ .

As depicted in Figure 1, the projection function can be formulated as

\begin{matrix} x = f \frac{X}{Z}, \\ y = f \frac{Y}{Z} \end{matrix}

(1)

which can be rewritten in homogenous coordinate system:

\begin{matrix} Z [\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}] = [\begin{bmatrix} f & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}] [\begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}] . \end{matrix}

(2)

In image coordinate system, there is an equation as follows:

\begin{matrix} [\begin{bmatrix} u \\ v \\ 1 \end{bmatrix}] = [\begin{bmatrix} s_{x} & 0 & u_{0} \\ 0 & s_{y} & v_{0} \\ 0 & 0 & 1 \end{bmatrix}] [\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}], \end{matrix}

(3)

where

u_{0}

and

v_{0}

are the image center in image coordinate system and

s_{x}

and

s_{y}

are scale factors.

According to (2) and (3), the pinhole model of a camera can be described as

\begin{matrix} u - u_{0} = f s_{x} \frac{X}{Z} = f_{x} \frac{X}{Z}, \\ v - v_{0} = f s_{y} \frac{Y}{Z} = f_{y} \frac{Y}{Z}, \end{matrix}

(4)

where

f_{x} = f s_{x}

and

f_{y} = f s_{y}

are defined as equivalent focal length of axes

X_{c}

and

Y_{c}

, respectively.

f_{x}

f_{y}

u_{0}

, and

v_{0}

are called the camera intrinsic parameters.

3.2. Pose Estimation with Known Correspondences

The mapping from 3D to 2D coordinates described by a pinhole camera is a perspective projection followed by a 180° rotation in the image plane, which corresponds to how a real pinhole camera operates. The resulting image is rotated 180° and the relative size of projected objects depends on their distance to the focal point and the overall size of the image depends on the distance f between the image plane and the focal point. In order to produce an unrotated image, which is what we expect from a camera, we can place the image plane so that it intersects the $Z_{c}$ axis at f instead of at $- f$ and rework the previous calculations. This would generate a virtual (or front) image plane which cannot be implemented in practice but provides a theoretical camera which may be simpler to analyze than the real one. This relationship is illuminated in Figure 2.

Figure 2

The perspective projection of a pinhole model.

Assuming that the coordinates of corresponding 3D object points and normalized 2D image points are $P_{i} = {[X_{i}, Y_{i}, Z_{i}]}^{T}$ and $v_{i} = {[x_{i}, y_{i}]}^{T}$ , the perspective projection camera model can be described as

\begin{matrix} l_{i} [\begin{bmatrix} v_{i} \\ 1 \end{bmatrix}] = R P_{i} + T, \end{matrix}

(5)

where

l_{i}

is a scale factor. R and T are the rotation matrix and translation vector, respectively, which describe the relationship between the camera and the 3D target. Pose estimation with known correspondences is then to seek the optimal rotation matrix R and translation vector T that best satisfy the equations in (5). According to the above equation, R is a

3 \times 3

matrix and T is a

3 \times 1

vector, which can described as

\begin{matrix} R = [\begin{bmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{bmatrix}] = [\begin{bmatrix} R_{1}^{T} \\ R_{2}^{T} \\ R_{3}^{T} \end{bmatrix}], \\ T = [\begin{bmatrix} t_{x} \\ t_{y} \\ t_{z} \end{bmatrix}], \end{matrix}

(6)

where

R_{1}^{T}

R_{2}^{T}

, and

R_{3}^{T}

indicate the row vectors of rotation matrix R and each of them is a unit vector.

t_{x}

t_{y}

, and

t_{z}

represent the coordinates of the original point of object coordinate system onto the three axes of the camera.

The perspective projection camera model can then be rewritten as

\begin{matrix} [\begin{bmatrix} l_{i} x_{i} \\ l_{i} y_{i} \\ l_{i} \end{bmatrix}] = [\begin{bmatrix} R_{1}^{T} & t_{x} \\ R_{2}^{T} & t_{y} \\ R_{3}^{T} & t_{z} \end{bmatrix}] [\begin{bmatrix} P_{i} \\ 1 \end{bmatrix}] . \end{matrix}

(7)

If we multiply the same factor $s = 1 / t_{z}$ on all the elements of the right-side perspective projection matrix and the left-side homogeneous image point coordinates, the equality is not affected. Introducing a scaling factor $w_{i} = l_{i} / t_{z}$ , we obtain

\begin{matrix} [\begin{bmatrix} w_{i} x_{i} \\ w_{i} y_{i} \end{bmatrix}] = [\begin{bmatrix} s R_{1}^{T} & s t_{x} \\ s R_{2}^{T} & s t_{y} \end{bmatrix}] [\begin{bmatrix} P_{i} \\ 1 \end{bmatrix}] \end{matrix}

(8)

with

\begin{matrix} w_{i} = R_{3} \cdot \frac{P_{i}}{t_{z}} + 1 . \end{matrix}

(9)

In photogrammetry, the pose estimation problem is usually formulated as the problem of optimizing the following objective function:

\begin{matrix} E = \sum_{i = 1}^{N} ({(Q_{x} \cdot P_{i} - w_{i} x_{i})}^{2} + {(Q_{y} \cdot P_{i} - w_{i} y_{i})}^{2}), \end{matrix}

(10)

where N is the number of 2D image points. To simplify the subsequent notation, we introduce three new vectors

Q_{x}

Q_{y}

, and

{\tilde{P}}_{i}

with four homogeneous coordinates:

\begin{matrix} Q_{x} = s (R_{1}, t_{x}), \\ Q_{y} = s (R_{2}, t_{y}), \\ {\tilde{P}}_{i} = (P_{i}, 1) . \end{matrix}

(11)

We call $Q_{x}$ and $Q_{y}$ the pose vectors.

In the objective function E, the pose vectors are those for which all the partial derivatives of the objective function with respect to the coordinates of these vectors are zero. $Q_{x}$ and $Q_{y}$ can then be solved using the conditions:

\begin{matrix} Q_{x} = {(\sum_{i = 1}^{N} P_{i} P_{i}^{T})}^{- 1} (\sum_{i = 1}^{N} w_{i} x_{i} P_{i}), \\ Q_{y} = {(\sum_{i = 1}^{N} P_{i} P_{i}^{T})}^{- 1} (\sum_{i = 1}^{N} w_{i} y_{i} P_{i}) . \end{matrix}

(12)

3.3. Pose Estimation with Unknown Correspondences

According to the above analysis with known correspondences, pose vectors can be computed in each iterative loop by minimizing the object function E. When correspondences are unknown, each image point $v_{i}$ can potentially match any of the object points $P_{j}$ , and therefore the scaling factor w should be corrected specific to the coordinates of $P_{j}$ :

\begin{matrix} w_{j} = R_{3} \cdot \frac{P_{j}}{t_{z}} + 1, j = 1,2, \dots, M, \end{matrix}

(13)

where M is the number of 3D object points.

Therefore, the simultaneous pose and correspondence problem can then be formulated as follows:

\begin{array}{l} E \\ = \sum_{i = 1}^{N} \sum_{j = 1}^{M} m_{i j} ({(Q_{x} \cdot P_{i} - w_{i} x_{i})}^{2} + {(Q_{y} \cdot P_{i} - w_{i} y_{i})}^{2}), \end{array}

(14)

where

m_{i j}

(equal to 0 or 1) are correspondence weights.

In next stage, we should find a zero-one assignment matrix $m = \{m_{i j}\}$ , which specifies the matching between a set of N image points and a set of M object points explicitly. The assignment matrix m, which has one row for each of image point $v_{i}$ and one column for each of object point, must satisfy the constraint that each 2D image point matches at most one 3D object point and vice versa. Given that pose vectors have been estimated, the assignment matrix can be calculated by the iterative Softassign technique [11].

Assuming the correspondence variables $m_{i j}$ are known and fixed, the pose vectors $Q_{x}$ and $Q_{y}$ can then be determined through minimizing the objective function in an iteration step. The solutions are

\begin{matrix} Q_{x} = {(\sum_{j = 1}^{M} m_{j}^{'} P_{j} P_{j}^{T})}^{- 1} (\sum_{i = 1}^{N} \sum_{j = 1}^{M} m_{i j} w_{j} x_{i} P_{j}), \\ Q_{y} = {(\sum_{j = 1}^{M} m_{j}^{'} P_{j} P_{j}^{T})}^{- 1} (\sum_{i = 1}^{N} \sum_{j = 1}^{M} m_{i j} w_{j} y_{i} P_{j}), \end{matrix}

(15)

with

m_{k}^{'} = \sum_{i = 1}^{N} m_{i j}

The objective function is minimized iteratively, with the following three operations at each iteration step: (1)

Given the initial guess of pose vectors $Q_{x}$ and $Q_{y}$ , compute the assignment matrix m.

(2)

Assuming the scaling factor $w_{j}$ is known and fixed, calculate the pose vectors using the assignment matrix m.

(3)

Using the estimated pose vectors $Q_{x}$ and $Q_{y}$ , update the scaling factor $w_{j}$ .

The above iterative approach can be summarized as follows. First, given an appropriate initial guess for the pose vectors according to the set of 2D image points and 3D object points, then the assignment matrix which represents correspondence between the image points and object points is estimated. Finally, the pose vectors are updated, using the results of the correspondences between 2D image points and 3D object points. This process is repeated until these estimations converge. In this case, through the final pose vectors

\begin{matrix} Q_{x} = {(q_{x 1}, q_{x 2}, q_{x 3}, q_{x 4})}^{T}, \\ Q_{y} = {(q_{y 1}, q_{y 2}, q_{y 3}, q_{y 4})}^{T}, \end{matrix}

(16)

pose variables can be calculated as follows:

\begin{matrix} s = \sqrt{‖(q_{x 1}, q_{x 2}, q_{x 3})‖ \cdot ‖(q_{y 1}, q_{y 2}, q_{y 3})‖}, \\ R_{1} = \frac{{(q_{x 1}, q_{x 2}, q_{x 3})}^{T}}{s}, \\ R_{2} = \frac{{(q_{y 1}, q_{y 2}, q_{y 3})}^{T}}{s}, \\ R_{3} = R_{1} \times R_{2}, \\ t_{x} = \frac{q_{x 4}}{s}, \\ t_{y} = \frac{q_{y 4}}{s}, \\ t_{z} = \frac{1}{s} . \end{matrix}

(17)

Then the final rotation matrix R and translation vector T are

\begin{matrix} R = {[\begin{bmatrix} R_{1} & R_{2} & R_{3} \end{bmatrix}]}^{T}, \\ T = {(t_{x}, t_{y}, t_{z})}^{T} . \end{matrix}

(18)

3.4. Genetic Algorithm

Genetic algorithm (GA) was an optimal model which simulated the natural evolution process [25–27]. The algorithm was first presented by John Holland in 1975.

In a genetic algorithm, a population of candidate solutions, called individuals, to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties, its chromosomes, which can be mutated and altered. Traditionally, candidate solutions are represented in binary as strings of 0 and 1, but other encodings are also possible.

The evolution usually starts from a population of randomly generated individuals and is an iterative process, and the population in each iteration is called a generation. In each generation, the fitness of every individual in the population is evaluated, which is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified to form a new generation. The new generation of candidate solutions is then used in the next iteration of the genetic algorithm. Commonly, the genetic algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population.

A typical genetic algorithm requires two aspects. The first one is a genetic representation of the solution domain, and the second one is a fitness function to evaluate the solution domain [25–27].

The flow chart of genetic algorithm is depicted in Figure 3, where $p_{c}$ is the probability of crossover and the $p_{m}$ is the probability of mutation.

Figure 3

The flow chart of genetic algorithm.

Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion, and selection operators.

4. Our Initialization Method

To solve the simultaneous pose and correspondence problem, an iterative process starting from an initial guess is necessary [14]. Usually, the minimized objective function has many local optimal because of occlusion, clutter, and image noise. If the initial guess is not appropriate, the correct global optimum will not be reached to or the objective function will converge to its other local optimal. Therefore we present an effective strategy based on genetic algorithm to solve the simultaneous pose and correspondence problem.

4.1. Initialization

The random-start method is a common initialization method, which starts from a random initial guess. There are 12 different variables, including a $3 \times 3$ rotation matrix R and a $3 \times 1$ translation vector T. All the variables can be generated through six degrees of freedom and include three Euler angles and the translation parameters. Therefore the traditional random-start initialization method generates a random 6D vectors $p_{0} = (α, β, γ, x, y, z)$ in a 6D hypercube. Then the rotation matrix R and translation vector T are defined as

\begin{matrix} R = [\begin{bmatrix} \cos γ \cos β & - \sin γ \cos α + \cos γ \sin β \sin α & \sin γ \sin α + \cos γ \sin β \cos α \\ \sin γ \cos β & \cos γ \cos α + \sin γ \sin β \sin α & - \cos γ \sin α + \sin γ \sin β \cos α \\ - \sin β & \cos β \sin α & \cos β \cos α \end{bmatrix}], \\ T = {(x, y, z)}^{T} . \end{matrix}

(19)

4.2. Algorithm Framework

In the traditional random-start initialization method, beginning from a random initial guess, pose and correspondence are iteratively estimated until meeting a specified termination criterion. If the calculated pose and correspondence are not correct, all the steps are processed again with another new random initial guess. However, the random-start method does not consider the different influence of each random initial guess because they are independent with each other in this method. Even worse, the initial guess $p_{0}$ may repeat in some range, and the count of initial guess will increase rapidly and the convergence rate will decrease sharply.

To get an appropriate initial guess, we present a new initialization method based on genetic algorithm [21]. First, instead of generating an initial guess one by one, a set of K random initial guesses $\{p_{0}^{(k)}\}$ , $k = 1,2, \dots, K$ , is generated as candidate solutions called a population. Second, the assignment matrix m is computed for each candidate solution in the population and then the perspective projection error E is evaluated as the fitness of every individual in the population. Finally, the more fit individuals are stochastically selected from the current population, and each individual is modified (selection, crossover, and mutation) to form a new population. The new population of candidate solutions is then used in the next iteration of the algorithm.

To illustrate the three modified operations, we use two candidate solutions in an iterative step as example:

\begin{matrix} p_{0}^{(i)} = (α^{(i)}, β^{(i)}, γ^{(i)}, x^{(i)}, y^{(i)}, z^{(i)}), \\ p_{0}^{(j)} = (α^{(j)}, β^{(j)}, γ^{(j)}, x^{(j)}, y^{(j)}, z^{(j)}) . \end{matrix}

(20)

We define three genetic algorithm operations.

Selection. Consider

\begin{matrix} p_{0}^{(i + 1)} = (α^{(i)}, β^{(i)}, γ^{(i)}, x^{(i)}, y^{(i)}, z^{(i)}), \\ p_{0}^{(j + 1)} = (α^{(j)}, β^{(j)}, γ^{(j)}, x^{(j)}, y^{(j)}, z^{(j)}) . \end{matrix}

(21)

Crossover. Consider

\begin{matrix} p_{0}^{(i + 1)} = (α^{(i)}, β^{(i)}, γ^{(i)}, x^{(j)}, y^{(j)}, z^{(j)}), \\ p_{0}^{(j + 1)} = (α^{(j)}, β^{(j)}, γ^{(j)}, x^{(i)}, y^{(i)}, z^{(i)}) . \end{matrix}

(22)

Mutation. Consider

\begin{matrix} p_{0}^{(i + 1)} = (α^{'}, β^{'}, γ^{'}, x^{(i)}, y^{(i)}, z^{(i)}), \\ p_{0}^{(j + 1)} = (α^{''}, β^{''}, γ^{''}, x^{(j)}, y^{(j)}, z^{(j)}) . \end{matrix}

(23)

Commonly, the iterative process will terminate when either a maximum number of iterations has been produced or a satisfactory fitness level has been reached for the population.

In practice, the search for better solution should be terminated when the current solution is such that the number of matching points is smaller than the threshold $t_{m}$ , which is defined as

\begin{matrix} t_{m} = ρ N, 0 < ρ \leq 1, \end{matrix}

(24)

where N is the total number of the image points and ρ is a scale factor of matching rate, which determines what percent of the detected object points must be matched.

In this case, we can deduce that the current initial guess is not suitable for the simultaneous pose and correspondence problem and a new initial guess should be given. Due to image noise, occlusion, and measurement error, ρ will not reach to 1 even when a good initial guess is found. Therefore, in the experiments discussed below, we take $ρ \equiv 0.75$ .

This test is not perfect, as it is possible for an initial guess to be very accurate even when the number of matched points is less than this threshold, which occurs mainly in cases of high noise. Conversely, a wrong initial guess may be accepted when the ratio of clutter features to detected object points is high. However, these situations are relatively uncommon both in simulation case and in practice.

5. Experiments

We first show the simulation results to confirm the effectiveness of our initialization method, compared with traditional random-start initialization method. Then experiments with real images intuitively present the computed pose and correspondences.

5.1. Synthetic Experiments

We generate synthetic data as follows: 3D object has eight points, distributing in the vertexes of a cube; there are seven 2D image points with unknown correspondence to 3D object points, where one vertex of the cube is occluded, as shown in Figure 4. All the coordinates of image points have been normalized for the sake of convenience. Connecting edges are used for understanding, instead of computing.

Figure 4

3D object points and 2D image points with one occluded point.

In synthetic experiments, the convergence rate and convergence speed of global object function are evaluated, respectively, initialized with the traditional random-start method and our novel initial method.

An initial guess is convergent only when the global objective function E in (14) reaches to a stable minimum with finite iterations. It means that, with the initial guess, pose and correspondence can be calculated simultaneously in a few of iterations.

We first compute the convergence rate (CR) at different noise levels, which is defined as

\begin{matrix} CR = \frac{number of convergent initial guesses}{total number of initial guesses}, \end{matrix}

(25)

where total number of initial guesses is set to 1000 in each different noise level.

Figure 5 shows the convergence rates of our initial method and the traditional random start method as a function of noise level. In this figure, each point depicts results averaged over 1000 random initial guesses. It can be seen that our initial method has a higher convergence rate and decreases more slowly than the traditional random-start method. This is due to the fact that our initial method using genetic algorithm considers different influences between each initial guess and has a higher probability of approaching the minimum of the global objective function than the traditional random start method.

Figure 5

Convergence rate at different noise level. The results initialized with random start method are plotted as circles (○) and the results using our initialized method are plotted as squares (□).

In the synthetic experiment, a set of random initial guesses is first generated as candidate solutions in our method based on genetic algorithm. Then certain processes are done with all the initial guesses until that the proper pose is determined. Similarly, the traditional random-start method also generates a list of globally aware guesses and tests against these random guesses as a whole. At this time, if the list of initial guesses contains the proper pose, the selection process terminates. However, the proper pose may be not in the list, and in this case, another list of initial guesses is needed to randomly generate. The whole procedure does not consider the former initial guess that has already been computed. On the opposite, our method based on genetic algorithm takes all the initial guesses as a whole population and new initial guess is generated according the results of former population.

We then compare the convergence speed when an initial guess computed by our method or random start method is convergent. The results are depicted in Figures 6 and 7.

Figure 6

The value of object function E at each iteration without noise. The results initialized with random start method are plotted as circles (○) and the results using our initialized method are plotted as squares (□).

Figure 7

Number of iterations at different noise level.

From Figures 6 and 7, we can see that the value of objective function E decreases quickly with our initial guess and the number of iterations is always low. However, initialized with random start, the value of object function E decreases quickly only at the first several iterations and becomes very slowly at the subsequent iterations, which leads to a very large number of iterations. Meanwhile, the random start method does not consider the different influence of each random initial guess because every initial guess is independent with each other. Even worse, the initial guess may repeat in some range, and the count of iterations will increase rapidly and the convergence speed will decrease greatly.

When in presence of large amounts of clutter, occlusions, or image noise, the random-start method searches for the proper initial pose in the pose space by using a RANSAC-style approach. It should consider all possible correspondences between 2D image points and 3D object points. What is worse, the initial guesses are represented in a 6D pose space ( $p_{0} = (α, β, γ, x, y, z)$ ). As a consequence, it would lead to computational explosion and quickly become computationally intractable with realistic numbers of features.

Figure 8 depicts the average time consumption until the initial guess is convergence.

Figure 8

Average time consumption at different noise level.

When the noise level is low, the initial guess in the 6D pose space has high probability to converge. Only a few of times is needed to determine the proper guess with the randomly selected initial pose. However, when the noise level becomes high, the time consumption increases obviously because the convergence rate fast decreases and more times are needed to try to find a proper initial value. On the opposite, the average time consumption increases slowly with the different noise level.

Figure 9 shows an example computation of simultaneous pose and correspondence algorithm initialized with our method, using the cube model depicted in Figure 4.

Figure 9

The iteration step of perspective projection with an initial value calculated by our method.

In Figure 9, we can see that the perspective projection of 3D object points overlaps with original 2D image points gradually after few iterations, giving the appreciate initial guess calculated by our method, which means that pose between the coordinate system of 3D object and the coordinate system of camera can be estimated with unknown correspondences and occlusion.

5.2. Real Images

We test our initialization method for simultaneous pose and correspondence problem on real images.

In the test, there are 11 or 10 image points in 2D image and 28 object points in the 3D model. An initial pose is first given by random start method (the results are showed in Figure 10) and then by our initialization method (the results are showed in Figure 11). Figure 12 is another experiment results initialized using our method.

Figure 10

The results of real image initialized by random start method (11 image points): (a) 2D image points s (point number 9); (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

Figure 11

The results of real image initialized by our method (11 image points): (a) 2D image points; (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

Figure 12

The results of real image initialized by our method (10 image points): (a) 2D image points; (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

3D object model with 28 points is known and 2D image points are recognized by our implementation of Harris algorithm [12].

As depicted in Figure 10, with the unsuitable initial pose given by random start method, the final pose is not correct and the simultaneous pose and correspondence problem cannot be solved very well.

As shown in Figures 11 and 12, the number of image points is less than 3D object points; what is more, the correspondences between 2D image points and 3D object points are unknown due to occlusion, clutter, and image noise. Using our initialization method based on genetic algorithm, an appropriate initial guess is first calculated and then pose and correspondences are estimated with the initial value. Finally, 3D object model is accurately reprojected onto image plane using the rotation matrix and translation vector contained in pose. The effectiveness of the initial guess can be verified by observing the overlapping degree between the perspective projection and the original image.

6. Conclusions and Future Work

In this paper, we present an initialization method for determining the pose of 3D objects from images to solve the simultaneous pose and correspondence problem. Pose and correspondences can be computed simultaneously by minimizing a global objective function initialized using our initial pose, which is calculated based on genetic algorithm considering the influences of different initial values comprehensively. Compared with the traditional random-start initialization method, our proposed method has higher convergence rate and lower number of iterations, which has been verified through experiments.

Future work will involve initializing the simultaneous pose and correspondence algorithm automatically using special features extracted in 3D object and 2D image plane. The effectiveness of other optimization algorithms would be analyzed. We are also interested in implementing a more thorough formalism to include initialization, pose estimation, and correspondence determination.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Fund of National Natural Science Foundation of China (61231018) and National High Technology Research and Development Program of China (2013AA014601).

References

Ansar

Daniilidis

Linear pose estimation from points or lines

IEEE Transactions on Pattern Analysis and Machine Intelligence 2003 25 5 578 589

10.1109/TPAMI.2003.1195992

2-s2.0-0037619235

Hesch

J. A.

Roumeliotis

S. I.

A direct least-squares (dls) solution for PnP

Proceedings of the International Conference on Computer Vision

November 2011

Barcelona, Spain

Gao

X.-S.

Hou

X.-R.

Tang

Cheng

H.-F.

Complete solution classification for the perspective-three-point problem

IEEE Transactions on Pattern Analysis and Machine Intelligence 2003 25 8 930 943

10.1109/tpami.2003.1217599

2-s2.0-0041471327

Lepetit

Moreno-Noguer

Fua

EPnP: an accurate O(n) solution to the PnP problem

International Journal of Computer Vision 2009 81 2 155 166

10.1007/s11263-008-0152-6

2-s2.0-58149481343

Fiore

P. D.

Efficient linear solution of exterior orientation

IEEE Transactions on Pattern Analysis and Machine Intelligence 2001 23 2 140 148

10.1109/34.908965

2-s2.0-0035248928

Schweighofer

Pinz

Globally optimal O(n) solution to the PnP problem for general camera models

Proceedings of the 19th British Machine Vision Conference (BMVC' 08)

September 2008

Leeds, UK

1 10

10.5244/c.22.55

2-s2.0-84898459574

Malik

Roth

McDonald

Robust 2D tracking for real-time augmented reality

Proceedings of the Conference on Vision Interface

2002

Fischler

M. A.

Bolles

R. C.

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Communications of the Association for Computing Machinery 1981 24 6 381 395

10.1145/358669.358692

MR618158

2-s2.0-0019574599

Gold

Rangarajan

C.-P.

Pappu

Mjolsness

New algorithms for 2D and 3D point matching: pose estimation and correspondence

Pattern Recognition 1998 31 8 1019 1031

10.1016/s0031-3203(98)80010-1

2-s2.0-0032141691

10.

DeMenthon

David

Samet

SoftPOSIT: an algorithm for registration of 3D models to noisy perspective images combining softassign and POSIT

2001 CS-TR-969, CS-TR 4257

College Park, Md, USA

University of Maryland

11.

Gold

Rangarajan

A graduated assignment algorithm for graph matching

IEEE Transactions on Pattern Analysis and Machine Intelligence 1996 18 4 377 388

10.1109/34.491619

2-s2.0-0030125842

12.

Skrypnyk

Lowe

D. G.

Scene modelling, recognition and tracking with invariant image features

Proceedings of the 3rd IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR '04)

November 2004

Arlington, Va, USA

IEEE

110 119

10.1109/ismar.2004.53

2-s2.0-18844419289

13.

Wunsch

Hirzinger

Registration of CAD-models to images by iterative inverse perspective matching

Proceedings of the 13th International Conference on Pattern Recognition (ICPR '96)

August 1996

78 83

10.1109/icpr.1996.545995

2-s2.0-84898820789

14.

David

Dementhon

Duraiswami

Samet

SoftPOSIT: simultaneous pose and correspondence determination

International Journal of Computer Vision 2004 59 3 259 284

10.1023/b:visi.0000025800.10423.1f

2-s2.0-9644281592

15.

Oberkampf

DeMenthon

D. F.

Davis

L. S.

Iterative pose estimation using Coplanar feature points

Computer Vision and Image Understanding 1996 63 3 495 511

10.1006/cviu.1996.0037

2-s2.0-0030142835

16.

Zhou

Zhang

Vision-based pose estimation from points with unknown correspondences

IEEE Transactions on Image Processing 2014 23 8 3468 3477

10.1109/TIP.2014.2329765

MR3234197

17.

Lepetit

Fua

Keypoint recognition using randomized trees

IEEE Transactions on Pattern Analysis and Machine Intelligence 2006 28 9 1465 1479

10.1109/TPAMI.2006.188

2-s2.0-33745821131

18.

Lin

W.-Y.

Cheong

L.-F.

Tan

Dong

Liu

Simultaneous camera pose and correspondence estimation with motion coherence

International Journal of Computer Vision 2012 96 2 145 161

10.1007/s11263-011-0456-9

MR2874564

2-s2.0-84856412680

19.

Raytchev

Kimura

Real-time 3D pose and correspondence from stereo image sequences by combinatorial optimization

Proceedings of the International Joint Conference on Neural Networks (IJCNN 2012)

June 2012

Brisbane, Australia

IEEE

1 8

10.1109/ijcnn.2012.6252596

2-s2.0-84865073743

20.

Xia

Xiong

Simultaneous pose and correspondence determination using Differential Evolution

Proceedings of the 8th International Conference on Natural Computation (ICNC '12)

May 2012

703 707

10.1109/icnc.2012.6234643

2-s2.0-84866181995

21.

Yang

Wang

Song

Chen

A novel initialization method based on genetic algorithm for simultaneous pose and correspondence estimation

Proceedings of the 4th International Conference on the Innovative Computing Technology

2014

22.

David

DeMenthon

Duraiswami

Samet

Simultaneous pose and correspondence determination using line features

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

June 2003

424 431

2-s2.0-0042441063

23.

Lin

W.-Y.

Dong

Tan

Cheong

L.-F.

Yan

C.-H.

Simultaneous camera pose and correspondence estimation in cornerless images

Proceedings of the IEEE 12th International Conference on Computer Vision

2009

1179 1186

24.

Śanchez-Riera

Östlund

Fua

Moreno-Noguer

Simultaneous pose, correspondence and non-rigid shape

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10)

June 2010

1189 1196

10.1109/cvpr.2010.5539831

2-s2.0-77956008444

25.

Dong

A QoS driven web service composition method based on ESGA (Elitist Selection Genetic Algorithm) with an improved initial population selection strategy

International Journal of Distributed Sensor Networks 2009 5 1 54

10.1080/15501320802540900

26.

Shu

Quantum-inspired genetic algorithm based on simulated annealing for combinatorial optimization problem

International Journal of Distributed Sensor Networks 2009 5 1 64 65

10.1080/15501320802554992

2-s2.0-60749103191

27.

Akbari

Ziarati

A multilevel evolutionary algorithm for optimizing numerical functions

International Journal of Industrial Engineering Computations 2011 2 2 419 430

10.5267/j.ijiec.2010.03.002

2-s2.0-84859147967