Abstract
Although several algorithms have been presented to solve the simultaneous pose and correspondence estimation problem, the correct solution may not be reached to with the traditional random-start initialization method. In this paper, we derive a novel method which estimates the initial value based on genetic algorithm, considering the influences of different initial guesses comprehensively. First, a set of random initial guesses is generated as candidate solutions. Second, the assignment matrix and the perspective projection error are computed for each candidate solution. And then each individual is modified (selection, crossover, and mutation) in current iterative process. Finally, the fittest individual is stochastically selected from the final population. With the presented initialization method, the proper initial guess could be first calculated and then the simultaneous pose and correspondence estimation problem could be solved easily. Simulation results with synthetic data and experiments on real images prove the effectiveness and robustness of our proposed method.
1. Introduction
Estimating the position and orientation using 2D images and known 3D targets, which is usually called pose estimation, is a basic problem in computer vision, including hand-eye coordination system, object tracking, augmented reality, and autonomous navigation [1–8]. However, when the correspondences between object features and image features are unknown, for example, because the scene contains repetitive patterns or because the 3D points are simply salient features on a geometric model without associated texture or because occlusion and clutter should be taken into account, it becomes difficult because no additional information is available.
In this case, pose and correspondence should be determined simultaneously, which is simultaneous pose and correspondence estimation problem. So far, designing a proper method that can be generally applied to estimate position and orientation when the correspondences are unknown is still a challenging problem. Particularly, this problem also takes occlusion, clutter, image noise, and model deformation into account. The problem can be iteratively solved by optimizing a global cost function. One limitation of such algorithm is that the global minimum cannot be guaranteed. This is alleviated by randomly initializing it at different initial guesses. Meanwhile, the algorithm is quite slow because hundreds of different initial poses are needed to try randomly and it may succeed only when the initial pose is close to the real pose. Obviously, the consideration of occlusion and clutter makes this problem complicated. Yet, certain configurations of the data or situations with large amounts of occlusion and clutter still cause the algorithm to fail.
In this paper, we formalize the simultaneous pose and correspondence problem comprehensively and present a novel initialization method based on genetic algorithm. We note that the relationships between each randomly initial guess are neglected by the traditional random-start initialization method. In this case, an initial guess may still be used to minimize the global objective function, even when a very similar guess has also been attempt. What is worse, an appropriate initial guess may not be selected in a long time if there are many local optima. Actually, each initial guess has different influence to the global optimum and a valid method should consider the different results of previous attempts during searching for the global optimum. Our initialization method is derived based on the analysis. Simulation results and experiments on real images suggest that our method has faster convergence speed and higher convergence rate than the traditional random-start method, which can be used to solve the simultaneous pose and correspondence problem effectively and robustly.
The rest of this paper is organized as follows: Section 2 carries out a study of simultaneous pose and correspondence method and some relevant proposals using the genetic algorithm. Section 3 provides a thorough analysis of the simultaneous pose and correspondence estimation problem and genetic algorithm. We describe our initialization method based on genetic algorithm in Section 4. Section 5 presents experimental results using both synthetic data and real images and compares our algorithm with the state-of-the-art pose estimation algorithms. Conclusions are drawn in Section 6.
2. Related Works
In literature, many algorithms have been developed to solve the simultaneous pose and correspondence problem. Most of them determine correspondences between 3D targets and 2D image features explicitly or implicitly before computing pose [9–11]. The hypothesize-and-test algorithm, for example, RANSAC, is frequently used. In this approach, a small set of 2D image features and 3D object features are selected first and the correspondences between them are then hypothesized. Based on the hypothesized correspondences, the pose can be easily computed. All the 3D object features are reprojected into the image plane with the estimated pose. If the original and reprojected images features are sufficiently similar, the pose is accepted; otherwise, a new hypothesis is formed and the process is repeated.
In the above algorithms, the correspondence problem and the pose estimation problem are solved separately, but the relationships between pose and correspondence are neglected. Time complexity may become great, especially when the number of 3D object points or 2D image points is high.
Skrypnyk and Lowe [12] presented another similar algorithm which uses view-variant 2D image features to index 3D models. In his approach, a process named off-line training is executed to learn 2D feature groupings associated with large number of different views of the 3D models. Then, the on-line recognition stage is performed to index 3D object models in a database of learned object-to-image correspondence hypotheses. Correspondences could be determined based on the object recognition results, which are used for pose estimation and final verification.
In [13], Wunsch and Hirzinger formalized the problem as the optimization of an objective function combining all the correspondence and pose constraints. Combined with a hybrid pose estimation algorithm, a random-start local search method is performed in the object-to-image space. However, in their algorithm, the correspondence constrains are not represented analytically. Instead, each object feature is explicitly matched to the closest sight line of the image features.
David et al. derived a very similar algorithm named SoftPOSIT in [14] by defining a global objective function, which combined an iterative correspondence assignment technique called Softassign [11] and an iterative pose estimation technique called POSIT [15] into a single iteration loop. SoftPOSIT stands out in the simultaneously estimating pose and correspondence approaches because of its accuracy and speed. The drawback is that it tries different initial poses and succeeds when the initial pose is close to the real pose.
SoftSI algorithm [16] proposed by Zhou et al. is also based on the global objective function, which can simultaneously obtain pose and correspondences. The SoftSI algorithm combines the SI algorithm and two singular value decomposition- (SVD-) based shape description theorems. By analyzing the calculation process of SI algorithm, the method can avoid pose ambiguity and quickly eliminate bad initial values.
This objective function can be also solved by different optimization algorithms [17–21] and can contains line features [22], cornerless images [23], and nonrigid shape [24]. Unfortunately, there is no guarantee of finding the global optimum given a single random-start initial guess in the algorithm, especially when the number of points is not equal because of occlusion, clutter, and image noise.
Genetic algorithm (GA), which simulated the natural evolution process, is a useful tool for many optimization problem. Recently, GA and its variations have been used in many fields, such as computer vision, artificial intelligence, pattern recognition, telecommunication, smart sensing, and mobile computing. In telecommunications, the quality of service (QoS) parameters may conflict with each other, so this problem is actually a multiobjective optimization [25]. A genetic algorithm can be used to solve complicated global optimization problems. In quantum computing, genetic algorithm can be used in stimulated annealing for combinatorial optimization so as to avoid premature convergence [26].
3. Problem Formulation
In this section, we first give a description of camera model and the formulation of pose estimation algorithm with known correspondences, using the closed-form global optimal function. Then the function is modified to characterize the global pose-correspondence problem without known correspondences. The genetic algorithm is introduced at last.
3.1. Camera Model
Generally, a camera can be seen as a pinhole model, which describes the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane, where the camera aperture is described as a point. Some of the effects that the pinhole camera model does not take into account can be compensated, for example, by applying suitable coordinate transformations on the image coordinates. This means that the pinhole camera model often can be used as a reasonable description of how a camera depicts a 3D scene, for example, in computer vision and computer graphics.
The geometry related to the mapping of a pinhole camera is illustrated in Figure 1.

The pinhole model of a camera.
In the model, a 3D orthogonal coordinate system is abstracted with its origin at
A point
A point P somewhere in the world has the coordinates
There is also a 2D coordinate system in the image plane, with origin at
As depicted in Figure 1, the projection function can be formulated as
In image coordinate system, there is an equation as follows:
According to (2) and (3), the pinhole model of a camera can be described as
3.2. Pose Estimation with Known Correspondences
The mapping from 3D to 2D coordinates described by a pinhole camera is a perspective projection followed by a 180° rotation in the image plane, which corresponds to how a real pinhole camera operates. The resulting image is rotated 180° and the relative size of projected objects depends on their distance to the focal point and the overall size of the image depends on the distance f between the image plane and the focal point. In order to produce an unrotated image, which is what we expect from a camera, we can place the image plane so that it intersects the

The perspective projection of a pinhole model.
Assuming that the coordinates of corresponding 3D object points and normalized 2D image points are
The perspective projection camera model can then be rewritten as
If we multiply the same factor
In photogrammetry, the pose estimation problem is usually formulated as the problem of optimizing the following objective function:
We call
In the objective function E, the pose vectors are those for which all the partial derivatives of the objective function with respect to the coordinates of these vectors are zero.
3.3. Pose Estimation with Unknown Correspondences
According to the above analysis with known correspondences, pose vectors can be computed in each iterative loop by minimizing the object function E. When correspondences are unknown, each image point
Therefore, the simultaneous pose and correspondence problem can then be formulated as follows:
In next stage, we should find a zero-one assignment matrix
Assuming the correspondence variables
The objective function is minimized iteratively, with the following three operations at each iteration step:
Given the initial guess of pose vectors Assuming the scaling factor Using the estimated pose vectors
The above iterative approach can be summarized as follows. First, given an appropriate initial guess for the pose vectors according to the set of 2D image points and 3D object points, then the assignment matrix which represents correspondence between the image points and object points is estimated. Finally, the pose vectors are updated, using the results of the correspondences between 2D image points and 3D object points. This process is repeated until these estimations converge. In this case, through the final pose vectors
Then the final rotation matrix
3.4. Genetic Algorithm
Genetic algorithm (GA) was an optimal model which simulated the natural evolution process [25–27]. The algorithm was first presented by John Holland in 1975.
In a genetic algorithm, a population of candidate solutions, called individuals, to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties, its chromosomes, which can be mutated and altered. Traditionally, candidate solutions are represented in binary as strings of 0 and 1, but other encodings are also possible.
The evolution usually starts from a population of randomly generated individuals and is an iterative process, and the population in each iteration is called a generation. In each generation, the fitness of every individual in the population is evaluated, which is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified to form a new generation. The new generation of candidate solutions is then used in the next iteration of the genetic algorithm. Commonly, the genetic algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires two aspects. The first one is a genetic representation of the solution domain, and the second one is a fitness function to evaluate the solution domain [25–27].
The flow chart of genetic algorithm is depicted in Figure 3, where

The flow chart of genetic algorithm.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion, and selection operators.
4. Our Initialization Method
To solve the simultaneous pose and correspondence problem, an iterative process starting from an initial guess is necessary [14]. Usually, the minimized objective function has many local optimal because of occlusion, clutter, and image noise. If the initial guess is not appropriate, the correct global optimum will not be reached to or the objective function will converge to its other local optimal. Therefore we present an effective strategy based on genetic algorithm to solve the simultaneous pose and correspondence problem.
4.1. Initialization
The random-start method is a common initialization method, which starts from a random initial guess. There are 12 different variables, including a
4.2. Algorithm Framework
In the traditional random-start initialization method, beginning from a random initial guess, pose and correspondence are iteratively estimated until meeting a specified termination criterion. If the calculated pose and correspondence are not correct, all the steps are processed again with another new random initial guess. However, the random-start method does not consider the different influence of each random initial guess because they are independent with each other in this method. Even worse, the initial guess
To get an appropriate initial guess, we present a new initialization method based on genetic algorithm [21]. First, instead of generating an initial guess one by one, a set of K random initial guesses
To illustrate the three modified operations, we use two candidate solutions in an iterative step as example:
We define three genetic algorithm operations.
Selection. Consider
Crossover. Consider
Mutation. Consider
Commonly, the iterative process will terminate when either a maximum number of iterations has been produced or a satisfactory fitness level has been reached for the population.
In practice, the search for better solution should be terminated when the current solution is such that the number of matching points is smaller than the threshold
In this case, we can deduce that the current initial guess is not suitable for the simultaneous pose and correspondence problem and a new initial guess should be given. Due to image noise, occlusion, and measurement error, ρ will not reach to 1 even when a good initial guess is found. Therefore, in the experiments discussed below, we take
This test is not perfect, as it is possible for an initial guess to be very accurate even when the number of matched points is less than this threshold, which occurs mainly in cases of high noise. Conversely, a wrong initial guess may be accepted when the ratio of clutter features to detected object points is high. However, these situations are relatively uncommon both in simulation case and in practice.
5. Experiments
We first show the simulation results to confirm the effectiveness of our initialization method, compared with traditional random-start initialization method. Then experiments with real images intuitively present the computed pose and correspondences.
5.1. Synthetic Experiments
We generate synthetic data as follows: 3D object has eight points, distributing in the vertexes of a cube; there are seven 2D image points with unknown correspondence to 3D object points, where one vertex of the cube is occluded, as shown in Figure 4. All the coordinates of image points have been normalized for the sake of convenience. Connecting edges are used for understanding, instead of computing.

3D object points and 2D image points with one occluded point.
In synthetic experiments, the convergence rate and convergence speed of global object function are evaluated, respectively, initialized with the traditional random-start method and our novel initial method.
An initial guess is convergent only when the global objective function E in (14) reaches to a stable minimum with finite iterations. It means that, with the initial guess, pose and correspondence can be calculated simultaneously in a few of iterations.
We first compute the convergence rate (CR) at different noise levels, which is defined as
Figure 5 shows the convergence rates of our initial method and the traditional random start method as a function of noise level. In this figure, each point depicts results averaged over 1000 random initial guesses. It can be seen that our initial method has a higher convergence rate and decreases more slowly than the traditional random-start method. This is due to the fact that our initial method using genetic algorithm considers different influences between each initial guess and has a higher probability of approaching the minimum of the global objective function than the traditional random start method.

Convergence rate at different noise level. The results initialized with random start method are plotted as circles (○) and the results using our initialized method are plotted as squares (□).
In the synthetic experiment, a set of random initial guesses is first generated as candidate solutions in our method based on genetic algorithm. Then certain processes are done with all the initial guesses until that the proper pose is determined. Similarly, the traditional random-start method also generates a list of globally aware guesses and tests against these random guesses as a whole. At this time, if the list of initial guesses contains the proper pose, the selection process terminates. However, the proper pose may be not in the list, and in this case, another list of initial guesses is needed to randomly generate. The whole procedure does not consider the former initial guess that has already been computed. On the opposite, our method based on genetic algorithm takes all the initial guesses as a whole population and new initial guess is generated according the results of former population.
We then compare the convergence speed when an initial guess computed by our method or random start method is convergent. The results are depicted in Figures 6 and 7.

The value of object function E at each iteration without noise. The results initialized with random start method are plotted as circles (○) and the results using our initialized method are plotted as squares (□).

Number of iterations at different noise level.
From Figures 6 and 7, we can see that the value of objective function E decreases quickly with our initial guess and the number of iterations is always low. However, initialized with random start, the value of object function E decreases quickly only at the first several iterations and becomes very slowly at the subsequent iterations, which leads to a very large number of iterations. Meanwhile, the random start method does not consider the different influence of each random initial guess because every initial guess is independent with each other. Even worse, the initial guess may repeat in some range, and the count of iterations will increase rapidly and the convergence speed will decrease greatly.
When in presence of large amounts of clutter, occlusions, or image noise, the random-start method searches for the proper initial pose in the pose space by using a RANSAC-style approach. It should consider all possible correspondences between 2D image points and 3D object points. What is worse, the initial guesses are represented in a 6D pose space (
Figure 8 depicts the average time consumption until the initial guess is convergence.

Average time consumption at different noise level.
When the noise level is low, the initial guess in the 6D pose space has high probability to converge. Only a few of times is needed to determine the proper guess with the randomly selected initial pose. However, when the noise level becomes high, the time consumption increases obviously because the convergence rate fast decreases and more times are needed to try to find a proper initial value. On the opposite, the average time consumption increases slowly with the different noise level.
Figure 9 shows an example computation of simultaneous pose and correspondence algorithm initialized with our method, using the cube model depicted in Figure 4.

The iteration step of perspective projection with an initial value calculated by our method.
In Figure 9, we can see that the perspective projection of 3D object points overlaps with original 2D image points gradually after few iterations, giving the appreciate initial guess calculated by our method, which means that pose between the coordinate system of 3D object and the coordinate system of camera can be estimated with unknown correspondences and occlusion.
5.2. Real Images
We test our initialization method for simultaneous pose and correspondence problem on real images.
In the test, there are 11 or 10 image points in 2D image and 28 object points in the 3D model. An initial pose is first given by random start method (the results are showed in Figure 10) and then by our initialization method (the results are showed in Figure 11). Figure 12 is another experiment results initialized using our method.

The results of real image initialized by random start method (11 image points): (a) 2D image points s (point number 9); (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

The results of real image initialized by our method (11 image points): (a) 2D image points; (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

The results of real image initialized by our method (10 image points): (a) 2D image points; (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.
3D object model with 28 points is known and 2D image points are recognized by our implementation of Harris algorithm [12].
As depicted in Figure 10, with the unsuitable initial pose given by random start method, the final pose is not correct and the simultaneous pose and correspondence problem cannot be solved very well.
As shown in Figures 11 and 12, the number of image points is less than 3D object points; what is more, the correspondences between 2D image points and 3D object points are unknown due to occlusion, clutter, and image noise. Using our initialization method based on genetic algorithm, an appropriate initial guess is first calculated and then pose and correspondences are estimated with the initial value. Finally, 3D object model is accurately reprojected onto image plane using the rotation matrix and translation vector contained in pose. The effectiveness of the initial guess can be verified by observing the overlapping degree between the perspective projection and the original image.
6. Conclusions and Future Work
In this paper, we present an initialization method for determining the pose of 3D objects from images to solve the simultaneous pose and correspondence problem. Pose and correspondences can be computed simultaneously by minimizing a global objective function initialized using our initial pose, which is calculated based on genetic algorithm considering the influences of different initial values comprehensively. Compared with the traditional random-start initialization method, our proposed method has higher convergence rate and lower number of iterations, which has been verified through experiments.
Future work will involve initializing the simultaneous pose and correspondence algorithm automatically using special features extracted in 3D object and 2D image plane. The effectiveness of other optimization algorithms would be analyzed. We are also interested in implementing a more thorough formalism to include initialization, pose estimation, and correspondence determination.
Footnotes
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the Fund of National Natural Science Foundation of China (61231018) and National High Technology Research and Development Program of China (2013AA014601).
