Sage Journals: Discover world-class research

Abstract

With the expansion of underground infrastructure in the urban areas, positioning in such scenario becomes a crucial problem for the ubiquitous urban positioning applications. The visual odometry algorithm can provide an accurate position reference for ground applications. However, it encounters robust estimation problems in the underground because automatic feature matching from the underground structures is difficult and errors are quite frequent. In this article, we present a novel structure-aware sample consensus algorithm to solve the robust estimation problem in stereo visual odometry. Features from the rigid structure provide a static reference and are more likely to be inliers, based on which we introduce a structure feature-guided sampling procedure instead of the random sampling procedure as used in random sample consensus. With this novel procedure, the structure-aware sample consensus gains more possibility to generate a correct motion model and performs as a robust estimator for the underground visual odometry algorithm. The experiments with both synthetic and real-world data show that structure-aware sample consensus outperforms the random sample consensus and its variants in robustness, while maintaining a lower computational cost. In addition, the structure-aware sample consensus–based visual odometry algorithm maintains the same performance level of robustness and accuracy for both ground and underground scenarios, which makes the algorithm applicable for ubiquitous urban positioning systems.

Keywords

Visual odometry underground positioning robust estimation random sample consensus underground dataset

Introduction

In recent years, vision-based localization techniques have been well developed for ground mobile applications such as autonomous vehicles and mobile mapping systems.^1,2 While modern urban development and expansion have increasingly shifted to underground infrastructures to relieve ground traffic loads, expanding the application of the ground platform to underground is imperative for ubiquitous positioning systems. According to the recent researches, both direct^3,4 and feature-based (indirect) visual odometry (VO)^5,6 provide accurate motion estimation results. Enframing feature matching, motion estimation, and nonlinear refinement, feature-based VO achieves good invariance to the illumination and view point changes, which makes it more suitable for the underground applications. Nonetheless, developing a feature-based VO solution for positioning applications in underground infrastructure remains non-trivial because the visual structures and appearances are notably different from the ground environment. Specifically, the major restrictions introduced by underground infrastructure are the poor illumination conditions, featureless structures, and repetitive structures, which create challenges for feature-matching techniques, for example, scale-invariant feature transform (SIFT),⁷ speeded up robust features (SURF),⁸ and oriented FAST and rotated BRIEF (ORB).⁹ Consequently, the matching results underground are contaminated with a large portion of wrong matches. These wrong matches, which are called outliers, have a severe negative effect on the positioning accuracy, since commonly used optimization techniques, such as least squares regression, can produce arbitrary bad model estimates in the presence of a single outlier.¹⁰ Therefore, the robust estimation is indispensable for the VO algorithm, and this problem has been well studied in recent years; among which the most widely used, particularly in computer vision, is random sample consensus (RANSAC).¹¹

The RANSAC scheme is a remarkably powerful technique for robust estimation. One compelling reason of its widespread adoption is its ability to tolerate a tremendous level of contamination, which provides a reliable parameter estimation even when well over half the data consist of outliers. While robust, the standard RANSAC has its own problem in balancing the robustness and efficiency. Regarding the situation in underground VO application, which suffers a high-ratio outliers, RANSAC can select the inliers but in an exhaustive manner because of its inherent random nature in the sampling process. Recent studies about the computational efficiency and robustness improvement have helped drive forward the state-of-the-art and introduced massive real-world applications.^12–15 Although recent efforts have focused on improving the RANSAC algorithm, relatively less attention has been paid to discover other clues from the environment to improve the performance.

In this article, we exploit the constructional attribute of the rigid underground infrastructure to detect the structure features and guide the sampling process to achieve a robust and efficient motion estimation. As shown in Figure 1, some image features are part of the rigid structure, and their inter-point Euclidean distances in three-dimensional (3D) space are constant. Based on this observation, we first check the pairwise Euclidean rigidity constraint and then formulate a vertex covering problem to select the structure features. These structure features are more likely to be the inlier matches and provide a static real-world coordinate reference for the motion estimation algorithm. After the structure feature election, a structure-aware sample consensus (SASAC) algorithm is proposed to efficiently guide the sampling process to generate a correct motion model. When applying SASAC, the sample procedure begins sampling in the group of structure features with a higher inlier ratio and then moves to the normal image feature group. Thus, the SASAC has a higher possibility to find the correct motion model than the RANSAC. Therefore, within the same number of iterations, the SASAC is more robust than the RANSAC, particularly in cases with a high outlier ratio.

Figure 1.

The structure features come from the static structures and objects. And these structures are rigidly connected to each other, which makes their Euclidean distances constant in 3D space.

To apply SASAC in a VO scheme, the 3D positions of the corresponding features should be recovered before motion estimation. This task is trivial for the observations of stereo cameras, RGB-D cameras and LiDAR which can recover the depth directly. Due to the limited space, we will focus on the stereo-camera case in this article, but our method can be extended to the above applications. The performance of SASAC-based stereo-VO algorithm is evaluated by the underground data collected using our own mobile platform and the ground data from the KITTI dataset.¹⁶ The main contributions of this work are twofold:

To achieve robust and efficient stereo-VO in the underground environment, we propose a SASAC algorithm to estimate the motion in the presence of rigid structures. Although our algorithm emphasizes on solving the underground robust motion estimation problem, it also works with the ground data and is able to remove the outliers from the moving object.

To evaluate the performance of the algorithm, we collected our own dataset with a self-designed unmanned vehicle. The ground truth was obtained by the total station with a measurement accuracy of millimeter. To contribute to the community, we will make this underground dataset with the ground truth available to the public.

In the remainder of this article, related works are discussed in section “Background.” Section “Structure feature definitions and detection” describes the primary structure feature definition and selection scheme. In section “Structure-aware sample consensus,” we describe the SASAC method and prove the robustness and efficiency theoretically. Then, the performance of SASAC and other comparisons are evaluated with both synthetic and real-world dataset in section “Experiment.” Finally, we conclude the work and discuss about the further improvements in section “Conclusion and discussion.”

Background

The RANSAC algorithm¹¹ was originally presented as a robust estimator for model fitting in the presence of outliers. It operates in a hypothesize-and-verify manner by repeatedly sampling the subsets of the input data to hypothesize the model parameters. Each hypothesized model parameter is scored by other data points and the hypothesis that obtains the best score is returned as the solution. To ensure with the confidence $n_{0}$ that at least one outlier-free set of $m$ points is sampled in RANSAC, we must draw at least $K$ samples

K > \frac{\log (1 - n_{0})}{\log (1 - e^{m})}

(1)

where $e$ is the inlier ratio of the input data and $m$ is the sample size. The confidence $n_{0}$ is typically set to 0.95 or 0.99. To ensure that at least one outlier-free sample is selected, the RANSAC scheme must select significantly more samples when the inlier ratio decreases. To provide a better intuition, we assume that the sample size $m$ is 5 and set the $n_{0}$ to 0.95; the minimum number of samples $k$ increases from 37 to 1231 when the inlier ratio $e$ decreases from 0.6 to 0.3. Thus, it becomes the main drawback of applying RANSAC in the low inlier case, which either increases the computational burden or reduces the robustness of the entire algorithm.

In fact, the low inlier case exists in several situations such as wide-baseline matching and VO applications in underground scenarios, and several researches have been proposed to solve this problem in the latest literature. Some researches^17,18 focus on improving the matching performance to provide matches with higher inlier ratio. However, these methods result in a reduced set of matches, which is not practical in our case, since the featureless structure in the underground can barely provide useful features. Other researches apply the branch-and-bound-based methods^19,20 to provide an optimal solution to the robust estimation problem. But it comes with a significantly higher computational cost and is not applicable for VO algorithms. Our work focuses on incorporating the prior information from the environment and guiding the sampling to generate models that are more likely to be correct. This procedure can significantly improve the robustness and efficiency of the RANSAC, particularly for cases with high outlier ratios. This simple procedure has been applied in several ways in the RANSAC literature. A detailed review of some prominent works is shown here.

The progressive sample consensus (PROSAC) algorithm²¹ measures the similarity of putative matches and preferentially generates hypotheses based on matches with high similarity. This algorithm assumes that putative matches with high similarity are more likely to be inliers. This assumption is proven to be valid in their initial work, and the algorithm achieves significant computational savings because good hypotheses are verified early in the sampling process. However, the quality scores are not always helpful: in the scenarios with repetitive artificial structures, the gains obtained with PROSAC are less significant.²²

The group sample census (GroupSAC) algorithm²³ uses an observation that inliers are often more similar to one another. The goal of GroupSAC is to separate data points into a number of similar groups according to some criterion such as image segmentation. Then, it assumes that the largest, more consistent clusters tend to have a higher inlier rate. As in PROSAC, the sampling scheme begins by testing a small subset of the most promising groups and gradually expands this subset to include all points. GroupSAC is shown to improve the sampling efficiency; however, as a part of the algorithm, the grouping stage can be exhaustive.

Both GroupSAC and PROSAC share the same idea of a starting sample in the relatively high inlier ratio group. However, their work is based on the assumptions derived from observation, which have not been proven to be universally applicable.²⁴

In this study, we use the Euclidean rigidity constraint to select out the structure features as the high inlier rate group. Compared with PROSAC and GroupSAC, the Euclidean rigidity constraint is more restricted and practical in the urban scenario with rigid structures, particularly in the underground or indoor environment. In addition, the Euclidean rigidity constraint is motion invariant and can be checked before the motion estimation. Thus, stereo-VO applications can integrate our algorithm without many changes while attaching the ability to work in the indoor or underground situation.

Structure feature definitions and detection

In the urban scenarios, we observe that the man-made structures, for example, underground facilities, buildings, and bridges are rigidly constructed. With this observation, we first recover the 3D position of the structures and then check the inter-point Euclidean distances between pairs of the 3D point. Given two pairs of corresponding 3D points from the rigid structures $(P_{i}, P'_{i})$ and $(P_{j}, P'_{j})$ , the distance between them does not change, that is, $∥ P_{i} - P_{j} ∥ = ∥ P'_{i} - P'_{j} ∥$ . This is known as the Euclidean rigidity constraint, and it implies that if the Euclidean distance between two measured 3D points is not constant, these 3D points can not both belong to the rigid structure. There are two kinds of features that will lead to this situation: the features belong to a non-rigid object; the corresponding features are wrongly matched. These non-rigid features and wrongly matched features are eliminated since they will reduce the RANSAC efficiency and motion estimation accuracy. After the elimination, the remained features all come from the rigid structure and are defined as the structure features. Since the Euclidean rigidity constraint is quite restricted, the inlier ratio of structure feature group is much higher than the inlier ratio in the original data.

Euclidean rigidity constraint for structure feature detection

In the case of stereo cameras, two identical cameras have parallel axes pointing in the same direction at right angles to the stereo base. With respect to an XYZ coordinate system located in the left perspective center, the 3D position of a point $P = (X, Y, Z)$ can be derived as

\begin{matrix} X = \frac{d}{c} \times x' \\ Y = \frac{d}{c} \times y' \\ Z = d = \frac{bc}{x' - x ″} \end{matrix}

(2)

where $(x', y')$ and $(x ″, y ″)$ are the corresponding features in the left and right images, $c$ is the focal length, $b$ is the baseline, and $d$ is the depth. In practice, the measurement of the features and the depths are contaminated by noises. After application of error propagation, the uncertainty of the measured point in X, Y, and Z directions are

\begin{matrix} S_{X, Y} = \frac{d}{c} \times ϵ \\ S_{Z} = \frac{d^{2}}{\sqrt{2} bc} \times ϵ \end{matrix}

(3)

where $ϵ$ is the measurement noise.²⁵ The 3D point uncertainty $D$ can be represented as

D \leq \sqrt{S_{X}^{2} + S_{Y}^{2} + S_{Z}^{2}}

(4)

Considering the 3D point uncertainty $(D_{i}, D_{j})$ for point $(P_{i}, P_{j})$ in Figure 2, the Euclidean rigidity constraint can be derived as

| ∥ (P_{i} - P_{j}) ∥ - ∥ (P'_{i} - P'_{j}) ∥ | \leq D_{i} + D_{j}

(5)

Figure 2.

In the case of stereo camera, the 3D points have measurement uncertainty. (a) 2D illustration of the stereo triangulation model and the measurement uncertainty in XYZ direction. (b) Considering the uncertainty $D_{i}, D_{j}$ of the 3D points, the difference of the inter-point Euclidean distance should be less than a threshold.

Thus, the definition of the structure features are derived as follows:

Definition 1

Two pairs of corresponding features $(P_{i}, P_{i'})$ and $(P_{j}, P_{j'})$ are structure features, if they satisfy the Euclidean rigidity constraint (cf. equation (5)).

With the structure feature definition, we can easily check the pairwise consistency of two features. If two features are inconsistent with the rigidity constraint, at least one of them is not a structure feature. After checking the pairwise consistency of all features, we remove those non-rigid features and wrong-matched features to obtain a group of structure features. This problem can be summarized as follows.

Problem 1

For sets of features with known pairwise consistency, find the largest subset $I$ , where all features satisfy the pairwise Euclidean rigidity constraint.

To solve this problem, a graph is built with all features as vertices and edges that connect the inconsistent ones. Finding the largest subset is equivalent to remove as few vertices as possible while eliminating all the edges. An example is shown in Figure 3: after checking the pairwise Euclidean rigidity constraint, a graph is built on the left. By removing the minimal vertex cover {2,5}, all edges are eliminated, such that {1,3,4,6} are consistent with one another. This is known as the vertex cover problem, which is a nondeterministic polynomial (NP)-complete problem.²⁶ This problem has been extensively discussed in the previous work.^27,28 In this article, we use a combination of factor-2 approximation and branch-and-bound approach²⁹ to solve the vertex cover problem and remove the vertices that are non-rigid features and wrong-matched features. The algorithm can be summarized as follows.

Figure 3.

In the graph, the vertices correspond to the features, if two features cannot satisfy the Euclidean rigidity constraint, they are linked with an edge. To find a group of structure features, all the edges should be removed.

Algorithm 1: Structure Feature Detection
Check the pairwise Euclidean rigidity constraint.Build the graph.Let $N$ be an upper bound for the non-rigid/wrong matched features.Repeat until convergence:1. Pick a vertex $v$ from the graph.2. Try to prove the $v$ lies in the minimal vertex cover If this works -Remove $v$ and update the graph otherwise -Find a vertex cover $K$ with $v \notin K$ -If $\| K \| < N$ , update $N$ .Remove $N$ features.

Algorithm 1: Structure Feature Detection

Check the pairwise Euclidean rigidity constraint.Build the graph.Let

N

be an upper bound for the non-rigid/wrong matched features.Repeat until convergence:1. Pick a vertex

v

from the graph.2. Try to prove the

v

lies in the minimal vertex cover If this works -Remove

v

and update the graph otherwise -Find a vertex cover

K

with

v \notin K

-If

| K | < N

, update

N

.Remove

N

features.

After the structure feature detection, the structure features are used to guide the sampling process rather than estimate the motion directly. Because the structure feature detection algorithm cannot efficiently handle a large number of features, estimating the motion with only the structure features can lead to degeneracies. For example, if the sampled structure features are from the same plane, there can be an infinite number of solutions. So, we propose a more efficient method to use the structure features and set safeguards against degeneracies.

Structure-aware sample consensus

The stereo-VO algorithm is a sequential process of motion estimation with the images of the real-world scene. Take two sequential image pairs for example, and feature points are extracted from the images and matched with some feature detector, for example, SIFT.⁷ In the first pair, the triangulation procedure projects the 2D image features to the 3D points in the world coordinate. And in the second pair, a motion estimation algorithm, for example, Perspective-Three-Point (P3P),³⁰ can estimate the camera pose (rotation and translation) with the 3D–2D correspondences. Since the feature extraction and matching are automatically performed, there is a certain portion of wrong correspondences. For scenarios in the indoor or underground infrastructures, this portion increases due to the repetitive structures and poor illumination. To address this situation, we will first introduce our preprocessing strategy and then illustrate the guided scheme in this section.

Preprocessing: 3D bucketing

Theoretically, checking the Euclidean rigidity constraint of the 3D points will reveal the clue of structure features. However, as previously specified, it is exhaustive to check all the feature points. In addition, when the points are close to one another, the chances of falling into the degeneracies increase. To reduce the computation cost and set safeguards against degeneracies, we present a 3D-bucketing algorithm to obtain a sparse subset of 3D points with a uniform distribution. The 3D-bucketing algorithm is outlined in Algorithm 2.

Algorithm 2: 3D-bucketing algorithm
Project the matched features into 3D via triangulation.Set up cubes with certain length covering all the 3D points.Push points into the cubes, according to their 3D position.For each cube, draw $n$ points.Return the sparse 3D points.

Then, the pairwise Euclidean rigidity constraints of the sparse 3D points are checked, and the vertex covering problem is solved to detect the structure features. The 3D-bucketing preprocessing reduces the computation burden of the pairwise Euclidean distance calculation and the structure feature detection, which is an important prerequisite for the SASAC-based VO algorithm.

Guided sampling

After the structure features are verified, we present SASAC, a variant of RANSAC that gains additional efficiency by exploiting the properties of the structure features. Our SASAC is based on the hypothesis-testing framework in the RANSAC, except for a guided sampling procedure. Since the structure features are selected, the original image features are subdivided into two subsets: the structure feature subset with high inlier ratio and the common feature subset with low inlier ratio. Assuming that the sample size is $m$ , the sampling configuration of the SASAC can be presented as ${G_{m - u}^{u}}_{\begin{matrix} u = m, . . ., 0 \end{matrix}}$ , which means selecting $u$ samples in the structure feature group and selecting $m - u$ samples in the common feature group.

The sampling process starts with $u = m$ and the configuration becomes $G_{0}^{m}$ , when reaching a certain number of trials, the process moves to the next stage with $u = m - 1$ and $G_{1}^{m - 1}$ . In this way, the sampling process gradually go through all the configurations until $u = 0$ . In the end, all the configurations will have their opportunities to be selected. The process can be summarized as follows:

Algorithm 3: The SASAC algorithm
Generate all the sample configurations with $u = m$ to $u = 0$ . Repeat the following:1. Check the maximum trials for the current configuration equation (6). If reached, move to next configuration with $u = u - 1$ .2. Draw $u$ samples from the structure feature group and $m - u$ samples from the common feature group.3. Generate the hypothesis with sampled features4. Verify the current hypothesis with consistent features5. Check the termination criteria in equations (8) and (9).

Algorithm 3: The SASAC algorithm

Generate all the sample configurations with

u = m

u = 0

. Repeat the following:1. Check the maximum trials for the current configuration equation (6). If reached, move to next configuration with

u = u - 1

.2. Draw

u

samples from the structure feature group and

m - u

samples from the common feature group.3. Generate the hypothesis with sampled features4. Verify the current hypothesis with consistent features5. Check the termination criteria in equations (8) and (9).

With the known knowledge that the structure feature group has a higher inlier ratio, SASAC bias the sampling with a view toward preferentially generating hypothesis that are more likely to be correct. Thus, the SASAC are more likely to meet the termination criteria and stop early before the computation budget is exhausted. This guided sampling strategy have a dramatic effect on the efficiency and robustness of SASAC, especially in the low inlier ratio cases.

In the sampling process, it is crucial to make the correct number of trials for each configuration ${G_{m - u}^{u}}_{\begin{matrix} u = m, . . ., 0 \end{matrix}}$ . We assume that the total number of features is $N$ and the computation budget is $T_{N}$ trials in total for all possible sample sets $M_{N} = (\begin{matrix} N \\ m \end{matrix})$ . Let $T_{u}$ be the trials for the possible sample sets $M_{m - u}^{u}$ of each sampling configuration $G_{m - u}^{u}$ . $T_{u}$ is chosen such that the computation budget for each sample sets of $M_{m - u}^{u}$ is equal to that of the entire $M_{N}$ sample sets

\frac{T_{u}}{M_{m - u}^{u}} = \frac{T_{N}}{M_{N}}

(6)

The possible sample sets $M_{m - u}^{u}$ can be computed as

M_{m - u}^{u} = (\begin{matrix} S \\ u \end{matrix}) * (\begin{matrix} C \\ m - u \end{matrix})

(7)

where $S$ is the number of features in structure feature group and $C$ is the feature number of common feature group. Equation (6) implies that when the computation budget is exhausted, all the features have the same opportunity to be sampled, that is, in the worse case, the SASAC behaves as RANSAC.

Termination criteria

Since SASAC samples in a fashion were similar to PROSAC,²¹ we use the same termination criteria in our implementation. First, we check the non-randomness in the case that an incorrect model is accidentally supported by outliers and selected as the final solution. More precisely, the probability that the model is supported by random points should be smaller than a certain threshold $ψ$ . The minimal number of inliers $n^{*}$ required to maintain the non-randomness is

\min {n^{*} : \sum_{i = n^{*}}^{N} β^{\begin{matrix} (i - m) \end{matrix}} {(1 - β)}^{\begin{matrix} (N - i + m) \end{matrix}} (\begin{matrix} N - m \\ i - m \end{matrix}) < ψ}

(8)

where $β$ is the probability of an incorrect model that is accidentally supported by other features. In our implementation, we use $ψ = 5 %$ for all the experiments.

Another termination condition is the maximality constraint, which defines how many samples must be drawn to ensure the confidence in the sample that does not contain an outlier. The confidence should be smaller than a given threshold $η$

(1 - I_{u})^{k} \leq η

(9)

where $I_{u}$ is the inlier ratio of the current configuration $G_{u}$ and $k$ is the number of random sampling trials. Note that this termination criterion is different from the standard RANSAC because the inlier ratio is quite high when the sampling begins with $u = m$ . In this case, even if the raw dataset has a low inlier ratio, SASAC can terminate before reaching the maximum number of sampling trials.

SASAC-based VO

The SASAC is integrated in the stereo-VO scheme to perform a robust motion estimation. Assume that the features have been detected and matched, the SASAC-based stereo VO is summarized in Algorithm 4.

Algorithm 4: SASAC-based stereo visual odometry
Input: Feature correspondences.Output: Motion parameter $R^{}, t^{}$ and inliers $I^{}$ .Project the features into 3D via triangulation;3D-bucketing;Structure feature election;Set $I^{} = 0$ , $u = m$ ;while Sampling number $< T_{N}$ do if Current configuration trails $> T_{u}$ then Move to the next configuration, $u = u - 1$ end Draw sample with current configuration $G_{m - u}^{u}$ ; Solve the P3P with the sample to get $R, t$ ; Find inliers $I_{r}$ with current $R, t$ ; if $I_{r} > I_{}$ then Update $I^{}, R^{}, t^{}$ with current $I_{r}, R, t$ end if Meet the termination criteria then break endendRefine the $R^{}, t^{}$ with all the inlier $I^{*}$ via nonlinear optimization.

Algorithm 4: SASAC-based stereo visual odometry

Input: Feature correspondences.Output: Motion parameter

R^{*}, t^{*}

and inliers

I^{*}

.Project the features into 3D via triangulation;3D-bucketing;Structure feature election;Set

I^{*} = 0

u = m

;while Sampling number

< T_{N}

do if Current configuration trails

> T_{u}

then Move to the next configuration,

u = u - 1

end Draw sample with current configuration

G_{m - u}^{u}

; Solve the P3P with the sample to get

R, t

; Find inliers

I_{r}

with current

R, t

; if

I_{r} > I_{*}

then Update

I^{*}, R^{*}, t^{*}

with current

I_{r}, R, t

end if Meet the termination criteria then break endendRefine the

R^{*}, t^{*}

with all the inlier

I^{*}

via nonlinear optimization.

In the algorithm, the minimal sample size $m = 4$ , since the P3P algorithm requires four feature correspondences³¹ to generate the unique solution. Compared with standard stereo VO, our implementation only requires additional computation in structure feature detection, which can be boosted by 3D bucketing. This additional step can significantly benefit the SASAC-based VO and barely costs additional computation. When the entire dataset has a high outlier ratio, the SASAC can terminate early and get a correct model while RANSAC needs much more iterations. Besides, if the computation budget is limited, the model from SASAC is more likely to be correct than that from RANSAC.

Experiment

Our algorithm was tested in C++ on a PC with an Intel i7 2.2 GHz CPU. We evaluated the performance of the SASAC and other comparisons using both synthetic and real-world data. Since our algorithm focuses on solving the VO problem in the underground environment and no related public dataset was available to us, we collected our own dataset together with the ground truth in an underground garage.

Synthetic data evaluation

For the simulation, random problems were generated in the following manner. The position of the first stereo-rig was fixed to the origin of the world frame, and its orientation was set to identity. The position of the second stereo-rig was randomly set so that the translation distance does not exceed 2.0 m. We bounded the rotation so that none of the Euler angles exceeded 0.5 rad (30°), which generated random motion estimation problems that would appear in the practical scenarios. In total, 200 random points were created with a uniformly varying depth of 5.0 m to the origin of the first stereo-rig. Then, noises with a mean of 0.5 pixel were added to the measurement in the camera frame. Outliers were added by randomly projecting the points into a false camera center.

Comparisons were given by the standard RANSAC scheme with different settings. They were the RANSAC-100 and RANSAC-1000 which will stop when reaching the maximal iteration of 100 and 1000. We also included the result obtained from the RANSAC-optimum with known inliers, which only samples in the inlier group and presents the optimum result that the RANSAC can achieve for a certain outlier level. The termination criteria were identical for the SASAC and the comparisons.

We ran 1000 random experiments with outlier levels of 10%–80% to evaluate each algorithm. The rotation errors were expressed in terms of the axis-angle representation of the difference between the ground truth and the estimated rotation. The translation errors were expressed as the difference of the ground truth and estimated translation with scale. To compare the performance of the sample and consensus scheme, the result came from the P3P algorithm directly, and no further nonlinear refinement was performed.

The mean error in rotation and translation are shown in Figure 4(a) and (b). As the results show, in both rotation and translation estimations, our SASAC solution outperforms the RANSAC-100 solution when the outlier ratio reaches 60%. With outlier ratios larger than 60%, the large estimation errors in RANSAC-100 indicate that after 100 iterations, the random sampling process could not promise to find one outlier-free sample set. To achieve comparable results with SASAC, the RANSAC scheme requires 1000 iterations, which is exhaustive in computation.

Figure 4.

The performance of each algorithm is evaluated with outlier ratio from 10% to 80%. (RANSAC-100) and (RANSAC-1000) denotes the standard RANSAC with maximal iteration of 100 and 1000. (RANSAC-optimum) presents the RANSAC samples in the inlier group. SASAC is the method we proposed in this article: (a) mean error in rotation, (b) mean error in translation, (c) the success estimation rate, and (d) required interaction.

Our SASAC maintains a mean rotation error of 0.0024 rad with the outlier ratio varying from 10% to 70%, and this result is close to the optimum of the RANSAC algorithm. However, when the outlier ratio approaches 80%, our algorithm occasionally fails because all the features happen to be outliers after the 3D bucketing. Theoretically, this failure rate is lower than 1% if we sample 20 cubes in the 3Dbucketing, whereas the theoretical failure rate of the RANSAC-100 and RANSAC-1000 are 85.2% and 20.16% in the 80% outlier ratio case, respectively.

To further analyze the successful estimation rate, we calculate the successful estimation over the 1000 simulations and show the results in Figure 4(c). The experiment results match the theoretical analysis that the RANSAC-100 cannot robustly estimate the motion with outlier levels larger than 50%. Both RANSAC-1000 and SASAC can robustly estimate the motion with outlier levels ranging from 10% to 70%; however, when the outlier level reaches 80%, our SASAC is still able to robustly estimate motion with the successful estimation rate of 97%, while the rate of RANSAC-1000 is 86.8%. As this experiment shows, the SASAC is more robust than the RANSAC-based methods, when the data is contaminated with high-ratio outliers.

Since the performance of RANSAC-1000 is comparable to SASAC, it is indispensable to analyze the computation efficiency of these algorithms. The non-randomness threshold $ψ$ is set as 1%, and the maximality constraint is set to be 5% for all the algorithms. We compute the theoretical iteration from equation (9) for reference. As shown in Figure 4(d), the RANSAC theoretically requires 368 and 1870 iterations when the outlier ratio comes to 70% and 80%. It indicates that, the RANSAC-1000 requires more iteration when the outlier ratio comes to the 80%. The SASAC only requires 6 iterations for different types of outlier levels and the RANSAC-optimum requires 4 iterations. This result proves that by electing the structure features our SASAC guides the sampling process in the correct direction and accelerates the convergence of the algorithm. Although the performances of SASAC and RANSAC-1000 are comparable, our SASAC requires magnitude less iteration than the RANSAC-1000 algorithm in the high outlier level. Thus, our algorithm is more suitable for the application with limited computation resource, e.g., smart phone platform and self-driving cars.

Real data evaluation

To further evaluate the performance in the real-world scenario, we installed sensors, computer hardwares and actuators on a prototype mobile platform for data collection. As shown in Figure 5, the platform was built on an unmanned electric vehicle platform. Considering the low light condition underground, two high-intensity LEDs were mounted on the front providing additional illumination. For the vision sensors, the platform has a stereo camera rig that consists of two rigidly mounted digital cameras.

Figure 5.

The data collection platform designed for underground and indoor data collection.

These two cameras were simultaneously triggered by the interval signal from the control unit at a frequency of 2 Hz. Before the data collection, the stereo camera system was calibrated, and the images were rectified according to the calibration results.

The dataset was collected in an underground garage because this environment is inevitable for driving cars in the urban scenario but challenging for the motion estimation because of the limited illumination and repetitive artificial structures. Since the GPS does not work underground, we collected ground truth with vision marks instead. Firstly, vision marks were tagged on the walls in the garage, and their 3D position were measured with total stations. Then we ran a structure-from-motion (SFM) program with these accurately measured 3D points, and obtain the camera poses from the SFM results. To confirm the accuracy of the camera poses, we measured some other 3D points with total station and projected them into the image coordinate according to the camera poses. The mean re-projection error is less than 1 pixel, so these camera poses are qualified in accuracy to be refered as the ground truth. According to our review, no similar open access underground dataset with ground truth was found, and we would like to make this dataset available to the community.

Then, the performance of the SASAC was evaluated with this underground garage dataset. For further analyses, it was also evaluated with the ground dataset from KITTI.¹⁶ For experimental comparison, we configured our SASAC to compare the performance against the following:

RANSAC: The standard RANSAC algorithm,¹¹ which denotes the baseline for comparison.

GroupSAC: Efficient variant of the RANSAC with the grouping process, which uses progressive numbers of groups to boost the sampling procedure.²³ In the experiment, we employ the widely used graph-based model³² to segment the image and group the features.

PROSAC: RANSAC with nonuniform sampling based on decent ordering data by feature-matching quality, which provides an efficient sampling process.²¹

In all the real data experiments, our SASAC implementation followed the Algorithm 4, and we only changed the preprocessing strategy for the comparisons and maintained the other process same for all the algorithms. For GroupSAC, we ran the segmentation algorithm and grouped the features according to the segmentation results. For the PROSAC, the features were decent ranked with their matching score. Since the true inlier ratios (along with the ground truth inliers) were unknown for the experimental data, we first set up $10^{7}$ random samples to estimate the motion and referred the largest number of inliers as ground truth inliers. To ensure the validity, manual inspection was performed to ensure that no outliers were included in the ground truth inliers. In all experiments, the results were averaged from 500 executions of each algorithm.

First, the evaluation was performed with the underground datasets, and the results are tabulated in Cases A–C in Table 1. The outlier ratios were 58.8%–76% in the dataset, which becomes a challenging dataset for robust estimation algorithms. The table lists the number of features, the ground truth inliers, the detected inliers, the motion estimation error (in millimeter), number of iterations and the total runtime (in milliseconds). The detected inliers from SASAC are shown in the image in the first column. As the results show, the SASAC consistently produces accurate solutions with successfully detected inliers, while maintaining almost the lowest runtime. Although the standard RANSAC can estimate the motion with a low inlier ratio in the experiment, the algorithm requires 369–990 iterations to obtain the exact result, which is costly in computation. Particularly in Case B, where the outlier ratio reaches 76%, RANSAC requires 990 iterations and costs 873 milliseconds to pick out the inliers and estimate the motion, whereas our SASAC requires 80 iterations, which is magnitude fewer than the RANSAC. For the PROSAC, which is faster than the RANSAC, but the accuracy is worse than those of the other methods. This difference is mainly because of the points sampled by PROSAC are often poorly distributed in the space, and the highest ranking points are typically spatially very close to one another, which generates degenerate solutions. In SASAC, this effect is mitigated by our 3D-bucketing strategy, which incurs a small additional computational cost but provides a much more uniformly distributed dataset. Another reason is that the quality score used by PROSAC is actually point-wise evidence, whereas the structure features used by SASAC reveals the correlation between the rigid structure and is shown to provide a stronger evidence. We also find that in our experiment, matching features with high quality scores are not guaranteed to be inliers because of the repetitive textures, but the structure features selected by SASAC do belong to inlier feature group.

Table 1.

Results for inlier detection and motion estimation in underground and ground dataset.

		RANSAC	PROSAC	GroupSAC	SASAC
Case A: underground small motion
	$N$	613	613	613	613
	$n^{*}$	253	253	253	253
	$n$	$222 \pm 6$	$228 \pm 3$	$200 \pm 4$	$218 \pm 10$
	$e$	9.5	18.1	10.2	8.7
	$k$	369	180	40	23
	$time$	417	205	3167	92
Case B: underground dynamic motion
	$N$	466	466	466	466
	$n^{*}$	114	114	114	114
	$n$	$93 \pm 7$	$89 \pm 4$	$83 \pm 2$	$88 \pm 15$
	$e$	12.8	16.4	13.4	11.6
	$k$	990	369	160	80
	$time$	873	320	4724	73
Case C: underground moving object
	$N$	617	617	617	617
	$n^{*}$	247	247	247	247
	$n$	$233 \pm 8$	$228 \pm 3$	$218 \pm 4$	$220 \pm 10$
	$e$	11.5	17.3	14.6	10.7
	$k$	378	89	28	22
	$time$	420	99	3765	82
Case D: ground moving object
	$N$	468	468	468	468
	$n^{*}$	348	348	348	348
	$n$	$343 \pm 2$	$340 \pm 2$	$336 \pm 2$	$328 \pm 6$
	$e$	7.4	8.3	8.6	6.9
	$k$	60	28	29	32
	$time$	87	41	4056	88
Case E: ground motion
	$N$	399	399	399	399
	$n^{*}$	267	267	267	267
	$n$	$260 \pm 3$	$261 \pm 2$	$254 \pm 3$	$255 \pm 3$
	$e$	7.7	6.8	7.3	7.2
	$k$	70.7	40	38	42
	$time$	99	56	4103	81

The table lists the number of correspondences ( $N$ ), the number of ground truth inlier ( $n^{*}$ ), the number of detected inlier ( $n$ ), the position error ( $e$ ) in millimeter, the number of iterations ( $k$ ), and the overall running time ( $time$ ) in millisecond. The results are averaged over 500 executions of each algorithm.

In some cases, the GroupSAC requires fewer number of iterations, and the overall performance is comparable with SASAC, but applying segmentation as preprocessing is too costly in computation, which makes the GroupSAC the most computationally expensive algorithm among the comparisons. In addition, in case C, when there is a non-rigid moving object, the GroupSAC takes samples from the moving object group and the results are less accurate. In this case, the SASAC begins sampling with the static structure features and easily abandons those non-rigid moving features.

For the overall performance with the underground dataset, the RANSAC and GroupSAC algorithms estimate the motion in a computationally expensive manner, whereas the SASAC achieves comparable results at a small fraction of the computational cost. Compared with the PROSAC, the SASAC uncovers stronger evidence of the features, which leads to a more accurate and robust result. Although it improves the performance, SASAC only acquires a small additional computational cost, which makes our SASAC more suitable for indoor applications with limited computation resources.

We also evaluated the performance with the ground KITTI dataset¹⁶ in cases D and E. Since the matching quality becomes better in the ground case, all the algorithms work well in this situation. The results show that the SASAC also reduces the sampling iteration; however, considering the preprocessing, the running time is nearly the same as that of the RANSAC. The PROSAC performs slightly better than the SASAC in the efficiency since the quality score is more significant in the ground scenario. As shown in case D with a moving car in the image, SASAC can eliminate the features from the moving object and deliver an accurate result. However, when the moving object becomes larger, for example, a moving truck, our structure feature detection algorithm may wrongly defines the moving object feature as the structure feature. This is mainly because the structure feature detection algorithm select the largest subset of features that are consistent with the rigidity constraint. In the case of large rigid moving object, the features from the rigid moving object satisfy the Euclidean rigidity constraint and form a subset of rigid features, when this subset becomes larger than the static feature subset, our algorithm considers the moving object feature subset as the structure feature.

Then, the RANSAC PROSAC GroupSAC and SASAC were integrated into a vision-based VO scheme, and their position accuracies are evaluated with both underground and ground datasets. The integrated VO algorithm is based on the LIBVISO2,³³ and we maintained the maximal iteration at 200 which is the default setting in LIBVISO2. The VO results in the underground are shown in traces in Figure 6(a)–(c). As the results show, in the underground dataset, the SASAC achieves a better position accuracy than the RANSAC, PROSAC, and GroupSAC.

Figure 6.

Performances of RANSAC, PROSAC, GroupSAC, and SASAC in VO algorithm. The ground truth in the underground is obtained by the total station, and the KITTI data ground truth is measured by the GPS and IMU. (a) Visual odometry results of the undergorund dataset lap 1, (b) visual odometry results of the undergorund dataset lap 2, (c) visual odometry results of the undergorund dataset lap 3 and (d) visual odometry results of the KITTI dataset.

During the experiment, we further discover that the top-ranked features in PROSAC often lie on the same surface in the scenario. Particularly when there is a large portion of plane structure in the center of the image, all sampled features in PROSAC are all from this plane, which generates a degenerate solution and affects the position accuracy. For the RANSAC-based VO, the RANSAC algorithm sometimes fails to elect the correct model and causes a motion estimation failure. When the failure occurs, the VO algorithm assumes that there is a linear movement and takes the motion parameter from the last frame. Thus, the overall performance of RANSAC is less accurate than SASAC. With correct segmentation, the GroupSAC achieves a comparable result as the SASAC; however, it requires magnitude more time for computation.

For the ground dataset, these algorithms achieved comparable results. The numerical analysis is performed with the same evaluation metrics as KITTI,¹⁶ and the results are shown in Table 2. A comparison of the position accuracy in the ground and underground scenarios shows that the SASAC and GroupSAC maintain the same level of position performance in both scenarios, whereas the RANSAC and PROSAC show a lower accuracy in the underground scenario. Although the GroupSAC is accurate in both scenarios, it is too slow for the real-world applications. Thus, the VO results show that by integrating SASAC, the standard stereo-VO algorithm can be applied to the underground with the same level of position accuracy and computation cost as in the ground scenario.

Table 2.

Position accuracy of RANSAC, PROSAC and SASAC.

Error	RANSAC	PROSAC	GroupSAC	SASAC
Lap 1	2.01%	1.88%	1.21%	1.09%
Lap 2	1.87%	1.62%	1.4%	1.18%
Lap 3	2.13%	1.91%	1.36%	1.22%
KITTI	1.42%	1.28%	1.44%	1.24%

SASAC: structure-aware sample consensus; RANSAC: random sample consensus; PROSAC: progressive sample consensus.

The position accuracy is evaluated by averaging the relative position error at fixed distance.

Conclusion and discussion

In this article, we have designed a robust structure-aware sample scheme to solve the stereo-VO problem in an underground environment. To robustly estimate the motion, we use the Euclidean rigidity constraint to detect a group of structure features and then guide the sampling process with these features. Compared with the existing methods, our method can robustly estimate the motion in the underground scenarios, which usually suffers from high outlier ratio effects. The performance of the SASAC and the comparisons are evaluated in both synthetic data and real-world data. The underground experiment results show that our method has advantages in robustness, accuracy and efficiency compared with the RANSAC and its variants. The SASAC-based VO algorithm achieves the same level of positioning accuracy in both ground and underground scenarios. As a second contribution of this work, we have collected an underground garage dataset with the ground truth. Since there are fewer open underground datasets than the ground datasets, we would like to make our datasets available to the community for further research (https://github.com/lizhengning/Underground-Garage-Dataset).

In this study, the definition and detection of structure features is derived pure geometrically, and in some situation, for example, with large moving object, it may wrongly select a subset as the structure feature. Besides the scope of this study is limited in the stereo-camera case, since the depth should be recovered before structure feature detection. Therefore, we will focus on discovering more high-level and applicable clue about the structure features in the future work. With the recent development of deep learning, we believe that the clue of the structure feature can be learned by a deep neural network in training procedure. And in inference procedure, the trained network can efficiently divide the the structure features. Combining the learning-based structure feature selection method, the SASAC scheme can be applied in more VO applications, for example, monocular VO.

Footnotes

Handling Editor: Haosheng Huang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Key Research and Development Program of China (grant nos 2016YFB0502102 and 2016YFB0502104), Natural Science Foundation of China (grant nos 41771481 and 41371333) and Shanghai International Corporation Program (grant no. 14530722400).

References

Mei

Sibley

Cummins

et al . RSLAM: a system for large-scale mapping in constant-time using stereo. Int J Comput Vision 2011; 94(2): 198–214.

Konolige

Agrawal

Sola

. Large-scale visual odometry for rough terrain. In: Kaneko

Nakamura

(eds) Robotics research: proceedings of the 13th international symposium. Berlin; Heidelberg: Springer, 2011, pp.201–212.

Engel

Koltun

Cremers

Direct sparse odometry. IEEE T Pattern Anal. Epub ahead of print 12 April 2017. DOI: 10.1109/TPAMI.2017.2658577.

Newcombe

Lovegrove

Davison

. DTAM: dense tracking and mapping in real-time. In: Proceedings of the IEEE international conference on computer vision, Barcelona, 6–13 November 2011, pp.2320–2327. New York: IEEE.

Klein

Murray

. Parallel tracking and mapping for small AR workspaces. In: Proceedings of the IEEE international symposium on mixed and augmented reality, Nara, Japan, 13–16 November 2007, pp.225–234. New York: IEEE.

Mur-Artal

Montiel

JMM

Tardos

. ORB-SLAM: a versatile and accurate monocular slam system. IEEE T Robot 2015; 31(5): 1147–1163.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004; 60(2): 91–110.

Bay

Tuytelaars

Van Gool

. SURF: speeded up robust features. In: Leonardis

Bischof

Pinz

(eds) Computer vision: European conference on computer vision (ECCV). Berlin; Heidelberg: Springer, 2006, pp.404–417.

Rublee

Rabaud

Konolige

et al . ORB: an efficient alternative to SIFT or SURF. In: Proceedings of the IEEE international conference on computer vision, Barcelona, 6–13 November 2011, pp.2564–2571. New York: IEEE.

10.

Nister

Naroditsky

Bergen

. Visual odometry. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Washington, DC, 27 June–2 July 2004, vol. 1, pp.652–659. New York: IEEE.

11.

Fischler

Bolles

. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 1981; 24(6): 381–395.

12.

Tanaka

Kondo

. Incremental RANSAC for online relocation in large dynamic environments. In: Proceedings of the IEEE international conference on robotics and automation, Orlando, FL, 15–19 May 2006, pp.68–75. New York: IEEE.

13.

Huang

Bachrach

Henry

et al . Visual odometry and mapping for autonomous flight using an RGB-D camera. In: Christensen

Khatib

(eds) Robotics research: proceedings of the 15th international symposium. Cham: Springer, 2017, pp.235–252.

14.

Schöps

Engel

Cremers

. Semi-dense visual odometry for AR on a smartphone. In: Proceedings of the IEEE international symposium on mixed and augmented reality, Munich, 10–12 September 2014, pp.145–150. New York: IEEE.

15.

Leutenegger

Lynen

Bosse

et al . Keyframe-based visual–inertial odometry using nonlinear optimization. Int J Robot Res 2015; 34(3): 314–334.

16.

Geiger

Lenz

Urtasun

. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Providence, RI, 16–21 June 2012, pp.3354–3361. New York: IEEE.

17.

Jung

Lacroix

. A robust interest points matching algorithm. In: Proceedings of the 8th IEEE international conference on computer vision, Vancouver, BC, Canada, 7–14 July 2001, vol. 2, pp.538–543. New York: IEEE.

18.

Sattler

Leibe

Kobbelt

. SCRAMSAC: improving RANSAC’s efficiency with a spatial consistency filter. In: Proceedings of the IEEE international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009, pp.2090–2097. New York: IEEE.

19.

Fredriksson

Larsson

Olsson

et al . Optimal relative pose with unknown correspondences. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp.1728–1736. New York: IEEE.

20.

Fredriksson

Larsson

Olsson

. Practical robust two-view translation estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, 7–12 June 2015, pp.2684–2690. New York: IEEE.

21.

Chum

Matas

. Matching with PROSAC-progressive sample consensus. In: Proceedings of the IEEE conference on computer vision and pattern recognition, San Diego, CA, 20–26 June 2005, vol. 1, pp.220–226. New York: IEEE.

22.

Raguram

Frahm

Pollefeys

. A comparative analysis of RANSAC techniques leading to adaptive real-time random sample consensus. In: Forsyth

Torr

Zisserman

(eds) Computer vision: European conference on computer vision (ECCV). Berlin; Heidelberg: Springer, 2008, pp.500–513.

23.

Jin

Dellaert

. GroupSAC: efficient consensus in the presence of groupings. In: Proceedings of the IEEE international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009, pp.2193–2200. New York: IEEE.

24.

Raguram

Chum

Pollefeys

et al . USAC: a universal framework for random sample consensus. IEEE T Pattern Anal 2013; 35(8): 2022–2038.

25.

Rumpler

Irschara

Bischof

. Multi-view stereo: redundancy benefits for 3D reconstruction. In: Proceedings of the 35th workshop of the Austrian Association for Pattern Recognition, Graz, 26–27 May 2011, vol. 4. AAPR/OAGM.

26.

Karp

. Reducibility among combinatorial problems. In: Proceedings of a symposium on the complexity of computer computations, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, 20–22 March 1972, pp.85–103. New York: Springer.

27.

Hochbaum

. Approximation algorithms for the set covering and vertex cover problems. SIAM J Comput 1982; 11(3): 555–556.

28.

Chen

Kanj

Jia

. Vertex cover: further observations and further improvements. J Algorithm 2001; 41(2): 280–301.

29.

Enqvist

Josephson

Kahl

. Optimal correspondences from pairwise constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Kyoto, Japan, 29 Sepember–2 October 2009, pp.1295–1302. New York: IEEE.

30.

Haralick

Lee

Ottenberg

et al . Review and analysis of solutions of the three point perspective pose estimation problem. Int J Comput Vision 1994; 13(3): 331–356.

31.

Kneip

Scaramuzza

Siegwart

. A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Colorado Springs, CO, 20–25 June 2011, pp.2969–2976. New York: IEEE.

32.

Felzenszwalb

Huttenlocher

. Efficient graph-based image segmentation. Int J Comput Vision 2004; 59(2): 167–181.

33.