Sage Journals: Discover world-class research

Abstract

Mobile robots should possess accurate self-localization capabilities in order to be successfully deployed in their environment. A solution to this challenge may be derived from visual odometry (VO), which is responsible for estimating the robot's pose by analysing a sequence of images. The present paper proposes an accurate, computationally-efficient VO algorithm relying solely on stereo vision images as inputs. The contribution of this work is twofold. Firstly, it suggests a non-iterative outlier detection technique capable of efficiently discarding the outliers of matched features. Secondly, it introduces a hierarchical motion estimation approach that produces refinements to the global position and orientation for each successive step. Moreover, for each subordinate module of the proposed VO algorithm, custom non-iterative solutions have been adopted. The accuracy of the proposed system has been evaluated and compared with competent VO methods along DGPS-assessed benchmark routes. Experimental results of relevance to rough terrain routes, including both simulated and real outdoors data, exhibit remarkable accuracy, with positioning errors lower than 2%.

Keywords

Visual Odometry Outlier Detection Incremental Motion Estimation Mobile Robots

1. Introduction

Mobile robots require highly sophisticated and accurate algorithms to achieve accurate location estimations while moving in unknown environments and possessing no prior knowledge of the scene or the robot's motion [1]. Visual odometry (VO) algorithms are able to provide a solution in this respect, and the development of such effective algorithms has become an active research topic as a result. Compared to standard wheel odometry, VO encapsulates a number of advantages, such as the elimination of errors due to wheel slippage in estimating the camera position and the ability to obtain 3D motion estimations even with non-planar surfaces (where wheel odometry alone can only achieve 2D path estimation). Consequently, VO has become a prerequisite for many practical requirements of robotics technology, such as obstacle avoidance, simultaneous localization and mapping, and even in path-planning. It is only in applications where external disturbances of the camera do not allow for a VO solution that researchers investigate other odometry solutions, such as inertial-based solutions [2, 3]. The majority of VO algorithms utilize feature tracking to estimate the relative incremental motion between successive frames with sufficient spatial overlap. The estimation of the path followed is actually based on the summation of smaller, independent motion estimates. However, incremental egomotion estimation typically involves a certain amount of drift, which grows with the distance travelled, leading to accumulated pose errors as a result. In recent years, sophisticated methods have provided more efficient solutions, resulting in accurate localization systems. As described in [4], the trajectory of a robot can be refined by a computationally intensive iterative optimization technique known as sparse bundle adjustment. The computation time of this method increases with the number of images, and therefore it is only applicable to robots with augmented computational resources. Another method for increasing the efficacy of VO algorithms is loop-closure, as reported in [5]. This relies on the hypothesis that a robot following an arbitrary route will eventually return to a previously visited area, and thus a database containing the salient points of every visited scene should be retained. Once a query image is matched with an image in the database, refinements to the covered path of the robot are applied. This approximation is very efficient when robots operate in indoor environments; however, in the domain of field robotics, where long distances need to be covered, the probability of passing through the same area twice will be low. Moreover, the size of the database grows continuously, resulting in an increased computational burden every time a new query image appears in the field of view (FoV) of the robot. Other approaches in this domain address the issue of the occurrence of cumulative errors in the pose of the robot by integrating information derived from both the onboard visual system and the global positioning system (GPS) [6]. Although the resulting VO framework enjoys more accurate estimations of the trajectory, when the robot operates in outdoor environments, it suffers from uncertain motion estimations in places with poor GPS reception.

The proposed VO algorithm was developed within the SPARTAN (Sparing Robotics Technologies for Autonomous Navigation) project, funded by the European Space Agency (ESA). This project focuses on the hardware implementation of computer vision algorithms suitable for planetary exploration rovers that exhibit limited computational resources, as described in [7]. Therefore, the present paper discusses in detail the developed localization algorithm as part of the SPARTAN vision system, which has been integrated and tested on a custom-made robotic platform. More precisely, this paper also extends the early work presented in [8] by thoroughly presenting the theoretical background of the statistical outlier detection algorithm, by providing a relationship between the robot's velocity and the camera's frame rate (aiming to regulate the outlier detection procedure), and by assessing the proposed algorithm with numerous simulated and real world datasets.

It is apparent that the adopted solution should avoid computationally demanding strategies and aim at the development of an algorithm able to operate in real time, even in robots with limited computational resources. Emphasis was placed on developing custom, non-iterative solutions at each step of the proposed algorithm's execution—the ultimate goal being to achieve solutions implementable for the hardware. The first step in the proposed VO method consists of the detection of the salient landmarks between successive images. We use the Speed-Up Robust Features (SURF) feature detector and matcher [9], which ensures great repeatability and speed in the detection of the features between two consecutive frames with significant spatial overlap. A depth estimate of these features is then obtained by means of an enhanced stereo correspondence algorithm. Our solution is a rapidly executed local stereo algorithm embodying a bidirectional consistency check. The next step consists of a non-iterative outlier detection methodology able to discard both the mismatches between the features and the inserted errors due to the 3D reconstruction procedure. A hierarchical motion estimation technique, which produces robust estimations for the movement of the robot is then adopted, thus providing refinements to the robot's global position and orientation every time a new frame arrives. The resulting overall algorithm requires modest computational effort and can thus be exploited in applications where hardware implementation becomes a necessity. The subordinate modules of the introduced VO algorithm are summarized in Fig. 1.

Figure 1.

Block diagram of the proposed VO algorithm

In a nutshell, the contributions of this paper can be summarized as:

The introduction of a non-iterative outlier detection technique suitable for real-time robot applications.

The presentation of a hierarchical motion estimation method producing robust orientation and position estimations while providing refinements to the robot's trajectory at every single step.

The design of the subordinate modules of the proposed VO algorithm in order to maintain the overall computational payload as low as possible.

The rest of the paper is organized as follows: the related work is outlined in Sec. 2; the adopted feature detection and matching technique are presented in Sec. 3.1, while the stereo correspondence algorithm with the 3D reconstruction routines is demonstrated in Sec. 3.2. Sec. 3.3 introduces the outlier detection method and Sec. 3.4 describes the hierarchical motion estimation methodology. In Sec. 4, the experimental validation is presented, followed by the discussion in Sec. 5.

2. Visual Odometry Methods

Recent decades have seen the application of VO methods to a significant number of implementations involving both real-time and offline approaches. Due to the fact that several alternative robotic architectures have been proposed, any attempt to provide a categorization of VO algorithms faces various difficulties. A common separating line among existing techniques is that involving vision, i.e., robots bearing either monocular or stereo vision systems. With regards to systems involving monocular vision, successful results for long-range routes have been presented, exploiting both perspective and omnidirectional cameras, as described in [10, 11] and [12]. However, it is well known that a moving calibrated camera observing a scene can recover the camera geometry and motion only up to a certain scale factor [13]. Due to this fact, VO algorithms implemented for stereo vision exhibit more stable behaviour [14] and, as a result, they have been preferred in the majority of the research conducted in the area [15]. In a stereo VO system, correspondences among points of interest appearing in a stereo rig are computed first in order to produce a disparity map, and then, by exploiting this depth information, 3D points are triangulated and fed into a module to estimate 3D motion [16].

VO frameworks can also be categorized according to the landmark detection method used. For example, in [17], the author adopts the feature tracking-based technique, while in [18] the optical flow method, which is based on local neighbourhood correlation, is utilized. In contrast with the dense optical flow technique, the utilization of a feature tracking method leads to less computationally demanding motion estimation tasks. Moreover, the success of the solely vision-based localization algorithms is, to a large extent, due to the development of robust feature detection and description methods. The most common feature detector is the Harris corner detector [19], while more recently developed feature detection methods, such as SIFT [20], SURF [9] and even CenSurE [21] involve both feature detection and description practices. The mismatches within the tracked features, which result from either erroneous estimations of the feature detection algorithms or poor estimations of the depth during the 3D reconstruction, lead to cumulative errors when employed in an incremental motion estimation. Such limitations have led to multiple outlier detection techniques which aim to cope with this problem [22]. In the work presented in [23], the authors make use of the RANSAC algorithm [24] in a least-square motion estimation routine for the detection of the outliers. This technique has since been adopted in many localization algorithms, such as those described in [25] and [26]. An alternative approach to detecting and removing outliers in such a localization system is that described in [27], where only the salient points possessing a high confidence in the disparity space are retained, thus avoiding the introduction of noise due to mismatches in the stereo correspondence routine.

VO algorithms can be further categorized according to the means utilized in estimating robot motion. The majority of the developed monocular VO techniques rely on the theory of structure from motion (SFM)—the beginnings of which are traced in [28]. Methods exploiting SFM theory calculate - usually under a restriction of error - the relative position and orientation of a set of frames relying on the matched salient points of the respective scenes [29, 30]. In addition, VO algorithms are usually accompanied by a refinement step that results in small corrections to the estimated robot trajectory. One of the most common techniques used is referred to as sparse bundle adjustment. According to this framework, location estimations can become more accurate by retaining a portion of any recent features in an iterative structure [31], thereby making the need for increased computational resources imperative. A rather different approach - also leading to localization refinements - is that based on location recognition, i.e., loop-closure detection. This technique relies on content-based retrieval practices, in order to refine the robot's position estimation when features of a new frame are matched with those already existing in the database [5]. For a more comprehensive and in-depth literature survey of the VO algorithms that have been developed, the reader may refer to [32].

3. Algorithm Description

In this section, the overall description of the proposed methodology is presented. It provides details of the 2D feature detection and matching routine, the 3D vision algorithm and the intermediate filtering steps, the outlier detection and discarding routine, the motion estimation as well as the one-step position refinement of the front end of the proposed algorithm. A detailed schematic depiction of the VO single-frame incremental update is provided in Fig. 2.

Figure 2.

Schematic representation of the VO single-frame incremental update approach

3.1. Feature Detection and Matching

One of the most common traits among the VO algorithms is that they employ a feature detection methodology which in effect detects the salient points of an image. The most popular such methodologies employed in the localization problem are the SIFT, SURF and Harris corner detectors. The first two detectors are supplemented by their own description and matching algorithms. The Harris corner detection algorithm is a detector only, locating the corners of a scene; a common procedure to match such points with the corresponding points in successive images involves correlating the local neighbourhoods around the detected corners. Among the potential detectors that can be used in the localization algorithms [33], the proposed system here employs SURF, which comprises a scale and rotation invariant detector and a descriptor. This attribute ensures robustness in the motion estimation when the robot's movement involves large motions around or along the optical axis. Moreover, the main reason moving us to choose the SURF algorithm lies in its potential to achieve high repeatability, distinctiveness and robustness, while revealing high computational efficiency, thus allowing significantly faster computation times at the same time. SURF provides a list containing the image coordinates of N matched features in two images. When two consecutive left-reference images of the stereo rig corresponding to the times t and t+1, respectively, are fed into the SURF algorithm, the number N of the features that result depends on a specific threshold. Fig. 3 depicts the SURF features matched across two frames, overlaid on the first one. More analytically, two different scenarios are tested in the same set of successive images: one with a strict SURF threshold and the other with a more tolerant one. Note that in Fig. 3(a), where the threshold is exacting, no mismatches occur, as compared with Fig. 3(b) where the threshold is loose and, as a consequence, several mismatches can be found. In the proposed VO algorithm, a very stringent threshold is utilized, ensuring that the algorithm returns features that are robust enough, thus decreasing the occurrence of mismatches.

Figure 3.

Tracked features for two consecutive images using a) a strict threshold and b) a tolerant threshold. The red and green crosses denote the position of features on the current and the next captured images, respectively, while the blue lines indicate the obtained correspondences.

3.2. 3D Vision Module

The next step in the proposed algorithm comprises the 3D reconstruction module, in which the 2D points previously extracted are now transformed into 3D points, i.e., the image coordinates are converted into a world coordinate system. This procedure comprises two different steps: the first one is the depth estimation of the scene utilizing a stereo correspondence algorithm such that every salient point obtains a disparity value, while the second one is the triangulation of the 3D points, taking advantage of the previously estimated disparity values. The calculated 3D points correspond to the salient landmarks of the scene expressed in world coordinates.

3.2.1. Stereo Algorithm

Depth estimation constitutes the cornerstone of VO algorithms. The third dimension, i.e., the depth of the depicted objects, can be obtained either by stereo vision setups, structured light devices, RGB-D cameras, 3D laser scanners or time-of-flight sensors, etc. By calculating the depth of the depicted objects and given the geometry of the vision system, a 3D point cloud of the scene can be computed easily. However, in the domain of field robotics, the utilization of stereo vision systems for depth perception seems to be the most reliable solution, since the aforementioned devices are sensitive owing to different illumination conditions. Therefore, for the proposed VO algorithm, we assume a stereo rig and, in order to avoid the aforementioned drawbacks, a specially designed stereo correspondence algorithm is employed [34]. The efficiency of the stereo algorithm was extensively tested both in simulated and in real-world navigation instances, as with the example depicted in Fig. 4. Under this method, the initially captured images are processed in order to extract the edges in the depicted scene, and the output of the edge detection procedure is then superimposed onto the initial ones, providing them with the most striking features and textured surfaces. The matching cost aggregation step consists of a sophisticated Gaussian-weighted sum of absolute differences (SAD). Moreover, the disparity selection step is a simple winner-takes-all choice, as the absence of any iteratively updated selection process significantly reduces the computational payload of the overall algorithm. The final step comprises a bidirectional consistency check where the selected disparity values are only approved if they are consistent, irrespective of which image is the reference and which is the target, resulting to the rejection of false correspondences. Furthermore, the precision of the final disparity values is set at a quarter pixel level as a result of a parabola curve-fitting procedure.

Figure 4.

A stereo image pair (a), (b) and the resulting disparity map (c), where the colours closer to red indicate larger disparity values (small distances) while the colours closer to blue indicate smaller disparity values (large distances). Dark blue indicates a lack of any disparity value for a particular pixel.

Building upon this stereo algorithm, the proposed method utilizes the same algorithmic process; however, it calculates the disparity values alone for the pixels that correspond to the detected feature points in the image plane (instead of all the pixels in the scene). The advantage of this method is the reduced computational burden for the stereo algorithm, resulting in a very sparse disparity map and computing reliable disparity values for the points of interest only. Consequently, this method is able to provide depth estimations for the most prominent points in a scene, retaining a frame rate suitable for real-time robotics applications.

3.2.2. 3D Reconstruction

Making use of the depth information calculated as the disparity, the positions of the detected features onto the image plane are then expressed in 3D world coordinates. More specifically, pixels expressed in camera coordinates $(x_{c}, y_{c}, d i s p (x_{c}, y_{c}))$ with respect to the stereo geometry are transformed into 3D points $(x, y, z)$ . The $X Y$ plane coincides with the image plane while the Z axis denotes the depth of the scene. The relation between the world coordinates of a point $P (x, y, z)$ and the coordinates on the image plane $(x_{c}, y_{c}, d i s p (x, y))$ is expressed by the pin-hole model and the stereo setup as

[x, y, z] = [\frac{x_{c} \cdot z}{f}, \frac{y_{c} \cdot z}{f}, \frac{b \cdot f}{d i s p (x_{c}, y_{c})}]

(1)

where z is the depth value of a feature depicted in $(x_{c}, y_{c})$ , b is the stereo camera's baseline, f the focal length of the lenses expressed in pixels, and $d i s p (x_{c}, y_{c})$ is the corresponding pixel's disparity value. In Eq. 1, x and y denote the abscissa and the ordinate in 3D world coordinates, corresponding to the $(x_{c}, y_{c})$ pixels in the image plane, respectively.

At this stage, it should be mentioned that the accuracy of the 3D reconstruction module is of great importance to the VO algorithm. The aforementioned 3D transformation strongly depends on the accuracy of the stereo process (i.e., any erroneous estimations of the depth will lead to incoherent estimations of the 3D coordinates of the features). The accuracy of the resulting disparity map depends on the range resolution, which is the minimal change in range that the stereo vision system can differentiate [35]. The function that calculates the range l (in metres) within which the resolution is better than or equal to a desired value s (in metres) is as follows

l = \sqrt{\frac{s \cdot b \cdot w}{2 \cdot c \cdot t a n (0.5 \cdot F)}}

(2)

where b is the baseline in metres, w is the horizontal image resolution in pixels, F is the camera's FoV in radians and c is the disparity accuracy in pixels. In general, the resolution deteriorates with distance and, therefore, the resolution accuracy decreases for the objects located far from the stereo camera. Apparently, the further a salient point is from the camera, the more erroneous the resulting depth estimations will be. In order to avoid the inclusion of erroneously estimated 3D points in the motion estimation step, a couple of different alternatives can be used. The first one involves the omission of those features corresponding to pixels with a disparity value smaller than a predefined threshold, depending on the range resolution that the current stereo geometry provides. The second alternative involves the adoption of a specific architecture for the stereo camera placement, whereby the latter should be tilted at a certain angle, ensuring that the 3D reconstructed features fall within the optimal disparity range [36]. Such an approach has been adopted during the construction of our robotic platform - the localization stereo camera is a Bumblebee2, placed 30 cm above the ground with a tilt of 31.55°. The existence of a tilted setup limits the largest observed distance, and as a result improves the obtained accuracy. However, the proposed methodology does not depend on a specific value of the tilt angle and can operate with any other value instead.

3.3. Outliers Detection

The feature detection, matching and 3D reconstruction modules result in several hundred pairs of features for two arbitrarily chosen consecutive frames. However, these matched features can include quite a few outliers, arising due to two different reasons: one stems from erroneous estimations of the depth values during the stereo correspondence procedure (thus inheriting errors from the 3D reconstruction module), while the other is due to the feature matching algorithm itself (i.e., SURF), which introduces mismatches as described in Sec. 3.1. As a result, an outlier detection and removal step on the matched features prior to the motion estimation procedure needs to be applied. The outlier detection method described in this work is specifically designed to exploit the dense sampling of the imaging device while the robot is in motion, thus allowing real-time robot navigation. Let us assume that the robot observes a specific point $^{t} P$ in 3D space as shown in Fig. 5, such as $^{t} P = [x_{t}, y_{t}, z_{t}]^{T}$ . At time $t + 1$ , the robot undergoes a specific motion with rotation $_{t + 1}^{t} R$ and translation $_{t + 1}^{t} T = [T_{x}, T_{y}, T_{z}]^{T}$ , and the corresponding point $^{t} P$ is next observed as $^{t + 1} P = [x_{t + 1}, y_{t + 1}, z_{t + 1}]^{T}$ . The transformation from point $^{t} P$ to $^{t + 1} P$ is given as follows [37]:

​^{t} P =_{t + 1}^{t} R \cdot^{t + 1} P +_{t + 1}^{t} T

(3)

Substituting the rotation and translation matrices in the previous equation, the following equation system is obtained,

\begin{matrix} \begin{array}{l} x_{t} = & x_{t + 1} \cos α \cos β + \\ ​ & + y_{t + 1} (\cos α \sin β \sin γ - \sin α \cos γ) + \\ ​ & + z_{t + 1} (\cos α \sin β \cos γ + \sin α \sin γ) + T_{x} \end{array} & (a) \\ \begin{array}{l} y_{t} = & x_{t + 1} \sin α \cos β + \\ ​ & + y_{t + 1} (\sin α \sin β \sin γ + \cos α \cos γ) + \\ ​ & + z_{t + 1} (\sin α \sin β \cos γ - \cos α \sin γ) + T_{y} \end{array} & (b) \\ \begin{array}{l} z_{t} = & - x_{t + 1} \sin β + \\ ​ & + y_{t + 1} \cos β \cos γ + z_{t + 1} \cos β \cos γ + T_{z} \end{array} & (c) \end{matrix}

(4)

where α, β and γ are the $X - Y - Z$ fixed angles (roll, pitch and yaw, respectively).

For cases where the stereo camera acquires images at a high frame rate while the robot moves at regular speed, the changes in the pose of the robot for two consequent observations at times t to $t + 1$ can be considered to be negligible. We may therefore safely assume in Eq. 5 that $α \to 0$ , $β \to 0$ and $γ \to 0$ . Applying these simplifications (i.e., considering only small angular and translational variations during successive frames) and by dividing Eq. 4a and Eq. 4c by x_t and z_t, respectively, we obtain the following,

\begin{matrix} \frac{x_{t + 1}}{x_{t}} = 1 - \frac{T_{x}}{x_{t}} + ε_{1} & (a) \\ \frac{z_{t + 1}}{z_{t}} = 1 - \frac{T_{z}}{z_{t}} + ε_{2} & (b) \end{matrix}

(5)

where $ε_{1}$ and $ε_{2}$ are two small deviation factors along the x and z directions, respectively. Considering that the robot moves slightly between the resulting frames, the majority of the observed features will have displacements quite close to zero, while the displacement that corresponds to the outliers will be erroneously estimated as $ε_{1}$ or $ε_{2}$ for the direction under consideration. Thus, the fractions described in Eq. 5 are related to the transition rates in the X and Z axes, respectively, and can be utilized for the detection of the outliers. The outliers are defined as numeric values in any dataset which have an unusually high deviation from either the statistical mean or the median value [38]. Thus, a margin around the region where these rates are close to the median rate value of all the matches can be established, denoting the inlier space. Consequently, all the rates that correspond to the matched features are computed, such as $L x = [l_{x,1}, l_{x,2},.., l_{x, N}]$ , with $l_{x, i} = \frac{x_{i, (t + 1)}}{x_{i, t}}$ and $L z = [l_{z,1}, l_{z,2},.., l_{z, N}]$ , with $l_{z, i} = \frac{z_{i, (t + 1)}}{z_{i, t}}$ , where N corresponds to the number of the matched features. A classical approach to screen the outliers is to use the standard deviation method, where the intervals determining the inliers region are typically set to equal ± three standard deviations (σ). Such an approach, according to Chebyshev's theorem, retains $88.89 %$ of the dataset, rejecting all extreme values in a particular set of observations [39]. To increase the confidence level of the proposed method, we set a stricter margin of ± 2 σ, and the framework to discard the outliers is as follows:

\begin{matrix} L z_{i}^{N} = {\begin{matrix} | l_{z, i} - \tilde{L} z | < 2 \cdot σ, & inlier \\ e l s e, & outlier \end{matrix} & (a) \\ L x_{i}^{N} = {\begin{matrix} | l_{x, i} - \tilde{L} x | < 2 \cdot σ, & inlier \\ e l s e, & outlier \end{matrix} & (b) \end{matrix}

(6)

Figure 5.

The stereo camera coordinate system

In Eqs. 5, $\tilde{L} z$ and $\tilde{L} x$ correspond to the median value of the transition rate calculated for all the matched features on the Z and X axes, respectively. In a set of matched features N, if any fall outside the set margins of Eqs. 5, they are considered to be an outlier and hence discarded. Examples of this procedure are illustrated in Fig. 6 and Fig. 7. Fig. 6 corresponds to a case in which the robot moves straight ahead without any rotation about its vertical axis. More specifically, Fig. 6(a) presents the feature matches on the $X Z$ plane for two successive frames. Note that the majority of the features possess a specific displacement, with the exception of few matches showing greater displacements. The latter correspond to erroneous estimations of the depth by the stereo algorithm, leading to outliers, and they are excluded by the statistical filter defined by Eqs. 5. In Fig. 6(b), the median value of the transition in the Z and X axes is denoted by the blue line, while the inlier region corresponding to the $2 σ$ limit is set by the green lines. All the observation points outside this region comprise the outliers and are thus discarded. In Fig. 7, the robot undergoes a small rotation about its vertical axis as indicated by the matched features in Fig. 7(a). In this case, the existence of outliers is also noticeable. As shown in Fig. 7(b), the Z axis outliers outnumber those on the X axis, suggesting that even though those features have been correctly matched by the SURF algorithm - as denoted by their ordinates alignment - they still suffer from erroneous estimations in their abscissa component (i.e., the depth axis Z) and are therefore also discarded as outliers.

Figure 6.

The robot moving forwards: a) corresponding matches on the $X Z$ plane between times t and $t + 1$ , and b) graphical illustration of the outlier detection method

Figure 7.

The robot undergoing a turn around its vertical axis: a) corresponding matches on the $X Z$ plane between times t and $t + 1$ , and b) graphical illustration of the outlier detection method

It becomes obvious that the efficiency of the proposed framework in detecting the outliers is greatly affected by the displacement of the robot between two feature measurements (i.e., two successive frames), requiring the robot's motion to be as smooth as possible. In such cases, there is a great deal of overlap between successive frames and the numerous available common features to be extracted. However, in applications where the speed of the robot is substantial, the distances travelled by the robot between successive frames can increase significantly. In such cases, the majority of the matched features will lie in the background of the scene, where the range resolution of the depth estimation is diminished (see Sec. 3.2.1), resulting in increased outlier numbers due to inaccurate depth estimations. This phenomenon can be counterbalanced by increasing the frame rate of the stereo camera, although this leads to increased data sampling notwithstanding the real-time frame rate restrictions in the operation of the VO algorithm. Hence, a need arises to determine a function that is able to relate the robot's linear velocity with the stereo camera's frame rate. The linear velocity is derived by differentiating Eq. 3 giving Eq. 7,

\frac{d^{t} P}{d t} = \frac{d (_{t + 1}^{t} R)}{d t} \cdot^{t + 1} P +_{t + 1}^{t} V_{P}

(7)

where $_{t + 1}^{t} V_{P}$ is the robot's velocity in the interval of two consecutive time steps t and $t + 1$ . The expansion of Eq. 7 is given as follows:

\frac{d^{t} P}{d t} =_{t + 1}^{t} R \cdot \frac{d^{t + 1} P}{d t} + \frac{d (​_{t + 1}^{t} R)}{d t} \cdot^{t + 1} P +_{t + 1}^{t} V_{P}

(8)

By using the expansion described in [37], the term $d (_{t + 1}^{t} R \cdot^{t + 1} P) / d t$ can be replaced by $_{t + 1}^{t} Ω \times (_{t + 1}^{t} R \cdot^{t + 1} P)$ , where the operator × denotes the vector cross product. Therefore, Eq. 8 becomes:

​_{t + 1}^{t} V_{P} +_{t + 1}^{t} Ω \times (_{t + 1}^{t} R) \cdot^{t + 1} P = 0

(9)

Assuming small angular differentiations β, the rotation matrix $_{t + 1}^{t} R$ can be safely substituted by the identity matrix I. In the next step, according to [37], $Ω \times^{t + 1} P$ can be replaced by the product $S \cdot^{t + 1} P$ , where S is the skew matrix above the angular velocity. The latter is expressed as $_{t + 1}^{t} Ω = [0, ω_{y} {,0]}^{T}$ , due to the fact that only the $X Z$ plane is examined. Since we consider two consecutive frames $d_{t}^{t + 1} = 1 / r$ , where r is the frame rate of the imaging equipment, the rotational velocity $ω_{y}$ is given as follows:

ω_{y} = β \cdot r

(10)

Substituting Eq. 10 into Eq. 9 and applying the requisite transformations, it can be shown that the robot's velocity can be expressed as a linear function encompassing the frame rate as follows:

​_{t + 1}^{t} V_{P} = - β \cdot r \cdot {[z_{t + 1},0, - x_{t + 1}]}^{T}

(11)

Eq. 11 shows that there is indeed a linear relationship between the linear velocity of the mobile robot and the feature sampling rate. Therefore, in order to retain the efficiency of the proposed outlier detection method when the robot moves at higher velocities, the frame rate should be increased accordingly and, moreover, it should be ensured that the entire procedure is followed within the same circle. The efficiency of the proposed outlier detection method increases significantly when the robot's position change within successive frames is not significant.

3.4. Incremental Motion Estimation

The motion estimation module comprises the calculation of a transformation that maps the matched 3D reconstructed features between two successive time instances. In this module, only those correspondences that have passed the outlier detection procedure are utilized. Let us consider the resulting two 3D point clouds that correspond to times t and $t + 1$ . The local coordinates' feature position vectors $^{t + 1} P$ on the reference image of the stereo pair at time $t + 1$ are related to the position vectors $^{t} P$ in the reference image of the corresponding stereo pair at time t according to Eq. 3. Ideally, three perfectly matched features should be sufficient to compute $_{t + 1}^{t} T$ and $_{t + 1}^{t} R$ . However, in practice with situations likely to give rise to errors, several independent 3D points are needed for the efficacious calculation of $_{t + 1}^{t} T$ and $_{t + 1}^{t} R$ , and which should minimize the following least squares criterion,

\sum_{i = 1}^{N} {‖ ​^{t} P ​_{i} -_{t + 1}^{t} T -_{t + 1}^{t} R \cdot^{t + 1} P ​_{i} ‖}^{2}

(12)

where N is the number of features used.

Additionally, the issues of computing $_{t + 1}^{t} T$ and $_{t + 1}^{t} R$ are dealt separately, since $_{t + 1}^{t} T$ which minimizes the previous equation is:

_{t + 1}^{t} T =^{t} \bar{P} -_{t + 1}^{t} R \cdot^{t + 1} \bar{P}

(13)

In Eq. 13, $^{t} \bar{P}$ and $^{t + 1} \bar{P}$ are the centroids of each group of points $^{t} P$ and $^{t + 1} P$ , respectively:

^{t} \bar{P} = \frac{1}{N} \cdot \sum_{i = 1}^{N^{t}} P_{i} and^{t + 1} \bar{P} = \frac{1}{N} \cdot \sum_{i = 1}^{N^{t + 1}} P_{i}

(14)

Thus, the first step in this motion estimation procedure is to find the rotation matrix $_{t + 1}^{t} R$ that minimizes the error function [40],

e r r o r = \sum_{i}^{N} {‖^{t} {\bar{P}}_{t} -_{t + 1}^{t} R \cdot^{t + 1} {\bar{P}}_{i} ‖}^{2}

(15)

where $^{t} {\bar{P}}_{i} =^{t} P_{i} -^{t} \bar{P}$ and $^{t + 1} {\bar{P}}_{i} =^{t + 1} P_{i} -^{t + 1} \bar{P}$ are the deviations with respect to the centroids. The rotation angle β is estimated by minimizing Eq. 15 using a full search scheme with examined orientation values within the range $[- π / 4, π / 4]$ with 0.01° steps. The output of this procedure comprises the instantaneous rotation angles forming the rotation matrices $_{t + 1}^{t} R$ , which correspond to the rotations of the robot between successive frames. Therefore, the rotation angle corresponding to the global displacement of the rover at time t is then estimated by adding all the previous estimated rotations. The global translation of the robot is computed next by means of Eq. 13. The contribution of the outlier detection method is presented in Fig. 8. The motion estimation module has been applied to a simulated dataset (described in detail in Section 4.1) related to a straight route on a planar surface. This route consists of 650 successive frames and, therefore, the rotation angle of the stereo camera within successive frames is zero. More specifically, in Fig. 8(a) the rotation estimations refer to the proposed VO algorithm with - as well without - the inclusion of the outlier detection procedure. For the case where the outliers are detected and removed, the rotation estimations show smaller variations as compared with the inclusion of the outliers in the motion estimation. Furthermore, if no outlier detection is applied, the maximum error in the rotation angle is about 2.5°, leading to increased cumulative error in the motion estimation. On the other hand, in the case where the outlier detection method is made use of, the maximum rotation error is less than 0.35°, leading to smaller cumulative error.

Figure 8.

Instantaneous rotation estimations for 650 frames corresponding to a straight route: a) rotation estimations with (red) or without (blue) the outlier detection method enabled, and b) motion estimation with the outlier detection enabled (red) with the effect of the position refinement module superimposed on it (green)

The mean error together with the standard deviation for the examined sequence of frames is summarized in Table 1. It can be seen that the mean values for the two separate cases are close to each other; however, the standard deviation in the case where the motion estimation makes use of the outliers detection method is decreased, leading to more accurate location estimations as the cumulative error increases more slowly with time.

Table 1.

Mean error and standard deviation for the rotation angle estimation

	Motion estimation	Estimation with outlier detection	Estimation with refinement
Mean error	0.0020	0.0015	0.0002
Standard deviation	0.2174	0.0853	0.0320

3.5. One-step Position Refinement

The position refinement module provides corrections to the location estimations of the robot. The purpose of this module is to improve the estimations that come from the motion estimation procedure, suppressing the computational cost at low levels. The proposed methodology provides refinements to the initially estimated 3D rotational angles. This correction is embodied in the accumulated rotation (i.e., from the starting point up to the current point). The position refinement procedure takes place in each successive frame and is applied to the already estimated rotational variables.

The output of the motion estimation is a rotation $_{t + 1}^{t} R$ and translation $_{t + 1}^{t} T$ matrix, comprising the transformation of the features at time t that conform best to the features at time $t + 1$ . These transformation matrices are applied to the feature points $^{t} P$ , resulting in a new set of the estimated points $^{t + 1} P_{e s t}$ . In the ideal case, with a perfect first-motion estimation, the points $^{t + 1} P_{e s t}$ should exactly coincide with the points $^{t + 1} P$ . However, this rarely happens, and therefore the motion estimation module is applied once more between the points $P_{e s t}^{t + 1}$ and $P^{t + 1}$ . In this way, a transformation of points $P_{e s t}^{t + 1}$ that best conform to the points $P^{t + 1}$ is determined. The output of this procedure is a vector n, providing correction to the rotation angles for the robot's orientation. Consequently, the vector $n = [n_{α}, n_{β}, n_{γ}]$ comprises the angular corrections in the 3D space, ensuring that the final refinement leads to more accurate 3D location estimations for the robot displacement. The corrections n are added in the cumulative rotation estimations, and as such we obtain

\begin{array}{l} ^{r e f i n d e} A =^{a c c} A + n_{α^{'}}^{r e f i n d e} B =^{a c c} B + n_{β} \\ and^{r e f i n d e d} Γ =^{a c c} Γ + n_{γ} \end{array}

(16)

where $^{r e f i n d e} A$ , $^{r e f i n e d} B$ and $^{r e f i n e d} Γ$ comprise the refined rotational angles from the starting point up to the current point corresponding to the roll, pitch and yaw angles, respectively.

The contribution of the position refinement module is illustrated in Fig. 8(b). It can be seen that, compared to the case where the outlier detection module is made use of alone, the variations in the rotation angle for a straight route are even lower. As indicated in Table 1, the mean error in the rotation angle for the straight route is close to zero, and the standard deviation is lower than with the other two cases. The utilization of the outlier detection module is crucial for the performance of the motion estimation procedure, while the position refinement procedure provides more accurate location estimations. Therefore, in this work, the trajectory of the robot is computed hierarchically, first calculating an initial transformation between the 3D point clouds of two successive frames, and then applying this transformation in the first 3D point cloud with the aim of producing refinements to the location estimations.

4. Experimental Validation

The performance of the proposed VO algorithm has been evaluated with simulated as well as real-world data. The simulated data were generated by 3D Studio Max, while the real-world data consist of two robot acquired-trajectories; the first one was recorded using the prototypes platform described in this work, and the second one is a part of the New College dataset. During these experiments, the proposed outlier detection procedure is compared with the RANSAC-Homography outlier detection procedure. Initially, this technique was utilized by [10] for the robot's motion estimation based solely on visual input, whereas a more sophisticated aspect - for the detection of the independent motion in a given scene - has been described in [41]. Besides the outlier detection methodology proposed in this work, RANSAC-Homography has also been made use of and, as a result, two variants of the proposed VO algorithm are obtained and compared. The implementation and the parameterization of the RANSAC-Homography technique is based on the framework that has been described analytically in [42]. The proposed VO algorithm is designed as a lightweight solution to be employed in applications where few computational resources are available, such as in the case of space exploration. The novel part of this algorithm is the outlier detection routine, which is a single-frame operating tool that efficiently reduces the cumulative errors during the robot's successive motion estimates. Therefore, concerning the experimental comparison, we felt that it would be reasonable to compare our algorithm with a well-established methodology that also operates in each frame and does not retain the inliers for later optimization during the robot's perambulation. The RANSAC-Homography outlier detection method is a widely employed technique and can be taken as a benchmark for evaluating the accuracy of the proposed method. Given a set of computed salient features in the scene - which are matched considering a proximity measurement and a similarity measurement - the steps for this outlier detection method can be summarized as follows:

A random sample of four correspondences is selected and the homography H is computed.

The Euclidean distance for each putative correspondence is calculated.

The number of the inliers which are consistent with the homography H and which have Euclidean distances smaller than a predefined threshold are computed.

The steps 1–3 are repeated K times, where K is determined adaptively, and where the homography transformation that is satisfied by the majority of the features is detected.

The features that satisfy the homography detected in step 4 are considered as inliers and the rest of them are discarded as outliers.

In this experimental procedure, the parameters of the RANSAC-Homography outlier detection technique have been carefully selected under the RANSAC method to perform optimally. First of all, a strict Euclidean distance threshold for the consideration of the inliers has been selected to be constantly 0.05 m, within the normalized range of [0, 1]m. The reason for this selection is mainly to counterbalance as much as possible the discretization of the disparity image space, with the aim of capturing the transformations stemming from correspondences several metres away from the stereo camera. The number of iterations K was determined adaptively and the iterations cease when the number of the remaining outliers is greater than 30, thus ensuring robustness in the estimation of the transformation. However, in order to avoid the noisy cases where the algorithm converges too quickly (less than 100 iterations), the experiment is repeated by randomly permuting the order of correspondences, and the RANSAC is initialized again. Although RANSAC performs very well when applied to data with few outliers, it regularly fails to find an optimal model when the number of inliers is less than 50% of the utilized data. This is the case where the robot performs quick turns and where the correctly matched features decrease in number, while noise due to blurred images is also introduced. On the other hand, the proposed method depends on fewer parameters and it does not seek an sufficient subset of inliers to converge on an optimal model, although it discards the outliers separately and then the remaining features are utilized to compute the transformation.

Obviously, the computational burden of the RANSAC method increases with the number of iterations K, while the estimation of the homography H at every step also constitutes a demanding procedure. On the other hand, the outlier detection method proposed in the present paper consists of a simple statistical filtering procedure with the computational cost depending only on the number of detected feature pairs in two successive frames.

4.1. Simulated Data

Simulated dataset #1. This dataset consists of 3,200 stereo pair frames corresponding to a 100 m route and was created using the 3DS-Max software. The camera was placed 30 cm above the ground and tilted by 31.55°, as dictated by the geometry of the vision system of the ESA ExoMars rover. The relative translation between the successive frames was constant and approximately 3 cm. The surface of the rendered images comprises a rocky and uneven environment, as depicted in Fig. 9. This simulated route can be considered as modestly challenging for a VO algorithm, as it mostly consists of straight paths with minor turns. Additionally, the dataset is accompanied by a very accurately measured ground-truth route. With this dataset, the performance of both the proposed VO algorithm and the version using the RANSAC-Homography outlier detection procedure is evaluated. Fig. 10(a) depicts the 2D position estimates against the ground-truth, while Fig. 10(b) presents the estimated positions for the method used as a benchmark. The total positioning accuracy of the proposed algorithm is 0.62 m for a 100 m route, while that of the benchmark is 2.15 m for the same route. The accuracy is evaluated on the basis of the Euclidean distance and includes the 3D displacements of the stereo camera. Moreover, the accuracy of the examined methods can be further analysed by examining the rotation estimations among all the successive frames, as presented in Fig. 10(c). Fig. 10(c) depicts the instantaneous rotation estimations for all the frames of the route against the ground-truth values. It should be noted that the VO framework that includes the RANSAC-Homography outlier detection method introduces more instabilities, as compared with that proposed here. Moreover, a full examination of the real potential of a VO algorithm also makes it necessary to examine the continuous deviation of the algorithm from the ground-truth. Fig. 10(d) presents the deviation of the compared VO algorithms against the ground-truth along the entire route. It is seen that the currently proposed method outperforms that based on the RANSAC-Homography outlier detection procedure, with its estimated trajectory being constantly closer to the ground-truth. Furthermore, the cumulative error denoted by the area under the curves in Fig. 10(d) is seen to be greater for the benchmark method (also shown by the increased deviation distances, from the ground-truth, as the robot moves away from the starting point).

Figure 9.

Stereo pair sample from the simulated Dataset 1

Figure 10.

Experimental validation with the simulated dataset 1: a) trajectory estimation of the proposed VO algorithm superimposed on the ground-truth; b) trajectory estimation of the VO algorithm inclusive of the RANSAC-Homography method superimposed on the ground-truth; c) rotation estimations of all the successive frames for the two different methods compared with the ground-truth; and d) deviations in the Euclidean distance from the ground-truth for every single frame for the two examined methods. Note that in d) the area under each curve is a measure of the cumulative error for each approach.

Simulated dataset #2. This dataset was also created using the 3DS-Max software and consists of 1,836 stereo pair frames corresponding to a 110 m route. The architecture of the stereo geometry was different in this dataset as the camera was placed one metre above the ground with a tilt of 35°. However, in this dataset, the sampling rate was sparser and the relative displacement between the successive frames was constant at approximately 6 cm. Fig. 11 presents the uneven and rocky environment depicted on the rendered images. In contrast with the previous route, the current one constitutes a greater challenge for the VO algorithm as it consists of a series of continuous turns instead of straight lines. This dataset is also accompanied by a very accurately measured ground-truth. The performance of the examined VO algorithm was also tested with this dataset and is presented in Fig. 12(a) and Fig 12(b). The accuracy in terms of the 3D positioning of the proposed VO algorithm is better than 0.49 m for the 110 m route, whereas at the same time the accuracy of the benchmark method is no better than 1.30 m. The accuracies of the two compared methods are further analysed by examining the rotation estimations for all the successive frames as presented in Fig. 12(c), which depicts the instantaneous rotation estimations for all the frames of the route against the ground-truth values. As in the simulated dataset #1, the VO framework, which includes the RANSAC-Homography method, introduces errors in the estimations of the rotation angles as a result of inadequate outlier detection. On the other hand, the rotation angles resulting from the current proposed VO framework are shown to be closer to the ground-truth. Finally, Fig. 12(d) presents the continuous deviation from the ground-truth, with the magnitude of the area under each curve being a measure of the cumulative error in the motion estimation, and thus revealing the better performance of the proposed method versus the benchmark method.

Figure 11.

Stereo pair sample from the simulated Dataset 2

Figure 12.

Experimental validation with the simulated Dataset 2: a) trajectory estimation of the proposed VO algorithm superimposed on ground-truth; b) trajectory estimation of the VO algorithm inclusive of the RANSAC-Homography method superimposed on the ground-truth; c) rotation estimations of all the successive frames for the two different methods compared with the ground-truth; and d) the deviations in the Euclidean distance from the ground-truth for every single frame for the two examined methods. Note that in d) the area under each curve is a measure of the cumulative error for each different approach.

Figure 13.

Experimental validation with the robot-acquired outdoor dataset: a) the left reference image of a stereo pair, b) the right reference image of a stereo pair, and c) the prototype robot equipped with a differential global positioning system (DGPS) system during its deployment in the field

4.2. Outdoors Data

Robot-acquired dataset. The accuracy of the VO algorithm has been examined for real-world data and - specifically - for an 11 m sequence acquired during the field deployment of our robot platform using the localization stereo camera, i.e., Bumblebee2. It consists of a rocky and sandy environment near the premises of Democritus University of Thrace, Xanthi, Greece, and it has been acquired under low lighting conditions with a low elevation angle of the sun. The gathered images cover a route of 11 m with the relative translation among successive frames being approximately 0.10m. As a result, the dataset consists of 116 consequent stereo pairs accompanied by ground-truth measurements acquired by a DGPS, namely the Promark 500 Magellan DGPS (see Fig. 14). In the current experiment, the performance of the proposed algorithm and the benchmark algorithm are evaluated. Fig. 14 depicts the 2D position estimates of the currently proposed VO method, with remarkable accuracy for the entire route. The VO algorithm making use of the RANSAC-Homography outlier detection procedure (Fig. 14(b)) is seen to achieve less accurate results, which become worse towards the last part of the route, exhibiting increased cumulative errors. More precisely, the proposed VO method achieved almost 20cm positioning accuracy across the 11 m route, i.e., a positioning error as low as 2% (Fig. 14(a)). One the other hand, the benchmark method achieved 1.8m positioning accuracy for the same route, i.e., a positioning error of 16.3% (Fig. 14(b)). Although the orientation ground-truth data were not taken in for this experiment, it can be clearly seen in Fig. 14(a) that the robot's estimated trajectory follows precisely the displacement ground-truth derived from the DGPS data. Moreover, the continuous deviation - expressed as the Euclidean distance from the ground-truth at each estimated position - is also depicted in Figure 14(c), revealing that the utilized method retains constantly a deviation of less than 20cm, in contrast with the benchmark method where the cumulative error increases almost exponentially after the first 70 frames.

Figure 14.

Experimental validation with the robot-acquired outdoor dataset: a) trajectory estimation of the proposed VO algorithm superimposed on the ground-truth; b) trajectory estimation of the VO algorithm inclusive of the RANSAC-Homography method superimposed on the ground-truth; and c) the deviations in the Euclidean distance from the ground-truth for every single frame for the two examined methods. Note that in c) the area under each curve is a measure of the cumulative error for each different approach.

The New College dataset. The performance of the proposed VO algorithm has been evaluated further with a publicly available, real outdoors dataset, namely the “New College Dataset”, [43], provided to the vision research community by the Oxford Mobile Robotics Group. The robotic platform used for the acquisition of these data was a two-wheeled Segway RMP200 base, equipped with a Bumblebee2 stereo camera. The stereo camera was placed around 1 m above the ground and tilted by approximately 13°. The frame rate of the camera was 20 Hz, providing grey-scale images with a resolution of 512×384 pixels, similar to those depicted in Fig. 15. This dataset is split into three different epochs, with epoch B comprising a route across uneven terrain with lot of rolling between successive frames - which in some cases were even dark - being selected for the evaluation of the proposed VO algorithm. Furthermore, this dataset consists of 9,000 frames corresponding to a 647 m route, and includes both smooth and sharp turns about the vertical axis of the robot. In addition, a person pushing a cart showed up during the run, although this had no bearing on the accuracy of the proposed algorithm, as any matched feature was rejected as an outlier. The ground-truth was measured by a DGPS. In Fig. 15(c), the aerial image of the route under consideration is illustrated with the ground-truth superimposed on it. In this experiment, the performance of the proposed algorithm and the benchmark algorithm were evaluated. Fig. 16(a) depicts the 2D position estimates of the currently proposed VO method, with consistent accuracy for almost the entire route being apparent. The VO algorithm making use of the RANSAC-Homography outlier detection procedure (Fig. 16(b)) is also seen to achieve credible results for up to about the middle of the route followed, although with increasingly inaccurate estimates due to increasing cumulative error. More specifically, the proposed VO method achieved 7.08 m positioning accuracy across a 647 m route (i.e., a positioning error as low as 1.1%), while the competing method could only achieve 13.60 m positioning accuracy corresponding to a positioning error of 2.1%. The discrepancy in the accuracy between the two methods considered can be explained by reference to Fig.16(c), where it can be seen that the instantaneous errors introduced in the rotation angles for the benchmark method are greater than 4°. Even if the two algorithms had only 1 m divergence along a 647 m route - which could well be considered negligible - the diagram in Fig. 16(d) reveals that the continuous deviation from the ground-truth is quite different for the two methods. The proposed algorithm exhibits robust behaviour during the entire travelled route and the cumulative error - compared with that of the benchmark method - is smaller, as indicated by the areas under the curves in Fig. 16(d). The behaviour of the two error-curves is too irregular to see a clear-cut pattern. A reasonable conclusion would be to say that the error for RANSAC-Homography and the proposed method grows linearly. However, in the RANSAC-Homography method, after the first 5,000 frames, the cumulative error seems to increase with a higher gradient while the error in the proposed methodology evolves more smoothly, thus revealing its superiority.

Figure 15.

Stereo pair sample and the aerial image from the New College dataset

Figure 16.

Experimental validation with the outdoor dataset: a) trajectory estimation of the proposed VO algorithm superimposed on the ground-truth; b) trajectory estimation of the VO algorithm inclusive of the RANSAC-Homography method superimposed on the ground-truth; c) rotation estimations of all successive frames for the two different methods compared with the ground-truth; and d) the deviations in terms of the Euclidean distance from the ground-truth for every single frame for the two examined methods. Note that in d) the area under each curve is a measure of the cumulative error for each different approach.

The Atacama Desert dataset. The performance of the proposed VO algorithm has been evaluated further with a publicly available real outdoor dataset, namely the “Atacama Desert Dataset”. This image sequence comprises an additional dataset which was acquired in the Atacama Desert in Chile [44] and includes data from a 6 km route of highly representative Martian terrain [45]. This dataset is accompanied by sparse DGPS ground-truth measurements, while a reference stereo pair of images is illustrated in Fig. 17.

Figure 17.

Stereo pair sample of the Atacama Desert Dataset

A subset of this dataset (approximately 96.5 m with a 6 cm distance between successive frames) was selected, where the DGPS data were dense enough, to be used in this work. In particular, we have evaluated our system on the part of this dataset that is accompanied with consistent ground-truth data. The subset considered consists of 1,530 frames corresponding to a 96.5 m route. Although the robot motion is calculated for every frame, the accuracy could be computed only on those frames for which DGPS measurements were available. Figure 18(a) shows that the estimated trajectory closely follows the DGPS ground-truth and performs better than the VO making use of the RANSAC-Homography outlier detection procedure (see Fig. 18(b)). Specifically, the proposed method achieved a 2.2% positioning error, which is better than the 4.6% that the benchmark method achieved. The superiority of the proposed outlier detection algorithm is also exhibited in Fig. 18(c), where the deviations in terms of the Euclidean distance from the ground-truth for every single frame are illustrated for both the examined methods. It should also been mentioned that representative videos that exhibit the performance of our algorithm are available at [46].

Figure 18.

Experimental validation with the Atacama Desert dataset: a) trajectory estimation of the proposed VO algorithm superimposed on the ground-truth; b) trajectory estimation of the VO algorithm inclusive of the RANSAC-Homography method superimposed on the ground-truth; and c) the deviations in terms of the Euclidean distance from the ground-truth for every single frame for the two examined methods. Note that in d) the area under each curve is a measure of the cumulative error for each different approach.

4.3. Additional Comparisons

Processing Times: To further evaluate the performance of the proposed outliers detection algorithm and prove its suitability to be used on low-computational resources setups, such as those utilized in space exploration, we conducted an additional experiment regarding the processing times. Specifically, we assessed the computational time required for the outliers detection procedure - which represents the great novelty of the proposed work - and compared it with the Ransac outlier detection methodology. For each frame of the sequence, the processing time required to compute the inliers was measured and averaged across all the frames of the sequence, while the variance of the processing times was also computed. The same experiment was performed both for the proposed outliers detection technique and for the Ransac method and repeated across all the datasets utilized in the above-mentioned experimental procedure. The results are exhibited in Fig. 19(a), where it can be seen that the proposed method computes the inliers in each scene almost twice as fast as the Ransac methodology, with indicative statistical significance. Additionally, it should be noted that both algorithms were tested using the same parameterization setup for all the examined datasets and that the experiments were conducted on the same PC.

Figure 19.

a) The averaged computational time needed for each frame evaluated on both methodologies for all five of the examined datasets; b) the averaged number of inliers retained in frames evaluated by both methodologies for all five of the examined datasets. Note that in both experiments the variance is provided besides the mean value to illustrate any indicative statistical significance of our findings.

Outlier Detection: An additional experiment that proves the efficiency of the proposed method comprises the qualitative comparison of the number of outliers discarded in this work compared with the Ransac methodology. By observing the above-mentioned experimental procedure, where the trajectories are directly evaluated, it is expected that the the number of the inliers retained at each frame in the proposed method will be decreased when compared with the benchmark method. This logical assumption is proved by the additional experimental results shown in Fig. 19(b), where the averaged number of retained inliers for all the sequences regarding both the examined methods is illustrated. It is proved in all the datasets that the proposed method retains fewer inliers when compared with the Ransac methodology, with important statistical significance. These findings further explain the differences in the cumulative errors of the two methods and justifies the superiority of the proposed VO algorithm with respect to the benchmark methodology.

5. Discussion

In this work, a computationally efficient stereo VO algorithm was presented. The algorithm is made up of a number of custom-tailored, non-resource consuming modules in order to keep the level of the computational burden low. In the first step, the salient features of the images are detected using the SURF algorithm. The scene's depth information is then acquired by exploiting a stereo correspondence method appropriate to robot applications, followed by the 3D reconstruction of any detected points by means of triangulation. In the next step, a novel statistical-based, non-iterative outlier detection technique was applied in order to discard the outlier features. The retained features were then utilized by the motion estimation module, which is responsible for the calculation of the robot's angular rotation and linear translation in each individual step. For a single step, a position refinement technique is then used at each frame to correct the global orientation of the robot. Due to its simplicity, the localization procedure can be applied in high-frame rate camera systems, ensuring at the same time that rapid robot movements will introduce no faults in its pose computation. This suggests that at very high sampling rates, the proposed outlier detection technique can efficiently discard the outliers. The ability of the introduced outlier detection method to operate efficiently is based on the fact that feature tracking only takes place between salient points with reliable depth estimations. The proposed framework can effectively remove outliers introduced by moving objects, thus maintaining robust behaviour even in cases where people or any other observed object moves. Yet in order to produce a statistically strong set of features, the camera needs to be tilted such that a considerable portion of the stereo pair lies bellow the horizon level.

The proposed VO algorithm was evaluated by means of both simulated and real outdoors data. It was also compared with a VO framework which makes use of a well known outlier detection technique, viz. RANSAC-Homography. The proposed algorithm achieved great accuracy, i.e., approximately a 1% total positioning error distance from the ground-truth, while the aforementioned RANSAC-Homography method was used as a benchmark. Moreover, the continuously varying deviation from the ground-truth is maintained at low levels, suggesting that the proposed VO method maintains its accuracy with long range routes, which can even contain uneven surroundings. Furthermore, the currently proposed VO algorithm was implemented and tested on a custom-made robot platform exhibiting remarkable accuracy during its evaluation using real world outdoor data.

Footnotes

6. Acknowledgements

This paper is the extension of our previous conference paper entitled “Visual Odometry for Autonomous Robot Navigation Through Efficient Outlier Rejection”, initially presented at the IEEE International Conference on Imaging Systems and Techniques, October, 2013.

This paper was supported by the project “Sparing Robotics Technologies for Autonomous Navigation (SPARTAN)” funded by the European Space Agency (ESA/ESTEC). The authors would like to thank Mr. Gianfranco Visentin for his valuable insights and the resources that he provided to our team during the SPARTAN project.

References

Tobata

Kurazume

Noda

Lingemann

Iwashita

, and Hasegawa

. Laser-based geometrical modeling of large-scale architectural structures using co-operative multiple robots. Autonomous Robots, pages 1–14, 2011.

Hesch

J.A.

Roumeliotis

S. I.

Design and analysis of a portable indoor localization aid for the visually impaired. The International Journal of Robotics Research, 29(11):1400–1415, 2010.

Amanatiadis

Gasteratos

, and Koulouriotis

. An intelligent multi-sensor system for first responder indoor navigation. Measurement Science and Technology, 22:114025, 2011.

Sünderhauf

Konolige

Lacroix

, and Protzel

. Visual odometry using sparse bundle adjustment on an autonomous outdoor vehicle. Autonome Mobile Systems 2005, pages 157–163, 2006.

Newman

. SLAM-loop closing with visually salient features. In IEEE International Conference on Robotics and Automation, pages 635–642. IEEE, 2005.

Agrawal

Konolige

. Rough terrain visual odometry. In International Conference on Advanced Robotics ICAR, 2007.

Kostavelis

Ioannis

Nalpantidis

Lazaros

Boukas

Evangelos

Rodrigalvarez

Marcos Aviles

Stamoulias

Ioannis

Lentaris

George

Diamantopoulos

Dionysios

Siozios

Kostas

Soudris

Dimitrios

, and Gasteratos

Antonios

. Spartan: Developing a vision system for future autonomous space exploration robots. Journal of Field Robotics, 31(1):107–140, 2014.

Kostavelis

Ioannis

Boukas

Evangelos

Nalpantidis

Lazaros

, and Gasteratos

Antonios

. Visual odometry for autonomous robot navigation through efficient outlier rejection. IEEE International Conference on Imaging Systems and Techniques, (Beijing, China), October, pages 45–50, 2013.

Bay

Tuytelaars

, and Van Gool

. Surf: Speeded up robust features. Computer Vision–ECCV 2006, pages 404–417, 2006.

10.

Nistér

Naroditsky

, and Bergen

. Visual odometry for ground vehicle applications. Journal of Field Robotics, 23(1):3–20, 2006.

11.

Ignakov

Okouneva

, and Liu

. Localization of a door handle of unknown geometry using a single camera for door-opening with a mobile manipulator. Autonomous Robots, pages 1–12, 2012.

12.

Corke

Strelow

, and Singh

. Omnidirectional visual odometry for a planetary rover. In International Conference on Intelligent Robots and Systems, IEEE/RSJ, 2004, volume 4, pages 4007–4012. IEEE, 2004.

13.

Civera

Davison

, and Montiel

. Dimensionless monocular SLAM. In Iberian Conference on Pattern Recognition and Image Analysis, pages 412–419. Springer, 2007.

14.

Olson

C.F.

Matthies

L.H.

Schoppers

, and Maimone

M.W.

. Robust stereo ego-motion for long distance navigation. In IEEE Conference on Computer Vision and Pattern Recognition, 2000, volume 2, pages 453–458. IEEE, 2000.

15.

Azartash

Haleh

Banai

Nima

, and Nguyen

Truong Q

. An integrated stereo visual odometry for robotic navigation. Robotics and Autonomous Systems, 62(4):414–421, 2014.

16.

Howard

Andrew

. Real-time stereo visual odometry for autonomous ground vehicles. In Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on, pages 3946–3952. IEEE, 2008.

17.

Cumani

. Feature localization refinement for improved visual odometry accuracy. International Journal of Circuits, Systems and Signal Processing, pages 151–158, 2011.

18.

Morency

L.P.

Gupta

. Robust real-time egomotion from stereo images. In International Conference on Image Processing, 2003, volume 2, pages II–719. IEEE, 2003.

19.

Harris

Stephens

. A combined corner and edge detector. In Alvey Vision Conference, volume 15, page 50. Manchester, UK, 1988.

20.

Lowe

D.G.

. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2):91–110, 2004.

21.

Agrawal

Konolige

, and Blas

. Censure: Center surround extremas for real-time feature detection and matching. Computer Vision–ECCV 2008, pages 102–115, 2008.

22.

Ojeda

Borenstein

. Methods for the reduction of odometry errors in over-constrained mobile robots. Autonomous Robots, 16(3):273–286, 2004.

23.

Nistér

Naroditsky

, and Bergen

. Visual odometry. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, volume 1, pages I–652. IEEE, 2004.

24.

Fischler

M.A.

Bolles

R.C.

. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.

25.

Yang

S.W.

Wang

C.C.

. Multiple-model ransac for ego-motion estimation in highly dynamic environments. In IEEE International Conference on Robotics and Automation, 2009, pages 3531–3538. IEEE, 2009.

26.

Civera

Grasa

O.G.

Davison

A.J.

, and Montiel

JMM

. 1-point ransac for ekf-based structure from motion. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, pages 3498–3504. IEEE, 2009.

27.

Milella

Siegwart

. Stereo-based egomotion estimation using pixel tracking and iterative closest point. In IEEE International Conference on Computer Vision Systems, 2006, pages 21–21. IEEE, 2006.

28.

Longuet-Higgins

. A computer algorithm for reconstructing a scene from two projections. Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, Fischler

M. A.

Firschein

, eds, pages 61–62, 1987.

29.

Agrawal

Konolige

, and Bolles

R.C.

. Localization and mapping for autonomous navigation in outdoor terrains: A stereo vision approach. In IEEE Workshop on Applications of Computer Vision, 2007, pages 7–7. IEEE, 2007.

30.

Mouragnon

Lhuillier

Dhome

Dekeyser

, and Sayd

. Real time localization and 3d reconstruction. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, volume 1, pages 363–370. IEEE, 2006.

31.

Engels

Stewénius

, and Nistér

. Bundle adjustment rules. Photogrammetric Computer Vision, 2, 2006.

32.

Scaramuzza

Fraundorfer

. Visual odometry [tutorial]. Robotics & Automation Magazine, IEEE, 18(4):80–92, 2011.

33.

Klippenstein

Zhang

. Quantitative evaluation of feature extractors for visual SLAM. In Fourth Canadian Conference on Computer and Robot Vision, pages 157–164. IEEE, 2007.

34.

Nalpantidis

Sirakoulis

G.C.

, and Gasteratos

. Non-probabilistic cellular automata-enhanced stereo vision simultaneous localization and mapping. Measurement Science and Technology, 22:11–27, 2011.

35.

Konolige

. Small vision systems: Hardware and implementation. In International Symposium on Robotics Research 1998, volume 8, pages 203–212. MIT PRESS, 1998.

36.

Kostavelis

Boukas

Nalpantidis

Gasteratos

, and Rodrigalvarez

. Spartan system: Towards a low-cost and high-performance vision architecture for space exploratory rovers. ICCV Workshops 2011, pages 1994–2001, 2011.

37.

Craig

J.J.

. Introduction to robotics: mechanics and control. Prentice Hall, 2004.

38.

Barnett

Lewis

. Outliers in statistical data. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, Chichester: Wiley, 1984, 2nd ed., 1, 1984.

39.

Bain

L.J.

Engelhardt

. Introduction to probability and mathematical statistics. Duxbury Press Belmont, CA, 1992.

40.

Siciliano

Sciavicco

, and Villani

. Robotics: modelling, planning and control. Springer Verlag, 2009.

41.

Agrawal

Konolige

, and Iocchi

. Real-time detection of independent motion using stereo. In IEEE Workshop on Motion and Video Computing, 2005, volume 2, pages 207–214. IEEE, 2005.

42.

Hartley

Zisserman

. Multiple view geometry in computer vision. Cambridge University Press, 2 edition, 2003.

43.

Smith

Baldwin

Churchill

Paul

, and Newman

. The new college vision and laser data set. International Journal of Robotics Research, 28(5):595–599, May 2009.

44.

Wettergreen

David

Cabrol

Nathalie

Baskaran

Vijayakumar

Calderón

Francisco

Heys

Stuart

Jonak

Dominic

Luders

Pane

David

Smith

Trey

Teza

James

. Second experiments in the robotic investigation of life in the atacama desert of chile. In Proc. 8th International Symposium on Artificial Intelligence, Robotics and Automation in Space i-SAIRAS, 2005.

45.

Woods

Mark

Shaw

Andrew

Tidey

Estelle

Van Pham

Bach

Artan

Unal

Maddison

Brian

, and Cross

Gary

. Seeker-autonomous long range rover navigation for remote exploration. In International Symposiumon Artificial Intelligence, Robotics and Automation in Space (i-SAIRAS), 2012.

46.

Kostavelis

. Stereo based visual odometry for autonomous robot navigation. Available from: https://www.youtube.com/results?search_query=stereo+visual+odometry++simulated. 2013.

Stereo-Based Visual Odometry for Autonomous Robot Navigation

Abstract

Keywords

1. Introduction

2. Visual Odometry Methods

3. Algorithm Description

3.1. Feature Detection and Matching

3.2. 3D Vision Module

3.2.1. Stereo Algorithm

3.2.2. 3D Reconstruction

3.3. Outliers Detection

3.4. Incremental Motion Estimation

3.5. One-step Position Refinement

4. Experimental Validation

4.1. Simulated Data

4.2. Outdoors Data

4.3. Additional Comparisons

5. Discussion

Footnotes

6. Acknowledgements

References