Maximum Likelihood Estimation of Monocular Optical Flow Field for Mobile Robot Ego-Motion

Abstract

This paper presents an optimized scheme of monocular ego-motion estimation to provide location and pose information for mobile robots with one fixed camera. First, a multi-scale hyper-complex wavelet phase-derived optical flow is applied to estimate micro motion of image blocks. Optical flow computation overcomes the difficulties of unreliable feature selection and feature matching of outdoor scenes; at the same time, the multi-scale strategy overcomes the problem of road surface self-similarity and local occlusions. Secondly, a support probability of flow vector is defined to evaluate the validity of the candidate image motions, and a Maximum Likelihood Estimation (MLE) optical flow model is constructed based not only on image motion residuals but also their distribution of inliers and outliers, together with their support probabilities, to evaluate a given transform. This yields an optimized estimation of inlier parts of optical flow. Thirdly, a sampling and consensus strategy is designed to estimate the ego-motion parameters. Our model and algorithms are tested on real datasets collected from an intelligent vehicle. The experimental results demonstrate the estimated ego-motion parameters closely follow the GPS/INS ground truth in complex outdoor road scenarios.

Keywords

Monocular Ego-motion Maximum Likelihood Mobile Robot Optical Flow

1. Introduction

Visual odometry involves the estimation of camera motion and the motion of the vehicle the camera is attached to, using a sequence of camera images. Ego-motion is one of the most important areas of visual odometry, in which the instantaneous relative motion of the camera is estimated. It has been applied in many areas, such as scene reconstruction by structure-from-motion (SFM), autonomous navigation, computer-vision-based driving assistance [1, 2, 3, 4, 5], obstacle avoidance, and short-term control (steering and braking) [6, 7]. Typically, visual ego-motion is used in those cases where GPS is denied, is insufficiently accurate with signal attenuation, or is too heavy to carry [8, 9]; it is always complementary to wheel odometry because of its advantage of robustness to wheel slippage, which can cause serious errors in wheel odometry. Most visual ego-motion research has been produced based on stereo cameras [10]. Moravec [11] presented the first motion-estimation pipeline for stereo-vision-based ego-motion, and tested it on a planetary rover. A completely different approach was proposed in 2004 by Nister et al. [12]. Their paper presents a 3D to 2D camera-pose estimation problem and a real-time long-run implementation with a robust outlier rejection scheme. Recent developments show significant progress in stereo visual odometry, such as combining a dense probabilistic 5D ego-motion estimation with a sparse key-point-based stereo approach [13]. However, the use of stereo cameras reduces the field of view because only features lying in the intersection of the two cameras' fields of view are used, and the accuracy is dependent on inter-camera calibration, which can be hard to ensure if the cameras are separated significantly. Finally, the cost of components, interfacing, synchronization, and computing are higher for stereo cameras compared to a monocular camera [8, 14].

In recent decades, monocular ego-motion has attracted more attention. Here, only a monocular image sequence is taken as input, a method that is inexpensive and easy to calibrate relative to stereo vision. Successful results with a single camera over long distances have been obtained in the last decade using both perspective and omnidirectional cameras [15, 16].

Related works can be divided into two categories: feature-based methods [14, 17] and appearance-based methods [15]. Feature-based methods [18, 19] are based on salient and repeatable features that are tracked over the frames; appearance-based methods use the intensity information of all the pixels in the image or sub-regions of it. Recently, Choi [20] presented a feature initialization and monocular EKF method for indoor-environment SLAM, and Milford and Wyeth [21] presented an appearance-based method to extract approximate rotational and translational velocity information from a single-perspective camera mounted on a car, which was then used in a RatSLAM scheme. They used template tracking at the centre of the scene.

Most feature-based algorithms deal only with regular or structured environments, such as indoor scenes or outdoor scenes in urban environments [15, 22, 20, 27, 30], where corners and segments are salient. Appearance-based approaches are suitable for most environments; however, a major drawback of these methods is that they are not robust to occlusions. These methods can increase the error of the ego-motion, and even lead to an unreliable estimation. Many algorithms implement some form of the RANdom Sampling and Consensus (RANSAC) outlier-rejection method to augment the image correspondence process [14, 23, 24], since these visual ego-motion algorithms are sensitive to incorrect matches. The basic random sampling method starts with successive minimal sets of correspondences being used to derive hypothesized solutions; the remaining correspondences are used to assess the quality of each hypothesis. However, while basic sampling algorithms assess the quality by counting the number of matches which support the current hypothesis, the error distribution is not considered. There are other similar algorithms in which prior probabilistic [28] or estimated ego-motion information is used to guide the search for matches.

In this paper, we describe the case of a single forward camera mounted on a mobile robot. We design a Maximum Likelihood Estimation (MLE) optical flow model and algorithms for monocular ego-motion, which achieve good performance in real-world tests. First, a multi-scale hyper-complex wavelet phase-derived optical flow without feature matching is designed to estimate image micro motion. This is a multi-scale signal correlation algorithm and does not involve any detection of feature points, so it overcomes the difficulties of unreliable feature selection and feature matching of outdoor scenes; at the same time, the multi-scale strategy overcomes the problem of road surface self-similarity and local occlusions. An MLE ego-motion model is constructed to describe the error distribution between the estimated image optical flow and hypothetical robot-motion-induced pixel motion, which is an optimized estimation of inlier parts of the optical flow. Then, a sampling and consensus strategy is designed to estimate the ego-motion parameters. In a development of RANSAC, we evaluate the likelihood of the hypothesis by representing the error distribution as a mixture model. Validity of optical flow vector is used as the support probability of inliers. All of these merits make our method suitable for outdoor mobile robot ego-motion estimation.

The remainder of this paper is organized as follows. Firstly, the basic theory and methods including multi-scale optical flow estimation and monocular ego-motion are described. Secondly, the support probability of the optical flow vectors and maximum-likelihood optical flow model are presented. Thirdly, the monocular ego-motion estimation algorithm using the MLE optical flow model is provided together with the sampling and consensus strategy. Fourthly, experimental results and analysis are given. Finally, some concluding remarks are provided.

2. Basic Theory and Methods

2.1. Hyper-complex Wavelet (HCW)-based optical flow field

An image sequence of the environment is always achieved when we fix a camera on a moving vehicle, and vehicle motion induces movement of image pixels. Pixel movements in an image sequence reflect vehicle motion in the physical world. Phase-based optical flow vectors and support score estimation are introduced in this section for monocular ego-motion.

The phase localizes the frequency components of an image in space (a spatial structure); it is insensitive to luminance variance. Phase difference in HCW space describes the motion of image blocks [25, 29]. HCW is based on the definition of the 2-D Hilbert Transform (HT) and analytic signal [25] according to the theory of quaternion. It is composed of a discrete-wavelet-transform tensor wavelet plus three real wavelets obtained by 1D HTs along either or both coordinates. Given a real tensor product wavelet $ψ_{h} (x) ψ_{h} (y)$ together with a scale function $ϕ_{h} (x) ϕ_{h} (y)$ , a quaternion wavelet can be organized according to the four wavelet components. The matrix form of HCW components at each scale is described in Equation (1):

G = [\begin{matrix} ψ_{h} (x) ψ_{h} (y) & ψ_{h} (x) ϕ_{h} (y) & ϕ_{h} (x) ψ_{h} (y) \\ ψ_{g} (x) ψ_{h} (y) & ψ_{g} (x) ϕ_{h} (y) & ϕ_{g} (x) ψ_{h} (y) \\ ψ_{h} (x) ψ_{g} (y) & ψ_{h} (x) ϕ_{g} (y) & ϕ_{h} (x) ψ_{g} (y) \\ ψ_{g} (x) ψ_{g} (y) & ψ_{g} (x) ϕ_{g} (y) & ϕ_{g} (x) ψ_{g} (y) \end{matrix}]

(1)

Each column of G is a sub-band of HCW, and four components of every column compose a quaternion wavelet. The diagonal sub-band quaternion wavelet is shown in Equation (2):

ψ^{q} (x, y) = ψ_{h} (x) ψ_{h} (y) - j_{1} ψ_{g} (x) ψ_{h} (y) - j_{2} ψ_{h} (x) ψ_{g} (y) + j_{3} ψ_{g} (x) ψ_{g} (y)

(2)

where g -indexed wavelet filters are HTs of h -indexed filters. Furthermore, a quaternion $q = a + j_{1} b + j_{2} c + j_{3} d$ can be represented using magnitude $| q |$ and phase $(θ_{1}, θ_{2}, θ_{3})$ . Based on the shift theorem of the Quaternion Fourier Transform [15], it is found that $θ_{1} = a r c t a n (2 (b d + a c) {(a^{2} + b^{2} - c^{2} - d^{2})}^{- 1}) / 2$ encodes image shifts in an absolute horizontal, and $θ_{2} = a r c t a n (2 (c d + a b) {(a^{2} - b^{2} + c^{2} - d^{2})}^{- 1}) / 2$ encodes image shifts in an absolute vertical coordinate system.

Assume two blocks O_t and O_r in the image pair $I_{t}$ and I_r respectively, which satisfy:

O_{t} (x, y) = O_{r} (x + d x, y + d y)

(3)

That is to say, $O_{r}$ is a translated version of O_t, $d = {(d x, d y)}^{T}$ is the translation in the x and y direction, respectively, and phase difference $Δ θ_{1} \in [- π, π)$ encodes d_x while $Δ θ_{2} \in [- π / 2, π / 2)$ encodes d_y. The phase difference is therefore very small between the same blocks, so phase structure is used to measure the similarity between blocks of two consecutive images. Then, the image optical flow vector $\bar{d}$ can be estimated through phase difference, as shown in Equation (4) [29]:

\begin{array}{l} {\bar{d}}_{x} = \underset{d_{x}}{\arg} \min_{​} \sum_{Δ x, Δ y \in W} | θ_{1 t} (x + Δ x, y + Δ y) - θ_{1 r} (x + Δ x + d_{x}, y + Δ y) | \\ {\bar{d}}_{y} = \underset{d_{y}}{\arg} \min \sum_{Δ x, Δ y \in W} | θ_{2 t} (x + Δ x, y + Δ y) - θ_{2 r} (x + Δ x, y + Δ y + d_{y}) | \end{array}

(4)

where W is the search region, within which the disparities are assumed to be smoothly varied.

2.2. Monocular ego-motion based on optical flow

First, the coordinate system and some parameters are defined.

Definition 1: For the camera coordinate system in $R^{3}$ space, the coordinate origin O is fixed to the camera with the Z-axis pointing along the optical axis, as shown in Figure 1. Camera motion can be seen as the environment moving relative to this coordinate system [3].

Figure 1.

Coordinate system

Definition 2: For the image coordinate system in $R^{2}$ space, the coordinate origin is fixed to the centre of the image plane with the x-axis pointing along the row of image pixels. Point $P = {(X, Y, Z)}^{T}$ in the camera coordinate system is transformed to image plane $p (x, y)$ by perspective transformation. The image plane is located at the focal length, which can be calibrated beforehand.

Definition 3: The camera motion Γ is modelled as a rigid motion: a translation $T = {(T_{x}, T_{y}, T_{z})}^{T}$ and a rotation $Ω = {(Ω_{x}, Ω_{y}, Ω_{z})}^{T}$ , and $Γ = [T Ω]$ , where T_x, T_y and $T_{z}$ are translations along the X, Y and Z axes, and $Ω_{x}$ , $Ω_{y}$ and $Ω_{z}$ are pitch, yaw (or heading), and roll increment, respectively.

Then, according to perspective transform theory, $, P_{i} = {(X_{i}, Y_{i}, Z_{i})}^{T}$ is mapped to the image plane at a point

u_{i} = {[x_{i} y_{i}]}^{T} = {[\frac{X}{Z} f_{c} \frac{Y}{Z} f_{c}]}^{T}

(5)

The basic problem for monocular ego-motion is to estimate the rigid body transformation $Γ_{k, k - 1}$ between two consecutive camera poses. If a camera moves from pose $k - 1$ to pose k through a transform $Γ_{k, k - 1}$ , it is found that the projections of P₁, P₂, P₃ in image $I_{k - 1}$ are $p_{1}^{'}$ , $p_{2}^{'}$ and $p_{3}^{'}$ , and the projections at time k are p₁, p₂ and p₃ as shown in Figure 2.

Figure 2.

Projection of pixels in a translation

When the robot is running on relatively flat ground, the model can be simplified. Statistical analysis on gravel roads finds that the translation T_Y along the Y axis is faint and can be neglected; however, $Ω_{x}$ , $Ω_{y}$ and $Ω_{z}$ reach about 1.0 degrees and cannot be negligible. The 2D motion field in the image plane corresponds to the 3D motion of the scene, as shown in Equation (6) [26]:

[\begin{matrix} u \\ v \end{matrix}] = [\begin{matrix} \frac{x y}{f_{c}} & - (\frac{x^{2}}{f_{c}} + f_{c}) & y \\ \frac{y^{2}}{f_{c}} + f_{c} & - \frac{x y}{f_{c}} & - x \end{matrix}] \cdot Ω + [\begin{matrix} - \frac{f_{c}}{Z} & 0 & \frac{x}{Z} \\ 0 & 0 & \frac{y}{Z} \end{matrix}] \cdot T

(6)

We can also notice that the translational and rotational components are separable, and a partition strategy can be used because larger $Z ((f_{c} / Z) \to 0)$ corresponds to the pixel in the upper image. The motion component implied by translation

{(u_{T}, v_{T})}^{T} = {(\frac{- f_{c} T_{X} + x T_{Z}}{Z}, y T_{Z} / Z)}^{T}

(7)

exhibits such small amounts that those measurements would be overwhelmed by tracking noise; therefore, the optical flow vectors in the upper image region mainly describe the rotation. Then, the rotations can be estimated according to Equation (8) and the optical flow vectors in the upper image region [16].

[\begin{matrix} \frac{x y}{f_{c}} & - (\frac{x^{2}}{f_{c}} + f_{c}) & y \\ \frac{y^{2}}{f_{c}} + f_{c} & - \frac{x y}{f_{c}} & - x \end{matrix}] \cdot Ω = [\begin{matrix} u \\ v \end{matrix}]

(8)

The estimated rotation parameters Ω and the optical flow field in the lower region of the image are then used to estimate translation parameters according to Equation (6).

Given optical flow vectors in the image space, we can estimate translation parameters T and rotation parameters Ω according to Equation (6) and Equation (8).

3. Maximum-likelihood Optical-flow-model-based Ego-motion Estimation

3.1. Support probability and estimation

In the present paper, an improved method to describe image motion and the probability of its correctness is developed. Image motion is small between consecutive image pairs for high-frequency shooting, and phases in the HCW field are multi-scale; therefore, large-scale estimations guide those on a smaller scale. However, many incorrect image motion vectors are also found (Figure 3(c)). If the probability of optical flow correctness is estimated, the error distribution induced by the motion model will be described effectively. The similarity of image blocks corresponding to the optical flow vector is then used to describe the correctness probability.

Figure 3.

Phase used in optical flow vector and support probability estimation

The local phase difference corresponds to the local translation of the image, and the same image blocks have the same phase structure; therefore, the score of optical flow vector ${\bar{d}}_{i k}$ ( ${\bar{d}}_{i k} = ({\bar{d}}_{x k}, {\bar{d}}_{y k})$ ) – being the correct optical flow – is calculated according to the phases, as shown in Equation (9):

\begin{array}{l} S_{i k} = \sum_{Δ x, Δ y \in W} | θ_{1 t} (x_{i} + Δ x, y_{i} + Δ y) - θ_{1 r} (x_{i} + Δ x + {\bar{d}}_{x k}, y_{i} + Δ y) | + \\ | θ_{2 t} (x_{i} + Δ x, y_{i} + Δ y) - θ_{2 r} (x_{i} + Δ x, y_{i} + Δ y + {\bar{d}}_{y k}) | \end{array}

(9)

where W_s is the support window, which is an image region around ${\bar{d}}_{i k}$ . $S_{i k}$ is called the support score of optical flow vector estimation ${\bar{d}}_{i k}$ , and is used to describe the optical flow correctness.

The optical flow vector estimated in the HCW space on a larger scale only guides candidate image motions on a smaller scale. Therefore, a probability function is introduced to describe the validity of the candidate image motions. First, it is found that optical flow error of inliers always obeys Gaussian distribution [28], and there are $n_{i}$ candidates in the range where the score S_i is estimated according to the support score estimation method. As shown in Figure 3, there are eight candidate phase blocks in the k^th image for the phase block in the k-1 image. S_ik of every candidate block can be estimated according to Equation (9); for example, the score of candidate (e) is 23.1 and it is found that (d) is the most similar, with a score of 4.3. Then, the probability function of validity is defined, as shown in the following.

We use $V_{i k}$ to describe the $i k^{t h}$ putative optical flow vector as an inlier. $S_{i k}$ then obeys Gaussian distribution $p (S_{i k} | V_{i k})$ . Otherwise, if the $i k^{t h}$ candidate is an outlier, then there will be a correct one in the other $n_{i} - 1$ candidate; that is to say, $S_{i k}$ obeys normal distribution $p (S_{i k} | {\bar{V}}_{i k})$ . Therefore, the total probability of optical flow vector with score $S_{i k}$ is $p (S_{i k} | V_{i k}) + p (S_{i k} | {\bar{V}}_{i k}) (n_{i} - 1)$ . The probability of optical flow vector with score $S_{i k}$ being an inlier is

p (V_{i k} | S_{i k}) = \frac{p (S_{i k} | V_{i k})}{p (S_{i k} | V_{i k}) + p (S_{i k} | {\bar{V}}_{i k}) (n_{i} - 1)}

(10)

At the same time, there is only one inlier optical flow for index $i$ ; other putative ones are outliers, and therefore the solution is exclusive. On the basis of Equation (10), we reconstruct a function (11) to describe the probability of putative i_k being an inlier and at the same time other candidates being outliers:

\begin{array}{l} p (V_{i k} | S_{i k}) = \frac{p (S_{i k} | V_{i k})}{p (S_{i k} | V_{i k}) + p (S_{i k} | {\bar{V}}_{i k}) (n_{i} - 1)} \cdot \\ (\prod_{m \neq k}^{n_{i}} (1 - \frac{p (S_{i m} | V_{i m})}{p (S_{i m} | V_{i m}) + p (S_{i m} | {\bar{V}}_{i m}) (n_{i} - 1)})) \end{array}

(11)

Then,

{p (u_{i}), s_{i}} = \underset{k}{m a x} p (V_{i k} | S_{i k})

(12)

The candidate with the highest $p (V_{i k} | S_{i k})$ is then chosen as the optical flow vector for image point i with a support probability $p (u_{i})$ .

Finally, $p (u_{i})$ and $s_{i}$ will be used to estimate ego-motion parameter Γ.

3.2. Maximum-likelihood optical flow model

In general, estimation is done analytically according to the optical flow between two consecutive images; however, noise and outliers can disturb the estimation. An optimized method is always used to estimate more accurate and robust $Γ_{k, k - 1}$ , as shown in Equation (13):

Γ_{k, k - 1} = a r g \min_{Γ_{k, k - 1}} f (Γ, u (x, y)) d x d y

(13)

where $u (x, y)$ is the disturbed optical flow vector at $(x, y)$ , observed according to two consecutive images; meanwhile, point P is projected to the image at time $k - 1$ , and with a transform Γ it can then be seen as movement of P projection, which induces a pixel micro movement $. f (Γ, u (x, y))$ is defined as residual.

Many studies use the image intensity $p (x, y)$ to describe the residual; however, we find that there is a lot of regain without salient features in images acquired by the vehicle-held camera, such as road, which inevitably results in high-frequency self-similarity in the image blocks and homogeneous luminance variance, which induces ambiguity of the residual. Therefore, the residual is defined as the difference between Γ induced pixel movements $u_{Γ} (x, y)$ and the observed $u (x, y)$ , as shown in Equation (14):

f (Γ, u (x, y)) = | u_{Γ} (x, y) - u (x, y) | = u_{r} (x, y)

(14)

In the following, without loss of generality it is assumed that the noise in the optical flow of two consecutive images is Gaussian with zero mean and uniform standard deviation σ. Thus, the conditional probability density function of the disturbed motion is

p (R_{0} | Γ) = \prod_{1 = 1, 2, …, n} {(\frac{1}{\sqrt{2 π} σ})}^{n} e^{- u_{r} {(x_{i}, y_{i})}^{2} / (2 σ^{2})}

(15)

where n is the number of optical flow vectors and R₀ indicates the observing residuals. $Γ_{k, k - 1}$ is expected to maximize the negative log likelihood of probability density function $p (R_{0} | Γ)$ . However, the error of motion estimation is affected by the estimation method and the noise and structure of scene image, especially in the problem of visual ego-motion, where outliers in motion estimation mean that the noise is not Gaussian. Then, function (15) cannot describe the correct probability density distribution, and the Gaussian mixture model and uniform distribution of the motion estimation noise are as shown in Equation (16). As described in reference [28], the uniform distribution is the most reasonable choice for outlying mis-match. We therefore use the uniform distribution to describe the outliers of the optical flow:

p (R_{0} | Γ) = \prod_{1 = 1, 2, …, n} [p (u (x_{i}, y_{j})) {(\frac{1}{\sqrt{2 π} σ})}^{n} e^{- \frac{u_{r} {(x_{i}, y_{i})}^{2}}{2 σ^{2}}} + (1 - p (u (x_{i}, y_{j}))) \frac{1}{v}]

(16)

where $p (u (x_{i}, y_{j}))$ is the probability of optical flow validity (or inliers) at $(x_{i}, y_{j})$ and v is the range of error distribution of the invalid optical flow (or outliers).

In visual ego-motion the image sequence is shot with very high frequency, which means little motion between two shots. The noise of optical flow estimation therefore always lies in a limited range and the uniform distribution parameters can be determined. The uniform standard deviation σ can be estimated through measured residuals for valid matches of a sample, which indeed are well-fitted by a Rayleigh distribution, therefore σ can be induced accordingly. The negative log maximum likelihood of P is shown in Equation (17), which is called the MLE optical flow model:

\begin{array}{l} - l o g p (R_{0} | Γ) = - \sum_{i = 1, 2, …, n} l o g {[p (u (x_{i}, y_{j})) {(\frac{1}{\sqrt{2 π} σ})}^{n} e^{- \frac{u_{r} {(x_{i}, y_{i})}^{2}}{2 σ^{2}}} \\ + (1 - p (u (x_{i}, y_{j}))) \frac{1}{v}]} \end{array}

(17)

In particular, $p (u_{i})$ is the prior probabilities of u_i being valid. In MLESAC the probability of a correspondence being valid is estimated in every iteration according to the hypothesis of the transform, in fact the motion between two shots; however, the validity of a correspondence is irrelevant to the transform, so we have estimated a probability function to estimate the validity of the optical flow according to the multi-resolution phase difference in a super-complex wavelet field, as shown in the last section.

4. Monocular Ego-motion Using MLE on Optical Flow

Based on the above MLE optical flow model, an ego-motion parameters estimation method for autonomous vehicles is introduced in this section. A block diagram of the proposed approach is shown in Figure 4. The system is composed of three main parts: optical flow computation and support probability estimation, MLE optical flow model construction, and ego-motion computation.

Figure 4.

Block diagram of the monocular ego-motion framework

The first block, “optical flow computation and support probability estimation”, receives as input the rectified images and provides as output a list of optical flow vectors and support probabilities. Then, the MLE optical flow model is constructed according to image motion residuals and their distribution of inliers and outliers together with their support probabilities. Finally, the ego-motion block receives as input the image motion optical flow vectors together with the MLE optical flow model and provides as output the translation and rotation of the camera platform between current and previous iterations. A sampling and consensus strategy is adopted to estimate the ego-motion parameters of this moment relative to the previous moment based on the proposed MLE optical flow model. The parameters are estimated which minimize the MLE optical flow model Equation (17). Then, a motion step stores the rotation and translation of the camera between two consecutive frames.

A sampling and consensus strategy similar to RANSAC is adopted, whose samples are labelled with the support probability to estimate Γ, avoiding minimizing Equation (17). Then, the minimal set samples are selected to pre-estimate the model parameters, guided by the probabilities; this is not random. The steps are as follows:

In the beginning of each iteration, the minimal optical flow set with the higher support probability is chosen according to the Monte-Carlo method.

A hypothesis T_h is estimated according to the pinhole imaging theory [7] using the chosen minimal optical flow set.

All other observed optical flows are observed with those induced by T_h; then, the residuals $u_{r} (x, y)$ are computed according to Equation (14).

The likelihood $- l o g p (R_{0} | Γ)$ is calculated according to Equation (17), and T_h is recorded if $- l o g p (R_{0} | Γ)$ is the largest so far.

Return to step 1) to choose another minimal optical flow set. The process is repeated until the loop termination, when we find optimized Γ.

5. Experiment Results and Discussion

This section presents our experimental results for the framework and algorithms validation on a real vehicle equipped with a CMOS camera for computing the monocular ego-motion on a country road. Our monocular ego-motion output is compared to the integrated GPS/INS system, which is treated as ground truth. It is shown that the ego-motion estimations are accurate and reliable in realistic ground-vehicle scenarios without a priori knowledge of the motion.

The frame rate of the achieved video is set to 30 frames per second, and each frame has 256 × 256 pixels. In our experiment, the road is covered with clay, gravel, sand, weed, or asphalt and concrete. A GPS/INS joint system, whose angular accuracy is 0.05°, absolute localization accuracy is 0.1 m, and relative localization accumulative error is lower than 0.5% of the mileage, is used to validate the accuracy of the recovered vehicle ego-motion.

5.1. MLE optical flow computation

Figure 5 illustrates the results of our MLE optical flow estimation for two scenarios from the image sequence shot by a single on-robot camera with the vehicle moving along a country road. The figure shows the optical flow estimated through phase coherence and the MLE optical flow, respectively. The optical flow vectors with lower probability of being inliers are discarded in Figures 5(b, d), from which it is found that some motion noises are eliminated by MLE optical flow estimation, especially visible in the lower left and lower right of frame 1. From Figure 5(c), we find that there is a lot of noise in the road surface optical flow estimated by phase coherence because of the texture similarity in the wave window. However, when we estimate support probability of optical flows, according to Equations (10) and (11), then the optical flow with lower probability is discarded, as shown in Figure 5(d). We find that support probability of obvious noise, i.e., the likelihood of optical flow vectors being inliers, is low.

Figure 5.

Comparison of optical flow and MLE optical flow

5.2. Ego-motion estimation and performance analysis

The ego-motion including instantaneous steering increment, pitch increment, roll increment and translation of the vehicle between consecutive frames estimated through the MLE optical flow model is shown in Figures 6∼9. Comparisons of these incremental motions among MLE optical flow ego-motion estimation, phase coherence with RANSAC, and GPS/INS as the ground truth are shown in the following to test the instantaneous relative motion measuring accuracy.

Figure 6.

Steering angle estimation of MLE optical flow method and RANSAC method

The vehicle is running on a road with two apparent bends; one of them corresponds to the frames around the 50^th frame and another corresponds to the frames around 520^th frame. In order to express the estimation clearly we show frames from the 50^th to the 545^th. From Figure 6, it can be seen that the steering increment estimations of MLE and RANSAC closely follow the ground truth. However, we find that the MLE method improves estimation accuracy because there is more noise in optical flow vectors, which disturbs the estimation of the RANSAC method remarkably compared to MLE. It is also found that some obvious errors are restrained, such as in frames 60, 78, 343, 415, 516, and 531, where we find many optical flows are misestimated and disturbed because of texture self-similarity. Figure 6(b) also shows that at the two bands around frames 60 and 520 the errors in the MLE-based method are more obvious than in the other frames. We analyse these scenes as corresponding to substantial motion, which means deformation of the local image and therefore more errors in the support probability estimation.

Figure 7 and Figure 8 show the pitch and roll increment estimation, respectively. It is found that pitch increment estimation of our MLE method is close to the ground truth. The increment around frame 450 is larger because of a bumpy road, which corresponds to a wide crossroads with few features. We also find larger luminance change between images around frame 450; it is a rugged environment for the monocular ego-motion method, and will induce more error in image motion estimation. Our MLE-based method estimates the pitch and roll more accurately and robustly than the RANSAC-based method, as shown in Figure 7(b) and Figure 8(b).

Figure 7.

Pitch angle estimation of MLE optical flow and RANSAC methods

Figure 8.

Roll angle estimation of MLE optical flow and RANSAC methods

Figure 9.

Translation estimation of MLE optical flow and RANSAC method

Figure 9(a) shows that our translation estimations closely follow the ground truth, and greater errors are restrained. Figure 9(b) shows that the error of our method was almost uniformly distributed during the test; however, there were greater errors in the translation estimation of the RANSAC-based method, especially around frame 90 and frame 490. After analysing the road scene, it is found that the road around frames 90 and 490 is bumpy. The roll angle error or pitch angle error of the RANSAC-based method is notable, inducing more errors of translation estimation. However, we use the optimized model to describe the error distribution and estimate the steering angle, pitching angle, roll angle and translation more accurately.

Table 1 shows a comparison of the statistical ego-motion estimation error between our method and a RANSAC-based method [29]. It can be seen that both the estimated maximum error and the standard deviation of our method are smaller than for the RANSAC-based method. That is to say, compared to the RANSAC-based method, our method obtains higher accuracy and higher precision using the MLE optical flow model and corresponding methods.

Table 1.

Ego-motion estimation error comparison

	Steering (°)		Pitch (°)		Roll (°)		Trans (m)
	Max. error	Standard deviation	Max. error	Standard deviation	Max. error	Standard deviation	Max. error	Standard deviation
RANSAC	1.247	0.138	0.661	0.135	0.801	0.277	1.520	0.414
Our method	0.488	0.081	0.388	0.109	0.346	0.138	0.725	0.181

From the above experiments, we can conclude that the MLE optical flow model and the sampling and consensus parameter estimation algorithms represent an optimized scheme for ego-motion estimation of outdoor unstructured road, even though some bumps are encountered. Furthermore, it overcomes the difficulties of salient feature absence and road scene self-similarity. All in all, the ego-motion estimations of our method closely follow the ground truth in most frames.

6. Conclusions

This paper has presented a monocular ego-motion estimation method based on an MLE optical flow model and sampling and consensus parameter estimation algorithms for mobile robots. Our algorithms use as input only those images provided by a single forward-looking camera and the output of the system is the instantaneous relative steering, pitch, roll and translation increment on the road.

The proposed scheme includes three main parts. Firstly, the multi-scale optical flow field is estimated through HCW, and then the support probability of optical flow vectors being inliers is estimated according to the phase structure to describe the image motion and estimation errors more accurately. Secondly, an MLE optical flow model is constructed based not only on image motion residuals but also their distribution of inliers and outliers, together with their support probabilities, to evaluate a given transform. At last, a sampling and consensus strategy guided by the support probability is used to estimate the ego-motion parameters.

The proposed method is tested on videos from a mobile robot platform. From experimental results using a sequence of more than 500 images, the estimated error of ego-motion parameters closely follows the high-precision GPS/INS joint system. Moreover, steering angle increment, pitch angle increment, roll angle increment, and translational increment with MLE optical flow are smaller than with RANSAC-based methods.

In future research, we will improve the scheme and algorithm to estimate 6-DOF ego-motion parameters for large-scale and rugged terrain.

Footnotes

7. Acknowledgements

This work was partially supported by a fund from the National 863 High Technology Research Fund (2015AA8106043), the National Science Foundation of China (Grant No.61402237, 61302156) and the Jiangsu Key Laboratory of Image and Video Understanding for Social Safety (Nanjing University of Science and Technology - Grant No. 30920140122007).

References

Borenstein

Everett

H.R.

Feng

, and Wehe

, “Mobile Robot Positioning: Sensors and Techniques”, Journal of Robotic Systems, Special Issue on Mobile Robots, Vol. 14, No. 4, pp. 231–249, 1997.

Jason

Rahul

Illah

, and Aroon

, “A Robust Visual Odometry and Precipice Detection System Using Consumer-grade Monocular Vision”, in Proc. of International Conference on Robotics and Automation (ICRA), pp. 3421–3427, 2005.

Burger

Bhanu

, “Estimating 3-D Egomotion from Perspective Image Sequences”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 12, No.11, pp. 1040–1058, 1990.

Luyang

Yun-Hui

Kai

, and Mu

, “Estimating Position of Mobile Robots from Omnidirectional Vision Using an Adaptive Algorithm”, IEEE Transactions on Cybernetics, Vol. 45, No. 8, pp. 1633–1646, 2015.

Ciarfuglia

Thomas A.

Gabriele

Paolo

Elisa

, “Evaluation of non-geometric methods for visual odometry”, Robotics and Autonomous Systems, Vol. 62, pp. 1717–1730, 2014.

Olson

F.C.

Matthies

H.L.

Schoppers

, and Maimone

W. M.

, “Rover navigation using stereo ego-motion”, Robotics and Autonomous Systems, Vol. 43, No. 4, pp. 215–229, 2003.

Nistér

Naroditsky

Bergen

, “Visual odometry for ground vehicle applications”, Journal of Field Robotics. Vol. 23, No. 1, pp. 3–20, 2006.

Zhang

Singh

, and Kantor

, “Robust Monocular Visual Odometry for a Ground Vehicle in Undulating Terrain”, Eighth International Conference of Field and Service Robots, Nov. 2012.

Hesch

J.A.

Kottas

D.G.

Bowman

S.L.

, and Roumeliotis

S.I.

, “Consistency Analysis and Improvement of Vision-aided Inertial Navigation”, IEEE Transactions on Robotics, Vol. 30, No. 1, pp. 158–176, 2014.

10.

Siddiqui

Khatibi

, “Robust visual odometry estimation of road vehicle from dominant surfaces for large-scale mapping”, IET Intelligent Transport Systems, Vol. 9, No. 3, pp. 314–322, 2015.

11.

Moravec

, “Obstacle avoidance and navigation in the real world by a seeing robot rover”, PhD dissertation, Stanford Univ., Stanford, CA, 1980.

12.

Nistér

Naroditsky

Bergen

, “Visual odometry”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, pp. 652–659, 2004.

13.

Silva

Bernardino

Silva

, “Probabilistic Egomotion for Stereo Visual Odometry”, Journal of Intelligent & Robotic Systems, Vol. 77, No.2, pp. 265–280, 2015.

14.

Fabian

Clayton

G.M.

, “Error Analysis for Visual Odometry on Indoor, Wheeled Mobile Robots With 3-D Sensors”, IEEE/ASME Transactions on Mechatronics, Vol. 19, No. 6, pp. 1896–1906, 2014.

15.

Davide

Roland

, “Appearance-Guided Monocular Omnidirectional Visual Odometry for Outdoor Ground Vehicles”, IEEE Trans. on Robotics. Vol. 24, No. 5, pp. 1015–1026, 2008.

16.

Scaramuzza

Fraundorfer

, “Visual Odometry (tutorial)”, IEEE Robotics & Automation Magazine, pp. 80–92, 2012.

17.

Pretto

Menegatti

, and Pagello

, “Omnidirectional dense large-scale mapping and navigation based on meaningful triangulation”, in Proc. IEEE Int. Conf. Robotics and Automation, pp. 3289–3296, 2011.

18.

Xia

Wang

Sun

Liu

, and Xiong

, “Steganalysis of LSB matching using differences between nonadjacent pixels”, Multimedia Tools and Applications, DOI 10.1007/s11042-014-2381-8, 2014.

19.

Wen

Shao

Xue

Fang

, “A rapid learning algorithm for vehicle classification”, Information Sciences, Vol. 295, No. 1, pp. 395–406, 2015.

20.

Choi

Park

Kim

Y.H.

Lee

H.K.

, “Monocular SLAM with undelayed initialization for an indoor robot”, Robotics & Autonomous Systems, Vol. 60, No. 6, pp. 841–851, 2012.

21.

Milford

M.J.

Wyeth

G.F.

, “Single camera visiononly SLAM on a suburban road network”, in Proc. IEEE Int. Conf. Robotics and Automation (ICRA '08), pp. 3684–3689, 2008.

22.

Davison

A.J.

Reid

I.D.

Molton

N.D.

Stasse

, “MonoSLAM: Real-Time Single Camera SLAM”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 29, No. 6, pp. 1052–1067, 2007.

23.

Scaramuzza

, “1-Point-RANSAC Structure form Motion for Vehicle-Mounted Cameras by Exploring Non-holonomic Constraints”, International Journal of Computer Vision, Vol. 95, No. 1, pp. 74–85, 2011.

24.

Civera

Grasa

G.O.

Davison

J.A.

, and Montiel

J.M.M.

, “1-point RANSAC for EKF Filtering: Application to Real Time Structure from Motion and Visual Odometry”, Intell. Robots Syst., Vol. 27, No. 5, pp. 609–631, 2010.

25.

Chan

W. L.

Choi

Baraniuk

G.R.

, “Coherent Multiscale Image Processing using Dual-Tree Quaternion Wavelets”, IEEE Transactions on Image Processing, Vol. 17, No. 7, pp. 1069–1082, 2008.

26.

Bonde

Nagel

H.H.

, “Deriving a 3-d description of a moving rigid object from monocular tvframe sequence”, in Proc. of workshop on computer analysis of time-varying imagery, Philadelphia, PA, pp. 44–45, 1979.

27.

Liu

Tang

Zeng

, “Feature Selection Based on Dependency Margin”, IEEE Transactions on Cybernetics, Vol. 45, No.6, pp. 1209–1221, 2015.

28.

Tordoff

Murray

D.W.

, “Guided-MLESAC: faster image transform estimation by using matching priors”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 10, pp. 1523–1535, 2005.

29.

Wang

Zhao

, and Yang

, “Monocular odometry in country roads based on phase-derived optical flow and 4-DOF ego-motion model”, Industrial Robot: An International Journal, Vol. 38, No. 5, pp. 509–520, 2011.

30.

Liu

Xiong

, “Robust and Accurate Multiple-camera Pose Estimation Toward Robotic Applications”. International Journal of Advanced Robotic Systems, Vol. 11, No. 153, pp. 357–367, 2014.