A square root unscented Kalman filter for multiple view geometry based stereo cameras/inertial navigation

Abstract

Exact motion estimation is one of the major tasks in autonomous navigation. Conventional Global Positioning System-aided inertial navigation systems are able to provide accurate locations. However, they are limited when used in a Global Positioning System-denied environment. In this paper, we present a square root unscented Kalman filter-based approach for navigation by using stereo cameras and an inertial sensor only. The main contribution of this work is the development of a novel measurement model by applying multiple view geometry constraints to the stereo cameras/inertial system. The measurement model does not require the three-dimensional feature position in the state vector of the filter, which substantially reduces the size of the state vector and the computational burden. To incorporate this nonlinear and complex measurement model, a variant of the square root unscented Kalman filter-based algorithm is also proposed. The root of the state covariance is propagated and updated directly in the square root unscented Kalman filter, thereby avoiding the decomposition of the state covariance and improving the stability of our algorithm. Experimental results based on a real outdoor dataset are presented to demonstrate the feasibility and the accuracy of the proposed approach.

Keywords

Autonomous navigation inertial measurement unit stereo cameras multiple view geometry square root unscented Kalman filter

Introduction

In recent years, the inertial measurement unit (IMU) has been widely used for the navigation tasks of robots, cars, unmanned aerial vehicles, and so forth, thanks to the remarkable advances in Micro-Electro-Mechanical System (MEMS) inertial sensors, such as lower cost, smaller size, lighter weight, higher power efficiency, etc. The IMU-based inertial navigation system (INS) is able to track fast motion with high precision in a short period of time. However, the accuracy of INS deteriorates with time because of the accumulative error caused by sensor biases and noises. A conventional method to tackle this problem is using the Global Positioning System (GPS) to correct the error of INS periodically. A major limitation of the approach is that the GPS/INS system cannot be used in GPS-denied environments where GPS receivers do not function,¹ such as indoors, forests, and underground. Besides, high precision GPS receivers are usually expensive and bulky.

An alternative approach to restrain the IMU error is the use of visual sensors such as stereo cameras. Both position and attitude information could be extracted by tracking visual features between sequences of images captured from cameras. The estimation accuracy in such a system depends on the observed scene instead of the accumulative time, and the visual motion estimation achieves high accuracy in a slow motion scenario. For this reason, the fusion of visual and inertial sensors has garnered considerable attention from many researchers. Peter Corke² presents a tutorial introduction of inertial and visual sensing from a biological and engineering perspective. A set of algorithms of relative pose (translation and rotation) calibration for hybrid inertial/visual system can be found in Lang and Pinz,³ Lobo and Dias,⁴ and Mirzaei and Roumeliotis.⁵ Since this paper is mainly concerned about navigation, we focus on the data fusion algorithm of inertial and visual sensors for motion estimation.

An area of particular relevance to our work is visual odometry, which focuses on the use of either monocular or stereo vision to estimate the ego-motion of an agent with respect to the environment. For the monocular case however, the detected feature suffers from the issue of scale unobservability. To overcome the problem, Feng et al.⁶ and Davison et al.⁷ employ the fixed depth constraint, whilst a delayed feature initialization scheme is presented in Davison⁸ and Kim and Sukkarieh.⁹ The stereo cameras, inherently, are able to provide scale through the baseline between the cameras. This is demonstrated by Davison and Murray^10,11 in the presented active stereo visual Simultaneous Localization And Mapping (SLAM) systems. Nevertheless, the vision-only techniques highly depend on the available features, causing difficulties in recovering the real track when all tracked features are lost.

In order to overcome the limitations of vins vision-only techniques, numerous work has been reported recently about hybrid stereo cameras and inertial systems. A Kalman filter (KF) is employed in Kelly et al.¹² and Carillo et al.¹³ to fuse stereo visual odometry and inertial measurements for the estimation of unmanned aerial vehicle position and velocity. As a loosely coupled approach, it does not utilize inertial sensors in the prediction of the tracked features. A tightly coupled approach is often preferable since it makes better use of information from these two sensors. Veth and Raquet¹⁴ develop an image-aided inertial navigation algorithm on the basis of a multi-dimensional stochastic feature tracker. An unscented KF (UKF)-based filter for localization, mapping, and self-calibration of inertial and visual sensors is presented in Kelly and Sukhatme.¹⁵ In order to tackle the far and near features in large-scale outdoor environments, Xian¹⁶ proposes an (iterative extended KF) IEKF-based tightly coupled integration algorithm using stereo cameras and a low-cost IMU. In these tightly coupled approaches, the positions of the features are contained in the filter. However, one of the main demerits of these approaches is the growing state space maintained within the filter along with the increase in the number of the features, resulting in a higher computational burden.

A remedy to this issue is presented in Diel et al.¹⁷ and Nilsson et al.,¹⁸ where the epipolar geometry between current and previous images is used and combined with the IMU measurement. Also, in Mourikis and Roumeliotis,¹⁹ the authors present an extended KF (EKF)-based algorithm for vision-aided inertial navigation. Although three-dimensional (3D) feature positions are removed from the vector of the filter state, it still makes use of a least-square solution to estimate the 3D position of feature points. In Indelman et al.,²⁰ both epipolar and three-view geometry constraints are used for monocular and inertial sensor integration. A similar approach could be found in Hu and Chen.²¹ The difference is that the former uses an EKF while the latter employs a UKF. The standard UKF utilizes the Cholesky decomposition in order to generate the sigma points at each time step. However, the main flaw of this approach is that the Cholesky decomposition may become unstable, especially when the covariance matrix becomes negative. The square root UKF (SRUKF), a re-implementation of the general UKF presented by Van Der Merwe and Wan,²² is able to achieve exactly the same accuracy. Yet, it avoids the decomposition process by directly propagating the Cholesky factor instead of the covariance of the state. Compared with EKF, the SRUKF/UKF has precision to the third order while the EKF is accurate to the first order only for the Gaussian state distribution.²³ Besides, the SRUKF/UKF is derivative-free, which means that the filter does not require the computation of Jacobian matrices.

In this paper, we propose a novel multiple view geometry constraints-based measurement model by applying the epipolar and trifocal geometry constraints to stereo vision. The stereo-vision-based approaches only require the previous and current moments to employ the three-view/trifocal geometry, compared with at least three moments in the monocular approaches.^20,21 Besides, the stereo cameras inherently contain the depth information, which will help improve the position estimation. The main benefit of using the measurement model is that the 3D position of the feature does not need to be contained in the state vector of the filter, resulting in less storage space and computational burden. Furthermore, an improved SRUKF algorithm is proposed in our approach for the purpose of fusing the stereo camera information with the inertial measurement. To improve the robustness and accuracy of the estimation, our approach employs a two-step method to reject outliers, which consists of an epipolar constraint check for left–right outlier removal and a random sample consensus (RANSAC) algorithm for previous–current outlier rejection.

The remainder of the paper is organized as follows. The process model and measurement model are detailed in the second section. In the third section, the background of the SRUKF is briefly introduced and its improved version in our approach is also presented. Multiple trails from a real dataset are used to evaluate our algorithm, with their results presented in he fourth section and Appendix. Finally, the main conclusions are drawn in the fifth section.

System modeling

Reference frames

The goal of this paper is to estimate the pose of an carrier with respect to the world frame by integrating stereo cameras and an IMU. Three reference frames are used, as shown in Figure 1.

Figure 1.

The reference frames and their relationship. The yellow solid line between two frames represents a fixed connection while the yellow dashed one denotes a flexible connection. The two camera frames C_L, C_R and the IMU frame I are rigidly attached. The features points M are modeled in the world frame W.

The world frame W: The pose of the IMU is estimated with respect to this frame, which is anchored to the earth. The features (like the feature point M shown in Figure 1) in the scene are modeled in this frame. It can be aligned in any way; however, in this paper it is defined as vertically aligned, with the x, y, and z-axis being aligned with the east, north, and up direction, respectively.

The camera frame C: This frame is attached to the moving stereo cameras, with its origin placed at the optical center of the camera, and with the z-axis pointing along the optical axis. There are two camera frames, as shown in Figure 1, and they are aligned with each other with a known translation. In this paper, we choose the left camera as the reference camera frame.

The IMU frame I: This is the frame of the IMU, with its origin at the center of the IMU body. The x, y, and z-axis respectively denotes the front, left, and up direction of the IMU body.

The camera frame and IMU frame are rigidly attached. The features are modeled in the world frame and assumed to be static. The assumption may not always be true in practice, for there may be some moving objects in the scene, such as running cars and walking pedestrians. In such a circumstance, these moving features can be rejected by the RANSAC algorithm in the proposed approach, which will be detailed in a later section. Since the pose of the IMU is variable with respect to the world frame, the estimation of the IMU pose is the main objective of this paper.

State representation

The state vector used in the proposed approach consists of the current IMU-related state vector and the last pose of IMU, that is

x = {[q_{W I}^{T}, {(v_{I W}^{W})}^{T}, {(p_{I W}^{W})}^{T}, b_{g}^{T}, b_{a}^{T},^{1} q_{W I}^{T}, {(^{1} p_{I W}^{W})}^{T}]}^{T}

where q_WI is a unit quaternion²⁴ which denotes a rotation quaternion from frame I to frame W; $v_{I W}^{W}$ and $p_{I W}^{W}$ respectively denote the linear velocity and position of the IMU with respect to the frame W and are expressed in the frame W; $b_{g}, b_{a}$ are the IMU gyroscope and accelerometer biases respectively; and the pair $(^{1} q_{W I}^{T}, {(^{1} p_{I W}^{W})}^{T})$ denotes the last pose of the IMU with respect to the frame W. Please note that the state vector does not include the position of the features. This is different to the case in a SLAM approach, because the feature positions are not of interest to navigation tasks.

The state in equation (1) is not estimated in the SRUKF directly. The error-state is estimated in the filter and then used to compensate the state. Following equation (1), the error state is defined as follows

δ x = {[δ θ_{W I}^{T}, {(δ v_{I W}^{W})}^{T}, {(δ p_{I W}^{W})}^{T}, δ b_{g}^{T}, δ b_{a}^{T}, δ^{1} θ_{W I}^{T}, {(δ^{1} p_{I W}^{W})}^{T}]}^{T}

For the IMU position, velocity, and biases, the error is defined as $δ x = x - \tilde{x}$ , where x is a true quantity, and $\tilde{x}$ is an estimate of the quantity. However, for a quaternion, if the true quaternion is denoted as q and the estimate is $\tilde{q}$ , the error is defined as²⁵

q = \tilde{q} \otimes δ q = \tilde{q} \otimes {[1, δ θ^{T} / 2]}^{T}

where the operator of ⊗ denotes quaternion multiplication. It is worthwhile to note that the attitude error δθ is a 3×1 vector, while the quaternion q is a 4×1 vector. Therefore, the size of the vector δx is 21.

Process model

The process model describes the time evolution of the state. In our approach, the biases of inertial sensors b_g, b_a are modeled as Gaussian random walk processes driven by zero mean white Gaussian noise. The kinematics of the state $q_{W I}$ , $v_{I W}^{W}$ , $p_{I W}^{W}$ , b_g, and b_a are the same as that in Xian et al.¹⁶ As regards the previous pose of the IMU $(^{1} q_{W I}^{T}, {(^{1} p_{I W}^{W})}^{T})$ , since they have no dynamics, their process models are

^{1} {\dot{q}}_{W I} = 0,^{1} {\dot{p}}_{I W}^{W} = 0

As for the error-state, the propagation can be deduced from state propagation equations and the definition of the error-state. Here we present the error-state process model directly. Interested readers could refer to Xian et al.¹⁶ for details.

δ \dot{x} = F δ x + n

where

\begin{array}{l} F = [\begin{matrix} F_{11} & 0_{15 \times 6} \\ F_{21} & 0_{6 \times 6} \end{matrix}] \\ n = [\begin{matrix} 0_{9 \times 1}^{T} & n_{g}^{T} & n_{a}^{T} & 0_{6 \times 1}^{T} \end{matrix}] \\ F_{11} = [\begin{matrix} ⌊ (b_{g} - ω_{m}) \times ⌋ & 0_{3 \times 3} & 0_{3 \times 3} & - I_{3} & 0_{3 \times 3} \\ R_{W I} ⌊ (b_{a} - a_{m}) \times ⌋ & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & - R_{W I} \\ 0_{3 \times 3} & I_{3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{6 \times 3} & 0_{6 \times 3} & 0_{6 \times 3} & 0_{6 \times 3} & 0_{6 \times 3} \end{matrix}] \\ F_{21} = [\begin{matrix} I_{3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{3 \times 3} & 0_{3 \times 3} & I_{3} & 0_{3 \times 3} & 0_{3 \times 3} \end{matrix}] \end{array}

Multiple view constraints-based measurement model

The measurement used in this paper is the multiple view geometry constraint between two consecutive stereo image pairs, or more precisely, the epipolar constraint and the trifocal tensor constraint. Both previous and current stereo images are taken into consideration. Assume the left and right images in the previous view are denoted respectively by Im1 and Im2, and these in the current view are denoted by Im3 and Im4. In our case, the epipolar constraints used for measurement are the pairs (Im1, Im3) and (Im2, Im4), out of in total six combinations. These two pairs are chosen because they contain the relative pose information between the previous and current poses, which is helpful to improve the accuracy of the state estimates. Similarly, the trifocal tensor constraints used in the measurement are groups (Im1, Im2, Im3) and (Im1, Im2, Im4).

The epipolar geometry describes the intrinsic projective geometry between two views, which is independent of scene structure, and only depends on the cameras’ internal parameters and relative pose.

Suppose a point M in the 3D-space is imaged in two image planes, at m ₁ in the first, and m ₂ in the second, as shown in Figure 2. Here, m is the image coordinate, which is defined by m = [u,v,1]^T. The relation between the corresponding image point m ₁ and m ₂ is the well-known epipolar constraint. The two cameras are indicated by their centers O ₁ and O ₂ and image planes. The three points M , O ₁, O ₂ lie in a common plane called the epipolar plane. If we assume that R , t are respectively the rotation matrix and position of the frame O ₂ with respect to the frame O ₁, then the image points satisfy the relation ${\bar{m}}_{2}^{T} F {\bar{m}}_{1} = 0$ , in which F is the fundamental matrix defined as $F = R ⌊ t \times ⌋$ ; $\bar{m} = K^{- 1} m$ is the normalized image coordinate, where K is the camera intrinsic parameter matrix.

Figure 2.

The epipolar geometry constraint between two views.

In terms of the three views geometry, it has similar properties; it is independent of the scene structure and only depends on the projective relations between the cameras. The new ability of the tri-view geometry compared with the two-view case is capable of transferring from two views to a third. As shown in Figure 3, line l ₂ in the second view back-projects to plane π₂ in the 3D-space. The point m ₁ in the first image defines a ray in the 3D-space which intersects π₂ at point M . The point M is then imaged as point m ₃ in the third view. The point-line-point correspondence is given by

Figure 3.

The point-line-point correspondence between three views.

{\bar{m}}_{3} = (\sum_{i} {\bar{m}}_{1, i} T_{i}^{T}) l_{2}

where T is the trifocal tensor;

{\bar{m}}_{1, i}

is the i th element of vector

{\bar{m}}_{1}

; and l ₂ can be any line passing through m ₂ in the second image. However, the line l ₂ is recommended to be chosen as the line perpendicular to the epipolar line.²⁶

The trifocal tensor T can be computed from camera matrices. If the canonical 3×4 camera matrices are $P_{1} = [I | 0], P_{2} = [a_{j}^{i}] = [a_{1}, a_{2}, a_{3}, a_{4}]$ and $P_{3} = [b_{j}^{i}] = [b_{1}, b_{2}, b_{3}, b_{4}]$ , then we have

T_{i} = a_{i} b_{4}^{T} - a_{4} b_{i}^{T}

Applying the two geometry constraints to our case, the measurement model can be presented as

\begin{array}{l} y = [\begin{array}{l} 0 \\ 0 \\ m_{3} \\ m_{4} \end{array}] = h (x, m_{1}, m_{2}, m_{3}, m_{4}) = [\begin{array}{l} {\bar{m}}_{3}^{T} (R_{31} ⌊ t_{31} \times ⌋) {\bar{m}}_{1} \\ {\bar{m}}_{4}^{T} (R_{42} ⌊ t_{42} \times ⌋) {\bar{m}}_{2} \\ K_{3} (\sum_{i} {\bar{m}}_{1, i}^{1} T_{i}^{T}) l_{2} \\ K_{4} (\sum_{i} {\bar{m}}_{1, i}^{2} T_{i}^{T}) l_{2} \end{array}] \end{array}

where $^{1} T,^{2} T$ are the trifocal tensors of the views (Im1, Im2, Im3) and the views (Im1, Im2, Im4) respectively, and K ₃, K ₄ are the intrinsic matrix of the left and right cameras respectively.

Square root unscented Kalman filter-based implementation

As for the fusing method, a variant of the SRUKF is used for integrating stereo images and the inertial measurement, since it has an excellent capacity for dealing with nonlinear problems and superior performance when compared with the EKF and UKF. State prediction and measurement update are the two main steps of the SRUKF. In order to predict state, the standard SRUKF requires the sigma point to be calculated; however, in our approach, we predict the state directly based on our process model. In this approach, less computation is needed. To implement the SRUKF, two linear algebra techniques are used, namely QR decomposition and Cholesky factor updating.

State prediction

In our case, the state prediction process is different from the standard SRUKF. The state is predicted based on the process model. Assume that the state and the Cholesky factor at time k is respectively $δ x_{k | k}$ and $S_{k | k}$ , where $S_{k | k}$ is the Cholesky factor in the decomposition $S_{k | k}^{T} S_{k | k} \leftarrow P_{k | k}$ . Let S _Q be the Cholesky factor of the process covariance, and then the state and related Cholesky factor at time k+1 can be predicted as

Φ_{k + 1 | k} = I + F_{k} d t + 0.5 F_{k}^{2} d t^{2}

δ x_{k + 1 | k} = Φ_{k + 1 | k} δ x_{k | k}

S_{k + 1 | k}^{-} = q r (Φ_{k + 1 | k} S_{k | k})

S_{k + 1 | k} = cholupdate (S_{k + 1 | k}^{-}, S_{Q},1)

where $F_{k}$ is computed from equation (6). The QR decomposition function $q r (•)$ and Cholesky factor updating function $cholupdate (•)$ are defined in Van Der Merwe and Wan.²²

Measurement prediction and update

To calculate the statistics of measurement y , we need to form the sigma points for measurement prediction, that is

\begin{array}{l} Χ_{k + 1 | k,0} = δ x_{k + 1 | k} \\ Χ_{k + 1 | k, j} = δ x_{k + 1 | k} + γ S_{k + 1 | k, j} j \in {1, ..., L} \\ Χ_{k + 1 | k, j + L} = δ x_{k + 1 | k} - γ S_{k + 1 | k, j} j \in {1, ..., L} \end{array}

where L is the dimension of the state, in our case L = 21, $Χ_{k + 1 | k, j}$ is the j th sigma point, $S_{k + 1 | k, j}$ is the j th column of the matrix $S_{k + 1 | k}$ , $γ = \sqrt{(L + λ)}$ is the weight, $λ = α^{2} (L + κ) - L$ is a scaling parameter, α determines the spread of the sigma points around $δ x_{k + 1 | k}$ , and κ is a secondary scaling parameter. The corresponding weights and the mean of $Χ_{k + 1 | k, j}$ are computed as follows

\begin{array}{l} W_{0}^{(m)} = λ / (L + λ) \\ W_{0}^{(c)} = λ / (L + λ) + (1 - α^{2} + β) \\ W_{j}^{(m)} = W_{j}^{(c)} = 1 / (2 (L + λ)) j = 1, ..., 2 L \end{array}

{\bar{Χ}}_{k + 1 | k} = \sum_{j = 0}^{2 L} W_{j}^{(m)} Χ_{k + 1 | k, j}

where $W_{j}^{(m)}, W_{j}^{(c)}$ are the scalar weights and β is used to incorporate prior knowledge of the distribution of $δ_{k | k}$ , for details refer to Wan and Van Der Merwe.²⁷

The sigma vector of the measurement then can be obtained through the nonlinear measurement function

y_{k + 1 | k, j} = h (χ_{k + 1 | k, j}, m_{1}, m_{2}, m_{3}, m_{4}), j = 0, ..., 2 L

The mean and the Cholesky factor of the measurement prediction is approximated as follows

{\bar{y}}_{k + 1 | k} = \sum_{j = 0}^{2 L} W_{j}^{(m)} y_{k + 1 | k, j}

S_{y}^{-} = q r ([d i a g (\sqrt{W_{1 : 2 L}^{(c)}}) (y_{k + 1 | k,1 : 2 L} - {\bar{y}}_{k + 1 | k}), S_{R}])

S_{y} = cholupdate (S_{y}^{-}, y_{k + 1 | k,0} - {\bar{y}}_{k + 1 | k}, W_{0}^{(c)})

The measurement update equations are

P_{x y} = \sum_{j = 0}^{2 L} W_{j}^{(c)} (Χ_{k + 1 | k, j} - {\bar{Χ}}_{k + 1 | k}) {(y_{k + 1 | k, j} - {\bar{y}}_{k + 1 | k})}^{T}

K = (P_{x y} / S_{y}^{T}) / S_{y}

δ x_{k + 1 | k + 1} = {\bar{Χ}}_{k + 1 | k} + K (y_{k + 1} - {\bar{y}}_{k + 1 | k})

U = K S_{y}

S_{k + 1 | k + 1} = cholupdate (S_{k + 1 | k}, U, - 1)

where S _R is the Cholesky factor of the measurement covariance R and y _k+1 is the measurement at time k+1.

Feature detection, tracking, and outlier rejection

Feature detection and tracking are vital steps for visual odometry, visual/inertial integration, and many other vision applications. In recent decades, a variety of methods have been proposed. Fraundorfer and Scaramuzza²⁸ compared the properties and performance of several popular feature detectors, such as FAST corner detection (FAST),^29,30 Scale Invariant Feature Transform (SIFT),³¹ and Speeded-Up Robust Features (SURF).³² In Desai and Lee,³³ the authors compared recently developed Binary Robust Independent Elementary Features (BRIEF)^34,35 and Binary Robust Invariant Scalable Keypoints (BRISK),³⁶ and presented a novel feature descriptor called SYnthetic BAsis (SYBA)^37,38 for accurate feature matching. In our visual/IMU approach, we mainly focus on the integration method of visual and IMU information. As for the feature detection algorithm, a fast, efficient, and sophisticated algorithm is welcomed. So we employ the FAST detector for detecting corner features in the image. The major advantage of the FAST algorithm is its high efficiency in reaching accurate corner localization in an image. For feature tracking, the Kanade Lucas Tomasi (KLT) tracker³⁹ is used. The KLT tracker allows tracking features over long image sequences and undergoing larger changes by applying an affine-distortion model to each feature.

However, matched points are usually contaminated by outliers due to image noise, occlusion, image blur, and changes in viewpoint. In this paper, a two-step method is introduced to reject the outliers. First, we use the epipolar geometry constraint for left–right outlier removal. If we take m ₁ and m ₂ as an example, the inlier is determined using the following criteria

\begin{array}{l} {(m_{1}, m_{2})}_{inliers} = {(m_{1}, m_{2}) | ‖ {\bar{m}}_{2}^{T} (R_{21} ⌊ t_{21} \times ⌋) {\bar{m}}_{1} ‖ < t h_{epi}} \end{array}

where th_epi is the threshold for the epipolar constraint.

It is then followed by the RANSAC algorithm for the past–current outlier rejection. To implement the RANSAC algorithm, the trifocal tensor constraint is used. Taking the features in the Im1, Im2, and Im3 as an example, we reject the outlier by using the formula below

\begin{array}{l} {(m_{1}, m_{2}, m_{3})}_{inliers} \\ = {(m_{1}, m_{2}, m_{3}) | ‖ m_{3} - K (\sum_{i} {\bar{m}}_{1, i}^{1} T_{i}^{T}) l_{2} ‖ < t h_{tri}} \end{array}

where th_tri is the threshold for the trifocal tensor constraint.

Experimental results

To evaluate the performance of the proposed algorithm, the KITTI dataset⁴⁰ is used. This dataset is available to the public and was chosen because it contains easy-to-process data from various types of sensors. The recording platform is equipped with two grayscale cameras, two color cameras, a rotating 3D laser scanner, and a combined GPS/IMU INS. Two versions of data are provided: raw (unsynced + unrectified) and processed (synced + rectified).

Since we focus on navigation tasks based on stereo and inertial sensors, we used the unsynced IMU measurements sampling at 100 Hz and the rectified grayscale stereo sequences sampling at 10 Hz. The pixel resolution of the rectified images is 1226×370. To synchronize the measurements from the IMU and cameras, we minimize the timestamp difference between the two types of measurements, resulting in synchronization errors no greater than 5 ms. The result of the GPS/IMU integrated system is used as the ground truth, as it provides open sky position/attitude precision as high as 0.02 m/0.1 deg.

We have tested our algorithm with most of the KITTI data and compared it with the pure inertial navigation solution, stereo visual odometry,⁴¹ and the UKF-based monocular aided inertial navigation solution (UKF-Mono/IMU).²¹ The results shown in this section are a typical representative of many trials. Additional test results can be found in the Appendix. The trajectory of the trail is about 4.1 km, taking 518 s with the average velocity of about 8 m/s. The initial attitude, speed, and position of the filter are provided by the results of GPS/IMU.

Trajectory estimation and comparison

The estimated trajectories are overlaid on the Google map as shown in Figure 3, and the corresponding two-dimensional (2D) horizontal position errors are depicted in Figure 3. Note that the large error from 75 s to 120 s is caused by ground truth error. For this reason, we do not take the period (75–120 s) into consideration when calculating the root mean square error (RMSE), the maximum error (MAXE), and the end-point error (ENDE). The overall RMSE, MAXE, and ENDE are shown in Table 1. Clearly, the worst one is the pure inertial navigation solution, with the end position error of about 21 km. The inertial navigation is actually an integral process (the position is computed through quadratic integral while the attitude is obtained by single integral), and therefore even a little sensor bias or noise error accumulates quickly, resulting in quadratic position error over time. This is the main reason why a low-cost INS usually needs to be aided by other types of sensors. As for the results of stereo visual odometry, the solution is found to be better than that of the inertial-only solution. The trajectory estimated by stereo visual odometry has high accuracy in the beginning, but the heading error increases gradually over time, resulting in the maximum position error of about 160 m.

Table 1.

Horizontal position error comparison

	Pure inertial	Stereo visual odometry	UKF-Mono/IMU	SRUKF-Stereo/IMU
	navigation solution	(Geiger, 2011)⁴¹	(Hu, 2014)²¹	(ours)
RMSE (m)	10112.1	68.7	15.8	7.7
MAXE (m)	20733.9	158.8	45.1	17.9
ENDE (m)	20733.9	57.2	36.7	16.4

UKF: unscented Kalman filter; IMU: inertial measurement unit; SRUKF: square root unscented Kalman filter; RMSE: root mean square error; MAXE: maximum error; ENDE: end-point error.

Compared with the pure inertial navigation solution and stereo visual odometry, both the UKF-VINS (Visual aided INS) and our proposed method have much better performance. It also shows the advantage of the combination of inertial and visual sensors. Both UKF-VINS and our proposed method are the visual aided inertial navigation approach and use the trifocal tensor geometry as measurement information. The main difference is that the UKF-VINS combines the monocular camera and the inertial sensor based on the UKF, while our proposed method fuses stereo cameras and the inertial sensor based on the SRUKF. As shown in both Figure 4 and Table 1, our proposed method has less position error and better accuracy than UKF-VINS.

Figure 4.

Comparison of trajectory estimation. (a) An overview of trajectory overlaid on the road map. (b) Horizontal position errors of different solutions. Noteworthy is the fact that the position estimated by GPS/IMU (the black dashed ellipses in both (a) and (b)) has large errors in some areas, probably due to loss of GPS satellites.

Velocity and attitude estimation and comparison

The velocity and attitude estimated by our proposed method are compared with that of the UKF-VINS. The results of the pure inertial navigation solution and stereo visual odometry are not used for comparison here, since their performance is much more poor, and the stereo visual odometry can not provide velocity directly. The velocity and attitude errors are compared in Figure 5 and the corresponding RMSE and MAXE are shown in Table 2. The jumps appearing in Figure 4 are again because of the loss of GPS signal. According to Figure 5 and Table 2, it is clear to see that the velocity and attitude estimations by our proposed method have higher accuracy than that of UKF-VINS, with the accuracy being better than 0.5 m/s for velocity RMSE and better than 1 deg for attitude RMSE in this experiment.

Figure 5.

Velocity error (a) and attitude error (b) comparison. The large error appearing in (a) from about 75 s to 120 s is caused by the GPS outage, which can be seen in Figure 4; Ve: East Velocity; Vn: north velocity; Vu: up velocity.

Table 2.

Velocity and attitude error

	UKF-Mono/IMU (Hu, 2014)²¹		SRUKF-Stereo/IMU (ours)
	RMSE	MAXE	RMSE	MAXE
Ve (m/s)	0.63	2.42	0.36	1.10
Vn (m/s)	0.53	1.49	0.19	0.55
Vu (m/s)	0.11	0.49	0.10	0.27
Roll (deg)	0.43	1.64	0.47	1.22
Pitch (deg)	0.55	2.08	0.50	1.55
Yaw (deg)	2.12	8.22	0.83	3.04

Results of feature tracking and outlier rejection

The positions of tracked features from the image sequence are not included in our filter, but are used as the measurement. The quality of the feature plays an important role in our algorithm. During feature tracking, we set a reference feature number of 80 in our case. Once the feature quantity is less than the value, new features are added into the filter. As shown in Figure 6(a), the average number of features is about 80. The tracked features are selected by our RANSAC approach; the moving objects (the features in the yellow ellipse shown in Figure 6(a)) are rejected effectively. The inlier rate of the test is also shown in Figure 6(b). Most of the time, the inlier rate is above 0.8, which is an important guarantee for reliable integration. It is worth mentioning that the inlier rate becomes zero in the time of frame ∼200, as shown in Figure 6(b). This is mainly caused by the large time difference between the IMU and the camera while the vehicle is moving at a high speed in the test. When this issue occurs, the pose could be updated by using the INS only, demonstrating the higher reliability of the inertial/visual approach in particular circumstances.

Figure 6.

Image features and inlier rate. (a) Two sample images and tracked features. (b) Feature number and inlier rate of this particular experiment. The detected features are shown in the images in (a), the green ones are inliers while the red ones are outliers.

Conclusions

In this paper, we present an SRUKF-based stereo camera and inertial sensor integration architecture for autonomous navigation. As we put our focus on the navigation and location rather than the environment reconstruction, the 3D positions of the features are not included in the filter. The measurement of the filter utilizes the multiple view geometry constraints (to be exact, the epipolar constraint and the trifocal tensor geometry constraint) between the consecutive two image pairs. The system model, the implementation of the SRUKF-based algorithm, and the feature detection, tracking, and outlier rejection method are presented in this paper.

The proposed algorithm has been validated using real outdoor data. The experiment results are compared with the pure inertial navigation solution, stereo visual odometry, and UKF-Mono/IMU approaches. The comparison shows that the proposed algorithm has multiple orders of magnitude improvement than the pure inertial navigation solution, and has more precise location and heading estimation than the stereo visual odometry and UKF-Mono/IMU approaches. The results also show the effectiveness of the outlier rejection method, which is able to reject the moving objects in the scene. The maximum 2D horizontal position error is less than 20 m for the typical vehicle path of about 4.1 km. It is concluded that the proposed algorithm has a precise motion estimation performance and is able to be used in GPS-denied environment.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant number 61503403, 61573371) and National University of Defense Technology Advanced Research Programs (grant number JC14-03-04, JC14-03-06)

Appendix: Results of other trails

The algorithm is also tested with other four trails. The estimated positions, attitude errors, and velocity errors are presented and compared in Figure 7. The video clips of all our results are available online.

References

Miller

Soloviev

Uijt de Haag

. Navigation in gps denied environments: Feature-aided inertial systems. Technical report, DTIC Document, 2010.

Corke

Lobo

Dias

. An introduction to inertial and visual sensing. Int J Rob Res 2007; 26(6): 519–535.

Lang

Pinz

. Calibration of hybrid vision/inertial tracking systems. In: InerVis, Barcelona, Spain, 2005.

Lobo

Dias

. Relative pose calibration between visual and inertial sensors. Int J Rob Res 2007; 26(6): 561–575.

Mirzaei

Roumeliotis

. A kalman filter-based algorithm for imu-camera calibration: Observability analysis and performance evaluation. IEEE Trans Rob 2008; 24(5): 1143–1156.

Feng

Wang

. Observability analysis of a matrix kalman filter-based navigation system using visual/inertial/magnetic sensors. Sensors 2012; 12(7): 8877–8894.

Davison

Reid

Molton

. Monoslam: Real-time single camera slam. IEEE Trans Pattern Anal Mach Intell 2007; 29(6): 1052–1067.

Davison

. Real-time simultaneous localisation and mapping with a single camera. In: Computer vision, 2003. Proceedings. Ninth IEEE international conference on, 2003, pp.1403–1410. IEEE.

Kim

Sukkarieh

. Airborne simultaneous localisation and map building. In: Robotics and automation, 2003. Proceedings. ICRA’03. IEEE international conference on, 2003, vol.1, pp.406–411.

10.

Davison

Murray

. Mobile robot localisation using active vision. In: European Conference on Computer Vision, 1998, vol.1, pp. 406–411. Berlin and Heidelberg: Springer.

11.

Davison

Murray

. Simultaneous localization and map-building using active vision. IEEE Trans Pattern Anal Mach Intell 2002; 24(7): 865–880.

12.

Kelly

Saripalli

Sukhatme

. Combined visual and inertial navigation for an unmanned aerial vehicle. In: Field and Service Robotics. Springer, pp.255–264.

13.

Carrillo

LRG

López

AED

Lozano

. Combining stereo vision and inertial navigation system for a quad-rotor uav. J Intell Rob Syst 2012; 65(1–4): 373–387.

14.

Veth

Raquet

. Fusing low-cost image and inertial sensors for passive navigation. Navigation 2007; 54(1): 11–20.

15.

Kelly

Sukhatme

. Visual-inertial sensor fusion: Localization, mapping and sensor-to-sensor self-calibration. Int J Rob Res 2011; 30(1): 56–79.

16.

Xian

Lian

. Fusing stereo camera and low-cost inertial measurement unit for autonomous navigation in a tightly-coupled approach. J Navig 2015; 68(03): 434–452.

17.

Diel

DeBitetto

Teller

. Epipolar constraints for vision-aided inertial navigation. In: Application of computer vision, 2005. WACV/MOTIONS’05 Volume 1. Seventh IEEE workshops on, volume 2, pp.221–228. IEEE.

18.

Nilsson

Zachariah

Jansson

. Realtime implementation of visual-aided inertial navigation using epipolar constraints. In: Position location and navigation symposium (PLANS), 2012 IEEE/ION, pp.711–718. IEEE.

19.

Mourikis

Roumeliotis

. A multi-state constraint kalman filter for vision-aided inertial navigation. In: Robotics and automation, 2007 IEEE international conference on, pp. 3565–3572. IEEE.

20.

Indelman

Gurfil

Rivlin

. Real-time vision-aided localization and navigation based on three-view geometry. IEEE Trans Aerosp Electron Syst 2012; 48(3): 2239–2259.

21.

Chen

. A sliding-window visual-imu odometer based on tri-focal tensor geometry. In: Robotics and automation (ICRA), 2014 IEEE international conference on, pp.3963–3968. IEEE.

22.

Van Der Merwe

Wan

. The square-root unscented kalman filter for state and parameter-estimation. In Acoustics, speech, and signal processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE international conference on, volume 6, pp.3461–3464. IEEE.

23.

Van Der Merwe

Wan

Julier

. Sigma-point kalman filters for nonlinear estimation and sensor-fusion: Applications to integrated navigation. In: Proceedings of the AIAA guidance, navigation & control conference, pp.16–19.

24.

Kuipers

. Quaternions and rotation sequences. Vol. 66. Princeton, NJ: Princeton University Press, 1999.

25.

Titterton

Weston

JL.

Strapdown inertial navigation technology. Vol. 17. London, UK: IET, 2004.

26.

Hartley

Zisserman

Multiple view geometry in computer vision. Cambridge, UK: Cambridge University Press, 2003.

27.

Wan

Van Der Merwe

. The unscented kalman filter for nonlinear estimation. In Adaptive systems for signal processing, communications, and control symposium 2000. AS-SPCC. The IEEE 2000, pp.153–158. IEEE.

28.

Fraundorfer

Scaramuzza

. Visual odometry: Part ii: Matching, robustness, optimization, and applications. IEEE Rob Autom Mag 2012; 19(2): 78–90.

29.

Rosten

Drummond

. Fusing points and lines for high performance tracking. In Computer vision, 2005. ICCV 2005. Tenth IEEE international conference on, volume 2, pp.1508–1515. IEEE.

30.

Rosten

Drummond

. Machine learning for high-speed corner detection. In: Computer vision–ECCV 2006, 2006, pp.430–443. Springer.

31.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vision 2004; 60(2): 91–110.

32.

Bay

Tuytelaars

Van Gool

. Surf: Speeded up robust features. In: European conference on computer vision, pp.404–417. Springer.

33.

Desai

Lee

. Visual odometry drift reduction using syba descriptor and feature transformation. IEEE Trans Intell Transp Syst 2016; 17(7): 1839–1851.

34.

Calonder

Lepetit

Strecha

. Brief: Binary robust independent elementary features. In: European conference on computer vision, pp.778–792. Springer.

35.

Calonder

Lepetit

Ozuysal

. Brief: Computing a local binary descriptor very fast. IEEE Trans Pattern Anal Mach Intell 2012; 34(7): 1281–1298.

36.

Leutenegger

Chli

Siegwart

. Brisk: Binary robust invariant scalable keypoints. In: 2011 International conference on computer vision, pp.2548–2555. IEEE.

37.

Desai

Lee

Ventura

. Matching affine features with the syba feature descriptor. In: International symposium on visual computing, pp.448–457. Springer.

38.

Desai

Lee

Ventura

. An efficient feature descriptor based on synthetic basis functions and uniqueness matching strategy. Comput Vision Image Understanding 2016; 142: 37–49.

39.

Shi

Tomasi

. Good features to track. In: Computer vision and pattern recognition, 1994. Proceedings CVPR’94., 1994 IEEE computer society conference on, pp.593–600. IEEE.

40.

Geiger

Lenz

Stiller

. Vision meets robotics: The kitti dataset. Int J Rob Res 2013; 32(11): 1231–1237.

41.

Geiger

Ziegler

Stiller

. Stereoscan: Dense 3d reconstruction in real-time. In: Intelligent vehicles symposium (IV), 2011 IEEE, pp.963–968. IEEE.