Sage Journals: Discover world-class research

Abstract

In this paper, a new method for vision-aided navigation based on trifocal tensor geometry is presented. The main goal of the proposed method is to estimate the position of vehicles in global positioning system-denied environments, using a standard inertial navigation system and only a single camera. The geometric trifocal tensor relationship between three images is used as measurement information from the camera, and the primary contribution of this work is the derivation of a measurement model that is able to express the geometric constraints of the trifocal tensor in the global frame. This measurement model does not require including the three-dimensional feature positions in the state vector. In other words, the proposed method does not entail reconstructing the environment. Rather, the method only considers the vehicle state. The vision-aided inertial navigation algorithm that we propose has computational complexity only with regard to the number of features at the current time, and the algorithm is capable of estimating the pose in real environments. Experiments were conducted to show the effectiveness of the proposed method in simulations and real environments.

Keywords

Vision-aided navigation trifocal tensor unmanned aerial vehicle simultaneous localization and mapping inertial navigation system global positioning system integral channel image patch

Introduction

Accurate estimations of the navigational state are essential in many fields, and a global positioning system (GPS) aided by an inertial measurement unit (IMU) is the primary method used today. In many cases, however, GPS might be unavailable or unreliable. For instance, navigation is important when operating indoors, in urban environments, under water, and on other planets. In these settings, vision-based methods are an attractive alternative for navigation, owing to their low cost, weight, and autonomous nature.

An important advantage of using a camera is that its images provide high-dimensional measurements with rich information content. By tracking feature points between several images, the motion of the camera can be estimated. However, the high volume of data poses a significant challenge when designing algorithms for such estimations. When real-time localization performance is required, we are faced with a fundamental trade-off between the computational complexity of the algorithm and the resulting accuracy of the estimation.

In this paper, we present an algorithm that is able to utilize the information provided by multiple measurements of visual features. The geometric trifocal tensor relationship between three images is used as measurement information from the camera. The primary contribution of our work is a measurement model that is able to express the geometric constraints of the trifocal tensor in the global frame. This measurement model does not require including the 3D feature positions in the state vector. Consequently, the computational complexity only relates to the number of features at the current time.

After a brief discussion of related work in the following section, the details of the proposed estimator are described in “Estimator description” section. In “Simulation and experimental results” section, the performance of the algorithm is verified by experiments in simulated and real environments. Finally, in the end section, the conclusions of this work are drawn.

Related work

Vision-aided navigation^1,2 has become an active research field over the past few decades. Many works have addressed visual-inertial motion estimation, and several camera-IMU solutions have been proposed to track the state of a system in real time on computationally constrained platforms and in real environments.

One family of algorithms for fusing inertial measurements with observations of visual features follows the simultaneous localization and mapping (SLAM) paradigm. With this family of methods, the current IMU pose and the 3D positions of all visual landmarks are jointly estimated.^3–6 These approaches treat the Scale-invariant feature transform (SIFT) features as the visual features. Sharing the same basic principles as SLAM-based methods for camera-only localization, Dong et al.⁷ proposed that the integral channel image patch (ICIMGP) features⁸ perform better than the others, including SIFT features. However, the main limitation to SLAM is its high computational complexity. Properly treating these correlations is computationally expensive. Thus, performing vision-based SLAM in environments with many features remains a challenging problem.

In contrast to SLAM, several algorithms have been proposed for estimating the pose of the camera exclusively, with the aim of achieving real-time operation. The most computationally efficient of these methods utilizes feature measurements to derive the constraints between pairs of images. Considering two images, epipolar constraints can be used as measurements. In Soatto et al.,⁹ epipolar geometry is employed in conjunction with a statistical motion model, while in Prazenica et al.,¹⁰ epipolar constraints are fused with a dynamical model of an airplane. In Indelman et al.,^11,12 the constraints between the current and previous image are defined using epipolar geometry and combined with IMU measurements in an extended Kalman filter (EKF). With two-view based methods for aiding navigation, however, it is only possible to determine camera rotations and up-to-scale translations¹³ (translations during the intervals are associated with an unknown scale). Therefore, two-view based methods are incapable of eliminating navigational errors in all states.

Given three overlapping images, it is possible to determine the camera motion up to a common scale¹³ (translations during the intervals are associated with a common, unknown scale). Moreover, the familiar constraints are the trifocal tensor. Thus, we can express trifocal tensor constraints among multiple camera poses in order to improve the accuracy of the estimation. In 1997, Hartley¹⁴ applied the trifocal tensor to compute the motion of a camera (along with the scene structure). For uncalibrated image sequences, Torr et al.¹⁵ focused on the problem of structural degeneracy and motion recovery using the trifocal tensor. Guerrero et al.¹⁶ solved the trifocal tensor directly using singular value decomposition for location and mapping. However, these trifocal tensor-based methods do not utilize IMU data. Based on an augmented state technique,¹⁷ Hu and Chen¹⁸ added the trifocal tensor to a multiview measurement model in order to estimate the current vehicle state and the state of the features during a particular interval. Nevertheless, this method estimates local features, increasing the computational time. Indelman et al.¹⁹ presented a method based on three-view geometry. Their method differs from both SLAM and augmented state techniques. By linearizing the residual measurements and calculating the relevant Jacobian matrix, computational costs are reduced and accurate position results are obtained by applying the implicit extended Kalman filter (IEKF).

The constraints of the trifocal tensor and three-view geometry are both derived from three overlapping images. As such, there might be some connection between them. Indeed, the former is the sufficient condition for the latter (see “Estimator description” section for details). Motivated by this relation between the trifocal tensor and three-view geometry, we here propose a vision-aided navigation method based on trifocal tensor geometry. The proposed method is an effective vision/inertial navigation system (INS) estimation algorithm that is able to express the geometric constraints of the trifocal tensor in the global frame. It does not require the inclusion of 3D feature positions in the state vector, and its computational complexity only relates to the number of features at the current time. Moreover, our proposal is capable of pose estimations in real environments. The proposed algorithm is efficient, as demonstrated in “Simulation and experimental results” section, and capable of precise vision-aided inertial navigation in real environments.

Estimator description

The goal of the proposed method is to estimate the pose of a vehicle in the global frame. There are three reference frames used by the IEKF: the n—north-east-down coordinate system, whose origin is set at the location of the navigation system and points north and east, completing a Cartesian right-hand system; the b—body-fixed reference frame, whose origin is set at the center of the vehicles mass and points toward front of the vehicle, or right when viewed from above, completing the setup to yield a Cartesian right-hand system; and the c—camera-fixed reference frame, whose origin is set at the center of the camera’s projection, and points toward the center of the field of view (FOV), or toward the right-half of the FOV when viewed from the center of the cameras projection, completing the setup to yield a Cartesian right-hand system.

Structure of the EKF state vector

The evolving INS state is described by the vector¹⁵

X = [q_{n}^{bT}, b_{g}^{T}, V_{n}^{T}, b_{a}^{T}, {Pos}_{n}^{T}]^{T}

(1) where

q_{n}^{b}

is the unit quaternion²⁰ (dividing a nonzero quaternion q by its norm:

q / ‖ q ‖

) describing the rotation from frame n to b,

{Pos}_{n}

and

V_{n}

are the vehicle position and velocity with respect to frame n, and finally

b_{g}

and

b_{a}

are 3 × 1 vectors that describe the biases affecting the gyroscope and accelerometer measurements, respectively. The IMU biases are modeled as random walk processes, driven by the white Gaussian noise vectors

n_{wg}

and

n_{wa}

, respectively. Following equation (1), the IMU error state is defined as

δ X = [δ θ^{T}, δ b_{g}^{T}, δ V_{n}^{T}, δ b_{a}^{T}, δ {Pos}_{n}^{T}]^{T}

(2)

For the position, velocity, and biases, the error is defined as the standard additive error, i.e. the estimate $\hat{x}$ of a quantity x is defined as $δ x = x - \hat{x}$ . However, for the quaternion, the error quaternion $δ q$ describes the small error rotation, and it is defined as

δ q = [0.5 δ θ^{T} 1]^{T}

(3) where

δ θ

is the attitude error.

System model

The linearized continuous time model¹⁵ for the IMU error state is

δ X^{.} = F δ X + {Gn}_{IMU}

(4) where

n_{IMU} = [n_{g}^{T}, n_{wg}^{T}, n_{a}^{T}, n_{wa}^{T}]

is the system noise. The matrices

F

and

G

are

F = [\begin{matrix} - [\hat{w} \times] & - I_{3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \\ - C_{b}^{n} [\hat{a} \times] & 0_{3 \times 3} & - 2 [w_{G} \times] & - C_{b}^{n} & - {[w_{G} \times]}^{2} \\ 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{3 \times 3} & 0_{3 \times 3} & I_{3} & 0_{3 \times 3} & 0_{3 \times 3} \end{matrix}]

(5)

G = [\begin{matrix} - I_{3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{3 \times 3} & I_{3} & 0_{3 \times 3} & 0_{3 \times 3} \\ 0_{3 \times 3} & 0_{3 \times 3} & - C_{b}^{n} & 0_{3 \times 3} \\ 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & I_{3} \\ 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} & 0_{3 \times 3} \end{matrix}]

(6) where

I_{3}

is the 3 × 3 identity matrix,

w_{G}

is the planet’s rotation,

\hat{w} = w_{m} - {\hat{b}}_{g} - C_{n}^{b} w_{G}

and

\hat{a} = a_{m} - {\hat{b}}_{a}, w_{m}

is the gyroscope measurements, and

a_{m}

is the accelerometer measurements.

Measurement model

The measurement model employed to update the estimate of the filter state is given by the trifocal tensor. The trifocal tensor encapsulates the geometric relationships between the three different viewpoints, and it is independent of the scene structure.¹³ Figure 1 shows the so-called point–point–point correspondence between the three views. This correspondence can be used to transfer the image point with the trifocal tensor. Where $X$ is a space point; $\hat{x}, \hat{x}', \hat{x}''$ are the space point $X$ projected in the first, second, and third camera image, respectively; $c_{1}, c_{2}, c_{3}$ are the camera centers of the first, second, and third camera, respectively; and $π, π', π''$ are the image planes.

Figure 1.

Three-view point correspondence among three camera images.

Assuming the camera calibration matrix is known, then the camera projection matrices can be defined as follows¹¹

U = [I | 0], U' = [A | a_{4}], U'' = [B | b_{4}]

(7) where

A = [a_{1}, a_{2}, a_{3}] = C_{c_{1}}^{c_{2}}

is the rotation matrix from camera c₁ to camera c₂,

B = [b_{1}, b_{2}, b_{3}] = C_{c_{1}}^{c_{3}}

is the rotation matrix from camera c₁ to camera c₃,

a_{4} = T_{12}^{c_{2}}

is the translation expressed in camera c₂ from c₁ camera to camera c₂, and

b_{4} = T_{13}^{c_{3}}

is the translation expressed in camera c₃ from camera c₁ to camera c₃.

According to the projection matrices and a line in 3D space, the trifocal tensor can be derived as follows

T_{i} = a_{i} b_{4}^{T} - a_{4} b_{i}^{T} (i = 1, 2, 3)

(8)

Next, we use the trifocal tensor to transfer the point $\hat{x}$ in the first frame and the point $\hat{x}'$ in the second frame into the point $\hat{x}''$ in the third frame. This is referred to as point–point–point correspondence

[\hat{x}' \times] (\sum_{i = 1}^{3} {\hat{x}}_{i} T_{i}) [\hat{x}'' \times] = 0_{3 \times 3}

(9) where

[. \times]

is the cross product matrix defined for some vector

i = [i_{1}, i_{2}, i_{3}]^{T}

[i \times] = [\begin{matrix} 0 & - i_{3} & i_{2} \\ i_{3} & 0 & - i_{1} \\ - i_{2} & i_{1} & 0 \end{matrix}]

(10) and image points are normalized as the LOS vectors expressed in the camera systems, that is,

\hat{x} = [{\hat{x}}_{1}, {\hat{x}}_{2}, {\hat{x}}_{3}]^{T} = [x_{1}, y_{1}, f_{1}]^{T},

\hat{x}' = [{\hat{x}}_{1}', {\hat{x}}_{2}', {\hat{x}}_{3}']^{T} = [x_{2}, y_{2}, f_{2}]^{T},

\hat{x}'' = [{\hat{x}}_{1}'', {\hat{x}}_{2}'', {\hat{x}}_{3}'']^{T} = [x_{3}, y_{3}, f_{3}]^{T}

, x_i, y_i is the point expressed in the camera system c_i and f_i is the camera focal length.

The variables used in equation (9) are expressed in camera systems, in order to estimate the pose of the vehicle in the global frame, by rearranging equation (9), we can yield point–point–point correspondence in the navigation system as follows (the proof for which is provided in Appendix 1)

[\hat{x}' n \times] {\hat{x}}^{n} T_{23}^{nT} [\hat{x}'' n \times] - [\hat{x}' n \times] [({\hat{x}}^{n} \times T_{12}^{n}) \times] [\hat{x}'' n \times] = 0_{3 \times 3}

(11) where

{\hat{x}}^{n} = C_{c_{2}}^{n} \hat{x}', \hat{x}'^{n} = C_{c_{2}}^{n} \hat{x}', \hat{x} ″^{n} = C_{c_{2}}^{n} \hat{x} ″, T_{23}^{n} = {Pos}_{3}^{n} - {Pos}_{2}^{n}

and

T_{12}^{n} = {Pos}_{2}^{n} - {Pos}_{1}^{n}

. Here

C_{c_{i}}^{n} = C_{b_{i}}^{n} (θ) C_{c_{i}}^{b_{i}} (i = 1, 2, 3)

is the rotation matrix from camera c_i to the navigation system(

C_{b_{i}}^{n}

is the rotation matrix from the body system at time t_i to the navigation system which is changed with the vehicle’s rotation angle

θ

and

C_{c_{i}}^{b_{i}}

is the rotation matrix from camera c_i to body system b_i, which is calibrated in the beginning and maintained unchanged), and

{Pos}_{i}^{n} (i = 1, 2, 3)

is the position expressed in the navigation system at time t_i.

Define matrix $M = [\hat{x}' n \times] {\hat{x}}^{n} T_{23}^{nT} [\hat{x}'' n \times] - [\hat{x}' n \times] [({\hat{x}}^{n} \times T_{12}^{n}) \times] [\hat{x}'' n \times]$ , and the elements in matrix M are

M = [\begin{matrix} m_{11} & m_{12} & m_{13} \\ m_{21} & m_{22} & m_{23} \\ m_{31} & m_{32} & m_{33} \end{matrix}]

(12) where m_ij represents the row i and column j element of

M

, and

m_{ij} = - (\hat{x}' n \times {\hat{x}}^{nT}) E_{ij} (\hat{x}'' n \times T_{23}^{n}) - (\hat{x}' n \times {\hat{x}}^{n})^{T} E_{ij} (\hat{x}'' n \times T_{12}^{n}) + (\hat{x}' n \times T_{12}^{n})^{nT} E_{ij} (\hat{x}'' n \times {\hat{x}}^{n})

(13) with

E_{ij}

representing a 3 × 3 dimension matrix, and the row i and column j element being one and others zeros.

The model in equation (13) contains nine trilinearites, but only four are linearly independent. Geometrically, these four trilinearites¹¹ arise from special line choices in the second and third images for the point–line–line relation.

Here, we select $m_{11}, m_{13}, m_{31}$ , and m₃₃ as the basis. This means that we choose a line parallel to the x-axis of the image and a line through the origin of the image coordinate in the second and third images, respectively. Then, the constraints in equation (13) can be simplified as four linearly independent constraints.

First consider a single feature observed in the three images captured at times t₁, t₂, and t₃ (where $t_{1} < t_{2} < t_{3}$ ). The measurement model comprising the trifocal tensor is given as follows

z_{1} = [m_{11}, m_{13}, m_{31}, m_{33}]^{T} = h_{1} ({Pos}_{t_{1}}, {Pos}_{t_{2}}, {Pos}_{t_{3}}, θ_{t_{1}}, θ_{t_{2}}, θ_{t_{3}}, \hat{x}, \hat{x}', \hat{x} ″)

(14)

In typical scenarios, there is more than one matching pair of features. Indeed, we assume that there are N matching features from all the three views. Thus, the measurement model can be written as

z = h ({Pos}_{t_{1}}, {Pos}_{t_{2}}, {Pos}_{t_{3}}, θ_{t_{1}}, θ_{t_{2}}, θ_{t_{3}}, \hat{x}, \hat{x}', \hat{x} ″) = (z_{1} \dots z_{N})

(15)

As we explained in “Related work” section, the constraints of the trifocal tensor geometry and the three-view geometry are both derived from three overlapping images. Therefore, there should be some connection between them. Through our analysis, we have obtained the following lemma (the detailed proof is listed in Appendix 2):

Lemma 1

The trifocal tensor constraints are the sufficient conditions for the three-view geometric constraints.

According to Indelman et al.,¹² given multiple matching features, one can determine the translation vectors $T_{12}$ and $T_{23}$ up to scale. In general, these two unknown scales are different. The two scales can be connected through three-view geometry, which establishes a relationship between the magnitudes of $T_{12}$ and $T_{23}$ . Consequently, when the magnitude of $T_{12}$ is known, it is possible to calculate both the direction and the magnitude of $T_{23}$ . In Indelman et al.,¹⁹ the authors obtained good results. Thus, from Lemma 1, we might conclude that the trifocal tensor constraints outperform the three-view geometry used in a navigation system. This conclusion is verified in the following sections.

Implicit EKF

In this section, we present the IEKF used to analyze the performance of the fusion with a navigation system.

Equation (15) shows that the measurement model z implicitly contains the system states (position Pos , attitude $θ$ ) and the measurement noise of the images. Linearizing z about ${Pos}_{t_{1}}, θ_{t_{1}}, θ_{t_{2}}, {Pos}_{t_{2}}, {Pos}_{t_{3}}, θ_{t_{3}}$ , and $\hat{x}, \hat{x}', \hat{x}''$ , and keeping the first-order yields

z \approx H_{3} δ X_{t_{3}} + H_{2} δ X_{t_{2}} + H_{1} δ X_{t_{1}} + Dv

(16) where

H_{3}, H_{2}

, and

H_{1}

are defined as

H_{3} = [\frac{\partial z}{\partial θ_{t_{3}}}, 0_{3 \times 12}, \frac{\partial z}{\partial {Pos}_{t_{3}}}] H_{2} = [\frac{\partial z}{\partial θ_{t_{2}}}, 0_{3 \times 12}, \frac{\partial z}{\partial {Pos}_{t_{2}}}] H_{1} = [\frac{\partial z}{\partial θ_{t_{1}}}, 0_{3 \times 12}, \frac{\partial z}{\partial {Pos}_{t_{1}}}]

(17) v is the image noise associated with the LOS vectors

\hat{x}, \hat{x}', \hat{x}''

, and its covariance matrix is R. The matrix D is the gradient of h with respect to the LOS vectors and

D = [\frac{\partial z}{\partial \hat{x}}, \frac{\partial z}{\partial \hat{x}'}, \frac{\partial z}{\partial \hat{x}''}]

(18)

During the time interval $(t_{k}, t_{k + T})$ , the IMU state estimate $X^{-}$ is propagated using Runge–Kutta numerical integration of the basic INS equations in Titterton and Weston.²⁰ Moreover, the propagation step of the filter is carried out using state transition matrix $Φ$ computed by numerical integration of the differential equation

Φ^{.} (t_{k + τ}, t_{k}) = F Φ (t_{k + τ}, t_{k}), τ \in (0, T)

(19)

The propagation covariance is similarly calculated by numerical integration of the Lyapunov equation

P^{.} = FP + PF + {GQ}_{IMU} G^{T}

(20) where

Q_{IMU}

is the covariance matrix of system noise

n_{IMU}

Noting that we are only interested in estimating the navigation errors at the current time instant $X_{t_{3}}$ , the navigation errors at the first two time instances are considered as random parameters in the measurement equation. Therefore, since $X_{t_{2}}$ and $X_{t_{1}}$ are not estimated, these errors are transmitted in the filter covariance matrices $P_{t_{2}}, P_{t_{1}}$ , respectively, which are attached to the first two images.

Then the Kalman gain matrix¹⁶ can be given by

K = P_{X_{t_{3}} z} P_{z}^{- 1}

(21)

Since $z^{-} = H_{3} δ X_{t_{3}}^{-}$

δ z = z - z^{-} = H_{3} δ X_{t_{3}}^{-} + H_{2} δ X_{t_{2}} + H_{1} δ X_{t_{1}} + Dv

(22)

Hence

\begin{array}{l} P_{X_{t_{3}}} = P_{3}^{-} H_{3}^{T} + P_{32}^{-} H_{2}^{T} + P_{31}^{-} H_{1}^{T} \\ P_{z} = H_{3} P_{3}^{-} H_{3}^{T} + [\begin{matrix} H_{2} & H_{1} \end{matrix}] [\begin{matrix} P_{2} & P_{21} \\ P_{21}^{T} & P_{1} \end{matrix}] {[\begin{matrix} H_{2} & H_{1} \end{matrix}]}^{T} + DR D^{T} \end{array}

(23)

Then the corrected state and covariance are given as

δ X_{t_{3}} = K δ z P_{3}^{+} = P_{3}^{-} - {KP}_{z} K^{T}

(24)

The estimated state $δ X_{t_{3}}$ is then used for correcting the navigation solution and IMU bias parametrization $X_{t_{3}}$ .

Referring to equation (23), the matrices $P_{1}, P_{2}, P_{3}^{-}$ are known, and one of the challenges during IEKF is how to calculate the cross-correlation matrices $P_{21}, P_{31}^{-}, P_{32}^{-}$ . Here the term $P_{21}$ may be calculated as

P_{21} = Φ (t_{2}, t_{1}) P_{1}

(25)

The other two cross-correlation terms, $P_{31}^{-}, P_{32}^{-}$ , may be neglected (e.g. loops). In case the above assumptions regarding $P_{21}, P_{31}^{-}, P_{32}^{-}$ are not satisfied, these terms may be calculated using the methods developed in the literature.^21–24

Computation requirements

It is interesting to examine the computational complexity of the operations needed during the IEKF.

Assume that at some step k there are N features included in the three overlapping images. The computational complexity of carrying out the algorithm IEKF at step k involves the computation of the predicted state $X^{-}$ and propagation covariance $P_{3}^{-}$ , which also requires the computation of the corresponding Jacobians $F_{k}, G_{k}$ and the updated $X_{k}, P_{k}$ , which involves the computation of the corresponding Jacobian $H_{i}, D$ , the Kalman gain matrix K, the innovation $δ z$ , and its covariance $P_{z}$ .

Note that during the IEKF process we do not estimate the state of the features, and the state dimension r remains unchanged. Here the state dimension r = 15, so the computation of the predicted state $X^{-}$ and propagation covariance $P_{3}^{-}$ is O(1) ( $O (.)$ means the first order of magnitude). The computation of the Jacobian matrices $H_{1}, H_{2}, H_{3}, D$ is linear in N, that is, O(N) operations. Consider as an example the innovation covariance matrix $P_{z}$ in equation (23). Normally, the computation of this matrix would require multiplications $(4 {Nr}^{2} + 16 {rN}^{2}) + (16 {Nr}^{2} + 32 {rN}^{2}) + (36 N + 48 N^{2})$ and $(4 Nr (r - 1) + 16 (r - 1) N^{2}) + (8 Nr (2 r - 1) + 16 (2 r - 1) N^{2}) + (24 N + 32 N^{2})$ sums, that is, $O (N^{2})$ operations. Similarly calculating the multiplications and sums of other items, we can draw the conclusion that the cost of computing the innovation $δ z$ is O(N), the covariance matrix $P_{k}$ is $O (N^{2})$ , and that the greatest cost in an IEKF update is the computation of the Kalman gain matrix K, which is $O (N^{3})$ . Thus, the computational cost per step of IEKF is third power of the size of the features.

Simulation and experimental results

In this section, we analyze the performance of the IEKF fused with a navigation system with a simulation and an experiment.

Simulation results

We present statistical results obtained by applying the trifocal tensor constraints to a trajectory containing a loop based on a simulated navigation system and synthetic imagery features. The assumed initial navigation errors and IMU errors are summarized in Table 1. The IMU was sampled at 100 Hz, and the synthetic imagery features were sampled at 1 Hz. These were obtained by assuming the camera calibration matrix

K_{cam}

and the image noise covariance R

K_{c a m} = [\begin{matrix} 260 & 0 & 376 \\ 0 & 260 & 240 \\ 0 & 0 & 1 \end{matrix}] R = [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{matrix}]

Table 1.

Initial navigation errors and IMU errors.

States	Description	Value	Units
$δ Pos$	Position(1δ)	$[0.1, 0.1, 0.1]^{T}$	m
$δ V$	Velocity(1δ)	$[0.1, 0.1, 0.1]^{T}$	m/s
$δ θ$	Attitude(1δ)	$[1, 1, 1]^{T}$	deg
$δ b_{g}$	drift(1δ)	$[0.2, 0.2, 0.2]^{T}$	deg/s
$δ b_{a}$	drift(1δ)	$[10, 10, 10]^{T}$	mg

IMU: inertial measurement unit.

Assume that the simulated trajectory is a loop repeated once, as shown in Figure 2. In order to demonstrate the performance of the algorithm in loop scenarios, two different update modes were evaluated: (1) a sequential update, in which all three images were acquired closely together; and (2) a loop update, in which the first two images were captured when the platform passed a given region for the first time, and the third image was obtained during the second passing of the same region. The total running time was approximately 215 s. At $t = 210 s$ , the vehicle returned to its original area in order to evaluate the loop update mode using the method developed in this paper. During the simulation, the image features were generated randomly. These features remained in the camera’s FOV, among the three overlapping images according to the flight data using the pinhole camera.

Figure 2.

Loop path.

Figure 3 shows the Monte Carlo results (200 runs) using the proposed trifocal tensor. The curve is the square root of the filter covariance, defined for the ith component in the state vector as $\sqrt{P (i, i)}$ , which is the a posteriori covariance matrix. In addition, the three-view geometry and INS-only estimations are shown for comparison. When $t \leq 150 s$ , the sequential update mode was used. The time instances $(t_{1}, t_{2}, t_{3})$ were selected, such that $t_{2} - t_{1} = 1 s (t_{1} = t_{1} + 10)$ and $t_{3} - t_{1} = 10 s$ , and the first time instances are $t_{1} = 0 s, t_{2} = 1 s, t_{3} = 10 s$ . The position and velocity errors in the INS grew unbounded during the flight. The position errors were reset on all axes to a few decimeters (Figure 3(a)) in the trifocal tensor and three-view geometry, and the position errors for the trifocal tensor were much closer to the initial errors (0.1 m). The velocity errors and accelerometer bias were reduced considerably (to nearly zero) on all axes as a result of activating the algorithm with both methods (Figure 3(b) and (c)). The attitudinal errors (Figure 3(d)) decreased to approximately 0.2° on all axes in the trifocal tensor, and they were reduced by approximately 0.5° with three-view geometry. Gyro drift (Figure 3(e)) decreased to approximately 0.002 degrees per second on all axes in the trifocal tensor. However, gyro drift decreased to approximately 0.007 degrees per second with three-view geometry. At $t = 210 s$ , the vehicle returned to its original area, where the algorithm could be applied in loop update mode. The time instances $(t_{1}, t_{2}, t_{3})$ were selected, such that $t_{1} = 0 s, t_{2} = 1 s$ , and $t_{3} = 210 s$ . Taking the position errors as an example, they were reset on all axes to a few decimeters from 4 m, and the position errors of the trifocal tensor were much closer to the initial errors than those of the three-view geometry. From this analysis, we can conclude that the trifocal tensor outperforms three-view geometry and INS.

Figure 3.

Two hundred Monte Carlo results error estimations (covariance) using the trifocal tensor, three-view geometry and INS. (a) Position errors, (b) velocity errors, (c) accelerate bias, (d) attitude errors, and (e) gyro bias.

Experimental results

In this section, we describe the results of an experiment we conducted to evaluate the proposed method in a real environment. A dataset package collected by Lee et al.²⁵ at the Swiss Federal Institute of Technology in Zurich was applied to validate the proposed method. There are five synchronized datasets in this package. We selected the 1LoopDown dataset (the flight trajectory for which is a loop repeated once using a downward-looking camera). The inertial sensor measurements and camera images were recorded for postprocessing at 200 and 50 Hz, respectively. The true position of the quadrotor was obtained with a VICON system that was set up in the room for comparison. The features were detected and matched using the SIFT algorithm²⁶ coupled with the RANSAC method,²⁷ and an additional comparison experiment between SIFT features and ICIMGP features using the trifocal tensor method is conducted in the end.

Figure 4(a)²⁵ shows the quadrotor collecting the dataset package in the indoor flight environment, and Figure 4(b)²⁵ shows the flight trajectory. The quadrotor flew for approximately 32 s and returned to its original position at $t = 27 s$ . In the experiment, the time instances $(t_{1}, t_{2}, t_{3})$ were selected, such that $t_{2} - t_{1} = 0.1 s (t_{1} = t_{1} + 1)$ and $t_{3} - t_{1} = 1 s$ , and the first time instances were $t_{1} = 0 s, t_{2} = 0.1 s, t_{3} = 1 s$ . First, we utilized the sequential update mode. When the quadrotor returned to its original position (at $t = 27 s$ ), we applied loop update mode. In loop update mode, the time instances were selected, such that $t_{2} - t_{1} = 0.1 s (t_{1} = t_{1} + 1)$ and $t_{3} - t_{1} = 27 s$ , and the first time instances were $t_{1} = 0 s, t_{2} = 0.1 s, t_{3} = 27 s$ .

Figure 4.

Experiment processing. (a) The quadrotor and flight environment and (b) flight trajectory.

Figure 5 shows the matched features detected during the experiment. Indeed, sometimes there were no matching features, e.g. at $t = 2, 3, 14, 21, 22 s$ . In these cases, we can use the INS-only update method. That is, we can merely execute the propagation step for the filter using equation (20).

Figure 5.

Number of detected features.

Figure 6 shows the computational complexity of the IEKF during the experiment, using MATLAB 7.1. As shown in the figure, the computational time does not relate to the total detected features (as it does with SLAM). Rather, it relates to the detected features at each step. The longest calculation time was approximately 0.058 s at $t = 16 s$ , and the mean calculation time was 0.013 s, indicating that our algorithm is suitable for real-time processing.

Figure 6.

Computational complexity.

The estimated positions are shown in Figure 7. Indeed, the position errors increased considerably with the INS-only method. However, they reset to meters with both the proposed trifocal tensor and three-view geometry. At time $t = 27 s$ , the position errors decreased markedly, because at time $t = 27 s$ , loop update mode was applied. Moreover, the position errors from the trifocal tensor were much closer to zero than those from three-view geometry.

Figure 7.

Estimated position errors using different methods.

As shown in Table 2, we also conducted a quantitative performance analysis of the proposed method according to the mean position error, maximum position error, and end position error. Taking the mean position error as an example, the error with the trifocal tensor method was 2.85 m, which is twice as accurate as the three-view geometric method, and 16 times more accurate than the INS-only method. The mean error of 2.8 m was approximately five times higher than the simulation results (at approximately 0.5 m). The reason for this is that there were fewer detected features captured by the downward-looking camera in our experiment than there were in the simulation. Specifically, there were only five cases with more than 20 features in the experiment, whereas there were more features in the simulation. Similar analyses of the maximum position error and end position error lead to the conclusion that the trifocal tensor method proposed in this paper outperforms three-view geometry and INS-only methods.

Table 2.

The position errors.

Algorithm	mean(m)	Max(m)	End(m)
Trifocal tensor	2.8500	5.4342	0.5801
Three view	4.7225	8.5094	0.5944
INS only	44.9512	120.6157	120.6157

The related work^7,8 showed that the performance of ICIMGP features is better than SIFT features. Here we conducted an additional comparison experiment between them. Because during the whole process, the number of the detected features is not many using these two methods, in order to illustrate the performance of ICIMGP features, here we just compare the loop update mode at $t = 29 s$ . As shown in Figure 8, the performance of ICIMGP features is indeed much better than SIFT features in our method. We believe that once the number of the detected features is enough, ICIMGP features might be the first choice.

Figure 8.

Loop experimental results using SIFT and ICIMGP. (a) Position error and (b) position error zoom. ICIMGP: integral channel image patch; SIFT: Scale-invariant feature transform.

Conclusion and future work

This paper presented a new method for vision-based/INS applications based on the trifocal tensor. The trifocal tensor constraints we derived were the sufficient conditions for the three-view geometric constraints. The proposed method utilizes three overlapping images to formulate the related constraints between platform motions at the time instances of three images. These constraints were fused with an INS using the IEKF. Simulated and real experimental results indicated that the performance of the proposed trifocal tensor constraints exceeded that of the three-view geometric constraints. Moreover, the performance of the trifocal tensor method is better using the ICIMGP features than the SIFT ones.

This paper applied the IEKF to analyze navigation systems. As we know, the UKF uses a selected set of points to map the probability distribution of the measurement model more accurately than the linearization of the IEKF. This results in faster convergence from inaccurate initial conditions in estimation problems. Therefore, in future research, we shall use the UKF method (ICIMGP features) to analyze the performance of navigation systems in an effort to further improve their accuracy.

Footnotes

Acknowledgment

The authors thank in particular Xinghui Dong from the University of Manchester for his discussions and code on ICIMGP. They are also grateful to the editors and anonymous reviewers for their useful suggestions and detailed comments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Appendix 1

Appendix 2

References

Mourikis

. Vision-aided inertial navigation with rolling-shutter cameras. Int J Robot Res 2014; 33: 1490–1507.

Leutenegger

Lynen

Bosse

. Keyframe-based visual-inertial odometry using nonlinear optimization. Int J Robot Res 2014; 34: 314–334.

>Davison

et al.

MonoSLAM: real-time single camera SLAM. IEEE Trans Pattern Anal Mach Intell 2007; 29: 1052–1067.

Bryson M and Sukkarieh S. Active airborne localization and exploration in unknown environments using inertial SLAM. In: Proceedings of the IEEE Aerospace Conference, Big Sky, MT, US, 4–11 March 2006, pp.13–26.

Mourikis A and Roumeliotis I. A dual-layer estimator architecture for long-term localization. In: IEEE computer vision and pattern recognition workshops, Anchorage, AK, USA, 23–28 June 2008, pp.1–8.

Kim

Sukkarieh

. 6DoF SLAM aided GNSS/INS navigation in GNSS denied and unknown environments. J Global Posit Syst 2005; 4: 120–128.

Dong X, Dong X and Dong J. Monocular Visual-IMU odometry: a comparative evaluation of the detector-descriptor based methods. In: 2016 European conference on computer vision workshops (CVRSUAD 2016), Amsterdam, The Netherlands, 8–16 October 2016, pp.81–95.

Dong

et al.

Monocular visual-IMU odometry using multi-channel image patch exemplars. Multimedia Tools Appl 2017; 76: 11975–12003.

Soatto

Frezza

Perona

. Motion estimation via dynamic vision. IEEE Trans Automat Control 1996; 41: 393–413.

10.

Prazenica RJ, Watkins A and Kurdila AJ. Vision-based kalman filtering for aircraft state estimation and structure from motion. In: Proceedings of the AIAA guidance, navigation, and control conference, San Francisco, CA, August 2005, AIAA no. 2005-6003, pp.15–18. San Francisco: AIAA.

11.

Indelman

et al.

Navigation aiding based on coupled online mosaicking and camera scanning. J Guid Control Dyn 2011; 33: 1866–1882.

12.

V. Indelman, et al. Real-time mosaic-aided aircraft navigation: II. Sensor fusion. In: Proceedings of the AIAA guidance, navigation and control conference, Chicago, IL, 10–13 August 2009. AIAA.

13.

Hartley

Zisserman

. Multiple view geometry, Cambridge: Cambridge University Press, 2000.

14.

Hartley

. Lines and points in three views and the trifocal tensor. Int J Comput Vis 1997; 22: 125–140.

15.

Torr

Fitzgibbon

Zisserman

. The problem of degeneracy in structure and motion recovery from uncalibrated image sequences. Int J Comput Vis 1999; 32: 27–44.

16.

Guerrero

Murillo

Sagues

. Localization and matching using the planar trifocal tensor with bearing-only data. IEEE Trans Robot 2008; 24: 494–501.

17.

Mourikis A and Roumeliotis I. A multi-state constraint Kalman filter for vision-aided inertial navigation. In: Proceedings of the IEEE international conference on robotics and automation, Roma, Italy, 10–14 April 2007, pp.3565–3572. IEEE.

18.

Hu JS and Chen MY. A sliding-window visual-IMU odometer based on tri-focal tensor geometry. In: 2014 IEEE international conference on robotics and automation (ICRA), Hong Kong, China 31 May–7 June 2014, pp.3963–3968. IEEE.

19.

Indelman

et al.

Real-time vision-aided localization and navigation based on three-view geometry. IEEE Trans Aerosp Electron Syst 2012; 48: 2239–2259.

20.

Titterton

Weston

. Strapdown inertial navigation technology, Rawalpindi and Lahore: National Defense Industry Press, 2007.

21.

Indelman

et al.

Graph-based distributed cooperative navigation for a general multi-robot measurement model. Int J Robot Res 2012; 31: 1057–1080.

22.

Simon JJ and Uhlmann JK. A non-divergent estimation algorithm in the presence of unknown correlations. In: Proceedings of the 1997 American control conference, vol. 4, Albuquerque, NM, USA, 6 June 1997, pp.2369–2373. IEEE.

23.

Simon

Uhlmann

. Using covariance intersection for SLAM. Robot Autonom Syst 2007; 55: 3–20.

24.

Fang

Huang

. UKF for integrated vision and inertial navigation based on three-view geometry. IEEE Sens J 2013; 13: 2711–2719.

25.

Lee GH, Achtelik M, Fraundorfer F, et al. A benchmarking tool for MAV visual pose estimation. In: International conference on control, automation, robotics and vision (ICARCV’10), vol. 20, Singapore, 7–10 December 2010, pp.1541–1546. IEEE.

26.

Lowe

. Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004; 66: 91–110.

27.

Fischler

Bolles

. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Commun Assoc Comput Mach 1981; 24: 381–395.

Vision-aided localization and navigation based on trifocal tensor geometry

Abstract

Keywords

Introduction

Related work

Estimator description

Structure of the EKF state vector

System model

Measurement model

Lemma 1

Implicit EKF

Computation requirements

Simulation and experimental results

Simulation results

Experimental results

Conclusion and future work

Footnotes

Acknowledgment

Declaration of conflicting interests

Funding

Appendix 1

Appendix 2

References