Sage Journals: Discover world-class research

Abstract

State estimation is a fundamental necessity for any application involving autonomous robots. This paper describes a visual-aided inertial navigation and mapping system for application to autonomous robots. The system, which relies on Kalman filtering, is designed to fuse the measurements obtained from a monocular camera, an inertial measurement unit (IMU) and a position sensor (GPS). The estimated state consists of the full state of the vehicle: the position, orientation, their first derivatives and the parameter errors of the inertial sensors (i.e., the bias of gyroscopes and accelerometers). The system also provides the spatial locations of the visual features observed by the camera. The proposed scheme was designed by considering the limited resources commonly available in small mobile robots, while it is intended to be applied to cluttered environments in order to perform fully vision-based navigation in periods where the position sensor is not available. Moreover, the estimated map of visual features would be suitable for multiple tasks: i) terrain analysis; ii) three-dimensional (3D) scene reconstruction; iii) localization, detection or perception of obstacles and generating trajectories to navigate around these obstacles; and iv) autonomous exploration. In this work, simulations and experiments with real data are presented in order to validate and demonstrate the performance of the proposal.

Keywords

Visual-based Navigation Mapping Autonomous Vehicles Sensor Fusion State Estimation

1. Introduction

In recent years, many researchers have addressed the issue of making vehicles more and more autonomous. In this context, the state estimation of the six degrees of freedom (DOFs) of the vehicle (i.e., its attitude and position) is a fundamental necessity for any application involving autonomy.

Even though, this problem is seemingly solved for open outdoor domains by an on-board Global Positioning System (GPS) unit and inertial measurements units (IMU), as well an integrated version: an inertial navigation system (INS). Contrary, unknown, cluttered and GPS-denied environments still pose a considerable challenge. While attitude estimation is well handled by available systems, GPS-based position estimation can have some drawbacks. GPS is not a reliable service as its availability can be limited by urban canyons and it is completely unavailable in indoor environments. The aforementioned issues have prompted recent studies to move towards the use of cameras for performing visual-based navigation in periods or circumstances where the position sensor (i.e., GPS) is not available. Cameras are well adapted for embedded systems: they are light, cheap and power-saving.

1.1. Related work

Several strategies related to visual-based navigation can be found in the literature. Some systems that use vision for navigation require previous knowledge of the whole environment. In this case, some kind of guided or automated pretraining is needed in order to learn visual patterns in the environment, which are recognized and used later as visual clues at the navigation stage. Examples of guided pretraining methods are [1] and [2]. In [3, 4], during the navigation, the position is determined matching the online image with previously recorded images. Usually, the application of this kind of method is limited to environments where the training phase takes place.

Another kind of system has been designed to perceive the environment as it is navigated through. Behaviour-based navigation systems can rely on qualitative image information (examples are [5, 6] and [7]). In this kind of system, it is avoided as much as possible using, computing or generating accurate numerical data, such as distances, position coordinates, velocity, projections from the image plane onto the real world plane, or contact time to the obstacles. Due to its critical dependence on unprocessed sensorial data, this kind of method is highly dependent on the change in the imaging conditions.

Approaches, such as [8] or [9], rely on optical flow information, as do [10] and [11], although the optical flow is combined with stereo information in their case. A related approach to the optical flow scheme is the tracking of visual features. In this case, however, instead of estimating the visual flow of the whole image (i.e., every pixel), a set of strong visual features, such as corners, lines and objects outlines, are detected and tracked over time for as much as it may be possible. Examples of methods based on the former approach are [12] and [13], in which the computation of the camera motion is obtained from a homography. In [14], inertial data obtained from an IMU is combined with visual tracking information.

Some systems can build a map of the environment while exploring it. The maps can have a topological representation, as in [15] and [16], or a metric representation. Occupancy grid-based strategies represent one of the early approaches to representing metric maps; some examples are [17] and [18] However, occupancy grid-based strategies can be computationally inefficient for path planning and localization, especially in complex and great indoor environments.

Another family of approaches relies on simultaneous localization and mapping (SLAM) strategies. In this case, the system must operate in an a priori unknown environment using only on-board sensors to simultaneously build a map of its surroundings and localize itself [19]. Usually, this kind of system also relies on the tracking of a set of prominent visual features. As these visual features typically have a spatial meaning, the information obtained from the visual measurements can be used to replace range sensors: the lasers are often expensive and heavy, while the operation range of ultrasonics is limited. In this sense, the appearance and spatial information provided by cameras can be useful for multiple tasks, such as: i) terrain analysis; ii) 3D scene reconstruction; iii) localization, detection or perception of obstacles and generating trajectories to navigate around these obstacles; and for iv) autonomous exploration.

Currently, there are two main approaches to implementing vision-based SLAM systems: i) filtering-based methods (see [20 –22] and [23]), and ii) optimization-based methods (see [24] and [25]). While the former approach is shown to give accurate results when the availability of computational power is enough, filtering-based SLAM methods might still be beneficial if limited processing power is available [26]. While the benefit obtained from constructing a map of the robot's environment is self-evident, it comes together with a considerable increase in computational requirements. The former issue is due to the fact that, in visual SLAM, the visual features are included in the state vector. When the number of map features in the system state increases, computational cost grows rapidly, such that it consequently becomes difficult to maintain real-time performance. The above becomes more evident when limited hardware is available. On the other hand, in a context of visual odometry, this problem can be partially mitigated by removing old features from the state in order to stabilize the computational cost per frame [27].

In applications involving mobile robots, it is very common to have a set of sensors available for use. Incorporating complementary or redundant data from multiple sensors can improve the robustness and accuracy of the system. In this sense, sensor fusion that uses, for instance, IMU, GPS and cameras is not novel. Some examples of multi-sensor navigation algorithms are [28 –31] and [32].

1.2. Objective

This work presents a novel method for implementing a visual odometry and mapping system. The method is mainly intended to be applied to autonomous mobile robots with limited computational resources. The proposed scheme is related, on the one hand, to visual odometry methods, which make use of the optical flow obtained from prominent visual features in order to compute the camera movement. On the other hand, it is related to visual SLAM methods in the sense that a map of the environment is built and used for refining the estimations. Some contributions and differences from previous work are:

In contrast with visual SLAM approaches, the system state in the proposed scheme is never augmented with the state of map features. Instead, each map feature is managed independently from the state of the vehicle, as well as from other map features. The above technique, which perhaps represents the main contribution of this work, allows a reduction in the computational requirements.

A novel technique based on two-step architecture is proposed in order to initialize and measure the visual features in a robust and efficient manner. The proposed scheme makes use of the information obtained from monocular measurements and allows advantage to be taken of the information provided by visual features, whose depth estimation is not well conditioned.

The use of an extended Kalman filter (EKF) in direct configuration allows a straightforward integration with additional sensors, such as inertial sensors (i.e., gyroscopes and accelerometers), and GPS, when it is available.

It is important to note that the proposed system should not be considered as a true SLAM system; rather, and just in case, it should be considered as an approximation of it. This is because the cross-correlation information of the feature uncertainty, which is contained in the system covariance matrix, is discarded on behalf of computational cost reduction. Fortunately, the experimental results would suggest that the proposed scheme is useful when applied to a context of visual-based navigation.

1.3. Paper outline

The paper is organized as follows. Section 2 presents the system parameterization and assumptions. Section 3 describes the proposed system in detail. In Section 4, experimental results, carried out using simulations and real data, are presented in order to validate and demonstrate the performance of the proposal. Finally, Section 5 presents conclusions and final remarks.

2. System Parametrization and Assumptions

The main goal of the proposed method is to estimate the following system state x:

x = {[q^{N R}, ω^{R}, r^{N}, v^{N}, a^{N}, x_{g}, x_{a}]}^{T}

(1)

where $q^{N R} = [q_{1}, q_{2}, q_{3}, q_{4}]$ represents the orientation of the vehicle with respect to the world (navigation) frame by a unit quaternion. $r^{N} = [x_{v}, y_{v}, z_{v}]$ represents the position of the vehicle (robot) expressed in the navigation frame. $ω^{R} = [ω_{x}, ω_{y}, ω_{z}]$ is the bias-compensated velocity rotation of the robot expressed in the same frame of reference. $v^{N} = [v_{x}, v_{y}, v_{z}]$ denotes the linear velocity of the robot expressed in the navigation frame. $a^{N} = [a_{x}, a_{y}, a_{z}]$ represents the bias-compensated acceleration of the robot expressed in the navigation frame. $x_{g} = [x_{g_{x}}, x_{g_{y}}, x_{g_{z}}]$ is the bias of the gyroscopes. $x_{a} = [x_{a_{x}}, x_{a_{y}}, x_{a_{z}}]$ is the bias of the accelerometers.

The proposed method is intended for local autonomous vehicle navigation. In this case, the local tangent frame is used as the navigation reference frame and, in turn, the initial position of the vehicle defines its origin. The navigation coordinate frame follows the NED (north-east-down) convention. In this work, the magnitudes expressed in the i) navigation frame, ii) vehicle (robot) frame and iii) camera frame are respectively denoted by the superscripts $^{N}$ , $^{R}$ and $^{C}$ (see Figure 1). All the coordinate systems are right-handed. In this work, it is assumed that the coordinate frame of the IMU is aligned with the coordinate frame of the vehicle $^{R}$ .

Figure 1.

System parametrization. The local tangent frame is used as the navigation reference frame ^N. Furthermore, it is assumed that the coordinate frame of the IMU is aligned with the origin of the coordinate frame of the robot ^R.

The position and orientation of the camera, in respect of the vehicle coordinate frame $^{R}$ , is assumed to be known and fixed: $R^{C R}$ is the camera to robot rotation matrix, while $t_{c}^{R}$ is the position of the optical centre of the camera, expressed in the vehicle frame. The superscript $^{A B}$ denotes a reference frame $^{B}$ expressed in respect of the reference $^{A}$ .

For estimating the system state x, the following kinds of sensorial measurement are considered:

2.1. Visual measurements

A standard monocular camera is considered (see Figure 1). In this case, the availability, at each k frame, of a list of $m_{(k)} = [z_{u v}_{(1)}, z_{u v}_{(2)},…, z_{u v}_{(n)}]$ visual measurements of n static landmarks of the environment is assumed. Visual measurements $z_{u v}$ are modelled by:

z_{u v}_{(j)} = y_{u v}_{(j)} + v_{u v}

(2)

where $y_{u v}_{(j)} = [u_{d_{j}}, v_{d_{j}}]$ represents the true position (in the image plane) of the projection of a j landmark. $u_{d_{j}}$ and $v_{d_{j}}$ are distorted pixel coordinates. The term $v_{u v}$ represents the uncertainty associated with visual measurements and is modelled by a Gaussian white noise with power spectral density (PSD) $σ_{u v}^{2}$ . In this work, the list of visual measurements $m_{(k)}$ is obtained using the Kanade-Lucas-Tomasi (KLT) tracker [33], but any other small baseline tracker could be used.

A central projection camera model is assumed. In this case, the image plane is located in front of the camera's origin, upon which a non-inverted image is formed (Figure 1.) The camera frame $^{C}$ is right-handed with the z-axis pointing to the field of view.

The $ℝ^{3} \Rightarrow ℝ^{2}$ projection of a 3D point, located at $p^{N} = (x, y, z)^{T}$ to the image plane $p = (u, v)$ , is defined by:

u = \frac{x^{'}}{z^{'}} v = \frac{y^{'}}{z^{'}}

(3)

where u and v are the coordinates of the image point p expressed in pixels units and:

[\begin{matrix} x^{'} \\ y^{'} \\ z^{'} \end{matrix}] = [\begin{matrix} f & 0 & u_{0} \\ 0 & f & v_{0} \\ 0 & 0 & 1 \end{matrix}] p^{C}

(4)

being $p^{C}$ the same 3D point $p^{N}$ , but expressed in the camera frame $^{C}$ by $p^{C} = R^{N C} (p^{N} - t_{c}^{N})$ , and where $R^{N C} = (R^{R N} R^{C R})^{T}$ is the rotation matrix transforming from navigation frame $^{N}$ to the camera frame $^{C}$ . $R^{C R}$ is known and $R^{R N}$ is computed from the current robot quaternion $q^{N R}$ . $t_{c}^{N} = r^{N} + R^{R N} t_{c}^{R}$ is the position of the camera's optical centre as expressed in the navigation frame.

Inversely, a directional vector $h^{C} = [h_{x}^{C}, h_{y}^{C}, h_{z}^{C}]^{T}$ can be computed from the image point coordinates u and v.

h^{C} (u, v) = {[\frac{u_{0} - u}{f}, \frac{v_{0} - v}{f},1]}^{Τ}

(5)

The vector $h^{C}$ points from the camera's optical centre position to the 3D point location. $h^{C}$ can be expressed in the navigation frame by $h^{N} = R^{C N} h^{C}$ , where $R^{C N} = R^{R N} R^{C R}$ is the camera to navigation rotation matrix. Note that for the $ℝ^{2} \Rightarrow ℝ^{3}$ mapping case, as defined in equation 5, depth information is lost.

The distortion caused by the camera lens is considered in relation to the model described in [34]. Using the former model (and its inverse form), undistorted pixel coordinates $(u, v)$ can be obtained from $(u_{d}, v_{d})$ , as well as conversely. In this case, it is assumed that the intrinsic parameters of the camera are already known: (i) focal length f, (ii) principal point $u_{0}, v_{0}$ , and (iii) radial lens distortion $k_{1},…, k_{n}$ .

2.2. Inertial measurements

The angular rate $ω^{R}$ of the vehicle, measured by the gyroscopes (in the same frame $^{R}$ ) and indicated as y_g, can be modelled by:

y_{g} = ω^{R} + x_{g} + v_{g}

(6)

where $x_{g}$ is the gyroscope bias and v_g is a Gaussian white noise with a PSD $σ_{g}^{2}$ .

The acceleration of the vehicle $a^{R}$ , measured by the accelerometers (in the same frame $^{R}$ ) as y_a, can be modelled by:

y_{a} = a^{R} + g^{R} + x_{a} + v_{a}

(7)

where $g^{R}$ is the gravity vector expressed in the vehicle frame $^{R}$ , $x_{a}$ is the accelerometer bias and v_a is a Gaussian white noise with PSD $σ_{a}^{2}$ .

2.3. GPS measurements

Position and course measurements (y_r and y_θ, respectively) are obtained from the GPS, whenever possible, and are modelled by:

\begin{array}{l} y_{r} = r^{N} + v_{r} \\ y_{θ} = θ^{N} + v_{θ} \end{array}

(8)

where v_r is a Gaussian white noise with PSD $σ_{r}^{2}$ . $θ^{N}$ is the course angle of the vehicle, measured by the GPS unit with respect to the geographic north, and v_θ is a Gaussian white noise with PSD $σ_{θ}^{2}$ . Commonly, position measurements are obtained from GPS devices in geodetic coordinates (latitude, longitude and height). Therefore, in equation 8, it is assumed that GPS position measurements have been previously transformed into their corresponding local tangent frame coordinates. It is also assumed that the offset between the GPS antenna and the vehicle frame was taken into account in the previous transformation.

3. Method Description

The architecture of the system is defined by the typical loop of prediction-update steps in the EKF in direct configuration, in which the EKF propagates the vehicle state. Whenever possible, the filter is updated by the filter update equations, in which visual information is obtained from the monocular camera as well as the data obtained from the additional sensors. The estimation of the features’ map location is carried out, in an independent manner, by the main EKF process.

3.1. System initialization

An initial period of time is used for system initialization tasks. During this period, the vehicle is assumed to be non-accelerating. The initial system state $x_{i n i}$ is determined as follows:

{\hat{x}}_{i n i} = {[q_{i n i}^{N R} {,0}_{1 \times 3} {,0}_{1 \times 3} {,0}_{1 \times 3} {,0}_{1 \times 3}, x_{g (i n i)} {,0}_{1 \times 3}]}^{T}

(9)

where $q_{i n i}^{N R}$ and $x_{g (i n i)}$ are estimated in same manner as in [35]. In this case, magnetometers are only used for obtaining an initial estimation of the heading (yaw). The average of the initial GPS lectures is subsequently used as a constant for transforming from geodetic to local tangent coordinates.

3.2. System prediction

At every step k, when gyroscope and accelerometer measurements are available, the predicted system state $\hat{x}$ is taken a step forward by the following (discrete) non-linear model:

{\begin{array}{l} q_{k | k - 1}^{N R} = (\cos ‖ w_{k} ‖ I_{4 \times 4} + \frac{\sin ‖ w_{k} ‖}{‖ w_{k} ‖} W_{k}) q_{k - 1 | k - 1}^{N R} \\ ω_{k | k - 1}^{R} = y_{g_{k}} - x_{g_{k - 1 | k - 1}} \\ r_{k | k - 1}^{N} = r_{k - 1 | k - 1}^{N} + v_{k - 1 | k - 1}^{N} Δ t + a_{k - 1 | k - 1}^{N} \frac{Δ t^{2}}{2} \\ v_{k | k - 1}^{N} = v_{k - 1 | k - 1}^{N} + a_{k - 1 | k - 1}^{N} Δ t \\ a_{k | k - 1}^{N} = R_{k}^{R N} (y_{a_{k}} - x_{a_{k - 1 | k - 1}}) - g^{N} \\ x_{g_{k | k - 1}} = (I - λ_{x g} Δ t) x_{g_{k - 1 | k - 1}} \\ x_{a_{k | k - 1}} = (I - λ_{x a} Δ t) x_{a_{k - 1 | k - 1}} \end{array}

(10)

where $R_{k}^{R N}$ is computed from the quaternion $q_{k - 1 | k - 1}^{N R}$ . Parameters $λ_{x g}$ and $λ_{x a}$ are correlation time factors, which respectively model how fast the bias of the gyroscopes and accelerometers varies. $g^{N}$ is the gravity vector expressed in the navigation frame. $Δ t$ is the sample time of the system. In the model defined in equation 10, a closed form solution of $\dot{q} = 1 / 2 (W) q$ is used for integrating the velocity rotation $ω^{R}$ over the quaternion $q^{N R}$ . In this case, $w_{k} = {[ω_{k | k - 1}^{R} Δ t / 2]}^{T}$ and:

W_{k} = [\begin{matrix} 0 & - w_{1} & - w_{2} & - w_{3} \\ w_{1} & 0 & - w_{3} & w_{2} \\ w_{2} & w_{3} & 0 & - w_{1} \\ w_{3} & - w_{2} & w_{1} & 0 \end{matrix}]

(11)

The state covariance matrix P is taken a step forward by:

P_{k | k - 1} = \nabla F_{x} P_{k - 1 | k - 1} \nabla F_{x}^{T} + \nabla F_{u} U \nabla F_{u}^{T}

(12)

The Jacobian $\nabla F_{x}$ is formed by the partial derivatives of the non-linear prediction model (equation 10) with respect to the system state $\hat{x}$ . The Jacobian $\nabla F_{u}$ is formed by the partial derivatives of the non-linear prediction model with respect to the system inputs y_g and y_a.

The measurement noise of the gyroscopes and accelerometers (respectively, v_g and v_a) are incorporated into the system by means of the process noise covariance matrix U through parameters $σ_{g}^{2}$ and $σ_{a}^{2}$ :

U = diag [σ_{g}^{2} I_{3 \times 3}, σ_{a}^{2} I_{3 \times 3}]

(13)

3.3. Attitude and position updates

In order to correct and limit drift in the estimated system state $\hat{x}$ , referenced information, along with assumptions about the system dynamics, are incorporated into the filter, whenever possible, through the EKF update equations:

{\begin{array}{l} {\hat{x}}_{k | k} = {\hat{x}}_{k | k - 1} + K_{k} (z_{i} - h_{i}) \\ P_{k | k} = P_{k | k - 1} - K_{k} S_{k} K_{k}^{T} \\ K_{k} = P_{k | k - 1} \nabla H_{i}^{T} S_{k}^{- 1} \\ S_{k} = \nabla H_{i} P_{k | k - 1} \nabla H_{i}^{T} + R_{i} \end{array}

(14)

where z_i is the current measurement and $h_{i} = h (\hat{x})$ is the predicted measurement. K is the Kalman gain. S is the innovation covariance matrix. $\nabla H_{i}$ is the Jacobian formed by the partial derivatives of the measurement prediction model $h (\hat{x})$ with respect to the state $\hat{x}$ . R_i is the measurement noise covariance matrix.

Each filter update is carried out using equation 14, along with z_i, h_i, $\nabla H_{i}$ and R_i, which are determined according to each kind of i measurement. The following filter updates are considered:

3.3.1. Roll and pitch measurement model

If the vehicle is not accelerating (i.e., $a^{R} \approx 0$ ), then equation 7 can be approximated as $y_{a} \approx g^{R} + v_{a}$ ( $x_{a}$ is neglected). In this situation, accelerometer measurements y_a provide noisy observations of the gravity vector (in the robot frame). The gravity vector is used as an external reference for correcting roll and pitch estimations. In order to detect the time (corresponding to k instants) when the vehicle is in a non-accelerating mode, the detector described in [36] is used.

The gravity vector g is predicted to be measured by the accelerometers as h_g:

h_{g} = R^{N R} {[0,0, g_{c}]}^{T}

(15)

where g_c is the gravity constant and $R^{N R}$ is computed using the current quaternion $q^{N R}$ . For updating the filter: $z_{i} = y_{a}$ , $h_{i} = h_{g}$ , $R_{i} = I_{3 \times 3} σ_{a}^{2}$ and $\nabla H_{i} = \frac{\partial h_{g}}{\partial \hat{x}}$ .

3.3.2. Position and heading measurement model

In this work, a loosely coupled approach is used for incorporating data provided by the GPS unit, whenever it is possible. In a loosely coupled approach, the high level outputs provided by the GPS unit are directly incorporated into the system state through their corresponding measurement prediction model.

The model h_r, which is used for predicting GPS measurements about the vehicle position, is simply:

h_{r} = r^{N}

(16)

The course (heading) of the vehicle is predicted by the following measurement model h_γ:

h_{γ} = atan2 (2 (q_{2} q_{3} - q_{1} q_{4}),1 - 2 (q_{3}^{2} + q_{4}^{2}))

(17)

where $q^{N R} = [q_{1}, q_{2}, q_{3}, q_{4}]$ is the current quaternion. atan2 is a two-argument function that computes the arc tangent of $a / b$ , given as a and b, within the range $[- π, π]$ . Alternatively, magnetometers could also be used in order to incorporate information into the system about the heading, as explained in [35]. For updating the filter: $z_{i} = {[y_{r}, y_{θ}]}^{T}$ , $h_{i} = {[h_{r}, h_{γ}]}^{T}$ , $R_{i} = diag [I_{3 \times 3} σ_{r}^{2}, σ_{θ}^{2}]$ and $\nabla H_{i} = {[\frac{\partial h_{g}}{\partial, \hat{x}}, \frac{\partial h_{γ}}{\partial, \hat{x}}]}^{T}$ .

3.3.3. Dynamical constraints

Prior knowledge about the vehicle dynamics can be incorporated into the system through filter updates. For instance, in the case of a car-like vehicle, nonholonomic constraints are incorporated in the form of zero velocity updates. For this purpose, it can be assumed that the velocity of the vehicle, in its own reference frame, should be measured as $v^{R} = [v_{x}^{R},0 + v_{v},0 + v_{v}]$ , where v_v is a Gaussian white noise with a PSD $σ_{v}^{2}$ . The measurement prediction model $h_{v_{0}}$ is simply:

h_{v_{0}} = {[v_{y}^{R}, v_{z}^{R}]}^{T}

(18)

where ${[v_{x}^{R}, v_{y}^{R}, v_{z}^{R}]}^{T} = R^{N R} v^{N}$ . For updating the filter: $z_{i} {= [0,0]}^{T}$ , $h_{i} = h_{v_{0}}$ , $R_{i} = I_{2 \times 2} σ_{v}^{2}$ and $\nabla H_{i} = \frac{\partial h_{v_{0}}}{\partial \hat{x}}$ .

3.4. Visual aid

3.4.1. Initialization of visual features

At the first instant k, when a new visual feature $z_{u v}$ is detected, the following entry $f_{l}$ is stored:

f_{l} = [{(t_{c_{0}}^{N})}^{T}, θ_{0}, ϕ_{0}, d]

(19)

where $y_{l_{j}} = [t_{c}^{N}, θ, φ]$ models a 3D semi-line, defined on one side by the vertex $t_{c}^{N}$ , corresponding to the current optical centre coordinates of the camera, as expressed in the navigation frame, while pointing to infinity on the other side with azimuth and elevation, θ and φ, respectively. d indicates the depth of the feature, which is initialized with a NaN (not a number) value, and:

\begin{array}{l} θ = atan2 (h_{y}^{N}, h_{x}^{N}) \\ ϕ = acos (\frac{h_{z}^{N}}{\sqrt{{(h_{x}^{N})}^{2} + {(h_{y}^{N})}^{2} + {(h_{z}^{N})}^{2}}}) \end{array}

(20)

where $h^{N} = [h_{x}^{N}, h_{y}^{N}, h_{z}^{N}]^{T}$ is computed as indicated in Section 2.1.

3.4.2. Applying epipolar constraints

Depth information cannot be obtained in a single measurement when bearing sensors (i.e., monocular camera) are used. To infer the depth of a feature, the sensor must observe it repeatedly as it moves through its environment, estimating the angle from the feature to the sensor centre. The difference between those angle measurements is the parallax, which is the key that allows estimating feature depth. Nevertheless, features with unknown depth or with huge uncertainty in depth can still provide useful information, such as for very far away landmarks, which will never produce a parallax, but can improve camera pose estimates. The following technique, which is based on epipolar constraints, is used for incorporating information from visual features with unknown or ill-conditioned depth. Epipolar constraints have been successfully used in the past for improving estimates in visual-based navigation systems (e.g., [37]). In this work, a version of the constraint is derived by considering the particularities of the proposed system.

If the camera moves from the location at which a feature is initialized, then this feature, which is first modelled as a 3D semi-line $y_{l_{j}}$ , can be projected onto the current image plane. The above projection, l_r, is the two-dimensional line, expressed in the image plane, while defined by the epipole e_r and the point, x_r (see Figure 2).

Figure 2.

Model measurement for visual features with ill-conditioned depth

The epipole e_r is computed by projecting $t_{c_{0}}^{N}$ , which is stored in $f_{l}$ , onto the current image plane by Equations 3 and 4. The point x_r is computed by projecting the 3D point $p^{N}$ , as defined by the data stored in $f_{l}$ , through Equations 3 and 4, while assuming a depth equal to one ( $d = 1$ ). In this case, $p^{N}$ models a 3D point located at:

p^{N} = t_{c}^{N} + d (m (θ_{j}, ϕ_{j}))

(21)

where $m (θ_{j}, φ_{j})$ is a directional unitary vector defined by: $m (θ_{j}, φ_{j}) = {(\cos θ \sin φ, \sin θ \sin φ, \cos φ)}^{T}$ . The epipolar constraint implies that new undistorted measurements of the visual feature $z_{u v}$ should lie over the line l_r. Therefore, d_e, which is the distance from the current measurement $z_{u v}$ to the line l_r, is used as innovation error (measurement-prediction) in order to update the filter. The distance d_e can be computed as follows (see Figure 2, upper right): If the line l_r is specified by two points, $e_{r} = (e_{r_{1}}, e_{r_{2}})$ and $x_{r} = (x_{r_{1}}, x_{r_{2}})$ , then a vector perpendicular to the line is given by $b = {[(x_{r_{2}} - e_{r_{2}}), - (x_{r_{1}} - e_{r_{1}})]}^{T}$ . Let c be a vector from the point $z_{u v} = (z_{u v_{1}}, z_{u v_{2}})$ to the first point on the line. $c = [(e_{r_{1}} - z_{u v_{1}}), (e_{r_{2}} - z_{u v_{2}} {)]}^{T}$ . The distance from $z_{u v}$ to the line l_r is then given by projecting c onto b, resulting in:

d_{e} = ‖ b \cdot c ‖ = \frac{| (x_{r_{2}} - e_{r_{2}}) (e_{r_{1}} - z_{u v_{1}}) - (x_{r_{1}} - e_{r_{1}}) (e_{r_{2}} - z_{u v_{2}}) |}{\sqrt{{(x_{r_{1}} - e_{r_{1}})}^{2} + {(x_{r_{2}} - e_{r_{2}})}^{2}}}

(22)

Assuming that, for the current frame, n visual measurements are available for features with ill-conditioned depth, the filter is updated as follows: $z_{i} = {[0_{1} {,0}_{2} {,…,0}_{n}]}^{T}$ , $h_{i} = [d_{e_{1}}, d_{e_{2}},…, d_{e_{n}}]^{T}$ , $R_{i} = I_{n \times n} σ_{e}^{2}$ and $\nabla H_{i} = {[\frac{\partial d_{e_{1}}}{\partial \hat{x}}, \frac{\partial d_{e_{2}}}{\partial \hat{x}},…, \frac{\partial d_{e_{n}}}{\partial \hat{x}}]}^{T}$ . For simplicity, the measurement noise is modelled as a Gaussian white noise with PSD $σ_{e}^{2}$ .

3.4.3. Estimating feature depth

For minimizing computational cost, and unlike typical filtered-based SLAM methods, where the vector state is augmented in order to incorporate the map features, the map in the proposed method is maintained in an independent manner of the state of the vehicle; hence, the state of the features cannot be updated directly through the filter update equations when new measurements are available. Instead, when a new measurement $z_{u v}$ is available (and along the entire life of the feature), the following procedure is applied for estimating the feature depth:

First, a hypothesis of depth d_j is computed by:

d_{j} = \frac{‖ e_{l} ‖ \sin γ}{\sin α}

(23)

where $α_{j} = π - (β + γ)$ is the parallax. $e_{l} = t_{c_{0}}^{N} - t_{c}^{N}$ indicates the displacement of the camera, since it was first observed to its current position, and:

β = \cos^{- 1} (\frac{h_{1} \cdot e_{l}}{‖ h_{1} ‖ ‖ e_{l} ‖}) γ = \cos^{- 1} (\frac{h_{2} \cdot - e_{l}}{‖ h_{2} ‖ ‖ e_{l} ‖})

(24)

where β is the angle defined by h₁ and e_l. h₁ is the normalized directional vector $m (θ_{j}, φ_{j})$ , which is computed by taking θ_j, φ_j from $f_{l}$ .

γ is the angle defined by h₂ and $- e_{l}$ . $h_{2} = h^{N}$ , which is the directional vector pointing from the current camera optical centre to the feature location, is computed as indicated in Section 2.1. The current estimation of depth d is computed by passing the hypothesis d_j through a low-pass filter. This is due to the fact that the depth estimated by triangulation varies considerably, especially for a low parallax. Finally, $f_{l}$ is updated with the new value of d.

3.4.4. 3D map management

As previously stated before, the parallax produced by sensor movements permits the estimation of the depth of the features. In the case of an indoor sequence, a displacement of centimetres is enough to produce a parallax; on the other hand, the more distant the feature, the more the sensor has to travel to produce a parallax.

In [38], it is shown that only a few degrees of parallax are enough to reduce the uncertainty related to depth estimation. Then, when the parallax of a j feature is greater than a threshold ( $α_{j} > α_{m i n}$ ), it is assumed that its depth is sufficiently well-conditioned in order to consider the feature as component of the 3D map. From this moment, it is assumed that the data, stored in f_l, model a 3D point $y_{p_{j}} = [t_{c}^{N}, θ, φ, d]$ , with the location determined by equation 21. Therefore, instead of updating the filter with epipolar constraints, the following prediction model $h_{u v} = h (\hat{x})$ is used.

First, the Euclidean representation $p^{N}$ of $y_{p_{j}}$ is computed using equation 21. Then the undistorted pixel coordinates of the feature $(u, v)$ are found using the central projection camera model defined by Equations 3 and 4. Finally, the distorted pixel coordinates $h_{u v} = (u_{d}, v_{d})$ are obtained from $(u, v)$ by applying the corresponding distortion model.

Assuming that, for the current frame, n visual measurements are available for 3D features, the filter is updated as follows: $z_{i} = {[z_{u v_{1}}, z_{u v_{2}},…, z_{u v_{n}}]}^{T}$ , $h_{i} = {[h_{u v_{1}}, h_{u v_{2}},…, h_{u v_{n}}]}^{T}$ , $R_{i} = (I_{2 n \times 2 n}) σ_{u v}^{2}$ and $\nabla H_{i} = {[\frac{\partial h_{u v_{1}}}{\partial \hat{x}}, \frac{\partial h_{u v_{2}}}{\partial \hat{x}},…, \frac{\partial h_{u v_{n}}}{\partial \hat{x}}]}^{T}$ .

4. Experimental Results

Experiments, using both synthetic data obtained by simulations and real data, were performed in order to validate the performance of the proposed method. A MATLAB implementation was used for this purpose.

4.1. Experiments with simulations

In experiments, a mobile robot was simulated moving through environments composed by 3D points, which can be used by the system as visual landmarks. The simulated vehicle is equipped with: i) inertial sensors, ii) a position sensor, and iii) a monocular camera. The position sensor is used only at the beginning of the trajectory for recovering the metric scale of the environment, after which it is turned off for the purpose of validating the performance of the method in which visual-based navigation is used.

The following parameters values were used in simulations: i) for emulating inertial sensors, $σ_{a} =$ .002m/s^2, $σ_{g} =$ .05/s and initial bias $x_{a}$ , $x_{g}$ , randomly determined with $σ_{x_{a}} =$ 4d-4m/s^2 and $σ_{x_{g}} =$ .25/s, respectively (these values were taken from the data sheet of a commercial degree IMU); ii) for emulating GPS measurements, $σ_{r} =$ .5; and iii) for emulating the precision of the visual features tracking method, $σ_{u v} =$ 1 pixel.

Figure 3.

A Monte Carlo simulation was carried out in order to validate the improvement obtained when visual information is incorporated into the system through the proposed scheme. The upper and middle plots give an example of the trajectories obtained with and without visual information. The lower plot shows the average mean absolute error (MAE) in the position obtained after 20 runs. The experimental results clearly show that position's drift is considerable lower when the system makes use of the visual information.

4.1.1. Visual aiding vs. no visual aiding

Figure 3 (upper plot) shows the simulated 3D environment, which was used to perform this experiment. In this case, the land vehicle was moved in order to follow a spiral-like trajectory (see middle plot). The area, in which the vehicle is moving, is surrounded by a wall of landmarks that can be used by the system as visual features. As expected, the drift in estimations is considerable lower when visual information is incorporated into the system. Figure 3 (lower plot) shows the average MAE in the position obtained after 20 Monte Carlo runs of the simulation.

4.1.2. Proposed method vs. standard EKF-SLAM

Figure 4 shows two of the simulated environments used in this experiment. In the first series of experiments (a), the robot was moved in order to follow a spiral-like trajectory. The area, where the vehicle is moving is surrounded by a wall of landmarks. In these experiments, the camera was fixed, pointing toward to the axis of movement of the robot. In the second series of experiments (b), the robot was moved, following a wave-like trajectory. The environment, where the vehicle is moving, is composed by clouds of landmarks located in random positions. In these experiments, the camera was fixed, pointing perpendicular to the axis of movement of the robot.

Figure 4.

In simulations, a land vehicle was moved through different environments, which were composed by 3D points, emulating potential visual landmarks. The position sensors were only used at the beginning of the trajectory. The rest of the trajectory was performed under visual navigation conditions.

Figure 5.

MAE: Proposed method vs. system state augmentation. Experiment (a) - left plot. Experiment (b) - right plot.

In order to gain more insight about the performance of the proposed scheme, a variant of the proposal was also implemented, in this case by using system state augmentation (standard EKF-SLAM methodology). Figure 5 shows the average MAE in the position obtained after 20 Monte Carlo runs of the simulation, using the proposed method and system state augmentation. It can be observed that the progression of the MAE over time, which was obtained in scenarios (a) and (b), is very similar. On the other hand, Table 1 clearly shows the benefits obtained with the proposed method in terms of computational cost. In this case, the performance of the variant, which uses system state augmentation, quickly degrades as the number of visual features being tracked increases. In experiments (a), where the average number of visual features being tracked per frame is low, the execution time of the variant is slightly higher than the proposed method. However, in experiments (b), where the average number of features being tracked is higher, the execution time of the variant is almost three times the execution time of the proposed method.

Table 1.

Execution time obtained using the proposed method and system augmentation. The third column indicates the average number of visual features being tracked per frame.

Experiment	Proposed method	System augmentation	Visual features per frame
(a)	19.0 s	25.0 s	23.9 ±3.4σ
(b)	44.2 s	128.1 s	63.1 ±38.5σ

4.2. Experiments with real data

In order to test the performance of the proposed method with real data, the MATLAB implementation was executed offline using the Malaga data set [39] as input signals. This data set is freely available online and contains several sources of sensor signals along with accurate ground truth. In the experiments, the following signals were used: i) inertial (accelerometer and gyroscope) measurements, available at 100; ii) GPS measurements, available at 1; iii) visual information, obtained from a frontal camera at seven frames per second (fps), and with a resolution of 512×384 pixels. The sensors were mounted over an electric buggy-type vehicle, (see [39] for a complete description). The experiments presented here correspond to a vehicle moving across a car park.

Figure 6.

Aerial view of the estimated map and trajectory, using real data from sensors mounted in an electric buggy-type vehicle, which is moving in a car park. The GPS signal (blue) was only used at the beginning of the trajectory for recovering the metric scale. The visual-based estimated trajectory is indicated in red.

In the same manner as the simulations, the proposed method was compared against the standard EKF-SLAM methodology (system augmentation), except that real data were used in this case. Figure 6 shows the map and trajectory obtained by the proposed method for a route with a duration of 110. Note that GPS was only used at the beginning of the trajectory for recovering the metric scale of the world. Table 2 shows the experimental results obtained with both the proposed method and the standard EKF-SLAM methodology.

Table 2.

Experimental results with real data

Method	Execution time	Visual features per frame	Average Euclidean error
Proposed method	30.89 s	49.9 ±17.6σ	1.51m±1.25σ
System augmentation	88.37.0s	49.9 ±17.6σ	2.32m±2.64σ

The experiment was executed on a desktop PC with a i7-3.4 processor. For the presented case, the non-optimized MATLAB implementation of the proposed method ran almost three times faster than the standard EKF-SLAM methodology. The above execution time does not include the KTL execution time. For this experiment, the proposal also showed a lower average Euclidean error.

Figure 7 shows the map and trajectory obtained for the entire data set. This route has a longitude of 524 and an approximate duration of 320. In the original data set, there is GPS availability almost over the whole trajectory. Nevertheless, GPS outages were artificially introduced in order to test the performance of the method under visual-based navigation conditions. In this case, the system has been able to determine the actual trajectory of the vehicle fairly well, even in periods where the GPS signal is not available. Figure 8 shows two different instances corresponding to periods when the system is working without a GPS signal (i.e., visual-based navigation mode). The spatial significance of the map can be appreciated. It is important to note that the video sequence, which is used in this experiment, contains a few dynamic elements, such as a few pedestrians or a moving car. There is also a lot of variation in lighting conditions, which are typical in outdoor environments.

Figure 7.

Map and trajectory estimated for the entire data set. Segments, in which the GPS is available, are indicated in blue; however, most of the trajectory was determined in visual-based navigation mode.

Figure 8.

Estimated trajectory and map, corresponding to two different instants of time during periods of visual-based navigation. Visual features $y_{l_{j}}$ , whose depth is ill-conditioned, are indicated by yellow lines. Visual features $y_{p_{j}}$ , whose depth is considered as known, are indicated by small spheres. The corresponding video frames are shown to the left, while the visual features being tracked are indicated by small red circles. When comparing visual features, which correspond to elements in the environment (like trees or cars), with the estimated map, it can be appreciated that the physical structure of the environment is partially recovered.

5. Conclusion

In this work, a practical method for implementing a visual-aided inertial navigation and mapping system for autonomous robots has been presented. The testing of the method has been performed over real data collected from a land robot. However, it should be straightforward to extend the application of the method to other kinds of autonomous robot (e.g., flying robots).

A novel scheme has been introduced for taking advantage of visual information provided by a monocular camera. The proposed technique allows for handling, in an efficient and robust manner, the measurements corresponding to visual features. Experimental results, obtained through simulations and also with real data, show that the proposed method is able to estimate the trajectory of the vehicle fairly well, even in periods where the GPS signal is not available. In this case, the drift in estimation is considerably minimized, especially when it is compared with the results obtained using inertial sensors only. It is important to highlight that, in experiments, a video sequence with a low resolution and a low frame rate was used. In addition, the proposed method can operate with a moderate number of visual features being tracked.

The proposed method, different from visual odometry approaches, makes it possible to estimate the spatial location of visual features, whenever possible. Both the appearance and spatial information provided by the system could be potentially used for multiple tasks in future versions of the system.

Filter-based methods are convenient when limited processing power is a concern. However, the computational cost of Kalman filters scale poorly in relation to the incremental size of the state vector. For this reason, the proposed method treats each map feature independently, instead of augmenting the system state vector, as it is common in monocular SLAM approaches. The above characteristics should mean that the method is suitable for application to robots with limited resources.

References

Sadeghi-Tehran

Behera

Angelov

Andreu

, Autonomous visual self-localization in completely unknown environment. In Evolving and Adaptive Intelligent Systems (EAIS), 2012 IEEE Conference on, pages 90–95, May 2012.

Ricardo

C. B. Rodrigues

Pellegrino

Sergio

Pistori

Hemerson

. Artificial Intelligence and Soft Computing: 11th International Conference, ICAISC 2012, Zakopane, Poland, April 29-May 3, 2012, Proceedings, Part I, chapter Combining Color and Haar Wavelet Responses for Aerial Image Classification, pages 583–591. Springer, Berlin Heidelberg, Berlin, Heidelberg, 2012.

Segvic

Sinisa

Remazeilles

Anthony

Diosi

Albert

Chaumette

Fran$§ois

. A mapping and localization framework for scalable appearance-based navigation. Computer Vision and Image Understanding, 113(2):172–187, 2009.

Chen

Birchfield

S. T.

, Qualitative vision-based path following. IEEE Transactions on Robotics, 25(3):749–754, June 2009.

Pablo De Cristóforis Matías

A. Nitsche

Tomáš Krajník Mejail

Marta

. Real-time monocular image-based path detection. Journal of Real-Time Image Processing, 11(2):335–348, 2013.

Moghadam

Starzyk

J. A.

Wijesoma

W. S.

, Fast vanishing-point detection in unstructured environments. IEEE Transactions on Image Processing, 21(1):425–430, Jan 2012.

Chang

C. K.

Siagian

Itti

, Mobile robot monocular vision navigation based on road region and boundary estimation. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 1043–1050, Oct 2012.

Grabe

Volker

Heinrich

H. B$L'lthoff

Scaramuzza

Davide

Giordano

Paolo Robuffo

. Nonlinear ego-motion estimation from optical flow for online control of a quadrotor uav. The International Journal of Robotics Research, 2015.

Barcelo

G. C.

Panahandeh

Jansson

, Image-based floor segmentation in visual inertial navigation. In 2013 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), pages 1402–1407, May 2013.

10.

Parra

Sotelo

M. A.

Llorca

D. F.

Ocaña

, Robust visual odometry for vehicle localization in urban environments. Robotica, 28:441–452, 5 2010.

11.

Konolige

Kurt

Agrawal

Motilal

Sola

Joan

. Large-scale visual odometry for rough terrain. In Kaneko

Makoto

Nakamura

Yoshihiko

, editors, Robotics Research, volume 66 of Springer Tracts in Advanced Robotics, pages 201–212. Springer Berlin Heidelberg, 2011.

12.

Zhao

Davoine

Cui

Zha

, Monocular visual localization using road structural features. In Intelligent Vehicles Symposium Proceedings, 2014 IEEE, pages 693–699, June 2014.

13.

Courbon

Mezouar

Guenard

Martinet

, Visual navigation of a quadrotor aerial vehicle. In Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, pages 5315–5320, Oct 2009.

14.

Mingyang Li Mourikis

A.I.

, Improving the accuracy of ekf-based visual-inertial odometry. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 828–835, May 2012.

15.

Amoros

Francisco

Paya

Luis

Reinoso

Oscar

Walterio Mayol-Cuevas Calway

Andrew

. Global appearance applied to visual map building and path estimation using multiscale analysis. Mathematical Problems in Engineering, 2014.

16.

Roy

Newman

Srinivasa

, Visual Route Recognition with a Handful of Bits, pages 504–. MIT Press, 2013.

17.

Kita

Kanehiro

Morisawa

Kaneko

, Obstacle detection for a bipedal walking robot by a fisheye stereo. In System Integration (SII), 2013 IEEE/SICE International Symposium on, pages 119–125, Dec 2013.

18.

You Li Ruichek

Yassine

. Occupancy grid mapping in urban environments from a moving on-board stereo-vision system. Sensors, 14(6):10454, 2014.

19.

Durrant-Whyte

Bailey

, Simultaneous localization and mapping: Part i. Robotics Automation Magazine, IEEE, 13(2):99–110, june 2006.

20.

Zhou

Zou

Pei

Ying

Liu

, Structslam: Visual slam with building structure lines. IEEE Transactions on Vehicular Technology, 64(4):1364–1375, April 2015.

21.

Zhang

Nian

Shen

Yan

, Stereo visual slam system in underwater environment. In OCEANS 2014 - TAIPEI, pages 1–5, April 2014.

22.

Schmidt

Adam

. Computer Vision and Graphics: International Conference, ICCVG 2014, Warsaw, Poland, September 15–17, 2014. Proceedings, chapter The EKF-Based Visual SLAM System with Relative Map Orientation Measurements, pages 570–577. Springer International Publishing, Cham, 2014.

23.

Rodrigo Mungu$a Urzua

Sarquis

Bolea

Yolanda

Grau

Antoni

. Vision-based slam system for unmanned aerial vehicles. Sensors, 16(3):372, 2016.

24.

Meilland

Comport

A. I.

, On unifying keyframe and voxel-based dense visual slam at large scales. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 3677–3683, Nov 2013.

25.

Song

, Visual navigation using heterogeneous landmarks and unsupervised geometric constraints. IEEE Transactions on Robotics, 31(3):736–749, June 2015.

26.

Strasdat

Montiel

J.M.M.

Davison

A.J.

, Realtime monocular slam: Why filter? In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 2657–2664, may 2010.

27.

Munguia

Rodrigo

Bernardino Castillo-Toledo Grau

Antoni

. A robust approach for a filter-based monocular simultaneous localization and mapping (slam) system. Sensors, 13(7):8501–8522, 2013.

28.

Oskiper

Samarasekera

Kumar

, Multisensor navigation algorithm using monocular camera, imu and gps for large scale augmented reality. In Mixed and Augmented Reality (ISMAR), 2012 IEEE International Symposium on, pages 71–80, Nov 2012.

29.

Zheng

Wei

Zhou

Fan

Wang

Zengfu

. Robust and accurate monocular visual navigation combining imu for a quadrotor. Automatica Sinica, IEEE/CAA Journal of, 2(1):33–44, January 2015.

30.

Lynen

Achtelik

M. W.

Weiss

Chli

Siegwart

, A robust and modular multi-sensor fusion approach applied to mav navigation. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 3923–3929, Nov 2013.

31.

Shen

Mulgaonkar

Michael

Kumar

, Multi-sensor fusion for robust autonomous flight in indoor and outdoor environments with a rotorcraft mav. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 4974–4981, May 2014.

32.

Elena López Barea

Rafael

Alejandro Gómez Saltos

Álvaro

Luis

M. Bergasa

Eduardo

J. Molinos

Nemra

Abdelkrim

. Robot 2015: Second Iberian Robotics Conference: Advances in Robotics, Volume 1, chapter Indoor SLAM for Micro Aerial Vehicles Using Visual and Laser Sensor Fusion, pages 531–542. Springer International Publishing, Cham, 2016.

33.

Shi

Tomasi

, Good features to track. In Computer Vision and Pattern Recognition, 1994. Proceedings CVPR ′94., 1994 IEEE Computer Society Conference on, 1994.

34.

Bouguet

J.Y.

, Camera calibration toolbox for matlab. In online, 2008.

35.

Munguia

Rodrigo

Grau

Antoni

. A practical method for implementing an attitude and heading reference system. International Journal of Advanced Robotic Systems, 11, 2014.

36.

Skog

Handel

Nilsson

J.O.

Rantakokko

, Zero-velocity detection, an algorithm evaluation. Biomedical Engineering, IEEE Transactions on, 57(11): 2657–2666, Nov 2010.

37.

Diel

D.D.

DeBitetto

Teller

, Epipolar constraints for vision-aided inertial navigation. In Application of Computer Vision, 2005. WACV/MOTIONS ′05 Volume 1. Seventh IEEE Workshops on, volume 2, pages 221–228, Jan 2005.

38.

Munguia

Grau

, Concurrent initialization for bearing-only slam. Sensors, 10(3):1511–1534, 2010.

39.

Blanco

Jose-Luis

Moreno

Francisco-Angel

Gonzalez

Javier

. A collection of outdoor robotic datasets with centimeter-accuracy ground truth. Autonomous Robots, 27(4):327–351, November 2009.

A Visual-Aided Inertial Navigation and Mapping System

Abstract

Keywords

1. Introduction

1.1. Related work

1.2. Objective

1.3. Paper outline

2. System Parametrization and Assumptions

2.1. Visual measurements

2.2. Inertial measurements

2.3. GPS measurements

3. Method Description

3.1. System initialization

3.2. System prediction

3.3. Attitude and position updates

3.3.1. Roll and pitch measurement model

3.3.2. Position and heading measurement model

3.3.3. Dynamical constraints

3.4. Visual aid

3.4.1. Initialization of visual features

3.4.2. Applying epipolar constraints

3.4.3. Estimating feature depth

3.4.4. 3D map management

4. Experimental Results

4.1. Experiments with simulations

4.1.1. Visual aiding vs. no visual aiding

4.1.2. Proposed method vs. standard EKF-SLAM

4.2. Experiments with real data

5. Conclusion

References