Sage Journals: Discover world-class research

Abstract

This paper presents the novel method of mobile robot simultaneous localization and mapping (SLAM), which is implemented by using the Rao-Blackwellised particle filter (RBPF) for monocular vision-based autonomous robot in unknown indoor environment. The particle filter is combined with unscented Kalman filter (UKF) to extending the path posterior by sampling new poses that integrate the current observation. The landmark position estimation and update is implemented through the unscented transform (UT). Furthermore, the number of resampling steps is determined adaptively, which seriously reduces the particle depletion problem. Monocular CCD camera mounted on the robot tracks the 3D natural point landmarks, which are structured with matching image feature pairs extracted through Scale Invariant Feature Transform (SIFT). The matching for multi-dimension SIFT features which are highly distinctive due to a special descriptor is implemented with a KD-Tree in the time cost of O(log2^N). Experiments on the robot Pioneer3 in our real indoor environment show that our method is of high precision and stability.

Keywords

mobile robot simultaneous localization and mapping Rao-Blackwellised particle filter vision Scale Invariant Feature Transform

1. Introduction

A key prerequisite for a truly autonomous robot is that it can simultaneously localize itself and accurately map its surroundings (Kortenkamp et al, 1998), which is known as Simultaneous Localization and Mapping (SLAM). Particle filters provide an attractive approach for updating distributions of data (Doucet, 1998). Early successes of particle filters can be found in the area of robot localization (Dellaert et al, 1999). Recently, particle filters have been at the core of solutions to higher dimensional robot problems such as SLAM, which, when phrased as a state estimation problem, involves a variable number of dimensions. Murphy adopted Rao-Blackwellized particle filters (RBPF) (Murphy, 2001) as an effective way of representing alternative hypotheses on robot paths and associated maps. Montemerlo et al. (Montemerlo & Thrun, 2003) extended this method to efficient landmark-based SLAM using Gaussian representations of the landmarks and were the first to successfully implement it on real robots.

The difficulty of the SLAM depends on the robot's environment, its sensors, and the representation of map. The environment could be relatively benign indoors with flat floors. But it could also be quite subversive such as aircraft and submarines. The most common sensors in use are sonar sensors, laser range finders and video cameras.

Sonar readings are susceptible to high degrees of uncertainty especially due to angular and radial errors. Lasers are accurate while they are heavy, expensive. Sonar and lasers are primarily used for 2D map. On the other hand, cameras are light, cheap, and can provide abundant environmental information, but are difficult to work with. Popular choices for the map representation include grid-based (Schultz & Adams, 1998), topological (Choset & Nagatani, 2001) and feature based models (Chong & Kleeman, 1999). Grid-based models are easy to build and maintain while implies high data requirements and induces high computational costs. Topological maps usually have the advantage of being compact, and more tolerant to errors in the robot location. Feature based representations have been difficult to build while being significantly less complex.

We primarily focus on investigating real-time, monocular vision based SLAM for indoor environments, and constructing 3D feature map from video data. Scale invariant features are extracted through Scale Invariant Feature Transform (SIFT) (Lowe, 2004), which are used to structure 3D landmarks because they are invariant to image scale, rotation and translation as well as partially invariant to illumination changes. We presents a fast and efficient algorithm for matching features in a KD-Tree in the time cost of O(log2^N) (Moore, 1991). RBPF is used to estimate a posterior of the path of the robot, where each particle has associated with it an entire map, in which each landmark is estimated and updated by the unscented transform (UT) (Merwe et al, 2000), and unscented Kalman filter (UKF) is used to sample new poses that integrate the current observation. Furthermore, the number of resampling steps is determined adaptively, which seriously reduces the particle depletion problem. All of these specialties can make data association in this paper more robust than other methods. Experiment results are compared with those of the EKF methods applied to the same robot in the same environment and indicate superior performance.

2. Background

Consider the case of a mobile robot moving through an unknown environment consisting of a set of landmarks θ. The robot moves according to a known motion model p(s_t | s_u, u_t), where s_t denotes the robot state at time t, and the control input u_t carried out in the time interval [t-1, t]. As the robot moves around, it takes measurements of its environment. A measurement z_t is related to the position of a landmark through observation model p(s_t | u_t, θ, s_t-1). The SLAM problem is that of simultaneously inferring the location of all landmarks and the path followed by the robot based on a set of measurements and inputs. Ideally, one would like to recover the posterior distribution p(s^t,θ | z^t, u^t, n^t), where the notation s^t denotes st,…st(and similarly for other variables). In (Doucet et al., 2000) Doucet et al. provide an implementation of RBPF for SLAM:

p (s^{t}, θ ∣ z^{t}, u^{t}, n^{t}) = p (s^{t} ∣ z^{t}, u^{t}, n^{t}) \prod_{n = 1}^{m} p (θ_{n} ∣ s^{t}, z^{t}, n^{t})

(1)

This can be done efficiently, since the factorization decouples the SLAM problem into a path estimation problem and individual conditional landmark location problems, and the quantity p(θ _n | s^t, z^t, n^t) can be computed analytically once s^t and z^t are known. The posterior p(s^t | z^t, u^t, n^t) over the potential robot trajectories uses a particle filter in which an individual map is associated to each particle. Each map is constructed given the observations z^t and the trajectory s^t represented by the corresponding particle.

A successful instance of the RBPF SLAM is FastSLAM, which offers many improvements over the traditional EKF-based SLAM framework: it has excellent time complexity; it does not need to linearize the robot's motion model; especially it can maintain several data association hypotheses. However, FastSLAM also has drawbacks: each particle has a different view of the map, integrating these views to obtain a single map is nontrivial, and more importantly, data association must be performed for each particle independently, which introduces a significant computational burden; FastSLAM is prone to diverge in regions where its measurements are not very informative, either due to high noise or the sparseness of landmarks.

3. Novel Rao-Blackwellized Particle Filter for SLAM

RBPF calculates the posterior over robot paths p(s^t | z^t, u^t, n^t) by a particle filter. The remaining M posteriors over landmark locations p(θ _n | s^t, n^t, z^t, u^t) are calculated and updated with UKF. Each UKF conditioned on robot paths estimates a single landmark pose. Each particle is of the form ${S_{t}}^{(i)} = {s^{t, (i)}, {μ_{1, t}}^{(i)}, {\sum_{1, t}}^{(i)}, \dots, {μ_{M, t}}^{(i)}, {\sum_{M, t}}^{(i)}}$ , where (i) indicates the index of the particle; s^t,(i) is its path estimate, and ${μ_{m, t}}^{(i)}$ and ${\sum_{m, t}}^{(i)}$ are the mean and variance of the Gaussian representing the m-th landmark location. Together, all these quantities form the i-th particle ${S_{t}}^{(i)}$ of which there is a total of N in the posterior. Our RBPF update is performed in the following steps:

Fig. 1.

Moving the samples in the prior to regions of high likelihood is important if the likelihood lies in one of the tails of the prior.

3.1. Sampling new poses using UKF

Here we need to calculate the posterior over robot paths p(s^t | z^t, u^t, n^t) approximated by a particle filter. Each particle in the filter represents one possible robot path s^t from time 0 to time t. Since the map landmark estimates p(Θ _n | s^t, z^t, n^t) depend on the robot path, the particles sampling step is very important. However, most methods use the state transition prior p(s_t | s_t-1, u_t) to draw particles. Because the state transition does not take into account the most recent observation z_t, especially when the likelihood happens to lie in one of the tails of the prior distribution or if it is too narrow, as shown in Fig. 1. If an insufficient number of particles are employed, there may be a lack of particles in the vicinity of the correct state, leading to divergence of the filter. This is known as the particles depletion problem.

In our methods, the i-th new pose ${s_{t}}^{(i)}$ is drawn from the posterior p(s_t | s^t-1,(i), u^t, z^t, n^t), which takes the measurement z_t into consideration, along with the landmark n_t, and s^t-1,(i) is the path up to time t-1 of the i-th particle. An effective approach to accomplish this is to use an EKF generated Gaussian approximation:

p (s_{t} | s^{t - 1, (i)}, u^{t} z^{t}, n^{t}) \sim N (s_{t}; {\bar{s}}_{t}^{(i)}, P_{t}^{(i)}), i = 1, 2, \dots, N

(2)

EKF approximates the distribution through the first-order Taylor-series expansion of the nonlinear observation function z_t=g(θ_nt, s_t) around the mean s _t :

z_{t} = g (θ_{n_{t}}, s_{t}) \sim g (θ_{n_{t}}, {\underline{s}}_{t}) + Δ_{s_{t}} g^{'} (θ_{n_{t}}, {\underline{s}}_{t})

(3)

The first-order mean and covariance used in the EKF is given by ${\underline{z}}_{t} = g (θ_{n t}, {\underline{s}}_{t}), P_{z t} = g^{'} (θ_{n t}, {\underline{s}}_{t})^{T} P_{s t} g^{'} (θ_{n t}, {\underline{s}}_{t})$ which often introduces large errors. However, The unscented transformation (UT) is an elegant way to accurately compute the mean and covariance up to the third order of the Taylor series expansion of g(θ_nt, s_t) (Merwe et al, 2000). Let L be the dimension of s_t, the UT computes mean and covariance as follows: 1)

Deterministically generate 2L+1 sigma points S_i={X_i, W_i}:

\begin{aligned} χ_{0} = {\underline{s}}_{t} χ_{i} = {\underline{s}}_{t} + (\sqrt{(L + λ) P_{s_{i}}})_{i} i = 1, \dots, L \\ χ_{i} = {\underline{s}}_{t} - (\sqrt{(L + λ) P_{s_{i}}})_{i} i = L + 1, \dots, 2 L \\ W_{0}^{m} = λ / (L + λ) W_{0}^{c} = W_{0}^{m} + (1 - α^{2} + β) \\ W_{i}^{m} = 1 / (2 \cdot (L + λ)) i = 1, \dots, 2 L \\ λ = α^{2} (L + γ) - L \end{aligned}

(4) where γ is a scaling parameter that controls the distance between the sigma points and the mean s _t , α is a positive scaling parameter that controls the higher order effects resulted from the non-linear function g, β is a parameter that controls the weighting of the 0-th sigma point α=0, β=0 and γ=2 are the optimal values for the scalar case.

(\sqrt{, (L + λ) P_{s t}})_{i}

is the i-th column of the matrix square root. Note that the 0-th sigma point's weight is different for calculating mean and covariance.

Propagate the sigma points through the nonlinear transformation:

Z_{i} = g (θ_{n_{t}}, χ_{i}) i = 0, . ., 2 L

(5)

Compute the mean and covariance as follows:

\begin{aligned} {\underline{z}}_{t} = \sum_{i = 0}^{2 L} W_{i}^{m} Z_{i} \\ P_{z_{i}} = \sum_{i = 0}^{2 L} W_{i}^{c} (Z_{i} - {\underline{z}}_{t}) (Z_{i} - {\underline{z}}_{t})^{T} \end{aligned}

(6)

Now we follow UKF algorithm to extend the path s^t,(i) by sampling the new poses $p (s_{t} ∣ s^{t - 1, (t)}, u^{t}, z^{t}, n^{t})$ from the posterior p(s_t | s^t-1,(i), u^t, z^t, n^t): 1)

Calculate the sigma points according to Eq. (4):

χ_{t - 1}^{(i)} = {{\underline{s}}_{t - 1}^{(i)} {\underline{s}}_{t - 1}^{(i)} \pm \sqrt{(L + λ) P_{t - 1}^{(i)}}}

(7)

Using motion model to predict:

\begin{aligned} χ_{t | t - 1}^{*, (i)} = f (χ_{t - 1}^{(i)}, u_{t}^{i}) {\underline{s}}_{t | t - 1}^{(i)} = \sum_{j = 0}^{2 L} W_{j}^{m, (i)} χ_{j, t | t - 1}^{*, (i)} \\ P_{t | t - 1}^{(i)} = \sum_{j = 0}^{2 L} W_{j}^{c, (i)} [χ_{j, t | t - 1}^{*, (i)} - {\underline{s}}_{t | t - 1}^{(i)}] [χ_{j, t | t - 1}^{*, (i)} - {\underline{s}}_{t | t - 1}^{(i)}]^{T} \end{aligned}

(8)

Incorporating new observation z_t, along with the landmark n_t:

\begin{aligned} Z_{t | t - 1}^{*, (t)} = g (χ_{t | t - 1}^{*, (t)}, θ_{n_{t}}) {\underline{z}}_{t | t - 1}^{(i)} = \sum_{j = 0}^{2 L} W_{J}^{m, (i)} Z_{j, t | t - 1}^{* (t)} \\ P_{z_{i} z_{t}}^{(i)} = \sum_{j = 0}^{2 L} W_{j}^{c, (i)} [Z_{j, t | t - 1}^{*, (i)} - {\underline{z}}_{t | t - 1}^{(i)}] [Z_{j, t | t - 1}^{*, (i)} - {\underline{z}}_{t | t - 1}^{(i)}]^{T} \\ P_{s_{t} z_{t}}^{(i)} = \sum_{j = 0}^{2 L} W_{j}^{c, (i)} [χ_{j, t | t - 1}^{*, (i)} - {\underline{s}}_{t | t - 1}^{(i)}] [Z_{j, t | t - 1}^{*, (i)} - {\underline{z}}_{t | t - 1}^{(i)}]^{T} \\ K_{t}^{(i)} = P_{s_{t} z_{t}}^{(i)} (P_{z_{t} z_{t}}^{(t)})^{- 1} {\underline{s}}_{t}^{(i)} = {\underline{s}}_{t | t - 1}^{(i)} + K_{t}^{(i)} (z_{t}^{(t)} - z_{t | t - 1}^{(i)}) \\ P_{t}^{(i)} = P_{t | t - 1}^{(i)} - K_{t}^{(i)} P_{z_{t} z_{t}}^{(i)} K_{t}^{(i) T} \end{aligned}

(9)

Sampling new pose ${s_{t}}^{(i)}$ and extending the path s^t,(i):

\begin{aligned} s_{t}^{(i)} \sim p (s_{t} ∣ s^{t - 1, (i)}, u^{t}, z^{t}) = N (s_{t}; {\underline{s}}_{t}^{(i)}, P_{t}^{(i)} \\ s^{t, (i)} = (s^{t - 1, (i)}, s_{t}^{(i)}) \end{aligned}

(10)

3.2. Updating the observed landmark estimate

In this step, we update the posterior over the landmark estimates represented by the mean ${μ_{n, t - 1}}^{(i)}$ , and the covariance ${\sum_{n, t - 1}}^{(i)}$ . The updated values ${μ_{n, t}}^{(i)}$ and ${\sum_{n, t}}^{(i)}$ are then added to the temporary particle set Ŝ_t along with the new sampling pose ${s_{t}}^{(i)}$ . The update depends on whether or not a landmark n was observed at time t. For n ≠ n_t, the posterior over the landmark remains unchanged. For the observed feature n = n_t, the update is specified as follows:

(θ_{n_{t}} ∣ s^{t, (i)}, n^{t}, z^{t}) = η \underset{\sim N (z_{t}; g (θ_{n_{t}}, s_{t}^{(i)}), R_{t})}{\underset{⏟}{p (z_{t} ∣ θ_{n_{t}}, s_{t}^{(t)}, n_{t})}} \underset{\sim N (θ_{n_{t}}; μ_{n_{t}, t - 1}^{(i)}, Σ_{n_{t}, t - 1}^{(i)})}{\underset{⏟}{p (θ_{n_{t}} ∣ s^{t - 1, (t)}, n_{t - 1}, z^{t - 1})}}

(11)

The probability $p (θ_{n t} ∣ s^{t - 1, (i)}, z^{t - 1} η^{t - 1})$ at time t-1 is represented by a Gaussian with mean ${μ_{n, t - 1}}^{(i)}$ , and the covariance ${\sum_{n, t - 1}}^{(i)}$ . For the new estimate at time t to also be Gaussian, we need generate Gaussian approximation for the perceptual model $p (z_{t} ∣ s_{t}^{(i)}, θ_{n t}, n_{t})$ . Our methods also use UT to approximate the non-linear measurement function $g (θ_{n t}, {s_{t}}^{(i)})$ : 1)

Calculate the sigma points:

ξ_{n_{t}, t - 1}^{(i)} = {μ_{n_{t}, t - 1}^{(i)} μ_{n_{t}, t - 1}^{(i)} \pm \sqrt{(L + λ) \sum_{n_{t}, t - 1}^{(i)}}}

(12)

Using observation model to compute the mean and covariance of the observation as follows:

\begin{aligned} Z_{n_{t}, t}^{(i)} = g (ξ_{n_{t}, t - 1}^{(i)}, s_{t}^{(i)}) {\underline{z}}_{n_{t}, t}^{(i)} = \sum_{j = 0}^{2 L} W_{j}^{m, (i)} Z_{j, n_{t}, t}^{(i)} \\ P_{z_{n_{t}, t}}^{(t)} = \sum_{j = 0}^{2 L} W_{j}^{c, (i)} [Z_{j, n_{t}, t}^{(i)} - {\underline{z}}_{n_{t}, t}^{(i)}] [Z_{j, n_{t}, t}^{(i)} - {\underline{z}}_{n_{t}, t}^{(t)}]^{T} \end{aligned}

(13)

Under this approximation, the posterior for the location of landmark n_t is indeed Gaussian. The new mean and covariance are obtained using the following update:

\begin{aligned} K_{t}^{(i)} = Σ_{n_{t}, t - 1}^{(i)} P_{z_{n_{t}, t}}^{(t)} (P_{z_{n_{t}, t}}^{(i) T} Σ_{n_{t}, t - 1}^{(i)} P_{z_{n_{t}, t}}^{(t)} + R_{t})^{- 1} \\ μ_{n_{t}, t}^{(i)} = μ_{n_{t}, t - 1}^{(i)} + K_{t}^{(i)} (z_{t} - {\underline{z}}_{t}^{(i)})^{T} \\ Σ_{n_{t}, t}^{(i)} = (I - K_{t}^{(i)} P_{z_{n_{t}, t}}^{(i) T}) Σ_{n_{t}, t - 1}^{(i)} \end{aligned}

(14)

3.3. Selective resampling

Next, we resample from temporary particles set Š_t, then form the new particle set S_t. The necessity to resample arises from the fact that the particles in Š_t do not yet match the desired posterior. Resampling can avoid particles degeneracy. By weighing particles in Š_t, and resampling according to those weights, the resulting particle set indeed approximates the target distribution. To determine importance weight of each particle, it will prove useful to calculate the actual proposal distribution of the path particles in St. Under the assumption that the set of path particles in S_t-1 is distributed according to p(s^t-1 | z^t-1, u^t-1,n^t-1), path particles in Š_t are distributed as:

\begin{aligned} p (s^{t, (i)} ∣ z^{t - 1}, u^{t}, n^{t - 1}) \\ = p (s^{t, (i)} ∣ s_{t - 1}^{(i)}, u^{t}) p (s^{t - 1, (i)} ∣ z^{t - 1}, u^{t - 1}, n^{t - 1}) \end{aligned}

(15)

Target distribution p(s^t,(i) | z^t, u^t, n^t) takes into account the measurement zt along with the correspondence nt. The importance weight of resampling process accounts for the difference of the target and the proposal distribution, which is given by the quotient of the target and the proposal distribution, applying Bayes rule and Markov assumption and omitting the irrelevant variables:

\begin{aligned} w_{t}^{(i)} = \frac{target distribution}{proposal distribution} = \frac{p (s^{t, (i)} ∣ z^{t}, u^{t}, n^{t})}{p (s^{t, (i)} ∣ z^{t - 1}, u^{t}, n^{t - 1})} \\ = η \int p (z_{t} ∣ s_{t}^{(i)}, θ_{n_{t}}, n^{t}) p (θ_{n_{t}} ∣ s^{t - 1, (i)}, z^{t - 1}, n^{t - 1}) d θ_{n_{t}} \end{aligned}

(16)

To calculate ${w_{t}}^{(i)}$ in closed form, we employ the very same approximation used in the measurement update. In particular, the weight is given by

\begin{aligned} w_{t}^{(i)} \approx η {| 2 π L_{t}^{(i)} |}^{- \frac{1}{2}} \exp {- \frac{1}{2} (z_{t} - {\underline{z}}_{t}^{(i)})^{T} L_{t}^{(i) - 1} (z_{t} - {\underline{z}}_{t}^{(i)})} \\ L_{t}^{(i)} = G_{t}^{(i) T} Σ_{n, t - 1}^{(i)} G_{t}^{(i)} + R_{t} \end{aligned}

(17)

After the resampling, all particle weights are then reset to ${w_{t}}^{(i)} = 1 / N$ . However, resampling can delete good samples from the sample set, in the worst case, the filter diverges. Accordingly, it is important to find a criterion when implementing a resampling step. Liu (Liu and Chen, 1998) introduced the so-called number of particles $N_{t, e f f} = 1 / \sum^{N, i = 1} ({w_{t}}^{(i)})^{2}$ to estimate how well the current particle set represents the true posterior. Our approach determines whether or not a resampling should be carried out according to N_{t, eff}. We resample each time N_{t, eff} drops below a given threshold which was set to 0.5N where N is the number of particles. In our experiments we found that this technique drastically reduces the risk of replacing good particles, because the number of resampling operations is reduced and resampling operations are only performed when needed.

4. Implementation Details for Monocular Vision

4.1. SIFT Feature Extraction

The Scale Invariant Feature Transform (SIFT) was proposed in (Lowe, 2004) as a method of extracting and describing key-points, which are robustly invariant to common image transforms. The SIFT algorithm has four major stages: 1) Scale-space extrema detection. The first stage finds scale-space extrema located in Difference of Gaussians (DOG) function D(x, y,θ), which can be computed from the difference of two nearby scaled images separated by a multiplicative factor k:

\begin{aligned} D (x, y, σ) & = (G (x, y, k σ) - G (x, y, σ)) * I (x, y) \\ = L (x, y, k σ) - L (x, y, σ) \end{aligned}

(18) where L(x, y,ς) is the scale space of an image, built by convolving the image I(x, y) with the Gaussian kernel G(x, y,ς). 2) Key-point localization. The location and scale of each candidate point is determined and key-points are selected based on measures of stability (Fig. 2). 3) Orientation assignment. One or more orientations are assigned to each key-point based on local image gradients. For each image sample L(x, y) at this scale, the gradient magnitude m(x, y) and orientation θ(x, y) is computed using pixel differences:

\begin{aligned} m (x, y) = \sqrt{(L (x + 1, y) - L (x - 1, y))^{2} + (L (x, y + 1) - L (x, y - 1))^{2}} \\ θ (x, y) = \tan^{- 1} ((L (x + 1, y) - L (x - 1, y)) / (L (x, y + 1) - L (x, y - 1))) \end{aligned}

(19)

Fig. 2.

Maxima and minima are detected by comparing a pixel (marked with X) to its 26 neighbors in 3 × 3 regions at the current and adjacent scales (marked with circles).

Fig. 3.

A key-point descriptor.

Fig. 4.

Typical extracted SIFT features with their locations represented by ‘+’. The radius of the circle represents their scales: the 320×240 pixel test image taken at (a) 1618mm; (b) 756mm; and the result is (a) 278 key-points; (b) 267 key-points.

4) Key-point descriptor. Typical key-point descriptors use 16 orientation histograms aligned in a 4 × 4 grid. Each histogram has 8 orientation bins each created over a support window of 4 × 4 pixels (Fig. 3). The resulting feature vectors are 128 elements with a total support window of 16 × 16 scaled pixels. For a more detailed discussion see (Lowe, 2004). The number of features generated is dependent on image size and content, as well as algorithm parameters. In this paper, we use the vectors with 128 elements as key-point descriptor. Fig. 4 shows an example of SIFT feature extraction for a cluttered and occluded image of size 320×240 pixels.

4.2. KD-Tree based Feature Matching

This section describes KD-tree algorithm for determining the matching SIFT features pairs of successive images captured at relatively close positions along the robot's path by a monocular vision system mounted on the robot. Every time the CCD camera vision system is triggered, it captures the consecutive digital images of pixels and after SIFT feature extracting, generates SIFT feature match pairs in adjacent images through KD-tree based feature matching algorithm. The match pairs are used for the landmarks' 3D structure. Given a SIFT key-points set E, and a target key-point vector d, then a nearest neighbor of d, d′ is defined as:

\begin{aligned} \forall d^{''} \in E, ∣ d \Leftrightarrow d^{'} ∣\leq∣ d \Leftrightarrow d^{''} ∣ \\ ∣ d \Leftrightarrow d^{'} ∣= (\sum_{i = 1}^{k} (d_{i} \Leftrightarrow {d^{'}}_{i})^{2})^{1 / 2} \end{aligned}

(20) where d_i is the i-th component of d. The KD-tree based SIFT feature matching algorithm is described as following: Constructing a KD-tree. The pivot-choosing procedure chooses a good vector from E to use as the tree's root, which is desirable for the tree to be reasonably balanced, and also for the shapes of the hyper-regions corresponding to leaf nodes to be fairly equally proportioned. Our pivoting strategy is to pick the splitting dimension split firstly: for each i dimension, compute the maximal value maxi and the minimal value min_i from the i-th dimension element of every key-point in set E, and choose the according dimension which has the most spread max_i-min_i, its median value med_i. Secondly, we pick the key-point provided with the minimal value between the split-th dimension and med_i as tree's root.

After constructing the KD-tree, the nearest neighbor search algorithm which is depth first is used to search the child node which contains the target. The space occupied by set E is represented by a hyper-rectangle composed of two arrays: one of its minimum coordinates, the other of its maximum coordinates. To cut the hyper-rectangle, so that one of its edges is moved closer to its centre, the appropriate array component is altered. To check to see if a hyper-rectangle hr intersects with a hyper-sphere radius r centered a point target, we find the point p in hr which is closest to target. Write $h r_{i}^{min}$ as the minimum extreme of hr in the i-th dimension and $h r_{i}^{max}$ as the maximum extreme p_i, the i-th component of this closest point is computed thus:

p_{i} = {\begin{cases} h r_{i} m i n & if t a r g e t_{i} \leq h r_{i}^{m i n} \\ t a r g e t_{i} & if h r_{i}^{m i n} < t a r g e t_{i} < h r_{i}^{m a x} \\ h r_{i}^{m a x} & if t a r g e t_{i} \geq h r_{i}^{m a x} \end{cases}

(21)

The object intersect only if the distance between p and target is less than or equal to r.

Fig. 5.

The SIFT feature matches based on KD-tree, and the matching pairs are represented by red “.”.

Fig. 6.

The key-point descriptor histograms of the matching key-point at different scale and direction.

We implement the SIFT key-points matching algorithm based on nearest neighbor algorithm in a KD-tree, and the distance of the key-points is represented using the Euclidean distance between their according 128 dimensional descriptor vector. The basic process for matching is as follows: A KD-tree is constructed using all key-points of the image I_t. For each key-point kp in the next image I_t+i, finding the two most nearest neighbors kp1 and kp2 based on nearest neighbor algorithm in a KD-tree. As proved in our experiment, if $∣ k p_{1} \equiv k p | / | k p_{2} \equiv k p |$ is bigger, then the matching quality between kp and kp1 is much higher, otherwise the matching quality is lower. So we can use the following equation to judge the matching for two key-points:

∣ k p_{1} \Leftrightarrow k p | / | k p_{2} \Leftrightarrow k p ∣< λ

(22) where λ is constant, and 0<λ<1 (in this paper λ is evaluated as 0.7), if this equation is satisfied, then the matching is successful, and simultaneously eliminates the false matching. Fig. 5 shows an example of SIFT feature matching, and the matched accurate rate is higher than 80%. Fig. 6 shows the key-point descriptor histograms of one matching pair at different scale and direction, which proves the robust matching algorithm.

Fig. 7.

Two viewpoints geometry and the epipolar constraint.

4.3. 3D Structure

After the SIFT feature matching, we obtain the 2D SIFT image feature matching pairs used to structure the 3D spatial landmarks, which are in a single world model. As seen from Fig. 7, According to the epipolar constraint, all the entities P, O₁, O₂, p₁, p₂, e₁, e₂, b should be coplanar. Through epipolar constraint, the matches with large error are eliminated. Let/be the focus of the CCD camera. The relationship between a 3D point P(X_w, Y_w, Z_w) and the image coordinates p(u, v) where it is projected is given by the pinhole camera model (Ma & Zhang, 1998):

Z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} α_{x} & 0 & u_{0} & 0 \\ 0 & α_{y} & v_{0} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} R & T \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} X_{w} \\ Y_{w} \\ Z_{w} \\ 1 \end{matrix}] = M [\begin{matrix} X_{w} \\ Y_{w} \\ Z_{w} \\ 1 \end{matrix}]

(23) where robot motion provides extrinsic camera rotations R and translations T for each image. Offline calibration yields the camera's intrinsic parameters α_x, α_y, u₀, v₀. For any pair of matching points p₁(u₁,v₁, 1) and p₂(u₂,v₂, 1) corresponding to a 3D point P(X_w, Y_w, Z_w), using the pinhole camera model:

z_{c 1} [u_{1} v_{1} 1]^{T} = M [X_{w} Y_{w} Z_{w} 1], z_{c 2} [u_{2} v_{2} 1]^{T} = M [X_{w} Y_{w} Z_{w} 1]^{T}

(24)

The solution of three unknown variants X_w, Y_w and Z_w can be obtained through the least square method.

5. Experimental Results and Discussion

The experiments are performed on a Pioneer 3-DX mobile robot incorporating an 800 MHz Intel Pentium processor as shown in Fig. 8(a). Motor control is performed on the on-board computer, while a 2.6GHz PC connected to the robot by a wireless link provides the main processing power for vision processing and the SLAM software. A monocular color CCD camera mounted at the front of the robot. The test environment is a robot laboratory with limited space as shown in Fig. 8(b).

Fig. 8.

(a) Pioneer 3 robot; (b) experiment enviroment.

Fig. 9.

Frames of an image sequence with SIFT features marked: (a) 2th frame; (b) 9th frame; (c) 19th frame; (d) 70th frame; (e) 79th frame; (f) 100th frame; (g) 150th frame; (h) 163th frame; (i) 172th frame.

Fig. 10.

Bird's-eye view of the SIFT landmarks in the map. the dashed line indicates the estimated robot path and the solid line indicates the real robot path.

Fig. 11.

The 3D SIFT landmark database map viewed from different angles. Each landmark has appeared consistently in every view: (a) from top; (b) from left; (c) from right.

The images are captured and processed, the map is kept and updated on the fly while the robot is moving around. The robot goes around in the laboratory for one loop and to come back. Fig. 9 shows some frames of the 320 × 240 image sequence (189 frames in total) captured while the robot is moving around. A total of 4068 SIFT landmarks with 3D positions are gathered in the map. The runtime of our RBPF SLAM algorithm with different numbers landmarks is shown in Fig. 10. Other performance of our SLAM algorithm with different numbers of particles and landmarks is also shown in Fig. 12. Fig. 10 shows the bird's-eye view of all these landmarks. Fig. 11 shows three views of the 3D SIFT landmark map from different angles. Finally, we compare our method with traditional EKF method, and our method shows superior performance as shown in Fig. 13.

Fig. 12.

Performance of our RBPF SLAM algorithm: (a) robot position error and (b) landmark error with different numbers of particles; (c) runtime and (d) memory requirement with different numbers of landmarks.

Fig. 13.

Comparison of our RBPF SLAM algorithm and EKF for (a) robot position error and (b) memory requirement.

6. Conclusion

Novel RBPF is presented to implement monocular vision-based mobile robot SLAM in indoor environment. The particle filter is combined with UKF to sampling new poses integrating the current observation. The landmark position estimation and update is implemented through the UT and EKF respectively. For solving the particle depletion problem, the number of resampling steps is selected adaptively. Single camera tracks the 3D natural point landmarks, which are structured with matching feature pairs extracted through SIFT. The matching for highly distinctive SIFT features descripted with multi-dimension vector is implemented with a KD-Tree in the time cost of O(log2^N). Experiment results show superior performance for our method.

References

Chong

K. S.

& Kleeman

(1999). Feature-based mapping in real large scale environments using an ultrasonic array, International Journal of Robotics Research, Vol. 18, No. 1, pp. 3–19, ISSN 0278–3649

Choset

& Nagatani

(2001). Topological simultaneous localization and mapping (SLAM): Toward exact localization without explicit localization, IEEE Trans. Robot. Automat., Vol.17, No. 2, pp. 125–137, May 21–26, Seoul, Korea

Dellaert

Fox

Burgard

& Thrun

(1999). Monte Carlo localization for mobile robots, in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), Vol. 2, pp. 10–15, Detroit, MI, USA

Doucet

(1998). On sequential simulation-based methods for Bayesian filtering, Technical report, Signal Processing Group, Departement of Engeneering, University of Cambridge

Kortenkamp

Bonasso

R. P.

& Murphy

(1998). AI-based Mobile Robots: Case studies of successful robot systems, MIT Press, Cambridge

Liu

J. S.

& Chen

(1998). Sequential Monte Carlo methods for dynamical systems, J. Amer. Statist. Assoc., Vol. 93, pp. 1032–1044

Lowe

(2004). Distinctive image features from scale-invariant keypoints, Int. J. of Computer Vision, Vol. 60, No. 2, pp. 91–110, ISSN 0920–5691

S. D.

& Zhang

ZH. Y.

(1998). Computer vision-computational theories and algorithm, Science Press, Beijing, pp.52–79 (in chinese)

Merwe

Doucet

Freitas

& Wan

(2000). The Unscented Particle Filter, Technical Report CUED/FINFENG/TR 380, Cambridge University, Engineering Department

10.

Montemerlo

& Thrun

(2003). Simultaneous localization and mapping with unknown data association using FastSLAM, in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), Taipei, China

11.

Moore

A.W.

(1991). An introductory tutorial on kd-trees, Robotics Institute, Carnegie Mellon University, Pittsburgh, Technical Report No. 209, Computer Laboratory, University of Cambridge

12.

Murphy

& Russell

(2001). Rao-blackwellized particle filtering for dynamic bayesian networks, in Sequential monte carlo methods in practice, Springer Verlag

13.

Schultz

A. C.

& Adams

(1998). Continuous localization using evidence grids, in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), pp.2833–2839, May 16–20, Leuven, Belgium