Learning Long-range Terrain Perception for Autonomous Mobile Robots

Abstract

Long-range terrain perception has a high value in performing efficient autonomous navigation and risky intervention tasks for field robots, such as earlier recognition of hazards, better path planning, and higher speeds. However, Stereo-based navigation systems can only perceive near-field terrain due to the nearsightedness of stereo vision. Many near-to-far learning methods, based on regions' appearance features, are proposed to predict the far-field terrain. We proposed a statistical prediction framework to enhance long-range terrain perception for autonomous mobile robots. The main difference between our solution and other existing methods is that our framework not only includes appearance features as its prediction basis, but also incorporates spatial relationships between terrain regions in a principled way. The experiment results show that our framework outperforms other existing approaches in terms of accuracy, robustness and adaptability to dynamic unstructured outdoor environments.

Keywords

autonomous navigation stereo vision machine learning conditional random fields

1. Introduction

Navigation in an unknown and unstructured outdoor environment is a fundamental and challenging problem for autonomous mobile robots. The navigation task requires identifying safe, traversable paths that allow the robot to progress toward a goal while avoiding obstacles. Standard approaches to complete the task use ranging sensors such as stereo vision or radar to recover the 3-D shape of the terrain. Various features of the terrain such as slopes or discontinuities are then analyzed to determine traversable regions (Matthies. 1992; Pagnot & Grandjean. 1995; Singh, Simmons, Smith. 2000; Rieder & Southall. 2002). However, ranging sensors such as stereo visions only supply short-range perception and gives reliable obstacle detection to a range of approximately 5m (Ollis, Huang, Happold. 2008). Navigating solely on short-range perception can lead to incorrect classification of safe and unsafe terrain in the far field, inefficient path following or even the failure of an experiment due to nearsightedness (Jackel, Krotkov, Perschbacher. 2006; Michael J. Procopio. 2009).

To address nearsighted navigational errors, near-to-far-learning-based, long-range perception approaches are developed, which collect both appearances and stereo information from the near field as inputs for training appearance-based models and then applies these models in the far field in order to predict safe terrain and obstacles farther out from the robot where stereo readings are unavailable (Dahlkamp, Kaehler, Stavens. 2006; Happold, Ollis & Johnson. 2006; Max Bajracharya. 2009).

We restrict our discussion to the online self-supervised learning since the diversity of the terrain and the lighting conditions of outdoor environments make it infeasible to employ a database of obstacle templates or features, or other forms of predefined description collections. The winner of DARPA Grand Challenge (Dahlkamp, Kaehler, Stavens. 2006) combines sensor information from a laser range finder and a pose estimation system to first identify a nearby patch (a set of neighboring pixels) of drivable surface. And then the vision system takes this patch and uses it to construct appearance models to find the drivable surface outward into the far range. (Happold, Ollis & Johnson. 2006) propose a method for classifying the traversability of terrain by combining unsupervised learning of color models that predict scene geometry with supervised learning of the relationship between geometric features and the traversability. A neural network is trained offline on hand-labeled geometric features computed from the stereo data. An online process learns the association between color and geometry, enabling the robot to assess the traversability of regions for which there is little range information by estimating the geometry from the color of the scene and passing this to the neural network. The system of (Max Bajracharya. 2009) consists of two learning algorithms: a short-range, geometry-based local terrain classifier that learns from very few proprioceptive examples; and a long-range, image-based classifier that learns from geometry-based classification and continuously generalizes geometry to the appearance.

Appearance-based near-to-far learning methods mentioned above do support the long-range perception which provides the “look-ahead” capability for complementing the traditional short-range stereo- or LIDAR-based sensing. However, appearance-based methods assume that the near-field mapping from the appearance to traversability is the same as the far-field mapping. Such an assumption does not necessarily hold due to the complex terrain geometry and varying lighting conditions in unstructured outdoor environment. Therefore, how to use other strategies to compensate for the mapping deviation begins to draw more attention.

(Lookingbill, Lieb & Thrun. 2007) use a reverse optical flow technique to trace back the current road appearance to how it appeared in previous image frames in order to extract road templates at various distances. The templates can be then matched with distant possible road regions in the imagery. However, trackable features, on which the reverse flow technique is based, are subject to the image saturation and scene elements occurrence patterns. Furthermore, changing illuminant conditions can result in unacceptable rates of misclassification. Noting that the visual size of features scales inversely with the distance from camera, (Hadsell, Sermanet, Ben. 2009) normalize the image by constructing a horizon-leveled input pyramid in which similar obstacles have similar heights, regardless of their distances from the camera. However, the distance estimation for different regions of images introduces extra uncertainties. In addition, this approach does not consider the influence of changing lighting conditions on appearances. (Michael J. Procopio. 2009) proposes the use of classifier ensembles to learn and store terrain models over time for the application to future terrain. These ensembles are validated and constructed dynamically from a model library that is maintained as the robot navigates terrain toward some goal. The outputs of the models in the resulting ensemble are combined dynamically and in real time. The main contribution of the ensembles approach is to leverage robots' past experience for classification of the current scene. However, since the validation of models is based on the stereo readings from the current scene, this approach is still subject to the mapping assumption.

In summary, all the existing near-to-far approaches rely excessively on appearance features and the mapping assumption. As a result, they lack the robustness and self-adaptability for changling illuminant conditions. Furthermore, the problem of appearance ambiguity is inherent in unstructured outdoor environments. Consider the navigating scene in Fig. 1, which is taken from the natural datasets of actual logged test runs by robots competing in the DARPA LAGR (Learning Applied to Ground Vehicle) program(Jackel, Krotkov, Perschbacher. 2006). In this scene, shadows are prominent and the appearance of the shadows case on the ground (traversable) appears very similar to that of the side of hay bale (non-traversable). The tops of the hay bales - which receive near-field stereo labels of “obstacle”- are very similar in appearance to both sky and groundplane. Therefore, the resulting instances have similar feature data but different class labels, which will easily confuse the navigating system.

Fig. 1.

A challenging navigating scene

In this paper, we propose a Conditional Random Fields (CRF) (Lafferty, McCallum & Pereira. 2001) based near-to-far perception framework (CRFNFP) to compensate for such appearance ambiguities and to enhace robustness and self-adaptability to changing illuminant conditions. The main difference between our solution and other existing methods is that CRFNFP not only includes local region features, but also spatial relationship (spatial context) between different regions as its classification basis. The problem to be solved here is how to design a specific CRF framework, i.e., CRFNFP, with respect to the self-supervised, near-to-far learning in unstructured outdoor environments. To the best of our knowledge, ours is the first work that introduces and adapts the CRF-based framework to model the navigating scene contexts and to improve the long-range perception for mobile robot navigation.

In our solution, we first over segment the current scene into superpixels (a superpixel is a set of neighboring pixels) and update the classification database using training samples from stereo readings. Then we model both local appearance and spatial relationshops between regions under the CRFNFP framework. Thresholds on CRFNFP prediction marginals are used to determine the terrain categories of superpixels.

An outline of this paper is as follows: We first briefly describe the generation of training samples from the stereo in section 2. The CRFNFP framework will be detailed in section 3 and section 4 provides the experiment results. We conclude our paper in section 5 with our further research in this area.

2. Generation of training samples from stereo

2.1. Generation of sample labels

In our proposed method, a ground plane (Fig. 3b) is first fitted in the disparity image and subtracted out, resulting in an estimate of the ground plane deviation (GPD) (Fig. 3c). Second, pixels of big GPD are considered as candidate pixels for obstacles (Fig. 3f). Third, the RGB image (Fig. 2a) is over segmented by the graph-based technique (Felzenszwalb & Huttenlocher. 2004). Finally, the rate of terrain-specific candidate pixels within a superpixel is used to determine the superpixel label.

Fig. 2.

(a) RGB image; (b) superpixels expression and specific graph structure

Fig. 3.

Generation of pixels of samples from stereo: (a) Stereo disparity; (b) Ground plane predicted; (c) Ground plane deviation; (d) Hand-labeled ground truth; (e) candidate pixels (ground); (f) candidate pixels (obstacles)

The extraction of the ground plane is the most important geometric analysis. We assume that there is a dominant ground plane within the near-field terrain and that the plane also is the surface of support for the robot. Since planar features in the world will project into a planar surface in the disparity image (Max Bajracharya. 2009), the ground plane model is extracted by applying the RANSAC algorithm (Fischler & Bolles. 1981) to the disparity image. However, the dominant plane may not be extracted directly from the disparity image sometimes due to the complexity of terrain geometry. And in such a case, we use the default plane computed from vehicle geometry instead. The assumption of the single ground plane has proven to work well in practice for the present. However, when the robot comes to a very challenging scene, the uneven terrain and the side slope may affect the heading of the robot, the extraction of the ground plane and further the accuracy of classification. In the future, we plan to relax the assumption above and extract the ground plane based on the multi-surfaces fitting. Furthermore, we also plan to use appearance features, which are collected online or given in advance, to label the training samples since the feature of GPD alone is not adequate for such a challenging scene.

Another issue concerns the selection of training samples in the form of superpixels. In principle, the samples should be drawn from the near-field region, which corresponds to “bottom” half of the image, due to the dense and reliable stereo information in the near field. However, we find that the validation flags, which are created based on the texture and uniqueness validation of the stereo algorithm, are very accurate indicators for the correctness of stereo measuring. In other words, it's safe to conservatively select a superpixel as a training sample based on the rate of terrain-specific candidate pixels within the superpixel, even if it corresponds to the far-field region. Furthermore, such a selection is benificial to the collection of obstacle samples since there are few obstacle samples in the near field. We select training samples all over the image and we do not need to create balanced training sets using undersampling as (Michael J. Procopio. 2009), since our proposed algorithm is based on Bayes' rule.

During the whole process mentioned above, category-specific candidate pixels can be generated rather accurately. However, the parameter of MS (minimum component size) of the segmentation algorithm above has great influence on the accuracy of generation of training samples and our proposed method. The algorithm with an extremely large MS will create superpixels that contain many pixels of different categories. And since pixels within a superpixel are classified as a whole, many pixels may be misclassified. In our experiments, we find an MS of value 80 reduces the accuracy by nearly 15% compared with that of value 40. And we choose the value 40 for MS throughout all our experiments.

2.2. Appearance feature of training samples

In our implementation, one superpixel produced one feature vector. The visual features used for our traversability classification task consist of color and texture information. Color information consists of average color in CIELAB color spaces. In addition, texture features of a superpixel are computed using eighteen filters selected from LM filter bank(Leung & Malik. 2001). The average response of each filter in a superpixel, and the distribution of the filter index of the maximum response at each pixel, represent texture features in a superpixel. The dimension of the feature vector is 39 as shown in Table 1.

Table 1.
Feature description

Feature Description

Type Description Dim

CIELAB value CIELAB mean 3

LM Average average filter responses from LM filter set 18

LM Maximum Histogram of maximum filter responses 18

Total number of dimension 39

Feature Description
CIELAB value	CIELAB mean	3
LM Average	average filter responses from LM filter set	18
LM Maximum	Histogram of maximum filter responses	18
	Total number of dimension	39

3. CRFNFP framework

Our CRFNFP framework is based on the concept of Conditional Random Field (CRF) proposed by (Lafferty, McCallum & Pereira. 2001) in the context of segmentation and labeling of 1-D text sequences. We first introduce the original definition of CRFs in our notations and pave the way for explaining our CRFNFP framework later.

3.1 Conditional Random Field framework

Let the observed data from an input image be given by X = { x _i}_i∈S, where S is the set of sites and x _i is the data from i th site. The corresponding labels at the image sites are given by L = {l_i}_i∈S. In this work, we will be concerned with binary classification, i.e., l_i ∈ {−1,1}, −1 for ground and 1 for obstacle.

CRF Definition: Let G = (S,E) be a graph such that L is indexed by the vertices of G. Then ( L,X ) is said to be a conditional random field if, when conditioned on X , the random variables l_i obey the Markov property with respect to the graph:P(l_i| X,L _s-{i})=P(l_i| X,L _{N_i}), where S-{i} is the set of all nodes in the graph except the node i, N_i is the set of neighbors of the node i in G.

Given the observation X , the CRF defines the joint distribution over the labels L as

P (L ∣ X) = \frac{1}{Z} \exp {\sum_{i \in S} A_{i} (l_{i}, X) + \sum_{i \in S} \sum_{j \in N_{i}} I_{i j} (l_{i}, l_{j}, X)}

(1)

Where Z is a normalizing constant known as partition function, and -A_i and -I_ij are the unary and pairwise potentials respectively. In the rest of this paper, we will call A_i the association potential and I_ij the interaction potential.

In this work, we first over segment the incoming scene into superpixels and then create a graph for the specific scene as shown in Fig 2b. Superpixels correspond to nodes (yellow points in Fig 2b) in the graph and the neighboring system N_i is represented by the set of yellow lines. We have shown only a small portion of the graph and capitals are assigned to some nodes in Fig 2b for the ease of further explanation.

3.2. CRFNFP framework

Our CRFNFP framework is an extended implementation of Standard Conditional Random Field (SCRF) in the context of self-supervised, near-to-far learning for mobile robots. In order to better adapt CRFNFP to changling illuminant conditions and the complex scene geometry of unstructured outdoor environments, we build CRFNFP based on SCRF with 3 modifications as bellow.

First, in unstructured outdoor environments, the feature distribution of terrain-specific samples is multimodal (samples are clustered into multiple centers in feature space), discriminative classifiers with linear or nonlinear decision boundaries, which SCRF uses to construct its association potential, are not suitable for our framework. Therefore, we develop a new version of the traversability classification algorithm in (Kim, Oh & Rehg. 2007) to construct the association potential of CRFNFP, which supports the incremental learning and the multimodal classification. The corresponding algorithm is described in subsection 3.2.1.

Second, the lighting condition and other unpredicted factors in outdoor environments usually make neighboring regions look different but with the same class such as regions A and B in Fig 2b. However, SCRF only incorporates feature-dependent terms into its interaction potential to allow the data to speak for themselves (i.e., only when neighboring superpixels are close in feature space, SCRF prefers to label them as the same class). Thus, we introduce an extra feature-independent smoothing term into the interaction potential to encourage the neighboring superpixels, although with different appearance, to be labeled as same classes. Detailed interaction potential construction is provided in subsection 3.2.2.

Finally, the context of online learning requires CRFNFP parameters to be adjusted continuously to the changing of scene geometry and lighting conditions. However, during real experiments, the change from one frame to the next is unpredictable and parameters directly learned from maximum likelihood framework are subject to abrupt changings. Therefore, we adopt a modified sequential bayesian parameter updating strategy to reduce parameter oscillations and to capture the overall trend of parameter changing. We describe parameter training and Bayesian updating algorithm in subsection 3.2.3 and 3.2.4 respectively.

3.2.1 Association Potential

The construction of association potential is based on an accumulation process of training samples and a real-time classification algorithm.

The accumulation process of training samples is similar to that of (Kim, Oh & Rehg. 2007). We maintain two models throughout the whole process: the traversability model Θ_T and the non-traversability model Θ_N. There are several prototypes (clustering centers in the feature space) contained in each model and each prototype C_j has an associated count n_j. When a newly labeled training data (x⁽ⁱ⁾, l⁽ⁱ⁾) comes, the training data is put into one of the two models depending on its label l⁽ⁱ⁾ and updates the model in two ways: (1) it adds a new prototype if the distance between any existing prototype C_j and the newly coming data x ⁽ⁱ⁾ exceeds a predefined threshold θ_d, i.e., ∀C_j, ||C_j - x ⁽ⁱ⁾||>θ_d or (2) increases the count of the closet prototype C_j by one, i.e., n_j ⩽ n_j + 1. When all the training samples from a certain image have been added, we cut away the prototypes whose associated count equals to 0. We remove specific prototypes not only to save computational resource but also to make the navigation system more adapted to varying environmental conditions.

Given a novel superpixel with feature vector x , we compute the minimum distance between x and the prototypes in each model, d_T and d_N respectively. We also record the corresponding prototype counts, n_T and n_N, for later usage. The classification is carried out using four heuristic rules as bellow.

First, if both d_T and d_N are larger than a predefined threshold θ_m, we simply assign the probability 0.5 to both P_T( x ) and P_N( x ) to indicate that we are not sure of which model the novel superpixel belongs to. P_T( x ) and P_N( x ) are the probabilities the novel superpixel belongs to the model Θ_T and Θ_N respectively.

Second, if d_T is larger than θ_m and d_N is not, we assign the probabilities 0.2 and 0.8 to P_T( x ) and P_N( x ) respectively to indicate that we are more confident of that the novel superpixel belongs to the model Θ_N.

Third, if d_T is smaller than θ_m and d_N is not, we assign the probabilities 0.8 and 0.2 to P_T( x ) and P_N(x) respectively to indicate that we are more confident of that the novel superpixel belongs to the model Θ_T.

Finally, if both d_T and d_N are smaller than θ_m, we use Bayes' rule where l denotes the unknown traversability variable:

P (l ∣ x) \propto P (x ∣ l) P (l)

(2)

The equation above shows that the posterior probability for the terrain with feature x is proportional to the product of the prior P(l) and the class-conditional likelihood P( x |l). The prior P(l) is computed as the ratio between the number of training examples in different classes: P(l)∝ ∑_l Θ_l.n_j. The class-conditional likelihood term P( x |l) is approximated by the ratio between n_T or n_N and the total number of training examples in the model l. Finally, the association potential A_i(l_i, X ) in Eq. (1) is replaced by P( x | l) or the combination of P_T( x ) and P_N( x ). Values of 0.5, 0.2, 0.8 above and the thresholds (i.e., θ_d and θ_m, which are 0.6 and 0.42 respectively in our experiments) are determined by performance comparison of different combinations of values, and such values will be used through both two sets of experiments later. We leave the optimum selection of parameters to our further study.

3.2.2 Interaction Potential

The interaction potential in CRFNFP is defined as

\begin{aligned} I_{i j} (l_{i}, l_{j}, X) = c_{i j} ({K I}^{*} l_{i} l_{j} + \\ K D 1^{*} (1 - {P d}_{i j} / P_{m a x})^{*} δ (l_{i} = l_{j}) + \\ K D 2^{*} ({P d}_{i j} / P_{m a x})^{*} δ (l_{i} \neq l_{j})) \end{aligned}

(3)

Where δ(X) = 1 if x is true and 0 otherwise. The c_ij represents the connection strength between superpixels i and j, and it is defined as

c_{i j} = | {S P}_{i j} | / | {S P}_{i} |

(4)

where |SP_i| represents the number of pixels within the superpixel i and |SP_ij| represents the number of pixels within superpixel i that are adjacent to j. Eq. (4) indicates that if |SP_ij| is close to |SP_i|, connection strenth between superpixels i and j is strong and the superpixel j has a lot of influence on the superpixel i. Pd_ij is the Euclidean distances of features extracted from superpixel i and j, and P_max is the maximum of all distances between superpixel i and its neighbors.

The interaction potential is used for representing the compatibility between classes of neighboring superpixels. The first term KI*l_il_j serves as a data-independent smoothing function, which indicates a high degree of compatibility for neighboring superpixels with same classes. From the experiment we carried out, we find that this smoothing function is necessary for the robot to recognize a coherent region of ground plane as shown in Fig. 6 and Fig. 7. The second term KD1*(1-Pd_ij/P_max)*δ(l_i=l_j) and the third term KD2*(Pd_ij/P_max)*δ(l_i ≠ l_j) both serve as data-dependent smoothing functions. To better demonstrate the effect of data-dependent terms, consider superpixels D, E and F in Fig 2b, where the superpixel E are adjacent to both superpixels D and F, and E is more close to F than D in feature space. Since Pd_ij/P_max of E and F is close to 0, the third term reduces to 0 and the second term will assign a large probability to the case that E and F belong to the same class. In a similar way, superpixels D and F are more likely to fall into different classes since the Pd_ij/P_maxof D and F is closer to 1.

In addition, KI,KD1 and KD2 are coefficients that modulate the effects of the potentials. The more the coefficient, the more role the corresponding potential plays. In order to better interpret the underlying meaning of KI,KD1 and KD2, we list average values of all parameters learned from the data sets DS1B and DS1A (Procopio. 2007a) in Table 2. DS1A is logged from the same scenario as DS1B with a more difficult lighting condition, and typical images from DS1B and DS1A are shown in Fig 7. From Table 2, we find that the parameter KI of DS1A is bigger than that of DS1B while KD1 and KD2 are not. It's obvious that if there are many neighboring superpixels from the same class but with distinct appearance, the value of KI needs to be increased in order to compensate the discontinuity of appearance of neighboring superpixels with the same class. Accordingly, values of KD1 and KD2 are decreased with the increase of KI.

Table 2.

Average values of parameters

Data sets	KI	KD1	KD2
DS1B	3.0646	0.6784	−0.9539
DS1A	3.4754	0.2118	−1.5259

3.2.3 Training of CRFNFP Parameters

Since the robot can only obtain near-field labeled images χ = {( L ⁿ, X ⁿ)} during navigation, we estimate the model's parameters based on the Conditional Maximum Likelihood criterion, that is,

\hat{Θ} = \underset{Θ}{\arg max} \sum_{n} \log P (L^{n} ∣ X^{n})

(5)

where Θ denotes all the parameters in the model including KI,KD1 and KD2. Parameters are estimated by gradient ascent:

△ K I \propto \sum_{n} \sum_{i \in S} \sum_{j \in N (i)} c_{i j} [l_{i}^{n} l_{j}^{n} - ⟨ l_{i} l_{j} ⟩_{P (l_{i}, l_{j} | X^{n}, Θ^{C})}]

(6)

\begin{aligned} △ K D 1 \propto \sum_{n} \sum_{i \in S} \sum_{j \in N (i)} c_{i j} (1 - \frac{{P d}_{i j}}{P_{m a x}}) \\ [δ (l_{i}^{n} = l_{j}^{n}) - ⟨ l_{i} l_{j} ⟩_{P (l_{i}, l_{j} | X^{n}, Θ^{C})}] \end{aligned}

(7)

\begin{aligned} △ K D 2 \propto \sum_{n} \sum_{i \in S} \sum_{j \in N (i)} c_{i j} (\frac{{P d}_{i j}}{P_{m a x}}) \\ [δ (l_{i}^{n} \neq l_{j}^{n}) - ⟨ δ (l_{i} \neq l_{j}) ⟩_{P (l_{i}, l_{j} | X^{n}, Θ^{C})}] \end{aligned}

(8)

where Θ^c represents the current parameter values and P(l_i,l_j| X ⁿ,Θ^c) is the marginal of labels in sites i and j based on the current parameter set. The 〈(δ(l_i = l_j)〉_{P(l_i,l_j|
X
ⁿ,Θ^c)} is the average of δ(l_i = l_j) under the distribution P(l_i, l_j| X ⁿ,Θ^c).

3.2.4 Sequential Bayesian Updating of CRFNFP Parameters

We apply independent sequential bayesian updating for each of CRFNFP parameters and model each parameter as a Gaussian with known variance σ² and unknown mean μ. We continuously take parameter training results as observations (input), construct prior, likelihood and posterior functions for mean μ, and then take the modal of posterior as the current value of corresponding parameter (output) of CRFNFP framework.

The likelihood function, which is the probability of the observed data given μ, viewed as a function of μ, is given by

\begin{aligned} p (X ∣ μ) = \prod_{n = 1}^{N} p (x_{n} ∣ μ) = \\ \frac{1}{(2 π σ^{2})^{N / 2}} \exp {- \frac{1}{2 σ^{2}} \sum_{n = 1}^{N} (x_{n} - μ)^{2}} \end{aligned}

(9)

where X = {x₁,…,x_N} is a set of N observations. We take our prior distribution, which is the conjugate distribution for likelihood function, to be

p (μ) = N (μ ∣ μ_{t - 1}, σ_{t - 1}^{2})

(10)

and the posterior distribution is given by

p (μ ∣ X) = N (μ ∣ μ_{t}, σ_{t}^{2})

(11)

where

μ_{t} = \frac{σ_{N}^{2}}{N σ_{t - 1}^{2} + σ_{N}^{2}} μ_{t - 1} + \frac{N σ_{t - 1}^{2}}{N σ_{t - 1}^{2} + σ_{N}^{2}} \frac{1}{N} \sum_{n = 1}^{N} x_{n}

(12)

\frac{1}{σ_{t}^{2}} = \frac{1}{σ_{t - 1}^{2}} + \frac{N}{σ_{N}^{2}}

(13)

In our implementation, we first collect 20 qualified images (We define an image as “qualified” only when neither class-specific sample ratios fall bellow a certain threshold in order to avoid overfitting of model parameters.) and train CRFNFP parameters for initial values μ₀. We continuously train parameters one time for every 5 new qualified images and take the result as one observation. And the modal, i.e., μ_t of posterior is updated using Eq. (12), where N is 1 and σ²_N / σ²_t−1 is set to a constant, which is defined as the ratio of image number used for training (5/20). In addition, we do not update σ²_t using Eq. (13) due to the fact that when the σ²_t is big enough, the μ_t can hardly incorporates new information from the newest observation and can not be altered, which is obviously unsuitable for self-adaption of the robot.

3.2.5 Classification

When a new image X arrives, CRFNFP predicts classes of superpixels based on the Maximum Posterior Marginal (MPM) criterion:

l_{i}^{*} = \underset{l_{i} \in ζ}{\arg max} {\sum_{L ∖ l_{i}} \frac{1}{Z} \exp (\sum_{i \in S} A_{i} (l_{i}, X) + \sum_{i \in S} \sum_{j \in N_{i}} I_{i j} (l_{i}, l_{j}, X))}

(14)

where L \l_i represents the label set of all superpixels except i th superpixel and the posterior marginal is computed through loopy belief propagation (Frey & MacKay. 1998). We take single superpixels as obstacles if corresponding margials exceed 0.5, and ground otherwise.

4. Experiment results

We ran two sets of experiments, i.e., the classification experiment and the navigation experiment, to compare the relative performance of our CRFNFP framework and another appearance-based approach. In the classification experiment, we used the natural data sets taken from logged field tests conducted by DARPA evaluators (Procopio. 2007b). We analyzed qualitative classification results of both algotirhms and compared prediction accuracies and the robustness under two performance metrics. In the navigation experiment, we implemented CRFNFP framework on our own UGV to confirm the extended perception range and more efficient path planning capability of our CRFNFP algorithm.

The appearance-based approach used to compare with our CRFNFP is a modified k-nearest neighboring (MKNN) algorithm of (Kim, Oh & Rehg. 2007). The major difference between MKNN and ours lies in the information they use to classify novel images. MKNN classifies image superpixels only based on appearance features, while our CRFNFP not only uses feature information but also utilizes the spatial contexts among superpixels. It's worth a mention that since many other near-to-far learning algorithms (Dahlkamp, Kaehler, Stavens. 2006; Happold, Ollis & Johnson; Max Bajracharya. 2009) are also based on appearance features only, we assume that the superiority of our CRFNFP framework over MKNN algorithm are applicable to cases of other algorithms of (Dahlkamp, Kaehler, Stavens. 2006; Happold, Ollis & Johnson. 2006; Max Bajracharya. 2009).

4.1 Classification experiment

4.1.1 Data Sets

The natural data sets used here are taken from logged field tests conducted by DARPA evaluators(Procopio. 2007b). Overall, three scenarios are considered. Each scenario is associated with two distinct image sequences, each representing a different lighting condition. Thus there are six data sets, i.e., DS3B, DS3A, DS2B, DS2A, DS1B, DS1A, and each data set consists of hundreds of frames. First 100 frames in each data set are hand-labeled, with each pixel being one of three classes: OBSTACLE, GROUNDPLANE, or UNKNOWN. The data sets are available on the internet(Procopio. 2007a).

4.1.2 Evaluation Metrics

We used the precision and recall as the evaluation metrics. These two metrics were defined as follows

\begin{aligned} p r e c i s i o n = \\ \frac{N o . o f c o r r e c t l y l a b e l e d p i x e l s i n t h e c a t e g o r y}{t o t a l N o . o f p i x e l s l a b e l e d a s t h e c a t e g o r y} \times 100 % \end{aligned}

(15)

\begin{aligned} r e c a l l = \\ \frac{N o . o f c o r r e c t l y l a b e l e d p i x e l s i n t h e c a t e g o r y}{t o t a l N o . o f p i x e l s i n t h e c a t e g o r y} \times 100 % \end{aligned}

(16)

4.1.3 Qualitative Results

The qualitative results for different data sets are shown in Fig. 4, Fig. 6 and Fig. 7, in which left columns show original RGB images, middle columns are related to the classification results of MKNN, and right columns concern the classification results of CRFNFP. White regions of the classification results indicate obstacles and black regions correspond to the ground.

DS3A is from a LAGR test run from 2006. The course is that of a trail, with dense, leafy foliage on either side. The trail proceeds deep into the far field. There are areas to the side of the trail that have appearance of the non-traversable foliage, a tricky aspect of the dataset. The scene is generally consistent from start to finish. DS3B is the same scenario as DS3A but with a different lighting condition, and the course appears generally darker. The classification results for typical images in DS3B and DS3A are shown in Fig. 4.

The Result comparison in Fig. 4 indicates that CRFNFP is prone to produce a continuous and coherent traversable region, which is of significant value for robot navigations in unstructured environments. While on the contrary, regions classified by MKNN are usually cut into pieces and full of noise due to stereo mismatch and unpredicted far-field appearance features as shown in Fig. 4e.

Another finding concerns the adaption to changing lighting conditions during a single run. Fig. 4a and Fig. 4d are taken from the same run but with very different classification performance as shown in Fig. 4b and Fig. 4e. The reason for such a case may be that when the lighting condition changes as the robot runs, appearance features in far-field become more unpredictable and vivid (different from dark appearance of near-field), and the classification database, which is collected during past experience and taken as the only classification basis for MKNN, can not account for the far-field appearance any more. As a consequence, the mapping assumption (introduced in section 1) from the appearance to geometry generally does not hold any longer. In contrast, our CRFNFP still maintains a coherent traversable region (shown in Fig. 4c and Fig. 4f) and keeps relatively high precision at different time points of a run (Fig. 5). In other words, the incorporation of spatial contexts in CRFNFP compensates for the mapping deviation and makes the vision system more adapted to lighting condition changings during a run.

Fig. 4.

Results comparison for DS3B Frame 29, 76 and DS3A Frame 97

Fig. 5.

Frame-varying precision of DS3B for ground class

DS2A and DS2B are also logged from a LAGR test run from 2006 with different lighting conditions. Major challenges faced by algorithms are: stereo usually struggles on dense, leafless foliage contained in DS2A and DS2B; obstacle examples can be very few; some areas of the traversable terrain have the same appearance as some of the obstacles (foliage). And the classification results on DS2A and DS2B are shown in Fig. 6.

From Fig. 6, we observe that if no sufficient training samples for a specific terrain category are accumulated in the classification database during past experience, MKNN would classify such terrain regions randomly (e.g., sky area and the upper part of foliage in Fig. 6a, or even incorrectly (e.g., taking lower part of foliage in Fig. 6a as ground), since almost all the training samples collected in classification database, which have the same appearacne as the lower part of foliage, belong to the category of ground. As a result, large amounts of pseudo-path appear as shown in Fig. 6b and Fig. 6e. Such pseudo-paths will guide the robot toward the foliage until the stereo vision finds it is an inefficient decision. In contrast, CRFNFP generally recognizes far-field obstacles (foliage) and guides the robot toward the right side ahead of time, endowing the robot with real long-range perception and planning abilities. Fig. 6e shows that though the stereo vision has confirmed the lower part of foliage as obstacle, the MKNN still classifies upper part of foliage as ground, which highlights the limitation of appearance-only-based approaches and the necessity of including spatial contexts as part of the basis for classification.

Fig. 6.

Results comparison for DS2B Frame 368, 400 and DS2A Frame 285

DS1A is a very challenging data set due to the difficult lighting conditions. Some image saturation is present and shadows are prominent as shown in Fig. 7d and Fig. 7g. Many training instances have similar feature data but different class labels, a situation that will easily confuse the classifier. DS1B is logged from a different run with a better lighting condition. The result comparison in Fig. 7 shows that MKNN still struggles on randomness of far-field classification and the generation of pseudo-paths. And our CRFNFP generally achieves better classification results.

Fig. 7.

Results comparison for DS1B Frame 221 and DS1A Frame 178, 291

4.1.4 Quantitative Results

In order to further highlight the superiority of CRFNFP over MKNN approach, we collected average recall and precision results for all 6 data sets in Table 3. Note that in our experiment, only pixels manually labeled as OBSTACLE or GROUNDPLANE in (Procopio. 2007a) are considered in computations of precision and recall. The high performance of CRFNFP and MKNN for ground category is due to the fact that many ground pixels are located in the perception range of stereo vision and can be easily recognized.

The first 3 rows of Table 3 represent different combinations of performance metric, terrain category and the classification approach. We list the main findings from Table 3 as bellow.

Table 3.
Quantitative comparison results I

Data Sets recall precision

ground obstacle ground obstacle

MK NN CRF NFP MK NN CRF NFP MK NN CRF NFP MK NN CRF NFP

DS2A 96.45 99.81 75.10 83.15 85.04 89.77 93.68 99.67

DS2B 97.88 99.48 51.04 83.11 73.51 89.04 94.80 99.18

DS3A 96.64 98.32 94.42 96.53 93.90 96.28 96.81 98.48

DS3B 93.58 98.95 81.44 93.79 86.11 94.77 92.13 98.79

DS1A 96.10 98.94 48.06 67.21 92.75 94.50 61.55 89.53

DS1B 97.64 98.85 57.68 69.39 84.06 90.77 90.86 96.73

Data Sets	recall	precision
DS2A	96.45	99.81	75.10	83.15	85.04	89.77	93.68	99.67
DS2B	97.88	99.48	51.04	83.11	73.51	89.04	94.80	99.18
DS3A	96.64	98.32	94.42	96.53	93.90	96.28	96.81	98.48
DS3B	93.58	98.95	81.44	93.79	86.11	94.77	92.13	98.79
DS1A	96.10	98.94	48.06	67.21	92.75	94.50	61.55	89.53
DS1B	97.64	98.85	57.68	69.39	84.06	90.77	90.86	96.73

First, CRFNFP generally outperforms MKNN under various combinations of performance metrics and terrain categories. For example, consider the obstacle recall of both algorithms on all data sets, i.e. values in column 4 and 5. The increased percentages of CRFNFP compared with MKNN are 8.05, 32.07, 2.11, 12.35, 19.15, 11.71 for DS2A, DS2B, DS3A, DS3B, DS1A and DS1B respectively. It's worth a mention that since stereo vision would correctly recognize near-field terrain for both algorithms, the recall increase mainly corresponds to the improvement of recognition of mid- and far-field terrain. So such an increase, even small sometimes, is of great value for the long-range perception of mobile robots. One example of this point is shown in Fig. 6b and Fig. 6c, where the increase of recall mainly concerns the foliage in the far-field. Such an earlier recognition of obstacles (hazards) in far-field would greatly improve the navigation efficience, which is extremely important for tasks such as searching and rescue.

The second finding is related to the robustness of classifications. Consider the results (marked in bold) of DS2A and DS2B, which are logged from the same scenario but in a different day and lighting condition. Obstacle recall of MKNN on DS2A is 75.10 while that on DS2B is reduceed to 51.04, and ground precision of MKNN on DS2A is 85.04 but 73.51 for that on DS2B. In other words, the performance of MKNN, based on appearance only, is subject to detailed lighting conditions and other potential factors. In contrast, the difference between obstacle recalls of CRFNFP on DS2A and DS2B is merely 0.04 and ground precision difference is 0.73, which indicated CRFNFP is more robust than MKNN with respect to lighting conditions and other factors. Similar comparison results can be found in values for other data sets, which highlights the classification stability and robustness of our CRFNFP framework.

Table 4 lists the comparison results of a linear SVM and CRFNFP. Readers may refer to (Max Bajracharya. 2009) for the parameter selection for linear SVM. The results also reflect the superiority of CRFNFP.

Table 4.

Quantitative comparison results II

Data Sets	recall				precision
	ground		obstacle		ground		obstacle
	SVM	CRF NFP	SVM	CRF NFP	SVM	CRF NFP	SVM	CRF NFP
DS2A	92.33	99.81	72.40	83.15	86.40	89.77	92.55	99.67
DS2B	97.70	99.48	53.80	83.11	68.42	89.04	95.20	99.18
DS3A	92.45	98.32	91.50	96.53	90.10	96.28	95.50	98.48
DS3B	91.20	98.95	77.21	93.79	83.38	94.77	90.10	98.79
DS1A	94.56	98.94	52.89	67.21	92.80	94.50	63.30	89.53
DS1B	93.33	98.85	56.64	69.39	81.80	90.77	87.40	96.73

4.2. Navigation experiment

In order to confirm that CRFNFP does enhance the long-range perception for mobile robots and helps planning more efficient paths for the navigation, we conducted outdoor navigation experiments with our UGV (shown in Fig. 8), which is a four-wheeled, 8 DOF mobile robot with each wheel individually driven and steered to obtain the desired maneuverability. Our experiment field (Fig. 9) is a deserted playground at nanjing agricultral university (32° 7'46.26“N, 118°41'28.66”E), containing grass, foliage, rocks and ground. We assume that the experiment field is generally even with modest dips and rises and the most common obstacles are foliage and tall grass. We do not assume the shape of the traversable region, which is totally determined by the classification of CRFNFP. The task of the robot was to reach the goal, which was beyond 200 meters away from the start point. In the future, we will further study on the navigation in challenging or hilly scenes. The goal was specified by global positioning system (GPS) coordinates and since we were planning in image space, we porjected the goal into the image plane, assuming that the ground is flat. When the distance between the robot and the goal, calculated and transformed from the GPS readings, was less than 50 centimeters, the task was considered to be performed successfully. We used AgGPS (20 cm, positioning error) and PointGray Bumblebee stereo uint (reliable obstacle detection, 5m) in our navigation experiment. During runs, near-field stereo information was not only used to update near-field map but also to construct the classification database for the far-field terrain prediction. The cost of a move to a pixel was determined by the probability of obstacle, which, in turn, is generated by CRFNFP framework. The cost image is shown in Fig. 10, in which, the darker a pixel was, the more easily it could be traversed. We performed an A* search on the cost image to find a pixel-to-pixel path to the goal pixel. A number of details, although worth mentioning, will not be expanded here since it is not the focus of this paper.

Fig. 8.

UGV used for navigation experiment

Fig. 9.

Experiment field

Fig. 10.

Path planning in image space using A* algorithm(Hart, Nilsson & Raphael. 1968): (a) CRFNFP & frame 22; (b) MKNN & frame 22; (c) CRFNFP & frame 84; (d) MKNN & frame 84. Note that although paths in the image are planned pixel to pixel, we have drawn them using straight lines between sampled points (large dots) for greater clarity.

We totally performed 6 runs on the same scenario, 3 runs using CRFNFP and another 3 using MKNN. For all runs, the robot successfully reached the goal. Running times for runs using CRFNFP were 249s, 263s and 258s, with an average time of 257s. And running times for runs using MKNN were 334s, 378s and 288s, with an average time of 333s. The discrepancy in the average running times indicated that paths, planned based on CRFNFP classification results, were more efficient compared with that on MKNN classification results. The reason for such a discrepancy could be well explained by the two inefficiency modes (shown in Fig. 10b and Fig. 10d respectively), which, in turn, were caused by the intrinsic randomness of appearance-based long-range approaches. First, possibly due to the shortage of corresponding training samples in classification database for far-field trees as shown in Fig. 10b, MKNN misclassified part of far-field trees as ground, and in order to minimize the overall cost, the corresponding path first went to the right side and then turned left in the far-field. As a result, the robot turned right at its current position, while the first choice should have been to keep going straight or turn left slightly. Second inefficiency mode was shown in Fig. 10d, in which the robot, based on MKNN classification result, found the “shortest” pseudo-path on the left. Consequently, the robot turned left sharply and approached tall grass (obstacle) until the arrival of stereo correction, which usually took several extra seconds.

In addition, based on results in Fig. 10 and Fig. 11, we could confirm that CRFNFP did enhance long-range perception ability of the robot. It could be easily verified that CRFNFP nearly recognized whole drivable regions in images and the perception ranges usually reached up to 80 meters (much larger than 5 meters of stereo perception region). On the other hand, appearance-based approaches, e.g., MKNN, also provided the robot with a similar long-range perception capability. However, the randomness (Fig. 10b) and the tendency to misclassification (Fig. 10d) counteracted the benefit of long-range perception.

Fig. 11.

Frames 022, 065, 084, 086, 088 from navigation experiment

In our experiments, the runtime of our CRFNFP algorithm was 1.1 Hz on the color images with 320 × 240 resolution. We implemented the CRFNFP algorithm using multithread programming under Visual C++ 6.0. Our CPU processor in the robot is 2.26 GHz Intel Core Duo P8400. On the other hand, the runtime of MKNN algorithm was 2.3 Hz.

5. Conclusion and future work

In this paper, we proposed a new statistical prediction framework, CRFNFP, in the context of near-to-far terrain learning and perception of mobile robots. Compared with other existing near-to-far learning approaches, the CRFNFP framework not only incorporated appearance features of far-field terrain, but also used the spatial contexts among terrain regions. Our original contributions concerned the design of a specific CRF framework, i.e., CRFNFP, with respect to the self-supervised, near-to-far learning in unstructured outdoor environments. The results from both experiments showed that our CRFNFP outperformed appearance-only-based approaches in aspects of accuracy, robustness and the adaptability to unstructured outdoor environments.

In the future, we plan to enhance our CRFNFP with other contexts such as temporal contexts, e.g., the temporal relationship between the current frame and the next. Based on observations from the current scene, how to dynamically select and combine various contexts to better classify the current scene is another concern of our future study.

Footnotes

6. Acknowledgement

This work is supported by the National High-Tech Research and Development Program of China (2006AA10A304, 2006AA10Z259, 2008AA100905).

References

Dahlkamp

Kaehler

Stavens

Thrun

& Bradski

(2006). Self-supervised monocular road detection in desert terrain, Proceedings of Robotics: Science & Systems.

Felzenszwalb

P. F.

& Huttenlocher

D. P.

(2004). Efficient graph-based image segmentation. International Journal of Computer Vision, Vol. 59. No. 2. pp. 167–181.

Fischler

M. A.

& Bolles

R. C.

(1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, Vol. 24. No. 6. pp. 381–395.

Frey

B. J.

& MacKay

D. J. C.

(1998). A revolution: Belief propagation in graphs with cycles. Advances in Neural Information Processing Systems, Vol. 10. No. pp. 479–485.

Hadsell

Sermanet

Ben

Erkan

Scoffier

Kavukcuoglu

Muller

& LeCun

(2009). Learning long-range vision for autonomous off-road driving. Journal of Field Robotics, Vol. 26. No. 2. pp. 120–144, 1556–4959

Happold

Ollis

& Johnson

(2006). Enhancing supervised terrain classification with predictive unsupervised learning, Proceedings of Robotics: Science and Systems, Philadelphia, PA. Cambridge

Hart

P. E.

Nilsson

N. J.

& Raphael

(1968). A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems, Man and Cybernetics, Vol. 4. No. 2. pp. 100–107

Jackel

L. D.

Krotkov

Perschbacher

Pippine

& Sullivan

(2006). The DARPA LAGR program: Goals, challenges, methodology, and phase I results. Journal of Field Robotics, Vol. 23. No. 11–12. pp. 945–973.

Kim

S. M.

& Rehg

J. M.

(2007). Traversability classification for UGV navigation: A comparison of patch and superpixel representations, Proceedings of IEEE International Conference on Intelligent Robots and Systems.

10.

Lafferty

J. D.

McCallum

& Pereira

F. C. N.

(2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of The Eighteenth International Conference on Machine Learning.

11.

Leung

& Malik

(2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, Vol. 43. No. 1. pp. 29–44.

12.

Lookingbill

Lieb

& Thrun

(2007). Optical Flow Approaches for Self-supervised Learning in Autonomous Mobile Robot Navigation, In: Autonomous Navigation in Dynamic Environments, pp. 29–44, Springer Berlin/ Heidelberg, 978-3-540-73421-5.

13.

Matthies

(1992). Stereo vision for planetary rovers: Stochastic modeling to near real-time implementation. International Journal of Computer Vision, Vol. 8. No. 1. pp. 71–91.

14.

Bajracharya

A. H. L. H. M. B. T. M. T. Max

(2009). Autonomous off-road navigation with end-to-end learning for the LAGR program. Journal of Field Robotics, Vol. 26. No. 1. pp. 3–25, 1556–4967

15.

Michael

Procopio

J. M. G. G.

(2009). Learning terrain segmentation with classifier ensembles for autonomous robot navigation in unstructured environments. Journal of Field Robotics, Vol. 26. No. 2. pp. 145–175, 1556–4967

16.

Ollis

Huang

W. H.

Happold

& Stancil

B. A.

(2008). Image-based path planning for outdoor mobile robots, Proceedings of IEEE International Conference on Robotics and Automation, Pasadena, CA

17.

Pagnot

& Grandjean

(1995). Fast cross-country navigation on fair terrains. Proceedings - IEEE International Conference on Robotics and Automation, Vol. 3. No. pp. 2593–2598.

18.

Procopio

M. J.

(2007a). Hand-labeled DARPA LAGR data sets., Available from: http://ml.cs.colorado.edu/~procopio/labeledlagrdata/, Accessed: 2009-04-11

19.

Procopio

M. J.

(2007b). An experimental analysis of classifier ensembles for learning drifting concepts over time in autonomous outdoor robot navigation, Department of Computer Science, University of Colorado at Boulder, Ph.D.

20.

Rieder

& Southall

(2002). Stereo perception on an off-road vehicle. Proc. IEEE Intelligent Vehicle Symposium, Vol. 1. No. pp. 221–226.

21.

Singh

Simmons

Smith

Stentz

Verma

Yahja

& Schwehr

(2000). Recent progress in local and global traversability for planetary rovers. Proceedings - IEEE International Conference on Robotics and Automation, Vol. 2. No. pp. 1194–1200.

Feature Description
Type	Description	Dim
CIELAB value	CIELAB mean	3
LM Average	average filter responses from LM filter set	18
LM Maximum	Histogram of maximum filter responses	18
	Total number of dimension	39

Learning Long-range Terrain Perception for Autonomous Mobile Robots

Abstract

Keywords

1. Introduction

2.1. Generation of sample labels

Table 1. Feature description Feature Description Type Description Dim CIELAB value CIELAB mean 3 LM Average average filter responses from LM filter set 18 LM Maximum Histogram of maximum filter responses 18 Total number of dimension 39

3.1 Conditional Random Field framework

3.2.1 Association Potential

4.1 Classification experiment

4.1.1 Data Sets

4.1.2 Evaluation Metrics

Footnotes

6. Acknowledgement

References

Table 1.
Feature description

Feature Description

Type Description Dim

CIELAB value CIELAB mean 3

LM Average average filter responses from LM filter set 18

LM Maximum Histogram of maximum filter responses 18

Total number of dimension 39