Monocular Road Detection Using Structured Random Forest

Abstract

Road detection is a key task for autonomous land vehicles. Monocular vision-based road-detection algorithms are mostly based on machine learning approaches and are usually cast as classification problems. However, the pixel-wise classifiers are faced with the ambiguity caused by changes in road appearance, illumination and weather. An effective way to reduce the ambiguity is to model the contextual information with structured learning and prediction. Currently, the widely used structured prediction model in road detection is the Markov random field or conditional random field. However, the random field-based methods require additional complex optimization after pixel-wise classification, making them unsuitable for real-time applications. In this paper, we present a structured random forest-based road-detection algorithm which is capable of modelling the contextual information efficiently. By mapping the structured label space to a discrete label space, the test function of each split node can be trained in a similar way to that of the classical random forests. Structured random forests make use of the contextual information of image patches as well as the structural information of the labels to get more consistent results. Besides this benefit, by predicting a batch of pixels in a single classification, the structured random forest-based road detection can be much more efficient than the conventional pixel-wise random forest. Experimental results tested on the KITTI-ROAD dataset and data collected in typical unstructured environments show that structured random forest-based road detection outperforms the classical pixel-wise random forest both in accuracy and efficiency.

Keywords

Road Detection Structured Random Forest

1. Introduction

Road detection is a key task for outdoor mobile robots like autonomous land vehicles (ALVs), which have been studied for decades [1]. Though types of sensors like the monocular camera [2, 3], stereo camera [4] and laser scanner [5] have been employed to detect the road, the monocular camera is one of the most frequently used ones, due to its being cheap in cost and rich in information. In highly structured environments like highways with well-painted markings, one can use lane detection to find the drivable area. However, in general urban or rural areas, the roads may not be marked or the markings may be damaged. This kind of general road detection is considered to be more difficult than lane-marking detection. In addition, the variations of road types and the presence of shadows, specularities and puddles caused by changes in weather and illumination make it even more challenging.

General monocular road-detection systems mostly use the machine-learning algorithms to classify the pixels of a given image as road or background. In the literature, various machine-learning methods have been applied in road detection. However, most of these methods classify each pixel independently, ignoring the contextual interaction. Thus, the outputs are usually very noisy due to the ambiguity of the appearance of a single pixel. Although some methods take the image patch or superpixel as the classification unit, the binary label space may lead to inaccurate output boundaries. Structured prediction [6] is an effective method for modelling the interaction with the context. A commonly used structured prediction method in computer vision is the Markov random fields (MRF), or conditional random fields (CRF) [7, 8]. However, the optimization of pixel-wise random fields is often too time consuming to be used in real-time systems. In this paper, we propose employing the structured random forest method [9, 10] to classify the image patch in a structured manner. With an efficient mapping of structured labels to discrete labels, the structured random forests can be trained in a similar way to the traditional random forests [10]. The proposed structured random forest-based road detection method exploits the contextual information of the image and the structural information of the label patch to improve the performance. In addition, by predicting a batch of pixels in a single classification, the proposed method is very fast in the run time. Extensive experiments show that the structured random forest-based road detection can achieve better results than the traditional pixel-wise random forest, patch-wise random forest and the even the CRF-based method, which is much more time consuming.

The rest of this paper is organized as follows. Section 2 briefly introduces the related work regarding road detection and structured random forests. In Section 3, we briefly review the training and testing of random forest classifiers. Then, in Section 4, we introduce the structured random forest-based road detection in detail. The image-feature extraction is presented in Section 5, and in Section 6 we describe some comparative experiments that we conducted to validate the proposed method. Finally, we draw conclusions in Section 7.

2. Related Work

Road detection has been a hot research topic for decades. Many road-detection algorithms have been developed. Monocular road-detection algorithms can be roughly divided into two classes: edge-based and region-based. Edge-based methods are widely used in structured environments with clearly delineated boundaries. Usually, a road model is assumed and some low level cues are extracted to fit the parameters of the model [11 –13]. A Kalman filter [12] or a particle filter [11, 13] is then employed to track the parameters. In addition, vanishing point detection can be utilized as a constraint for road edge detection [2]. In this paper, we focus on region-based methods.

Region-based road detection usually uses machine-learning approaches to learn a classifier using labelled samples and then to classify the pixels into two classes (road and background) or sometimes three classes (road, sky and others [14]). The training samples are acquired by either making the assumption that the central-lower part of the testing images always belongs to the road surface [15] or manually labelling the offline collected images [16]. After getting the training samples, the two most influential factors are the feature extraction and the classification approaches employed. For feature extraction, various kinds of image features have been investigated. Colour is the most commonly used feature, such as different colour space representation (HSI [17], LAB [14] and normalized R-G [4, 15]), colour histograms [18] and illumination invariant images [4, 19, 20]. Other frequently used features include texture filter responses [4, 21] and other texture descriptors [22]. Recently, feature learning has become a hot topic in computer vision. Various learning-based feature-extraction methods have been used in road detection, such as slow feature analysis [23], dictionary learning [24] and convolutional neural network (CNN) [3]. Currently, deep learning is popular in feature learning and has achieved great success in various computer-vision problems. In [25], Mohan proposed combining deep deconvolutional neural networks with CNNs for road scene parsing, and good performance was achieved. However, deep learning models are often complex and rely on the latest graphic processing unit (GPU) for fast computing.

For machine-learning approaches, various methods have been employed, such as mixture of Gaussian [26], support vector machines [27], boosting [21, 28], random forest [4, 29] and neural networks [30, 31]. However, these methods usually classify the pixels independently, and therefore they do not take the contextual information into consideration. Some methods take the image patch or superpixel as the classification unit, but they classify the whole patch or superpixel as road or background, leading to degraded resolution and inaccurate zigzag boundaries. The contextual information can be beneficial for the overall classification performance. In [18] and [32], the importance of contextual information in road detection is emphasized. They generate many road-probability maps according to the locations or the scene types and then use the global position system or scene classification to select a more suitable probability map for guiding the online road detection. However, these methods rely on extra sensory data or expensive online computation. In addition, they cannot adapt to new scenes. Alternatively, structured prediction is an effective approach for modelling the interaction between the pixel and its context [6]. In road detection, a widely used structured prediction method is the Markov random field or conditional random field [7, 8]. However, the inference of random field is often time consuming and unusable in real-time systems, especially for high resolution images. Considering the need for structured prediction, and the rapid training and testing of random forests, Kontschieder et al. [9] proposed a method by which the random forests can be augmented with structured labels. The structured label space poses a challenge in selecting the best test functions during the training process. In [9], the authors used a single pixel label randomly sampled from the patch to represent the label of the patch and then train the test functions in the manner of the traditional random forests. In [10], Dollár and Zitnick proposed an improved strategy for training structured random forests. They firstly map the structured label to a dimensionally reduced binary vector and then further perform K-means clustering or principal component analysis (PCA) in order to obtain the discretized label. This method was used for edge detection and achieved the state-of-the-art.

Inspired by [10], this paper proposes utilizing structured random forest for rapid road detection. With each image patch predicted using a structured label, both the contextual information and the structural information of the label can be taken into account. By formulating the road detection as a binary classification problem, the patch label can be readily rearranged into a binary vector and then mapped to a discrete label with K-means clustering or PCA. Thus, the training of structured random forests can be performed in a similar way to that of classical random forests. During testing, a batch of pixels are predicted within one classification, making the algorithm very efficient and suitable for real-time vision navigation of ALVs.

3. Random Forest Classifier

In this section, we give a brief review of the random forest classifier. Random forest [33, 34] is a ensemble of N independently trained decision trees. Each decision tree $T_{i} (x)$ consists of a collection of test functions that are organized into a tree structure. A sample $x$ is classified by recursively branching left or right down the tree until a leaf node is reached. Each non-leaf node (split node) is associated with a test function (split function) $h (x, θ_{j})$ . While any classification model can be used for the test function, the most popular one is the decision stump:

h (x, θ_{j}) = {\begin{matrix} 0, & x_{k} \leq t, \\ 1, & x_{k} > t . \end{matrix}

(1)

in which k is the feature index and t is the threshold value. For each leaf node, a prediction model is attached. The most frequently used prediction model is the leaf statistics captured with the conditional distribution $p (c | x)$ , where c represents the categorical labels. Sometimes, the maximum a-posteriori (MAP) estimation: $c^{⋆} = {a r g m a x}_{c \in C} p (c | x)$ is employed.

3.1. Decision tree training

During the training period, for each node j and the incoming training set S_j, we sought the best split function $h (x, θ_{j})$ to split S_j into $S_{j}^{L}$ and $S_{j}^{R}$ , where $S_{j}^{L} = {x \in S_{j} | h (x, θ_{j}) = 0}$ and $S_{j}^{R} = {x \in S_{j} | h (x, θ_{j}) = 1}$ .

For classification trees, usually the maximum information gain is sought:

θ_{j} = \underset{θ \in Θ_{j}}{a r g m a x} I (S_{j}, θ) = \underset{θ \in Θ_{j}}{a r g m a x} H (S_{j}) - \sum_{i \in {L, R}} \frac{| S_{j}^{i} |}{| S_{j} |} H (S_{j}^{i})

(2)

where $H (S)$ is the entropy of set S which is defined as $H (S) = - \sum_{c} p (c) \log p (c)$ .

3.2. Randomness and the ensemble model

Random forest is an ensemble model of the N decision trees. Randomness plays an important role in the performance of the whole forest. Randomness can be injected via random training set sampling and randomized node optimization. Randomness helps to reduce possible overfitting and improve the generalization capabilities.

4. Structured Random Forest-based Road Detection

Random forest is popular for several reasons: 1) it is simple to implement; 2) it is fast in training and even faster in testing; 3) it is resistant to overfitting and 4) it is fully parallelizable. However, when applied in road detection, pixel-wise random forest classifiers predict each pixel independently, ignoring the interaction between the neighbouring pixels. This kind of method may thus generate inconsistent outputs, which can be extremely noisy. Some researchers tried to classify each patch or superpixel instead of individual pixels. This kind of method partially utilized the contextual information. However, these methods map each segment to a discrete label and therefore cause a reduction in the resolution of the final output, especially at the boundaries. A more effective solution to the problem is structured prediction [6]. In structured prediction, the prediction function maps the input domain X to a structured label space Y, instead of to the discrete label space C. Conditional random field (CRF) is a kind of structured prediction approach which is used widely in computer vision. However, CRF is a kind of probabilistic graphical model and the inference relies on time-consuming message passing or graph cutting. Inspired by Kontschieder's [9] and Dollar's [10] work on structured random forest, we propose employing the much more efficient structured random forest for locally consistent road detection.

4.1. Formulation

In monocular vision-based road detection, we want to classify the pixels into road ( $c = 1$ ) or background ( $c = 0$ ). In structured random forest, we take the $d \times d$ image patch rather than the pixel as the classification unit. And the structured binary label space consists of $d \times d$ patch $p$ , with $p_{i j} \in {0,1}$ denoting the label of the $i j$ -entry of the patch.

Figure 1.

Illustration of structured labels used in road detection. On the left is the ground-truth of a typical road image and on the right is the zoomed view of the corresponding patches, marked with boxes of the same colour used in the left image.

For a general labelled road image, we observed that the label patches are organized not randomly but in a highly structured manner. As is shown in Figure 1, the four boxes with different colours represent four typical kinds of structured labels, that is, all road, all background, partial road on the left boundary and partial road on the right boundary. This observation is usually ignored by the traditional pixel-wise classifiers. However, it can be exploited to improve the performance of road detection with structured random forests.

4.2. Structured random forest training

In random forests with discrete label spaces, we pursued the best split parameter with the maximum information gain. However, in structured random forests, the structured label spaces are highly dimensional and much more complex. This leads to two problems in the training process: one is the prohibitive complexity of evaluating the candidate splits and the other which is more critical is that the information gain over the structured labels may not be well defined [10]. These are the key problems to be solved in structured random forests. The common solution is to build a map between the structured labels and discrete labels. In [9], Kontschieder proposed two schemes to obtain the discrete label: one is to take the label of the central pixel of the patch, and the other is to take the pixel label of a position that is drawn randomly within the patch. These intuitive methods fail to make full use of the holistic information of the label patch, and the use of the randomly selected or fixed single position label makes the training less effective. To solve these problems, [10] proposed a two-stage approach for mapping the structured labels to a discrete label space. This approach firstly maps the structured labels to long binary vectors and then performs K-means clustering or principal component analysis (PCA) to obtain the discrete labels. This method is more effective because it takes the whole label patch into account.

In road detection, the structured label is a binary patch and we can readily rearrange the patch label to get a binary vector. Then, following [10], we can employ the K-means clustering or PCA to get the corresponding discrete labels. In K-means clustering, we set $K = 2$ and use the Hamming distance for fast computing. Then, each structured label is discretized as 1 or 0 according to the clustering centre to which it is assigned. Figure 2 shows an example of discretization with K-means clustering. Note that the road patches and the background patches are mostly separated as desired, proving the effectiveness of mapping the structured label to a discrete label.

Figure 2.

Discretized labels with K-means clustering. The label patches of the samples that arrive at a certain internal node are mapped to the discrete label 1 (the left image) or 0 (the right image).

Compared to K-means clustering, PCA-based discretization is much more computationally efficient. We project the long binary vector to the first principle direction and assign a discrete label 0 or 1 according to whether or not the projection falls on the positive semi-axis. Figure 3 shows the results of discretization using PCA on the same samples as in Figure 2. We can see that there is no significant discrepancy between these two discretizing methods. Considering the computational efficiency, we employed the PCA-based discretization in this paper.

Figure 3.

Discretized labels with PCA. The same samples and presentation scheme are used as in Figure 2.

After discretizing the structured labels, we can train each decision stump of the decision tree with the normal information-gain criterion. For the test functions, we considered the single location feature and the feature difference between a couple of locations. For a testing image with a resolution of $h \times w$ , as will be introduced in Section 5, we extracted n feature channels for each pixel. Then for each testing image patch, we obtained a $d \times d \times n$ dimensional feature. The single location test function can be parametrized by:

h (x, θ_{j}) = s i g n (f [θ_{j}^{r}, θ_{j}^{c}, θ_{j}^{k}] - θ_{j}^{t}),

(3)

where $θ_{j}^{r}$ and $θ_{j}^{c}$ are the row and column in the patch, $θ_{j}^{k}$ is the channel index of the feature, $θ_{j}^{t}$ is the threshold value and the function $s i g n$ is defined as:

s i g n (x) = {\begin{matrix} 0, & x \leq 0 \\ 1, & x > 0 \end{matrix} .

An alternative kind of test function can be the paired sites difference, i.e.,

h (x, θ_{j}) = s i g n (f [θ_{j}^{r}, θ_{j}^{c}, θ_{j}^{k}] - f [θ_{j}^{r^{'}}, θ_{j}^{c^{'}}, θ_{j}^{k}] - θ_{j}^{t}),

(4)

in which $r, c$ and $r^{'}, c^{'}$ indicate two sites in the patch, and the other parameters are the same as the single-site test function.

4.3. Structured label prediction

For each leaf node of the tree, we gathered a set of training patches during the training procedure. We can parametrize the leaf node with either a conditional distribution or a MAP estimation. Denote $P_{t} \in Y$ as the set of label patches gathered at the leaf node t. As is introduced in [9], by assuming that the labels of the pixels in a patch are independent, we can obtain the joint distribution of the label patch predicted by the leaf node as:

P r (p | P_{t}) = \prod_{i j} P r^{i j} (p_{i j} | P_{t}),

(5)

where $P r^{i j} (c | P_{t})$ denotes the marginal class distribution of pixel position $(i, j)$ . Then we can obtain the MAP label patch by:

p^{*} = \arg \max_{p \in P_{t}} P r (p | P_{t}) .

(6)

In this paper, we chose to use a simpler scheme to parametrize the leaf node, in which we directly stored the marginal distribution of each pixel in the patch. This scheme is advantageous for its efficient calculation and reservation of more information. In addition, because the road detection is a binary classification problem, we only need to record the probability of being road for each pixel position in the patch.

4.4. Road detection with label fusion

After training a structured random forest using N decision trees, we need to predict the road area for a given testing image. Unlike the classical pixel-wise random forests which classify each pixel independently to obtain the whole prediction, structured random forests classify the patch and obtain the structured prediction with a $d \times d$ probabilistic distribution. Suppose the patches are sampled with stepsize δ, we accumulate the predicted probability and the sampling times of each pixel of the whole image in order to obtain the matrix Q and A. Then we can acquire the probability map of the test image by $P = Q . / A$ . Figure 4 gives an example. To obtain a binary output, we only need to threshold the probability map.

Figure 4.

Road detection with label fusion. The upper part is the input image and the lower part is the road probability map.

5. Feature Extraction

For each image, we extracted pixel-wise channel features including, texture filter bank response, illumination invariant image and colour. We also included the location cues in the feature.

Texture Filter Bank Response. The images are converted to the CIE-LUV colour space and then a filter bank is applied to the grey-scale image or each channel of the CIE-LUV image. Concretely, Gaussian filters are applied to each channel while the horizontal and vertical Gaussian derivative filters and the Laplacian of Gaussian filters is applied to the grey-scale image. Therefore, for a given scale σ, we get a 6-dimensional feature for each pixel. In this paper, three scales are employed so we get an 18-dimensional texture filter bank response for each pixel.

Illumination Invariant Image. The presence of shadows in the image is a challenging problem for image-based road detection. To reduce the impact of illumination, we included the illumination invariant image in the feature. In [35], Finlayson et al. proposed an efficient method to recover shadow-free image representation. For the sake of simplicity, we extracted the 1D shadow-free image feature [35] and took it to be one of the feature channels. Figure 5 shows an example of shadow-free image representation.

Colour and Location. RGB channels of each pixel are included in the feature. In addition, the location of the pixel is a useful cue for road detection because the road usually appears at the central lower part of the image. Hence, the 2D normalized locations of the pixels are also used as part of the feature.

Figure 5.

Illumination invariant image-feature channel: the upper part is the source image and the lower part is the shadow-free image

Therefore, finally, we get a 24-dimensional feature vector for each pixel in the image.

6. Experiments and Results

6.1. Experimental setting

The algorithm was implemented with C++ under Ubuntu 12.04. Experiments were tested on a standard PC with 8GB RAM and an Intel Core i5-3230 CPU @ 2.6G Hz.

There are several important parameters during the training of structured random forests: the size of the image patch d, the maximum depth $d_{m a x}$ allowed for each tree, the minimum number of samples for split $s_{m i n}$ , the number of random splits N_s and the number of trees N. In this paper, we set $d_{m a x} = 20$ , $s_{m i n} = 20$ and $N_{s} = 100$ . We trained a forest with $N = 10$ trees and chose different patch sizes d for different resolutions of the input image.

6.2. Performance evaluation

To evaluate the proposed method, we conducted some experiments on both a publicly available urban-road dataset and unstructured road data collected using our own autonomous land vehicle platform.

6.2.1. KITTI-ROAD dataset

The KITTI-ROAD benchmark dataset [16, 36] is a well-known dataset which is widely used for the evaluation of urban road-detection algorithms. According to the different scenes in which the data were collected, the KITTI-ROAD dataset is split into three subsets: urban unmarked (UU), urban marked (UM) and urban multiple-marked lanes (UMM), each of which contains about 100 training images and 100 testing images with a resolution of approximately $1242 \times 375$ . The ground-truth labellings for the training images are provided within the released dataset, while the ground-truth for the testing images is not publicly available and one needs to upload the results to the website in order to get them evaluated online. For performance evaluation, several metrics including precision(PRE), recall(REC), maximum F1-measure(MaxF), average precision(AP), false positive rate(FPR) and false negative rate(FNR) are used in the official development kit¹, which was released along with the dataset.

Firstly, we used only the training sets of KITTI-ROAD for our experiments. We randomly split each of the three training sets of different scenes into two equally numbered parts. One was used for training, and the other was used for testing. For example, we split the 95 training images of the UM subset with ground-truth provided into two sets: one with 47 images for training, and the other with 48 images for testing. For notational simplicity, we denoted these two sets as the new UM training set and the new UM testing set, and did the same for UMM and UU.

In order to verify the superiority of the proposed structured random forest (denoted the SRF)-based road detection, we took two kinds of random forest classifiers as the baseline algorithms: one is the basic pixel-wise random forest (denoted RF) and the other is the patch-wise random forest (denoted PatchRF), which classifies each patch into either 1 or 0. We used the same forest parameters and image features to train them all. For PatchRF and SRF, we used a patch size $d = 24$ and during testing we used a stepsize of $δ = 8$ . In addition, we took the conditional random field (CRF)-based approach as another comparison. Practically, we took the negative log-likelihood predicted by the pixel-wise random forest classifier as the unary potential and the 4-neighbourhood smooth prior as the pairwise potential. In addition, we used graph cutting² [37] to solve the optimization problem. This method is denoted as RF+CRF hereafter.

Figure 6 shows some qualitative comparisons between these algorithms. We can see that the pixel-wise predictions of random forest are noisy, and that this can be improved by CRF optimization to a certain extent. However, when too many pixels are misclassified, as is shown in the right-hand column of Figure 6, the result after CRF optimization is also erroneous. In comparison, the patch-based methods (including PatchRF and SRF) can reduce the ambiguity by exploiting the contextual information. Additionally, because the whole patch is predicted as being either 1 or 0, the outputs of PatchRF are lower in resolution and zigzag at the boundaries, and are therefore less accurate than the proposed SRF. The results validate the contextual information, and the structural information of the label exploited by the proposed method can help to improve the performance of road detection.

For quantitative evaluation, we used the official development kit to evaluate the aforementioned methods in the perspective view on the three subsets. The results are listed in Tables 1, 2 and 3. The best ones are marked in a bold typeface. From the results, we can see that our proposed method achieves the best or near-to-best performance for all indices on the three subsets.

Figure 6.

Results of the KITTI-ROAD dataset. The first row presents the source images; the second row is the ground-truth; the third row gives the results of the basic pixel-wise random forest; the fourth row is the results of the patch-wise random forest; the fifth row is the results of the pixel-wise random forest with CRF optimization and the last row is the results of the proposed structured random forest-based road detection.

Table 1.

Comparative Results on the New UM Set (Perspective View)

	RF	PatchRF	RF+CRF	SRF(ours)
MaxF	82.42	84.15	84.37	85.79
AP	83.95	84.67	70.88	86.70
PRE	76.22	79.98	82.92	82.58
REC	89.71	88.77	85.87	89.26
FPR	5.62	4.46	3.55	3.78
FNR	10.29	11.23	14.13	10.74

Table 2.

Comparative Results on the New UMM Set (Perspective View)

	RF	PatchRF	RF+CRF	SRF(ours)
MaxF	90.14	91.22	91.50	92.25
AP	89.18	89.12	82.86	91.45
PRE	85.56	88.52	88.75	89.81
REC	95.24	94.09	94.44	94.83
FPR	5.08	3.85	3.78	3.40
FNR	4.76	5.91	5.56	5.17

Table 3.

Comparative Results on the New UU Set (Perspective View)

	RF	PatchRF	RF+CRF	SRF(ours)
MaxF	74.10	78.60	76.60	79.84
AP	73.57	81.34	61.78	82.54
PRE	66.15	73.25	79.60	73.62
REC	84.22	84.81	73.82	87.21
FPR	7.17	5.15	3.15	5.20
FNR	15.78	15.19	26.18	12.79

In addition, we also used the precision-recall (P-R) curves to evaluate the methods. We show the results in Figure 7. Note that the CRF outputs binary results and, therefore, it is shown as a single point in the P-R figure. From the figures, we can see that the structured random forest outperforms the pixel-wise random forest by a considerable margin. The patch-based random forest gets better results than the pixel-wise random forest by exploiting the contextual information. However, it is inferior to the structured random forest because its binary classification of the patches may cause errors at the boundaries. In addition, CRF optimization can improve the results of pixel-wise random forest via complex global optimization, but the performance is still inferior to the proposed SRF, which is even more efficient.

To further validate the effectiveness of the proposed algorithm, we evaluated it on the testing set of the KITTI-ROAD dataset. We used all of the training images of UM, UMM and UU subsets to train the models. The predictions of the testing images are transformed to a bird's-eye view (BEV) and uploaded to the website for evaluation [16]. We compared the results of our algorithm with those of several recently developed ones, including CN [3], ARSL-AMI [7], SPlane+BL [38], ANN [31] and the baseline (BL) [16] algorithm that was released with the dataset. We listed the comparative results tested on the UM, UMM, UU subsets and the average results in Table 4, Table 5, Table 6 and Table 7. From the results, we can see that the proposed method achieves better maximum F1-scores (MaxF) than the others when applied to the UMM and UU subsets. However, the results for the UM subset are not as good. We analysed the detection results and found that the proposed method obtained poor results in several scenes with different road textures and large sections of heavy shadow. Therefore, we can use more training samples and design more discriminative and illumination-invariant features for better performance. Overall, the average results of the proposed method are superior to the others. Apart from better performance, our method can be advantageous in terms of efficiency. The computational time will be further introduced in Section 6.3.

Figure 7.

Precision-Recall curves tested on the KITTI-ROAD dataset. The top left, top right and bottom left show the results of the new UM, UMM and UU sets, respectively, and the bottom right shows the average results of the three sets.

Table 4.

Results of Online Evaluation on the UM (BEV)

Algorithm	MaxF	AP	PRE	REC	FPR	FNR
CN [3]	73.69	76.68	69.18	78.83	16.00	21.17
ARSL-AMI [7]	71.97	61.04	78.03	66.79	8.57	33.21
SPlane+BL [38]	85.23	88.66	83.43	87.12	7.89	12.88
ANN [31]	62.83	46.77	50.21	83.91	37.91	16.09
BL [16]	82.24	85.30	79.44	85.24	10.05	14.76
Ours	76.43	83.24	75.53	77.35	11.42	22.65

Table 5.

Results of Online Evaluation on the UMM (BEV)

Algorithm	MaxF	AP	PRE	REC	FPR	FNR
CN [3]	86.21	84.40	82.85	89.86	20.45	10.14
ARSL-AMI [7]	89.56	82.82	85.87	93.59	16.93	6.41
SPlane+BL [38]	82.04	85.56	75.11	90.39	32.93	9.61
ANN [31]	80.95	68.36	69.95	96.05	45.35	3.95
BL [16]	76.02	78.82	65.71	90.17	51.72	9.83
Ours	90.77	92.44	89.35	92.23	12.08	7.77

Table 6.

Results of Online Evaluation on the UU (BEV)

Algorithm	MaxF	AP	PRE	REC	FPR	FNR
CN [3]	72.25	66.61	71.96	72.54	9.21	27.46
ARSL-AMI [7]	70.33	61.97	83.33	60.84	3.97	39.16
SPlane+BL [38]	74.02	79.61	65.15	85.68	14.93	14.32
ANN [31]	54.07	36.61	39.28	86.69	43.67	13.31
BL [16]	69.50	73.87	65.87	73.56	12.42	26.44
Ours	76.07	79.97	71.47	81.31	10.57	18.69

Table 7.

Average Results of Online Evaluation on the KITTI-ROAD Dataset (BEV)

Algorithm	MaxF	AP	PRE	REC	FPR	FNR
CN [3]	79.02	78.80	76.64	81.55	13.69	18.45
ARSL-AMI [7]	80.36	70.23	83.24	77.67	8.61	22.33
SPlane+BL [38]	79.63	83.90	72.59	88.17	18.34	11.83
ANN [31]	67.70	52.50	54.19	90.17	41.98	9.83
BL [16]	75.80	79.85	69.31	83.63	20.40	16.37
Ours	82.44	87.37	80.60	84.36	11.18	15.64

6.2.2. Unstructured road dataset

We then tested the performance of the proposed algorithm with the actual data collected by our own autonomous land vehicle (ALV). Our ALV is a modified Toyota Land Cruiser equipped with cameras and other sensors. As is shown in Figure 8, the yellow box shows the mounting position of the camera used for road detection.

Our ALV aims to run autonomously on typical unstructured roads, using monocular vision for road detection. Here, we used two typical scenes to evaluate the proposed algorithm, namely Scene I and Scene II. Concretely, Scene I consists of 195 images which were collected on sunny days with highlights and shadows, and Scene II consists of 153 images which were collected on sunny days (some after rain) with puddles and specularities on the roads. As is shown in Figure 9, the first row shows some examples of Scene I and the second row shows some examples of Scene II. We manually labelled the images pixel-wise and split each set into two equally numbered subsets: one used for training and the other used for testing. To make the road detection run in real time, we cropped and downsampled the images to $500 \times 250$ before inputting them to the algorithms. We took a patch size of $d = 12$ and a testing stepsize of $δ = 6$ .

Figure 8.

Modified Toyota Land Cruiser platform. The yellow box indicates the camera used for road detection.

Figure 9.

Samples of unstructured road images collected using our own ALV. The first row is Scene I and the second row is Scene II.

To validate the superiority of the proposed method for unstructured road detection, we took the vanishing point-based general road detection (denoted as VP) [2] as a comparison. We used the MATLAB code released by the author. Figure 10 presents some qualitative comparisons between the vanishing point-based road detection and the proposed method. From the figure, we can see that the results of the vanishing point-based road detection are coarse and far from accurate. Moreover, when the roads are extremely curved or the direction of the vehicle is biased, the vanishing points of the roads become biased or even fall out of the camera's field of view and this method may thus output poor results. This can be seen in the third and fifth rows of Figure 10. In comparison, the proposed structured learning-based method gives good results for different road shapes and textures, illuminations and weathers.

For quantitative evaluation, we again used the P-R curves. The results tested on the two scenes are shown in Figure 11. From the figures, we can see that the learning-based approaches can achieve much better results than the vanishing point-based approach, which uses no data-related prior information. For the learning-based ones, the proposed structured random forest always achieves the best results, which are comparable to or even better than the pixel-wise random forest with CRF optimization. Compared to the pixel-wise random forest, the patch-based random forest exploits the contextual information to reduce ambiguity. However, the binary prediction of the patch reduces the resolution of the output, especially at the boundaries, and this can cause degradation of performance. Therefore, the overall performance of the patch-based random forest may vary from case to case. This is shown in Figure 11. In Scene I, the PatchRF gives even slightly worse results than those of the pixel-wise RF. In Scene II, however, the results of PatchRF are better than those of the pixel-wise RF and only slightly worse than the SRF. The reason may be that the contextual information plays a more important role in Scene II, and the structural information of the label patch exploited by the SRF is less helpful due to the obscure boundaries of these images. Additionally, the question of whether a method is time consuming is another important factor for real-time applications. It can be seen that the patch-based methods save much more time. This will be introduced in detail in the following subsection.

Figure 10.

Results of unstructured road detection. The first column presents the source images, the second row is the results of the vanishing point-based road detection and the last column is the results of the proposed method.

Figure 11.

P-R curves of unstructured road detection with different algorithms

6.3. Computational time

The time consumed in road detection is crucial for the navigation of autonomous land vehicles. Therefore, we investigated the computational time of the proposed method in the testing phase. Since patch-based methods predict a batch of pixels within a single classification, the total times needed for classifying the whole image are reduced substantially compared to those of pixel-wise classifiers. Taking the unstructured road detection as an example, the resolution of input images is $500 \times 250$ . Therefore, when we used the classical pixel-wise random forest, we needed to run the classifier $500 \times 250$ times for each input image because each pixel is classified independently. In comparison, if we used patch size $d = 12$ and sampled the image patches with a stepsize of δ = 6 in the patch-based methods, we only needed approximately a $500 \times 250 / (6 \times 6)$ times classification. This means that the total classification times were reduced by 36 times. Therefore, the runtime can be reduced dramatically. We recorded the running time tested on our computer and conducted a quantitative comparison between the pixel-wise random forest, patch-wise random forest, pixel-wise random forest with CRF optimization and the proposed structured random forest-based road detection. The means and standard errors tested on the KITTI-ROAD dataset and unstructured road images collected using our platform are shown in Figure 12. We can see that the pixel-wise random forest method is time consuming, especially for high-resolution images such as those in KITTI-ROAD. In addition, the CRF-based method is more time consuming as a result of its extra graph-cutting step. In comparison, the patch-based methods, including PatchRF and SRF, are much more efficient, making them more suitable for real-time applications. For instance, they cost only approximately 70ms for the unstructured road detection on average. Though the PatchRF achieves a similar efficiency with the proposed SRF, its performance is inferior to that of the SRF. Therefore, taking both the performance and the runtime into consideration, we can conclude that the proposed SRF is more competitive than the others. In addition, the proposed structured random forest-based road detection can be easily parallelized, retaining the capability for further acceleration.

Figure 12.

Running time statistics of different algorithms on the KITTI-ROAD dataset and the unstructured road images

7. Conclusions

Road detection is an essential task for the visual navigation of autonomous land vehicles. Most monocular road-detection algorithms employ certain machine-learning approaches to classify each pixel or patch as either road or background. However, these methods fail to make good use of the contextual information of the pixel and the structural information of the labels, which can be very helpful for reducing ambiguity and improving accuracy. In this paper, we proposed using the structured random forest for road detection. The benefits are twofold: first, the contextual information of the pixels is encoded and the structural information of the labels is exploited. Second, by predicting a batch of pixels in each classification, the computational complexity is significantly reduced compared with pixel-wise classifiers. Experiments tested on the KITTI-ROAD dataset and data collected in typical unstructured environments show that the structured random forest can substantially improve the accuracy of road detection over the classical pixel-wise and patch-wise random forest classifiers and, at the same time, be very computationally efficient.

Footnotes

8. Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant 61375050 and 91220301.

1

available at:

2

open-source software available at:

References

Ernst

D Dickmanns

Birger

D Mysliwetz

. Recursive 3-d road and relative ego-state recognition. IEEE Transactions on pattern analysis and machine intelligence, 14(2):199–213, 1992.

Kong

Hui

Audibert

Jean-Yves

Ponce

Jean

. General road detection from a single image. IEEE Transactions on Image Processing, 2010.

Jose

M Alvarez

Gevers

Theo

Yann LeCun Antonio

M Lopez

. Road scene segmentation from a single image. In Computer Vision–ECCV 2012, pages 376–389. Springer, 2012.

Scharwachter

Franke

, Low-level fusion of color, texture and depth for robust road scene understanding. In Intelligent Vehicles Symposium (IV), 2015 IEEE, pages 599–604, June 2015.

Chen

Tongtong

Dai

Bin

Wang

Ruili

Liu

Daxue

. Gaussian-process-based real-time ground segmentation for autonomous land vehicles. Journal of Intelligent & Robotic Systems (JINT), 76:563–582, Sep 2013.

Nowozin

Sebastian

Christoph

H Lampert

. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3–4):185–365, 2011.

Passani

Yebes

J. J.

Bergasa

L. M.

, CRF-based semantic labeling in miniaturized road scenes. In Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on, pages 1902–1903, Oct 2014.

Xiao

Liang

Dai

Bin

Liu

Daxue

Tingbo Hu Tao Wu . CRF based road detection with multi-sensor fusion. In Intelligent Vehicles Symposium (IV), 2015 IEEE, pages 192–198. IEEE, 2015.

Kontschieder

Peter

Rota Buló

Bischof

Horst

Pelillo

Marcello

. Structured class-labels in random forests for semantic image labelling. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2190–2197. IEEE, 2011.

10.

Piotr Dollár Lawrence Zitnick

. Structured forests for fast edge detection. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1841–1848. IEEE, 2013.

11.

Chiku

Takeshi

Miura

Jun

. On-line road boundary estimation by switching multiple road models using visual features from a stereo camera. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4939–4944. IEEE, 2012.

12.

Loose

Heidi

Franke

Uwe

. B-spline-based road model for 3d lane recognition. In Intelligent Transportation Systems (ITSC), 2010 13th International IEEE Conference on, pages 91–98. IEEE, 2010.

13.

Xiao

Liang

Dai

Bin

Tingbo Hu Tao Wu . Fast unstructured road detection and tracking from monocular video. In Control and Decision Conference (CCDC), 2015 27th Chinese, pages 3974–3980. IEEE, 2015.

14.

Levinkov

Evgeny

Fritz

Matt

. Sequential bayesian model update under structured scene prior for semantic road scenes labeling. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1321–1328. IEEE, 2013.

15.

Tan

Ceryen

Hong

Tsai

Chang

Tommy

Shneier

Michael

. Color model-based real-time learning for road following. In Intelligent Transportation Systems Conference, 2006. ITSC'06. IEEE, pages 939–944. IEEE, 2006.

16.

Fritsch

Jannik

Kuhnl

Tobias

Geiger

Andreas

. A new performance measure and evaluation benchmark for road detection algorithms. In International Conference on Intelligent Transportation Systems (ITSC), 2013.

17.

Sotelo

Miguel Angel

Rodriguez

Francisco Javier

Magdalena

Luis

Bergasa

Luis Miguel

Boquete

Luciano

. A color vision-based lane tracking system for autonomous driving on unmarked roads. Autonomous Robots, 16(1):95–116, 2004.

18.

Shang

Erke

Xiangjing An Jian Li Lei Ye Hangen He . Robust unstructured road detection: The importance of contextual information. International Journal of Advanced Robotic Systems, 10(179), 2013.

19.

José

M Álvarez

Antonio

M Ĺopez

. Road detection based on illuminant invariance. Intelligent Transportation Systems, IEEE Transactions on, 12(1): 184–193, 2011.

20.

Wang

Bihao

Vincent Frémont . Fast road detection from color images. In Intelligent Vehicles Symposium (IV), 2013 IEEE, pages 1209–1214. IEEE, 2013.

21.

Alon

Yaniv

Ferencz

Andras

Shashua

Amnon

. Off-road path following using region classification and geometric projection constraints. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages 689–696. IEEE, 2006.

22.

Graovac

Stevica

Goma

Ahmed

. Detection of road image borders based on texture classification. International Journal of Advanced Robotic Systems, 2012-12-06.

23.

Tobias Kühnl Kummert

Franz

Fritsch

Jannik

. Monocular road segmentation using slow feature analysis. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 800–806. IEEE, 2011.

24.

Xiao

Liang

Dai

Bin

Tao Wu Fang

Yuqiang

. Unstructured road segmentation method based on dictionary learning and sparse representation (in Chinese). Journal of Jilin University (Engineering and Technology Edition), 43:384–388, 2013.

25.

Mohan

Rahul

. Deep deconvolutional networks for scene parsing. arXiv:1411.4101, 2014.

26.

Dahlkamp

Kaehler

Stavens

Thrun

Bradski

G. R.

, Self-supervised monocular road detection in desert terrain. In Robot. Sci. Syst. Conf. (RSS), 2006.

27.

Zhou

Shengyan

Gong

Jianwei

Xiong

Guangming

Chen

Huiyan

Iagnemma

Karl

. Road detection using support vector machine based on online learning and evaluation. In 2010 IEEE intelligent vehicles symposium (IV 2010), pages 256–261, 2010.

28.

Fritsch

Jannik

Kuehnl

Kummert

Franz

. Monocular road terrain detection by combining visual and spatial information. Transactions on Intelligent Transportation Systems, 15(4):1586–1596, 2014.

29.

Choi

J.H.

Song

G.Y.

Lee

J.W.

, Road identification in monocular color images using random forest and color correlogram. International Journal of Automotive Technology, 13(6):941–948, 2012.

30.

Patrick

Y. Shinzato

Valdir Grassi Jr Fernando

S. Osorio

Denis

F. Wolf

. Fast visual road recognition and horizon detection using multiple artificial neural networks. In Intelligent Vehicles Symposium Proceedings, 2012 IEEE, pages 1090–1095, June 2012.

31.

Vitor

G. B.

Lima

D. A.

Victorino

A. C.

Ferreira

J. V.

, A 2d/3d vision based approach applied to road detection in urban environments. In Intelligent Vehicles Symposium (IV), 2013 IEEE, pages 952–957, 2013.

32.

Alvarez

Jose Manuel

Gevers

Theo

Antonio

M. Lopez

. 3D Scene priors for road detection. In Computer Vision and Pattern Recognition, pages 57–64, 2010.

33.

Breiman

Leo

. Random forests. Machine Learning, 45(1):5–32, 2001.

34.

Criminisi

Shotton

, Decision Forests for Computer Vision and Medical Image Analysis. Springer Publishing Company, Incorporated, 2013.

35.

Graham

D Finlayson

Steven

D Hordley

Cheng Lu Mark

S Drew

. On the removal of shadows from images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(1):59–68, 2006.

36.

Geiger

Andreas

Lenz

Philip

Stiller

Christoph

Urtasun

Raquel

. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research, 32:1231–1237, 2013.

37.

Kolmogorov

Vladimir

Zabin

Ramin

. What energy functions can be minimized via graph cuts? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(2):147–159, 2004.

38.

Einecke

Eggert

, Block-matching stereo with relaxed fronto-parallel assumption. In Intelligent Vehicles Symposium Proceedings, 2014 IEEE, pages 700–705, June 2014.