Abstract
An accurate hierarchical stereo matching method is proposed based on continuous 3D plane labeling of superpixel for rover’s stereo images. This method can infer the 3D plane label of each pixel combined with the slanted-patch matching strategy and coarse-to-fine constraints, which is especially suitable for large-scale scene matching with low-texture or textureless regions. At every level, the stereo matching method based on superpixel segmentation makes the iteration convergence faster and avoids huge redundant computations. In the coarse-to-fine matching scheme, we propose disparity constraint and 3D normal vector constraint between adjacent levels through which the disparity map and 3D normal vector map at a coarser level are used to restrict the search range of disparity and normal vector at a fine level. The experimental results with the Chang’e-3 rover dataset and the KITTI dataset show that the proposed stereo matching method is efficiently and accurately compared with the state-of-the-art 3D labeling algorithm, especially in low-texture or textureless regions. The computational efficiency of this method is about five to six times faster than the state-of-the-art 3D labeling method, and the accuracy is better.
Introduction
The 3D information is the main way of robot 3D visual perception. 1,2 Stereo matching is an important step of disparity calculation or 3D reconstruction in binocular vision, 3 which has been widely studied. However, there are still challenges in occlusion and weak texture or textureless regions, especially for the rover’s stereo images.
In recent years, the matching method based on 3D labels has increased the accuracy of stereo matching. It can not only estimate the disparity of each pixel but also estimate the 3D normal vector of each pixel. To efficiently infer 3D labels, many methods successfully use PatchMatch, 4 –6 which can estimate a 3D plane at each pixel. 7 In recent years, inspired by randomized search and propagation of PatchMatch, 6,8 many optimization methods with belief propagation (BP) 4 or graph cut (GC) 9 have been proposed for efficient inference in pairwise Markov random field (MRF) with large label spaces. To the best of our knowledge, the local expansion moves (LocalExp) method 10 is the state-of-the-art method using local expansion moves based on GC.
The first problem is that in the LocalExp method, three grid structures with cell sizes of 5 × 5, 15 × 15, and 25 × 25 pixels are used, which may lead to huge redundant computations. However, the stereo matching method based on superpixel assumes that pixels in the same superpixels belong to the same 3D surface. Therefore, we propose superpixel-based segmentation, which has low complexity and makes the iteration convergence faster. The second problem is that the LocalExp method cannot handle low texture regions well. As shown in Figure 1(a), the typical characteristic of this image is that it contains low-texture or even textureless regions. The disparity map generated with the LocalExp method (https://github.com/t-taniai/LocalExpStereo) is shown in Figure 1(b), however, there are lots of mismatches and unreliable matching points in the bottom and left side of the disparity map. The third problem is that it is very difficult to assign a suitable 3D label for each pixel from the infinite continuous label space. Therefore, the LocalExp method 10 iteratively applies the local expansion moves using GC to update and propagate local disparity planes, which not only increases the calculation but also may get some wrong matching results. In this article, based on the original LocalExp method, we introduce a coarse-to-fine stereo matching method using 3D plane labeling of superpixels. The coarsest level improves the matching robustness especially in the low texture regions, while the other levels guarantee the robustness because they make full use of disparity constraint and normal vector constraint between two adjacent levels.

(a) Image of Chang’e-3 Yutu rover and (b) its disparity map generated by the LocalExp method.
Overall, the major contributions of this work are as follows: A coarse-to-fine stereo matching framework combined with 3D labels is proposed, which is especially suitable for the stereo matching of low-texture or textureless regions in large-scale real scenes. A matching method based on superpixel segmentation is proposed, which makes the iteration convergence faster and avoids huge redundant computations. We propose disparity constraint and normal vector constraint between two adjacent levels, through which the disparity map and 3D normal vector map at a coarser level are used to restrict the search range of disparity and normal vector at a finer level.
The remainder of the article is arranged as follows: related works are present in the second section; the proposed method is given in the third section; the experimental results and conclusion are given in the fourth and fifth sections, respectively.
Related works
Coarse-to-fine matching
A few decades ago, the coarse-to-fine stereo matching strategy has been introduced into many matching methods. 11 This kind of method first calculates a coarse resolution disparity and then uses coarse disparity to constrain the disparity search range for calculating fine disparity. 12
To speed up the convergence, this strategy has been widely used in global matching methods. A hierarchical stereo matching method using dynamic programming is proposed by Van Meerbergen et al., 13 which achieves a tremendous gain in memory and speed. A hierarchical strategy using semiglobal matching is introduced to generate a half-resolution disparity map firstly, and then, it is used for the disparity computation of the original image. 14 A coarse-to-fine stereo matching based on the more-global matching (MGM) method is proposed by Li et al., 15 which improves the disparity accuracy compared with the original MGM. 16 Some BP algorithms 17,18 are proposed to use a coarse-to-fine strategy for reducing the complexity. A coarse-to-fine bilateral disparity structure method based on GC 19 is introduced to reduce the computational cost and improve the disparity accuracy. In addition to the global methods, the local methods also adopt the coarse-to-fine strategy, which mainly reduces the disparity search range for the computation of finer disparity. A stereo matching method with multiscale and multiwindow is proposed to estimate disparity for restricting the disparity search range. 20 A confidence-based multiscale stereo matching strategy has been proposed, 21 which can obtain higher-resolution disparity maps by processing the existing lower-resolution ones.
3D label-based stereo matching
The matching cost calculation based on 3D labels is different from that based on discrete disparities stereo matching. For each pixel in one image, lots of corresponding regions can be obtained by the slanted-plane of the pixel according to different 3D labels. Therefore, the major issue with 3D labels stereo matching is the computational complexity.
An energy function with hybrid MRF is proposed by Yamaguchi et al., 22 which contains continuous variables for 3D slanted plane and discrete random variables for each pair of neighboring segments. A slanted plane model is optimized with simple linear iterative clustering (SLIC) segmentation based on a pre-estimated disparity map, 23 and simultaneously, the states of occlusion and coplanarity over adjacent segment pairs are also analyzed. Bleyer et al. 7 propose to segment the scene into planar superpixels and estimate each pixel’s optimal 3D plane during all the possible slanted planes. Olsson et al. 24 represent second-order surface smoothness with 3D labels. Heise et al. 25 propose to integrate PatchMatch method into a variational smoothing formulation. Li et al. 26 use pixelwise 3D label optimization by fusing multilayer superpixels iteratively. Taniai et al. 9,10 propose to use LocalExp based on GC.
Methodology
The proposed matching method constructs a hierarchical architecture and works in a coarse-to-fine scheme. The advantage of this method is that some mismatches can be solved at a coarse level, especially in the low-texture or textureless regions. Given robust disparity map at the coarser level, disparity reconstruction at the finer level can be correctly obtained based on disparity constraint and normal vector constraint.
The input two images are firstly corrected by epipolar rectification,
27
and then, they are decomposed into a

The flowchart of the proposed method. The upper row is the coarsest disparity map computation steps, while the lower row is the disparity map computation steps from level
Superpixel-based image segmentation
The stereo images are firstly decomposed into many nonoverlapping segments with SLIC algorithm.
28
Figure 3(a) shows an image with about 200 SLIC superpixels. And then, as shown in Figure 3(b), we define three windows for each superpixel

(a) SLIC superpixels. (b) Unit window
Iterative randomized optimization based on superpixel
According to the three defined windows,
where Data term
For a pixel
where
which is defined as
where
As shown in Figure 4, for each affine transformation window

(a) The affine transformation window and (b) its corresponding windows obtained by different 3D labels.
Smoothness term
The smoothness term can be defined as the following which is the same as in the literature 10
where
where
where the first term measures the difference between Iterative randomized optimization
We use a local expansion method similarly to the LocalExp method 10 to randomly iterate optimization based on superpixels
for all possible labels
Our expansion method is also in two ways:
Our local expansion iteratively is shown in Algorithm 1, which also includes propagation (lines 1–4), RANSAC (lines 5–8), and refinement (lines 9–14) steps similarly to the LocalExp method. 10 In addition to the refinement step, the other two steps in our local expansion are the same as those in the LocalExp method. 10
Iterative randomized optimization
The candidate label
where
For normal vector, we add a random vector
After that, we then update the current labels of pixels
Coarse-to-fine disparity constraint and 3D normal vector constraint
The above matching is based on continuous 3D label of superpixel, however, the large disparity search range may lead to inaccurate disparity, especially in textureless or low-texture regions. For yielding accurate disparity in those regions, based on our previous coarse-to-fine disparity constraint method proposed by Li et al., 15 we introduce a new coarse-to-fine 3D normal vector constraint.
As can be seen from Figure 5, for each pixel

The disparity constraint
15
and 3D normal vector constraint between two adjacent levels. (a) Disparity at level
Step 1: Compute
Step 2: Find the minimum disparity
where
Step 3: Select the normal vector of any pixel in
where
Experimental results
To evaluate the performance of our proposed method, two datasets are used, that is, Chang’e-3 Yutu rover dataset (http://planetary.s3.amazonaws.com/data/change3/pcam.html) and KITTI (http://www.cvlibs.net/datasets/kitti/) dataset. Our method is compared with the state-of-the-art LocalExp method. 10
We use the following settings throughout the experiments. The parameters for the data term are set to
In the LocalExp method, three structures with cell sizes of 5 × 5, 15 × 15, and 25 × 25 pixels are used, and the iteration numbers
While in our method, the number of pyramid levels is set to 3 for Chang’e-3 rover dataset and 2 for KITTI dataset. For each level, the number of superpixels is set to 1200, which will be analyzed in “Analysis of parameter and processing time” section. The sizes of three windows
In all the experiments, we only use CPU instead of GPU for comparing our method with the LocalExp method. All the experiments are executed on a laptop with Intel Core i5-8250 1.60 Hz CPU and 8-GB memory, and our codes are implemented in Microsoft Visual Studio 2017 with C++ and OpenCV library.
Evaluation on chang’e-3 rover dataset
The input Chang’e-3 stereo images are firstly rectified by epipolar rectification method.
27
During the coarse-to-fine processing, we construct three-level pyramid (

Disparity maps of Chang’e-3 images computed with the LocalExp method and our proposed method. (a–e) Each column shows, from top to bottom, the rectified left images. Disparity maps are estimated by the LocalExp method and our proposed method. It is noted that our method can obtain more accurate and reliable disparity results.
We qualitatively evaluate our proposed method compared with the LocalExp method. 10 As can be seen from Figure 6, the disparity map computed by our method (the third row of Figure 6) is superior to that computed by the LocalExp method (the second row of Figure 6). For example, the stereo pair-1 (shown in Figure 6(a)) has huge rock, disparity discontinuous, and many low-texture regions. We can see that our method can obtain high-precision disparity in many low-texture regions, while the LocalExp method obtains false or unreliable disparity in these regions. The other four stereo pairs (shown in Figure 6(b) to (e)) include low texture, repetitive texture, precipice, and disparity discontinuous regions, and especially, there is strong light intensity in the left of Figure 6(d). We can see that all the precipice, low-texture, or repetitive texture regions, varying light conditions, and strong light regions are perfectly reconstructed with our proposed method, while there exist some wrong disparities in these regions with the LocalExp method.
Evaluation on KITTI dataset
Furthermore, due to the lack of standard datasets with the ground truth of the rover’s stereo images, the accuracy of our method cannot be quantitatively evaluated. Therefore, we choose the KITTI dataset (http://www.cvlibs.net/datasets/kitti/) created from a driving platform,
29
whose imaging mode is similar to the rover imaging mode, and the sizes of images are 1226 × 370 pixels. We construct two-level pyramid (
We test the accuracy and efficiency of disparity maps computed by our proposed method, our improved MGM method, 15 and the LocalExp method 10 with KITTI dataset. The quantitative results with different methods are listed in Table 1, which gives the average results of all the 194 training image pairs in reflective regions. In this table, Out-Noc is the percentage of erroneous pixels in nonoccluded areas. Out-All is the percentage of erroneous pixels in total. Avg-Noc is the average disparity error in nonoccluded areas. Avg-All is the average disparity error in total. It is noted that the proposed method performs better than our previous MGM method 15 and the LocalExp method. 10 Our method has a significant improvement in the percentage of erroneous pixels and average disparity error, and the Out-Noc and Out-All are decreased from 12.67%, 13.57% (LocalExp method) to 7.27%, 8.17% (our method) respectively, while the Avg-Noc and Avg-All are all decreased from 2.05 pixels, 2.19 pixels (LocalExp method) to 1.37 pixels, 1.53 pixels (our method). Figure 7 shows several examples of disparity maps, and Table 2 gives the corresponding quantitative results of our previous MGM, LocalExp method, and the proposed method. We can see that there are significant improvements in both percentage of erroneous pixels and average disparity error with our proposed method.
The average quantitative results with our previous MGM, LocalExp method, and our proposed method for the training images on KITTI dataset using the default error threshold of 3 pixels in reflective regions.
Quantitative results of our previous MGM method, the LocalExp method, and our proposed method for nos 0, 14, 102, 140, and 192 on KITTI dataset using the default error threshold of 3 pixels in reflective regions.

Disparity maps of five images in KITTI datasets with different methods, from left to right: Nos 0, 14, 102, 140, and 192. Disparity maps are visualized using color-map. (a) Left image, (b) ground truth pixels, (c) color map of disparity map with our previous MGM method, (d) disparity map with the LocalExp method, (e) color map of (d), (f) disparity map with our proposed method, and (g) color map of (f).
As shown in Figure 7, for low-texture or textureless regions, for example, shadowed regions, roads, and strong light conditions, the LocalExp method tends to generate errors due to nonconvergence. While in disparity maps generated by our method, the errors in these regions are almost eliminated.
Efficiency evaluation
We evaluate the efficiency of our method and the LocalExp method, which are both implemented with CPU. The two methods mainly have two computation parts: the calculations of data term of equation (3) and smoothness term of equation (5). For the two terms, they all require
The computational complexity of our method is estimated by the sum of computation for all the superpixels at all levels
While the complexity of the LocalExp method is estimated by the sum of computation for all the cell sizes of 5 × 5, 15 × 15, and 25 × 25 pixels
where
We approximately compare the computational complexity of our method with that of the LocalExp method for five stereo pairs of Yutu rover. Table 3 gives the approximately computational complexity of our method estimated by equation (12) and that of the LocalExp method estimated by equation (13). Finally, the ratio between the complexity of our method and that of LocalExp can be computed. As presented in Table 3, compared with the LocalExp method, the computational complexity of our method is less than 20%.
Approximate comparison of computational complexity between our method and LocalExp method.
We firstly compared the processing time of our method against that of LocalExp method with the five stereo pairs of Yutu rover, and the results are shown in Figure 8. Obviously, our method greatly reduces the processing time and improves the processing efficiency. By comparing our method with the LocalExp method, we observe that ours is about six times faster than the LocalExp method. For example, the generation of disparity map for stereo pair-1 with the LocalExp method takes about 3021 s, while it takes about 493 s with our method.

Processing time of five stereo pairs of Chang’e-3 Yutu rover using LocalExp method and our method.
As shown in Figure 9, we compare the processing time of our method with that of the LocalExp method based on all the stereo pairs of KITTI dataset, and our method has faster convergence and greatly reduces the computational complexity than the LocalExp method. It should be noted that for fair matching comparison, the processing time of the LocalExp method was obtained by running the downloaded C++ source code (https://github.com/t-taniai/LocalExpStereo) on our laptop with only CPU implementations. For all the 194 training images on KITTI dataset, the average processing time with the LocalExp method is about 1879 s, and the time with our SLIC method is about 350 s. By comparing the LocalExp method with our method, we can observe that our method is 5.3 times faster than the LocalExp method.

Processing time for 194 training KITTI dataset based on the LocalExp method and our method.
We consider that there are mainly two factors contributing to this acceleration of our method. First, our method with SLIC segmentation makes its convergence faster. Meanwhile, in these experiments, our method iterates three times in the main loop (one time only with data term and two times with data term and smoothness), which can obtain good results, while the LocalExp method iterates eight times (two times only with data term and five times with data term and smoothness). Second, in the LocalExp method, three different structures with cell sizes of 5 × 5, 15 × 15, and 25 × 25 pixels are used, while in our method, the number of superpixels for all levels is set to 1200.
Analysis of parameter and processing time
We evaluate the sensitivity of our method to some parameters. The number of pyramid levels
Figure 10 shows the performance of no. 44 image pair in KITTI dataset as an example with different pyramid levels and different number of SLIC superpixels, which contain large untextured regions and shadow regions. We evaluate the disparity maps generated with five different number of SLIC superpixels (400, 800, 1200, 1600, and 2000) and four different number of pyramid levels, that is, (4, 3, 2, and 1). The error rates Out-Noc and Out-All generated with different levels and different numbers of SLIC are shown in Figure 10(d) and (e). The error rate of using two-level is much lower than that of using one-level, three-level, or four-level. The Avg-Noc and Avg-All with different levels and different number of SLIC are shown in Figure 10(f) and (g). The disparity error of using two-level is much lower than that of using the other levels. Therefore, we choose two-level pyramid. Figure 10(h) gives the processing time with different levels and different number of SLIC. When the number of pyramid levels is fixed, the processing time increases almost linearly with the number of SLIC. Figure 10(i) shows the error rate of Out-Noc and Out-All with a different number of SLIC at two-level pyramid. When the number of SLIC is 1200, the disparity error rate is much lower than that of using the other number of SLIC.

Sensitivity analysis of parameters. (a) No. 44 stereo image in KITTI dataset. (b) Disparity map with the LocalExp method. (c) Disparity map with our method. (d)–(h) Out-Noc, Out-All, Avg-Noc, Avg-All, and processing time with different levels and number of SLIC. (i) Out-Noc and Out-All with a different number of SLIC at two-level pyramid.
And then, to further verify the relationship between optimization times and matching results, we give different quantitative results under four different optimization times, that is, one time (one time with both data term and smoothness term), two times (one time only with data term and one time with both data term and smoothness term), three times (one time only with data term and two times with both data term and smoothness term), and five times (two times only with data term and three times with both data term and smoothness term). The performance results with different iteration times under the conditions of two-level pyramid and 1200 superpixels are given in Table 4. As the main loop increases from 1 to 5, the error rates of Out-Noc and Out-All decrease from 7.80%, 8.30% to 3.14%, 3.23%, and the average disparity errors of Avg-Noc and Avg-All are getting smaller and smaller, while the processing time is getting longer. When the optimization times reach 5, not only the error rate is increased but also the processing time is increased. Therefore, the optimization times are set to three (one time only with data term and two times with both data term and smoothness term) in our method.
Quantitative results with different optimization times.
After that, we evaluate the sensitivity of our method to some other parameters. The coefficient
Quantitative results with different
Quantitative results with different Δ
Quantitative results with different Δ
Conclusion
An accurate stereo matching method based on continuous 3D plane label estimation has been proposed for rover’s stereo images. Unlike the previous LocalExp method using three structures with different cell sizes, we propose a coarse-to-fine hierarchal stereo matching method based on superpixels, which makes the convergence faster. The experimental results on Chang’e-3 lunar rover dataset and KITTI dataset show that, compared with the state-of-the-art 3D labeling method, our method can generate more accurate disparity maps, especially in low-texture regions of images.
Although we observe that our method is several times faster than the LocalExp method, the processing time of our method still needs several hundred seconds, which cannot meet the actual application requirements of rover. Therefore, we will implement the parallel processing of the algorithm on GPU processor to meet the speed requirements of practical applications in the future.
Footnotes
Acknowledgements
The authors would like to thank the editors and anonymous reviews for their valuable comments and helpful suggestions, which greatly improve the article’s quality.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Nature Science Foundation of China under grant no. 61773383.
