Abstract
Human beings process stereoscopic correspondence across multiple purposes like robot navigation, automatic driving, and virtual or augmented reality. However, this bioinspiration is ignored by state-of-the-art dense stereo correspondence matching methods. Cost aggregation is one of the critical steps in the stereo matching method. In this article, we propose an optimized cross-scale cost aggregation scheme with coarse-to-fine strategy for stereo matching. This scheme implements cross-scale cost aggregation with the smoothness constraint on neighborhood cost, which essentially extends the idea of the inter-scale and intra-scale consistency constraints to increase the matching accuracy. The neighborhood costs are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in an eight-connected neighbors region but also incorporated with inter-scale consistency to optimize the disparity estimation. Additionally, the improved method introduces an adaptive scheme in each scale with different aggregation methods. Finally, experimental results evaluated both on classic Middlebury and Middlebury 2014 data sets show that the proposed method performs better than the cross-scale cost aggregation. The whole stereo correspondence algorithm achieves competitive performance in terms of both matching accuracy and computational efficiency. An extensive comparison, including the KITTI benchmark, illustrates the better performance of the proposed method also.
Introduction
Dense correspondence between stereo pair images of the same scene, termed stereo matching, is an important issue in computer vision especially in visual-based robotic navigation, automatic driving, and augmented reality. Stereo matching extracts the horizontal offsets relating correspondence using the binocular characteristic. These offsets, termed as disparities, could be easily converted to depth information, which represents the three-dimensional parameters of the scene. It is a challenging problem to obtain the accurate disparity maps in uncontrolled dynamic environments due to the inherent stereo vision limitations. Cost aggregation plays an important role in stereo matching methods, for instance, local methods like those of Pham and Jeon 1 and Lu et al. 2 and global methods like those of Sun et al. 3 and Xiang et al. 4 Cost aggregation could be considered as building the cost volume, termed disparity space image (DSI) with W × H × L dimensions, where W and H represent the width and height of stereo pair images, respectively, and L denotes the range of disparity. Traditional cost aggregation methods performed as image filters blur the depth boundaries during cost aggregation. Yoon and Kweon 5 adopted an area-based correspondence search in a given support window, which focuses on the dissimilarity calculation. However, the method is computationally more expensive than other area-based local methods. To improve the trade-off between the matching accuracy and computational efficiency, several papers have proposed some effective strategies. Yang 6 proposed a nonlocal cost aggregation (NLCA) framework adaptively based on pixel similarity on a minimum spanning tree (MST) structure derived from the stereo image pair to preserve depth edges with high efficiency, which combined the advantages of both the local and the global methods. Meanwhile, stereo matching methods follow one smoothness constraint: disparity value in a given neighborhood should be the same, except for depth boundaries. Zhang et al. 7 proposed that temporal information, gathered from consecutive frames, had been proven to improve the accuracy of stereo matching, which also reflects the changes in occlusion region happened in the disparity discontinuity area. While several methods, such as those of Cheng et al. and Zhang et al., 8, 9 have recently demonstrated impressive performance in this context, none of them explicitly takes advantage of the fact that bioinspiration with coarse-to-fine (CTF) framework could improve the cost aggregation in stereo matching.
In contrast, we propose an adaptive cross-scale cost aggregation (CSCA) scheme with CTF strategy for stereo matching. This scheme implements CSCA with the smoothness constraint on neighborhood cost, which essentially extends the idea of the inter-scale and intra-scale consistency constraints to increase the matching accuracy. The neighborhood costs are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in an eight-connected neighbors region, but also incorporated with inter-scale consistency to optimize the disparity estimation. Additionally, one adaptive scheme of CSCA will be illustrated in following sections.
The rest of this article is organized as follows. In the “Related works” section, we introduce some state-of-the-art cost aggregation methods for stereo correspondences, and the traditional CSCA framework is introduced in the “CSCA framework” section. In the “The Proposed method” section, the adaptive CSCA stereo matching method is then presented. In the “Experiment and evaluation” section, the experimental results are evaluated with three kinds of data sets. Finally, the “Conclusion” section provides some concluding remarks regarding this research.
Related works
State-of-the-art cost aggregation methods have made great efforts to computer vision, especially in the domain of robotic stereo correspondences. With rapid development of hardware, robot vision tend to process stereoscopic correspondence across multiple scales like human beings. As aforementioned, Yang 6 proposed an NLCA method based on MST structure. However, the aggregation over a tree structure generated from truncated edge weight will suffer from edge blurring effect since it assumes that disparity smoothens at every point. Due to this reason, appropriate priors must be employed to indicate the potential locations of the depth boundaries. Based on NLCA, Cheng et al. 8 proposed one cross-tree structure that adopts edge and superpixel priors tackling the false cost aggregation across the depth boundaries. But it only verifies edge and superpixel priors with static paired images, without considering the moving objects in the urban scene. Meanwhile, Pham et al. 10 proposed one similar segment-simple-tree (SST) that is designed for practical driving images.
A common property of cost aggregation is operated at the finest scale of the paired images. Early researchers like Mallot et al. 11 proposed that information at CTF scales was processed interactively in the correspondence search of the human stereo vision system. Thus, it is reasonable that cost aggregation should be operated across multiple scales rather than only on the finest scale like conventional methods. The CTF strategy has been widely applied not only in global stereo matching methods such as dynamic programming and semi-global matching, but also in some local methods. Most CTF approaches explicitly process the disparity estimation in the scale space, and make sure the disparity consistency across multi-scales. Zhang et al. 9 reformulated the CTF strategy, and proposed one CSCA strategy for stereo matching, which built stereo correspondence search across multi-scales and aggregated costs across multi-scales rather than conventional methods at the finest scale. Different from the conventional CTF methods, CSCA builds the cost volume in the scale space and makes sure the consistency across multi-scales. Recently, Ma et al. 12 extended the aforementioned CSCA framework by integrating intra-scale smoothness constraint in stereo matching, which considered the inter-scale consistency of cost volume and intra-scale consistency of neighbors’ cost value respectively.
However, there are still some challenging problems for stereo matching. With these challenges, it is important for stereo matching algorithms to overcome those problems and generate an accurate disparity map. In this article, we reform the CSCA framework with adaptive cost aggregation methods at each scale. First, the consistency of cost volume at the adjoining scales could be enforced by a generalized Tikhonov regularizer, following the optimization perspective of Zhang et al. and Ma et al.
9,12
Then, the aforementioned conventional cost aggregation methods are employed to ensure the intra-scale consistency of the cost volume, which can be optimized to generate more robust cost volume and accurate disparity map. Experimental results show that the proposed method performs better than CSCA evaluated both on classic Middlebury and Middlebury 2014 data sets. An extensive comparison, including the KITTI benchmark, also illustrates the better performance of the proposed method than CSCA. The proposed stereo correspondence algorithm achieves competitive performance in terms of both matching accuracy and computational efficiency. The main contributions of this article are twofold: ensuring one intra-scale smooth consistency with eight-connected neighbors costs; proposing an adaptive and effective scheme with different aggregation methods for each scale.
CSCA framework
In this section, we first briefly introduce the CSCA framework. For paired images Ileft and Iright with the same width W and height H, one single pixel
il is the correspondence point of i with a disparity l and location
The cost volume
where Ni represents a neighbor region of i. K(i, j) is a kind of kernel, which measures the similarity between pixels i and j, while
Equation (3) summarizes the principle of most cost aggregation methods like those of Yoon and Kweon, Yang, Cheng et al., Pham et al., and Hosni et al.
5, 6,8,10,15
In filter methods, Ni is a local region around i, but Ni represents the whole image in the tree methods. During the cost aggregation procedure, the filter methods depend on the local similarity, while tree-based ones are based on such edges of different regions. In all, the state-of-the-art cost aggregation methods perform very well in the high-texture region, but usually fail in low-texture region. According to the research of Menz and Freeman,
16
cost matching with the coarse scale can obtain the more accurate correspondence in those low-texture regions than cost matching at fine scale. In general, previous CTF approaches reduced the search space at the current scale by using a disparity map estimated from the cost volume at the coarser scale, often provoking the loss of small disparity details. Here, s ∈ [0, S] is introduced as one scale parameter in this framework, while Cs represents the cost volumes at different scales. The CSCA builds multi-scale cost volume Cs with the different down-sampled images and a factor of ηs, which has the dimensions of
where

The flowchart of CSCA. Left side represents the cost volumes under different scales from 0 to S; corresponding variables
For
When we set
With the similar equations of s from 0 to S, we have S + 1 linear equations in total, which indicates a straight-forward expression as follows:
The matrix A is an (S + 1) × (S + 1) tridiagonal constant matrix, which can be easily derived from equation (9). Since A is tridiagonal, its inverse always exists. Thus,
In the CSCA framework, the final regularized cost volume C0(i0, l0) will be obtained through the adaptive combination of the cost aggregation results
The proposed method
Information from the finest scale is not enough but when inter-scale regularization is adopted, useful information from coarse scales reshapes the cost vector, generating disparity closer to the ground truth. Experimental results of Zhang et al. 9 validate the high overall accuracy and efficiency of CSCA with several data sets. Additionally, several different cost aggregation methods such as the NLCA, the guided filter (GF), the segment tree method (ST), and the bilateral filter method (BF) were evaluated in their paper. The guided filter (GF) outperformed other cost aggregation methods in the most scenes.
In this article, we proposed an optimized adaptive CSCA, denoted as A-CSCA scheme for stereo matching, that implements CSCA with adaptive strategy and extends the idea of the inter-scale and intra-scale consistency constraints on cost neighborhood to increase the matching accuracy. In the DSI, the neighborhood costs at each disparity level are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in a local region, but also incorporated with inter-scale consistency to optimize the disparity estimation. One Winner-Takes-All (WTA) strategy is employed to generate the final disparity maps. Algorithm 1 describes a basic workflow of A-CSCA, where we can attempt to evaluate alternative cost aggregation methods in step 3. The proposed method brings a little increment in computational complexity, compared to conventional cost aggregation methods.
Pipeline of A-CSCA algorithm.
Matching cost computation
Similar with CSCA, two paired images Ileft and Iright take part in the disparity calculation of our framework. It first computes the matching cost between Ileft and Iright with the predefined scale S, according to the disparity range L. Thus, we build several DSIs under different scale space. Let Cd(p) denotes the matching cost for pixel p at disparity level l.
where Ccensus(l) and Cgrad(l) are census cost term and gradient-based measure cost term, respectively; α is a scale factor used to control the contribution of two cost terms.
Cost aggregation
Previous research 13 had proven that various cost aggregation methods can be formulated uniformly as WLS optimization problems, which enforce smoothness constraint over neighboring costs. As shown in equation (4), the first term reflects the data fidelity in the traditional CSCA model, additional regularization on cost volume is proposed on the different scale space that reflects the inter-scale consistency. Inspired by the research work of Ma et al., 12 our smoothness constraint is enforced with eight neighbor costs, in order to make sure that the disparity varies smoothly. Thus, we define the formulation of proposed cost aggregation as follow:
As shown in Figure 1, cost slice extracted from aggregated cost volume at different scales with a specified disparity value. Because the above optimization problem is convex, we can get the solution by finding the stationary point of the optimization objective in equation (12).
when β equals 0, equation (12) will become equation (9). The detailed explanation of equation (12) solution is illustrated in Ma et al. 12 Here, we proposed one adaptive alternation for CSCA at each scale aggregation. Different from traditional CSCA method, we adopt the SST aggregation method in the odd scale [1, 3, .., S − 1]. Additionally, in the even scale [0, 2, 4, ..., S], the GF is employed to aggregate costs. Then, we sum aggregated cost of different methods at each scale with adaptive weights into the finest scale. After the above cost aggregation step, we adopt WTA strategy to obtain raw disparity map of the finest scale, which represents the accurate disparity in most regions. Local minima always happen in the cost volume without CSCA structure, causing mismatch in disparity map. Generally, information only with the finest scale is not enough to generate accurate disparity map. But when inter-scale regularization is adopted, much more information from coarse scales reshapes the cost volume, result in disparity closer to the ground truth. Experimental results show that the combination of SST and GF brings obvious improvement in matching accuracy.
Experiment and evaluation
Environment setup
In this section, we compare the proposed A-CSCA with traditional CSCA method. Meanwhile, our experiments are evaluated on classic Middlebury data set as introduced in Scharstein and Szeliski, 17 the 2014 Middlebury data set of Scharstein et al., 18 and 194 training pairs of the KITTI data set proposed by Geiger et al., 19 under such hardware environments: Intel Xeon L5640 (2.2GHz), AMD R9 280x. The scale S is preset to 5, resulting totally six scales used both in A-CSCA and in CSCA frameworks. Thus, we employed the combination of SST and GF to aggregate cost, instead of aggregating with separate method. More specifically, in the even scale [0, 2, 4], we adopted the GF to aggregate costs. And SST aggregation is executed in the odd scale [1, 3, 5].
In stereo matching algorithms, matching accuracy is the most important criteria, which directly reflects the final quality of the disparity map. In our experiments, we did not apply any disparity optimization technique, in order to keep fairly comparison between A-CSCA and CSCA. Additionally, several parameters are defined in the proposed method. The cost balance parameter α is set to 0.3. Following the parameters setting in CSCA, 9 the regularization parameters λ is set to 0.3 for Middlebury data set, and 1.0 for KITTI and Middlebury 2014 data sets. Two regularized parameters λ and β of the proposed method are defined as (0.5, 0.7) for classic Middlebury data set and (1.0, 1.2) for KITTI and Middlebury 2014 data sets. Figure 2 shows the performance of varying inter-scale regularization parameter λ with different aggregation methods, when β = 0. In a similar way, the optimal value of parameter β could be chosen with lots of tuning experiments.

The performance of varying inter-scale regularization parameter λ for “GF” and “SST” methods with scales, when β = 0. GF: guided filter; SST: segment-simple-tree.
Classic Middlebury data set
First, the parameter setting needs to be analyzed for the Middlebury data sets “Cone,” “Teddy,” “Venus,” and “Tsukuba” image pairs before evaluating. Four classic Middlebury image pairs were utilized to determine the optimal parameter setting for the Middlebury data set. Figure 3 gives visual comparisons between CSCA and the proposed method, and the corresponding quantitative results are listed in Table 1, which shows the average rate of matching errors in “Noc,” “all,” and “disk” regions. “Noc” represents errors in nonoccluded regions, “all” represents errors in nonoccluded and half-occluded regions, and “disk” represents errors in visible pixels near occluded regions. Evaluations for CSCA and the proposed method on classic Middlebury data sets are analyzed with error thresholds 1. The results indicate that our method obtains better matching accuracy and time efficiency than CSCA. Through the visual comparison, our adaptive cost aggregation strategy performs better on textureless regions.

Evaluation visual results of classic Middlebury image pairs: (a) original “Cone”, “Teddy”, “Venus,” and “Tsukuba” image pairs, (b) the ground-truth maps, (c) disparity maps calculated by CSCA, and (d) disparity maps calculated by the proposed method. CSCA: cross-scale cost aggregation.
Evaluations for CSCA and our on classic Middlebury data set.a
CSCA: cross-scale cost aggregation; GF: guided filter.
KITTI data set
The KITTI data set consists of 194 training image pairs and 195 test image pairs for stereo correspondence matching evaluation. These image pairs are derived under practical condition with real-world illumination, which contains a large part of textureless regions. Meanwhile, related ground truth of disparity maps has been measured by laser scanner and GPS system. The same process about regularization parameters λ and β was proceeded on 10 randomly selected image pairs from KITTI. During our experiments, we randomly select 30 paired images available in KITTI data set to evaluate the proposed method. Figure 4 gives some visual comparisons between CSCA and the proposed A-CSCA. As shown in Table 2, the evaluation metric is the same as the KITTI benchmark with an error threshold of 3: Out-Noc and Out-All represent percentage of erroneous pixels in nonoccluded and all regions, respectively; Avg-Noc and Avg-All represent average disparity error in nonoccluded and all regions, respectively. According to the results illustrated in Figure 4 and Table 2, our method improved the matching accuracy with 3 − 5% approximately, compared with CSCA.

Evaluation visual results of KITTI image pairs: (a) original image pairs, (b) disparity maps calculated by CSCA, and (c) disparity maps calculated by the proposed method. CSCA: cross-scale cost aggregation.
Evaluations for CSCA and the proposed method on KITTI benchmark data set and the absolute disparity error threshold of 3.
CSCA: cross-scale cost aggregation.
aNoc represents errors in nonoccluded regions, all represents errors in nonoccluded and half-occluded regions, and disk represents errors in visible pixels near occluded regions. Shown values are the average and standard deviation of the percentage of bad pixels (threshold = 1).
Middlebury 2014 data set
Finally, we evaluated the proposed method on the Middlebury 2014 data set. Our evaluation is based on 15 training image pairs as illustrated in Middlebury 2014 data set. These image pairs involve more complex environments, providing a more challenging task than classic Middlebury data sets. Four criteria Out-Noc, Out-All, Avg-Noc, and Avg-All are the same with the previous KITTI evaluation, and the error threshold is 1. Table 3 also shows the quantitative comparison between CSCA and the proposed method. As shown in Figure 5, our method performs better in the flat and texture regions of disparities than CSCA, especially in the disparity discontinuity region.
Evaluations for CSCA and the proposed method on 2014 Middlebury data set and the absolute disparity error threshold of 1.
CSCA: cross-scale cost aggregation.

Evaluation visual results of Middlebury 2014 data sets “Bicycle,” “Motorcycle,” and “Piano” image pairs: (a) original image pairs, (b) disparity maps calculated by CSCA, and (c) disparity maps calculated by the proposed method.
Based on the above experimental results, the proposed method outperformed CSCA in most of data sets. A confirmation is thus made that the adaptive CSCA framework with different cost aggregation methods in each scale brings an improvement in the matching accuracy.
Conclusion
In this article, a novel optimized CSCA method with CTF strategy for stereo matching has been proposed. The scheme implements CSCA with the smoothness constraint on neighbor costs, which essentially extends the idea of the inter-scale and intra-scale consistency constraints to increase the matching accuracy. The neighborhood costs are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in an eight-connected neighbors region, but also incorporated with inter-scale consistency to optimize the disparity estimation. Additionally, the improved method introduces an adaptive scheme in each scale with different aggregation methods. Experimental results have shown that the proposed adaptive CSCA method outperforms the original CSCA method on Middlebury and KITTI data sets.
As one extension, we plan to apply a global optimization scheme into current method for better temporal consistency, in order to evaluate the proposed method with some real scene simulations. Our future work concentrate on reducing the complexity of the proposed method and implementing it in real-time acceleration with the help of parallel structure like GPU platform.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded by Zhejiang Provincial Natural Science Foundation of China under grants LY16F020033 and LY18F020034, the Natural Science Foundation of China under grants 61702275 and 41775008, the Scientific Research Foundation of Nanjing University of Information Science and Technology under grant S8113055001, and the Natural Science Foundation of JiangSu Province under grant BK20150931.
