Improving stereo matching algorithm with adaptive cross-scale cost aggregation

Abstract

Human beings process stereoscopic correspondence across multiple purposes like robot navigation, automatic driving, and virtual or augmented reality. However, this bioinspiration is ignored by state-of-the-art dense stereo correspondence matching methods. Cost aggregation is one of the critical steps in the stereo matching method. In this article, we propose an optimized cross-scale cost aggregation scheme with coarse-to-fine strategy for stereo matching. This scheme implements cross-scale cost aggregation with the smoothness constraint on neighborhood cost, which essentially extends the idea of the inter-scale and intra-scale consistency constraints to increase the matching accuracy. The neighborhood costs are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in an eight-connected neighbors region but also incorporated with inter-scale consistency to optimize the disparity estimation. Additionally, the improved method introduces an adaptive scheme in each scale with different aggregation methods. Finally, experimental results evaluated both on classic Middlebury and Middlebury 2014 data sets show that the proposed method performs better than the cross-scale cost aggregation. The whole stereo correspondence algorithm achieves competitive performance in terms of both matching accuracy and computational efficiency. An extensive comparison, including the KITTI benchmark, illustrates the better performance of the proposed method also.

Keywords

Stereo matching three-dimensional reconstruction depth map image processing computer vision

Introduction

Dense correspondence between stereo pair images of the same scene, termed stereo matching, is an important issue in computer vision especially in visual-based robotic navigation, automatic driving, and augmented reality. Stereo matching extracts the horizontal offsets relating correspondence using the binocular characteristic. These offsets, termed as disparities, could be easily converted to depth information, which represents the three-dimensional parameters of the scene. It is a challenging problem to obtain the accurate disparity maps in uncontrolled dynamic environments due to the inherent stereo vision limitations. Cost aggregation plays an important role in stereo matching methods, for instance, local methods like those of Pham and Jeon¹ and Lu et al.² and global methods like those of Sun et al.³ and Xiang et al.⁴ Cost aggregation could be considered as building the cost volume, termed disparity space image (DSI) with W × H × L dimensions, where W and H represent the width and height of stereo pair images, respectively, and L denotes the range of disparity. Traditional cost aggregation methods performed as image filters blur the depth boundaries during cost aggregation. Yoon and Kweon⁵ adopted an area-based correspondence search in a given support window, which focuses on the dissimilarity calculation. However, the method is computationally more expensive than other area-based local methods. To improve the trade-off between the matching accuracy and computational efficiency, several papers have proposed some effective strategies. Yang⁶ proposed a nonlocal cost aggregation (NLCA) framework adaptively based on pixel similarity on a minimum spanning tree (MST) structure derived from the stereo image pair to preserve depth edges with high efficiency, which combined the advantages of both the local and the global methods. Meanwhile, stereo matching methods follow one smoothness constraint: disparity value in a given neighborhood should be the same, except for depth boundaries. Zhang et al.⁷ proposed that temporal information, gathered from consecutive frames, had been proven to improve the accuracy of stereo matching, which also reflects the changes in occlusion region happened in the disparity discontinuity area. While several methods, such as those of Cheng et al. and Zhang et al.,^{8, 9} have recently demonstrated impressive performance in this context, none of them explicitly takes advantage of the fact that bioinspiration with coarse-to-fine (CTF) framework could improve the cost aggregation in stereo matching.

In contrast, we propose an adaptive cross-scale cost aggregation (CSCA) scheme with CTF strategy for stereo matching. This scheme implements CSCA with the smoothness constraint on neighborhood cost, which essentially extends the idea of the inter-scale and intra-scale consistency constraints to increase the matching accuracy. The neighborhood costs are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in an eight-connected neighbors region, but also incorporated with inter-scale consistency to optimize the disparity estimation. Additionally, one adaptive scheme of CSCA will be illustrated in following sections.

The rest of this article is organized as follows. In the “Related works” section, we introduce some state-of-the-art cost aggregation methods for stereo correspondences, and the traditional CSCA framework is introduced in the “CSCA framework” section. In the “The Proposed method” section, the adaptive CSCA stereo matching method is then presented. In the “Experiment and evaluation” section, the experimental results are evaluated with three kinds of data sets. Finally, the “Conclusion” section provides some concluding remarks regarding this research.

Related works

State-of-the-art cost aggregation methods have made great efforts to computer vision, especially in the domain of robotic stereo correspondences. With rapid development of hardware, robot vision tend to process stereoscopic correspondence across multiple scales like human beings. As aforementioned, Yang⁶ proposed an NLCA method based on MST structure. However, the aggregation over a tree structure generated from truncated edge weight will suffer from edge blurring effect since it assumes that disparity smoothens at every point. Due to this reason, appropriate priors must be employed to indicate the potential locations of the depth boundaries. Based on NLCA, Cheng et al.⁸ proposed one cross-tree structure that adopts edge and superpixel priors tackling the false cost aggregation across the depth boundaries. But it only verifies edge and superpixel priors with static paired images, without considering the moving objects in the urban scene. Meanwhile, Pham et al.¹⁰ proposed one similar segment-simple-tree (SST) that is designed for practical driving images.

A common property of cost aggregation is operated at the finest scale of the paired images. Early researchers like Mallot et al.¹¹ proposed that information at CTF scales was processed interactively in the correspondence search of the human stereo vision system. Thus, it is reasonable that cost aggregation should be operated across multiple scales rather than only on the finest scale like conventional methods. The CTF strategy has been widely applied not only in global stereo matching methods such as dynamic programming and semi-global matching, but also in some local methods. Most CTF approaches explicitly process the disparity estimation in the scale space, and make sure the disparity consistency across multi-scales. Zhang et al.⁹ reformulated the CTF strategy, and proposed one CSCA strategy for stereo matching, which built stereo correspondence search across multi-scales and aggregated costs across multi-scales rather than conventional methods at the finest scale. Different from the conventional CTF methods, CSCA builds the cost volume in the scale space and makes sure the consistency across multi-scales. Recently, Ma et al.¹² extended the aforementioned CSCA framework by integrating intra-scale smoothness constraint in stereo matching, which considered the inter-scale consistency of cost volume and intra-scale consistency of neighbors’ cost value respectively.

However, there are still some challenging problems for stereo matching. With these challenges, it is important for stereo matching algorithms to overcome those problems and generate an accurate disparity map. In this article, we reform the CSCA framework with adaptive cost aggregation methods at each scale. First, the consistency of cost volume at the adjoining scales could be enforced by a generalized Tikhonov regularizer, following the optimization perspective of Zhang et al. and Ma et al.^9,12 Then, the aforementioned conventional cost aggregation methods are employed to ensure the intra-scale consistency of the cost volume, which can be optimized to generate more robust cost volume and accurate disparity map. Experimental results show that the proposed method performs better than CSCA evaluated both on classic Middlebury and Middlebury 2014 data sets. An extensive comparison, including the KITTI benchmark, also illustrates the better performance of the proposed method than CSCA. The proposed stereo correspondence algorithm achieves competitive performance in terms of both matching accuracy and computational efficiency. The main contributions of this article are twofold:

ensuring one intra-scale smooth consistency with eight-connected neighbors costs;

proposing an adaptive and effective scheme with different aggregation methods for each scale.

CSCA framework

In this section, we first briefly introduce the CSCA framework. For paired images I^left and I^right with the same width W and height H, one single pixel $i = (x_{i}, y_{i})$ in left image, where x_i and y_i are pixel coordinates. C(i, l) is used to express the pixel i′s cost at disparity l, which is defined based on intensity and gradient similarity between two matching pixels i and i_l. As illustrated in CSCA framework, the initial matching cost is calculated by following formulation:

\begin{array}{l} C (i, l) = (1 - α) \cdot \min (∥ I (i) - I^{'} (i_{l}) ∥) \\ + α \cdot \min (∥ \nabla_{x} I (i) - \nabla_{x} I^{'} (i_{l})∥) \end{array}

i_l is the correspondence point of i with a disparity l and location $(x_{i} - l, y_{i})$ in right image. Parameter α is designed to balance the intensity and gradient terms. With equation (1), we can obtain the cost volume $C \in ℂ^{W \times H \times L}$ of the finest scale with W × H × L dimensions, which represents the matching costs for each pixel in the each level of disparity range.

The cost volume $C \in ℂ^{W \times H \times L}$ is built with lots of possibilities so that we can search for the best candidate. Inspired by Min and Sohn¹³ and Milanfar¹⁴ of aggregating cost with weighted least squares (WLS) formulation, the cost aggregation can be formulated to denoise the $C \in ℂ^{W \times H \times L}$ as

\tilde{C} (i, l) = arg min \frac{1}{Z_{i}} \sum_{j \in N_{i}} K (i, j) ∥ z - C (j, l) ∥^{2}

where N_i represents a neighbor region of i. K(i, j) is a kind of kernel, which measures the similarity between pixels i and j, while $Z_{i} = \sum_{j \in N_{i}} K (i, j)$ is a normalization constant. $\tilde{ℂ}$ denotes the denoised cost volume. The solution of the above WLS problem is

\tilde{C} (i, l) = \frac{1}{Z_{i}} \sum_{j \in N_{i}} K (i, j) C (j, l)

Equation (3) summarizes the principle of most cost aggregation methods like those of Yoon and Kweon, Yang, Cheng et al., Pham et al., and Hosni et al.^{5, 6,8,10,15} In filter methods, N_i is a local region around i, but N_i represents the whole image in the tree methods. During the cost aggregation procedure, the filter methods depend on the local similarity, while tree-based ones are based on such edges of different regions. In all, the state-of-the-art cost aggregation methods perform very well in the high-texture region, but usually fail in low-texture region. According to the research of Menz and Freeman,¹⁶ cost matching with the coarse scale can obtain the more accurate correspondence in those low-texture regions than cost matching at fine scale. In general, previous CTF approaches reduced the search space at the current scale by using a disparity map estimated from the cost volume at the coarser scale, often provoking the loss of small disparity details. Here, s ∈ [0, S] is introduced as one scale parameter in this framework, while C_s represents the cost volumes at different scales. The CSCA builds multi-scale cost volume C_s with the different down-sampled images and a factor of η^s, which has the dimensions of $(W / η^{s}) \times (H / η^{s}) \times L / η^{s}$ . Alternatively, CSCA enforced the inter-scale consistency on the cost volume by adding a generalized Tikhonov regularizer, leading to the following optimization objective as introduced in equation (4):

\begin{array}{l} \tilde{v} = arg min_{{z^{s}}_{s = 0}^{S}} \sum_{s = 0}^{S} \frac{1}{Z_{i^{s}}^{S}} \sum_{j^{s} \in N_{i^{s}}} K (i^{s}, j^{s}) ∥ z^{s} - C^{s} (j^{s}, l^{s}) ∥^{2} \\ + λ \sum_{s = 1}^{S} ∥ z^{s} - z^{s - 1} ∥^{2} \end{array}

where ${i^{s}}_{s = 0}^{S}$ and ${l^{s}}_{s = 0}^{S}$ represent the corresponding variables at each scale. i⁰ means the i th pixel in the finest scale and i^s is the corresponding pixel of i⁰ at scale s, as shown in Figure 1. $C^{s} (i^{s}, l^{s})$ means the cost value of pixel i^s at disparity level l^s on scale s. $N_{i^{s}}$ is a set of neighboring pixels of i^s on the s th scale, and K(i^s, j^s) measures the similarity between i^s and j^s. λ is a constant parameter to control the strength of regularization. Specifically, the red arrow indicates the principle of data fidelity between cost volumes and corresponding regularized ones. If we ignore scale space and only consider the cost aggregation in the finest scale, the optimization objective will come to equation (2). In the CSCA model, cost aggregation is proceeded at each scale, and then regularized across multiple scales by enforcing the inter-scale consistency, as the blue arrow shown. The intra-scale consistency is also designed to ensure that the regularized costs vary smoothly in such a local region. Let $F ({z^{s}}_{s = 0}^{S})$ represent the optimization objective in equation (5).

\begin{array}{l} F ({z^{s}}_{s = 0}^{S}) = \sum_{s = 0}^{S} \frac{1}{Z_{i^{s}}^{S}} \sum_{j^{s} \in N_{i^{s}}} K (i^{s}, j^{s}) ∥ z^{s} - C^{s} (j^{s}, l^{s}) ∥^{2} \\ + λ \sum_{s = 1}^{S} ∥ z^{s} - z^{s - 1} ∥^{2} \end{array}

Figure 1.

The flowchart of CSCA. Left side represents the cost volumes under different scales from 0 to S; corresponding variables ${i^{s}}_{s = 0}^{S}$ and ${l^{s}}_{s = 0}^{S}$ are also introduced in each scale. The red arrow indicates the principle of data fidelity between cost volumes and corresponding regularized ones. In the CSCA model, cost aggregation is proceeded at each scale, and then regularized across multiple scales by enforcing the inter-scale consistency, as the blue arrow shown. The intra-scale consistency is also designed to ensure that the regularized costs vary smoothly in such a local region like eight-connected neighbors. CSCA: cross-scale cost aggregation.

For $s \in [1, 2, ..., S - 1]$ , the partial derivative of F with respect to z^s is

\begin{matrix} \frac{\partial F}{\partial z^{s}} = \frac{2}{z_{i^{s}}^{s}} \sum_{j^{s} \in N_{i^{s}}} K (i^{s}, j^{s}) (z^{s} - C^{s} (j^{s}, l^{s})) \\ + 2 λ (z^{s} - z^{s - 1}) - 2 λ (z^{s + 1} - z^{s}) \\ = 2 (- λ z^{s - 1} + (1 + 2 λ) z^{s} - λ z^{s + 1} - {\tilde{C}}^{s} (j^{s}, l^{s})) \end{matrix}

When we set $\frac{\partial F}{\partial z^{s}} = 0$ , equation (6) will become:

{\tilde{C}}^{s} (j^{s}, l^{s}) = - λ z^{s - 1} + (1 + 2 λ) z^{s} - λ z^{s + 1}

With the similar equations of s from 0 to S, we have S + 1 linear equations in total, which indicates a straight-forward expression as follows:

A \hat{v} = \tilde{v}

The matrix A is an (S + 1) × (S + 1) tridiagonal constant matrix, which can be easily derived from equation (9). Since A is tridiagonal, its inverse always exists. Thus,

\hat{v} = A^{- 1} \tilde{v}

In the CSCA framework, the final regularized cost volume C⁰(i⁰, l⁰) will be obtained through the adaptive combination of the cost aggregation results $[C^{1} (i^{1}, l^{1}),..., C^{S} (i^{S}, l^{S})]$ at multi-scales with adaptive weights. The multi-level combination enables the multi-scale fusion of the cost aggregation in the perspective of optimization.

The proposed method

Information from the finest scale is not enough but when inter-scale regularization is adopted, useful information from coarse scales reshapes the cost vector, generating disparity closer to the ground truth. Experimental results of Zhang et al.⁹ validate the high overall accuracy and efficiency of CSCA with several data sets. Additionally, several different cost aggregation methods such as the NLCA, the guided filter (GF), the segment tree method (ST), and the bilateral filter method (BF) were evaluated in their paper. The guided filter (GF) outperformed other cost aggregation methods in the most scenes.

In this article, we proposed an optimized adaptive CSCA, denoted as A-CSCA scheme for stereo matching, that implements CSCA with adaptive strategy and extends the idea of the inter-scale and intra-scale consistency constraints on cost neighborhood to increase the matching accuracy. In the DSI, the neighborhood costs at each disparity level are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in a local region, but also incorporated with inter-scale consistency to optimize the disparity estimation. One Winner-Takes-All (WTA) strategy is employed to generate the final disparity maps. Algorithm 1 describes a basic workflow of A-CSCA, where we can attempt to evaluate alternative cost aggregation methods in step 3. The proposed method brings a little increment in computational complexity, compared to conventional cost aggregation methods.

Algorithm 1.

Pipeline of A-CSCA algorithm.

input: Paired images

I^{l e f t}, I^{r i g h t}

1 Gauss down sampling, multi-scale paired images

I_{s}^{l e f t}

and

I_{s}^{r i g h t}

s \in [0, 1, ..., S]

2 Compute the initial cost volume C_d at different scales according to equation (10);

3 Build the general cost aggregation model with different aggregation methods at each scale;

4 Strengthen the inter-scale consistency of cost volumes by regularization;

5 Weighted sum aggregated cost at each scale into the finest scale;

6 Disparity determination.

output: Final optimized disparity maps

Matching cost computation

Similar with CSCA, two paired images I^left and I^right take part in the disparity calculation of our framework. It first computes the matching cost between I^left and I^right with the predefined scale S, according to the disparity range L. Thus, we build several DSIs under different scale space. Let C_d(p) denotes the matching cost for pixel p at disparity level l.

\begin{matrix} C_{d} (p) = (1 - α) \times C_{c e n s u s} (l) + α \times C_{g r a d} (l) \\ C_{c e n s u s} (p) = | C e n s u s_{left} (p) - C e n s u s_{right} (p - l) | \\ C_{g r a d} (p) = | \nabla_{x} I_{left} (p) - \nabla_{x} I_{right} (p - l) | \\ + | \nabla_{y} I_{left} (p) - \nabla_{y} I_{right} (p - l) | \end{matrix}

where C_census(l) and C_grad(l) are census cost term and gradient-based measure cost term, respectively; α is a scale factor used to control the contribution of two cost terms.

Cost aggregation

Previous research¹³ had proven that various cost aggregation methods can be formulated uniformly as WLS optimization problems, which enforce smoothness constraint over neighboring costs. As shown in equation (4), the first term reflects the data fidelity in the traditional CSCA model, additional regularization on cost volume is proposed on the different scale space that reflects the inter-scale consistency. Inspired by the research work of Ma et al.,¹² our smoothness constraint is enforced with eight neighbor costs, in order to make sure that the disparity varies smoothly. Thus, we define the formulation of proposed cost aggregation as follow:

\begin{array}{l} \tilde{v} = arg min_{{z^{s}}_{s = 0}^{S}} \sum_{s = 0}^{S} \frac{1}{Z_{i^{s}}^{S}} \sum_{j^{s} \in N_{i^{s}}} K (i^{s}, j^{s}) ∥ z^{s} - C^{s} (j^{s}, l^{s}) ∥^{2} \\ + λ \sum_{s = 1}^{S} ∥ z^{s} - z^{s - 1} ∥^{2} + β \sum_{s = 0}^{S} \sum_{N_{j}^{s} \in N_{8} (i^{s})} w^{s} (i^{s}, N_{j}^{s}) \end{array}

As shown in Figure 1, cost slice extracted from aggregated cost volume at different scales with a specified disparity value. Because the above optimization problem is convex, we can get the solution by finding the stationary point of the optimization objective in equation (12).

v_{i}^{k + 1} = (A + β w_{i})^{- 1 p} (\tilde{v} + β v_{N_{i}}^{k})

when β equals 0, equation (12) will become equation (9). The detailed explanation of equation (12) solution is illustrated in Ma et al.¹² Here, we proposed one adaptive alternation for CSCA at each scale aggregation. Different from traditional CSCA method, we adopt the SST aggregation method in the odd scale [1, 3, .., S − 1]. Additionally, in the even scale [0, 2, 4, ..., S], the GF is employed to aggregate costs. Then, we sum aggregated cost of different methods at each scale with adaptive weights into the finest scale. After the above cost aggregation step, we adopt WTA strategy to obtain raw disparity map of the finest scale, which represents the accurate disparity in most regions. Local minima always happen in the cost volume without CSCA structure, causing mismatch in disparity map. Generally, information only with the finest scale is not enough to generate accurate disparity map. But when inter-scale regularization is adopted, much more information from coarse scales reshapes the cost volume, result in disparity closer to the ground truth. Experimental results show that the combination of SST and GF brings obvious improvement in matching accuracy.

Experiment and evaluation

Environment setup

In this section, we compare the proposed A-CSCA with traditional CSCA method. Meanwhile, our experiments are evaluated on classic Middlebury data set as introduced in Scharstein and Szeliski,¹⁷ the 2014 Middlebury data set of Scharstein et al.,¹⁸ and 194 training pairs of the KITTI data set proposed by Geiger et al.,¹⁹ under such hardware environments: Intel Xeon L5640 (2.2GHz), AMD R9 280x. The scale S is preset to 5, resulting totally six scales used both in A-CSCA and in CSCA frameworks. Thus, we employed the combination of SST and GF to aggregate cost, instead of aggregating with separate method. More specifically, in the even scale [0, 2, 4], we adopted the GF to aggregate costs. And SST aggregation is executed in the odd scale [1, 3, 5].

In stereo matching algorithms, matching accuracy is the most important criteria, which directly reflects the final quality of the disparity map. In our experiments, we did not apply any disparity optimization technique, in order to keep fairly comparison between A-CSCA and CSCA. Additionally, several parameters are defined in the proposed method. The cost balance parameter α is set to 0.3. Following the parameters setting in CSCA,⁹ the regularization parameters λ is set to 0.3 for Middlebury data set, and 1.0 for KITTI and Middlebury 2014 data sets. Two regularized parameters λ and β of the proposed method are defined as (0.5, 0.7) for classic Middlebury data set and (1.0, 1.2) for KITTI and Middlebury 2014 data sets. Figure 2 shows the performance of varying inter-scale regularization parameter λ with different aggregation methods, when β = 0. In a similar way, the optimal value of parameter β could be chosen with lots of tuning experiments.

Figure 2.

The performance of varying inter-scale regularization parameter λ for “GF” and “SST” methods with scales, when β = 0. GF: guided filter; SST: segment-simple-tree.

Classic Middlebury data set

First, the parameter setting needs to be analyzed for the Middlebury data sets “Cone,” “Teddy,” “Venus,” and “Tsukuba” image pairs before evaluating. Four classic Middlebury image pairs were utilized to determine the optimal parameter setting for the Middlebury data set. Figure 3 gives visual comparisons between CSCA and the proposed method, and the corresponding quantitative results are listed in Table 1, which shows the average rate of matching errors in “Noc,” “all,” and “disk” regions. “Noc” represents errors in nonoccluded regions, “all” represents errors in nonoccluded and half-occluded regions, and “disk” represents errors in visible pixels near occluded regions. Evaluations for CSCA and the proposed method on classic Middlebury data sets are analyzed with error thresholds 1. The results indicate that our method obtains better matching accuracy and time efficiency than CSCA. Through the visual comparison, our adaptive cost aggregation strategy performs better on textureless regions.

Figure 3.

Evaluation visual results of classic Middlebury image pairs: (a) original “Cone”, “Teddy”, “Venus,” and “Tsukuba” image pairs, (b) the ground-truth maps, (c) disparity maps calculated by CSCA, and (d) disparity maps calculated by the proposed method. CSCA: cross-scale cost aggregation.

Table 1.

Evaluations for CSCA and our on classic Middlebury data set.^a

Method	Cone			Teddy			Venus			Tsukuba			Time (ms)
Method	noc	all	disk	noc	all	disk	noc	all	disk	noc	all	disk	Time (ms)
CSCA+GF	2.39	8.01	6.19	2.76	4.35	7.55	0.24	1.77	2.77	2.16	3.42	9.78	1.37
Ours	2.01	7.37	5.88	2.19	3.73	7.02	0.17	0.21	2.41	1.68	2.71	8.61	0.79

CSCA: cross-scale cost aggregation; GF: guided filter.

KITTI data set

The KITTI data set consists of 194 training image pairs and 195 test image pairs for stereo correspondence matching evaluation. These image pairs are derived under practical condition with real-world illumination, which contains a large part of textureless regions. Meanwhile, related ground truth of disparity maps has been measured by laser scanner and GPS system. The same process about regularization parameters λ and β was proceeded on 10 randomly selected image pairs from KITTI. During our experiments, we randomly select 30 paired images available in KITTI data set to evaluate the proposed method. Figure 4 gives some visual comparisons between CSCA and the proposed A-CSCA. As shown in Table 2, the evaluation metric is the same as the KITTI benchmark with an error threshold of 3: Out-Noc and Out-All represent percentage of erroneous pixels in nonoccluded and all regions, respectively; Avg-Noc and Avg-All represent average disparity error in nonoccluded and all regions, respectively. According to the results illustrated in Figure 4 and Table 2, our method improved the matching accuracy with 3 − 5% approximately, compared with CSCA.

Figure 4.

Evaluation visual results of KITTI image pairs: (a) original image pairs, (b) disparity maps calculated by CSCA, and (c) disparity maps calculated by the proposed method. CSCA: cross-scale cost aggregation.

Table 2.

Evaluations for CSCA and the proposed method on KITTI benchmark data set and the absolute disparity error threshold of 3.

Methods	Out-Noc	Out-All	Avg-Noc	Avg-All
CSCA-[9]	20.10%	22.12%	8.60 px	9.20 px
Ours	19.30%	20.98%	7.80 px	8.80 px

CSCA: cross-scale cost aggregation.

^aNoc represents errors in nonoccluded regions, all represents errors in nonoccluded and half-occluded regions, and disk represents errors in visible pixels near occluded regions. Shown values are the average and standard deviation of the percentage of bad pixels (threshold = 1).

Middlebury 2014 data set

Finally, we evaluated the proposed method on the Middlebury 2014 data set. Our evaluation is based on 15 training image pairs as illustrated in Middlebury 2014 data set. These image pairs involve more complex environments, providing a more challenging task than classic Middlebury data sets. Four criteria Out-Noc, Out-All, Avg-Noc, and Avg-All are the same with the previous KITTI evaluation, and the error threshold is 1. Table 3 also shows the quantitative comparison between CSCA and the proposed method. As shown in Figure 5, our method performs better in the flat and texture regions of disparities than CSCA, especially in the disparity discontinuity region.

Table 3.

Evaluations for CSCA and the proposed method on 2014 Middlebury data set and the absolute disparity error threshold of 1.

Methods	Out-Noc	Out-All	Avg-Noc	Avg-All
CSCA-[9]	21.75%	27.29%	2.65 px	5.07p px
Ours	20.37%	26.32%	2.33 px	4.83 px

CSCA: cross-scale cost aggregation.

Figure 5.

Evaluation visual results of Middlebury 2014 data sets “Bicycle,” “Motorcycle,” and “Piano” image pairs: (a) original image pairs, (b) disparity maps calculated by CSCA, and (c) disparity maps calculated by the proposed method.

Based on the above experimental results, the proposed method outperformed CSCA in most of data sets. A confirmation is thus made that the adaptive CSCA framework with different cost aggregation methods in each scale brings an improvement in the matching accuracy.

Conclusion

In this article, a novel optimized CSCA method with CTF strategy for stereo matching has been proposed. The scheme implements CSCA with the smoothness constraint on neighbor costs, which essentially extends the idea of the inter-scale and intra-scale consistency constraints to increase the matching accuracy. The neighborhood costs are not only used in the intra-scale consistency to ensure that the regularized costs vary smoothly in an eight-connected neighbors region, but also incorporated with inter-scale consistency to optimize the disparity estimation. Additionally, the improved method introduces an adaptive scheme in each scale with different aggregation methods. Experimental results have shown that the proposed adaptive CSCA method outperforms the original CSCA method on Middlebury and KITTI data sets.

As one extension, we plan to apply a global optimization scheme into current method for better temporal consistency, in order to evaluate the proposed method with some real scene simulations. Our future work concentrate on reducing the complexity of the proposed method and implementing it in real-time acceleration with the help of parallel structure like GPU platform.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is funded by Zhejiang Provincial Natural Science Foundation of China under grants LY16F020033 and LY18F020034, the Natural Science Foundation of China under grants 61702275 and 41775008, the Scientific Research Foundation of Nanjing University of Information Science and Technology under grant S8113055001, and the Natural Science Foundation of JiangSu Province under grant BK20150931.

References

Pham

Jeon

. Domain transformation-based efficient cost aggregation for local stereo matching. IEEE Trans Circuit Syst Video Technol 2013; 23(7): 1119–1130.

Shi

Min

. Cross-based local multipoint filtering. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), Providence, RI, USA, pp. 430–437. IEEE.

Sun

Zheng

Shum

. Stereo matching using belief propagation. IEEE Trans Pattern Anal Mach Int 2003; 25(7): 787–800.

Xiang

Zhang

. Real-time stereo matching based on fast belief propagation. Mach Vision Appl 2012; 23(6): 1219–1227.

Yoon

Kweon

. Adaptive support-weight approach for correspondence search. IEEE Trans Pattern Anal Mach Int 2006; 28(4): 650–656.

Yang

. A non-local cost aggregation method for stereo matching. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), Providence, RI, USA, pp. 1402–1409. IEEE.

Zhang

Bai

Nezan

. Joint motion model for local stereo video-matching method. Optic Eng 2015; 54(12): 123108–123108.

Cheng

Zhang

Sun

. Cross-trees, edge and superpixel priors-based cost aggregation for stereo matching. Pattern Recognit 2015; 48(7): 2269–2278.

Zhang

Fang

Min

. Cross-scale cost aggregation for stereo matching. In: 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, pp. 1590–1597. IEEE.

10.

Pham

Dinh

Jeon

. Robust non-local stereo matching for outdoor driving images using segment-simple-tree. Sig Proc Image Comm 2015; 39: 173–184.

11.

Mallot

Gillner

Arndt

. Is correspondence search in human stereo vision a coarse-to-fine process? Biol Cybern 1996; 74(2): 95–106.

12.

Zheng

. Cross-scale cost aggregation integrating intrascale smoothness constraint with weighted least squares in stereo matching. JOSA A 2017; 34(4): 648–656.

13.

Min

Sohn

. Cost aggregation and occlusion handling with WLS in stereo matching. IEEE Trans Image Proc 2008; 17(8): 1431–1442.

14.

Milanfar

. A tour of modern image filtering: new insights and methods, both practical and theoretical. IEEE Sig Proc Magaz 2013; 30(1): 106–128.

15.

Hosni

Rhemann

Bleyer

. Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans Pattern Anal Mach Int 2013; 35(2): 504–511.

16.

Menz

Freeman

. Stereoscopic depth processing in the visual cortex: a coarse-to-fine mechanism. Nature Neurosci 2003; 6(1): 59–65.

17.

Scharstein

Szeliski

. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int J Comput Vision 2002; 47(1–3): 7–42.

18.

Scharstein

Hirschmüller

Kitajima

. High-resolution stereo datasets with subpixel-accurate ground truth. In: Pattern Recognition. GCPR 2014. Lecture Notes in Computer Science, vol 8753. Springer, Cham, pp. 31–42.

19.

Geiger

Lenz

Urtasun

. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), Providence, RI, USA, pp. 3354–3361. IEEE.