Efficient Depth Map Estimation Method Based on Gradient Weight Cost Aggregation Strategy for Distributed Video Sensor Networks

Abstract

Video sensor networking technologies have developed very rapidly in the last ten years. In this paper, a cross-based framework strategy for cost aggregation is presented for the depth map estimation based on video sensor networks. We formulate the process as a local regression problem consisting of two main steps with a pair of video sensors. The first step is to calculate estimates for a set of points within a shape-adaptive local support region. The second step is to aggregate the matching cost for the gradient-based weight of the support region at the outmost pixel. The proposed algorithm achieves strong results in an efficient manner using the two main steps. We have achieved improvement of up to 6.9%, 8.4%, and 8.3%, when compared with adaptive support weight (ASW) algorithm. Comparing to cross-based algorithm, the proposed algorithm gives 2.0%, 1.3%, and 1.0% in terms of nonocclusion, all, and discontinuities, respectively.

1. Introduction

Wireless sensor networks (WSN) have drawn the attention of the research community in the last few years, driven by a wealth of theoretical and practical applications [1, 2]. Recently, as rapid improvements and miniaturization in hardware, a single embedded device can be equipped with audio and visual information collection modules [3]. The availability of low-cost hardware is like enabling the development of wireless multimedia sensor networks (WMSNs), that is, networks of resource-constrained wireless devices that can retrieve multimedia content such as video and audio streams, still images, and scalar sensor data from the environment [4–7].

There are many algorithms for developing WMSNs applications [8–12]. In [8], Chi et al. have studied the problem of compression of video surveillance sequences collected by a wireless sensor network. In particular, they have proposed a low-complexity coding framework based on change detection and JPEG-like compression of regions of interest, along with a suitable low-complexity change detection algorithm. Huang et al. have proposed a robot wireless sensor network that can enhance multimedia surveillance and provide the foundation for strategies based on multi-modal sensor integration [9].

In [10], DeBardelaben have investigated techniques that can be applied at each layer of the network protocol stack to produce clandestine, power-efficient wireless microsensor network implementations. Also, a smart camera network has been demondtrated for providing extensive coverage of a large virtual public space, a train station populated by autonomously self-animating virtual pedestrians [11, 12].

Figure 1 shows that several adjacent sensors may have overlapping sensing areas, due to the field of view of the cameras in the sensors. Since the overlapped image region contains information about the alignment of the image, we are able to make an approach for stereoscopic view generation (depth estimation) with a pair of neighboring camera senors. This kind of work can give more plentiful information for object recognition and tracking strategies.

Figure 1

Distributed camera sensors around a sink in WMSN.

To make stereoscopic view generation (depth estimation), many different approaches have been taken towards solving the stereo correspondence problem and great progress has been made within the field during the last decade. Dense stereo matching forms the basis for extraction of a three-dimensional scene structure and involves the identification of, for every pixel in one image, the corresponding pixel in another image [13].

Because individual pixel values are not unique and as corresponding pixels may not have the same values in different views, searching for correspondence at the level of individual pixels produces questionable results. To overcome this problem, local stereo matching algorithms generally aggregate similarities in the areas around each pixel with the implied assumption that all pixels in area have the same depth.

Several algorithms have been proposed in attempts to solve the problem of optimizing the size of the region. Kanade and Okutomi [14] presented a method selecting a window adaptively by evaluating the local variations in both the intensity and the initial disparities. They used a statistical model to represent the uncertainty in the disparity of points over a given window. Veksler [15] proposed an algorithm to choose an appropriate window shape by optimizing over the class of compact windows. A compact window class was achieved via the minimum ratio cycle algorithm. However the selected shapes produced by the methods of Kanade and Okutomi and Veksler are both constrained to a rectangle.

To overcome the mentioned constraints, Okutomi et al. [16] determined regions by recovering precise object edges and obtaining smooth surfaces using multiple stereo pairs and multiple windowing. Yoon and Kweon [17] adjusted the support weight of pixels in a given support window based on the color similarity and the geometric proximity in order to reduce image ambiguity. However, their algorithm consumes a huge amount of memory and has high computationally complexity due to storage of center pixel-dependent support weights. To overcome the computational complexity, Richardt et al. [18] introduced a real-time stereo matching technique based on a reformulation of Yoon's adaptive support weight algorithm.

Zhang et al. [19] proposed a cross local support aggregation algorithm that uses color similarities and connectivity constraints to construct a shape-adaptive full support region on the fly, merging horizontal segments of the crosses in the vertical neighborhood. However the problem of finding the optimal support region and aggregating the matching costs still remains.

We propose a cross-based gradient weight cost aggregation algorithm to find the optimal support region. After constructing a support region, we aggregate matching costs over the support region using gradient-based weights in the outermost pixel of the support region.

Section 2 describes our proposed algorithm in detail, with observations. We verify the performance of the proposed algorithm in Section 3, and conclusions are presented in Section 4.

2. The Proposed Algorithm

Our proposed algorithm is based on the following assumptions: $(1)$ color differences and a variation in depth values exist in a texture area; $(2)$ the variation in the depth value is small between adjacent pixels in the nontexture area; $(3)$ the gradient magnitude has a constant bandwidth within the object boundaries. It is important to find the same depth area in the local stereo matching. Because minimum matching cost is obtained by aggregating matching costs of the same depth level, we used color similarity information to find the same depth area. Most regions of the same depth level have similar colors such as homogeneous region. Although textured region has some exception, matching costs, which are calculated in the previous aggregation step, are discriminable at each disparity range in the texture region. Hence we considered that color differences and variation in depth values exist in texture area. Also the variation in the depth value is small between adjacent pixels in the nontexture area.

The final assumption is that the gradient magnitude has a constant bandwidth within the object boundaries. Because each object has different bandwidth of edge magnitude in the boundary, we have aggregated matching costs by weighting factor as edge magnitude to distinguish different depth levels.

Figure 2 shows the overall procedure of proposed algorithm to estimate the depth map. We focus on the cost aggregation step using matching cost after the calculated matching cost initialization.

Figure 2

A flowchart of the proposed algorithm.

The proposed cross-based gradient weight cost aggregation algorithm consists of a few stages. First, for each pixel p, a set of four varying support arm lengths is determined for on the image. Before a constructing support region, this step searches the nontexture area in order to limit the support region. Figure 3(a) shows the limited arm lengths ${L_{L}, L_{R}, L_{U}, L_{B}}$ as left, right, up, and bottom, respectively. The non-texture area is calculated using a variance value of the intensities over the window of 3 × 3.

Figure 3

Cross-based gradient weight aggregation processes: (a) region search to limit arm length, (b) construction of support region, and (c) aggregation of matching cost over support region.

Second, an upright cross with four arms is constructed for each pixel. Figure 3(b) shows construction of the support region. For a given pixel, the left arm stops when it finds an endpoint pixel $p_{e}$ that violates one of the two following conditions (see (1) and (4)). The first condition can be defined as

\begin{matrix} D_{c} (p_{e}, p^{'}) < τ, \end{matrix}

(1)

where $D_{c} (p_{e}, p^{'})$ is the color distance between $p_{e}$ and $p^{'}$ , and τ is a preset threshold value. The color distance is given as follows:

\begin{matrix} D_{c} (p_{e}, p^{'}) = \sqrt{\sum_{C \in {L, a, b}} {(C_{p_{e}} - C_{p^{'}}^{'})}^{2}}, \end{matrix}

(2)

where $C_{p_{e}}$ indicates the CIELab color value at the endpoint $p_{e}$ . $C_{p}^{'}$ is expressed as

\begin{matrix} C_{p}^{'} = (1 - α) C_{p}^{'} + α C_{p}, \end{matrix}

(3)

where $α = 1 / (h + 1)$ is a parameter used to control the update rate and h indicates arm length in constructed support region.

The second condition can also be defined as

\begin{matrix} D_{g} (p_{e}, p^{'}) < {L_{L}, L_{R}, L_{U}, L_{B}}, \end{matrix}

(4)

where $D_{g} (p_{e}, p^{'})$ is the spatial distance between $p_{e}$ and $p^{'}$ and ${L_{L}, L_{R}, L_{U}, L_{B}}$ is set of the limited arm lengths.

A following step is to aggregate matching costs over the constructed support region. Figure 3(c) shows the aggregation process. We assume the gradient magnitude has a constant bandwidth in the object boundaries. Based on this assumption, our algorithm aggregates the matching cost, which is a given gradient as the weight of the support region at the outermost pixel (yellow block in Figure 3(c)). Aggregation cost is expressed as

\begin{matrix} E_{d} (p) = \sum_{q \in U_{d} (p)} e_{d} (q), \end{matrix}

(5)

where $U_{d} (p)$ is the support region at pixel p and $e_{d} (q)$ is the aggregation cost that is added in the horizontal direction at the pixel q within $U_{d} (p)$ . After summing the horizontal costs, the matching cost is merged and aggregated by summing aggregation costs vertically. $e_{d} (q)$ can be represented as

\begin{matrix} E_{d} (q) = {\begin{cases} w_{g} (q) \cdot c (q), & if q is the outmost pixel, \\ c (q), & otherwise, \end{cases} \end{matrix}

(6)

where $c (q)$ is the matching cost at pixel q. If a pixel is in the outermost support region, the aggregation cost is multiplied by a weight $w_{g} (q)$ . The weight $w_{g} (q)$ is given as

\begin{matrix} w_{g} (q) = e^{- m (q) / ρ}, \end{matrix}

(7)

where $m (q)$ is edge magnitude at pixel q and ρ is preset threshold value. The edge magnitude $m (p)$ is usually defined as follows:

\begin{matrix} m (p) = \sqrt{G_{x} {(x, y)}^{2} + G_{y} {(x, y)}^{2}}, \\ G_{x} (x, y) = I (x - 1, y) - I (x + 1, y), \\ G_{y} (x, y) = I (x, y - 1) - I (x, y + 1), \end{matrix}

(8)

where $G_{x} (x, y)$ and $G_{y} (x, y)$ indicate the responses in both directions (horizontal and vertical) and I is intensity of image plane. Also, p and q can be considered as the mapped indices for searching color similarity vertically and horizontally, respectively.

The proposed gradient-based weight has a value in the range $[0,1]$ . For example, if edge magnitude is strong, gradient weight is near 0. If edge magnitude is weak, gradient weight is close to 1.

Based on the response of the edge in the object, we designed an adaptive weighting scheme for cost matching. The proposed weighting scheme is based on uncertainty of the edge feature. As the edge becomes stronger, the contribution of the given pixel becomes dominant. Otherwise, the contribution will be decreased in the aggregation stage.

With the proposed gradient-based weight, the proposed algorithm produces a more accurate depth map in the aggregation step for application of wireless video sensor networks.

The overall procedure of the proposed algorithm is summarized as follows (in Figure 2).

(i)

First, before constructing support regions, we determined size of support region to find much more homogeneous region.

(ii)

Support regions are constructed based on the assumption that the same depth level has similar color. To improve the performance, we have to collect the matching costs in the same depth level.

(iii)

Finally, we have aggregated matching cost over the constructed support region based on the response of the edge in the object. The goal of aggregation step is to collect matching costs of depth levels. To find the correct depth level, we constructed the support region in the same depth area using color similarity and aggregated matching costs weighted by response of edge over support region to reduce disturbance in the object boundary.

3. Results and Discussion

To verify the performance of the proposed algorithm, we tested our system using the Middlebury benchmark [20]. Middlebury defined three measures for evaluating performance, including nonocclusion, all, and discontinuities. The quantitative measure is the percentage of bad pixels in subsets of estimated disparity map, which is defined as

\begin{matrix} B_{Occ} = \frac{1}{N} \sum_{S \in Occ} (| d (s) - d_{τ} (s) | > δ_{d}), \\ B_{All} = \frac{1}{N} \sum_{S \in All} (| d (s) - d_{τ} (s) | > δ_{d}), \\ B_{Disc} = \frac{1}{N} \sum_{S \in Disc} (| d (s) - d_{τ} (s) | > δ_{d}) . \end{matrix}

(9)

The parameters given in Table 1 have been kept constant for all the datasets. T is the threshold value for checking on non-texture area. τ is the threshold value to compare the color similarity between pixels. Finally, ρ is the threshold value to get a weighting factor. This threshold has been set from average magnitude of edge in all datasets. These parameters have been selected through experiments. Before analyzing our proposed method, we used a Census transform cost measure to initialize the cost volume [21]. Census transform has good performance on images with exposure and light changes.

Table 1

Parameter settings.

T	τ	ρ
50	20	140

After performing the aggregation step, we used the WTA (winner-take-all) [13] local optimization technique to calculate disparity map.

Figure 4 shows the quantitative performance. We evaluated results by comparing between the ASW algorithm [18] and the cross-based algorithm [19]. We achieved improvements of up to 6.9%, 8.4%, and 8.3% compared with ASW algorithm and improvements of 2.0%, 1.3%, and 1.0% compared with Cross-based algorithm in terms of nonocclusion, all, and discontinuities, respectively.

Figure 4

Errors over the Tsukuba, Venus, Cones, and Teddy test image pairs in quantitative measures.

The proposed method produced an accurate disparity map, as shown in Figure 5. Figure 5(a) shows the ground truth image of the tested venus image. In particular, depth discontinuities were well preserved comparing to the ground truth image. On the other hand, the other methods failed to preserve depth discontinuities. With the proposed algorithm, we can observe that the defects of inside of near objects can be avoided efficiently.

Figure 5

Partial boundary regions in venus image: (a) the ground truth, (b) ASW, (c) cross-based, and (d) the proposed algorithm.

Table 2 shows the Middlebury ranking of the proposed algorithm. Although the Middlebury ranking is low in terms of the overall procedure, we have improved the performance of the aggregation step which is based on the response of edge comparing with other aggregation algorithms. Our key idea is that we design the weight as each magnitude of edge in the object boundary and construct a support region using color similarity in the homogeneous region to improve accuracy, in the aggregation step. In terms of the overall stereo matching procedure, the improvement may be smaller than that of just considering an aggregation step.

Table 2

Middlebury rank of the proposed algorithm and implemented other algorithms (TH < 1).

Algorithms	Avg. rank	Tsukuba			Venus			Teddy			Cones
Algorithms	Avg. rank	Nonocc	All	Disc.	Nonocc	All	Disc.	Nonocc	All	Disc.	Nonocc	All	Disc.
Proposed algorithm	133.3	15.84	16.14	22.11	8.15	8.11	7.59	13.91	16.91	18.04	14.58	18.17	7.04
ASW	143.4	20.84	20.96	32.15	10.69	10.84	13.31	22.50	29.14	23.85	26.10	32.37	19.02
Cross-based	141.7	18.24	17.50	24.31	9.32	9.27	8.09	15.70	18.59	18.91	17.22	19.22	7.56

From the above results, the average rank has been improved by using the proposed algorithm. From this result, we would like to address that the proposed algorithm has enough contribution in the aggregation step.

4. Conclusions

We have proposed a cross-based gradient weight cost aggregation algorithm for efficient depth map estimation of WMSNs. We constructed an adaptive shape support region and aggregate matching costs by efficiently using gradient-based weight within the support region with a pair of neighboring camera sensors. Using the proposed feature and our algorithm, we verified promising results compared with the other aggregation algorithms.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the agency specific research program of MSIP, Republic of Korea (Development of Multi-Sensor Platform Technology for Context Cognitive Smart Car).

References

Akyildiz

I. F.

Sankarasubramaniam

Cayirci

Wireless sensor networks: a survey

Computer Networks 2002 38 4 393 422

2-s2.0-0037086890

10.1016/S1389-1286(01)00302-4

Al-Karaki

J. N.

Kamal

A. E.

Routing techniques in wireless sensor networks: a survey

IEEE Wireless Communications 2004 11 6 6 28

2-s2.0-11144277843

10.1109/MWC.2004.1368893

Akyildiz

I. F.

Melodia

Chowdhury

K. R.

A survey on wireless multimedia sensor networks

Computer Networks 2007 51 4 921 960

2-s2.0-33845708421

10.1016/j.comnet.2006.10.002

Akyildiz

I. F.

Melodia

Chowdhury

K. R.

Wireless multimedia sensor networks: applications and testbeds

Proceedings of the IEEE 2008 96 10 1588 1605

2-s2.0-55549094400

10.1109/JPROC.2008.928756

Magli

Mancin

Merello

Low-complexity video compression for wireless sensor networks

Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '03)

2003

585 588

Singh

Lobiyal

A novel energy-aware cluster head selection based on particle swarm optimization for wireless sensor networks

Human-Centric Computing and Information Sciences 2012 2 13 1 13

Dubey

Sahu

O. P.

Self-localized packet forwarding in wireless sensor network

Journal of Information Processing Systems 2013 9 3 477 488

Chi

Y. M.

Etienne-Cummings

Cauwenberghs

Carpenter

Colling

Video sensor node for low-power ad-hoc wireless networks

Proceedings of the 41st Annual Conference on Information Sciences and Systems (CISS '07)

March 2007

244 247

2-s2.0-44049086917

10.1109/CISS.2007.4298307

Huang

Makedon

Popab

Stephanou

Ford

A feature extraction method for multimedia data analysis in robot wireless sensor networks

Proceedings of the 8th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '07)

June 2007

1 4

2-s2.0-46749144957

10.1109/WIAMIS.2007.3

10.

DeBardelaben

J. A.

Multimedia sensor networks for ISR applications

Proceedings of the 37th Asilomar Conference on Signals, Systems and Computers

November 2003

2009 2012

2-s2.0-4143051571

11.

Qureshi

Terzopoulos

Smart camera networks in virtual reality

Proceedings of the IEEE 2008 96 10 1640 1656

2-s2.0-55549128202

10.1109/JPROC.2008.928932

12.

Gallego

Huecas

An empirical case of a context-aware mobile recommender system in a banking environment

Journal of Convergence 2012 3 4 49 56

13.

Scharstein

Szeliski

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

International Journal of Computer Vision 2002 47 1–3 7 42

2-s2.0-0036537472

10.1023/A:1014573219977

14.

Kanade

Okutomi

A stereo matching algorithm with an adaptive window: theory and experiment

IEEE Transactions on Pattern Analysis and Machine Intelligence 1994 16 9 920 932

2-s2.0-0028499182

10.1109/34.310690

15.

Veksler

Stereo correspondence with compact windows via minimum ratio cycle

IEEE Transactions on Pattern Analysis and Machine Intelligence 2002 24 12 1654 1660

2-s2.0-0036941065

10.1109/TPAMI.2002.1114859

16.

Okutomi

Katayama

Oka

A simple stereo algorithm to recover precise object boundaries and smooth surfaces

International Journal of Computer Vision 2002 47 1-3 261 273

2-s2.0-0036537012

10.1023/A:1014510328154

17.

Yoon

K.-J.

Kweon

I. S.

Adaptive support-weight approach for correspondence search

IEEE Transactions on Pattern Analysis and Machine Intelligence 2006 28 4 650 656

2-s2.0-33144482417

10.1109/TPAMI.2006.70

18.

Richardt

Orr

D. A. H.

Davies

I. P.

Criminisi

Dodgson

N. A.

Real-time spatiotemporal stereo matching using the dual-cross bilateral grid

Proceedings of the European Conference on Computer Vision

2010

19.

Zhang

Lafruit

Cross-based local stereo matching using orthogonal integral images

IEEE Transactions on Circuits and Systems for Video Technology 2009 19 7 1073 1079

2-s2.0-68249086330

10.1109/TCSVT.2009.2020478

20.

http://vision.middlebury.edu/stereo/

21.

Zabib

woodfill

Non-parametric local transforms for computing visual correspondence

Proceedings of the 3rd European Conference on Computer Vision

May 1994