High Performance Imaging through Occlusion via Energy Minimization-Based Optimal Camera Selection

Abstract

Seeing an object in a cluttered scene with severe occlusion is a significantly challenging task for many computer vision applications. Although camera array synthetic aperture imaging has proven to be an effective way for occluded object imaging, its imaging quality is often significantly decreased by the shadows of the foreground occluder. To overcome this problem, some recent research has been presented to label the foreground occluder via object segmentation or 3D reconstruction. However, these methods usually fail in the case of complicated occluder or severe occlusion. In this paper, we present a novel optimal camera selection algorithm to handle the problem above. Firstly, in contrast to the traditional synthetic aperture photography methods, we formulate the occluded object imaging as a problem of visible light ray selection from the optimal camera view. To the best of our knowledge, this is the first time to “mosaic” a high quality occluded object image via selecting multi-view optimal visible light rays from a camera array or a single moving camera. Secondly, a greedy optimization framework is presented to propagate the visibility information among various depth focus planes. Thirdly, a multiple label energy minimization formulation is designed in each plane to select the optimal camera view. The energy is estimated in the 3D synthetic aperture image volume and integrates the multiple view intensity consistency, previous visibility property and camera view smoothness, which is minimized via graph cuts. Finally, we compare this approach with the traditional synthetic aperture imaging algorithms on UCSD light field datasets and our own datasets captured in indoor and outdoor environment, and extensive experimental results demonstrate the effectiveness and superiority of our approach.

Keywords

Occluded Object Imaging Computational Photography Synthetic Aperture Imaging Energy Minimization

1. Introduction

Occluded object imaging is a significantly challenging task in many computer vision application fields such as video surveillance and monitoring, hidden object detection and recognition, tracking through occlusion, etc. However, because traditional photography simply captures the 2D projection of the 3D world, essentially it cannot handle occlusion.

Recently, computational photography is changing the traditional way of imaging, which captures additional visual information by using generalized optics. Synthetic aperture imaging (SAI) [1 –7] is one of the key aspects of computational photograph. Figure 1 visualizes the principle of multiple camera synthetic aperture imaging. In a convex lens, rays from the red point on the plane of focus after refraction converge to a single point on the sensor plane, forming a sharp image (Figure 1a). Rays from the blue point, which is not in the plane of focus, form a circle of confusion on the sensor plane, resulting in a blurred image (Figure 1b). A camera array is analogous to a “synthetic” lens aperture - each camera being a sample point on a virtual lens (Figure 1c). We synthetically focus the camera array by choosing a plane of focus, and adding up all the rays corresponding to each point on the chosen plane to get a pixel in a “synthetic aperture” image. By warping and integrating the multiple view images, synthetic aperture imaging can simulate a virtual camera with a large convex lens, and focus on different frontal-parallel or oblique planes with a narrow depth of field. As a result, occluded objects focused on the virtual focal plane are visible, while those that are not are blurry.

Figure 1.

Explanation of the principle of multiple camera synthetic aperture imaging via geometric optics.

Synthetic aperture imaging photography [1, 2] provides a new concept in resolving the occluded object imaging problem, however it still suffers from the following limitations:

(1)

The clarity of the occluded object image is often significantly decreased by the shadows of the foreground occluder. Although some methods have been presented to label the foreground via object segmentation or 3D reconstruction, these methods would fail in the case of a complicated occluder or severe occlusion.

(2)

Because the state-of-the-art SAI methods use the intensity average of multiple cameras, various colour responses of cameras often significantly reduce the colour smoothness and consistency of synthetic aperture image.

(3)

The occluded object's contour and contrast are sensitive to calibration error, which is serious especially for unstructured light field synthetic aperture imaging with a moving camera.

In this paper, we address the above issues by proposing a new algorithm which for the first time formulates the occluded object imaging as an optimal camera selection problem. A multiple label energy minimization formulation is designed in each plane to select the optimal camera. The energy is estimated in the 3D synthetic aperture image volume, which integrates the multiple view intensity consistency clustering, visibility probability propagation and camera view smoothness. When focusing on a hidden object, instead of naively averaging all camera views in the synthetic aperture image, our method will actively select the rays from only one optimal camera via multiple label graph cuts-based energy minimization [8 –11].

The organization of this paper is as follows: Section 2 introduces several related works. Our algorithm is presented in Section 3. Following that, Section 4 presents the data set, implementation details and the experimental results. In addition, the performance analysis and discussions are proposed. Finally, we conclude the paper and point to future work in Section 5.

2. Related Work

In 1999, the first famous camera array setup was devised for the film The Matrix, in which a 1D camera array was used to create the impression of orbiting around a scene that has been frozen in time. Pioneering work on synthetic aperture imaging was proposed by Levoy[1]. They set up a two dimensional Stanford light field camera array which consisted of 128 Firewire cameras, and for the first time aligned multiple cameras to approximate a camera with a very large aperture.

The MIT computer graphics group [12] used 64 USB webcams for synthesizing dynamic depth of field effects. Lei et al. [13] developed a cluster-based system for camera array application, which consisted of eight nodes and 16 cameras. Ding et al. [14] constructed a 3×3 camera array to reconstruct the surface of fluid. Venkataraman et al. [15] presented an ultra-thin high performance monolithic camera array named PiCam (Pelican Imaging Camera-Array), that captured light fields and synthesized high resolution images. Maitre et al. [16] used a planar camera array to perform surface reconstruction. Schuchert et al. [17] estimated 3D object structure, motion and rotation based on 4D affine optical flow using a multi-camera array. Yang et al. [18] built a new hybrid synthetic aperture imaging system, which adopted a linear camera array and four top view cameras to continuously track and see people through occlusion. Later on, Yang et al. [19] presented a camera array autofocus algorithm by minimizing the temporal and spatial correspondences error subject to global loop constraint. To reduce the complexity and expensive cost of the large scale camera array, Davis et al. [2] presented a system for interactively acquiring and rendering light fields using a hand-held commodity camera, this system provided users with real-time feedback and guided them toward under-sampled parts of the light field. The constructed synthetic aperture image of the above light field photography systems have a shallow depth of field, so that objects off the focus plane disappear due to significant blur. This unique characteristic makes synthetic aperture imaging a powerful tool for occluded object imaging.

When focusing on the occluded object, outlier rays that actually hit occluders will blur the focused occluded object, and decrease the clarity and contrast of the synthesized image quality. Several methods have been presented to overcome this problem. Vaish et al. [20] studied four cost functions, including colour medians, entropy, focus and stereo for reconstructing an occluded surface using synthetic apertures. Their method achieved good results under slight occlusion; however the cost functions may fail under severe occlusion. Pei et al. [21] proposed a background subtraction method for segmenting and removing a foreground occluder before synthetic aperture imaging. Their results were encouraging for a simple static background, however the performance was very sensitive to the moving foreground segmentation result and may fail in crowded scenes with a dynamic cluttered background. Later on, Pei et al. [22] improved the imaging quality via graph cut-based foreground segmentation. Because of the colour variance-based segmentation, the performance of their approach was very sensitive to the illumination and colour response of different views in the camera array. Although their results were encouraging in a well-controlled laboratory experiment, their method may fail in complex outdoor environments. In addition, this method [22] cannot handle occluded object imaging through multiple occluders.

3. Our Approach

In this section we will introduce our optimal camera selection-based occluded object imaging method. Instead of averaging all rays from the entire camera array [1] or partial visible cameras[22], our approach selects only one optimal camera for each pixel from the camera array, and combines the selected rays together to create a synthetic image. The word optimal refers to three facts. (1) From the pixel perspective, the pixel should be visible in the corresponding optimal camera. (2) From the local image perspective, the adjacent pixels should be more likely to select the same or adjacent camera from the camera array. (3) From the global perspective, we hope to select as few as possible cameras to create the image of the occluded object.

3.1. Algorithm framework

In this subsection, we give an overview of the optimal camera selection and imaging approach by describing the flow of information with the Stanford light field dataset shown. The overall method mainly includes two parts: (1) Synthetic Aperture Imaging, and (2) Optimal Camera Selection, which are shown in Figure 2.

Figure 2.

Algorithm framework of our approach.

An imaging cycle begins when capturing multi-view images by multiple camera or a moving camera. The Synthetic Aperture Imaging Module (Figure 2, left part) takes the multiple view images as input, through camera calibration and synthetic aperture imaging, this module generates a set of multi-view warping images on the given focus plane, in which all the images are precisely aligned without parallax.

The image warping results are then fed as input to the Optimal Camera Selection Module (Figure 2, middle part). Then, for the entire image plane of focus at depth i, this module initializes a multi-label graph grid. The node in this graph denotes the pixel on the plane of focus. The edges between different nodes represent the spatial relations between connected pixels, and the label of each node denotes the optimal camera ID for this pixel. Through the above definition, we can formulate the optimal camera selection problem as a multi-label problem, and assign each pixel to a unique optimal camera via graph cuts based energy minimization. To take advantage of the information of multiple depth planes, we adopt a greedy searching strategy which applies the energy from the close to the distant focus plane, and propagates the previous labelling results to the distant layers. This greedy searching strategy only needs to scan the observed scene once, which is quite efficient.

Finally, the camera selection results are combined together to generate a high quality occluded object image result (Figure 2, right top image), which is much better than the traditional synthetic aperture imaging result (Figure 2, right bottom image).

3.2. Optimal camera selection via multi-labelling graph cut

Let C denotes the camera view number. For a given depth i, our goal is to find a labelling function f: Ω → C where Ω refers to all the pixels in all images and C represents the set of possible labels of each pixel. For a pixel x, if f(x) = c,c ∊ C, then x is considered to be visible in view c.

Consider the labelling redundancy of the camera array (the labels in different cameras are highly related), we label all the pixels in the reference camera view instead of in all camera views. Thus, we only seek a more succinct labelling, g: I_ref → C, where I_ref represents the whole image area in the reference camera.

The objective of choosing the visible view can be formulated as a following energy minimization problem:

E (g, i) = E_{d} (g) + E_{s} (g)

(1)

where the data term E_d is the sum of the data cost of each pixel and the smooth term E_s is a regularizer that encourages neighbouring pixels to share the same label.

Data term: If a scene point p is located in the focusing depth i, and pixel x refers to the corresponding pixel in reference image, then the colour value I₁ (x), I₂ (x),…, I_C (x) of the corresponding pixels in all camera views should approximate to a Gaussian distribution N(μ,σ²) with μ as the true colour value of the point p and σ² as the variance of colour values from different cameras. Those colour values close to μ usually refer to the colour value of point p from different cameras, while those distant from μ are usually from the occluding object with low probability. Thus, we define the data cost of labelling each pixel x in the reference image as follows:

E_{d} (g) = \sum_{x \in I_{r e f}} (1 - N (μ, σ^{2}, I_{g (x)} (x)))

(2)

where N(μ,σ², I_g(x) (x)) refers to the Gaussian probability of pixel labelling to camera g(x), g(x) ∊ C. To get the Gaussian distribution, we first apply K Means clustering in colour space, and then choose the cluster with the greatest number of samples. Let M and Var denote the sample mean and variance of this cluster. Finally, we can obtain the Gaussian distribution N(μ,σ²) with μ = M and σ² = Var.

Smooth term: The smoothness term E_s (g) at depth i is a prior regularizer that encourages overall labelling to be smooth. The prior regularizer is based on the fact that two neighbouring pixels often have a higher probability of choosing the same camera view as their visibility is similar in the same camera view.

Because of the different colour responses among multiple camera views and calibration errors of the camera position, even for the same visible point, the colour value of the corresponding visible pixels in multiple camera views are always different. Thus this prior is very important and it will determine the colour smoothness of the final synthetic image. Surprisingly the traditional synthetic aperture imaging methods seldom consider the problem.

Here, we adopt the standard four-connected neighbourhood system and penalize if the labels of two neighbouring pixels are different:

E_{s} (g) = \sum_{​_{q \in N_{p}}^{p \in I_{r e f}}} S_{p, q} (g (p), g (q))

(3)

S_{p, q} (g (p), g (q)) = {\begin{matrix} 1 & g (p) \neq g (q) \\ 0 & o t h e r w i s e \end{matrix}

(4)

The intuition behind the design of the above energy is that:

(1)

If the point is a fully visible focus point, then the appearance can be modelled as a unimodal distribution. In this case, the cost of choosing any visible camera is small and therefore the imaging quality will not be greatly influenced by the choice of labelling.

(2)

If the point is a partially occluded focus point, we are likely to get a unimodal distribution and the case is similar to 1).

(3)

If the point is a free point, the distribution will tend to be a uniform distribution. In this case all the cameras have relatively large cost and as a result the smoothness term will play the decisive role.

In the experiment, we adopt Boykov's graph cuts methods [8, 9] to solve the above energy minimization problem.

4. Experiments

4.1. Imaging system and datasets

In order to evaluate the performance of the proposed method under various circumstances, we have designed and set up several moving camera-based light field capture systems.

Figure 3(a) displays our moving linear camera array light field capture system. The vertical linear array contains eight Pointgrey Flea3 cameras, and it can move smoothly on a sliding track. The entire system has the ability to simulate a virtual camera with a three metre convex lens. In this experiment, we adopt the above moving linear array to capture multiple occluded toys in an indoor environment (as shown in Figure 5).

Figure 3.

Our moving camera synthetic aperture imaging systems. (a) and (b) are designed for light field capture in an indoor environment. (c) is designed for capturing an unstructured light field in an outdoor environment.

In order to capture a dense 3D light field of the scene, as well as to avoid the various colour responses among different cameras, we have also set up a single moving camera-based imaging system (as shown in Figure 3(b)). Through moving the camera in different directions, the system can simulate a virtual camera array with 900 camera views and a three metre convex lens (as shown in Figure 3(b), right image).

To evaluate our approach in outdoor scenes, we have also set up an outdoor unstructured light field imaging system based on a moving Canon 5D Marker III camera. Figure 3(c) displays the system and the examples of an outdoor building occluded by the front trees. We use Zhang's automatic camera tracking method [24] to estimate the moving camera's pose and position for synthetic aperture imaging. The imaging results of this system are shown in Figures 6 and 7. Besides our system and datasets, in this experiment section, we also adopt the public available UCSD light field datasets[12]in the experiment to evaluate and compare the imaging performance of our method.

4.2. UCSD Santa dataset

The UCSD Santa light field dataset was acquired using an eight-camera array and a linear translating gantry[23]. This dataset contains 120 views on a 120×1 grid with image resolution as 640×512. Figure 4(a) displays five examples of the Santa dataset. We adopt the view #57 as the reference camera view (as shown in Figure 4(b1)), and Figure 4(b2) shows the synthetic aperture imaging result using the Vaish's method [20]. Please note that when we focus on the distant chairs and windows, the shadows from the foreground Santa significantly blurred the image. By contrast, through multi-labelling graph cuts based optimal camera selection (as shown in Figure 4(b4)), our approach successfully removes the false shadows and produces a high quality occluded object image with far greater clarity (as shown in Figure 4(b3)).

Figure 4.

Imaging results of USCD Santa light field datasets [23]. (a) shows the examples of original camera views. (b1) to (b4) display the comparison results of the chair through occlusion.

4.3. Our multiple occluded toys dataset

To further test our method on severe occlusion cases, we have conducted another experiment with multiple objects. We use our moving linear camera array (as shown in Figure 3) to capture this set of images. As shown in Figure 5(a), the flower pot, teddy bear and the penguin are lined up in a column. It can be seen that the penguin is occluded by the teddy bear, which is further occluded by the front flower pot. In particular, we can see nothing but the feet of the teddy bear from the reference camera view (Figure 5(b1)). The standard synthetic aperture imaging results of the teddy bear and penguin are shown in Figure 5(b2) and Figure 5(c2) respectively. Due to the severe occlusion, Vaish's method [20] can only obtain a blurred image of the occluded object. By contrast, our method successfully selects the optimal camera views via energy minimization (as shown in Figures 5(b4) and 5(c4)), and provides a clear and complete object image of the teddy bear (Figure 5(b3)) and penguin (Figure 5(c3)) through severe occlusion.

Figure 5.

Imaging results through multiple occluders. Please note that in this challenging scene, the penguin is occluded by the teddy bear, which is further occluded by the front flower pot. Our approach successfully sees through multiple objects (b3) and (c3).

4.4. Our window and building dataset

Figure 6 shows the imaging results of the outdoor scene through a large window. We capture these test datasets with a hand-held single moving camera in our laboratory. Because of occlusion by the black window frame, we cannot get a complete image of the distant building (Figure 6(a)). Figure 6(b2) gives the synthetic aperture imaging results using the Vaish's method [20], which is blurry due to the foreground window frame and depth variation of the distant building. By contrast, our approach virtually “removes” the foreground occluder, and creates a complete and clear image with lots of details via optimal camera selection.

Figure 6.

Imaging results of the outdoor scene through the window. Please note that standard synthetic aperture imaging approach is very blurred (b2). By contrast, our method successfully removes the window and generates a clear image of the occluded building (b3).

4.5. Our street dataset

To evaluate our method on the challenging street view, we have conducted another experiment with a complex outdoor scene. As shown in Figure 7, the distant buildings are occluded by the nearby trees (Figure 7(a)). Our aim is to see the building to the rear of the scene through the trees in the foreground. Comparison results using Vaish's method [20] and our method are shown in Figures 7(b2) and 7(b3). As Vaish's method [20] simply averages the intensity value of the multiple view images, it cannot select different camera views for different regions, and the resulting image is significantly blurred. In addition, since Vaish's method only focuses on a given depth plane, the image clarity can be reduced due to the depth variation of the building's surface (Figure 7(b2)). Please note that through selecting the optimal camera views, our method accurately gives the desired imaging result even under depth changes and occlusion (Figure 7(b3)).

Figure 7.

Imaging results of the distant buildings through the trees. Please note that standard synthetic aperture imaging approach is very blurred (b2). By contrast, our method successfully removes the complex foreground trees and generates a clear image of the occluded building (b3).

5. Conclusion

A novel occluded object imaging approach has been presented. Experimental results with qualitative and quantitative analysis demonstrate that the proposed method can reliably select the optimal camera and generate a clear image even through severe occlusion. Moreover, the satisfied imaging results with a moving camera indicate that this approach has great potential for many applications. Our future work will focus on extending the method by developing new applications of the occluded object imaging techniques on smartphones.

Footnotes

6. Acknowledgements

The research in this paper is supported by the Project of National Natural Science Foundation of China under grant numbers 61272288 and 61231016, the Foundation of China Scholarship Council under grant number 201303070083, the NPU New People and New Directions Foundation under grant number 13GH014604, the NPU Foundation for Fundamental Research under grant numbers JC201120 and JC201148, and the Soaring Star of NPU under grant number 12GH0311.

References

Levoy

(2006) Light fields and computational imaging. IEEE Computer Maganize, 39:46–55.

Davis

Levoy

Durand

(2012) Unstructured light fields. Eurographics, 31 (2):305–314.

Vaish

Garg

Talvala

Antunez

Wilburn

Horowitz

Levoy

(2005) Synthetic aperture focusing using a shear-warp factorization of the viewing transform. IEEE Workshop on A3DISS, Computer Vision and Pattern Recognition, 129–135.

Joshi

Avidan

Matusik

Kriegman

D J

(2007) Synthetic aperture tracking: tracking through occlusions. International Conference on Computer Vision, 1–8.

Wilburn

Joshi

Vaish

Talvala

V E

Antunez

Barth

Adam

Horowitz

Levoy

(2005) High performance imaging using large camera arrays. ACM T GRAPHIC, 24 (3):765–776.

Basha

Avidan

Hornung

Matusik

(2012) Structure and Motion from Scene Registration. Computer Vision and Pattern Recognition, 1426–1433.

Vaish

Wilburn

Joshi

Levoy

(2004) Using plane + parallax for calibrating dense camera arrays. Computer Vision and Pattern Recognition, 1:2–9.

Boykov

Veksler

Zabih

(2001) Efficient approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 20 (12):1222–1239.

Boykov

Kolmogorov

(2004) An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26 (9):1124–1137.

10.

Boykov

Kolmogorov

(2010) Basic graph cuts algorithms. Advances in Markov Random Fields.

11.

Kolmogorov

Zabih

(2004) What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26:147–159.

12.

Yang

J C

Everett

Buehler

McMillan

(2002) A real-time distributed light field camera. Eurographics Symposium on Rendering.

13.

Lei

Chen

Yang

Y H

(2009) A new multi-view spacetime-consistent depth recovery framework for free viewpoint video rendering. International Conference on Computer Vision, 1570–1577.

14.

Ding

Y Y

J Y

(2011) Dynamic fluid surface acquisition using a camera array. International Conference on Computer Vision, 2478–2485.

15.

Venkataraman

Lelescu

Duparré

McMahon

Molina

Chatterjee

Mullis

(2013) PiCam: An Ultra-Thin High Performance Monolithic Camera Array. ACM Transactions on Graphics, 32 (5):1–13.

16.

Maitre

Shinagawa

M N

(2008) Symmetric multi-view stereo reconstruction from planar camera arrays. Computer Vision and Pattern Recognition, 1–8.

17.

Schuchert

Scharr

(2010) Estimation of 3D object structure, motion and rotation based on 4D affine optical flow using a multi-camera array. 11th European Conference on Computer Vision, 5963–609.

18.

Yang

Zhang

Y N

Tong

X M

Zhang

X Q

(2013) A new hybrid synthetic aperture imaging model for tracking and seeing people through occlusion. IEEE Transactions on Circuits and Systems for Video Technology, 23 (9):1461–1475.

19.

Yang

Zhang

Y N

Chen

(2013) Exploiting loops in the camera array for automatic focusing depth estimation. International Journal of Advanced Robotic Systems, 10:232, DOI:10.5772/56321.

20.

Vaish

Szeliski

Zitnick

C L

Kang

S B

Levoy

(2006) Reconstructing occluded surfaces using synthetic apertures: stereo, focus and robust methods. Computer Vision and Pattern Recognition, 2331–2338.

21.

Pei

Zhang

Y N

Yang

Zhang

X W

Yang

Y H

(2011) A novel multi-object detection method in complex scene using synthetic aperture imaging. Pattern Recognition, 45 (4):1637–1658.

22.

Pei

Zhang

Y N

Chen

Yang

Y H

(2013) Synthetic aperture imaging using pixel labelling via energy minimization. Pattern Recognition, 46 (1):174–187.

23.

UCSD/MERL light field dataset (2007) http://vision.ucsd.edu/datasets/lfarchive/lfs.shtml.

24.

Zhang

G F

Jia

J Y

Wong

T T

Bao

H J

(2009) Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 31:974–988.