Generative multiview inpainting for object removal in large indoor spaces

Abstract

As interest in image-based rendering increases, the need for multiview inpainting is emerging. Despite of rapid progresses in single-image inpainting based on deep learning approaches, they have no constraint in obtaining color consistency over multiple inpainted images. We target object removal in large-scale indoor spaces and propose a novel pipeline of multiview inpainting to achieve color consistency and boundary consistency in multiple images. The first step of the pipeline is to create color prior information on masks by coloring point clouds from multiple images and projecting the colored point clouds onto the image planes. Next, a generative inpainting network accepts a masked image, a color prior image, imperfect guideline, and two different masks as inputs and yields the refined guideline and inpainted image as outputs. The color prior and guideline input ensure color and boundary consistencies across multiple images. We validate our pipeline on real indoor data sets quantitatively using consistency distance and similarity distance, metrics we defined for comparing results of multiview inpainting and qualitatively.

Keywords

Multiview inpainting object removal generative adversarial network color consistency boundary consistency

Introduction

The rendering of real indoor spaces is typically achieved via an image-based rendering (IBR) method^1
–3 that allows a free-viewpoint exploration in a virtual world. Recent IBR applications¹ have reduced geometric complexity for the real-time rendering of large-scale indoor spaces. The key idea of the method¹ is to render only the architectural components without objects, which requires a consistent object removal over multiple images, as suggested in the literature.^3,4

Multiview inpainting is a task to achieve image inpainting on multiple images while satisfying two conditions: color consistency and boundary consistency. For example, in Figure 1, images of C₂ and C₃ require image inpainting as ideal case. However, typical image inpainting methods might result in color or boundary inconsistency (cf. Figure 1(b)).

Figure 1.

Example of multiview inpainting: (a) Architectural component visible directly in images C₁ and C₄ but occluded by an object in images C₂ and C₃; (b) image inpainting is required in C₂ and C₃ as ideal image, avoiding color or boundary inconsistencies.

Philip and Drettakis³ proposed a method for multiview inpainting. Their proposed method begins with estimating the plane, projecting visible quadrangle parts of images into a common rectified plane, and then filling in the occluded parts using a conventional patch-based method.⁵ Although this method can ensure the color consistency up to the performance level achieved in traditional inpainting algorithm named PatchMatch,⁵ the boundary consistency is not secured as the planar assumption is not applicable in nonplanar or complex environments.

Recent advances in image inpainting are based on deep learning approaches,^6
–8 which have shown their efficiency and effectiveness on applying any form of mask. Recent studies^7,8 further enhance the performance by utilizing guidelines. Although the guidelines themselves ensure the boundary consistency in case the user sets consistent guidelines over the multiple images, the color consistency over multiple images cannot be secured as such approaches consider only a single image.

In this study, we target object removal in the images covering large-scale indoor spaces. We propose a novel pipeline that fills in occluded regions of multiple images in such a way that both the color and the boundary consistencies are well preserved. The proposed pipeline extends the state-of-the-art GConv⁸ applied to a single-image inpainting problem.

First, for the color consistency, we create a color prior image, which is a pixel-wise color information observed from visible cameras (e.g. C₁ and C₄ in Figure 1(a)). This color prior is added along with other inputs of the inpainting network such as a masked image, a guideline input, and a mask similar to that of GConv,⁸ allowing the inpainted colors to be consistent in multiple inpainted images.

Second, for the boundary consistency, we extend GConv⁸ in such a way that imperfect guideline input can be refined within the network. Ideally, there should be no boundary inconsistency when the guidelines are perfectly set by users. However, because it is a laborious task to set the guidelines over multiple images, typical guidelines include various errors such as in the angle or location or missing information.

For an experiment validation, we utilized real indoor data sets^1,9 enabling a real-time rendering of large-scale indoor spaces. For the color consistency, we developed new measures to evaluate intensity-free color consistency among multiple inpainted images, as well as the similarity between multiple inpainted images and images without occlusions. Given the metrics, the proposed method yields improvements of up to 46.4% and 30.6% over the previous single-image inpainting method EC⁷ and GConv,⁸ respectively. Regarding the boundary consistency, we conducted a qualitative comparison to determine whether the proposed pipeline predicts the enhanced guidelines.

This article is organized as follows: the second section briefly summarizes related works. The third section describes the proposed pipeline. The fourth section details the experiment results of multiview inpainting on real data sets. Finally, some concluding remarks are given in the fifth section. Our codes are available at https://github.com/kimjh069/generative-MVI.¹⁰

Related work

Indoor modeling and IBR

Typical IBR methods require 3D map or model of the environment. Henry et al.¹¹ use depth cameras integrating depth and color information for robust 3D mapping of indoor environments. The map, however, remains in point cloud rather than mesh, which makes hard for rendering novel viewpoints. Shao et al.¹² introduce interactive approach for semantic modeling of indoor spaces. The approach retrieves 3D models from predefined database, which has limitations on reconstructing the real scene. More recent works focus on the architectural components for better representation of indoor spaces. Ikehata et al.¹³ devise structural grammar that divides the elements of indoor environments such as rooms, walls, and objects. The approach builds the indoor model with Manhattan assumption in a single-story buildings. Recent works overcome Manhattan assumption for indoor modeling or integrate the 3D model with IBR.^9,14

Hedman et al.² proposed a real-time IBR method. However, this method has limitations in rendering larger than room-scale spaces. Turner et al.¹⁵ reconstruct an architectural mesh but do not remove objects in the images, which results in flattened object textures on the architectural mesh. In addition, recent studies^1,9 have focused on architectural component modeling and objects removal in images. However, the object removal in an image might cause a discordance of colors between multiple images.

Recently, there are attempts to use learning-based methods for IBR. Hedman et al.¹⁶ use deep learning approach for blending novel views for IBR. It showed feasibility of using deep learning-based blending but needs further research to reduce blurriness and flicker. Thies et al.¹⁷ use a novel deep learning-based method for image synthesis and real-time rendering. However, the method needs training for every new object requiring more research for the generalization.

Multiview inpainting

To achieve color consistency through multiview inpainting, other studies^3,4 warp multiple images onto a common rectified target image plane and optimize the objective function or process traditional patch-based algorithms. Li et al.¹⁸ apply RGB-D sensor, a device that augments the image with depth information, and employ exemplar-based image inpainting with sequential data. However, RGB-D sensors are inappropriate for large-scale indoor scanning or rendering owing to a short range of depth and an extremely large number of required images. Other approaches such as video inpainting¹⁹ for recovering consistent color using dense sequential data, however, are not applicable on sparse data or wide-baseline data used for IBR.^1,9,3

Single-image inpainting

Traditional inpainting algorithms^5,20 use patch based or diffusion-based approaches with low-level features. Recent studies have focused on rapidly advanced deep learning-based approaches, such as a convolutional neural networks and generative adversarial networks (GANs),²¹ which are capable of using high-level features.

Yu et al.⁶ employed a GAN and proposed a coarse-to-fine network as well as a contextual attention module (CAM). CAM attends to known patches of features directly, and explicitly borrows patches to fill in unknown regions of features. It has inspired numerous recent studies^8,22 making use of CAM for image inpainting. Yu et al.⁸ proposed a gated convolution that enables a sketch channel input to propagate through a network along with semantic information.

An increasing number of recent studies^7,23,8 have applied edge information (i.e. guideline) as an additional input for the purpose of enhancing the quality or achieving editable results. However, no studies have focused on combining the inputs of color and imperfect guidelines and a guideline refinement at the same time.

Method

In this section, we introduce the initial data and a multiview inpainting pipeline, which is shown in Figure 2, including a method for creating the color prior images and an image inpainting network. The color prior image is used as the additional input for the image inpainting network.

Figure 2.

Overview of our multiview inpainting pipeline. “Initial data” section describes the initial input data of the pipeline that was acquired by a mobile sensor robot system. The method of generating color prior images is described in “Color prior generation” section. Using the color prior images, “Image inpainting” section describes image inpainting that aims for the color consistency in multiview.

Initial data

Initial data consist of architectural mesh, images, and camera poses of each image of large-scale indoor spaces that were introduced in recent works.^1,9 The data sets were acquired by scanning the indoor spaces employing a mobile robot system (cf. Figure 2). Global guidelines are obtained by automatic algorithm,⁹ which are overlaid with user guidelines. Masks were made manually for the purpose of thorough multiview inpainting.

Color prior generation

A color prior image is a synthetic image created by projecting a colored point cloud data (PCD) onto the image plane. We introduce a method for coloring the PCD with only architectural component colors and projecting the colored PCD onto the images.

The PCD is initially sampled from the object-free architectural mesh. From the sampled PCD, we adopt the hidden point removal (HPR) algorithm,²⁴ which is a simple and fast algorithm that approximates the visibility of the PCD for a given camera pose by flipping the point clouds and computing a convex hull onto it. To be more specific, for each given camera position, $c_{j} \in ℝ^{3}$ , the HPR algorithm yields a binary vector $v_{j} \in {0, 1}^{m \times 1}$ indicating the visibility of the point clouds. We make the relation matrix $P^{v} \in ℝ^{m \times (3 + n)}$ as

P^{v} = [\begin{matrix} P & V \end{matrix}]

where $P \in ℝ^{m \times 3}$ is the 3D position of the point clouds, $V = [\begin{matrix} v_{1}, v_{2}, \dots, v_{n} \end{matrix}]$ , m is the number of points, and n is the number of camera poses.

The next step is coloring the PCD with architectural colors and creating color prior images. For each point, we make candidates of the proper color by projecting the point onto each image plane where the point is visible. We can refer to a row of the relation matrix, $P_{i}^{v}$ , to find images in which the point is visible. If the point in a visible image plane corresponds to a nonmasked region (i.e. architectural component), the color becomes a candidate for the proper color, as shown in Figure 3(a). Among the accumulated candidates of color for a point, we simply take their median to determine the color of the point and achieve a robustness to the outliers and an efficient computation.

Figure 3.

Creating the color prior image: (a) The color for a given point is computed from a set of images that are not occluded by objects; (b) a colored point with the coverage space is projected onto the image plane to create the color prior image.

The colored point can be projected onto each visible image plane to produce a color prior image, where we assume a 360-camera model as a cubemap format

p_{p r o j} = s_{i m g} \cdot \frac{​^{c} R_{w} ​^{w} P +^{c} T_{w}}{max (|​^{c} R_{w} ​^{w} P +^{c} T_{w}|)} = [\begin{matrix} u \\ v \\ w \end{matrix}]

where $^{c} R_{w}$ and $^{c} T_{w}$ are the rotation matrix and translation vector from the camera with respect to the world coordinates, respectively. In addition, $^{w} P$ is the point with respect to (w.r.t.) world coordinates, and $s_{i m g}$ is the image size.

The color prior image, however, depends on the sampling rate and the distance to a point from the camera pose. To be more precise, the sparse PCD originates from minimum sampling distance of r from a mesh. In addition, the points closer to an image plane appear even sparser in the image than those in farther (cf. Figure 4(a)).

Figure 4.

Example of created color prior: (a) The colored point closer to the camera appears sparser than that farther away; (b) we resolve the sparsity issue by defining the coverage area; (c) the original image from the given viewpoint.

To resolve the dependencies, we define point coverage area, $C_{p r o j}$ , in the image plane as shown in Figure 3(b). The distance between two points along the diagonal will be $\sqrt{3} r$ in 3D space. Hence, we set the coverage radius of a point to $\frac{\sqrt{3}}{2} r$ allowing some overlap. The coverage space, $C_{i n i t}$ , of a point becomes a sphere

C_{i n i t} = {(\frac{\sqrt{3}}{2} r)}^{2} I_{3}

where $I_{3}$ refers to 3 $\times$ 3 identity matrix. The Jacobian of $p_{p r o j}$ w.r.t. each coordinate of a point is

J = \frac{\partial p_{p r o j}}{\partial p} = [\begin{matrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} & \frac{\partial u}{\partial z} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} & \frac{\partial v}{\partial z} \\ \frac{\partial w}{\partial x} & \frac{\partial w}{\partial y} & \frac{\partial w}{\partial z} \end{matrix}]

Combining these matrices, the coverage area $C_{p r o j}$ on the image plane can be defined by projecting the coverage space $C_{i n i t}$ onto the image plane, which becomes an ellipse as

C_{p r o j} = J C_{i n i t} J^{Τ}

The coverage area, $C_{p r o j}$ , is approximated as a circle with a radius $r_{p r o j}$ equal to the minor axis of the ellipse

r_{p r o j} = median (\sqrt{eigenvalue (C_{p r o j})})

where $\sqrt{eigenvalue (C_{p r o j})}$ gives three nonnegative values including the major axis, minor axis, and a zero value. Finally, we blur the color prior images using a Gaussian blur to alleviate the high-frequency property in the color priors, as the results in Figure 4(b) indicate.

Image inpainting

Typical image inpainting networks^8,7 take a masked image and a guideline image as inputs and output an inpainting result. By contrast, to preserve color consistency over multiple images, we train a network that is capable of receiving the color prior information as an additional input. In addition, the network is able to perform two different tasks concurrently such as the image inpainting and guideline refinement. In the following sections, we introduce the input data we generated to train the network, architecture, and loss functions to achieve our goal: the color consistent inpainting and guideline refinement.

Input data generation for training

Given the ground truth image $I_{g t}$ and guideline $L_{g t}$ , we need to generate training input data including the color prior $I_{c p}$ , imperfect guideline $\tilde{L}$ , mask $M$ , and noncolor mask $M_{c p}$ as in Figure 5. The ground truth guideline $L_{g t}$ is generated by the pre-trained edge detector named as holistically nested edge detection,²⁵ with a threshold of 0.6.

Figure 5.

From the left two ground truth images (image and guideline), we generate the other five images by automatic generation algorithms for training process of our model. The image is a sample validation image from Places2 data set.

For the generation of color prior input $I_{c p}$ and imperfect guideline input $\tilde{L}$ , we tried to reproduce the data similar to a real case. The pixels in the color prior input $I_{c p}$ are randomly sampled from two different randomly affine transformed and jittered ground truth images followed by Gaussian blur and random erase. The transformations are used to imitate various noises or errors that the color prior image of the real indoor data sets often contains such as sparsity, camera pose error, blurriness, occlusions by objects, and so on.

An imperfect guideline $\tilde{L}$ is generated through random affine transformed patches of the ground truth guideline $L_{g t}$ and is then randomly erased. These transformations are used to imitate angle, location, or missing errors that can be often made by users.

The mask $M$ is generated by a random free-form mask generation algorithm.⁸ A noncolor mask $M_{c p}$ is another binary mask representing whether the color prior information exists in the mask

M_{c p} = (\begin{array}{l} 0, & where I_{c p} + (1 - M) >0 \\ 1, & otherwise \end{array}

No-guideline zone is another binary mask used for a loss function. The usage of no-guideline zone is explained later in detail. The codes of generating the input data for the training are also available at the link.¹⁰

Network architecture

We modify the coarse-to-fine network used in GConv⁸ based on the key ideas from the recent works^22,26 as shown in Figure 6. The coarse network receives an input tensor $X$ with a size of $9 \times H \times W$ , where H and W are height and width of the input image, respectively. The nine input channels consist of a masked image $\tilde{I}$ , mask $M$ , imperfect guideline $\tilde{L}$ , color prior $I_{c p}$ , and noncolor mask $M_{c p}$ .

Figure 6.

The blue, orange, yellow, and green blocks in the generator represent a gated convolution, dilated convolution, self-attention module, and CAM, respectively. Each of upsampling layer consists of nearest-neighbor interpolation followed by a gated convolution. The refine network is simplified for visualization. Red blocks in the discriminator represent spectral normalized convolutions. CAM: contextual attention module.

The size of the output tensor of the coarse network is $4 \times \frac{H}{2} \times \frac{W}{2}$ , where four channels consist of a coarse image output and coarse guideline correction. The refine network then receives the output of the coarse network and returns a tensor with a size of $4 \times H \times W$ , in which the four channels consist of a refined image output $I_{p r e d}$ and refined guideline $L_{p r e d}$ .

The discriminator is similar to the SN-patchGAN structure in GConv⁸ except that instead of having a stride 1 convolution at the very first layer, our discriminator starts with a stride 2 convolution directly and uses Leaky ReLU as the activation function. We also adopt a spectral normalization technique recently proposed by Miyato et al.²⁷ The discriminator takes eight channels of input consisting of the image, guideline, mask, and color prior.

Loss functions

The coarse-to-fine network is trained in an end-to-end manner over several joint losses consisting of an adversarial loss, $ℓ_{1}$ loss, perceptual loss, style loss, focal loss, and two more proposing losses including anti-specificity loss and $L_{ℓ_{1}, c p}$ loss.

Let G and D be the generator and discriminator, respectively. We use the hinge loss as the objective function

L_{D} = E (ReLU (1 - D (X_{g t}))) + E (ReLU (1 + D (Y_{g e n})))

L_{G A N} = - E (D (G (X), M, I_{c p}))

where $X_{g t}$ consists of the ground truth image $I_{g t}$ and guideline $L_{g t}$ , mask $M$ , and color prior $I_{c p}$ , whereas $Y_{g e n}$ consists of the output image and guideline from generator, as well as the mask and color of input data. The $ℓ_{1}$ loss is computed as follows

\begin{array}{l} L_{ℓ_{1}} (I_{g t}, I_{p r e d}, M) = \frac{{‖ (I_{g t} - I_{p r e d}) ⊙ M ‖}_{1}}{n_{M}} \\ + \frac{{‖ (I_{g t} - I_{p r e d}) ⊙ (1 - M) ‖}_{1}}{n_{1 - M}} \end{array}

where $n_{M}$ is the number of elements in the binary mask $M$ , $I_{g t}$ is the ground truth image, $I_{p r e d}$ is the output image of the generator, and $⊙$ is the Hadamard product. We normalize each term based on the mask size for proper scaling.

We also include the perceptual loss $L_{p e r c}$ and style loss $L_{s t y l e}$ proposed in works.^28,29 We observed that the perceptual loss and the style loss help enhancing image details and reducing artifacts as addressed in the work.³⁰ Here, $L_{p e r c}$ penalizes the distance between two feature maps of a pre-trained network forcing the generator to make perceptually similar output to the ground truth image. This is computed as follows

L_{p e r c} = E [\sum_{i} \frac{1}{N_{i}} {‖ ϕ_{i} (I_{g t}) - ϕ_{i} (I_{p r e d}) ‖}_{1}]

where $ϕ_{i}$ is the feature map of ith layer of the pre-trained VGG-16 with batch normalization applied on the ImageNet data set.³¹ We use layers of pool1, pool3, pool4, and pool5. These layers are also used to compute style loss, which penalizes the difference between covariances of the feature maps. Here, $L_{s t y l e}$ is computed as follows

L_{s t y l e} = \sum_{i} E_{i} [\frac{{‖ G_{i}^{ϕ} (I_{g t}) - G_{i}^{ϕ} (I_{p r e d}) ‖}_{1}}{N_{H W_{i}}}]

where $G_{i}^{ϕ}$ is a $C_{i} \times C_{i}$ gram matrix made from each feature map $ϕ_{i}$ of shape $C_{i} \times H_{i} \times W_{i}$ and $N_{H W_{i}} = H_{i} \times W_{i}$ .

The guideline refinement is trained using the focal loss $L_{f o c}$ proposed by Lin et al.³² It regularizes the network not only to restore the ground truth guideline from randomly distorted and erased input guideline but also to sharpen the image edges in the inpainted image. It is computed as follows

L_{f o c} = - E [{(L_{g t} - L_{p r e d})}^{2} ⊙ H (L_{g t}, L_{p r e d})]

where H is the binary cross entropy.

As can be seen in Figure 7, we found that the black region in the color prior affects on the prediction of the guideline and results in black holes. This is because the network regards a black region in the color prior as the proper color and predicts a borderline around the region. However, a black region is where no prior color information exists and should be filled in appropriately.

Figure 7.

(a) Masked image is overlaid with imperfect guideline. (b) Masked image is overlaid with color prior. When the model is trained without $L_{ℓ_{1}, c p}$ and $L_{a s}$ , a black hole problem arises, as shown in (c) and (d). Prediction of the guideline in (c) is affected by the black pixels where color prior does not exist and results in black holes in the inpainted images, as shown in (d).

We resolve this problem by proposing two new losses, anti-specificity loss $L_{a s}$ and another $ℓ_{1}$ loss for the color prior $L_{ℓ_{1}, c p}$ . Here, $L_{a s}$ penalizes false-positive predicted guideline that belongs to the no-guideline zone, $N$ (as in Figure 5)

\begin{array}{l} N = (1 - L_{g t}) ⊙ D i l a t e (C a n n y (M_{c p})) \end{array}

It is where guideline prediction should be negative that belongs to the negative ground truth guideline $1 - L_{g t}$ overlapping with thick edges surrounding the black region $M_{c p}$ , $D i l a t e (C a n n y (M_{c p}))$ . The anti-specificity loss $L_{a s}$ is computed as follows

L_{a s} (L_{p r e d}, N) = \frac{F P}{T N + F P} = \frac{E [L_{p r e d} ⊙ N]}{E [N] + ε}

where $F P = E [L_{p r e d} ⊙ N]$ is the mean of the false-positive guideline prediction, $T N = E [(1 - L_{p r e d}) ⊙ N]$ is mean of true-negative guideline prediction, and $ε$ is a small value to avoid dividing by zero. The code of $L_{a s}$ is also available at the link.¹⁰ $L_{ℓ_{1}, c p}$ is the same as in equation (10) except for the mask $M_{c p}$ .

An adversarial loss, a perceptual loss, and a style loss are used to regularize the entire generator, whereas the others are additionally used to regularize both the coarse network and the entire generator. The overall loss function for our generative network $L_{G}$ is defined as follows

\begin{array}{l} L_{G} = L_{c o l o r} + L_{g u i d e l i n e} \end{array}

\begin{array}{l} L_{c o l o r} = λ_{g} L_{G A N} + λ_{s} L_{s t y l e} + λ_{p} L_{p e r c} + λ_{ℓ_{1}} L_{ℓ_{1}} \end{array}

\begin{array}{l} L_{g u i d e l i n e} = λ_{ℓ_{1}, c p} L_{ℓ_{1}, c p} + λ_{f} L_{f o c} + λ_{a s} L_{a s} \end{array}

In short, $L_{c o l o r}$ is a typical loss similar to ones used in other works.^8,7 By contrast, $L_{g u i d e l i n e}$ is introduced to resolve the black hole problem and to perform two different tasks in a single model concurrently which are the inpainting task, $L_{c o l o r}$ , and the guideline refinement task, $L_{f o c}$ . We choose $λ_{g} = 0.1$ , $λ_{s} = 200$ , $λ_{p} = 0.08$ , $λ_{ℓ_{1}} = 0.5$ , $λ_{ℓ_{1}, c p} = 0.7$ , $λ_{f} = 1$ , and $λ_{a s} = 0.1$ .

Experiments and results

We evaluate our multiview inpainting pipeline on real indoor spaces introduced in recent works.^1,9 The properties of each space are summarized in Table 1. The degree of light difference, occlusions, or environmental complexity of a space is indicated qualitatively by “+” sign, where more “+” signs represent harder conditions for multiview inpainting (Experiencing real-time rendering using our multiview inpainting results is available at www.teevr.net:2019/generative-mvi.³³)

Table 1.

Properties, statistics, and results on real data sets.

Data set	Space 1	Space 2	Space 3
Size (m²)	80	1930	2804
Camera poses	5	29	59
Num. images	30	174	354
Light difference	+	++	+++
Occlusions	+++	+	++
Env. complexity	+	+++	++
r ( $c m$ )	0.75	1.25	2.75
Evaluation points	253,798	844,441	501,873
avg(n_i)	2.65	3.25	4.67
avg(m_i)	2.16	5.92	8.52
EC − avg(d_c)	0.0140	0.0238	0.0236
GConv − avg(d_c)	0.0108	0.0222	0.0200
OURs w/o cp − avg(d_c)	0.0099	0.0209	0.0171
OURs w/ cp − avg(d_c)	0.0075	0.0184	0.0166
EC − avg(d_s)⁷	0.0170	0.0262	0.0269
GConv − avg(d_s)⁸	0.0152	0.0266	0.0233
OURs w/o cp − avg(d_s)	0.0145	0.0249	0.0210
OURs w/ cp − avg(d_s)	0.0110	0.0226	0.0200

Among the different methods, the best performance of avg(d_c) and avg(d_s) are indicated in boldface.

For comparison, we selected recent two state-of-the-art papers that utilize the edge information, EC⁷ and GConv.⁸ EC consists of two separate generative networks, one predicting the image edge and the other completing the inpainting with the use of the edge prediction. By contrast, GConv uses a user-sketch input for the image inpainting but does not predict the edge. Our model uses imperfect guideline as well as the color prior as input and refines the guideline and completes the image inpainting concurrently. We used 512 $\times$ 512 images of the Places2³⁴ data set for training our model, which is the same resolution as the indoor data sets. EC was pre-trained over the Places2 data set using 256 $\times$ 256 images.

Color evaluation

Before beginning color evaluation for multiview inpainting, pre-processing is necessary to alleviate luminance dependency due to a lot of light sources affecting differently on color intensity w.r.t. camera poses. To resolve the problem, all images are normalized for conversion into the intensity-free rg chromaticity space³⁵ as

\begin{array}{l} r = \frac{R}{R + G + B}, g = \frac{G}{R + G + B} \end{array}

where r and g are bounded between 0 and 1.

Following data arrangement is also processed for pixel-wise evaluation. For a given 3D point p_i from the sampled PCD, two sets of images are defined, $A_{i}$ and $B_{i}$ . $A_{i}$ is a set of images that p_i is occluded by object while $B_{i}$ is a set of images that the point is directly visible (i.e. sets of inpainted images vs. architectural component visible images). Then pixel-wise rg chromaticity matrices, $A_{i}$ and $B_{i}$ , can be constructed by utilizing $P^{v}$ and $p_{p r o j}$ in equation (1) and (2) as

\begin{array}{l} A_{i} = [\begin{matrix} r_{1} & g_{1} \\ ⋮ & ⋮ \\ r_{n_{i}} & g_{n_{i}} \end{matrix}], B_{i} = [\begin{matrix} {r^{'}}_{1} & {g^{'}}_{1} \\ ⋮ & ⋮ \\ {r^{'}}_{m_{i}} & {g^{'}}_{m_{i}} \end{matrix}] \end{array}

where r_j and g_j are the rg chromaticity of the pixel corresponding to the point p_i in jth inpainted image belonging to the set $A_{i}$ , and similarly ${r^{'}}_{k}$ and ${g^{'}}_{k}$ are the rg chromaticity of that in kth visible image belonging to the set $B_{i}$ . In addition, n_i and m_i refer to the number of elements in the set $A_{i}$ and $B_{i}$ , respectively, where n_i is larger than one and m_i is larger than zero. The number of point clouds to evaluate that satisfy the conditions are subset of the overall sampled PCD, which is shown in 7th row of Table 1. The average of the number of elements n_i and m_i over the points set is presented in the following rows of the Table 1.

For the evaluation, we compare two factors, consistency over n_i images (d_c) and similarity between the sets $A_{i}$ and $B_{i}$ (d_s) as visualized in Figure 8. First, consistency distance d_c is computed as follows

\begin{array}{l} {\tilde{A}}_{i} = A_{i} - {\bar{A}}_{i} \end{array}

\begin{array}{l} d_{c} = \sqrt{max (σ (\frac{1}{n_{i} - 1} ({\tilde{A}}_{i}^{Τ} {\tilde{A}}_{i})))} \end{array}

where ${\bar{A}}_{i}$ is $A_{i}$ mean over n_i-elements and $σ$ refers to singular values. We decide consistency distance as the square root of the largest singular value of sample covariance of $A_{i}$ .

Figure 8.

Let a given 3D point p_i be occluded by objects in five images ( $A_{i}$ ) and visible directly in four images ( $B_{i}$ ). A red point corresponds to rg chromaticity of the pixel from an inpainted image. A blue point is from an image that shows architectural color directly. d_c is consistency distance and d_s is similarity distance.

Second, similarity distance d_s is the Euclidean distance between the means of the matrices $A_{i}$ and $B_{i}$ , which is computed as

\begin{array}{l} d_{s} = {‖ {\bar{A}}_{i} - {\bar{B}}_{i} ‖}_{2} \end{array}

Finally, since d_c and d_s are point-wise measures, they are averaged over all points. Evaluations were conducted on EC, GConv, OURs in the absence of the color prior to verify the effects of the color prior, and OURs using the color prior. The results are presented in the Table 1. Improvements of up to 46.4% and 30.6% in the consistency distance and also up to 35.6% and 27.3% in the similarity distance were achieved compared to EC and GConv, respectively. The qualitative comparisons are shown in Figure 9.

Figure 9.

Sample test results of Space 3. Other methods result in color inconsistency while our method using color prior preserves color consistency in the results of multiview inpainting.

Boundary evaluation

We evaluate boundary consistency qualitatively. As can be seen in Figure 10, We observed that EC is good at finding finer edges because it was trained using a Canny edge detector followed by nonmaximum suppression. However, it often fails to generate proper edge on the area where the architectural boundary should exist in a real data set (cf. Figure 10(b) and (c)). By contrast, both GConv and our approach achieve better results in terms of the architectural boundary because of the explicit guideline inputs. Furthermore, with the approach of guideline refinement in the network, Figure 11 shows that our model is able to find the missing guidelines in the images.

Figure 10.

(a) The original image. (b) Boundary comparison on the left side pillar. (c) Boundary comparison on the right-side pillar.

Figure 11.

A sample result of missing guidelines. Line patterns on the left side pillar in the input guideline image are not drawn but are visible in the refined guideline output. For visualization, the output guideline corresponding to the mask is colored in blue and emphasized.

Ablation studies

For the quantitative evaluation of multiview inpainting on the real indoor data sets,^1,9 we defined the point-wise consistency metrics, consistency distance and similarity distance, which are averaged over all points. However, averaging over the number of all points might blur the point of multiview inpainting, which is color-consistent inpainting regardless of the number of inpainted images.

This time, we further analyze the impact of the number of inpainted images on the color consistency. For this, each point, p_i, is classified according to the number of times, n_i, that it is used for inpainting. Then, the consistency distance and similarity distance of each classified group of points are averaged over the number of the group’s points. In specific, n_i is the number of rows in the matrix $A_{i}$ in equation 20. Finally, we have the average values of each metric according to the number of inpainted images and make plots of them as Figures 12 and 13.

Figure 12.

Quantitative comparisons of consistency distance for different models: (a), (b), and (c) are the results of the Space 1, 2, and 3, respectively. The dashed vertical line represents the average value of n_i’s of the given space. Our approach shows lower value on every n compared to the other methods. It means our approach outperforms the others on the perspective of consistency distance regardless of the number of inpainted images of the multiview inpainting task.

Figure 13.

Quantitative comparisons of similarity distance for different models: (a), (b), and (c) are the results of the Space 1, 2, and 3, respectively. The dashed vertical line represents the average value of n_i’s of the given space. Our approach shows lower value on every n compared to the other methods. It means our approach outperforms the others on the perspective of similarity distance regardless of the number of inpainted images of the multiview inpainting task.

Figure 12 shows the impact of n_i on consistency distance. Consistency distance tends to increase as the number n_i increases. Our approach consistently shows lower value on every n_i compared to other single-image inpainting methods. Likewise, Figure 13 also shows similar results and that our approach outperforms other methods on every n_i in similarity distance. Both results suggest that our pipeline using the color prior is more effective on the multiview inpainting task regardless of the number of inpainted images compared to the single-image inpainting approaches. Figure 14 shows more qualitative comparisons on the real data sets.

Figure 14.

Qualitative comparisons on real data sets. Each pair of rows is Space 1, 2, and 3 of the real data set, respectively.

Conclusions

We proposed a multiview inpainting pipeline, that is, creating color prior images and an image inpainting network. Our pipeline aims at achieving both color and boundary consistency for multiview inpainting by introducing a color prior and guideline input, respectively. While training the network, we introduced new losses to resolve the black hole problem and achieve two goals, inpainting and guideline refinement, concurrently. We evaluated the results of our pipeline using a consistency distance and a similarity distance, which indicate that the method outperforms other methods applied to multiview inpainting.

We plan to extend the proposed multiview image inpainting with a better pipeline to further stabilize the light effects during the color prior generation step. There are also other promising applications and future works using our pipeline. Since the inpainting is affected by the color prior images, user-interactive inpainting can be possible. Users may remodel their virtual spaces as the inpainting can be modified by user input. Similarly, our method can be used for image synthesis in indoor or outdoor of target space if the point clouds and poses of images are given. Also, our pipeline may be used as a preprocessing and extended to recent deep learning-based rendering approaches.

Footnotes

Declaration of conflicting interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Korea University and TeeLabs Inc.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Technology Innovation Program (10073166) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea)

ORCID iDs

Joohyung Kim

Janghun Hyeon

References

Doh

Choi

Jang

, et al. Teevr: spatial template-based acquisition, modeling, and rendering of large-scale indoor spaces. In: ACM SIGGRAPH Emerging Technologies (ed Rose

), Los Angeles, CA, 28 July to 1 August 2019, pp. 1–1. New York: ACM.

Hedman

Ritschel

Drettakis

, et al. Scalable inside-out image-based rendering. ACM Trans Graphic 2016; 35(6): 1–11.

Philip

Drettakis

. Plane-based multi-view inpainting for image-based rendering in large scenes. In: Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (ed Anthony

), Quebec, Canada, 15–18 May 2018, pp. 1–11. New York: ACM.

Thonat

Shechtman

Paris

, et al. Multi-view inpainting for image-based scene editing and rendering. In: 2016 Fourth International Conference on 3D Vision (ed Savarese

), Stanford University, CA, 25–28 October 2016, pp. 351–359. New York, USA: IEEE.

Barnes

Shechtman

Finkelstein

, et al. Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Trans Graphic 2009; 28(3): 24.

Lin

Yang

, et al. Generative image inpainting with contextual attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition (eds Brown

Morse

Peleg

), Salt Lake City, UT, 18–22 June 2018, pp. 5505–5514. New York, USA: IEEE.

Nazeri

Joseph

, et al. Edgeconnect: structure guided image inpainting using edge prediction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (eds Mu Lee

Forsyth

Pollefeys

, et al.), Seoul, Korea, 27 October to 2 November 2019, pp. 1–9. New York, USA: IEEE.

Lin

Yang

, et al. Free-form image inpainting with gated convolution. In: Proceedings of the IEEE International Conference on Computer Vision (eds Lee

Forsyth

Pollefeys

, et al.), Seoul, Korea, 27 October to 2 November 2019, pp. 4471–4480. New York, USA: IEEE.

Hyeon

Choi

Kim

, et al. Automatic spatial template generation for realistic 3d modeling of large-scale indoor spaces. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (ed Sun

), Macau, China, 4–8 November 2019, pp. 4221–4228. New York, USA: IEEE.

10.

Available at: https://github.com/kimjh069/generative-MVI (accessed 8 February 2020).

11.

Henry

Krainin

Herbst

, et al. RGB-D mapping: using Kinect-style depth cameras for dense 3D modeling of indoor environments. Int J Rob Res 2012; 31(5): 647–663.

12.

Shao

Zhou

, et al. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans Graphic 2012; 31(6): 1–11.

13.

Ikehata

Yang

Furukawa

. Structured indoor modeling. In: Proceedings of the IEEE International Conference on Computer Vision (eds Bajcsy

Hager

), Santiago, Chile, 11–18 December 2015, pp. 1323–1331. New York, USA: IEEE.

14.

Lim

Kim

, et al. Modeling of architectural components for large-scale indoor spaces from point cloud measurements. IEEE Rob Autom Lett 2020; 5(3): 3830–3837.

15.

Turner

Cheng

Zakhor

. Fast, automated, scalable generation of textured 3D models of indoor environments. IEEE J Sel Top Signal Process 2014; 9(3): 409–421.

16.

Hedman

Philip

Price

, et al. Deep blending for free-viewpoint image-based rendering. ACM Trans Graphic 2018; 37(6): 1–15.

17.

Thies

Zollhöfer

Nießner

Deferred neural rendering: image synthesis using neural textures. ACM Trans Graphic 2019; 38(4): 1–12.

18.

Ricardez

GAG

Takamatsu

, et al. Multi-view inpainting for RGB-D sequence. In: 2018 International conference on 3D Vision (ed Fusiello

), Verona, Italy, 5–8 September 2018, pp. 464–473, New York, USA: IEEE.

19.

Kim

Woo

Lee

, et al. Deep video inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition (eds Davis

Torr

Zhu

), Long Beach, CA, 16–20 June 2019, pp. 5792–5801. New York, USA: IEEE.

20.

Bertalmio

Sapiro

Caselles

, et al. Image inpainting. In: Proceedings of the 27th annual conference on computer graphics and interactive techniques (eds Brown

Akeley

). New Orleans, LA, 23–28 July 2000, pp. 417–424. New York: ACM.

21.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial nets. In: Advances in neural information processing systems (eds Ghahramani

Welling

), Montreal, Canada, 8–13 December 2014, pp. 2672–2680. Cambridge, MA, USA: MIT Press.

22.

Zeng

Chao

, et al. Learning pyramid-context encoder network for high-quality image inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition (eds Davis

Torr

Zhu

), Long Beach, CA, 16–20 June 2019, pp. 1486–1494. New York, USA: IEEE.

23.

Xiong

Lin

, et al. Foreground-aware image inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition (eds Davis

P Torr

Zhu

), Long Beach, CA, 16–20 June 2019, pp. 5840–5848. New York, USA: IEEE.

24.

Katz

Tal

Basri

. Direct visibility of point sets. ACM Trans Graphic 2007; 26(3): 24–es.

25.

Xie

. Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision (eds Bajcsy

Hager

), Santiago, Chile, 11–18 December 2015, pp. 1395–1403. New York, USA: IEEE.

26.

Zhang

Goodfellow

Metaxas

, et al. Self-attention generative adversarial networks. In: International conference on machine learning (ed Xing

), Long Beach, CA, 9–15 June 2019, pp. 7354–7363. New York, USA: IMLS.

27.

Miyato

Kataoka

Koyama

, et al. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. 2018.

28.

Gatys

Ecker

Bethge

. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (eds Bajcsy

Tuytelaars

), Las Vegas, NV, USA, 26 June to 1 July 2016, pp. 2414–2423. New York, USA: IEEE.

29.

Johnson

Alahi

Fei-Fei

. Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision (eds Gevers

Smeulders

), Amsterdam, The Netherlands, 8–16 October 2016, pp. 694–711. Germany: Springer.

30.

Sajjadi

Scholkopf

Hirsch

. Enhancenet: single image super-resolution through automated texture synthesis. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017, pp. 4491–4500. New York, USA: IEEE.

31.

Russakovsky

Deng

, et al. Imagenet large scale visual recognition challenge. Int J Comput Vision 2015; 115(3): 211–252.

32.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (eds Ikeuchi

Medioni

Pelillo

), Venice, Italy, 22–29 October 2017, pp. 2980–2988. New York, USA: IEEE.

33.

Available at: www.teevr.net:2019/generative-mvi (accessed 21 August 2020).

34.

Zhou

Lapedriza

Khosla

, et al. Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 2017; 40(6): 1452–1464.

35.

Hunt

. The reproduction of colour. 6th ed. Hoboken: John Wiley, 2004.