Sage Journals: Discover world-class research

Abstract

This article proposes a generative adversarial networks (MiniGAN) to tackle both informative and uninformative image transferring. The generator of MiniGAN is based on the structure of StyleGANv2, in which the encoder and style transform block are proposed to extract the high-level feature maps of the source image and capture the latent representation of the target image, respectively. This information guides the generator for the final image generation. The proposed MiniGAN outperforms other models in style transferring while preserving the color information on the informative images. To test the performance of MiniGAN on the uninformative images, a new data set consisting of 10,000 fashion hand drawings is proposed. Extensive experiments and detailed analysis are presented to demonstrate the performance of MiniGAN.

Keywords

Convolutional neural network Generative adversarial network Image synthesis Image transferring Style transfer

Introduction

The task of image-to-image translation^1,2 is to learn an appropriate mapping function from source image to target image. Recently generative adversarial network (GAN)-based methods perform well to colorize the gray-scale real images,^3,4 combine the content image with the style of the target image^1,5,6 and two or multi-objects translation,^2,7 and so on. Although these methods achieve promising results in some scenarios that deal with informative images, such as day scene to night scene and real photo to Monet-style photo, to the best of our knowledge, tackling the problem of uninformative image transforms is absent.

As shown in Figure 1(a), most images that we can easily access in our daily life are informative enough. However, there still are large amounts of images that belong to minimalism (as shown in Figure 1(b)). This kind of image mainly appears in the design creation process, such as illustration, technical drawing, or hand sketches. Uninformative image transferring thus has high practical value similar to informative image transferring.

Figure 1.

Samples of the informative images and uninformative images. The upper image shows the samples of informative images. While the input image is full of details, there is no pure color present in the big region. Besides, the image is layered. The lower images illustrate the samples of uninformative images. The background of the input is clean, and the color of the input is simple and pure. Besides, there is no multi depth in the sketch: (a) samples of informative images and (b) samples of uninformative images.

In this work, we tackle the problem of both informative and uninformative image transferring. The informative images we mentioned here have characteristics including (1) rich colors, (2) diverse space distribution, and (3) multi-level depth. On the other side, the characteristics of uninformative images are (1) relatively simple color, (2) line drawings, and (3) less feeling of spaciousness.

To this end, we propose MiniGAN based on an encoder–decoder network. Specifically, we propose a style transform block that contains nine independent residual blocks to carry out target-style image transferring. Besides, we use a style coder to support pixel-wise image generation. The style coder, inspired by the structure of StyleGANv2,⁸ captures the multi-level features of the target image, which then guides the synthesis network to generate images. The network creates new target-style images with the support of multi-scale least square discriminator. To verify the effectiveness of our model, we evaluate the aforementioned model in qualitative and quantitative ways. In addition, to test the performance in uninformative data sets, we introduce a new data set with images that are composed of fashion line drawings. Extensive experiments in both informative and uninformative images show that our model outperforms the aforementioned models in preserving the image details. Analysing the challenges of uninformative image transferring can be a good reference for future exploration in the related tasks.

In summary, the contributions of this article are as follow:

We define the differences between informative and uninformative images and introduce a novel task that synthesizes images in uninformative domains.

We propose the MiniGAN, a neat and effective model that achieves good result in unsupervised style transfer while preserving the color information of the source image.

We introduce an uninformative data set that is composed of 10,000 high-resolution sketches to support this task and further explorations.

Related Work

GAN

GAN⁹ is an algorithm used for carrying out image synthesis. In GAN-based models, there are two key issues we need to address. The first is to improve the quality of the generated image, and the other is to avoid mode collapse when carrying out image synthesis. In recent years, the GAN-based model has achieved many promising results. Radford et al.¹⁰ (deep convolutional generative adversarial network (DCGAN)) first used the convolutional neural network (CNN) to perform GAN-based image synthesis. More recently, to ensure high-resolution image generation, a set of progressive-like generators^8,11–13 has been proposed to generate images with textures and details. However, these algorithms need expensive computation cost and detailed data sets. Earth-Mover distance-based GAN¹⁴ is proposed to address the problem of mode collapse. Later, methods like Wasserstein GANs¹⁵ and Spectral normalization for GANs¹⁶ have been proposed to stablize the quality of image generation. Methods which achieved promising results in image generation still lack the ability to control the mode of the generated image during image synthesis.

Style Transfer

In addition to basic quality-improved and diversity image synthesis, GAN-based image generation has been adopted in other works. Style transfer is a task to generate a new target-like image using linear mapping methods based on the content information and style information extracted from the content input and the target input. Gatys et al.¹⁷ first proposed a novel algorithm where a generator using iterative optimization ways learns the matrices-wise correlation in deep feature space extracted by pre-trained deep neural networks. Later, they introduced additional constrains¹⁸ to guide the stylization of the generated image from color and texture. While the generated image fuses the content from the content input and style from the target input to generate a positive result, the computation cost is relatively high. To achieve faster style mixing, single forward neural networks^19–23 are introduced to sharply decrease computation time. In order to generate artistic generated images, Jing et al.²⁴ first took multi-scale strokes into consideration by utilizing a multi-scale encoder and discriminator. Yao et al.²⁵ adopted the advantage of single forward network with multi-stroke consideration and proposed an attention-aware method to improve the quality of generated image. However, these methods require more or less style images as necessary input. Furthermore, this type of method alters not only the texture and details but also color distribution during style mixing. In other words, these methods are limited to transferring the source images to the target-like images while preserving the color information of the source image.

Image-to-Image Translation

The aim of image-to-image translation is to learn a mapping from source domain to target domain. Recently, some researches have achieved promising outcomes. Pix2pix¹ is the first GAN-based method to transfer the image from two different domains. However, it still needs a paired data set to generate images with high quality. To overcome this, several architectures like cycle consistency² and shared latent space²⁶ were introduced. Very recently, algorithms^7,27,28 based on these two architectures have been introduced to improve image quality. Although image-to-image translation can achieve good quality as well as multi-modal results, the scenario is limited as the two unpaired data sets are both all informative data sets. Furthermore, at present, there is no research focused on uninformative style transferring.

Methodology

The main goal of this work is to deal with informative image and uninformative image transferring. For GAN-related algorithms, the task is to generate vivid images with rich details like reasonable texture and easily recognized shape. Several image-to-image-based models^7,28,29 work in two information-rich domains like real photos to paintings. However, as indicated in Figure 2, those methods do not perform well enough when the two domains information.

Figure 2.

Different scenarios when doing image-to-image style transfer. The top of the figure illustrates the sample of two traditional informative image domains. The left-hand side is from the real image data set while the right-hand side shows the artistic image. In the bottom of the figure are, $A_{un}$ samples from the uninformative simple generated images while the other side shows the target image named Fashion Illustration.

In order to fix this problem, we adopt gram matrices, which capture the high-level target-specific style statistics, to carry out image style transfer. Moreover, several loss functions are applied to help construct the final output. Figure 3 illustrates the main structure of generator. In Figure 3(a), given a source image as an input, the model applies an encoder with residual blocks³⁰ to extract low-level details as well as high-level features of the source image. After being transferred by the style transform block, high-level feature maps will be fed into the generator to support the generation of the target-style image. Inspired by StyleGAN,^8,13 style coder $C$ was involved in style mixing to assist the generator in producing target-like images. The skip-connection supports the generator to preserve the details of the source image. Figure 3(b) demonstrates the architecture of the multi-scale discriminator. Low- and high-level features are both considered when the discriminator distinguishes whether the input image is real or fake.

Figure 3.

Overview of the main architecture of an encoder–decoder-based network. (a) Details of the generator. In addition to the style latent vector applied in StyleGAN and StyleGANv2, deep-level feature maps extracted by encoder are also considered during skip-connection to help image generation. The style transform block contains nine residual convolutional layers transferring the style from content image to target image. (b) Process of multi-scale discriminator. From multi-scale values, the discriminator-support generator captures both low-level information and high-level features.

Encoder

In order to get latent feature maps, networks based on four residual-based blocks³⁰ are adopted to extract features from the source image. We denoted $E_{i}, i \in {0, \dots, N}$ while $E_{i}$ is the $i$ th block in residual-based network. Then the extracted feature maps $f_{i}$ can be written as follows

f_{i} = {\begin{matrix} E_{0} (S), & if i = 0 \\ E_{i - 1} (f_{i - 1}), & otherwise \end{matrix}

(1)

where $S$ means the source image and $f_{i - 1}$ means the output of the $i - 1$ th residual block.

On the other side, an intermediate latent vector was applied^8,13 for style refinement. Like the implementation in StyleGANv2, we sampled the input $z$ from the original latent space $Z$ , then a network composed of eight fully connected layers $f : Z \to C$ maps input to intermediate latent space $c \in C$ . The dimensions of both $z$ and $c$ are 512. Style coder $C$ captures the style in detail while feature maps $f_{i}$ have the latent representation of source images. Both of them will then be utilized to be fed into the generator (depicted as the decoder in Figure 3).

Style Transform Block

From the previous style transfer methods, we found that models do style transfer effectively in global representation. Taking conditional style transfer³¹ as an example, generated images are high quality with fine details and rational textures. However, the shortcoming is the color distribution of the output image which is similar to that of the target image, making it unreasonable when comparing with the original input. To tackle this problem, inspired by Style-aware,⁶ we use a style transform block which is composed of nine residual convolutional blocks to transfer the image to the target-like image in latent representation.

Generator

Given the multi-scale features $f_{i}$ and the style coder $C$ , a StyleGAN-based generator is set as our main generator to carry out mixing style image generation. In order to transfer the input image to the target style while retaining the color distribution better, refined convolutional block and skip-connection³² are applied in each style block. Furthermore, to obtain stochastic details in the final output, noise inputs are applied in the refined style block. The format is shown as follows

g_{i} = {\begin{matrix} G_{i} (c_{0}, f_{N}, noise), & if i = 0 \\ G_{i} (c_{i}, g_{i - 1}, f_{N - i}, noise), & otherwise \end{matrix}

(2)

where $G_{i}$ is the refined style block in the generator, $g_{i}$ is the output of the style block, and $c_{i}$ is the ith part of style vector generated by the style coder.

At the end of the generator, there is an additional convolutional block named the RGB block, which represents the output to the final image.

Loss Functions

Adversarial Loss

GAN-based adversarial loss GANs⁹ is an effective tool to help match the distribution of the source image to that of the target image by playing an min–max game. In other words, the generator tries to deceive the discriminators by generating the same distributions of target domain as much as it can while the discriminator learns to distinguish the differences between real target domain and fake output. Instead of using prevalent methods^14,15 with which it is difficult to achieve balance between this adversarial loss and other loss in scale, least square adversarial loss³³ is applied to supervise the generator

\begin{matrix} min_{D} V_{GAN} (D) = \\ \frac{1}{2} E_{p_{data} (x)} [{(D (x) - b)}^{2}] + \frac{1}{2} E_{p_{z} (z)} [{(D (G (z)) - a)}^{2}] \\ min_{G} V_{GAN} (G) = \\ \frac{1}{2} E_{z ~ p_{z} (z)} [{(D (G (z)) - c)}^{2}] \end{matrix}

(3)

where $b = c = 1$ and $a = 0$ in this work.

Inspired by Wang and Yu³⁴ and, Lu et al.³⁵ two more feature maps are extracted as guided feature maps. These two empirical priors are applied to support discriminators to distinguish. To obtain margin features from the image, traditional Sobel kernel is utilized to extract the margin of the image. Besides the structure difference in various image, texture difference is another key objective. Although it is challenging to obtain texture features in traditional RGB channel images, transferring images into luminance and color information like YUV or laboratory domain reduces the difficulty as the first channel represents the texture information and the other two channels show the color information which influences texture little. In order to obtain more information from shape to details, the multi-scale discriminator is employed as shown in Figure 3(b), where the discriminator is composed of several convolution blocks, to distinguish input in both low- and high-level feature maps.

Style Loss

Style loss is introduced to capture the high-level feature structure as well as the texture information. Gram-matrices-based style loss¹⁷ is adopted in our work. Given gram metrices $G$

G_{ij}^{l} = \sum_{k} F_{ik}^{l} F_{jk}^{l}

(4)

where $G_{ij}$ is the matrices calculated by vectorised feature maps $F_{ik}$ and $F_{jk}$ in layer $l$ . Then, mean-squared loss is adopted to measure the style distance between generated image and target image. Given $g$ as generated image and $i$ as input image, total style loss is

L_{style} (g, i) = \sum_{l = 0}^{L} w_{l} \frac{1}{4 N_{l}^{2} M_{l}^{2}} \sum_{i, j} {(G_{ij}^{l} - I_{ij}^{l})}^{2}

(5)

where $N_{l}$ and $M_{l}$ represent the number of channels in layer $l$ and number of pixels in feature maps in layer $l$ , respectively. $G_{ij}^{l}$ and $I_{ij}^{l}$ are the Gram matrices from multi-scale feature maps extracted by encoder from the generated image and target image, respectively. While it is a trade-off that the generated image should be shown as similar to both target image and content image, we only calculate style loss in the deepest feature maps.

Content Loss

Content loss is utilized to preserve the global structure of the input image. Mean square loss is adopted to calculate the distance of deep feature maps extracted by the content extractor. The content extractor is defined as $C B$ . The loss is written as follows

L_{content} = {(C B (G (I)) - C B (I))}^{2}

(6)

where $G (I)$ and $I$ are the generated image and input image, respectively.

Total Variance Loss

Due to the specific characteristics, the frequency information of painting images is different from that of real photos, which makes it difficult to generalize. To maintain the continuity of the image, total variance loss was adopted to decrease the probability of unwanted noise. The loss function is illustrated as follows:

L_{tv} = \frac{1}{H * W * C} \sum_{i} ‖ \nabla_{x} (G (i)) + \nabla_{y} (G (i)) ‖

(7)

where $H, W, and C$ means the height, width, and channel, respectively; $i$ means each pixel in the image and ∇ means the direction of the axis.

Full Structure

Overall, our model can be illustrated as a network composed of an encoder, a style transformer, a generator, and a multi-scale discriminator. The full-loss function is used to optimize generator in high-level features as well as textures and details representation. The formula is shown as follows

\begin{matrix} L_{final} = λ_{1} * L_{multi - adv} + λ_{2} * L_{style} \\ + λ_{3} * L_{content} + λ_{4} * L_{tv - loss} \end{matrix}

(8)

where $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , and $λ_{4}$ are hyper-parameters which could be modified to generate various style images.

Experiment

In this section, extensive experiments on state-of-the-art cycle consistency²-based models, that is, GDWCT,³⁶ MUNIT,²⁷ DRIT,²⁸ CycleGAN,² and style transfer-based models like Style-aware⁶ and our model were conducted to evaluate the performance in both the informative data set and the uninformative data set. Training performance and general results analysis in both informative and uninformative domains are presented in the following subsection.

Implementation

For the optimization problem, Adam’s³⁷ algorithm was used in both the generator and the discriminator with $β_{1} = 0.5, β_{2} = 0.999$ . The initial learning rate for generator and discriminator are $l r_{g} = 10^{- 4}$ and $l r_{d} = 10^{- 5}$ , respectively. The batch size was set as 2 and the model was trained with about 100,000 epochs or until convergence was reached.

Hyper-parameter: The default hyper-parameters were set as: $λ_{1} : 1, λ_{2} : 100, λ_{3} : 100, λ_{4} : 1 e^{- 2}$ , while the multi-scale weights in adversarial loss were all set as 1. The default parameter is based on the training data set which has high bias in the real domain and the target domain. Refined parameters were used in other domain-based scenarios to ensure satisfactory performance.

Data set: We sampled images from the Places365³⁸ training data set as our source domain. For the target domain, the collection of Cezanne and Van Gogh images is adopted from WikiArt (https://www.wikiart.org/). For the source domain, there are more than $300, 000$ images while there are 999 images in the target domain. To get more images in the target domain, data augmentation like rotation and flipping were used to help create a “new artistic image.” Images shown in this article are sampled from testing image data set of Place365. For the validation data set, we sample several images Place365 testing data set to measure and compare. In order to obtain high-resolution images, all images in two domains were resized to $512 * 512$ resolution. For the purpose of uninformative image style transfer, 5817 samples were used in both BeautyU and the single sketch data set. For those uninformative data sets, we set the size of images as $384 * 256$ which is the same as the size of the images in BeautyU. Figure 4 shows a sample of each data set.

Figure 4.

Sample of images from Place365 (real photo), artworks, BeautyU (illustrations) and sketches, respectively: (a) informative images and (b) uninformative images.

Evaluation

To evaluate the performance of style transferring, four other algorithms including image-domain style transferring,³⁶ content style disentangled transferring,^27,28 and zero-shot image style transferring²² are benchmarked.

Qualitative evaluations: Figure 5 illustrates the comparison between the four benchmarked methods and our method in informative domain image style transfer. Due to the effectiveness of skip-connection, the image generated from our method has clear contours and details. For images generated by style image-guided algorithms, in terms of of style, generated images contain rich information of texture. However, they inevitably have the color information from guided images, which is opposite to our expectation. For unsupervised algorithms such as MUNIT and DRIT, while they capture both content and style latent representation of target images, the outputs lose the color information in style transferring process. For image-guided algorithm GDWCT, it is difficult for outputs to obtain the style of target images. Furthermore, to obtain color-invariant generated images is another challenge.

Figure 5.

Informative style transfer. To get the stroke of the target domain as well as retain the global structure of the content image, our models outperforms other models in detail preserving as well as style representation. For Style-aware, the content of the generated image loses much so that the contour of the image looks messy.

To compare more specifically, Figure 6 illustrates the comparisons in details between the Style-aware, CycleGAN and our algorithm. While the style in the target domain is abstract, the model cannot obtain both characteristics and clear boundaries. For CycleGAN, the generated images have clear boundaries and background. However, the main difference between input images and generated images is the color distribution. For global representation, the generated images are shown more like photos than paintings. In other words, generated images do not carry the characteristics of the painting. For Style-aware, while it transfers the style of the target domain successfully, the details from the image show that it has messy contours and unreasonable line strokes. Taking the images below as an example, the chair is in mess and the roof of the pavilion is mixed up with the trees right side. In the image above, Style-aware distorts the shape of those people and generates irregular noise. Although our method still has limitations in style representation and high-frequency artifacts reducing, it preserves the details such as clear contours and reduces the noise. In summary, our model outperforms all baselines in generating images with clear contours, and more rational trade-off between real image and target image.

Figure 6.

Details in generated images, results from our model get clear contours as well as less color difference and more detail. Compared with (c), the image generated by our model gets cleaner boundaries as well as rational distribution. For (d), there is little change in style transfer: (a) input, (b) ours, (c) Style-aware, and (d) CycleGAN.

Quantitative Evaluations: Frechet inception distance (FID)³⁹ which is an algorithm to calculate the Frechet distance between two Gaussian-mixed-based probabilities is adopted in this work as it is an ideal distance to evaluate how close two probabilities are and evaluate the quality of generated images. As is illustrated below

FID = ‖ μ_{g} - μ_{t} ‖^{2} + Tr (Σ_{g} + Σ_{t} - 2 {(Σ_{g} Σ_{t})}^{\frac{1}{2}})

(9)

where in the format $V_{g} ~ N (μ_{g}, Σ_{g}) and V_{t} ~ N (μ_{t}, Σ_{t})$ demonstrates the mean and variance of two vectors of generated and target images extracted from pre-trained Inception-v3⁴⁰ model, respectively. The lower the FID scores, the better quality the image synthesized as well as the better performance the model has.

Furthermore, a evaluation algorithm named learned perceptual image patch similarity is also adopted to measure the quality of generated images in our work.⁴¹ In Table 1, three cycle consistency-based models obtain the best three FID scores in the photo domain, and their scores in the painting domain are relatively high. FID score is relatively more important in the painting domain compared with that in the photo domain. Our model obtains relatively low FID scores when comparing with other methods, which means that our model could catch the latent style representation from the target images. MUNIT obtains the lowest FID scores in both the photo domain and the painting domain. However, the change of color distribution is undesirable. Furthermore, extremely low FID scores in the real data set means the model changes little in source image. With relatively lower scores compared with other methods, our method preserves the details of the content images. For Learned Perceptual Image Patch Similarity (LPIPS), our method and Style-aware obtain the best two scores, which means that the generated images from Style-aware and ours outperform the other methods in semantic structure representation.

Table 1.

Table illustrates the FID distance as well as LPIPS scores between the real photo data set and painting data set.

Model	Resolution	$FI D_{photo}$	$FI D_{paint}$	LPIPS
Style-aware⁶	512*512	161.99	96.78	0.583
DRIT²⁸	512*512	139.55	106.20	0.749
MUNIT²⁰	512*512	78.82	84.73	0.663
GDWCT³⁶	512*512	29.76	112.08	0.836
CycleGAN²	512*512	20.00	130.12	0.78
Ours	512*512	133.83	105.99	0.632
Photo	512*512	\	157.36	\

FID: Frechet inception distance. The lower score indicates better stylization results. The aim of the task is to get lower score from photo data set as well as from painting data set while it is a trade-off. For LPIPS, a lower score means the generated image is more similar to the original image. In other words, the lower score means the better the generated image preserve details. While our model has not achieved the best result in all measurements, it achieves balanced FID scores and the second best result in LPIPS. Bold values indicate the best performance among different methods.

Ablation Study

In order to carry out style transferring while preserving the color information, we use regular encoder–decoder structure model Style-aware⁶ with multi-scale discriminator as our baseline. Several components were added into our model for the sake of higher-quality image generation. We compared these outputs generated with or without these blocks to see the effectiveness of them.

Figure 7 shows the ablation studies. In Figure 7(b), the images are generated with normal GAN loss. There is irrational color distribution on the whole image. Besides, the tree in the first image is in a amess, which means the model is limited to transfer some objects. As the structure representation of Figure 7(b) is similar to the input, this means that the skip-connection can catch both low- and high-frequency information. However, it preserves the details of the content image so well that it cannot catch the style of the target images. In Figure 7(c), output images are generated by the model when it is trained without adding the noise. The color distribution of generated image is shifting when comparing with the input. Besides, mode collapse appeared in several places like the branch of the trees and the top of the car. Images generated from the full model alleviate the color difference while learning the latent representation of style image.

Figure 7.

Ablation studies with or without some components. From the figure (b) while it performs well in the under image, there are some irrational black lines on the trees. For (c) two images are not realistic, and for the upper image there is black blob in some places. Also, the color distribution is not as natural as in the original image. For our model, the generated image has more realistic color distribution as well as the painting-like stroke: (a) input, (b) w/o sq. loss, (c) w/o noise, and (d) full model.

In addition to image demonstration, Table 2 illustrates the FID distance and LPIPS scores among three models. Model (b) obtained the highest FID distance in the deal data set and LPIPS scores, which means that preserving the content of the source image and the style of target image is limited. For model (c), while it achieved the best result of LPIPS scores, the FID distance between real data set and generated images is relatively low and that between the paint data set and generated images is high. This means the ability of style transferring is ineffective. The full model gained the lowest FID distance between paint data set and relatively high distance between the real data set. It indicates the model captured the style of the target image and preserved the information of the source image well.

Table 2.

Table illustrates the FID distance and LPIPS scores between the real photo data set painting data set in the ablation study.

Model	$FI D_{photo}$	$FI D_{paint}$	LPIPS
(b) w/o Sq.loss	149.23	170.26	0.633
(c) w/o noise	71.54	141.66	0.468
(d) Full model	133.83	105.99	0.632
Photo	\	157.36	\

According to the Table 2, (c) obtained the lowest distance in FID-photo and LPIPS scores. This means, model (c) changed little of the input image. The full model obtained the best result in FID-paint, which means the model captured the style of the target images. Bold values illustrate the best performance.

Analysis on Uninformative Data set

From the human vision perspective, some images generated from existing algorithms can deceive the expert to a certain extent from layout to texture and color distribution. Those well-performed image transferring models stand with two informative image domains. In other words, there is little research focused on uninformative image transfer. For our images which belongs to the uninformative domain, the aforementioned five algorithms and ours are compared and the result is illustrated in Figure 8.

Figure 8.

Uninformative style transfer. For the first three algorithms, there is more or less mode collapse. Images generated from the latter three algorithms changes little from the original domain but stroke and color distribution.

In Figure 8, the above methods do not perform well in this task. For the four cycle consistency-based image translation methods (MUNIT, GDWCT, DRIT, and CycleGAN), CycleGAN preserved the content of the source image in this scenario, but the style of the generated image changed little. For the other three aforementioned methods, mode collapse more or less appeared when they carried out image synthesis. In Figure 5, DRIT can carry the style of target domain. However, it is limited to preserving the content of the input image. For MUNIT and GDWCT, while the generated images somehow capture the style representation of the target image, the color and full shape of the body in images are out of control. For Style-aware, the strokes in the images are different from both that of input images and that of target images. It has thick, straight lines, which is same as its performance in real-painting image transfer, instead of thin and curved lines. However, this method preserves body shape well. For our model, it preserves well the details in the content domain but still lacks style representation from the target domain. Moreover, the output has a blurry background, which decreases the quality. In order to remove the unexpected blurring, two more methods are proposed. One method is to use a mask to extract the main part and remove the other region to white clean. The other method is to combine the input with its mask to generate new four-dimensional input. And the new input will be fed into the network. In the latter method, the added mask can be seen as a white-box attention. Figure 9 shows its results. It is noted that in the third and the fourth line, the generated image is background clean. However, the model cannot learn the layout and stroke of the target image. For the former method, using a mask to get foreground it is easy to understand that the region of interest is the same as that of the mask. In other words, only the regions of interests remain. For the latter method, while the channel of the mask shares the region of interest, it supports too much about the ability of shape generation and limits the style of image synthesis.

Figure 9.

Methods to clean the background from the image generated by our method. For the third and fourth lines the background is cleaned while the styles of generated images are limited.

Conclusion

In this article, we propose the MiniGAN to improve the performance of image style transferring in informative and uninformative image data sets. Through this model, we can generate high-quality target-style images when the input images are informative like real-life photos. The main structure of MiniGAN is a basic encoder–decoder network with residual blocks. To achieve better detail generation, we apply StyleGAN-like modulated convolution layers to facilitate the representation of content. In order to make the generated image look more like the target image, multi-scale discriminator is applied to constrain the generator. Furthermore, total variation loss is applied to reduce the irrational part of the generated image. Qualitative and quantitative results show that our method can generate images with target style when the data set belongs to the informative data set. For uninformative data, our method performs well in detail preservation but still is not satisfactory in preserving style which will be our future work.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is funded by the Laboratory for Artificial Intelligence in Design (Project Code: RP3-1), Innovation and Technology Fund, Hong Kong Special Administrative Region.

ORCID iD

Fangjian Liao

References

Isola

Zhu

J-Y

Zhou

, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 1125–1134. New York: IEEE.

Zhu

J-Y

Park

Isola

, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, Venice, 22–29 October 2017, pp. 2223–2232. New York: IEEE.

Cao

Zhou

Zhang

, et al. Unsupervised diverse colorization via generative adversarial networks. In: Joint European conference on machine learning and knowledge discovery in databases, Skopje, 18–22 September 2017, pp. 151–166. Berlin: Springer.

Nazeri

Ebrahimi

Image colorization using generative adversarial networks. In: International conference on articulated motion and deformable objects, Palma de Mallorca, 13–15 July 2018, pp. 85–94. Berlin: Springer.

Johnson

Alahi

Fei-Fei

. Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Amsterdam, 8–16 October 2016, pp. 694–711. Berlin: Springer.

Sanakoyeu

Kotovenko

Lang

, et al. A style-aware content loss for real-time HD style transfer. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 698–714. Berlin: Springer.

Park

Efros

Zhang

, et al. Contrastive learning for unpaired image-to-image translation. In: European conference on computer vision, Glasgow, 23–28 August 2020, pp. 319–345. Berlin: Springer.

Karras

Laine

Aittala

, et al. Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 13–19 August 2020, pp. 8110–8119. New York: IEEE.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. arXiv preprintarXiv:1406.2661, 2014, https://arxiv.org/abs/1406.2661

10.

Radford

Metz

Chintala

Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprintarXiv:1511.06434, 2015, https://arxiv.org/abs/1511.06434

11.

Brock

Donahue

Simonyan

Large scale GAN training for high fidelity natural image synthesis. arXiv preprintarXiv:1809.11096, 2018, https://arxiv.org/abs/1809.11096

12.

Karras

Aila

Laine

, et al. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprintarXiv:1710.10196, 2017, https://arxiv.org/abs/1710.10196

13.

Karras

Laine

Aila

. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, 15–20 June 2019, pp. 4401–4410. New York: IEEE.

14.

Arjovsky

Chintala

Bottou

Wasserstein generative adversarial networks. International Conference on Machine Learning (PMLR) 2017; 70: 214–223.

15.

Gulrajani

Ahmed

Arjovsky

, et al. Improved training of Wasserstein GANs. arXiv preprintarXiv:1704.00028, 2017, https://arxiv.org/abs/1704.00028

16.

Miyato

Kataoka

Koyama

, et al. Spectral normalization for generative adversarial networks. arXiv preprintarXiv:1802.05957, 2018, https://arxiv.org/abs/1802.05957

17.

Gatys

Ecker

Bethge

Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 2414–2423. New York: IEEE.

18.

Gatys

Ecker

Bethge

, et al. Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 3985–3993. New York: IEEE.

19.

Huang

Belongie

Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision, Venice, 22–29 October 2017, pp. 1501–1510. New York: IEEE.

20.

Fang

Yang

, et al. Universal style transfer via feature transforms. arXiv preprintarXiv:1705.08086, 2017, https://arxiv.org/abs/1705.08086

21.

Liu

M-Y

, et al. A closed-form solution to photorealistic image stylization. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 453–468. Berlin: Springer.

22.

Sheng

Lin

Shao

, et al. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–23 June 2018, pp. 8242–8250. New York: IEEE.

23.

Yoo

Chun

, et al. Photorealistic style transfer via wavelet transforms. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, South Korea, 27 October–2 November, pp. 9036–9045. New York: IEEE.

24.

Jing

Liu

Yang

, et al. Stroke controllable fast style transfer with adaptive receptive fields. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 238–254. Berlin: Springer.

25.

Yao

Ren

Xie

, et al. Attention-aware multi-stroke style transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, 15–20 June 2019, pp. 1467–1475. New York: IEEE.

26.

Liu

M-Y

Breuel

Kautz

Unsupervised image-to-image translation networks. arXiv preprintarXiv:1703.00848, 2017, https://arxiv.org/abs/1703.00848

27.

Huang

Liu

M-Y

Belongie

, et al. Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 172–189. Berlin: Springer.

28.

Lee

H-Y

Tseng

H-Y

Mao

, et al. Drit++: Diverse image-to-image translation via disentangled representations. Int J Comp Vis 2020; 128(10): 2402–2417.

29.

Richardson

Alaluf

Patashnik

, et al. Encoding in style: A StyleGAN encoder for image-to-image translation. arXiv preprintarXiv:2008.00951, 2020, https://arxiv.org/abs/2008.00951

30.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 770–778. New York: IEEE.

31.

Svoboda

Anoosheh

Osendorfer

, et al. Two-stage peer-regularized feature recombination for arbitrary image style transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 13–19 June 2020, pp. 13816–13825. New York: IEEE.

32.

Ronneberger

Fischer

Brox

. U-Net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, Munich, 5–9 October 2015, pp. 234–241. Berlin: Springer.

33.

Mao

Xie

, et al. Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, Venice, 22–29 October 2017, pp. 2794–2802. New York: IEEE.

34.

Wang

Learning to cartoonize using white-box cartoon representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, 13–19 June 2020, pp. 8090–8099. New York: IEEE.

35.

Jia

. Combining sketch and tone for pencil drawing production. In: Proceedings of the symposium on non-photorealistic animation and rendering, Annecy, 4–6 June 2012, pp. 65–73. New York: ACM.

36.

Cho

Choi

Park

, et al. Image-to-image translation via group-wise deep whitening-and-coloring transformation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, 15–20 June 2019, pp. 10639–10647. New York: IEEE.

37.

Kingma

Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014, https://arxiv.org/abs/1412.6980

38.

Zhou

Lapedriza

Khosla

, et al. Places: A 10 million image database for scene recognition. IEEE T Patt Anal Mach Intel 2017; 40: 1452–1464.

39.

Heusel

Ramsauer

Unterthiner

, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. arXiv preprintarXiv:1706.08500, 2017, https://arxiv.org/abs/1706.08500

40.

Szegedy

Vanhoucke

Ioffe

, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 2818–2826. New York: IEEE.

41.

Zhang

Isola

Efros

, et al. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–23 June 2018, pp. 586–595. New York: IEEE.

MiniGAN: Toward Informative and Uninformative Image Transferring

Abstract

Keywords

Introduction

Related Work

GAN

Style Transfer

Image-to-Image Translation

Methodology

Encoder

Style Transform Block

Generator

Loss Functions

Adversarial Loss

Style Loss

Content Loss

Total Variance Loss

Full Structure

Experiment

Implementation

Evaluation

Ablation Study

Analysis on Uninformative Data set

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References