Sage Journals: Discover world-class research

Abstract

In this article, we propose StylishGAN, a generative adversarial network that generates a fashion illustration sketch given an actual photo of a human model. The generated stylish sketches not only capture the image style from real photos to hand drawings with a cleaner background, but also adjust model’s body into a perfectly proportioned shape. StylishGAN learns proportional transformation and texture information through a proposed body-shaping attentional module. Furthermore, we introduce a contextual fashionable loss that augments the design details, especially the fabric texture, of the clothing. To implement our method, we prepare a new fashion dataset, namely, StylishU, that consists of 3578 paired photo–sketch images. In each pair, we have one real photo collected from the fashion show and one corresponding illustration sketch created by professional fashion illustrators. Extensive experiments show the performance of our method qualitatively and quantitatively.

Keywords

AI Dataset Fashion Fashion Illustration GAN

Introduction

Fashion illustration is a classical way of fashion communication. Compared with photographs, fashion illustrations are obviously filtered through an individual vision that could obtain more fictional narratives and have more stylish feelings.¹ On one hand, this form of art is costly because it requires skillful drawing techniques with long-time practice. Fashion illustrators can help to better exhibit the designed clothing and have extensive needs from industry, which makes automatically generating fashion illustrations practical and valuable.² Thus, it is reasonable to expect that automatically generating the fashion illustrations, which can significantly reduce the cost, would have substantial practical value for the fashion industry.

As shown in Figure 1, designers begin with a sketch of a body figure called a Croquis and build a look on top of it. They typically illustrate clothing on a body figure with exaggerated 9-head proportions. The details of clothing (silhouettes and fabrics) are carefully rendered by using tools such as gouache, marker, and ink. In a word, the requirements for a standard fashion illustration can be summarized as² (1) body shape is adjusted to a specific proportion, that is, head-to-body ratio around 9 when in standing position; (2) clothing identities (e.g. design attributes, color, print design) should be the same as those of the source image; and (3) high quality in texture renders a clean background.

Figure 1.

Examples of fashion illustration.

To satisfy the above requirements, we transform the real fashion photo into a hand-drawn style while adjusting the body shape to a certain proportion and obtaining the texture information through a new Body-Shaping Attentional Module. According to the definition of the drawing principles of fashion illustration,³ the fashion sketch is based on the special relation of human body keypoints instead of the full-body shape. We thus adopt the idea of pose transfer-guided person image generation^4–6 while using keypoint instead of clothing parsing to represent the geometrical features of body shape. Meanwhile, we introduce an attention mechanism with non-local operation⁷ to enhance the model’s ability to select the region of interest (ROI) that needs to be transferred. In addition, apart from the body shape, we introduce a contextual fashionable (coFa) loss, which preserves the edge of the clothing while enhancing the design details. The spatial pixel coordinates and pixel-level Red, Green, Blue (RGB) information are integrated into the image features. To further enhance the details, the Laplacian pyramid is adopted to decompose images into multiple scales.⁸ For each output pyramid coefficient, we render a filtered version of the full-resolution image. According to the corresponding local image value from the Gaussian pyramid at the same scale, we build a new Laplacian pyramid from the filtered image and copy the corresponding coefficient to the output pyramid.

In addition, we prepare a new dataset with 3567 paired images, that is, real photos and their corresponding fashion illustration for implementation. The source of the real photo is from catwalk images of various high-fashion brands such as LOEWE, GUCCI, and CHANEL. The ready-to-wear from the “Big Four,” that is, New York, London, Milan, and Paris fashion weeks, would have more design attributes and details to be generated than outfits in our daily life. The illustrations created by different professional fashion illustrators are collected from the Internet.

Extensive experiments demonstrate the advance of StylishGAN qualitatively and quantitatively compared with the state-of-the-art generation methods on the fashion illustration generation. We summarize our main contributions as follows. (1) We propose a fully automatical generative adversial network that can adjust the shape and transfer style while preserving design details of the clothing. (2) We introduce a new coFa loss that augments the details, for example, fabric texture, and reduction of fashion illustrations. (3) We build a fashion illustration dataset, namely StyleU with 3567 pairs of real photos and fashion illustrations. The dataset will be released to favor the academic society.

Related Work

Image-to-Image Translation

This task that aims to learn a mapping that transforms images within two different domains has recently gained much attention from computer science researchers. It is divided into the supervised setting and unsupervised setting. If the paired data are unavailable, the image could still be translated by sharing latent space^9–11 or using cycle consistency assumptions.^12,13 Although the previous works, for example, CycleGAN,¹² UNIT,⁹ U-GAT-IT,¹⁴ achieved promising performance and could generate diverse results,^13,15–17 they are different from our work as we aim to translate all clothing identities, that is, colors, print, attributes, and so on. On the contrary, when we obtain the paired data, the GAN model can be trained in a supervised manner.^18–20 Given a reference style image, the style-transfer network creates an output image with the same content as the input but with the style of the reference image.^21–24 Our task differs from the above as we only transfer the ROI, that is, the fashion model, to the hand-drawn style without transferring the style on the whole input image with deformation between the source and target shape.

Fashion Image Generation

Many pieces of research have been conducted in fashion image generation.^25–28 Similar to the pix2pix,²⁹ fashion sketches can be transformed into textured fashion items.^30,31 Meanwhile, rather than texture transfer, studies also focused on virtual try-on task, that is, conditioned upon new clothing, a desired clothing item can be transferred onto the corresponding region of a person.^5,32–37 Han et al.³⁷ adopted a coarse-to-fine strategy to match the shape of warp clothing and the body shape of the target person, which could preserve detailed texture information. Raj et al.³⁸ introduced a weakly supervised approach to generate training pairs from a single image via data augmentation to solve the problem of lacking paired data that have the same clothing on different bodies. Very recently, some studies focused on fashion editing or inpainting, which targeted local region translation.^27,28,39 Dong et al.³⁹ enabled editing fashion attributes print design, sleeve length, pant length, and so on. Hsiao et al.²⁷ proposed a network to minimally adjust the full-body clothing outfit that affected its fashionability. Han et al.²⁸ could generate the missing part of fashion images. Unlike the above research, our work aims to transfer the fashion image to the hand-drawn style while adjusting the body shape of the original image with a certain ratio.

Dataset Construction

To the best of our knowledge, this is the first work focusing on fashion illustration generation. To implement our approach, we build a dataset, named StylishU, consisting of 3567 paired images. Each paired image include one real photo and its corresponding hand sketch. All real photos were collected from the “Big Four,” that is, New York, London, Milan, and Paris fashion week, which would have more design attributes and details to be generated than the outfits in our daily life. The illustrations are correspondingly produced by different fashion illustrators. (Noted that the duplicate data are automatically removed).

As shown in Figure 2, the images vary in many aspects. First, the styles of illustrations are different as they are created by different illustrators, for example, the last four cases in Figure 2. Second, the background of the real photo is not clean and the hand-created sketch is not always clean. Third, the pose of most images is in front view while the rest of them are in the side view. To retain the original ratio of the fashion model, we pad the image with a white background instead of directly stretching it to a certain size. Finally, the size of all images is 384 $\times$ 256.

Figure 2.

Samples in the StyleU dataset.

Methodology

Given a source photo, our task is to translate it into a sketch image with a fashion illustrative style. The image from the domain of real photos is denoted as $x_{p}^{i} \in X_{p}$ , while the image from the domain of sketches is denoted as $x_{d}^{i} \in X_{d}$ , where $i$ is the index of this image pair among all source–target data points. $S_{p}^{i}$ is the (human body) keypoint shape map (kShape map) of source photo $x_{p}^{i}$ , and $S_{d}^{i}$ is the kShape map of sketch $x_{d}^{i}$ . A kShape map encodes the locations of $K$ different keypoints of a single human body on an image. It shares the same width and height but varies in depth with their corresponding image in terms of dimension size, namely as height $\times$ width $\times$ depth, where depth is $K$ rather than 3 (RGB). During the training, the model is optimized in an adversarial framework. We describe each part of our adversarial network in detail as follows.

Generator

As shown in Figure 3, we have three inputs for the generator, one source image $x_{p}^{i}$ , one kShape map $S_{p}^{i}$ of the source image and one kShape map $S_{d}^{i}$ of the target sketch image. We aim to generate the sketch with a beautified body shape and render the source image in an illustration style. We adopt a parallel pipeline to process one single source image and one pair of kShape maps. In the upper branch, the source photo image is encoded by two down-sampling convolutional layers to produce a photo image embedding. Meanwhile, in lower branch, two kShape maps are stacked together along their depth axes ( $K + K = 2 K$ ) and followed by an other two-layer down-sampling convolutions to produce the embedding of kShape. The down-sampling convolutions in the lower branch share the same structure as those in the upper branch but vary in the force of the depth channel reduction, as we aim to produce kShape embedding in the same dimension shape with source photo embedding. In other words, the depth channel size of kShape input is $2 K / 3$ times that of the source photo input, but we force the depth channel size of kShape output embedding to be the same as that of the source photo output embedding. Two down-sampled output embeddings with identical dimension size will be ready as the inputs for further steps of processing. In addition, referring to the sketch principle stated in the “Introduction” section, we adopt fashion keypoints to describe human pose representations.

Figure 3.

Flowchart for automatically generating fashion illustration.

Next, the information of extracted embedding features from both the source photo image and kShape maps shall be utilized collectively through an attention mechanism, which we define as body-shaping attentional module that is designed to first produce a keypoint attention mask with pixel value ranges from 0 to 1, which indicates the level of importance of each keypoint element pixel. Then the source photo representation embedding and attention mask embedding are multiplied to form an element-wise product. Specifically, we build it by first setting a block of two sequential convolutional layers, which are connected by a batch normalization layer and a ReLU layer in chain between them. Then we input the down-sampled source photo embedding into the defined convolutional block in the upper branch, and input the down-sampled kShape embedding into another identical convolutional block in the lower branch. Two intermediary output embeddings are produced here from these two branches. An element-wise sigmoid function is then adopted to map pixel values of that kShape output feature embedding into a range of $(0, 1)$ . And the final step is to operate element-wise product on this scaled kShape embedded with that source photo output embedded. We treat this product result as a weighted source photo embedding that is ready as the input to the decoder.

Inspired by Zhu et al.,⁵ we connect three body-shaping attentional modules in a chain such that the valuable information of embedding is fully exploited. Two output embeddings from the upper/lower branches of the previous module will continue to be the inputs to the upper/lower branches of this module and subsequent ones. It is worth noting that as the source photo embedding gets updated through each module, the kShape embedding should also be updated to synchronize the change in each of them. We achieve this by concatenating kShape embedding and weighted source photo embedding along the axis of depth at the end of each module, equivalently doubling the depth of output embedding from the lower branch. Then from the second module onward, we adjust the lower branch convolutional block to reduce the input feature map depth to half, making the output kShape embedding depths of this subsequent module still equal to those of the first one. Finally, following the standard practice, the decoder generates the output image through some deconvolutional layers with source photo embedding as the input. We discard keypoint embedding after the final convolution block.

Discriminators

The previously mentioned target sketch image $x_{d}^{i}$ as the ground truth is only used in the discriminator. We design two discriminators, one for appearance discrimination $D_{a}$ that tells us how good the appearance of the generated sketch image is, and another one for shape discrimination $D_{s}$ that tells us how well the shape is transformed through sketch image generation. These two discriminators are built in a common-owned structure in which the generated sketch image, either concatenated with source photo image $x_{p}^{i}$ or target shape embedding $S_{d}^{i}$ along the depth axis, is processed through another shareable convolutional neural network for mask production. The output of this network, which is a probability produced by the final softmax layer, is either $M_{a}$ or $M_{s}$ , namely the appearance consistency mark or the shape consistency mark, respectively. We calculate the final score $M$ as the production of these two marks: $M = M_{a} M_{s}$ .

As the training proceeds, we observe that a discriminator with low capacity is insufficient to differentiate real and fake images. Therefore, we build discriminators by adding three residual blocks after two down-sampling convolutions to enhance their capability.

Training Procedure

To realize the proposed architecture for fashion illustration generation task, as shown in Figure 3, we define the objective of a conditional GAN as follows:

L_{cGAN} = E_{x, y} [\log D (x, y)] + E_{x, z} [\log (1 - D (x, z))]

(1)

Then, the $L_{GAN}$ loss in our work can be computed as follows:

\begin{matrix} L_{GAN} = E [\log D_{a} (x_{p}, x_{t}) D_{s} (S_{p}, x_{t})] \\ + E [\log (1 - D_{a} (x_{p}, x_{g})) (1 - D_{s} (S_{t}, x_{g}))] \end{matrix}

(2)

where G aims to minimize this objective and D tries to maximize it, that is, $\arg min_{G} max_{D} L_{GAN} (G, D)$ . Moreover, the pixel-wise $L_{1}$ loss shall also be calculated to explicitly measure the similarity between target and generated image. We mathematically express this as:

{L_{L}}_{_{1}} = ‖ x_{g} - x_{t} ‖_{1}

(3)

To ensure the texture rendering of the fabric would be similar to the source image as much as possible and make the appearance of the generated image more natural visually, we take the perceptual loss,⁴⁰ noted as $L_{per L_{1}}$ , into consideration. Using $ϕ_{i}$ to denote the outputs of the $i$ th layer of the VGG network, the $L_{per L_{1}}$ can be computed as follows:

L_{per L_{1}} = \sum_{i = 1}^{5} ‖ ϕ_{i} (x_{g}) - ϕ_{i} (x_{t})) ‖_{1}

(4)

CoFa Loss

To deal with unaligned data pairs,⁴¹ we introduce the CoFa loss (coFa) to further improve the quality of generated images. The original contextual loss (CX) is defined as:

CX (X, Y) = \frac{1}{N} \sum_{j} ma x_{i} C X_{ij}

(5)

where $N$ is the number of total sample points. We treat the source image and the target image each as a collection of feature points. To calculate the similarity between images, we look into each feature $y_{j}$ on image $j$ and find the most similar feature $x_{i}$ on image $i$ . Then the corresponding feature similarity scores are summed over all $y_{j}$ . For each source image feature, it searches the nearest neighbor (NN)⁴² feature match. To train our model appropriately, we design a similarity metric capable for fashion image.

Specifically speaking, the generated fashion image focuses more on the following. (1) The overall silhouette is the same in the semantic level, that is, a V-line neckline should be a V-line neckline. However, it is not required to be exactly the same in pixel-level. (2) The texture of fabric needs to be well-rendered on the ROI. (3) The sketch line should be natural but not too solid. Based on these special requirements, we thus consider improving the quality of the generated image from the aspect of image augmentation, that is, preserving the edge of the clothing while enhancing the design details. We integrate the spatial pixel coordinates and pixel-level RGB information into the image features based on the inspiration of the bilateral filter. To further enhance the details, the Laplacian pyramid is adopted to decompose images into multiple scales.⁸ For each output pyramid coefficient $(x_{0}, y_{0}, l_{0})$ , we render a filtered version of the full-resolution image: according to the corresponding local image value from the Gaussian pyramid $g_{0} = G_{l_{0}} (x_{0}, y_{0})$ at the same scale, we build a new Laplacian pyramid from the filtered image, and then copy the corresponding coefficient to the output pyramid. In the end, the coFa loss is defined as follows:

L_{coFa} = \frac{1}{N} \sum_{i} \max (g_{0} + (x_{g} - g_{0})) σ

(6)

where $σ$ is a user-provided threshold where intensity variations smaller than $σ$ should be considered as fine-scale details while larger variations are edges. As a center point for this function, we use $g_{0} = G_{l_{0}} (x_{0}, y_{0})$ , which represents the image intensity at the location and scale where we compute the output pyramid coefficient. Intuitively, pixels closer than $σ$ to $g_{0}$ should be processed as details and those greater than $σ$ should be processed as edges. We found $σ$ = 0.1 leads to the best results in our task. For training, the total loss function is divided into four parts: the adversarial loss $L_{GAN}$ , the pixel-wise $L_{1}$ loss $L_{L_{1}}$ , the perceptual $L_{1}$ loss $L_{pe r_{L 1}}$ , and the CoFa loss $L_{coFa}$ . Denote $λ_{GAN}$ , $λ_{L_{1}}$ , and $λ_{per L_{1}}$ as the weight of $L_{GAN}$ , $L_{L_{1}}$ , and $L_{per L_{1}}$ that contributes to the total loss $L$ , respectively. Then, it can be denoted as follows:

L = λ_{GAN} L_{GAN} + | {λ_{L}}_{_{1}} {L_{L}}_{_{1}} + λ_{per L_{1}} L_{per L_{1}} + L_{coFa}

(7)

Experiments

In this section, we conduct experiments to demonstrate the effectiveness of the proposed approach. We first describe the baseline model. Then, we show the visual results of our method compared with the state-of-the-art methods and present the analysis. Meanwhile, we demonstrate the importance of each part in our framework with the ablation study. Finally, we discuss some failure cases of the proposed method. We used images from the StyleU dataset to conduct all the experiments in this work. As described in Section 3, the dataset consists of 3567 paired images, that is, real photos and fashion illustrations. Each image is 384 $\times$ 256. We randomly split the dataset into a training set of 3467 paired images and a testing set of 100 paired images.

Implementation Details

We use ADAM optimizers⁴³ in Pytorch framework with $β_{1} = 0.5$ and $β_{2} = 0.9$ . The initial learning rate was set as $2 \times 10^{- 4}$ and was decreased linearly to 0 after 300 epochs. The total step was 1,600,000. We empirically set ( $λ_{GAN} = 1$ , $λ_{L_{1}} = 1$ , $λ_{pe r_{L 1}} = 1$ ). The instance normalization was applied and the rate of dropout was set as $0.5$ . To generate the kShape map, we adopted human pose estimation (HPE).⁴⁴ Then, the total number of keypoints in this work was 18 ( $K = 18$ ).

Baseline Model

We compared our method with various models from both image-to-image translation and fashion generation tasks. The mainstream frameworks including Pixel2pixel,²⁹ CycleGAN,¹² Unsupervised image-to-image Translation (UNIT),⁹ and Unsupervised Generative Attential Networks (U-GAT-IT)¹⁴ are adopted. Meanwhile, in terms of fashion generation, we follow the flowchart of CP-VTON.³⁴ (Noted that we train the baselines using their official implementations.)

Qualitative Analysis

The visual results of our method and the baselines for style transfer are shown in Figure 4. We can see that, among the first four baseline image-to-image translation networks, Pixel2pixel produces over-stylish images while the body shape of the fashion model is still consistent with the original photo. Cycle-GAN and U-GAT-IT well reflect the silhouette of the clothing and the body shape of the fashion model to some extent. However, it does not well retain the color and texture of the clothing. UNIT generates more photo-like images and minor changes in the stylish sketch. In addition, we also considered the mainstream fashion generation framework in our task.^32–34,37 The inputs in the set of VTON articles include a model image where a model wears a cloth, a target cloth image, a body shape image, a body keypoint image and model face image, that is, more inputs than ours. We present the results of CP-VTON³⁴ just to show the performance of the current flowchart of fashion generation on the illustration generation task (note that during the implementation, we reduce the global input image channel depth to adjust the difference of inputs).

Figure 4.

Visual results comparison between our method and the mainstream baselines including pixel2pixel, CycleGAN, UNIT, U-GAT-IT, and CP-VTON. (a) Source images, (b) ground truth, (c) pixel2pixel,²⁹ (d) CycleGAN,¹² (e) UNIT,⁹ (f) U-GAT-IT,¹⁴ (g) CP-VTON,³⁴ and (h) StylishGAN.

Overall speaking, unlike StylishGAN, the above baselines either could not adjust the body-shape or faithfully render the design details of the clothing. As shown in the last column of Figure 4, it can be found that the images generated by our approach have high consistency with the source image from the aspect of clothing identities, that is, color, silhouette, neckline design, print design, sleeve design, and so on. On the contrary, the body shape of the generated fashion model is also adjusted to a certain ratio (similar to the ground truth at the second column of Figure 4). Moreover, we can see that the beading and tassel design on the surface of the dress is also well-rendered by the StylishGAN.

Quantitative Analysis

Metrics

It remains a problem to effectively evaluate the appearance and shape consistency of the generated images. Following the general practice, we adopt the inception score (IS)⁴⁵ and structure similarity (SSIM)⁴⁶ to assess the generated image quality from the two perspectives. However, as discussed in the study by Szegedy et al.,⁴⁷ the IS metric, which is based on the entropy computed over the classification neurons of an external classifier, is not very suitable in our case. Thus, we further adopted the perceptual study.

Perceptual Study

We also conducted the perceptual study from the perspective of fashion. As introduced before, the evaluation standard of fashion illustration focuses on three parts: (1) body-shape adjustment with certain ratio to achieve whole image balance, (2) clothing identities consistency, and (3) quality of texture render. We thus propose three indexes for fashion illustrations: $S_{boShape}$ , $S_{idenCon}$ , and $S_{tQua}$ . Among these three parts, only the clothing identities consistency can be objectively judged. For each outfit, assuming the total number of design attributes in the source image is $D_{total}$ , the correct number of design attributes in the generated image is $D_{correct}$ . Then, for $N$ test images, the design consistency score $S_{idenCon}$ can be computed as:

S_{idenCon} = \frac{1}{N} \sum_{i}^{N} (\frac{D_{correct}^{i}}{D_{total}^{i}} \times 10)

(8)

Taking the outfits of the first row in Figure 4 as examples, the design attributes include deep blue (color), plain (print design), A-line in the whole with Mermaid in the bottom (silhouette), sleeveless (sleeve-length), full-length (length of the bottom). Our generated image has the same attribute value in each attribute dimension and thus the design consistency score of this outfit is 10. To automatically compute the design consistency score, it is expected to adopt the model for design attribute recognition.^48,49 However, it related to different domain recognition and was highly affected by its accuracy, which is out of the scope of this work. We thus manually calculate the design consistency score (the labeled clothing identities of the test set will be released with the StyleU dataset). In addition, the first and second parts are highly subjective and professional, and are hard to describe in a set of formulations. We thus performed a human perceptual study to evaluate these two parts from the human point of view. Specifically, we invited five experts with majors in fashion to judge the quality of the generated images. The total score is set as 10. To decrease the bias of different judges, we obtained the trimmed mean, that is, the final score is the average of three middle scores (excepting the highest and lowest one). Thus, for $N$ test images, $S_{tQua}$ is computed as follows:

S_{tQua} = \frac{1}{N} \sum_{i}^{N} (\sum_{m = 1}^{5} (S_{e}^{m} - \min (S_{e}^{m}) - \max (S_{e}^{m})))

(9)

where $S_{e}^{m}$ is the score of texture render quality that $m$ experts gave for the $i$ th generated image.

Quantitative comparisons are summarized in Table 1 and we make the following observations: (1) Regarding the IS, which indicates image quality, that is, does the image look like a specific object, StylishGAN obtained a relatively higher score compared with the rest of baselines. It demonstrates that our StylishGAN obtains better performance on objective consistency. (2) In terms of SSIM, which is adopted to measure the shape similarity between the source image and the generated image, the StylishGAN achieves the highest similarity. (3) For the remaining indexes, which evaluate the generated images from the fashion perspective, StylishGAN consistently outperforms three of them. In a word, all results indicate the effectiveness of the proposed approach.

Table 1.

The comparison of different baselines including pixel2pixel,²⁹ CycleGAN,¹² UNIT,⁹ U-GAT-IT,¹⁴ and CP-VTON³⁴ and the ground truth refers to the fashion illustrations in our test set.

Metric	Pixel2pixel	CycleGAN	UNIT	U-GAT-IT	CP-VTON	StylishGAN	Ground truth
IS⁴⁵	2.70	3.48	3.01	2.99	2.24	2.80	3.02
SSIM⁴⁶	0.58	0.56	0.57	0.60	0.56	0.64	1.00
$S_{boShape}$	5.57	6.57	6.08	5.04	5.86	7.27	10.00
$S_{idenCon}$	3.39	5.09	4.13	4.14	3.28	7.62	10.00
$S_{tQua}$	5.08	6.37	5.74	4.85	2.71	6.78	10.00

UNIT: unsupervised image-to-image translation; U-GAT-IT: unsupervised generative attential networks; IS: inception score; SSIM: structure similarity.

All networks are trained on the whole training set of StyleU dataset and the results tested on our test set. The metrics including IS, SSIM, $S_{boShape}$ , $S_{idenCon}$ , and $S_{tQua}$ .

The bold values are the highest ones in certain evaluation indexes.

Ablation Study

To analyze the proposed method in detail, we conducted the ablation study from three aspects: (1) We demonstrated the importance of the coFa loss by the results of removing $L_{coFa}$ . (2) To show the necessity of each module in our framework, we trained StylishGAN for comparison under the settings: removing $L_{GAN}$ , $L_{1}$ , $L_{per L_{1}}$ , and without the body-shaping attentional module, respectively. (3) To demonstrate the relation between the performance of the proposed method and the adpative weights, we further present the results when StylishGAN is trained with different loss weights.

Efficacy of the coFa Loss

As shown in Figure 5, it can be found that the coFa loss can effectively augment the details of clothing item and thus improve the visual quality of the generated image. Meanwhile, compared with the last two columns of Figure 5, the sketch line becomes further solid with the increasing weight of the coFa loss. Considering the fluency of the generated fabric, we do not recommend setting the weight of the coFa loss too high in the fashion generation task ( $σ = 0.1$ in this work).

Figure 5.

Ablation study: (a) without source images, (b) ground truth, (c) without $L_{coFa}$ , that is, $σ = 0$ , (d) with $L_{coFa}$ and $σ = 0.1$ , and (e) with $L_{coFa}$ and $σ = 0.3$ .

Function of Each Module

We present the comparison results without different parts of our framework one by one in Figure 6. It can be found that without $L_{GAN}$ , our result is relatively worse on the quality of the generated image. Consistent with the same conclusion on other generating tasks, the $L_{GAN}$ , $L_{1}$ , and $L_{per L_{1}}$ are important modules for image-to-image translation. In addition, it is worth noting that the results in the last column indicate the effect of our proposed body-shaping attention module. Result images without attention are blurred with background, failing to adjust human body shape and with loss of performance of texture rendering.

Figure 6.

Ablation study: (a) source images, (b) our results, (c) without $L_{GAN}$ (d) without $L_{1}$ , (e) without $L_{per L_{1}}$ , and (f) without body-shaping attentional module.

Adaptive Weights

Meanwhile, we study the effect of the loss weight under different conditions. The visual results are shown in Figure 7. We can see that when the loss weights are changed, the generated images have some differences. For example, with the decrease of $L_{GAN}$ weights, the surface of the fabric is increasingly smooth in terms of texture rendering. In addition, we have tested several sets of loss weights and found that $λ_{GAN} : {λ_{L}}_{_{1}} : λ_{per L_{1}} = 1 : 1 : 1$ leads to the best results in our experiments.

Figure 7.

Abalation study: (a) source images, (b) ground truth, results with different loss weight where $λ_{GAN} : λ_{L_{1}} : λ_{per L_{1}}$ is set as: (c) 5:1:1, (d) 2:1:1, and (e) 1:1:1.

Discussion

Figure 8 presents failure cases of our method. On one hand, as demonstrated before, the proposed framework achieves good performance in most cases, which are expected to deal with the common clothing in our daily life. On the other hand, as stated in the section “Dataset Construction,” that is, the clothing in the StyleU dataset is ready-to-wear with more design attributes, and we found that the proposed method cannot deal with some specific situations. Take the samples of the first two columns in Figure 8, for example. Both of them are made from transparent gauze. Although the generated images are consistent with the source image on the color, the whole silhouette, however, fails to reproduce some design details. Specifically, the Peter Pan collar in the first garment is changed to a round neckline and the round neckline in the second garment is replaced by the off-shoulder. Another case is that when the fabric appearance is texturized and presents a light-weight feeling, the lower saturation of the clothing’s color leads to the poor quality of the generated image. For example, for the white lace dress in the fourth column in Figure 8, the texture of the white dress does not render on the generated image. In addition, as shown in the last two columns in Figure 8, fur clothing is also a difficult category for texture rendering. We conclude that in terms of texture rendering in fashion, (1) clothing made from light and transparent fabric is more difficult compared with solid fabric, (b) the saturation of color has an effect on the texture rendering. The lower the saturation of clothing’s color, the worse the quality of the generated image. (c) A complicated surface of the fabric, for example, fur clothing would increase the difficulty of texture rendering.

Figure 8.

Failure cases of our method.

Advantage and Disadvantage

Fashion illustration is the art of communicating fashion ideas in a visual form. As demonstrated in previous sections, StylishGAN has huge advantages in helping design labels or clothing firms to generate fashion illustrations quickly with less cost. This will greatly help them to shorten the process while saving money on hiring fashion illustrators. However, as shown in the last paragraph about the limitation, we can see that StylishGAN has limitations in some specific situations, such as transparent color, complicated textures such as fur, and so on.

Conclusion

We proposed StylishGAN for fashion illustration generation. The body-shaping attentional module was adopted to adjust the body shape of the fashion model and obtain the texture information. A new CoFa loss was introduced to improve the fabric texture rendering. To implement our task, we present a fashion illustration dataset. Through the experiments, we demonstrate the effectiveness of the proposed method. Meanwhile, the newly introduced dataset StyleU with paired images covering two types of domains can well facilitate the task of pixel-level image transfer.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by Laboratory for Artificial Intelligence in Design (Project Code: RP3-1) under InnoHK Research Clusters, Hong Kong.

References

Lee

A study on the multi-space method in fashion illustration. J Korean Soc Cloth Text 2009; 33(4): 644–654.

Blackman

100 years of fashion illustration. London: Laurence King Publishing, 2007.

Choi

The expression of fantastic body in fashion illustration. Fash Text Res J 2009; 11(6): 867877.

Yang

Ruslan

, et al. Deep generative models with learnable knowledge constraints. In: NIPS’18: Proceedings of the 32nd international conference on neural information processing systems, Montréal, QC, Canada, 3–8 December 2018, pp. 10501–10512. Red Hook, NY: Curran Associates Inc.

Zhu

Huang

Shi

, et al. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Long Beach, CA, 15–20 June 2019, pp. 2347–2356. New York: IEEE.

Jia

Sun

, et al. Pose guided person image generation. In: Advances in neural information processing systems, Long Beach, CA, 4–9 December 2017, pp. 406–416, https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=5461&context=sis_research

Wang

Girshick

Gupta

, et al. Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 7794–7803. New York: IEEE.

Paris

Hasinoff

Kautz

Local Laplacian filters: Edge-aware image processing with a Laplacian pyramid. ACM Trans Graph 2011; 30(4): 68.

Liu

Breuel

Kautz

Unsupervised image-to-image translation networks. In: Advances in neural information processing systems, Long Beach, CA, 4–9 December 2017, pp. 700–708. Red Hook, NY: Curran Associates Inc.

10.

Choi

Kim

, et al. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 8789–8797. New York: IEEE.

11.

Karras

Laine

Aila

. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 4401–4410. New York: IEEE.

12.

Zhu

Park

Isola

, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, Venice, 22–29 October 2017, pp. 2223–2232. New York: IEEE.

13.

Zhang

Tan

, et al. Dualgan: Unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE international conference on computer vision, Venice, 22–29 October 2017, pp. 2849–2857. New York: IEEE.

14.

Kim

Kang

, et al. U-GAT-IT: Unsupervised generative attentional networks with adaptive layer-instance normalization for imageto-image translation. arxiv:1907.10830, 2019.

15.

Zhu

Zhang

Pathak

, et al. Toward multimodal image-to-image translation. In: Advances in neural information processing systems, Long Beach, CA, 4–9 December 2017, pp. 465–476. Red Hook, NY: Curran Associates Inc.

16.

Lee

Tseng

Huang

, et al. Diverse image-to-image translation via disentangled representations. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 35–51, https://openaccess.thecvf.com/content_ECCV_2018/papers/Hsin-Ying_Lee_Diverse_Image-to-Image_Translation_ECCV_2018_paper.pdf

17.

Huang

Liu

Belongie

, et al. Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 172–189, https://openaccess.thecvf.com/content_ECCV_2018/papers/Xun_Huang_Multimodal_Unsupervised_Image-to-image_ECCV_2018_paper.pdf

18.

Wang

Liu

Zhu

, et al. High-resolution image synthesis and semantic manipulation with conditional GANs. In; Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 8798–8807. New York: IEEE.

19.

Larsson

Maire

Shakhnarovich

. Learning representations for automatic colorization. In: European conference on computer vision, Munich, 8–14 September 2018, pp. 577–593. New York: Springer.

20.

Lin

Xia

Qin

, et al. Conditional image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 5524–5532, https://openaccess.thecvf.com/content_cvpr_2018/papers/Lin_Conditional_Image-to-Image_Translation_CVPR_2018_paper.pdf

21.

Zhang

Isola

Efros

AA.

Colorful image colorization. In: European conference on computer vision, Amsterdam, 11–14 October 2016, pp. 649–666. New York: Springer.

22.

Luan

Paris

Shechtman

, et al. Deep photo style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 4990–4998. New York: IEEE.

23.

Fang

Yang

, et al. Universal style transfer via feature transforms. In: Advances in neural information processing systems, Long Beach, CA, 4–9 December 2017, pp. 386–396. Red Hook, NY: Curran Associates Inc.

24.

Gatys

Ecker

Bethge

Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 26 June–1 July 2016, pp. 2414–2423. New York: IEEE.

25.

Chen

, et al. Personalized fashion design. In: The IEEE international conference on computer vision (ICCV), Seoul, South Korea, 27 October–2 November 2019, https://openaccess.thecvf.com/content_ICCV_2019/papers/Yu_Personalized_Fashion_Design_ICCV_2019_paper.pdf

26.

Lim

Tham

, et al. Attribute manipulation generative adversarial networks for fashion images. In: The IEEE international conference on computer vision (ICCV), Seoul, South Korea, 27 October–2 November 2019, https://openaccess.thecvf.com/content_ICCV_2019/papers/Ak_Attribute_Manipulation_Generative_Adversarial_Networks_for_Fashion_Images_ICCV_2019_paper.pdf

27.

Hsiao

Katsman

, et al. Fashion++: Minimal edits for outfit improvement. In: The IEEE international conference on computer vision (ICCV), Seoul, South Korea, 27 October–2 November 2019, https://openaccess.thecvf.com/content_ICCV_2019/papers/Hsiao_Fashion_Minimal_Edits_for_Outfit_Improvement_ICCV_2019_paper.pdf

28.

Han

Huang

, et al. Finet: Compatible and diverse fashion image inpainting. In: The IEEE international conference on computer vision (ICCV), Seoul, South Korea, 27 October–2 November 2019, https://openaccess.thecvf.com/content_ICCV_2019/papers/Han_FiNet_Compatible_and_Diverse_Fashion_Image_Inpainting_ICCV_2019_paper.pdf

29.

Isola

Zhu

Zhou

, et al. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 1125–1134. New York: IEEE.

30.

Xian

Sangkloy

Agrawal

, et al. Texturegan: Controlling deep image synthesis with texture patches. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 8456–8465. New York.

31.

Jiang

Fashion style generator. In: International joint conference on artificial intelligence, Melbourne, VIC, Australia, 19–25 August 2017, pp. 3721–3727, https://www.ijcai.org/proceedings/2017/520

32.

Issenhuth

Mary

Calauzennes

. End-to-end learning of geometric deformations of feature maps for virtual try-on. arXiv [preprint] 2019, arXiv:1906.01347.

33.

Dong

Liang

Wang

, et al. Towards multi-pose guided virtual try-on network. arXiv [preprint] 2019, arXiv:1902.11026.

34.

Wang

Zheng

Liang

, et al. Toward characteristic-preserving image-based virtual try-on network. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 589–604. New York: Springer.

35.

Zanfir

Popa

Zanfir

, et al. Human appearance transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 5391–5399. New York: IEEE.

36.

Lin

Tao

, et al. M2e-try on net: Fashion from model to everyone. In: Proceedings of the 27th ACM international conference on Multimedia, Nice, 21–25 October 2019, pp. 293–301. New York: Association for Computing Machinery.

37.

Han

, et al. Viton: An image-based virtual try-on network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, 18–22 June 2018, pp. 7543–7552. New York: IEEE.

38.

Raj

Sangkloy

Chang

, et al. Swapnet: Image based garment transfer. In: European conference on computer vision, Munich, 8–14 September 2018, pp. 679–695. New York: Springer.

39.

Dong

Liang

Zhang

, et al. Fashion editing with multi-scale attention normalization. arXiv [preprint] 2019, arXiv:1906.00884.

40.

Johnson

Alahi

Fei-Fei

. Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision, Amsterdam, 11–14 October 2016, pp. 694–711. New York: Springer.

41.

Mechrez

Talmi

Zelnik-Manor

. The contextual loss for image transformation with non-aligned data. In: Proceedings of the European conference on computer vision (ECCV), Munich, 8–14 September 2018, pp. 768–783. New York: Springer.

42.

Cover

Hart

Nearest neighbor pattern classification. IEEE Trans Inf Theory 1967; 13(1): 21–27.

43.

Kingma

. Adam: A method for stochastic optimization. arXiv [preprint] 2014, arXiv:1412.6980.

44.

Cao

Simon

Wei

, et al. Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, 21–26 July 2017, pp. 7291–7299. New York: IEEE.

45.

Salimans

Goodfellow

Zaremba

, et al. Improved techniques for training GANs. In: Advances in neural information processing systems, Barcelona, 5–10 December 2016, pp. 2234–2242. Red Hook, NY: Curran Associates Inc.

46.

Wang

Bovik

Sheikh

, et al. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process 2004; 13(4): 600–612.

47.

Szegedy

Vanhoucke

Ioffe

, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 2818–2826. New York: IEEE.

48.

Liu

Luo

Qiu

, et al. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, 27–30 June 2016, pp. 1096–1104. New York: IEEE.

49.

Zou

Kong

Wong

, et al. FashionAI: A hierarchical dataset for fashion understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Long Beach, CA, 16–17 June 2019, pp. 296–304. New York: IEEE.

StylishGAN: Toward Fashion Illustration Generation

Abstract

Keywords

Introduction

Related Work

Image-to-Image Translation

Fashion Image Generation

Dataset Construction

Methodology

Generator

Discriminators

Training Procedure

CoFa Loss

Experiments

Implementation Details

Baseline Model

Qualitative Analysis

Quantitative Analysis

Metrics

Perceptual Study

Ablation Study

Efficacy of the coFa Loss

Function of Each Module

Adaptive Weights

Discussion

Advantage and Disadvantage

Conclusion

Footnotes

Declaration of conflicting interests

Funding

References