Sage Journals: Discover world-class research

Abstract

Style transfer between images has been a research direction gaining considerable attention in the field of image generation. CycleGAN is widely used because it does not require paired image data to train, which greatly reduces the cost of collecting data. In 2018, based on CycleGAN, a new model structure, InstaGAN, was proposed and then applied in the style transfer algorithm in the special part of an image we called instance. From then on, style transfer can transform the instance in the image. Based on CycleGAN and InstaGAN, we transformed the pictures in different domains combined with shape context and thin plate splines (TPS) in the present study. Based on generative adversarial networks (GANs), we designed a fusion network to optimize the results. We combined style transfer with TPS in fashion and got convincing performance by experiments and a fusion net with good performance.

Keywords

Computer Vision Generative Adversarial Networks Image Generation Online Fashion Try-On TPS

Introduction

With the dramatic development of social economy, people are paying more and more attention to the needs of individualization. In terms of fashion, for ordinary consumers, if it is relatively easy to achieve online personalized try-on, consumers will save a lot of time to choose clothes in the store, and greatly increase the interest in shopping by sliding their finger on a smartphone. Some techniques of deep learning provide the possibility of online try-on. For example, style transfer, which can already achieve online try-on. If a person wants to try on a long sleeve garment online (maybe by smartphone), but there are only short sleeve images in the database, then through style transfer and thin plate spines (TPS) used in this study, the person can see specific long sleeve images.

To achieve this goal, this work uses and modifies the framework of CycleGAN¹ with reference to the framework and design ideas of InstaGAN². Fig. 1 shows our network structure. It is similar to CycleGAN at first glance, but we combined this with a TPS module, so it has become more complicated. The input data is (X, a) and (Y, b), where X and Y represent the images, and a and b are the mask corresponding to the images. Instance normalization³ has been added to both the generator and the discriminator, and spectral normalization⁵ has been added to the discriminator, the same as PatchGAN¹⁷. Both of these measures improved the performance in experiments. The basic GAN loss is the same as CycleGAN. To follow the theory of WGAN, the loss function of LSGAN is used. The reconstruction loss uses the L1 norm, and the loss of its own mapping is also used. Since the mask information is added, the loss function to deal with mask is needed. This study uses the context loss function to handle mask information. The details of loss function will be introduced in the following sections. The training dataset comes from ATR⁶ (style transfer between trousers and shorts). There are more than 17,000 images, but we only processed 6000 images as the dataset of domain X, and 1000 images as the dataset of domain Y; The dataset of swapping T-shirts and jackets is crawled from the Zalando website, and the corresponding annotation is acquired by using the FCN model that has been pretrained on the ATR dataset.

Fig. 1

Synopsis of the network structure.

Shape context⁷ can transfer objects into specific shapes. The idea of shape context is to change shape between two images by finding the feature point sets, and to determine whether two objects are similar by calculating the matching cost. The object handled in this study is the output of the generator of GAN a and the related mask b. First, the edge detector is used to detect the edges of the two binary masks, then the centroid calculation is performed to establish the log-polar coordinate system, and finally, the shape histogram of each matching point is obtained.

The shape histogram of each pair can be obtained by the above method. To judge the similarity between the feature points a and b on the above, it is necessary to calculate their matching cost. This work uses TPS⁷ to minimize the bending energy to acquire the mapping relations between the point sets and the matching cost matrix.

The organization of the paper as follows. We briefly introduce the GANs family and some other work related to image-to-image translation used in fashion. Then, we introduce the details of network, including loss functions and specific network structure. Nest, experiments in two different datasets were performed and the results were analyzed. Finally, we present the conclusion and discuss future work.

Related Work

GANs

Since Goodfellow proposed the generative adversarial network GAN⁴ in 2014, the idea that these two networks compete and progress together has become the mainstream idea in the field of image generation. For example, CGAN⁸ adds conditional information on the original GAN structure, so that the output of network can be supervised. DCGAN⁹ combines GAN and convolution networks and solves the problem of unstable GAN training. WGAN,¹⁰ WGAN-GP,¹¹ LSGAN,¹² and other methods make network training more stable and faster by changing the loss function of GAN. The prediction result of the network is also greatly improved compared with the original GAN, effectively solving the problem of model collapse. In terms of image super-resolution, SRGAN¹³ and ESRGAN¹⁴ are able to generate realistic textures in the single-image super-resolution task, while TecoGAN¹⁵ proposes a spatio-temporal discriminator to obtain more realistic and coherent video super-resolution. Now, when the amount of data is insufficient, the researchers will also consider whether GAN can be used to generate some “real” data.

Image Style Transfer

In the field of style transfer, pix2pix^16,17 needs to use a pair of data to train the network, but such paired data collection brings many problems. Therefore, CycleGAN is used to solve this problem. CycleGAN is composed of two mirror-symmetrical GANs. The two GANs share two generators, each with a discriminator. That is, two discriminators and two generators in all. The model inputs an image from domain X, then it is passed to the first generator G_XY, whose task is to convert a given image from domain X to an image X′ in target domain Y. At this time, the output is passed to D_Y. D_Y is a discriminator to judge if X′ is real or not. X′ is also passed to another generator G_YX, whose task is to convert back to image X. This output image must be similar to the original input image to define meaningful mappings that do not exist in the unpaired data set originally. The other part of CycleGAN is analogous in the opposite direction.

Based on CycleGAN, InstaGAN adds the mask information corresponding to the images additionally, and uses the mask information to control the output of the network, which can effectively realize the style transfer just in some specific pixels while maintaining the invariance of the others. The serialized small-batch training method is proposed. When there are multiple targets in the picture that need to be converted, batch transfer is performed instead of transfer all at one time. This method uses limited GPU resources to train the dataset for a large number of instances in a single image.

Online Try-On Methods

FashionGAN¹⁹ needs two steps. The first step is to generate a mask by extracting the mask of the original image through the fully convolutional networks (FCN)¹⁸ and downsampling. The description of the images is encoded as a conditional information input into CGAN to generate a mask that is expected to be acquired. The second step is to generate the texture. Another CGAN network inputs the results of the previous network with related encoding. After that, the network finally outputs the image after the replacement. VITON²⁰ also performs virtual dressing in two steps, which can seamlessly replace the clothing, and the texture features of the clothing can also be well reflected in the dressing. Similarly, CP-VTON,²¹ through the geometric matching module (GMM), converts the target garment into a shape suitable for the target. It then integrates and renders the deformed garment with the character through the Try-On module.

Our work is made of two stages. The first stage is using image style transfer to get the desired masks. Then, the TPS is used to replace the specific part. The part for replacement can be a specific piece of clothing, so that the online try-on can be achieved.

Research Methods

Network Structure and Style Transfer

To style transfer between unmatched images, according to CycleGAN, two generators and two discriminators are required. The generator (Figs. 2 and 3) consists of three parts: encoding, transforming, and decoding. For encoding, the features of the input image are extracted by using 3 convolutional layers. In the experiment, the input image was cropped and resized to [300 (height),200 (width) px]. Transformation involves transforming an eigenvector of an image in the domain A into an eigenvector in the domain B by combining the non-similar features of the image. In this study, we used the 6-layer Resnet²³ module. Each Resnet module is a neural network layer composed of two convolutional layers, which retains the original image features as far as possible while transforming. For decoding, the deconvolution layer was used to complete the work of restoring low-level features from the feature vector, and finally obtaining the generated image. The discriminator (Fig. 3) takes an image as input and attempts to predict whether it was the original image or the output image of the generator. The discriminator itself belongs to a convolutional network, and it is necessary to extract features from the image first, and then add a convolution layer that produces a vector to determine the category.

Fig. 2

Generator and discriminator structure used for style transfer.

Fig. 3

More specific generator and discriminator structure.

Loss Function

Because the loss function of LSGAN has a stable performance, we replaced the loss function of the original GAN with the loss function of LSGAN (like that in CycleGAN) as shown in Eq. 1.

L_{L S G A N} = {(D_{A} (x, a) - 1)}^{2} + {(D_{A} G_{B 2 A} (y, b))}^{2} + {(D_{B} (y, b) - 1)}^{2} + {(D_{B} G_{A 2 B} (x, a))}^{2} .

Eq. 1

D_A and D_B are two different discriminators, similarly G_B2A and G_A2B are two different generators. The reconstruction loss Lcyc (Eq. 2) is used to keep the image invariant after being computed through the two generators. The native mapping loss Lidt (Eq. 3) ensures that the image is invariant when the input image comes from the target domain.

L_{cyc} = ∥ G_{A 2 B} (G_{B 2 A} (x, a)) - (x, a) ∥_{1} + ∥ G_{B 2 A} (G_{A 2 B} (y, b)) - (y, b) ∥_{1}

Eq. 2

L_{i d t} = ∥ G_{A 2 B} (y, b) - (y, b) ∥_{1} + ∥ G_{B 2 A} (x, a) - (x, a) ∥_{1} .

Eq. 3

To transfer on the specific part in the image (called the instance in InstaGAN), while ensuring that the background information would be changed as little as possible, InstaGAN proposed a context loss Lctx. First calculate the weight matrix between a and b′, masked as ω(a, b′), where a and b have the value {0, 1}, 0 represents the background, and 1 represents the instance. The weight matrix is calculated according to Eq. 4.

ω (a, b^{'}) = \bar{a \cup b}

Eq. 4

When a = 0 and b′ = 0, ω(a, b′) = 1, and in other cases ω(a, b′) = 0, which means that loss is calculated only when both represent the background, so that the background can remain as constant as possible. The loss function is given in Eq. 5.

L = ∥ ω (a, b^{'}) ⊙ (x - y^{'}) ∥ + ∥ ω (b, a^{'}) ⊙ (y - x^{'}) ∥ .

Eq. 5

Among them, the symbol ⊙ represents the operation on the element-wise.

Finally, we weighted add all the losses together to get the total loss function (Eq. 6). total loss function (Eq. 6).

L = L_{L S G A N} + λ_{c y c} L_{c y c} + λ_{i d t} L_{i d t} + λ_{c t x} L_{c t x} .

Eq. 6

The values λ_cyc, λ_idt, and λ_ctx are referenced to the settings in InstaGAN; we further fine-tuned these. In the experiment for shirt replacement, they were set to 8, 8, 8. In the experiment for the replacement of pants, they were set to 10, 10, 10. No more parameter sensitivity analysis was performed.

Warping a Clothing Image

TPS can be finished by two stages (Fig. 4). Given the target clothing image and a clothing mask generated from the network, we used shape context matching to estimate the TPS transformation and generate warped clothing. First, we used two masks (generate by the blue pants and the network) to produce a control points set, which can also be used in the original image just like the blue pant in Fig. 4. Second, the TPS transformation program will change the shape of these pants by the control points set.

Fig. 4

TPS transfer process to generate a warped clothing image.

Shape Context is shape matching, so the first thing to do is to get the edge of the mask which requires edge detection. Since the Canny edge detection operator has good anti-noise interference and can accurately locate the feature points, it was chosen as the edge detection algorithm in this study. Next, we extracted the boundary contour point set p = {p₁, p₂, p₃,…, p_n}. In the experiment, n was set as 200, which indicates the number of contour points extracted, and the centroid would be calculated by those points. Then, we established a polar coordinate system to calculate a shape histogram and the matching cost. This study used the TPS interpolation technique to calculate the matching cost between two sets of points. In more detail, point set A had 200 points, indicating that the edge of the clothes needed to be replaced, and point set B was also 200 points, which represents the contour extracted from a mask map generated by the network. The deformation is done by the TPS transformation to ensure 200 points were matched correctly. TPS can be used to find a smooth curved surface with the smallest curvature through all control points, and is often used to make non-rigid transformations of shapes. The shape context algorithm has a one-to-one correspondence between point sets when performing point set matching. Therefore, TPS can be used to minimize the bending energy and solve the mapping parameters and matching matrix between the point sets. In the experiment, the number of TPS transfer control points was 20, which ensured accuracy and made the algorithm not run too slowly.

Experiments

ATR Dataset

The ATR dataset has 17,706 images in all, which are females in a variety of costumes with diverse backgrounds. We show some samples in Fig. 5. The number of labels is 17, and the category range of the pixel is between 1 and 17.

Fig. 5

Example of an ATR dataset.

We picked out about 1000 images of shorts or short skirts and about 6000 images of trousers. Their corresponding masks were also used as the two domains. These two domains were marked as (X, a) and (Y, b) respectively. X and Y represent the real picture, and a and b represent the corresponding binary mask.

Zalando Dataset

We got the pictures shown in Fig. 6 from the Zalando website, but the network does not only need image data, but also the corresponding mask. Therefore, we firstly used the ATR dataset to simply train an FCN network and then the inference of FCN regarded as mask information for the corresponding image. Specifically, we used the structure of FCN-8s (Fig. 7) and incorporated some attribute of ATR (The original ATR dataset has 17 types of label attributes). Since this research only needed to distinguish between the upper outer garment and backgrounds, we merged labels other than upper, which are considered to be the background.

Fig. 6

Example of masks generated by the Zalando dataset.

Fig. 7

FCN-8s network structure.

The FCN input and output size was [128, 128]. We got the Zalando dataset for training: the T-shirts, jackets, and their masks were about 1000 each. In addition, standard clothing for TPS transfer were also available on the Zalando website.

TPS Transfer to get Deformed Clothing

After the style transfer, we used the mask of the pants (Fig. 8, a2 and b2), and the image of the pants and its mask (Fig. 6) to TPS transfer to obtain the deformed pants image (such as Fig. 8, a1 and b1). After that, the pants in the original picture can be replaced with this new one. From the experiments, we found that if the original mask is relatively complete, that is, the pants in the original picture are not overly obscured, the effect after TPS is good, but if the pants in the original picture are incomplete, it makes TPS unable to successfully match the contour points between them, and the performance will become poor. It can't be converted when the TPS algorithm can't find enough matching points.

Fig. 8

Example of TPS transfer results

Online Try-On

We evaluated the performance of the model on the ATR dataset and compared it with CycleGAN and InstaGAN. It can be seen from Figs. 9 and 10 that CycleGAN can also change style, but it also changes the background information, and the transfer performance is relatively general. For this dataset, InstaGAN's style transfer is significantly better than CycleGAN, which can ensure the invariance of the background information as much as possible, and the image is realistic after transformation. The style transfer cannot specify the color of the converted pants, the characteristics of the texture, and the replacement of specific clothes. But our results, based on InstaGAN, can replace a specific pair of pants at will, as long as there was a standard picture of the pants that needed to be replaced. As shown in Fig. 9, we randomly selected 3 different color trousers, and 3 different color shorts or skirts to be replaced. Since the TPS cannot produce the same deformation as the mask, some local mismatch will occur during the replacement process. We simply calculated the average of the RGB pixel values of the pants to fill these areas. The consequence is that the realism of the picture was significantly reduced, and the replacement part cannot be truly integrated with the surrounding environment in terms of hue and brightness.

Fig. 9

Pants and shorts transfer results. “Original” is the original picture, the subsequent picture is the result of CycleGAN and InstaGAN, and the last three columns are the replacement results of the three pants of different color which we randomly took out from the data set.

Fig. 10

T-shirt and jacket transfer results. “Original” is the original image, the subsequent images are the results generated by CycleGAN and InstaGAN, and the last three columns are the replacement results of the three different color uppers that we randomly took out from the dataset.

The code in this study runs on pytorch0.4, the input image was resized and cropped, and the final input to the convolution network was [300(h),200(w)]. The size of the output was the same as the input. Instance normalization (IN) was added to both the generator and the discriminator, and spectral normalization (SN) was added to the discriminator, which gave a significant improvement on the experimental results. We only trained on one GPU, with the batch size for style transfer fixed at 1. For the other hyper-parameter in the network, we set in the jacket dataset λ_cyc, λ_idt, and λ_ctx each to 8. In the replacement pants dataset λ_cyc, λ_idt, and λ_ctx were each set to 10. The optimizer of training was Adam,²² with a learning rate of 0.0002 for the generator, and 0.0001 for the discriminator. The parameters of Adam were set as β₁ = 0.5 and β₂ = 0.999. The learning rate decayed linearly at the beginning of 10 epochs and then remained unchanged.

Next, we did an experiment on the Zalando dataset; the experimental process is basically the same as the process on the ATR dataset. Due to the different number of datasets, we adjusted the epoch of the training, using the same preprocessing process, and did not require complicated tuning procedures. Fig. 10 shows the comparison of the results of this experiment. In addition, more experiment results were shown in Fig. 11.

Fig. 11

More experimental results.

From the results of the experiments, it can be seen that after replacement, the clothing can still maintain the characteristics of the pattern, color, text, and so forth. The clarity of the picture was effectively maintained. On the standard clothing, and when the previous generation of the mask against the network is ideal, the TPS conversion result will also be ideal, and can be preliminary. Specifically, for the ATR data set, the accuracy of the mask itself was relatively high, so the authenticity of the final dress-up image was higher than that of the Zalando data set generated by the mask through the FCN network. The output of InstaGAN was also true for virtual dress-up. However, the final synthesized image did not look particularly real; some pictures can be seen as being computer-synthesized, rather than naturally collected. We designed a fusion network to solve these problem by training a network, which makes the replacement process more real and keeps relatively high definition.

Fusion Network

The design of the fusion network mainly referred to the idea of GAN. The entire network consisted of two parts, which were the same as the general GAN. The difference is that the input of the traditional GAN network generator was random noise, and the generator input of this experiment was two RGB pictures, so the number of input channels was 6, the generator learned how to integrate the two input pictures, and output a size A 3-channel picture with the same size as the input. The network structure was similar to pix2pix, which realized the picture-to-picture translation process. Therefore, in the design process of the generator, refer to the generator structure of pix2pix in Fig. 12. Compared with the generator, the structure of the discriminator was relatively simple. As shown in Fig. 13, the input picture size was [300,200,3], similar to a binary classification network. When the input picture came from the output of the generator, it was classified as 0, and the picture was considered false; When the input picture was from a real sample, it was classified as 1, and the picture was considered to be real. Through continuous learning, the discriminator can discriminate the authenticity of the input picture, the generator reacts against it, and progress together.

Fig. 12

Generator of fusion network.

Fig. 13

Discriminator of fusion network.

From the results of the fusion network we designed in Fig. 14, we can see that the out of the net was more real and clearer than the results in Figs. 9–11. But when compared with ground truth (GT), some details and textures are lost because of generator's performance.

Fig. 14

The results of fusion network. In_A and In_B are the two parts of input, GT means ground truth, Out is the output if the net.

Conclusion

Online try-on can be realized in three steps using our method. The results kept good definition and trueness after the experiments using the ATR and Zalando datasets. But the method contains too many trivial details, therefore, we are continuing to look for a simpler way.

Footnotes

Acknowledgements

This work was partially supported by the National Key Research and Development Program of China (2019YFC1521300), the National Natural Science Foundation of China (61971121), the Fundamental Research Funds for the Central Universities of China, and the DHU Distinguished Young Professor Program.

References

Zhu

Jun-Yan

, Park

Taesung

, Isola

Phillip

, and Efros

Alexei A.

. Unpaired image-to-image translation using cycle-consistent adversarial networks. 16th IEEE International Conference on Computer Vision, ICCV 2017, 2242–2251.

Sangwoo

, Cho

Minsu

, and Shin

Jinwoo

. InstaGAN: Instance-aware Image-to-Image Translation. ICLR 2019.

Ulyanov

Vedaldi

and Lempitsky . Instance normalization. The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.

Goodfellow

I. J

.; Pouget-Abadie

; Mirza

; Xu

; Warde-Farley

; Ozair

; Courville

; Bengio

Generative adversarial nets. In Neural Information Processing Systems 27, 2014.

Miyato

Takeru

, Kataoka

Toshiki

, Koyama

Masanori

, and Yoshida

Yuichi

. Spectral normalization for generative adversarial networks. arXiv preprint arXiv: 1802.05957, 2018.

Liang

, Xu

, Shen

, Yang

, Liu

, Tang

, Lin

, Yan

. Human parsing with contextualized convolutional neural network. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39 (1), 115–127.

Belongie

, Malik

, Puzicha

. Shape matching and object recognition using shape contexts. IEEE TPAMI, 2002

Mehdi

Mirza,

and Simon

Osindero,

. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.

Radford

, Metz

, Chintala

. Unsupervised representation learning with deep convolutional generative adversarial networks, 2015.

10.

Arjovsky

, Chintala

, Bottou

. Wasserstein gan. arXiv preprint arXiv:1701. 07875, 2017.

11.

Gulrajani

Ishaan

, Ahmed

Faruk

, Martín

Arjovsky

, Dumoulin

Vincent

, and Courville

Aaron C.

. Improved training of Wasserstein GANs. arXiv e-prints, arXiv:1704.00028, 2017.

12.

Mao

Xudong

, Li

Qing

, Xie

Haoran

, Raymond

Y. K. Lau

, Wang

Zhen

, and Paul

Stephen

Smolley Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp 2813–2821.

13.

Ledig

Theis

Husz´ar

Caballero

Cunningham

Acosta

Aitken

Tejani

Totz

Wang

. Photo-realistic single image superresolution using a generative adversarial network. CVPR, 2017.

14.

Wang

Xintao

, Yu

, Wu

Shixiang

, Gu

Jinjin

, Liu

Yihao

, Dong

Chao

, Change

Chen

Qiao

Loy Yu

, Tang

Xiaoou

. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. https://arxiv.org/abs/1809.00219

15.

Chu

Mengyu

, Xie

You

, Leal-Taixé

Laura

, Tuerey

Nils

Temporally Coherent GANs for Video Super-Resolution (TecoGAN). https://arxiv.org/abs/1811.09393v1

16.

, Wand

. Combining markov random fields and convolutional neural networks for image synthesis. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, pp 2479–2486

17.

Isola

, Zhu

J. -Y.

, Zhou

, and Efros

A. A.

. Image to-image translation with conditional adversarial networks. In CVPR, 2017.

18.

Long

, Shelhamer

, and Darrell

. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39 (4), 640–651.

19.

Zhu

Shizhan

, Fidler

Sanja

, Urtasun

Raquel

, Lin

Dahua

, and Loy

Chen Change

. Be Your Own Prada: Fashion Synthesis with Structural Coherence, 2017.

20.

Han

; Wu

; Yu

; Davis

L. S.

Viton: An image-based virtual try-on network. arXiv preprint arXiv:1711.08447, 2017.

21.

Wang

; Zheng

; Liang

. Toward Characteristic-Preserving Image-based Virtual Try-On Network, 2018.

22.

Diederik

Kingma and Ba

Jimmy

. Adam: A method for stochastic optimization. arXiv preprint. arXiv:1412.6980, 2014.

23.

, Zhang

, Ren

, and Sun

. Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016, 770–778.

Online Try-On: GANs and TPS

Abstract

Keywords

Introduction

Related Work

GANs

Image Style Transfer

Online Try-On Methods

Research Methods

Network Structure and Style Transfer

Loss Function

Warping a Clothing Image

Experiments

ATR Dataset

Zalando Dataset

TPS Transfer to get Deformed Clothing

Online Try-On

Fusion Network

Conclusion

Footnotes

Acknowledgements

References