Sage Journals: Discover world-class research

Abstract

Current fashion image searching technology based on fine-grained fashion recognition on fashion images has recently achieved great success in online shopping. However, this technique is limited to a single domain—real product images—and thus is inflexible. Recognition and search performance are degraded to a large extent when the distribution of the target data is different from the source training data. To improve the flexibility of fashion image retrieval, we propose multi-domain fashion image recognition in this work. We firstly established Fashion-DA, a large-scale fashion dataset comprising 14 fashion categories and a total of 13,435 images originating from three domains. Then, we propose an unsupervised domain adaption approach based on adaptive feature norm to handle data with different feature distributions. The experiment evaluated the effectiveness of the proposed method.

Keywords

Computer Vision Deep Network Fashion Image Retrieval Feature Norm Multi-Domain Image Recognition Unsupervised Domain Adaption

Introduction

Recent studies^1–8 in cross-domain fashion image retrieval have achieved satisfying results and been widely applied in our daily life, such as in the “street-to-the-shop” application. Thus, current studies in cross-domain fashion image retrieval (except for some research^6–8 that focused on shoe sketch images) focused on real products in different scenarios (e.g., street and online images). However, this method has a limitation: when we do not obtain the desired photo (i.e., it is not always convenient for us to take a picture of other people on the street). In fact, there are many other expression formats, such as hand-drawn and sketch images, that could be used instead of the real photo. As shown in Fig. 1, imagine a scenario: a designer wants to check if his or her new design work already exists or not, or a customer wants to find a fashion item in his or her own mind without any real captured photos, how can a simple hand-drawn image be used to find the real fashion products?

Fig. 1

The first line is current mainstream fashion cross-domain network designed for the street-to-the-shop task. The second and third lines are the application scene that the Fashion-DA was built for.

Different from previous cross-domain retrieval (i.e., street-to-the-shop), the aim of this work is to realize fashion image recognition in multiple domains. Since there is no available dataset that could satisfy our interests, we build a fashion dataset, namely Fashion-DA, for fashion recognition in multiple domains.

Dissimilar Domains

Different from previous cross-domain datasets^1,4,9 (including online product images and street photos) in fashion, we added hand-drawn (D), sketch (S), and online product images (P) together to build the Fashion-DA dataset.

Scales

The categories cover all the main products in fashion: tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, ankle boots, skirts, and jumpsuits (note that the first ten categories are consistent of those included in Fashion-MNIST¹⁰). The whole Fashion-DA dataset contains 13,435 images, of which 5673 images are hand-drawn, 1431 are sketch images, and 6331 are online product images.

Availability

The Fashion-DA database will be openly accessible to the public for research use. We expect that this dataset could serve for benchmarking domain adaptation (DA) algorithms.

Transfer Tasks

Meanwhile, in the Fashion-DA dataset, two new transfer tasks of fashion recognition are proposed:

DTP. A designer creates new design work, and they can see the ready-to-wear version from the retrieved real products.

S → P. A customer does not have the photo of the products he or she wants, but they can search the products by sketch images.

We believe that the two tasks can further enhance the fashion image retrieval systems.

DA Methods

Generally, the ability to generalize across datasets that share similar characteristics, such as classes, but also present different underlying data structures is important. DA^11–14 attempts to find an isomorphic latent feature space. The inner sample from this feature space is hard to distinguish from the source domain or the target domain. This approach could solve the domain shift problem in fashion image retrieval, since it enables the model to transfer knowledge from different domains with some shared characteristics, such as categories, but with different distribution features.

Existing DA methods can be roughly divided into three types, including supervised domain adaptation (SDA),^15–17 semi-supervised domain adaptation,^18–20 and unsupervised domain adaptation (UDA),^21–24 based on the label situation in the target source. Generally, SDA outperforms UDA because UDA algorithms do not have any target labeled data (it is possible that the target training samples are not available).

However, the above approaches are performed on general datasets related to animals and transportation office supplies (e.g., Office-31,²⁵ VisDA2017,²⁶ and ImageCLEF-DA²⁷). In other words, previous DA methods are not suitable for fashion images. To solve this problem, we propose an unsupervised domain adaptation approach based on the feature norm²⁸ to perform our proposed application scenes and demonstrate the advantages of Fashion-DA.

Summary

We summarize our contributions as follows:

We built a fashion dataset for fashion recognition in multiple domains with over 13,435 images, namely Fashion-DA. The results of the state-of-the-art DA algorithms as the benchmark in the proposed Fashion-DA dataset are also presented.

We propose an unsupervised DA approach based on feature-norm to deal with the two transfer tasks specifically for fashion.

Related Work

Cross-Domain Datasets

We summarized the cross-domain dataset into two parts: the mainstream datasets that are commonly adopted in testing DA algorithms, and the cross-domain dataset for fashion image retrieval tasks.

The mainstream datasets for object recognition based domain adaptation are COIL,²⁹ Office-31,²⁵ Office-Caltech,²¹ Office-Home,³⁰ VisDA2017,²⁶ and ImageCLEF-DA.²⁷ Note that these datasets are not related to fashion.

On the other hand, there are some cross-domain datasets for fashion recognition. The most typical are the street- to-the-shop datasets (including two domains: online product images and real-world images), such as Deep Fashion,⁹ WITB,⁵ and FashionAI.³¹ This kind of dataset is large enough, with rich attribute annotations, but only covers two types of domains with similar feature distributions. Additionally, there are some datasets^7,32,33 including domains (sketch images and product images) with greater differences. However, all of them are related to shoes and bags.

Unlike the datasets summarized above, we created a dataset with many different domains across more comprehensive fashion categories.

DA Approaches

To mitigate the generalization bottleneck and bridge different distributions, extensive studies have been conducted on DA.^12,34–38 Some existing approaches^39,40 attempted to align feature spaces by exploiting shift-invariant information to match the target domain with the source domain. Meanwhile, from the perspective of deep learning, some methods^{18,22,41–44} adopted Maximum Mean Discrepancy (MMD) and association-based losses.

Most recently, major developments^45–48 focused on exploring adversarial learning to reveal feature distribution in different domains. Part of adversarial domain adaptation methods was applied to adversarial losses in the feature space. Meanwhile, some methods have introduced adversarial loss in the pixel space. Liu et al.⁴⁵ introduced CoGAN to train two generative adversarial networks (GANs) that could generate images as similar as the source and target, respectively. Additionally, Hoffman et al.⁴⁹ used the CycleGAN⁵⁰ to solve the problem of semantic segmentation.

The approach we propose in this work follows this direction to present a novel learning paradigm from the perspective of an adaptive feature norm which targets the fashion domain.

Fashion-DA Dataset

We developed the Fashion-DA dataset with three types of domains to solve the domain shift problem in fashion recognition and to serve as the benchmarking dataset for testing the performance of DA algorithms.

The proposed dataset embeds several properties. First, it targets the fashion recognition task. Fourteen categories, including tops, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags, ankle boots, jumpsuits, skirts, sunglasses, and hats cover all main fashion items. Second, it contains three types of domains with dissimilar feature distributions to provide a more comprehensive dataset for benchmarking.

DA Algorithms

The samples of 14 categories from three domains in the Fashion-DA dataset are shown in Fig. 2 from the three domains, including drawing, sketch, and real product. Meanwhile, each type contains at least 500 images and the total number of images is 13,435. Details can be found in Table I.

Fig. 2

Samples of 14 categories in the Fashion-DA dataset from three domains including drawing (1st column), sketch (2nd column), and real product (3rd column).

Table I.

Image Numbers of the 14 Categories in the Fashion-DA Dataset

	Product	Drawing	Sketch
Top	839	881	275
Trousers	523	796	225
Pullover	148	242	50
Dress	1,052	576	160
Coat	356	674	148
Sandal	538	356	58
Shirt	298	185	93
Sneaker	473	281	67
Bag	335	511	101
Ankle boot	490	173	34
Jumpsuit	162	69	15
Skirt	398	326	123
Sunglasses	510	284	23
Hat	209	319	59

The detailed numbers of images for the 14 categories are shown in Fig. 3. Obviously, the three domains in Fashion-DA are imbalanced. The quality of some collected images in different categories (e.g., trousers, dresses, and coats) was low, which makes categorization more difficult (note that the sketches in this dataset were not asymmetric and were out of proportion).

Fig. 3

Image numbers of the 14 categories in the Fashion-DA dataset.

Our Approach

To perform the proposed application scenes and demonstrate the advantages of the Fashion-DA database,²⁸ we used an adaptive unsupervised DA algorithm based on augmented feature norms. Generally, the traditional domain adaptation can be formulated as follows.

Given a source domain $D_{S} = {(x_{i}^{S}, y_{i}^{S})}_{i = 1}^{a_{s}}$ (where x and y are image samples) a_s labeled samples with | C_s | different categories (e.g., top, dress, and trousers), and a target domain $D_{t} = {x_{i}^{t}}_{i = 1}^{a_{t}}$ of a_t unlabeled samples with |C_t| different categories. Domain adaptation occurs when the inherent distributions p and q corresponding to source and target domains in the shared label space are different. When we have no access to any labeled target examples, it needs to adopt unsupervised DA.

Generally, there are two types of settings, including the vanilla setting (which we focused on in this study) and the partial setting in DA. The partial setting means that the source label space subsumes that C_s ⊂ C_t. The source labeled data are not related to the target task. The vanilla setting refers to the standard unsupervised DA that has been extensively explored. Specifically, in this setting, the source and target domains share the identical label space (i.e., C_s = C_t). Adversarial learning-based methods under the vanilla setting are vulnerable to the negative transfer effect in the disjoint label space C_s\C_t. Considering the specific target of the fashion domain in this study, the proposed approach is independent of the association between label spaces of the two domains.

Regarding the definition of Maximum Mean Discrepancy,⁵¹ as Norm is a non-negative valued scalar function, we instantiated the function class to be the L2-norm function with deep neural network function and define Maximum Mean Feature Discrepancy (MMFND) between source and target domains in Eq. 1.

\begin{array}{l} MMFND[ ] : = \sup_{G, F} (\frac{1}{n_{s}} \sum_{x_{i} \in D_{s}} ∥ F_{L - 1} (G (x_{i})) ∥_{2} \\ - \frac{1}{n_{t}} \sum_{x_{i} \in D_{t}} ∥ F_{L - 1} (G (x_{i})) ∥_{2}) \end{array}

Eq. 1

Where n are images from source domain, F_l(G(x)) = F^(l) (F_l-1(G(x))) and F₀(G(x)) = G(x). G and F correspond to the backbone network and classifier in our framework, respectively. Specifically, G is defined as a general feature extraction module inherited from the prevailing neural network architecture. F is presented as a task-specific classifier that has L fully connected layers. p^(l) (·) represents the l-th layer operation of the F. The first L - 1 layers of the classifier F we call the bottleneck, noted as F_L-1. Those features computed by F depend on the specific domain. Meanwhile, it cannot ensure that it can be transferred to a new domain. Thus, we calculated the class probabilities along the last layer, which was followed with a SoftMax operation.

Next, we considered an F_L-1 that can generate the task specific feature embedding. Based on the vector of logits computed by F, we further calculated the class probabilities by applying the SoftMax function. The class probabilities is denoted as p(y|x).

Based on the analysis of the characteristics of the fashion images and testing results of the proposed method, the traditional feature selection method was better than the analysis of feature importance when the dataset obtains non-mass samples. Generally, feature selection is used to preprocess the input. This means it is isolated from the training process. Thus, the number of features selected needs to be artificially set. However, such a situation can easily decrease the models ability to handle overfitting and generalization.

For multi-domains retrieval, we focused more on the generalization ability of the model. Thus, in this work, we augmented the G by combining the traditional feature selection approach with the evaluation of features together into the training process of the deep neural network. Then, similar to the attention mechanism, we forced the neural network to be faster and better focused on information-rich features, which avoid the influence of irrelevant redundancy feature.

To this end, as stated in Eq. 2, we introduced an additional feature selection layer in the traditional neural network model (e.g., AlexNet,⁵² GoogleNet,⁵³ ResNet,⁵⁴ and DenseNet).⁵⁵

G {(x_{i})}^{'} = R e L U (G (x_{i}) w + b)

Eq. 2

The weight W_m×1 is multiplied by the corresponding input feature X_n×m to multiply the elements. The weight from Eq. 2 is mainly used to scale the input features. Then, we used the ReLU to truncate the scaled feature. And the bias is denoted as the threshold of feature selection.

The deep neural network optimizes w and b, and plays a role as an adaptive selection of input features. Next, we initialize W to the evaluation value of traditional feature selection. The feature selection layer is based on the evaluation value of the featured item. It can enhance or weaken the influence of certain features on the network training by the proposed framework; the traditional feature selection method and the input feature are scaled.

Meanwhile, inspired by the success of GANs, similar to the current mainstream operation, we optimized the upper bound in a two-player adversarial manner. Specifically, we replaced sup in Eq. 1 with the max operator and apply the min operator with respect to F and G respectively and obtained Eq. 3.

\begin{array}{l} \min_{G} \max_{F} (\frac{1}{n_{s}} \sum_{x_{i} \in D_{s}} ∥ F_{L - 1} (G (x_{i})') ∥_{2} \\ - \frac{1}{n_{t}} \sum_{x_{i} \in D_{t}} ∥ F_{L - 1} (G (x_{i})') ∥_{2}) \end{array}

Eq. 3

However, in our case, this kind of operation lacks explicit interpretability for adversarial behavior. It may lead to the random walk in the feature space, and then makes it a failure to adapt the source and target samples at the semantic level.

Similar to the idea in a reference,²⁸ the L₂-norm of a vector can be regarded as the radius from the hypersphere origin to the vector point. Based on this, we constructed an equilibrium and a large radius R to bridge the gap between source and target domains. The obtained feature norm objective is given in Eq. 4.

\begin{array}{l} L_{1} = {(\frac{1}{n_{s}} \underset{x_{i} \in D_{s}}{\sum^{​}} ∥ F_{L - 1} (G (x_{i})') ∥_{2} - R)}^{2} + \\ {(\frac{1}{n_{t}} \sum_{x_{i} \in D_{t}} ∥ F_{L - 1} (G (x_{i})') ∥_{2} - R)}^{2} \end{array}

Eq. 4

Eq. 4 minimizes the MMFND, since we strictly limited the mean feature norms of source and target domains converging to R. Different from the common alignment method in the adversarial feature, the objective was to optimize Eq. 4 by giving the existing intermediate variable R. This is because the functions class are rich enough to contain substantial positive real-valued functions on the input x. Meanwhile, if there is no restriction on the function, the upper bound would greatly deviate from zero. Specifically, considering the fashion image feature, we replaced the R in Eq. 4 with Δr, which refers to the residual feature norm. This operation is expected to add the instance information into the network. Then, Eq. 4 can be rewritten as Eq. 5.

\begin{array}{l} L_{2} = \frac{1}{n_{s}} \sum_{x_{i} \in D_{s}} {(∥ F_{L - 1} (G (x_{i})') ∥_{2} - Δ r)}^{2} + \\ \frac{1}{n_{t}} \sum_{x_{i} \in D_{t}} {(∥ F_{L - 1} (G (x_{i})') ∥_{2} - Δ r)}^{2} \end{array}

Eq. 5

Additionally, the supervised source domain classification loss can be written as Eq. 6.

L_{c l s} = - \frac{1}{n_{s}} \sum_{(x_{i}, y_{i}) \in D_{s}} \sum_{k = 1}^{| C_{s} |} \log p (k ∣ x_{i})

Eq. 6

Finally, we obtained the learning objective in Eq. 7, where λ and β are variable weights.

L = L_{c l s} + λ L_{1} + β L_{2}

Eq. 7

Experiments

To evaluate the effectiveness of our proposed feature norm-based approach (compared with state-of-the-art DA methods), we first conducted experiments on two widely-used domain adaptation benchmark datasets. Then, we listed the results of those approaches (the state-of-the-art methods and our proposed method) on the Fashion-DA datasets as the baselines. The adaptation results are analyzed in detail. Finally, we show a potential application based on the proposed transfer task (i.e., StP) to demonstrate the practical value of this work.

Setup

Along with the Fashion-DA dataset, we also obtained several general datasets for evaluating DA approaches to demonstrate the effectiveness of the proposed method.

Office-31

Office-31²⁵ comprises 31 categories in an office environment. It contains a total of 4652 images from three domains including: Amazon (A), Digital Simple Lens Reflex (DSLR) (D), and Webcam (W). Those domains contain online website images, digital SLR camera images, and web camera images, respectively. A total of six transfer tasks as A → D, D → A,…, W → A could be conducted.

ImageCLEF-DA

ImageCLEF-DA²⁷ is a balanced dataset in which each domain obtains the same number of images in the same category. It contains a total of 12 common categories. The images are collected from Caltech-256 (C), ImageNetlLSVR-C2012 (I), and Pascal VOC 2012 (P). A total of six transfer tasks as C → I, I → C,…, C → P could be conducted.

Similar to the literature,²⁸ we followed the standard protocol^42,43,48 with the vanilla setting. All labeled source samples and all unlabeled target images that belong to the corresponding target label space were used.

Method Comparisons

We compared the proposed method with state-of-the-art deep learning and domain adaptation approaches ResNet-50,⁵⁴ Domain-Adversarial Neural Networks (DANN),⁴⁸ Deep Adaptation Network (DAN),⁴³ Conditional Domain Adversarial Networks (CDAN),⁵⁶ Hard Adaptive Feature Norm (HAFN), and Instance Adaptive Feature Norm (IAFN).²⁸

We conducted our experiments on the PyTorch platform and fine-tuned ResNet-50 pretrained on ImageNet.⁵⁷ The same as the general operation, we adopted a unified set of hyper-parameters throughout the Office-31 and ImageCLEF-DA databases. The mini-batch SGD with a momentum of 0.9 and a learning rate decay of 0.001 was used for the classifier and backbone network. All experiments were repeated three times and the average accuracy with the standard deviation are presented.

Results

The classification results under the vanilla setting for the Office-31, ImageCLEF-DA, and proposed Fashion-DA datasets are presented in Tables II–IV respectively. Reported accuracies of those compared methods (e.g., DAN⁴³ and DANN⁴⁸) are directly cited from the corresponding papers. The performance of our proposed models outperformed the other benchmarked methods.

General DA Datasets

As shown in Table II, the proposed method achieved better performance in most transfer tasks on Office-31 (except A → D and D → W, note that our approach on these two tasks achieved comparable classification performance compared with the state-of-the-art methods). The proposed approach was the highest one on the average of classification accuracy of those six tasks.

Table II.

Classification Accuracy (%) for Vanilla Setting on Office-31 Dataset

Method	A→D	D→A	D→W	W→D	A→W	W→A	Avg
ResNet⁵⁴	68.4 ± 0.2	96.7 ± 0.1	99.3 ± 0.1	68.9 ± 0.2	62.5 ± 0.3	60.7 ± 0.3	76.1
DAN⁴³	80.5 ± 0.4	97.1 ± 0.2	99.6 ± 0.1	78.6 ± 0.2	63.6 ± 0.3	62.8 ± 0.2	80.4
DANN⁴⁸	82.0 ±0.4	96.9 ± 0.2	99.1 ± 0.1	79.7 ± 0.4	68.2 ± 0.4	67.4 ± 0.5	82.2
HAFN²⁸	83.4 ±0.7	98.3 ± 0.1	99.7 ± 0.1	84.4 ± 0.7	69.4 ± 0.5	68.5 ± 0.3	83.9
IAFN²⁸	88.8 ± 0.4	98.4 ± 0.0	99.8 ± 0.0	87.7 ± 0.3	69.8 ± 0.4	69.7 ± 0.2	85.7
CDAN⁵⁶	93.1 ± 0.2	98.2 ± 0.2	100 ± 0.0	89.9 ± 0.3	70.1 ± 0.4	68.0 ± 0.4	86.6
Ours	90.2 ± 0.2	98.5 ± 0.1	99.8 ± 0.1	90.9 ± 0.2	70.9 ± 0.2	69.8 ± 0.1	86.7

Additionally, as reported in Table III, the proposed method yielded better performance on the standard DA dataset ImageCLEF-DA. However, we can see that accuracy of the proposed method was lower than the highest one on I → C, I → P, and C → P. Specifically, IAFN²⁸ achieved a higher accuracy on I → C and C → P tasks. The main difference between our approach with IAFN is the defined R. The images in these two domains are more related to the central samples and therefore enjoy more sufficient information.

Table III.

Classification Accuracy (%) for Vanilla Setting on ImageCLEF-DA Dataset

Method	C→I	I→C	I→P	P→I	P→C	C→P	Avg
ResNet⁵⁴	74.8 ± 0.3	83.9 ± 0.1	91.5 ± 0.3	78.0 ± 0.2	65.5 ± 0.3	91.2 ± 0.3	80.7
DAN⁴³	74.5 ± 0.4	82.2 ± 0.2	92.8 ±0.2	86.3 ± 0.4	69.2 ± 0.4	89.8 ± 0.4	82.5
DANN⁴⁸	75.0 ± 0.6	86.0 ± 0.3	96.2 ± 0.4	87.0 ± 0.5	74.3 ± 0.5	91.5 ± 0.6	85.0
HAFN²⁸	76.9 ± 0.4	89.0 ± 0.4	94.4 ± 0.1	89.6 ± 0.6	74.9 ± 0.2	92.9 ± 0.1	86.3
IAFN²⁸	78.0 ± 0.4	91.7 ± 0.5	96.2 ± 0.1	91.1 ± 0.3	77.0 ± 0.5	94.7 ± 0.3	88.1
CDAN⁵⁶	76.7 ± 0.3	90.6 ± 0.3	97.0 ± 0.4	90.5 ± 0.4	74.5 ± 0.3	93.5 ± 0.4	87.1
Ours	78.0 ± 0.2	91.3 ± 0.1	96.7 ± 0.2	91.2 ± 0.1	77.6 ± 0.1	94.2 ± 0.2	88.2

Application to Fashion-DA Dataset

We also applied these approaches to the proposed Fashion-DA dataset. As indicated in Table IV, our method obtained consistently higher accuracy on most transfer tasks. The images in this dataset were specially collected from different fashion domains that were greatly different from the previous DA dataset. It was expected to serve as a benchmark dataset for evaluating DA algorithms from a different perspective.

Table IV.

Classification Accuracy (%) for Vanilla Setting on Fashion-DA Dataset

Method	D→S	S→D	S→P	P→S	D→P	P→D	Avg
ResNet⁵⁴	78.4 ± 0.1	55.6 ± 0.1	56.6 ± 0.2	52.9 ± 0.3	68.6 ± 0.2	56.0 ± 0.1	61.4
DAN⁴³	82.9 ± 0.2	62.4 ± 0.1	60.2 ± 0.2	52.3 ± 0.2	73.4 ± 0.1	64.6 ± 0.2	66.0
DANN⁴⁸	83.2 ± 0.2	64.4 ± 0.2	61.1 ± 0.1	60.9 ± 0.2	77.7 ± 0.1	63.7 ± 0.2	68.5
HAFN²⁸	79.7 ± 0.5	69.8 ± 0.2	59.1 ± 0.2	52.3 ± 0.4	71.3 ± 0.4	64.7 ± 0.3	66.2
IAFN²⁸	81.0 ± 0.3	73.9 ± 0.4	62.3 ± 0.2	56.0 ± 0.5	74.6 ± 0.4	61.9 ± 0.1	68.3
CDAN⁵⁶	80.7 ± 0.1	51.6 ± 0.2	56.6 ± 0.1	49.4 ± 0.2	75.9 ± 0.2	60.1 ± 0.1	62.4
Ours	84.3 ± 0.2	74.8 ± 0.4	63.5 ± 0.1	56.3 ± 0.6	76.0 ± 0.4	65.8 ± 0.3	70.1

Feature Visualization

We visualized the t-SNE embeddings⁵⁸ of the features learned by HAFN, IAFN, and our approach on the Fashion-DA dataset (including the three transfer tasks: I → C, I → P, and C → P.) in Fig. 4 (with class information). It can be intuitively observed that features in the first row (source only on Fashion-DA) were mixed together. The second row (IAFN on Fashion-DA) and third row (HAFN on Fashion-DA) are indistinguishable, but the fourth row (our method on Fashion-DA) generated better visualization results. In other words, the proposed network obtained a clearer cluster compared with the other two methods.

Fig. 4

The t-SNE visualization of our proposed model, HFAN and IFAN on the Fashion-DA dataset.

Application

To demonstrate the practical value of the proposed method, we applied our method with a fashion image retrieval task (i.e., sketch-to-the-shop). This application scenario is based on the defined transfer task (i.e., S → P). The images of the retrieval dataset were collected from the Internet. The format of these images was similar to the trained data. Note that our model can recognize the category (but not the specific design attributes) of the trained dataset and, at the same time, achieve consistent performance on the retrieval dataset. As shown in Fig. 5, we present the top ten retrieval results among several categories, including hats, tops, skirts, and sandals. Our model successfully recognized the fashion categories in a different domain. The retrieval results for the same category were randomly ranked. On top of these results, we can further apply other features to conduct more specific retrievals. For example, we adopted the FOCO system⁵⁹ to rank the retrieval result by color. As shown in Fig. 6, the retrieval results are ranked according to the color of the fashion items and other pre-labeled attributes can be used as the filter.

Fig. 5

Top ten retrieval results based on the proposed method.

Fig. 6

Top five retrieval results based on the proposed method and ranked by the black color.

Conclusion

This paper presents Fashion-DA, a fashion recognition dataset in multiple domains. Fashion-DA contains 13,435 images covering 14 main fashion categories. The images have different distributions in three domains including sketch, drawing, and product images. It differs from the current cross-domain fashion dataset, which contains two domains, including street photos and online shopping websites. Our proposed dataset is targeted for application cases, such as from sketch to product and from drawing to product. To demonstrate the advantage of the proposed dataset, we designed a deep model, an adaptive unsupervised DA algorithm based on augmented feature norms, that outperformed state-of-the-art DA approaches among the Fashion-DA dataset and common DA datasets (Office-31 and ImageCLEF-DA). Meanwhile, Fashion-DA is also expected to serve as a benchmark for DA algorithms.

Footnotes

Acknowledgement

This project was supported by the research fund of The Hong Kong Polytechnic University (project code: RUWZ).

References

Huang

Junshi

, Rogerio

S. Feris

, Chen

Qiang

, and Shuicheng

Yan

. Cross-domain image retrieval with a dual attribute-aware ranking network. In Proceedings of the IEEE international conference on computer vision, pp 1062–1070, 2015.

Jiang

Shuhui

, Wu

Yue

, and Yun

. Deep bi-directional cross triplet embedding for cross-domain clothing retrieval. In Proceedings of the 24th ACM international conference on Multimedia, pp 52–56. ACM, 2016.

Shatha

Jaradat

. Deep cross-domain fashion recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pp 407–410. ACM, 2017.

Liu

, Song

Zheng

, Liu

Guangcan

, Xu

Changsheng

, Lu

Hanqing

, and Yan

Shuicheng

. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 3330–3337. IEEE, 2012.

Kiapour

M. Hadi

, Han

Xufeng

, Lazebnik

Svetlana

, Berg

Alexander C.

, and Berg

Tamara L.

. Where to buy it: Matching street clothing photos in online shops. In Proceedings of the IEEE international conference on computer vision, pp 3343–3351, 2015.

Song

Jifei

, Song

Yi-Zhe

, Xiang

Tao

, Hospedales

Timothy M.

, and Ruan

Xiang

. Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval. In British Machine Vision Conference (BMVC), Vol. 1, p3, 2016.

Peng

, Yin

Qiyue

, Huang

Yongye

, Song

Yi-Zhe

, Ma

Zhanyu

, Wang

Liang

, Xiang

Tao

, Kleijn

W. Bastiaan

, and Guo

Jun

. Cross-modal subspace learning for fine-grained sketch-based image retrieval. Neurocomputing 2018, 278, 75–86.

Pang

Kaiyue

, Song

Yi-Zhe

, Xiang

Tao

, and Hospedales

Timothy M.

. Cross-domain generative learning for fine grained sketch-based image retrieval. In British Machine Vision Conference (BMVC), Vol. 82, p85, 2017.

Liu

Ziwei

, Luo

Ping

, Qiu

Shi

, Wang

Xiaogang

, and Tang

Xiaoou

. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104, 2016.

10.

Xiao

Han

, Rasul

Kashif

, and Vollgraf

Roland

. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv: 1708.07747, 2017.

11.

Blitzer

John

, McDonald

Ryan

, and Pereira

Fernando

. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pp 120–128, 2006.

12.

Pan

Sinno Jialin

, Tsang

Ivor W.

, Kwok

James T.

, and Yang

Qiang

. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 2010, 22 (2), 199–210.

13.

Glorot

Xavier

, Bordes

Antoine

, and Bengio

Yoshua

. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), pp 513–520, 2011.

14.

Ganin

Yaroslav

, Ustinova

Evgeniya

, Ajakan

Hana

, Germain

Pascal

, Larochelle

Hugo

, Laviolette

Francois

, Marchand

Mario

, and Lempitsky

Victor

. Domain-adversarial training of neural networks. Journal of Machine Learning Research 2016, 17 (1), 2096–2030.

15.

Koniusz

Piotr

, Tas

Yusuf

, and Porikli

Fatih

. Domain adaptation by mixture of alignments of second-or higher-order scatter tensors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4478–4487, 2017.

16.

Motiian

Saeid

, Piccirilli

Marco

, Adjeroh

Donald A.

, and Doretto

Gianfranco

. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision, pp 5715–5725, 2017.

17.

Tzeng

Eric

, Hoffman

Judy

, Darrell

Trevor

, and Saenko

Kate

. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pp 4068–4076, 2015.

18.

Muhammad

Ghifary W.

Kleijn

Bastiaan

, Zhang

Mengjie

, Balduzzi

David

, and Li

Wen

. Deep reconstruction classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pp 597–613. Springer, 2016.

19.

Motiian

Saeid

, Jones

Quinn

, Iranmanesh

Seyed

, and Doretto

Gianfranco

. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp 6670–6680, 2017.

20.

Saito

Kuniaki

, Kim

Donghyun

, Sclaroff

Stan

, Darrell

Trevor

, and Saenko

Kate

. Semi-supervised domain adaptation via minimax entropy. arXiv preprint arXiv: 1904.06487, 2019.

21.

Gong

Boqing

, Shi

Yuan

, Sha

Fei

, and Grauman

Kristen

. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2066–2073. IEEE, 2012.

22.

Long

Mingsheng

, Zhu

Han

, Wang

Jianmin

, and Jordan

Michael I.

. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, pp 136–144, 2016.

23.

Kang

Guoliang

, Jiang

, Yang

, and Hauptmann

Alexander G.

. Contrastive adaptation network for unsupervised domain adaptation. arXiv preprint arXiv:1901.00976, 2019.

24.

Roy

Subhankar

, Siarohin

Aliaksandr

, Sangineto

Enver

, Bulo

Samuel Rota

, Sebe

Nicu

, and Ricci

Elisa

. Unsupervised domain adaptation using feature-whitening and consensus loss. arXiv preprint arXiv:1903.03215, 2019.

25.

Saenko

Kate

, Kulis

Brian

, Fritz

Mario

, and Darrell

Trevor

. Adapting visual category models to new domains. In European conference on computer vision, pp 213–226. Springer, 2010.

26.

Peng

Xingchao

, Usman

Ben

, Kaushik

Neela

, Hoffman

Judy

, Wang

Dequan

, and Saenko

Kate

. Visda: The visual domain adaptation challenge. arXiv preprint arXiv.1710.06924, 2017.

27.

Caputo

Barbara

, Muller

Henning

, Martinez-Gomez

Jesus

, Villegas

Mauricio

, Acar

Burak

, Patricia

Novi

, Marvasti

Neda

, Ujskudarli

Suzan

, Paredes

Roberto

, Cazorla

Miguel

, . Imageclef 2014: Overview and analysis of the results. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp 192–211. Springer, 2014.

28.

Ruijia

, Li

Guanbin

, Yang

Jihan

, and Lin

Liang

. Unsupervised domain adaptation: An adaptive feature norm approach. arXiv preprint arXiv:1811.07456, 2018.

29.

Long

Mingsheng

, Wang

Jianmin

, Ding

Guiguang

, Sun

Jiaguang

, and Yu

Philip S.

. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp 2200–2207, 2013.

30.

Venkateswara

Hemanth

, Eusebio

Jose

, Chakraborty

Shayok

and Panchanathan

Sethuraman

. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5018–5027, 2017.

31.

Zou

Xingxing

, Kong

Xiangheng

, Wong

Waikeung

, Wang

Congde

, Liu

Yuguang

, and Cao

Yang

. Fashionai: A hierarchical dataset for fashion understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 3889–3893, 2019.

32.

Qian

, Liu

Feng

, Song

Yi-Zhe

, Xiang

Tao

, Timothy

Hospedales, and Chen-Change Loy Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 799–807, 2016.

33.

Song

Jifei

, Yu

Qian

, Song

Yi-Zhe

, Xiang

Tao

, and Hospedales

Timothy M.

. Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pp 5551–5560, 2017.

34.

Pan

Sinno Jialin

and Yang

Qiang

. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 2009, 22 (10), 1345–1359.

35.

Gong

Boqing

, Grauman

Kristen

, and Sha

Fei

. Connecting the dots with landmarks: Discriminatively learning domain invariant features for unsupervised domain adaptation. In International Conference on Machine Learning, pages 222–230, 2013.

36.

Zhang

Kun

, Muandet

Bernhard Scholkopf Krikamol

, and Wang

Zhikun

. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp 819–827, 2013.

37.

Wang

Xuezhi

and Schneider

Jeff

. Flexible transfer learning under support and model shift. In Advances in Neural Information Processing Systems, pp 1898–1906, 2014.

38.

Duan

Lixin

, Tsang

Ivor W.

, and Xu

Dong

. Domain transfer multiple kernel learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2012, 34 (3), 465–479.

39.

Hoffman

Judy

, Rodner

Erik

, Donahue

Jeff

, Darrell

Trevor

, and Saenko

Kate

. Efficient learning of domain-invariant image representations. arXiv preprint arXiv: 1301.3224, 2013.

40.

Jhuo

I-Hong

, Liu

Dong

, Lee

D. T.

, and Chang

Shih-Fu

. Robust visual domain adaptation with low-rank reconstruction. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2168–2175. IEEE, 2012.

41.

Tzeng

Eric

, Hoffman

Judy

, Zhang

Ning

, Saenko

Kate

, and Darrell

Trevor

. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv: 1412.3474, 2014.

42.

Long

Mingsheng

, Zhu

Han

, Wang

Jianmin

, and Jordan

Michael I.

. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp 2208–2217, 2017.

43.

Long

Mingsheng

, Cao

Yue

, Wang

Jianmin

, and Jordan

Michael I.

. Learning transferable features with deep adaptation networks. arXiv preprint arXiv: 1502.02791, 2015.

44.

Haeusser

Philip

, Frerix

Thomas

, Mordvintsev

Alexander

, and Cremers

Daniel

. Associative domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp 2765–2773, 2017.

45.

Liu

Ming-Yu

and Tuzel

Oncel

. Coupled generative adversarial networks. In Advances in neural information processing systems, pp 469–477, 2016.

46.

Tzeng

Eric

, Hoffman

Judy

, Saenko

Kate

, and Darrell

Trevor

. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7167–7176, 2017.

47.

Liu

Ming-Yu

, Breuel

Thomas

, and Kautz

Jan

. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pp 700–708, 2017.

48.

Ganin

Yaroslav

and Lempitsky

Victor

. Unsupervised domain adaptation by backpropagation. arXiv preprint arXiv: 1409.7495, 2014.

49.

Hoffman

Judy

, Tzeng

Eric

, Park

Taesung

, Zhu

Jun-Yan

, Isola

Phillip

, Saenko

Kate

, Efros

Alexei A.

, and Darrell

Trevor

. Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint arXiv: 1711.03213,2017.

50.

Zhu

Jun-Yan

, Park

Taesung

, Isola

Phillip

, and Efros

Alexei A.

. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp 2223–2232, 2017.

51.

Karsten

M. Borgwardt

, Gretton

Arthur

, Rasch

Malte J.

, Kriegel

Hans-Peter

, Scholkopf

Bernhard

, and Smola

Alex J.

. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006, 22 (14), 49–57.

52.

Krizhevsky

Alex

Sutskever

Ilya

, and Hinton

Geoffrey E.

. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp 1097–1105, 2012.

53.

Liu

Christian Szegedy Wei

, Jia

Yangqing

, Sermanet

Pierre

, Reed

Scott

, Anguelov

Dragomir

, Erhan

Dumitru

, Van-houcke

Vincent

, and Rabinovich

Andrew

. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9, 2015.

54.

Kaiming

, Zhang

Xiangyu

, Ren

Shaoqing

, and Sun

Jian

. Deep residual learning for image recognition. In Proceedings of the I EEE conference on computer vision and pattern recognition, pp 770–778, 2016.

55.

Huang

Gao

, Liu

Zhuang

, Van Der Maaten

Laurens

, and Weinberger

Kilian Q.

. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708, 2017.

56.

Long

Mingsheng

, Cao

Zhangjie

, Wang

Jianmin

, and Jordan

Michael I.

. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, pp 1645–1655, 2018.

57.

Deng

Jia

, Dong

Wei

, Socher

Richard

, Li

Li-Jia

, Li

Kai

, and Fei-Fei

. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp 248–255, 2009.

58.

Donahue

Jeff

, Jia

Yangqing

, Vinyals

Oriol

, Hoffman

Judy

, Zhang

Ning

, Tzeng

Eric

, and Darrell

Trevor

. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp 647–655, 2014.

59.

Zou

Xingxing

, Wong

Wai Keung

, Gao

Can

, and Zhou

Jie

. Foco system: a tool to bridge the domain gap between fashion and artificial intelligence. International Journal of Clothing Science and Technology, 2019.

Feature Norm-Based Deep Network for Multi-Domain Fashion Image Retrieval

Abstract

Keywords

Introduction

Dissimilar Domains

Scales

Availability

Transfer Tasks

DA Methods

Summary

Related Work

Cross-Domain Datasets

DA Approaches

Fashion-DA Dataset

DA Algorithms

Our Approach

Experiments

Setup

Office-31

ImageCLEF-DA

Method Comparisons

Results

General DA Datasets

Application to Fashion-DA Dataset

Feature Visualization

Application

Conclusion

Footnotes

Acknowledgement

References