Sage Journals: Discover world-class research

Abstract

Fabric image retrieval, a form of content based image retrieval, is a high value research with the potential to be applied in many fields, such as e-commerce and inventory management. However, this research hotspot is plagued by two major challenges, namely the high requirements for retrieval results and the peculiarities of fabric images. Unlike general image retrieval, fabric image retrieval systems have to pay more attention to texture and color features. To address these challenges, we propose a novel framework for fabric retrieval by using self-supervised and deep hashing techniques. The framework consists of two modules for feature learning and hashing learning. During the feature learning phase, the color and texture information in the image is decoupled under the drive of augmented based pretext tasks. In hashing learning, Bi-half layer is introduced to generate high-quality hash codes. The visualization results indicate that the proposed method performs well for the representation of fabric images. And the experimental results show that the proposed retrieval framework can achieve a good performance (best mAP 0.903) and outperforms other methods, including several deep hashing methods and our previous work.

Keywords

Deep hashing CBIR fabric retrieval feature learning self-supervised learning

Introduction

In order to cater to the rapidly changing needs of the fast fashion industry, the textile industry had evolved to favor a variety of fabric designs produced in small bulk, instead of the mass production of a single type fabric. This production mode has caused the textile industry to accumulate an extensive catalog of historical designs, making it increasingly difficult to search for similar fabric designs. The traditional method of finding and comparing fabric from a huge and ever-growing catalog is both time-consuming and labor-intensive, which can be alleviated with a Content-based Image Retrieval (CBIR) system.

When receiving a query image, the CBIR system is expected to output a list with the same visual content as the query. Technically speaking, there are two core components of CBIR: image representation and feature matching respectively. Image representation vectorizes the input images (queries and the images in the database), and the second component ranks the images in the database for similarity and outputs similar images. The most challenging task in CBIR is to associate pixel-based low-level features with human-perceived high-level semantic features. In many previous works,^1
–3 some hand-crafted feature descriptors are often used to represent the visual content of images, such as Scale Invariant Feature Transform (SIFT),^4
–6 Local Binary Pattern (LBP),^7
–9 Global Image Structure (GIST).^10
–12 Even though these pixel-level feature-based methods have achieved some success, they rely too much on feature engineering, which leads to their limitations in robustness. Recently Convolutional Neural Networks (CNN) have achieved outstanding performance in many vision tasks, such as image classification, object detection, which demonstrate its good performance in a visual feature description. Naturally, many researchers applied CNN model in image retrieval tasks. CBIR has achieved a significant breakthrough due to the replacement of earlier low-level feature-based algorithms with an end-to-end framework based on deep learning. Inspired by this trend, this study focuses on the use of deep CNN to solve the problem of fabric retrieval. Krizhevsky et al.¹³ directly used the output of the convolutional layer in CNN as index for image retrieval, and its excellent retrieval performance demonstrated the superiority of deep CNN for image retrieval. However, the disadvantage of this method is in its high computational cost, which resulted in a long retrieval time.

To optimize retrieval efficiency, many feature dimensionality reduction methods have been introduced, of which the most commonly used approach is Approximate Nearest Neighbor (ANNS) search. Deep hashing is an emerging and efficient ANN search method designed to automatically learn optimal hash functions and generate image hash codes. Nearest neighbors are obtained by computing the Hamming distance of these hash codes. Recently, several deep CNN based deep hashing, that is, Convolutional Neural Networks Hashing (CNNH),¹⁴ Central Similarity Quantization (CSQ),¹⁵ Deep Supervised Discrete Hashing (DSDH),¹⁶ have been proposed to improve image retrieval significantly.

Textile or fabric images do not contain rich discriminant features, like 3-D shapes, prevalent in natural images. Instead, fabric images are texture and color features are the dominant features, as shown in Figure 1. This difference in feature types is the reason why general image retrieval algorithm is ill-suited for fabric retrieval. To make matters worse, fabric images lack distinct objects that can be annotated. Take Figure 1(a) as an example, the mother swan and the baby swan can clearly identifiable, compared to Figure 1(b) where no distinct object can be annotated. Many previous research^11,17
–20 represented the fabric images using low-level hand-crafted feature descriptors and achieved a good performance on fabric image retrieval. However, the success of handcrafted methods is limited to small dataset or specific fabric type. Our previous work^21,22 tried to used annotation information (single-view and multi-view) to guide the model to learn fabric image representation. However, because the retrieval results are limited to the annotation classification, the proposed methods are difficult to apply in the textile industry. To address this problem, this paper proposed a novel deep learning framework which can simultaneously learn the texture and color representation of fabrics by using self-supervised learning. Self-supervision is a learning framework in which supervised signals for pretext tasks are automatically created in an effort to learn representations useful for solving real-world downstream tasks in an unsupervised manner.

Figure 1.

General image and fabric image: (a) general natural image and (b) fabric image.

The motivation for designing the fabric retrieval framework come from two aspects: (1) Color and texture are the main features in the fabric images, and decoupling these two features can improve retrieval performance and robustness; (2) Rotation and scaling only slightly change the period the texture in the fabric images, while its structure is not changed. However, adjusting the hue of the image color can change the color type, as shown in Figure 3 below. In summary, this fabric retrieval framework is designed to take full advantage of the dominant feature types available in fabric images, while enforcing strict rotation and scale invariant.

The rest of this paper is organized as follows. Section II introduces the main technical components involved in this paper. Section III presents the experimental setup configuration, including introduction of the used dataset, evaluation metrics, comparison methods, and implementation details. Section IV analyzes and discusses the experimental results. Conclusion is presented in Section V.

Fabric image representation

Unlike general image retrieval, fabric image retrieval systems have to pay more attention to texture and color features. In this section, we propose a fabric image representation framework based on self-supervised learning with designed pretext tasks. The framework consists of three main components: image transformation, Convolutional Neural Network, and learning algorithm, as illustrated in Figure 2.

Figure 2.

The overview of the proposed fabric representation framework. The framework consists of three main components: image transformation, Convolutional Neural Network, and learning algorithm. When receiving an input image I, first perform texture transformation and color transformation on I, and then input it into CNN for nonlinear transformation, and finally the parameters of the model are optimized under the supervision of the objective function.

Notations

Let $I \in ℝ^{w \times h}$ be an input fabric image. $y_{C} \in {1, \dots N_{C}}$ and $y_{T} \in {1, \dots N_{T}}$ denote the label of I in color and texture where N_C and N_T are the number of corresponding classes respectively. We also let $I_{C} = τ_{C} (I)$ and $I_{T} = τ_{T} (I)$ denote the augmented samples using a color transformation $τ_{C}$ and a texture transformation $τ_{T}$ . $R_{•} = f (I_{•}; θ) = (\begin{matrix} R_{•}^{C} \\ R_{•}^{T} \end{matrix})$ be an embedding vector of $I_{•}$ where f is a convolutional neural network with the parameter $θ$ and the subscript • denotes the type of the f input (null represents the original sample, C represents augmented sample using color transformation and T represents augmented sample using texture transformation), that is, $R_{T}^{C}$ denotes the embedding vector representing texture feature of the augmented sample using color transformation, as shown in Figure 2. $ℒ_{C E}$ represents the cross-entropy loss function, and $σ (.; u)$ represents the Softmax classifier, that is, $σ_{i} (R; u) = \exp (u_{i}^{⊺} R) ∕ \sum_{k} \exp (u_{k}^{⊺} R)$ . In the following paper, $σ^{C}$ represents color classifier, and $σ^{T}$ represents texture classifier.

Fabric image augmentation

Since color and texture are the main features of fabric images, we consider two type of transformations for fabric image augmentation, color transformation and texture transformation, as illustrated in Figure 3. Generally, fabric retrieval systems are expected to output results with high similarity in texture, regardless of scale and rotation. So we adopt two type texture transformations: (1) crop the fabric image first and then resize cropped sample to a fixed size, as shown in Figure 3(b); (2) rotate the fabric image within a certain range first and then crop a fixed-size sample, as shown in Figure 3(c).

Figure 3.

Fabric data augmentation. (a) Original image. (b) Crop-based augmentation. (c) Rotation-based augmentation. (d)–(f) Color-jitter-based augmentation.

However, the fabric is very sensitive to color. Due to the limitations of the commonly used RGB color space, the results of color transformation in this color space are often uncontrollable. HSV is closer to human perception of color than RGB. The hue in HSV is an important reference for many color classifications, as is the color feature classification of textile fabrics. This study considers three levels of color transformation by adjusting the hue $(+ π / 2, + π, + 3 π ∕ 2)$ , and the transformed samples are shown in Figure 3(d)–(f).

Feature learning

In the training phase of CNN, data augmentation is a trick commonly used to improve the generalization ability of target network by exploiting certain transformations that preserve its semantics, such as cropping, contrast enhancement, rotation, and flipping. The training objective $ℒ_{D A}$ with data augmentation can be written as follows:

ℒ_{D A} (I, y, θ, u) = E_{t ~ T} [ℒ_{C E} (σ (f (\tilde{I}, θ); u), y)]

(1)

where T is the transformed distribution of data augmentation. The classifier $σ (f (\tilde{I}; u)$ is forced to be invariant to the transformations by optimizing the above loss function. However, since the statistical characteristics of the augmented training samples may be very different from those of the original training samples, it may not make sense to enforce this invariance. Furthermore, the distribution of the transformation T may be difficult to be computed. In this case, enforcing invariance to these transformations can make learning more difficult and even degrade performance.

Our idea is to remove the unnecessary transformations and just use two type of transformations. The model, learns texture representation by using the invariance of the texture transformation to supervise the model, learns color representation by using the variance of the color transformation. The proposed feature learning model is called as SDA. We expect the trained model is invariant to subtle texture changes and is sensitive to drastic color changes. Then, the training objective can be written as

\begin{matrix} ℒ_{S D A} (I, y; θ, u) = \sum_{*}^{{C, T}} ℒ_{C E} (σ^{*} (R^{*}; u^{*}), y^{*}) \\ + λ_{C} ℒ_{T r i p l e t} (R^{C}, R_{T}^{C}, R_{C}^{C}) \\ + λ_{T} ℒ_{S i m} (R^{T}, R_{T}^{T}, R_{C}^{T}) \end{matrix}

(2)

where $ℒ_{C E}$ is the cross-entropy loss. $λ_{C}$ and $λ_{T}$ are two non-negative scalar weights for balancing different losses. And the meaning of different R in the above equation is as follows:

f (I; θ) = (\begin{matrix} R^{C} \\ R^{T} \end{matrix})

(3)

{\begin{matrix} f (I_{C}; θ) = f (τ_{c} (I), θ) = (\begin{matrix} R_{C}^{C} \\ R_{C}^{T} \end{matrix}); \\ f (I_{T}; θ) = f (τ_{T} (I), θ) = (\begin{matrix} R_{T}^{C} \\ R_{T}^{T} \end{matrix}) . \end{matrix}

(4)

And the $ℒ_{T r i p l e t}$ is Triplet-Loss which is first presented in Schroff et al.²³ In this study, the Triplet-loss can be written as:

ℒ_{T r i p l e t} = {[{‖ R^{C} - R_{T}^{C} ‖}_{2}^{2} - {‖ R^{C} - R_{C}^{C} ‖}_{2}^{2} + α]}_{+}

(5)

Here we use Euclidean distance to measure the distance between representations and set the margin $α$ to 1.0. The subscript “+” means that when the value in the brackets is greater than zero, the value is taken as the loss, and when it is less than zero, the loss is zero. The $ℒ_{S i m}$ is a loss function, which is supervised and learned by texture similarity. It can be computed by:

ℒ_{S i m} = {‖ R^{T} - R_{C}^{T} ‖}_{2}^{2} + {‖ R^{T} - R_{T}^{T} ‖}_{2}^{2}

(6)

Then, we introduce a multi-loss collaborative gradient descent optimization for the above problem. In this work, the parameters, which need to be learned, contain the parameters $θ$ in f and the parameters $u$ in σ . We first compute the partial derivative of the objective function for the parameters of the two classifiers.

\frac{\partial ℒ_{S D A}}{\partial u^{*}} = {\begin{matrix} \frac{\partial ℒ_{C E} (σ^{C} (R^{C}; u^{C}), y^{C})}{\partial u^{C}}, i f * = C; \\ \frac{\partial ℒ_{C E} (σ^{T} (R^{T}; u^{T}), y^{T})}{\partial u^{T}}, i f * = T . \end{matrix}

(7)

The partial derivative of the objective function for the parameters of neural network can be computed by:

\frac{\partial ℒ_{S D A}}{\partial θ} = \frac{\partial ℒ_{S D A}}{\partial R} . \frac{\partial R}{\partial θ}

(8)

There are six types of $R$ , and three parameter updates are required in each back-propagation. All forward propagation is differentiable, so we can solve this optimization problem using stochastic gradient descent algorithm.

The strong performance of CNN in feature learning has been demonstrated, so a compact CNN is designed as the base network in this study, as shown in Figure 4. To enhance the performance, the joint learning of multiple tasks adopts a parameter soft-sharing framework. It is well known that the larger the number of parameters, the more data is required. Therefore, we only intercept the first three convolution-pooling modules of VGG16 for visual feature abstraction, and their parameters are initialized with models pre-trained on ImageNet dataset. The motivation and rationale for this choice is that the dataset used has a relatively small number of fabric images and that training deep networks from scratch on this dataset is very sensitive to overfitting. However, an effective way to reduce data requirements is to use a pretrained model and reduce its depth. Furthermore, fabric images mainly consist of textures and colors. It has been proved that low-level features, that is, colors and textures, also appear in natural images and can be extracted by the first several layers of a convolutional neural network which is trained on natural images (ImageNet). On the other hand, features from deep layers capture more semantically relevant abstract features, so deep layers trained on natural image datasets may not be able to accurately represent fabric images. The architecture of the adopted CNN is presented in Figure 4. The trained model received a fabric image with the size of w×h and then output two feature embeddings: color feature embedding $R^{C}$ and texture feature embedding $R^{T}$ .

Figure 4.

The architecture of adopted CNN. We use a shadow CNN, which is intercepted from the first few layers of VGG-16, for fabric image representation. The reason for using this configuration is that the low-level features of the image often appear in the shallow layer of the CNN network.

Deep hashing for feature aggregation

The function of above proposed feature learning model is to extract the color feature embedding and texture feature embedding for fabric image representation. Generally, the high-dimensional feature directly used for retrieval will greatly increase the computational cost, thereby reducing the retrieval efficiency. Hashing is very efficient in terms of computation and storage. It converts original images features into compact binary codes by preserving the data structure in the original space. The transition from the continuous variable $R$ to a binary code variable B generally can be regard as a lossy communication channel. The transition can be expressed as:

g : R \mapsto h \in {- 1, + 1}^{K}

(9)

where g is the hashing function and K is the length of the binary codes. Recently, Li et al.²⁴ presented a new parameter-free network layer which can minimize the optimal transport cost measured by the Wasserstein distance. Here we briefly introduce the principle of this method and how to graft it in our framework.

The principle of bi-half layer

Maximizing hash channel capacity: The authors first introduced the concepts of channel capacity C and entropy $ℋ$ , and their relationship can be expressed as:

C = \max_{p (r)} I (R; B) = \max_{p (r)} (ℋ (B) - ℋ (B | U))

(10)

where the maximum is taken over all possible input distributions p(r) and I(R;B) denotes mutual information between continuous variable U and binary variable B. $ℋ (B)$ and $ℋ (B | U)$ represent entropy and conditional entropy. Thus, maximizing the C is equivalent to maximizing $ℋ (B)$ and minimizing $ℋ (B | U)$ . When $p (B = + 1) = p (B = - 1) = 0.5$ (half-half distribution), the entropy of binary variable B is maximized. And the conditional entropy is computed by:

\begin{matrix} H (B | R) = \int_{r \in R} p (r) H (B | R = r) d r \\ = - \int_{r \in R} p (r) (\begin{array}{l} p r (pos) \log p_{r} (pos) \\ + p r (neg) \log p_{r} (neg) \end{array}) d r \\ s u b j e c t t o : p_{r} (neg) + p_{r} (pos) = 1 \\ 0 \leq p_{r} (neg), p_{r} (pos) \leq 1 # \end{matrix}

(11)

where $p_{u} (pos)$ and $p_{u} (neg)$ are defined as how probable $+ 1$ or $- 1$ binary output value is. The value of $ℋ (B | U)$ is between 0 and 1 (minimized to 0). When $p_{u} (pos)$ and $p_{u} (neg)$ are equal to 1, the minimum of $ℋ (B | U)$ can be obtained.

Bi-half layer for quantization: To align the distribution of continuous feature with the ideal prior half-half distribution, the authors introduced Optimal Transport²⁵ and 1-Wasserstein distance. For a randomly sampled mini-batch of M samples, the empirical distributions, P_r (for continuous variable R) and P_b (for binary variable B), are computed by:

P_{r} = \sum_{i = 1}^{M} p_{i} δ_{u_{i}}, P_{b} = \sum_{j = 1}^{2} q_{j} δ_{b_{j}}

(12)

where $δ_{x}$ denotes the Dirac function at location x. p_i and q_j are the probability mass of the corresponding location. Then, the optimization problem of the hash function can be written as:

π_{0} = \min_{π \in Π (P_{r}, P_{b})} \sum_{i} \sum_{j} π_{i j} {(r_{i} - b_{j})}^{2}

(13)

where $Π (P_{r}, P_{b})$ is the set of all joint probability distributions $π_{i j}$ . The minimization problem is optimized by a simple method: (1) first sort the elements of u over all mini-batch; (2) then assign the top half elements to $+ 1$ and remaining elements to $- 1$ . And the method can be expressed as:

b = π_{0} (u) = {\begin{matrix} + 1, t o p h a l f o f s o r t e d u \\ - 1, o t h e r w i s e \end{matrix}

(14)

The idea is implemented as a new simple hash coding layer called bi-half layer which can be embedded into many architectures to generate binary codes. During training, the forward propagation and back propagation are concluded as:

\begin{matrix} F o r w a r d - p r o p a g a t i o n : B = π (R) \\ B a c k - p r o p a g a t i o n : \frac{\partial ℒ}{\partial R} = \frac{\partial ℒ}{\partial B} + η (R - B) # \end{matrix}

(15)

where $ℒ$ is the used Loss function. In this work, we combine cosine similarity and MSE for training in an unsupervised manner. $η = \frac{1}{l_{r}}$ where l_r is the learning rate.

The framework of hashing model

We graft the bi-half layer into our hashing model, as shown in Figure 5. The architecture of the hashing model consists of three layers: input $R$ , fully connected layer for feature dimensionality reduction, bi-half layer for generating hash codes $B$ . The fc layer converts the size of the input continuous variable into the size of the final encoding. In this work, we need to train two hashing models for color feature encoding $R^{C}$ and texture feature $R^{T}$ encoding, respectively. Since the architecture of the two hash models are completely the same, here we just introduce one of them. We use $R \in {R^{C}, R^{T}}$ to represent color or texture feature embedding which is extract by the above feature learning model. $R_{+}$ and $R_{-}$ denote the first half and the second half of the input sample $R$ , respectively. It is the same with $B_{+}$ , $B_{-}$ . We use the following loss function to encourage the model learn the structural features in the input, thereby reducing the information loss.

ℒ_{m s e} = {(S_{R} (R_{+}, R_{-}) - S_{B} (B_{+}, B_{-}))}^{2}

(16)

S_{R} (R_{+}, R_{-}) = \frac{R_{+} . R_{-}}{‖ R_{+} ‖ . ‖ R_{-} ‖}

(17)

S_{B} (B_{+}, B_{-}) = \frac{B_{+} . B_{-}}{‖ B_{+} ‖ . ‖ B_{-} ‖}

(18)

For the optimization, we adopt stochastic gradient descent algorithm to optimize hashing model. The hash model is trained in an unsupervised manner (also called self-supervised).

Figure 5.

The framework of Hashing Model. $R_{+}$ and $R_{-}$ denote the first half and the second half of the input sample $R$ , respectively. It is the same with $B_{+}$ , $B_{-}$ . MES Loss is used to maintain the distribution of features before and after the hash transformation, thereby improving the quality of the generated hash code.

Experimental configuration

Dataset

Driven by the target task, learning-based methods learns and induces representation methods from data, so data is the basis for deep learning model learning. Besides, a standard dataset is also an indispensable component for evaluating the retrieval performance of different methods. In this study, a fabric image dataset named MFT-fabric-v1 is built. Specifically, the proposed dataset consists of 46,868 Mélange fabric (Mélange fabrics are directly woven from Mélange yarns without dyeing or printing. And Mélange yarn is made of two or more different color fibers which are spun after fully mixing, therefore creating a unique mixed color effect. http://www.e-huafu.com/) images as the training-set, and 3672 Mélange fabric images as testing-set. Also, all images in the dataset are annotated from three different viewpoints, namely color, texture, raw materials. This work mainly focuses on color and texture of fabric, so the raw material label of the fabric is ignored. The color of Mélange fabric is defined as the color of the special colored yarn or fiber (other than the basic color, such as white and black). This definition may cause the general hand-crafted based methods to be difficult to represent the color of Mélange fabrics. So this paper proposes to use the deep learning based method to describe the feature of Mélange fabric. According to this definition, there are a total of nine colors of Mélange fabric in the dataset, namely grey (8232), red (5241), orange (3928), yellow (5096), brown (6315), green (4751), blue (6894), purple (2867), and colorful grey (3544). With respect to texture, we divide the weft-knitted fabric into six levels (VOL47: 6593; VOL48: 6796; VOL49: 6729; VOL50: 8924; VOL51: 7939; VOL52: 9887) based on the thickness and weight of the yarn used. The testing-set contains a total of 72 sets (different sets of images belong to different categories) of Mélange fabric images, each of which consists of a query and 50 related images corresponding to it. To evaluate the robustness of the retrieval methods, the 72 queries are augmented with some transformations: rotation, flipping, and scaling. Each query image is expended to 10. To avoid the influence of capture conditions, the images in MFT-fabric-v1 are collected in a stable environment stable light box. The DigiEye system is equipped with a Nikon D7000 camera, a special pick-up head and a standard illumination D65, which has the advantages of small color difference and stable condition. Also, the resolution of collected images is 96 dpi.

Evaluation metrics

In this work, two metrics are used to evaluate the retrieval performance of different methods, namely, precision-recall curve and mAP (Mean Average Precision) value. To compute their value, it is necessary to introduce some definition of TP (true positive), FN (false negative), FP (false positive), and TN (true negative). As shown in Table 1, TP refers to the number of relevant images retrieved; FN refers to the number of relevant images not retrieved; FP refers to the number of non-relevant images incorrectly retrieved as relevant; TN refers to the number of non-relevant images correctly retrieved as non-relevant. Then the precision and recall of retrieval results are defined as:

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(19)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(20)

mAP is the mean of average precision (AP), which can be calculated by:

A P (q) = \frac{1}{N_{T r} (q) @ n} = \sum_{i = 1}^{n} (T r (q, i) \frac{N_{T r} (q) @ i}{i})

(21)

A p (q) = \frac{1}{N_{T r} (q) @ n} = \sum_{i = 1}^{n} (T r (q, i) \frac{N_{T r} (q) @ i}{i})

where $T r (q, i) \in {0, 1}$ is an indicator function that if the query I_q and the ith retrieval result I_i have the same label, $T r (q, i) = 1$ ; otherwise, $T r (q, i) = 0$ . $N_{T r} (q) @ i$ denotes the number of relevant images within the top i images. Let Q represents the numbers of query sets. Then the mAP can be calculated by:

m A P = \frac{1}{Q} \sum_{q}^{Q} A p (q)

(22)

Table 1.

The definition of TP (true positive), FN (false negative), FP (false positive), and TN (true negative).

	Relevant	Non-relevant
Retrieved	TP	FP
Not retrieved	FN	TN

In evaluation, two images are considered semantically similar if they have the same annotation. For the two evaluation metrics used in this work, a larger area under the precision-recall curve and a larger mAP indicate better retrieval performance.

Implementation details

To avoid overfitting, the model pre-trained VGG16 on ImageNet is used to initialize the previous Convolutional layers of proposed model. The two hyperparameters, $λ_{1}$ and $λ_{2}$ , are specifically used to adjust the gradients of different objective functions for parameter optimization during training. The experimental results demonstrate that the model can achieve a better performance on MFT-fabric-v1 when $λ_{1} = 0.1$ and $λ_{2} = 0.1$ . When training the feature learning model, the other parameters of the model are configured as follows: batch_size = 64, weight1_decay = 5e-5, initial learning_rate = 1e-3, optimizer = ADAM. To prevent getting stuck in local optima during training, we set a decay strategy for the learning rate, which can be denoted by:

l r_{n} = {\begin{matrix} l r_{n - 1}, i f n < a \\ l r_{n - 1} \times {(1 - \frac{n - a}{N - a})}^{p}, o t h e r w i s e \end{matrix}

(23)

where lr_n is the learning_rate of the nth epoch and N is the total number of epochs configured. a denotes the starting epoch where the learning rate begins to decay. $p \in (0, 1)$ is a parameter used to control the intensity of each decay. In this work, we set $a = 20, N = 60, p = 0.7$ , respectively, which were adjusted empirically based on the results of multiple training experiments.

With respect to the 3-layer hashing model, we simply adopt the following configuration: learning_rate = 1e-4 (fixed), weight_decay = 4e-5, optimizer = SGD. The proposed framework is implemented with the Pytorch (https://www.pytorch.org/) toolkit. To make fair comparison, all the compared methods are reimplemented by using Pytorch toolkit and based on the bone of VGG16. The hardware environment is as follows: CPU: E5-2623 V4@2.60GHz, RAM: 32G, GPU: GeForce RTX 3090 (24G).

Experimental results

The performance of fabric image representation

To demonstrate the effectiveness of the proposed framework for fabric image representation, we first visualize the extracted high-dimensional features by using T-SNE, which is unsupervised method. The trained feature learning model output two feature embeddings: color feature embedding $R^{C}$ and texture feature embedding $R^{T}$ . We visualize the two feature embeddings, and the results are shown in Figure 6(a) and (b). The visualization results show a great improvement compared to our previous research. Figure 6(a) shows the presentation effect of the visual feature of fabric color. Except for gray and color-gray, other colors are well separated in the figure. In fact, color-gray and gray have a high degree of similarity, which makes it difficult to distinguish them. Figure 6(b) presents the visualization result of texture features.

Figure 6.

The visualization results of feature learning: (a) the representation effect of color features, (b) the representation effect of texture features formed by different organizational structures, (c) the representation effect of binarization color features, and (d) the representation effect of binarization texture features.

Then we visualize the binary codes ( $B^{C}$ and $B^{T}$ ) output from the bi-half layer and the visualization results are presented in Figure 6(c) and (d). We observe that the features output from feature learning model are entangled with each other. With binarization by using bi-half layer, most images belonging to the same category are scattered into independent, more concentrated spatial regions, and the boundaries between regions are more obvious. Furthermore, the proposed self-supervised based feature learning framework can drive different task branches to focus on specific feature representation, which can be regarded as the decoupling of color and texture features. In summary, the good visualization results demonstrate the effectiveness of the proposed feature learning method for fabric image representation.

Ablation experiments

In this section, we do some ablation experiments to verify the rationality of our model configuration, including MTL²¹ + Bi-half layer (general multi-task learning framework + Bi-half layer), SDA + Sign layer (the proposed feature learning framework + Linear layer with Tanh + Sign layer). In addition, we also test the robustness of these configurations by evaluating the mAP on two testing-set: 72 queries and augmented queries (mentioned in Section 4.1).

Table 2 presents the ablation experimental results with different code-length (16, 32, 64, 128, 256). MLT + Bi-half adopt a multi-task learning framework to learn fabric image representation. It can achieve good performance on the 72-original-query testing-set, but it performs poorly on the augmented testing-set (a decrease of 0.138). This phenomenon shows that the MLT model is sensitive to rotation and scale, and then leads to its weak robustness. Comparing MLT + Bi-half and SDA + Sign, their performance on the 72-original-query testing-set is not much different, but there is a big gap in the augmented testing-set. This is exactly the result of the improved performance of the proposed SDA model for model generalization. When combining proposed SDA model and Bi-half layer, retrieval performance is improved to a certain extent. It is demonstrated that the Bi-half layer can automatically generate higher quality binary codes and the proposed SDA can improve retrieval performance.

Table 2.

Ablation experiments on MFT-fabric-v1.

Configurations		mAP
Configurations		16 bits	32 bits	64 bits	128 bits	256 bits
MLT + Sign	ori	0.651	0.718	0.773	0.829	0.837
MLT + Sign	aug	0.514	0.594	0.627	0.651	0.707
MLT + Bi-half	ori	0.672	0.747	0.792	0.849	0.864
MLT + Bi-half	aug	0.524	0.615	0.644	0.673	0.726
SDA + Sign	ori	0.729	0.789	0.815	0.849	0.862
SDA + Sign	aug	0.679	0.772	0.807	0.832	0.847
Ours	ori	0.748	0.825	0.843	0.889	0.903
Ours	aug	0.726	0.806	0.826	0.858	0.886

“Ori” presents the result of the corresponding experiment on 72-original-query testing-set. And “aug” represent the result of the corresponding experiment on augmented testing-set.

Comparisons and retrieval performance

In this section, we conduct comparative experiments with 10 state-of-the-art methods, including three unsupervised hashing methods, three supervised hashing methods, and four methods for fabric image retrieval. The mAP values of all method with various hash code length, including 16 bits, 32 bits, 64 bits, 128 bits, and 256 bits, are presented in Table 3. All experiments are all conducted on MFT-fabric-v1 dataset, and retrieval performances are all evaluated in 72-original-query testing-set and augmented-query testing-set.

Table 3.

mAP comparison results on MFT-fabric-v1 dataset.

Methods	72-original testing-set					Augmented testing-set
Methods	16 bits	32 bits	64 bits	128 bits	256 bits	16 bits	32 bits	64 bits	128 bits	256 bits
CNNH¹⁴	0.711	0.787	0.807	0.815	0.821	0.616	0.676	0.725	0.742	0.769
DPSH²⁶	0.732	0.798	0.813	0.827	0.839	0.635	0.663	0.718	0.739	0.754
CSQ¹⁵	0.763	0.819	0.834	0.859	0.878	0.654	0.683	0.727	0.741	0.793
UHBDNN²⁷	0.691	0.766	0.793	0.807	0.812	0.603	0.652	0.678	0.705	0.719
SSDH²⁸	0.683	0.769	0.789	0.81	0.816	0.606	0.647	0.671	0.692	0.711
SADH²⁹	0.707	0.783	0.812	0.826	0.837	0.612	0.654	0.693	0.716	0.735
FRHS²¹	0.709	0.749	0.763	0.799	0.812	0.629	0.663	0.706	0.71	0.739
FRMT²²	0.739	0.807	0.831	0.847	0.882	0.647	0.693	0.726	0.757	0.796
CMGF¹¹	0.569					0.547
MR-LBP¹⁹	0.612					0.601
Ours	0.748	0.825	0.843	0.889	0.903	0.726	0.806	0.826	0.858	0.889

The best result in each column is marked with bold. The three grids below the header present the results of three different types of methods, in order supervised methods, unsupervised methods, fabric image retrieval methods.

To make fair comparison, we employ VGG-16 network as the stem of all deep-learning based methods. With respect to implementation of three supervised deep hashing method, including CNNH,¹⁴ DPSH²⁶ and CSQ,¹⁵ we also apply the framework based on soft parameter sharing to build the feature learning network, then use the hashing method proposed by the authors to learn hash code from each view. It is stated here that the pairwise matrix S is generated by using the fabric annotations of each view. The three unsupervised hashing method, including UHBDNN,²⁷ SSDH²⁸ and SADH,²⁹ are all implemented by using pytorch according to the corresponding paper, and then trained on MFT-fabric-v1 dataset. We also compare with our previous work, FRHS²² and FRMT,²¹ on fabric retrieval. In addition, the two hand-crafted descriptor based fabric retrieval methods, CMGF¹¹ and MRI-LBP,¹⁹ are implemented by using Matlab tools.

Table 3 presents the mAP comparison results on MFT-fabric-v1 dataset. It can be observed that, the retrieval method proposed in this paper achieves the best performance under different code lengths, especially in augmented testing-set. For example, when we set the code length to 128 bits, the proposed retrieval method achieves the mAP value of 0.889 in 72-original-query testing-set and 0.858 in augmented testing-set, which surpasses other comparison methods. And the results clearly demonstrate the superiority of the proposed method. In addition, we observe that the performance of all comparison methods improves as the code length increase from 16 bits to 256 bits. This phenomenon indicates that longer hash codes can bring more discriminative in most deep hashing model. Furthermore, when comparing the results on 72-original-query testing-set and augmented testing-set, there is a big difference in the performance of most methods. For example, CSQ can achieve a mAP value of 0.834 with a code length of 64 bits in the former testing-set, but it can only achieve a performance of 0.727 in the latter testing-set. However, the proposed method can achieve superior performance in both testing-sets, which demonstrates that the proposed feature learning model (SDA) has high robustness and generalization. The two hand-crafted descriptor based methods also have a certain degree of robustness, but due to the limitations of feature engineering, there performance on our dataset is poor. Finally, we clearly find that the supervised methods perform better than the unsupervised methods on MFT-fabric-v1 for fabric image retrieval.

In Figure 7, we present the precision-recall curves of the compared methods (learning based) on the two testing-sets. In the results, the area under the curve corresponding to the proposed method is larger than the curve corresponding to other methods, indicating that the retrieval performance of our method is better than other methods. Moreover, the results in Figure 7(b) again verify the robustness and generalization of our method. In Figure 8, We also present five retrieval examples using our method, in which retrieval results and queries are very similar in color and texture.

Figure 7.

The Precision-Recall curve of the comparison methods on 72-original-query testing-set and augmented testing-set: (a) the PR curve on 72-original-query testing-set and (b) the PR curve on augmented testing-set.

Figure 8.

Retrieval results of five samples.

Conclusion

In this paper, we present an efficient fabric retrieval framework based on our previous work. Texture and color are the main features in fabric images (2D). The proposed framework consists of two modules: feature learning and hashing learning. We try to decouple the color information and texture information in the image during feature learning. Decoupled task is driven in a self-supervised manner through several pretext tasks. There are two type of transformations used in this study, texture transformation (rotation and scaling) and color transformation (color jitter), respectively. Then we introduce a Bi-half layer for hashing learning. The visualization results of trained features and hash codes indicate that the proposed method performs well for the representation of fabric images. Experimental results demonstrate that our method outperforms other methods for fabric image retrieval with the best best mAP 0.903. In real applications, our method can be deployed on the information management platforms of fabric manufacturing or trading companies. Based on the given or provided fabric samples, it provides customers, engineers, or salespersons with similar historical fabric variety search, improving the efficiency of production design and trade interaction.

The proposed framework is suitable for most types of fabrics, due to its consideration of the common issues in fabric image representation: color and texture. However, for a broader range of fabric types beyond Mélange fabrics, additional validation is still required. In further research, we plan to collaborate with more textile manufacturing and trading companies to collect more images of different types of fabrics, and based on this, conduct more in-depth research and improvement on the universality of the fabric image retrieval framework.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors gratefully acknowledge the financial support from the National Natural Science Foundation of China ([grant number 62202202]), the Natural Science Foundation of Jiangsu Province ([grant number BK20221061]) and the Fundamental Research Funds for the Central Universities ([grant number JUSRP121030]).

ORCID iDs

Ning Zhang

Jun Xiang

References

Datta

Joshi

, et al. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput Surv Csur 2008; 40: 1–60.

Deselaers

Keysers

Ney

Features for image retrieval: an experimental comparison. Inf Retr 2008; 11: 77–107.

Rui

Huang

Chang

S-F.

Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent 1999; 10: 39–62.

Bakar

Hitam

Yussof

WNJHW

. Content-based image retrieval using SIFT for binary and greyscale images. In: Presented at the conference 2013 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), Melaka, Malaysia, October 2013, pp.83–88.

Sukthankar

. PCA-SIFT: A more distinctive representation for local image descriptors. In: Presented at the proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition, 2004, CVPR 2004, Washington, DC, USA, June, 2004, p.II.

Ledwich

Williams

. Reduced SIFT features for image retrieval and indoor localisation. In: Presented at the Australian conference on robotics and automation Australia, Citeseer, December, 2004, p.3.

Dubey

Singh

RK.

Multichannel decoded local binary patterns for content-based image retrieval. IEEE Trans Image Process 2016; 25: 4018–4032.

Liu

Guo

J-M

Chamnongthai

, et al. Fusion of color histogram and LBP-based features for texture image retrieval and classification. Inf Sci 2017; 390: 95–111.

Sotoodeh

Moosavi

Boostani

A novel adaptive LBP-based descriptor for color image retrieval. Expert Syst Appl 2019; 127: 342–352.

10.

Douze

Jégou

Sandhawalia

, et al. Evaluation of gist descriptors for web-scale image search. In: Presented at the proceedings of the ACM international conference on image and video retrieval, Santorini, Fira Greece, July, 2009, pp.1–8.

11.

Jing

, et al. A new method of printed fabric image retrieval based on color moments and gist feature description. Text Res J 2016; 86: 1137–1150.

12.

Xie

Qin

Xiang

, et al. An image retrieval algorithm based on gist and sift features. IJ Netw Secur 2018; 20: 609–616.

13.

Krizhevsky

Sutskever

Hinton

GE.

Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012; 25: 1097–1105.

14.

Xia

Pan

Lai

, et al. Supervised hashing for image retrieval via image representation learning. AAAI 2014; 28(1). DOI: 10.1609/aaai.v28i1.8952.

15.

Yuan

Wang

Zhang

, et al. Central similarity quantization for efficient image and video retrieval. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020, pp.3080–3089. https://openaccess.thecvf.com/content_CVPR_2020/html/Yuan_Central_Similarity_Quantization_for_Efficient_Image_and_Video_Retrieval_CVPR_2020_paper.html

16.

Sun

, et al. Deep supervised discrete hashing. IEEE T Image Processing 2018; 27(12): 5996–A6009.

17.

Jing

, et al. Patterned fabric image retrieval using color and space Features⋆. J Fiber Bioeng Inform 2015; 8: 603–614.

18.

Xiang

Pan

Gao

Fabric retrieval based on multi-task learning. IEEE Trans Image Process 2020; 30: 1570–1582.

19.

Zhang

Liu

, et al. Lace fabric image retrieval based on multi-scale and rotation invariant LBP. In: Presented at the proceedings of the 7th international conference on internet multimedia computing and service, Zhangjiajie, Hunan, China, August

20.

Zhang

Xiang

Wang

, et al. Image retrieval of wool fabric. Part I: based on low-level texture features. Text Res J 2019; 89: 4195–4207.

21.

Xiang

Zhang

Pan

, et al. Fabric image retrieval system using hierarchical search based on deep convolutional neural network. IEEE Access Pract Innov Open Solut 2019; 7: 35405–35417.

22.

Zhang

Xiang

Wang

, et al. Image retrieval of wool fabric. Part II: based on low-level color features. Text Res J 2020; 90: 797–808.

23.

Schroff

Kalenichenko

Philbin

. Facenet: A unified embedding for face recognition and clustering. In: Presented at the 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, USA, June 2015, pp.815–823.

24.

van Gemert

. Deep unsupervised image hashing by maximizing bit entropy. AAAI, December 2020. https://arxiv.org/abs/2012.12334

25.

Peyré

Cuturi

Computational optimal transport. Found Trends Mach Learn 2018; 11: 355–607.

26.

W-J

Wang

Kang

W-C.

Feature learning based deep supervised hashing with pairwise labels. arXiv:1511.03855. DOI: 10.48550/ARXIV.1511.03855.

27.

T-T

Doan

A-D

Cheung

N-M.

Learning to hash with binary deep neural network. In: European conference on computer vision, 2016, pp.219–234. https://arxiv.org/abs/1607.05140

28.

Yang

Deng

Liu

, et al. Semantic structure-based unsupervised deep hashing. In: Presented at the proceedings of the 27th international joint conference on artificial intelligence, Stockholm, Sweden, July

29.

Shen

Liu

, et al. Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans Pattern Anal Mach Intell 2018; 40: 3034–3044.

Fabric image retrieval based on decoupling of texture and color feature

Abstract

Keywords

Introduction

Fabric image representation

Notations

Fabric image augmentation

Feature learning

Deep hashing for feature aggregation

The principle of bi-half layer

The framework of hashing model

Experimental configuration

Dataset

Evaluation metrics

Implementation details

Experimental results

The performance of fabric image representation

Ablation experiments

Comparisons and retrieval performance

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References