Sage Journals: Discover world-class research

Abstract

Deep learning has shown promise in textile defect detection, but its reliance on large high-quality labeled datasets poses challenges in real-world industrial applications. This study presents a novel unsupervised defect detection framework that effectively detects various types of texture defects with limited defect-free texture samples. The framework integrates texture and semantic information using a bilateral-branch network architecture (TSUBB-Net). Specifically, TSUBB-Net employs a weighted centering loss to cluster complex texture units, emphasizing semantic information within defect regions through a channel attention mechanism. It further fuses contextual semantic information to achieve precise defect localization. Thus, the efficient fusion method combines texture and semantic information, enabling the representation of complex texture structures and mitigating the impact of image acquisition quality on defect recognition. To evaluate the effectiveness of our proposed method, we build a unique dedicated database of textile defect image segmentation, which serves as the benchmark for textile defect detection. Experimental results demonstrate that TSUBB-Net surpasses state-of-the-art methods, exhibiting excellent performance in textile defect detection. The proposed framework holds significant potential for practical applications in the textile industry, improving defect detection capabilities.

Keywords

Deep learning weighted centering loss texture units contextual semantic bilateral branch

In industrial textile production, various types of surface defects often occur due to complex manufacturing processes. These defects manifest as localized regions with texture structure destruction or light intensity variation, which can seriously affect product quality. Over the past few decades, numerous methods for detecting surface texture defects have been developed to address these challenges. These methods can be broadly categorized into two classes based on their feature extraction strategies: traditional methods and deep learning methods.

Traditional detection methods encompass several processing methods for texture defects, roughly classified into five groups: spectral-based,¹ structured-based,² statistical-based,³ learning-based,^4,5 and model-based.⁶ These methods have demonstrated relatively good detection accuracy for complex textile defects. However, the efficacy of these methods relies on acquiring a high-quality set of defective samples with manual labeling of each type of defect. Meanwhile, substantial computing resources are required to ensure detection accuracy. In addition, these traditional detection methods fail to provide a suitable manual feature annotation approach to deal with defect images with various texture surfaces.

Recently, texture defect detection has greatly benefited from the rapid development of deep learning, a method recognized for its efficient texture feature extraction. Within the context of deep learning, these techniques are often differentiated into supervised and unsupervised learning methods, depending on the labeling of the training data. For instance, Zhang et al.⁷ devised a semi-supervised convolutional network that adeptly merges spectral (graph convolutional network (GCN)) and spatial (convolutional neural network (CNN)) features and adaptive weight adjustments, thus providing an alleviation of the issues of label scarcity and sample deficiency. Xiao et al.⁸ used a coordinate attention mechanism and spatial pyramid pooling to improve feature extraction, and combined a compound loss function to address challenges such as complex backgrounds and relatively small defect sizes. However, collecting sufficient defective samples in textile industrial applications remains challenging. Moreover, accurately distinguishing defects from complex background textures poses a significant difficulty.

In contrast, the unsupervised method offers the advantage of processing a huge quantity of unlabeled data and defines appropriate clusters, highlighting its potential to bring significant benefits to textile industrial production. For example, Schlegl et al.⁹ proposed the AnoGAN network, which effectively utilizes generative adversarial networks (GANs) in an unsupervised manner to distinguish defects by identifying significant deviations from the learned normal data distribution. Yi and Yoon¹⁰ proposed an advanced unsupervised approach that leverages the support vector data description to detect anomalies by capturing the global structure and local texture variations in the data distribution. However, practical industrial textile datasets are often composed of complex textures intertwined, as shown in Figure 1. These textures are variable, encompassing both regular and irregular, dense and sparse patterns. Furthermore, defects in these textures can result in significant variations in appearance, including irregular changes in brightness, variable shapes and sizes, and low contrast. Previous unsupervised deep learning methods have faced limitations in terms of their performance. While they could efficiently represent basic texture features, these methods' training relied on high-quality datasets. However, more adaptability and robustness are needed to account for real-world environmental influence.

Figure 1.
Examples of industrial textile defects datasets: (a) irregular texture; (b) dense texture; (c) irregular shape; (d) fuzzy border; (e) low contrast and (f) uneven lighting.

In this study, we introduce a novel framework, TSUBB-Net, which innovatively integrates texture and semantic information via a bilateral-branch network architecture. The central objective of this work is to navigate the complex challenge of effectively representing intricate texture structures while evading limitations imposed by sample quality. Unlike conventional methods that primarily focus on texture or semantic information in isolation—resulting in potential omissions or inaccuracies—TSUBB-Net simultaneously handles these two dimensions, providing a more comprehensive solution. The network's unique operational approach entails accurate texture background image reconstruction to capture inherent texture patterns in the input. At the same time, it identifies substantial reconstruction errors in defective regions and accurately segments and locates these defects, thus ensuring thorough defect detection. Then, the residual images are obtained by subtracting the texture backgrounds from the input images. Finally, the defect detection result is obtained by fusing the residual images from the two branches via the feature fusion module (FFM). Our proposed network simultaneously tackles multi-scale texture feature clustering and semantic feature fusion. Therefore, it can efficiently and simultaneously detect various texture defects by utilizing only a small number of defect-free industrial textile surface samples. The main contributions of this study are threefold.
We propose an innovative bilateral-branch network framework that adeptly integrates texture and semantic information, adapts to complex textile textures, and minimizes the impact of image acquisition quality on defect recognition.

We present a novel texture-reconstruction-based deep clustering method that trains an autoencoder (AE) by minimizing the weighted centering loss of the texture unit and cluster center, utilizing soft group assignments as sample weights, consequently boosting the effectiveness and accuracy of texture classification.

We construct a self-attention fusion module (SAFM) that integrates contextual semantic information and enhances semantic prediction by incorporating additional low-level information near boundaries, thereby enabling precise localization of defect positions.

Related works

Autoencoders and their extensions

AEs are a type of artificial neural network model that have been widely applied in various fields, including shape retrieval,¹¹ target recognition,¹² and object detection,¹³ for learning effective encoding of unlabeled data. Autoencoders utilize the input data itself as a supervisory signal, enabling the neural network to learn a mapping relationship F_θ: x → x to achieve image reconstruction. The AE $F_{θ}$ can be divided into two subnetworks.

The first subnetwork, commonly referred to as the encoder, maps the input data $x \in R^{d}$ to a low-dimensional latent variable $z$ , achieving dimensionality reduction of the original data features:
$z = E_{ϕ} (x) \in R^{k}, k < d$
(1)
where $E_{ϕ} : R^{d} \to R^{k}$ represents the mapping function of the encoder. The second subnetwork, commonly known as the decoder, can be seen as the data decoding process, reconstructing the learned low-dimensional variable $z$ into approximate high-dimensional input data $x^{'}$ :
$x^{'} = D_{θ} (z) \in R^{d}$
(2)
where $D_{θ} : R^{k} \to R^{d}$ represents the mapping function of the decoder. The goal of the mapping relationship learned by the AE is to make the reconstructed image $\tilde{x}$ as similar as possible to the original input image $x$ . Therefore, a reconstruction loss function $L (θ)$ , similar to the mean squared error, is generally used to evaluate the performance of the encoder $E_{ϕ}$ and the decoder $D_{θ}$ :
$\min_{θ, ϕ} L (θ, ϕ), where L (θ, ϕ) = \frac{1}{N} \sum_{i = 1}^{N} x_{i} - D_{θ} {(E_{ϕ} (x_{i}))}_{2}^{2} + λ \cdot \sum_{w \in {W, W^{'}}} {‖ w ‖}_{F}$
(3)
where $N$ is the number of training samples, $x_{i}$ is the original image of the $i$ th training data, $\sqrt{x_{i} - D_{θ} {(E_{ϕ} (x_{i}))}_{2}^{2}}$ is the reconstruction residual of the $i$ th image, and $λ$ is a constant that balances the contribution of reconstruction and regularization terms. To enhance the non-linear representation in the encoder and decoder modules, AEs are improved by using deep networks called multilayer perceptron (MLP) methods. However, the fully connected (FC) layers used in AEs and MLP methods exhibit some limitations in the image processing process, including the loss of spatial information in images, imposing global features on each feature, and requiring redundant parameters.

Recently, to overcome these limitations, convolutional autoencoder (CAE) networks¹⁴ have been widely applied in machine vision applications, combining the powerful feature representation capability of CNNs with the foundation of a MLP. Mei et al.¹⁵ proposed a multi-scale convolutional denoising autoencoder (MSCDAE) network that incorporates a certain amount of random noise before the sample input and can obtain powerful representations from the original noisy data. However, optimizing the noise level selection is required, and adding noise affects the model's training speed, requiring a balance between noise size and training efficiency. Zhou et al.¹⁶ proposed a semi-supervised fabric defect detection method based on variational autoencoders (VAEs) and Gaussian mixture models (GMMs). This method can more accurately construct defect region boundaries by synthesizing detection results from image content and latent space. However, a moderate number of labeled samples is needed to improve performance, which is greatly influenced by the quality and quantity of the labeled samples. In this study, aiming to accommodate actual industrial production and boost efficiency, a novel bilateral-branch network framework is proposed to integrate texture and semantic information to adapt to the complex and diverse irregular texture structures of textile images, while eliminating the impact of image acquisition quality on defect recognition.

Clustering methods

Over the past few decades, clustering methods have been proposed that can be broadly classified into four categories: connectivity-based,¹⁷ centroid-based,¹⁸ distribution-based,¹⁹ and density-based.²⁰ However, these methods have several limitations that need to be considered. For instance, clustering is sensitive to parameter choice, impacting the quality of grouping results. In addition, clustering can be computationally intensive, especially when dealing with large datasets or high-dimensional feature spaces. Another limitation is the dependence on specific assumptions or probability distributions, which may not hold for all data types. Lastly, clustering algorithms can be susceptible to outliers in the data, which may affect the accuracy and reliability of the clustering results. Recently, Qian et al.²¹ proposed a deep embedding-based unsupervised clustering method. This method first applies a self-organizing map algorithm for an initial round of clustering, mapping data samples from high-dimensional space to low-dimensional space. Subsequently, an efficient dense subspace clustering algorithm performs a second round of clustering, reducing human interference with the parameters. However, it still requires a complex preprocessing step for the images. Yang et al.²² proposed a novel center-constrained clustering method that detects anomalous features by learning the distribution of latent features in the intermediate layer of an AE. However, the accuracy of locating clustering centers depends largely on the initial $k$ -means of clustering results. In this study, considering the characteristics of textile images, which exhibit fine and abundant background textures with a certain degree of repetitiveness and disorder, we propose a soft assignment weighted centering loss to stabilize the feature space representation of fundamental texture units.

Proposed method

The training parameters and learning procedure of TSUBB-Net were described in the last section. The overall architecture of TSUBB-Net is shown in Figure 2, and consists of five components: (I) the feature extraction network (FEN); (II) the feature decoding network (FDN); (III) the texture reconstruction branch (TRB); (IV) the semantic localization branch (SLB); and (V) the FFM.

Figure 2.
Overall architecture of the proposed TSUBB-Net method. TSUBB-Net consists of a feature extraction network (FEN), a feature decoding network (FDN), a texture reconstruction branch (TRB), a semantic localization branch (SLB), and a feature fusion module (FFM). The original input image undergoes the FEN and FDN to extract and reconstruct multi-scale information. The TRB and SLB combine the bilateral branch at each scale to process texture and semantic information. In the testing stage, the reconstructed background image is fused by the FFM to obtain the final defect image.

Feature extraction network and feature decoding network

The FEN and FDN serve the respective purposes of directly extracting feature maps with varying receptive fields from the entire input image and reconstructing original images from feature maps at different scales, thereby maximizing the representation of information within the textile images.

Feature extraction network

Here, we propose an enhanced textile image feature extraction model that adopts ConvNeXt²³ as the primary FEN. This technique is renowned for its high-quality representation in image feature extraction. The proposed module comprises four layers, which can capture a diverse range of perceptual scales. Each layer consists of various stacked blocks arranged in a 1:3:1:1 configuration, which deviates from the original ConvNeXt model's 1:1:3:1 configuration. Texture features in high-layer spaces are unordered and can extract irrelevant information with conventional convolutional layers operating with sliding windows. In contrast, low-layer spaces contain a rich texture and fine-grained information that aids in uncovering local texture structures. However, low feature dimensions can introduce significant noise impairing performance. Our module effectively captures texture features at various levels to enhance feature representation. To optimize model performance while mitigating over-encoding-induced reconstruction distortion, we adjust the quantities to 2:6, instead of the original ConvNeXt model's 3:9 configuration. Ultimately, the input image $I_{input}$ is mapped to latent features $Z \in R^{\frac{W}{8} \times \frac{H}{8} \times C}$ .

Feature encoding network

Within the FEN, outputs from each layer are processed through a bilateral branch, generating comprehensive feature representations.

As depicted in Figure 3, the original image $I_{input}$ undergoes processing via the function $E_{ϕ} (\cdot)$ of the FEN, and then the FDN reconstructs the background image $I_{out}$ . The procedure can be formulated as follows:
$I_{out} = F_{θ} (I_{input})$
(4)
where $F_{θ}$ denotes the functions that combines the FEN, TRB, SLB, and FDN, respectively, while $θ$ represents their respective parameters.

Figure 3.
Architecture on the fourth scale in TSUBB-Net.

Texture reconstruction branch

Current unsupervised texture defect detection methods, based on deep CNNs, are significantly influenced by the initial centroid choice in the k-means clustering process. Moreover, these strategies estimate texture information by analyzing data distribution, resulting in redundant and entangled positive samples. To address this issue, we propose a novel clustering method for texture reconstruction to attain stable texture unit representations in high-dimensional feature spaces, thereby enhancing the discriminative power for complex textures.

As illustrated in Figure 4, the TRB structure consisting of encoding, texture clustering, and decoding modules. The entire process can be represented with the following mathematical expressions:
$F_{i n} = σ (W \circ Z + b)$
(5)

$F_{out} = Ψ (F_{i n})$
(6)

$Z^{'}_{Texture} = σ (W^{'} \circ F_{out} + b^{'})$
(7)
where the symbol $\circ$ represents the convolution operator; $Z$ and $Z^{'}_{Texture}$ denote the input and output feature vectors for the TRB, respectively; $σ$ represents for the activation function GELU²³; $Ψ (\cdot)$ characterizes the clustering process; and $F_{i n}, F_{out} \in R^{^{C_{code} \times 1}}$ represent the code feature vectors before and after passing through the clustering module, with $C_{code}$ set to 20. In addition, $W$ and $W^{'}$ are the parameter matrices for the convolution kernels in the encoding and decoding modules, respectively, while $b$ and $b^{'}$ correspond to the bias vectors in the encoding and decoding modules.

Figure 4.
Schematic diagram of the texture reconstruction branch. Textures of the same class are close to each other, and textures of different classes are far away from each other.

Standard texture images are characterized by basic microstructures, herein referred to as ‘textures.' As illustrated in the Figure 4, the texture code feature vector $F_{j} = {f_{1}, f_{2}, f_{3}, \dots, f_{N}}$ is obtained through the encoding module, where $n \in (1, 2, 3, 4)$ represents different scales and $f_{i} \in R^{^{C_{code} \times 1}}, i \in (1, 2, \dots, W_{n} \times H_{n})$ can be considered as a set of local features. However, in the texture images of industrial products, normal and defective textures often exhibit certain similarities. As displayed in Figure 4(a), texture code feature vectors are obtained through the encoding module, with different scales representing a set of local features. To derive stable and highly discriminative texture units, our study suggests employing a weighted reconstruction and center updating approach during the training phase to enhance the stability and uniqueness of these fundamental texture units. Our texture clustering module further stabilizes the representation of diverse texture information based on these feature codes. To achieve this objective, various types of local features in the texture image should be separated and clustered into $K$ classes, as exemplified in Figure 4(b). This study proposes training the FEN and FDN in $K$ consecutive runs, where, in the $k$ th run, the FEN and FDN focus on the reconstruction and centering of texture units more likely to belong to the $k$ th basic texture class. Notably, texture clustering is an unsupervised task, and the identities of texture units are unknown at the outset. Consequently, during the $k$ th run of the texture clustering module, it uses $z = [z_{1}, z_{2}, \dots, z_{k}]$ to represent the class centers, and $z_{i} \in R^{C_{code} \times 1}$ . The Euclidean distance between $f_{i}$ and $z^{(k)}$ (obtained in previous training iterations), that is, ${‖ f_{i} - z^{(k)} ‖}^{2}$ , serves as a measure of the degree of similarity of texture units, indicating the membership of $f_{i}$ in the $k$ th basic texture class in $z$ space. Class membership is employed as the sample weight in the $(k + 1)$ th run. Specifically, as a sample approaches the group center $z^{(k)}$ , higher weights are assigned to texture units closer to the group center $z^{(k)}$ , resulting in a greater contribution of these texture units during the $k$ th run. The membership of the $i$ th texture unit in $z$ space to the $k$ th texture class is denoted as $p_{i k}^{'}$ :
$p_{i k}^{'}^{} = \frac{\frac{1}{f_{i} - z^{(k)}_{2}^{2 / (m - 1)}}}{\sum_{j = 1}^{K} \frac{1}{f_{i} - z^{(k)}_{2}^{2 / (m - 1)}}}$
(8)
where $m$ represents the fuzziness degree, which is set to 1.5 in all experiments. This setting ensures that different texture units are more focused on obtaining membership within a specific class. Smaller intra-class similarity and larger inter-class similarity imply a greater disparity between the feature probabilities of different classes. To achieve this, we pass $p_{i k}^{'}$ prime through a $softmax$ function to obtain the final $p_{i k}$ :
$p_{i k} = \frac{\exp (p_{i k}^{'})}{\sum_{k} exp (p_{i k}^{'})}$
(9)
where $p_{i k}$ will also serve as an essential weight in the weighted centering loss, aiming to enhance the discriminative capability of texture units, which will be elaborated upon in the subsequent section. Following this, during every $T$ training iterations (set to 2 in our experiments), the texture class centers are updated to the weighted average of texture units in the $z$ space, meaning that samples closer to $z^{(k)}$ contribute more to the update:
$z^{(k)} = \frac{\sum_{f_{i} \in F_{j}} p_{i k}^{m} f_{i}}{\sum_{f_{i} \in F_{j}} p_{i k}^{m}}$
(10)

Semantic localization branch

In textile defect detection, most existing methods primarily focus on the reconstruction of defects based on texture features. However, some samples may need clearer texture information due to limited industrial sampling conditions. In addition, interference from noise points can easily lead to false detections in the reconstructed images, directly impacting the defect detection’s precision. In this study, we propose a SLB that includes a channel attention module (CAM) and a SAFM. This branch aims to localize textile defects based on semantic information accurately.

Channel attention module

High-level features possess rich semantic information, with each channel mapping considered a class-specific response. These responses affect the final semantic prediction to varying degrees. We improve the CAM based on the Convolutional Block Attention Module (CBAM).²⁴ CBAM is an attention mechanism module that combines both spatial and channel attention mechanisms. Our enhanced approach selectively emphasizes the importance of defect information in optimally locating defect locations. Firstly, it obtains the global average information and global maximum information, which characterize the base texture information and defect information, respectively. On this basis, the absolute value of their difference is weighted to $G A P_{c}$ , representing the importance of the $c$ th channel, and its process is characterized as follows:
$I_{c} = | G A P_{c} - G M P_{c} | \times G A P_{c}$
(11)
where $I_{c}$ represents the importance of the $c$ th channel. Consequently, the attention weight for the $c$ th channel can be expressed as follows:
$w_{c} = δ (W_{1} σ (W_{0} (I_{c})))$
(12)
where the symbol σ stands for the activation function GELU²⁴ while $W_{0} \in R^{C \times \frac{C}{r}}$ and $W_{1} \in R^{^{\frac{C}{r} \times C}}$ (set to 2 in our experiments) represent the FC layers. The symbol $δ$ denotes the activation function, which typically employs the $Sigmoid$ function in practice. By employing the CAM, the feature map is optimized to foreground semantically rich defect information, concurrently suppressing less relevant normal information. This focus on semantic cues enables a more precise and meaningful localization of textile defects.

Self-attention fusion module

High-dimensional feature maps contain semantically rich information but may lose some fine-grained details, while low-dimensional feature maps have a higher resolution. Combining features from different layers is crucial for predicting semantic information. However, effective alignment of different feature maps can be challenging due to the complex texture structure. To address this limitation, we propose the SAFM, which computes the correlation between pixels in different feature maps using matrix multiplication, and uses the correlation as a weight vector for low-dimensional features. This approach fuses rich semantic information from a higher-dimensional perspective into the low-dimensional features, enhancing their ability to discriminate defects.

As illustrated in Figure 5, the SAFM receives information from two dimensions. One is the feature map $Z^{'} \in R^{W \times H \times C}$ with more prominent defect information obtained after the channel weighting module. The other is the high-dimensional feature derived from the $Z^{'}$ through the ASPP²⁵ module:
${Z^{'}}_{ASPP} = f_{ASPP} (Z^{'})$
(13)
where $f_{ASPP} (\cdot)$ is a function representing the application of the ASPP module to the low-dimensional feature map $Z^{'}$ to obtain the high-dimensional feature map ${Z^{'}}_{ASPP} \in R^{W \times H \times C}$ . In this context, the high-dimensional and low-dimensional features refer to the further high-dimensional and low-dimensional information obtained from each scale through the FEN:
$A = reshape (Conv (Z^{'}), (C^{'}, N))$
(14)

$B = reshape (Conv ({Z^{'}}_{ASPP}), (C^{'}, N))$
(15)
where $A, B \in R^{C^{'} \times N}$ represent the reshaped feature maps. The operation $Conv (\cdot)$ denotes a 1 × 1 convolution, while $C^{'}$ signifies the compressed channel count (set to $\frac{C}{2}$ in our experiments). This reduces dimensionality by compressing the channel count of feature maps, integrates information across various channels by effectively merging features at different scales, and preserves spatial resolution, thereby maintaining fine-grained details across different scale feature maps. The variable $N$ , equal to $W \times H$ , represents the number of pixels in the feature map. Our methodology entails the computation of pixel-level correlations, presented as follows:
$s_{j i} = \frac{\exp (A_{i} \cdot s_{j i}^{'})}{\sum_{i = 1}^{N} \exp (A_{i} \cdot B_{j}^{'})}$
(16)
where $s_{j i} \in R^{N \times N}$ is employed to quantify the correlation between the $i$ th position in the lower-level feature map and the $j$ th position in the higher-level feature map. Here, $N$ represents the total number of pixels. In addition, $B^{'} \in R^{N \times C^{'}}$ denotes the transpose operation applied to the feature map $B$ . A higher similarity between the feature representations of two pixel positions results in a greater correlation between them. Subsequently, the feature map $Q \in R^{W \times H \times C}$ is generated by performing matrix multiplication between the spatial attention map $S$ and $B$ :
$Q_{j} = \sum_{i = 1}^{N} s_{j i} B_{i}$
(17)
where $Q_{j}$ represents the self-attention weight of the higher-dimensional feature map relative to the lower-dimensional feature map at the $j$ th channel, while $B_{i}$ denotes the ith position of the lower-dimensional feature map. Ultimately, the final output $O \in R^{W \times H \times C}$ is obtained by performing an element-wise summation operation between $Q$ and ${Z^{'}}_{ASPP}$ , as illustrated below:
$Z^{'}_{Semantics}^{j} = α Q_{j} + B_{j}$
(18)
where the parameter $α$ is initialized to 0 and adaptively learns more weights during the training process. The final feature, denoted as $Z^{'}_{Semantics}$ , at each position is the weighted sum of features from all positions in the higher-dimensional feature map.

Figure 5.
Schematic diagram of the self-attention fusion module structure.

Feature fusion module

To maximize the utilization of texture and semantic information, we propose an innovative FFM. This module integrates the dimensionality-reduced multi-scale images from separate branches, thus capturing information across diverse abstraction levels while mitigating the risk of overfitting. It employs a unique AND operation to integrate the defect detection results from these branches, culminating in a comprehensive final detection outcome.

The FDN reconstructs the feature maps $Z^{'}_{Texture}$ and $Z^{'}_{Semantics}$ from the bilateral branches into background images $I_{Texture}^{n}$ and $I_{Semantics}^{n} \in R^{W \times H \times C}$ , representing the reconstructed images at scale $n$ , where $n \in (1, 2, 3, 4)$ . To connect the texture background maps at different scales, capture information from various abstraction levels, comprehensively represent images, and simultaneously minimize the risk of overfitting or redundancy in feature maps, thereby enhancing efficiency, the reconstructed images from each scale are fused. Initially, this is done by concatenating images along the channel dimension:
$I_{ConT} = Concat (I_{Texture}^{1}, I_{Texture}^{2}, I_{Texture}^{3}, I_{Texture}^{4}) \in R^{W \times H \times 4 C}$
(19-1)

$I_{ConS} = Concat (I_{Semantics}^{1}, I_{Semantics}^{2}, I_{Semantics}^{3}, I_{Semantics}^{4}) \in R^{W \times H \times 4 C}$
(19-2)

A 1 × 1 convolution is applied to reduce the dimensionality to 1:
$I_{Texture} = Conv (I_{ConT}) \in R^{W \times H \times C}$
(20-1)

$I_{Semantics} = Conv (I_{ConS}) \in R^{W \times H \times C}$
(20-2)
where $Concat (\cdot)$ signifies the concatenation function and $Conv (\cdot)$ refers to the 1 × 1 convolution function. The resulting $I_{Texture}$ and $I_{Semantics}$ are the fused representations that integrate information from all considered scales. By subtracting the original input image, $I_{input}$ , the residual maps derived from each scale are obtained:
$I_{ResT} = abs (I_{Texture} - I_{input})$
(21-1)

$I_{ResS} = abs (I_{Semantics} - I_{input})$
(21-2)
where $I_{ResT}, I_{ResS} \in R^{W \times H \times C}$ represent the residual images derived from the bilateral branches, while $abs (\cdot)$ denotes the absolute value operation. Furthermore, we incorporate median filtering and morphological opening operations to enhance the reconstructed image quality. Subsequently, defect segmentation is performed on the residual images. In this step, we follow the strategy presented²² and employ a double-threshold segmentation, resulting in the definition of the defect map as follows:
$Γ (i, j) = {\begin{array}{l} 0, & if T_{l b} < I_{Res} (i, j) < T_{u b} \\ 1, & otherwise \end{array}$
(22)
where $i \in (1, 2, \dots, W)$ and $j \in (1, 2, \dots, H)$ . Here, $T_{l b}$ and $T_{u b}$ represent the lower and upper bounds of the segmentation thresholds, respectively:
$T_{l b} = μ - ε σ$
(23-1)

$T_{u b} = μ + ε σ$
(23-2)
where $μ$ and $σ$ represent the mean and standard deviation, respectively, computed based on the training samples. Next, the defects are fused from both branches. Let $Γ_{Semantics} \in R^{W \times H \times C}$ and $Γ_{Texture} \in R^{W \times H \times C}$ denote the binary defect maps obtained from SLB and TRB obtained from Equation (22), respectively. We propose a novel fusion method that complements the strengths of both branches. Firstly, it identifies the defect locations in $Γ_{Semantics}$ :
$D_{b} = {(i, j) | Γ_{(i, j)} > 0}$
(24)
where $i \in (1, 2, \dots, W)$ and $j \in (1, 2, \dots, H)$ , while $D_{b}$ denotes the defect locations in $Γ_{Semantics}$ . Perform morphological dilation on $D_{b}$ to expand the defect area:
$E_{b} = dilate (D_{b}, K, ζ)$
(25)
where $dilate (\cdot)$ represents the morphological dilation operation, $K$ is the kernel of size 5 × 5, and $ζ$ is the number of dilation iterations (three in this case). Finally, the final defect locations are found in $Γ_{Texture}$ and perform the AND operation with $E_{b}$ :
$I_{defects} = {(i, j) | Γ_{Texture} (i, j) > 0 \land E_{b} (i, j) > 0}$
(26)
where $\land$ denotes the AND operation, and $I_{defects} \in R^{W \times H \times C}$ represents the final defect map. This residual image fusion strategy allows for precise defect segmentation based on accurate defect localization.

Model training strategy

To accurately reconstruct texture background images, stably represent complex texture units, and maximize the utilization of defect information, we propose a novel and effective loss function composed of a background reconstruction loss and a weighted centering loss at multiple scales to train TSUBB-Net. The background reconstruction loss aims to minimize the gap between the background reconstruction image and the original input image, enabling TSUBB-Net to fully learn the global and local information of the positive fabric samples. We employ the mean squared error as the metric for training:
$L_{r}^{branch} = \sum_{n = 1}^{4} \frac{1}{H_{n} \times W_{n}} ({‖ I_{Input} - I_{Texture}^{n} ‖}^{2} + {‖ I_{Input} - I_{Semantics}^{n} ‖}^{2})$
(27-1)

$L_{r}^{Con} = \frac{1}{H \times W} ({‖ I_{Input} - I_{ConT} ‖}^{2} + {‖ I_{Input} - I_{ConS} ‖}^{2})$
(27-2)

$L_{r} = L_{r}^{Con} + L_{r}^{branch}$
(27-3)
where $n \in (1, 2, 3, 4)$ represents different scales and $I_{Texture}^{n}$ and $I_{Semantics}^{n}$ denote the images reconstructed by the bilateral branches at scale $n$ , respectively. In the clustering module, the weighted centering loss function serves to enhance the discriminability of texture units and steer the focus of the TSUBB-Net towards reconstructing one texture category at a time. Through this methodology, it is ensured that texture units closer to their class center significantly influence the overall loss, leading to an effective representation of texture units with an emphasis on their unique textural characteristics. This allows us to maintain a high level of discrimination between different texture classes, thereby boosting the representational power of our model. This weighted centering loss function is defined during the $k$ th iteration based on Equations (8) and (9):
$L_{c}^{n}^{(k)} = \sum_{f_{i} \in F} p_{i k}^{m} f_{i} - z^{(k)}_{2}^{2}$
(28-1)

$L_{c} = \sum_{n = 1}^{4} \sum_{j = 1}^{K} γ_{n} L_{c}^{n}^{(k)}$
(28-2)
where $L_{c}^{n}^{(k)}$ represents the distance between the texture units and the texture center when focusing on reconstructing the $k$ th texture category at scale n. Here, $p_{i k}$ is assigned more influence to texture units closer to the center of their respective texture classes during training, thereby enhancing the discriminative capability of our model for different texture classes. The weights for clustering at the four scales, $γ_{n} = [0.005, 0.01, 0.05, 0.1]$ , are assigned because encoded features at deeper layers exhibit greater discriminability, thus warranting more weight. Combining the two equations above, the final loss function for TSUBB-Net is expressed as follows:
$L = L_{r} + L_{c} + λ \sum_{w \in W} {‖ w ‖}_{F}$
(29)
where $w$ represents the set of parameter matrices in the TSUBB-Net model. To ensure the model's efficiency, $λ$ , a penalty factor controlling the model's complexity, is introduced with a value in the range of 0 < $λ$ < 1 (set to 0.001 in our experiments).

As the background reconstruction loss and the weighted centering loss of TSUBB-Net represent opposite optimization directions, TSUBB-Net adopts a two-stage optimization strategy. Firstly, the part model of the feature clustering module in TSUBB-Net is executed, which enables the model to have the ability to express image features. Then, standard k-means clustering is performed in the feature space to obtain $K$ initial centroids $z$ (set to 16 in our experiments). With these initializations completed, TSUBB-Net undergoes training through the joint optimization of multi-scale background reconstruction and texture feature clustering.

Experimental evaluation and discussion

Datasets and evaluation metrics

Dataset overview

In this study, as depicted in Figure 6(a), the process of raw textile image acquisition is implemented utilizing a charge-coupled device (CCD) camera. Some of the images are captured under varied illumination conditions, providing alterations in scale, orientation, and lighting, with the intricate interplay of complex pattern textures proving quite distinct from each other. It yields a challenging defect detection dataset that aligns with real-time samples in industrial production. The textile defect segmentation image dataset (DHU-DS1100) encompasses a range of defect-free (normal) and five types of defective samples (broken picks, holes, stains, cracks, and felters). Some representative samples, both normal and defective, are illustrated in Figure 6(b).

Figure 6.
(a) The charge-coupled device camera utilized for fabric defect detection and (b) Normal and defective samples collected by the test bench.

The DHU-DS1100 dataset is composed of 1100 samples, inclusive of 1000 defective images (with 200 images per defect type) and 100 defect-free textile images. For the sake of convenient processing, all images have been standardized to a resolution of 224 × 224. Subsequent experimentation involved validation of our model against this database, serving the dual purpose of evaluating both its robustness and accuracy.

Evaluation metrics

To evaluate the effectiveness of the TSUBB-Net model, several multi-label metrics are selected. We use the criteria of $Recall$ , $Precision$ , and $F 1 - Measure$ to evaluate the performance of the model:
$Recall = \frac{T P}{T P + F N} \times 100 %$
(30)

$Precision = \frac{T P}{T P + F P} \times 100 %$
(31)

$F 1 - Measure = \frac{2 \times Precision \times Recall}{Precision + Recall}$
(32)
where $F P$ refers to the proportion of defect areas erroneously detected in the background region, $T P$ signifies the proportion of defect areas correctly detected in the defect region, and $F N$ represents the proportion of defect areas undetected in the defect region. The $F 1 - Measure$ serves as a comprehensive evaluator, employing both $Precision$ and $Recall$ metrics simultaneously. All comparative experiments are conducted on the same machine, furnished with an Intel(R) i9-10900X CPU @ 3.70GHz processor and Quadro RTX 6000, CUDA 11.7, and cuDNN v8. The TSUBB-Net method is implemented in Python and will be open-sourced at the following address: https://github.com/WaterrrForever/TSUBB-Net/tree/master.

Comparing TSUBB-Net with the state-of-the-art models

To verify the performance of the proposed TSUBB-Net method, a comparative analysis was conducted against an array of existing detection techniques, which included conventional methods such as PHOT²⁶ and TEXEMS,²⁷ and unsupervised methods including the AE,²⁸ AE-SSIM,²⁹ OCGAN,³⁰ PaDiM,³⁰ and Reverse_distillation.³² The evaluation benchmark chosen was the DHU-DS1100 dataset, with the TSUBB-Net model being trained on merely 100 non-defective samples derived from industrial real-world scenarios, subsequently tested on the dataset.

Our experiments highlight the varying performance of different methods when processing five texture samples. Conventional approaches, such as PHOT and TEXEMS, struggle to handle intricate textile textures, relying on pre-determined feature sets, which limits their adaptability to unknown or complex texture surfaces, as shown in Figure 7(e). The AE, a deep learning-based approach, excels at learning texture representations through convolution operations but struggles under high-noise environments with intricate defect boundaries, as evidenced in Figures 7(a)–(d). AE-SSIM augments the AE with the structural similarity index (SSIM) to assess image quality, rather than detect defects accurately, especially in situations where defects exhibit similar textures to normal samples, as evidenced in Figures 7(a) and (d). OCGAN, an unsupervised method for learning the distribution of normal samples to detect defects, may require extensive computational resources and struggles with processing interconnected defects that deviate from the original distribution, as seen in Figures 7(a) and (c). PaDiM identifies regions that encapsulate normal data samples and eliminates anomalous samples rather than directly detecting defects, leading to misclassifications of normal samples as defective in situations where defect textures subtly intertwine with normal textures, as illustrated in Figures 7(b) and (e). Reverse_distillation generates training samples through inverse distillation, which could compromise the integrity of the original texture and structural information, ultimately introducing erroneous data during model learning, as exemplified in Figure 7(e).

Figure 7.
Examples of the defect inspection performances of the compared methods: (a) Broken pick. (b) Hole. (c) Stains. (d) Cracks. (e) Felters.

In contrast to the above methods, our proposed TSUBB-Net outperforms existing methods in detecting surface defects across diverse textures. On average, TSUBB-Net enhances 0.098% $F 1 - Measure$ across five texture categories compared to the second-best method, as shown in Table 1. These results demonstrate that TSUBB-Net delivers optimal performance in handling tasks associated with textured surfaces. This is attributed to its accurate reconstruction of defect shapes using texture information and the incorporation of semantic information for precise defect localization, which mitigates noise effects and ensures reliable detection.

Table 1.
$F 1 - Measures$ of different methods on five types of defects

MethodMetrics PHOT TEXEMES AE AE-SSIM OCGAN PaDiM Reverse_distillation TSUBB-Net (Ours)

$F 1 - Measure$ Broken picks 0.235 0.232 0.112 0.642 0.556 0.667 0.721 0.874

Hole 0.075 0.142 0.215 0.167 0.445 0.621 0.691 0.794

Stain 0.088 0.563 0.357 0.148 0.521 0.657 0.605 0.749

Crack 0.246 0.112 0.270 0.567 0.589 0.706 0.748 0.823

Felter 0.314 0.174 0.315 0.214 0.368 0.698 0.504 0.842

TSUBB-Net ablation experiment

In this section, we argue that the TRB and the SLB incorporated into the proposed TSUBB-Net model enhance both the completeness and precision of textile defect detection.

Texture reconstruction branch

The TRB is developed to acquire a stable representation of texture units in a high-dimensional feature space. This is achieved through a texture reconstruction-based clustering approach, where each class of textures in z space is treated as a texture unit cluster and assigned to the nearest center cluster. As depicted in the second row of Figure 8, in the absence of the TRB, latent features become entangled. However, the presence of the TRB results in the scattering of texture units from different classes and the condensing of similar texture units, as shown in the third row of Figure 8. In experiments comparing models with and without the TRB, TSUBB-Net demonstrates improved $Recall$ and $F 1 - Measure$ by 20.92% and 11.43%, respectively, While the $Precision$ metric exhibited a decrease of 1.18%, as shown in Table 2, this could be attributed to the potential omission of some defect information during the subsequent fusion process. However, this decrease can be deemed negligible, given the substantial improvements observed in other metrics. The results suggest that the TRB can improve defect detection by employing a texture clustering method.

Figure 8.
Influence of clustering in the texture reconstruction branch. (a)–(f) represent different types of defects. The first line represents the original image, the second line represents the distribution of texture units before clustering, and the third line represents the distribution of texture units after the clustering module.

Table 2.
Component analysis for the TSUBB-Net on the DHU-DS1100

MetricsMethod $Recall$ (%) $Precision$ (%) F1-Measure(%)

TSUBB-Net (without CAM) 79.77 87.99 81.55

TSUBB-Net (without SAFM) 78.99 79.83 79.32

TSUBB-Net (without SLB) 78.66 75.1 76.45

TSUBB-Net (without TRB) 59.71 91.50 71.92

TSUBB-Net 80.63 90.32 83.35

CAM: channel attention module; SAFM: self-attention fusion module; SLB: semantic localization branch; TRB: texture reconstruction branch.

Semantic localization branch

Existing methods for textile defect detection rely on texture feature-based defect reconstruction, which can be prone to misidentification due to interference from noise points. The SLB is developed to locate textile defects with greater precision based on semantic information. Two essential modules underpin the SLB's success: the SAFM and the CAM.

The SAFM module serves to efficiently fuse and align high- and low-dimensional features, thereby enhancing the system's capacity to accurately detect defects. When the SAFM is omitted, performance metrics indicate a noticeable decrease of around 1.64%, 10.49%, and 4.03%, as shown in Table 2, underlining the vital role the SAFM plays in optimizing defect detection. The CAM module, on the other hand, is crucial in foregrounding semantically rich defect information while suppressing less relevant data, leading to more precise defect detection and localization. Without the CAM, performance metrics dip by approximately 0.86%, 2.33% and 1.8%, further emphasizing the importance of the CAM in bolstering the model's performance.

Experimental results illustrate that both the SAFM and the CAM significantly contribute to the efficacy of the SLB. When these modules are incorporated, the TSUBB-Net sees substantial improvements in $Recall$ , $Precision$ , and $F 1 - Measure$ by 1.97%, 15.22%, and 6.9%, respectively, compared to models without the SLB. This is achieved through context information aggregation and channel re-weighting, allowing the SLB to highlight defects and employ rich semantic information for better defect localization. Figure 9 illustrates how the SLB effectively eliminates noise points and accurately pinpoints defect locations.

Figure 9.
Effect of semantic localization branch (SLB) defect localization. The first line indicates the original image, and the second line indicates the defect location of SLB positioning.

Feature fusion module

The FFM in our proposed model optimizes the usage of texture and semantic information, yielding superior defect detection results. While the TRB effectively exhibits defect information post texture reconstruction, as shown in Figure 10(b), it also exhibits a higher false detection rate due to the intricate textures inherent in industrial samples, leading to a lower $Recall$ rate of 59.71%, as depicted in Table 2. On the other hand, the SLB employs a more nuanced approach to defect localization, through the precise fusion of multidimensional data, resulting in more accurate defect localization, as shown in Figure 10(c). However, it often fails to preserve the complete shape information of defects, contributing to a lower $Precision$ rate of 75.1% in Table 2.

Figure 10.
Influence of the feature fusion module (FFM) in TSUBB-Net: (a) original input image; (b) defect map detected by the texture reconstruction branch; (c) defect map detected by the semantic localization branch and (d) final defect map after the FFM.

The combination of texture and semantic data serves to compensate for the individual limitations of the TRB and SLB. When these branches operate in isolation, either noise is misinterpreted as defects (in the case of the TRB) or the complete shape information of defects is lost (in the case of the SLB). However, when texture and semantic information are amalgamated via the FFM, a significant increase in $Precision$ , $Recall$ , and $F 1 - Measure$ rates is observed, which leads to precise defect detection, as illustrated in Figure 10(d). This underscores the efficacy of the FFM in integrating texture and semantic information, offering a more holistic approach to defect detection.

Analysis and discussion

Selection of the $k$ value

Within the framework of the TRB, the number of clusters $k$ plays a pivotal role in texture decomposition. To determine the ideal $k$ value, we propose initial verification of the accurate reconstruction capability of TRB on the texture background, followed by setting $k$ within a reasonable range based on the actual defect detection performance.

Background reconstruction validation

To accurately assess background reconstruction, we utilized mean absolute error (MAE) and SSIM metrics. The former provides an indication of image accuracy, while the latter evaluates visual quality. Our results demonstrated that lower and higher $k$ values led to diminished background reconstruction quality, whereas moderate values produced optimal results, as shown in Figure 11(a). The SSIM trend revealed a swift increase, peaking at $k$ = 16, and then decreasing with increasing $k$ . This decrease may result from a higher density of texture units and cluster centers, leading to structural information loss. A resurgence in SSIM was observed at $k$ = 25, possibly attributed to overfitting. Therefore, to ensure the quality of reconstruction, it is recommended to establish the number of clusters $k$ within a moderate range. In our study, the optimal range for $k$ would be between 14 and 19, thereby providing a practical guideline for selecting the number of clusters.

Figure 11.
Influence of the number of clusters $k$ : (a) influence of the number of clusters $k$ on the mean absolute error (MAE) and structural similarity index (SSIM) and (b) influence of the number of clusters $k$ on the criteria.

Inspection verification

The selection of the ideal cluster number $k$ requires a simultaneous consideration of $Precision$ and $Recall$ . However, there is often a trade-off between these two metrics. A smaller $k$ value might enhance $Recall$ due to fewer feature categories, which leads to easier recognition, but it compromises $Precision$ due to poorer reconstruction, increasing the chance of false positives. Conversely, an excessively large $k$ value could reduce texture discernibility and increase erroneous reconstruction of defect backgrounds, thereby reducing $Precision$ . To balance both $Precision$ and $Recall$ , we employ the $F 1 - Measure$ as an efficient approach. Based on our experimental results, as shown in Figure 11(b), the optimal $k$ value is the one that maximizes the $F 1 - Measure$ . Pertaining to the dataset used in this study, we deduced that a $k$ value of 16 seems to be an appropriate selection.

FEN and FDN ratio settings

The FEN and FDN play a critical role in representing feature information of textile images, where they are employed to extract feature maps with four levels of receptive fields directly from the entire input image, thereby maximizing the representation of internal information. The stacked block ratio and specific number in the FEN and FDN determine the quality of texture background reconstruction. Our experiments with ConvNeXt,²³ a model with a 1:1:3:1 stacked block ratio, serve as a baseline for comparison. We vary the ratios to 3:1:1:1, 1:3:1:1, 1:1:3:1, and 1:1:1:3 to optimize model performance. We assess $F 1 - Measure$ as the evaluation criterion, as shown in Figure 12. In determining specific numbers, we experiment with various scenarios, such as 1:3, 2:6, 3:9, 4:12, etc. We find that excessive stacking of blocks (e.g. 3:9, 4:12) results in over-encoding of features and distorted images, while too few blocks (e.g. 1:3) limits learning capabilities and hinders effective information representation. Therefore, balanced module configurations that consider feature representation capabilities and model complexity are essential. We choose a 1:1:3:1 stacked block ratio with a 2:6 specific quantity configuration to achieve optimal texture background reconstruction results. As depicted in Figure 13, a visual representation offers a comparison between varying scales, substantiating the efficacy of feature extraction and integration across these dimensions.

Figure 12.
$F 1 - Measure$ under different ratio settings.

Figure 13.
Defect maps at various scales: (a) original input image; (b)–(e) defect maps across different scales and (f) final defect map.

Conclusions and future work

In this paper, we propose a novel bilateral-branch network structure, known as TSUBB-Net, which efficiently amalgamates texture and semantic information. This is intended to handle the diverse and complex irregular texture structures inherent in textile images, while mitigating the influence of image acquisition quality on defect recognition. Our model resourcefully employs multiple TSUBB-Net subnetworks of varying scale levels to reconstruct the background image. Each layer undergoes clustering of texture units in the bilateral-branch network and integrates context-semantic information to maximize the utilization of textile feature information. Experimental results show that the TSUBB-Net framework can achieve the best performance compared to the current state-of-the-art methods. Within the purview of this research, we observe that TSUBB-Net encounters difficulties in extracting defect information from elongated textures, a factor that curtails its efficacy. Addressing this limitation will be a focus of our future research.

MethodMetrics	PHOT	TEXEMES	AE	AE-SSIM	OCGAN	PaDiM	Reverse_distillation	TSUBB-Net (Ours)
$F 1 - Measure$	Broken picks	0.235	0.232	0.112	0.642	0.556	0.667	0.721	0.874
Hole	0.075	0.142	0.215	0.167	0.445	0.621	0.691	0.794
Stain	0.088	0.563	0.357	0.148	0.521	0.657	0.605	0.749
Crack	0.246	0.112	0.270	0.567	0.589	0.706	0.748	0.823
Felter	0.314	0.174	0.315	0.214	0.368	0.698	0.504	0.842

MetricsMethod	$Recall$ (%)	$Precision$ (%)	F1-Measure(%)
TSUBB-Net (without CAM)	79.77	87.99	81.55
TSUBB-Net (without SAFM)	78.99	79.83	79.32
TSUBB-Net (without SLB)	78.66	75.1	76.45
TSUBB-Net (without TRB)	59.71	91.50	71.92
TSUBB-Net	80.63	90.32	83.35

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the Shanghai Sailing Program (no. 22YF1401300), Cultivation Project of Discipline Innovation (XKCX202313), the Fundamental Research Funds for the Central Universities (2232021D-32,2232021A-10), National Natural Science Foundation of China (nos. 61806051, 61903078).

ORCID iDs

Bing Wei

Kuangrong Hao

References

Wei

Hao

, et al. Textile defect detection using multilevel and attentional deep learning network (MLMA-Net). Text Res J 2022; 92: 3462–3477.

Wei

Hao

Tang

, et al. A new method using the convolutional neural network with compressive sensing for fabric defect classification based on small sample sizes. Text Res J 2019; 89: 3539–3555.

Zheng

A new method for the detection and classification of weave pattern repeat. Text Res J 2014; 84: 1586–1599.

Zhou

Wang

Fabric defect detection using adaptive dictionaries. Text Res J 2013; 83: 1846–1859.

Jun

Wang

Zhou

, et al. Fabric defect detection based on a deep convolutional neural network using a two-stage strategy. Text Res J 2021; 91: 130–142.

Huang

Wang

, et al. Unsupervised fabric defect detection based on a deep convolutional generative adversarial network. Text Res J 2020; 90: 247–270.

Zhang

Han

Wei

, et al. A spatial–spectral adaptive learning model for textile defect images recognition with few labeled data. Compl Intell Syst 2023; 1–13. DOI: https://doi.org/10.1007/s40747-023-01070-y

Xiao

Guo

Wang

TOF-UNet: high-precision method for terry towel defect detection. Text Res J 2023; 93: 925–935.

Schlegl

Seebock

Waldstein

, et al. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: Niethammer M, Styner M, Aylward S, et al. (eds) information processing in medical imaging: 25th international conference (IPMI 2017), Boone, NC, USA, 25–30 June 2017, pp. 146–157. Cham: Springer International Publishing. DOI: https://doi.org/10.1007/978-3-319-59050-9_12

10.

Yoon

Patch svdd: patch-level svdd for anomaly detection and segmentation. In: Ishikawa H, Liu CL, Pajdla T, et al. (eds) proceedings of the Asian conference on computer vision, Kyoto, Japan, November 30–December 4, 2020, pp. 375–390. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-030-69544-6_23

11.

Zhu

Wang

Bai

, et al. Deep learning representation using autoencoder for 3D shape retrieval. Neurocomputing 2016; 204: 41–50.

12.

Kang

Leng

, et al. Synthetic aperture radar target recognition with feature fusion based on a stacked autoencoder. Sensors 2017; 17: 192.

13.

Jia

Qiao

, et al. Self-taught learning based on sparse autoencoder for e-nose in wound infection detection. Sensors 2017; 17: 2279.

14.

Masci

Meier

Cireşan

, et al. Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela T, Duch W, Girolami M, et al. (eds) artificial neural networks and machine learning–ICANN 2011: 21st international conference on artificial neural networks, Espoo, Finland, 14–17 June 2011, Part I 21, pp. 52–59. Berlin Heidelberg: Springer.

15.

Mei

Wang

Wen

Automatic fabric defect detection with a multi-scale convolutional denoising autoencoder network model. Sensors 2018; 18: 1064.

16.

Zhou

Mei

Zhang

, et al. Semi-supervised fabric defect detection based on image reconstruction and density estimation. Text Res J 2021; 91: 962–972.

17.

Defays

An efficient algorithm for a complete link method. Comput J 1977; 20: 364–366.

18.

Dempter

AP.

Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 1977; 39: 1–22.

19.

Kriegel

Kroger

Sander

, et al. Density‐based clustering. Wiley Interdisc Rev Data Mining Knowl Disc 2011; 1: 231–240.

20.

Cheng

Zeng

Bruniaux

, et al. Research on intelligent clustering of male upper body. Text Res J 2022; 92: 2174–2193.

21.

Qian

Wang

Huang

, et al. Color segmentation of multicolor porous printed fabrics by conjugating SOM and EDSC clustering algorithms. Text Res J 2022; 92: 3488–3499.

22.

Yang

Chen

Song

, et al. Multiscale feature-clustering-based fully convolutional autoencoder for fast accurate visual inspection of texture surface defects. IEEE Trans Autom Sci Eng 2019; 16: 1450–1467.

23.

Liu

Mao

, et al. A convnet for the 2020s. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, June 19–24, 2022, pp.11976–11986. Piscataway, NJ: IEEE. https://doi.org/10.48550/arXiv.2201.03545

24.

Woo

Park

Lee

, et al. Cbam: Convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, et al. (eds) proceedings of the European conference on computer vision (ECCV), Munich, Germany, September 8–14, 2018, pp.3–19. Berlin, Heidelberg: Springer-Verlag. https://doi.org/10.1007/978-3-030-01234-2_1

25.

Chen

Zhu

Papandreou

, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C, (eds.) proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp.801–818. Berlin, Heidelberg: Springer-Verlag. DOI: https://doi.org/10.1007/978-3-030-01234-2_49

26.

Aiger

Talbot

The phase only transform for unsupervised surface defect detection. In: Krečo A, Ovsenek M, Marolt S (eds.) 2010 IEEE Computer Society conference on computer vision and pattern recognition, San Francisco, CA, USA, 13–18 June 2010, pp.295–302. Piscataway, NJ: IEEE. https://doi.org/10.1109/CVPR.2010.5540198

27.

Xie

Mirmehdi

TEXEMS: texture exemplars for defect detection on random textured surfaces. IEEE Trans Patt Anal Mach Intell 2007; 29: 1454–1464.

28.

Hinton

Salakhutdinov

RR.

Reducing the dimensionality of data with neural networks. Science 2006; 313: 504–507.

29.

Bergmann

Lowe

Fauser

, et al. Improving unsupervised defect segmentation by applying structural similarity to autoencoders. arXiv preprint arXiv:1807.02011, 2018.

30.

Perera

Nallapati

Xiang

Ocgan: One-class novelty detection using gans with constrained latent representations. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Piscataway, NJ, 16–20 June 2019, pp.2898–2906. Piscataway, NJ: IEEE. https://doi.org/10.1109/CVPR.2019.00301

31.

Defard

Setkov

Loesch

, et al. Padim: a patch distribution modeling framework for anomaly detection and localization. In: Del Bimbo A (ed) pattern recognition. ICPR international workshops and challenges: virtual event, 10–15 January 2021, Part IV, pp.475–489. Cham: Springer International Publishing.

32.

Deng

Anomaly detection via reverse distillation from one-class embedding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 19–24 June 2022, pp.9737–9746. Piscataway, NJ: IEEE. https://doi.org/10.48550/arXiv.2201.10703

An efficient unsupervised approach for textile defect detection through unifying texture reconstruction and semantic localization

Abstract

Keywords

Related works

Autoencoders and their extensions

Clustering methods

Proposed method

Feature extraction network and feature decoding network

Feature extraction network

Feature encoding network

Texture reconstruction branch

Semantic localization branch

Channel attention module

Self-attention fusion module

Feature fusion module

Model training strategy

Experimental evaluation and discussion

Datasets and evaluation metrics

Dataset overview

Evaluation metrics

Comparing TSUBB-Net with the state-of-the-art models

TSUBB-Net ablation experiment

Texture reconstruction branch

Semantic localization branch

Feature fusion module

Analysis and discussion

Selection of the k value

Background reconstruction validation

Inspection verification

FEN and FDN ratio settings

Conclusions and future work

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References

Selection of the $k$ value