Sage Journals: Discover world-class research

Abstract

In the textile industry, surface defects can greatly damage the value of fabric, but the coexistence of subtle defects and elongated defects poses a significant challenge to the localization of them. Existing convolutional neural networks-deep learning methods, especially the YOLO series, can present promising fabric defect detection. However, their performance is limited in simultaneously learning local and global features, leading to inaccurate localization results. To address this issue, this paper proposes a Fabric Defect Detection Network (FDDNet) based on Spatial Depth-Transforming Convolution (SDTC) and Multiscale Dilated Self-attention Fusion Module (MDSFM). Firstly, to enhance the local feature characterization capability of the backbone network, our FDDNet proposes spatial depth-transforming convolutions to preserve more fine-grained information. Subsequently, to effectively integrate global and local information and enhance global-local modeling capability, the multiscale dilated self-attention fusion module is introduced by combining self-attention mechanisms and dilated convolutions, thus enabling the model to percept scale changes and achieving multi-scale defect localization. Experiment results conducted on the publicly available Tianchi fabric dataset and a self-made denim dataset show that, the proposed FDDNet can achieve the AP50 of 54% and 56.8% respectively, which outperforms mainstream state-of-the-art methods.

Keywords

fabric defect detection YOLO convolutional neural networks self-attention mechanism global-local modeling

Introduction

The textile industry, as one of the traditional pillar industries in China, holds an important position in the national economy. With the improvement of people’s living standards and the intensifying of market competition, there is an increasing demand for high-quality textiles. Fabric defects (such as Loose warp, Thick bar, Flower of error, etc.) are one of the crucial factors affecting the quality of textile products, as illustrated in Figure 1. Fast and accurate detection of defects in fabrics is of great significance for improving product quality, reducing production costs and enhancing the enterprise competitiveness. Early fabric defect detection relies mainly on manual visual inspection, which is not only inefficient but also susceptible to the subjective influences of inspectors, making it challenging to ensure the accuracy and reliability of the inspection results. Furthermore, prolonged manual inspection can lead to inspector fatigue, which will further affect the detection quality. Therefore, developing an efficient, accurate, and automated fabric defect detection method has become an urgent issue in the textile industry.

Figure 1.

Visualization of typical fabric defects in denim (a) and Tianchi (b) datasets.

Currently, most fabric defect detection methods are based on computer vision technologies, including traditional algorithms and deep learning algorithms. Traditional algorithms include spectral detection,^1,2 mathematical statistics,^3,4 image differencing,⁵ image texture analysis,⁶ etc. These traditional methods can only detect specific types of defects in simple textures, and are sensitive to noise and lack robustness. However, fabric defects often appear in multi-scale forms and the background textures are complex and diverse. Currently, fabric defect detection includes two specific challenges: (1) Small defects: as shown in Figure 1(a), these are defects of small size in the textile industry process with weak contrast, making it difficult to accurately differentiate them during detection. (2) Multi-scale defects: In the textile industry production process, defects of different scales coexist in the same surface, mainly in the form of points and lines. The coexistence of these subtle defects and elongated defects often leads to feature confusion or information loss, increasing the difficulty of detection, as shown in Figure 1. Therefore, it is a challenging and urgent task to solve the above two problems in fabric defect detection.

With the rapid development of computer technology and the enhancement of hardware processing capabilities, deep learning methods in computer vision have reached highly advanced levels in recent years. In the field of fabric defect detection, convolutional neural network (CNN)⁷-based detection methods have been proven to more effective in handling complex scenarios, gradually replacing traditional detection methods. To improve the defect detection accuracy, Li et al.⁸ proposed FD-YOLOv5 as an enhancement over YOLOv5. By introducing a coordinate attention module in YOLOv5 to replace bottleneck structures, they improved the network’s feature extraction capabilities, enhancing the detection of small defects. Meanwhile, they utilized a smooth Mish activation function, SIoU loss function, as well as a combination of focal loss and GHM loss functions to address dataset sample imbalance issue. To enhance time efficiency, Zhao et al.⁹ introduced the Dynamic Inference Network (DI-Net), which dynamically allocates computational resources based on image complexity. This network includes an “AND” gate control network module for adjusting network depth. The inference unit allows for early network exit under specific conditions to enhance efficiency. However, DI-Net is primarily suitable for plain textile designs, and its performance may degrade with complex textures and backgrounds, with limitations in extracting complex features and defect sample training. Moreover, Wan et al.¹⁰ proposed the unsupervised high-frequency feature mapping model for fabric defect detection, addressing the challenges of a lack of labeled fabric images and difficulty in finding discriminative features. Despite its impressive performance, the uneven distribution of high-frequency information in defect areas may impact pixel-level segmentation. Additionally, the imbalance between foreground defects and background textures can affect the accuracy of detection and segmentation. Li et al.¹¹ introduced PEI-YOLOv5 for fabric defect detection, which incorporates the Particle Depth Convolution method to reduce redundant computations and memory access, thus improving detection speed and feature extraction efficiency. Through the Enhanced-BiFPN structure, they enhanced attention to spatial and channel feature mapping as well as the fusion of information at different scales. However, due to its lightweight design, PEI-YOLOv5 may have insufficient capabilities in detecting certain defects. To address the issue of weak feature representation in traditional autoencoders, Zhang et al.¹² proposed an unsupervised method called the Triple Attention Multi-Scale U-Shape Denoising Convolutional Autoencoder. By introducing a triple attention mechanism and using noisy defect-free samples during training to reconstruct and repair defective areas, they aimed to enhance the feature representation capability of the autoencoder. Nonetheless, there is still room for improvement in detection performance due to the limitations of convolutional neural networks in capturing long-range dependencies.

To address these issues, a Fabric Defect Detection method based on spatial depth-Transforming convolution and multiscale dilated self-attention fusion network, named FDDNet, is proposed in this paper. Specifically, the feature extraction backbone network is improved by using spatial depth-wise transforming convolution to replace the convolution downsampling layer. The spatial depth-wise transforming convolution can extract more fine-grained information, enabling the network to better detect small defects with low-contrast characteristics. Then, dilated convolutions and self-attention mechanism are combined to propose the multiscale dilated self-attention fusion network. The dilated convolutions can expand the receptive field without increasing the parameters, thus better capturing defect features of various sizes. The self-attention mechanism captures dependencies between global information and different regions, which enhance the ability to percept scale variations and further improve model performance.

Our contributions can be summarized as follows:

(1) We proposes a novel fabric defect detection method FDDNet, which can achieve the top performance on public and self-made datasets.

(2) The SDTC is proposed to enhance fine-grained feature extraction with lower computational cost, allowing better detection of small defects with low-contrast features.

(3) The MDSFM is proposed to enhance global-local modeling capability, highlight defects of different scales and improve multi-scale defect detection capability.

In the following paper, we first introduce related work on fabric defect detection, as well as a review of relevant technologies. Then, the proposed network is described in detail. Next, the experimental section provides a thorough validation of the feasibility of the proposed method. Finally, the conclusion of this paper is presented.

Related work

Fabric defect detection methods

In the field of fabric defect detection, many researchers have proposed various innovative methods and technologies. Lu et al.¹³ introduced the texture-aware single-stage fabric defect detection network, which explicitly considers fabric texture during defect detection through defect texture recognition tasks. However, there are still undetected defects, especially long and thin linear structural features and small region-based defects. To address these issues, Lu et al.¹⁴ proposed the channel-wise adaptive feature pyramid network, which integrates the anchor-free detection strategy AutoAssign to introduce a flexible anchor-free detector CA-AutoAssign. Zhao et al.¹⁵ proposed a multi-scale feature fusion method based on attention to guide the model to focus more on defects rather than the background. Guo et al.¹⁶ introduced a method that captures multi-scale information by using dilated convolution pooling and introduced the convolutional compression excitation module Wu et al.¹⁷ proposed a lightweight network structure based on Faster R-CNN, utilizing dilated convolutions and multi-scale anchor boxes for fabric detection. Liu et al.¹⁸ employed Generative Adversarial Networks to develop a fabric defect detection system capable of adapting to various fabric textures. Zeng et al.¹⁹ introduced a reference-based defect detection network by incorporating template and context references. Lastly, Mo et al.²⁰ presented a weighted double low-rank decomposition method to preserve the most salient features of fabric images. The above studies collectively represent frontier work in the field of fabric defect detection, and provides important references and insights for related research and applications. However, the current methods still have considerable room for improvement in detecting small defects and elongated defects. This paper introduces a new fabric defect detection network, FDDNet, to further integrate global and local information and enhance perception of scale variations, thus achieving multi-scale defect localization.

Small object detection methods

Small object detection has always been a challenging task in object detection, especially in fabric defect detection. Existing small object detection methods tend to integrate well-designed strategies into state-of-the-art frameworks, and show outstanding performance in general object detection tasks. To address the issue of information loss about small objects as the network depth increases, Kong et al.²¹ proposed a multi-scale fusion network HyperNet, that enhances object detection performance by combining shallow high-resolution features, deep semantic features and intermediate features. To solve the challenge of limited training samples for small objects, Zhang et al.²² utilized partitioning and size functions to augment the dataset. RRNet²³ introduced an adaptive enhancement strategy called AdaResampling. CRANet²⁴ proposed an algorithm for adaptive search clustering regions. TridentNet²⁵ constructed a parallel multi-branch architecture where each branch has an optimal receptive field for objects of different scales. QueryDet²⁶ designed a cascaded query strategy to avoid redundant computation of low-level features, achieving efficient detection of small objects. To optimize detection efficiency and reduce computational costs, DS-GAN²⁷ proposed a novel data augmentation pipeline for generating high-quality synthetic data for small objects. Most small object detection methods often struggle to fully leverage contextual information around the target. Moreover, challenges such as the interference from complex backgrounds and the inadequate representation of tiny objects in images remain unresolved. In this paper, from the perspective of backbone network feature extraction, the STDC method is employed to enhance the feature extraction capability of the model for small defects.

Self-attention

The vanilla Vision Transformer^28,29 has been successfully applied in various tasks such as natural language processing^28,30 and visual tasks.^31
–40 In Transformers used for visual tasks, self-attention is employed to aggregate extensive contextual features between image patches, where images are divided into multiple patches that are fed into the transformer. Currently, there are many methods that have been improved based on Transformers. For instance, CrossFormer⁴¹ utilized different convolution operations or patch sizes to design patch embeddings. Parallel Transformers⁴² employed multi-scale token aggregation to acquire keys and values of various sizes. MPViT⁴³ consisted of multi-scale patch embeddings and multi-path Transformers blocks. Conformer,⁴⁴ Mobile-Former⁴⁵ and ViTAE³³ incorporated additional convolution branches inside and outside the self-attention block to integrate multi-scale information. The aforementioned methods require intricate designs, which inevitably introduces additional parameters and computational costs. While the proposed MDSFM extracts multi-scale features by setting different dilated rates, which can offer a simple approach without the need for introducing extra parameters and computational costs.

Proposed network

In this section, the proposed fabric defect detection network FDDNet based on SDTC and MDSFM is detailed introduced. As shown in Figure 2, it consists of a backbone network, a neck network, and multiple detection heads. The backbone network uses spatial depth-transforming convolution for downsampling feature extraction, thus generating three different-level feature maps (P5, P4, P3) on the backbone. Subsequently, through the multiscale dilated self-attention fusion network in the neck network, Feature Pyramid Network (FPN) first enhances multi-scale representation by transferring high-level semantic features, and then Path Aggregation Network (PAN) combines the bottom-level localization information to the high-level using dilated self-attention mechanism, thus obtaining discriminative feature maps at different scales (N3, N4, N5). Then, based on the feature maps N3, N4, N5, three detection heads are derived for defect localization and classification of three scales, thus enhancing the multi-scale defect detection capability.

Figure 2.

The overall framework diagram of FDDNet.

Enhanced backbone

The object detection algorithms usually start by extracting multi-scale features of input images. Thus, its structure directly affects the quality of feature extraction and has a great impact on the detection results. YOLOv8 is widely used in the field of object detection due to its effectiveness and efficiency, which mainly consists of five basic versions: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, YOLOv8x, with the network depth increasing in sequence. As the number of network layers increases, the network also has more potential for feature representation, but the model efficiency decreases. And, when training data is limited, deeper networks may lead to overfitting, thus reducing the defection performance. In the textile factory, the number of fabric defect images is usually small, and these images mainly exhibit low-level texture features and contain less deep semantic information. Moreover, there is often a requirement for real-time performance in practical applications. Therefore, DarkNet53 of YOLOv8n is chosen as the backbone network, due to the optimum balance between performance and efficiency. The specific network structure of the enhanced backbone DarkNet53 is shown in Table 1. It can be seen that the enhanced DarkNet53 similarly comprises five stages. In the Stage 1, besides the one convolutional layer from the original DarkNet53, it also includes the proposed SDTC to extract more detailed and intricate information. Meanwhile, building upon the original model, the proposed SDTC has been incorporated into Stages 2, 3, 4 and 5 to merge feature maps of different stages. The last feature map of Stages 3, 4 and 5 is used as the input for the neck network, named P3, P4 and P5. These feature maps have different resolutions, and preserve local features and semantic information at different hierarchical levels, which is conducive to detecting defects of different scales.

Table 1.

The specific structure of the enhanced DarkNet53.

Stage	Module	Size	Stride	Channel	Out size
1	Conv/SDTC	3×3/-	2/-	64	s/2
2	Conv/SDTC	3×3/-	2/-	128	s/4
2	C2f			128	s/4
3	Conv/SDTC	3×3/-	2/-	256	s/8
3	C2f*2			256	s/8
4	Conv/SDTC	3×3/-	2/-	512	s/16
4	C2f*2			512	s/16
5	Conv/SDTC	3×3/-		1024	s/32
	C2f			1024	s/32
	SPPF		2/-	1024	s/32

s means the image size.

Spatial depth-transforming convolution (SDTC)

Currently, although CNN-based fabric defect detection algorithms have achieved outstanding results, there are often limitations in handling small targets due to inaccurate extraction and information loss. To address this issue, the proposed SDTC serves as a substitute for traditional convolution, functioning as a pivotal component within the backbone network to extract more intricate and detailed features. The structure of the spatial depth transforming convolution is shown in Figure 3. A spatial-to-depth transformation is applied on the feature map X (with a size of S×S×C1) through slicing operations on the feature map, resulting in a series of sub-feature maps that can be represented as:

\begin{array}{l} f_{0, 0} = X [\dots, : : 2, : : 2] \\ f_{1, 0} = X [\dots, : 1 : 2, : : 2] \\ f_{0, 1} = X [\dots, : : 2, 1 : : 2] \\ f_{1, 1} = X [\dots, 1 : : 2, 1 : : 2] \\ \dots \dots \\ f_{i, j} = X [\dots, i : : s c a l e, j : : s c a l e], i, j \in [0, s c a l e - 1] \end{array}

(1)

where $s c a l e \in [0, \frac{S}{2}]$ represents the size of the slicing operation, and the sub-feature map $f_{i, j}$ is determined by it. $f_{0, 0}$ represents extracting elements at even height and width positions in the feature map X. $f_{1, 0}$ represents extracting elements at odd height positions and even width positions. $f_{0, 1}$ represents extracting elements at even height positions and odd width positions. and $f_{1, 1}$ represents extracting elements at odd height and width positions. This process continues to $f_{i, j}$ , with i and j being the slicing control factors.

Figure 3.

The specific process of SDTC: (b and c): Implement equidistant sampling of feature maps (a). (c and d): Implement spatial-to-depth transformation. (d and e): Use point convolution to merge features from different channels.

Then, these sub-feature maps are concatenated along the channel dimension to obtain the feature map $X 1$ , whose height and width are proportionally reduced by $s c a l e$ , and the channel dimension is increased by $s c a l e^{2}$ . Figure 3 provides an example when $s c a l e$ = 2, and four sub-feature maps $f_{0, 0}$ , $f_{1, 0}$ , $f_{0, 1}$ , $f_{1, 1}$ are produced. Consequently, the shape of each sub-feature map is $(\frac{S}{2}, \frac{S}{2}, C 1)$ , and $X$ is downsampled by a factor of 2.

Next, after transforming the feature map $X (S, S, C 1)$ into a series of sub-feature maps, a feature concatenation operation $c a t$ along the channel dimension is performed, which can be described as:

X 1 = c a t (f_{0, 0}, f_{1, 0}, f_{0, 1}, f_{1, 1}, \dots, f_{i, j})

(2)

Finally, the 1 × 1 convolutions are used to compress the channel dimension of feature map X1–C2, thus enhancing information interaction and reducing model computation. This non-stride convolution is able to preserve as much discriminative feature information as possible, thus allowing the model to learn more fine-grained information.

Multiscale dilated self-attention fusion module (MDSFM)

The feature maps generated from different stages of the backbone network have varying resolutions. High-resolution feature maps contain more local detailed information and richer fine-grained details. As the resolution decreases, the information encapsulated in the feature maps becomes more abstract and semantic. Therefore, for the defect detection across varying size, both local detailed information and global semantic information are required. To achieve this, MDSFM is designed to merge feature maps of different scales, thus enhancing the multi-scale defect detection capability. Based on the FPN, we aggregate information from feature maps with different resolution sizes to enrich their fine-grained and high-level semantic information. Subsequently, employing PAN, we design a MDSFM that combines dilated convolutions and attention mechanisms to capture context semantic dependencies at different scales. The feature maps generated by MDSFM contain rich fine-grained information and high-level semantic features, further enhancing the model multi-scale detection capability. The following section will provide a detailed introduction of the MDSFM.

As shown in Figure 4, the feature maps P3, P4, P5 output by the backbone network are input to the MDSFM. Firstly, the Feature Pyramid Network achieves top-down multiscale information fusion, producing feature maps P′5, P′4, P′3. Subsequently, the bottom-up pyramid results from PAN pass the strong localization features of the lower layers and gradually fuse multiscale features to obtain output feature maps (N3, N4, N5). In addition, as shown in Figure 4, before obtaining the output feature maps, the multiscale dilated self-attention is used for multiscale information aggregation. The proposed multiscale dilated self-attention module applies a local attention mechanism to the feature maps and utilizes varying dilated rates to capture defect information at different scales within a local context, which can enhance the model ability to learn multiscale object information. The specific design structure of our MDS is depicted in Figure 5. Firstly, the channels of the feature map are divided into multiple heads. Then, a self-attention operation is performed between the colored patches around the red query patch and the window patches, using different dilated rates for different heads. Additionally, the features from different heads are concatenated together and then fed into a linear layer. By default, a kernel size of 3×3 and dilated rates r = 1, 2, 3, 4 are used. In this way, the effective receptive field sizes for different heads are 3×3, 5×5, 7×7, and 9×9 respectively. Specifically, the output feature vector of MDS can be expressed as:

\begin{array}{l} {x^{r}}_{i j} = A t t e n t i o n (Q, K, V), \\ = A t t e n t i o n (q_{i j}, K_{r}, V_{r}), \\ = s o f t \max (\frac{q_{i j} {K_{r}}^{T}}{\sqrt{d_{k}}}) V_{r}, i \leq r \leq n \in N^{+}, \\ X_{2} = L i n e a r (c a t [x_{i j}^{1}, x_{i j}^{2}, x_{i j}^{n}]) \end{array}

(3)

where Q, K and V denote the tensors of query, key and value respectively and r controls the local scope of the self-attention mechanism. $K_{r}$ and $V_{r}$ represent the selections of K and V from the feature map under different r values, respectively. $q_{i j}$ represents the position in the feature map, $d_{k}$ is the dimension of the feature vector, and the $l i n e a r$ function denotes feature aggregation.

Figure 4.

The specific structural of the MDSFM.

Figure 5.

Illustration of multiscale dilated self-attention.

By controlling the dilated rate $r$ and the number of heads, the proposed MDS decomposes the input feature map into r sub-feature maps and performs self-attention calculations on each sub-feature map. In this way, the feature information of different scales within the receptive field can be aggregated, which is beneficial to inspecting defects with various scales.

Experiment

Experimental settings

All experiments in this study are conducted on a server running the Ubuntu operating system. The CPU model is AMD EPYC 7H12 64-Core Processor and the GPU used is NVIDIA A100. During training, the SGD optimizer with a weight decay of 0.0005 is adopted for model optimization, and the initial learning rate is set to 0.01. The entire training process is composed of 300 epochs with an initial batch size of 32. The input images are uniformly resized to 640 × 640.

Experimental dataset

The Alibaba cloud Tianchi fabric dataset and a self-built denim fabric dataset are used for model training and testing. The Alibaba Cloud Tianchi Fabric Dataset is created and released by the Alibaba Tianchi Competition Team in 2019 at the Nanhai Textile Factory in Foshan, China. This dataset contains 5913 images, with 5413 images for training and 500 images for testing. Each image has a resolution of 2446×1000 and includes 34 defect categories, encompassing various small defect points such as Kont head, Broken spandex, Capillus, as well as longer and thinner defects like Check Jump, Wavy crotch, Double Welf, among others. In the self-built denim dataset, 8000 images are adopted for training and the remaining 1000 images for testing. The resolution of each image is 3072×2048, all taken from the denim production line of a textile company in Foshan, Guangdong. The defect types in the self-built dataset mainly include Nep, Cotton kernels, Warp knot, Loose warp, Thick bar, Flower of error, Strain Barre, etc. The variable scale of surface defects makes it challenging for accurate defection.

Evaluation metrics

Six common object detection evaluation metrics are employed to assess the model performance, including precision, recall, mean average precision (mAP), model parameters (Params), theoretical computational complexity (FLOPs) and frames per second (FPS). Precision refers to the proportion of samples that the model identifies as positive correctly. Recall is the proportion of actual positive samples that the model correctly identifies as positive. Average precision (AP) is the area under the precision-recall curve and mAP is the average of the AP values for all tested classes. These metrics are defined by the following formulas:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

A P = \int_{0}^{1} P (R) d R c

(6)

m A p = \frac{\sum_{i = 1}^{c} \int_{0}^{1} P (R) d R}{c}

(7)

Experimental results

To verify the rationality of the scale parameter in the SDTC module, we evaluated different settings (1, 2, 3, 4) on the Tianchi dataset. Table 2 show that scale = 2 achieves the highest mAP of 26.0%, outperforming scale = 1 by about 2%. Increasing the scale beyond 2 leads to a consistent decline in mAP, AP50, Precision and Recall, with scale = 4 dropping to 22.1%. This demonstrates that too small a scale restricts feature representation, while an excessively large scale introduces redundancy and weakens detail extraction. Therefore, scale = 2 is adopted as the final configuration in this study.

Table 2.

Results under different scale settings on the Tianchi dataset (%).

Scale	mAP	AP50	Precision	Recall
1	24.0	46.2	51.3	45.4
2	26.0	47.0	50.8	44.7
3	25.1	46.3	49.1	44.0
4	22.1	42.8	46.5	40.3

(1) Experimental Analysis on the Tianchi Dataset: It is evident that the selection of the number of attention heads and the dilate rate directly influences the performance of MDSFM. Therefore, it is necessary to perform ablation experiments first to determine the optimal parameters for the number of attention heads and the dilated rate. Note that the number of attention heads must be a multiple of the number of dilated rates. Moreover, we also analyze the impact of MDSFM on the model performance in both multi-scale and single-scale scenarios. The results are listed in Table 3. In the single-scale scenario, when the dilated rate was set to [3], the performance is optimal with a mAP of 27.3% and an AP50 of 51.8%. In the multi-scale scenario, selecting HeadNum = 4 and Dilation Rate = [1,2,3,4] achieves the best results, with a mAP value of 27.9%, which is 0.6% higher than the optimal result in the single-scale scenario. By comparing single-scale and multi-scale scenarios, multi-scale scenario is generally superior to single-scale, because multi-scale can provide richer information than single-scale. Moreover, the moderate dilated rates allow to mimic the local attention locality and sparsity without introducing redundant information, which is highly beneficial for fabric defect detection, especially for the small defects and elongated defects. Therefore, the HeadNum and Dilated Rate in our MDSFM are set to 4 and [1,2,3,4] respectively.

Table 3.

Performance comparison of MDSFM with different parameters on Tianchi dataset (%).

Model	Head num	Dilation rate	mAP	AP50	Precision	Recall
	2	[1,2]	26.4	52.0	52.4	50.4
Multi-	3	[1,2,3]	26.9	52.0	61.6	47.0
	4	[1,2,3,4]	27.9	54	56.1	54.3
		[1]	26.8	51.3	48.9	52.6
Single	3	[2]	25.8	49.9	50.3	51.0
		[3]	27.3	51.8	54.8	52.0

Note. The bold entries mean the best performance.

Next, to validate the contribution of each component in our FDDNet, a series of ablation experiences are conducted on the Tianchi dataset. We take YOLO v8n as the baseline model, and integrate one component upon the baseline at a time, that is, SDTC and MDSFM.As shown in Table 4, when applying SDTC to the baseline, the mAP and AP50 increase by 2% and 0.8%, respectively. Using MDSFM alone result in an increase of 3.2% in mAP and 7.6% in AP50. When combined with SDTC and MDSFM, the best performance is achieved, with mAP and AP50 reaching 27.9% and 54%, respectively. The above results represent an improvement of 3.9% and 7.8% over the baseline.

Table 4.

Performance comparison of ablation experiments on the Tianchi dataset (%).

SDTC	MDSFM	Head num	Dilation rate	mAP	AP50	Precision	Recall
-	-	-	-	24	46.2	51.3	45.4
✓				26	47	50.8	44.7
	✓	4	[1,2,3,4]	27.2	52.8	61.2	49.3
✓	✓			27.9	54	56.1	54.3

Performance Analysis: The corresponding feature maps obtained using the YOLOv8n and FDDNet are depicted in Figure 6(b) and (c) respectively. The orange areas represent captured defect characteristics, which are the regions of interest for the model, while the blue areas represent the background. It can be observed that YOLOv8n struggles to differentiate between defects and background in images with low contrast. Conversely, the proposed FDDNet performs exceptionally well in perceiving the characteristics of defective areas of varying sizes and highlighting these defects with greater accuracy in the feature heat map. Meanwhile, a comparison of the detection performance between our FDDNet and the baseline YOLOV8n is shown in Figure 7. It can be observed that the detection performance after the improvements FDDNet is significantly superior to the original network YOLOv8n. Particularly in Figure 7(c), the YOLOv8n incorrectly identifies “skip” as “star jump,” while our proposed FDDNet can accurately inspect it. Therefore, we can conclude that our method demonstrates superior capability in locating and classifying defective regions.

Figure 6.

Visualization of Typical Fabric Defects and Feature Maps. (a) Displays typical fabric defects. (b and c) Show the visualized feature maps obtained by YOLOv8n and our FDDNet, respectively. It can be observed that the defect features extracted by FDDNet are more prominent and easier to distinguish.

Figure 7.

Visualization comparison of different models on the Tianchi dataset (a-f).

The YOLO series models are widely used in industrial inspection due to their high precision and efficiency. To further validate the superiority of our algorithm, we conducted comparisons with other models within the YOLO series, as well as some other mainstream models. As shown in Table 5, our FDDNet performs the best among all YOLO models. The mAP and AP50 are 27.9% and 54%, respectively, which obviously surpasses the second-best YOLOv8s in the YOLO series by 0.5% and 3.1%.

Table 5.

Performance comparison of different models on the Tianchi dataset (%).

Model	mAP	AP50	Precision	Recall
YOLOv5n⁴⁶	25.7	50.0	68.9	43.2
YOLOv5s⁴⁶	26.7	52.1	61.4	47.6
YOLOv6n6⁴⁷	18.2	35.9	41.9	37.4
YOLOv6n⁴⁷	19.9	36.8	52.9	33.8
YOLOv6s⁴⁷	23.1	42.2	48.3	43.4
YOLOv7-Tiny⁴⁸	20.9	42.9	48.8	43.2
YOLOv8n⁴⁹	24.0	46.2	51.3	45.4
YOLOv8s⁴⁹	27.4	50.9	56.1	48.1
ResNet18⁵⁰	22.8	44.9	57.9	40.8
ShuffleNetV2⁵¹	27.1	48	55.5	43.91
GhostNetV2⁵²	23.9	46.8	47.5	47.2
EfficientFormer⁵³	11.3	24.5	24.1	23.6
Ours	27.9	54.0	56.1	54.3

Note. The bold entries mean the best performance.

(2) Experimental Analysis on the denim Dataset: To further validate the generalization of the proposed method, we also conduct experiments on the self-built denim dataset. The influence of different attention heads and dilated rates on the self-built dataset is first analyzed in Table 6. It can be observed that the best performance is achieved when Head and Dilated Rate are set to 4 and [1,2,3,4]. Moreover, the ablation experiments are also conducted, as shown in Table 7. When SDTC is applied independently, the mAP value increased by 0.2% compared to the baseline, while using MDSFM alone improves performance by 1% over the baseline. The integration of both, namely our FDDNet, yields an mAP of 31.2%, marking a 2.2% improvement over the baseline YOLOv8n. Therefore, it can be concluded that both of the proposed modules are beneficial to the detection performance of the FDDNet.

Table 6.

Performance demonstration of the model under different parameter settings on the denim dataset (%).

Model	Head num	Dilation rate	mAP	AP50	Precision	Recall
Multi-	2	[1,2]	28.7	54.4	59.8	53.9
	3	[1,2,3]	29.1	55.0	63.1	53.0
	4	[1,2,3,4]	31.2	56.8	61.7	56.4
Single		[1]	27.9	53.6	62.1	53.4
	3	[2]	27.9	54.3	62.2	53.8
		[3]	28.8	53.9	60.8	54.3

Note. The bold entries mean the best performance.

Table 7.

Demonstration of ablation experiments on the denim dataset (%).

SDTC	MDSFM	Head num	Dilation rate	mAP	AP50	Precision	Recall
-	-	-	-	29.0	54.9	58.8	55.5
✓				29.2	55.3	61.1	53.2
	✓	4	[1,2,3,4]	30.0	56.0	63.5	54.8
✓	✓			31.2	56.8	61.7	56.4

Performance Analysis: Figure 8 showcases the visualization effects of feature maps before and after model enhancement. Specifically, (a) presents the original fabric image, whereas (b) and (c) exhibit the visualization results of the feature maps from the original YOLOv8n model and the proposed FDDNet, respectively. Upon observation, it becomes evident that FDDNet excels in targeting linear and small defects, effortlessly distinguishing defects from the feature heat maps. In comparison to the YOLOv8n model, FDDNet demonstrates superior capability in recognizing defect features.

Figure 8.

Visualization heatmap demonstration of typical features on the denim dataset. Here (a) represents the original image, while (b and c) depict the feature heatmaps obtained by YOLOv8n and FDDNet, respectively.

To further illustrate the superiority of the proposed method in this paper, we conducted comparative experiments using other lightweight YOLO models and several mainstream models on our self-built dataset, as shown in Table 8. The performance of the proposed network in terms of the mAP metric is the best, with a value of 31.2%, which is 2.2% higher than the baseline YOLOv8n network. This improvement is attributed to the enhanced model’s detection capabilities for small and elongated defects. Of note, while the AP50 metric of YOLOv8s reaches the same value of 56.8% as our method, considering its model size and computational complexity, the proposed model remains optimal.

Table 8.

Comparison demonstration of different models on the denim dataset (%).

Model	mAP	AP50	Precision	Recall
YOLOv5n⁴⁶	27.5	54.1	59.8	54.1
YOLOv5s⁴⁶	29.9	56.3	61.5	55.9
YOLOv6n6⁴⁷	25.4	47.4	56.1	46.2
YOLOv6n⁴⁷	25.2	47.2	57.7	46.0
YOLOv6s⁴⁷	28.3	51.3	59.2	49.5
YOLOv7-Tiny⁴⁸	23.7	48.6	58.8	55.5
YOLOv8n⁴⁹	29.0	54.9	58.8	55.5
YOLOv8s⁴⁹	31.0	56.8	60.1	56.5
ResNet18⁵⁰	21.8	52.5	55.4	55.2
ShuffleNetV2⁵¹	28.2	53.4	59.7	53.3
GhostNetV2⁵²	27.8	53.1	58.4	53.9
EfficientFormer⁵³	11.5	24.9	24.9	23.6
Ours	31.2	56.8	61.7	56.4

Note. The bold entries mean the best performance.

Subsequently, a visual comparison of the detection results between the proposed method and other methods is conducted to validate the effectiveness of FDDNet on the denim dataset, with representative sample images shown in Figure 9. The results clearly indicate that our FDDNet outperforms the baseline network YOLOv8n in both the classification and localization of defect regions. This advantage is particularly pronounced in cases where the contrast is weak and the defect regions are tiny and elongated.⁵⁴

Figure 9.

Visualization comparison on the demin dataset.

Efficiency analysis

Apart from the above analysis on model accuracy, model efficiency is also a crucial factor in assessing algorithm performance, especially in the textile industry where production speeds are high. Therefore, FLOPs, parameters, and FPS are employed to compare the model efficiency of our FDDNet with other models. As shown in Table 9, the proposed model has 5.14M parameters, ranking just behind YOLOv5n and YOLOv6n, making it highly lightweight and suitable for online deployment. In terms of detection speed, although our method may not be the fastest, it still achieves an FPS of 176, which can meet the requirements for real-time detection. Additionally, the FLOPs of 21.5G are slightly lower compared to YOLOv6s and YOLOv8s. Consequently, the proposed algorithm achieves a good balance between efficiency and accuracy, making it more suitable for practical applications.

Table 9.

Model efficiency comparison.

Model	Params (M)	FPS	FLOPs (G)
YOLOv5n⁴⁶	3.69	312	4.3
YOLOv5s⁴⁶	13.82	294	16.0
YOLOv6n6⁴⁷	10.35	370	12.4
YOLOv6n⁴⁷	4.64	434	11.35
YOLOv6s⁴⁷	18.52	400	45.19
YOLOv7-Tiny⁴⁸	12.4	119	13.9
YOLOv8n⁴⁹	5.94	188	8.1
YOLOv8s⁴⁹	21.46	185	28.5
ResNet18⁵⁰	4.26	115	6.3
ShuffleNetV2⁵¹	5.9	102	7.5
GhostNetV2⁵²	13.36	78	8.8
EfficientFormer⁵³	26.1	46	-
Ours	5.14	176	21.5

Model generalization analysis

To further verify the generalization performance of the proposed method, in addition to the experiments conducted on the denim dataset, supplementary comparisons with the baseline method YOLOv8n were carried out on the ZJULEAPER⁵⁸ (patterned fabric) dataset. Since the original annotations of this dataset are in segmentation format, they were first converted into YOLO-compatible detection labels. A total of 14,233 images were selected for the training set and 4,742 images for the test set, with the experimental settings kept consistent with those used on the denim dataset. As shown in Table 10, the proposed method outperforms YOLOv8n by 3.8% and 0.5% in terms of mAP and AP50, respectively. Moreover, the visualization results in Figure 10 demonstrate that FDDNet exhibits stronger defect recognition capability. In summary, the proposed method can better adapt to the feature variations of different types of fabrics, shows strong generalization ability, and has the potential to be applied to diverse fabric defect detection tasks in complex production environments.

Table 10.

Performance demonstration of FDDNet on the ZJULEAPER dataset (%).

Model	mAP	AP50	Precision	Recall
YOLOv8n	42.5	74.3	82.3	67.2
Ours	46.3	74.8	84.1	68.7

Note. The bold entries mean the best performance.

Figure 10.

The performance demonstration of FDDNet and YOLOv8n (baseline) on the ZJULEAPER dataset.

Limitations

Although the proposed method demonstrates promising results in fabric defect detection, there are certain limitations that need to be addressed in future work. One key limitation is the reliance on a fixed set of feature extraction techniques, which may not be optimal for capturing all defect types, particularly in fabrics with highly variable textures. This can limit the model’s ability to generalize across diverse defect types and fabric materials. Additionally, the method may face challenges in handling rare or unseen defect patterns due to the limited diversity of the training data, which can impact its performance on less common defect types. Moreover, factors such as lighting conditions, fabric texture, and image noise can affect the robustness of the model in real-world applications. To improve detection accuracy in the future, several methodologies could be explored. Multi-source information fusion⁵⁵ and image representation⁵⁶ offer techniques for combining diverse data sources, which could enhance the performance of defect detection. Additionally, latent features and graph neural networks⁵⁷ could be used to model complex relationships and incorporate additional information, leading to improved detection accuracy. Finally, self-paced semi-supervised learning⁵⁸ addresses the common challenge of limited labeled data, and its integration could enhance the model’s performance in scenarios where labeled data is scarce.

Conclusion

In the textile industry, the coexistence of subtle defects and elongated defects poses a significant challenge to fabric defect defection. This paper introduces a novel fabric defect detection model FDDNet, which leverages SDTC and MDSFM to enhance defect feature characterization capabilities and multi-scale defect localization performance. FDDNet successfully overcomes the limitations of traditional methods in learning local and global features, resulting in more accurate defect localization. The STDC enhances local feature characterization by preserving fine-grained information, which aids in the accurate identification of subtle fabric defects and elongated defects. The MDSFM effectively integrates global and local information through a combination of self-attention mechanisms and dilated convolutions, which enables the model to adapt to scale variations and enhance multi-scale defect localization. Experimental results show that on both the public Tianchi fabric dataset and a self-made denim dataset, FDDNet demonstrates significant advantages over mainstream methods, outperforming the baseline model by 3.9% and 2.2% in mAP values, respectively.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by NSFC (No.62472463, 61873293, 62301623), Leading talents of science and technology in the Central Plain of China (234200510009), Henan Key Research and Development Projects (241111220700), China National Textile and Apparel Council Technology Guiding Projects (2025055, 2025010).

ORCID iDs

Junpu Wang

Miao Yu

References

Bai

Fang

Lin

, et al Saliency-based defect detection in industrial images by using phase spectrum. IEEE Trans Ind Inform 2014; 10(4): 2135–2145.

Kleynen

Leemans

Destain

M-F.

Development of a multi-spectral vision system for the detection of defects on apples. J Food Eng 2005; 69(1): 41–49.

Wilson

Saintier

Palin-Luc

, et al Statistical study of the size and spatial distribution of defects in a cast aluminium alloy for the low fatigue life assessment. Int J Fatigue 2023; 166: 107206.

Cogranne

Retraint

Statistical detection of defects in radiographic images using an adaptive parametric model. Signal Process 2014; 96: 173–189.

Cao

Ruan

Peng

, et al Large-complex-surface defect detection by hybrid gradient threshold segmentation and image registration. IEEE Access 2018; 6: 36235–36246.

Liang

Zhou

, et al Automatic defect detection of texture surface with an efficient texture removal network. Adv Eng Inform 2022; 53: 101672.

Xie

Dauwels

, et al Semantic cues enhanced multimodality multistream cnn for action recognition. IEEE Trans Circuits Syst Video Technol 2019; 29(5): 1423–1437.

Xiao

, et al Fabric defect detection algorithm based on improved yolov5. Vis Comput 2024; 40(4): 2309–2324.

Zhao

Zhong

, et al A dynamic inference network (di-net) for online fabric defect detection in smart manufacturing. J Intell Manuf 2025; 36: 2881–2896.

10.

Wan

Gao

Zhou

, et al Unsupervised fabric defect detection with high-frequency feature mapping. Multimed Tools Appl 2023; 83(7): 21615–21632.

11.

Zhu

A real-time and accurate convolutional neural network for fabric defect detection. Complex Intell Syst 2024; 10: 3371–3387.

12.

Zhang

Liu

Wang

, et al Color-patterned fabric defect detection algorithm based on triplet attention multi-scale u-shape denoising convolutional auto-encoder. J Supercomput 2024; 80(4): 4451–4476.

13.

Huang

A texture-aware one-stage fabric defect detection network with adaptive feature fusion and multi-task training. J Intell Manuf 2024; 35(3): 1267–1280.

14.

Fang

Qiu

, et al An anchor-free defect detector for complex background based on pixelwise adaptive multiscale feature fusion. IEEE Trans Instrum Meas 2023; 72: 1–12.

15.

Zhao

Liu

, et al Attention-based multi-scale feature fusion for efficient surface defect detection. IEEE Trans Instrum Meas 2024; 73: 1–10.

16.

Guo

Kang

, et al Automatic fabric defect detection method using ac-yolov5. Electronics 2023; 12(13): 2950.

17.

Xiao

, et al Automatic fabric defect detection using a wide-and-light network. Appl Intell 2021; 51(7): 4945–4961.

18.

Liu

Wang

, et al Multistage gan for fabric defect detection. IEEE Trans Image Process 2019; 29: 3388–3400.

19.

Zeng

Liu

, et al Reference-based defect detection network. IEEE Trans Image Process 2021; 30: 6637–6647.

20.

Wong

Lai

, et al Weighted double-low-rank decomposition with application to fabric defect detection. IEEE Trans Autom Sci Eng 2021; 18(3): 1170–1190.

21.

Kong

Yao

Chen

, et al Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.845–853.

22.

Zhang

Izquierdo

Chandramouli

Dense and small object detection in uav vision based on cascade network. In Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019.

23.

Chen

Zhang

, et al Rrnet: a hybrid detector for object detection in drone-captured images. In Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019.

24.

Wang

Youlong

Zhao

Object detection using clustering algorithm adaptive searching regions in aerial images. In: European conference on computer vision, 2020, pp.651–664. Springer.

25.

Chen

Wang

, et al Scale-aware trident networks for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp.6054–6063.

26.

Yang

Huang

Wang

. Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2022, pp.13668–13677.

27.

Bosquet

Cores

Seidenari

, et al A full data augmentation pipeline for small object detection based on generative adversarial networks. Pattern Recognit 2023; 133: 108998.

28.

Dosovitskiy

An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

29.

Touvron

Cord

Douze

, et al Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, 2021, pp.10347–10357. PMLR.

30.

Brown

Mann

Ryder

, et al Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877–1901.

31.

Devlin

Chang

Lee

, et al Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, Minneapolis, MN, vol. 1, 2019, p.2.

32.

Carion

Francisco

Synnaeve

, et al End-to-end object detection with transformers. In: European conference on computer vision, 2020, pp.213–229. Springer.

33.

Zhang

, et al Vitae: vision transformer advanced by exploring intrinsic inductive bias. Adv Neural Inf Process Syst 2021; 34: 28522–28535.

34.

Chen

, et al Object matching of visible–infrared image based on attention mechanism and feature fusion. Pattern Recognit 2025; 158: 110972.

35.

Song

Han

Chen

, et al Vman: visual-modified attention network for multimodal paradigms. Vis Comput 2025; 41(4): 2737–2754.

36.

Zahra

Amin

El-Samie

FEA

, et al Efficient utilization of deep learning for the detection of fabric defects. Neural Comput Appl 2024; 36(11): 6037–6050.

37.

Wang

Liao

Zhou

, et al Swinurnet: hybrid transformer-cnn architecture for real-time unstructured road segmentation. IEEE Trans Instrum Meas 2024; 73: 1–16.

38.

Huang

, et al Source camera identification algorithm based on multi-scale feature fusion. Comput Mater Contin 2024; 80(2): 3047–3065.

39.

Zhang

Wang

Zheng

, et al A detection method with antiinterference for infrared maritime small target. IEEE J Sel Top Appl Earth Obs Remote Sens 2024; 17: 3999–4014.

40.

Wang

Han

Cui

, et al Nas-yolox: a sar ship detection using neural architecture search and multi-scale attention. Conn Sci 2023; 35(1): 1–32.

41.

Wang

Chen

Qiu

, et al Crossformer++: a versatile vision transformer hinging on cross-scale attention. IEEE Trans Pattern Anal Mach Intell 2024; 46: 3123–3136.

42.

Ren

Zhou

, et al Shunted self-attention via multi-scale token aggregation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.10853–10862.

43.

Lee

Kim

Willette

, et al Mpvit: multi-path vision transformer for dense prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.7287–7296.

44.

Peng

Huang

, et al Conformer: local features coupling global representations for visual recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.367–376.

45.

Chen

Dai

Chen

, et al Mobile-former: bridging mobilenet and transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.5270–5279.

46.

Jocher

Stoken

Borovec

, et al Russ

Ferriday

, et al. ultralytics/yolov5: v3. 0. Zenodo, 2020.

47.

Chuyi

Lulu

Hongliang

, et al Yolov6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976, 2022.

48.

Wang

Bochkovskiy

Liao

HYM.

Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp.7464–7475.

49.

Wang

Chen

, et al Uav-yolov8: a small-object-detection model based on improved yolov8 for uav aerial photography scenarios. Sensors 2023; 23(16): 7190.

50.

Zhang

Ren

, et al Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.770–778.

51.

Zhang

Hai-Tao

, et al Shufflenet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.116–131.

52.

Tang

Han

Guo

, et al Ghostnetv2: Enhance cheap operation with long-range attention. Adv Neural Inf Process Syst 2022; 35: 9969–9982.

53.

Geng

Wen

, et al Efficientformer: vision transformers at mobilenet speed. Adv Neural Inf Process Syst 2022; 35: 12934–12949.

54.

Zhang

Feng

Wang

, et al Zju-leaper: a benchmark dataset for fabric defect detection and a comparative study. IEEE Transactions on Artificial Intelligence 2020; 1(3): 219–232.

55.

Dunkin

Dezert

Multi-source information fusion: progress and future. Chin J Aeronaut 2024; 37(7): 24–58.

56.

Zhang

Gao

Wang

, et al Image representations of numerical simulations for training neural networks. Comput Model Eng Sci 2023; 134(2): 821–833.

57.

Shi

Zhang

Pujahari

, et al When latent features meet side information: a preference relation based graph neural network for collaborative filtering. Expert Syst Appl 2025; 260: 125423.

58.

Guan

Xing

Huang

, et al S2match: self-paced sampling for data-limited semi-supervised learning. Pattern Recognit 2025; 159: 111121.

FDDNet: Fabric defect detection with spatial depth-transforming convolution and multiscale dilated self-attention fusion module

Abstract

Keywords

Introduction

Related work

Fabric defect detection methods

Small object detection methods

Self-attention

Proposed network

Enhanced backbone

Spatial depth-transforming convolution (SDTC)

Multiscale dilated self-attention fusion module (MDSFM)

Experiment

Experimental settings

Experimental dataset

Evaluation metrics

Experimental results

Efficiency analysis

Model generalization analysis

Limitations

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

References