Sage Journals: Discover world-class research

Abstract

Fabric defect detection is a pivotal step in quality control in the textile manufacturing industry. Due to the diversity and complexity of defects, manual visual inspection and traditional fabric defect detection methods suffer from low efficiency and accuracy. To address the issues, a saliency model capable of mining local and global information from CNN and vision Transformer is proposed for fabric defect detection in this paper, named ACCTNet. Specifically, to enhance the feature interaction of different scales, an adjacent context coordination module composed of one local branch and two adjacent branches is proposed. Meanwhile, a contrast-aggregation module is proposed to highlight the defects from low contrast background using pooling and subtraction operations. In addition, vision Transformer is adopted to capture global contextual information with long-range dependencies, which can guide local information to further refines the defect detection results. Experimental results demonstrate that the proposed method can accurately inspect the defects from plain and patterned fabric surfaces, achieving E_m values of 78.49% and 97.19% respectively, which significantly surpasses the existing state-of-the-art fabric defect detection methods.

Keywords

Fabric defect detection saliency model adjacent context coordination vision transformer

Introduction

During the textile manufacturing process, it is inevitable to generate various kinds of defects such as warp breakage, stains, yarn removal, etc., caused by textile materials or machine malfunctions, and consequently resulting in the significant resource waste and economic losses. Early fabric defect detection generally relied on the manual visual inspection, which is always seriously affected by light intensity, fatigue and inspector’s experience, making it difficult to provide reliable and stable detection results.^1,2 With the development of computer vision technology, automatic visual inspection approaches have become a promising solution for fabric defects detection.³ The traditional automatic fabric defect detection methods can be broadly categorized into five main groups: statistical-based methods,^4–7 structure-based methods,^8,9 spectral-based methods,^10–13 model-based methods,^14–16 learning-based methods,^17–22 and hybrid methods.^23,24 Meanwhile, fabric defect detection methods can also be classified into supervised and unsupervised methods, according to whether training samples are required. These traditional methods can generally produce favorable defect detection results for some specific fabrics. However, they rely heavily on the hand-crafted low-level features and heuristic cues, such as color, texture, intensity, background prior and orientation, and fail to exploit high-level semantic information, thus lacking of adaptability and generalization for detecting defects.

Visual saliency can quickly localize the salient targets or regions by simulating the human visual system.²⁵ Although the texture of the fabric image is complex and the defect type is diversity, the defects are prominent in the complex texture background. Therefore, it is a promising solution to detect the fabric defect using visual saliency which is calculated by simulating human visual perception mechanism and combing the characteristics of the fabric image. For the past few years, with the breakthrough of deep learning in the field of computer vision, saliency object detection (SOD) model has gradually replaced from the traditional method in fabric defect detection and achieves significant performance improvement.^26–30 Nonetheless, most of the existing SOD models are based only on convolutional neural network (CNN) that cannot explicitly learn global contextual information with long-range relationship, which plays a crucial role in object detection from the cluttered and complex backgrounds. Recently, visual Transformer³¹ can model the long-distance dependencies using a self-attention mechanism in image classification task. However, only a few studies have applied Transformer to SOD, especially in fabric defect detection.

To address the above problem, this paper proposes a novel saliency model ACCTNet for fabric defect detection by integrating the advantages of CNN and visual Transformer in local spatial details and global contextual information respectively. Specifically, four contributions are shown as follows:

(1) An adjacent context coordination module is proposed, which utilizes the cross-attention fusion of a local branch and two adjacent branches to realize the mutual coordination of contextual information at different scales.

(2) A contrast-aggregation module is proposed to capture contrast information through pooling and subtraction operations.

(3) The global information is mined via vision Transformer and fused with the local features for accurate prediction of fabric defects.

(4) The proposed saliency model achieves 78.49% and 97.19% E_m on the plain and patterned datasets, which are significantly better than other existing fabric defect detection algorithms.

Related work

Deep learning technology has led to the proposal of a variety of CNN-based computer vision methods. The gradual improvement in the performance of these methods has greatly contributed to the development of SOD. Among them, the most significant is visual saliency models based on fully convolutional network (FCN),³² aiming at solving the semantic segmentation problems. They allow end-to-end spatial saliency representation learning and have the ability to preserve spatial information, thus allowing for effective saliency prediction during the feed-forward process. Therefore, a series of fabric defect detection algorithms have been devised utilizing the visual saliency of FCN.

Liu et al.³³ proposed a FCN-based saliency model with attention mechanism to achieve accurate segmentation of fabric defective regions. The model firstly extracts multi-level and multi-scale features using FCN to enhance the characterization of fabric texture. Then, the attention mechanism is integrated into the backbone network to assign different weights to different feature maps, further improving the effectiveness of feature extraction. Finally, the multi-level saliency maps are generated by de-convolution and fused through a series of short connection layers to better detect defective regions. Liu et al.³⁴ proposed a two-branch balanced saliency model based on FCN for automatic fabric defect inspection to enhance quality control in textile manufacturing. The model can address the scale variation and contextual information fusion issues through the two-branch network, thereby improving the performance of defect detection. Furthermore, Liu et al.³⁵ developed a novel saliency model for detecting fabric defects. This model is able to accurately identify defects even under low contrast, due to its multi-scale attention mechanism and two-way information interaction. Jing et al.³⁶ proposed an efficient Mobile-Unet model by introducing the depthwise separable convolution and addressed the data imbalance problem in the fabric defects dataset by introducing a median frequency loss function. Wang et al.³⁷ proposed an enhanced encoder-decoder network with hierarchical supervision for surface defect detection, which produce satisfactory region consistency and boundary localization of defects. Cui et al.³⁸ developed an autocorrelation-aware aggregation SOD network for identifying surface defects from background with clutter and low contrast. To obtain the fine edge prediction of defect regions, a semantics guided detection method is proposed in Sun et al.,³⁹ which includes the guided atrous pyramid module and query context module that enables semantic features to be learned throughout the inference phase. However, the above CNN-based methods cannot fully utilize contextual semantic information and ignore the effective coordination between them, resulting in limited detection performance.

More recently, Transformer structure originally used in natural language processing has made great progress in image recognition.³¹ It can utilize the self-attention mechanism to model the long-term dependency of images explicitly, thus significantly improving the recognition ability. To overcome the issue of lacking global semantic information in CNNs, some efforts have begun to introduce Transformer into defect detection. Shang et al.⁴⁰ replaced Transformer with convolution in the encoder module to learn long-range dependencies and constructed a dynamic graph to implicitly encode position information. Wang et al.⁴¹ explored the advantages of Transformer and CNN for defect detection and proposed a simplified yet effective hybrid model DefT to accurately identify defective regions. However, there is still limited research on high-precision and high-efficiency detection methods specifically designed for fabric surface defects.

Methodology

In this paper, we propose a saliency model based on adjacent context coordination and Transformer, named ACCTNet, for fabric defect detection. The network architecture is illustrated in Figure 1, which consists of three main parts: encoder network, adjacent context coordination module, and contrast-aggregation module. The encoder network is used for multi-scale feature extraction. The adjacent context coordination module is to realize the coordinated fusion of features from adjacent levels with each other. The contrast-aggregation module is adopted to capture the difference between each feature and the local neighborhood; meanwhile, its output is fused with contextual semantic information learned from vision Transformer to enhance feature representation. Finally, all side outputs are supervised by a loss function to guide the model to generate more accurate defect prediction maps.

Figure 1.

Saliency model based on adjacent context coordination and Transformer.

The encoder network

The basic encoder network of our model is VGG16 without the last pooling layer and three fully connected layers, which is shown in Figure 2. More concretely, the encoder network consists of five convolutional blocks, denoted as $E_{t}$ ( $t \in {1, 2, 3, 4, 5}$ ), $t$ is the block index). In this paper, the feature maps of last convolutional layer of each block, that is, conv1-2, conv2-2, conv3-3, conv4-3, and conv5-3, are extracted for the subsequent operations. The input size of the encoder network is $352 \times 352 \times 3$ , and the output size of feature maps in five blocks are $h_{t} = \frac{352}{2^{t - 1}}$ , $w_{t} = \frac{352}{2^{t - 1}}$ respectively.

Figure 2.

The encoder network.

Adjacent context coordination module

The high-level features from deeper convolutional layers provide abundant semantic information, while the low-level features from shallower convolutional layers offer detailed information. Therefore, the interactions between features at adjacent levels is able to capture complementary information across different levels, facilitating the determination of defect regions and refinement of defect details. To accomplish this, the adjacent context coordination module is proposed in this paper, which can coordinate the cross-scale features of previous, current and subsequent blocks in the encoder network. Generally, the adjacent context coordination module consists of three branches, namely, the local branch located in the middle and the two adjacent branches on the both sides (as illustrated by A2, A3, and A4 in Figure 1). In particular, A1 and A5 contain only two branches: a local branch and an adjacent branch. Concretely the processing of adjacent context coordination module can be defined as F (∙), and its output feature can be formulated as follows:

f_{a c c m}^{t} = {\begin{cases} F (f_{a}^{t}, f_{a}^{t + 1}), & t = 1 \\ F (f_{a}^{t - 1}, f_{a}^{t}, f_{a}^{t + 1}), & t = 2, 3, 4 \\ F (f_{a}^{t - 1}, f_{a}^{t}), & t = 5 \end{cases}}

(1)

where $f_{a}^{t - 1}$ , $f_{a}^{t}$ and $f_{a}^{t + 1}$ are the output features of previous, current and subsequent blocks respectively.

Local branch: The local branch is used to operate on the current feature $f_{a}^{t} \in ℝ^{h_{t} \times w_{t} \times c_{t}}$ and composed of two steps. First, a receptive field block (RFB)⁴² with four dilated convolutions is applied to expand the receptive field without increasing the number of parameters and extract the multi-receptive field features. Its output $f_{d c}^{t}$ can be represented as:

f_{d c}^{t} = R F B (f_{a}^{t})

(2)

Then, we further refine $f_{d c}^{t}$ in an adaptive manner employing an effective coordinate attention (CA)⁴³ module, which is formulated as:

f_{l o c a l} = C A (f_{d c}^{t})

(3)

where $f_{l o c a l}$ stands for the features generated by the local branch. In this way, the model can capture not only cross-channel information but also direction-aware and location-aware information, thus more accurately identifying and localizing the defective regions.

Adjacent branches: There are two types of adjacent branch: previous-to-current branch and subsequent-to-current branch. The output of the former $f_{p}^{t}$ and the output of the latter $f_{l}^{t}$ can be computed as:

f_{p}^{t} = S A (D o w n (f_{a}^{t - 1})) \otimes f_{d c}^{t} t = 2, 3, 4, 5

(4)

f_{l}^{t} = S A (U p (f_{a}^{t + 1})) \otimes f_{d c}^{t} t = 1, 2, 3, 4

(5)

where $S A$ means the spatial attention (SA),⁴⁴ $D o w n$ is 2× downsampling by max pooling, $U p$ is 2× upsampling by bilinear interpolation and ⊗ represents the element-wise multiplication operation.

Branch fusion: After effective feature coordination as described above, the output features of multi branches are fused with the input $f_{a}^{t}$ . As consequence, the output of adjacent context coordination module $f_{a c c m}^{t}$ is formulated as:

f_{a c c m}^{t} = {\begin{cases} f_{l o c a l}^{t} \oplus f_{l}^{t} \oplus f_{a}^{t} & t = 1 \\ f_{l o c a l}^{t} \oplus (f_{p}^{t} \oplus f_{l}^{t}) \oplus f_{a}^{t}, & t = 2, 3, 4 \\ f_{l o c a l}^{t} \oplus f_{p}^{t} \oplus f_{a}^{t}, & t = 5 \end{cases}}

(6)

where ⊕ is an element-wise summation operation. As a result, a large amount of valuable contextual information is used for the detection of defective regions.

Contrast-aggregation module

In fabric images, defects often exhibit low contrast with the background texture, making it difficult to generate accurate boundary localization of defects. To overcome this problem, the contrast-aggregation module is proposed in our model, which can capture sufficient contrast information. Inspired by Achanta et al.,⁴⁵ the contrast feature $[f_{a c c m}^{t}]^{'}$ is calculated by comparing $f_{a c c m}^{t}$ and $f_{a c c m}^{t}$ after undergoing average pooling, that is,

[f_{a c c m}^{t}]^{'} = f_{a c c m}^{t} - A v g P o o l (f_{a c c m}^{t})

(7)

where $A v g P o o l (.)$ represents an average pooling operation with kernel size of 3.

Then, to further incorporate global contextual information into fabric defect detection, the last four side outputs of the encoder network $P_{t} (t = 2, 3, 4, 5)$ are fused by a stack of Transformer blocks. As illustrated in Figure 1, the whole process can be expressed as:

P_{g} = c a t (D_{8} (P_{2}), D_{4} (P_{3}), D_{2} (P_{4}), P_{5})

(8)

G = R (L P (T r a n s f o r m e r (L P (P_{g}))))

(9)

where $c a t (.)$ is a concatenation operation along the channel dimension, $D_{i} (i = 2, 4, 8)$ represents the downsampling operation with different stride $i$ , $R (.)$ defines the reshape operation and LP(.) is the linear projection operation. $T r a n s f o r m e r (.)$ means numerous Transformer blocks, each including a multi-head self-attention (MHSA), a feed-forward network (FFN) and two layer normalizations (LN).

Finally, the feature $G$ are fed into the contrast-aggregation module for the fusion of local and global features. To generate finer feature maps, a progressive upsampling process is adopted as shown in Figure 1. For the t-th block in the encoder network, the fused feature $S^{t}$ used to predict defects is obtained by combining the $f_{a c c m}^{t}$ , $[f_{a c c m}^{t}]^{'}$ , $S^{t - 1}$ and $G$ , which can be expressed as:

S^{t} = D e c o n v (c a t (f_{a c c m}^{t}, [f_{a c c m}^{t}]^{'}, G, S^{t - 1}))

(10)

where $D e c o n v (.)$ is the deconvolution layer with kernel size 5 and stride 2.

Experimental results and analysis

Experimental configuration

Datasets

A series of experiments and evaluations of the model are carried out on the plain and patterned datasets, and some examples of the dataset are shown in Figure 3. The plain fabric dataset comprises 2200 images for training and 500 images for testing, encompassing five primary types of defects: indentations, creases, detachments, stains, and holes. This dataset includes numerous elongated defects and tiny defects with low contrast. The pattern fabric dataset contains 5948 images for training and 500 images for testing with more complex background textures and irregular defects. It contains six main types of defects: stripped yarns, broken yarns, yarns, cotton balls, holes, and stains.

Figure 3.

Sample fabric image dataset (the first two rows are plain fabric and the last two rows are pattern fabric).

Evaluation indicators

The proposed fabric defect detection model is evaluated in terms of six commonly adopted metrics, that is, Precision-Recall (PR) curve, F-measure curve, Maximum F-measure (F_β), Mean Absolute Error (MAE), S-measure ( $S_{m}$ ) and E-measure ( $E_{m}$ ).

Precision-recall (PR) is calculated based on the binarized saliency map and the ground truth map. Precision represents the proportion of defect pixels correctly identified among the predicted saliency map, and recall represents the proportion of predicted pixels that are correctly identified in the ground truth map, as represented below:

P r e c e s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

where TP, FP, and FN denote the true positive, false positive, and false negative pixels, respectively. A set of thresholds from 0 to 255 is applied, each of which produces a pair of Precision-Recall values to form the PR curve for performance comparison.

The weighted harmonics of Precision and Recall is to compute F_β for evaluating the overall performance. The maximum F_β is noted as the top value in the F-measure curve. And the larger the area under the F-measure curve, the better the performance. F_β is defined as:

F_{β} = \frac{(1 + β^{2}) \times P r e c i s i o n \times R e c a l l}{β^{2} \times P r e c i s i o n + R e c a l l}

(13)

where $β^{2}$ is set to 0.3.

The MAE⁴⁶ score is calculated as the average absolute difference between the predicted saliency map S and its ground truth G. The lower the MAE score, the better the performance. MAE score is computed as:

M A E = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | S (x, y) - G (x, y) |

(14)

where H and W denote the height and width of G respectively, and $(x, y)$ is the coordinate of the pixel position.

The S-metric ( $S_{m}$ )⁴⁷ differs from the above metrics that deal only with pixel-level errors. It can evaluate the structural similarity between the predicted saliency maps and the corresponding ground truth maps and is composed of region-aware $S_{r}$ and object-aware $S_{o}$ structural similarity, which can be expressed as:

S_{λ} = λ * S_{o} + (1 - λ) * S_{r}

(15)

where $λ$ is set to 0.5. The higher the value of S-measure, the better the performance of the model.

The E-measure ( $E_{m}$ )⁴⁸ can take into account both the pixel-level matching information and the image-level statistics, which be expressed as:

E_{m} = \frac{1}{W_{S} \times H_{S}} \sum_{x = 1}^{W_{S}} \sum_{y = 1}^{H_{S}} θ (x, y)

(16)

where $θ (x, y)$ represents the enhanced alignment matrix, which reflects the correlation between the normalized saliency map $S \in {[0, 1]}^{W \times H}$ and the ground truth map $G \in {0, 1}^{W \times H}$ after subtracting their average values, respectively.

Implementation details

The proposed saliency model for fabric defect detection is implemented with PyTorch on NVIDIA V100 GPUs. During the training and testing phases, the input fabric images are resized to $352 \times 352$ uniformly. The weight parameters of encoder network is initialized by pre-trained VGG-16 model, and the parameters of all other newly added layers are initialized by the normal distribution. We set the initial learning rate to 5e-5, batch size to 6 and training epoch to 70. Meanwhile, the initial learning rate is reduced by a factor of 10 at the 51st epoch.

Comparative results with state-of-the-art methods

We compare the model proposed in this paper with 12 current state-of-the-art saliency object detection methods on two fabric datasets in terms of both qualitative and quantitative results.

Quantitative Analysis: From Table 1, it can be observed that all the evaluation metrics scores of our method ACCTNet are the highest except for the MAE and $S_{m}$ metric values on the plain dataset, which indicates that our method is able to identify the defective regions more accurately and provides better results on the patterned dataset. Figure 4 shows the PR curve and F-measure curve of the different methods on two datasets. We can see that the PR curves of our model on both datasets are closer to the upper right than the other methods, which demonstrates the excellent performance of our method in fabric defect detection. In addition, the F-measure curves of our method can cover the other method in both cases, which also show the performance of our method is superior to others.

Figure 4.

Comparison of PR curves and F-measure curves.

Table 1.

Quantitative comparison of different methods.

Methods	Plain fabric dataset				Pattern fabric dataset
Methods	$F_{β}$	MAE	$S_{m}$	$E_{m}$	$F_{β}$	MAE	$S_{m}$	$E_{m}$
NLDF⁴⁸	0.6095	0.0041	0.7046	0.7838	0.8391	0.0105	0.8703	0.9108
DSS⁴⁹	0.6006	0.0059	0.6591	0.6963	0.8647	0.0108	0.8840	0.7898
PiCANet⁵⁰	0.5985	0.0069	0.6634	0.4016	0.5939	0.1088	0.6375	0.6039
R³Net⁵¹	0.6125	0.0053	0.6287	0.7070	0.6331	0.0416	0.6514	0.7798
BASNet⁵²	0.6161	0.0674	0.6901	0.7537	0.6787	0.1906	0.6952	0.7714
PoolNet⁵³	0.6004	0.0053	0.6776	0.5303	0.7413	0.0199	0.7942	0.6932
GateNet⁵⁴	0.6191	0.0057	0.6152	0.4678	0.7753	0.0257	0.7964	0.7259
F³Net⁵⁵	0.6069	0.0049	0.7253	0.7396	0.8619	0.0095	0.8835	0.9025
C2FNet⁵⁶	0.5820	0.062	0.7081	0.7011	0.6282	0.0329	0.8575	0.6008
PSGLoss⁵⁷	0.5945	0.0037	0.6714	0.7686	0.8627	0.0069	0.8744	0.9569
TPRNet⁵⁸	0.6073	0.0053	0.7027	0.7085	0.8836	0.0067	0.9056	0.9373
ICON⁵⁹	0.5889	0.0044	0.6096	0.7375	0.8374	0.0106	0.8694	0.8952
ACCTNet	0.6511	0.0040	0.7239	0.7849	0.9077	0.0064	0.9208	0.9719

The bold entries mean the best performance.

We also compare the number of parameters and running speed of each model in Table 2. From Table 2, we can see that compared to other methods, our method ACCTNet has the lowest number of parameters 26.72 M while ensuring faster running speed 27 FPS, which proves the high efficiency of our method.

Table 2.

Efficiency comparison of different models.

Methods	Params (M)	FPS
NLDF⁴⁸	35.48	11
DSS⁴⁹	62.24	12
PoolNet⁵³	68.26	30
PiCANet⁵⁰	47.22	15
R³Net⁵¹	56.16	18
GateNet⁵⁴	128.63	22
BASNet⁵²	87.06	25
ACCTNet	26.72	27

Qualitative analysis: To better visualize the validity of our method, the partial saliency maps generated by the different methods are shown in Figure 5. Various scenarios such as tiny defects, line faults, dot defects, slightly larger defects, irregular defects, and defects with low contrast to the background on different fabric textures are included. It can be seen that our method performs well in detecting dot defects with more complete region segmentation results and clear contours compared to most models. The line defects are also inspected with clear and noise-free contours. In the case of large defects, the detection maps are more accurate and closer to the ground truth maps. Meanwhile, for the irregular defects and defects with low contrast, our method is able to locate them more accurately and segment them more integrally. For example, as shown in the seventh row of Figure 5, the comparison methods are unable to detect the defective regions accurately. BASNet has the worst performance with a high false detection rate, and the contours of other methods appear blurred and rough. Our method can generate defective regions with high region consistency and clear boundary localization on different fabric textures.

Figure 5.

Visual comparison of different methods.

Ablation experiments

In this section, we perform a series of ablation studies on the plain fabric dataset to investigate the effectiveness of the three main components proposed in our approach. We first perform an ablation study of the encoder network by replacing the VGG16 with ResNet50, and Res2Net,⁶⁰ and the results using different encoder networks are shown in Table 3. It can be seen that the model achieves the best performance when using VGG16 as the encoder network. Moreover, the results with different encoder networks show little variation, indicating the proposed method is highly robust.

Table 3.

Comparison results of different encoder networks.

Backbone structures	$F_{β}$	MAE	$S_{m}$	$E_{m}$
VGG16	0.6511	0.0040	0.7239	0.7849
ResNet50	0.6290	0.0044	0.7206	0.7806
Res2Net	0.6218	0.0044	0.7170	0.7707

The bold entries mean the best performance.

The specific comparison results using different components of our method are reported in Figure 4 and Table 4. From Table 4, we can observe that the method with only the adjacent context coordination module is much better than the other two modules alone, which shows the effectiveness of the adjacent context coordination module. In addition, the network with only the Transformer blocks perform the worst, indicating that relying solely on global information without local texture information is insufficient for achieving satisfactory performance. The networks with other two components outperform the network with Transformer blocks, which suggests the importance of local texture information over global contextual information. And since the network with three components can make an efficient combination of both local and global features, its performance is significantly better than in scenarios where only one type of feature is considered, which proves the necessity of three components we designed. In brief, both local texture information and global contextual information play equally important roles in enhancing the model performance.

Table 4.

Comparison results with the different components.

	Plain fabric dataset				Pattern fabric dataset
Configurations	$F_{β}$	MAE	$S_{m}$	$E_{m}$	$F_{β}$	MAE	$S_{m}$	$E_{m}$
Baseline + A	0.6401	0.0043	0.7196	0.7587	0.8437	0.0110	0.8637	0.9166
Baseline + B	0.6310	0.0045	0.7165	0.7459	0.8768	0.0074	0.8988	0.9510
Baseline + C	0.6291	0.0047	0.7031	0.6985	0.6892	0.0167	0.8086	0.8152
Baseline + A + B	0.6438	0.0042	0.7259	0.7823	0.8936	0.0068	0.8968	0.9498
Baseline + A + B + C	0.6511	0.0040	0.7239	0.7849	0.9036	0.0064	0.9180	0.9705

A denotes the adjacent context coordination module, B denotes the contrast-aggregation module, and C denotes the Transformer block.

To demonstrate the effectiveness of each component more intuitively, Figure 6 shows the different output feature maps in the proposed method. From Figure 6, it is evident that the adjacent contextual coordination module is able to capture more discriminative features, enabling increased attention to the defective regions, thereby making defects more salient. Following the contrast-aggregation module, the feature map has less background texture interference, facilitating a more distinct visual representation. Additionally, under the global guidance of Transformer blocks, the output feature map of the last layer is highlighted more obviously, leading to a clearer and brighter representation of defects. The final prediction map confirms that the proposed method can generate precise fabric defect detection results.

Figure 6.

Visual comparison of different output feature maps in the proposed method. (a) Fabric image, (b) ground truth; the output maps of (c) encoder, (d) adjacent context coordination module, (e) contrast-aggregation module, (f) the last layer of model, and (g) prediction map.

We also investigate the effect of the number of Transformer blocks on the defect detection performance and conduct a series of experiments by setting the number of Transformer blocks to ${1, 2, 3, 4, 5, 6}$ respectively. The results are shown in Table 5. It can be seen that the proposed model achieves the best performance when employing four Transformer blocks. However, as the number of Transformer blocks exceeds four, the model performance gradually deteriorates. This phenomenon suggests that the saturation point is reached with the incorporation of four Transformer blocks, and the additional blocks could introduce redundant information, thereby leading to a decline in performance.

Table 5.

Performance comparison of different numbers of Transformer blocks.

Numbers of Transformer blocks	$F_{β}$	MAE	$S_{m}$	$E_{m}$
1	0.6498	0.0042	0.7234	0.7865
2	0.6470	0.0042	0.7207	0.7807
3	0.6493	0.0043	0.7156	0.7760
4	0.6511	0.0040	0.7239	0.7849
5	0.6467	0.0042	0.7178	0.7832
6	0.6460	0.0043	0.7210	0.7796

The bold entries mean the best performance.

To show the effectiveness of the adopted deep supervision strategy, we compare two supervision learning strategies: multi-level supervision strategy and single-level supervision strategy, and the comparison results are reported in Table 6. We can observe that the multi-level supervision strategy can achieve the best performance, with a 6.98% decrease in MAE and the 1.89%, 0.21%, 0.78% increase in $F_{β}$ , $S_{m}$ , $E_{m}$ compared to the single-level supervision strategy. In addition, we also show the visual comparison of different supervision learning strategies in Figuer 7. It can be seen that the multi-level supervision learning strategy can produce feature maps and prediction maps with more accurate localization, complete regions and clear contours.

Table 6.

Performance comparison of different supervision learning strategies.

Supervision learning strategies	$F_{β}$	MAE	$S_{m}$	$E_{m}$
Single-level supervision	0.6390	0.0043	0.7224	0.7788
Multi-level supervision	0.6511	0.0040	0.7239	0.7849

The bold entries mean the best performance.

Figure 7.

Visual comparison of different supervision learning strategies. (a) Fabric image; (b) ground truth; (c–g) feature maps of the five side outputs; (h) prediction map.

Conclusion and future work

In this paper, we propose a saliency model based on adjacent context coordination and Transformer for high precision and high efficiency fabric defect detection. First, an adjacent context coordination module is adopted to coordinate the adjacent features effectively, thereby enhancing the interaction of features at different scales. Then, a contrast-aggregation module is proposed to explore the discriminative information to highlight salient defective regions from the background with low contrast. Finally, vision Transformer is used to capture long-range relationships with global contextual semantic knowledge, which is able to guide the model to focus more on defective regions. Extensive experiments on the plain and patterned fabric defect datasets demonstrate that our method can generate the complete defective regions with well-defined boundaries, outperforming the existing saliency detection methods. Meanwhile, the model parameter is only 26.72 M and the detection efficiency is at 27 fps on a single GPU.

As a supervised fabric defect detection method, our ACCTNet requires a large number of labeled defective samples for model training, which limits its applicability in textile industrial scenario where defective samples are scarce. Therefore, to reduce reliance on defective fabric images with pixel-level annotations, we will explore semi-supervised fabric detection methods based on the theory of graph signal processing^61–63 in the future.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by NSFC (No. 62072489, No.61873293), leading talents of science and technology in the Central Plain of China (234200510009), Henan province key science and technology research projects (222102210008, 232102211002, 232102211030).

ORCID iD

Junpu Wang

References

Korman

Reichman

Tsur

, et al. Fast-match: fast affine template matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013, pp. 2331–2338.

Zhou

Wang

Unsupervised fabric defect segmentation using local patch approximation. J Text Inst 2016; 107(6): 800–809.

Yapi

Allili

Baaziz

Automatic fabric defect detection using learning-based local textural distributions in the contourlet domain. IEEE Trans Autom Sci Eng 2017; 15(3): 1014–1026.

Chetverikov

Hanbury

Finding defects in texture using regularity and local orientation. Pattern Recognit 2002; 35(10): 2165–2180.

Mak

Peng

Yiu

KFC

. Fabric defect detection using morphological filters. Image Vision Comput 2009; 27(10): 1585–1592.

Hamdi

Sayed

Fouad

, et al. Fully automated approach for patterned fabric defect detection. In: 2016 Fourth international Japan-Egypt conference on electronics, communications and computers (JEC-ECC), Cairo, Egypt, 31 May–2 June, 2016. Piscataway, NJ: IEEE, 2016, pp. 48–51.

Zhang

Bresee

RR.

Fabric defect detection and classification using image analysis. Text Res J 1995; 65(1): 1–9.

Zuo

Wang

Yang

, et al. Fabric defect detection based on texture enhancement. In: 2012 5th international congress on image and signal processing, Agadir, Morocco, 28–30 June, 2012. Piscataway, NJ: IEEE, 2012, pp. 876–880.

Gan

Yang

. Texture enhancement though multiscale mask based on RL fractional differential. In: 2010 international conference on information, networking and automation (ICINA), Kunming, China, 17–19 October, 2010. Piscataway, NJ: IEEE, pp. V1-333–V1-337.

10.

Chan

Pang

GKH

. Fabric defect detection by Fourier analysis. IEEE Trans Indus Appl 2000; 36(5): 1267–1276.

11.

Kumar

Pang

GKH

. Defect detection in textured materials using Gabor filters. IEEE Trans Indus Appl 2002; 38(2): 425–440.

12.

Mak

Peng

Detecting defects in textile fabrics with optimal Gabor filters. Int J Comput Sci 2006; 1(4): 274–282.

13.

Wen

Cao

Liu

, et al. Fabric defects detection using adaptive wavelets. Int J Cloth Sci Technol 2014; 26(3): 202–211.

14.

Liu

Fabric defect detection based on low-rank decomposition with structural constraints. Visual Comput 2022; 38(2): 639–653.

15.

Wang

, et al. Surface defects detection using non-convex total variation regularized RPCA with kernelization. IEEE Trans Instrument Measure 2021; 70: 1–13.

16.

Bao

Liang

Xia

, et al. Low-rank decomposition fabric defect detection based on prior and total variation regularization. Visual Comput 2022; 38(8): 2707–2721.

17.

Stojanovic

Mitropulos

Koulamas

, et al. Real-time vision-based system for textile fabric inspection. Real Time Imaging 2001; 7(6): 507–518.

18.

Kumar

Neural network based detection of local textile defects. Pattern Recognit 2003; 36(7): 1645–1659.

19.

Kuo

CFJ

Lee

Tsai

CC.

Using a neural network to identify fabric defects in dynamic cloth inspection. Text Res J 2003; 73(3): 238–244.

20.

Kuo

CFJ

Lee

Tsai

CC.

Using a neural network to identify fabric defects in dynamic cloth inspection. Text Res J 2003; 73(3): 238–244.

21.

Yin

Zhang

WB.

Textile flaw classification by wavelet reconstruction and BP neural network. In: Advances in neural networks–ISNN 2009: 6th international symposium on neural networks, ISNN 2009, Part II 6, Wuhan, China, 26–29 May 2009. Berlin Heidelberg: Springer, 2009, pp. 694–701.

22.

Castilho

Gonçalves

PJS

Pinto

JRC

, et al. Intelligent real-time fabric defect detection. In: Image analysis and recognition: 4th international conference, ICIAR 2007, Montreal, Canada, 22–24 August 2007. Berlin Heidelberg: Springer, pp. 1297–1307.

23.

Ngan

HYT

Pang

GKH

. Defect detection of patterned objects. In: Mechatronic systems. Boca Raton, FL: CRC Press, 2007, pp. 24-1–24-10.

24.

Ngan

HYT

Pang

GKH

Yung

, et al. Defect detection on patterned jacquard fabric. In: 32nd applied imagery pattern recognition workshop, 2003, Washington, DC, 15–17 October 2003. Piscataway, NJ: IEEE, 2003, pp. 163–168.

25.

Ren

Gao

Chia

, et al. Region-based saliency detection and its application in object recognition. IEEE Trans Circuits Syst Video Technol 2013; 24(5): 769–779.

26.

Zahra

Amin

El-Samie

FEA

, et al. Fabric defect detection based on saliency map and keypoints. J Optics 2023; 52(4): 1750–1757.

27.

Zahra

Amin

El-Samie

FEA

, et al. Efficient utilization of deep learning for the detection of fabric defects. Neural Comput Appl 2024; 99: 1–14.

28.

Guan

Fabric defect detection using an integrated model of bottom-up and top-down visual attention. J Text Inst 2016 107(2): 215–224.

29.

Liu

Zheng

Fabric defect detection based on information entropy and frequency domain saliency. Visual Comput 2021; 37(3): 515–528.

30.

Wan

Deng

, et al. Fabric defect detection based on saliency histogram features. Comput Intellig 2019; 35(3): 517–534.

31.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16×16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

32.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015, pp. 3431–3440.

33.

Liu

Wang

, et al. Fabric defect detection using fully convolutional network with attention mechanism. In: ICCPR’19: 2019 8th international conference on computing and pattern recognition, Beijing, China, 23–25 October 2019.

34.

Liu

Wang

, et al. A dual-branch balance saliency model based on discriminative feature for fabric defect detection. Int J Cloth Sci Technol 2022; 34(3): 451–466.

35.

Liu

Wang

, et al. Saliency model based on discriminative feature and bi-directional message interaction for fabric defect detection. In: International forum on digital TV and wireless multimedia communications. Singapore: Springer, 2020, pp. 181–193.

36.

Jing

Wang

Rätsch

, et al. Mobile-Unet: an efficient convolutional neural network for fabric defect detection. Text Res J 2022; 92(1–2): 30–42.

37.

Wang

, et al. Sddet: an enhanced encoder–decoder network with hierarchical supervision for surface defect detection. IEEE Sensors J 2022; 23(3): 2651–2662.

38.

Cui

Song

Feng

, et al. Autocorrelation aware aggregation network for salient object detection of strip steel surface defects. IEEE Trans Instrument Measure 2023; 99: 1.

39.

Sun

Yan

Song

QCNet: query context network for salient object detection of automatic surface inspection. Visual Comput 2023; 39(10): 4391–4403.

40.

Shang

Sun

Liu

, et al. Defect-aware transformer network for intelligent visual surface defect detection. Adv Eng Inform 2023; 55: 101882.

41.

Wang

Yan

, et al. Defect transformer: an efficient hybrid transformer architecture for surface defect detection. Measurement 2023; 211: 112614.

42.

Koltun

Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.

43.

Hou

Zhou

Feng

. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA, 20–25 June 2021, pp. 13713–13722.

44.

Woo

Park

Lee

, et al. Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 3–19.

45.

Achanta

Hemami

Estrada

, et al. Frequency-tuned salient region detection. In: 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, USA, 20–25 June 2009. Piscataway, NJ: IEEE, pp. 1597–1604.

46.

Perazzi

Krähenbühl

Pritch

, et al. Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012. Piscataway, NJ: IEEE, pp. 733–740.

47.

Fan

Cheng

Liu

, et al. Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017, pp. 4548–4557.

48.

Fan

Gong

Cao

, et al. Enhanced-alignment measure for binary foreground map evaluation. arXiv preprint arXiv:1805.10421, 2018.

49.

Luo

Mishra

Achkar

, et al. Non-local deep features for salient object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017, pp. 6609–6617.

50.

Hou

Cheng

, et al. Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017, pp. 3203–3212.

51.

Deng

Zhu

, et al. R3net: recurrent residual refinement network for saliency detection. In: Proceedings of the 27th international joint conference on artificial intelligence. Menlo Park, CA, USA: AAAI Press, 2018, pp. 684–690.

52.

Qin

Zhang

Huang

, et al. Basnet: boundary-aware salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019, pp. 7479–7489.

53.

Liu

Hou

Cheng

, et al. A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 15–20 June 2019, pp. 3917–3926.

54.

Zhao

Pang

Zhang

, et al. Suppress and balance: a simple gated network for salient object detection. In: Proceedings, Part II 16 Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, 23–28 August, 2020. Charm: Springer International Publishing, pp. 35–51.

55.

Wei

Wang

Huang

F³Net: fusion, feedback and focus for salient object detection. Proc AAAI Conf Arti Intell 2020; 34(07): 12321–12328.

56.

Sun

Chen

Zhou

, et al. Context-aware cross-level fusion network for camouflaged object detection. arXiv preprint arXiv:2105.12555, 2021.

57.

Yang

Lin

, et al. Progressive self-guided loss for salient object detection. IEEE Trans Image Proc 2021; 30: 8426–8438.

58.

Zhang

, et al. TPRNet: camouflaged object detection via transformer-induced progressive refinement network. Visual Comput 2022; 39: 4593–4607.

59.

Zhuge

Fan

Liu

, et al. Salient object detection via integrity learning[J]. IEEE Trans Pattern Anal Mach Intell 2022; 45(3): 3738–3752.

60.

Gao

Cheng

Zhao

, et al. Res2net: a new multi-scale backbone architecture[J]. IEEE Trans Pattern Anal Mach Intell 2019; 43(2): 652–662.

61.

Ortega

Frossard

Kovačević

, et al. Graph signal processing: overview, challenges, and applications. Proc IEEE 2018; 106(5): 808–828.

62.

Giraldo

Javed

Sultana

, et al. The emerging field of graph signal processing for moving object segmentation. In: International workshop on frontiers of computer vision. Cham: Springer International Publishing, 2021, pp. 31–45.

63.

Leus

Marques

Moura

JMF

, et al. Graph signal processing: history, development, impact, and outlook. IEEE Signal Proc Magaz 2023; 40(4): 49–60.

Fabric defect detection via saliency model based on adjacent context coordination and transformer

Abstract

Keywords

Introduction

Related work

Methodology

The encoder network

Adjacent context coordination module

Contrast-aggregation module

Experimental results and analysis

Experimental configuration

Datasets

Evaluation indicators

Implementation details

Comparative results with state-of-the-art methods

Ablation experiments

Conclusion and future work

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References