Sage Journals: Discover world-class research

Abstract

Aiming at the shortcomings of most current anomaly detection models, such as low detection accuracy and poor generalization performance, this paper proposes a few-shot anomaly detection model based on a convolutional multidimensional attention module to achieve feature registration (abbreviated as RCM-FSAD), which enhances the model’s perception of the overall image perception ability, using spatial transformer network to obtain the spatial transformation features of the image, improving the sensitivity of the relevant features, so that the whole model learns the commonality between the categories, and enhancing the generalization ability of the model. The spatial transformations and local structures of the input data are captured by deformable convolutional networks v2 to ensure the spatial invariance of the input data. The model is trained with only normal samples to accomplish anomalous regions’ localization and anomaly detection. On the challenging MVTec AD dataset, the unsupervised model not only improves the anomaly detection accuracy but also shows better generalization compared to current state-of-the-art unsupervised anomaly detection methods.

Keywords

Anomaly detection registration convolutional multidimensional attention module few-shot learning deformable convolutional networks v2

1. Introduction

Anomaly detection, also known as outlier detection, is a data analysis method that detects “differences” between normal samples by comparing them with each other to determine whether there is an anomaly or not, and further identifies anomalous subregions in the image (Xie et al., 2023). Due to its high efficiency and accuracy, it is widely used in various fields such as industrial anomaly detection (Bergmann et al., 2019), medical image analysis (Fernando et al., 2021), and video surveillance (Liu et al., 2018). In the field of computer vision, anomaly detection of images covers two main tasks: anomaly detection and anomaly localization. Anomaly detection focuses on the overall image anomalies for image-level judgment and classification; while anomaly localization judges pixel-level anomalies of an image and pinpoints the location of the anomalies. Currently, due to the variety of anomaly samples and the relatively limited number of anomalies in practical applications, this poses a considerable challenge for the anomaly detection task.

With the rapid development of deep learning technology, there are numerous deep learning-based methods for image anomaly detection, and the common methods mainly include classification-based, few-shot-based, and feature registration-based methods. A common solution for classification-based methods is to model the distribution of normal samples to identify abnormal samples. To achieve this goal, Bergmann et al. (2020), Defard et al. (2021), and others proposed a strategy where separate models are trained for different classes of objects. However, this one-class-one-model scheme may lead to an increase in memory consumption, especially when the number of classes increases. Moreover, this strategy may not be applicable when the normal sample has more categories.

Few-shot-based approaches mainly utilize few-shot in training to provide limited normal image data for each category (Fengping and Peng Yunfa, 2022). Sheynin et al. (2021) proposed few-shot learning to improve the performance of a model when dealing with few-shot data by learning shared features across categories. This can be achieved through transfer learning (Jiang et al., 2022), meta-learning, or generative adversarial networks (Goodfellow et al., 2014). These methods try to exploit the commonalities between classes to enhance the generalization ability of the model and thus achieve better results in few-shot anomaly detection tasks. In addition, to address the imbalance problem of few-shot datasets, He and Garcia (2009) proposed using resampling techniques to adjust the number of samples from different categories so that the model can better learn the data from a few categories. Gupta et al. (2020) proposed to use a weighted loss function of the samples from different categories to give more importance to the few categories. However, the above methods do not utilize the commonalities between categories, which can affect the effectiveness of the model in learning for the minority category.

Feature registration-based methods mainly rely on the steps of feature extraction, feature matching, geometric transformation modeling, and registration optimization to ensure registration between different images, making anomalies easier to detect and analyze in images. Huang et al. (2022) proposed to implement a generic anomaly detection model using feature registration, capable of shared learning across multiple categories and can be generalized to new categories. However, deformations and rotations in the image affect the accurate registration of the features, resulting in a model that is less robust to these changes in appearance, thus affecting the performance of anomaly detection.

In general, the problems of the above methods mainly lie in the uneven distribution of data samples, the failure of few-shot learning to fully utilize the commonalities between categories, and the poor robustness of feature registration. Techniques such as variability convolution, few-shot learning, and attention mechanisms are now widely used in the anomaly detection task of images. Small-sample learning can provide the model with richer anomaly data and image features, variability convolution makes the model capture the changing features more efficiently, and the attention mechanism focuses on recognizing the important features in the data. In addition, feature registration techniques help to improve the recognition of critical features and may reduce the effect of noise on the model. Therefore, this paper proposes a class-independent detection model based on a multidimensional convolutional attention module (MAM) and feature registration as a class-independent detection model for less sample anomaly detection (RCM-FSAD), the model can be mainly divided into two parts: feature extraction and anomaly detection and localization, the features contain structural and distributional features, and the anomaly judgment is mainly realized through the decoder to achieve the segmentation and localization of anomalous images.

The main contributions of this paper are summarized below:

Aiming at the problem of how to learn the commonality between categories when based on a few-shot model. A MAM is proposed, which combined with the feature registration technique makes the model pay deeper attention to the pixel regions in the image that play a decisive role in the classification, enabling the model to learn the commonalities between categories during training and reducing the problem of poor data generalization.

Aiming at the problem of poor feature-based alignment robustness and difficulty in effectively capturing the changing features. A twin network trained based on variability convolution v2 (deformable convolutional network v2 [DCNv2]) is proposed, which improves the expressive ability and adaptability of the model, enabling the model to better handle complex image data and improve the performance and generalization of the task.

Aim at the problem of one class and one model based on the small amount of anomaly sample data for classification. A multidimensional spatial generalized anomaly detection model (RCM-FSAD) is proposed, where all the classes only need to train a unified model with a small number of samples. The model first feeds normal images into a pretrained feature extraction network for feature extraction. Next, the extracted features are ensemble-trained by feature registration to learn a universal feature representation.

Figure 1.

Schematic structure of the TAM. Note. TAM = triplet attention module.

Figure 2.

Schematic diagram of the detailed process of the TAM . Note. TAM = triplet attention module.

2. Relevant Theoretical Basis

2.1. Convolutional Tripletbasis Attention Module

Dosovitskiy et al. (2020) proposed the transformer self-attention mechanism, initially applied to aligned text with remarkable results. Later, researchers began to explore the introduction of attention mechanisms in computer vision tasks and convolutional neural networks (CNNs) to enhance network performance (Fraga et al., 2021). For example, models such as SENet (Hu et al., 2018), convolutional block attention module (CBAM; Woo et al., 2018), and bottleneck attention module (BAM; Park et al., 2018) have been proposed to compute attention from different perspectives, such as channel, space, and receptive field. Although these methods significantly improve performance, they cannot solve the cross-dimensional interaction problem.

For this reason, Misra et al. (2021) proposed the triplet attention module (TAM), which not only reduces the computational cost of the model, but also eliminates the indirect correspondence between the channels and the weights, and achieves significant results in terms of efficiency and multidimensional correlation, as compared to previous attention methods. The structure of the TAM is shown in Figure 1.

The TAM contains three main branches as shown in Figure 2, two of which are used to capture cross-channel interactions between the channel C dimension and the spatial dimension width/height; the two branches are rotated counterclockwise along the height/width (H/W) axis at the input and clockwise along the H/W axis at the output $90 \circ$ . The third branch is the traditional computation of spatial attention weights.

Z-pool: is responsible for reducing the zero dimensions of the tensor to two dimensions by connecting the average and maximum pool features in that dimension. It preserves the rich representation of the actual tensor and reduces its depth to alleviate the subsequent computation, as in equation (1).

Z-pool = [{MaxPool}_{0 d}, {AvgPool}_{0 d} (x)]

(1)

where

0 d

is the zero dimension in which the maximum and average pooling operations occur.

2.2. Deformable Convolution

Dai et al. (2017) proposed a deformable convolutional network v1 (DCNv1), the core idea of which is to improve the model’s ability to adapt to changes in object geometry. Although DCNv1 spatially extracts features that are more consistent with the image structure, its computational region may be far beyond the region of interest, which leads to features being affected by irrelevant image content. To overcome this problem, Zhu et al. (2019) proposed DCNv2, which enhances the network’s ability to focus on relevant image regions. The modeling capability is further enhanced by integrating deformable convolution more comprehensively within the network while introducing a modulation mechanism and training strategy that extends the scope of deformation modeling. The modulation process in DCNv2 can be expressed as follows:

y (p) = \sum_{k = 1}^{k} ω_{k} \cdot x (p + p_{k} + △ p_{k}) \cdot △ m_{k}

(2)

Given a convolution kernel with K sampling points, the

ω_{k}

and

ρ_{k}

denote the weights and prespecified offsets for each sampling point, respectively.

ρ

is the true pixel coordinates,

ρ_{k}

is the convolution kernel position, and

△ ρ_{k}

is the learned offset.

△ m_{k}

represents the modulation scalar (Zhang et al., 2020).

2.3. Spatial Transformer Network

Jaderberg et al. (2015) introduced a learnable spatial transformer network (STN), which is a microscopic module that can be embedded into existing CNN architectures. The STN allows a neural network to automatically perform spatial transformations on feature maps conditional on the input feature maps without additional training supervision or modification of optimization methods. One of the key advantages of STNs is the ability to learn the spatial transformation parameters of a picture or feature without labeling key points. By learning these parameters, STNs can spatially align the input picture or learned features, thus reducing the impact of objects on classification, localization, and other tasks due to geometric transformations such as rotation, translation, scale, and distortion.

STN is more accurately called spatial transformer layer (STL), which is a layer in the network, and STL can be added between any two layers and is usually used in CNNs. As shown in Figure 3, the STN model consists of three main components: the localization network, the grid generation module, and the sampler. The localization network is responsible for generating spatial transformation parameters that describe the input image U, such as translation, rotation, and scaling. The grid generation module uses these parameters to generate a regular grid of sampled grid points that can be used to geometrically transform the input image U. Finally, the sampler uses the generated grid to interpolate with the input image to obtain a transformed image, which is then mapped onto the original image U to obtain the output V.

Figure 3.

Schematic diagram of spatial transformer network structure.

3. Proposed Approach

The few-shot anomaly detection model (RCM-FSAD) based on the MAM and feature registration proposed in this paper is shown in Figure 4

Figure 4.

RCM-FSAD model architecture.

The RCM-FSAD model is mainly composed of a feature optimization module, a feature extraction and transformation module, and a Siamese network. In this model, the MAM is firstly utilized to focus more finely on specific regions or features in the input data to improve the model’s ability to perceive key information. Secondly, through the feature extraction and transformation module $B_{i \times 3}$ learns an effective representation of the data, which provides the model with more expressive and discriminative features to improve the performance and generalization ability of the model. Finally, the Siamese network with the addition of DCNv2 can effectively improve the model’s adaptive feature learning ability as well as increase the sensory field, thus improving the model’s registration ability. The model utilizes normal feature alignment during training to learn category-independent feature alignment. For testing, a statistically based distribution estimator is used to estimate the normal distribution of the registration features for the target category, and test samples that exceed the learned statistical normal distribution are considered anomalous.

3.1. Feature Optimization Module

This paper draws on the literature (Misra et al., 2021); since TAM only focuses on the interaction between spatial width and height in practice, although this helps to improve the model performance, it is relatively limited in dealing with the interactions in the channel dimensions and fails to adequately capture the complex dependencies between different channels. In order to solve this problem, we introduce the MAM as a feature optimization module, which enhances the model’s ability to express features by extending the functionality of the TAM, effectively solving the problem that the TAM fails to adequately capture the complex dependencies between different channels.

Based on TAM, MAM constructs interactions between channel dimensions, with the input X of the first branch rotated counterclockwise along the C-axis $90 \circ$ . This rotation tensor represents $χ_{1}^{\land}$ the dimension of the channel as $(C \times W \times H)$ and then $χ_{1}^{\land}$ the output through the Z-pool is $χ_{1}^{\land *}$ , which is subsequently reduced by $χ_{1}^{\land *}$ the number of channels becomes $(2 \times W \times H)$ . $χ_{1}^{\land *}$ A standardized kernel convolutional layer will be passed, of size $k \times k$ followed closely by a batch normalization layer, which provides the output of the intermediate dimensions as $(1 \times W \times H)$ which in turn generates synthetic attentional weights, and then the output tensor is passed through a sigmoid activation layer $(σ)$ generating attentional weights that are subsequently applied to the $χ_{1}^{\land}$ , which is then rotated clockwise along the C-axis $90 \circ$ to preserve the original input dimension of X. The other two branches preserve the interaction of the spatial height and spatial width of the TAM, as shown in Figure 5.

Figure 5.

Schematic diagram of the detailed process of multidimensional attention module.

From the input tensor $χ \in R^{C \times H \times W}$ of multidimensional attention to obtain the refined attention tensor $y$ , the process can be represented by the equation.

y = \frac{1}{3} (\bar{x_{1}^{\land} σ (φ_{1} (x_{1}^{\land *})}) + \bar{x_{2}^{\land} σ (φ_{2} (x_{2}^{\land *})}) + \bar{x_{3}^{\land} σ (φ_{3} (x_{3}^{\land *})}))

(3)

where

σ

denotes the sigmoid activation function; in the three branches of the multidimensional attention module, the

φ_{1}

, the

φ_{2}

and

φ_{3}

denote the standard two-dimensional convolutional layers defined by the kernel size

k

. Simplifying equation (3) as in equation (4)

y

y = \frac{1}{3} (\bar{x_{1}^{\land} ω_{1}} + \bar{x_{2}^{\land} ω_{2}} + \bar{x_{3}^{\land} ω_{3}}) = \frac{1}{3} (\bar{y_{1}} + \bar{y_{2}} + \bar{y_{3}})

(4)

where

ω_{1}

ω_{2}

, and

ω_{3}

are the three cross-dimensional attention weights computed in ternary attention. The

\bar{y_{1}}

\bar{y_{2}}

and

\bar{y_{3}}

in equation (4) denote

90 \circ

clockwise rotation to maintain the original input shape of

(C \times H \times W)

3.2. Feature Extraction and Transformation Module

Randomly select images of the same category in training set $T_{a}$ and $T_{b}$ as input images, and after the output shapes of the feature optimization module go through the feature extraction and transformation module (as Figure 4, the $B_{i \times 3}$ shown), this module uses four convolutional residual blocks (C1, C2, C3, C $i$ ) in ResNet as feature extractors, and each residual block is followed by a feature transformation module (STN, Si). The convolutional residual block is responsible for extracting the input image features and retaining the spatial information at the output, which makes it easier for the next feature transformation module to learn the feature mapping. Specifically, to the input features $f_{i}^{s}$ apply the transformation function $S_{i} (i = 1, 2, 3)$ :

(\begin{matrix} x_{i}^{t} \\ y_{i}^{t} \end{matrix}) = S_{i} (f_{i}^{s}) = A_{i} (\begin{matrix} x_{i}^{s} \\ y_{i}^{s} \\ 1 \end{matrix}) = [\begin{array}{lll} θ_{11} & θ_{12} & θ_{13} \\ θ_{21} & θ_{22} & θ_{23} \end{array}] (\begin{matrix} x_{i}^{s} \\ y_{i}^{s} \\ 1 \end{matrix})

(5)

where (

x_{i}^{t}

y_{i}^{t}

) is the output feature

f_{i}^{t}

the target coordinates of the output feature, (

x_{i}^{s}

y_{i}^{s}

) is the input feature

f_{i}^{s}

in the source coordinates of the same point, and

A_{i}

is the affine transformation matrix. The module

S_{i}

is used to learn the convolutional block

C_{i}

of feature mappings with the same structure as used in the spatial transformer.

Figure 6.

Schematic diagram of Siamese network structure.

3.3. Siamese Network

The Siamese network is a parameter-sharing neural network applied to multiple inputs. To avoid the problem of gradient computation collapse, inspired by SimSiam Chen and He (2021), in this paper, the feature encoder is designed as a Siamese network. In Siamese networks, it is crucial to accurately capture the spatial transformations and local structure of the input data, and DCNv2 has strong spatial transformation adaptation, which makes it better able to compare the similarity between input pairs in twin networks, which is crucial for the task of such networks. By introducing a dynamic convolutional kernel, DCNv2 is better able to handle rotational, translational, and other transformations of the input data, improving the network’s invariance to these transformations.

The overall Siamese network structure is shown in Figure 6. Given the pairwise extracted features $f_{3, a}^{t}$ and $f_{3, b}^{t}$ as the final transformed output, the same encoder network (Encoder, E ) processes the features and then applies a prediction header on one branch (Predictor, P). A stop-gradient operation is applied to the other branch to prevent model collapse. Where $P_{a} ≜ P (E (f_{3, a}))$ and $Z_{b} ≜ E (f_{3, b})$ , the negative cosine similarity loss is applied and the formula is shown in equation (6):

D (P_{a}, Z_{b}) = - \frac{P_{a}}{{‖ P_{a} ‖}_{2}} \cdot \frac{Z_{b}}{{‖ Z_{b} ‖}_{2}}

(6)

where

P_{a}

is the coded features of the image to be detected obtained by the predictor,

Z_{b}

is the feature of the supported image after encoding by the encoder, and

‖ | \cdot ‖ |_{2}

is the number of

L_{2}

the number of paradigms. Here we use feature level registration loss instead of registering the image pixel by pixel to achieve better robustness. The symmetric feature registration loss is defined as shown in equation (7):

L = \frac{1}{2} (D (P_{a}, Z_{b}) + D (P_{b}, Z_{a}))

(7)

3.4. Estimation of Normal Distribution

After obtaining the alignment features, the distribution estimation model is used to fit the alignment feature distribution of the supported image to obtain the feature distribution model; the specific process is:

Divide the supported image into an $A \times B$ grid, where $A \times B$ is the resolution of the features used to estimate the normal distribution. The alignment features of the supported image are computed at each position, and the sample covariance matrix is computed $Σ_{I J}$ :

Σ_{I J} = \frac{1}{C - 1} \sum_{r = 1}^{C} (f_{I J}^{r} - μ_{I J}) {(f_{I J}^{r} - μ_{I J})}^{T} + ϵ I

(8)

of which

1 \leq I \leq A

, the

1 \leq J \leq B

, and

f_{I J}^{r}

is the

r

th support image grid position

(I, J)

at which the alignment features are located, and

F_{I J} = {f_{I J}^{r}, r \in [1, C]}

is the transformed feature of the

C

th support image

f_{I J}^{r}

of the support image;

μ_{I J}

is the

F_{I J}

the sample mean of the

ϵ I

is the regularization term.

3.5. Inference

In the inference process, test samples that are beyond the normal distribution are considered anomalies. For each test image, this paper uses the anomaly rating function to calculate the Mahalanobis distance matrix between the alignment features of the image to be tested, and the feature distribution model, and the Mahalanobis distance matrix composes the anomaly score matrix, through which the anomaly score matrix indicates the anomalous region in the image to be tested, so as to realize the anomaly detection. Mahalanobis distance matrix $M (F_{I J})$ . The formula is shown in (9):

M (F_{I J}) = \sqrt{{(F_{I J} - μ_{I J})}^{T} Σ_{I J}^{- 1} (F_{I J} - μ_{I J})}

(9)

where

Σ_{I J}^{- 1}

is the sample covariance

Σ_{I J}

of the inverse matrix.

4. Experiment

4.1. Datasets and Evaluation Indicators

To effectively evaluate the RCM-FSAD anomaly detection model, experiments are conducted in this paper using the MVTec AD dataset (Bergmann et al., 2019), which is a widely used dataset for surface anomaly detection tasks. The dataset includes real images from 15 different categories, 3,629 images for training and validation, and 1,725 images for testing. The training set contains only normal images without defects. The test set contains both various defective images (abnormal) and defect-free images (normal). There are only a few images per new category, an average of 5 per category, giving 73 different defect types. The resolution of all images ranges from $700 \times 700$ to $1, 024 \times 1, 024$ pixels, and each image is labeled at the appropriate pixel level. The categories of anomaly images include scratches, spots, cracks, etc., which can help the image anomaly detection model to better learn and understand different types of anomalies, and improve the performance and application range of the model.

Evaluation metrics commonly used in anomaly detection tasks, including image-level area under receiver operating characteristic (AUROC) and pixel-level AUROC, are selected in this paper to accurately evaluate the RCM-FSAD anomaly detection model’s performance. These two metrics are used to measure the effectiveness of the model in overall anomaly detection and anomaly localization, respectively.

The AUROC in the evaluation index is the area surrounded by the true positive rate (TPR)–false positive rate (FPR) curve (receiver operating characteristic [ROC]), the ROC curve is obtained by changing the threshold value of classification, which in turn gets a series of (TPR, FPR) points, and then according to the threshold value from small to large to get the points plotted as the TPR–FPR curve (TPR, the number of correctly predicted samples of positive samples accounted for the proportion of the true samples, FPR, the number of samples incorrectly predicted as positive samples as a proportion of true and negative samples), which is called the ROC curve, and then the area enclosed by the curve is calculated, and when the area is larger, it indicates better performance, that is, the larger the AUROC, the better the performance is. The output of the AUROC is a numerical value between 0 and 1, and usually the closer it gets to 1 means the better the performance of the classifier, and the closer it gets to 0.5 means that the classifier’s effect is similar to random guessing.

4.2. Implementation Details

The model is designed to be a unified model, but independent training is used to enhance the model’s ability to recognize features from each category during training. This model uses ResNet18 as the backbone network. The encoder contains three $1 \times 1$ convolutional layers, while the predictor contains two $1 \times 1$ convolutional layers without using any pooling operation. The training and inference of the model are made in a Tesla V100-SXM2-16GB experimental environment, for each data category the training is done after 50 iterations, respectively, where the batch size is set to 32, the input image size resized is $224 \times 224$ size, and the model is updated with the parameters using momentum stochastic gradient descent, where the initial learning rate is set to 0.0001, the A single-cycle cosine learning rate is used as the decay scheduling strategy, and data enhancement methods such as rotation, translation, and flipping are applied to each image of the support set.

4.3. Comparison With Other Methods

To validate the effectiveness and superiority of RCM-FSAD, it is evaluated against several mainstream deep learning-based anomaly detection methods in terms of anomaly detection and anomaly localization, respectively.

4.3.1. Anomaly Detection

The experiments were conducted using a left-one-out setup, selecting one target category for testing while using other categories in the dataset for training. The experimental results of the k-shot anomaly detection performance on the MVTec AD dataset are shown in Tables 1–3. The results for each of the 15 categories in RCM-FSAD anomaly detection are listed in terms of the average AUROC (%) over 10 runs and are labeled separately for each category. It can be seen that the average results of RCM-FSAD anomaly detection are 87.33% ( $k = 2$ ), 89.83% ( $k = 4$ ), and 92.09% ( $k = 8$ ) for the 15 categories, respectively. Compared to RegAD, the mean Image-AUROC of MVTec was increased by 2.03%, 1.18%, and 1.78% at $k = 2$ , 4, and 8, respectively. Among the compared methods, optimal detection accuracy was achieved in 8, 10, and 10 categories, respectively, and the rest of the categories were close to the optimal values. Some of the categories did not achieve the best results, which is mainly due to the following three reasons: first, the abnormal region part of very few abnormal images is difficult to recognize, which causes the model to mistake abnormality for normal. Second, some categories have more complex or more difficult-to-capture anomalies(e.g., changes involving multiple features, transformations in appearance, and noise interference), and the model’s performance in these categories is relatively poor. Third, RCM-FSAD was tested without any parameter fine-tuning, and there is no guarantee that optimal performance was achieved for each category.

Table 1.
Image-Level AUROC Used to Evaluate the Anomalous Detection of Each Model on the MVTec AD Dataset for $k = 2$ (%).

STPM RD4AD DiffNet+ LeMO

Category (Wang et al., 2021) (Deng and Li, 2022) (Rudolph et al., 2021) RegAD (Gao et al., 2023) RCM-FSAD

Bottle 93.8 91.2 99.3 99.2 – 99.6

Cable 60.2 65.3 85.3 70.5 – 79.1

Capsule 45.2 50.5 73.0 67.3 – 69.6

Carpet 90.8 92.8 78.4 95.6 – 96.1

Grid 72.6 75.2 62.1 78.2 – 79.7

Hazelnut 90.3 93.4 94.9 94.6 – 98.9

Leather 95.8 96.7 90.7 98.7 – 99.9

Metal Nut 59.4 63.4 61.9 95.9 – 95.6

Pill 58.7 62.8 83.2 62.3 – 74.2

Screw 51.9 54.3 73.4 53.2 – 55.4

Tile 91.4 88.9 97.0 98.3 – 97.9

Toothbrush 76.5 77.1 60.8 90.0 – 90.8

Transistor 82.4. 78.1 61.8 83.1 – 84.9

Wood 95.8 93.7 98.1 99.1 – 99.7

Zipper 47.6 49.5 89.2 93.5 – 88.5

Average 74.16 75.53 80.6 85.30 87.9 87.33

	STPM	RD4AD	DiffNet+		LeMO
Bottle	93.8	91.2	99.3	99.2	–	99.6
Cable	60.2	65.3	85.3	70.5	–	79.1
Capsule	45.2	50.5	73.0	67.3	–	69.6
Carpet	90.8	92.8	78.4	95.6	–	96.1
Grid	72.6	75.2	62.1	78.2	–	79.7
Hazelnut	90.3	93.4	94.9	94.6	–	98.9
Leather	95.8	96.7	90.7	98.7	–	99.9
Metal Nut	59.4	63.4	61.9	95.9	–	95.6
Pill	58.7	62.8	83.2	62.3	–	74.2
Screw	51.9	54.3	73.4	53.2	–	55.4
Tile	91.4	88.9	97.0	98.3	–	97.9
Toothbrush	76.5	77.1	60.8	90.0	–	90.8
Transistor	82.4.	78.1	61.8	83.1	–	84.9
Wood	95.8	93.7	98.1	99.1	–	99.7
Zipper	47.6	49.5	89.2	93.5	–	88.5
Average	74.16	75.53	80.6	85.30	87.9	87.33

Note. AUROC = area under receiver operating characteristic; STPM = student–teacher feature pyramid matching; RD4AD = Reverse Distillation from One-Class Embedding for Anomaly Detection.

Table 2.

Image-Level AUROC Used to Evaluate the Anomalous Detection of Each Model on the MVTec AD Dataset for $k = 4$ (%).

	STPM	RD4AD	DiffNet+		LeMO
Category	(Wang et al., 2021)	(Deng and Li, 2022)	(Rudolph et al., 2021)	RegAD	(Gao et al., 2023)	RCM-FSAD
Bottle	93.9	92.1	99.3	99.1	–	99.4
Cable	61.3	68.4	85.2	84.9	–	82.7
Capsule	47.4	51.7	80.3	74.3	–	76.2
Carpet	91.5	93.2	78.6	97.3	–	97.9
Grid	75.3	76.4	60.5	87.8	–	89.0
Hazelnut	91.4	93.8	95.8	94.8	–	98.8
Leather	96.9	96.8	91.2	97.6	–	99.9
Metal Nut	60.8	65.3	67.3	91.9	–	95.8
Pill	61.3	62.8	84.0	69.0	–	74.7
Screw	52.8	55.7	72.5	59.8	–	57.7
Tile	90.4	90.8	98	96.8	–	98.6
Toothbrush	80.4	76.7	62.5	94.8	–	94.9
Transistor	82.4	79.3	62.2	85.7	–	88.3
Wood	95.8	94.2	96.4	99.0	–	99.9
Zipper	47.6	56.7	84.8	96.9	–	93.6
Average	74.77	76.93	81.3	88.65	88.7	89.83

Note. AUROC = area under receiver operating characteristic; STPM = student–teacher feature pyramid matching; RD4AD = Reverse Distillation from One-Class Embedding for Anomaly Detection.

Table 3.

Image-Level AUROC Used to Evaluate the Anomalous Detection of Each Model on the MVTec AD Dataset for $k = 8$ (%).

	STPM	RD4AD	DiffNet+		LeMO
Category	(Wang et al., 2021)	(Deng and Li, 2022)	(Rudolph et al., 2021)	RegAD	(Gao et al., 2023)	RCM-FSAD
Bottle	94.1	92.8	99.4	99.7	–	99.9
Cable	62.6	69.2	87.9	85.3	–	88.6
Capsule	57.8	58.5	78.6	75.2	–	78.2
Carpet	91.6	93.8	78.5	98.7	–	98.6
Grid	76.9	77.9	78.5	91.9	–	93.8
Hazelnut	91.8	94.2	97.9	97.8	–	98.3
Leather	97.2	97.2	92.2	100	–	100.0
Metal Nut	61.3	65.6	67.6	96.5	–	97.8
Pill	64.2.	63.6	82.1	75.7	–	78.4
Screw	55.9	59.3	75	63.7	–	66.0
Tile	91.2	91.2	99.6	98.9	–	99.4
Toothbrush	82.3	77.9	60.8	94.7	–	98.2
Transistor	84.6	81.2	63.3	88.3	–	91.0
Wood	95.8	95.6	99.4	99.3	–	99.9
Zipper	57.2	58.9	87.3	88.9	–	93.2
Average	77.63	78.46	83.2	90.31	90.8	92.09

Note. AUROC = area under receiver operating characteristic; STPM = student–teacher feature pyramid matching; RD4AD = Reverse Distillation from One-Class Embedding for Anomaly Detection.

4.3.2. Anomaly Localization

The results of the k-shot anomaly localization experiments performed on the MVTec AD dataset are presented in Tables 4–6. RCM-FSAD performs the best on the mean values of the Pixel-AUROC evaluation metrics. The model utilizes the feature representations learned throughout the training process to localize anomalous regions for the samples in the dataset accurately. Anomaly localization was performed using RCM-FSAD in 15 categories, where the results for each category are listed as the average AUROC (%) over 10 runs and labeled individually for each category. The average results for RCM-FSAD anomaly localization were 95.37% ( $k = 2$ ), 96.77% ( $k = 4$ ), and 97.14% ( $k = 8$ ) in the 15 categories, respectively. Among the compared methods, optimal localization accuracy was achieved in 8, 13, and 11 categories, respectively, and most of the rest of the localization results were not far from the optimal values.

Table 4.
Pixel-Level AUROC Used to Evaluate the Anomalous Localization of Each Model on the MVTec AD Dataset for $k = 2$ (%).

Category STPM (Wang et al., 2021) RD4AD (Deng and Li, 2022) CFA (Lee et al., 2022) RegAD RCM-FSAD

Bottle 84.6 81.7 93.5 97.7 98.3

Cable 51.6 65.4 88.9 94 95.1

Capsule 59.2 78.2 85.9 97.4 97.5

Carpet 60.5 74.2 97.9 98.6 98.6

Grid 61.2 76.3 81.4 76.6 76.9

Hazelnut 74.5 64.8 98.2 97.9 98.6

Leather 75.2 86.5 99.3 99.2 99.2

Metal Nut 51.1 68.9 89.7 97.8 97.3

Pill 49.9 70.2 91.5 96.2 96.8

Screw 51.8 60.8 96.7 94.5 94.4

Tile 58.2 59.2 81.8 95.1 95

Toothbrush 66.3 78.3 93.9 98 98.1

Transistor 47.5 67.8 80.3 93.8 92.4

Wood 48.4 93.8 92.4 94 95.4

Zipper 56.3 51.2 94.1 98.5 97

Average 59.75 71.82 91.03 95.29 95.37

Category	STPM (Wang et al., 2021)	RD4AD (Deng and Li, 2022)	CFA (Lee et al., 2022)	RegAD	RCM-FSAD
Bottle	84.6	81.7	93.5	97.7	98.3
Cable	51.6	65.4	88.9	94	95.1
Capsule	59.2	78.2	85.9	97.4	97.5
Carpet	60.5	74.2	97.9	98.6	98.6
Grid	61.2	76.3	81.4	76.6	76.9
Hazelnut	74.5	64.8	98.2	97.9	98.6
Leather	75.2	86.5	99.3	99.2	99.2
Metal Nut	51.1	68.9	89.7	97.8	97.3
Pill	49.9	70.2	91.5	96.2	96.8
Screw	51.8	60.8	96.7	94.5	94.4
Tile	58.2	59.2	81.8	95.1	95
Toothbrush	66.3	78.3	93.9	98	98.1
Transistor	47.5	67.8	80.3	93.8	92.4
Wood	48.4	93.8	92.4	94	95.4
Zipper	56.3	51.2	94.1	98.5	97
Average	59.75	71.82	91.03	95.29	95.37

Note. AUROC = area under receiver operating characteristic; STPM = student–teacher feature pyramid matching; RD4AD = Reverse Distillation from One-Class Embedding for Anomaly Detection; CFA = coupled-hypersphere-based feature adaptation.

Table 5.

Pixel-Level AUROC Used to Evaluate the Anomalous Localization of Each Model on the MVTec AD Dataset for $k = 4$ (%).

Category	STPM (Wang et al., 2021)	RD4AD (Deng and Li, 2022)	CFA (Lee et al., 2022)	RegAD	RCM-FSAD
Bottle	84.9	81.8	93.6	97.9	98.5
Cable	52.2	66.2	89.1	95.1	96.2
Capsule	59.3	78.4	86.2	98.1	98.8
Carpet	60.6	74.8	98.2	98.8	98.8
Grid	61.8	76.9	82.5	84.7	86.8
Hazelnut	74.9	65.2	98.5	98.1	98.5
Leather	75.3	86.7	99.3	99.2	99.3
Metal Nut	51.8	69.2	89.9	96.2	98.2
Pill	50.6	70.4	91.6	96.8	98.1
Screw	51.9	60.9	96.8	96.2	96.1
Tile	58.5	59.5	82.3	91.9	95.4
Toothbrush	66.9	78.9	94.2	98.4	98.4
Transistor	57.5	67.9	80.5	94.1	93.8
Wood	48.9	94.2	92.6	94.7	96.7
Zipper	56.4	52.3	94.8	97.7	97.9
Average	60.77	72.22	91.34	95.86	96.77

Table 6.

Pixel-Level AUROC Used to Evaluate the Anomalous Localization of Each Model on the MVTec AD Dataset for $k = 8$ (%).

Category	STPM (Wang et al., 2021)	RD4AD (Deng and Li, 2022)	CFA (Lee et al., 2022)	RegAD	RCM-FSAD
Bottle	85.2	82.1	93.6	98.4	98.4
Cable	53.3	68.2	89.2	96.3	96.7
Capsule	59.3	78.5	86.5	98.1	98.4
Carpet	60.7	79.2	98.4	98.9	98.9
Grid	61.8	76.9	82.8	88.9	89
Hazelnut	74.9	65.5	98.6	98.4	98.7
Leather	75.3	86.9	99.4	99.1	99.1
Metal Nut	54.6	69.5	89.9	97.9	98.2
Pill	55.7	70.5	91.7	97.7	97.7
Screw	52.3	61.9	96.9	97.1	97.4
Tile	58.9	60.8	83.4	95	96
Toothbrush	66.9	79.1	94.5	98.7	98.6
Transistor	58.2	67.9	81.5	94.2	97
Wood	49.2	94.5	92.7	96.3	95.5
Zipper	57.8	52.8	94.9	97.8	97.5
Average	61.61	72.95	91.60	96.85	97.14

Some typical results of this model for anomaly localization on the MVTec dataset are shown in Figure 7. The figure contains the input images, the real anomaly segmentation graphs, and the anomaly segmentation graphs predicted by the model, which demonstrate the excellent performance of this model in localizing anomalies. This series of visualization results presents the model’s anomaly localization results.

Figure 7.

Qualitative results of anomaly location in MVTec AD dataset.

4.4. Ablation Studies

To further validate the effectiveness of the added modules, this paper conducts ablation experiments for each module on the MVTec dataset, and the k-shot results are shown in Table 7. ResNet18 was used as the feature extraction network but without the MAM and DCNv2 (denoted as R18 in the table). This was considered as the baseline performance for comparison with other models. For the MAM, the effectiveness assessment is represented from R18+TAM and R18+MAM, where ResNet18 is combined with the TAM and the MAM, respectively.

Table 7.
Ablation Study of the Model in MVTec AD Dataset.

$K = 2$ $K = 4$ $K = 8$

ResNet18 TAM MAM DCNv2 Det% Loc% Det% Loc% Det% Loc%

R18 $\sqrt$ 85.3 95.29 88.65 95.86 90.31 96.85

R18+TAM $\sqrt$ $\sqrt$ 85.61 94.92 88.29 96.29 90.55 96.71

R18+MAM $\sqrt$ $\sqrt$ 86.56 95.32 89.35 96.41 90.77 96.85

R18+DCNv2 $\sqrt$ $\sqrt$ 85.43 95.33 88.77 96.34 90.82 96.97

RCM-FSAD $\sqrt$ $\sqrt$ $\sqrt$ 87.33 95.37 89.83 96.77 92.09 97.14

					$K = 2$	$K = 4$	$K = 8$
R18	$\sqrt$				85.3	95.29	88.65	95.86	90.31	96.85
R18+TAM	$\sqrt$	$\sqrt$			85.61	94.92	88.29	96.29	90.55	96.71
R18+MAM	$\sqrt$		$\sqrt$		86.56	95.32	89.35	96.41	90.77	96.85
R18+DCNv2	$\sqrt$			$\sqrt$	85.43	95.33	88.77	96.34	90.82	96.97
RCM-FSAD	$\sqrt$		$\sqrt$	$\sqrt$	87.33	95.37	89.83	96.77	92.09	97.14

Note. For each case in the table, the detection column is image-level AUROC (%), and the location column is pixel-level AUROC (%). TAM = triplet attention module; MAM = multidimensional convolutional attention module; DCNv2 = deformable convolutional network v2.

The experimental results show that adding the MAM enhances the model’s performance more effectively than the TAM. This indicates that the MAM enhances the model’s ability to represent features by boosting the interaction of channel dimensions, and especially achieves significant results in enhancing feature registration. In addition, the role of DCNv2 in the model is validated. The addition of DCNv2 (R18+DCNv2) to the baseline method and the introduction of deformable convolutional kernels help the model to better understand the interrelationships between pixels in the image. This change led to an improvement in the model performance, suggesting that DCNv2 plays an active role in improving the feature learning capability. Ultimately, the model’s performance is further improved by combining the MAM and DCNv2 (R18+MAM+DCNv2), and these experiments demonstrate the effectiveness of the MAM module and DCNv2 in anomaly detection tasks, especially in improving feature representation and understanding image relationships. Combining the two, the model achieves better performance on the MVTec dataset.

Overall, the combination of R18+MAM+DCNv2 modules was chosen because they complement each other and work together to improve the performance of the model. ResNet18 provides stable feature extraction capability, the MAM enhances the feature representation capability, and DCNv2 improves the flexibility of feature learning. This combination allows the model to more accurately capture and differentiate between normal and abnormal samples in anomaly detection tasks, resulting in better performance on datasets such as MVTec.

5. Conclusion

For image anomaly detection and localization tasks, this paper proposes an unsupervised multidimensional attention module and feature registration-based anomaly detection model (RCM-FSAD) to perform detection and localization tasks of surface anomalies in images. The MAM serves as a feature optimization module, which facilitates the feature extraction and transformation module to learn the common features among categories. The variability convolution improved the Siamese network, which helps locate the anomaly region more precisely and improves the localization accuracy of the model. The overall performance of the model presented by RCM-FSAD in the MVTec AD dataset is better than that of state-of-the-art methods. In practical applications, there is still room for improvement. Image anomaly detection usually needs to consider multiple information sources, including color, texture, and shape. In the future, how to effectively fuze multimodal information into RCM-FSAD can be investigated to improve the model’s anomaly detection accuracy and robustness.

Footnotes

Author Contributions

Xin Xie and Shenping Xiong designed the study, conducted the anomaly detection experiments, and wrote the manuscript, and Tijian Cai and Wenbin Zheng provided technical support and assistance with data analysis. All authors reviewed and edited the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by the National Natural Science Foundation of China (Grant No. 62162026) and the Jiangxi Provincial Natural Science Foundation (Grant Nos. 20232BAB202055, 20242BAB26019, and 20242BAB25066).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Bergmann

Fauser

Sattlegger

Steger

. (2019). Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9592–9600). IEEE. https://doi.org/10.1109/CVPR.2019.00982

Bergmann

Fauser

Sattlegger

Steger

. (2020). Uninformed students: Student–teacher anomaly detection with discriminative latent embeddings. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4183–4192). IEEE. https://doi.org/10.1109/CVPR42600.2020.00424

Chen

. (2021). Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15750–15758). IEEE. https://doi.org/10.1109/CVPR46437.2021.01549

Dai

Xiong

Zhang

Wei

. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 764–773). IEEE. https://doi.org/10.1109/ICCV.2017.89

Defard

Setkov

Loesch

Audigier

. (2021). Padim: A patch distribution modeling framework for anomaly detection and localization. In International conference on pattern recognition (pp. 475–489). Springer. https://doi.org/10.48550/arXiv.2011.08785

Deng

. (2022). Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9737–9746). IEEE. https://doi.org/10.1109/CVPR52688.2022.00951

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

et al. (2020). An image is worth 16

\times

16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Fengping

Peng Yunfan

L. Y.

(2022). Research on insulator self-explosion detection with small sample based on deep learning. Journal of East China Jiaotong University, 39(2), 110. https://10.16749/j.cnki.jecjtu.20220314.010

Fernando

Gammulle

Denman

Sridharan

Fookes

(2021). Deep learning for medical anomaly detection—a survey. ACM Computing Surveys (CSUR), 54(7), 1–37. https://doi.org/10.1145/3464423

10.

Fraga

V. A.

Schreiber

L. V.

Silva

M. A. C.

Kunst

Barbosa

J. L.

Ramos

G.d.O.

(2021). A machine learning pipeline for extracting decision-support features from traffic scenes 1. AI communications (preprint) (pp. 1–13). AI communications. https://doi.org/10.3233/AIC-220317

11.

Gao

Luo

Shen

Zhang

. (2023). Towards total online unsupervised anomaly detection and localization in industrial vision. arXiv preprint arXiv:2305.15652.

12.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

. (2014). Generative adversarial nets advances in neural information processing systems. arXiv preprint arXiv:1406.2661.

13.

Gupta

Tatbul

Marcus

Zhou

Lee

Gottschlich

. (2020). Class-weighted evaluation metrics for imbalanced data classification. https://doi.org/10.48550/arXiv.2010.05995

14.

Garcia

E. A.

(2009). Learning from imbalanced data. IEEE Transactions on Knowledge and data Engineering, 21(9), 1263–1284. IEEE. https://doi.org/10.1109/TKDE.2008.233331

15.

Shen

Sun

. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132–7141). IEEE. https://doi.org/10.1109/TPAMI.2019.2913372

16.

Huang

Guan

Jiang

Zhang

Spratling

Wang

Y.-F.

(2022). Registration based few-shot anomaly detection. In European conference on computer vision (pp. 303–319). Springer. https://doi.org/10.1007/978-3-031-20053-3_18

17.

Jaderberg

Simonyan

Zisserman

et al. (2015). Spatial transformer networks. Advances in neural information processing systems 28. https://doi.org/10.5555/2969442.2969465.

18.

Jiang

Shu

Wang

Long

. (2022). Transferability in deep learning: A survey. arXiv preprint arXiv:2201.05867.

19.

Lee

Song

B. C.

(2022). CFA: Coupled-hypersphere-based feature adaptation for target-oriented anomaly localization. IEEE Access, 10, 78446–78454. https://doi.org/10.1109/ACCESS.2022.3193699

20.

Liu

Luo

Lian

Gao

. (2018). Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6536–6545). IEEE. https://doi.org/10.1109/CVPR.2018.08578782

21.

Misra

Nalamada

Arasanipalai

A. U.

Hou

. (2021). Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 3139–3148). IEEE. https://doi.org/10.1109/WACV48630.2021.00318

22.

Park

Woo

Lee

J.-Y.

Kweon

I. S.

(2018). BAM: Bottleneck attention module. arXiv preprint arXiv:1807.06514.

23.

Rudolph

Wandt

Rosenhahn

. (2021). Same same but different: Semi-supervised defect detection with normalizing flows. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 1906–1915). IEEE. https://doi.org/10.1109/WACV48630.2021.00195

24.

Sheynin

Benaim

Wolf

. (2021). A hierarchical transformation-discriminating generative model for few shot anomaly detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8495–8504). IEEE. https://doi.org/10.1109/ICCV48922.2021.00838

25.

Wang

Han

Ding

Huang

. (2021). Student–teacher feature pyramid matching for anomaly detection. arXiv preprint arXiv:2103.04257.

26.

Woo

Park

Lee

J.-Y.

Kweon

I. S.

(2018). CBAM: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19). Springer. https://doi.org/10.1007/978-3-030-01234-2_1

27.

Xie

Huang

(2023). A weakly supervised anomaly detection method based on deep anomaly scoring network. Signal, Image and Video Processing, 17(8), 3903–3911. https://doi.org/10.1007/s11760-023-01111-1

28.

Zhang

Chen

Liu

. (2020). Deep object co-segmentation via spatial-semantic network modulation. In Proceedings of the AAAI conference on artificial intelligence (vol. 34, pp. 12813–12820). AAAI. https://doi.org/10.1609/aaai.v34i07.6977

29.

Zhu

Lin

Dai

. (2019). Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9308–9316). IEEE. https://doi.org/10.1109/CVPR.2019.00953

					$K = 2$		$K = 4$		$K = 8$
	ResNet18	TAM	MAM	DCNv2	Det%	Loc%	Det%	Loc%	Det%	Loc%
R18	$\sqrt$				85.3	95.29	88.65	95.86	90.31	96.85
R18+TAM	$\sqrt$	$\sqrt$			85.61	94.92	88.29	96.29	90.55	96.71
R18+MAM	$\sqrt$		$\sqrt$		86.56	95.32	89.35	96.41	90.77	96.85
R18+DCNv2	$\sqrt$			$\sqrt$	85.43	95.33	88.77	96.34	90.82	96.97
RCM-FSAD	$\sqrt$		$\sqrt$	$\sqrt$	87.33	95.37	89.83	96.77	92.09	97.14

Registration and Convolutional Multidimensional Attention Module for Few-Shot Anomaly Detection

Abstract

Keywords

1. Introduction

2.1. Convolutional Tripletbasis Attention Module

4.1. Datasets and Evaluation Indicators

4.2. Implementation Details

4.3. Comparison With Other Methods

4.3.1. Anomaly Detection

Footnotes

Author Contributions

Funding

Declaration of Conflicting Interests

References