Sage Journals: Discover world-class research

Abstract

The detection of metal surface defects is crucial in the field of industrial production. However, in practical applications, challenges such as ambiguous defect directions, large scale differences, and strong background interference are often encountered. This paper proposes an improved multi-scale fine-grained object detection framework based on YOLOv11, referred to as “DA-YOLO.” Firstly, the 3D attention module (3DAM) was used to enhance the model’s ability to model spatial direction features and improve the model’s ability to perceive fine-grained structures. Secondly, a feature enhancement module (AMFEM) employing multi-scale convolution and a spatial-channel attention mechanism was constructed, significantly boosting the model’s recognition accuracy for multi-scale targets and blurry boundary defects. Furthermore, an intersection and union ratio Aware Joint Loss function (IoU-Aware joint loss, IAJ-Loss) was proposed and designed, which further enhanced the quality perception ability and stability of the model in complex detection scenarios. The experimental results show that the DA-YOLO model improved mAP@0.5 by 5.42% and 4.05% respectively on the GC10-DET and NEU-DET datasets compared to the baseline YOLOv11 model, demonstrating superior defect detection performance.

Keywords

attention mechanism defect detection fine particle size fuzzy defect multi-scale YOLOv11

Introduction

Steel surface defect detection is crucial for shipbuilding quality. Defects including cracks, pores, inclusions, scratches, and indentations^1–3 arising throughout production and processing compromise hull integrity and pose risks to structural safety and service life.

Early steel defect detection relied on manual visual inspection, which fails to meet industrial demands. While techniques like infrared and eddy current detection⁴ excel in specific scenarios, they struggle to balance accuracy, stability, and real-time performance. Recent automated systems using neural networks,⁵ Bayesian networks,⁶ and support vector machines⁷ demonstrate strong capabilities in end-to-end learning and complex feature representation.

Deep learning-based object detection comprises two mainstream approaches. Two-stage methods like Mask R-CNN⁸ and Faster R-CNN,⁹ Faster R-CNN first generate region proposals then perform classification and regression. In contrast, single-stage frameworks including SSD¹⁰ and YOLO¹¹ achieve end-to-end recognition directly. The YOLO series excels in speed and accuracy, making it particularly suitable for metal surface defect detection, with advancements in YOLOv3, YOLOv7, and YOLOv10 providing key insights into multi-scale feature extraction and real-time performance.^12–17 Recent variants include ST-YOLO¹⁸ with sample screening strategy, EML-YOLO¹⁹ featuring large-kernel and multi-branch fusion, and SDMS-YOLOv10²⁰ incorporating dynamic backbone and redesigned detection head. Despite progress, challenges remain in detection efficiency and dense defect performance.

Unlike routine detection tasks, steel surface defects exhibit distinctive characteristics including small size, irregular shape, blurred boundaries,²¹ and sensitivity to background noise, significantly increasing detection difficulty. This paper proposes DA-YOLO, a novel extension to the YOLOv11 architecture that introduces two new modules: 3DAM and AMFEM. These modules address the limitations of YOLOv11 in detecting directional features and small-scale defects. 3DAM enhances directional feature modeling, while AMFEM improves multi-scale detection, especially in complex environments with blurred boundaries. The main contributions are:

The YOLOv11 backbone incorporates a 3D Attention Module (3DAM) that integrates horizontal, vertical, and channel-wise attention with multi-pooling feature statistics. This significantly improves the detection of directional defects, such as slender cracks and fine scratches.

This paper proposes an Attention-Guided Multi-Scale Feature Enhancement Module (AMFEM) that integrates multi-scale convolution, spatial-channel attention, and group gating mechanisms. AMFEM improves multi-scale defect detection while suppressing background interference, particularly enhancing accuracy for small notches with blurred edges.

An IoU-Aware joint loss (IAJ-LOSS), integrating BCE-IoU and Focal EIoU, is proposed to coordinate classification and regression branches, significantly improving the detection of challenging targets such as small defects and elongated cracks.

Evaluations on the GC10-DET and NEU-DET datasets demonstrate DA-YOLO’s effectiveness, achieving mAP@0.5 improvements of 5.42% and 4.05% over YOLOv11, respectively. This significantly enhances detection performance and demonstrates the practical application potential of DA-YOLO in steel defect inspection.

Related work

Accuracy and real-time performance are two important metrics for evaluating steel surface defect detection algorithms. In recent years, object detection technology has continuously developed, from the original two-stage detection methods to single-stage detection methods with simpler structures and higher efficiency.²² Classic one-stage detection networks include the YOLO series,^23–29 SSD series,¹⁹ RetinaNet,³⁰ etc. To meet the demands of more rigorous industrial production, researchers have been exploring detection algorithms that can achieve both faster detection speeds and higher defect detection accuracy. Most of the improvements are based on the one-stage detection framework of YOLO series algorithms.

Steel surface defect detection is a small-target detection task with a complex background. Wang et al. improved a fatigue detection model based on YOLOv7 and enhanced the model’s ability to recognize key features³¹ by introducing a coordinated attention mechanism. Su et al.³² designed a new convolution-ModSConv, which effectively improves the performance of ordinary convolution in cross-channel feature interaction. Wang et al.,³³ Li et al.,³⁴,³⁵ and Zhang et al.³⁶ all modify the base YOLO model with the goal of primarily improving the model’s ability to detect small objects.

YOLOv11, the 11th generation of the YOLO series, first proposed in 2016, was developed by Ultralytics based on YOLOv8.^37,38 Its architecture (Figure 1, left) features a convolutional backbone that extracts multi-level features through progressive subsampling. The model introduces key enhancements: C3K2 improves feature extraction efficiency, SPPF enhances multi-scale context modeling, and C2PSA focuses on key regions to boost detection robustness.

Figure 1.

Comparison between the baseline YOLOv11 and the proposed DA-YOLO framework. Newly introduced modules (3DAM and AMFEM) are highlighted with colored blocks and star markers. 3DAM improves directional feature modeling, while AMFEM enhances multi-scale defect detection in complex backgrounds.

In the neck network, YOLOv11 employs feature fusion to integrate multi-scale features from shallow and deep layers, improving detection across object sizes. It utilizes multiple C3K2 modules to efficiently extract and fuse semantic information at different levels. The detection head handles classification and bounding box regression, incorporating depthwise separable convolutions (DWConvs) to enhance class discrimination.

Although there have been a large number of research results based on YOLO defect detection model, for steel surface defects with various and complex shapes, the existing detection methods are still not enough to meet the requirements of industrial inspection in terms of accuracy and stability. With its efficient backbone network design, multi-scale fusion mechanism and excellent module scalability, YOLOv11 provides an ideal baseline for introducing attention mechanisms, multi-scale enhancement structures, and improved loss functions.

To this end, this paper improves the YOLOv11m model framework and propose a new defect detection model DA-YOLO, whose network structure is shown on the right of Figure 1. Through the module-level enhancement and improvement of the loss function, the model achieves stronger feature modeling and detection capabilities. The specific structure will be introduced in the third section.

Method

Overall network architecture

To address key challenges in metal defect detection—including ambiguous orientations, significant scale variations, blurred edges, and background interference—this paper systematically enhances YOLOv11 at both architectural and loss function levels, proposing DA-YOLO with strengthened fine-grained perception.

The framework introduces 3DAM to enhance directional perception and AMFEM to improve multi-scale feature representation in complex backgrounds. An IoU-Aware joint loss optimizes classification-regression consistency, collectively boosting accuracy on complex defects. Experiments confirm DA-YOLO’s superior robustness and stability in detecting small targets, slender cracks, and high-noise scenes, demonstrating strong industrial application potential. For clarity of notation, a comprehensive list of symbols and their definitions is provided in Appendix “Nomenclature.”

Three-dimensional attention module

Unlike the conventional convolutions used in YOLOv11, which struggle with detecting directional features, 3DAM introduces a novel approach by integrating three distinct attention mechanisms (horizontal, vertical, and channel-wise) to enhance directional feature modeling. This allows the model to effectively detect slender defects such as cracks and scratches, which are often overlooked by YOLOv11 due to its limited ability to capture fine-grained directional dependencies.

The decoupled attention mechanism, comprising three parallel branches for H, W, and C dimensions, is motivated as follows:

Separation of Spatial and Channel Dependencies: Explicitly decoupling spatial (H, W) and channel (C) dependencies allows the model to independently capture location-specific patterns (e.g. defect regions) and channel-wise attributes (e.g. texture, color). This enhances flexibility in handling defects with complex spatial layouts or diverse channel characteristics.

Dimensional Attention: The tri-branch design enables the model to assign distinct weights to horizontal, vertical, and channel dependencies, improving its focus on directional features critical for defect detection across diverse materials.

Empirical Validation: Comparative experiments demonstrate that the proposed decoupled attention mechanism (3DAM) outperforms its single combined counterpart, achieving higher mAP50 and mAP50-95 scores—particularly on defects characterized by complex spatial structures or channel-dependent features. Detailed results are presented in the Experimental Results section.

In summary, 3DAM enhances directional feature modeling and is well-suited for metal surface defect detection, where spatial and channel dependencies are best addressed separately.

Figure 2 depicts the overall structure of the 3DAM module. $X \in R^{C \times H \times W}$ is its input feature map, the module consists of three parallel branches, which respectively model the attention weights of the following dimensions: the horizontal dimension (H), which focuses on the information changes between different “rows”; Vertical dimension (W), focusing on the response relationship between different “columns”; Channel dimension (C), focusing on the significance between different feature channels.

Figure 2.

3DAM module structure diagram.

Directional attention includes the following contents:

1.Rearrange the input features

To handle the H, W, and C dimensions respectively, the input feature X needs to rearrange the tensor dimensions through the Permute operation. Take extracting the attention in the H direction as an example. The original feature needs to be transformed into:

X_{H} = Permute (X), X_{H} \in R^{H \times C \times W} .

(1)

2. Multi-pooling compression

For the features of each dimension, average pooling (AvgPool) and standard difference pooling (StdPool) are respectively adopted for statistical compression to extract the global structural features. Take the horizontal dimension as an example:

\begin{matrix} F_{H} = AvgPool (X_{H}) + StdPool (X_{H}), \\ F_{H} \in R^{H \times 1 \times 1} . \end{matrix}

(2)

Similarly, the processing methods for the vertical dimension $F_{W}$ and the channel dimension $F_{C}$ are the same.

3. Directional convolution generates an attention map

A $1 \times K$ convolution is used to extract the local dependence on the pooled statistical features, where K denotes the size of the directional convolution kernel, and an adaptive setting is calculated based on the input channel. Take the H direction as an example:

A_{H} = σ ({Conv}_{1 \times K} (F_{H})), A_{H} \in R^{H \times 1 \times 1} .

(3)

Where $σ$ represents the Sigmoid activation function. Similarly, we can obtain $A_{W} \in R^{W \times 1 \times 1}$ and $A_{C} \in R^{C \times 1 \times 1}$ .

4. Attention weighting and fusion

Apply the attention maps in each direction to the original feature maps respectively (through the broadcasting mechanism) and fuse them proportionally. The fusion weights of the three attention branches are controlled by the learnable parameter vector $α = [α_{1}, α_{2}, α_{3}]$ , and normalized to:

w_{i} = \frac{e^{α_{i}}}{\sum_{j = 1}^{3} e^{α_{j}}}, i = 1, 2, 3 .

(4)

The final fused attention feature is:

\begin{matrix} X_{attn} = w_{1} \cdot (X \otimes A_{H}) + w_{2} \cdot (X \otimes A_{W}) \\ + w_{3} \cdot (X \otimes A_{C}), \end{matrix}

(5)

Where ⊗ represents broadcast multiplication.

5. Residual connection output

To preserve the original features and suppress the overfitting of attention to the shallow expression, the final output adopts the form of residual connection:

Y = X + γ \cdot X_{attn},

(6)

Here $γ$ is the learnable scaling coefficient, which is initialized to 0.1 to stabilize the training.

By incorporating tri-directional spatial attention, 3DAM enhances structural feature modeling and improves representation of directional patterns.

Attention-guided multi-scale feature enhancement module

Metal surface defects present significant scale variations, from tiny scratches to large corrosion areas, posing challenges for consistent feature representation.

AMFEM is introduced to address the challenges of multi-scale detection and small defect recognition, which YOLOv11 struggles to handle due to its limited multi-scale fusion capabilities. By integrating multi-scale convolutions and spatial-channel attention, AMFEM improves the model’s ability to detect small defects with blurred boundaries, such as corrosion or scratches, in complex industrial environments. This enhancement makes DA-YOLO more robust to background noise and more accurate in detecting fine-grained defects. Positioned between SPPF and C2PSA, AMFEM participates in upsampling and cross-layer fusion of mid-to-high-level features (Figure 3). The EUCB module upsamples feature maps for fusion with prior-level MFEM features, while the LGAG module gates salient regions to recover fine spatial structures and strengthen inter-layer semantic associations.

Figure 3.

AMFEM module structure diagram.

AMFEM comprises three sub-modules (Figure 3): MFEM for channel-spatial attention and multi-scale context fusion to improve fine-grained target detection, EUCB for semantic-preserving upsampling, and LGAG for salient region enhancement via dynamic gating. MFEM sequentially integrates CAB (channel attention), SAB (spatial attention), and MSCB (multi-scale context extraction) to boost feature discriminability.

1. Channel Attention Block (CAB)

CAB guides the model to highlight semantically relevant channel responses by modeling the importance of global semantic channels. Given the input feature map $X \in R^{C \times H \times W}$ , adaptive average pooling (AAP) and adaptive Max pooling (AMP) were used to compress respectively, and two global channel vectors were obtained:

F_{avg} = AAP (X), F_{avg} \in R^{C \times 1 \times 1},

(7)

F_{\max} = AMP (X), F_{\max} \in R^{C \times 1 \times 1} .

(8)

Both vectors are mapped through shared two convolutional layers (including nonlinear ReLU activations):

M_{avg} = W_{2} \cdot ReLU (W_{1} \cdot F_{avg}),

(9)

M_{\max} = W_{2} \cdot ReLU (W_{1} \cdot F_{\max}) .

(10)

The attention map is passed through the Sigmoid activation function imposed after fusion:

M_{c} = σ (M_{avg} + M_{\max}), M_{c} \in R^{C \times 1 \times 1} .

(11)

The final channel weighted output is:

X_{c} = X ⊙ M_{c},

(12)

Where ⊙ stands for element-by-channel multiplication.

2. Spatial Attention Block(SAB)

SAB avoids localization offset caused by complex background by guiding the model to focus on the target area. The input feature $X_{c}$ computes average pooling versus Max pooling along the channel dimension:

\begin{matrix} F_{s} = Concat ({Avg}_{c} (X_{c}), {Max}_{c} (X_{c})), \\ F_{s} \in R^{2 \times H \times W} . \end{matrix}

(13)

The spatial attention map is then generated after 7 × 7 convolution with Sigmoid activation:

M_{s} = σ ({Conv}_{7 \times 7} (F_{s})), M_{s} \in R^{1 \times H \times W} .

(14)

The final feature map is updated as follows:

X_{s} = X_{c} ⊙ M_{s} .

(15)

3. Multi-Scale Convolution Block (MSCB)

MSCB extracts scale-invariant features to alleviate the perception bias of the model for small objects or large-size objects. The channels are first compressed using 1 × 1 convolutions:

X_{r} = ReLU (BN ({Conv}_{1 \times 1} (X_{s}))) .

(16)

Then it is fed into the multi-scale branch convolution path:

X_{m} = Concat [f_{3 \times 3} (X_{r}), f_{5 \times 5} (X_{r}), f_{7 \times 7} (X_{r})] .

(17)

Finally, the multi-scale context was fused by BN and channel shuffling operation, and the highly expressive features $X_{m} \in R^{C \times H \times W}$ were output.

4. Efficient Up-convolution Block (EUCB)

EUCB is designed as a lightweight and semantically sensitive upsampling architecture. Firstly, it uses bilinear interpolation to upsample the feature map to the target resolution, and then performs depthwise separable directional convolution:

X_{d} = {DWC}_{3 \times 1} ({DWC}_{1 \times 3} (X)) .

(18)

Normalization is then performed with nonlinearity:

X_{e} = ReLU (BN (X_{d})) .

(19)

Finally, the channels are compressed and fused:

Y = {Conv}_{1 \times 1} (X_{e}) .

(20)

The whole structure has the characteristics of few parameters, large receptive field and high localization accuracy.

5. Large Kernel Group Attention Gating Modul(LGAG)

To dynamically adjust the information fusion strategy during the upsampling process, we use the LGAG module to construct the attention gated channel. Given the main feature map x and guide feature map g, it is first processed by large kernel group convolution:

F_{x} = GroupConv (x),

(21)

F_{g} = GroupConv (g) .

(22)

After fusion, BN, ReLU, and Sigmoid are applied:

M = σ (ReLU (BN (F_{x} + F_{g}))) .

(23)

The output gating feature is as follows:

Y = x ⊙ M .

(24)

LGAG guides cross-level semantic fusion through structural information, significantly suppresses the invalid region response, and improves the attention ability of key defects.

The AMFEM module is embedded at each resolution level during feature decoding, forming a closed-loop enhancement mechanism with semantic guidance and hierarchical feedback. The specific feature enhancement process follows:

For the input feature map $X_{i}$ of layer $i$ , firstly, the channel, spatial and scale attention enhanced features $F_{i}$ are extracted through the MFEM module:

X_{i} \overset{MFEM}{\to} F_{i} .

(25)

Then, the EUCB module is used to upsample $F_{i}$ to the same resolution as the $i - 1$ layer to generate the upsampled feature ${\hat{F}}_{i - 1}$ :

F_{i} \overset{EUCB}{\to} {\hat{F}}_{i - 1} .

(26)

${\hat{F}}_{i - 1}$ is fused with the lateral feature $X_{i - 1}$ from the backbone network, and the LGAG module is used to guide the saliency enhancement to generate the fused feature ${\overset{\lor}{X}}_{i - 1}$ :

({\hat{F}}_{i - 1}, X_{i - 1}) \overset{LGAG}{\to} {\overset{\lor}{X}}_{i - 1} .

(27)

Finally, ${\overset{\lor}{X}}_{i - 1}$ is fed into the MFEM module again to generate the next layer of enhanced features $F_{i - 1}$ and continue the next level of fusion processing:

{\overset{\lor}{X}}_{i - 1} \overset{MFEM}{\to} F_{i - 1} .

(28)

This design maintains semantic consistency during hierarchical feature transfer.

Loss function

In object detection, classification confidence and localization quality are often decoupled, which may degrade performance on hard samples. While YOLOv11 employs BCE and CIoU losses, two limitations persist: (1) classification confidence fails to reflect localization quality due to task decoupling; (2) regression loss inadequately addresses hard samples like small or slender defects. To address these, we propose an IoU-Aware joint loss with dual-branch optimization: the classification branch uses IA-BCE Loss³⁹ to align confidence with box quality, while the regression branch adopts Focal-EIoU to emphasize low-quality examples. This structure enhances quality awareness and stability in complex detection scenarios without increasing architectural complexity.

Traditional BCE loss employs hard labels (0/1), ignoring localization quality differences between predicted and ground-truth boxes. This often causes high-confidence false detections with poor box quality, especially for small or dense defects. To address this, we introduce an IoU-Aware label $t_{i} \in [0, 1]$ as the classification target and replace BCE with IA-BCE Loss, defined as:

t_{i} = s_{i}^{α} \cdot u_{i}^{1 - α},

(29)

Here, $s_{i}$ represents the confidence of the predicted box, $u_{i}$ represents the IoU between the predicted box and the true box, and $α \in [0, 1]$ represents the weight coefficient to control the fusion degree of classification or regression, which is set as $α$ = 0.25 in this paper. In this case, the classification loss is:

L_{cls} = \sum_{i \in P} BCE ({\hat{p}}_{i}, t_{i}) + \sum_{j \in N} {\hat{p}}_{j}^{2} \cdot BCE ({\hat{p}}_{j}, 0),

(30)

Among them, soft label $t_{i}$ was used for positive samples to align classification and regression information. The square weight ${\hat{p}}_{j}^{2}$ was used for background samples, and hard negative example mining was performed. $P$ is the set of positive samples; $N$ is the set of negative samples; ${\hat{p}}_{i} \in [0, 1]$ represents the prediction confidence of the ith sample; $BCE (a, b)$ is the binary cross-entropy loss and is defined as follows:

BCE (a, b) = - [b \log a + (1 - b) (1 - a)] .

(31)

In order to further suppress the influence of low-quality positive samples, this paper introduces a decaying weight mechanism based on IoU ranking. Let the IoU rank of the ith prediction box within its true box matching group be $r_{i}$ , then its weight is defined as follows:

w_{i} = \exp (- \frac{r_{i}}{r}),

(32)

Here $r_{i}$ is the IoU rank of the $i$ th predicted box within the current true box matching group; $r$ is the intensity of regulation ranking penalty, which is 3.0 in this paper. The IA-BCE classification loss after adding rank weighting is finally expressed as follows:

L_{cls} = \sum_{i \in P} w_{i} \cdot BCE ({\hat{p}}_{i}, t_{i}) + \sum_{j \in N} {\hat{p}}_{j}^{2} \cdot BCE ({\hat{p}}_{j}, 0) .

(33)

This design can significantly suppress the influence of low-quality positive samples and improve the consistency between the classification output of the model and the actual positioning quality.

YOLOv11 employs CIoU loss to measure IoU overlap, center offset, and aspect ratio difference. However, its weak gradient distribution for slender defects—particularly in low-IoU regions with sparse learning signals—often causes regression failure on hard examples. Therefore, this paper adopts Focal-EIoU loss, which incorporates a modulation factor ${(1 - IoU)}^{r}$ to emphasize high-error candidate boxes and enhance regression learning on low-quality examples. The loss is defined as:

L_{reg} = {(1 - IoU)}^{r} \cdot L_{EIoU}, r = 1.5,

(34)

Here, the EIoU loss is as follows:

\begin{matrix} L_{EIoU} = 1 - IoU + \frac{{(x - x^{*})}^{2} + {(y - y^{*})}^{2}}{c^{2}} \\ + \frac{{(w - w^{*})}^{2}}{w_{\max}^{2}} + \frac{{(h - h^{*})}^{2}}{h_{\max}^{2}} . \end{matrix}

(35)

Here, $x$ , $y$ , $w$ , $h$ , and $x^{*}$ , $y^{*}$ , $w^{*}$ , $h^{*}$ are the center point coordinates and width and height of the predicted box and the real box respectively, and c is the diagonal length of the minimum bounding box of the two. The loss is designed to assign larger gradients to low IoU intervals. This mechanism boosts the model’s capacity to learn from difficult targets and improves regression convergence speed.

Finally, the training objective function of DA-YOLO is defined as:

L_{IAJ} = λ_{1} \cdot L_{cls} + λ_{2} \cdot L_{reg} + λ_{3} \cdot L_{dfl} .

(36)

In the experiments of this paper, $λ_{1} = 1.0$ is set to balance the classification task. The $λ_{2} = 2.5$ enhances the convergence speed of the localization branch. With $λ_{3} = 0.5$ , the fine regulation effect of DFL is retained.

The proposed loss function introduces quality-aware classification supervision and hard-sample-focused regression, effectively bridging the gap between classification and regression tasks. This improves alignment between classification confidence and localization quality.

The results shown in the Appendix “Parameter Sensitivity Analysis” demonstrate that $α = 0.25$ and $r = 3.0$ consistently provide the highest mAP@50 on both GC10-DET and NEU-DET datasets. This confirms that these parameter values are robust and optimal across different defect detection scenarios.

Experiment

Environment and parameters

Experiments used: NVIDIA RTX 3090, CUDA 11.8, PyTorch 2.0.1, and Python 3.10. This paper employs SGD optimizer with Cosine Annealing scheduler SGD optimizer with Cosine Annealing scheduler (initial LR $1 \times 10^{- 3}$ , final LR $1 \times 10^{- 5}$ , lrf = 0.01). Mosaic augmentation was disabled in the last 10 epochs to reduce late-stage interference, and early stopping was triggered after 50 epochs without validation improvement.

Dataset

To evaluate DA-YOLO’s performance in metal defect detection, this paper uses two public industrial datasets: GC10-DET⁴⁰ and NEU-DET.⁴¹ Both are widely adopted in industrial vision and cover diverse, challenging steel defects with strong practical relevance.

GC10-DET contains 6380 RGB images (640 × 640) with 10 defect types including cracks, scratches, and peeling. Images may contain multiple defects with imbalanced class distribution.

NEU-DET includes 1800 grayscale images of hot-rolled steel strips, featuring six defect types (roll dirt, faceting, cracks, etc.). Each class has 300 images, originally 200 × 200 and resized to 640 × 640. Its small, blurry targets effectively test small-defect detection capability.

All datasets are split 8:1:1 into training, validation, and test sets for model training, hyperparameter tuning, and evaluation.

Evaluation metrics

The main evaluation indicators included:

1. Mean Average Precision (mAP)

We use mAP@0.5 and mAP@0.5:0.95 as core metrics. mAP@0.5 measures detection accuracy at IoU threshold 0.5, while mAP@0.5:0.95 averages AP across IoU thresholds from 0.5 to 0.95 (step 0.05). Together they comprehensively evaluate model performance under varying matching strictness.

2. Precision and Recall

Precision is the fraction of samples predicted as positive by the model that are actually positive, and it measures the rate of false positives of the model. Recall is the fraction of all positive samples that the model successfully detects, and it measures the ability of the model to miss detections. They are defined as follows:

Precision = \frac{TP}{TP + FP},

(37)

Recall = \frac{TP}{TP + FN} .

(38)

Here, TP represents the number of predictions positive and true positive, FP represents the number of predictions positive but true negative, and FN represents the number of predictions negative but true positive.

3. Frames Per Second (FPS)

FPS measures how many frames the model can process per second, reflecting its real-time performance, crucial for fast decision-making in industrial applications.

4. Computational Complexity (FLOPs)

FLOPs quantify the computational cost per image, indicating the number of floating-point operations required. This metric is key for assessing model efficiency and deployment feasibility, especially on resource-constrained hardware.

Ablation experiment

This paper performs ablation studies on GC10-DET and NEU-DET to evaluate individual contributions of DA-YOLO’s key modules. Using YOLOv11m as baseline, we progressively add 3DAM, AMFEM, and IAJ-Loss, with results shown in Tables 1 and 2.

Table 1.

Ablation study of DA-YOLO on GC10-DET dataset.

YOLO11m	3DAM	AMFEM	IAJ-Loss	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
✓	✕	✕	✕	80.85	81.56	76.29	53.59
✓	✓	✕	✕	85.44	85.08	78.67	53.83
✓	✕	✓	✕	88.17	87.61	79.28	53.61
✓	✕	✕	✓	81.07	82.03	77.83	53.72
✓	✓	✓	✕	84.92	89.41	81.44	52.55
✓	✓	✕	✓	89.14	89.85	79.25	53.05
✓	✕	✓	✓	87.93	88.05	79.21	54.39
✓	✓	✓	✓	89.14	89.93	81.71	53.99

Table 2.

Ablation study of DA-YOLO on NEU-DET dataset.

YOLO11m	3DAM	AMFEM	IAJ-Loss	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
✓	✕	✕	✕	77.19	70.93	78.19	45.75
✓	✓	✕	✕	84.97	83.39	79.88	49.77
✓	✕	✓	✕	83.60	83.78	80.78	55.41
✓	✕	✕	✓	72.58	68.35	79.14	46.18
✓	✓	✓	✕	84.03	87.76	81.68	55.64
✓	✓	✕	✓	84.04	84.20	79.67	56.72
✓	✕	✓	✓	83.44	84.21	81.30	56.11
✓	✓	✓	✓	86.70	86.79	82.24	58.45

As shown in Table 1, 3DAM boosts accuracy and mAP by +2.38% mAP@0.5, demonstrating its structural adaptability to diverse targets. AMFEM improves recall and overall accuracy, particularly for small and blurry defects. IAJ-Loss enhances classification-regression consistency, raising mAP@0.5 to 77.83%—outperforming the baseline individually. The full combination achieves the best performance with a 5.42% mAP@0.5 gain, confirming module complementarity.

Table 2 shows consistent improvements on small grayscale images, with AMFEM + IAJ-Loss achieving the highest gain (+3.1% mAP@0.5), confirming its efficacy for small defect recognition. The full combination yields the best performance (+4.05% mAP@0.5), demonstrating strong versatility and robustness.

Across both datasets, 3DAM, AMFEM, and IAJ-Loss enhance structural modeling, multi-scale fusion, and loss alignment respectively. Each module contributes stable gains individually while showing complementary effects when combined, forming a structural foundation for high-accuracy, robust metal defect detection in DA-YOLO.

Contrast test

Figure 4 compares DA-YOLO with YOLOv11m on both datasets: the top two groups show GC10-DET results, the bottom two NEU-DET. DA-YOLO improves all metrics across both datasets.

Figure 4.

Comparison on the GC10-DET dataset and NEU-DET dataset.

Through comparative experiments with the most advanced defect detection algorithms on the GC10-DET and NEU-DET datasets, the performance of the DA-YOLO model is more comprehensively verified. These models include Faster-RCNN,⁴² TOOD,⁴³ YOLOX,⁴⁴ RDN,⁴⁵ CenterNet,⁴⁶ DCAM-NET,⁴⁷ ADE-YOLO,⁴⁸ PCP-YOLO,⁴⁹ and various YOLO modules.⁵⁰

Tables 3 and 4 compare DA-YOLO with recent detection methods on GC10-DET and NEU-DET, showing it achieves state-of-the-art results on both datasets.

Table 3.

Main metric comparison on the GC10-DET dataset.

Model	Params (M)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	FPS	FLOPs
FASTER-RCNN (2016)	41.4	73.9	43.1	15	10.5B
TOOD (2021)	32.0	79.1	51.1	18	9.7B
YOLOX (2021)	9.0	81.1	51.5	55	7.2B
CENTERNET (2019)	11.5	80.5	51.7	45	8.4B
YOLOv8 (2025)	11.1	80.0	50.8	50	7.8B
PCP-YOLO (2025)	3.8	77.6	51.2	60	6.5B
YOLOv9 (2025)	2.61	81.5	52.4	75	6.8B
DA-YOLO	3.91	81.71	53.99	70	6.9B

Table 4.

Main metric comparison on the NEU-DET dataset.

Model	Params (M)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	FPS	FLOPs
FASTER-RCNN (2016)	41.4	80.9	45.6	14	10.9B
TOOD (2021)	32.0	81.8	50.8	17	9.8B
YOLOX (2021)	9.0	82.1	47.9	54	7.3B
RDN (2022)	16.5	79.0	51.6	45	8.2B
CENTERNET (2019)	11.3	78.3	51.2	42	8.6B
DCAM-NET (2023)	9.7	82.3	56.4	52	8.2B
YOLOv8 (2025)	11.1	78.2	53.4	49	7.9B
ADE-YOLO (2025)	21.3	81.5	53.2	51	8.2B
YOLOv9 (2025)	2.61	83.3	54.6	78	6.8B
DA-YOLO	3.91	82.23	58.45	68	6.9B

These comparisons demonstrate DA-YOLO’s effectiveness for industrial metal defect detection. It significantly outperforms YOLOv11m in convergence speed, accuracy, and stability, while also achieving leading performance among mainstream methods, exhibiting stronger generalization and robustness.

Visualization

To visually demonstrate DA-YOLO’s detection performance, we select representative samples from both GC10-DET and NEU-DET datasets for analysis, with results shown in Figure 5.

Figure 5.

Visualization of DA-YOLO detection results on GC10-DET dataset and NEU-DET dataset.

Visualizations on both datasets demonstrate DA-YOLO’s accurate localization of diverse steel defects. It maintains stable detection under challenging industrial conditions, confirming its superiority and practical applicability in real manufacturing environments.

Analysis of deployment feasibility

To further address the industrial applicability of DA-YOLO, we analyze its deployment feasibility on resource-constrained edge devices. As shown in Tables 3 and 4, DA-YOLO achieves 70 FPS on GC10-DET and 68 FPS on NEU-DET, surpassing the typical industrial requirement of 30 FPS. The model’s memory footprint is 15 MB in FP32, reduced to under 5 MB when quantized to INT8, making it suitable for low-power devices like NVIDIA Jetson Nano. With 6.9B FLOPs, DA-YOLO balances high accuracy and efficient deployment, making it ideal for real-time industrial defect inspection.

Discussion and conclusion

This paper proposes DA-YOLO, an enhanced YOLOv11 framework that incorporates two novel modules: 3DAM and AMFEM. These modules improve the model’s ability to detect small defects, slender cracks, and blurred boundary defects in industrial environments. 3DAM enhances directional perception, while AMFEM improves multi-scale feature representation, allowing DA-YOLO to outperform YOLOv11 in challenging detection scenarios. This method not only enhances the spatial feature perception ability of the model, but also improves the detection ability of the model for objects of different scales.

The experimental results on GC10-DET and NEU-DET datasets show that the proposed model has good detection results, especially in identifying small targets and weak contrast defects. However, we recognize that both datasets exhibit class imbalance, with certain defect types having significantly fewer samples. This imbalance could affect the model’s performance, particularly for rare or small defects.

To address this, we plan to incorporate Conditional GANs to generate synthetic data for underrepresented defect types and Transfer Learning to improve the model’s generalization on small sample datasets.⁵¹ These strategies have demonstrated significant success in similar tasks and are expected to greatly mitigate the impact of class imbalance.

In the future, our research will further make the model design more lightweight to improve the real-time performance of the model in industrial use scenarios, and incorporate multimodal information to improve the adaptability of the model in complex detection situations. Additionally, we will test the model on a broader variety of datasets, including those with different defect types and materials, to validate DA-YOLO’s generalization capability across various industries.

Footnotes

Appendix

Acknowledgements

The authors are deeply grateful to the editor and referees for their careful review and valuable suggestions, which have significantly contributed to enhancing the overall quality of the paper.

Handling Editor: Aarthy Esakkiappan

ORCID iDs

Zhen Liu

Kuan-Ching Li

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Dang

Wang

. FD-YOLO11: a feature-enhanced deep learning model for steel surface defect detection. IEEE Access 2025; 13: 63981–63993. https://doi.org/10.1109/access.2025.3559733

Sanaei

Fatemi

. Defects in additive manufactured metals and their effect on fatigue performance: a state-of-the-art review. Prog Mater Sci 2021; 117: 100724. https://doi.org/10.1016/j.pmatsci.2020.100724

Usamentiaga

Lema

Pedrayes

, et al. Automated surface defect detection in metals: a comparative review of object detection and semantic segmentation using deep learning. IEEE Trans Ind Appl 2022; 58(3): 4203–4213. https://doi.org/10.1109/tia.2022.3151560

Xiang

Wang

Zhang

, et al. AGCA: an adaptive graph channel attention module for steel surface defect detection. IEEE Trans Instrum Meas 2023; 72: 1–12. https://doi.org/10.1109/tim.2023.3248111

Wang

. Computer vision based system for apple surface defect detection. Comput Electron Agric 2002; 36(2–3): 215–223. https://doi.org/10.1016/s0168-1699(02)00093-5

Heinemann

Sherry

. Neural network and Bayesian network fusion models to fuse electronic nose and surface acoustic wave sensor data for apple defect detection. Sens Actuators B Chem 2007; 125(1): 301–310. https://doi.org/10.1016/j.snb.2007.02.027

Xue-wu

Yan-qiong

Yan-yun

, et al. A vision inspection system for the surface defects of strongly reflected metal based on multi-class SVM. Expert Syst Appl 2011; 38(5): 5930–5939. https://doi.org/10.1016/j.eswa.2010.11.030

Shi

Zhou

Tai

, et al. An improved faster R-CNN for steel surface defect detection. In: 2022 IEEE 24th international workshop on multimedia signal processing (MMSP), 2022, pp.1–5. IEEE. https://doi.org/10.1109/MMSP55362.2022.9949353

Zheng

Zhang

. Wafer surface defect detection based on background subtraction and faster R-CNN. Micromachines 2023; 14(5): 905. https://doi.org/10.3390/mi14050905

10.

Liu

Anguelov

Erhan

, et al. SSD: Single shot multibox detector. In: Leibe

Matas

Sebe

, et al. (eds) Computer vision – ECCV 2016. Lecture notes in Computer Science. Vol. 9905. Springer International Publishing, 2016, pp.21–37. https://doi.org/10.1007/978-3-319-46448-0-2

11.

Han

Chang

Wang

. You only look once: unified, real-time object detection. Procedia Comput Sci 2021; 183: 61–72. https://doi.org/10.1016/j.procs.2021.02.044

12.

Alif

Hussain

. YOLOv1 to YOLOv10: a comprehensive review of YOLO variants and their application in the agricultural domain. arXiv preprint arXiv: 2406.10139. 2024.

13.

Redmon

Farhadi

. YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767. 2018. https://doi.org/10.48550/arXiv.1804.02767

14.

Wang

Xin

. Efficient detection model of steel strip surface defects based on YOLO-V7. IEEE Access 2022; 10: 133936–133944. https://doi.org/10.1109/access.2022.3230894

15.

Gallo

Rehman

Dehkordi

, et al. Deep object detection of crop weeds: performance of YOLOv7 on a real case dataset from UAV images. Remote Sens 2023; 15(2): 539. https://doi.org/10.3390/rs15020539

16.

Sharma

Kumar

Longchamps

. Comparative performance of YOLOv8, YOLOv9, YOLOv10, YOLOv11 and faster R-CNN models for detection of multiple weed species. Smart Agric Technol 2024; 9: 100648.

17.

Liao

Song

, et al. A novel YOLOv10-based algorithm for accurate steel surface defect detection. Sensors 2025; 25(3): 769. https://doi.org/10.3390/s25030769

18.

Sheng

Zou

. Steel surface defect detection based on improved YOLOv8 algorithm. In: 2024 4th international conference on industrial automation, robotics and control engineering (IARCE), 2024, pp.109–113. IEEE. https://doi.org/10.1109/IARCE59021.2024.10688492

19.

Arwidiyarti

. Single shot multibox detector (SSD) in object detection: a review. Int J Adv Comput Inform 2025; 1(2): 118–127.

20.

Xie

Zhou

Chen

, et al. SDMS-YOLOv10: improved Yolov10-based algorithm for identifying steel surface flaws. Nondestructive Test Eval 2026; 41: 782–802. https://doi.org/10.1080/10589759.2025.2474103

21.

Chen

Jin

Liu

, et al. Multi-scale and dynamic snake convolution-based YOLOv9 for steel surface defect detection. J Supercomput 2025; 81(4): 4981–5001. https://doi.org/10.1007/s11227-025-07036-w

22.

Chao

Guo

, et al. IAMF-YOLO: metal surface defect detection based on improved YOLOv8. IEEE Trans Instrum Meas 2025; 74: 1–17. https://doi.org/10.1109/tim.2025.3548198

23.

Jiang

, et al. YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976. 2022.

24.

Wang

Bochkovskiy

Liao

HYM

. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023, pp.7464–7475. IEEE. https://doi.org/10.1109/CVPR52729.2023.00720

25.

Mulajkar

Yede

. YOLO version v1 to v8 comprehensive review. In: 2024 international conference on inventive computation technologies (ICICT), 2024, pp.472–478. IEEE. https://doi.org/10.1109/ICICT60155.2024.10512345

26.

Gui

Jiang

, et al. FS-YOLOv9: a frequency and spatial feature-based YOLOv9 for real-time breast cancer detection. Acad Radiol 2025; 32(3): 1228–1240.

27.

Wang

Chen

Liu

, et al. Yolov10: real-time end-to-end object detection. Adv Neural Inf Process Syst. 2024; 37: 107984–108011. https://doi.org/10.48550/arXiv.2405.14458

28.

Zhao

Shu

Yan

, et al. RDD-YOLO: a modified YOLO for detection of steel surface defects. Measurement 2023; 214: 112776. https://doi.org/10.1016/j.measurement.2023.112776

29.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp.779–788. IEEE. https://doi.org/10.1109/CVPR.2016.91

30.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. In: 2017 IEEE international conference on computer vision (ICCV), 2017, pp.2980–2988. IEEE. https://doi.org/10.1109/ICCV.2017.324

31.

Wang

Huang

, et al. Enhancing YOLOv7-based fatigue driving detection through the integration of coordinate attention mechanism. In: 2023 IEEE international conference on image processing and computer applications (ICIPCA), 2023, pp.725–729. IEEE. https://doi.org/10.1109/ICIPCA59266.2023.10234567

32.

Han

Liu

, et al. MOD-YOLO: rethinking the YOLO architecture at the level of feature information and applying it to crack detection. Expert Syst Appl 2024; 237: 121346. https://doi.org/10.1016/j.eswa.2023.121346

33.

Wang

Han

Chen

, et al. FastPFM: a multi-scale ship detection algorithm for complex scenes based on SAR images. Conn Sci 2024; 36(1): 2313854. https://doi.org/10.1080/09540091.2024.2313854

34.

Crespi

, et al. Hyper-IIoT: a smart contract-inspired access control scheme for resource-constrained industrial Internet of Things. IEEE Trans Sustain Comput 2025; 10(5): 820–829. https://doi.org/10.1109/tsusc.2025.3542466

35.

Crespi

Minerva

, et al. DPS-IIoT: non-interactive zero-knowledge proof-inspired access control towards information-centric industrial Internet of Things. Comput Commun 2025; 233: 108065. https://doi.org/10.1016/j.comcom.2025.108065

36.

Zhang

Han

Chen

. Swin-PAFF: a SAR ship detection network with contextual cross-information fusion. Comput Mater Contin 2023; 77(2): 2657–2675. https://doi.org/10.32604/cmc.2023.042311

37.

Khanam

Hussain

. Yolov11: an overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725. 2024.

38.

Frey

Facente

Wei

, et al. Optimizing intraoperative AI: evaluation of YOLOv8 for real-time recognition of robotic and laparoscopic instruments. J Robot Surg 2025; 19(1): 131–142. https://doi.org/10.1007/s11701-025-02284-7

39.

Cai

Liu

Wang

, et al. Align-detr: improving detr with simple iou-aware bce loss. arXiv preprint arXiv:2304.07527. 2023.

40.

Duan

Jiang

, et al. Deep metallic surface defect detection: the new benchmark and detection network. Sensors 2020; 20(6): 1562. https://doi.org/10.3390/s20061562

41.

Song

Meng

, et al. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans Instrum Meas 2020; 69(4): 1493–1504. https://doi.org/10.1109/tim.2019.2915404

42.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 2015; 28: 91–99. https://doi.org/10.48550/arXiv.1506.01497

43.

Feng

Zhong

Gao

, et al. TOOD: Task-aligned one-stage object detection. In: 2021 IEEE/CVF international conference on computer vision (ICCV), 2021, pp.3490–3499. IEEE Computer Society. https://doi.org/10.1109/ICCV48922.2021.00346

44.

Liu

Wang

, et al. YOLOX: exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430. 2021.

45.

Wang

. Few-shot steel surface defect detection. IEEE Trans Instrum Meas 2022; 71: 1–12. https://doi.org/10.1109/tim.2021.3128208

46.

Duan

Bai

Xie

, et al. CenterNet: keypoint triplets for object detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV), 2019, pp.6569–6578. IEEE. https://doi.org/10.1109/ICCV.2019.00667

47.

Chen

, et al. DCAM-Net: a rapid detection network for strip steel surface defects based on deformable convolution and attention mechanism. IEEE Trans Instrum Meas 2023; 72: 1–12. https://doi.org/10.1109/tim.2023.3238698

48.

Wei

Wang

Zhang

, et al. ADE-YOLO: real-time steel surface flaw recognition through enhanced adaptive attention and dilated convolution fusion. Signal Image Video Process 2025; 19(6): 5681–5695. https://doi.org/10.1007/s11760-025-03990-3

49.

Wang

Shi

Aguilar

. PCP-YOLO: an approach integrating non-deep feature enhancement module and polarized self-attention for small object detection of multiscale defects. Signal Image Video Process 2025; 19(1): 201–215. https://doi.org/10.1007/s11760-024-03666-4

50.

Wang

Yeh

Mark Liao

, et al. YOLOv9: learning what you want to learn using programmable gradient information. In: Avidan

Brostow

Cissé

(eds) Computer Vision – ECCV 2024. Lecture Notes in Computer Science. Vol. 15078. Springer Nature, 2024, 198–217. https://doi.org/10.1007/978-3-031-71254-3-12

51.

Guo

Liu

Xie

, et al. Weld defect detection from imbalanced radiographic images based on contrast enhancement conditional generative adversarial network and transfer learning. IEEE Sens J 2021; 21(9): 10844–10853. https://doi.org/10.1109/jsen.2021.3059860

DA-YOLO: A dual attention-driven YOLOv11 framework for multi-scale fine-grained metal surface defect detection

Abstract

Keywords

Introduction

Related work

Method

Overall network architecture

Three-dimensional attention module

Attention-guided multi-scale feature enhancement module

Loss function

Experiment

Environment and parameters

Dataset

Evaluation metrics

Ablation experiment

Contrast test

Visualization

Analysis of deployment feasibility

Discussion and conclusion

Footnotes

Appendix

Acknowledgements

ORCID iDs

Funding

Declaration of conflicting interests

References