Sage Journals: Discover world-class research

Abstract

Detecting small objects reliably is particularly difficult in modern neural architectures, where scale imbalance, background clutter, and high object density frequently degrade feature quality and prediction accuracy. To address these challenges, we propose YOLO-Super Resolution and Attention (YOLO-SRA), a multi-scale neural architecture enhanced with attention and super-resolution. The architecture introduces High-Resolution Feature Enhancement (HRFE) to better represent small objects without incurring high computational cost, a Grouped Multi-Scale Split Attention (GMSA) mechanism to efficiently extract features from densely distributed objects, and Weighted Fine-Grained Cross-Scale Fusion (WFCF) network for adaptive multi-scale feature integration with Unmanned Aerial Vehicle (UAV)-specific adjustments. The Spatial-Attentive Non-Maximum Suppression (SA-NMS) strategy is further employed to reduce missed detections in overlapping regions. Extensive experiments on the VisDrone dataset demonstrate that YOLO-SRA outperforms the baseline YOLOv11, achieving 12.5% and 11.3% increase in mAP ₅₀ and mAP $_{50 : 95}$ , respectively, while reducing the parameter count by 6.0M. These results highlight the effectiveness of the proposed approach as a methodological contribution to neural systems for small object detection in aerial image.

Keywords

Machine learning real-time object detection YOLO super resolution attention mechanism feature fusion NMS

1. Introduction

Accurate detection of small objects in complex scenes remains a fundamental challenge in machine learning, owing to large scale variations, dense spatial distributions, and environmental factors such as occlusion and illumination changes.^1–3 Small and densely packed objects pose particular difficulties for conventional detectors,⁴ as minor feature degradation or misalignment can lead to missed detections.⁵ Over the past decade, deep learning has provided a transformative pathway to address such challenges. With the success of convolutional neural networks (CNNs),^6,7 deep learning-based object detectors now surpass traditional approaches in both accuracy and speed.^8–10 Current mainstream detectors can be broadly categorised into two-stage and one-stage methods. Two-stage models, represented by Faster R-CNN¹¹ and Mask R-CNN,¹² first generate region proposals and then refine them through classification and bounding box regression. One-stage models, exemplified by the YOLO series^13–25 and SSD,²⁶ predict object categories and locations directly on grids or anchor boxes through a single forward pass, achieving higher efficiency. Given the stringent requirements of real-time applications, one-stage detectors are generally favoured for unmanned aerial vehicle (UAV) aerial imaging, which has been widely applied in agriculture, forestry, emergency rescue, and environmental management.^27–32

Despite these advances, existing methods are not yet fully adapted to UAV aerial data. UAV aerial images differ substantially from natural images, as illustrated in Figure 1: targets are often small-to-medium in scale, densely distributed, and embedded in complex backgrounds.³³ Environmental factors such as vegetation occlusion, shadows, seasonal variation, and illumination changes further complicate feature extraction. In contrast, standard benchmarks such as PASCAL VOC³⁴ and MS COCO³⁵ predominantly contain objects with clearer contours and simpler scene composition, and therefore do not fully reflect the density, scale variation, occlusion, and distant small targets commonly observed in practical UAV applications. Consequently, models trained on these datasets often struggle to learn robust representations, resulting in missed detections of dense small objects and tiny distant targets and, ultimately, suboptimal recognition performance.

Figure 1.

Examples of manually collected images and UAV aerial images.

Low-resolution image further exacerbates these challenges,³⁶ as blurred object boundaries degrade detection accuracy and hinder the extraction of discriminative features for small objects. Super-resolution techniques^37–40 have been shown to reconstruct high-frequency details and enhance fine-grained object representation. However, existing approaches often rely on computationally expensive preprocessing or fail to integrate efficiently with detection networks, limiting real-time applicability in UAV scenarios. Similarly, attention mechanisms^41–44 can selectively focus on salient spatial and channel-wise features, suppress irrelevant background, and strengthen responses to densely distributed objects, yet conventional attention modules may introduce redundancy or overlook cross-scale dependencies, reducing their effectiveness for small and densely packed objects in complex aerial scenes.

To overcome these problems, we propose YOLO-Super Resolution and Attention (YOLO-SRA), a multi-scale neural architecture enhanced with attention and super-resolution, built upon YOLOv11. Unlike conventional approaches, our design introduces several methodological innovations that together advance the state of neural object detection systems. First, High-Resolution Feature Enhancement (HRFE) fuses super-resolved details with original features via attention, enhancing fine-grained image representation while maintaining computational efficiency. Second, the Grouped Multi-Scale Split Attention (GMSA) mechanism employs a grouped parallel strategy with selective attention, enhancing the response to dense small objects while reducing redundancy. Third, the Weighted Fine-Grained Cross-Scale Fusion (WFCF) network adaptively integrates multi-scale features and tailors detection heads (adding P2, removing P5) to UAV-specific object distributions. Finally, a Spatial-Attentive Non-Maximum Suppression (SA-NMS) strategy replaces traditional NMS,⁴⁵ mitigating the suppression of overlapping objects through spatial attention weighting and smooth score decay.

The main contributions of this paper are summarised as follows:

A High-Resolution Feature Enhancement (HRFE) method that integrates super-resolved details with original features via attention, enhancing small-object representation while maintaining computational efficiency.

A Grouped Multi-Scale Split Attention (GMSA) that applies selective attention across grouped feature maps, improving extraction efficiency and detection of densely distributed small objects.

A Weighted Fine-Grained Cross-Scale Fusion (WFCF) network that achieves adaptive fusion with UAV-specific detection head adjustments.

A Spatial-Attentive NMS (SA-NMS) strategy that alleviates suppression among overlapping objects by replacing hard deletion with smooth confidence decay.

The article is organised as follows. Section 2 provides related work on YOLO and its algorithmic variants. Section 3 describes the proposed method. Section 4 presents numerical results through examples. Finally, Section 5 concludes the work and suggests potential future developments.

2. Related Work

2.1. Advances in YOLO-based methods for unmanned aerial vehicles

YOLO¹⁸ reformulates object detection as a single-stage regression problem, directly predicting bounding boxes and class probabilities on image grids, achieving a balance between accuracy and speed that enables wide adoption in real-time applications.⁴⁶ Since the introduction of the first YOLO model, successive versions have continuously advanced detection performance. To mitigate gradient vanishing, YOLOv9²⁵ introduces Programmable Gradient Information (PGI). In contrast, YOLO-World enhances YOLO with open-vocabulary detection capabilities via vision–language modelling and pre-training on large-scale datasets. YOLOv11¹⁵ emphasises edge deployment by incorporating the C3K2 module for parameter reduction. Regarding feature fusion and optimisation, YOLO-ACR⁴⁷ introduces adaptive channel–spatial feature fusion and augments the loss with dynamic scaling and coordinate distribution modelling to improve localisation and detection accuracy on natural-image benchmarks. YOLOv13¹⁶ employs a Hypergraph-based Adaptive Correlation Enhancement (HyperACE) mechanism for global cross-location and cross-scale feature fusion. In contrast to YOLO-ACR, which focuses on fusion and loss optimisation for general natural-image detection, YOLO-SRA targets UAV aerial imagery and introduces HRFE/GMSA/WFCF/SA-NMS to better handle dense small targets and complex backgrounds.

These innovations not only drive continuous performance improvements on standard benchmark datasets, but also highlight key optimisation directions—such as parameter reduction and enhanced feature fusion—that are particularly relevant for addressing the challenges of UAV aerial image detection. Such advances provide a useful foundation for adapting and deploying YOLO-based models in UAV scenarios.

With the increasing application of UAVs across various fields,^48–51 researchers have proposed UAV-oriented adaptations of YOLO to address domain-specific challenges. For instance, GGT-YOLO¹ integrates transformers^52,53 to enhance feature extraction. To mitigate small-object information loss, CF-YOLO² and Drone-YOLO³ refine multi-scale feature extraction and fusion through neck network optimisation. YOLO-DKR⁵⁴ explores automated design by applying Differentiable Architecture Search (DARTS)⁵⁵ with kernel reuse technology, overcoming constraints of manually crafted architectures. Additionally, some researchers optimise performance via loss function improvements, PARE-YOLO⁵⁶ uses Exponential Moving Average (EMA) to balance positive and negative sample weights, while HFC-YOLO11⁵⁷ integrates boundary alignment penalties and scale-adaptive weight mechanisms into the Complete Intersection over Union (CIoU) geometric constraint framework. Despite these advances, detecting small, densely distributed UAV objects remains challenging. We adopt YOLOv11 as the baseline due to its strong generalization and real-time efficiency for UAV-specific improvements.

2.2. Applications of super-resolution algorithms in object detection

Super-resolution algorithms aim to reconstruct high-resolution images from low-resolution counterparts. Early methods were primarily based on interpolation⁴⁰ and reconstruction,³⁸ while deep learning⁵⁸ has since established neural network-based models as the dominant approach. Low resolution often degrades accuracy due to blurred object boundaries, and super-resolution methods can enhance detection by recovering finer details. Two main strategies exist. The first is data preprocessing,⁵⁹ where super-resolution networks are applied prior to detection. DCASR³⁷ and YOLO-MST³⁹ employ this two-step pipeline by enhancing image quality before passing the data into the detection model. The second is tight integration of super-resolution and detection networks. SRNet-YOLO,⁶⁰ introduces a feature map resolution reconstruction module that jointly learns super-resolved features for detection, reconstructing P5 feature maps to the size of P3 to recover fine-grained details.

Motivated by these findings, we design the High-Resolution Feature Enhancement (HRFE) module for UAV detection. Instead of directly substituting inputs with super-resolved images (an approach that incurs high computational cost), HRFE extracts high-resolution features and fuses them with base image features via an attention mechanism. This design enhances image detail representation while maintaining efficiency, effectively improving detection performance on UAV aerial image.

2.3. Attention mechanism

Attention mechanisms selectively emphasise informative features to improve representation learning.⁶¹ Channel attention models inter-channel dependencies to strengthen discriminative feature channels,^42,43 whereas spatial attention highlights object-relevant regions and suppresses background interference.^41,44 Self-attention (e.g., scaled dot-product attention) captures long-range dependencies by modelling global correlations among features.⁵²

Integrating attention mechanisms into YOLO has demonstrated notable improvements in real-time detection tasks.⁶² For example, SCCA-YOLO⁶³ sequentially combines spatial attention with shared semantic and channel self-attention to improve YOLOv8²² accuracy in autonomous driving applications. YOLO-FaceV2⁶⁴ employs a separated enhancement attention module to handle occluded faces, focusing on affected regions within YOLOv5. Inspired by these approaches, we propose Grouped Multi-Scale Split Attention (GMSA), which enhances responses to dense small objects and strengthens feature extraction through selective attention.

2.4. NMS algorithm

Non-maximum suppression (NMS)⁴⁵ is a standard technique in object detection for retaining high-confidence bounding boxes while suppressing redundant ones. Traditional NMS applies hard deletion based on a fixed intersection-over-union (IoU) threshold, which can lead to missed detections in dense small-object scenarios due to mutual suppression. Variants such as DIoU-NMS⁶⁵ and CIoU-NMS⁶⁶ refine overlap calculation, and Adaptive-NMS⁶⁷ dynamically adjusts the IoU threshold according to candidate box density. Soft-NMS⁶⁸ replaces hard deletion with smooth confidence decay, and Softer-NMS⁶⁹ further incorporates regression uncertainty, yet neither explicitly accounts for spatial relationships between objects. To overcome these limitations, we propose Spatial-Attentive NMS (SA-NMS), which calculates spatial attention weights based on the distance between candidate box centers, quantifying spatial correlations. Combined with smooth confidence attenuation, SA-NMS fully abandons hard deletion and more accurately mitigates mutual suppression among densely packed objects.

3. Proposed method

The overall architecture of the proposed YOLO-SRA is illustrated in Figure 2. The network follows the standard YOLOv11 design, consisting of three main components: the backbone, neck, and head.

Figure 2.

Overall architecture of the YOLO-SRA model. The blue dashed boxes highlight the newly introduced detection heads and enhanced feature fusion pathways in comparison to the YOLOv11 baseline.

Let the input image be $I \in R^{C \times H \times W}$ , where $C$ , $H$ , and $W$ denote the number of channels, height, and width, respectively. Feature maps at levels P2, P3, and P4, denoted by $F_{2} \in R^{C_{2} \times H_{2} \times W_{2}}$ , $F_{3} \in R^{C_{3} \times H_{3} \times W_{3}}$ , and $F_{4} \in R^{C_{4} \times H_{4} \times W_{4}}$ , correspond to 4 $\times$ , 8 $\times$ , and 16 $\times$ downsampling of the input image, respectively, representing features at different granularities.

To enhance tiny object detection, the High-Resolution Feature Enhancement (HRFE) module enriches the input image representation by injecting high-resolution details. Additionally, the Grouped Multi-Scale Split Attention (GMSA) mechanism is integrated into the backbone to improve feature extraction for densely distributed small objects.

3.1. Weighted fine-grained ccross-scale fusion (WFCF)

Although YOLOv11¹⁵ performs well on natural-image benchmarks, its default design is less effective for UAV aerial object detection, primarily due to the pronounced domain shift between natural images and UAV aerial imagery and the prevalence of small, densely distributed targets in aerial scenes. Specifically, YOLOv11 is largely optimised for natural-image characteristics, where objects are typically larger, less crowded, and exhibit clearer contours; in contrast, UAV aerial images often contain numerous tiny instances embedded in complex backgrounds, which requires stronger fine-grained feature preservation and more effective cross-scale fusion. To address this, we propose the Weighted Fine-Grained Cross-Scale Fusion (WFCF) network, as illustrated in Figure 2.

Existing multi-scale fusion designs can be suboptimal in UAV aerial imagery for two reasons: (i) detection heads designed for large objects may be under-utilised in UAV scenes dominated by small targets, and (ii) uniform fusion schemes may not sufficiently emphasise fine-grained cues needed for densely distributed small objects. To address these issues, we introduce three design choices in WFCF: adaptive detection-head configuration, learnable fusion weights, and PixelShuffle-based upsampling. Together, these choices prioritise small/medium objects and improve feature alignment in dense UAV scenarios. To realise these design choices, WFCF simplifies the backbone by pruning redundant layers while preserving core feature extraction capacity, reducing model size and computation. To improve sensitivity to tiny objects, we add a P2 detection head, and remove the original P5 detection head (primarily intended for large objects), thereby reallocating capacity towards small and medium scales. In the neck, conventional upsampling is replaced with PixelShuffle, which rearranges channel information into spatial resolution to better preserve fine details and reduce feature blurring. For the learnable weighted fusion mechanism, let $F_{i} \in R^{C_{i} \times H_{i} \times W_{i}}$ denote the feature map at level $i$ (where $i \in {2, 3, 4}$ ). These feature maps are fused via an attention-based weighted strategy:

\begin{aligned} F_{fused} = \sum_{i = 2}^{4} α_{i} F_{i} \end{aligned}

(1)

where

α_{i} \in R

is a learnable scalar representing the importance of each level.

Cross-layer connections ensure accurate alignment between feature maps of different resolutions, improving the discriminability of dense small objects. These modifications allow the network to focus computational resources effectively on UAV-specific challenges, significantly improving detection accuracy in aerial scenarios.

3.2. High-resolution feature enhancement (HRFE)

Traditional super-resolution-based detection methods typically upsample the entire input image to improve detection accuracy. Although such enlargement can enhance small-object visibility, it substantially increases computational cost and often compromises real-time performance, limiting practical deployment. The proposed HRFE module addresses this trade-off through a lightweight dual-path design that fuses high-resolution feature information with base features while preserving the original input resolution. Sigmoid-based attention maps are used to adaptively weight detail-enhanced and original features according to target scale and scene complexity, and residual enhancement injects the fused information back into the input representation. This design enables effective detail enhancement without the computational burden associated with full-image super-resolution pipelines. The structure of HRFE is illustrated in Figure 3.

Figure 3.

Structure of the HRFE module. The super-resolution (SR) path extracts high-resolution features and upsamples them via PixelShuffle before fusion with the original feature path.

Let the input image be $I \in R^{C \times H \times W}$ , where $C$ , $H$ , and $W$ are the number of channels, height, and width, respectively. HRFE follows a dual-path workflow:

Super-resolution (SR) path: The input $I$ is upsampled by a factor of 2 using a lightweight SR module and PixelShuffle, producing the high-resolution image:

\begin{aligned} I_{HR} = PixelShuffle (SR (I)) \in R^{C \times 2 H \times 2 W} \end{aligned}

(2)

Convolution is then applied to extract high-resolution features:

\begin{aligned} F_{HR} = W_{SR} * I_{HR} \end{aligned}

(3)

where

*

denotes convolution and

W_{SR}

is the convolution kernel tensor.

Original feature path: The input $I$ is also processed by convolutional layers to extract base features:

\begin{aligned} F_{base} = W_{base} * I \end{aligned}

(4)

The two feature maps are concatenated along the channel dimension:

\begin{aligned} F_{concat} = Concat (F_{base}, F_{HR}) \end{aligned}

(5)

A Sigmoid-activated attention map $A = σ (W_{att} * F_{concat}) \in R^{(C_{base} + C_{HR}) \times H \times W}$ is applied for weighted fusion:

\begin{aligned} F_{fused} = F_{base} ⊙ (1 - A) + F_{HR} ⊙ A \end{aligned}

(6)

where

⊙

denotes element-wise multiplication.

Finally, a residual connection adds the fused features back to the original input:

\begin{aligned} I_{enh} = I + W_{res} * F_{fused} \end{aligned}

(7)

with

W_{res}

as a learnable convolution kernel. The output

I_{enh} \in R^{C \times H \times W}

preserves the original resolution while incorporating high-resolution details.

This process improves the recognition of small and blurred objects in UAV images without significantly increasing computational cost, balancing enhancement and efficiency.

3.3. Grouped multi-scale split attention (GMSA)

Although channel and spatial attention mechanisms are well established and have been incorporated into YOLO-style detectors, many existing attention designs are not tailored to UAV aerial small-object detection, where targets are tiny, densely distributed, and embedded in complex backgrounds. In particular, attention is often applied at a single feature level or without explicitly exploiting multi-level cues, which can limit sensitivity to dense small objects.

To address these challenges, we propose the Grouped Multi-Scale Split Attention (GMSA) module with UAV-oriented design choices. GMSA groups input channels and processes them in parallel to control computational overhead while maintaining feature extraction capacity. Each group is further processed by dual parallel branches, and multi-scale feature aggregation is employed to enhance sensitivity to dense small targets. Moreover, channel attention is applied to key feature partitions rather than the full feature map, helping suppress background interference while strengthening responses on small-object regions. The structure is illustrated in Figure 4.

Figure 4.

The structure of the GMSA. GMSA employs the grouped parallel strategy and attention mechanism to enhance the model’s feature extraction capability.

Given input feature $F \in R^{C \times H \times W}$ , where $C$ is the number of channels, and $H$ , $W$ are spatial dimensions. The channels are first divided into $G$ groups:

\begin{aligned} F = [F_{1}, F_{2}, \dots, F_{G}] \end{aligned}

(8)

Each group $F_{g}$ is processed by two parallel branches, a local branch using a $1 \times 1$ convolution to capture fine-grained details, and a contextual branch using a $3 \times 3$ convolution to encode broader context:

\begin{aligned} F_{g}^{local} = W_{g}^{1 \times 1} * F_{g}, F_{g}^{context} = W_{g}^{3 \times 3} * F_{g} \end{aligned}

(9)

where

*

denotes convolution. In addition, their outputs are fused to obtain the group-level enhanced feature map:

\begin{aligned} F_{g}^{enh} = F_{g}^{local} + F_{g}^{context} \in R^{C_{g} \times H \times W} \end{aligned}

(10)

Finally, all group-level features are concatenated:

\begin{aligned} F^{enh} = Concat (F_{1}^{enh}, F_{2}^{enh}, \dots, F_{G}^{enh}) \end{aligned}

(11)

The enhanced features are split into two parts according to a predetermined ratio:

\begin{aligned} F^{enh} = [F_{a}^{enh}, F_{b}^{enh}], C_{a} + C_{b} = C \end{aligned}

(12)

To suppress background noise and enhance small object features, channel-wise attention is applied to the second partition $F_{b}^{enh}$ :

\begin{aligned} A & = σ (GlobalPool (F_{b}^{enh}) \cdot W^{att}) \end{aligned}

(13)

\begin{aligned} F_{b}^{att} & = F_{b}^{enh} ⊙ A \end{aligned}

(14)

where

σ (\cdot)

is the Sigmoid function,

W^{att}

is a learnable weight tensor, and

⊙

denotes channel-wise multiplication broadcast across spatial dimensions.

Subsequently, the unweighted $F_{a}$ and the weighted $F_{b}^{att}$ are concatenated, and feature fusion is performed via a $1 \times 1$ convolution:

\begin{aligned} F^{fuse} = W_{fuse} * Concat (F_{a}^{enh}, F_{b}^{att}) \in R^{C \times H \times W} \end{aligned}

(15)

where

W_{fuse}

denotes the fusion convolution kernel. The output of GMSA is obtained via a residual connection:

\begin{aligned} F_{GMSA} = F + F^{fuse} \in R^{C \times H \times W} \end{aligned}

(16)

This grouped multi-scale split attention mechanism enhances small object representation by combining local detail extraction with global contextual suppression, improving detection performance in dense UAV aerial images while maintaining computational efficiency.

3.4. Spatial-attentive Non-maximum suppression (SA-NMS)

Traditional NMS⁴⁵ employs a hard suppression mechanism with a fixed threshold. Let the set of candidate boxes be $B = {B_{1}, B_{2}, \dots, B_{n}}$ and their corresponding confidence scores $S = {s_{1}, s_{2}, \dots, s_{n}}$ , where $s_{i} \in R$ and $B_{i}$ is represented by its corner coordinates or center with width and height. In standard NMS, for the box $B_{i}$ with the highest confidence, any box $B_{j}$ with an intersection-over-union (IoU) greater than a threshold $N_{t}$ is suppressed via hard deletion:

\begin{aligned} s_{j} = {\begin{cases} 0, & if IoU (B_{i}, B_{j}) > N_{t}, \\ s_{j}, & otherwise . \end{cases} \end{aligned}

(17)

Due to the globally uniform threshold, standard NMS tends to over-suppress in dense object scenarios, erroneously removing valid predictions.

To address the limitations of existing NMS strategies in dense UAV scenarios, we further propose Spatial-Attentive NMS (SA-NMS). Unlike conventional NMS and soft-NMS variants that rely primarily on IoU-based suppression, SA-NMS incorporates a distance-normalised spatial attention weight $w_{spatial}$ to model local object density. Instead of hard deletion, suppression strength is adaptively modulated according to the Euclidean distance between candidate bounding-box centres, producing a near-strong and far-weak suppression pattern that is better suited to densely packed small targets. The final confidence score is computed using a Gaussian-based dynamic decay that integrates both IoU and spatial distance information.

\begin{aligned} s_{j} \leftarrow s_{j} \cdot \exp (- \frac{(IoU (B_{i}, B_{j}) \cdot w_{spatial})^{2}}{σ}) \end{aligned}

(18)

where

σ \in R

is a smoothing scalar controlling the decay rate.

The spatial attention weight $w_{spatial}$ is calculated as follows. Let $(c_{x}^{i}, c_{y}^{i})$ and $(c_{x}^{j}, c_{y}^{j})$ be the center coordinates of $B_{i}$ and $B_{j}$ . The Euclidean distance between these two centers is given by Equation 19:

\begin{aligned} d_{i j} = \sqrt{(c_{x}^{i} - c_{x}^{j})^{2} + (c_{y}^{i} - c_{y}^{j})^{2}} \end{aligned}

(19)

This distance is normalized to the image dimensions to remove scale dependence:

\begin{aligned} d_{norm} = \frac{d_{i j}}{\sqrt{(x_{\max} - x_{\min})^{2} + (y_{\max} - y_{\min})^{2}}} \end{aligned}

(20)

where

x_{\min}, x_{\max}, y_{\min}, y_{\max}

define the image boundaries.

Finally, the spatial attention weight is defined as Equation 21:

\begin{aligned} w_{spatial} = α \cdot \exp (- 5 d_{norm}^{2}) \end{aligned}

(21)

where

α \in R

is a learnable scalar controlling the weight of the central region.

This design allows suppression to adapt dynamically to spatial distance, providing strong suppression for nearby boxes and weak suppression for distant ones, thereby balancing precision and recall in dense and edge object scenarios.

4. Experiments

To evaluate the effectiveness of the proposed YOLO-SRA for UAV aerial image detection, systematic experiments were conducted on the VisDrone dataset,³³ assessing multiple aspects including detection precision and recall. Results demonstrate that YOLO-SRA significantly improves detection accuracy compared with mainstream methods, highlighting the adaptability of the proposed modules to UAV-specific challenges.

4.1. Dataset

The VisDrone2019 dataset,³³ developed by the AISKYEYE team at Tianjin University, consists of 288 video clips (261,908 frames) and 10,209 static images captured by multiple UAV platforms. It covers diverse scenes, object densities, weather conditions, and lighting environments. The dataset includes 10 object categories, with 6,471 images for training, 548 for validation, and 3,190 for testing, of which 1,610 test images are annotated.

4.2. Evaluation metrics

We evaluate detection performance using Precision (P), Recall (R), mAP ₅₀, and mAP $_{50 : 95}$ , and report Frames Per Second (FPS) to measure inference speed. Precision and Recall are defined as $P = T P / (T P + F P)$ and $R = T P / (T P + F N)$ , respectively. Intersection over Union (IoU) measures the overlap between predicted and ground-truth bounding boxes. mAP ₅₀ denotes mean Average Precision at an IoU threshold of 0.5, while mAP $_{50 : 95}$ is computed by averaging mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05, providing a stricter evaluation of localization accuracy. In addition, we report the number of model parameters (Params) and GFLOPs to quantify model size and computational cost during inference.

4.3. Experimental setup

The model is implemented in PyTorch and trained on an NVIDIA GeForce RTX 4090 GPU. Input images are resized to $640 \times 640$ , with a batch size of 8 and a maximum of 300 epochs. We use AdamW for optimisation with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005. A cosine-annealing learning-rate schedule with a fixed warm-up is adopted. The warm-up phase lasts for 3 epochs, during which the learning rate increases linearly to 0.01 and the momentum increases linearly from 0.8 to 0.937; the warm-up learning rate for the bias term is set to 0.1. After warm-up, the learning rate decays smoothly following the cosine schedule to 1% of its initial value. For inference, the confidence threshold is set to 0.25. Other hyperparameters follow the default configuration of YOLOv11. YOLO-SRA is provided in three model specifications (n, s, m) to accommodate different deployment scenarios.

4.4. Experimental results

Our model is built based on the YOLOv11 architecture. Therefore, we conducted a systematic comparison of presented the detection accuracy of objects in different categories and different variants of YOLOv11 on the dataset, with detailed results provided in Tables 1 and 2.

Table 1.
Comparison of per-category detection accuracy with YOLOv11 on the visDrone validation set.

Model pedestrian people bicycle car van truck tricycle bus motor

YOLOv11-n 0.152 0.099 0.032 0.528 0.275 0.187 0.113 0.318 0.152

YOLOv11-s 0.189 0.121 0.054 0.569 0.320 0.236 0.154 0.410 0.194

YOLOv11-m 0.238 0.149 0.087 0.605 0.360 0.294 0.202 0.469 0.242

YOLO-SRA-n 0.284 0.220 0.115 0.619 0.391 0.304 0.224 0.476 0.270

YOLO-SRA-s 0.330 0.250 0.140 0.651 0.427 0.370 0.256 0.541 0.312

YOLO-SRA-m 0.358 0.261 0.170 0.670 0.455 0.391 0.288 0.553 0.339

Model	pedestrian	people	bicycle	car	van	truck	tricycle	bus	motor
YOLOv11-n	0.152	0.099	0.032	0.528	0.275	0.187	0.113	0.318	0.152
YOLOv11-s	0.189	0.121	0.054	0.569	0.320	0.236	0.154	0.410	0.194
YOLOv11-m	0.238	0.149	0.087	0.605	0.360	0.294	0.202	0.469	0.242
YOLO-SRA-n	0.284	0.220	0.115	0.619	0.391	0.304	0.224	0.476	0.270
YOLO-SRA-s	0.330	0.250	0.140	0.651	0.427	0.370	0.256	0.541	0.312
YOLO-SRA-m	0.358	0.261	0.170	0.670	0.455	0.391	0.288	0.553	0.339

Table 2.

Comparison with YOLOv11 on the visDrone validation set.

Model	Params(M)	mAP ₅₀	mAP $_{50 : 95}$
YOLOv11-n	2.6	0.330	0.193
YOLOv11-s	9.4	0.392	0.234
YOLOv11-m	20.1	0.454	0.277
YOLO-SRA-n	1.0	0.464	0.306
YOLO-SRA-s	3.4	0.517	0.347
YOLO-SRA-m	13.0	0.548	0.367

The experimental results indicate that the YOLO-SRA series consistently improves performance while reducing model parameters. Across the n, s, and m sizes, YOLO-SRA outperforms YOLOv11 with fewer parameters; for example, YOLO-SRA-s has only 36.2% of the parameters of YOLOv11-s. Although the computational load increases slightly, precision, recall, mAP ₅₀, and mAP $_{50 : 95}$ all show substantial improvement, validating the rationality of our architectural adjustments.

Overall, YOLO-SRA achieves an approximate 10% improvement in mAP ₅₀, with YOLO-SRA-s demonstrating an 11.3% increase in mAP $_{50 : 95}$ . At the category level, detection accuracy improves across all categories, with notable gains in challenging categories. For instance, the ”people” category accuracy in YOLO-SRA-s is 12.9% higher than YOLOv11-s, while easier categories such as ”car” and ”bus” also see improvements exceeding 10%. These gains are consistently maintained as the model scales from n to s to m, highlighting the efficiency and adaptability of YOLO-SRA for dense UAV aerial images.

Comparisons with a diverse set of object detectors in Table 3 further demonstrate the competitiveness of YOLO-SRA. A key strength of YOLO-SRA is its compact parameter footprint: across comparable model scales, it uses fewer parameters than most competing methods. While YOLO-SRA incurs a modest increase in GFLOPs, this reflects a deliberate trade-off to improve feature representation for small and densely distributed targets. We report GFLOPs as a proxy for computational cost (and, by extension, potential energy demand on UAV platforms), noting that actual energy consumption depends on hardware and deployment conditions. Overall, YOLO-SRA achieves strong performance on the main evaluation metrics (e.g., mAP ₅₀, mAP $_{50 : 95}$ , and precision) while maintaining a lightweight model size.

Table 3.

Comparison with other object detectors on the visDrone validation set.

Model	Para(M)	GFLOPs(G)	P	R	mAP ₅₀	mAP $_{50 : 95}$
YOLOv8-n	3.2	8.7	0.439	0.336	0.336	0.198
YOLOv9-t	2.0	7.7	0.463	0.340	0.343	0.201
YOLOv10-n	2.3	6.7	0.445	0.331	0.335	0.198
YOLOv11-n	2.6	6.5	0.439	0.331	0.330	0.193
YOLOv12-n	2.6	6.5	0.445	0.332	0.330	0.194
Drone-YOLO (nano)	3.05	–	–	–	0.38	–
Drone-YOLO (tiny)	5.35	–	–	–	0.43	–
YOLO-SRA-n	1.0	14.8	0.617	0.309	0.464	0.306
YOLOv8-s	11.2	28.6	0.507	0.378	0.386	0.234
YOLOv9-s	7.1	26.4	0.529	0.390	0.404	0.242
YOLOv10-s	7.2	21.6	0.500	0.381	0.390	0.234
YOLOv11-s	9.4	21.5	0.510	0.378	0.392	0.234
YOLOv12-s	9.3	21.4	0.510	0.403	0.405	0.241
Drone-YOLO (small)	10.9	-	-	-	0.44	-
YOLO-DKR-Tiny	9.0	25.1	0.460	0.385	0.360	0.185
YOLO-SRA-s	3.4	32.9	0.630	0.405	0.517	0.347
YOLOv8-m	25.9	78.9	0.539	0.427	0.434	0.263
YOLOv9-m	20.0	76.3	0.553	0.424	0.439	0.268
YOLOv10-m	15.4	59.1	0.546	0.414	0.429	0.264
YOLOv11-m	20.1	68.0	0.564	0.439	0.454	0.277
YOLOv12-m	20.2	67.5	0.561	0.423	0.441	0.268
PARE-YOLO	26.3	45.1	0.606	0.455	0.463	0.284
CF-YOLO	37.7	23.9	0.53	0.43	0.45	0.28
Drone-YOLO (medium)	33.9	-	-	-	0.49	-
RT-DETR-L	32.8	108.0	0.410	0.278	0.265	0.135
YOLO-SRA-m	13.0	103.4	0.610	0.472	0.548	0.367
RT-DETR-R50	42.8	130.5	0.422	0.300	0.278	0.143
YOLO-DKR	50.3	146.0	0.585	0.494	0.495	0.281
Drone-YOLO (large)	76.2	–	–	–	0.51	–

In particular, YOLO-SRA attains competitive accuracy relative to larger baselines such as Drone-YOLO (large) and YOLO-DKR, despite using substantially fewer parameters. Moreover, several recent YOLO-style variants (e.g., PARE-YOLO and CF-YOLO) do not achieve higher accuracy despite having larger parameter counts, and the RT-DETR series shows lower detection performance under the considered setting while requiring higher computational cost. These results indicate that YOLO-SRA offers a favourable balance between model compactness and detection performance, supporting its suitability for UAV-based object detection.

To provide a detailed assessment of the model’s performance in detecting small objects, we present per-category precision-recall curves for YOLO-SRA-s on the VisDrone validation set in Figure 5. The results are computed using an IoU threshold of 0.5, and the corresponding mAP ₅₀ values are determined by calculating the area under each curve, offering a clear visualisation of detection accuracy across different object classes.

Figure 5.

The precision-recall curve of YOLO-SRA-s on the VisDrone validation set.

4.5. Ablation experiments

4.5.1. Validation of module effectiveness

To evaluate the impact of each proposed component, ablation studies were conducted on the VisDrone validation set using YOLOv11-s as the baseline model. The results, summarized in Table 4, illustrate the progressive performance improvements achieved by integrating individual modules and their combined configurations.

Table 4.
Ablation experiment results on the visDrone validation set.

YOLOv11-s

HRFE WFCF GMSA Params(M) SA-NMS GFLOPs P R mAP ₅₀ mAP $_{50 : 95}$

9.4 21.5 0.510 0.378 0.391 0.234

✓ 9.4 26.4 0.512 0.393 0.403 0.241

✓ 3.4 27.9 0.559 0.421 0.444 0.272

✓ 9.8 22.6 0.510 0.390 0.401 0.242

✓ 9.4 21.7 0.622 0.331 0.479 0.318

✓ ✓ 3.3 31.9 0.544 0.432 0.449 0.274

✓ ✓ ✓ 3.4 32.9 0.562 0.433 0.454 0.278

✓ ✓ ✓ ✓ 3.4 32.9 0.630 0.405 0.517 0.347

YOLOv11-s
				9.4	21.5	0.510	0.378	0.391	0.234
✓				9.4	26.4	0.512	0.393	0.403	0.241
	✓			3.4	27.9	0.559	0.421	0.444	0.272
		✓		9.8	22.6	0.510	0.390	0.401	0.242
			✓	9.4	21.7	0.622	0.331	0.479	0.318
✓	✓			3.3	31.9	0.544	0.432	0.449	0.274
✓	✓	✓		3.4	32.9	0.562	0.433	0.454	0.278
✓	✓	✓	✓	3.4	32.9	0.630	0.405	0.517	0.347

Each module exhibits distinct functional advantages. The WFCF module delivers a substantial performance gain while markedly reducing model parameters. SA-NMS slightly affects recall but significantly improves precision. HRFE and GMSA enhance precision and optimise feature extraction. The combination of all four modules achieves the best overall performance, maintaining low parameter count, balancing precision and recall, and attaining peak detection accuracy.

Regarding computational efficiency, WFCF and its combinations demonstrate clear lightweight advantages, preserving high performance despite the parameter reduction. SA-NMS introduces negligible computational overhead while providing notable gains. The overall computational cost increase from HRFE and GMSA is modest relative to the performance improvements, confirming the effectiveness and synergy of the proposed modules.

4.5.2. Effect of WFCF on model parameters

The WFCF module achieves substantial parameter reduction primarily by simplifying the backbone network. Specifically, decreasing the backbone depth reduces the channel dimensions of its output features, which propagates to the neck and head due to inter-layer feature dependencies, leading to a chain reduction of parameters. Figure 6 illustrates this effect, showing that backbone simplification directly decreases parameter counts in subsequent network components, confirming the efficiency of WFCF in reducing model complexity.

Figure 6.

Comparison of network parameters between WFCF and YOLOv11. In the backbone, the parameter-free layers correspond to components removed by WFCF. In the neck, the parameter-free layers are operations such as Concat and Upsample, which inherently contain no trainable parameters.

4.5.3. Effect of detection head structure

To investigate the influence of detection head design on model performance and efficiency, three strategies were evaluated: removing P5, adding P2, and combining both. The results, compared with the original YOLOv11-s, are presented in Table 5.

Table 5.
Performance comparison results of different detection head structures on the visDrone validation set.

YOLOv11-s

P2 P5 Params(M) mAP ₅₀ mAP $_{50 : 95}$

6.9 0.390 0.236

✓ 7.1 0.440 0.267

✓ 9.4 0.391 0.234

✓ ✓ 9.6 0.439 0.267

YOLOv11-s
		6.9	0.390	0.236
✓		7.1	0.440	0.267
	✓	9.4	0.391	0.234
✓	✓	9.6	0.439	0.267

Removing the P5 detection head reduced both parameters and computational load with minimal impact on performance. Adding the P2 detection head slightly increased parameters and computational cost but substantially improved detection accuracy. The combined strategy reduced parameters relative to adding P2 alone, achieved lower computational load, and maintained comparable performance, with higher precision. These results validate the design choices for the detection head in WFCF, balancing efficiency and performance.

4.5.4. Cross-dataset generalization verification

To further evaluate cross-dataset robustness and practical real-time feasibility, we conduct comparative experiments between YOLO-SRA-s and the baseline YOLOv11-s on five representative datasets covering diverse UAV and small-object scenarios. The datasets include VisDrone,³³ DOTA,⁷⁰ UAVVaste,⁷¹ UAVUOD-10,⁷² and CARPK.⁷³

As shown in Table 6, YOLO-SRA-s achieves consistent performance gains or maintains comparable performance relative to YOLOv11-s across all evaluated datasets, indicating good robustness under varying scene characteristics and data distributions. In terms of efficiency, YOLO-SRA-s maintains real-time inference capability, achieving FPS values in the range of 80–87 across datasets. Although this is moderately lower than YOLOv11-s, the reduction reflects an accuracy–efficiency trade-off introduced by enhanced fine-grained and cross-scale feature processing. Moreover, YOLO-SRA provides a lightweight configuration with as few as 3.4M parameters (in the smallest variant) and multiple model scales, supporting deployment under different resource constraints on UAV platforms.

Table 6.
Cross-dataset performance of YOLO-SRA.

Dataset Model mAP ₅₀ mAP $_{50 : 90}$ FPS

VisDrone YOLOv11-s 0.392 0.234 119

YOLO-SRA-s 0.517 0.347 83

DOTA YOLOv11-s 0.723 0.489 126

YOLO-SRA-s 0.740 0.525 84

CARPK YOLOv11-s 0.963 0.709 120

YOLO-SRA-s 0.965 0.721 80

UAVVaste YOLOv11-s 0.774 0.497 117

YOLO-SRA-s 0.832 0.523 87

UAVVOD-10 YOLOv11-s 0.403 0.218 120

YOLO-SRA-s 0.581 0.346 82

Dataset	Model	mAP ₅₀	mAP $_{50 : 90}$	FPS
VisDrone	YOLOv11-s	0.392	0.234	119
	YOLO-SRA-s	0.517	0.347	83
DOTA	YOLOv11-s	0.723	0.489	126
	YOLO-SRA-s	0.740	0.525	84
CARPK	YOLOv11-s	0.963	0.709	120
	YOLO-SRA-s	0.965	0.721	80
UAVVaste	YOLOv11-s	0.774	0.497	117
	YOLO-SRA-s	0.832	0.523	87
UAVVOD-10	YOLOv11-s	0.403	0.218	120
	YOLO-SRA-s	0.581	0.346	82

4.5.5. Comparative study on super-resolution preprocessing methods

To quantify the performance differences between HRFE and standard super-resolution (SR) preprocessing strategies, we conduct comparative experiments in this study. As shown in Table 7, increasing effective image resolution can improve UAV small-object detection performance; however, conventional SR preprocessing often involves a non-trivial accuracy–efficiency trade-off. For example, interpolation-based upsampling (e.g., Bicubic) introduces no additional learnable parameters, but increases inference cost due to the enlarged input resolution, while providing limited accuracy gains. In contrast, learning-based SR methods such as Real-ESRGAN⁷⁴ can yield larger accuracy improvements, but at the expense of substantially increased model complexity and computational overhead. By comparison, HRFE enhances the detector’s input representation by injecting high-resolution feature details while preserving the original input resolution. This design reduces the overhead associated with full-image upsampling and supports flexible adaptation to different input sizes, making it well suited to UAV small-object detection scenarios.

Table 7.
Comparison of super-resolution preprocessing methods on the visDrone validation set.

Model Size Scale Para(M) GFLOPs(G) mAP $_{50 : 90}$

YOLOv11-s 640 $\times$ 1 9.4 21.5 0.234

Bicubic 640 $\times$ 2 9.4 86.9 0.303

Real-ESRGAN 640 $\times$ 2 26.1 105.2 0.324

HRFE 640 $\times$ 1 9.4 26.4 0.241

HRFE 1280 $\times$ 1 9.4 102.8 0.335

Model	Size	Scale	Para(M)	GFLOPs(G)	mAP $_{50 : 90}$
YOLOv11-s	640	$\times$ 1	9.4	21.5	0.234
Bicubic	640	$\times$ 2	9.4	86.9	0.303
Real-ESRGAN	640	$\times$ 2	26.1	105.2	0.324
HRFE	640	$\times$ 1	9.4	26.4	0.241
HRFE	1280	$\times$ 1	9.4	102.8	0.335

4.5.6. Comparative study on NMS methods

To evaluate the effectiveness of the proposed SA-NMS, we conducted comparative experiments on the VisDrone validation set. As reported in Table 8, replacing standard NMS with SA-NMS improves the main detection metrics for the baseline detector under the same evaluation protocol. Moreover, SA-NMS achieves favourable overall performance compared with representative NMS variants, indicating that incorporating spatial-awareness into the suppression strategy is beneficial for dense small-target UAV scenes. These results support the effectiveness of the proposed SA-NMS design.

Table 8.
Comparison of NMS methods on the visDrone validation set.

Method P R mAP ₅₀ mAP $_{50 : 90}$

YOLOv11-s 0.510 0.378 0.391 0.234

DIoU-NMS 0.532 0.405 0.428 0.265

Soft-NMS 0.483 0.391 0.447 0.289

Softer-NMS 0.601 0.343 0.467 0.311

SA-NMS 0.622 0.331 0.479 0.318

Method	P	R	mAP ₅₀	mAP $_{50 : 90}$
YOLOv11-s	0.510	0.378	0.391	0.234
DIoU-NMS	0.532	0.405	0.428	0.265
Soft-NMS	0.483	0.391	0.447	0.289
Softer-NMS	0.601	0.343	0.467	0.311
SA-NMS	0.622	0.331	0.479	0.318

4.5.7. Evaluation of experimental reproducibility and performance robustness

To assess experimental reproducibility and performance robustness, we conducted five independent runs of the proposed SRA-s model under the same experimental setting, varying only the random seed. Each run includes a complete training and evaluation cycle, and the aggregated results are reported in Table 9.

Table 9.
Performance of SRA-s model with different random seeds on the visDrone validation set.

Seed mAP ₅₀ mAP $_{50 : 95}$ Precision Recall

0 0.5173 0.3474 0.6309 0.4051

1 0.5159 0.3458 0.6301 0.4038

2 0.5182 0.3491 0.6316 0.4066

3 0.5164 0.3465 0.6299 0.4032

4 0.5187 0.3487 0.6313 0.4071

Seed	mAP ₅₀	mAP $_{50 : 95}$	Precision	Recall
0	0.5173	0.3474	0.6309	0.4051
1	0.5159	0.3458	0.6301	0.4038
2	0.5182	0.3491	0.6316	0.4066
3	0.5164	0.3465	0.6299	0.4032
4	0.5187	0.3487	0.6313	0.4071

Using the results in Table 9, we computed the mean and standard deviation for the four primary metrics across the five runs, and constructed 95% confidence intervals using Student’s t-distribution. All statistics are reported to four significant figures, and the detailed results are summarised in Table 10.

Table 10.

Statistical results of evaluation metrics.

Metric	Mean	Std	95% Confidence interval
mAP ₅₀	0.5173	1.55%	[0.5153, 0.5193]
mAP $_{50 : 95}$	0.3475	1.86%	[0.3452, 0.3498]
Precision	0.6308	0.78%	[0.6298, 0.6318]
Recall	0.4052	1.37%	[0.4034, 0.4070]

To further strengthen statistical rigour, we performed significance testing at a significance level of $α = 0.05$ to compare SRA-s with the baseline across the repeated runs. Under the chosen test, the differences on the main evaluation metrics were statistically significant (all $p < 0.05$ ). These results indicate that the observed performance improvements are unlikely to be attributable to random variation. Overall, the repeated-run experiments support the reproducibility of our findings and the robustness of the proposed model.

4.5.8. Visualisation

To provide a balanced qualitative comparison, we selected challenging scenes from the VisDrone dataset. Figure 7 shows a representative example: $a$ is the original image, $b$ shows detections produced by YOLOv11-s, and $c$ shows detections produced by YOLO-SRA-s. These scenes contain densely distributed small objects, complex backgrounds, and multi-scale targets. YOLOv11-s exhibits both missed detections and false positives, particularly for very small targets. In contrast, YOLO-SRA-s reduces these errors and improves the detection of small and partially occluded objects in UAV scenarios. Nevertheless, in extremely crowded scenes with many visually similar and closely adjacent small instances, the performance of both models degrades and missed detections remain. Although YOLO-SRA-s consistently performs better than YOLOv11-s under these conditions, it does not fully eliminate failures in such extreme cases; addressing this limitation is part of our future work.

Figure 7.

Visualisation results on the VisDrone dataset: (a) original image, (b) detections produced by YOLOv11-s, and (c) detections produced by YOLO-SRA-s. Object categories are indicated by colour-coded bounding boxes.

5. Conclusion

This paper presents YOLO-SRA, a novel neural architecture for small-object detection. To fully leverage fine-grained information in high-resolution images, the approach introduces High-Resolution Feature Enhancement (HRFE), which extracts detailed features via a super-resolution path and fuses them with the original image, enhancing edge textures and small-object representation while avoiding excessive computational cost. To efficiently extract both local and global features, a Grouped Multi-Scale Split Attention (GMSA) mechanism is proposed, combining grouped parallel processing with selective attention to enhance the model’s ability to capture both local details and global context. Multi-scale feature integration is further refined through Weighted Fine-Grained Cross-Scale Fusion (WFCF), which uses learnable attention weights and adaptive upsampling to improve representation of small, densely distributed objects, with the added P2 detection head further boosting sensitivity to tiny objects. Finally, Spatial-Aware Non-Maximum Suppression (SA-NMS) optimises detection results in dense scenes through a spatially adaptive suppression mechanism combined with a Gaussian-based confidence decay.

YOLOv11 and other general-purpose object detectors are primarily developed for natural-scene imagery and may be less effective in UAV aerial settings characterised by dense small targets and cluttered backgrounds. In contrast, YOLO-SRA is designed to better match these characteristics by strengthening fine-grained feature representation and cross-scale aggregation. Experiments on the VisDrone dataset indicate that YOLO-SRA improves detection performance on key metrics while maintaining a relatively compact parameter footprint. These results support its suitability for small and densely distributed objects in aerial scenarios. Nevertheless, achieving higher accuracy can incur additional computational cost, which remains a practical constraint for resource-limited UAV platforms. In future work, we will further explore strategies to improve the accuracy–efficiency balance to enhance deployability in real-world UAV applications.

Footnotes

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (62376127, 61876089, 61876185, 61902281, 61403206), the Natural Science Foundation of Jiangsu Province (BK20141005), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (14KJB520025), Jiangsu Distinguished Professor Programme.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Yuan

Wang

, et al. GGT-YOLO: A novel object detection algorithm for drone-based maritime cruising. Drones 2022b; 6: 335.

Wang

Han

Yang

, et al. CF-YOLO for small target detection in drone imagery based on YOLOv11 algorithm. Sci Rep 2025; 15: 16741.

Zhang

. Drone-yolo: An efficient neural network method for target detection in drone images. Drones 2023; 7: 526.

Zhou

Yuen

, et al. A graph attention reasoning model for prefabricated component detection. Comput-Aided Civ Infrastruct Eng 2025; 40: 1606–1626.

Tavaris

Scandino

Luca Foresti

, et al. A cost-effective autonomous underwater system for small size object detection. Integr Comput Aided Eng 2025; 32: 258–271.

Bengio

. Practical recommendations for gradient-based training of deep architectures, 2012.

Lecun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86: 2278–2324.

Adeli

. Machine learning – an increasingly ubiquitous technology: Advances and pitfalls. In: Kahraman C, Cebi S, Oztaysi B, Cevik Onar S, Tolga C, Ucal Sari I and Otay I (eds.) Intelligent and Fuzzy Systems, 2025, Vol. 1528, pp.39–44. Cham: Springer Nature Switzerland. ISBN 978-3-031-97984-2 978-3-031-97985-9.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521: 436–444.

10.

Schmidhuber

. Deep learning in neural networks: An overview. Neural Netw 2015; 61: 85–117.

11.

Girshick

. Fast R-CNN. In: 2015 IEEE international conference on computer vision (ICCV), Santiago, Chile, 2015, pp.1440–1448.

12.

Gkioxari

Dollár

, et al. Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV), 2017, pp.2980–2988. Venice, Italy.

13.

Bochkovskiy

Wang

Liao

HYM

. YOLOv4: Optimal speed and accuracy of object detection, 2020.

14.

Jocher

Chaurasia

Qiu

. YOLOv5: Real-time object detection, 2020.

15.

Khanam

Hussain

. YOLOv11: An overview of the key architectural enhancements, 2024.

16.

Lei

, et al. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception, 2025.

17.

Jiang

, et al. YOLOv6: A single-stage object detection framework for industrial applications, 2022a.

18.

Redmon

Divvala

Girshick

, et al. You only look once: Unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp.779–788. Las Vegas, NV, USA: IEEE. ISBN 978-1-4673-8851-1.

19.

Redmon

Farhadi

. YOLO9000: Better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp.6517–6525. Honolulu, HI: IEEE. ISBN 978-1-5386-0457-1.

20.

Redmon

Farhadi

. YOLOv3: An incremental improvement, 2018.

21.

Tian

Doermann

. YOLOv12: Attention-centric real-time object detectors, 2025.

22.

Varghese R

. YOLOv8: A novel object detection algorithm with enhanced performance and robustness. In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS), 2024, pp.1–6. Chennai, India.

23.

Wang

Chen

Liu

, et al. YOLOv10: Real-time end-to-end object detection, 2024a.

24.

Wang

Bochkovskiy

Liao

HYM

. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: 2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023, pp.7464–7475. Vancouver, BC, Canada: IEEE. ISBN 979-8-3503-0129-8.

25.

Wang

Yeh

Liao

HYM

. YOLOv9: Learning what you want to learn using programmable gradient information, 2024b.

26.

Liu

Anguelov

Erhan

, et al. Ssd: Single shot multibox detector. In: Leibe B, Matas J, Sebe N and Welling M (eds.) Computer Vision – ECCV 2016, 2016, Vol. 9905, pp.21–37. Cham: Springer International Publishing. ISBN 978-3-319-46447-3 978-3-319-46448-0.

27.

Asadzadeh

de Oliveira

de Souza Filho

. UAV-based remote sensing for the petroleum industry and environmental monitoring: State-of-the-art and perspectives. J Petrol Sci Eng 2022; 208: 109633.

28.

Bakirci

. Enhancing vehicle detection in intelligent transportation systems via autonomous uav platform and yolov8 integration. Appl Soft Comput 2024; 164: 112015.

29.

Cirillo

Zappa

Tangari

, et al. Rockfall analysis from uav-based photogrammetry and 3d models of a cliff area. Drones 2024; 8: 31.

30.

Goodrich

Betancourt

Arias

, et al. Placement and drone flight path mapping of agricultural soil sensors using machine learning. Comput Elect Agricul 2023; 205: 107591.

31.

Klaine

Nadas

JPB

Souza

, et al. Distributed drone base station positioning for emergency cellular networks using reinforcement learning. Cognit Comput 2018; 10: 790–804.

32.

Chen

Liu

, et al. High-resolution model reconstruction and bridge damage detection based on data fusion of unmanned aerial vehicles light detection and ranging data imagery. Comput-Aid Civil Infrast Eng 2024; 39: 1197–1217.

33.

Zhu

Wen

, et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In: 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), 2019, pp.213–226. Seoul, South Korea.

34.

Everingham

Van Gool

Williams

CKI

, et al. The pascal visual object classes (voc) challenge. Int J Comput Vis 2010; 88: 303–338.

35.

Lin

Maire

Belongie

, et al. Microsoft COCO: Common objects in context, 2015.

36.

Casado

Pedrino

. A new methodology for monocular depth estimation with attention mechanisms. Integr Comput Aided Eng 2025; 32: 158–175.

37.

Varol Malkocoglu

Samli

. A novel model for higher performance object detection with deep channel attention super resolution. Eng Sci Technol Int J 2025; 64: 102003.

38.

Yang

Wright

Huang

, et al. Image super-resolution via sparse representation. IEEE Trans Image Process 2010; 19: 2861–2873.

39.

Yue

Cai

, et al. Yolo-mst: Multiscale deep learning method for infrared small target detection based on super-resolution and yolo. Opt Laser Technol 2025; 187: 112835.

40.

Zhou

Yang

Liao

. Interpolation-based image super-resolution using multisurface fitting. IEEE Trans Image Process 2012; 21: 3312–3318.

41.

Hou

Zhou

Feng

. Coordinate attention for efficient mobile network design, 2021.

42.

Shen

Sun

. Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, 2018, pp.7132–7141. Salt Lake City, UT, USA.

43.

Wang

Zhu

, et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020, pp.11531–11539. Seattle, WA, USA: IEEE. ISBN 978-1-7281-7168-5.

44.

Woo

Park

Lee

, et al. CBAM: Convolutional block attention module, 2018.

45.

Neubeck

Van Gool

. Efficient non-maximum suppression. In: 18th International conference on pattern recognition (ICPR’06), 2006, Vol. 3, pp.850–855. Hong Kong, China.

46.

Pan

Yang

Xiao

, et al. Vision-based real-time structural vibration measurement through deep-learning-based detection and tracking methods. Eng Struct 2023; 281: 115676.

47.

Neri

Yang

Xue

. YOLO-ACR: A new architecture for real-time object detection with advanced feature fusion and bounding box regression. Knowl Based Syst 2025; 326: 114052.

48.

Cecchinato

Scagnetto

Toma

, et al. A broadcast sub-ghz framework for unmanned aerial vehicles clock synchronization. Integr Comput.-Aided Eng 2024; 31: 59–75.

49.

Gómez

Peña

, et al. Battery parameter identification for unmanned aerial vehicles with hybrid power system. Integr Comput Aided Eng 2024; 31: 341–362.

50.

Jeon

Moon

Jeong

, et al. Autonomous flight strategy of an unmanned aerial vehicle with multimodal information for autonomous inspection of overhead transmission facilities. Comput-Aid Civil Infrast Eng 2024; 39: 2159–2186.

51.

Pan

Qin

, et al. Unmanned aerial vehicle–human collaboration route planning for intelligent infrastructure inspection. Comput-Aid Civil Infrast Eng 2024; 39: 2074–2104.

52.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need, 2023.

53.

Xie

Lian

Wang

. Enhanced graph attention network by integrating transformer for epileptic eeg identification. Int J Neural Syst 2025; 35: 2550037.

54.

Xue

Yao

Wahib

, et al. YOLO-DKR: Differentiable architecture search based on kernel reusing for object detection. Inf Sci (Ny) 2025; 713: 122180.

55.

Liu

Simonyan

Yang

. Darts: Differentiable architecture search, 2019a.

56.

Zhang

Xiao

Yao

, et al. Fusion of multi-scale attention for aerial images small-target detection model based on pare-yolo. Sci Rep 2025; 15: 4753.

57.

Bai

Zhu

Nie

, et al. Hfc-yolo11: A lightweight model for the accurate recognition of tiny remote sensing targets. Computers 2025; 14: 195.

58.

Tong

Zhang

. Evidential transformer for buried object detection in ground penetrating radar signals and interval-based bounding box. Comput.-Aided Civ Infrastruct Eng 2025; 40: 2152–2169.

59.

Liu

Niu

Kou

, et al. An interactive cross-multi-feature fusion approach for salient object detection in crack segmentation. Comput-Aid Civil Infrast Eng 2025a; 40: 1080–1099.

60.

Yang

Zhou

Feng

, et al. SRNET-YOLO: A model for detecting tiny and very tiny pests in cotton fields based on super-resolution reconstruction. Front Plant Sci 2024; 15: 1416940.

61.

Yazdanpanah

Chang

Ali Bakhshi

. Attention-based hybrid convolutional-long short-term memory network for bridge pier hysteresis and backbone curves prediction. Integr Comput Aided Eng 2025; 32: 176–195.

62.

Liu

Tong

, et al. An enhanced yolo11 model for small target detection in uav images. In: 2025 5th International conference on artificial intelligence and industrial technology applications (AIITA), 2025b, pp.802–805.

63.

Wei

Wang

. Scca-yolo: A spatial and channel collaborative attention enhanced yolo network for highway autonomous driving perception system. Sci Rep 2025; 15.

64.

Huang

Chen

, et al. YOLO-FACEv2: A scale and occlusion aware face detector. Pattern Recognit 2024; 155: 110714.

65.

Zheng

Wang

Liu

, et al. Distance-iou loss: Faster and better learning for bounding box regression, 2019.

66.

Zheng

Wang

Ren

, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation, 2021.

67.

Liu

Huang

Wang

. Adaptive nms: Refining pedestrian detection in a crowd, 2019b.

68.

Bodla

Singh

Chellappa

, et al. Soft-nms – improving object detection with one line of code, 2017.

69.

Zhang

Savvides

, et al. Softer-NMS: Rethinking bounding box regression for accurate object detection, 2018.

70.

Xia

Bai

Ding

, et al. Dota: A large-scale dataset for object detection in aerial images. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, 2018, pp.3974–3983.

71.

Kraft

Piechocki

Ptak

, et al. Autonomous, onboard vision-based trash and litter detection in low altitude aerial images collected by an unmanned aerial vehicle. Remote Sens (Basel) 2021; 13: 965.

72.

Han

Wang

, et al. A context-scale-aware detector and a new benchmark for remote sensing small weak object detection in unmanned aerial vehicle images. Int J Appl Earth Obs Geoinf 2022; 112: 102966.

73.

Hsieh

Lin

Hsu

. Drone-based object counting by spatially regularized regional proposal networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), 2017, pp.5277–5285. IEEE.

74.

Wang

Xie

Dong

, et al. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In: International conference on computer vision workshops (ICCVW), 2021.

YOLOv11-s
HRFE	WFCF	GMSA	Params(M)	SA-NMS	GFLOPs	P	R	mAP ₅₀	mAP $_{50 : 95}$
				9.4	21.5	0.510	0.378	0.391	0.234
✓				9.4	26.4	0.512	0.393	0.403	0.241
	✓			3.4	27.9	0.559	0.421	0.444	0.272
		✓		9.8	22.6	0.510	0.390	0.401	0.242
			✓	9.4	21.7	0.622	0.331	0.479	0.318
✓	✓			3.3	31.9	0.544	0.432	0.449	0.274
✓	✓	✓		3.4	32.9	0.562	0.433	0.454	0.278
✓	✓	✓	✓	3.4	32.9	0.630	0.405	0.517	0.347

A multi-scale neural architecture enhanced with attention and super-resolution for aerial small object detection

Abstract

Keywords

1. Introduction

2.1. Advances in YOLO-based methods for unmanned aerial vehicles

2.2. Applications of super-resolution algorithms in object detection

2.3. Attention mechanism

2.4. NMS algorithm

3. Proposed method

4.1. Dataset

4.2. Evaluation metrics

4.3. Experimental setup

4.4. Experimental results

4.5.1. Validation of module effectiveness

Table 5. Performance comparison results of different detection head structures on the visDrone validation set. YOLOv11-s P2 P5 Params(M) mAP 50 mAP 50 : 95 6.9 0.390 0.236 ✓ 7.1 0.440 0.267 ✓ 9.4 0.391 0.234 ✓ ✓ 9.6 0.439 0.267

Table 8. Comparison of NMS methods on the visDrone validation set. Method P R mAP 50 mAP 50 : 90 YOLOv11-s 0.510 0.378 0.391 0.234 DIoU-NMS 0.532 0.405 0.428 0.265 Soft-NMS 0.483 0.391 0.447 0.289 Softer-NMS 0.601 0.343 0.467 0.311 SA-NMS 0.622 0.331 0.479 0.318

Table 9. Performance of SRA-s model with different random seeds on the visDrone validation set. Seed mAP 50 mAP 50 : 95 Precision Recall 0 0.5173 0.3474 0.6309 0.4051 1 0.5159 0.3458 0.6301 0.4038 2 0.5182 0.3491 0.6316 0.4066 3 0.5164 0.3465 0.6299 0.4032 4 0.5187 0.3487 0.6313 0.4071

Footnotes

Acknowledgments

Funding

Declaration of conflicting interests

References

Table 5.
Performance comparison results of different detection head structures on the visDrone validation set.

YOLOv11-s

P2 P5 Params(M) mAP ₅₀ mAP $_{50 : 95}$

6.9 0.390 0.236

✓ 7.1 0.440 0.267

✓ 9.4 0.391 0.234

✓ ✓ 9.6 0.439 0.267

Table 8.
Comparison of NMS methods on the visDrone validation set.

Method P R mAP ₅₀ mAP $_{50 : 90}$

YOLOv11-s 0.510 0.378 0.391 0.234

DIoU-NMS 0.532 0.405 0.428 0.265

Soft-NMS 0.483 0.391 0.447 0.289

Softer-NMS 0.601 0.343 0.467 0.311

SA-NMS 0.622 0.331 0.479 0.318