Sage Journals: Discover world-class research

Abstract

The paper proposes improvements to YOLOv8n to enhance small target detection capabilities and introduces coordinate attention (CA) to the C2f module to improve focus on spatial information and local details. CA enhances spatial feature representation and small object recognition and replaces Path Aggregation Network with Bidirectional Feature Pyramid Network (BiFPN) in the neck to better fuse multi-scale features. BiFPN enables more effective fusion of features at different scales and adds a smaller detection head to improve perception of very small targets. The additional detection head utilizes more shallow feature information and incorporates Omni-dimensional Dynamic Convolution (ODConv) to adaptively adjust convolution kernels. ODConv allows ﬂexible capture of critical information for various target patterns and sizes. Experimental results show the proposed improvements lead to better performance on small object detection tasks, with increases in metrics like average precision mean (mAP), precision, and recall. The combined enhancements aim to address common challenges in small target detection such as low contrast, large-scale differences, and the need for ﬁne-grained feature capture. Compared to the original YOLOv8n algorithm, this algorithm improves the average accuracy on small targets by 3.2% for mAP@50, 4.4% for mAP@75, 3% for Precision rate, and 4% for Recall.

Keywords

YOLOv8n object detection small target attention mechanism coordinate attention

Introduction

Recent improvements in YOLOv8 focus primarily on attention mechanisms, multi-scale feature fusion networks, and regression loss. To identify fruits, the YOLOv8 model¹ employs cross-stage partial (CSP) and C2f (CSP2) modules to facilitate lightweight processing. The incorporation of these modules is aimed at improving efficiency and performance.

Small object detection has important application value in the ﬁelds of computer vision and artiﬁcial intelligence, especially in areas such as security monitoring, autonomous driving, and disease treatment. Small targets usually refer to targets that occupy small areas in an image, such as distant pedestrians, cancerous areas, and lesion areas. Due to the limited feature information of small targets and frequent interference from background noise, traditional object detection algorithms perform poorly in handling small targets. Small object detection poses signiﬁcant challenges in the ﬁeld of computer vision, and common issues can be summarized in detail as follows. Small targets occupy a small area in the image and contain less feature information, making it difficult for traditional convolutional neural networks (CNNs) to capture sufficient useful information during feature extraction. Deep neural networks typically rely on higher-level features, where small targets may have already been eliminated. After multiple downsampling and pooling operations, the information of small targets may be lost, even on high-resolution inputs. Small targets are prone to interference from complex backgrounds or occlusion by objects, which can affect detection accuracy. The scale variation of small targets in different scenarios increases the difficulty of detection. The distance between the object and the camera causes signiﬁcant changes in the size of the target, and the detection model needs to be able to adapt to this multi-scale problem. There is limited annotated data for small targets, which affects the training effectiveness of the model. Small targets have a relatively small proportion in the dataset, making it susceptible to class imbalance during model training. The data collection and annotation of small targets are more difficult, especially for certain speciﬁc ﬁelds such as military and remote sensing images. Multiple downsampling in deep learning (DL) models reduces the resolution of small targets, making them more difficult to detect. The downsampling operation in the network structure will result in a loss of spatial resolution, further weakening the ability to detect small targets. By using spatial and channel attention mechanisms to weight features, the accuracy of small object detection has been improved.² Feature pyramid network (FPN) improves the detection performance of small targets by constructing a bottom-up feature pyramid, extracting high-resolution small target features from low-level feature maps, and further integrating these features into high-level feature maps.³ An innovative plug-and-play feature enhancement module that incorporates multi-scale local contextual information to bolster detection performance for small objects.⁴ Reference⁵ developed an efficient detection framework that improves the detection performance of small targets at different resolutions through ﬁne parameter adjustments and network design. Based on Feature Aggregation and Propagation Network, effectively integrating the features of adjacent layers and utilizing cross-level correlations, a multi-scale feature aggregation module was designed to improve the accuracy of small target recognition.⁶

To address the misdetection and missed detection of small targets, the Bi-PAN-FPN approach⁷ is introduced to enhance the neck part of the YOLOv8-s model and the backbone of the benchmark model incorporates the GhostblockV2 structure, which replaces part of the C2f module. To enhance the network's ability to extract diverse features, the Multi-Head Self-Attention mechanism is utilized in MHSA-YOLOv8.⁸ In the intermediate neck layer, the traditional convolution process is replaced⁹ by the Ghost Shuffle Convolution mechanism. This substitution reduces the number of model parameters and improves the convergence speed. In,¹⁰ the authors modify the structure of the FPN-Path Aggregation Network (PANet) in the YOLOv8 to achieve multi-level feature fusion among different layers by strengthening the multipath fusion of the networks. In order to detect brain tumors in,¹¹ a Convolutional Block Attention Module (CBAM) was added to YOLOv7 to enhance feature extraction capabilities and enable the model to better focus on signiﬁcant regions related to brain tumors. Twins¹² and CAT¹³ alternatively perform local and global attention at different layers. Swin transformers¹⁴ execute local attention within a window and introduce a shift-window partition method for cross-window connections.

For more research on YOLO, refer to the following literature.^15,16

In the healthcare sector, researchers have employed a diverse array of machine learning algorithms to enhance cancer diagnosis. These methods include traditional algorithms such as support vector machines, k-nearest neighbors, decision trees, and Naive Bayes, as well as DL techniques, including CNNs, VGGNets¹⁷ GoogleNet,¹⁸ and ResNets,¹⁹ to aid in the diagnosis of cancer. However, existing classiﬁcation methods often exhibit limitations in precision and recall, leading to inefficiencies and prolonged classiﬁcation times that may delay the initiation of treatment.²⁰ These limitations mainly include overﬁtting, data imbalance, insufficient feature extraction and computational constraints. Furthermore, these techniques have also been applied to the diagnosis of neurological disorders and the analysis of brain tumor images,^21,22 To solve the above problem, we introduce the following improved YOLOv8 scheme.

The main contributions of this article are summarized as follows. Effective solutions have been proposed to address the following shortcomings of YOLOv8n.

YOLOv8n has low detection accuracy for small targets, often leading to missed detections. This is due to the features of small targets being diluted or lost in deep networks, making it difficult for the detector to capture these tiny defects.

YOLOv8n has limitations in feature fusion, especially when dealing with complex backgrounds and small targets. Traditional FPNs have limited effectiveness in multi-scale feature fusion and fail to fully utilize shallow features to improve the detection accuracy of small targets.

Although YOLOv8n optimizes computational complexity while maintaining high detection accuracy, further optimization is still needed when applied to high-density tumor detection to meet real-time detection requirements.

To address the aforementioned issues, an improved tumor detection solution has been proposed, which includes the following innovative designs.

The Efficient Coordinate Attention Network (ECANet) is utilized to improve the C2f module (from CSPDarknet53 to 2Stage FPN), strengthening the channel attention mechanism and effectively enhancing the detection accuracy of small targets.

Adopting Bidirectional Feature Pyramid Network (BiFPN). By replacing the Neck component in the original network, the utilization of shallow features is enhanced, thereby improving overall detection performance.

A dedicated small target detection head with a coordinate attention (CA) is added to the network to enhance its ability to perceive small targets, reducing the incidence of missed detections for small targets.

Omni-dimensional dynamic convolution (ODConv) Technology is applied. Through multi-dimensional dynamism, ODConv can adaptively extract delicate and complex features, effectively responding to various input changes.

Compared to the original YOLOv8n algorithm, this algorithm improves the average accuracy on small targets by 3.2% for average precision mean (mAP)@50, 4.4% for mAP@75, 3% for Precision rate, 4% for Recall.

YOLOv8

As shown in Figure 1, YOLOv8n consists of Backbone (backbone network), Neck (neck network) and Head (head network).

Figure 1.

YOLOv8 network structure.

The backbone network uses the C2f module, which is an improved CSPBottleneck (cross-stage partial bottleneck) structure. The core idea of the CSPBottleneck structure is to split the input feature maps of the traditional convolutional layers or residual blocks and have them processed through different paths, and then in the later part then fuses these two paths of features.²³ On the basis of CSPBottleneck, the C2f module extracts features through cross-stage connections, effectively reducing the number of model parameters and computational complexity. This design not only improves the running speed of the model, but also maintains efficient feature extraction capability.

In the Neck network of YOLOv8n, a PANet structure is adopted. PANet is an improvement based on the FPN, which adopts a top-down feature pyramid structure, while PANet introduces a bottom-up information transmission mechanism, allowing low-level signals to be transmitted to high-level features, thereby enhancing the expressive power of the entire feature hierarchy. In addition, PANet also introduces the Adaptive Feature Pooling Module, which can aggregate feature maps from different levels, allowing useful information from each layer to be directly transmitted to subsequent sub-networks.

In the Head section of YOLOv8n, a Decoupled Head structure is adopted, which can separate the classiﬁcation head and detection head, making the model more ﬂexible in processing objects of different sizes. In addition, YOLOv8n also adopts an anchor free method, abandoning the traditional anchor based design concept and directly predicting the position and other attributes of the target from image pixels. This simpliﬁed detection process signiﬁcantly improves inference speed and further enhances the overall performance of the model.

The aforementioned model's ability to detect small targets still needs improvement. To address this issue, we will focus on the following aspects to enhance small target detection capabilities. With the introduction of CA, the network can pay more attention to local details in the image, enhancing its ability to recognize small targets, especially in complex backgrounds. BiFPN is not just about simply fusing features of different scales; it improves the complementarity between shallow and deep features by gradually optimizing the feature fusion process. This helps to solve common challenges in multi-scale object detection, such as low contrast of small targets and large-scale differences, thereby improving the detection accuracy of small targets. In traditional convolution, the convolution kernel is ﬁxed and may not effectively capture the detailed information of small targets. By dynamically adjusting the convolution kernel, Full Dimensional Dynamic Convolution can perform adaptive convolution based on the characteristics of small targets, enabling better capture of their features and improving the accuracy of small target detection. To address the above problems, this paper proposes an improvement strategy YOLOv8n Figure 2 to enhance the detection capability of YOLOv8n for small targets.

Figure 2.

Improved YOLOv8 network structure.

Coordinate attention

The C2f module is a part of CSPDarknet53 (Cross-Stage Partial Darknet53),^24,25 mainly used for feature extraction and transmission. It improves the expressive power and efficiency of the model by dividing the feature map into two parts for parallel processing and then merging them. This structure can effectively reduce computational overhead while maintaining the representational power of the model. CA is a commonly used attention mechanism in computer vision tasks, aimed at enhancing the network's ability to focus on spatial information. It enhances feature expression by introducing coordinate information, enabling the network to more effectively capture important spatial location and contextual information. Combining the C2f module (CSPDarknet53 to Stage 2 FPN), the introduction of CA can signiﬁcantly improve the performance of the network.

The core idea of CA is to enhance the network's attention at speciﬁc locations by encoding the spatial positions of feature maps. Speciﬁcally, CA is achieved through the following steps. Divide the input feature map into two channels, one for horizontal information and the other for vertical information. Perform global average pooling on the feature maps of each channel to obtain two one-dimensional coordinate vectors, representing the attention distribution in the horizontal and vertical directions, respectively. Mapping one-dimensional coordinate vectors to the channel dimension through two fully connected layers to generate corresponding attention weights. These weights reﬂect the relative importance of feature maps at different spatial positions. Multiply the original feature map with the generated attention weights channel by channel to achieve feature recalibration, allowing the network to focus more on important feature regions.

First, CA (Coordinated attention)²⁸ is used to improve C2f(CSPDarknet53 to 2-Stage FPN)²⁶ module. CA aggregates global information along spatial dimensions, allowing each pixel to consider global contextual information when generating attention weights. This aggregation of global information signiﬁcantly improves the richness and discriminability of feature representation. Due to the ability of CA mechanism to accurately locate and enhance important spatial positions and channels, it helps to capture more reﬁned and meaningful features, which is particularly important for complex image tasks. By focusing on important spatial positions and channels, CA mechanism can effectively suppress the interference of noise and irrelevant features, and improve the sensitivity of the model to key information in the input image.

Many attention mechanisms, such as self-attention, typically consider the relationships between features on a global scale, but may overlook subtle differences in local spatial information. Channel attention mainly focuses on the interrelationships between channels, but its attention to spatial information is not as direct as CA. Self-attention mechanism complexity is O((H × W)²× C), and CA complexity is O(H × W × C), where H and W are the spatial dimensions of the feature map, and C is the number of channels. Compared to traditional self-attention mechanisms, CA has lower computational complexity because it only requires the calculation of two one-dimensional global average pooling, without involving the calculation of the interrelationships between all positions on the feature map, thereby reducing computational costs. By comparing the structures of SE, CBAM, and CA, we observe that SE uses only channel features, CBAM uses channel features and location features, but CA uses not only channel features and location features, but also direction-aware features. Although other attention mechanisms also attempt to capture contextual information, CA enhances the spatial relationships of features through clear coordinate directions, further improving the performance of the model in complex scenes. CA can effectively adjust the focus on various directions when dealing with targets of different sizes, shapes, and poses, making the model more adaptable. Some attention mechanisms may perform well on speciﬁc types of targets, but may be limited in complex scenes. The design of CA makes it more universal and able to maintain good performance in various scenarios.

For the problem of small target leakage detection, this paper adds a smaller detection head to enhance the network's perception of small targets. In order to enhance the utilization of shallow features, the original network in the Neck part is replaced with BiFPN²⁷ network. In order to reduce the signiﬁcant increase in network layers caused by the changes of the ﬁrst two, GSConv³² (Cross-Stage Channel wise Convolution) technology is introduced in the Neck section.

CABiFPN structure

In order to further enhance the utilization of shallow features, this paper replaces the original PANet structure with BiFPN structure in the Neck section, as shown in Figure 3.

Figure 3.

CABiFPN.

BiFPN introduces bidirectional connections, allowing the network to effectively fuse features across different scales through both top-down and bottom-up pathways. This bidirectional nature ensures that high-level features and low-level features can complement each other, enhancing the richness of feature representation. BiFPN also adds additional connections between the same layers, further improving feature fusion capability. In contrast, PANet primarily relies on top-down paths, which makes it difficult to fully leverage low-level feature information.

BiFPN allows the network to dynamically adjust the weights of input features based on their importance during the feature fusion process. This adaptive weighting strategy enables the network to ﬂexibly select and integrate features, leading to better performance across different scenarios and target sizes. By dynamically adjusting weights, BiFPN effectively avoids the information loss that can result from ﬁxed-weight fusion and reduces the risk of overﬁtting.

One of the design principles of BiFPN is to eliminate nodes with only one input edge, thereby simplifying the network structure and reducing unnecessary computational overhead. This simpliﬁcation not only improves the computational efficiency of the model but also accelerates inference speed. BiFPN excels at processing multi-scale features, allowing YOLOv8 to more effectively utilize low-level features in small object detection tasks, thereby improving detection accuracy for small objects. This is particularly important in object detection tasks, as small objects are often overlooked or misclassiﬁed.

Combining CA mechanism with BiFPN can further enhance the effectiveness of object detection through the following methods. In the feature fusion process of BiFPN, CA mechanism is applied to adjust the weight of features in 50/10,000 real-time translation, so that speciﬁc regions of features receive higher attention. In the feature fusion process of BiFPN, CA mechanism is applied to adjust the weights of features, so that speciﬁc regions of features receive higher attention. Before the input features of each BiFPN module, CA processing is performed to enhance the expressive power of the features, thereby obtaining more effective feature representations during fusion. The multi-level feature fusion combined with CA mechanism can better capture the spatial information of the target and improve the complementarity of features between different levels.

Detection head

The original model is designed with three detection heads, in the small target detection task; there are often very small targets to be detected, especially in the brain tumor defect dataset. There are more targets smaller than 3*3 pixels, and in the backbone network and the Neck end of the shallow feature maps contain richer information about the small targets that are not fully utilized. In order to fully apply the small target information, we elicit the new detection head on the YOLOv8 model by the features after the second downsampling. In order to highlight the positional features, CA mechanism is integrated here. YOLOv8 was downsampled a total of ﬁve times, and after each downsampling, the feature map becomes 1/2 of the original image. In this paper, the initial image size is set to 640*640 pixels. The size of the detection head feature map is 160*160 pixels, which contains richer underlying feature information of the target due to two downsamplings.

Omni-dimensional dynamic convolution

ODConv²⁹ is an innovative convolution method aimed at improving the adaptability of neural networks in feature extraction and representation. ODConv dynamically adjusts the weights of convolution kernels in spatial dimension, input channel, output channel, and kernel dimension. ODConv can dynamically adjust the weights of convolution kernels based on the features of input data, enabling the network to better adapt to different inputs. Multi-dimensional dynamic convolution kernels are used for convolution operations, including the following. The spatial dimension adjusts the convolution kernel based on the spatial variation of the input feature map. The channel dimension adaptively adjusts weights on each channel. Convolutional window dimension adjusts the weights of convolution windows of different sizes. ODConv introduces a novel multi-dimensional attention mechanism that can simultaneously learn four types of attention, targeting the four dimensions of position, channel, ﬁlter, and kernel of the convolutional kernel. These four types of attention are learned in parallel, and by applying them sequentially to different dimensions of the convolution kernel, the feature extraction ability of the convolution operation can be enhanced. Position attention makes convolution operations different for all spatial positions. Channel attention makes the convolution operation different for all input channels. Filter attention adjusts the behavior of each ﬁlter (or convolution kernel) based on input features. Kernel attention adjusts the weights of convolutional kernels at a ﬁner granularity.

Position attention focuses on spatial positions within feature maps. It generates a weight for each location that determines the contribution of that location to the ﬁnal feature map. In a concrete implementation, a positional code can be used to weight the input feature map so that the features at each spatial location are different during convolutional computation. By assigning different weights to each spatial location, positional attention allows convolutional operations to capture spatial information more ﬂexibly, enhancing the model's sensitivity to detail. It can be represented mathematically as follows:

A_{pos} (x, y) = σ (W_{pos} \cdot F (x, y) + b_{pos})

(6.1)where A_pos (x, y) is the position attention score for the spatial position (x, y). F (x, y) is the feature map at position (x, y). W_pos is a learnable weight matrix for position attention. b_pos is a learnable bias for position attention. σ is an activation function (e.g. sigmoid or softmax).

The channel attention mechanism focuses on the different channels of the input feature map. It generates a weight for each channel that reﬂects the importance of that channel in feature extraction. The global features for each channel are usually obtained by performing global average pooling or global maximum pooling on the feature graph, and then the channel weights are generated by a small network (e.g. a fully connected layer). Channel attention adaptively adjusts the contribution of each channel, thereby reinforcing the impact of important feature channels and suppressing unimportant ones. It can be deﬁned as follows:

A_{channel} (c) = σ (W_{channel} \cdot G (c) + b_{channel})

(6.2)where A_channel (c) is the channel attention score for channel c. G(c) is a pooled representation of the feature maps across the spatial dimensions for channel c. W_channel and b_channel are learnable parameters.

Filter Attention focuses on the output of each convolutional ﬁlter, adjusting the response of each ﬁlter to the input feature map. The global features of the output feature maps of each ﬁlter are computed; and these features are used to generate the weights of the ﬁlters. Filter attention allows the model to adaptively adjust the inﬂuence of each convolutional ﬁlter to improve sensitivity to speciﬁc features. It can be represented as

A_{filter} (k) = σ (W_{filter} \cdot H (k) + b_{filter})

(6.3)where A_filter (k) is the attention score for the k-th ﬁlter. H(k) represents the feature responses of the k-th ﬁlter. W_filter and b_filter are learnable parameters.

The kernel attention focuses on the weights and properties of the convolutional kernels themselves. By weighting the weights of each convolutional kernel, this mechanism is able to dynamically adjust the performance of the convolutional kernel in the feature extraction process. Kernel attention can be achieved by combining the weights of a convolutional kernel with speciﬁc input features to generate a new convolutional kernel that can be used to enhance or inhibit the extraction of speciﬁc features. Kernel attention allows the model to dynamically adjust the behavior of the convolution kernel according to changes in the input features, thus improving the ﬂexibility and adaptability of feature extraction.

A_{kernel} (k, x, y) = σ (W_{kernel} \cdot K (k, x, y) + b_{kernel})

(6.4)where A_kernel (k, x, y) is the attention score for the k-th kernel at position (x, y). K(k, x, y) is the kernel weights or features. W_kernel and b_kernel are learnable parameters.

Through multi-dimensional dynamism, ODConv can adaptively extract delicate and complex features, effectively responding to various input changes. Due to weight adjustments in multiple dimensions, ODConv can capture and fuse more feature information, thereby improving the overall representation ability of the network. The adaptability of multi-dimensional dynamic convolution kernels can reduce overﬁtting and improve the generalization performance of the network on different test data, making it suitable for various tasks and scenarios. This flexibility helps improve the performance of the model on various datasets, especially in scenarios where feature distributions vary greatly. Due to its ability to simultaneously optimize the weights of convolution kernels in multiple dimensions, ODConv is typically more effective than traditional convolution methods in capturing complex feature relationships, thereby improving the overall performance and accuracy of the model. The combination of ODConv and attention mechanism enables the model to dynamically adjust the weights of convolution kernels during feature extraction, and better focus on the information required for the current task during feature selection, thereby enhancing the interpretability of the model.

Motivated by VoVGSCSPC³² architecture, we proposed VoVODCSPC architecture. As shown in Figure 4(a), ODConv introduces a novel multi-dimensional attention mechanism that can simultaneously learn four types of attention, targeting the four dimensions of position, channel, ﬁlter, and kernel of the convolutional kernel. The concatenation operation then combines these features, followed by a shuffle mechanism to mix multi-dimensional channel features from both convolutions. This ensures comprehensive exchange and integration of information between channels. This design promotes efficient interaction of information between network channels while reducing the overall computation time and resource consumption, thus improving efficiency while maintaining model performance.

Figure 4.

VDC.

As depicted in Figure 4(b), the ODBottleneck architecture is an innovative design that integrates the core components of two ODConv units along with a DWConv module. In this architecture, raw input feature data is fed in parallel to both ODConv modules and a separate DWConv module for independent processing. Each module uniquely extracts and transforms the input features. Subsequently, the outputs of these three modules are overlaid and combined to form the ﬁnal feature representation.

As shown in Figure 4(c), the VoVODCSPC architecture adopts a more efficient one-time aggregation strategy based on the ODBottleneck design. This means that VoVODCSPC optimizes and integrates the operation of multiple ODBottleneck structures, reﬁning key features through a single computation process. This aims to signiﬁcantly reduce the overall computational load and parameter size of the network while maintaining the model's prediction accuracy.

By using these designs, ODConv modules and their derivatives ensure a high level of efficiency and accuracy in CNNs, providing enhanced performance for various neural network applications.

Experiment result

Conﬁguration and network environment are presented in Table 1. The dataset established in this article comes from the publicly available Kaggle competition dataset, Data set source: https://www.kaggle.com/sartajbhuvaji/brain-tumor-classiﬁca, which includes four categories: pituitary adenoma, meningioma, glioma, and no-tumor. By enhancing the original data images, we used a total of 7500 images. 4500 images were randomly selected as the training set; randomly select 2000 images as the validation set; randomly select 1000 images as the test set; appearing in different positions, postures, and angles. The proportion of the four categories of pituitary adenoma, meningioma, glioma and non-tumor in the dataset is 25%, respectively.

Table 1.

Conﬁguration and network environment.

Hardware conﬁguration
CPU	Intel(R) Core(TM) i7-7700HQ CPU @2.8GHz
GPU	NVIDIA GeForce RTX 3060
Memory	24G
Software conﬁguration
System	Windows10
Interpret	Python3.8
Framework	PyTorch1.13.1
Parameter conﬁguration
Image size	640*640
Learning rate	0.01
Batch	4
Epoch	150
Optimizer	SGD

In order to verify the model performance, the precision rate (P), recall rate (R), average precision (AP), mAP are selected to evaluate the detection performance of the model. If a positive instance is predicted correctly as positive, it is counted as a True Positive (TP). If a negative instance is predicted correctly as negative, it is counted as a True Negative (TN). If a negative instance is incorrectly predicted as positive, it is counted as a False Positive (FP) (Type I error). If a positive instance is incorrectly predicted as negative, it is counted as a False Negative (FN) (Type II error). For more details of these metric indicators, see.³¹

Based on Table 2, the experimental results show that all features contribute to the performance improvement of the model. Each feature plays a positive role in improving model accuracy, recall, and mAP, and their combined effect improves the performance more than the individual features alone. Especially, the activation of ODConv has a signiﬁcant impact on improving accuracy and recall, which is related to its role in optimizing convolutional layer processing, helping the model better identify targets and reduce false positives. C2f ECA and CAHead also perform well in improving model accuracy and recall, as they enhance the model's contextual understanding ability in object detection. Compared with other methods, Table 3 demonstrates the effectiveness of the proposed method Improved YOlOv8.

Table 2.

Comparison of ablation experiments.

Number	C2 f ECA	CA Head	CABiFPN	ODConv	mAP@50	P	R
1	×	×	×	×	0.952	0.961	0.921
2	_	×	×	×	0.965	0.964	0.935
3	×	_	×	×	0.967	0.968	0.926
4	×	×	_	×	0.970	0.971	0.938
5	×	×	×	_	0.976	0.981	0.944
6	_	_	×	×	0.972	0.981	0.941
7	_	_	_	×	0.974	0.982	0.953
8	_	_	_	_	0.984	0.991	0.961

ODConv: omni-dimensional dynamic convolution; mAP: average precision mean.

Table 3.

Comparison with several typical networks.

	mAP@50	mAP@75	P	R
Efficientnetv2	97.9%	93.2%	95.3%	94.8%
DenseNet³³	94.9%	94.7%	95.1%	94.6%
VGG19 ³⁴	95.6%	95.4%	95.7%	95.2%
Hybrid ensemble³⁵	95.8%	95.4%	95.9%	95.3%
YOLOv5s	94.1%	91.1%	94.2%	92.4%
YOLOv8n	95.2%	92.4%	96.1%	92.1%
MobileViT	94.7%	91.3%	95.3%	94.5%
Mobile-net	96.8%	91.6%	95.2%	95.3%
Resnet-18 ¹⁹	97.1%	91.5%	94.6%	95.2%
Efficient¹⁹	93.7%	90.8%	94.2%	94.1%
ImprovedYOLOv8	98.4%	96.8%	99.1%	96.1%

mAP: average precision mean.

Figures 5 and 6 provide visual renderings, demonstrating the effectiveness of our pro-posed method. In Figure 5, the precision rate of most images is less than 95%, while in Figure 6, the precision rate of most images is greater than 98%.

Figure 5.

YOLOv8 image.

Figure 6.

Improved YOLOv8.

To demonstrate the effectiveness of the method, we also tested it on the following dataset: 879 mammography images from the DDSM dataset region of interest (ROI) imaging was used for tumor and non-tumor recognition experiments, with 703 ROIs used as training samples to train CNN models, leaving the remaining 176 ROIs to be used as test samples, and the data partitioning method is as follows Table 4.

Table 4.

Dataset partitioning and augmentation.

Identify targets	Training set	Enhanced training set	Test set
Tumor	320	8600	180
No Tumor	383	9800	220
Total	703	18,400	400

The AP value of each model reﬂects its ability to distinguish between categories on the test set. A higher AP value indicates a better balance between accuracy and recall. Based on the results shown in Figure 7, the model proposed in this study exhibits the best performance, demonstrating its ability to maintain high accuracy across most recall rates. This suggests that the model performs well in identifying positive samples. In comparison, the overall performance of the ResNet series is commendable, particularly ResNet34 and ResNet101. However, the ConvNext series performs slightly less well in comparison.

Figure 7.

Pr curve.

In order to comprehensively compare the effectiveness of the methods, we also compared multiple methods on the second dataset. From the experimental results Table 5, our methods have performed better than other models.

Table 5.

Comparison with several typical networks.

	P	R	F1
ALexnet³⁶	95.6%	94.9%	94.9%
AAnet³⁷	94.5%	94.6%	94.1%
Vgg-16 ³⁸	95.7%	95.7%	96.2%
Vgg-19 ³⁸	95.4%	95.2%	94.9%
Resnet-34 ³⁰	93.2%	93.5%	93.1%
Resnet-50 ³⁰	92.4%	92.3%	91.9%
convnextsmall³⁹	90.3%	90.3%	90.4%
Ours	98.6%	97.9%	98.2%

Conclusion

In conclusion, this paper presents effective solutions to address the shortcomings of YOLOv8n in small target detection, feature fusion, and computational efficiency. The proposed improvements are presented as follows.

CA enhances the representation of spatial information in feature maps, enabling better integration of spatial positions and contextual information, thus improving feature representation capabilities. By introducing global contextual information for each pixel and focusing on important spatial locations in the image, CA improves the ability to recognize small objects. BiFPN enables efficient feature fusion, allowing features of different scales to combine more effectively. This is especially crucial for small object detection, as it helps the network leverage shallow, ﬁne-grained features to improve the accuracy of small object detection. CA, by enhancing attention to speciﬁc spatial positions, contributes to improving object localization accuracy, particularly when the target position is complex or the target scale is small, allowing for accurate identiﬁcation and localization of object boundaries. CABiFPN optimizes multi-scale feature fusion, improving localization accuracy, especially for complex multi-object detection tasks, where CABiFPN effectively enhances precise localization of objects. Dynamic Convolution, by dynamically adjusting the convolution kernel weights based on input features, allows the model to ﬂexibly capture critical information when dealing with targets of varying patterns and structures. This has a signiﬁcant impact on precise localization and classiﬁcation in object detection, particularly in complex backgrounds or small object detection tasks. By dynamically adjusting the convolution kernels, ODConv can perform adaptive convolution based on the characteristics of small objects, enabling better capture of small object features and thus improving the precision of small object detection.

By incorporating a dedicated small target detection head with CA, the model's ability to perceive and accurately detect small targets is signiﬁcantly improved, reducing the incidence of missed detections.

The ECANet is utilized to improve the C2f module (from CSPDarknet53 to 2Stage FPN), strengthening the channel attention mechanism and effectively enhancing the detection accuracy of small targets.

Through these comprehensive improvements, the enhanced YOLOv8n model achieves better detection accuracy for small targets, more effective feature fusion, and higher computational efficiency, making it more suitable for real-time detection applications, especially in high-density tumor detection tasks.

Limitations and future research directions

The spatial resolution, if not sufficiently high, weakens the ability to detect small targets accurately. Research can focus on enhancing the spatial resolution or utilizing more advanced multi-scale attention mechanisms to further improve small target detection. In complex scenes, where both small targets and intricate backgrounds exist, the fusion of multi-scale features is not fully effective, leading to suboptimal detection accuracy. Developing more advanced feature fusion techniques that better integrate shallow features with deeper layers may help address this issue. The current system may not be fast enough for real-time applications, particularly in cases where detection needs to be performed on dense or rapidly changing data. Future research should focus on balancing the trade-off between accuracy and computational efficiency, potentially exploring lightweight models or optimization algorithms tailored to real-time detection. While ECANet enhances the channel attention mechanism, there may still be limitations in handling highly complex and noisy scenes. Further reﬁning the attention mechanisms, particularly for handling small targets in noisy or cluttered environments, could improve model performance. While ODConv helps with feature extraction in complex scenarios, the model may not yet fully address the issue of handling extreme variations in target size, shape, and pose. Research can explore more adaptable or context-aware convolution techniques that further enhance feature extraction capabilities for small target detection in diverse environments.

Footnotes

Acknowledgements

The authors would like to acknowledge the support from Jiangsu Second Normal University for providing the necessary resources and funding to conduct this research.

Ethical committee approval

Not applicable.

Author contributions/CRediT

LB was involved in conceptualization, methodology, software, validation, formal analysis, investigation, writing—original draft, writing—review and editing, and visualization. ZS was contributed to supervision, project administration, funding acquisition, and resources.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by Jiangsu Second Normal University, Grant Number 927801/013 and Jiangsu Province Engineering Research Center of Basic Education Big Data.

Competing interest

The author declares that he has no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper.

ORCID iD

Lufeng Bai

References

Xiao

Nguyen

Yan

. Fruit ripeness identiﬁcation using YOLOv8 model. Multimed Tools Appl 2024; 83: 28039–28056.

Bochkovskiy

Wang

Liao

HYM

. YOLOv4: Optimal speed and accuracy of object detection. 2020, arXiv:2004.10934.

Lin

, et al. Feature pyramid networks for object detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2017.

Jiang

. ESL-YOLO: Small object detection with effective feature enhancement and spatial-context-guided fusion network for remote sensing. Remote Sens (Basel) 2024; 16: 1–21.

Liang

, et al. Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans Circuits Syst Video Technol 2019; 30: 1758–1770.

Zhou

Gong

, et al.

Feature aggregation and propagation network for camouﬂaged object detection.

IEEE Trans Image Process 2022; 31: 7036–7047.

, et al.

A modiﬁed YOLOv8 detection network for UAV aerial image recognition.

Drones 2023; 7: 04.

Zheng

, et al. Tomato maturity detection and counting model based on MHSA-YOLOv8. Sens 2023; 23: 6701.

Saydirasulovich

Mukhiddinov

Djuraev

, et al.

An improved wildﬁre smoke detection based on YOLOv8 and UAV images.

Sensors 2023; 23: 8374.

10.

Kang

, et al. Bgf-yolo: Enhanced YOLOv8 with multiscale attentional feature fusion for brain tumor detection. International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024.

11.

Abdusalomov

Mukhiddinov

Whangbo

. Brain tumor detection based on deep learning approaches and magnetic resonance imaging. Cancers (Basel) 2023; 15: 4172.

12.

Chu

Tian

Wang

, et al. Twins: revisiting the design of spatial attention in vision transformers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Virtual, 2021; vol. 6, (14), 9355–9366.

13.

Lin

Cheng

, et al. Cat: cross attention in vision transformer. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei Taiwan July 2022; IEEE: New York NY, USA, 2022; vol. 18, 1–6.

14.

Liu

Lin

Cao

, et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference On Computer Vision, Montreal, BC, Canada, October 2021; vol. 1, 10012–10022.

15.

Montalbo

FJP

. A computer-aided diagnosis of brain tumors using a ﬁne-tuned YOLO-based model with transfer learning. KSII Trans Internet Inform Syst (TIIS) 2020; 14: 4816–4834.

16.

Dipu

Shohan

Salam

KMA

. Deep learning based brain tumor detection and classiﬁcation. 2021 International conference on intelligent technologies (CONIT). IEEE, 2021.

17.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.

18.

Szegedy

Liu

Jia

, et al. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, June 2015; vol. 7, (12), 1–9.

19.

Zhang

Ren

, et al. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, vol. 27, (30), 770–778.

20.

Muhammad

Khan

Ser

, et al. Deep learning for multi-grade brain tumor classiﬁcation in smart healthcare systems: a prospective survey. IEEE Trans Neural Netw Learn Syst 2021; 32: 507–522.

21.

Noreen

Palaniappan

Qayyum

, et al. A deep learning model based on concatenation approach for the diagnosis of brain tumor. IEEE Access 2020; 8: 55135–55144.

22.

Abd-Ellah

Awad

Khalaf

AAM

, et al. Two-phase multimodel automatic brain tumour diagnosis system from magnetic resonance images using convolutional neural networks. EURASIP J Image Video Process 2018; 2018: 97. doi:https://doi.org/10.1186/s13640-018-0332-4

23.

Ebert

Stricker

Wasenmller

. Transformer-based detection of microorganisms on high-resolution Petri dish images. Proc IEEE Int Conf Comput Vis 2023; 3961–3970.

24.

Liu

, et al. Ssd: Single shot multibox detector. Computer VisionCECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11C14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016.

25.

Kumari

, et al.

Fog computing for healthcare 4.0 environment: opportunities and challenges.

Comput Electr Eng 2018; 72: 1–13.

26.

Jocher

Chaurasia

Qiu

. Ultralytics YOLO (Version 8.0.0) [Computer software]. 2023, https://github.com/ultralytics/ultralytics

27.

Syazwany

Nam

Lee

. MM-BiFPN: multi-modality fusion network with Bi-FPN for MRI brain tumor segmentation. IEEE Access 2021; 9: 160708–160720.

28.

Hou

Zhou

Feng

. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition 2021, pp.13713–13722.

29.

Zhou

Yao

. Omni-dimensional dynamic convolution. arXiv preprint arXiv:2209.07947, 2022.

30.

Zhang

, et al. Deep residual learning for image recognition. Proc IEEE Conf on Comp Vis Pattern Recogn 2016: 770–778.

31.

Bai

. Improved printed circuit board defect detection scheme. Sci Rep 2025; 15: 2389.

32.

Wei

, et al. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arxiv preprint arxiv:2206.02424, 2022.

33.

Yahyaoui

Ghazouani

Farah

. Deep learning guided by an ontology for medical images classiﬁcation using a multimodal fusion. In: Proceedings of the International Congress of Advanced Technology and Engineering (ICOTEN 2021), Virtual Conference, July 2021; vol. 4, (5), 1–6.

34.

Khan

Harouni

, et al. Brain tumor segmentation using K-means clustering and deep learning with synthetic data augmentation for classiﬁcation. Microsc Res Tech 2021; 84: 1389–1399. CrossRef.

35.

Bhatele

Bhadauria

. Machine learning application in glioma classiﬁcation: review and comparison analysis. Arch Comput Methods Eng 2021; 29: 247–274.

36.

Krizhevsky

Sutskever

Hinton

. Imagenet classiﬁcation with deep convolutional neural networks. Commun ACM 2017; 60: 84–90.

37.

Tay

Roy

, et al. AANet: attribute attention network for person re-identiﬁcations. 2019 IEEE-CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020.

38.

Zisseman

. A very deep convolutional networks for large-scale image recognition. In ICLR The International Conference on Learning Representations. San Diego: ICLR, 2015: 1–15.

39.

Liu

, et al. A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

Omni-dimensional dynamic convolution with coordinate attention detection scheme

Abstract

Keywords

Introduction

YOLOv8

Coordinate attention

CABiFPN structure

Detection head

Omni-dimensional dynamic convolution

Experiment result

Conclusion

Limitations and future research directions

Footnotes

Acknowledgements

Ethical committee approval

Author contributions/CRediT

Funding

Competing interest

ORCID iD

References