Sage Journals: Discover world-class research

Abstract

As a pivotal task within computer vision, object detection finds application across a diverse spectrum of industrial scenarios. The advent of deep learning technologies has significantly elevated the accuracy of object detectors designed for general-purpose applications. Nevertheless, in contrast to conventional terrestrial environments, remote sensing object detection scenarios pose formidable challenges, including intricate and diverse backgrounds, fluctuating object scales, and pronounced interference from background noise, rendering remote sensing object detection an enduringly demanding task. In addition, despite the superior detection performance of deep learning-based object detection networks compared to traditional counterparts, their substantial parameter and computational demands curtail their feasibility for deployment on mobile devices equipped with low-power processors. In response to the aforementioned challenges, this paper introduces an enhanced lightweight remote sensing object detection network, denoted as YOLO-Faster, built upon the foundation of YOLOv5. Firstly, the lightweight design and inference speed of the object detection network is augmented by incorporating the lightweight network as the foundational network within YOLOv5, satisfying the demand for real-time detection on mobile devices. Moreover, to tackle the issue of detecting objects of different scales in large and complex backgrounds, an adaptive multiscale feature fusion network is introduced, which dynamically adjusts the large receptive field to capture dependencies among objects of different scales, enabling better modeling of object detection scenarios in remote sensing scenes. At last, the robustness of the object detection network under background noise is enhanced through incorporating a decoupled detection head that separates the classification and regression processes of the detection network. The results obtained from the public remote sensing object detection dataset DOTA show that the proposed method has a mean average precision of 71.4% and a detection speed of 38 frames per second.

Keywords

Remote sensing object detection deep learning lightweight network YOLO

Introduction

Remote sensing object detection, as a fundamental visual task, has found widespread applications in various domains such as national defense and security inspections, maritime ship detection, and disaster prediction.^1,2,3,4 With the advent of deep learning technologies, remote sensing object detection networks have witnessed remarkable progress, achieving significant improvements in both detection accuracy and speed.⁵ However, compared to standard object detection images, remote sensing images present challenges such as large scale, complex backgrounds, varying object scales, and severe interference from background noise as shown in Figure 1. The variability in object orientations also makes it difficult for object detection networks to accurately locate bounding boxes. These issues collectively hinder the further enhancement of remote sensing object detection network performance.^1,6 In response to the above problems, researchers have put forward distinct solutions. Su et al. proposed a multiscale striped convolutional attention mechanism (MSCAM) that reduces the introduction of background noise, fuses multiscale features, and enhances the model's focus on various-sized foreground objects.⁷ Li et al. strengthened the fused features of the neural network through the use of spatial and channel attention mechanisms, transforming the original network's fusion structure into a weighted structure for more efficient and richer feature fusion.⁸ Shen et al. integrated the Swin Transformer into the neck module of YOLOX, enabling the recognition of high-level semantic information and enhancing sensitivity to local geometric feature information.⁹ However, while some existing methods have achieved improvements in accuracy, they ignore the adaptive adjustment ability of the network and the positioning function of the rotating box in the large-scale background. To tackle the challenges outlined above, this paper introduces an enhanced lightweight remote sensing object detection network, denoted as YOLO-Faster. Firstly, to minimize network parameters and enhance inference speed, we integrate the lightweight FasterNet^10,11 into YOLOv5 as the foundational network. Secondly, to tackle the issue of detecting objects of varying scales in large-scale scenes, an adaptive multiscale feature fusion network is introduced. This is achieved by embedding a receptive field adaptive adjustment (RFAA) block within the original YOLOv5 feature integration network. The RFAA block utilizes a gated selection mechanism to refine and filter features extracted by large-scale convolutional kernels, dynamically adjusting the large receptive field to capture dependencies among different scales. This approach facilitates better modeling of object detection scenarios for various objects in remote sensing scenes. Lastly, a Decoupled Head is employed at the network's output, utilizing two parallel prediction branches for bounding box regression and target category classification, respectively. The decoupled head enhances the expressive capabilities and convergence speed of the object detection network in complex backgrounds with noise. To this end, YOLO-Faster ensures both the lightweight requirements for mobile deployment and maintains detection accuracy for remote sensing object detection networks.

Figure 1.

Remote sensing object detection data samples in the detection scene.

To sum up, our contributions can be summarized as follows:

We introduce YOLO-Faster, a lightweight remote sensing object detection network that merges the strengths of YOLOv5 and FasterNet. Results from experiments conducted on the public dataset DOTA showcase YOLO-Faster's success in achieving network lightweighting without compromising accuracy in object detection tasks.

To address the challenges of complex backgrounds and varying object scales in remote sensing object detection scenarios, we introduce an adaptive multiscale feature fusion network to replace the original YOLOv5 feature integration network. This network utilizes a gated mechanism to filter features extracted by different large-scale convolutional kernels, dynamically adjusting the receptive field of the object detection network.

Through the integration of a decoupled object detection head into YOLOv5, we elevate remote sensing object detection by separating the responsibilities of bounding box regression and classification. This strategy enhances both the expressiveness and convergence speed of the object detection network, especially in challenging scenarios characterized by noise.

Related works

Method of deep learning object detection

Presently, deep learning-based object detection networks can be primarily categorized into one-stage object detection networks and two-stage object detection networks.¹² On the one hand, one-stage object detection networks, exemplified by the SSD, YOLO series (YOLOv3, YOLOv4, YOLOv5),^13,14,15,16 eliminate the generation of object candidate regions and instead directly predict the object's bounding boxes and categories through the grid where the object center resides. On the other hand, two-stage object detection networks, represented by the RCNN series of algorithms (including Fast R-CNN and Faster R-CNN),^12,17 utilize a region proposal network to generate candidate regions for the objects to be detected. These candidate regions are then refined to accurately locate the object's bounding boxes. Both one-stage and two-stage object detection networks have their respective advantages and disadvantages. One-stage detection algorithms achieve quicker detection speeds but frequently compromise on detection accuracy due to the absence of a refinement process for bounding boxes, typically inferior to that of two-stage object detection networks. In contrast, two-stage object detection networks deliver superior detection accuracy but tradeoff for slower detection speeds, attributed to the dual-stage procedure employed in bounding box localization. The above mentioned existing general object detection algorithms are primarily applied to land-based object detection, and have good detection performance on datasets such as the COCO¹⁸ dataset and the PASCAL VOC¹⁹ dataset. However, when faced with remote sensing object detection tasks that involve large-scale scenes and varying object sizes, their performance is often limited.

Application in deep learning remote sensing object detection

With the development of deep learning technology, its application in remote sensing object detection tasks is increasing. To solve the above-mentioned problems encountered in deep learning for remote sensing object detection. Li et al. proposed RSI-YOLO,²⁰ which integrated channel attention mechanism and spatial attention mechanism into YOLOv5 network and modified the loss function to improve the detection accuracy of the object detector in remote sensing object detection tasks ; Su et al. proposed MSA-YOLO,⁷ which proposed an MSCAM to reduce the introduction of background noise and fuse multiscale features, enhancing the focus of various sizes of foreground objects; Chen et al. embedded a pyramid squeeze attention mechanism for key feature extraction and designed a context information module to enhance the network's context understanding ability²¹; Lang et al. constructed efficient channel attention layers to improve channel information sensitivity.²² Differential evolution algorithm can automatically find the optimal anchor configuration to solve the problem of large target scale changes. Gao et al. propose a novel global-to-local scale-aware detection network for remote sensing object detection.²³ Nevertheless, deep learning models often require a large amount of computational resources, which is not conducive to deployment at the edge.

Research on lightwight networks

As mentioned above, despite the substantial enhancement in detection accuracy brought about by deep learning-based object detection networks, they frequently incur larger parameter sizes and heightened computational demands in contrast to conventional methods. Lightweighting object detection networks for deployment on low-power processors on mobile devices for real-time object detection remains a prominent research focus. Mark et al. proposed the MobileNet network,²⁴ which introduces a residual structure to reduce parameter sizes while maintaining the network's feature extraction capabilities. Zhang and their team introduced the ShuffleNet network,^25,26 which enhances interaction between features extracted by depthwise separable convolutions through a channel shuffle method. Chen et al. pointed out that although lightweight networks based on depthwise separable convolutions reduce the number of floating-point operations (FLOPs), their effective FLOPS are often lower due to frequent memory access. To address this, they proposed an efficient lightweight network called FasterNet, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks.¹¹

In order to address the detection challenges in remote sensing objects, this paper aims to research on enhancing network lightweight design, large-scale background adaptive adjustment capability, algorithm robustness, and other aspects.

Methods

In this Section, we will provide a detailed introduction to the proposed YOLO-Faster. Its overall structure is visually depicted in Figure 2. Specifically, in Section Review of YOLOv5, we will revisit the theoretical foundations of YOLOv5. Following that, Section The backbone of YOLO-Faster will elucidate the backbone network, FasterNet, employed by YOLO-Faster. Section Adaptive multi-scale feature fusion network will introduce the proposed adaptive multi-scale feature fusion network. In Section Multi-scale decoupled object detection head, we will delve into the application of decoupled detection heads within YOLOv5. Finally, Section Loss function will outline the utilized loss function.

Figure 2.

The overall structure of YOLO-Faster. “Faster Block” represents backbone feature extraction module whose detail structure is introduced in the following section; “RAFF Block” denotes receptive field adaptive adjustment block; “Conv2d” denotes standard convolution operation; “BatchNorm” denotes batch normalization; “SiLU,” “GELU,” and “Hardsigmoid” denote activation function; and “Adaptive Avgpool2d” denotes adaptive average pooling operation.

Review of YOLOv5

YOLOv5, positioned as a pioneering object detection algorithm, achieves a remarkable equilibrium between detection precision and processing speed. The YOLOv5 network architecture consists of four main components: the Input, Backbone, Neck, and Output. Figure 3 illustrates the structural configuration of YOLOv5.

Figure 3.

The structure of YOLOv5.

At the input stage, data Augmentation techniques such as random flipping, random rotation, and Mosaic augmentation are employed to enrich the training dataset and mitigate the risks of overfitting during the training process. Notably, the Mosaic data augmentation method involves randomly stitching together four images, which are then fed into the YOLOv5 network. For feature extraction, YOLOv5 utilizes the CSPDarkNet, incorporating modules like Focus slices, convolutional residual blocks, and the spatial pyramid pooling module.²⁷ In the Neck for feature aggregation, the pyramid attention network (PANet)²⁸ is employed to blend high-level semantic information from different scales with low-level semantic details. Finally, at the output stage, cascaded convolutional modules are used to predict object bounding boxes, confidence scores, and category information. The integration of these techniques and architectural components allows YOLOv5 to maintain a robust detection performance while optimizing detection speed, making it a popular choice for various real-world applications.

The backbone of YOLO-Faster

Despite the robust feature extraction capabilities of the CSPDarkNet employed in the original YOLOv5 object detection network, it carries a substantial parameter size and demands significant computational resources. This paper aims to explore a lightweight backbone network for integration into the YOLOv5 architecture, enabling the deployment of object detection networks on low-power processors for mobile applications. FasterNet, through its examination of memory access mechanisms in convolutional operators, introduces an efficient convolutional operator called PConv to enhance the effective operations per second. For feature maps containing significant primary information, conventional convolutional operators are used for feature extraction. However, for feature maps with redundant feature information, identity mappings are employed to directly transfer these to the next layer's feature map, preventing unnecessary computations on ineffective information. The calculation formula is as follows:

\begin{matrix} F C o n v (x) = {\begin{matrix} C o n v (x) x \in X_{v a l i d} \\ x x \in X_{r e d u n d n t} \end{matrix} \end{matrix}

(1)

\begin{matrix} P C o n v (x) = C o n c a t [F C o n v (x_{v a l i d}), F C o n v (x_{r e d u n d a n t})] \end{matrix}

(2)

where

X_{v a l i d}

and

X_{redundant}

represent the valid and redundant features in the extracted features respectively. Conv(·) represents general convolution operation, and Concat[·] represent feature maps concatenate in the channel dimension.

Compared to conventional convolution operations, the computational cost (FLOPs) of PConv in FasterNet is significantly reduced. The formulas for calculating the computational cost of both regular convolution operations and PConv in FasterNet are as follows:

\begin{matrix} \begin{matrix} C o n v_{F L O P s} = H \times W \times K^{2} \times C^{2} \\ P C o n v_{F L O P s} = H \times W \times K^{2} \times {(\frac{C}{r})}^{2} \end{matrix} \end{matrix}

(3)

where H, W, and C represent the height, width and number of channels in the feature map, respectively. The variable r denotes the proportion of effective feature maps to the total feature maps. In this paper, it is set to 4. Based on the provided formula, it can be concluded that the computational cost of PConv in FasterNet is only 1/16 of that of a regular convolution operation.

Based on PConv, an efficient convolutional module called Faster Block is further proposed. The final lightweight backbone network, FasterNet, is constructed by stacking multiple layers of Faster Blocks. By replacing the backbone network of YOLOv5 with FasterNet, the operational efficiency of the object detection network is enhanced while maintaining detection accuracy. A schematic diagram illustrating the structures of PConv and Faster Block is provided in Figure 4.

Figure 4.

Diagram of the structure of PConv and faster block.

Adaptive multiscale feature fusion network

The original YOLOv5 object detection network employs PANet as the neck network to concatenate features of different dimensions in the channel dimension, followed by convolutional operations to integrate semantic information of various scales. For land-based images with rich target features, PANet effectively fuses multiscale semantic information flows. However, when dealing with remote sensing images with large-scale complex backgrounds, a substantial amount of irrelevant background noise causes significant interference during the multiscale feature fusion process. To address this issue, this paper introduces an RFAA module that adaptively adjusts the receptive field of the object detection network to filter out ineffective semantic information from the background. A schematic diagram of the RFAA Block structure is provided in Figure 5. Within this block, GAP and GMP denote Adaptive Global Average Pooling and Adaptive Global Max Pooling, respectively, while S represents the sigmoid activation function. The RFAA Block follows a Transformer architecture design, initially performing adaptive adjustments to the features and then refining them through a feed-forward neural network (FFN) consisting of two multilayer perceptrons. The calculation formula for this process is as follows:

\begin{matrix} \begin{matrix} F F N (x) = F C [R e L U (F C (x))] \\ R e L U = {\begin{matrix} x & x \geq 0 \\ 0 & x < 0 \end{matrix} \end{matrix} \end{matrix}

(4)

where FC denotes a fully connected layer. The RFAA Block establishes a gating mechanism through a combination of depthwise separable convolution, adaptive pooling, and the Sigmoid activation function. This mechanism enables the network to dynamically select convolutions with varying receptive field sizes for different detection targets, thereby achieving optimal detection results. Specifically, the process begins with obtaining two distinct feature maps using two large kernel depthwise separable convolutions. These feature maps are then subjected to a standard convolution with a kernel size of 1, which adjusts the channel counts of both to a uniform size. The feature maps are subsequently stacked along the channel dimension to create a composite feature map. Global adaptive average pooling and maximum pooling operations are performed on this feature map, and the pooled results are stacked in the channel dimension. Convolution and Sigmoid activation are then applied to derive selection weights corresponding to different kernel sizes. Finally, these weights are multiplied and added to the previously derived feature maps, and the result is multiplied with the initial input X to yield the final output Y. The computational formula for the RFAA Block is provided below:

\begin{matrix} \begin{matrix} y_{1} = C o n v_{L} (x) \\ y_{2} = C o n v_{L} (y_{1}) \\ y_{3} = C o n c a t [C o n v (y_{1}), C o n v (y_{2})] \\ W = S i g m o i d (C o n v (C o n c a t [G M P (y_{3}), G A P (y_{3})])) \\ y_{out} = W [0] \times y_{1} + W [1] \times y_{2} \end{matrix} \end{matrix}

(5)

where ConvL denotes the large kernel convolution, Sigmoid refers to the Sigmoid activation function, GMP and GAP represent global maximum pooling and global average pooling, respectively. W denotes the dynamically predicted weight matrix after integrating features with different receptive fields, which consists of two channels that are used to weight the feature maps with different receptive fields for dynamic adjustment. Through the integration of the RFAA Block into the multiscale feature fusion process within PANet, our proposed adaptive multiscale feature fusion network (AMFFN) is able to adaptively fuse features at different scales to provide richer semantic feature representations for subsequent object detection tasks.

Figure 5.

Schematic diagram of the RFAA block structure. Norm represents layer normalization, S represents sigmoid function, GMP and GAP denote global max pooling and global average pooling, respectively.

Multiscale decoupled object detection head

YOLOv5 adopts different detection heads to detect targets of different scales, and fuses features of different scales through an adaptive multiscale feature integration network to obtain a set of feature maps. The backbone and the feature synthesis network AMFFN produce a total of three different scales of feature maps ${F_{1} \in \frac{H}{8} \times \frac{W}{8} \times C_{1}, F_{2} \in \frac{H}{16} \times \frac{W}{16} \times C_{2}, F_{3} \in \frac{H}{32} \times \frac{W}{32} \times C_{3}} . F_{1}$ is used for detecting small-scale targets, F2 is used for detecting mediumscale targets, and F3 is used for detecting large-scale targets. Hierarchical detection aids the object detector in effectively identifying multiscale targets. However, the original YOLOv5 adopts a coupled detection head in the detection process, which combines the classification task and the regression task together. These two tasks have conflicting constraints that limit the detection performance of the object detector. Inspired by YOLOX,²⁹ this paper proposes to replace the coupled detection head with a decoupled one, using two parallel branches to perform the classification task and the bounding box regression task separately. By decoupling the classification task and the regression task, the complexity and convergence speed of the model during training are reduced, while the robustness of the object detector in complex background environments is improved. The difference between coupled and decoupled detection heads is shown in Figure 6.

Figure 6.

The difference between coupled and decoupled detection heads. Cls denote the class of target and Bbox denotes the bounding box of target.

Loss function

We propose YOLO-Faster to address the problem of rotating object detection in remote sensing images. Compared to ordinary object detection tasks, rotating object detection is more complex as it requires additional regression parameters, such as rotation angle $θ$ as shown in Figure 7. In this paper, we use classification loss, bounding box regression loss, and confidence loss to guide the update of the object detection network's weights. The classification loss constrains the accuracy of the target detection model's classification of target categories, while the bounding box regression loss constrains the accuracy of the target box's location. To match the rotating target detection task, we adopt the bounding box regression loss, and the confidence loss constrains the reliability of the detection results. The calculation formula of loss function is given below:

\begin{matrix} \begin{matrix} L_{c l s} = \sum_{i = 0}^{H} \sum_{j = 0}^{W} \sum_{c \in c l s} I_{i j}^{o b j} [- α {\hat{p}}_{i j} (c) {(1 - p_{i j} (c))}^{γ} l n (p_{i j} (c))] \\ + \sum_{i = 0}^{H} \sum_{j = 0}^{W} \sum_{c \in c l s} I_{i j}^{o b j} [- (1 - α) (1 - {\hat{p}}_{i j} (c)) p_{i j} {(c)}^{γ} l n (1 - p_{i j} (c))] \\ L_{b b o x} = \sum_{i = 0}^{H} \sum_{j = 0}^{W} I_{i j}^{o b j} L_{s m o o t h - L 1} (v_{θ}^{p}, v_{θ}^{g}) + [1 - I o u (v_{x y w h}^{p}, v_{x y w h}^{g})] \\ L_{conf} = \sum_{i = 0}^{H} \sum_{j = 0}^{W} I_{i j}^{o b j} [- α {\hat{C}}_{i j} (c) {(1 - C_{i j} (c))}^{γ} l n (C_{i j} (c))] \\ L_{total} = λ_{1} L_{c l s} + λ_{2} L_{b b o x} + λ_{3} L_{conf} \end{matrix} \end{matrix}

(6)

where the balance factor α is set to 0.75 and the weight factor Y is set to 2.0. H and W represent the width and height of the feature map. Cls represents the number of object categories for target detection, and

p_{i j}

and

{\hat{p}}_{i j}

represent the predicted and true values of the category, respectively

v_{θ}^{p}

and

v_{θ}^{g}

represent the prediction box and the true detection box.

C_{i j}

and

{\hat{C}}_{i j}

represent the predicted confidence and true confidence of the object detection network, respectively.

λ_{1}

λ_{2}

, and

λ_{3}

are hyper parameter which are used to balance the total loss. Iou is used to measure the overlap between the predicted bounding box and the ground truth bounding box, and can be expressed as Iou =

\frac{A \cap B}{A \cup B}

Figure 7.

Prediction box parameters for rotating object detection.

Experiments and results

Dataset

In this paper, we adopt the commonly used public dataset DOTA³⁰ in the field of remote sensing object detection for training and testing the object detection network. The DOTA dataset contains 2806 images with a resolution of 4000 × 4000, including a total of 15 object categories and 188,282 object instances. During the training process, we use image cropping to preprocess the DOTA dataset to increase the sample size. By cropping the original 4000 × 4000 resolution images into multiple images with a resolution of 1024 × 1024, the DOTA dataset finally contains 19,472 training images and 5297 test images. Some cropped images are shown in Figure 8.

Figure 8.

Cropped images in c dataset.

Evaluation metrics

This article uses the average precision (AP) of each category, the mean average precision (mAP) of all categories, and the detection speed of the target detection network. AP can be obtained from the precision-recall (P-R) curve in target detection, where recall and precision are defined as follows:

\begin{matrix} \begin{matrix} precision = \frac{T P}{T P + F P} \\ recall = \frac{T P}{T P + F N} \\ A P = \int_{0}^{1} P (r) d r \\ m A P = \frac{\sum_{i = 1}^{c} A P_{i}}{c} \end{matrix} \end{matrix}

(7)

where TP represents the number of true positive samples. FN represents the number of false negative samples. FP represents the number of false positive samples, and c denotes the number of classes.

Implementation details

The experimental environment setup of this paper is shown in Table 1. The operating system used is Ubuntu 20.04, while Python 3.8 and PyTorch 1.10 are adopted as the deep learning framework for the deployment of remote sensing object detection networks. In the training process, the size of the training image is set to 1024 × 1024, the batch size is set to 8, the initial learning rate is set to 0.00025, and the Adam optimizer is used to update the network parameter weights. In addition to random flipping, no data augmentation techniques are used in this paper.

Table 1.

The quantitative results on the DOTA dataset.

Configuration	Version
CPU	15 vCPU AMD EPYC 7543 32-Core Processor
GPU	RTX 3090(24GB)
Operation system	Ubuntu 20.04
Language	Python 3.8
deep learning framework	Pytorch 1.11.0

Quantitative analysis of experimental results

This paper compares the improved remote sensing object detection network with the mainstream object detection networks to verify the effectiveness of the proposed method, including SSD,¹³ Faster-RCNN,¹⁷ RetinaNet,³¹ YOLOv3,¹⁴ YOLOv4,¹⁵ YOLOv5, and TPH-YOLOv5.³² The comparison results are shown in Table 2. As can be seen from the table, compared with mainstream object detection methods, the network proposed in this article has achieved advanced results in terms of average detection accuracy. Compared with the baseline network YOLOv5 s, the accuracy has improved by 1.9%. In terms of running efficiency and detection speed, due to the use of the efficient backbone network which takes into account both the FLOPs and memory access for higher FLOPS, YOLO-Faster not only improves accuracy but also ensures network detection speed. Compared with the one-stage networks such as YOLOv5 s and SSD algorithms, the detection frame rate remains at a similar level, concurrently achieving a higher mAP. Compared with TPH-YOLOv5 s, which has similar detection accuracy, the detection frame rate of this article has a significant improvement of 20 FPS. In addition, compared with other two-stage or single-stage target detection networks, the proposed YOLO-Faster has better performance in both detection accuracy and detection speed.

Table 2.

The quantitative results on the DOTA dataset.

Method	Average precision of each class of targets (%)															mAP (%)	Detection rate (FPS)
Method	Plane	Baseball court	Bridge	Ground track field	Small vehicle	Large vehicle	Ship	Tennis court	Basketball court	Storage tank	Soccer ball field	Roundabout	Harbor	Swimming pool	Helicopter	mAP (%)	Detection rate (FPS)
SSD	80.9	70.3	18.2	68.7	22	58.4	34.6	88	61.2	23.5	65.3	32.5	70.8	38.4	53.5	52.4	38
Faster-RCNN	74.7	66.4	14	63.7	8.8	38	13.2	84.6	53.2	17.4	57.3	28.2	56.3	25.7	27.8	42	8
RetinaNet	86.2	79.6	26	68.9	29.7	62	50.7	94.3	67.5	29.2	66.9	55.2	70.8	69.7	66.7	61.6	15
YOLOv3	90.5	53	27.2	47.7	68.3	69.9	82.7	94.3	52.8	60.7	50.7	31.6	77.8	78.6	82	64.5	18
YOLOv4	94.2	77.1	42.1	42.8	70.8	71.1	88.3	94.6	55.9	68.5	42.9	38.7	81.9	74	83.8	68.4	17
YOLOv5s	91.5	73.4	43.9	63.6	63	84.9	87	93	63.6	66.9	51.3	59.2	83	61	56.9	69.5	38
TPH-YOLOv5s	91.8	77.7	49.5	68.1	66.6	84.7	87.2	93.7	64.7	69.7	53.1	62.7	84.1	63.3	53.9	71.4	18
YOLO-Faster	89.6	73.9	45.6	77.1	75.4	85.8	88.5	90.3	68.7	71.8	57.5	57.1	74.9	57.9	55.4	71.4	38

mAP: mean average precision.

To verify the generalization of YOLO-Faster, we further evaluated its performance on DOTAv2 dataset. DOTAv2 dataset includes a total of 18 object categories. We select Faster-RCNN, Retinanet, FCOS, ATSS, Oriented RCNN as comparison networks. The quantitative comparison results are shown in Table 3. As can be seen from the table, the performance of YOLO-Faster is still better than that of the general object detection network represented by Faster-RCNN. Although YOLO-Faster is lower than Oriented RCNN in mAP metric, it has an advantage in lightweight performance.

Table 3.

The quantitative results on the DOTAv2 dataset.

Method	Backbone	mAP(%)
Faster-RCNN	FasterNet	47.3
RetinaNet	ResNet50	46.7
FCOS	ResNet50	48.5
ATSS	ResNet50	49.6
Oriented RCNN	ResNet50	52.3
YOLO-Faster	ResNet50	50.6

mAP: mean average precision.

In order to verify the lightweight property of the proposed object detection network YOLO-Faster, we further compare the parameter count and computational complexity of YOLO-Faster with other object detectors. The results are shown in Table 4. As can be seen from the table, compared with other general-purpose object detectors, the proposed YOLO-Faster achieves the best results in terms of network parameter count, computational complexity, and detection speed, which demonstrate its lightweight design. Upon the replacement of the original YOLOv5 backbone network with the lightweight counterpart, there is a precipitous decline in both parameter count and FLOPs.

Table 4.

Lightweight comparison of different methods.

Method	Params (M)	GFLOPs	FPS
Faster-RCNN	41.4	255.1	8
RetinaNet	36.4	84.4	15
YOLOv3	61.5	193.8	18
YOLOv4	52.5	119.8	17
YOLOv5s	7.1	48.0	38
YOLO-Faster(Ours)	15.3	19.7	38

In theory, this substitution should augment the detection speed of the target detector. Nevertheless, owing to the substantial computational intricacy introduced by the large convolutional kernels and gated selection mechanism within AMFFN, the inference speed of the network experiences a deceleration, thereby mitigating the speed advantages conferred by the lightweight backbone. However, the inference speed remains comparable to that of YOLOv5 s.

In addition, we visualized the trend of the loss function during the training process, as shown in Figure 9. As can be seen from the figure, our YOLO-Faster can converge quickly and remain stable in future training sessions and mAP metric shows a steady upward trend, without suffering from the overfitting problem that often occurs during the training of small models.

Figure 9.

The loss curve during training processing.

Ablation study

The article sets up ablation experiments to verify the effectiveness of the proposed improved modules. The results are shown in Table 5. As can be seen from the table, after replacing the original YOLOv5 s backbone network with FasterNet, the average detection accuracy of the detector increased from 69.5% to 70.6%. When replacing the original PANet neck network with the AMFFN proposed in this article, the average detection accuracy of the target detector increased from 70.6% to 71.0%. Furthermore, by replacing the original YOLOv5 detection head with the multiscale decoupled detection head proposed in this article, the detection accuracy of the detector increased from 71.0% to 71.3%. From the above ablation experiments, it can be seen that all the improved methods proposed in this article have a positive effect on improving the detection accuracy of remote sensing object detection networks, which further proves the effectiveness of our proposed method. We also visualize the mAP curve in Figure 10. As can be seen from the figure, YOLO-Faster has a leading mAP trend in almost all stages compared to other settings.

Figure 10.

The mAP curve.

Table 5.

Ablation experiments for different modules.

FasterNet	AMFFN	Decouple Head	mAP (%)
-	-	-	69.5
_√	-	-	70.6
_√	_√	-	71.0
_√	_√	_√	71.4

AMFFN: adaptive multiscale feature fusion network; mAP: mean average precision.

Qualitative visual analysis

In this section, the effectiveness of the proposed remote sensing object detection network YOLO-Faster is further verified through visualizing the detection results. Figure 11 shows the visual detection result plots of YOLOv5 s, Faster- RCNN, and our YOLO-Faster. The regions where YOLO-Faster is more accurate than the other two networks are marked with red boxes. For example, in the first column, YOLOv5 s misses the “plane” target marked by the red box, and Faster-RCNN incorrectly detects it as a target of other categories, while our method accurately classifies and locates this target; In the second column, YOLOv5 s does not locate the “harbor” target precisely, and Faster-RCNN performs a large number of repeated detection of the “harbor” target, while our YOLO-Faster avoids these problems and detects the target accurately. This visualization also shows our advantage in false detection, recall, and localization accuracy.

Figure 11.

The visualization results compare with YOLOv5 s and Faster-RCNN.

In addition, we visualize the detection results of ablation experiments. The visualization results are shown in Figure 12. In the first row, the detection results of the original YOLOv5 s object detection network are shown, while the second row demonstrates the results after embedding the multiscale adaptive feature synthesis network. The third row demonstrates the detection results after further using FasterNet as the backbone network and multiscale adaptive feature synthesis network. Compared to the first and second rows, it can be observed from the third row that using FasterNet as the backbone network and multiscale decoupled detection head improved the detection accuracy for targets of different sizes. This further demonstrates the effectiveness of our proposed method.

Figure 12.

The visualization of detection results. “AMFFN” represents embedding AMFFN in the YOLOv5 s network.

Discussion

YOLO-Faster is predominantly geared toward adaptively detecting targets in large-scale complex backgrounds. However, there are a large number of small target objects in remote sensing images.^33,34,35,36 Our YOLO-Faster is suboptimal at detecting small size targets, and some failed detection cases are shown in the Figure 13. As shown in the figure, (a) and (c) show that our YOLO-Faster misses the car running on the road, which is usually considered as small targets. In (b) and (d), YOLO-Faster mistakenly detects small vehicles as big vehicles. This also shows the limitations of our method in detecting small targets.

Figure 13.

Failure cases. (a) and (c) show that YOLO-Faster misses the car running on the road. (b) and (d), YOLO-Faster mistakenly detects small vehicles as big vehicles.

Improving the detection accuracy of small target objects is a key factor in improving the overall detection performance of the target detector. In future work, we will also explore more small target detection methods and combine them with our networks.

Conclusions

In response to the challenges of large-scale complex scenes, varying target scales, and severe interference from background noise in remote sensing images, this paper introduces an improved lightweight remote sensing object detection network called YOLO-Faster. Firstly, the lightweight backbone feature extraction network is used to replace the original YOLOv5 backbone network CSPDarkNet. This not only improves the lightweight nature of the target detection network but also maintains its detection accuracy. Secondly, a multiscale adaptive feature fusion network is proposed, in which an RFAA module is embedded to adjust the receptive field of the target detection network to filter out invalid semantic information from the background. Additionally, this paper suggests a decoupled object detection head that replaces the coupled object detection head. Two parallel branches are used for classification and box regression tasks, respectively, to further improve the detection accuracy of the object detector. Experimental results on common remote sensing object detection datasets demonstrate that the proposed improved object detection algorithm achieves an average detection accuracy improvement of 1.9% compared to the original object detection network, while maintaining the same inference time. These innovations and improvements provide effective solutions for the field of remote sensing object detection and offer important references for subsequent research and applications.

Footnotes

Author contributions

The contribution of each author to this research article is specified as follows: conceptualization: Congling Tian; methodology: Yicheng Tong, Guan Yue, and Longfei Fan; validation: Deya Zhu and Yan Liu; data curation: Guosen Lyu and Boyuan Meng; formal analysis: Shu Liu and Xiaokai Mu; investigation: Deya Zhu, Yan Liu, Shu Liu, and Xiaokai Mu; writing—original draft preparation: Yicheng Tong and Guan Yue; project administration: Congling Tian; writing—review and editing: all authors. All authors have read and agreed to the published version of the manuscript.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant No. 52301369).

ORCID iD

Yicheng Tong

References

Wan

Cheng

, et al. Object detection in optical remote sensing images: a survey and a new benchmark. ISPRS J Photogramm Remote Sens 2020; 159: 296–307.

Wang

Bashir

SMA

Khan

, et al. Remote sensing image super-resolution and object detection: benchmark and state of the art. Expert Syst Appl 2022; 197: 116793.

Xiaoyang

Xingxing

, et al. Ship detection and classification from optical remote sensing images: a survey. Chin J Aeronaut 2021; 34: 145–163. doi:https://doi.org/10.1016/j.cja.2020.09.022

Cheng

Han

. A survey on object detection in optical remote sensing images. ISPRS J Photogramm Remote Sens 2016; 117: 11–28.

Wang

Zhang

, et al. Deep learning-based object detection techniques for remote sensing images: a survey. Remote Sens (Basel) 2022; 14: 2385.

Wen

Cheng

Fang

, et al. A comprehensive survey of oriented object detection in remote sensing images. Expert Syst Appl 2023; 224: 119960.

Tan

, et al. Msa-yolo: a remote sensing object detection model based on multi-scale strip attention. Sensors 2023; 23: 6811.

Lei

Ran

Zhang

, et al. A spatiotemporal fusion method based on multiscale feature extraction and spatial channel attention mechanism. Remote Sens (Basel) 2022; 14: 461.

Shen

Gao

. Multiple attention mechanism enhanced yolox for remote sensing object detection. Sensors 2023; 23: 1261.

10.

Hou

Zheng

, et al. Large selective kernel network for remote sensing object detection. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp.16794–16805.

11.

Chen

Kao

, et al. Run, don’t walk: chasing higher flops for faster neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.12021–12031.

12.

Girshick

. Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp.1440–1448.

13.

Liu

Anguelov

Erhan

, et al. SSD: single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp.21–37: Springer.

14.

Redmon

Farhadi

. Yolov3: An incremental improvement. arXiv preprint arXiv:180402767. 2018.

15.

Bochkovskiy

Wang

Liao

HYM

. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:200410934. 2020.

16.

Wang

Bochkovskiy

Liao

HYM

. Scaled-yolov4: scaling cross stage partial network. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.13029–13038.

17.

Ren

Girshick

, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 2016; 39: 1137–1149.

18.

Lin

Maire

Belongie

, et al. Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.740–755: Springer.

19.

Everingham

Van Gool

Williams

, et al. The pascal visual object classes (voc) challenge. Int J Comput Vision 2010; 88: 303–338.

20.

Zhuang

Yuan

, et al. RSI-YOLO: object detection method for remote sensing images based on improved yolo. Sensors 2023; 23: 6414. doi:https://doi.org/10.3390/s23146414

21.

Chen

Hou

, et al. An improved s2a-net algorithm for ship object detection in optical remote sensing images. Remote Sens (Basel) 2023; 15: 4559.

22.

Gong

, et al. Swin-transformer-enabled YOLOv5 with attention mechanism for small object detection on satellite images. Remote Sens (Basel) 2022; 14: 2861.

23.

Gao

Niu

Zhang

, et al. Global to local: a scale- aware network for remote sensing object detection. IEEE Trans Geosci Remote Sens 2023; 61: 1–14.

24.

Sandler

Howard

Zhu

, et al. Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

25.

Zhang

Zhou

Lin

, et al. Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.6848–6856.

26.

Zhang

Zheng

, et al. Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp.116–131.

27.

Zhang

Ren

, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 2015; 37: 1904–1916.

28.

Liu

Qin

, et al. Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.8759–8768.

29.

Liu

Wang

, et al. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:210708430 2021.

30.

Xia

Bai

Ding

, et al. Dota: a large-scale dataset for object detection in aerial images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.3974–3983.

31.

Lin

Goyal

Girshick

, et al. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp.2980–2988.

32.

Zhu

Lyu

Wang

, et al. Tph-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: Proceedings of the IEEE/CVF international conference on computer vision, pp.2778–2788.

33.

Xing

, et al. Attention and feature fusion SSD for remote sensing object detection. IEEE Trans Instrum Meas 2021; 70: 1–9.

34.

Bashir

SMA

Wang

. Small object detection in remote sensing images with residual feature aggregation-based super- resolution and object detector network. Remote Sens (Basel) 2021; 13: 1854.

35.

Liu

Yuan

, et al. Abnet: adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans Geosci Remote Sens 2021; 60: 1–14.

36.

Chen

Zeng

. Transformer with transfer cnn for remote-sensing-image object detection. Remote Sens (Basel) 2022; 14: 984.

YOLO-Faster: An efficient remote sensing object detection method based on AMFFN

Abstract

Keywords

Introduction

Related works

Method of deep learning object detection

Application in deep learning remote sensing object detection

Research on lightwight networks

Methods

Review of YOLOv5

The backbone of YOLO-Faster

Adaptive multiscale feature fusion network

Multiscale decoupled object detection head

Loss function

Experiments and results

Dataset

Evaluation metrics

Implementation details

Quantitative analysis of experimental results

Ablation study

Qualitative visual analysis

Discussion

Conclusions

Footnotes

Author contributions

Declaration of conflicting interests

Funding

ORCID iD

References