Sage Journals: Discover world-class research

Abstract

UAV (Unmanned Aerial Vehicle) black flight at low altitude could cause serious safety risks. Consequently, it is crucial to detect and manage low altitude small UAVs. The existing methods of low altitude small UAV detection suffer from problems such as high false alarm rate, and poor real-time performance. In order to solve the above problems, we present a novel approach, named AD-YOLOv5s, to achieve low altitude small UAV detection with high precision and high real-time performance. Firstly, the feature enhancement method is used to expand the dataset. We optimize the model feature fusion, the prediction head structure, and the loss function. Based on the CBAM (Convolutional Block Attention Module) attention mechanism, feature enhancement is performed to improve the detection accuracy. Secondly, the ghost module and depthwise separable convolution are used to reduce the number of parameters of the model, and we propose the method of lightweight design of model to improve the detection speed. Compared with the YOLOv5s model, the experiment result shows that our proposed AD-YOLOv5s model improves the value of mAP by 2.2% and the value of Recall by 1.8%, reduces the value of GFLOPs by 29.9% and parameters by 38.8%, and achieves 27.6 FPS when the proposed model deploy on a low-cost edge computing device (jetson nano).

Keywords

Low Altitude Security object detection embedded deployment deep learning UAV

Introduction

In recent years, with the rapid development of Wireless communications,^3,1,2 panoramic Vision technology⁴ and UAV technologies, the civilian UAV market has expanded rapidly. Most of the low altitude small UAVs are quadrotors, which have the characteristics of low altitude, slow speed and small size. However, there is also an increasing pressure of low-altitude UAV control. Borders, airports and nuclear power plants have become the most serious areas of UAV invasion. There are countless safety accidents caused by UAVs, and low-altitude security for UAVs is imperative.

It mainly oriented this paper to the border defense application scenario. Due to the weak communication coverage in the border scenario, the communication equipment cannot transmit large-scale video data. At the same time, the environmental constraints of border areas, such as shortage of power resources, make it impossible to deploy large equipment such as ground stations with powerful GPUs. Consequently, this scenario calls for detection algorithms to be deployed at the edge, as well as algorithms that can function properly and in real-time on systems with limited processing power. Due to the long borderline of border defense applications, numerous devices need to be deployed, so the scenario requires edge deployment of devices that are low cost.

The detection methods for low altitude small UAV⁵ are radio,⁶ audio,⁷ radar⁸ and image,⁹ as well as multi-source detection information fusion.¹⁰ Radio can locate UAV quickly, but Frequency Hopping, which allows airborne stations and ground stations to change channels at the same time, makes it much harder to locate. Audio has a certain effect on large UAV with a huge noise, but it is not suitable for small UAV with little noise. Radar can detect large-size UAV well, however it does net work well for small-size ones. Image detection methods are widely studied because of their low cost and easy expansion, but they rely on manual features based on prior knowledge and experience, so the design is difficult and inefficient.

With the increasing pressure of low-altitude security in recent years, traditional detection methods cannot meet the detection needs. At the same time, deep learning object detection algorithms have been widely used and played an important role in vehicle detection, face detection, automatic driving,¹¹ safety systems, and other fields. Compared with traditional detection methods, deep learning detection methods have the advantages of high detection accuracy and fast speed. Deep learning detection has two models: one-stage and two-stage. One-stage detection models include Yolo,^13–15,12 SSD,¹⁶ etc. The core is the idea of regression, and it does not need to use regional candidate networks. It can predict the object category and location directly with the feature extraction network. The two-stage detection models are represented by R-CNN (Region-Convolutional Neural Networks),¹⁷ Fast R-CNN (Fast Region-Convolutional Neural Networks),¹⁸ Faster R-CNN (Faster Region-Convolutional Neural Networks),¹⁹ mask R-CNN (mask Region-Convolutional Neural Networks).²⁰ The basic idea is to extract images features with convolutional neural networks, then the regional candidate networks form candidate frames. The “matting” and further feature representation is executed. Finally, classification and regression get the type and location of objects. The performance of the two methods is also different. The two-stage model has advantages in detection accuracy and positioning accuracy. However, due to the complex detection process, it is difficult to meet the real- time requirements. The one-stage model is quicker in detection. The low-altitude security scenarios have higher requirements for detection speed, so this paper uses the one-stage method of the YOLO model.

The YOLO family contains various variants, such as YOLOv1-v7, YOLOX, YOLOR. The YOLOv5 has the characteristics of good portability and strong real-time performance. The YOLOv5s is a lightweight network in the YOLOv5 series. The network structure of YOLOv5s can divide into three parts, that is the feature extraction network backbone layer, the feature fusion network neck layer, and the prediction layer detection head.

In practice, the detection of low altitude small UAV needs to be low cost, low delay( $\geq 24 F P S$ )and high precision. Therefore, it is necessary to deploy the model on low-cost edge devices and complete real-time and accurate data processing and analysis. We use the YOLOv5s model as the base model in the paper. However, the YOLOv5s model has numerous parameters, and it is not suitable for deployment in embedded devices. At the same time, the YOLOv5s model is not good at small UAV detection due to the limitation of the network structure, and it cannot meet the accuracy requirements of low altitude small UAV detection. We propose AD-YOLOv5s (AntiDrone-YOLOv5s) to detect low altitude small UAV. The innovations are as follows:

A feature enhancement method is proposed to optimize the feature fusion layer and detection head structure. An attention mechanism is used to enhance the features of a single feature map. The EIoU (Efficient Intersection over Union loss)²¹ is used as the loss function to make the model converge faster and more accurately.Experiment results show that the proposed method improves the value of MAP (Mean Average Precision) by 5.4% and Recall value has increased by 5.5%.

Based on the ghost module and depthwise separable convolution for model lightweight design, the AD-YOLOv5s is proposed to achieve lightweight low altitude small UAV detection. At the same time, the technology of TensorRT²² is used to accelerate the proposed model.Experiment results show that the proposed method reduces the floating-point number calculation by 29.9%.

Experiment results show that the proposed method is suitable for deployment in low-cost edge equipment.

Related work

In recent years, many vision-based UAV detection algorithms have been proposed. Mejias et al. proposed a method for tracking UAVs with morphological preprocessing and Hidden Markov Models to detect UAV objects within a distance of 1000 meters, but the detection speed is slow.²³ Rozantsev et al. proposed a regression-based video UAV detection method, which solved the problem of UAV size change, but it still has the disadvantage that cannot detect fast UAV.²⁴ In order to solve the difficulty of detecting the fast flying speed of the UAV, Lian Du combine CNN (Convolutional Neural Networks) and SVM (Support Vector Machine) to propose a UAV detection method on a moving camera. The detection method can detect fast UAV, but the detection speed is poor.²⁵ For the problem of multi-category detection, Sommer et al. studied UAV detection with dynamic and static data and proposed a convolutional neural network to detect and classify six types of flying objects (including UAV).²⁶ Hu et al. replaced the feature extraction network of DiafonalNet with the improved Hourglassnet, which made the detection accuracy higher, but the disadvantage was that there was no further test for speed.²⁷ Seidaliyeva et al. detected moving objects in the static background, then they classified the moving objects to further improve the accuracy of the model, but the disadvantage is that the model had poor real-time performance.²⁸ Makirin used a web application for real-time detection when he participated in the UAV Chasing Challenge. The precision and recall rate could be improved in the detection process, but this method relied on high computing resources.²⁹

The current Yolo-based anti-drone detection algorithm has two limitations. Firstly, the current algorithm can only identify targets with a minimum size of 8*8 pixels. Secondly, the number of parameters of current models is so large that the model cannot be computed in real time on low-cost edge computing devices.

The analysis shows that there are many excellent algorithms for UAV detection, but these models is unsuited for the detection the UAV in low-altitude security scenarios, the reasons are the size of the detection object is much smaller than the size of the conventional object in low-altitude security scenarios, and its posture is changeable over time. Therefore, researchers committed to developing a robust low-altitude object detection algorithm suitable for detecting small UAV in low-altitude security scenarios.

Considering the low-altitude small UAV detection task has high requirements for detection accuracy and real-time performance, it is necessary to consider the accuracy performance and the detection speed when designing the low-altitude small UAV detection algorithm. However, the existing model are bloated relatively, and this lead the poor real-time performance on intelligent monitoring equipment. The deep learning has powerful processing ability, and it has become a hot research by applying it to low-altitude security. In summary, we proposed a lightweight low-altitude small UAV detection model based on deep learning.

AD-YOLOv5s low altitude small UAV detection algorithm

The algorithm framework

The YOLOv5s model has two problems when applied to low altitude small UAV detection. Firstly, the object size distribution of low altitude small UAV detection is inhomogeneous extremely, and it includes numerous small objects, and even smaller than those small objects, that is, tiny object. The existing YOLOv5s detection model cannot detect small and tiny objects very well, so we need to optimize the model with the object size distribution of low altitude small UAV detection. Secondly, low altitude small UAV detection scenarios need to be supported by low-cost edge devices. Although the YOLOv5s model is a relatively lightweight in the YOLOv5 family, due to the limited memory and computing resources on the low-cost edge device, the model cannot effectively play the performance of the model when deployed on the above low-cost edge device. As shown in Figure 1, we obtain the network structure of the AD-YOLOv5s model with feature enhancement optimization and structural lightweight improvement. Therefore, we proposed a low altitudes mall UAV detection algorithm based on AD-YOLOv5s in the paper. Figure 2 shows the object detection algorithm framework.

Figure 1.

AD-YOLOv5s network structure.

Figure 2.

Low altitude small UAV detection algorithm based on AD-YOLOv5.

First, we optimized the accuracy based on the yolov5s model: we add the CBAM module in the backbone network to improve the model’s extraction of image features; we use feature enhancement methods in the neck layer to enhance the model’s extraction of minute target features; we use EIOU loss in the prediction head for a more accurate model evaluation. Secondly, we optimize the accuracy- improved model for lightweight we use the ghost module to reduce the computation of the backbone; we use depth-separable convolution to replace the regular convolution to reduce the number of neck network module parameters.

The performance optimization of low altitude small UAV detection based on feature enhancement

The feature enhancement of feature fusion layer

The feature maps of normal YOLOv5s model use 8 times down-sampling, 16 times down-sampling and 32 times down-sampling as the prediction layer to detect the object. The shallow features of the image include contour, edge, colour, texture and shape features. The edge and contour can reflect the image content. The low-level feature semantic information of the image is small relatively, but the location of target is accurate. Although the reduction of the feature map enriches the semantic information, it will lose the shallow feature information. The small UAV detection depends more on shallow feature information, so the improvement of the detection accuracy is inhibited without shallow feature information.

We analyzed the object size distribution, and the receptive field mapping relationship of the convolutional neural network is:

N = \frac{X}{Y}

(1)

$X$ is the size of the input image. $Y$ is the size of the feature map. $N$ is the number of pixels that one point on the feature map corresponds to the original image.

The default input size of YOLOv5s model is $640 \times 640$ , and 8 times of down-sampling can obtain an $80 \times 80$ feature map. One pixel on the feature map corresponds to eight pixels on the input image, which means that the detection layer can be used to detect objects that the size is more than $8 \times 8$ . Similarly, a feature map with a size of $40 \times 40$ and with a size of $20 \times 20$ are suitable for detecting objects with a size of over $16 \times 16$ and $32 \times 32$ , respectively. According to the above analysis, Table 1 shows the size classification of YOLOv5s detected object.

The evidence shows that the detection layer of the original YOLOv5s set three sizes of detection heads, it can detect targets with a size of 8*8 and larger. However, when the 32 times down-sampling process original image, the object whose size is smaller than 8*8 in the original image will lose a lot of shallow feature information. Even if the feature layer uses the size of 80*80 to detect, the object cannot be detected.

Table 1.

Object Size Classifications.

Object size Classifications	Original feature map size	Optimized feature map size
Tiny Object
( $4 \times 4 - 8 \times 8$ )	–	$160 \times 160$
Small Object
( $8 \times 8 - 16 \times 16$ )	$80 \times 80$	$80 \times 80$
Medium Object
( $16 \times 16 - 32 \times 32$ )	$40 \times 40$	$40 \times 40$
Big Object
( $> 32 \times 32$ )	$20 \times 20$	–

Considering the low altitude security scene, the UAV object to be detected generally has the characteristics of small volume and few available features, which makes it become a tiny object with the size less than 8*8 in the image. In order to further improve the detection performance of the model, it is necessary to count the object proportion of each size in the low altitude security data set.

As can be seen from Figure 3, the proportion of big object is only 6.7%, the proportion of medium object has reached 54.0%, the proportion of small object has reached 24.5%, and the proportion of tiny object has reached 14.8%. There are a large number of tiny object in the dataset of this paper. Therefore, the existing object detection methods are not suitable for low-altitude security dataset. The method ignores the existence of a large number of the small object and tiny object in the dataset, which seriously affects the model detection accuracy and detection recall rate.

Figure 3.

Statistical of object Size of Low Altitude Security Data Set.

In order to solve the above problems, we added a detection layer for detecting tiny object, while retaining the detection layer for detecting small object and medium object, and deleted the detection layer for detecting big object. The optimized object size classification table is shown in Table 1.

Since the feature map of the detection layer comes from the feature fusion layer, after the modification of the above detection layer is determined, we optimize the corresponding feature fusion layer for feature enhancement. The feature fusion layer network structure of YOLOv5s adopts the structure of Feature Pyramid Network (FPN) +Path Aggregation Network (PAN),¹² which adds a bottom-up feature pyramid behind the FPN layer.³⁰ The feature pyramid contains two PAN Structures.³¹ Figure 4 shows the structure.

Figure 4.

Conventional neck feature fusion layer.

The detection head layer uses three types of feature maps, 8 times down-sampling gets the 80*80 feature maps, 16 times down-sampling gets the 40*40 feature maps, and 32 times down-sampling gets the 20*20 feature maps. Based on the above analysis, in order to match and optimize the improved object size classification table, the existing feature fusion layer needs to be optimized. Figure 5 shows the optimized feature fusion layer network structure.

Figure 5.

The optimized neck feature fusion layer.

Firstly, a new layer of up-sampling operation is added behind the two up-sampling in the FPN structure. The pyramid structure of FPN layer is improved from the initial three layers to four layers. The feature extraction layer performs the Concat operation for the feature map corresponding to the same size, and it can get a feature map with a size of 160*160 to detect tiny object. Although the prediction layer does not use the feature map of 20*20, the FPN layer reserve the feature map of 20*20, so that it can extract more detailed semantic feature information and can transfer to other feature maps with the FPN structure. It is beneficial to the detection of a tiny object. At the same time, two PAN structures are added behind the 160*160 feature map to generate feature maps with sizes of 80*80 and 40*40, respectively. The FPN layer is used to convey strong semantic features from top to bottom, and the PAN conveys strong positioning feature from bottom to top. This is the feature enhancement operation of the feature fusion layer. This paper does not put forward detection requirements for big objects with a size of 32*32 or more, so we delete the PAN with the size of 20*20 feature map.

Meanwhile, in order to match the improvement of the feature fusion layer, we further optimize the detection head layer, as shown in Figure 6. The two detection heads are retained for detecting small and medium objects from feature maps of 80*80 and 40*40. We delete the detection head of detecting big object and related sampling convolution processes. At the same time, the 160*160 feature map is used as a detection head for detecting tiny object.

Figure 6.

Detection head structure.

Feature enhancement by attention mechanism

As shown in Figure 7, the feature map generated by the convolution operation. We can see that the small and tiny object occupies less feature information on the feature map, and the background of the detection object is the sky background, mostly. Most of the feature map is a single-color gamut, and there is no obvious colour gamut distinction. As a result, there is a problem that a large amount of invalid feature information occupies computing resources in a single feature map, and the model cannot locate the detection object quickly, which affects the detection performance.

Figure 7.

Input image and convolved feature map.

In order to focus on more important feature information and suppress invalid feature information, and improve the detection accuracy of small and tiny object, we need to consider feature enhancement operations for feature extraction operations on a single feature map. In this paper, based on the Convolutional Block Attention Module (CBAM)³² in the mixed domain, we propose the feature enhancement, and the feature representation ability is stronger. Figure 8 shows the structure. CBAM (Convolutional Block Attention Module) is a lightweight convolutional attention module, which combines the attention mechanism modules of channel and space. CBAM includes CAM (Channel Attention Module) and SAM (Spatial Attention Module) sub modules, which respectively perform attention operations on channels and spaces. Given a feature map, the CBAM module can serialize the attention feature map information on the channel and space dimensions, and then multiply the two feature map information with the original input feature map for adaptive feature correction to generate the final feature map. CBAM module can not only reduce the amount of parameters and computing power but also embed in any existing network architecture to improve performance. The attention module enriches the extracted high-level features in the channel dimension and the space dimension by taking the global average pooling operation and the global maximum pooling operation. After obtaining the weight of the space and the channel, it is weighted to the initial features to complete dual attention adjustment of features.

Figure 8.

CBAM Attention Mechanism Module Structure.

In order to visualize the improvement effect of the attention module on the feature information of small and tiny object, two typical attention modules of Squeeze-and-Excitation (SE)³³ Attention module and Effective Channel Attention (ECA)³⁴ module are selected for comparing with this paper’s CBAM, as shown in Table 2. The CBAM attention module used in this paper improves the model performance which is better than the SE module and the ECA module. This is because the CBAM attention module can learn the importance of different feature channels of the feature map at a deeper level, and reinforce the important information of the object, and weak the irrelevant unimportant information. It can enhance the shallow perception and representation ability of small target features. Since the CBAM module considers feature extraction in both the channel dimension and the space dimension, the high-level features can enrich and the accuracy of small and tiny object detection can improve.

Table 2.

The optimized object size classification.

Method	mAP/%	Recall/%
YOLOv5s	85.2	83.8
YOLOv5s with SE	85.6	85.3
YOLOv5s with ECA	86.0	85.8
YOLOv5s with CBAM	87.5	85.9

Therefore, this paper embeds the CBAM module into the feature extraction network backbone of YOLOv5s. We propose the improved CBAM-YOLOv5s model, as shown in the Figure 9.

Figure 9.

Feature extraction network structure.

Loss function

The loss function in the YOLOv5s model consists of three parts: the bounding box regression score, the objectness score and the class probability score. Its expression is as follows.

L o s s = L_{o b j} + L_{c l s} + L_{b o x}

(2)

L_{o b j}

is objectness score.

L_{c l s}

is class probability score.

L_{b o x}

is bounding box regression score. The most important value that affects loss in the object detection task is

L_{b o x}

, which is directly related to the accuracy of the predicted box. The YOLOv5s model uses the GIoU³⁵ loss function as the frame regression loss function. The definition and calculation diagram are as follows in Figure 10 and equation (3).

G I o U = I o U - \frac{| C - (A \cup B) |}{| C |}

(3)

Figure 10.

The definition and calculation diagram of GIoU loss.

GIoU loss adds the measurement method of the intersection scale. On the one hand, this operation solves the problem that the loss function is non-differentiable when IoU = 0, that is, the IoU loss cannot optimize the situation when the prediction box and target box do not coincide. On another hand, it also solves the situation that the IoU loss cannot distinguish the prediction box when the prediction boxes are the same and the IoUs are also the same. However, GIoU loss cannot deal with the situation when the prediction box is inside the target box and the size of the prediction box is the same. In this situation, the difference set between the prediction box and the target box is the same. Therefore, the GIoU values of these three states are also the same. As shown in Figure 11.

Figure 11.

The situation of the same GIOU.

In order to solve the shortcomings of the above GIoU loss, the Complete Intersection over Union loss (CIoU loss)³⁶ is used to achieve prediction. The formula is as follows.

I o U = \frac{A \cap B}{A \cup B}

(4)

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν

(5)

α = \frac{ν}{(1 - I o U) + ν}

(6)

ν = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(7)

b

b^{g t}

represents the centre point of the prediction box and target box.

ρ

represents the Euclidean distance between the two centre points.

c

represents the diagonal distance of the smallest closure area that can contain both the prediction box and target box.

w

and

w^{g t}

represent the width of the prediction box and target box, respectively.

h

and

h^{g t}

represent the height of the prediction box and target box, respectively. IoU is the ratio of the intersection and union between the prediction box and target box.

The CIoU loss considers the overlapping area, centre point distance and aspect ratio of bounding box regression, but it ignores the real difference between width and height and its confidence, the effectiveness of model optimization is limited. For solving this problem, Yi-Fan Zhang et al. disassembled the aspect ratio on the basis of CIoU loss,²¹ and he proposed an effective intersection over union loss function (EIoU loss). The penalty term of EIoU loss is based on the penalty term of CIoU loss, the influence factor of aspect ratio is disassembled, and the length and width of the target box and anchor box are calculated separately. The loss function consists of three parts: overlap loss, centre distance loss and width high loss, where the width height loss minimizes the difference between the width and height of the target box and the anchor box. EIOU Loss is shown in the formula:

\begin{matrix} L_{E I o U} = L_{I o U} + L_{d i s} + L_{a s p} = \\ 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + \frac{ρ^{2} (w, w^{g t})}{c_{w}^{2}} + \frac{ρ^{2} (h, h^{g t})}{c_{h}^{2}} \end{matrix}

(8)

where

c_{w}

and

c_{h}

are the width and height of the minimum bounding box that covering the prediction box and target box. The width and height loss in EIoU loss makes the convergence speed faster and the accuracy higher than the GIoU loss of original network. So this paper adopts the EIoU loss as loss function.

The design of lightweight model

The design of redundancy feature map

The feature extraction of object detection relies on conventional convolution, and a certain number of convolution kernels are used to convolve the input image to generate a corresponding number of feature maps. In Figure 12, the Focus convolution operation of one image generates 32-dimensional channels feature maps.

Figure 12.

Feature map of different channels with convolution operation.

The feature map of Figure 12 contains rich and redundant feature information to ensure the understanding of the input and to identify and locate the object. However, a large amount of redundant information also brings unnecessary computation.

In order to reduce the calculation of the model, this paper optimizes the feature map generation method of the feature extraction network backbone by introducing the ghost module. It would generate the redundant feature maps in a low-cost way, and it can reduce the model parameters without losing vital feature information. The model’s calculation is reduced. The ghost module adopts a new convolution method, the principle is as follows: first, conventional convolution with less calculation amount can generate a few feature maps, and then using fewer feature maps with linear operations can generate a new similar feature map. Finally, it combines the information from two sets of feature maps and outputs as all feature information.

If the size of input feature map is $H \times W \times C_{i n}$ , the size of output feature map is $H^{'} \times W^{'}$ , the output channel number is $m \times r$ , and the size of convolution kernel is $k \times k$ . The calculation ratio of the conventional convolution and ghost module is:

\begin{matrix} \frac{(m \times r) \times H^{'} \times W^{'} \times C_{i n} \times k \times k}{m \times H^{'} \times W^{'} \times C_{i n} \times k \times k + (r - 1) \times H^{'} \times W^{'} \times k \times k} \\ = \frac{C_{i n} \times r}{C_{i n} + r - 1} \end{matrix}

(9)

Normally,

C_{i n} ≫ r

, we can get Equation (10):

\frac{C_{i n} \times r}{C_{i n} + r - 1} \approx r

(10)

So, assuming that the conventional convolution generates 36 feature maps, when using the ghost module, the first step generates 6 feature maps, and then each feature map maps 5 similar feature maps. It reduced the calculation of the model by 6 times, and the ghost module can effectively solve the problem that redundant feature maps occupy a lot of computing resources.

The feature extraction network of YOLOv5s is composed of four layers of conventional convolution and three layers of convolution-fused cross-stage local bottleneck network (bottleneckCSP). If we replace the original bottleneckCSP structure and conventional convolution with ghost module directly, on the one hand, it will bring a large amount of model calculation and increase the complexity of the model, on the other hand, it will lead to the repetition of gradient information, and even lead to the problem of gradient vanishing.

In order to solve the above problems, based on the ghost module, this paper designs ghost-bottleneck structures and ghost-bottleneck CSP structures, as shown in Figure 13.

Figure 13.

Improved ghost-bottleneck structure and ghost-bottleneck CSP structure.

The bottleneck structure is a structure improvement method that proposed in order to reduce the number of parameters. The Table 3 analyzes the number of parameters.

Table 3.

Comparison of introducing ghost-bottleneckCSP structure.

Network	Number of parameters	GFLOPs
YOLOv5s	7063542	16.4
ghost-bottleneckCSP YOLOv5s	6414022	12.4

In Figure 14, it is assumed that the input feature map is a 256 dimensional channel and the output feature map is also a 256 dimensional channel. If the $3 \times 3$ convolution kernel performs convolution operation, and the number of parameters is $589824$ . If 256 dimensional input is processed by a $1 \times 1 \times 64$ convolution layer, and then processed by a $3 \times 3 \times 64$ convolution layer, and processed by a $1 \times 1 \times 256$ convolution layer finally, then the total number of parameter is $69632$ . The number of parameters of the bottleneck structure convolution is only 11.8% of the conventional convolution, which proves that the bottleneck structure convolution can help to reduce the parameter of the model.

Figure 14.

Comparison between conventional convolution and bottleneck convolution.

So, the ghost-bottleneck structure is designed with the ghost module, as shown in Figure 13(a). The ghost-bottleneck is composed of two stacked ghost modules. The first ghost module is used as an extension layer to increase the number of channels. The second ghost module is used to reduce the number of channels to match the shortcut path. Then the shortcut is used to connect the input and output of these ghost modules. It added the batch normalization operation and the Leaky Relu activation function behind the first ghost module, and the second module only adds batch normalization. After that, the residual structure is used to superimpose features with the input, which can enhance the gradient value of back-propagation between layers. This operation also can avoid the gradient vanishing caused by the deepening of the model depth, and can extract more fine-grained features without worrying about network degradation.

In order to further reduce the computational bottleneck and improve the problem of gradient vanishing, the CSP structure is introduced to form the ghost-bottleneck CSP structure, as shown in Figure 13(b). It divided the input into two branches. One of branch passes through a standard GBL module, that is, ghost module + batch normalization operation + Leaky Relu activation function. Then it passes through the ghost-bottleneck structure, and batch normalization is used to reduce internal covariate shifts and to accelerate the network training process. We perform the concatenate operation between another branch that deal with conventional convolution and above result. The design of this CSP structure is to reduce computational bottlenecks and memory consumption, and improve the problem of gradient vanishing, so that it can extract richer feature information.

This paper uses the new ghost-bottleneckCSP structure to replace the original bottleneckCSP structure, and obtains the ghost-bottleneckCSP YOLOv5s model. The Figure 15 shows the feature extraction network backbone.

Figure 15.

Comparison of backbones.

The Table 3 shows the parameters and floating-point calculations of the model.

As can be seen from the above table, after using the new module, the number of parameters of the model decreased by 9.2%, while the amount of floating-point operations decreased by 24.3%. The effect of lightweight improvement is also confirmed.

The design of depth separable convolution

The feature fusion network neck layer of YOLOv5s has four CBL structures (Conv + Batch Normalization + Leaky Relu), that is, the feature fusion network contains numerous conventional convolution. The conventional convolution leads to a large amount of parameters, it is difficult to deploy on low-cost edge devices. So, it is necessary to make lightweight improvements for conventional convolution. Based on the existing research,³⁷ we can use the Depthwise Separable Convolution (DSConv)³⁸ to replace conventional convolution, this operation can reduce the large number of parameters in the convolution operation.

Suppose $M$ is the number of channels of input feature map, $D_{K} \times D_{K}$ is the size of the conventional convolution kernel, $D_{F}$ is the size of the convolution layer output feature map. $N$ convolution kernels with the size of $1 \times 1$ are used for convolution, and the multi-channel convolution results of the deep convolution layer are combined to output the $N$ feature map. The theoretical calculation efficiency improvement ratio of the DSConv with the conventional convolution is:

\begin{matrix} \frac{D_{K} \times D_{K} \times M \times D_{F} \times D_{F} + M \times N \times D_{F} \times D_{F}}{D_{K} \times D_{K} \times M \times N \times D_{F} \times D_{F}} \\ = \frac{1}{N} + \frac{1}{D_{K}^{2}} \end{matrix}

(11)

According to the above formula, for a convolution kernel with size of

3 \times 3

, the DSConv can reduce the calculation by about

8 - 9

times. All conventional convolution in the feature fusion layer of YOLOv5s are replaced with DSConv to get DWConv-YOLOv5s, and the complexity is compared with the original model to get the following Table 4.

Table 4.

Comparison of models using DSConv.

Network	Number of parameters	GFLOPs
YOLOv5s	7063542	16.4
DWConv-YOLOv5s	6166646	15.2

It can be seen from the above table that, after using the DSConv, the number of parameters reduce by 12.7%, and the calculation of floating-point numbers reduce by 7.3%, which proves that DSConv plays a positive role in model lightweighting.

Experimental results and analysis

Dataset

In this paper, the low-altitude small UAV is used as the detected object. But there is no suitable dataset for low- altitude small UAV detection. It improved the dataset used in this paper on the basis of the UAV dataset provided by Google. The original dataset contains 9000 images of civil UAVs in different environments. Firstly, we filter the dataset to remove the unqualified detection images. Then, we expand the dataset by collecting relevant pictures on the network. Finally, we further expand the dataset by intercepting some UAV video images. We produce an AntiDrone dataset, which contains 7500 images, 6750 of which are used for training and 750 for testing and verification. Figure 16 shows some data samples.

Figure 16.

Dataset part image.

Lightweight deployment and experimental configuration

Embedded device selection

We choose the Jetson nano as the AD-YOLOv5s deployment platform. The device can deliver 472 GFLOPs for taking on modern AI algorithms, and it has great advantages in price. It uses a 64 bit quad-core ARM A57 processor with a working frequency of 1.43 GHz. It has 128 CUDA cores at just 5 to 10 watts. It also includes a 4 GB LPDDR4 memory. The operating environment JetPack4.4 developed by NVIDIA provides a complete desktop Linux environment, which supports CUDA Toolkit and cuDNN.It is an AI edge computing device suitable for developing a small structure, low cost, and low energy consumption.

TensorRT acceleration

Jetson nano can use TensorRT to accelerate model inference. TensorRT uses inter-layer fusion (tensor fusion) and data accuracy calibration to optimize and accelerate neural network models. During model inference, the GPU completes the data calculation by launching different CUDA cores. Since the expression range of $i n t 8$ is $[- 128, 127]$ , there are only $256$ different values, if the $F P 32$ precision value is quantized into $I N T 8$ precision, information will be lost and the performance of the model will be weakened. In this case, TensorRT will optimize the quantization process automatically.³⁹

Evaluation indicator

For object detection, Precision and Recall are used usually to evaluate the detection performance of the model. Since the Precision and Recall are affected by the confidence, these indicators cannot reflect fully the performance of the detection model. In the experiment, we introduce the average precision $m A P$ and $m A P_{@ 0.5 : 0.95}$ to evaluate the model’s performance. This metric is one of the most important metrics for evaluating the performance of mainstream object detection algorithms. It is defined as follows:

A P = \int_{0}^{1} P (R) d R

(12)

A P

is the area under the Recall and Precision curves. For single-class object detection,

A P

is the same as

m A P

m A P_{@ 0.5 : 0.95}

represents the average

m A P

at different IoU thresholds (from 0.5 to 0.95 in steps of 0.05) (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95).

Experiment configuration and training

The training server of model is Ubuntu 16.04 operating system, Intel (R) Xeon (R) CPU E5-2630 v3 @ 2.40 GHz, NVIDIA GeForce RTX2080Ti video card (with 12G video memory). It built the model based on the PyTorch deep learning framework, and the development environment is PyTorch1.4, cuda10.1 and python3.7.

In the model training process, we use the Adam optimizer for training. We set the initial learning rate to 0.01, the weight attenuation is 0.0001, the momentum is 0.9, and the batch size is 16. Single-scale training is used in all experiments, and the input size of image is 640*640 pixels. According to the characteristics of the model, the pre-training model is yolov5s.pt. Each experiment runs 200 iterations.

We can see from the Figures 17 and 18, after 200 rounds of epoch training, the loss function curves and the evaluation indicator curves of the model tend to be stable, which verifies that the convergence of the model is effective.

Figure 17.

Line chart of loss function of model loss.

Figure 18.

Line chart of evaluation indicator values.

Comparison experimental analysis of object detection performance

Table 5 shows the comparison results between the AD-YOLOv5s model and the YOLOv5s model. The only embeds CBAM modules is named CBAM-YOLOv5s, it named the model of optimizing the feature fusion layer as improved-YOLOv5s, and the model of combining structural lightweight design with feature enhancement is named as AD-YOLOv5s. As shown in Table 5, compared with the YOLOv5s model, we introduced the CBAM attention mechanism to make the mAP, Recall and $m A P_{@ 0.5 : 0.95}$ improved, and the mAP of CBAM-YOLOv5s increased by 2.3%. This experiment proves that the introduction of CBAM attention mechanism in the original model can improve effectively the feature extraction ability and enhance the detection performance of small UAV targets.

Table 5.

Comparison results of different model.

Model	mAP/%	Recall/%	$m A P_{@ 0.5 : 0.95}$ /%
YOLOv5s	85.2	83.8	80.1
improved-YOLOv5s	90.6	89.3	84.5
CBAM-YOLOv5s	87.5	85.9	82.6
YOLOv5s with GIoU loss	85.2	83.8	-
YOLOv5s with CIoU loss	85.5	85.2	80.5
YOLOv5s with EIoU loss	86.0	85.8	81.0
AD-YOLOv5s	87.4	85.6	82.5

The $m A P_{@ 0.5 : 0.95}$ of improved-YOLOv5s model has increased from 80.1% to 84.5%, the model Recall value has increased by 5.5% compared with the original model, and the mAP value has increased by 5.4%. which proves that the feature enhancement method could improve significantly the effect of small UAV detection. We use the EIoU loss function to increase the mAP value of YOLOv5s model by 0.8% and increase the Recall value by 2.0%. The EIoU loss function among the GIoU, CIoU and EIoU loss functions has the greatest improvement in the model performance, so the EIoU loss function is more suitable for the low-altitude small UAV detection task in this paper.

Finally, Table 5 shows that compared with the yolov5s model, the proposed AD-YOLOv5s model has a 2.2% increase in the mAP value, a 1.8% increase in the Recall value, and a 2.4% increase in the $m A P_{@ 0.5 : 0.95}$ value. We proposed model achieves the best performance in the dataset. In addition, as shown in the Figure 19, in order to reflect the effect of the proposed network structure, we selected some test result of YOLOv5s and AD-YOLOv5s models for comparison.

Figure 19.

Comparison of test results with different model.

Figure 19 show that after we improved the feature fusion layer of yolov5s model, added CBAM attention mechanism and improved loss function operation, the accuracy of the new model for small UAV detection has been improved and the object false detection rate has been reduced, which proves the feasibility of the proposed method in this paper.

Comparison experimental analysis of model lightweight

Table 6 analyzes the AD-YOLOv5s network framework of this paper.

Table 6.

AD-YOLOv5s network framework.

	Stage	Args	Input size	Convolutional kernel	Output sizes
0	Focus	[3, 32, 3]	[640,640,3]	3*3	[320,320,32]
1	Conv	[32, 64, 3, 2]	[320,320,32]	3*3,S=2	[160,160,64]
2	CBAM	–	[160,160,64]	-	[160,160,64]
3	Ghost-BottleneckCSP	[64, 64, 1]	[160,160,64]	1*1	[160,160,64]
4	Conv	[64, 128, 3, 2]	[160,160,64]	3*3,S=2	[80,80,128]
5	CBAM	–	[80,80,128]	–	[80,80,128]
6	Ghost-BottleneckCSP	[128, 128, 3]	[80,80,128]	3*3	[80,80,128]
7	Conv	[128, 256, 3, 2]	[80,80,128]	3*3,S=2	[40,40,256]
8	CBAM	–	[40,40,256]	–	[40,40,256]
9	Ghost-BottleneckCSP	[256, 256, 3]	[40,40,256]	3*3	[40,40,256]
10	Conv	[256, 512, 3, 2]	[40,40,256]	3*3,S=2	[20,20,512]
11	SPP	[512, 512, [5, 9, 13]]	[20,20,512]	55,99,13*13	[20,20,512]
12	BottleneckCSP	[512, 512, 1, False]	[20,20,512]	1*1	[20,20,512]
13	Conv	[512, 256, 1, 1]	[20,20,512]	1*1,S=1	[20,20,256]
14	Upsample	[None, 2, ’nearest’]	[20,20,256]	–	[40,40,256]
15	Concat	[1]	[40,40,256],[40,40,256]	–	[40,40,512]
16	BottleneckCSP	[512, 256, 1, False]	[40,40,512]	1*1	[40,40,256]
17	Conv	[256, 128, 1, 1]	[40,40,256]	1*1	[40,40,128]
18	Upsample	[None, 2, ’nearest’]	[40,40,128]	–	[80,80,128]
19	Concat	[1]	[80,80,128],[80,80,128]	–	[80,80,256]
20	BottleneckCSP	[256, 128, 1, False]	[80,80,256]	1*1	[80,80,128]
21	Conv	[128, 64, 3, 2]	[80,80,128]	3*3	[80,80,64]
22	Upsample	[None, 2, ’nearest’]	[80,80,64]	–	[160,160,64]
23	Concat	[1]	[160,160,64],[160,160,64]	–	[160,160,128]
24	BottleneckCSP	[128, 64, 1, False]	[160,160,128]	1*1	[160,160,64]
	Conv	–	[160,160,128]	–	[80,80,128]
	Concat	[1]	[80,80,128],[80,80,128]	–	[80,80,256]
	BottleneckCSP	[256, 128, 1, False]	[80,80,256]	1*1	[80,80,128]
25	Conv	[128, 128, 3, 2]	[80,80,128]	3*3,S=2	[40,40,128]
26	Concat	[1]	[40,40,128],[40,40,128]	-	[40,40,256]
27	BottleneckCSP	[256, 256, 1, False]	[40,40,256]	1*1	[40,40,256]
28	Detect	[1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64,128, 256]]

We calculate the parameters and GFLOPs of the network models, as shown in Table 7. The total number of parameters of the AD-YOLOv5s network model in this paper is about 4.32 million, and the number of parameters of the original YOLOv5s network model is about 7.06 million. Compared with the original model, the number of AD-YOLOv5s model parameters reduce by 38.8 %. The GFLOPs of the AD-YOLOv5s model is 70.1 % of the original model. Experiments result show that the parameters and computational complexity of AD-YOLOv5s network model is lower than the YOLOv5s network model.

Table 7.

Comparison of parameters and GFLOPs.

Model	Parameters	GFLOPs
YOLOv5s	7.06M	16.4
AD-YOLOv5s	4.32M	11.5

We perform a comparative experiment on the lightweight optimization of the model. In this experiment, the TensorRT module is used to accelerate the optimization of the model. The trt suffix is uniformly used to represent the accelerated model. Table 8 shows the comparison results.

Table 8.

Comparison of parameters and GFLOPs.

Detection method	mAP/%	FLOPs	Frame rate/fps
YOLOv5s	85.2	16.4GFLOPs	12.6
YOLOv5s-trt	84.9	16.4GFLOPs	20.5
Ghost-bottleneckCSP-YOLOv5s-trt	86.1	12.4GFLOPs	24.3
DWConv-YOLOv5s-trt	87.2	15.2GFLOPs	23.5
AD-YOLOv5s-trt	87.2	11.5GFLOPs	27.6

The Table 8 shows that it has improved greatly the detection speed of each network model by introducing TensorRT acceleration. First of all, compared with the YOLOv5s model, the YOLOv5s-trt model use TensorRT acceleration to reduce the mAP value by 1.3% and the frame rate is increased to 20.5fps, the detection speed is twice times as before. Secondly, compared with YOLOv5s-trt, Ghost bottomleneckCSP-YOLOv5s has improved the mAP by 2.8%, but it reduced the GFLOPs by 24.3%. And then, compared with YOLOv5s trt, DWConv-YOLOv5s- trt model reduces the GFLOPs by 7.3%, and it would improve the mAP by 2%. Finally, compared with the YOLOv5s model, the AD-YOLOv5s-trt model has improved the mAP by 2% and reduced the GFLOPs by 29.1%, and it increased the frame rate to 27.6fps. From the experiment, we can see that the depth separable convolution can reduce the computational complexity while maintaining the model effect.

The proposed method of this paper enables YOLOv5s algorithm to achieve real-time detection effect on embedded devices. In order to show the real-time detection effect, this paper uses YOLOv5s and AD-YOLOv5s trt models to experiment respectively, and it shows the detection results at the bottom of Figure 19. The proposed AD-YOLOv5s model can detect low-altitude small UAV object effectively and improve the detection rate. However, the confidence of the model for small object detection is affected greatly by the environment. This issue needs to be studied in future work.

Conclusion

This paper proposed an AD-YOLOv5s model for low-altitude small UAV detection. Firstly, we proposed an optimization method for feature enhancement to solve the problem of small object detection, and introduced the EIoU loss function to replace the original GIoU loss function. Secondly, based on the ghost module and depth separable convolution, we optimized the feature extraction network backbone and feature fusion layer neck in YOLOv5s model. Finally, we used TensorRT to accelerate the model on Jetson nano. The experimental results show that the mAP of the proposed model is improved by 2.2%, and the value of Recall by 1.8%, and the value of GFLOPs is reduced by 29.9% and parameters by38.8%, and achieved 27.6 FPS when the proposed model is deployed on a low-cost edge computing device (jetson nano).The proposed model can be deployed on the low-cost edge device of Jetson nano, and it would improve the model detection frame rate from 12.6FPS to 27.6FPS, it can meet the detection accuracy of low-altitude security tasks and the lightweight requirements of deployment requirements.

Footnotes

Declaration of conflicting interests

The authors declare that there are no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Key R&D Program of China, No.2022YFC3320800 and Zhejiang Provincial Key R&D Plan of China, No.2021C01040.

ORCID iD

Dawei Qiu

References

Liu

Zhou

Yuan

et al. Economically optimal ms association for multimedia content delivery in cache-enabled heterogeneous cloud radio access networks. IEEE J Sel Areas Commun 2019; 37: 1584–1593.

Zhou

Liu

Pan

et al. Cooperative multicast with location aware distributed mobile relay selection: performance analysis and optimized design. IEEE Trans Veh Technol 2017; 66: 8291–8302.

Zhou

Liu

Wang

et al. Service-aware 6g: An intelligent and open network based on the convergence of communication, computing and caching. Digital Commun Netw 2020; 6: 253–260.

Luo

Yan

et al. Autonomous detection of damage to multiple steel surfaces from 360 panoramas using deep neural networks. Comput-Aided Civ Infrastruct Eng 2021; 36: 1585–1599.

Yuan

Xia

et al. Low altitude small uav detection based on yolo model. In: 2020 39th Chinese Control Conference (CCC). IEEE, 2020, pp. 7362–7366.

Shuzheng

Sun

Liying

et al. Uav intrusion detection based on software defined radio. Res Explor Lab 2018; 37: 64–67.

Bernardini

Mangiatordi

Pallotti

et al. Drone detection by acoustic signature identification. Electron Imaging 2017; 10: 60–64.

Ritchie

Matthew

Fioranelli

Francesco

Borrion

Hervé

et al. Multistatic micro-doppler radar feature extraction for classification of unloaded/loaded micro-drones. IET Radar, Sonar Nav 2017; 11: 116–124.

Park

Kim

Shin

et al. A comparison of convolutional object detectors for real-time drone tracking using a ptz camera. In: 2017 17th International Conference on Control, Automation and Systems (ICCAS), 2017.

10.

Chen

Yin

Wang

et al. Low-altitude protection technology of anti-uavs based on multisource detection information fusion. Int J Adv Robot Syst 2020; 17: 1729881420962907.

11.

Zhou

Liu

et al. Traffic-aware task offloading based on convergence of communication and sensing in vehicular edge computing. IEEE Internet Things J 2021; 8: 17762–17777.

12.

Bochkovskiy

Wang

Liao

HYM

. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

13.

Redmon

Divvala

Girshick

et al. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. pp. 779–788.

14.

Redmon

Farhadi

. Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 7263–7271.

15.

Redmon

Farhadi

. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.

16.

Liu

Anguelov

Erhan

et al. Ssd: Single shot multibox detector. In: European conference on computer vision. Springer, 2016. pp. 21–37.

17.

Girshick

Donahue

Darrell

et al. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. pp. 580–587.

18.

Girshick

. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1440–1448.

19.

Ren

Girshick

et al. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 2015; 28: 1–9.

20.

Gkioxari

Dollár

et al. Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2017. pp. 2961–2969.

21.

Zhang

Ren

Zhang

et al. Focal and efficient iou loss for accurate bounding box regression. Neurocomputing 2022; 506: 146–157.

22.

Song

Shui

. Research on the acceleration effect of tensorrt in deep learning. Sci J Intell Syst Res Vol 2019; 1: 45–50.

23.

Mejias

McNamara

Lai

et al. Vision-based detection and tracking of aerial targets for uav collision avoidance. In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010. pp. 87–92.

24.

Rozantsev

Lepetit

Fua

. Detecting flying objects using a single moving camera. IEEE Trans Pattern Anal Mach Intell 2016; 39: 879–892.

25.

Gao

Feng

et al. Small uav detection in videos from a single moving camera. In: Computer Vision: Second CCF Chinese Conference, CCCV 2017, Tianjin, China, October 11–14, 2017, Proceedings, Part III. Springer, 2017. pp. 187–197.

26.

Sommer

Schumann

Müller

et al. Flying object detection for automatic uav recognition. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2017. pp. 1–6.

27.

Duan

Mao

et al. Diagonalnet: Confidence diagonal lines for the uav detection. IEEJ Trans Electr Electron Eng 2019; 14: 1364–1371.

28.

Seidaliyeva

Akhmetov

Ilipbayeva

et al. Real-time and accurate drone detection in a video with a static background. Sensors 2020; 20: 3856.

29.

Makirin

Wastupranata

Daffa

. Onboard visual drone detection for drone chasing and collision avoidance. In: AIP Conference Proceedings, volume 2366. AIP Publishing LLC, 2021. p. 060013.

30.

Lin

Dollár

Girshick

et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. pp. 2117–2125.

31.

Liu

Qin

et al. Path aggregation network for instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. pp. 8759–8768.

32.

Woo

Park

Lee

et al. Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV). 2018. pp. 3–19.

33.

Shen

Sun

. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. pp. 7132–7141.

34.

Borah

Sahu

. Ecasr: Efficient channel attention based super-resolution. In: International Conference on Computer Vision and Image Processing. Springer, 2020. pp. 374–386.

35.

Rezatofighi

Tsoi

Gwak

et al. Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. pp. 658–666.

36.

Wang

Song

. Iciou: Improved loss based on complete intersection over union for bounding box regression. IEEE Access 2021; 9: 105686.

37.

Howard

Zhu

Chen

et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

38.

Kaiser

Gomez

Chollet

. Depthwise separable convolutions for neural machine translation. arXiv preprint arXiv:1706.03059, 2017.

39.

Jeong

Kim

Tan

et al. Deep learning inference parallelization on heterogeneous processors with tensorrt. IEEE Embed Syst Lett 2021; 14: 15–18.

AD-YOLOv5s based UAV detection for low altitude security

Abstract

Keywords

Introduction

Related work

AD-YOLOv5s low altitude small UAV detection algorithm

The algorithm framework

The performance optimization of low altitude small UAV detection based on feature enhancement

The feature enhancement of feature fusion layer

Feature enhancement by attention mechanism

Loss function

The design of lightweight model

The design of redundancy feature map

The design of depth separable convolution

Experimental results and analysis

Dataset

Lightweight deployment and experimental configuration

Embedded device selection

TensorRT acceleration

Evaluation indicator

Experiment configuration and training

Comparison experimental analysis of object detection performance

Comparison experimental analysis of model lightweight

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References