Sage Journals: Discover world-class research

Abstract

Accurate identification and monitoring of aircraft on the airport surface can assist managers in rational scheduling and reduce the probability of aircraft conflicts, an important application value for constructing a "smart airport." For the airport surface video monitoring, there are small aircraft targets, aircraft obscuring each other, and affected by different weather, the aircraft target clarity is low, and other complex monitoring problems. In this paper, a lightweight model network for video aircraft recognition in airport field video in complex environments is proposed based on SSD network incorporating coordinate attention mechanism. First, the model designs a lightweight feature extraction network with five feature extraction layers. Each feature extraction layer consists of two modules, Block_A and Block_I. The Block_A module incorporates the coordinate attention mechanism and the channel attention mechanism to improve the detection of obscured aircraft and to enhance the detection of small targets. The Block_I module uses multi-scale feature fusion to extract feature information with rich semantic meaning to enhance the feature extraction capability of the network in complex environments. Then, the designed feature extraction network is applied to the improved SSD detection algorithm, which enhances the recognition accuracy of airport field aircraft in complex environments. It was tested and subjected to ablation experiments under different complex weather conditions. The results show that compared with the Faster R-CNN, SSD, and YOLOv3 models, the detection accuracy of the improved model has been increased by 3.2%, 14.3%, and 10.9%, respectively, and the model parameters have been reduced by 83.9%, 73.1%, and 78.2% respectively. Compared with the YOLOv5 model, the model parameters are reduced by 38.9% when the detection accuracy is close, and the detection speed is increased by 24.4%, reaching 38.2fps, which can well meet the demand for real-time detection of aircraft on airport surfaces.

Keywords

Complex environment airport surface aircraft recognition SSD network coordinate attention

1 Introduction

With the rapid development of the aviation industry, the airport area is expanding, the flight flow is increasing rapidly, the runway, taxiway, apron, and other airport traffic conditions are becoming more and more complex, and the probability of aircraft conflicts on the airport surface is also increasing, which has put forward higher requirements on the operational safety and efficiency of the airport [1]. In recent years, various countries are paying more and more attention to the construction of "intelligent airports," especially some airports currently under construction or renovation are beginning to introduce some technological means to enhance security measures, and airport surface aircraft detection as an important part of the construction of intelligent airports, but also gradually began to be attached to people [2]. Accurate identification and detection of aircraft on the airport surface can assist managers in rational scheduling and reduce the probability of aircraft conflicts on the airport surface. Currently, most aircraft detection is focused on aircraft detection in remote sensing images [3, 4]. The difference between airport surface aircraft detection and remote sensing image aircraft detection is that the surveillance image of the airport surface does not capture the complete outline of the aircraft as the remote sensing aerial image does. The aircraft in the airport surface video surveillance will vary in size and attitude [5]. Moreover, of the different weather conditions (snow, rain, fog, night), the impact of monitoring equipment to monitor the clarity of the target will also be relatively low, which significantly increases the difficulty of the airport surface aircraft target detection.

To address the above problems, this paper proposes a lightweight aircraft recognition model for airport field in complex environments. First, we design a new lightweight feature extraction network with five feature extraction layers, each consisting of two modules, Block_A and Block_I. The Block_A module incorporates a coordinate attention mechanism and a channel attention mechanism. The coordinate attention mechanism enables the network to focus on a larger range of location information, which helps the model to better localize and identify the target. The channel attention mechanism is used to enhance feature extraction for small targets in images. The Block_I module is mainly used for feature map scaling, and uses a multi-scale feature map fusion method to generate feature maps with rich semantic information to enhance the feature extraction capability of the network. Secondly, the designed feature extraction network, is applied to the SSD target detection network, and the SSD network is improved to increase the target recognition accuracy. The lightweight recognition model proposed in this paper has been proved experimentally to accurately recognize the aircraft in the airport field under the complex environment, and the detection speed is superior, which can well meet the demand of real-time detection of aircraft in the airport field.

The main contributions of this paper are as follows:

(1) We design a new lightweight feature extraction network, which incorporates an attention mechanism into the feature extraction network to help us obtain richer semantic information of the original image and fully extract the aircraft features of the airport scene in complex environments.

(2) We improved the SSD network and applied the design feature extraction network to it, which improved the accuracy of target detection and was able to effectively identify the airplanes in the airport field.

(3) Our improved recognition model is lightweight, which allows our model to be deployed on resource-constrained devices, which will further promote the application of computer vision in the field of smart airport construction.

2 Related work

In recent years, target detection methods based on convolutional neural networks have worked very well in different application areas. For example, face recognition [6 , 8– 10], pest recognition [10 –13], defect detection [14 –16], medical image pathology detection [17 –19], and target detection in remote sensing images [20 –22]. The commonly used target detection methods are mainly divided into two categories: one is one-stage detection algorithms, mainly YOLO [23], SSD [24], YOLO9000 [25], YOLOv3 [26], and so on; the other is two-stage detection algorithms, mainly R-CNN [27], Fast R-CNN [28], Faster R-CNN [29], SPPNet [30], and so on. The SSD algorithm borrows ideas from Faster R-CNN and YOLO: from the regression-based model in YOLO, which directly regresses the class and location of objects, and from the design of regions in Faster R-CNN, which outputs feature maps of varying scale sizes for detection. The original SSD algorithm uses low-level feature maps to recognize small targets, resulting in low recognition accuracy. Improvements to the SSD algorithm for the problem of insufficient detection of small targets focus on using different feature extraction networks and adding attention mechanisms in the network. Different feature extraction networks can achieve different results according to their characteristics. The feature extraction networks commonly used in SSD algorithms include VGG [31], ResNet [32] and MobileNet [33]. ResNet increases the network depth compared to VGG and can improve the feature extraction ability. Fu et al. [34] designed the DSSD network, and Yi et al. [35] designed the ASSD network to use Residua-101 instead of VGG as a feature extraction network to improve the detection accuracy of small targets. However, the use of Residua-101 as a feature extraction network leads to a large number of training parameters and reduced detection speed. MobileNet is a lightweight feature extraction network that is superior in terms of model parameters and inference speed. Chiu et al. [36] used MobileNet-v2 [37] as a feature extraction network for the SSD algorithm in order to meet the requirements of running on an embedded platform, which improved the detection speed. However, the feature extraction capability of MobileNet-v2 is not as good as that of ResNet, resulting in insufficient detection capability for small targets.

Attention mechanisms [38 –42] have been proposed and widely used based on human retinal properties. The attention mechanism can enhance the performance of the model by assigning dynamic weighting parameters to reinforce key information according to its importance. The main attention mechanisms commonly used in the field of computer vision are the channel attention mechanism [38 , 43– 45], the spatial attention mechanism [39 , 46– 48], and the attention mechanism with channel and spatial fusion [41 , 50]. A typical channel attention mechanism is the Sequeeze and Excitation Net (SENet) [38], which captures the importance of each channel of the feature map and then uses this importance to assign a weight value to each feature, thus allowing the neural network to focus on specific feature channels. A representative work of spatial attention mechanism is Spatial Transformer Network (STN) [39], where not all regions in the image contribute equally crucial to the task in target detection, and STN can focus on the focused regions relevant to the task. Channel attention focuses the network on the "what" of the image, while spatial attention focuses on the "where" of the object in the image. Sanghyun et al. [41] proposed a new Convolutional Block Attention Module (CBAM) in 2018, which merges channel attention with spatial attention. To solve the problem that ordinary attention mechanisms cannot be applied to mobile networks, Hou et al. [51] proposed Coordinate Attention (CA) in 2021, which allows mobile networks to acquire information about more significant regions without introducing large overheads by embedding location information into channel attention.

In the field of aircraft detection on airport surfaces, fewer papers are available due to the specificity of industry data. The complexity of the airport surface environment poses a significant challenge to the accuracy of aircraft detection. For the problem of incomplete target contour and varying attitude in aircraft detection, Dai et al. [52] proposed a static aircraft detection method based on Faster R-CNN and multi-part combination for airport surface. Li et al. [5] proposed an airport aircraft detection method based on the part model and distance tradeoff to improve detection accuracy. To address the problem that some aircraft targets in aircraft detection are more difficult to detect when they are small, Guo et al. [53] based on the YOLOv3 detection network, replaced the convolutional layer in the backbone network with the dilated convolution to maintain higher resolution and larger receptive field and improve the accuracy of the model for small target detection. Han et al. [54] proposed an airport surfaces small target detection algorithm based on Faster R-CNN and combined with multi-scale feature fusion. Li et al. [55] constructed a new feature extraction network RPDNet4, designed a four-scale prediction module, and used an adjacent scale feature fusion technique to fuse features at different abstraction levels to improve the detection accuracy of small targets of airport aircraft.

3 Materials and methods

3.1 SSD detection algorithm

The SSD detection algorithm uses a multi-layer feature map generation structure to learn semantic information hierarchically so that the low-level feature map can detect small targets and the high-level feature map detects large targets, and finally uses non-maximum suppression (NMS) to remove duplicate prediction box and keep the prediction box with the best results. The detection method using feature maps of different scales can significantly improve the target detection accuracy. The original SSD network uses VGG16 as the base model and then adds new convolutional layers to VGG16 to obtain more feature maps for detection. The structure of the SSD network is shown in Fig 1, which consists of two parts: the base network and the extended network. The algorithm uses the outputs of conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2 layers as feature maps at different scales for detection. The sizes of their corresponding feature maps are 38×38×512, 19×19×1024, 10×10×512, 5×5×512, 3×3×512, and 1×1×256, respectively. Six feature maps of different sizes perform classification and location regression for objects of different sizes.

Fig. 1

Original SSD network structure.

The original SSD detection algorithm loss function is divided into two parts: the weighted sum of the localization loss and the confidence loss, as shown in the formula (1). $L (x, c, l, g) = \frac{1}{N} (L_{conf} (x, c) + α L_{loc} (x, l, g))$ (1)

In formula (1), L_conf (x, c) represents the confidence loss function, L_loc (x, l, g) represents the localization loss function, N represents the number of samples, α represents the weighting factor, x represents the matching information of the current search prediction box category, c represents the labeled category, l represents the search prediction box boundary coordinates, and g represents the labeled boundary box coordinates.

The localization loss function L_loc (x, l, g), as shown in the formula (2).

$L_{loc} (x, l, g) = \sum_{i \in Pos}^{N_{i}} \sum_{m \in Box} x_{ij}^{k} {smooth}_{L 1} (l_{i}^{m} - {\hat{g}}_{j}^{m})$ (2)

In formula (2), $x_{ij}^{k} = {0, 1}$ represents whether the i-th search prediction box matches the j-th actual box for the category k, $l_{i}^{m}$ represents the prediction box, ${\hat{g}}_{j}^{m}$ represents the actual box, N_i represents the number of samples matched, Pos represents a positive sample, Box represents the set of coordinates of the center point of the search prediction box and the width and height of the prediction box, and smooth_L1 represents the error function of L1.

The confidence loss function L_conf (x, c), as shown in the formulas (3) and (4). $L_{conf} (x, c) = - \sum_{i \in Pos}^{N} x_{ij}^{p} lg ({\hat{c}}_{i}^{p}) - \sum_{i \in Neg} lg ({\hat{c}}_{i}^{0})$ (3)

${\hat{c}}_{i}^{p} = exp (c_{i}^{p}) / \sum_{p} exp (c_{i}^{p})$ (4)

In formulas (3) and (4), ${\hat{c}}_{i}^{p}$ represents the probability that the target is P in the i-th search prediction box, ${\hat{c}}_{i}^{0}$ represents the probability that the target is not detected in the i-th search prediction box, $x_{ij}^{p}$ represents whether the search prediction box i matches the actual box j for the category P, Neg represents a negative sample, and Pos represents a positive sample.

3.2 Coordinated attention

As a new efficient attention mechanism, coordinated attention enables lightweight networks to obtain information about larger areas by embedding location information into channel attention, reducing the number of attention module parameters while avoiding excessive computational overhead. Its structure is shown in Fig 2.

Fig. 2

Coordinated attention structure.

The specific operation process of the CA module is as follows: Assume that the size of the input feature map is C×H×W (C represents the number of feature map channels, H represents the height of the feature map, and W represents the width of the feature map). The CA module first performs a global average pooling operation on the input feature map in the X and Y directions to obtain the feature maps in the X and Y directions, as shown in the formulas (5) and (6). $z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)$ (5)

$z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq i < H} x_{c} (j, w)$ (6)

In formulas (5) and (6), $z_{c}^{h}$ represents the output of the c-th channel in the X direction, $z_{c}^{w}$ represents the output of the c-th channel in the Y direction, and x_c represents the input of the c-th channel.

The feature maps output from the average pooling in the x-direction and the y-direction are concatenated and passed through a shared 1×1 convolutional transformation, followed by batch normalization processing and nonlinear activation operation to obtain the intermediate feature maps, as shown in the formula (7). $f = δ (F_{1} ([z^{h}, z^{w}]))$ (7)

In formula (7), f represents the intermediate feature mapping obtained by encoding spatial information in horizontal and vertical directions, δ represents the nonlinear activation function, and F₁ represents the 1 × 1 convolutional transform.

After normalization and nonlinear processing, f is cut into two independent tensors $f^{h} \in R^{\frac{C}{r} \times H}$ and $f^{w} \in R^{\frac{C}{r} \times W}$ along the spatial dimension (f^h represents the tensor of f decomposed along the X direction, and f^w represents the tensor of f decomposed along the Y direction), after which the feature maps f^h and f^w are transformed to have the same number of channels as the input feature maps using two 1×1 convolutional transformations F_h and F_w, as shown in the formulas (8) and (9). In formulas (8) and (9), g^h and g^w represent the weights on the height and width, and σ represent the sigmoid activation function.

$g^{h} = σ (F_{h} (f^{h}))$ (8)

$g^{w} = σ (F_{w} (f^{w}))$ (9)

Using this approach can reduce the model’s complexity and computational overhead. The obtained results are extended, and the matrix multiplication method is used to find the final attentional weight matrix. The final output of the coordinate attention module is shown in the formula (10). $y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)$ (10)

3.3 Build feature extraction network

The structure of our proposed feature extraction network MatbNet is shown in Fig 3, which consists of two main modules Block_A and Block_I (Fig 4 and Fig 5). The Block_A module consists of pointwise convolution, 3×3 depthwise convolution, Relu6 activation function, SE module, linear activation function, CA module, and summation operation. The Block_I module consists of 5×5 depthwise convolution, 3×3 depthwise convolution, Pointwise Convolution, Relu6 activation function, linear activation function, CA module, and summation operation.

Fig. 3

MatbNet feature extraction network.

Fig. 4

Block_A module.

Fig. 5

Block_I module.

The Block_A module is proposed based on the inverted residual block in MobileNet-v2. MobileNet-v2 uses depthwise convolution and pointwise convolution in depthwise separable convolution to replace ordinary convolution. The depthwise convolution performs convolution on each layer of the input feature map, and the number of convolution kernels corresponds to the number of channels of the input feature map. The pointwise convolution uses a 1×1 size convolution kernel, and the number of channels of the convolution kernel corresponds to the number of channels of the input feature map. For a 3×3 convolution kernel, the depthwise separable convolution structure can reduce the number of parameters by about nine times compared to regular convolution and is a lightweight network structure that can be better suited for embedded devices. After the inverted residual block’s pointwise convolution and linear activation, we add the CA module to encode channel relations and remote dependencies enables the network to focus on a larger range of location information.. The CA module can be used to improve the algorithm’s ability to learn and process details, enhance the extraction of aircraft features in the complex environment of the airport surface, and further improve the accuracy of aircraft detection on the airport surface. Many small targets exist when performing aircraft detection on airport surfaces. It has been demonstrated that the SE attention mechanism is more helpful for small target detection. We add the SE module to the proposed Block_A module to further enhance the detection of small targets on the airport surface. The Block_I module is proposed based on the inception module in GoogLeNet [56]. The original inception module contains more branching structures. Although it improves the model’s performance to some extent, the module introduces more parameters and increases the complexity of the computation. Considering the need for lightweight models, we propose a feature fusion module with two branching structures based on the inception idea. Using convolutional kernels of different sizes implies different receptive fields, and the summation operation allows different scale features to be fused, enabling more effective extraction of feature information.

The feature extraction network MatbNet is set up with five feature extraction layers in the direction from input to output. The different colored modules in Fig. 3 show the first to fifth feature extraction layers. The first and second feature extraction layers include a Block_I module and two Block_A modules, respectively, arranged in order from input to output. The third, fourth, and fifth feature extraction layers consist of one Block_I module and three Block_A modules, respectively, arranged in order from input to output. The specific parameter settings of the feature extraction network are shown in Fig 6.

Fig. 6

Parameter settings of feature extraction network.

For the first feature extraction layer, the parameter setting steps are as follows:

(1) An image with an input size of 512×512×3 passes through the Block_I module once, the step size s is 2, the number of output channels c is 32, the channel expansion factor e is 2, and the output size is a tensor of 256×256×32.

(2) Using the output result of (1), through a Block_A module, the step size s is 1, the output channel number c is 32, the channel expansion factor e is 2, and the output size is a tensor of 256×256×32.

(3) Using the output result of (2), through a Block_A module, the step size s is 1, the output channel number c is 32, the channel expansion factor e is 2, and the output size is a tensor of 256×256×32.

For the second, third, fourth, and fifth feature extraction layers, the parameter setting steps of the first feature extraction layer are the same.

3.4 Improved SSD network

The original SSD network resizes the input image to 300×300×3 and then performs the detection. In airport surface aircraft detection, if the image is resized to 300 × 300, most of the small target objects have only a few pixels or cause small targets to disappear, and the small target features are not apparent. This causes the accuracy of small target detection in the small target detection task not able to meet the task requirements. The improved SSD network (shown in Fig 7) resizes the input image to 512 × 512 × 3, making the feature map of the lower layer rich in small target features to facilitate enhanced detection of small targets. The lower-layer feature maps detect smaller targets, and the higher-layer feature maps are used to detect larger targets. Studies have demonstrated that the higher layer feature maps play a smaller role in performing detection tasks. Based on the characteristics of the airport surface aircraft detection dataset, the improved SSD network uses the output of five different scales of feature maps for the classification and location regression of objects of different sizes. The improved SSD network uses the proposed MatbNet feature extraction network as the base network. Then two new convolutional layers are added to the MatbNet network to obtain more feature maps for the detection task. The MatbNet feature extraction network outputs 64×64×96, 32×32×128, and 16×16×256 feature maps at three different scales, and the two newly added convolutional layers output 8×8×512 and 4×4×256 feature maps at two different scales.

Fig. 7

The improved SSD network.

4 Experiments and results

4.1 Experimental dataset

The dataset used in the experiment is from the surveillance video of Zhengzhou Xinzheng International Airport. The acquired videos were pre-processed, and one image was extracted according to every 70 frames of video images, and 4146 images with four-pixel resolutions of 1920×1080, 1920×1200, 1858×974 and 640×640 were obtained. The images include images of different types of single-passenger aircraft, multiple-passenger aircraft, and other small target images, covering various complex weather conditions, including sunny, foggy, rainy days, snow, and night. The obtained images are labeled with the labelImg tool to make a VOC format dataset and generate the corresponding configuration files. The labeled dataset is divided according to a specific ratio: the training set contains 2984 images, the validation set contains 747 images, and the test set contains 415 images.

4.2 Experimental environment

The experimental environment is Ubuntu 18.04 operating system, Intel^® Core^TM i9-9900K processor, 32 GB RAM, RTX 2080Ti^*2 GPU, deep learning framework using Pytorch 1.8, and general purpose parallel computing architecture CUDA 11.4. To speed up the convergence of the model during training, we used a migration learning approach, using a modified SSD network with 300 iterations on the PASCAL VOC2012 dataset to train a model to initialize the weights. The model is trained with the stochastic gradient descent (SGD) optimizer, the parameter selection momentum is set to 0.937, the weight decay is set to 0.0005, the initial learning rate is set to 0.01, the learning rate is dynamically adjusted during training, and the Num Workers in the training parameters is set to 4, the Batch Size is set to 16. The training is stopped when 500 epochs are reached. Fig 8 shows the loss values of the training set and the loss values of the test set during the model’s training.

Fig. 8

Model training loss.

4.3 Evaluation indicators

In order to better evaluate the improved detection model in this paper, Average Precision(AP), mean Average Precision(mAP), and Frames Per Second(FPS) of processed images are used as evaluation criteria in the field of target detection. The AP values are derived from accuracy P and recall R to measure the accuracy of model detection. The accuracy rate represents the proportion of samples that are actually positive classes and are predicted to be positive classes to all predicted positive samples, as shown in formula (11). The recall rate represents the proportion of samples that are actually positive and are predicted to be positive to all samples that are actually positive, as shown in formula (12). $P = \frac{TP}{TP + FP}$ (11)

$R = \frac{TP}{TP + FN}$ (12)

In formulas (11) and (12), TP represents the positive samples detected correctly, FP represents the positive samples detected incorrectly, and FN represents the negative samples detected incorrectly.

AP is calculated by integrating the curves of precision and recall. The higher the value of AP, the better the model’s performance, and its formula is shown in formula (13). The mAP is the average of each category of AP, which is used to measure the average detection accuracy of multiple targets. FPS is an essential measure of the real-time performance of the network, which indicates the number of frames processed per second, and the larger the FPS, the better the real-time performance.

$AP = \int_{0}^{0} PRdR$ (13)

4.4 Results and discussion

4.4.1 Comparison experiments

In order to better show the advantages of the improved models, we conducted comparison experiments. The improved model is compared with Faster R-CNN, SSD, YOLOv3, and YOLOv5 models for the experiments. The models are trained and validated using the same dataset, and the results of the comparison experiments for each model in four aspects, AP, Recall, FPS, and model parameters, are shown in Table 1.

Table 1
Comparison of the performance of different models

Model AP(%) Recall(%) FPS(F/S) Param(M)

Faster R-CNN 92.4 85.7 7.5 159.3

SSD 81.3 75.6 18.6 95.0

YOLOv3 84.7 79.5 21.8 117.5

YOLOv5 96.3 89.3 30.7 41.9

our 95.6 88.7 38.2 25.6

Model	AP(%)	Recall(%)	FPS(F/S)	Param(M)
Faster R-CNN	92.4	85.7	7.5	159.3
SSD	81.3	75.6	18.6	95.0
YOLOv3	84.7	79.5	21.8	117.5
YOLOv5	96.3	89.3	30.7	41.9
our	95.6	88.7	38.2	25.6

As shown in Table 1, compared with the Faster R-CNN model, the improved model has similar average precision and recall, but the number of frames per second to process images is five times higher than that of the Faster R-CNN, and the parameters of the model are about 1/6 of those of the Faster R-CNN. The improved model outperforms the YOLOv3 and SSD models in four aspects: average precision, recall, frames per second processed, and model parameters. YOLOv5 is slightly superior to the improved model regarding average precision and recall. However, the parameters of its model are about 1.6 times the parameters of our improved model, and the inference speed is slightly lower than that of the improved model. The average precision of the improved model detection is 95.6%, the number of frames per second of image processing is 38.2, and the model parameters are 25.6M, a lightweight detection network model that can meet the task requirements of real-time aircraft detection on airport fields.

To further validate the effectiveness of the improved model, the detection results of Faster R-CNN (Fig 9), SSD (Fig 10), YOLOv3 (Fig 11), YOLOv5 (Fig 12), and the improved model (Fig 13) are visualized under different environments.

Fig. 9

Detection results of Faster R-CNN in different environments.

Fig. 10

Detection results of SSD in different environments.

Fig. 11

Detection results of YOLOv3 in different environments.

Fig. 12

Detection results of YOLOv5 in different environments.

Fig. 13

Detection results of the improved model in different environments.

Comparing with the sunny environment in part (a) of the figures, we can find that SSD, YOLOv3, and YOLOv5 have missed detection of small targets far away from the airport surface, and Faster R-CNN and the improved model have good detection effect. Comparing with the evening environment in part (b) of the figures, Faster R-CNN, SSD, and YOLOv3 do not extract enough semantic information to distinguish the background, and there is a missed detection due to the influence of the airport field lighting. In contrast, YOLOv5 and the improved model have better detection results. Comparing with the night environment in part (c) of the figures, Faster R-CNN, SSD, and YOLOv3 cannot identify the aircraft on the airport runway at night. In contrast, YOLOv5 and the improved model can identify them accurately. Comparing with the rainy environment in part (d) of the figures, SSD and YOLOv3 cannot perform accurate recognition for partially obscured aircraft. The Faster R-CNN, YOLOv5, and the improved model have good detection results. Comparing with the foggy environment in part (e) of the figures, Faster R-CNN, SSD, and YOLOv3 cannot recognize aircraft in fog, and YOLOv5 and the improved model can fully extract aircraft features in fog for accurate recognition. Comparing with the smoggy environment in part (f) of the figures, SSD and YOLOv3 cannot accurately identify aircraft with blurred images due to foggy weather. In contrast, Faster R-CNN, YOLOv5, and improved models detect them well.

4.4.2 Ablation experiments

In order to evaluate the performance of our proposed method, especially to verify the optimal role of the two attention modules, we designed ablation experiments on the CA module and the SE module, whose experimental results are shown in Table 2. The results show some improvement in accuracy for each case, which indicates that each attention module is valuable. When there is no SE module, small targets in the image are missed during detection (Fig 14), which indicates that the SE attention module can further enhance the detection of small targets on the airport surface. When there is no CA module, the extraction of aircraft features in the complex environment of the airport surface is not sufficient, and there are missed aircraft targets in the complex environment (Fig 15). We found that the best results were achieved when the two modules were used in combination, indicating that our method is feasible and effective.

Table 2
Comparison of results of ablation experiments

Methods AP(%) Recall(%)

SE+CA 95.6 88.7

without SE 93.3 86.3

without CA 91.1 83.5

without SE&CD 88.2 81.8

Methods	AP(%)	Recall(%)
SE+CA	95.6	88.7
without SE	93.3	86.3
without CA	91.1	83.5
without SE&CD	88.2	81.8

Fig. 14

Without SE test results.

Fig. 15

Without CA test results.

5 Conclusions

This paper solves the problem of difficult detection of aircraft targets on airport surfaces in complex environments due to the influence of monitoring equipment and weather conditions and meets the demand for real-time aircraft detection on airport surfaces. We propose a lightweight aircraft recognition model for airport surfaces in complex environments based on SSD networks and attention mechanisms. First, a new feature extraction network MatbNet is designed based on MobileNet-v2 and the inception module in GoogLeNet, which contains two modules, Block_A and Block_I. Block_A for improved detection of obscured aircraft and enhanced detection of small targets. Block_I is used for feature fusion at different scales to extract feature information with rich semantic meaning and to enhance the feature extraction capability of the network in complex environments. Second, we use the designed feature extraction network in the improved SSD network. The experimental results show that the method proposed in this paper outperforms Faster R-CNN, SSD and YOLOv3 detection algorithms in terms of detection accuracy and model parameters, respectively 95.6% and 25.6 M. With detection accuracy close to that of YOLOv5, the model detection speed is superior to 38.2 fps, which is in line with the demand of real-time detection. Experiments show that the method proposed in this paper has a good effect on the identification of aircraft in airport field under complex environment.

However, the method in this paper is not effective for small target detection in night scenes. This is due to the fact that the clarity of the small airplane target on the distant runway under the influence of light in the night scene is very low, and the effective feature information cannot be extracted, which is difficult to recognize. As a result, the method proposed in this paper does not achieve higher accuracy. In our future research, we will focus on solving the problem of recognizing small targets in nighttime environments through the enhancement of image features and the regression of small target positions, and deploy the proposed model in practical applications.

Footnotes

Acknowledgments

The authors express gratitude to, the Open Fund from Research Platform of Grain Information Processing Center in Henan University of Technology(NO.KFJJ2022012), and the Key Scientific Research Projects of Colleges and Universities in Henan Province (NO.23A170013).

Conflict of interest

The authors declare that they do not have any conflict of interest.

References

Yang

, Yang

, Wu

, Yu

Research and prospect of intellectualized air traffic management technology, 2018.

Xia

, Wei

, Tu

and Li

, Moving target detection method for general aviation airport,, Science Technology and Engineering 22(29) (2022), 13114–13119.

, Su

and Li

, An aircraft detection algorithm in sar image based on improved faster r-cnn,, Journal of Beıjing University of Aeronautics and Astronautics 47(1) (2021), 159–168.

Chen

, Liu

, Xu

, Xie

, Zuo

and Cao

, A novel method of aircraft detection under complex background based on circular intensity filter and rotation invariant feature, Sensors 22(1) (2022), 319.

, Zheng

, Zhang

, Wang

A detection method of airport planes in complex conditions, Science Technology and Engineering (23) (2015), 43–49.

Najibi

, Samangouei

, Chellappa

and Davis

L.S.

, Ssh: Single stage headless face detector, in, Proceedings of the IEEE international conference on computer vision (2017), 4875–4884.

Kalinovskii

, Spitsyn

Compact convolutional neural network cascade for face detection, arXiv preprint arXiv:1508.01292, 2015.

Chen

, Hua

, Wen

, Sun

Supervised transformer network for efficient face detection, in Computer Vision–ECCV 2016:14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, (2016), pp. 122–138.

Ranjan

, Patel

V.M.

and Chellappa

, Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 41(1) (2017), 121–135.

10.

Shen

, Zhou

, Li

, Jian

and Jayas

D.S.

, Detection of stored-grain insects using deep learning, Computers and Electronics in Agriculture 145 (2018), 319–325.

11.

Barbedo

J.G.

and Castro

G.B.

, Influence of image quality on the identification of psyllids using convolutional neural networks, Biosystems Engineering 182 (2019), 151–158.

12.

Ozguven

M.M.

and Adem

, Automatic detection and classification of leaf spot disease in sugar beet using deep learning algorithms, Physica A: Statistical Mechanics and its Applications 535 (2019), 122537.

13.

Wang

, Dong

, Jiao

, Du

, Huang

, Zheng

and Kang

, Osaf-net: A one-stage anchor-free detector for smalltarget crop pest detection, Applied Intelligence (2023), 1–13.

14.

, Huang

, Wang

, Li

, Xu

and Huang

, Unsupervised fabric defect detection based on a deep convolutional generative adversarial network, Textile Research Journal 90(3-4) (2020), 247–270.

15.

, Song

, Meng

and Yan

, An end-to-end steel surface defect detection approach via fusing multiple hierarchical features, IEEE Transactions on Instrumentation and Measurement 69(4) (2019), 1493–1504.

16.

Zhang

, Wen

and Chen

, Weld image deep learningbased on-line defects detection using convolutional neural networks for al alloy in robotic arc welding, Journal of Manufacturing Processes 45 (2019), 208–216.

17.

, Li

, Wang

, Zheng

, Huang

, Ding

and K.

, Rohde, “Label-efficient breast cancer histopathological image classification, IEEE Journal of Biomedical and Health informatics 23(5) (2018), 2108–2116.

18.

Jia

A.D.

, Li

B.Z.

and Zhang

C.C.

, Detection of cervical cancer cells based on strong feature cnn-svm network, Neurocomputing 411 (2020), 112–127.

19.

Bingzhen

, Guimei

and Kun

, Kidney tumor image segmentation method based on uncertainty guidance and scale consistency, Pattern Recognition and Artificial Intelligence 36(2) (2023), 95–107,2 2023.

20.

, Chang

, Zhang

, Xu

, Zhang

and Sun

, Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images, ISPRS Journal of Photogrammetry and Remote Sensing 161 (2020), 294–308.

21.

Chen

, Gong

, Chen

and Li

, Object detection in remote sensing images based on a scene-contextual feature pyramid network, Remote Sensing 11(3) (2019), 339.

22.

, Cheng

, Wang

, Zhou

and Han

, Instance-aware distillation for efficient object detection in remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1–11.

23.

Redmon

, Divvala

, Girshick

and Farhadi

, You only look once: Unified, real-time object detection, in, Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 779–788.

24.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

, Fu

C.-Y.

and Berg

A.C.

, Ssd: Single shot multibox detector, in, Computer Vision–ECCV:14th European Conference, Amsterdam The Netherlands, October 11– 14, Proceedings, Part I 14. Springer, (2016), 21–37.

25.

Redmon

, Farhadi

Yolo9000: better, faster, stronger, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.

26.

–, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767, 2018.

27.

Girshick

, Donahue

, Darrell

, Malik

Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

28.

Girshick

Fast r-cnn, in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.

29.

Ren

, He

, Girshick

and Sun

, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28, 2015.

30.

, Zhang

, Ren

and Sun

, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE transactions on pattern analysis and machine intelligence 37(9) (2015), 1904–1916.

31.

Simonyan

, Zisserman

Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

32.

, Zhang

, Ren

, Sun

Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

33.

Howard

A.G.

, Zhu

, Chen

, Kalenichenko

, Wang

Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017.

34.

C.-Y.

, Liu

, Ranga

, Tyagi

, Berg

A.C.

“Dssd: Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.

35.

, Wu

and Metaxas

D.N.

, Assd: Attentive single shot multibox detector, Computer Vision and Image Understanding 189 (2019), 102827.

36.

Chiu

Y.-C.

, Tsai

C.-Y.

, Ruan

M.-D.

, Shen

G.-Y.

, Lee

T.-T.

“Mobilenet-ssdv2: An improved object detection model for embedded systems, in 2020 International conference on system science and engineering (ICSSE). IEEE, 2020, pp. 1–5.

37.

Sandler

, Howard

, Zhu

, Zhmoginov

, Chen

L.-C.

Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 4510–4520.

38.

, Shen

, Sun

Squeeze-and-excitation networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

39.

Jaderberg

, Simonyan

, Zisserman

et al., Spatial transformer networks,”, Advances in Neural Information Processing Systems 28, 2015.

40.

Chen

G.-P.

, Zhao

, Dai

, Zhang

J.-X.

, Yin

X.-T.

, Cui

and Qian

, Asymmetric u-shaped network with hybrid attention mechanism for kidney ultrasound images segmentation, Expert Systems with Applications 212 (2023), 118847.

41.

Woo

, Park

, Lee

J.-Y.

, Kweon

I.S.

Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.

42.

, Wang

, Ullah

, Zhang

and Duan

, Multiple attention-based encoder– decoder networks for gas meter character recognition, Scientific Reports 12(1) (2022), 1–12.

43.

Zhang

, Li

, Wang

, Zhong

and Fu

, Image super-resolution using very deep residual channel attention networks, in pp, Proceedings of the European conference on computer vision (ECCV) (2018), 286–301.

44.

Wang

, Wu

, Zhu

, Li

, Zuo

, Hu

Ecanet: Efficient channel attention for deep convolutional neural networks, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 11 534–11 542.

45.

Lee

, Kim

H.-E.

, Nam

Srm: A style-based recalibration module for convolutional neural networks, in Proceedings of the IEEE/CVF International conference on computer vision, (2019), pp. 1854–1862.

46.

Carion

, Massa

, Synnaeve

, Usunier

, Kirillov

, Zagoruyko

End-to-end object detection with transformers, in Computer Vision–ECCV 2020:16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 213–229.

47.

Xie

, Liu

, Chen

, Tu

Attentional shapecontextnet for point cloud recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 4606–4615.

48.

Mnih

, Heess

, Graves

et al., Recurrent models of visual attention, Advances in Neural Information Processing Systems 27, 2014.

49.

Chen

, Deng

, Hu

Mixed high-order attention network for person re-identification, in Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 371–381.

50.

Zhang

, Liu

, Chen

, Jin

and Bai

, Selective kernel convolution deep residual network based on channel-spatial attention mechanism and feature fusion for mechanical fault diagnosis, ISA Transactions 133 (2023), 369–383.

51.

Hou

, Zhou

, Feng

Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021), pp. 13 713–13 722.

52.

Dai

and Li

, Aeroplane detection in static aerodrome based on faster rcnn and multi-part model, Journal of Computer Applications 37(A02) (2017), 85–88.

53.

Guo

, Liu

, Xu

and Zheng

, Airport scene aircraft detection method based on yolo v3, Laser & Optoelectronics Progress 56(19) (2019), 191003.

54.

Han

, Zhang

, Li

, Tang

and Fu

, Small target detection in airport scene via modified faster-rcnn, Journal of Nanjing University of Aeronautics & Astronautic 51(6) (2019), 735–741.

55.

, Liu

and Mei

, Lightweight convolutional neural network for aircraft small target real-time detection in airport videos in complex scenes, Scientific Reports 12(1) (2022), 14474.

56.

Szegedy

, Liu

, Jia

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

, Rabinovich

Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), pp. 1–9.

An improved SSD lightweight network with coordinate attention for aircraft target recognition in scene videos

Abstract

Keywords

1 Introduction

2 Related work

3 Materials and methods

3.1 SSD detection algorithm

4.1 Experimental dataset

4.2 Experimental environment

4.4.1 Comparison experiments

Table 1 Comparison of the performance of different models Model AP(%) Recall(%) FPS(F/S) Param(M) Faster R-CNN 92.4 85.7 7.5 159.3 SSD 81.3 75.6 18.6 95.0 YOLOv3 84.7 79.5 21.8 117.5 YOLOv5 96.3 89.3 30.7 41.9 our 95.6 88.7 38.2 25.6

Table 2 Comparison of results of ablation experiments Methods AP(%) Recall(%) SE+CA 95.6 88.7 without SE 93.3 86.3 without CA 91.1 83.5 without SE&CD 88.2 81.8

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Comparison of the performance of different models

Model AP(%) Recall(%) FPS(F/S) Param(M)

Faster R-CNN 92.4 85.7 7.5 159.3

SSD 81.3 75.6 18.6 95.0

YOLOv3 84.7 79.5 21.8 117.5

YOLOv5 96.3 89.3 30.7 41.9

our 95.6 88.7 38.2 25.6

Table 2
Comparison of results of ablation experiments

Methods AP(%) Recall(%)

SE+CA 95.6 88.7

without SE 93.3 86.3

without CA 91.1 83.5

without SE&CD 88.2 81.8