Sage Journals: Discover world-class research

Abstract

This paper addresses critical challenges in Unmanned Aerial Vehicle (UAV) target detection and 3D positioning, specifically inaccuracies in localization and lack of robustness in complex environments. The objective of this research is to improve UAV detection and positioning accuracy by proposing an enhanced Deformable DETR (Detection Transformer) model integrated with multi-view geometry theory. To achieve this, the study first preprocesses UAV-collected data, then optimizes the convolutional layers of the original DETR model to better handle object occlusion and scale variations. Furthermore, the research incorporates multi-view geometric modeling and multimodal fusion strategies to enhance detection accuracy during the target recognition process. Experimental results demonstrate that the proposed approach achieves over 70% detection accuracy, significantly outperforming traditional methods. The findings underscore the effectiveness of combining the improved Deformable DETR model with multi-view geometry for high-precision detection and 3D localization in complex environments. This research has significant implications for UAV-based applications, such as autonomous navigation, surveillance, and search-and-rescue missions, where precise target detection and 3D positioning are critical for successful operation.

Keywords

Multiple unmanned aerial vehicle object detection three-dimensional positioning improved deformable DETR model multiple view geometry image denoising

Introduction

UAV (Unmanned Aerial Vehicle) object detection and 3D positioning are critical fields within computer vision.^1,2 With the rapid advancement of both object detection and UAV technologies, UAV object detection algorithms have gained widespread application across various domains.^3,4 However, detecting objects in aerial views captured by UAVs presents significant challenges. These include the high density and distribution of small objects, substantial variations in object scales, and the complex, dynamic environments depicted in multiple UAV aerial images. Consequently, adapting object detection algorithms designed for ground-level or frontal perspectives to UAV aerial views proves challenging.^5,6

In object detection, numerous models have emerged with the goal of enhancing precision while maintaining computational efficiency. Some notable examples include EfficientDet, which is based on the EfficientNet design philosophy and realizes efficient object detection.^7,8 Another model, DETR (Detection Transformer), transforms object detection problems into sequence-to-sequence tasks, leveraging a Transformer structure for end-to-end detection and positioning.^9,10 FCOS (Fully Convolutional One-Stage) is a simple and efficient single-stage object detection method that is based on fully convolutional networks.^11,12 RepPoints (Representative Points) achieves object detection and positioning through point regression and exhibits strong robustness.^13,14 CenterNet, on the other hand, achieves object detection through center point prediction.^15,16 For specific applications, such as vehicle object recognition under weak lighting and harsh weather conditions, Wang et al. proposed RODNet (Radar Object Detection Network), which achieves an average accuracy of 86%.¹⁷ In the realm of very-high-resolution (VHR) remote sensing images, Gong et al. proposed the CA-CNN (Context-Aware Convolutional Neural Network) model to mine underlying contextual information between objects.¹⁸ Despite the progress made by these models, they are still limited by their single perspective, which makes it difficult to fully utilize the multiple view geometry information of the scenario. Additionally, they may suffer from performance degradation in complex scenarios.^19,20 Therefore, while some advancements have been made, existing models still have limitations and require further improvement to fully address the challenges of object detection.

Object detection based on multiple view geometry has demonstrated significant potential to enhance the accuracy of object detection and positioning by amalgamating information from diverse perspectives. In a study conducted on a small sample dataset, Xiao et al.²¹ applied a straightforward category-agnostic viewpoint estimation method, integrating information from different viewpoints. This approach surpassed the performance of established models such as PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning) and COCO (Common Objects in Context). Considering multiple view geometry, Zheng et al.²² proposed the CIoU (Complete-Intersection over Union) loss and Cluster-NMS (Nonmaximum Suppression), which notably improved the average accuracy and recall of detection. The DETR model, recognized for its efficiency, amalgamates multiple advanced technologies, including decoder,^23,24 encoder,^25,26 and feedforward neural networks.^27,28 It maintains high accuracy while offering rapid inference speed, making it suitable for object detection tasks in complex scenarios.²⁹ Despite the progress made by existing methods, their performance deteriorates when confronted with large-scale scenarios and often exhibits poor results in the presence of complex occlusions. Consequently, researchers are endeavoring to refine existing models and augment their robustness in object detection. Jiang et al.³⁰ refined the Deformable DETR model by integrating the dynamic attention mechanism and multi-scale feature fusion technology, catering to the requirements of vehicle detection in UAV video streams. Multi-level feature extraction is performed on vehicle targets to bolster the detection capability of small and fast-moving targets. Cui et al.³¹ optimized the feature extraction network of the Deformable DETR model and introduced adaptive light correction and color enhancement techniques to mitigate the challenges posed by light variation and color distortion in underwater images. The refined model exhibited commendable performance in complex underwater environments and low visibility conditions. Zhang et al.³² designed a novel attention module and boundary box regression mechanism by enhancing the Deformable DETR model, enabling it to process text of arbitrary shape in natural scenes and achieve high-precision detection. Hamid et al.³³ conducted a survey on stereo matching algorithms based on deep learning and found that these algorithms have achieved remarkable progress. Based on the aforementioned research, the Deformable DETR model is deemed suitable for enhancing the robustness of object detection. Integrating this model with multiple view geometry information holds promise for precise object detection and positioning by UAVs.

To sum up, in order to improve the object detection and 3D positioning performance of UAV, this paper first used UAV to collect data, and then used denoising and other means for pre-processing. Then, the convolutional layer was refined in view of the shortcomings of the original Deformable DETR model. At the same time, 3D positioning was improved by combining multiple view geometry. The data obtained from the above operations continued to be fused to achieve more accurate 3D positioning with the help of data fusion features. The experimental results showed that the method used in this paper could improve the accuracy of 3D positioning and detection of UAVs in complex environment.

Data preprocessing

In the first step of data processing, the image data collected from different UAV perspectives is converted into a common reference frame, and the parallax between the two views is calculated using the stereo matching algorithm, thereby deriving the position of the points in the three-dimensional space. The calculation formula is:

Z = \frac{f \cdot B}{d}

(1)

$Z$ refers to the distance of the target point, $f$ refers to the focal length of the camera, $B$ refers to the baseline distance between the two cameras, and $d$ refers to the parallax in the two views.

In the case of variable lighting conditions, multiple Gaussian filters of different scales are used to estimate the illumination, and then the results are weighted to average:

R (x, y) = \sum_{i = 1}^{N} w_{i} [\log I (x, y) - \log [F_{i} (x, y) * I (x, y)]

(2)

$R (x, y)$ refers to the reflectivity image, $I (x, y)$ refers to the image, $N$ refers to the number of scales, $w_{i}$ refers to the weight of each scale, $F_{i} (x, y)$ is the Gaussian filter, and * represents the convolution operation.

In order to enhance image contrast and detail, histogram equalization is first used to adjust the image globally, and the gray level of the image is redistributed to expand the dynamic range of the image. The adjustment process is as follows:

{\begin{matrix} CDF (i) = \sum_{j = 0}^{i} \frac{h (j)}{N} \\ s_{i} = round [(L - 1) \cdot CDF (i)] \end{matrix}

(3)

$h (j)$ refers to the gray histogram of the image, $s_{i}$ refers to the new gray level, $L - 1$ refers to the gray level range of the image, and $CDF (i)$ refers to the cumulative distribution function.

Then, adaptive histogram equalization is used to adjust the local area, the number of pixels in the histogram exceeding a certain threshold is clipped, and the clipped part is evenly distributed to all gray levels. The formula is as follows:

h_{tile}' (i) = \min (h_{tile} (i), T) + \frac{\sum_{i = 0}^{L - 1} \max (h_{tile} (i) - T, 0)}{L}

(4)

$h_{tile} (i)$ refers to the grayscale histogram of the small block image, and $T$ refers to the threshold value.

To adapt to the challenges of object detection and positioning in complex environments, special labeling operations can continue to be applied to images captured by UAVs. Using a strategy of combining manual labeling with automated labeling, specific object types and positional details are marked in the captured images. These clearly labeled data are used as guidance for training the model, helping the model to deeply understand the characteristics and positions of the object entity, thereby achieving accurate object detection and 3D positioning effects.

Deformable DETR model optimization

The optimization of Deformable DETR model requires increasing the depth and width of convolutional layers on the original basis, and applying residual connections to accelerate information transmission speed.³⁴

Increasing the depth of convolutional layers mainly involves deepening the number of convolutional layers in the model architecture, thereby improving the model’s ability to extract image features. Increasing the width of the convolutional layer mainly includes increasing the number of channels in the model architecture, making the dimensions of information processing more diverse.

Applying residual connections not only benefits the efficient flow of information in deep networks, but also allows signals to directly cross several layers, reducing the problem of vanishing and exploding gradients, significantly improving training speed and generalization performance.

Through the above operations, while retaining the high-speed detection characteristics of Deformable DETR, its accuracy and adaptability can be improved, thereby meeting the requirements of higher-level computer vision tasks.

In addition, to ensure the accuracy of object positioning and confidence, a new loss function is further designed, with the following formula:

\begin{matrix} L = α_{1} (\sum_{i = 0}^{xy} \sum_{j = 0}^{n} τ_{1} [(x_{i} - {\hat{x}}_{i})^{2} + (y_{i} - {\hat{y}}_{i})^{2}] \\ + \sum_{i = 0}^{xy} \sum_{j = 0}^{n} τ_{1} [(w_{i} - {\hat{w}}_{i})^{2} + (h_{i} - {\hat{h}}_{i})^{2}]) \\ + \sum_{i = 0}^{xy} \sum_{j = 0}^{n} τ_{1} [(Z_{i} - {\hat{Z}}_{i})^{2} + α_{2} \sum_{i = 0}^{xy} \sum_{j = 0}^{n} τ_{2} (Z_{i} - {\hat{Z}}_{i})^{2}) \\ + α_{3} \sum_{i = 0}^{xy} τ_{3} (g - \hat{g})^{2} \end{matrix}

(5)

The definition of the loss function formula is shown in Table 1.

Table 1.

Definition of loss function formula.

Sequence	Variable	Meaning
1	$α_{1}$	Target coordinate loss weight coefficient
2	$α_{2}$	No target confidence loss weight coefficient
3	$α_{3}$	Category weight coefficient for predicting losses
4	$xy$	Positioning image size
5	$n$	The number of bounding boxes predicted by the cell
6	$τ_{1}$	There is an indicator function of the target
7	$τ_{2}$	There is no indication function for the target
8	$x_{i}$ , $y_{i}$	Predict the center coordinates of the box
9	$w_{i}$	Predict the width of the box
10	$h_{i}$	Predict the height of the box
11	$Z_{i}$	Confidence degree
12	$g$	Class prediction probability

The optimized Deformable DETR model is used to perform object detection on a randomly obtained image, requiring the recognition of the yellow building in the image. The specific effect is shown in Figure 1.

Figure 1.

Optimized model detection effect.

According to the recognition task, the original image in Figure 1 has a total of two detection points. It is evident that the optimized Deformable DETR model accurately completes the recognition task.

Multiple view geometry modeling

The multiple view geometry modeling method can accurately capture the three-dimensional structure and morphological characteristics of target objects through multi angle and camera techniques.³⁵ By observing the target object from various perspectives and combining data information collected from multiple perspectives, it helps to gain a deeper understanding of the detailed characteristics and morphology of the target object, thereby achieving more accurate construction and reconstruction of the 3D model.

The selected 5 R series UAVs have built-in cameras, and the object position expression under each camera is:

b_{i} = (x_{i}, y_{i}, z_{i}) (i = 1, 2, . . ., 5)

(6)

Observing the same object from different perspectives, it is found that the projection coordinates of the object on the camera’s image plane are:

[\begin{matrix} a_{i} \\ b_{i} \\ 1 \end{matrix}] = C [\begin{matrix} D & E \\ 0 & 1 \end{matrix}] [\begin{matrix} x_{i} \\ y_{i} \\ \begin{matrix} z_{i} \\ 1 \end{matrix} \end{matrix}]

(7)

Among them, $C$ represents the inner parameter matrix, and $D$ and $E$ refer to the outer parameter matrices.

Next, the triangulation method is used to convert the projection coordinates of objects from the five camera perspectives of the R series, and is uniformly replaced with normalized plane coordinates. Then, the three-dimensional coordinates of the object in the camera coordinate system are calculated based on the normalized plane coordinates and $C$ .

To comply with multiple view geometry optimization, the objective function is used to optimize the calculated 3D coordinates:

Y = \begin{matrix} Min \\ b_{1}, b_{2}, . . ., b_{n} \end{matrix} \sum_{i = 1}^{n} \sum_{j = 1}^{n} ‖ b_{i} - b_{j} ‖^{2}

(8)

Finally, the three-dimensional coordinates are converted into coordinates in a unified coordinate system to obtain the accurate position of the object in three-dimensional space.

Multimodal fusion

LiDAR is an active sensor that obtains high-precision 3D point cloud data to model the environment in 3D and extract object information from it. By optimizing the collection of object detection information and using the K-means clustering algorithm based on Euclidean distance, the LiDAR point cloud is segmented into different clusters, each representing an object. Through segmentation and clustering, the target objects in the point cloud are extracted, and their position and shape features are calculated.

The image object data obtained through the improved Deformable DETR model is fused with the data obtained from LiDAR, and the fusion steps are shown in Figure 2.

Figure 2.

Data fusion steps.

In Figure 2, by extracting features from the obtained data, image features and point cloud features are obtained respectively. In order to make these two features related and correspond to each other, optical flow method is first used for alignment and matching, followed by direct fusion to generate multimodal features. Based on multimodal feature representation, the motion state of the object is estimated and predicted, achieving precise 3D positioning and tracking.

Experimental verification

In order to verify the effectiveness of multiple UAV object detection and 3D positioning based on the improved Deformable DETR model and multiple view geometry, comparative experiments are conducted with multiple UAV object detection and 3D positioning based on the improved YOLOv4 model, the DETR model, and the multiple view geometry. The experiment mainly revolves around three randomly selected scenarios (as shown in Figure 3) for detection and positioning operations.

Figure 3.

Comparative experimental scenarios.

In Figure 3, the images are named Scenario 1, Scenario 2, and Scenario 3 in order from left to right.

In this paper, R1 represents the improved YOLOv4 model, R2 denotes the DETR model, R3 signifies the utilization of multiple view geometry, and R4 corresponds to the integrated approach encompassing multiple UAV object detection, 3D positioning, leveraging the techniques including the improved Deformable DETR model and multiple view geometry.

One of the performance indicators for evaluating object detection in R1, R2, R3, and R4 is confidence. In object detection tasks, each detection box is accompanied by a confidence score, which is used to indicate whether the detection box contains objects and the model’s level of confidence in this judgment. Among them, the optimal distribution interval for the confidence level of object detection is (0.6, 1). The closer the confidence level is to 1, the higher the confidence level of the detection result.

Firstly, R1, R2, R3, and R4 are utilized to detect Scenario 1. In general, the higher the number of detections, the better it is, and the smaller the number of misdetections, the better it is. Among them, the object detection of Scenario 1 by R1 is shown in Figure 4.

Figure 4.

Object detection of R1 in Scenario 1: (a) initial object detection of R1 in Scenario 1 and (b) object misdetection of R1 in Scenario 1.

Figure 4 mainly presents the object detection situation of R1 in Scenario 1. From Figure 4(a), it can be seen that the total number of annual detections of R1 for the object is 17. The yellow dashed box in Figure 4(b) represents misdetections (i.e. errors that have already been detected), with a total of four misdetections.

The object detection of Scenario 1 by R2 is shown in Figure 5.

Figure 5.

Object detection of R2 in Scenario 1: (a) initial object detection of R2 in Scenario 1 and (b) object misdetection of R2 in Scenario 1.

Figure 5 mainly presents the object detection situation of R2 in Scenario 1. From Figure 5(a), it can be seen that the total number of annual detections of R2 for the object is 19, and the number of misdetections in Figure 5(b) is 2.

The object detection of Scenario 1 by R3 is shown in Figure 6.

Figure 6.

Object detection of R3 in Scenario 1: (a) initial object detection of R3 in Scenario 1 and (b) object misdetection of R3 in Scenario 1.

Figure 6 mainly presents the object detection situation of R3 in Scenario 1. From Figure 6(a), it can be seen that the total number of annual detections of R3 for objects is 20, and the number of misdetections in Figure 6(b) is 4.

The object detection of Scenario 1 by R4 is shown in Figure 7.

Figure 7.

Object detection situation of R4 in Scenario 1: (a) initial object detection of R4 in Scenario 1 and (b) object misdetection of R4 in Scenario 1.

Figure 7 mainly presents the object detection situation of R4 in Scenario 1. From Figure 7(a), it can be seen that the total number of annual detections of R4 for objects is 34, and the number of misdetections in Figure 7(b) is 0.

According to the data summarized in Figures 4 to 7, the distribution of object detection confidence is shown in Table 2.

Table 2.

Distribution of object detection confidence.

Model	(0, 0.2)	(0.2, 0.3)	(0.3, 0.4)	(0.4, 0.5)	(0.5, 0.6)	(0.6, 0.7)	(0.7, 0.8)	(0.8, 0.9)	(0.9, 1)
R1	–	1	3	9	3	1	–	–	–
R2	–	2	1	5	3	6	2	–	–
R3	–	1	5	3	7	3	1	–	–
R4	–	–	–	–	8	6	10	10	–

According to the object detection confidence summarized in Table 2, it can be found that the object detection confidence of R1 is most distributed within (0.4, 0.5); the object detection confidence of R2 is most distributed within (0.6, 0.7), followed by (0.4, 0.5); the object detection confidence of R3 is most distributed within (0.5, 0.6); the object detection confidence of R4 is most distributed within (0.7, 0.9).

In addition, the number of object detection confidence intervals for R1, R2, R3, and R4 is 1, 8, 4, and 26, respectively.

The above data clearly demonstrates some advantages of R4 for object detection, not only with a large number of detections and fewer misdetections, but also with better confidence than the other three.

Then, based on the data in Figures 4 to 7, the accuracy of object detection is summarized, as shown in Table 3.

Table 3.

Object detection accuracy.

Model	Total quantity detected	False count	Accuracy rate (%)
R1	17	4	76.5
R2	19	2	89.5
R3	20	4	80
R4	34	0	100

In Table 3, R4 has the highest accuracy in object detection, followed by R2. This further proves that R4 performs excellently in 3D positioning in Scenario 1 and has a stronger ability to extract object feature information.

In the same experimental operation as Scenario 1, R1, R2, R3, and R4 are used to detect Scenario 2. The difference is that Scenario 2 has fewer and simpler constituent elements, and the detection task objects are dogs, rabbits, and grass. Scenarios with very few elements generally have few misdetections, but there is a high possibility of missed detections. Therefore, for Scenario 2 object detection, 3D positioning is particularly challenging. Accurate 3D positioning can more smoothly identify indistinct elements.

Among them, the detection of Scenario 2 by R1, R2, R3, and R4 is shown in Figure 8.

Figure 8.

Object detection in Scenario 2.

In Figure 8, it is obvious that R4 completes the detection task of Scenario 2 very awesome, and identifies the grass elements given in Scenario 2 very well. R1, R2, and R3 only identify two obvious elements: dog and rabbit, which shows that R4 has the best 3D positioning effect.

Although the confidence levels of R1, R2, R3, and R4 in object detection are all within the optimal range, R4 has a more prominent confidence level in object detection, with the highest confidence level, particularly close to 1.

After completing the object detection for Scenario 1 and Scenario 2, the object detection for Scenario 3 is continued. Compared to Scenario 1 and Scenario 2, Scenario 3 has more and more complex elements. Similarly, R1, R2, R3, and R4 are utilized to detect Scenario 3. The detection results are shown in Figures 9 to 12.

Figure 9.

Object detection of R1 in Scenario 3.

Figure 10.

Object detection of R2 in Scenario 3.

Figure 11.

Object detection of R3 in Scenario 3.

Figure 12.

Object detection of R4 in Scenario 3.

There are a total of 11 objects that need to be detected in Scenario 3, with a total quantity of 15. According to Figure 9, it can be found that R1’s actual object detection in Scenario 3 has three items that do not meet the standards, and the number of missed detections is 4. In addition, the number of objects detected in R1 within the optimal confidence distribution range is 4.

Figure 10 shows the object detection situation of R2 in Scenario 3. It can be observed that R2’s actual object detection in Scenario 3 also has three items that do not meet the standards, with a missed detection quantity of three. In addition, the number of object detection in the best confidence distribution interval for R2 is 2.

Figure 11 shows the object detection situation of R3 in Scenario 3. It can be found that there are two items that do not meet the actual object detection standards and two missed detections in R3 in Scenario 3. In addition, the number of object detection in the best confidence distribution interval for R3 is 6.

Figure 12 shows the object detection situation of R4 in Scenario 3, and it can be observed that R4’s actual object detection in Scenario 3 meets the standards without any missed detections. In addition, the object detection of R4 is within the optimal confidence distribution range, with a quantity of 15.

According to the confidence data of R1–R4 in Scenarios 1, 2, and 3, the mean confidence data is calculated, as shown in Table 4.

Table 4.

Mean confidence data.

Model	Scenario 1	Scenario 2	Scenario 3
R1	0.45	0.71	0.54
R2	0.53	0.78	0.48
R3	0.51	0.76	0.55
R4	0.72	0.95	0.79

According to the mean confidence calculated in Table 4, it can be found that R4 has the highest mean confidence regardless of which scenario it is used for object detection.

Based on the comparison between the positioning results of the Global Positioning System (GPS) position and attitude information of the UAV in Scenario 1 and the actual position, the relevant results of 3D positioning errors were calculated, as shown in Table 5.

Table 5.

3D positioning performance of multiple UAVs under different methods.

Model	Average positioning error (m)	Maximum positioning error (m)	Positioning accuracy (%)
R1	0.8	2	70
R2	0.6	1.8	85
R3	0.7	1.9	80
R4	0.5	1.5	95

R4 in Table 5 shows the highest positioning accuracy and the lowest average positioning error in scenario 1. Especially in terms of positioning accuracy, R4 reached 95%, far exceeding other models.

Conclusions

This paper utilized an improved Deformable DETR model for multiple view geometry for multiple UAV object detection and 3D positioning. By preprocessing the data collected by UAVs, a clear data foundation was provided for subsequent processing. By improving the convolutional layer, applying residual structures, and designing new loss functions, the DETR model was optimized to improve the accuracy and robustness of object detection. Subsequently, multiple view geometry modeling and multimodal fusion were implemented to determine object information, further improving the accuracy of detection and positioning. Finally, experimental verification was conducted in different scenarios, and the results showed that the method used in this paper had the least number of misdetections and missed detections, which verified its effectiveness and reliability in UAV object detection and 3D positioning. Future research can further explore the optimization and performance improvement of algorithms to meet the growing demand and challenges of applications.

Footnotes

Handling Editor: Divyam Semwal

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by Shaanxi Provincial Science and Technology Department under Grant 2023-YBGY-342, and in part by the National Natural Science Foundation of China under Grant 62073256.

ORCID iD

Xuebin Liu

References

Nie

Han

Z-C

, et al. UAV detection and localization based on multi-dimensional signal features. IEEE Sens J 2021; 22: 5150–5162.

Kolsch

, et al. Fast and robust UAV to UAV detection and tracking from video. IEEE Trans Emerg Top Comput 2021; 10: 1519–1531.

Wang

Jiang

Zhang

, et al. Development of UAV-based target tracking and recognition systems. IEEE Trans Intell Transp Syst 2019; 21: 3409–3422.

Khan

Tufail

Khan

, et al. A novel framework for multiple ground target detection, recognition and inspection in precision agriculture applications using a UAV. Unmanned Syst 2022; 10: 45–56.

Tao

Hong

Chao

. Drone identification and location tracking based on YOLOv3. Chin J Eng 2020; 42: 463–468.

Leira

Helgesen

Johansen

, et al. Object detection, recognition, and tracking from UAVs using a thermal camera. J Field Robot 2021; 38: 242–267.

Cao

. Ship target detection of unmanned surface vehicle base on EfficientDet. Syst Sci Control Eng 2022; 10: 264–271.

Zocco

Lin

T-C

Huang

C-I

, et al. Towards more efficient EfficientDets and real-time marine debris detection. IEEE Robot Autom Lett 2023; 8: 2134–2141.

Liu

, et al. LSH-DETR: object detection algorithm for marine benthic organisms based on improved DETR. J Electron Imaging 2022; 31: 063030.

10.

Ban

Delorme

, et al. TransCenter: transformers with dense representations for multiple-object tracking. IEEE Trans Pattern Anal Mach Intell 2022; 45: 7820–7835.

11.

Tian

Shen

Chen

, et al. FCOS: a simple and strong anchor-free object detector. IEEE Trans Pattern Anal Mach Intell 2020; 44: 1922–1933.

12.

Liang

Zhao

, et al. Tiny FCOS: a lightweight anchor-free object detection algorithm for mobile scenarios. Mob Netw Appl 2021; 26: 2219–2229.

13.

Leng

Zhou

, et al. Pareto refocusing for drone-view object detection. IEEE Trans Circuits Syst Video Technol 2022; 33: 1320–1334.

14.

Zhu

Huang

, et al. Unmanned aerial vehicle (UAV) object detection algorithm based on keypoints representation and rotated distance-IoU loss. J Real Time Image Process 2024; 21: 1–14.

15.

. Stepwise domain adaptation (SDA) for object detection in autonomous vehicles using an adaptive CenterNet. IEEE Trans Intell Transp Syst 2022; 23: 17729–17743.

16.

Xiong

Xin

, et al. MobileNetV3-CenterNet: a target recognition method for avoiding missed detection effectively based on a lightweight network. J Beijing Inst Technol 2023; 32: 82–94.

17.

Wang

Jiang

, et al. RODNet: a real-time radar object detection network cross-supervised by camera-radar fused object 3D localization. IEEE J Sel Top Signal Process 2021; 15: 954–967.

18.

Gong

Xiao

Tan

, et al. Context-aware convolutional neural network for object detection in VHR remote sensing imagery. IEEE Trans Geosci Remote Sens 2019; 58: 34–44.

19.

Huang

S-C

T-H

Jaw

D-W

. DSNet: joint semantic learning for object detection in inclement weather conditions. IEEE Trans Pattern Anal Mach Intell 2020; 43: 2623–2633.

20.

Zhou

Liang

, et al. Intelligent small object detection for digital twin in smart manufacturing with industrial cyber-physical systems. IEEE Trans Industr Inform 2021; 18: 1377–1386.

21.

Xiao

Lepetit

Marlet

. Few-shot object detection and viewpoint estimation for objects in the wild. IEEE Trans Pattern Anal Mach Intell 2022; 45: 3090–3106.

22.

Zheng

Wang

Ren

, et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans Cybern 2021; 52: 8574–8586.

23.

Dai

Liu

Tang

, et al. AO2-DETR: arbitrary-oriented object detection transformer. IEEE Trans Circuits Syst Video Technol 2022; 33: 2342–2356.

24.

Zhang

Luo

Cui

, et al. Meta-DETR: image-level few-shot detection with inter-class correlation exploitation. IEEE Trans Pattern Anal Mach Intell 2022; 45: 12832–12843.

25.

Zhang

Shen

, et al. Detecting surface defects of pine wood using an improved RT-DETR model. Trans Chin Soc Agric Eng 2024; 40: 210–218.

26.

Song

, et al. Research on robot riveting defect detection method based on improved DETR. J Railw Sci Eng 2024; 21: 1690–1700.

27.

Cui

Yang

Cai

, et al. Multi-scale local clustering method for object detection based on Kmeans-DETR. J Chin Comput Syst 2024; 45: 1136–1142.

28.

Zhang

Duan

, et al. Improved DETR method for building detection in high-resolution remote sensing images. Remote Sens Inf 2024; 39: 146–156.

29.

Zhang

. Research on transmission line engineering vehicle detection based on improved DETR model. Softw Eng 2024; 27(4): 49–53.

30.

Jiang

Wang

Zhang

, et al. A vehicle object detection algorithm in UAV video stream based on improved Deformable DETR. Comput Eng Sci 2024; 46(1): 91–101.

31.

Cui

Han

Gao

, et al. An object detection method of underwater image based on improved Deformable-DETR. Appl Sci Technol 2024; 51: 30–36+91.

32.

Zhang

You

Tong

, et al. Arbitrary-shaped text detection based on deformable DETR. Radio Eng 2024; 54: 312–318.

33.

Hamid

Abd Manap

Hamzah

, et al. Stereo matching algorithm based on deep learning: a survey. J King Saud Univ Comput Inf Sci 2022; 34: 1663–1673.

34.

Zhang

, et al. Research on infrared image object detection method based on Deformable DETR. Air Space Def 2024; 7: 16–23.

35.

Wang

, et al. Initialization of monocular vision positioning of UAVs based on multi-view geometry. Fire Control Command Control 2024; 49: 80–86.

Model	(0, 0.2)	(0.2, 0.3)	(0.3, 0.4)	(0.4, 0.5)	(0.5, 0.6)	(0.6, 0.7)	(0.7, 0.8)	(0.8, 0.9)	(0.9, 1)
R1	–	1	3	9	3	1	–	–	–
R2	–	2	1	5	3	6	2	–	–
R3	–	1	5	3	7	3	1	–	–
R4	–	–	–	–	8	6	10	10	–

Model	(0, 0.2)	(0.2, 0.3)	(0.3, 0.4)	(0.4, 0.5)	(0.5, 0.6)	(0.6, 0.7)	(0.7, 0.8)	(0.8, 0.9)	(0.9, 1)
R1	–	1	3	9	3	1	–	–	–
R2	–	2	1	5	3	6	2	–	–
R3	–	1	5	3	7	3	1	–	–
R4	–	–	–	–	8	6	10	10	–

A study on UAV target detection and 3D positioning methods based on the improved deformable DETR model and multi-view geometry

Abstract

Keywords

Introduction

Data preprocessing

Deformable DETR model optimization

Multiple view geometry modeling

Multimodal fusion

Experimental verification

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References

Model	(0, 0.2)	(0.2, 0.3)	(0.3, 0.4)	(0.4, 0.5)	(0.5, 0.6)	(0.6, 0.7)	(0.7, 0.8)	(0.8, 0.9)	(0.9, 1)
R1	–	1	3	9	3	1	–	–	–
R2	–	2	1	5	3	6	2	–	–
R3	–	1	5	3	7	3	1	–	–
R4	–	–	–	–	8	6	10	10	–