Sage Journals: Discover world-class research

Abstract

Absolute visual localization is of significant importance for unmanned aerial vehicles when the satellite-based localization system is not available. With the rapid evolution in the field of deep learning, the real-time visual detection and tracking of landmarks by an unmanned aerial vehicle could be implemented onboard. This study demonstrates a landmark-based visual localization framework for unmanned aerial vehicles flying at low altitudes. YOLOv5 and DeepSORT are used for multi-object detection and tracking, respectively. The unmanned aerial vehicle localization is achieved according to the geometric similarity between the geotagged transmission towers and the annotated images captured by a monocular camera. The validation is accomplished both in the Rflysim-based simulation and the quadrotor-based real flight. The localization precision is about 10 m, and the location update frequency reaches 5 Hz with a commercially available entry-level edge artificial intelligence platform. The proposed visual localization strategy needs no satellite image as a reference map, which saves a significant amount of the GPU memory and makes possible the end-to-end implementation on small unmanned aerial vehicles.

Keywords

Unmanned aerial vehicles visual localization deep learning transmission tower

Introduction

Nowadays, small-sized unmanned aerial vehicles (UAVs) are becoming popular solutions for expanding military and civilian applications. Application of UAVs emerged in scenarios like military and defense,^1,2 post-disaster assessment,³ infrastructure inspection,⁴ precision agriculture,⁵ flight testing,⁶ and so on. To navigate autonomously, UAVs in these applications depend highly on the Global Navigation Satellite System (GNSS), such as GPS and Beidou, to acquire their locations. However, signals from GNSS satellites could be degraded when it is close to vegetation, water bodies, hostile environment, and inside structures.⁷ These situations would make it impossible to get precise latitude and longitude estimates of the UAV during flight. An auxiliary or redundant localization approach is needed to ensure the UAV navigation when it fails to receive the GNSS signals.

Taking no account of the GNSS system, common sensors capable of ranging and supporting the localization system on a UAV include cameras, radar, and Light Detection and Ranging (LiDAR). Cameras of various types could be used to establish a vision-based localization system. Stereoscopic cameras and depth cameras already have reliable UAV distancing applications.⁸ However, their limited range^9,10 makes their localization confined to indoor or short-distance scenarios. Radar and LiDAR are common ranging sensors, but their size, weight, cost, and power consumption make them more suitable for large-scale UAVs.¹¹ The monocular camera could be the cheapest among these sensors, but the limitation is that the ranging information is not directly measured. In this study, a sophisticated vision-based localization framework is explored using a monocular camera.

In vision-based UAV localization, two main approaches are comprised: relative visual localization (RVL) and absolute visual localization (AVL). The essential difference between these two approaches is whether the visual localization is accomplished by matching from frame to frame or from frame to reference. Popular RVL methods include visual odometry (VO)¹² and simultaneous localization and mapping (SLAM).¹³ The core issue with RVL is similar to localization solely with the inertial measurement unit (IMU), namely the accumulation of error makes the localization drift over time. Although SLAM outperforms VO by mitigating the drift problem with bundle adjustment and loop closure detection, localization with long distance still needs regular correction by absolute localization information.

A state-of-the-art review of the AVL for UAV is given by Couturier and Akhloufi.¹⁴ AVL is inherently immune to overtime drift since precisely geo-referenced information is used during localization. According to Couturier and Akhloufi,¹⁴ major AVL methods can be divided into template matching, feature point matching, deep learning, and modified VO based on their matching mechanism. Template matching¹⁵ is directly using the UAV image as a template to search in a reference map. In contrast, feature point matching¹⁶ detects the feature points in the UAV image first and localizes through descriptor extraction from the reference image. Deep learning¹⁷ uses a convolutional neural network (CNN)-based computer vision to deal with the feature extraction and the matching problem. VO could also be used for AVL when the reference map is a set of preexisting frames with location information.^18,19 The essential localization scheme in this study is based on deep learning.

The major deep learning tools used in this study are You Only Look Once (YOLO)²⁰ and DeepSORT.²¹ YOLO is used for multi-object detection and DeepSORT is used for multi-object tracking from video frames. These two methods are recently developed and could be implemented in UAVs for edge computation. They have already demonstrated their reliability on applications such as UAV surveillance,²² autonomous vehicles,²³ monitoring COVID-19 social distancing,²⁴ and droplet tracking in fluid mechanics.²⁵ In this study, the combination of YOLO and DeepSORT is used to detect and track the transmission towers as landmarks. The landmarks’ geometric information is extracted from images as bounding boxes and compared with the dimension and geo-locations of the preexisting landmarks, by which the UAV location is estimated. We validate this localization framework in simulation and a real flight experiment, respectively. It turns out that this framework could achieve real-time onboard localization with precision comparable to the GPS.

Related work

Deep learning-based AVL for UAV

Although deep learning methods have caused a widespread sensation in the field of robotics, application in AVL for UAVs is still in its infancy.¹⁴ The difficulty rests in building an end-to-end architecture for localization since the computing resources are confined by the limited onboard load capacity. In addition, the size of the reference map with an acceptable resolution is always beyond the limitation of the GPU memory, making the manipulations of real-time matching impractical. Several studies have made efforts in this field, and we put it below to introduce the motivation of the current study.

Amer et al.²⁶ and Mughal et al.²⁷ used the deep architecture to extract the “Deep Signature” from the aerial imagery based on the unique visual appearance of the architecture and landscape style. Such information was then identified in a preordained and geotagged image, and the localization was performed. Nassar et al.²⁸ and Marcu et al.²⁹ performed aerial image semantic segmentation using deep learning, and the semantic shape was matched to realize localization. Goforth and Lucey¹⁸ combined deep learning and VO inside an optimization framework, where the reference map is put in VO acting like a neighboring frame, and thus, the localization is accomplished. Other UAV localization strategies based on deep learning include using conditional generative adversarial network,³⁰ deep reinforcement learning,³¹ and more information can be found in the literature.^14,32

Most of these AVL methods take advantage of deep learning to match the extracted visual features of the landscape with a generalized map. Only by comparing the abstracted features from both the input image and the reference map, could a lightweight, real-time, and vision-based localization framework be accomplished. The proposed method also follows this essential routine, and the real-time UAV localization has been implemented onboard without a reference map from the satellite image.

Landmark-based AVL for UAV

Landmarks are conspicuous objects which can mark locality. Landmark-based localization has already been applied in UAV applications.^33,34 However, when it comes to AVL, most strategies are based on maps to get latitude and longitude in a larger region. Many methods provide UAVs with satellite or aerial imagery as a map before flight, which is used to perform template matching to realize localization.^7,26

–29 The challenge is that the matching might fail due to the changing image-capturing conditions, such as seasonal and perspective changes.¹⁸ Even though the deep learning tools can make generalizations for the changed features, training the CNN to remember all the reference maps accurately is impractical due to the limited GPU memory size.¹⁴ In addition, it is also worth noting that UAV needs to fly at a favorable altitude to acquire images with a resolution comparable to the reference map, which might disable UAVs from conducting AVL at a lower altitude.

Landmark-based AVL has many advantages. Firstly, AVL by landmark enables the UAV to be self-aware of its surroundings, which is consistent with human navigation intuitively and beneficial to autonomous obstacle avoidance due to the enhanced perception of the surrounding hazards. Secondly, AVL by landmarks takes much fewer memory resources than saving high-resolution map images since only the learned feature parameters and the location parameters need to be stored. Considering the generalization capability of deep learning, storage occupancy for the landmarks of the same class is even less. Thirdly, AVL by landmarks could utilize the side view of the landmarks to extract more significant features, making it a useful alternative strategy when the flat-world assumption³⁵ is not valid and the planar homography is unavailable for localization.

The transmission tower is one of the competitive landmark candidates for AVL of UAVs. Their height is comparable to low-altitude UAVs, and they have conspicuous visual features which are easy to detect by computer vision.^36,37 Hui et al.³⁸ used transmission towers as landmarks to navigate the UAV for power line inspection. Schofield et al.³⁹ developed an algorithm for transmission tower detection to navigate the drone to a safe distance around towers within the power grid. In this study, the proposed localization framework mainly depends on the transmission towers as landmarks. Other vision-based landmarks could be selected as wind turbines,⁴⁰ natural vegetation,⁴¹ and buildings.⁴² Taking advantage of the extended infrastructure coverage, once equipped with an AVL system, UAV path planning and navigation over a long distance is made possible when a series of landmarks have been appropriately selected.

Contribution

We introduce a localization framework suitable for small UAVs flying at a relatively low altitude when the flat-world assumption no longer holds. In this case, UAVs fly at a height comparable to the heights of the landmarks, and the visual features from the camera’s downward view are too limited to be matched with the satellite map. It is a method localizing more like humans, in that it depends mainly on recognizing and distancing typical landmarks which are previously remembered.

This method needs no satellite image as a reference map. Instead, only the landmarks’ location and their geometric size are required, which could save a large amount of GPU memory. The unified landmark appearance also results in a small data set while learning the features. Typical landmarks could be selected as transmission towers, light poles, wind turbines, and other upright and thin infrastructure. UAVs could locate themselves among landmarks as long as the consecutively appeared landmarks have been registered.

Our method could realize vision-based UAV localization in real-time. With the deep learning tools like YOLO and DeepSORT, we built an end-to-end AVL architecture for UAVs. The frequency of localization reaches 5 Hz, as demonstrated during a flight in this study. It is also a lightweighted system that requires only a front-facing monocular RGB camera, an onboard computer implementing landmark detection and tracking, and the preexisting landmark information. In addition, it costs much less than the LiDAR-based localization system, which is friendly to most small UAVs in typical GNSS-denied scenarios.

Organization

The remainder of the article is structured as follows. The methodology introducing detection, tracking, and localization is presented in the second section. Details of the implementation of the simulation and the experiment are described in the third section. Results from the simulation part and the experimental part are presented in the fourth section. Discussion is given in the fifth section. Finally, the conclusions are summarized in the sixth section.

Methodology

General process

In this study, the UAV gets its location by analyzing the video stream captured by its monocular camera. The localization framework mainly includes three parts, namely object detection, object tracking, and localization, as shown in Figure 1. We take YOLOv5 and DeepSORT to realize object detection and object tracking, respectively. Finally, the UAV’s location is acquired through a so-called localization by landmarks (LbL) process. In this section, we introduce these three modules separately.

Figure 1.

Overview of the localization framework.

Object detection

YOLO is a state-of-the-art object detection algorithm that has been widely used in the community. The original YOLO algorithm is given in 2015²⁰ and it has been continuously optimized ever since.⁴³ YOLOv1²⁰ is known for its unique target recognition characteristics which use a single network to complete the bounding box labeling and the object class prediction. The drawbacks of YOLOv1 rest in the poor generalization ability and the low localization accuracy. YOLOv2⁴⁴ proposed a new backbone network as Darknet-19 to overcome these drawbacks. Other improvements introduced by YOLOv2 include batch normalization, high-resolution classifier, convolution with anchor boxes, direct location prediction, and multi-scale training, based on which the detection accuracy and speed have been increased. YOLOv3⁴⁵ applies the Darknet-53 instead of the Darknet-19, which effectively improves the network’s ability to extract image features, leading to the upgrade of the recognition accuracy. In addition, since YOLOv3 outputs three different scales of feature maps to detect targets of different sizes, the performance of small object detection is even better. YOLOv4⁴⁶ uses the CSPDarknet53 in the backbone network. The feature mapping of the base layer is firstly divided into two parts and then combined through a cross-stage hierarchy, which not only reduces the amount of computation but also ensures the accuracy of the model. The introduction of the development process from YOLOv1 to YOLOv5 could be referred to literature⁴³ in detail.

In this study, YOLOv5⁴⁷ has been used to complete the object detection tasks. A detailed picture explaining the network architecture of YOLOv5 could be found in the study by Glenn.⁴⁷ YOLOv5’s network structure can be divided into three parts, namely the backbone, neck, and head. The backbone part majorly uses the Focus and CSP structures. The Focus structure is a slice operation that concentrates the plane information into the channel space without the information loss, making the feature extraction more sufficient. The CSP structure comes from the Cross Stage Partial Network⁴⁷ which solves the problem of a large amount of computation during reasoning. The neck part introduces a multi-scale prediction mechanism comprised of the feature pyramid networks (FPN) and the path aggregation network (PAN) structures, which could enhance the feature fusion of different layers of the feature pyramid from top to bottom and from bottom to top, respectively. The head part makes the final predictions by implementing basic convolution manipulations to the neck’s outputs. To further improve the detection performance, operations like mosaic data augmentation, adaptive anchor box calculation, and adaptive image scaling have been made before applying the network to the image input.⁴⁷ Compared with the previous versions, YOLOv5 has the advantages of a smaller model scale, lower deployment cost, higher flexibility, and faster detection speed.⁴⁷

Version 5.0 of YOLOv5 is used and it has four different network structures: s, m, l, and x.⁴⁷ These networks have similar structures differing majorly in two parameters: depth_multiple and width_multiple. With the increase of these two parameters, the s, m, l, and x networks have an increasing structure complexity as well as an improving AP accuracy. More complicated networks tend to detect more small targets at the expense of the lowered detection speed.⁴⁷ As a result, the selection of the network structure needs to be thoroughly weighed according to the target size and the detection speed, which is determined by the task of real-time localization and the capability of the onboard processor. YOLOv5 has been tested on the Microsoft Common Object in Context (MS COCO) data set, which is one of the most popular large-scale object detection, segmentation, key-point detection, and captioning data set available for public use.⁴⁸ It shows that the performance of YOLOv5-s, m, x could fully span the variance of the COCO AP values and the performance of YOLOv5-l is a compromise of YOLOv5-m and YOLOv5-x.⁴⁷ As a result, this study mainly compares the YOLOv5-s, m, x when detecting the landmarks for localization.

Object tracking

Simple Online and Realtime Tracking (SORT)⁴⁹ is an algorithm that tracks detected objects between two successive frames. The SORT algorithm solves the multiple object tracking problems focusing on an efficient association of objects for online and real-time applications. Key components of SORT include detection, object state propagation, ID association, and life span managing of tracked objects.⁴⁴ High-quality detection is of fundamental importance in the first step of tracking. Motion model with Kalman filter is then involved to estimate and propagate target identities within frames. The ID assignment is optimally solved using the Hungarian algorithm according to the IoU distance, which is the intersection-over-union distance between each detection and all predicted bounding boxes from the existing targets. Finally, tracking identities are created or destroyed when objects enter and leave the image.⁴⁴ Due to its simplicity and efficiency, multiple objects could be detected by this fast and accurate tracker, which could provide a state-of-the-art tracking solution for onboard UAV target tracking scenarios.

DeepSORT²¹ is an algorithm that improves the performance of the SORT. It turns out that SORT would return a relatively high number of identity switches, which has a deficiency when tracking through occlusions.²¹ DeepSORT overcomes this problem by additionally integrating CNN into the framework, introducing a more informed association metric combining motion and appearance information. The motion information is evaluated by the Mahalanobis distance between predicted Kalman states and new measurements. The appearance information is evaluated by the cosine distance of the appearance descriptor between the previous tracks and current detections.²¹ With decreased misses and mismatches, DeepSORT is preferred since it maintains the system applicable to online scenarios.

In this study, the detection and tracking of standard landmarks like transmission towers are of interest for UAV localization. However, these objects always share a unified appearance, which gives rise to severe mismatching when associating their identities. As a result, directly introducing SORT or DeepSORT could lead to the following problems. Firstly, SORT encounters tremendous identity switches and misses⁵⁰ in this case since unique object features from the unified tower models could hardly be captured within and between frames. Secondly, DeepSORT tends to mistakenly reidentify new towers with previously allocated identifications. DeepSORT gets into this side effect when trying to discriminate the reappeared objects in the large-scale reidentification data set.^21,50 These problems need to be solved with a more sophisticated strategy when allocating the object identities. This study evaluates and compares SORT and DeepSORT in the UAV transmission tower tracking scenario. A timeline-based identity correction algorithm is applied so that newly appeared towers would not be assigned a previous identification.

As shown in Algorithm 1, the identity correction algorithm is conceived according to the continuity of the landmark appearance during a stable flight. It can easily eliminate the sporadic tracking IDs, which are majorly assigned to the false positive targets. In addition, it is applicable especially for the one-way flight path when the uniformed landmarks majorly appear once so that the reappeared tracking IDs could be allocated with a new ID and the mismatching is avoided. After the identity correction process, the randomly allocated IDs from DeepSORT are ensured as true tracking IDs, which means the same landmark would only be assigned with the same ID. These tracking IDs are preliminary IDs that could be easily correlated with the real landmark identities according to their landmark categories, order of appearance, and the landmark distance (or the size of the bounding box) when several landmarks appeared in the same frame.

Algorithm 1.

Identity correction.

Input: Result list

ℒ = {I, C, T, R}

, with Image index

I \in ℐ = {1, 2, \dots, N}

, Category of the detected tower

C \in C = {A,B,C,D}

, Tracking indices

T \in T = {1, 2, \dots, M}

, Results of the tracking anchor R

1: Sort

ℒ

based on the image indices in

ℐ

2: Sort

ℒ

based on the tower categories in

C

3: for category = A to D do

4: Sort

ℒ

based on the tracking indices in

T

5: for I in

ℐ_{sorted} = {I_{1}, I_{2}, \dots, I_{K}}

6: if

T (I_{k}) = T_{I_{k - 1}}

and

I_{k} = I_{k - 1} = 1

7: Flag

L (T (I_{k}))

and

L (T (I_{k - 1}))

with true

8: else if

T (I_{k}) = T (I_{k - 1})

and

I_{k} - I_{k - 1} > I_{threshold}

9: Add

t_{new} \notin T

T

10: for

k^{'} = k

K - 1

11: if

T (I_{k^{'}}) = T (I_{k - 1})

and

I_{k^{'}} = I_{k^{'} + 1} - 1

12:

T (I_{k^{'}}) = t_{new}

13: Flag

L (T (I_{k^{'}}))

with true

14: end if

15: end for

16: else

17: Flag

L (T (I_{k}))

with false

18: end if

19: end for

20: end for

21: return

Localization by landmarks

After landmark detection and tracking, landmark information to calculate the UAV locations in each frame of the video has been prepared. A localization model is constructed here in three-dimensional (3-D) space and the perspective relationship is taken into account. As shown in Figure 2, three coordinate frames are defined: the global inertial frame $C_{g} = o_{g} x_{g} y_{g} z_{g}$ , the body-fixed frame $C_{b} = o_{b} x_{b} y_{b} z_{b}$ , and the camera frame $C_{c} = o_{c} x_{c} y_{c} z_{c}$ , all of which follow the right-hand rule. The rotation matrix of the camera frame with respect to the body frame is denoted as $^{b} R_{c}$ . The coordinate origins $o_{b}$ and $o_{c}$ are set to coincide so that the translation vector $^{b} t_{c}$ is $0$ in the transform matrix $^{b} T_{c}$ . The rotation matrix and the translation vector of the body frame with respect to the global frame are denoted as $^{g} R_{b}$ and $^{g} t_{b}$ . We can transform the camera coordinate into the global coordinate according to the transform matrix $^{g} T_{c}$ , which is

^{g} T_{c} =^{g} T_{b}^{b} T_{c} = [\begin{matrix} ^{g} R_{b}^{b} R_{c} & ^{g} t_{b} \\ 0^{T} & 1 \end{matrix}]

Figure 2.

One scenery during localization showing with the definition of the coordinate frames.

It is evident that $^{g} t_{c} =^{g} t_{b}$ , and $^{g} R_{c} =^{g} R_{b}^{b} R_{c}$ . Any vector $^{c} a$ obtained from the camera frame can be transformed to the absolute location vector $^{g} a$ from the global frame using

^{g} a =^{g} R_{b}^{b} R_{c}^{c} a +^{g} t_{b}

In this study, we are looking for the global location of the UAV, that is

^{g} t_{b} =^{g} a -^{g} R_{b}^{b} R_{c}^{c} a

Here $^{g} R_{b}$ can be fetched from the onboard sensor of the three-axis gyroscope in real-time, and $^{b} R_{c}$ can be acquired through a calibration process before flight since it is a constant matrix representing how the camera is mounted. We store the landmarks’ global locations as many $^{g} a$ vectors onboard prior to flight. The landmarks’ locations $^{c} a$ relative to the camera are calculated according to their geometric similarities. For the sake of simplicity, we omit the upper index of c when dealing with the items in the context of the camera coordinate frame. It is also assumed that the altitude of the UAV is already measured by a barometer or a laser altimeter, so it will not be considered into this vision-based localization process.

After processing of the object detection module and the object tracking module, the four transmission towers are correctly recognized. Here, we make the assumption that the position of the UAV is approximately known when it is capturing the initial frame. This is a reasonable assumption in real applications, even in a GNSS-denied flight, where the accurate or approximate GNSS data at the takeoff position are known. The correct identification of the transmission towers is ensured iteratively by this initial condition.

As shown at the top of Figure 3, a sample image captured from the scenery in Figure 2 is provided with the tracking anchors noted on the transmission towers. Each transmission tower provides the height of the bounding box h and the bounding box deviation e from the vertical image centerline. The image’s centerline lies in the direction of the $o_{c} x_{c}$ axis.

Figure 3.

The localization by landmark model. An image captured from the scene in Figure 2 with tracking anchor information noted (top); geometric similarity of the parameters (lower left); geometric relation between the landmarks and the UAV (lower right). UAV: unmanned aerial vehicle; LbL: localization by landmark.

The geometric similarity of the parameters is depicted at the lower left of Figure 3. Taking the horizontal plane and the landmark B1’s projection plane as an example, the relations from geometric similarity are presented as

\frac{H}{h} = \frac{E}{e} = \frac{D}{d}

where H is the landmark height, E is the landmark deviation from the $o_{c} x_{c}$ axis, D is the distance ranging from the camera to the landmark, and d is

d = \sqrt{f^{2} + e^{2}}

with f is the focal length of the camera. The focal length f is known either from the lens parameters or from the intrinsic parameters attained from the calibration process. It is worth noting that f should be converted to the unit of pixel to take part in the calculations with h and e. In this LbL model, H is measured prior to flight and saved onboard as landmark information. As a result, the only two unknown parameters E and D from the geometric similarity equation can be calculated by

{\begin{array}{l} E = e H / h \\ D = \sqrt{f^{2} + e^{2}} H / h \end{array}

After getting E and D, the landmark coordinates in the camera frame can be calculated as

{\begin{array}{l} x = \sqrt{D^{2} - E^{2}} \\ y = E \end{array}

Putting $^{c} a = (x, y, z)^{T}$ into equation (3) and the UAV location $^{g} t_{b}$ is achieved. The GNSS position with latitude and longitude could be converted from $^{g} t_{b}$ with the given GNSS location of $o_{g}$ . In ideal conditions, using one landmark could realize the localization of the UAV. For the sample image where four landmarks appeared, four estimations of the UAV location could be achieved using the parameters as shown in the lower right of Figure 3. In this study, the mean value of these location estimations is presented. By gathering the UAV locations calculated from each frame of the video, the UAV trajectory is obtained. In conclusion, localization by landmarks is accomplished majorly by analyzing the geometric similarity between the geotagged landmarks and their annotated images from the detection and tracking process.

Implementation

Simulation

The simulation is conducted on the Rflysim⁵¹ platform, which is efficient in developing a fully functional simulation environment for the UAV model. The scenes with landmarks are constructed using the popular 3D engine, Unreal Engine 4, to ensure fidelity in visual scene rendering. The model of the electric transmission tower is commercially available in the Epic Games marketplace. As shown in Figure 4, four types of typical transmission towers with steel lattices are used as landmarks.

Figure 4.

Four types of transmission towers as the landmarks in simulation.

Multiple transmission towers are allocated on a flat plate of size 1000 m × 500 m, and the general layout is shown in Figure 5. There are 18 transmission towers in total, which are noted with their type and identity separately. Each of the transmission towers is also provided with its global position info ( $^{g} x,^{g} y,^{g} z$ ) and height info (H). With the global origin set in the center of the ground plane (as shown in Figure 5), the modeled quadrotor planned a flight from (0, −500, 20) to (0, 500, 20). Two path shapes, a straight line and a sinuous curve, are selected with a fixed flight height. During the flight, the vision data are sent to the sensor subsystem to generate a stream for the onboard camera. With an image resolution of 1440 × 810, the output data set has 300 images (more than 3000 landmarks with four types of appearances), and each image is associated with the UAV global location. This simulation setup has been proven to be convenient and efficient in validating our detection–tracking–localization framework.

Figure 5.

The distribution and IDs of the landmarks in simulation.

Experiment

To implement the localization framework in practice, a self-assembled SFR 450 frame quadrotor is employed as the test platform (as shown in Figure 6). The Nora Autopilot with PX4 is used as the flight controller. The CUAV P9 radio is used as the datalink to communicate with the ground station. A front-view monocular camera is fixed with the UAV for detection and tracking. It is equipped with the real-time kinematic (RTK) differential GPS with an accuracy of 0.02 m, which provides the reference to check the vision localization results. NVIDIA Jetson Nano is used as the embedded processor. It has a high-performance graphics processing unit responsible for the visual detection–tracking–localization framework. The modules of YOLOv5-s, DeepSORT, LbL, and off-board control are used as the software packages. The robot operating system (ROS) runs on the Ubuntu 18.04 OS while the sensor information and the control command are exchanged via ROS topics.

Figure 6.

The quadrotor test platform.

During the flight test, the UAV is flying with a path planned on the ground station associated with autopilot and the RTK GPS is enabled. Simultaneously, the detection–tracking–localization framework is automatically initialized and running until the end of the flight. The localization results are finally downloaded and compared with the position information from the flight log.

Training

Three scales of YOLOv5 are trained to identify the electric tower targets in the image. For training the YOLO models, we only use the simple data augmentation pipeline left–right flipping to expand the number of images to more than 300. According to the ratio of 8:1:1, images are divided into the training set, the validation set, and the test set. LabelImg is used to complete the data set annotation. The annotated information includes the target category information and the location information of the annotated box (top left point and bottom right point coordinates). When we are labeling, the upper end of annotated box should be as close to the tower top as possible, and the lower end should pass the diagonal intersection of the tower base.

In the object detection experiments, the default hyper-parameters are set as follows: the global batch size is 6, and the learning rate is initialized as 0.001; Adam optimizer with the momentum coefficient of 0.9 and weight decay of $3.6 \times 10^{- 4}$ ; 500 training epochs; image size with 640 × 640. All experiments are trained with NVIDIA RTX3060Ti GPU and Intel Core i7-10700 CPU, which have 8G and 16G memory individually. The software environment of deep learning is as follows: CUDA 11.1, Torch 1.9.0, OpenCV-Python 4.2.0.34.

The generalized metric mean Average Precision (mAP) is used to measure the performance of detection models. All categories’ Average Precision (AP) values are averaged to obtain the mAP for a specific method. AP is the area under the Precision–Recall curve where Precision and Recall are defined as

Precision = \frac{TP}{TP + FP}

Recall = \frac{TP}{TP + FN}

True Positive (TP), False Positive (FP), and False Negative (FN) respectively correspond to object correctly detected, background mistakenly detected as an object, and object not detected. By training the YOLOv5 model, we can successfully detect the landmarks in the video stream. Then complete the follow-up target tracking and a series of work.

Results

Simulation results

Detection

The simulation results are used to compare the three different versions (s, m, x) of YOLOv5. The detection results for the same frame are presented in Figure 7. Multiple transmission towers close to the UAV have been detected, and the bounding boxes are annotated with the tower type and the corresponding classification accuracy. It turns out that all three detection schemes could correctly recognize and classify the four types of towers. However, the m and x versions have relatively higher detection confidence, while the version s detected more distant smaller towers, even though the corresponding bounding box confidence is lower.

Figure 7.

Comparison of the detection results of transmission towers on the simulated data set.

To quantify the performance of the three different detection schemes, the general detection statistics from the testing set have been presented in Table 1. The s, m, and x detection schemes have an increasing mAP. YOLOv5-x almost has the best detection performance for any metric, which is owing to the fact that it has the most sophisticated network. It is noted that the precision and recall of YOLOv5-s are acceptable considering its light-scale network.

Table 1.

Performance comparison of the transmission tower detection schemes.

Metric	YOLOv5-s	YOLOv5-m	YOLOv5-x
Total Positives	71	71	71
True Positives	55	60	58
False Positives	11	7	3
False Negatives	16	11	13
Precision	0.837	0.892	0.952
Recall	0.778	0.850	0.817
mAP (%)	78.41	83.44	94.48

mAP: mean Average Precision.

The performance of the detection schemes is further compared according to the four types of transmission towers. The P-R curves are presented in Figure 8. The three detection schemes have similar P-R curves for the type A and type B towers, which are closer to the UAV in the x direction. Differences majorly come from the P-R curves of type C and type D transmission towers, where YOLOv5-x has the best overall performance, and YOLOv5-s’s performance is relatively poorer. The reason for this phenomenon is that type C and type D towers locate further to the UAV, and they are more likely blocked by the other two types of towers since the flight path is in the middle. In this case, the more sophisticated network tends to maintain a higher Recall even if the given Precision is high.

Figure 8.

Comparison of P-R curves for the detection of the transmission towers on the simulated data set.

Tracking

The tracking performance is compared between DeepSORT and SORT both using YOLOv5-s and YOLOv5-x based on the simulated data set. Taking the result from DeepSORT + YOLOv5-s as an example, Figure 9 presents the direct output of the tracker. The transmission towers could be assigned with their categories and tracking IDs. Multiple towers are recognized in each frame from the time series, marked by their bounding boxes. A detailed list $ℒ$ of the tracking results is then generated, where each recognized tower is listed with the image index I, tower category C, tracking ID T, and the bound box info R.

Figure 9.

Tracking results with DeepSORT + YOLOv5-s on the simulated data set.

The tracking indices of the transmission towers are presented in Figure 10. The list indices have been sorted according to image indices I and tower category C. The correct tower identities are noted above the lines with randomly generated tracking ID numbers by checking the result images. On the one hand, the same tracking indices could be allocated to different towers repeatedly. As shown in Figure 10, the wrongly tracked tower identities, such as B4, B5, C4, C5, D4, are noted in red. Taking the case of B4 and B5 from DeepSORT with YOLOv5-s as an example, these towers are repetitively assigned with tracking IDs of 5 and 9 as shown in Figure 9 at I = 100. It is because they are mismatched with B1 and B2 from Figure 9 at I = 10 due to their same appearance after B1 and B2 disappear from the view. On the other hand, tracking ID of the same tower from SORT would hop from multiple numbers, indicating that SORT frequently mismatch the same tower from different frames. This causes the count of the different tracking indices from DeepSORT being much less than that from SORT. However, the tracking errors could be corrected by the identity correction algorithm by directly manipulating the result list. Figure 10 indicates that the result from DeepSORT with YOLOv5-s is the easiest to correct by the algorithm.

Figure 10.

Comparison of the tracking indices.

Some major indicators of the tracking policies are compared in Table 2. The simulated data set has 18 towers arranged so that a correct count of tracking indices should be 18. DeepSORT turns out to assign fewer tower IDs because some newly appeared towers are allocated with the previous track IDs (error of repetitive labeling). This mismatching roots in the unified appearance shared by the corresponding towers. On the contrary, SORT usually assigns much more tracking IDs, in that the same tower in consecutive frames always fails to be matched so that several IDs are allocated (error of hop labeling). Moreover, DeepSORT only generates repetitive labeling errors, while SORT undergoes hop labeling errors. It is also found that more sophisticated detectors are more likely to create mismatches among the unified targets. In general, DeepSORT with YOLOv5-s has the highest tacking precision in this study, which is recommended to track the landmark of transmission towers before the localization task.

Table 2.

Detailed comparison of the sorting results.

σ	DeepSORT		SORT
σ	YOLOv5-s	YOLOv5-x	YOLOv5-s	YOLOv5-x
Count of tracking indices	14	12	36	37
Precision (%)	77	61.9	73.1	72
Hop labeling error (%)	0	0.3	28	26
Repetitive labeling error (%)	100	99.7	72	74

Note: DeepSORT with YOLOv5-s has a count of tracking indices closest to the correct value of 18. Its tracking results also have the highest tracking precision, with its error represented as no hop labeling but completely repetitive labeling errors.

Localization

Landmark lateral deviation E

The localization results of the transmission towers from the simulation data set are first presented. The vision-based calculation results of the lateral deviation E are compared with the given landmarks’ position information. It is clear that the estimated E values are related to the transmission tower types. The mAP from YOLOv5-s and the corresponding standard deviation of the E for transmission towers A, B, C, and D are listed in Table 3. More accurate detection with higher mAP values tends to decrease the error of the E calculation. It is consistent with the intuition that inaccurate detection could lead to an inaccurate bounding box position e, and thus error in E would increase.

Table 3.

Comparison of mAP and the standard deviation of the E measurement for different transmission towers.

Type	A	B	C	D
mAP (%)	96.57	96.80	88.47	31.63
std	1.76	1.13	5.10	4.70

mAP: mean Average Precision; std: standard deviation.

Another fact influencing the accuracy of E is the height h of the detected landmark in the image. According to Figure 11, E values based on smaller h values have larger errors. The reason is that smaller h values indicate the landmark in the image is small and more distant in reality, and the bounding box position tends to be inaccurate since the target size is small.

Figure 11.

Results of E and its error analysis.

Landmark distance D

The landmark distance De calculated in equation (6) is compared with the real landmark distance D during the whole flight for the straight line path. Their consistency is validated by the slope when we plot De versus D in Figure 12. The probability density function (PDF) of the error distribution shows that different tower types have similar error PDFs with reasonable accuracy. The De error is further analyzed as a function of bounding box height h, and the real landmark distance D. It indicates that smaller h and farther landmarks would generate higher distance measurement errors. It is the same situation compared with the E measurement results. It suggests that in the current simulation setup, the case when the bounding boxes with h larger than 300 pixel or the landmark distance D smaller than 150 m would provide better vision-based distance estimations.

Figure 12.

Results of D and its error analysis.

UAV locations $^{g} t_{b}$

The UAV locations in the flight path are finally achieved after calculations by equations (7) and (3). We compared the estimated locations and the real locations of the UAV in two different paths. As shown in Figure 13, it turns out that the detection–tracking–localization framework could have the UAV localized with acceptable accuracy. Here we directly use the mean location value when multiple landmarks have provided the localization information in the same frame. The std in the x direction is smaller than 10 m. However, the y direction has a larger std, which can reach 20 m in the simulated case. Both directions share a comparable relative error since the distance measured in the y direction is larger than that in the x direction.

Figure 13.

Measured localization results in comparison with the ground truth from two paths of the simulated flight.

Experimental results

Detection and tracking

The detection and tracking results of the transmission towers from the experiment are shown in Figure 14. The left column is recorded when the UAV is flying toward the northwest, and images in the right column are taken after the UAV turns around and flies to the southeast. Labels above the bounding boxes are presented by the DeepSORT algorithm, while the labels below the bounding boxes are the final tracking identities after implementing the identity correction algorithm.

Figure 14.

Detection and tracking results of the towers from the experiment.

The whole flight takes about 5 min, and we take about 200s of the video for measurement when the flight is stable. It turns out that the combination of YOLOv5-s and DeepSORT generates reliable initial landmark tracking identities, and by implementing the identity correction algorithm, all the landmarks could be successfully recognized. Owing to the fact that the training set is fetched from a previous flight in this region and the landmarks are meticulously annotated, the bounding boxes have provided reliable position information even with interference from the diverse landscape in autumn. Note the towers far away are not detected. They are not considered in the first place when making the annotations for the training set since these small targets tend to estimate the locations with much larger uncertainties.

Localization

The vision-based localization results are compared with the GPS-measured locations as presented in Figure 15. During the experiment, the flight path is scheduled along a river bank where several transmission towers are located. The size of the region presented in Figure 15 is about 2.0 km × 1.3 km, and the flight path has a total length of about 1 km with a height below 30 m. The takeoff point is near the midpoint between towers B2 and B3. The flight is outside of the air-exclusive zone defined by the transmission lines.

Figure 15.

Comparison between the vision-based localization results and the GPS locations.

In comparison with the RTK-GPS-measured locations, the vision-based locations revealed acceptable consistency. Considering the state space from the output of (lat, lon), the standard deviation for the localization error is 11.1 m in this experiment. These errors root in the estimations in the (E, D) space, where the standard deviation is 13.1 m and 7.6 m for tower A and tower B, respectively. In a similar low flight height, the localization precision is comparable to that in Masselli et al.⁵² and is higher than that in Sun et al.¹³ when the stereo camera is used. According to the jtop report, the system monitoring utility for Jetson stats, the highest resource consumption running the localization framework onboard is 98% of the GPU, 50% of the CPU, and 3.7 GB of the memory. The localization result could be provided with the best rate reaching above 5 Hz.

Discussion

The performance of the proposed localization method has been compared in Table 4 with the papers^{18,26

–30,34} cited in the related works. The discussion majorly focuses on the works studying the AVL with a monocular camera for UAV outdoor applications. The experiment type, strategy and method, edge computing capability, trajectory length and altitude, and the average error of localization have been compared. Different from the strategies from other studies, this study uses the deep learning architecture to realize landmark detection and tracking and solves the localization problem using geometric similarity. Although the proposed method is implemented on the entry-level edge computing platforms like NVIDIA Jetson Nano, it stands out as a real-time AVL solution with low latency which can be mounted on small UAVs. Moreover, the proposed method could be implemented at a much lower altitude (<50 m) since it needs no nadir view maps for feature matching during the flight. The localization precision of the current study is moderate among the state-of-the-art studies as listed, which needs to be improved in our future work.

Table 4.

Performance comparison with the state-of-the-art UAV monocular AVL works using deep learning.

No.	Work	Type	Strategy; Method	Edge computing capability	Trajectory (m); Altitude (m)	Average error (m)
1	Amer et al.²⁶	Simulation	District-level localization; VGG16 + Deep urban signatures	—	Static positions; 333	200.75
2	Nassar et al.²⁸	Simulation Real data	Matching with semantically segmentation; SIFT + ORB + RANSAC + U-Net + Hu moments	—	1200 and 500; 300 and —	10.4 and 6.3 5.1 and 3.6
3	Marcu et al.²⁹	Simulation	Matching with semantically segmentation; MSMT	NVIDIA Jetson TX2 1 Hz in real-time	Static positions; —	20 (w/o alignment) 2.5 (w/ alignment)
4	Goforth and Lucey ¹⁸	Real data	VO + sliding window; VGG16 + Deep-LK + Photometric BA	—	850 and 610; 200 and 220	7.06 and 25
5	Schleiss³⁰	Real data	Template matching; cGAN + SSD	—	1610; 300	40 (median error)
6	Mughal et al.²⁷	Real data	Template matching; ResNet + Neighborhood consensus networks	NVIDIA Jetson TX2 3.7s latency	Static positions; —	3.708(m²)
7	Shao et al.³⁴	Real data	Real data SLAM landmark map matching with offline reference map; ORB-SLAM3 + YOLOv4	—	439 (straight) and 394 (s-shaped); 208 (straight) and 300 (s-shaped)	20.33 (straight, 3.06 after converged) 14.55 (s-shaped)
8	Proposed	Simulation Real data	Detection and tracking of landmarks, ranging with geometric similarity; YOLOv5 + DeepSORT	NVIDIA Jetson Nano 5 Hz in real-time 0.5s latency	600 (simulation) 1000 (real data); 22 (simulation) 35 (real data)	10 and 20 (simulation) 11.1 (real data)

UAV: unmanned aerial vehicles; AVL: absolute visual localization; SLAM: simultaneous localization and mapping.

The precision of the current visual localization framework is determined by various factors, which include the resolution of the original video frames, the precision of the bounding boxes, and the real landmark size and location information. Although current commercial aerial cameras could generate UAV locations with acceptable accuracy using this localization system, higher resolution cameras would provide better location estimations. The precision of the bounding boxes is majorly influenced by the image quality and the meticulosity when preparing the training set. Further improvement of the bounding box regression could be referred to the work of Wang et al.⁵³ Nevertheless, the acquirement and establishment of a reliable real landmark database is also crucial to generating accurate localization results for the UAVs.

In this study, the localization results could be improved by implementing a weighted localization according to the quality of the landmarks when multiple landmarks appeared in one frame. The error analysis results show that landmarks with higher mAP and larger bounding boxes always provide more accurate location results. It is reasonable to give these landmarks higher credit when combining their results instead of making an average of them to calculate the localization. It is also possible to further optimize the localization precision through multi-sensor fusion, such as IMU, VO, and laser-based rangers. This study focuses on visual localization and leaves out the discussion of UAV visual navigation, which can be referred in the literature^7,54 for more information.

Despite achieving real-time autonomous localization based on transmission towers, there exist some shortcomings in the current AVL strategy. The localization framework is intimately dependent on the detection and tracking of the visual landmarks, which means at least one recognizable landmark is required to appear in the view of the UAV. This condition necessitates a landmark system that ensures a stable landmark appearance to enable UAV navigation, which indicates that UAVs would not be able to localize themselves on unexplored land with bare ground. On the other hand, since the landmarks with a unified appearance are selected, it is easy to get lost if the iteratively inferred landmark identities are occluded or mismatched. It is necessary to select some landmarks with exclusive visual features to adjust the localization process occasionally. By overcoming these disadvantages, it is possible to establish a sparse landmark system in the GNSS-denied region, which could maintain reliable UAV localization while disabling any intruder with the traditional GNSS.

Conclusions and future work

In this study, a vision-based absolute localization framework is developed for UAVs. The deep learning tools, including YOLO and DeepSORT, are used to detect and track the transmission towers, which are used as landmarks for localization. The simulation is conducted based on the Rflysim platform, where a UAV navigated with different paths among a landmark system constructed by four types of transmission towers. The performance of detection is evaluated by comparing different versions of YOLOv5 such as s, m, and x, and the tracking performance is compared between SORT and DeepSORT. It turns out that a combination of YOLOv5-s and DeepSORT could provide a reliable implementation for UAV localization when the landmark identities are correctly assigned.

Using a monocular camera, the same localization framework is then demonstrated on a small UAV. The real-time localization is realized with a location update rate of 5 Hz based on the entry-level edge artificial intelligence solution of the NVIDIA Jetson Nano. The total length of the path is about 1 km, and the average error of localization is about 10 m compared with the results measured with RTK GPS. The proposed visual localization strategy in this study saves a significant amount of the GPU memory since no satellite image is needed as a reference map. It is also an economic localization system that makes possible the end-to-end implementation on small UAVs.

As future work, the improvement of the localization precision can be explored, in that the deep learning tools which can generate more accurate bounding boxes for the landmarks would promote the reliability of the location results. In the following stage, we expect to implement the fusion of sensors with the proposed visual localization framework, exploring to achieve autonomous navigation with a real UAV under GNSS-denied conditions.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Liqun Ma

References

Chamola

Kotesh

Agarwal

, et al. A comprehensive review of unmanned aerial vehicle attacks and neutralization techniques. Ad hoc Networks 2021; 111: 102324.

Gargalakos

. The role of unmanned aerial vehicles in military communications: application scenarios, current trends, and beyond. J Def Model Simul 2021. DOI: 10.1177/15485129211031668.

Recchiuto

Sgorbissa

. Post-disaster assessment with unmanned aerial vehicles: a survey on practical implementations and research approaches. J Field Robot 2017; 35(4): 459–490.

Jordan

Moore

Hovet

, et al. State-of-the-art technologies for UAV inspections. IET Radar Sonar Navig 2018; 12(2): 151–164.

Tsouros

Bibi

Sarigiannidis

. A review on UAV-based applications for precision agriculture. In: Information (Switzerland). Basel, Switzerland: Multidisciplinary Digital Publishing Institute, 2019, p. 349.

Han

Siddique

Zhang

, et al. A flight-testing campaign to examine inflight icing characteristics and its effects on the flight performance of an unmanned-aerial-vehicle. Cold Reg Sci Technol 2023; 207: 103775. DOI: 10.1016/j.coldregions.2023.103775.

Balamurugan

Valarmathi

Naidu

VPS

. Survey on UAV navigation in GPS denied environments. In: International conference on signal processing, communication, power and embedded system, SCOPES 2016—Proceedings, Paralakhemundi, India, 3–5 October 2016, pp. 198–204. IEEE. DOI: 10.1109/SCOPES.2016.7955787.

Zhou

Wen

Wang

, et al. Swarm of micro flying robots in the wild. Sci Robot 2022; 7(66): eabm5954.

Setyawan

Soenoko

Mudjirahardjo

, et al. Measurement accuracy analysis of distance between cameras in stereo Vision. In: 2018 Electrical power, electronics, communications, controls and informatics seminar (EECCIS), Batu, Indonesia, 9–11 October 2018, pp. 169–172. IEEE.

10.

Chang

C-C

Shih

K-C

Ting

H-C

, et al. Utilizing machine learning to improve the distance information from depth camera. In: 2021 IEEE third Eurasia conference on IOT, communication and engineering (ECICE), Yunlin, Taiwan, 29–31 October 2021, pp. 405–408. IEEE.

11.

Malle

Nyboe

Ebeid

. Survey and evaluation of sensors for overhead cable detection using UAVs. In: 2021 International conference on unmanned aircraft systems, ICUAS 2021, Athens, Greece, 15–18 June 2021, pp. 361–370. IEEE.

12.

Zhu

Huang

, et al. A review of monocular visual odometry. Vis Comput 2020; 36(5): 1053–1065.

13.

Sun

Zhang

Wang

, et al. A review of visual SLAM based on unmanned systems. In: Proceedings—2021 second international conference on artificial intelligence and education, ICAIE 2021, Dali, China, 18–20 June 2021, pp. 226–234. IEEE.

14.

Couturier

Akhloufi

. A review on absolute visual localization for UAV. Rob Auton Syst 2021; 135: 103666.

15.

Brunelli

. Template matching techniques in computer vision: theory and practice. In: Template matching techniques in computer vision: theory and practice. New York, NY: Wiley, 2009. DOI: 10.1002/9780470744055.

16.

Liu

Wang

. A review of keypoints’ detection and feature description in image registration. In: Scientific programming, vol. 2021. London: Hindawi Limited, 2021. DOI: 10.1155/2021/8509164.

17.

Chai

Zeng

, et al. Deep learning in computer vision: a critical review of emerging techniques and application scenarios. Mach Learn Appl 2021; 6: 100134.

18.

Goforth

Lucey

. GPS-denied UAV localization using pre-existing satellite imagery. In: Proceedings—IEEE international conference on robotics and automation, Montreal, QC, Canada, 20–24 May 2019, pp. 2974–2980. IEEE.

19.

Warren

Greeff

Patel

, et al. There’s no place like home: visual teach and repeat for emergency return of multi-rotor UAVs during GPS failure. IEEE Robot Autom Lett 2019; 4(1): 161–168.

20.

Redmon

Divvala

Girshick

, et al. You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 779–788. IEEE.

21.

Wojke

Bewley

Paulus

. Simple online and real-time tracking with a deep association metric. In: Proceedings—international conference on image processing, ICIP , Beijing, China, September 2017, 2018, pp. 3645–3649. IEEE. DOI: 10.1109/ICIP.2017.8296962.

22.

L-Y

Yiu

Tang

, et al. Dynamic object tracking on autonomous UAV system for surveillance applications. Sensors 2021; 21(23): 7888.

23.

Janai

Güney

Behl

, et al. Computer vision for autonomous vehicles: problems, datasets and state of the art. Found Trends Comput Graph Vis 2020; 12(1–3): 1–308.

24.

Rezaei

Azarmi

. DeepSOCIAL: social distancing monitoring and infection risk assessment in COVID-19 pandemic. Appl Sci 2020; 10(21): 7514.

25.

Durve

Bonaccorso

Montessori

, et al. Tracking droplets in soft granular flows with deep learning techniques. Eur Phys J Plus 2021; 136(8): 1–22.

26.

Amer

Samy

Elhakim

, et al. Convolutional neural network-based deep urban signatures with application to drone localization. In: Proceedings—2017 IEEE international conference on computer vision workshops, ICCVW 2017, Venice, Italy, 22–29 October 2017, pp. 2138–2145. IEEE.

27.

Mughal

Khokhar

Shahzad

. Assisting UAV localization via deep contextual image matching. IEEE J Sel Top Appl Earth Obs Remote Sens 2021; 14: 2445–2457.

28.

Nassar

Amer

Elhakim

, et al. A deep CNN-based framework for enhanced aerial imagery registration with applications to UAV geolocalization. In: IEEE computer society conference on computer vision and pattern recognition workshops, Salt Lake City, UT, USA, 18–22 June 2018, pp. 1594–1604. IEEE.

29.

Marcu

Costea

Slusanschi

, et al. A multi-stage multi-task neural network for aerial scene interpretation and geolocalization, 2018. DOI: 10.48550/arxiv.1804.01322.

30.

Schleiss

. Translating aerial images into street-map-like representations for visual self-localization of UAVs. The international archives of the photogrammetry. Remote Sens Spatial Inf Sci 2019; XLII-2/W13: 575–580.

31.

Wang

Ren

, et al. UAV first view landmark localization via deep reinforcement learning. In: Bai

Hancock

. (eds) Structural, syntactic, and statistical pattern recognition. S+SSPR 2018. Lecture notes in computer science, Vol. 11004, 2018, pp. 76–85. Cham: Springer. DOI: 10.1007/978-3-319-97785-0_8.

32.

Osco

Junior

Ramos

APM

, et al. A review on deep learning in UAV remote sensing. Int J Appl Earth Obs Geoinf 2021; 102: 102456.

33.

Jayatilleke

Zhang

. Landmark-based localization for unmanned aerial vehicles. In: SysCon 2013—seventh annual IEEE international systems conference, Proceedings, Orlando, FL, USA, 15–18 April 2013, pp. 448–451. IEEE.

34.

Shao

Chen

, et al. Monocular object SLAM using quadrics and landmark reference map for outdoor UAV applications. In: Proceedings—2021 IEEE international conference on real-time computing and robotics, RCAR 2021, Xining, China, 15–19 July 2021, pp. 1195–1201. IEEE.

35.

Sengupta

Sturgess

Ladicky

, et al. Automatic dense visual semantic mapping from street-level imagery. In: IEEE international conference on intelligent robots and systems, Vilamoura-Algarve, Portugal, 7–12 October 2012, pp. 857–862. IEEE.

36.

Steiger

Lucas

Maret

. Automatic detection of transmission towers. In: Proceedings of IEEE sensors, Valencia, Spain, 2–5 November 2014, 2014, pp. 1034–1037. IEEE. DOI: 10.1109/ICSENS. 2014.6985180.

37.

Wang

Yang

, et al. High-voltage power transmission tower detection based on faster R-CNN and YOLO-V3. In: Chinese control conference, CCC, Guangzhou, China, 27–30 July 2019, pp. 8750–8755. IEEE.

38.

Hui

Bian

Zhao

, et al. Vision-based autonomous navigation approach for unmanned aerial vehicle transmission-line inspection. Int J Adv Robot Sys 2018; 15(1): 1729881417752821.

39.

Schofield

Lorenzen

Ebeid

. Cloud to cable: a drone framework for autonomous power line inspection. In: Proceedings—Euromicro conference on digital system design, DSD 2020, Kranj, Slovenia, 26–28 August 2020, pp. 503–509. IEEE.

40.

Van Grieken

Dower

. Wind turbines and landscape. In: Wind energy engineering: A handbook for onshore and offshore wind turbines. Amsterdam, Netherlands: Elsevier, 2017, pp. 493–515.

41.

Song

Sun

Iagnemma

. Natural landmark extraction in cluttered forested environments. In: Proceedings—IEEE international conference on robotics and automation, Saint Paul, MN, USA, 14–18 May 2012, pp. 4836–4843. IEEE.

42.

Bala

. Landmarks in urban space as signs. Curr Urban Stud 2016; 4(4): 409–429.

43.

Jiang

Ergu

Liu

, et al. A review of YOLO algorithm developments. Procedia Comput 2021; 199: 1066–1073.

44.

Redmon

Farhadi

. YOLO9000: better, faster, stronger. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017, pp. 6517–6525. IEEE.

45.

Redmon

Farhadi

. YOLOv3: an incremental improvement, 2018. DOI: 10.48550/arxiv.1804.02767.

46.

Bochkovskiy

Wang

Liao

HYM

. YOLOv4: optimal speed and accuracy of object detection, 2020. DOI: 10.48550/arxiv.2004.10934.

47.

Glenn

. YOLOv5 release v5.0, July 2021, https://github.com/ltralytics/yolov5/releases/tag/v5.0.

48.

Lin

Maire

Belongie

, et al. Microsoft COCO: common objects in context. In: Fleet

Pajdla

Schiele

. (eds) Computer vision—ECCV 2014. ECCV 2014. Lecture notes in computer science, Vol. 8693(Part 5), pp. 740–755. Cham: Springer. DOI: 10.1007/978-3-319-10602-1_48.

49.

Bewley

Ott

, et al. Simple online and Realtime tracking. In: Proceedings—International conference on image processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. IEEE.

50.

Ciaparrone

Sánchez

Tabik

, et al. Deep learning in video multi-object tracking: A survey. Neurocomputing 2020; 381: 61–88. DOI: 10.1016/j.neucom.2019.11.023.

51.

Dai

Quan

, et al. RflySim: automatic test platform for UAV autopilot systems with FPGA-based hardware-in-the-loop simulations. Aerosp Sci Technol 2021; 114: 106727.

52.

Masselli

Hanten

Zell

. Localization of unmanned aerial vehicles using terrain classification from aerial images BT. In: Menegatti

Michael

Berns

(eds) Intelligent autonomous systems, Vol. 13. Berlin, Germany: Springer Science & Business Media, 2016, pp. 831–842.

53.

Wang

Zhang

Cao

, et al. Side-aware boundary localization for more precise object detection. Lecture Notes in Computer Science, Vol. 12349, 2019 , pp. 403–419. DOI: 10.48550/arxiv.1912.04260.

54.

Gyagenda

Hatilima

Roth

, et al. A review of GNSS-independent UAV navigation techniques. Robot Auton Syst 2022; 152: 104069.

Visual localization with a monocular camera for unmanned aerial vehicle based on landmark detection and tracking using YOLOv5 and DeepSORT

Abstract

Keywords

Introduction

Related work

Deep learning-based AVL for UAV

Landmark-based AVL for UAV

Contribution

Organization

Methodology

General process

Object detection

Object tracking

Localization by landmarks

Implementation

Simulation

Experiment

Training

Results

Simulation results

Detection

Tracking

Localization

Landmark lateral deviation E

Landmark distance D

UAV locations g t b

Experimental results

Detection and tracking

Localization

Discussion

Conclusions and future work

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References

UAV locations $^{g} t_{b}$