Abstract
Object tracking for unmanned aerial vehicle applications in outdoor scenes is a very complex problem. In videos captured by unmanned aerial vehicle, due to frequent variation in illumination, motion blur, image noise, deformation, lack of image texture, occlusion, fast motion, and other degradations, most tracking methods will lead to failure. The article focuses on the object tracking in severely degraded videos. To deal with those various degradations, a real-time object tracking method for high dynamic background is developed. By integrating histogram of oriented gradient, RGB histogram and motion histogram into a novel statistical model, our method can robustly track the target in unmanned aerial vehicle captured videos. Compared to those existing methods, our proposed approach costs less resource in the tracking, significantly increases the tracking speed, and runs faster than state-of-the-art methods. Also, our approach achieved satisfactory tracking results on the challenging visual tracking benchmark, object tracking benchmark 2013, the supplementary experiments demonstrates that our method is more effective and accurate than other methods.
Introduction
During the last 10 years, people witnessed the emergence of notable development in the technology and application of unmanned aerial vehicles (UAV). The military drones have become an important role in battlefield scouting. The reliability and efficiency of unmanned aerial vehicles make it easy to operate and maintain in the battlefield. In field of civilian, UAVs have been applied to proceed aerial surveillance, fast disaster monitoring, and short distance parcel delivering. There are large amount of companies developing UAV systems to keep their technology competitive, such as Amazon Prime Air and Googles Project Wing. The flight security will be a significant issue in the future, thus the UAV must have the ability to sense its surroundings during the flight.
Vision-based tracking systems are becoming ever more important in current UAV applications. Visual cameras are lightweight and less expensive, and above all, they provide more useful information than other sensors. With the abundant informations provided by the vision system, a drone can detect hidden military threats and take the appointed action. This technology has aroused lots of attention in the recent 10 years. A reliable UAV vision system should have the ability to automatically track objects. Research on such issues, as a central theme of computer vision, has been active for decades and has made lots of great achievement.
However, there are various challenges in UAV captured videos. As shown in Figure 1, because of the cost limitation, UAV captured videos often contains a lot of video quality degradations, such as defocus and motion blur. Even if the video is captured by a high frames per second (FPS) camera, once the UAV carries on abrupt rotation or other motion (1080p @ 60 FPS), severe degradation will occur in the video. Thus, how to deal with those degradations becomes an important problem in object tracking on UAV captured videos.

These are the tracking results of proposed method in severely degraded video. (The yellow rectangles annotate our tracking results).
The proposed approach has been compared with the state-of-the-art methods on a sequence with challenging degradations, including high dynamic background, video transmission error, violent rotation, illumination and contrast change, motion blur, and fast motion. For common object tracking methods, the motion blur, camera rotation, and fast motion is quite challenging to achieve high quality tracking result for the UAV captured video because it does not do further processing for the degradation information. However, those previously proposed trackers are seldom specialized to handle degradations like motion blur and fast motion. Figure 1 is some successful tracking results of proposed method in severely degraded video, and the yellow rectangles indicate the position of the target.
To sum up, our contribution has three aspects: (1) the proposed method introduced three different features, histogram of oriented gradient (HOG), RGB histogram and histogram of the optical flow result into the statistical model; (2) we proposed a tracking method for the motion degradation of video by analyzing the optical flow of the target; and (3) the proposed technique performs better than the state-of-the-art tracker which is evaluated by a widely applied benchmark—object tracking benchmark 2013 (OTB2013). The overview of our algorithm is given in Figure 2.

Overview of our algorithm (Firstly, we calculate the color, gradient, and optical flow distribution of the frame sequence. Secondly, we combine those distribution into our statistical degradation model with optical flow. Finally, we use correlation filter to track the target.)
The rest of the article is organized as follows: the related work on object tracking is briefly and systematically reviewed in “Related works” section; in “The proposed approach” section, we present a detailed description of the proposed algorithm; and in “Experiments” section, we give out the quantitative and qualitative experiments’ results, some of the limitations of our approach is also discussed; and finally, the conclusion of this article is bring out in the last section.
Related works
In the coming decades, UAV-related computer vision applications are very promising. When tracking target in UAV captured videos, people need to use an online object tracking algorithm. Algorithm for robustness improvement of the results, 1 Struck 2 has concise formulations, with the purpose to find the minimum of the localized structured object output. 3 However, the various features and amount of the training samples make those algorithms lack efficiency.
Correlation filter 4 is to find the minimum in all the cyclic shifts of the positive example from the least squares loss. This doesn’t seem to be a proper approximation to the current real world, by using the Fourier domain and dense sampling examples and high-dimensional feature images, it can easily achieve real time. The disadvantage of correlation filters is that they are limited to learning from all cyclic shifts. Several recent works 5 –7 have attempted to solve this problem, especially spatial regularization, 8 and formulation has shown good tracking results. However, this is at the expense of real-time operations, 9 they extend to multiple feature channels, so the HOG feature 10 enables the technology to implement the most advanced performance 11 in VOT14. DSST 8 challenger winners incorporate multi scale templates using one-dimensional correlation filters to distinguish scale space tracking.
Current correlation filters-based method 12 are inherently limited in learning rigid template problems. This is a problem when a target undergoes shape deformation in sequence. Perhaps the simplest way to achieve robustness to deformation is to use a representation 13 insensitive to shape changes. The image histogram has this property because they discard the position of each pixel. In fact, the histogram can be considered orthogonal to the correlation filter because the correlation filter is learned from the cyclic shift, while the histogram is invariant for cyclic shifts. 14,15 However, separate histograms are usually insufficient to distinguish objects from backgrounds.
The primary alternative to achieving deformation robustness is to learn a deformable model. We think that learning deformable models from a single video is very meaningful, and the only monitoring is the position in the first frame, so a simple bounding box is adopted. Although our approach is superior to the recent sophisticated parts-based model 16,17 in benchmarking, deformable models have richer representations that can do tracking better.
In most common situations, the target template which select from the frame can be used to generate the highest score for the target. However, the scores of the background are also unavoidably high. To overcome this issue, a variety of correlation filters is proposed recently. For the ASEF, 18 this filter combines different learned filters by averaging over them, while MOSSE 9 filter try to train all the images. By solving ridge regression problem and circulant matrices, kernelized correlation filters can be proposed for object tracking. In its theory, the linear kernel of KCF 19 can be the same with MOSSE 9 filter, if the multiple samples of one channel are introduced for training progress. A general case which is using several multichannel information to train filters needs expensive computation costs, which is quiet inappropriate in real-time online visual tracking. The obvious differences between STC 20 and other proposed training schemes are introduced in the following aspects. Firstly, STC 20 is proposed to model the interrelationships between the object itself and its local spatial contents, while normal CFTs just model the input appearance with some trained filters. Secondly, the calculated values of the confidence map in STC 20 can be considered as prior probabilities given the tracking object, while values in confidence maps of other CFTs are correlation responses. Thirdly, the algorithm of STC 20 is capable of estimating the variations of scale, which is challenging for CFTs like MOSSE9 and KCF. 19 However, the newly proposed discriminative correlation filter tracking algorithms have caused great concern and accomplished some remarkable achievement.
The optical flow filter 21 is a real-time optical flow algorithm implemented on GPU, which can obtain the motion information of the image pixelwise, which can help us calculate the degradation information of the target.
The proposed approach
Statistical degradation model with optical flow
In our statistical degradation model, we use the
where
The parameters space is
The whole model parameter
And the image loss function in each image could be
in which
Those three feature scores are learnt in our model
where
Finally, we give out a overall combination of the three scores, using
Statistical degradation features
In this article, three different statistical features are applied to describe the target in object tracking progress. Color-based statistical feature is used to cope with deformation and defocus, gradient distributions are combined to eliminate the influence of illumination change, and degradation distribution is used to deal with fast motion and motion blur. Figure 3 is the three different feature of a image patch. In which, (a) the origin image input, (b) the result of the optical flow which shows that the UAV is moving toward the top-left direction, (c) the degradation color map of the optical flow, the color indicates the directions of the patches and RGB value indicates the amplitude of motion, and (d) the histogram of optical flow of the input image.

(a) The origin image input, (b) the result of the optical flow which shows that the UAV is moving toward the top-left direction, (c) the degradation color map of the optical flow, the color indicates the directions of the patches and RGB value indicates the amplitude of motion, and (d) the histogram of optical flow of the input image.
Gradient distribution
Obtained by HOG feature, with a correlation filter formulation using least squares, the image loss in each frame is
where
Color distribution
The RGB histogram score is calculated from the samples in each image using the correct location as a positive sample. We use
Degradation distribution
The degradation information
The degradation distribution is the histogram of
The maximum final score is considered as the center of the target in current frame. The Figure 4 is a visual view of the overall parameters and the final score.

(a) The input frame patch, (b) the per-pixel score of the color response map, (c) the gradient distribution score map, (d) the color distribution score map, (e) the optical flow distribution score map, and (f) the final score map. The maximum of the score map indicates the predicted position of the target.
Experiments
The proposed approach is implemented in Matlab and runs on a Computer Vision Server with dual Intel Xeon E5-2670 2.60 GHz CPU, 32 GB RAM, and a graphic card GTX1080Ti. The proposed tracker is evaluated on popular OTB2013,
1
and some other sequences captured by our UAVs. The common frame attributes in our video is
The OTB2013 data set contains 50 various sequences, includes a variety of scenes with challenging conditions, such as in-plane rotation, out-of-plane rotation, out-of-view, background clutters, low resolution, illumination variation, scale variation, occlusion, deformation, motion blur, and fast motion.
Parameters evaluation
In our implementation, the input images patch is firstly resized to 150 × 150 to achieve real-time tracking. We compared the proposed approach with eight state-of-the-art tracker in a popular benchmark OTB2013, the OTB2013 database have been manually tagged with nine attributes, which represents the challenging aspects in visual tracking. And 29 publicly available visual trackers are already tested in the benchmark.
Two different experiments are designed to determine the parameter of our method. In Figure 5, we draw the result of different optical flow merging factor (from 0.10 − 0.30) and the different optical flow learning rate (from 0.10 − 0.40) using line chart. From the line chart, we can easily find the best optical flow merging factor is 0:15, and the best optical flow learning factor is 0:20.

(a) The different optical flow merging factor (from 0.10 to 0.30) and (b) the result of different optical flow learning rate (from 0.10 to 0.40).
Comparison with the state-of-the-art trackers
In this section, we compared our tracker with some state-of-the-art trackers. The performance of our algorithm is evaluated quantitatively, following the method used by Kristan et al. 11 We evaluate the proposed method by comparing to the eight state-of-the-art trackers: Staple, 4 Struck, 2 TLD, 22 CXT, 23 TM, 24 LOT, 25 OAB, 26 and MTT. 27
As shown in the Figure 6, our method outperforms other state-of-the-art trackers. The evidence of our superiority is that, our method obtains the best performance in the average precision (0.748), which is 13.8% superior to the second best tracker Staple. Besides, in 11 sequences our tracker achieved first or second place. The comparison results of success rate and execution speed on the 13 sequences are given in Table 1. The best results are highlighted in boldface and the second best are in underline fonts. However, the application of optical flow makes our method slightly slower than Staple, 4 which is still faster than other seven algorithms.

These figures are the result of some popular methods in OTB2013, (a) to (c) are the precision plot of fast motion data sets, motion blur data sets, and all data sets, (d) to (f) are the success rate plot of fast motion data sets, motion blur data sets, and all data sets (The red curve is the result of our method which has the best performance, the green curve is the state-of-the-art method Staple 4 and the score in the legend is the average score of the OTB2013). OTB2013: object tracking benchmark 2013.
Success rate and FPS on the 13 sequences.
FPS: frames per second.
We can also see that, in common sequences, most of the trackers perform well. But when the sequence have different perturbations like motion blur and fast motion, lots of the trackers fail to finish the tracking procedure. For example, in the boy sequence a lot of trackers performed well, but in the soccer sequence which contains severe degradation, our method achieved satisfying result (0.732) while other trackers can hardly reach 0.3.
Actually, various kinds of degradations exist in those videos captured by UAVs, which are the most important issues to cope with in object tracking for UAV videos. In Table 2, the success rate and execution speed on the four UAV sequences of the proposed and the eight competing trackers.
Success rate and FPS on the UAV sequences.
FPS: frames per second; UAV: unmanned aerial vehicle.
In summary, the motion blur and fast motion are great challenges in object tracking, in our method, we combined three different feature to overcome the tracking problem. And we outperform other trackers in the experiments, especially in UAV videos. The result of OTB2013 in Figure 6 shows that our result is much better than others in fast motion and motion blur sequences. And in overall evaluation, we achieve similar result with Staple, which suggest that we did not introduce more error into the method.
Limitation
Our method could obtain satisfying result in most conditions, even if there are severe motion blurs and camera distortions in the sequences (Figure 7). However, with the encounter of longtime occlusion, our method could fail in tracking the target. That is because the occlusion cloud change the template, and influence the tracking result.

These are the tracking results (first and third column) and optical flow results (second and fourth column) of proposed method in severely degraded situation. (The yellow rectangles annotates our result).
In Table 3, we report the values of the most important parameters we use.
Parameters in our method.
HOG: histogram of oriented gradient.
Advantages
Our algorithms perform in a more satisfying manner when there are obvious variation in motion (Figure 6). The sequences captured by UAV often possess severe degradations, as shown in Figure 6, and according to the optical flow the motion in the frame is very severe.
Furthermore, severe movements, such as fast motion, motion blur, and nonuniform degradation, are usually grim challenge for object tracking, these abnormal movements usually do not follow the movement hypothesis. Most of the algorithms fail tracking the target when the target accelerate or change the direction of motion in nonuniform degradation data sets. Also, Staple4 suddenly fail to track the target when the target abruptly turn to another direction, which makes the target very fuzzy and take a lot of useless information.
As shown in Figure 6, our method performs much better than other methods in the severely degraded videos, that is because we have applied degradation information into the tracking procedure. The degradation information is a misplaced resource, and it could be useful by combining it into the tracking progress. In our algorithm, the degradation is considered as the motion direction, unlike the HOG feature, motion direction gives more information of the moving target, which makes our method outperform others.
All the methods in OTB2013 are in contrast to our article, and only the best 10 of the methods are shown in Figure 5. Common tracking methods(like STRUCK, 2 MIL, 28 TLD, 22 CT, 29 KCF 19 ) inevitably failed to track the target in severely degraded videos. Obviously, our method outperforms other methods very much. However, because of the limitation of HOG and RGB histogram, our method performs similar to Staple 4 when the sequence have no degradation.
Conclusion
We proposed a statistical degradation model in this article, in which, three advantageous features are combined to make the model sensitive to deformation, color change, and degradation. The color distribution is generated simply by the RGB histogram, the gradient distribution is calculated by the HOG feature, and the degradation distribution of the target is obtained by calculating the histogram of motion direction. With those three features our method could achieve outstanding result when the degradation of the video occurs, and performs as good as Staple, 4 when there is no degradation. Although the proposed tracker performs very well in most image sequences in our experiments, it could not handle occluded scene very well.
In the future work, we plan to improve our model with features calculated by deep learning. That would further increase the overall performance of our algorithm. The speed of our algorithm is approximatively 80 FPS. With the help of deep neural network, we look forward to improving its result in the future.
Footnotes
Authors’ note
This paper was presented in part at the CCF Chinese Conference on Computer Vision, Tianjin, 2017.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by Zhejiang Provincial Natural Science Foundation of China under Grant number LY15F020031 and LQ16F030007, National Natural Science Foundation of China (NSFC) under Grant numbers 11302195 and 61401397.
