Sage Journals: Discover world-class research

Abstract

Recently, due to the excellent computational efficiency and interpretability, discriminative correlation filter (DCF)-based tracking methods have received extensive attention in the field of unmanned aerial vehicles (UAVs). However, existing methods are usually susceptible to interference from significant appearance changes of the target object or background occlusion, which leads to tracking failure. To effectively address these issues, we propose a distortion-aware correlation filter with target mask (DACFTM) for UAV, which introduces a target regularization term to enhance the target perception ability of the tracking model. Specifically, we construct a target mask matrix based on the highest peak of the response map of the previous frame, thereby leveraging prior reliable localization confidence, and multiply it with the current feature map to obtain a regularization term containing only target information, effectively distinguishing the target from the background. In addition, to deal with tracking failure caused by large appearance changes, we propose a distortion-aware mechanism. When the quality of the response map corresponding to the filter is higher than a set threshold, we consider the filter is reliable and adopt the filter fusion strategy; otherwise, the saved high-quality filter is selected for the tracking in the next frame. Finally, we comprehensively evaluate the performance of DACFTM on three mainstream UAV benchmark datasets, and experimental results demonstrate that the DACFTM achieves impressive tracking performance.

Keywords

discriminative correlation filters distortion-aware target mask visual object tracking

1. Introduction

Visual object tracking is an important research topic in the field of computer vision. The tracker is initialized in the first frame, aiming to continuously track the target in the subsequent video sequences. However, due to the fact that the target is constantly disturbed by significant deformations, occlusions, complex environments and other challenges during the moving process, the design of high-performance trackers remains a difficult task. Currently, visual object tracking for unmanned aerial vehicles (UAVs) has been widely used in practical applications, including traffic monitoring (Xuan et al., 2019), autonomous driving, aerial film shooting (Bonatti et al., 2019), and other fields.

In recent years, discriminative correlation filter (DCF)-based tracking methods have achieved excellent results, especially in the field of UAVs (Chen et al., 2024; Jin et al., 2024). Early DCF-based trackers (Danelljan et al., 2015b; Henriques et al., 2014) benefited from circular shift sampling to obtain abundant training samples. Meanwhile, discrete Fourier transform (DFT) can convert the convolution calculation in the spatial domain into the dot product calculation in the frequency domain, which greatly reduces computational complexity and promotes the rapid development of related technologies. Based on previous works, the tremendous success of high-performance DCF-based trackers (Zhang et al., 2020, 2022c) can mainly be attributed to the following three aspects: various regularization terms with specific significance, various interference suppression strategies, and robust feature representations. Firstly, the temporal regularization term can emphasize the continuity of the time series to improve the generalization ability (Li et al., 2018b), and the channel regularization term enhances the tracking ability by giving the attention mechanism for learning each channel (Xu et al., 2019b). Secondly, the most representative one of the interference suppression strategies is background-aware correlation filter (BACF) (Kiani Galoogahi et al., 2017), which effectively suppresses background interference by modeling the background, thereby ensuring accurate tracking. Thirdly, as a necessary component for tracker design, robust feature representation methods have developed rapidly, such as histogram of oriented gradient (HOG) (Dalal & Triggs, 2005), color names (CNs) (Van De Weijer et al., 2009), and convolutional neural network (CNN) features (Lu et al., 2018). Among them, deep CNN features are currently the main factors to improve the precision and success rate of the tracker.

Although deep CNN features have made significant progress in improving the tracking performance, the complementarity of hand-crafted features and deep features is often overlooked. Specifically, hand-crafted features are more sensitive to spatial information compared to deep features (Zhang et al., 2022b), while deep features pay more attention to channel information (Xu et al., 2019b). Secondly, the current tracker design lacks effective improvements to specific designs in previous tracking frameworks. For example, when the quality of the response map deviates, existing tracker frameworks rarely check this situation, resulting in tracking failure. This not only loses the guiding significance that the response map can bring to the tracking quality, but also fails to correct abnormal situations such as multi-modality in a timely manner. Finally, most current trackers usually do not enhance the target information but weaken the influence of the background, which ignores the possibility of other candidate targets as the actual tracking objects. To sum up, in order to further improve the tracking performance, the problems analyzed above all need to be urgently addressed.

To solve the above problems, we propose an object tracking method based on a target mask regularization and distortion-aware mechanism. Inspired by Xu et al. (2019b), we used hand-crafted features and CNN features to perform grouped feature selection in the channel dimension, obtaining a compact feature representation. Secondly, we designed a distortion-aware mechanism to guide the filter selection by detecting the quality of the response map.

Extensive experiments conducted on multiple challenging tracking datasets show that the proposed tracking method achieves excellent performance. Our main contributions are as follows:

A target mask regularization correlation filter model is proposed. We construct the target mask matrix through the highest peak of the response map of the previous frame, and multiply it with the current feature map to obtain the regularization term that only contains the target information, which can effectively distinguish the target from the background.

A distortion-aware mechanism to guide the filter selection is proposed. We use the peak-to-sidelobe ratio and the highest peak to evaluate the quality of the response map. When the quality of the response map corresponding to the filter is higher than a set threshold, the filter is considered reliable; otherwise, the saved high-quality filter is selected.

The optimal derivation process of the proposed model is presented. The proposed objective function with the target mask is decomposed into several sub-problems, and each sub-problem has a closed-form solution. In this way, the implementation of the tracking algorithm is simplified.

We comprehensively evaluated the effectiveness of our method on UAV123@10fps, DTB-70 and UAVDT. The results show that our tracker achieves competitive performance compared to other advanced trackers, achieving superior accuracy due to the synergistic effect of these innovations and robust multi-feature fusion. The code and data will be made public at https://github.com/upup99/DACFTM.

Our Distortion-Aware Correlation Filter with Target Mask (DACFTM) is a general CF model. The rest of this article is organized as follows: in Section 2, we will review the tracking methods closely related to this work. Section 3 will briefly discuss the proposed method and its technical details. Section 4 will present the experimental details and corresponding results of this article. Finally, in Section 5, this article is summarized.

2. Related Work

2.1. DCFs for Object Tracking

In 2010, minimum output sum of squared error (MOSSE) (Bolme et al., 2010) algorithm took the lead in introducing the correlation filtering theory into the target tracking task, and performed excellently in accuracy and speed. However, due to the widespread existence of boundary effects, its ability to cope with complex target changes is poor. Code shift keying (CSK) (Henriques et al., 2012) algorithm follows closely, and develops cyclic dense sampling based on MOSSE to obtain training samples, and uses kernel correlation filtering method. Kernelized correlation filter (KCF) (Henriques et al., 2014) algorithm further optimizes CSK performance with multichannel HOG characteristics. Danelljan et al. (2014) paid attention to the value of color features in tracking, and proposed multi-channel color feature CN after comprehensively evaluating the effect of feature extraction in multiple color spaces. Li and Zhu (2015) combined the HOG and CN features to complement each other and become the commonly used manual features for subsequent correlation filtering tracking. DeepSRDCF (Danelljan et al., 2015a) replaces HOG feature with VGG (visual geometry group) network single-layer convolution depth feature, greatly improving tracking accuracy. GFSDCF (Xu et al., 2019b) uses Resnet network and integrates HOG and CN features, so the performance is improved but the speed is reduced. In addition, DCF employs the alternating direction method of multipliers (ADMMs) to find fast convergent and accurate iterative solutions, which is widely used in image processing (Xu et al., 2019a), finance (Lai et al., 2020), and other fields.

In the object tracking scene of UAV, the target often encounters more serious deformation, and the background also changes dramatically. In order to effectively deal with the impact of this kind of interference, Zhang et al. (2022a) proposed a target sensing background suppression method with dual regression, and specifically established a new regularization term to improve the recognition ability of the filter. Zheng et al. (2021) used adaptive hybrid tags to enhance the anti-interference ability of the model in response to sudden appearance changes. They believed that the predefined tag quality had a great impact on the robustness of the tracker. Although these trackers have improved their tracking performance to a certain extent, they ignore the rapid change of target appearance in the real world. In our work, we use target mask and distortion sensing mechanism to deal with such challenges. In this way, the tracker proposed by us shows a strong recognition ability in the object tracking task of UAV, can better adapt to complex and changeable practical application scenarios, and provides a strong guarantee for accurate object tracking.

2.2. CNNs for Object Tracking

CNN has achieved significant success in the field of computer vision. In object tracking, CNN can continuously track targets in video sequences by learning their appearance features and motion patterns. In recent years, tracking algorithms (Zhang et al., 2024b, 2024c, 2025) have been widely inspired by this method. MDNet (Nam & Han, 2016) is an early target tracking algorithm based entirely on CNN, which improves the robustness and generalization ability of trackers through multi-domain learning. C-COT (Danelljan et al., 2016) effectively improves tracking performance by introducing continuous convolution operators to handle target deformation and appearance changes, and adopting online learning strategies. CREST (Song et al., 2017) introduced the concept of residual learning for the first time. When there is a significant difference between the output of the base map and the true Gaussian response, the residual network supplements the output of the base map through summation operations, which helps to make the output closer to the true result. The SiameseRPN (Li et al., 2018a) algorithm is based on SiameseFC (Bertinetto et al., 2016) and applies the idea of extracting target candidate boxes from region candidate networks in object detection to target tracking, which greatly solves the problem of severe object deformation in target tracking. In addition, in the past three years, with the popularity of transformer-based models, many scholars have applied them to the field of visual tracking. STARK (Yan et al., 2021) uses a score discriminator to evaluate the score of the object. When the score exceeds a predetermined threshold, the tracking result will be used as an online template. Mixformer (Cui et al., 2022) has designed an efficient single-stage tracking network. Although it still adopts online template update strategy, it also shows strong tracking performance. In the domain of three-dimensional (3D) object tracking, several methods have been developed to enhance the discriminative capability of search area features through target-aware information propagation. For instance, Wang et al. (2021)introduced MLVSNet, which incorporates a target guided attention (TGA) module designed to transmit target information and emphasize relevant points in the search area. Similarly, Xiao et al. proposed a target-specific feature enhancement (TSFA) module (Qi et al., 2020) based on the MLP layers to embed the target point cloud features into the search region, thereby improving feature matching. Extending this idea, Hui et al. (2021) presented a Siamese voxel-to-BEV tracker that performs template feature embedding to integrate template features into potential candidates, aiming to better capture the 3D structural characteristics of the target object. Deep learning algorithms show great potential in the field of target tracking. However, state-of-the-art vision transformer based models (Zhang et al., 2025a, 2025b) typically rely on large-scale offline pre-training and demand heavy computational resources. These constraints often render them unsuitable for real-time deployment on resource-limited UAV platforms, highlighting the continued necessity for efficient, lightweight tracking solutions.

3. Proposed Method

In this section, a detailed description of DACFTM will be provided. Its process can be summarized into three aspects: (1) target mask regularization; (2) distortion-aware mechanism; (3) high-quality filter selection. The overall flowchart of DACFTM is shown in Figure 1.

Figure 1.

The framework of our proposed distortion-aware correlation filter with target mask (DACFTM) tracker. Here, $⊛$ represents the circular convolution operator and $⊙$ denotes dot product. In the target mask, the black part represents the all-ones matrix, while the rest of the parts except the black part are zeros.

3.1. Preview of DCF

DCF (He et al., 2017) learns multichannel filters with multiple real training samples. Let $H_{t} \in R^{N \times N \times C}$ be the correlation filter of the $t$ -th frame, $X_{t} \in R^{N \times N \times C}$ be the tensor composed of the channel features extracted in the $t$ -th frame, and $Y \in R^{N \times N}$ be the expected response map of the target position in the $t$ -th frame. To learn the correlation filter $H_{t}$ , the objective function of the standard DCF is as follows:

E (H_{t}) = {‖ Y - \sum_{k = 1}^{C} H_{t}^{k} ⊛ X_{t}^{k} ‖}_{2}^{2} + \frac{λ}{2} \sum_{k = 1}^{C} {‖ H_{t}^{k} ‖}_{2}^{2},

(1)

where

⊛

is the circular convolution operator,

X_{t}^{k} \in R^{N \times N}

is the feature representation of the

k

-th channel,

H_{t}^{k} \in R^{N \times N}

is the corresponding discriminative filter,

C

is the number of channels, and

λ

is the hyper-parameter that limits the channel regularization term. In the frequency domain, we can easily obtain the closed-form solution of the above objective function.

DCF is an effective visual tracking method, but it also brings inevitable boundary effects. The SRDCF algorithm based on spatial penalty thus emerged. At the same time, in order to solve the problem of online update, SRDCF established multiple training image models. Although this improves the tracking performance, it inevitably increases the complexity of the algorithm. Aiming at the problem of high complexity and low tracking efficiency, a correlation filter based on spatial–temporal regularization has emerged. This method can not only make a reasonable approximation of SRDCF in the case of multiple training samples, but also provide a more robust appearance model when significant appearance changes occur. The specific objective function is expressed as follows:

E (H_{t}) = \frac{1}{2} {‖ Y - \sum_{k = 1}^{C} H_{t}^{k} ⊛ X_{t}^{k} ‖}_{2}^{2} + \frac{1}{2} \sum_{k = 1}^{C} {‖ W_{t}^{k} ⊙ H_{t}^{k} ‖}_{2}^{2} + \frac{λ}{2} \sum_{k = 1}^{C} {‖ H_{t}^{k} - H_{t - 1}^{k} ‖}_{2}^{2},

(2)

where the second term is the spatial regularization term, and the third term is the temporal regularization term. Although STRCF (Li et al., 2018b) has achieved favorable outcomes, there are still some problems. The fixed and unaltered spatial regularization term is unable to address the appearance variations of the target in the unknown and complex aerial environment. The preset temporal penalty is not applicable in all circumstances. When there is an excessive amount of background information, more background noise will be introduced. Consequently, the filter is more prone to learn the appearance model from the environment rather than the target. This situation, combined with the target appearance changes caused by reasons such as complete or partial occlusion and illumination variations, makes distortion more likely to occur during the detection process, significantly reducing the credibility of the detection results.

3.2. Target Mask

Generally, moving objects will continuously change after the initial frame. Given that UAV videos usually observe moving objects from a bird’s-eye view, and changes in the position of the camera and the position of the moving objects can cause greater changes in the background and foreground of the target. Therefore, in UAV videos, it is particularly crucial to focus on the information of moving objects or reduce the influence of background information. In fact, enhancing the information of moving objects should be given primary consideration, because in the process of weakening the influence of background information, the possibility of potential targets is often simultaneously reduced, which will subsequently have an adverse impact on the performance of the tracker.

In this article, we propose to adopt the target mask method to enhance the information of the target object. As shown in Figure 1, DACFTM will generate a target feature block, where the target features are retained and the background feature values are set to zero. To ensure the accurate separation of the target and background regions in the template, we take the position with the highest peak in the response map of the previous frame as the center of the current target region and adopt the scale information of the tracking target. Let $M_{t}$ be the feature of only the target region in the $t$ -th frame, and let $m_{t}$ be the target mask of the $t$ -th frame as follows:

M_{t} = X_{t} ⊙ m_{t} .

(3)

The above operation serves as a crucial spatial gating mechanism. It effectively emphasizes features within the estimated target region while attenuating or zeroing out background features. By incorporating this $M_{t}$ into our objective function as a regularization term, we explicitly guide the correlation filter to focus its learning on the target’s distinctive appearance. This direct emphasis on target-specific information enhances the filter’s discriminative power, making it more robust against background clutter, occlusions, and appearance changes, which are prevalent in UAV tracking. Based on the above discussion, we apply the idea of the target mask to the objective function and obtain a correlation filter that includes the target mask. Our objective function is described as follows:

E (H_{t}) = \frac{1}{2} {‖ Y - \sum_{k = 1}^{C} H_{t}^{k} ⊛ X_{t}^{k} ‖}_{2}^{2} + \frac{λ_{1}}{2} \sum_{k = 1}^{C} {‖ H_{t}^{k} ‖}_{2}^{2} + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ H_{t}^{k} ⊛ M_{t}^{k} ‖}_{2}^{2} + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ H_{t}^{k} - H_{t - 1}^{k} ‖}_{2}^{2},

(4)

where the first term is a ridge regression term.

X_{t} = [X_{t}^{1}, X_{t}^{2}, \dots, X_{t}^{C}]

represents all the training data of the

t

-th frame image, and

H_{t} = [H_{t}^{1}, H_{t}^{2}, \dots, H_{t}^{C}]

represents the filters to be trained in order to obtain the response

R

. The second term is the channel regularization term. The third term is the target mask regularization term. The fourth term is the temporal regularization term. Inspired by STRCF (Li et al., 2018b), this term appears in the design of many DCF-based trackers. It can smooth the sudden changes of the filters and enhance the robustness of the filters. Here,

H_{t - 1}^{k}

represents the filter of the

(t - 1)

-th frame.

λ_{1}

λ_{2}

and

λ_{3}

are the corresponding regularization parameters.

3.3. Distortion-Aware Mechanism and Filter Selection

Different from traditional visual tracking tasks, in UAV videos, the tracked objects are more likely to be occluded and may even exceed the image boundary. Most existing trackers have difficulty repositioning the target when it reappears. In view of this, this article proposes a distortion-aware mechanism to guide the selection of filters. This mechanism is designed to detect such unreliable response maps and prevent the tracker from learning incorrect information. Instead of updating the filter with bad data, it allows us to select a previously reliable filter, thus preventing tracking failure and maintaining robustness.

Above all, one problem that we must solve is how to detect the tracking failure caused by conditions such as occlusion or deformation with the help of the response map. Under normal circumstances, a clear response map has a sharp peak at the target position and a smooth shape in other areas. However, once the target is occluded or deformed, the response map will change drastically, usually showing a multi-peak shape, and the height of each peak will also decrease accordingly. Based on the above discussion, we can find that there are two indicators to evaluate the quality of the response map, namely (1) to evaluate the degree of the response map being a single peak, we adopt the peak-to-sidelobe ratio (Bolme et al., 2010), $PSR (R)$ ; and (2) the highest peak, $max (R)$ . The following is the definition of the peak-to-sidelobe ratio of the response map $R_{t}$ :

PSR (R_{t}) = \frac{max (R_{t}) - μ}{σ},

(5)

where

μ

represents the average value of the response map, while

σ

represents the standard deviation of the response map. When the target is occluded, the value of

P S R

will decrease significantly. Both

max (R)

and

PSR (R)

can evaluate the quality of the tracking results to a certain extent.

In view of this, we decided to adopt a combined form of these two indicators. In order to be able to adaptively select the appropriate threshold, finally we give the specific definition of the evaluation index based on the peak-to-sidelobe ratio and the highest peak:

Ω (R_{t}) = PSR (R_{t}) \cdot \frac{max (R_{t})}{mean (max (R_{1 : t}))},

(6)

where

Ω (R_{t})

represents the evaluation score of the

t

-th frame, and

mean (max (R_{1 : t}))

represents the average maximum peak value from the first frame to the

t

-th frame. Usually, occlusion or deformation events often occur within a relatively short period of time, so we will use the average evaluation score of the response maps of the most recent

k

reliable tracked frames, that is,

Ω^{avg} (R_{(t - k) : t})

. We finally adopt the following form as the quality evaluation parameter:

ω_{t} = \frac{Ω (R_{t})}{Ω^{avg} (R_{(t - k) : t})},

(7)

where

ω_{t}

represents the tracking certainty. When

ω_{t}

is less than or equal to the set threshold, it indicates that the current response map quality is poor.

According to equation (7), we further propose a filter selection mechanism:

H_{t}^{n e x t} = {\begin{cases} β H^{saved} + (1 - β) H_{t}, & ω_{t} > ε \\ H^{saved}, & ω_{t} \leq ε \end{cases},

(8)

where

ε

represents a threshold,

H_{t}^{next}

represents the finally adopted filter, and

H^{saved}

represents the historically retained filter. The reason for this setting is that when the detection result is abnormal, we do not update the filter to prevent the subsequently updated filter from being affected by the response map that changes due to occlusion. This strategy ensures robust adaptation to normal changes while preventing degradation from abnormal detection results. However, when the filter result is normal, we adopt the filter fusion strategy to learn a better filter. The following is the update mechanism of

β

β = {\begin{cases} β_{1}, & ω_{t - 2} - ω_{t - 1} > θ and ω_{t - 1} - ω_{t} > θ, \\ β_{2}, & ω_{t - 1} - ω_{t - 2} > θ and ω_{t} - ω_{t - 1} > θ, \\ 1 - ω_{t}, & otherwise . \end{cases}

(9)

When $β_{1}$ is selected, the tracking certainty at this time is constantly decreasing, indicating that the quality of the filter is decreasing at this time; when $β_{2}$ is selected, the tracking certainty at this time is constantly increasing, indicating that the quality of the filter is increasing at this time. If it does not belong to these two situations, it indicates that the overall quality of the filter has little change at this time, and we use the tracking certainty to weight the filter.

3.4. Model Optimization

To enhance the computational efficiency, the calculation in the time domain is typically transformed to the frequency domain for the optimized calculation of the correlation filter. Applying the Parseval theorem to equation (4), the following equivalent optimization equation can be obtained:

E (H_{t}) = \frac{1}{2} {‖ \hat{Y} - \sum_{k = 1}^{C} {\hat{H}}_{t}^{k} ⊙ {\hat{X}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{1}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} ⊙ {\hat{M}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} - {\hat{H}}_{t - 1}^{k} ‖}_{2}^{2},

(10)

where

\hat{*}

is the discrete Fourier transform operator of the signal. The symbol between the filter

H_{t}^{k}

and the training sample

X_{t}^{k}

changes from the circular convolution symbol

⊛

to the dot product

⊙

. The four terms on the right side of equation (10) are all convex terms. According to the convex optimization theory, since equation (10) is an unconstrained optimization problem, it can be optimized by the ADMMs (Boyd et al., 2011). To facilitate the solution and calculation, an auxiliary variable

H^{'} = H

is introduced, and the resulting new objective function is as follows:

{\begin{cases} min_{\hat{H}, \hat{H^{'}}} & \frac{1}{2} {‖ \hat{Y} - \sum_{k = 1}^{C} {\hat{H}}_{t}^{k} ⊙ {\hat{X}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{1}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}^{'}}_{t}^{k} ‖}_{2}^{2} \\ + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} ⊙ {\hat{M}}_{t}^{k} ‖}_{2}^{2} \\ + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} - {\hat{H}}_{t - 1}^{k} ‖}_{2}^{2}, \\ s.t. & \hat{H} - {\hat{H}}^{^{'}} = 0. \end{cases}

(11)

After introducing the auxiliary variable, the above equation becomes a bi-variable optimization problem. Usually, we can use the augmented Lagrangian multiplier to increase the convergence speed of the ADMM method. After augmenting the above equation, we can obtain the following equation:

\begin{aligned} L (\hat{H}, {\hat{H}}^{^{'}}, \hat{Γ}) & = \frac{1}{2} {‖ \hat{Y} - \sum_{k = 1}^{C} {\hat{H}}_{t}^{k} ⊙ {\hat{X}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{1}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}^{'}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} ⊙ {\hat{M}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} - {\hat{H}}_{t - 1}^{k} ‖}_{2}^{2} \\ + {\hat{Γ}}^{T} \sum_{k = 1}^{C} ({\hat{H}}_{t}^{k} - {\hat{H}^{'}}_{t}^{k}) + \frac{μ}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} - {\hat{H}^{'}}_{t}^{k} ‖}_{F}^{2}, \end{aligned}

(12)

among them,

Γ = [Γ_{1}, Γ_{2}, \dots, Γ_{C}]

is the Lagrange multiplier,

μ

is the penalty coefficient of the augmented term, and

\hat{Γ}

in equation (12) is the corresponding Fourier transform form of

Γ

. For simplicity of notation, let

γ = [γ_{1}, γ_{2}, \dots, γ_{C}] = \frac{\hat{Γ}}{μ}

, from which the scaled form of equation (12) can be derived as follows:

\begin{aligned} L (\hat{H}, {\hat{H}}^{^{'}}, \hat{γ}) & = \frac{1}{2} {‖ \hat{Y} - \sum_{k = 1}^{C} {\hat{H}}_{t}^{k} ⊙ {\hat{X}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{1}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}^{'}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} ⊙ {\hat{M}}_{t}^{k} ‖}_{2}^{2} + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} - {\hat{H}}_{t - 1}^{k} ‖}_{2}^{2} \\ + \frac{μ}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{k} - {\hat{H}^{'}}_{t}^{k} + {\hat{γ}}^{k} ‖}_{2}^{2} - \frac{μ}{2} \sum_{k = 1}^{C} {‖ {\hat{γ}}^{k} ‖}_{F}^{2} . \end{aligned}

(13)

Next, by alternately updating $\hat{H}$ and ${\hat{H}}^{'}$ in the above equation, the following iterative solution equation can be obtained:

{\begin{cases} {\hat{H}}^{(i + 1)} = \underset{\hat{H}}{\arg min} L (\hat{H}, \hat{H} {^{'}}^{(i)}, {\hat{γ}}^{(i)}), \\ \hat{H} {^{'}}^{(i + 1)} = \underset{\hat{H}^{'}}{\arg min} L ({\hat{H}}^{(i)}, \hat{H}^{'}, {\hat{γ}}^{(i)}), \\ {\hat{γ}}^{(i + 1)} = {\hat{γ}}^{(i)} + {\hat{H}}^{(i + 1)} - \hat{H} {^{'}}^{(i + 1)} . \end{cases}

(14)

Here, the first two equalizes are the update formulas of the two sub-problems, and both $\hat{H}$ and ${\hat{H}}^{'}$ have closed-form solutions. The $\hat{γ}$ in the last equality is the dual variable updated after $\hat{H}$ and ${\hat{H}}^{'}$ . We will use the ADMM algorithm to solve the above two sub-problems and the update problem.

$S u b p r o b l e m$ $\hat{H} :$ First, we assume that ${\hat{H}}^{' (i)}$ and ${\hat{γ}}^{(i)}$ are fixed and known in the $(i + 1)$ -th iteration. Then, substituting them into equation (12) gives the following equation:

\begin{aligned} {\hat{H}}^{(i + 1)} & = \underset{{\hat{H}}_{t}}{\arg min} \frac{1}{2} {‖ \hat{Y} - \sum_{k = 1}^{C} {\hat{H}}_{t} ⊙ {\hat{X}}_{t} ‖}_{2}^{2} + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t} ⊙ {\hat{M}}_{t} ‖}_{2}^{2} + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t} - {\hat{H}}_{t - 1} ‖}_{2}^{2} \\ + \frac{μ}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t} - {\hat{H}^{'}}_{t}^{(i)} + {\hat{γ}}^{(i)} ‖}_{2}^{2}, \end{aligned}

(15)

where we omit the superscript

k

for the sake of concise expression. The computational complexity of [the above formula] is

O (T^{3} C^{3})

, so it cannot be calculated directly. Observing equation (15), it can be found that equation (15) is the calculation on all channels

C

at the

t

-th frame and the

(i + 1)

-th iteration.

Therefore, for simple calculation, the dimension can be decomposed into the coordinate position of each pixel, that is, equivalently decomposed into $T$ smaller sub-problems. The following formula is the representation of the $j$ -th sub-problem:

\begin{aligned} {\hat{H}}^{(i + 1)} (j) & = \underset{{\hat{H}}_{t}}{\arg min} \frac{1}{2} {‖ \hat{Y} (j) - \sum_{k = 1}^{C} {\hat{H}}_{t} (j) ⊙ {\hat{X}}_{t} (j) ‖}_{2}^{2} + \frac{λ_{2}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t} (j) ⊙ {\hat{M}}_{t} (j) ‖}_{2}^{2} \\ + \frac{λ_{3}}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t} (j) - {\hat{H}}_{t - 1} (j) ‖}_{2}^{2} + \frac{μ}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t} (j) - {\hat{H}^{'}}_{t}^{(i)} (j) + {\hat{γ}}^{(i)} (j) ‖}_{2}^{2} . \end{aligned}

(16)

The ${\hat{H}}^{(i + 1)} (j)$ in the above equation represents the $j$ -th element across $C$ channels. To solve it, we let the first derivative of equation (16) be zero, and obtain as follows:

\begin{aligned} {\hat{H}}^{(i + 1)} (j) & = {({\hat{X}}_{t} (j) {\hat{X}}_{t} {(j)}^{⊤} + λ_{2} {\hat{M}}_{t} (j) {\hat{M}}_{t} {(j)}^{⊤} + (λ_{3} + μ) T I)}^{- 1} \\ \times ({\hat{X}}_{t} (j) \hat{Y} (j) + λ_{3} T {\hat{H}}_{t - 1} (j) + μ T {\hat{H}}_{t} (j) - μ T \hat{γ} (j)) . \end{aligned}

(17)

For the composite inverse matrix in the above equation, it can be further simplified using the Sherman-Morrison formula (Sherman & Morrison, 1950). Therefore, we can obtain as follows:

\begin{aligned} {\hat{H}}^{(i + 1)} (j) & = \frac{1}{λ_{3} + μ} (I - \frac{{\hat{X}}_{t} (j) {\hat{X}}_{t} {(j)}^{⊤} + λ_{2} {\hat{M}}_{t} (j) {\hat{M}}_{t} {(j)}^{⊤}}{(λ_{2} + μ) T + {\hat{X}}_{t} {(j)}^{⊤} {\hat{X}}_{t} (j) + λ_{2} {\hat{M}}_{t} {(j)}^{⊤} {\hat{M}}_{t} (j)}) \\ \times ({\hat{X}}_{t} (j) \hat{Y} (j) + λ_{3} T {\hat{H}}_{t - 1} (j) + μ T {\hat{H}}_{t} (j) - μ T \hat{γ} (j)) . \end{aligned}

(18)

The above equation no longer contains the calculation of inversion but only simple multiplication and addition calculations of vectors, so it can achieve higher computational efficiency.

$S u b p r o b l e m$ ${\hat{H}}^{'} :$ We assume that ${\hat{H}}^{(i)}$ and ${\hat{γ}}^{(i)}$ are fixed and known in the $(i + 1)$ -th iteration. Then, substituting them into equation (12) gives the following equation:

\begin{aligned} \hat{H} {^{'}}^{(i + 1)} & = \frac{1}{2} λ_{1} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{^{'}} ‖}_{2}^{2} + \frac{μ}{2} \sum_{k = 1}^{C} {‖ {\hat{H}}_{t}^{(i)} - {\hat{H}}_{t} + {\hat{γ}}^{(i)} ‖}_{2}^{2} . \end{aligned}

(19)

According to Chen et al. (2009) and Zhang et al. (2012), setting the first derivative of the above equation equal to zero, the optimal solution can be calculated as follows:

\hat{H} {^{'}}^{(i + 1)} = max (0, 1 - \frac{λ_{1}}{μ {‖ q^{k} ‖}_{2}}) q^{k},

(20)

where

q = {\hat{H}}_{t}^{k} + {\hat{γ}}^{k} (k = 1, \dots, C)

$S u b p r o b l e m$ $\hat{γ} :$ Finally, we can directly update ${\hat{γ}}^{(i + 1)}$ using the following expression:

{\hat{γ}}^{(i + 1)} = {\hat{γ}}^{(i)} + {\hat{H}}^{(i + 1)} - \hat{H} {^{'}}^{(i + 1)} .

(21)

The ${\hat{H}}^{(i + 1)}$ and ${\hat{H}}^{' (i + 1)}$ in the equation have been obtained in equations (18) and (20), respectively. To sum up, the closed-form solutions of the three sub-problems regarding $\hat{H}$ , ${\hat{H}}^{'}$ , and $\hat{γ}$ have all been obtained.

3.5. Lagrangian Update

The Lagrangian multipliers can be updated as follows:

{\hat{Γ}}^{i + 1} = {\hat{Γ}}^{i} + μ ({\hat{H}}^{i + 1} - \hat{H} {^{'}}^{i + 1}),

(22)

where

{\hat{H}}^{i + 1}

and

\hat{H} {^{'}}^{i + 1}

are the current solution of the sub-problem in the

i + 1

iteration of ADMM. Here

μ^{i + 1} = min (μ^{\max}, δ μ^{i})

. Then, the optimization is completed.

3.6. Online Tracking

Similar to other trackers (Kiani Galoogahi et al., 2017; Li et al., 2018b), we adopt the online adaptive template scheme to train the correlation filter, which enhances the robustness of the tracker against appearance changes and illumination variations of the target. The online adaptive update method of the filter model is defined as follows:

{\hat{x}}_{model}^{t} = (1 - α) {\hat{x}}_{model}^{t - 1} + α {\hat{x}}^{t},

(23)

where

α

represents the adaptive updating ratio,

{\hat{x}}_{model}^{t}

denotes the updated template of the current frame,

{\hat{x}}_{model}^{t - 1}

refers to the template of the previous frame, and

{\hat{x}}^{t}

represents the template in the current frame.

In the target detection process, we need to extract the multichannel feature ${\hat{X}}_{t}$ , and then correlate ${\hat{X}}_{t}$ with the filter ${\hat{H}}_{t - 1}^{next}$ from the previous frame in the Fourier domain to obtain the frequency response. Subsequently, an inverse Fourier transform is performed on the computed result to obtain the final time-domain response:

R_{t} = F^{- 1} (\sum_{k = 1}^{C} {\hat{H}}_{t - 1}^{k} ⊙ {\hat{X}}_{t}^{k}),

(24)

where

R_{t}

denotes the response score of the

t

-th frame, and

F^{- 1} (*)

represents the inverse discrete Fourier transform (IDFT). Finally, the maximum position of its response is determined using Newton’s iteration, which yields the tracking result for that frame. The specific process is shown in Algorithm 1.

4. Experiments

To verify the tracking performance of the DACFTM we proposed, we conducted a large number of comparative experiments and ablation experiments on three mainstream UAV tracking datasets against 18 advanced trackers. These extensive evaluations include detailed ablation studies to analyze component impact, and comparisons with recent state-of-the-art methods.

4.1. Implemented Details

To guarantee that the comparison experiments can be conducted fairly and impartially, only the groundtruth of the target in the first frame are provided for all the tested videos. We implemented DACFTM in MATLAB 2020b, using MatConvNet-v1.0 as a toolbox with CUDA 10.2. The computational platform used a PC with a 2.5 GHz CPU and 16 GB of RAM. We extracted robust high-dimensional CNN features using hand-craft features such as CN, HOG, and ResNet-50 (He et al., 2016). Despite the inclusion of these multi-level features, our proposed DACFTM maintains high efficiency, achieving an average processing speed of 69.7 FPS, satisfying the real-time requirements for UAV tracking. Regarding the parameter selection for our experiments, the following values were specifically chosen: For the parameters in the objective function, we set $λ_{1} = 10$ , $λ_{2} = 16$ , $λ_{3} = 0.1$ , and adaptive updating ratio $α = 0.6$ when using hand-crafted features, and $λ_{2} = 12$ , $λ_{3} = 0.1$ , and adaptive updating ratio $α = 0.05$ when using deep features. For the parameters of distortion-aware mechanism, we set $ε = 1$ , $θ = 0.3$ , $β_{1} = 0.9$ , $β_{2} = 0.1$ , $k = 5$ . For the Lagrangian update, we set $δ = 1.5$ , $μ^{\max} = 20$ .

To obtain the results of all compared methods, we ensured a fair evaluation under the same conditions as described in this section. We achieved this by running the publicly available code of each tracker on the default drone benchmarks (UAV123@10fps, DTB-70, and UAVDT). In cases, where direct code execution was not feasible or less efficient, we utilized published benchmark scores, ensuring that all comparison results were obtained within a standardized testing framework.

4.2. Compared With Advanced Trackers

In this section, we compare our method with other advanced trackers.

4.2.1. DTB-70

In Table 1, we evaluated DACFTM on the Drone Tracking Benchmark 70 (DTB-70), which contains 70 RGB video sequences from the perspective of unmanned aerial vehicles. In Table 2, we experimentally compared DACFTM with another eleven tracking methods. These tracking methods include ARCF (Huang et al., 2019), BACF (Kiani Galoogahi et al., 2017), IBRI (Fu et al., 2020), STRCF (Li et al., 2018b), LADCF (Xu et al., 2019c), AutoTrack (Li et al., 2020), MSEFCF (Yu et al., 2024), BiCF (Lin et al., 2020), ASTWR (Chen et al., 2024), ASTSCF (Li et al., 2025), and FWRDCF (Jia et al., 2025).

Table 1.
A Detailed Comparison of the 11 Challenge Attributions on DTB-70 Datasets. Our Approach Outperformed Other Advanced Trackers on DTB-70.

Trackers IPR $↑$ FCM $↑$ DEF $↑$ OCC $↑$ SV $↑$ MB $↑$ SOA $↑$ BC $↑$ OV $↑$ OPR $↑$ ARV $↑$

MCCT 0.376 0.410 0.354 0.377 0.439 0.334 0.399 0.296 0.349 0.243 0.334

BACF 0.371 0.435 0.302 0.348 0.392 0.412 0.411 0.337 0.419 0.203 0.273

IBRI 0.427 0.494 0.431 0.410 0.469 0.463 0.466 0.398 0.464 0.315 0.408

STRCF 0.391 0.460 0.400 0.401 0.426 0.437 0.444 0.341 0.407 0.260 0.340

LADCF 0.391 0.474 0.443 0.447 0.425 0.430 0.458 0.350 0.452 0.323 0.315

AutoTrack 0.454 0.497 0.452 0.415 0.493 0.468 0.473 0.394 0.407 0.343 0.405

MSEFCF 0.469 0.532 0.460 0.472 0.508 0.510 0.527 0.401 0.404 0.370 0.423

BiCF 0.439 0.472 0.444 0.372 0.482 0.448 0.444 0.381 0.389 0.354 0.398

Ours 0.521 0.552 0.534 0.473 0.528 0.550 0.537 0.511 0.503 0.432 0.503

Trackers	IPR $↑$	FCM $↑$	DEF $↑$	OCC $↑$	SV $↑$	MB $↑$	SOA $↑$	BC $↑$	OV $↑$	OPR $↑$	ARV $↑$
MCCT	0.376	0.410	0.354	0.377	0.439	0.334	0.399	0.296	0.349	0.243	0.334
BACF	0.371	0.435	0.302	0.348	0.392	0.412	0.411	0.337	0.419	0.203	0.273
IBRI	0.427	0.494	0.431	0.410	0.469	0.463	0.466	0.398	0.464	0.315	0.408
STRCF	0.391	0.460	0.400	0.401	0.426	0.437	0.444	0.341	0.407	0.260	0.340
LADCF	0.391	0.474	0.443	0.447	0.425	0.430	0.458	0.350	0.452	0.323	0.315
AutoTrack	0.454	0.497	0.452	0.415	0.493	0.468	0.473	0.394	0.407	0.343	0.405
MSEFCF	0.469	0.532	0.460	0.472	0.508	0.510	0.527	0.401	0.404	0.370	0.423
BiCF	0.439	0.472	0.444	0.372	0.482	0.448	0.444	0.381	0.389	0.354	0.398
Ours	0.521	0.552	0.534	0.473	0.528	0.550	0.537	0.511	0.503	0.432	0.503

DTB-70 = Drone Tracking Benchmark 70; SV = Scale Variation; OCC = Occlusion; DEF = deformation; OV = Out of View; IPR = in-plane rotation; FCM = fast camera motion; MB = motion blur; SOA = similar object around; BC = background clutter; OPR = out-of-plane rotation; ARV = abrupt range variation; MCCT = multi-cue correlation tracker; BACF = background-aware correlation filter; IBRI = disruptor-aware interval-based response inconsistency; STRCF = spatial-temporal regularized correlation filter; LADCF = learning adaptive discriminative correlation filter.

Table 2.

Comparison on DTB-70 and UAVDT Datasets.

Datasets	Scores	Trackers
DTB-70		ASTSCF	FWRDCF	ARCF	BACF	IBRI	STRCF	LADCF	AutoTrack	MSEFCF	BiCF	ASTWR	Ours
	AUC $↑$	0.483	0.493	0.472	0.402	0.471	0.432	0.428	0.479	0.502	0.462	0.494	0.546
	DP $↑$	0.702	0.732	0.694	0.590	0.686	0.644	0.629	0.694	0.728	0.657	0.751	0.834
UAVDT		ASTSCF	FWRDCF	ASRCF	TADT	RTDG	ECO	HiFT	MCCT	BSTCF	FBACF	MSEFCF	Ours
	AUC $↑$	0.469	0.510	0.437	0.431	0.458	0.454	0.468	0.437	0.441	0.465	0.448	0.506
	DP $↑$	0.739	0.750	0.702	0.678	0.728	0.714	0.641	0.671	0.685	0.739	0.733	0.768

ASTSCF = adaptive spatial-temporal structured correlation filter; FWRDCF = target-background feature blocks with aberrance repressed DCF; ARCF = aberrance repressed correlation filter; BACF = background-aware correlation filter; IBRI = disruptor-aware interval-based response inconsistency; STRCF = spatial-temporal regularized correlation filter; LADCF = learning adaptive discriminative correlation filter; MSEFCF = multi-scale enhanced features correlation filter; BiCF = bidirectional incongruity-aware correlation filter; ASTWR = adaptive spatial-temporal weighted regularization; AUC = area under the curve; DP = distance precision; ASRCF = adaptive spatially-regularized correlation filters; TADT = target-aware deep tracking; RTDG = response temporal difference guided tracker; ECO = efficient convolution operators; HiFT = hierarchical feature transformer; MCCT = multi-cue correlation tracker; BSTCF = background-aware and spatial-temporal regularized correlation filter; FBACF = feature block-aware correlation filters; MSEFCF = multi-scale enhanced features correlation filter.

As shown in Table 2, our tracker achieved the best scores. The success rate was 5.2% higher than ASTWR, 6.7% higher than AutoTrack, and 8.4% higher than BiCF. The precision was 8.3% higher than ASTWR and 10.6% higher than MSEFCF. Secondly, we compared the performance with eight of these tracking methods in specific challenging scenarios. As shown in Table 1, DACFTM achieved the best performance in scenarios such as Scale Variation (SV), Occlusion (OCC), deformation (DEF), and Out of View (OV). Our tracker is mainly designed to address occlusion and changes in moving objects in UAV tracking, and DACFTM can exhibit the best results in these challenging scenarios.

4.2.2. UAV123@10fps

We comprehensively evaluated DACFTM on the UAV123@10fps dataset, which contains 91 UAV video sequences. We experimentally compared DACFTM with another 13 tracking methods. These tracking methods include MCCT_H (Wang et al., 2018), BACF (Kiani Galoogahi et al., 2017), IBRI (Fu et al., 2020), STRCF (Li et al., 2018b), LADCF (Xu et al., 2019c), Autotrack (Li et al., 2020), MSEFCF (Yu et al., 2024), ARCF (Huang et al., 2019), BiCF (Lin et al., 2020), ReCF (Lin et al., 2021), ECO_HC (Danelljan et al., 2017), UDT (Wang et al., 2019), and SRDCF (Danelljan et al., 2015b).

As shown in Figure 2, our work achieved the best scores in both tracking success rate and precision. The success rate was 8.2% higher than AutoTrack, 10.5% higher than LADCF, and 7.1% higher than MSEFCF. In terms of precision, it was 8.8% higher than IBRI, 7.9% higher than MSEFCF, and 9.9% higher than BiCF. Secondly, in order to evaluate the robustness of DACFTM, we compared the performance of each tracker in different challenging scenarios, respectively. As shown in Figures 3 and 4, our method achieved the best performance in scenarios such as SV, POC, low resolution (LR), and OV. Finally, we perform a qualitative evaluation for this dataset with four other tracking methods. For this aim, we select several different video sequences to assess the performance of the trackers. From Figure 5 , the tracking results of these trackers in these five sequences can be, respectively, observed. To further analyze the limitations of the proposed method, we present several representative failure cases in Figure 6, where the tracker performs poorly under extremely challenging conditions. As shown in these sequences, tracking failure mainly occurs in situations involving long-term severe occlusion, abrupt camera motion, fast target deformation, and significant background interference. Under such conditions, the response map becomes highly ambiguous, and even the distortion-aware mechanism may fail to recover reliable target information due to the absence of discriminative visual cues. Overall, our tracker demonstrates higher accuracy in tracking the target. For instance, in the wakeboard sequence, all the other trackers lost the target, but our tracker was still capable of successfully detecting the target.

Figure 2.

Success rate and precision of compared algorithms on UAV123@10fps.

Figure 3.

Comparisons of success rates for 12 challenging attributes on DTB-70. For ARC, BC, CM, FM, FO, IV, LR, OV, PO, SO, SV and VC, our DACFTM achieved better tracking performance compared to other advanced trackers. ARC = aspect ratio change; BC = background clutter; CM = camera motion; FM = fast motion; FO = full occlusion; IV = illumination variation; LR = low resolution; OV = out of view; PO = partial occlusion; SO = similar object; SV = scale variation; VC = viewpoint change; DACFTM = distortion-aware correlation filter with target mask.

Figure 4.

Comparisons of precision for 12 challenging attributes on DTB-70. For ARC, BC, CM, FM, FO, IV, LR, OV, PO, SO, SV and VC, our DACFTM achieved better tracking performance compared to other advanced trackers. ARC = aspect ratio change; BC = background clutter; CM = camera motion; FM = fast motion; FO = full occlusion; IV = illumination variation; LR = low resolution; OV = out of view; PO = partial occlusion; SO = similar object; SV = scale variation; VC = viewpoint change; DACFTM = distortion-aware correlation filter with target mask.

Figure 5.

Comparison of tracking quality with four other advanced trackers in five challenging image sequences. (from top to bottom: boat4, wakeboard2, bike1, car16_2, and wakeboard8).

Figure 6.

Compare with the failure cases of four other advanced trackers in two challenging image sequences. (from top to bottom: bike2 and person14_1). All sequences in this figure show poor tracking performance under extremely difficult conditions, including abrupt camera motion, fast target deformation, and background clutter. These examples highlight the limitations of the proposed method when the visual appearance of the target is heavily corrupted.

Figure 7.

Ablation experiment of the initial value selection of $θ$ on Drone Tracking Benchmark 70 (DTB-70).

Figure 8.

Ablation experiment of the initial value selection of $ε$ on Drone Tracking Benchmark 70 (DTB-70).

4.2.3. UAVDT

Finally, we evaluated DACFTM on the UAVDT dataset, which contains 50 video sequences from the perspective of unmanned aerial vehicles, including 14 challenge attributes. We experimentally compared DACFTM with another 9 trackers. These tracking methods include ASRCF (Dai et al., 2019), TADT (Li et al., 2019), RTDG (Lin et al., 2021), ECO (Danelljan et al., 2017), HiFT (Cao et al., 2021), MCCT (Wang et al., 2018), BSTCF (Zhang et al., 2023), FBACF (Zhang et al., 2024a), MSEFCF (Yu et al., 2024), ASTSCF (Li et al., 2025), and FWRDCF (Jia et al., 2025).

We reported the performance of each tracker on this dataset in Table 2. Our DACFTM achieved the best scores in both tracking success rate and precision. The success rate was 6.5% higher than BSTCF and 4.1% higher than FBACF. The precision was 3.5% higher than MSEFCF and 4.0% higher than RTDG.

4.3. Ablation Studies

4.3.1. The Effectiveness of Different Components

We further conduct ablation experiments to demonstrate the effectiveness of the respective components of the DACFTM tracker, including target mask (TM) and distortion aware mechanism (DA). The baseline tracker is the tracker that does not use the target mask and distortion aware mechanism. Based on the baseline, we verified the effectiveness of our method through the combination of different components.

According to the results reported in Table 2. The tracker proposed in this article shows better tracking success rate and precision than other trackers. On the DTB-70 dataset, it is 4.4% higher in success rate and 10.6% higher in precision than MSEFCF. Secondly, as shown in Table 3, the proposed target mask and filter update strategy improve the performance of the baseline. The target mask and distortion-aware mechanism increase the success rate by 1.4%/1.8% and the precision by 1.8%/1.0%, respectively. Please note that the target mask and the distortion-aware mechanism we proposed are not linearly weighted. The target mask enhances the model’s perception of the target by acting on the objective function, and distortion-aware mechanism maintains a high-quality filter by judging whether the current response map is reliable. When we enable both components, our tracker achieves the optimal performance, which is 2.8% higher in success rate and 3.8% higher in precision than the baseline. These results verify the effectiveness of our proposed method.

4.3.2. The Effectiveness of Different Feature

In order to be compared with other trackers that do not use ResNet-50 as the feature extraction network, we conduct a feature model influence experiment. As shown in Table 4, in the case of using only HOG features, our tracker still performs well in terms of AUC, and the speed is nearly 18 times faster than the model after feature fusion. When using HOG features and CN features, the speed is also 6 times faster. This demonstrates that while HOG provides essential structural information, CN further enriches the representation with discriminative color cues. Crucially, the subsequent integration of DCNN features (e.g., from ResNet-50) provides a significant performance boost by introducing high-level semantic understanding, proving that the combined power of these diverse feature types is indispensable for achieving state-of-the-art tracking performance in complex UAV environments. For instance, on the DTB70 dataset, adding DCNN features to HOG+CN boosts AUC from 0.442 to 0.546, and DP from 0.664 to 0.834. This highlights that DCNN features are the most critical component for achieving high accuracy and robustness in UAV tracking, primarily because they capture high-level semantic information that is highly discriminative and robust to the severe appearance changes and occlusions inherent in aerial video sequences.

Table 3.
Experimental Results Using Different Components on DTB-70 and UAV123@10fps.

Components DTB-70 UAV123@10fps

Methods TM DA DP $↑$ AUC $↑$ DP $↑$ AUC $↑$

Baseline 0.796 0.518 0.745 0.652

Baseline $+$ TM ✓ 0.814 0.532 0.755 0.663

Baseline $+$ DA ✓ 0.806 0.536 0.748 0.664

Baseline $+$ TM $+$ DA ✓ ✓ 0.834 0.546 0.761 0.667

	Components	DTB-70	UAV123@10fps
Baseline			0.796	0.518	0.745	0.652
Baseline $+$ TM	✓		0.814	0.532	0.755	0.663
Baseline $+$ DA		✓	0.806	0.536	0.748	0.664
Baseline $+$ TM $+$ DA	✓	✓	0.834	0.546	0.761	0.667

DTB-70 = Drone Tracking Benchmark 70; UAV = unmanned aerial vehicle; TM = target mask; DA = distortion aware mechanism; AUC = area under the curve.

Table 4.

Experimental Results Using Different Feature Fusion on DTB-70 and UAV123@10fps.

	DTB-70			UAV123@10fps
Features	AUC $↑$	DP $↑$	FPS $↑$	AUC $↑$	DP $↑$	FPS $↑$
HOG	0.307	0.460	69.718	0.319	0.443	70.632
HOG $+$ CN	0.442	0.664	25.677	0.445	0.609	26.789
HOG $+$ CN $+$ DCNN	0.546	0.834	3.900	0.667	0.761	3.974

DTB-70 = Drone Tracking Benchmark 70; UAV = unmanned aerial vehicle; AUC = area under the curve; HOG = histogram of oriented gradient; CN = color name; DCNN = deep convolutional neural network.

4.4. Sensitivity Analysis

(1) Selection of $θ$ : In this article, an ablation experiment was conducted on $θ$ . This parameter represents the allowable floating range of the quality evaluation parameters of adjacent frames. When the quality evaluation parameters of consecutive frames satisfy equation (9), we will adopt the corresponding response map learning rate. As shown in Figure 7, we selected $θ = {0, 0.1, 0.2, 0.3, 0.4, 0.5}$ for experiments, respectively. It can be found that the comprehensive performance is the best when $θ = 0.3$ is selected, and the performance is the worst when $θ = 0.5$ .

(2) Selection of $ε$ : We determined the optimal value for the quality evaluation parameter $ε$ through ablation experiments. As shown in Figure 8, we tested various $ε$ values from $1, 1.5, \dots, 3.5$ . The results demonstrate that the best overall tracking performance is achieved when $ε$ = 1. This value acts as the threshold: if the current response map quality $Ω (R_{t})$ is less than or equal to the average quality of the past $k$ reliable frames ( $Ω^{avg} (R_{(t - k) : t})$ ), the filter is considered unreliable, and a previously saved high-quality filter is used instead. This empirical selection ensures robust tracking by preventing the integration of distorted filters.

5. Conclusion

In this article, we designed and implemented a correlation filter with target mask regularization and also proposed a distortion-aware mechanism for guiding the selection of the filter. These two methods can be applied to any discriminative correlation filter design and are particularly practical in the UAV tracking scenario. On the DTB-70 dataset, the precision of our method is as high as 83.4%. However, compared with other trackers, the method proposed in this article has no obvious advantage in tracking speed. In the subsequent research, it is possible to consider using other benchmark models with faster tracking speeds to improve the usability in UAV video sequences. Secondly, through the observation of video sequences, we found that the method proposed in this paper has higher robustness in small target scenarios and can play a more excellent role in specific scenarios. Finally, the experiments conducted on multiple datasets fully demonstrate that our method has more excellent performance.

Footnotes

ORCID iDs

Jianming Zhang

Jiangxin Dai

Xiaokang Jin

Ke Nai

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by Jinhua Public Welfare Technology Application Research Project under Grant 2025-4-043.

Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Bertinetto

Valmadre

Henriques

J. F.

Vedaldi

Torr

P. H.

(2016). Fully-convolutional siamese networks for object tracking. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II 14 (pp. 850–865). Springer.

Bolme

D. S.

Beveridge

J. R.

Draper

B. A.

Lui

Y. M.

(2010). Visual object tracking using adaptive correlation filters. In 2010 IEEE computer society conference on computer vision and pattern recognition (pp. 2544–2550). IEEE.

Bonatti

Wang

Choudhury

Scherer

(2019). Towards a robust aerial cinematography platform: Localizing and tracking moving targets in unstructured environments. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 229–236). IEEE.

Boyd

Parikh

Chu

Peleato

Eckstein

(2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122.

Cao

(2021). Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 15457–15466).

Chen

Pan

Kwok

J. T.

Carbonell

J. G.

(2009). Accelerated gradient method for multi-task sparse learning problem. In 2009 Ninth IEEE international conference on data mining (pp. 746–751). IEEE.

Chen

Liu

(2024). Toward robust visual tracking for uav with adaptive spatial-temporal weighted regularization. The Visual Computer, 40(12), 8987–9003.

Cui

Jiang

Wang

(2022). Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13608–13618).

Dai

Wang

Sun

(2019). Visual tracking via adaptive spatially-regularized correlation filters. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4670–4679).

10.

Dalal

Triggs

(2005). Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 1, pp. 886–893). IEEE.

11.

Danelljan

Bhat

Shahbaz Khan

Felsberg

(2017). ECO: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6638–6646).

12.

Danelljan

Hager

Shahbaz Khan

Felsberg

(2015a). Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE international conference on computer vision workshops (pp. 58–66).

13.

Danelljan

Hager

Shahbaz Khan

Felsberg

(2015b). Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision (pp. 4310–4318).

14.

Danelljan

Robinson

Shahbaz Khan

Felsberg

(2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Computer Vision–ECCV 2016: 14th european conference, amsterdam, the netherlands, October 11–14, 2016, Proceedings, Part V 14 (pp. 472–488). Springer.

15.

Danelljan

Shahbaz Khan

Felsberg

Van de Weijer

(2014). Adaptive color attributes for real-time visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1090–1097).

16.

Lin

(2020). Disruptor-aware interval-based response inconsistency for correlation filters in real-time aerial tracking. IEEE Transactions on Geoscience and Remote Sensing, 59(8), 6301–6313.

17.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

18.

Fan

Zhuang

Dong

Bai

(2017). Correlation filters with weighted convolution responses. In Proceedings of the IEEE international conference on computer vision workshops (pp. 1992–2000).

19.

Henriques

J. F.

Caseiro

Martins

Batista

(2012). Exploiting the circulant structure of tracking-by-detection with kernels. In Computer Vision–ECCV 2012: 12th European conference on computer vision, Florence, Italy, October 7–13, 2012, Proceedings, Part IV 12 (pp. 702–715). Springer.

20.

Henriques

J. F.

Caseiro

Martins

Batista

(2014). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.

21.

Huang

Lin

(2019). Learning aberrance repressed correlation filters for real-time uav tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2891–2900).

22.

Hui

Wang

Cheng

Xie

Yang

(2021). 3d siamese voxel-to-bev tracker for sparse point clouds. Advances in Neural Information Processing Systems, 34, 28714–28727.

23.

Jia

Liu

Wang

(2025). Target-background feature blocks and aberrance repressed correlation filters for real-time UAV tracking. Signal, Image and Video Processing, 19(7), 527.

24.

Jin

Zhang

Xiao

Zhao

Zheng

(2024). Improved siamcar with ranking-based pruning and optimization for efficient UAV tracking. Image and Vision Computing, 141, 104886.

25.

Kiani Galoogahi

Fagg

Lucey

(2017). Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision (pp. 1135–1143).

26.

Lai

Z. R.

Tan

Fang

(2020). Loss control with rank-one covariance estimate for short-term portfolio optimization. Journal of Machine Learning Research, 21(97), 1–37.

27.

Yan

Zhu

(2018a). High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 8971–8980).

28.

Tian

Zuo

Zhang

Yang

M. H.

(2018b). Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4904–4913).

29.

Zhao

Liu

Tang

(2025). Learning adaptive spatial-temporal structured correlation filters for uav object tracking. In ICASSP 2025-2025 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). IEEE.

30.

Yang

M. H.

(2019). Target-aware deep tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1369–1378).

31.

Ding

Huang

(2020). Autotrack: Towards high-performance visual tracking for uav with automatic spatio-temporal regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11923–11932).

32.

Zhu

(2015). A scale adaptive kernel correlation filter tracker with feature integration. In Computer vision-ECCV 2014 workshops: Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings, Part II 13 (pp. 254–265). Springer.

33.

Lin

Guo

Tang

(2020). Bicf: Learning bidirectional incongruity-aware correlation filter for efficient uav object tracking. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 2365–2371). IEEE.

34.

Lin

Xiong

(2021). Recf: Exploiting response reasoning for correlation filters in real-time UAV tracking. IEEE Transactions on Intelligent Transportation Systems, 23(8), 10469–10480.

35.

Yang

Reid

Yang

M. H.

(2018). Deep regression tracking with shrinkage loss. In Proceedings of the European conference on computer vision (ECCV) (pp. 353–369).

36.

Nam

Han

(2016). Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4293–4302).

37.

Feng

Cao

Zhao

Xiao

(2020). P2b: Point-to-box network for 3d object tracking in point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6329–6338).

38.

Sherman

Morrison

W. J.

(1950). Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1), 124–127.

39.

Song

Gong

Zhang

Lau

R. W.

Yang

M. H.

(2017). Crest: Convolutional residual learning for visual tracking. In Proceedings of the IEEE international conference on computer vision (pp. 2555–2564).

40.

Van De Weijer

Schmid

Verbeek

Larlus

(2009). Learning color names for real-world applications. IEEE Transactions on Image Processing, 18(7), 1512–1523.

41.

Wang

Song

Zhou

Liu

(2019). Unsupervised deep tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1308–1317).

42.

Wang

Zhou

Tian

Hong

Wang

(2018). Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4844–4853).

43.

Wang

Xie

Lai

Y. K.

Long

Wang

(2021). Mlvsnet: Multi-level voting siamese network for 3d visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3101–3110).

44.

Zhao

Zhang

(2019a). An image reconstruction model regularized by edge-preserving diffusion and smoothing for limited-angle computed tomography. Inverse Problems, 35(8), 085004.

45.

Feng

Z. H.

X. J.

Kittler

(2019b). Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7950–7960).

46.

Feng

Z. H.

X. J.

Kittler

(2019c). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.

47.

Xuan

Han

Wan

Xia

G. S.

(2019). Object tracking in satellite videos by improved correlation filters with motion estimations. IEEE Transactions on Geoscience and Remote Sensing, 58(2), 1074–1086.

48.

Yan

Peng

Wang

(2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10448–10457).

49.

Y. F.

Zhang

Chen

C. P.

(2024). Multi-scale enhanced features correlation filters learning with dual second-order difference for UAV tracking. IEEE Transactions on Intelligent Vehicles, 9(2), 3232–3245.

50.

Zhang

Qiu

(2022a). Learning target-aware background-suppressed correlation filters with dual regression for real-time UAV tracking. Signal Processing, 191, 108352.

51.

Zhang

Liu

Yuan

Yang

(2024a). Feature block-aware correlation filters for real-time UAV tracking. IEEE Signal Processing Letters, 31, 840–844.

52.

Zhang

Feng

Yuan

Wang

Sangaiah

A. K.

(2022b). Scstcf: Spatial-channel selection and temporal regularized correlation filters for visual tracking. Applied Soft Computing, 118, 108485.

53.

Zhang

Chen

Kuang

L. D.

Zheng

(2024b). Corrformer: Context-aware tracking with cross-correlation and transformer. Computers and Electrical Engineering, 114, 109075.

54.

Zhang

Feng

Wang

Xiong

N. N.

(2023). Learning background-aware and spatial-temporal regularized correlation filters for visual tracking. Applied Intelligence, 53(7), 7697–7712.

55.

Zhang

Jin

Sun

Wang

Sangaiah

A. K.

(2020). Spatial and semantic convolutional features for robust visual object tracking. Multimedia Tools and Applications, 79, 15095–15115.

56.

Zhang

Tao

Huang

Zhang

(2024c). A robust real-time anchor-free traffic sign detector with one-level feature. IEEE Transactions on Emerging Topics in Computational Intelligence, 8(2), 1437–1451.

57.

Zhang

Sun

Wang

Chen

(2022c). An object tracking framework with recapture based on correlation filters and siamese networks. Computers & Electrical Engineering, 98, 107730.

58.

Zhang

Yang

Liu

Wang

(2025). Rgbt tracking via frequency-aware feature enhancement and unidirectional mixed attention. Neurocomputing, 616, 128908.

59.

Zhang

Yang

Qin

Xiao

Wang

(2025a). Mgnet: Rgbt tracking via cross-modality cross-region mutual guidance. Neural Networks, 190, 107707.

60.

Zhang

Wang

(2025b). Crack segmentation network via difference convolution-based encoder and hybrid CNN-mamba multi-scale attention. Pattern Recognition, 167, 111723.

61.

Zhang

Ghanem

Liu

Ahuja

(2012). Robust visual tracking via multi-task sparse learning, in 2012 IEEE conf. In Computer vision and pattern recognition (CVPR) (pp. 2042–2049).

62.

Zheng

Lin

Ding

(2021). Mutation sensitive correlation filter for real-time uav tracking with adaptive hybrid label. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 503–509). IEEE.

	Components		DTB-70		UAV123@10fps
Methods	TM	DA	DP $↑$	AUC $↑$	DP $↑$	AUC $↑$
Baseline			0.796	0.518	0.745	0.652
Baseline $+$ TM	✓		0.814	0.532	0.755	0.663
Baseline $+$ DA		✓	0.806	0.536	0.748	0.664
Baseline $+$ TM $+$ DA	✓	✓	0.834	0.546	0.761	0.667

Learning Distortion-Aware Correlation Filters With Target Mask for UAV Tracking

Abstract

Keywords

1. Introduction

2. Related Work

2.1. DCFs for Object Tracking

2.2. CNNs for Object Tracking

3. Proposed Method

4.1. Implemented Details

4.2. Compared With Advanced Trackers

4.2.1. DTB-70

4.3. Ablation Studies

4.3.1. The Effectiveness of Different Components

4.3.2. The Effectiveness of Different Feature

5. Conclusion

Footnotes

ORCID iDs

Funding

Conflicting Interests

References