Abstract
Template dictionary construction is an important issue in sparse representation (SP)-based tracking algorithms. In this article, a drift-free visual tracking algorithm is proposed via the construction of an effective template dictionary. The constructed dictionary is composed of three categories of atoms (templates): nonpolluted atoms, variational atoms, and noise atoms. Moreover, the linear combinations of nonpolluted atoms are also added to the dictionary for the diversity of atoms. All the atoms are selectively updated to capture appearance changes and alleviate the model drifting problem. A bidirectional tracking process is used and each process is optimized by two-step SP, which greatly reduces the computational burden. Compared with other related works, the constructed dictionary and tracking algorithm are both robust and efficient.
Introduction
In many industry scenarios, such as intelligent traffic systems, industrial robots, and smart security surveillance systems, visual sensors have become increasingly more common due to their low cost and nonintrusiveness. Object tracking is an important problem in visual sensor-based intelligent systems and has been studied extensively in recent decades. Technically, object tracking involves locating a specified region in a video sequence and has significant potential applications in various fields, including visual surveillance, 1 intelligent transportation systems, 2 human–computer interaction, 3 and intelligent driving. 4
A variety of tracking algorithms have been investigated in the existing literature. From the perspective of appearance modeling, there are two kinds of tracking algorithms: generative and discriminative. For generative models, the modeling of the object using Gaussian mixture models is both effective and computationally efficient. 2,5 –7 Zhou et al. 2 assume that the observations are composed of different components and modeled the object by utilizing an adaptive mixture of Gaussians. Taking the spatial distribution of the object into consideration, Yu and Wu 5 propose a spatial-appearance model that captures the properties of both local appearance changes and global spatial changes to fit nonrigid appearance variations. In the literature, 7,8 a spatial-color mixture of Gaussians model is introduced to model the object. It considers not only the common similarity measure based on color histograms but also the spatial layout of the colors.
These generative model-based tracking algorithms all use pixelwise object representation and are apt to be corrupted by noise. To consider the object appearance as a whole, linear subspace learning, which represents the object as a vector, has been widely applied to visual tracking. Based on the subspace constancy assumption, Black and Jepson 9 propose a subspace learning algorithm for template tracking. However, this algorithm does not work well if the appearance of the object gradually changes. The model proposed in the literature 10,11 learns and presents the object appearance by low-dimensional subspace in an incremental way, and it can therefore efficiently capture nonrigid appearance variations and recover all motion parameters efficiently. In the literature, 12 the observation model is decomposed into multiple basic observation models that are constructed by the sparse principal component analysis of a set of feature templates. To utilize more spatial layout information, higher-order subspace learning algorithms are proposed. 13 –16 In the literature, 13,14 an online tensor decomposition framework is introduced for object tracking. It can adapt to the appearance changes of a target by gradually learning a low-order tensor eigenspace representation. Due to the fact that the appearance variations are highly nonlinear, nonlinear manifold learning methods have been proposed. In the research, 17 Porikli et al. utilize a covariance matrix 18 descriptor to capture the spatial correlation information in the appearance of object.
While for discriminative trackers, many models are proposed to select different discriminative features for tracking. The support vector tracker 19 uses an offline learned support vector machine as the classifier and embeds it into an optical flow-based tracker. In the literature, 20 a discriminative classification rule is learned to distinguish between the object and background. These algorithms require a large hand-labeled data set for training, and the support vector machine classifier is not updated once trained. To adapt to the object appearance changes, discriminative trackers have been extended to include online learning. Collins et al. 21 classify tracking as tracked object and background. A variance ratio is used to measure feature discriminability and select the best color space feature from a feature pool for tracking. Avidan 22 labels pixels by a combination of some weak classifiers and constructs a probability map to represent the probabilities that particular pixels belong to the tracking object or its background. However, the pixel-based features (color and gradient) have very limited discriminative power, especially if the background shares a similar color with the object. To overcome this disadvantage, Grabner et al. 23,24 select discriminative local tracking features from a large feature pool by online boosting. The weak classifiers discriminate the object from the background to obtain corresponding features. Zhang et al. 25 propose a graph embedding-based subspace learning method, which can simultaneously learn the subspace of the object and its local discriminative subspace against the background. Li et al. 26 propose a novel correlation filter-based tracker, which is robust to background clutters and scale variations of the target. Zhou et al. 27 exploit appearance and the background context to design a robust correlation filter-based tracker.
In recent years, l 1-norm constrained sparse representation (SR) has attracted increasingly more attention and applied to object tracking. 28 –34 Despite the great success of SP in the field of tracking, less research is focused on how to establish an effective visual tracking template in the dictionary. SR requires an overcomplete template dictionary, so the linear combination of these templates can be used to approximate the estimation of new samples with very sparse coefficients. For an online video sequence, the preliminary construction of an overcomplete dictionary beforehand is difficult or even impossible. Therefore, a dictionary with a small number of atoms is collected when tracking is started, and the atoms are updated online during the tracking. However, there are two limitations to this method: (1) the atoms are far from complete and (2) the atoms are gradually contaminated by the tracking errors, ultimately resulting in drift problems. In contrast, because all video frames are available, overcomplete atoms can be built for offline video sequences. Thus, the construction of a dictionary with overcomplete atoms is a key problem that remains to be solved, and some methods have been proposed in the signal processing field to address this problem. 35 –38 In the literature, 35 Aharon et al. propose a method alternates between a process of updating the dictionary atoms and sparse-coding the examples based on the current dictionary. Yaghoobi et al. 36 introduce diverse constraints to spread the dictionary learning problem and use optimized methods to solve it. In the literature, 37 an iterative algorithm based on the least-squares cost is proposed to construct dictionaries. Nevertheless, the labeling of so much sample data in a video is time-consuming.
From the preceding analysis of existing research, an effective dictionary construction method for offline video tracking is proposed in the present study. The major characteristics of the proposed method are summarized as follows: Three categories of atoms are constructed: nonpolluted atoms and their linear combination, variational atoms, and noise atoms. All the atoms are selectively updated to capture the appearance changes and alleviate the model-drifting problem. The algorithm adopts a two-step method, which effectively solves the optimization problem via two sets of SP and reduces the huge computational burden. From the perspective of control theory, the presented pursuit algorithm combines the key frame constraint and bitracking constraint, which make the open-loop essence of the tracking problem well-posed.
The remainder of the article is organized as follows. The “SP-based tracker” section discusses some work of the SP framework in the tracking context. We present the detail of the proposed tracker in “Proposed tracking system” section. Some results are shown in “Experimental results” section, and conclusions are made in “Conclusion” section.
Motivation
Before introducing the motivation of this article, a short review of traditional SP-based tracker is first presented to make this article self-contained.
SP-based tracker
In the applications of object tracking, it is assumed that the manifold of the object lies in a linear subspace for a short period of time. The assumption is rational because the appearances of the object are similar among the consecutive frames. This implies that regardless of how the appearance of the object changes, it can be represented by some atoms.
Suppose there are some image atoms
where
However, the image samples
where
where
where
where
Motivation of the article
In the traditional SP-based trackers, 28,33 the atom dictionary is constructed in two steps: (1) labeling the position of the object and (2) sampling several samples near the position. However, dictionaries constructed in this way are not complete, so the tracking performance is limited. Moreover, the atoms used in SP-based trackers are updated in a simple manner; if a new tracking result has a low similarity with the object atoms, then the atom with the lowest weight is updated by the tracking result. From the perspective of control theory, this updating strategy is essentially an open-loop process with no feedback, which is ill-posed. In this process, the tracking errors gradually accumulate, ultimately leading to the drifting problem.
Proposed tracking system
In consideration of these problems, a robust tracker for offline sequences is proposed in the present work. As shown in Figure 1, the proposed tracker contains four parts, namely the construction of the effective dictionary, two-step SP optimization, the bitracking procedure, and the updating strategy of the dictionary. The details of each part are presented in the subsequent section.

The flowchart of the proposed tracking algorithm.
Dictionary construction
Inspired and guided by the key frame-based trackers, 40 –42 this article proposes the construction of a valid and large template dictionary via the use of a key frame-based algorithm. Because the goal is to collect as many representative atoms as possible, the objects in several key frames are manually selected as the nonpolluted atoms; however, these atoms are not sufficient. To enlarge the atom set, three categories of atoms are introduced, namely (1) the given nonpolluted atoms and their linear combinations, (2) some variational atoms that are used to adapt to the appearance changes, and (3) some noise atoms that deal with occlusion and noise. The noise atom was defined in “SP-based tracker” section, and the other two categories of atoms are introduced in the following subsections.
Nonpolluted atoms
The upper portion of Figure 2 shows that in the user-specified k key frames, the target area to be traced is manually marked. For the j’th frame, the target region of disturbed 0–2 pixels is to generate some new image regions. The mismatch problem can be alleviated via intensive sampling around the target area. The cropped regions are then adopted as the nonpolluted atoms

An illustration of the template dictionary construction process.
Variational atoms
To effectively capture the changes in the appearance of the target object, the variational atoms can be initialized by the linear combination of the randomly selected nonpolluted atoms in the two corresponding frames (see the second line of Figure 2).
More precisely, consider the subsequence between the first and second key frames. Let
where α is a random weight that is uniformly generated in the interval
Two-step SP optimization
According to the preceding section, the number of atoms inside the dictionary is large, and solving equation (5) is therefore time-consuming. To solve this problem, two-step SP optimization is proposed.
The notations used in this article are first reviewed. The dictionary contains the nonpolluted atoms
Step I: selection algorithm.
Accordingly, most of the atoms in
Bitracking procedure
To take both forward and backward sequential information into consideration, the tracker is managed by a bitracking procedure. As shown in Figure 3, the tracking process is not conducted in the chronological order. The tracking process of the left part of the figure is responsible for obtaining forward sequential information and that of the right part of the figure enables the capture of backward sequence information. The tracker begins from frame 1; the first tracking step is from frame 1 to frame 3, the second tracking step is from frame 2 to frame 4, the third tracking step is from frame 3 to frame 5, and so on. The tracking procedure stops until the end of the forward and backward loops. In this way, the object in every frame has two tracking results. To obtain more reliable results, the frame with the smallest difference between the two results is selected as the intersection of the bitracking process.

An example of the tracking process.
Atom updating
In the majority of tracking applications, the changes of both the target and the environment must be handled by the tracker simultaneously. If the atoms used in the tracker are updated frequently, the atom will be gradually polluted by tracking errors, leading to the model-drifting problem. Therefore, it is necessary to design an appropriate updating strategy for tracking. In the proposed model, nonpolluted atoms and variational atoms are updated online in different manners.
Discussion
This section discusses the reasons for the effectiveness of the proposed dictionary and tracking process.
From the perspective of control theory, tracking is essentially an open-loop problem; there is no feedback in the tracking process. Therefore, tracking errors inevitably accumulate, leading to the model-drifting problem. To alleviate this problem, some special constraints must be introduced. Traditionally, there are two kinds of constraints, the first of which is the key frame-based constraint. 41,42 In this work, the manually labeled ground truth in the key frames acts as special feedback. The optimization is conducted on the whole sequence, which minimizes the tracking errors in the key frames. The second kind of constraint is the bidirectional tracking constraint, 43,44 which leads to a new minimization criterion that combines both the forward and backward tracking errors. Both types of constraints can improve the robustness and accuracy of tracking; however, they are time-consuming and therefore cannot fulfill the real-time requirement.
In the present work, the object templates in the key frames can be naturally incorporated into the SP framework and are used for the tracking of the whole sequence. Additionally, the optimization process is solved efficiently by the algorithm in “Two-step SP optimization” section.
Iterative extension
As stated in “Discussion” section, tracking is an open-loop problem that is inevitably corrupted by image noise. Although the labeled templates from key frames provide a constraint for the drifting problem, the tracker still cannot successfully track any object in arbitrary video sequences.
The proposed tracking framework enables an iterative way to refine tracking results when the performance of the tracker is not satisfactory. When the tracker deviates from the true position of the object and can never be recovered again, the tracking process is paused, the most representative frames are selected, and the image region of the object is extracted. By randomly linear combination of the regions in the two corresponding frames, their offsprings are generated. They are then added into the template dictionary, and the tracking process is restarted. From the theoretical analysis in control domain, it is found that the interactive process together with the key frame-based constraints essentially forms a feedback to the tracking process and thus can greatly improve the tracking performance in theory.
Experimental results
Two comparative experiments, each involving different dictionaries and tracking processes, are first presented to verify the asserted contributions of this work. Next, to confirm the performance of the proposed tracker, several traditional tracking algorithms are compared, and an iterative tracking example is presented. The average pixel error is adopted to measure the tracking accuracies of the different methods.
Different dictionaries
In this section, the proposed method is compared with the classic L 1 tracker. 28 To make a fair comparison, the constraints of the key frames and the bitracking procedure are not used.
As shown in Figure 4, when the object undergoes large changes in pose and illumination, the L 1 tracker is unable to follow the object quickly. There are two reasons for these results: (1) The atoms adopted in the L 1 tracker are inadequate, so they cannot capture the appearance changes, thus leading to tracking failure. (2) The template updated in the L 1 tracker introduces errors into the template, and thus, the template deviates from the tracking target, leading to the drifting problem. In our method, the nonpolluted templates are chosen from the key frames to construct the dictionary, which has significant effect on avoiding target drifting. Figure 5 shows that, as compared with the L 1 tracker, the proposed method achieves superior tracking performance.

Tracking performances of two dictionaries. The red boxes are the proposed method and the green boxes are L 1 tracker. (a) Football sequence and (b) shaking sequence.

Tracking accuracies of two dictionaries. The red boxes are the proposed method and the blue boxes are L 1 tracker. The numbers in the upper right corner of the images denote average errors. (a) Football and (b) shaking.
Different tracking procedures
To demonstrate bitracking process outperforming the normal tracking procedure, the proposed algorithm is tested on two sequences via the use of the two tracking procedures, respectively.
As shown in Figure 6, the normal tracking process cannot track the sudden motion of the target, although many nonpolluted templates are used. In contrast, the bitracking process obtains accurate results, as the motion of the object is estimated from both tracking directions, resulting in the improvement of the robustness of the tracking algorithm. As shown in Figure 7, the bitracking process achieves better performance than the normal tracking process.

Tracking performances of two tracking procedures. The red boxes are bitracking procedure and the blue boxes are normal tracking procedure. (a) Basketball sequence and (b) skate sequence.

Tracking accuracies of two tracking procedures. The red boxes are bitracking procedure and the blue boxes are normal tracking procedure. The numbers in the upper right corner of the images denote average errors. (a) Basketball and (b) skate.
Comparison with state-of-the-art methods
The proposed algorithm (normal tracking process and bitracking process) is compared with several state-of-the art tracking algorithms, namely the L 1 tracker, 28 incremental visual tracking (IVT), 10 semi-supervised online boosting (SSOB), 24 and multiple instance learning (MIL). 45 Furthermore, these methods are tested on multiple video sequences that contain illumination changes, occlusion, background interference, and posture changes.
In the first experiment, the proposed algorithm is compared with SSOB and MIL. As shown in Figure 8, MIL fails to track the object in the second image and remaining sequences, because the employed Haar feature in MIL lost its discrimination due to the large change in illumination. SSOB loses track of the object in the fourth image. As shown in Figure 9, the average error of the proposed algorithm is notably much lower than those of SSOB and MIL. These results demonstrate that the proposed algorithm exhibited an accurate tracking performance superior to those of SSOB and MIL.

(a)–(e) Experiment 1 tracking performances of the compared algorithms (red: the proposed method; green: MIL; magenta: SSOB).

Experiment 1 tracking accuracies of the compared algorithms (red: the proposed method; green: MIL; magenta: SSOB). The numbers in the upper right corner of the images denote average errors.
In the second experiment, all the tracking algorithms, including the normal tracking process, bitracking process, L 1 tracker, MIL, IVT, and SSOB, are compared. Figure 10 shows that most of the algorithms involved in the experiment achieve good results, excluding SSOB and MIL in the car sequence. The reason for this is that SSOB and MIL are susceptible to illumination changes and deformations, and thus tracking failure occurs. As presented in Table 1, the proposed method achieved the best performance in terms of the tracking speed. These results demonstrate that the proposed method not only exhibits accurate tracking performance but is also characterized by a greatly reduced time consumption.

Experiment 2 tracking performances of the compared algorithms (red: the proposed method; blue: bitracking procedure; green: L 1 tracker; purple: MIL; yellow: IVT; white: SSOB). (a) Car sequence and (b) occlusion sequence.
Video tracking speed (fps).
Conclusion
This article proposes a drift-free visual tracking algorithm based on a constructed template dictionary. A set of templates, namely some nonpolluted templates, their offspring, one stable template, and variable templates, is used in the dictionary. To accommodate changes and prevent the model-drifting problem, these templates are selectively updated. In addition, the tracking process is bidirectional, which results in improved tracking performance of the proposed algorithm. The effectiveness of the proposed tracking algorithm is proven by several comparison experiments.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China [Grant No. 61922064], in part by the Zhejiang Provincial Natural Science Foundation [Grant Nos LR17F030001 and LQ19F020005], in part by the Project of Science and Technology Plans of Wenzhou City [Grant Nos C20170008, G20150017, and ZG2017016].
