Abstract
Template matching and updates are crucial steps in visual object tracking. In this article, we propose a two-stage object tracking algorithm using a dual-template. By design, the initial state of a target can be estimated using a prior fixed template at the first stage with a particle-filter-based tracking framework. The use of prior templates maintains the stability of an object tracking algorithm, because it consists of invariant and important features. In the second step, a mean shift is used to gain the optimal location of the object with the stage update template. The stage template improves the ability of target recognition using a classified update method. The complementary of dual-template improves the quality of template matching and the performance of object tracking. Experimental results demonstrate that the proposed algorithm improves the tracking performance in terms of accuracy and robustness, and it exhibits good results in the presence of deformation, noise and occlusion.
Introduction
Object detection and tracking are prerequisites for a number of practical applications in computer and robotic vision. 1 –5 In recent years, particle filter (PF) has become one of the most popular visual tracking methods, due to its outstanding performance in optimal estimation problems of non-linear and non-Gaussian systems. 6 –9 The PF-based method is usually used to tackle complicated tracking problems using model matching. The key step of PF is feature model extraction that is also known as prior template. In a tracking system, we use the template to find the best candidate target as the tracking result, hence a good template is crucial for robust object tracking. However, the effectiveness of the template cannot be guaranteed for long-term object tracking in the wild, in the presence of pose, illumination, occlusion and background variations. Therefore, the template must be updated iteratively to improve the matching efficiency for robust long-term object tracking. To address this problem, a number of state-of-the-art algorithms have been developed in recent years. The fixed frame update is one of the earliest methods for template updates. 7,10 However, this method has not been widely used, because the strategy of using a fixed frame cannot adapt to a variety of changes in video sequences. In contrast, an alternative is to update an object template frame by frame. For example, Collins et al. 11,12 used three components with different weights in Red-Green-Blue (RGB) space to generate a colour feature model, and improved the reliability of object tracking through a fusion scheme that selected the biggest difference between the tracked target and backgrounds. But this method is inefficient when the optimal update scheme is constantly reselecting when tracking. Zhang et al. 13 assigned initial and candidate templates with different weights, but it caused less-update or over-update when the weights were selected improperly. In the work by Peng et al., 14 Babu et al. 15 and Shan et al., 16 the template is updated by a Kalman filter (KF) in the object tracking. The single model does not hold enough prior knowledge of an object for visual tracking, hence the multi-model method is adopted to improve the target recognition capability by using an optimization function in model selection. However, multi-model and template updates cannot perform well in terms of stability and accuracy in video tracking. With the study of human cognitive vision system, the tracking strategy can gain the performance of target tracking which conforms to human vision. 17 –28
To deal with the issues stated above and improve the stability and accuracy of video target tracking, this article presents a two-stage visual tracking algorithm with double-template. Firstly, in order to improve the target matching efficiency, a fusion method with a prior fixed template and a stage update template is introduced. Secondly, motivated by the human cognitive vision, a fixed template is constructed, using the prior knowledge at the beginning of object tracking, to keep the stability of tracking. In addition, the stage update template is built using the target and background in tracking to improve the precision. The proposed two-stage tracking algorithm guarantees the stability and improves the matching accuracy for each frame in a video. The proposed tracking pipeline is similar to the human vision system. Lastly, extensive experiments validate the effectiveness of the proposed algorithm in terms of robustness and accuracy under complicated scenarios.
An overview of particle filter and mean-shift
Particle filter
A particle filter is an effective algorithm for solving tracking problems of non-linear and non-Gaussian systems. The algorithm consists of two steps: Prediction and update. The prediction step uses the system transition model
The key idea of a particle filter is to approximate the probability distribution using a set of weighted samples
where
It has been shown that more pdf approaches are true if we set
Mean shift
A mean shift
27,28
is an object tracking algorithm based on pattern matching. The method finds the maximal Bhattacharyya coefficient
where
where
Target model representation and update method
The key of the tracking algorithm presented in this article is the construction of a prior fixed template and a stage template update, which play important roles in template matching and the long-term stability of the tracking results. In practical applications, the initial localization of an object is usually identified by the key features of the target, and then discriminated according to the scene features. Hence the key feature of the target is vital in the process of visual tracking, and the main task of a success tracker is to extract robust features and construct prior fixed template using a target state, scene change and prior knowledge. To build a correct template is the first step in visual tracking, and it holds invariant properties in the whole tracking process. A good template can improve the stability of a tracking algorithm.
Because extracted target regions are usually influenced by noise, background and other factors, we use the filter algorithm to enhance the effectiveness of an extracted target area, such as median filtering and wavelet transform. Then we classify the target information in video frames into four common types: Background features, key features, mixed features and stage features. There are a large number of background features in video sequences. Background features often change frame to frame and interfere with the performance of a tracking algorithm. Key features are usually around the target area and remain unchanged in the tracking process, hence they play determining roles in target recognition. Mixed features are generated by camera settings and illumination variations. They change greatly in videos and are usually useless for tracking. However, in this article, we try to make full use of these features. Stage features exist few in target areas, but they are changeable and show strong time effectiveness. They are very useful when they are different from backgrounds.
Generation of prior fixed template
The main task of the construction of a prior fixed template is to extract the second kind of features stated above, that is to say key features. Key features are constant in the whole tracking process and are the main components of target areas.
To construct a prior fixed template, we first define a background region for a video sequence to obtain background statistical features. As shown in Figure 1, the red box is the background region

The construction of a prior-fixed template: (1) the background and target regions; (2) the extracted histogram vector of the target region; (3) the extracted key features for the target region; (4) the constructed template; and (5) the final optimized template using morphological image processing methods.
In the same manner, for the target region
Figure 1-(2) shows the extracted histogram vector of the target region
where
Second, the extracted histogram vector
where
These features are invariant and play important roles in the whole tracking process. The prior fixed template can be created after obtaining these features. The template construction method is summarized in Algorithm 1. Regarding Algorithm 1 the prior fixed template
Stage template update
A stage template reflects the variation of the template in the tracking step, which helps a tracking algorithm to accurately locate the target using stage distinctiveness of video sequences. Hence a template update is very important in video tracking. In traditional tracking algorithms, the total model update strategy will all be updated regardless of primary or secondary, hence it could lead to less-update or over-update problem.
The mixed features are usually changed due to illumination variations, camera settings and object deformations. In this article, we consider the neighbourhood of key features as mixed features. To extract mixed features, we first find the indexes of the bins that are with non-zero values in key features
where
Because of the relationship between background and target regions, the external background statistical features can be calculated using the region between the red box and the blue box in Figure 1-(1). By the statical analysis method, it can be estimated by
where the symbol ‘−’ is an operator that sets the values of the bins in
Stage features reflect the variation of the appearance. A small amount of these exists in the target area, which can be estimated by
where the symbol ‘−’ has the same meaning as in the last equation. Lastly, the stage template can be composed of key features, mixed features and stage features
where the symbol ‘∪’ obtains a super-set of two histogram vectors by using all of the bins with non-zero values.
Once we obtain the stage template, the remaining key tasks are to update the template and give full play to the advantages of the stage. In this article, in order to update the template by class and provide more useful information to matching and tracking, we propose a new update strategy that analyses the structure and relationship of the features from temporal and spatial distributions respectively.
In the proposed template update strategy, the stage template is used for matching and the current template is the ultimate goal of each step in tracking. There are four different situations which are as follows. (i)
This situation indicates that the stage template and the current template does not have the sub-features, so we do not have to update the template. (ii)
This situation indicates that the current template has this sub-feature, but the stage template does not have it. So the stage template can be updated using
where
In equation (13), the condition (iii)
This situation implies that the sub-model of the stage template in the current template disappears, because of object deformation, rotation and occlusion. In such a case, the stage template can be updated by
where (iv)
This situation indicates that both the stage model and current template hold this sub-model. The stage template can be updated by
where
In equation (15), the condition
Two-stage target tracking strategy using dual-template
In this section, we present the proposed two-stage visual tracking algorithm using a dual-template, which consists of two steps: (a) the first step uses prior fixed template to embed the particle filter framework and estimates the preliminary target state; (b) the second step uses the stage update template to mean shift the matching method, so as to achieve an accurate state estimation. By design, the proposed two-stage method in series is more consistent with the human vision system.
In the first step, the particle weights are updated after state prediction, so that the particle can propagate in the correct direction and can effectively describe the target state. The similarity between a priori fixed template and the candidate template is calculated by the differences sum-of-squared (SSD) method, which reflects the performance of particle matching in terms of distance. The SSD method calculates the distance using
where
where
where
Experimental results and analysis
In order to verify the efficiency and effectiveness of the proposed algorithm, we present our experimental results in this section. The experiments were carried out on a personal computer with Intel Dual-Core 2.13 GHz CPU and 4 GB memory.
In this article, we test the proposed algorithm using three different data sets. The first one tests object tracking of human faces, and compares the proposed algorithm with other methods. The second experiment tests the proposed algorithm for object tracking with occlusions. The third one tests the proposed algorithm in terms of object tracking with deformations.
The face video (128 x 96) used in the first experiment has 500 frames. The fixed-template-based algorithm cannot meet the requirements of object tracking due to background changes and object rotation. In this experiment, we compare the proposed method with a single-fixed-template-based algorithm (method 1) and a weighted-update-template-based algorithm (method 2) in terms of effectiveness in face tracking. Some tracking results of these three algorithms are shown in Figure 2. The figure illustrates that the single-fixed-template-based method achieves the correct template matching only when the variation of the appearance of the tracked face is minor. The tracking accuracy decreases dramatically when the face rotates because we do not update the template during tracking. Influenced by target rotation and non-target occlusion, the weighted-update-template-based method still has shortages under changed scenes because of an over-update or a less-update. In this article, the prior fixed template contains key features. The tracking accuracy can locate the target better by the use of key features. This process is similar to the human vision system. In the process of tracking, the stage template is updated by the feature classification for each, which ensures the quality of extracted features and the convergence of the tracking result.

A comparison of the proposed algorithm and other methods in face tracking (from left to right are frames 9, 43, 89, 101, 129 and 310 respectively). (a) Tracking results of method-1. (b) Tracking results of method-2. (c) Tracking results of the proposed method.
Table 1 shows the root mean square error (RMSE) of the
Comparison of tracking root mean square error (RMSE) for face sequences (pixels).
The second experiment is used to verify the performance of the proposed algorithm by tracking human faces with occlusions. Occlusions are harmful to the performance of template matching and may cause failure in tracking, because an occlusion inevitably leads to the loss of part or all of the target information in the target region. To address this problem, this article uses the proposed two-stage visual tracking algorithm with the dual-template, in which the prior fixed template contains the key features of the tracked target, so it can obtain the correct initial position of the target. The stage template update strategy updates the effective information and guarantees the accuracy of the target template matching, hence the accuracy of the tracker is improved. Even when the appearance of the whole target is occluded, we can use the key features extracted from the prior fixed template for template matching, and correctly and timely update the template after occlusion. As shown in Figure 3, the proposed algorithm can address the problem of occlusion and keep the accuracy for long-term target tracking. The RMSE of this experiment is less than 2.5 pixels.

Tracking results of the proposed algorithm with occlusions. (a) Video sequence 1 (from left to right are frames 10, 13, 20, 33, 41 and 46 respectively). (b) Video sequence 2 (from left to right are frames 8, 45,93,196, 584 and 897 respectively).
The third experiment tests the proposed algorithm for a pedestrian with deformations. The shape of the target greatly varies during the movement of the pedestrian, which leads to difficulties in robust target tracking. Also, an improper template update may have a negative impact. To tackle this problem, this article uses the prior fixed template to predict the initial position of the target in the whole tracking process because it contains key features. The invariance and the importance of key features ensures that the initial position is correct. The update strategy frame by frame can perceive the variation information of each frame hence it ensures the accuracy of the tracking result. As shown in Figure 4, the proposed algorithm performs very well in terms of accuracy and robustness even when the shape of the tracked pedestrian changes. The RMSE of this experiment is less than 1.65 pixels.

Pedestrian tracking results in surveillance videos. (a) Video sequence 1 (from top to bottom, left to right 4, 15, 38, 46, 55 and 66 respectively). (b) Video sequence 2 (from top to bottom, left to right 8, 43, 72, 102, 130 and 183 respectively).
Conclusion
In template-matching-based visual tracking systems, template construction is crucial. In this article, we propose a two-stage tracking algorithm with a dual-template, in order to address the problems caused by occlusion and object deformation. By design, the proposed method consists of two steps to achieve robust video target tracking. The first step is the initial target location using a prior fixed template, and the second step is accurate target matching by a stage template update. The key of the proposed algorithm is its consistency with the human vision system. In the proposed algorithm, the mean shift algorithm is embedded to the particle filter algorithm. Also, we take full advantage of these two algorithms to improve the performance of target tracking. However, the classified update method is the breach point to improve the effectiveness of the stage template, and to solve the less-update or over-update. In our future work, the template update method will be further studied, and the sparse representation and dictionary learning algorithm will be introduced to improve the efficiency of template update.
Footnotes
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the Major Program for Scientific and Technological Research in the University of China (grant number 311024), the National Natural Science Foundation of China (grant numbers 61373055 and 41501461), the Natural Science Foundation of Jiangsu Province of China (grant number BK20140419), and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (grant numbers 14KJB520001 and 16KJD520001).
