A two-stage visual tracking algorithm using dual-template

Abstract

Template matching and updates are crucial steps in visual object tracking. In this article, we propose a two-stage object tracking algorithm using a dual-template. By design, the initial state of a target can be estimated using a prior fixed template at the first stage with a particle-filter-based tracking framework. The use of prior templates maintains the stability of an object tracking algorithm, because it consists of invariant and important features. In the second step, a mean shift is used to gain the optimal location of the object with the stage update template. The stage template improves the ability of target recognition using a classified update method. The complementary of dual-template improves the quality of template matching and the performance of object tracking. Experimental results demonstrate that the proposed algorithm improves the tracking performance in terms of accuracy and robustness, and it exhibits good results in the presence of deformation, noise and occlusion.

Keywords

Visual tracking particle filter mean shift prior fixed template stage update template

Introduction

Object detection and tracking are prerequisites for a number of practical applications in computer and robotic vision.^1

–5 In recent years, particle filter (PF) has become one of the most popular visual tracking methods, due to its outstanding performance in optimal estimation problems of non-linear and non-Gaussian systems.^6

–9 The PF-based method is usually used to tackle complicated tracking problems using model matching. The key step of PF is feature model extraction that is also known as prior template. In a tracking system, we use the template to find the best candidate target as the tracking result, hence a good template is crucial for robust object tracking. However, the effectiveness of the template cannot be guaranteed for long-term object tracking in the wild, in the presence of pose, illumination, occlusion and background variations. Therefore, the template must be updated iteratively to improve the matching efficiency for robust long-term object tracking. To address this problem, a number of state-of-the-art algorithms have been developed in recent years. The fixed frame update is one of the earliest methods for template updates.^7,10 However, this method has not been widely used, because the strategy of using a fixed frame cannot adapt to a variety of changes in video sequences. In contrast, an alternative is to update an object template frame by frame. For example, Collins et al.^11,12 used three components with different weights in Red-Green-Blue (RGB) space to generate a colour feature model, and improved the reliability of object tracking through a fusion scheme that selected the biggest difference between the tracked target and backgrounds. But this method is inefficient when the optimal update scheme is constantly reselecting when tracking. Zhang et al.¹³ assigned initial and candidate templates with different weights, but it caused less-update or over-update when the weights were selected improperly. In the work by Peng et al.,¹⁴ Babu et al.¹⁵ and Shan et al.,¹⁶ the template is updated by a Kalman filter (KF) in the object tracking. The single model does not hold enough prior knowledge of an object for visual tracking, hence the multi-model method is adopted to improve the target recognition capability by using an optimization function in model selection. However, multi-model and template updates cannot perform well in terms of stability and accuracy in video tracking. With the study of human cognitive vision system, the tracking strategy can gain the performance of target tracking which conforms to human vision.^{17

–28}

To deal with the issues stated above and improve the stability and accuracy of video target tracking, this article presents a two-stage visual tracking algorithm with double-template. Firstly, in order to improve the target matching efficiency, a fusion method with a prior fixed template and a stage update template is introduced. Secondly, motivated by the human cognitive vision, a fixed template is constructed, using the prior knowledge at the beginning of object tracking, to keep the stability of tracking. In addition, the stage update template is built using the target and background in tracking to improve the precision. The proposed two-stage tracking algorithm guarantees the stability and improves the matching accuracy for each frame in a video. The proposed tracking pipeline is similar to the human vision system. Lastly, extensive experiments validate the effectiveness of the proposed algorithm in terms of robustness and accuracy under complicated scenarios.

An overview of particle filter and mean-shift

Particle filter

A particle filter is an effective algorithm for solving tracking problems of non-linear and non-Gaussian systems. The algorithm consists of two steps: Prediction and update. The prediction step uses the system transition model $p (x_{k} | x_{k - 1})$ to predict the state of a target at time k. This task is accomplished using an observation likelihood model $p (z_{k} | x_{k})$ in the update step where $z_{k}$ is the observation at time k, that is extracted features at the state $x_{k}$ .

The key idea of a particle filter is to approximate the probability distribution using a set of weighted samples ${x_{k}^{i}, w_{k}^{i}}_{i = 1}^{N}$ , where ${x_{k}^{i}}_{i = 1}^{N}$ is a set of support points with associated weights ${w_{k}^{i}}_{i = 1}^{N}$ , and N is the number of particles. The weights are usually normalized by $\sum_{i = 1}^{N} w_{k}^{i} = 1$ . The posterior density of probability (pdf) is given by

p (x_{k} | z_{1 : k}) \approx \sum_{i = 1}^{N} w_{k}^{i} δ (x_{k} - x_{k}^{i})

where $δ ()$ is the Dirac delta and the weight is defined by

w_{k}^{i} \propto w_{k - 1}^{i} \frac{p (z_{k} | x_{k}^{i}) p (x_{k}^{i} | x_{k - 1}^{i})}{π (x_{k}^{i} | x_{k - 1}^{i}, z_{k})}

It has been shown that more pdf approaches are true if we set N to a larger value.

Mean shift

A mean shift^27,28 is an object tracking algorithm based on pattern matching. The method finds the maximal Bhattacharyya coefficient ρ between a target model $q_{u}$ and a candidate model $p_{u}$ . The coefficient ρ can be obtained by

ρ \approx \frac{1}{2} \sum_{u = 1}^{M} \sqrt{p_{u} (y_{0}) q_{u}} + \frac{C}{2} \sum_{i = 1}^{n_{k}} w_{i} k (∥ \frac{p x_{i} - y}{h} ∥^{2})

where M is the number of feature bins, $n_{k}$ is the number of pixels, C is a normalization constant, k is a kernel profile that is convex and monotonic decreasing, h is the kernel bandwidth, $y_{0}$ is the estimated location of the target in the previous frame, and $w_{i}$ is obtained by

w_{i} = \sum_{u = 1}^{M} δ [b (p x_{i}) - u] \sqrt{\frac{q_{u}}{p_{u} (y_{0})}}

where $b (p x_{i})$ is the colour histogram at the location $p x_{i}$ and $δ ()$ is the Delta function that determines whether the value of the pixels at $p x_{i}$ belongs to the u th bin.

Target model representation and update method

The key of the tracking algorithm presented in this article is the construction of a prior fixed template and a stage template update, which play important roles in template matching and the long-term stability of the tracking results. In practical applications, the initial localization of an object is usually identified by the key features of the target, and then discriminated according to the scene features. Hence the key feature of the target is vital in the process of visual tracking, and the main task of a success tracker is to extract robust features and construct prior fixed template using a target state, scene change and prior knowledge. To build a correct template is the first step in visual tracking, and it holds invariant properties in the whole tracking process. A good template can improve the stability of a tracking algorithm.

Because extracted target regions are usually influenced by noise, background and other factors, we use the filter algorithm to enhance the effectiveness of an extracted target area, such as median filtering and wavelet transform. Then we classify the target information in video frames into four common types: Background features, key features, mixed features and stage features.

There are a large number of background features in video sequences. Background features often change frame to frame and interfere with the performance of a tracking algorithm.

Key features are usually around the target area and remain unchanged in the tracking process, hence they play determining roles in target recognition.

Mixed features are generated by camera settings and illumination variations. They change greatly in videos and are usually useless for tracking. However, in this article, we try to make full use of these features.

Stage features exist few in target areas, but they are changeable and show strong time effectiveness. They are very useful when they are different from backgrounds.

Generation of prior fixed template

The main task of the construction of a prior fixed template is to extract the second kind of features stated above, that is to say key features. Key features are constant in the whole tracking process and are the main components of target areas.

To construct a prior fixed template, we first define a background region for a video sequence to obtain background statistical features. As shown in Figure 1, the red box is the background region $R_{b}$ and the blue box is the target region $R_{t}$ . In this article, we encode the RGB space to obtain the colour histogram for a given region. The M-bin background feature extracted from the background region $R_{b}$ is a histogram vector with M elements

Figure 1.

The construction of a prior-fixed template: (1) the background and target regions; (2) the extracted histogram vector of the target region; (3) the extracted key features for the target region; (4) the constructed template; and (5) the final optimized template using morphological image processing methods.

S = [S_{1}, S_{2},..., S_{M}]^{T}

In the same manner, for the target region $R_{t}$ , we can obtain the histogram vector as follows

P = [P_{1}, P_{2},..., P_{M}]^{T}

Figure 1-(2) shows the extracted histogram vector of the target region $R_{t}$ . However, the target region is a mixture of the tracked object and backgrounds. To remove the backgrounds from the extracted histogram from the target region, we modify the histogram vector P by setting the value of some bins that dominate the background region $R_{b}$ to 0

P_{u} = 0, i f u \in {i d x_{1},..., i d x_{T}}

where $i d x_{t}$ is the index of the top-t bin of the histogram of background region R_b, T is a threshold. The value of the threshold T is obtained by experience or using a cross validation set.

Second, the extracted histogram vector P can be considered as the target features filtered by background features. The bins top-ranked in the target area are used as key features $I = [I_{1}, I_{2},..., I_{M}]^{T}$ , and we ignore the bins with lower probabilities by setting the values of these bins to 0

I_{u} = {\begin{array}{l} P_{u}, & i f P_{u} ≧ ε \\ 0, & i f P_{u} < ε \end{array}, for u = 1, \dots, M

where ε is a threshold used to remove less-important features. The value of this threshold can be estimated by experience or from a cross validation set. Figure 1-(3) shows the histogram vector of extracted key features for the target region.

These features are invariant and play important roles in the whole tracking process. The prior fixed template can be created after obtaining these features. The template construction method is summarized in Algorithm 1. Regarding Algorithm 1 the prior fixed template KR is generated at the beginning of a tracking process. After that, we can optimize the template using the information provided by video sequences during the tracking process. As in the prior fixed template construction shown in Figure 1, we set the regions of target and background areas for each frame of a video. In Figure 1-(1), the red box is the background area and the blue box is the target area. Then we extract the histogram features for the target region as shown in Figure 1-(2) and obtain the key features. The extracted key features are illustrated in Figure 1-(3). Also, by the use of Algorithm 1, we can generate the prior fixed template, as shown in Figure 1-(4). Last, we optimize the template by morphological image processing algorithms, as shown in Figure 1-(5).

Stage template update

A stage template reflects the variation of the template in the tracking step, which helps a tracking algorithm to accurately locate the target using stage distinctiveness of video sequences. Hence a template update is very important in video tracking. In traditional tracking algorithms, the total model update strategy will all be updated regardless of primary or secondary, hence it could lead to less-update or over-update problem.

The mixed features are usually changed due to illumination variations, camera settings and object deformations. In this article, we consider the neighbourhood of key features as mixed features. To extract mixed features, we first find the indexes of the bins that are with non-zero values in key features I. We use a set $i d x = {i d x_{1}, i d x_{2},..., i d x_{N}}$ to express these indexes. Then the mixed features can be obtained by using the neighbourhoods of these indexes of the background histogram P. To this end, the set $i d x$ is first expanded to $i d x = {i d x_{1} - T_{2},..., i d x_{1} - 1, i d x_{1}, i d x_{1} + 1, ..., i d x_{1} + T_{2},..., i d x_{N} - T_{2},..., i d x_{N} - 1, i d x_{N}, i d x_{N} + 1, ..., i d x_{N} + T_{2}}$ , then the values of the bins of the mixed features can be assigned by

J_{u} = {\begin{array}{l} P_{u}, & if u \in i d x \\ 0, & otherwise \end{array}, for u = 1, \dots, M

where $T_{2}$ is the radius of the neighbourhood used to extract mixed features. The value of T₂ can be obtained by experience or a cross validation set.

Because of the relationship between background and target regions, the external background statistical features can be calculated using the region between the red box and the blue box in Figure 1-(1). By the statical analysis method, it can be estimated by

B = S - P

where the symbol ‘−’ is an operator that sets the values of the bins in S, corresponding to the bins in P with non-zero values, to 0.

Stage features reflect the variation of the appearance. A small amount of these exists in the target area, which can be estimated by

L = P - I - J

where the symbol ‘−’ has the same meaning as in the last equation. Lastly, the stage template can be composed of key features, mixed features and stage features

Q = I \cup J \cup L

where the symbol ‘∪’ obtains a super-set of two histogram vectors by using all of the bins with non-zero values.

Once we obtain the stage template, the remaining key tasks are to update the template and give full play to the advantages of the stage. In this article, in order to update the template by class and provide more useful information to matching and tracking, we propose a new update strategy that analyses the structure and relationship of the features from temporal and spatial distributions respectively.

In the proposed template update strategy, the stage template is used for matching and the current template is the ultimate goal of each step in tracking. There are four different situations which are as follows.

(i) $P_{u}^{k} = 0, Q_{u} = 0$

This situation indicates that the stage template and the current template does not have the sub-features, so we do not have to update the template.

(ii) $P_{u}^{k} > 0, Q_{u} = 0$

This situation indicates that the current template has this sub-feature, but the stage template does not have it. So the stage template can be updated using

Q_{u} = {\begin{array}{l} Q_{u} & B_{u}^{k - 1} > 0 \\ β Q_{u} + (1 - β) P_{u}^{k} & J_{u}^{k - 1} > 0 \\ γ Q_{u} + (1 - γ) P_{u}^{k} & L_{u}^{k - 1} > 0 \end{array}

where β and γ are updated speed factors of the mixed features and stage features, which are used to adjust the update speed.

In equation (13), the condition $B_{u}^{k - 1} > 0$ represents the current template having background features in the previous frame. In order to prevent error accumulation, we do not update the stage template. The condition $J_{u}^{k - 1} > 0$ represents the current template having additional features because of the camera setting, rotation or deformation variations. If these features do not belong to the background of frame $k - 1$ , they have to be updated for accurate localization of the target. The condition $L_{u}^{k - 1} > 0$ represents the current template having few features that do not belong to the background. They have to be updated because they can support template matching.

(iii) $P_{u}^{k} = 0, Q_{u} > 0$

This situation implies that the sub-model of the stage template in the current template disappears, because of object deformation, rotation and occlusion. In such a case, the stage template can be updated by

Q_{u} = {\begin{array}{l} Q_{u} & I_{u}^{k - 1} > 0 \\ β Q_{u} + (1 - β) P_{u}^{k} & J_{u}^{k - 1} > 0 \\ γ Q_{u} + (1 - γ) P_{u}^{k} & L_{u}^{k - 1} > 0 \end{array}

where $I_{u}^{k - 1} > 0$ indicates that the feature is a key feature which is about to disappear, so need not to be updated, $J_{u}^{k - 1} > 0$ or $L_{u}^{k - 1} > 0$ indicates that the features will be updated because the mixed features and stage features will change as the video frames play.

(iv) $P_{u}^{k} > 0, Q_{u} > 0$

This situation indicates that both the stage model and current template hold this sub-model. The stage template can be updated by

Q_{u} = {\begin{array}{l} α Q_{u} + (1 - α) P_{u}^{k} & I_{u}^{k - 1} > 0 \\ Q_{u} & J_{u}^{k - 1} > 0, B_{u}^{k} ≧ Q_{u} \\ β Q_{u} + (1 - β) P_{u}^{k} & J_{u}^{k - 1} > 0, B_{u}^{k} < Q_{u} \\ Q_{u} & L_{u}^{k - 1} > 0, B_{u}^{k} ≧ Q_{u} \\ γ Q_{u} + (1 - γ) P_{u}^{k} & L_{u}^{k - 1} > 0, B_{u}^{k} < Q_{u} \end{array}

where α is the update speed factor of the key features, which is used to adjust the update speed.

In equation (15), the condition $I_{u}^{k - 1} > 0$ implies that the sub-model is the key feature and will be updated. The condition $J_{u}^{k - 1} > 0, B_{u}^{k} ≧ Q_{u}$ means that the sub-model exists mainly in the background, and only a small amount is in the current template hence the contribution is tiny. To prevent the accumulation of error we do not update it. The condition $J_{u}^{k - 1} > 0, B_{u}^{k} < Q_{u}$ indicates that the sub-model in the current template is more than that in the background, so that it is better for matching and we will update it. The condition $L_{u}^{k - 1} > 0, B_{u}^{k} ≧ Q_{u}$ indicates that the sub-model exists mainly in the background, but only a small amount is in the current template. The contribution of it is very small, hence we do not need to update it. Lastly, the condition $L_{u}^{k - 1} > 0, B_{u}^{k} < Q_{u}$ implies that the sub-model in the current template is more than that of the background, hence it supports the matching procedure better and we will update it.

Two-stage target tracking strategy using dual-template

In this section, we present the proposed two-stage visual tracking algorithm using a dual-template, which consists of two steps: (a) the first step uses prior fixed template to embed the particle filter framework and estimates the preliminary target state; (b) the second step uses the stage update template to mean shift the matching method, so as to achieve an accurate state estimation. By design, the proposed two-stage method in series is more consistent with the human vision system.

In the first step, the particle weights are updated after state prediction, so that the particle can propagate in the correct direction and can effectively describe the target state. The similarity between a priori fixed template and the candidate template is calculated by the differences sum-of-squared (SSD) method, which reflects the performance of particle matching in terms of distance. The SSD method calculates the distance using

d_{i} = \sum_{(a, b) = (1, 1)}^{size (K R)} {[K R (a, b) - H R_{i} (a, b)]}^{2}

where $H R_{i}$ is the candidate template and $(a, b)$ is the template pixel coordinate. The construction method of $H R_{i}$ is summarized in Algorithm 2. Then the weight of each particle can be calculated for the k th frame

w_{k}^{i} = \frac{1}{\sqrt{2 π} σ} e^{- \frac{d_{i}^{2}}{2 σ^{2}}}

where σ is the variance of Gaussian distribution, and we set $σ {= 10}^{- 2}$ in this article. The first step of the initial target state can be estimated by

{\tilde{x}}_{k} = \sum_{i = 1}^{N} w_{k}^{i} x_{k}^{i}

where $x_{k}^{i}$ is the state of a particle. In the second step, the target state uses mean shift by the stage update template after the first stage estimation. This template shows the difference of each frame, and ensures the template matching is precise. The pipeline of the proposed method is summarized in Algorithm 2.

Experimental results and analysis

In order to verify the efficiency and effectiveness of the proposed algorithm, we present our experimental results in this section. The experiments were carried out on a personal computer with Intel Dual-Core 2.13 GHz CPU and 4 GB memory.

In this article, we test the proposed algorithm using three different data sets. The first one tests object tracking of human faces, and compares the proposed algorithm with other methods. The second experiment tests the proposed algorithm for object tracking with occlusions. The third one tests the proposed algorithm in terms of object tracking with deformations.

The face video (128 x 96) used in the first experiment has 500 frames. The fixed-template-based algorithm cannot meet the requirements of object tracking due to background changes and object rotation. In this experiment, we compare the proposed method with a single-fixed-template-based algorithm (method 1) and a weighted-update-template-based algorithm (method 2) in terms of effectiveness in face tracking. Some tracking results of these three algorithms are shown in Figure 2. The figure illustrates that the single-fixed-template-based method achieves the correct template matching only when the variation of the appearance of the tracked face is minor. The tracking accuracy decreases dramatically when the face rotates because we do not update the template during tracking. Influenced by target rotation and non-target occlusion, the weighted-update-template-based method still has shortages under changed scenes because of an over-update or a less-update. In this article, the prior fixed template contains key features. The tracking accuracy can locate the target better by the use of key features. This process is similar to the human vision system. In the process of tracking, the stage template is updated by the feature classification for each, which ensures the quality of extracted features and the convergence of the tracking result.

Figure 2.

A comparison of the proposed algorithm and other methods in face tracking (from left to right are frames 9, 43, 89, 101, 129 and 310 respectively). (a) Tracking results of method-1. (b) Tracking results of method-2. (c) Tracking results of the proposed method.

Table 1 shows the root mean square error (RMSE) of the X-axis and Y-axis in a 500 frames face video. By using the dual-template, the method of this article is better than method 1 and method 2 in terms of RMSE.

Table 1.

Comparison of tracking root mean square error (RMSE) for face sequences (pixels).

RMSE	Method 1	Method 2	The proposed method
x-axis	15.75	4.14	1.60
y-axis	14.74	5.34	2.25

The second experiment is used to verify the performance of the proposed algorithm by tracking human faces with occlusions. Occlusions are harmful to the performance of template matching and may cause failure in tracking, because an occlusion inevitably leads to the loss of part or all of the target information in the target region. To address this problem, this article uses the proposed two-stage visual tracking algorithm with the dual-template, in which the prior fixed template contains the key features of the tracked target, so it can obtain the correct initial position of the target. The stage template update strategy updates the effective information and guarantees the accuracy of the target template matching, hence the accuracy of the tracker is improved. Even when the appearance of the whole target is occluded, we can use the key features extracted from the prior fixed template for template matching, and correctly and timely update the template after occlusion. As shown in Figure 3, the proposed algorithm can address the problem of occlusion and keep the accuracy for long-term target tracking. The RMSE of this experiment is less than 2.5 pixels.

Figure 3.

Tracking results of the proposed algorithm with occlusions. (a) Video sequence 1 (from left to right are frames 10, 13, 20, 33, 41 and 46 respectively). (b) Video sequence 2 (from left to right are frames 8, 45,93,196, 584 and 897 respectively).

The third experiment tests the proposed algorithm for a pedestrian with deformations. The shape of the target greatly varies during the movement of the pedestrian, which leads to difficulties in robust target tracking. Also, an improper template update may have a negative impact. To tackle this problem, this article uses the prior fixed template to predict the initial position of the target in the whole tracking process because it contains key features. The invariance and the importance of key features ensures that the initial position is correct. The update strategy frame by frame can perceive the variation information of each frame hence it ensures the accuracy of the tracking result. As shown in Figure 4, the proposed algorithm performs very well in terms of accuracy and robustness even when the shape of the tracked pedestrian changes. The RMSE of this experiment is less than 1.65 pixels.

Figure 4.

Pedestrian tracking results in surveillance videos. (a) Video sequence 1 (from top to bottom, left to right 4, 15, 38, 46, 55 and 66 respectively). (b) Video sequence 2 (from top to bottom, left to right 8, 43, 72, 102, 130 and 183 respectively).

Conclusion

In template-matching-based visual tracking systems, template construction is crucial. In this article, we propose a two-stage tracking algorithm with a dual-template, in order to address the problems caused by occlusion and object deformation. By design, the proposed method consists of two steps to achieve robust video target tracking. The first step is the initial target location using a prior fixed template, and the second step is accurate target matching by a stage template update. The key of the proposed algorithm is its consistency with the human vision system. In the proposed algorithm, the mean shift algorithm is embedded to the particle filter algorithm. Also, we take full advantage of these two algorithms to improve the performance of target tracking. However, the classified update method is the breach point to improve the effectiveness of the stage template, and to solve the less-update or over-update. In our future work, the template update method will be further studied, and the sparse representation and dictionary learning algorithm will be introduced to improve the efficiency of template update.

Footnotes

Declaration of conflicting interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This work was supported by the Major Program for Scientific and Technological Research in the University of China (grant number 311024), the National Natural Science Foundation of China (grant numbers 61373055 and 41501461), the Natural Science Foundation of Jiangsu Province of China (grant number BK20140419), and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (grant numbers 14KJB520001 and 16KJD520001).

References

Viola

Jones

. Rapid object detection using a boosted cascade of simple features. In: IEEE conference on computer vision and pattern recognition (CVPR), Kauai, HI, USA, 8–14 December 2001, volume 1, pp.I–511. IEEE.

Feng

Z-H

Guosheng

Kittler

. Cascaded collaborative regression for robust facial landmark detection trained using a mixture of synthetic and real images with dynamic weighting. IEEE Trans Image Process 2015; 24: 3425–3440.

Guo

Zhao

. Pedestrian tracking based on camshift with kalman prediction for autonomous vehicles. Int J Adv Robotic Syst 2016; 13: 120.

Feng

Z-H

Huber

Kittler

. Random cascaded-regression copse for robust facial landmark detection. IEEE Signal Process Lett 2015; 22(1): 76–80.

Huber

Feng

Z-H

Christmas

. Fitting 3D morphable face models using local features. In: IEEE international conference on image processing (ICIP), Quebec, Canada, 27–30 September 2015, pp.1195–1199. IEEE.

Gustafsson

Gunnarsson

Bergman

. Particle filters for positioning, navigation, and tracking. IEEE Trans Signal Process 2002; 50: 425–437.

Zhou

Chellappa

Moghaddam

. Visual tracking and recognition using appearance-adaptive models in particle filters. IEEE Trans Image Process 2004; 13(11): 1491–1506.

Choi

Savarese

. Understanding collective activities of people from videos. IEEE Trans Pattern Anal Mach Intell 2014; 36: 1242–1257.

Khatoonabadi

Bajić

. Video object tracking in the compressed domain using spatio-temporal Markov random fields. IEEE Trans Image Process 2013; 22: 300–313.

10.

Liu

T-L

Chen

H-T

. Real-time tracking using trust-region methods. IEEE Trans Pattern Anal Mach Intell 2004; 26: 397–402.

11.

Collins

Liu

Leordeanu

. Online selection of discriminative tracking features. IEEE Trans Pattern Anal Mach Intell 2005; 27: 1631–1643.

12.

Nummiaro

Koller–Meier

Van Gool

. An adaptive color-based particle filter. Image Vis Comput 2003; 21: 99–110.

13.

Zhang

Tian

Jin

. Joint tracking algorithm using particle filter and mean shift with target model updating. Chin Opt Lett 2006; 4: 569–572.

14.

Peng

Yang

Liu

. Mean shift blob tracking with kernel histogram filtering and hypothesis testing. Pattern Recognit Lett 2005; 26: 605–614.

15.

Babu

Pérez

Bouthemy

. Robust tracking with motion estimation and local kernel-based color modeling. Image Vis Comput 2007; 25: 1205–1216.

16.

Shan

Tan

Wei

. Real-time hand tracking using a mean shift embedded particle filter. Pattern Recognit 2007; 40: 1958–1970.

17.

Zhang

Yang

M-H

. Real-time object tracking via online discriminative feature selection. IEEE Trans Image Process 2013; 22: 4664–4677.

18.

Zhong

Yang

M-H

. Robust object tracking via sparse collaborative appearance model. IEEE Trans Image Process 2014; 23: 2356–2368.

19.

Possegger

Mauthner

Bischof

. In defense of color-based model-free tracking. In: IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015, pp.2113–2120. IEEE.

20.

Vatavu

Danescu

Nedevschi

. Stereovision-based multiple object tracking in traffic scenarios using free-form obstacle delimiters and particle filters. IEEE trans Intell Transp Syst 2015; 16: 498–511.

21.

Cavallaro

Steiger

Ebrahimi

. Tracking video objects in cluttered background. IEEE T Circ Syst Vid 2005; 15: 575–584.

22.

Wen

Cai

Lei

. Robust online learned spatio-temporal context model for visual tracking. IEEE Trans Image Process 2014; 23: 785–796.

23.

Fan

Shen

. Scribble tracker: A matting-based approach for robust tracking. IEEE Trans Pattern Anal Mach Intell 2012; 34: 1633–1644.

24.

Nguyen

Worring

Van Den Boomgaard

. Occlusion robust adaptive template tracking. In: IEEE International Conference on Computer Vision (ICCV), Vancouver, Canada, 7–14 July 2001, volume 1, pp.678–683. IEEE.

25.

Kim

Han

. Optimal colour-based mean shift algorithm for tracking objects. IET Comput Vis 2014; 8: 235–244.

26.

Kwon

Lee

. Highly nonrigid object tracking via patch-based dynamic appearance modeling. IEEE Trans Pattern Anal Mach Intell 2013; 35: 2427–2441.

27.

Comaniciu

Ramesh

Meer

. Kernel-based object tracking. IEEE Trans Pattern Anal Mach Intell 2003; 25: 564–577.

28.

Leichter

. Mean shift trackers with cross-bin metrics. IEEE Trans Pattern Anal Mach Intell 2012; 34: 695–706.