Distribution fields visual tracking based on Newton style convergence descending

Abstract

Traditional mean shift method has the limitation that could not effectively represent object accurately and converge the correct target position fast. To address this problem, in this paper, we propose a novel tracking algorithm using a Newton style convergence descending way based on distribution Fields target representation scheme. In contrast with traditional mean shift algorithm, the computational efficiency is greatly improved due to the SSD form of the histogram in distribution Fields and the efficient Newton style search. Our method adds uncertainty information of the target model and ensures a more accurate convergence to the true target. Moreover, we use a Kalman filter to predict target location in modified mean shift framework. This contributes to shorten convergence speed of the algorithm. Experiment results on several challenging video sequences have verified that the proposed algorithm is efficient and effective in many complicated scenes.

Keywords

Object tracking distribution fields Kalman filter mean shift

Introduction

Visual tracking is a challenging research topic in the field of computer vision. The task of tracker is to generate the trajectories of the moving objects in a sequence of images. In previous literature, the key components of most object trackers is composed of object representation, similarity measure and seeking method. Target is often represented by a histogram to model the appearance.¹ And also a similarity measure between the reference model and candidate targets is used to discriminate the object of interesting. Moreover, a local mode seeking method, such as mean shift is introduced to find the most similar location in the subsequent frames.² Target location is often searched by the value of the objective function which will reach the global optimum. A common method to smooth the objective function is to blur the image. However, blurring the image pollutes image information, which may cause the target to be lost. To address this problem we use a method which is introduced to build a target descriptor in distribution Fields in the literature.³ DF’s representation allows smoothing the objective function without polluting information about pixel values. Then we introduce a Newton-style minimization procedure on distribution Fields. We also show that the Newton-style searching framework is a more efficient than mean shift which is fundamentally a gradient descent method.

The rest of this paper is organized as follows. The next section recalls related target representation method in tracking method. The Distribution fields motion estimation section introduces our tracking framework, i.e. distribution fields tracking method based on Newton style convergence descending. The Experiments section shows extensive experiments to compare our method with state-of-the-art ones. The last section concludes on our new proposed method.

Previous work

It has been many years since mean shift algorithm became one of the most popular hill climbing methods. It is introduced in the literature.¹ After that it has been adopted to solve various computer vision problems, such as segmentation and object tracking.⁴ The virtue of the mean shift method is obvious. It is popular for ease of implementation, real-time response and robust tracking performance. The original algorithm uses histograms as a target representation and Bhattacharya coefficient as a similarity measure. At the same time, an isotropic kernel is used as a spatial mask to smooth a histogram-based appearance similarity function between model and target candidate regions. The tracker searches to a local minimum of this smooth similarity function to estimate the translational offset of the target in each frame. Although it has promising performance, the traditional mean shift method suffers from two main limitations: inaccurate target representation and trapping in local minima. Both of them can result in target drift. To alleviate this problem, recently many authors perform self-learning^5–7 during the process of tracking. This approach finds the position of the target and updates the model with positive and negative samples. The strategy can make the tracker adapt to new appearances and background, but breaks down as soon as the target is mislabeled. In the literature,⁵ the author proposes a co-training classifiers in the context to alleviate this problem. The tracker demonstrated re-detection capability and scored well on challenging video sequences. Another solution is MIL learning,⁶ where the training examples are selected by spatially related regions, rather than independent training samples. In Kalal et al.,⁷ a tracking method was introduced that combines adaptive tracking with object detections. Growing and pruning events are perform during the process of tracking. Rather than self-learning, recently a superior method based on the DF target representation and MIL classifier is proposed in Ning et al.⁸ Comparing with the self-learning which needs a big sample pool, the layer feature based on DFs can more effectively represent the target without maintaining a increasing sample pool. Following the framework, we also use a similar representation in DFs to make the model more generative. Our method helps in disregarding outliers during tracking without modeling them explicitly. Meanwhile, we demonstrate a connection between distribution Fields target representation and Newton style iterations optimization method to further improves the tracker’s performance.

Distribution fields motion estimation

Target representation scheme

Distribution Fields are proposed by Sevilla-Lara and Learned-Miller.³ A DF is an array of probability distributions which defines the probability of a pixel of taking each feature value. DF model has matrix with (2 + N) dimensions. The first two dimensions of the matrix are height and width of the image, and the other N dimensions index the feature space. Compared with histogram-based representation, it preserves the spatial structure of the object by having a distribution at each pixel. It can be viewed as a generalization of many previous histogram-based descriptors. As shown in Figure 1, feature bins are 8. Model matrix is shown in the left and reshape matrix in the right. In histogram-based representation, each element in matrix is either 0 or 1, and it is a binary matrix. A DF model is represented as a matrix d of size $h \times w \times m$ with probability mass which represents possibility of the target pixel in certain bin, where m is the number of intensity of feature or bins. In particular, expanding an image I into d with as many bins as features values is defined by

d (i, j, k) = {1 if ⌊ I (i, j) / (256 / m) ⌋ = = k 0 otherwise,

(1)

where i and j index the row and column of the image, 256/m indexs bins interval, and k indexes the possible values of the pixel. In addition, the k indexes the kth layer of the DF. Note that the sum of the components of each column in each DF is 1, and this produces a probability distribution at each pixel.

Figure 1.

Comparison of target models. (a) Bright histogram. (b) Distribution fields histogram.

The 3D DF has simply been convolved with a 2D Gaussian filter which spreads out in the x and y dimensions first. Then filter results convolve with a 1D Gaussian filter which spreads out in the feature dimension. That is, each layer k of the smoothed DF $\bar{d}$ is computed as

\bar{d} (:, :, k) = d (:, :, k) * h σ s, \tilde{d} (i, j, :) = \bar{d} (i, j, :) * h σ f

(2)

where

h σ s

is a 2D Gaussian kernel of standard deviation

σ s

and * is the convolution operator.

h σ f

is a 1D Gaussian kernel of standard deviation

σ f

. It is worthy of note that, in order to keep the property that values of each column of the DF integrate to 1, first we should fill with uniform distributions in the missing information outside the boundaries. With completion of space filter, the borders are removed. Then followed by putting zeros around the DF, the image is reshaped and filter in 1D. Finally the matrix is restored to original dimensions and the borders are folded without removing weights. In the summary, we get a smooth Distribution Field in which “for any non-zero value in a layer L, there is a pixel of value L somewhere near this location in the original image”.³ For compact representation, we reshape

h \times w \times m

dimension

\tilde{d}

n \times m

dimension u.

u (n, m) = reshape (\tilde{d} (h, w, m))

(3)

It is well known that target representation is critical to the robust and precision in tracking. To demonstrate the performance of DFs, we compared DFs descriptor with other descriptor in the basin of attraction. For an objective function and a point p, the basin of attraction is the region around the point p from which descending the gradient of objective function leads to p. The size of the patches is $48 \times 80$ , and they were displaced from 1 to 40 pixels in each direction of the horizontal axis. Five other descriptors are introduced to compare with DFs. These methods are as follows: normalized cross correlation (ncc), sum of squared distances (ssd), Bhattacharyya distance (bhatt), mean shift using Bhattacharyya distance (ms) and L1 in DF. In our method, Matusita distance is used for the DF descriptor. Figure 2 is the basin of attraction for different target representation. It is shown that df-matu descriptor has wider convergence radius than ncc and ssd descriptor and more oblique than ms and bhatt descriptor.³ As shown in Figure 1, it is superiority over the traditional descriptors for it does not mix the values of different pixels. Meanwhile, it is superiority over the histogram-based descriptor for weights is introduced and there is more specificity among patches.

Figure 2.

Comparison of target models. (a) Original image. (b) The different objective function evaluated translating a patch 1–40 pixels in both directions horizontally. (c) Objective function in df-L₁ descriptor. (d) Projection map in df-L₁ descriptor. (e) Objective function in df-matu descriptor. (f) Projection map in df-matu descriptor.

Search mechanism

Kalman filter

In general, we assume that there is a linear process governed by an unknown inner state producing a set of measurements. More specifically, there is a discrete time system and its state at time n is given by vector X_n. The state and measurement in the next time step n + 1 is given by

X n + 1 = F n + 1 | n X n + ω n + 1, Z n + 1 = H n + 1 X n + 1 + υ n + 1

(4)

where

F n + 1 | n

is the transition matrix from state X_n to

X n + 1

and

ω n + 1

is white Gaussian noise with zero mean and covariance matrix

Q n + 1

. Moreover, H_n+1 is the measurement matrix and

υ n + 1

is white Gaussian noise with zero mean and variance matrix

R n + 1

. In equation (4), the measurement

Z n + 1

depends only on the current state

X n + 1

and the noise vector

υ n + 1

is independent of the noise

ω n + 1

. Kalman filter computes the minimum mean-square error estimate of the state X_k given the measurements

Z 1, \dots Z k

. The solution is a recursive procedure,³ which is described as follows

{\overset{\land}{X}}_{n}^{-} = F n | n - 1 \overset{\land}{X} n - 1, P n = F n | n - 1 P n - 1 F_{n | n - 1}^{T} + Q, G n = P_{n}^{-} H_{n}^{T} [H n P_{n}^{-} H_{n}^{T} + R] - 1

(5)

\overset{\land}{X} n = {\overset{\land}{X}}_{n}^{-} + G n (Z n - H n {\overset{\land}{X}}_{n}^{-}), P n = (I - G n H n) P_{n}^{-}

(6)

Newton style mean shift: as shown in equation (7), a target model can be represent by histogram $q = (q 1, q 2, \dots, q m) t$ .

q j = C \sum_{i = 1}^{n} K (x i - c) u (i, j) j = 1, \dots, m, C = \frac{1}{\sum_{i = 1}^{n} K (x i - c)}

(7)

where c is the kernel center location and

u (i, j)

is the j-th layer distribution Fields representation. Any eligible kernel function k(x), such as the commonly used Epanechnikov kernel and Gaussian kernel, can be used. It has shown that the two kernels lead to almost the same tracking results. Here, we selected the Epanechnikov kernel so that g(x) = −k′(x) = 1. Note that the definition of C implies that

\sum_{j = 1}^{m} q j = 1

. As equation (3), DF can be denoted as an n by m sifting matrix

U = [u 1, u 2, \dots, u m]

. Similarly, we can define a vector version of the kernel function K by

K i (c) = K (x i, c)

. Suppose we are now given a candidate region centered about c in a subsequent image acquired at time

t'

. With this, we can now rewrite target model and candidate model in a more concise form

q = U t K (c), p (c) = p (c, t') = U t (t') K (c)

(8)

It is known that Matusita metric is defined by the sum of squared differences (SSD) objective function.

O (Δ c) = ‖ \sqrt{q} - \sqrt{p (c + Δ c)} ‖ 2

(9)

As a result, the minima of (9) coincide with the maxima of the Bhattacharyya coefficient. By substitute Bhattacharyya distance with the SSD error, we can equivalently work with traditional mean shift algorithm. We derive a Newton-style iterative procedure to solve this optimization by expanding the expression for $\sqrt{p (c)}$ in a Taylor series and dropping higher order terms

O (Δ c) = ‖ \sqrt{q} - \sqrt{p (c)} - \frac{1}{2} d (p) - \frac{1}{2} U t J K Δ c ‖ 2

(10)

where d(p) denotes the matrix with p on its diagonal and J_K is the n by 2 matrix of the form

J K = [\frac{\partial K}{\partial c 1}, \frac{\partial K}{\partial c 2}] = [\nabla c K (x 1 - c) \nabla c K (x 2 - c) \dots \nabla c K (x n - c)] t

(11)

Let objective function partial derivatives be zero: $\partial O (Δ c) / \partial (Δ c) = 0$ . The minimum of this objective function is then the solution of the linear system.

J_{K}^{t} Ud (p) - 1 U t J K Δ c = 2 J_{K}^{t} Ud (p) - \frac{1}{2} (\sqrt{q} - \sqrt{p (c)})

(12)

Provided that $J U = d (p) - 1 / 2 U t J K$ is of column rank 2, the solution to this optimization will exist. By comparison (12) with standard mean shift, we can find that the term $Ud (p) - 1 / 2 (\sqrt{q} - \sqrt{p (c)})$ corresponds to the weighting vector ω. Further, if the kernel satisfies the assumptions required by the mean shift procedure, then $\nabla c K (x i - c) = g (‖ x i - c ‖ 2) (x - c) t$ and solving equation (12) for c yields a modified mean shift operator. It is more important that in (12) the linear system solution attempts to jump directly to the minimum in a single step. At this point, it is interesting to consider convergence of the Newton style mean shift based SSD-like objective function. In the modified mean shift algorithm, DF bins with zero values must be ignored in order to make the inverse of $d (p)$ exist. Furthermore, due to the constraint that the DF model sums to 1, the mth column value of the bins is a function of the previous $m - 1$ columns. And the result is a rank reduction on matrix $U t J K$ . Note any zero values in the DF will serve to further lower m. No matter how the feature values are distributed, it is necessary to have at least three different features values to track two degrees of translational freedom. Hessian matrix $J_{U}^{t} J U$ has dimensions with 2 × 2. Compared with the common mean shift algorithm, the additional computation complexity of the inverse of Hessian matrix is $O (d 3)$ where $d = 2$ . According to the low dimension, the computation cost is not high. In fact, the iterations of our method mean 1–2. Yet the iterations of common mean shift mean 5–6. The expenditure-saving in iterations is majority to the cost of computation of the inverse of Hessian matrix. So comparison of the gradient ascent algorithms, our method tends to shorten convergence speed.

Model update

In the process of tracking, first a model of the target should be generated by expanding the target image into a DF and smoothing it. Then, the same work is done underlying part of the bigger candidate region in the new frame. Searching for the target follows the direction where the gradient of the Matusita distance between the model of the target and candidate. The location of the target is estimated by finding the global minimum. After that, a model of the target is updated to adapt to the target appearance change. As in equation (13), a linear combination of the model and the observation is used to update target model.

d t + 1 (i, j, k) = λ d t (i, j, k) + (1 - λ) d t - 1 (i, j, k)

(13)

where λ is the learning factor and d_t is the model of the target in the t frame. For better performance, a hierarchical approach is introduced in our searching process. We use multiple DF models to represent the target. Each of them is produced by using an increasing value of the parameter

σ s

. The goal is to regulate the amount of spatial blur. The results of multiple DF models contain different frequencies’ information. At the stage of searching, a coarse to fine strategy is performed. The most smoothed DF model is used to start the search. This process is continued until the convergence condition is satisfied. Then the location found by the first DF is the start for the search in the second DF. It is useful by combining the information of multiple DFs. This approach helps not only to decrease the time of search but also improve the accuracy of target’s location. By combining multiple DFs, a non-parametric model of the distribution at each pixel is built. At the same time, the statistics of the appearance is learned during tracking. We now summarize the procedure in pseudo-code.

Algorithm 1

Distribution fields tracking

Input: S = video sequence.

I = patch containing target in frame 1st.

$σ s$ = set of spatial smoothing parameters ( $[σ_{s}^{1}, σ_{s}^{2}] = [1, 2], [ϖ_{s}^{1}, ϖ_{s}^{2}] = [9, 15]$ ).

$σ f$ = brightness smoothing parameter ( $σ f = 0.625, ϖ f = 5$ ).

m = number of brightness bins (m = 16).

λ = learning factor ( $λ = 0.95$ ).

Output: $(x; y) f$ {Positions of target at each video frame f in V}

1. Initialize target representation model

${\tilde{d}}_{m}^{i} = expand (I) * h (i) * h f, i \in 1, \dots | σ s |$

Reshape $u_{m}^{i} (n, m) = reshape ({\tilde{d}}_{m}^{i} (h, w, m))$

2. Initialize target location $(x 1; y 1)$ to center of patch I.

3. for $k = 2 \to | S |$ do

4. Predict target location $(x k | k - 1; y k | k - 1)$ by motion equation of Kalman filter.

5. for $i = 1 \to | σ s |$ do

6. Candidate object representation model in distribution Fields.

${\tilde{d}}_{f}^{i} = expand (f) * h s (i) * h f$

Reshape $u_{f}^{i} (n, m) = reshape ({\tilde{d}}_{f}^{i} (h, w, m))$

7. Find Minimum value of objective function at $(x'; y')$ with updating weighs as Eq. (12) .

8. end for

9. update target location $(x k; y k)$ by observation value $(x'; y')$

10. model update:

${\tilde{d}}_{m}^{i} = λ {\tilde{d}}_{m}^{i} + (1 - λ) {\tilde{d}}_{f}^{i} (x k, y k)$

11. end for

Experiments

Experimental setup

In the experiments, nine publicly available video sequences are used for tracking performance evaluation. Captured in different scenarios, these video sequences contain diverse events such as occlusion, object pose variation, lighting changes and out-of-plane rotation and so on. For each tracker, the default parameters with the source code are used in all evaluations. Our method performs object localization using a distribution Fields target representation scheme with Kalman filter and Newton style mean shift algorithm. The average running time of our C++ implementation on OpenCV3.0 is about 0.03 s per frame on a workstation with an Intel i7 3770 CPU (3.4 GHz) and 6G RAM.

For quantitative performance comparison, two popular evaluation criteria are used: center location error (CLE) and overlap ratio (OVR) between the predicted bounding box B_p and ground truth bounding box $B gt$ such that $OVR = \frac{area (B p \cap B gt)}{area (B p \cup B gt)}$ . We generalize the recent evaluation of tracking methods⁹ on the all sequences. The performance of tracking algorithms is evaluated on Matlab7.8 using precision rate P and success rate S. Detection is considered to be correct if its overlap with ground truth bounding box was larger than 25% or center location distance between object and ground truth bounding box was lower than 20 pixel.

Empirical comparison of trackers

We compare our method with several state-of-the-art trackers both qualitatively and quantitatively. These trackers are referred to as OAB, SBT, BSBT, MIT, TLD.^5–7 In our method (DF), we set parameters as follows. The number of bins is 16. Width of space filter is [9, 15], width of feature filter is 5, variance of space filter is [1, 2], variance is feature filter is 0.625 and learning factor λ is 0.95. Motion model of Kalman filter is constant velocity model and initial parameters set 1.

Qualitative comparison

We evaluate the performance of all the six trackers on 10 video sequences. Figure 3 shows the qualitative tracking results of the six trackers over several representative frames of nine video sequences. From this figure, we observe that our method achieves the best tracking performance on most video sequences. In particular, our method obtains the more robust tracking results in the presence of complicated appearance changes. An example of severe illumination variation and occlusion is the “coke” sequence, shown in the top left of Figure 3. The tracked coke cans is lost by all other trackers at the 79th frame as the target is occluded partially by the leaves. Our method succeeds in tracking the target in the whole sequence. The “bolt” video sequence, bottom left of Figure 3, contains object deformation and out-of-plane rotation followed by partial occlusion. SBT, MIL, and TLD break down after the 10th frame. OAB and BSBT lose the target after the 14th frame and the 116th frame, respectively. Our method locks on the target whatever the illumination or body pose is changed. In the ‘dog1’ sequence, row third left of Figure 3, the object undergoes out-of-plane rotation and in-plane rotation. Our method achieves the best performance than the other algorithms. In the ‘football1’ sequence, row fourth right of Figure 3, the object suffers from heavily background clutter. Our method can handle the clutter effectively and efficiently. The DF descriptor is able to follow a gradient back to its original position in a wide basin of attraction fast.

Figure 3.

Qualitative tracking results of the six trackers over several representative frames of the 10 video sequences (i.e. ‘coke', ‘faceocc1', ‘boy', ‘fleetface', ‘dog1', ‘tiger1', ‘football1', ‘freeman1', ‘bolt', ‘fish') that are, respectively, aligned from left to right and from up to down.

Quantitative comparison

We evaluate the CLE and OVR performance of all the six trackers on 10 video sequences. Table 1 reports the median CLEs, OVRs and FPS of the six trackers on all of the 10 video sequences. Figure 4 plots the frame-by-frame CLEs and OVRs obtained by the six trackers for the ‘bolt’ video sequences. We report true tracking with CLE at a threshold of 20 pixels and OVR at a threshold of 0.25. It is seen that our tracker obtains more accurate tracking results than the other trackers. It is known that visual trackers can be sensitive to initialization. To evaluate the initialization robustness, we follow the protocol proposed in the benchmark evaluation.⁹ We compare our method with the other trackers in the distance and overlap precision plots for TRE and SRE experiments. The trackers are ranked using the area under the curve (AUC). Both the precision and success plots show the mean precision scores over all the sequences. The results are shown in Figure 5. In all evaluations, our method obtains the best results.

Figure 4.

Quantitative evaluation of using six methods in CLE & OVR on the “bolt” video sequences.

Figure 5.

Precision and success plots over all 10 sequences. The mean precision scores for of each tracker are reported in the legends. Note that our method improves the second tracker by 16% in mean distance precision. In both cases, our approach performs favorably to state-of-the-art tracking methods.

Table 1.

Comparison of different methods for tracking.

Metric	DF	OAB	SBT	BSBT	MIL	TLD
mean CLE	13.41	64.06	58.37	31.30	59.46	22.11
mean OVR	0.63	0.43	0.33	0.51	0.42	0.54
mean FPS	35.50	4.82	5.96	4.11	28.85	22.58

Note: The best two results are shown in red and blue fonts. The results are presented using both median overlap precision (OVR) (%) and center location error (CLE) (in pixels), frame per second (FPS) (fps) over all 10 sequences.

Discussion and analysis

Each component is evaluated to show its contribution to the overall performance and sensitivity. Since the multiple DFs are introduced, the tracking method has the ability to overcome different challenges like moderate illumination changes and occlusion. Further study shows that $σ s$ and $σ f$ are the most important parameters which effect the tracker’s performance severely. The rule of selection is that small targets use smaller kernels. However, there are other factors that influence the tracker performance, such as objective function between background and target in appearance model. It is shown that under suitable conditions, the SSD-like measure can be optimized using a Newton-style iterations. Target original position is another important factor that influences convergence speed. Small bias predictive position gets fast convergence to the true target center, and worse performance appears in the case of the large distance from the target. By introducing a Kalman-based prediction model, target searching starts at a suitable position which will shorten searching time. For other online learning tracking methods, the processing speed during tracking depends on both the number of samples and which way is learned. When dense sample strategy is selected, the accuracy is improved and the speed comes down. The trade-off is performed by the preference consideration between accuracy and speed. However, as we have used a Newton style convergence descending, the actual run speed will be no less than the standard mean shift. At the same time, Kalman filter is used to predict the original searching position. This causes to boost convergence speed and make the tracker accuracy. As can be seen in Table 1, the achieved frame rate in our method leads by about 23% than the second rank tracker while the precision also gains high rank. During the searching, the inverse of Hessian matrix $J_{u}^{t} J u$ should be updated to synchronize with target change in the appearance. The computation time of the inverse of matrix should affect convergence speed. Due to the low dimension of the Hessian, this factor is alleviated greatly. Although this problem still exists, the result has shown that our method provides a higher speed and thus fewer steps are needed to provide the same tracking accuracy.

Conclusion

We have proposed a new target representation scheme in distribution Fields for tracking. It resolves ambiguity and overcome the under sensitivity to spatial structure. it also resolves the over sensitivity that other descriptors have to the geometric structure of the target, and it’s able to model slow changes in appearance and pose and be robust to moderate occlusions. We extend the mean shift scheme with Newton style descending way for the less iteration. Furthermore, we merge the Kalman filter and modified mean shift framework together. It is well known that speed is a crucial factor for many real-world applications such as robotics and real-time surveillance. Our method maintains state-of-the-art accuracy while operating in real-time speed. We believe that DFs is a fertile framework for visual tracking and Newton style convergence descending is especially suitable for real-time applications.

Footnotes

Acknowledgement

We thank the reviewers for the helpful comment.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is granted by Science Project of Fujian Education Department (JA12263) and Science and technology cooperation projects of Fuzhou City (2013-G-86).

References

Fukunaga K and Hostetler LD. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 1975; 21(1): 32–40.

Tian-Jian

Zu-Tao

. Adaptive double Kalman filter and mean shift for robust fast object tracking. Int J Advance Comput Technol 2013; 5: 349–356.

Sevilla-Lara L and Learned-Miller E. Distribution fields for tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, 2012, pp.1910–1917. USA: IEEE.

Cheng

. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Machine Intell 1995; 17: 790–799.

Yu Q, Dinh TB and Medioni G. Online tracking and reacquisition using co-trained generative and discriminative trackers. In: Forsyth D, Torr P and Zisserman A (eds) ECCV 2008, Part II. LNCS, 2008, pp.678–691.

Babenko B, Yang MH and Belongie S. Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, 2009, pp.983–990. USA: IEEE.

Kalal Z, Matas J and Mikolajczyk K. ‘P-N learning: bootstrapping binary classifiers by structural constraints’. In: Proceedings of the IEEE conference on computer vision and pattern recognition, San Francisco, USA, 2010, pp. 49–56.

Ning

Shi

Yang

. Visual tracking based on distribution fields and online weighted multiple instance learning. Image Vision Comput 2013; 31: 853–863.

Wu Y, Lim J and Yang M. Online object tracking: A benchmark. In: IEEE Conference on Computer vision and pattern recognition (CVPR), Portland, OR, 2013, pp.2411–2418. USA: IEEE.