Abstract
Object tracking is an important and fundamental task in computer vision and its high-level applications, e.g., intelligent surveillance, motion-based recognition, video indexing, traffic monitoring and vehicle navigation. However, the recent widespread use of wireless consumer cameras often produces low quality videos with frame-skipping and this makes object tracking difficult. Previous tracking methods, for example, generally depend heavily on object appearance or motion continuity and cannot be directly applied to frame-skipping videos. In this paper, we propose an improved particle filter for object tracking to overcome the frame-skipping difficulties. The novelty of our particle filter lies in using the detection result of erratic motion to ameliorate the transition model for a better trial distribution. Experimental results show that the proposed approach improves the tracking accuracy in comparison with the state-of-the-art methods, even when both the object and the consumer are in motion.
1. Introduction
Object tracking, in general, is the tracking of an object or objects over a sequence of images. Object tracking is an important task in the field of computer vision and it is usually performed at an early stage in the context of higher-level applications such as automated surveillance, motion-based recognition, video indexing, traffic monitoring and vehicle navigation [1]-[3]. For example, in an automated surveillance system there are at least three key steps: detection of interesting objects, tracking of such objects over frames and analysis of object trajectories to recognize their behaviour. Therefore, object tracking is a critical task in many high-level applications.
Object tracking is a very challenging problem because a lot of difficulties can arise due to non-rigid object structures, occlusions, changing appearance patterns of both the object and the scene, etc. There have been many methods designed to overcome these common difficulties. However, the availability of low-cost hardware, such as CMOS cameras and microphones that are able to ubiquitously capture video content from the environment, has fostered the development of wireless video sensor networks (WVSNs) [4], [5]. Wireless devices (Fig. 1 shows the wireless consumer cameras used in our experiment in Section 5) allow retrieving videos and tracking in WVSNs is a practical requirement of many real-time applications. The retrieved videos, however, usually have two common difficulties which are usually named together as a frame-skipping problem (see Fig. 2): one is unexpected frame dropping (missing frames in a continuous video sequence) and the other is low frame rate. The frame-skipping problem can be caused by various factors, e.g., low hardware cost, low or unstable processing speed in the video sources, frame dropping caused by the transmission conditions or online compressing or uncompressing which limits the frame rate. Therefore, videos with frame-skipping is common and “normal” for WVSNs and the property of the video flow itself – frame-skipping has become an important issue in many applications.

The wireless cameras used in the experiments (Section 5).

(a) Normal video frame sequences; (b) Frame dropping; (c) Low frame rate.
Previous tracking methods, in general, depend heavily on object appearance or motion continuity (see Section 2 for a detailed review). These methods often utilize the assumption of temporal continuity, whereas in frame-skipping videos the continuity of a target is often too weak to follow. Essentially, frame-skipping videos create difficulties in obtaining the transition model (describing how objects move between frames). Meanwhile, identifying the target from frame to frame is difficult due to the absence of context and we cannot rely only on image processing techniques. Therefore, most previous tracking methods cannot be directly applied to frame-skipping videos. One feasible solution proposed previously, whether or not the motive is frame-skipping tracking, is the integration of object detection and tracking because object detection allows for discrimination of the target from the others [6]-[14]. The solution can overcome the frame-skipping difficulties partially, but applying reliable object detection over a large search space is often costly [6]-[12]. Furthermore, identifying the target requires strong discriminative power which is usually achieved by massive offline training [13], [14]. However, in many applications of WVSNs, offline training is impossible because both targets and scenarios are unpredictable. Our method, on the other hand, is highly efficient in frame-skipping videos and it does not require offline training.
Another important challenge of object tracking is that consumer cameras are frequently rotated or moved during the video capturing process. The motion of consumer cameras also makes object tracking more difficult because the non-stationary background is an obstacle to the extraction of moving objects, thus static background subtraction-based methods [6]-[10] are not applicable. When camera motion and possible cluttered backgrounds appear in some applications, a particle filter (using a dynamic model to guide the particle propagation within a limited sub-space of target state) has been used previously to solve object tracking effectively [15], [21]. However, when object motion becomes unpredictable under frame-skipping conditions, the standard particle filter will cause departure of the sample set from the true target state and this eventually leads to tracking loss.
In this paper, we propose an improved particle filter with a better transition model to overcome the difficulties of object tracking in frame-skipping videos. In our method, it is motion detection rather than object detection as in [6]-[14] that plays the key role in object tracking. We apply fast and reliable detection which produces a global description in an acceptable search space. The novelty of our particle filter lies in using the detection result of erratic motion to ameliorate the transition model for a better trial distribution. We compare the tracking accuracy of the proposed approach with the state-of-the-art methods and show that our new method is much better, even when both the object and the consumer camera are in motion.
The remainder of this paper is organized as follows: Section 2 briefly summarizes related works on frame-skipping videos. Section 3 is devoted to analysing the essence of the frame-skipping problem in the probabilistic framework and then reveals the deficiency of the standard particle filter. Section 4 introduces the extraction of erratic motion and then proposes our particle filter with a newly defined transition model for a better trial distribution. Section 5 presents the experimental results which show that our new method outperforms the state-of-the-art methods and we also discuss the limitations of our method. Section 6 concludes this paper.
2. Related Works
The frame-skipping events, in general, are equivalent to uncertain erratic motion in most cases. A large number of the state-of-the-art methods, such as mean shift [18], [19], generally require the kernels or feature patches in consecutive frames to overlap with or be in a very close vicinity of each other. However, some existing publications [6]-[14] have attempted to tackle similar difficulties, whether or not the motive is partial frame-skipping tracking. A common feature of all these methods is the integration of object detection and tracking. Furthermore, we classify these works into three categories.
i. “Global object detection” for object tracking. These methods use an independent detector to guide the search of an existing tracker when target motion becomes unpredictable and require an object detector fast enough to be applied to the whole frame in most cases. Okuma et al. [13] use a boosted detector to amend the trial distribution of the particle filter. However, the boosted detector requires massive offline training. Another similar piece of research on mixture trial distribution is described in [6]. Porikli et al. [17] extend the standard mean shift technique using multiple kernels at motion areas detected by background subtraction to track in both 6 fps (frame per second) and 1 fps camera fixed videos. In our method, we utilize erratic motion detection (this requires no offline learning) to conquer the frame-skipping problem.
ii. “Object detection and connection” for object tracking. These methods detect the objects of interest and then constructing trajectories by analysis of motion continuity, object appearance similarity, etc. However, the algorithms of this category [8]-[10] are limited in static background scenes, where a fast change detector is easily to be realized. Besides, the trajectories are uncertain and usually cannot be recognized in frame-skipping videos. The methods in ii are not applicable in many applications of WVSNs because of non-stationary backgrounds and require an object detector fast enough to be applied to the whole frame in most cases. In our method, the improved particle filter can be applied to dynamic background scenes.
iii. “Multi-scale or multi-stage object detection” for object tracking. These methods increase the discriminative power by layered sampling of multi-scale likelihoods or multi-stage observations. In [11], multi-scale approaches are designed for erratic motion by layered sampling of multi-scale likelihoods [12]. However, the multi-scale approaches adopt the same observation model but lose image information in down-scaling process. Li et al. [14] propose a cascade particle filter with discriminative observers of different life spans. This method can be viewed as a classification problem in the sense of distinguishing tracking human face from the background. Besides, in the long span of this method, massive offline training costs several days. In our method, we integrate erratic motion detection and tracking together to find out a way in getting a better transition model (rather than others, e.g., the observation model proposed in [14]) to conquer the frame-skipping problem without massive offline learning.
3. Problem Analysis
The standard particle filter can effectively overcome the difficulties such as camera motion and clustered backgrounds which usually appear in applications of WVSNs. However, when object motion becomes erratic and unpredictable under frame-skipping conditions, the standard particle filter makes for tracking loss. Hereby we first briefly review the basic rules of object tracking in a probabilistic frame work and then reveal the deficiency of the standard particle filter in case of frame-skipping.
3.1 Tracking in a Probabilistic Framework
The basic idea of particle filter, which means tracking in a probabilistic framework, is now briefly reviewed here firstly.
To define the tracking problem, we can consider a dynamic system represented by the stochastic process {
where
Then tracking problem can be converted to recursively estimate
where
Given the data
Where the normalizing constant α depends on the likelihood function defined by the observation model in (2) and the known statistics of
Note that in (3):
Equation (3) describes the optimal Bayesian solution. However, this recursive propagation of the posterior density is only a conceptual solution and it cannot be determined analytically. So we need a method to approximate the optimal Bayesian solution such as a particle filter.
A particle filter [15], [21] uses a probabilistic framework to formulate tracking as an inference in a Hidden Markov Model (HMM). It is based on random measurement density approximated by a set of weighted particles. Each particle consists of the state domains and its corresponding probability (weight) is denoted by
where δ(·) is the Dirac delta measure. Let xi ~ q(x), (i = 1,…, Ns) be samples that are generated from a proposal called an importance density. If the samples were drawn from an importance density q(
where the importance density becomes only dependent on
where wik is given in (6). When Ns → ∞ the approximation presented in (7) approaches the true posterior density p(
3.2 Deficiency of the Standard Particle Filter
In the standard particle filter, the distribution p(
Prediction:
• Update: p(
According to the standard particle filter, the calculation of the integral in (8) is carried out by importance sampling, which means samples that are generated from a proposal distribution.
In practice, a presumed prior distribution p(
Prior distribution is simply an impulse based on user input. It describes initial distribution of object states and could be based on an object detector.
Observation model is a simple HSV histogram-based model (Fig. 3). We specify the likelihood of an object being in a specific state. Likelihood is based on a distance metric D[·,·] between histograms h0(or h(0)) (the original one) and h(
Transition model is a Gaussian window around current state where the standard particle filter usually samples the next state from. As to a given particle at time k-1, the prediction and measurement process is shown in Fig. 4.

Examples of HSV histogram-based observation model.

Prediction and measurement. D[·,·] is a distance metric between two histograms.
Fig. 5 (a) illustrates a sample of tracking loss using the particle filter in our case. Theoretically, in order to solve the tracking loss problem under frame-skipping conditions, we can choose four cases as follows:
Tracking in a frame-skipping video lab.avi at 3–4 fps. (a) Using a standard particle filter (particle number = 400) begins at #2 and completely fails at #17. (b) Using our method (particle number = 50), successful tracking (#13 to #22). Increasing the number of samples, which results in greatly decreased efficiency in the mean time. Considering the effect of p( Strengthening discriminative powers of the observation Improving the transition model.
In this paper, our choice is trying to improve the transition model directly (Case 4). Under other choices, the system's efficiency degrades (Case 1, 2) or massive offline training (Case 3) is required. So how can we overcome the frame-skipping difficulties when results of erratic motion detection are available?
4. Our Improved Particle Filter
Erratic motion detection can produce a global description in a search space. So in Section A, we present how to represent and extract erratic motion as a local motion vector. Then in Section B, the transition model is redefined in the view of target drift over frames based on the local motion vector. Finally in Section C, we describe how this new transition model is integrated into the probabilistic framework to tackle the frame-skipping problem.
4.1 Representation and Extraction of Erratic Motion
In this paper, we use MHI and its hierarchical mechanism [16], [17], a simple and fast enough method to represent motion in successively layered silhouettes which directly encode system time. This representation can be used to segment and measure the motions induced by the object in a video scene. These segmented regions are not “motion blobs”, but motion regions naturally connected to the moving parts of the object of interest. First, we label those pixels (a set number of standard deviations from the mean RGB background) as foreground. Then a pixel dilation and region growing method is applied to remove noise and extract the silhouette. Then MHI representation is constructed by successively layering selected image regions over time using a simple update rule:
where each pixel (x, y) in the MHI is marked with a current timestamp τ if the function Φ indicates object (or motion) presence in the current video frame I(x, y); the remaining timestamps are removed if they are longer than the decay value (τ – ϑ).
Since motion can be perceived from the displayed timestamp gradients in the template, we could convolve gradient masks with the timestamp values in the MHI to extract a motion vector at each pixel. Gradients of the MHI can be calculated efficiently by convolution with separable Sobel filters in the X and Y directions yielding the spatial derivatives: Fx(x, y) and Fy(x, y).
A simple calculation for the global weighted orientation is denoted as (11), where
However, the use of a discrete fixed-sized gradient mask (i.e., 3 × 3 Sobel gradient masks) limits the range of recoverable motion. When an object moves at different velocities, a fixed-sized mask can result in detection failure. Therefore, we use a hierarchical MHI mechanism [16], which extends the original MHI representation into a hierarchical pyramid format, to appropriately extract the MHI motions of different velocities. An image pyramid is constructed by recursively low-pass filtering and sub-sampling an image until reaching the desired size of spatial reduction. This permits us to use fixed-sized gradient masks at each pyramid level to calculate motions of different speeds. To create the corresponding MHI pyramid, each level from the Φ pyramid is used to update a MHI of that particular resolution. Then the algorithm to segment motion regions is denoted as follows:
Some samples (#8 to #13) of erratic motion detection are shown in Fig. 6. This method is simple and affords completely real-time performance. It is noteworthy that the local motion vector (

Erratic motion detection. Top row: the frames of a frame-skipping video. Bottom row: MHIs and the local motion vectors (the lines in the red circles indicate the motion orientations) corresponding to the frames in the top row.
4.2 The Integration of Motion Detection and Tracking
As what we have analysed in Section 3, when target motion becomes erratic and unpredictable under the frame-skipping conditions, the standard particle filter is not applicable, e.g., sampling the next state from a Gaussian window around the current state is no longer a good choice. Therefore, based on the local motion vector mentioned above, we redefine the transition model described in (1) as follows:
Given a state
where Drift(·) describes that the state
Theoretically, a good implementation of the transition model should take into account previous states for velocity and acceleration information. In this paper, we use an erratic motion detected, second-order, auto-regressive dynamical model to predict the next state based on the previous two plus the Gaussian noise. Let
Where

The flowchart of a new prediction and measurement process in our transition model
5. Experimental Results
In this section we demonstrate the benefits of using the proposed method. We present the experimental setup in Section A, then the metrics for our method evaluation is summarized in Section B. By following these metrics, we analyse the efficiency of the sampling process in our method in Section C and do some quantitative comparisons of object tracking in Section D. Finally, the discussion of our method is given in Section E.
5.1 Experimental Setup
In our experiments, the improved particle filter is implemented in C++ on a PC with a Pentium IV 3.0 GHz CPU. Some test videos are taken using wireless consumer cameras (D-Link DCS-5300G, 1/4 inch colour CCD, using the 802.11g wireless technology, Fig. 1). The video resolution of these cameras is fixed at 704×576, whereas their frame rates go down to 6–10 fps.
The nonlinear function parameters (15) correspond to a priori knowledge about object movements, e.g., the previous two states and the movement randomicity. These features exploit basically low-level information about the movement characteristic. In our experiments, we set these parameters empirically, i.e., κ1 =2.0, κ2 =-1.0, κ3 =1.0. Moreover, we abbreviate the standard particle filter as PF in the experiments. The numbers of particles adopted in different cases are 50–1500.
5.2 Metrics for Evaluation
To verify the effectiveness of the proposed method, we get a systematic objective evaluation chiefly via following metrics [14]: the effective sample size (ESS) [21] analysis in the sampling process, tracking error and computational cost with similar tracking performance, respectively.
ESS analysis in the sampling process. To analyse sampling efficiency, we compute ESS of importance sampling. In the same set of tracking sessions, ESS measures the uniformity of the weights of the particles and is defined by EES = 1/
Tracking error. We compare effectiveness of the tracking methods in two aspects, position error in pixel and in size. Position error in pixel means the drift of a target location measured by pixel, while position error in size illustrates the scale inaccuracy of a target in a tracking process.
Computational cost with similar tracking performance. The cost of our tracker can be divided into two parts:
A cost shared by the prediction and update of the particle filter, mainly depends on Ns, namely, the number of particles calculated for each observer.
A calculation of the local motion vector and some basic image processing involved in the calculation.
5.3 Efficiency of the Sampling Process
With comparison to PF (with different particle numbers), a quantitative analysis of sampling efficiency is done on a test sequence (CAVIAR.avi) from CAVIAR datasets (free and open). CAVIAR.avi has been down-sampled randomly to corresponding 4–6 fps. Fig. 8 (a) includes a quantitative analysis of sampling efficiency by the curve of ESS.

(a) EES curve on CAVIAR.avi. (b) Position error curve on CAVIAR.avi. The green windows indicate the person positions of every current frame, while those in yellow denote the person positions of every previous frame.
We can simply describe ESS as that
5.4 Quantitative Comparison
To validate the sampling effectiveness of the proposed method, quantitative comparisons of tracking are done via tracking errors in two aspects.
With comparison to PF with different numbers of particles, a quantitative comparison of position error in pixel is done on the test sequence (CAVIAR.avi). Fig. 8 (b) illustrates a quantitative analysis of sampling effectiveness: the error rate of our method is still low with the smallest particle numbers; the curve of tracking error also includes that enlarging the particle number of PF can compensate for its poor accuracy under frame-skipping conditions.
A comparison of position error in size is done among PF, the mean shift using a colour histogram [18] and our method on test sequences: CAVIAR.avi, lab.avi and football.avi. lab.avi (Fig. 5) is taken by a wireless consumer camera just outside of our lab (3–4 fps). football.avi (Fig. 9 (a)) records a player engaged on a pitch (2–4 fps). All videos are down-sampled randomly to the corresponding frame rate. By the curve indicated in Fig. 10, our method shows higher accuracy than the others under the frame-skipping condition.

(a) Results of tracking a player on football.avi (2–4 fps) with the motion of both the player and the camera. (b) Vehicle tracking in a real-life vehicle counting system (6–10 fps) using a wireless consumer camera. (c) Unsuccessful tracking when the camera moves too quickly (lab2.avi, 3–4 fps).

Tracking error curve on three videos.
In addition, in our tracker: 1) the cost in the prediction and update stages of the particle filter mainly depends on Ns. 2) For the calculation of the local motion vector, a hierarchical MHI mechanism is adopted to handle different velocities of moving targets efficiently and a pyramid of images is built so that, in each localized search space, an image is recursively low-pass filtered and sub-sampled until reaching the desired size of spatial reduction. Our experimental results show that the calculation of the local motion vector and the basic image processing involved is roughly comparable to PF with 50 particles.
Therefore, a comparison of the computational cost is done with the representation of average number of particles calculated by each observer per frame on three test sequences: CAVIAR.avi, lab.avi and football.avi. The result is shown in Fig. 11. The compared PF with different particles have similar tracking performance to our approach, yet many more excessive particles are calculated.

Comparison of the computational costs: average number of particles calculated by each observer per frame on different test sequences.
5.5 Discussion
Fig. 9 (a) shows a player engaged on a pitch. Challenges involved in this video sequence are common in real-life cases, e.g., the camera is moving and shaking when following the player's movements, the player's motion itself is unstable, with zoom in and zoom out, and the pose changes when he stands up to kick. Our method tracks the player successfully.
Also, we validate the proposed method in a real-life vehicle counting system. Fig. 9 (b) includes that our method correctly tracks a car, even the video stream acquired by a wireless consumer camera is at 6–10 fps.
However, the discriminative power of our method decreases when cameras move too quickly due to too much false motion alarms caused by quick motion of cameras. Fig. 9 (c) shows the movement of a student (lab2.avi is also taken by a wireless consumer camera just outside of our lab with 3–4 fps). In contrast with the two earlier cases, one great challenge involved in this video is that the camera is moving too quickly and our method tracks the player unsuccessfully. Moreover, another main limitation of our method is multi-object tracking. In fact, tracking multiple interactive objects itself would be a much more challenging problem. Multi-object tracking in frame-skipping videos is our future work.
6. Conclusion
The key to successful tracking relies on the effective extraction of useful information of the target's state from observations. A good transition model of the target will certainly boost this to a great extent. Generally one can say without exaggeration that a good model is worth a thousand pieces of data [22]. This paper has introduced a redefined transition model for frame-skipping tracking without any restrictions on a priori offline training, image quality, objects' shapes and speed. We have contributed to the state-of-the-art methods in improving the standard particle filter for better tracking. We have compared our contribution to the state-of-the-art solutions in the literature and observed its superiority, even when both the object and the camera are moving. In our future work, we will study how to increase the discriminative power for object tracking when cameras move quickly. Also, multi-object tracking in frame-skipping videos is another issue to be addressed in our future studies.
Footnotes
7. Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant No. 60903072; National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. SQ2013SF13E00579.
Anlong Ming is with the Beijing Key Lab of Intelligent Telecomm. Software and Multimedia, Beijing University of Posts and Telecomm., Beijing, 100876, China (e-mails:
Huadong Ma is with Beijing Key Lab of Intelligent Telecomm. Software and Multimedia, Beijing University of Posts and Telecomm., Beijing, 100876, China (e-mails:
Charles X. Ling is with the Department of Computer Science at The University of Western Ontario, London, Ontario N6A 5B7, Canada; (e-mails:
