Abstract
Traditional tracking-by-detection methods use online classifier to track object, and the classifier can be degenerated easily using self-learning process. The article presents a multiple instance learning (MIL) tracking method based on a semi-supervised learning model with Fisher linear discriminant (MILFLD). First, the overlap rate of sampled instances and tracking object served as the prior information. Using both labeled and unlabeled data, the tracking drift problem in the learning model could be alleviated. Second, the lost function of MILFLD is built using Fisher linear discriminant model incorporated with priors. Hence the optimal classifier can be selected out directly in instance level. Last but not least, the classifiers are chosen by gradient descent method, assuring the maximum descent of lost function. Therefore, the classifiers selected at previous frames are still discriminative to future frames, which can help to constrain the error propagation. Comparison experiments show that the center location errors of online AdaBoosting , online MIL tracking, weighted MIL tracking (WMIL), compressive tracking (CT), struck tracking, and MILFLD are 78, 66, 62,74, 59, and 25 pixels, respectively, which demonstrates the tracking accuracy of our method. The experiments of robot motion tracking in realistic scenario have been complemented for comparison as well. Despite the variations in illumination, deformation, or occlusions of the objects, the proposed method can track the target accurately and has high real-time performance.
Introduction
Visual tracking 1 –3 has been employed extensively in many practical applications, for example, robot navigation, video surveillance, and satellite measurement and so on, and the “tracking-by-detection” method has become the focus in the research literature of the discipline. By building the appearance of the model based on the object and background, the position of the target of interest has been evaluated. In general, tracking method can be partitioned into either generative model or discriminative model based on the distinct appearance model.
Generative model observes the whole distribution of data in statistic perspective to represent the specific object and searches the objects with maximum resemblance on image regions of each frame. To cope with the challenges of illumination variation and deformation, Jepson et al. 4 proposed a tracking method using the WSL combination, involving three components, that is, the wandering, the stable, and the “lost” components. Based on the mechanism of online estimation, Han et al. 5 proposed to model the pixels around the target with a Gaussian mixture. Similarly, Ross et al. 6 constructed a subspace unit by online learning. The appearance model was updated using incremental eigenbasis, resulting in improved subspace accuracy. The method has shown a good result while handling large illumination, posture, and scale variation. Yang et al. 7 presented an adaptive appearance model of the subspace with three data-driven constraints, the negative data, the bottom-up pairwise data, and the adaptation dynamics. The generative model consists of a compact appearance model to describe the object changes. However, the surrounding context of object is discarded from which the exact object can be better separated.
Varied from generative model, the discriminative model treats visual tracking as a binary classification task and decides the object position by optimal decision surface. In this model, the target can be separated from background by discriminant classifier. Online AdaBoosting (OAB) 8 is used to search object by training an online classifier with positive and negative samples. As only one positive instance is sampled in observation set, tracking accuracy is susceptible to the incorrect samples. Whenever an error is introduced, it could result in a tracking error, which may accumulate over time resulting in tracking failures. Zhang et al. 9 proposed a long-term tracking method with an adaptive scale feature that compressed tracking algorithm in complex environment. A naive Bayes classifier made decision based on low-dimensional features and updated the parameters through online learning. The image region with the maximal classification score was selected as the new tracking result; hence, the tracking location and scale would be updated adaptively. Recently, B. Babenko proposed a tracking method called MILTrack (aka online MIL tracking) 10 based on multiple instance learning (MIL) theory. When using this method in tracking, multiple instances are clustered with the bag label, which could eliminate the ambiguity of incorrect samples. Given the initial position of tracking object, with ensemble learning to construct strong classifier, Luo et al. 11 proposed mean shift tracking method integrated with MIL for long-term tracking. Similarly, according to the contribution of each sample to the learning, WMIL 12 weights each bag probability on the basis of the distance between samples and target, thereby boosting the robustness of bag model in MILTrack. Among the above tracking methods, visual tracking is treated as a classifier issue, and the task of trackers is to decide the margin function between target and its background. The discriminative model-based algorithms work well in complex environments. However, because the training step of classifiers is modeled with preceding frames, the target in future frames are hard to recapture when the drift occurs in the frames in sequence.
In video tracking, the acquisition of unlabeled data is relatively inexpensive, so they are exploited by classifier to enhance the learning performance besides the labeled samples. Recently, Grabner proposed a tracking method based on semi-supervised learning 13 ; this method assumes that only the label of initial frame is known, and the initial labeled data are used to train an off-line detector. The detector trained by them is used to provide prior information of the training samples when tracking. Based on the same semi-supervised learning model, Grabner 14 adds an extra step to recognize a similar sample, in such way that all three parts of detecting, recognizing, and tracking can be well coordinated. Among the above-mentioned methods, only the data information at initial frame is exploited, as both of the target’s scale and shape would be changing all the time, such that the methods would become inefficient. The data information can be exploited even more with the help of adaptive data structure. Recently, MIL method based on co-training is proposed to use predicted results to update classifier itself. 15 For the long-term tracking, the error that is introduced by the novice would be accumulated to bring the final tracking drift. With the unlabeled data trained from different views, Chen et al. 16 improved the original co-training framework by keeping the update of object appearance. Using a semi-learning model called P-N learning, Tracking Learning Detection (TLD) 17,18 is proposed to search the data structure in temporal, spatial dimension and to rectify error with an efficient detector. Similarly, Liu et al. 19 proposed to search the error upper bound of boosting to update the appearance model of co-training framework. Based on the data fusion plan, Zhong et al. 20 proposed a novel weakly supervised tracking method to cope with various complicated environment. Based on the online support vector machines (SVM) plan, Struck 21 uses the overlap information as the prior information of classifier, unifying the process of training classifier and deciding the label of samples.
To deal with the degeneration of classifier in the self-learning process, this article presents an MIL tracking method based on a semi-supervised learning model
22
with prior information. The main contributions in this article include the following: Construct the bag model of MIL by a semi-supervised learning model. Exploit the prior information of overlap rate of sampled instance and current object in semi-supervised model to prevent the classifier from tracking drift. Employ the Fisher linear discriminant (FLD) model to simplify the construction of bag model of MIL. Select the optimal classifier on the instances level and therefore reduce the computing complication. Optimize the selection of classifiers from the perspective of gradient descent. Choose the weak classifiers on the condition of assuring the maximum descent of lost function. Suppress the error propagation problem issued from selection process effectively. Ensure that the strong classifier trained from the current frame have same discriminant for the next frame. We measured our algorithm in the challenging video sequences, and the results verified the robustness and real-time performance of the algorithm. Then, we present empirical results by introduced MIL tracking method based on a semi-supervised learning model with FLD (MILFLD) algorithm to the MT-AR robot tracking system, which shows the proposed method has strong robustness to tracking the moving objects with occlusion.
The rest of this study is organized as follow: The “approach overview” section briefly introduces the MILFLD algorithm. Then, the “MIL with prior information” section gives a detailed description of MIL with prior information. The “experiments” section demonstrates the advantages of MILFLD by comparing the proposed method with existing tracking methods over several benchmark data sets and making the experiments for the MT-AR research robot tracking in the real environment. Finally, the “conclusion” section concludes the article.
Approach overview
In this article, Haar-like features 23 are used and the image patch is represented by the feature vector. Different rectangles with random weights inside feature vector grouped together to decide the response of each sample. Using the compressive tracking (CT), only nonzero elements in Haar-like features need to store and the sparse random matrix could be initialized off-line merely once. 24 The various combinations of Haar-like features contribute differently to the response of each sample, having established the basic structure of classifiers in the article. The different rectangle boxes within Haar-like features build the basis of weak classifiers and various structures of Haar-like features maintain a pool abundant with diverse weak classifiers. Considering that the history data of samples might be exploited and the response of each sample is already known, the Bayesian learning model based on Gaussian distribution can be constructed.
This article proposes an MIL tracking method with prior information based on FLD model, and the tracking framework of our method is on the basis of MIL. In traditional MIL tracking, 10 the instances are sampled on group of the sampling process. The label of each individual sample is assumed to be unknown beforehand, the discriminative model of bag is constructed in the unit of bag, and its whole probability is decided by the whole distribution of samples. As for the tracking based on online classifier, each instance’s label has been known in advance. Therefore, the discriminative model can be built according to the labeled samples at instance level. This makes the construction of discriminative model more flexible. As for binary classification in tracking, this article chooses the FLD classifier as the classifier and employs of the gradient descent to evaluate the maximum and minimum of discriminative function. Finally, the selection of classifier in MIL has been transformed into the error propagation of discriminative model. The weak classifier selected out has the performant generality, and also make the classifiers selected at previous frames still discriminative to future frames.
The basic flow of our tracking method is shown in Figure 1. First, we need to densely sample the samples set of the positive and negative bag around the position of the tracking object in frame t. Then, transcendental classifiers are used to obtain the label of each sample, which will make the labels of all samples in the positive bag be redefined, and samples with pseudo labels are used to construct and update classifiers. Finally, the sample is sampled around the target of frame t + 1, and the confidence level of each sample is obtained by using the previously obtained classifier, thus updating the tracking position of the target. In frame t, the collection of positive and negative samples should be first sampled around the tracking object. Suppose the position of target at current frame is

The basic flow of proposed MILFLD for tracking. MILFLD: multiple instance learning tracking method based on a semi-supervised learning model with Fisher linear discriminant.
MIL with prior information
In the process of constructing appearance of MIL, there entails double classifier to be designed: prior classifier and online classifier. The prior classifier gives the instances in positive bag with prior pseudo label and pseudo weight, which can be used to acquire the error of each weak classifier in MIL, providing a theory evidence of the construction of bag model at instance level. The construction of online classifier is based on the MIL, and this article simplifies the traditional MIL tracking with the FLD model and treats the selection of weak classifiers from the perspective of error propagation.
Prior information of sample overlap
In traditional MIL tracking method, the importance of each sample only considers the sample response from online classifiers. As the influence of samples distance condition is discarded, the samples carried with large weight but far away from object at previous frame would be viewed as wrong current object. Through the measurement of semi-supervised learning model, evaluating the prior label and weight using the prior classifier, the influence introduced by the noise samples would be eliminated.
Usually, the contribution of each samples to the whole distribution is distinct, the objects sat near the position of object are more similar to the object, contributing more to the selection of weak classifiers. This makes the behavior of treating differently to each sample in bag become necessary. In this article, the prior information is built with the overlap rate of sampled instances and current object; that is, while computing the probability of samples in positive bag, the overlap rate between the sampled instances and previous object should be known in advance. Suppose that the position of object
where the area denotes the acreage of the overlap part.
When finishing the sampling process of positive samples, each sample would be assigned as a prior value
where σ denotes the sigmoid function
As the samples in negative bag are rather far from the object, the interference of their overlap rate can be neglected, the probability of each sampled instance can be denoted as
where the φ is the scale factor which adjusts the probability in the whole distribution, and its value is 1.8 in the experiment.
After obtaining the probability of samples, the pseudo label
In this article, the threshold θ effects the positive sample, or vice versa. Obviously, the more confident is the prior information of sample, the harder is the label of samples to be able to mutate. So the threshold θ is set to 0.5. When updating the classifier, those samples with inconsistent information of pseudo label and bag label should be cut off from the bag; this decreases the risk of tracking drift caused by the noise samples.
At last, each instance in the sample collection will need an importance weight
In importance weight
Construction of online classifier
In traditional MIL tracking, the construction of appearance model utilizes the Noisy-OR (NOR) discriminative model, 10 a likelihood estimation based on the response of samples would be given. In this kind of model, as long as there is one positive instance, the label of the bag would be positive. In the process of training the sampled instance, as the response of training examples and the pseudo label of samples with overlap information can be both obtained easily with respective classifiers, then either the discriminative model of bag or the boundary characteristics can be analyzed at instance level. Based on this feature selection rule at instance level, this article presents a novel MIL tracking, the construction of lost function of bag model using FLD model, and the optimal weak classifier is chosen by the gradient descent method.
FLD model
In the sampling process, two sets of collection with different label can be obtained. As the label of each collection has been decided, the boundary characteristics can also be analyzed with the linear discriminative model. The FLD model measures to select unlabeled samples in active learning. 25 In this article, the FLD model 26,27,28 is used to construct the lost function of bag model. As for the binary classification, the FLD has the capability to divide the samples collection. The rule of the selection of lost function is to arrive at a maximum bag margin in such a way that the performance of classifier selected can occupy a better discriminative quality. With the FLD model, the lost function of the bag model can be built as
where fk denotes the element representatives of image. And
The cardinalities of positive and negative bag are same—both of two bags contain
where
The procedure of features selection maintains a pool consisting of
where Hk−1 is the strong classifier built with first (
The selection process of the whole classifiers is consecutive. When new examples are delivered in, weak classifiers are updated in parallel first. Each time iterating weak classifier, the optimal classifier is incorporated into the current strong classifier based on the previous strong classifier.
As we mentioned before, in each sampling process the overlap rate information of the samples has been known in advance. In the construction of bag model, the lost function model can be built with the overlap prior information by the semi-supervised theory. The lost function can be further formulated as
where the prior information
Gradient descent model
Similar to the AnyBoost,
29
a more efficient selection strategy with the gradient descent method
30
is proposed in this article. In this method, each time the latest classifier incorporated into the lost function, the lost function of the whole samples is decreased at the maximum. With the combination of weak classifier h, lost function
where the optimal weak classifier
where
Since the error influence in the process of classifier selection has been considered, the vote weight in front of the construction model of strong classifier should be merged into the computation process also, and then the strong classifier with the vote weight combined can be rewritten as
Each time updating the weak classifier, the vote weight αn should be computed once again. In the process of obtaining the vote weight αn, the error rate of samples is participated into the computation, in this way that the prior and posterior probability of samples both should be assumed known. However, as for the bag model constructed with the Fisher discriminant model mentioned above, the strong classifier is the composition of simple linear accumulation
By employing the semi-supervised technology, 31 the acquirement of vote weight αn is similar to online boosting, each time obtaining the classifier, the classification error of the classifier is defined as
where wn,c and wn,w describe the consistency of the label of prior and posterior label of all samples. After reviewing the procedure of computing the overlap rate in positive bag, the importance weight wn of each sample is obtained at the very beginning time. When the classification of samples is correct, the importance weight wn will be incorporated into the wn,c, otherwise, the wn will be incorporated into the wn,w. And then, the computation of voting weight would become very simple, in the selection process of
As the voting importance is introduced into the strong classifier, the composition model of strong classifier would be transformed into
Basic flow of tracking method
The basic flow of proposed MIL tracking method based on FLD is summarized in the Figure 2.

The flowchart of proposed MILFLD. MILFLD: multiple instance learning tracking method based on a semi-supervised learning model with Fisher linear discriminant.
Experiments
To testify the robustness as well as stability of our tracking method, quantitative experiments under different scenes have been performed; there are six “tracking-by-detection” methods involved here: OAB, MILTrack, the weighted MIL tracking (WMIL), the compressive tracking (CT), and the Struck tracking. All parameters of respective methods have been optimized to gauge performance with the fairness. Table 1 lists the characteristics of prior information, likelihood function of bag model, and features selection strategy of six methods. In real-world environment, there exist the illumination variations, the occlusions, fast motion, motion blur, dynamic background, the scale variations, and so on. And this article takes 10 distinct groups of video sequences for comparison of six methods.
Comparison of six methods.
OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; WMIL: weighted MIL tracking; CT: compressive tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.
Parameter setting
In the initialization process of feature selection, weak classifier model of all tracking methods, except Struck, can be represented as a combination of different rectangles in Haar-like feature. The number of rectangles is limited to a random number between 2 and 6, and the learning rate of classifiers is controlled as η. If learning rate set smaller, it can make the tracker quickly adapts to the fast appearance changes. If learning rate set larger, it can reduce the likelihood that the tracker drifts off the object. So the paper set η = 0.85. M should be set as large as possible to obtain enough candidate features to make the selected features more discriminative. But we also need to leverage the increased the computational time. The cardinality of candidate weak classifiers pool is maintained with the quantity of M = 250. The quantity of K = 50 weak classifiers is selected from the candidates pool to build strong classifier. As for the tracking method of Struck, its classifier model is composed with a structured output SVM framework and the overlap rate information works as the restriction kernel of SVM. Except the OAB and Struck, all other tracking methods embrace the property of sampling radius of four pixels for cropping positive image patches per frame, which generates nearly 45 samples. As a contrast, the quantity of positive samples is confined to one for OAB. The performance of OAB being observed is degraded with more sampled numbers. The sampling area of negative samples of all tracking methods except Struck is designed annular, the radius of which ranges from 9 to 37.5 pixels yielding 45 randomly abstracted examples. In the Struck tracking, 5 radial and 16 angular divisions of polar grid will be used to give 81 instance locations. All motion models of six methods duplicate the model in MILTrack. To solve randomness, all experiments are repeated 10 times on each sequence and then we took the averaged results for comparison. At last, all of six methods are implemented in VS2012 with a workstation of the configuration of Intel i5 3.2 GHz processor and 4 GB memory.
Simulation experiment result and analysis
For the variations of challenging scenes, our trackers are evaluated on 10 sequences for comparison, that is, Tiger1, Boy, Fish, Lemming, Bolt, David, FleetFace, Carscale, Deer, and Pedestrian. All sequences except the Pedestrian file are publicly available with ground truth in the websites. 32 , 33 The goal of doing multiple experiments is to demonstrate the robustness and the computational complexity of our MILFLD under diverse challenging factors. The challenging factors contained in each sequence are listed in Table 2, that is, illumination variation, occlusion, fast motion, motion blur, rotation, background clutters, deformation, dynamic background, and scale variation. Figure 3 shows the entire screen captures of 10 sequences for the tracking results. The following will analyze these screenshots with more details.
Overview of tested sequences.

Screenshot of tracking results (blue: OAB; green: MILTrack; black: WMIL; cyan: CT; yellow: Struck; magenta: MILFLD). OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; WMIL: weighted MIL tracking; CT: compressive tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.
Tiger1, Boy
The two sequences describe the situation when the object position changes, rotates out of plane, and motion blurs. In the sequence of Tiger1, besides the challenging factor of fast motion, the object also appears to rotate and be occluded at times. As the partial occlusion of object occurs at frame 108 of OAB and the similar object in the background interferes at frame 242 of WMIL, Struck, and CT, all four methods fail to be continued tracking. In the frame between 192 and 218, MILTrack also loses the target as target gets occluded all of sudden. Though the target is found back at the frame 220, the target gets lost after the frame of 242. Among all six tracking methods, our method performs the second best in the sequence of Tiger1. For the sequence of Boy, as the tracking object moves all the time, the unified model is very hard to be established. For the WMIL method, the distance between object and sampled instance is used to weigh the classifier directly; the overfitting problem occurs, which causes the classifier degradation after frame 500. While the overlap information is served as the prior probability of Struck, the adaptive SVM classification is formed in the classifier training process. The overfitting problem can be well-handled, which makes the Struck the best among six methods. Our method performs the second best on the sequences because the prior information weights more on the positive samples when classifiers are updated.
Fish, Lemming
The two image sequences describe challenging tracking problems under the complicated situations. In the sequence of Fish, the shooting angle always changes, which makes the scale of tracking object varies as well. As the light reflecting of camera also varies all the time, light reflecting from different angles could blur the object. In this sequence, Struck performs the best among six methods, followed by the CT tracking and our method. The sequence Lemming blends the challenging factors of complicated background with object deformation and motion blur. As severe occlusions appeared at 400th frame, all tracking methods except our method, Struck, and CT are capable of surviving with such occlusions. The discriminative model of MIL bag is formed at instance level using FLD method with filtered samples, which significantly reduces the bag prototype’s discriminant computational complexity. The target can be easily separated from the background in different complicated environment.
David, FleetFace
The two image sequences suffer from the challenging factors of scale, illumination, and position variations. In the sequence David, the illumination varies in the tracking process, and as the human body moves meanwhile, the appearance of tracking object has been changed at the same time. There appear some backgrounds interferences between the frame of 198 and 219, accompanied with the occurrences of motion blur about the tracking object. Except for our method, all other tracking methods have the problem of tracking drift. At the frame 398, as the appearance of foreground and background have been both changed, MILFLD, CT, WMIL, and Struck hold a relatively solid performance, while the other two have the tracking failure after the frame 398 yet. In our method, Haar-like features contribute a lot to scale variation. Whereas prior information helps to control the backgrounds samples’ weights in classifier score that make our tracker better to separate the target from its background. The sequence FleetFace describes one situation of object tracking when the object deforms a lot with its scale varied. As the illumination of sequence FleetFace varied not so much as that of David, most of the tracking methods performed quite well. In the feature initialization process, as with the combination of compressive features, the features come from a higher dimension can be reduced efficiently. The image can preserve the original multi-scale information such that illumination and scale variations can both be well-handled.
Carscale, Deer, Bolt
The three sequences describe the motion tracking in the condition of fast motion, motion blur, and occlusions. As for the sequences of Deer and Bolt, our method also occupies a high accuracy. The moving target burrs so much that our method misses the object by some degree between frames 41 and 60 in the sequence of Deer. Among all six tracking methods, our method performs only the fifth best in the sequence of Deer. Similarly, in the Carscale sequence, WMIL and our method result in a better tracking output; this is probably because of the prior information taken by these two methods. The Bolt sequence suffers especially fast motion and swift changes, and our method scores best in six compared methods. As the training speed of the classifier is critical in these conditions, our method, which employs a discriminative model of MIL bag formed at instance level using Fisher method, shows quite suitable for the fast motion and swift changes.
Pedestrian
This group of sequences is shot in the outside field by us, which is used to testify the tracking situation of severe occlusions. The whole tracking process is subject to branch occlusions. The object gets occluded at frame 200, except the tracking methods of CT and WMIL—other tracking methods are all passed to the end. For the WMIL method, the distance between object and sampled instance is used to weigh the classifier directly, and the overfitting problem occurs during the fitting process. Our method constructs the overlap prior information using a semi-supervised model, so the overfitting problem can be well-handled.
Figure 4 describes the concrete center location error (CLE) diagram curve of six methods in 10 sequences and Table 3 shows the quantification results of center location error. The CLE is defined as Euclidean distance between the location of tracking target and true target. This rule can be formulated as
where (xi, yi) is the output position of tracking target, (xc, yc) is the center position of target at current frame, and if the CLE is smaller than 20 pixels in one frame, the tracking result is considered as a success. Table 3 summarizes the quantitative CLE with six tracking methods in 10 sequences. Comparison experiments under different scenarios reveal that the center location errors of OAB, MILTrack, WMIL, CT, Struck, and MILFLD are 78, 66, 62, 74, 59, and 25 pixels, respectively. In general, a smaller CLE means a better tracking accurate.

Comparison results of center location error with six methods (blue: OAB; green: MILTrack; black: WMIL; cyan: CT; yellow: Struck; magenta: MILFLD). OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; WMIL: weighted MIL tracking; CT: compressive tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.
Center location error (pixel).
OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; WMIL: weighted MIL tracking; CT: compressive tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.
As can be easily observed from Figure 4, our method could survive the consistent tracking in complicated challenging conditions, including fast motion, illumination variation, and occlusion, whereas the Struck method only behaves well in slow motion scenarios. By constructing discriminative model at instance level using FLD, combined with a more efficient selection strategy–gradient descent, the strong classifier trained by the current frame can also be discriminative for the next frame. Meanwhile, the prior information of overlapping samples also contributes to deal with motion blur caused by fast motion. With the compressive features coming from high dimension, the tracking drift problem can be eliminated in the condition of illumination and scale variations.
In the statistical description, the variance is used to calculate the difference between each variable and the population mean. In this article, the variance of the center point error is used to evaluate the smoothness of the algorithm, and the smaller variance means the better the stability of the algorithm. The variance in the center point error is shown in Table 4. In this article, the double t test is used to analyze the statistical significance. The difference between the proposed algorithm and the contrast algorithm is explained. The double t test is a test whether the difference between the two samples is significant and the difference between them. In this article, the statistic of independent sample t test is as follows
where
Variance of center point error.
OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; WMIL: weighted MIL tracking; CT: compressive tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.
Results of statistical significance analysis.
OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; WMIL: weighted MIL tracking; CT: compressive tracking.
In order to find the optimal number M of weak classifiers and the number K of weak classifiers that used to construct the strong classifier, this article uses the grid search to find the optimal combination of parameters in the two sets of parameter combinations
Results of FPS, center point error, and variance with difference parameter.
FPS: frames per second.
Experiments of mobile robot tracking
This article also makes the experiments for the binocular mobile robot tracking in the real environment. The experimental platform is MT-AR research robot equipped with the Bumblebee2 binocular camera, as is shown in Figure 5. The binocular visual system outputs the spatial position information of object in the tracking process, the information of which can be used to control the motion feedback of robot: walking forward and backward, turning left and right. The image sequence read from left viewport of binocular camera is outputted to MILFLD for the tracking process, and after detecting the object, the pixel position can be used to control the left and right direction about the robot motion. While MILFLD trains classifiers at each frame of visual stream, the spatial 3D coordinates of object can be obtained using disparity, and the depth information can be used to control the robot motion. To enhance the robustness of mobile robot, the bootstrap and retrieval modules are also included then. When the robot is opened, the bootstrap module in robot models the initial tracking object, the initial program takes advantage of off-line facial Haar-like feature library. The initialization program searches the face at the starting frame and uses MILFLD to train the object. As for the retrieval strategy, there are two classifiers required to be instantiated in the initial frame. One of them is used to construct the tracker of video sequences and to retrieve tracking object, and its inner parameters is updated at each frame. The other classifier is a detector, which is an off-line classifier for the detecting purpose. When the object gets lost anyway, the detector can be rebooted by itself to retrieve object in the global viewport of current frame. And a threshold is entailed being set to judge whether the target is retrieved or not.

MT-AR mobile robot.
Figures 6 to 8 display experimental results located in different real-world scenes separately. Each figure includes two groups of image sequences from up to down, the upper ones come from the screenshots of tracking window in mobile robot and the lower ones come from the real-world Digital Video (DV). Figure 6 is the image sequence whose image sequence comes from the frames of 232, 356, 357, 358, and 428; the sequences arranged here is to give a demonstration of severe occlusions happened on objects. The objects get occluded at the frames of 356, 357, and 358. As the tracking method utilizes the prior information of FLD method, when the object gets the whole or the partial occlusions in scenes, the robust overlap prior information is complemented to control the weight of occluding object. Figure 7 draws from the frames of 24, 113, 123, 179, and 412 of another image sequences in the nonuniform illumination condition. The prior information of overlap rate contributes distinctly to the classifiers, eliminating the interferences of background noisy samples. Figure 8 displays the mobile robot tracking when object gets disappeared once a while; the tracking object hid behind the pillar and crossed through the pillar from right to the left. The frames 154, 222, 223, 229, and 233 are extracted out from the video sequences for experiments. The object disappeared out of the visible ranges from the frame 201, and then reappeared at frame 225. As the retrieval module is utilized in the robot platform, when the object get lost in the visible range, the off-line classifier can be rebooted to search the object in global image space with black tracking window shown in Figure 8. The value of threshold is set to judge whether the object is retrieved or not. When the response of detecting weight of object is small enough, the tracking fails which leaves the tracking window in the previous frame.

Mobile robot tracking with severe occlusions.

Mobile robot tracking with complicated background.

Mobile robot tracking with the disappearance of object.
For the tracking of binocular mobile robot in real-world environment, there are more experiments being performed for the comparisons with OAB, MILTrack, and MILFLD, as shown in Figures 9 and 10. Figure 9(a) shows tracking result of OAB and MILTrack when the object gets disappeared all of a sudden; OAB cannot handle the severe appearance variations and drift away. As for the sequences with the tracking method of MILTrack, the error is accumulated, while the detector is rebooted automatically, the classifiers have been all degenerated, which makes the final tracking failure. Figure 9(c) shows the tracking scenario with MILFLD, which is drawn from the frames 76, 87, 92, 102, and 110. Notice that even the object gets lost once a while, the tracker can still retrieve the object. And when the object gets partial occlusions, the object weight is so small that the tracker fails to retrieve the object. At this moment, the off-line detector will be rebooted to search the object in the full screen of image scale. As it can be observed that the binocular mobile robot can track the object very robust, even if the object gets disappeared all of a sudden, the detector inside it is able to retrieve the object. Figure 10 shows result of OAB, MILTrack, and MILFLD when the object is occluded by another face. We can know from the pictures that the MILFLD yields more robust and accurate result because MILFLD constructs the overlap prior information using a semi-supervised model which effectively handles the overfitting problem. And because discriminative model is formed at instance level using FLD method, the training speed of the classifier turns out very quick and runs at 5 FPS, so the robot can keep tracking the object real time, meanwhile, the OAB runs at 1 FPS, and MILTrack runs at 3 FPS. MILTrack algorithm learns the classifier by maximizing the maximum likelihood of all bags, so it has to calculate M times sample probability and bag probability when selecting the weak classifier, which increases the computational time. While the MILFLD algorithm we proposed uses the classifier consisting of a sparse representation of Haar-like features, its distribution characteristics are accompanied by the Bayesian learning process of the Gauss distribution. The encoding process of compressed sensing can maintain the inherent characteristics of the image model, the sampling process transformation of high-dimensional feature in the image is sampled image low-dimensional feature space, it reduces the computation time. And it selects a more effective classifier with the angle of the gradient descent strategy. Each time the new weak classifier is added to the lost function, the lost function corresponding to the sample distribution will be reduced fastest, so that the real-time performance is better.

Comparisons of robot tracking with the disappearance of object of OAB, MILTrack, and MILFLD. OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.

Comparisons of robot tracking with the occlusion of object of OAB, MILTrack, and MILFLD. OAB: online AdaBoosting; MIL: multiple instance learning; MILTrack: online MIL tracking; MILFLD: MIL tracking method based on a semi-supervised learning model with Fisher linear discriminant.
As it can be seen from the above experiments, our method could get rid of various challenging factors, such as fast motion, occlusions, and illumination variation in mobile robot tracking. In this article, a novel MIL tracking algorithm based on semi-supervised learning prior information is proposed, and the feature selection method based on sparse matrix is incorporated into multi-instance learning tracking framework. The discriminant model constructs the lost function of bag model using FLD, which can construct a classifier directly at the instance level. By combining the gradient descent together with online enhancement, we can treat the selection of weak classifiers from the perspective of gradient descent of error propagation. Hence the strong classifier trained from the current frame has the same discriminant for the future frame. Experimental results show that the proposed algorithm has high robustness and stability for a variety of complex environment changes.
Conclusion
This article presents an MIL method based on FLD methods, and the overlap information serves as the prior knowledge of sampled instances in the bag model. The construction of online classifier bases on MIL at instance level, eliminating the computational complexity in the discriminant bag prototype. After the construction of discriminative model, the weak classifier is selected out one by one using gradient descent rule in the terms of error propagation. As it can be seen from the results of numerous experiments, our method outperforms the other method in diverse challenging conditions: illumination variation, occlusions, fast motion, motion blur, dynamic background, and scale and deformation. After joining the initialization program and retrieval strategy, the results by transplanted MILFLD algorithm to the MT-AR robot object tracking system show the proposed method has strong robustness to tracking face with occlusion and complex background.
Footnotes
Acknowledgements
This work is supported by Zhejiang Provincial Natural Science Foundation of China (No. LY18F030018), NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization (No.U1609205) and Natural Science Foundation of China (No.51376055).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
