Abstract
In this paper, we present a tracking-by-multiple hypotheses framework to detect and track multiple vehicles accurately and precisely. The tracking-by-multiple hypotheses framework consists of obstacle detection, vehicle recognition, visual tracking, global position tracking, data association and particle filtering. The multiple hypotheses are from obstacle detection, vehicle recognition and visual tracking. The obstacle detection detects all the obstacles on the road. The vehicle recognition classifies the detected obstacles as vehicles or non-vehicles. 3D feature-based visual tracking estimates the current target state using the previous target state. The multiple hypotheses should be linked to corresponding tracks to update the target state. The hierarchical data association method assigns multiple tracks to the correct hypotheses with multiple similarity functions. In the particle filter framework, the target state is updated using the Gaussian motion model and the observation model with associated multiple hypotheses. The experimental results demonstrate that the proposed method enhances the accuracy and precision of the region of interest.
1. Introduction
In order for vehicles to navigate automatically, it is very important to perceive the external environment accurately and reliably with object detection and tracking. These factors require various expensive sensors such as radar, lidar, cameras, and GPS to perceive the external environment accurately. Actually, Team Tartan Racing's vehicle “Boss”, which was Carnegie Mellon University's winning entry in the 2007 DARPA Urban Challenge, is equipped with 13 different perception sensors [1]. Various vehicle detection methods [2] for intelligent vehicle fields have been introduced in recent decades, and many algorithms and systems have been reported and demonstrated to enhance the reliability and robustness of these systems.
Range sensors, such as lidar [3] and radar [4], have been used as standard approaches for robust object detection and localization systems. However, these sensors give only point information for the detected target and it is very difficult to recognize the class of the detected object. Camera-based perception methods have been proposed to detect and recognize moving objects while localizing the object's position with prior perspective information [5, 6]. Currently, many researchers have been working toward stereo vision-based approaches in order to provide systems with more reliable detection and localization performance [7-11].
No state-of-the-art detection and recognition algorithms [2, 12] can detect and recognize all the objects on the road without false alarms. Recently, several multiple object tracking methods have integrated object detectors and visual trackers to provide reliable object detection and localization output [13-15]. In [13], an integrated system, a WaldBoost detector [16] and tracking-learning-detection (TLD) [17], is proposed to detect and track vehicles in real-time using a single camera. However, the method has a delay in confirming a target object, because the detector runs every three frames and three consecutive right detections are required. The tracking-by-detection framework utilizes the output of an object detector as an observation model of a Bayesian filter [15]. The framework reduces the number of false detections while enhancing the detection probability. The tracking-by-detection framework reduces false detections during track initialization due to the sparse occurrence of false alarms. The framework can also increase the detection probability while estimating the target object state with a visual tracker, such as a Kanade-Lucas-Tomasi (KLT) [14] and a particle filter [15] when the detector misses an object in the current frame. A tracking method using a particle filter was proposed to re-initialize the tracking algorithm automatically whenever the performance severely deteriorates [18]. In [15], the framework consists of an object-specific detector, a visual tracker and data association for tracking multiple objects. The region of interest (ROI) is updated by the particle filter with a motion model and observation model. A constant velocity model is used for the motion model and associated detection output and an online classifier are used for an output observation model. However, the updated ROI is mainly dependent on the output of the associated detection because the motion model in an image plane is inaccurate due to the nonlinearity of the target's movement. Only a very small number of works [7, 14] have introduced a stereo-based tracking-by-detection framework for detecting and tracking multiple vehicles.
Stereo-based multiple object tracking methods have an advantage in that they can localize objects not only in the 2D image plane but also in 3D global coordinates. A method using an occupancy grid and interacting multiple model (IMM) filter [11], and methods [8-10] that combine depth and motion information have been proposed to detect and track multiple vehicles or pedestrians using 3D information. The state of a target vehicle, including its position, orientation, velocity, acceleration and yaw rate, is estimated while tracking a 3-D point cloud in global coordinates [9]. The method can automatically detect the target object by using a fusion method with vision and radar. In [11], the researchers reconstructed 3D points using a depth image and mapped them onto an occupancy grid using an inverse sensor model. Clustered and segmented objects are associated with tracks and served as an input of the IMM filter. The method extracts obstacles on a road using an occupancy grid and it does not classify specific target objects such as pedestrians or vehicles.
In the field of intelligent vehicles, most stereo-based multiple object tracking methods have been concerned with object detection and localization problems in 3D global coordinates [8-11]. There has been a lack of efforts to increase the accuracy and precision of the ROI in the image plane. In order to enhance the precision as well as the accuracy of the ROI, we propose a tracking-by-multiple hypotheses framework based on the Bayesian probability model. The proposed method uses a hierarchical data association method, 3D feature-based visual tracking and a particle filter using associated multiple hypotheses. The particle filter updates a target state with not only vehicle recognition outputs, but also obstacle detection and visual tracking outputs.
This paper is structured as follows. Our stereo-vision system and tracking-by-multiple hypotheses framework are introduced in Section 2. In Section 3, the proposed multiple vehicle tracking approach using tracking-by-multiple hypotheses is described. This framework consists of a global position tracking, 3D feature-based tracking, hierarchical data association and a particle filter. A qualitative evaluation metric is detailed, and experimental results and analysis are presented in Section 4. Finally, Section 5 provides the conclusion and insight for future works.
2. System overview
2.1 Stereo vision system for intelligent vehicles
Our stereo vision system consists of stereo matching, obstacle detection, vehicle recognition and multiple object tracking modules, as shown in Fig. 1. The stereo matching module, based on the belief propagation algorithm [19], is implemented on the embedded platform with FPGA for real-time processing. The stereo matching module offers two grey images (left and right images) and a depth image to the software platform with VGA @15fps. The dense depth image has 128 disparity levels.

Architecture of stereo vision system
The obstacle detection module extracts the road information using the v-disparity method [20] and then detects all the obstacles on the road using a disparity histogram [21]. The vehicle recognition module classifies the obstacles as vehicle or non-vehicle using the cascaded AdaBoost algorithm [22]. Searching regions are restricted according to the region determined from the obstacle detection module [23]. This approach not only removes false positive alarms, but also reduces the computation time for vehicle detection. The number of false positive alarms can be drastically reduced in the recognition module. On the other hand, vehicle detection probability is slightly decreased by the errors of obstacle detection and vehicle recognition. The multiple vehicle tracking module updates the state (global position and velocity, ROI position and size) of a vehicle and minimizes the number of false alarms caused by the imperfect obstacle detection and vehicle recognition algorithms. One of the advantages of the stereo vision system is that the global position and motion of the target object can be estimated accurately and reliably; also, this system is very helpful for distinguishing between the target object and other objects [14].
2.2 Tracking-by-multiple hypotheses framework
The tracking-by-multiple hypotheses framework consists of obstacle detection, vehicle recognition, global position tracking, visual tracking, data association and a particle filter, as shown in Fig. 2.

Block diagram of tracking-by-multiple hypotheses
In global position tracking, the accurate sub-pixel disparity of the object can be calculated using the stripe-based accurate disparity (S-BAD) estimation method; the global 3D position and velocity of an object can be updated using the inverse perspective mapping-based extended Kalman filter (IPM-based EKF) [24].
Feature-based visual tracking enables the ROI of the current target object to be estimated from the previous ROI. In feature-based tracking, one of the difficult problems is to find corresponding feature pairs in the current image. Another important point is removing the outlier features corresponding to other objects or to the background. The Kanade-Lucas-Tomosi (KLT) [25] feature tracker has been widely used to deal with real-time tracking problems due to its fast computation and generality [26-27]. However, the KLT is vulnerable to severe illumination change or abrupt object movement. Also, it easily fails to track a target when there are many outlier features in a cluttered environment. Our 3D feature-based tracking method is proposed so as to overcome these problems.
All the existing tracks are connected to correct observations in order to update the target states (global position and velocity, ROI position and size) in a multiple object tracking problem. A hierarchical data association approach deals with the track-to-multiple hypotheses assignment problem. The hierarchical data association utilizes the sub-pixel disparity and global position of the global position tracking module [24], outputs of the visual tracking module, outputs of obstacle detection module [21], and outputs of the vehicle recognition module [23] to assign multiple hypotheses to multiple tracks. In [14], the association cost is calculated by considering the similarity of the sub-pixel disparity and the longitudinal and lateral distance. In this work, we improve the robustness of data association by adding the criteria of the local distance and appearance similarity.
The ROI update module utilizes three types of hypotheses from the following respective modules: obstacle detection, vehicle recognition, and visual tracking modules. They are designated as general hypothesis (GH), object-specific hypothesis (OSH), and target-specific hypothesis (TSH), respectively. GH gives a very high detection probability, but provides poor ROI precision and a high false positive alarm rate. OSH has the advantage of removing many false positive alarms from GH and improving the ROI precision. The number of false negative alarms increases slightly and GH often provides noisy and unstable ROI outputs. The ROI of TSH is very dependent on the ROI states of the tracking object and the track drifting problem often occurs when tracking a target for a long time without GH or TSH. The particle filter using Bayesian probability updates the current ROI with the associated multiple hypotheses in order to enhance the ROI precision and accuracy.
3. Multiple vehicle tracking using tracking-by-multiple hypotheses framework
3.1 Global position tracking with IPM-based EKF
Global position tracking estimates the position and velocity of a target object on the road using a stereo vision system. The accuracy of longitudinal distance mainly depends on the accuracy of disparity; accurate disparity estimation is very important in estimating the distance accurately and precisely. In [24], we proposed the S-BAD estimation method to accurately and reliably estimate sub-pixel disparity. The experimental results show that the proposed method can estimate the sub-pixel disparity with about 0.1 pixel error as well as a distance of less than 50 m with approximately 2% error.
The IPM-based EKF method reduces the error covariance of the position and velocity of the target. In the prediction step of EKF, a system equation, a state transition matrix (Fk/k–1) with a constant velocity model, and a state vector (xi,k) of
where wk and Qk are the process noise and the process noise covariance, respectively.
where
where zj,k denotes the
where
3.2 3D feature-based tracking
The 3D feature-based visual tracking module consists of feature extraction, feature tracking, feature selection, 3D feature clustering, model selection and ROI estimation, as shown in Fig. 3. A feature from an accelerated segment test (FAST) detector [28] is used to extract distinctive features due to its speed and high repeatability. The FAST detector classifies a point as a corner feature if

Block diagram of 3D feature-based visual tracking
where
One of the problems of using a feature-based tracker is that it is very difficult to select only the features corresponding to the target object. When an object is estimated by a misaligned ROI, there are many more outlier features that correspond to the background or to other objects (Fig. 4). Consequently, the outlier features cause the model parameters to be incorrectly estimated. The 3D feature clustering method deals with the problem while minimizing the number of these outlier features. The features are clustered in 3D global position and motion spaces using the iterative scheme. In 3D global position clustering, the features are projected into 3D global coordinates using the IPM model [24].

Many outlier features in misaligned ROI
where
where Pm and ΣP denote the mean and covariance of the features in 3D global position, respectively.
where Mi indicates the motion vector of the
3.3 Hierarchical data association
Data association problems were originally addressed using the multiple object tracking problem in radar systems. In recent decades, data association methods have been applied to intelligent vehicles [14] and surveillance fields [15, 26] for multiple object tracking. To solve the assignment problem, an association cost matrix (C) should be calculated using the similarity function. In [14], the similarity function of global distance and the sub-pixel disparity are used to calculate an association cost. Even though, according to the experimental results, track identity switching error is not known to have occurred, a few tracks often link to false detections such as guard rails or side walls. In this study, we enhance the discriminating power while using local position distance and appearance similarity as well as 3D global distance.
where
Global distance function is represented by the global and disparity distance between the track's prediction and the measurement.
where
Local distance is computed using the overlap ratio (
where
The histogram of gradient (HOG) [30] for the appearance similarity function distinguishes a correct hypothesis from an incorrect hypothesis.
where
The hierarchical data association method assigns existing current tracks to multiple hypotheses. This method has three stages, which are the track-to-OSH, track-to-GH, and track-to-TSH association. In the track-to-OSH stage, all the existing tracks are assigned to OSH using the GNN data association algorithm. The optimal assignment matrix (
where
3.4 Particle filter using multiple hypotheses
The Bayesian-based object tracking framework consists of a motion model and an observation model; the target states are estimated by maximum
where
where
In our tracking-by-multiple hypotheses framework, several measurements are used in the observation model; these observations correspond to multiple hypotheses, such as GH, OSH and TSH. All the tracks are initialized from a few of the consecutive associated OSH; they are terminated by a few of the unassociated OSH and GH. GH contains many false detections and poor ROI precision, but provides high detection probability, because the approach extracts all the obstacles on the road regardless of their object class. The state of the tracks that are not linked to the OSH is updated with the associated GH, which allows the track to be maintained for a longer time. TSH is mainly dependent on the previous target state. TSH provides relatively good results in general conditions, but is prone to failing to track the target during abrupt motions or illumination changes. The TSH enables a track to maintain a stable state in the presence of abrupt variations of the GH and OSH.
In the observation model of the tracking-by-multiple hypotheses framework, the likelihood term is calculated by the weighted sum of these noisy multiple hypotheses.
where Σ denotes the covariance matrix of the residual (
The number (N) of samples per object is set to 1,000. They are used to estimate the optimal target state in a particle filter. The posterior probability density is recursively propagated using the probabilities of the samples at every time step.
4. Experimental results
4.1 Experimental setup
Real-world stereo sequences are captured in various scenarios from stereo cameras to test and verify the performance of our method. All the images are 640 × 352 × 8 bpp at 15 fps from a stereo camera mounted on a moving vehicle with a 0.3 m baseline (Fig. 5). Depth images are obtained by a software program based on the belief propagation algorithm. It is a time-consuming process and the algorithm is implemented in the FPGA system for real-time processing. Our software platform includes obstacle detection, vehicle recognition and multiple vehicle tracking modules (Fig. 1).

Stereo vision system mounted on vehicle
Four different scenarios (Fig. 6) are selected for quantitative evaluation; many more test scenarios are used for qualitative analysis. The four scenes are captured from the following settings: urban roads in heavy traffic, cluttered roads with severe illumination change, urban roads on rainy days and highways with curves. Ground truths for each scenario are manually annotated. The tracking performance is evaluated using a metric that is widely used in the multiple object tracking field [14-15, 26, 32]. Two ground truths are used to count the numbers of false negative and false positive alarms while considering limited distance and occlusion conditions. One is a mandatory ground truth, which represents all the vehicles with full appearance at less than 70 m and includes tracking vehicles that are partially occluded at less than 70 m. The other is an optional ground truth, which includes partially occluded vehicles being initialized, and vehicles at more than 70 m. The vehicle recognition system fails to classify partially occluded vehicles correctly; also, distant vehicles are difficult to recognize due to their small size. The number of false negative alarms is counted when a vehicle with mandatory ground truth is not detected. The number of false positive alarms is counted when the estimated ROI fails to correspond to both the mandatory and the optional ground truths.

Test datasets for quantitative evaluation
The CLEAR MOT metric [32] gives both the multiple object tracking precision (MOTP) score and the multiple object tracking accuracy (MOTA) score. MOTP indicates a measure for localization precision of the estimated ROI. It is calculated using the intersection ratio over the union of two bounding boxes.
where
4.2 Evaluation and analysis
We tested and analysed the proposed method qualitatively and quantitatively using several image sequences captured from various real road environments. Most walls, guardrails and trees around roads are extracted from the obstacle detection module, because the obstacle detection algorithm detects all the obstacles on the road. The vehicle recognition module often mistakes these false detections for vehicles due to erroneous vehicle recognition algorithms (Fig. 7(a)). In the multiple vehicle tracking, most false detections are removed during track initialization due to their sparse occurrence (Fig. 7(b)). Vehicle detection misses a partially occluded vehicle (first raw image), one of two vehicles that are close together (second raw image), and a vehicle in the far distance (third raw image), as shown in Fig. 8(a). However, the visual tracking module estimates the missed target ROI using the previous ROI (Fig. 8(b)). The vehicle detection module often gives an unstable ROI state such as a bigger ROI (first raw image), smaller ROI (second raw image), or misaligned ROI (third raw image), as shown in Fig 9(a). The target states are smoothed even though the ROI states are abruptly changed due to noisy vehicle recognition (Fig. 9(b)). A track is terminated if the track is not linked to the corresponding observations for several consecutive frames. In Fig. 10, the tracks are not associated with any vehicle recognition outputs even though the obstacle detection module estimates the ROI of the vehicle correctly. Errors in vehicle recognition often occur in small ROI (first raw image), ROI with a part of a vehicle (second raw image), and ROI in dark lighting conditions (third raw image). Unassigned tracks determine their corresponding GH using hierarchical data association; the tracks can be updated and maintained with the associated GH.

False detection removal. (a) Vehicle detection results: White boxes and black boxes represent results of obstacle detection and vehicle recognition, respectively. Red circles indicate false detections. (b) Results of multiple vehicle tracking: Colour boxes denote the tracking vehicles and white box indicates that the vehicles are being initialized, which are not regarded as detected vehicles in this frame.

Recovery of the ROI of missed detection. (a) Vehicle detection results: Red circles indicate missed detections. (b) Results of multiple vehicle tracking: Black circles indicate tracked ROIs.

Smoothness of unstable ROI. (a) Vehicle detection results: Red circles indicate the misaligned ROI. (b) Results of multiple vehicle tracking: Black circles indicate updated ROIs.

Track-to-GH association for track maintenance. (a) Vehicle detection results: Red circles indicate the ROI of GH. (b) Results of multiple vehicle tracking: Black circles indicate the track states are updated with the GH.
In scenario 1, there are many missed detections when vehicles are close to or occluded by other vehicles for several tens of frames, the track cannot be initialized due to their sparse detection outputs and the number of false negative alarms are increased in this period (Fig. 11). In scenario 2, when the false detections (walls and guard rails) are associated with incorrect tracks for a few consecutive frames, the false detections are propagated using visual tracking, even though the false detections are not detected in subsequent frames (Fig. 12). In some scenarios, there are a few visual tracking errors for far away vehicles in heavy traffic (Fig. 13(a)), vehicles in bad illumination conditions (Fig. 13(b)), and vehicles in noisy images due to raindrops (Fig. 13(c)).

Track initialization failure. (a) Vehicle detection results: Red circles indicate two consecutive vehicles are detected, but the vehicle is not detected in the third image. (b) Results of multiple vehicle tracking: White box indicates a track-initializing vehicle. Black circle indicates track initialization failure due to deficiency of consecutive detections.

False detection propagation error caused by visual tracking. (a) Vehicle detection results: Red circles indicate false detections. (b) Results of multiple vehicle tracking: Black circle indicates false detection propagation error.

Visual tracking errors (a) Far away vehicle in heavy traffic. (b) Vehicle in bad illumination condition (c) Vehicle in noisy image due to raindrops.
Table 1 shows the quantitative evaluation results for four different real world scenarios. Recall and precision as well as MOTA and MOTP are reported to indirectly compare with other methods. In the tracking-by-multiple hypotheses framework, the target state is estimated by the stochastic particle filter. We executed the method ten times to determine the mean and standard deviations. The experimental results show that the scores of MOTA and MOTP in our proposed method outperformed those in the vehicle detection method in all the test scenarios. In scenario 1, there were many missed detections due to close and occluded vehicles. In scenario 2, false detection propagation errors occurred due to a few consecutive false detections. In scenario 3, there were some errors in the vehicle recognition module due to very noisy images. The recall in the vehicle detection method is very low. However, the obstacle detection module can detect many vehicles, and the tracking-by-multiple hypotheses framework can update and maintain the target state with the GH (Fig. 14). In scenario 4, a track was not initialized when vehicles were occluded by other vehicles for dozens of frames, and most of the false negative alarms occurred in this period. Our videos for experimental results are available on YouTube [33-36]. In our future research, we will adopt a more advanced object detection method and will show the effectiveness of the proposed approach using the object recognition with the obstacle detection.

(a) Missed detection in vehicle recognition. (b) Updated ROI with GH in tracking-by-multiple hypotheses framework.
Quantitative evaluation results
Fig. 15 shows the vehicle trajectories of longitudinal distance and lateral distance for scenario 3. The experimental results verify that our method can estimate the trajectories of target vehicles reliably even though noisy stereo images were captured on a rainy day.

Longitudinal and lateral distance estimation for two target vehicles in scene 3.
All the software algorithms were implemented in Visual C++ using OpenCV 2.2 on a PC platform with a quad core 2.83 GHz CPU. The values of the parameters used for the experiments are summarized in Table 2. The frame rate of all the software algorithms, such as obstacle detection, vehicle recognition and multiple vehicle tracking, is about 10 to 15 frames per second. The frame rate of the multiple vehicle tracking algorithm is about 15 to 19 frames per second. The processing time for all the test scenes is described in Table 3.
Values of parameters used in our experiments
Processing time per frame
5. Conclusions
In this paper, we proposed a tracking-by-multiple hypotheses framework to improve multiple object tracking accuracy and precision. Most false detections are removed during track initialization; also, the number of missed detections is minimized using 3D visual tracking. A hierarchical data association method was proposed to assign multiple tracks to multiple hypotheses. The particle filter updates the target state using the motion model and the observation model with the multiple associated hypotheses. Experimental results using challenging test scenarios demonstrate that the scores of both MOTA and MOTP are remarkably improved when the results of proposed method and those of the vehicle detection method were compared. Irregular detections caused by occluded vehicles prevent a track from being initialized; false detections propagation errors occur due to visual tracking when the track is initialized by consecutive false detections. We will work with the track management method to solve these problems. Also, the software algorithm will be optimized and the processing time will be improved using a parallel programming scheme.
Footnotes
6. Acknowledgement
This work was supported by the DGIST R&D Program of the Ministry of Education, Science and Technology of Korea
