Abstract
We present a novel framework for multiple pedestrian tracking using overlapping cameras in which the problems of object detection and data association are solved alternately. In each round of our algorithm, the people are detected by inference on a factor graph model at each time slice. The outputs of the inference, namely, the probabilistic occupancy maps, are used to define a cost network model. Data association is achieved by solving a min-cost flow problem on the resulting network model. The outputs of the data association, namely, the ground occupancy maps, are used to control the size of factors in graph model in the next round. By alternating between object detection and data association, a desirable compromise between complexity and accuracy is obtained. Experiments results on public datasets demonstrate the competitiveness of our method compared with other state-of-the-art approaches.
1. Introduction
In recent years, there are lots of research interests in multiple people detection and tracking using camera networks with overlapping views. Generally speaking, proposed methods can be categorized into image based ones and state based ones. In the image based methods [1–4], typically, some multiview geometry is utilized for people localization using the foreground detection results in multiple cameras. In [1, 2], the vertical axes of the foreground blobs corresponding to a single person in different views are warped into a common view using ground plane homography, and the ground position of the person is determined by the intersection of the mapped axes. Similarly, in [3] the foreground likelihood maps in different views are warped into a common reference view to generate a synergy map, in which the occupying positions are assigned with higher values. In [4], the vertical axis of a person is projected from one view to another by jointly using the ground plane homography and epipolar geometry. The main drawback of the image based methods is that the localization performance depends heavily on the foreground segmentation quality and deteriorates rapidly when complex occlusions occur. In contrast, occlusions can be treated naturally in state based methods [5–7], in which the ground plane is discretized into regular grid and the occupancy states on each position are estimated by inference on a MRF (Markov random field) or CRF (conditional random field) model. The main difficulties in state based methods are the combinatorial explosion of the state space and the higher-order dependency structure in the underlying graphical model. Therefore, the state based methods have to resort to either random sampling algorithms [5] or assumption of simplified structure in graphical models [6, 7]. In [6], the ground is discretized into regular locations and the occupancy state is represented by introducing a Boolean variable. With the hypothesis of independency between each of the locations, the joint posterior occupancy distribution is approximated by the product of marginal probability of the occupancy state on each location by minimizing the KL (Kullback-Leibler) divergence. Performance of the method significantly decreases in case of intense crowd due to the independent hypothesis. In [7], the dependency caused by occlusion between people is modeled by higher-order MRFs and the occupancy state is determined by cascaded optimization on a sequence of MRFs with increasing clique size. The optimization problem in each stage, that is, in each MRF with specific clique size, is solved using the pattern based optimization algorithm. And the optimizer of the lower order MRF is used to impose constraint on the state space of the higher-order MRF in the next level, in order to maintain the computation to be tractable. Finally, by considering the whole ground plane as a single clique, the global optimizer can be obtained. However, when the monitored area is crowded with people, the speed of the algorithm [7] will be slowed down significantly. In fact, there are several works that deal with occlusions by inferring on a graphical model.
The tracking-by-detection method has the inherent weakness that it requires the output of the detectors to be reliable in order for the data association module to work properly. However, fast and reliable detection in crowded scene is still an unresolved problem. It is advantageous to solve for detections and data association jointly, rather than computing detections first and then linking them into trajectories. The method presented in [8] uses quadratic Boolean programming to solve detection and tracking in coupled manner. Multiple tracking hypotheses are kept and the most likely trajectories are found by searching forward as well as backwards in time. The authors in [9] expressed the problems of detection and tracking in optimizing a single objective function and solved the optimization problem using Lagrange dual decomposition strategy. Specifically, the subproblem of detection is solved by using spars recovery technique and subproblem of data association is solved by network flow and the two subproblems reach consensus iteratively by projected subgradient optimization.
Inspired by the joint detection and tracking strategy, in this paper we proposed a new method based on previous work [7] for multiple people tracking using camera networks with overlapping views. The main drawback of [7] is its speed, especially in case of crowded scenarios. The key to speed up [7] is to eliminate false alarms generated by inference on MRF with small-sized cliques as fast as possible. In this work we try to accelerate the elimination process in two ways. Firstly, we exploit spatiotemporal information to delete false alarms by running object detection and data association modules alternatively. But we note that the performance of data association relies heavily on the probabilistic distribution of occupancy state over the monitored area. So secondly, we incorporate the pedestrian detector into our factor graph model for occupancy map inference, resulting in a more peaky distribution. In fact, addition of pedestrian detector to foreground mask to infer the occupancy maps has been studied in several works. The main features of our work include the following.
In each time slice, a probabilistic occupancy map is calculated by inference on a factor graph model using belief propagation algorithm [10]. The potential function of the factor graph is defined using output of a HOG-based person detector and a generative model that compares back-projected synthetic images with foreground masks. By inferring on factor graph model, information from multiple views is fully exploited, and the problem of occlusion between people is well addressed. In the data association phase, the probabilistic occupancy maps are used to define a cost network model, and the ground occupancy maps in each time slice are determined by solving a min-cost flow problem on the cost network model using the efficient Our algorithm works iteratively. The resulting ground occupancy maps in current round are used to determine the elements contained in each factor in the next round. In each round, each factor contains only state variables corresponding to 1-state (occupied) locations in the ground occupancy map within a fix-sized sliding window. Along with iteration, the number of occupied positions in ground occupancy maps decreases monotonously. Thus the factor window can be enlarged gradually, taking into account occlusions in larger and larger area, while maintaining the computation to be tractable. To avoid the computational complexity in inference, we begin with small factor window in the first round. The experimental results show that our algorithm provides a desirable compromise between complexity and accuracy. By alternating between object detection and data association, the size of the factors can be well controlled. Compared with the state-of-the-art methods, it can reduce false alarms and missed detections in some cases.
The rest of the paper is organized as follows. In Section 2 we formulate the problem of detecting people using multiple cameras as an inference problem on factor graph model and define the potential functions in the factor graph model. In Section 3 we present our framework in which person detection (calculating probabilistic occupancy map) and data association (calculating ground occupancy map) are performed alternatively and show how to solve the two problems using belief propagation and
2. Multipeople Localization
In this section we deal with the problem of multiple people localization using camera networks with overlapping views, as shown in Figure 1. In case of intensive crowd and frequent occlusions, multipeople localization and tracking using single camera are difficult to achieve ideal performance. However, the comprehensive utilization of multiple cameras can significantly improve the accuracy of both.

The scheme of multicamera localization.
Assume that
2.1. Factor Graph Model
Factor graph [10], a bipartite graph that expresses which variables are arguments of which local function, is a graphical model that can be used to visualize the factorization of a complicated probabilistic distribution. Assume that the joint posterior probability distribution of
The potential function
2.2. Potential Function
In this subsection we define the potential function

Pedestrian detection.
Potential function
Occupancy state on all locations determined simply by pedestrian detector is difficult to achieve ideal effect due to the occlusions between pedestrians. So we must consider the interaction of occupancy states between different locations, which is caused by occlusions in certain view. Next we show the generation model of observation

Observation model of foreground detection.
The potential function
2.3. Determination of Factor a
Factor

A typical factor-in-factor graph model.
The complexity of the inference on the factor graph model increases exponentially with the factor size. Therefore it is intractable to use belief propagation algorithm directly for the above factor graph model as Figure 4. Based on the above consideration, we propose a novel framework, which can effectively reduce the computational complexity and ensure the accuracy of localization and tracking.
3. A Novel Framework Based on Probabilistic Inference and Data Association
As mentioned above, due to occlusion between pedestrians and overlap between camera views, the factor

The influence of factor window on target detection.
Figure 5 shows people detection results under different size of factor windows for single camera view. When factor window covers the whole ground, correct results are obtained by inferring based on the graph model and thresholding, as shown in Figures 5(b) and 5(c). When the size of factor window is 2 × 2, the ground occupancy map and the image annotation according to it are shown in Figures 5(d) and 5(e). As we can see, when the window becomes smaller, the locations on the ground occluded by pedestrians labeled with 1 and 2 are mislabeled as occupied position, leading to the occurrence of false alarms but no missed detection.
Based on the above assumption, we propose a novel framework, as shown in Figure 6. The factor window becomes larger and larger along with iterations. At each round, we first specify a fixed window size for factors and determine variables contained in each factor via sliding the factor window on the ground. Initially, we assume the size of factor window is 3 × 3, as

The proposed framework.
When we calculate the probabilistic occupancy map
Locations marked with 1 in
In the next round, we enlarge the size of factor window and determine variables of each factor by masking the factor window on suspected positions. Due to the expansion of the size of factor window, we can consider the dependency in longer distance, making the probabilistic occupancy map more peaky on the locations where pedestrians exist. According to
The iteration continues until the factor window covers the whole ground, as
3.1. Probabilistic Inference
In the
The messages are usually initialized to
3.2. Data Association
The purpose of data association is to determine the number of pedestrians and the trajectory of each pedestrian according to probabilistic occupancy maps
We take probabilistic occupancy maps
To model occupancy over time, let us consider a labeled directed graph with

Network flow model.
To consider that the number of tracked objects may vary over time, we introduce two additional nodes
A directed edge
The cost value of the edges emanating from the source node is set to zero to allow objects to appear at any entrance position and at any time instant at no cost. Each edge
For all
Furthermore, since a location cannot be occupied by more than one object at a time, we can set an upper bound of 1 to the sum of all outgoing flows from a given location and impose
A similar constraint applies to the incoming flows, but we do not need to explicitly state it, since it is implicitly enforced by (9). Finally, the flows have to be nonnegative and we have
Finally, we introduce an additional constraint that ensures that all flows departing from
Under the conditions of (10)–(13), data association problem is converted into minimum cost flow problem and can be written as
Here,
As a result, flow variable
4. Results
4.1. Test Data
Our data set consists of Terrace sequence, Basketball sequence, Passageway sequence [14] and PETS 2009 S2/L1 sequence [15]. Terrace sequence is 3 1/2-minute outdoor sequence consisting of up to 9 people appearing one after the other and walking in front of the cameras. It tests the ability of our algorithm to cope with a moderately crowded environment. Basketball sequence involves 10 players and 1 referee in a game on half a Basketball court. It is a difficult sequence with fast moving people and many occlusions. Passageway sequence involves 7 people passing through a public underground Passageway. This is very challenging for several reasons. First, lighting conditions are very poor, which is typical in real-world surveillance situations. A large portion of the images is either underexposed or saturated. Second, the area covered by the system is large, and people get very small on the far end of the scene, making their precise localization challenging. Finally, large parts of the scene are filmed by only two cameras or even a single camera. PETS 2009 sequence filmed at a road corner of a university campus involves about 10 people, due to the large monitored area, and part of the scene is only filmed by one camera, making it a certain difficulty. In the first three pedestrian environments, the scene was filmed by 4 DV cameras with overlapping fields of view, each of which is placed in a corner of the monitored area. The video format is DV PAL, downsampled to 360
Datasets used for test.
4.2. Implementation Details
In foreground segmentation, we adopt the method proposed in [16] to obtain binary image; in pedestrian detection, HOG feature is calculated by method proposed in [17] and the kernel function of support vector machine is linear, realized by LIBSVM tools [18]. For Basketball datasets, the trained model of INRIA datasets [19] cannot achieve good performance due to the rapid motion of players, so we add extra data depicted in [20] to retrain more robust model. In probabilistic inference, the manner of factor window enlargement is very flexible. In order to speed up the algorithm, the factor window is enlarged in an adaptive manner. In each round, we choose the maximum factor window size ensuring that the number of unresolved states in each individual factor does not exceed 8.
Our original formulation is not applicable to cases where the numbers of locations in the ground and frames in the video are very large. In Basketball sequence, the resulting cost-flow network consists of about 21 million nodes and 197 million edges. This is beyond the memory capability of common PC. So in our experiments, following [6], we process the video sequence by batches of 200 frames instead of the whole sequence and keep the results of the first 20 frames and slide our temporal window to achieve consistency over successive batches.
4.3. Case Study
The frame 3250 of Terrace sequence is chosen as a typical case to illustrate the process of the framework proposed in this paper.
Each row in Figure 8 shows the result of a single round. The first column represents the probabilistic occupancy map given by belief propagation algorithm; the second column represents the ground occupancy map obtained by thresholding the probabilistic occupancy map; the third column represents the ground occupancy map achieved by

A typical case illustrating our proposed framework.
We can see from the second and third columns in Figure 8 that the true occupied positions can be detected by both thresholding and data association, but the number of suspected positions in the latter is less than the former. This proves that data association using spatiotemporal constraints can reduce false alarms. Consequently, taking such ground occupancy map as input for the next round can greatly improve the speed of inference.
In the first round, we set the size of factor window to 3 × 3. Thus we cannot handle occlusions in long distance. As a result, the probabilistic occupancy map is relatively flat and the number of suspected positions is large. In the following round, the factor window is enlarged in an adaptive manner. Thus the occlusions in longer distance are considered and the resulting probabilistic occupancy map is more peaked. Moreover, the number of trajectories given by
Now we focus on the red arrow pointing to the people labeled with 1. In the third iteration, this target is lost due to the large number of false alarms around it in the preceding and following frames. However, after the enlargement of factor window in the fourth iteration, these false alarms are deleted because of the consideration of occlusions in larger area. Consequently, target 1 is recovered by data association using spatiotemporal constraints.
4.4. Effect of Pedestrian Detection
It is interesting to ask to what extent does our method rely on the suspected position removing effect of data association. In our experiments, we find that the behavior of data association relies heavily on the flatness of probabilistic occupancy map. If the probabilistic occupancy map is flat,
We find that the incorporation of the pedestrian detector into the potential function of factor graph is critical to the performance of our method. Figure 9 illustrates the effect of pedestrian detection. The first row shows the resulting probabilistic occupancy map and ground occupancy map at each round when

Illustration of the effect of pedestrian detection.
In some extremely difficult cases, for example, very crowded group of people moving in a small region, the output probabilistic occupancy map may be rather flat and the data association cannot be able to remove the suspected position effectively. In this case, our framework reduces to the method in [7]. However, in our experiments we find our method can always remove the suspected position effectively by proper combination of information from pedestrian detector and foreground masks.
4.5. Baseline and Evaluation Metrics
In order to demonstrate the effectiveness of our algorithm, we compare the following four algorithms: probabilistic occupancy map (POM) proposed in [6]; cascaded optimization on higher-order MRFs (HO-MRF) proposed in [7];
We adopt the commonly used detection metrics, multiple object detection accuracy (MODA), multiple object detection precision (MODP) and tracking metrics multiple object tracking accuracy (MOTA), and multiple object tacking precision (MOTP) [21]. MODA takes into account false alarm, missed detections. MODP measures the spatial overlap information between the ground truth and the detections. MOTA takes into account false alarm, missed detections, and identity switches. MOTP measures the average distance between ground truth trajectories and the system-generated trajectories. The higher the above metrics are, the better the performance of the algorithm is.
4.6. Comparison Results
Figure 10 shows detection metrics of the above algorithms. On all sequences, HO-MRF and our proposed method are better than POM and POM + KSP in terms of MODA and MODP because of the modeling power of factor graph model. POM approximates the joint posterior distribution of occupancy states by a product law minimizing the KL divergence; therefore the performance of detection deteriorates in case of dense crowd or serious occlusions; POM + KSP is slightly better than POM, because POM + KSP takes the output of POM as input and takes advantage of spatiotemporal constraints, leading to reduction of false alarms and missed detections. By using factor graph (or MRF) model with varying-sized factors (or cliques), our method and HO-MRF can deal with occlusions in large area, leading to better detection results. Ideally, if none of the true occupied positions were misdeleted during removing of suspected positions, upon convergence HO-MRF and our method could reach the equivalent result to that given by optimizing directly on the joint distribution of occupancy states, which is usually computationally intractable. Actually, we find in our experiments that true occupied positions are seldom misdeleted in HO-MRF and our method during removing of suspected positions. This can be verified in Figure 13 that the FN metrics of HO-MRF and our method are always equal or less than that of POM and POM + KSP. In contrast to HO-MRF which makes hard decision through optimization on MRF model at each iteration, our method takes the soft occupancy probabilistic map as input and exploits spatiotemporal information to find the ground occupancy map and thus can reduce the suspected positions more efficiently and effectively. Consequently, the number of suspected positions decreases much faster in our method than in HO-MRF, leading to significant improvement in speed.

Detection metrics.
It is noticeable that for Passageway sequence, POM + KSP and our method, which exploit the spatiotemporal constraints, perform worse than POM and HO-MRF. As mentioned above, the observing condition in Passageway sequence is rather poor, resulting in large number of false alarms and missing detections, which usually occurs continuously. Utilization of spatiotemporal constraints in this case, however, may further deteriorate the performance of detection. Figure 11 shows a typical case.

Identity switches of targets.
As we can see from Figure 11, at the beginning, target 1 and target 2 are detected in frame 200 and assigned with blue box and green box, respectively. During the period starting from frame 205 to frame 225, the occupancy probabilities of corresponding target 1 are small due to the poor lighting conditions. This leads to the association of target 1's track with false alarms in the neighborhood of target 1, as shown by the blue box in frames 205~210. At frame 215, the blue box is assigned to target 2. Because of the hard constraint in KSP that a single position can be associated with only one track, the green box originally assigned to target 2 is moving to target 3, as shown in frames 210~240. After frame 225, the targets 1 and 2 are close to the camera and the calculated occupancy probability maps are much reliable. However, target 1 cannot be assigned to any track until another target moves into its neighborhood. This is mainly because the track corresponding to target 1, that is, the blue box, is now associated with target 2. And at the same time, in our network model the virtual nodes
Figure 12 shows tracking metrics of POM + KSP and our method on different sequences. We can see that our method is superior to POM + KSP on all sequences in terms of MOTA and MOTP. There are two main reasons. (1) POM + KSP follows the paradigm of detection-tracking, so detection results directly determine the performance of tracking. In order to avoid the computational intractability in the joint state space, POM approximates the probability of occupancy on each individual location by the marginal of a product law which minimizes the KL divergence from the posterior joint distribution. In contrast, by alternating between detection and tracking, our method can deal with the whole state space by enlarging the factor size sequentially and in principle is able to find the global optimal solution with reasonable computation provided that no misdeletion of true occupied positions occurs. (2) POM utilizes foreground information only, while our method combines the information from foreground detection and pedestrian detection in a principle way by defining a novel potential function, thanks to the modeling power of factor graph model.

Tracking metrics.

Components of tracking metrics.
Figure 14 shows some examples of pedestrian detection and tracking results in our experiments. Rows 1 and 2 come from the Terrace sequence; rows 3 and 4 come from the PETS 2009 sequence; rows 5 and 6 come from the Passageway sequence; rows 7 and 8 come from the Basketball sequence. Here, the odd rows are the results of POM + KSP, and even rows are the results of ours.

Tracking examples.
In Terrace sequence, we can see persons labeled 6 and 7 are occluded seriously in views 1, 2, and 4. POM + KSP fails to detect them and produces a false alarm near person 6. In contrast, our method locates all the 7 persons successfully without producing any false alarm. In PETS 2009 sequence, there is a period during which two pedestrians labeled 5 and 6 walk in parallel and can only be seen in one view. This is a very difficult scenario for detection and tracking, and POM + KSP loses the tracks of these two persons completely during this period. But our method can still track them. In Passageway sequence, the foreground segmentation is unreliable due to the underexposed or saturated image. As a result, in the frame shown in Figure 12, POM + KSP takes a false alarm as detection of person 3 and misses the true person. In contrast, our method can locate person 3 correctly. In Basketball sequence, due to the complex occlusion and severe illumination, POM + KSP produces a lot of false alarms and fails to detect persons labeled 6, 7, 8, and 9. Our method takes advantage of the factor graph model and pedestrian detector, improving the accuracy of detection and tracking significantly.
4.7. Run Time
We implement HO-MRF and the proposed algorithm with Matlab code in common PC and use the code provided by the authors of [6, 11] for implementing POM and POM + KSP. The runtime is varying with the length of sequence and the number of locations. We report the average time for processing a single frame of each sequence, as shown in Table 2.
Run time (s/frame).
The probabilistic inference of HO-MRF and our method take long time due to complicated dependency structure between locations. However, by alternating detection and data association, our method reduces the number of suspect positions significantly and speeds up the detection. Thus our method achieves good compromise between accuracy and speed. We note here that POM and POM + KSP are implemented with highly optimized C code, which is very efficient in computing. The speed of our algorithm can be improved by optimizing the coding.
4.8. Limitations
In detection phase, we choose HOG detector, which represents the contour information of the whole detection window. The performance of HOG detector is not ideal in the case of crowd or serious occlusions. However, the detection result is still reliable because of the fusion of multiple visual observations. Nevertheless, if pedestrian detector based on deformable part model is adopted, occlusion can be dealt with more effectively, and detection accuracy can be further improved.
In tracking phase, we care about not only the location of targets, but also the identity of different targets. In this paper, we associate with the whole sequence to achieve the global optimal solution for multiple target trajectories. But identity switches still exist due to undistinguished observation between pedestrians in case of dense crowd or frequent occlusions. In order to solve this problem, establishing distinguished appearance model may be helpful.
5. Conclusions
This paper focuses on people tracking using camera networks with overlapping views and proposes a novel framework alternating localization and data association. The localization is realized by probabilistic inference on factor graph model through belief propagation algorithm. It makes modeling the likelihood of the entire image possible and thus avoids nonmaximum suppression. Data association is converted to min-cost flow problem solved by
Footnotes
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work is supported by the National Natural Science Foundation of China, under Grant no. 61174020.
