Abstract
Kinect sensors are able to achieve considerable skeleton tracking performance in a convenient and low-cost manner. However, Kinect sensors often generate poor skeleton poses due to self-occlusion, which is a common problem among most vision-based sensing systems. A simple way to solve this problem is to use multiple Kinect sensors in a workspace and combine the measurements from the different sensors. However, this method creates a new issue known as the data fusion problem. In this research, we developed a human skeleton tracking system using the Kalman filter framework, in which multiple Kinect sensors are used to correct inaccurate tracking data from a single Kinect sensor. Our contribution is to propose a method to determine the reliability of each tracked 3D position of a joint and then combine multiple observations based on measurement confidence. We evaluate the proposed approach by comparison with the ground truth obtained using a commercial marker-based motion-capture system.
1. Introduction
Capturing human motion can provide useful information for various robotic applications. For example, human activity recognition based on skeleton tracking [26] might be a fundamental component for human robot interaction [24, 23], whereby humans and robots can cooperate in tasks in industrial environments based on human skeleton tracking [21]. Transferring human skills and teleoperation of human motions in robots require 3D body/hand poses [22, 25]. There are two ways to capture human motion: marker-based motion capture, and markerless motion capture. One drawback of marker-based motion capture is that the performing subject has to wear a suit covered with sensors or markers. On the other hand, it is not necessary for the subject to wear a suit in markerless motion capture, which has helped the technology gain significant attention recently. However, markerless motion capture is challenging in the absence of accurate depth information. The release of the first version of Microsoft Kinect made a low-cost sensor for fast and high-quality dense depth images available for the first time. The introduction of Kinect had a significant impact in robotics as well as full-body pose tracking.
The second version of the kit, Microsoft Kinect for Windows v2 (Kinect v2), was released and made available to researchers in 2014. This new generation offers a higher resolution and a wider field of view compared to the original Kinect technology. Further, in terms of depth, Kinect v2 is based on the time-of-flight principle, whereas the previous version utilized structured light to reconstruct the third dimension. This has led to a considerable improvement in the accuracy of depth sensing.
To enable the use of Kinect sensors for developers and researchers, the official Microsoft SDKs (Software Development Kits) 1.0 and 2.0 are freely available for Kinect v1 and v2, respectively. These SDKs provide not only the appropriate drivers, but also a set of functions such as a skeleton tracker and code samples that can be used for custom implementations. The skeleton tracker works by classifying each pixel of the depth image as part of a joint using trained decision forests [1]. The initial version of the Kinect SDK allowed tracking of up to 20 body joints, and the second allows tracking of up to 25, including fingertips and thumbs. Moreover, due to the enhanced depth sensor, tracking accuracy has been significantly improved. Therefore, in this work we developed our skeleton tracking system based on Kinect v2.
Although Kinect v2 provides better tracking results compared to Kinect v1, it often generates poor skeleton poses in the presence of self-occlusion, which is common among most vision-based sensing systems. One simple way to solve this problem is to use multiple cameras in the workspace. For instance, if a view of a body part is blocked from one camera, it might be possible to obtain a view of the body part from another camera. Appropriately combining data obtained from multiple Kinect sensors in this way can be used to achieve more accurate tracking compared with a single sensor.
However, the use of multiple sensors introduces a new problem called “data fusion”. Therefore, it is necessary to determine a way to estimate the confidence of each measurement and to combine multiple observations in a suitable manner based on these confidence values.
The data fusion problem can be investigated based on various existing models. Before determining a specific model to tackle the problem, it is necessary to consider the characteristics of the tracked skeleton pose. Specifically, the estimated position of a joint often varies discontinuously, and is noisy. Probabilistic filtering methods such as Kalman filtering may be suitable to address this problem. The transition model in probabilistic filtering usually helps to remove jitters and promote temporal continuity.
Data fusion with Kalman filtering has been studied by several researchers [2–5]. The two most commonly used kinds of fusion method for Kalman filtering are state-vector fusion methods and measurement fusion methods. State-vector fusion methods use a group of Kalman filters to obtain individual sensor-based state estimates, which are then fused to acquire an improved joint state estimate. Conversely, measurement fusion methods directly fuse the sensor measurements to a weighted or combined measurement, and use a single Kalman filter to access the final state estimate based upon the fused observation. Measurement fusion methods generally provide better overall estimation performance, while state-vector fusion methods have a lower computation requirement and the advantage of allowing parallel implementation. As mentioned above, the number of joints in the Kinect body model is only 25. Thus, even if we consider the x, y, and z-coordinate for each joint, the dimension of the state vector is only 75, meaning that the computational cost of Kalman filtering is relatively low. Therefore, we used the measurement fusion method in our model.
In this paper, we developed a human skeleton tracking system in which Kalman filtering employing a weighted measurement fusion method was used to combine different tracking results acquired from multiple Kinect sensors. Our contribution in this work is to propose a method to determine the reliability of each tracked 3D position of joints and combine multiple observations according to measurement confidence.
The remainder of the paper is organized as follows. Section 2 provides a survey of the current literature related to skeleton tracking. Section 3 briefly describes Kalman filtering with weighted measurement fusion. In Section 4, the proposed model is described with an emphasis on the details of the proposed measurement fusion method. Section 5 presents our experimental setup and an evaluation of the performance of the proposed model. Finally, we present our conclusions in Section 6.
2. Related Works
Skeleton tracking algorithms can be classified in several different ways. First, we can consider tracking-based methods [6–8] and single-frame-based approaches [1, 9]. The former has a tracking step while the latter estimates the body pose based on a single frame, and has no tracking step. Tracking-based methods require initialization and encounter the issue of re-initialization of the tracked model whenever the tracking fails, while single-frame-based approaches are difficult because they do not make any assumptions for time coherence.
Shotton
Skeleton tracking algorithms can also be classified into single-view-based models [10–12] and multi-view-based models [13, 14]. The 3D body pose that is estimated using a single view frequently has problems determining positions of joints during self-occlusion motions. Therefore, approaches that utilize multiple views have recently begun to receive significant attention.
For example, Zhang
The method presented in this paper is based on the pose estimation results provided by Kinect SDK 2.0, and differs from the methods used by the studies described above. Specifically, our goal was not to develop a method that estimates 3D positions of body joints directly from raw depth images or RGB images, but rather to investigate how to generate more realistic and accurate full-body poses in real time by combining the initially-estimated multiple 3D body poses.
Skeleton tracking performed by a single Kinect sensor also has problems of discontinuous movements, such as unwanted vibration and bone-length variation, because self-occlusion usually results in failure of 3D joint position estimation. Relatively few studies have sought to tackle this problem using multiple Kinect sensors.
Masse
Yeung
Both studies described above adopt approaches similar to that presented in the following. First, they do not conduct skeleton tracking based directly on raw depth or RGB images. Second, instead of this, they use estimated skeleton tracking based on multiple Kinect sensors as input, and then refine the initial tracking results based on a different sensor fusion method. However, if we consider the detail of the proposed method, our approach differs from the two methods mentioned above in its determination of the reliabilities of each 3D pose, which are obtained from multiple Kinects. Furthermore, unlike in our method, Yeung
3. Kalman Filter with Weighted Measurement Fusion
In this section, we briefly describe Kalman filtering and the data fusion method for achieving Kalman-filter-based multi-sensor data fusion. As mentioned in the Introduction, the two most commonly used fusion methods for Kalman filtering are state-vector fusion and measurement fusion. Because the method used in this study involves directly combining sensor measurements to generate a weighted fused measurement, we only describe the measurement fusion method. For further detail see [3].
Consider a linear system. The dynamics and the sensors are modelled by the following discrete-time state-space model:
where
where the dimension of measurement is
In measurement fusion, the
As shown in Eq. (9), the fused measurement vector
Given the state-space model described by (1) and (2), the Kalman filter provides an unbiased and optimal estimate of the state-vector in the sense of minimizing estimate covariance. The algorithm works in a two-step process, comprising prediction and update steps. The prediction step of the Kalman filter is given by:
where
where
4. Proposed Measurement Fusion Method
Equations (7–9) tell us that in order to fuse measurements appropriately, the most important issue is to determine the covariance matrix of each measurement. Specifically, our method controls the fusion of data by adjusting the augmented measurement noise vector,
However, before explaining the detail of the proposed measurement fusion method, it should be noted that to the best of our knowledge, it is not currently possible for multiple Kinect v2 sensors to be attached to a single PC. Therefore, in this study, each Kinect device was connected to a separate PC. To fuse group sets of measurements, we time-synchronized the local PCs and compared the times at which they acquired measurements.
4.1. Determining measurement noise, v
k
, based on the predicted state
A measurement vector at time index
where
In order to assign reliability of the estimated 3D joint positions acquired from the
where
In our model, the state transition matrix
where
The mean vector of the distribution in Eq. (17) is
Equation (9) tells us that if the measurement noise value of the observation is small, the observation contributes strongly towards determining the fused measurement, whereas the observation might have a weak influence on the combined observations if the value is large. Here, it should be noted that the measurement noise value in Eq. (19) decreases as the probability becomes greater, and vice versa. Consequently, if the probability in Eq. (19) is low, the corresponding observation
4.2. Determining measurement noise, v
k
, based joint motion continuity
The second technique used in this study for assigning reliability to a tracked skeleton is based on estimating continuity in human motion. As mentioned in Section 2, discontinuous motion is usually caused by a failure to estimate the 3D position of a joint, and can be further categorized into two cases. In the first case, the joint is actually moving fast, but the velocity of the joint calculated based on both the current and previous measurements is slow. In the first case, the joint is actually moving slowly, but the velocity of the joint as calculated based on the current and previous measurements is fast. In both cases, it is important to classify whether the joint is actually moving fast or slow: referred to respectively as
Algorithm I outlines the proposed method to detect the two types of discontinuous joint motion and assign high measurement noise value to the joint moving discontinuously. In this approach, the velocity of each measurement is not actually estimated, but instead is implicitly calculated by computing the distance between the current observation and previously estimated 3D joint positions in Kalman filtering. If the distance is greater than the threshold
Note that the maximum value among
5. Experiments
5.1. Experimental setup
We implemented the algorithm proposed in this paper using Visual C++ and the Microsoft Kinect SDK 2.0 on Windows 8 OS. All experimental tests were run on a PC with an Intel Core i5 1.8GHz processor and 4GB RAM. The Microsoft Kinect SDK 2.0 can extract skeleton data at approximately 30 frames per second (fps). The proposed skeleton enhancement algorithm generates results at a speed of 10 msec per frame, on average. As a result, the final tracking ran at approximately 20-25 fps, allowing real-time extraction of skeleton data.
To evaluate the proposed method in terms of the precision of 3D positions of the human body joints, we employed an OptiTrack motion-capture system to provide a set of ground truth trajectories. In addition, for quantitative comparison, the joint trajectories were recorded by a single Kinect and averaged across multiple Kinect sensors. We deployed five Kinect sensors covering 180°, resulting in an average angle of approximately 45° between two Kinect sensors. Berger [20] summarized state-of-the-art papers which have built on RGB-D camera setups that include two or more Kinect v1 sensors. In this study, different Kinect sensor setups for motion estimation, reconstruction, tracking, and recognition are introduced, including the right number of sensors with appropriate position, orientation, etc. These recommended conditions are determined mainly to mitigate interference error caused by inherent properties of the Kinect v1 sensor, and are changed depending on the type of task, for example tracking, reconstruction, or recognition, and the experimental setup, for example size of room.
However, there seems no agreement on the right number and appropriate poses of Kinect v2 sensors for effective skeleton tracking. Two mutually exclusive criteria that can be suggested intuitively are that the number of Kinect sensors should be as small as possible and that the size of the self-occlusion region should be reduced as much as possible. Some researchers [17, 19] have employed three cameras to tackle the occlusion issue in skeleton tracking, but we found in our experimental setup that five Kinect sensors can achieve the best performance. The distance from each Kinect to the centre position was about 3 m and the height of each Kinect above the ground plane was 130 cm.
Five Kinect sensors were extrinsically calibrated using a checkerboard. Specifically, the local coordinates of the centre Kinect sensor were considered as the global reference frame, and each Kinect and the motion-capture system were then calibrated in relation to the centre Kinect sensor, so that each 3D skeleton pose was represented in the global coordinate system.
A total of 25 joints are supported by Kinect v2:
The parameters used in our method were sampling interval
5.2. Results and discussion
Experiments were conducted to compare the precision of the proposed algorithm with that of a single Kinect sensor and simple averaging of five skeleton poses. We tested ten activity classes consisting of
Figure 1 shows the average 3D position error of the ten activities compared to the motion sequence captured by the motion-capture system. In the figure, the blue, green, and red bars represent the level of error produced by the centre-Kinect, simple-average, and proposed methods, respectively. In evaluating these data, it should be noted that the positions of the corresponding joints between the motion-capture system and Kinect were not matched perfectly, For example, if a performing subject was just standing facing the centre Kinect, i.e., the centre Kinect observed the performing subject without self-occlusion, the average position error of 16 joints observed by the centre Kinect was approximately 5.5 cm.

The average 3D position error of ten activities compared to the ground truth in a motion sequence. Blue, green, and red represent errors produced by the centre-Kinect, simple-average, and proposed methods, respectively. Standard deviations are represented by black lines.
We noted that for the activities of
Figure 2 (a-c) shows the trajectories of the x, y, and z positions of the right knee joint during

Graphs are obtained during
For qualitative comparison, the tracking results generated by three comparative approaches are displayed along with increasing frame number. Figure 3 shows 3D skeleton poses obtained by the motion-capture system and the centre-Kinect, simple-average, and proposed methods during

6. Conclusions
The goal of this paper was to enhance the skeleton poses extracted using a single Kinect v2, which often generates poor poses due to self-occlusion. To solve this problem, we proposed a real-time motion-tracking technique utilizing multiple Kinect sensors. In this approach, the major technical problem was how to resolve inconsistencies between the positions of each joint reported by the multiple Kinects. To address this issue, rather than changing the motion extraction method, we employed a Kalman filter framework to fuse five skeletons extracted per frame.
The main contribution of the method described in this paper is the ability to determine the reliability of each tracked 3D position of a joint and combine multiple observations based on confidence. We evaluated the proposed method by comparison with the ground truth acquired from a commercial motion-capture system and results generated by a single Kinect sensor and a simple averaging technique. Among the different methods, our approach was the most robust and accurate, and performed considerably better than the two comparative approaches in the presence of occlusions.
Footnotes
7. Acknowledgements
This work was supported by the Global Frontier R&D Program on “Human-centered Interaction for Coexistence” funded by the National Research Foundation of Korea, grant-funded by the Korean Government (MSIP) (2010-0029744).
