Dynamic human object recognition by combining color and depth information with a clothing image histogram

Abstract

Human object detection, tracking, and recognition have applications in many areas, such as in the development of assistance robots and intelligent monitoring systems. The emergence of an RGB-D camera, namely the Kinect v2, has simplified the process of human object detection and tracking. Color space methods are dependent on lighting conditions. Because skeleton-tracking algorithms are based on depth images, they are light invariant relative to color space methods. However, skeleton information may sometimes be incorrect or become lost. An algorithm for human-target recognition is thus required. Therefore, this study proposes a human-target tracking and recognition system combining RGB images, depth images, body index, and skeleton information. The system first extracts the color information of five body parts (two upper arms, the torso, and two thighs) using color, depth, and skeleton information. The system then analyzes the color information using a mixed nine-dimensional histogram and single-color analysis method. The algorithm also includes overlap detection during the process of human-target tracking to prevent misidentification caused by occlusion. To test the proposed system, various scenarios were carefully designed to simulate the extremely complex environmental changes characteristic of the real world. Furthermore, the dynamic statistical method of event statistics was used to collect results. Experiments revealed that the proposed method is robust under varying lighting conditions and increases the success rate for individuals wearing similar clothing with monochrome colors.

Keywords

Image segmentation nine-dimensional histogram hue mapping human object recognition RGB-D camera

Introduction

Assistance robots have attracted an increasing amount of attention in recent years. Assistance robots are being developed not only for use in workplaces but also in the daily lives of individuals. Assistance robots for daily life should be easy to operate and human-friendly; thus the software may offer affordances such as enabling users to interact with the robot using gestures^1,2 or their voice²; an assistance robot might be programmed to follow a specific person.^3,4 Such following robots have a variety of applications, including security, monitoring, elder care, and helping humans pick up objects.

Human object detection, tracking, and recognition are critical in robots that are required to follow a particular person (hereafter called following robots). Many of the previous research studies used stationary camera to avoid the difficulty of having to separate the foreground from the background.⁵ There was the problem of the camera movements that required further attention. The emergence of the Kinect v2 RGB-D (Microsoft) camera has simplified the process of background removing by providing skeletal tracking information with a player ID.⁶ In general, skeletal tracking information is reliable for use by following robots and for the purpose of person reidentification. However, skeleton tracking may be subject to error when the target is occluded by another person on the edge of the detection region of the infrared sensor or in the event of occlusion by more than one person. Herein, an error in skeleton-tracking ID is defined as a change in the tracking ID from one person to another person between successive frames. This type of phenomenon is observed in the occlusion of a tracked individual by an untracked individual and must be considered when using Kinect v2 and its software development kit (SDK) for the human tracking algorithm. In addition to errors in skeleton-tracking ID, the skeleton-tracking ID becomes lost when the tracked person is occluded or out of the tracking range of the Kinect v2. Therefore, an algorithm for person reidentification is required.

Several methods have been proposed for person reidentification,^7
–9 such as the use of pictorial structures for estimating body parts.¹⁰ A full-body MSCR and 11 HSV color histograms of body parts are then extracted to perform reidentification. Experiments have demonstrated that despite the significance of body parts varying among data sets, the torso is always the most critical.

One study used depth information to extract information for the shirt region.¹¹ The HSV color histogram of the shirt could then be extracted to perform reidentification. Because only nine histogram bins were extracted, the feature was less sensitive to varying lighting conditions and less powerful for similar colors compared with other methods. How et al.¹² also used shirt information for reidentification by adopting the speeded-up robust features algorithm to extract the shirt pattern as the reidentification feature. In their experiments, a dark-colored shirt with a pattern in the middle was identified as the most effective. However, the algorithm was ineffective for bright colors, because folded parts of a shirt were marked as a feature.

Imani and Soltanizadeh¹³ proposed a method based on depth and skeleton information in which a human-based object was separated into three parts (head, torso, and legs), and histograms were extracted from each depth-image region. The histograms were produced using descriptors of local binary patterns, local derivative patterns, or local tetra patterns. Additionally, nine anthropometric measurements were extracted using skeleton information provided by Kinect. The feature used for person reidentification was information obtained from anthropometric measurements and the histograms of local pattern descriptors. Because this method was based on depth information, it was more invariant to illumination conditions than were methods based on color information.

In another study,¹⁴ anthropometric measurements were combined with information concerning human color appearance to form an individual’s reidentification feature. During experimentation, the fusion feature achieved favorable performance compared with clothing appearance alone. Zheng et al.⁷ offered a fairly comprehensive survey of the development of the person reidentification technique. Although their focus was on reidentification after certain changes, such as from aging, they provided a very nice classification of the available techniques. The results on the following robots were not as frequently encountered. Following the result from the study by Pala et al.,¹⁴ Sun et al.⁶ proposed a similar method in which the information regarding clothes appearance and body size were combined into a feature for reidentification; their experiments revealed superior performance over other methods. Furthermore, the clothes-appearance method outperformed the body-size method because of substantial position errors of joints in some frames in the body-size method. To overcome the problem of measurement errors, they used 50 samples to construct information on a selected human target. How et al.¹² proposed a Speed-Up Robust Feature (SURF) as fast and robust algorithm for moving human-target identification. They claimed the human following application, but the experiments were based on stationary camera. Cao and Hashimoto¹⁵ proposed a method based on Kinect skeleton information. However, this method was affected by noise during skeleton measurement that could not be mitigated even using 10 skeleton data samples. The result by Abdul Shukor et al.¹⁶ used only Kinect for human recognition. They offered their result on the most appropriate scene for the skeleton-based human detection, but all the test cases took seconds to achieve the detection. In a separate research, Xiao et al.¹⁷ used a single laser scanner to detect human legs. Their research did not include cases when more persons were involved. In fact, laser range finder were proposed in many research studies.^18,19 People had proposed to use ultrawide band technology for human tracking.²⁰ The method was robust to environment change, but the user had to wear a tag. With successful identification of the location of the human, there were many research studies discussing the tracking strategies for the robot.^21
–23 To make use of more available information from a person, Khedher et al.²⁴ proposed to use multiple shots to fuse the movement style into the recognition procedure. They were able to achieve very good performance; however, the camera in the setup was fixed. Tsun et al.²⁵ considered the trajectory planning problem of a person-following robot. They proposed a very nice fusion algorithm but only presented simulation results. A small survey of human tracking was also available,²⁶ but this was aimed for the use of multiple stationary cameras. An interesting and very recent result is by Wang et al.²⁷ who proposed the fusion of a monocular camera with an ultrasonic sensor. They did not address the multiple person scenario, but they were able to operate their robot outdoor.

This study introduces a human object recognition and tracking method based on the consumer camera of the Kinect v2. Human object detection and tracking using the Kinect v2 and its SDK with an occlusion detector can reduce the occurrence of errors in skeleton-tracking ID. After the human object has been detected, the color information of five body parts (two upper arms, torso, and two thighs) is extracted for recognition. The extraction of features consists of two phases. First, the depth and skeleton information obtained using the Kinect v2 sensor is employed to segment the five body parts in the color image. Second, color features are obtained using mixed nine-dimensional histograms and the single-color analysis method.

The main contribution of this research is the segmentation of five body parts and the development of an integrated mixed nine-dimensional histogram and single-color analysis method. The remainder of this study is organized as follows. In the second section, the system structure and segmentation of the five body parts are introduced. Third section presents the mixed nine-dimensional histogram and single-color analysis method. The experimental results are provided in the fourth section, and conclusions are presented in the fifth section.

System structure and body-part segmentation method

System structure

Figure 1 is a flowchart of the vision system used in human tracking. Tracking human targets is crucial for the functionality of following robots, which require interfaces to select their human targets. In this study, a human target was selected through the recognition of a series of hand-gesture sequences implemented based on the gesture recognition technique of the Kinect v2. Using hand gestures to select a human target, the system can be established as more user-friendly. When first obtaining information regarding the human target, the system must obtain information for all five body parts.

Figure 1.

Vision system flowchart.

The flowchart comprises two decisions. The tracking state of the human target is determined using the player ID information provided by the Kinect v2. When the same human-target player ID is detected in the subsequent image frame, reidentification is not implemented, which minimizes computation time. The second decision is the “overlap-detection method,” which is discussed regarding its use for preventing misidentification caused by occlusion.

Body-part extraction

The body-part extraction method involves the series of steps displayed in Figure 2. First, the color image and body index image, which are provided by the Kinect v2 sensor, are used to remove the background. If the pixel value of the body index image is between 0 and 5, then the pixel belongs to player ID k; if it is greater than 5, then the pixel belongs to the background.

Figure 2.

Body-part extraction flowchart.

Second, the no-background color image is combined with skeleton information to extract five specific body parts: the two upper arms, torso, and two thighs. The extraction regions are indicated by yellow rectangles in Figure 3. Table 1 lists the corresponding joints. The lengths of the torso and two thighs are adjusted during the process of body-part extraction; the middle three-fifths are defined as the length because an uncertain area between them may influence the two parts. The lengths of the two upper arms are also adjusted; the lower three-fifths are defined as the lengths because when a person is facing away from the Kinect camera, some areas of the upper arms are located in the region of the torso. The extraction region is constructed from points C, D, E, and F (Figure 4). The vertex locations in the image plane $x y$ are obtained from a width w, joint location A, joint location B, and their internal points of division $A'$ and $B'$ , as indicated on the left in Figure 4 and defined in Tables 2 and 3. In this case, point B and point $B'$ coincide. In Table 3, the distance between spine_shoulder and shoulder_left as well as the distance between spine_shoulder and shoulder_right are averaged if both are in the “tracked” state; otherwise, only the “tracked” distance is used. The variables $\bar{h}$ and $\bar{w}$ (left side of Figure 4) are calculated using equations (1) to (3). The scan region is defined as $(x_{min}, y_{min})$ and $(x_{max}, y_{max})$ , as indicated on the right side of Figure 4 in which $x_{min}$ and $y_{min}$ are the minimum x and y values for points C, D, E, and F and $x_{max}$ and $y_{max}$ are the maximum x and y values for points C, D, E, and F. If line $\bar{A' B'}$ is vertical or horizontal, then the scan region is the same as the extraction region.

Figure 3.

Five body-part extraction regions.

Figure 4.

Extraction region.

Table 1.

Reference points of the body-part extraction regions.

Body parts	Joint 1	Joint 2
Torso	Spine_shoulder	Spine_base
Upper left arm	Shoulder_left	Elbow_left
Upper right arm	Shoulder_right	Elbow_right
Left thigh	Hip_left	Knee_left
Right thigh	Hip_right	Knee_right

Table 2.

Reference points for body-part extraction.

Point	x	y
A′	x ₁	y ₁
B′	x ₂	y ₂
C	$x_{1} - 0.5 \times \bar{w}$	$y_{1} - 0.5 \times \bar{h}$
D	$x_{1} + 0.5 \times \bar{w}$	$y_{1} + 0.5 \times \bar{h}$
E	$x_{2} + 0.5 \times \bar{w}$	$y_{2} + 0.5 \times \bar{h}$
F	$x_{2} - 0.5 \times \bar{w}$	$y_{2} - 0.5 \times \bar{h}$

Table 3.

Widths of body-part extraction.

Width w	Value
$w_{torso}$	(Distance between spine_shoulder and shoulder_left/right) × 4/3
$w_{arms}$	$w_{torso} / 3$
$w_{thighs}$	$w_{torso} / 2$

If line $\bar{A' B'}$ is neither vertical nor horizontal, then the scan region is larger than the extraction region displayed in Figure 4 (right). Consequently, determining whether the scanning point $P \in {(x, y) | x, y \in R, x_{min} \leq x \leq x_{max}, y_{min} \leq y \leq y_{max}}$ is in the extraction region is necessary. The distance between point P and line $\bar{A' B'}$ is calculated using equation (4) in which Q is the projection point from point P projected onto line $\bar{A' B'}$ . The projection x value of point P is x_n , which is calculated using equation (5). If $d (P, \bar{A^{'} B^{'}}) \leq 0.5 \times w$ and x_n satisfies equation (6), then this point resides in the extraction region, meaning that the extraction region can be obtained from the scanned region.

\bar{h} = w / \sqrt{1 + β^{2}}

\bar{w} = | β \times \bar{h} |

β = (y_{1} - y_{2}) / (x_{1} - x_{2})

d (P, \bar{A^{'} B^{'}}) = \frac{| - β x + y + β x_{1} - y_{1} |}{\sqrt{1 + β^{2}}}

x_{n} = \frac{x + β y + β^{2} x_{1} - β y_{1}}{1 + β^{2}}

x_{n} \in {x \in R | min (x_{1}, x_{2}) \leq x \leq max (x_{1}, x_{2})}

Mixed nine-dimensional histogram and single-color analysis method

The feature vectors of the body parts are produced using color information of the extracted body-part regions, which consist of the torso, mixed upper arms, and mixed thighs. The mixed upper arms and thighs are used to prevent the problem of left and right reversal and the change in thigh lighting conditions during the act of walking. First, this method transforms the extracted body-part regions from an RGB into an HSV color space representation to increase the robustness of the method under varying lighting conditions. Second, based on the concept of Southwell and Fang,¹¹ nine-dimensional histogram feature vectors are constructed in the HSV color space. Third, if one of the six chromatic colors (red, yellow, green, cyan, blue, and magenta) in the nine-dimensional histogram feature vector accounts for more than 90% of all color during the process of human-target selection (first-time construction of human-target information), then this region is labeled as monochrome. The single-color analysis method is used to produce a dominant color feature vector. In the reidentification process, the condition becomes 85% if this stored human-target region is labeled as monochrome and 95% if this stored human-target region is labeled as a non-monochrome. Fourth, the confidence ruling method ¹¹ or fuzzy sets are used to compare the feature vectors and recognize the human target. The steps are described in detail as follows.

Color space conversion

The RGB information of body parts is converted into HSV color space using equations (7 –9), in which $r, g, b \in [0, 1]$ . Let maxRGB and minRGB be the maximum and minimum values of r, g, and b, respectively. Here, $hu \in [0, 360 °)$ and $sat, val \in [0, 1]$ .

hu = {\begin{matrix} 0 °, if maxRGB = minRGB \\ 60 ° \times \frac{g - b}{maxRGB - minRGB} + 0 °, if maxRGB = r a n d g \geq b \\ 60 ° \times \frac{g - b}{maxRGB - minRGB} + 360 °, if maxRGB = r a n d g < b \\ 60 ° \times \frac{b - r}{maxRGB - minRGB} + 120 °, if maxRGB = g \\ 60 ° \times \frac{r - g}{maxRGB - minRGB} + 240 °, if maxRGB = b \end{matrix}

sat = {\begin{matrix} 0, if maxRGB = 0 \\ \frac{maxRGB - minRGB}{maxRGB}, otherwise \end{matrix}

val = maxRGB

Feature vector extraction

Feature vector of the nine-dimensional histogram method

The nine-dimensional histogram feature vectors are constructed using the concept proposed by Southwell and Fang.¹¹ The feature vector ${\vec{V}}_{i}^{k}$ , which can be expressed by equation (10), is a measure of the ratio of pixels belonging to each of the nine-dimensional histograms of the body parts. The HSV values that correspond to these nine-dimensional histograms are listed in Table 4. The subscript i indicates whether this feature vector belongs to the torso, mixed upper arms, or mixed thighs. The superscript k represents the player ID, where $k = 0 - 5$ when assigned by Kinect SDK and $k = m$ for stored human-target information. The feature vector of the mixed upper arms is calculated by averaging the feature vectors of the upper right arm and upper left arm if both are in the “tracked” state; otherwise, only the “tracked” arm is used. The same method is used for the mixed thighs. In the feature vector, the jth element in the vector, ${\vec{V}}_{i}^{k} [j]$ (where $j = 1, 2, \dots, 9$ corresponds to the chromatic and achromatic colors listed in Table 4), is the ratio of pixels in the region i classified into the jth dimension in the HSV color space. Therefore, all feature vectors satisfy equation (11).

{\vec{V}}_{i}^{k} = [V_{i, Red}^{k}, V_{i, Yellow}^{k}, V_{i, Green}^{k}, V_{i, Cyan}^{k}, V_{i, Blue}^{k}, V_{i, Magenta}^{k}, V_{i, White}^{k}, V_{i, Black}^{k}, V_{i, Grey}^{k}]

\sum_{j = 1}^{9} {\vec{V}}_{i}^{k} [j] = 1

Table 4.

HSV regions for feature-vector extraction.

Region no.	Color region	hu	sat	val
1	Red	<30 or ≥330	≥0.2	≥0.157
2	Yellow	≥30 and <90	≥0.2	≥0.157
3	Green	≥90 and <150	$\geq 0.2$	$\geq 0.157$
4	Cyan	$\geq 150 and < 210$	$\geq 0.2$	$\geq 0.157$
5	Blue	$\geq 210 and < 270$	$\geq 0.2$	$\geq 0.157$
6	Magenta	$\geq 270 and < 330$	$\geq 0.2$	$\geq 0.157$
7	White	$0 - 360$	$< 0.2$	$> 0.706$
8	Black	$0 - 360$	$0 - 1$	$< 0.157$
9	Grey	$0 - 360$	$< 0.2$	$\geq 0.157 and \leq 0.706$

Feature vector of the single-color analysis method

Whether body-part region i is monochrome is determined after the nine-dimensional histogram feature vectors have been extracted. The label $α_{i}^{k}$ denotes whether the body-part region i of player ID k is monochrome. If it is monochrome, then $α_{i}^{k} = 1$ ; otherwise, $α_{i}^{k} = 0$ . The conditions for setting the region i as monochrome, $α_{i}^{k} = 1$ , are as follows:

One of the six chromatic color regions accounts for more than 90% of all color during the process of human-target selection.

One of the six chromatic color regions accounts for more than 85% of all color when this stored human-target region is labeled as monochrome during the reidentification process.

One of the six chromatic color regions accounts for more than 95% of all color when this stored human-target region is labeled as a nonmonochrome in the reidentification process.

When the body-part region i is labeled as monochrome, the single-color analysis method is used and $α_{i}^{k}$ is set to 1. The procedure of the single-color analysis method is as follows. For convenience during explanation, the symbol “*” is used to emphasize that a region is labeled as single color. For example, i* represents the labeled region i and j* represents the corresponding dominant color j.

In step 1, the hue and saturation values of body-part region i* are extracted and collected to form a hue vector $H v$ and saturation vector $S v$ . In step 2, each element of the hue vector $H v$ is forward-mapped to form a mapped hue vector $\bar{H v}$ using equations (12) and (13) and the region number of the dominant color j*. In equations (12) and (13), $\hat{H}$ is the transition hue value, H is the original hue value, and $\bar{H}$ is the forward-mapped hue value. The region numbers of the dominant color are provided in Table 5. Forward-mapping maps the dominant color j* to 180°, which is the sorting center of the region $[0, 360)$ . In Step 3, every 250 elements of the mapped hue vector $\bar{H v}$ (element $1 - 250, 251 - 500, \dots$ ) are sorted and the middle 20% are averaged. Moreover, every 250 elements of the saturation vector $S v$ are sorted and the middle 20% are averaged. This step is used to enhance the influence of the dominant color j* and reduce the influence of minor colors. In step 4, the results of step 3 are averaged to obtain an averaged mapped hue value $\bar{h v}$ , and averaged saturation value $s v$ . In step 5, $\bar{h v}$ is inverse-mapped to $h v$ using equations (14) and (15) as well as the region number of the dominant color j*. In equations (14) and (15), $H^{*}$ is the transition hue value, $\bar{H}$ is the forward-mapped hue value, and H is the original hue value. In step 6, the dominant color feature vector is obtained, which is represented as $D V_{i *}^{k} = [{hu}_{i *}^{k}, {sat}_{i *}^{k}, j *_{i *}^{k}] = [h v, s v, j^{*}]$ , where $k = 0 - 5$ represents the player ID and $k = m$ represents the stored human-target information.

Table 5.

HSV regions for mapping.

Color	Red	Yellow	Green	Cyan	Blue	Magenta
No.	1	2	3	4	5	6
Center of hue	0	60	120	180	240	300

\hat{H} = H - 60 \times (j^{*} - 1) + 180

\bar{H} = {\begin{matrix} \hat{H} - 360, if \hat{H} \geq 360 \\ \hat{H}, if 0 \leq \hat{H} < 360 \\ \hat{H} + 360, if \hat{H} < 0 \end{matrix}

H^{*} = \bar{H} + 60 \times (j^{*} - 1) - 180

H = {\begin{matrix} H^{*} - 360, if H^{*} \geq 360 \\ H^{*}, if 0 \leq H^{*} < 360 \\ H^{*} + 360, if H^{*} < 0 \end{matrix}

Expectation calculation

Expectation of the nine-dimensional histogram method

The expectation of the nine-dimensional histogram method is calculated using the confidence ruling method,¹¹ and that of the nine-dimensional histogram feature vector of body-part region i of a player ID k, $E x_{i, 9-dim}^{k}$ , is calculated using equations (16) –(20). The feature vector ${\vec{V}}_{i}^{m}$ represents the stored feature vector of body-part i of the human target. ${\vec{V}}_{i}^{k}$ is the feature vector of the body-part region i of a potential match for player ID k. The value $E n_{i, sum}^{k}$ is the sum of the absolute difference between the corresponding human-target vector and player ID k vector of the body-part region i. W_n is a weighting factor and is selected to be 2 (the same as Southwell and Fang¹¹), which produces an expectation range of $0 - 2$ .

E n_{i, sum}^{k} = \sum_{j = 1}^{9} | Δ {\vec{V}}_{i}^{k} [j] |

Δ {\vec{V}}_{i}^{k} = [(V_{i, Red}^{k} - V_{i, Red}^{m}), (V_{i, Yellow}^{k} - V_{i, Yellow}^{m}), \dots, (V_{i, Grey}^{k} - V_{i, Grey}^{m})]

C o_{i, nondom}^{k} = 2 - E n_{i, sum}^{k}

C o_{i, dom}^{k} = 2 \sum_{j = 1}^{9} {\vec{V}}_{i}^{k} [j] (1 - | Δ {\vec{V}}_{i}^{k} [j] |)

E x_{i, 9 - dim}^{k} = (C o_{i, nondom}^{k} + W_{n} \times C o_{i, dom}^{k}) / 3

Expectation of the single-color analysis method

The expectation of the single-color analysis method is calculated using the dominant color j* and fuzzy sets, as demonstrated in Figure 5. The symbol “*” is used to indicate that a region has been labeled as a single color and only appears after other symbols. If the dominant color of the human target $j *_{i *}^{m}$ differs from the dominant color $j *_{i *}^{k}$ of player ID k, the expectation of the dominant color feature vector for player ID k is $E x_{i *, D}^{k} = 0$ . Otherwise, $E x_{i *, D}^{k}$ is calculated using fuzzy sets, which are established using the hue and saturation values of the stored dominant color feature vector of the human target, $D V_{i *}^{m} = [{hu}_{i *}^{m}, {sat}_{i *}^{m}, j *_{i *}^{m}]$ , as the center and expanding them with the parameters $δ_{1}$ and $δ_{2}$ . Herein, $δ_{1}$ is set as 20 and $δ_{2}$ is set as 80. When the hue and saturation values of the dominant color feature vector $D V_{i *}^{k} = [{hu}_{i *}^{k}, {sat}_{i *}^{k}, j *_{i *}^{k}]$ of player ID k are assigned to the fuzzy sets, the sets return an expectation of the dominant color feature vector. The range of this expectation is 0–2. The fuzzy set region of the hue is repeated every 360°, as indicated in Figure 5(b).

Figure 5.

Fuzzy sets.

Subtotal and total expectation

The total expectation is calculated using equations (21) to (23). Subtotal expectation 1, $E x_{subtot1}^{k}$ , is the weighted expectation that combines the torso and mixed thighs. Subtotal expectation 2, $E x_{subtot 2}^{k}$ , is the weighted expectation of the mixed upper arms, $α_{i}^{k}$ is the monochrome label of player ID k, $α_{i}^{m}$ is the monochrome label of the stored human-target information, and $W t_{i}$ is the weighting factor of body-part region i. The total expectation of player ID k, $E x_{tot}^{k}$ , is a combination of the two subtotal expectations. The weighting factors selected in this study are as follows: $W t_{torso} = 4$ , $W t_{mixThighs} = 2$ , and $W t_{mixUpperArms} = 1.5$ . The range of subtotal expectation 1 is 0–12 and that of the total expectation is 0–15.

E x_{subtot 1}^{k} = \sum_{i = torso,mixThighs}^{} W t_{i} ((1 - α_{i}^{k}) (1 - α_{i}^{m}) E x_{i, 9 -dim}^{k} + α_{i}^{k} α_{i}^{m} E x_{i, D}^{k})

E x_{subtot 2}^{k} = \sum_{i = mixUpperArms}^{} W t_{i} ((1 - α_{i}^{k}) (1 - α_{i}^{m}) E x_{i, 9 -dim}^{k} + α_{i}^{k} α_{i}^{m} E x_{i, D}^{k})

E x_{tot}^{k} = E x_{subtot 1}^{k} + E x_{subtot 2}^{k}

Human-target reidentification

After total and subtotal expectations have been calculated using equations (21) to (23), human-target recognition is performed using two thresholds in some steps, as illustrated in Figure 6, in which $Th_1$ indicates threshold one and $Th_t o t$ indicates the total threshold. First, the number of player IDs k is determined; here, $E x_{subtot 1}^{k} \geq Th_1$ . If no ID is greater than or equal to $Th_1$ , then no human target exists in the frame. Otherwise, the process continues to the next step. Next, the number of player IDs k is determined; here $E x_{subtot 1}^{k} \geq Th_1$ and $E x_{tot}^{k} \geq Th_tot$ . If no ID meets the specified conditions or more than one player ID is equal to the maximum total expectation, then no human target exists in the frame. If only one player ID meets the conditions, then this player ID represents the human target. Otherwise, the player ID with the maximum total expectation represents the human target.

Figure 6.

Flowchart of human-target reidentification.

Overlap-detection method

The Kinect skeleton-tracking player ID may alternate from one person to another between frames. This occurs in the event of occlusion between two people on the edge of the detection region of an infrared sensor or during occlusion by more than one person. In particular, it occurs when one person is tracked and another person is untracked. This phenomenon can be problematic when player ID is used as a tracker for tracking a human target. To overcome this problem, a method for overlap detection was developed in which a specific joint on the skeleton of the human target is tracked and its displacement, $Δ d$ , between the current and previous frames in the camera space is calculated. If the displacement $Δ d$ exceeds the threshold $Th_Δ d$ , then the system performs human-target reidentification.

In this study, spine_shoulder was selected as the tracking joint because the probability of a “not tracked” state in the event of overlap is lower for this joint than other joints. The threshold $Th_Δ d$ is calculated using equation (24), where $v_{low}$ is the predefined low-speed constant (km/h), $v_{high}$ is the predefined high-speed constant (km/h), $Δ t$ is the system cycle time (ms), $Δ U_v_{low}$ is the uncertainty of the human-target skeleton at a low speed in the camera space (m), and $Δ U_v_{h i g h}$ is the uncertainty of the human-target skeleton at a high speed in the camera space (m).

{\begin{array}{l} Low speed : Th_Δ d = v_{low} \times \frac{Δ t}{3600} + Δ U_v_{low} \\ High speed : Th_Δ d = v_{high} \times \frac{Δ t}{3600} + Δ U_v_{high} \end{array}

The estimated speed of the human target is used to determine whether the human target is moving at a high or low speed. The speed of the human target is estimated by dividing the spine_shoulder displacement $Δ d$ by the cycle time $Δ t$ . However, when the human target is moving, the skeleton oscillates slightly irrespective of its speed and consequently produces inaccurate speed estimations. Therefore, five estimations of the speed of the human target, the current speed, and the estimated speeds in the preceding four frames are used in this study to determine whether the human target is moving at a high or low speed. If more than two of the five estimated speeds of the human target are equal to or greater than the speed threshold $Th_v$ , then the high-speed model is used; otherwise, the low-speed model is used. In this study, $Th_v$ is set at 4.5 km/h.

Experimental results

A ThinkPad T540p laptop computer with a core i7-4710MQ 2.5-GHz CPU and 8 GB RAM was used to evaluate the performance of the vision system. Videos were recorded using the Kinect v2 sensor with Kinect Studio. The system was developed using VS2012 in Windows 8, and OpenCV 3.0 was used to display the resultant images.

Database

To test the proposed method, we recorded approximately 200 videos using Kinect Studio v2.0 provided by Kinect SDK v2.0. The duration of each prerecorded video was approximately 15 s. The videos contained 8 scenes and 26 users. Some scenes had another illumination condition, and most of the users had occlusions in at least one video. Furthermore, some of the users in the prerecorded videos were jogging, holding things, and jumping. The videos were recorded on different days; therefore, the illumination conditions may have differed for the same scene. Data regarding the prerecorded videos and corresponding user numbers are presented in Tables 6 and 7. The first column of Table 7 indicates the teaching environment of the human target.

Table 6.

Prerecorded video data.

Day recorded	Scene	Illumination (Lux)	No. of videos recorded	User no.
1	1	290	3	3
1	2	220	3	3
2	1	280	4	4
2	2	220	3	4
3	1	290	10	4
	2	190	10
	3	245	10
4	1	280	11	4
	2	220	10
	3	260	11
5-1	4	190	10	4
5-1	4 low Lux	75	10	4
5-2	5	500	11	4
	5 low Lux	310	11
	6	250	10
6-1	7	220	10	5
6-1	7 low Lux	140	10	5
6-2	7 low Lux	140	10	2 (+3)
6-2	7	220	10	2 (+3)
6-3	8	365	10	5
6-3	8 low Lux	140	10	5

Table 7.

Scenes, illumination conditions, and users in the prerecorded videos.

Day 1	Scene 1 (290 Lux)	Scene 2 (220 Lux)
Day 1
Day 2	Scene 1 (280 Lux)	Scene 2 (220 Lux)
Day 2
Day 3	Scene 1 (290 Lux)	Scene 2 (190 Lux)	Scene 3 (245 Lux)
Day 3
Day 4	Scene 1 (280 Lux)	Scene 2 (220 Lux)	Scene 3 (260 Lux)
Day 4
Day 5-1	Scene 4 (190 Lux)	Scene 4 low (75 Lux)
Day 5-1
Day 5-2	Scene 5 (500 Lux)	Scene 5 low (310 Lux)	Scene 6 (250 Lux)
Day 5-2
Day 6-1	Scene 7 (220 Lux)	Scene 7 low (140 Lux)
Day 6-1
Day 6-2	Scene 7 low (140 Lux)	Scene 7 (220 Lux)
Day 6-2
Day 6-3	Scene 8 (365 Lux)	Scene 8 low (140 Lux)
Day 6-3

Event statistics method

A confusion matrix was used to evaluate the performance of the proposed method. Four possible outcomes were defined as follows.

True positive (TP): skeleton of human target exists and TRUE human target is reidentified.

True negative (TN): skeleton of human target does NOT exist and human target is NOT reidentified.

False positive (FP): FALSE human target is reidentified.

False negative (FN): skeleton of human target exists and human target is NOT reidentified.

The recorded videos were used to evaluate tracking methods, and a method was required for presenting the observation results numerically. An event statistics method was proposed for this purpose, the primary aim of which is for the event to distinguish the section that can be tracked without problems. An event is defined as the moment at which tracking problems occur. Additionally, only one outcome (TP, TN, FP, and FN) exists between two events.

In this study, an event is defined at the beginning and end of a video for a skeleton lost because of occlusion or being located outside of the skeleton-tracking range of the Kinect as well as for misidentification caused by occlusion. When a skeleton is tracked as the human target, events are based on this skeleton, regardless of the tracking accuracy. When no skeleton is tracked as the human target, events are based on the skeleton of the true human target. Only one outcome is possible for each of the two events, and these outcomes have different implications. In this study, an FP was more significant than an FN. Moreover, because of the design of the statistical method, a TN was only used for the final event interval.

In this study, the event statistics method was used to present the observation results. Accuracy rates and receiver operating characteristic (ROC) graphs²⁸ are used for discussing the experimental results.

Thresholds and parameter selection

For the samples collected on a given day, 2–4 videos and 1–2 users were randomly selected to determine the thresholds and parameters. In addition to the randomly selected videos, those used for human-target selection were also considered. Each video and user was tested between recording days. A few specific videos that included jogging and jumping were used for parameter selection in the overlap-detection method. The remaining prerecorded videos were used for performance testing. The thresholds and parameter selection procedures were as follows. First, the proposed method was tested without using the overlap-detection method. The testing included two cases: $Th_1$ was fixed, and then $Th_tot$ was varied; $Th_tot$ was fixed, and then $Th_1$ was varied. Second, the videos with misidentification caused by occlusion and a jogging human target were used, and we adjusted the appropriate parameters for the overlap-detection method through trial and error. Third, the proposed method was tested by combining it with the overlap-detection method. Testing used a fixed $Th_1$ value, and then $Th_tot$ was varied.

The values of the parameters in the overlap-detection method, selected through trial and error, were as follows: $Th_v = 4.5$ km/h; $v_{low} = 6.5$ km/h; $v_{high} = 15$ km/h; $Δ U_v_{low} = Δ U_v_{high} = 0.18$ m, and the upper bounds of $Th_Δ d$ were 0.3 m (low-speed mode) and 0.75 m (high-speed mode). The main goal of selecting parameter values in the overlap-detection method is to select a set of values that enable the detection of misidentification caused by occlusion and that remain constant during skeleton tracking of a jogging figure. The high-speed mode is assumed when a human target is first identified.

The experimental results obtained from 60 prerecorded videos with 81 occlusions, 15 movements out of range, and 13 misidentifications caused by occlusion are displayed in Figures 7 and 8. The label “our” indicates that the proposed method was used, and “our + odm” indicates that the proposed method was used in combination with the overlap-detection method. First, considering only the results without the overlap-detection method, when $Th_tot$ decreased, the accuracy and FP rate increased. When $Th_1$ decreased, the accuracy first increased and then stabilizes. Because the maximum expectation value of the mixed upper arms was 3, testing $Th_1$ less than 9 was unnecessary, as indicated on the right side of Figure 7. The results for the fixed $Th_tot$ were stable when $Th_1$ was less than 10. Therefore, we did not test the proposed method again, including the overlap-detection method. Second, use of our proposed method was compared with and without the overlap-detection method. Accuracy was higher when the overlap-detection method was employed. The ROC graph demonstrates that the FP rate decreased when the overlap-detection method was used.

Figure 7.

Accuracy results for threshold selection: (left) fixed $Th_1$ and varied $Th_tot$ ; (right) fixed $Th_tot$ and varied $Th_1$ .

Figure 8.

ROC graph for threshold selection with fixed $Th_1 = 10$ and varied $Th_tot$ . When $Th_tot$ increases, the point in the ROC graph moves toward the origin. ROC: receiver operating characteristic.

The concepts that aided the selection of $Th_1$ and $Th_tot$ are as follows.

Small $Th_tot$ was used to achieve high accuracy.

Large $Th_tot$ was used to obtain a low FP rate.

$Th_tot \geq 12$ should be employed if the mixed upper arms will be used in the process of human-target recognition.

Robustness under varying lighting conditions is desired.

Therefore, we selected $Th_1 = 10$ and $Th_tot = 12$ .

Evaluation of the single-color analysis method

As explained in the introduction, the target system is an assistive robot that operates in a dynamic environment. The robot would rock when it was following its master around. As a result, most of the methods we tested fail to provide continuous recognition to fulfill the tracking requirement. After many failed attempts, we came to realize that the algorithm not only has to be accurate but also be very fast. We were thus faced with the constraints from both ends. On the one end, the algorithm needs to be able to distinguish its master under change environment; on the other end, it also has to complete the computation within a very limited amount of time before the vision system lost frame. While most previous researches used stationary cameras, we were left with very few options of applicable algorithms. Here, we compared our approach with the shirt recognition method proposed in Southwell and Fang,¹¹ which is simple enough to achieve fast computing. We set the weighting factors as $W t_{torso} = 7.5$ , $W t_{mixThighs} = 0$ , and $W t_{mixUpperArms} = 0$ , which only considered the torso information, and the remaining parameters were left unchanged. User 1, user 2, and all prerecorded videos of day 1 were selected for this test. We evaluated our proposed method (torso only), our method (torso only) with the overlap-detection method, and the shirt recognition method¹¹ to determine whether the method of single-color analysis is helpful. The shirt recognition method was tested based on our program structure, and only the method capable of recognizing the human target was changed during testing. The experimental results are presented in Table 8 and reveal that the shirt recognition method had an accuracy of 50%. This is because the torso of both user 1 and 2 had the same dominant color (yellow). Therefore, the users were considered the same person when the shirt recognition method was employed. To avoid this problem, the single-color analysis method was used. Consequently, the FP decreased substantially when our method (torso only) was employed, both with and without the overlap-detection method.

Table 8.

Results of the evaluation of the single-color analysis method.

Single color analysis method test
Algorithm	Th_1	Th_tot	TP	TN	FP	FN	accuracy	tp rate	fp rate
Shirt	x	5	21	2	21	2	0.500	0.913	0.913
Our (torso only)	10	12	31	3	3	0	0.919	1.000	0.500
Our (torso only) + odm	10	12	35	3	1	0	0.974	1.000	0.250

TP: true positive; TN: true negative; FP: false positive; FN: false negative; ODM: overlap-detection method.

In this experiment, we discovered that the single-color analysis method improved performance for monochrome objects. Although the overlap-detection method can overcome the problem of misidentification caused by occlusion, it still has some limitations.

Human-target recognition

To test the performance of the selected thresholds and parameters, each user was randomly selected from two videos for each corresponding scene and each day from the remaining prerecorded videos. The results obtained using our proposed method, our method combined with the overlap-detection method, and the shirt recognition method¹¹ were compared. The experimental results, which were obtained using 164 prerecorded videos with 278 occlusions, 56 movements out of range, and 27 misidentifications caused by occlusion, are presented in Table 9 and Figure 9. Our proposed method with the overlap-detection method demonstrated the highest accuracy, and the shirt recognition method had the lowest accuracy. The accuracy of the proposed method was 13.9% and 11.1% higher, and FP rates were 25.1% and 22.4% lower than those of the shirt recognition method, as indicated by the performance test and parameter selection test, respectively. Nevertheless, the TP rates were 2% and 2.5% lower compared with the shirt recognition method. This was because in some scenes, users wore the same or similar shirts, which are difficult to distinguish with the shirt recognition method, as was the case for user 1 and user 2 on day 1; user 7 and user 10 on day 3; and all users on day 6-1 and day 6-3. Our method outperforms the shirt recognition method, because it uses information of five body parts and improves the resolution of the nine-dimensional histogram with respect to chromatic color. However, our proposed method has some limitations with respect to achromatic color. For example, user 18 was often recognized as user 20 on day 6-1. In our experiments, the lighting conditions of the thighs of the user changed while the user was walking, and some chromatic colors were converted into achromatic colors when illumination was low. The pants of user 18 had this type of color. This phenomenon was also observed for user 4 and user 6 on day 2. Although the information of five body parts was used in our method, distinguishing between user 23 and 24 on day 6-2 was still difficult.

Table 9.

Results of performance test and parameter selection.

Performance test
Algorithm	Th_1	Th_tot	TP	TN	FP	FN	Accuracy	tp rate	fp rate
Shirt	x	5	330	39	191	63	0.592	0.840	0.830
Our	10	12	360	53	73	79	0.731	0.820	0.579
Our + odm	10	12	374	58	44	88	0.766	0.810	0.431
Parameter selection
Algorithm	Th_1	Th_tot	TP	TN	FP	FN	Accuracy	tp rate	fp rate
Shirt	x	5	108	10	55	10	0.645	0.915	0.846
Our	10	12	113	17	28	14	0.756	0.890	0.622
Our + odm	10	12	118	21	20	15	0.799	0.887	0.488

Note: our proposed method with overlap-detection method has highest accuracy rate in performance test, mark in boldface. TP: true positive; TN: true negative; FP: false positive; FN: false negative.

Figure 9.

ROC graph: change from parameter selection to performance test. ROC: receiver operating characteristic.

Incorporating the overlap-detection method increased the accuracy of our method by approximately 3.5% and 4.3% in the performance test and parameter selection test, respectively; moreover, the FP rates decreased by 14.8% and 13.4%, and the TP rates decreased by approximately 1% and 0.3%, respectively. The decrease in TP rate was because of the difficulty in person reidentification when occlusion occurred. Moreover, skeletons sometimes embodied an incorrect position and then the correct person was reidentified. Our method is based on the Kinect skeleton-tracking method; therefore, the problem of misidentification because of occlusion affected the system performance. The testing results indicate that incorporating the overlap-detection method increases the performance of the proposed method.

Table 9 and Figure 9 reveal that all accuracies, TP rates, and FP rates obtained were lower in the performance test than the parameter selection test. This is because person reidentification was more difficult in the performance test than in the parameter selection test. For example, in the scenes used in the performance test, the profile of the human target was presented between two events, and the target enacted the motions of pitching and sitting on a chair.

The experimental results revealed a lower accuracy than that reported in other studies because of misidentification caused by occlusion that occurred during dynamic testing; moreover, the scenarios designed in this study are more complex than those used in other studies.

Computation time

To test the computation time of the vision system, images were extracted from the Kinect v2 and processed using the proposed method and shirt recognition method.¹¹ The computation time was recorded under five conditions: zero, one, two, three, and four people in the field of view (Figure 10). The additional computation time required to add one person to the field of view was then calculated. Additionally, all of the people in Figure 10 were selected as the human target for this test. Finally, the amounts of time required for additional computation were averaged to obtain an average computation time per skeleton.

Figure 10.

Test of computation time.

The experimental results indicate that the average computation time per skeleton for the shirt recognition method¹¹ and proposed method with and without the overlap-detection method were 7.5 ms, 25.9 ms, and 24.5 ms, respectively. The average computation time per skeleton was higher for the proposed method than the shirt recognition method,¹¹ because more extracted pixels were employed in the method of segmenting five body parts; moreover, identifying monochrome colors required more computation time in the proposed method than in the shirt recognition method.¹¹ The average computation time per skeleton increased by only 1.4 ms when the overlap-detection method was included.

Conclusions

This study introduced a human object recognition and tracking method based on an RGB-D sensor (Kinect v2). This method uses the depth, skeleton, and color information of five body parts for human object recognition. First, a segmentation method is employed to identify five body parts based on depth and skeleton information. Second, the color information of the five body parts is extracted using the nine-dimensional histogram method or single-color analysis method. Third, although the overlap-detection method can be used to prevent misidentification caused by occlusion during the tracking process, the proposed method increases the success rate for distinguishing similar clothing with monochrome colors; moreover, robustness is maintained under varying lighting conditions. However, the single-color analysis method is not applicable for achromatic colors, and chromatic colors can be converted into achromatic colors under low-illumination conditions. Using information of five body parts for reidentification can distinguish more users compared with using shirt information alone. Therefore, the proposed method can be used to overcome the problem of misidentification because of occlusion because it may cause a FP between two events during the process of tracking. As accuracy increased, the FP rate decreased and the performance of the system improved. The aforementioned results indicate that the proposed method is suitable for recognizing and tracking a person wearing chromatic clothing. Further research is required for recognizing individuals wearing achromatic clothing.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by Hon Hai Precision Industry Company Ltd (under contract no. 104-S-A26) and in part by the Ministry of Science and Technologies, Taiwan (under grant 104-2221-E-002-197-MY3).

ORCID iD

Jia-Yush Yen

Supplemental material

Supplemental material for this article is available online.

References

Canal

Escalera

Angulo

. A real-time human-robot interaction system based on gestures for assistive scenarios. Comput Vis Image Und 2016; 149: 65–77.

Ming

Enomoto

Shinozaki

. Development of an entertainment robot system using Kinect. In: 2014 10th France-Japan/8th Europe-Asia Congress on Mechatronics, Tokya, Japan, 27-29 November 2014, pp. 127–132. IEEE.

Tarokh

Merloti

. Vision-based robotic person following under light variations and difficult walking maneuvers. J Field Robot 2010; 27: 387–398.

Miura

Satake

Chiba

. Development of a person following robot and its experimental evaluation. In: Intelligent autonomous systems 11, IAS 2010 (eds Christensen

Groen

Petriu

) Ottawa, Canada, 2010, pp. 89–98.

Abu Bakar

Amran

MFM

. A study on techniques of person following robot. Int J Comput Appl 2015; 125: 4.

Sun

Zhao

. Human recognition for following robots with a Kinect sensor. In: 2016 IEEE international conference on robotics and biomimetics (ROBIO), Qingdao, China, 3–7 December 2016, pp. 1331–1336. IEEE.

Zheng

Yang

Hauptmann

. Person re-identification: past, present and future. J Latex Class File 2015; 14: 20.

Saghafi

Hussain

Zaman

. Review of person re-identification techniques. IET Comput Vis 2014; 8: 455–474.

Abu

HIA

Preumont

Loix

. Piezoelectric Stewart platform for general purpose active damping interface and precision control. In: 9th European Space Mechanisms and Tribology Symposium, Liege, Belgium, 19–21 September 2001, pp. 331–334. Paris: ESA Publishing Group.

10.

Cheng

Cristani

. Person re-identification by articulated appearance matching. In: Gong

Cristani

Yan

Loy

(eds) Person re-Identification. London: Springer, 2014, pp. 139–160.

11.

Southwell

Fang

. Human object recognition using colour and depth information from an RGB-D Kinect sensor. Int J Adv Robot Syst 2013; 10: 171.

12.

How

Ming

ESL

Fai

. Shirt pattern recognizing and tracking behavior analysis using SURF on the moving human target. In: 2015 10th Asian control conference: emerging control techniques for a sustainable world, ASCC 2015, Kota Kinabalu, Malaysia, 31 May–3 June 2015. IEEE.

13.

Imani

Soltanizadeh

. Person reidentification using local pattern descriptors and anthropometric measures from videos of Kinect sensor. IEEE Sens J 2016; 16: 6227–6238.

14.

Pala

Satta

Fumera

. Multimodal person reidentification using RGB-D cameras. IEEE Trans Circ Syst Vid 2016; 26: 788–799.

15.

Cao

Hashimoto

. Specific person recognition and tracking of mobile robot with Kinect 3D sensor. In: IECON 2013 – 39th annual conference of the industrial electronics society, 10–13 November 2013, pp. 8323–8328. IEEE.

16.

Shukor

SAA

Rahim

MAA

Ilias

. Scene parameters analysis of skeleton-based human detection for a mobile robot using Kinect. In: Proceedings – 6th IEEE international conference on control system, computing and engineering, ICCSCE 2016. 2017, pp. 173–178.

17.

Xiao

Sun

. Human tracking and following of mobile robot with a laser scanner. In: 2017 2nd international conference on advanced robotics and mechatronics, ICARM 2017. 2018, pp. 675–680.

18.

Lee

González-Baños

Latombe

. Real-time tracking of an unpredictable target amidst unknown obstacles. In: Proceedings of the 7th international conference on control, automation, robotics and vision, ICARCV 2002. 2002, pp. 596–601.

19.

Schulz

Burgard

Fox

. People tracking with mobile robots using sample-based joint probabilistic data association filters. Int J Robot Res 2003; 22: 99–116.

20.

Feng

. A human-tracking robot using ultra wideband technology. IEEE Access 2018; 6: 42541–42550.

21.

Jiang

Wang

Chen

. Personalize vision-based human following for mobile robots by learning from human-driven demonstrations. In: RO-MAN 2018 – 27th IEEE international symposium on robot and human interactive communication 2018, pp. 726–731.

22.

Zhang

. Autonomous following indoor omnidirectional mobile robot. In: Proceedings of the 30th Chinese control and decision conference, CCDC 2018. 2018, pp. 461–466.

23.

Chen

Tseng

Hsu

. Design and implementation of human following for separable omnidirectional mobile system of smart home robot. In: Proceedings of the 2017 international conference on Orange Technologies, ICOT 2017. 2018, pp. 210–213.

24.

Ibn Khedher

El-Yacoubi

Dorizzi

. Fusion of appearance and motion-based sparse representations for multi-shot person re-identification. Neurocomputing 2017; 248: 94–104.

25.

Tsun

MTK

Lau

. Exploring the performance of a sensor-fusion-based navigation system for human following companion robots. Int J Mech Eng Robot Res 2018; 7: 590–598.

26.

Watada

Musa

Jain

. Human tracking: a state-of-art survey. Lect Notes Comput Sci 2010; 6277: 454–463..

27.

Wang

Liu

. Accurate and real-time 3-D tracking for the following robots by fusing vision and ultrasonar information. IEEE/ASME Trans Mechatron 2018; 23: 997–1006.

28.

Fawcett

. An introduction to ROC analysis. Pattern Recogn Lett 2006; 27: 861–874.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB