Hand Gesture Modeling and Recognition for Human and Robot Interactive Assembly Using Hidden Markov Models

Abstract

Gesture recognition is essential for human and robot collaboration. Within an industrial hybrid assembly cell, the performance of such a system significantly affects the safety of human workers. This work presents an approach to recognizing hand gestures accurately during an assembly task while in collaboration with a robot co-worker. We have designed and developed a sensor system for measuring natural human-robot interactions. The position and rotation information of a human worker's hands and fingertips are tracked in 3D space while completing a task. A modified chain-code method is proposed to describe the motion trajectory of the measured hands and fingertips. The Hidden Markov Model (HMM) method is adopted to recognize patterns via data streams and identify workers' gesture patterns and assembly intentions. The effectiveness of the proposed system is verified by experimental results. The outcome demonstrates that the proposed system is able to automatically segment the data streams and recognize the gesture patterns thus represented with a reasonable accuracy ratio.

Keywords

Hybrid Assembly System Human-robot Collaboration Artificial Cognition Hidden Markov Model

1. Introduction

There is exceptional demand within the manufacturing industry to meet the high-mix, low-volume requirements for changing consumer-market demands. These demands are also met with an ever-increasing number of product variants and smaller lot-sizes [1, 2]. A fully robotic manufacturing cell has already been designed and adopted for this purpose [3]. However, a fully robotic manufacturing process cannot obtain sufficient flexibility with a highly variable product line. At the same time, it must also aim towards complimentary cost-effectiveness in order to support this demand. By taking advantage of a human's adaptability and flexibility, we can exploit the concept of a hybrid assembly system for medium-sized manufacturing processes. Hybrid assembly creates a modern assembly mode whereby the robot works as a co-worker to collaborate with the human and share the same working space and time [1, 4, 5]. Within this scenario, the advantages of human-robot collaboration are exploited through an optimized task-scheduling system while the shortcomings are avoided. The realization of a hybrid system can also have positive impact in society [6]. In previous work, it is shown that effective collaboration between human and robot can achieve cost-effective performance, and reduce the total assembly time (i.e., makespan) and cost of production [6].

The importance of hybrid assembly systems has become increasingly apparent both in industry and academia as regards improving production efficiency in manufacturing processes [7–10]. The main objective of this system is to make use of an individual's intelligence, expertise and flexibility. In this way, the robotic system is also able to take advantage of the human operator's sense, sensibility and resourcefulness to complete a required task. On the other hand, the human can utilize the high-precision, strength and repeatability of the robotic system, and thus reduce fatigue and the risk of injury as well as increase overall work safety [1, 11].

A hybrid assembly cell (HAC), where human and robot collaboratively work within a limited space, is commonly agreed as one realization of hybrid assembly systems [1, 12]. Compared with human and robot collaboration in an open environment, a cell-based assembly can concentrate much more on multi-functional modular manufacturing with small-volume requirements. This task-oriented assembly can be quickly deployed and allocated between human and robot co-workers [6]. However, despite carefully examining the research regarding this issue, it is always difficult to model such cooperation in a quantitative way. It is hard to identify any effective models describing the interaction between human and robots from recent research papers. However, when addressing industrial issues it is always very important to evaluate the cooperation performance quantitatively for industrial assembly in terms of the time span or errors which will directly affect the profits that the company can make. We study the mathematical model for human and robot cooperation by building a stochastic petri-net system, as in [13]. It is an event-driven system, whereby the robot detects certain ‘trigger’ events in order to carry out corresponding reactions.

Similar works can be found in [14], where a HAC called ‘multi-modal assembly-support system’ (MASS) is developed. MASS is equipped with physical support and information support, guaranteeing human workers' safety as well as the assembly task-flow. However, MASS mainly focuses on the safety rules category within the hybrid assembly and disregards any collaboration between the human and the robot co-worker. The recent research in [11] addresses a safety-control strategy for a robot co-worker by monitoring the position of the robot end-effector. Speed control for the end-effector is categorized into different stages within the working space. However, this configuration method has profoundly restricted the performance of the robot and lowered collaboration levels. Another platform called ‘joint-action for humans and industrial robots' (JAHIR, within the CoTeSys Project) is introduced in [15]. JAHIR focuses on monitoring the status of human workers and the assembly work-flow. Two cameras mounted on top of the working area are used to determine the 3D position of the human operator's hands. Tracking of the hands is achieved based on a 3D occupancy-map generated by the cameras. This configuration of a sensor system can only obtain the raw information of the hand's position, and therefore it cannot achieve accurate pattern recognition. This method is also time-consuming for online 3D occupancy-map generation, and makes it difficult for the robot co-worker to respond quickly.

Within a HAC system, the human action-pattern recognition and intention estimation are the key issues that must be addressed [16]. The assembly tasks assigned to the human and the robot are defined in advance based on a selection of optimal rules [6, 13]. According to this task-flow, the human worker can easily perceive and understand what his partner is doing, while the robot co-worker is limited and unable to perceive and react accordingly. A vision-based sensor system is already widely applied for non-contact environment-awareness in human-robot interaction (HRI) [17, 18]. However, the recognition of static and dynamic gestures within dynamic environments is difficult. It is essential to isolate the objects from complex and dynamic scenes with cluttered backgrounds. Consequently, pattern recognition on RGB image data combined with depth information (RGB-D data) has been introduced in recent years [19, 20]. Aligned with this research, the following techniques have also presented methods for addressing static and dynamic gestures. Single-frame data-based human gesture recognition [21], object recognition [22] and 3D environment reconstruction [23] have been reported. There are already a number of commercial RGB-D cameras available in the market at prices ranging from 200 USD to 50,000 USD and capture speeds from 0.033 seconds to three seconds. These features have restricted the wider usage of RGB-D cameras in industry. Among these cameras, Microsoft Kinect costs less than 200 USD with the frame rate 30FPS [24]. One drawback of Kinect is its inaccurate measurement output. Subtle movements of the human hand and fingertips are difficult to measure. In [25], the authors developed algorithms to detect the human palm and fingertips, and in [26] the authors developed an algorithm based on flocking in order to interact with computers more naturally. In [27], the pose of a human was computed from the fusion of data from a gypsy-giro suit based on accelerometers and UWB sensors for assembly and disassembly tasks in collaboration between humans and industrial robots. In the present research, we assume that the illumination within the assembly cell is always stable. When a human worker is collaborating with a robot, his hands are tracked only when they are completely exposed to Kinect and the light source. Supporting sensor data are sometimes provided for recognizing such actions.

The HMM is widely applied to pattern recognition, including speech recognition [28], handwriting recognition [29, 30], human behaviour recognition [31] and trajectorylearning [32]. It can also be applied to human-action pattern recognition via sensor-data streams. Related works demonstrate pattern recognition [33] and prediction [34, 35]. One of the challenges surrounding the use of HMM for online applications is the issue of a method for dealing with data segmentation via data streams and recognizing patterns via short segments.

The original contribution of this work is in the design of an intelligent human-robot collaborative hybrid assembly (iHRCHA) cell. In this, a sensor system serves as a natural interface between a human and a robot co-worker. Via this interface, the robot co-worker is able to identify a human worker's hand gestures accurately and rapidly. This mechanism utilizes an RGB-D camera and a supportive glove to produce the position data and rotation information of the humans palm and fingertips in 3D space. An algorithm is developed to obtain accurate information from the raw data streams of the sensor system. The HMM is combined with this interface for online hand-gesture recognition using a segmentation technique. This system is cheap and can be deployed quickly. It demonstrates an improvement in the current setup of HACs by providing a robot co-worker with the capability of carrying-out collaborative tasks rapidly, effectively and safely.

The remainder of this paper is organized as follows: Section II introduces the basic assumption, problems and challenges in iHRCHA. Section III describes the experimental setup, including the sensor system and the algorithm to process the sensor information and obtain the featured data. A trajectory descriptor is introduced to encode the human's palm and fingertip movement trajectories. Section IV describes a task scenario which assesses the presented system. Section V and Section VI explain the experimental results, discussion and conclusion respectively.

2. An intelligent human and robot collaborative hybrid assembly cell

2.1 Basic problem

In iHRCHA cells, human workers and robot co-workers must work closely with each other. In [1], the author describes several typical human and robot coordination models:

The human alternately collaborates with a robot co-worker in performing a task. In this case, the human and the robot perform the assembly sequentially (Fig. 1-(a)). The human and the robot do not share the working time but they do share the working space.

The human collaborates with the robot co-worker in performing a task. In this case, the human and the robot collaboratively perform the assembly task simultaneously (Fig. 1-(b)). The human and the robot share both the working time and the working space.

Figure 1.

Gantti diagram for a “human/robot tasks shift”

When the human worker is performing the assembly task, the robot monitors, detects and estimates his actions and intentions. Being aware of the predefined scheduled assembly tasks, the robot co-worker can assist the human and carry out its own assembly task. The human cannot send explicit or implicit commands using a traditional human-machine interface, such as a keyboard or a mouse. He has to concentrate and focus on more urgent tasks. As a consequence, it is reasonable to design a human-robot interface in a natural way, using gestures and a language. Languages are the most natural way in which humans communicate. Despite the last decade of development, natural language process technology is still incapable of wide adoption for mutual applications due to the complicated processing involved. Human gestures - alternatively - are believed to be the most convincing natural interface while still retaining the rich information of human-robot communication.

Human gestures usually refer to those gestures represented via the body, arm or hands of a human. In iHRCHA, human and robot co-workers mainly perform electronic manufacturing assembly tasks on a bench. It is unwise to monitor human body gestures because these will not change very often or obviously during the assembly task. It is also not practical to monitor human arm gestures because it is difficult to capture sufficient information to describe them. Hand gestures are appropriate to describe a human worker's working status during manual assembly. The posture of a human worker's hands varies due to the different requirements of the current assembly task. Therefore, it is of interest to examine the relationship between a human worker's hand gestures and the actual assembly actions. Human intentions, as associated with assembly work, are defined as the assembly action that a given human worker is performing or intending to perform.

2.2 Hand gesture analysis in iHRCHA

In electronic manufacturing systems, assembly requires a human to perform accurate and quick assembly operations. In current manufacturing systems, human operators carry out most operations by hand. Therefore, we look to identify human assembly actions by monitoring hand gestures. Previous visual-object recognition technologies cannot provide an effective method for tracking and identifying complex palm and fingertip gestures in complex environments. In this research, a human-robot interface based on human hand gestures is designed and an effective algorithm for palm and fingertip recognition is proposed by analysing the RGB-D data of hands. Although it is possible to analyse palm rotation information based on such RGBD data, it is still not reliably accuracy. Therefore, we design a simple glove for a human with a three-axis acceleration sensor and a gyro-sensor attached for accurate palm movement and rotation information collection.

There are two types of hand gestures in iHRHAC (Fig. 2). One is the static gesture, which takes place immediately, and the other is the dynamical gesture which takes place over a time. The former can be described using a single frame of data from the data streams, while the latter can be described using sequential frames of data from the data streams.

Figure 2.

Human hand gesture using OpenSim [36]

For a static gesture, a single frame of data includes not only the 3D position of the palm and fingertips but also the rotation information. For a dynamical gesture, besides the static gesture data for each frame, it also contains the hand movement trajectory.

In this research, the sensor system for the human-robot interface is built by two sets of sensors: a vision system using Kinect, and a data glove using a three-axis acceleration sensor and gyro. The information acquired by this sensor system is represented in two respects: the position of the human hands and fingertips, and the rotation information of the hands (including 3D acceleration and the angular velocity). Therefore, the action of a human's hand can be uniquely described by the combination of these two types of information.

2.3 Challenges for pattern recognition via data streams

In [37, 38], the authors list some particularly challenging issues for pattern recognition in data streams. Our iHA-CHA is designed based on the following considerations:

Cost-effectiveness: The sensor should be cheap for practical deployment while providing promising results.

Naturalness: This interface system should not require the human to wear additional heavy devices or else cause physical or psychological stress for the human [39]. Therefore, a conventional - i.e., complicated - data glove is not acceptable[40].

Interaction space: Traditionally, the system requires a human standing within a fixed environment without moving. However, the noisy background associated with a HAC can greatly affect the output of the system (irrespective of considerations of retrieving 3D position information).

Outlier point detection on data streams: Data streams-based pattern recognition requires that the system can automatically detect outlier points and segment data streams. Therefore, the cue study for detecting the segment containing potential human action is important [41].

Responsiveness: When a human performs an action, the robotic system should respond in near real-time. 45ms [42] is thought to be the threshold value for a real-time response, as with a human. After we determine the start point, the length of the data segment should be not too long in order that the robot can analyse that data segment and respond in real-time.

3. Experimental setup and methodology

3.1 Experimental environment setup

A concept of human-robot coordinated assembly is shown in Fig. 3, and the control diagram for this whole system is shown in Fig.4.

Figure 3.

Concept of human-robot coordinated assembly. (a) Sensor system configuration. (b) Process human-hand movement and rotation information

Figure 4.

Control chart of human-robot coordinated assembly. If we treat it as a robotic system, the human worker could be viewed as a disturbance which can cause the robot co-worker to change its work flow. A “human intention estimation module” is used to recognize the human worker's actions based on the information from the sensor system.

The 3D region where the human and robot collaboration takes place is called the ‘hazard zone’. A Kinect camera is used to detect this hazard area. The detection and tracking are only triggered when the human worker moves his hands into the “human/robot coordinated working area”, as in Fig.3-(a). There are two types of cameras mounted in a Kinect - a normal camera and an infrared camera. By combining these two cameras, one can get the depth image of an interested area, as shown in Fig. 3-(a). Based on the depth image, Algorithm 4 is developed to calculate the position of human hands and fingers in 2D. In Fig. 3-(b), we can see that Algorithm 4 can effectively detect the human operator's palm and fingertips when performing different actions. We are mostly interested in the fingertip position of the thumb, index finger and middle finger. However, in the algorithms, we tried to detect all five fingertips within each 20 ms sampling time. Sometimes, it becomes more difficult to detect the third finger or the little finger when the human worker is grabbing or holding as compared with the moving case. However, we still use the five fingertips data when training the HMMs. Moreover, we do not consider occlusion by the robotic manipulator. The robot is supposed to stay in the standby zone and monitor the human worker. When a cooperation decision is made by the central control module, the robot will carry out the task; afterwards, it goes back to standby status once again.

A human gesture is represented in two ways, as described in Sect. II: (1) in a static frame, and (2) via sequential frames. The former contains the information about human hands, while the latter contains information about the movement trajectory.

Algorithm 1 Position of the palm- and fingertip-detection algorithm.

Based on the binary image, determinhe the polygon of the hand

Based on the polygon, we determine the convex hull (Sklansky algorithm)

Determine the convex hull vertices where θ<160° as the fingertips

Determine the centre point of the region (shown in Fig. 3-(b)) as the palm position

The vertices points above the centre point are the fingertips.

Use the Lukas-Kanade tracking algorithm to track the points of interest in 2D

Map the depth data of each detected point into height data

A hand information descriptor for a static frame

The rotation information of a human's hand is as important as position information. It is assumed that human static gestures include palm rotation information. A three-axis accelerometer and gyro-sensor mounted on the human's glove is constructed to meet this objective. Although there are several data gloves that can help collect hand information, these are inconvenient to use and intrusive for sensing. In this study, we only attach a sensor chip on the working glove. It is a non-intrusive sensor method, never interfering with the human's assembly actions. The measurement range of the three-axis acceleration sensor is from −3.6g to 3.6g (g: acceleration of gravity), and the measurement range of the gyro-sensor is from −110°/s to 110°/s. As shown in Fig. 3-a, the human and the robot are working on opposite sides of the work bench.

A movement trajectory descriptor for sequential frames

A major element of the motion information is contained in the motion trajectory. Accordingly, a trajectory descriptor is designed. The chain code method [43] (Fig. 5-(a)) has been widely used in describing the boundary of the region in image processing. The basic principle of chain codes is to separately encode each connected component in the image. For each such region, a point on the boundary is selected and its coordinates are transmitted. The encoder then moves along the boundary of the image and at each step, transmitting a symbol representing the direction of this movement. Compared to conventional freeman chain code descriptors, a modified chain code (Fig. 5-(b)) is used in this study. In this method, the neighbour points P_pre and P_curr within the human movement trajectory are stored and compared. If P_curr is within a limited range (such as where the range is smaller than r) of P_pre, the chain code for P_curr is set as 0. If P_curr is within any of the 8 directions from P_pre, as shown in Fig. 5-(b), the chain code for P_curr will be given a the corresponding value.

Figure 5.

Traditional eight-directional Freeman chain code

3.2 A human intention estimation module

In order to study the cooperation language using hand gestures for a robot to understand human workers' assembly intentions implicitly, as shown in Fig. 4, a “human intention estimation module” is developed to act as the connection between the human worker and the robot co-worker. Because only a segment of the data streams can be used for HMM, an automatic sliding window method is used to choose these segments. There are three steps to move the window via the data streams (details can be found in Section IV).

Find out the outlier points based on the data streams.

From each outlier point, continuously choose 10 samples, like O ={o_i,o_i₊₁,…,o_i₊₉}, where o_i denotes an outlier point.

Input O to the HMM classifier to determine which action the human is performing by comparing the output likelihood.

3.3 HMM: feature vector construction

The HMM definition is listed as follows:

The number of states of the model is N.

The number of observation symbols in the alphabet is M. If the observations are continuous, then M is infinite, O={v₁,v₂,…,v_m}.

A set of hidden state transition probabilities is denoted A={a_ij}:

a_{i j} = p {s_{t + 1} = j | s_{t} = i},1 \leq i, j \leq N

(1)

where s_t denotes the current hidden state at time t.

A probability distribution of the observation is denoted B={bj(k)}:

b_{j} (k) = p {o_{t} = v_{k} | s_{t} = j},1 \leq j \leq N,1 \leq k \leq M

(2)

v_k denotes the k-th observation symbol and o_t denotes the current parameter vector at time t.

If the observations are continuous, then we must use a continuous probability density function instead of a set of discrete probabilities. In this case, we specify the parameters of the probability density function. Usually, the probability density is approximated by a weighted sum of M Gaussian distributions N, whereby:

b_{j} (o_{t}) = \sum_{m = 1}^{M} c_{j m} N (μ_{j m}, Σ_{j m}, o_{t})

(3)

and where:

c_jm: weighting coefficients

^μ_jm: mean vectors

Σ_jm: covariance matrices and the initial state distribution, Π={π_i}, is such that:

π_{i} = p {s_{1} = i},1 \leq i \leq N

(4)

Therefore, we can use the compact notation:

λ = (A, c_{j m}, μ_{j m}, Σ_{j m}, Π)

(5)

to denote an HMM with continuous probability distributions. Define $o_{f} = {(x_{f i}, y_{f i}, z_{f i}), I ϕ}, i = 1,2, \dots, I$ . Accordingly, (x_fi,y_fi,z_fi) denotes the position of the fingertip i in a 3D plane and I denotes the detected fingertip number. φ denotes the detected area of the palm in pixels.

Define $o_{h} = {(x_{h}, y_{h}, z_{h}), a_{x}, ω_{x}, a_{y}, ω_{y}, a_{z}}$ . denotes the centre of the palm in a 3D plane. (a_x, a_y, a_z) denotes acceleration along the XYZ axes and (ω_x,ω_y) denotes the angular velocity around the XY axes.

The feature vector for gesture identification is o = {o_f,o_h}.

The task is to compute the probability of a particular output sequence given the parameters of the model. This requires summation over all possible state-sequences.

The probability of observing a sequence:

O = {o (0), o (1), \dots, o (l - 1)}

(6)

of a length l is given by:

P (O) = Σ_{S} p (O | S) p (S)

(7)

where S ={s(0),s(1),…,s(l-1)}.

This problem can be handled efficiently using a forward algorithm.

Later we determine the likelihood for a sequence O related to a HMM Λ.

For a sequence O, we define the log-likelihood:

L (O | λ) = \log p (O | λ)

(8)

For a sample of observed sequences Oⁿ, we define:

L (λ) = Σ_{n} L (O^{n} | λ)

(9)

Here, L (Λ) is the log-likelihood of observing this sample given the model Λ, which can be calculated recursively. The likelihood approach is frequently used to find which HMM is most likely to be generated for a given data sequence in comparison to a threshold value or other HMM.

4. Gesture pattern recognition via data streams

4.1 Task scenario

The assembly task is predefined in order to study the human ‘fetch’ and ‘grasp’ patterns (Fig. 6). In the assembly area, there are two pairs of connectors (C₁ and C₂), each of which has two connector heads (C_1-1, C_1-2 and C_2-1 C_2-2) connected with a single cable. In the assembly area, there are two connector slots, called P₁ and P₂, respectively. The objective of this assembly work is to insert the connector C₁ or C₂ to P. The restriction is that C_1-1(C_2-1) should be inserted on P₁ and that C_1-2(C_2-2) should be inserted on P₂. Because it is human-robot coordinated work, the human worker will choose and insert C_1-1(C_2-1) while the robot will choose and insert C_1-2(C_2-2). Both the human's and the robot's actions occur simultaneously.

Figure 6.

Task scenario

4.2 Possible human intention

It is a human-conducted, robot-supported assembly task; therefore, the robot moves after it successfully identifies the human worker's action. Only when the robot knows what the human worker is doing will the robot estimate the human's movement trajectory and coordinate itself correspondingly in safe mode.

However, due to human nature, the human's intentions may change during assembly, as categorized according to four types (Table 1).

Table 1.

Action description within this task scenario

Intention	Intention Description
A	Human directly fetches to $C_{1 - 1}$ , picks it up and moves to $P_{1}$ and inserts.
B	Human changes his mind, and then directly fetches to $C_{2 - 1}$ , picks it up and holds it to $P_{1}$ and inserts.
C	Human fetches to $C_{1 - 1}$ , but on the way he changes his mind, fetches to $C_{2 - 1}$ , picks it up and holds it to $P_{1}$ and inserts.
D	Human fetches to $C_{1 - 1}$ and begins to pick it up, but changes his mind and fetches to $C_{2 - 1}$ , picks it up and moves to $P_{1}$ and inserts.

Subject to different intentions, the robot co-worker performs its actions correspondingly, as shown in Table 2. The problem is that the robot needs to know when the human will carry out the “move and grab” action and the “move and hold” action, which together comprise the most basic human gesture patterns in this assembly process (Fig. 8).

Table 2.

Robot action corresponding to different human intentions

Intention	Robot Reaction Description
A	Robot picks up $C_{1 - 2}$ and inserts it to $P_{2}$ .
B	Robot picks up $C_{2 - 2}$ and inserts it to $P_{2}$ .
C	Robot picks up $C_{2 - 2}$ and inserts it to $P_{2}$ .
D	Robot picks up $C_{2 - 2}$ and inserts it to $P_{2}$ .

According to the intention described in Table 1, the human movement trajectory is shown in Fig. 7.

Figure 7.

Human hand movement trajectory in an XY 2D panel. “Micro-slip” describes the phenomenon whereby the human is carrying out an action and then hesitates or changes his mind. In an assembly system, this phenomenon is widespread due to the fact that humans are non-controllable.

Figure 8.

Data streams from the sensor system (y-axis) for the “move and grab” action (A1) and the “move and hold” action (A2) versus the length of the samples (x-axis) with 100 sampling steps. The former 50 and the latter 50 contain the “move and grab” action and the “move and hold” action twice, respectively. (a) Acceleration of the palm via the x-axis, the y-axis and the z-axis, respectively. (b) Rotation information of the palm around the x-axis and the y-axis, respectively. (c) Number of detected fingertips. (d) Area of the detected palm in a 2D panel. (e) and (f) represent two criteria: palm movement acceleration and rotational energy for sample segmentation.

Modified eight-directional chain code is adopted to describe the trajectory shown in Fig. 7.

4.3 Segment start point detection

A segment start point detection method involves finding out the outlier point via the data streams. It is assumed that when a human wants to carry out an action, his hands will move in an obvious manner. This movement can be represented by the change in data via acceleration in a 3D space. We can use the following C₁ (combined acceleration) to detect the start point of the significant action segment by comparing it with a certain threshold value. Similarly, for the rotation movement, we can observe the movement action represented by the gyro-data change, as shown in C₂ (rotational energy).

C_{1} = a = \sqrt{(a_{x}^{2} + a_{y}^{2} + a_{z}^{2})}

(10)

C_{2} = \frac{1}{2} I ω^{2} = \frac{1}{2} m l^{2} (ω_{x}^{2} + ω_{y}^{2})

(11)

where a denotes the movement speed acceleration, (a_x,a_y,a_z) denotes the three-axis speed acceleration (all acquired from the data glove), and (ω_x,ω_y) denotes the angular velocity. m is the mass of the hand and l is the distance between the mass point and the rotational axis. ml² is the moment of inertia (which is treated as a constant value in this study), and so we have the following equation:

C_{2} \propto (ω_{x}^{2} + ω_{y}^{2})

(12)

4.4 Segment start point decision

The segment should not be too long or too short. If the length is set too long, the robot cannot react to the human worker quickly. If the length is set too short, the data segment cannot represent any meaningful information. In this study, we suppose that all the meaningful human actions are completed by 1s. According to our experimental setup, it utilizes 50 time-intervals.

5. Experiment and discussion

5.1 Segment start point and length:

When a human worker collaborates with a robot on assembly tasks, the robot must be aware of the task sequence that has been predefined. The only issue here is that the human worker cannot guarantee stable performance in time domain. Accordingly, the robot must constantly monitor the human in order to know what to do next. In this case, the robot must detect some “trigger” action so that it can estimate which task the human is performing or else is going to perform by checking the task sequence defined in the database. We assume that grasping (move and grab) and handling (move and hold) are the two most typical “trigger” actions. In this section, two actions including “move and grab” and “move and hold” for a human worker's left hand are recognized by the HMM model.

We use the movement acceleration C₁ and the rotational energy C₂ by comparing them with a certain threshold value in order to detect the start point of a meaningful action. In this experiment, we observe the action samples (Fig. 8 - (e) and (f)) via time (the time for each sample interval is 20ms). From the observation, the length of the segment for these two actions is chosen as 10 and 10 in order to form a 0.2s length data segment, respectively (Table 4).

Table 3.

Chain code to describe the trajectory in Fig. 7

Four trajectories in Fig. 7	Trajectory descriptor using modified chain code $(r = 10)$
A	{44444473400002507000718870027107000030710000730030007070}
B	{434205000000000775030500030000737000030370}
C	{47220456003700878707515841003070005370}
D	{44463330510378230741100788782100036050000010}

Table 4.

Start point detection and segment length decision

Motion type	Threshold for start point detection	Segment length
move and grab	$1.7 (C_{1}) / 5.4 (C_{2})$	>10
move and hold	$1.7 (C_{1}) / 5.3 (C_{2})$	10

5.2 Feature vector for HMM

“Move and grab” action

In this experiment, we assume that the rotation action usually occurs without any rotation in the 3D axis; therefore, the feature vector for training is just the data vector from the data glove and the Kinect sensor.

We define f_h ={x_a,y_a, z_a, x_h,y_h,z_h}, where a denotes the acceleration and x_h,y_h,z_h denotes the hand position in 3D.

“Move and hold” action

In this experiment, we assume that the rotation action usually occurs without any coordination shifts in the 3D axis; therefore, the feature vector for training is just the data vector from the data glove.

We define f_h = {a_x, a_y, a_z, ω_x, ω_y}, where a denotes the acceleration and ω denotes the angular velocity.

We sampled the data segment (Fig. 8 - (a)(b)(c)(d)) for training based on the start-point detection and segment-length decision method described in the previous subsection.

5.3 HMM for the identification of two actions in Fig. 7

We acquire the sample segments using the segmentation method introduced in the previous section and use the Baum-Welch algorithm to train the HMM based on these samples for each action respectively. 50 sets of experiments are performed for the “move and grab” action and another 50 for the “move and hold” action. In each 30 sets are used as the training data and 20 sets are used to validate the HMM model that is being trained. The created HMM is denoted Λ = (A,B,Π), as shown in Fig. 9. Each action only lasts for less than 0.2s (the sample interval length is 10 at most). After we train the HMM with more (>3) hidden states, we discover that there is no obvious physical meaning for some hidden states. Therefore, we choose a three-hidden state HMM to model a given action. On the other hand, because the length of the sample is short, we should avoid over-fitting during the training process. Three hidden states are chosen as the most appropriate to model this hand-movement action.

Figure 9.

HMM construction with three hidden states for “move and grab” and “move and hold” actions. For each hidden state, it represents a single frame of a gesture.

For each of the hidden states, it actually represents a frame of the action for “move and grab” and “move and hold”. In Fig. 9, it shows the representative action within, b move and grab” and “move and hold”, as described in Fig. 8.

In Table 5 can be seen that the performance of the trained HMM in identifying the “move and grab” action and the “move and hold” action. The log likelihood (Equation 9) is the main criterion for evaluating the possibility of a given datum generated by the HMM model. We define an experiment from P positive instances and N negative instances for a given condition. According to the experimental setup, both P and N are equal to 20. We use the true-positive (TP) rate to evaluate the sensitivity (Equation 13) and the true-negative (TN) rate to evaluate the specificity (Equation 14) of the trained classifier - TP gives the number of successful identifications when the sample data contain the pattern requiring identification, while TN gives the number of failure identifications when the sample data does not contain the pattern requiring identification. Moreover, the identification accuracy rate γ_accuracy is calculated according to Equation 15 [44].

Table 5.

Hidden states and recognition performance for the patterns “moving and grab” and “moving and hold”

Gesture pattern	Parameter	Chain code	C ₁	C ₂	Palm area	Fingertip number	TP rate	TN rate	Accuracy rate
Move and grab	S1	2	1.7	5.5	3660	2	0.85	0.85	0.85
	S2	2	2.1	5.6	2990	1
	S3	3	2.1	5.4	2770	1

Move and hold	S1	1	2.1	5.5	2990	2	0.85	0.90	0.875
	S2	3	1.9	5.5	3880	3
	S3	1	2.1	5.5	2880	1

T P_{r a t e} = \frac{T P}{P}

(13)

T N_{r a t e} = \frac{T N}{N}

(14)

γ_{a c c u r a c y} = \frac{T P + T N}{P + N}

(15)

where P denotes the number of samples containing the movement pattern to be detected and N denotes the number of samples which do not contain the movement pattern to be detected.

It is noted that the classifier generated by this method has a high TP rate as well as relatively high FP rate, yielding an accuracy rate greater than 85%. This suggests that the robot co-worker makes the right decision most of the time. It should also be pointed out that, because the pattern recognition is based on a 0.2s-long segment, the robot is capable of responding to the human worker's actions in near real-time. This feature is significant, not only for carrying out effective and efficient assembly tasks by a human and a robot but also for safety.

6. Conclusion

The design, development and evaluation of the novel natural human-robot interface for human gesture recognition within a HRC have been presented. A RGB-D camera is used to detect the human operator's palm and fingertip positions, and an accelerometer and gyro-sensor attached to a glove are used to detect the rotation and movement of the hand. This offers a novel research element for pattern recognition via data streams in a hybrid manufacturing environment. Based on this technology, a HMM-based method is adopted to identify human workers' gestures automatically. This shows that the proposed system can automatically segment the data streams and recognize the action patterns represented with an acceptable accuracy ratio.

The main contribution of this paper lies in the design of an iHRCHA cell within which a sensor system serves as a natural interface between a human and a robot co-worker. With the help of this system, several challenges in realizing human-robot collaboration in HRC are fulfilled. In particular, this system is affordable while not violating the naturalness, it can effectively monitor hazardous areas where collaboration occurs, and it can automatically detect the outlier point, segment the data streams and reduce time to process the data. Therefore, the robot co-worker can respond to an action within around 0.2s. This short response-time can also eliminate the potential danger that a robot co-worker may hurt the human worker.

From the illustrated example, when recognizing the gesture patterns “move and grab” and “move and hold”, our approach yields a high true-positive rate and a high accuracy rate. It demonstrates that our interface performs better than the previous, ordinary vision-based HRC interfaces, since we consider more information in describing the gesture of the human's hand and fingertips while improving the processing algorithms. This system can be easily extended to more action patterns in recognition situations. Future research will focus on human intention estimation by testing with various stream-pattern recognition methods. One direction would be to consider the reconstruction of the data in subsequent frames based on the current recognized action patterns.

Footnotes

7. Acknowledgements

This material is based upon work funded by the Natural Science Foundation of China under Grant No. 61203360, Zhejiang Provincial Natural Science Foundation of China under Grant Nos. LQ12F03001, LQ12D01001 and LY12F01002, Ningbo City Natural Science Foundation of China under Grant Nos.2012A610009 and 2012A610043, the State Key Laboratory of Robotics and Systems (HIT) Foundation of China under Grant No. SKLRS-2012-MS-06, and the China Post-doctoral Science Foundation under Grant No. 2013M531022.

References

Krüger

Lien

Verl

. Cooperation of human and machines in assembly lines. CIRP Annals-Manufacturing Technology, 58(2):628–646, 2009.

Bannat

Bautze

Beetz

Blume

Diepold

Ertelt

Geiger

Gmeiner

Gyger

Knoll

Artificial cognition in production systems. Automation Science and Engineering, IEEE Transactions on, 8(1):148–174, 2011.

Haraguchi

Domae

Shiratsuchi

Kitaaki

Okuda

Noda

Sumi

Matsuno

Kaneko

Fukuda

. Development of production robot system that can assemble products with cable and connector. Journal of Robotics and Mechatronics, 23(6):939, 2011.

Dautenhahn

. Methodology and themes of human-robot interaction: A growing research field. International Journal of Advanced Robotic Systems, Vol. 4, No. 1, 2007.

Green

Scott A

Billinghurst

Mark

Chen

XiaoQi

Chase

J Geoffrey

. Human robot collaboration: An augmented reality approachąła literature review and analysis. In ASME 2007 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, pages 117–126. American Society of Mechanical Engineers, 2007.

Chen

Sekiyama

Sasaki

Huang

Sun

Fukuda

. An assembly strategy scheduling method for human and robot coordinated cell manufacturing. In International Journal of Intelligent Computing and Cybernetics, volume 4, pages 487–510. IEEE, 2011.

Krüger

Katschinski

Surdilovic

Schreck

. Pisa: Next generation of flexible assembly systems-from initial ideas to industrial prototypes. In Robotics (ISR), 2010 41st International Symposium on and 2010 6th German Conference on Robotics (ROBOTIK), pages 1–6. VDE, 2010.

New Energy and Industrial Technology Development Organization, “Project for Strategic Development of Advanced Robotics Elemental Technologies”, http://www.nedo.go.jp/english, Accessed on 01 Apr 2006.

National Science Foundation, “National Robotics Initiative”, Available: http://www.nsf.gov/pubs/2014/nsf14500, Accessed on 21 Jan 2014.

10.

European Robotics Technology Platform, “The Strategic Research Agenda for robotics”, http://www.roboticsplatform.eu/sra, Accessed on 01 Jul 2009.

11.

Morioka

Adachi

Sakakibara

Tan

J.T.C.

Kato

Arai

. Cooperation between a highpower robot and a human by functional safety. Journal of Robotics and Mechatronics, 23(6):926, 2011.

12.

Takata

Hirano

. Human and robot allocation method for hybrid assembly systems. CIRP Annals-Manufacturing Technology, 2011.

13.

Chen

Fei

Sekiyama

Cannella

Fukuda

. Optimal subtask allocation for human and robot collaboration within hybrid assembly system. Automation Science and Engineering, IEEE Transactions on, 11(4):1065–1075, Oct 2014.

14.

Duan

Tan

J.T.C.

Tong

J.G.

Kato

Arai

. Application of the assembly skill transfer system in an actual cellular manufacturing system. Automation Science and Engineering, IEEE Transactions on, (99):1–1, 2012.

15.

Wallhoff

Blume

Bannat

Rösel

Lenz

Knoll

. A skill-based approach towards hybrid assembly. Advanced Engineering Informatics, 24(3):329–339, 2010.

16.

Salter

Dautenhahn

Boekhorst

. Learning about natural human–robot interaction styles. Robotics and Autonomous Systems, 54(2):127–134, 2006.

17.

Chaaraoui

A.A.

Climent-Pérez

Flórez-Revuelta

. A review on vision techniques applied to human behaviour analysis for ambient-assisted living. Expert Systems with Applications, 2012.

18.

Bethel

C.L.

Murphy

R.R.

. Review of human studies methods in hri and recommendations. International Journal of Social Robotics, 2(4):347–359, 2010.

19.

Henry

Krainin

Herbst

Ren

Fox

. Rgb-d mapping: Using depth cameras for dense 3d modeling of indoor environments. In the 12th International Symposium on Experimental Robotics (ISER), volume 20, pages 22–25, 2010.

20.

Lai

Ren

Fox

. Sparse distance learning for object recognition combining rgb and depth information. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 4007–4013. IEEE, 2011.

21.

Shotton

Fitzgibbon

Cook

Sharp

Finocchio

Moore

Kipman

Blake

. Real-time human pose recognition in parts from single depth images. In CVPR, volume 2, page 7, 2011.

22.

Lai

Ren

Fox

. A large-scale hierarchical multi-view rgb-d object dataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1817–1824. IEEE, 2011.

23.

Rusu

R.B.

Cousins

. 3d is here: Point cloud library (pcl). In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE, 2011.

24.

Billy

YL Li

Mian

Ajmal S

Liu

Wanquan

Krishna

Aneesh

. Using kinect for face recognition under varying poses, expressions, illumination and disguise. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pages 186–192. IEEE, 2013.

25.

Vázquez

Pablo Gil

Agulló

Carlos Mateo

Medina

Fernando Torres

. 3d visual sensing of human hand for remote operation of a robotic hand. International Journal of Advanced Robotic Systems, 11:26, 2014.

26.

Chen

Zihong

Zheng

Lingxiang

Chen

Yuqi

Zhang

Yixiong

. 2d hand tracking based on flocking with obstacle avoidance. International Journal of Advanced Robotic Systems, 11:22, 2014.

27.

Antonio

Juan

Ramón

Corrales

Jesús

Gabriel

Gómez

García

Medina

Fernando Torres

Perdereau

Véronique

. Cooperative tasks between humans and robots in industrial environments. International Journal of Advanced Robotic Systems, 9:1–10, 2012.

28.

Rabiner

L. R.

. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

29.

Brown

M.K.

Turin

. Hmm based online handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 18(10): 1039–1045, 1996.

30.

Van

B.L.

Garcia-Salicetti

Dorizzi

. On using the viterbi path along with hmm likelihood information for online signature verification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 37(5):1237–1247, 2007.

31.

Brooks

R.R.

Schwier

J.M.

Griffin

. Behavior detection using confidence intervals of hidden markov models. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 39(6):1484–1492, 2009.

32.

Vakanski

Mantegh

Irish

Janabi-Sharifi

. Trajectory learning for robot programming by demonstration using hidden markov model and dynamic time warping. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 42(4): 1039–1052, 2012.

33.

Yamato

Ohya

Ishii

. Recognizing human action in time-sequential images using hidden markov model. In Computer Vision and Pattern Recognition, 1992. Proceedings CVPR'92., 1992 IEEE Computer Society Conference on, pages 379–385. IEEE, 1992.

34.

Yamada

Morizono

Umetani

Konosu

. Warning: To err is human [human-friendly robot dependability]. Robotics & Automation Magazine, IEEE, 11(2):34–45, 2004.

35.

Lenz

Sotzek

Roder

Radrich

Knoll

Huber

Glasauer

. Human workflow analysis using 3d occupancy grid hand tracking in a human-robot collaboration scenario. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 3375–3380. IEEE, 2011.

36.

Delp

S.L.

Anderson

F.C.

Arnold

A.S.

Loan

Habib

John

C.T.

Guendelman

Thelen

D.G.

. Opensim: Open-source software to create and analyze dynamic simulations of movement. Biomedical Engineering, IEEE Transactions on, 54(11): 1940–1950, 2007.

37.

Wachs

J.P.

Kölsch

Stern

Edan

. Vision-based hand-gesture applications. Communications of the ACM, 54(2):60–71, 2011.

38.

Gaber

M.M.

Zaslavsky

Krishnaswamy

. Mining data streams: A review. ACM Sigmod Record, 34(2):18–26, 2005.

39.

Triesch

Von Der Malsburg

. Robotic gesture recognition by cue combination. Proceedings of the Informatik, 98:21–25, 1998.

40.

Sturman

D.J.

Zeltzer

. A survey of glove-based input. Computer Graphics and Applications, IEEE, 14(1):30–39, 1994.

41.

Baudel

Beaudouin-Lafon

. Charade: Remote control of objects using free-hand gestures. Communications of the ACM, 36(7):28–35, 1993.

42.

Sheridan

T.B.

Ferrell

W.R.

. Remote manipulative control with transmission delay. Human Factorsin Electronics, IEEE Transactionson, (1):25–29, 1963.

43.

Bribiesca

. A new chain code. Pattern Recognition, 32(2):235–251, 1999.

44.

Kohavi

Provost

. Glossary of terms. Machine Learning, 30(2-3):271–274, 1998.