Understanding Human Hand Gestures for Learning Robot Pick-and-Place Tasks

Abstract

Programming robots by human demonstration is an intuitive approach, especially by gestures. Because robot pick-and-place tasks are widely used in industrial factories, this paper proposes a framework to learn robot pick-and-place tasks by understanding human hand gestures. The proposed framework is composed of the module of gesture recognition and the module of robot behaviour control. For the module of gesture recognition, transport empty (TE), transport loaded (TL), grasp (G), and release (RL) from Gilbreth's therbligs are the hand gestures to be recognized. A convolution neural network (CNN) is adopted to recognize these gestures from a camera image. To achieve the robust performance, the skin model by a Gaussian mixture model (GMM) is used to filter out non-skin colours of an image, and the calibration of position and orientation is applied to obtain the neutral hand pose before the training and testing of the CNN. For the module of robot behaviour control, the corresponding robot motion primitives to TE, TL, G, and RL, respectively, are implemented in the robot. To manage the primitives in the robot system, a behaviour-based programming platform based on the Extensible Agent Behavior Specification Language (XABSL) is adopted. Because the XABSL provides the flexibility and re-usability of the robot primitives, the hand motion sequence from the module of gesture recognition can be easily used in the XABSL programming platform to implement the robot pick-and-place tasks. The experimental evaluation of seven subjects performing seven hand gestures showed that the average recognition rate was 95.96%. Moreover, by the XABSL programming platform, the experiment showed the cube-stacking task was easily programmed by human demonstration.

Keywords

Human demonstration pick-and-place gesture recognition robot behaviour control convolution neural network (CNN)Gaussian mixture model (GMM)Extensible Agent Behavior Specification Language (XABSL)

1. Introduction

Human-Robot interaction (HRI) has emerged as a crucial, interdisciplinary research topic which has many promising applications [1]. Sato et al. [2] showed that understanding human intention through human behaviours is an essential function for a robot because about 35% of the information is delivered to a listener's side by means of language and about 65% is obtained by nonverbal communication. Compared to previous work using verbal commands in human-robot interaction [3], understanding human hand gestures provides a novel, natural method to program robots. This task, however, is full of challenges because essential hand features and learning algorithms are severely influenced by hand shape and environmental conditions such as lighting. To implement a good hand gesture recognition system, a large training database is usually required and various gestures should be modelled. In addition, the recognized human hand gesture should be interpreted by a robot system to manipulate objects and interact with an environment. In other words, the robot should achieve the task goal according to the stimulus conditions and action priorities of the plan understood from the human hand gestures. In this paper, we develop a human gesture recognition system based on a convolution neural network (CNN). Without much effort on modelling gestures, an image of a hand is simply fed to the CNN. The second challenging problem is to manage the input and output resources of the robot system. This work develops a behaviour-based robot system to plan the task demonstrated by a human hand. To the robot system, the recognized gesture commands are regarded as high-level task sequence inputs. Once the human hand motion is understood by the gesture recognition system, the robot task sequence is generated for the behaviour-based robot system to execute the task. Figure 1 illustrates the proposed idea that enables a robot to understand human hand gestures and learn pick-and-place tasks.

Figure 1.

Proposed idea of understanding human hand gestures and learning pick-and-place tasks

The novelty of our research is to propose a robot task-learning system which has a front-end CNN-based gesture recognition system and a back-end behaviour-based robot programming platform to deal with pick-and-place tasks. We propose a computational approach based on CNN to recognize human hand gestures without using complicated algorithms to extract hand features such as hand contour and curvature. Through the convolution and sub-sampling layers of a CNN, invariant features are allowed with little dislocation. To reduce the effect of various hand poses on recognition accuracy, the principal axis of the hand is found to calibrate the image in this work. Calibrated images are advantageous to a CNN to learn and recognize gestures correctly. In addition, the light condition affects the skin colour, resulting in the recognition rate being severely degraded. To solve this problem, we adopt a Gaussian mixture model (GMM) to derive the robust skin model. In a continuous hand motion, because transitive gestures may be ambiguous for the CNN to recognize, we apply forward-and-backward search rules on their adjacent frames which have the most similar images to them. The rules correct the misrecognized transitive gestures without changing the semantic meaning of the continuous motion.

In the proposed robot system, sensor inputs and action outputs are encapsulated in the form of behaviours to react to or interact with an environment. Basic skills for pick-and-place tasks such as Gilbreth's transport empty (TE), grasp (G), transport loaded (TL), release (R), and search (SH) [4] are initially created. The planner of behaviours is generated by the front-end gesture recognition system and described by the Extensible Agent Behavior Specification Language (XABSL) [5] to maintain the multi-tazsking performance in the robot system. The XABSL programming platform handles the multi-tasking scheduling to process images from cameras and robot encoders, and command the robot to behave under proper conditions. The advantages of the XABSL programming platform are: first, a task is implemented by a hierarchical state machine of robot behaviours and planned by an option graph in the XABSL. Second, behaviours are reusable and modular for pick-and-place tasks.

In the following sections, Section II presents the brief review of the related research. Section III introduces the proposed framework with a CNN-based gesture recognition system followed by a behaviour-based robot programming platform being described in detail. In Section IV, experimental evaluations are provided, and lastly concluding remarks are summarized in Section V.

2. Review of related research

2.1 Programming by demonstration

Programming by demonstration (PbD) [6] is an alternative HRI methodology to program robots without programming expertise. The method of PbD is implemented by understanding human motion and implementing a corresponding robot motion to accomplish tasks. Most of the previous PbD work represented human demonstration by motion trajectories and made robots replay these encoded trajectories. The human motion trajectories could be joint, Cartesian, and force trajectories [7]. Human motion was modelled as a cyclic motion [8], discrete motion [9], or a combination of cyclic and discrete motion [10]. Since these demonstrated trajectories had high dimensionality, local [11] and global [12] transformation methods were adopted to reduce the dimensionality. To abstract human motion, motor primitives were used in PbD. They could be perceptual-motor primitives [13], movement primitives [14], or control modules [15]. Simply speaking, these primitives encoded the mapping relationship between sensory stimuli and action responses and were treated as the building blocks for robots to accomplish tasks. The main advantage of using motor primitives to program a robot is that motor primitives are reusable to facilitate robot programming.

For a robot, pick-and-place motions are commonly used in a variety of robot manipulative tasks. In [16], a 6D-tracker was placed on a human demonstrator and used to capture motion trajectories. The velocity profile of the trajectories was analysed to decompose a pick-and-place task into task primitives. These task primitives were basic movements such as MoveZ (move along the z-axis) and MoveJ (move to the position using joint interpolation). In [17], they taught a robot low-level action primitives by kinaesthetic teaching and developed a graphical interface to plan these primitives to accomplish a behaviour. For example, a pick-and-place behaviour was implemented by primitives such as Input, MoveArm, Close/Open, and Gripper Output. Zoliner et al. [18] proposed a hierarchical task representation to describe manipulation problems such as pick-and-place tasks, device handling, and tool handling. A task was implemented by elementary operators and the knowledge was extracted from human demonstration. Previous work had their own motor primitives which were customized to specific manipulative tasks. Much work focused on analysing human motion data and mapping it to the customized primitives. However, human motion was high dimensional and changeable, making a robot difficult to duplicate the same robot motion by primitives. Thus, human demonstration should be understood by a robot during PbD including the purpose of action and the type of action.

In this paper, a set of basic skills should be created for pick-and-place tasks. Before that, it is necessary to clarify the commonalities and differences between a skill and a task. Friedrich and Kaiser [19] stated that acquisition a new program schemata on task level is ‘learning of a sequence of elementary skills including proper application conditions'. Voyles and Khosla [20] assumed that ‘a set of robotic skills that approximate the set of primitives that a human uses to perform a task from the particular application domain’. Nicolescu [21] considered ‘a task to be an activity involving the coordination of an existing set of skills in order to achieve a given set of goals’. Aramaki et al. [22] considered that a task sequence is defined by concatenation of skills. Gilbreth [4] investigated 18 types of elemental motions (so-called therbligs) to represent human motion used in a workspace. Any human motion in a workspace could be represented by these 18 therbligs. In this paper, the therbligs, transport empty (TE), grasp (G), transport loaded (TL), release (R), and search (SH) are used to understand human demonstration of pick-and-place tasks. Accordingly, a set of robot basic skills are created to achieve the pick-and-place tasks.

2.2 Hand gesture recognition

Hand gesture recognition has been a promising topic and applied to many practical applications [23]. For example, hand gesture is observed and recognized by surveillance cameras to prevent criminal behaviour [24]. Hand gesture recognition has also been investigated by a variety of studies [25], such as sign language recognition [26], lie detection [27], and robot control [28]. Gesture is a body language that humans use to express emotion and thoughts. The varied gestures of the five fingers and palm may have their physical meanings. Hand gesture recognition is a complicated process including feature extraction and gesture modelling and learning. Scale-invariant feature transform (SIFT) has been used to extract hand features from images [29]. The image features were represented in the form of vectors to describe the hand shape. Central moments were used by [30, 31] to represent the hand shape because they were invariant to translation and scaling and physically meaningful. They were used to describe the inclination, displacement of the centre, and stretch of a hand. Hand contour was usually adopted as the feature in gesture recognition. Huang [32] used the Fourier descriptors to describe the hand contour and applied the Hausdorff distance to measure the similarity between two hand shapes. McEiroy and Wilson [33] used the Fourier Descriptor coefficients to describe the hand shape and then applied the neural network on the coefficients to classify the different hand shapes. Tantachun et al. [34] considered that the biometrics such as fingerprints and hand geometry could be extracted from the Eigenface method [35, 36]. Al-Jarrah [37] and Holden [38] plotted the hand contour in the polar coordinate by measuring the distance from the hand centre to the edge. In previous work on gesture modelling, Hidden Markov Model (HMM) was used to a real-time semantic level American Sign Language recognition system [39]. A gesture also can be modelled as a HMM state sequence. In [40], they adopted a Finite State Machine (FSM) model to recognize human gestures. In [41], Time Delay Neural Network (TDNN) was used to match motion trajectories and train gesture models.

To recognize gestures in a continuous motion, Jo el al. [42] used a FSM to deal with a task-level recognition problem where a task was represented in a state transition diagram and every state represented a possible gesture. Some researchers used a rule-based method for gesture recognition. Culter and Turk [43] designed a set of rules to recognize waving, jumping, and marching gestures. In recent years, deep learning is widely applied to many applications. Especially, CNN is a proper method for image-based learning. For example, [44] used a CNN to implement recognize open and closed hands.

3. Framework of the robot task-learning system

The flowchart of the proposed robot task-learning system is shown in Fig. 2. The figure shows that the system is composed of a gesture recognition system and a behaviour-based robot programming platform. The proposed system aims at understanding human hand gestures and learning to accomplish a pick-and-place task in the same manner.

Figure 2.

Framework of the proposed robot task-learning system

For the gesture recognition system, human hand motion and the environmental conditions are captured by a camera and then the captured images are processed to generate a sequence of gestures and position changes of objects. For the behaviour-based robot programming platform, it consists of basic skills and a planner. Every skill is treated as a behaviour that is activated by stimuli from camera images, force sensor, or robot encoders. The planner is described by the XABSL syntax to implement the sequence and the stimulated conditions of behaviours. The XABSL programming platform maintains the multi-tasking schedule among behaviours. Thus, any sequence of gestures understood from human demonstration can be implemented by the XABSL programming platform. The details of the gesture recognition system and the behaviour-based robot programming platform are described as follows.

3.1 Gesture recognition system

Figure 3 shows the flowchart of the gesture recognition stystem. From an image input, the hand is extracted by skin it colour segmentation. The skin model is trained by a Gaussian mixture model to classify the skin colour and non-skin colour. After that, the calibration of the hand position and orientation is used to translate and rotate the hand image to the neutral pose. The calibrated image is fed to the CNN to train or test the network. For continuous hand motion post-processing is used to filter out the noise of the results from the CNN. Each block of the flowchart is described as follows.

Figure 3.

Flowchart of the proposed human gesture recognition system

3.1.1 Human demonstration

One of the salient features of the proposed robot task-learnin_R g _o is _t that th₂ e in₁ p _4, u _Vt to the robot system is by means of human demonstration. In other words, a human can use gestures captured by a camera to command the robot to perform a pick-and-place task. Currently, the proposed system uses Gilbreth's therbligs such as TE (transport empty), TL, G, and R to understand human motion. Figure 4 shows that the human demonstrates a cube-stacking task and the robot learns to understand the demonstration and re-perform the task.

Figure 4.

(a) A human demonstrates a cube-stacking task. (b) The robot re-performs the same task.

3.1.2 Skin model for colour segmentation

In a vision system, the light condition is an important factor to affect the system performance. Using colour thresholds to classify skin and non-skin colours is common in conventional approaches, but colour thresholds are not enough to describe the statistical properties of skin colour under various light conditions. Even though the YCbCr colour space that is less sensitive to the light condition than the RGB colour space is used, the result is still defective. For example, the pixel value (B) of a binarized image is determined by the YCbCr threshold and described as

B = {\begin{matrix} 255, & ​ i f ​ 50 \leq Y \leq 255; \\ ​ ​ 90 \leq C b \leq 155; \\ ​ ​ 135 \leq C r \leq 180; \\ 0, & ​ o t h e r w i s e . ​ \end{matrix}

(1)

Figure 5 shows the YCbCr image and its binarized image by the threshold in Eq. (1). From Fig. 5(b), some pixels on the ring and little fingers are classified as non-skin pixels. In this paper, we use a GMM to solve this problem. A GMM is represented by K Gaussian components as

Figure 5.

Image binarization using a threshold in Eq. (1). (a) YCbCr image; (b) binarized image.

P (x) = \sum_{k = 1}^{K} P (k) P (x | k)

(2)

where P(k) is the prior probability (π) and P(x | k) is the conditional probability formulated as a Gaussian distribution with the mean (μ) and covariance (Σ). Thus, {π_k, μ_k, Σ_k}_k are optimized by the expectation-maximization (EM) algorithm and used to express P(x) by Eq. (2). The detailed steps of GMM were described in [45]. In GMM training, we divide the workspace into 15 cells. For every section, we sample the colour within the yellow area around the centre of the hand. Figure 6 shows that the workspace is divided into 15 cells in which the skin colours are sampled for GMM training data. The greater number of cells indicates that the more skin colour data are sampled for the GMM to obtain the colour distribution of hand skin in the environment wherever the light condition is various. Thus, the number 15 is chosen according to the lighting condition of the experimental environment.

Figure 6.

GMM training. (a) Workspace is divided into 15 cells; (b) Skin colour within the yellow area of the hand is sampled for every section.

For an extreme case, we put a light source on the upper-right corner (see Fig. 7). Figure 8 compares the results between a threshold and a GMM. The result in Figure 8(c) is obviously superior to the one in Figure 8(b).

Figure 7.

A light source on the upper-right side

Figure 8.

Image binarization using a threshold in Eq. (1) and GMM. (a) Original RGB image; (b) binarized image by a threshold; (c) binarized image by GMM.

3.2 Calibration of hand position and orientation

To obtain better recognition results, the hand images should be calibrated by their position and orientation. Once the binarized image is derived, we calculate the hand centre (x̄ȳ) as follows

\begin{matrix} M_{i, j} = \sum_{x} \sum_{y} x^{i} y^{j} f (x, y) \\ \bar{x} = \frac{M_{1,0}}{M_{0,0}}, ​ ​ \bar{y} = \frac{M_{0,1}}{M_{0,0}} \end{matrix}

(3)

where x and y are the coordinates of the skin pixel whose pixel value f (x,y) is set as 1, and vice versa.

For orientation, we use moment invariants to find the principal axis of the hand [30]. The angle of orientation is expressed as

\begin{array}{c} μ_{20} = \sum_{x} \sum_{y} {(x - \bar{x})}^{2} f (x, y) \\ μ_{11} = \sum_{x} \sum_{y} (x - \bar{x}) (y - \bar{y}) f (x, y) \\ μ_{02} = \sum_{x} \sum_{y} {(y - \bar{y})}^{2} f (x, y) \\ θ = \frac{1}{2} t a n^{- 1} (\frac{2 μ_{11}}{μ_{20} - μ_{02}}) . \end{array}

(4)

As (x̄ȳ) and θ are calculated, they are used to translate and rotate the image to the neutral pose.

3.2.1 Gesture recognition by a convolution neural network

To recognize human gestures, we adopt a CNN as the approach to classify the seven types of human hand gestures. The CNN is a type of feed-forward neural network. Small overlapped portions of an original image are represented by neurons in a layer of the network so that the image features are allowed to be translated. The pixels in the convolutional layer use the same weight to save memory size and improve efficiency. Compared to previous technologies using complicated image feature extraction, the CNN provides a robust and systematic methodology to classify the type of hand gestures. Figure 9 shows the architecture of the CNN where there are eight layers, denoted as I1, C2, S3, C4, S5, N6, H7, and O8 in sequence [46]. I1 is the input layer, C2 and C4 are the convolution layers, S3 and S5 are the sub-sampling layers, N6 is the input layer of a feed-forward neural network, H7 is the hidden layer of the neural network, and O8 is the output layer of the neural network. C2, S3, C4, and S5 are interleaved and S5 is connected with N6 of the feed-forward neural network. The input of I1 is an 28×28-pixel image. By applying six 5×5-pixel convolution patches on I1, six 24×24-pixel feature maps are generated in C2. Then, each feature map in C2 is taken by 2×2-pixel sub-sampling, resulting in six 12×12-pixel feature maps in S3. Again, applying 12 5×5-pixel convolution patches on the six feature maps in S3 generates 12 8×8-pixel feature maps in C4. Finally, 2×2-pixel sub-sampling is applied to the feature maps in C4 to generate 12 4×4-pixel feature maps in S5. The 12 4×4-pixel feature maps are expanded and concatenated in terms of a 192×1 vector as N6. These 192 neurons of N6 are fully connected with the 192 neurons of H7. Last, N7 is also fully connected to the seven neurons of O8. For the implementation, the steps were described in [46].

Figure 9.

Architecture of the CNN

3.2.2 Post-processing in continuous motion

Since there are transitive hand gestures in a continuous hand motion, they become uncertain for the trained CNN. We design rules to post-process the gesture type of the transitive image frames. The rules are as follows

IF there are the O8 values of the CNN greater or equal than 0.9, we decide the gesture type as the one whose O8 value is greatest.

ELSEIF all the O8 values of the CNN are less than 0.9, we decide the gesture type as the same as the one in the closest preceding or successive frame whose O8 value is greater than 0.9.

Using the rules, the transitive gestures are assigned to the certain gesture type in the closest preceding or successive frame.

4. Behaviour-based robot programming platform

4.1 Basic skills

As mentioned in Section II, Gilbreth's 18 therbligs are used to analyse human demonstration. Accordingly, we create a corresponding set of robot basic skills such as TE, TL, G, R, SH, and search precisely (SHP) to accomplish pick-and-place tasks. Once human demonstration is recognized as a sequence of therbligs, the plan of scheduling the robot basic skills is directly synthesized for the robot to accomplish the same task. Table 1 shows the robot basic skills and their descriptions.

Table 1.

Basic skills for pick-and-place tasks

Basic skill	Function	Stimulus	Action
Transport empty (TE)	Transport without an object	Check whether the robot reaches to the target	Move the robot to the target
Transport loaded (TL)	Transport with an object	Check whether the robot reaches to the target and the gripper force is large enough to grasp an object	Move the robot to the target
Grasp (G)	Grasp an object	Check whether the gripper force is large enough to grasp an object	Close the gripper
Release (R)	Release an object	Check whether the gripper force is zero	Open the gripper
Search (SH)	Search an object by a camera hung on the top	Check whether the object is found	Output the location of the found object
Search precisely (SHP)	Search an object by an eye-on-hand camera	Check whether the object is found	Output the location of the found object

4.1.1 Sensors

For each basic skill, it serves as a behaviour that has the stimuli as the pre-conditions to activate itself to act properly. Stimuli in the proposed system come from the cameras, force sensor, and robot encoder. In the system, two cameras are used: camera 1 is hung on the top of the workspace and camera 2 is an eye-on-hand camera. Camera 1 and camera 2 are used for basic skills SH and SHP, respectively because camera 1 is far from an object and the acquired object location is less precise; however, camera 2 is close to an object and is used to acquire a more precise object location.

4.1.2 Planner

The planner is the key block in the proposed system. Particularly, the plan for managing behaviours is easily described by the XABSL platform [5]. The XABSL representation to describe the execution plan of behaviours is an option graph that is simply a diagram of hierarchical state machines. Option graphs allow users to simply plan the execution sequence of behaviours. In the proposed system, options are planned to manage basic skills to accomplish pick-and-place tasks. A basic skill is described by a simplest option that has inputs (stimuli), outputs (action), and decision. Thus, the basic skills presented in Table 1 are implemented by options. Figure 10 shows the XABSL syntax of TE and the option graph to execute the sequence of TL, G, and R.

Figure 10.

(a) XABSL syntax of TE. (b) Option graph to execute the sequence of TL, G, and R.

To illustrate the XABSL platform for pick-and-place tasks, the cube-stacking task is implemented. The task is to stack cubes in an order of blue, red, and green. The XABSL needs to plan a sequence of basic skills to stack cubes. Figure 11 shows the option graph of the task. In the option graph, the sequence of picking the red cube is described as: SH searches the red cube using camera 1, TE moves the robot over the red cube, SHP searches the red cube again using camera 2 to acquire a more accurate location, TE moves the robot to the red cube, and G grasps the red cube. The sequences of picking and placing the red and blue cubes follow the same scheduling logic.

Figure 11.

Option graph to perform cube stacking in blue, red, and green order

5. Experimental evaluation

5.1 Validation of GMM

In the experiment, the GMM for modelling the skin colour was validated. Figure 12 shows the testing environment with light on/off. For either lighting condition, 600 training and 200 testing images of a gesture type were chosen to test the GMM. Figure 13 shows the seven gesture types that were tested in the experiment. Tables 2 and 3 show the recognition rates by the threshold of Eq. (1) and GMM of Eq. (2), respectively. Obviously, when the light was off, the result of the threshold was degraded.

Table 2.

Recognition rates using the colour threshold of Eq. (1)

	Recognition rate(%)
Gesture	(a)	(b)	(c)	(d)	(e)	(f)	(g)	Ave
Light on	100	100	100	100	100	100	99.5	99.928
Light off	47	68	84	0	4.5	43.5	0	35.285

Table 3.

Recognition rates using the GMM of Eq. (2)

	Recognition rate(%)
Gesture	(a)	(b)	(c)	(d)	(e)	(f)	(g)	Ave
Light on	100	100	100	100	100	100	100	100
Light off	75	100	93	52	47.5	100	62	75.642

Figure 12.

Testing environment for GMM. (a) Light on; (b) Light off.

Figure 13.

Seven gesture types to be recognized

5.2 Gesture recognition

In this paper, seven gesture types (see Fig. 13) were chosen for recognition. The images were taken by a XBOX Kinect camera that was set up right above the hand. The image size is 200×200 pixels. First, each image was converted to the YCbCr colour space and the background was removed by the GMM. After the image was down-sized and its background was removed, it became an 28×28-pixel grey-scale or binarized image (see Fig. 14) for saving the computational time in the CNN. For each gesture type, we took 600 and 200 images from a subject to train and test, respectively, the CNN. To validate the proposed method, we used four-fold cross-validation. Tables 4 and 6 show the recognition rates of the seven gesture types using their binarized and grey-scale images, respectively when the background was removed. From the results, it was obvious that the recognition rates were high for gesture types (a)-(f). However, the recognition rate of gesture (g) was poor using the binarized images. Instead, using the grey-scale images, the recognition rate of gesture (g) was improved. The reason was that the binarized images were similar among gesture types (d), (e), and (g), but their grey-scale images had more image details such as the finger positions. Figure 15 shows the binarized and grey-scale images of gesture types (d), (e), and (g). Tables 5 and 6 compare the recognition rates of the grey-scale images with/without the background, respectively. Obviously, even though the images were grey-scale, the recognition rate without a background was superior to the one with a background.

Table 4.

Recognition rates of the seven gesture types using their binarized images when the background was removed

Recognition Rates (%)
Test data	a	b	c	d	e	f	g
1∼200	99.5	99	100	99.5	99	100	85.5
201∼400	100	100	100	100	100	100	86
401∼600	99.5	96.5	99.5	100	98	100	99.5
601∼800	100	100	99	97	98.5	100	99
Ave.	99.75	98.88	99.63	99.13	98.88	100	92.5
Total Ave.				98.40

Table 5.

Recognition rates of the seven gesture types using their grey-scale images when the background was not removed

Recognition Rates (%)
Test data	a	b	c	d	e	f	g
1∼200	99	99	99.5	99.5	97.5	95	85.5
201∼400	98.5	100	100	97	98	100	71
401∼600	100	93.5	97.5	100	98.5	98.5	94.5
601∼800	100	99.5	84.5	84.5	97.5	96.5	93.5
Ave.	99.38	98	95.38	95.25	97.88	97.5	86.13
Total Ave.				95.65

Table 6.

Recognition rates of the seven gesture types using their grey-scale images when the background was removed

Recognition Rates (%)
Test data	a	b	c	d	e	f	g
1∼200	98.5	99	99.5	99.5	97	100	92.5
201∼400	99.5	100	100	98.5	100	100	95
401∼600	100	93.5	100	100	100	100	100
601∼800	100	100	100	94	98	100	100
Ave.	99.5	98.13	99.88	98	98.75	100	96.88
Total Ave.				98.73

Figure 14.

Images after the original images were down-sized and the background was removed

Figure 15.

Binarized and grey-scale images of gestures (d), (e), and (g)

To address the contribution of the proposed method, the two image recognition methods, Eigenface [35, 36] and Fourier Descriptor[33], were adopted to compare with the proposed CNN method using the same training and testing data set. For the Eigenface method, 3×3 Eigenface images were chosen. Tables 7 and 8 show the recognition rates of the seven gesture types using the Eigenface and Fourier Descriptor methods, respectively. Table 9 shows the performance comparison among the CNN, Eigenface, and Fourier Descriptor methods. The CNN method was superior to the other two methods in the total average recognition rate. The average execution time for 200 testing images was 5.97 seconds by the CNN method and much faster than 12.65 seconds by the Fourier Descriptor method and 34.58 seconds by the Eigenface method.

Table 7.

Recognition rates of the seven gesture types using the Eigenface method

	Recognition Rates(%)
Test data	a	b	c	d	e	f	g
1∼200	99.5	99	98	81.5	96.5	80	85
201∼400	100	100	100	100	83.5	89	100
401∼600	100	99.5	100	100	99	99	100
601∼800	100	98.5	100	100	99	100	99
Ave.	99.875	99.25	99.5	95.375	94.5	92	96
Total Ave.	96.64

Table 8.

Recognition rates of the seven gesture types using the Fourier Descriptor method

		Recognition Rates(%)
Test data	a	b	c	d	e	f	g
1∼200	100	100	100	89	100	77.5	99
201∼400	100	100	100	99.5	100	42	100
401∼600	100	96.5	100	100	100	100	97
601∼800	100	100	100	100	100	100	99
Ave.	100	99.13	100	97.13	100	79.88	98.75
Total Ave.	96.41

Table 9.

Performance comparison

	CNN	Eigenface	Fourier Descriptor
Total ave. recognition rate(%)	98.73	96.64	96.41
Total ave. execution time (sec)	5.97	34.58	12.65

To test the generality of the trained CNN model, 300 untrained images of each gesture type from three testers were tested where each tester provided 100 testing images. Table 10 shows the recognition rates of the results. From Table 10, we found that gesture types (d) and (e) from testers 2 and 3 were poor. Figure 16 shows the testing images of gesture types (d) and (e) from tester 2 and 3. Because the hand size of tester 2 was much smaller than that of the training images, gesture types (d) and (e) of tester 2 could not be recognized clearly. On the contrary, the hand size of tester 3 was much larger than that of the training images, resulting in failures in recognizing gesture types (d) and (e). Thus, we trained these 300 testing images and tested another 300 untrained images for both gesture types (d) and (e). After this, Table 11 shows the average recognition rates were improved above 90%. To validate the system with a large database, we both trained and tested 700 images from seven subjects for each gesture type where each subject provided both 100 training and testing images. Table 12 shows the recognition rates of the testing images of the seven gesture types. The results show that the average rate was 95.96%.

Table 10.

Recognition rates of the seven gesture types where the hand images of testers 1, 2, and 3 were untrained

Recognition Rates(%)
Untrained tester	a	b	c	d	e	f	g	Ave.
Tester 1	100	100	96	84	85	100	97	94.6
Tester 2	100	100	100	0	7	100	97	72
Tester 3	23	100	99	13	7	81	96	59.8

Table 11.

Recognition rates of the seven gesture types where the hand images of testers 1, 2, and 3 were trained

Recognition Rates(%)
Untrained tester	a	b	c	d	e	f	g	Ave.
Tester 1	100	100	96	84	85	100	97	94.6
Tester 21	100	100	100	89	100	100	97	98
Tester 3	98	100	99	87	77	73	96	90

Table 12.

Recognition rates of the seven gesture types from seven subjects

Recognition Rates(%)
a	b	c	d	e	f	g	Ave.
6	99	98.4	93.2	93.1	93.7	94.7	95.96

Figure 16.

Gestures d and e from testers 2 and 3. (a) and (b) were gesture types d and e from tester2; (c) and (d) were gesture types d and e from tester3.

To test the pose calibration of the proposed method, the hand was continuously waving around ±90^o. The frame rate was 40. After the principal axis was calibrated, the hand was orientated to a neutral pose. Figure 17 shows the frames 1∼40 of a handwaving motion. The grey-scale and white images were before and after calibration, respectively. Apparently, the tilted hands were rotated to their neutral poses correctly.

Figure 17.

Frames (a) 1∼20 and (b) 21∼40 of a handwaving motion

In addition, our experimental evaluation for a continuous motion was performed by the gestures shown in Fig. 18. The original result was incorrect in frames 2 and 21. For frame 2, the closest frame whose O8 value in the CNN was greater than 0.9 was frame 3. However, all the O8 values of frame 21 were less than 0.5. By applying the rule of the proposed post-processing, frame 2 was change from gesture type (d) to (e) and frame 21 was denoted as an undetermined frame.

Figure 18.

Frames 1∼21 of a continuous motion

5.3 Spatial and temporal reasoning

After the hand gesture was recognized, its corresponding robot basic skill such as TE, TL, G, or R was determined. However, for some gesture types, they corresponded to the different possible robot basic skills. For example, gesture type (a) could be a state of TE or R. To solve the problem, the spatial and temporal rules were applied to recognize the corresponding robot basic skills such as TE, TL, G, and R from human hand gestures. Table 13 and Fig. 19 show the spatial and temporal rules, respectively. From Table 13, the two spatial properties were used to determine the possible basic skills in the current gesture type. For example, when the hand position was near an object and the gesture type was (a), the corresponding robot basic skill was TE or R. After the possible gesture types were detected, they were fed to the temporal state machine to determine the exact corresponding robot basic skill. For example, in Fig. 19, when the current state was G and the possible gesture types TE/R were predicted from Table 13, the estimated next state was R instead of TE because G could not jump to TE.

Figure 19.

Temporal rules

Table 13.

Spatial rules

		Gesture type
		(a)	(c)
Hand position	Close to object	TE/R	G
Hand position	Away from object	TE/R	TE/TL

5.4 Cube-stacking task

In the a cube-stacking task, a teacher demonstrated how to stack cubes. There were green, red and blue cubes on the desk. The teacher stacked these three cubes in any order and then the robot followed the same order to stack the cubes. Figures 20(a) to 20(k) show the teacher's demonstration to stack cubes. Figure 20(a) shows that the teacher performed TE to reach to the green cube. Figure 20(b) shows that G was performed to grasp the green cube.

Figure 20.

(a) TE was performed to reach to the green cube; (b) G was executed to grasp the green cube; (c) TL was performed to transport the green cube; (d) the green cube was brought over the blue one; (e) R was performed to place the green cube on the blue one; (f) TE was performed; (g) reach to the red cube; (h) G was executed to grasp the red cube; (i) TL was performed to transport the red cube; (j) the red cube was brought over the green one; (k) R was performed to place the red cube on the green one

Figures 20(c) and 20(d) show that the teacher performed TL to transport the green cube and bring it over the blue one. Figure 20(e) shows that R was executed to place the green cube on the blue one. Figures 20(f) and 20(g) show that the teacher performed TE to move the red cube. Figure 20(h) shows that G was performed to grasp the red cube. Figures 20(i) and 20(j) show that the teacher performed TL to transport the red cube and bring it over the green one. Figure 20(k) shows that R was executed to place the red cube on the green one.

After the human demonstration, Figures 21(a) to 21(j) show that the cubes were stacked in the same order by the robot. After the therbligs of the human demonstration were recognized, the robot searched the red, green, and blue cubes and placed them in the same order. In this experiment, the motion primitives were performed as the following: SH (search the green cube), TE (move to the green cube), G (grasp the green cube), SH (search the blue cube), TL (transport the green cube), R (place the green cube on the blue one), SH (search the red cube), TE (move to the red cube), G (grasp the red cube), SH (search the green cube), TL (transport the red cube), and R (place the red cube on the green one). The motion sequence was automatically synthesized and implemented by the XABSL options. Particularly, the motion primitives SH (search) and SHP (search precisely) were inserted in the XABSL options because they were not observable in the human demonstration.

Figure 21.

(a) Search for the green cube and reach it; (b) grasp the green cube; (c) search for the blue cube and transport the green one to the blue one; (d) reach to the blue cube; (e) place the green cube on the blue one; (f) search for the red cube and reach to it; (g) grasp the red cube; (h) search for the green cube and transport the red one to the green one; (i) reach to the green cube; (j) place the red cube on the green one

6. Concluding remarks and future work

We have introduced a novel robot task-learning system for pick-and-place tasks by understanding human hand gestures. To achieve a robust performance, the system is implemented in two subsystems: the hand gesture recognition system and the behaviour-based robot programming platform. To recognize human hand gestures, the proposed system adopts a CNN to learn seven types of gestures. The underlying idea is to use a CNN that we do not need gesture models and features (i.e., hand shape). Since the CNN learns the gestures by the convolution and subsampling of the images, the proposed system works on various types of gestures. In addition, the calibration on colour and hand pose is presented to achieve a robust recognition result. To consider the effect of the light condition, a GMM is used to model the skin colour. The moment invariants are used to find the principal hand axis for calibrating the hand pose to a neutral position.

After the gesture is recognized, the behaviour-based robot programming platform provides a direct approach to synthesize the task sequence of the human demonstration. The underlying idea is to use the XABSL options and basic skills such as TE, TL, G, and R to implement the demonstrated task sequence. Each skill serves as a behaviour that is activated by the stimuli from the images and robot encoders to perform the proper action.

In the experiments, the four-fold cross-validation was used to validate the system where 600 and 200 images from a subject were used to train and test the CNN, respectively and the results showed that the average recognition rate of the seven gesture types was around 98.73%. The CNN method outperformed the other two methods, Eigenface and Fourier Descriptor in the recognition rate. Particularly, the execution time of the CNN was much faster (5.73 sec) than the Eigenface (34.58 sec) and Fourier Descriptor (12.65 sec) in testing 200 images. In addition, the Eigenface method was vulnerable to the change of the lighting condition. The Fourier Descriptor that analysed the hand contour could not provide enough information to represent delicate gestures. On the contrary, the proposed method using the GMM and CNN was robust against the change of the lighting condition and provided the detailed features of gestures to achieve the better results. The proposed system also had the satisfactory result on the transitive gestures in a continuous motion using the proposed post-processing rules. The system implemented a cube-stacking task by the XABSL-based platform. From observing the human demonstration, the robot stacked the cubes in the same order.

The salience feature of the proposed system is to learn pick-and-place tasks by understanding human hand gestures. The success of the system is based on the marriage of the gesture recognition system and behaviour-based programming platform because the former provides high-level task sequence inputs to the latter to accomplish pick-and-place tasks from human demonstration.

The limitation of the CNN is that when an image has low image resolution and many details, it is hard to find the feature of the image by the CNN because the convolution and subsampling of CNN make the image blurred. Besides, in the proposed system, a gesture type does not uniquely correspond to a robot motion. For example, gesture types (a) and (c) can be a state of robot motion TE or R. In the proposed method, the spatial and temporal rules were adopted to determine the robot motion. Thus, another limitation is that using these simple rules is insufficient to deal with complex motions. The potential solution to this problem is to apply a systematic method such as HMM to infer the spatial and temporal relationship between gesture and motion.

The potential application of the system is a human-robot interface that allows human operators to program an industrial robot by demonstration. Human operators do not need any experience of robot programming but are able to command a robot to perform industrial tasks such as pick and place. Currently the proposed system has dealt with hand gestures instead of intentions to interpret human demonstration. In the future work, understanding human intentions to accomplish robot tasks would be an interesting research topic.

Footnotes

7. Acknowledgements

This work was supported in part by cooperated educational research, NTUT-MMH-102-02, from National Taipei University of Technology and Mackay Memorial Hospital, Taipei, Taiwan. Special thanks go to W. K. Chen, C. H. Cheng, and M. H. Hsu for preparing the experimental work.

References

Goodrich

M. A.

and Schultz

A. C.

Human-robot interaction: A survey. J. Found. and Trend. in Human-Computer Interact., 1(3):203–275, 2007.

Sato

Nishida

Ichikawa

Hatamura

, and Mizoguchi

Active understanding of human intention by a robot through monitoring of human behavior. Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst.: 405–414, 1994.

Breazeal

and Aryananda

Recognition of affective communicative intent in robot-directed speech. Auton. Robot., 12(1):83–104, 2002.

Gilbreth

F. B.

and Gilbreth

L. M.

Classifying the elements of work. Management and Administration, 8:151–154, 1924.

Loetzsch

Risler

, and Jungel

Xabsl-a pragmatic approach to behavior engineering. IEEE Conf. Intell. Robot. and Syst.:5124–5129, 2006.

Argall

B. D.

Chernova

Veloso

, and Browning

A survey of robot learning from demonstration. Robot. and Auton. Syst., 57:469–483, 2009.

Yamane

and Nakamura

Dynamics filter concept and implementation of online motion generator for human figures. IEEE Trans. Robot. Autom., 19:421–432, 2003.

Ito

Noda

Hoshino

, and Tani

Dynamic and interactive generation of object handling behaviors by a small humanoid robot using a dynamic neural network model. Neural Networks, 19(3):323–337, 2006.

Calinon

Guenter

, and Billard

On learning, representing, and generalizing a task in a humanoid robot. IEEE Trans. Syst., Man, Cybern. B, 37(2):286–298, 2007.

10.

Ijspeert

Nakanishi

, and Schaal

Learning control policies for movement imitation and movement recognition. Neural Inf. P. System:1547–1554, 2003.

11.

Vijayakumar

D'souza

, and Schaal

Incremental online learning in high dimensions. Neural Comput., 17(12):2602–2634, 2006.

12.

Shon

A. P.

Grochow

, and Rao

R. P.

Robotic imitation from human motion capture using Gaussian processes. Proc. IEEE/RSJ Int. Conf. Humanoid Robot.:129–134, 2005.

13.

Arbib

M. A.

Schema theory. Encyclopedia of Artif. Intell., 2:1427–1443, 1992.

14.

Jeannerod

Arbib

M. A.

Rizzolatti

, and Sakata

Grasping Objects: The cortical mechanisms of visuomotor transformation. Trends in Neurosciences, 18:314–320, 1995.

15.

Schaal

Ijspeert

, and Billard

Computational approaches to motor learning by imitation. Philos. Trans. of the Roy. Soc. of London. Series B: Biol. Sci., 358:537–547, 2003.

16.

Skoglund

Iliev

Kadmiry

, and Palm

Programming by demonstration of pick-and-place tasks for industrial manipulators using task primitives. Proc. Int. Conf. Comput. Intell. in Robot. Automat.:368–373, 2007.

17.

Ekvall

Aarno

, and Kragic

Task learning using graphical programming and human demonstrations. Proc. IEEE Int. Conf. Robot and Human Interactive Commun.:398–403, 2006.

18.

Zoliner

Pardowitz

Knoop

, and Dillmann

Towards cognitive robots: Building hierarchical task representations of manipulations from human demonstration. Proc. IEEE Int. Conf. Robot. Autom.:1535–1540, 2005.

19.

Friedrich

and Kaiser

What can robots learn from humans. Proc. the Internation Federation of Automatic Control Workshop on Human-Oriented Design of Advanced Robotic Systems, 1995.

20.

Voyles

R. M.

and Khosla

P. K.

A multi-agent system for programming robotic agents by human demonstration. Integrated Computer-Aided Engineering, 8(1): 59–67, 2001.

21.

Nicolescu

M. N.

A framework for learning from demonstration, generalization and practice in human-robot domains. Ph.D. dissertation, University of Southern California, 2003.

22.

Aramaki

Nagasawat

Kuronot

Nagait

Moritat

, and Suetsugu

The knowledge representation and programming for robotic assembly task. Proc. IEEE Int. Assembly and Task Planning:256–261, 1999.

23.

Kojima

Izumi

Tamura

, and Fukunaga

Generating natural language description of human behavior from video images. Int. Conf. Pattern Recog., 4:728–731, 2000.

24.

Cohen

C. J.

Morelli

, and Scott

K. A.

A surveillance system for the recognition of intent within individuals and crowds. IEEE. Conf. Technol. for Homeland Secur.:559–565, 2008.

25.

Mitra

and Acharya

Gesture recognition: A survey. IEEE Trans. Syst., Man, Cybern. C, 37(3):311–324, 2007.

26.

Vogler

Christian

and Metaxas

Dimitris

. ASL recognition based on a coupling between HMMS and 3d motion analysis. Int. Conf. Computer Vision, pages 363–369. IEEE, 1998.

27.

Bond

Charles F

Jr Omar

Adnan

Mahmoud

Adnan

, and Bonser

Richard Neal

. Lie detection across cultures. J. nonverbal behav., 14(3):189–204, 1990.

28.

Lin

H. I.

Cheng

C. H.

, and Chen

W. K.

Learning a pick-and-place robot task from human demonstration. Int. Automatic Control Conf. (CACS):312–317, 2013.

29.

Dardas

N. H.

and Georganas

N. D.

Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques. IEEE Trans. Instrum. Meas., 60(11):3592–3607, 2011.

30.

M. K.

Visual pattern recognition by moment invariants. IRE Trans. on Information Theory, 8(2): 179–187, 1962.

31.

Liu

Yun

Gan

Zhijie

, and Sun

. Static hand gesture recognition and its application based on support vector machines. Proc. Int. Conf. Softw. Eng., Artif. Intell., Networking, and Parallel/Distrib. Comput.:517–521, 2008.

32.

Huang

C. L.

and Huang

W. Y.

Sign language recognition using model-based tracking and a 3d hopfield neural network. Mach. Vision and Application, 10(5–6):292–307, 1998.

33.

McEiroy

and Wilson

Fourier descriptors and neural networks for shape classification. Int. Conf. Acoustics, Speech, and Signal Processing:3435–3438, 1995.

34.

Tantachun

Pintavirooj

Lertprasart

, and Bunluechokchai

Biometrics with eigen-hand. IEEE Conf. Ind. Electron. and Applications:1–4, 2006.

35.

Sirovich

and Kirby

Low-dimensional procedure for the characterization of human faces. J. Opt. Soc. Amer., 4(3):519–524, 1987.

36.

Turk

and Pentland

Face recognition using Eigenfaces. IEEE Conf. Computer Vision and Pattern Recognition:586–591, 1991.

37.

Al-Jarrah

and Halawani

Recognition of gestures in Arabic sign language using neuro-fuzzy systems. Mach. Vision and Application, 133(1–2):117–138, 2001.

38.

Holden

E. J.

and Owens

Recognising moving hand shapes. Proc. Int. Conf. Image Anal. and Process.:14–19, 2003.

39.

Thad Starner and Alex Pentland.

Real-time American sign language recognition from video using hidden Markov models. In Motion-Based Recognition: 227–243. Springer, 1997.

40.

Davis

James

and Shah

Mubarak

. Visual gesture recognition. In IEE Proc. Vision, Image and Signal Process. 141:101–106. IET, 1994.

41.

Yang

Ming-Hsuan

and Ahuja

Narendra

. Recognizing hand gestures using motion trajectories. In Face Detection and Gesture Recognition for Human-Computer Interaction:53–81. Springer, 2001.

42.

Kang-Hyun

Kuno

Yoshinori

, and Shirai

Yasuhiro

. Manipulative hand gesture recognition using task knowledge for human computer interaction. In IEEE Int. Conf. Automat. Face and Gesture Recog.: 468–473. IEEE, 1998.

43.

Cutler

Ross

and Turk

Matthew

. View-based interpretation of real-time optical flow for gesture recognition. In IEEE Int. Conf. and Workshops on Automat. Face and Gesture Recog.: 416–416. IEEE Computer Society, 1998.

44.

Nowlan

Steven J

and Platt

John C.

A convolutional neural network hand tracker. Advances in Neural Inf. Process. Systems:901–908, 1995.

45.

Yang

M. H.

and Ahuja

Gaussian mixture model for human skin color and its application in image and video databases. Proc. Storage and Retrieval for Image and Video Databases, 3656:458–466, 1999.

46.

LeCun

Bottou

Bengio

, and Haffner

Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278–2324, 1998.