Sage Journals: Discover world-class research

Abstract

With rapid developments in biometric recognition, a great deal of attention is being paid to robots which interact smartly with humans and communicate certain types of biometrical information. Such human–machine interaction (HMI), also well-known as human–robot interaction (HRI), will, in the future, prove an important development when it comes to automotive manufacturing applications. Currently, hand gesture recognition-based HRI designs are being practically used in various areas of automotive manufacturing, assembly lines, supply chains, and collaborative inspection. However, very few studies are focused on material-handling robot interactions combined with hand gesture communication of the operator. The current work develops a depth sensor-based dynamic hand gesture recognition scheme for continuous-time operations with material-handling robots. The proposed approach properly employs the Kinect depth sensor to extract features of Hu moment invariants from depth data, through which feature-based template match hand gesture recognition is developed. In order to construct continuous-time robot operations using dynamic hand gestures with concatenations of a series of hand gesture actions, the wake-up reminder scheme using fingertip detection calculations is established to accurately denote the starting, ending, and switching timestamps of a series of gesture actions. To be able to perform typical template match on continuous-time dynamic hand gesture recognition with the ability of real-time recognition, representative frame estimates using centroid, middle, and middle-region voting approaches are also presented and combined with template match computations. Experimental results show that, in certain continuous-time periods, the proposed complete hand gesture recognition framework can provide a smooth operation for the material-handling robot when compared with robots controlled using only extractions of full frames; presented representative frames estimated by middle-region voting will maintain fast computations and still reach the competitive recognition accuracy of 90.8%. The method proposed in this study can facilitate the smart assembly line and human–robot collaborations in automotive manufacturing.

Keywords

Human robot interaction material handling robot hand gesture recognition Kinect depth sensor representative frame estimate

Introduction

Rapid developments in artificial intelligence techniques are giving rise to many smart human–machine interaction (HMI) applications and further contributing to the field of smart manufacturing nowadays.^1–4 The main idea of smart HMI is to create a specific interaction between the target end-device and the person using certain types of biometric features derived from that person. It is well-known that speech communication using the speech recognition functionality is a frequently-seen and acceptable way of being interactive with a specific device, such as speech recognition performed by a smart phone or smart speaker in order to facilitate device function operations.⁵ In addition to speech communication-type HMI, the HMI design using hand gesture communication has served as an alternative mode to establish fine interaction with the machine. The main purpose of human–machine interaction using hand gesture communication is to acquire the activity information of hand gesture actions made by the person, through which a specific interaction with the machine can then be carried out. Hand gesture communication with hand gesture recognition has been an extremely interesting HMI technique issue, with numerous related studies seen in recent years.^6–8 As speech recognition matures, it is believed that hand gesture recognition will also become the widely-accepted HMI mode in the near future.

Among all HMI fields, human–robot interaction (HRI) is attracting much attention.^9–11 HRI is an important HMI form, where the machine device which will be interactive with the human is specifically set as the “robot.” The above-mentioned speech communication with speech recognition, and hand gesture communication with hand gesture action recognition, have also been attempted in order to make fine connections with the specific robot system. In such studies,^12,13 voice control-based speech recognition is effectively used for humanoid robot and ROS robot operations. Due to inevitable performance restrictions in the adverse situation where speech recognition is carried out in a noisy environment, hand gesture recognition will perhaps become a very acceptable path for HRI application system developments.¹⁴ In fact, with the rapid growth of advanced image sensors such as three-dimension (3D) and depth sensors, there will be a significant alleviation of the performance decrease of hand gesture recognition by conventional RGB sensor-based image process calculations in the case of complicated and variant backgrounds. There has been a remarkable emergence of works related to the utilization of 3D sensor-based or depth sensor-based hand gesture recognition as an interaction interface between the robot system and the user.^15–27 Studies concerning HRI via hand gesture communication can be categorized into various different applications, imitation robots,^15,16 show robots,¹⁷ sport or rehabilitation instructor expert robots,^18,19 surgery robots,²⁰ and industrial robots.^21–34

With regard to applications and developments of industrial robots, hand gesture communication-based HRI is attracting a great deal of interest. In Van den Bergh et al.,²¹ 3D hand gesture interaction is carried out with a robot, where the robot can search the person to interact with, ask for the direction, and then carry out detection of the obtained 3D point direction. The HRI system developed in Van den Bergh et al.²¹ belongs to the categorization of “reconnaissance robots.” Scene recognition, which makes it possible to plan the walking area and avoid obstacle collision by combining cloud computing and artificial intelligence techniques, is explored in the application of “clean robots.”²² Moreover, a “mine rescue robot,” which works in the underground environment with low illumination, is presented in Shang et al.,²³ while in the work of Shang et al.,²³ depth sensor-based hand gesture recognition is used as the interface for creating interaction with the rescue robot. On the other hand, with the greatly increasing demand for automatic manufacturing, a substantial amount of interest is being attracted by the “collaborative robot,” sometimes also referred to as the “Cobot,” which is controlled by the hand gestures of the operator. This kind of hand gesture control-based human–robot collaboration mode is generally employed on the assembly line in automotive manufacturing. There can also be found studies on the human–robot collaborative assembly, supply chain management, or material inspections with hand gesture recognition.^24–34 As can be seen in Figure 1, the operator must always stay close to the image sensor to perform specific hand gesture actions in order to create collaborative operations with the robot. Hand gesture recognition in such applications of automotive manufacturing^24–34 can be primarily divided into three different categorizations of approaches, namely (1) recognition of static hand poses using only a spatial approach without involving kernel pattern recognition technique issues, detections of finger segmentations,²⁴ and estimations of hand joint variations observed from the 3D-(x, y, z) space of the 3D image sensor (e.g. the usage of the Leap Motion controller),^25–27 (2) statistical model-based recognition techniques using the typical context-dependent statistics model, namely the hidden Markov model (HMM), for classifying continuous-time dynamic hand gestures,^28–30 and (3) deep learning model-based recognition approaches by popular convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory (LSTM), or integration designs of the above-mentioned models.^31–34 Note that, for hand pose detection using the relatively simple spatial approach with a specific 3D image sensor, the work in developments of the recognition system is mainly designed to continually track variations of the 3D-(x, y, z) location of each hand joint defined in the specific 3D image sensing space. This kind of spatial 3D-(x, y, z) hand pose detection approach is appropriate for automotive manufacturing tasks of collaborative material inspection or collaborative assembly line without semantic cognitions of human behavior intention. On the other hand, as for the model-based categorization of recognition approaches, where dynamic hand gesture recognition is carried out using the statistical model of HMM or the deep learning model of various types of networks, it would be more suitable for this to be performed in the automotive manufacturing scenario, which requires a certain degree of cognition when it comes to recognition of the continuous-time action intentions of the operator (e.g. providing assistance to the operator). Model-based hand gesture recognition generally has fine usages in the more complex assembly line. In the development of such a model-based hand gesture recognition system, model establishments (also known as model-training) of hand gesture classification models with extremely high cost in time will be the primary technical issue.

Figure 1.

The HRI type of human-computer collaborations with hand gesture recognition of the operator employed on assembly-line, supply-chain, or material-inspection in automotive manufacturing of the new-generation.

The collaborative robot design combined with human hand gesture action interaction has been an on-going tendency in the innovative “Industry 4.0,” with the goal of completing automation in the manufacturing field. Although many studies on the developments of the human–robot collaborative system with hand gesture recognition have been seen to be used in automotive manufacturing, most of those works focus on assembly, supply chain, and material inspection. Extremely few hand gesture recognition works focus on the HRI development of the smart material-handling robot. Figure 2 depicts the operations of the typical material-handling robot, namely the automated guided vehicle (AGV), in the automotive manufacturing nowadays. The AGV robot, equipped with the handled material, is moved to the target location (assembly or inspection areas) in the factory. Conventional approaches to AGV navigations mainly include path-following and simultaneous localization and mapping (SLAM). In both of these robot navigation schemes, the human operator in the manufacture line cannot provide the path guide information to the material-handling robot. In addition, when the robot moves close to the indicated location, a robot positioning approach using the RFID tag is typically employed. However, in some situations, such navigation and positioning schemes on the material-handling robot will perhaps be unsafe, inflexible, and not collaborative. It is well-known that the “Atlas Robot,” designed by the company Boston Dynamics, has powerful material-handling ability.³⁵ The Atlas Robot is not categorized as a wheeled-type AGV, and instead belongs to the humanoid-type robot class. This kind of humanoid-type robot will be able to partake in further material handling with the operator in a mode of “collaborative material handling” after the robot reaches the indicated location and finishes location-positioning. To tackle this issue, a smart material-handling robot system with human–robot interactions via Kinect depth sensor-based hand gesture communication is developed in the present study. The rationale behind this design is that the operator located in the fixed position of the assembly line, performing the human–computer collaborative task via hand gesture recognition (assembly or inspection), can also use the dynamical hand gesture to achieve operations of navigation and positioning of material-handling robots of the typical AGV and the Atlas Robot-like (see Figure 3).

Figure 2.

Navigating and positioning of the typical material handling robot (AGV) by simple path-following (or SLAM) and RFID tagging approaches respectively in the assembly-line of automotive manufacturing.

Figure 3.

An advanced HRI design on operations of the material handling robot presented using hand gesture actions of the specific operator in the assembly-line to make semantic communication with the robot and possible material handling in a collaborative manner.

Compared with those hand gesture recognition approaches which have been explored in the field of automotive manufacturing, such as the finger segmentation detection approach,²⁴ the spatial 3D-(x, y, z) hand pose detection method (which involves observing hand joint variations from the specific 3D image sensor),^25–27 and those model-based recognition approaches using the well-trained statistics model HMM^28–30 or the deep learning neural network with iterative learning of image characteristics,^31–34 the hand gesture recognition presented in this study for smart material-handling robot operations is a feature-based approach. Depth sensor images of continuous-time dynamic hand gestures are extracted from the significant feature of the Hu invariant moments and then made into classifications via template match. From the technical viewpoints of calculation cost and computational real-time, without the exhausting requirement of gesture classification model establishments in model-based hand gesture recognition, the feature-based recognition approach presented in this work maintains the competitive merit. On the other hand, compared with those static hand pose recognition approaches of finger segmentation detection and 3D-(x, y, z) hand joint variation estimates, due to the estimate of representative frames of the acquired specific HRI hand gesture actions finely incorporated into template match, the approach presented in this work can recognize continuous-time dynamic hand gesture actions, and therefore the context-dependent semantics revealed by continuous-time dynamic hand gesture actions can still be significantly extracted. Furthermore, in this work, in order to perform command recognition of a series of continuous-time dynamic hand gesture actions to smoothly interact with the robot, the wake-up reminder scheme to regulate starting, ending, and switching of meaningful gesture actions is also finely provided using fingertip detection designs.

Aimed at the ideal plan of full-line automation in next-generation manufacturing, by the proposed HRI scheme, the conventional material-handling robot can further be promoted to increase much intelligence. The presented HRI by hand gesture communication on the material-handling robot will also provide a feasible way, where the efficiency of current collaborative robot missions in the smart assembly line can be increased. Details of the proposed HRI design using depth sensor-based hand gesture communication for smart material-handling robot operations will be described in the following sections.

The main contributions of this study are summarized below:

A complete framework of advanced HRI by hand gesture communication for intelligent material-handling robot operations proposed to facilitate the smart assembly line and human–robot collaborations in automotive manufacturing.

A depth sensor feature-based hand gesture recognition approach developed using template match with estimates of representative frames for attaining real-time recognition and achieving extraction of the semantic information from the continuous-time dynamic hand gesture action, without extremely exhausting cost on model training of the model-based recognition approach.

Fingertip detection estimates of system wake-up reminders designed to be appropriately combined with hand gesture recognition in order to smoothly regulate a series of gesture communication actions on the material-handling robot.

The developed smart material-handling robot system with HRI designs of Kinect depth sensor-based hand gesture communication

Figure 4 illustrates the framework of the presented system. As shown in the said figure, the developed system contains three main technique components, namely data acquisition and preprocess of Kinect depth sensor-based hand gesture actions, HRI with hand gesture recognition using the extracted continuous-time depth sensing hand gesture data, and operations of the material-handling robot with HRI integrations of continuous-time dynamic hand gesture recognition. A series of depth sensing images containing hand gesture action information are captured by the Kinect depth sensor, following which hand detection, hand tracking, and noise removal of the extracted hand gesture depth image are orderly performed. When the fine hand gesture depth image is obtained, feature extraction and recognition of this depth image is then carried out. Finally, the classification label of the recognized hand gesture to denote the corresponding robot operation command is sent to the material-handling robot to complete robot manipulation actions. It’s noted that in this study, the robot operation command derived from hand gesture recognition is transmitted to the robot device via Bluetooth (BT) wireless transmission.

Figure 4.

Frameworks of the presented smart material handling robot system with depth sensing-based hand gesture communication HRI designs.

Data acquisition and preprocess of Kinect depth sensor-based hand gesture actions

The Kinect sensor device employed in this work for capturing depth information of active hand gestures is depicted in Figure 5.^36,37 As shown in the said figure, the Kinect device essentially includes a 3D depth sensor, the RGB camera, acoustic sensors of the microphone array, and the motor control module to adjust the tilt degree of the image capture. For image data acquisition by the Kinect device, both the RGB camera and the 3D depth sensor can be utilized. However, for hand gesture action data, the RGB camera-based images will inevitably encounter the problem of complex background and insignificant brightness. In contrast, the depth sensor uses the infrared receiver to obtain the sensing data emitted by the infrared transmitter in order to achieve extraction of the depth image characteristics. The depth sensor is therefore able to be much more tolerant of the environment of complex background and insignificant brightness. However, when the hand gesture action is performed, the depth image acquired from the Kinect device reveals the overall contour of the hand gesture-making person. The desired hand gesture image segment will still have to be separated from the overall depth image for further analyzing and recognizing.

Figure 5.

The Kinect sensor device employed in this work for depth image data acquisition of hand gesture actions.^36,37

Depth sensor-based hand detection by OpenNI SDK

Since the raw parameters of the image data cannot be derived from the Kinect sensor-acquired depth image, the OpenNI SDK, developed by the organization of open natural interaction (OpenNI), is employed for data acquisition of the 3D sensor.³⁸ The OpenNI SDK essentially provides the efficient tool of natural user interface and human-computer interaction mode to allow the system developer to be able to acquire the required sensing raw data from various types of 3D sensor devices. With the support of the OpenNI SDK, a human–machine interaction application using 3D sensors can be quickly developed. This work employs the OpenNI SDK to obtain the depth value of the captured scene image from the infrared ray information of Kinect depth sensors. A gray scale depth image is obtained by converting numerous depth values in the captured scene to gray scale map information using equation (1). As observed from equation (1), the index $R_{k}$ is the depth value received by each infrared ray, and the value of the parameter k is the resolution of the depth map. In this work, the maximum distance of the deployed Kinect sensor device from the hand gesture-making person is set at 4 m (i.e. the value of 4000 compared with $R_{k}$ in equation (1)). In the case where the active hand of the person is closer to the Kinect sensor, the gray scale degree will be greater. The gray scale degree will be 0 (i.e. only the white scene captured) when the distance from the Kinect sensor is over 4 m. Figure 6 shows the gray scale depth image obtained from the Kinect device using the calculations of equation (1) in the situation where the human operator of robots makes the specific hand gesture action.

{\begin{matrix} R_{k} \times (\frac{255}{4000}), if R_{k} \leq 4000, \\ 255, if R_{k} > 4000 . \end{matrix}

(1)

Figure 6.

The Kinect sensor gray scale depth image obtained by the OpenNI SDK where the overall contour of the hand gesture-making person including the hand segment is covered.

Hand tracking and segmentation by NiTE SDK

After obtaining the gray scale depth image, the significant hand part will be independently separated from the overall depth image. The purpose of such image segmentation is essentially to obtain the significant front-ground segment and emit the unnecessary back-group segment. This work properly employs the NiTE SDK to ascertain the location of the hand center point.³⁹ Figure 7 shows the estimated hand center point (i.e. the red point depicted in Figure 7) by the NiTE SDK.

Figure 7.

The center point of the hand estimated by the NiTE SDK (the red point in the figure).

When the location of the NiTE SDK hand center point is derived, the gray scale depth value of this location, $d_{c}$ , is then checked for the following binary image process of the overall depth image. The main calculation of the binary image process herein is designed to scan each pixel location in the overall depth image. If the depth value of the scanned location, $d_{k}$ , is equal to $d_{c}$ , this scanned pixel point is then set to 255 (i.e. white). Otherwise, a value of 0 is given to the scanned pixel point to denote black. Equation (2) shows the above-mentioned computation of the binary image process. In this work, the size of the gray scale depth image from the Kinect sensor device is set at $640 \times 480$ . In such an image resolution setting, the value of k will be 307,200 ( $640 \times 480$ ). In this case, there are, in total, 307,200 scanned points, each of which has the depth value $d_{1}$ , $d_{2}$ ,…, and $d_{307200}$ , respectively, to be estimated in the overall gray scale depth image. Figures 8 and 9 show estimated depth images after performing binary image calculation of equation (2). To quickly find the hand point (if existing) location in the binary-valued image by the NiTE SDK, an initial procedure is implemented to obtain regular motion vector information between consecutive images (similar to eye detection using variants of motion vectors). In the NiTE SDK, a total of three dynamic hand gesture actions are defined to make such motion vector variants,³⁹ namely “hand-waving,”“hand-clicking,” and “hand-raising.” In this work, the gesture action of “hand-clicking,” representing “pushing forward and then retracting,” is used. In addition, when the hand point is well-detected, hand tracking then follows, in which keeping the hand “pushing forward” and performing the specific HRI hand gesture action will then be helpful for segmentations of the significant hand part. It can be clearly observed from Figures 8 and 9 that, in the case where the operator keeps the hand forward (i.e. relatively close to the depth sensor), only the hand segment has white pixels and therefore can be easily separated from the image. An appropriate distance of approximately 1 m is set between the hand of the operator and the depth sensor in this study. As shown in Figure 8, the hand segment process will encounter a certain degree of difficulty if the location of the hand remains almost the same as that of the body of the gesture-making person, with undesired white pixels also appearing in the head and body segments due to the fact that they are the same distance from the depth sensor.

{\begin{matrix} 255, if d_{k} = d_{c}, \\ 0, if d_{k} \neq d_{c .} \end{matrix}

(2)

Figure 8.

The binary image process with equation (2) calculations on the Kinect sensor gray scale depth image of Figure 6 (binary-valued and original RGB raw images in the left and right respectively).

Figure 9.

The binary image process with equation (2) calculations on the Kinect sensor gray scale depth image in the situation that the hand of the gesture-making user is much close to the sensor (binary-valued and original RGB raw images in the left and right respectively).

As shown in Figure 9, after the binary image process shown in equation (2), the significant front-ground part in the overall image can then clearly be seen, which includes mainly the hand and the arm part. In this work, to consider the issue of real-time calculations and the effect of responses to hand gesture communication on the operations of the material-handling robot, only the hand part (i.e. the palm and five fingers) is focused on, while feature extraction is performed to represent a specific hand gesture action. The binary-valued image segmentation, with a fixed size of $160 \times 160$ , which is properly determined according to the location of the estimated hand center point, is set to cover the variation information of the hand gesture action in this study, as illustrated in Figure 10. Note that the proper size of the extracted hand segment from the image for gesture action categorizations is $160 \times 160$ (fine hand rendering in the overall image) in the case where the hand of the gesture-making operator stays 1 m from the depth sensor.

Figure 10.

The binary-valued image segmentation with the fixed size of $160 \times 160$ determined according to the location of the estimated hand center point for feature extraction.

Noise removal of the segmented hand gesture image by OpenCV SDK

As can be seen in Figure 10, the hand shape used to describe hand gesture variation information in the $160 \times 160$ binary-valued image is still dissatisfactory due to existing noises. In order to obtain more accurate feature parameters to represent the specific hand shape, a noise removal procedure to improve the $160 \times 160$ binary-valued image is carried out in this work. The noise removal process is composed of two main calculations, namely median filtering and opening operation, both of which can be implemented quickly by OpenCV SDK.⁴⁰ The main effectiveness of the median filtering scheme can greatly smooth the contours of the hand. The contours of the hand in Figure 10 are obviously not smooth, and this can be significantly improved by median filtering. In median filtering computation, a proper image block with a size of $3 \times 3$ is set for sequentially scanning the overall $160 \times 160$ image. Assume that the scanned $3 \times 3$ block has the sets {{255, 255, 0}, {255, 0, 0}, and {0, 255, 255}}, each of which denotes the pixel value. Median filtering used herein is firstly designed to sort all of these nine pixel values to the order of {0, 0, 0, 0, 255, 255, 255, 255, 255}, and the median element in the sorted list is employed to perform the pixel replacement (the element with the value of 255 in this example). The value of the center pixel in the $3 \times 3$ block will then be replaced with that of the estimated median element; after such filtering, the pixel value sets of the $3 \times 3$ block will then become {{255, 255, 0}, {255, 255, 0}, and {0, 255, 255}}. Figure 11 shows the significant improvement which median filtering induces with regard to the smoothness of the hand contours. Following median filtering, the opening operation is then performed in order to complete the image noise removal process. The opening operation involves carrying out both dilation and erosion. By additionally adding pixels to the boundaries of objects in the image (i.e. dilation) and properly removing pixels on object boundaries (i.e. erosion), the smoothness degree of the hand contours will be further increased, which can be seen in Figure 12.

Figure 11.

The smooth degree significantly increased by median filtering calculations.

Figure 12.

The opening operation scheme employed for further improving hand contour smooth.

Hand gesture recognition with features of HU moment invariants

The Hu moments approach is essentially the improved version of the geometric moment (GM) method.^41,42 Generally, the moment invariant approach seeks to achieve a specific goal, namely for those calculated moments of the observed object in the image to still be able to maintain the invariant after image rotating, moving, zooming, or mirroring. These GM invariants presented by Hu are essentially “rotation moment invariants.” Due to the invariant characteristics of Hu moments in the case image geometric transformation is performed, the moment invariant approach earlier presented in the work of Hu⁴¹ has been proven to be an effective and efficient feature extraction scheme. The features of the Hu moment extracted from the specific types of images have been widely used in many image-related applications, such as image analysis (shape circularity measure in Žunić et al.⁴³) and visual-based pattern recognition (face recognition in Paul et al.⁴⁴; 3D human action recognition in Al-Azzo et al.⁴⁵). The image-extracted Hu moment belonging to the 2D geometric characteristics contains the characteristics of image translation, scaling, and rotation, and can be calculated from seven functions, which are shown in equations (5)–(11). The centroid in a 2D image is generally defined as the central moment, $μ_{pq}$ . Derivations of the central moment parameter can be seen in Hu.⁴¹ Due to the origin independence property of the central moment, the central moment will be invariant when image transmission is performed. A scaling procedure with a normalization factor, $γ$ , will further be applied to the derived central moment using the following formula:

η_{pq} = \frac{μ_{pq}}{μ_{00}^{γ}},

(3)

in which,

γ = \frac{p + q + 2}{2} .

(4)

In terms of the scaled central moments, a total of seven Hu moment feature parameters ( $h_{1}$ , $h_{2}$ ,…, and $h_{7}$ ) which are invariant to object scale, position, and orientation can then be calculated using seven corresponding equations, as follows:

h_{1} = η_{20} + η_{02} .

(5)

h_{2} = (η_{20} - η_{02})^{2} + 4 {η_{11}}^{2} .

(6)

h_{3} = (η_{30} - 3 η_{12})^{2} + (3 η_{21} - η_{03})^{2} .

(7)

h_{4} = (η_{30} + η_{12})^{2} + (η_{21} - η_{03})^{2} .

(8)

\begin{matrix} h_{5} = (η_{30} - 3 η_{12}) (η_{30} + η_{12})) \\ [{(η_{30} + η_{12})}^{2} - 3 {(η_{21} + η_{03})}^{2}] + \\ (3 η_{21} - η_{03}) (η_{21} + η_{03}) \\ [3 {(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] . \end{matrix}

(9)

\begin{matrix} h_{6} = (η_{20} - η_{02}) [{(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] \\ + 4 η_{11} (η_{30} + η_{12}) (η_{21} + η_{03}) . \end{matrix}

(10)

\begin{matrix} h_{7} = (3 η_{21} - η_{03}) (η_{21} + η_{03}) \\ [3 {(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] - \\ (η_{30} - 3 η_{12}) (η_{21} + η_{03}) [3 {(η_{30} + η_{12})}^{2} - {(η_{21} + η_{03})}^{2}] \end{matrix}

(11)

Note that these seven different parameters of Hu moments derived from equations (5) to (11) are used to describe the interesting object of the hand segment in an image in this work. These seven invariant moment parameters, $h_{1}$ , $h_{2}$ ,…, and $h_{7}$ , derived from equations (5), (6),… and (11), respectively, will be independent when performing image translation, scale, and rotation. It is also noted that the parameter $h_{1}$ is generally viewed as the moment of inertia, which will be made around the calculated centroid of an image; in the Hu moment invariant parameter set, these first six parameters, $h_{1}$ , $h_{2}$ ,…, and $h_{6}$ , will still remain unchanged when the image mirroring is carried out; the last parameter, $h_{7}$ , has the reflection anti-symmetric property and will be made sign change, and with such a property, mirror images can still be distinguished. These seven derived Hu moment invariant parameters will be appropriate for classifications of binary images, including categorizations of continuous-time binary-valued hand gesture images in this work.

In the present work on HRI designs, in order to use hand gesture communication for operating the material-handling robot, K different specific hand gesture actions, Gesture₁, Gesture₂,…, and Gesture_K, are set, each of which is used to indicate the corresponding operation of robot action behavior. Each of these set K hand gesture actions is composed of numerous frames in a certain continuous-time period. Figure 13 depicts test and training database establishments through the feature extraction of Hu moment invariants. As can be seen in Figure 13, each of the different hand gesture actions contains n continuous-time frames, Frame 1, Frame 2,…, and Frame n. Each of these n hand gesture frames includes seven different estimated Hu moments, $h_{1}$ , $h_{2}$ ,…, and $h_{7}$ . Via feature extraction of Hu moment invariants, such a feature vector, with seven dimensions, is employed in the recognition phase.

Figure 13.

Test and training database establishments of Hu moment invariant-features by feature extraction to each $160 \times 160$ frame in the continuous-time hand gesture communication action, $Gestur e_{K}$ , for hand gesture recognition by template match.

In this study, considering its simplicity and rapid computation, the template match method is adopted for recognition calculations. Equations (12) and (13) detail the calculation process of template match used in this study. When performing recognition in the test recognition phase, the test hand gesture action for classification, $Gestur e_{test}$ , is compared with each of the preset and established specific hand gesture actions in the system, namely $Gestur e_{1}$ , $Gestur e_{2}$ ,…, and $Gestur e_{K}$ . By repeatedly accumulating the difference of all seven Hu moment feature parameters between the test template $Gestur e_{test}$ and the model template $Gestur e_{k}$ using equation (12), the minimum difference $D_{Minimum}$ to denote the most matched template can then be acquired by equation (13). The recognition result of $Gestur e_{test}$ can finally be used as the classification label of the template $Gestur e_{k}$ with $D_{Minimum}$ .

\begin{matrix} D_{k} = Gestur e_{test} - Gestur e_{k} = \sum_{i = 1}^{n} \sum_{j = 1}^{7} {\tilde{H}}_{i} (h_{j}) \\ - H_{ki} (h_{j}), k = 1, 2, \dots, K . \end{matrix}

(12)

D_{Minimum} = argmin (D_{1}, D_{2}, \dots, D_{K}) .

(13)

The material-handling robot system with HRI integrations of continuous-time hand gesture communication

The above-mentioned hand gesture recognition system is then integrated into a material-handling robot device to further construct a smart HRI system. Figure 14 illustrates integrations between the hand gesture recognition system and the robot device, where the Bluetooth (BT) wireless transmission scheme is employed for sending the label of the recognized hand gesture action to the robot device to provide the corresponding robot operation. It is noted that, for robot developments nowadays, the end-device of robots being equipped with a wireless transmission scheme including BT is a mainstream tendency. The HRI system presented in this work can be effectively employed in most of the robot devices with wireless transmission modules.

Figure 14.

Human robot interaction by hand gesture communication where the recognized hand gesture action is sent to the robot device to denote the corresponding robot operation via Bluetooth.

Start and end time-point determination of specific HRI hand gesture action acquisitions

To be able to construct a practical hand gesture communication-based HRI system in a real-life application, real-time detection, and recognition of the specific hand gesture communication sent to the robot device will undoubtedly be crucial. In order to accurately and completely capture all hand gesture frames of the specific hand gesture robot operation action carried out by a human operator, this work presents a hand gesture detection scheme. Developed detection of the significant hand gesture action is mainly designed to provide a scheme to denote start and end time-points on hand image acquisitions. When the starting point is detected, a series of gray-scale depth hand gesture images are then acquired from the Kinect depth sensor. This hand gesture image capturing process will be repeatedly carried out before the end time-point is detected. Figure 15 depicts this detection design for the start and end of the hand gesture action. It is noted that hand gesture images acquired between the start and end time-points will then be considered for further feature extraction and hand gesture recognition. As can also be seen in Figure 15, both start and end indications of the significant hand gesture action are the hand poses of “open the hand fully.” The hand gesture action detection will then essentially become a problem of detecting the hand pose “open the hand fully.”

Figure 15.

Start and end time point determination by detection of the hand pose “open the hand fully” for image acquisitions of the specific hand gesture robot operation action.

In this work, the problem of hand pose detection for the pose “open the hand fully” can essentially be viewed as the fingertip detection issue, that is, detection of the hand pose with five fingertips. In order to tackle this issue, the Douglas–Peucker algorithm, which belongs to the categorization of iterative end-point fit algorithms, is adopted.⁴⁶ As can be seen in Figure 16, when the pose “open the hand fully” is made, proper utilizations of the Douglas–Peucker algorithm can then transform the image with a curve of multiple piecewise line segments to that with a much simpler and similar curve composed of only fewer points. As observed from Figure 16, the image with the noise removal process (see Figure 12) can be transformed into the image which is represented by some hand critical points. In order to rapidly implement the Douglas–Peucker calculations mentioned herein, this work directly adopts the OpenCV SDK.⁴⁰ After performing the Douglas–Peucker algorithm, a fingertip search of the Douglas–Peucker derived simple hand curves is then carried out; all critical points distributed on the derived curve will be evaluated. The main evaluation scheme of fingertip verification in this work is that three consecutive critical points on the curve are considered using equations (14)–(17). Figure 17 illustrates the fundamental rationale of such fingertip verification. In this study, when the $α$ degree derived from equation (17) is located between 50 and 60, a fingertip is then found. On checking all hand critical points, if the fingertip number determined is equal to 5, the hand pose “open the hand fully” is then found; otherwise, no such hand pose can be detected. Note that different degrees of the open hand will have significant influences on performances of fingertip detection, and this type of totally-open gesture actions will be apparently superior to those types of half-open or even only slightly-open actions.

\bar{P_{n} P_{n - 1}} = \sqrt{{(P_{n} (x) - P_{n - 1} (x))}^{2} + {(P_{n} (y) - P_{n - 1} (y))}^{2}} .

(14)

\bar{P_{n} P_{n + 1}} = \sqrt{{(P_{n} (x) - P_{n + 1} (x))}^{2} + {(P_{n} (y) - P_{n + 1} (y))}^{2}} .

(15)

\begin{matrix} \bar{P_{n + 1} P_{n - 1}} = \\ \sqrt{{(P_{n + 1} (x) - P_{n - 1} (x))}^{2} + {(P_{n + 1} (y) - P_{n - 1} (y))}^{2}} . \end{matrix}

(16)

α = \cos^{- 1} (\frac{{\bar{P_{n} P_{n - 1}}}^{2} + {\bar{P_{n} P_{n + 1}}}^{2} - {\bar{P_{n + 1} P_{n - 1}}}^{2}}{2 \bar{P_{n} P_{n - 1}} \cdot \bar{P_{n} P_{n + 1}}}) .

(17)

Figure 16.

The use of the Douglas–Peucker algorithm for obtaining a simple hand curve composed of only some critical points.

Figure 17.

Fingertip verification by evaluations of three consecutive hand critical points.

Estimate of the representative frame of the acquired specific HRI hand gesture action

It is worth noting that all the captured hand gesture frames occurring between the start and end detection points considered for feature extraction and recognition will perhaps not be a most satisfactory strategy from the viewpoint of both issues of recognition accuracy and real-time computation. Figure 18 depicts that a series of hand gesture images (i.e. hand gesture frames) contained in the specific HRI hand gesture action denoting a certain gesture classification label can be acquired by the detected action start and end time-point. To tackle this problem and construct a hand gesture communication-based HRI system with both high recognition accuracy and low computation cost, this work presents three different strategies to effectively determine the representative frame of the specific HRI hand gesture action acquired from the detection scheme of the hand pose “open the hand fully” presented in Section 2.3.1. Note that only the representative frame of the specific HRI hand gesture action is employed for extraction of Hu moment invariant features and determination of action classification labels. Excluding the extraction of full frames without estimate of the representative frame, in order to reduce calculation cost and achieve real-time recognition, those frames that are not related to the estimated representative frame will not be considered in the recognition calculation process and will thus be neglected directly. Additionally, in the situation where the extracted Hu moment invariant features of two (or more than two) adjacent gesture frames are invariantly the same, all of these seemingly-the-same frames will still be required to be taken into account in hand gesture action recognition in this design. From the technical viewpoint of dynamical visual human action recognition with context sensitivity in certain continuous-time periods, those consecutive and invariant frames generally denote the starting or ending timestamps of a significant action. The three different strategies presented for determining the representative frame of the specific HRI hand gesture action in this study are the centroid representative frame estimate, the middle representative frame estimate, and the representative frame estimate by middle-region voting.

Figure 18.

Estimate of the representative frame from all hand gesture images between the action start and end time-point of the specific HRI hand gesture action with certain classification label.

Extraction of full frames without estimate of the representative frame

Without real-time computation considerations of recognition, full frames of the specific hand gesture action are used for recognition, which is a simple and direct approach. This approach, without calculations of representative frames, will use all n gesture action frames of the specific hand gesture action to make one template match computation, that is, ${\tilde{H}}_{i} (h_{j}), i = 1, 2, \dots, n, j = 1, 2, \dots, 7$ denoting n Hu moment invariant feature vectors, each of which has seven feature parameters, used for template match (see equation (12) mentioned in Section 2.2).

The centroid representative frame estimate

The estimate of the centroid frame that is set as the representative frame for recognition calculations is a simple averaged determination among all n gesture action frames of the specific hand gesture action. Equation (16) details the estimate of the centroid representative frame as follows:

{\tilde{H}}_{centroid} (h_{j}) = \frac{\sum_{i = 1}^{n} {\tilde{H}}_{i} (h_{j})}{n}, j = 1, 2, \dots, 7 .

(18)

Compared with extraction of complete n frames for recognition, the centroid representative frame estimate approach employs only one estimated centroid frame derived from equation (18) for recognition.

The middle representative frame estimate

When performing a specific HRI hand gesture action in a certain continuous-time period, the most significant frames inside the gesture action will be located around the middle frame. This is because the beginning and finishing segments of each specific hand gesture action are similar to each other. The main action data that can be used to distinguish each different gesture action will be concentrated on the neighbored region of the middle frame. Equations (19) and (20) show the determination of the middle representative frame.

{\tilde{H}}_{middle} (h_{j}) = {\tilde{H}}_{m} (h_{j}), j = 1, 2, \dots, 7 .

(19)

m = \frac{n}{2} .

(20)

Note that, in the situation where all n frames are acquired in the specific hand gesture action, the middle frame value can be derived from equation (20). Similarly, with the centroid representative frame estimate, the middle representative frame estimate is also a computationally-rapid approach, with only one estimated middle frame for template match recognition.

The representative frame estimate by middle-region voting

The approach of middle-region voting is designed to facilitate voting among various different action classification labels on recognition determination using the hand gesture frame region of $[m - threshold, m + threshold]$ , where m is defined as equation (20) and $threshold$ is an empirical value that is adjustable according to the practical recognition system (the setting of $threshold = 5$ in this work, that is, a total of 11 votes chosen, including the vote of the middle frame, m). Equation (21) denotes various consecutive action frames in the region of $[m - threshold, m + threshold]$ which are chosen as representative frames for recognition. Each of the hand gesture frames in the middle region set is viewed as one vote that independently carries out classification recognition of the template match. The gesture action classification with the most votes is then the recognition outcome. Note that the approach of middle-region voting for estimating the representative frame which will perform feature extraction and recognition is essentially an improved scheme of the middle representative frame estimate method.

\begin{matrix} {\tilde{H}}_{middle region} (h_{j}) = {\tilde{H}}_{v} (h_{j}), j = 1, 2, \dots, 7, \\ v \in {m - threshold, m - threshold \\ + 1, \dots, m, \dots, m + threshold} . \end{matrix}

(21)

Compared with the middle representative frame estimate approach, the middle-region voting scheme will inevitably have a slightly higher computation cost due to more action frames considered for recognition evaluations. However, by accumulating more recognition decisions of template match, each of which is separately made by one of these frames in the middle-region, the recognition result of the constructed HRI hand gesture action classification system will be much more reliable.

Experiments and results

Experiments on the presented smart material-handling robot operation system with HRI designs of depth sensor-based hand gesture communication are made in an environment of the laboratory office. Related experiment settings, the established hand gesture database for performance evaluations of the developed system, and experimental results will be detailed in the following sections.

Experimental settings and establishments of the HRI hand gesture action database

In order to make it so that the constructed material-handling robot with HRI operations of hand gesture communication is able to be practically used in real-life applications, the Microsoft Windows OS with the coding environments of Visual Studio C++ is used as the system development platform. The Kinect device with Kinect software development kit version-1.8.0, employed for acquiring depth gray images of hand gesture actions, is properly deployed in the laboratory office. When capturing hand gesture actions, the sampling rate of the Kinect device is set at 30 Hz. A tilt angle of 21° is set for Kinect. In addition, the distance between Kinect and the ground is approximately 1.2 meters, while the distance between Kinect and the hand gesture-making user is set in the region of 0.7–1.2 m.

In the selection of material-handling robot devices, the “Robotis Bioloid” robot, manufactured by the Robotis company in Korea, is adopted.⁴⁷ The Bioloid robot, belonging to the humanoid robot type, has a total of 18 modular servomechanisms, that is, 18 degrees of freedom (DOF), each of which represents one robot joint and is made by an AI motor called a “Dynamixel.” In addition, three types of mechanical robot designs, to denote different function complexities, namely Type-A, Type-B, and Type-C, are contained in the Bioloid-series robot device, and this work uses the class of Type-A Bioloid for hand gesture communication HRI system developments. In addition, through motion editors of the RoBoPlus software,⁴⁸ each of the different robot operation actions denoted by the corresponding hand gesture action can be implemented on the Bioloid humanoid robot. Note that the adopted robot for material handling in this study is specifically equipped with the wireless Bluetooth module⁴⁷ so that command-based HRI applications will be easy to be constructed. In this work, for achieving efficient gesture command communication via Bluetooth, only the index value (numbered 1–6) of gesture actions is “sent” to the robot device (from the Bluetooth transmitter in the computer with developed depth sensor-based hand gesture recognition applications to the Bluetooth receiver in the Bioloid robot device). When the robot “receives” the action command index, according to the indication of this received index, the specific robot operation action with the corresponding number (finely regulated by the RoBoPlus motion editor, as mentioned) is then triggered and performed.

For HRI hand gesture action collections, there are a total of six different hand gesture actions to be designed for remotely operating the behavior of the material-handling robot. If these six designed hand gesture actions are implemented in the proper order, the robot will complete an object handling task, picking up the indicated object in the initial location, walking (i.e. handling and moving), and finally putting the object down in the destination location. In this work, these six hand gesture actions for robot operations are “Forward,”“Left,”“Right,”“Whirl,”“Pick,” and “Open,” to represent the robot actions of “robot goes forward,”“robot turns left,”“robot turns right,”“robot makes whirl,”“robot picks up the indicated target object,” and “robot opens both arms to put down the target object,” respectively. The rationale behind such hand gesture action designs is that standard hand gestures to denote 26 English letters (A–Z) are finely defined in the dictionary of American sign language (ASL).⁴⁹ Six different ASL-based hand gesture actions, which represent the six English letters of “F,”“L,”“R,”“W,”“P,” and “O,” are therefore designed for representing “Forward,”“Left,”“Right,”“Whirl,”“Pick,” and “Open,” respectively. Table 1 shows the six designed ASL-based hand gesture actions and the robot action behavior after receiving the recognized hand gesture communication. In the part of ASL-based hand gesture action collection, two different databases of “F,”“L,”“R,”“W,”“P,” and “O” actions are established; one is established by only static hand poses, and the other is built up using continuous-time hand gestures. The database of static hand poses is used for comparative evaluations of six different kinds of ASL-alphabet hand pose recognition, and the collected continuous-time hand gesture database is employed in the real robot action operation, where start and end time-point determination and representative frame estimate of the significant HRI hand gesture action, as mentioned before, will be considered. A total of 300 hand poses are contained in the database of static hand poses – 50 acquired for each of “F,”“L,”“R,”“W,”“P,” and “O.” In the part of the continuous-time robot operation hand gesture action database, there are a total of 120 gesture actions collected – 20 obtained for each of these six different indicated ASL alphabet gestures. As mentioned, the hand pose “open the hand fully” is employed as the mark to denote the start and end time-points of the required HRI hand gesture action. In the aspect of collection of the data of timestamp-marked hand poses, in order to thoroughly evaluate the performances of the presented fingertip detection method, two different categorizations of hand pose data are taken into account, namely hand poses with different fingertip numbers and hand poses of opening the hand with different degrees of openness. In establishments of the database of hand poses with different fingertip numbers, hand pose data with one, two, three, four, and five fingertips is collected, which can be seen in Table 2. For hand poses with five different types of fingertip numbers, a total of 250 hand poses are acquired (50 obtained for each of these five different hand pose types). In data collection for hand poses of opening the hand with different openness degrees, three types of open hand poses, namely “slightly open,”“half open,” and “totally open,” are considered, where 50 hand poses are collected for each of these three different types (see Table 3).

Table 1.

The designed six different ASL-based hand gesture actions, each of which denotes the specific robot action operation for completing a specific task of object handling.

Hand gestures	Robot action performed according to the recognized hand gesture communication
(Forward, F)	(Robot actions: robot goes forward)
(Left, L)	(Robot actions: robot turns left)
(Right, R)	(Robot actions: robot turns right)
(Whirl, W)	(Robot actions: robot makes whirl)
(Pick, P)	(Robot actions: robot picks up the indicated target object)
(Open, O)	(Robot actions: robot opens both arms to put down the target object)

Table 2.

Database establishments by collecting hand pose data with different fingertip numbers.

Fingertip numbers	1	2	3	4	5
Kinect sensor gray scale depth images

Table 3.

Database establishments by collecting hand pose data with different degrees of open hand (the action “open hand” used to denote start and end time-points of the specific HRI action with certain classification label; also see Table 6, effects of different degrees of open hand on performances of Douglas–Peucker fingertip detection).

Degrees of open hand	Slightly open	Half open	Totally open
Kinect sensor gray scale depth images

Experimental results

Performance evaluation works on the proposed human robot interaction system with depth sensor-based hand gesture communication for smart material-handling robot operations include recognition performance evaluations of static poses of ASL-hand gesture actions, recognition performance comparisons of hand detection using different fingertip numbers, recognition accuracy measurements of hand poses with different open hand degrees, and recognition rate comparisons of four estimation approaches to derive the representative frame of the acquired specific HRI hand gesture action.

In the recognition experiment, in order to classify static poses of ASL hand gesture actions, six different static poses, representing the English letters of “F,”“L,”“R,”“W,”“P,” and “O,” are compared via pre-processing of the hand data (see Section 2.1), feature extraction of HU moment invariants, and then template match calculations (see Section 2.2). Table 4 lists the recognition performance of these six different static poses of ASL hand gesture actions. As observed from Table 4, each of these “F,”“L,”“R,”“W,”“P,” and “O,” static hand poses has satisfactory performance. The three static hand poses of “F,”“R,” and “W” achieve complete recognition (i.e. recognition accuracy of 100%). The average recognition rate of these six static ASL hand poses is 97%. Without consideration of continuous-time hand gesture HRI operations on robot control, static ASL hand poses of six English letters can almost be categorized completely.

Table 4.

Recognition performances of six different static poses of ASL-hand gesture actions.

Gesture types	F	L	R	W	P	O	Recognitionaccuracy (%)
Test pattern
Forward (F)	50	0	0	0	0	0	100
Left (L)	0	47	1	0	0	2	94
Right (R)	0	0	50	0	0	0	100
Whirl (W)	0	0	0	50	0	0	100
Pick (P)	5	0	0	0	45	0	90
Open (O)	0	1	0	0	0	49	98
Averaged recognition accuracy of all the six hand gesture actions (static hand poses): 97%

In the part of smart material-handling robot control by continuous-time HRI hand gesture communication, as mentioned previously, technique issues of start and end time-point determination for image acquisitions of the specific hand gesture robot operation action (see Section 2.3.1) and the representative frame estimate to represent the acquired robot operation action (see Section 2.3.2) are additionally required to be taken into account. For start and end time-point determination, hand pose recognition of “open the hand fully” by the presented fingertip detection scheme is used. Table 5 lists the recognition rate comparison results of five different hand pose detection situations, that is, the hand pose with one to five fingers (see Table 2). It can be clearly seen in Table 5 that the condition of the hand pose with completely five fingers has an excellent performance when it comes to hand detection, achieving recognition accuracy of 98%. The average recognition accuracy of hand detection among these five different fingertip numbers is 96%. Note that, in Table 5, although the situation of the hand pose with one or two fingertips has the best hand detection performance (perfect 100%), with the concern of easy memorization and specific robot operation action continuousness, the hand pose “open the hand fully” with five complete fingertips is still chosen as the reminder action to denote the switch start or end of the HRI hand gesture robot operation action. Table 6 illustrates the recognition accuracy comparison of three hand poses with different degrees of open hand (also see Table 3). As can be seen in Table 6, in all conditions, presented hand detection performs best in the totally open type of open hand action, achieving a satisfactory 98%. The slightly open type of hand pose will be the most disadvantageous when it comes to hand detection, with only an uncompetitive 16% achieved.

Table 5.

Recognition rate comparisons of five hand poses with different fingertip numbers.

Numbers of fingertips	1	2	3	4	5
Corrected recognition numbers	50	50	46	45	49
Recognition rates (%)	100	100	92	90	98
Averaged recognition rates (%)	96

Table 6.

Recognition accuracy of three hand poses with different degrees of the open hand gesture.

Types of the open hand	Slightly open	Half open	Totally open
Corrected recognition numbers	8	22	49
Recognition rates (%)	16	44	98

Tables 7 to 10 show the recognition performance results of six continuous-time HRI robot operation hand gesture actions by extraction of full frames without estimate of the representative frame, the centroid representative frame estimate, the middle representative frame estimate, and the representative frame estimate by middle-region voting, respectively. Among all these approaches, without consideration of the representative frame estimate, full frames of the continuous-time HRI hand gesture action employed for recognition have the highest average recognition rate, achieving 91.7%, as can be seen in Table 7. For presented representative frame estimate approaches on continuous-time HRI hand gesture action recognition, the representative frame estimate by middle-region voting is apparently the best method, reaching an average recognition accuracy of 90.8% (see Table 10); the middle representative frame estimate method, with an average recognition rate of 86.7%, follows (see Table 9). The centroid representative frame estimate approach performs worst on classifications of these six continuous-time HRI hand gesture actions, attaining only 62.5% (see Table 8). Note that, from the viewpoint of simultaneous considerations on both recognition accuracy and real-time computation, the presented representative frame estimate by middle-region voting will be the most competitive approach to be used in the practical application of hand gesture communication-based human–robot interaction.

Table 7.

Recognition rates of six continuous-time HRI hand gesture action recognition for robot control by extraction of full frames without estimate of the representative frame.

Gesture types	F	L	R	W	P	O	Recognition accuracy (%)
Test pattern
Forward (F)	20	0	0	0	0	0	100
Left (L)	0	20	0	0	0	0	100
Right (R)	0	0	20	0	0	0	100
Whirl (W)	0	0	0	20	0	0	100
Pick (P)	0	1	0	0	18	1	90
Open (O)	0	0	8	0	0	12	60
Averaged recognition accuracy of these six continuous-time HRI hand gesture actions: 91.7%

Table 8.

Recognition rates of six continuous-time HRI hand gesture action recognition for robot control by the centroid representative frame estimate.

Gesture types	F	L	R	W	P	O	Recognition accuracy
Test pattern
Forward (F)	20	0	0	0	0	0	100
Left (L)	4	12	0	3	0	1	60
Right (R)	2	0	7	11	0	0	35
Whirl (W)	0	4	0	11	0	5	55
Pick (P)	2	0	2	2	14	0	70
Open (O)	2	0	7	0	0	11	55
Averaged recognition accuracy of these six continuous-time HRI hand gesture actions: 62.5%

Table 9.

Recognition rates of six continuous-time HRI hand gesture action recognition for robot control by the middle representative frame estimate.

Gesture types	F	L	R	W	P	O	Recognition accuracy (%)
Test pattern
Forward (F)	20	0	0	0	0	0	100
Left (L)	1	17	0	1	0	1	85
Right (R)	0	0	19	0	0	1	95
Whirl (W)	0	0	0	19	1	0	95
Pick (P)	0	0	0	0	19	1	95
Open (O)	0	0	10	0	0	10	50
Averaged recognition accuracy of these six continuous-time HRI hand gesture actions: 86.7%

Table 10.

Recognition rates of six continuous-time HRI hand gesture action recognition for robot control using the representative frame estimate by middle-region voting.

Gesture types	F	L	R	W	P	O	Recognition accuracy (%)
Test pattern
Forward (F)	20	0	0	0	0	0	100
Left (L)	1	19	0	0	0	0	95
Right (R)	0	0	20	0	0	0	100
Whirl (W)	0	0	0	20	0	0	100
Pick (P)	0	1	0	0	18	1	90
Open (O)	0	0	8	0	0	12	60
Averaged recognition accuracy of these six continuous-time HRI hand gesture actions: 90.8%

As observed from all of the above-mentioned experiment results compiled in this work, a primary remark can be made in relation to three different aspects, which are (1) performance comparisons between static (see Table 4) and various kinds of dynamic hand gesture action recognition (Tables 7 –10), (2) performance analysis of presented fingertip detection for providing system wake-up reminders on appearances of a significant dynamic hand gesture action (Tables 5 and 6), and (3) recognition accuracy comparisons of continuous-time dynamic hand gesture action recognition by only extracting full frames (Table 7) and three representative frame estimate approaches (Tables 8 –10).

Static hand gesture recognition performs better than dynamic hand gesture action when it comes to recognition performances due to the fact that only the unique gesture frame is determined, without consideration of these issues of context sensitivity and performances of system wake-up reminders in dynamic hand gesture action recognition. However, regarding the practical issue of continuous-time interactions with the material-handling robot in this study, dynamic hand gesture action recognition with relatively better cognition will be more feasible.

The hand pose with one or two fingertips can be verified in perfect recognition, achieving 100% on accuracy; however, these two hand poses will be very ambiguous compared with other popular hand gesture actions employed in the assembly line (e.g. the quality of materials made by the hand gesture-making inspection operator). With consideration of both detection performances and practical applications of material-handling robot operations, the hand pose with five fingertips (i.e. “open the hand fully”), achieving a competitive 98% on accuracy, will inevitably be a fine choice.

Recognition by extraction of full frames without estimate of the representative frame exhibits the best recognition performance (reaching 91.7%) when it comes to continuous-time dynamic hand gesture recognition. However, the use of full frames of a hand gesture action for recognition will bring with it much greater computation costs, which largely reduces the performance of real-time responses of the recognition system and is not proper for automotive manufacturing applications. Taking into consideration both real-time computation and high recognition accuracy, the representative frame estimate by middle-region voting (achieving a competitive 90.8%) will probably be a more appropriate approach for performing hand gesture communication with the material-handling robot in a smart factory.

Discussions

In this study, the Microsoft Kinect device with the depth sensor is employed for developing depth sensor-based hand gesture recognition. For the specification of the Kinect sensor device, the maximum distance needed to be able to capture depth images is 4 m, and such a specification is essentially effective in many practical applications, including the assembly line, where hand gesture recognition is used to create interactions with the collaborative robot. Besides the Kinect device, there are other types of depth sensors which can be used to develop the hand gesture recognition presented in this work, such as the well-known Intel real-sensor depth sensor. No matter which type of depth sensor is used, the setting of 1 m distance between the hand and the depth sensor will be general and appropriate for most applications of the smart assembly line with collaborative robots^24–34 (also see Figures 1 –3). In these applications of automotive manufacturing,^24–34 the operator who uses hand gesture communication to achieve cooperation with the collaborative robot is generally located in a fixed location (sitting or standing near the deployed depth sensor without frequent movements), and therefore the sensitive factor of the image sensing distance between the operator and the depth sensor can be effectively regulated, thus meaning that fine recognition performances are maintained.

With regard to the aspect of handling robot operations by wireless communication, this work uses the BT wireless transmissions for sending the intention label of the hand gesture action to the robot. Due to the fine property of peer-to-peer connections of BT, the specified handling robot matched with the developed hand gesture recognition system can always obtain fine regulations without interferences from other robots or related devices. Recently, works on the utilization of a group with numerous collaborative robots to complete a specific task have been explored. When presented hand gesture recognition is employed in the scenario of operations of multiple handling robots, the Wi-Fi-wireless transmission will be alternative, in order to establish the master-client operation scheme.

In this work, focused on the developments of dynamic hand gesture recognition, designs of indicated actions on operations of the handling robot are based on the dictionary of ASL. Recently, studies on ASL-based recognition have been seen.^50,51 Most ASL-based recognition works aim to develop the recognition calculation model, such as through the design of a deep learning model by 3D CNN in Sharma and Kumar⁵⁰ and the establishment of the statistical model HMM in Starner and Pentland⁵¹; moreover, a translation scheme for communication between the specific disabled group and the normal group is the main purpose of these presented ASL recognition systems. ASL recognition has also been seen to be employed in robot interaction applications.^52,53 Most of these studies concerning human–robot interaction by ASL recognition use just classifications of various static ASL hand poses without consideration of context-dependency, which are apparently different to the continuous-time dynamic hand gesture recognition developed in this paper. On the other hand, with regard to the issue of ease of system utilizations, ASL may perhaps not finally be accepted by all people. Some people (or operators in the assembly line) are not familiar with ASL and will cause additional loading because of having to learn operated hand gestures. In reference to this problem, the dictionary of hand gestures cannot be designed completely according to the standard ASL dictionary. For example, without consideration of ASL finger-spelling, the letter “S,” denoting “Stop movements,” can be designed as the action of “making a fist,” which will be easy to understand without exhausting memorization. It is also noted that the indicated gesture action “Open,” operated according to the symbol “O” defined in the ASL dictionary, does not have ideal performance when it comes to recognition accuracy (see Tables 7–10), with the symbol “O” often being recognized as “R,” due to similar actions on these two symbols. This classification error phenomenon of ASL-based recognition has been a challenging issue, which can be significantly alleviated via an action design without completely following the ASL definition.

Finally, regarding the issue of the interaction latency (the system response) between the human and the handling robot, the time duration of each recognized hand gesture action for operating the robot will be extremely small and will not cause great loading on real-time responses. In this work, as mentioned previously, three different representative frame estimate approaches are designed in order to extract only a few significant gesture frames from a hand gesture action with full gesture frames to make calculations of template match. Assuming that a dynamic hand gesture action is operated in durations of 2 s, a total of 60 depth images in this action will be obtained from the Kinect depth sensor (with a general specification of the frame rate of 30 in Kinect). In this case, for each of the representative frame estimate approaches presented in this study, the derived frame numbers to represent this gesture action will become very small, restricted to a size of around only 10 (millisecond levels or much less on recognition). Furthermore, each frame extracts only seven parameters of Hu invariant moments for further calculations of template match (features with much more than seven parameters are employed in the current 3D image sensor-based hand gesture tasks). Unlike all frames utilizations without estimates of representative frames or other model-based recognition approaches, calculations made using the presented approach with template match of five classification labels will achieve great competitiveness in terms of computations cost. On the other hand, although the typical approach of robot control via an operational joystick (or panel) has also been frequently seen to be used in the factory, such an approach will perhaps be inconvenient and encounter many limitations in a scenario of task “collaborative handling” carried out by both the operator and the robot. As can be seen in Figure 19, when the operator makes a specific hand gesture action to take over the material from the robot or handle the material with the robot in a cooperation mode, traditional methods of joystick or panel control will not be carried out smoothly. Advanced hand gesture recognition with collaborative HRI will be explored in future work.

Figure 19.

A collaborative operator-robot way to complete a material handling task cooperatively using hand gesture communication to interact with the material handling robot in the future works.

Conclusions

This work develops a smart human–robot interaction scheme by depth sensor-based hand gesture communication for continuous-time material-handling robot operations. The Kinect depth sensor-based hand gesture recognition system is presented for achieving smart remote robot control via recognized hand gesture action. In order to achieve the goal of performing continuous-time robot operations with efficiency and effectiveness, this paper also investigates detection of start and end time-points of each different type of hand gesture action and the representative frame estimate of the hand gesture action with multiple gesture frames. The presented smart material-handling robot control via hand gesture communication is very competitive when it comes to real-time recognition and human–robot interaction.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is partially supported by the Ministry of Science and Technology (MOST) in Taiwan under Grant MOST 109-2221-E-150-034-MY2.

ORCID iD

Ing-Jr Ding

References

Bansal

Khan

. A review paper on human computer interaction. Int J Adv Res Comput Sci Softw Eng 2018; 8(4): 53–56.

Saha

Gupta

Maity

, et al. Data-driven probabilistic performance of wire EDM: a machine learning based approach. Proc IMechE, Part B: J Engineering Manufacture 2022; 236: 908–919.

Kashyap

Parhi

Kumar

. Route outlining of humanoid robot on flat surface using MFO aided artificial potential field approach. Proc IMechE, Part B: J Engineering Manufacture 2022; 236: 758–769.

McLeay

Turner

Worden

. A novel approach to machining process fault detection using unsupervised learning. Proc IMechE, Part B: J Engineering Manufacture 2021; 235: 1533–1542.

Sudharsan

Ali

Corcoran

Smart speaker design and implementation with biometric authentication and advanced voice interaction capability. In: Proceedings of the 27th AIAI Irish conference on artificial intelligence and cognitive science, Galway, Ireland, 2019.

Ding

Zheng

Hsieh

. Hand gesture intention-based identity recognition using various recognition strategies incorporated with VGG convolution neural network-extracted deep learning features. J Intell Fuzzy Syst 2021; 40(4): 7775–7788.

Ding

. Two user adaptation-derived features for biometrical classifications of user identity in 3D-sensor-based body gesture recognition applications. IEEE Sens J 2019; 19(19): 8432–8440.

Kaluri

. Optimized feature extraction for precise sign gesture recognition using self-improved genetic algorithm. Int J Innov Eng Technol 2018; 8(1): 25–37.

Wang

Chen

, et al. Learning and comfort in human–robot interaction: a review. Appl Sci 2019; 9(23): 5152.

10.

Galin

Meshcheryakov

. Review on human–robot interaction during collaboration in a shared workspace. In: Proceedings of the international conference on interactive collaborative robotics, Istanbul, Turkey, 2019, pp.63–74.

11.

Sheridan

. Human–robot interaction: status and challenges. Hum Factors 2016; 58(4): 525–532.

12.

Ding

Shi

. Kinect microphone array-based speech and speaker recognition for the exhibition control of humanoid robots. Comput Electr Eng 2017; 62: 719–729.

13.

Megalingam

Reddy

Jahnavi

, et al. ROS based control of robot using voice recognition. In: Proceedings of the 2019 third international conference on inventive systems and control (ICISC), Coimbatore, India, 2019, pp.501–507.

14.

Liu

Wang

. Gesture recognition for human-robot collaboration: a review. Int J Ind Ergon 2018; 68: 355–367.

15.

Canal

Escalera

Angulo

. A real-time human-robot interaction system based on gestures for assistive scenarios. Comput Vis Image Underst 2016; 149: 65–77.

16.

Ding

Chang

. An adaptive hidden Markov model-based gesture recognition approach using Kinect to simplify large-scale video data processing for humanoid robot imitation. Multimed Tools Appl 2016; 75(23): 15537–15551.

17.

Lin

Chen

Pan

, et al. The manipulation of real-time Kinect-based robotic arm using double-hand gestures. J Sens 2020; 2020: 1–9.

18.

Ding

Chang

. HMM with improved feature extraction-based feature parameters for identity recognition of gesture command operators by using a sensed Kinect-data stream. Neurocomputing 2017; 262: 108–119.

19.

Ding

Chang

. On the use of kinect sensors to design a sport instructor robot for rehabilitation and exercise training of the elderly. Sens Mater 2016; 28(5): 463–476.

20.

Vargas

Vivas

. Gesture recognition system for surgical robot’s manipulation. In: Proceedings of the 2014 XIX symposium on image, signal processing and artificial vision, Armenia, Colombia, 2014.

21.

Van den Bergh

Carton

Nijs

, et al. Real-time 3D hand gesture interaction with a robot for understanding directions from humans. In: Proceedings of the IEEE 2011 RO-MAN, Atlanta, GA, 31 July–3 August 2011, pp.357–362. New York: IEEE.

22.

Sheu

Han

. Combining cloud computing and artificial intelligence scene recognition in real-time environment image planning walkable area. Adv Technol Innov 2020; 5(1): 10–17.

23.

Shang

Cao

, et al. Kinect-based vision system of mine rescue robot for low illuminous environment. J Sens 2016; 2016: 1–9.

24.

Chen

Kim

Liang

, et al. Real-time hand gesture recognition using finger segmentation. Sci World J 2014; 2014: 1–9.

25.

Tsarouchi

Athanasatos

Makris

, et al. High level robot programming using body and hand gestures. Procedia CIRP 2016; 55: 1–5.

26.

Vysocký

Grushko

Oščádal

, et al. Analysis of precision and stability of hand tracking with leap motion sensor. Sensors 2020; 20(15): 4088.

27.

Zhao

Chen

, et al. Human–robot collaborative assembly based on eye-hand and a finite state machine in a virtual environment. Appl Sci 2021; 11(12): 5754.

28.

Chen

Zhong

Cannella

, et al. Hand gesture modeling and recognition for human and robot interactive assembly using hidden Markov models. Int J Adv Robot Syst 2015; 12(4): 48.

29.

Coupeté

Moutarde

Manitsaris

. Gesture recognition using a depth camera for human robot collaboration on assembly line. Procedia Manuf 2015; 3: 518–525.

30.

Coupete

Moutarde

Manitsaris

, et al. Recognition of technical gestures for human-robot collaboration in factories. In: Proceedings of the ninth international conference on advances in computer-human interactions, Venice, Italy, 2016, pp.280–285.

31.

Mazhar

Ramdani

Cherubini

. A deep learning framework for recognizing both static and dynamic gestures. Sensors 2021; 21(6): 2227.

32.

Nuzzi

Pasinetti

Lancini

, et al. Deep learning-based hand gesture recognition for collaborative robots. IEEE Instrum Meas Mag 2019; 22(2): 44–51.

33.

Lai

Yanushkevich

SN.

CNN+RNN depth and skeleton based dynamic hand gesture recognition. In: Proceedings of the 2018 24th international conference on pattern recognition (ICPR), Beijing, China, 2018, pp.3451–3456.

34.

Nuzzi

Pasinetti

Lancini

, et al. Deep learning based machine vision: first steps towards a hand gesture recognition set up for collaborative robots. In: Proceedings of the 2018 workshop on metrology for industry 4.0 and IoT, Brescia, Italy, 2018, pp.28–33.

35.

Atmeh

Ranatunga

Popa

, et al. Implementation of an adaptive, model free, learning controller on the Atlas robot. In: Proceedings of the American control conference, Portland, OR, 2014, pp.2887–2892.

36.

Zhang

. Microsoft Kinect sensor and its effect. IEEE Multimed 2012; 19(2): 4–10.

37.

Tashev

. Kinect development kit: a toolkit for gesture- and speech-based human-machine interaction [Best of the web]. IEEE Signal Process Mag 2013; 30(5): 129–131.

38.

Villaroman

Rowe

Swan

. Teaching natural user interaction using OpenNI and the Microsoft Kinect sensor. In: Proceedings of the 2011 ACM conference on information technology education, West Point, NY, 2011, pp.227–232.

39.

Babaians

Khazaee Korghond

Ahmadi

, et al. Skeleton and visual tracking fusion for human following task of service robots. In: Proceedings of the 2015 3rd RSI international conference on robotics and mechatronics (ICROM), Tehran, 2015, pp.761–766.

40.

OpenCV. https://opencv.org/ (accessed December 2020).

41.

. Visual pattern recognition by moment invariants. IEEE Trans Inf Theory 1962; 8: 179–187.

42.

Huang

Leng

. Analysis of Hu’s moment invariants on image scaling and rotation. In: Proceedings of the 2nd international conference on computer engineering and technology, Chengdu, China, 2010.

43.

Žunić

Hirota

Rosin

. A Hu moment invariant as a shape circularity measure. Pattern Recognit 2010; 43(1): 47–57.

44.

Paul

Uddin

Bouakaz

. Component based face recognition using feature matching through Hu moment invariants. Int J Comput Sci Inf Secur 2016; 14(6): 589–604.

45.

Al-Azzo

Mohammed

Milanova

. 3D human action recognition using Hu moment invariants and Euclidean distance classifier. Int J Adv Comput Sci Appl 2017; 8(4): 13–21.

46.

Ramer

. An iterative procedure for the polygonal approximation of plane curves. Comput Graph Image Process 1972; 1: 244–256.

47.

Thai

Paulishen

. Using robotis bioloid systems for instructional robotics. In: Proceedings of the 2011 IEEE Southeastcon, Nashville, TN, 17–20 March 2011.

48.

ROBITS. https://www.robotis.us/roboplus-r-educational-software-apps (accessed December 2020).

49.

Rosen

. American sign language curricula: a review. Sign Lang Stud 2010; 10(3): 348–381.

50.

Sharma

Kumar

. ASL-3DCNN: American sign language recognition technique using 3-D convolutional neural networks. Multimed Tools Appl 2021; 80: 26319–26331.

51.

Starner

Pentland

. Real-time American sign language recognition from video using hidden Markov models. In: Proceedings of the international symposium on computer vision, Coral Gables, FL, 1995, pp.265–270.

52.

Tara

Santosa

Adji

. Sign language recognition in robot teleoperation using centroid distance Fourier descriptors. Int J Comput Appl 2012; 48(2): 8–12.

53.

Tazhigaliyeva

Nurgabulov

Parisi

, et al. SLIRS: sign language interpreting system for human-robot interaction. In: Proceedings of the 2016 AAAI fall symposium series: artificial intelligence for human-robot interaction, Technical Report, FS-16-01, 2016, pp.94–99.

Designs of human–robot interaction using depth sensor-based hand gesture communication for smart material-handling robot operations

Abstract

Keywords

Introduction

The developed smart material-handling robot system with HRI designs of Kinect depth sensor-based hand gesture communication

Data acquisition and preprocess of Kinect depth sensor-based hand gesture actions

Depth sensor-based hand detection by OpenNI SDK

Hand tracking and segmentation by NiTE SDK

Noise removal of the segmented hand gesture image by OpenCV SDK

Hand gesture recognition with features of HU moment invariants

The material-handling robot system with HRI integrations of continuous-time hand gesture communication

Start and end time-point determination of specific HRI hand gesture action acquisitions

Estimate of the representative frame of the acquired specific HRI hand gesture action

Extraction of full frames without estimate of the representative frame

The centroid representative frame estimate

The middle representative frame estimate

The representative frame estimate by middle-region voting

Experiments and results

Experimental settings and establishments of the HRI hand gesture action database

Experimental results

Discussions

Conclusions

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References