Abstract
Abstract
Considering that the distinctions among static hand gestures are the difference between fingers sticking out, a method of grouping and classifying hand gestures step by step by using the information of the quantity, direction, position and shape of the outstretched fingers was proposed this paper. Firstly, the gesture region was segmented by using the skin color information of the hand, and the gesture direction was normalized by using the direction information of the gesture contour lines. Secondly, the finger was segmented one by one by using convex decomposition in the hand gesture image based on the convex characteristic of the gesture shape. Thirdly, the features of quantity, direction, position and shape of the segmented fingers were extracted. Lastly, a hierarchical decision classifier embedded with deep sparse autoencoders was constructed. The quantity of fingers was used to divide the gesture images into groups first, then the direction, position and shape features of the fingers were used to subdivide and recognize gestures within each group. The experimental results show that the proposed method is robust as lighting, direction and scale changes and significantly superior to the traditional method both in the recognition rate and the recognition stability.
Keywords
Introduction
Nowadays, almost all man-machine interactions are done by hands, such as using hands to input data into or read information from the devices by manipulating the keyboard, mouse or touch screen, using data gloves to perceive grasping, movement, rotation and other motions of the hand for the interaction with the virtual reality system. Traditional ways of man-machine interaction need tedious and boring operations. Even complex hardware systems are required. All these have brought a great inconvenience to the users. The development of artificial intelligence technology provides conditions for the vision based non-contact human-computer interaction. Vision based hand gesture recognition technology relieves the constraints and limits of traditional man-machine interaction on the users dramatically, and it is helpful for the natural expression of human gestures. Vision based hand gesture recognition technology has wide application prospects in the field of human-computer interaction,1,2 human-robot interaction,3,4 sign language interaction,5,6 surgeon-computer interaction,7,8 smart home appliances,9,10 virtual reality11,12 and game interface,13,14 etc.
Vision based hand gesture recognition technology refers to the technology that performs the processing on images or videos containing hand gestures by using computer vision algorithms, and identifies the messages that users send with different hand gestures further. Hand gesture recognition can be divided into static hand gesture recognition and dynamic hand gesture recognition.15–17 Static hand gesture recognition obtains the meaning represented by each category of hand gesture through the processing of the hand gesture images according to the combining state of different fingers stretching out, while dynamic hand gesture recognition identifies the meaning expressed by the hand gesture through the processing of the hand gesture videos according to the trajectory, velocity and angle of the hand motion. Although the research on gesture recognition has lasted for decades, it is still an open problem now. Existing gesture recognition methods lack the ability to adapt to the changes of illumination, direction and scale. The essence of the problem is that neither method can fully extract the changeable characteristics of gestures. The motivation of this paper is to explore a novel hand gesture representation and recognition method which is robust enough to lighting, direction and scale changes.
Static hand gesture conveys a different meaning by sticking out different combinations of fingers. Typical static hand gestures are shown in Figure 1. The distinctions among static hand gestures are the difference between fingers sticking out, so the essence of static hand gesture recognition lies in the judgment of the state of the fingers stretching out. In the case of large category quantity of hand gestures, they can be grouped and classified step by step, by using the information of the quantity, direction, position and shape of the outstretched fingers. This will gradually narrow the classification scope of the hand gestures, and also perform a classification according to the characteristics of each gesture.

Typical static hand gestures (a) palm (b) three-left (c) V (d) loose.
A hierarchical decision and classification method for hand gesture recognition by using the finger features was proposed in this paper. Firstly, the gesture region was segmented by using the skin color information of the hand, and the gesture direction was normalized by using the direction information of the gesture contour lines. Secondly, the finger was segmented one by one by using convex decomposition in the hand gesture image based on the convex characteristic of the gesture shape. Thirdly, the features of quantity, direction, position and shape of the segmented fingers were extracted. Lastly, hand gestures were grouped and recognized by using hierarchical decision classifier which is embedded with deep sparse autoencoders. The flow chart of the proposed method is shown in Figure 2. The main contributions of this paper are summarized as following:
A gesture preprocessing method was proposed for the gesture image with unconstrained background. Specific convex decomposition conditions were established especially for hand gesture decomposition. A new descriptor based on direction, position and shape for hand gesture was constructed. A new hierarchical decision classifier embedded with deep sparse autoencoders was proposed for hand gesture recognition. Special considerations for robustness on lighting, direction and scale variations were included in our method.

Flow chart of gesture recognition.
The remainder of the paper is organized as following. Section ‘Related work’ reviews the related researches on hand gesture recognition in recent years. The segmentation of the gesture region and the normalization of the gesture direction are introduced in Section ‘Preprocessing of hand gesture image’. The method of finger segmentation based on convex decomposition is explained in detail in Section ‘Convex decomposition of hand gesture’. Extraction methods of the features of the direction, position and shape of the segmented fingers are described in Section ‘Feature extraction of fingers’. The hierarchical decision classifier and corresponding classification method are established in Section ‘Hand gesture classification’. Experimental results and comparisons in different scenarios are demonstrated in Section ‘Experiment and discussion’. Conclusions and future work are given in Section ‘Conclusion’.
Related work
Vision based gesture recognition technology is a challenging frontier in the world and attracting a lot of interest from researchers. The main tasks of static hand gesture recognition include feature extraction and feature classification of the hand gesture images. In the aspect of feature extraction of gestures, studies were mainly carried out on the extraction of geometric features, moment features, contour features, histograms of oriented gradients and wavelet features of hand gesture images in the literature. Lopez-Casado et al. 7 introduced a hand gesture geometric descriptor based on the distances and orientations of the lines connecting the hand contour convex extrema and concave extrema, and a linear Support Vector Machine classifier was used for hand gesture recognition. Wu et al. 18 presented a hand gesture recognition method based on hand shape. Fingertips and concave points between fingers were detected by using convex hull, and the numbers of fingertips and concave points were used for hand gestures recognition. Dominio et al. 19 extracted four different sets of hand gesture feature, which are the distances of the fingertips from the hand center and the palm plane, the curvature of the hand contour and the geometry of the palm region. Marin et al. 20 employed the distances from the hand centroid, the curvature of the hand contour and the convex hull of the hand shape as gesture feature descriptors. The extracted feature sets were used for hand gesture recognition. Park et al. 21 applied masked Zernike moment features for hand gesture recognition. They presented two categories of masks to eliminate the overlapped information of hand images due to their shape characteristic. The internal mask was used to eliminate overlapped information in hand images, while the external mask was used to weigh the outstanding features of hand images. Priyal and Bora 22 presented a hand gesture recognition method by using geometry based normalizations and Krawtchouk moment features. The regions constituting the hand and the forearm were extracted through skin color detection and the anthropometric measures. Rotation normalization was used for aligning the extracted hand and the Krawtchouk moment features were used to represent the hand gestures. Chevtchenko et al. 23 researched the multi-objective optimization problem in hand gesture feature selection. They used Hu moments and Gabor features to represent a hand gesture. Gesture recognition was undertaken by a multilayer perceptron. Both the feature vector and the neural network were tuned by a multi-objective evolutionary algorithm. Elouariachi et al. 24 proposed a quaternion Tchebichef moment invariants by using the quaternion algebra for extracting the gesture features. Based on the algebraic properties of the discrete Tchebichef polynomials, the direct derivation of invariants from their orthogonal moments has the robustness against geometrical distortion, noisy conditions and complex background. Ren YY et al. 25 proposed a contour based static hand gesture recognition method. They performed direction normalization of hand gesture images by applying a multi-scale weighted histogram of contour direction. The contour direction histogram was weighted by considering the position and direction of each contour point. Time-series curve feature was extracted from the hand contour for hand gesture recognition. Ren Z et al. 26 applied near-convex decomposition to obtain the finger clusters in their time-series curves of the hand gesture. A distance metric called Finger-Earth Mover's distance which considers each finger as a cluster and penalizes the empty finger-hole was used for hand gesture recognition. As it only matches fingers while not the whole hand shape, it can better distinguish hand gestures with slight differences. Feng and Yuan 27 proposed a static hand gesture recognition method based on gradient direction histogram features. Gradient direction histogram feature is operating on the local grid unit of image, so it can maintain a good invariance on geometric and optical deformation. Ding et al. 28 presented a cascade feature of the histograms of oriented gradients and improved local binary pattern to represent the hand gesture. Pansare et al. 29 used edge orientation histogram to extract the feature of hand gesture images. Huang et al. 30 applied Gabor filtered image for hand gesture representation and PCA method for feature dimensionality reduction. Parvathy and Subramaniam 31 applied 2D-Discreete Wavelet Transform to reduce the hand gesture image size and Harris corner detector to extract key points of the hand. The geometric contour feature was extracted for a window centered on the detected key point. Liu et al. 32 proposed a tortoise model to describe the hand gesture. The tortoise model is composed of the hand gesture features, such as the radius of the palm, the radius of the wrist, the number of fingers, the length and width of fingers, etc. These features were extracted by using the concentric circular scan lines of the palm. Yang et al. 33 employed saliency based feature and sparse representation for hand gesture recognition. The block radial histogram based saliency features of the hand gestures were extracted, and the histogram intersection kernel function was used to map the extracted features into the kernel feature space. Wu et al. 34 extracted the length feature, angle feature and angular velocity feature based on hand joint coordinates acquired by the Leap Motion. These features were fed into the long-short time memory recurrent neural network to predict the gesture. Zhang et al. 35 proposed a hand gesture recognition algorithm based on geometric features. Some length parameters of the palms were used to divide the hand gestures into different types firstly, and the area-perimeter ratio and effective-area ratio of the hand gesture were extracted for gesture recognition. Zhang et al. 36 proposed a distinctive fingertip gradient orientation with finger Fourier descriptor and modified Hu moments for depth gesture images collected by Kinect sensor. A weighted AdaBoost classifier based on finger-earth mover's distance and SVM models was used to realize the hand gesture recognition.
In the aspect of classification and recognition of hand gestures, hand gesture recognition by a neural network based on deep learning shows great potential. 37 Oyedotun and Khashman 15 applied deep learning-based networks to the task of recognizing hand gestures. The segmented binary hand gesture images were used to train Convolutional Neural Network and stacked denoising autoencoder respectively, the trained network was tested for hand gesture recognition. Hu et al. 38 used a Deep Belief Network which is composed of three Restricted Boltzmann Machines for hand gesture recognition. Tang et al. 39 applied Deep Neural Networks to automatically learn features from the hand gesture images that are insensitive to movement, scaling, and rotation. Chen et al. 40 proposed a pose guided structured region ensemble network for hand pose estimation. This network extracts regions from the feature maps of Convolutional Neural Network and generates optimal and representative features for hand pose estimation. Jain et al. 41 used Shift Invariant Convolutional Deep Structured Neural Learning with Long Short-Term Memory and Bivariate Fully Recurrent Deep Neural Network with Long Short-Term Memory for gestures classification. The proposed method can automatically learn the features and the data to minimize time complexity in gesture recognition. Zhang et al. 42 proposed a gesture recognition method based on Deconvolutional Single Shot Detector. They used the K-means clustering algorithm to select the aspect ratios of the prior boxes to improve the detection accuracy. By using the transfer learning method, the detection accuracy of small gesture data set is improved. Bhaumik et al. 43 proposed a hybrid feature attention network by stacking four multi-scale refined edge extraction modules for hand gesture recognition. The purpose of edge extraction module was to capture the refined edge information of hand gestures by incorporating hybrid feature attention block. Noreen et al. 44 proposed a 2D CNN model with four parallel streams to classify hand gestures with depth data. Each stream received input samples from the gesture data and the 2-D convolution process was followed in parallel. SoftMax was applied for the final classification. Iglesias et al. 45 specifically designed a CNN network with a small architecture for using in computationally limited devices. The CNN network adopted a Darknet reference model which has high speed on the detection stage while having simple network architecture. Kowdiki and Khaparde 46 developed a dynamic hand gesture segmentation and deep learning-based strategy for gesture recognition. The segmentation of gesture was performed by adaptive Hough transform in which the theta value was optimized by the Whale Optimization Algorithm. Segmented gesture images were classified by the optimized Deep CNN. Al-Hammadi et al.47,48 proposed a system for dynamic hand gesture recognition using multiple deep learning architectures for hand segmentation, local and global feature representations, and sequence feature globalization and recognition. Two 3DCNN instances were used separately for learning the fine-grained features of the hand shape and the coarse-grained features of the global body configuration.
Human hand is composed of palm, fingers and joints, and the joints have more than 20 degrees of freedom. The acquisition of hand gesture images is conducted in unconstrained environments. 49 There are physiological differences among individual hands. In addition, there exist the interference from changes of lighting, shielding, background, direction, scale, position and angle of view, all these cause the patterns of hand gesture images to be very complicated. Although there have been many researches on hand gesture recognition, the robustness of the existing hand gesture recognition method is far from the needs of the practical application. 50
Preprocessing of hand gesture image
The purpose of preprocessing is to segment the hand region from the captured hand gesture image and conduct direction correction for the hand gesture. The skin area was segmented by Bayesian decision in YCrCb color space firstly. Then, the center point of the hand gesture was determined by using the maximum inscribed circle in the gesture region, and the forearm was removed based on the obtained center point. Lastly, Hough transform was used to detect the direction of the linear features on the hand contour and the gesture was rotated to the vertical direction according to the direction of the detected linear features.
Skin area segmentation
The commonly used skin area segmentation is completed by threshold processing. This fixed threshold segmentation method cannot adapt to the influence of illumination changes. So posterior probability based Bayesian decision method was used to segment the skin color area in the hand gesture image. Considering that the skin color features have excellent clustering characteristics in YCrCb color space, Cr and Cb components of hand gesture images in YCrCb color space were used to form skin color feature vector. Firstly, the prior probability of the feature vector of skin color region and background region was established respectively. Then, according to the values of Cr and Cb components in different regions of the image, the skin color gesture region was segmented by Bayesian decision formula.
As segmenting the skin area, the pixels in the image were divided into two categories, namely skin color area pixels and background area pixels. They were labeled as category 1 and category 2 respectively. Select some sample images from the hand gesture image database for building Bayesian model. In YCrCb space of the hand gesture image, mark the pixel points of the skin area and the pixel points of the background area with category labels respectively. Establish the prior probability of the feature vectors of the skin area and background area by using Cr and Cb components of the pixels.
Suppose samples of category 1 in the labeled pixels are
Then, calculate the conditional probability as
As for the input pixel

Skin area segmented results (a) palm (b) three-left (c)V (d) loose.
Hand region segmentation
After skin color area segmentation, the obtained binary gesture image may also contain non-gesture areas such as face, wrist and arm, etc. These areas are redundant for gesture recognition based on hand; their presence will affect and interfere with the extraction and recognition of hand gesture features. Therefore, areas in the image that are not related to hand gestures need to be removed. The hand gesture area was segmented by the following steps. Traverse the whole hand gesture image and judge the connectivity of the 3 by 3 neighborhood of each pixel, set the connected pixels belong to the same connected region. So some connected areas in the image can be obtained which correspond to the hand region, face region and other regions respectively. Calculate the perimeter, area and roundness features of each connected area. According to the value of these features, retain the hand gesture area and remove the other areas. The processed results are shown in Figure 4.

Connectivity area segmented results (a) palm (b) three-left (c) V (d) loose.
The segmented hand gesture area of the image still contains the arm part. The maximum inner circle of the hand gesture area was used to determine the hand gesture center, and the arm part was removed by using this center point as the reference. All pixels in the hand gesture area in the binary image are marked as set I. Take any point in the gesture area as the circle center, draw a circle with the radius r, and mark all pixels inside the circle as the set

Arm removing process (a) Traverse circle position (b) Change circle radius (c) Maximum inner circle (d) Arm removed.
Gesture direction normalization
Hand gesture interaction is carried out under unconstrained conditions, so the gesture direction in the acquired image shows certain arbitrariness. In the process of hand gesture recognition, in order to facilitate the comparison between different hand gesture images, it is necessary to unify all gestures into the same direction by rotating the gesture image with a certain angle. The gesture direction is mainly embodied in the gesture contours, so the needed rotation angle for correction was obtained through calculating the contour direction. Thus all gesture directions can be normalized by rotating the images.
Considering that the contours on both sides of the fingers and the palms show the characteristics of a straight line, Hough transform was used to detect the direction of the linear features on the gesture contours. Then the needed correction angle can be determined according to the average direction of the detected lines. For a straight line in the pixel coordinate space
According to the above equation, for every point on a straight line in the
When detecting straight lines on the gesture contours, a two-dimensional accumulative array of

Direction normalized results (a) palm (b) three-left (c) V (d) loose.
Convex decomposition of hand gesture
Convex decomposition 26 is the process of decomposing the convex shape from the original shape of the image. The purpose of the convex decomposition of hand gesture is to separate the fingers from the gesture silhouette, and the information of the decomposed fingers is used to construct the feature vector for hand gesture recognition. In order to perform near-convex decomposition of hand gesture in the image, edge detection is carried out on the pre-processed hand gesture image to obtain the contour of the hand. Firstly, according to the physiological characteristics of the human hand silhouette, the candidate cutting line required for the convex decomposition was obtained. Then, the optimal cutting line for each finger in the gesture was determined according to the convexity of the decomposed finger shape and the number of spacing contour points between the two endpoints of the candidate cutting line.
Determination of candidate cut lines
In the process of convex decomposition of the hand gesture, in order to decompose the fingers and reduce the calculation, points were taken on the gesture contour at intervals. Starting from the contour point directly below the center of the gesture region, the contour points were numbered clockwise. As shown in Figure 7, suppose

Visible pair for convex decomposition.
Because the width of the finger contour is narrow, so the cutting line used in convex decomposition of the hand gesture is shorter. The candidate cut lines of convex decomposition are the connecting lines between the visible pairs conforming to the short cut rule. To find the visible pairs that conform to the short cut rule, calculate the distance between each pair of visible pairs in the interval contour point set, and normalize them to be

Candidate cut lines for hand gesture decomposition (a) palm (b) three-left (c) V (d) loose.
Selection of optimal cut line
In the convex decomposition of the hand gestures, each decomposed shape may not be a strict convex shape, but it should be as convex as possible. Suppose
Calculate the convexity

Convex decomposition results (a) palm (b) three-left (c) V (d) loose.
Feature extraction of fingers
Hand gesture is composed of palm and fingers. The difference among hand gestures is the quantity, direction, position and shape of the fingers. After convex decomposition, the outstretched fingers have been determined. So the outstretched fingers were used to establish the features for gesture recognition. Firstly, each finger was refined into a single pixel line that approximates the center line of that finger, and the direction of the finger was calculated by the pixel coordinates of the center line of the finger. Then, the position feature of the finger was determined according to the distribution of all pixel points of each decomposed finger on the circumference of the gesture. Finally, shape features of the decomposed finger were constructed by the scale invariant Hu moment feature.
Direction feature of finger
Each finger area was refined into a single pixel line that approximates the center line of the decomposed finger by using the image thinning algorithm, and then the pixel coordinates of the line are used to calculate the direction of the outstretched finger. Check each pixel of the image in the 3 × 3 neighborhood, if it satisfies: 1) there is no upper adjacent pixel, that is lower adjacent pixel, left adjacent pixel and right adjacent pixel; 2) not isolated point or termination line; 3) remove the pixel point will not disconnect the region, then remove the checked pixel. Scan the whole finger area, and repeat this step until no pixels can be removed. This process is realized by an iterative method, which removes the boundary layer by layer at a time until the finger area is refined into a central line. The original finger is represented by the corresponding center line after the refinement processing.
Select two pixels at a certain pixel interval on the thinning center line of the finger, and their coordinates are
Position feature of finger
In order to avoid the influence of the inconsistency of the gesture direction on the recognition accuracy, the average value of all finger directions of each gesture was defined as the main direction of the gesture. The main direction was used as the benchmark for further direction correction of the hand gesture. The direction angle of each finger was calculated according to equation (15), and the main direction of the gesture was obtained by averaging the direction angle of all fingers. Rotate the gesture image with a certain angle according to the value of the main direction of the gesture, and adjust the main direction of all gesture images to the vertical direction.
With the center of the hand gesture region as the circle center and the horizontal rightward direction as the starting position, the gesture region was divided into 360 equal parts counterclockwise along the circumference, as shown in Figure 10. Each part was numbered in turn as

Partitions of gesture image.
Shape feature of finger
Each finger was separated from the palm of the hand through convex decomposition. In addition to the difference in direction and position of the fingers, the shapes of them also vary greatly. Since the acquisition of gesture images is carried out under unconstrained conditions, the same finger has the change of zoom, rotation, translation and other modes.Hu invariant moments have rotation and translation invariance. Based on Hu invariant moments, 51 moments with scale invariant characteristics were constructed here to describe the shapes of different fingers.
Suppose the value of the pixel with coordinates
The normalized central moment is:
According to the above relation, the normalized central moment of the scaled image is:
From the above formula, it can be seen that the scaling factor
Hand gesture classification
Due to the unconstraint of the image acquisition environment and the physiological differences among individual hands, the acquired gesture images have the pattern variations of direction, scale, position and perspective. These bring great difficulty in the classification and recognition of hand gestures. In the case of large category quantity of the hand gestures, they can be grouped and classified step by step by using the information of the quantity, direction, position and shape of the outstretched fingers. This will gradually narrow the classification scope of the hand gestures, and perform classification according to the characteristics of each gesture. Based on this consideration, a hierarchical decision classifier with embedded deep sparse autoencoders was established to classify the hand gestures step by step. By using the method of threshold judgment, hand gestures were classified step by step by using the quantity, direction and position of the outstretched fingers. By using the output of deep network, the final recognition of hand gestures was realized by recognizing the shape of the outstretched fingers.
Deep sparse autoencoder
Deep sparse autoencoder establishes the correlation of the data by learning the characteristics of input data. In this paper, the deep sparse autoencoder used for finger shape classification has a four-layer network structure, that is one input layer, two feature layers and one Softmax classification layer.
The adopted network structure of the deep sparse autoencoder is shown in Figure 11. It was trained through layer by layer training method. Firstly, train the network between the Input layer and the Feature I layer by using the input feature data. Then, train the network between the Feature I layer and the Feature II layer by using the data of Feature I layer as input. Lastly, train the network between the Feature II layer and the Softmax layer by using the data of Feature II layer as input. As training for all layers has been finished, fine-tuned was conducted for the whole network. Consider all layers of the network as a model, optimize all connection weights in each iteration.

Network structure of deep sparse autoencoder.
Hierarchical decision classifier
A hierarchical decision classifier embedded with deep sparse autoencoders was constructed for hand gesture recognition, as shown in Figure 12. The quantity of fingers was used to divide the gesture images into groups first, then the direction, position and shape features of the fingers were used to subdivide and recognize gestures within each group. The classification process is as follows:

Hierarchical decision classifier.
Firstly, the hand gestures were divided into six groups according to the quantity of fingers stretching out, and each group includes 0, 1, 2, 3, 4 and 5 fingers respectively. Then, classification for each group was performed.
Classification of the first group: the quantity of the outstretched finger in the first group of gestures is 0, which only corresponds to one category of hand gesture “fist”, so it was directly recognized as “fist”.
Classification of the second group: “thumb left” was distinguished according to the direction of the outstretched finger, and then “one” and “one-right” were distinguished according to the position of the outstretched finger. In order to avoid confusion between the gesture “one” and gesture “thumb-left”, shape features were adopted to classify them again.
Classification of the third group: divide the gestures into two groups according to the position of the outstretched left finger, and then classify each group according to the position of the outstretched right finger, thus the gestures “loose”, “two-left”, “lock” and “V” can be recognized.
Classification of the fourth group: divide the gestures into two groups according to the position of the outstretched left finger. Then divide the first group into “ILY” and “three-left” according to the shape of the outstretched right finger, and divide the second group into “W” and “OK” according to the position of the outstretched right finger. In order to avoid confusion between the gesture “W” and the gesture “OK”, classify them again by using the shape feature of the outstretched right finger.
Classification of the fifth group: divide the hand gestures into gesture “four-right” and “four-left” according to the shape feature of the first outstretched finger on the left.
Classification of the sixth group: the quantity of the outstretched finger in the sixth group of gestures is 5, which only corresponds to one category of hand gesture “palm”, so it is directly recognized as “palm”.
Experiment and discussion
In order to evaluate the effectiveness and robustness of our proposed method, experimental evaluations were performed on our self-built gesture image dataset and the gesture image dataset 20 built by University of Padova.
Evaluation on our self-built dataset
Our self-built gesture image dataset contains 15 categories of hand gestures. The gestures are shown in Figure 13. The category number of the 15 categories of hand gestures is 1, 2, 3, … , 15 respectively. Each category of gesture was acquired 40 times from 5 people, and a total of 3000 gesture images were collected. The size of the images is 300 × 400. Among them, 900 images were used for the training of the hierarchical decision classifier, and the remaining 2100 images were used for the recognition test The effectiveness and robustness of the proposed method are verified by comparing it with the gesture recognition methods in the literature for the variation of lighting conditions, gesture direction and gesture scale respectively.

Hand gesture from our image dataset 1. Fist 2. One 3. Thumb-left 4.One-right 5.Loose 6. Two-left 7. Lock 8.V 9.ILY 10. Three-left 11.W 12.OK 13. Four-right 14. Four-left 15. Palm.
Comparison on illumination variation
In order to verify the robustness of the proposed method on illumination variation, hand gesture images were captured in a shaded environment, natural lighting, indoor lighting and artificial lighting four different illumination conditions respectively. Since the robustness to the variation of lighting conditions is mainly reflected in the segmentation and preprocessing method of the hand gesture region, the Bayesian decision method in YCrCb space used in this paper, k-means clustering method in RGB space 52 and Gaussian model method in HSV space 53 was respectively adopted to segment the gesture images. Then, gesture recognition experiments were carried out using the processed images, the results were shown in Figure 14. It can be seen from the figure that the proposed method in this paper is significantly superior to the traditional method both in the recognition rate and the recognition stability of different gestures when the lighting condition of gesture image acquisition changes.

Recognition results for illumination variation (a) Shaded environment (b) Natural lighting (c) Indoor lighting (d) Artificial lighting.
Comparison on gesture direction variation
In order to verify the robustness of the proposed method on gesture direction variation, hand gesture images were rotated by −20°, −10°, + 10°and + 20° respectively. The proposed method in this paper, the CNN method, 54 the Hu moment feature method 23 and the HOG feature method 27 was respectively adopted to conduct gesture recognition experiments, the results were shown in Figure 15. It can be seen from the figure that the proposed method in this paper is significantly superior to the traditional method both in the recognition rate and the recognition stability of different gestures when the direction of the hand gesture changes.

Recognition results for rotation variation (a) Rotated by −20° (b) Rotated by −10° (c) Rotated by + 10° (d) Rotated by + 20°.
Comparison on gesture scale variation
In order to verify the robustness of the proposed method on gesture scale variation, the size of the gesture in the hand gesture image was rescaled by 0.5 times, 0.75 times, 1.5 times and 2 times respectively, as shown in Figure 16. The proposed method in this paper, the CNN method, 54 the Hu moment feature method 23 and the SIFT method 55 was respectively adopted to conduct gesture recognition experiments, the results were shown in Figure 16. It can be seen from the figure that the proposed method in this paper is significantly superior to the traditional method both in the recognition rate and the recognition stability of different gestures when the scale of the hand gesture changes.

Recognition results for scale variation (a) Rescaled by 0.5 times (b) Rescaled by 0.75 times (c) Rescaled by 1.5 times (d) Rescaled by 2.0 times.
Evaluation on publicly available dataset
In order to verify the generalization ability of the proposed method on different gesture image dataset, the publicly available gesture image dataset20 built by the University of Padova was used for evaluation. The dataset contains 10 different categories of gestures, as shown in Figure 17. These gestures were acquired from 14 different persons; each person performed each gesture 10 times. A total of 1400 RGB gesture images were obtained The size of each image is 1280 × 960.

Hand gestures from other dataset 20 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10.
Four tests by using the leave-three-out method were performed on this dataset. In each test, 300 images from 3 persons were selected for training the classifier, and 1100 images from the remaining 11 persons were used for the recognition test Different training sets and testing sets are selected from different persons for each test For example, the images from person P1 to P3 were used for training in Test 1. The training sets and testing sets selection of each test are shown in Table 1.
Training sets and testing sets selection for each test.
The proposed method in this paper, the CNN method, 54 the Hu moment feature method 23 and the SIFT method 55 was respectively adopted to conduct gesture recognition experiments, the results were shown in Figure 18. It can be seen from the figure that the proposed method in this paper is significantly superior to the traditional method both in the recognition rate and the recognition stability of different gestures on publicly available dataset.

Recognition results on publicly available gesture dataset (a) Test 1 (b) Test 2 (c) Test 3 (d) Test 4.
Conclusion
In this paper, the hand gesture region was obtained by skin color segmentation in YCrCb space. This processing utilizes good clustering characteristics of skin color in YCrCb space and overcomes the influence of changes in lighting conditions on gesture recognition. Before the finger segmentation, the hand gesture direction was corrected by using the direction of the gesture contour lines. After the finger segmentation, the hand gesture direction was corrected further by using the direction of the finger direction. These processing overcome the influence of changes in gesture direction on gesture recognition. The shapes of different fingers were described by constructing Hu moment features with scale invariance. This representation overcomes the influence of changes in gesture scale on gesture recognition. In the stage of gesture recognition, the generalization ability of the classifier is improved by embedding the embedded deep sparse autoencoders in the classifier. The experimental results show that the proposed method is robust as lighting, direction and scale changes and significantly superior to the traditional method both in the recognition rate and the recognition stability. Further research will be considered in the following aspects. Firstly, feature extraction will be not limited to the silhouette image based method in this paper, but conducted from the contour curves of the binary gesture images or texture of the grayscale gesture images. Secondly, optimization, selection and weighting processing of the extracted features will be studied for simplifying calculations. Finally, in algorithm design aspect, not only the recognition accuracy and robustness would be considered, but also the convenience of use and efficiency of operation will be considered.
Footnotes
Acknowledgements
None.
Author contributions (roles)
Yunfeng Li: proposed conceptualization and methodology, performed calculating.
Pengyue Zhang: performed experiments and data processing.
Ethical Approval /Patient consent
Topic of the paper does not based on the human subjects, thus no ethical approval and patient consent are required.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
