Multi-modal gesture recognition with voting-based dynamic time warping

Abstract

Gesture recognition has remained a challenging problem in the fields of human robot interaction. With the development of depth sensors such as Kinect, different modalities become available for gesture recognition while its advantages have not been fully exploited. One of the critical issues for multi-modal gesture recognition is how to fuse features from different modalities. In this article, we present a unified framework for multi-modal gesture recognition based on dynamic time warping. The 3D implicit shape model is applied to characterize the space-time structure of the local features extracted from different modalities. And then, all votes from the local features are incorporated into a common probability space which is then used for building the distance matrix. Meanwhile, an upper-bounding method UB_Pro is proposed to speed up dynamic time warping. The proposed approach is evaluated on the challenging ChaLearn Isolated Gesture Dataset, showing comparable performance in comparison to the state-of-the-art approaches for multi-modal gesture recognition problem.

Keywords

Multi-modal gesture interaction local feature dynamic time warping

Introduction

Gestures are elementary movements of human’s arms and hands and are natural and intuitive ways for human robot interaction (HRI).¹ Due to its potential applications in many fields like HRI,² sign language recognition,³ industrial control,⁴ computer games,⁵ among others, there has been huge interest by the machine learning communities to analyze human gestures from visual data in order to enable robots to read and understand human commands. Recently, the irruption of commercial 3D sensors, such as Kinect, has greatly promoted the research of hand gesture recognition with an extreme enrichment of visual data. As an advanced sensing technology, 3D sensors can capture more comprehensive signals, including RGB video, depth video, and audio. Depth image contains structural information which makes gesture recognition can deal with illumination changes, complex background changes, and be less sensitive to clothing and skin changes.⁶ On the other hand, salient information can be extracted from the RGB-D stream, as well as the skeletal information, both of them can be important supplements to the RGB-D videos. As a consequence, with more complete information, multi-modal-based algorithm can improve the performances significantly.

In order to push the boundary of multi-modal methods for gesture recognition, ChaLearn organized several challenges on gesture recognition with Kinect as 3D sensor.⁷ In 2013, the challenge focused on user independent multiple gesture learning.⁸ A large multi-modal gesture dataset of 13,358 gestures was released, the dataset providing the audio, skeletal model, user mask, RGB, and depth videos. In 2016, the organizer presented two larger multi-modal datasets, the ChaLearn LAP RGB-D Isolated Gesture Dataset (IsoGD) and the Continuous Gesture Dataset (ConGD).⁹ The results of two challenges show that the deep learning methods demonstrate the absolute dominance in this field. Neverova et al.¹⁰ developed a deep learning-based framework with an effective fusion strategy which placed first in the first challenge. In the second challenge, the team ASU ranked first, and they used C3D network¹¹ and Temporal Segment Network (TSN) to extract features.⁷ But one of the drawbacks of deep learning based algorithms is the tedious and time-consuming training process. When a new gesture is defined, the model should be retrained and large amounts of processing power is required.

In this article, we pay attention to build an effective and efficient framework to fuse features extracted from different modalities. Our approach integrates several modalities, include RGB and depth images, optical flow fields from RGB channel, and salient information extracted from depth channel, as shown in Figure 1. We make use of different modalities to extract some representative features. And then, 3D implicit shape model (3D-ISM) is applied to characterize the space-time structure of the local features. All votes from local features are incorporated into a common probability space which is then used for building the distance matrix. Finally, dynamic programming algorithm is applied to find an optimal alignment path in the distance matrix. In order to speed up dynamic time warping (DTW), we try to use a cheap upper-bounding calculation to abandon some expensive computation. The major contributions of our work are the following: (1) We propose a unified framework for multi-modal gesture recognition, which can incorporate all features extracted from different modalities into a common voting space; (2) a consensus voting-based extension of DTW method is proposed to achieve a sequence-to-class alignment, which can align one test sequence to the specific class implicitly; (3) a cheap upper-bounding calculation is used to speed up DTW; (4) the proposed approach performs competitively to the state-of-the-arts on the ChaLearn IsoGD dataset.

Figure 1.

Examples of different types of modalities. (a) RGB image, (b) depth image, (c) optical flow field, and (d) saliency.

The remainder of the article is organized as follows. The second section briefly reviews the related works on gesture recognition. The third section describes the details of the proposed approach. Experimental results are presented in the fourth section. The fifth section concludes the article.

Related work

The field of gesture recognition has advanced rapidly, researchers have proposed a large amount of gesture recognition algorithms which achieved impressive performances in both accuracy and speed.

Gesture recognition

Early gesture recognition approaches relied on wearable devices. Keskin et al.¹² developed a real-time hand tracking system with the help of a colored glove, and then the trajectory of hand can be obtained via 3D reconstruction. Using electromyogram sensors, the “Myo” armband measures electrical activity from muscles to detect five gestures.¹³ Although these wearable devices perform well, but additional burden would make users feel uncomfortable to perform gestures, which is the fatal weakness for HRI applications.¹

Vision-based gesture recognition can offer a natural way to enable human to communicate with robot efficiently. Traditional machine learning algorithms pay more attention on hand-crafted features. Wu et al.¹⁴ Extended-Motion-History-Image (Extended-MHI) feature to capture holistic structural information. Hernndez et al.¹⁵ proposed a mixed feature descriptor named Viewpoint Feature Histogram Camera Roll Histogram (VFHCRH) to address the information loss of the VFH feature on the roll view. Wan et al.¹⁶ proposed a novel spatiotemporal feature named Mixed Features around Sparse Keypoints (MFSK), which used Speeded Up Robust Features (SURF) detector¹⁷ to initial keypoint detection and used optical flow method to filter the points with less motion. However, this kind of method uses the bag-of-vision-word model¹⁶ to integrate all features, the main drawback is their negligence of global structure information.

Feature extraction is the first step of gesture recognition and gesture classification is the last but most important step. Hidden Markov Model (HMM) is a widely method used for gesture recognition. Wu et al.¹⁸ used audio data to segment continuous gesture into isolated clips, and a late fusion strategy was employed to fuse the results of HMM classifiers applied to both audio and skeleton features. Miranda et al.¹⁹ represented gesture as the sequence of key poses and decision forest is applied to label gestures. Zhang et al.²⁰ trained Support Vector Machines (SVMs) to classify the dynamic American Sign Language (ASL) gestures in the MSRGesture3D dataset. Lin et al.²¹ combined Kernel Principal Component Analysis and Nonparametric Discriminant Analysis to extract discriminative features and achieve a robust gesture recognition with a simple Nearest Neighbor (NN) classifier. Hence, the DTW approach has been successfully used for recognizing signed digits.²² Recently, some specific neural networks have been proposed to extract robust feature⁷ or achieve an end-to-end classifier directly.²³

Multi-modal fuse

A survey by Escalera et al.⁷ provided a high-level overview of the recent approaches for multi-modal gesture recognition. As we can see, the performance of gesture recognition has been improved due to the enrichment of information. Wu and Cheng⁶ proposed a novel Bayesian Co-Boosting framework for multi-modal gesture recognition. Multiple HMM classifiers were trained collaboratively to construct a strong classifier which achieved 97.63% on the ChaLearn Multi-modal Gesture Recognition dataset. Molchanov et al.²³ used 3D convolutional deep neural networks to fuse information from multiple spatial scale for gesture recognition in driver assistance scenario. Also for dynamic car-driver gesture recognition, Molchanov et al.²⁴ applied convolutional deep neural networks to fuse data from multiple sensors (RGB camera, Time of Flight depth and Radar). To handle the problem variable-length gestures, Long Short-Term Memory (LSTM) cells were integrated into the Recurrent Neural Networks to consider temporal dynamics.²⁵ Neverova et al.¹⁰ proposed a multi-scale and multi-modal deep learning to classify gestures robustly with restricted number of free parameters. Duan et al.²⁶ proposing a unified framework to fuse information from multi-modalities, and a depth-saliency stream was used to remove the noises from backgrounds.

DTW

Considered as the sequential data, the dynamic gesture recognition can be also handled by DTW. Typically, DTW was regarded as the preprocessing step to perform the begin–end segmentation of continuous gestures. Reyes et al.²⁷ presented a begin–end gesture recognition approach using feature weighting in the DTW framework. Cheng et al.²⁸ proposed a Windowed DTW to detect the beginning and end of the specific gesture from an infinite trajectory gesture sequence. A searching window was used to handle the problem of gesture overlapping. Hernndez-Vela et al.¹⁵ presented a probability-based DTW that utilizes Gaussian Mixture Models to model the variance caused by environmental factors. And the proposed approach was used to segment continuous gesture into isolated clips. However, combining with NN algorithm, DTW is a powerful method to recognize gesture as a distance measure. Konecny al.²⁹ aggregated Histogram of Oriented Gradient (HOG) and Histogram of Flow (HOF) features as representation of each frame and used Quadratic-Chi distance to build the temporal cost matrix, finally Viterbi algorithm was applied to find the shortest path. Krishnan and Sarkar³⁰ proposed a conditional distance which got the distances named warp vectors between two gesture sequences using a third sequence, and DTW process was performed again between the two warp vectors to get final label.

Methodology

The overview of the proposed approach

In this article, we focus on isolated gesture recognition with RGB-D video. The global pipeline of the proposed approach is illuminated in Figure 2. The first step is feature extraction, besides RGB data and depth data, optical flow fields and saliency map are used to extract efficient features. And then, 3D-ISM is applied to model the spatiotemporal structure of the local features. In the classification step, after feature extraction, all votes from the local features are incorporated into a common probability space which is then used for building the distance matrix of DTW. Finally, we adopt dynamic programming algorithm to find the optimal path and label the gesture with NN method.

Figure 2.

General pipeline of the proposed approach.

Preprocessing

The purpose of preprocessing is to improve the robust of feature extraction. Firstly, a median filter and in-painting are applied to remove the noise in depth data, especially the holes occur usually along the edges. Secondly, because of the structural information contained in depth data, some image segmentation algorithm can be applied to achieve background subtraction. The effects of median filter and background subtraction are shown in Figure 3.

Figure 3.

The results of the preprocessing step. (a) Origin image, (b) image smoothing, and (c) background subtraction.

Feature extraction

In this article, we extract two kind of local features to represent the gesture, MFSK and Salient Features based on Reference Frame (SFRF). MFSK feature is a novel spatiotemporal feature which is robust and invariant to scale, rotation, and partial occlusions.¹⁶ MFSK uses SURF detector¹⁷ to initiate the keypoint detection. And then optical flow field are acquired to calculate the velocities of all keypoints. The keypoints with small velocities will be treated as motionless points and be deleted. The velocity threshold of layer l is defined as

τ^{l} = max (max (α | v_{max}^{l} {|, 0.5}^{l - 1} β), δ)

where $| v_{max}^{i} |$ is the maximum velocity of layer l in the pyramid. After feature selection, four descriptors (3D SMoSIFT, HOG, HOF, and Motion Boundary Histograms) are calculated around four patches around the keypoints as shown in Figure 4.

Figure 4.

The descriptors of MFSK feature.¹⁵ MFSK: Mixed Features around Sparse Keypoint.

The SFRF feature extracts salient region through a bidirectional reference frame search algorithm, and the features outside the salient region will be deleted. Reference frame search contains forward search and backward search. The reference frame $r_{i}^{F}$ of frame I_i in forward search can be obtained as

\begin{array}{l} ​ & r_{i}^{F} = \{\begin{matrix} r_{i - 1}^{F} & (A_{i, i - 1} < A_{i - 1}^{F}) and {\bar{A}}_{i, i - 1} - A_{i, i - 1} > ε \\ I_{i - 1} & other \end{matrix} \end{array}

where $A_{i, i - 1}$ is the area of salient region extracted in frame I_i with $I_{i - 1}$ as reference frame. $A_{i - 1}^{F}$ is the area of salient region extracted in frame $I_{i - 1}$ with $r_{i - 1}^{F}$ as reference frame. ${\bar{A}}_{i, i - 1}$ means the reference frame is $r_{i - 1}^{F}$ . The salient region is obtained through calculating frame difference with reference frame. Following the same steps, we can obtain the backward reference frame $r_{i}^{B}$ , and the final silent region is defined as

r_{i} = r_{i}^{F} \cap r_{i}^{B}

Finally, keypoints will be filtered by salient region, and HOG descriptor is computed on both RGB and depth patches around keypoints. HOG features are computed with a small patch around the point. We extract local patches on both color and depth images. The local patch is a square area which has $32 \times 32$ pixels and is divided into $8 \times 8$ cells. Gradient directions are quantized into 9 bins and a histogram of gradients is calculated for each $8 \times 8$ cells. So the size of the descriptor is 288. Compared with MFSK, SFRF can retain the information in holding phase which is significant for DTW alignment.

3D-ISM learning

The ISM is a classical algorithm used for detecting and localizing objects of a visual category. The ISM has two basic steps: learning appearance codebook and learning spatial distributions.³¹ The first step extracts local features at interest points and groups them with an agglomerative clustering scheme. The second step performs a second iteration over all features and stores all relative positions for each codebook. The 3D-ISM is an extension of ISM on temporal domain, Figure 5 shows the difference of them.

Figure 5.

ISM and 3D-ISM. ISM: implicit shape model.

Given a training dataset V_D , we can extract different features separately, $V_{F} = {F_{1}, F_{2}, \dots, F_{L_{F}}}$ , where L_F is the number of the features. For each kind of feature, we can obtain a feature set, $F = {(f_{j}, l_{j}, t_{j}, c_{j}), j = 1, 2, \dots L_{f}}$ , where L_f is the number of the local features, f_j is the feature descriptor. l_j is the coordinate location, t_j is the time location, and c_j refers to the label of the gesture. In this article, the coordinate of face is used as the center position of each frame. As the clustering will be applied to large datasets, the traditional clustering algorithms would be time-consuming, we use a multi-class balanced random forest³² to model the spatiotemporal structure of features. For each node of the trees, two splitting measures will be chosen with equal chance: distribution-based measure and entropy-based measure. The distribution based measure is defined as

{var}_{(τ_{1}, τ_{2})} = \sum_{j = 1}^{L_{f}} {((f_{j} (τ_{1}) - f_{j} (τ_{2})) - μ_{(τ_{1}, τ_{2})})}^{2}

where

μ_{(τ_{1}, τ_{2})} = \frac{1}{L_{f}} \sum_{j = 1}^{L_{f}} (f_{j}^{τ_{1}} - f_{j}^{τ_{2}})

$τ_{1}, τ_{2}$ are the dimension indexes which are selected randomly and the split threshold will be $μ_{(τ_{1}, τ_{2})}$ which has the largest ${var}_{(τ_{1}, τ_{2})}$ . This distribution based measure can make the trees balanced. For the entropy based measure, the node will be split based on the threshold,

γ = \frac{1}{L_{f}} \sum_{j = 1}^{L_{f}} (f_{j}^{τ_{1}} - f_{j}^{τ_{2}}) + δ

where $δ$ is a small jitter value. The entropy is defined as

\begin{array}{l} ​ & E (τ_{1}, τ_{2}, γ) = \frac{1}{| S_{l} |} \sum_{C = 1}^{K} - p_{C} (S_{l}) log (p_{C} (S_{l})) \\ ​ & + \frac{1}{| S_{r} |} \sum_{C = 1}^{K} - p_{C} (S_{r}) log (p_{C} (S_{r})) \\ ​ & ​ \end{array}

where $S_{l}, S_{r}$ refer to the left and right sub-nodes. $p_{C} (S)$ is the probability of the samples in node S belonging to category C. With the two splitting measures, the training process can be considered as a clustering step, and the leaf is a collection of similar features.

Based on the above steps, each leaf is treated as a codebook entry and the next step is to learn the spatiotemporal probability distribution. As shown in Figure 6, for all features in training data, a second iteration is performed to find the best-matching codebook entry, and the position information will be merged into the list of occurrences which represents the spatiotemporal probability distribution implicitly.

Figure 6.

Building the voting space of the 3D-ISM. ISM: implicit shape model.

Consensus voting strategy

Let $Q = {q_{1}, q_{2}, \dots, q_{M}}$ denote a test gesture sequence with M frames. $f_{i}^{m}$ denotes the $i th$ feature extracted from the mth frame. $l_{i}^{m}, t_{i}^{m}$ denote the corresponding spatiotemporal position. The alignment score of $f_{i}^{m}$ belonging to a specific time T and specific class $C \in {1, 2, \dots, L_{C}}$ is defined as

p (C, T | f_{i}^{m}, l_{i}^{m}) = p (c_{i} = C, t_{i} = T | f_{i}^{m}, l_{i}^{m})

We define the 3D-ISM as $V_{3D-ISM} = {(f_{j}^{R F}, l_{j}, t_{j}, c_{j})}$ , $f_{j}^{R F}$ denotes the codebook entry. By match $f_{i}^{m}$ to the codebook, the probability can be computed as

\begin{array}{l} ​ & p (c_{i} = C, t_{i} = T | f_{i}^{m}, l_{i}^{m}) \\ ​ & = \sum_{(f_{j}^{R F}, l_{j}, {\bar{t}}_{j}, c_{j}) \in V^{R F}} p (c_{i} = C, t_{i} = T | f_{i}^{m}, l_{i}^{m}, f_{j}^{R F}, l_{j}, \\ ​ & t_{j} = T, c_{j} = C) \times p (f_{j}^{R F}, l_{j}, t_{j} = T, c_{j} = C | f_{i}^{m}, l_{i}^{m}) \\ ​ & = \sum_{(f_{j}^{R F}, l_{j}, {\bar{t}}_{j}, c_{j}) \in V^{R F}} p (c_{i} = C, t_{i} = T | l_{i}^{m}, c_{j} = C, t_{j} = T, l_{j}) \\ ​ & \times p (f_{j}^{R F}, c_{j} = C, t_{j} = T | f_{i}^{m}) \\ ​ & ​ \end{array}

The previous item refers to the similarity based on the spatial shifts and be computed as:

\begin{array}{l} ​ & p (c_{i} = C, t_{i} = T | l_{i}^{m}, l_{j}, c_{j} = C, t_{j} = T) \\ ​ & = \frac{1}{Z} {exp}^{\frac{- {(l_{i}^{m} - l_{j})}^{2}}{σ^{2}}} \\ ​ & ​ \end{array}

where Z is a normalization constant and 2 is a bandwidth parameter. The latter item can be considered as a weighting coefficient and be calculated as

\begin{array}{l} ​ & p (f_{j}^{R F}, c_{j} = C, t_{j} = T | f_{i}^{m}) \\ ​ & = \frac{1}{L_{T}} \sum_{t e = 1}^{L_{T}} \frac{1}{L_{t e}} \sum_{j \in L_{t e}} I (t_{j} = T) I (c_{j} = C) \end{array}

where L_T refers to the number of trees, $L_{t e}$ refers to the codebook entry which match the feature best.

Therefore, given an feature $f_{i}^{m}$ located at $l_{i}^{m}$ , we can calculate the probability of the feature belong to class C at time T, $p^{C} (T | f_{i}^{m})$ . Owing to the time scale have been normalized into [0, 1], the voting time scale should be stretched to the same scale of the test sequence, thus we can get a probability vector ${[p^{C} (1 | f_{i}^{m}), p^{C} (2 | f_{i}^{m}), \dots, p^{C} (M | f_{i}^{m})]}^{T}$ . By combining the K features in mth frame, then we can get following matrix

[\begin{matrix} p^{C} (1 | f_{1}^{m}) p^{C} (1 | f_{2}^{m}) ... & p^{C} (1 | f_{K}^{m}) \\ p^{C} (2 | f_{1}^{m}) p^{C} (2 | f_{2}^{m}) ... & p^{C} (2 | f_{K}^{m}) \\ ... ... ... & ... \\ p^{C} (M | f_{1}^{m}) p^{C} (M | f_{2}^{m}) ... & p^{C} (M | f_{K}^{m}) \end{matrix}] \to [\begin{matrix} p^{C} (1 | q_{m}) \\ p^{C} (2 | q_{m}) \\ ... \\ p^{C} (M | q_{m}) \end{matrix}]

where each column is the temporal probability of each feature and each row refers to the temporal probability of each frame. Finally, we can obtain the cost matrix $D_{F}^{C} {(Q, 3D-ISM)}_{F}$ between the test sequence and a set of N pattern samples from the same gesture category C with the feature F

D_{F}^{C} (Q {,3D-ISM}_{F}) = [\begin{matrix} p (C,1 | q_{1}) & p (C,1 | q_{m}) & \dots & p (C,1 | q_{M}) \\ p (C,2 | q_{1}) & p (C,2 | q_{m}) & \dots & p (C,2 | q_{M}) \\ \dots & \dots & \dots & \dots \\ p (C, M | q_{1}) & p (C, M | q_{m}) & \dots & p (C, M | q_{M}) \end{matrix}]

Classification

The naive DTW attempts to find an optimal alignment between two temporal sequences $Q = {q_{1}, q_{2}, \dots, q_{M}}$ and $P = {p_{1}, p_{2}, \dots, p_{N}}$ .¹ In this article we extend the DTW to achieve a fast alignment between a sequence Q to a set of samples ${P_{1}, P_{2}, \dots, P_{L_{P}}}$ from the same class.

As shown in “Consensus voting strategy” section, 3D-ISM is applied to characterize the space-time structure of features extracted from all training samples. And then we can extract features from test sequence Q, $V_{F} = {F_{1}, F_{2}, \dots, F_{L_{F}}}$ . Therefore, the cost matrix of DTW can be built as

D^{C} (Q,3 D-ISM) = \frac{1}{| V_{F} |} \sum_{F_{i} \in V_{F}} D_{F_{i}}^{C} (Q,3 {D-ISM}_{F_{i}})

Following three basic constraints: boundary, continuity, and monotonicity constraints, there still can find various warping paths V_W . Let $W = {w_{1}, w_{2}, \dots, w_{L_{W}}} \in V_{W}$ denotes one warping path and w_k is a node of the path. The total cost of the path can computed as,¹

{Cost}_{W} (X, Y) = \sum_{k = 1}^{L_{W}} dist (w_{t})

where $dist (w_{t}) = D^{C} (w_{t}^{i}, w_{t}^{j})$ is the cost at node w_t .

Contrary to the classical DTW, the cost matrix represents the similarity of two sequences, which means the optimal path is the path has maximal value

DTW (Q, 3D-ISM) = max ({Cost}_{W} | W \in V_{W})

This path optimization problem can be solved by dynamic programming²⁸

\begin{array}{l} DTW (i, j) = & D (i, j) + max (DTW (i - 1, j - 1) \\ ​ & DTW (i - 1, j), DTW (i, j - 1)) \\ ​ & ​ \end{array}

where ${i, j}$ is the index of DTW cost matrix. And once we have the similarity measure of two gestures, NN algorithm is applied to get the label

C^{*} = \underset{C}{arg max} {DTW}^{C} (Q, 3D - ISM)

As we know, the DTW is a robust distance measure, while it needs mass computing. Lower-bounding methods can address this problem by using a cheap lower bounding function to abandon invalid matching.³³ Similarly, we proposed a upper bound function named UB_Pro, which is defined as

UB_{Pro}^{C} (Q, {3D - ISM}_{F}) = \sum_{i = 1}^{M} max (pro (C, q_{i}))

where $pro (C, q_{i}) = [p^{C} (1 | q_{i}), p^{C} (2 | q_{i}), \dots, p^{C} (M | q_{i} {)]}^{T}$ . Table 1 shows the flow of an algorithm which uses UB_Pro to speed up DTW.

Table 1.

An algorithm that uses UB_Pro to speed up DTW.

Input A test sequence Q.

best_so_far = zero;

for all category C in database

UB_dist =

UB_{Pro}^{C} (Q, {3D-ISM}_{F})

;

if UB_dist

>

best_so_far

true_dist =

DTW (Q, 3D-ISM)

;

if true_dist

>

best_so_far

best_so_far = true_dist;

index_of_best_match = i;

endif

endfor

Return best_so_far,index_of_best_match

Experiments and results

Datasets

The ChaLearn LAP RGB-D IsoGD is a very representative gesture dataset released on Look At People challenge, and it is the first large-scale RGB-D gesture dataset.⁹ The dataset contains 47,933 gesture samples labeled into 249 categories, including 35,878 training samples, 5784 validation samples, and 6271 test samples. Each sample contains only one gesture. The dataset uses a depth camera to record RGB and depth videos by 21 different people. Figure 7 shows some examples from the gesture dataset.

Figure 7.

Some samples from the ChaLearn IsoGD. IsoGD: Isolated Gesture Dataset.

Experimental setup

In this article, all models are trained using training samples, parameters are validated on the validation set, and we report our result based on the test set. We extract two local features, the first one is MFSK feature and SFRF feature. The MFSK feature are extracted using the code given in the study by Wan et al.,¹⁵ and the parameters follow the default setting. The dimension of the MFSK Feature is 1024. We use significant local features, which is based on the reference frame and have been filtered out invalid features, to extract the local HOG feature descriptor of depth image and RGB image respectively to obtain the 256-dimensional feature descriptor.

In the process of 3D-ISM learning, it is unnecessary to use all training samples. So we used one-tenth of the training set to train the random forest, and using all samples to learn the spatiotemporal probability distribution. On the other hand, the number of trees is set to 100 and the tree depth is set to 15 as given in the study by Yu et al.³² In order to ensure the balance of category voting, the number of samples from each category used for random forest learning process is the same. To ensure the spatial information is not disturbed by the movement of human body, we detect the human face center using the code from the study by Zhang et al.,³⁴ and then treat it as the reference center to correct the movement bias.

In order to verify the effectiveness of the proposed approach, the approach is compared with the winners of ChaLearn 2017 large-scale gesture recognition challenge, all of them used deep learning totally or partly.⁹ The team ASU and SXDETVP used the deep neural networks to extract features, and then support vector machine was applied achieve classification. While other teams used post fusion framework to integrate multiple deep neural network streams to predict label.⁹ In this article, we use the same data and evaluation measure as the above approaches, then the results are shown in the following section.

Results

The proposed approach can fuse multiple features, such as MFSK features and SFRF features used in this article. In order to find the effectiveness of each feature, we evaluate the features in our experiments separately and jointly. Table 2 shows the results. We can see that although MFSK feature is more powerful, but the result is poor. The reason for this gap should be that SFRF can retain the information in both movement and holding phases which is more suitable to DTW. Overall, the combination of these two features can achieve better performances than using individual features. The results show that the verification set has been improved by 6.78% and 5.91% for the test set.

Table 2.

The performances of different features on dataset.

Features	Valid set (%)	Test set (%)
MSFK	55.30	57.47
SFRF	56.60	59.31
MSFK + MSFK	63.47	65.22

MFSK: Mixed Features around Sparse Keypoint; SFRF: Salient Features based on Reference Frame.

We expect the number of training samples to be an important constraint for the learning of codebook in 3D-ISM. Hence, we quantitatively measure the impact of the different number samples $N_{samples_for_R F} = {200, 800, 2000, 4000, 8000, 16, 000}$ . Figure 8 compares the results. It shows that the performance is increased when the number is small, but above 4000, the improvement is very slight. Through considering the performance and computation complexity, we set $N_{samples_for_R F} = 4000$ .

Figure 8.

Results on ChaLearn IsoGD for using different numbers of samples to learning the codebook of 3D-ISM. IsoGD: Isolated Gesture Dataset; ISM: implicit shape model.

Table 3 shows that, compared with the start-of-the-art approaches, the proposed approach in this article has competitive results. However, compared with the best approach ASU,⁹ it is 0.93% lower in the validation set and 2.49% lower in the test set. Because they use C3D and TSN networks to extract powerful features which provides strong support for the performance. While our approach is a kind of traditional machine learning approach, which does not need complex and tedious training process.

Table 3.

Comparison with the state-of-the-art approaches.

System	Modality	Classify	Valid set (%)	Test set (%)
XDETVP	LSTM, C3D	SVM	58.00	60.47
AMRL	ConvLSTM	Score Fusion	60.81	65.59
Lostoy	C3D, ResNet-18	Score Fusion	62.02	65.97
SYSU_IEEE	LSTM, VGG16	Score Fusion	59.70	67.02
ASU	C3D, TSN	SVM	64.40	67.71
Ours	3D-ISM	VDTW	63.47	65.22

TSN: Temporal Segment Network; LSTM: Long Short-Term Memory; ISM: implicit shape model.

Conclusion and future work

In this article, a unified framework for multi-modal feature fusion based on consistent voting is proposed, in which 3D-ISM is used to learning the space-time structure of features. And then, all votes from the local features are incorporated into a common probability space which is then used for building the distance matrix. The approach proposed in this article gets comparable results on the large-scale ChaLearn IsoGD. Although there is a slight gap between the final results and the approaches based on deep learning, while it avoids the tedious and fussy training process, and the flexible framework can merge new samples efficiently without retraining.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Yiqun Kuang

References

Cheng

Yang

Liu

. Survey on 3D hand gesture recognition. IEEE Trans Circuits Syst Video Technol 2016; 26(9): 1659–1673.

Lee

SW.

Automatic gesture recognition for intelligent human-robot interaction. In: IEEE 7th international conference on automatic face and gesture recognition, Southampton, UK, 10–12 April 2006. UK: IEEE.

Zafrulla

Brashear

Starner

, et al. American sign language recognition with the Kinect. In: Proceedings of the 13th international conference on multimodal interfaces, Alicante, Spain, 14–18 November 2011, pp. 279–286.

Raheja

Shyam

Kumar

, et al. Real-time robotic hand control using hand gestures. In: The 2nd international conference on machine learning and computing, Bangalore, India, 9–11 February 2010. India: IEEE.

Roccetti

Marfia

Semeraro

A fast and robust gesture recognition system for exhibit gaming scenarios. In: Proceedings of the 4th international ICST conference on simulation tools and techniques, Barcelona, Spain, 21–25 March 2011, pp. 343–350.

Cheng

. Bayesian co-boosting for multi-modal gesture recognition. J Mach Learn Res 2014; 15(1): 3013–3036.

Escalera

Athitsos

, et al. Challenges in multi-modal gesture recognition. In: Gesture recognition. Cham: Springer, 2017, pp. 1–60.

Escalera

Gonzlez

Bar

, et al. Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on international conference on multimodal interaction, Sydney, NSW, Australia, 9–13 December, 2013.

Wan

Zhao

Zhou

, et al. Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Las Vegas, NV, USA, 26 June–1 July 2016. USA: IEEE.

10.

Neverova

Wolf

Taylor

, et al. ModDrop: adaptive multi-modal gesture recognition. IEEE Trans Pattern Anal Mach Intell 2016; 38(8): 1692–1706.

11.

Tran

Bourdev

Fergus

, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 7–13 December 2015. Chile: IEEE.

12.

Keskin

Erkan

Akarun

Real time hand tracking and 3D gesture recognition for interactive interfaces using HMM. In: Proceedings of the joint International Conference ICANN/ICONIP, Istanbul, Turkey, 26–29 June 2003.

13.

Sathiyanarayanan

Rajan

S. MYO

. Armband for physiotherapy healthcare: a case study using gesture recognition application. In 8th international conference on communication systems and networks, Bangalore, India, 5–10 January 2016. India: IEEE.

14.

Zhu

Shao

. One shot learning gesture recognition from RGBD images. In: Workshop on IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012. USA: IEEE.

15.

Hernndez-Vela

Bautista

Perez-Sala

, Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in RGB-D. Pattern Recogn Lett 2014; 50: 112–121.

16.

Wan

Guo

. Explore efficient local features from RGB-D data for one-shot learning gesture recognition. IEEE Trans Pattern Anal Mach Intell 2016; 38(8): 1626–1639.

17.

Bay

Ess

Tuytelaars

, et al. Speeded-up robust features (SURF). Comput Vision Image Understand 2008; 110(3): 346–359.

18.

Cheng

Zhao

, et al. Fusing multi-modal features for gesture recognition. In: Proceedings of the 15th ACM on international conference on multimodal interaction, Sydney, Australia, 9–13 December 2013, pp. 453–460.

19.

Miranda

Vieira

Martinez

, et al. Campos, Real-time gesture recognition from depth data through key poses learning and decision forests. In: SIBGRAPI conference on graphics, patterns and images, Ouro Preto, Brazil, 22–25 August 2012. Brazil: IEEE.

20.

Zhang

Tian

Edge enhanced depth motion map for dynamic hand gesture recognition. In: Workshop on IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013. USA: IEEE.

21.

Lin

Hsieh

. Kernel-based representation for 2D/3D motion trajectory retrieval and classification. Pattern Recogn 2013; 46(3): 662–670.

22.

Tang

Cheng

Zhao

, et al. Structured dynamic time warping for continuous hand trajectory gesture recognition. Pattern Recogn 2018; 80: 21–31.

23.

Molchanov

Gupta

Kim

, et al. Hand gesture recognition with 3D convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Boston, MA, USA, 7–12 June 2015. USA: IEEE.

24.

Molchanov

Gupta

Kim

, et al. Multi-sensor system for driver’s hand-gesture recognition. In: 11th IEEE international conference and workshops on automatic face and gesture recognition, Ljubljana, Slovenia, 4–8 May 2015. Slovenia: IEEE.

25.

Nishida

Nakayama

Braunl

, et al. Multimodal gesture recognition using multi-stream recurrent neural network. In: Image and video technology. Cham: Springer, 2015, pp. 682–694.

26.

Duan

Wan

Zhou

, et al. A unified framework for multi-modal isolated gesture recognition. ACM Trans Multimedia Comput Commun Appl 2018; 14(1s): 21.

27.

Reyes

Dominguez

Escalera

Feature weighting in dynamic time warping for gesture recognition in depth data. In: Workshops on IEEE international conference on computer vision, Barcelona, Spain, 6–13 November 2011. Spain: IEEE.

28.

Cheng

Luo

Chen

. A windowed dynamic time warping approach for 3D continuous hand gesture recognition. In: IEEE International Conference on Multimedia and Expo, Chengdu, China, 14–18 July 2014. China: IEEE.

29.

Konecny

Hagara

. One-shot-learning gesture recognition using HOG-HOF features. J Mach Learn Res 2014; 15(1): 2513–2532.

30.

Krishnan

Sarkar

. Conditional distance based matching for one-shot gesture recognition. Pattern Recogn 2015; 48(4): 1298–1310.

31.

Leibe

Leonardis

Schiele

. Combined object categorization and segmentation with an implicit shape model. In: Workshop on European conference on computer vision, 2004.

32.

Yuan

Liu

Predicting human activities using spatio-temporal structure of interest points. In: Proceedings of the 20th ACM international conference on Multimedia, Nara, Japan, 29 October–2 November 2012.

33.

Keogh

Ratanamahatana

. Exact indexing of dynamic time warping. Knowledge Inform Syst 2005; 7(3): 358–386.

34.

Zhang

, et al. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process Lett 2016; 23(10): 1499–1503.