Sage Journals: Discover world-class research

Abstract

Gestures have long been recognized as an interaction technique that can provide a more natural, creative, and intuitive way to communicate with computers. However, some existing difficulties include the high probability that the same type of movement done at different speeds will be recognized as a different category of movement; cluttered, occluded, and low-resolution backgrounds; and the near-impossibility of fusing different types of features. To this end, we propose a novel framework for integrating different scales of RGB and motion skeletons to obtain higher recognition accuracy using multiple features. Specifically, we provide a network architecture that combines a three-dimensional convolutional neural network (3DCNN) and post-fusion to better embed different features. Also, we combine RGB and motion skeleton information at different scales to mitigate speed and background issues. Experiments on several gesture recognition public datasets show desirable results, validating the superiority of the proposed gesture recognition method. Finally, we do a human-computer interaction experiment to prove its practicality.

Keywords

Multi-modal action recognition body action robot simulation

1 Introduction

In natural scenes, gesture acquisition faces background interference from occlusion, lighting variations, and low resolution. To cope with the challenges of the environment, most of the existing methods use human skeleton based action recognition instead of rgb based action recognition [1-3]. However, skeleton-based gesture recognition loses the information about the interaction between the background and the human body, leading to misjudgment of some similar actions. As shown in Fig. 1, the skeleton sequences of stair climbing and running are very similar, and stairs can clearly distinguish these two actions. Therefore, we solve the above problem by fusing the RGB features with the skeleton sequences while taking both into account. In order to satisfy the universality of the model, the dataset must contain videos of different people’s movements, but there is individual variability in each person, and the speeds of different people when they make the same movement are very different, and a large part of the error rate comes from this. To address this problem, we collect fast and slow biscale global motion features using one sample frame every other frame and one sample frame every two frames to overcome individual variability [4]. Most of the time the features to be fused are of the same class and the resulting data format is the same. It is conceivable that features whose data formats are too different from each other can hardly be fused with each other, but it is precisely this kind of fusion between features that is more valuable, because the information overlap between them is very low, and fusion can achieve a greater win-win situation and obtain a higher accuracy rate. Skeletal sequences, RGB and graph structures [2] are such features. In this paper, we select the bone sequence and RGB for fusion, and we reduce the bone sequence into the form of pseudo heatmap to facilitate its fusion with RGB, which not only speeds up the computation, but also enables the bone sequence to be perfectly fused with RGB. The innovations and contributions of this paper are summarized below.

Fig. 1

Skeleton sequences of stair climbing and running.

(1) A multimodal recognition method is proposed to fuse skeleton sequences of different scales of motion with RGB images, which not only solves the influence of environmental background but also preserves RGB features and takes into account the problem of temporal differences. In addition, we reformulate 2D poses into a 3D heatmap volume, which not only accelerates the calculation speed but also solves the problem that different features are difficult to fuse.

(2) A network structure for post-fusion is designed, and the superiority of the designed network structure is verified by using the method of mid-term fusion. The problem of complex fusion of different modal data is avoided.

(3) The state-of-the-art performance of the proposed recognition method is verified on a public action recognition dataset. The simulation platform also realizes the simulation of gesture control of the robot.

The remainder of this document is organized as follows: Chapter 2 presents the related work, including the 3D convolutional neural network, the three features of different motion scale skeletons, RGB, and the feature fusion method used in this paper. Chapter 3 presents the design of the main network structure, the primary data processing methods, and the methods used for the post-feature fusion. Chapter 4 presents various experiments to validate the proposed recognition method, including comparison, ablation, and simulated experiments in a robot. Chapter 6 presents the conclusion and further work.

Fig. 2

key points and limbs annotation example.

2 Related works

In the following, we discuss the neural networks closely related to our work and the features involved.

2.1 3D CNNs

A simple way to apply deep convolutional neural networks(CNNs) to video is to apply CNNs to each frame for recognition, which can be done by networks such as image classification [5-11]. Nevertheless, this approach needs to take into account the motion information between consecutive frames. 3D CNNs [12] synthesize this motion information well. 3D convolution is performed by stacking several consecutive frames to form a cube and then applying 3D convolution kernels in the cube. In this structure, each feature map in the convolution layer is connected to multiple neighboring contiguous frames in the previous layer, thus capturing motion information. Therefore, 3D CNNs [13] is widely used in motion recognition, and many advanced model architectures have been proposed one after another. In this paper, we construct a 3D CNNs-based multimodal gesture recognition network that combines and assembles RGB and 3D skeletal pseudo-heatmaps of two motion modes into a framework that provides complementary RGB features to skeletal-based methods to improve recognition accuracy.

2.2 Skeleton-based methods and RGB-based gesture recognition

Skeleton-based gesture recognition is a popular research topic in the field of computer vision and has been widely used in video understanding [14], human-computer interaction [15], robot vision, autonomous driving, aerospace, medical [16] and other fields. Skeletal data consists of 3D coordinates of multiple spatiotemporal skeletal joints, effectively representing motion dynamics. With the emergence of more superior pose estimation algorithms, it can be easily acquired not only by low-cost depth sensors [17-19], but also directly from 2D images by video-based bit-pose estimation algorithms [1, 20-22]. Unlike RGB and optical flow, the skeleton data is small in data size, computationally efficient, and the skeleton data is highly robust to illumination changes and background disturbances. However, despite this, when it is faced with similar actions, it also leads to unsatisfactory recognition because of the lack of information about objects interacting with the human body. So even though skeletal data carries rich and vital information, he still complements RGB. With the rapid development of 3D skeletal data acquisition, the field of skeleton-based gesture recognition research [23-26] can be described as blossoming, with various advanced and effective recognition methods [27-30] coming out one after another. For methods based on Recurrent Neural Network(RNN) [31], the skeleton sequence is a natural time series of joint coordinate positions, which can be regarded as a sequence vector, while RNN itself is suitable for processing time series data due to its unique structure. Long short-term memory(LSTM) is also suitable for processing time series data. Despite the good results of RNN-based methods, this approach cannot effectively learn spatial relationships between skeletal joints. To explore spatial information explicitly, many researchers reformulate 2D poses into a 3D heatmap volume. The representations that provide CNN-based methods have the advantage of a natural ability to learn spatial information from skeletal joints. The GCN-based approach [32], which has been very successful, constitutes the skeleton as a graph in space, the joints as vertices, and the natural connections in the human body such as arms, legs, and temporal connections of the same joint points as edges. The method combines spatial map convolution and interleaved temporal convolution for spatiotemporal modeling, which can mine as much discriminative information as possible in both the spatial and temporal domains. However, this skeleton map also dramatically restricts the fusion with other modal features and has limited scalability. Based on this, we choose a CNN-based approach to fuse RGB and 3D heatmap volume from 2D skeleton data, which more smoothly fuses multiple complementary modal features to improve the final accuracy.

2.3 Different scale motion

Due to the variability of the human body, different people do the same action, or even the same person in different situations does the same action in time; the postural form is different [33]. It is easy to understand that the frequency and span of the legs of the fast and slow runners in a running race are different, and the frequency of the palms of the same person is also different when he is happy and upset. For either fast or slow motion scales, we consider different motions simultaneously and apply the method of fusing different motion features to form multi-scale motion features to improve the robustness of the model and the final accuracy.

3 Methodology

In the following, we present the overall approach, including the preliminary data processing method steps, the network framework for feature input, and the post-feature fusion method.

3.1 Global network framework

Take the experiment with the dataset JHMDB as an example. As shown in Fig. 3, the network framework consists of three parts, which are divided into RGB, slow motion, and fast motion according to the input features. The second layer also contains a 3D convolutional layer and a downsampling layer, similar to the first layer, which also doubles the size of the output and increases the number of channels to twice the base channels; the third and fourth layers are of the same size, containing two 3D convolutional layers and a downsampling layer, but the second convolutional layer does not increase the number of channels. Finally, the pooling layer and the fully connected layer are used to obtain a 1×21 array indicating the confidence level of each gesture category, and the recognition result is the one with the highest confidence level. We put the multimodal fusion at the end, that is, the prediction layer, and use the weighted summation method to multiply the prediction results obtained from each of the three features by their corresponding weights and then add them together to obtain the final fusion result, which is also a 1×21 array. The weights are set according to the accuracy of each feature trained separately in the ablation experiment.

Fig. 3

The concept of gesture recognition framework based on 3D CNN multimodal fusion extracts the feature information of all modalities and fuses them for prediction in the prediction layer. labyer3:2*3Dconv3 and AvgPool3D indicate that the third layer of the network consists of two 3D convolutions and an average pooling down sampling. The other layers are defined in the same format. Where base channels=64 and AAP denotes Adaptive averaging pooling.

3.2 Acquisition of skeletal information

We use the pose estimation algorithm for extraction instead of direct acquisition from a public dataset. This approach facilitates the training of our dataset, eliminates the tedious labeling work, and enables the inclusion of the skeletal information of non-key characters appearing in the video, making the dataset more realistic and natural and improving the results’ robustness. Using Faster R-CNN [34] to frame the tasks that appear in the video for subsequent key point detection, the Faster R-CNN detection system achieves a frame rate of 5 frames/second on the GPU for the intense deep VGG-16 model (including the desired step). Compared to most high-resolution to low-resolution networks, HR-net [35] maintains a high-resolution representation throughout, so the predicted key point heat map is more accurate and spatially precise.

For the dataset annotation problem, we choose coco [36] dataset keypoint annotation format. As shown in Fig. 2, this symmetrical labeling format facilitates later testing of the set of folded flip feature maps.

3.3 Generate skeletal feature maps

The popular features in the field of motion recognition are RGB images and skeletal images. Although RGB images are becoming less and less popular with the emergence of skeletal image features, there is no denying that they still have irreplaceable feature information. In a sense, RGB and skeletal images complement each other. The skeleton image is composed of key points predicted from RGB images by some pose estimation algorithms, and the skeleton information composed of key points is left alone by removing the background to solve the influence of RGB images caused by background lighting and other factors. However, when the background is removed, the information associated with the background and the subject acting is also removed. In order to retain this information to improve the recognition accuracy, we choose to retain both kinds of features and fuse them to achieve higher accuracy.

The previous approach only divided RGB images and skeletal images into two feature inputs to form a two-branch network structure, which did not consider the robustness of the motion features, as different people move at different rhythms, and both fast and slow rhythms should be considered.

The RGB image data consists of R, G, and B values of one pixel by one. The data format of a video is T×C×H×W, which refers to the number of frames, the three RGB channels, and the length and width of the image, respectively. The skeleton information comprises 2D or 3D coordinates of one key point. The data format is T×N×D, which refers to the number of frames, key points, and coordinate dimensions. The data formats of the two features are completely different, and for better integration, we reduce the extracted 2D pose to the form of an image [37, 38] The obtained N key points are reduced to N images matrix of length and width H, W. Each image is reduced to the image matrix form by using the Gaussian mapping method with the key point as the center so that the data of N×H×W can be obtained, and finally, the data format of T×N×H×W similar to RGB image is obtained by stacking in the time dimension. The specific method flow is shown in Fig. 4. $J_{nij} = e^{\frac{- (i - x_{n})^{2} + (j - y_{n})^{)}}{2 \times σ^{2}}} \times q_{n}, n \in 1, 2, 3, \dots, N$ (1) Equation (1) where (x_n, y_n) are the two-dimensional coordinates of the key points. q_n is the confidence level of the key points. σ is the variance of the Gaussian mapping function, which is set by itself. The coordinates of the pixel points on the restored image are denoted by (i, j).

Fig. 4

Skeletal feature generation framework. Extraction is performed for the input action video stream using the faster RCNN(body detection) + HR-net(pose estimation) to extract the two-dimensional human pose. The key point coordinates are saved and reduced to a pseudo heat map using Gaussian mapping. Finally, the input data is stacked in spatial and temporal dimensions to obtain the four-dimensional input data.

There may be more than one person in the same image, which means there is more than one key point of the same kind. In this case, to ensure the uniform data format, the key points of the same kind are mapped to the same image, ensuring the uniform data format and recognizing the motion of multiple people.

In order to obtain the motion rhythm of the two skeleton images, frame-skipping sampling was used in the past, which means sampling a frame in a video interval. This approach is too simple and brutal, and we adopt a random frame sampling approach for better results. The video is divided into T segments according to the selected number of frames T (e.g., 32 frames for this experiment). One frame is selected in each segment by random sampling and finally reorganized into T frames. The fast-paced sampling method is similar, except that a factor of K reduces the selected number of frames. The video is divided into T/K segments, so the final number of frames obtained is also T/K.

3.4 Multimodal fusion

The three input features fall into two categories, RGB and skeleton, which are so different that they are suitable for fusion processing as complementary features to improve recognition accuracy. Previous fusions have used pre-fusion or mid-fusion to stack multiple tensors into one tensor by dimensionality, which will cause some feature information to be lost due to the corruption of features with low similarity and non-uniform data format by forcibly stacking them together, which is not as effective as post-fusion.

We design four network layers and finally do post-fusion in the prediction layer [39]. The first layer: Stem layer (conv3D (3×3×3), base channels), the second layer: max pool (1×2×2) → stage1 (conv3D (3×3×3), 2*base channels), the third layer: max pool (1×2×2) → stage2 (conv3D (3×3×3), 4*base channels)*2, and the fourth layer: max pool (1×2×2) → stage3 (conv3D (3×3×3), 8*base channels)*2. The first layer is the only difference between the three branching networks because the second dimension of the three input feature data is different. Finally, three one-dimensional tensor q_rgb, q_fast, q_slow of length n_c are obtained after the fully connected layer. where n_c is the total number of action categories to be recognized. $q_{last} = α_{rgb} \times q_{rgb} + α_{fast} \times α_{fast} + α_{slow} \times q_{slow}$ (2)

The post-fusion part uses a custom weighted fusion method. In Equation (2), the weights α_rgb, α_fast, and α_slow are defined by the accuracy obtained from the validation set of each corresponding feature trained separately. The final result is the maximum confidence of the fused.

3.5 Model regression module

Finally, we choose the cross-entropy loss function [40] for the loss function segment. The more similarity between the q(predicted data distribution) and p(real data distribution) learned by the model in the training data, the better the cross-entropy function can measure the similarity between p and q. And in using the sigmoid function in gradient descent, the cross-entropy loss function can also avoid the problem of decreasing the learning rate like the mean square error loss function. Therefore, many neural network models dealing with classification problems use cross-entropy as the loss function.

$L = \frac{1}{N} \sum_{i} L_{i} = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} y_{ic} ln p_{ic}$ (3) Equation (3) where M is the number of categories; y_ic denotes the sign function (0 or 1), which takes 1 when the true category of sample i is equal to c, and 0 otherwise; p_ic denotes the predicted probability that the observed sample i belongs to category c.

4 Experiments

This section contains experimental validation results. Firstly, we present the experimental results and validation images of a 3D CNN-based multimodal fusion gesture recognition network. Secondly, we show the experimental results and details of the simulation of gesture transmission of the robot model in the Unity platform.

4.1 Method validation of 3D CNN-based multimodal fusion gesture recognition

4.1.1 Dataset description

Our models are trained and tested on the JHMDB [41], HMDB51 [42], and UCF101 [43] datasets. The experimental dataset was created as shown in Table 1. The JHMDB dataset contains 928 videos divided into 21 categories, of which 664 videos are used as the training set and 264 videos are used as the test set. The HMDB51 dataset contains 5100 videos divided into 51 categories, of which 3570 videos are used as the training set and 1530 videos are used as the test set. The UCF101 dataset contains 13320 videos divided into 101 categories, of which 9537 videos are used as the training set and 3783 videos are used as the test set.

Table 1
Setup of the experimental datasets

JHMDB Dataset HMDB51 Dataset UCF101 Dataset

Number of samples 928 5100 13320

Training/testing setup 664/264 3570/1530 9537/3783

Feature RGB and 2D skeletons RGB and 2D skeletons RGB and 2D skeletons

Number of actions 21 51 101

Joint or lamb joints joints joints

		JHMDB Dataset	HMDB51 Dataset	UCF101 Dataset
Number of samples	928	5100	13320
Training/testing setup	664/264	3570/1530	9537/3783
Feature	RGB and 2D skeletons	RGB and 2D skeletons	RGB and 2D skeletons
Number of actions	21	51	101
Joint or lamb	joints	joints	joints

The HMDB51 dataset is a extensive collection of real videos from different sources, including movies and web videos. The dataset contains 51 action categories with 6766 video clips. Each action category is divided into 70 clips for training and 30 for testing. The 51 classifications are divided into five major categories: general facial action smile, facial manipulation with object manipulation, general body action, interaction with object action, and human action.

JHMDB dataset is a secondary annotation of the HMDB dataset, i.e., joint-annotated HMDB. JHMDB is a frame-by-frame annotated data covering: sit, run, pull up, walk, shoot gun, brush hair, jump, pour, pick, kick ball, golf, shoot bow, catch, clap, swing baseball, climb stairs, throw, wave, shoot ball, push, stand 21 categories include the behavior of only one person. Each of these categories has 35-55 samples, each sample includes the start and end time of the behavior, and each sample includes 14-40 frames. There is a maximum of one target behavior per video, and the anchor box only marks the person doing the target behavior.

UCF101 is an action recognition dataset of realistic action videos collected from YouTube, providing 13320 videos from 101 action categories. Each short video varies in length (from zero to a dozen seconds), is 320*240 in size, has a variable frame rate, typically 25 or 29 fps, and contains only one category of human behavior in a video. Each category (folder) is divided into 25 groups of 4 to 7 short videos. Including boxing, boxing speed bag, head massage, playing guitar, lead balloon, etc.

The dataset contains RGB video, 2D key points (2D coordinates and confidence levels), and 3D key points (3D coordinates and confidence levels). The key points only contain information about the target actor, and others in the video are excluded. While this approach facilitates identification accuracy, it can also lead to results that do not reflect the state-of-the-art of the model well. Therefore, we only utilize the video files in the dataset. The input skeleton information in the model is calculated from the pose estimation algorithm. The calculation results contain the key point information of everyone in the video, which enhances the universality and resistance to interference and makes it possible to simulate natural scenarios better. Finally, the JHMDB dataset is divided into training/testing sets in the ratio of 2.5:1.

4.1.2 Implementation details

The experiments were conducted on an RTX A4000 graphics machine, and the models were trained and tested under the pytorch framework, and all environments were set up on the unbuntu system. The experimental training on the dataset JHMDB is set to epoch=24, batch size is set to 32, initial learning rate is set to 0.4, and the training process uses the learning rate adjustment method CosineAnnealingLR with a lower limit of 0. Due to the large size of the other two datasets, migration learning was used, and the model was trained using the kinetics-400 pre-training model. Equal interval adjustment of the learning rate StepLR was used, setting the learning rate to 0.01 0.001 after the 10th epoch and 0.0001 after the 11th epoch. The parameter k characterizing the different scales of motion is taken as 2. The number of frames in the input data is selected as 32 and 48, so the final input contains 32 or 48 frames for slow motion features, 8 frames for both RGB features, and 16 frames for fast motion. Scale the image to 56×56 size. The specific input size and network details are shown in Figure Table 2. The parameter σ = 0.6 for the Gaussian function is used to generate the pseudo-heat map. The weights of the subsequent weighted fusion are selected as the accuracy obtained by training the corresponding features individually, respectively. To demonstrate the superiority of the designed network model, we did not apply any integration strategy or pre-training weights on the smaller dataset JHMDB to improve the performance. However, the dataset UCF101 and the dataset HMDB51 are larger and therefore trained using the pre-trained model on the dataset Kinetics-400. Validation adds to the operation of flipping images. Three accuracy rates were calculated for the evaluation test, top 1 accuracy, top 5 accuracy, and mean class accuracy, respectively.

Table 2
Network composition of the training experiment (T is the dimension of frame rate)

Mid-term integration Post integration

stage Modified C3D Modified ResNet50 C3D ResNet50

Data layer JHMDB HMDB51 UCF101

Skeleton RGB Skeleton RGB Skeleton RGB

Uniform 32,56×56 SampleFrames 8,56×56 Uniform 48,48×48 SampleFrames 8,48×48 Uniform 48,56×56 SampleFrames 8,56×56

Stage layer Conv3×3²,32 Conv3×7²,16 Conv3×3²,32 Conv3×7²,16

Max pool 1×2² Max pool 1×3² Max pool 1×2² Max pool 1×3²

Middle fusion layer torch.cat(dim=T) torch.cat(dim=T) - -

Stage1 $[\begin{matrix} 3 \times 3^{2}, 64 \end{matrix}] \times 1$ $[\begin{matrix} 1 \times 1^{2}, 32 \\ 1 \times 3^{2}, 32 \\ 1 \times 1^{2}, 128 \end{matrix}] \times 4$ $[\begin{matrix} 3 \times 3^{2}, 64 \end{matrix}] \times 1$ $[\begin{matrix} 1 \times 1^{2}, 32 \\ 1 \times 3^{2}, 32 \\ 1 \times 1^{2}, 128 \end{matrix}] \times 4$

Stage2 Max pool 1×2² $[\begin{matrix} 3 \times 1^{2}, 64 \\ 1 \times 3^{2}, 64 \\ 1 \times 1^{2}, 256 \end{matrix}] \times 6$ Max pool 1×2² $[\begin{matrix} 3 \times 1^{2}, 64 \\ 1 \times 3^{2}, 64 \\ 1 \times 1^{2}, 256 \end{matrix}] \times 6$

$[\begin{matrix} 3 \times 3^{2}, 128 \end{matrix}] \times 2$ $[\begin{matrix} 3 \times 3^{2}, 128 \end{matrix}] \times 2$

Stage3 Max pool 1×2² $[\begin{matrix} 3 \times 1^{2}, 128 \\ 1 \times 3^{2}, 128 \\ 1 \times 1^{2}, 512 \end{matrix}] \times 3$ Max pool 1×2² $[\begin{matrix} 3 \times 1^{2}, 128 \\ 1 \times 3^{2}, 128 \\ 1 \times 1^{2}, 512 \end{matrix}] \times 3$

$[\begin{matrix} 3 \times 3^{2}, 256 \end{matrix}] \times 2$ $[\begin{matrix} 3 \times 3^{2}, 256 \end{matrix}] \times 2$

GAP,fc

Post fusion layer - - Weighted sum Weighted sum

	Mid-term integration	Post integration
stage	Modified C3D	Modified ResNet50	C3D	ResNet50
Data layer	JHMDB	HMDB51	UCF101
	Skeleton	RGB	Skeleton	RGB	Skeleton	RGB
	Uniform 32,56×56	SampleFrames 8,56×56	Uniform 48,48×48	SampleFrames 8,48×48	Uniform 48,56×56	SampleFrames 8,56×56
Stage layer	Conv3×3²,32	Conv3×7²,16	Conv3×3²,32	Conv3×7²,16
	Max pool 1×2²	Max pool 1×3²	Max pool 1×2²	Max pool 1×3²
Middle fusion layer	torch.cat(dim=T)	torch.cat(dim=T)	-	-
Stage1	$[\begin{matrix} 3 \times 3^{2}, 64 \end{matrix}] \times 1$	$[\begin{matrix} 1 \times 1^{2}, 32 \\ 1 \times 3^{2}, 32 \\ 1 \times 1^{2}, 128 \end{matrix}] \times 4$	$[\begin{matrix} 3 \times 3^{2}, 64 \end{matrix}] \times 1$	$[\begin{matrix} 1 \times 1^{2}, 32 \\ 1 \times 3^{2}, 32 \\ 1 \times 1^{2}, 128 \end{matrix}] \times 4$

Stage2	Max pool 1×2²	$[\begin{matrix} 3 \times 1^{2}, 64 \\ 1 \times 3^{2}, 64 \\ 1 \times 1^{2}, 256 \end{matrix}] \times 6$	Max pool 1×2²	$[\begin{matrix} 3 \times 1^{2}, 64 \\ 1 \times 3^{2}, 64 \\ 1 \times 1^{2}, 256 \end{matrix}] \times 6$
	$[\begin{matrix} 3 \times 3^{2}, 128 \end{matrix}] \times 2$		$[\begin{matrix} 3 \times 3^{2}, 128 \end{matrix}] \times 2$
Stage3	Max pool 1×2²	$[\begin{matrix} 3 \times 1^{2}, 128 \\ 1 \times 3^{2}, 128 \\ 1 \times 1^{2}, 512 \end{matrix}] \times 3$	Max pool 1×2²	$[\begin{matrix} 3 \times 1^{2}, 128 \\ 1 \times 3^{2}, 128 \\ 1 \times 1^{2}, 512 \end{matrix}] \times 3$
	$[\begin{matrix} 3 \times 3^{2}, 256 \end{matrix}] \times 2$		$[\begin{matrix} 3 \times 3^{2}, 256 \end{matrix}] \times 2$
GAP,fc
Post fusion layer	-	-	Weighted sum	Weighted sum

Table 3

Results on JHMDB(Using 2D skeletons from HR-net)

Methods	Mean acc	Top1 acc	Top5 acc	FLOPS	Params
EHPI [27]	0.6550	-	-	-	-
DD-net [4]	0.7720	-	-	-	0.15M
Posec3d [38]	0.7299	-	-	16.8G	3.4M
JMRN(WACV 22) [44]	0.7108	-	-	-	-
RGB	0.6300	0.6326	0.8409	0.43G	3.39M
Slow-only	0.7292	0.7311	0.9545	0.54G	3.39M
Fast-only	0.7567	0.7576	0.9432	1.62G	3.39M
Slow-fast(middle fusion)	0.7445	0.7462	0.9508	2.27G	3.40M
Slow-fast and RGB(middle fusion)	0.8043	0.8068	0.9583	2.70G	3.41M
Slow-fast and RGB(later fusion)	0.8120	0.8144	0.9394	2.70G	10.16M

Table 4

Results on HMDB51(Using 2D skeletons from HR-net)

Methods	Mean acc	Top1 acc	Top5 acc	FLOPS	Params
PoTion [45]	0.4370	-	-	-	-
SVT [46]	0.6720	-	-	-	-
Posec3d [38]	0.6940	-	-	15.9G	2.0M
TEINet[]	0.7210	-	-	66G	30.4M
VTCL [47]	0.6100	0.7610	-	-	-
JMRN(WACV 22) [44]	0.5405	-	-	-	-
TSM-MobielNetV3 [48]	0.6823	0.8900	-	-	-
RGB	0.2712	0.2712	0.5712	1.90G	3.00M
Slow-only	0.7000	0.7000	0.9137	11.41G	3.00M
Fast-only	0.6804	0.6804	0.8961	1.21G	3.00M
Slow-fast(middle fusion)	0.6810	0.6810	0.9013	15.21G	3.03M
Slow-fast and RGB(middle fusion)	0.7190	0.7190	0.9255	17.11G	3.03M
Slow-fast and RGB(later fusion)	0.7242	0.7242	0.9203	17.11G	8.99M

Table 5

Results on UCF101(Using 2D skeletons from HR-net)

Methods	Mean acc	Top1 acc	Top5 acc	FLOPS	Params
R(2+1)D [49]	0.7870	-	-	-	14.4M
PoTion [45]	0.6520	-	-	-	-
Posec3d [38]	0.8690	-	-	15.9G	2.0M
VTCL [47]	0.8750	0.9650	-	-	-
RGB	0.5428	0.5424	0.7811	2.59G	3.00M
Slow-only	0.8738	0.8747	0.9683	15.53G	3.00M
Fast-only	0.8460	0.8475	0.9593	1.65G	3.00M
Slow-fast(middle fusion)	0.8704	0.8718	0.9683	20.71G	3.03M
Slow-fast and RGB(middle fusion)	0.8772	0.8781	0.9770	23.29G	3.03M
Slow-fast and RGB(later fusion)	0.8873	0.8898	0.9818	23.29G	8.99M

Fig. 5

Confusion matrix of JHMDB dataset (21 hand actions)obtained.

Fig. 6

Confusion matrix of HMDB51 dataset (51 hand actions)obtained.

Fig. 7

Confusion matrix of UCF101 dataset (101 hand actions)obtained.

4.1.3 Comparative experiments

Comparison experiments were conducted on the datasets JHMDB, HMDB51, and UCF101. The experimental results were compared with other advanced methods, and the comparison results are shown in 3 to 5. It is worth noting that not all the training on the dataset did not use the pre-trained model. Since the datasets of HMDB51 and UCF101 are relatively large, in order to shorten the training time and improve the training accuracy, We use a transfer learning approach, applying a pre-trained model on the kinetics dataset. In order to solve the problem of slow convergence, we use the resnet50 network to increase the number of layers and complexity of the network model. Finally, we achieved the accuracy of 0.812, 0.724, 0.887 on JHMDB, HMDB51, and UCF101 datasets. More details are listed in their confusion matrices. The confusion matrices for the three datasets are shown in 7. The robustness of the designed network structure is verified according to the matrix plot display. Overall, the 3D CNN-based multimodal fusion gesture recognition method can achieve good results on the JHMDB dataset, UCF101 dataset and HMDB51 dataset.

4.1.4 Ablation experiments

In ablation experiments, we explore the performance of action recognition by keeping other input features constant. In addition, we also look at the variation of performance with weights by adjusting the weights of post-fusion.

The results from the ablation experiments show that the RGB-based action recognition is the lowest for the same input conditions. Skeleton-based action recognition is much more accurate than RGB-based action recognition and solves some drawbacks of RGB features, such as background interference. Nevertheless, the recognition result of combining RGB and skeleton is again higher than that of the individual skeleton, thus indicating that the skeleton features lose some interactive information only available in RGB images. In addition, according to the table, the mid-term fusion method is slightly lower than the weighted post-fusion method we used, which confirms the drawback of losing information from the forced integration of data when the data formats are different. As described before, it shows the superiority of our designed network structure.

4.2 Robot simulation experiment

4.2.1 Implementation design

The simulation experiments are conducted using the Unity3D engine, which is commonly used for game development and is a perfect game engine with advanced rendering support for both 2D and 3D. The expected effect of the experiment is that the robot imported into Unity will do the corresponding actions according to the recognition results of the network model. As shown in Fig. 8. In order to make the robot’s movements more fluent, we abandoned the method of setting the corresponding movements and used the method of storing the movements in video and then imitating them, i.e., let the robot imitate the movements of the characters in the video, which makes the robot’s movements more natural and more fluent. For this purpose, four example action videos are given as a reference for the robot, which is instructed to perform the corresponding specified actions according to the recognition results.

Fig. 8

Robot simulation experiment framework.

Table 6

Simulation performance index

Indicator	Value
Visual action recognition accuracy	0.812(JHMDB),0.724(HMDB51),0.887(UCF101)
Visual recognition delay	0.5s
Simulate the transmission delay	Hardly any(<20ms)
Robot simulation success accuracy	100(2 operators, 4 times/operator)

4.2.2 Implementation details

Using mediapipe’s human pose estimation method [50] to capture the movements of the person in the reference video (i.e., saving the coordinates of key points that change over time). By using socket and UPD communication and data transfer in localhost, the motion capture data is transferred in real-time to achieve the effect of imitating the motion in real-time. Connecting the armature in the robot model with the 33 key points obtained from the attitude estimation. The specific mapping is shown in Fig. 9.

Fig. 9

The mapping diagram of joint points of the human body and robot joints.

Take the midpoints of key points 7 and 8 and map them to the robot’s head, the midpoints of key points 11 and 12 and map them to the robot’s body, vector (12, 14) maps to the robot’s left upper arm, vector (14, 22) maps to the robot’s left lower arm, vector (11, 13) maps to the robot’s right upper arm, vector (13, 15) maps to the robot’s right lower arm, vector (24, 26) maps to the robot left thigh, vector (26, 28) maps to robot left lower leg, vector (23, 25) maps to robot right thigh, vector (25, 27) maps to robot right lower leg, vector (27, 31) maps to robot left foot, and vector (28, 32) maps to robot right foot.

4.2.3 Visualization

Some experimental results are visualized to demonstrate further the robot simulation experiments, as shown in Fig. 10. As can be seen from the figure, according to the recognition results, the robot made four actions: beat, jump, walk, and wave.

Fig. 10

Unity3D platform robot simulation simulation experiments, listed clap, jump, walk, wave four movements of the actual and simulation comparison video.

In order to better demonstrate the performance of the robot simulation, a table of experimental performance indicators was made and shown in Fig. 6.

5 Conclusion

We convert the two motion bone images into pseudo heatmaps before combining them with the RGB image. A three-channel post-fusion 3D CNNs network model was designed along with a three-channel mid-term fusion 3D CNNs network model for experimental comparison. The results demonstrated the improved accuracy of the experimental dataset, although only simple RGB videos were used. The superiority of the proposed method was glimpsed in the training results of the datasets JHMDB, HMDB51, and UCF101, and the application value was demonstrated by simulating the experiments on a robot model on the Unity3D platform.

This design contains only two types of features, RGB and bone. Other features, such as SIFT with a constant position viewpoint, can be added to improve accuracy. In addition, some advanced networks like I3D can be used to process the feature information. Of course, combining with target segmentation algorithms is also a good direction to achieve a complete recognition process for robots. During the experiment, we found a small detail: the difference in maximum execution time between action classes is vast. We often train with a uniform number of frames, which will inevitably cause information loss. Therefore, multiple execution times will be considered simultaneously in the subsequent research.

Footnotes

Acknowledgment

This paper is supported by National Key Research and Development Program of China under Grant Nos. 2018YFB1304600; National Natural Science Foundation of China under Grant 51775541, Grant 62006204; CAS Interdisciplinary Innovation Team under Grant Nos. JCTD-2018-11; in partly supported by the Shenzhen Science and Technology Program under Grant RCBS20210609104516043.

References

Cao

, Simon

, Wei

S.-E.

and Sheikh

, Realtime multi-person 2dpose estimation using part affinity fields, inpp, Proceedings ofthe IEEE conference on computer vision and pattern recognition (2017), 7291–7299.

Kipf

T.N.

, Welling

Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609.02907, 2016.

Farha

Y.A.

, Gall

Ms-tcn: Multi-stage temporal convolutional network for action segmentation, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3575–3584.

Yang

, Wu

, Sakti

, Nakamura

Make skeleton based action recognition model smaller, faster and better, in Proceedings of the ACM multimedia asia, 2019, pp. 1–6.

Russakovsky

, Deng

, Su

, Krause

, Satheesh

, Ma

, Huang

, Karpathy

, Khosla

, Bernstein

et al. Imagenetlarge scale visual recognition challenge, International Journalof Computer Vision 115 (2015), 211–252.

Simonyan

, Zisserman

Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

, Zhang

, Ren

, Sun

Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

Liu

, Mao

, Wu

C.-Y.

, Feichtenhofer

, Darrell

, Xie

A convnet for the 2020s, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 976–11 986.

Tan

, Le

Efficientnet: Rethinking model scaling for convolutional neural networks, in International conference on machine learning. PMLR, 2019, pp. 6105–6114.

10.

Zhang

, Zhou

, Lin

, Sun

Shufflenet: An extremely efficient convolutional neural network for mobile devices, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856.

11.

Howard

A.G.

, Zhu

, Chen

, Kalenichenko

, Wang

, Weyand

, Andreetto

, Adam

Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017.

12.

Ullah

and Munir

, A 3dcnn-based knowledge distillationframework for human activity recognition, Journal of Imaging 9(4) (2023), 82.

13.

Zhang

, Zhu

, Shen

, Song

, Afaq Shah

, Bennamoun

Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition, in Proceedings of the IEEE international conference on computer vision workshops, 2017, pp. 3120–3128.

14.

Wei

S.-E.

, Tang

N.C.

, Lin

Y.-Y.

, Weng

M.-F.

, LiaoSkeleton-augmented

H.-Y.M.

Skeleton-augmented human action understanding by learning with progressively refined data, in Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding from Multimedia, 2014, pp. 7–10.

15.

Ren

, Meng

, Yuan

, Zhang

Robust hand gesture recognition with kinect sensor, in Proceedings of the 19th ACM international conference on Multimedia, 2011, pp. 759–760.

16.

Huang

J.-D.

Kinerehab: a kinect-based system for physical rehabilitation: a pilot study for young adults with motor disabilities, in The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, 2011, pp. 319–320.

17.

Xing

, Zhu

Deep learning-based action recognition with 3d skeleton: A survey, 2021.

18.

Moeslund

T.B.

and Granum

, A survey of computer vision-based humanmotion capture, Computer Vision and Image Understanding 81(3) (2001), 231–268.

19.

Zhang

, Microsoft kinect sensor and its effect, IEEEMultimedia 19(2) (2012), 4–10.

20.

Xiao

, Wu

, Wei

Simple baselines for human pose estimation and tracking, in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 466–481.

21.

Mehta

, Sridhar

, Sotnychenko

, Rhodin

, Shafiei

, Seidel

H.-P.

, Xu

, Casas

and Theobalt

, Vnect: Real-time 3d humanpose estimation with a single rgb camera, Acm Transactions onGraphics (tog) 36(4) (2017), 1–14.

22.

, Liang

, Yuan

and Thalmann

, Real-time 3d hand poseestimation with 3d convolutional neural networks, IEEETransactions on Pattern Analysis and Machine Intelligence 41(4) (2018), 956–970.

23.

Chen

, Zhuang

, Nie

, Yang

, Wu

and Xiao

, Learning a3d human pose distance metric from geometric pose descriptor, IEEE Transactions on Visualization and Computer Graphics 17(11) (2010), 1676–1689.

24.

De Smedt

, Wannous

, Vandeborre

J.-P.

Skeleton based dynamic hand gesture recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 1–9.

25.

Liu

, Liu

and Chen

, Enhanced skeleton visualization for viewinvariant human action recognition, Pattern Recognition 68 (2017), 346–362.

26.

Chen

, Wang

, Guo

, Zhang

, Wang

and Zhang

, Mfa-net:Motion feature augmented network for dynamic hand gesturerecognition from skeletal data, Sensors 19(2) (2019), 239.

27.

Ludl

, Gulde

, Curio

Simple yet efficient realtime pose-based action recognition, in 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, 2019, pp. 581–588.

28.

Devineau

, Xi

, Moutarde

, Yang

Convolutional neural networks for multivariate time series classification using both inter-and intra-channel parallel convolutions, in Reconnaissance des Formes, Image, Apprentissage et Perception (RFIAP’2018), 2018.

29.

Yang

, Li

, Yang

and Luo

, Action recognition withspatio-temporal visual attention on skeleton image sequences, IEEE Transactions on Circuits and Systems for Video Technology 29(8) (2018), 2405–2415.

30.

Tang

, Tian

, Lu

, Li

, Zhou

Deep progressive reinforcement learning for skeleton-based action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5323–5332.

31.

Yin

, Kann

, Yu

, Schütze

Comparative study of cnn and rnn for natural language processing, arXiv preprint arXiv:1702.01923, 2017.

32.

Pei

, Wei

, Chang

K.C.-C.

, Lei

, Yang

Geom-gcn: Geometric graph convolutional networks, arXiv preprint arXiv:2002.05287, 2020.

33.

Gong

and Shu

, Real-time detection and motion recognition ofhuman moving objects based on deep learning and multi-scale featurefusion in video, IEEE Access 8 (2020), 25811–25822.

34.

Girshick

, Donahue

, Darrell

, Malik

Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.

35.

Sun

, Xiao

, Liu

, Wang

Deep high-resolution representation learning for human pose estimation, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.

36.

Lin

T.-Y.

, Maire

, Belongie

, Hays

, Perona

, Ramanan

, Dollár

, Zitnick

C.L.

Microsoft coco: Common objects in context, in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.

37.

Fabbri

, Lanzi

, Calderara

, Alletto

, CucchiaraCompressed

Compressed volumetric heatmaps for multi-person 3d pose estimation, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7204–7213.

38.

Duan

, Zhao

, Chen

, Lin

, Dai

Revisiting skeleton-based action recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978.

39.

Jiang

, Sun

, Wang

, Bai

, Li

, Fu

Skeleton aware multi-modal sign language recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3413–3423.

40.

Mannor

, Peleg

, Rubinstein

The cross entropy method for classification, in Proceedings of the 22nd international conference on Machine learning, 2005, pp. 561–568.

41.

Jhuang

, Gall

, Zuffi

, Schmid

, Black

M.J.

Towards understanding action recognition, in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3192–3199.

42.

Kuehne

, Jhuang

, Garrote

, Poggio

, Serre

Hmdb: a large video database for human motion recognition, in 2011 International conference on computer vision. IEEE, 2011, pp. 2556–2563.

43.

Soomro

, Zamir

A.R.

, Shah

Ucf101: 731 A dataset of 101 human actions classes from videos in the wild, arXiv preprint arXiv:1212.0402, 2012.

44.

Shah

, Mishra

, Bansal

, Chen

J.-C.

, Chellappa

, Shrivastava

Pose and joint-aware action recognition, in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 141–151.

45.

Choutas

, Weinzaepfel

, Revaud

, Schmid

Potion: Pose motion representation for action recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7024–7033.

46.

Ranasinghe

, Naseer

, Khan

F.S.

, Ryoo

M.S.

Self-supervised video transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2874–2884.

47.

Wang

, Ye

, Wang

, Jin

, Wang

Visual tempo contrastive learning for few-shot action recognition, in 2022 IEEE International Conference on Image Processing(ICIP), 2022, pp. 1096–1100.

48.

Zhang

, Tong

, Kong

, Lin

Tsm-mobilenetv3: A novel lightweight network model for video action recognition, in 2023 4th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), 2023, pp. 228–232.

49.

Pan

, Song

, Yang

, Jiang

, Liu

Videomoco: Contrastive video representation learning with temporally adversarial examples, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 205–11 214.

50.

Bazarevsky

, Grishchenko

, Raveendran

, Zhu

, Zhang

, Grundmann

Blazepose: On-device realtime body pose tracking, arXiv preprint arXiv:2006.10204, 2020.

Gestures recognition based on multimodal fusion by using 3D CNNs

Abstract

Keywords

1 Introduction

2.1 3D CNNs

2.2 Skeleton-based methods and RGB-based gesture recognition

2.3 Different scale motion

3 Methodology

3.1 Global network framework

3.3 Generate skeletal feature maps

4.1 Method validation of 3D CNN-based multimodal fusion gesture recognition

4.1.1 Dataset description

4.1.4 Ablation experiments

4.2 Robot simulation experiment

4.2.1 Implementation design

Footnotes

Acknowledgment

References