Hand Gesture and Character Recognition Based on Kinect Sensor

Abstract

The purpose of this research was to see if Kinect sensor can recognize numeric and alphabetic characters written with the hand in the air. Kinect sensor can capture motion without the sensor device being attached to the user's body. The input screen has both modes of numerals and alphabet. The recognition rate was measured and the user wrote the numbers from zero to nine and the letters from A to Z twice. Alphabet recognition relied on Palm's Graffiti. The input numerals and alphabet were recognized by dynamic programming matching based on interstroke information. In addition, this system can perform the numeral operation, such as +, −, ×, and /. Most people are not used to writing in the air and are unfamiliar with Kinect sensor, and it takes some time to master them both. First, the user needs to become accustomed to using the sensor. Average recognition rates of 95.0% and 98.9%, respectively, were obtained for numerical and alphabetical characters.

1. Introduction

Hand gesture recognition is an important research issue in the field of human-computer interaction, because of its extensive applications in virtual reality, sign language recognition, and computer games. Despite much previous work, building a robust hand gesture recognition system that is applicable in real-life applications remains a challenging problem. Existing vision-based approaches are greatly limited by the quality of the input image from optical cameras [1, 2]. Consequently, these systems have not been able to provide satisfactory results for hand gesture recognition. Hand gesture recognition faces two challenging problems: hand detection and gesture recognition. Hand gesture implementation involves significant usability challenges, including fast response time, high recognition accuracy, speed of learning, and user satisfaction, helping to explain why few vision-based gesture systems have matured beyond prototypes or made it to the commercial market for human computer devices [3, 4]. There are some applications for the image, face, use of smartphone, and so forth for human-computer interaction [5–7].

As regards the input system for handwriting in the air, some researchers have suggested the use of a wearable video camera to recognize characters written in the air [8, 9]. Their research provides a letter input as it captures the operation of the operator's hand with a video camera and executes image analysis on a computer. However, it is assumed that the system continuously operates the hand-mounted video camera. The use of multiple cameras has also been proposed in order to recognize the silhouette of a person and cut it out and perform trajectory detection, character recognition, and fingertip detection [10]. This system can be realized by using multiple cameras to obtain a high recognition rate (96.6%) and to recognize handwriting in the air from all directions.

Our system has no equipment in the hand and body using Kinect sensor made by Microsoft (Figure 1) [11]. Kinect sensor provides a natural dialogue between users and electronic devices. Kinect can obtain the image data, voice data, and depth data from the sensor, which is connected to a PC [12]. This research mainly used image data and depth data to detect and estimate the joint position of the body of the user. Kinect can relatively and easy detect hand and gesture recognition using these data. In addition, Kinect can capture motion without the sensor device being attached to the user's body. When we write a letter, we usually use a pen and remove any mistakes with an eraser. With Kinect, we do not need them. Brushstrokes in the air are detected by the coordinates on the $X Y$ plane of the hand. The hand is recognized by Kinect, which can detect the coordinates of the main part of the human body. Character recognition constitutes the brushstroke of two-dimensional coordinates on the plane $X Y$ and compares the numeric data as numeral written in succession with one stroke per one character data based on interstroke information. Alphabet recognition uses the Graffiti of normal alphanumeric gestures from Palm. The input numeral and alphabet are recognized by DP matching and interstroke information. Finally, this system can perform the numeral operation, such as +, -, ×, and / [13].

Figure 1

Kinect sensor from Microsoft.

2. Hand Gesture and Character Recognition with Kinect Sensor

2.1. Hand Gesture Recognition

We therefore use Kinect for Windows SDK as a development environment of Kinect sensor. It is possible to perform the detection and estimation of the joint positions of the user's body from depth data. As shown in Figure 2, Kinect sensor can track 20 places of joint. Kinect can carry out easily hand detected and gesture recognition from this data and detected gestures. As shown in Table 1, this research used the gestures “Click” and “Wave.” “Click” is used to start tracking the hand and “Wave” is used to stop tracking.

Table 1

Kinect can recognize four gesture types.

Gesture	Meaning
Click	Before and after the operation of the arm
Wave	Behavior of the left and right arm
Raising hand	Behavior to raise the arm
Moving hand	Movement of the arm

Figure 2

Recognition of the human skeleton (from Microsoft Kinect for Windows SDK).

2.2. Hand Writing Detection by Kinect Sensor

We selected the method of writing numerals in the air with the hand. We described the recognized hand by using the function depicted in Section 2.1. The character can be written in the air by tracking the movement of the hand and recording the coordinates of it on the $X Y$ plane. The problem is the time taken to write a character in the air, one character writing finished and preparation period of moving for writing the next writing: (ON/OFF of the pen of one stroke). At this identification method, we decide tracking one case as “ON” and tracking three cases as “OFF.” (i)

“Click” means start tracking and pen is “ON.”

(ii)

First puts point, “OFF” of pen is using the distance to the point from the Kinect

(iii)

Hand outside the screen means pen is “OFF.”

(iv)

“Wave” means end tracking and pen is “OFF.”

2.3. Numeral Recognition and Operation

If the left hand is between 500 and 1000 mm away from Kinect on the blue point, the pen is off and DP matching is performed simultaneously for the data of written numerals. The system can identify any numeral from zero to nine. The left yellow point is the sum of +, −, ×, and ÷, as shown in Figures 3(a) and 3(b). A wave of the hand deletes the previous numeral and the operation is performed by using the input second numeral.

Figure 3

(a) Writing the numeral by hand. (b) Simulation results of operation by Kinect sensor.

2.4. Alphabet Recognition

The alphabet can be recognized by using Graffiti. Graffiti is essentially a single-stroke shorthand handwriting recognition system used in PDA (personal digital assistant) and based on Palm OS. Graffiti was originally written by Palm Inc. as a recognition system for GEOS-based devices such as HP OmniGo 120 and Magic Cap-line. The software is based primarily on uppercase characters that can be drawn blindly with a stylus on a touch-sensitive panel. Since the user typically cannot see the character as it is being drawn, complexities have been removed from four of the most difficult letters: “A,” “F,” “K,” and “T” can be drawn without any need to match a cross-stroke [14, 15].

2.5. DP Matching and Interstroke Information

Features used to recognize characters are listed below: (i)

DP (dynamic programming) distance,

(ii)

interstroke information.

This system starts the DP matching when the distance from Kinect to the blue point in Figures 3(a) and 4(a) becomes constant. DP matching is based on the degree of similarity between the elements of the pattern. DP matching finds the ordered correspondence between time series of two pattern elements with the aim of minimizing the distance. It is a matching method that takes into account the expansion and contraction of the pattern [16].

Figure 4

(a) Writing the alphabet by hand. (b) Graffiti of normal alphanumeric gestures. (c) Input results of alphabet using Kinect sensor.

The input and reference pattern are represented by the time series of features as follows:

\begin{matrix} A = a_{1} a_{2} \dots a_{i} \dots a_{I}, \\ a_{i} = i th input pattern, \\ B = b_{1} b_{2} \dots b_{i} \dots b_{J}, \\ b_{j} = j th reference pattern . \end{matrix}

(1)

DP matching calculates the distance from the reference pattern B through the following steps:

(1)

initial condition

\begin{matrix} g (0,0) = d (0,0) . \end{matrix}

(2)

(2) DP recursive expression

\begin{array}{l} g (i, j) = \min [\begin{bmatrix} g (i - 1, j) \\ g (i - 1, j - 1) \\ g (i - 1, j - 2) \end{bmatrix}] + d (i, j), \end{array}

(3)

\begin{array}{l} d (i, j) = w \times | a_{i} - b_{j} | \\ + (1 - w) \sum_{k = 0}^{n} | ({count}_{1} [k] - {count}_{2} [k]) |, \end{array}

(4)

where $d (i, j)$ is the local distance between $a_{i}$ and $b_{j}$ , ${count}_{1}$ is the interstroke information of input, ${count}_{2}$ is the interstroke information of reference, and w is weight, and $g (i, j)$ is performed sequentially from the initial point,

(3) pattern distance is calculated as follows:

\begin{matrix} D (A, B) = g (I, J) . \end{matrix}

(5)

This research uses DP matching for both the $X Y$ plane and the interstroke information [17, 18].

Interinformation between feature points, such as shape context [18], is introduced. In addition to using character intrastroke information such as the $X Y$ position and the direction of each point, information regarding the relative positions between strokes can provide the topological properties of a character such that character recognition is more effective. In particular, the length ratios and crossing relation between strokes are important features in discriminating several particular characters and are commonly referred to as interstroke information.

In this study, we use other interstroke information, as shown in Figure 5. Following [17], we develop new interstroke information. For example, as shown in Figure 5, interstroke information is calculated by the hit count of eight direction lines from the start point. Each number of crossing points of (2) in Figure 5 is calculated and we obtain the result of counting number of (3). Using the counting number, the distance is calculated by comparison with interstroke information from the reference pattern. Further, we increase the recognition rate by combining the starting coordinates of matching.

Figure 5

Calculation of interstroke information.

3. Experimental Result

First, we perform the experiment to investigate the familiarity about Kinect for handwriting recognition.

We measure the total time of 5 times of handwriting by 5 writers where the subject is the numeral of 0–9. Each result of total time of 1st, 3rd, and 5th times is shown in Table 2.

Table 2

Experiment by the 1st, 3rd, and 5th writing time of character.

Time	Writer
Time	1	2	3	4	5
1	1:06	2:00	2:01	4:10	1:03
3	1:17	1:30	1:51	1:01	0:58
5	0:44	1:05	1:12	0:46	0:48

Table 2 shows the experiment by the 1st, 3rd, and 5th writing time of character. It takes long time at 1st writing; however, at fifth time everyone is familiar with handwriting interaction based on Kinect sensor because recognition of hand, handwriting of writer in the air, and deleting character of mistake are smooth for function. From the result of familiarity, it takes 3–5 times for the user to be familiar with handwriting interaction using the Kinect. This is very short time for trial and error.

Second, we perform the experiment to investigate the recognition rate of character handwriting using Kinect sensor. For experiment, the user of 10 writers is used to write alphabet and numeral characters of two times.

Recognition rate of characters is 96.9%. This result is better than that of [10]. Figure 6 shows the recognition rates of each numeral. Average recognition rate is 95.0%. The character recognition rate of “4,” “6,” “8,” and “9” (below 100%) is “70%,” “95%,” “95%,” and “90%,” respectively. Recognition rate of most numerals is high, but only “4” was lower than other numerals. The reason for the lower recognition rate is that some numerals have similar shapes and similar one-stroke writing style. Figure 7 shows the recognition rate of each letter. Average recognition rate is 98.9%. Recognition rate of most letters is high, but for “D” and “P” it is slightly lower. The recognition rates of “D” and “P” are 90% and 80%, respectively. The reason for lower recognition rates is that some letters employ a similar style in the Graffiti of normal alphanumeric gestures.

Figure 6

Recognition rate of each numeral.

Figure 7

Recognition rate of each letter.

4. Discussion

4.1. Consideration of Recognition Rate

The recognition rate of numeral was 95.0%; however, the case of “4” was worst comparing with another numeral because this system permits some different writing pattern of numeral data according to the variety of human handwriting style. For example, despite the general pattern, the handwriting pattern will be various by different stroke order and shape by one- stroke writing. Therefore, the shapes of “4” and “9” became similar by one-stroke writing as shown in Figure 8.

Figure 8

Input numeral and recognition.

The recognition rate of alphabet character was 98.9%. This result of alphabet recognition rate is higher than numeral recognition rate since we used the Graffiti character with low ambiguity. If the method of writing style of numeral will be changed to Graffiti character, the improvement of recognition rate could be expected.

Detection performance of Kinect sensor is very high; however, it takes some time for user to perform the handwriting and detect hands because most people are not familiar with handwriting based on Kinect sensor in the air. Therefore, user needs to be familiar with handwriting interaction between Kinect and human for increasing recognition rate.

4.2. Misrecognition

There are two types of misrecognition as follows. (1)

The numeral recognition rate of “4” was the lowest in this research because “4” and “9” have a similar shape, as shown in Figure 8.

(2)

Alphabet recognition rate of “D” was lowest in this research because “D” and “P” are similar in shape and written style, as shown in Figure 9.

(3)

Humans cannot distinguish between “4” and “9” and “D” and “P” so this is difficult to deal with. Methods of solving the other misrecognitions include postprocessing to recognize parts of letters and the special relationship between parts.

Figure 9

Input alphabet and recognition.

4.3. Recognition Rate of Numeral Character

As shown in Section 4.1, the result of numeral recognition rate was a little low comparing with alphabet using Graffiti. In the viewpoint of human-computer interaction, we can say that our system with natural writing style used in daily life is perfect and easy to use for any user. However, if you expect to improve the recognition rate, the scheme of writing style of numeral character could be changed to Graffiti style.

5. Conclusion

This paper considered the recognition by Kinect sensor of characters written in the air with the hand. Recognition rate was 96.9%, higher than that of studies using multiple cameras [10], even though our system uses only one camera. The researchers in [9] used DP matching and notation of their own alphanumerics; however, the recognition rate was 75.3%. We have obtained improved recognition rates by introducing interstroke information in DP matching. Kinect performs very well in terms of hand detection, but most people are unfamiliar with writing a character in the air or indeed with Kinect. Therefore, the user needs to become accustomed to using Kinect. To reduce the chances of misrecognition, we suggest a method which eliminates similarity in writing styles, such as developing one's own writing style, would be helpful.

For further study, the recognition of hand shape will be performed because users can communicate with a computer faster than drawing a shape by hand. The purpose of this research was to enable easy sign language. It is less difficult to remember than normal sign language. This research could enable people who do not know normal sign language to communicate with others via a computer.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

Chua

Guan

Model-based 3D hand posture estimation from a single 2D image

Image and Vision Computing 2002 20 3 191 202

10.1016/S0262-8856(01)00094-4

2-s2.0-0036498461

Stenger

Thayananthan

Torr

P. H. S.

Cipolla

Filtering using a tree-based estimator

Proceedings of the IEEE International Conference on Computer Vision

October 2003

Nice, France

1063 1070

2-s2.0-0345414521

Wachs

J. P.

Kölsch

Stern

Edan

Vision-based hand-gesture applications

Communications of the ACM 2011 54 2 60 71

10.1145/1897816.1897838

2-s2.0-79551718333

Cho

S. J.

J. K.

Bang

Chang

Kim

D. Y.

Magic wand: a hand-drawn gesture input device in 3-D Space with inertial sensors

Proceedings of 9th International Workshop on Frontiers in Handwriting Recognition (IWFHR-9 '04)

October 2004

106 111

10.1109/IWFHR.2004.66

2-s2.0-18044396137

Kang

Sugimoto

Image categorization and semantic segmentation using scale-optimized textons

IT CoNvergence PRActice (INPRA) 2014 2 1 2 14

Liu

Wang

Bai

YaVNC—a virtual application solution for smartphone

IT Convergence Practice 2013 1 4 39 49

Lee

Y.-H.

Detection and recognition of facial emotion using bezier curves

IT CoNvergence PRActice (INPRA) 2013 1 2 11 19

Han

Seki

Kamiya

Hikizu

Wearable handwriting input device using magnetic field

Proceedings of the Society of Instrument and Control Engineers Annual Conference (SICE '07)

September 2007

365 368

10.1109/SICE.2007.4421009

2-s2.0-50249085873

Sonoda

Muraoka

A letter input system of handwriting gesture

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2003 J86-D- II 7 1015 1025

10.

Horo

Inaba

A handwriting recognition system using multi cameras

Proceedings of the 17th Workshop on Interactive Systems and Software

2006

121 122

11.

Yosihiro

Aishi

Omikazu

Examination of aerial handwritten character input system using Kinect sensor

Proceedings of the Workshop on Interactive Systems and Software

December 2012

12.

Zhang

Microsoft kinect sensor and its effect

IEEE Multimedia 2012 19 2 4 10

10.1109/MMUL.2012.24

2-s2.0-84860660860

13.

Ren

Meng

Yuan

Zhang

Robust hand gesture recognition with kinect sensor

19th ACM International Conference on Multimedia ACM Multimedia 2011, MM'11

December 2011

usa

759 760

10.1145/2072298.2072443

2-s2.0-84455191625

14.

Yoon

H. S.

Soh

Min

B. W.

Yang

H. S.

Recognition of alphabetical hand gestures using hidden Markov model

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 1999 82 7 1358 1366

2-s2.0-0032596461

15.

Költringer

Grechenig

Comparing the immediate usability of Graffiti 2 and virtual keyboard

Proceedings of the Conference on Human Factors in Computing Systems (CHI EA '04)

April 2004

1175 1178

10.1145/985921.986017

2-s2.0-84876766429

16.

Satoru

Kanyuu

Kazutaka

Nobuo

Efficiency through dimensional reduction—the similarity. Search of time-series data based on Time Warping

DBSJ Letters 4 1 1 4

17.

Shin

Ali

M. M.

Katayama

Sakoe

Stroke order free on-line character recognition algorithm using inter-stroke information

IEICE Transactions on Information and Systems 1999 82 3 382 389

18.

Belongie

Malik

Puzicha

Shape matching and object recognition using shape contexts

IEEE Transactions on Pattern Analysis and Machine Intelligence 2002 24 4 509 522

10.1109/34.993558

2-s2.0-0036538619