Sage Journals: Discover world-class research

Abstract

Facial expression is one of the major cues for emotional communications between humans and robots. In this paper, we present emotional human robot interaction techniques using facial expressions combined with an exploration of other useful concepts, such as face pose and hand gesture. For the efficient recognition of facial expressions, it is important to understand the positions of facial feature points. To do this, our technique estimates the 3D positions of each feature point by constructing 3D face models fitted on the user. To construct the 3D face models, we first construct an Active Appearance Model (AAM) for variations of the facial expression. Next, we estimate depth information at each feature point from frontal- and side-view images. By combining the estimated depth information with AAM, the 3D face model is fitted on the user according to the various 3D transformations of each feature point. Self-occlusions due to the 3D pose variation are also processed by the region weighting function on the normalized face at each frame. The recognized facial expressions - such as happiness, sadness, fear and anger - are used to change the colours of foreground and background objects in the robot displays, as well as other robot responses. The proposed method displays desirable results in viewing comics with the entertainment robots in our experiments.

Keywords

Human robot interaction facial expression 3D face model face pose hand gesture

1. Introduction

Currently, various types of robots - such as intelligent service robots, entertainment robots, etc. - are in various stages of development. One of the key issues for these robots is human-robot interaction (HRI). For successful HRI, it is desirable for a robot to recognize and interact with the user's facial expressions and pose, as well as their gestures and voice. For instance, this applies to entertainment robots: a new type of media machine with the ability to transfer various contents to audiences. Children can read and hear fairy tales, comics and sing songs through a robot. However, almost all methods for HRI are developed for controlling robots, not for interacting with them. For natural communications between robots and humans, it is necessary for the robot to be able to respond according to the user's emotional state [1]. In general, the user may feel more satisfied with and friendly towards the robot when it responds in parallel with the user's emotion.

Figure 1.

Face pose, facial expression and hand gesture-based human robot interaction

Facial expression is one of the most important means of interaction in using emotions [1, 2, 3]. To recognize the facial expressions of the input face, it is important for the robot to get an accurate reading of the positions of facial feature points, such as the nose, mouse, eyes and eyebrows, since the locations of the facial feature points are largely varied according to the facial expression. To define these feature points, some methods [2, 3] detect skin regions and extract feature points by searching for minima in the topographic grey level relief. More recently, some approaches [4, 5] have used face models - such as the Active Appearance Model (AAM) [6] - for real-time feature extraction. However, almost all of the proposed methods have the limitation of requiring a frontal view for facial input, which ignores the 3D transformations of each feature point. For accurate feature point extraction and natural HRI, the 3D positions of feature points rather than their 2D positions should be considered.

To acquire the 3D positions of feature points, one method is to directly use the 3D information of a face captured by special equipment, such as 3D scanners or depth cameras. The most popular such method is the 3D Morphable Model (3DMM) [7]. The 3DMM is the statistical representation of both the 3D shape and texture of a face. Sun and Yin [8] detected the eyes and the tip of a nose by using 3D information acquired from the 3dMD face imaging systems. Next, they estimated the 3D shape of the input face using information about the detected eyes and the tip of nose. However, since special and expensive cameras are necessary to acquire 3D information, this method cannot be easily used [9].

A more practical method for gaining 3D information is to extrapolate from 2D inputs rather than directly acquiring the 3D information. To acquire 3D information from 2D images, stereo vision [10] is generally used. Stereo vision uses two different images to estimate depth information. Chen and Wang [11] proposed the 3D AAM method combining the AAM and depth information estimated from input pair images using stereo imaging. Sung and Kim [12] proposed the 2D+3D AAM, which fits a face onto the View-based AAM [13] and reconstructs it as a 3D model through stereoscopic approximation. However, the method of combining the stereo images cannot estimate the 3D face model in real-time or near real-time, since it requires complex calculations to compute the depth information using the stereo method. In addition, the stereo for a face often results in inaccurate depth information since the facial region frequently has consistent skin colour (except for some small regions, such as the eyes, eyebrows, lips, etc.).

In this paper, we describe a new method for the recognition of facial expressions using an estimation of 3D feature points based on the AAM. In our approach, we acquire simple depth information from frontal and two-sided face views in the training phase, but we efficiently control 3D shape variations and self-occlusion by combining the acquired depth information with the AAM. With the extracted 3D facial feature points from the 3D face model, we approximate a Gaussian Mixture Model (GMM) where each Gaussian represents each different facial expression. By defining the probability density function of the input facial expression from the GMM, a robot is able to recognize the user's emotions and respond with appropriate actions.

As an example of emotional interaction with a robot, we present an efficient experiment in which a comic book is viewed on a robot's display, as shown in Figure 1. We designed a non-linear control for the viewing order of the comic book panels and in order to adjust the size and colour of the objects in the specified panel according to the user's emotional state. For control, we also use face pose and hand gesture recognition. The content provided is generated by a multi-layer structure that includes both a foreground and a background layer. Several objects are segmented in the foreground layer. According to the recognized emotions of the user, the objects in the panel can be modified in respect of colour, size, orientation or location. Based on the content of the comic book, the robot exhibits responses such as “move forward”, “move backward” or “make a circle”. As a result, a user can appreciate the comics in his or her own style.

The remainder of the paper is organized as follows. Section II describes the proposed facial expression recognition method. In Section III, we discuss face pose recognition. The hand gesture recognition method is presented in Section IV. In Section V, we show our content manipulation scheme for viewing comics. Section VI presents the experimental results of our proposed method.

2. Facial Expression Recognition

2.1. Active Appearance Model

The AAM represents a face by combining its shape and appearance. The shape and appearance are acquired from manually marked landmarks (facial feature points) on face images. The shapes used for the exercise are represented by a $d \times N$ matrix composed of $d = 2 \times M$ vectors where a set of M is the number of 2D point landmarks and N is the number of face images used. In this paper, we manually marked $M = 86$ landmarks on each face image and composed 138 triangle meshes for the appearance model, as shown in Figure 2. We located the landmarks at the edges, corners and additional 3D topographic locations, such as the tip of a nose and the middle of the forehead, in order to deal with the 3D rotations of facial feature points.

The shape subspace is constructed by applying Principal Component Analysis (PCA) to the set of shape vectors which are aligned by the mean shape $\bar{s}$ using the Procrustes Analysis method [14]. The shape model $S_{M}$ is defined as a linear combination of l shape parameters $s = {s_{1}, \dots, s_{l}}$ with the weights of shape eigenvectors $p = {p_{1}, \dots, p_{l}}$ derived from the shape subspace:

S_{M} = \bar{s} + \sum_{i = 1}^{l} p_{i} s_{i}

(1)

To create the appearance subspace, each training image is warped in order to remove spurious texture variations due to shape differences. As a result, we can obtain shape-free images. The appearance of each image is represented as a vector in the same order and with a dimension corresponding to these shape-free images. The appearance subspace is constructed by applying PCA to the set of appearance vectors, centred on the mean appearance vector, $\bar{A}$ . The appearance model $G_{M}$ is defined as a linear combination of m appearance parameters $A = {A_{1}, \dots, A_{m}}$ with weights of appearance eigenvectors $g = {g_{1}, \dots, g_{m}}$ with respect to the shape model:

G_{M} = \bar{A} + \sum_{i = 1}^{m} g_{i} A_{i}

(2)

Figure 2.

The AAM fitted onto four different facial expressions

Figure 3.

The AAM variations for the first three AAM parameters

The AAM combines the shape model and the appearance model by applying PCA on the connected vector of the shape and appearance parameters:

{[κ S_{M}, G_{M}]}^{T} = \sum_{i = 1}^{n} q_{i} c_{i}

(3)

where κ is the weight parameter for the shape calculations, allowing for the difference in units between the shape and the grey models of the appearance. We set this weight parameter as the ratio of the total intensity variations to the total shape variation in our experiments. q and c are n eigenvectors and AAM parameters, respectively. Figure 3 shows the variation results of the AAM for the first three AAM parameters $c_{1}$ , $c_{2}$ and $c_{3}$ .

To fit the input face with the AAM, the transformation parameters are necessary. We assume that any arbitrary input face X can be generated by applying the similarity transformation $t$ to the AAM. The similarity transformation is represented by the rotation ${θ_{x}, θ_{y}}$ , scaling ${s}$ and translation ${t_{x}, t_{y}}$ . For global illumination changes, gamma correction with the parameter ${u}$ can be applied. Therefore, the transformation parameters for any input face with the AAM model are defined by $t = {θ_{x}, θ_{y}, s, t_{x}, t_{y}, u}$ .

The AAM utilizes the parameter $p$ to fit the input face, as follows:

p^{T} = [c^{T} | t^{T}]

(4)

To estimate the parameter $p$ , the appearance error between the input face and the AAM model is minimized using an optimization method, as in the following equation:

\sum {[\bar{A} + \sum_{i = 1}^{m} g_{i} A_{i} - W (I; p)]}^{2}

(5)

where $W (\cdot)$ is the warping function which adjusts the input face I to the mean shape $\bar{s}$ with the parameter $p$ .

2.2. Depth Estimation

We calculate the depth information of each facial feature point by using a frontal and two side-view faces, as shown in Figure 4. Although this method shows lower accuracy than some other recently proposed methods, it is sufficiently accurate to estimate the 3D face model of the input face with the aid of the AAM.

Figure 4.

The estimation of the 3D face model for a frontal and two side-view faces

To estimate the depth information of a face, we manually marked landmarks on a frontal and two side-view face images, as shown in Figure 4. The frontal face provides the x and y locations and the side faces provide depth locations. However, the images captured by the cameras show the face at differing scales due to variations in the distance between the camera and the captured face. To take account of this scale difference, we use Procrustes Analysis [16] to minimize the shape difference error as follows:

E = \sum {| x_{i} - \bar{x} |}^{2}

(6)

where $x_{i}$ is the ith shape vector and $\bar{x}$ is the mean shape vector. However, since our goal is to align the frontal and two side faces which share y coordinates, we redefine Equation (6) as the error between y coordinates:

E = \sum {| y_{s} - y_{f} |}^{2}

(7)

where $y_{s}$ and $y_{f}$ are the y coordinates of the side faces and the frontal face respectively.

2.3. 3D Face Model Estimation

To estimate the 3D face model according to the 3D transformation of the input face, the transformation parameter $t$ from Equation (4) has to be redefined as follows with respect to 3D rotation:

t = {θ_{x}, θ_{y}, θ_{z}, s, t_{x}, t_{y}, u}

(8)

If the AAM is learned with 3D shapes, as with [11], the Jacobian representation of the variations in the appearance with respect to the 3D rotation $θ_{x}$ and $θ_{y}$ can be learned in the training phase. However, since the variation of the appearance for the 3D rotation is not large, and since it confused the variations of other parameters, it cannot efficiently fit a face with pre-learned Jacobian parameters. Therefore, we update the 2D parameters ( $θ_{z}, s, t_{x}, t_{y}, u$ ) and the 3D parameters ( $θ_{x}, θ_{y}, t_{x}, t_{y}$ ) stage-by-stage.

Figure 5 shows the overview of the proposed 3D face model estimation. We first fit the input face using the AAM in the same way as with the typical AAM method. This updates the 2D parameters. Next, we fit the input face again using the 3D face model to update the 3D parameters. We repeatedly apply these two steps until convergence is achieved with respect to parameter $t$ .

Figure 5.

Overview of the estimation method for the 3D face model fitted on the input face

To fit the AAM to the input face, we estimate the parameter $t$ , which minimizes the difference between the appearance vector $G_{I}$ of the input face and $G_{M}$ of the AAM model. Given the parameter $t$ , the error according to the difference is defined as:

e (p) = G_{I} - G_{M}

(9)

A first-order Taylor expansion of Equation (9) gives the following Equation (10):

e (p + δ p) = e (p) + \frac{\partial e}{\partial p} δ p

(10)

For the current residual r, $δ p$ is chosen by minimizing ${| e (p + δ p) |}^{2}$ . By setting (10) to zero, we obtain the RMS solution as in the following Equation (11):

\begin{matrix} δ p = - Re (p) & w h e r e & R = {({\frac{\partial e}{\partial p}}^{T} \frac{\partial e}{\partial p})}^{- 1} {\frac{\partial e}{\partial p}}^{T} \end{matrix}

(11)

Since the typical AAM fits the input face with the parameters in Equation (4), it is hard to deal with 3D pose variations when compared with the construction of the 3D face model with the parameters in (8) in our method. We update them by using Equation (11).

The face that fits with a typical AAM defines the x, y coordinates of the model. With this fitting result, we update the rotation parameters $θ_{x}$ and $θ_{y}$ by re-fitting with the additive z coordinates from the estimated depth information in the previous section. In this step, since the Jacobian values for $θ_{x}$ and $θ_{y}$ change according to the pose location of the input face, we recalculate these at each frame. In addition, we also update the parameters $t_{x}$ and $t_{y}$ according to the movement caused by the centre of rotation. The Jacobian values for this step include only four parameters ( $θ_{x}, θ_{y}, t_{x}, t_{y}$ ), and these do not require a great deal of computational time.

Although the 3D face fitting with the parameters in (8) deals with the variations of 3D pose and facial expression, this does not provide sufficiently accurate fitting results when the input face pose is far from its frontal position due to self-occlusion. For this reason, further processing for self-occlusion is essential.

We use the direction of the mesh construction to describe its occlusion in relation to the occurrence of the x and y rotations. The direction of the three points in the mesh is reversed when it is occluded. Let us consider the case in Figure 6(a), in which the red mesh is constructed by A-B-C. After the y rotation of the face in the case of Figure 6(b) is applied, the red mesh is constructed by C-B-A due to the occlusion which occurs when the mesh is at the back of the face. Therefore, we check the directions of all of the mesh construction parameters and we assign a weight of zero to the mesh which has the reverse direction. A zero weight value means that the mesh is replaced by the mean appearance. The direction of the mesh construction can be decided efficiently by the inner-product between the vectors.

Figure 6.

The change of direction of the mesh construction according to 3D pose variation

2.4. Facial Expression Recognition

Assuming that we have K states of emotion resulting from recognized facial expressions, such as happiness, sadness, fear and anger, we represent the facial expression appearance manifold by a set of simple linear K sub-manifolds, as constructed by the parameters of 3D AAM in the previous section. The facial expression appearance is represented as an approximated GMM where each Gaussian represents a different facial expression. We can define the probability density function of the input facial expression from the approximated GMM as:

P (x) = \sum_{i = 1}^{K} α_{i} G a u s s (x | μ_{i}, Σ_{i})

(12)

where $μ_{i}$ , $Σ_{i}$ and $α_{i}$ are the mean, the covariance matrix and the weight of the i^th Gaussian component, respectively. We have estimated the mean $μ_{i}$ and the covariance $Σ_{i}$ of the distribution from the given training images that belong to the i^th facial expression. We first initialize each $α_{i}$ with $1 / K$ and update it using the EM (Expectation-Maximization) algorithm [15]. The probability of each Gaussian component $G a u s s (x | μ_{i}, Σ_{i})$ is calculated by its likelihood. According to the AAM parameters, we can approximate the i^th Gaussian component using each AAM. In addition, the likelihood can be calculated as:

P (x | Ω_{i}) = \exp [- \frac{1}{2} ‖ c ‖]

(13)

where $c$ is the projected parameter with respect to the i^th 3D AAM. Thus, the probability for each facial expression can be represented by the combination of the likelihood and its weight. We can simply classify the facial expression at each frame by evaluating:

k^{*} = \underset{i}{\arg \max} α_{i} P (x | {\hat{Ω}}_{i})

(14)

3. Face Pose Recognition

The interaction with robots using the commands of face poses is very useful and helps users to more easily understand their potential for interaction. As we have approximated the facial expression manifold in a linear way, we are also able to approximate the face pose manifold using the approximated GMM that each Gaussian represents with respect to each different face pose. That is, we can use Eq. (12) for the probability density function of the input face pose derived from the approximated GMM. The likelihood of an input image x that has N dimensions is given by:

P (x | Ω_{i}) = \frac{\exp [- \frac{1}{2} {(x - μ_{i})}^{T} Σ_{i}^{- 1} (x - μ_{i})]}{{(2 π)}^{N / 2} {| Σ_{i} |}^{1 / 2}}

(15)

From [15, 16], we can approximate the i^th Gaussian component $Ω_{i}$ as an affine subspace ${\hat{Ω}}_{i}$ constructed using PCA. Given a training set that belongs to the i^th pose, we can form ${μ_{i}, Σ_{i}, Φ, Λ}$ , where $μ_{i}$ , $Σ_{i}$ , Φ and Λ are the mean vector, the covariance matrix, the eigenvector matrix of $Σ_{i}$ and the corresponding diagonal matrix of the eigenvalues, i.e., $Λ_{j j} = λ_{j}$ , respectively. The linear projection from I to ${\hat{Ω}}_{i}$ may be expressed as $y = {[y_{1}, y_{2}, …, y_{M}]}^{T} = Φ^{T} (I - μ_{i})$ . Using PCA projection, the likelihood probability Eq. (2) is expressed as the product of two marginal Gaussian densities. In other words:

P (x | {\hat{Ω}}_{i}) = [\frac{\exp (- \frac{1}{2} \sum_{j = 1}^{M} \frac{y_{j}^{2}}{λ_{j}})}{{(2 π)}^{M / 2} \prod_{j = 1}^{M} λ_{j}^{1 / 2}}] \cdot [\frac{\exp (- \frac{ε^{2} (x)}{2 ρ})}{{(2 π ρ)}^{(N - M) / 2}}]

(16)

where M and $ε^{2} (x)$ denote the dimension of the subspace ${\hat{Ω}}_{i}$ and the residual reconstruction error, which is defined as:

ε^{2} (x) = \sum_{j = M + 1}^{N} y_{j}^{2} = {‖ x - μ_{i} ‖}^{2} - \sum_{j = 1}^{M} y_{j}^{2}

(17)

The parameter ρ in Equation (16) can be represented as $\frac{1}{N - M} \sum_{j = M + 1}^{N} λ_{j}$ , or simply $\frac{1}{2} λ_{M + 1}$ [16] (in our experience, we use the latter).

Finally, we can linearly approximate Eq. (12) as:

P (x) = \sum_{i = 1}^{K} α_{i} P (x | {\hat{Ω}}_{i})

(18)

When the input face image is similar to the i^th face pose, the weight parameter $α_{i}$ for the i^th pose density probability $P (x | {\hat{Ω}}_{i})$ has a larger value. Otherwise, it has a smaller value. According to the first-order Markov assumption, we can estimate each parameter $α_{i}^{t}$ at the t frame by the parameters $α^{t - 1}$ and the pose density probabilities $P^{t - 1} (x | \hat{Ω})$ at the $t - 1$ frame using the EM algorithm:

\hat{s} = f (t), where f : [0, 1] \to [θ_{\min}, θ_{\max}]

(19)

4. Hand Gesture Recognition

The user executes a small set of commands on the robots using hand gestures. In our system, the user manipulates the object's size and position. To recognize the user's hand gestures, we first extract skin-like regions based on skin colour statistics and then classify the skin-like regions with respect to face and hand regions. Face candidate regions are detected using the Viola and Jones method [17]. From skin-like regions, we delete face candidates and eliminate false hand candidates using size constraint values.

Among the hand candidate regions, it is necessary to extract hand regions by removing forearm regions. To do this, we compute a skeleton of the hand candidate regions and then draw a circle to detect the hand region along the skeleton. The circle will be small at the wrist point and large in the hand region. When we find the largest circle along the skeleton - as in Figure 7 - the centre of the circle is labelled as the COG of the hand.

From the COG, we detect the farthest and the nearest points in the hand. Usually, the farthest point is the tip of the longest active finger. To count the number of active fingers, we draw a circle from the COG, as in [18]. The radius is 0.7 of the farthest distance, which is the distance from the COG to the farthest point. Following this, we can recognize the intersection area for counting the active fingers of the hand. The fist shape of the hand is detected by comparing the nearest and farthest distances from the COG. If two-times the shortest distance is greater than the farthest distance, the shape of the hand will be a fist. We can thus classify hand gestures as either a fist or an open palm, with a further distinction made for the number of active fingers. These classifications are used for the control of object manipulation in the activated panel of the comic book [19].

Figure 7.

COG of the hand

In our system, zooming in and out of the panel is controlled by counting the number of active fingers. Object translation and rotation are also controlled by the number of active fingers.

5. Contents Manipulation

To control the viewing style of the comics and manipulate contents at will, the user will utilize face pose, facial expression and hand gestures. For various manipulations, object-based multi-layer comic contents were created. The scene in the panel of the comics is represented by the scene graph. Each leaf node in the scene graph contains information about the location, rotation and scale of the object.

The viewing style is determined by the face pose and hand gesture recognition results. Only one panel is activated while the other panels are deactivated. The activation order of panels in the comic book is one of four directions, i.e., up, down, right or left. Hand gesture recognition results control the activated panel with respect to “zoom in” or “zoom out” using the user's left fist and the open right hand with five fingers recognized. When the panel is zoomed in, the robot moves towards to the user by about 20cm, following which the user can translate the object into one of the four directions by using the motion of the index finger. Scaling is performed by indicating three or four. Rotation is performed using two fists. When the object is rotated, the robot also makes a circle.

Based on the facial expression, we also manipulate the colour of the objects in the comics. To analyse emotional characteristics in the colour feature, we performed an empirical study of the images. The statistical baseline for the four emotions was manually determined by surveying 10 students with regard to 300 images. If the image is labelled with the same emotion by at least 7 of the 10 students, we assign the image as having one of four emotions. From these labelled images, we extract emotion-related colour features in the HSV colour space. Figure 8 shows our hue and value template. For example, one main sector covers an area around 60 degrees for the happiness hue template. In the value template, the upper arc contrast is selected for “happiness” and the lower arc contrast is selected for “fear, sadness and anger.”

Figure 8.

(a) Hue template, (b) Brightness template: happiness-upper arc contrast, sadness, fear, anger-lower arc contrast

To transform the pixel values according to the emotions, we use hue templates [20, 21]. The new hue value $H' (p)$ is computed by:

H' (p) = C (p) + \frac{w}{2} (1 - G_{σ} (| | H (p) - C (p) | |))

(20)

where $C (p)$ is the mean value of the main sector in the hue template, w is the arc length of the template sector and $G_{σ}$ is the normalized Gaussian function with variance σ.

To perform brightness transformations, we use a value template, as in Figure 8(b). The brightness value transformation is performed by non-linear characteristics like:

y = T (x) = c x^{γ}

(21)

The brightness value at pixel p is transformed by:

V' (p) = {(\frac{V (p)}{255})}^{\frac{1}{γ}} \times 255

(22)

where γ is 2.0 for happiness, 0.4 for sadness, 0.5 for anger and 0.3 for fear, respectively.

6. Experimental Results

To evaluate the performance of the proposed method, we first evaluated the fitting result of the proposed 3D AAM, because the estimated positions of the facial feature points are important in recognizing the facial expressions of the input user. To do this, we compared our method with the typical AAM [6] learned from face images, including various facial expressions and the View-based AAM [13] learned from three distinct pose images with various facial expressions for each pose. For learning, we made training videos with nine volunteers. All of the videos have a 640×480 image size, 15 fps, are about 23 seconds in length and are taken by a web camera. The test video and training video were independently captured for each person. The training video includes variations between four facial expressions: expressionless, happiness, surprise and disgust, as shown in Figure 2, but do not include variations of scale or translation. However, the test video includes all such variations of facial expressions, scale and translation; all of the variation combinations were randomly changed. For training, we sampled the training videos at a ten-frame interval.

For the AAMs, we marked 86 landmarks using 35 training frontal faces per subject. The AAM trained by frontal faces is also used in our method. For the view-based AAM, the training faces are captured from three perspectives: −45, 0, 45 degrees of viewing. With each view, 35 training faces with various facial expressions are used.

Figure 9.

Some results of the depth estimation (top: frontal face, middle: one side face, bottom: the estimated 3D face model)

To estimate the depth information for each person, we used frontal and two side-view faces. We marked landmarks on these three faces and constructed a simple 3D face model. Figure 9 shows some examples of the frontal and one sided faces used for estimation and the subsequently obtained 3D face results. As shown in Figure 9, our depth estimation is calculated from subjects who include various face shapes: lean-faced, round faces, etc. The estimated 3D face models using the proposed method did not have high accuracy. However, they do express the adaptive shape for each person and efficiently help our method by defining a relationship of fit between the input face and 3D face transformations.

We combined the estimated depth information with the AAM learned from frontal faces. From this combination, we fitted the input faces using the method explained in Figure 5. Figure 10 shows one of the fitting results on the training video, which includes frequent changes of 3D pose and facial expression. The graph in Figure 10 shows the errors that occur between the appearances of the input face and the model from Equation (9). When the method fits the input face correctly, we observe lower error values. However, higher error values are observed when the method does not have an accurate fitting result or else misses the input face. In the case of a typical AAM, it is not possible to fit the input faces when the input face is far from the frontals, because the AAM is learned from frontal faces. In Figure 10, the AAM attempts to fit the input faces unsuccessfully until the 190^th frame, but not subsequently, because significant 3D pose variation has occurred. After the 190^th frame, the AAMs missed all of the input faces due to non-correspondence in the scale and translation parameters of its model. The view-based AAM can fit on a wider selection of input faces than the AAM since it selects the adaptive pose model for the input face from the training data. However, and as shown around the 50^th and 250^th frames in Figure 10, this model provides unstable fitting results due to the difference between the poses of the input and the training. Although the fitted face model is similar to the input face, the pose difference leads to inaccurate fitting. As a result, these unstable fitting results have a higher rate of error than our method, which has accurate fitting results. It also missed the input faces after 305^th frame due to the incorrectly selected pose model. Because of the wrong pose, this model misses all of the parameters, including those in Equation (4). By comparison, the proposed method fitted the input faces with the lowest rate of error and more accurate face models, as shown at the bottom of Figure 10. Our method does not require the choice of an adaptive pose model and does not include pose-varied face images in the training data. Nevertheless, our method had very accurate fitting results on input faces which have significant pose variations. In our method, the pose of the input face is estimated with reference to the results of the previous frame. Since our method calculates the Jacobian values for the 3D parameters at each frame, our fitting results are not dependent on the pose variations in the training data. Although our training data includes only frontal faces, our method provides accurate fitting results, even with respect to dramatic pose changes. Moreover, any error caused by appearance differences between frontal- and pose-varied views is decreased by means of the region weighting function proposed; thus, it continues to provide accurate results with a lower error rate, even where the input pose is more distant than the frontal view.

Figure 10.

Comparison of face fitting results between our proposed method and other methods

Figure 11.

Human Robot Interaction

Using the 3D facial feature points extracted from the fitted 3D face model, we designed an HRI interface on the viewing comic robot. Our human-robot interaction scheme is shown in Figure 11. The wheeled robot is equipped with a webcam, as in Figure 1. To interface with the designed robot, we used the recognition of the facial expression, face pose and hand gesture of the user. The input image is taken from the camera and, initially, skin-like regions are extracted. Next, we use morphological filters to filter out noise and holes for gesture recognition. Face detection and hand segmentation are executed from probable face and hand regions. A face candidate is detected from the grey image using Viola and Jones' method [17]. Subsequently, the face is recognized and a tracking process is executed for video-based face pose recognition. Figure 12 shows our strategy for controlling the viewing order by means of recognized face poses and hand gestures.

Figure 12.

The viewing panel is changed using face poses. Objects are manipulated by hand gesture recognition.

Figure 13.

Examples of hand gesture images

Table 1.

Pose Estimation Results

Pose (Total Frame No.)	Correct Frame No.	Proposed method
0 (560)	527	94.11 %
1 (1,263)	1,202	95.17 %
2 (575)	570	99.13 %
3 (331)	287	83.99 %
4 (430)	387	90.00 %
Total (3,159)	2,964	93.83 %

To control the viewing order using the face pose information, the user's face should first be detected and then recognized. Next, the face tracking process is executed for video-based face pose recognition. For face tracking, 80 sample windows with different sizes and orientations are created around the previous face position. Each sample window is converted to a 20 × 20 size window and skin regions are extracted. For the skin regions, the colour histogram is equalized. Afterwards, the sample windows are compared with the identified person's face pose sub-manifold. At this time, the face pose sub-manifold is the face pose of the previous frame. If the minimal distance between the sample window and the sub-manifold is larger than the threshold value, we conclude that the face tracked is incorrect, and a new face detection process is initiated. We evaluated our face pose recognition method on 10 sequences captured from 6 subjects. Table I shows our results.

Figure 14.

Re-coloured image. (a) happiness, (b) sadness, (c) fear and (d) anger.

Figure 15.

Facial expression-based re-colouring

Hand gesture recognition is also one of the important factors in controlling our viewing system. To evaluate the performance of the gesture recognition, we generated 120 hand gesture images captured at in-door and out-door environments, as shown in Figure 13. We evaluated hand gesture recognition rates between our proposed method and that of Malima [18]. Table 3 shows the recognition results. Our method demonstrated more accurate classified results from the various input hand gesture images than Malima's.

To evaluate the performance of the facial expression recognition system, we labelled its facial expression on the training set used for the face fitting evaluation. The facial expressions are classified into five categories: normal, sadness, anger, fear and happiness. The test sequences used for this evaluation have lengths of about 30 seconds and 30fps. Our system classified each facial expression at each frame and compared our results with the ground-truth data. Table 2 shows the recognition results for each emotion.

Table 2.

Facial Expression Recognition Results

Facial Expression	Recognition rate
Normal	84.6
Happy	94.7
Fear	90.2
Angry	82.0
Sadness	88.4

Table 3.

Hand Gesture Recognition Results

Method	Recognition rate
Malima's	84.84
Proposed	92.24

For emotional interaction with our robot, we performed the content re-colouring method based on the user's emotional states. To do this, we transform the hue and brightness value using Eq. (21) and (22). The resulting images are shown in Figure 14. After obtaining the images, we carry out a survey of the perception of emotion for each image obtained from our emotion-based contents re-colouring method. 87% of the students show a strong preference for our re-colouring images. Figure 15 also shows one of the re-colouring examples based on the user's emotion, which is recognized by the facial expression of the user. By the manipulation of the colour of the contents, the user has more fun in viewing the contents and interacting with the robot.

7. Conclusions

In this paper, we proposed an emotional interaction method with a robotic interface for viewing comic books. To acquire accurate facial feature points, we constructed a 3D AAM which combines the AAM with the estimated depth information. From the fitted 3D face model, we extracted 3D facial feature points to accurately recognize the facial expressions of the user. The facial expression of the user is estimated by computing a minimal distance between the given distance and the facial expression sub-manifold. Our proposed face pose and hand gesture recognition method also displays good performance in controlling the viewing order of the comics. By re-colouring contents based on the user's emotion from a recognized facial expression, we could provide an intuitive HRI interface which provides more enjoyment for the user in viewing visual materials. It is worth noting that our proposed method provides various new and promising viewing methods for comics with a robotic interface.

Footnotes

8. Acknowledgments

This research is supported by the Basic Science Research Program through the National Research Foundation (NRF) funded by the Ministry of Education, Science and Technology (2010-0024641).

References

Wimmer

MackDonald

B. A.

Jayamuni

Yadav

, (2008) Facial Expression Recognition for Human-Robot Interaction – A Prototype. 2^nd Workshop on Robot Vision.

Khan

M. I.

Bhuiyan

M. A.-A

, (2009) Facial Expression Recognition for Human-Robot Interface. International Journal of Computer Science and Network Security. 9(4):300–306.

Yang

S. S.

Lee

T. H.

Wang

, (2008) Facial expression recognition and tracking for intelligent human-robot interaction. Intelligent Service Robotics. 1(2):143–157.

Tang

Deng

, (2007) Facial Expression Recognition using AAM and Local Facial Features. Proceedings of the Third International Conference on Nattural Computation. 2: 632–635.

Zhao

Shang

, (2012) Facial expression recognition using local binary pattern and discriminant kernel locally linear embedding. Journal on Advances in Signal Processing.

Cootes

Edwards

Taylor

C. J.

(2001) Active appearance models. Transactions on Pattern Analysis and Machine Intelligence. 23(6):681–685.

Faggian

Paplinski

A. P.

Sherrah

, (2006) Active Appearance Models for Automatic Fitting of 3D Morphable Models. IEEE International Conference on Video and Signal Based Surveillance.

Sun

Yin

, (2008) Automatic pose estimation of 3D facial models. ICPR.

Skoglund (2003) Three-dimensional face modeling and analysis. M.S. thesis. Informatics and Mathematical Modelling. Tech. Univ. Denmark. Lyngby. Denmark.

10.

Yoon

K.-J

Kweon

I. S.

(2006) Adaptive Support-Weight Approach for Correspondence Search. IEEE Transactions on Pattern Analysis and Machine Intelligence. 28(4):650–656.

11.

Chen

C.-W

Wang

C.-C

, (2008) 3D Active Appearance Model for Aligning Faces in 2D Images. Proceedings of the IEEE/RS International Conference on Intelligent Robots and Systems. 3133–3139.

12.

Sung

Kim

, (2008) Pose-Robust Facial Expression Recognition Using View-Based 2D+3D AAM. IEEE Transactions On Systems, Man and Cybernetics, Part A: SYSTEMS AND HUMANS. 38(4):852–866.

13.

Cootes

T. F.

Wheeler

G. V.

Walker

K. N.

Taylor

C. J.

(2002) View-based active appearance models. Image and Vision Computing. 20: 657–664.

14.

Goodall

, (1991) Procrustes methods in the statistical analysis of shape. Journal of the Royal Statistical Society B. 53(2):285–339.

15.

Moghaddam

Pentland

, (1997) Probabilistic visual learning for object recognition. IEEE Trans. PAMI.

16.

Lee

Kriegman

, (2005) Online learning of probabilistic appearance manifold for video-based recognition and tracking. In Proc. IEEE Conf. CVPR.

17.

Viola

Jones

, (2001) Rapid object detection using a boosted cascade of simple features. Proc. CVPR.

18.

Malima

Ozgur

Cetin

, (2006) A Fast algorithm for vision-based hand gesture recognition for robot control. In Proc. IEEE Conf. Signal Proc. And Comm. Applications.

19.

Kang

H.-B

M.-H

, (2007) Face and Gesture-based Interaction for Displaying Comic Books. PSIVT 2007. LNCS 4872.

20.

Tokumaru

Muranaka

Imanishi

, (2002) Color Design support system considering color harmony. In Proc. IEEE Conf. Fuzzy Systems.

21.

Cohen-Or

Sorkine

Cal

Leyvand

Y.-Q

, (2006) Color Harmonization. ACM SIGGRAPH.