Robust facial landmark detection based on initializing multiple poses

Abstract

For robot systems, robust facial landmark detection is the first and critical step for face-based human identification and facial expression recognition. In recent years, the cascaded-regression-based method has achieved excellent performance in facial landmark detection. Nevertheless, it still has certain weakness, such as high sensitivity to the initialization. To address this problem, regression based on multiple initializations is established in a unified model; face shapes are then estimated independently according to these initializations. With a ranking strategy, the best estimate is selected as the final output. Moreover, a face shape model based on restricted Boltzmann machines is built as a constraint to improve the robustness of ranking. Experiments on three challenging datasets demonstrate the effectiveness of the proposed facial landmark detection method against state-of-the-art methods.

Keywords

Facial landmark detection cascaded regression multiple initialization restricted Boltzmann machines

Introduction

Facial landmarks encode significant information about face shape deformation. Accurate detection of facial landmark points has aroused great interest, owing to its importance in such applications as human identification, facial animation, and facial expression recognition. It has been proven that the performance of face recognition can be remarkably elevated when using extra facial landmark locations.^1,2 Although several methods have been proposed for facial landmark detection, facial point estimation for real-world images with facial expressions, poses, or occlusion has always been a challenging problem, since current approaches struggle to handle the outliers that are generated under these conditions.

Previous work on facial landmark estimation, such as active shape modeling³ and active appearance modeling,⁴ focuses on explicit modeling of the shapes of objects. However, these methods tend to fail under the condition of large pose variations or facial expressions, especially for occluded faces in real-world applications. In general, they suffer from poor generalization performance and slow training. Even on datasets “in the wild,” they still cannot deliver state-of-the-art performance.

Lately, popular cascaded-regression-based approaches, such as the supervised descent method⁵ and models based on cascaded pose regression^6,7 and local binary features,⁸ have achieved similar state-of-the-art performance. Under the supervised descent method, scale-invariant feature transform features or histogram of oriented gradient (HOG) features are utilized for facial landmark detection by learning descent directions that minimize the mean of a non-linear least squares function. Models based on cascaded pose regression^6,7 select random ferns as primitive regressors instead of using a single regression, as done under previous methods. The robust cascaded pose regression model⁹ is an improvement of the cascaded pose regression model for handling occlusions and large shape variations by detecting occlusions and using shape-indexed features; this method is more robust. Methods based on local binary features help realize facial landmark detection by learning the local binary features, which can be discriminated independently, and utilize the features to learn a global linear regression for the final output. All of these methods have achieved the state-of-the-art results on the most challenging datasets available currently. Nevertheless, such cascade regression methods are highly sensitive to the initialization. If the initial face shape differs too much from the ground truth, the performance of the cascade regression method will decrease, especially for extreme poses and expressions.

Some approaches estimate the pose by first detecting the eyes and mouth independently and then choosing the best initialization. However, such methods are highly sensitive to occlusion. Cui et al.¹⁰ uses eyes and mouth to form a triangle, which designates the rough facial pose direction. Nonetheless, the dependency of the triangle and pose direction is not strong enough to ensure the robustness of the prediction. Burgos-Artizzu et al.⁹ propose a smart restart scheme that can alleviate the shortage to some extent.

In addition, it seems like a good strategy to implement several initializations. By setting several different poses as the initializations instead of using the mean face shape (derived from training data), which is always far from the ground truth, and then estimating face landmarks independently based on these initializations using a cascade regression method, there would be a higher probability of approximating the ground truth. Cao et al.,⁶ Burgos-Artizzu et al.,⁹ and Yang et al.¹¹ attempt to initialize with several shapes and then calculate the average value of the outputs as the final result. However, the accuracy of the median value is still unsatisfactory. Based on this intuition, our research group aims to propose a novel strategy to rank these estimates and obtain the final shape output by predicting the probability that an estimate will approximate the ground truth. Unfortunately, it is difficult to calculate the probability directly from the appearance, because the robustness of prediction is not strong enough, owing to variations in the facial appearance, especially in the more challenging images. In addition, it is difficult to rank results when there is not much difference in the landmark estimates.

To address these problems, a novel ranking method is proposed to select the best estimate. Firstly, a score variable is defined to measure the deviation of a predicted landmark value from its true value. To predict the score more accurately, a deviation evaluation model based on two-level regression is put forward, which has the capacity of distinguishing estimates that are only slightly different by utilizing local appearance information. Secondly, a deviation evaluation model based on a prior shape model is used to calculate the reconstruction errors for measuring the difference between the typical face shape pattern and each face landmark estimate.

However, training a face shape model is still challenging, because it is difficult to capture face shapes in different poses or with different facial expressions. Recently, the use of restricted Boltzmann machines (RBMs) and their variants has been proposed to solve this challenging problem and have achieved outstanding results.^12,13 It is worth mentioning that most RBM-based models that can handle the pose problem are difficult to train. In this article, a brief and effective model based on conditional Boltzmann machines¹⁴ is applied to capture face shape patterns with different poses and facial expressions, which are then embedded in the ranking model as constraints to improve robustness.

The major contributions of this research are described as follows.

In contrast to using the median value of the several shape estimates as the final output, like most current methods, we propose a novel ranking model, which selects the best shape by estimating the deviation of a predicted landmark value from its true value.

A deviation evaluation model based on two-level regression is trained to distinguish two similar estimates by utilizing local appearance information.

A deviation evaluation model based on an RBM-based shape model is utilized to calculate the reconstruction error to evaluate the global shape deviation, which can help rank the shape estimates robustly.

This method is evaluated using Annotated Faces in the Wild (AFW), IBUG, and Caltech Occluded Faces in the Wild (COFW) datasets.^9,15,16 The algorithm in this article shows results that are comparable with or better than state-of-the-art methods.

The rest of this article is organized as follows. The details of the proposed robust facial landmark detection method are described. Experimental result are then presented, and compared with state-of-the-art methods. Then follows a discussion of the impact of the facial landmark detection method to image-based applications. The article concludes with a summary of this research work and its future extensions.

Proposed method

Overview of the proposed method

The general framework of the proposed method is shown in Figure 1, which includes three major parts: (1) a pre-processing module for facial landmark detection; (2) a cascade regression module for estimating the facial landmarks; and (3) a ranking model for estimating the deviation between a predicted facial landmark and its true value.

Figure 1.

Overall procedure for the proposed method.

In the first step, for a test face image, it is necessary to crop and resize all faces to a unified scale by utilizing the pre-processing module for facial landmark detection. After the pre-processing procedure, we have a new face image with unified size and the face bounding box.

In the second step, a cascade regression module is used to estimate facial landmarks by starting from an initial shape. In this method, to overcome the face poses problem, N face shapes under different poses are set as the initializations, based on which facial landmarks are estimated independently with the cascade regression module. After this procedure, N facial landmark estimates corresponding to different initializations are obtained.

The final step is to select the best estimate, which is closer to the real value, from N facial landmark estimates by utilizing our ranking module. Since we do not have the ground truth landmark locations for any test images in real application, the truth deviation between the predicted facial landmarks and the ground truth landmarks cannot be obtained. Therefore, in this module, we establish two deviation evaluation models on the basis of appearance and shape information, separately, to estimate the deviation between the predicted facial landmarks and the ground truth. After combining the two deviation evaluation results, a ranking score can be obtained, and the estimate with the minimum score is output as the final landmark detection result.

Pre-processing for facial landmark detection

The pre-processing flow of the cascaded-regression facial landmark detection method is shown in Figure 2. Firstly, the face is detected using the OpenCV face detector. Then the image is cropped based on the bounding box proposed by the face detector. In the following procedure of facial landmark detection, the HOG feature is extracted within a fixed-size region around each landmark. Nonetheless, this fixed-size region can be different, owing to the different pixel densities of images. For example, a 50 pixel × by 50 pixel region may cover the whole face part or only the mouth part in different images. Thus, it is necessary to resize all faces to a unified scale by normalizing the images to the mean shape.

Figure 2.

Pre-processing steps for landmark prediction.

Cascaded regression

Algorithm 1 shows the main process of the cascaded-regression method. The n facial landmarks of an image can be expressed as a shape vector $S = (l_{x 1}, l_{y 1}, l_{x 2}, l_{y 2}, \dots, l_{x n}, l_{y n})$ , where l_x1, l_y1, l_x2, l_y2, …, l_xn, l_yn are the coordinates of the facial landmarks, valued in pixels. Cascaded regression is formed by a cascade of K regressors {R¹, R², …, R^K} that start from a raw initial shape guess S₀ and progressively refine the estimate, outputting the final shape estimate S_T.

Algorithm 1 Cascaded regression for landmark prediction.
Input: Image I, initial shape S₀ regressors ${R^{1}, R^{2}, \dots, R^{K}}$
Output: Final estimate S_T
1: for K = 1 : T do
2: Compute HOG features $Φ (S^{K - 1}, I)$
3: Apply regression, using $Δ S^{K} = R^{K} \cdot Φ (S^{K - 1}, I)$
4: Update estimate, using $S^{K} = S^{K - 1} + Δ S^{K}$
5: end for

Specifically, the method starts with an image I and an initial shape estimate S₀. The regressor R_K computes a shape increment ΔS^K to update the shape until the last iteration. At the T th iteration, the shape is estimated through

S^{K} = S^{K - 1} + Δ S^{K}

Δ S^{K} = R^{K} \cdot Φ (S^{K - 1}, I)

where I is the image and $Φ (S^{K - 1}, I)$ denotes the HOG features of the local patches centered at the current landmark locations. Figure 3 shows the facial landmark detection results in different iterations.

Figure 3.

Facial landmark detection results in different iterations.

Learning the multi-initialization cascaded-regression model

Generally, the traditional cascaded-regression method sets the initial shape as the mean face shape of training images. On the one hand, the mean face shape is closer to the facial shapes under different poses, and the corresponding model can cover most kinds of situation; on the other hand, accuracy will be more or less sacrificed for each single shape.

To solve the problem, the initialization should be discriminative for different facial shapes. It needs to be emphasized that the discriminative strategy is not used in the test part, so a rough pose estimate method is proposed. Instead of training a classified model, it is preferable to utilize the triangle formed by the eyes and nose from the ground truth shape to choose the initialization, as shown in Figure 4. To estimate the linear regression function with parameter R in each iteration, a standard least squares formulation is employed. Specifically, given N training examples ${(I_{i}, {\hat{S}}_{i} t)}_{i = 1}^{N}$ , and the currently estimated landmark locations S_K, the appearance and shape features Φ can be calculated.

Figure 4.

Selection of the initialization by utilizing the triangle formed by the eyes and nose.

To obtain each regressor ${R^{1}, R^{2} ... R^{K}}$ , efforts are made to minimize the difference between the ground truth ${\hat{S}}_{i}$ and the shape estimated in the current iteration. The HOG features are extracted around the current landmark locations and are thus varied in each iteration. Each regressor R in different iterations is expressed as

R_{K}^{*} = \underset{R_{K}}{argmin} {‖ Δ S_{i}^{K} - R_{K} \cdot Φ (S_{i}^{K - 1}, I_{i}) ‖}_{2}

wherein I is the image, $Φ (S^{K - 1}, I)$ denotes the HOG features of the local patches, centered at the current landmark locations, and $Δ S_{i}^{K}$ is the difference between the ground truth and the shape estimated in the current iteration.

Ranking model

To reduce the influence of the initialization, it is chosen to set several different poses as the initializations, instead of using the mean face shape; independent estimation of face landmarks according to these initializations is then achieved through cascaded regression. The goal of this facial landmark detection algorithm is to determine the best estimates as the final output.

Figure 5 gives an example of the detection results with five initializations. Although there are significant differences between these results, it is still difficult for a computer vision system to distinguish which one is closer to the ground truth value because we do not have the true landmark locations for any test images in real application. Fortunately, as we can see in in the leftmost and rightmost images of Figure 5, for the same landmark, the local appearance around each landmark location is widely divergent. This kind of difference can help us evaluate the deviation between the predicted facial landmarks and the ground truth landmarks by using the local appearance features because human beings have similar local appearance features around the same landmarks. In this method, a deviation evaluation model based on two-level regression is proposed to establish the mapping relationship between the appearance features around the i th landmark and the deviation of a predicted value from its true value. A HOG feature describer is applied to represent the appearance features of the local patches centered at each landmark locations. The model is explained in detail in the next section.

Figure 5.

An example of the detection results with five initializations.

Referring again to Figure 5, obviously, the central and rightmost images present the worst results. As the features of the local patches centered at the current landmark locations are totally different from those of the other three images, it is easy to exclude these two estimates.

Meanwhile, the remaining three images have similar features around the landmarks, and it is difficult to distinguish them just by the appearance features. By observing the detection result in the fourth image, we can find that, although the features around landmarks are “ballpark”, the global face shape does not look like a normal human face.

For the second image, the landmark location of the nose does not match the 45° profile face. In fact, many studies have reported particular patterns for human face shapes under differential facial expressions and poses. This provides an opportunity for us to calculate the difference between the typical face shape pattern and each face landmark estimate. This strategy can also help in estimating the deviation of a predicted landmark value from its true value.

In this method, a deviation evaluation model based on a global shape model is proposed to establish the mapping relationship between the face shape and the deviation of a predicted landmark value from its true value. The evaluation model needs a prior face shape to capture face shape patterns with different poses. In this method, an effective model based on conditional Boltzmann machines can be utilized to model the face shape pattern, and then calculate the reconstruction errors that measure the difference between the typical face shape pattern and each face landmark estimate. The model is explained in detail in the next but one section.

To combine the two deviation evaluation models, a score variable is defined as P, which measures the level to which the landmark estimate approximates the ground truth to rank each estimate. The ranking score prediction depends on the local appearance and the current shape. Then the score P can be calculated as

P = Ψ (S, I) + λ Θ (S)

where I is the image, S denotes the current predicted landmark value, Ψ(S, I) represents the deviation evaluation model based on appearance features, and the deviation evaluation model based on global shape is expressed by Θ(S). The weight of each model is determined by λ. The two parts of the function can be trained separately.

Deviation evaluation model based on appearance feature

To encode the appearance information, the HOG features of the local patches centered at the i th landmark locations S are utilized and denoted as Φ(S, I). The deviation evaluation model Ψ(S, I) based on appearance feature can be expressed as

Ψ (S, I) = {‖ T^{K} \cdot Φ (S, I) ‖}_{2}

where I is the image, and Φ(S, I) denotes the HOG features of the local patches centered at the current landmark locations. A linear regression function with parameter T_K is utilized to predict the deviation of a predicted value from its true value based on the appearance features Φ(S, I).

In experiments for this research, it was found that a single regressor can effectively exclude estimates that are widely different from the ground truth. However, it has too weak and poor performance in ranking similar estimates. Although the face shape model is utilized to help enhance the robustness of the ranking model, there are still a number of failure cases. This is because it is too difficult to regress all kinds of N estimates.

In fact, massive training data that are widely different from the ground truth should be utilized to cover as many kinds of feature as possible. The advantage is that any kind of failing estimation can be excluded even if it has a perfect shape, and the disadvantage is that estimation accuracy will be sacrificed. It is significant to acquire a good regressor that can accurately cover all kinds of estimates. For this purpose, it is advised to acquire the regressor T through a second-level regression.

Unlike cascaded regression, the HOG features are invariable at the two levels. It is only relative to S_K−1 that they no longer change. The Ψ(S, I) regression function can be changed to

Ψ (S, I) = {‖ T^{1} \cdot Φ (S, I) ‖}_{2} + {‖ T^{2} \cdot Φ (S, I) ‖}_{2}

where I is the image and Φ(S, I) denotes the HOG features of the local patches centered at the current landmark locations. The two-level regression, T₁ and T₂, should be trained based on different datasets. To estimate the linear regression function with parameters T₁ and T₂, a standard least square formulation is also utilized.

To estimate the linear regression function with parameter T in each iteration, a standard least square formulation is adopted. Before introducing the training process, it is necessary to describe the definition of the ground truth feature score

\hat{P} = \sum_{i = 1}^{l} {‖ {\hat{s}}_{i} - s_{i} ‖}_{2}

where s_i is the i th landmark of the corresponding estimate and ${\hat{s}}_{i}$ is the corresponding estimated ground truth. As mentioned, training data very different from the ground truth should be utilized to cover as many kinds of feature as possible.

Two training sets are prepared for the two-level regression. To estimate T₁, the proposed cascaded-regression model is utilized to detect all the training data for the cascaded regression R_K under N different initializations, and the results are regarded as the first training set. Then the feature scores are calculated through equation (4) and the two lowest scores are rejected. The corresponding shape estimate is put into the second training set.

Again, given the training image and the currently estimated landmark locations S, we can calculate the appearance and shape features Φ and the ground truth score. Then T could be estimated as

T_{K}^{*} = \underset{T_{K}}{argmin} {‖ \hat{P} - T_{K} \cdot Φ (S, I) ‖}_{2}

where I is the image, Φ(S, I) denotes the HOG features of the local patches centered at the current landmark locations, and $\hat{P}$ is the ground truth feature score.

Deviation evaluation model based on global shape

The deviation evaluation model Θ(S) based on global shape can be expressed as

Θ (S) = \sum_{i}^{l} {‖ S_{i} - E (S_{i}) ‖}_{2}

where S_i is the i th landmark of the corresponding estimate and $∥ S_{i} - E (S_{i}) ∥_{2}$ is the reconstruction error, which is calculated by the face shape model to measure the difference between the current shape and the face shape pattern.

Although human faces are different in both the size and shape of facial components, such as the size of eyes and the contour of mouth, the position distributions of facial feature points are similar. Therefore, there are many patterns for human face shapes. In fact, if the prior face shape pattern function E(S_i) can be obtained, the difference between the ground truth value and the estimates can be measured, as shown in Figure 6.

Figure 6.

Reconstruction errors.

Prior face shape models have been proven helpful in improving facial point detection accuracy, as they make it possible to evaluate the quality of the estimates and constrain and correct the corruptions. However, building a strong probabilistic model of face shapes is still challenging because of its complex properties.

Recent research has shown that RBMs can also be used to model shapes.^14,17 A deep Boltzmann machine (DBM) has been constructed to capture face shape variations caused by facial expressions for near-frontal views, and facial points can be tracked robustly and accurately in case of significant changes in facial expressions and poses using this model.^12,13 Nevertheless, this method is too complex and difficult to learn. In this article, a model based on a conditional RBM¹⁴ is advocated and utilized to construct a face shape prior model based on the landmark labels of the training data.

As shown in Figure 7, the proposed face shape model consists of two parts: pose-DBM and front-RBM. The model describes the joint probability distribution of the visible unit vector S and hidden unit vector H₁, with a conditional RBM, which depends on the hidden input Z₂, where S_i is the facial landmark estimate achieved by using the cascaded-regression method and Z₁, Z₂, and H₁ are binary hidden nodes. In pose-DBM, a two-layer DBM extracts the pose information, and the hidden nodes Z₂ represent the head pose variations of the current face shape. In front-RBM, H₁ captures the shape variations of the estimate.

Figure 7.

Proposed face shape model.

This face shape mode is built to represent the function E(S_i). Based on the pretrained pose-DBM and the front-RBM, a Markov-Chain-Monte-Carlo-based Gibbs sampling method is used to generate samples for the reconstruction R_i according to the face shape estimation S_i (Figure 8). Then the reconstruction errors can be calculated as

{err}_{i} = {‖ S_{i} - R_{i} ‖}_{2}

Figure 8.

Markov Chain Monte Carlo Gibbs sampling scheme.

Figure 9 presents the module that evaluates the quality of the face shape estimate by calculating the reconstruction R_i from the shape S_i

S_{i} = [S_{1, x}, S_{1, y}, S_{2, x}, S_{2, y}, \dots, S_{n, x}, S_{n, y}]

where S_i are the states of visible nodes, Z ∈ {0, 1}^m are the states of hidden nodes, and parameters

θ = {W_{1 i j}, W_{2 j k}, a_{1 i}, a_{2 j}, b_{1 j}, b_{2 k}}

contain the weight matrix, as well as bias vectors for visible and hidden nodes. If the visible layer is known, the activation function for the hidden layer Z₁,Z₂ can be written as

P (Z_{1 j} = 1 | S_{i}) = \frac{1}{1 + \exp (- \sum_{i} W_{1 i j} S_{i} - b_{1 j})}

P (Z_{2 k} = 1 | Z_{1 j}) = \frac{1}{1 + \exp (- \sum_{i} W_{2 j k} Z_{j} - b_{2 k})}

where the symmetric weight W_1ij connects visible nodes S to hidden unit Z, W_2jk connects hidden nodes Z₁ to hidden unit Z₂, and b_1j and b_2k are the bias of hidden nodes Z₁ and hidden unit Z₂.

Figure 9.

Graphical depiction of the face shape estimate evaluation module.

Then the hidden layer H₁ for the front-RBM can be calculated as the regulation of conditional Boltzmann machines. The hidden units are determined by both the input received from the current observation and the input Z₂. The effect of Z₂ on each hidden unit can be viewed as a dynamic bias

{\hat{b}}_{j} = b_{j} + \sum_{k = 1} B_{j k} Z_{2 k}

where the symmetric weight B_jk connects hidden unit Z₂ to hidden unit H and b_j is the bias of hidden nodes H.

The hidden layer H₁ can be written as

P (H_{1 j} = 1 | S_{i}, Z_{2 k}) = \frac{1}{1 + \exp (- \sum_{i} W_{3 i j} S_{i} - {\hat{b}}_{j})}

where the symmetric weight W_3ij connects visible unit S to hidden unit H and ${\hat{b}}_{j}$ is the dynamic bias. If the hidden layer is known, the activation function for the visible layer can be expressed as

{\hat{a}}_{i} = a_{i} + \sum_{k = 1} A_{i k} Z_{2 k}

P (H_{1 j} = 1 | S_{i}, Z_{2 k}) = N (S_{i} | μ_{i}, σ_{i}^{2})

μ_{i} = {\hat{a}}_{i} + σ_{i}^{2} \sum_{j} W_{3 i j} H_{j}

where the symmetric weight A_ik connects hidden unit Z₂ to visible unit S, a_i is the bias of visible unit S, and $σ_{i}^{2}$ is the standard deviation of the Gaussian noise for visible unit S. The result $P (H_{1 j} = 1 | S_{i}, Z_{2 k})$ is the reconstruction of shape S_i.

The parameters $θ = {W_{1 i j}, W_{2 j k}, a_{1 i}, a_{2 j}, b_{1 j}, b_{2 k}}$ for pose-DBM are learned following the method of Salakhutdinov and Hinton.¹⁸ The training data ${{\hat{S}}_{i}}_{i = 1}^{N}$ include the face shape S under various poses. The learning processing algorithm is described in detail in Algorithm 2.

Algorithm 2 Parameter estimation.
Input:
Training data: ${{\hat{S}}_{i}}_{i = 1}^{N}$ ; Learning rate: ∊; Maximum
Iterations: maxepoch.
Output:
$θ = {W_{3 i j}, A_{k}, B_{k}, a_{i}, b_{j}}$
1: Train the Pose-DBM to get the parameters $θ = {W_{1 i j}, W_{2 j k}, a_{1 i}, a_{2 j}, b_{1 j}, b_{2 k}}$ .
2: Initialize W_3ij, A_k, B_k randomly, a, b as zero vector.
3: Initialize learning rate ∊
4: for epoch=1,2,…, maxepoch do
5: Sample from $P (H_{1}^{+} \| S_{i}^{+}, Z_{k}^{+})$ by using equation (12), equation (13), and equation (15)
6: Sample from $P (S_{i}^{-} \| Z_{k}^{-}, H_{1}^{-})$ by using equation (16), equation (17), and equation (18)
7: Update: W_t+1 ← W_t + ∊(〈S_iH_j〉_data − 〈S_iH_j〉_recon) A_t+1 ← A_t + ∊(〈S_iZ_k〉_data − 〈S_iZ_k〉_recon) B_t+1 ← B_t + ∊(〈H_jZ_k〉_data − 〈H_jZ_k〉_recon) c_t+1 ← c_t + ∊(〈S_i〉_data − 〈S_i〉_recon) d_t+1 ← d_t + ∊(〈H_j〉_data − 〈H_j〉_recon)
8: end for

Focusing on the learning processing for conditional Boltzmann machines, a contrastive divergence algorithm¹⁹ is chosen to help learn this model. The parameters θ are learned by maximizing the log-likelihood function:

\hat{θ} = \underset{θ}{argmax} \log P (S | θ, Z)

The contrastive divergence algorithm¹⁹ is utilized in this learning work

Δ W_{3 i j} = {〈 S_{i} H_{j} 〉}_{data} - {〈 S_{i} H_{j} 〉}_{recon}

Δ A_{k i} = {〈 S_{i} Z_{k} 〉}_{data} - {〈 S_{i} Z_{k} 〉}_{recon}

Δ B_{k j} = {〈 H_{j} Z_{k} 〉}_{data} - {〈 H_{j} Z_{k} 〉}_{recon}

Δ a_{i} = {〈 S_{i} 〉}_{data} - {〈 S_{i} 〉}_{recon}

Δ b_{j} = {〈 H_{j} 〉}_{data} - {〈 H_{j} 〉}_{recon}

where 〈〉_data is an expectation with respect to the data distribution and 〈〉_recon is with respect to the distribution of “reconstructed” data. Through an alternate sampling between the hidden visible units for K times, this reconstruction data can be obtained.

Experiment results

Implementation details

For learning the face shape model, the training sets of the HELEN and LFPW datasets and images from part of the Multi-PIE database are used. The Multi-PIE database [20] was collected in a laboratory environment, and contains 337 subjects imaged from 15 viewpoints and under 19 illumination conditions with six expressions: neutral, smiling, surprised, squinting, disgusted, and screaming. The ground truth coordinates of the facial landmarks in these datasets were annotated manually by human beings, and coordinate values are measured in pixels.

It is easy for us to calculate the distance between the landmark estimate and the ground truth coordinates. We call this distance the “errors”; obviously, the smaller the error, the more accurate the detecting results. However, the faces can be dramatically different, owing to scale variations in the images; for example, for a 50 pixel × 50 pixel patch, we may retrieve the whole face part from a small face and only the nose part from a large face. Therefore, it is necessary to normalize the error using the distance between the pupils (the centers of the eyes). We define this distance the inter-ocular distance error. The error is calculated as the distance between the landmark estimate and the ground truth, normalized by the inter-ocular distance error, which is defined as

{error}_{i} = \frac{{‖ p_{d}^{i} - p_{m}^{i} ‖}_{2}}{d_{IOD}}

where $p_{m}^{i}$ represents the manually labeled coordinate of point i, $p_{d}^{i}$ indicates the tracking result of point i, and d_IOD represents the distance between the eye centers. In this way, the error measure does not depend on the size of the image.

Performance evaluation on AFW dataset

The AFW dataset¹⁶ was randomly sampled from Flickr images. This dataset consists of 337 face images with large variations in both face viewpoint and appearance (for example, aging, sunglasses, make-up, skin color, and expression). Each face is labeled with 68 landmarks. The training part of the experiment used the training images of the LFPW and HELEN datasets, with 2811 samples in total.

Figure 10 and Table 1 show the comparison with the results of the supervised descent method. This latter method is one of the most classic cascaded-regression-based approaches. In fact, our multiple-initializations-based method is an improvement on the supervised descent method. We use the same HOG feature, and the similar cascaded-regression framework. The primary dissimilarity is that we propose a ranking strategy to select the best shape estimate. To prove the validity of our algorithm, the supervised descent method is the obvious choice for comparison. As can be seen, based on the limited comparable results on the AFW dataset, the performance of the proposed method on the challenging dataset is better than the state-of-the-art supervised descent method. Furthermore, this proposed face shape model is effective in helping to enhance the robustness of the facial landmark detection model. We also divided the faces in the AFW dataset manually into indoor (219 faces) and outdoor (118 faces) images. We performed experiments on the indoor and outdoor databases separately to determine whether there exist differences when using the method to process images taken indoors and outdoors. Figure 10 shows the resulting images compared with those obtained using the supervised descent method and ground truth; as we can see, our detection method has significant advantages over the supervised descent method.

Figure 10.

Comparison results on AFW. Top row: supervised descent method; middle row: proposed method; bottom row: ground truth.

Table 1.

Comparison using AFW dataset.

Method	Inter-ocular distance error	Inter-ocular distance error (indoor)	Inter-ocular distance error (outdoor)
Supervised descent method⁵	7.75	7.98	7.31
Our method (without face shape)	7.13	7.35	6.72
Our method (with face shape)	6.88	7.11	6.45

AFW: Annotated Faces in the Wild.

Table 2 displays the mean error under different initializations. It can be clearly seen that the error decreases gradually with increase in number of initialization; this indicates that this multi-initialization strategy is effective for images with a large variety of poses. However, although a method using different initial poses has improved accuracy, computational costs increase. In this experiment, we find that most estimates can be eliminated in early iterations of the cascaded regression. This means that the number of iterations for different initializations can be different in the cascaded-regression module. For instance, in Figure 5, it is easy to eliminate the third, fourth, and fifth images at the second iteration but the first two images cannot be distinguished until there are four iterations. Inspired by this, instead of setting a fixed number of six iterations, as used in the supervised descent method, we change it in the ranges of [2:4]. A comparison of detection speeds for different numbers of initializations using the AFW dataset is shown in Table 2. It can be clearly seen that five initializations is the best choice, according to the errors and computational costs. Our results were obtained using a i7-4770 CPU and MATLAB.

Table 2.

Comparison of detection speed on AFW dataset.

Method	Inter-ocular distance error	Speed (fps)
Supervised descent method⁵	7.75	32
Our method (1 initialization)	7.82	36
Our method (2 initializations)	7.46	25
Our method (3 initializations)	7.31	19
Our method (4 initializations)	7.05	15
Our method (5 initializations)	6.98	12
Our method (6 initializations)	6.92	9
Our method (7 initializations)	6.88	7
Our method (8 initializations)	6.85	5

Performance evaluation on IBUG dataset

The IBUG dataset is the most challenging dataset in 300 Faces In-the-Wild Challenge (300-W) dataset; it was created for facial landmark detection in the real world.¹⁵ All the images in this database contain faces with extreme poses and expressions. However, it only provides training images. For comparison with recent work, the training images of the LFPW, HELEN, and AFW databases, with 3148 samples in total, were chosen for the training part of the experiment, following the experimental method of Ren et al.⁸ The testing set consists of 135 images altogether. Landmark annotations in this experiment followed the Multi-PIE²⁰ 68 points mark-up.

The comparison is shown in Table 3. It is worth mentioning that the results of the supervised descent method, explicit shape regression method, robust cascaded pose regression method, and local binary features method are quoted from Ren et al.⁸ Table 3 shows that the performance of the proposed method on the challenging IBUG dataset is better than those of other state-of-the-art methods. As this facial landmark detection method focuses on handling image with head different head poses, the approximate frontal face image is not optimized. Moreover, the local binary features method is much better than HOG descriptors. On this basis, a better result is still achieved. Therefore, it can be concluded that the proposed method has a strong ability to handle difficult poses on this challenging dataset.

Table 3.

Comparison with the IBUG dataset.

Method	Inter-ocular distance error	Speed (fps)
Explicit shape regression⁶	17.00^a	20^a
Robust cascaded pose regression⁹	15.50^a	12^a
Supervised descent method⁵	15.40^a	32
Local binary features⁸	11.98^a	320^a
Our method	11.52	12

^aReported results from the original articles.

Figure 11 shows the comparison with the resulting images. The supervised descent method is one of the most classic cascaded-regression-based approaches and the code of the supervised descent method algorithm and its parameters are publicly available. Since we use the same HOG feature as, and a similar cascaded-regression framework to, the supervised descent method, this method was selected for comparison. Ren et al.⁸ have published their well-trained model with results using the training images of the LFPW, HELEN, and AFW datasets, with 3148 samples in total, so we can obtain the resulting images, as shown in the second row of Figure 11. For the other algorithms listed in Table 2, we could not guarantee that we would obtain similar results to the original articles, because the parameters of these models were not published. Therefore, to be fair, we do not show resulting images for the other two algorithms listed in Table 3.

Figure 11.

Comparison results for the IBUG dataset. First row: supervised descent method; second row: local binary features method; third row: proposed method; fourth row: ground truth.

Referring to Table 3, we can see that the local binary features method has been the most efficient algorithm so far, because it uses local binary features which are learned in the training stage. Although this method is much faster than the conventional HOG feature method, we note that the local binary features method needs to train hundreds of thousands of trees, which greatly increases the computational costs in the training stage. In addition, the speed of the local binary features method,⁸ which is quoted from the original article, is implemented in C++ in the original article, which is much faster than MATLAB. We believe that if we incorporate the learning-based feature in our framework in future, the accuracy and efficiency of our method can be improved considerably.

Performance evaluation on COFW dataset

As a more challenging dataset, the COFW dataset [9] includes images with large variations in shape and occlusions due to differences in pose and expression. All 1007 images are annotated using the 29 landmarks. A total of 1345 images for 845 LFPW and 500 COFW faces were selected for the training part of the experiment. The remaining 507 COFW faces were used for testing.

Table 4 exhibits the comparison between the proposed method and recent state-of-the-art methods, including the cascaded-regression copse method,²¹ supervised descent method,⁵ robust cascaded pose regression method,⁹ and explicit shape regression method.⁶ Compared with other cascaded-regression methods, the proposed multi-initialization scheme is very effective in further decreasing the mean error.

Table 4.

Comparison on COFW dataset.

Method	Inter-ocular distance error	Speed (fps)
Explicit shape regression⁶	11.20^a	20^a
Robust cascaded pose regression⁹	8.50^a	12^a
Supervised descent method⁵	7.70^a	32
Cascaded-regression copse²¹	7.30^a	21^a
Our method	7.38	12

1 Reported results from the original articles. COFW: Caltech Occluded Faces in the Wild.

As the model can iteratively predict the landmark occlusions, the result presented by Feng et al.²¹ is better than this research result. Importantly, the reason may lie in the fact that the average landmark occlusion for the COFW dataset is > 23%, for which no countermeasure is designed in our model.

Applications with robust facial landmark detection

Automatic facial landmark detection has been an active area in computer vision during the past decades. Although there have recently been tremendous improvements in the facial landmark detection algorithms on “in-the-wild” images, what kinds of applications can be newly achieved using the facial landmark detection method are rarely discussed. In this section, we discuss the impact of the facial landmark detection method on image-based applications in three major areas: (1) face recognition; (2) facial animation; (3) facial expression recognition.

Face recognition

For a robotic vision system, face recognition is the basis of access controls or video face spotting. In addition, as social networking services develop, the automatic organization of photo collections based on accurate face recognition, which Google and Facebook have begun to apply, is becoming more and more popular. Formerly, with detected facial landmarks, human facial shapes and appearance feature around key points could be utilized for facial attribute analysis. Recently, with the development of deep learning in computer vision, feature extraction and classification can proceed simultaneously, and it seems as if the procedure of facial landmark detection is no longer needed.

However, in reality, face images are often taken under such conditions as different viewpoints, lighting, rotation and occlusion, which significantly decrease the performance of face recognition. To overcome these problems, in general, three succeeding steps need to be applied in face recognition: face detection, face alignment, and face recognition. In the first step, face detection is utilized to search the coarse location of faces in an image. In the second step, face alignment, utilizing landmark localization for geometric face normalization can increase the performance of face recognition very effectively, owing to the geometric invariance of the human face. The importance of face alignment has been demonstrated;^1,2 it can be seen that a face alignment step clearly improves the performance of face recognition. It is should be stressed that, as one of the best face recognition systems, designed by Google, FaceNet,² which is based on deep learning, also greatly enhances its performance when using extra face alignment. It has fully proved the importance of automatic facial landmark detection in face recognition.

Facial animation

The generation of realistic facial animations for virtual characters is frequently used in film and game production. With the development of virtual reality technology, this form of facial animation simulation can be more and more popular through online games. Generally, to synthesize facial expressions more convincingly, many sensors should be attached on the human face, which greatly limits the scope of its application. Recently, benefiting from facial landmark detection algorithm, facial landmarks are used as the input to drive the animation of a virtual character.²² In the course of implementation, once the facial landmarks are attained, the rigid transformation and facial expression parameters are calculated from the detected landmarks, and they are then transferred to a digital avatar to generate the corresponding animation. Considering our facial landmark detection method, because our algorithm has better robustness for images of different face poses, we believe it can help to enhance the performance of facial animations generation.

Face expression analysis

Facial expression recognition is another important application with our robust facial landmark detection method. With the detected facial landmark, the human facial shape and the appearance of features around the key points can be utilized for facial expression analysis. In real-world applications, people tend to move their heads when they make the corresponding expressions. Furthermore, depending on the camera position, facial images can be taken from multiple views. For these reasons, facial expression recognition should be robust to the multiple poses of faces. Depending on our algorithm, which can handle images of faces in different poses robustly, we believe it can help to enhance the performance of facial expression recognition.

Conclusions

In this article, an initialization based on multiple poses for robust facial landmark detection is proposed. Several different poses are set as initializations to estimate face landmarks independently, using a cascade regression method. To pick out the best estimate, they are ranked by probability, calculated based on appearance and shape information in each iteration. An RBM-based face shape model is trained to improve the robustness of the ranking. Finally, experiments using three challenging datasets show that the proposed method performances better than state-of-the-art methods.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This project was funded in part by a scholarship from the China Scholarship Council, from whom we would like to acknowledge support. Furthermore, this work was sponsored by the National Natural Science Foundation of China (Grant no. 61401117). In addition, Yongqiang Li is partly supported by the National Natural Science Foundation of China (no. 61402129) and Postdoctoral Foundation Projects (nos. LBH-Z14090 and 2015M571417).

References

Köstinger

Wohlhart

Roth

. Annotated Facial Landmarks in the Wild: a large-scale, real-world database for facial landmark localization. 2011 IEEE international conference on computer vision workshops, Barcelona, Spain, 6–13 November 2011, pp.2144–2151. Piscataway, NJ: IEEE.

Schroff

Kalenichenko

Philbin

. FaceNet: a unified embedding for face recognition and clustering. 2015 IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 7–12 June 2015, pp.815–823. Piscataway, NJ: IEEE.

Cristinacce

Cootes

. Boosted regression active shape models. 2007 British machine vision conference, University of Warwick, UK, 10–13 September 2007, pp.880–889. Durham: BMVA.

Matthews

Baker

. Active appearance models revisited. Int J Comput Vision 2004; 60: 135–164.

Xiong

De la Torre

. Supervised descent method and its applications to face alignment. 2013 IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013, pp.532–539. Piscataway, NJ: IEEE.

Cao

Wei

Wen

. Face alignment by explicit shape regression. 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp.2887–2894. Piscataway, NJ: IEEE.

Dollár

Welinder

Perona

. Cascaded pose regression. 2010 IEEE conference on computer vision and pattern recognition, San Francisco, CA, USA, 13–18 June 2010, pp.1078–1085. Piscataway, NJ: IEEE.

Ren

Cao

Wei

. Face alignment at 3000 fps via regressing local binary features. 2014 IEEE conference on computer vision and pattern recognition, Columbus, OH, USA, 23–28 June 2014, pp.1685–1692. Piscataway, NJ: IEEE.

Burgos-Artizzu

Perona

Dollár

. Robust face landmark estimation under occlusion. 2013 IEEE international conference on computer vision, Sydney, Australia, 1–8 December 2013, pp.1513–1520. Piscataway, NJ: IEEE.

10.

Cui

Zhang

Guo

. Robust facial landmark localization using classified random ferns and pose-based initialization. Signal Process 2015; 110: 46–53.

11.

Yang

Jia

. Robust face alignment under occlusion via regional predictive power estimation. IEEE Trans Image Process 2015; 24: 2393–2403.

12.

Wang

. Facial feature tracking under varying facial expressions and face poses based on restricted Boltzmann machines. 2013 IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013, pp.3452–3459. Piscataway, NJ: IEEE.

13.

. Discriminative deep face shape model for facial point detection. Int J Comput Vision 2015; 113: 37–53.

14.

Taylor

Sigal

Fleet

. Dynamical binary latent variable models for 3D human pose tracking. 2010 IEEE conference on computer vision and pattern recognition, San Francisco, CA, USA, 13–18 June 2010, pp.631–638. Piscataway, NJ: IEEE.

15.

Sagonas

Tzimiropoulos

Zafeiriou

. 300 faces in-the-wild challenge: the first facial landmark localization challenge. 2013 IEEE international conference on computer vision workshops, Sydney, Australia, 2–8 December 2013, pp.397–403. Piscataway, NJ: IEEE.

16.

Zhu

Ramanan

. Face detection, pose estimation, and landmark localization in the wild. 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp.2879–2886. Piscataway, NJ: IEEE.

17.

Ali Eslami

Heess

Winn

. The shape Boltzmann machine: a strong model of object shape. 2012 IEEE conference on computer vision and pattern recognition, Providence, RI, USA, 16–21 June 2012, pp.406–413. Piscataway, NJ: IEEE.

18.

Salakhutdinov

Hinton

. Deep Boltzmann machines. Proceedings of the international conference on artificial intelligence and statistics, Clearwater Beach, FL, USA, 16–18 April 2009, pp.448–55. Brookline, MA: Microtome Publishing.

19.

Hinton

. Training products of experts by minimizing contrastive divergence. Neural Comput 2002; 14: 1711–800.

20.

Gross

Matthews

Cohn

. Multi-PIE. Image Vision Comput 2010; 28: 807–813.

21.

Feng

Huber

Kittler

. Random cascaded-regression copse for robust facial landmark detection. IEEE Signal Process Lett 2015; 22: 76–80.

22.

Cao

Weng

Lin

. 3D shape regression for real-time facial animation. ACM Trans Graphics 2013; 32: 96.