Abstract
Detecting faces in images is a key step in numerous computer vision applications, such as face recognition or facial expression analysis. Automatic face detection is a difficult binary classification problem because of the large face intra-class variability which is due to the important influence of the environmental conditions on the face appearance. We propose a cascade face detection method based on histograms of oriented gradients (HOG), using different kinds of features and classifier to exclude non-face step by step. The candidate feature set was constructed by HOG feature of different grain size; at different stages the support vector machine (SVM) was used as the weak classifier with different parameters. Experimental results showed a better performance compared to the state-of-the-art on Carnegie Mellon University/ Massachusetts Institute of Technology (CMU/MIT) datasets.
Introduction
Video surveillance systems are very suitable for the physical security, since the video sequences from many remote areas can be presented to watchmen at a time. Face detection is interesting because it is usually an indispensable step of autonomous video surveillance system. In recent 20 years many researchers were devoted to the study of face detection and related surveys are given in Yang et al. 1 and Zhang and Zhang. 2 Its goal is to detect the presence of human faces in a still image and to return their position. The faces in different positions are detected by sliding a window and the faces of different sizes by scaling. Thus the key problem in face detection is the classification between face and non-face, in which the characteristics has the following two: (1) the number of non-face is far greater than that of face and the distribution of non-face pattern is very widespread, so it is hard to classify face and non-face by one model; (2) the computing capacity is large and a high demand for real time is required.
Sung and Poggio 3 obtain representative non-face samples with a bootstrap algorithm. The false detected non-face samples were used as new samples to train the classifier and the new trained classifier classified the testing dataset, which is repeated until the classifier meets the requirements. Sung and Poggio used this kind of bootstrap to get the non-face samples and trained neural network. They tried hard to construct a strong classifier rather than a cascade classifier, but their work inspires the later research.
Viola and Jones 4 presented a multi-stage detection method, which has milepost significance to improve the speed of face detection. Their success is largely due to the following: (1) extracting Haar feature and introducing integral map to speed up the calculation; (2) the cascade classifier was used to remove most of the background regions in the early stages; (3) bootstrap was used to generate a effective classifier: misclassified samples were focused in the later and a small part of the key features were selected from a large feature set. However, this approach has two weaknesses: (1) discriminatory ability is not strong for Haar features only reflect brightness differences in different regions; (2) just the threshold is used to distinguish between face and non face, which cannot adapt to the complex mode-distribution.
Inspired by the Viola and Jones’s work, many people tried to construct a cascade classifier.5–11 The different levels of classifier were constructed by selecting different features and change the weak-classifier parameters. We expect the classifier used in the early is relatively simple and complex in the later. Support vector machine was used as the weak classifier, which is a typical two-class classification method and have various linear and nonlinear kernel. In the different stage the distribution of face and non-face changes, so different SVM kernels and corresponding parameters can be used to adapt to this kind of change.
Histograms of oriented gradient (HOG) was used to construct feature pool. HOG is a kind of discriminative local descriptor with 2D rotation invariance, which was widely applied to the field of computer vision, e.g. pedestrian detection,12,13 human detection,14–17 face recognition,16–18 and face detection,21,22 etc. The number of HOG bins and the size of the block were adjusted to generate the feature pool with different size and granularity. We expect different features can adapt to the distribution change in different stages.
The rest of the paper is organized as follows. “Basic idea” section describes the basic idea on the cascade classifier. “HOG feature pool” section describes how to construct the hog feature pool. In “Support vector machines” section, we describe the basic principle of SVM. In “Construct cascade classifier by AdaBoost” section we describe how to construct the cascade classifier. We show the experiments and results in “Experiments and results” section. And in “Conclusions” section, we finalize the paper with the main conclusions and describe the future work.
Basic idea
In this section, we describe the basic idea on the Cascade classifier.
The basic idea of face detection based on cascade strategy is to exclude a majority of background areas quickly in the early stage to reduce the amount of computation in the later stage. In the early stage of face detection the algorithm should meet the following three requirements:
Fast detection: as mentioned before, the face is of different sizes and located in different positions and it is detected by scaling and sliding window. In the early the amount of computation is large for the image of various scales need be detected in each position, so the early detection speed determines the speed of the whole detection system. High detection rate (DR). Early detection is a preliminary screening. In this process the region judged as non-face is discarded and the majority of face region need be retained. High ability to exclude non-face region. The more non-face region can be excluded in the early, the higher the speed of the whole face detection system.
Requirements (2) and (3) are contradictory, so a balance should be strike between them. Therefore, we introduced two concepts: face detection rate (FDR) and background rejection rate (BRR). FDR means what percentage the detected face samples account for the total face samples; BRR means what percentage the detected non-face samples account for the total non-face samples.
The features in pool introduced in “Basic idea” section were selected from small to large according to the dimension, and in every stage several kinds of support vector machine were trained based on the selected features until the DR reached more than 98% and the removal rate more than 30%. The optimal SVM was selected as the classifier of this stage and sometimes the classifier may be the superposition of multiple support vector machines. According to this standard, we construct the classifier of every stage by taking the following measures: (1) selecting different features from the pool introduced in 2; (2) adjusting the parameters of weak classifier; (3) stacking the multiple classifier; (4) adding the misclassified samples into training set for training the classifier of next stage.
Hog feature pool
In this section, we describe how to construct the hog feature pool.
The basic concept of the HOG is to replace the original pixel value with gradient direction of adjacent pixels to improve the robustness to illumination variance. In this paper, operator (−1, 0, 1) and operator (−1, 0, 1)' are used to calculate gradient Gx in x direction and gradient Gy in y direction, respectively. These two gradient values are then used to calculate the corresponding angle of each pixel O = arctan(Gy/Gx), which is used as a kind of non-directional gradients and the angle value range from 0° to 180°. The calculated angle needs to be quantified. Assuming that the number of the histogram bins of the quantified histogram is “nb”, then each bin represents an angle range of B = 180/nb. Assuming that the histogram is represented by H, then when the angle O of pixel is in the interval (B(2 t − 1)/2,B(2 t + 1)/2)(t = 1, 2, 3,…, nb), the value of H(t) increases by 1. The number of histogram bins reflects the sensitivity of gradient variance, the larger the number of histogram bins, the more flexible the gradient variance. Thus, decrease of the number of histogram bins lowers the ability to identify face and non-face; increase of the number of histogram bins makes the calculation more complicated and makes it more sensitive to rotation.
HOG feature reflects how many pixels in each gradient direction and do not reflect the position distribution of each pixel's gradient. So we partition a face into regular non-overlapping grids. Assuming that “w” represents the number of rows and “h” the number of columns, then the entire face is partitioned into w × h regions; gradient histogram is calculated in each region and the histogram in each region is arranged from top to bottom and from left to right. Assuming that “nb” represents the number of histogram bins of each region, the dimension of feature vector of a region to be detected is w × h × nb.
The feature pool is constructed as follows: (1) setting six kinds of number of histogram bin: 3, 6, 9, 12, 15, 18, (2) the entire face is partitioned in the following 15 kinds of ways: 1 × 1, 1 × 2, 2 × 2, 1 × 3, 2 × 3, 3 × 3, 2 × 4, 1 × 1, 3 × 4, 2 × 5, 3 × 5, 4 × 5, 5 × 5, 8 × 10, 10 × 10. Thus the feature pool contains 90(6 × 15) kinds of features.
Support vector machines
In this section, we describe the basic principle of SVM.
Consider a set of n vectors {xi}, xi ∈ Rn, 1 ≤ i ≤ n, representing input samples and set of labels {yi}, yi ∈ {#x000B1;1}, that divide input samples into two classes, positive and negative. If the two classes are linearly separable, there exists a separating hyperplane (w, b) defining the function:
The dual representation of the decision function f(x) is then:
Training a linear SVM means finding the embedding strengths {
In real-life problems it is rarely the case that positive and negative samples are linearly separable. Non-linear support vector classifiers map input space X into a feature space F via a usually non-linear map ϕ: X to F, x to
Usually F is a high-dimensional space where images of training samples are highly separable, but working directly in such a space would be computationally expensive. However we can choose a space F which is induced by kernel K, defined by a kernel function K(x, y) that computes the dot product in F, K(x, y) ≤
Commonly used kernels include polynomial kernels K(x, y) = (x + y)
d
and the Gaussian kernel
In our implementation we use the different kernel in various stages, and it is interesting to choose an optimal kernel for the given input data.
Construct cascade classifier by AdaBoost
In this section, we describe how to construct the cascade classifier.
AdaBoost has been proposed by Freund and Schapire 23 as an efficient learning algorithm for constructing a strong classifier as an additive combination of boosted weak classifiers that are individually only slightly better than random. At each iteration AdaBoost determines a new weak classifier relative of the lowest value in the weight distribution in the training set. The principal advantage of AdaBoost is that the training error converges exponentially towards zero and the generalization performance grows at each iteration when the null training error is reached by the algorithm. 23
AdaBoost combines the hypotheses generated by a set of classifiers trained one after the other. The tth classifier is trained with more emphasis on certain patterns, using a cost function weighted by a probability distribution Dt over the training data. Some learning algorithms do not permit training with respect to a weighted cost function, e.g. decision trees. In this case resampling with replacement (using the probability distribution Dt) can be used to approximate a weighted cost function. Examples with high probability would then occur more often than those with low probability, while some examples may not occur in the sample at all although their probability is not zero. This is particularly true in the simple resampling version (labeled “R” earlier), and probably unlikely when a new training set is resampled after each epoch (“E” version). Neural networks can be trained directly with respect to a distribution over the learning data by weighting the cost function (this is the “W” version): the squared error on the pattern is weighted by the probability Dt(i). The result of training the tth classifier is a hypothesis ht where Y = {1,…,K} is the space of labels, and X is the space of input features. After the tth round the weighted error of the resulting classifier is calculated and the distribution Dt + 1 is computed from Dt, by increasing the probability of incorrectly labeled examples. The global decision f is obtained by weighted voting. It converges (learns the training set) if each classifier yields a weighted error that is less than 50%, i.e. better than chance in the two-class case. There is also a multiclass version, called pseudoloss-AdaBoost, that can be used when the classifier computes confidence scores for each class. AdaBoost has very interesting theoretical properties, in particular it can be shown that the error of the composite classifier on the training data decreases exponentially fast to zero. More importantly, however, bounds on the generalization error of such a system have been formulated.
7
These are based on a notion of margin of classification, defined as the difference between the score of the correct class and the strongest score of a wrong class. In the case in which there are just two possible labels {#x02212;1, +1}, this is f(x), where f f is the composite classifier and y the correct label. Obviously, the classification is correct if the margin is positive. We now state the theorem bounding the generalization error of AdaBoost (and other classifiers obtained by convex combinations of other classifiers). Let H be a set of hypotheses (from which the ht are chosen), with VC-dimension d. Let f be any convex combination of hypotheses from H. Let S be a sample of N examples chosen independently at random according to a distribution D. Then with probability at least 1 − ξ over the random choice of the training set S from D, the following bound is satisfied for all θ > 0:
Note that this bound is independent of the number of combined hypotheses and how they are chosen from H. The distribution of the margins however plays an important role. It can be shown that the AdaBoost algorithm is especially well suited to the task of maximizing the number of training examples with large margin. 24
Support vector machine
25
was used as weak classifiers, whose undetermined parameters were listed as follows:
selecting the kernel function
26
of SVM and determining the corresponding parameter; the number of support vector; nonzero Lagrange factor and the corresponding support vector; threshold (determining the position of the classification hyper plane); the penalty factor.
Among the above, (2), (3), and (4) may be obtained in training; (1) and (5) need be determined in advance and (4) can be manually adjusted.
To construct the classifier of the various stages we take the following measures: (1) increasing the penalty factor 23 of positive cases in training to improve the DR; (2) adjusting the DR and false detection by changing the threshold, that is, moving the position of SVM hyperplane; (3) parts of the classifiers were weighted summed and the high-detection-rate classifier were given greater weight; (4) change the kernel of support vector machines and its parameters, and the kernel function used includes: polynomial function, Gauss radial basis function, and multilayer perception function (sigmoid function); (5) selecting the features from coarse to fine. Measures (1), (2), and (3) can ensure high DR, and (4) and (5) can avoid the over fitting.
Experiments and results
In this section, we show the experiments and results.
The DR and BRR were used as the standard to evaluate the detection results in the experiments. The testing set comes in two forms: one is as same as the training set, containing face and non-face images and collected by ourselves; the other is the big images containing multiple faces, e.g. CMU/MIT datasets.
We collected the training set which contains face and non-face images. Face images were collected from the standard face database, e.g. Yale database,
27
FERET,
28
BioID,
29
etc. Totally, 10100 face images were obtained by scaling, rotation, and translation. Non-face samples are randomly collected from various scene pictures and 50,000 non-face images were got. The size of face and non-face images all were 92 × 112. Of all, 2000 images were randomly selected from the face datasets and non-face datasets, respectively, half of which were used to train and half to test. By cross validation the DR and the BRR in each stage were shown in Figure 1.
Detection rate and background rejection rate of various stages.
Figure 1 showed that the DR can reach 94.6% and 98.3% of the non-face can be excluded (i.e. the false DR was 1.7%) after eight-stage detecting. In the initial stage high DR was achieved and a considerable part of the non-face can be excluded; in the latter period, FDR decreased and the proportion of non-face excluded increased dramatically.
The complexity of the feature is represented by its dimension and Figure 2 depicts the complexity variation curves. Figure 2 shows that the selected features get more complex as the detection process goes on for the rest of images become later more and more difficult to distinguish and more details of feature need be extracted. At the fifth stage the feature complexity increased sharply, however, we need not worry about the increase in the amount of calculation for 87% of non-face images have been excluded at the fourth stage as shown in Figure 2.
Complexity curve in various stages.
The parameters of the selected SVM were adjusted. Polynomial kernel function was used as the kernel to train SVM in each stage, respectively. And we found that the polynomial kernel function of D = 1 is more suitable for the detection of the early four stages than the other and in the later the radial basis kernel function and sigmoid kernel function perform better, which may be explained in Figure 3: solid circles represent the face and hollow circles represent non-face. Face concentrate in one area while the non-face was widely distributed. In the early only four lines can exclude the majority of non-face while in the later the face and the rest of the non-face are staggered distributed, it is difficult to separate the two classed by line.
Schematic diagram of gradually eliminating non-face.
The four lines are similar to the linear SVM hyperplane and the polynomial kernel function of d = 1 is linear; in the later the nonlinear kernel need be used to transform the samples to a linearly separable state.
In Figure 4 and Figure 5, the missing and correctly classified face images are showed, respectively. Figure 4 shows that the detection result is relatively sensitive to glasses and big beard. HOG mainly is an edge-direction feature, glasses and big beard correspond to add an occlusion to the face, which can cause strong edge effects and affect the calculation result of the HOG. As showed in Figure 5, the result is robust against changes of illumination, gesture and facial expression.
Part of missing faces in test dataset. Part of human face images correctly detected as faces in test dataset.

Figure 6 lists part of non-face images false classified as face. The non-face images contain various scenes.
Part of non-human face images false detected as faces in test dataset.
We can recognize these images as non-face all of a sudden, however, the computer cannot recognize by our method. Why? We think that hog feature is a kind of quantitative distribution about different angles of the direction-gradient and hog feature does not describe the holistic structure. Thus there maybe exist a coincidence: the face image and the non-face image have very close direction-gradient distribution, of course, its low-probability that such a coincidence occurs. Further, we use the support vector machine as a weak classifier and construct a cascade-classifier, which is a kind of machine learning method. The current machine learning requires a lot of sample to train a performative classifier. This is far worse than human's ability to recognize, for human only need a small number of samples to be able to find the essential difference between face and non-face and recognize them. Thus, there is a very long way to go for the machine learning reach the human's level to recognize, the machine learning method need be radically innovated.
The presented methods were used to detect the faces contained in the big picture and the related experiments were conducted on MIT/CMU datasets. A sliding window was used to detect the face in the whole picture. We used the window with the size of 20 × 15 to scan the entire area and moved half of the window at a time (i.e. 10 pixels in the horizontal direction and 8 vertical). To detect faces of different scales, the image to detect is then dilated by powers of 1.2. We performed C- means clustering on adjacent windows detected as face and the center window of each class was looked on as a detection result.
In the Intel i5 core 2410M computer the picture with the size of 380 × 280 is processed with 40 frames per second. The CMU/MIT test set was first introduced by Rowley for testing, which contains 130 pictures and 507 faces. We detected 479 faces in CMU/MIT (i.e. the DR is 94.5%) and the number of false detected faces (NFD) is 48. Figure 7 shows the DR and NFD at each stage.
Performances on CMU/MIT.
Figure 8 shows part of the detection results in crowded scene. Background in the image includes a variety of scenes: indoor and outdoor, simple and complex. This shows that the classifier trained by small images can effectively detect the face in the big picture.
Results on images of CMU/MIT.
Figure 9 indicated that the rotated face can be detected. Our method can adapt to the variations of pose. The main reason is that the hog feature of multi-granularity was made use of.
Results on images of CMU/MIT.
A picture may contain many faces and these faces can be detected. Some examples are showed in Figure 10. This indicated that the sliding-window strategy is powerful.
Results on images of CMU/MIT.
Comparisons of various methods on the CMU/MIT database.
Conclusions
In this section, we finalize the paper with the main conclusions and describe the future work.
We designed multi-stage cascaded detection methods, gradually excluding non-face. According to the different distribution between face and non-face at different stages, on the one hand we adjust the parameters of support vector machine to adapt to this change; on the other hand we use the histogram of oriented gradients to construct different-granularity features, observing the image to detect from coarse to thin. From the actual effect the detection speed can be compared with the state-of-the-art methods and achieve a higher DR. Theoretically, this can effectively avoid the over-fitting phenomenon, compared with a single weak-classifiers (such as threshold) classification method. The research results have important implication for future research: we can try to use other more classification methods in different stages, such as nearest neighbor, neural networks, etc. For the different pattern recognition problems people put forward the corresponding features and classification method according to the characteristics of the problem. Although the problem of face detection based on cascade is still the classification problem of face and non-face in different stages, the distribution between face and non-face is completely different, so it may be necessary to extract different features and to use different classification methods. Therefore, we will try to integrate more features and classification methods in different stages in future research.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: partially by the Natural Science Foundation of China (Grant Nos. 60902083).
