Abstract
Detecting faces in images is a key step in numerous computer vision applications, such as face recognition or facial expression analysis. Automatic face detection is a difficult binary classification problem because of the large face intra-class variability which is due to the important influence of the environmental conditions on the face appearance. We propose a cascade face detection method based on histograms of oriented gradients (HOG), using different kinds of features and classifier to exclude non-face step by step. The candidate feature set was constructed by HOG feature of different grain size; at different stages the support vector machine (SVM) was used as the weak classifier with different parameters. Experimental results showed a better performance compared to the state-of-the-art on Carnegie Mellon University/ Massachusetts Institute of Technology (CMU/MIT) datasets.
Introduction
Video surveillance systems are very suitable for the physical security, since the video sequences from many remote areas can be presented to watchmen at a time. Face detection is interesting because it is usually an indispensable step of autonomous video surveillance system. In recent 20 years many researchers were devoted to the study of face detection and related surveys are given in Yang et al. 1 and Zhang and Zhang. 2 Its goal is to detect the presence of human faces in a still image and to return their position. The faces in different positions are detected by sliding a window and the faces of different sizes by scaling. Thus the key problem in face detection is the classification between face and non-face, in which the characteristics has the following two: (1) the number of non-face is far greater than that of face and the distribution of non-face pattern is very widespread, so it is hard to classify face and non-face by one model; (2) the computing capacity is large and a high demand for real time is required.
Sung and Poggio 3 obtain representative non-face samples with a bootstrap algorithm. The false detected non-face samples were used as new samples to train the classifier and the new trained classifier classified the testing dataset, which is repeated until the classifier meets the requirements. Sung and Poggio used this kind of bootstrap to get the non-face samples and trained neural network. They tried hard to construct a strong classifier rather than a cascade classifier, but their work inspires the later research.
Viola and Jones 4 presented a multi-stage detection method, which has milepost significance to improve the speed of face detection. Their success is largely due to the following: (1) extracting Haar feature and introducing integral map to speed up the calculation; (2) the cascade classifier was used to remove most of the background regions in the early stages; (3) bootstrap was used to generate a effective classifier: misclassified samples were focused in the later and a small part of the key features were selected from a large feature set. However, this approach has two weaknesses: (1) discriminatory ability is not strong for Haar features only reflect brightness differences in different regions; (2) just the threshold is used to distinguish between face and non face, which cannot adapt to the complex mode-distribution.
Inspired by the Viola and Jones’s work, many people tried to construct a cascade classifier.5–11 The different levels of classifier were constructed by selecting different features and change the weak-classifier parameters. We expect the classifier used in the early is relatively simple and complex in the later. Support vector machine was used as the weak classifier, which is a typical two-class classification method and have various linear and nonlinear kernel. In the different stage the distribution of face and non-face changes, so different SVM kernels and corresponding parameters can be used to adapt to this kind of change.
Histograms of oriented gradient (HOG) was used to construct feature pool. HOG is a kind of discriminative local descriptor with 2D rotation invariance, which was widely applied to the field of computer vision, e.g. pedestrian detection,12,13 human detection,14–17 face recognition,16–18 and face detection,21,22 etc. The number of HOG bins and the size of the block were adjusted to generate the feature pool with different size and granularity. We expect different features can adapt to the distribution change in different stages.
The rest of the paper is organized as follows. “Basic idea” section describes the basic idea on the cascade classifier. “HOG feature pool” section describes how to construct the hog feature pool. In “Support vector machines” section, we describe the basic principle of SVM. In “Construct cascade classifier by AdaBoost” section we describe how to construct the cascade classifier. We show the experiments and results in “Experiments and results” section. And in “Conclusions” section, we finalize the paper with the main conclusions and describe the future work.
Basic idea
In this section, we describe the basic idea on the Cascade classifier.
The basic idea of face detection based on cascade strategy is to exclude a majority of background areas quickly in the early stage to reduce the amount of computation in the later stage. In the early stage of face detection the algorithm should meet the following three requirements:
Fast detection: as mentioned before, the face is of different sizes and located in different positions and it is detected by scaling and sliding window. In the early the amount of computation is large for the image of various scales need be detected in each position, so the early detection speed determines the speed of the whole detection system. High detection rate (DR). Early detection is a preliminary screening. In this process the region judged as non-face is discarded and the majority of face region need be retained. High ability to exclude non-face region. The more non-face region can be excluded in the early, the higher the speed of the whole face detection system.
Requirements (2) and (3) are contradictory, so a balance should be strike between them. Therefore, we introduced two concepts: face detection rate (FDR) and background rejection rate (BRR). FDR means what percentage the detected face samples account for the total face samples; BRR means what percentage the detected non-face samples account for the total non-face samples.
The features in pool introduced in “Basic idea” section were selected from small to large according to the dimension, and in every stage several kinds of support vector machine were trained based on the selected features until the DR reached more than 98% and the removal rate more than 30%. The optimal SVM was selected as the classifier of this stage and sometimes the classifier may be the superposition of multiple support vector machines. According to this standard, we construct the classifier of every stage by taking the following measures: (1) selecting different features from the pool introduced in 2; (2) adjusting the parameters of weak classifier; (3) stacking the multiple classifier; (4) adding the misclassified samples into training set for training the classifier of next stage.
Hog feature pool
In this section, we describe how to construct the hog feature pool.
The basic concept of the HOG is to replace the original pixel value with gradient direction of adjacent pixels to improve the robustness to illumination variance. In this paper, operator (−1, 0, 1) and operator (−1, 0, 1)' are used to calculate gradient
HOG feature reflects how many pixels in each gradient direction and do not reflect the position distribution of each pixel's gradient. So we partition a face into regular non-overlapping grids. Assuming that “
The feature pool is constructed as follows: (1) setting six kinds of number of histogram bin: 3, 6, 9, 12, 15, 18, (2) the entire face is partitioned in the following 15 kinds of ways: 1 × 1, 1 × 2, 2 × 2, 1 × 3, 2 × 3, 3 × 3, 2 × 4, 1 × 1, 3 × 4, 2 × 5, 3 × 5, 4 × 5, 5 × 5, 8 × 10, 10 × 10. Thus the feature pool contains 90(6 × 15) kinds of features.
Support vector machines
In this section, we describe the basic principle of SVM.
Consider a set of
The dual representation of the decision function
Training a linear SVM means finding the embedding strengths {
In real-life problems it is rarely the case that positive and negative samples are linearly separable. Non-linear support vector classifiers map input space
Usually
Commonly used kernels include polynomial kernels
In our implementation we use the different kernel in various stages, and it is interesting to choose an optimal kernel for the given input data.
Construct cascade classifier by AdaBoost
In this section, we describe how to construct the cascade classifier.
AdaBoost has been proposed by Freund and Schapire 23 as an efficient learning algorithm for constructing a strong classifier as an additive combination of boosted weak classifiers that are individually only slightly better than random. At each iteration AdaBoost determines a new weak classifier relative of the lowest value in the weight distribution in the training set. The principal advantage of AdaBoost is that the training error converges exponentially towards zero and the generalization performance grows at each iteration when the null training error is reached by the algorithm. 23
AdaBoost combines the hypotheses generated by a set of classifiers trained one after the other. The
Note that this bound is independent of the number of combined hypotheses and how they are chosen from
Support vector machine
25
was used as weak classifiers, whose undetermined parameters were listed as follows:
selecting the kernel function
26
of SVM and determining the corresponding parameter; the number of support vector; nonzero Lagrange factor and the corresponding support vector; threshold (determining the position of the classification hyper plane); the penalty factor.
Among the above, (2), (3), and (4) may be obtained in training; (1) and (5) need be determined in advance and (4) can be manually adjusted.
To construct the classifier of the various stages we take the following measures: (1) increasing the penalty factor 23 of positive cases in training to improve the DR; (2) adjusting the DR and false detection by changing the threshold, that is, moving the position of SVM hyperplane; (3) parts of the classifiers were weighted summed and the high-detection-rate classifier were given greater weight; (4) change the kernel of support vector machines and its parameters, and the kernel function used includes: polynomial function, Gauss radial basis function, and multilayer perception function (sigmoid function); (5) selecting the features from coarse to fine. Measures (1), (2), and (3) can ensure high DR, and (4) and (5) can avoid the over fitting.
Experiments and results
In this section, we show the experiments and results.
The DR and BRR were used as the standard to evaluate the detection results in the experiments. The testing set comes in two forms: one is as same as the training set, containing face and non-face images and collected by ourselves; the other is the big images containing multiple faces, e.g. CMU/MIT datasets.
We collected the training set which contains face and non-face images. Face images were collected from the standard face database, e.g. Yale database,
27
FERET,
28
BioID,
29
etc. Totally, 10100 face images were obtained by scaling, rotation, and translation. Non-face samples are randomly collected from various scene pictures and 50,000 non-face images were got. The size of face and non-face images all were 92 × 112. Of all, 2000 images were randomly selected from the face datasets and non-face datasets, respectively, half of which were used to train and half to test. By cross validation the DR and the BRR in each stage were shown in Figure 1.
Detection rate and background rejection rate of various stages.
Figure 1 showed that the DR can reach 94.6% and 98.3% of the non-face can be excluded (i.e. the false DR was 1.7%) after eight-stage detecting. In the initial stage high DR was achieved and a considerable part of the non-face can be excluded; in the latter period, FDR decreased and the proportion of non-face excluded increased dramatically.
The complexity of the feature is represented by its dimension and Figure 2 depicts the complexity variation curves. Figure 2 shows that the selected features get more complex as the detection process goes on for the rest of images become later more and more difficult to distinguish and more details of feature need be extracted. At the fifth stage the feature complexity increased sharply, however, we need not worry about the increase in the amount of calculation for 87% of non-face images have been excluded at the fourth stage as shown in Figure 2.
Complexity curve in various stages.
The parameters of the selected SVM were adjusted. Polynomial kernel function was used as the kernel to train SVM in each stage, respectively. And we found that the polynomial kernel function of Schematic diagram of gradually eliminating non-face.
The four lines are similar to the linear SVM hyperplane and the polynomial kernel function of
In Figure 4 and Figure 5, the missing and correctly classified face images are showed, respectively. Figure 4 shows that the detection result is relatively sensitive to glasses and big beard. HOG mainly is an edge-direction feature, glasses and big beard correspond to add an occlusion to the face, which can cause strong edge effects and affect the calculation result of the HOG. As showed in Figure 5, the result is robust against changes of illumination, gesture and facial expression.
Part of missing faces in test dataset. Part of human face images correctly detected as faces in test dataset.

Figure 6 lists part of non-face images false classified as face. The non-face images contain various scenes.
Part of non-human face images false detected as faces in test dataset.
We can recognize these images as non-face all of a sudden, however, the computer cannot recognize by our method. Why? We think that hog feature is a kind of quantitative distribution about different angles of the direction-gradient and hog feature does not describe the holistic structure. Thus there maybe exist a coincidence: the face image and the non-face image have very close direction-gradient distribution, of course, its low-probability that such a coincidence occurs. Further, we use the support vector machine as a weak classifier and construct a cascade-classifier, which is a kind of machine learning method. The current machine learning requires a lot of sample to train a performative classifier. This is far worse than human's ability to recognize, for human only need a small number of samples to be able to find the essential difference between face and non-face and recognize them. Thus, there is a very long way to go for the machine learning reach the human's level to recognize, the machine learning method need be radically innovated.
The presented methods were used to detect the faces contained in the big picture and the related experiments were conducted on MIT/CMU datasets. A sliding window was used to detect the face in the whole picture. We used the window with the size of 20 × 15 to scan the entire area and moved half of the window at a time (i.e. 10 pixels in the horizontal direction and 8 vertical). To detect faces of different scales, the image to detect is then dilated by powers of 1.2. We performed C- means clustering on adjacent windows detected as face and the center window of each class was looked on as a detection result.
In the Intel i5 core 2410M computer the picture with the size of 380 × 280 is processed with 40 frames per second. The CMU/MIT test set was first introduced by Rowley for testing, which contains 130 pictures and 507 faces. We detected 479 faces in CMU/MIT (i.e. the DR is 94.5%) and the number of false detected faces (NFD) is 48. Figure 7 shows the DR and NFD at each stage.
Performances on CMU/MIT.
Figure 8 shows part of the detection results in crowded scene. Background in the image includes a variety of scenes: indoor and outdoor, simple and complex. This shows that the classifier trained by small images can effectively detect the face in the big picture.
Results on images of CMU/MIT.
Figure 9 indicated that the rotated face can be detected. Our method can adapt to the variations of pose. The main reason is that the hog feature of multi-granularity was made use of.
Results on images of CMU/MIT.
A picture may contain many faces and these faces can be detected. Some examples are showed in Figure 10. This indicated that the sliding-window strategy is powerful.
Results on images of CMU/MIT.
Comparisons of various methods on the CMU/MIT database.
Conclusions
In this section, we finalize the paper with the main conclusions and describe the future work.
We designed multi-stage cascaded detection methods, gradually excluding non-face. According to the different distribution between face and non-face at different stages, on the one hand we adjust the parameters of support vector machine to adapt to this change; on the other hand we use the histogram of oriented gradients to construct different-granularity features, observing the image to detect from coarse to thin. From the actual effect the detection speed can be compared with the state-of-the-art methods and achieve a higher DR. Theoretically, this can effectively avoid the over-fitting phenomenon, compared with a single weak-classifiers (such as threshold) classification method. The research results have important implication for future research: we can try to use other more classification methods in different stages, such as nearest neighbor, neural networks, etc. For the different pattern recognition problems people put forward the corresponding features and classification method according to the characteristics of the problem. Although the problem of face detection based on cascade is still the classification problem of face and non-face in different stages, the distribution between face and non-face is completely different, so it may be necessary to extract different features and to use different classification methods. Therefore, we will try to integrate more features and classification methods in different stages in future research.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: partially by the Natural Science Foundation of China (Grant Nos. 60902083).
