Abstract
Humans are the most important tracking objects in surveillance systems. However, human tracking is not enough to provide the required information for personalized recognition. In this paper, we present a novel and reliable framework for automatic age estimation based on computer vision. It exploits global face features based on the combination of Gabor wavelets and orthogonal locality preserving projections. In addition, the proposed system can extract face aging features automatically in real-time. This means that the proposed system has more potential in applications compared to other semi-automatic systems. The results obtained from this novel approach could provide clearer insight for operators in the field of age estimation to develop real-world applications.
1. Introduction
A human face image contains abundant information about personal characteristics, including identity, emotional expression, gender, age, etc. Generally, a human image can be considered as a complex signal composed of many facial attributes such as skin colour and geometric facial features. These attributes play a crucial role in real applications of facial image analysis. In such applications, various attributes estimated from a captured face image can infer further system reactions. Age, in particular, is more significant among these attributes. For example, users may require an age-specific human computer interaction system that can estimate age for secure system access control or intelligence gathering. Automatic human age estimation using facial image analysis has numerous potential real-world applications.
An automatic face image age estimation system is composed of two parts: face detection and age estimation. The purpose of face detection is to localize the faces in an image. It is quite challenging to detect the faces in images, because the detected results are highly dependent on many conditions, such as environment, movement, lighting, orientation and facial expressions. These variant factors may lead to changes in colour, luminance, shadows and contours of images. For this reason, Viola and Jones proposed the famous face detector system in 2004 [1]. The Viola-Jones classifier employs AdaBoost at each node in the cascade to learn a high detection rate at the cost of a low rejection rate of a multi-tree classifier at each node of the cascade. The algorithm incorporates several innovative features: 1) the Haar-like input features – a threshold is applied to sums and differences of rectangular image regions; and 2) the integral image technique enables rapid computation of the value of rectangular regions or such regions rotated 45 degrees. This data structure is used to accelerate computation of the Haar-like input features. In addition, the AdaBoost algorithm 3) uses statistical boosting to create binary (face – non-face) classification nodes characterized by high detection and weak rejection; and 4) organizes the weak classifier nodes of a rejection cascade. In other words, the first group of classifiers is selected that most effectively detects image regions containing an object while allowing for many mistaken detections; the next classifier group is the second-best at detection with weak rejection. In test mode, an object is detected only if it makes it through the entire cascade.
Although the automatic face detection of an image is a mature technique involving many real-world applications, estimating human age from face images is still a challenging problem. Because the aging process is represented differently not only among races, but also within races, the process is almost personal. Moreover, this process is also determined by external factors, such as health, lifestyle, location and weather conditions. Therefore, “how to find a robust representation featuring” remains an open problem.
Overall, there are three categories of feature extraction for human facial age estimation in the proposed literature. The first category is statistical-based approaches. Xin Geng et al. [2][3] proposed the AGing pattErn Subspace (AGES) method for automatic age estimation. The idea of AGES is to model the aging pattern, which is defined as a sequence of personal aging face images, by learning a representative sub-space from EM-like (expectation-maximization) iterative learning Principle Component Analysis (PCA). In other major studies [4][5], Guodong Guo et al. compared three typical dimensionality reduction and manifold embedding methods, such as PCA, Locally Linear Embedding (LLE) and Orthogonal Locality Preserving Projections (OLPP). According to the data distribution in OLPP sub-space, they proposed the Locally Adjusted Robust Regression (LARR) method for learning and prediction of human ages. The LARR applies Support Vector Regression (SVR) to obtain a coarse prediction and determine a local adjustment within a limited range of ages centred on the predicted result using the Support Vector Machine (SVM).
The second category comprises appearance-based approaches. Using appearance information is the most intuitional method in all facial image analysis works. Young H. Kwon et al. [6] used visual aging features to construct an anthropometric model. The primary features are the eyes, nose, mouth and chin. The ratios of those features are computed to distinguish different age ranges. In secondary feature analysis, a wrinkle geography map is used to guide the detection and measurement of wrinkles. Jun-Da Txia et al. [7] proposed an age estimation method using the Active Appearance Model (AAM) to extract the regions of age features. Each face requires 28 feature points and is divided into ten wrinkle feature regions. Shuicheng Yan et al. [8] presented a patch-based appearance model named Patch-Kernel. This method is designed to characterize the Kullback-Leibler divergence between the models which are derived from the global Gaussian Mixture Model (GMM) using Maximum a Posteriori (MAP) for any two images. The discriminating power is further enhanced using a weak learning process, called “inter-modality similarity synchronization”. Kernel regression is employed for estimating age.
The third category comprises frequency-based approaches. In image processing and pattern recognition, frequency domain analysis is the most popular method for extracting image features. Guodong Guo et al. [9] investigated the biologically inspired features (BIF) for human age estimation from faces. Unlike the previous works in [4][5], Guo simulated the human visual process based on bio-inspired models [10] by applying Gabor filters. A Gabor filter is a linear filter used in image processing for edge detection. Frequency and orientation representations of Gabor filters are similar to those of the human visual system, and have been found to be particularly appropriate for textural representation and discrimination. Furthermore, previous bio-inspired models are changed by proposing a novel “STD” operation.
Our proposed system used the cascaded Adaboost learning algorithm in face detection and achieved the age estimation mechanism using Gabor wavelets and OLPP. This paper is organized in the following sections. First, our presented face detection system includes histogram lighting normalization, feature selection, the cascaded Adaboost classifier and the region-based clustering algorithm. The age estimation process, including the feature extraction using Gabor wavelets, feature reduction and selection, and age classification, is then introduced. Finally, the experimental results and conclusions are provided and summarized.
This paper proposes a fully automatic age estimation system using Gabor wavelets to represent aging progress. The system we proposed has four main modules: 1) face detection; 2) Gabor wavelet analysis; 3) OLPP reduction; and 4) SVM classification. The input image comes from a camera frame or image file. First, the face is captured from an image using a face detector which is achieved using the AdaBoost approach presented in [12], and the image is resized to 64 by 64 pixels. After face detection, using 40 Gabor wavelet kernels, features are extracted and reduced by OLPPs. Lastly, age estimating from features using the SVM classification is conducted.
The remainder of this paper is organized as follows. Section 2 describes the sub-system of face detection using AdaBoost. Section 3 shows a facial age estimation algorithm including textural analysis using Gabor wavelets, data reduction based on orthogonal locality preserving projections and classification. Section 4 shows experimental results and comparisons. Finally, the conclusions on this system are presented in Section 5.
2. Face Detection
Figure 1 displays the architecture of the automatic age estimation system in our work. The system consists of a face detection system localizing the facial regions in a captured image and an age estimator for the extracted face. Searching windows of various sizes are applied to an image to find multi-scale facial candidates as a result of object distance to camera during image capture. There are in total twelve block searching windows for mutli-scale purposes and the window size is increased from the smallest (24×24) size with a scaling factor of 1.25. While a camera is acquiring an image, the camera may produce various illuminating intensities of image depending on the environment. The image can be more accurately recognized after its brightness was normalized.

System overview.
2.1 Lighting normalization
The lighting normalization is based on the histogram fitting method. The primary task of histogram fitting is to transform the original histograms Lighting normalization. (a) Target image. (b) Input images. (c) Lighting normalization images.
where
2.1 Feature selection
The intensity-based features employed in our work were based on Haar features. We selected four types of rectangular features, as illustrated in Figure 3: the vertical edge, horizontal edge, vertical line and diagonal edge, as proposed by Papageorgiou [13]. It is feasible to use a composition of multiple different brightness rectangles to present the light and dark regions in the image. The features are defined as:
Four types of rectangle features.
where (
A single rectangle feature which most effectively separates the face and non-face samples can be considered as a weak classifier
The weak classifier

Database of face detection system. (a) Face images. (b) Non-face images.
The Adaboost method combines a collection of weak classifiers to form a stronger classifier. Although the stronger classifier is effective for face detection application, it is still time consuming. A structure of cascaded classifiers which improve the detection performance and reduce the computation time was proposed by Viola and Jones [14]. Based on this idea, our cascade Adaboost classifier will work stage by stage to classify a face and form a stronger classifier. In stage 1, if an image-block is classified as a face then it will allow entering stage 2, otherwise it is rejected. Likewise stage 3 can continue only if the object has been classified as a face at stage 2. The number of stages must be sufficient to achieve an excellent detection rate while minimizing computation. For example, if each stage has a detection rate of 0.99 (since 0.9 ≈ 0.9910), a detection rate of 0.9 can be achieved using a 10-stage classifier. While achieving this detection rate may sound like a daunting task, it is made significantly easier by the fact that each stage need only achieve a false positive rate of about 30%.
The procedure of the Adaboost process is described as follows: if
The weights are updating by Eq. (5) in each iteration, if the object is classified correctly then
The final classifier for
2.3 Region based clustering
The face detector usually finds more than one face candidate even though only a single face appears in an image, as illustrated in Figure 5. Therefore, a region-based clustering method is used to solve this kind of problem. The proposed region-based clustering method consists of two levels of clustering local and global-scale clustering. The local-scale clustering is used to cluster the blocks in the same scale and design a simple filter to determine the number of blocks within clusters. While the number of blocks in some clusters is more than one, that cluster will be reserved as a possible face candidate, otherwise it will be discarded. The local-scale clustering judges if the blocks meet the decision rule in:
Face detector result.
In Eq. (7), the overlap rate (
Figure 6 shows several cases of the clustering process. In Figure 6(a), the two blocks are processed as the same cluster, and in Figure 6(b) the two blocks are processed as different clusters because the distance of the centres does not satisfy

The chart of overlapped regions and distances of the centres of two blocks. (a) Case 1. (b) Case 2. (c) Special case in the cluster: more than two blocks overlapping.

Region-based clustering result. (a) The results of clustering in local-scale and (b) in global-scale.
3. Age Estimation
There are three major parts to our age estimation system in this work, age feature extraction, feature reduction and feature classification. The age feature extractor is constructed using Gabor wavelets that were used for image analysis because of their biological relevance and computational properties. Gabor wavelet kernels are similar to the 2D receptive field profiles of the mammalian cortical simple cells, exhibiting strong characteristics of spatial locality and orientation selectivity, and are optimally localized in space and frequency domains. The Gabor wavelet transform is
3.1 Feature extraction using Gabor wavelets
A Gabor wavelet
where
where
In most cases, researchers use Gabor wavelets at five different scales,

Region-based clustering result. (a) The results of clustering in local-scale and (b) in global-scale.
The Gabor wavelet representation of an image is the convolution of the image with a family of Gabor kernels as defined using Eq. (8). Let
where
To apply the convolution theorem, the Fast Fourier Transform (FFT) is used to derive the convolution output. Eq. (11) and Eq. (12) are the definition of convolution via FFT.
where 𝔍 and 𝔍−1 denote the Fourier and inverse Fourier transform, respectively.
Figure 10 shows the magnitude of convolution outputs of a sample image. The outputs exhibit strong characteristics of spatial locality, as well as scale and orientation selectivity corresponding to those displayed in Fig. 9. Such characteristics produce salient local features that are suitable for visual event recognition. Hereafter, we indicate with

Sample image and magnitude of 40 convolution outputs.

Parallel Dimension Reduction Scheme.
3.2 Feature reduction by scheme
Generally, Principle Component Analysis (PCA) or other algorithms will follow with Gabor wavelet feature extraction to reduce dimensionality of the transformed data [19][20]. The convolution results corresponding to all Gabor wavelets are put together as a whole to enhance the computational efficiency when Principle Component Analysis (PCA) is applied to dimensional reduction. Three different schemes have been proposed: (a) Parallel Dimension Reduction Scheme (PDRS): Gabor wavelet features are extracted from each sample as shown in Figure 10. Training each PCA projection matrix in every channel and combining these features using a voting method. (b) Ensemble Dimension Reduction Scheme (EDRS): the EDRS is the most common scheme used for Gabor wavelet feature. As shown in Figure 11, the difference between PDRS and EDRS is that the EDRS concatenates Gabor wavelet features instead of using them in parallel. (c) Multi-channel Dimension Reduction Scheme (MDRS). Xiaodong Li et al. [21] proposed MDRS in 2009 – as shown in Figure 13, the main idea of MDRS is training a PCA projection matrix for the same channel between different samples. In [21], Xiaodong Li et al. have already proved that MDRS has higher performance than EDRS in facial feature extraction using a Gabor wavelet transform.

Ensemble Dimension Reduction Scheme.

Multi-channel Dimension Reduction Scheme.

Sample images of the same subject at different ages.
To compare the performance of PDRS and MDRS, the K-Nearest Neighbour (KNN) classifier is used for experimentation. For PDRS, we used a voting method called “Gaussian voting” to combine 40 channels. The concept of Gaussian voting is described as using a KNN classifier for each channel to predict 40 ages. Each predicted age is treated as the mean value of a Gaussian distribution and is counted as a histogram. The highest peak is the final predicted answer. For MDRS, we use the concatenated feature directly.
The FG-NET Aging Database [22] is adopted for experiments. The database contains 1,002 high-resolution colour and grey-scale face images with large variations in lighting, pose and expression. There are a total of 82 subjects (multiple races) ranged in age from 0 to 69 years. We used the mean absolute error (MAE) criterion to evaluate the performance of each age estimation. The MAE denotes the average of the absolute errors between the estimated ages and ground truth ages. The mathematical function is defined as:
where
MAE of PDRS and MDRS.
3.3 Feature selection
The dimensionality of the Gabor wavelet feature space is overwhelmingly high, even though the dimension reduction scheme has already been applied. Therefore it is important to select the more significant features and to further reduce the dimension to a low-dimensional space. Three typical dimensionality reduction methods have been proposed in past research besides PCA. (a) Linear Discriminant Analysis (LDA) is similar to the PCA method [23]; the difference is that LDA uses class information to improve itself. (b) Locality Preserving Projections (LPP) search the sub-space that preserves essential manifold structure by measuring the local neighbourhood distance information [24]. (c) Orthogonal Locality Preserving Projections (OLPP) produce orthogonal basis functions based on LPP and preserve the metric structure [25]. To determine which reduction method from the above is most suitable for use in age features from Gabor wavelets, we used the KNN classifier for experimentation and use the MAE criterion to evaluate performance. In the experiment, we changed the affinity weight of LPP and OLPP to obtain more detail. Table 2 shows the MAE of each reduction method. The OLPP with cosine distance affinity weight has the best performance in age estimation
MAEs of different reduction methods.
3.4 Age classification
The Gabor wavelet features are used in the SVM classifier to indentify how old the face is. Support Vector Machines (SVMs) have considerable potential as classifiers of sparse training data as they were developed to solve classification and regression problems. SVMs have similar roots with neural networks, and they demonstrate the well-known ability of being universal approximates of any multivariable function to any desired degree of accuracy. This approach was produced by Vapnik et al. using a statistical learning theory [25–27]. Table 1 and Figure 11 show the comparisons of results using our conditional entropy-based feature selection approach with those by others for feature selection and classification. All the comparisons in this paper used the same training and testing database. The database was composed of 1002 high-resolution colour or grey-scale face images with large variations in lighting, pose and expression. There are 82 subjects (multiple races) in total with ages ranging from 0 to 69 years. Our used input dimensions of SVM in the comparison process were 43 which are shown in Table 2. In addition, we also compared the accuracy rate with the same Gabor wavelet features and KNN in classification.
4. Experimental Results
The database that we adopt for age estimation experiments is the FG-NET Aging Database [20]. This database is a publically available age database containing 1002 high-resolution colour or grey-scale face images with large variations in lighting, pose and expression. There are 82 subjects (multiple races) in total with ages ranging from 0 to 69 years. Figure 13 shows a serial of sample images from the same subject at different ages.
To evaluate the age estimation performance, the facial area in each image was located by the face detector described in Section 2. A leave-one-person-out (LOPO) test scheme was used in the experiments. Each face image was cropped and resized to 64 × 64 pixels and the colour information was transformed to 256 grey level. We used the classifier of SVMs with parameters of the RBF kernel, where cost
The performance of age estimation can be measured by two different measures: the mean absolute error (MAE) and the cumulative score (CS). The MAE is defined as the average of the absolute errors between the estimated ages and the ground truth ages. The MAE measure has been used previously in [2–10]. The cumulative score is defined as
where
Table 3 shows the experimental results. We compare our results with all previous methods reported on the FG-NET age database. The Gabor-OLSS method of this study has the MAEs of 8.43 and 5.71 years for using KNN and SVM respectively, which are explicitly smaller than most previous results under the same experimental protocol. Our method offers approximately 16% deductions of MAEs over the result of AGES [2]. In Table 3, we can see that the LARR [4] method and BIF [9] method have more favourable MAEs of 5.07 and 4.77 than do ours.
MAEs of different methods.
As mentioned previously, our purpose is to build a “fully-automatic” age estimation system. The LARR method uses the AAM features FG-NET provided directly, meaning it usually needs human involvement in aligning the feature points. In our survey, there is still no efficient method that can automatically align feature points quickly and correctly. For applications, the LARR method may require considerable effort in aligning feature points. The MAE of the BIF is explicitly more favourable than the method we propose. In order to verify their results, we tried to implement the BIF method. The result of the implemented BIF is quite poor, having an MAE of 10.32. Furthermore, the BIF method requires a large amount of time when extracting the aging features. Compared to our system, the BIF method requires an extraction time more than twice that of ours. Our method increases the performance of feature extraction to approximately 12 to 15 images per second.
The comparisons of cumulative scores are shown in Figure 14. Our Gabor-OLPP method performs much better than WAS and MLPs methods. The method of AGES is close to our Gabor-OLPP method in low age error levels, but lower than those of Gabor-OLPP when error level is larger than five.

CS of each method.
5. Conclusions
In this paper, we propose a new framework for automatic age estimation of face images. A Gabor wavelet transform is first introduced for age estimation to achieve real-time and fully-automatic aging feature extraction; SVMs have considerable potential as classifiers of sparse training data and provide robust generalization ability.
Most previous studies have used PCA only to reduce the dimensionality of the Gabor wavelet features; but PCA exhibits inadequate efficiency when we use general Gabor wavelet features directly. By exchanging efficiency for accuracy of classification, previous researchers have usually attempted to select only the features they require, rather than using all the features. Therefore, data reduction methods are more convenient for selecting the target features. We compare four different typical data reduction methods; OLPP provides the lowest dimensionality of feature vectors and the most favourable discrimination from further extraction.
Footnotes
6. Acknowledgments
This work was supported in part by the Department of Industrial Technology under grant: 100-EC-17-A-02-S1–032, and supported in part by the Taiwan National Science Council under grant: NSC-100-2218-E-009-023.
