Sage Journals: Discover world-class research

Abstract

In this paper, an object detector is proposed based on a convolution/subsampling feature map and a two-level cascade classifier. First, a convolution/subsampling operation alleviates illumination, rotation and noise variances. Then, two classifiers are concatenated to check a large number of windows using a coarse-to-fine strategy. Since the sub-sampled feature map with enhanced pixels was fed into the coarse-level classifier, the checked windows were drastically reduced to a quarter of the original image. A few remaining windows showing detailed data were further checked using a fine-level classifier.

In addition to improving the detection process, the proposed mechanism also sped up the training process. Some features generated from the prototypes within the small window were selected and trained to obtain the coarse-level classifier. Moreover, a feature ranking algorithm reduced the large feature pool to a small set, thus speeding up the training process without losing detection performance. The contribution of this paper is twofold: first, the coarse-to-fine scheme shortens both the training and detection processes. Second, the feature ranking algorithm reduces training time. Finally, some experimental results were achieved for evaluation. From the results, the proposed method was shown to outperform the rapidly performing Adaboost, as well as forward feature selection methods.

Keywords

Face Detection Coarse-to-fine Strategy Convolution/Subsampling Feature Map Feature Ranking

1. Introduction

Object detection has received significant attention due to a large number of requirements in the computer vision field. For robot systems, robust and real-time object detection is a first and critical step for object recognition. A significant success rate, especially in face detection and license plate detection, has been achieved using various algorithms such as principle component analysis (PCA)[1], artificial neural networks (ANNs)[2, 3] and support vector machines (SVMs)[4, 5], among others. Generally, a window-based object detection process consists of four stages: pyramid image construction, window sliding, window verification and post-processing. First, pyramid images are constructed as the result of various object sizes. An original image is scaled to various smaller images on a scale of 1.25 in order to fit the object located within a window. Since an object can appear at any location, a window slides through all pyramid images. Next, a classifier is designed to check whether the window contains an object or not. Finally, postprocessing methods and/or operations or fusing algorithms are applied to determine exact object locations and sizes.

The first challenge for object detection is that objects have to be detected in real time. The detector has to check a large number of windows to complete this process in real time. Decreasing the number of windows and reducing the checking time per window are two strategies for improving efficiency. The second challenge is that the training process for the detector has to be completed as soon as possible. For example, a large number of weights are trained in ANN, while many feature candidates are checked in Adaboost-based algorithms.

ANNs are widely used in face detection. In Rowley's method [6], two independent MLPs (multiple layer perceptrons) are constructed using a bootstrapping training strategy. Each window has to be checked twice to determine if it contains a face or not. In Han's method [8], morphology-based preprocessing removes most backgrounds to speed up the detection process. In addition to the traditional MLP methods, a convolutional neural network (CNN) proposed by LeCun et al. [15] has been successfully used in handwritten character recognition (HCR) [16] and face detection [7, 18].

Three basic components, local receptive fields, shared weights and spatial subsampling are used for feature extraction. Feature maps are robust to rotation, scaling, shifting, pixel distortion and illumination variations. Additional applications, i.e., document analysis [17], face pose estimation [19], facial expression recognition [20], face recognition [21], generic object categorization [22] and hand tracking [23] have been developed. In Viola [9], a face detector based on the Adaboost algorithm is proposed. Asymmetric cascade classifiers and integral image-based features rapidly filter out background regions. Moreover, many variants of Adaboost algorithm have been proposed for improvement. The FloatBoost algorithm [12] was designed to solve the monotonic problem on sequential Adaboost training for multi-view face detection. Forward feature selection (FFS) [13] reduces the training time of feature selection using a pre-computing strategy. The LDA plus FFS algorithm [14] enhances classification ability. The designed face detectors with high performance in [12 –14] are all robust in terms of face variations.

Although the Adaboost-based algorithm has been successfully applied for face detection, many researchers have applied various other features for face detection. Jeong et al. [24] successfully applied semi-local structure patterns in face detection. Gunasekar et al. [25] proposed qualHOG features for face detection on distorted images. The Adaboost-based feature selection algorithm proposed by Jun et al. [26] selects local gradient patterns, binary histograms of oriented gradient patterns and hybridization for improving face detection performance. Face detection is also a critical step in other applications. For example, the performance of face recognition was dramatically affected by the results of face detection [27]. The first step of facial expression recognition or facial feature tracking was also the application of successful face detection [28].

In addition to face detection, license plate detection (LPD) is an important application in object detection, particularly for intelligent transportation systems (ITS). Wang and Lee [29] propose a cascade framework based on the Adaboost algorithm for license plate detection. The feature pool is composed of original Haar-like features, as well as skewed Haar-like features. Zhou et al. [31] present a plate detection algorithm using SIFT features. According to the invariant merit of SIFT features, variation among view angles, scales and illumination can be remedied. Giannoukos et al. [32] propose an LPD algorithm by analysing the contexts of the sliding window. The performance of LPD was improved in this case because only regions of interested were scanned. Wang et al. [33] present a discrete wavelet transform in which LP features are found and verified from the HL sub-band. Horizontal lines are checked to determine the LP location in sub-band LH. Li et al. [34] presented a component-based license plate detection algorithm that decomposed license plates into several candidate regions of digital characters. These candidates were then verified by a constructed conditional random field model. Al-Ghaili et al. [35] presented a fast vertical edge detection algorithm for locate license plates. Additionally, they pointed out LP detection as the first step in LP recognition, traffic data collection, crime prevention and ITS.

The CNN and Adaboost methodologies have been integrated in this study. A brief review of these two methods is given below. The CNN-based object detector uses several convolutional/subsampling feature maps and a multilayer perceptron for feature extractors and face classifiers, respectively. According to results obtained by Garcia [7], although CNN can extract features to alleviate the variations of rotation, scaling, shifting, distortion and illumination, its extensive architecture renders implementation and training difficult. Additionally, CNN detection speed is also slow. The convolutional and subsampling operations in CNN not only alleviate variations among facial images, but also condense the pixel information within a checked window. This approach was adopted in Rowley [6] and Han [8], in which one complex pre-processing step extracts facial features for detection. In our approach, multiple feature maps in convolutional and subsampling layers were reduced to one simple map only.

An Adaboost-based detector in the method applied by Viola [10] detects facial regions in real time using several cascade strong-classifiers composed of thousands of weak classifiers. Asymmetric cascade classifiers and integral image-based features quickly filter out background regions. However, significant training time is needed for selecting 38 cascade strong-classifiers from hundreds of thousands of block-based features within a window of 24 by 24 pixels.

In this study, a two-level cascade detector, as shown in Fig. 1, was used based on a simple feature map and a coarse-to-fine strategy. According to our experience, image enhancement improves detection performance and shortens the training process. Window pixels were convoluted and subsampled by the trained coefficients for enhancement. The detector was composed of a coarse-level classifier and a fine-level classifier. Small windows with enhanced pixels were verified by the coarse-level classifier, while large windows with detailed data were verified by the fine-level classifier. The coarse-level and fine-level classifiers were composed of eight and 12 cascade strong-classifiers, respectively. The coarse-level classifier applied only a small number of features and filtered out most backgrounds using the map with condensed data at layer S. The fine-level classifier rechecked the remaining windows using the original data in fine resolution. The main function of the fine-level classifier was to reduce false alarms (FAs) while retaining high detection rates (DRs). In the detection process, pyramid-based localization techniques were proposed for fusing the candidates and identifying the objects' regions. In order to detect objects of various sizes, an input image was repeatedly sub-sampled by a factor of 1.25 and a set of pyramid sub-images was generated at various scales. The windows in each sub-image were verified by the trained detector. Finally, a clustering-based fusing method was performed to determine object location.

Figure 1.

The proposed system architecture

The rest of this paper is organized as follows: the mechanism for feature extraction using convolutional and subsampling operations is presented in section 2. A single feature map was constructed and the weights were trained. Two-level cascaded classifiers and the coarse-to-fine strategy are described in section 3. In addition, a small feature set was pre-generated by ranking the features in order to speed up the training process and is also presented in this section. The experiments conducted for showing the feasibility and efficiency of the proposed method is given in section 4. Finally, concluding remarks are provided in section 5.

2. Single Convolution-Subsampling Feature Map

Using the architecture of CNN, convolutional and subsampling operations were performed for feature extraction without any preprocessing. In Garcia [7], feature maps were responsible for extracting and fusing a set of appropriate features. The input plane received window pixels at a size of 32 by 36 pixels in order to be classified as either face or nonface. Successive convolutional and subsampling operations were performed on the feature maps from layers C1 to S2. Layers C1, S1, C2 and S2 were composed of four, four, fourteen and fourteen feature maps, respectively. Consider a window of pixels w and h as shown in Fig. 2(a), e.g., a widow of 24 by 24 pixels for face detection in this study. A single feature map with convolutional layer C and a subsampling layer S was constructed for feature extraction. Edge features such as two eyes, two eyebrows, the nose and the mouth, were enhanced for reinforcing discrimination abilities. Additionally, these weights were in each layer trained by examples. The corresponding feature extraction for license plate detection is shown in Fig. 2(b). Edge features were also enhanced with convolution and subsampling operations.

Figure 2.

CNN with a single feature map for (a) human face detection; (b) plate detection

At layer C, the values in the feature map were convoluted from the original window by the trained weights as follows:

I_{C} (x, y) = \tanh (\sum_{i = 0}^{2} \sum_{j = 0}^{2} (w_{C} (i, j) I (x + i, y + j)) + b_{C}) ​, ​ 1 \leq x \leq w - 2, and 1 \leq y \leq h - 2.

(1)

Here, ten trained weights w_C(i, j), i, j = 0, 1, 2 and a bias b_C were the shared coefficients in the convolution operation. Next, a subsampling operation was completed for size reduction between layers C and S. Each element at layer S received a 2 by 2 neighbouring field from layer C as shown below:

I_{S} (x, y) = \tanh (\sum_{i = 0}^{1} \sum_{j = 0}^{1} (w_{S} (i, j) I_{C} (2 x + i - 1, ​ 2 y + j - 1)) + b_{S}) ​, ​ 1 \leq x \leq \frac{w - 2}{2}, and 1 \leq y \leq \frac{h - 2}{2} .

(2)

The elements at layer S were generated from the non-overlapping receptive fields of 2 by 2 at layer C. Four weights and one bias were also trained between layers C and S. The window size w_S by h_S at layer S (11 by 11 for face and 27 by 7 for LP) was a quarter of that at layer C (22 by 22 for face and 54 by 14 for LP). The convolution-subsampling mechanism was designed to simultaneously enhance the texture features and reduce the window size using the trained weights. The windows of a new feature map, a quarter of the original image in size, were verified by a cascade classifier to filter out most background regions. In practical systems, the checked windows were extended to the detected images during the detection process. The convolutions and subsampling operations were first performed on all the images by using the trained weights. Furthermore, the sliding windows were checked by the designed two-level classifier following the feature extraction step. In order to train the face detector, 5000 facial images (positive samples¹) and 5000 non-facial images (negative samples) were collected. In the training phase, facial images with various sizes were collected from web sites. Face-only images without any hair or backgrounds were cropped and normalized to a size of 24 by 24. In addition, negative samples of the same size were randomly collected from the natural scene images without any face samples among them. Several facial and non-facial images, respectively, are given in the first and third rows in Fig. 3(a). The correspondingly convoluted images are also shown in the second and fourth rows. The edge features on two eyes, two eyebrows, the nose and the mouth were all enhanced; 8721 plate images and 8721 non-plate images were manually collected for training the LP detector. LP images were scaled to a size of 56 by 16, as shown in Fig. 3(b). Similarly, LP and non-LP images are shown in rows one and three. Their corresponding convoluted images are shown in the second and fourth rows in Fig. 3(b).

Figure 3.

Training samples for (a) face detection; (b) plate detection

Fig. 3 shows the training of facial/non-facial patterns. The images tabulated in the first and third rows represent facial and non-facial images, respectively. Their correspondingly convoluted images are shown in the second and fourth rows. As can be seen, the edge features on two eyes, two eyebrows, the nose and the mouth were all enhanced. As such, there were no objects in the scene images. The shared weights in layers C and S were trained by these 10 000 samples, a process described below.

In order to train the weights, a CNN (see Fig. 2) was constructed by layers C and S, as well as by a classic MLP fully connected to layer S within a supervised classification. The objective values of training samples were assigned, e.g., 1 for positive samples and −1 for negative samples. Moreover, a hyperbolic tangent function was used as an activation function at both layers C and S. Fifteen trainable parameters at layers C and S, 3×3+1 weights for convolutions and 2×2+1 weights for subsampling operations performed the feature extraction in the proposed method.

3. Two-level Classifier

Although CNN can simultaneously train feature maps and classifiers [7], the large architecture of CNN made it difficult to implement this and also decreased detection performance. To accelerate the detection process, the CNN was simplified to a single convoluting-subsampling feature map and a two-level classifier was trained for window checking using a coarse-to-fine strategy. In the detection scheme, original images were first convoluted and sub-sampled. The w_S by h_S windows were verified using the convoluted data throughout the entire image in the coarse-level stage. Owing to the condensed information in the feature map, a quarter of the original windows were checked by the coarse-level classifier. Most non-face/non-plate backgrounds were filtered out quickly. A small amount of remaining face/plate-like w_S by h_S windows was further verified using the original image data in the fine-level stage.

In this section, the cascade classifier was introduced based on FFS algorithms. To speed up the training process, features in a huge pool were evaluated by a ranking algorithm. A new and small feature pool was generated by selecting features with high discriminating abilities. Most redundant features were not checked in each loop of the FFS algorithm.

3.1 Adaboost and the Forward Feature Selection (FFS) Algorithm

In Viola [10], a well-known algorithm based on the Adaboost algorithm [11] was proposed for building a detector employing cascading numerous classifiers. In addition to facial detection, it was also trained for license plate detection [29] and pedestrian detection [30]. During the training process, all samples were equally weighted. Features with smaller weighted errors were iteratively selected from a feature pool. The selected features formed a strong classifier and the stronger classifiers were cascaded for constructing a detector. However, significant training time was needed for assembling the detector.

To alleviate the time-consuming problem in the Adaboost-based classifier, Wu et al. [13] propose an FFS scheme for decreasing training time. The pre-computing strategy significantly speeds up the training process in the FFS algorithm [13]. In addition, only 3% of memory usage as that of the fast Adaboost algorithm was required, because each entry in table V had a binary value and required one bit. All operations were performed by direct memory access rather than slow disk access. Considering that N training samples of M features in a pool, the flowcharts of naive Adaboost implementation and the FFS algorithm are given in Fig. 4(a) and 4(b), respectively. By comparing these two algorithms, blocking of the ‘train all weaker classifiers' in FFS was removed from the loops in Adaboost. Due to the complexity in this block, O(NM logN) was needed to update the sample weights for the new decision function of a weaker classifier. The complexity of the FFS algorithm was O(NMT + NM logN), while that of the naive implementation of the Adaboost algorithm was O(T + NMT logN) when T features were selected, i.e., T loops approximated 6000 for an entire classifier. Moreover, an improved naive Adaboost implementation called fast Adaboost implementation was modified by Wu et al. [13], as shown in Fig. 4(c). Similarly, blocking the ‘train all weaker classifiers' function was moved from the loops in the naive Adaboost implementation. The complexity of the FFS and fast Adaboost implementation were both O(NMT + NM logN).

Figure 4.

Flowcharts for (a) naïve Adaboost [20]; (b) FFS [20]; (c) fast Adaboost [20]; (d) feature ranking algorithms

3.2 Feature Ranking

When the window was large, the feature pool, i.e., the possible feature number M, was large, too. For example, when the size was increased from 11 by 11 to 24 by 24, the possible features were rapidly increased from 6556 to 143 900 by using five prototypes, as shown in Fig. 5. Similarly, when the window size of LPD was increased from 27 by 7 to 56 by 16, the number of possible features was increased from 41 887 to 77 365 using the prototypes, as shown in Fig. 10. As already noted, the sliding window size was reduced to 11 by 11 at layer S to decrease the number of features. To alleviate this problem in the context of the fine-level classifier, a feature ranking algorithm was used to reduce the large feature pool to a smaller set by ranking the discriminating power of features. It was not necessary to check the redundant features with low discriminating powers during the iterations.

Figure 5.

Feature prototypes for facial detection

A pre-selection strategy was adopted in the feature ranking algorithm. This was a trade-off for shortening the training process to contain only a few features or to maintain the generality of the original feature pool. The goal of the proposed ranking algorithm was to decrease features to a small set without losing the generality of the feature pool. The pre-selection strategy is presented below.

First, all features in the original pool were ranked according to their weighted errors, e.g., the mis-classification error of each weaker classifier. Second, the features with the lowest P errors were chosen to generate a small feature set R, e.g., P = M / 20, which was 1/20 of the original features. Third, the feature selection from set R was repeated to minimize the error of the current ensemble until Q features were selected, e.g., Q = 10 to 20. When Q features were chosen, a new small feature set was re-generated according to the weighted errors. The flowchart of the proposed ranking algorithm is shown in Fig. 4(d). In other words, a pre-selection module prior to the FFS algorithm was added in the proposed method. Using the pre-selection strategy, the training efficiency of the FFS algorithm was improved without losing the generality of the original feature pool; this was especially the case for large feature pools. The complexity of the proposed method can be shown as O(T (NM + M logM) / Q +NM logN +NPT).

During detection, candidates of various sizes were obtained when the detector with multi-scale windows was moved over the original image. A post-process was then performed to project all candidates of various scales to the original scale for fusing the overlapped regions. For face detection, post-processing was the same as that in [13]. The candidate regions were further checked using the vertical and horizontal gradients in order to remove any false alarms in LP detection.

4. Experimental Results

In this section, some experimental results are discussed to indicate the feasibility of the proposed method in object detection, e.g., in facial and license plate detection.

The algorithms were implemented using C programming language and conducted on a PC-based machine with an Intel-based CPU, 3.4GHz and 11G RAM. In order to show the efficiency of the proposed method, two state-of-the-art algorithms, the FFS [13] and fast Adaboost algorithm [13] were conducted in the same workspace.

4.1 Facial Detection

In face detection, the 24 by 24 windows were checked by the trained detector. A benchmark testing dataset ‘CMU +MIT'² was composed of 130 images with 507 facial images for evaluation. In the training phase, 6556 and 16 233 features, respectively, were generated for training both coarse- and fine-level classifiers using five feature prototypes, as shown in Fig. 5. In the detection process, an input image was processed by the convolution and subsampling operations. A sliding window was checked in terms of whether it contained a face or not. If it did, the window had to satisfy both the coarse-level and fine-level tests. The sub-sampled FM and the source image were inputted into the coarse-level and fine-level classifiers to decrease the number of checked windows. Since the features in FM were enhanced, only eight stages with 392 features were used in the coarse-level classifier. Most non-facial background regions were filtered out. Following on, only a few windows remained for locating the exact number of faces via 12 stages with 1650 features. Various scales with an increasing rate 1.25 were applied to detect objects of various sizes. Some detected results for the dataset ‘CMU+MIT’ are shown in Fig. 6(a). All detected faces were drawn within the bounding boxes. Here, the definition of a correctly detected face is identified in [13]. Moreover, missed faces and false alarms are shown in Fig. 6(b). The mis-detected faces occurred in the following contexts: small sizes, dark-faces, profiles, wearing sunglasses. In the final three cases, faces could be detected by proper image training.

Figure 6.

Detection results for dataset ‘CMU+MIT’

The performance of detection was highly dependent on various performance indices: detection rates, false alarms, detection time, feature numbers and training time. Two state-of-the-art cascade face detection algorithms based on Harr-like features, FFS and fast Adaboost, were trained by programs available on the Internet³. Since the detection rates contradicted the number of false alarms, the cascade classifier with more stages delivered a lower detection rate and slower detection speed. The detection rate of a classifier should be higher than 90% for a practical system. Thus, the detection results with the condition “while detection rate is higher than 90 per cent and false alarms are less than 100” were chosen for comparison.

The face detectors with 2042, 3132 and 3932 features were trained using the proposed methods, that is, the FFS and fast Adaboost algorithms. Best detection rates and false alarms are tabulated in Table 1. In the empirical tests, various errors that occurred in facial detection can be summarized as follows: 1) mis-detected faces: due to the lack of diversity in training samples, some faces such as those wearing sunglasses, profiles, rotations and obstructed faces will be missed; 2) false alarms: we found that “eyes” were the most important features of faces. Therefore, we were able to observe that these false alarms occurred in the presence of “eye-like” features. Furthermore, the ROC curves of the detection rates versus the false alarms are given in Fig. 7. Table 1 and Fig. 7 show that the proposed algorithm slightly outperformed other algorithms.

Table 1.

The performance of the proposed method, FFS and fast Adaboost algorithms for face detection

Configurations	Accuracy		Detector		Speed
Configurations	Dataset: ‘CMU+MIT’, 130 images		Features	Training time	Image size320 × 240
Method	DR	FA	Layers/Features	mins	FPS
FFS	90.34 %	97	23/3132	37	11.6
Fast Adaboost	90.93 %	84	27/3932	85	9.8
Our method with coarse-level classifier	86.19 %	7670	8/392	5	20.8
Our method without ranking	91.12 %	91	20/2042	27	20.4

Figure 7.

The ROC curves for the detection rates and the number of false alarms for three algorithms using the ‘CMU+MIT’ database

Next, detection time was analysed. One thousand (1000) testing images at 320 × 240 were collected for evaluating detection efficiency. For a fair comparison, all algorithms were run on the same machine. In addition, pyramid images were constructed for a fair comparison with the FFS algorithm, as per Wu's source codes. The FFS algorithm with 3132 features and the fast Adaboost algorithm with 3932 features achieved detection rates of 11.6 frames/seconds (f/s) and 9.8 f/s, respectively. Both detection rates were less than 15 f/s. According to the coarse-to-fine strategy and some trained features, a detection rate of 20.4 f/s was achieved by using the proposed mechanism. The statistical rates of filtered windows in the first eight stages are illustrated in Fig. 8 to indicate filtering power of algorithms FFS, Adaboost, and ours. Due to the different scales, the total number of checked windows for the proposed method, FFS and fast Adaboost algorithms were 18 436 485, 73 621 281 and 73 621 281, respectively. Shown in Fig. 8(a), the verified number of windows in the proposed method was significantly less than in the other two algorithms during the first eight stages. Moreover, the filtering rate was defined as the ratio of filtered windows over the total number of checked windows in each stage. Since the proposed method filtered out more than 70% of the checked windows in the first stage (See Fig. 8(b)), the checking time was rapidly reduced and detection speed was significantly increased, even though lowered filter rates were achieved in later stages. The performances for the three algorithms are summarized in Table 1.

Figure 8.

Comparisons for three algorithms in the first eight stages: (a) verified windows; (b) filtering rates

Furthermore, the efficiency of the coarse-level classifier is also given in Table 1. The results in Table 1 and Fig. 8 show that most of the checked windows were filtered out by the coarse-level classifier and a reasonable detection rate could be preserved by the fine-level classifier. Moreover, in order to show the effectiveness and robustness of the proposed method, three cascade face detection algorithms based on local transform features, LBP, LGP and LBP hybrid LGP [36], were trained. The best detection rates and false alarms have been tabulated in Table 2. According to this table, the methods with local transform features slightly outperformed the proposed method. However, the training time of the proposed method was 46.6, 46.6 and 166 times shorter than those of LBP, LGP and LBP hybrid LGP methods. This indicates that the proposed method is extremely practical.

Table 2.

The performance of the proposed method and local transformation features for face detection

Configurations	Accuracy		Detector		Speed
Configurations	Dataset: ‘CMU+MIT’, 130 images		Features	Training time	Image size320 × 240
Method	DR	FA	Layers/Features	mins	FPS
LBP	90.72%	81	4/212	1260	60.6
LGP	90.92%	42	4/364	1263	58.8
LBP hybrid LGP	91.17 %	21	4/606	4483	15.6
Our method without ranking	91.12 %	91	20/2042	27	20.4

Finally, the efficiency of the training process was analysed. For a fair comparison of training times, the pool of Haar-like features used in Wu et al. [13] was adopted for training these three cascade classifiers⁴. The convolution-subsampling pre-processing, the coarse-level classifier and the fine-level classifier were trained in 3, 3 and 21 minutes, respectively. The training times of the proposed method, FFS and fast Adaboost algorithms were 27, 37 and 85 minutes, respectively. Although the proposed method achieved a shorter training time, the efficiency of its feature ranking was not apparent, because the feature pool was very small (16233 Haar-like features were included in this experiment). In a later experiment, training time was sufficiently improved when the feature pool for LPD was large.

4.2 License Plate Detection

The results for car plate detection are given in this subsection. The efficiency of the proposed feature ranking scheme in the training phase, when the pool of Haar-like features was large, is given. The proposed method, FFS and fast Adaboost algorithms were evaluated using a license plate dataset; 6498 vehicle images were collected from surveillance systems installed on urban roads, at toll stations and at a parking lot in a basement. The captured image sizes were 320 by 240. Since drivers switch on car headlights in a basement, the LP areas were frequently dark in the captured images. On the other hand, for images captured on roads, weather conditions and cluttered backgrounds increased image complexity. When images were captured at toll stations, capturing time at night impacted on image quality. The training set was composed of 8721 positive manually selected plate images from the Internet and 8721 negative, randomly generated samples from scene images, whose sizes were normalized to 56 by 16.

Capturing scenes were located on urban roads. The feature prototypes used in the training phase were the same as those used in Wang and Lee [29], as shown in Fig. 9. Furthermore, 41 880 and 77 365 features, respectively, were generated for the training of both coarse- and fine-level classifiers. The dataset was tested by the trained detector under three testing conditions. The correctly detected results, the mis-detected LPs and the false alarms are displayed in Figs. 10 and 11, respectively. All detected LPs are identified by drawing in the bounding boxes. Most of mis-detected LPs occurred at skewed, polluted, obscure or overexposed LPs, as shown in Fig. 11(c). Furthermore, in order to evaluate the robustness of proposed method on LP detection, the presence of noise due to the weather conditions (e.g., rain and snow) was adopted. Due to the lack of benchmark datasets for these conditions, several testing samples were collected from the Internet to serve as reference. The LPD results for rainy and snowy days are shown in Fig. 12 and Fig. 13, respectively. In Fig. 12(a), all detected LPs for rainy days are identified via drawing in the bounding boxes. Meanwhile, mis-detected LPs are presented due to invisible LPs, as shown in Fig. 12(b). Fig. 13 demonstrates the detected results for snowy days. Similarly, all detected LPs are identified by the bounding boxes, as shown in Fig. 13(a). The mis-detected LPs for snowy days also occurred in the case of invisible characters for LPs, as shown in Fig. 13(b).

Figure 9.

Feature prototypes of upright and skewed Haar-like features for LPD [29]

Figure 10.

Correctly detected results. Images grabbed from cameras (a) in a parking lot in a basement; (b) on urban roads; (c) at toll stations

Figure 11.

Mis-detected results and false alarms. Images grabbed from cameras (a) in a parking lot in a basement; (b) on urban roads; (c) at toll stations

Figure 12.

Detection results for license plates on a rainy day

Figure 13.

Detection results for license plates on a snowy day

In order to show the efficiency of the feature ranking scheme, detection rates and the training times for fast Adaboost, FFS and the proposed method were analysed for various parameters P and Q, as shown in Figs. 14 and 15. Fig. 14 shows that the ROC curves for detection rates vs. false alarms in various parameters P and Q were very similar for all three algorithms. The generality of the feature pool was retained when a small feature set was chosen. Further evaluation results of different training times for the three algorithms are shown in Fig. 15. The training times of the processes within the loops of various parameters were calculated for the three algorithms as shown in Fig. 15. On average, training time without feature ranking was 10.4,9.3 and 9.4 times that of feature ranking. However, the total training time without feature ranking was 2.65, 1.5 and 1.83 times that of feature ranking, i.e., all blocks in the algorithms were included. The total training time was not exceptional, because the block “train all weaker classifiers” used a significant amount of time to complete. Moreover, the ROC curves of the detection rates vs. the number of false alarms for the three algorithms in the LPD database are shown in Fig. 16. The performances for the compared algorithms in terms of LP detection are tabulated in Table 3. According to our experience, the DRs contradicted the FAs each other. DRs are reported in Table 3 for comparisons under the condition “the FAs are less than 1600”. In addition, the configurations for the trained detectors are also shown in this table. According to the coarse-to-fine strategy, the detection frame rate of the proposed method was higher than the fast Adaboost method even though the layer numbers and feature sizes were larger. This table also shows that the detection speed of the proposed method was faster than for the other two methods.

Table 3.

The performance of the proposed method, FFS and fast Adaboost algorithms for LPD

Configurations	Accuracy		Detector		Speed
Configurations	Dataset of 6498 images		Features	Training time	Image size320 × 240
Methods	DR	FA	Layers/Features	mins	FPS
FFS	91.26%	1765	9/612	62	24.4
Fast Adaboost	90.44%	1200	8/487	84	18.5
Our method with coarse-level classifier	84.30%	4649	6/250	17 (4 + 13)	21.1
Our method without ranking	90.75%	1567	(6+4)/580	59 (4 + 55)	36.4
Our method with ranking (M/20,Q=20)	92.15%	1488	(6+4)/580	36 (4 + 32)	38.5

Figure 14.

ROC curves for the detection rates versus the number of false alarms for various parameters P and Q on LPD

Figure 15.

Training time for the cascade classifiers with feature ranking and one without feature ranking

Figure 16.

ROC curves for the detection rate versus the number of false alarms for three algorithms with better results regarding LPD

In summary, the proposed method using feature ranking improved both the detection and training processes for LP.

4.3 Discussion

From the experimental results of this study, the proposed mechanism can efficiently improve both detection and training processes for LP as summarized in the following.

Detection process : two issues, decreasing the checking time in each window and decreasing the number of checked windows, were considered for speeding up the detection process. According to the two-level verifying scheme, as many backgrounds as possible were filtered out at the coarse level. In addition, few features were obtained in the trained detector according to the condensed feature maps. On average, the checking time in each window was decreased. Second, the number of checked windows was decreased to improve performance. Since the subsampling feature map was fed into the coarse-level cascade classifier, the checked windows were reduced to a quarter of the original image. In this way, the checking time and the number of checked windows were reduced at the same time.

Training Process : the training process for cascade classifiers was also improved. Since the feature pool size was proportionate to the training time, two methodologies were used to speed up the training process. First, window size was reduced. Since this was a quarter of the original window size in the coarse-level classifier, the features were rapidly reduced. For example, when the window size was reduced from 24 by 24 to 11 by 11, the features were rapidly decreased from 143 900 to 6556 using five prototypes. Second, a feature ranking algorithm pre-selected the features from a large pool. Based on the pre-selection strategy, features with high discriminating powers were chosen to obtain a smaller feature set in the training phase. The proposed scheme decreased the training time compared to the traditional Adaboost-based algorithms. Meanwhile, the ranking algorithm removed redundant features during the training process; not only was training time decreased, but the detection rate of the detector was also retained. Furthermore, fewer weights in a single feature map were trained than in ANN. Feature enhancement needed four minutes for training facial detection or LPD.

5. Conclusions

In this study, a coarse-to-fine cascade classifier was designed to detect facial or LP objects. In the two-level classifier, small windows with enhanced pixels were verified by the coarse-level classifier and large windows with detailed data were further verified by the fine-level classifier. A small feature map with enhanced pixels (i.e., 1/4 of the original image size) was trained by the convolution and subsampling operations. Using this mechanism, most background windows were filtered out by the coarse-level classifier and a few surviving foreground windows were further verified by the fine-level classifier. On the other hand, the feature pool was also reduced to speed up the training process using the following two strategies: (1) Some features were generated from the prototypes with a small window at the coarse level and (2) a small feature set was generated from the ranking algorithm at the fine level. The experiments were conducted to show the feasibility and effectiveness of the proposed method. In future, this two-level mechanism will be extended to detect more complex objects such as text characters, cars and pedestrians.

Footnotes

6. Acknowledgements

This research was supported by the National Science Council under grant no. NSC 100-2221-E-239-035 and NSC 101-2221-E-239-034.

1

The training facial samples were provided by Dr. Wu and are available from: . Accessed 09/12/2014.

2

. Accessed 09/12/2014.

3

. Accessed 09/12/2014.

4

The features used in the training of a fine-level classifier were the same as those used in Wu's program, i.e., 16 233 features. However, the feature number used in the training of a coarse-level classifier was 6556, due to the smaller window size.

References

Turk

Pentland

A. P.

, “Face recognition using eigenfaces,” IEEE Conf. Computer Vision and Pattern Recognition, pp. 586–591, 1991.

Yang

Kriegman

Ahuja

, “Detecting faces in images: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34–58, 2002.

Hjelmas

Low

B. K.

, “Face detection: A survey,” Computer Vision and Image Understanding, vol. 83, no. 3, pp. 236–274, 2001.

Waring

C. A.

Liu

, “Face detection using spectral histograms and SVMs,” IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, vol. 35, no.3, pp. 467–476, 2005.

Gong

Sherrah

Liddell

, “Support vector machine based multi-view face detection and recognition,” Image and Vision Computing, vol.22, no. 5, pp. 413–427, 2004.

Rowley

H. A.

Baluja

Kanade

, “Neural network-based face detection,” IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23–38, January 1998.

Garcia

Delakis

, “Convolutional face finder: A neural architecture for fast and robust face detector,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408–1423, 2004.

Han

C. C.

Liao

H. Y.

G. J.

Chen

L. H.

, “Fast face detection via morphology-based pre-processing,” Pattern Recognition, vol. 33, no. 10, pp. 1701–1712, 2000.

Viola

Jones

, “Rapid object detection using a boosted cascade of simple features,” in Proc. Intl Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 511–518, 2001.

10.

Viola

Jones

, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no.2, pp. 137–154, 2004.

11.

Freund

Schapire

, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comp. Syst. Sci., vol.55, no. 1, pp. 119–139, 1997.

12.

Zhu

Zhang

Blake

Zhang

Shum

, “Statistical learning of multi-view face detection,” in Proc. Seventh European Conf. Computer Vision, pp. 67–81, 2002.

13.

Brubaker

Mulin

M. D.

Rehg

J. M.

, “Fast asymmetric learning for cascade face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 3, pp. 369–382, 2008.

14.

Shen

Paisitkriangkrai

Zhang

, “Efficiently learning a detection cascade with sparse eigen-vectors,” IEEE Transactions on Image Processing, vol. 20, no. 1, pp. 22–35, 2011.

15.

LeCun

Bottou

Bengio

Haffner

, “Gradient-based learning applied to document recognition,” Proceedings of IEEE, pp. 2278–2324, 1998.

16.

LeCun

Boser

Denker

Henderson

Howard

Hubbard

Jackel

, “Handwritten digit recognition with a back-propagation network,” Advances in Neural Information Processing Systems, pp. 396–404, 1990.

17.

Simard

P. Y.

Steinkraus

Platt

J. C.

, “Best practices for convolutional neural networks applied to visual document analysis,” Proc. IEEE Conf. Document Analysis and Recognition, pp. 958–163, 2003.

18.

Kwolek

, “Face detection using convolutional neural networks and Gabor filters,” Lecture Notes in Computer Science, vol. 3696, pp. 551–556, 2005.

19.

Osadchy

LeCun

Miller

, “Synergistic face detection and pose estimation with energy-based model,” Journal of Machine Learning Research, vol. 8, pp. 1197–1215, 2007.

20.

Matsugu

Mori

Mitari

Kaneda

, “Subject independent facial expression recognition with robust face detection using a convolutional neural network,” Neural Networks, vol. 12, no. 5–6, pp. 555–559, 2003.

21.

Lawrence

Giles

C. L.

Tsoi

A. C.

Back

A. D.

, “Face recognition: A convolutional neural network approach,” IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 98–113, 1997.

22.

Huang

F. J.

LeCun

, “Large-scale learning with SVM and convolutional nets for generic object categorization,” in Proc. Computer Vision and Pattern Recognition Conference (CVPR'06), pp. 284–291, 2006.

23.

Nowlan

S. J.

Platt

J. C.

, “A convolutional neural network hand tracker,” in Advances in Neural Information Processing Systems, vol. 7, pp. 901–908, The MIT Press, 1995.

24.

Jeong

Choi

J. J.

Jang

G. J.

, “Semi-local structure patterns for robust face detection,” IEEE Signal Processing Letters, vol. 22, No. 9, pp. 1400–1403, 2015.

25.

Gunasekar

Ghosh

Bovik

A. C.

, “Face detection on distorted images augmented by perceptual quality-aware features,” IEEE Transactions on Information Forensics and Security, vol. 9, No. 12, pp. 2119–2131, 2014.

26.

Jun

Choi

Kim

, “Local transform features and hybridization for accurate face and human detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, No. 6, pp. 1423–1436, 2013.

27.

Cho

Roberts

Jung

Choi

Moon

, “An efficient hybrid face recognition algorithm using PCA and GABOR wavelets,” International Journal of Advanced Robotic Systems, 11:59 pp. 1–8, 2014.

28.

Cho

Lee

Suh

I. H.

, “Facial feature tracking using efficient particle filter and active appearance model,” International Journal of Advanced Robotic Systems, 11:154 pp. 1–11, 2014.

29.

Wang

S. Z.

Lee

H. J.

, “A cascade framework for a real-time statistical plate recognition system,” IEEE Transactions on Information Forensics and Security, vol. 2, No. 2, pp. 267–282, 2007.

30.

Enzweiler

Gavrila

D. M.

, “Monocular pedestrian detection: Survey and experiments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, No. 12, pp. 2179–2195, 2009.

31.

Zhou

Tian

, “Principal visual word discovery for automatic license plate detection,” IEEE Transactions on Image Processing, pp. Early Access, 2012.

32.

Giannoukos

Anagnostopoul

C. N.

Loumos

Kayafas

, “Operator context scanning to support high segmentation rates for real time license plate recognition,” Pattern Recognition, vol. 43, no. 11, pp. 3866–3878, 2010.

33.

Wang

Y. R.

Lin

W. H.

Horng

S. J.

, “A sliding window technique for efficient license plate localization based on discrete wavelet transform,” Expert Systems with Applications, vol. 38, no. 4, pp. 3142–3146, 2011.

34.

Tian

Wen

, “Component-based license plate detection using conditional random field model,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, No. 4, pp. 1690–1699, 2013.

35.

Al-Ghaili

A. M.

Mashohor

Ramli

A. R.

Ismail

, “Vertical-edge-based car-license-plate detection method,” IEEE Transactions on Vehicular Technology, vol. 62, No. 1, pp. 26–38, 2013.

36.

Jun

Choi

Kim

, “Local transform features and hybridization for accurate face and human detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, No. 6, pp. 1423–1436, 2013.

Facial/License Plate Detection Using a Two-Level Cascade Classifier and a Single Convolutional Feature Map

Abstract

Keywords

1. Introduction

2. Single Convolution-Subsampling Feature Map

3. Two-level Classifier

3.1 Adaboost and the Forward Feature Selection (FFS) Algorithm

3.2 Feature Ranking

4. Experimental Results

4.1 Facial Detection

4.2 License Plate Detection

4.3 Discussion

5. Conclusions

Footnotes

6. Acknowledgements

1

2

3

4

References