Multiple Kernel Learning in Fisher Discriminant Analysis for Face Recognition

Abstract

Recent applications and developments based on support vector machines (SVMs) have shown that using multiple kernels instead of a single one can enhance classifier performance. However, there are few reports on performance of the kernel-based Fisher discriminant analysis (kernel-based FDA) method with multiple kernels. This paper proposes a multiple kernel construction method for kernel-based FDA. The constructed kernel is a linear combination of several base kernels with a constraint on their weights. By maximizing the margin maximization criterion (MMC), we present an iterative scheme for weight optimization. The experiments on the FERET and CMU PIE face databases show that, our multiple kernel Fisher discriminant analysis (MKFD) achieves high recognition performance, compared with single-kernel-based FDA. The experiments also show that the constructed kernel relaxes parameter selection for kernel-based FDA to some extent.

Keywords

Multiple Kernel Learning (MKL)Kernel-based Fisher Discriminant Analysis (kernel-based FDA)Margin Maximization Criterion (MMC)Weight Optimization

1. Introduction

As there exist many image variations such as pose, illumination and facial expression, face recognition is a highly complex and nonlinear problem which could not be sufficiently handled by linear methods, such as principal components analysis (PCA) [1] and Fisher discriminant analysis (FDA) [2]. Therefore, it is reasonable to assume that a better solution to this inherent nonlinear problem could be achieved using nonlinear methods, such as the so-called kernel machine techniques [3]. Following the success of applying the kernel trick in SVMs, many kernel-based PCA and FDA methods have been developed and applied in pattern recognition tasks, such as kernel PCA (KPCA) [4], kernel Fisher discriminant (KFD) [5], generalized discriminant analysis (GDA) [6], and kernel direct FDA (KDDA) [7].

It has been shown that the kernel-based FDA method is a feasible approach to solve the nonlinear problems in face recognition. However, the performance of the kernel-based FDA method is sensitive to the selection of a kernel function and its parameters. Kernel parameter selection to date can mainly be achieved by Cross Validation [8], which is computationally expensive, and the selected kernel parameters can not be guaranteed optimal. Furthermore, a single and fixed kernel can only characterize the geometrical structure of some aspects for the input data and, thus, not always be fit for the applications which involve data from multiple, heterogeneous sources [9][10], such as face images under broad variations of pose, illumination, facial expression, aging, etc.

Recent applications and developments based on SVMs [11][12] have shown that using multiple kernels (i.e., a combination of several “base kernels”) instead of a single fixed one can enhance classifier performance, which raised the so-called multiple kernel learning (MKL) method. With m kernels, input data can be mapped into m feature spaces, where each feature space can be taken as one view of the original input data [10]. Each view is expected to exhibit some geometrical structures of the original data from its own perspective such that all the views can complement for the subsequent learning task. It has been proven that MKL can offer some needed flexibility and well manipulate the case that involves multiple, heterogeneous data sources [9][13][14]. However, MKL is proposed for SVMs, and there have been few reports on performance of the kernel-based FDA method with multiple kernels.

In this paper, we propose multiple kernel Fisher discriminant analysis (MKFD), in which the constructed kernel is a linear combination of several base kernels with a constraint on their weights, and we give an iterative scheme for weight optimization.

The rest of this paper is organized as follows. First we describe the kernel construction for MKFD in section 2. Then in section 3, the optimization scheme for the multi-kernel weights is presented. The experimental results are reported in section 4, while we draw our conclusion in section 5.

2. Kernel construction for MKFD

Given M Mercer kernel functions k^(m)(x, y), m = 1, 2,…,M, defined on ℝ ^d × ℝ ^d , which are M base kernels, we construct the multiple kernel function as the following linear combination

k (x, y) = \sum_{m = 1}^{M} γ_{m} ​^{2} k^{(m)} (x, y), ​ ​ s . t . ​ ​ \sum_{m = 1}^{M} γ_{m} = 1,

(1)

where γ² _m is the weight of base k^(m) (x, y). Since the weights are nonnegative, it is easy to show k (x, y) is also a Mercer kernel defined on ℝ ^d × ℝ ^d .

We apply the multiple kernel function (1) in kernel-based FDA, which produces what we call MKFD. To achieve good performance of MKFD for face recognition, we consider the problem of learning proper weights of the base kernels, i.e., choosing the best γ = {γ₁, γ₂,…,γ _M )^T, without regard to the specific structures of base kernel functions.

3. Weight optimization for MKFD

3.1 Some notations on MKFD

Let 𝕏 ⊂ ℝ ^d be the original sample space. Let X be a training set of N samples and C be the number of sample classes. Assume the i th class X_i contains N_i samples, i.e., $X_{i} = {x_{1}^{i}, x_{2}^{i}, \dots, x_{N_{i}}^{i}}$ , $i = 1, 2, \dots, C$ so $N = \sum_{i = 1}^{C} N_{i}$ .

Denote M nonlinear mappings as $Φ_{m} : x \in X \to Φ_{m} (x) \in F$ , $m = 1, 2, ..., M$ , where F is the mapped feature space, with df = dim F (dimensionality of F). Denote M base kernel matrices (N × N) as

K^{(m)} = {[k^{(m)} (x_{j}^{i}, x_{h}^{l})]}_{\begin{array}{l} i = 1, \dots, C, j = 1, \dots, N_{i} \\ l = 1, \dots, C, h = 1, \dots, N_{l} \end{array}}, m = 1, ..., M,

(2)

where $k^{(m)} (x_{j}^{i}, x_{h}^{l}) = Φ_{m} {(x_{j}^{i})}^{T} Φ_{m} (x_{h}^{l})$ , each base kernel corresponding to one nonlinear mapping.

Given γ = (γ₁, γ₂,…,γ) subject to σ ^M _m=1γm = 1, N × N multiple kernel matrix is

K = \sum_{m = 1}^{M} γ_{m} ​^{2} K^{(m)} .

(3)

We call the mapping from 𝕏 to F multiple nonlinear mapping, denoted as φ, which is implicitly defined by the multiple kernel matrix K, and can be understood as the compound of φ _m , m = 1,…,M.

Under multiple nonlinear mapping φ, the i th mapped class and the mapped sample set are respectively given by

\begin{array}{l} Φ (X_{i}) = {Φ (x_{1}^{i}), Φ (x_{2}^{i}), \dots, Φ (x_{N_{i}}^{i})}, \\ Φ (X) = {Φ (X_{1}), Φ (X_{2}), \dots, Φ (X_{C})} . \end{array}

Also, the mean of the mapped class φ(X_i) and that of the mapped sample set φ(X) are respectively given by

m_{i} = \frac{1}{N_{i}} \sum_{j = 1}^{N_{i}} Φ (x_{j}^{i}), m = \frac{1}{N} \sum_{i = 1}^{C} \sum_{j = 1}^{N_{i}} Φ (x_{j}^{i}) .

In kernel feature space 𝕏, the within-class scatter matrix S^Φ _w and between-class scatter matrix S^Φ _b are respectively defined as

S_{w}^{Φ} = \frac{1}{N} \sum_{i = 1}^{C} \sum_{x \in X_{i}} (Φ (x) - m_{i}) {(Φ (x) - m_{i})}^{T} = Φ_{w} Φ_{w}^{T},

(4)

S_{b}^{Φ} = \frac{1}{N} \sum_{i = 1}^{C} N_{i} (m_{i} - m) {(m_{i} - m)}^{T} = Φ_{b} Φ_{b}^{T},

(5)

Where

\begin{array}{c} Φ_{w} = {[φ_{1}^{1}, ..., φ_{N_{1}}^{1}, φ_{1}^{2}, ..., φ_{N_{2}}^{2}, ......, φ_{1}^{C}, ..., φ_{N_{C}}^{C}]}_{d f \times N}, \\ φ_{j}^{i} = \frac{1}{\sqrt{N}} (Φ (x_{j}^{i}) - m_{i}), \\ Φ_{b} = {[φ_{1}, ..., φ_{C}]}_{d f \times C}, φ_{i} = \sqrt{\frac{N_{i}}{N}} (m_{i} - m) . \end{array}

(6)

The kernel Fisher criterion is defined as

J^{Φ} (W) = \frac{tr (W^{T} S_{b}^{Φ} W)}{tr (W^{T} S_{w}^{Φ} W)},

where W = {w_1,…,w _q } is a df × q (df > q) projection matrix. MKFD is to find an optimal projection matrix W* : ℝ ^df → ℝ ^q in mapped feature space F, such that W* = arg max J ^φ (W).

3.2 Diagonalization strategy

We use the same diagonalization strategy as KDDA [7] to deal with the small sample size (SSS) problem in our MKFD, i.e., first diagonalzing S^Φ _b to I (identical matrix) and then diagonalzing S^Φ _w to Λ _w , which is briefly expressed using the MKFD notations as follows.

3.2.1 Eigen-analysis of S^Φ _w in the feature space.

Φ^T _b can be expressed using the multiple kernel matrix K as follows:

\begin{matrix} Φ_{b}^{T} Φ_{b} = \frac{1}{N} D \cdot (A_{N C}^{T} \cdot K \cdot A_{N C} - \frac{1}{N} A_{N C}^{T} \cdot K \cdot 1_{N C} \\ - \frac{1}{N} 1_{N C}^{T} \cdot K \cdot A_{N C} + \frac{1}{N^{2}} 1_{N C}^{T} \cdot K \cdot 1_{N C}) \cdot D, \end{matrix}

(7)

where $D = diag (\sqrt{N_{1}}, ..., \sqrt{N_{C}})$ (C × C diagnal matrix), 1 _NC is a N × C matrix with terms all equal to one, A _NC = diag (a_N1,…, a _Nc ) is a N × C block diagonal matrix, and aN_i is a N_i × 1 vector with all terms equal to $\frac{1}{N_{i}}$ .

Let λ _i and e _i (i = 1,…,C) be the i th largest eigenvalue and corresponding eigenvector of Φ^T _b ṁ Let r (≤ C – 1) be the rank of S^Φ_b(= Φ_bΦ^T_b) (also the rank of Φ_bΦ^T_b). Denote E _r = (e_1,…,e _r ), and V = (v_1,…, v _r ) = φ _b E _r . It can be derived that V^TSΦV = Φ _b , with Λ _b =diag(γ²₁,…,γ²_r), a nonsingular diagonal matrix. Let U = VΦ _b . Then U^TS^φ _b U = 1.

3.2.2 Eigen-analysis of S^Φ _w in the feature space

Based on the analysis in section 3.2.1, it can be seen that

U^{T} S_{w}^{Φ} U = {(E_{r} Λ_{b}^{- 1 / 2})}^{T} (Φ_{b}^{T} S_{w}^{Φ} Φ_{b}) (E_{r} Λ_{b}^{- 1 / 2}),

where $Φ_{b}^{T} S_{w}^{Φ} Φ_{b}$ can be expressed using K, with the similar details seen in [7].

Let z _j be the eigenvector of U^TSΦU corresponding to the j th smallest eigenvalue λ' _j , j = 1,…,r. Denote Z = (z₁,…,z _r ). Defining Y = UZ, it can be derived that Y^TS^Φ _w Y = Λ _w , with Λ _w = diag(λ₁',…,λ _r ’).

Based on the derivation presented in section 3.2.1 and 3.2.2, an optimal projection matrix for MKFD is obtained as

W * = Y Λ_{w}^{- 1 / 2} = Φ_{b} E_{r} Λ_{b}^{- 1 / 2} Z Λ_{w}^{- 1 / 2} .

(8)

Certainly, as the multiple nonlinear mapping φ is implicitly defined by the multiple kernel function (or matrix), φ _b (defined by Eq. (6)) remains unknown, and W* can not be evaluated. The real meaning of Eq. (8) is obtaining matrix $E_{r} Λ_{b}^{- 1 / 2} Z Λ_{w}^{- 1 / 2}$ , which can be computed from the multiple kernel matrix K. This is the core result of diagonalization for MKFD.

3.3 Optimization criterion and objective

We adopt the maximum margin criterion (MMC) [15] as the objective function to optimize weight γ:

F (W, γ) = tr (W^{T} S_{b}^{Φ} W) - tr (W^{T} S_{w}^{Φ} W),

(9)

where W is a projection matrix, γ = {γ₁, γ₂,…,γ _M )^T, subject to 1^T γ = ∊γ _m = 1, with γ² _m being the weight of the m th base kernel matrix K^(m).

Based on the result (8) in 3.2, the optimal projection matrix $W * = Φ_{b} E_{r} Λ_{b}^{- 1 / 2} Z Λ_{w}^{- 1 / 2}$ . Denoting $G = E_{r} Λ_{b}^{- 1 / 2} Z Λ_{w}^{- 1 / 2}$ , which can be computed from the multiple kernel matrix K. Then the objective function (9) can be reformulated as

\begin{array}{l} F (γ) = tr (W *^{​ T} S_{b}^{Φ} W * - W *^{​ T} S_{w}^{Φ} W *) \\ = tr (G^{​ T} Φ_{b}^{T} Φ_{b} Φ_{b}^{T} Φ_{b} G - G^{​ T} Φ_{b}^{T} Φ_{w} Φ_{w}^{T} Φ_{b} G) \\ = tr (G^{​ T} P P^{​ T} G - G^{​ T} Q Q^{​ T} G), \end{array}

(10)

where P = Φ^T _b Φ _b and Q = Φ^T _b Φ _w can be expressed in terms of the multiple kernel matrix K as follows.

\begin{array}{l} P = \frac{1}{N} D \cdot (A_{N C}^{T} \cdot K \cdot A_{N C} - \frac{1}{N} A_{N C}^{T} \cdot K \cdot 1_{N C} \\ - \frac{1}{N} 1_{N C}^{T} \cdot K \cdot A_{N C} + \frac{1}{N^{2}} 1_{N C}^{T} \cdot K \cdot 1_{N C}) \cdot D \\ = \sum_{m = 1}^{M} γ_{m} ​^{​ 2} P^{(m)}, \end{array}

(11)

where

\begin{matrix} Q = \frac{1}{N} D \cdot (A_{N C}^{T} \cdot K - A_{N C}^{T} \cdot K \cdot H_{N N} \\ - \frac{1}{N} 1_{N C}^{T} \cdot K + \frac{1}{N} 1_{N C}^{T} \cdot K \cdot H_{N N}) \\ = \sum_{m = 1}^{M} γ_{m} ​^{2} Q^{(m)}, \end{matrix}

with D, A _NC and 1 _NC defined the same as in (7);

\begin{array}{l} Q = \frac{1}{N} D \cdot (A_{N C}^{T} \cdot K - A_{N C}^{T} \cdot K \cdot H_{N N} \\ - \frac{1}{N} 1_{N C}^{T} \cdot K + \frac{1}{N} 1_{N C}^{T} \cdot K \cdot H_{N N}) \\ = \sum_{m = 1}^{M} γ_{m} ​^{​ 2} Q^{(m)}, \end{array}

(12)

where

\begin{matrix} Q^{(m)} = \frac{1}{N} D \cdot (A_{N C}^{T} \cdot K^{(m)} - A_{N C}^{T} \cdot K^{(m)} \cdot H_{N N} \\ ​ ​ - \frac{1}{N} 1_{N C}^{T} \cdot K^{(m)} + \frac{1}{N} 1_{N C}^{T} \cdot K^{(m)} \cdot H_{N N}), \\ m = 1, 2, ..., M, \end{matrix}

H _NN = diag(h_N1,…,h _NC ) is a N × N block diagonal matrix, and h _Ni is a N_i × N_i matrix with all terms equal to $\frac{1}{N_{i}}$ .

Therefore, to find the best weights (γ₁, γ₂, …, γ _M ) for the multiple kernel matrix K defined in Eq. (3), we need solve the following constrained optimization

\begin{array}{l} \max_{γ} F (γ) = tr (G^{​ T} P P^{​ T} G - G^{​ T} Q Q^{​ T} G) \\ s . t . 1^{T} γ = \sum_{m = 1}^{M} γ_{m} = 1. \end{array}

(13)

3.4 Solving the optimization problem

We introduce a Lagrangian

L (γ, α) = F (γ) + α (\sum_{m = 1}^{M} γ_{m} - 1),

(14)

with one multiplier α. From Eq. (11) and (12), we can obtain

\frac{\partial P}{\partial γ_{m}} = 2 γ_{m} P^{(m)}, ​ ​ ​ \frac{\partial Q}{\partial γ_{m}} = 2 γ_{m} Q^{(m)} .

Moreover, temporarily regarding G as constant, we have

\begin{array}{l} \frac{\partial F (γ)}{\partial γ_{m}} = \frac{\partial}{\partial γ_{m}} (tr (G^{​ T} P P^{​ T} G - G^{​ T} Q Q^{​ T} G)) \\ = tr (\frac{\partial}{\partial γ_{m}} (G^{​ T} P P^{​ T} G - G^{​ T} Q Q^{​ T} G)) \\ = tr ((G^{​ T} \frac{\partial P}{\partial γ_{m}} P^{​ T} G + G^{​ T} P \frac{\partial P^{​ T}}{\partial γ_{m}} G) \\ - (G^{​ T} \frac{\partial Q}{\partial γ_{m}} Q^{​ T} G + G^{​ T} Q \frac{\partial Q^{​ T}}{\partial γ_{m}} G)) \\ = 2 γ_{m} \sum_{k = 1}^{M} γ_{k} ​^{​ 2} {tr}_{m, k}, \end{array}

(15)

Where

\begin{matrix} {tr}_{m, k} = tr ((G^{​ T} P^{(m)} P^{(k)} ​^{​ ​ T} G + G^{​ T} P^{(k)} P^{(m)} ​^{​ ​ T} G) \\ ​ ​ - (G^{​ T} Q^{(m)} Q^{(k)} ​^{​ ​ T} G + G^{​ T} Q^{(k)} Q^{(m)} ​^{​ ​ T} G)), \\ m, k = 1, 2, ..., M . \end{matrix}

Now, differentiating L(γ, α) with respect to γ₁,…, γ _M and α gives the following partial derivatives:

\begin{array}{r} \frac{\partial L (γ, α)}{\partial γ_{m}} = 2 γ_{m} \sum_{k = 1}^{M} γ_{k} ​^{​ 2} {tr}_{m, k} + α \\ m = 1, ..., M, \end{array}

(16)

\frac{\partial L (γ, α)}{\partial α} = \sum_{m = 1}^{M} γ_{m} - 1.

(17)

Setting these partial derivatives to zero, we get the following set of M + 1 equations:

{\begin{cases} 2 γ_{m} \sum_{k = 1}^{M} γ_{k} ​^{​ 2} {tr}_{m, k} + α = 0, m = 1, ..., M \\ \sum_{m = 1}^{M} γ_{m} - 1 = 0 \end{cases}

(18)

We use Newton's iteration method to solve these nonlinear equations. Let

F (\tilde{γ}) = [\begin{matrix} 2 γ_{1} \sum_{k = 1}^{M} γ_{k}^{2} {tr}_{1, k} + α \\ ⋮ \\ 2 γ_{M} \sum_{k = 1}^{M} γ_{k}^{2} {tr}_{M, k} + α \\ \sum_{m = 1}^{M} γ_{m} - 1 \end{matrix}],

(19)

where $\tilde{γ} = {(γ_{1}, \dots, γ_{M}, α)}^{T}$

Then the iteration formula is

{\tilde{γ}}^{(i + 1)} = {\tilde{γ}}^{(i)} - {[F^{'} ({\tilde{γ}}^{(i)})]}^{- 1} F ({\tilde{γ}}^{(i)}),

(20)

where ${[F^{'} ({\tilde{γ}}^{(i)})]}^{- 1}$ is the Jacobian matrix of $F (\tilde{γ})$ at ${\tilde{γ}}^{(i)} = {(γ_{1}^{(i)}, ..., γ_{M}^{(i)}, α^{(i)})}^{T}$ , $i = 0, 1, 2, ...$ .

3.5 Weight optimization procedure

Based on the analysis above, the detailed weight optimization procedure for MKFD is described as follows.

Input : K^(m) = [k^(m) (x _r , x _t ,)] _N × N, m = 1,…,M, i.e., M base kernel matrices.

Output : γ = (γ₁,…, γ _M )^T, with γ² _m being the weight of base kernel K^(m)

S1. Given ε > 0. Initialize iteration counter i = 0 and $i = 0$ and $(γ_{1}^{(0)}, ..., γ_{M}^{(0)}, α^{(0)})$ , subject to $\sum_{m = 1}^{M} γ_{m}^{(0)} = 1$

S2. Using the diagonalization strategy of MKFD with the constructed multiple kernel matrix $K = \sum_{m = 1}^{M} γ_{m}^{(i)} ^{2} K^{(m)}$ , find an optimal projection matrix in the i -th iteration

W *^{(i)} = Φ_{b} G^{(i)} = \arg \max_{W} J^{Φ} (W);

S3. Regarding G⁽ⁱ⁾ as constant, construct the constrained optimization

S4.

\begin{array}{c} \max_{γ} ​ F (γ) = tr (G^{(i)} ​^{​ T} P P^{​ T} G^{(i)} - G^{(i)} ​^{​ T} Q Q^{​ T} G^{(i)}), \\ s . t . ​ ​ ​ 1^{T} γ = \sum_{m = 1}^{M} γ_{m} = 1, \end{array}

and calculate matrix (tr _m,k )M × M' where

\begin{matrix} {tr}_{m, k} = tr ((G^{(i)} ​^{​ T} P^{(m)} P^{(k)} ​^{​ ​ T} G^{(i)} + G^{(i)} ​^{​ T} P^{(k)} P^{(m)} ​^{​ ​ T} G^{(i)}) \\ - (G^{(i)} ​^{​ T} Q^{(m)} Q^{(k)} ​^{​ ​ T} G^{(i)} + G^{(i)} ​^{​ T} Q^{(k)} Q^{(m)} ​^{​ ​ T} G^{(i)})), \\ m, k = 1, ..., M . \end{matrix}

S5 (Update weights) Compute ${\tilde{γ}}^{(i + 1)}$ from ${\tilde{γ}}^{(i)}$ using Eqs. (19)(20). If $‖ {\tilde{γ}}^{(i + 1)} - {\tilde{γ}}^{(i)} ‖ > ε$ , then $i = i + 1$ , go to S2; else stop.

In our experiments reported in Section 4, for the initial value, considering Eq. (18), we set $γ_{1}^{(0)} = ... = γ_{M}^{(0)} = \frac{1}{M}$ , $α = - \frac{2}{M^{4}} \sum_{m = 1}^{M} \sum_{k = 1}^{M} {tr}_{m, k}$ , and ε is set to $5 e - 3$ .

4. Experiments

To evaluate the performance of our MKFD for face recognition, we have made experimental comparisons with KDDA based on single kernels, in terms of low-dimensional representation and image recognition. Images are from two face databases, namely the FERET and the CMU PIE databases.

In our experiments, three base kernels ( $M = 3$ ) are adopted to construct the multiple kernel function: linear kernel $k_{1} (x_{i}, x_{j}) = x_{i} ^{T} x_{j}$ , Gaussian RBF kernel $k_{2} (x_{i}, x_{j}) = \exp (- \frac{{‖ x_{i} - x_{j} ‖}^{2}}{2 σ^{2}})$ where σ is set to the average value of all the original sample distances $\frac{1}{N (N - 1)} \sum_{i = 1}^{N} \sum_{j = 1}^{N} ‖ x_{i} - x_{j} ‖$ , and polynomial kernel $k_{1} (x_{i}, x_{j}) = {(x_{i} ^{T} x_{j} + 1)}^{d}$ where d is set to 0.5. Thus the multiple-kernel is $k (x_{i}, x_{j}) = \sum_{m = 1}^{3} γ_{m} ^{2} k_{m} (x_{i}, x_{j})$ , with $\sum_{m = 1}^{3} γ_{m} = 1$ . We demonstrate the effectiveness of the multiple-kernel by comparing its performance with the single base kernels.

4.1 Face image datasets

From the FERET database [16], we select 72 people, with 6 frontal-view images for each individual. Face image variations in these 432 images include illumination, facial expression, wearing glasses, and aging. All the images are aligned by the centers of the eyes and the mouth and then normalized with a resolution of 92 × 112. The pixel value of each image is normalized between 0 and 1. The original images with resolution 92 × 112 are reduced to wavelet feature faces with resolution 49 × 59 after 1-level Daubichies-4 (Db4) wavelet decomposition. Images from one individual are shown in Fig. 1.

Figure 1.

Images of one person from the FERET database

In the CMU PIE face database [17], there are totally 68 people, and each person has 13 pose variations ranged from the full right profile image to the full left profile image and 43 different lighting conditions, 21 flashes with ambient light on or off. In our experiments, for each person, we select 56 images including 13 poses with neutral expression and 43 different lighting conditions in the frontal view. For all frontal-view images, we apply alignment based on two eye center and nose center points, and no alignment is applied on the other images with poses. All the segmented images are rescaled to the resolution of 92 × 112, and then reduced to wavelet feature faces with resolution 49 × 59 after 1-level Daubichies-4 (Db4) wavelet decomposition. Some images of one person are shown in Fig. 2.

Figure 2.

Some images of one person from the CMU PIE face database

4.2 Distribution of extracted features

This section aims to provide insights on how the proposed MKFD simplifies the face pattern distribution, compared with KDDA based on single kernels, when the patterns are subject to pose and illumination variations.

We select five subjects, 56 images per subject, with varying pose and illumination, from our CMU PIE face dataset determined above (56 × 5 = 280 images in all). Four types of feature bases are generalized from the images by utilizing KDDA with linear kernel, KDDA with Gaussian RBF kernel, KDDA with polynomial kernel, and our MKFD, respectively. In the sequence, all the 280 images are projected onto the four subspaces. For each image, its projections in the first two most significant feature bases of each subspace are visualized in Fig. 3.

Figure 3.

Distribution of 280 images of five subjects under varying pose and illumination in four types of subspaces

Fig. 3(a)–(d) depict the first two most discriminant features extracted by utilizing KDDA with linear kernel, KDDA with Gaussian RBF kernel, KDDA with polynomial kernel and MKFD, respectively. Obviously, our MKFD extracts the most discriminant features.

4.3 Recognition results

This section reports the recognition results of MKFD and KDDA with single kernels on the FERET and the CMU PIE datasets. For each subject in the FERET dataset, we randomly select n (n = 2 to 5) out of 6 images for training, with the rest for testing. In the CMU PIE dataset, the number of randomly selected training images is ranged from 10 to 18 out of 56 for each individual, while the rest are testing images. The average recognition accuracies over 10 runs on the FERET and CMU PIE datasets are shown in Fig. 4(a)–(b), respectively.

Figure 4.

Comparison of accuracies obtained by MKFD and KDDA with single kernels

Table 1 shows the average and standard deviation of the accuracies for FERET (n = 3: 3 images per subject for training with the rest for testing) and CMU-PIE (n = 14: 14 images per subject for training with the rest for testing), respectively.

Table 1.

Performance comparison between MKFD and KDDA with single kernels

Type of kernel		FERET (n=3)	CMU PIE (n=14)
Linear kernel	Mean Std	84.94% 0.015	71.79% 0.014
RBF kernel	Mean Std	73.84% 0.040	69.03% 0.010
Polynomial kernel	Mean Std	84.80% 0.017	76.50% 0.010
Multi-kernel with same weights	Mean Std	80.84% 0.030	71.81% 0.014
Multi-kernel with optimized weights	Mean Std	86.34% 0.015	78.74% 0.007

From the results in Fig. 4 and Tables 1, it can be seen that, the blending of multiple kernels in the proposed MKFD can achieve higher accuracies than any of the three single kernels, but a simple summation of multiple kernels is hardly a good idea for improving the classification performance, while our constructed multiple kernel function with optimized kernel weights leads to enhanced performance.

Note that in the experiments, neither the parameter for RBF kernel nor the one for polynomial kernel is optimally selected, and even linear kernel has no parameter. This means that, to a certain extent, the multiple kernel function relaxes parameter selection about the base kernels.

5. Conclusion

In this paper, on the assumption that multiple kernels can characterize geometrical structures of the original data from multiple views which can complement to improve recognition performance, we apply kernel-based FDA with multiple kernels, which we call MKFD, to recognition of face images under variations of pose, illumination, facial expression, etc. The constructed kernel for MKFD is a linear combination of several base kernels with a constraint on their weights. By maximizing the margin maximization criterion, we propose an iterative scheme based on the method of Lagrange multipliers for the weight optimization, which yields updated kernel weights resulting in high recognition accuracy on the FERET and CMU PIE face database, compared with the single kernels. The experiments also demonstrate that the multiple kernel function relaxes parameter selection to some extent.

It is important to point out that the proposed weight optimization scheme is generic and, with minor modifications, can be applied to all kernel-based FDA algorithms.

Footnotes

6. Acknowledgement

This work is partially supported by the National Natural Science Foundation of China under grant No. 60975083.

References

Turk

and Pentland

Eigenfaces for recognition. J. Cogn. Neurosci., 3(1):71–86, 1991.

Belhumeur

P. N.

Hespanha

J. P.

, and Kriegman

D. J.

Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell., 19(7):711–720, Jul. 1997.

Ruiz

and de Teruel

López P.E.

Nonlinear kernel-based statistical pattern analysis. IEEE Trans. Neural Netw., 12(1):16–32, Jan. 2001.

Schölkopf

Smola

, and Müller

Nonlinear component analysis as a kernel eigenvalue problem. MPI fur biologische kybernetik, Tubingen, Germany, Tech. Rep. 44, 1996.

Mika

Rätsch

Weston

Schölkopf

, and Müller

K. R.

Fisher discriminant analysis with kernels. Proc. IEEE workshop Neural Netw. Signal Process. IX, 1999, pp. 41–48.

Baudat

and Anouar

Generalized discriminant analysis using a kernel approach. Neural Comput., 12(10):2385–2404, 2000.

J. W.

Plataniotis

, and Venetsanopoulos

A. N.

Face recognition using kernel direct discriminant analysis algorithms. IEEE Trans. Neural Netw., 14(1):117–126, Jan. 2003.

Chapelle

Vapnik

Bousquet

, and Mukherjee

Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3):131–159, 2002.

Sonnenburg

Rätsch

, and Schäfer

A general and efficient multiple kernel learning algorithm. Neural Information Processing Systems, 2005.

10.

Wang

Chen

, and Sun

MultiK-MHKS: A novel multiple kernel learning algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 30(2):348–353, Feb. 2008.

11.

Zhang

, and Bennett

Column-generation boosting methods for mixture of kernels. Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 521–526, 2004.

12.

Lanckriet

G.R.G.

Cristianini

Bartlett

Ghaoui

L.E.

, and Jordan

M.I.

Learning the kernel matrix with Semidefinite Programming. J. Machine Learning Research, vol. 5, pp. 27–72, 2004.

13.

Bach

Lanckriet

G.R.G.

, and Jordan

M.I.

Multiple kernel learning, conic duality, and the SMO algorithm. Proc. 21st Int'l Conf. Machine Learning, 2004.

14.

Bennett

K.P.

Momma

, and Embrechts

M.J.

MARK: A boosting algorithm for heterogeneous kernel models. Proc. ACM SIGKDD, pp. 24–31, 2002.

15.

Jiang

, and Zhang

Efficient and robust feature extraction by maximum margin criterion. Advances in Neural Information Processing Systems 16, Thrun

Saul

, and Schölkopf

, Eds. Cambridge, MA, MIT Press, pp. 157–165, 2004.

16.

Phillips

P. J.

Moon

Rizvi

S. A.

, and Rauss