Label consistent transform learning for pattern classification

Abstract

Transform learning has been successfully applied to various image processing tasks in recent years. Nevertheless, transform learning learns the representation in an unsupervised fashion. To make transform learning suitable for pattern classification, we introduce a label consistency constraint into transform learning and propose a new label consistent transform learning to enhance the classification performance of transform learning. The resulting optimization problem can be solved elegantly by employing the alternative strategy. Experimental results on publicly available databases demonstrate that label consistent transform learning outperforms several dictionary learning approaches and the recently proposed discriminative transform learning. More importantly, label consistent transform learning has the least training time which has the potential in practical applications.

Keywords

Transform learning dictionary learning label consistent pattern classification

Introduction

Recent years have witnessed increasing interest in the study of dictionary learning (DL) in various domains, such as face recognition,¹ image fusion,² and person re-identification.³ DL can be divided into two different types, i.e. synthesis dictionary learning (SDL) and analysis dictionary learning (ADL), according to how to encode the input data. SDL aims to learn a dictionary by which the input data can be reconstructed through the linear combination of atoms in the dictionary. The most classic SDL approach is the K-SVD algorithm⁴ which has been successfully applied to image compression and denoising. However, K-SVD mainly focuses on the representational ability of the dictionary without considering its capability for classification. To address this problem, Zhang and Li⁵ proposed a discriminative K-SVD (D-KSVD) method by introducing the classification error into the framework of K-SVD. Jiang et al.⁶ further incorporated a label consistency constraint into K-SVD and presented a label consistent K-SVD (LC-KSVD) algorithm. Yang et al.⁷ developed a Fisher discrimination dictionary learning method which imposes the Fisher discrimination criterion on the coding coefficients. Wang D and Kong⁸ presented an SDL approach called COPAR which explicitly learns the shared patterns and the class-specific dictionaries. Li et al.⁹ proposed a locality constrained and label embedding dictionary learning algorithm to take the locality and label information of atoms into account. Many SDL employs ℓ₀/ℓ₁ norm to promote sparsity of the coding, which will lead to high computational complexity. To alleviate this problem, Zhao et al.¹⁰ designed an orthogonal collaborative dictionary learning method which can derive analytical solutions for both code learning and dictionary updating.

Although SDL achieves impressive results in classification tasks, it is time-consuming to learn the dictionary. Recently ADL has attracted lots of attention due to its efficacy and efficiency. Rubinstein et al.¹¹ presented the analysis K-SVD which is parallel to the synthesis K-SVD.⁴ Gu et al.¹² developed a projective dictionary pair learning (DPL) which simultaneously learns a synthesis dictionary and an analysis dictionary. Guo et al.¹³ explored a discriminative ADL by integrating structure preserving and discriminative properties into the basic ADL model. Yang et al.¹⁴ proposed a discriminative analysis-synthesis dictionary learning, in which a linear classifier based on the coding coefficient is jointly learned with the dictionary pair. Wang et al.¹⁵ designed a synthesis linear classifier-based ADL (SLC-ADL) algorithm by introducing a synthesis-linear-classifier-based error term into the basic ADL framework. Similarly, Wang et al.¹⁶ also incorporated the linear classification error term into ADL. The difference between SLC-ADL¹⁵ and the approach proposed by Wang et al.¹⁶ is that SLC-ADL actually uses the classification error term from the analysis viewpoint.

Transform learning (TL) is a new representation learning technique which was presented by Ravishankar et al.^17,18; it utilizes a transform matrix to attain the representation of input data. In fact, TL and ADL have similar formulation. TL is mainly used in signal and image denoising. And there is few work on discriminative transform learning (DTL). Very recently, Maggu et al.¹⁹ developed a label consistent TL for hyperspectral image classification. Although the method proposed by Maggu et al.¹⁹ is termed as label consistent TL, they essentially introduced the linear classification error term into the framework of TL, so this method is coined as DTL in this paper. To further enhance the classification capability of TL, we introduce the label consistency term into the framework of TL and present label consistent transform learning (LCTL) for pattern classification. Compared with DTL, our proposed LCTL leverages the label information of both training data and dictionary atoms to form an ideal representation matrix. Through the ideal matrix, samples belonging to the same class are forced to have similar representation vectors while those from different classes to have distinct representation vectors. LCTL is evaluated on widely used databases and the experimental results demonstrate that LCTL is superior to DTL and other SDL approaches in terms of recognition accuracy and training time. The source code of our proposed LCTL can be downloaded from https://github.com/yinhefeng/label-consistent-transform-learning

Related work

Synthesis DL

Let $X = [x_{1}, x_{2}, \dots, x_{N}] \in ℝ^{n \times N}$ be the training data matrix from C classes, where n and N denote the dimensionality and total number of training samples, respectively. The objective function of K-SVD⁴ is formulated as follows

min_{D, Z} {‖ X - DZ ‖}_{F}^{2}, s . t . {‖ z_{i} ‖}_{0} \leq T_{0}

(1)

where

D = [d_{1}, d_{2}, \dots, d_{K}] \in ℝ^{n \times K}

is the dictionary that is to be learned,

Z = [z_{1}, z_{2}, \dots, z_{N}] \in ℝ^{K \times N}

is the coding coefficient matrix, and T₀ is a given sparsity level. Equation (1) can be solved by alternatively updating D and Z. Although K-SVD delivers promising results in image compression and denoising, it is not tailored for classification. To make K-SVD applicable to pattern classification, Zhang and Li⁵ presented D-KSVD algorithm by introducing the classification error term into the framework of K-SVD

\begin{array}{l} min_{D, W, Z} {‖ X - DZ ‖}_{F}^{2} + β {‖ H - WZ ‖}_{F}^{2} + λ {‖ W ‖}_{F}^{2} \\ s . t . {‖ z_{i} ‖}_{0} \leq T_{0} \end{array}

(2)

where

H = [h_{1}, h_{2}, \dots, h_{N}] \in ℝ^{C \times N}

is the label matrix of training data,

h_{i} = {[0, 0, \dots, 1, \dots, 0, 0]}^{T} \in ℝ^{C \times 1}

is the label vector of

x_{i}

, and W is the parameters for a linear classifier. We can see that dictionary and classifier are jointly learned in D-KSVD. Then, Jiang et al.⁶ developed LC-KSVD by solving the following optimization problem

min_{D, W, A, Z} {‖ X - DZ ‖}_{F}^{2} + α {‖ Q - AZ ‖}_{F}^{2} + β {‖ H - WZ ‖}_{F}^{2}, s . t . {‖ z_{i} ‖}_{0} \leq T_{0}

(3)

where

Q = [q_{1}, q_{2}, \dots, q_{N}] \in ℝ^{K \times N}

is an ideal representation matrix and A is a linear transformation matrix.

Transform learning

TL was presented by Ravishankar et al.^17,18 and the objective function of TL is formulated as follows

min_{T, Z} {‖ TX - Z ‖}_{F}^{2} + μ ({‖ T ‖}_{F}^{2} - log det T) + λ {‖ Z ‖}_{1}

(4)

where

T = [t_{1}; t_{2}; \dots; t_{K}] \in ℝ^{K \times n}

is the transform matrix,

t_{i} \in ℝ^{1 \times n}

- log det T

imposes a full rank constraint to avoid the trivial solution (

T = 0, Z = 0

{‖ T ‖}_{F}^{2}

is used to address the scale ambiguity of the transform matrix. TL is an unsupervised representation learning approach, thus it is unsuitable for classification tasks. Recently, Maggu et al.¹⁹ developed a supervised version of TL by solving the following problem

min_{T, Z, W} {‖ TX - Z ‖}_{F}^{2} + μ ({‖ T ‖}_{F}^{2} - log det T) + λ {‖ Z ‖}_{1} + β {‖ H - WZ ‖}_{F}^{2}

(5)

where H and W are the label matrix of training data and parameters of linear classifier, respectively.

Label consistent TL

As mentioned earlier, we want to employ the label information of both training data and dictionary atoms to boost the discriminative capability of TL, and the objective function of our proposed LCTL is formulated as

min_{T, Z, A} {‖ TX - Z ‖}_{F}^{2} + μ ({‖ T ‖}_{F}^{2} - log det T) + λ {‖ Z ‖}_{1} + α {‖ Q - AZ ‖}_{F}^{2}

(6)

where

Q = [q_{1}, q_{2}, \dots, q_{N}] \in ℝ^{K \times N}

is an ideal representation matrix formed by the label information of training data and dictionary atoms,

q_{i} = {[0, 0, \dots, 1, 1, \dots, 0, 0]}^{T} \in ℝ^{K \times 1}

. The entries in

q_{i}

are 1 when the training samples and the dictionary atoms have the same class label. For instance, suppose

X = [x_{1}, x_{2}, \dots, x_{6}]

and

T = [t_{1}; t_{2}; \dots; t_{6}]

, where

x_{1}, x_{2}, t_{1}

, and

t_{2}

belong to the first class,

x_{3}, x_{4}, t_{3}

, and

t_{4}

belong to the second class, and

x_{5}, x_{6}, t_{5}

, and

t_{6}

belong to the third class, then Q can be defined as

Q = [\begin{matrix} 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 \end{matrix}]

Equation (6) can be solved by alternatively updating T, Z, and A, and the optimization process is presented as follows:

Update T: When Z and A are fixed, we can obtain the following optimization problem with respect to T

min_{T} {‖ TX - Z ‖}_{F}^{2} + μ ({‖ T ‖}_{F}^{2} - log det T)

(7)

Equation (7) has closed-form solution.¹⁸ First, we factorize

X X^{T} + μ I

L L^{T}

, then let

L^{- 1} X Z^{T} = U Σ V^{T}

be a full singular value decomposition (SVD). Finally, the solution to Equation (7) is given by

T = 0.5 V (Σ + {(Σ^{2} + 2 μ I)}^{\frac{1}{2}}) U^{T} L^{- 1}

(8)

Update Z: Fix the other variables and update Z by solving the following problem

min_{Z} {‖ TX - Z ‖}_{F}^{2} + λ {‖ Z ‖}_{1} + α {‖ Q - AZ ‖}_{F}^{2}

(9)

Equation (9) can be rewritten as

min_{Z} {‖ (\begin{matrix} TX \\ \sqrt{α} Q \end{matrix}) - (\begin{matrix} I \\ \sqrt{α} A \end{matrix}) Z ‖}_{F}^{2} + λ {‖ Z ‖}_{1}

(10)

Let $X_{new} = (\begin{matrix} TX \\ \sqrt{α} Q \end{matrix})$ and $A_{new} = (\begin{matrix} I \\ \sqrt{α} A \end{matrix})$ , equation (10) can be reformulated as

min_{Z} {‖ X_{new} - A_{new} Z ‖}_{F}^{2} + λ {‖ Z ‖}_{1}

(11)

Equation (11) is the ℓ₁ norm constrained sparse coding problem which can be solved by the Homotopy algorithm.²⁰

Update A: When fixing the other variables, equation (6) is degenerated into the following problem with respect to A

min_{A} {‖ Q - AZ ‖}_{F}^{2} + η_{1} {‖ A ‖}_{F}^{2}

(12)

Equation (12) has the following closed-form solution

A = Q Z^{T} {(Z Z^{T} + η_{1} I)}^{- 1}

(13)

When the training phase is completed, we can obtain the transform matrix T, and the representation matrix of training data can be computed by $Z_{t r} = TX$ . We employ a simple yet effective linear classifier to classify the test samples. A linear classifier W can be learned based on the label matrix H of training data and the representation matrix $Z_{t r}$ , and the following problem is utilized to learn W

W^{*} = arg min_{W} {‖ H - W Z_{t r} ‖}_{F}^{2} + η_{2} {‖ W ‖}_{F}^{2}

(14)

Equation (14) has the following closed-form solution

W^{*} = H Z_{t r}^{T} {(Z_{t r} Z_{t r}^{T} + η_{2} I)}^{- 1}

(15)

For a test sample x , first we can compute its coding vector by $z = T x$ , then the identity of x is given by

label (x) = arg max_{i} W^{*} z

(16)

Experimental results and analysis

In this section, we evaluate the classification performance of our proposed LCTL on four benchmark datasets: the Yale database, the Extended Yale B database, the AR database, and the Scene 15 dataset. To illustrate the superiority of LCTL, we compare LCTL with the following approaches: SRC,²¹ LLC,²² K-SVD,⁴ D-KSVD,⁵ LC-KSVD,⁶ and DTL.¹⁹ Apart from the recognition accuracy, we also present the training time (in seconds) of all the competing methods. All experiments are run with MATLAB R2019a under Windows 10 on a PC equipped with Intel i9-8950HK 2.90 GHz CPU and 16 GB RAM.

Experiments on the Yale database

The Yale database consists of 165 images for 15 subjects, each individual has 11 images. These images have illumination and expression variations. Figure 1 shows some example images from this database. All the images are resized to 24 × 24, resulting in a 576-dimensional vector. Six images per subject are randomly selected for training and the remaining for testing. The learned dictionary contains 60 atoms, i.e. each class has four atoms. Experimental results are summarized in Table 1. We can see that LCTL has the highest recognition accuracy and the least training time. Specifically, the accuracy gain of LCTL over DTL is 2.6%.

Figure 1.

Example images from the Yale database.

Table 1.

Recognition accuracy and training time on the Yale database.

Methods	Accuracy (%)	Training time (s)
SRC	88.0	No need
LLC	93.3	No need
K-SVD	96.0	0.27
D-KSVD	90.7	0.80
LC-KSVD	94.7	0.80
DTL	94.7	0.20
LCTL	97.3	0.16

Note: Bold values signify the best recognition accuracy and the least training time.DTL: discriminative transform learning; LCTL: label consistent transform learning.

Experiments on the Extended Yale B database

The Extended Yale B face database contains 2414 images of 38 individuals. Each individual has 59–64 images taken under different illumination conditions; example images from this dataset are shown in Figure 2. In our experiments, each 192 × 168 image is projected onto a 504-dimensional space via random projection. We randomly select half of the images per category as training and the remaining for testing. The dictionary consists of 570 items, which corresponds to an average of 15 atoms per person. Experimental results are listed in Table 2. Although the accuracy gain of LCTL over LC-KSVD is not that significant, LCTL is about 31 times faster than LC-KSVD.

Figure 2.

Example images from the Extended Yale B database.

Table 2.

Recognition accuracy and training time on the Extended Yale B database.

Methods	Accuracy (%)	Training time (s)
SRC	80.5	No need
LLC	82.2	No need
K-SVD	93.1	1.79
D-KSVD	94.1	32.43
LC-KSVD	95.0	38.96
DTL	94.3	3.91
LCTL	95.2	1.27

Note: Bold values signify the best recognition accuracy and the least training time.DTL: discriminative transform learning; LCTL: label consistent transform learning.

Experiments on the AR database

The AR database has more than 4000 face images of 126 subjects with variations in facial expression, illumination conditions, and occlusions. Figure 3 shows example images from the database. In our experiments, we use a subset of 2600 images of 50 male and 50 female subjects from the database. Each 165 × 120 face image is projected onto a 540-dimensional vector by random projection. For each person, 20 images are randomly selected for training and the remaining for testing. The learned dictionary has 500 dictionary items, i.e. five atoms per person. Table 3 shows the experimental results on this database. Although the AR database consists of occluded images which pose challenge to face recognition, LCTL achieves the best recognition accuracy and the highest efficiency.

Figure 3.

Example images from the AR database.

Table 3.

Recognition accuracy and training time on the AR database.

Methods	Accuracy (%)	Training time (s)
SRC	66.5	No need
LLC	69.5	No need
K-SVD	86.5	1.97
D-KSVD	88.8	45.64
LC-KSVD	93.7	54.48
DTL	95.7	4.75
LCTL	96.2	1.32

Note: Bold values signify the best recognition accuracy and the least training time.DTL: discriminative transform learning; LCTL: label consistent transform learning.

Experiments on the Scene 15 dataset

Scene 15 dataset contains 15 natural scene categories introduced by Lazebnik et al.,²³ which comprises a wide range of indoor and outdoor scenes, such as bedroom, office, and mountain; example images from this dataset are shown in Figure 4. For fair comparison, we employ the 3000-dimensional SIFT-based features used in LC-KSVD.⁶ Following the common experimental settings, we randomly select 100 images per category as training data and use the remaining for testing. The learned dictionary has 450 atoms. Experimental results are presented in Table 4. One can see that LCTL outperforms DTL by a large margin, and it makes 7.2% and 4.7% improvement over DTL and LC-KSVD, respectively. The confusion matrix for LCTL is shown in Figure 5. As can be seen from Figure 5, LCTL obtains over 99% recognition accuracy for the categories of suburb, coast, forest, insidecity, opencountry, street, office, and kitchen.

Figure 4.

Example images from the Scene 15 dataset.

Table 4.

Recognition accuracy and training time on the Scene 15 dataset.

Methods	Accuracy (%)	Training time (s)
SRC	91.8	No need
LLC	89.2	No need
K-SVD	91.6	7.74
D-KSVD	89.9	99.20
LC-KSVD	93.8	111.01
DTL	91.3	11.45
LCTL	98.5	8.56

Note: Bold values signify the best recognition accuracy and the least training time.DTL: discriminative transform learning; LCTL: label consistent transform learning.

Figure 5.

Confusion matrix on the Scene 15 dataset.

To examine how the dictionary size influences the performance of DL approaches, here we compare LCTL with DTL, LC-KSVD, D-KSVD, and K-SVD. We evaluate all the competing methods under dictionary sizes of 150, 450, 750, 1050, 1350, and 1500, which correspond to 10, 30, 50, 70, 90, and 100 atoms per category, respectively. Recognition accuracy with different dictionary sizes is plotted in Figure 6. We can observe that LCTL consistently outperforms all other approaches for all dictionary sizes.

Figure 6.

Recognition accuracy on the Scene 15 dataset with varying dictionary size.

Conclusion

In this paper, we present an LCTL algorithm. To enhance the discriminative ability of TL and apply it to pattern classification tasks, a label consistency constraint is incorporated into the framework of TL to form a unified objective function. The variables in our optimization problem are updated alternatively and some of them have closed-form solutions. Experiments are conducted on three benchmark face databases and one scene dataset, and the results validate the effectiveness of our proposed LCTL. Moreover, LCTL shows high efficiency in terms of training time, which can be used in real-time scenarios.

Footnotes

Acknowledgements

The authors would like to thank Zhuolin Jiang for providing the source code of LC-KSVD at http://users.umiacs.umd.edu/∼zhuolin/projectlcksvd.html, and Jyoti Maggu for releasing the code of DTL at .

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant Nos. 61672265, U1836218), the 111 Project of Ministry of Education of China (Grant No. B12018), the Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant No. KYLX_1123, the Overseas Studies Program for Postgraduates of Jiangnan University and the China Scholarship Council (Grant No. 201706790096).

References

Song

Chen

Feng

, et al. Collaborative representation based face classification exploiting block weighted LBP and analysis dictionary learning. Pattern Recognit 2019; 88: 127–138.

XJ.

Multi-focus image fusion using dictionary learning and low-rank representation. In: ICIG, Shanghai, China, 13–15 September 2017, Springer, Cham, pp.675–686.

Shao

Person re-identification by cross-view multi-level dictionary learning. IEEE Trans Pattern Anal Mach Intell 2017; 40: 2963–2977.

Aharon

Elad

Bruckstein

AK.

SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 2006; 54: 4311–4322.

Zhang

Discriminative K-SVD for dictionary learning in face recognition. In: CVPR, San Francisco, California, 13–18 June 2010, pp.2691–2698.

Jiang

Lin

Davis

LS.

Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In: CVPR, Colorado Springs, Colorado, 20–25 June 2011, pp.697–1704.

Yang

Zhang

Feng

, et al. Fisher discrimination dictionary learning for sparse representation. In: CVPR, Colorado Springs, Colorado, 20–25 June 2011, pp.543–550.

Wang

Kong

A classification-oriented dictionary learning model: explicitly learning the particularity and commonality across categories. Pattern Recognit 2014; 47: 885–898.

Lai

, et al. A locality-constrained and label embedding dictionary learning algorithm for image classification. IEEE Trans Neural Netw Learning Syst 2015; 28: 278–293.

10.

Zhao

Feng

Zhang

, et al. Novel orthogonal based collaborative dictionary learning for efficient face recognition. Knowledge Based Syst 2019; 163: 533–545.

11.

Rubinstein

Peleg

Elad

Analysis K-SVD: a dictionary-learning algorithm for the analysis sparse model. IEEE Trans Signal Process 2012; 61: 661–677.

12.

Zhang

Zuo

, et al. Projective dictionary pair learning for pattern classification. In: NeurIPS, Montreal, Canada, 8–13 December 2014, pp.793–801.

13.

Guo

Kong

, et al. Discriminative analysis dictionary learning. In: AAAI, Phoenix, USA, 12–17 February 2016, pp.1617–1623.

14.

Yang

Chang

Luo

Discriminative analysis-synthesis dictionary learning for image classification. Neurocomputing 2017; 219: 404–411.

15.

Wang

Guo

, et al. Synthesis linear classifier based analysis dictionary learning for pattern classification. Neurocomputing 2017; 238: 103–113.

16.

Wang

Guo

, et al. Synthesis K-SVD based analysis dictionary learning for pattern classification. Multimed Tools Appl 2018; 77: 17023–17041.

17.

Ravishankar

Bresler

Learning sparsifying transforms. IEEE Trans Signal Process 2012; 61: 1072–1086.

18.

Ravishankar

Wen

Bresler

Online sparsifying transform learning – part I: algorithms. IEEE J Sel Top Signal Process 2015; 9: 625–636.

19.

Maggu

Aggarwal

Majumdar

Label-consistent transform learning for hyperspectral image classification. IEEE Geosci Remote Sensing Lett 2019; 16: 1502–1506.

20.

Yang

Sastry

Ganesh

, et al. Fast ℓ₁-minimization algorithms and an application in robust face recognition: a review. In: ICIP, Hong Kong, China, 26–29 September 2010, pp.1849–1852.

21.

Wright

Yang

Ganesh

, et al. Robust face recognition via sparse representation. IEEE TPAMI 2009; 31: 210–227.

22.

Wang

Yang

, et al. Locality-constrained linear coding for image classification. In: CVPR, San Francisco, California, 13–18 June 2010, pp.3360–3367.

23.

Lazebnik

Schmid

Ponce

Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR, New York, USA, 17–22 June 2006, pp.2169–2178.