Boosting sparsity-induced autoencoder: A novel sparse feature ensemble learning for image classification

Abstract

As a kind of unsupervised learning model, the autoencoder is usually adopted to perform the pretraining to obtain the optimal initial value of parameter space, so as to avoid the local minimality that the nonconvex problem may fall into and gradient vanishment of the process of back propagation. However, the autoencoder and its variants have not taken the statistical characteristics and domain knowledge of the train set and also lost plenty of essential representaions learned from different levels when it comes to image processing and computer vision. In this article, we firstly add a sparsity-induced layer into the autoencoder to exploit and extract more representative and essential features exist in the input and then combining the ensemble learning mechanism, we propose a novel sparse feature ensemble learning method, named Boosting sparsity-induced autoencoder, which could make full use of hierarchical and diverse features, increase the accuracy and the stability of a single model. The classification results on different data sets illustrated the effectiveness of our proposed method.

Keywords

Sparse representation sparsity-induced mechanism image denoising image classification

Introduction

When dealing with classification tasks, the performance of the general learning algorithm depends to a large extent on the quality of the feature representation learned from the original input. Good features can not only remove the irrelevant or redundant features coexisting in the original input space but also retain the essential information. Therefore, how to build a good feature extractor is one of the key issues to achieve high performance for various tasks. With the development of deep learning, the feature extractors constructed by unsupervised learning methods are further used in the field of computer vision. At the same time, deep hierarchical features generated by stacked unsupervised models have been proved to be a powerful tool to employ the computer vision tasks, which have brought about the widespread attention in the field of computer vision.^1,2

The sparseness of the representation has been shown to have advantages in robustness to denoising and also improved classification performance in high-dimensional spaces.^3,4 Meanwhile, the sparse representation has proven its significant impact on computer vision.^5
–7 Some works pointed out that the performance of a specific task can be improved if the input image can be represented by sparse representation, and in classification tasks, the networks with sparse structures always have better performance than the dense one, especially for original data set with degraded quality.^7,8 A significant advantage of the sparse representation method is that it can produce coding or feature representation that is more robust to noise. Some methods, such as sparse encoding strategy, sparse regularization term and sparse filtering, have taken the input samples into sparse depth-related neural network model. Based on these strategies, the sparse deep model is proposed recently.

In recent years, deep learning has been widely used in image classification tasks.^9

–12 Autoencoder, a deep learning tool with special structure, has been stacked to pretrain a deep neural network by using a greedy layer-wise means.¹³ Then, the unsupervised pretraining method is adopted to initialize the every layer of autoencoder, and back propagation is used for fine-tuning by a supervised learning algorithm, leading to solve the problem of inadequate expressive ability of shallow network. Denoising autoencoder is employed based on image patches,¹⁴ which leads a denoising result among various deep networks due to its peculiar unsupervised feature learning mechanism. Sparse autoencoder,¹⁵ serves as one typical model for unsupervised feature learning, proposed with a sparsity constraint on the hidden units. Since it consists of an encoding stage and a decoding stage and also requires output as much as input data, autoencoder can be recognized as an identical mapping that could reconstruct the original input.

However, when it comes to design a deep network, autoencoder and its variants do not take the statistical characteristics and the domain knowledge of train set into account. What’s more, they also abandon a large number of features obtained from multilayers. Besides, due to the large variance and low generalization ability on the unknown test set, autoencoder usually serves as a pretraining method. Accordingly, how to take full advantage of the characteristics of input is one of the most important focuses in our work. It is well-known that ensemble learning is considered as a practical technique for improving accuracy and stability, which could employs several weak single classifiers according to some combination rules to construct a stronger classifier. To ensure a better performance, two key issues should be considered, namely the diversity and accuracy of each learner and the combination rules used in ensemble learning.¹⁶

In this work, we propose a new sparsity encourage mechanism for autoecndoer and further build a sparsity-induced autoencoder (SparsityAE) that can exploit more representative and intrinsic features from input. Then, to benefit from the ability of ensemble learning method and boost the autoencoders performance, an ensemble sparse feature learning algorithm based on SparsityAE is proposed, named Boosting sparsity-induced autoencoder (BoostingAE). In our work, multilevel sparse features will be obtained when pretraining SparsityAE is complete, and ensemble learning is used in our study, which could effectively improve the accuracy and stability of single learner. Experimental results for classification tasks on different data sets show that the advantages of SparsityAE in sparse feature extraction and that our BoostingAE can significantly improve the overall performance compared with single learners.

Related work

Some related models and methods used in our work will be briefly introduced in this section.

Sparse representation

Sparse coding

Sparse coding provides a family of methods for acquiring the condense representations of input. Usually, we try to learn features from input directly and then reconstruct the data from the learned sparse codes by transforming which from the feature space to the data space. Although sparse coding is closely related to the traditional sparse coding techniques on image denoising, its main disadvantage is the high computational cost. In addition, it is well-known that sparse coding is not “smooth,”^17,18 which means that a small change in original input data space may lead to a significant difference in the code space.

Sparse filtering

Compared with existing feature learning models, an important property of sparse filtering is that it needs only one hyper-parameter for its simple cost function¹⁹

min \sum_{i = 1}^{M} | | {\hat{f}}^{(i)} {| |}_{1} = \sum_{i = 1}^{M} | | \frac{{\tilde{f}}^{(i)}}{| | {\tilde{f}}^{(i)} {| |}_{2}} {| |}_{1}

where f represents features learned from input sample, $\tilde{f}$ is defined by $L_{2}$ norm of f, and M is the sample size.

Sparse regularization

Different from sparse coding, there is no need for sparsity regularization to perform an extra separate stage to induce sparsity and to encourage sparse representations of input, which can make sparsity be induced more automatically and efficiently even a new term is added to its loss function. Various sparse regularization methods are used in deep belief networks or autoencoders,²⁰ and the results have proved that these methods have beneficial effects for a particular scene.

Softmax regression

Softmax regression is a generalized version of logistic regression, usually applied to classification problems where the class label y can be defined as any one in a label set that has no less than two values. Assume that there are m training samples and k labels: ${(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2))}, \dots, (x^{(i)}, y^{(i)})} (1, 2, \dots, m)$ , where $x^{i}$ is the input sample and $y^{i} \in {1, 2, \dots, k}$ is the corresponding label. In our work, utilizing the softmax regression to realize the image classification. For every input, the output probability function can be defined as follows

h_{θ} (x^{i}) = [\begin{matrix} p (y^{(i)} = 1 | x^{(i)}; θ) \\ p (y^{(i)} = 2 | x^{(i)}; θ) \\ ⋮ \\ p (y^{(i)} = k | x^{(i)}; θ) \end{matrix}] = \frac{1}{\sum_{j = 1}^{k} e^{θ_{j}^{T} x^{i}}} [\begin{matrix} e^{θ_{1}^{T} x^{(i)}} \\ e^{θ_{2}^{T} x^{(i)}} \\ ⋮ \\ e^{θ_{k}^{T} x^{(i)}} \end{matrix}]

where θ represents the parameter of softmax model, and the probability of its category is estimated to be

p (y = k | x^{(i)}; θ) = \frac{e^{θ_{k}^{T} x^{(i)}}}{\sum_{j = 1}^{k} e^{θ_{j}^{T} x^{(i)}}}

Ensemble learning

In ensemble learning, according to the combination rules, some weak classifiers are used to construct a strong classifier whose generalization error much lower than any weak classifier. Compared to random guessing, the weak learner means that whose generalization performance is slightly better on the unknown test sets. From a mathematical point of view, ensemble learning could effectively reduce variance and achieve more stable performance. To get a better integration result, each weak learner needs to be as different as possible, that is to say, there is a high degree of diversity between the base weak learners, which will be helpful for the performance of ensemble learning.

Boosting method is a widely used method of statistical learning and serves as an important means of ensemble learning. By changing the weights of training samples, boosting method trains multiple base learners, according to a combination rule to get the final decision result.

Figure 1 shows a schematic diagram of the boosting method. Usually, boosting method trains a set of individual learners that learning ability is relatively weak and then adjusts the distribution of training samples according to the classification accuracy of the classifier; after that, the adjusted sample distribution will be used to train the next learner. The above process is operated repeatedly until the number of base learners meets the preset value. Finally, combining the pretrained classifiers with a combination rule to give the final decision output.

Figure 1.

Overview of boosting method.

AdaBoost is a well-known ensemble method, which can be adopted in conjunction with many various types of learners to serve as a strong learner. It trains the single classifier in “rounds,” and at each round the importance of training examples is increased that were misclassified in the previous round. The final ensemble is then aggregated using weights that were calculated during the training process.

Boosting sparsity-induced autoencoder

Sparsity-induced autoencoder

Based on the assumptions of the sparse representation and the efficient reconstruction of low-dimensional feature representation obtained in the encoding phase of deep models, a novel sparse feature learning approach, named SparsityAE, is proposed firstly by employing the sparsity property of input and adding an sparsity-induced layer that is placed to the last layer of encoding phase of a stacked deep model, which could efficiently capture the sparse feature representations from multiple layers.

The structure of the SparsityAE is shown in Figure 2. From the topology map, we can tell that the new sparsity-induced layer is added next to the last layer of the encoder phase, also just in the middle of the whole deep hierarchical architecture, which could apply the sparsity constraint to obtain the sparse robust and highly intrinsic features. We input the high-dimensional data set into the encoding stage. With the deepening of the encoder, the length of the code learned from each layer becomes shorter and shorter. The last layer of the encoding stage is followed by a novel sparsity-induced layer to generate sparse code. In contrast, the decoding stage reconstructs the compressive and sparse codes given by the sparsity-induced layer.

Figure 2.

The topology of the proposed SparsityAE.

In SparsityAE, the number of neuron decreases layer-by-layer. In the sparsity-induced layer, according to the threshold set in advance, the neurons whose activation value lower than threshold will be set to zero, while retaining other neurons whose activation value is significantly higher than that threshold. The introduction of sparsity-induced layer could not only compress the original input space by reducing the number of neurons but also eliminate the correlation among attributes. That is to say, it reduces the complexity of the model and further improves the generalization ability of the model. In the view of the neural network, the deeper the layers are, the more abstract and essential features can be extracted.

Assuming that $y_{i} (i = 1, 2, \dots, N)$ is the original input and $x_{i}$ is its degraded version, so the input can be mapped by a hidden representation as follows

\hat{y} (x_{i}) = σ (W^{'} h (x_{i}) + b^{'})

where $\hat{y}$ is an approximation evaluation of y and $σ (•)$ is the map function.

For our SparsityAE, we optimize the reconstruction loss regularized by a weight decay and a sparsity-inducing term. The cost function can be designed as follows

L (X, Y; θ) = {‖ y_{i} - \hat{y} (x_{i}) ‖}_{2}^{2} / N + β • K L (\hat{ρ} ‖ ρ)

where $θ = {W^{'}, b^{'}}$ represents weights and biases, $β • K L (\hat{ρ} | | ρ)$ is the sparse regularization and used to extract sparse representation, and $\hat{ρ}$ is the average output of all hidden neurons

K L (\hat{ρ} | | ρ) = \sum_{j = 1}^{| \hat{ρ} |} ρ log (\frac{ρ}{{\hat{ρ}}_{j}}) + (1 - ρ) log (\frac{1 - ρ}{1 - {\hat{ρ}}_{j}})

\hat{ρ} = (1 / N) \sum_{i}^{N} h (x_{i})

As shown in Algorithm 1, we initialize the data by computing the pairwise data similarity and determine the reconstruction coefficient for each data firstly. After that, minimizing the loss function by using the stochastic gradient descent to update network parameters. Then, updating the data according to the hidden representation derived from the SparsityAE and setting the neurons without significant activation value to zero. Repeat the steps (2), (3), and (4) until convergence.

Algorithm 1

SparsityAE

Notation:

Ω_{i}

: the reconstruction error for

x_{i}

;

S_{i}

: the reconstruction coefficient for

x_{i}

;

y_{i} ​_{1}^{N}

: the hidden representation for every input.

Input:

train set

D = {x_{i}}_{1}^{N}

;

parameters k (constant) and

θ = {W, b, W^{'}, b^{'}}

Process:

(1) Compute the

S_{i}

for train set

{x_{i}}_{1}^{N}

;

(2) Minimize the cost function by stochastic gradient descent and update θ;

(3) Compute

y_{i} ​_{1}^{N}

for each input;

(4) Keep the neurons whose activation value higher than threshold k and others are set to zero, and then update

S_{i}

and

Ω_{i}

;

(5) Repeat the step (2), (3) and (4) until convergence.

Output:

reconstruction representation of the input.

Feature ensemble method

With the SparistyAE introduced above, we can obtain multiple sparse features. And considering that ensemble feature learning can significantly increase the accuracy and stability of the single classifier, a BoostingAE based on SparsityAE and ensemble learning is proposed.

To make the whole structure easy to understand, Figure 3 shows a detailed construction of BoostingAE, which indicates that BoostingAE is theoretically possible to obtain multiple sparse representations that come from the cascaded SparsityAEs. Suppose a deep net with N hidden layers for image classification has been trained, N SparsityAE s will be pretrained to set initial values of boosting network and, therefore, produce N representations of input data. To make full use of the learned representations, a classification algorithm is selected to train the first M representations to obtain M corresponding classifiers. Finally, a fusion rule is used to aggregate the results of multiple classifiers so as to get the final prediction result.

Figure 3.

The topology of the proposed BoostingAE.

Specifically, we train three SparsityAEs to establish the BoostingAE. First, we train SparsityAE_1 in Figure 3. Let x be the original input, $W^{(i)}$ be the weight matrix connecting the input layer and the hidden layer and $b^{(i)}$ be the bias vector, so the output of SparsityAE_1 can be mapped by equation (4)

{\hat{y}}_{1} = σ (x • W^{(1)} + b^{(1)})

Treat ${\hat{y}}_{1}$ as the input of the next SparsityAE to further train SparsityAE_2. With reference to the above operation, the output of SparsityAE_2 can be obtained, which can also be used as the input of next SparsityAE

{\hat{y}}_{2} = σ ({\hat{y}}_{1} • W^{(2)} + b^{(2)})

From the horizontal direction, the sparse features learned from the current layer will be passed to the next SparsityAE through the above process. As a result, all of three SparsityAEs can be trained. At the same time, three trained classifiers and their corresponding optimal parameters will be obtained by the softmax regression and the training of the feature representations at the encoding stage.

Combination method of voting

When all three classifiers are set up, final prediction is given by the fusion result of all classifiers with a specific fusion rule. Here, the Naive Bayes combination rules²¹ are applied, which assume that individual classifiers are mutually independent. Naive Bayes combination rules include MAX rule, MIN rule, and AVG rule. In our work, we will adopt all three combination mehtods mentioned above.

Given a sample x and its label y, where y has L possible values. Suppose that BoostingAE contains N base classifiers, and $P_{n i} (x)$ is the probability that the category of x is i in the n th classifier, so the label y can be predicted with three different combination mehtods as follows:

MAX rule:

y = \underset{i = 1, 2, \dots, L}{argmax} max_{n = 1, 2, \dots, N} P_{n i} (x)

MIN rule:

y = \underset{i = 1, 2, \dots, L}{argmax} min_{n = 1, 2, \dots, N} P_{n i} (x)

AVG rule:

y = \underset{i = 1, 2, \dots, L}{argmax} \frac{1}{N} \sum_{n = 1}^{N} P_{n i} (x)

BoostingAE algorithm

From Figure 3, BoostingAE includes an input layer, some hidden layers and an output layer to carry out specific tasks, of which SparsityAE are mainly employed to provide sparse features for next layer.

Here, two key problems will directly affect the performance of the BoostingAE: how to measure the importance of feature of every layer and how to select the optimal model for each base learner. A main contribution of this article is that Adaboost will be employed to supervise the adjustment of parameters and weight coefficients. Algorithm 2 gives the detailed explanation of BoostingAE. The traditional sparse stacked autoencoder, especially in the encoding stage will extract multilayer features. However, it only uses the output of the last layer of the encoding stage, which wastes the features learned from other layers when it comes to the image processing.

Algorithm 2

BoostingAE

Notation:

T: the number of base learner;

SparsityAE: the algorithm of base learner.

Input:

train set

D = {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}),..., (x^{(m)}, y^{(m)})}

Process:

Initialize: assign the default parameters.

Loop: for

t = 1, 2, \dots, T

(1) The new dataset D′ is generated by bootstrapping original dataset;

(2) The individual classifier

g_{t} =SparsityAE(D^{'})

will be obtained.

Output:

Obtain ensemble classifier BoostingAE.

BoostingAE can make full use of the multilayer features by integrating base classifiers, thus avoiding the waste of computing and storage resources. And the representations extracted from multiple layers enhance the diversity of base learners in integrated learning architecture, which is more conducive to the performance of integrated learning. More concretely, BoostingAEs characteristics are shown in following aspects:

Similar to AdaBoost, BootingAE also adopts the cascade serialization mechanism rather than the parallel one to combine base learners, which makes the individual learners interrelated but different.

The output of the current layer will be the input of the next layer to get more features. While further deepening the network and acquiring abundant feature representation, it also makes the “training samples” received by every individual learners have great differences.

In the designing of individual learners, the topology of each model can be defined separately, rather than by a unified model. This can not only maintain the homogeneity of basic learners but also further increase the diversity of individual learners.

Experiments

In this section, we will verify the performance of SparsityAE with respect to sparse feature learning first. Next, to accurately illustrate the performance and stability of the proposed ensemble sparse feature learning method on real-world image processing and computer vision tasks, BoostingAE will be carried out on three widely employed image data sets. Moreover, we also provide the compared results by employing some state-of-the-art methods on same data sets.

The sparse feature learning of SparsityAE

In this part, we mainly focus on denoising of gray scale images. A group of natural images that come from http://decsai.ugr.es/cvg/dbimagenes are employed as the train set, and a group of standard natural images as the testing set has been widely used in the image processing.

Before the stage of training, we select a clean image y randomly from the data set and corrupt it with a specific strength of additive white Gaussian noise to generate the corresponding noisy image x. At the end of the training process, given any noisy image, the model can reconstruct its corresponding clean image. To avoid the local minimum, we adopt the layerwise pretraining procedure.¹³

First, Figure 4 shows the visualization results of the learned weight matrix of autoencoder with KL-divergence sparsity constraint only and SparsityAE, respectively, which means that the features obtained from SparsityAE can describe the edge, contour, and texture details of the image more accurately and also indicates that SparsityAE could learn more representative features from the inputs. Moreover, the comparison with the autoencoder with KL-divergence sparsity constraint which also shows good performance of SparsityAE is due to the introduction of sparsity-induced layer.

Figure 4.

The learned weight matrix visualization results of different autoencoders. (a) Left one is the visualization of autoencoder with KL-divergence sparsity constraint only, and (b) the right is the result of our method.

Next, to compare the denoising performance of our SparsityAE with some classic methods such as KSVD²² and BM3D,²³ we implemented the above methods on standard testing images, which have been degraded by different noise levels. Figure 5 illustrates that when the model is trained with the noise, image degradation is only $σ = 25$ , SparsityAE (magenta line) is competitive, which is because it knows nothing about the noise level, while other methods were provided. When we train the above methods on several noise levels, the green line shows that SparsityAE is more robust to the change of noise levels.

Figure 5.

Denoising performance comparison of various methods with various noise levels.

Table 1 gives the numerical denoising results of SparsityAE, KSVD, and BM3D, which is measured by the peak signal-to-noise ratio (PSNR in dB). And we highlighted the best result of each image in bold. Besides, we also compare SparsityAE and other state-of-the-art denoising methods, WNNM,²⁴ and two training-based methods, MLP²⁵ and TNRD,²⁶ in Table 2. The results show that KSVD and BM3D are more suitable for processing images with a large number of repetitive structures, but the adopted method in this study is better in all images, except Couple and Barbara. In addition, our SparsityAE can compete with WNNM, MLP, and TNPD.

Table 1.

Comparison between SparsityAE, KSVD, and BM3D measured by PSNR.^a

Image	KSVD	BM3D	Present study
Lena	31.35	32.08	32.46
House	32.14	32.86	32.96
Cameraman	28.72	29.45	29.98
Moarch	28.81	29.25	29.65
Couple	28.84	29.71	29.68
Man	29.09	29.61	29.78
Babara	29.60	30.72	29.20
Boat	29.32	29.91	29.91
Pepper	29.71	30.16	30.37

SparsityAE: sparsity-induced autoencoder.

^aThe best result of each image is provided in boldface.

Table 2.

Comparison between SpasityAE and several state-of-the-art denoising measured by PSNR.^a

Image	WNNM	MLP	TNRD	Present study
Lena	32.24	32.25	32.00	32.46
House	33.23	32.56	32.53	32.96
Cameraman	29.64	29.61	29.72	29.98
Moarch	29.85	29.61	29.85	29.65
Couple	29.82	29.76	29.69	29.68
Man	29.76	29.88	30.11	29.78
Babara	31.24	29.54	29.41	29.20
Boat	30.03	29.97	30.21	29.91
Pepper	30.42	30.30	30.57	30.37

SparsityAE: sparsity-induced autoencoder.

^aThe best result of each image is provided in boldface.

From the above results, we known that SparsityAE, the crux of the subsequent work, can project the original high-dimensional space to a lower dimensional and more intrinsic space from the dimensionality reduction point of view but also capture the more representative sparse features derived from multilayers to make full use of the information contained in inputs. SparsityAE is a more streamlined and targeted network structure and effectively improves its performance in image denoising.

BoostingAE for classification on MNIST

MNIST is a widely used handwritten digits data set, and the size of each image is 28 × 28. It consists of 60,000 training images and 10,000 testing images.

For illustration of the performance of BoostingAE on MINIST, we need to establish the topological structure of the BoostingAE for MNIST. The topology is $784 \times 500 \times 250 \times 100 \times 10$ , where 784 is the image size and 10 is the number of labels. First, Classifier_1 in Figure 3 is obtained by pretraining and fine-tuning SparsityAE_1. And then the saprse features derived from SparsityAE_1 will be used as the input of SparsityAE_2, so as to get Classifier_2. According to the process mentioned earlier, we got all base classifiers and completed the construction and the training of the whole BoostingAE model. When predicting actual samples, three Naive Bayes combination rules will be used to vote on the integrative results of the base classifiers.

Table 3 lists that the classification results of three base classifiers are better than those of KNN and SVM because of the introduction of the sparsity-induced layer. And under the effect of three different fusion rules, BoostingAE achieves 98.37%, 98.43%, and 98.87% accuracy rate, respectively, which are better than any individual classification and very close to Lpnorm AE.²⁷ Moreover, benefiting from convolution operations and hidden layers, stacked CAE²⁸ and CASE²⁹ can extract more abstract features of input, so the result of our BostingAE is slightly worse than theirs.

Table 3.

Numerical results of classification on MNIST.

	MNIST (%)
KNN	91.32
SVM	90.42
Lpnorm AE (KNN)	97.44
Lpnorm AE (SVM)	98.64
Stacked CAE	99.29
CSAE	99.39
Classifier_1	96.73
Classifier_2	96.3
Classifier_3	95.89
BoostingAE (MAX)	98.37
BoostingAE (MIN)	98.43
BoostingAE (AVG)	98.87

BoostingAE for classification on CIFAR-10 and SVHN

CIFAR-10 includes 10 types of color images, and each category contains 60,000 28 × 28 color images. The train set consists of 5000 images in each category, a total of 50,000 training images, and the remaining images are used for test. Its images are obtained from the real life so the background is more complex and comparing with other simple data sets are difficult to identify.

SVHN data set is also drawn from the real world and can be viewed as the deformation and upgrade of MNIST, but it has a more complex objective environment than MNIST because there is not one number in each image (the images are cut from house number plates so the category is decided by middle number). Same as MNIST, SVHN data set also contains 10 types of color images, as well as be marked as 0–9. SVHN includes a train set, a testing set, and an extra set. The validation set is derived from the train set and the extra set. It is generated randomly: two-third of it comes from the train set (400 samples per class), and the rest are from the extra set (200 samples per class).

Before the experiment, it is necessary to transform the images of CIFAR-10 and SVHN from RGB space into gray space and then normalize them. In the process of pretraining and fine-tuning the model, a mini-batch gradient descent algorithm will be implemented to improve the training efficiency. Taking into account, the unsupervised learning mechanism of autoencoder, both CIFAR-10 and SVHN use a certain proportion of unlabeled samples as train set in pretraining. At the stage of fine-tuning, both two data sets require ground truth to classify the image.

Next, similar to the experimental process on MNIST, the SparsityAE model is firstly determined, whose topology is $1024 \times 500 \times 250 \times 100 \times 10$ . In addition, the follow-up operations in the experiment are similar to those performed on the MNIST data set.

Table 4 lists the numerical results of comparison methods, individual classifiers that are based on the sparsity-induced layer, and the proposed method. From the results in Table 3, we get a conclusion same as that of MNIST. In all comparison methods, our method has achieved the best result.

Table 4.

Numerical results of classification on CIFAR-10 and SVHN.

	CIFAR-10 (%)	SVHN (%)
KNN	84.4	78.32
SVM	88.45	83.24
Lpnorm AE (KNN)	/	67.2
Lpnorm AE (SVM)	/	71.19%
Stacked CAE	79.20	/
CDSAE³⁰	74.18	/
Classifier_1	91.46	88.46
Classifier_2	91.83	88.93
Classifier_3	90.96	87.69
BoostingAE (MAX)	92.32	89.94
BoostingAE (MIN)	91.49	90.35
BoostingAE (AVG)	92.63	90.87

From the experiments of BoostingAE on three data sets, the individual classifiers Classifier_1, Classifier_2, and Classifier_3 have better classification results than or close to comparison methods because of the introduction of sparsity-encouraged mechanism in SparsityAE. Moreover, the BoostingAE with three different fusion rules achieve better performance than any other methods, and compared with the single classifier, more than three percentage points is improved, among which AVE rule is more stable, the MAX rule is worse, MIN rule in the middle. All of the experimental results illustrated that BoostingAE can not only capture more sparse representation but also utilize the multilayer features, resulting in the improvement of accuracy and diversity of overall model.

Conclusions

Based on the sparse representations of general signal and the efficient reconstruction of low-dimensional features obtained in the encoding stage of deep models, we first introduced a novel sparsity-encouraged mechanism to build SparsityAE by adding a sparsity-induced layer, which could efficiently induce the sparse feature representations of input data. And then, on the basis of SparsityAE and ensemble learning, considering that the multilayer features of input data will be learned after training several SparsityAE s and the input has not been exploited and reused reasonably, we proposed BoostingAE that could make full use of the more abstract and intrinsic features learned from multilayers and improve the performance of individual learners under the premise of guaranteeing the accuracy and diversity of base learning classification. An experiment about image denoising and the experiments of image classification on three different data sets validated the effectiveness of the proposed algorithm.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was jointly supported by the National Natural Science Foundations of China under grant No. 61772396.

ORCID iD

Jian Ji

References

Goh

Thome

Cord

. Learning deep hierarchical visual feature coding. IEEE Trans Neural Networks Learn Syst 2017; 25(12): 2212–2225.

Huang

Yang

. Hierarchical convolutional features for visual tracking. In: IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015, pp. 3074–3082. IEEE Computer Society.

Ranzato

Boureau

Lecun

. Sparse feature learning for deep belief networks. In: International Conference on Neural Information Processing Systems (eds Platt

Koller

Singer

Roweis

), Vancouver, Canada, 8–11 December 2008, pp. 1185–1192. Curran Associates Inc.

Jeong

Xiang

J L

Yang

Li Q

. Dictionary learning and sparse coding-based denoising for high-resolution task functional connectivity MRI analysis. In: International Workshop on Machine Learning in Medical Imaging, Quebec, Canada, 10 September 2017, pp. 45–52.

Shahnawazuddin

Sinha

. Sparse coding over redundant dictionaries for fast adaptation of speech recognition system. Comput Speech Lang 2016; 43: 1–17.

Srinivas

Lin

Liao

. Learning deep and sparse feature representation for fine-grained object recognition. In: IEEE International Conference on Multimedia & Expo, Hong Kong, China, 10–14 July 2017, pp. 1458–1463. IEEE Computer Society.

Ghifary

Kleijn

Zhang

. Sparse representations in deep learning for noise-robust digit classification. In: Image & Vision Computing New Zealand, Wellington, New Zealand, 27–29 November 2013, pp. 340–345. IEEE.

Wright

Mairal

. Sparse representation for computer vision and pattern recognition. Proc IEEE 2010; 98(6): 1031–1044.

Yang

. Hyperspectral image classification with deep learning models. IEEE Trans Geosci Remote Sens 2018; 56(9): 1–16.

10.

Mei

Jiang

. Invariant feature extraction for image classification via multi-channel convolutional neural network. In: International Symposium on Intelligent Signal Processing & Communication Systems, Amoy, China, 6–9 November 2017, pp. 491–495. IEEE.

11.

Durand

Mordan

Thome

. Wildcat: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: IEEE Conference on Computer Vision & Pattern Recognition, Hawaii, USA, 21–26 July 2017, pp. 642–651. IEEE.

12.

Marino

Salakhutdinov

Gupta

. The more you know: using knowledge graphs for image classification. In: IEEE Conference on Computer Vision & Pattern Recognition, Hawaii, USA, 21–26 July 2017, pp. 2673–2681.

13.

Bengio

Lamblin

Popovici

. Greedy layer-wise training of deep networks. Adv Neur In 2007; 19: 153–160.

14.

Hinton

Salakhutdinov

. Reducing the dimensionality of data with neural networks. Science 2006; 313(5786): 504–507.

15.

Wen

Wang

. Learning structured sparsity in deep neural networks. In: International Conference on Neural Information Processing Systems (eds Lee

Sugiyama

Luxburg

Guyon

Garnett

), Barcelona, Spain, 5–10 December 2016, pp. 2074–2082. Curran Associates, Inc.

16.

Zhang

Zhou

. Sparse ensembles using weighted combination methods based on linear programming. Pattern Recognit 2011; 44(1): 97–106.

17.

Wang

Yang

Kai

. Locality-constrained linear coding for image classification. In: Computer Vision & Pattern Recognition, San Francisco, USA, 13–18 June 2010, pp. 3360–3367. IEEE.

18.

Gao

Tsang

Chia

. Local features are not lonelylaplacian sparse coding for image classification. In: Computer Vision & Pattern Recognition, San Francisco, USA, 13–18 June 2010, pp. 3555–3561. IEEE.

19.

Ngiam

Pang

Chen

. Sparse filtering. In: International Conference on Neural Information Processing Systems, Granada, Spain, 12–17 December 2011, pp. 1125–1133. Curran Associates Inc.

20.

Ngiam

Coates

. On optimization methods for deep learning. In: Proceedings of the 28th International Conference on Machine Learning (eds Getoor

Scheffer

), Bellevue, USA, 28 June–2 July 2011, pp. 265–272. Omnipress.

21.

Kuncheva

. Combining pattern classifiers: methods and algorithms, 1st ed. Hoboken: Wiley, 2004.

22.

Elad

Aharon

. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 2007; 15(12): 3736–3745.

23.

Dabov

Foi

Katkovnik

. Image denoising by sparse 3d transform-domain collaborative filtering. IEEE Trans Image Process 2007; 16(8): 2080–2095.

24.

Lei

Zuo

. Weighted nuclear norm minimization with application to image denoising. In: Computer Vision & Pattern Recognition, Columbus, USA, 24–27 June 2014, pp. 2862–2869. IEEE Computer Society.

25.

Burger

Schuler

Harmeling

. Image denoising: can plain neural networks compete with bm3d? In: Computer Vision & Pattern Recognition, Providence, Rhode Island, 16–21 June 2012, pp. 2392–2399. IEEE.

26.

Chen

Pock

. Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE Trans Pattern Anal Mach Intell 2017; 39(6): 1256–1272.

27.

Mehta

Gupta

Gogna

. Stacked robust autoencoder for classification. In: Neural Information Processing (eds Hirose

Ozawa

Doya

Ikeda

Lee

Liu

), Kyoto, Japan, 16–21 October 2016, pp. 600–607. Cham: Springer.

28.

Masci

Meier

Cirean

. Stacked convolutional autoencoders for hierarchical feature extraction. In: International Conference on Artificial Neural Networks (eds Honkela

Duch

Girolami

Kaski

), Espoo, Finland, 14–17 June 2011, pp. 52–59. Berlin, Heidelberg: Springer.

29.

Luo

Yang

. Convolutional sparse autoencoders for image classification. IEEE Trans Neural Netw Learn Syst 2018; 29(7): 3289–3294.

30.

Chen

Liu

Zeng

. Image classification based on convolutional denoising sparse autoencoder. Math Prob Eng 2017; 2017: 1–16.