Sage Journals: Discover world-class research

Abstract

The Extreme Learning Machine (ELM) is a highly efficient model for real-time network retraining due to its fast learning speed, unlike traditional machine learning methods. However, the performance of ELM can be negatively impacted by the random initialization of weights and biases. Moreover, poor input feature quality can further degrade performance, particularly with complex visual data. To overcome these issues, this paper proposes optimizing the input features as well as the initial weights and biases. We combine both Convolutional Neural Network (CNN) and Convolutional AutoEncoder (CAE) extracted features to optimize the quality of the input features. And we use our hybrid Grey Wolf Optimizer-Multi-Verse Optimizer (GWO-MVO) metaheuristic for initializing weights and biases by applying four fitness functions based on: the norm of the output weights, the error rate on the training set, and the error rate on the validation set. Our method is evaluated on image classification tasks using two benchmark datasets: CIFAR-10 and CIFAR-100. Since image quality may vary in real-world applications, we trained and tested our models on the dataset’s original and noisy versions. The results demonstrate that our method provides a robust and efficient alternative for image classification tasks, offering improved accuracy and reduced overfitting.

Keywords

Extreme learning machine grey wolf optimizer feature extraction convolutional neural network convolutional autoencoder image classification

1. Introduction

Traditional learning techniques face challenging issues, such as slow learning speed, intensive human intervention, and poor learning scalability. Extreme Learning Machine (ELM) [1] is a simple machine learning method that offers advantages over traditional learning algorithms. This method is known for its breakneck learning speed compared to conventional methods, as it requires no iterative gradient-based training. Moreover, no parameters need to be tuned except the architecture size. Therefore, ELM can be used to address problems that require real-time network retraining. ELM has also demonstrated exemplary performance in various fields such as medicine [2, 3], chemistry [4], transportation [5], economics [6], and robotics [7].

However, the random initialization of input weights and biases, as well as the learning procedure that aims to minimize the error on the training set, can reduce the prediction accuracy and produce the overfitting problem. On the other hand, it has been shown that ELM encounters challenges when dealing with complex visual data, particularly when the spectral information of each pixel in the image is given directly as input [8].

Since the quality of images can vary in real-world applications, some works have investigated the effect of noise on the image classification task. Models trained with a noisy version of the training set and tested on data with the same noise configuration have performed worse than models trained and tested on the original data [9, 10]. However, most image classification systems ignore preprocessing and suppose that the image quality does not vary.

Multiple contributions have been made to improve the performance of ELMs. Many of these contributions aim to optimize the initialization of weights and biases to increase the prediction accuracy while ignoring the overfitting problem. Another part of the contributions focuses on improving the quality of ELM input features. Feeding the ELM classifier with high-level feature representations extracted by a deep neural network has become a promising direction [8, 11, 12, 13, 14]. However, little attention has been paid to image quality during the feature extraction phase.

In this work, we are interested in improving the basic ELM model in two aspects: the optimization of the initialization of weights and biases and the preprocessing of images with or without noise.

Metaheuristics have proven their efficiency in optimizing the input parameters of ELM. Most contributions in this field aim to minimize the error rate on the training or validation set. This paper investigates the effectiveness of the hybrid Grey Wolf Optimizer-Multi-Verse Optimizer (GWO-MVO) metaheuristic, which has proven its performance in estimating the optimal dropout value in [15].

On the other hand, Convolutional Neural Networks (CNNs) have been widely used for feature extraction in image classification tasks. While current CNN-based methods have shown improved robustness over previous approaches, overfitting and noise present additional challenges [10]. Convolutional Autoencoders (CAE) [16], based on the Autoencoder (AE) architecture, have shown promising performance in handling highly variable conditions. However, they still provide poor precision compared with traditional deep networks such as VGG16 [17]. Therefore, in this article, we propose to combine the features extracted by the basic CNN with those extracted by CAE.

To the best of our knowledge, this is the first study to optimize the ELM model’s input features and initial parameters simultaneously for noisy image classification tasks. The main contributions of our work can be summarized as follows:

•
Feeding ELM with relevant features extracted by both models, CNN and CAE.
•
Proposing a GWO-MVO-based approach to select initial values of weights and biases for the ELM model.
•
Improving the ELM performance and generalization capacity by single and multi-objective optimization using these three properties:

–
the norm of the output weights of the ELM network,
–
the error rate on the training set, and
–
the error rate on the validation set.

•
Applying our method to image classification problems, whether noisy or not.

The remainder of this article is organized as follows: Section 2 presents related work. Section 3 gives the theoretical background of the techniques used in this work. Section 4 presents the concepts of the hybrid GWO-MVO algorithm. Sections 5 and 6 explain our method. Section 7 reports and discusses the obtained results. Finally, Section 8 concludes the paper, highlights its limitations, and discusses potential future research directions.
2. Related works

In this section, we first present some research studies of the ELM model using metaheuristics. Then, we describe the most important studies that used deep learning models to extract relevant features from images. Finally, we discuss the effect of image quality on the classification task.

2.1 Metaheuristic-based ELM

Random initialization of weights and biases further improves the speed of ELM’s learning algorithm. However, it reduces the performance of ELM in complex classification tasks and causes the overfitting problem as it affects the classification boundaries. Therefore, more nodes should be used for better generalization than classical training approaches, which significantly increases the testing time.

Metaheuristics represent a family of methods that have proven to be effective in solving what are known as complex optimization problems. As a result, metaheuristic-based approaches have been widely used with success in weight and bias initialization.

A hybrid evolutionary extreme learning machine (E-ELM) algorithm was proposed in [18]. The approach used the Differential Evolution (DE) algorithm to optimize initial weight and bias values. The root mean squared error (RMSE) was used as the fitness function, and the sigmoid as the activation function. A new selection strategy was presented such that when the difference in fitness between individuals is slight, the one with the smallest value of the norm of the output weights is selected. The proposed approach was evaluated on both regression and classification tasks. Although E-ELM outperformed traditional machine learning algorithms in terms of performance and speed, the results cannot be generalized, as the tested problems are simple and do not require a large number of hidden nodes. DE algorithm was replaced by the Particle Swarm Optimization (PSO) algorithm in [19] so that the input weights and hidden biases were bounded in the interval of [ $-$ 1, 1]. The sigmoid activation function was also employed in this work. The fitness function was RMSE on the training set by assumption, as there was no such detail in the paper. The stopping criteria applied were based on the maximum number of iterations and the change rate of RMSE. PSO-ELM was evaluated on the prediction of farm production using a small and simple dataset. The results showed that PSO-ELM could achieve similar accuracy to traditional BP algorithms on simple regression problems.

Mohapatra et al. used an improved cuckoo search-based extreme learning machine (ICSELM) to classify medical datasets [20]. The performance was analyzed and compared with traditional ELM, online sequential extreme learning algorithm (OSELM), and other conventional artificial neural networks. The results obtained showed that ICSELM could outperform the other models. The authors in [21] presented an efficient computer-aided diagnosis (CAD) model to classify mammography images as benign or malignant. The model uses conventional feature extraction techniques to extract relevant features from mammography images. Then, a modified variant of PSO, named MODPSO, is employed to optimize the parameters of the ELM hidden layer by considering the RMSE on the validation set as the fitness function. The effectiveness of the proposed model was assessed across three mammography datasets, surpassing other state-of-the-art models. In [22], a hierarchical approach based on the ELM classifier was introduced to detect and classify cyber-attacks effectively. The novel metaheuristic, Harris Hawks Optimizer (HHO), was employed to determine the optimal input feature subset and fine-tune the ELM weights. Two objective functions were used; the first seeks to reduce the number of features and increase classification accuracy, while the second aims to minimize the crossover error rate (CER). As a result, this approach led to substantial enhancements in detection rates, particularly on the UNSWNB-15 dataset. Dogan et al. proposed a hybrid methodology that combines the ELM model with GoogLeNet transfer learning to detect dry beans [23]. The research investigated the application of the Salp Swarming Algorithm (SSA) to optimize ELM parameters and enhance the performance of the ELM classifier. The RMSE on the validation set was employed as the fitness function. The SSA-ELM model successfully classified 14 different types of dry beans. The comparative results highlighted the superior performance of this hybrid approach over traditional machine learning algorithms.

Numerous studies have explored this optimization domain, investigating other metaheuristic approaches, such as Firefly Algorithm (FA) [24], Dolphin Swarm Algorithm (DSA) [25], Whale Optimization Algorithm (WOA) [26], Ant Lion Algorithm (ALA) [27], GWO [28], and Harmony Search (HS) [29].

2.2 Image feature extraction based on deep learning models

Feature extraction is an essential phase in the construction of computer vision systems. The main goal of this phase is to preserve only the critical aspects of the input data in robust and discriminative representations. Recently, machine learning algorithms have been widely used in this context as they can efficiently compress the data into a lower dimensional representation without significant loss of information.

Niu and Suen proposed a hybrid method for handwritten digit recognition [30]. This method consists of two classifiers: CNN and Support Vector Machine (SVM). First, CNN is trained on the training set to build the new feature vector that represents the input of the SVM. Then, the SVM classifier will be trained to perform the recognition task. The proposed approach was evaluated on the MNIST dataset. The results showed that this method outperformed previous works.

Feature extraction from complex industry data was studied in [31]. The objective was to estimate the etch rate from OES data of a plasma etching process. Since the input data structure is two-dimensional, the CAE model adapted for visual data was used for feature extraction. Initially, CAE is trained on the OES data. After that, the outputs of the encoder pooling layers will be concatenated and passed to the Support Vector Regression (SVR) model. The SVR is trained in a supervised manner to make the final predictions. The results clearly demonstrated the importance of the feature extraction step.

CNN was also used to extract features from Synthetic Aperture Radar (SAR) images in [13]. The output of the CNN model fed an ELM classifier to perform recognition. The tests were conducted on the MSTAR database. The proposed CNN-ELM model outperforms other traditional methods for SAR image recognition in terms of accuracy and time.

To solve the problem of CNN overfitting during feature extraction while taking advantage of the learning speed of the ELM model, CNN was replaced by wide ResNet (WRN) in [8]. Like the previous contributions, the feature extractor, WRN, in this case, is trained in a supervised manner to construct the feature vector that feeds the ELM model. The proposed DW-ELM approach was evaluated on five benchmark datasets (CIFAR-100, CIFAR-10, STL-10, Flower-102, and Fashion-MNIST). The results showed a significant improvement in the accuracy as well as the training stability of the proposed DW-ELM compared to SVM-based classification approaches.

Pintelas et al. propose a hybrid method between CAE and CNN to classify high-dimensional visual data [32]. This data type usually contains noise and redundant information, so CAE extracts relevant features to feed the CNN model with a high-level feature representation. The proposed model is evaluated on datasets from three application domains: plant diseases, skin cancer, and DeepFake detection. The results proved that this approach represents a helpful attempt to improve the performance of deep learning models.

In [33], a methodology employing exemplary pyramid-based deep feature extraction was introduced to detect cervical cancer. The main objective was classifying cervical cells in pap-smear images to identify cancerous cases. Initially, this approach utilizes transfer learning-based feature extraction with DarkNet19 or DarkNet53 networks in an exemplary pyramid structure to generate 21,000 features. Subsequently, the Neighborhood Component Analysis (NCA) technique selects the 1,000 most informative and weighted features. These selected features are then used for classification via the SVM algorithm. The effectiveness of this model was validated successfully using the SIPaKMeD and Mendeley LBC datasets.

Jiang et al. proposed a new image classification architecture named CofaNet [34]. This architecture combines CNN and transformer-based fused attention to overcome the limitations of transformers in capturing local context. Experimental results on three benchmark datasets demonstrated CofaNet’s excellent performance compared with some transformer-based networks.

2.3 Image quality effect on classification task

Due to the importance of the image classification problem in computer vision, it has been widely studied in machine learning. However, most works focus on high-quality datasets, neglecting that image quality varies significantly depending on the captures used and the lighting conditions in the real world. Few works have investigated the presence of noise in training and/or test data.

Dodge and Karam showed that deep learning networks are sensitive to blur and noise [35]. Four state-of-the-art CNN architectures were studied, where each model was trained on the original images and tested on the same images in their original state, with noise and blur. The results showed that the image classification performance under noise or blur is degraded. Authors proposed to train the models on low-quality images, but there was no study about the effect of noise and blur on the training set.

Paranhos da Costa et al. considered the case where the training set contains low-quality images [9]. The principle was to extract features from the input images by two hand-crafted feature extraction methods (LBP and HOG) and then pass them to the SVM classifier. Several versions of the training and testing set were created (the original version without noise and versions with only one type of noise). Each model is trained on one of the training set versions and tested on each test set version. The paper also analyses the effects of denoising techniques by restoring noisy images. The results show that noise makes the classification problem more difficult since none of the classifiers was able to exceed the performance of the classifier trained and tested on the original dataset, even when applying denoising techniques. However, this study only considered conventional hand-crafted feature extraction methods.

Following the same methodology in [9], Nazaré et al. evaluated the performance of CNN models [10]. Both Gaussian and salt & pepper noise were used with different degrees of noise. Denoising methods were applied for each dataset version. As a result, 21 classifiers were created. The results indicate that although injecting noise into the training data makes classification more complicated, it can be beneficial for applications where image quality is subject to variation after deployment. On the other hand, applying denoising methods cannot guarantee a good improvement, but quite the opposite.

Hossain et al. address the vulnerability of CNNs to image distortion [36]. They note that even minor levels of noise or blur can significantly impact CNN performance. To overcome these limitations, the authors propose a distortion-robust module called DCT-Net. This module integrates the Discrete Cosine Transform (DCT) into a deep network built on top of the VGG16 architecture. This approach effectively handles diverse types of distortion, generalizes well to unseen distortions, and performs competently on benchmark datasets.

Researchers in [37] also addressed the problem of the sensitivity of deep neural networks to image quality variations during testing. They introduced an approach that uses discrete vector quantization (VQ) to create a robust, quality-independent representation. This model demonstrated substantial progress by applying self-attention, achieving a new state-of-the-art result on ImageNet-C and a notable improvement in accuracy on other benchmark datasets.

3. Background

This section provides a brief theoretical overview of the techniques used to better understand this work.

3.1 Extreme Learning Machine

The Extreme Learning Machine (ELM) was first proposed in 2004 by Huang et al. [1]. ELM belongs to the family of single hidden-layer feedforward networks (SLFN). Unlike conventional neural networks based on the backpropagation (BP) algorithm, ELM uses the Moore Penrose (MP) generalized inverse to estimate the target outputs. This significantly reduces the training time.

For a training set $\{(x_{i},t_{i})\}_{i=1}^{N}$ with $N$ distinct instances, where $x_{i}=[x_{i,1},$ $x_{i,2},\ldots,x_{i,n}]^{T}$ is the vector of input attributes and $t_{i}=[t_{i,1},t_{i,2},\ldots,t_{i,m}]^{T}$ is the vector of output training values of $m$ classes, a basic SLFN with $N$ hidden nodes and a single hidden layer can be modeled mathematically as follows:

$\displaystyle\sum_{i=1}^{\check{N}}\beta_{i}f_{i}(x_{j})=\sum_{i=1}^{\check{N}% }\beta_{i}f_{i}(a_{i}\cdot x_{j}+b_{i})$ (1)

where $a_{i}=[a_{i,1},a_{i,2},...,a_{i,n}]^{T}$ is the weight vector connecting the $i^{th}$ hidden node with the input nodes, and $\beta_{i}=[\beta_{i,1},\beta_{i,2},...,\beta_{i,m}]^{T}$ is the weight vector connecting the $i^{th}$ hidden node with the output nodes. $b_{i}$ represents the bias of the $i^{th}$ hidden node, and $f$ is an activation function. “ $\cdot$ ” Represents the scalar product.

We can reduce the equation and write it in the form:

$\displaystyle H\cdot\beta=T$ (2)

$H$ represents the hidden layer output matrix of the neural network. In the ELM algorithm, the input weights and hidden layer biases are randomly initialized, and then the hidden layer output matrix $H$ is determined. Thus, feedforward training can be transformed by computing the output weights matrix by the least squares principle of the following linear system:

$\displaystyle\check{\beta}=H^{+}\cdot T$ (3)

where $H^{+}$ is the Moore-Penrose generalized inverse of matrix $H$ .

The steps of the training procedure of an ELM are outlined in Algorithm 1.

Algorithm 1: Learning procedure of ELM
Input
Training Set, Test Set, Activation Function, Hidden Node number
Steps
1. Randomly initialize the input weights and biases
2. Calculate hidden layer output $H$ with training set using matrix
3. Calculate output weight matrix $\check{\beta}$
4. Use $\check{\beta}$ to make prediction on test set

3.2 Convolutional neural network

The convolutional neural network (CNN) is a feedforward neural network adapted to grid-like topologies like images. Inspired by Hubel and Wiesel’s 1962 work on simple and complex cells, Fukushima [38] developed the first model under the name “neocognitron”. Other architectures were later proposed, such as LeNet-5 [39], AlexNet [40], VGGNet [17], GoogLeNet [41], and ResNet [42].

Rather than having only the classification part like a multi-layer perceptron (MLP) model, a CNN model has two parts: feature extraction and classification, as shown in Fig. 1. The feature extraction part compresses the images by retrieving significant features. And the classification part combines the obtained features to classify the images.

Figure 1.

CNN architecture, adapted from [43].

The feature extraction part is mainly composed of two components, the convolution layer and the pooling layer. The convolution layer consists of a set of filters (kernels). The output of a convolution layer is calculated by a dot product between the input matrix and the kernel, this is known as the convolution operation. This operation can be represented mathematically as follows:

$\displaystyle z_{i,j,k}^{[l]}=w_{k}^{[l]}x_{i,j}^{[l]}+b_{k}^{[l]}$ (4)

where for a layer $l$ , $x_{i,j}$ is the input and $z_{i,j,k}$ is the output value of the $k^{th}$ kernel at location $i$ , $j$ . The filter weights are denoted by $w$ , and $b$ is the bias. Next, we apply the activation function to $z_{i,j,k}^{[l]}$ .

$\displaystyle g_{i,j,k}^{[l]}=g\left(z_{i,j,k}^{[l]}\right)$ (5)

Usually, a convolutional layer is followed by a pooling layer that applies a statistical measure such as maximum, mean, and median. This reduces the output resolution of the convolutional layer.

$\displaystyle y_{i,j,k}^{[l]}=\textit{pooling}\left(g_{i,j,k}^{[l]}\right)$ (6)

The classification part consists of one or more fully connected layers (FC) and a softmax output layer used for classification. The main idea is to take the output of the previous part and group them to define the output label.

3.3 Convolutional Autoencoder

Convolutional Autoencoder (CAE) [16] is a variant of the AE network adapted for 2D image structure as it is composed of convolutional layers. CAEs are mainly used to reconstruct input images by compressing them and removing noise while keeping the most significant features. They can also be seen as unsupervised dimensionality reduction models.

ACEs are composed of two parts based on convolutional layers, the encoder and the decoder. The encoder extracts the features of the input image into a lower-dimensional representation. In contrast, the decoder reconstructs this compressed representation by generating an output image similar to the input image. Figure 2 illustrates the basic architecture of the CAE model.

Figure 2.

Basic architecture of CAE model, adapted from [32].

Besides the convolutional and pooling layers explained before, CAEs contain two other types: deconvolutional and unpooling layers. The deconvolutional and unpooling layers do the opposite operations of the convolutional and pooling layers, respectively, to allow the reconstruction of the original size of each sub-region. These two types of layers are only found at the decoder level. Note that the deconvolution operation is, in fact, the transposition operation.

3.4 Single and multi-objective optimization

3.4.1 Single-objective optimization

A single objective optimization problem (SOOP) has a single objective function $f(X)$ , where the goal is to minimize or maximize it. Mathematically, a minimization problem can be posed as follows [44]:

$\displaystyle X^{*}=\underset{\forall X\in S}{\textit{arg}\ \textit{min}}f(X)$ (7)

where $X=(X_{1},X_{2},X_{3},\ldots,X_{N})$ an $N$ -dimensional vector of variables and $S$ represents the arbitrary search space. $X^{*}$ is the solution considered as the minimum of the objective function $(f(X^{*})\leqslant f(X),X\in S)$ .

3.4.2 Multi-objective optimization

The concept of Pareto optimal is introduced when two or more objective functions need to be optimized. The goal in a multi-objective optimization problem (MOOP) is to find the global minimum/maximum $X^{*}$ that minimizes/maximizes a set of M functions [44]:

$\displaystyle X^{*}=\underset{\forall X\in S}{\textit{arg}\ \textit{min}}f(X)=% \underset{\forall X\in S}{\textit{arg}\ \textit{min}}\{f_{1}(X),f_{2}(X),% \ldots,f_{M}(X)\}$ (8)

4. Grey wolf optimizer-multi-verse optimizer

Metaheuristics have proven to be very effective in solving continuous optimization problems. Grey Wolf Optimizer (GWO) [45] is one of the swarm intelligence algorithms designed for continuous optimization problems. The social behavior of wolves in the wild inspired GWO. Due to its simplicity to implement, as it requires few parameter settings, it has been applied with success to several complex optimization problems. However, it is sensitive to stagnation in local optima. We have previously proposed in [15], a hybrid search method based on GWO and Multi-Verse Optimizer (MVO) [46] to ensure a good balance between exploration and exploitation. MVO is a Metaheuristic inspired by the multiverse theory in astrophysics. The multiverse theory explains how big bangs create multiple universes and how these universes interact with each other. Motivated by the results obtained in estimating the dropout probability in a deep neural network, we investigate the performance of GWO-MVO for image classification. In the following, we provide the inspiration and mathematical modeling of the proposed GWO-MVO.

4.1 Inspiration

Like GWO, the GWO-MVO algorithm is mainly inspired by grey wolves’ leadership hierarchy and hunting behavior. Grey wolves often prefer to live in packs with a strict social hierarchy. Four hierarchical levels are established to keep discipline within the group. The pack is led by the leader wolves called alpha ( $\alpha$ ). The alphas are responsible for making the major decisions. In the second level, we find the beta ( $\beta$ ) wolves. Betas not only communicate important choices to the other wolves but also help the alphas in their duties. The third-level wolves are called delta ( $\delta$ ). Deltas can be scouts, sentinels, elders, hunters, or guardians. The wolves of the last level are called omega ( $\omega$ ). These wolves are the last ones allowed to eat prey. The second interesting social behavior of grey wolves is hunting in groups. Three main stages describe hunting: tracking, encircling, and attacking the prey.

4.2 Mathematical modeling

4.2.1 Social hierarchy

Wolves are ranked in the population according to their fitness value. The first, second, and third best are alpha, beta, and delta, respectively. While the remaining wolves are considered $\omega$ . The search for prey will be guided by $\alpha$ , $\beta$ , $\delta$ .

4.2.2 Encircling prey

The encircling strategy around the prey can be modeled mathematically by the following equations:

$\displaystyle X_{(t+1)}=X_{p,t}-A\cdot D$ (9) $\displaystyle D=\lvert C\cdot X_{p,t}-X_{t}\rvert$ (10) $\displaystyle A=2\cdot a\cdot r_{1}-a$ (11) $\displaystyle C=2\cdot r_{2}$ (12)

where $t$ indicates the $t^{th}$ iteration, $X$ denotes the position vector of a grey wolf, and $X_{(p,t)}$ represents the location of the prey at the $t^{th}$ iteration. $A$ and $C$ are coefficient vectors, $r_{1}$ , $r_{2}$ are random vectors between $[0,1]$ belonging to a uniform distribution. $a$ is a vector that decreases from 2 to 0 as the iteration proceeds. It represents an important control parameter of the convergence speed and can be considered the key to a good balance between exploration and exploitation. Unlike the basic GWO algorithm, the parameter $a$ decreases in a non-linear way in our proposed GWO-MVO algorithm. The decrease equation is inspired by the calculation of the TDR (Travelling_distance_rate) parameter of MVO.

$\displaystyle a=2\times\left(1-\frac{t^{\frac{1}{p_{1}}}}{T^{\frac{1}{p_{1}}}}\right)$ (13)

where $t$ indicates the current iteration, and $T$ is the maximum number of iterations. $p_{1}$ is the degree of exploitation over iterations. The higher $p_{1}$ is, the faster and more accurate the local exploitation/search is. Therefore, $p_{1}$ must be small to improve exploration and reduce the speed of convergences.

4.2.3 Hunting

Assuming that the alpha, beta, and delta leaders have good knowledge of the prey, these three wolves guide the search mechanism. Thus, each wolf updates its position according to the following equations:

$\displaystyle X_{\alpha}=X_{\alpha}-A_{\alpha}\cdot D_{\alpha}$ (14) $\displaystyle X_{\beta}=X_{\beta}-A_{\beta}\cdot D_{\beta}$ (15) $\displaystyle X_{\delta}=X_{\delta}-A_{\delta}\cdot D_{\delta}$ (16) $\displaystyle X_{(t+1)}=\frac{X_{\alpha}+X_{\beta}+X_{\delta}}{3}$ (17)

4.2.4 Reinforcing the exploitation

To further enhance the exploitation/local search process around the best individual $\alpha$ , the worst wolf position can be replaced by a new position close to $\alpha$ , as described in Eq. (18). This equation is inspired by the MVO algorithm.

$\displaystyle X_{\textit{worst}}=\left\{\begin{array}[]{l l}X_{\textit{new}}&% \quad\text{if }f^{*}(X_{\textit{new}})<f^{*}(X_{\textit{worst}})\\ X_{\textit{worst}}&\quad\text{else}\\ \end{array}\right.$ (18) $\displaystyle X_{\textit{new}}=X_{\alpha}+\textit{TDR}\times((ub-lb)\times r_{% 3}+lb)$ (19) $\displaystyle\textit{TDR}=1-\frac{t^{\frac{1}{p_{2}}}}{T^{\frac{1}{p_{2}}}}$ (20)

where $f^{*}$ represents the fitness function, $t$ and $T$ indicate the current and maximum number of iterations, respectively. $p_{2}$ is a control parameter of the exploitation. The higher $p_{2}$ is, the faster and more accurate the local exploitation/search is. Therefore, $p_{2}$ must be big to improve exploitation over the best agent. $X_{\alpha}$ and $X_{\textit{worst}}$ indicate the position of the alpha wolf and the worst wolf, respectively. $r_{3}\in[-1,1]$ is a random number chosen from a uniform distribution. $l b$ and $u b$ are the lower and upper bounds, respectively.

5. Design of the proposed approach

In image classification, generating relevant features that enable the classifier to identify different classes is crucial. In this work, we propose a method that leverages the strengths of two models: CNN and CAE. Our approach uses the extracted features from both models to feed the ELM classifier.

First, the CNN and CAE models are trained on the training set for a few epochs. The CNN model performs feature extraction through its designated feature extraction part, while the CAE model employs its encoder to generate feature vectors. These two sets of feature vectors, derived from the CNN and CAE models, are combined to form the input vector for the ELM classifier.

Once the ELM input vector is defined, the GWO-MVO $+$ ELM method determines the optimal initial values for the model weights and biases. This optimization technique improves the performance of the ELM classifier by finding the most suitable parameter values. The overall approach is outlined in Fig. 3.

Figure 3.

Flowchart of the proposed approach.

6. The proposed GWO-MVO method for ELM initialization

In this section, we present and explain in depth the GWO-MVO method used in initializing the weights and biases of the ELM classifier.

6.1 Solution representation and initialization

A solution is defined by a wolf’s position, representing a vector containing possible values of weight $w$ and bias $b$ . The position of an agent has $m$ dimensions:

$\displaystyle m=n\times k+n$ (21)

where $n$ is the hidden size of the ELM model, and $k$ is the size of the input features.

The position of an agent $t$ can be represented as follows:

$\displaystyle X^{t}=<\underbrace{x_{1}^{t},x_{2}^{t},\ldots,x_{\textit{nk}}^{t% }}_{\text{Weights}},\ \underbrace{x_{(\textit{nk}+1)}^{t},\ldots,x_{m}^{t}}_{% \text{Bias}}>$ (22)

All individuals’ initial weights and biases are randomly generated between $[-1,1]$ .

6.2 Fitness functions

The choice of the fitness function is crucial for any Metaheuristic, as it defines the search space. For this reason, we have defined four fitness functions that will be used separately in the solution evaluation of the two approaches described in the previous section. The evaluation of each solution must first go through the ELM learning procedure.

According to Bartlett’s theory [47], the smaller the norm of the weights, the better the generalization performance of the network. For this reason, we aim to minimize the norm of the weight matrix. Our first fitness function is defined as follows:

$\displaystyle F_{1}=\parallel\check{\beta}\parallel$ (23)

Most previous works have assumed that it is not vital to evaluate the training set since the ELM training procedure will indeed reduce the training error. However, the choice of initial weights and biases can easily affect the training error. Therefore, our second evaluation function is defined by the classification error rate on the training set.

$\displaystyle F_{2}=\textit{error}\_\textit{rate}_{\textit{trainng}\_\textit{% set}}$ (24) $\displaystyle\textit{error}\_\textit{rate}=1-\textit{accuracy }=1-\frac{% \textit{tp}+\textit{tn}}{\textit{tp}+\textit{fp}+\textit{tn}+\textit{fp}}$ (25)

where, tp is the true positive, tn is the true negative, fp is the false positive and fn is the false negative.

For better generalization, our third function is defined by the classification error rate on the validation set. We used 5-fold cross-validation to save time. Five-fold cross-validation has proven to be useful in many models [48]. The training set is randomly and equally divided into five parts. At each step, one of the partitions is chosen for validation, and the four remaining are used for training. The procedure is repeated five times so that each partition is used exactly once as a validation set. The fitness function is calculated by the average error rate over the five validation sets, as shown in the following formula:

$\displaystyle F_{3}=\frac{1}{5}\sum_{i=1}^{5}\textit{error}\_\textit{rate}_{% \textit{validation}\_\textit{set}_{i}}$ (26)

In order to prevent the overfitting problem, the fourth fitness function aims at optimizing the error rate on both the training and validation sets at once. This is what we call a multi-objective optimization problem. The weighted sum method combines the two objective functions into a single function. We also used the 5-fold cross-validation method. The final fitness function is calculated using the formula (27).

$\displaystyle F_{4}=w_{1}F+w_{2}F^{\prime}$ (27)

where

•

$w_{1}=w_{2}=\frac{1}{2}$ ,

•

$F=\frac{1}{5}\sum_{i=1}^{5}\textit{error}\_\textit{rate}_{\textit{training}\_% \textit{set}_{i}}$ , and

•

$F^{\prime}=\frac{1}{5}\sum_{i=1}^{5}\textit{error}\_\textit{rate}_{\textit{% validation}\_\textit{set}_{i}}$ .

6.3 The overall GWO-MVO

+

ELM algorithm

The GWO-MVO algorithm starts by initializing a population of potential solutions and then tries to find a good solution iteratively. Through the search process, the algorithm mimics the search mechanism of gray wolves to evolve the population across generations. ELM classifier is built for each solution to evaluate its fitness. The overall method is repeated over a specified number of iterations, fixed empirically. The global GWO-MVO $+$ ELM method is given in Algorithm 2.

Algorithm 2: GWO-MVO for ELM initialization
Initialization
Initialize the grey wolf population $X_{i}(i=1,2,..,n)$
Initialize parameters: $T$ , $p_{1}$ and $p_{2}$
Evaluation
Evaluate wolfs by training ELM models
Select the leaders $X_{\alpha}$ , $X_{\beta}$ and $X_{\delta}$ of wolf pack
while $t<T$ do
Update $a$ using Eq. (13)
For each search agent $i$ do
Update the agent position by using Eq. (17)
Evaluate the fitness by training ELM models
end for
Update the leaders $X_{\alpha}$ , $X_{\beta}$ and $X_{\delta}$ of wolf pack
Update TDR using the Eq. (20)
Replace the worst wolf using the Eq. (18)
Update the leaders $X_{\alpha}$ , $X_{\beta}$ and $X_{\delta}$ of wolf pack
$t=t+1$
end while
Return $X_{\alpha}$

7. Experiments

In this section, we first introduce our test environment. Then, we present the datasets used for the evaluation and the experimental setup. Finally, we discuss the different results obtained.

7.1 Software and hardware setup

We ran all our tests on a desktop computer with the following characteristics: an AMD RYZEN 7 3700X CPU with 16 GB of RAM, a single GPU (RTX 2060 super), and an Ubuntu 18.04 operating system.

All our programs were developed in Python using the Keras API and the basic data science libraries (scikit-learn, NumPy, SciPy, and pandas).

7.2 Experimental data

Table 1
Dataset description

Dataset	#Training set	#Testing test	Image size	Color	#Classes
CIFAR-10	50000	10000	$32\times 32$	RGB	10
CIFAR-100	50000	10000	$32\times 32$	RGB	100

Table 2

Parameter settings

Method	Parameter	Value
ELM	Activation function	Relu
	#hidden nodes	100, 200, 300, 400, 500, and 1000
GWO-MVO	#iterations	20
	#agents	7
	$p_{1}$	2
	$p_{2}$	10
GWO	#iterations	20
	#agents	7
PSO	#iterations	20
	$c_{1}$	2
	$c_{2}$	2
	$w_{\textit{min}}$	0.2
	$w_{\textit{max}}$	0.9
KNN	#neighbors	5
	Metric	Euclidean
SVM	Kernel	Rbf
	Gamma	0.001
	C	100

Figure 4.

Examples of original and noisy images for each training set. The first line shows the original image and the second represents images with Gaussian noise ( $\sigma=50$ ).

Figure 5.

CAE architecture.

Figure 6.

Comparative performance of the ELM model initialized by GWO-MVO using 100 hidden nodes based on different input types on the CIFAR-10 dataset (both test versions).

All the proposed approaches have been evaluated on the image classification task. Therefore, we used two benchmark datasets: CIFAR-10 and CIFAR-100. They are popular benchmark datasets for image classification because they are well-known, standardized, and provide a good range of complexity to test. Besides, the datasets are large enough to provide a meaningful challenge but not so large that they require extensive computational resources. Table 1 gives a summary of the datasets used.

•

CIFAR-10 dataset [49]: is a popular classification dataset in machine learning. It has a total of 60,000 colored images of size $32\times 32$ , with ten classes, including trucks, planes, boats, horses, frogs, dogs, deer, cats, birds, and cars. The dataset is divided into 50,000 training instances and 10,000 test instances.

•

CIFAR-100 dataset [50] is a database similar to CIFAR-10, except it has 100 classes containing 600 colored images each. It includes 500 training images and 100 test images per class.

Figure 7.

Comparative performance of the ELM model initialized by GWO-MVO using 100 hidden nodes based on different input types on the CIFAR-100 dataset (both test versions).

Figure 8.

Generalization gaps on CIFAR-10 and CIFAR-100 based on the feature input type.

Figure 9.

Test accuracy values across ten executions for studied ELM initialization techniques with VGG $+$ CAE input feature vector and 100 hidden nodes.

We created a noisy version for each dataset to consider the effect of noise on the training and testing set. We applied Gaussian noise with a mean of 0 and a standard deviation of 50. This particular standard deviation value was selected based on its ability to create a challenging test environment for Gaussian noise, as highlighted in [10]. Figure 4 gives an example of both original and noisy images from the training set for each dataset.

Figure 10.

Generalization gaps on CIFAR-10 and CIFAR-100 based on the initialization type.

Figure 11.

ELM model performance using GWO-MVO initialization and 100 hidden nodes on the four fitness functions for the CIFAR-10 dataset (both test versions).

7.3 Experimental setup

In this part, we give the implemented CNN and CAE architectures and the configuration of the different parameters.

We have opted for straightforward architectures that ensure the reduction of input features, thereby enhancing the learning speed of ELM.

7.3.1 CNN architectures

In our work, we adopted the CIFAR-VGG architecture, a modified version of VGG16 that addresses the overfitting problem by using dropout and weight decay techniques [17, 51]. This architecture is considered the gold standard as it offers a simplified effective network, unlike more complex and manually optimized modern architectures. To ensure performance, we used the same architecture provided in (https://github. com/geifmany/cifar-vgg).

7.3.2 CAE architectures

We employed a straightforward and standard CAE architecture to enable dimensionality reduction and feature extraction. Figure 5 describes the CAE structures used.

We further used the Adam optimizer, a learning rate equal to $0.001$ , a batch size equal to 64, and a total number of learning epochs of 200.

7.3.3 Parameter settings

The control parameters were tuned empirically through an experimental study considering the learning time and the memory space. The values of the parameters used in our experimental study are given in Table 2.

7.4 Results and discussion

In the rest of this section, we report the results of our tests. We performed ten separate runs for each method.

7.4.1 Input feature impact

The ELM model is initialized using the GWO-MVO technique and evaluated with four input types: the raw pixel image, the CAE feature vector, the VGG feature vector, and the combined VGG-CAE feature vector. Figures 6 and 7 compare the different approaches on the two datasets, CIFAR-10 and CIFAR-100, with a fixed set of 100 hidden nodes. Each figure includes both evaluation scenarios: original data and data with noise.

In both evaluation scenarios and across the four fitness functions, we observe that the VGG feature vector and the combined VGG $+$ CAE feature vector outperform the other input types by a notable margin. Furthermore, the VGG $+$ CAE vector demonstrates a slightly higher performance than the VGG vector.

Conversely, we notice that the CAE feature vector yields the lowest results on the original version of the datasets. However, the model fed directly with image pixels, without the feature extraction process, performs the worst on the noisy version.

One of the critical factors in machine learning is the ability of the model to fit new data correctly. Figure 8 illustrates the generalization gap of the two feature vectors, VGG and VGG $+$ CAE, for the four fitness functions. The results presented in the figure were obtained with 100 hidden nodes. We conclude that combining the VGG and CAE feature vectors reduces the generalization gap observed when using the VGG features alone.

7.4.2 ELM initialization impact

We draw box plots to analyze the impact of different initialization methods for ELM input parameters, which clearly represent the data dispersion and outliers. We compare the five initialization methods: random, GWO-MVO, with the four fitness functions: $F_{1}$ , $F_{2}$ , $F_{3}$ , and $F_{4}$ . We use the VGG $+$ CAE vector for each technique as input features and a fixed number of 100 hidden nodes.

The comparison presented in Fig. 9 reveals the superiority of GWO-MVO, regardless of the fitness function employed. In addition to the significance of the chosen input features, these results strongly highlight the necessity of initializing the ELM with appropriate input parameters.

Figure 10 depicts the generalization gap observed in the two versions of the two datasets. The figure focuses on two initialization methods, random initialization and our GWO-MVO method, using the $F_{4}$ fitness function. These results were obtained by VGG $+$ CAE input features using 100 hidden nodes.

In addition to the influence of VGG $+$ CAE input features on the generalization ability (as discussed in Section 7.4.1), using our GWO-MVO initialization method further reduces the generalization gap. This observation reinforces the crucial significance of proper ELM parameter initialization.

7.4.3 Fitness function impact

Figures 11 and 12 give an overview of how the ELM model initialized by GWO-MVO performs on the four fitness functions for the two datasets, CIFAR-10 and CIFAR-100, respectively. The evaluation includes the four input types presented in Section 7.4.1, using 100 hidden nodes. Both figures show the two evaluation scenarios, including the original dataset and the dataset with added noise.

Figure 12.

ELM model performance using GWO-MVO initialization and 100 hidden nodes on the four fitness functions for the CIFAR-100 dataset (both test versions).

The results demonstrate a close similarity in the performance of the four evaluation functions, making it challenging to specify the clear best function. However, the $F_{1}$ function shows the highest instability, as its results are inconsistent with the other functions in some cases.

Table 3

Best average accuracy and standard deviation for the original data version

Input feature	# Input	Initialisation	Fitness	CIFAR-10		CIFAR-100
	features		function	# Hidden	Best	# Hidden	Best
				nodes	result (%)	nodes	result (%)
Raw pixel image	3072	Random	$-$	1000	44.33 $\pm$ 0.23	1000	18.13 $\pm$ 0.33
		GWO-MVO	$F_{1}$	500	39.41 $\pm$ 0.27	500	14.3 $\pm$ 0.21
		GWO-MVO	$F_{2}$	1000	44.41 $\pm$ 0.31	500	14.34 $\pm$ 0.15
		GWO-MVO	$F_{3}$	1000	44.27 $\pm$ 0.4	1000	18.25 $\pm$ 0.23
		GWO-MVO	$F_{4}$	1000	44.47 $\pm$ 0.24	1000	18.25 $\pm$ 0.16
CAE	1793	Random	$-$	1000	46.2 $\pm$ 0.39	1000	18.92 $\pm$ 0.25
		GWO-MVO	$F_{1}$	1000	49.23 $\pm$ 0.23	1000	20.27 $\pm$ 0.18
		GWO-MVO	$F_{2}$	1000	49.39 $\pm$ 0.25	1000	20.24 $\pm$ 0.15
		GWO-MVO	$F_{3}$	1000	49.37 $\pm$ 0.26	1000	20.14 $\pm$ 0.26
		GWO-MVO	$F_{4}$	1000	49.3 $\pm$ 0.2	1000	20.21 $\pm$ 0.22
VGG	512	Random	$-$	500	93.51 $\pm$ 0.03	1000	70.99 $\pm$ 0.04
		GWO-MVO	$F_{1}$	1000	93.6 $\pm$ 0.03	1000	70.58 $\pm$ 1.55
		GWO-MVO	$F_{2}$	300	93.58 $\pm$ 0.06	1000	71.07 $\pm$ 0.1
		GWO-MVO	$F_{3}$	500	93.6 $\pm$ 0.08	1000	71.07 $\pm$ 0.05
		GWO-MVO	$F_{4}$	500	93.6 $\pm$ 0.06	1000	71.06 $\pm$ 0.11
VGG $+$ CAE	2305	Random	$-$	300	93.59 $\pm$ 0.06	1000	71.05 $\pm$ 0.11
		GWO-MVO	$F_{1}$	300	93.69 $\pm$ 0.04	200	70.44 $\pm$ 0.11
		GWO-MVO	$F_{2}$	1000	93.68 $\pm$ 0.05	1000	71.16 $\pm$ 0.1
		GWO-MVO	$F_{3}$	500	93.68 $\pm$ 0.04	1000	71.2 $+$ 0.18
		GWO-MVO	$F_{4}$	200	93.71 $\pm$ 0.06	1000	71.18 $\pm$ 0.11

Figure 13.

Average accuracy over the two test sets based on the number of ELM hidden nodes.

7.4.4 Hidden node impact

Table 4
Best average accuracy and standard deviation for the noisy data version

Input feature	# Input	Initialisation	Fitness	CIFAR-10		CIFAR-100
	features		function	# Hidden	Best	# Hidden	Best
				nodes	result (%)	nodes	result (%)
Raw pixel image	3072	Random	$-$	1000	36.99 $\pm$ 0.26	1000	12.86 $\pm$ 0.25
		GWO-MVO	$F_{1}$	1000	34.79 $\pm$ 0.32	300	11.18 $\pm$ 0.14
		GWO-MVO	$F_{2}$	1000	37.75 $\pm$ 0.29	500	10.74 $\pm$ 0.2
		GWO-MVO	$F_{3}$	1000	36.82 $\pm$ 0.31	1000	13.48 $\pm$ 0.28
		GWO-MVO	$F_{4}$	1000	38.02 $\pm$ 0.37	300	11.28 $\pm$ 0.22
CAE	1793	Random	$-$	1000	45.99 $\pm$ 0.35	1000	18.74 $\pm$ 0.27
		GWO-MVO	$F_{1}$	1000	47.07 $\pm$ 0.18	1000	19.01 $\pm$ 0.1
		GWO-MVO	$F_{2}$	1000	47.46 $\pm$ 0.31	1000	19.03 $\pm$ 0.1
		GWO-MVO	$F_{3}$	1000	46.57 $\pm$ 0.26	1000	18.82 $\pm$ 0.1
		GWO-MVO	$F_{4}$	1000	47.43 $\pm$ 0.41	1000	19.06 $\pm$ 0.21
VGG	512	Random	$-$	500	83.06 $\pm$ 0.07	1000	54.08 $\pm$ 0.14
		GWO-MVO	$F_{1}$	1000	83.2 $\pm$ 0.05	1000	54.11 $\pm$ 0.1
		GWO-MVO	$F_{2}$	500	83.16 $\pm$ 0.12	1000	54.08 $\pm$ 0.1
		GWO-MVO	$F_{3}$	1000	83.22 $\pm$ 0.07	1000	54.03 $\pm$ 0.1
		GWO-MVO	$F_{4}$	500	83.17 $\pm$ 0.07	1000	54.14 $\pm$ 0.1
VGG $+$ CAE	2305	Random	$-$	1000	83.18 $\pm$ 0.08	1000	53.97 $\pm$ 0.14
		GWO-MVO	$F_{1}$	100	82.97 $\pm$ 0.08	500	54.06 $\pm$ 0.1
		GWO-MVO	$F_{2}$	1000	83.28 $\pm$ 0.09	1000	54.23 $\pm$ 0.11
		GWO-MVO	$F_{3}$	1000	83.3 $\pm$ 0.11	1000	54.21 $\pm$ 0.17
		GWO-MVO	$F_{4}$	1000	83.35 $\pm$ 0.07	1000	54.22 $\pm$ 0.07

Table 5

Ablation study of key components in our method using 100 hidden nodes on CIFAR-10.

Original
Settings	SD	Mean Acc (%)	Best Acc (%)	Worst Acc (%)	Mean GG (%)
ELM	0.54	34.46	35.27	33.5	0.11
$+$ VGG	0.1	93.37	93.57	93.24	6.62
$+$ CAE	0.08	93.51	93.62	93.39	6.47
$+$ GWO-MVO ( $F_{4}$ )	0.07	93.66	93.8	93.59	6.33
Noisy
ELM	0.54	28.61	29.61	27.63	0,39
$+$ VGG	0.12	82.73	82.93	82.55	13.02
$+$ CAE	0.17	82.71	82.92	82.37	12.92
$+$ GWO-MVO ( $F_{4}$ )	0.08	83.01	83.18	82.9	12.86

SD: Standard deviation, Acc: Accuracy, GG: Generalization gap.

Table 6

Ablation study of key components in our method using 100 hidden nodes on CIFAR-100

Original
Settings	SD	Mean Acc (%)	Best Acc (%)	Worst Acc (%)	Mean GG (%)
ELM	0.22	11.2	11.63	10.98	1.18
$+$ VGG	0.09	69.18	69.35	69.05	30.28
$+$ CAE	0.27	68.24	68.59	67.68	30.86
$+$ GWO-MVO ( $F_{4}$ )	0.11	69.82	69.99	69.66	29.66
Noisy
ELM	0.29	8.34	8.91	7.89	1.44
$+$ VGG	0.33	51.67	52.1	50.96	29.61
$+$ CAE	0.37	49.86	50.37	48.97	28.29
$+$ GWO-MVO ( $F_{4}$ )	0.16	52.03	52.26	51.79	29.18

SD: Standard deviation, Acc: Accuracy, GG: Generalization gap.

Figure 13 displays the average accuracy achieved on the test set, according to the number of hidden nodes, for both the original and noisy versions of the CIFAR-10 and CIFAR-100 datasets. Regarding input features, the comparison focuses on VGG and VGG $+$ CAE. We explore both random initialization and initialization using the GWO-MVO method with the $F_{4}$ fitness function. The results indicate that network size plays a crucial role in determining performance.

We observe that increasing the number of hidden nodes enhances accuracy. However, this observation is inconsistent with the CIFAR-10 dataset, particularly with the original data. On the other hand, despite the superior performance of approaches using VGG $+$ CAE features, the intelligent initialization of ELM input parameters remains a crucial factor. It enables achieving high accuracies with relatively few hidden nodes.

Table 7

Comparing average accuracy between different metaheuristic-based ELM approaches using 100 Hidden Nodes

Dataset	Data version	Fitness function	PSO	GWO	GWO-MVO
CIFAR-10	Original	$F_{1}$	93.46	93.44	93.67
		$F_{2}$	93.57	93.45	93.65
		$F_{3}$	93.42	93.47	93.68
		$F_{4}$	93.43	93.49	93.66
	Noisy	$F_{1}$	82.78	82.83	82.97
		$F_{2}$	82.98	82.83	83
		$F_{3}$	82.72	82.85	83.09
		$F_{4}$	82.91	82.75	83.01
CIFAR-100	Original	$F_{1}$	69.45	69.59	69.73
		$F_{2}$	69.59	69.57	69.9
		$F_{3}$	69.28	69.52	69.68
		$F_{4}$	69.28	69.63	69.82
	Noisy	$F_{1}$	52.03	51.86	52.06
		$F_{2}$	51.82	51.88	52.01
		$F_{3}$	51.88	51.91	51.99
		$F_{4}$	51.56	51.86	52.03

Table 8

Comparison between different classifiers performance

Method	CIFAR-10				CIFAR-100
	Original		Noisy		Original		Noisy
	Acc (%)	TT (s)	Acc (%)	TT (s)	Acc (%)	TT (s)	Acc (%)	TT (s)
VGG $+$ CAE $+$ KNN	93.56	N/A	82.61	N/A	70.78	N/A	52.26	N/A
VGG $+$ CAE $+$ SVM	93.54	10.88	82.41	134.3	70.92	135.18	52.91	322.85
VGG $+$ CAE $+$ ELM	93.56	8.77	83.18	7.66	71.05	8.47	53.97	7.96
Proposed ( $F_{1}$ )	85.32	1131.54	61.28	909.71	36.25	701.7	22.44	614.32
Proposed ( $F_{2}$ )	93.68	1234.53	83.28	1245.72	71.16	1293.38	54.23	1267.83
Proposed ( $F_{3}$ )	93.65	5100.58	83.3	5095.76	71.2	5302.27	54.21	5301.16
Proposed ( $F_{4}$ )	93.63	5106.73	83.35	5089.35	71,18	5238.393	54.22	5300.96

1000 hidden nodes were used for the ELM classifier. Acc: Accuracy, TT: Training time.

Table 9

Comparison with similar methods on CIFAR-10 and CIFAR-100

Work	Feature extrac-	Approach	CIFAR-10				CIFAR-100
	tion network		Original		Noisy		Original		Noisy
			Acc (%)	Err (%)	Acc (%)	Err (%)	Acc (%)	Err (%)	Acc (%)	Err (%)
VGG16 [17]	VGG16	$-$	88.1	11.9	$-$	$-$	63.5	36.5	$-$	$-$
DCNN [10]	DCNN	Without using denoising methods	81.92	18.08	66.08	33.92	$-$	$-$	$-$	$-$
		With denoising methods	$-$	$-$	62.77	37.23	$-$	$-$	$-$	$-$
VGG-16-pruned-A[52]	VGG16	Pruning Filters	93.4	6.6	$-$	$-$	$-$	$-$	$-$	$-$
VGG16-AFP-E [53]	VGG16	Pruning Filters	92.94	7.06	$-$	$-$	$-$	$-$	$-$	$-$
DCT-Net [36]	VGG16	DCT ${}^{1}$	88.44	11.56	62.92	37.08	67.5	32.5	49.81	50.19
SNN-VGG-15 [54]	VGG16	Hard reset	$-$	$-$	$-$	$-$	62.85	37.15	$-$	$-$
		Soft reset	$-$	$-$	$-$	$-$	63.09	36.91	$-$	$-$
		TF-reset reset	$-$	$-$	$-$	$-$	64.27	35.73	$-$	$-$
CIFAR-VGG [51]	CIFAR-VGG	$-$	93.56	6.44	81.57	18.43	70.48	29.52	52.37	47.63
VGG-CAE-ELM (our work)	CIFAR-VGG $+$ CAE	GWO-MVO $F_{1}$	93.76	6.24	83.29	16.71	71.36	28.64	54.37	45.63
		GWO-MVO $F_{2}$	93.77	6.23	83.4	16.6	71.36	28.64	54.43	45.57
		GWO-MVO $F_{3}$	93.78	6.22	83.44	16.56	71.4	28.6	54.57	45.43
		GWO-MVO $F_{4}$	93.8	6.2	83.47	16.53	71.3	28.7	54.37	45.63

Acc: Best accuracy, Err: Best error rate. ${}^{1}$ Discrete cosine transform.

To facilitate the numerical comparison of the results, we provide in Tables 3 and 4 the best outcomes achieved on the test set for each tested method. The best results are highlighted in bold, while the worst results are underlined.

Upon initial observation, it is evident that our approach achieved superior performance in both versions of the two datasets. For the CIFAR-10 dataset, the best results on the test set, when models were trained and tested on the original and noisy data, were 93.71% $\pm$ 0.06 and 83.35% $\pm$ 0.07, respectively. These results were obtained using our method with the $F_{4}$ function. Notably, the best outcome for the original version was achieved with only 200 hidden nodes, indicating the effectiveness of our approach in reducing resource requirements and training time. Regarding the original CIFAR-100 dataset, the best average result of 71.2% $\pm$ 0.18 was achieved with the $F_{3}$ function. However, the $F_{2}$ function performed best on the noisy data, with an average accuracy of 54.23% $\pm$ 0.11.

We further find that models fed directly by the input images, without going through the feature extraction process, perform the worst. Especially when using the GWO-MVO optimization algorithm with the $F_{1}$ function. This indicates that optimizing the initial ELM parameters must be done using optimal input features.

7.4.5 Ablation study

To investigate the effectiveness of the critical components of our approach, we conduct ablation studies. Tables 5 and 6 provide details on standard deviation, mean accuracy, best accuracy, worst accuracy, and the rate of generalization gap for both CIFAR10 and CIFAR100 datasets. The outcomes of the two datasets in both versions show remarkable similarity. Initially, we observe a significant enhancement in test accuracies and standard deviation when VGG features are used. However, this gain is accompanied by a deterioration in the generalization gap due to the overfitting problem of VGGs. Subsequently, integrating CAE features may decrease the accuracy in some cases, but it improves the generalization gap. Furthermore, by optimizing ELM weights and biases using GWO-MVO, our method achieves greater competitiveness in prediction stability, evident from the further improvements in all metrics.

To validate our approach, GWO-MVO-ELM is compared with two other metaheuristic-based ELM models: PSO-ELM and GWO-ELM. Table 7 presents the average accuracy results. We note that VGG $+$ CAE feature vector is used to feed these classifiers. Looking at the results, the GWO-MVO-ELM achieved the highest average accuracy in all datasets, whatever the fitness function employed.

Additional experiments are conducted with other classifiers to further verify the superiority of the proposed method. The classifiers evaluated were KNN, RBF-kernel SVM, standard ELM, and our optimized ELM. Each classifier was fed with the VGG $+$ CAE feature vector. The results, including average accuracy and training time for both dataset versions, are reported in Table 8. To obtain an accurate estimate of the training time for the ELM classifier, we employed 1000 hidden nodes. Moreover, training time was ignored for KNN, given the simplicity of this classifier.

Our proposed method with the three fitness functions $F_{2}$ , $F_{3}$ , and $F_{4}$ outperforms other classifiers in terms of accuracy. This superiority is more noticeable with the noisy data version, showcasing an improvement of about 2% over the KNN classifier on CIFAR-100. However, despite its good performance in previous experiments, the $F_{1}$ function shows the worst accuracy. This indicates that this function is a non-optimal choice with a high number of hidden nodes.

On the other hand, our approach shows the worst training time, particularly with the two functions that use cross-validation, $F_{3}$ and $F_{4}$ . At the same time, the standard ELM classifier achieved the fastest training time. Despite the excellent performance of the ELM classifier in terms of training time, optimizing its weights and biases is time-consuming, requiring additional focus and attention.

7.5 Comparison with previous works

To conduct a comprehensive evaluation of our method, we compare it with similar works of equivalent complexity on the CIFAR-10 and CIFAR-100 datasets. The proposed method is compared with several other approaches, including VGG16 [17], DCNN [10], VGG-16-pruned-A[52], VGG16-AFP-E [53], DCT-Net [36], SNN-VGG-15 [54], and CIFAR-VGG [51]. The results are shown in Table 9, and the best ones are in bold.

As can be seen, our method outperformed all the compared approaches, achieving the highest accuracies on both versions of the two datasets. This improvement is particularly notable in the case of the noisy version of the data, where our technique significantly outperforms the others. We reached an accuracy of 93.8% on the standard CIFAR-10 dataset and 83.47% on the noisy version. For CIFAR-100, our method achieved accuracies of 71.4% and 54.57% on the original and noisy data, respectively. These outcomes further highlight the strong performance of our approach across both datasets and the superior stability obtained on noisy data.

8. Conclusion

In this study, we proposed a new method for improving the basic ELM model for the image classification task. Our proposal focuses on the input features as well as the initial weights and biases of the model. Features extracted from CNN and CAE were used as input to the ELM model. While the hybrid GWO-MVO metaheuristic was used for weight and bias optimization. Four fitness functions were applied by considering three properties: the norm of the output weights, the training set’s error rate, and the validation set’s error rate. We validated our method on two benchmark datasets: CIFAR-10 and CIFAR-100. We even considered the variation in real-life image quality by training and testing our models on the original and noisy datasets. The obtained results revealed the effectiveness of our method.

The proposed method capitalizes on the strengths of the ELM classifier and the VGG and CAE models in a synergistic manner that effectively mitigates the limitations of each model. While using VGG features enhances the performance of the ELM model, combining these features with those extracted by CAE not only boosts accuracy but also reduces overfitting. Furthermore, using the GWO-MVO algorithm for ELM hyperparameter initialization provides an additional advantage. An essential benefit of our approach is its ability to achieve good performance with a few hidden nodes, significantly reducing hyperparameter tuning time. Experimental results demonstrate the robustness of our method, even when applied to datasets with high noise levels. The technique maintains its efficacy in such challenging conditions, showcasing its suitability for real-world scenarios. Furthermore, our method outperforms other classifiers, KNN, SVM, conventional ELM, PSO-ELM, and GWO-ELM.

In summary, the superior performance of our proposed approach can be attributed to several crucial factors, including the consistent prediction stability, the ability to fully exploit the classification power of the ELM classifier, and the capacity to reduce overfitting. On the other hand, even though reducing input features using deep learning extraction can improve training time, optimizing ELM weights and bias is still time-consuming. Hence, enhancing training time remains an area that requires attention and improvement. In our future work, we plan to further improve our model by optimizing both the CNN and CAE architectures. Future efforts will also focus on optimizing the architecture of our approach by adapting parallelism in the ELM parameter tuning phase to enhance training time.

References

Huang

Zhu

Siew

. Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE international joint conference on neural networks (IEEE Cat. No. 04CH37541). vol. 2. IEEE; 2004. pp. 985-990.

Pan

Lü

Wang

Wei

Chen

. Novel battery state-of-health online estimation method using multiple health indicators and an extreme learning machine. Energy. 2018; 160: 466-477.

Khishe

Mohammadi

Parvizi

Karim

SHT

Rashid

. Real-time COVID-19 diagnosis from X-Ray images using deep CNN and extreme learning machines stabilized by chimp optimization algorithm. Biomedical Signal Processing and Control. 2021; 68: 102764.

Kang

Zhao

. Predicting refractive index of ionic liquids based on the extreme learning machine (ELM) intelligence algorithm. Journal of Molecular Liquids. 2018; 250: 44-49.

Wang

. Determinants investigation and peak prediction of CO2 emissions in China’s transport sector utilizing bio-inspired extreme learning machine. Environmental Science and Pollution Research. 2021; 28(39): 55535-55553.

Milačić

Jović

Vujović

Miljković

. Application of artificial neural network with extreme learning machine for economic growth estimation. Physica A: Statistical Mechanics and its Applications. 2017; 465: 285-288.

Yuan

Chen

Wang

Cao

Cai

Xue

. A compensation method based on extreme learning machine to enhance absolute position accuracy for aviation drilling robot. Advances in Mechanical Engineering. 2018; 10(3): 1687814018763411.

Qing

Zeng

Huang

. Deep and wide feature based extreme learning machine for image classification. Neurocomputing. 2020; 412: 426-436.

Paranhos da Costa

Contato

Nazare

Batista Neto

Ponti

. An empirical study on the effects of different types of noise in image classification tasks. arXiv e-prints. 2016; p. arXiv-1609.

10.

Nazaré

Costa

Contato

Ponti

. Deep convolutional neural networks and noisy images. In: Iberoamerican Congress on Pattern Recognition. Springer; 2017; p. 416-424.

11.

Kölsch

Afzal

Ebbecke

Liwicki

. Real-time document image classification using deep CNN and extreme learning machines. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1. IEEE; 2017. p. 1318-1323.

12.

Yang

Tang

Zhang

Wong

. Multi-view CNN feature aggregation with ELM auto-encoder for 3D shape recognition. Cognitive Computation. 2018; 10(6): 908-921.

13.

Wang

Zhang

Hao

. A method combining CNN and ELM for feature extraction and classification of SAR image. Journal of Sensors. 2019; 2019.

14.

Dos Santos

da Silva Filho

dos Santos

. Deep convolutional extreme learning machines: Filters combination and error model validation. Neurocomputing. 2019; 329: 359-369.

15.

Kali Ali

Boughaci

. Hybrid Approach Based on Grey Wolf Optimizer for Dropout Regularization in Deep Learning. In: International Symposium on Modelling and Implementation of Complex Systems. Springer 2023; pp. 121-134.

16.

Masci

Meier

Cireşan

Schmidhuber

. Stacked convolutional auto-encoders for hierarchical feature extraction. In: International conference on artificial neural networks. Springer; 2011; pp. 52-59.

17.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv14091556. 2014.

18.

Zhu

Qin

Suganthan

Huang

. Evolutionary extreme learning machine. Pattern Recognition. 2005; 38(10): 1759-1763.

19.

Shu

. Evolutionary extreme learning machine – based on particle swarm optimization. In: International Symposium on Neural Networks. Springer; 2006; pp. 644-652.

20.

Mohapatra

Chakravarty

Dash

. An improved cuckoo search based extreme learning machine for medical data classification. Swarm and Evolutionary Computation. 2015; 24: 25-49.

21.

Muduli

Dash

Majhi

. Fast discrete curvelet transform and modified PSO based improved evolutionary extreme learning machine for breast cancer detection. Biomedical Signal Processing and Control. 2021; 70: 102919.

22.

Alzaqebah

Aljarah

Al-Kadi

. A hierarchical intrusion detection system based on extreme learning machine and nature-inspired optimization. Computers & Security. 2023; 124: 102957.

23.

Dogan

Taspinar

Cinar

Kursun

Ozkan

Koklu

. Dry bean cultivars classification using deep cnn features and salp swarm algorithm based extreme learning machine. Computers and Electronics in Agriculture. 2023; 204: 107575.

24.

Zhang

Liu

. A New Extreme Learning Machine Optimized by Firefly Algorithm. In: 2013 Sixth International Symposium on Computational Intelligence and Design. vol. 2. IEEE; 2013. pp. 133-136.

25.

Yao

Yang

. Dolphin swarm extreme learning machine. Cognitive Computation. 2017; 9(2): 275-284.

26.

Nayak

Dash

Majhi

. Pathological Brain Detection using Extreme Learning Machine Trained with Improved Whale Optimization Algorithm. In: 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR). IEEE; 2017. pp. 1-6.

27.

Zheng

Wang

Zhao

Wang

. Research of bearing fault diagnosis method based on multi-layer extreme learning machine optimized by novel ant lion algorithm. IEEE Access. 2019; 7: 89845-89856.

28.

Shariati

Mafipour

Ghahremani

Azarhomayun

Ahmadi

Trung

, et al. A novel hybrid extreme learning machine – grey wolf optimizer (ELM-GWO) model to predict compressive strength of concrete with partial replacements for cement. Engineering with Computers. 2022; pp. 1-23.

29.

Jiang

Jia

Chen

. The two-stage machine learning ensemble models for stock price prediction by combining mode decomposition, extreme learning machine and improved harmony search algorithm. Annals of Operations Research. 2022; pp. 1-33.

30.

Niu

Suen

. A novel hybrid CNN-SVM classifier for recognizing handwritten digits. Pattern Recognition. 2012; 45(4): 1318-1325.

31.

Maggipinto

Masiero

Beghi

Susto

. A convolutional autoencoder approach for feature extraction in virtual metrology. Procedia Manufacturing. 2018; 17: 126-133.

32.

Pintelas

Livieris

Pintelas

. A convolutional autoencoder topology for classification in high-dimensional noisy image datasets. Sensors. 2021; 21(22): 7731.

33.

Yaman

Tuncer

. Exemplar pyramid deep feature extraction based cervical cancer image classification model using pap-smear images. Biomedical Signal Processing and Control. 2022; 73: 103428.

34.

Jiang

Cui

. Transformer-Based Fused Attention Combined with CNNs for Image Classification. Neural Processing Letters. 2023; pp. 1-15.

35.

Dodge

Karam

. Understanding how image quality affects deep neural networks. In: 2016 eighth international conference on quality of multimedia experience (QoMEX). IEEE; 2016. pp. 1-6.

36.

Hossain

Teng

Zhang

Lim

. Distortion robust image classification using deep convolutional neural network with discrete cosine transform. In: 2019 IEEE International Conference on Image Processing (ICIP). IEEE; 2019. pp. 659-663.

37.

Yang

Dong

Huang

Sun

Shi

. Vector Quantization with Self-Attention for Quality-Independent Representation Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023; pp. 24438-24448.

38.

Fukushima

. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics. 1980; 36: 193-202.

39.

LeCun

Bottou

Bengio

Haffner

. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998; 86(11): 2278-2324.

40.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. Communications of the ACM. 2017; 60(6): 84-90.

41.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

, et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition 2015, pp. 1-9.

42.

Zhang

Ren

Sun

. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 770-778.

43.

Tupe

Vibhute

Sayyad

. An Architecture Combining Convolutional Neural Network (CNN) with Batch Normalization for Apparel Image Classification. In: 2020 IEEE International Symposium on Sustainable Energy, Signal Processing and Cyber Security (iSSSC). IEEE; 2020. pp. 1-6.

44.

Marler

Arora

. Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization. 2004; 26(6): 369-395.

45.

Mirjalili

Lewis

. Grey wolf optimizer. Advances in Engineering Software. 2014; 69: 46-61.

46.

Mirjalili

Hatamlou

. Multi-verse optimizer: a nature-inspired algorithm for global optimization. Neural Computing and Applications. 2016; 27(2): 495-513.

47.

. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE transactions on Information Theory. 1998; 44(2): 525-536.

48.

Marcot

Hanea

. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Computational Statistics. 2021; 36(3): 2009-2031.

49.

Krizhevsky

Nair

Hinton

. Cifar-10 (canadian institute for advanced research). URL http://wwwcstorontoedu/kriz/cifarhtml. 2010; 5(4): 1.

50.

Krizhevsky

Nair

. Cifar-100 (canadian institute for advanced research). 30 [65] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems. 2012; 25(1097-1105): 26.

51.

Liu

Deng

. Very deep convolutional neural network based image classification using small training sample size. In: 2015 3rd IAPR Asian conference on pattern recognition (ACPR). IEEE; 2015. pp. 730-734.

52.

Kadav

Durdanovic

Samet

Graf

. Pruning filters for efficient convnets. arXiv preprint arXiv160808710. 2016.

53.

Ding

Han

Tang

. Auto-balanced filter pruning for efficient convolutional neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32; 2018.

54.

Niu

Wei

Liu

. Event-driven spiking neural network based on membrane potential modulation for remote sensing image classification. Engineering Applications of Artificial Intelligence. 2023; 123: 106322.

Improving extreme learning machine model using deep learning feature extraction and grey wolf optimizer: Application to image classification

Abstract

Keywords

1. Introduction

2.1 Metaheuristic-based ELM

2.2 Image feature extraction based on deep learning models

2.3 Image quality effect on classification task

3. Background

3.1 Extreme Learning Machine

3.4.1 Single-objective optimization

4.1 Inspiration

4.2 Mathematical modeling

4.2.1 Social hierarchy

4.2.2 Encircling prey

6.1 Solution representation and initialization

7. Experiments

7.1 Software and hardware setup

7.2 Experimental data

Table 1 Dataset description

7.3.1 CNN architectures

7.3.2 CAE architectures

7.3.3 Parameter settings

7.4 Results and discussion

7.4.1 Input feature impact

7.4.2 ELM initialization impact

7.4.3 Fitness function impact

Table 4 Best average accuracy and standard deviation for the noisy data version

7.5 Comparison with previous works

8. Conclusion

References

Table 1
Dataset description

Table 4
Best average accuracy and standard deviation for the noisy data version