Sage Journals: Discover world-class research

Abstract

Microarray expression datasets generate a huge number of genes, but only a few genes provide information about cancer diseases. In this context, feature selection approaches have been developed to deal with this problem. Filter-based methods, in particular, select the relevant genes and remove the irrelevant ones using different evaluation metrics. In this study, we shed light on nine univariate filter methods. Three categories of filter methods were investigated using eight microarray datasets, including binary and multi-class samples. The support vector machine and Naive Bayes classifiers were used to assess classification accuracy. Different comparison methods were used to assist the researchers in visualizing the performance of each studied filter. Precisely, statistical tests were applied in terms of classification accuracy, and the feature ranking similarity of the filter methods was studied based on a rank correlation measure.

Keywords

Feature selection univariate filter method classification microarray datasets

Introduction

Large-scale biological data analysis, or microarray datasets, is currently a challenge for biostatistics and machine learning researchers due to the construction of this kind of dataset: the huge number of features (genes) versus a small number of instances. The microarray data are used to obtain information about genes that can be helpful for cancer identification or diagnosis. In a bi-class problem, microarray datasets are used to distinguish between normal and cancer patients. In the case of multi-class problems, the datasets are used to classify various types of cancer.

Based on this consideration, feature selection approaches were developed to remove all the non-informative genes and detect only the relevant ones. Filter, wrapper, and hybrid methods are the most useful types of feature selection. Filter-based methods rank the features using a specific relevance measure without invoking any classification algorithm. A subset of features is selected, and its cardinal is either predefined or set according to a certain threshold.¹ The most commonly used filters are correlation,² (MI),³ information gain (IG),⁴ reliefF,⁵ and relevance.⁶ On the contrary, wrapper-based methods select the subset of features with the best results based on a classification algorithm, such as the genetic algorithm,⁷ firefly algorithm,⁸ binary bat algorithm,⁹ particle swarm optimization,¹⁰ flower pollination algorithm¹¹ and artificial bee colony.¹² Besides filters and wrappers, the hybrid-based approaches¹³ combine the merits of both of them. On the other hand, the feature selection approaches can be grouped into three different categories based on the presence of the target class label in the datasets, namely, supervised, semi-supervised, and unsupervised methods.

Historically, filter-based methods¹⁴ were considered as the most important process in feature selection as a consequence of their low computational cost compared to the other methods, which made them able to deal with high-dimensional dataset classification. Rouhi and Nezamabadi-pour¹⁵ developed a filter based on an improvedbinary gravitational search algorithm. Kavitha et al.¹⁶ combined the symmetric uncertainty (SU) and relief filters using score normalization in order to select the relevant and non-redundant genes. Based on an estimation of relevance between features and the class label, Ke et al.¹⁷ presented a score-based criteria fusion. On the other hand, many hybrid-based methods have been developed using the filter ranking method as a first stage. Chamlal et al.¹⁸ used a score of importance based on the preordonnaces theory to remove the irrelevant features as a first step before the wrapper stage. Zhang et al.¹⁹ proposed a hybrid approach where the IG was applied in the filter step. Minimum redundancy maximal relevance and analysis of variance filters were used by Dabba et al.²⁰ and Baliarsingh et al.,²¹ respectively.

In this article, we present a comparative study of nine univariate filter-based methods. The studied filters were divided into three groups based on how the relevance between the features and the class was assessed: entropy, statistics, and similarity-based methods.

The aspects used in this study are described as follows:

Nine univariate filter-based methods from different domains were studied.

Eight microarray datasets of varying sizes are used, with features ranging from 2000 to 12,600 and instances ranging from 60 to 181. Binary and multi-class cases were considered.

Two commonly used classifiers are considered to test the performance of the filter methods based on their classification accuracy. Support vector machine (SVM) and Naive Bayes (NB) classifiers.

Different number of selected features were evaluated, ranging from 2 to 60, increasing by 2.

Pairwise comparison of the filters was developed in terms of accuracies using the Friedman statistic test and in terms of feature ranking similarity using a rank correlation metric.

The remainder of this article is organized as follows: the “Filter methods” section presents the studied filter-based methods. In the “Experimental protocol and results” section, we describe the protocol used for the experiments and analyze the results. The “Conclusion” section contains the conclusions of this study.

Filter methods

In this study, we shed light on the most frequently used univariate filter methods for high-dimensional datasets. The studied methods were divided into three categories: entropy-based, statistics-based, and similarity-based methods. A brief description of the used filters is presented in this section.

Entropy-based methods

The entropy-based methods are the most useful metrics in the filter feature selection approaches. These methods are based on the information-theoretic concept of entropy. The basic idea of this metric is to measure the uncertainty of the feature. The uncertainty of a discrete feature increases when all possible values occur with the same probabilities. Let X be a discrete variable and p be the probability mass function. The entropy of X is defined as follows:

H (X) = - \sum_{x} p (x) l o g_{2} (p (x))

(1)

Beside entropy, joint entropy and conditional entropy are two extensions of the entropy metric that can be used to describe a relationship between two discrete features. Let X and Y be two discrete variables. The joint entropy is calculated as follows:

H (X, Y) = - \sum_{x} \sum_{y} p (x, y) l o g_{2} (p (x, y))

(2)

The conditional entropy measures the uncertainty of X when Y is known. It is computed as follows:

H (Y / X) = - \sum_{x} p (x) (\sum_{x y} p (y / x) l o g_{2} (p (y / x))

(3)

The entropy-based methods used in this study are presented below:

Mutual information²²:

The MI measure is used to calculate the correlation and the statistical dependency between an explanatory variable X and the target class label Y. It can be understood as the amount of information shared by X and Y and the decrease in uncertainty. In the context of filter feature selection, this measure can be used to calculate the relevance between X and Y. The greater the MI value between X and Y, the stronger the discriminative power of the explanatory variable. It is can be calculated as follows:

\begin{aligned} M I (X, Y) = {\begin{matrix} H (X) - H (X / Y) \\ H (Y) - H (Y / X) \\ H (X) + H (Y) - H (X, Y) \end{matrix} \end{aligned}

(4)

Gain ratio²³:

To overcome the bias problem of the MI metric, the gain ratio was developed as the normalization of MI between a random variable and the class label. It is defined as the ratio of MI to the entropy measure. As a result, it is clear that this metric favors features with smaller entropy. It is defined as follows:

G R (X, Y) = \frac{M I (X, Y)}{H (X)}

(5)

Symmetric uncertainty²⁴:

It is defined as a modification to the MI in order to reduce the bias toward detecting variables with a large number of different values and normalize the MI to the range [0, 1]. 0 indicates an independent relationship between the feature and the class label, while 1 indicates a stronger dependency relationship between them. It is computed as follows:

S U (X, Y) = \frac{2 \times M I (X, Y)}{H (X) + H (Y)}

(6)

Gini impurity²⁵:

Gini impurity is the most frequently used in the decision tree algorithm for feature selection, such as classification and regression tree. This importance score is calculated using probability theory by summing the product of the probability of choosing a sample with a class label and its probability of being wrongly classified. This metric is defined as follows:

G = 1 - \sum_{i} p_{i}^{2}

(7)

where i

\in {1, 2, \dots, C}

denotes the number of classes.

Statistics-based methods

These types of filter-based methods evaluate the importance of features based on statistical metrics. The studied statistical methods are described below:

Preordonnances association²⁶:

The preordonnances association offers a way to measure the relevance between the preordonnances induced by an explanatory variable and the class label. This metric is defined as the Kendall rank coefficient between the ranks induced by the preordonnances. The strength of this coefficient is that it can deal with heterogeneous or mixed datasets. Also, do not require any preprocessing of the variables. The higher the value of the preordonnances association, the stronger the relevance.

Let X and Y be a random variable and the class label, respectively; $P_{X}$ and $P_{Y}$ represent the variables’ preordonnances; $r_{P_{X}}$ and $r_{P_{Y}}$ are the rank variables, respectively, induced by the preordonnances. The preordonnances association $ψ_{c o r}$ is calculated as follows:

ψ_{c o r} (X, Y) = τ (r_{P_{X}}, r_{P_{Y}})

(8)

Spearman correlation²⁷:

The Spearman correlation is defined as a non-parametric metric. It measures the dependence between the ranks induced by a feature X and the target class Y using a monotonic function. The closer the coefficient to 1, the stronger the relationship. Precisely, if Y tends to increase when X increases, the coefficient is positive, while if Y tends to decrease when X increases, the coefficient is negative. The Spearman correlation is computed as follows:

ρ (X, Y) = 1 - \frac{6 \sum d_{i}^{2}}{n (n^{2} - 1)}

(9)

where

d_{i}

denotes the difference between the two rank induced by X and Y for each observation i, and n is the number of observations.

Chi-square²⁸:

The chi-square filter is a statistical test that measures the divergence between the expected and observed distribution of a feature. The greater the value of this statistic metric, the greater the dependency between the feature and the class label. It can be computed as given in equation (10), where $O_{i j}$ and $E_{i j}$ represent the observed and expected frequencies, respectively,

χ^{2} = \sum_{i} \sum_{j} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}

(10)

\in {1, \dots, r}

is the number of bins used for discretization of numerical variables. j

\in {1, \dots, c}

is the number of classes.

Similarity-based methods

The similarity-based filters rank the features according to their ability to preserve the similarity of the studied datasets based on the class label. The used filters are described below:

Fisher score²⁹:

The Fisher score is the most popular similarity-based method. It is a supervised method that ranks the features according to the class variable. This measure favors the feature that makes instances of the same class close to each other and makes instances of different classes far apart. Let X be a random feature, $\bar{x}$ be the mean of X while ${\bar{x}}^{(k)}$ denotes the mean of X in the $k$ th class. $n_{k}$ is the number of instances in the $k$ th class, ${\bar{x}}_{j}^{(k)}$ represents the value of X for the $j$ th instance in the $k$ th class. The Fisher score is defined in equation (11) as the ratio of the between-class scatter of X to the sum of the within-class scatter of X with respect to the $k$ th class.

F S (X) = \frac{\sum_{k = 1}^{C} n_{k} ({\bar{x}}^{(k)} - \bar{x})^{2}}{\sum_{k = 1}^{C} \sum_{j = 1}^{n_{k}} (x_{j}^{(k)} - {\bar{x}}^{k})^{2}}

(11)

Laplacian score³⁰:

The Laplacian score is an unsupervised method that ranks the features according to the Laplacian matrix and the variance. A feature is relevant if it preserves the graph structure represented by the Laplacian matrix. Let X be an explanatory variable. W represents the similarity weight matrix computed by the nearest neighbor graph. If $x_{i}$ and $x_{j}$ are adjacent, W is equal to $\exp (- \frac{| x_{i} - x_{j} |}{c t e})$ ; otherwise, 0. A diagonal matrix D is constructed based on W, where the row sum of W represents the diagonal; otherwise, 0. The Laplacian score is computed as follows:

L S (X) = \frac{{\tilde{X}}^{T} L \tilde{X}}{{\tilde{X}}^{T} D \tilde{X}}

(12)

\tilde{X} = X - \frac{X^{T} D I}{I^{T} D I} I

is the transformed feature and L =

D - W

is the graph Laplacian.

Experimental protocol and results

This section exposes the methodology followed in this article to evaluate the filter-based methods, datasets, performance measures, classifiers, and techniques used for a pairwise comparison. This section also discusses the results with different interpretations.

Datasets description

To compare the performance of the studied filter-based methods and evaluate the quality of the methods to be applied to microarray datasets. We use eight real-world datasets from microarray data analysis, which include binary and multi-class datasets. The detailed statistics about these datasets are exposed in Table 1. Microarray datasets are high-dimensional data and are available in R packages such as “sdwd” and “varbvs.”

Table 1.

Summary of the studied microarray datasets.

No.	Dataset	No. of observations	No. of features	No. of classes
1	Colon³⁵	62	2000	02
2	Leukemia³⁶	72	3571	02
3	Breast³⁷	78	4348	02
4	CNS³⁸	60	7129	02
5	Lung³⁹	181	12,534	02
6	Prostate⁴⁰	102	12,600	02
7	Lymphoma⁴¹	62	4026	03
8	SRBCT⁴¹	83	2309	04

SRBCT: small, round blue-cell tumor; CNS: central nervous system.

Performance measure and classifiers

The performance measure used to evaluate the studied filter methods is the classification accuracy described in equation (13), where TP and TN represent the number of positive and negative samples correctly classified, respectively. FP and FN denote the number of positive and negative samples wrongly classified, respectively. On the other hand, to ascertain the performance of the filter in terms of classifiers, we used the two most commonly used classifiers to validate a filter ranking method: the SVM³¹ classifier is the most commonly used in feature selection and classification tasks due to its performance and robustness to high-dimensional datasets, such as microarray datasets. The main idea of this classifier is to select a small number of boundary vectors for each label. Then, the selected boundaries are separated using a linear hyperplane. The NB³² classifier takes the features’ conditional independence for granted, based on probabilistic knowledge; that is, it calculates the probability of instances for each label. For the evaluation, 10-fold cross-validation is used.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} * 100

(13)

Statistical tests for performance comparison

In order to establish a statistical analysis and comparison of the studied filter-based methods, the Friedman test and post hoc techniques are used. First of all, the non-parametric Friedman test³³ is employed to test the statistical differences among the methods over the microarray datasets. This test ranks the studied methods according to their results (accuracy). The Friedman test considers the methods to perform equally in the null hypothesis at the significance level $α$ . In this study, the significance level is equal to 0.05. Then, if the null hypothesis is rejected (statistically significant differences are detected), the post hoc test is applied for a pairwise comparison of the studied filters.

Feature ranking similarity of the filter methods

The key idea of this analysis is to evaluate the similarity of the studied filter methods based on feature ranking. First, we generate the selection orders for all methods and datasets. Then, for each dataset, we compute the rank correlation between the selection orders of all pairs of methods, and the average results of all datasets are used to assess the feature ranking similarity.

Results

Tables 2 and 3 present the accuracies of the nine studied filter methods for the eight microarray datasets, using the SVM and NB classifiers. An examination of these tables reveals that the gain method is the best in terms of achieving the greatest number of best accuracies (marked in bold), where it was the best seven times, followed by MI, Gini, and PrC, which were the best six times with results close to those obtained by the gain method. Then, the SU, chi-square, and Fisher methods were the best four times, while the SpC method got the best results two times.

Table 2.

Results (accuracy) of the three studied types of filter methods using SVM classifier.

Dataset	Entropy-based methods				Statistics-based methods			Similarity-based methods
Dataset	MI	Gain	SU	Gini	PrC	SpC	Chi-square	Fisher	Laplacian	Original
Colon	89.60	92.87	90.72	89.94	90.42	89.31	89.95	90.65	87.46	84
Leukemia	99.10	98.82	98.55	98.65	98.87	97.84	98.85	98.77	86.06	98.4
Beast	77.41	77.51	81.90	79.59	79.77	87.43	79.33	78.84	61.31	60.76
CNS	87.14	87.60	87.32	83.87	74.76	80.43	84.01	84.83	65.04	64.71
Lung	99.45	99.44	99.41	99.42	99.98	99.84	99.42	99.44	93.89	99.13
Prostate	95.96	95.77	95.34	95.10	95.96	94.29	95.19	95.29	69.72	91.9
Lymphoma	100	100	100	100	100	100	100	100	98.28	99.89
SRBCT	100	100	100	100	100	99.95	99.98	100	81.52	99.87
Average rank	3.93	3.56	4.12	5.43	3.43	8	5.25	4.43	9	-

SRBCT: small, round blue-cell tumor; CNS: central nervous system; SVM: support vector machine; MI: mutual information; SU: symmetric uncertainty.

Table 3.

Results (accuracy) of the three studied types of filter methods using NB classifier.

Dataset	Entropy-based methods				Statistics-based methods			Similarity-based methods
Dataset	MI	Gain	SU	Gini	PrC	SpC	Chi-square	Fisher	Laplacian	Original
Colon	91.89	93.51	91.60	91.90	91.96	88.78	91.86	91.96	90.22	83.37
Leukemia	98.22	98.25	98.41	98.50	98.22	97.14	98.69	98.13	81.67	98.43
Beast	79.83	80.06	81.35	80.70	84.21	77.68	80.82	82.46	65.93	62.05
CNS	81.97	82.01	78.19	82.09	77.70	77.53	82.04	81.07	58.48	61.07
Lung	100	100	100	100	100	99.54	100	99.44	92.93	98.34
Prostate	94.45	94.34	94.99	94.57	94.40	93.11	94.52	95.05	61.90	62.76
Lymphoma	100	100	100	100	99.48	98.74	100	100	98.35	95.1
SRBCT	99.89	99.85	99.90	99.98	98.73	95.73	99.94	99.93	64.50	98.43
Average rank	4.81	4.25	4	2.87	4.93	8	3.25	4	8.87	-

SRBCT: small, round blue-cell tumor; CNS: central nervous system; NB: Naive Bayes; MI: mutual information; SU: symmetric uncertainty.

Through this comparison, if we compare the average rank among the entropy-based methods, the gain ratio and Gini methods perform the best against all competitors using the SVM and NB classifiers, respectively. While the PrC outperforms the other statistics-based methods, it is closely followed by the chi-square filter. For both classifiers, the PrC method achieves an accuracy of >90% for almost all the datasets except the breast and CNS datasets, where the accuracies were far better than those obtained with the original datasets. Finally, comparing the similarity-based methods to the other methods, they do not perform well. Among these, the Fisher score is far better than the Laplacian score, knowing that the Laplacian score achieves the worst results for all the studied datasets and classifiers.

In order to investigate the impact of the number of selected features on the performance of the filter methods, Tables 4 and 5 display the accuracies for several subset sizes, starting from two features to 60, increasing by two for different categories of filters (entropy, statistics, and similarity-based methods), using the SVM (Table 4) and NB (Table 5) classifiers, with 10-fold cross-validation repeated 100 times. On the one hand, we observe that as the number of features increases, accuracy improves in almost all cases. Also, some of them have similar results when the number of selected features is large enough. On the other hand, it demonstrated once more that the gain ratio outperforms the other entropy-based methods, with results close to those obtained by the other methods, where the Gini index agrees with the gain ratio in 98 % of the cases, which is mathematically demonstrated by Raileanu and Stoffel.³⁴ Among the statistics-based methods, PrC performs better than SpC and chi-square. However, for some datasets, the SpC performs extremely well compared to the PrC method. It can be concluded that the PrC is stable when compared to other statistics-based methods studied. Finally, for similarity-based methods, the Fisher score performs far better than the Laplacian score for all the datasets using both classifiers.

Table 4.

Classification accuracy versus the number of selected features for different types of filter methods using SVM classifier.

Dataset	Entropy-based methods	Statistic-based methods	Similarity-based methods
Dataset
Colon
Leukemia
Breast
CNS
Lung
Prostate
Lymphoma
SRBCT

SRBCT: small, round blue-cell tumor; CNS: central nervous system; SVM: support vector machine; MI: mutual information; SU: symmetric uncertainty.

Table 5.

Classification accuracy versus the number of selected features for different types of filter methods using NB classifier.

Dataset	Entropy-based methods	Statistic-based methods	Similarity-based methods
Dataset
Colon
Leukemia
Breast
CNS
Lung
Prostate
Lymphoma
SRBCT

SRBCT: small, round blue-cell tumor; CNS: central nervous system; NB: Naive Bayes; MI: mutual information; SU: symmetric uncertainty.

The results of the statistical Friedman test corresponding to the accuracy of the studied filter methods are presented in Table 6. From this table, for both classifiers, we obtain a p-value <0.05, which means that statistical differences are detected. Meanwhile, Tables 7 and 8 expose the post hoc comparison among the accuracy of the studied datasets over SVM and NB classifiers, respectively. According to the SVM classifier results, the Laplacian score from the similarity-based methods is clearly the worst method when compared to the other studied ones. In the same way, the results of the pairwise comparison using the NB classifier are shown in Table 8. It can be seen that there is a significant difference between the SpC method and MI, gain, SU, Gini, chi-square, and Fisher filters. Also, the results prove once again that the Laplacian score is the worst method and should be avoided.

Table 6.

Results (p-value) of the Friedman test for the comparison among the studied filter methods using SVM and NB classifiers.

Classifier	p-value
SVM	3.47 $\times 10^{- 4}$
NB	4.194 $\times 10^{- 6}$

SVM: support vector machine; NB: Naive Bayes.

Table 7.

Results (p-value) achieved on post hoc comparison for $α$ =0.05 using SVM classifier.

	MI	Gain	SU	Gini	PrC	SpC	Chi-square	Fisher	Laplacian
MI	1	0.815	0.852	0.246	0.779	0.152	0.307	0.6750	0.00015
Gain	–	1	0.675	0.165	0.962	0.0975	0.211	0.515	0.00015
SU	–	–	1	0.675	0.852	0.2116	0.403	0.815	0.0058
Gini	–	–	–	1	0.152	0.779	0.888	0.457	0.0099
PrC	–	–	–	–	1	0.088	0.195	0.485	0.0001
SpC	–	–	–	–	–	1	0.675	0.307	0.0204
Chi-square	–	–	–	–	–	–	1	0.543	0.0068
Fisher	–	–	–	–	–	–	–	1	0.0011
Laplacian	–	–	–	–	–	–	–	–	1

SRBCT: small, round blue-cell tumor; CNS: central nervous system; SVM: support vector machine; MI: mutual information; SU: symmetric uncertainty.

Table 8.

Results (p-value) achieved on post hoc comparison for $α$ =0.05 using NB classifier.

	MI	Gain	SU	Gini	PrC	SpC	Chi-square	Fisher	Laplacian
MI	1	0.686	0.560	0.168	0.591	0.0282	0.265	0.53097	0.0043
Gain	–	1	0.857	0.326	0.348	0.0102	0.474	0.822	0.0013
SU	–	–	1	0.421	0.265	0.006	0.591	0.964	0.0007
Gini	–	–	–	1	0.057	0.0005	0.788	0.447	5.5 $\times 10^{- 5}$
PrC	–	–	–	–	1	0.092	0.101	0.246	0.018
SpC	–	–	–	–	–	1	0.0013	0.005	0.4742
Chi-square	–	–	–	–	–	–	1	0.6223	0.0001
Fisher	–	–	–	–	–	–	–	1	0.0001
Laplacian	–	–	–	–	–	–	–	–	1

SRBCT: small, round blue-cell tumor; CNS: central nervous system; NB: Naive Bayes; MI: mutual information; SU: symmetric uncertainty.

In terms of feature ranking, Figure 1 exposes the rank correlation between all pairs of methods. The higher the correlation between two methods, the more similar they are. With a value of 0.883, the gain and MI methods produce the highest value. This makes sense, as the gain method is a modified version of the MI method, followed by the mean rank correlation between the chi-square and Gini methods (0.713), the chi-square and SU methods (0.132), the Laplacian score and Fisher score methods (0.085), and the PrC and chi-square methods (0.075). The other methods are not similar; the resulting mean rank correlation is negative.

Figure 1.

Mean rank correlation between the selection order of all pairs of studied filters.

Conclusion

In this article, nine univariate filters were studied using eight microarray datasets. We employed various methods for filter evaluation and pairwise comparison. On the one hand, the comparative assessment in terms of classification accuracy used statistical tests. On the other hand, the feature ranking similarity used a rank correlation measure. The filter-based methods were grouped into three categories for a more accurate study.

As a general finding from our study, the preordonnances association measure performed the best compared to all the methods using the SVM classifiers in terms of average rank. The gain ratio and MI were the second and third best filters, respectively. One of the most important findings was that the founded accuracies were very close to each other, with the exception of the Laplacian score, which was significantly lower. On the contrary, the filters rank the features differently, except for those with about the same evaluation criteria. These large experiments serve as a reference for researchers to choose a filter feature selection depending on the used datasets or classifiers.

Footnotes

Author’s note

The paper was selected from ICAMDS22.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Fatima Ezzahra Rebbah

References

Mai

Zou

. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika 2013; 100(1): 229–234.

Uzma Al-Obeidat

Tubaishat

et al. Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data. Neural Comput Appl 2022; 34(11): 8309–8331.

Vijay

SAA

GaneshKumar

. Fuzzy expert system based on a novel hybrid stem cell (HSC) algorithm for classification of micro array data. J Med Syst 2018; 42(4): 61.

Salem

Attiya

El-Fishawy

. Classification of human cancer diseases by gene expression profiles. Appl Soft Comput 2017; 50: 124–134.

Sangaiah

Vincent Antony Kumar

. Improving medical diagnosis performance using hybrid feature selection via relieff and entropy based genetic search (RF-EGA) approach: application to breast cancer prediction. Cluster Comput 2019; 22(3): 6899–6906.

Chamlal

Ouaderhman

Rebbah

. A novel filter based feature selection approach for microarray dataset. In 2021 Fifth International Conference on Intelligent Computing in Data Sciences (ICDS). pp. 1–6.

Yan

. Evaluating Ensemble Learning Impact on Gene Selection for Automated Cancer Diagnosis. In Shaban-Nejad A and Michalowski M (eds.) Precision Health and Medicine: A Digital Revolution in Healthcare. Studies in Computational Intelligence, Cham: Springer International Publishing. ISBN 978-3-030-24409-5, 2020. pp. 183–186.

Almugren

Alshamlan

. FF-SVM: New FireFly-based Gene Selection Algorithm for Microarray Cancer Classification. In 2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). pp. 1–6.

Chatra

Kuppili

Edla

et al. Cancer data classification using binary bat optimization and extreme learning machine with a novel fitness function. Med Biol Eng Comput 2019; 57(12): 2673–2682.

10.

Chinnaswamy

Srinivasan

. Hybrid Feature Selection Using Correlation Coefficient and Particle Swarm Optimization on Microarray Gene Expression Data. In Snášel

Abraham

Krömer

et al. (eds.) Innovations in Bio-Inspired Computing and Applications. Advances in Intelligent Systems and Computing, Cham: Springer International Publishing. ISBN 978-3-319-28031-8, pp. 229–239.

11.

Alomari

Khader

Al-Betar

et al. A Hybrid Filter-Wrapper Gene Selection Method for Cancer Classification. In 2018 2nd International Conference on BioSignal Analysis, Processing and Systems (ICBAPS). pp. 113–118.

12.

Musheer

Verma

Srivastava

. Novel machine learning approach for classification of high-dimensional microarray data. Soft Comput 2019; 23(24): 13409–13421.

13.

Chamlal

Benzmane

Ouaderhman

. Elastic net-based high dimensional data selection for regression. Expert Syst Appl 2024; 244: 122958.

14.

Janane

Ouaderhman

Chamlal

. A filter feature selection for high-dimensional data. J Algorithm Comput Technol 2023; 17: 17483026231184171. Publisher: SAGE Publications Ltd STM.

15.

Rouhi

Nezamabadi-pour

. Filter-based feature selection for microarray data using improved binary gravitational search algorithm. In 2018 3rd Conference on Swarm Intelligence and Evolutionary Computation (CSIEC). pp. 1–6.

16.

Kavitha

. Prakasan

Dhrishya

. Score-Based Feature Selection of Gene expression Data for Cancer Classification. In 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). pp. 261–266.

17.

et al. A new filter feature selection based on criteria fusion for gene microarray data. IEEE Access 2018; 6: 61065–61076. Conference Name: IEEE Access.

18.

Chamlal

Ouaderhman

Aaboub

. A graph based preordonnances theoretic supervised feature selection in high dimensional data. Knowl Based Syst 2022; 257: 109899.

19.

Zhang

Hou

Wang

et al. Feature selection for microarray data classification using hybrid information gain and a modified binary Krill Herd algorithm. Interdiscipl Sci: Computat Life Sci 2020; 12(3): 288–301.

20.

Dabba

Tari

Meftali

. Hybridization of Moth flame optimization algorithm and quantum computing for gene selection in microarray data. J Ambient Intell Humaniz Comput 2021; 12(2): 2731–2750.

21.

Baliarsingh

Vipsita

Dash

. A new optimal gene selection approach for cancer classification using enhanced Jaya-based forest optimization algorithm. Neural Computing and Applications 2020; 32(12): 8599–8616.

22.

Macedo

Valadas

Carrasquinha

et al. Feature selection using decomposed mutual information maximization. Neurocomputing 2022; 513: 215–232.

23.

Alanni

Hou

Azzawi

et al. New Gene Selection Method Using Gene Expression Programing Approach on Microarray Data Sets. In Lee R (ed.) Computer and Information Science. Studies in Computational Intelligence, Cham: Springer International Publishing. ISBN 978-3-319-98693-7, 2019. pp. 17–31.

24.

Potharaju

Sreedevi

. Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance. Clin Epidemiol Glob Health 2019; 7(2): 171–176.

25.

Bommert

Sun

Bischl

et al. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal 2020; 143: 106839.

26.

Chamlal

Ouaderhman

Rebbah

. A hybrid feature selection approach for microarray datasets using graph theoretic-based method. Inf Sci (Ny) 2022; 615: 449–474.

27.

Shukla

Singh

Vardhan

. DNA Gene Expression Analysis on Diffuse Large B-Cell Lymphoma (DLBCL) Based on Filter Selection Method with Supervised Classification Method. In Behera

Nayak

Naik

et al. (eds.) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, Singapore: Springer. ISBN 978-981-10-8055-5, pp. 783–792.

28.

Ahakonye

LAC

Nwakanma

Lee

et al. SCADA intrusion detection scheme exploiting the fusion of modified decision tree and Chi-square feature selection. Internet of Things 2023; 21: 100676.

29.

Sun

Wang

Ding

et al. Feature selection using Fisher score and multilabel neighborhood rough sets for multilabel classification. Inf Sci (Ny) 2021; 578: 887–912.

30.

Javandel

Vakilian

Firuzi

. Multiple partial discharge sources separation using a method based on laplacian score and correlation coefficient techniques. Electric Power Systems Research 2022; 210: 108070.

31.

Boser

Guyon

Vapnik

. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory. COLT ’92, New York, NY, USA: Association for Computing Machinery. ISBN 978-0-89791-497-0, pp. 144–152. DOI:10.1145/130385.130401.

32.

Langley

Iba

Thompson

. An analysis of Bayesian classifiers. Proceedings of the Tenth National Conference on Artificial Intelligence 1998; 90.

33.

Friedman

. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 1937; 32(200): 675–701.

34.

Raileanu

Stoffel

. Theoretical comparison between the Gini index and information gain criteria. Ann Math Artif Intell 2004; 41(1): 77–93.

35.

Alon

Barkai

Notterman

et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 1999; 96(12): 6745–6750. Publisher: Proceedings of the National Academy of Sciences.

36.

Golub

Slonim

Tamayo

et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. science 1999; 286(5439): 531–537.

37.

Van’t Veer

Dai

Van De Vijver

et al. Gene expression profiling predicts clinical outcome of breast cancer. nature 2002; 415(6871): 530–536.

38.

Pomeroy

Tamayo

Gaasenbeek

et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002; 415(6870): 436–442.

39.

Gordon

Jensen

Hsiao

et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 2002; 62(17): 4963–4967.

40.

Singh

Febbo

Ross

et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1(2): 203–209.

41.

Zhu

Ong

Dash

. Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 2007; 40(11): 3236–3248.

Accurate analysis for univariate-based filter methods for microarray data classification

Abstract

Keywords

Introduction

Filter methods

Entropy-based methods

Statistics-based methods

Similarity-based methods

Experimental protocol and results

Datasets description

Performance measure and classifiers

Statistical tests for performance comparison

Feature ranking similarity of the filter methods

Results

Conclusion

Footnotes

Author’s note

Declaration of conflicting interests

Funding

ORCID iD

References