Sage Journals: Discover world-class research

Abstract

In a classification problem, before building a prediction model, it is very important to identify informative features rather than using tens or thousands which may penalize some learning methods and increase the risk of over-fitting. To overcome these problems, the best solution is to use feature selection. In this article, we propose a new filter method for feature selection, by combining the Relief filter algorithm and the multi-criteria decision-making method called TOPSIS (Technique for Order Preference by Similarity to Ideal Solution), we modeled the feature selection task as a multi-criteria decision problem. Exploiting the Relief methodology, a decision matrix is computed and delivered to Technique for Order Preference by Similarity to Ideal Solution in order to rank the features. The proposed method ends up giving a ranking to the features from the best to the mediocre. To evaluate the performances of the suggested approach, a simulation study including a set of experiments and case studies was conducted on three synthetic dataset scenarios. Finally, the obtained results approve the effectiveness of our proposed filter to detect the best informative features.

Keywords

Relief Technique for Order Preference by Similarity to Ideal Solution feature selection high-dimensional data feature ranking

Introduction

Feature selection¹ has become a primordial process in all classification problems. Given the recent technologies in the field of high-dimensional data collection, the large number of features has become a real curse for researchers and data analysts, its presence can penalize learning methods and increase the risk of over-fitting. The benefits of feature selection include facilitating the analysis, understanding, and interpretation of data, improving predictive results, and reducing compilation time. Feature selection is a procedure characterized by four principal elements: the type of approach, the search strategy and search direction, the evaluation function, and the stopping criterion. In the literature, there are many feature selection methods, which are subdivided into three types^2–4 including filters, wrappers, and embedded methods. One of the major challenges for feature selection is the dimension of the search space, which increases exponentially with the number of features, this makes the search strategy exhaustive and computationally difficult to implement. To overcome this problem, many search strategies have been developed to find a near-optimal solution. As for the search direction, it can be of three types: forward selection, backward elimination, and bidirectional search. In order to evaluate the discriminating power of features or also a set of features, an evaluation function is applied in this purpose, the evaluation criteria generally can be subdivided into five categories based on the quantity of information, distance, independence, consistency, and precision provided by a classifier. Information-based criteria are based on the amount of information contributed by a variable to the target one. Independence criteria include all measures of association or correlation, they assess the dependence between the target feature and the explanatory ones, while precision criteria are based on the classifiers as an evaluation function. The stopping criterion is used to stop the selection process, it can be a number predefined by the user, the choice of the stopping criterion can also depend on the obtained results, this is the case when adding or removing a feature does not give any performance improvement.

Filters⁵ for feature selection evaluate features according to their intrinsic properties, among its advantages are the rapidity and also the independence of the learning methods. We can also distinguish between univariate and multivariate filters, for univariate filters, the features are evaluated according to their individual discriminating power, in multivariate filters, we are essentially looking for subsets that can collectively answer the given classification problem and taking into consideration the notion of complementarity and interaction between features. We cite in this category, mutual information,⁶ gain ratio,⁷ relevance-complementarity preordonnances based,⁸ symmetrical uncertainty,⁹ Fisher score,¹⁰ information gain,¹¹ chi-square test,¹² minimum redundancy maximum relevance,¹³ relevance-redundancy preordonnances based,¹⁴ the correlation-based filter selection,¹⁵ Pearson correlation,¹⁶ the fast correlation-based filter,¹⁷ and Relief-based algorithms.¹⁸ In wrapper methods, the process of selecting features is carried out in parallel with the learning process, where each subset of features is evaluated by a learning method, this simultaneous evaluation of the subsets of features allows us to take into account the interactions between features. Among the existing wrapper methods, we cite the Recursive feature elimination algorithm,¹⁹ also there are those with a population-based search strategy, such as the genetic algorithm^20,21 and the particle swarm optimization algorithm,²² and those with a sequential search strategy such as the sequential forward selection algorithm²³ and the sequential backward selection algorithm.²³ Finally, in embedded methods, feature selection is integrated into the learning method. They also have the advantage of being generally faster than wrapper methods by avoiding the back-and-forth process between selection and learning evaluation and can be influenced by the choice of the learning method and also allow for the consideration of interactions between features. For example, we cite Lasso²⁴ or elastic net algorithm,²⁴ random forest algorithm,²⁵ recursive feature elimination for support vector machines,¹⁹…etc. Among the best-known filters in the field of feature selection is Relief,²⁶ an algorithm invented by Kira and Rendall inspired by instance-based learning methods, it is exclusively designed for binary classification problems, which evaluates features using a global score through randomly selected instances. In order to adapt Relief to handle problems such as missing data, multiclass data, regression problems as well as noise handling, several variants of Relief have been developed, the most famous being ReliefF,²⁷ also many extensions¹⁸ have been introduced after the ReliefF algorithm, such as RReliefF, EReliefF, SURF, SWRF*, MultiSURF, MultiSURF*, TuRF, Evaporative Cooling ReliefF, ReliefMSS, LH-RELIEF, ReliefSeq, and iVLSReliefF. In the evaluation process by Relief, there is often the possibility that some instances contribute excessively to the scoring process, leading to the penalization of many features. The main objective of this work is to propose a new filter for feature selection, by combining the Relief method and the multi-criteria decision-making (MCDM) method named TOPSIS (Technique for Order Preference by Similarity to Ideal Solution),²⁸ we have developed a multi-criteria-based approach to select the relevant features. Through the Relief method, we have constructed a decision matrix on which we have applied TOPSIS, each feature is scored individually without being penalized, and at the end, we get a ranking for all features.

The main contributions of this work are presented as follows:

Modeling feature selection as a MCDM problem exploiting the fundamental concepts of Relief.

By combining a filter and a powerful MCDM method, a new filter has been proposed allowing the ranking and selection of relevant features.

The TOPSIS method is used to determine the relevant features.

A simulation study including different scenarios was conducted to validate the efficiency of the proposed method.

This article is structured as follows, an introduction is presented in the first section, then, the related work and used background are given in the “Related work and background” section, next, we describe our proposed approach in “The proposed method” section, the experimental results are given in “the Experiments and results” section, and finally, a conclusion is presented in the “Conclusion” section.

Related work and background

In the field of feature selection, filter methods are commonly used due to their rapidity and simplicity. In general, filters use the intrinsic or statistical characteristics of the features as an evaluation measure and adopt ranking techniques. Instead of an explicit optimal subset of features, the filter methods generate a ranked list of features. They have the advantage of being independent of the classification algorithms, which leads to the generality of selection results and avoids the problem of over-fitting. Filters generally require less computational time than other feature selection techniques which makes them easily adapted to deal with high-dimensional data. As already mentioned, they can be divided into two categories: univariate and multivariate. Univariate methods assess features individually, but multivariate methods consider subsets of features in the assessment procedure. The $N$ best features with the highest scores, or all features with scores above a certain specific threshold $τ$ , are selected, while those with a lower score are removed. $N$ and $τ$ are two parameters to be defined beforehand by the user. After the features have been selected, they can be provided to an already chosen classification method to evaluate their predictive performance. On the other hand, many works have used filters to perform a pre-selection of features. Among the most popular filter methods, we cite the Relief algorithm. A detailed description of the algorithm is presented in the following.

Relief

Relief is a very well-known and commonly univariate used filter in feature selection due to its simplicity, it is created by Kira and Rendall inspired by instance-based learning methods. It is designed only for binary classification problems. The Relief algorithm is also designed to handle noisy and redundant data. The process of the feature weighting in the Relief algorithm is essentially based on an analysis of instance pairs, more precisely, on a study of differences and similarities between an instance and its nearest neighbor in the same class as well as its nearest neighbor in the different class. The quality of a feature increases if it brings two instances of the same class closer together and separates two instances of two different classes, if the opposite is true, then the quality decreases. In other words, the more a feature varies within a class, the less relevant it is, and on the other hand, the more it moves between different classes, the more its importance score increases. Given a randomly selected instance $R_{j}$ , two nearest neighbors are determined based on a calculation of pairwise distances between the instances, the first one belongs to the same class as $R_{j}$ called hit $H$ , and the second one in the other class called miss $M$ . Relief is a filter that adopts the concept of nearest neighbors in the evaluation process of features. The score of the Relief algorithm ranges from $- 1$ to 1, the feature with a score of 1 is the best, and the one with a score of $- 1$ is the worst. After choosing m instances randomly, Relief searches notably for each instance its nearest neighbor of the same class and of the other class, then it calculates the distance between these instances and their nearest neighbors. Finally, by accumulating these distances, Relief computes a global score for each feature. The score of each feature $X_{i}$ is updated at each iteration according to the formula below:

W_{i} = W_{i} - \frac{δ (X_{i}, R_{j}, H)}{m} + \frac{δ (X_{i}, R_{j}, M)}{m}

(1)

W_{i}

is the feature weight measure of

X_{i}

Algorithm 1

Pseudo code of the Relief

1. Input: Dataset X= { X ₁,…, X _p } with p features, the vector of labels Q, and the number of iterations m.

2: output: the weight vector of the features

(W_{1}, . . ., W_{p})

3: Initialization:

\forall i \in {1

, …,

p}, W_{i} = 0

4: for j=1,...,m do

5: Randomly select an instance

R_{j}

6: Find the nearest neighbor H of the same class and one M of the other class.

7: for i = 1 to p do

W_{i} = W_{i} - \frac{δ (X_{i}, R_{j}, H)}{m} + \frac{δ (X_{i}, R_{j}, M)}{m}

9: end for

10: end for

11: Sort the features according to their weights (W) in descending order

The $δ$ function aims to calculate the difference value of a feature $X_{i}$ between two instances $t_{1}$ and $t_{2}$ (where $t_{1} = R_{j}$ and $t_{2}$ = either $H$ or $M$ ).

For continuous (ordinal or numerical) features, the delta function $δ$ is defined as:

δ (X_{i}, t_{1}, t_{2}) = \frac{| value (X_{i}, t_{1}) - value (X_{i}, t_{2}) |}{m a x (X_{i}) - m i n (X_{i})}

(2)

For discrete (nominal or categorical) features, the delta function

δ

is defined as:

δ (X_{i}, t_{1}, t_{2}) = {\begin{matrix} 0 & if value (X_{i}, t_{1}) = value (X_{i}, t_{2}) \\ 1 & if otherwise \end{matrix}

(3)

At each update, by dividing the output of the delta

δ

function by the number of iterations m, there is a guarantee that the global score will be normalized within the interval[

-

1,1]. In the search for nearest neighbors, the original Relief uses the Euclidean distance, the delta function can also be used in the calculation of distances between pairs of instances by calculating the Manhattan distance. Algorithm 1 describes in detail the Relief method.

Relief-based algorithm family

After the creation of the Relief²⁶ algorithm in 1992, many extensions have been developed over the years, the use of k nearest neighbors in updating the score following each instance instead of restricting to one hit and one miss was introduced by the ReliefA²⁷ extension in 1994, it is addressed to noisy data, with time complexity $O (n^{2} . p)$ , where n represents the number of instances, and p the number of features. Then three extensions of ReliefA were created in the same year under the names Relief(B-D)²⁷ in order to handle missing data where ReliefD is considered the best one. All the extensions already mentioned are limited to two-class problems, to handle multiclass data, two extensions of ReliefD named Relief(E-F)²⁷ have been proposed in 1994 with time complexity $O (n^{2} . p)$ , the most famous is ReliefF.

Two extensions of ReliefF have been developed, the first one was developed in 1996, and it is called RReliefF,²⁹ it is essentially dedicated to regression problems, the second one named Relieved-F³⁰ is a deterministic variant of ReliefF created in 1997 which uses the set of all instances and for each instance, it determines the set of all nearest neighbors. It is designed to handle missing data.

To eliminate the bias of ReliefF in the case of non-monotonic variables, in 2003, an iterative version named Iterative Relief³¹ was created, its time complexity is $O (I . n^{2} . p)$ where $I$ represents the number of iterations predefined by the user. In 2006, a variant of the Iterative Relief method was developed with the same time complexity under the name I-RELIEF.³²

In 2007, three methods have been introduced; an extension called TuRF³³ was introduced to improve the quality of the score calculated by ReliefF by removing a percentage of noisy features at each iteration. By combining ReliefF and mutual information, Evaporative Cooling ReliefF³⁴ was developed in the same year addressing noisy features. In order to solve the problem of missing data for some instances, a new version of ReliefF named EReliefF³⁵ has been created with time complexity $O (n^{2} . p)$ .

Another extension of ReliefF was proposed in 2008 under the name VLSReliefF³⁶ to detect interactions between features and to solve the problem of the influence of the large number of features on the selection of the nearest neighbors as well as on the weight computed by ReliefF. The method relies primarily on random subsets of features in the scoring mechanism, its time complexity is $O (S, n^{2}, p_{s})$ , where S represents the number of feature subsets, $p_{s}$ represents the number of features in each subset.

In 2009, a method named ReliefMSS³⁷ was introduced which proposes a modification of the weight of ReliefF, its complexity is $O (n^{2} . p)$ . In the same year, a method called SURF³⁸ was introduced, which limited the number of nearest neighbors through a distance threshold instead of setting a number by the user in order to detect interactions between the variables.

Then in 2010, an extension of the SURF method was proposed under the name SURF*,³⁹ it introduced the use of the farthest individuals in the scoring process to improve the detection of interactions between features.

In 2012, a variant of SURF* was introduced called SWRF*,⁴⁰ it used a sigmoid function to weight the nearest neighbors, and its complexity is $O (n^{2} . p)$ , also, in this work, a framework has been created to develop the Relief algorithm. During the same year, a variant of I-RELIEF was proposed under the name LH-RELIEF,⁴¹ where the features are evaluated through a margin computed between the instance and its hyperplane which represents the nearest neighbors, its complexity is $O (I . n^{2} . p)$ .

In 2013, an improvement of the SURF* method was developed under the name MultiSURF*,⁴² it is faster and with a complexity $O (n^{2} . p)$ . In the same year, a method called ReliefSEQ⁴³ was invented, which proposed to use an adapted nearest neighbor number to each feature.

In 2018, a new variant of MultiSURF* was developed under the name MultiSURF,⁴⁴ its execution time is $O (n^{2} . p)$ , also in this work, techniques are added to handle different types of databases.

MCDM concepts and TOPSIS

When we have to make decisions in a complex reality where there are several conflicting viewpoints, MCDM methods are adapted to solve this problem, to enlighten our decisions, or to make recommendations to the decision-makers. In general, a multi-criteria decision problem is characterized by a set of alternatives $a_{1}, a_{2}, \dots, a_{p}$ to be considered, criteria $C_{1}, C_{2}, \dots, C_{m}$ to be taken into account and their relative importance vector $w = (w_{1}, w_{2}, \dots, w_{m})$ . The description of these alternatives according to these criteria is translated through a matrix noted $B$ called the decision matrix represented as shown in Table 1. Multi-criteria decision methods allow the identification of the best alternative, the ranking of alternatives, or the classification of them into pre-defined categories. There are a large number of multi-criteria decision methods, TOPSIS is a well-known method created by Hwang and Yoon. The method orders the alternatives based on their Euclidean distances to the positive and negative ideal solutions. The positive ideal solution represents a solution taking the best value of each criterion, conversely, the negative ideal solution takes the worst values.

Table 1.

Multi-criteria decision matrix B.

	Criteria
Alternatives	$C_{1}$	$C_{2}$	…	…	$C_{m}$
$a_{1}$	$B_{1, 1}$	–	–	–	$B_{1, m}$
$a_{2}$	$B_{2, 1}$	–	–	–	$B_{2, m}$
–	–	–	–	–	–
–	–	–	–	–	–
$a_{p}$	$B_{p, 1}$	–	–	–	$B_{p, m}$

The first step in TOPSIS is to normalize the decision matrix B.

g_{i, j} = \frac{B_{i, j}}{\sqrt{\sum_{i = 1}^{p} B_{i, j}^{2}}}, \forall i \in {1, \dots, p}, \forall j \in {1, \dots, m}

(4)

Then, the method constructs the weighted normalized matrix by multiplying the matrix g by the weight vector w.

h_{i, j} = g_{i, j} . w_{j}, \forall i \in {1, \dots, p}, \forall j \in {1, \dots, m}

(5)

In the next step, TOPSIS identifies the positive and negative ideal solutions.

I^{+} = h_{1}^{+}, \dots, h_{m}^{+}

and

I^{-} = h_{1}^{-}, \dots, h_{m}^{-}

h_{j}^{+} = {\begin{matrix} m a x {h_{i, j} | i = 1, \dots, p} & if C_{j} is a benefit criteria \\ m i n {h_{i, j} | i = 1, \dots, p} & if C_{j} is a cost criteria \end{matrix}

(6)

h_{j}^{-} = {\begin{matrix} m i n {h_{i, j} | i = 1, \dots, p} & if C_{j} is a benefit criteria \\ m a x {h_{i, j} | i = 1, \dots, p} & if C_{j} is a cost criteria \end{matrix}

(7)

In this step, TOPSIS calculates for each alternative the Euclidean distance between the positive ideal solution and the negative ideal solution.

e_{i}^{+} = \sqrt{\sum_{j = 1}^{m} (h_{j}^{+} - h_{i, j})^{2}}

(8)

e_{i}^{-} = \sqrt{\sum_{j = 1}^{m} (h_{j}^{-} - h_{i, j})^{2}}

(9)

Finally, a degree of proximity for each alternative is calculated and the alternatives are sorted according to

S_{i}

S_{i} = \frac{e_{i}^{-}}{e_{i}^{-} + e_{i}^{+}}

(10)

Overview of using MCDM for feature selection

In the literature, many MCDM-based selection methods have been introduced recently, among them, two approaches for the evaluation of selection methods have been developed. The first one⁴⁵ is based on TOPSIS, it has been proposed to evaluate selection methods for the Network Traffic dataset by considering different performance measures as decision-making criteria. The second one⁴⁶ is for text classification and small data, it suggests an evaluation of selection methods by comparing several MCDM methods. For feature selection, a method named MFS-MCDM⁴⁷ has been introduced for the multi-label dataset, it is a modeling of the selection task as an MCDM problem, by adopting the correlation between features and labels as criteria and using Ridge regression to calculate the decision matrix, the method applies TOPSIS in order to generate a ranking of the features. Another algorithm was introduced under the name EFS-MCDM,⁴⁸ on a decision matrix constructed through the rankings provided by different selection algorithms, the MCDM method called VIKOR is applied to rank the features. An approach was developed under the name VMFS⁴⁹ for multilabel data, where a decision matrix was constructed by calculating the correlation between features and labels using cosine simulation, then the matrix is provided to the VIKOR multicriteria method to rank the variables. The TOPSIS method was also used in a hybrid method called TOPSIS-JAYA,⁵⁰ after combining several filters in a decision matrix, TOPSIS ranks the features and then, a wrapper method uses the best features selected by TOPSIS to find the optimal subset. Finally, in another method⁵¹ intended for cost-sensitive data with missing value, an MCDM strategy was used to evaluate the features, after, a forward greedy selection is conducted to find the optimal subset.

The proposed method

The main objective of our proposed approach is to select the relevant features and eliminate the noise, by modeling the feature selection as a multi-criteria decision problem, the proposed method used the fundamental concepts of the Relief method to compute a decision matrix by considering the feature weights corresponding to the randomly chosen instances as decision-making criteria (Figure 1). As shown in Algorithm 2, at each iteration, the features are weighted corresponding to a randomly chosen instance $R_{j}$ . After achieving all iterations, the feature weights corresponding to the instances are translated through a decision matrix B that will be delivered to the TOPSIS method for ranking the features. Instead of updating the score of features at each iteration through an accumulation mechanism as in Relief, each feature will be judged by TOPSIS through matrix B and finally, a ranking of the features is provided. $W (X_{i}, R_{j})$ represents the weight of the feature $X_{i}$ according to the instance $R_{j}$ . The proposed method is detailed in Algorithm 2.

Figure 1.

Schematic diagram representation of the proposed approach.

Algorithm 2

Pseudo code of the proposed algorithm

1. Input: Dataset X= { X ₁,…, X _p } with p features, the vector of labels Q, and the number of iterations m

2: output: the ranking of features

3: for j=1,…, m do

4: Randomly choose an instance

R_{j}

5: Identify the nearest hit H and the nearest miss M

6: for i = 1 to p

B_{i, j} = W (X_{i}, R_{j})

8: end for

9: end for

10: Set the weight of each chosen instance to 1/m

11: Apply TOPSIS on the decision matrix B

12: Sort the features according to their TOPSIS scores (S) in descending order

We present binary data with an X matrix containing the features and a Q vector of labels as shown in Figure 2. The rows in the matrix X represent the instances and the columns represent the features, the vector Q contains the labels of instances.

Figure 2.

The structure of the data.

In the first step, we calculate the feature weights corresponding to the instances using the Relief method, the formula for measuring the weight of a feature $X_{i}$ corresponding to instance $R_{j}$ is presented in the following equation

W (X_{i}, R_{j}) = \frac{- δ (X_{i}, R_{j}, H)}{m} + \frac{δ (X_{i}, R_{j}, M)}{m}

(11)

where m is the number of iterations, H and M are the hit and miss, respectively, and

X_{i} = [\begin{matrix} X_{1 i} \\ X_{2 i} \\ \dots \\ X_{n i} \end{matrix}]

. The obtained decision matrix is as follows:

B = [\begin{matrix} W (X_{1}, R_{1}) & W (X_{1}, R_{2}) & \dots & \dots & W (X_{1}, R_{m}) \\ W (X_{2}, R_{1}) & W (X_{2}, R_{2}) & \dots & \dots & W (X_{2}, R_{m}) \\ \dots & \dots & \dots & \dots & \dots \\ W (X_{p}, R_{1}) & W (X_{p}, R_{2}) & \dots & \dots & W (X_{p}, R_{m}) \end{matrix}]

In the next step, we define the instance weights, we gave all of the instances equal importance. The instance weight vector is presented as follows:

w = [1 / m, 1 / m, \dots, 1 / m]

, where the length of the w vector equals the number of randomly chosen instances, and the sum of the instance weights is 1.

Next, the decision matrix B and the instance weight vector w are provided to the TOPSIS method to generate the ranking of features. The first step in TOPSIS is to normalize the decision matrix B using equation (4). Then, the weighted normalized matrix h is constructed by multiplying the normalized matrix g by the weight vector w using equation (5). The weighted normalized matrix is structured as shown:

h = [\begin{matrix} h (X_{1}, R_{1}) & h (X_{1}, R_{2}) & \dots & \dots & h (X_{1}, R_{m}) \\ h (X_{2}, R_{1}) & h (X_{2}, R_{2}) & \dots & \dots & h (X_{2}, R_{m}) \\ \dots & \dots & \dots & \dots & \dots \\ h (X_{p}, R_{1}) & h (X_{p}, R_{2}) & \dots & \dots & h (X_{p}, R_{m}) \end{matrix}]

Then, TOPSIS identifies the ideal positive and negative solutions. As our goal is to maximize the feature weights corresponding to the instances, then, all criteria are beneficial, so we will use the maximum value for each instance according to the features to determine the ideal positive solution

I^{+}

, and the minimum value to determine the negative solution

I^{-}

. After, TOPSIS determines for each feature the Euclidean distance between the positive ideal solution

e_{i}^{+}

and the negative ideal solution

e_{i}^{-}

as shown in equations (8) and (9), respectively. Finally, a relative closeness

S_{i}

for each feature is calculated using equation (10), and the features are ranked according to

S_{i}

Time complexity

In general, one of the factors that makes an algorithm efficient is its ability to decrease the execution time as a function of the input size, which determines the number of operations performed by an algorithm. In this study, a time complexity analysis is proposed. The time complexity of our algorithm depends mainly on the identification of the hit and miss, this phase has time complexity $O (n . p . m)$ because it consists in calculating the distances between the individual $R_{j}$ and all the n instances by performing $O (n . p)$ operations inside a loop repeating m times.

Experimentally, let’s consider a dataset with 20 samples ( $n = 20$ ) generated according to scenario 1 which is a simulation model described in the next section. The time needed in seconds to build the decision matrix B in the case of $p = 100$ features and $m = 10$ is 3.4 s. On the other hand, in order to make a comparative analysis of the execution time between different cases by varying p and m, we used the “microbenchmark package” in the R environment, it allowed us to obtain an average overview of the execution time of each case studied after 100 evaluations. We illustrated in Figure 3 the execution time associated with each evaluation for the different cases studied, this graph allows us to visualize all the timings taken in each case, we also reported the average values associated in Table 2. We can see from Table 2 that the construction of a decision matrix with 100 features needs on average 3.94 s for $m = 10, 6.02$ s for $m = 15$ , and 7.96 s for $m = 20$ .

Figure 3.

Microbenchmark timings to compute the decision matrix.

Table 2.

Average run-time in seconds to compute the decision matrix.

Number of features	$m = 10$	$m = 15$	$m = 20$
$p = 100$	3.94	6.02	7.96
$p = 200$	13.73	20.58	27.59
$p = 300$	29.17	43.47	57.7

After the construction phase of the decision matrix, TOPSIS generates the ranking of the features with an average time presented in Table 3. The TOPSIS algorithm is generally characterized by its low time complexity. According to Table 3, with $m = 10$ , we can see that on average 5.14 ms are sufficient to rank 100 variables, 11.65 ms to rank 200 variables, and 19.52 ms to rank 300 variables.

Table 3.

Average run-time in milliseconds of Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) phase.

Number of features	$m = 10$	$m = 15$	$m = 20$
$p = 100$	5.14	7.48	8.81
$p = 200$	11.65	17.03	22.43
$p = 300$	19.52	27.74	37.01

Experiments and results

To test the proposed algorithm, we propose a simulation study for evaluating its performance and ability to select relevant features. In this simulation study, we will generate data under three scenarios, two scenarios contain numerical features, and one contains categorical ones.

Scenario 1: The first scenario⁵² is a simulated dataset with n samples and p quantitative features including six important features: $X_{1}$ , $X_{2}$ , $X_{3}$ , $X_{4}$ , $X_{5}$ , and $X_{6}$ , whereas the other features are irrelevant, and a target feature Q has two classes 1 and $- 1$ . In order to evaluate the suggested approach, the simulation model is described as follows:

50% are labelled with the class 1 and 50% are labeled with the class $- 1$ .

For 70% of observations ( $Q = 1$ ), $X j \sim N (j, 1), j = 1, 2, 3$ and $X j \sim N (0, 1), j = 4, 5, 6$ . For 30% of observations ( $Q = 1$ ), $X j \sim N (0, 1), j = 1, 2, 3$ and $X j \sim N (j - 3, 1), j = 4, 5, 6$ .

For 70% of observations ( $Q = - 1$ ), $X j \sim (- N (j, 1)), j = 1, 2, 3$ and $X j \sim (- N (0, 1)), j = 4, 5, 6$ .

For 30% of observations ( $Q = - 1$ ), $X j \sim (- N (0, 1)), j = 1, 2, 3$ and $X j \sim (- N (j - 3, 1)), j = 4, 5, 6$ .

$X j \sim N (0, 20)$ , $j = 7, 8, \dots, p .$

The features $X_{j}$ , $j = 1, 2, \dots, 6.$ are ordered according to the degree of relevance as follows: ( $X_{3}$ , $X_{2}$ ) are the highly relevant, then $(X_{1}$ , $X_{6})$ , and $(X_{5}$ , $X_{4})$ are the less relevant.

Scenario 2: The second scenario is a simulated dataset with 50 samples and 500 quantitative features generated with normal distribution $N (0, 1)$ . It includes two relevant features: $X_{1}$ , $X_{2}$ , and the other features are noisy.

The target feature $Q$ is generated according to the model described below:

$Z = 2 X_{1} + 2 X_{2} + X_{1} X_{2}$ .

$Q [k] = I (1 / (1 + e x p^{(- Z [k])}) > 1 / 2)$ , $k = 1, \dots, 50$ , $I$ is the standard indicator function.

Scenario 3: The third scenario⁵³ is a simulated dataset with 50 samples and 500 qualitative features including eight relevant features: $X_{1}$ , $X_{2}$ , $X_{3}$ , $X_{4}$ , $X_{5}$ , $X_{6}$ , $X_{7}$ , and $X_{8}$ , whereas the remaining features are irrelevant. The model is as follows:

A target vector Y is generated for the binomial distribution, where $Y = 0$ or $Y = 1$ , and $P (Y_{i} = 1) = β$ , where $β = {0.3, 0.5, 0.7}$ , $X_{i, j} \in {0, 1}, j = 1, 3, 5, 7$ .

Conditional on $Y_{i} = k$ , let $P (X_{i j} = 1 | Y_{i} = k) = θ_{k j}$ .

$θ_{k j}$ values are given in Table 4.

Table 4.

$θ_{k j}$ values.

$θ_{k j}$	$j = 1$	$j = 3$	$j = 5$	$j = 7$
$k = 0$	0.3	0.4	0.5	0.3
$k = 1$	0.95	0.9	0.9	0.95

Given $Y_{i}$ and $X_{i, 2 t - 1} (t = 1, 2, 3, 4)$ , $X_{i, 2 t} \in {0, 1}$ is generated using the following probabilities:

$P (X_{i, 2 t} = 1 | Y_{i} = k) = θ_{k j}$ .

$P (X_{i, 2 t} = 1 | Y_{i} = k, X_{i, 2 t - 1} = 0) = 0.6 I (θ_{k, 2 t - 1} > 0.5) + 0.4 I (θ_{k, 2 t - 1} \leq 0.5)$ .

$P (X_{i, 2 t} = 1 | Y_{i} = k, X_{i, 2 t - 1} = 1) = 0.95 I (θ_{k, 2 t - 1} > 0.5) + 0.05 I (θ_{k, 2 t - 1} \leq 0.5)$ .

$I$ is the standard indicator function.

For all remaining features $(X_{j}$ , for $j > 8)$ , randomly sample the set ${0, 1}$ with $θ_{k, j}$ .

The goal of this study is to evaluate how well the suggested algorithm can identify the relevant features that have previously been indicated and reject the irrelevant ones. The three models already described are random, for this reason, the experimental protocol adopted in our simulation study consists mainly in generating M different datasets according to each of the three scenarios after choosing the number of instances n, the number of features p, and the number of iterations m. On each database, we applied the proposed method and we obtained a feature scoring vector $S_{1}, \dots, S_{p}$ . Finally, through M-computed feature scoring vectors, we calculated an average score for each feature. In this study, we conducted $M = 100$ runs, as already mentioned, for each run, we performed the suggested approach on a generated dataset after determining n, p, and m. The proposed method is evaluated through the three scenarios studied. In general, the results obtained are practically presented in the figures as shown. Figures 4 to 6 represent the average TOPSIS score of features generated with scenario 1, scenario 2, and scenario 3, respectively. We suggested calculating the average score of the features to better visualize the results. In the first scenario, we have studied many cases by changing the values of n, p, and m; we have increased n and set the value of p on the one hand, and increased p and set the value of n on the other hand to explore the effects of varying n and p on the selection outcomes. For the second and third scenarios, as n and p are already fixed ( $n = 50$ and $p = 500$ ), we have varied only the parameter m, in addition, we have varied the parameter beta for the third scenario. From Figures 4 to 6, it is very obvious that the suggested method finds the relevant features by assigning them high average scores in comparison to the rest. Also from Figure 4, the suggested method successfully selects relevant features based on their level of importance for the first scenario. In parallel, we proposed to calculate the selection frequencies of the features in the top 6, top 2, and top 8 for the first, second, and third scenarios, respectively. Figures 7 to 9 represent the results of the different scenarios studied. We present in Tables 5 to 7, the selection frequencies calculated of relevant features after 100 simulations. It can be seen that the relevant features have higher selection frequencies than the other noisy which have very low frequencies. It can also be seen that when the number of iterations increases, the frequencies of the relevant features also increase, which indicates that the number of iterations has an influence on the results.

Figure 4.

Average Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) score of features generated using scenario 1 for many options of n, p, and m.

Figure 5.

Average Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) score of features generated using scenario 2 for many options of m.

Figure 6.

Average Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) score of features generated using scenario 3 for many options of m according to beta values.

Figure 7.

Selection frequency in the top 6 of features generated using scenario 1 for many options of n, p, and m.

Figure 8.

Selection frequency in the top 2 of features generated with scenario 2 for many options of m.

Figure 9.

Selection frequency in the top 8 of features generated with scenario 3 for many options of m according to beta values.

Table 5.

Selection frequency in the top 6 of relevant features generated with scenario 1 for many options of n, p, and m.

Relevant features	$n = 20$ , $p = 100$			$n = 50$ , $p = 200$			$n = 50$ , $p = 300$
Relevant features	$m = 10$	$m = 15$	$m = 20$	$m = 40$	$m = 45$	$m = 50$	$m = 40$	$m = 45$	$m = 50$
$X_{1}$	31.00	41.00	49.00	50.00	56.00	64.00	37.00	41.00	43.00
$X_{2}$	76.00	86.00	89.00	99.00	99.00	97.00	95.00	98.00	99.00
$X_{3}$	93.00	98.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
$X_{4}$	6.00	7.00	0.00	6.00	4.00	6.00	7.00	3.00	4.00
$X_{5}$	7.00	8.00	5.00	11.00	14.00	12.00	8.00	14.00	11.00
$X_{6}$	5.00	6.00	3.00	12.00	15.00	18.00	16.00	18.00	17.00

Table 6.

Selection frequency in the top 2 of relevant features generated with scenario 2 for many options of m.

Relevant features	$m = 40$	$m = 45$	$m = 50$
$X_{1}$	38.00	46.00	40.00
$X_{2}$	32.00	37.00	42.00

Table 7.

Selection frequency in the top 8 of relevant features generated with scenario 3 for many options of m according to beta values.

Relevant features	$β = 0.3$			$β = 0.5$			$β = 0.7$
Relevant features	$m = 40$	$m = 45$	$m = 50$	$m = 40$	$m = 45$	$m = 50$	$m = 40$	$m = 45$	$m = 50$
$X_{1}$	57.00	54.00	65.00	66.00	67.00	79.00	83.00	80.00	81.00
$X_{2}$	57.00	64.00	53.00	68.00	69.00	70.00	65.00	84.00	74.00
$X_{3}$	13.00	17.00	23.00	21.00	32.00	34.00	40.00	50.00	41.00
$X_{4}$	62.00	75.00	73.00	75.00	79.00	74.00	88.00	78.00	77.00
$X_{5}$	6.00	5.00	7.00	4.00	16.00	13.00	25.00	26.00	30.00
$X_{6}$	73.00	77.00	81.00	80.00	77.00	83.00	86.00	83.00	91.00
$X_{7}$	58.00	65.00	61.00	67.00	76.00	75.00	79.00	76.00	83.00
$X_{8}$	65.00	54.00	60.00	68.00	65.00	70.00	76.00	76.00	75.00

Conclusion

Before dealing with a classification problem, extracting the relevant information is a crucial step. In this article, we proposed a filter method for feature selection, which is mainly designed to deal with two-class databases. Through the Relief method, we have computed a decision matrix considering the feature weights corresponding to the instances our decision criteria, this matrix is provided to the TOPSIS method to generate the ranking of the features. Through a simulation study of three scenarios, we have evaluated the performance of our method. The results obtained approve the efficiency of our method to detect the relevant features.

In our future work, to improve the proposed approach, the variant method named ReliefF can be used to address the multiclass selection problem and make the method more robust. We also intend to implement our method on real datasets such as microarray datasets and make comparisons with existing methods.

Footnotes

Authors’ note

The paper was selected from ICAMDS22.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Fatima Zahra Janane

References

Guyon

Elisseeff

. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–1182.

Bolón-Canedo

Sánchez-Maroño

Alonso-Betanzos

. A review of feature selection methods on synthetic data. Knowl Inf Syst 2013; 34: 483–519.

Saeys

Inza

Larranaga

. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23: 2507–2517.

Chandrashekar

Sahin

. A survey on feature selection methods. Comput Electr Eng 2014; 40: 16–28.

Lazar

Taminau

Meganck

et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinform 2012; 9: 1106–1119.

Lee

Kim

. Mutual information-based multi-label feature selection using interaction information. Expert Syst Appl 2015; 42: 2013–2025.

Karegowda

Manjunath

Jayaram

. Comparative study of attribute selection using gain ratio and correlation based feature selection. Int J Inform Technol Knowl Manage 2010; 2: 271–277.

Chamlal

Ouaderhman

Rebbah

. A hybrid feature selection approach for microarray datasets using graph theoretic-based method. Inf Sci (Ny) 2022; 615: 449–474.

Singh

Kushwaha

Vyas

et al. A feature subset selection technique for high dimensional data using symmetric uncertainty. J Data Anal Inform Proce 2014; 2: 95.

10.

Han

. Generalized Fisher score for feature selection. arXiv preprint arXiv:12023725, 2012.

11.

Quinlan

. Induction of decision trees. Mach Learn 1986; 1: 81–106.

12.

Bommert

Sun

Bischl

et al. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal 2020; 143: 106839.

13.

Ding

Peng

. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 2005; 3: 185–205.

14.

Chamlal

Ouaderhman

Aaboub

. A graph based preordonnances theoretic supervised feature selection in high dimensional data. Knowl Based Syst 2022; 257: 109899.

15.

Hall

. Correlation-based feature selection for machine learning. PhD Thesis, The University of Waikato, 1999.

16.

Lee

Choi

Jun

. An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data. Expert Syst Appl 2021; 166: 113971.

17.

Liu

. Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03). pp. 856–863.

18.

Urbanowicz

Meeker

La Cava

et al. Relief-based feature selection: introduction and review. J Biomed Inform 2018; 85: 189–203.

19.

Guyon

Weston

Barnhill

et al. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46: 389–422.

20.

Yang

Honavar

. Feature subset selection using a genetic algorithm. IEEE Intell Syst Appl 1998; 13: 44–49.

21.

Chamlal

Ouaderhman

El Mourtji

. Feature selection in high dimensional data: a specific preordonnances-based memetic algorithm. Knowl Based Syst 2023; 266: 110420.

22.

Kennedy

Eberhart

. Particle swarm optimization. In: Proceedings of ICNN’95-international conference on neural networks. volume 4. IEEE, pp. 1942–1948.

23.

Pudjihartono

Fadason

Kempa-Liehr

et al. A review of feature selection methods for machine learning-based disease risk prediction. Front Bioinform 2022; 2: 927312.

24.

Jović

. Brkić

Bogunović

. A review of feature selection methods with applications. In: 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO). IEEE, pp. 1200–1205.

25.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

26.

Kira

Rendell

LAA

. A practical approach to feature selection. In: Machine learning proceedings 1992. Elsevier, 1992. pp. 249–256.

27.

Kononenko

. Estimating attributes: analysis and extensions of relief. In: ECML, volume 94. Citeseer, pp. 171–182.

28.

Lourenzutti

Krohling

. A generalized topsis method for group decision making with heterogeneous information in a dynamic environment. Inf Sci (Ny) 2016; 330: 1–18.

29.

Kononenko

Robnik-Sikonja

Pompe

. Relieff for estimation and discretization of attributes in classification, regression, and ilp problems. Artif Intel: Methodol, Syst, Appl 1996; 31–40.

30.

Kohavi

John

. Wrappers for feature subset selection. Artif Intell 1997; 97: 273–324.

31.

Draper

. Kaito

Bins

. Iterative relief. In: 2003 conference on computer vision and pattern recognition workshop, volume 6. IEEE, pp. 62–62.

32.

Sun

. Iterative relief for feature weighting. In: Proceedings of the 23rd international conference on machine learning. pp. 913–920.

33.

Moore

White

. Tuning relieff for genome-wide genetic analysis. In: Evolutionary computation, machine learning and data mining in bioinformatics: 5th European conference, EvoBIO 2007, Valencia, Spain, April 11-13, 2007. Proceedings 5. Springer, pp. 166–175 .

34.

McKinney

Reif

White

et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics 2007; 23: 2113–2120.

35.

Park

Kwon

. Extended relief algorithms in instance-based feature filtering. In: Sixth international conference on advanced language processing and web information technology (ALPIT 2007). IEEE, pp. 123–128.

36.

Eppstein

Haake

. Very large scale relieff for genome-wide association analysis. In: 2008 IEEE symposium on computational intelligence in bioinformatics and computational biology. IEEE, pp. 112–119.

37.

Chikhi

Benhammada

. Reliefmss: a variation on a feature ranking relieff algorithm. Int J Bus Intell Data Mining 2009; 4: 375–390.

38.

Greene

Penrod

Kiralis

et al. Spatially uniform relieff (surf) for computationally-efficient filtering of gene–gene interactions. BioData Min 2009; 2: 1–9.

39.

Greene

Himmelstein

Kiralis

et al. The informative extremes: using both nearest and farthest individuals can improve relief algorithms in the domain of human genetics. In: Evolutionary computation, machine learning and data mining in bioinformatics: 8th European conference, EvoBIO 2010, Istanbul, Turkey, April 7–9, 2010. Proceedings 8. Springer, pp. 182–193.

40.

Stokes

Visweswaran

. Application of a spatially-weighted relief algorithm for ranking genetic predictors of disease. BioData Min 2012; 5: 1–11.

41.

Cai

. Feature weighting by relief based on local hyperplane approximation. In: Advances in knowledge discovery and data mining: 16th Pacific-Asia conference, PAKDD 2012, Kuala Lumpur, Malaysia, May 29–June 1, 2012, Proceedings, Part II 16. Springer, pp. 335–346.

42.

Granizo-Mackenzie

Moore

. Multiple threshold spatially uniform relieff for the genetic analysis of complex human diseases. In: Evolutionary Computation, machine learning and data mining in bioinformatics: 11th European conference, EvoBIO 2013, Vienna, Austria, April 3–5, 2013. Proceedings 11. Springer, pp. 1–10.

43.

McKinney

White

Grill

et al. Reliefseq: a gene-wise adaptive-k nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mrna-seq gene expression data. PLoS ONE 2013; 8: e81527.

44.

Urbanowicz

Olson

Schmitt

et al. Benchmarking relief-based feature selection methods for bioinformatics data mining. J Biomed Inform 2018; 85: 168–188.

45.

Singh

Kumar

Singla

. Topsis based multi-criteria decision making of feature selection techniques for network traffic dataset. Int J Eng Technol 2014; 5: 4598–4604.

46.

Kou

Yang

Peng

et al. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 2020; 86: 105836.

47.

Hashemi

Dowlatshahi

Nezamabadi-Pour

. Mfs-mcdm: multi-label feature selection using multi-criteria decision making. Knowl Based Syst 2020; 206: 106365.

48.

Hashemi

Dowlatshahi

Nezamabadi-pour

. Ensemble of feature selection algorithms: a multi-criteria decision-making approach. Int J Mach Lear Cybern 2022; 13: 49–69.

49.

Hashemi

Dowlatshahi

Nezamabadi-pour

. Vmfs: a Vikor-based multi-target feature selection. Expert Syst Appl 2021; 182: 115224.

50.

Chaudhuri

Sahu

. A hybrid feature selection method based on binary jaya algorithm for micro-array data classification. Comput Electr Eng 2021; 90: 106963.

51.

Shu

Shen

. Multi-criteria feature selection on cost-sensitive data with missing values. Pattern Recognit 2016; 51: 268–280.

52.

Weston

Elisseeff

Schölkopf

et al. Use of the zero norm with linear models and kernel methods. J Mach Learn Res 2003; 3: 1439–1461.

53.

Reese

. Feature screening of ultrahigh dimensional feature spaces with applications in interaction screening. PhD Thesis, Utah State University, 2018.

A filter feature selection for high-dimensional data

Abstract

Keywords

Introduction

Related work and background

Relief

Relief-based algorithm family

MCDM concepts and TOPSIS

Overview of using MCDM for feature selection

The proposed method

Time complexity

Experiments and results

Conclusion

Footnotes

Authors’ note

Declaration of conflicting interests

Funding

ORCID iD

References