Knowledge discovery in medical and biological datasets by integration of Relief-F and correlation feature selection techniques

Abstract

Feature selection is a pre-processing method that identifies the significant features from high-dimensional data and able to diminish the computational cost of the learning algorithm because of removing the irrelevant and redundant features. It has traditionally been applied in a wide range of problems that include biological data processing, pattern recognition, and computer vision. The aim of this paper is to identify the best feature subsets from the benchmark datasets which improve the performance of the classifiers. Existing filter-based feature selection approaches fail to choose the relevant features from the original feature sets. To obtain the tiny subset of relevant features, we have introduced a novel filter-based feature selection method, called ReCFS. The proposed method is a combination of both feature-feature correlation and nearest neighbor weighted features to find an optimal subset of features to minimize correlation among features. The effectiveness of the selected feature subset by proposed method is evaluated by using two classifiers such as Naïve Bayes and K-Nearest Neighbour on real-life datasets. For the diverse performance measurements, the experiments are conducted on eight real-life datasets of varied dimensionality and number of instances. The result demonstrates that the proposed method has found promising feature subsets which improved the classification accuracy over competing feature selection methods

Keywords

Machine learning relief-F correlation feature selection classification naïve bayes

Get full access to this article

View all access options for this article.

References

Ferreira

A.J.

and Figueiredo

M.A.T.

, Efficient feature selection filters for high-dimensional data, Pattern Recognit Lett33(13) (2012), 1794–1804.

Shukla

A.K.

and Tripathi

, Identification of potential biomarkers on microarray data using distributed gene selection approach, Math Biosci (2019).

Kumar

, Diwakar

, Ramachandra

T.B.

and Chandramohan

R.D.

, A study on metaheuristics approaches for gene selection in microarray data: algorithms, applications and open challenges, Evol Intell (2019).

Shukla

A.K.

, Singh

and Vardhan

, An adaptive inertia weight teaching-learning-based optimization algorithm and its applications, Appl Math Model77 (2020), 309–326.

and Jonassen

, New feature subset selection procedures for classification of expression profiles, Genome Biol3(4) (2002).

Shukla

A.K.

, Identification of cancerous gene groups from microarray data by employing adaptive genetic and support vector machine technique, Comput Intell (2019), 1–30.

Liu

and Motoda

, Computational methods of feature selection, Computer Long Beach Calif198(1) (2008), 2–13.

Meyer

P.E.

, Schretter

and Bontempi

, Information-Theoretic Feature Selection in Microarray Data Using Variable Complementarity, IEEE J Sel Top Signal Process2(3) (2008), 261–274.

Peng

, Long

and Ding

, Feature selection based on mutual information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy, IEEE Trans Pattern Anal Mach Intell8 (2005), 1226–1238.

10.

Wang

, An

, Yang

, Chen

, Li

and Alterovitz

, Wrapper-based gene selection with Markov blanket, Comput Biol Med81 (2017), 11–23.

11.

Guyon

and Elisseeff

, An Introduction to Variable and Feature Selection, J Mach Learn Res3(3) (2003), 1157–1182.

12.

Shukla

A.K.

, Singh

and Vardhan

, A two-stage gene selection method for biomarker discovery from microarray data for cancer classification, Chemom Intell Lab Syst183 (2018), 47–58.

13.

Tang

, Dai

and Xiang

, Feature selection based on feature interactions with application to text categorization, Expert Syst Appl120 (2019), 207–216.

14.

Lai

, Tang

, Luo

and Pan

, Greedy feature selection for ranking, Proc. 2011 15th Int. Conf. Comput. Support. Coop. Work Des. CSCWD2011 (2011), 42–46.

15.

Shukla

A.K.

, Singh

and Vardhan

, A hybrid gene selection method for microarray recognition, Biocybern Biomed Eng38(4) (2018), 975–991.

16.

Wang

, Jin-Mao

, Yang

and Shu-Qin

, Feature Selection by Maximizing Independent Classification Information, IEEE Trans Knowl Data Eng29(4) (2017), 828–841.

17.

Wang

, et al., Gene selection from microarray data for cancer classification - A machine learning approach, Comput Biol Chem29(1) (2005), 37–46.

18.

Roffo

, Melzi

and Cristani

, Infinite feature selection, Proc IEEE Int Conf Comput Vis (2016), 4202–4210.

19.

Mollaee

and Moattar

M.H.

, A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification, Biocybern Biomed Eng36(3) (2016), 1–9.

20.

Yang

J.B.

, Shen

K.Q.

, Ong

C.J.

and Li

X.P.

, Feature selection for MLP neural network: The use of random permutation of probabilistic outputs, IEEE Trans Neural Networks20(12) (2009), 1911–1922.

21.

Bolón-Canedo

, Sánchez-Maroño

, Alonso-Betanzos

, Benítez

J.M.

and Herrera

, A review of microarray datasets and applied feature selection methods, Inf Sci (Ny)282 (2014), 111–135.

22.

Sun

, Todorovic

and Goodison

, Local-learning-based feature selection for high-dimensional data analysis, IEEE Trans Pattern Anal Mach Intell32(9) (2010), 1610–1626.

23.

Shukla

A.K.

, Singh

and Vardhan

, A New Hybrid Wrapper TLBO and SA with SVM Approach for Gene Expression Data, Inf Sci (Ny)503 (2019), 238–254.

24.

Shukla

A.K.

, Pippal

S.K.

and Singh

, An empirical evaluation of teaching–learning-based optimization, genetic algorithm and particle swarm optimization, IJCA, (2019).

25.

Singh

, Shukla

and Vardhan

, Hybrid approach for gene selection and classification using filter and genetic algorithm, in Proceedings of the International Conference on Inventive Computing and Informatics, ICICI, (2017), pp. 832–837.

26.

Singh

, Shukla

and Vardhan

, A Novel Filter approach for efficient selection and Small round blue-cell tumor cancer detection using microarray gene expression data, in Proceedings of the International Conference on Inventive Computing and Informatics, (2017), pp. 827–831.

27.

Pashaei

, Pashaei

and Aydin

, Gene selection using hybrid binary black hole algorithm and modified binary particle swarm optimization, Genomics (2018), 1–18.

28.

Tawhid

M.A.

and Dsouza

K.B.

, Hybrid Binary Bat Enhanced Particle Swarm Optimization Algorithm for solving feature selection problems, Appl Comput Informatics (2018).

29.

Lee

, Lin

, Chen

and Kuo

, Gene selection and sample classification on microarray data based on adaptive genetic algorithm / k -nearest neighbor method, Expert Syst Appl38(5) (2011), 4661–4667.

30.

Das

A.K.

, Sengupta

and Bhattacharyya

, A group incremental feature selection for classification using rough set theory based genetic algorithm, Appl Soft Comput65 (2018), 400–411.

31.

, Zhang

and Zeng

, Research of multi-population agent genetic algorithm for feature selection, Expert Syst Appl36(9) (2009), 11570–11581.

32.

Lutu

P.E.N.

and Engelbrecht

A.P.

, A decision rule-based method for feature selection in predictive data mining, Expert Syst Appl37(1) (2010), 602–609.

33.

Bache

and Lichman

, UCI machine learning repository, Univ of California School Inf Comput Sci Irvine (2013). [Online]. Available: available: http://archive.ics.uci.edu/ml.