An efficient search algorithm for biomarker selection from RNA-seq prostate cancer data

Abstract

RNA-sequencing technology helps to consider the expression of thousands of genes, simultaneously. The large-scale gene expression data include a huge number of genes versus a few samples. Therefore, the algorithms that among huge number of unrelated genes can accurately detect genes associated with specific disease can be useful for experts in early detect and treat the disease.

A two-phase search algorithm is proposed in this paper to discover the biomarkers in the RNA-seq gene expression dataset for the prostate cancer diagnosis. After statistical noise removing from the original large-scale dataset, a multi-objective optimization process is proposed to select the best non-dominated subset of genes with the maximum classification accuracy and the minimum number of genes, simultaneously. Finally, the proposed cache-based modification of the sequential forward floating selection (CMSFFS) algorithm is applied to the selected subset of genes to discover the most discriminant genes.

The obtained results show that the proposed algorithm is able to achieve the classification accuracy, sensitivity and specificity of 100% in the large scale RNA-seq prostate cancer dataset by selecting only three biomarkers.

Keywords

RNA-seq large-scale prostate cancer data two-phase search algorithm multi-objective-based optimization CMSFFS

Get full access to this article

View all access options for this article.

References

Chiang

J.-H.

and Ho

S.-H.

, A Combination of Rough-Based Feature Selection and RBF Neural Network for Classification Using Gene Expression Data, IEEE Transactions on Nanobioscience, 7, 1, pp. 91–99, 2008.

Gao

, Ye

, Lu

, et al., Hybrid method based on information gain and support vector machine for gene selection in cancer classification, Genomics, Proteomics & Bioinformatics 15 (2017), 389–395.

Chen

, Zhang

and Gutman

, A kernel-based clustering method for gene selection with gene expression data , Journal of Biomedical Informatics 62 (2016), 12–20.

Sharma

, Imoto

and Miyano

, A top-r feature selection algorithm for microarray gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 9 (2012), 754–764.

Dashtban

and Balafar

, Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts, Genomics 109 (2017), 91–107.

Shahbeig

, Helfroush

M.S.

and Rahideh

, A fuzzy multi-objective hybrid tlbo–pso approach to select the associated genes with breast cancer, Signal Processing 131 (2017), 58–65.

Vladimir

V.N.

and Vapnik

, The nature of statistical learning theory, Springer Heidelberg, 1995.

Melgani

and Bazi

, Classification of electrocardiogram signals with support vector machines and particle swarm optimization, IEEE Transactions on Information Technology in Biomedicine 12 (2008), 667–677.

Hsu

C.-C.

, Chen

M.-C.

and Chen

L.-S.

, Integrating independent component analysis and support vector machine for multivariate process monitoring, Computers &Industrial Engineering 59 (2010), 145–156.

10.

Eusuff

, Lansey

and Pasha

, Shuffled frog-leaping algorithm: A memetic meta-heuristic for discrete optimization, Engineering Optimization 38 (2006), 129–154.

11.

Gomez-Gonzalez

, Ruiz-Rodriguez

and Jurado

, Probabilistic optimal allocation of biomass fueled gas engine in unbalanced radial systems with metaheuristic techniques, Electric Power Systems Research 108 (2014), 35–42.

12.

Whitney

A.W.

, A direct method of nonparametric measurement selection, IEEE Transactions on Computers 100 (1971), 110–1103.

13.

Shahbeig

, Rahideh

, Helfroush

M.S.

, et al., Gene expression feature selection for prostate cancer diagnosis using a two-phase heuristic– deterministic search strategy, IET Systems Biology (2018).

14.

Aziz

, Verma

and Srivastava

, A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data, Genomics Data 8 (2016), 4–15.

15.

Chandra

and Gupta

, An efficient statistical feature selection approach for classification of gene expression data, Journal of Biomedical Informatics 44 (2011), 529–535.

16.

Cui

, Zheng

C.-H.

, Yang

, et al., Sparse maximum margin discriminant analysis for feature extraction and gene selection on gene expression data, Computers in Biology and Medicine 43 (2013), 933–941.

17.

Gonzalez-Navarro

F.F.

and Belanche-Muñoz

L.A.

, Feature Selection for microarray gene expression data using simulated annealing guided by the multivariate joint entropy, ComputaciÓn y Sistemas 18 (2014), 275–293.

18.

Liu

, Cui

, Jiang

, et al., A combinational feature selection and enble neural network method for classification of gene expression data, BMC Bioinformatics 5 (2004).

19.

Nguyen

, Khosravi

, Creighton

, et al., Hidden markov models for cancer classification using gene expression profiles, Information Sciences 316 (2015), 293–307.

20.

Wang

and Han

, Hybrid feature selection method for gene expression analysis, Electronics Letters 50 (2014), 1269–1271.

21.

Shahbeig

, Rahideh

, Helfroush

M.S.

, et al., Gene selection from large-scale gene expression data based on fuzzy interactive multi-objective binary optimization for medical diagnosis, Biocybernetics and Biomedical Engineering 38 (2018), 313–328.

22.

Smith

B.A.

, Sokolov

, Uzunangelov

, et al., A basal stem cell signature identifies aggressive prostate cancer phenotypes, Proceedings of the National Academy of Sciences 112 (2015), E6544–E6552.