Abstract
RNA-sequencing technology helps to consider the expression of thousands of genes, simultaneously. The large-scale gene expression data include a huge number of genes versus a few samples. Therefore, the algorithms that among huge number of unrelated genes can accurately detect genes associated with specific disease can be useful for experts in early detect and treat the disease.
A two-phase search algorithm is proposed in this paper to discover the biomarkers in the RNA-seq gene expression dataset for the prostate cancer diagnosis. After statistical noise removing from the original large-scale dataset, a multi-objective optimization process is proposed to select the best non-dominated subset of genes with the maximum classification accuracy and the minimum number of genes, simultaneously. Finally, the proposed cache-based modification of the sequential forward floating selection (CMSFFS) algorithm is applied to the selected subset of genes to discover the most discriminant genes.
The obtained results show that the proposed algorithm is able to achieve the classification accuracy, sensitivity and specificity of 100% in the large scale RNA-seq prostate cancer dataset by selecting only three biomarkers.
Keywords
Get full access to this article
View all access options for this article.
