Improving microarray classification with a graph-based gene selection method

Abstract

High-dimensional microarray data presents a major challenge for accurate disease classification due to feature redundancy and limited sample sizes. We hypothesize that modeling multivariate gene relationships using a graph-based structure can enhance the effectiveness of gene selection without relying on labelled data. This paper proposes Feature Graph-based Unsupervised Gene Selection (FGUGS), a novel unsupervised filter method that constructs a threshold-free directed graph to represent pairwise gene similarities and selects a compact subset of genes by identifying high in-degree nodes, thereby minimizing redundancy. FGUGS eliminates the need for manual parameter tuning, enhances interpretability through graph analysis, and scales efficiently with the number of genes while preserving structural relationships. Experimental results on four benchmark datasets show that FGUGS outperforms both traditional and recent state-of-the-art gene selection methods, achieving up to a 20% improvement in classification accuracy and demonstrating strong clustering performance. FGUGS provides a reproducible and scalable solution for biomedical data analysis, particularly in scenarios where class labels are scarce or unavailable.

Keywords

High-dimensional biomedical data feature graph representation unsupervised nonlinear dimensionality reduction graph-based multivariate gene selection graph-theoretic feature filtering improved classification performance

Get full access to this article

View all access options for this article.

References

Dash

Misra

. Gene selection and classification of microarray data: A pareto de approach. Intell Dec Technol 2017; 11: 93–107. DOI:https://doi.org/10.3233/IDT-160280.

Chamlal

Ouaderhman

Rebbah

. A hybrid feature selection approach for microarray datasets using graph theoretic-based method. Inf Sci (Ny) 2022; 615: 449–474. DOI: https://doi.org/10.1016/j.ins.2022.10.001.

Kishore

Venkataramana

Prasad

DVV

, et al. Enhancing the prediction of idc breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture. Med Biol Eng Comput 2023; 61: 2895–2919. DOI: https://doi.org/10.1007/s11517-023-02892-1.

Flaounas

Iakovidis

Maroulis

. Cascading svms as a tool for medical diagnosis using multi-class gene expression data. Int J Artif Intell Tools 2006; 15: 335–352. DOI: https://doi.org/10.1142/S0218213006002709.

Rostami

Forouzandeh

Berahmand

, et al. Gene selection for microarray data classification via multi-objective graph theoretic-based method. Artif Intell Med 2022; 123: 102228. DOI: https://doi.org/10.1016/j.artmed.2021.102228.

Alhenawi

Al-Sayyed

Hudaib

, et al. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140: 105051.DOI: https://doi.org/10.1016/j.compbiomed.2021.105051.

Moslemi

Ahmadian

. Dual regularized subspace learning using adaptive graph learning and rank constraint: Unsupervised feature selection on gene expression microarray datasets. Comput Biol Med 2023; 167: 107659.

Bolón-Canedo

Alonso-Betanzos

Morán-Fernández

, et al. Feature selection: from the past to the future, Vol. 24, 2022. DOI: 10.1007/978-3-030-93052-3_2.

Alzubi

Qiqieh

Alzubi

. Fusion of deep learning based cyberattack detection and classification model for intelligent systems. Cluster Comput 2023; 26: 1363–1374. DOI: https://doi.org/10.1007/s10586-022-03686-0.

10.

Hall

. Correlation-based feature selection for machine learning. PhD Thesis, The University of Waikato, Hamilton, New Zealand, 1999

11.

Guyon

Weston

Barnhill

, et al. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46: 389–422. DOI:https://doi.org/10.1023/A:1012487302797.

12.

Saberi-Movahed

Rostami

Berahmand

, et al. Dual regularized unsupervised feature selection based on matrix factorization and minimum redundancy with application in gene selection. Knowl Based Syst 2022; 256: 109884. DOI: https://doi.org/10.1016/j.knosys.2022.109884.

13.

Nagarajan

Babu

LDD

. A hybrid feature selection model based on improved squirrel search algorithm and rank aggregation using fuzzy techniques for biomedical data classification. Netw Model Anal Health Inform Bioinform 2021; 10: 39.

14.

Maung

Arai

, et al. Two-stage feature selection with unsupervised second stage. Int J Artif Intell Tools 2018; 27: 1860014. DOI: https://doi.org/10.1142/S021821301860014X.

15.

Djellali

Ghoualmi-Zine

Guessoum

. Hybrid adapted fast correlation fcbf-support vector machine recursive feature elimination for feature selection. Intell Dec Technol 2020; 14: 182–198. DOI: https://doi.org/10.3233/IDT-190014.

16.

Sosa-Cabrera

García-Torres

Gómez-Guerrero

, et al. A multivariate approach to the symmetrical uncertainty measure: Application to feature selection problem. Inf Sci (Ny) 2019; 494: 1–20. DOI: https://doi.org/10.1016/j.ins.2019.04.046.

17.

Zare

Niazi

. Relevant based structure learning for feature selection. Eng Appl Artif Intell 2016; 55: 93–102. DOI: https://doi.org/10.1016/j.engappai.2016.06.001.

18.

Sun

Jie

Loo

, et al. A parallel self-organizing overlapping community detection algorithm based on swarm intelligence for large scale complex networks. Future Gener Comput Sys 2018; 89: 265–285.

19.

Qiu

Zhang

Gao

, et al. A fusion of centrality and correlation for feature selection. Expert Syst Appl 2024; 241: 122548.

20.

Liang

Yang

, et al. A review of matched-pairs feature selection methods for gene expression data analysis, 2018. DOI: 10.1016/j.csbj.2018.02.005.

21.

Zheng

Chao

Parthaláin

, et al. Feature grouping and selection: A graph-based approach. Inf Sci (Ny) 2021; 546: 1256–1272. DOI: https://doi.org/10.1016/j.ins.2020.09.022.

22.

Azadifar

Rostami

Berahmand

, et al. Graph-based relevancy-redundancy gene selection method for cancer diagnosis. Comput Biol Med 2022; 147: 105766. DOI: https://doi.org/10.1016/j.compbiomed.2022.105766.

23.

Song

Wang

. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 2013; 25: 1–14. DOI: https://doi.org/10.1109/TKDE.2011.181..

24.

Mandal

Sarmah

. Sgaclust: Semi-supervised graph attraction clustering of gene expression data. Netw Model Anal Health Inform Bioinform 2022; 11: 24.

25.

Taheri

Moradi

Tavassolipour

. Collaboration graph for feature set partitioning in data classification. Expert Syst Appl 2023; 213: 118988. DOI: https://doi.org/10.1016/j.eswa.2022.118988.

26.

Huang

Kong

Xie

, et al. Robust unsupervised feature selection via data relationship learning. Pattern Recognit 2023; 142: 109676. DOI: https://doi.org/10.1016/j.patcog.2023.109676.

27.

Zhu

Zuo

Zhang

, et al. Unsupervised feature selection by regularized self-representation. Pattern Recognit 2015; 48: 438–446. DOI: https://doi.org/10.1016/j.patcog.2014.08.006.

28.

Al-Ghafer

AlAfeshat

Alshomali

, et al. Nmf-guided feature selection and genetic algorithm-driven framework for tumor mutational burden classification in bladder cancer using multi-omics data. Netw Model Anal Health Inform Bioinformat 2024; 13: 26.

29.

Thawkar

Singh

Khanna

. Multi-objective techniques for feature selection and classification in digital mammography. Intell Dec Technol 2021; 15: 115–125. DOI: https://doi.org/10.3233/IDT-200049.

30.

Nie

Wang

Tian

, et al. Subspace sparse discriminative feature selection. IEEE Trans Cybern 2022; 52: 4221–4233. DOI: https://doi.org/10.1109/TCYB.2020.3025205.

31.

Cai

Zheng

Chang

KCC

. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans Knowl Data Eng 2018; 30: 1616–1637.

32.

Cheng

Wang

, et al. Feature selection: a data perspective, 2017. DOI: 10.1145/3136625.

33.

Alzubi

, et al. Quantum mayfly optimization with encoder-decoder driven lstm networks for malware detection and classification model. Mob Netw Appl 2023; 28: 795–807. DOI: https://doi.org/10.1007/s11036-023-02105-x.

34.

Ringnér

. What is principal component analysis?, 2008. DOI: 10.1038/nbt0308-303.

35.

Zhu

Wang

, et al. A new unsupervised feature selection algorithm using similarity-based feature clustering. Comput Intell 2019; 35: 2–22. DOI: https://doi.org/10.1111/coin.12192.

36.

Karami

Saberi-Movahed

Tiwari

, et al. Unsupervised feature selection based on variance–covariance subspace distance. Neural Netw 2023; 166: 105877. DOI: https://doi.org/10.1016/j.neunet.2023.06.018.

37.

Dashtban

Balafar

Suravajhala

. Gene selection for tumor classification using a novel bio-inspired multi-objective approach. Genomics 2018; 110: 10–17. DOI: https://doi.org/10.1016/j.ygeno.2017.07.010.

38.

Shunmugapriya

Kanmani

. A hybrid algorithm using ant and bee colony optimization for feature selection and classification (ac-abc hybrid). Swarm Evol Comput 2017; 36: 27–36. DOI: https://doi.org/10.1016/j.swevo.2017.04.002.

39.

Salem

Attiya

El-Fishawy

. Classification of human cancer diseases by gene expression profiles. Appl Soft Comput J 2017; 50: 124–134. DOI: https://doi.org/10.1016/j.asoc.2016.11.026.

40.

Kang

Huo

Xin

, et al. Feature selection and tumor classification for microarray data using relaxed lasso and generalized multi-class support vector machine. J Theor Biol 2019; 463: 77–91. DOI:https://doi.org/10.1016/j.jtbi.2018.12.010.

41.

Kabir

Shahjahan

Murase

. A new local search based hybrid genetic algorithm for feature selection. Neurocomputing 2011; 74: 2914–2928. DOI: https://doi.org/10.1016/j.neucom.2011.03.034.

42.

Witten

Frank

Geller

. Data mining: practical machine learning tools and techniques with java implementations. SIGMOD Record 2002; 31: 76–87. DOI: https://doi.org/10.1145/507338.507355.

43.

Huang

. Similarity measures for text document clustering. In: New Zealand computer science research student conference, NZCSRSC 2008 - Proceedings, 2008.

44.

Székely

Rizzo

Bakirov

. Measuring and testing dependence by correlation of distances. Ann Stat 2007; 35: 2769–2794. DOI: https://doi.org/10.1214/009053607000000505.

45.

Aziz

. Nature-inspired metaheuristics model for gene selection and classification of biomedical microarray data. Med Biol Eng Comput 2022; 60: 1627–1646. DOI: https://doi.org/10.1007/s11517-022-02555-7.

46.

Rahmanian

Mansoori

. Unsupervised fuzzy multivariate symmetric uncertainty feature selection based on constructing virtual cluster representative. Fuzzy Sets Sys 2022; 438: 148–163. DOI: https://doi.org/10.1016/j.fss.2021.07.015.

47.

Tang

Zheng

Zhang

, et al. Unsupervised feature selection via multiple graph fusion and feature weight learning. Sci China Inform Sci 2023; 66: 152101. DOI: https://doi.org/10.1007/s11432-022-3579-1.

48.

Zhu

Zhang

Zhu

, et al. Unsupervised spectral feature selection with dynamic hyper-graph learning. IEEE Trans Knowl Data Eng 2022; 34: 1–7. DOI: https://doi.org/10.1109/TKDE.2020.3017250.

49.

Zhang

Shang

Jiao

. Large-scale community detection based on core node and layer-by-layer label propagation. Inf Sci (Ny) 2023; 632: 1–18. DOI: https://doi.org/10.1016/j.ins.2023.02.090.

50.

Rahmanian

Mansoori

. An unsupervised gene selection method based on multivariate normalized mutual information of genes. Chemometr Intell Lab Syst 2022; 222: 104512. DOI: https://doi.org/10.1016/j.chemolab.2022.104512.

51.

Bandyopadhyay

Bhadra

Mitra

, et al. Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recognit Lett 2014; 40: 104–112. DOI: https://doi.org/10.1016/j.patrec.2013.12.008.

52.

Solorio-Fernández

Carrasco-Ochoa

Martínez-Trinidad

. A review of unsupervised feature selection methods. Artif Intell Rev 2020; 53: 907–948. DOI: https://doi.org/10.1007/s10462-019-09682-y.

53.

Vapnik

. The Nature of Statistical Learning Theory. New York, NY: Springer, 1995. DOI: 10.1007/978-1-4757-2440-0.

54.

Manning

Raghavan

Schütze

. Introduction to Information Retrieval. Cambridge University Pres: Cambridge, 2008. DOI: 10.1017/cbo9780511809071.

55.

Hastie

Tibshirani

Friedman

. The elements of statistical learning, second edition. New York, NY: Springer, 2009.

56.

Hastie

Rosset

Zhu

, et al. Multi-class adaboost. Stat Interface 2009; 2: 349–360. DOI: https://doi.org/10.4310/sii.2009.v2.n3.a8.

57.

Lloyd

. Least squares quantization in pcm. IEEE Trans Inform Theory 1982; 28: 129–137. DOI:https://doi.org/10.1109/TIT.1982.1056489.

58.

Arthur

Vassilvitskii

. K-means++: The advantages of careful seeding. In: Proceedings of the annual ACM-SIAM symposium on discrete algorithms, 2007, Volume 07-09-January-2007.

59.

Park

Jun

. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 2009; 36: 3330–3335. DOI: https://doi.org/10.1016/j.eswa.2008.01.039.

60.

Shi

Malik

. Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 2000; 22: 888–905. DOI: https://doi.org/10.1109/34.868688.

61.

Briola

Aste

. Topological feature selection. In: Doster T, Emerson T, Kvinge H et al. (eds.) Proceedings of 2nd Annual workshop on topology, algebra, and geometry in machine learning (TAG-ML), Vol. 221, pp.534–556. PMLR. https://proceedings.mlr.press/v221/briola23a.html.

62.

Roffo

Melzi

Castellani

, et al. Infinite feature selection: a graph-based feature filtering approach. IEEE Trans Pattern Anal Mach Intell 2021; 43: 4396–4410. DOI: https://doi.org/10.1109/TPAMI.2020.3002843.

63.

Wang

, et al. Unsupervised feature selection by learning exponential weights. Pattern Recognit 2024; 148: 110183.

64.

You

Yuan

, et al. Unsupervised feature selection via neural networks and self-expression with adaptive graph constraint. Pattern Recognit 2023; 135: 109173. DOI: https://doi.org/10.1016/j.patcog.2022.109173.

65.

Friedman

. A comparison of alternative tests of significance for the problem of

m

rankings. Ann Math Stat 1940; 11: 86–92. DOI: https://doi.org/10.1214/aoms/1177731944.

66.

Alzubi

Alweshah

, et al. An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neur Comput Appl 2020; 32: 16091–16107. DOI: https://doi.org/10.1007/s00521-020-04761-6.