Predicting cancer subtypes from microarray data using semi-supervised fuzzy C-means algorithm

Abstract

Microarray technologies help to observe the expression levels of thousands of genes. Analysis of gene expression data arising from these experiments provides insight into different subtypes of diseases and functions of genes. Gene expression data are characterized by a large number of genes and a few samples. Employing traditional supervised classifiers for prediction requires adequate labeled data. However, the limited number of samples make the prediction of disease subtypes a difficult task. Hence, we investigate the potential of semi-supervised learning to delineate the tissue samples from a few labeled data. The available labeled samples were exploited to guide the clustering of unlabeled samples. A classification system by integrating feature selection techniques with semi-supervised fuzzy c-means algorithm was built. The system was evaluated using publicly available gene expression datasets and results showed that a few labeled tissue samples can assist in the accurate prediction of disease subtypes.

Keywords

Gene expression clustering semi-supervised fuzzy c-means

Get full access to this article

View all access options for this article.

References

Kalousis

, Prados

and Hilario

, Stability of feature selection algorithms: A study on high-dimensional spaces, Knowledge and Information Systems12(1) (2007), 95–116.

Jiang

, Tang

and Zhang

, Cluster analysis for gene expression data: A survey, In: IEEE Tran on Knowledge and Data Engineering, 2004, pp. 1370–1386.

Bair

, Semi-supervised clustering methods, Wiley Interdisciplinary Reviews: Computational Statistics5(5) (2013), 349–361.

Bair

and Tibshirani

, Semi-supervised methods to predict patient survival from gene expression data, PLoS Biol2 (2004), E108.

Sherlock

, Analysis of large-scale gene expression data, Current Opinion in Immunology12(2) (2000), 201–205.

Kononenko

, Åăimec

and Robnik-Åăikonja

, Overcoming the myopia of inductive learning algorithms with RELIEFF, Applied Intelligence7(1) (1997), 39–55.

Beane

, Sebastiani

, Whitfield

T.H.

, Steiling

, Dumas

Y.M.

, Lenburg

M.E.

and Spira

, A prediction model for lung cancer diagnosis that integrates genomic and clinical features, Cancer Prevention Research1(1) (2008), 56–64.

Ernst

, Beg

Q.K.

, Kay

K.A.

, Balázsi

, Oltvai

Z.N.

and Bar-Joseph

, A semi-supervised method for predicting transcription factor–gene interactions in Escherichia coli, PLoS Comput Biol4(3) (2008), e1000044.

Han

, Kamber

and Pei

, Data mining: Concepts and techniques, Elsevier, 2011.

10.

http://orange.biolab.si.datasets.psp (available on June 6, 2014).

11.

Weston

, Leslie

, Ie

, Zhou

, Elisseeff

and Noble

W.S.

, Semi-supervised protein classification using cluster kernels, Bioinformatics21(15) (2005), 3241–3247.

12.

Ein-Dor

, Zuk

and Domany

, Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer, Proc of the National Academy of Sciences103(15) (2006), 5923–5928.

13.

Eisen

M.B.

, Spellman

P.T.

, Brown

P.O.

and Botstein

, Cluster analysis and display of genome-wide expression patterns, Proceedings of the National Academy of Sciences95(25) (1998), 14863–14868.

14.

Hall

, Frank

, Holmes

, Pfahringer

, Reutemann

and Witten

I.H.

, The WEKA data mining software: An update, ACM SIGKDD Explorations Newsletter11(1) (2009), 10–18.

15.

List

, Hauschild

A.C.

, Tan

, Kruse

T.A.

, Mollenhauer

, Baumbach

and Batra

, Classification of breast cancer subtypes by combining gene expression and DNA methylation data, J Integr Bioinform11(2) (2014), 236.

16.

Shi

and Zhang

, Semi-supervised learning improves gene expression-based prediction of cancer recurrence, Bioinformatics27(21) (2011), 3017–3023.

17.

Chapelle

, Schölkopf

and Zien

, Semi-Supervised Learning, The MIT Press, 2006.

18.

, Ouyang

and Chen

, An effective gene selection method for cancer subtype classification based on predatory search genetic algorithm and support vector machine, Journal of Computational and Theoretical Nanoscience12(9) (2015), 2538–2544.

19.

Ibrahim

, Yousri

N.A.

, Ismail

M.A.

and El-Makky

N.M.

, miRNA and gene expression based cancer classification using self-learning and co-training approaches, In Bioinformatics and Biomedicine (BIBM), 2013 IEEE International Conference on, IEEE, 2013, pp. 495–498.

20.

Salazar

, Roepman

, Capella

, Moreno

and Simon

, Dreezen

, Lopez-Doriga

, Santos

, Marijnen

, Westerga

and Bruin

, Gene expression signature to improve prognosis prediction of stage II and III colorectal cancer, Journal of Clinical Oncology (2010), JCO–2010.

21.

Haferlach

, Kohlmann

, Wieczorek

, Basso

, Te Kronnie

, Béné

M.C.

, De Vos

, Hernández

J.M.

, Hofmann

W.K.

, Mills

K.I.

and Gilkes

, Clinical utility of microarraybased gene expression profiling in the diagnosis and subclassification of leukemia: Report from the International Microarray Innovations in Leukemia Study Group, Journal of Clinical Oncology28(15) (2010), 2529–2537.

22.

Pedrycz

and Waletzky

, Fuzzy clustering with partial supervision, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on27(5) (1997), 787–795.