Sage Journals: Discover world-class research

Abstract

The employment of machine learning (ML) approaches to extract gene expression information from microarray studies has increased in the past years, specially on cancer-related works. However, despite this continuous interest in applying ML in cancer biomedical research, there are no curated repositories focused only on providing quality data sets exclusively for benchmarking and testing of such techniques for cancer research. Thus, in this work, we present the Curated Microarray Database (CuMiDa), a database composed of 78 handpicked microarray data sets for Homo sapiens that were carefully examined from more than 30,000 microarray experiments from the Gene Expression Omnibus using a rigorous filtering criteria. All data sets were individually submitted to background correction, normalization, sample quality analysis and were manually edited to eliminate erroneous probes. All data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) analyses to observe sample division and were additionally tested using various ML approaches to provide a base accuracy for the major techniques employed for microarray data sets. CuMiDa is a database created solely for benchmarking and testing of ML approaches applied to cancer research.

Get full access to this article

View all access options for this article.

References

Feltes

B.C.

, et al., 2019. CuMiDa: An extensively curated microarray database. [Online] SBCB. Available at: http://sbcb.inf.ufrgs.br/cumida. Accessed February 6, 2019.

Alizadeh

A.A.

, Eisen

M.B.

, Davis

R.E.

, et al. 2000. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, 503.

Allison

D.B.

, Cui

, Page

G.P.

, et al. 2006. Microarray data analysis: From disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55.

Alon

, Barkai

, Notterman

D.A.

, et al. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96, 6745–6750.

Ang

J.C.

, Mirzal

, Haron

, et al. 2016. Supervised, unsupervised, and semi-supervised feature selection: A review on gene selection. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 13, 971–989.

Blalock

E.M.

2003. A Beginner's Guide to Microarrays. Springer Science & Business Media. New York, NY.

Blohm

, and Guiseppi-Elie

2001. New developments in microarray technology. Curr. Opin. Biotechnol. 12, 41–47.

Dash

, and Misra

B.B.

2018. Performance analysis of clustering techniques over microarray data: A case study. Phys. A Stat. Mech. Appl., 493:162–176.

Davis

, and Meltzer

2007. Geoquery: A bridge between the gene expression omnibus (geo) and bioconductor. Bioinformatics, 14:1846–1847.

10.

Díaz-Uriarte

, and De Andres

S.A.

2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3.

11.

, Kibbe

, and Lin

2008. lumi: A pipeline for processing illumina microarray. Bioinformatics, 24, 1547–1548.

12.

Dunning

, Smith

, Ritchie

, et al. 2007. beadarray: R classes and methods for illumina bead-based data. Bioinformatics, 23, 2183–2184.

13.

Epstein

, and Butow

2000. Microarray technology—Enhanced versatility, persistent challenge. Curr. Opin. Biotechnol. 11, 36–41.

14.

Frank

, Hall

, and Witten

2016. The weka workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann. San Francisco, CA.

15.

Gautier

, Cope

, Bolstad

, et al. 2004. affy—Analysis of affymetrix genechip data at the probe level. Bioinformatics, 20, 307–315.

16.

Golub

T.R.

, Slonim

D.K.

, Tamayo

, et al. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.

17.

Grisci

B.I.

, Feltes

B.C.

, and Dorn

2019. Neuroevolution as a tool for microarray gene expression pattern identification in cancer research. J. Biomed. Inf. 89, 122–133.

18.

Hardiman

2018. Update on sporadic colorectal cancer genetics. Clin. Colon Rectal Surg. 31, 147–152.

19.

, Li

, Zhang

, et al. 2018. Translational genomics in pancreatic ductal adenocarcinoma: A review with re-analysis of tcga dataset. Semin. Cancer Biol. DOI: 10.1016/j.semcancer.2018.04.004.

20.

Huber

, Carey

, Gentleman

, et al. 2015. Orchestrating high-throughput genomic analysis with bioconductor. Nat. Methods, 12, 115–121.

21.

Joseph

, Papadaki

, Althobiti

, et al. 2018. Breast cancer intra-tumour heterogeneity: Current status and clinical implications. Histopathology. DOI:10.1111/his.13642

22.

Kauffmann

, Gentleman

, and Huber

2009. arrayqualitymetricsa bioconductor package for quality assessment of microarray data. Bioinformatics, 25, 415–416.

23.

Kauffmann

, and Huber

2010. Microarray data quality control improves the detection of differentially expressed genes. Genomics, 95, 138–142.

24.

Khan

, Wei

J.S.

, Ringner

, et al. 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7, 673.

25.

Lazar

, Taminau

, Meganck

, et al. 2012. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 1106–1119.

26.

Lee

J.W.

, Lee

J.B.

, Park

, et al. 2005. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 48, 869–885.

27.

Maaten

L.V.D.

, and Hinton

2008. Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605.

28.

Mramor

, Leban

, Demšar

, et al. 2007. Visualization-based cancer microarray data classification analysis. Bioinformatics, 23, 2147–2154.

29.

Owzar

, Barry

, and Jung

2011. Statistical considerations for analysis of microarray experiments. Clin. Transl. Sci. 4, 466–477.

30.

Oyelade

, Isewon

, Oladipupo

, et al. 2016. Clustering algorithms: Their application to gene expression data. Bioinf. Biol. Insights, 10:237–253.

31.

Pedregosa

, Varoquaux

, Gramfort

, et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825–2830.

32.

Peters

, Brenner

, Wang

, et al. 2018. Putting benchmarks in their rightful place: The heart of computational biology. PLoS Comput. Biol. 14, e1006494.

33.

Peterson

L.E.

, Ozen

, Erdem

, et al. 2005. Artificial neural network analysis of dna microarray-based prostate cancer recurrence, 1–8. In Proceedings of the 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2005, CIBCB’05. IEEE. (Nov. 14–15, La Jolla, CA.)

34.

Pirooznia

, Yang

J.Y.

, Yang

M.Q.

, et al. 2008. A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics, 9, S13.

35.

Reich

, Liefeld

, Gould

, et al. 2006. Genepattern 2.0. Nat. Genet., 38, 500.

36.

Ressom

H.W.

, Lakshman

, Yun

S.J.

, et al. 2009. Microarray Data Analysis Using Machine Learning Methods. In: Biosystems Engineering; McGraw-Hill.

37.

Ritchie

, Phipson

, Wu

, et al. 2015. limma powers differential expression analyses for rna-sequencing and microarray studies. Nucl. Acids Res. 43, e47.

38.

Schena

, Shalon

, Davis

R.W.

, et al. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467–470.

39.

Shen

, Li

, Zhu

, et al. 2016. Systematic investigation of metabolic reprogramming in different cancers based on tissue-specific metabolic models. J. Bioinform. Comput. Biol. 14, 1644001.

40.

Singh

, Febbo

P.G.

, Ross

, et al. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 1, 203–209.

41.

Smith

L.M.

, Baggerly

A.K.

, Bengtsson

, et al. 2013. illuminaio: An open source IDAT parsing tool for illumina microarrays. F1000Res. 2:264.

42.

Statnikov

, Wang

, and Aliferis

C.F.

2008. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics, 9, 319.

43.

Taminau

, Steenhoff

, Coletta

, et al. 2011. insilicodb: An r/bioconductor package for accessing human affymetrix expert-curated datasets from geo. Bioinformatics, 27, 3204–3205.

44.

Tao

, Shi

, Li

, et al. 2017. Microarray bioinformatics in cancer—A review. J. BUON. 22, 838–843.

45.

Thalamuthu

, Mukhopadhyay

, Zheng

, et al. 2006. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22, 2405–2412.

46.

Tong

D.L.

, and Mintram

2010. Genetic algorithm-neural network (gann): A study of neural network activation functions and depth of genetic algorithm search applied to feature selection. Int. J. Mach. Learn. Cybernet. 1, 75–87.

47.

Walsh

, Hu

, Batt

, et al. 2015. Microarray meta-analysis and cross-platform normalization: Integrative genomics for robust biomarker discovery. Microarrays (Basel), 4, 389–406.

48.

Whitworth

G.B.

2010. An introduction to microarray data analysis and visualization, 19–50. In Methods in Enzymology, volume 470. Elsevier. San Francisco, CA.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB

CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research

Abstract

Abstract

Get full access to this article

References

Supplementary Material