Classification,Predictive Modelling,and Statistical Analysis of Cancer Data (A)

Abstract

Supplement Aims and Scope

Cancer Informatics represents a hybrid discipline encompassing the fields of oncology, computer science, bioinformatics, statistics, computational biology, genomics, proteomics, metabolomics, pharmacology, and quantitative epidemiology. The common bond or challenge that unifies the various disciplines is the need to bring order to the massive amounts of data generated by researchers and clinicians attempting to find the underlying causes and effective means of diagnosing and treating cancer.

The future cancer informatician will need to be well-versed in each of these fields and have the appropriate background to leverage the computational, clinical, and basic science resources necessary to understand their data and separate signal from noise. Knowledge of and the communication among these specialty disciplines, acting in unison, will be the key to success as we strive to find answers underlying the complex and often puzzling diseases known as cancer.

Authors of articles in this supplement were asked to focus on classification, predictive modelling, and statistical analysis of cancer data, including one or more of the following topics:

□

Random Forest Algorithms

□

Fuzzy-Set Analysis

□

Non-Linear Signal Processing

□

Bootstrapping Methods

□

Imputation Algorithms

□

Bayesian Classifiers

□

Support Vector Machines

□

Time-to-Event Models

□

K-Means Cluster Analysis

□

Discriminant Analysis Classifiers

□

K-Nearest Neighbor Methods

□

Multiple Comparison Strategies

□

Cancer Influence Modelling

□

Hyperplane Bernoulli Ring Sets

□

Network Analysis

□

Target Prediction and Cross-Validation Algorithms

□

Hybrid and Hierarchical Partitioning

□

Self-Organizing Maps

□

Causal Validity and Benchmarking

The availability of microarray and next-generation sequencing based technologies has made it possible to study human cancers at the whole genomic, transcriptomic, and epigenetic proteomic level. While these high throughput technologies motivate advanced approaches to study cancer-associated genetic variants, genes, and pathways, they also yield large scale and complex-structured “-omics” cancer data which provides unique analytic challenges for the communities of computer science and statistics. Novel computational algorithms and statistical methods, and/or appropriate choices among available tools and software are needed to analyze and make sense of such rich datasets, with the overall goal of understanding the genomic landscape of cancer development and progression. This supplement includes articles by leading researchers in statistics, biostatistics and bioinformatics covering a wide spectrum of scientific questions and hypothesis generated by such data. Some of these main areas are summarized below.

Classification of tumor types is of great importance in cancer diagnosis and treatment. Many modern machine learning and data mining algorithms have been proposed in the literature for cancer classification based on the gene expression data. It is a challenging task to choose among tens of thousands of genes and identify which ones are relevant with different cancer types or subtypes. Popularly used methods include Bayesian network, k-nearest neighbors, neural network, nearest shrunken centroids, logistic regression, random forest, and support vector machine. Previous studies show that the accuracy of the classifiers depend on the specific datasets and there is no single classifier which outperforms others universally. While classification focuses on identifying patterns in the data which are associated with predefined cancer types and classifying future observations, clustering analysis, an unsupervised learning method, can be used to extract information from gene expression data and discover new cancer types or subtypes. Popularly used methods include hierarchical clustering, k-means, self-organizing maps, and model-based clustering.

It is believed clustering is a more difficult problem than classification due to unknown number of cancer types and lack of learning set of labeled observations.

It has been shown that the genetic variants, together with some established risk factors, could be used to improve the performance of cancer risk prediction models in breast cancer and prostate cancer studies. With decreasing cost, DNA sequencing provides a practical and useful tool to study cancer-associated single nucleotide variants, small insertions or deletions, copy number variations, and other structural variants at the whole genome level. Data analysis usually involves multiple steps, including quality control of raw reads, reads mapping to the reference genome, variant calling, and annotation and prioritization of potentially cancer related variants. Although many software and pipelines are available there is an urgent need of evaluations of these bioinformatics tools based on both simulated and benchmark datasets so that users can make appropriate choices for their own data analysis.

The last few years has seen an explosion of Bayesian models for high-throughput genomics data, partly aided by the rapid developments in computational machinery. Bayesian models are particularly appealing in these settings since they provide coherent probabilistic formulations of the scientific hypotheses, appropriate quantification of uncertainties and allow incorporation of prior knowledge, which together allow more refined biological interpretations of the analysis. This issue contains several novel developments in this area. Cassese et al. propose Bayesian hierarchical models for integrative analysis of gene expression levels with comparative genomic hybridization array measurements in lung cancer. Ni et al propose a network based Bayesian model for integrative analysis of diverse genomics data for Glioblastoma. Zhang et al propose variable selection methods for joint selection of genes and pathways that are associated with Multiple Myeloma progression and development. Guha et al use a Bayesian mixture model for disease classification of breast tumor types using copy number data.

Footnotes

Lead Guest Editor dr Hongmei Jiang

Dr. Hongmei Jiang is an Associate Professor of Statistics at Northwestern University. She completed her PhD at Purdue University. She now works primarily in statistical genomics, metagenomics, computational biology and bioinformatics. Dr. Jiang is the author or co-author of 29 published papers, one book chapter and has presented at over 30 conferences.

Guest Editors DR. LINGLING AN

Dr. Lingling An is an Assistant Professor of Biometry at the University of Arizona. She completed her PhD at Purdue University. Her research focus is primarily on the development and application of statistical methods to the analysis of high-dimensional genomic, epigenomic and metagenomic data. Dr. An is the author or co-author of 20 published papers and one book chapter and has presented at 13 conferences.

Dr. Veerabhadran Baladandayuthapani

Dr. Veerabhadran Baladandayuthapani is currently an Associate Professor of Biostatistics at UT MD Anderson Cancer Center. He received his PhD in Statistics from Texas A&M University and Bachelors (honors) degree in Mathematics from the Indian Institute of Technology, Kharagpur, India. His research interests are in Big Data analytics and particularly in developing new statistical and machine learning frameworks and software for analyzing datasets characterized by high dimensionality and complex structures such as high-throughput genomics, proteomics and imaging. These frameworks include hierarchical and spatial functional data analysis, semi-/non-parametric modeling, non-linear methods for classification and prediction, graphical models and machine learning approaches. He has published over 60 articles in top statistical, biostatistical, bioinformatics and biomedical journals as well as authored a book on Bayesian analysis of gene expression data. He also holds several grants from NIH and NSF as PI and co-investigator. He has given over 70 presentations in national and international conferences and academic departments around the world.

Dr. Paul Livermore Auer

Dr. Paul Livermore Auer is an Assistant Professor of Biostatistics at the University of Wisconsin-Milwaukee and an Affiliate Investigator of the Fred Hutchinson Cancer Research Center. He completed his PhD at Purdue University and has previously worked at the U.S. Census Bureau. He now works primarily in bioinformatics. Dr. Auer is the author or co-author of 12 published papers and has presented at 11 conferences.