Abstract
Whole-genome analyses have uncovered that most cancer-relevant genes cluster into 12 signaling pathways. Knowledge of the signaling pathways and associated gene signatures not only allows us to understand the mechanisms of oncogenesis inherent to specific cancers but also provides us with drug targets, molecular diagnostic and prognosis factors, as well as biomarkers for patient risk stratification and treatment. Publicly available genomic data sets constitute a wealth of gene mining opportunities for hypothesis generation and testing. However, the increasingly recognized genetic and epigenetic inter- and intratumor heterogeneity, combined with the preponderance of small-size cohorts, hamper reliable analysis and discovery. Here, we review two methods that are used to infer meaningful biological events from small-size data sets and discuss some of their applications and limitations.
Keywords
Introduction
Next-generation sequencing and microarray technologies have generated massive amounts of data that can be mined for disease-gene expression correlates in search for molecular mechanisms, biomarkers, or drug targets. As of August 15, 2015, there were a bit less than 4,000 publicly available Gene Expression Omnibus (GEO) data sets (GDSs) that may be retrieved from GEO alone (the NIH gene expression data set repository at the National Center for Biotechnology Information; http://www.ncbi.nlm.nih.gov/gds/), several hundreds of which being dedicated to human cancers. Current gene expression arrays encompass some 45k and 22k probesets for protein-encoding and noncoding genes, respectively (eg, Affymetrix's GeneChip® Human Transcriptome Array 2.0; Illuminas HumanHT-12 v4 Expression BeadChip), allowing to probe gene expression variation in clinical samples or cell lines at an unprecedented depth. The analytical power of whole-genome analyses, however, remains limited mostly owing to two practical parameters: (i) most cancers are relatively low-incidence diseases (eg, Ewing's sarcoma affects 1–2 children/year/million 1 and subcutaneous panniculitis-like T-cell lymphoma afflicts about 1 person/year/10 million2,3), and most laboratories or even institutions have therefore access to only a limited number of tumor samples and (ii) the cost of the technology remains too high for most low- to mid-budget laboratories, thus forcing investigators to limit the number of tested samples and biological replicates, which in turn yields mostly underpowered studies.
In 2010, McClellan and King highlighted the complex interplay, linking genetic diversity to disease heterogeneity. 4 Accordingly, discovery of many disease-associated genetic risk variants requires exceedingly large cohorts in genome-wide association studies, as recently exemplified in a large-size cohort analysis of lung adenocarcinoma, wherein 26 research departments from several countries pulled their resources together to conduct the study. 5 The problem posed by the high interindividual allelic variability can be further exacerbated by that of epigenetic diversity (eg, in follicular lymphoma and diffuse large B-cell lymphomas), 6 whereby stochastic and/or environmental factors can lead to different epigenetic (and gene expression) landscapes, even in presumably otherwise genetically identical monozygotic twins. 7 It is now becoming increasingly appreciated that several cancers exhibit high intratumor variability, including those of the breast,8–10 colon,11,12 head and neck, 13 ovary, 14 prostate 15 and stomach, 16 and glioblastoma.17–19 In fact, somatic mutation frequency analysis of more than 3,000 tumor samples encompassing 27 cancer types showed up to three or more orders of magnitude mutation rate variability between tumors (eg, in lung adenocarcinoma and melanoma), 20 underscoring the scale of heterogeneity. Furthermore, tumor heterogeneity can be driven in response to chemotherapeutic intervention adding to the complexity of the analysis.21,22 Since heterogeneity can increase through time and/or in cases wherein tumors are exposed to different microenvironments, heterogeneity can be high when comparing metastases to primary tumors, particularly in cases whereby metastases take up to several decades to evolve allowing time for stochastic genotypic or epigenetic changes. 23 Thus, for instance, 3%-24% of breast cancer metastases display a different estrogen, progesterone, or HER2 erb-b2 receptor tyrosine kinase 2 receptor status from the primary tumors, 24 either due to a receptor switch or due to the fact that the tested metastases arose from sections of the primary tumor not included in the analyses. These intra- and interindividual differences notwithstanding, recurrent alterations in key biological processes often underlie a given disease, 4 and for example, only a dozen or so of core signaling pathways appear to drive the tumorigenic phenotype of most cancers.25–27 Whereas cancer genetic and epigenetic diversities offer opportunities for biomarker discovery and risk stratification, 28 uncovering genes and pathways associated with specific disease states remains challenging, owing to the sample-size requirements. Fortunately, bioinformatic methods have begun to address this problem, and we briefly summarize subsequently those that proved to be useful in the analysis of small-size cohorts.
How Small are Small-Size Cohorts?
It is common knowledge that most childhood cancer cohorts are relatively small in size. This is not only due to the fact that these diseases are relatively rare in nature but also because less funding is devoted to research on these neoplasms as compared to their adult counterparts. Thus, for example, the combined NIH budget for all types of pediatric sarcomas is only about 1/15th of the budget allocated to breast cancer alone.
29
But what about the cohort size of gene data sets of more frequent childhood cancers (eg, leukemia with 88 cases/year/million in 1–4–year-old children
30
) or adults cancers (80–690 cancers/year/million in men and 73–724 cases/year/million in women for the top 10 cancer types; based on our estimates taking into account the current US population projection
31
and cancer frequencies in adults in the US population in 2015
32
)? To address this question, we ran a meta-analysis of all cancer gene data sets deposited to date in GEO. As shown in Figure 1, the majority of data sets contained 50 or less samples and qualified as small-size cohorts (median

Probability density distribution of all cancer gene data sets in Gene Expression Omnibus (GEO). All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). Gene data sets size refers to the number of tumor samples per data set. The analysis included 368 data sets and 9,845 tumor samples. Only data sets limited to tumor samples were retrieved; those solely listing data on tumor stroma or normal peripheral blood lymphocytes in cancer patients or those that combined several cancer types were omitted from the analysis. There were no other exclusion criteria.

Probability density distribution of cancer gene data sets of high-incidence adult cancers. All cancer data sets were retrieved from GEO (query performed on August 15, 2015) and plotted against sample size (x axis). These included (
To illustrate some of the limitations imposed by small-size cohorts, we ran two simple tests comparing gene expression in two publicly available Ewing's sarcoma gene data sets.33,34 In the first test, we looked at two genes that encode epigenetic modifiers with important roles in tumorigenesis. The first gene, lysine-specific demethylase 1 (to avoid gene and species ambiguity,35,36 NCBI gene IDs are given herein along with the gene symbol; LSD1; GeneID: 23028) was shown to be overexpressed and to serve as a drug target in Ewing's sarcoma in vitro 37 as well as in other neoplasms, such as breast cancer. 38 The second gene, enhancer of zeste homolog 2 (EZH2; GeneID: 2146) was also shown to be overexpressed in Ewing's sarcoma and to be a drug target both in vitro and in vivo.39,40 As proposed by others, 41 we computed the bivariate kernel density estimates and run regression analyses in the R environment and compared the probability density distributions for either gene in two equal small-size Ewing's sarcoma cohorts. As shown in Figure 3A, although LSD1 shows consistently high gene expression in both cohorts, there are a few outliers for EZH2 (Fig. 3B), indicating that the sample size and number of cohorts utilized, while sufficient to analyze LSD1, were borderline in the case of EZH2.
We next ran a test, this time looking at chemokine (C-X-C motif) receptor 4 (CXCR4; GeneID: 7852), a gene encoding a chemokine receptor previously shown to mark metastatic Ewing's sarcoma and as such associates with about a one-third of all samples, representing the fraction of metastatic tumors. 42 Contrary to the tests earlier, here we find the size and number of cohorts to be limiting, as the bivariate kernel density estimates did not fully reproduce the predicted distribution for this gene (Fig. 3C). Although these examples show that analyses in two small-size cohorts may be sufficient for some genes, it is important to note that these were tests for which we already knew the answers. For hypothesis generation through bioinformatics, which is usually one of the main applications of gene data set mining analyses, one would need a method to infer meaningfulness with a much higher degree of confidence.
In statistics, one way of boosting confidence is to increase sample size. One example is that of the so-called sequential analysis, particularly in prospective studies, whereby one adds samples (or recruits patients) until statistical significance is reached or until indications are present that significance is unlikely to be achieved. 43 Although the sample size in sequential analysis is unknown prior to the end of the investigation, it tends to be smaller than in other methods wherein the sample size is predetermined, making it particularly suitable for cancers with particularly low incidences. Such a method, however, is impractical in the analysis of publicly available gene data sets, as the size of these is already fixed. In this case, an alternative would be to increase the number of cohorts. Two bioinformatic methods have taken advantage of this option to infer meaningful correlates from small-size cohorts.
Ican and Affiliated Cancer Informatics Methods to Probe Small-Size Cohorts
To address the problem posed by the small size of cancer cohorts, one of the authors developed the first method to reliably infer gene expression significance, and its association to specific patient subsets, from publicly available small-size cohort data sets irrespective of the expression profiling platform.37,42,44 This method, named Intercohort Co-ANalysis or Ican, relies on several innovative tools. First, it utilizes published gene expression levels known to be biologically active in experimentally validated tissues as a benchmark for gene expression significance, thus extracting biological significance from gene expression profiles. This eliminates variability across studies that results from the customary usage by different investigators of different arbitrary cutoffs for gene expression significance. Second, instead of combining small-size cohorts into a larger meta-cohort, each small-size cohort is analyzed individually. This helps avoid conormalization and the averaging out of sample quantiles across cohorts of different variances and distribution functions. The cohort-specific distribution probabilities are in fact used to highlight the high intercohort variability inherent to small-size cohorts. Next, quantile fitting of sample size to specific disease states, say chemoresistant or metastatic tumors, are mined for consistent molecular correlates within individual cohorts. Finally, a subtractive overlay of cohort-restricted associations is carried out to uncover genes whose expression is consistently associated with select sample subsets in all cohorts. In our case, four small-size publicly available data sets, in addition to a fifth nonpublicly available cohort that served for wet laboratory validation, were sufficient to infer gene expression significance. As a case in point, an Ican investigation of Ewing's sarcoma (a cancer wherein metastasis is the major poor prognosis indicator) yielded several cosegregated chemokine ligand/receptor pairs in Ewing's sarcoma patient subsets and helped uncover the first two chemokine receptors associated with either metastases or poor prognosis in Ewing's sarcoma. 42 To increase stringency, one may filter the patient-derived data sets through cell line-derived data sets. This can help eliminate genes not necessarily associated with the tumor cells but rather with the tumor stroma or tumor-infiltrating lymphocytes. 42 Using such a strategy, we could, for example, zero in on two drugable receptors that represent viable therapeutic strategies for the corresponding Ican patient subsets. Using Ican, another study uncovered a micro-RNA, miR-34a, as a major molecular determinant of chemosensitivity and patient survival. 45 The analytical power of Ican can therefore help uncover genes and pathways with clinical significance from underpowered small-size cohorts, as well as from larger cohorts of highly heterogeneous diseases, which represent most diseases.4,28 We surmise that the growing number of gene expression profiling investigations, compounded by the mandatory submission of gene expression data sets to public repositories requested now by most journals, will lead to a large field of Ican applications, and ensuing prognostic factor and biomarker discoveries.

Bivariate kernel density estimates of gene expression consistency across small-size cohorts. Genes tested were LSD1 (
A similar bioinformatics methodology dubbed Integrative Transcriptome Analysis (Itan) was independently developed by research groups at Harvard University and Massachusetts Institute of Technology (MIT). 46 In this case, a coanalysis of nine hepatocellular carcinoma (HCC) gene data sets derived from different populations and microarray platforms was sufficient to uncover a novel mechanism of TGF-dependent WNT signaling activation in a subset of HCC patients. 46 Contrary to Ican which uses publicly available gene data sets for hypothesis formulation and an additional cohort to experimentally test the hypothesis, Itan uses the larger publicly available gene data sets for training purposes (to avoid data overfitting to any given cohort) and uses the smaller publicly available data sets for testing. The latter was accomplished by subclass mapping, which utilizes hierarchical clustering, k-means clustering, and nonnegative matrix factorization as unsupervised clustering methods to identify tumor subclasses. 47 As in Ican, the overriding principle here relies on molecular events consistently associated with particular tumor populations across all tested data sets. Based on this principle, the accuracy of both methods is dependent on the quality and number of data sets included in the analysis.
Limitations of Small-Size Cohort Bioinformatic Methods
Although Ican and Itan can be useful in inferring meaningfulness for any given gene (and corresponding pathway), they remain of limited value when assessing covariance of two or more genes across data sets, for example, to uncover gene networks associated with particular tumor subsets. This is because such analyses rely on Bayesian networks, Boolean networks, or on the mathematics of product moment correlations, and assuming all samples were added to the data set randomly (ie, patients were recruited consecutively without any prior knowledge of their clustering into one or another tumor subset, and no patients were removed from the cohort based on criteria that relate to the query at hand), these analyses are highly dependent on sample size. 48 Thus, despite the constraints imposed by data set conormalization procedures, analysis of meta-genes remains here the method of choice. For example, using the same data sets analyzed by Ican, product moment correlation analyses of meta-genes can determine whether a signaling pathway is on or off directly in tumor samples or whether signaling molecules are active within specific pathways. 49
Similarly, these methods would be ineffectual in inferring significance of tumor drivers harboring activating mutations and whose gene expression remained unchanged. In these cases, however, Ican and affiliated methods can be used in the analysis of the associated transcriptomes, given that the gene in question impervious to Ican analysis imparts a characteristic downstream gene expression signature, as shown, for instance, for several tumors drivers.50–53 While future studies should give us a better feel about the usefulness of Ican in such cases, this and affiliated methods will certainly find ample application in the field of biomarker discovery in search of markers of diagnosis, prognosis, patient risk stratification, and treatment response.37,42
Finally, though Ican is useful in the analysis of small-size cohorts, it requires multiple cohort data sets to infer differential gene expression significance. Unfortunately, many childhood cancers have very few (if any) gene expression data sets deposited in the public repositories, thus critically limiting the scope of Ican for these cancers. In this regard, an NCI's Office of Cancer Genomics and Cancer Therapy Evaluation Program initiative, dubbed Therapeutically Applicable Research to Generate Effective Treatments (or TARGET) and which aims at characterizing the transcriptomes and genomes of hard-to-treat childhood cancers, is most welcome. TARGET has already generated data sets for childhood acute lymphoblastic leukemia and for neuroblastoma, and efforts are underway to generate genomic and expression profiling data sets for childhood acute myeloid leukemia, osteosarcoma, and renal tumors.
Conclusions
The majority of gene data sets, including those of high-incidence adult cancers, are represented by small-size cohorts. Bioinformatic methods, such as Ican or Itan, can help analyze underpowered studies, given that several data sets of the same disease type are available.
Although it may still be necessary to experimentally validate findings in additional data sets, particularly in case novel or little-known pathways are uncovered, these methods have proven to be sufficient to uncover with high confidence genes meaningful for a particular biological or pathological state from small-size cohorts. As most cancers are genetically and epigenetically heterogeneous and/or of low incidence, the cancer informatics of small-size cohorts will remain a tool of choice to enable the grasping for the brass ring of meaningful cancer-associated events in genomic and epigenomic data sets.
Author Contributions
Conceived and designed the experiments: NBB, IMBB. Analyzed the data: NBB, IMBB. Wrote the first draft of the manuscript: NBB, IMBB. Contributed to the writing of the manuscript: NBB, IMBB. Agree with manuscript results and conclusions: NBB, IMBB. Jointly developed the structure and arguments for the paper: NBB, IMBB. Made critical revisions and approved final version: NBB, IMBB. Both authors reviewed and approved of the final manuscript.
Footnotes
Appendix
The following source code was used for the kernel density estimate statistical analyses implemented in R 1 :
