Abstract
In this review we provide a systematic analysis of transcriptomic signatures derived from 42 breast cancer gene expression studies, in an effort to identify the most relevant breast cancer biomarkers using a meta-analysis method. Meta-data revealed a set of 117 genes that were the most commonly affected ranging from 12% to 36% of overlap among breast cancer gene expression studies. Data mining analysis of transcripts and protein-protein interactions of these commonly modulated genes indicate three functional modules significantly affected among signatures, one module related with the response to steroid hormone stimulus, and two modules related to the cell cycle. Analysis of a publicly available gene expression data showed that the obtained meta-signature is capable of predicting overall survival (
Introduction
Development of effective tools such as DNA microarrays for monitoring gene expression on a large scale has resulted in the discovery of gene networks and regulatory pathways in various tumor processes. In this respect, global gene expression in breast cancer has been profiled extensively over the last decade, which allowed the identification of breast cancer molecular subtypes and the development of prognostic and predictive gene signatures, resulting in an improved understanding of the heterogeneity of breast cancer.
In pioneering work, Perou et al used cDNA arrays to test the expression of approximately 8,000 genes in samples from 42 breast cancer patients
1
. This first report suggested that primary breast carcinomas could be classified into specific ‘intrinsic subtypes’ distinguished by particular gene expression patterns. These data were confirmed and extended by Sorlie et al., who investigated the clinical usefulness of the breast cancer subtypes identified by screening for correlations between gene expression patterns and clinically relevant parameters. They demonstrated that classification of tumors based on gene expression patterns could be used as a prognostic marker with respect to overall and relapse-free survival in a subset of patients who had received uniform therapy2,3 The five subtypes identified (Luminal A, Luminal B, Basal-like, ERBB2 positive/ER negative and normal breast-like) represent different biological entities and might originate from different cell types. One of the five subtypes was characterized by over-expression of ERBB2 and poor prognosis. A second tumor subtype, lacking expression of estrogen receptor
According to Sotiriou and Pusztai, 10 global gene expression profiling has employed three different strategies to develop genomic signatures that may provide better prediction of clinical outcome. First, in the ‘top-down’ approach, gene expression data from tumors (or cell line models) are correlated with the clinical outcome of patients to identify prognostic gene signatures (eg, 70- and 76-gene poor-prognosis signatures). Second, in the ‘bottom-up’ approach the prognostic predictor is derived from a gene expression signature related to a biological pathway or process (eg, wound-response, invasiveness and stromal related poor-prognosis signatures). Third, in the candidate-gene list approach a set of biomarkers are prospectively selected on the basis of previous biological knowledge (eg, recurrence score signature) 10 .
Among the myriad of prognostic or predictive gene expression signatures generated, only four genetic assays have been currently licensed for commercial use: the 70-gene ‘poor-prognosis’ signature (MammaPrint, Agendia BV, Amsterdam), the 21-gene ‘recurrence score’ signature (Oncotype DX, Genomic Health, Redwood City, California), the 97-gene ‘genomic grade index’ signature (MapQuant Dx, Ipsogen, Marseille, France), and the 2-gene ratio signature (Theros, Biotheranostics, San Diego, California). Some of these signatures have been previously compared. Fan et al demonstrated that 5 gene expression signatures, the intrinsic subtypes, the 70-gene, the 2-gene expression ratio, the 21-gene, and the wound response signature, had similar performance in predicting outcome. 11 However, comparisons of the gene lists derived from these studies have shown a limited or zero overlap between signatures. The reasons for this disparity have been attributed to differences in the group of patient analyzed (ER status, tumor grade, stage, etc), in sample preparation (bulk, microdissected, etc.), in microarray platforms (high or low coverage of the human genome) and the statistical methods used (supervised or unsupervised methods, gene selection, construction of the classifiers, etc.). In this sense, Ein-Dor et al demonstrated that many equally prognostic or predictive gene sets can be obtained from the same study. 12 These data showed that each gene signature identify different molecular features, which are predictive of the clinical outcome by looking a partial picture of breast cancer biology. More importantly, these data suggest that combining multiples gene expression signatures may provide an integrated view that would be useful to define the most relevant breast cancer biomarkers.
In the present review, we provide a comprehensive integration of 42 breast cancer gene expression signatures demonstrating that the overlap between gene expression signatures is greater than previously estimated by the comparison of a reduced set of gene lists. 11 In addition, we demonstrate that the gene expression meta-signature is a powerful predictor of clinical outcome in patients with early-stage breast cancers. We also discuss the most relevant set of genes recurrently identified in these signatures re-analysis.
Materials and Methods
Identification of common gene expression features among breast cancer signatures
We employed the GeneSigDB (release 2.0) online resource (http://compbio.dfci.harvard.edu/ genesigdb) for the detection of gene overlapping among breast cancer gene expression signatures available in this database. 13 GeneSigDB is a manually curated and standardized (EnsEMBL gene identifiers) database of gene expression signatures (n = 957), which focuses on cancer and stem cell studies. We selected the most relevant gene signatures derived from 42 breast cancer gene expression profiling studies (from 2002 to 2009) (see additional file 1). For the selected signatures, the GeneSigDB web application provide one gene per signature heatmap-style plot colored in red or grey according to presence or absence of gene overlap, respectively.
Data extraction and hierarchical clustering
GeneSigDB data management was performed using a customizable HEM2TEM (for HeatMap to TExtMatrix) java tool developed by us for extracting a plain text matrix from the XML/HTML heatmap previously described. To enable unsupervised classification and illustration of the commonly overlapped genes between the 42 breast cancer gene expression signatures, we used the Multi Experiment Viewer (MeV 4.5) software (http://www.tm4.org/mev/).
14
Two-way (by gene and by signature) hierarchical clustering was used to examine the relationships among the 42 breast cancer gene expression signatures. Hierarchical clustering was based on Spearman's rank correlation distance metric and the complete linkage clustering method. Furthermore, we tested whether semantic terms (signature name, platform name or biological process) differed across clusters using the Fisher's exact test. All P values were two sided, and
Data mining analysis
For automated functional annotation and classification of genes of interest based on Gene Ontology (GO) terms, we used the
In order to identify the molecular pathways that were mainly affected by the meta-signature, we look for protein/gene interaction networks in the common core of overlapped genes. The protein-protein interaction network was generated using the STRING database (‘Search Tool for the Retrieval of Interacting Genes/ Proteins’) (http://www.string.embl.de/). 16 This bioinformatic tool was used with the aims to collect, predict and unify most types of protein-protein associations, including direct and indirect associations. STRING runs a set of prediction algorithms and transfers known interactions from model organisms to other species based on predicted orthology of the respective proteins. 16 In order to identify each gene in the database, we used both gene names and EnsEMBL gene identifiers in the ‘protein-mode’ application. The analysis input options were ‘co-occurrence’, ‘co-expression’, ‘experiments’, ‘databases’, and ‘text mining’ data at high confidence level of predicted human orthology groups. All of the raw data reported as additional files in this article are publicly available at the journal web site.
Gene expression meta-signature and survival analysis
To further investigate the prognostic value of the gene expression meta-signature, we did survival analyses in a publicly available breast cancer microarray study. We selected van de Vijver data set due to the biological diversity of breast tumors included in this study. 17 Briefly, van de Vijver's data set included 295 early-stage breast cancer samples (226 ER-positive and 69 ER-negative), some of whom were lymph-node-negative (n = 151) and the others were lymph-node-positive (n = 144). The patients had all been treated by radical mastectomy or breast-conserving surgery, followed in some cases by radiotherapy; and a fraction of patients had received adjuvant treatment. Data on relapse-free survival (defined as the time to a first event) and overall survival were available for all patients. The gene expression profile was derived by researchers from the Netherlands Cancer Institute and Rosetta Inpharmatics–-Merck using Agilent Hu25K oligonucleotide (60mer) microarray (Agilent Technologies, Palo Alto, CA–-USA). The gene expression matrix and the associated clinical data were obtained from the Rosetta Inpharmatics website 17 (http://www.rii.com/publications/2002/nejm.html).
In an unsupervised analysis, 295 tumor samples were grouped by similarity of the 117 gene list meta-signature by complete linkage clustering by using the Multi Experiment Viewer software. The samples were segregated into three classes (from Cluster 1 to Cluster 3) based on the second bifurcation of the clustering dendrogram. In addition, we integrated the gene expression meta-signature with four prognostic or predictive gene signatures (Intrinsic subtype, Poor-prognosis, Recurrence Score and Wound Response signatures) to evaluate the data set. Tumor classification according to the four prognostic or predictive gene signatures were stablished based on data provided by Fan et al 2006. 11 Kaplan-Meier survival curves and, log-rank statistics and the Cox proportional hazard method were performed by using the SPSS® statistic software package (SPSS Inc., Chicago). The multivariate Cox proportional-hazard model included: estrogen receptor status (ER-positive vs. ER-negative), tumor grade (grade 1 vs. 2 and grade 1 vs. 3), lymph node status (LN-negative vs. 1–3 LN-positives and LN-negative vs. > 3 LN-positives), age (as a continuous variable), tumor size (diameter ≥ 2 cm vs. diameter > 2 cm), treatment received (no adjuvant therapy vs. chemotherapy/hormonal therapy), and gene expression meta-signature predictive clusters (cluster 1 vs. cluster 2/3). Overall survival and relapse-free survival were the end points.
Results and Discussion
Based on a novel gene list meta-analysis approach, a systematic review of 42 gene signatures of breast cancer was performed in order to identify and compare the most relevant breast cancer biomarkers. The study approach underwent four phases: (a) detection of overlapping genes among the different signatures, (b) examination of the relationship between gene expression signatures by a two-way unsupervised analysis, (c) identification of the molecular pathways that are mainly affected by the gene expression meta- signature followed by (d) validation of the gene expression meta-signature's prognostic value in a set of 295 patients with early-stage breast cancers obtained from van de Vijver et al study 17 .
Identification of the gene expression meta-signature and data mining analysis
Among the 42 gene expression signatures (see additional file 1), a total of 946 transcripts were identified as overlapping in more than one study (Fig. 1A, Additional file 2). Of the 946 transcripts, 117 genes were identified in more than four studies, representing a set of the most frequents breast cancer biomarkers in this analysis (Fig. 1B). Additional file 2 shows the most common overlapping genes between breast cancer signatures.

Overlap beween gene identifiers across 42 breast cancer gene expression signatures.
Hierarchical clustering analysis of the 42 gene expression studies classified the signatures in four groups: the intrinsic subtype signatures, the response to chemotherapy related signatures, the stromal/ extracellular matrix (ECM) related signatures and the signatures enriched in cell cycle genes (Fig. 2). It can be clearly seen that related signatures such us intrinsic subtypes and ER-alpha status on the one hand, or stromal and extracellular matrix signatures on the other hand, have a large overlap relative to other gene expression signatures. Furthermore, it is interesting to note that the most common signatures cluster found was associated with the enrichment of cell cycle genes (Fig. 2). Non-statistically significant associations were detected between signatures clusters and the microarray platforms employed for gene expression profiling (

Hierarchical clustering analysis of the 42 breast cancer gene expression studies, classified them in four groups: the intrinsic subtypes, response to chemotherapy, stromal/extracellular matrix (ECM) and signatures enriched in cell cycle genes. It can clearly be seen that related signatures such us intrinsic subtype and ER-alpha status on the one hand, or stromal and extracellular matrix signatures on the other hand, have a large overlap relative to other gene expression signatures.
Gene Ontology annotation of the 117 gene meta-signature showed that approximately 55% of the transcripts are involved in cell cycle regulation, 13% are related to response to steroid hormone stimulus, 4% are related to extracellular matrix interaction/ remodeling and 3% are related to other signal transduction pathways (Fig. 3A, additional file 2). Additionally, Figure 3B shows a protein-protein interaction network associating the common core of genes across gene expression signatures. The graph was generated employing the STRING on-line resource based on high confidence data. STRING is a comprehensive tool integrating protein association information with the capability to transfer known interactions from model organisms to other species. The generated graph (Fig. 3B) indicates strong interactions among a set of 95 proteins derived from the 117 gene meta-signature (81% of coverage). Furthermore, the network architecture suggests the existence of three functional modules (sets of genes that act in concert to carry out a specific function): a module related with the response to steroid hormone stimulus (green circles in Fig. 3B), and two modules related with the cell cycle signaling pathway (Fig. 3B).

Data mining analysis of the gene expression meta-signature.
Gene expression meta-signature analysis and its clinical relevance as prognostic marker
To further explore the prognostic value of gene expression meta-signature, we performed univariate and multivariate analysis of 295 breast cancer patients obtained from a publicly available breast cancer gene expression data set. 17 We first used hierarchical clustering (HCL) analysis to separate the patients into groups according the similarity in the gene expression meta-signature, and then determined the overall and relapse-free survival rates for these groups.
The HCL analysis classified the patients into 3 clusters (Fig. 4A). To further elucidate the reasons driving the separation of breast carcinomas in three major groups, we integrated the gene expression meta-signature with four prognostic or predictive gene signatures (Fig. 4B–C). Interestingly, meta-signature cluster 1 was highly associated with normal-like and luminal A breast carcinomas intrinsic subtypes (

Cross-validation of the gene meta-signature with a single data set of 295 breast cancer samples and integration with 4 pronostic or predictive gene expression signatures.
Kaplan–Meier analysis revealed that the meta-signature cluster 2 and 3 were particularly associated with shorter overall survival (

Kaplan–Meier curves of overall and relapse-free survival among the 295 early-stage breast cancer patients obtained from van de Vijver et al study (2002) according to the meta-signature (A and B), Intrinsic Subtypes (C and D), Poor Prognosis Signature (E and F), Recurrence Score (G and
To further evaluate the independent prognostic value of the gene expression meta-signature, we next performed a multivariate Cox proportional-hazard analysis that included the most relevant and traditional prognostic factors such as: ER status, tumor grade, nodal status, tumor size, etc. This analysis demonstrated that the gene expression meta-signature was statistical significant predictor of both overall survival and relapse-free survival (Table 1).
Multivariate Cox proportional hazard analysis of standard clinical prognosis factors with the gene expression meta-signature predictor.
Size was a binary variable (0 = diameter of 2 cm or less, 1 = greater than 2 cm.), age was a continuous variable formatted as decade-years. Hazard ratio for meta-signature was calculated comparing the clusters 2 and 3 relative to cluster 1. Variables found to be significant (
The results show that the 117-gene meta-signature was highly informative in identifying patients with good and poor prognosis outcome based on the expression profiles obtained from van de Vijver data set.
17
In addition, the meta-signature added important prognostic information beyond that provided by the standard clinical predictors. In fact, the meta-signature was the most predictive variable in the analysis as reflected by their having the lowest nominal
Most highly up-regulated transcripts from meta-siganture gene list in van de Vijver et al 2002 data set.
Gene expression modules associated with the meta-signature
Response to steroid hormone stimulus module Approximately two-thirds of all breast cancers are ERα(+) at the time of diagnosis and the expression of this receptor is determinant of a tumor phenotype that is associated with hormone-responsiveness. Patients with tumors expressing ERα have a longer disease-free interval and overall survival than patients with tumors that lack ERα expression.
18
Several studies have been carried out using cDNA and oligonucleotide microarrays identifying breast cancer subclasses possessing distinct biological and clinical properties.1,19 Among the distinctions made to date, the clearest separation was observed between ERα (+) and ERα (–) tumors. It has been suggested that there are sets of genes expressed in association with ERα that could play an important role in determining the hormone-responsive breast cancer phenotype.
20
Functional annotation of the 117 gene meta-signature identified several genes related to the response to steroid hormone stimulus, such us
Reciprocally, ERα directly stimulates transcription of the
Cell cycle module and the mitotic spindle related genes
A common observation in cancer gene expression profiling is the systematic up-regulation of proliferation/ cell cycle related genes among human cancer cells. The up-regulation of these genes is consistent with the fact that cancer is a disease that disrupts normal cell cycle control. Moreover, both in interphase and during mitosis, surveillance mechanisms (checkpoints) ensure that cell cycle events occur in the correct order by delaying crucial transitions until previous processes have been completed. Lesions in the processes and checkpoints mentioned above inevitably lead to genetic imbalances, a hallmark of cells in most solid tumors.
As was previously described, functional annotation of the 117 gene meta-signature identified 64 genes related to the cell cycle process. In addition, according to the gene/protein network analysis the 64 genes were divided in two modules: 32 genes (50%) related to the mitotic spindle biology and 32 genes (50%) related with cell cycle progression per se (red circles and part of blue circles in Figure 3, respectively). More importantly, the mitotic spindle module consists of 32 genes of which many have been associated with gene over-expression and poor prognosis in breast cancer such as
Interestingly, in another study it has been demonstrated that BRCA1 regulates transcriptional expression of multiple cell cycle genes, including the genes mentioned above
Interestingly, the centromere associated protein family members, the mentioned CENPA, CENPN, CENPE and CENPF are all linked in the spindle module gene. CENPA is essential for the recruitment to the centromere of most other proteins required for kinetochore function,
59
as indicated by the observation that RNAi of
In view of this information, it is interesting to note that most genes of the mitotic spindle cluster are involved in the G2-M phase of the cell cycle in which they are more active. Since these genes arose from a breast cancer gene signature meta-analysis of 42 studies, it is possible to believe that these genes, involved in “opening the door to proliferation”, could represent potential targets for breast cancer therapy. Although many of them have been extensively studied in breast carcinoma, there are new ones that might constitute the “key to close the door”.
The other cluster of cell cycle genes is a more heterogeneous group, which mainly includes cyclins, cyclin dependent kinases, cyclin dependent kinases inhibitors and members of the minichromosome maintenance complex (MCM). Several studies have focused on the behavior and localization of different cyclins during tumor progression. Of cyclins that emerged from our analysis, cyclins A2, B1, B2 and E2 are all well characterized; however there is no enough information about their expression in breast cancer. Cyclin A2 is associated with cellular proliferation and can be used for molecular diagnostic as a proliferation marker. It has been demonstrated that this gene is an estrogen-mediated down-regulated.
73
A recent study, suggested that an oncogenic role of overexpressed cyclin B1 is mediated in nuclei of breast carcinoma cells, and the nuclear translocation is regulated by
Conclusions
In summary, microarray technology has allowed the discovery of relevant signatures and consequently the identification of novel genes that may have an impact as breast cancer biomarkers. Our comprehensive comparison of overlapping genes across 42 breast cancer gene expression signatures provides an integrated view of a significant number of transcripts identified as highly modulated in breast tumors. The identification of individual proteins is of high relevance not only for the potential value as prognostic biomarkers but also because may provide insight into mechanisms and pathways of relevance in breast cancer progression. More importantly, this analysis identified the most promising biomarkers for further evaluation in breast cancer such as the cell cycle and mitotic spindle related genes.
Supplementary Data
Additional file 1
42 gene expression signatures selected for analysis and their corresponding list of genes.
Additional file 2
List of 946 transcripts that were identified as overlapping in more than one of the 42 gene expression signatures analyzed.
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
Footnotes
Acknowledgements
This work was supported by FONCYT (PICT N32702, BID 1728 OC/AR), CONICET (PIP N2131) grants.
