Sage Journals: Discover world-class research

Abstract

Large public repositories of microarray experiments offer an abundance of biological data. It is of interest to use and to combine the available material to create new biological information and to develop a broader view on biological phenomena. Meta-analyses recombine similar information over a series of experiments to sketch scientific aspects which were not accessible by each of the single experiments. Meta-analysis of high-throughput experiments has to handle methodological as well as technical challenges. Methodological aspects concern the identification of homogeneous material which can be combined by appropriate statistical procedures. Technical challenges come from the data management of large amounts of high-dimensional data, long computation time, as well as the handling of the stored phenotype data.

This paper compares in a meta-analysis of a large series of microarray experiments the interaction structure within selected pathways between different tumour entities. The feasibility of such a study is explored and a technical as well as a statistical framework for its completion is presented. Multiple obstacles were met during completion of this project. They are mainly related to the quality of the available data and influence the biological interpretation of the results derived.

The sobering experience of our study asks for combined efforts to improve the data quality in public repositories of high-throughput data. The exploration of the available data in large meta-analyses is limited by incomplete documentation of essential aspects of experiments and studies, by technical deficiencies in the data stored, and by careless duplications of data.

Keywords

meta-analysis oncology r public microarray data gene graphs

Introduction

Increasing insights into the pathogenesis of malignant disorders and the detection of a rapidly rising number of molecular alterations gave rise to the hope that cancer specific genetic profiles might be generated that will define biologic subgroups as well as define targets for direct specific therapeutic agents.¹ The search for genomic alterations has revealed a huge heterogeneity not only within one histologically defined cancer entity but even within one individual tumour.² The heterogeneity of genomic mutations, however, becomes less complex since their functional effects merge in the alteration of a few, distinct pathways, only.³ Hence, the understanding of cancer biology may be improved also by focusing on alterations in pathway activities across tumour entities.

Rhodes et al⁴ developed meta-analytic tools to characterize a common transcriptional profile that is universally activated in most cancer types relative to the normal tissues from which they arise, likely reflecting essential transcriptional features of neoplastic transformation. In addition, they characterized a transcriptional profile that is commonly activated in various types of undifferentiated cancer, suggesting common molecular mechanisms by which cancer cells progress and avoid differentiation.

It is the goal of this study to explore the feasibility of a large cross-cancer meta-analysis based on high-throughput gene expression microarray data (GEMA) to compare the interaction structure between members of specific pathways across relevant tumour entities based on available gene expression microarray data from oncological studies. The challenges of this project are given by the quality of available data, the data management for the projected study, the biostatistics/bioinformatics tools available for the analysis, and finally the strategy for interpreting the computational results.

Editorial policies and the idea to reuse the high-throughput gene expression data for validation and new research questions triggered the creation of public repositories. The MIAME (Minimum information about a microarray experiment) criteria⁵ formulate the necessary conditions for verifying and reproducing results of microarray data analyses. MIAME compliance assures a sensible reuse of public microarray data for the study of new questions: biological properties of the samples and phenotypes that were assayed need to be recorded along the data obtained from these assays.

At the moment there are three recommended international repositories to archive publication related functional genomic data:^6,7 ArrayExpress (AE),⁸ Gene Expression Omnibus (GEO),⁹ and the Center for Information Biology Gene Expression Database (CIBEX).¹⁰ GEO is currently the largest fully public gene expression resource.

Meta-analytic tools for GEMA are developed by many authors¹¹ but mainly in the field of differential gene expression and profiling. To our knowledge this paper is the first trying to do a meta-analysis of pathway specific network structures across tumour entities. The structural comparison is motivated by the discovery of different relationships between cancer types.¹² There is evidence for familial associations between acute myeloic leukemia and colorectal cancer.¹³ Men with family history of breast cancer also have an increased risk of prostate cancer.¹⁴ Different leukemia derive from specified deregulation during the hematopoietic stem cell differentiation.¹⁵

Therefore, the interaction structure within genes annotated to specific pathways is explored and compared between eight human cancer entities. The cancer entities are grouped in eight tumour groups: four solid tumours (breast, colon, prostate, lung) and four haemic tumours (ALL, AML, CLL, Lymphoma). Thirteen different KEGG pathways which are organized into three groups are studied: Basic cellular signalling pathways (KEGG ID 04110: Cell cycle, 04115: p53 signalling pathway, 04210: Apoptosis, 04310: Wnt signalling pathway, 04512: ECM-receptor interaction), disease specific pathways (05210: Colorectal cancer, 05215: Prostate cancer, 05221: Acute myeloic leukaemia, 05223: Non-small cell lung cancer), and pathways related to DNA repair (04150: mTOR signalling pathway, 03410: Base excision repair, 03420: Nucleotide excision repair, 03430: Mismatch repair).

Table 1.

Number of experiments and samples in GEO (published data) and AE database (27/02/2009).

Database	Experiments	samples	Experiments without FLEO	First data
GEO	11298	286645	4362 (39%)	Jan 2001
AE	7637	224947	1599 (21%)	Okt 2003

Abbreviation: FLEO, feature-level extraction output.

The exploration of the communication structure within a large set of genes is feasible by ignoring the dynamics of the complex biological system. The available micorarray measurements represent time averages of transcription dynamics. Conditional interaction graphs are used to infer their conditional correlation structure.¹⁶ An interpretation of the edges of these graphs will not be given. The interest consists in assessing evidence that these graphs are different with respect to edges between cancer entities.

The paper is an explorative study on a strategy how to combine publicly available data repositories, bioinformatic tools, and statistical concepts to the defined task. Therefore, an analysis pipeline for the intended problems is described. The adaption of data management and related tools to assemble the data, to check its quality, and to perform the low and high level analysis for a very large set of microarrays is demonstrated. The results are presented. The paper is organized as follows: Section 2 describes material and methods, presents the data as well as the tools used for the low-and high-level analyses. Section 3 contains the results on global differences in the conditional correlation structure of thirteen pathways in eight cancer entities. We discuss our experiences and results in Section 4.

Materials and Methods

Microarray Data Set

Due to the weekly imports from GEO to AE, the data is taken from AE in order to facilitate the data management process. The repositories are dominated by experiments with Affymetrix Microarray data of the ‘HG-U133A’ and ‘HG-U133 Plus 2.0’ chip platforms. In order to work with a sample with an uniform laboratory work-up, we concentrate on data from the ‘HG-U133A’ Affymetrix GeneChip. In order to avoid bias due to specific pre-processing of the raw data, the feature-level extraction output (FLEO) files (CEL files) are used.¹⁷

All experiments from AE repository available on February 27, 2009 and satisfying the following selection criteria are included: FLEO data available, more than 10 arrays have chip type HGU133A, experiment has more than 20 arrays, 50% of the arrays belong to one of the eight cancer entities. Some experiments satisfying these criteria contain identical arrays. For example the arrays from the experiments ‘E-GEOD-3910’ and ‘E-GEOD-3911’ together are identical to the arrays from the super series experiment ‘E-GEOD-3912’. These experiments are not included in the study to avoid duplicate arrays. Thereby 23 experiments are excluded.

A large cancer data set with more than 7000 microarrays is built from about 60 public available experiments in the AE database. An overview of the selected experiments is available in the Appendix. A detailed statistic of the data set is shown in Table 2. Data from cell line experiments and from human patients are grouped together. Furthermore, cancer subtypes are combined to one cancer entity (eg, childhood ALL is included in the ALL cancer entity group).

Table 2.

Statistic of available arrays for selected ArrayExpress experiments grouped by the eight cancer entities.

	Experiments	Arrays	HG-U133A	Deficient	Used
BREAST	20	3595	2454 (68%)	40 (1%)	1834 (51%)
ALL	12	1190	1140 (96%)	3 (0%)	916 (77%)
LUNG	7	537	398 (74%)	12 (2%)	386 (72%)
COLON	6	203	203 (100%)	6 (3%)	197 (97%)
PROSTATE	5	475	418 (88%)	2 (0%)	416 (88%)
AML	4	726	563 (78%)	29 (4%)	534 (74%)
LYMPHOMA	4	335	335 (100%)	4 (1%)	331 (99%)
CLL	3	194	182 (94%)	5 (3%)	177 (91%)
	61	7255	5693 (78%)	101 (1%)	4791 (66%)

The R language¹⁸ and the Bioconductor project¹⁹ are chosen as the computational environment. In order to handle several thousand of microarrays for the low- and high-level analyses of our data parallel computing is used.^20,21 A Bioconductor package called affyPara^22,23 implements parallel computing for pre-processing quality assessment of microarray data.

The tools ‘boxplot’ and ‘MA-plot’¹⁹ are used for quality assessment in the pre-processing step. If an array is deficient in both assessments, it is marked as ‘deficient’ and excluded. Sixty deficient arrays for solid cancer experiments and 41 deficient arrays for haemic cancer experiments are excluded, which is about 1% of the data. Due to duplicated arrays in different experiments (from one cancer entity) and several deficient arrays, the set of arrays used in the analysis is smaller than all available arrays. For the breast cancer experiments only 51% can be used in the analysis and about 66% of all arrays. Therefore, the meta-analysis is executed on 4791 microarrays: 2833 arrays for solid tumours and 1958 arrays for haemic disease tumours.

Phenotype Data

No detailed data on the phenotypes of the patients included in the study was available due to lack of compliance with MIAME annotation rules. Even basic information like sex and age of the patients is not completely available. Detailed information on basic features of the tumours are generally missing. Our findings are summarized as follows: 47% of the patients are female and 18% male (36% missing), the median age is 55 (50% missing). All other variables, eg, tumour staging, are available for less than 20% of all arrays. Figure 1 shows the histogram of the age distribution for the haemic and solid cancer group. Since basic information on the tumours are not available the tumour entities may represent quite inhomogeneous groups.

Figure 1

Histogram of the age distribution for the hemic and solid cancer group.

Data Management

Sixty one single experiments and 7255 microarrays define the data set. Due to the large amount of CEL files and about 80 GB data volume, data management and storing is intricate. To make the data management feasible and reproducible, the raw data and processed data are saved in a general defined directory structure on the local hard disk. For every cancer entity, a directory containing the files is created. The file structure is optimized for the data processing with the R language and for re-usability of intermediate results.

The R package called ArrayExpressDataManage supports the data management of AE experiments at the local file system. It uses the Bioconductor package ArrayExpress²⁴ to download data from the AE database. Functions for different operations on the file structure are provided: Standard microarray processing steps (eg, rma parallel and serial preprocessing) as well as functions for data structure cleaning, creating overview tables.

The package creates automatically the data set generation script. Providing an R list structure object with the AE experiment IDs the complete data set can be regenerated from the AE data base. For the large cancer study the list object is available in the Appendix. Therefore, the data set of our analysis is not submitted as new super-series data set to one of the public repositories. It (raw data and phenodata) is already available in the AE database and can easily constructed by the analyst from the data set generation script. It is straightforward to add new experiments to the analysis. For more details see the vignette of the package or the help files of the package. The package is available at the R-forge repository: http://AEDataManage.R-forge.R-project.org/.

Low-level Analysis

The data is pre-processed in one run using the R packages ArrayExpressDataManage and affyPara. After quality control, normalization is achieved by the Robust Multichip Average²⁵ [RMA] method. All analyses are parallelized and run on the 32 engine computer cluster at the IBE (LMU, Munich) offering a maximum of 128 processors. Each machine runs on four processors and eight GB main memory and they are connected with a 1 Gbit network. The complete RMA pre-processing of the 4791 HG-U133A CEL files took about 50 minutes computation time.

The data showed strong batch effects. Correction for batch effects^26,27 uses an empirical Bayes framework as proposed by Johnson et al.²⁷

High-Level Analysis

The PC-Algorithm²⁸ is used to estimate the network structure (conditional correlation structure) within the set of genes annotated to each pathway for each cancer entities.

Estimating Graphs

Multivariate gene expression data is characterised by its mean value structure as well as its dependence or correlation structure. While the first is concerned with the quantitative amount of transcription activity, the second focuses on the map of direct influences between genes: does the transcription activity of one gene influences the transcription activity of a second gene freezing all other genes annotated to the corresponding pathway on a fixed transcription level. An edge is drawn between two nodes (genes) if a direct influence is assessed. The PC-Algorithm estimates such a graph from observational data.^28,29

Thorough validation studies³⁰ show advantages of the PC-Algorithm compared to competing approaches^31–35 especially for sparse graphs in terms of estimation quality (true and false discovery rates for edges) as well as computational speed.

The PC-Algorithm is run with α = 0.05. This is a good choice for a graph with less then 20% of the maximal number of edges.²⁸ This is a plausible assumption for gene sets annotated to the KEGG pathways.

Comparing Graphs

Graphs on the same set of nodes are compared by the Structural Hamming Distance (SHD). The SHD between two graphs is the number of edge insertions, deletions or flips in order to transform one graph to the other. The smaller the SHD the bigger is the similarity between the two graphs. The SHD is symmetric and can be calculated by SHD = # of different edges in both graphs—# common edges in both graphs.

The null-hypothesis of no structural difference between two tumour entities is tested by a permutation test. The test assesses if an observed SHD between two graphs is untypically large compared to the SHD distribution under the null-hypothesis. This distribution results from comparing two estimated graphs from two data sets which differ just by random fluctuations. The permutation test is carried out after standardizing the transcription values of genes annotated to the specific pathways. The mean value is substracted from the individual measurements and the difference is divided by the standard deviation in each set of the two cancer entities which are compared. The rejection of this null-hypothesis on a 5% significance level is considered as evidence the cell processes as captured by the specific set of pathway genes proves a differential dynamic between both tumour entities considered.

The resampling for the test procedure proceeds as follows:

•

Choose the SHD to measure differential conditional correlation structure between both graphs.

•

Estimate each graph by the PC-Algorithm with α = 0.05 from the observed data and determine the SH Dobs between both graphs.

•

For resampling step i permute the data units between both data sets, estimate both graphs and calculate the specific SH Di (i = 1, …, R).

•

Determine a permutation P-value by pperm = #{SH Dobs < SH Di}/R.

•

Reject the null-hypothesis if pperm is smaller then 0.05.

The data is resampled R = 500 times.

Permutation P-values below 0.01 are considered as evidence for a difference. Larger P-values are called marginal (P ≤ 0.05) or not significant (P > 0.05).

Results

A total of 4791 microarrays was grouped into eight tumour entities (four solid tumours with a total of 1958 arrays and four haemic tumours with a total of 2833 arrays). The minimal sample sizes is 177 arrays for probes from CLL patients, the maximal sample size is 1834 arrays for breast cancer tissue (see Table 2). The phenotype information on the individual tumour probes is very sparse and is not considered in the following analysis.

Figure 2 shows the SHD for all six combinations of solid tumours (red triangles), all six combinations of haemic tumours (black triangles), and for all 16 haemic-solid combinations (blue triangles) when conditional independence graphs are estimated for each entity and compared by SHD.

Figure 2.

SHD in single pathways for comparisons within solid tumours (black), haemic tumours (red) and between group comparisons (blue).

There is no obvious evidence in any pathway that the SHD for a between group (haemic/solid) comparison is larger as the SHD for a within group (haemic/haemic or solid/solid) comparison.

The comparison within solid tumours can be summarized as follows. It holds that the breast-colon comparison (# of arrays: 1834/197) is only distinct for the Wnt signalling pathway (04310). The breast-lung comparison (# of arrays: 1834/386) results for most pathways in a pronounced difference except the AML pathway (05221) and the Mismatch repair pathway (03430). The breast-prostate comparison (# of arrays: 1834/416) shows marginal or non-significant differences for the p53 signalling pathway (04115), the ECM-receptor interaction pathway (04512), the AML pathway (05221), Non-small cell lung cancer pathway (05223), and the Mismatch repair pathway (03430). The colon-lung comparison (# of arrays: 197/386) shows marginal or non-significant differences for the ECM-receptor interaction pathway (04512), the AML pathway (05221), and the Non-small cell lung cancer pathway (05223). The colon-prostate comparison (# of arrays: 197/416) shows marginal or non-significant differences for the p53 signalling pathway (04115), Apoptosis (04210), the ECM receptor interaction pathway (04512), Prostate cancer pathway (05215), the AML pathway (05221), Non-small cell lung cancer pathway (05223), and the mismatch repair pathway (03430). The lung-prostate comparison (# of arrays: 386/416) shows marginal or non-significant differences for ECM-receptor interaction pathway (04512), Non-small cell lung cancer pathway (05223), and the Mismatch repair pathway (03430).

Table 3.

SHDs (permutation P-values) for different hemic and solid cancer entities and pathways.

Solid entities
	BRE-COL	BRE-LUN	BRE-PRO	COL-LUN	COL-PRO	LUN-PRO
04110	577 (0.302)	642 (< 0.002)	627 (< 0.002)	435 (<0.002)	420 (<0.002)	477 (<0.002)
04115	319 (0.994)	373 (<0.002)	350 (0.046)	234 (<0.002)	223 (0.016)	279 (<0.002)
04210	475 (0.054)	540 (<0.002)	475 (<0.002)	335 (<0.002)	298 (0.02)	375 (<0.002)
04310	834 (<0.002)	919 (<0.002)	920 (<0.002)	617 (<0.002)	568 (<0.002)	701 (<0.002)
04512	435 (0.993)	477 (<0.002)	453 (0.614)	314 (0.686)	310 (0.592)	354 (0.802)
05210	506 (0.524)	569 (<0.002)	549 (<0.002)	393 (<0.002)	351 (<0.002)	428 (<0.002)
05215	527 (0.998)	607 (<0.002)	596 (0.002)	416 (<0.002)	387 (0.046)	457 (0.002)
05221	279 (0.993)	322 (0.044)	312 (0.84)	235 (0.136)	215 (0.374)	272 (0.008)
05223	314 (0.644)	336 (<0.002)	313 (0.162)	222 (0.038)	213 (0.028)	241 (0.022)
04150	234 (0.418)	242 (<0.002)	268 (<0.002)	172 (<0.002)	170 (<0.002)	180 (0.004)
03410	89 (0.744)	107 (<0.002)	107 (<0.002)	68 (<0.002)	68 (<0.002)	76 (<0.002)
03420	145 (0.35)	163 (<0.002)	146 (<0.002)	114 (<0.002)	109 (<0.002)	107 (<0.002)
03430	60 (0.993)	73 (0.054)	61 (0.918)	63 (0.006)	45 (0.366)	62 (0.188)

Hemic entities
	ALL-AML	ALL-CLL	ALL-LYM	AML-CLL	AML-LYM	CLL-LYM
04110	581 (<0.002)	542 (<0.002)	570 (<0.002)	447 (<0.002)	453 (<0.002)	436 (<0.002)
04115	300 (<0.002)	283 (0.142)	266 (0.066)	241 (<0.002)	240 (<0.002)	193 (0.128)
04210	438 (<0.002)	373 (0.118)	393 (<0.002)	345 (<0.002)	383 (<0.002)	296 (0.004)
04310	822 (<0.002)	697 (0.05)	761 (<0.002)	585 (<0.002)	675 (<0.002)	526 (<0.002)
04512	468 (0.004)	435 (0.993)	443 (0.544)	373 (0.658)	381 (<0.002)	338 (0.52)
05210	503 (<0.002)	424 (0.644)	472 (<0.002)	389 (<0.002)	413 (<0.002)	336 (<0.002)
05215	516 (<0.002)	470 (0.994)	506 (0.006)	410 (0.23)	440 (<0.002)	344 (0.562)
05221	265 (0.002)	242 (0.994)	261 (0.008)	233 (0.386)	248 (<0.002)	195 (0.232)
05223	307 (<0.002)	255 (0.868)	258 (0.02)	234 (0.014)	249 (<0.002)	193 (0.038)
04150	241 (<0.002)	223 (0.06)	225 (<0.002)	192 (<0.002)	200 (<0.002)	162 (0.002)
03410	104 (<0.002)	85 (0.012)	91 (<0.002)	83 (<0.002)	91 (<0.002)	76 (<0.002)
03420	132 (< 0.002)	123 (0.034)	131 (<0.002)	125 (<0.002)	133 (<0.002)	100 (<0.002)
03430	62 (0.002)	63 (0.758)	63 (<0.002)	61 (0.062)	59 (0.042)	48 (0.628)

The comparison within haemic tumours can be summarized as follows. The ALL-AML comparison (# of arrays: 916/534) shows for each pathway a distinct conditional correlation structure. The ALL-CLL comparison (# of arrays: 916/177) shows marginal or non-significant differences for all pathway except Cell cycle (04110). The ALL-LYM (# of arrays: 916/331) comparison shows marginal or non-significant differences for p53 signalling (04115), ECM-receptor interaction (04512), Non-small cell lung cancer (05223). The AML-CLL comparison shows marginal or non-significant differences for the ECM receptor interaction (04512), Prostate cancer (05215), AML (05221), Non-small cell lung cancer (05223), and mismatch repair (03430). Comparing AML-LYM (# of arrays: 534/331) shows only the Mismatch repair pathway (03430) as marginal significant. The CLL-LYM comparison (# of arrays: 177/331) shows marginal or non-significant differences for p53 signalling (04115), ECM-receptor interaction (04512), Colon cancer (05210), Prostate cancer (05215), AML (05221), Non-small cell lung cancer (05223), and Mismatch repair (03430).

Table 6 in the Appendix presents the SHD and P-values for the between groups comparisons. They result in distinctive conditional correlation structures in all pathways for most of the pairs. More than two marginal or non-significant P-values are found in the COL-CLL, COL-LYM, LUN-ALL comparisons (see Table 4). No clear evidence for a difference in the COL-CLL comparison is found for p53 signalling (04115), Apoptosis (04210), ECM-receptor interaction (04512), Prostate cancer (05215), AML (05221), Non-small cell lung cancer (05223), mTOR signalling (04150), Base excision repair (03410), Nucleotide excision repair (03420), and Mismatch repair (03430) pathway. No clear evidence for a difference in the COL-LYM comparison is found for p53 signalling (04115), ECM-receptor interaction (04512), Prostate cancer (05215), AML (05221), Non-small cell lung cancer (05223), mTOR signalling (04150), and Mismatch repair (03430). Finally, no clear evidence for a difference in the COL-CLL comparison is found for p53 signalling (04115), Apoptosis (04210), ECM-receptor interaction (04512), Prostate cancer (05215), AML (05221), Non-small cell lung cancer (05223), mTOR signalling (04150), Base excision repair (03410), Nucleotide excision repair (03420), and mismatch repair (03430) pathway.

Table 4.

Number of pathways with no evidence for difference in conditional correlation structure.

Solid tumors						Heamic tumors
BRE COL	BRE LUN	BRE PRO	COL LUN	COL PRO	LUN PRO	ALL AML	ALL CLL	ALL LYM	AML CLL	AML LYM	CLL LYM
12	2	5	3	7	3	3	12	3	5	1	6

Mixed comparison
BRE ALL	BRE AML	BRE CLL	BRE LYM	COL ALL	COL AML	COL CLL	COL LYM	LUN ALL	LUN AML	LUN CLL	LUN LYM	PRO ALL	PRO AML	PRO CLL	PRO LYM
0	0	0	0	1	2	10	7	7	0	0	0	2	0	2	0

We use the number of pathways with no evidence for differential conditional correlation structure as a measure for similarity between tumor entities. Table 4 and Figure 3 summarize the situation. Table 5 lists the number of comparisons between and within groups with a permutation P-value above 0.1. The highest ranked pathways with respect to no evidence for a difference are Mismatch repair (03430), Non-small cell lung cancer (05223), AML (05221), ECM-receptor interaction (04512), and p53 signalling (04115) pathway. The pathways Cell cycle (04110) and Wnt signalling (04310) show in all except one comparison a significant difference in conditional correlation structure.

Figure 3.

Similarity between tumours in terms of pathways with no evidence for a difference in conditional correlation structure.

Table 5.

Number of comparisons with no evidence (p ge 0.1) for a difference in conditional correlation structure (per pathway).

Pathway KEGG ID	Total 28 comparisons	solid tumors 6 comparisons	Haemic tumors 6 comparisons	Mixed tumors 16 comparisons
3410	3	1	1	1
3420	4	1	1	2
3430	15	5	4	6
4110	1	1	0	0
4115	10	3	3	4
4150	5	1	1	3
4210	5	2	1	2
4310	1	0	1	0
4512	11	5	4	2
5210	2	1	1	0
5215	5	2	3	2
5221	13	5	3	5
5223	13	5	4	4

The major similarity between the entity pairs are visualized in Figure 3. Every node represents an entity and the wide of the edges is the number of pathways with no evidence for a difference. Similar pathway structure can be found between ALL-CLL, BREASTCOLON, and COLON-CLL. In the haemic tumour entities there is a noticeable similarity between lymphatic tumours (ALL, CLL, Lymphoma). In the solid tumour entities there is a similarity between tumours (breast, colon, prostate) arising in gland tissues.

Discussion

Similarities as well as dissimilarities between cellular processes in different tumour entities are of general interest. They promise a better understanding of basic disease mechanism as well as therapeutic principles. To this end many studies are concerned with the comparison of transcription profiles and gene signatures.⁴ Bioinformatic tools give easy access to a wide range of signatures³⁶ and help to validate them over a wide range of disease entities. For example the molecular program of normal wound healing plays an important role in cancer metastasis. Consistent features in the transcriptional response of normal fibroblasts to serum reveal links between wound healing and cancer progression in a variety of common epithelial tumours.³⁷

Our studies does not focus on transcription profiles but on the structure of wiring between genes annotated to specific pathways and if these structures differ between tumor entities. We formalize the wiring between genes by conditional correlation graphs²⁸ where the genes of interest define the nodes and a direct influence between two genes is represented by an edge.

The following ideas motivated the study:

•

Tumours of different tissues may have distinct features in the way the corresponding transcription of genes annotated to the pathways is regulated. Therefore, we compared haemic and solid tumours. Within both groups we defined two subgroups. Solid tumours were split in a group taken from gland (breast, colon, and prostate cancer) and lung tissue. Haemic tumours were split in lymphatic tumours (ALL, CLL, and Lympoma) and AML.

•

To detect similar regulation structures in major pathways between cancers, we studied generic pathways which are crucial for the cell machinery (Cell cycle, Apoptosis, p53 signalling, ECM-receptor, Wnt signalling), which are disease related (Colorectal cancer, Prostate cancer, Non-small cell lung cancer, AML), and finally pathways which concern DNA repair.

The study is designed as a meta-analysis of data available in public repositories. Uses microarrays from one specific technical platform (HG-U133A) to avoid unnecessary heterogeneity caused by differing technical and lab-specific conditions. The data collection was the most challenging part of the meta-analysis. Our activity focused on the ArrayExpress repository. Here we found 61 studies with a total of 5693 arrays (HG-U133A) arrays which contributed material to the tumours of interest. A total of 101 defect arrays and further 801 duplicates of microarrays were removed. The package ArrayExpressDataManage was developed to organise the data management of the 4791 remaining microarrays for the meta-analysis. The package allows an efficient and reproducible retrieval of large data sets from the ArrayExpress repository.

The quality of the downloaded data in terms of phenotype information was very low. It is astonishing that other authors do not report this fact. Phenotype information is generally missing and even incomplete in basic items to characterize the clinical staging of tumours. Our data was taken from clinical studies and only 64% of the studies provided the sex of the patients. Age was missing in 50% of the patients. Detailed information on the tumour (tumour grade was available in 20% of the solid cancer, only sparse information on metastatic disease) was not given.

All studies contributed to ArrayExpress declare formally compliance with the MIAME criteria.³⁸ MIAME also requests information on the biological properties of the samples and phenotypes that were assayed. Neglecting this information may produce an insensible mixture of probes and invalidates most of the public oncological microarray data for secondary research.

The missing phenotype information confronts us with a potentially inhomogeneous set and a biological mix of probes within a tumour entity which are in different stages of its development. This confounding may invalidate the interpretation of our results.

Furthermore, the observed microarray data quantifies time averages of a complex dynamics with many components. Additionally, the gene expression measured for a few selected genes annotated to a pathway is only a small observation window on the system. The unobserved components also may confound the conditional correlation between two observed genes.

All these restriction impose a severe restrictions on interpretability and only allow a very coarse conclusion if a difference between two conditional correlation graphs is assessed: The dynamics are somehow distinct. Techniques to locate differences between two graphs more precisely to specific subset of nodes of graphs are under study and applied in settings with a stricter control of the phenotype and the confounding.¹⁶

The estimation of conditional correlation structures was performed with the PC-Algorithm.²⁸ The difference between the estimated structures was quantified by the Structural Hamming Distance. Statistical significance was explored by resampling techniques. The computational challenge were handled by a parallelized computation environment created out of standard open access tools from R and the Bioconductor.^18,19 Additional software which was required to build up our pipeline for reproducible calculations is readily provided.

The quantitative results for the conditional correlation structures of the gene sets and tumour quantities studied can be summarized as follows: The inferred structures for the selected thirteen pathways look mostly similar between breast cancer and colon cancer as well as between ALL and CLL samples. The lag of strong differences in the structures between colon cancer and CLL samples never addressed to our knowledge in the literature. There are narrative reports on patients with two different primary tumours where the treatment directed two one of both only resulted in a response of the second and not the first (in our centre a patient with simultaneous colon cancer and AML). The pathways with the most similar conditional correlation of the annotated genes between tumour entities are Mismatch repair (03430), Non-small cell lung cancer (05223), and AML (05221). The wiring between genes annotated to the Wnt pathway (04512) seems to be more similar within as between the tumour entities (solid/haemic).

The study answers both heuristics which motivated the meta-analysis in a limited way. It defined the technical requirements to perform a meta-analysis with about 4800 microarrays. Also, the statistical methods are available which help to tackle the question posed. But, it was not possible to enrich the developed analysis pipeline with necessary biological and clinical data. The publicly available data generally lack relevant and important phenotype information. The sobering experience of our study asks for combined efforts to improve the data quality in public repositories for data from high-throughput technologies.

The dataset used for the presented results is from February 2009. Since then, only 7 new data sets became available which meet our inclusion/exclusion criteria: six for breast cancer and 1 for prostate cancer, all together about 550 arrays. Furthermore, it would have been possible to validate our findings on microarray data derived from microarray type HGU-133 Plus 2.0. We did not perform the validation since we are not able to assure the homogeneity between the populations used for the first step and the validation. In this case we would not be able to discuss possible deviations between the validation based on the HGU-133 Plus 2.0 and the calculation based on the HGU-133 arrays. The HGU-133 Plus 2.0 arrays are more recent as the HGU-133 arrays which may imply that the studies based on the HGU-133 Plus 2.0 array are performed with more specific questions on more specific populations.

Footnotes

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

Methodological Considerations for the Interpretation of our Results

A short outline of ideas is given on how to relate biological ideas to the results of the described microarray analyses.

There are two problems: measuring time averages on a narrow set of dynamic processes and confounding of interaction within genes annotated to a pathway by unobserved quantities.

In order to understand the problem of time average, the cell is considered as a large dynamic system which is in a stable state. The multivariate process (X_t)_t∈[o,T] describes the dynamic of all entities within the cell over a time period of length T. The data read from a microarray is the average over many cells caught at different time points within the dynamics of the cell-specific system. If the dynamic system is ergodic, its behaviour when it is allowed to run for a long time can be read from the cross-sectional measurements of many simultaneously trajectories at a fixed time point. This is expressed through ergodic theorems which assert that, under certain conditions, the time average of a function along the trajectories exists almost everywhere and is related to the space average.³⁹

Therefore, if a large amount of cells is assayed we get an average of all measurements from each cell which is identical of the average over time of (X_t)_t∈[o,T]:

w = \frac{1}{T} \int X_{t} d t

The covariance matrix of the high-dimensional vector W consists of integrals of covariance (in the diagonal) as well as cross-covariance functions (off diagonal). The term cross-covariance refers to the covariance cov(X, Y) between two random vectors X and Y. In order to distinguish that concept from the “covariance” of a random vector X, which is understood to be the matrix of covariance between the scalar components of X.

If W is observed in the tissue of several individuals, it is possible to estimate the conditional correlation structure of W by a conditional correlation graph. If material from a different biological condition exists, one could estimate a second graph and compare both. In case of a difference between both graphs the conclusion is allowed that the corresponding covariance matrices are different and that also some time averages of cross-covariance functions between components of the cell dynamics are different. This would allow the vague statement that the dynamic system under one condition is different from the dynamic system which governs the alternative condition. A deeper insight given such data is not possible.

The next problem is that we do not see the complex dynamic system. Transcription measurements are only available for a restricted set of genes (defined by the pathway). We do not see the whole of W but only a small subset of it. Let U be the components of W which are observable from the data and V be the components of W which are not measured (protein concentration, gene expression of genes which are not part of our pathway, …).

Often a random dynamical system is considered as a complex Gaussian process and the framework of multivariate normal distributions is available to do a thorough formal analysis. Thereby, the precision matrix of W which is the inverse of the covariance matrix plays the central role to understand direct interactions between the components.

Using this approach allows to understand how the unobserved components influence the correlation structure of the observed data by confounding. Confounding means that the conditional correlation between two genes is also the effect of unobserved components in the system and possibly not a real biological feature that is shared by both genes. This makes it difficult to interpret a conditional correlation graph of genes annotated to a pathway. It may show effects caused from outside and not by biological activity within the pathway.

The precision matrix of W is given by Q and can be partitioned in parts Q_UU and Q_VV which describe the conditional correlation structure in the observed part (U) and the unobserved part (V) of the system. The parts Q_UV and $Q_{V U} (= Q_{U V}^{T})$ describe the conditional correlation structure between the observed and unobserved components of the system.

The precision matrix of the marginal distribution of the observed components U is given by

Q_{U U}^{m \arg} = Q_{U U} + Q_{U U} \cdot Q_{V V}^{- 1} Q_{V U}

This formalizes the idea that in the worst case an observed conditional correlation between two pathway genes is not caused by activity within the pathway but transmitted by conditional correlation of the genes with components of the system which are not observed. The following example demonstrates a practical consequence of the formal consideration.

A transcription factor regulates the expression of gene G1. Transcription factors belong to the unobserved V components of W. The concentration of a transcription factor may be regulated by some other protein which is also an unobserved V component of W. This protein is regulated by the transcriptional products of gene G2. The conditional correlation structure of both proteins is an element of Q_VV while the interaction of the transcription factor with G1 and the protein regulation by gene G2 are represented by elements of Q_UV. This may imply a non-zero element in $Q_{U U}^{m \arg}$ without the need for direct interaction of G1 and G2 within the pathway.

References

Hudson

Thomas J

, Anderson

Warwick

, Artez

Axel

International Cancer Genome Consortium. International network of cancer genome projects.

Nature. 2010 Apr; 464(7291): 993–8.

Baisse

, Bouzourene

, Saraga

E.P.

, Bosman

F.T.

, Ben-Hattar

Intratumor genetic heterogeneity in advanced human colorectal adenocarcinoma.

Int J Cancer. 2001 Aug; 93(3): 346–52.

Brabletz

Thomas

, Jung

Andreas

, Spaderna

Simone

, Hlubek

Falk

, Kirchner

Thomas

Opinion: migrating cancer stem cells—an integrated concept of malignant tumour progression.

Nat Rev Cancer. 2005 Sep; 5(9): 744–9.

Rhodes

Daniel R

, Yu

Jianjun

, Shanker

Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression.

Proc Natl Acad Sci USA. 2004 Jun; 101(25): 9309–14.

Brazma

Alvis

. Minimum information about a microarray experiment (MIAME)—towards standards for microarray data. Nature Genetics. 2001; 29: 365–71.

Ball

Catherine A

, Brazma

Alvis

, Causton

Helen

Submission of microarray data to public repositories.

PLoS Biology. 2004 Aug; 2(9): e317.

Gardiner-Garden

, Littlejohn

T.G.

A comparison of microarray databases.

Brief Bioinform. 2001 May; 2(2): 143–58.

Alvis

Brazma

, Helen

Parkinson

, Ugis

Sarkans

ArrayExpress—a public repository for microarray gene expression data at the EBI.

Nucleic Acids Research. 2003 Jan; 31(1): 68–71.

Tanya

Barrett

, Troup Dennis

, Wilhite Stephen

NCBI GEO: archive for high-throughput functional genomic data.

Nucleic Acids Research. 2009 Jan; 37(Database issue): D885–90.

10.

Ikeo

Kazuho

, Ishi-i

Jun

, Tamura

Takurou

, Gojobori

Takashi

, Tateno

Yoshio

CIBEX: center for information biology gene expression database.

C R Biol. 2003; 326(10-11): 1079–82.

11.

Conlon

Erin M

A bayesian mixture model for meta-analysis of microarray studies.

Funct Integr Genomics. 2008 Feb; 8(1): 43–53.

12.

Landgren

, Pfeiffer

R.M.

, Stewart

Risk of second malignant neoplasms among lymphoma patients with a family history of cancer.

International Journal of Cancer. 2007; 1099–1102(5): 8–14.

13.

Lynch

Family with acute myelocytic leukemia, breast, ovarian, and gastrointestinal cancer.

Cancer Genetics and Cytogenetics. 2009; 137(1): 8–14.

14.

Brandt

, Bermejo

J.L.

, Sundquist

, Hemminki

Risk of second malignant neoplasms among lymphoma patients with a family history of cancer.

European Journal of Cancer. 2009. [Epub ahead of print].

15.

Arinobu

, Mizuno

S.I.

, Chong

Reciprocal activation of GATA-1 and PU.1 marks initial specification of hematopoietic stem cells into myelo-erythroid and myelolymphoid lineages.

Cell Stem Cell. 2007; 1(4): 416–27.

16.

Mansmann

Ulrich

, Schmidberger

Markus

, Strobl

Ralf

, Jurinovic

Vindi

Statistical modelling and regression structures—festschrift in honour of Ludwig Fahrmeir, chapter Indirect Comparison of Interaction Graphs.

Physica, 2009: 249–65.

17.

Ramasamy

Adaikalavan

, Mondry

Adrian

, Holmes

Chris C

, Altman

Douglas G.

Key issues in conducting a meta-analysis of gene expression microarray datasets.

PLoS Medicine. 2008 Sep; 5(9): e184.

18.

R development core team. R: A language and environment for statistical computing. Vienna, Austria: R foundation for statistical computing; 2009. ISBN 3-900051-07–0.

19.

Gentleman

Robert C

, Carey

Vincent J

, Bates

Douglas M

Bioconductor: open software development for computational biology and bioinformatics.

Genome Biology. 2004: 5.

20.

Schmidberger

Markus

, Morgan

Martin

, Eddelbuettel

Dirk

, Yu

Hao

, Tierney

Luke

, Mansmann

Ulrich

. State of the art in parallel computing with R. Journal of Statistical Software. 2009; 31(1).

21.

Schmidberger

Markus

. Parallel Computing for Biological Data. PhD thesis, 2009 Nov.

22.

Schmidberger

Markus

, Mansmann

Ulrich

Parallelized pre-processing algorithms for high-density oligonucleotide arrays. In Proceedings IEEE International Symposium on Parallel and Distributed Processing IPDPS 2008. 2008 Apr 14–18: 1–7.

23.

Schmidberger

Markus

, Mansmann

Ulrich

. affyPara—a bio-conductor package for parallelized preprocessing algorithms of affymetrix microarray data. Bioinformatics and Biology Insights. 2009; 3: 83–7.

24.

Kauffmann

Audrey

, Rayner

Tim F

, Parkinson

Helen

Importing ArrayExpress datasets into R/Bioconductor.

Bioinformatics. 2009; 25(16): 2092–4.

25.

Gentleman

Robert C

, Carey

Vincent

, Huber

Wolfgang

, Irizarry

Rafael

, Dudoit

Sandrine

. Bioinformatics and Computational Biology Solutions using R and Bioconductor, 1st ed. Springer; 2005 Aug.

26.

Irizarry

Rafael A

, Warren

Daniel

, Spencer

Forrest

Multiple-laboratory comparison of microarray platforms.

Nature Methods. 2005 May; 2(5): 345–50.

27.

Evan Johnson

, Li

Cheng

, Rabinovic

Ariel

Adjusting batch effects in microarray expression data using empirical bayes methods.

Biostatistics. 2007 Jan; 8(1): 118–27.

28.

Kalisch

Markus

, Buehlmann

Peter

Estimating high dimensional acyclic graphs with the PC-Algorithm.

Journal of Machine Learning Research. 2007; 8: 613–36.

29.

Spirtes

Peter

, Glymour

Clark

, Scheines

Richard

Causation, Prediction, and Search, 2nd ed. Cambridge, MA, USA: The MIT Press; 2001 Jan.

30.

Villers

Fanny

, Schaeffer

Brigitte

, Bertin

Caroline

, Huet

Sylvie

Assessing the validity domains of graphical Gaussian models in order to infer relationships among components of complex biological systems.

Statistical Applications in Genetics and Molecular Biology. 2008; 7(1): Article 14.

31.

Schaefer

Juliane

, Strimmer

Korbinian

A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Statistical Applications in Genetics and Molecular Biology. 2005; 4: Article 32.

32.

Meinshausen

Nicolai

, Buehlmann

Peter

High dimensional graphs and variable selection with the Lasso.

The Annals of Statistics. 2006; 34: 1436–62.

33.

Wainwright

Martin J

, Ravikumar

Pradeep

, Lafferty

John D.

High dimensional graphical model selection using L1-regularized logistic regression.

Proceedings of Advances in Neural Information Processing Systems. 2006; 9: 1465–72.

34.

Friedman

Jerome

, Hastie

Trevor

, Tibshirani

Robert

Sparse inverse covariance estimation with the graphical Lasso.

Biostatistics. 2007; 9(3): 432–41.

35.

Onureena

Banerjee

, Laurent

El Ghaoui

, Alexandre

d'Aspremont

Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data.

Journal of Machine Learning Research. 2008: 485–516.

36.

Culhane

Aedn C

, Schwarzl

Thomas

, Sultana

Razvan

Genesigdb—a curated database of gene expression signatures.

Nucleic Acids Res. 2010 Jan; 38(Database issue): D716–25.

37.

Chang

Howard Y

, Nuyten

Dimitry SA

, Sneddon

Julie B

Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival.

Proc Natl Acad Sci USA. 2005 Mar; 102(10): 3738–43.

38.

Brazma

Alvis

. Minimum information about a microarray experiment (MIAME)—successes, failures, challenges. Scientific World Journal. 9: 420–3.

39.

Ludwig

Arnold

. Random Dynamical Systems, 2nd ed. Berlin, Germany: Springer-Verlag; 1998 Jan.

Conceptual Aspects of Large Meta-Analyses with Publicly Available Microarray Data: A case study in Oncology

Abstract

Keywords

Introduction

Materials and Methods

Microarray Data Set

Phenotype Data

Data Management

Low-level Analysis

High-Level Analysis

Estimating Graphs

Comparing Graphs

Results

Discussion

Footnotes

Methodological Considerations for the Interpretation of our Results

References