Abstract
Background:
Recent studies indicate that gene expression levels in blood may be able to differentiate subjects with Alzheimer’s disease (AD) from normal elderly controls and mild cognitively impaired (MCI) subjects. However, there is limited replicability at the single marker level. A pathway-based interpretation of gene expression may prove more robust.
Objectives:
This study aimed to investigate whether a case/control classification model built on pathway level data was more robust than a gene level model and may consequently perform better in test data. The study used two batches of gene expression data from the AddNeuroMed (ANM) and Dementia Case Registry (DCR) cohorts.
Methods:
Our study used Illumina Human HT-12 Expression BeadChips to collect gene expression from blood samples. Random forest modeling with recursive feature elimination was used to predict case/control status. Age and APOE
Results:
Gene and pathway level models performed similarly to each other and to a model based on demographic information only.
Conclusions:
Any potential increase in concordance from the novel pathway level approach used here has not lead to a greater predictive ability in these datasets. However, we have only tested one method for creating pathway level scores. Further, we have been able to benchmark pathways against genes in datasets that had been extensively harmonized. Further work should focus on the use of alternative methods for creating pathway level scores, in particular those that incorporate pathway topology, and the use of an endophenotype based approach.
INTRODUCTION
The most common form of dementia is Alzheimer’s disease (AD). It is predicted that by 2050, 1 in every 85 people will be living with the disease [1]. No disease modifying treatments are available for AD and existing treatments only provide short-term symptomatic relief in a subset of patients [2]. Additionally, in the early stages (between 2 and 15 years prior to the development of clinical symptoms) the disease is difficult to diagnose. Villemagne et al. and Jack et al. hypothesize that characteristic AD pathology (the presence of amyloid-β (Aβ) plaques and hyperphosphorylated tau tangles in the brain) begins to develop up to 20 years prior to clinical diagnosis [3, 4]. This extended prodromal stage is an important window in which to target treatments that may be able to alter the course of the disease; provided people could be sensitively and accurately diagnosed. Aβ, tau, and phosphorylated-tau levels are indicative of AD pathology in this prodromal period and can be measured in cerebrospinal fluid (CSF) and by positron emission tomography (PET) imaging [5]. The procedures involved in attaining these measurements can be invasive or expensive and require specialized administration, equipment. and expertise. The development of a less invasive, potentially cheaper technique, such as a blood test, would offer significant advantages [6].
Recent studies indicate that gene expression levels in blood may be able to differentiate AD subjects from normal elderly controls and mild cognitive impairment (MCI) subjects with prodromal disease [7–10]. Han et al. provide an overview of studies of gene expression associated with AD-related phenotypes [11]. They state that the blood transcriptome is vital in the disease mechanism of AD and should therefore be investigated further in independent studies of a large sample size. A more general summary of gene expression data in neurodegenerative diseases is given by Cooper-Knock et al. [12]. This review emphasizes the dysregulation in neuroinflammation and intracellular signaling pathways including calcium signaling in AD. The commonality between these reviews is that they both highlight limited replicability at the single marker level. Furthermore, Han et al. report a greater concordance between differentially expressed genes at the pathway level. A pathway-based interpretation of gene expression may therefore prove more robust across different sample populations. Such an approach may also reduce noise and dimensionality. It is important to note that differential gene expression, as described in these reviews, does not necessarily identify genes that will be useful in a classification context.
Although previous gene expression studies in AD have retrospectively identified pathways altered in disease [9], this is the first study to use pathway scores for each individual to build predictive models across the population. This study used Pathway Level Analysis of Gene Expression (PLAGE) to estimate pathway variability across samples in the population by calculating sample-wise pathway scores [14]. PLAGE outperformed other single sample enrichment methods such as ZSCORE, Gene Set Variation Analysis (GSVA), and Single Sample Gene Set Enrichment Analysis (SSGSEA) in a comparison of sensitivity, specificity, and prioritization by Tarca et al. [15]. PLAGE scores have been used in univariate t-testing and unsupervised clustering methods to investigate the pathways involved in oral leukoplakia and those leading to cell proliferation and migration in leukemia [16, 17]. We combine, for the first time, PLAGE scoring with a supervised machine learning approach to build an AD classifier.
This study used blood expression data from subjects participating in the AddNeuroMed (ANM) and Dementia Case Registry (DCR) studies to develop models of clinical diagnosis. The performance metrics of gene expression and demographic models is compared with those generated using pathway level measures of expression.
MATERIALS AND METHODS
Cohort
ANM is a European multi-center study aiming to develop biomarkers for AD [18]. Subjects with an AD diagnosis as well as those with MCI and healthy controls were recruited from centers based in Kuopio, Lodz, London, Perugia, Thessaloniki, and Toulouse. Details of study design and enrollment are provided by Lunnon et al. [10]. Subjects for the DCR were recruited from the Maudsley and Kings Healthcare Partners, which incorporates the Alzheimer’s Research UK (ARUK) cohort [19] from whom gene expression data has not previously been reported.
The present study used data from 748 subjects: 614 subjects from ANM and 134 subjects from DCR.
Gene expression
Whole blood samples (2.5 ml) were collected after 2 h of fasting into Paxgene Blood RNA tubes (BD) and extracted as in Lunnon et al. [9]. Illumina Human HT-12 Expression BeadChips were used to analyze the whole transcriptome according to the manufacturers protocol. The gene expression analysis was run in two batches at two different sites. Batch 1 contained samples from 356 ANM subjects run on version 3 of the BeadChip, as previously described [9, 10]. Batch 2 contained samples from 411 subjects: 134 from DCR and 277 from ANM run on version 4 of the BeadChip. Samples from 19 subjects were included in both batches. See Fig. 1 for an overview of sample numbers. The raw gene expression data are available as GEO DataSets (Accession number GSE63060 for batch 1 and GSE63061 for batch 2).
Statistical analysis
Data pre-processing
The data pre-processing performed in this study is different to that used for the original analysis by Lunnon et al. [9, 10]. The data processing pipeline used in this study aims to address the effects of technical data artifacts in gene expression studies [20]. Raw gene expression data was exported from Illumina’s Genome studio and processed in R (version 3.1.1) [21] using the lumi package [22] and custom in-house pre-processing scripts (GitHub, http://bit.ly/1vjyKNo). Briefly, raw expression data was subject to a model based background correction for bead array [23]. This used negative bead expression levels to correct for background noise. The data was then log base 2 transformed and robust spline normalized in lumi [22]. Outlying samples were iteratively identified using fundamental network concepts and removed, following the methods described by Oldham et al. [24]. To reduce any batch effects we adjusted for technical categorical variables using ComBat [25]. Continuous technical artifacts were accounted for by taking the first principal component across housekeeping and undetected probes and regressing this against technical variables. Variables significantly associated with the first principal component were then regressed against expression for each probe, and the mean adjusted residuals taken forward for all further analyses. Finally, the data was reduced to a subset of probes that could be reliably detected in 80% of samples in at least one diagnostic group. Finally, subjects were excluded where there were discrepancies between the recorded sex and sex determined by the XIST (ILMN 1764573), USP9Y (ILMN 2056795) and EIF1AY (ILMN 1755537 and ILMN 2228976) X- and Y-linked genes.
Demographic data for the ANM and DCR subjects was extracted using CohortExplorer [26].
Pathway level analysis of gene expression (PLAGE)
Gene level expression data were condensed to sample wise, pathway level scores using PLAGE [14]. PLAGE groups genes into pathways defined by the Broad Institute Collection of Curated Pathways [27] and outputs a score, per sample, for each of these sets. We restricted PLAGE to only include pathways with between 10 and 500 genes. The generation of PLAGE scores was implemented through R package ‘GSVA’ and is detailed in Supplementary Methods, section 1 [13].
Data analysis
Clinical diagnosis (AD versus non-demented elderly control) classification models were built using batch 1 gene expression data. Variable selection was performed using recursive feature elimination (RFE) and the creation of a tolerance set using the ‘pickSizeTolerance’ function in R. This function finds a smaller set of variables while maintaining model accuracy [28]. Three Random Forest (RF) models were built, the first of which was a model based on demographic data alone (
Each model was used to predict the diagnostic status of subjects in batch 2. Model statistics including accuracy, sensitivity, and specificity were generated and compared between the
Full details of model building are provided in Supplementary Methods section 2.
Additionally, variable importance (determined as the change in Gini index) was examined in the
The validity of the pathways selected in the
RESULTS
Cohort demographics
Table 1 gives an overview of the demographics of subjects included in the two batches of gene expression data.
Data pre-processing
As a result of pre-processing 12 samples in batch 1 and 49 samples in batch 2 failed quality control (QC) and were removed. The majority of these samples failed QC as they were identified as outliers. Additionally, some samples were removed because the sex of the individual recorded in the clinical database did not match the biological sample (2 samples in batch 1 and 7 in batch 2).
Samples from 19 subjects were present in both batch 1 and batch 2. Samples from 14 of these individuals passed QC in both batches; only data from batch 1 was used and the other was discarded. Correlation between the two batches was at least 0.9 for all individuals (Supplementary Figure 3). Batch 2 gene expression data contains subjects from the DCR whereas batch 1 does not. This study used the same protocols, staff, and facilities as the London sample collection site within ANM. Principal components analysis (PCA) was performed across the batch 2 gene expression data from DCR and ANM subjects from London. The first three principal components (accounting for >40% of variation) were linearly regressed against the study the individual was enrolled in (DCR or London ANM) and found to be non-significant. Therefore, it was deemed appropriate to group DCR subjects with London ANM, allowing the model trained in batch 1 data to be simply applied to batch 2 data.
After data processing only subjects with either an AD diagnosis at all visits or control status at all visits were analyzed further: 207 subjects in batch 1 and 236 in batch 2.
Only gene probes that mapped between the version 3 and version 4 chips used to generate batches 1 and 2, respectively, were used for analysis (5212 probes). The Broad Institute Collection of Curated Pathways matched these probes to 834 pathways [27].
Data analysis
Demographic model
The following demographic variables were associated with case/control status in our cohorts (Table 1): age, sex, APOE status, years in full time education, and sample collection site. These variables were therefore used in multivariate modeling using RFE. The optimal cross-validated accuracy was found when including all variables; calculation of a tolerance set excluded the variable representing the Lodz sample collection site. Variable importance scores showed age as the most important covariate followed by years in full timeeducation and then APOE status and sample collection site. In batch 2 test data the model achieved an accuracy of 0.69, sensitivity of 0.53 and specificity of 0.84.
The area under the ROC curve was 0.77 (see Table 2 and Fig. 2).
Additionally, a model that did not contain the sample collection site was built. The aim was to create a model based on demographics that would be available to clinicians. This model had a slightly decreased accuracy in comparison to the
Gene model
The top 5% of variables from the bootstrapped variable importance calculations (261 variables) were carried forward to the RFE model building. The optimal cross-validated accuracy from RFE in the
In batch 2 test data, the
Pathway model
The top 5% of variables from the bootstrapped variable importance calculations (42 variables) were carried forward to the RFE model building. The optimal cross-validated accuracy from RFE in the
Permutation tests of variable importance were performed to assess the size of effect relative to that observed under the null hypothesis of no association. Of the 14 pathways, two achieved nominal significance with a
Additionally, we compared the model accuracy of 1000 models comprising 16 random variables from the pathways, age, sex, APOE status, and years in full time education. This yielded a
In batch 2 test data, the model accuracy was lower than that of the
There is minimal overlap in genes between the different pathways included in the final
Misclassification
We discovered that 22% of controls used in the training data had reported memory complaints deemed not serious enough to reflect a change in diagnosis. By studying misclassification rates split by AD subjects, control subjects, and control subjects with memory complaints, we see that the most well classified group in the
DISCUSSION
In this study we investigated whether AD cases could be differentiated from control subjects using gene expression data analyzed at the pathway level. We were particularly interested in confirming whether pathway level information created a more robust predictor of case/control status than expression data at the gene level as recent reviews of AD studies have suggested [11]. Our results, using subjects from the ANM and DCR cohorts, show similar model performance in a
The fourteen pathways included in the final
RF models are commonly used in biomarker studies [9, 33]. However, it has been shown that they exhibit variable selection bias being more likely to select continuous variables or those with many categories [34]. Additionally, the presence of correlated predictors (as is common in gene expression studies) can add further bias [35]. Strobl et al. aimed to address these issues with an ensemble-learning algorithm based on conditional inference trees; Conditional RF (CRF) models [36, 37]. We attempted to use this methodology in the present study. We hypothesized that the creation of an unbiased predictor may highlight different pathways and genes to those previously discovered, potentially allowing greater predictive ability. However, the process of creating a CRF model was computationally expensive even when using high performance computing resources. Model building considering the 834 pathways and 5,212 genes was consequently infeasible. Work to improve the efficiency of this method would be computationally beneficial and would allow the use of alternative variable importance measures. Measures such as mean decrease in accuracy and conditional mean decrease in accuracy would be an improvement over biased variable importance measures such as the Gini index, which was used in this study.
This study used the Broad institute collection of curated pathways to generate the
The creation of a
The models created in this study all achieved an accuracy of approximately 70% with the
Furthermore, we found that the heterogeneity of control subjects may be leading to reduced predictive accuracy and suggest that the use of an endophenotype may be beneficial in future work.
Conclusions
We have used subjects from the ANM and DCR studies to investigate case/control classification using gene and pathway level expression data. We hypothesized that a model built on pathway level data may be more robust than a gene level model and consequently perform better in test data. However, a pathway level model built using scores and a gene level model performed similarly to each other and to a model based on demographic information only. Further work should focus on the use of alternative methods for creating pathway level scores, in particular those that incorporate pathway topology, and the use of an endophenotype based approach.
Footnotes
ACKNOWLEDGMENTS
This work was supported by the Alzheimer’s Society, InnoMed (Innovative Medicines in Europe), an integrated project funded by the European Union of the Sixth Framework program priority (FP6-2004- LIFESCIHEALTH-5); Alzheimer’s Research Trust UK; the John and Lucille van Geest Foundation (AH); and the NIHR Biomedical Research Centre for Mental Health and Biomedical Research Unit for Dementia at the South London, Maudsley NHS Foundation Trust and Kings College London, and a joint infrastructure grant from Guy’s and St Thomas’ Charity and the Maudsley Charity. The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under EMIF grant agreement number 115372, resources of which are composed of financial contribution from the European Union’s Seventh Framework Program (FP7/2007-2013) and EFPIA companies’ in kind contribution. Kuopio University Hospital (HS) and funding from UEF- BRAIN (HS). Steven Kiddle is supported by an MRC Career Development Award in Biostatistics (MR/L011859/1).
