Abstract
INTRODUCTION
Parkinson’s disease (PD) is the second most prevalent neurodegenerative disease worldwide, with rates expecting to double in the next 15 years [1]. Despite many efforts to more accurately diagnose and understand PD pathophysiology, diagnosis is primarily clinical and can only be confirmed at autopsy [2, 3]. Therefore, there is a need for sensitive and specific biomarkers to diagnose PD at early stages, enabling clinicians to monitor disease progression and develop therapeutic approaches to assess response to existing and future treatments.
Genome-wide expression analyses using next generation sequencing (NGS) technology have proven valuable for identifying biological processes in neurodegenerative diseases [2]. Next generation RNA sequencing (RNA-seq) provides sequence information and measures the abundance of RNA molecules present at any particular time in a specific cell, tissue or organ. RNA sequencing does not rely on
Despite the collection of multiple large data sets, these approaches to date have not produced clinically useful PD biomarkers. This can be attributed in part to small sample volumes and samples with fragmented RNAs in low abundance that make downstream RNA sequencing challenging.
Here, we report a methodology to purify and perform next generation RNA sequencing to identify extracellular RNA transcripts in CSF. Our data demonstrate that CSF contains a diversity of fragmented RNA species, including several RNAs that are dysregulated in PD patients. We optimized a method to use small volumes (300
MATERIALS AND METHODS
Human CSF samples
Frequency matching was used to recruit 27 PD patients and 30 age- and sex-matched healthy controls at the Pacific Northwest Udall Center of Excellence for Parkinson’s Disease Research (PANUC) and the Alzheimer’s Disease Research Center (ADRC) at the University of Washington (UW) and Veterans Affairs (VA) Puget Sound Health Care System (Seattle, Washington, USA) as previously described [8–10]. All subjects underwent the following evaluations: medical history, physical and neurological examinations, laboratory tests, and neuropsychological assessments. The inclusion and exclusion criteria for study subjects have been previously described [8–11]. Briefly, PD patients met UK PD Society Brain Bank clinical diagnostic criteria for PD [12], except that having “more than one affected relative” was not considered an exclusion criterion. Control subjects were the patients’ spouses or community volunteers in good health. They had no signs or symptoms suggesting cognitive impairment or neurological disease; all had a Mini Mental Status Exam (MMSE) score between 28 and 30, a Clinical Dementia Rating (CDR) score of 0, and New York University paragraph recall scores (immediate and delayed) of >6. All procedures were performed in strict compliance with guidelines for human experimentation and all participants underwent detailed informed consent procedures. Subjects provided consent in writing in accordance with procedures approved by the institutional review boards at the UW and VA Puget Sound Health Care System.
CSF was collected by lumbar puncture as previously described [9]. Similar sample collection protocols and quality control procedures were followed at all participating centers, in particular, the use of polypropylene collection and storage tubes, rapid separation into single use aliquots, and freezing of CSF samples, to minimize potential site variability. De-identified samples were aliquoted and stored at –80°C, coded as group “A” or “B”, and then transferred to the University of Miami for RNA analysis. The UW CSF data dictionary is shown in Supplementary Table 1.
Library preparation of the CSF transcriptome and sequencing
Library preparation was optimized to achieve successful sequencing runs from low input RNA by scaling up reverse transcription and PCR reaction volumes without increasing PCR amplification cycles. Extracted RNA was between 550 t0 3600 picograms as shown in Supplementary Figure 1. We utilized NEB-based directional RNA-Seq (Ipswich, MA, USA) sample preparation kit to prepare all libraries. End repair of cDNA libraries was performed and 5’ and 3’ adaptors were ligated, sequentially. Eluted cDNA was enriched and remaining RNA products were digested with USER excision and PCR amplification. During enrichment, each sample was indexed with one unique index primer from a set of 12 different index primers for Illumina multiplex sequencing. RNA was reverse transcribed to synthesize first strand cDNA, followed by a purification step, using Agencourt RNA Clean XP beads (Beckman Coulter, Brea, CA, USA). Second strand cDNA synthesis was performed and libraries were amplified (12 cycles), followed by purification using 1.8 X Agencourt AMPure beads (Beckman Coulter, Brea, CA, USA). Finally, three uniquely indexed libraries were pooled and sequenced on a HigSeq2000 Illumina sequencer. A quality control assessment was performed, using the Bowtie software package by estimating genome coverage, percent alignment and nucleotide quality. Raw sequencing data was transformed to FastQ format and stored on a dedicated server. All raw sequencing data and corresponding codes are uploaded onto the Parkinson’s Disease Biomarker Program database (PDBP) Data Management Resource (PDBP DMR) (http://pdbp.ninds.nih.gov) and are freely available to registered users.
Sequencing read processing and genome mapping
Genome mapping was performed using the Spliced Transcripts Alignment to Reference (STAR) software [13], based on the latest available versions of STAR 2.1.3 and the most recent human transcriptome (reference hg19) downloaded from the UCSC database (Santa Cruz, CA, USA). Transcript assembly was processed and counts per million (CPM) values for each aligned sequence (known and novel) were determined. To ensure that reads were indeed novel, we applied an initial filter based on expression levels to eliminate low abundance single-exon transcripts to avoid mislabeling of genomic DNA contaminated reads.
RNA sequencing analysis and identification of differentially expressed transcripts
A trimming algorithm based on Trim Galore software (Babraham Institute, Babraham, United Kingdom) was used to trim reads for adapter removal. EdgeR [14], a Bioconductor software package for differential expression analysis of replicated count data, was used to correct for multiple testing (false discovery rate (FDR) cutoff of <0.1) and for the identification of differentially expressed transcripts based on CPM values [14].
Functional annotation and pathway analysis of differentially expressed transcripts (DETs)
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of DETs was performed using the Database for Annotation Visualization and Integrated Discovery (DAVID) online analytical tools [15–17]. Protein-protein interaction networks of DETs were constructed via the search tool for the retrieval of interacting genes/proteins (STRING; http://www.stringdb.org/) and the similarity between the PPI networks and the KEGG pathways were calculated via EnrichNet [18, 19].
Quantitative real time PCR (qRT-PCR) validation of RNA-seq data
The cDNA synthesis protocol is described in Supplementary Figure 2. The majority of RNA in the CSF is short in length, most likely the result of RNA fragmentation. We synthesized cDNA and utilized SYBR green-based qRT-PCR reactions with transcript-specific forward primers and a universal PCR (reverse) primer (Supplementary Table 3) that anneals to the 5’ end of the tagged cDNA (Supplementary Figure 3). Small CSF sample volumes limited qRT-PCR gene validation to a subset of candidate sequencing tags; therefore 21 transcripts (Supplementary Table 3) were selected for validation based on their relevance to known PD genes in the PPI analysis.
We performed Sanger sequencing to validate the specificity of qRT-PCR products. Raw data was normalized to the geometric mean of the 18 S gene that met the criteria of having similar expression between sample groups. Relative expression was compared between PD and control groups using the student’s
RESULTS
High-throughput sequencing of the CSF transcriptome
We analyzed RNA transcripts from the CSF of 27 PD patients and 30 controls. CSF RNA-seq data were analyzed using STAR to align reads to the human genome, generating an average of 84×106 paired-end reads that aligned to the reference human genome as well as to annotated genes. The average number of trimmed reads was 81×106 per sample, of which 82.7% aligned to the human reference genome (UCSC_hg19) (Supplementary Table 2). Our analysis focused on RNA fragments that were reproducibly present in at least 65% of samples in each group and CPM >1. After filtering, a total of 3521 fragmented transcripts were detected: protein-coding (2862 transcripts; 81.3% of total transcripts), non-coding RNAs (458 transcripts; 13% of total transcripts) and pseudo genes (201 transcripts; 5.7% of total transcripts, Supplementary Table 4).
Differentially expressed transcripts (DETs) in the CSF of PD patients
Differential expression analysis was performed using
Differential expression analysis of CPM values using
Functional annotation clustering of differentially expressed transcripts
To determine the relationship between differentially expressed transcripts identified in RNA-seq to relevant PD processes and pathways, gene ontology (GO) and pathway enrichment analyses were performed using DAVID online tools. We considered significantly enriched genes as those with the following criteria:
Based on the cutoff criteria of |log2 FC|>1,
Construction of protein-protein interaction (PPI) networks
We mapped 201 DETs to the STRING database (the hub protein was selected according to the node degree) and screened for significant interactions (>0.7). These relationships were integrated to construct interaction networks among interacting proteins (Fig. 3) and the similarity between PPI networks and KEGG pathways were calculated via EnrichNet [18, 19]. This analysis detected the following proteins:
Validation of CSF sequencing
Target transcripts included 19 genes from the PPI network (Fig. 3) and two non-coding RNAs that are differentially expressed. The differential expression observed in RNA-seq between the two groups was validated in 13 out of 17 transcripts (76%) by qRT-PCR (Table 2). Our comparison of RNA-Seq and qRT-PCR showed good agreement in both the direction and magnitude of fold change for candidate transcripts. Log2 fold changes of these transcripts in qRT-PCR correlated with RNA-seq (
DISCUSSION
Our RNA-seq data and bioinformatics pipeline suggest that extracellular RNAs might be detectable in CSF and their presence and differential expression can be monitored to identify Parkinson’s disease-related extracellular RNAs. We observed higher frequency of non-coding RNAs in the CSF of PD patients compared to controls corroborating previous reports that various ncRNA species perform essential functions in the regulation of gene expression during neural development, plasticity, and aging [20, 21]. Several ncRNAs are implicated in PD [22] and other CNS diseases. Specifically, long noncoding RNAs (lncRNAs) are abundantly transcribed and highly expressed in the brain [25]. Recent evidence suggests lncRNAs play important roles in neurodegenerative diseases, specifically PD [26] and some reports suggest a link between
DNA methyltransferase 1 (
We identified protein tyrosine phosphatase, receptor type, C (
Bone morphogenetic protein 7 (
Overall, these candidate genes form an interesting picture of the transcripts present in the CSF. Validation of candidates of known importance in PD and PD-related pathways suggest that CSF transcriptome profiling holds promise for the identification of novel PD RNA biomarker discovery.
CONCLUSIONS
Cerebrospinal fluid contains a number of measurable RNA species that can be used to generate a CSF-RNA transcriptome. Analysis of these profiles from a small cohort of PD patients and controls resulted in the validation of several transcripts, including:
AUTHOR CONTRIBUTIONS
AH, RPF and RA conceived the study, performed experiments, wrote the manuscript and analyzed the data. EP, CZ, SCH and MS collected and characterized clinical samples. CW, JZ and MAF conceived the study, performed experiments and analyzed the data.
CONFLICT OF INTEREST
The authors have no conflicts of interest to report.
Footnotes
ACKNOWLEDGMENTS
NIH grants P50NS071674 supporting AH and part of MAF as well as sequencing costs, U01 NS082137 (to JZ), P50 NS062684 (PANUC), and P50 NIA AG05136 (to EP).
