Abstract
RNA sequencing (RNA-seq) has revolutionized transcriptome analysis through profiling the expression of thousands of genes at the same time. Systematic analysis of orthologous transcripts across species is critical for understanding the evolution of gene expression and uncovering important information in animal models of human diseases. Several computational methods have been published for analyzing gene expression between species, but they often lack crucial details and therefore cannot serve as a practical guide. Here, we present the first step-by-step protocol for cross-species RNA-seq analysis with a concise workflow that is largely based on the free open-source R language and Bioconductor packages. This protocol covers the entire process from short-read mapping, gene expression quantification, differential expression analysis to pathway enrichment. Many useful utilities for data visualization are included. This complete and easy-to-follow protocol provides hands-on guidance for users who are new to cross-species gene expression analysis.
Introduction
The transcriptome representing the entire repertoire of gene transcripts in a cell bridges the gap between genetic information encoded in DNA and phenotypes. A quantitative measurement of the transcriptome provides a snapshot of the content and dynamics of RNA species under a certain cellular condition. Traditional tools for gene expression profiling include Northern blot, reverse-transcription polymerase chain reaction, expressed sequence tags, and serial analysis of gene expression. The advent of microarray1,2 and, more recently, RNA sequencing3,4 (RNA-seq) allows fast, cost-effective, and comprehensive measurement of messenger RNA abundance for thousands of genes simultaneously. Many studies that compare the two technologies in the same system have found that RNA-seq has increased sensitivity for the identification of differentially expressed genes, compared to microarray measurements.5–10
Although microarray or RNA-seq experiments are often used to probe changes in gene expression within a species,11–17 understanding the differences in gene expression between species has a number of important applications in the fields of biology and medicine, including (1) evolution in gene expression18–20; (2) animal models of human diseases such as cancers,21,22 Alzheimer's disease, 23 Huntington's disease, 24 diabetes, 25 and hypertension 26 ; (3) developmental biology 27 ; (4) aging28–30; (5) toxicology 31 ; and (6) biomarkers.32,33 As such, several computational methods have been proposed and developed to analyze interspecies gene expression data.34–36
However, a detailed, step-by-step protocol is not available for cross-species RNA-seq data analysis, which hampers the full utilization of gene expression data in public repositories such as Gene Expression Omnibus 37 and ArrayExpress. 38 Here, we address this need with a protocol based on published computational pipelines39–41 and relevant Bioconductor packages. The steps in this protocol are detailed in “Description of the protocol” section, which include (1) short-read alignment to a genome, (2) quantification of gene expression based on a given annotation, (3) lifting of annotations between species to their best orthologs, (4) differential expression analysis between multiple species or between multiple samples of one species, and (5) pathway enrichment and analysis of differentially expressed genes.
Description of the Protocol
RNA-seq analysis typically begins with the sequencing of many individual complementary DNA reads, which are usually no more than several hundred base pairs long. Quality control software assesses the quality of each base pair of a sequenced read and returns a file in the FASTQ format with both DNA sequence and a quality score for each nucleotide. A scientist then examines this file and may use tools to improve the average quality of the data by truncating the read fragments with low-quality scores. Other operations are often performed in this step as well, such as the demultiplexing of barcoded samples. At the end of the quality control process, a file with the high-quality reads is the starting point for this protocol (Fig. 1, Step 1).

Oomputational pipeline for cross-species expression analysis using RNA-seq. The pipeline is divided into eight steps. Description of each step and corresponding commands are included in the text. The custom scripts for RNA-seq analysis and data visualization can be downloaded from https://github.com/ploverso. Other software or R packages listed are available for downloading (details are given in “Hardware and software” section).
The reads are then aligned to the genome of the organism using
The next step is to quantify gene expression across species based on a gene annotation (Fig. 1, Step 4). This part of the protocol is quite specific to comparisons between species and is sensitive to errors. A single reference species is identified, in this protocol the mm10 annotation,46,47 and the annotation file is downloaded in the GFF format. Constitutive exons, which are exons that are always included in the final gene product, are identified in this annotation using MISO.
48
Other parts of the annotation that are not constitutive exons are discarded. Pairwise genome alignments of the chosen reference annotation to each query species are downloaded in the AXT format.49–51 All exons in the reference annotation that have complete orthologous regions in all query species genomes are lifted to their respective orthologous position in each query species, while maintaining the gene IDs of the reference species. The resulting annotations are then converted from the GFF format to the GTF format using the
The annotations are then used to count the number of reads that align to each exon from the indexed alignment file, which is used to calculate expression on a per-gene level. For comparison between species, this pipeline uses a count-based method rather than an FPKM-based method for quantifying expression, as it is easier to integrate this information into downstream expression analysis tools. The reason for this is twofold. First, many tools that compute differential expression (eg,
Mapped short reads are counted for each sample against the respective annotation using Rsubread,
52
which returns the count information in a list (Fig. 1, Step 5). This list can then be read into edgeR
53
that is able to perform a number of statistical tests upon the count data (Fig. 1, Step 6). Of primary importance, differential expression is computed for each gene between each sample using a negative binomial distribution.54,55 The list of differentially expressed genes may then be subset by magnitude and reported directly, as well as lend itself to further downstream analysis. In particular, our method covers the use of
In addition to the above-mentioned pathway analysis charts, a number of scripts are written that leverage various other R packages to display various interesting aspects of the data using heatmaps, Venn diagrams, and other charts.
Generation of Cross-Species Genome Annotations
The key to being able to compare RNA-seq data between different species is the generation of a cross-species genome annotation. To this end, one species is selected as a “reference” species against which any other query species is compared. In this protocol, the mouse genome and annotation are selected as the reference. The goal is to ultimately compare all data from all relevant samples to one another in terms of genes and pathways in the reference species. The best way to do this is to use orthologous genome regions. Because many annotations may be variably complete or have similarly named genes that have different functions, comparisons at a base pair level are used to determine which regions in the query species’ genomes are to match up to each region in the reference annotation.
One commonly used tool for the translation of genomic coordinates from one annotation version to another, or indeed one species to another, is the University of California Santa Cruz (UCSC)'s
Furthermore, in our protocol, the reference annotation is filtered such that only constitutive exons, that is, the exons in a gene that are always incorporated into the final gene product, are included. Although alternatively splicing is biologically important, comparing all exons in a gene between species is less meaningful as exons may differ in size and number. As such, exons that may be spliced out of the primary transcript, that is, cassette exons, are removed from consideration since they serve as a source of variation between samples.
Comparisons with other Methods
To the best of the knowledge of the authors, this is the first published protocol providing a generally applicable set of instructions for the comparison of RNA-seq data between different species. This protocol is partially based on a method published by Liu et al.
8
for comparing RNA-seq data between closely related species, while adapting other parts of this protocol on previously published protocols and software39–41 that are already commonly used throughout the field. The goal of this protocol is to use off-the-shelf software when possible for the analysis of the data, while writing and publishing new code for ease of comparisons between species. There are often many options available for most steps in the protocol, which work with similar input and output files; the user may alter some steps to use one of the many other tools available, for example,
Several other groups have published tools for cross-species gene expression analysis.34–36 For instance, Kuhn et al. 34 used an approach similar to ours (gene ontology), but rather than examine either species at a base pair level, they have built tools to query Homologene for gene IDs and return orthologous mappings in another species. This tool has been wrapped and published in a Bioconductor package “annotationTools.” One weakness of their method is that it does not contain any function to judge the relatedness between the genes. It simply returns the gene IDs, relying on Homologene to do the heavy lifting.
Zhu et al. 36 used a method that is also similar to ours. That is, they have built cross-species annotations using lift-Over, which poses some problems discussed in “Generation of cross-species genome annotations” section. They then used BLAT to filter out their orthologous exons, while we used AXT files instead.
Kristiansson et al. 35 defined and implemented a method for cross-species gene expression analysis as well, although they did not provide a detailed step-by-step protocol for upstream preparation and downstream analyses. While the general idea of this method is the same as ours - comparison of expression based on ontology - the implementation is very different. They take into account the homology structure between compared species and compare the expression data from genes that have any number of orthologs and paralogs. A simulation study has shown that this method has increased statistical power compared to other methods. We may construct a separate protocol later based on their method for the analysis of cross-species gene expression data.
Confounding Effects
Several confounding effects may be introduced in the cross-species analysis of RNA-seq data. First, as this protocol seeks to compare RNA-seq data between different species using one annotation, differences in the procurement and treatment of cells can introduce variations into the data, as well as the relatedness of the species. Second, the quantification of orthologous genes across species can only be an approximation based on alignment scores, and in distantly related species, this may introduce confounding of the data. Third, if the protocol is used to compare data from more than one study, differences in cells or data treatments between studies may also introduce variations. Fourth, experimental and computational tools used for the analysis may introduce variations. It is advised that a user should keep track of all software versions and cell treatment protocols used. Any differences in the protocols may be used to determine whether differences found in the downstream analysis are the result of these factors or true biological expression variations. Fifth, comparing one sample from one species to one sample from another species may have bias introduced in the process of the cross-species annotation. In order to control for this, it is recommended that multiple cell types in a species be examined. Thus, variations in gene expression in individual samples can be controlled against the average expression of all samples in the same species.
Hardware and Software
The computing resources necessary for this protocol are heavy, particularly for the alignment of sequencing reads to the reference genome. While there are ways to reduce the memory footprint of the alignment if necessary, it is recommended to use a computer with at least eight cores and 64 GB of RAM, as well as at least 500 GB of hard drive space. Additional resources allow multiple samples to be run simultaneously, significantly speeding up the analysis.
This protocol is constructed on and for a GNU/Linux operating system, and commands are given assuming that the user is using a POSIX-compliant operating system with access to a shell such as bash. While it is possible to run the protocol under Microsoft Windows, several additional steps would be necessary for the proper execution of various programs, which is outside the scope of this protocol. The author recommends one of the Debian or Red Hat distro's for this protocol.
The SHRiMP alignment software may be downloaded from http://compbio.cs.toronto.edu/shrimp/.
To work with the alignment files, SAMtools is used for conversion, sorting, and indexing of the files. It may be installed from your distro's software repository or downloaded from http://samtools.sourceforge.net/.
Determination of constitutive exons leverages MISO, which can be downloaded from https://miso.readthedocs.org/en/fastmiso/index.html.
Various scripts and utilities for working with axt files, as well as various downstream analyses, were written by the author and may be downloaded from a git repo created for this protocol at https://github.com/ploverso.
The
The R statistical computing environment may be downloaded from http://cran.rstudio.com/.
Bioconductor and several of its packages (
The
Following is the output of R's sessionInfo() command, which will show the versions of all packages used:
Alternative a Ligners to the Protocol
This protocol is modular, meaning that users can use aligners other than
Input Data
Theoretically, this protocol is suitable for RNA-seq data from any species generated from a commercial NGS platform (Illumina, SOLiD, or Ion Torrent), provided that the quality control on the reads are performed according to the manufacturer's instructions. For the purpose of illustration, we use RNA-seq data from two different species,
Protocol
Preparation of reference genome
The reference genomes for rat and mouse (rnor5 and mm10, respectively) are downloaded in compressed FASTA format from llumina's igenomes FTP server: ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/. Furthermore, the comprehensive gene annotation for mm10 is downloaded in the GFF format from GENCODE: http://www.gencodegenes.org/mouse_releases/3.html. Once the genome FASTA files are downloaded and extracted for each species, they are preprocessed with
This command indexes the genome and saves the indexes and projections of the genome to files that are loaded for each sample to align against. If this preprocessing step is not done, it will be performed prior to mapping for each sample, causing the alignment step per sample to take many hours longer. The index files will be several times larger than the original genome.fa files and should be placed in a location convenient to the RNA-seq samples.
Alignment of RNA-seq reads to indexed genome
Each sample should be aligned to its respective genome, which is specified with the assembly name given when performing the indexing. This step is the most computationally intensive and should be performed on a computer with at least 60 GB of RAM. This step can be parallelized easily, and allotting more CPU cores to the alignment allows it to run significantly faster. We suggest writing a simple bash script to run the alignments, to reduce the amount of oversight necessary for large numbers of samples. The general command used to align the rat data is
This command outputs the alignment to a specified output file in the SAM format. It is only valid for the rat data, which are single-ended RNA-seq data. The mouse data are paired-end data, with each sample name suffixed with a _1 or _2 in the FASTQformat. The command used for the mouse data is
The -p option specifies how
Conversion, sorting, and indexing of SAM files
The SAM output of
BAM files that are sorted and indexed have greatly enhanced access, which speeds up downstream analysis. Additionally, BAM files tend to be only a small fraction of the size of the SAM files, freeing up disk space and reducing RAM requirements for downstream analysis programs that need to load the entire BAM files into memory.
Generation of cross-species annotations
Starting with the GFF file downloaded from GENCODE in a previous step, the constitutive exons must be identified. This is done with the
This command extracts all constitutive exons (ie, all exons are always incorporated into the final gene product) that are greater than 100 bp, an arbitrary cutoff” value, into a GFF file in a specified folder. This file is then broken down into individual chromosomes using a PERL script from the github repo specified above, gffToChrs.pl.
Next, the axt files are downloaded, one axt file per chromosome. The wget utility may be helpful in doing so. The axt files are provided for many species by UCSC at: http://hgdownload.cse.ucsc.edu/goldenPath/mm10/vsRn5/axtNet/. Then, the
It should be noted that the output folder must be created prior to running the script. Furthermore, any files in the folders should be deleted if this step needs to be rerun as the script appends rather than overwrites files. The script only supports a single input GFF and AXT file, so a bash script may be useful to run all chromosomes for the reference species. It is not recommended to run the chromosomes in parallel as each input chromosome may map to any of the output chromosome files.
Once this is completed, the GFF files should be sorted and combined into a single annotation file. This is again easy to do with R. For example:
Important: The final annotation for each species MUST be in the same chromosomal order as the genome.fa file for that species, otherwise the final GFF file will be sorted improperly and many gene quantification tools will fail to work. Additionally, chromosomes must be in the same format as in genome.fa for that species (eg, chr10 vs 10).
Finally, the GFF files should be converted to GTF format. GTF is essentially a simplified, more specific form of the GFF format. The Cufflinks package comes with a utility for performing this conversion,
These GTF files are then used together with the BAM files generated previously to quantify the expression at a pergene level.
Counting of gene features
The next step is to quantify the expression for each sample, which is done using the Bio-conductor package
This last command saves all the data, from all the samples, to a file that can be reloaded by R at any time. This is useful when analyses are done on a computer that a researcher may not have access all the time or when further analysis may be desired at a later time. Unless saved, the data are deleted when the R environment is closed.
The data are then prepared for loading into
These commands do additional sorting and filtering to ensure that all genes for all samples are in the correct order, and there are no gene IDs that do not exist for all samples.
Differential expression analysis
Once the files with the counts have been prepared, they can be analyzed using
To load the count data into
The generated experimental design matrix (specifying which samples/replicates in the “counts.txt” file to include under which labels) appears as follows:
While not strictly necessary, a design matrix is exceptionally useful as it allows comparisons to be made with the
These comparisons allow for the determination of differentially expressed genes both generally across all cell types as well as on a per-cell-type basis. Comparisons of cellular differences between individual species as well as comparisons between cell types are also defined. Not all the above comparisons wound up having their data used for the final reporting of results. However, the presence of many of these comparisons allowed for additional error checking of the data. Additionally, as analyses are done, certain comparisons that had not been of interest before may turn to be of interest. Overall, it is useful to have all the data preprocessed and ready to go if it should be decided that any subset of it may be needed, rather than adding onto already existing data structures after the fact.
The next step is to use
The table of fold changes and false discovery rate (FDR)-corrected
In cross-species analyses, simple comparisons - for example, comparing the rat astrocytes directly against the mouse astrocytes - may present skewed data, as artifacts introduced by the cross-species annotation are not controlled in this comparison. To tackle this issue, in the
Visualization of data
Using the gene expression and pathway data contained in the lists described earlier, the data are graphed and visualized. First,
Furthermore, Venn diagrams are generated using the
Pathway analysis and visualization
While lists of differentially expressed genes are a detailed and robust way to represent differences in expression between two samples, they are not a very friendly format for humans to understand. A table of thousands of gene ID tags, each with individual expression values, is not easy to read and to visually extrapolate to biological significance. To this end,
Once lists of significantly enriched pathways have been generated, the KEGG IDs of enriched
Discussion
Here we describe, to the best of our knowledge, the first comprehensive protocol for the comparison of RNA-seq data between species. The novelty of this should be emphasized. While other methods have been developed for interspe-cies comparisons, a comprehensive protocol that walks users through the analysis process does not exist. By thoroughly explaining and documenting each step, it is easy for a person without prior experience to comprehend and follow this protocol rather than attempt to run or even develop it on his/her own.
Compared to other previously published methods, this protocol has a number of strengths. For example, the generation of a cross-species annotation enables the use of other commonly used downstream analysis tools. This means that the cross-species annotation can be easily integrated into existing pipelines for automation. The downstream analysis methods used in this protocol are essentially identical to those used in within-species RNA-seq data analysis. Most common downstream analysis tools will be able to natively support the sorts of comparisons necessary for comparing between species with minimal effort or changes to existing workflows.
Furthermore, the use of UCSC axt files means that huge numbers of comparisons are possible. Any species can theoretically be compared to any other species so long as an alignment exists between their genomes, although for distant species the quality of the comparison may be questionable. The axt files can be used as a drop-in to create annotations for as many species as is desired. Additionally, they provide a standardized format, which again proves useful in automation.
Author Contributions
Conceived and designed the experiments: FC. Analyzed the data: PRL. Wrote the first draft of the manuscript: PRL, FC. Contributed to the writing of the manuscript: PRL, FC. Agreed with manuscript results and conclusions: PRL, FC. Jointly developed the structure and arguments for the paper: PRL, FC. Made critical revisions and approved the final version: PRL, FC. Both the authors reviewed and approved the final manuscript.
Supplementary Material
Footnotes
Acknowledgment
An earlier version of this research was presented by Peter R. LoVerso in fulfillment of the requirements of his M.S. in Bio-informatics at Rochester Institute of Technology.
