Sage Journals: Discover world-class research

Abstract

Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of “metagenomics”, often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards.

Keywords

metagenomics next-generation sequencing computational tools data analysis

Introduction

The advent of next-generation sequencing (NGS) or high-throughput sequencing has revolutionized the field of microbial ecology and brought classical environmental studies to another level. This type of cutting-edge technology has led to the establishment of the field of “metagenomics”, defined as the direct genetic analysis of genomes contained within an environmental sample without the prior need for cultivating clonal cultures. Initially, the term was only used for functional and sequence-based analysis of the collective microbial genomes contained in an environmental sample,¹ but currently it is also widely applied to studies performing polymerase chain reaction (PCR) amplification of certain genes of interest. The former can be referred to as “full shotgun metagenomics”,² and the latter as “marker gene amplification metagenomics” (ie, 16S ribosomal RNA gene) or “meta-genetics”.³

Such methodologies allow a much faster and elaborative genomic/genetic profile generation of an environmental sample at a very acceptable cost. Full shotgun metagenomics has the capacity to fully sequence the majority of available genomes within an environmental sample (or community). This creates a community biodiversity profile that can be further associated with functional composition analysis of known and unknown organism lineages (ie, genera or taxa).⁴ Shotgun metagenomics has evolved to address the questions of who is present in an environmental community, what they are doing (function-wise), and how these microorganisms interact to sustain a balanced ecological niche. It further provides unlimited access to functional gene composition information derived from microbial communities inhabiting practical ecosystems.

Marker gene metagenomics is a fast and gritty way to obtain a community/taxonomic distribution profile or fingerprint using PCR amplification and sequencing of evolutionarily conserved marker genes, such as the 16S rRNA gene.⁵ This taxonomic distribution can subsequently be associated with environmental data (metadata) derived from the sampling site under investigation.

Several types of ecosystems have been studied so far using metagenomics, including extreme environments such as areas of volcanism^6–9 or other areas of extreme temperature,^10,11 alkalinity,¹² acidity,^13,14 low oxygen,^15,16 and high heavy-metal composition.^17,18 This invaluable resource provides an infinite capacity for bioprospecting and allows the discovery of novel enzymes capable of catalyzing reactions of biotechnological commercialization.¹⁹

The first metagenomic studies were focused on low-diversity environments, such as an acid mine drainage,²⁰ human gut microbiome,²¹ and water samples from the Sargasso Sea,²² mainly due to the unavailability of both high-throughput sequencing technologies at that time and relevant software for the scaffolds’ assembly. As more and more researchers entered this new field of study, the need for powerful tools and software became apparent and therefore led to the creation of several such tools.

Sequencing Technologies

Two commonly used NGS technologies utilized to date are the 454 Life Sciences and the Illumina systems, with the ratio of usage shifting in favor of the latter recently. Both technologies have been widely used in metagenomic studies, and hence it is important to briefly describe their advantages and disadvantages with respect to the sequencing of metagenomics samples.

The 454 pyrosequencer was the first next-generation sequencer to achieve commercial introduction in 2004.²³ Its chemistry relies on the immobilization of DNA fragments on DNA-capture beads in a water-oil emulsion and then using PCR to amplify the fixed fragments. The beads are placed on a PicoTiterPlate (a fiber-optic chip). DNA polymerase is also packed in the plate, and pyrosequencing is performed.^24,25 Its main difference from the classic Sanger sequencing is that pyrosequencing relies on the detection of pyrophosphate release on nucleotide incorporation rather than chain termination with dideoxynucleotides. The release of pyrophosphate is conveyed into light using enzyme reactions, which is then converted into actual sequence information.²³

In the initial years of high-throughput sequencing, scientists embraced the new technology and hence discovered the existence of the “rare biosphere”.²⁶ However, in many cases the apparent assignment of a microbial operational taxonomic unit (OTU) was in fact an attribute of sequencing errors, which caused an overinflation of the diversity estimates.²⁷ Noise generated by this 454 pyrosequencing technology affected different aspects of metagenomic data analysis and led to biased results.²⁸

PCR errors may lead to replicate sequence artifacts, which can cause overestimation of species abundance and functional gene abundance in 16S rRNA and full shotgun metagenom-ics, respectively. PCR can also generate noise in the form of single base pair errors (ie, substitutions, deletions) that can cause frame shifts for protein coding genes in shotgun meta-genomics. Moreover, PCR chimeras (sequences generated by undesired end-joining of two or more true sequences) can also affect 16S metagenomics results with respect to species distribution.²⁹ Sequencing errors can also occur due to the actual chemistry underlining the technology. For example, there is an inherent difficulty in clearly identifying the intensity of 454 pyrosequencing-generated flowgrams. This task becomes even more difficult during the sequencing of homopolymers.³⁰ The 454 pyrosequencing technology can generate reads up to 1,000 bp in length and ~1,000,000 reads per run. The relatively long read length generated by this technology (in comparison to other sequencing technologies) allows a significantly less error-prone assembly in shotgun metagenomics and permits greater annotation accuracy.^31,32 The cost of sequencing using 454 pyrosequencing technology is estimated at around US$20 per Mb, but it has a relatively low coverage of 0.7 GB per sequencing run. With respect to pyrosequencing, <20 ng of DNA is sufficient for sequencing single-end libraries, although paired-end sequencing may require larger quantities of DNA.

Although 454 will eventually stop being supported by Life Sciences, still one should take into account that there is a large number of existing unpublished datasets that have been generated via this technology. Therefore, it is important to include it in this review and compare it with the other sequencing services that have become more popular over the last years, namely Illumina.

Illumina dye sequencing by synthesis begins with the attachment of DNA molecules to primers on a slide, followed by amplification of that DNA to produce local colonies.²³ This generation of “DNA clusters” is accompanied by the addition of fluorescently labeled, reversible terminator bases (adenine, cytosine, guanine, and thymine) attached with a blocking group.³³ The four bases then compete for binding sites on the template DNA to be sequenced, and the nonin-corporated molecules are washed away. After each synthesis cycle, a laser is used to excite the dyes, and a high-resolution scan of the incorporated base is made. A chemical deblocking step ensures the removal of the 3’ terminal blocking group and the dye in a single step. The process is repeated until the full DNA molecule is sequenced. Illumina has a variety of sequencing instruments dedicated to different applications. MiSeq, for example, has an output of 15 GB and 25 million sequencing reads of 300 bp in length; clustered fragments can be sequenced from both ends (paired-end sequencing), which can be merged so that 600 bp reads can be obtained. HiSeq2500 has a much greater output (1,000 GB per run) but offers 125 bp reads. Illumina yields involve a much lower cost (~US$0.50 per Mb), but the run time is longer than that for 454 pyrosequencing. Currently, this feature is being addressed by the MiSeq Illumina machine, which has been developed in order to run smaller jobs at a much faster rate with relatively high throughput. Illumina allows sample preparation sizes of <20 ng DNA (similar to 454 pyrosequencing). The shorter read length produced by Illumina may increase errors during assembly and, subsequently, the annotation inaccuracies during shotgun metagenomics data analysis.³⁴ In contrast, when analyzing 16S metagenomics data, this technology obviates the need for time-consuming noise removal algorithms required for pyrosequencing and makes analysis less error-prone.³⁵ The greater coverage/yield generally offered by Illumina allows significant decrease of systematic errors. This advantage and the low cost are the delineating factors that have turned Illumina into the preferred high-throughput sequencing technology for metagenomics studies.

Additional sequencing technologies are available and can potentially be used for metagenomic studies. These include the Applied Biosystems SOLiD 5500 W Series sequencer, which offers higher coverage than 454 pyrosequencing but lower than Illumina (~120 GB per run). It allows fragment or mate-paired sequencing; however, it can only guarantee a low error rate for sequencing reads of maximum 50 bp in length.³⁶ This reduces the possibility of generating a reliable and usable de novo assembly for shotgun metagenomics; but, on the other hand, this technology performs very well when utilizing a reference genome for mapping or assembly of reads. However, using the Exact Call Chemistry (ECC) module, the SOLiD system offers to boost the accuracy of its ligation-based sequencing.

An emerging sequencing technology that may have high impact on the fields of genomics and metagenomics was recently developed by Pacific Biosciences (PacBio).³⁶ This technology uses single-molecule real-time (SMRT) sequencing, which is a parallelized single-molecule DNA sequencing by synthesis. SMRT sequencing utilizes the zero-mode waveguide (ZMW), whereby a single DNA polymerase enzyme is fixed to the bottom of a ZMW with a single molecule of DNA as a template. The ZMW is a structure that creates an illuminated observation volume that is small enough to allow the observation of a single nucleotide of DNA (also known as a base) being incorporated by DNA polymerase. Each of the four DNA bases is attached to one of four different fluorescent dyes. When a nucleotide is incorporated by the DNA polymerase, the fluorescent tag is cleaved off, which diffuses out of the observation area of the ZMW where its fluorescence is no longer observable. A detector detects the fluorescent signal of the nucleotide incorporation, and the base call is made according to the corresponding fluorescence of the dye. PacBio provides much longer read lengths (~10,000 bp) compared to the aforementioned technologies, thus having obvious advantages when addressing issues of annotation and assembly for shotgun metagenomics. PacBio technology uses a process called strobing to perform paired-end read sequencing. Despite the high read length of PacBio, this technology is limited by high error rates and low coverage (albeit at higher throughput than Sanger sequencing).

In addition to the aforementioned technologies, which are based on optics, technologies such as Ion Torrent's semiconductor sequencing benchtop sequencer and Ion Proton are now coming into play. These technologies are based on the use of proton emission during polymerization of DNA in order to detect nucleotide incorporation. This system promises read lengths of > 200 bp and relatively high throughput, on the order of magnitude achieved by 454 Life Sciences systems. Additionally, it offers higher quality than 454, especially when sequencing homopolymers, but at a similar cost (about US$23 per Mb for the Ion Torrent PGM -314 Chip). Looking into the future, and given that 454 will eventually stop being supported by Life Sciences, it is very likely that former users of the 454 pyrosequencing will switch to Ion Torrent sequencing chemistry, due to the similarities of both (eg, emulsion PCR step) and the significant the advantages of the latter.

An even more cutting-edge technology is currently under development by Oxford Nanopore technologies, which is developing “strand sequencing”, a method of DNA analysis that could potentially sequence completely intact DNA strands/polymers passed through a protein nanopore. This obviates the need for shotgun sequencing and aims to revolutionize the sequencing industry in the future. Oxford Nanopore intends to commercialize this technology with the Company's GridION™ and MinION™ systems. For meta-genomics, this technology can have obvious advantages, as it will eliminate erroneous sequencing caused by shotgun metagenomics and exclude the need for the error-prone assembly step during data analysis (for details, see later). However, nanopore sequencing is at the moment noncommercialized (offered only through the MinION™ Access Program) and is still being optimized on case-by-case basis using specific template and sequencing needs.

Another example of an innovative and very promising technology is the Irys Technology (BioNano Genomics), which uses micro and nanostructures and offers new ways of de novo constructing genome maps. The input is DNA labeled at specific sequence motifs that can be used for imaging and identification in IrysChips. These labeling steps result in a uniquely identifiable, sequence-specific pattern of labels to be used for de novo map assembly or for anchoring sequencing contigs.

Shotgun Metagenomics

Assembly of Shotgun Metagenomics Data

Metagenomics studies are commonly applied to investigate the specific genomes (known as well as unknown, both cultured and uncultured) that are present within an environmental community under study. Moreover, when performing full shotgun metagenomics, the complete sequences of protein coding genes (previously characterized or novel) as well as full operons in the sequenced genomes can offer invaluable functional knowledge about the community. For these reasons, an assembly of shorter reads into genomic contigs and orientation of these into scaffolds is often performed to provide a more compact and concise view of the sequenced community under investigation. Early attempts at metagenomic data assemblies utilized tools initially implemented for single genome data assemblies. They, therefore, fell short when forced to assemble reads into contigs for metagenomic samples. However, assembly tools have significantly evolved since then, and the current line of tools have been modified and specifically designed to assemble samples containing multiple genomes, thereby rendering them much more affective for the task in hand.

The process of assembling shorter reads into contigs can take two different routes: 1) reference-based assembly and 2) de novo assembly. The choice of which route to follow depends on the dataset that needs to be analyzed and on the specific needs of each research project. For example, de novo assembly could be, in theory, used even if a reference genome exists, if the computational power allows for it.

Reference-based assembly refers to the use of one or more reference genomes as a “map” in order to create contigs, which can represent genomes or parts of genomes belonging to a specific species or genus. Tools such as Newbler (Roche), MIRA 4,³⁷ or AMOS, as well as the recent MetaAMOS,³⁸ are commonly used in metagenomics for performing referenced-based assemblies. These tools are not computationally intensive and perform well when metagenomic samples are derived from extensively studied and researched areas. In such cases, sequences from closely related organism would have already been deposited in online data repositories and databases, allowing them to be used as references for the assembly process. Often, assemblies are visually evaluated using genome browser tools such as Artemis.³⁹ The observation of large gaps in the query genome(s) of the resulting assembly, when comparing to the reference genome(s), can be seen as an indication that perhaps the assembly is incomplete or that the reference genome(s) used are too distantly related to the community under investigation in order to perform optimally.

De novo assembly refers to the generation of assembled contigs using no prior reference to known genome(s).⁴⁰ This task is computationally expensive and relies heavily on sophisticated graph theory algorithms, such as de-Bruijn graphs, which were specifically employed to tackle this job. Tools such as EULER,⁴¹ Velvet,⁴² SOAP,⁴³ and Abyss⁴⁴ were amongst the first to perform de novo assembly and are still widely used today. They require computers with large amounts of memory and generally long execution times (depending on the size of the dataset). However, these tools were built with the assumption of assembling a single genome and often underperform when used for metagenome assemblies. Problems arise from 1) variation between similar subspecies, 2) genomic sequence similarity between different species, and 3) difference in abundance for species in a sample also affected by different sequencing depths for individual species. These issues introduce kinks (or branches) in the de Bruijn graph, and have to be addressed in order to improve the assembly.

The next generation of assembly tools, such as MetaVelvet and very recently MetaVelvet-SL^45,46 and Meta-IDBA,⁴⁷ was developed to address these issues. MetaVelvet and Meta-IDBA employ a combined binning (for details on binning, see below) and assembly approach to create more accurate assemblies from datasets containing a mixture of multiple genomes. They make use of k-mer frequencies to detect kinks in the de-Bruijn graph and then use these k-mer thresholds to decompose the graph into subgraphs. These tools further assemble contigs and scaffolds based on the decomposed subgraphs, and thus perform a more efficient grouping/ assembly of contigs, effectively separating those belonging to different species.

The IDBA-UD algorithm⁴⁸ was recently developed to additionally address the issue of metagenomic sequencing technologies with uneven sequencing depths. It makes use of multiple depth-relative k-mer thresholds to remove erroneous k-mers in both low-depth and high-depth regions. Comparison of the performances of these tools is often performed using the N50 length score, which is defined as “the length for which the collection of all contigs of that length, or longer, contains at least half of the total of the lengths of the contigs in the assembly”.^49,50 A recent comparison of the latest line of assembly tools shows that IDBA-UD can reconstruct longer contigs with higher accuracy.⁴⁸ However, there is still much room for the improvement of metagenomic assembly algorithms in order for them to conceptually capture the task in hand.

Binning Tools for Metagenomes

Binning is the process of grouping (binning) reads or contigs into individual genomes and assigning the groups to specific species, subspecies, or genus. Binning methods can be characterized in two different ways depending on the information used to group the sequences in hand: 1) Composition-based binning is based on the observation that individual genomes have a unique distribution of k-mer sequences (also denoted as genomic signatures). By making use of this conserved species-specific nucleotide composition, these methods are capable of grouping sequences into their respective genomes. 2) Similarity- or homology-based binning refers to the process of using alignment algorithms such as BLAST or profile hidden Markov Models (pHMMs) to obtain similarity information about specific sequences/genes from publically available databases (eg, NCBI's nonredundant database - nr or PFAM). Thereafter, sequences are binned according to their assigned taxonomic information.

Available composition-based binning algorithms are included in tools such as TETRA,⁵¹ S-GSOM,^52,53 Phylopythia⁵⁴ and its successor PhylopythiaS,⁵⁵ TACAO,⁵⁶PCAHIER,⁵⁷ ESOM,^58,59 and ClaMS,⁶⁰ while examples of purely similarity-based binning software include tools such as CARMA,⁶¹ MetaPhyler,⁶² and SOrt-ITEMS.⁶³ Some tools employ similarity-based binning algorithms in their metagenomics analysis pipelines. Examples of such tools are IMG/MER 4,⁶⁴ MG-RAST,^65,66 and MEGAN^67–69 and will be described in more detail below.

Certain binning tools employ a hybrid approach using both composition and similarity-based information to group sequences. Some examples of such tools are PhymmBL⁷⁰ and MetaCluster.^71,72 More innovative binning approaches include co-abundance gene segregation across a series of metagenomic samples, thus facilitating the assembly of microbial genomes without the need for reference sequences.⁷³ This new method promises to overcome the usual computational challenges of other binning tools and has been tested for a human gut microbiome.

Binning tools can further be characterized with respect to the type of algorithm they employ such as 1) ab initio unsupervised classifiers and 2) supervised/training-based classifiers.⁶⁰ Unsupervised binning refers to the process of using pre-existing bins derived from genomic sequences to classify a given dataset without user supervision. In contrast, supervised binning allows user interference and supervision in the training process per se. More particularly, the user may specify the type of sequences that will be used to train each bin and, furthermore, select sequences from known taxonomic lineages to use while training the classifier. Sophisticated algorithms such as support vector machines (PhylopythiaS), hidden Markov models (PhymmBL, TETRA), as well as self-organizing maps (ESOMs) have been used in binning algorithms. However, tools such as PhylopythiaS and TETRA allow little user intervention, while ClaMS and ESOM provide a more supervised training approach that can be fine-tuned to allow optimal classification for the specific dataset under consideration.

There are certain aspects that one must take into consideration when performing the binning of metagenomic sequences. Composition-based binning using genomic signature has its drawbacks, especially when performed on short reads (ie, 150 bps). Given that all possible tetranucleotide combinations amount to 256, it is unlikely to extract sufficient information to reliably assign a taxonomic rank to a specific bin using short reads. Therefore, it is common practice to perform composition-based binning on assembled datasets. This way, longer contigs can provide the required k-mer distribution information, which will allow effective binning and taxonomic assignment.³¹ Observation of a taxonomic marker sequence (ie, 16S rRNA gene) within the bins can further facilitate reliable taxonomic assignment for the respective bin. Similarity-based binning also has its disadvantages. Although capable of binning reads of short length, it fails to do so accurately when the metagenome under consideration consists of numerous closely related species. This may cause assignment of closely related sequences to the same reference genome, perhaps at a higher taxonomic level (ie, order or class), thereby generating bins containing a mixture of genomes. Therefore, optimal binning results are expected to be attained when combining both composition- and similarity-based approaches as adopted by hybrid tools such as PhymmBL⁷⁰ and MetaCluster.^71,72

Annotation of Metagenomics Sequences

Annotation of metagenomes is specifically designed to work with mixtures of genomes and contigs of varying length. Initially, a series of preprocessing steps prepare the reads for annotation. These include 1) Trimming of low-quality reads using platform-specific tools such as the FASTX-Toolkit.⁷⁴ Additionally, FastQC⁶⁷ can provide summary statistics for FASTQ files. Both have been recently integrated into the Galaxy platform.^75–77 SolexaQA⁷⁸ and Lucy 2⁷⁹ are also used for FASTQ files. Most of these tools make use of Phred or Q quality scores,^80,81 the thresholds of which depend on sequencing technology; 2) Masking of low-complexity reads performed using tools such as DUST⁸²; 3) A de-replication step that removes sequences that are more than 95% identical; 4) A screening step performed by some tools (ie, MG-RAST) in which the pipeline provides the option of removing reads that are near-exact matches to the genomes of a handful of model organisms, including fly, mouse, cow, and human. This is done using mapping tools such as Bowtie 2.⁸³

The next main stage of the annotation pipeline is the identification of genes within the reads/assembled contig, a process often denoted as “gene calling”.⁶⁴ Genes are labeled as coding DNA sequences (CDSs) and noncoding RNA genes, and certain annotation pipelines (eg, IMG/MER) also predict for regulatory elements such as clustered regularly interspaced short palindromic repeats (CRISPRs).

CDSs are identified using a number of tools including MetaGeneMark,⁸⁴ Metagene,⁸⁵ Prodigal,⁸⁶ Orphelia,⁸⁷ and FragGeneScan,⁸⁸ all of which utilize ab initio gene prediction algorithms. Often, annotation pipelines use an intersection of these tools to obtain a more informative prediction of the protein coding genes. Gene prediction tools utilize codon information (ie, start codon - AUG) to identify potential open reading frames and hence label sequences as coding or non-coding. Most tools can be trained by using the desired training sets. For example, FragGeneScan is trained for prokaryotic genomes only, and is used by IMG/MER and MG RAST as well as EBI Metagenomics. It is believed to be one of the most accurate gene-prediction tools currently available. However, like most of these tools, it is expected to have an average prediction accuracy of ~65%-70%, resulting in multiple genes that are missed altogether.⁸⁸

CRISPR elements are identified by programs such as CRT⁸⁹ and PILER-CR.⁹⁰ IMG/MER uses a concatenation of results obtained from both these programs, retaining the longest element prediction in case of overlap.

Noncoding RNAs such as tRNAs are predicted using programs like tRNAscan,^91,92 ribosomal RNA (rRNA) genes (5s, 16s, and 23s) are predicted using internally developed rRNA models for IMG/MER, and MG-RAST uses similarity to compare three known databases (SILVA,⁹³ Greengenes,⁹⁴ and the Ribosomal Database Project-RDP^95,96) to predict rRNA genes.

The next stage of the annotation pipeline involves functional assignment to the predicted protein coding genes. This is currently achieved by homology-based searches of query sequences against databases containing known functional and/or taxonomic information. Due to the large size of metagenomic datasets, this stage is often very expensive computationally and highly automated. BLAST or other sequence-similarity-based algorithms⁹⁷ often run on high performance computer clusters. Often, multithreading or other parallel programming approaches are used to divide jobs in multiple central/graphic processing units (CPUs/GPUs). This reduces the running time complexity and significantly speeds up querying execution time.

Some widely used data repositories to obtain annotation for metagenomic datasets include functional annotation databases such as KEGG,^98,99 SEED,¹⁰⁰ eggNOG,¹⁰¹ COG/KOG,¹⁰² as well as protein domain databases such as PFAM^103,104 and TIGRFAM.¹⁰⁵ Often, annotation pipelines make use of multiple databases or composite protein domain databases such as Interpro¹⁰⁶ (see EBI Metagenomics) in order to obtain a more collective, cumulative biological functional annotation.

IMG/MER utilizes HMMsearch (profile HMMs) to associate genes with PFAM, and genes are further annotated using COGs. Database of position-specific scoring matrix (PSSMs) for COGs are downloaded from NCBI and are used to annotate protein sequences. Moreover, genes are labeled using KEGG-associated KO terms, EC numbers, and assigned phylogeny using similarity searches. With a large set of genomes in its public repositories, IMG/MER can exploit its own resources, using them as reference nonredundant databases from which it obtains additional functional annotation.

MG-RAST utilizes many of the databases described above for annotation mapping as well as the NCBI taxonomy. The primary data product displayed to the user by MG-RAST is in the form of abundance profiles, and taxonomic information is projected against this data.

Both IMG/MER and MG-RAST are widely used data management repositories and comparative genomics environments. They are fully automated pipelines that provide quality control, gene prediction, and functional annotation. Both tools support user download of data products generated, as well as optional sharing and publishing within the respective portals. However, there are important differences between MG-RAST and IMG/MER that are relevant to the way MG-RAST calculates abundance profiles.

MG-RAST predicts all genes in the metagenome, and then identifies the best homologs of those genes in the isolate genomes using a tool called BLAT (BLAST-like alignment tool).¹⁰⁷ BLAT misses similarities below 70% identity, so many strong hits to other genes are missed. After the best hits to genes from an isolated genome are identified, all subsequent analysis is done using the genes of the isolate genomes, not the genes of the metagenome at hand. This creates a lot of limitations due to the fact that the analysis is not performed on the original genes of the metagenome but on the “proxy” genes to the isolated genomes instead. The advantage of this method is its speed; the only computationally intensive step is to find the best hits of the metagenomes against the isolates. Once this is done, all other comparisons are already preexisting. The other major advantage is that the MG-RAST database does not grow in size, as is the case with the IMG/MER database.

IMG/MER also begins with prediction of all genes from the metagenome, but then runs all the computations on those genes rather than on their proxies. This allows the identification of PFAM hits (which is not supported in MG-RAST)and provides much more detailed functional information compared to COGS, which is the only protein families database used in MG-RAST. The major bottleneck for IMG/MER is the exponential growth of the gene number, which is not an issue for MG-RAST since the metagenome genes are not kept for analysis. It is, however, important to use PFAM for functional analysis because by comparing the number of genes from any metagenome that go into COG or PFAM clusters, the second provides significantly higher coverage and therefore allows a much deeper analysis. Another major advantage of IMG/MER is that, since the tool keeps the original metagenome genes, it also keeps the original contigs, which provides synteny information. Therefore, it is far more suitable if one is interested in identifying novel biosynthetic gene clusters (BGCs) in the metagenomes, a type of analysis that may be less viable using MG-RAST. The prediction of BGCs from metagenomics data is recently gaining a great deal of interest due to their potential in biotechnological applications. The possibility to engineer BGCs for the production of secondary metabolites with improved properties, known for their use in anticancer drugs and antibiotics, offers limitless potential for bioprospecting.

The EBI Metagenomics service¹⁰⁸ is a newly developed web-based portal that uses metadata structures and formats that comply with the Genomic Standards Consortium (GSC) guidelines. Moreover, a novel data scheme currently being hosted by the EBI-EMBL is being adopted by the EBI Metagenomics service. This is known as the European Nucleotide Archive (ENA)¹⁰⁹ data schema and aims to integrate data derived from sequencing technologies under a consensus, mutually accepted standard. EBI Metagenomics offers a dual shotgun and marker gene analysis service. It allows the extraction of rRNA data from shotgun metagenomic data using tools such as rRNASelector¹¹⁰ for concurrent marker metagenomic analysis. It therefore supports additional 16S rRNA-based analysis tools such as Qiime¹¹¹ (see section on Marker Gene Metagenomics) for the efficient taxonomic assignment of these sequences. For functional analysis and annotation of CDS sequences, EBI Metagenomics uses FragGeneScan to obtain protein coding sequences and thereafter utilizes databases such as Interpro, which is a composite, cumulative system comprised of multiple databases of protein families, and allows for protein domain prediction and functional assignment. EBI Metagenomics provides data archiving via ENA and provides unique accession numbers for submitted datasets. Archiving policies require the data to be made public; however, there is a 2-year period (upon submission) during which the data is kept private pending user publication of analysis results.

CAMERA¹¹² is another online cloud computing service that provides hosted software tools and a high-performance computing infrastructure for the analysis of metagenomic data. One advantage of CAMERA is that it allows greater user intervention and flexibility during the analysis process. However, this means that users must have expertise, knowledge, and hands-on experience in metagenomic date analysis per se, in order to ensure correct execution of the pipeline and accuracy of results. Moreover, in order to perform comparative metagenomics using CAMERA, the datasets in hand must be traversed through the CAMERA pipeline, thus making integration of data from different resources more computationally demanding. MEGAN 5⁶⁷ is yet another tool that performs analysis of metagenomic data and offers a wide range of visualization tools for metagenomic annotation results. It supports multiple visualization schemes including functional or taxonomic dendrograms, tag clouds, bar charts, and Krona taxonomic plots,¹¹³ that allow hierarchical data to be explored in the form of a zoomable pie chart.

Marker Gene Metagenomics

It is widely accepted that sequencing of the 16S rRNA gene reflects eubacterial evolution.¹¹⁴ Since the introduction of SSU rDNA-based molecular techniques,^115–117 the study of microbial diversity in natural environments has advanced significantly. In addition, pyrosequencing^24,25 of the 16S rRNA gene has been widely applied in the field of microbial ecology^118–120 and has resulted in a great number of sequences deposited in relevant databases, thus enhancing the value of 16S as the “gold standard” in microbial ecology. While the 16S rRNA gene fragment, containing one or more variable regions, is the preferred target marker gene for bacteria and archaea, this is not the case for fungi and eukaryotes where the preferred marker genes are the internal transcribed spacer (ITS) and 18S rRNA gene, respectively.

Taxonomic analysis for prokaryotes (ie, bacteria and archaea) is regularly performed using 16S data derived from varying sequencing technologies (ie, 454 pyrosequencing as well as Illumina, Solid and Ion Torrent), and, for the purposes of this review, we will list the relevant software to allow analysis for most sequencing technologies. Commonly used tools for 16S data analysis and denoising include QIIME,¹¹¹ Mothur,¹²¹ SILVAngs,⁹³ MEGAN,⁶⁷ and AmpliconNoise.¹²² Despite the vast availability of algorithms and software for analysis of 16S metagenomics datasets, QIIME seems to be established as the “gold standard”.¹²³

It is important to be aware of certain aspects of the terminology required for the efficient analysis of 16S metagenomics data. These include the following: 1) Amplicon –- a DNA fragment that is amplified by PCR, eg, one or more 16S rRNA variable regions, or other marker genes. Most researchers will make use of standard PCR primers; 2) OTU –- species distinction in microbiology, typically using rRNA and a percentage of similarity threshold for classifying microbes within the same, or different, OTUs; 3) Barcode –- a short DNA sequence that is added to each read during amplification and that is specific for a given sample. This allows samples to be mixed (multiplexed) to reduce sequencing cost. During analysis, sequences need to be demultiplexed, ie, separated by sample.

Analysis usually requires a reference database that is searched to find the closest match to an OTU from which a taxonomic lineage is inferred. Some widely utilized databases include Greengenes,⁹⁴ (16S), Ribosomal Database Project,^95,96,124 (16S), Silva^93,125 (16S + 18S), and Unite¹²⁶ (ITS). These databases are less suitable for certain groups of organisms, such as protists and viruses, which are extremely diverse and for which considerably less sequence information is available compared to bacteria.

Denoising

Denoising is important for 16S metagenomic data analysis, and it is platform-specific; ie, certain platforms (eg, Illumina) require less denoising than others (eg, pyrosequencing). For example, denoising of 454 pyrosequencing data, despite being computationally expensive, is necessary due to intrinsic errors generated from pyrosequencing that can give rise to erroneous OTUs. A procedure called “flowgram clustering” removes problematic reads and increases the accuracy of the taxonomic analysis. Several denoising algorithms have been developed so far,^127–131 but for the purpose of this review three of them will be analyzed in detail.

Denoising is performed very efficiently by Amplicon-Noise,¹²² a tool that uses the following basic denoising steps: 1) Filtering of noisy reads: reads are truncated based on the appearance of low signal intensities; 2) Removing pyrosequencing noise: distance between the flowgrams is defined and true sequences and their frequencies are inferred by an expectation-maximization (EM) algorithm; 3) Removing PCR noise: the same ideas are used for removing PCR errors; 4) Chimera identification and removal: for each sequence, exact pairwise alignments are performed to all sequences with equal or greater abundance, which is the set of possible parents. Although a considerable number of sequences is lost during the denoising process, it results in high-quality sequences¹³²; however, there has been some debate on the level of stringency required to achieve such high quality.¹³³

A very popular software for the analysis of microbial communities is QIIME. Initially QIIME was implemented for use of 454 pyrosequencing datasets only, ie, using sff (Standard Flowgram Format) files, but currently QIIME has been modified to accept the fastq file format, thereby making the analysis of Illumina datasets possible. The QIIME developers provide users with extensive online tutorials for several workflows, and, moreover, QIIME is available as an open-source software package mostly implemented using the programming language PYTHON.

Another widely used software for the analysis of microbial communities is Mothur. It was created from the combination of pre-existing software, such as DOTUR,¹³⁴ SONS,¹³⁵ and Treeclimber,¹³⁶ but, due to the community support it has received, currently it incorporates many more algorithms, thus providing the user with a variety of choices.

More recently, a web-based application called SILVAngs⁹³ was developed, which provides a fully automated analysis pipeline for data derived from rRNA marker gene amplicon sequencing. The analysis workflow is based on 1) Alignment of reads, 2) Quality assessment and filtering of reads, 3) Dereplication, whereby identical sequences are filtered out to avoid overestimation, 4) Clustering and OTU picking using a priori defined thresholds, and 5) Taxonomic assignment of OTUs using the SILVA rDNA database.

The choice of which denoising algorithm to use is largely depends on the user. Once a choice is made, the user should also consider whether to deviate from the default parameters. Parameter adjustment is related to the dataset produced, ie, which specific 16S rRNA region was sequenced and which technology was used to perform the actual sequencing. In addition, it has been suggested that use of different denoising methods can produce significantly different outcomes,¹³⁷ which should be taken into careful consideration when comparing studies that have utilized different algorithms for data analysis.

OTU Clustering, Picking, and Taxonomic Assignment

After the demultiplexing of the dataset, ie, the assignment of reads to samples using barcode information, the next step is OTU picking. For bacteria/archaea, it is accepted that OTUs of similarity greater than 97% correspond to the same species, but also other dissimilarity cutoffs can be employed, if needed for the downstream analyses. There are numerous OTU picking strategies: 1) De novo is used if amplicons overlap and if a reference sequence collection is not available. It clusters all reads without using a reference and is quite expensive computationally, hence not very suitable for very large datasets. 2) Closed-reference is used if amplicons do not overlap and if a reference sequence collection is available. This approach discards reads that do not hit a reference sequence. 3) Open-reference is used if amplicons overlap and a reference dataset is available. This method clusters reads against a reference dataset, but if the reads do not match the reference, they are consequently clustered de novo. All the aforementioned are incorporated into QIIME. There are also other types of OTU clustering and picking strategies being developed^138–141; the most appropriate choice for the downstream analysis will depend on the type of data and the user.

Taxonomic assignment of OTUs can be performed using a variety of algorithms. Currently QIIME supports numerous algorithms, such as BLAST, the RDP classifier, RTAX, Mothur classifier, and uclust, to search for the closest match to an OTU from which a taxonomic lineage is inferred. This requires reference databases of marker genes. Some commonly utilized databases include Greengenes,⁹⁴ (16S), Ribosomal Database Project^95,96,124 (16S), Silva^93,125 (16S + 18S), and Unite¹²⁶ (ITS).

Statistical Analysis and Visualization of Results

QIIME output includes a representation of a taxonomic tree in Newick format, which can be visualized in applications such as FigTree,¹⁴² and a file in Biom (Biological Observation Matrix) format¹⁴³ representing OTU tables. This file can be imported into MEGAN for visualization or into any other statistical software requiring matrix-type data. In addition, alpha-diversity analysis (diversity within a sample, eg, Phylogenetic Diversity (PD), Chao,¹⁴⁴ etc.) and beta-diversity analysis (diversity across samples, eg, UniFrac,¹⁴⁵ PCoA), as well as taxonomic composition and phylogenetic analyses, are supported through QIIME. Numerous other tools and software packages exist for performing statistical analysis of metagenomic data. The Primer-E package¹⁴⁶ is commonly utilized by microbial ecologists and allows for multiple multivariate statistical analyses, such as multidimensional scaling (MDS), analysis of similarities (ANOSIM), and hypothesis testing. Recently the R statistical programming language¹⁴⁷ has gained imense popularity and is currently widely used for multivariate statistics. Packages such as vegan,¹⁴⁸ phyloseq,¹⁴⁹ and Bioconductor¹⁵⁰ provide multiple in-built functions and libraries for performing a wide range of statistical analysis required for metagenomic datasets. While it is out of the scope of this review to thoroughly analyze visualization tools for genomic data, readers are encouraged to visit a recent review article.¹⁵¹

Data Management, Storage, and Sharing

Tools such as IMG/MER, CAMERA, MG-RAST, and EBI metagenomics (which also incorporates QIIME) provide an integrated environment for analysis, management, storage, and sharing of metagenome projects. This requires that a consensus commonly accepted annotation scheme is designed in order to allow for efficient data exchange, integration, sharing, and visualization between different platforms and to further reduce the need for reprocessing of metagenomic datasets, a task which is very expensive computationally.

The GSC is currently investing heavily toward a widely accepted language that shares ontologies and nomenclatures thereby providing a common standard for exchange of data derived from the analysis of metagenomic projects. Toward this goal, MIMS (Minimum Information about a Metagenome Sequence) and MIMARKS (Minimum Information about a MARKer Sequence)¹⁵² have been devised, providing a scheme of standard languages for metadata annotation.

Conclusions

Tools and databases for metagenomic data analysis are currently well on their way to becoming more and more efficient and elaborate (for an overview of the tools most utilized nowadays for metagenomic data analysis, see Table 1). Technologies offering increased read length, such as PacBio, or new chemistry, such as Irys Technology and Nanopore Sequencing, are beginning to offer new capabilities to the analysis pipelines and aid in many aspects the assembly as well as the concurrent annotation process. Assembly tools such as IDBA-UD are being developed and increasingly improved to address the specific problem of assembling mixtures of genomes as is eminent for metagenomic samples. Databases like GOLD,¹⁵³ associated with the IMG/MER portal, can be used as a reference in order to perform validation tests for assembly tools. Moreover, the use of simulated metagenomic datasets has been proposed in order to asses these tools.¹⁵⁴

Table 1

Tools grouped according to their main functionality.

Shotgun metagenomics	Assembly	EULER⁴¹
		Velvet⁴²
		SOAP⁴³
		ABySS⁴⁴
		MetaVelvet⁴⁶
		MetaVelvet-SL⁴⁵
		Meta-IDBA⁴⁷
		IDBA-UD⁴⁸
		Newbler (Roche)
		MIRA³⁷
		Mapsembler¹⁷¹
		ALLPATHS^172,173
		MetaORFA^174,175
		MetAMOS³⁸
	Binning	TETRA⁵¹
		S-GSOM⁵²
		PhylopythiaS^54,55
		TACOA⁵⁶
		PCAHIER⁵⁷
		ESOM⁵⁸
		ClaMS⁶⁰
		CARMA⁶¹
		WGSQuikr¹⁷⁶
		SPHINX¹⁷⁷
		MetaPhyler⁶²
		SOrt-ITEMS⁶³
		PhymmBL⁷⁰
		MetaCluster^71,72
	Annotation	FASTX-Toolkit⁷⁴
		FastQC⁶⁷
		SolexaQA⁷⁸
		Lucy 2⁷⁹
		DUST⁸²
		Bowtie⁸³
		MetaGeneMark⁸⁴
		LEfSe¹⁹
		TACOA⁵⁶
		Metagene⁸⁵
		CREST¹⁷⁸
		Prodigal⁸⁶
		mOTU-LG¹⁷⁹
		Orphelia⁸⁷
		Kraken¹⁸⁰
		FragGeneScan⁸⁸
		CRT⁸⁹
		NBC¹⁸¹
		MyTaxa¹⁸²
		RITA¹⁸³
		PILER-CR⁹⁰
		tRNAscan¹⁸⁴
		KEGG⁹⁹
		MetaCluster TA⁷¹
		SEED¹⁰⁰
		eggNOG¹⁰¹
		ProViDE¹⁸⁵
		COG/KOG¹⁸⁶
		PFAM^103,104,187
		TIGRFAM¹⁰⁵
		MetaPhlAn¹⁸⁸
		HighSSR¹⁸⁹
		Blat¹⁰⁷
	Analysis pipelines	IMG/MER^64,190
		MG-RAST⁶⁵
		MEGAN 5^67-69
CAMERA¹¹²
Parallel-META^74,191
EBI Metagenomics¹⁰⁸
METAREP¹⁹²
PHACCS¹⁹³
Marker gene metagenomics	Standalone software	QIIME^111,194
		Mothur¹²¹
		JAguc¹⁹⁵
		M-pick¹⁹⁶
		OTUbase¹⁹⁷
		CopyRighter¹⁹⁸
		AbundantOTU¹⁹⁹
		UniFrac^145,200
		ESPRIT^141,201
	Analysis pipelines	SILVA¹²⁵
		FunFrame²⁰²
		PANGEA²⁰³
		FastGroupII²⁰⁴
		CLOTU²⁰⁵
	Denoising	AmpliconNoise¹²²
		DADA²⁸
		JATAC¹²⁷
		UCHIME²⁰⁶
		Bellerophon²⁰⁷
		CANGS^208,209
	Databases	SILVA¹²⁵
		Greengenes⁹⁴
		Ribosomal Database Project (RDP)²¹⁰
		Unite¹²⁶

There has been some controversy within the metagenomics community regarding the actual need for performing assembly on metagenomes. One contention is that using clustering algorithms such as cd-hit^155,156 or uclust⁹⁷ is sufficient to group similar reads together and thereafter proceed to annotation of these clusters without prior assembly. This clustering approach may allow for more accurate annotation of highly diverse samples containing rare, uncultured genomes that may otherwise be excluded from the assembly process due to their low coverage. One drawback of not performing an assembly may be that complex regulatory elements such as CRISPRs may not be identified successfully.³¹

Binning and annotation methods are also constantly being modified and altered to specifically address metagenomic analysis pipelines. A significant improvement of these processes will be achieved upon increase of the genomic repository of cultured as well as uncultured genomes within the public database repertoire. Composition-based as well as similarity-based binning methods, especially those making use of supervised machine learning algorithms (ie, PhylopithiaS, trained on reference genomes), will become increasingly accurate due to the availability of more reliable information.

At this stage it is important to mention that, in spite of the best efforts to reconstruct and prepare datasets by 1) quality filtering, 2) performing assemblies, and 3) binning sequences into taxonomically informative groups, annotation pipelines still achieve successful annotation for only ~50% of the sequences under analysis.^31,157 As mentioned above, the annotation process is highly dependent on the available databases and hence limited by the amount of information that is present within these repositories. Sequences that do not have any similarity with any other sequence existing in a known database are termed “orphan genes”.¹⁵⁸ These genes are believed to be 1) a consequence of sequencing errors and/or reflect the inaccuracy of gene prediction tools, or 2) truly novel genes that have no sequence or function similarity to known genes and may share higher order similarity in the form of protein folds.^31,158 A lot of work is currently being undertaken in order to shed some light on these unknowns/orphans using various types of information. Some existing tools use pathway information from metagenomic neighbors and also context-depended metabolomic data to assign a functional annotation to unknown genes.^159,160 Along these lines, the use of metabolomic, metatranscriptomic, and/or metaproteomic data will provide a more elaborate view of the “picture”, addressing all aspect of the dogma of life in the metagenomics era. Moreover, single-cell genomics is now becoming increasingly popular by investigating information from sequencing individual cells. The synergy of single-cell genomics with metagenomics can allow a more accurate separation of metagenomics sequences into individual genomes, guided by the single-cell sequencing data.

A wide array of software is currently available to perform each step of the marker gene metagenomics analysis pipeline. What is missing from the literature is a systematic evaluation of software and algorithms that have been used so far and a standardized means of comparing results derived from different workflows. Variation in results can occur due to inconsistencies in a number of factors, such as DNA extraction,^161,162 primer pair and amplification region,^163–165 sequencing platform,¹⁶⁶ and the software used.¹⁶⁷ All of the aforementioned sources of variation make it very difficult to compare and obtain trustworthy results. Computational and programming challenges to improve the already available software can be achieved, but only through benchmarks, simulations,¹⁶⁸ and thorough testing. Initiatives such as the GSC could potentially take over the design of the “Minimum Analysis Requirements of Metagenome Sequences (MARMS)”. This will be made up of standardized methodologies and consensus in the choice of software, analysis steps, threshold values, and parameters. Such an initiative would eliminate, or at least minimize, the biases that can be generated by analyzing data using multiple methodologies.

The availability of data software such as EBI Metagenomics, IMG/MER, MG-RAST, and SILVAngs will further allow users with limited computational facilities to perform analysis of metagenomic samples. In comparative metagenomic analyses, one can use tools to compare samples from different ecological niches and extract information that is common and/or unique to a specific environment.^8,169,170 Moreover, the GSC is striving toward the successful integration of analyzed data under a unified and mutually acceptable structure/format that will facilitate the exchange of valuable insights and information in the field of microbial ecology and environmental microbiology.

To sum up, we have created a metagenomics flowchart (Fig. 1) outlining all the aforementioned basic steps of the analysis pipeline. Analysis can take two different routes depending on the type of sequencing data (marker gene or shotgun metagenomics). Every analysis step shown in the flowchart is complemented by a list of some well-established tools used by the metagenomics community.

Figure 1

Flowchart of basic metagenomics steps and tools currently in practice.

Author Contributions

AO, GAP, II conceived the idea of the manuscript. AO, CP wrote the first draft of the manuscript. All other authors (GAP, II, NP, PP, GK, CA) made critical revisions and approved the final version of the manuscript.

References

Riesenfeld

C.S.

, Schloss

P.D.

, Handelsman

Metagenomics: genomic analysis of microbial communities.

Annu Rev Genet. 2004; 38: 525–52.

Xia

L.C.

, Cram

J.A.

, Chen

, Fuhrman

J.A.

, Sun

Accurate genome relative abundance estimation based on shotgun metagenomic reads.

PLoS One. 2011; 6(12): e27992–e27992.

Handelsman

Metagenetics: spending our inheritance on the future.

Microb Biotechnol. 2009; 2(2): 138–9.

Tringe

S.G.

, von Mering

, Kobayashi

. Comparative metagenomics of microbial communities. Science. 2005; 308(5721): 554–7.

Tringe

S.S.G.

, Hugenholtz

A renaissance for the pioneering 16S rRNA gene.

Curr Opin Microbiol. 2008; 11(5): 442–6.

Benson

C.A.

, Bizzoco

R.W.

, Lipson

D.A.

, Kelley

S.T.

Microbial diversity in nonsulfur, sulfur and iron geothermal steam vents.

FEMS Microbiol Ecol. 2011; 76(1): 74–88.

Urich

, Lanzén

, Stokke

. Microbial community structure and functioning in marine sediments associated with diffuse hydrothermal venting assessed by integrated meta-omics. Environ Microbiol. 2014; 16(9): 2699–710.

Xie

, Wang

, Guo

. Comparative metagenomics of microbial communities inhabiting deep-sea hydrothermal vent chimneys with contrasting chemistries. ISMEJ. 2011; 5(3): 414–26.

Kilias

S.P.

, Nomikou

, Papanikolaou

. New insights into hydrothermal vent processes in the unique shallow-submarine arc-volcano, Kolumbo (Santorini), Greece. Sci Rep. 2013; 3: 2421.

10.

Bradford

M.A.

, Davies

C.A.

, Frey

S.D.

. Thermal adaptation of soil microbial respiration to elevated temperature. Ecol Lett. 2008; 11(12): 1316–27.

11.

Pearce

D.A.

, Newsham

K.K.

, Thorne

M.A.

. Metagenomic analysis of a southern maritime antarctic soil. Front Microbiol. 2012; 3: 403–403.

12.

Xiong

, Liu

, Lin

. Geographic distance and pH drive bacterial distribution in alkaline lake sediments across Tibetan Plateau. Environ Microbiol. 2012; 14(9): 2457–66.

13.

García-Moyano

, González-Toril

, Aguilera

, Amils

, Aguilera

Comparative microbial ecology study of the sediments and the water column of Río Tinto, an extreme acidic environment.

FEMS Microbiol Ecol. 2012; 81(2): 303–14.

14.

Johnson

D.B.

Geomicrobiology of extremely acidic subsurface environments.

FEMS Microbiol Ecol. 2012; 81(1): 2–12.

15.

Bryant

J.A.

, Stewart

F.J.

, Eppley

J.M.

, DeLong

E.F.

Microbial community phylogenetic and trait diversity declines with depth in a marine oxygen minimum zone.

Ecology. 2012; 93(7): 1659–73.

16.

Stevens

, Ulloa

Bacterial diversity in the oxygen minimum zone of the eastern tropical South Pacific.

Environ Microbiol. 2008; 10(5): 1244–59.

17.

Chodak

, Gołebiewski

, Morawska-Ploskonka

, Kuduk

, Niklińska

Diversity of microorganisms from forest soils differently polluted with heavy metals.

Appl Soil Ecol. 2013; 64: 7–14.

18.

Gołeebiewski

, Deja-Sikora

, Cichosz

, Tretyn

, Wróbel

16S rDNA pyrosequencing analysis of bacterial community in heavy metals polluted soils.

Microb Ecol. 2014; 67(3): 635–47.

19.

Segata

, Izard

, Waldron

. Metagenomic biomarker discovery and explanation. Genome Biol. 2011; 12(6): R60.

20.

Tyson

G.W.

, Chapman

, Hugenholtz

. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004; 428(6978): 37–43.

21.

Breitbart

, Hewson

, Felts

. Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol. 2003; 185(20): 6220–3.

22.

Venter

J.C.

, Remington

, Heidelberg

J.F.

. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004; 304(5667): 66–74.

23.

Mardis

E.R.

Next-generation DNA sequencing methods.

Annu Rev Genomics Hum Genet. 2008; 9: 387–402.

24.

Ronaghi

Pyrosequencing sheds light on DNA sequencing.

Genome Res. 2001; 11(1): 3–11.

25.

Ronaghi

, Uhlén

, Nyrén

A sequencing method based on realtime pyrophosphate.

Science. 1998; 281(5375): 363–5.

26.

Sogin

M.L.

, Morrison

H.G.

, Huber

J.A.

. Microbial diversity in the deep sea and the underexplored “rare biosphere”. Proc Natl Acad Sci USA. 2006; 103(32): 12115–20.

27.

Brown

S.P.

, Veach

A.M.

, Rigdon-Huss

A.R.

, Grond

, Lothamer

K.L.

, Lickteig

S.K.

Scraping the bottom of the barrel: are rare high throughput sequences artifacts?

Fungal Ecol. 2014; 13: 6–10.

28.

Rosen

M.J.

, Callahan

B.J.

, Fisher

D.S.

, Holmes

S.P.

Denoising PCR-amplified metagenome data.

BMC Bioinformatics. 2012; 13(1): 283.

29.

Brodin

, Mild

, Hedskog

. PCR-induced transitions are the major source of error in cleaned ultra-deep pyrosequencing data. PLoS One. 2013; 8(7): e70388–e70388.

30.

Rothberg

J.M.

, Leamon

J.H.

The development and impact of 454 sequencing.

Nat Biotechnol. 2008; 26(10): 1117–24.

31.

Thomas

, Gilbert

, Meyer

Metagenomics - a guide from sampling to data analysis.

Microb Inform Exp. 2012; 2(1): 3.

32.

Wommack

K.E.

, Bhavsar

, Ravel

Metagenomics: read length matters.

Appl Environ Microbiol. 2008; 74(5): 1453–63.

33.

Bentley

D.R.

, Balasubramanian

, Swerdlow

H.P.

. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008; 456(7218): 53–9.

34.

Kircher

, Sawyer

, Meyer

Double indexing overcomes inaccuracies in multiplex sequencing on the illumina platform.

Nucleic Acids Res. 2012; 40(1): e3–e3.

35.

Werner

J.J.

, Zhou

, Caporaso

J.G.

, Knight

, Angenent

L.T.

Comparison of Illumina paired-end and single-direction sequencing for microbial 16S rRNA gene amplicon surveys.

ISME J. 2011; 6(7): 1273–6.

36.

Metzker

M.L.

Sequencing technologies — the next generation.

Nat Rev Genet. 2010; 11(1): 31–46.

37.

Chevreux

, Pfisterer

, Drescher

. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004; 14(6): 1147–59.

38.

Treangen

T.J.

, Koren

, Sommer

D.D.

. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 2013; 14(1): R2.

39.

Rutherford

, Parkhill

, Crook

. Artemis: sequence visualization and annotation. Bioinformatics. 2000; 16(10): 944–5.

40.

Paszkiewicz

, Studholme

D.J.

De novo assembly of short sequence reads.

Brief Bioinform. 2010; 11(5): 457–72.

41.

Pevzner

P.A.

, Tang

, Waterman

M.S.

An Eulerian path approach to DNA fragment assembly.

Proc Natl Acad Sci USA. 2001; 98(17): 9748–53.

42.

Zerbino

D.R.

, Birney

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Genome Res. 2008; 18(5): 821–9.

43.

, Li

, Kristiansen

, Wang

SOAP: short oligonucleotide alignment program.

Bioinformatics. 2008; 24(5): 713–4.

44.

Simpson

J.T.

, Wong

, Jackman

S.D.

, Schein

J.E.

, Jones

S.J.

, Birol

ABySS: a parallel assembler for short read sequence data.

Genome Res. 2009; 19(6): 1117–23.

45.

Afiahayati

Sato K.

, Sakakibara

MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning.

DNA Res. 2014; 22(1): 69–77.

46.

Namiki

, Hachiya

, Tanaka

, Sakakibara

MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads.

Nucleic Acids Res. 2012; 40(20): e155.

47.

Peng

, Leung

H.C.

, Yiu

S.M.

, Chin

F.Y.

Meta-IDBA: a de Novo assembler for metagenomic data.

Bioinformatics. 2011; 27(13): i94–101.

48.

Peng

, Leung

H.C.

, Yiu

S.M.

, Chin

F.Y.

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Bioinformatics. 2012; 28(11): 1420–8.

49.

Earl

, Bradnam

, St John

. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011; 21(12): 2224–41.

50.

Miller

J.R.

, Koren

, Sutton

Assembly algorithms for next-generation sequencing data.

Genomics. 2010; 95(6): 315–27.

51.

Teeling

, Waldmann

, Lombardot

, Bauer

, Glockner

F.O.

TETRA: a web-service and a stand-alone program for the analysis and comparison of tetra-nucleotide usage patterns in DNA sequences.

BMC Bioinformatics. 2004; 5: 163.

52.

Chan

C.K.

, Hsu

A.L.

, Halgamuge

S.K.

, Tang

S.L.

Binning sequences using very sparse labels within a metagenome.

BMC Bioinformatics. 2008; 9: 215.

53.

Chan

C.K.

, Hsu

A.L.

, Tang

S.L.

, Halgamuge

S.K.

Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing.

J Biomed Biotechnol. 2008; 2008: 513701.

54.

McHardy

A.C.

, Martin

H.G.

, Tsirigos

, Hugenholtz

, Rigoutsos

Accurate phylogenetic classification of variable-length DNA fragments.

Nat Methods. 2007; 4(1): 63–72.

55.

Patil

K.R.

, Roune

, McHardy

A.C.

The PhyloPythiaS web server for taxonomic assignment of metagenome sequences.

PLoS One. 2012; 7(6): e38581.

56.

Diaz

N.N.

, Krause

, Goesmann

, Niehaus

, Nattkemper

T.W.

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.

BMC Bioinformatics. 2009; 10: 56.

57.

Zheng

, Wu

Short prokaryotic DNA fragment binning using a hierarchical classifier based on linear discriminant analysis and principal component analysis.

J Bioinform Comput Biol. 2010; 8(6): 995–1011.

58.

Ultsch

, Moerchen

ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM. 2005. Technical Report, Department of Mathematics and Computer Science, University of Marburg.

59.

Dick

G.J.

, Andersson

A.F.

, Baker

B.J.

. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009; 10(8): R85.

60.

Pati

, Heath

L.S.

, Kyrpides

N.C.

, Ivanova

ClaMS: a classifier for metagenomic sequences.

Stand Genomic Sci. 2011; 5(2): 248–53.

61.

Krause

, Diaz

N.N.

, Goesmann

. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008; 36(7): 2230–9.

62.

Liu

, Gibbons

, Ghodsi

, Treangen

, Pop

Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences.

BMC Genomics. 2011; 12(suppl 2): S4.

63.

Monzoorul Haque

, Ghosh

T.S.

, Komanduri

, Mande

S.S.

SOrt-ITEMS: sequence orthology based approach for improved taxonomic estimation of metagenomic sequences.

Bioinformatics. 2009; 25(14): 1722–30.

64.

Markowitz

V.M.

, Chen

I.M.

, Palaniappan

. IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 2014; 42(Database issue): D560–7.

65.

Glass

E.M.

, Wilkening

, Wilke

, Antonopoulos

, Meyer

Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes.

Cold Spring Harb Protoc. 2010; 1: dbrot5368.

66.

Meyer

, Paarmann

, D'Souza

. The metagenomics RAST server — a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics. 2008; 9: 386.

67.

Huson

D.H.

, Auch

A.F.

, Qi

, Schuster

S.C.

MEGAN analysis of metagenomic data.

Genome Res. 2007; 17(3): 377–86.

68.

Huson

D.H.

, Mitra

Introduction to the analysis of environmental sequences: metagenomics with MEGAN.

Methods Mol Biol. 2012; 856: 415–29.

69.

Huson

D.H.

, Weber

Microbial community analysis using MEGAN.

Methods Enzymol. 2013; 531: 465–85.

70.

Brady

, Salzberg

S.L.

Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models.

Nat Methods. 2009; 6(9): 673–6.

71.

Wang

, Leung

, Yiu

, Chin

MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning.

BMC Genomics. 2014; 15(suppl 1): S12.

72.

Wang

, Leung

H.C.

, Yiu

S.M.

, Chin

F.Y.

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample.

Bioinformatics. 2012; 28(18): i356–62.

73.

Nielsen

H.B.

, Almeida

, Juncker

A.S.

; MetaHIT Consortium. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014; 32(8): 822–8.

74.

, Xu

, Ning

Parallel-META: efficient metagenomic data analysis based on high-performance computation.

BMC Syst Biol. 2012; 6(suppl 1): S16.

75.

Blankenberg

, Von Kuster

, Coraor

. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol. 2010; Chapter 19: 1–21.

76.

Giardine

, Riemer

, Hardison

R.C.

. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005; 15(10): 1451–5.

77.

Goecks

, Nekrutenko

, Taylor

, Galaxy

Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Genome Biol. 2010; 11(8): R86.

78.

Cox

M.P.

, Peterson

D.A.

, Biggs

P.J.

SolexaQA: at-a-glance quality assessment of Illumina second-generation sequencing data.

BMC Bioinformatics. 2010; 11: 485.

79.

, Chou

H.H.

LUCY2: an interactive DNA sequence quality trimming and vector removal tool.

Bioinformatics. 2004; 20(16): 2865–6.

80.

Ewing

, Green

Base-calling of automated sequencer traces using phred. II. Error probabilities.

Genome Res. 1998; 8(3): 186–94.

81.

Ewing

, Hillier

, Wendl

M.C.

, Green

Base-calling of automated sequencer traces using phred. I. Accuracy assessment.

Genome Res. 1998; 8(3): 175–85.

82.

Morgulis

, Gertz

E.M.

, Schaffer

A.A.

, Agarwala

A fast and symmetric DUST implementation to mask low-complexity DNA sequences.

J Comput Biol. 2006; 13(5): 1028–40.

83.

Langmead

, Salzberg

S.L.

Fast gapped-read alignment with Bowtie 2.

Nat Methods. 2012; 9(4): 357–9.

84.

Zhu

, Lomsadze

, Borodovsky

Ab initio gene identification in metagenomic sequences.

Nucleic Acids Res. 2010; 38(12): e132.

85.

Noguchi

, Park

, Takagi

MetaGene: prokaryotic gene finding from environmental genome shotgun sequences.

Nucleic Acids Res. 2006; 34(19): 5623–30.

86.

Hyatt

, Chen

G.L.

, Locascio

P.F.

, Land

M.L.

, Larimer

F.W.

, Hauser

L.J.

Prodigal: prokaryotic gene recognition and translation initiation site identification.

BMC Bioinformatics. 2010; 11: 119.

87.

Hoff

K.J.

, Lingner

, Meinicke

, Tech

Orphelia: predicting genes in metagenomic sequencing reads.

Nucleic Acids Res. 2009; 37(Web Server issue): W101–5.

88.

Rho

, Tang

, Ye

FragGeneScan: predicting genes in short and error-prone reads.

Nucleic Acids Res. 2010; 38(20): e191.

89.

Bland

, Ramsey

T.L.

, Sabree

. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics. 2007; 8: 209.

90.

Edgar

R.C.

PILER-CR: fast and accurate identification of CRISPR repeats.

BMC Bioinformatics. 2007; 8: 18.

91.

Lowe

T.M.

, Eddy

S.R.

tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.

Nucleic Acids Res. 1997; 25(5): 955–64.

92.

Schattner

, Brooks

A.N.

, Lowe

T.M.

The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs.

Nucleic Acids Res. 2005; 33(Web Server issue): W686–9.

93.

Quast

, Pruesse

, Yilmaz

. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013; 41(Database issue): D590–6.

94.

DeSantis

T.Z.

, Hugenholtz

, Larsen

. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006; 72(7): 5069–72.

95.

Cole

J.R.

, Chai

, Farris

R.J.

. The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Res. 2007; 35(Database issue): D169–72.

96.

Maidak

B.L.

, Olsen

G.J.

, Larsen

, Overbeek

, McCaughey

M.J.

, Woese

C.R.

The ribosomal database project (RDP).

Nucleic Acids Res. 1996; 24(1): 82–5.

97.

Edgar

R.C.

Search and clustering orders of magnitude faster than BLAST.

Bioinformatics. 2010; 26(19): 2460–1.

98.

, Yuan

, Ma

, Song

, Xie

, Chen

KEGG-PATH: Kyoto encyclopedia of genes and genomes-based pathway analysis using a path analysis model.

Mol Biosyst. 2014; 10(9): 2441–7.

99.

Ogata

, Goto

, Sato

, Fujibuchi

, Bono

, Kanehisa

KEGG: Kyoto encyclopedia of genes and genomes.

Nucleic Acids Res. 1999; 27(1): 29–34.

100.

Overbeek

, Begley

, Butler

R.M.

. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005; 33(17): 5691–702.

101.

Powell

, Forslund

, Szklarczyk

. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 2014; 42(Database issue): D231–9.

102.

Tatusov

R.L.

, Galperin

M.Y.

, Natale

D.A.

, Koonin

E.V.

The COG database: a tool for genome-scale analysis of protein functions and evolution.

Nucleic Acids Res. 2000; 28(1): 33–6.

103.

Bateman

, Birney

, Durbin

, Eddy

S.R.

, Howe

K.L.

, Sonnhammer

E.L.

The Pfam protein families database.

Nucleic Acids Res. 2000; 28(1): 263–6.

104.

Finn

R D.

, Bateman

, Clements

. Pfam: the protein families database. Nucleic Acids Res. 2014; 42(Database issue): D222–30.

105.

Haft

D.H.

, Selengut

J.D.

, White

The TIGRFAMs database of protein families.

Nucleic Acids Res. 2003; 31(1): 371–3.

106.

Hunter

, Apweiler

, Attwood

T.K.

. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009; 37(Database issue): D211–5.

107.

Kent

W.J.

BLAT — the BLAST-like alignment tool.

Genome Res. 2002; 12(4): 656–64.

108.

Hunter

, Corbett

, Denise

. EBI metagenomics — a new resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 2014; 42(Data-base issue): D600–6.

109.

Leinonen

, Akhtar

, Birney

. The European nucleotide archive. Nucleic Acids Res. 2011; 39(Database issue): D28–31.

110.

Lee

J.H.

, Yi

, Chun

rRNASelector: a computer program for selecting ribosomal RNA encoding sequences from metagenomic and metatranscriptomic shotgun libraries.

J Microbiol. 2011; 49(4): 689–91.

111.

Caporaso

J.G.

, Kuczynski

, Stombaugh

. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010; 7(5): 335–6.

112.

Seshadri

, Kravitz

S.A.

, Smarr

, Gilna

, Frazier

CAMERA: a community resource for metagenomics.

PLoS Biol. 2007; 5(3): e75.

113.

Ondov

B.D.

, Bergman

N.H.

, Phillippy

A.M.

Interactive metagenomic visualization in a Web browser.

BMC Bioinformatics. 2011; 12: 385.

114.

Woese

C.R.

Bacterial evolution.

Microbiol Rev. 1987; 51(2): 221–71.

115.

Amann

R.I.

, Ludwig

, Schleifer

K.H.

Phylogenetic identification and in situ detection of individual microbial cells without cultivation.

Microbiol Rev. 1995; 59(1): 143–69.

116.

Muyzer

DGGE/TGGE a method for identifying genes from natural ecosystems.

Curr Opin Microbiol. 1999; 2(3): 317–22.

117.

Rusch

D.B.

, Halpern

A.L.

, Sutton

. The Sorcerer II global ocean sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol. 2007; 5(3): e77.

118.

Jones

R.T.

, Robeson

M.S.

, Lauber

C.L.

, Hamady

, Knight

, Fierer

A comprehensive survey of soil acidobacterial diversity using pyrosequencing and clone library analyses.

ISME J. 2009; 3(4): 442–53.

119.

Luna

R.A.

, Fasciano

L.R.

, Jones

S.C.

, Boyanton

Jr , Ton

T.T.

, Versalovic

DNA pyrosequencing-based bacterial pathogen identification in a pediatric hospital setting.

J Clin Microbiol. 2007; 45(9): 2985–92.

120.

Thompson

F.L.

, Bruce

, Gonzalez

. Coastal bacterioplankton community diversity along a latitudinal gradient in Latin America by means of V6 tag pyrosequencing. Arch Microbiol. 2011; 193(2): 105–14.

121.

Schloss

P.D.

, Westcott

S.L.

, Ryabin

. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009; 75(23): 7537–41.

122.

Quince

, Lanzen

, Davenport

R.J.

, Turnbaugh

P.J.

Removing noise from pyrosequenced amplicons.

BMC Bioinformatics. 2011; 12: 38.

123.

Nilakanta

, Drews

K.L.

, Firrell

, Foulkes

M.A.

, Jablonski

K.A.

A review of software for analyzing molecular sequences.

BMC Res Notes. 2014; 7(1): 830–830.

124.

Cole

J.R.

, Wang

, Cardenas

. The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009; 37(Data-base issue): D141–5.

125.

Pruesse

, Quast

, Knittel

. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007; 35(21): 7188–96.

126.

Kõljalg

, Nilsson

R.H.

, Abarenkov

. Towards a unified paradigm for sequence-based identification of fungi. Mol Ecol. 2013; 22(21): 5271–7.

127.

Balzer

, Malde

, Grohme

M.A.

, Jonassen

Filtering duplicate reads from 454 pyrosequencing data.

Bioinformatics. 2013; 29(7): 830–6.

128.

Bragg

, Stone

, Imelfort

, Hugenholtz

, Tyson

G.W.

Fast, accurate error-correction of amplicon pyrosequences using Acacia.

Nat Methods. 2012; 9(5): 425–6.

129.

Iyer

, Bouzek

, Deng

, Larsen

, Casey

, Mullins

J.I.

Quality score based identification and correction of pyrosequencing errors.

PLoS One. 2013; 8(9): e73015–e73015.

130.

Keegan

K.P.

, Trimble

W.L.

, Wilkening

. A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE. PLoS Comput Biol. 2012; 8(6): e1002541–e1002541.

131.

Reeder

, Knight

Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions.

Nat Methods. 2010; 7(9): 668–9.

132.

Gaspar

J.M.

, Thomas

The consequences of denoising marker-based metagenomic data.

BMC Proc. 2012; 6(suppl 6): 11–11.

133.

Bakker

M.G.

, Tu

Z.J.

, Bradeen

J.M.

, Kinkel

L.L.

Implications of pyrosequencing error correction for biological data interpretation.

PLoS One. 2012; 7(8): e44357–e44357.

134.

Schloss

P.D.

, Handelsman

Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness.

Appl Environ Microbiol. 2005; 71(3): 1501–6.

135.

Schloss

P.D.

, Handelsman

Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures.

Appl Environ Microbiol. 2006; 72(10): 6773–9.

136.

Schloss

, Handelsman

Introducing TreeClimber, a test to compare microbial community structures.

Appl Environ Microbiol. 2006; 72(4): 2379–2379.

137.

Koskinen

, Auvinen

, Bjorkroth

K.J.

, Hultman

Inconsistent denoising and clustering algorithms for amplicon sequence data.

J Comput Biol. 2014. DOI: 10.1089/cmb.2014.0268. [Ahead of print]. Accessed at http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0268

138.

Hwang

, Oh

, Kim

T.K.

. CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization. PLoS One. 2013; 8(5): e62623–e62623.

139.

Patin

N.V.

, Kunin

, Lidström

, Ashby

M.N.

Effects of OTU clustering and PCR artifacts on microbial diversity estimates.

Microb Ecol. 2013; 65(3): 709–19.

140.

Preheim

S.P.

, Perrotta

A.R.

, Martin-Platero

A.M.

, Gupta

, Alm

E.J.

Distribution- based clustering: using ecology to refine the operational taxonomic unit.

Appl Environ Microbiol. 2013; 79(21): 6593–603.

141.

Sun

, Cai

, Liu

. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009; 37(10): e76.

142.

FigTree. Available at: http://tree.bio.ed.ac.uk/software/figtree/. 0000

143.

McDonald

, Clemente

J.C.

, Kuczynski

. The biological observation matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome. Giga Sci. 2012; 1(1): 7.

144.

Chao

Nonparametric estimation of the number of classes in a population.

Scand J Stat. 1984; 11: 265–70.

145.

Lozupone

, Hamady

, Knight

UniFrac - an online tool for comparing microbial community diversity in a phylogenetic context.

BMC Bioinformatics. 2006; 7: 371.

146.

Clarke

K.G.

, Gorley

R.N.

PRIMER v6: User Manual/Tutorial. Plymouth: PRIMER-E; 2006.

147.

Team

R.D.C.

R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2008: I.

148.

Oksanen

, Kindt

, Legendre

. The vegan package. 2008; 10(01): 2008.

149.

McMurdie

P.J.

, Holmes

phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data.

PLoS One. 2013; 8(4): e61217–e61217.

150.

Gentleman

R.C.

, Carey

V.J.

, Bates

D.M.

. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004; 5(10): R80–R80.

151.

Pavlopoulos

G.A.

, Oulas

, Iacucci

. Unraveling genomic variation from next generation sequencing data. BioData Min. 2013; 6(1): 13.

152.

Yilmaz

, Kottmann

, Field

. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol. 2011; 29(5): 415–20.

153.

Reddy

T.B.

, Thomas

A.D.

, Stamatis

. The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2014; 43(Database issue): D1099–106.

154.

Mavromatis

, Ivanova

, Barry

. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007; 4(6): 495–500.

155.

, Niu

, Zhu

, Wu

, Li

CD-HIT: accelerated for clustering the next-generation sequencing data.

Bioinformatics. 2012; 28(23): 3150–2.

156.

, Godzik

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Bioinformatics. 2006; 22(13): 1658–9.

157.

Gilbert

J.A.

, Field

, Swift

. The taxonomic and functional diversity of microbes at a temperate coastal site: a ‘multi-omic’ study of seasonal and diel temporal variation. PLoS One. 2010; 5(11): e15545.

158.

Lespinet

, Labedan

Orphan enzymes?

Science. 2005; 307(5706): 42.

159.

Yamada

, Waller

A.S.

, Raes

. Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours. Mol Syst Biol. 2012; 8: 581.

160.

Smith

A.A.

, Belda

, Viari

, Medigue

, Vallenet

The CanOE strategy: integrating genomic and metabolic contexts across multiple prokaryote genomes to find candidate genes for orphan enzymes.

PLoS Comput Biol. 2012; 8(5): e1002540.

161.

Cruaud

, Vigneron

, Lucchetti-Miganeh

, Ciron

P.E.

, Godfroy

, Cambon-Bonavita

M-A

. Influence of DNA extraction methods, 16S rRNA targeted hypervariable regions and sample origins on the microbial diversity detected by 454 pyrosequencing in marine chemosynthetic ecosystems. Appl Environ Microbiol. 2014; 80(15): 4626–39.

162.

Vishnivetskaya

T.A.

, Layton

A.C.

, Lau

M.C.

. Commercial DNA extraction kits impact observed microbial community composition in permafrost samples. FEMS Microbiol Ecol. 2014; 87(1): 217–30.

163.

Kim

, Morrison

, Yu

Evaluation of different partial 16S rRNA gene sequence regions for phylogenetic analysis of microbiomes.

J Microbiol Methods. 2011; 84(1): 81–7.

164.

Klindworth

, Pruesse

, Schweer

. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 2013; 41(1): e1–e1.

165.

Soergel

D.A.W.

, Dey

, Knight

, Brenner

S.E.

Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences.

ISME J. 2012; 6(7): 1440–4.

166.

Harismendy

, Ng

P.C.

, Strausberg

R.L.

. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009; 10(3): R32–R32.

167.

Sun

, Cai

, Huse

S.M.

. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform. 2012; 13(1): 107–21.

168.

Richter

D.C.

, Ott

, Auch

A.F.

, Schmid

, Huson

D.H.

MetaSim: a sequencing simulator for genomics and metagenomics.

PLoS One. 2008; 3(10): 1–12.

169.

D'Argenio

, Casaburi

, Precone

, Salvatore

Comparative metagenomic analysis of human gut microbiome composition using two different bioinformatic pipelines.

Biomed Res Int. 2014; 2014: 325340.

170.

Sangwan

, Lata

, Dwivedi

. Comparative metagenomic analysis of soil microbial communities across three hexachlorocyclohexane contamination levels. PLoS One. 2012; 7(9): e46219.

171.

Peterlongo

, Chikhi

Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer.

BMC Bioinformatics. 2012; 13: 48.

172.

Maccallum

, Przybylski

, Gnerre

. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 2009; 10(10): R103.

173.

Butler

, MacCallum

, Kleber

. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008; 18(5): 810–20.

174.

, Tang

An ORFome assembly approach to metagenomics sequences analysis.

J Bioinform Comput Biol. 2009; 7(3): 455–71.

175.

, Tang

An ORFome assembly approach to metagenomics sequences analysis.

J Bioinform Comput Biol. 2008; 7: 3–13.

176.

Koslicki

, Foucart

, Rosen

WGSQuikr: fast whole-genome shotgun metagenomic classification.

PLoS One. 2014; 9(3): e91784.

177.

Mohammed

M.H.

, Ghosh

T.S.

, Singh

N.K.

, Mande

S.S.

SPHINX - an algorithm for taxonomic binning of metagenomic sequences.

Bioinformatics. 2011; 27(1): 22–30.

178.

Lanzén

, Jørgensen

S.L.

, Huson

D.H.

. CREST — classification resources for environmental sequence tags. PLoS One. 2012; 7(11): e49334.

179.

Sunagawa

, Mende

D.R.

, Zeller

. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013; 10(12): 1196–9.

180.

Wood

D.E.

, Salzberg

S.L.

Kraken: ultrafast metagenomic sequence classification using exact alignments.

Genome Biol. 2014; 15(3): R46.

181.

Rosen

G.L.

, Reichenberger

E.R.

, Rosenfeld

A.M.

NBC: the Naive Bayes classification tool webserver for taxonomic classification of metagenomic reads.

Bioinformatics. 2011; 27(1): 127–9.

182.

Luo

, Rodriguez

R.L.

, Konstantinidis

K.T.

MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences.

Nucleic Acids Res. 2014; 42(8): e73.

183.

MacDonald

N.J.

, Parks

D.H.

, Beiko

R.G.

Rapid identification of high-confidence taxonomic assignments for metagenomic data.

Nucleic Acids Res. 2012; 40(14): e111.

184.

Wang

, Wang

, Yue

, Zhang

, Liu

The complete mitochondrial genome of the Bufo tibetanus (Anura: Bufonidae).

Mitochondrial DNA. 2013; 24(3): 186–8.

185.

Ghosh

T.S.

, Mohammed

M.H.

, Komanduri

, Mande

S.S.

ProViDE: a software tool for accurate estimation of viral diversity in metagenomic samples.

Bioinformation. 2011; 6(2): 91–4.

186.

Tatusov

R.L.

, Fedorova

N.D.

, Jackson

J.D.

. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003; 4: 41.

187.

Finn

R.D.

, Miller

B.L.

, Clements

, Bateman

iPfam: a database of protein family and domain interactions found in the protein data bank.

Nucleic Acids Res. 2014; 42(Database issue): D364–73.

188.

Segata

, Waldron

, Ballarini

, Narasimhan

, Jousson

, Huttenhower

Metagenomic microbial community profiling using unique clade-specific marker genes.

Nat Methods. 2012; 9(8): 811–4.

189.

Churbanov

, Ryan

, Hasan

. HighSSR: high-throughput SSR characterization and locus development from next-gen sequencing data. Bioinformatics. 2012; 28(21): 2797–803.

190.

Markowitz

V.M.

, Chen

I.M.

, Chu

. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 2014; 42(Database issue): D568–73.

191.

, Pan

, Song

, Xu

, Ning

Parallel-META 2.0: enhanced metagenomic data analysis with functional annotation, high performance computing and advanced visualization.

PLoS One. 2014; 9(3): e89323.

192.

Goll

, Rusch

D.B.

, Tanenbaum

D.M.

. METAREP: JCVI metagenomics reports — an open source tool for high-performance comparative metagenomics. Bioinformatics. 2010; 26(20): 2631–2.

193.

Angly

, Rodriguez-Brito

, Bangor

. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics. 2005; 6: 41.

194.

Kuczynski

, Stombaugh

, Walters

W.A.

, Gonzalez

, Caporaso

J.G.

, Knight

Using QIIME to analyze 16S rRNA gene sequences from microbial communities.

Curr Protoc Bioinformatics. 2011; Chapter 10: Unit10.17.

195.

Nebel

M.E.

, Wild

, Holzhauser

. Jaguc — a software package for environmental diversity analyses. J Bioinform Comput Biol. 2011; 9(6): 749–73.

196.

Wang

, Yao

, Sun

, Mai

M-pick, a modularity-based method for OTU picking of 16S rRNA sequences.

BMC Bioinformatics. 2013; 14: 43.

197.

Beck

, Settles

, Foster

J.A.

OTUbase: an R infrastructure package for operational taxonomic unit data.

Bioinformatics. 2011; 27(12): 1700–1.

198.

Angly

F.E.

, Dennis

P.G.

, Skarshewski

, Vanwonterghem

, Hugenholtz

, Tyson

G.W.

CopyRighter: a rapid tool for improving the accuracy of microbial community profiles through lineage-specific gene copy number correction.

Microbiome. 2014; 2: 11.

199.

Identification and quantification of abundant species from pyrosequences of 16S rRNA by consensus alignment.

Proceedings IEEE Int Conf Bioinformatics Biomed. 2011; 2010: 153–7.

200.

Lozupone

, Knight

UniFrac: a new phylogenetic method for comparing microbial communities.

Appl Environ Microbiol. 2005; 71(12): 8228–35.

201.

Cai

, Sun

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

Nucleic Acids Res. 2011; 39(14): e95.

202.

Weisman

, Yasuda

, Bowen

J.L.

FunFrame: functional gene ecological analysis pipeline.

Bioinformatics. 2013; 29(9): 1212–4.

203.

Giongo

, Crabb

D.B.

, Davis-Richardson

A.G.

. PANGEA: pipeline for analysis of next generation amplicons. ISME J. 2010; 4(7): 852–61.

204.

, Breitbart

, McNairnie

, Rohwer

FastGroupII: a web-based bioinformatics platform for analyses of large 16S rDNA libraries.

BMC Bioinformatics. 2006; 7: 57.

205.

Kumar

, Carlsen

, Mevik

B.H.

. CLOTU: an online pipeline for processing and clustering of 454 amplicon reads into OTUs followed by taxonomic annotation. BMC Bioinformatics. 2011; 12: 182.

206.

Edgar

R.C.

, Haas

B.J.

, Clemente

J.C.

, Quince

, Knight

UCHIME improves sensitivity and speed of chimera detection.

Bioinformatics. 2011; 27(16): 2194–200.

207.

Huber

, Faulkner

, Hugenholtz

Bellerophon: a program to detect chimeric sequences in multiple sequence alignments.

Bioinformatics. 2004; 20(14): 2317–9.

208.

Pandey

R.V.

, Nolte

, Boenigk

, Schlotterer

CANGS DB: a stand-alone web-based database tool for processing, managing and analyzing 454 data in biodiversity studies.

BMC Res Notes. 2011; 4: 227.

209.

Pandey

R.V.

, Nolte

, Schlotterer

CANGS: a user-friendly utility for processing and analyzing 454 GS-FLX data in biodiversity studies.

BMC Res Notes. 2010; 3: 3.

210.

Cole

J.R.

, Chai

, Marsh

T.L.

; Ribosomal Database Project. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res. 2003; 31(1): 442–3.

Metagenomics: Tools and Insights for Analyzing Next-Generation Sequencing Data Derived from Biodiversity Studies

Abstract

Keywords

Introduction

Sequencing Technologies

Shotgun Metagenomics

Assembly of Shotgun Metagenomics Data

Binning Tools for Metagenomes

Annotation of Metagenomics Sequences

Marker Gene Metagenomics

Denoising

OTU Clustering, Picking, and Taxonomic Assignment

Statistical Analysis and Visualization of Results

Data Management, Storage, and Sharing

Conclusions

Author Contributions

References