Abstract
DNA sequences accumulating in the International Nucleotide Sequence Databases (INSD) form a rich source of information for taxonomic and ecological meta-analyses. However, these databases include many erroneous entries, and the data itself is poorly annotated with metadata, making it difficult to target and extract entries of interest with any degree of precision. Here we describe the web-based workbench PlutoF, which is designed to bridge the gap between the needs of contemporary research in biology and the existing software resources and databases. Built on a relational database, PlutoF allows remote-access rapid submission, retrieval, and analysis of study, specimen, and sequence data in INSD as well as for private datasets though web-based thin clients. In contrast to INSD, PlutoF supports internationally standardized terminology to allow very specific annotation and linking of interacting specimens and species. The sequence analysis module is optimized for identification and analysis of environmental ITS sequences of fungi, but it can be modified to operate on any genetic marker and group of organisms. The workbench is available at http://plutof.ut.ee.
Introduction
Molecular (DNA-based) techniques and informatics form vital research implements in nearly all fields of the biological sciences, including ecology and taxonomy.1–3 As more and more DNA sequences accumulate in the International Nucleotide Sequence Databases (INSD: EMBL, GenBank, and DDBJ), 4 the joint corpus of sequence data generated by the international research community gradually attains far-reaching explanatory power in the disciplines of taxonomy, ecology, and biogeography.5–7 The analysis of such amalgamated data are of particular relevance to understanding the biology of microorganisms because of their inconspicuous and poorly understood nature, their high population sizes, and the insurmountable difficulties associated with keeping many of them in culture.8,9 Extensive sampling in terms of sequence depth, ecological niches, and geographical regions is typically required to answer microbiological questions with any noteworthy degree of certainty, pointing to the benefits of—indeed, need for—integrating datasets and resources already generated. Studies in microbiology rely to a great extent not only on the sequence data itself but also on the associated metadata—auxiliary information on, eg, collection site, host, and soil type. Unfortunately, INSD does not require that metadata be submitted alongside the sequence data itself and offers little by way of a standardized vocabulary for specification of metadata, leaving the sequence authors free to decide what information items to give and how to do it. Thus, in spite of international standardization and data infrastructure initiatives such as the Darwin Core standard (maintained by TDWG, http://rs.tdwg.org/dwc/) and the Microbiological Common Language, 10 the INSD metadata is often given in inconsistent and irreconcilable ways (eg, specified under different headings or using synonymous wording).
Additional technical problems further complicate data mining of public sequence data. Names of species or higher taxonomic lineages are often applied in conflicting ways due to differences in taxonomic opinion or in tradition among ecologists and taxonomists. 11 A substantial proportion of the publicly available sequences are furthermore chimeric, reverse complementary, or contain numerous erroneous bases or ambiguities.12–14 Worryingly, there is at present no straightforward way to alert other users of INSD to the presence of such defective data, 15 paving the way for the percolation of incorrect information through the databases and the scientific community at large. 16 As an example, the set of nuclear internal transcribed spacer (ITS) sequences of fungi in INSD includes an estimated 1% reverse complementary, 1.5% chimeric, and more than 10% incorrectly identified entries.17,18 This is problematic given the weight assigned to the ITS region in contemporary mycology; it is the most commonly sequenced genetic marker for species identification from environmental samples due to its ease of amplification and its discriminative power at the species level.19–21
These complications notwithstanding, the INSD provides an important backbone resource for the development of more accurate, but less inclusive, databases, such as SILVA, 22 Greengenes, 23 and UNITE.19,24 One of the main objectives of these resources is to facilitate reliable taxonomic identification of newly generated environmental and clinical sequences (ie, from samples such as soil, wood, and gut). The core set of reference sequences in these databases is composed of entries that have passed various steps of quality control and that are deemed of sufficient standard and reliability to be of true use in taxonomy and ecology. As such these initiatives often assume the role of INSD as the primary reference database in large-scale environmental sequencing studies,25–27 and they typically feature tailored search tools and analysis modules not found in INSD. As an example, the command line-based utility MOTHUR 28 was developed to span the range of steps involved in assigning environmental sequences to species or operational taxonomic unit (OTU) level and to obtain diversity assessments of the samples at hand. However, these utilities were primarily built with prokaryotes and the ribosomal small subunit (16S) gene in mind; furthermore, many of them require that the indata be presented in the form of a joint, scientifically sound multiple alignment.23,28,29 Thus, by their very nature, these resources are largely incompatible with fungal ITS sequence data since the high level of variability of the region precludes admissible alignment across higher taxonomic levels.
Here we describe an online workbench—PlutoF—that is designed to tackle the many issues of contemporary DNA-based research in ecology and taxonomy. PlutoF was developed in response to the need of many researchers and research networks to manage and analyze their molecular data in ways not fully supported by existing resources and databases (Table 1). The ultimate goal of PlutoF is to cover all elements of the extant biodiversity, viz. ecological, genetic, and taxonomic diversity of all biological kingdoms. This will enable researchers to address integrated questions spanning the different fields of the biosciences, something that is in increasing demand. 30 Through the PlutoF workbench any researcher can develop an indefinite number of databases and bring together existing databases for joint analysis. Data available to the user can be searched, sorted, and analysed across all these databases. The present study addresses the procedures of rapid submission and retrieval of large sequence datasets, annotation of new and pre-existing sequences and specimens, and the sequence analysis features of PlutoF. The workbench supports tools for processing raw community sequence data from any genetic marker, but the analysis module of PlutoF is optimized for fungal ITS sequences by default.
Overview and comparison of PlutoF with INSD, mothur, and QIIME.
Database Structure and Operation
Database and web design
The PlutoF workbench draws from the relational MySQL v. 5.0.77 database and has more than 150 tables for storing taxonomic, ecological, and molecular data (Suppl. Item 1). The database structure (Fig. 1) is rooted in Taxonomer 31 but with far-reaching modifications to integrate modules for storing multimedia, molecular data, and analysis results. The current database model enables users to insert, search, and browse various taxon occurrences (based on, eg, specimens, observations, or DNA sequences), literature references, and scientific collections. PlutoF has a hierarchical study/plot/sample model (Fig. 1) that enables users to manage their own projects all the way from sampling design and persistent storage of data to molecular data analysis and interpretation of the results. Users can work with their own data and form workgroups of users that share the data. The regularly updated classification used in the central taxonomy module is largely based on Hibbett et al (2007) 32 and Index Fungorum (http://www.indexfungorum.org/) for fungi, Fauna Europaea (http://www.faunaeur.org/) for animals (higher taxonomic levels have however been updated to reflect recent phylogenetic literature),33,34 and APG III 35 for plants.

Simplified database scheme showing the core modules. Shaded modules and arrows illustrate the hierarchical structure of the study/plot/sample model, and lines indicate relationships among other modules.
As implemented at the University of Tartu (Estonia), the PlutoF workbench runs on a quad-core 64-bit Linux server (CentOS 5.2, Apache webserver v. 2.2.3). To communicate with the databases, the PlutoF web interface uses the PHP, HTML, CSS, AJAX, JavaScript, and SQL programming languages. The software packages of the analysis module are written in Perl. PlutoF has been tested with all major web browsers, including Mozilla Firefox (v. 2.x and 3.x), Internet Explorer (v. 6.x–8.x), and Safari (v. 5.0) on various operating systems.
Storage of sequence data in PlutoF
The core information of the PlutoF system is the sequence data, and PlutoF supports the distinction between external (eg, INSD) and internal (public or private) sequences. These datasets can be queried separately or jointly. In recognition of the explanatory power of the body of amalgamated fungal ITS sequences in INSD, PlutoF offers the possibility to mirror the INSD for the fungal ITS data (Suppl. Fig. 2a) or any other genetic marker of interest; in the UNITE database, all reasonably full-length fungal ITS sequences identified as such in INSD are downloaded on a monthly basis. As of September 2010, UNITE thus contained 160,581 INSD sequences and 6,368 native sequences of the fungal ITS region (the latter including 2,843 entries from fully identified and vouchered reference fruiting bodies). The overall corpus of sequences corresponds to about 15,000 fully identified species of fungi; about 50% of the sequences, nearly all of which stem from INSD, remain unidentified to species level however. The system furthermore supports the distinction between different classes of sequences. The present classes include INSD, native reference, native non-reference, and next generation sequencing (NGS) sequences (eg, sequences from massively parallel (“454”) pyrosequencing 36 efforts). NGS entries form a challenge due to their sheer numbers and potential reduction in length and read quality.37,38 We advocate that pyrosequencing entries be marked as being distinct from sequences obtained using traditional Sanger sequencing. Since cleaning and filtering methods of pyrosequencing raw data improve over time,37,39–41 the availability of raw NGS data underlying scientific studies and results may prove important for ulterior analyses. PlutoF accordingly supports deposition of NGS data at two levels—i) compressed files of raw sequence data, quality scores, and barcode translation tables; and ii) quality filtered sequences—optionally in the form of majority-rule consensus sequences—with abundance and sample information added to their annotation. Templates for comma- and tab-delimited files are available for these purposes. These and other file types can be uploaded to the database through the PlutoF Digital Repository module, which recognises most common file types and formats.
The sequence data in INSD are by default retrieved with all available metadata (eg, isolation source, geographical locality, and literature reference); these data are extracted and stored in PlutoF. All INSD entries are indexed according to study of origin using the hierarchical model so that sequences belonging to the same study are separated into plots and samples based on their locality information, as available. This makes precise data retrieval possible (Suppl. Fig. 2b); for instance one could search for all studies involving fungal ITS sequences on Canadian territory in a single query. Similarly, all sequences deposited by a specific researcher or during a given year are easily retrieved. Such searches are not always straightforward in INSD itself.
Data Handling and Sequence Analysis Modules
Handling user data
The PlutoF structure supports submission of sample details and other auxiliary information along with sequence data on a sequence-per-sequence, as well as bulk, basis. For example, samples (as Taxon occurrence in the main menu) may comprise multiple specimens in a scientific collection, mere field observations of some given species, or DNA sequences from various genes and organisms. Similarly to INSD, direct submission of sequence data requires that the name of the study or project be given along with one or more plot as relevant. Unlike INSD, however, PlutoF offers a standardized vocabulary for describing and defining the properties of the sequences and the conditions under which they were obtained. In accordance with contemporary research in ecology, PlutoF supports the subdivision of plots into samples to allow very specific data retrieval queries. For each plot and sample, comprehensive descriptions can be provided, including data on locality (eg, geo-coordinates, altitude, and municipalities), habitat (following the IUCN habitat classification system: http://www.iucnredlist.org/technical-documents/classification-schemes/habitats-classification-scheme-ver3 including history, age, and climate), soil (the FAO classification), 42 soil horizon (chemical and physical properties), plant root (eg, biomass, turnover, and production by diameter), forest (eg, canopy height, stand density, and basal area), and general information (name, type, and size). Specimen information includes taxonomy (eg, name of the taxon and pheno/logic/typic data), collection (date, collector, and determiner), and substrate/interacting taxon (taxonomy and type of interaction). Sequence information includes ID, DNA sequence, name of the gene, PCR primers, and level of availability to other users.
While the taxonomic classification in PlutoF follows international standards, power users can add and edit taxon names directly in the workbench on subclass or lower level. Above the level of subclass, only administrators can implement changes; prior agreement between classification curators is however required. All users can apply for the right to upload and edit taxon names.
Annotating INSD entries
The PlutoF workbench allows third-party annotation of INSD, as well as native, sequences. The primary rationale is to support the addition of missing metadata, the correction of incomplete or incorrect taxonomic information, and the provision of information pertaining to the overall reliability of the sequence, such as chimeric nature. The original information is retained, and annotations are introduced as separate data layers. All annotations are by default non-anonymous. Missing metadata can be added directly to each specimen/sequence, sample, or plot in the relevant window (Suppl. Fig. 2c, d). Sequences of ectomycorrhizal fungi can be assigned to monophyletic lineages (sensu Tedersoo et al 2010) 43 to overcome paraphyly. Updating taxonomic annotations—typically by providing additional taxon names to misidentified or unnamed sequences—should only be undertaken by users with sufficient experience of the taxonomic lineage at hand, and PlutoF supports a peer-review type of process for managing such annotations.
Bioinformatics resources and the analysis module
PlutoF enables rapid sorting and retrieval of relevant sequence data by various search parameters such as sequence ID, taxon name, country, interacting taxon, sequence length, and study. Another option is to use the BLAST 44 -based search tool emerencia 45 which is designed to track the taxonomic affiliation of insufficiently identified ITS sequences over time. In both cases, relevant entries are marked and sent to the clipboard, where they can be checked for duplicates (data that has been submitted to both PlutoF and INSD) and exported to FASTA or comma separated (csv) files with a full set of metadata. In addition, data can be sent to an integrated Google Maps module for instant geographical visualisation (Suppl. Fig. 2e).
The analysis module includes software for extracting and classifying ITS sequences that are derived from high-throughput sequencing or cloning studies (Suppl. Fig. 2f). Based on highly conserved short signal motifs, the ITS Extractor 46 separates the ITS1 and ITS2 subregions of the ITS region from the flanking rDNA genes, a process that is much to the purpose of high-precision clustering and sequence identification.47,48 BLASTClust of the BLAST suite performs single-linkage clustering at user-defined similarity threshold values to collapse query datasets into OTUs. The chimera checker utility identifies potentially chimeric ITS sequences through contrasting the respective taxonomic signal of the ITS1 and ITS2 subregions. 18 A serial BLAST engine to compare arbitrarily large query datasets for similarity against the sequences in UNITE/INSD is also available. A pyrosequencing pipeline allows for pyrosequencing datasets of the ITS region to be analysed in a reasonable time, providing the taxonomic results in a spreadsheet format where OTUs are separated into rows and samples into columns. 49
Conclusions
PlutoF is a web-based workbench for the storage, editing, analysis, and overall management of ecological, taxonomic, and genetic data. It has a strong ecological and taxonomic orientation but also covers several aspects of biogeography and co-evolution. PlutoF was developed in light of the urgent need to address integrated questions in these fields through DNA sequence data. In recognition of the increasing internationalisation of biological research and the fact that different research groups and taxonomic lineages require different information items to be stored and analysed, PlutoF is flexible, scalable, and highly modularized. PlutoF is run at University of Tartu, Estonia, and it is open for public use, including data submission, annotation, and analysis. Potential users are requested to contact the curator (http://plutof.ut.ee/contact.php) for obtaining authentication information.
Footnotes
Disclosures
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
Acknowledgements
We thank the Estonian Science Foundation (grants 8-2/T8030PKPK, 0180012s09, 0180122s08, 0180127s08, 6939, 7434, 7558, 8235, and JD-0092), FIBIR, and Kapten Carl Stenholms Donationsfond for financial support.
