Abstract
Biomarker identification is often associated with the diagnosis and evaluation of various diseases. Recently, the role of microRNA (miRNA) has been implicated in the development of diseases, particularly cancer. With the advent of next-generation sequencing, the amount of data on miRNA has increased tremendously in the last decade, requiring new bioinformatics approaches for processing and storing new information. New strategies have been developed in mining these sequencing datasets to allow better understanding toward the actions of miRNAs. As a result, many databases have also been established to disseminate these findings. This review focuses on several curated databases of miRNAs and their targets from both predicted and validated sources.
Keywords
Introduction
In 1956, Crick stated the central dogma of molecular biology describing the flow of information from DNA to RNA to protein. 1 Although the process of information transmission was oversimplified, the central dogma hinted at the wealth of information that can be extracted from every biological sequence. The mining of information from nucleotide and protein sequences prompted the development of bioinformatics, the science that interfaces biology and computer science to answer biological questions on a molecular level. Sequence-based discovery allows the elucidation of the relationships between structure, function, and evolution. Discovering the relationships between our genetic sequences and the various genetic actions, including the causes of diseases, is one of the main goals of bioinformatics.
The development of biomarker identifications is often associated with the diagnosis and evaluation of various diseases. Many biomarkers are macromolecules of nucleic acids, carbohydrates, and proteins in nature. The initial isolation of nucleic acid-based biomarkers requires the need for genomics as opposed to proteomics, which is needed to isolate protein-based biomarkers. These raw
With the advent of next-generation sequencing (NGS), the identification and quantitation of miRNA as biomarkers are becoming more precise. Many experiments of miRNA quantitation were the results of whole transcriptome sequencing, often referred to as RNA-seq. 8 The hallmark feature of NGS is the ability to elucidate millions of strands of nucleotides simultaneously, which results in an unprecedented amount of coverage for any genome. While NGS is of great interest to many readers, the technical detail is beyond the scope and the allotted space of this review. For users interested in NGS technology, review articles by Mardis, Mutz et al, and Koboldt et al provide a thorough coverage on its usage and application.9–12 For readers interested in NGS and classical methods of miRNA discovery, Eminaga et al, Tam et al, and Git et al provide an excellent overview for the processes.13–15
NGS is also known as massive parallel sequencing or deep sequencing due to its potential outputs. Consequently, the amount of data generated has also been unprecedented. This requires the establishment of corresponding protocols in processing miRNA data from RNA-seq experiments. For bioinformatics to contribute to the analysis of these RNA-seq datasets, protocols need to be created for finding the most relevant miRNA species. While the main goal of this review is to focus on various repositories of miRNAs and their interactions, it is worthy of note that efforts of computational approaches, such as miRClassify, 16 are also accelerating the overall annotation process of miRNAs. In addition, TargetScan, 17 miRanda, 18 and PicTar 19 are the leading programs in the field, as reflected by the number of citations. For other computational approaches, it is recommended that readers should review articles by Zou et al, Wang et al, and Wei et al.16,20,21
As one of the most important goals in bioinformatics, the proper storage and organization of data will lead to easy retrieval and dissemination of information. This review focuses on the specific aspect of databases in miRNA discovery. Several databases are discussed below. The inclusion of databases reviewed here must meet the following criteria: (1) clear documentation of updates and history, (2) recent updates in the past 12 months, and (3) not a simple derivative on data from another database. The major features of each database reviewed here are summarized in Table 1.
Summarized features from databases reviewed.
miRBase
miRBase (www.mirbase.org) combines the knowledge of miRNA and NGS to create a repository aimed at assigning stable and consistent names to novel miRNAs. 22 While it can be accessed via its web interface, bulk download via file transfer protocol is also available. Established in 2002, miRBase was originally called the miRNA Registry, which allowed submissions of novel miRNAs to be named in a consistent and organized fashion. 23 Its first release contained 218 miRNA loci from five species. As of June 2014, after continuous growth, release 21 contains 28,645 entries representing hairpin precursor miRNAs that expressed 35,828 mature miRNA products in 223 species. miRBase can be used for searching and browsing both hairpin and mature sequences.
Since the inception of miRBase, the annotation strategy was developed and continually improved to organize all the information associated with miRNA species. Its goal was to officialize identifiers as quickly as possible for publication in articles. For example, the prefix in dme-mir-100 designates the organism and is followed by sequentially assigned numbers. Recently, for sequences derived from the 5′ and 3′ arms of the hairpin precursor, names are assigned as dmemiR-100-5p and dme-miR-100-3p, respectively, to specify the mature sequences. This standardized scheme also includes a strategy where homologous miRNA loci are assigned the same number from different species. Two of the most recent developments for miRBase are associated with the advances of NGS technology and community-based contributions toward the textual and functional information on miRNAs. 22 The curators for miRBase attributed the most recent database additions to the next-generation or deep sequencing. This has led to more research groups participating in the process. Similar to many other knowledge bases, the annotation process is also community based in miRBase. Two major sources are involved in the annotation process: publications from PubMed and contribution of textual and functional annotations from the miRNA community. miRBase provides primary references for each miRNA sequence describing its discovery, links to evidence supporting the annotation, coordinates on the genome, and links to databases of predicted and validated target sites. miRBase can be searched with identifiers or keywords along with genomic location. miRNA sequences were also collected and mapped from the Gene Expression Omnibus (GEO) and the Short Read Archive, which are hosted by the National Center for Biotechnology Information (NCBI).
miRDB
While serving as an online resource for functional annotations, miRDB (www.mirdb.org) also functions as a repository for miRNA-target predictions with data downloaded from version 21 of miRBase.
24
Users can also submit their own sequences for prediction at miRDB. As of early 2015, 2.1 million predicted gene targets regulated by 6,709 miRNAs are included in miRDB. The above target prediction was performed with MirTarget.
24
MirTarget was developed by analyzing high-throughput expression profiling data in a support vector machine framework. The MirTarget algorithm also serves as the back-end for the web server interface in prediction. One of the most recent developments was the inclusion of integrated computational analyses with literature, resulting in a new strategy and a scoring system for the identification of functional miRNA with the following four selection criteria. First, PubMed literature mining was utilized to map NCBI gene database for the association of miRNAs with corresponding PubMed records. Second, sequence conservation among different species was considered as functionally important. Third, expression profiles from 81 RNA-seq experiments were used for functional miRNA identification. Fourth, functional annotations by miRBase resulted in the identification of
miRWalk
The third database reviewed is miRWalk (mirwalk.uni-hd.de), which hosts predicted and validated miRNA-binding sites along with information on all known genes of human, mouse, and rat. 25 Similar to miRDB, miRWalk also utilizes automated text mining searches of PubMed to extract information on miRNAs. It is designed as a comprehensive database for predicted and validated targets for miRNAs associated with genes, pathways, diseases, organs, cell lines, and transcription factors.
One of the goals for miRWalk is to use a computational approach to identify the longest consecutive complementary regions between miRNA and gene sequences. The identified miRNA binding sites are generated with the miRWalk algorithm and then combined with the results of many other established prediction programs and databases, including DIANA-microTv4.0,26,27 DIANA-microT-CDS, 26 miRanda-rel2010, 18 mirBridge, 28 miRDB4.0, 24 miRmap, 29 miRNAMap, 30 doRiNA, 31 PicTar2, 19 RNA22v2, 32 RNAhybrid2.1, 33 and TargetScan6.2. 34 Continual updates and upgrades are the goals for improving miRWalk. Recently, the comparative platform of miRNA-binding sites within the mRNA 3′-UTR region was also upgraded with 13 miRNA-target prediction datasets. All results described above can be found via the web interface of miRWalk 2.0, containing two modules: predicted target module (PTM) and validated target module (VTM). The PTM provides novel comparative platforms of binding sites for the promoter, coding sequence (CDS), and 5′- and 3′-UTR regions. The VTM contains interaction information associated with genes, pathways, organs, diseases, cell lines, Online Mendelian Inheritance in Man (OMIM) disorders, and literature on miRNAs, in addition to information on proteins known to be involved in miRNA processing. The above modules are categorized into different search pages to allow users to retrieve miRNA-associated information using different identifiers.
miRTarBase
The miRTarBase (mirtarbase.mbc.nctu.edu.tw) aimed to provide “the most current and comprehensive information of experimentally validated miRNA-target interactions (MTIs).” 35 For its initial launch of version 1.0 in 2010, the database utilized over 100 published studies. As of September 15, 2015, version 6.0 is the most current iteration of miRTarBase containing 4,966 articles and 3,786 miRNAs. In comparison to databases that provide collections of miRNAs without deeper annotation, the uniqueness of miRTarBase is the curation on MTIs with both manual and computer-aided methods together with a robust suite of tools for the visualization of MTIs and diseases.
In the most recent release, over 360,000 MTIs were collected by manual review after applying natural language processing (NLP) on literature text. In comparison to others, the application of an artificial intelligence approach by the curators of miRTarBase, such as NLP, is a unique feature and should increase the number of relevant articles in the database. Unlike other miRNA databases, miRTarBase contains many robust features of graphical visualization. For instance, the word cloud is a new feature to visualize relationships between individual miRNA and medical conditions. For interactions between miRNAs and their respective targets, Cytoscape Web can be integrated to aid the understanding of miRNA-target regulation. 36 Beyond the usage of Cytoscape Web, the curators also used the Database for Annotation, Visualization and Integrated Discovery (DAVID) gene annotation tool to perform gene ontology and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway enrichment annotation to further examine the functions of the target genes involved in MTIs.37–39 These MTIs and associated annotations can be searched by users via the interfaces of the species browser and search utility. The above two interfaces have recently undergone enhancement and redesign. This allows basic MTI searches by miRNA, target gene symbol, validation method, or PubMed ID.
Other than user interface and visualization tools, miRTarBase sets itself apart from similar databases by incorporating datasets from NCBI GEO (www.ncbi.nlm.nih.gov/geo/) and The Cancer Genome Atlas (TCGA) (cancergenome.nih.gov/) to provide miRNA-target gene expression profiles. Specifically, TCGA provides clinical aspects of miRNA and gene expression profiles. Gene expression profiles from the above two data sources are currently considered as a method for experimental validation with the NGS technology. Several specific approaches involving NGS technology are currently being utilized by the curators, including cross-linking and immunoprecipitation (CLIP)-seq, 40 crosslinking, ligation, and sequencing of hybrids (CLASH-seq), 41 and degradomeseq. 42 Overall, miRTarBase contains 21 human CLIP-seq datasets, 5 mouse CLIP-seq datasets, 6 nematode datasets, and 1 human CLASH-seq dataset.
miRCancer
For readers specifically interested in miRNA and cancer, miRCancer (mircancer.ecu.edu) provides a comprehensive collection on the expression of miRNAs via text mining of PubMed. 43 The components for this approach are literature collection, named entity and expression recognition, rule matching, voting, manual verification, and recording. Regular expressions were first used to identify miRNA in literature for miRCancer with miR and miR- for locating miRNA names. Species prefixes, such as hsa- and mmu-, were also used as a part of the regular expressions in searching for related literature. For recognition of cancer names, a cancer name dictionary was compiled from the International Classification of Diseases for Oncology (codes. iarc.fr). The curators also established a dictionary for miRNA expression with 28 terms to include common keywords and phrases for upregulation and downregulation. The text mining approach for miRCancer further relies on 75 rules constructed by the curators using sentence structures commonly found in describing miRNA expressed in cancer cells. These rules are hard-coded sentence structures. Manual revision is then carried out to improve automated extraction. As of March 2015, 44,353 miRNAs for 173 cases of human cancer are associated with 2,073 publications in miRCancer.
doRiNA 2.0
The main goal of doRiNA is to create a single framework for the systematic curation, storage, and integration of RNA- binding proteins (RBPs) and miRNAs from different species.31,44 It is a database of RNA interactions in posttranscriptional regulation, with predictions carried out by PicTar. 19 Unlike other miRNA databases, doRiNA 2.0 (dorina.mdc-berlin.de) stands out with a strong capability for local implementation, allowing integration into third-party pipelines. Furthermore, doRiNA 2.0 solicits user feedback, can be implemented locally, and operates on an open-source model. As a part of the upgraded version 2.0, the developers also reworked the user interface and expanded the database to improve the usability of the website. It therefore should be considered as one of the most unique and technically sophisticated databases.
Developers of doRiNA 2.0 collected and integrated all available data on miRNA and RBP target sites from the public domain. More than 67 new publicly available RBP datasets have been added into doRiNA 2.0. In the latest version of doRiNA, miRNA and their targets were identified with both computational predictions and new experimental techniques by chimeric sequencing reads. Due to the lack of reliable
Recent updates in version 2.0 provide various improvements from the previous version. Developers of doRiNA paid special attention toward the infrastructure and interoperability surrounding their repository. doRiNA 2.0 can now achieve high query speed and complexity by precomputing several important data characteristics. External developers can easily integrate doRiNA 2.0 into third-party analysis piplines via a representational state transfer application program interface (API), while the Python API can be used for local queries by users. Documentations for the above two APIs can be found at http://dorina.mdc-berlin.de/docs. The developers have also migrated away from the traditional Common Gateway Interface (CGI) and Structured Query Language (MySQL) implementations and instead used a fast key-value cache and store (redis.io) as well as in-memory caching of frequent queries for faster access. Mirrored sites and database servers are utilized by doRiNA 2.0 to achieve high service availability. Both the web application and the APIs are available under an open-source license approved by the Open Source Interconnection that permits research and commercial access and reuse. The developers at doRiNA essentially created an
SomamiR
SomamiR (compbio.uthsc.edu/SomamiR/) was created to integrate heterogeneous datasets to investigate the impact of somatic and germline mutations on miRNA function in cancer. 46 It specifically contains experimentally determined germline and somatic miRNA mutations associated with cancer, along with their target sites. A total of 15 sources of somatic mutations that have been identified from whole-genome sequencing of paired normal and cancer samples were analyzed and incorporated into SomamiR.
Three methods were used to predict how mutations may impact target sites in SomamiR. First, a comprehensive list of how somatic mutations may alter miRNA-binding sites was created with methods established by Ellwanger et al. 47 Second, two popular miRNA-target prediction algorithms, TargetScan 17 and PITA, 48 were used to determine mutations that are more likely to alter functional binding sites. Third, five major types of information were used to annotate miRNAs, genes, and target locations in SomamiR: results of association studies, gene pathways, sequence conservation, expression of miRNAs in cancer, and germline mutations. For association studies, high scoring markers from genome-wide association studies (GWAS) of cancer in National Human Genome Research Institute (NHGRI) GWAS catalog were collected. The data on meta-analysis of cancer candidate gene association studies from the Cancer GAMAdb 49 were also collected. Developers also carried out functional annotation of genes containing somatic mutations that alter miRNA target sites with the KEGG. They further highlighted genes with somatic mutations from miRNA target sites in each pathway. To improve miRNA-target prediction, the conservation of a target site sequence across species has been used. A 46-way multiZ 50 alignment of vertebrate genomes was utilized to determine whether the sequence of a predicted target site was conserved. To better understand somatic cell mutations associated with cancer, miRNA expression data from various cancer genome sequencing projects deposited at TCGA were also collected. In addition to somatic cell mutations, germline mutations that alter predicted and experimental miRNA target sites were collected from PolymiRTS. 51 The name PolymiRTS derives from polymorphisms in miRNAs and their target sites. PolymiRTS is a database for tracking and identifying sequence polymorphisms in miRNAs or their target sites to possibly reveal links to molecular, physiological, and behavioral disease phenotypes.
In SomamiR, each gene is represented by a single web page to provide all somatic mutations that alter miRNA target sites in the gene, as well as associate with specific types of cancer. Each web page representing a gene can also be accessed through several browsable tables that are linked from the database homepage. These browsable tables contain somatic mutations in miRNAs and respective target sites. Furthermore, experimental evidence linking these mutations to various cancer types is also incorporated into these tables. Two additional tables can be used to browse database entries in the context of association studies and KEGG gene pathways. SomamiR also allows the following criteria for searching against the database: miRNA, gene symbol, RefSeq ID, and chromosome location. The search can be performed using the form on the website or by uploading a batch file with multiple terms. For users who are interested in parsing the database for further analysis, the complete content of SomamiR is also available for download at http://compbio.uthsc.edu/SomamiR/download/.
Early Detection Research Network (EDRN)
While the above-described databases are exclusively for the discovery and understanding of miRNAs, other repositories can contain similar information from various types of biomarkers. One such effort in categorizing data related to multiple types of biomarkers is EDRN from the National Cancer Institute
52
(edrn.nci.nih.gov). While EDRN is not exclusively designated as a sequence-level repository, biomarker data, including miRNAs, can be found under the section of
Conclusion
While data repositories were the main focus for this review, miRNA-target prediction also presents other interesting questions in bioinformatics. It is more challenging to predict miRNA targets in animals than in plants, due to imperfect base pairing with target sites. This demonstrates the potential limitation for any prediction algorithms due to the complexity of many biological systems. There will be a strong need for further improvements to develop accurate predictions for miRNA targets. In addition to the goal of predicting miRNA targets, the selected miRNA databases reviewed above share the commonality of relying on textual information, mostly from PubMed, in the retrieval of relevant literature. It is also important for readers to note that one of the most important efforts is the standardization of biomarker nomenclature, including various miRNAs by EDRN. Standardization will improve the interoperability among different research groups and databases. Furthermore, nearly all curators for the above repositories recognized that major growth of data will result from sequencing. With the advent of new technologies, there is no doubt that more miRNAs will be discovered, resulting in an exciting new era for researchers.
Author Contributions
Wrote the first draft of the manuscript: ACM, JSW, T-TT. Contributed to the writing of the manuscript: ACM, JSW, T-TT. Jointly developed the structure and arguments for the article: ACM, JSW, T-TT. Made the critical revisions and approved the final version: ACM, JSW, T-TT. All authors reviewed and approved the final article.
Footnotes
Acknowledgment
We thank Ashley Pedicini for her assistance in the preparation of this article.
