Abstract
RNA-seq is gradually becoming the dominating technique employed to access the global gene expression in biological samples, allowing more flexible protocols and robust analysis. However, the nature of RNA-seq results imposes new data-handling challenges when it comes to computational analysis. With the increasing employment of machine learning (ML) techniques in biomedical sciences, databases that could provide curated data sets treated with state-of-the-art approaches already adapted to ML protocols, become essential for testing new algorithms. In this study, we present the Benchmarking of ARtificial intelligence Research: Curated RNA-seq Database (BARRA:CuRDa). BARRA:CuRDa was built exclusively for cancer research and is composed of 17 handpicked RNA-seq data sets for Homo sapiens that were gathered from the Gene Expression Omnibus, using rigorous filtering criteria. All data sets were individually submitted to sample quality analysis, removal of low-quality bases and artifacts from the experimental process, removal of ribosomal RNA, and estimation of transcript-level abundance. Moreover, all data sets were tested using standard approaches in the field, which allows them to be used as benchmark to new ML approaches. A feature selection analysis was also performed on each data set to investigate the biological accuracy of basic techniques. Results include genes already related to their specific tumoral tissue a large amount of long noncoding RNA and pseudogenes. BARRA:CuRDa is available at http://sbcb.inf.ufrgs.br/barracurda.
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
