Analysis of high-throughput biological data using their rank values

Abstract

High-throughput biological technologies are routinely used to generate gene expression profiling or cytogenetics data. To achieve high performance, methods available in the literature become more specialized and often require high computational resources. Here, we propose a new versatile method based on the data-ordering rank values. We use linear algebra, the Perron–Frobenius theorem and also extend a method presented earlier for searching differentially expressed genes for the detection of recurrent copy number aberration. A result derived from the proposed method is a one-sample Student’s t-test based on rank values. The proposed method is to our knowledge the only that applies to gene expression profiling and to cytogenetics data sets. This new method is fast, deterministic, and requires a low computational load. Probabilities are associated with genes to allow a statistically significant subset selection in the data set. Stability scores are also introduced as quality parameters.

The performance and comparative analyses were carried out using real data sets. The proposed method can be accessed through an R package available from the CRAN (Comprehensive R Archive Network) website: https://cran.r-project.org/web/packages/fcros.

Keywords

Microarray sequencing differentially expressed genes count reads recurrent copy number aberration

Get full access to this article

View all access options for this article.

References

Schena

Shalon

Davis

, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270: 467–470.

Lockhart

Dong

Byrne

, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnol 1996; 14: 1675–1680.

Wang

Gerstein

Snyder

. RNA-Seq: a revolutionary tool for transcriptomics. Nature Rev Genet 2009; 10: 57–63.

Pleasance

Cheetham

Stephens

, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 2010; 463: 191–196.

Jeffery

Higgins

Culhane

. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformat 2006; 7: 359–359.

Kadota

Nakai

Shimizu

. A weighted average difference method for detecting differentially expressed genes from microarray data. BMC Algorithms Mol Biol 2008; 3: 8–8.

Breitling

Armengaud

Amtmann

, et al. Rank products: a simple, yet powerful, new method to detect differentially expressed genes in replicated microarray experiments. FEBS Lett 2004; 573: 83–92.

Farztdinov

McDyer

. Distributional fold change test—a statistical approach for detecting differential expression in microarray experiments. Algorithms Mol Biol 2012; 7: 29–29.

Tusher

Tibshirani

Chu

. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98: 5116–5121.

10.

Baldi

Tong

. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inference of gene changes. Bioinformatics 2001; 17: 509–519.

11.

Ambroise

McLachlan

. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 2002; 99: 6562–6566.

12.

Smyth

. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004; 3: Article 3–Article 3.

13.

McCarthy

Smyth

. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics 2009; 25: 765–771.

14.

Thomas

Olson

Tapscott

, et al. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 2001; 11: 1227–1236.

15.

Pan

. A comparative review of statistical methods for discovery differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18: 546–554.

16.

Yanofsky

Bickel

. Validation of differential gene expression algorithms: application comparing fold-change estimation to hypothesis testing. BMC Bioinformat 2010; 11: 63–63.

17.

Kadota

Shimizu

. Evaluating methods for ranking differentially expressed genes applied to microarray quality control data. BMC Bioinformat 2011; 12: 227–227.

18.

Robinson

McCarthy

Smyth

. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 2007; 23: 2881–2887.

19.

Robinson

McCarthy

Smyth

. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 2008; 9: 321–332.

20.

Anders

Huber

. Differential expression analysis for sequence count data. Genome Biol 2010; 11: 106–106.

21.

Robinson

McCarthy

Smyth

. edge: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26: 139–140.

22.

Hardcastle

Kelly

. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformat 2010; 11: 422–422.

23.

Lund

Nettleton

McCarthy

, et al. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol 2012; 11: Art.8–8.

24.

Wang

. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 2013; 14: 232–243.

25.

van De Wiel

Leday

Pardo

, et al. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics 2013; 14: 113–128.

26.

Leng

Dawson

Thomson

, et al. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics 2013; 29: 1035–1043.

27.

Huber

Vitek

. Shrinkage estimation of dispersion in negative binomial models for RNA-seq experiments with small sample size. Bioinformatics 2013; 29: 1275–1282.

28.

Love

Huber

Anders

. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014; 15: 550–550.

29.

McCullagh

Nelder

. Generalized linear models, New York: Chapman & Hall/CRC, 1989.

30.

Marioni

Mason

Mane

, et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008; 18: 1509–11517.

31.

Wang

Feng

Wang

, et al. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 2010; 26: 136–138.

32.

Auer

Doerge

. A two-stage Poisson model for testing RNA-seq data. Stat Appl Genet Mol Biol 2011; 10: Art.26–26.

33.

Witten

Johnstone

, et al. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics 2012; 13: 523–538.

34.

Zhou

Xia

Wright

. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics 2011; 27: 2672–2678.

35.

Tibshirani

. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res 2013; 22: 519–536.

36.

Soneson

Delorenzi

. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformat 2013; 14: 91–91.

37.

Tang

Sun

Shimizu

, et al. Evaluation of methods for differential expression analysis on multi-group RNA-seq count data. BMC Bioinformat 2015; 16: 361–361.

38.

Benjamini

Hochberg

. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995; 57: 289–300.

39.

Dembélé

Kastner

. Fold change ordering statistics: a new method for detecting differentially expressed genes. BMC Bioinformat 2014; 15: 14–14.

40.

Efron

Tibshirani

Storey

, et al. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96: 1151–1160.

41.

Efron

Tibshirani

. Using specially designed exponential families for density estimation. Ann Stat 1996; 24: 2431–2461.

42.

Efron

. Large-scale inference: empirical Bayes methods for estimation, testing and prediction, Chapter 6: Cambridge University Press, 2010.

43.

Liu

Holik

, et al. Why weight? Modelling sample and observational level variability improves power in RNA-seq analyses. Nucleic Acids Res 2015; 43: e97–e97.

44.

Phipson

Lee

Majewski

, et al. Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Ann Appl Stat 2016; 10: 946–963.

45.

Pinkel

Segraves

Sudar

, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genet 1998; 20: 207–211.

46.

Rouveirol

Stransky

Hupé

, et al. Computation of recurrent minimal genomic alterations from array-CGH data. Bioinformatics 2006; 22: 849–856.

47.

Diskin

Eck

Greshock

, et al. STAC: a method for testing the significance of DNA copy number aberrations across multiple array-CGH experiments. Genome Res 2006; 16: 1149–1158.

48.

Beroukhim

Getz

Nghiemphu

, et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Natl Acad Sci USA 2006; 104: 20007–20012.

49.

Klijn

Holstege

de Ridder

, et al. Identification of cancer genes using a statistical framework for multiexperiment analysis of nondiscretized array CGH data. Nucleic Acids Res 2008; 36: e13–e13.

50.

Ritz

Paris

Ittmann

, et al. Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformat 2011; 12: 114–114.

51.

van Dyk

Reinders

Wessels

, et al. A scale-space method for detecting recurrent DNA copy number changes with analytical false discovery rate control. Nucleic Acids Res 2013; 41: e100–e100.

52.

Toloşi

Theiβen

Halachev

, et al. A method for finding consensus breakpoints in the cancer genome from copy number data. Bioinformatics 2013; 29: 1793–1800.

53.

Law

Chen

Shi

, et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014; 15: R29–R29.

54.

Mekenkamp

LJM

Haan

Israeli

, et al. Chromosomal copy number aberrations in colorectal metastases resemble their primary counterparts and differences are typically non-recurrent. PLoS One 2014; 9: e86833–e86833.

55.

Alter

Brown

Botstein

. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97: 10101–10106.

56.

Holter

Maritan

Cieplak

, et al. Dynamic modeling of gene expression data. Proc Natl Acad Sci USA 2001; 98: 1693–1698.

57.

Golub

Loan

CFV

. Matrix computations, 3rd ed. Baltimore: The Johns Hopkins Univ Press, 1996.

58.

Jolliffe

. Principal component analysis, 2nd ed. Chapters 2 & 3: Springer, 2004.

59.

Marcus

Ming

. A survey of matrix theory and matrix inequalities, Boston: Allyn and Bacon, 1964.

60.

Horn

Johnson

. Matrix analysis, Chapter 8: Cambridge Univ Press, 1996.

61.

Zhang

. Matrix theory: Basic results and techniques 20112nd ed. Springer, pp. 167–167.

62.

Pillai

Suel

Cha

. The Perron-Frobenius theorem: some of its applications. IEEE T Signal Process Mag 2005; 22: 62–75.

63.

Dembélé

Kastner

. Comments on: fold change ordering statistics, a new method for detecting differentially expressed genes. BMC Bioinformat 2016; 17: 462–462.

64.

Feller

. An introduction to probability theory and its applications 1971; Vol II, 2nd ed. New York: John Wiley.

65.

Davies

Kovac

. Local extremes, runs, strings and multiresolution. Ann Stat 2001; 29: 1–65.

66.

R Core Team. A language and environment for statistical computing, 2017, www.r-project.org (accessed 17 November 2017).

67.

Zhu

Miecznikowski

Halfon

. Preferred analysis methods for Affymetrix GeneChips. II. An expanded, balanced, wholly-defined spike-in dataset. BMC Bioinformat 2010; 11: 285–285.

68.

Irizarry

Hobbs

Collin

, et al. Exploration, normalization, and summaries of high-density oligonucleotide array probe level data. Biostatistics 2003; 4: 249–264.

69.

Sokolova

Laplame

. A systematic analysis of performance measures for classification tasks. Inform Process Manag 2009; 45: 427–437.

70.

Bottomly

Walter

NAR

Hunter

, et al. Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-seq and microarrays. PLoS One 2011; 6: e17820–e17820.

71.

Frazee AC, Langmead B and Leek JT. Recount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformat 2011; 12: 449, http://bowtie-bio.sourceforge.net/recount/ (accessed 6 June 2016).

72.

Tarazona

Furió-Tari

Turrà

, et al. Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res 2015; 43: e140–e140.

73.

Bullard

Purdom

Hansen

, et al. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformat 2010; 11: 94–94.

74.

Dillies

Rau

Aubert

, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 2012; 14: 671–683.

75.

Liu

Zhou

White

. RNA-seq differential expression studies: more sequence or more replication? Bioinformatics 2014; 30: 301–304.

76.

Sircoulomb

Bekhouche

Finetti

, et al. Genome profiling of ERBB2-amplified breast cancers. BMC Cancer 2010; 10: 539–539.

77.

Stange

Engel

Longerich

, et al. Expression of an ASCL2 related stem cell signature and IGF2 in colorectal cancer liver metastases with 11p15.5 gain. Gut 2010; 59: 1236–1244.

78.

Aguirre

Brennan

Bailey

, et al. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc Natl Acad Sci USA 2004; 101: 9067–9072.

79.

de Ronde

Klijn

Velds

, et al. KC-SMARTR: an R package for detection of statistically significant aberrations in multi-experiment acgh data. BMC Res Notes 2010; 3: 298–298.

80.

Zhang

Ding

Larson

, et al. CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data. Bioinformatics 2010; 26: 464–469.

81.

Walter

Nobel

Wright

. DiNAMIC: a method to identify recurrent DNA copy number aberrations in tumors. Bioinformatics 2011; 27: 678–685.

82.

Olshen

Venkatraman

. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004; 5: 557–572.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.30 MB