Abstract
Automated high-throughput sequencing of cDNA clones from numerous libraries has generated a wealth of information about both genome sequence and relative transcript abundances. A common statistical challenge in the analysis of library sequences is to infer whether there is differential expression for the same transcript under two different conditions, such as normal and diseased tissue. In contrast to the continuously variable intensity measurements from microarray experiments, data from cDNA library sequencing presents itself as a discrete count of the incidence of some clone or transcript in a finite sample. In this paper, we first propose a statistical model for data generated from cDNA library sequencing efforts. The model is based on the Poisson mixed with generalized inverse Gaussian (PGIG), introduced by Sichel (1971, 1975). PGIG has been used in modeling population abundance, ecological studies, word frequencies in publications, etc. Using data from the literature, we show that the proposed model provides a good fit to the observed data. Using this new model for cDNA library data, we developed an empirical Bayesian significance test (EBST) for inferring the statistical significance of differential gene expression from discrete data.
Keywords
Get full access to this article
View all access options for this article.
