A bivariate zero-inflated negative binomial model and its applications to biomedical settings

Abstract

The zero-inflated negative binomial distribution has been widely used for count data analyses in various biomedical settings due to its capacity of modeling excess zeros and overdispersion. When there are correlated count variables, a bivariate model is essential for understanding their full distributional features. Examples include measuring correlation of two genes in sparse single-cell RNA sequencing data and modeling dental caries count indices on two different tooth surface types. For these purposes, we develop a richly parametrized bivariate zero-inflated negative binomial model that has a simple latent variable framework and eight free parameters with intuitive interpretations. In the scRNA-seq data example, the correlation is estimated after adjusting for the effects of dropout events represented by excess zeros. In the dental caries data, we analyze how the treatment with Xylitol lozenges affects the marginal mean and other patterns of response manifested in the two dental caries traits. An R package “bzinb” is available on Comprehensive R Archive Network.

Keywords

Bivariate zero-inflated negative binomial model dental caries expectation-maximization algorithm single-cell RNA sequencing

1. Introduction

1.1 Motivation

In biomedical research, count data often include a large number of zeros. For example, quite often people do not have dental caries,¹ and the majority of a population does not make a hospital visit during a given year.² In omics data, either because of technological reasons related to sequencing or due to some biological reasons, counts are often very sparse.^4,5,3

Such count data with excess number of zeros are frequently modeled using the zero-inflated Poisson (ZIP) or the zero-inflated negative binomial (ZINB) distributions.^7,6 The negative binomial distribution, of which the Poisson distribution is a limiting case, has a capacity of modeling overdispersion that accounts for the heterogeneity of the incidence processes and thus is widely used in practice. Zero-inflation refers to the phenomenon where the proportion of zeros is greater than is expected by the corresponding baseline distribution.

However, these univariate models only provide insights about marginal distributions and do not inform the joint characterization of two dependent count variables. For example, when the association of two genes is of interest, a bivariate model is needed to effectively estimate the dependence. On the other hand, in caries and other clinical trials, a joint model may be beneficial when univariate analyses are less efficient for comparing marginal mean outcomes between treatment groups. As in the choice of multivariate analysis of variance versus multiple analysis of variances for the hypothesis testing of continuous random variables, multivariate count models can control type-I error in a more efficient way than multiple testing based on univariate ZIP or ZINB models.⁸ To meet these different purposes, we develop a flexible bivariate ZINB model that characterizes the joint distribution of two correlated count random variables.

The two motivating examples are used to illustrate the utility of the proposed bivariate model for count data in biomedical research. The first one involves the estimation of dependence of two genes in single-cell RNA sequencing (scRNA-seq) data that have a significant amount of dropouts, which are represented by excess zeros under model assumptions. The pairwise dependencies are important as they serve as building blocks for identifying pathways. However, possible dropout events obfuscate measuring the dependence of gene pairs in the absence of dropout. For this task, we estimate the dependence after controlling for the dropout events through a bivariate zero-inflation model. The second example is characterizing the distribution of the number of dental caries detected on two different tooth surfaces. A treatment—Xylitol lozenges—may not only affect the mean count for both traits to varying degrees but also may modify the structure of the joint distribution. A bivariate model is needed to adequately characterize the joint distribution and estimate marginal effects on outcomes with possibly improved precision. More details of the two examples are given in the following subsections.

1.2 Measuring genewise dependence in scRNA-seq data

Single-cell RNA sequencing (scRNA-seq) is a high throughput sequencing technology that profiles gene expression at a cell’s resolution.⁹ This is in contrast to bulk RNA sequencing (RNA-seq), where a group of cells are sequenced altogether and consequently, no cell-level information is available in data. As a price for cell-level resolution, scRNA-seq loses some information by the so-called “dropout” phenomenon; during the sequencing steps (and the capturing steps, e.g. in 10X sequencing platform) of scRNA-seq, a large amount of RNAs are undetected. Consequently, the observed count data include a greater number of zeros than would be expected given the number of molecules sequenced and our a priori knowledge of transcription rates at individual loci.^4,10,11 In contrast, in a bulk RNA-seq, excess zeros are less frequently observed.¹⁰ For these reasons, negative binomial models have been extensively used for bulk RNA-seq data,^12,14,13 and ZINB models are typically used for scRNA-seq data.^5,4 Although recent technologies such as unique molecular identifiers (UMIs) effectively remove the dropout events, the problem still persists in data obtained from many of the non-UMI-based platforms.

Statistical inferences at both individual gene level^15,16 and gene set level, for example, pathways, can be misleading without considering the excess zeros caused by dropouts. Inference of gene-gene dependence, for example, the correlation-based method, has been widely used in pathway analysis of bulk RNA-seq data¹⁷ and in recent scRNAseq data analyses.^{15,16,19,20,18} However, the conventional Pearson correlation of two genes with significant dropouts in the scRNAseq may not properly reflect the underlying gene-gene dependence.

For example, a pair of genes, of which expressions are highly correlated with the absence of dropouts, would have an attenuated correlation, based on the observed data, if only one of the genes have a large amount of dropouts. On the other hand, a pair of uncorrelated genes would have a higher correlation, when both genes have dropouts in a substantial portion of the sample. The systematic bias will not vanish without adjusting for the effects of the dropout events, regardless of the dependence measure such as Pearson correlation and mutual information.

Two strategies have been considered to address the bias in scRNA-seq data. Imputation methods^21,18,22 aim to provide expression levels free of the excess zeros by imputing them. While imputation methods are versatile in that they provide ready-to-use data, they are not deterministic, having different results for every implementation, and analyses results over multiple imputations may have to be combined. The other strategy is an estimation of the count distribution.^11,23 Once having obtained information about the distribution of the expressions before dropouts, one can do downstream analyses such as measuring the dependence of the before-dropout expressions. However, many of the methods taking this approach focus on modeling marginal distributions and they do not explicitly posit dependence structure between two genes.

Our proposed method, to be introduced in Section 1.4, takes the distribution estimation approach where a bivariate distribution explicitly addresses the dependence structure. Specifically, our method is built on a bivariate generalization of the ZINB model, where the dropout probabilities are modeled using the zero-inflation parameters. In this approach, the full joint distribution is estimated using a bivariate count model with zero-inflation and the underlying distribution before dropouts is uncovered to estimate the dependence.

1.3 Characterizing the joint count distribution of two dental caries traits

The second example is about describing the pattern of the number of dental caries occurring at two different tooth surface types in a randomized clinical trial (the Xylitol for Adult Caries Trial, or X-ACT).²⁴ In one of its secondary studies, the effect of Xylitol lozenges on the incidence of three different dental caries outcomes were examined with univariate models according to the type of tooth surface: smooth-surface caries, proximal-surface caries, and occlusal-surface caries.²⁵ To be more specific, we refer to dental caries as the annualized D $_{23}$ FS (cavitated caries lesions) caries increment. There are often a great amount of zero counts for dental caries and the ZINB model is thus frequently used in the literature,²⁶ as exemplified by this study. Of note, the excess zeros in this example have a different nature than those in scRNA-seq data in the sense that those zeros more likely represent a lack of incidence from the “non-susceptible” population and are not induced by some unwanted events such as dropouts in the scRNA-seq setting.

However, the original analysis can only tell us about the marginal distribution of each trait and does not provide any information of their joint structure. For example, Xylitol lozenges may have lowered the correlation between the proximal- and smooth-surface caries by reducing the concordant pairs (zero for both or non-zero for both) while increasing the discordant pairs (zero for one and non-zero for the other). A joint analysis may give additional clues for identifying the mechanism of Xylitol lozenges and can only be obtained through joint modeling of the traits. We analyze the joint distribution of a pair of traits using our bivariate extension of the ZINB distribution.

The bivariate models could also be used to boost statistical efficiency in marginal analyses, since the information contained in one variable could be borrowed in making inferences on the other. For instance, in testing the group difference of mean incidence rates on smooth surfaces, the joint incidence rate of smooth and proximal surfaces could be estimated with a bivariate model and the group differences could be tested for each margin at an enhanced precision.

1.4 Review of existing bivariate count models and the proposed method

In consideration of building bivariate count models, it is noteworthy that there have been proposed a variety of bivariate models that fit overdispersed count data: bivariate Poisson mixture models,^28,27,29 bivariate generalized Poisson models,³⁰ and copula models.^31,13,32 These models can be further extended to flexibly accommodate excess zeros by introducing zero-inflation parameters or composing hurdle models. For a comprehensive survey of bivariate count models, refer to Cameron and Trivedi³³ and Chou and Steenhard.³⁴

Of a plethora of the proposed models in the literature, many of the bivariate Poisson mixture models and bivariate generalized Poisson models take overly complicated forms, they do not have simple marginal distributions (e.g. GBIVARNB model by Gurmu and Elder²⁸), and their parameters are hard to interpret and/or computationally expensive to estimate. Copula-based bivariate models can be alternatives to the mixture models, but they depend on the underlying copula models and can be difficult to interpret.

Many existing bivariate negative binomial models are primarily designed for modeling marginal means rather than pairwise dependence. For example, Gurmu and Elder²⁸ discussed a bivariate negative binomial distribution (BIVARNB), but their model is specified by only four parameters, which may not provide sufficient flexibility to delineate diverse distributional structure. For such a bivariate joint distribution, four parameters are needed to specify the first two marginal moments of each of the two variables, while another parameter is needed solely for modeling the dependence. Subsequently, Wang³⁵ extended BIVARNB to a zero-inflated BIVARNB regression setting. In this model, zero-inflation is dictated by a single parameter, implying that when one variable either drops out or not, the other variable behaves exactly the same, which may not be the case for scRNA-seq data; one gene can drop out, while the other does not in a sample. Instead, it is possible to have three free parameters for the full joint zero-inflation probability structure.³⁶

We propose a bivariate zero-inflated negative binomial model with eight parameters: five parameters for the negative binomial part and another three free parameters for the zero-inflation part. This model allows analyzing the dependence of two zero-inflated count variables parametrically but with more flexibility than existing models. Specifically, five parameters of our proposed model characterize all moments of the first two orders before zero-inflation, and the three zero-inflation parameters model the dropouts or the membership of non-susceptible groups with full flexibility.

Besides the flexibility and the provision of the dependence measure, the proposed BZINB model has the following features. The parameters have simple latent variable interpretations, the joint distribution can be marginalized into the corresponding univariate ZINB distributions, and the model can be easily reduced to a non-zero-inflated model, or BNB, by dropping the zero-inflation parameters.

The rest of the article is organized as follows. In Section 2, we describe how the model is constructed based, successively, on a bivariate negative binomial (BNB) model and a bivariate ZINB (BZINB)model. We present the maximum likelihood estimator using the expectation-maximization (EM) algorithm in Section 3. In Section 4, we illustrate how well the model fits data and how the model-based dependence measure behaves in contrast to naive measures using mouse paneth scRNA-seq data and study the performance of the dependence measure via simulations. In Section 5, we analyze the dental caries clinical trial data with the BZINB model to characterize the joint distribution of two dental caries traits, and illustrate the use of the model in testing the group differences in the marginal means of the two traits. In Section 6, we address the limitations of the models and discuss potential extensions. Section 7. provides software information.

2. The model

2.1 A BNB model

In constructing the BZINB model, to induce dependence and zero-inflation, layers of latent variables were used as in Kocherlakota and Kocherlakota³⁷ and Li et al.³⁶ We first introduce a simpler model, the BNB model in this subsection, and then generalize it to the BZINB model in Section 2.2.

The key assumption that induces the dependence structure of BNB (and BZINB) is that the mean parameters of two Poisson random variables are sums of gamma random variables that share a common gamma random variable. Let $R_{j} \sim G a m m a (α_{j}, β)$ for $j = 0, 1, 2$ and $R_{j}$ ’s be mutually independent, where $α_{j}$ and $β$ are the shape and scale parameters, respectively. Then $(R_{0} + R_{1}, R_{0} + R_{2})$ is bivariate gamma distributed, denoted as $B G a m m a (α_{0}, α_{1}, α_{2}, β)$ . To account for heterogeneous scales of the two Poisson mean variables, we introduce an additional parameter $δ \in R^{+}$ . Then, a pair $(X_{1}, X_{2})$ of Poisson variables with means $(R_{0} + R_{1}, δ (R_{0} + R_{2}))$ follows a bivariate negative binomial distribution, denoted as

(X_{1}, X_{2}) \sim B N B (α_{0}, α_{1}, α_{2}, β_{1}, β_{2})

(1)

where we reparametrize

(β, δ)

(β_{1}, β_{2}) = (β, δ β)

and the probability mass function is given as

\begin{aligned} P_{BNB} (X_{1} = x_{1}, X_{2} = x_{2}) = ∭_{R_{+}^{3}} \frac{(R_{0} + R_{1})^{x_{1}} (R_{0} + R_{2})^{x_{2}} e^{- \frac{1 + β_{1} + β_{2}}{β_{1}} R_{0} - \frac{1 + β_{1}}{β_{1}} R_{1} - \frac{1 + β_{2}}{β_{1}} R_{2}}}{x_{1}! x_{2}! Γ (α_{0}) Γ (α_{1}) Γ (α_{2}) β_{1}^{α_{0} + α_{1} + α_{2} + x_{2}}} \\ R_{0}^{α_{0} - 1} R_{1}^{α_{1} - 1} R_{2}^{α_{2} - 1} β_{2}^{x_{2}} \prod_{j = 0}^{2} d R_{j} \times 1_{(x_{1}, x_{2}) \in N_{0}^{2}} = \sum_{k = 0}^{x_{1}} \sum_{m = 0}^{x_{2}} (\binom{α_{0} + x_{1} + x_{2} - k - m - 1}{α_{0} + x_{2} - m - 1}) \\ \times (\binom{α_{0} + x_{2} - m - 1}{α_{0} - 1}) (\binom{α_{1} + k - 1}{α_{1} - 1}) (\binom{α_{2} + m - 1}{α_{2} - 1}) \times \frac{β_{1}^{x_{1}} β_{2}^{x_{2}} (β_{1} + β_{2} + 1)^{k + m - x_{1} - x_{2} - α_{0}}}{(β_{1} + 1)^{k + α_{1}} (β_{2} + 1)^{m + α_{2}}} 1_{(x_{1}, x_{2}) \in N_{0}^{2}} \end{aligned}

where

R_{+}

and

N_{0}

denote the positive real and nonnegative integer spaces, respectively, and superscripts represent the dimension of the product space. The support indicators will be omitted throughout this article when the context is clear. The derivation is provided in the Web Section A of the Supplemental Materials.

This BNB model is marginally negative binomial, as we know from the construction procedure that both $X_{1}$ and $X_{2}$ are Poisson random variables with means marginally Gamma distributed, respectively:

X_{j} \sim N B (α_{0} + α_{j}, \frac{1}{β_{j} + 1}) for j = 1, 2

where the random variable

X \sim N B (ν, ϕ)

can be interpreted as the minimum number of failures to have

ν

successes with probability of

ϕ

for each independent trial; that is, its probability mass function is expressed as

P_{NB} (x; ν, ϕ) = (\binom{x + ν - 1}{x}) ϕ^{ν} (1 - ϕ)^{x}

Interpretation of the BNB parameters is straightforward: $α_{0}$ , $α_{1}$ , and $α_{2}$ are the shape parameters of latent variables, where the larger $α_{0}$ implies a larger amount of shared components in $X_{1}$ and $X_{2}$ and thus larger correlation; $β_{1}$ and $β_{2}$ controls the scale of $X_{1}$ and $X_{2}$ , respectively. Note in scRNA-seq data context, $X_{1}$ and $X_{2}$ may represent the before-dropout expression level of each of two genes in a cell in the absence of dropout events, which we rarely observe in practice.

The first two moments and the correlation of a BNB random pair are given as

\begin{aligned} E (X_{j}) & = (α_{0} + α_{j}) β_{j}, j = 1, 2, \\ Var (X_{j}) & = (α_{0} + α_{j}) β_{j} (β_{j} + 1), j = 1, 2, \\ Cov (X_{1}, X_{2}) & = α_{0} β_{1} β_{2}, \\ Cor (X_{1}, X_{2}) & = \frac{α_{0}}{\sqrt{(α_{0} + α_{1}) (α_{0} + α_{2})}} \sqrt{\frac{β_{1} β_{2}}{(β_{1} + 1) (β_{2} + 1)}} . \end{aligned}

(2)

Note that this distribution only allows positive correlation. See Section 6. for more discussion.

Maher³⁸ developed another bivariate negative binomial distribution that is a constrained case of BNB in a sense that the marginal means and variances are the same for both variables.

One can further generalize this BNB model into a $m$ -variate negative binomial model by adding common latent gamma parameter(s) to the $m$ gamma variables.

2.2 A BZINB model

In this subsection, we generalize the BNB model to the BZINB model by including zero-inflation components. Since BZINB is also a generalization of the univariate ZINB model, we illustrate the construction of the univariate ZINB model first and move to the bivariate version.

A univariate negative binomial model, $N B (ν, ϕ)$ , can be generalized to allow zero-inflation by having an additional parameter, $π$ : $Z I N B (ν, ϕ, π)$ . The ZINB model has a latent variable interpretation. Let $X$ follow $N B (ν, ϕ)$ and $E$ denote the zero-inflation indicator having $1$ with probability of $π$ and $0$ otherwise, independently of $X$ . Then $Y \equiv (1 - E) X$ follows $Z I N B (ν, ϕ, π)$ with the probability measure of $P_{ZINB} (y; ν, ϕ, π) = (1 - π) P_{NB} (y; ν, ϕ) + π ζ (y)$ , where $ζ (a) \equiv 1_{(a = 0)}$ .

Similarly, a multivariate zero-inflated random variable can be constructed using a latent variable that follows the multivariate Bernoulli distribution as in the Poisson case.³⁶ For a bivariate distribution, suppose we have a random vector $E \equiv (E_{1}, E_{2}, E_{3}, E_{4})^{⊤} \sim M N (1, π)$ , where $M N (1, π)$ denotes the multinomial distribution with a single trial and an associated probability of $π \equiv (π_{1}, π_{2}, π_{3}, π_{4})^{⊤}$ with $1^{⊤} π = 1$ . Now the BZINB distribution can be formulated as

(Y_{1}, Y_{2}) := ((E_{1} + E_{2}) X_{1}, (E_{1} + E_{3}) X_{2})

(3)

where

(X_{1}, X_{2}) \sim B N B (α_{0}, α_{1}, α_{2}, β_{1}, β_{2})

and

E_{1}

E_{2}

E_{3}

, and

E_{4}

are the indicators of observing both

X_{1}

and

X_{2}

, only

X_{1}

, only

X_{2}

, and none of them, respectively. We say

(Y_{1}, Y_{2}) \sim B Z I N B (α_{0}, α_{1}, α_{2}, β_{1}, β_{2}, π_{1}, π_{2}, π_{3})

. A simpler model with a restriction of

π_{2} = π_{3} = 0

can also be considered as by Wang.³⁵

From the construction of the latent variables in the above paragraph, the probability mass function of a BZINB variable is derived as

\begin{aligned} P_{BZINB} (Y_{1} = y_{1}, Y_{2} = y_{2}; α, β, π) & = π_{1} P_{BNB} (y_{1}, y_{2}; α_{0}, α_{1}, α_{2}, β_{1}, β_{2}) \\ + π_{2} P_{NB} (y_{1}; α_{0} + α_{1}, \frac{1}{β_{1} + 1}) ζ (y_{2}) + π_{3} P_{NB} (y_{2}; α_{0} + α_{2}, \frac{1}{β_{2} + 1}) ζ (y_{1}) \\ + π_{4} ζ (y_{1} + y_{2}) \end{aligned}

(4)

where

α = (α_{0}, α_{1}, α_{2})^{⊤}, β = (β_{1}, β_{2})^{⊤},

and

π = (π_{1}, π_{2}, π_{3}, π_{4})^{⊤}

with

1^{⊤} π = 1

Here, the parameters $α$ and $β$ have the same interpretation as in BNB but in the presence of dropouts, and $π$ indicates the dropout probability, where $π_{1}, π_{2}, π_{3}$ , and $π_{4}$ are the probability that none, $Y_{2}$ only, $Y_{1}$ only, and both were dropped out, respectively.

In scRNA-seq data, $Y_{1}$ and $Y_{2}$ are the observed number of expressions for each of two genes in a cell. The term observed was used in contrast to before-dropout in a sense that an unobserved subset of the zeros are excess zeros due to dropouts.

This BZINB distribution is marginally ZINB, since the latent random variables, $X_{1}$ and $X_{2}$ , are marginally negative binomial random variables (from Section 2.1.) with probabilities of being observed, $π_{1} + π_{2}$ and $π_{1} + π_{3}$ , respectively:

Y_{j} \sim Z I N B (α_{0} + α_{j}, \frac{1}{β_{j} + 1}, π_{4 - j} + π_{4}) for j = 1, 2

(5)

The first two moments of a BZINB pair are given as

\begin{aligned} E (Y_{j}) & = (π_{1} + π_{j + 1}) (α_{0} + α_{j}) β_{j}, \end{aligned}

(6)

\begin{aligned} Var (Y_{j}) & = (α_{0} + α_{j})^{2} β_{j}^{2} (π_{1} + π_{j + 1}) (1 - π_{1} - π_{j + 1}) + (α_{0} + α_{j}) β_{j} (β_{j} + 1) (π_{1} + π_{j + 1}) for j = 1, 2, \end{aligned}

(7)

\begin{aligned} Cov (Y_{1}, Y_{2}) & = {α_{0} + (α_{0} + α_{1}) (α_{0} + α_{2})} β_{1} β_{2} π_{1} - (α_{0} + α_{1}) (α_{0} + α_{2}) β_{1} β_{2} (π_{1} + π_{2}) (π_{1} + π_{3}), \end{aligned}

(8)

and the correlation

ρ (Y_{1}, Y_{2})

is not further simplified than

Cov (Y_{1}, Y_{2}) / \sqrt{Var (Y_{1}) Var (Y_{2})}

(9)

When dropouts are technical artifacts of which effects are unwanted and need to be adjusted for, the underlying correlation

ρ^{*}

Y_{1}

and

Y_{2}

under BZINB model is simply the correlation of

X_{1}

and

X_{2}

(equation (2)), which is

ρ^{*} (Y_{1}, Y_{2}) = \frac{α_{0}}{\sqrt{(α_{0} + α_{1}) (α_{0} + α_{2})}} \sqrt{\frac{β_{1} β_{2}}{(β_{1} + 1) (β_{2} + 1)}}

(10)

3. Estimation

With the natural interpretation of the BZINB model as layers of latent variables, one can estimate the parameters by the expectation-maximization (EM) algorithm.

The joint probability density of the observed and latent variables (“complete likelihood”) is given as

\begin{aligned} f (Y_{1}, Y_{2}, X_{1}, X_{2}, R_{0}, R_{1}, R_{2}, E_{1}, E_{2}, E_{3}, E_{4}) & = f (X_{1}, X_{2}, R_{0}, R_{1}, R_{2}, E_{1}, E_{2}, E_{3}, E_{4}) \\ \times 1_{{Y_{1} = X_{1} (E_{1} + E_{2}), Y_{2} = X_{2} (E_{1} + E_{3})}} \end{aligned}

(11)

with

\begin{aligned} f (X_{1}, X_{2}, R_{0}, R_{1}, R_{2}, E_{1}, E_{2}, E_{3}, E_{4}) & = \frac{(R_{0} + R_{1})^{X_{1}} (R_{0} + R_{2})^{X_{2}} R_{0}^{α_{0} - 1} R_{1}^{α_{1} - 1} R_{2}^{α_{2} - 1} β_{2}^{X_{2}}}{X_{1}! X_{2}! Γ (α_{0}) Γ (α_{1}) Γ (α_{2}) β_{1}^{X_{2} + α_{0} + α_{1} + α_{2}}} \\ \times \frac{\prod_{k = 1}^{4} π_{k}^{E_{k}} 1_{\sum_{k = 1}^{4} E_{k} = 1}}{\exp {R_{0} \frac{1 + β_{1} + β_{2}}{β_{1}} + R_{1} \frac{1 + β_{1}}{β_{1}} + R_{2} \frac{1 + β_{2}}{β_{1}}}} \end{aligned}

Thus, the full individual log-likelihood for the

i

th entry, or the

i

th cell, is

\begin{aligned} l_{i}^{Full} & = X_{1, i} \log (R_{0, i} + R_{1, i}) + X_{2, i} \log (R_{0, i} + R_{2, i}) + (α_{0} - 1) \log R_{0, i} + (α_{1} - 1) \log R_{1, i} + (α_{2} - 1) \log R_{2, i} \\ + X_{2, i} \log β_{2} - (X_{2, i} + α_{0} + α_{1} + α_{2}) \log β_{1} + \sum_{k = 1}^{4} E_{k, i} \log π_{k} - \log X_{1, i}! - \log X_{2, i}! \\ - \log Γ (α_{0}) - \log Γ (α_{1}) - \log Γ (α_{2}) - R_{0, i} \frac{1 + β_{1} + β_{2}}{β_{1}} - R_{1, i} \frac{1 + β_{1}}{β_{1}} - R_{2, i} \frac{1 + β_{2}}{β_{1}} \\ + \log 1_{(Y_{1, i} = X_{1, i} (E_{1, i} + E_{2, i}))} + \log 1_{(Y_{2, i} = X_{2, i} (E_{1, i} + E_{3, i}))} + \log 1_{\sum_{k = 1}^{4} E_{k} = 1} \end{aligned}

The expected full log-likelihood conditional on the observed data is linear in

E [R_{j, i} | Y_{1, i}, Y_{2, i}; θ]

E [\log (R_{j, i} | Y_{1, i}, Y_{2, i}; θ)]

E [E_{k, i} | Y_{1, i}, Y_{2, i}; θ],

and

E [X_{2, i} | Y_{1, i}, Y_{2, i}; θ]

, where

θ \equiv (α^{⊤}, β^{⊤}, π^{⊤})^{⊤}

j = 0, 1, 2

and

k = 1, 2, 3, 4

. The formulae of the components—including the analytic solution of the conditional expectations—are given in Web Section B.1 of the Supplemental Materials.

As the likelihood is the product of functions concave with respect to each of the parameters at the expectation (see Web Section B.2 of the Supplemental Materials), the maximization can be achieved by solving a system of score equations. The individual scores are given as

\begin{aligned} \partial_{α_{j}} E [l_{i}^{Full} | \cdot] & = E [\log R_{j, i} | \cdot] - \log β_{1} - ψ (α_{j}), j = 0, 1, 2 \\ \partial_{β_{1}} E [l_{i}^{Full} | \cdot] & = E [R_{0, i} + R_{2, i} | \cdot] \frac{1 + β_{2}}{β_{1}^{2}} + \frac{E [R_{1, i} | \cdot]}{β_{1}^{2}} - \frac{α_{0} + α_{1} + α_{2} + E [X_{2, i} | \cdot]}{β_{1}} \\ \partial_{β_{2}} E [l_{i}^{Full} | \cdot] & = - \frac{E [R_{0, i} + R_{2, i} | \cdot]}{β_{1}} + \frac{E [X_{2, i} | \cdot]}{β_{2}} \\ \partial_{π_{j}} E [l_{i}^{Full} | \cdot] & = \frac{E [E_{j, i} | \cdot]}{π_{j}} - \frac{1 - E [E_{j, i} | \cdot]}{1 - π_{j}}, j = 1, 2, 3 \end{aligned}

where the conditioning arguments

(Y_{1}, Y_{2}; θ)

are suppressed as

(\cdot)

and can be replaced with

(Y_{1, i}, Y_{2, i}; θ)

, where we assume a sample of independent entries,

Y_{j}

denotes

(Y_{j, 1}, \dots, Y_{j, n})^{⊤}

for

j = 1, 2

n

is the sample size,

\partial_{a} b

denotes the partial derivative of

b

with respect to

a

, and

ψ (\cdot)

is the digamma function.

At the $k + 1$ st iteration of the EM algorithm, we get $θ^{(k + 1)}$ by solving the score equations $\partial_{θ} \sum_{i}^{n} E [l_{i}^{Full} | Y_{1}, Y_{2}, θ^{(k)}] = 0$ :

\begin{aligned} \frac{β_{2}^{(k + 1)}}{β_{1}^{(k + 1)}} & = \frac{\bar{E} [X_{2, i} | \cdot]}{\bar{E} [R_{0, i} + R_{2, i} | \cdot]}, \\ β_{1}^{(k + 1)} & = \frac{\bar{E} [R_{0, i} + R_{1, i} + R_{2, i} | \cdot]}{α_{0}^{(k + 1)} + α_{1}^{(k + 1)} + α_{2}^{(k + 1)}}, \\ π_{j}^{(k + 1)} & = \bar{E} [E_{j, i} | \cdot], j = 1, 2, 3, 4, \\ α_{j}^{(k + 1)} & = ψ^{- 1} {- \log β_{1}^{(k + 1)} + \bar{E} [\log R_{j, i} | \cdot]}, j = 0, 1, 2, \end{aligned}

where

\bar{E} [A | \cdot]

denotes the empirical average of the conditional expectations, that is,

\frac{1}{n} \sum_{i}^{n} E [A_{i} | \cdot]

, and the conditioning arguments

(Y_{1}, Y_{2}, θ^{(k)})

are again suppressed. The equations can be solved by solving the following through Newton-Raphson algorithm:

\begin{aligned} Solve for β_{1} & = \frac{\bar{E} [R_{0} + R_{1} + R_{2} | \cdot]}{\sum_{k = 0}^{2} ψ^{- 1} (- \log β_{1} + \bar{E} [\log R_{k} | \cdot])} \\ Then get α_{j} & = ψ^{- 1} (- \log β_{1} + \bar{E} [\log R_{j} | \cdot]) \end{aligned}

After sufficient iteration enough to observe convergence, the final updated parameter values serve as the maximum likelihood estimates. Although the likelihood is concave at the expectation, it may not always be so, and thus, the convergence of the EM algorithm may depend on the choice of initial values. Our simulations suggest that the estimator does not depend much on the initial values, especially when the sample size is large. See the sensitivity analysis in Web Section B.3 of the Supplemental Materials.

The standard error of the maximum likelihood parameter estimates can be calculated using observed information. In Web Section D of the Supplemental Materials, detailed formulae are given, and simulations illustrating the accuracy of standard error estimation are included in Section 4.3.

4. Measuring gene-gene correlations accounting for dropout events by the BZINB model

4.1 Model comparison using the mouse paneth data

In this section, we show how the BZINB model fits a scRNA-seq data set compared to its nested models (in Section 4.1.), present how model-based dependence measures can be different from naive measures (in Section 4.2.), and study the asymptotic behavior of the estimator through simulations in Section 4.3. The data were collected from paneth cells of a C57Bl6 mouse with a Sox9 gene knockout. The Fluidigm C1 system was used to capture single cells and generate Illumina libraries using manufacturers’ protocols. Illumina NextSeq sequencing platform was used for paired-end sequencing. Reads per cell were demultiplexed using mRNASeqHT_demultiplex.pl, a script provided by Fluidigm. Low quality base calls and primers were removed using Trimmomatic³⁹ and poly-A tails were removed using a custom perl script. Reads were aligned to the mouse genome (mm9) using STAR (https://academic.oup.com/bioinformatics/article/29/1/15/272537) and read per gene were counted using htseq-count (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4287950).

The data are composed of 23,425 genes for 800 cells, where all the cells came from a single mouse and have the same cell type. Over 90% of genes have more than 90% zero counts and the average (minimum, 25th percentile, 75th percentile, and maximum) proportion of zero counts for a gene is 97.3% (4.8%, 97.0%, 99.9%, and 100%) in the data. We perform a zero-inflation test to detect zero inflation, since a high proportion of zeros may not necessarily mean zero inflation. The likelihood ratio test comparing the NB and ZINB distribution is used for this test with a 50:50 mixture $χ_{0}^{2}$ and $χ_{1}^{2}$ distributions as the reference distribution, where $χ_{d}^{2}$ denotes a central $χ^{2}$ distribution with $d$ degrees of freedom.⁴⁰ Figure 1 provides the histogram of p-values associated with the zero-inflation test. Without zero-inflation, the histogram should be even over the $[0, 0.5]$ interval. However, a hike on the left is observed indicating the existence of zero-inflation, and thus overall, the use of the BZINB model is suggested over the BNB model.

Figure 1.

The histogram of the likelihood ratio test for zero-inflation. The reference $p$ -value distribution under the null model is uniform over $[0, 0.5]$ and a point mass 0.5 at 1, where the uniform density corresponds to 5% relative frequency line with the width of bins being 0.05.

We compare four nested models: BZINB, BNB, bivariate zero-inflated Poisson (BZIP), and bivariate Poisson (BP). BZIP has fixed mean values instead of latent gamma variables of BZINB, and BP further lacks zero-inflation components. The estimated probability distribution of these models is compared with the empirical distribution of the 50 gene pairs.

To systematically study the model performances, we performed a stratified sampling of genes according to their proportion of zeros; strata H, M, L, and V include genes with $\geq$ 90%, 80% to 90%, 60% to 80%, and $< 60 %$ zeros, respectively. Genes with $\geq$ 98% of zeros and genes with extremely large expression ( $> 10, 000$ counts for at least one cell) were screened out. After screening out those irregular genes, each group has 81.4%, 13.5%, 4.2%, and 0.9% of genes, respectively. We randomly selected five pairs of genes from each possible combination of two strata (HH, MM, LL, VV, HM, HL, HV, ML, MV, and LV) without replacement. For each of the 50 pairs (5 pairs $\times$ 10 combinations), we estimated the parameters of the four nested models. Based on the parameter estimates, the distributions of the four models were compared. As it is not straightforward to visually compare more than one-dimensional distributions, and the strong smoothness of the model-based estimated distributions makes it hard to illustrate the distributional characteristics, we drew a random sample of size $n = 800$ from each estimated model and the resulting empirical distributions were then compared. (Figure 2 for three pairs from HH, MM, and LL. We leave all the 50 pair results in Figures S1 to S10 in Supplemental Materials. It can be seen from those figures that the distributional patterns are similar among five pairs within each strata and are very different between strata.)

Figure 2.

The bivariate distribution of actual and simulated mouse paneth RNA count data. Each of the three panels corresponds to the first gene pair from each stratum: HH1, MM1, and LL1, where the letters represent a stratum with varying proportions of zeros and the numbers represent the number of the pair in each stratum. Each panel has the empirical distribution (top) which serves as truth for the simulations, and the four model-based simulated empirical distributions (bottom). The first four lines of numbers represent the distribution of the dichotomized values ( $+$ for non-zeros) suggested by each model, and the last lines represent the Akaike Information Criterion (AIC) values of each model.

For any pair, the BP model obviously fails to address the overdispersion and zero-inflation, while the BZIP model could not properly mimic the overdispersion. BNB and BZINB seem to fairly mimic the true distribution in most of the pairs. Although BNB provides a decent fit to the data, the seemingly good fit may come at a price of sacrificing the dependence structure. The Akaike Information Criterion (AIC) could be used for model comparisons, and it suggests BZINB for pairs MM1 and LL1 and BNB for pair HH1. Figure 3 compares the marginal correlations of the 50 gene pairs implied by both the BZINB and BNB models; that is, correlations are calculated accounting for both zero-inflation and nonzero-inflated counts. While the BZINB model seems to preserve the magnitude of the observed correlation, BNB has correlations conspicuously lower than the corresponding empirical Pearson correlations. This indicates that the BZINB model carefully considers the overall structure of the data.

Figure 3.

The BZINB and BNB model-based marginal correlations ( $y$ -axis, according to (2) and (9), respectively) and the empirical Pearson correlations ( $x$ -axis) of the 50 gene pairs. The diagonal lines represent $y = x$ . BZINB: bivariate zero-Inflated negative binomial; BNB: bivariate negative binomial.

Furthermore, when genes have some large-valued counts and many zeros at the same time either marginally or jointly, the BZINB has an apparent advantage over the BNB model. Often, in the BNB model, nonzero count pairs are highly concentrated on the diagonal line, while nonzero counts in the BZINB model are more dispersed away from the diagonal line (LL1 in Figure 2 and more examples in Figures S1 to S2). This can be explained by the lack of flexibility of the BNB model. When data are highly zero-inflated but overdispersed at the same time, BNB is forced to have small shape parameters ( $α_{j}, j = 0, 1, 2$ ) and large scale parameters ( $β_{j}, j = 1, 2$ ) while keeping the mean of the latent Gamma variables, $E [R_{j}] = α_{j} β_{1}$ , close to zero. These latent Gamma variables, serving as mean parameters of Poisson variables, take on very small values most of the times and very large values with a small chance. It is unlikely that both $R_{1}$ and $R_{2}$ have large numbers at the same time (“CASE 1”), but it is more frequent than $R_{0}$ alone has a large number (“CASE 2”). Thus, the latent Poisson variables, $X_{1}$ and $X_{2}$ , are more likely to have similarly large numbers (resulting from CASE 2) than to have significantly different nonzero numbers (resulting from CASE 1).

4.2 Comparison of dependence measures in the mouse paneth scRNA-seq data

When the excess zeros are believed to come from dropouts, the BZINB model may uncover the underlying dependence using measures such as $ρ^{*}$ and $M I^{*}$ . $M I^{*}$ is the underlying mutual information defined similarly to $ρ^{*}$ and can be estimated by first estimating the BZINB model parameters and by measuring the mutual information, $M I (Z_{1}, Z_{2}) := \sum_{z_{1} = 0}^{\infty} \sum_{z_{2} = 0}^{\infty} Pr (Z_{1} = z_{1}, Z_{2} = z_{2})$ $\log \frac{Pr (Z_{1} = z_{1}, Z_{2} = z_{2})}{Pr (Z_{1} = z_{1}) Pr (Z_{2} = z_{2})}$ , of the estimated distribution after replacing $π$ with $(1, 0, 0, 0)^{⊤}$ . In other words, the underlying mutual information, $M I^{*} (Y_{1}, Y_{2})$ , is defined as the mutual information $M I (X_{1}, X_{2})$ of the latent variables $(X_{1}, X_{2})$ .

For the same 50 pairs in the previous subsection, we estimated the dependence using naive measures—Pearson correlation (PC) and empirical mutual information (EMI)—and zero-inflation adjusted measures—underlying correlation ( $ρ^{*}$ ) and underlying MI ( $M I^{*}$ ) based on the BZINB model. Specifically, EMI is obtained as $E M I (Y_{1}, Y_{2}) := \sum_{y_{1} = 0}^{\infty} \sum_{y_{2} = 0}^{\infty} \hat{Pr} (Y_{1} = y_{1}, Y_{2} = y_{2})$ $\log \frac{\hat{Pr} (Y_{1} = y_{1}, Y_{2} = y_{2})}{\hat{Pr} (Y_{1} = y_{1}) \hat{Pr} (Y_{2} = y_{2})}$ , where $\hat{Pr} (Y_{1} = y_{1}, Y_{2} = y_{2}) = \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{1 i} = y_{1}, Y_{2 i} = y_{2})$ , $\hat{Pr} (Y_{j} = y_{j})$ are similarly defined for $j = 1, 2$ , and $0 \log 0$ is defined as zero. Figure 4 summarizes the estimates for all the pairs. The plots of empirical distribution with estimated dependence measures for each pair are also available in Figures S11 to S20 of the Supplemental Materials.

Figure 4.

Estimated dependence measures of 50 pairs. Pearson correlation (PC) and underlying correlation estimates ( $ρ^{*}$ ) (left). Empirical (EMI) and underlying (MI $^{*}$ ) mutual information estimates (right).

In Figure 4 left, we see that PC and $ρ^{*}$ are relatively close to each other in most cases, but also that they can have quite different values (e.g. HM4 and MM2). If we judge whether two genes are correlated based on (naive) PC with a certain threshold, say PC > 0.2, many genes might be missed (e.g. HM4) or falsely included (e.g. MM2). Pairs from the relatively high zero proportion strata, such as H and M, tend to have high gaps between the two measures. This is the consequence of adjusting for zero-inflation in the BZINB model.

Similar analyses can be done for MI-based measures. While EMI and $M I^{*}$ estimates are correlated, there are pairs that are notably located away from the tendency. For example, the pair MV2 has highest $M I^{*}$ , while its EMI is not one of the highest. Also, the values of $M I^{*}$ are in generally smaller than those of EMI for scRNA-seq data. The same is seen for $ρ^{*}$ versus PC but to a lesser degree. The heavy proportion of zero-zero pairs boosts naive EMI, while $M I^{*}$ removes the effects of the co-zero-inflation. These results suggest that measures that fail to identify the excess zeros caused by the dropout events may be highly misleading.

4.3 Evaluation of the dependence estimator based on simulation

We ran simulations to study the performance of estimators of underlying correlation and their associated standard error under a finite sample size. We considered 40 distinct sets of BZINB parameter values (Table 1). Note that for each value of $ρ^{*}$ there are two distinct sets of parameters $(α, β)$ , the first (a) of which have lower $α$ values and the second (b) of which have higher $α$ values. For each parameter set $(α, β, π)$ and for $n = 250, 500, 800, 1500, 2500$ , we generated $n_{sim} = 1000$ random BZINB samples of size $n$ .

Table 1.
The set of parameters for simulation. Combination of $(α_{0}, α_{1}, α_{2}, β_{1}, β_{2})$ and $(π_{1}, π_{2}, π_{3}, π_{4})$ below makes $40 (= 8 \times 5)$ sets in total.

Underlying correlation # $(α_{0}, α_{1}, α_{2}, β_{1}, β_{2})$ Zero-inflation $(π_{1}, π_{2}, π_{3}, π_{4})$

1. High ( $ρ^{} = 0.6$ ) 1-a (0.2, 0.05, 0.05, 3.0, 3.0) i. Low (0.7, 0.1, 0.1, 0.1)

1-b (2.0, 0.7, 0.1, 2.5, 2.5) ii. Moderate-balanced (0.5, 0.15, 0.15, 0.2)

2. Moderate ( $ρ^{} = 0.3$ ) 2-a (1.0, 1.0, 1.0, 1.5, 1.5) III. Moderate-unbalanced (0.5, 0.1, 0.3, 0.1)

2-b (3.0, 2.0, 1.0, 1.5, 0.5) iv. High-balanced (0.2, 0.2, 0.2, 0.4)

3. Low ( $ρ^{} = 0.1$ ) 3-a (0.2, 0.3, 3.0, 2.0, 1.5) v. High-unbalanced (0.2, 0.1, 0.4, 0.3)

3-b (0.5, 2.0, 2.0, 0.5, 3.0)

4. Very low ( $ρ^{} = 0.01$ ) 4-a (0.01, 0.1, 1.0, 0.5, 0.5)

4-b (0.05, 2.0, 3.0, 3.0, 0.5)

Underlying correlation	#	$(α_{0}, α_{1}, α_{2}, β_{1}, β_{2})$	Zero-inflation	$(π_{1}, π_{2}, π_{3}, π_{4})$
1. High	( $ρ^{*} = 0.6$ )	1-a	(0.2, 0.05, 0.05, 3.0, 3.0)	i. Low	(0.7, 0.1, 0.1, 0.1)
		1-b	(2.0, 0.7, 0.1, 2.5, 2.5)	ii. Moderate-balanced	(0.5, 0.15, 0.15, 0.2)
2. Moderate	( $ρ^{*} = 0.3$ )	2-a	(1.0, 1.0, 1.0, 1.5, 1.5)	III. Moderate-unbalanced	(0.5, 0.1, 0.3, 0.1)
		2-b	(3.0, 2.0, 1.0, 1.5, 0.5)	iv. High-balanced	(0.2, 0.2, 0.2, 0.4)
3. Low	( $ρ^{*} = 0.1$ )	3-a	(0.2, 0.3, 3.0, 2.0, 1.5)	v. High-unbalanced	(0.2, 0.1, 0.4, 0.3)
		3-b	(0.5, 2.0, 2.0, 0.5, 3.0)
4. Very low	( $ρ^{*} = 0.01$ )	4-a	(0.01, 0.1, 1.0, 0.5, 0.5)
		4-b	(0.05, 2.0, 3.0, 3.0, 0.5)

For each $k$ of $n_{sim}$ simulation replicates, we got an estimate ${\hat{ρ}}_{k}^{*}$ of the parameter $ρ^{*}$ , the standard error estimate $s e ({\hat{ρ}}_{k}^{*})$ , and the logit-transformed 95% confidence interval (i.e. ${logit}^{- 1} (logit ({\hat{ρ}}_{k}) \pm 1.96 \frac{s e ({\hat{ρ}}_{k})}{{\hat{ρ}}_{k} (1 - {\hat{ρ}}_{k})})$ ). Then for each set of parameters, the following three quantities were calculated:

the average estimated standard error (SE, $\bar{s e} ({\hat{ρ}}^{*})$ ),

the standard deviation of the parameter estimates (SD, $s d ({\hat{ρ}}^{*})$ ),

the empirical coverage probability (CP, $\frac{1}{n_{sim}} \sum_{k = 1}^{n_{sim}} 1_{ρ^{*} \in {CI}_{k}}$ , where ${CI}_{k}$ is the logit-transformed 95% confidence interval for the $k$ th replicate).

In the simulation results, the mean parameter estimates are close, or getting closer as the sample size grows, to their true parameter values for each of the 40 scenarios (Figure 5). For most of the 40 parameter sets, CP was close to 0.95, and for those not close, CP gets closer to 0.95 with increasing sample size. In the same context, the average estimated standard error (SE) was close to the standard deviation of the parameter estimates (SD) especially when the sample size was large (Figure 6). However, when the underlying correlation was close to zero (i.e. 0.01 in our example), standard error estimation did not perform as well in terms of both CP and closeness of SE to SD. The parameter value being near the boundary may be responsible for the poorer performance. Also, Scenarios iv and v have higher SE and SD than the others. One possible explanation to this is that the effective sample size for those high zero-inflation scenarios is smaller than the other scenarios; in other words, for samples with many zeros, the actual number of observations needed to validly estimate the shape and scale parameters (

α

and

β

) is relatively larger. In these simulations, the average computation time ranges between less than a minute to a few minutes for most of the settings. See Web Section C of the Supplemental Materials for more details.

Figure 5.

Mean parameter estimates ( ${\hat{ρ}}^{*}$ ) and coverage probability (CP). Each color represents the distinct simulation scenarios.

Figure 6.

Standard error (SE, solid lines) and standard deviation (SD, dashed lines) of the bivariate zero-inflated negative binomial (BZINB)-based underlying correlation estimates. Each color represents the distinct simulation scenarios.

5. Modeling the incidence of dental caries on two surfaces using the BZINB model

In the dental caries clinical trial (X-ACT) study,²⁵ 647 participants (ages 21–80 years) were randomized to receive Xylitol versus inactive lozenges with 50% chance. The number of caries 36 months after treatment initiation was recorded for proximal and smooth tooth surfaces with the proportion of no-caries for each type being 23.8% and 22.6%, respectively. Figure 7 compares the empirical distribution with the model-estimates and provides the zero-inflation test results (the mixture chi-square statistics, ${\bar{χ}}^{2}$ ) for each surface-type data. According to the test, the number of caries on smooth surfaces (no caries for 23% of the sample) is significantly zero-inflated ( $p = 0.0025$ ). The NB model seems to have a somewhat better fit to the proximal type, and the zero-inflation test statistic is not statistically significant at 5% (The AIC values of the ZINB and NB models are 2949 and 2954, respectively, for the smooth-surface caries data, and those for the proximal-surface caries data are 2736 and 2735, respectively). Thus, a BZINB model has suggested if the joint distribution between proximal and smooth surfaces is of interest.

Figure 7.

The empirical distribution of the number of caries (bars) on the proximal and smooth surfaces. The fitted NB distribution (red dots, left), the fitted ZINB distribution (blue dots, right), and the zero-inflation test statistics ( ${\bar{χ}}^{2}$ ) and their p-values are shown. NB: bino- mial model; ZINB: zero-inflated negative bino- mial model.

The following two approaches shed light on the effectiveness and the efficiency of the bivariate model. First, we rigorously investigate the difference in the joint distribution of the caries counts between the intervention (Xylitol) and control groups, which is not obtainable from univariate models. In the second analysis, we illustrate how bivariate models could be more efficient than univariate models in testing marginal mean differences. The BZINB model parameter estimates for each group are provided in Table 2 and, together with their covariance estimates, the analyses of the joint dichotomized distribution and marginal mean tests are derived.

Table 2.

The BZINB parameter estimates and their standard errors for the Xylitol and control groups. $\hat{α} = ({\hat{α}}_{0}, {\hat{α}}_{1}, {\hat{α}}_{2}), \hat{β} = ({\hat{β}}_{1}, {\hat{β}}_{2}), \hat{π} = ({\hat{π}}_{1}, {\hat{π}}_{2}, {\hat{π}}_{3}, {\hat{π}}_{4})$ .

Parameter	Control	Xylitol
$(\hat{α}, \hat{β})$	(1.51, 0.28, 0.008, 1.58, 2.34)	(1.56, 0.001, 0.14, 1.65, 1.86)
SE( $\hat{α}, \hat{β}$ )	(0.28, 0.18, 0.22, 0.28, 0.43)	(0.33, 0.21, 0.21, 0.25, 0.32)
$(\hat{π})$	(0.936, 0.007, $< 0.001$ , 0.057)	(0.921, 0.035, 0.009, 0.035)
SE( $\hat{π}$ )	(0.04, 0.02, 0.02, 0.02)	(0.05, 0.03, 0.03, 0.03)

BZINB: bivariate zero-inflated negative binomial; SE: standard error.

Denoting the proximal and smooth surface caries counts as $Y_{1}$ and $Y_{2}$ , respectively, the first analysis focuses on the joint distribution of the dichotomized caries statuses or prevalences whereby $Pr (Y_{1} = 0, Y_{2} = 0), Pr (Y_{1} > 0, Y_{2} = 0)$ , and $Pr (Y_{1} = 0, Y_{2} > 0)$ are estimated for each group based on the BZINB model (Table 3). The detailed formulae are provided in Web Section E.1 of the Supplemental Materials. The BZINB model-based estimates are fairly close to the empirical distribution (Web Table S3 in the Supplemental Materials) and provide, at the same time, a more stable and systematic estimation of the true distribution than the empirical distribution estimators.

Table 3.

The bivariate zero-inflated negative binomial (BZINB)-estimated joint probability (and its standard error) of dichotomized caries incidence on smooth- and proximal-surfaces for control (top) and Xylitol (bottom) groups. The estimated probabilities are given by the plug-in estimates of (4) for each group. See Web Section E.3 of the Supplemental Materials for more details.

		# smooth-surface
Control group		$Y_{2} = 0$	$Y_{2} > 0$	total
Proximal-surface	$Y_{1} = 0$	0.122 (0.018)	0.106 (0.025)	0.228 (0.029)
	$Y_{1} > 0$	0.092 (0.028)	0.680 (0.023)	0.772 (0.029)
	Total	0.214 (0.032)	0.786 (0.032)	1.000 (-)
		# smooth-surface ( $Y_{2}$ )
Xylitol group		$Y_{2} = 0$	$Y_{2} > 0$	total
Proximal-surface	$Y_{1} = 0$	0.119 (0.018)	0.132 (0.033)	0.251 (0.035)
	$Y_{1} > 0$	0.106 (0.028)	0.643 (0.024)	0.749 (0.035)
	Total	0.225 (0.035)	0.775 (0.035)	1.000 (-)

For both proximal and smooth surfaces, an increase in the marginal proportion of caries-free participants is observed in the treatment group. However, interestingly, the proportion of either caries or caries-free for both surfaces jointly has decreased. Provided that there was virtually no change in caries-free-for-both-surfaces, this perhaps implies a shift from caries-for-both-surfaces to caries-for-one-surface. Approximately 3.7%p, where “%p” refers to percentage points, of the caries-for-both group have been transferred to caries-for-smooth-surface-only (2.6%p) or caries-for-proximal-surface-only (1.4%p) groups, with 0.3%p difference is due to decrease in caries-free-for-both group.

In the second analysis, we compare the overall mean caries counts between the Xylitol and control groups for the two outcomes because, in clinical trials, the overall means are typically of interest as opposed to the latent class means. Nonetheless, two-part models for counts provide the structure to test the difference in the marginal means, $E [Y_{j} | control] - E [Y_{j} | Xylitol]$ for $j = 1, 2$ , using both the uni- and bi- variate ZINB models. Since the Xylitol intervention is not expected to increase the caries incidence, directional tests are used. The derivation of the statistics is relegated to Web Section C3 of the Supplemental Materials.

It can be seen from Table 4 that while the means are not very different across the models being used, the use of the bivariate model appears to increase the precision of the marginal means estimates. This result is as expected, because in univariate models, the information contained in the other variable not being modeled is not utilized, while bivariate models can leverage the information of the other variable implied by the underlying structure. The standard errors of the BZINB-based estimators (in parentheses) and the p-values of the BZINB-based tests are overall smaller than those of the ZINB-based ones. Most notably, the p-value of the univariate test for smooth surfaces is 0.051 under the BZINB model, compared to 0.106 for the ZINB model.

Table 4.

The univariate and bivariate model-based estimates (and the standard errors in parentheses) of the group means and the group differences. The $p$ -values are given by the directional z-tests. The global mean difference test is done with the chi-square distribution of two degrees of freedom as the reference.

Todel	Surface	Control group mean	Xylitol group mean	Difference	$z$	$p$
ZINB	Proximal	2.68 (0.29)	2.48 (0.15)	0.20 (0.32)	0.628	0.132
	Smooth	3.31 (0.35)	2.95 (0.29)	0.36 (0.45)	0.801	0.106
BZINB	Proximal	2.69 (0.18)	2.47 (0.18)	0.21 (0.26)	0.831	0.101
	Smooth	3.33 (0.22)	2.95 (0.20)	0.38 (0.30)	1.266	0.051
	(global)			$χ^{2} = 1.97$		0.373

ZINB: zero-inflated negative binomial; BZINB: bivariate zero-inflated negative binomial.

A further advantage of bivariate models is the use of global tests where inference is made by testing against the null hypothesis that the marginal mean parameters of the two groups are the same for all surfaces and rejecting the null if the marginal means differ between groups for at least one of the surface types. In the Xylitol trial, the BZINB-based global test of differences is not statistically significant ( $p = 0.373$ ), therefore, we fail to reject the null hypothesis of no difference of average proximal and smooth surface caries counts between the Xylitol and control groups.

6. Discussion

This article proposes a richly parametrized BZINB model that provides a full specification for the distribution of two correlated, overdispersed and zero-inflated, count random variables. Compared to existing ones, it models bivariate count data with high flexibility by having eight free parameters and at the same time with simple latent variable interpretations. The hierarchical nature of the framework allows for the consideration of nested models, such as BNB and BZIP models, and makes the model highly versatile and applicable to various contexts. Moreover, to our knowledge, the BNB model having five parameters is novel.

The BZINB model is applicable to diverse biomedical settings. In the scRNA-seq settings, by decomposing two sources of zeros, the distribution of counts without dropouts is recovered and the dependence is estimated accordingly. In a second example with a totally different perspective on the meaning and utility of modeling “excess zeros” than the first example, the joint pattern of two types of dental caries was examined using the BZINB model in the Xylitol lozenges clinical trial. In particular, the BZINB model applied to bivariate caries counts from the Xylitol study enabled estimation of marginal parameters of common interest in clinical trials including the joint probabilities for the presence versus absence of any caries and overall mean caries counts for each surface type.

The BZINB model proposed in this article assumes an independent and identically distributed random bivariate sample of zero-inflated counts. Future work could generalize this homogeneous mean model to allow for covariance analysis or joint conditional mean analysis by introducing the generalized linear model framework. As in the univariate ZINB regression, the latent count variables (i.e. $X_{1}$ and $X_{2}$ ) can be modeled using linear predictors with some link function. For example, in the Xylitol study, the treatment indicator or baseline factors could be included in the linear predictors. Nonetheless, we are easily able to construct z-tests based on separate BZINB models for the two treatment groups in the Xylitol study because the groups were statistically independent. In a similar way, tests of treatment efficacy could be conducted for independent subgroups according to baseline factors, although often such tests have limited power.

There is a growing amount of literature that many scRNAseq data are not zero-inflated, and dropout events are primarily caused by PCR amplification that could be removed by the UMIs technique.^43,42,41 While a good amount of comfort is available that there is no zero-inflation in the data for the droplet-based data such as 10X that uses UMI quantification, there is still a need to address dropouts in other platform-based scRNA-seq data as well as single-cell proteomics and metatranscriptomics data. As we have observed the presence of zero-inflation in Section 4, zero-inflated models such as BZINB are needed in the example dataset.

Our model can be applied to other settings where there is a belief in two sources of zeros such as frailty, for example, the first source corresponds to a cohort of people who are not susceptible to disease and will always have a zero count; the other sources are random zeros among susceptible individuals. In this case, the dependence measure proposed in equation (10) applies to the bivariate outcome among the latent class of individuals that are susceptible to disease.

The proper use of the BZINB model depends on researchers’ understanding of how zeros were generated in the data. For example, if the expressed mRNAs are captured and sequenced without dropouts with a certain platform, the observed zeros in the resulting data would represent genes with no expression. In these settings where the excess zeros are not caused by dropout, the overall mean count and the proportion of subjects with positive counts have meaningful interpretations that may be directly modeled by marginalized ZINB¹ and hurdle models,⁴⁴ respectively. Directly modeling the observed pairs of counts, that is, ( $Y_{1}$ , $Y_{2}$ ), using such models extended to bivariate counts could be beneficial, including for clinical trials such as the Xylitol study where marginal means are of interest and zero-inflated models are viewed as a convenient mechanism to account for many zeros.⁴⁵ These scenarios underscore that any model, including BZINB, may not be ideal for all purposes, and that the statistical model for zero-inflated counts should be chosen to match the research question.²⁶

In the BZINB model, allowing only positive $ρ^{*}$ can be regarded as a limitation. One justification for the BZINB model is that the negative correlation of count data is not so prevalent in reality. For example, in genomics data, there are some genes that suppress other genes from being expressed, however, such genes either are relatively rare or have a weak negative correlation with other genes. On the other hand, when we believe that the zeros are mostly not induced by dropout events, we can consider using the overall correlation $ρ (Y_{1}, Y_{2})$ which allows for negative correlation, instead of $ρ^{*} (Y_{1}, Y_{2})$ .

Finally, the BZINB model can also be generalized to a multivariate zero-inflated negative binomial model. This model may have an exponentially increasing number of latent variables or parameters as the dimension gets large. Though the lack of parsimony may make the multivariate model look less attractive, the idea can be very practically used in simulating multivariate zero-inflated count data and potentially in statistical analysis based on Bayesian models. For instance, a genomic count data with a large amount of zeros can be mimicked by a set of latent random layers along with the generalized linear model framework. In dental caries clinical trials, a trivariate ZINB model could analyze caries counts from three tooth surface types simultaneously with a single global test avoiding the need for multiplicity adjustments to control family-wise Type I error.

7. Software

An R package bzinb estimating BZINB parameters is available on CRAN. The R code for the mouse paneth and the dental caries data analyses and simulations is available on https://github.com/Hunyong/BZINB_analysis.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231172028 - Supplemental material for A bivariate zero-inflated negative binomial model and its applications to biomedical settings

Supplemental material, sj-pdf-1-smm-10.1177_09622802231172028 for A bivariate zero-inflated negative binomial model and its applications to biomedical settings by Hunyong Cho, Chuwen Liu, John S Preisser and Di Wu in Statistical Methods in Medical Research

Footnotes

Acknowledgments

The authors thank Scott Magness and Joshua Starmer for providing the mouse paneth scRNA seq data, André V. Ritter for sharing the X-ACT study data, and Michael I. Love for discussion of the implications of the dropouts.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by grant from the National Institutes of Health, National Institute of Dental and Craniofacial Research, R03-DE028983, and University of North Carolina Computational Medicine Program Award 2020.

ORCID iD

Hunyong Cho

Supplemental material

The reader is referred to the online Supplemental Materials for A. the standard error calculation, B. details of the EM algorithm, C. additional details of the Xylitol experiment data analyses, and Web Figures. Supplemental material is available online.

References

Preisser

Das

Long

, et al. Marginalized zero-inflated negative binomial regression with application to dental caries. Stat Med 2016; 35: 1722–1735.

Gurmu

. Semi-parametric estimation of hurdle regression models with an application to medicaid utilization. J Appl Econ 1997; 12: 225–242.

Aldirawi

Yang

Metwally

. Identifying appropriate probabilistic models for sparse discrete omics data. In: 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 1–4. IEEE, 2019.

Risso

Perraudeau

Gribkova

, et al. A general and flexible method for signal extraction from single-cell rna-seq data. Nat Commun 2018; 9: 284.

Van den Berge

Perraudeau

Soneson

, et al. Observation weights unlock bulk rna-seq tools for zero inflation and single-cell applications. Genome Biol 2018; 19: 1–17.

Greene

. Accounting for excess zeros and sample selection in Poisson and negative binomial regression models, 1994.

Lambert

. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992; 34: 1–14.

Warne

. A primer on multivariate analysis of variance (MANOVA) for behavioral scientists. Practical Assessment, Research & Evaluation, 19, 2014.

Kolodziejczyk

Kim

Svensson

, et al. The technology and biology of single-cell rna sequencing. Mol Cell 2015; 58: 610–620.

10.

Hicks

Townes

Teng

, et al. Missing data and technical variability in single-cell rna-sequencing experiments. Biostatistics 2017; 19: 562–578.

11.

Huang

Wang

Torre

, et al. Saver: gene expression recovery for single-cell rna sequencing. Nat Methods 2018; 15: 539–542.

12.

Love

Huber

Anders

. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol 2014; 15: 550.

13.

Hanson

Y-Y

. Flexible bivariate correlated count data regression. Stat Med 2020; 39: 3476–3490.

14.

Robinson

McCarthy

Smyth

. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26: 139–140.

15.

Iacono

Massoni-Badosa

Heyn

. Single-cell transcriptomics unveils gene regulatory network plasticity. Genome Biol 2019; 20: 110.

16.

. A new dynamic correlation algorithm reveals novel functional aspects in single cell and bulk rna-seq data. PLoS Comput Biol 2018; 14: e1006391.

17.

Zhang

Horvath

. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 2005; 4. 10.2202/1544-6115.1128.

18.

Eraslan

Simon

Mircea

, et al. Single-cell rna-seq denoising using a deep count autoencoder. Nat Commun 2019; 10: 390.

19.

Pont

Tosolini

Fournié

. Single-cell signature explorer for comprehensive visualization of single cell signatures across scrna-seq data sets. Nucleic Acids Res 2019; 47: e133.

20.

Van Dijk

Sharma

Nainys

, et al. Recovering gene interactions from single-cell data using data diffusion. Cell 2018; 174: 716–729.

21.

. An accurate and robust imputation method scimpute for single-cell rna-seq data. Nat Commun 2018; 9: 997.

22.

Peng

Zhu

Yin

, et al. Scrabble: single-cell rna-seq imputation constrained by bulk rna-seq data. Genome Biol 2019; 20: 88.

23.

Wang

Huang

Torre

, et al. Gene expression distribution deconvolution in single-cell rna sequencing. Proc Natl Acad Sci USA 2018; 115: E6437–E6446.

24.

Bader

Vollmer

Shugars

et al. Results from the xylitol for adult caries trial (x-act). J Am Dent Assoc 2013; 144: 21–30.

25.

Ritter

Bader

Leo

, et al. Tooth-surface-specific effects of xylitol: randomized trial results. J Dent Res 2013; 92: 512–517.

26.

Preisser

Long

Stamm

. Matching the statistical model to the research question for dental caries indices with many zero counts. Caries Res 2017; 51: 198–208.

27.

Famoye

. On the bivariate negative binomial regression model. J Appl Stat 2010; 37: 969–981.

28.

Gurmu

Elder

. Generalized bivariate count data regression models. Econ Lett 1999; 68: 31–36.

29.

Jørgensen

. Exponential dispersion models. J R Stat Soc: Ser B (Methodol) 1987; 49: 127–145.

30.

Famoye

Consul

. Bivariate generalized poisson distribution with some applications. Metrika 1995; 42: 127–138.

31.

Cameron

Trivedi

, et al. Modelling the differences in counted outcomes using bivariate copula models with application to mismeasured counts. Econom J 2004; 7: 566–584.

32.

Safari-Katesari

Samadi

Zaroudi

. Modelling count data via copulas. Statistics 2020; 54: 1329–1355.

33.

Cameron

Trivedi

. Regression analysis of count data. New York: Cambridge University Press, 2013.

34.

Chou

Steenhard

. Bivariate count data regression models – a SAS_^ macro program. Sas global forum – statistics and data analysis, SAS Institute, 2011.

35.

Wang

. A bivariate zero-inflated negative binomial regression model for count data with excess zeros. Econ Lett 2003; 78: 373–378.

36.

Park

, et al. Multivariate zero-inflated Poisson models and their applications. Technometrics 1999; 41: 29–38.

37.

Kocherlakota

. Bivariate Discrete Distributions. New York: Marcel Dekker, 1992.

38.

Maher

. A bivariate negative binomial model to explain traffic accident migration. Accid Anal Prev 1990; 22: 487–498.

39.

Bolger

Lohse

Usadel

. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 2014; 30: 2114–2120.

40.

Shapiro

. Asymptotic distribution of test statistics in the analysis of moment structures under inequality constraints. Biometrika 1985; 72: 133–144.

41.

Svensson

. Droplet scrna-seq is not zero-inflated. Nat Biotechnol 2020; 38: 147–150.

42.

Townes

Hicks

Aryee

, et al. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome Biol 2019; 20: 1–16.

43.

Vieth

Ziegenhain

Parekh

, et al. powsimr: power analysis for bulk and single cell rna-seq experiments. Bioinformatics 2017; 33: 3486–3488.

44.

Mullahy

. Specification and testing of some modified count data models. J Econom 1986; 33: 341–365.

45.

Mwalili

Lesaffre

Declerck

. The zero-inflated negative binomial regression model with correction for misclassification: an example in caries research. Stat Methods Med Res 2008; 17: 123–139.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.63 MB