Abstract
In this article, we present new commands for modeling count data using marginalized zero-inflated distributions. While we mainly focus on presenting new commands for estimating count data, we also present examples that illustrate some of these new commands.
Keywords
1 Introduction
Often, count responses have zero-inflation—a higher prevalence of zeros than is accounted for by the underlying distribution of the regression model to be fit. This discordance can occur for outcome variables in many fields of study, such as medical, public health, and manufacturing. In these cases, estimation based on the distributional assumptions of Poisson, generalized Poisson, and negative binomial models can result in incorrect parameter estimates and biased standard errors. Zero-inflated count data are encountered in the number of defects in manufacturing (Lambert 1992), patient falls in hospitals (Ullah, Finch, and Day 2010), and the number of cubes in the test of tower building for motor development (Cheung 2002), just to name a few. Hardin and Hilbe (2018) describe the two origins of zero outcomes: outcomes for individuals who do not enter into the counting process and outcomes for individuals who enter into the counting process and have a zero outcome. Mullahy (1986) proposed the zero-inflated Poisson (ZIP) model, using a model familiar to researchers (Poisson), to deal with outcomes with an excess of zeros. However, for modeling count data with zero outcomes where overdispersion or underdispersion exists, one should consider other models, such as zero-inflated generalized Poisson (ZIGP) and zero-inflated negative binomial (ZINB) (Famoye and Singh 2006; Greene 1994).
Sometimes analysts want to estimate the marginal mean and be able to interpret estimated coefficients as the population-average parameters. Some authors have proposed different approaches to marginal models, such as Lee et al. (2011), who proposed likelihood-based marginalized models for zero-inflated clustered count data using hurdle models. Kassahun et al. (2014) presented ways to model hierarchical count data that had issues such as overdispersion, correlation, and an excess of zeros by marginalized hurdle and marginalized ZIP (MZIP) normal-gamma models. Others, like Heagerty and Zeger (2000), used a marginalized multilevel model that regressed the marginal mean instead of the conditional mean on the covariates. Long et al. (2014) recently proposed an MZIP regression model that directly models the population mean count, therefore providing the ability to interpret population-wide parameters. Preisser et al. (2016) also proposed a marginalized zero-inflated negative binomial (MZINB) regression model and applied it on dental caries in a school-based fluoride mouth rinse program.
We introduce the new commands
In this article, we illustrate modeling count data using MZIP, MZIGP, and MZINB regression models. In section 2, we review the three marginalized zero-inflated regression models. In section 3, we present syntax for the new commands. In section 3, we present a synthetic data example and a real world data example. Finally, we summarize in section 5.
2 Marginalized zero-inflated distributions
2.1 Marginalized ZIP distribution
The widely known ZIP regression model with a count outcome variable, Yi (i = 1,…, n), has the probability pi that the binary process results in a zero outcome, where 0 ≤ pi < 1, and the counting process probability of a zero outcome is from the Poisson distribution. Thus, we have a probability mass function (p.m.f.)
where µi = exp(xiβ) and pi = g−1(ziγ) and where g−1(·) is the inverse link function of the linear predictor ziγ; our software allows specification of inverse link functions for logit, probit, loglog, and complementary loglog.
For a random sample of observations y1, y2,…, yn, the MZIP regression log-likelihood function is given by
where the mean (µi) is rescaled from the ZIP regression model to µi = exp{xiβ − ln(1 − pi)} and Z is the set of zero outcomes.
2.2 MZIGP distribution
The ZIGP regression model with a count outcome variable, Yi, where i = 1,…, n, has the p.m.f.
where µi = exp(xiβ), pi = g−1(ziγ), and δ is the dispersion parameter having 0 ≤ δ < 1. By applying the same concept from the MZIP regression model in section 2.1 to the ZIGP regression model, we introduce the MZIGP regression model. For a random sample of observations y1, y2,…, yn, the MZIGP regression log-likelihood function is
where the mean (µi ) is rescaled from the ZIGP regression model to µi = exp{xiβ−ln(1− pi )}, δ is the dispersion parameter having 0 ≤ δ < 1, and Z is the set of zero outcomes.
2.3 MZINB distribution
The ZINB regression model with a count outcome variable Yi , where i = 1,…, n, has the p.m.f.
where µi = exp(xiβ), pi = g −1(ziγ), and δ is the dispersion parameter. Lastly, we apply the same concept from the MZIP regression model in section 2.1 to the ZINB regression model, and we introduce the MZINB regression model. For a random sample of observations y 1 , y 2 ,…, yn , the MZINB regression log-likelihood function is
where the mean (µi ) is rescaled from the ZINB regression model to µi = exp{xiβ−ln(1− pi )}, δ is the dispersion parameter, and Z is the set of zero outcomes.
3 Syntax
The accompanying software includes the command files and supporting files for prediction and help. In the following syntax diagrams, unspecified options include the usual collection of maximization and display options available for all estimation commands. All marginalized zero-inflated commands include the
Equivalent in syntax to the
4 Examples
4.1 Example synthetic marginalized zero-inflated data
Here we illustrate how to generate synthetic marginalized zero-inflated data. We synthesized
Having created an outcome with our specified associations, we can fit some models (below) to see how closely the sample data match the specifications. The first model using our marginalized zero-inflated synthesized data with a Poisson distribution shows that using the robust variance estimator does a good job adjusting for the overdispersion due to the excess zeros (compared with the marginalized ZIP results at the end of this section).
However, when we fit our ZIP model to our sample data, we see a worse match to our synthetic-data specifications. The estimated coefficients for both of the nonzero-inflated components are not close to the values from our synthesized data. However, we can use a program to calculate the difference and ratio versions of the average predicted value.
The ratio version of the average predicted value depicted above illustrates the total estimated effect of the
Finally, we fit the data with the MZIP regression model with requested exponentiated coefficients. As expected, because the data are generated according to this model, they are well estimated.
4.2 Example real-world study
We use the popular German health reform data for the year 1984 as example data. The goal of our example is to understand the number of visits made to a physician during 1984. Our predictor of interest is whether the patient is highly educated based on achieving a graduate degree (
From the output, variables
Similarly, from the output, variables
5 Summary
In this article, we introduced supporting programs for modeling count data using marginalized zero-inflated distributions. We illustrated the use of the new command
6 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
Supplemental Material
Supplemental Material, st0563 - Modeling count data with marginalized zero-inflated distributions
Supplemental Material, st0563 for Modeling count data with marginalized zero-inflated distributions by Tammy H. Cummings and James W. Hardin in The Stata Journal
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
