Abstract
Mediation analysis has become increasingly popular over the last decade as researchers are interested in assessing mechanistic pathways for intervention. Although available methods have increased, there are still limited options for mediation analysis with zero-inflated count variables where the distribution of response has a “cluster” of data at the zero value (i.e. distribution of number of cigarettes smoked per day, where nonsmokers cluster at zero cigarettes). The currently available methods do not obtain unbiased population average effects of mediation effects. In this paper, we propose an extension of the counterfactual approach to mediation with direct and indirect effects to scenarios where the mediator is a count variable with excess zeroes by utilizing the Marginalized Zero-Inflated Poisson Model (MZIP) for the mediator model. We derive direct and indirect effects for continuous, binary, and count outcomes, as well as adapt to allow mediator-exposure interactions. Our proposed work allows straightforward calculation of direct and indirect effects for the overall population mean values of the mediator, for scenarios in which researchers are interested in generalizing direct and indirect effects to the population. We apply this novel methodology to an application observing how alcohol consumption may explain sex differences in cholesterol and assess model performance via a simulation study comparing the proposed MZIP mediator framework to existing methods for marginal mediator effects.
Introduction
Zero-inflated count variables are common in many fields of research; for example, in cardiovascular disease (CVD) research, this could include risk factors such as number of cigarettes smoked and number of alcoholic beverages consumed or health outcomes such as number of arrhythmias, surgery complication count, and coronary artery stenosis.1–3 Standard count regressions like Poisson and negative binomial models fail to accurately predict count outcomes with excess zeroes. 4 The Zero-Inflated Poisson (ZIP) model was developed based on a mixture distribution of a degenerative distribution at zero (excess zeroes) and a Poisson distribution. 4 Parameters from ZIP's Poisson process are interpreted with respect to the nonexcess zero population, but often researchers are interested in explaining effects with respect to the whole population. 5 In response the Marginalized Zero-Inflated Poisson (MZIP) model was developed by reparametrizing the likelihood of the ZIP model, to directly model the overall mean while addressing zero-inflation. 6
Mediation analysis has become a powerful tool used in many fields to explore causal pathways that may suggest ways to lessen the burden of CVD-related disparities, allowing investigators to quantify the portion of the association between an exposure and outcome that can be explained by a potential mediating factor (Figure 1). For example, in men, higher Southern diet scores (a potential mediating factor) explain approximately 46% of the association between Black race (an exposure) and incident hypertension (an outcome), 7 and a portion of the educational disparities in CVD risk are attributable to smoking. 8 While mediation methods have been proposed for zero-inflated count outcomes,9,10 there is a dearth of methods for zero-inflated count mediators.

Pathways of a standard mediation analysis with exposure, X, mediator, M, and outcome, Y. No interaction is assumed.
The counterfactual approach to mediation provides definitions of mediation effects based on calculations of expectations for both outcome and mediator models, which are easily implementable and computationally straightforward.11–13 Currently, the counterfactual approach to mediation has been adapted for binary, continuous and count outcomes and mediators. 13
This article extends the counterfactual approach to mediation for zero-inflated count mediators utilizing the MZIP model for the mediator.6,13 Assuming all assumptions of causal mediation methods are satisfied, this method allows for easily derivable and computationally efficient mediation effects for the overall population mean. Section 2 reviews the counterfactual approach specification of mediation effects. Section 3 reviews the MZIP model. Section 4 extends the counterfactual approach to mediation where the mediator is a count variable with excess zeroes using MZIP. Section 5 presents a simulation study to examine the properties of the new MZIP counterfactual mediation model to compare to standard count counterfactual mediation methods. Section 6 presents analysis observing if gender differences in lipoprotein cholesterol can be explained by alcohol consumption. A discussion will follow in section 7.
Traditional approaches to mediation such as the difference and product methods do not give causal interpretations when there are interactions or when outcome or mediator models beyond the identity link are fitted.13,14 Many frameworks have been developed to address the lack of flexibility of traditional method including the counterfactual approach to mediation.11–13,15–18 Assume
The counterfactual approach involves fitting two regression equations: equation (1) regresses the outcome on the exposure, mediator, and any covariates and equation (2) regresses the mediator on the exposure and the same covariates as in the first model.
Standard errors for effects in mediation analysis are typically computed using bootstrapping methods.19,20 In the case of large sample sizes, bootstrapping may be too computationally intensive, and standard errors using the delta method are alternatively available.13,16
The counterfactual approach to causal mediation requires the following assumptions about confounding to be satisfied for accurate estimation of NDE and NIE:
Assumption 1: No uncontrolled confounding of the exposure–outcome relationship ( Assumption 2: No uncontrolled confounding of the mediator–outcome relationship ( Assumption 3: No uncontrolled confounding of the exposure–mediator relationship ( Assumption 4: No mediator–outcome confounder is affected by the exposure (
Where

This DAG illustrates a scenario with proper control of confounders in a mediation analysis with an exposure X, mediator M, outcome Y, exposure–outcome confounder C1, mediator–outcome confounder C2, and exposure–mediator confounder C3.
When a count variable has more zeroes than expected by a count distribution, this count is often referred to as “zero-inflated” or having “excess zeroes.” When a count outcome has excess zeroes, Poisson and negative binomial model estimates will be biased.
4
While several models have been developed for excess zeroes in count data, many of these models do not provide inference comparable to standard count regression.4,24–26 For example, the zero-inflated Poisson (ZIP) model allows the count variable of interest,
Mediation with zero-inflated count mediator
Using the counterfactual definitions of mediation effects allows for easily implementable and computationally straightforward estimations of NDE and NIE. Merging this framework with MZIP gives mediation effects interpreted with respect to the population mean while minimizing the bias of mediation effects. In addition to the confounding and consistency assumptions needed for causal mediation discussed in Section 2, the proposed methods additionally require that the mediator and outcome model are correctly specified.
Continuous outcome
Integrating the MZIP model into the counterfactual mediation framework results in derivations similar to the counterfactual approach with a Poisson model. When the mediator is a count variable with excess zeroes, the first fit model a continuous outcome, Y, on the exposure, X, mediator, M, and a vector of covariates,
By summing NDE and NIE the total effect can be obtained. The proportion of the exposure–outcome relationship operating through the mediator, called the proportion mediated, can be derived by
As in other applications of counterfactual mediation, standard errors and confidence intervals for the direct and indirect effects can be derived through bootstrapping or delta method techniques (Supplemental Appendices A1 and A2). Overdispersion is often a concern of Poisson models mostly due to underestimation of variance. For zero-inflated counts, the use of robust standard errors has been shown to alleviate this burden when using an MZIP model. 25 When using the delta method for the proposed method one can use either model based or robust covariance structures for the MZIP mediator model to obtain effect estimates, minimizing the burden of overdispersion. For an outcome model specified using an identity link, the formulas of NDE and NIE will be the same for MZIP and Poisson mediator models, but the two models differ in distributional assumptions and estimation techniques.
Binary or count outcome
Derivations of mediation effects have also been computed for binary and count outcomes. Given the odds ratio is noncollapsible, it is not recommended to use a logistic regression outcome model for a nonrare binary outcome in a mediation framework.
29
For nonrare binary outcomes it is recommended to use a log-binomial or Poisson model with robust standard errors to obtain risk ratio interpretations of effects.
30
Since a log-link is used for the outcome model, NDE and NIE will be on a risk ratio scale. For binary (log-link) outcomes, the model is specified as:
Simulation
To examine the properties of the proposed mediation methods, a simulation study was performed using the model and formulas for direct and indirect effects with a continuous outcome specified in the continuous outcome section.
For each iteration, a binary exposure of interest x is simulated from a Bernoulli distribution with a probability of 0.5. A covariate c, that is simulated from
Various parameter scenarios in which the natural direct and natural indirect effect are in the same direction were examined, meaning we can conveniently describe each scenario with the proportion mediated. Four scenarios of mediator data generation were considered as method performance may vary by zero-inflation levels, overall mean, and exposure effect on the probability of being an excess zero impact results (Table 1). Scenario 1 was used as a reference, scenario 2 decreased the zero-inflation, scenario 3 widened the gap between treatment and control for the probability of being an excess zero, and scenario 4 increased the overall mean (Figure 3). For each parameter scenario, different samples sizes (200, 600, 1000) are considered with 5000 iterations. Additionally, using equation (1) we assume
Scenarios for simulation study on zero-inflated mediator using MZIP.
Scenarios for simulation study on zero-inflated mediator using MZIP.
*The confounder variable is fixed at its mean level C = 2 for these calculations.

Distributions of simulated zero-inflated mediator by exposure group. From Scenario 1, Scenario 2 has a decreased probability of being an excess zero, Scenario 3 has a larger differential effect on the probability of being an excess zero in the unexposed group compared to the exposed group, and Scenario 4 has an increased overall population mean.
Given the lack of zero-inflated count mediation methods estimating marginalized effects, the proposed MZIP meditation method is compared to Poisson and linear mediator models which ignore the excess zeroes feature but provide estimation through modeling the overall mean. For each mediator modeling approach, the NDE will be the same as it relies solely on the outcome model. To compare methods in the estimation of NIE, percent median bias, coverage, power, and median standard error using both the delta method and bootstrapped standard errors are calculated.
From Table 2, note that MZIP mediator models have the lowest bias. Poisson regression was not noticeably biased in scenarios with a lower effect of exposure on excess zero probabilities. However, Poisson methods exhibited increased bias when there was a larger treatment effect on the probability of excess zero (scenario 3) and when the overall mean was increased (scenario 4). This is not unexpected as the Poisson model does not account for these differential effects in the treatment on the probability of being an excess zero. Linear regression yielded biased results in every scenario. For the MZIP mediator model bias decreased modestly as sample size increased. For Poisson and linear mediator models, this trend did not hold and in some cases with higher zero-inflation bias increased as sample size increased. This is likely reflective of bias converging to the population value based on the Poisson and normal distributions which are overestimating the mean of the zero-inflated mediator.
Comparison of median percent bias for estimation of NIE using MZIP, Poisson, and linear models for the mediator.
In Table 3 note that the delta method coverage probabilities for the NIE for Poisson and linear regression in a mediation framework are subpar and their bootstrap counterparts are slightly lower than the nominal 95%. Coverage for MZIP mediator models using both the delta method and bootstrap standard errors were near 95% for all scenarios. Coverage was stable across the sample size for all methods.
Comparison of coverage probabilities for NIE for MZIP, Poisson, and linear mediator models.
Delta method and bootstrap errors for MZIP were comparable in terms of power (Table 4) and median standard error (Table 5) implying that bootstrap methods may not be necessary for MZIP application and delta method variance estimation is sufficient. Also, MZIP standard errors were close to the intrinsic standard error for the model (Table 5) implying that the model accurately estimates parameter variability. Poisson regression significantly underestimated standard errors which explains the poor coverage and high power of the model. Linear regression yielded a higher variance of NIE than other models, but still underestimated the true variance of NIE. The performance of linear regressions can be explained by linear regression tendency to not perform well due to skewness and sparsity of count data causing heteroskedasticity of standard errors. 31
Comparison of power for NIE estimates using MZIP, Poisson, and linear regression mediator models.
Comparison of median standard errors for NIE estimates using MZIP, Poisson, and linear regression mediator models.
Simulations were also completed for binary outcomes (Supplemental Appendix A5) and for overdispersed zero-inflated count mediators (Supplemental Appendix A6). The simulations with binary outcomes were comparable to continuous outcomes. For overdispersed mediators we observed that model-based delta method variance for MZIP did not provide adequate coverage; however, bootstrapping or use of robust delta method variance led to nominal coverage with robust errors having rapid computation speed. Additional simulations were conducted varying the value of
Overall, the proposed mediation method for zero-inflated count mediators using MZIP performed well in estimating the NIE and its corresponding variance in all sample sizes considered under both standard error estimation techniques. Poisson regression significantly underestimated the variance when using delta method errors. Delta method standard errors inherit distributional assumptions of the mediator model; for a Poisson model, the mean is equal to the variance. Notably for a mediator with a large number of zeroes, the overall mean is small resulting in small Poisson model variance estimates as well. Although computationally intensive bootstrap methods largely resolved the deficiencies of variance estimation for the Poisson mediator model, the biased estimation of mediation effects is problematic, particularly when there was a large treatment effect on the probability of being an excess zero and when the overall mean of the zero-inflated mediator was increased. Linear regression methods also performed poorly, indicating that jointly ignoring the zero-inflation and count nature of the mediator can lead to severely biased estimation. Linear regression assumes the mediator is unbounded, so it is not surprising that it behaved poorly with a bounded variable.
Cholesterol has long been associated with CVD events. 32 Using this novel mediation technique, we will observe if sex differences in lipid values can be explained by behavioral factors suitable to intervention. Studies have found relationships between alcohol consumption and low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and triglycerides.33–35 Alcohol consumption (e.g. number of drinks per week) is a zero-inflated count variable that could be intervened upon if this variable acts as a mediator between sex and cholesterol.
The REasons for Geographic And Racial Differences in Stroke (REGARDS) study is an ongoing, national cohort targeted at identifying factors that explain regional and race differences in stroke. 36 REGARDS enrolled 30239 black and white individuals between 2003 and 2007 and continues to follow participants to understand why stroke incidence is higher among Black Americans and southerners, particularly in regions with a higher risk of stroke called the Stroke Belt and Stroke Buckle. 36 REGARDS has intensive baseline and follow-up data on participants and is an ideal setting for exploring reasons for CVD-related disparities. Using this cohort, we observe how sex differences in lipid measures (9.5 years after cohort entry) can be explained by baseline alcohol consumption.
Triglycerides follow a skewed distribution, so they were log-transformed for analysis. Finally, adjustment is made for numerous covariates at baseline including race, urbanicity, geographical region (Stroke belt, Stroke buckle), income level, education level, and baseline statin use. After excluding people with missing baseline covariates and follow-up cholesterol, our analytic sample size is 12093. Alcohol consumption in REGARDS is self-reported as the number of drinks per week and contains about 70% zeroes (Figure 4). We assume that confounding and consistency assumptions are satisfied. In applied work, rigorous examination of these assumptions is necessary.

Distribution of number of alcoholic drinks in the last week by sex in the REGARDS study (n = 12,093). Over 70% of participants reported no drinks in the last week.
Shown in Table 6 are the results of the analysis examining the potential mediation of sex differences in log-triglycerides by alcohol consumption (Supplemental Appendix A7). Due to the large sample size, variance estimates were similar for all methods except for indirect and total effects for the linear regression mediator model without an interaction this is likely due to a combination of skewness in the mediator model causing heteroskedasticity of variance estimates and not including the interaction term in the outcome model to explain variability. From simulations in Section 5, we observe that Poisson and linear regression had higher estimates of NIE and are likely overestimating NIE and subsequently the proportion mediated. These results hold with and without exposure–mediator interaction effects, and the estimated NIE was less when including the exposure–mediator interaction across all methods. Interaction terms were significant in the outcome model (p < .0001). We found that about 12% of sex disparities in triglycerides can be explained by alcohol consumption and that the relationship between sex and triglycerides varies by alcohol consumption.
Mediation results showing sex disparities in triglycerides (log) explained by alcohol consumption (female = reference group).
Sensitivity analysis stratifying by statin use examined whether mediation effects varied by medication usage and no significant differences were observed across strata. One limitation of this analysis is the potentially nonlinear relationship between alcohol consumption and cholesterol,33,34 but accounting for a nonlinear relationship in mediation analysis is an area of future method development. Different specifications of alcohol consumption may be warranted to account for such nonlinearities through, for example, categorization. Although methods for ordinal mediators exist they are not comparable to the proposed method and were not considered.37,38 We also considered robust standard errors for effect estimates given the seemingly over-dispersed outcome, but standard errors were equivalent to model-based standard errors for MZIP.
This application utilizes alcohol consumption as an example of a zero-inflated count mediator. Other zero-inflated variables that may act as mediators include healthcare utilization frequency, 39 cigarette smoking, 40 and the Charlson comorbidity index. 41
A mediation method for zero-inflated count mediators was proposed by incorporating the MZIP model into the counterfactual mediation framework. This novel causal mediation method for zero-inflated count mediators has marginal effect interpretations, options for rapid computation of variance, exposure–mediator interaction compatibility, and can accommodate continuous, binary and count outcomes. Given satisfaction of the confounding and consistency assumptions of causal mediation, this novel application of MZIP in mediation analysis yields unbiased population-average NIE estimates in a straightforward way compared to other two-part zero-inflated models. While previous work has developed a methodology for zero-inflated count outcomes,9,10 the proposed method focuses on zero-inflated mediators.
The simulation study discussed in Section 5 demonstrated that other marginal mediator models (Poisson and linear regression) gave biased results, particularly given a large treatment effect on the probability of being an excess zero. This is because these models do not account for exposure differences in excess zeroes and subsequent impact on parameter estimation. Simulation results also showed that Poisson and linear regression underestimated the variance of NIE. While mediation for Poisson and linear mediator models are readily available and easy to use,13,15 using these methods on a zero-inflated count variable should be avoided to prevent inaccurate and unreliable conclusions.4,5 Specifically, Poisson models tend to overestimate the overall mean of zero-inflated counts while underestimating variance. The assumption of normality in linear regression fails to be satisfied when the mediator has a large proportion of observations on a boundary space of the observed variable. The discussed method using MZIP yields unbiased estimates of NIE and its variance and is now readily available in an R package called “mzipmed” on the Comprehensive R Archive Network.42,43
Standard errors for NDE and NIE are typically computed via bootstrap methods to account for multiple sources of model variability; however, this can be computationally intensive for large datasets such as the motivating REGARDS cohort. Using an MZIP mediator model, closed-form expressions of variance via the delta method that are comparable to bootstrapped variance estimation have been derived. Both delta and bootstrapping methods provide reliable estimates of variance and are incorporated into the R package. Avoiding computationally intensive methods for reliable variance estimation can provide analytic efficiency, particularly for large datasets.
While we have shown that the proposed method performs better than more conventional approaches for zero-inflated counts, the use of the MZIP model needs to be a justifiable modeling approach for the zero-inflated variable beyond mediation. As the MZIP model has significantly more parameters than a Poisson model, a sufficient sample size is also needed. Without sufficient zero counts to warrant a zero-inflated model, mediation with a Poisson model will be more powerful and computationally efficient than the MZIP model.4,44
One disadvantage of the proposed counterfactual approach to mediation is that added complexity to the mediator or outcome model requires new formulaic expressions of NDE and NIE. 13 Not all potential scenarios have been considered in the R package including cases with multiple mediator/exposures, covariate–exposure interactions, covariate–mediator interactions, and nonlinear exposure/mediators associations. While these derivations are obtainable, they were not presently considered and are an area of future development. Other potential expansions to this method also will allow the modeling of other types of outcomes such as time-to-event variables. In addition, we only considered ZIP, but data could be zero-inflated negative binomial. While robust standard errors using MZIP seem to perform adequately in our simulations, future work will extend this methodology to other marginal zero-inflated models such as negative binomial. 25
Conclusion
In this paper, we propose a causal mediation framework that takes into consideration zero-inflation of potential mediators by using MZIP for the mediator model, which provides marginal inference of the exposure on the mediator. Failure to consider the zero inflation of a mediator with excess zeroes with traditional models like Poisson and linear regression can yield inaccurate results. Marginalized mean indirect effect estimates are not directly obtained with the use of ZIP, meaning that inference on population effects is challenging to obtain. The proposed method circumvents these issues by minimizing the bias of indirect effects, giving ideal coverage of standard errors, and providing marginal effect estimates. While we focused on alcohol consumption as a zero-inflated count mediator, cigarette use, 40 sexual encounters, 6 dental caries, 5 healthcare utilization, 39 and coronary artery stenosis 45 are other zero-inflated variables. Each of these variables could be reliably incorporated into the discussed method as mediators to describe, for example, health disparities in cardiovascular, dental, or healthcare research.
Supplemental Material
sj-docx-1-smm-10.1177_09622802231220495 - Supplemental material for Application of marginalized zero-inflated models when mediators have excess zeroes
Supplemental material, sj-docx-1-smm-10.1177_09622802231220495 for Application of marginalized zero-inflated models when mediators have excess zeroes by Andrew Sims, Hemant Tiwari, Emily B. Levitan, Dustin Long, George Howard, Todd Brown, Melissa J. Smith, Jinhong Cui and D. Leann Long in Statistical Methods in Medical Research
Footnotes
Acknowledgments
We thank the other investigators, the staff, and the participants of the Reasons for Geographic and Racial Differences in Stroke study for their valuable contributions. A full list of participating Reasons for Geographic and Racial Differences in Stroke investigators and institutions can be found at
.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research project is supported and cofunded by the National Institute of Neurological Disorders and Stroke and the National Institute on Aging (cooperative agreement U01 NS041588). This project is also supported by a National Heart, Lung, and Blood Institute (NHLBI) predoctoral training fellowship (T32 HL155007).
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
