Assessing treatment effect heterogeneity in the presence of missing effect modifier data in cluster-randomized trials

Abstract

Understanding whether and how treatment effects vary across subgroups is crucial to inform clinical practice and recommendations. Accordingly, the assessment of heterogeneous treatment effects based on pre-specified potential effect modifiers has become a common goal in modern randomized trials. However, when one or more potential effect modifiers are missing, complete-case analysis may lead to bias and under-coverage. While statistical methods for handling missing data have been proposed and compared for individually randomized trials with missing effect modifier data, few guidelines exist for the cluster-randomized setting, where intracluster correlations in the effect modifiers, outcomes, or even missingness mechanisms may introduce further threats to accurate assessment of heterogeneous treatment effect. In this article, the performance of several missing data methods are compared through a simulation study of cluster-randomized trials with continuous outcome and missing binary effect modifier data, and further illustrated using real data from the Work, Family, and Health Study. Our results suggest that multilevel multiple imputation and Bayesian multilevel multiple imputation have better performance than other available methods, and that Bayesian multilevel multiple imputation has lower bias and closer to nominal coverage than standard multilevel multiple imputation when there are model specification or compatibility issues.

Keywords

Bayesian inference cluster randomized trials heterogeneous treatment effects missing data multilevel multiple imputation

1. Introduction

Cluster-randomized trials (CRTs) are a popular experimental design where groups of individuals are randomized to different treatment arms.¹ This design is used when the intervention is designed to be administered at the cluster level or in studies where randomization of individuals is difficult or impractical. Though methods for study design and analysis of CRTs have been developed (see, e.g. a pair of recent reviews by Turner et al.^2,3), they mostly focus on studying the average treatment effect (ATE). Methods for the design and analysis of CRTs have only recently branched into identifying heterogeneous treatment effects (HTEs), where the treatment effect is hypothesized to vary across pre-specified subgroups of the trial population.^4–9 In particular, there is burgeoning interest in confirmatory HTE analysis of CRTs focused on pre-specified subgroups defined by baseline demographic characteristics. A recent systematic review of 64 CRT analyses published between 2010 and 2016 found that 16 (25%) examined HTE across pre-specified demographic subgroups, and noted that guidance on design and analysis of CRTs to assess HTE needs further development.¹⁰ Evaluation of HTE is critical to understanding treatment effects in vulnerable subpopulations and developing real-world recommendations for the administration of new treatments.

In this article, we focus on confirmatory and hypothesis-driven HTE analyses that require pre-specification. This is in contrast to exploratory HTE analyses that are often ad-hoc and used for generating hypotheses for future studies. Confirmatory HTE analyses in CRTs can be assessed using a variety of methods, including generalized estimating equations (GEE) or generalized mixed-effects models with statistical interaction terms,⁴ pre-specified subgroup analyses (often also based on GEE or mixed-effects models),¹¹ and data-driven methods for targeting pre-specified conditional ATEs for well-defined subpopulations.^12,13 Among these approaches, the statistical interaction approach is likely the most popular in practice.^10,14 These interactions are often assessed one at a time, but the potential effect modifiers composing the interactions may be subject to missingness. As these modifiers are typically baseline covariates, they can be missing in a variety of common scenarios, including when electronic health record data is used to capture baseline covariates or when trial participants actively choose not to respond to questions in baseline surveys and enrollment forms. This latter scenario may occur if the questions make participants uncomfortable, if participants are not sure how to answer the questions, or if there are language barriers, among other reasons. Missing data of various forms can introduce bias in the analysis and interpretation of study results,¹⁵ and missing modifiers in particular have been shown to introduce bias in the estimation of interaction terms in observational studies and individually randomized trials.^16,17 However, the extent of this bias and appropriate methods to address missing effect modifier data are under-explored in CRTs. In the remainder of this section, we review some existing methods from two related areas of research: missing outcomes in CRTs and missing modifiers in individually randomized trials and observational studies, to provide context for our current study.

While research on missing effect modifier data in CRTs is under-developed, research on methods for missing outcome data in CRTs has grown substantially over the last two decades.^3,18,19 In particular, multiple imputation (MI) has been shown to be a useful tool for reducing bias,²⁰ but addressing the correlations within clusters was found to be key for valid estimation and inference in CRTs.²¹ One method that addresses such correlations is multilevel MI (MMI), which includes random intercepts for clusters in both imputation and outcome models. MMI has been studied for both continuous and binary outcomes in CRTs,^22–24 and has also been shown to be useful for studies with missing multivariate outcomes.²⁵ While fixed effects for clusters are used in some imputation models, such specification may introduce model parameter estimation bias in many settings, especially when the intracluster correlations (ICCs) or cluster sizes are small.²⁶ Whether and how these methods can be applied to study HTE in CRTs with missing effect modifier data is comparatively less well understood; the goal of this article is, therefore, to compare existing methods such as MI and MMI in a CRT setting with a continuous outcome where a binary effect modifier (used to define pre-specified subgroup membership) is subject to missingness.

Research developing or comparing methods for addressing missing modifier data is more developed in the observational study and individually randomized trial settings. In some of this existing literature, these methods are described broadly as methods for missing covariates or missing covariates with interactions. It is worth mentioning that in the current article, all effect modifiers of interest we consider are baseline covariates, even though not all baseline covariates are effect modifiers of interest. In observational studies, MI and MMI are popular for handling missing modifiers, and much attention has been focused on whether interaction terms should be computed as the product of variables after imputation (passive imputation) or treated as their own missing variable in the process (just-another-variable imputation).¹⁷ This issue is less important in randomized trials focused on HTE, since an interaction between treatment and a modifier will always equal 0 under placebo and equal the modifier under treatment (assuming 0-1 coding). A second key issue, which remains relevant for randomized trials, is that imputation and outcome models should be compatible. Imputation and outcome models are compatible (sometimes referred to as congenial) if there exists a joint model for the modifier and outcome which has implied conditional distributions equal to those specified by the imputation and outcome models.^27,28 However, choosing a compatible imputation model may not be straightforward, especially when the outcome model contains interaction terms. Semiparametric approaches such as fractional imputation^29,30 and nonparametric approaches such as imputation from Bayesian additive regression trees (BARTs)^31,32 have been proposed to address missing covariates, but such methods have not yet been extended to account for the within-cluster correlations in the covariates and outcomes in the CRT setting. With independent data, Erler et al.³³ noted that nesting MI in a Bayesian approach could alleviate imputation model misspecification issues and help account for additional uncertainty in the imputation process. This procedure can be adapted for the analyses of CRTs with a random-effects specification in the imputation model to arrive at a Bayesian MMI (B-MMI) approach, which represents another useful candidate method for evaluating HTE in CRTs with missing modifiers. Thus, from prior work on (i) missing outcomes in CRTs and (ii) missing modifiers in individually randomized trials or observational studies, we can enlist a set of existing methods that may be applicable for assessing HTE in CRTs with missing modifiers, but which have not been studied specifically for that purpose. Importantly, previous methods and guidelines developed for individually randomized trials will not necessarily be applicable for CRTs because covariates and/or missingness may be correlated within clusters, and this challenge has motivated our current study.

Overall, the contribution of this article is to develop and report on a series of comparison studies in order to determine which readily available methods may be most suitable for assessing HTE in CRTs with missing effect modifier data. The rest of the article is structured as follows. First, we define notation and describe the GEE,³⁴ which will be used for all outcome models throughout the article for final outcome analyses. Then, we describe each of the missing data methods that will be compared. Next, we compare the methods in an extensive simulation study using the ADEMP framework.³⁵ Then, we apply each of the methods to real data from the Work, Family, and Health Study, where we impose controlled missing data scenarios on complete data where the true values are known. Finally, we conclude with a discussion of the results and offer some recommendations for practitioners.

2. Notation and outcome model for CRTs

We consider the setting of a two-arm CRT with intervention assignment for cluster $i$ denoted as $A_{i} = 1$ for treatment and $A_{i} = 0$ for control. Suppose there are $C$ clusters of potentially varying sizes $n_{i}$ and let the total sample size $N = \sum_{i = 1}^{C} n_{i}$ . Let $M_{i j}$ be a pre-specified effect modifier (or subgroup variable) of interest for individual $j$ in cluster $i$ and let $Y_{i j}$ be the observed outcome for individual $j$ in cluster $i$ . In this article, we will focus specifically on a binary effect modifier measured at the individual level and a continuous outcome variable. Let $R_{i j}$ be an indicator equal to 1 if $M_{i j}$ is measured and 0 if $M_{i j}$ is missing. Suppose that there are $p$ additional baseline covariates available denoted $X_{i j} = (X_{1 i j}, X_{2 i j}, \dots, X_{p i j})^{T}$ . We assume $X_{i j}$ are fully observed auxiliary variables that may be marginally or conditionally associated with the missing effect modifier.

We first describe the data and analytic model in cases where the effect modifier is fully observed. We will focus on a GEE approach with correctly specified mean model for testing HTE to streamline our discussion. We consider the GEE approach because (i) the estimation and inference for treatment effect parameters can be robust to misspecification of the within-cluster correlation structure under correct specification of the marginal mean model,³⁴ and (ii) the population-averaged interpretation of the marginal mean model may be preferable for the analysis of CRTs.³⁶ We acknowledge, however, that GEE is not the only possible outcome modeling choice for CRTs, and mixed-effects regression may provide a more efficient estimator when both the conditional mean model and the random-effects structure are correctly specified. Suppose that the outcome follows a generalized linear marginal model of the form

\begin{aligned} g {E (Y_{i j})} = g (μ_{i j}) = γ_{0} + γ_{1} A_{i} + γ_{2} M_{i j} + γ_{3} A_{i} M_{i j} \end{aligned}

where

g

is a link function, and

μ_{i j}

is the mean function of individual

j

in cluster

i

. GEE methods estimate the model parameters

γ = (γ_{0}, γ_{1}, γ_{2}, γ_{3})^{T}

while allowing for correlation between the model outcomes across individuals. This is accomplished by specifying the structure of a working covariance matrix

V_{i}

for

Y_{i} = (Y_{i 1}, Y_{i 2}, \dots, Y_{i n_{i}})^{T}

and solving the estimating equations

\begin{aligned} \sum_{i = 1}^{C} \frac{\partial μ_{i}}{\partial γ^{T}} V_{i}^{- 1} (Y_{i} - μ_{i}) = 0 \end{aligned}

where

μ_{i} = (μ_{i 1}, μ_{i 2}, \dots, μ_{i n_{i}})^{T}

. Common specifications of the working covariance structure include independence or exchangeability of individuals within a cluster. Importantly, when the mean model is correctly specified, the GEE estimator remains consistent regardless of covariance specification,³⁷ but can usually be more efficient when the working covariance structure is correctly specified.^34,38 In CRTs, an exchangeable correlation matrix is commonly specified to acknowledge the correlated outcomes in each cluster. Typically, robust sandwich standard error estimators are used for inference, and remain consistent estimators even if the working covariance model is incorrectly specified. The sandwich variance estimator takes the form of

\begin{aligned} \hat{Δ} {\sum_{i = 1}^{C} \frac{\partial μ_{i}}{\partial γ^{T}} V_{i}^{- 1} (Y_{i} - {\hat{μ}}_{i}) (Y_{i} - {\hat{μ}}_{i})^{T} V_{i}^{- 1} \frac{\partial μ_{i}}{\partial γ}} \hat{Δ} \end{aligned}

where

\hat{Δ} = \sum_{i = 1}^{C} {(\partial μ_{i} / \partial γ^{T}) V_{i}^{- 1} (\partial μ_{i} / \partial γ)}^{- 1}

is referred to as the model-based variance estimator. Of note, even though the cluster sizes are potentially varying, we have assumed the absence of informative cluster size³⁹ and a correct specification of the GEE marginal mean model. In this case, we can define the two subgroup-specific treatment effect estimands on the link function scale by

γ_{1} = g {E (Y_{i j} | A_{i} = 1, M_{i j} = 0)} - g {E (Y_{i j} | A_{i} = 0, M_{i j} = 0)}

and

γ_{1} + γ_{3} = g {E (Y_{i j} | A_{i} = 1, M_{i j} = 1)} - g {E (Y_{i j} | A_{i} = 0, M_{i j} = 1)}

; therefore,

γ_{1} + γ_{3} E (M_{i j})

measures the ATE, whereas

γ_{3}

represents the difference between the two subgroup-specific treatment effects and measures the degree of treatment effect heterogeneity.

Next, five missing data methods associated with fitting the GEE outcome model when $M_{i j}$ is partially missing are described in detail. We will develop a simulation study and application to compare these approaches to generate practical recommendations for analyzing CRTs. We specifically consider two types of missing data mechanisms. First, when $R_{i j} ⫫ M_{i j}$ , the effect modifier data are missing completely at random (MCAR). When the modifier missingness depends only on observed variables, that is, $R_{i j} ⫫ M_{i j} | A_{i}, Y_{i j}, X_{i j}$ , the data are said to be missing at random (MAR). When the modifier missingness depends on the values of the modifier or other unobserved data, the data are considered missing not at random (MNAR); we will not address MNAR scenarios in our simulations or application, but will return to a discussion of this at the end.

3. Statistical methods for addressing a missing binary effect modifier in CRTs

3.1. Complete-case analysis (CCA)

CCA entails fitting a model based on only individuals for whom complete data is available. Thus, any individuals with missing effect modifiers will be excluded from the analysis data set. While this is usually a valid strategy in MCAR settings, it may lead to biased inference in MAR settings where the outcome and missingness of the modifier are both dependent on one or more baseline covariates. In addition, CCA may have lower power than other methods which utilize the full data. Performing CCA is equivalent to solving the GEE estimating equations weighted by the indicator that the modifier is observed

\begin{aligned} \sum_{i = 1}^{C} \frac{\partial μ_{i}}{\partial γ^{T}} R_{i} V_{i}^{- 1} (Y_{i} - μ_{i}) = 0 \end{aligned}

where

R_{i}

is a diagonal matrix with diagonal vector

{R_{i 1}, R_{i 2}, \dots, R_{i n_{i}}}^{T}

. The robust sandwich variance estimator can be derived analogously by including the missing indicator matrix

R_{i}

3.2. Single imputation

When the data are not MCAR, it is often reasonable to relax this assumption and assume the data are MAR. This assumption underlies each of the imputation methods described in this article. Single imputation (SI) entails imputing a single, fixed value for each of the missing effect modifiers and then fitting the GEE outcome model. Imputation can be achieved in many ways, but is typically performed using parametric models. As stated in Section 1, the distribution of $M_{i j} | Y_{i j}, A_{i}, X_{i j}$ implied by the outcome model and other assumptions may not correspond to a model fit by off-the-shelf software. However, parametric models may still be “approximately compatible” in many scenarios, and it is recommended in general that the imputation model contain interaction terms that correspond to those in the outcome model (e.g. if the outcome model contains an interaction between the treatment and modifier, then the imputation model should contain an interaction between the outcome and treatment).⁴⁰

Once the imputation model is specified and fit, missing values of $M_{i j}$ are replaced by predicted values from the model. Let $M_{i j}^{*}$ be equal to $M_{i j}$ whenever $R_{i j} = 1$ and equal to the imputed value whenever $R_{i j} = 0$ , and let $g (μ_{i j}^{*}) = γ_{0} + γ_{1} A_{i} + γ_{2} M_{i j}^{*} + γ_{3} A_{i} M_{i j}^{*}$ be the specified mean model for the GEE analysis with imputed, complete data. Then the outcome model parameters are estimated using the imputed data set by solving the estimating equations

\begin{aligned} \sum_{i = 1}^{C} \frac{\partial μ_{i}^{*}}{\partial γ^{T}} V_{i}^{- 1} (Y_{i} - μ_{i}^{*}) = 0 \end{aligned}

3.3. Multiple imputation

MI extends the idea of SI by imputing multiple values for each missing effect modifier and then combining results across imputed data sets in order to account for variability in the imputation procedure. In particular, $D$ unique data sets will be imputed, where $D \in [5, 15]$ is often recommended in practice.⁴¹ In the simulations and data application of this article, we will use $D = 15$ . Then the outcome model will be fit for each of the $D$ data sets and the model estimates will be combined using Rubin’s rule.^42,43 For example, let ${\hat{γ}}_{d}$ be the estimate of $γ$ from imputed data set $d$ for $d = 1, 2, \dots, D$ . Then ${\hat{γ}}^{M I} = \frac{1}{D} \sum_{d = 1}^{D} {\hat{γ}}_{d}$ and

\begin{aligned} \hat{Var} ({\hat{γ}}^{M I}) = \frac{1}{D} \sum_{d = 1}^{D} Var ({\hat{γ}}_{d}) + \frac{D + 1}{D} \cdot \frac{1}{D - 1} \sum_{d = 1}^{D} ({\hat{γ}}_{d} - {\hat{γ}}^{M I})^{2} \end{aligned}

Standard confidence intervals can be constructed for

γ

noting that

({\hat{γ}}^{M I} - γ) / \sqrt{\hat{Var} ({\hat{γ}}^{M I})} \sim t_{ν}

. The degrees of freedom

ν

are calculated as

\begin{aligned} ν = (D - 1) {(1 + \frac{\sum_{d = 1}^{D} Var ({\hat{γ}}_{d})}{(D + 1) \sum_{d = 1}^{D} ({\hat{γ}}_{d} - {\hat{γ}}^{M I})^{2} / (D - 1)})}^{2} \end{aligned}

When the number of clusters are limited, an adjusted degrees of freedom

ν_{a d j}

has been recommended in practice for CRTs.^21,44 This is calculated as

\begin{aligned} ν_{a d j} = {(\frac{1}{ν} + \frac{1}{ν_{o b s}})}^{- 1} \end{aligned}

where

\begin{aligned} ν_{o b s} = \frac{(C - 1) (C - 2)}{C + 1} {(1 + \frac{(D + 1) \sum_{d = 1}^{D} ({\hat{γ}}_{d} - {\hat{γ}}^{M I})^{2}}{(D - 1) \sum_{d = 1}^{D} {\hat{γ}}_{d}})}^{- 1} \end{aligned}

These adjusted degrees of freedom are used for all MI and MMI methods throughout the article.

3.4. Multilevel MI

While the above imputation approaches each use GEE to account for ICC in the outcome variable when analyzing complete data, they assume in the imputation process that $ICC = 0$ for the effect modifier. This may be overly simplistic in CRTs where the effect modifiers (and covariates in general) in the same cluster can be positively correlated, leading to a non-zero covariate ICC.^4,5,9,45 Ignoring the covariate ICC in the imputation process may lead to incorrect confidence intervals around the HTE estimator. In the context of non-zero covariate ICC, MMI entails the steps of the MI procedure described above, but using a multilevel imputation model that acknowledges the correlated nature of the effect modifiers. For example, instead of imputing a binary effect modifier using logistic regression, one might use a logistic linear mixed-effects model with a random intercept corresponding to cluster membership, where the empirical best linear unbiased predictions can be used to predict the missing effect modifier. Then the imputed data sets are combined using Rubin’s rule, as above. The MI and MMI imputation procedures are implemented by predicting from models in a Frequentist fashion, that is, treating the estimated imputation model parameters as fixed. This is sometimes referred to as improper MI or approximate MI.^22,46

3.5. Bayesian MMI

B-MMI implements MMI within a Bayesian framework to account for uncertainty in imputation model parameter estimation. This can be accomplished using Markov Chain Monte Carlo methods such as Gibbs sampling.⁴⁷ For a fully Bayesian approach that integrates the imputation model and outcome model in a single procedure, the general algorithm can be summarized as follows: (i) specify priors for all parameters; (ii) draw posterior samples of the imputation model parameters and impute values for all missing modifiers; (iii) draw posterior samples of the outcome model parameters using the imputed data; and (iv) iterate between steps (ii) and (iii) until convergence, and summarize the posterior distribution of the outcome model parameter estimator. Although this fully Bayesian approach is attractive, it often does not separate the imputation step and the outcome data analysis step, and may constrain the flexibility on choice of outcome model for data analysis. In practice, to separate the imputation and outcome data analysis steps, one often implements Bayesian MI by only omitting step (iii), and then sampling $D$ complete data sets from the posterior distribution, perhaps using a thinning procedure. Once the complete data are obtained, then one can fit the outcome data analysis model (e.g. the GEE procedure) and combine the $D$ final parameter estimates using Rubin’s rule.

In this article, scenarios with a binary effect modifier will be considered. A natural choice for an imputation model would be a logistic mixed-effects model, with a random intercept corresponding to cluster membership. Leveraging Pólya-Gamma random variables for Bayesian inference of logistic regression models was proposed by Polson et al.⁴⁸, who also described expanding the procedure to logistic mixed-effects models. The essence of this approach is to recognize that the binomial likelihood parameterized by log-odds can be written as a Gaussian mixture of Pólya-Gamma densities, namely

\begin{aligned} \frac{(e^{ψ})^{a}}{(1 + e^{ψ})^{b}} = 2^{- b} e^{(a - b / 2) ψ} \int_{0}^{\infty} e^{- ω ψ^{2} / 2} f (ω | b, 0) d ω \end{aligned}

for all real numbers

a

, where

b > 0

f (ω | b, 0)

is the density of the Pólya-Gamma density with parameters set to be

b

and

0

, or

P G (b, 0)

, and

ψ

can be taken as the linear component in a logistic mixed-effects model. However, handling missing data via this approach has not been described to our knowledge. To formally describe such a procedure, we combine the aforementioned generic approach with standard techniques for handling missing data in a Bayesian procedure using the following proposed Gibbs sampler. Specifically, we write the imputation model as

logit {P (M_{i j} = 1)} = w_{i j}^{T} η + α_{i}

, where

w_{i j}

is the

p \times 1

design vector including auxiliary covariates and possibly the treatment and observed outcome, and

α_{i} \sim N (0, 1 / τ_{α}^{2})

is the cluster-level random intercept, with

τ_{α}^{2}

defined as the precision parameter. Let

{\hat{η}}_{O}

be the estimates from a logistic mixed-effects imputation model fit to the observed data (with initialized missing modifier values). Then,

Set initial values for the imputation model parameters $η$ as ${\hat{η}}_{O}$ .

Set initial values for random intercepts $α = (α_{1}, \dots, α_{C})^{T}$ to 0 and an initial random effect precision $τ_{α} = 0.5$ .

Define a prior $p (η) \sim N ({\hat{η}}_{O}, Σ)$ where $Σ$ is a $p \times p$ diagonal matrix with each variance set to be a large value, such that the prior is uninformative. Set Gamma hyperpriors for the random effect precision to be $c = d = 0.01$ .

Impute missing values of $M_{i j}$ using the latest iteration of imputation model parameters.

Update the imputation model parameters and random effects as follows.

Let $W = (w_{1}, \dots, w_{C})^{T}$ be the full design matrix for the imputation model and $M$ be the full modifier vector $(M_{1}, \dots, M_{C})^{T}$ , where $w_{i} = (w_{i 1}, \dots, w_{i n_{i}})^{T}$ and $M_{i} = (M_{i 1}, M_{i 2}, \dots, M_{i n_{i}})$ . Let $I_{C}$ be an $N \times C$ matrix corresponding to cluster membership such that each element of column $i$ is 1 for all individuals that belong to cluster $i$ and 0 otherwise.

Generate $ω$ as a Polya-Gamma random vector $ω \sim P G (1, W^{T} η + I_{C} α$ ). Let $Ω$ be a diagonal matrix with the diagonal vector equal to $ω$ .

Update $η$ as a multivariate random Normal variable with:

i.
Covariance parameter equal to $(W Ω W^{T} + Σ^{- 1})^{- 1}$ .
ii.
Mean parameter equal to $(W Ω W^{T} + Σ^{- 1})^{- 1} \cdot [Σ^{- 1} {\hat{η}}_{O} + W Ω {(M - 0.5) \circ ω^{- 1} - I_{C} α}]$ , where $\circ$ is an element-wise product.

Update random effects $α$ as a multivariate random Normal variable with: i.
Covariance parameter equal to $(τ_{α} + I_{C}^{T} Ω I_{C})^{- 1}$ .
ii.
Mean parameter equal to $(τ_{α} + I_{C}^{T} Ω I_{C})^{- 1} \cdot I_{C}^{T} \cdot {M - 1 / 2 - W^{T} η \circ ω}$ .

Generate $τ_{α}$ as a Gamma random variable $τ_{α} \sim Gamma (1, c + C / 2, d + (I_{C} α)^{T} I_{C} α / 2)$ .

Iterate between Steps 4 and 5 until desired Monte Carlo Markov Chain (MCMC) chain length is reached (or parameter estimates have sufficiently converged).
Finally, one can fit the GEE outcome model on each of $D$ imputed data sets and combine the parameter estimates using Rubin’s rule. One can select the $D$ data sets after a burn-in period and use a thinning procedure to separate imputed data sets across the MCMC chain. Note that this method is not fully Bayesian, in the sense that the imputation and outcome models are not estimated in the same procedure under a combined joint likelihood. However, the method does account for uncertainty in estimation of the imputation model parameters, distinguishing it from the MMI approach. Although the exact performance of this specific procedure is unknown in our setting and will be explored in the next section, prior work on MI has demonstrated that Bayesian imputation procedures should generally result in improved performance over their Frequentist counterparts (approximate MI procedures) from a theoretical perspective. Moreover, in scenarios where the imputation and outcome models may not be necessarily compatible, Bayesian imputation procedures hold the promise to explicitly account for uncertainty in estimating the imputation model parameters.^27,46,49
4. Simulation study

4.1. Aims

The aim of the simulation study is to compare the missing data methods described in Section 4.4 under plausible data-generating mechanisms for CRTs. The key study objectives are to (i) study which of the methods perform best in practice when imputation models are correctly specified and (ii) evaluate how robust each approach is to imputation model misspecification and/or lack of compatibility between the imputation and outcome models. We will evaluate these questions under several specific data-generating processes described below.

4.2. Data-generating mechanisms

Simulations were separated into two scenarios according to the data-generating mechanisms. The initial setup was identical in each scenario. First, a set of $C \in {20, 50, 100}$ clusters were generated where cluster $i$ had $n_{i}$ individuals, and $n_{i}$ is sampled from the Poisson distribution with mean 50. The total sample size was $N = \sum_{i = 1}^{C} n_{i}$ . Then, a binary treatment $A_{i}$ was randomized at the cluster level, with $P (A_{i} = 1) = 0.5$ and exact 1:1 allocation. Next, a binary effect modifier $M_{i j}^{†}$ was generated as a Bernoulli random variable with $logit {P (M_{i j}^{†} = 1)} = 0.5 + α_{i}$ , where $α_{i}$ was a random intercept generated as a Normal random variable with mean 0 and variance such that the ICC defined on the latent response scale was equal to 0.1.^9,50,51 Subsequent data-generating mechanisms varied across the two scenarios.

Scenario 1: In the first scenario, an additional covariate for person $j$ in cluster $i$ , $X_{i j}$ , was generated as a standard Normal random variable. Then the outcome $Y_{i j}$ was simulated according to the model

\begin{aligned} Y_{i j} = 1 + 1 A_{i} + 0.75 M_{i j}^{†} + β_{3} A_{i j} M_{i j}^{†} + 0.8 X_{i j} A_{i} - 0.4 X_{i j} M_{i j}^{†} + 0.7 X_{i j} A_{i} M_{i j}^{†} + κ_{i} + ϵ_{i j} \end{aligned}

where

κ_{i}

was a random intercept generated as a Normal random variable with a mean of 0 such that the outcome ICC was equal to 0.1 and

ϵ_{i j}

was a Normal random variable with a mean of 0 and variance equal to 3. The outcome was simulated under two possible coefficient values,

β_{3} = 0

and

β_{3} = - {1 + \exp (- 0.5)}

. Finally, missingness was imposed on the effect modifier by generating an indicator

R_{i j}

as a Bernoulli random variable with

logit {P (R_{i j} = 1)} = 1.2 + 0.5 X_{i j} - 0.2 Y_{i j}

such that the observed

M_{i j} = M_{i j}^{†}

when

R_{i j} = 1

and is missing otherwise. While outcome-dependent missingness may seem unusual for a clinical trial setting with outcome measured after treatment and modifier measurement, this effectively captures a scenario where unmeasured covariates affect both missingness and the outcome. The marginal percentage of missingness was about 32% when

β_{3} = 0

and 30% when

β_{3} = - {1 + \exp (- 0.5)}

Scenario 2: In the second scenario, three covariates $X_{1 i j}, X_{2 i j},$ and $X_{3 i j}$ were generated as independent standard Normal random variables. Then the outcome was simulated according to the model

\begin{aligned} Y_{i j} & = 1 + 1 A_{i} + 0.75 M_{i j}^{†} + β_{3} A_{i j} M_{i j}^{†} + 0.8 X_{1 i j} A_{i} - 0.4 X_{1 i j} M_{i j}^{†} \\ + 0.7 X_{1 i j} A_{i} M_{i j}^{†} + 0.9 X_{2 i j} A_{i} M_{i j}^{†} - 1.1 X_{3 i j} A_{i} M_{i j}^{†} + κ_{i} + ϵ_{i j} \end{aligned}

where

κ_{i}

ϵ_{i j}

, and

β_{3}

were generated as in Scenario 1. Missingness was imposed on the effect modifier by generating an indicator

R_{i j}

as a Bernoulli random variable with

logit {P (R_{i j} = 1)} = 1.5 + 0.6 X_{1 i j} + 1.2 X_{2 i j} - 0.8 X_{3 i j} - 0.2 Y_{i j} + ζ_{i}

such that the observed

M_{i j} = M_{i j}^{†}

when

R_{i j} = 1

and is missing otherwise. The random intercept

ζ_{i}

was generated with mean 0 such that the missingness ICC on the latent response scale was equal to 0.1.⁵⁰ The marginal missingness percentage was the same as Scenario 1. Directed acyclic graphs (DAGs) representing Scenarios 1 and 2 are provided in Figure 1(a) and (b), respectively. All data simulation and analyses were performed in R Software Version 4.1.2.⁵² Data were generated using the 64-bit Mersenne-Twister with input seed

1000 k

for simulation iteration

k

in each scenario.

Figure 1.

Directed acyclic graphs (DAGs) corresponding to the simulation scenarios considered. (a) Simulation Scenario 1 and (b) simulation Scenario 2.

Note that in each scenario, the modifier is generated independently of the baseline covariates. This may seem unusual given the stated goal to compare imputation methods. However, this choice helps illustrate model compatibility in this setting. The modifier $M$ is associated with the outcome $Y$ , such that $Y$ should in principle be included in imputation model specifications. However, even though $M$ is marginally independent of $A$ , $X_{1}, X_{2}$ , and $X_{3}$ , once we condition on $Y$ , it is not conditionally independent of those variables. Thus, the outcome model generation process informs a fairly complex imputation model, despite the simple modifier generation process.

4.3. Estimands

Two estimands were considered for each simulation study. First, we consider the implied marginal model after integrating out additional covariates, for example, for the first scenario,

\begin{aligned} E (Y_{i j} | A_{i}, M_{i j}^{†}) & = 1 + A_{i} + 0.75 M_{i j}^{†} + β_{3} A_{i j} M_{i j}^{†} + 0.8 E (X_{i j}) A_{i} - 0.4 E (X_{i j}) M_{i j}^{†} + 0.7 E (X_{i j}) A_{i} M_{i j}^{†} \\ = γ_{0} + γ_{1} A_{i} + γ_{2} M_{i j}^{†} + γ_{3} A_{i} M_{i j}^{†} \end{aligned}

The first estimand of interest is the interaction term from the marginal model,

γ_{3}

, whose interpretation as the difference in subgroup-specific treatment effects is provided earlier. This interaction term will be referred to as the HTE estimand moving forward. Note that since all baseline covariates had mean 0, the HTE estimand is equal to

β_{3}

in both scenarios. The second estimand of interest is the ATE, namely,

E (Y | A = 1) - E (Y | A = 0)

. As we mentioned earlier, this estimand is equal to

γ_{1} + γ_{3} E (M^{†}) = 1 + β_{3} E (M^{†})

under both scenarios since

X, A,

and

M^{†}

are marginally independent under the full data distribution. Thus, the true ATE estimand is 1 when

β_{3} = 0

and the true ATE estimand is 0 when

β_{3} = - {1 + \exp (- 0.5)}

4.4. Methods

Each of the missing data methods described earlier were used to analyze the data generated under each scenario. In particular, data were analyzed using CCA, SI, MI, MMI, and B-MMI. For each method, the final outcome data analysis model was a correct GEE specification for the marginal mean model (integrating out baseline covariates) with the identity link function. Thus the estimator of the HTE estimand ${\hat{γ}}_{3}$ is simply given by the estimated interaction coefficient from each model. For each analysis, the corresponding robust standard error is used. The estimator for the ATE estimand is given by

\begin{aligned} {\hat{γ}}_{1} + {\hat{γ}}_{3} \sum_{i = 1}^{C} \sum_{j = 1}^{n_{i}} M_{i j}^{*} / N \end{aligned}

where

M_{i j}^{*} = M_{i j}

when the modifier is observed and equals the imputed values otherwise. More simply, one can mean-center the modifier variable after imputation such that the estimator is given by

{\hat{γ}}_{1}

.^5,6 The standard error of the ATE estimator can then also be estimated directly as the robust standard error of the treatment coefficient, or can be obtained via the delta method if the modifier is not mean-centered. Mean-centering of the effect modifier, however, does not affect the interpretation of the interaction effect or HTE estimator.

The general imputation model used had the form

\begin{aligned} logit {P (M_{i j} = 1)} = η^{T} f (X_{1 i j}, X_{2 i j}, X_{3 i j}, A_{i}, Y_{i j}) + ξ_{i} \end{aligned}

where the function

f

returns a design vector that is a function of its arguments and is varied by model specification; the random intercept

ξ_{i}

was only included for MMI methods. Even though we have considered continuous baseline covariates in the simulations, the general form of the imputation model can easily include binary and categorical predictors as in any conventional regression model (and indeed includes the binary treatment assignment). For each of the imputation-based approaches, five imputation model specifications were used, ranging from models that only contained main effects (likely highly incompatible with the outcome model) to ones with three-way interactions and lower order terms (approximately compatible with the outcome model). Table 1 elaborates on the details of the imputation model specifications. Overall, two data scenarios were compared across three sample sizes for two estimands, two choices of

β_{3}

values, and 21 combinations of methods and model specifications, for a total of 504 comparisons.

Table 1.
Possible imputation model specifications considered in the simulation studies, as well as abbreviations used in communicating simulation results.

Specification 1 Specification 2 Abbreviation

$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j}$ $logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j}$ $M \sim X + A + Y$

$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} A_{i} Y_{i j}$ $logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} A_{i} Y_{i j}$ $M \sim X + A * Y$

$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} X_{i j} A_{i}$ $logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} X_{1 i j} A_{i} + η_{7} X_{2 i j} A_{i} + η_{8} X_{3 i j} A_{i}$ $M \sim X * A + Y$

$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} X_{i j} A_{i} + η_{5} A_{i} Y_{i j}$ $logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} X_{1 i j} A_{i} + η_{7} X_{2 i j} A_{i} + η_{8} X_{3 i j} A_{i} + η_{9} A_{i} Y_{i j}$ $M \sim X * A + Y * A$

$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} X_{i j} A_{i} + η_{5} A_{i} Y_{i j} + η_{6} X_{i j} Y_{i j} + η_{7} X_{i j} A_{i} Y_{i j}$ $logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} X_{1 i j} A_{i} + η_{7} X_{2 i j} A_{i} + η_{8} X_{3 i j} A_{i} + η_{9} A_{i} Y_{i j} + η_{10} X_{1 i j} Y_{i j} + η_{11} X_{2 i j} Y_{i j} + η_{12} X_{3 i j} Y_{i j} + η_{13} X_{1 i j} A_{i} Y_{i j} + η_{14} X_{2 i j} A_{i} Y_{i j} + η_{15} X_{3 i j} A_{i} Y_{i j}$ $M \sim X * A * Y$

Specification 1	Specification 2	Abbreviation
$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j}$	$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j}$	$M \sim X + A + Y$
$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} A_{i} Y_{i j}$	$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} A_{i} Y_{i j}$	$M \sim X + A * Y$
$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} X_{i j} A_{i}$	$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} X_{1 i j} A_{i} + η_{7} X_{2 i j} A_{i} + η_{8} X_{3 i j} A_{i}$	$M \sim X * A + Y$
$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} X_{i j} A_{i} + η_{5} A_{i} Y_{i j}$	$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} X_{1 i j} A_{i} + η_{7} X_{2 i j} A_{i} + η_{8} X_{3 i j} A_{i} + η_{9} A_{i} Y_{i j}$	$M \sim X * A + Y * A$
$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{i j} + η_{2} A_{i} + η_{3} Y_{i j} + η_{4} X_{i j} A_{i} + η_{5} A_{i} Y_{i j} + η_{6} X_{i j} Y_{i j} + η_{7} X_{i j} A_{i} Y_{i j}$	$logit {P (M_{i j} = 1)} = η_{0} + η_{1} X_{1 i j} + η_{2} X_{2 i j} + η_{3} X_{3 i j} + η_{4} A_{i} + η_{5} Y_{i j} + η_{6} X_{1 i j} A_{i} + η_{7} X_{2 i j} A_{i} + η_{8} X_{3 i j} A_{i} + η_{9} A_{i} Y_{i j} + η_{10} X_{1 i j} Y_{i j} + η_{11} X_{2 i j} Y_{i j} + η_{12} X_{3 i j} Y_{i j} + η_{13} X_{1 i j} A_{i} Y_{i j} + η_{14} X_{2 i j} A_{i} Y_{i j} + η_{15} X_{3 i j} A_{i} Y_{i j}$	$M \sim X * A * Y$

Specification 1 (2) is the model used for simulation scenario 1 (2). For MMI and B-MMI methods, a random intercept was added to each of the specifications in the table. $logit (\cdot) = log (\cdot) / {1 - log (\cdot)}$ . MMI: multilevel multiple imputation; B-MMI: Bayesian multilevel multiple imputation.

Since both scenarios used binary effect modifiers, all imputation procedures were specified under the umbrella of logistic regression procedures, that is, SI and MI used logistic regression, while the MMI and B-MMI methods used logistic mixed-effects models with a random intercept at the cluster level (B-MMI using estimates from such models as initial values). For the B-MMI method, each MCMC chain included 1000 burn-in iterations and used thinning to draw one set of posterior samples every 100 iterations after burn-in until $D = 15$ complete data sets were collected, for a total of 2500 iterations. For all MI procedures, we combined the outcome model estimates based on each complete data set using Rubin’s rule as described earlier.

4.5. Performance measures

Four performance measures were considered for each simulation scenario and estimand. First, the bias for each method was calculated as the mean difference of estimates across simulation iterations and the corresponding true estimand value described in Section 4.3. As a second measure, the coverage was calculated as the proportion of simulations for which the estimated confidence interval contained the corresponding true estimand value. Third, the power was calculated as the proportion of simulations which would reject the null hypothesis (that the corresponding estimand is equal to 0). This is equivalent to Type I error when simulating under the null. Finally, the mean squared error (MSE) was calculated as the average squared difference of estimates from each simulation and the corresponding true estimand value. The number of simulations run for each data-generating mechanism was chosen to ensure a reasonably low Monte Carlo standard error for the coverage estimates. In particular, supposing that the true coverage for a method is $95 %$ , then $\sim$ 1900 simulations for each data-generating mechanism are required to achieve a Monte Carlo standard error of $0.5 %$ . This was then rounded up to 2000 simulation iterations. Thus, for a method that truly has 95% coverage, we would expect to find estimated coverage between 94% and 96% across the vast majority of data-generating mechanisms we considered.

4.6. Results

The results for the HTE estimand when setting $β_{3} = γ_{3} = 0$ in the first scenario are presented in Figure 2. Figure 2 shows that SI and MI can dramatically reduce bias compared to CCA when using the three-way interaction imputation model. However, some bias remains. The MMI and B-MMI approaches reduce bias further and have near-zero bias under the three-way interaction imputation model, but exhibit similar bias under the other imputation models. As expected, coverage was far below the nominal level for the SI method, but this recovered greatly after performing MI. B-MMI under the three-way interaction imputation model was the only method to achieve approximately nominal coverage and maintain the specified $α = 0.05$ Type I error rate, but could sometimes lead to overcoverage or undercoverage under misspecified imputation models. MI, MMI, and B-MMI all provided comparable MSE. Results for the ATE estimand under this data-generating mechanism are presented in Figure 1 of the Supplemental Material. While all methods showed a small amount of bias and undercoverage for this estimand, MI, MMI, and B-MMI were the closest to achieving zero bias and nominal coverage, with better performance than SI or CCA.

Figure 2.

Simulation results for the HTE estimand (interaction effect estimand $γ_{3})$ in the first simulation scenario when $γ_{3} = 0$ . Rows from top to bottom show the bias, coverage, MSE, and Type I error metrics. Columns from left to right show the performance of SI, MI, MMI, and B-MMI methods. CCA is displayed for comparison on each panel. HTE: heterogeneous treatment effect; MSE: mean squared error; SI: single imputation; MI: multiple imputation; MMI: multilevel multiple imputation; B-MMI: Bayesian multilevel multiple imputation; CCA: complete-case analysis.

Results for the HTE estimand and ATE estimand when setting $β_{3} = γ_{3} = - {1 + \exp (- 0.5)}$ in Scenario 1 are presented in Supplemental Figures 2 and 3, respectively. For the HTE estimand, B-MMI slightly outperformed MMI across all imputation model specifications, followed by MI and SI. However, the differences between B-MMI and MMI were minimal. Notably, each of the imputation approaches can have substantially more bias than CCA if a highly incompatible imputation model is used, such as the imputation model with main effects only. Each imputation approach can also exhibit much worse coverage than CCA when the imputation model is misspecified, with coverage deteriorating further as the number of clusters increases; this is because the estimated confidence intervals narrow around biased point estimates. For the ATE estimand (Supplemental Figure 3), performance was similar for MI, MMI, and B-MMI, with each achieving lower bias and closer to nominal coverage than SI or CCA, although a small amount of bias remained.

The results of the second simulation scenario were largely in agreement with the first scenario. The results for the HTE estimand $γ_{3}$ when setting $β_{3} = γ_{3} = 0$ in the second scenario are presented in Figure 3. All imputation methods had low bias under the three-way interaction imputation model, but performed poorly under highly misspecified imputation models. However in this scenario, B-MMI had noticeably higher coverage than MMI under all imputation models, and was the only method to achieve nominal coverage. Likewise, B-MMI under the three-way interaction model was the only method to achieve the nominal Type I error rate. The results for the ATE estimand for this data-generating mechanism are presented in Supplemental Figure 4. For this estimand, MI, MMI, and B-MMI all yielded good and comparable performance.

Figure 3.

Simulation results for the HTE estimand (interaction effect estimand $γ_{3}$ ) in the second simulation scenario when $γ_{3} = 0$ . Rows from top to bottom show the bias, coverage, MSE, and Type I error metrics. Columns from left to right show the performance of SI, MI, MMI, and B-MMI methods. CCA is displayed for comparison on each panel. HTE: heterogeneous treatment effect; MSE: mean squared error; SI: single imputation; MI: multiple imputation; MMI: multilevel multiple imputation; B-MMI: Bayesian multilevel multiple imputation; CCA: complete-case analysis.

Figure 4.

Directed acyclic graphs (DAGs) corresponding to the work, family, and health study data application scenarios that were considered.

Finally, the results for the HTE and the ATE estimands when setting $β_{3} = γ_{3} = - {1 + \exp (- 0.5)}$ in Scenario 2 are presented in Supplemental Figures 5 and 6. For the HTE estimand $γ_{3}$ , MMI and B-MMI exhibited lower bias and higher coverage than SI or MI across all specified imputation models. But B-MMI under the three-way imputation model was the only method to achieve nominal coverage. As in Scenario 1, all imputation methods could perform substantially worse than CCA under highly misspecified imputation models. For the ATE estimand, performance was similar for MI, MMI, and B-MMI, with low but non-zero bias and slight undercoverage. Overall, B-MMI consistently had the strongest performance in the simulation scenarios considered, followed by MMI and then MI. As noted in theoretical examinations in related work,^27,46 Bayesian methods may perform better than their Frequentist counterparts in scenarios where imputation and outcome model specification are incompatible because the Bayesian approaches more appropriately handle uncertainty in parameter estimation. This feature likely explains the improvements we found when comparing B-MMI with MMI.

5. Application to the work, family, and health study (WFHS)

In this section, we illustrate each of the methods described earlier by analyzing data from the WFHS. WFHS consisted of two CRTs conducted at two employers; here, we focus only on the experiment conducted at one employer, an extended-care company.⁵³ The trial included 30 work sites of 30–89 employees each. The work sites were randomly assigned to a comprehensive work–family intervention or usual practice conditions. The intervention consisted of two main components, one to increase managers’ support for their employees’ family and personal lives, and one to improve employees’ control over their schedule. As managers already have more control over their schedule and would be both giving and receiving the first component of the intervention, it’s possible that the intervention effect on various outcomes varies by employee type (manager vs. non-manager).

We specifically focus on this effect modification problem even though the employee type variable is completely observed, such that we can simulate missing data scenarios while comparing to results from the known full data. In particular, we will study employee type as a potential modifier of the effect of the intervention on the outcome of time adequacy (TA), measured for each individual employee. TA was a composite variable averaging the individual survey responses to several questions of the form “To what extent is there enough time to…” followed by a familial activity or responsibility. TA ranged from 1 to 5 with 5 signifying always having adequate time for family. Thus, TA acts as a proxy variable for the desirable but unmeasurable outcome of “good work–life balance.” Our outcome specification was the difference between TA at 12 months post-intervention and TA at baseline. This research question largely falls under Aim 4 of the selected sub-study of WFHS, to “test whether employee, mid-level manager, and work-group characteristics moderate the effect of the intervention on work–family conflict and health outcomes.”

Although the employee type variable is fully observed, we considered three hypothetical missing data scenarios for this effect modifier. In the first scenario (Figure 4(a)), about 20% of the values of employee type were set to be missing completely at random. In the second scenario (Figure 4(b)), about 20% of the values of employee type were set to be MAR as a simple function of the outcome and two additional covariates: self-reported job autonomy and a self-reported assessment of control of schedule, each of which was reported on a Likert scale of 1 to 5, with 5 indicating more control/autonomy. In particular

\begin{aligned} logit {P (Employee Type is not missing)} & = 2 + 0.5 T A - 0.6 (Control of Schedule \geq 4) \\ - 0.3 (Job Autonomy \geq 4) \end{aligned}

The third scenario is a similar MAR setup (Figure 4(b)) to the second scenario, but missingness was simulated as a more complicated function of outcome and covariates, including a random effect to induce clustering of missingness within work sites.

\begin{aligned} logit {P (Employee Type is not missing)} & = ζ + 2 + 0.5 T A - 0.6 (Control of Schedule \geq 4) \\ - 0.3 (Job Autonomy \geq 4) \\ + 0.05 T A (Control of Schedule \geq 4) \\ - 0.15 T A (Job Autonomy \geq 4) \\ + 0.1 T A (Control of Schedule \geq 4) (JobAutonomy \geq 4) \end{aligned}

where

ζ

is a random intercept at the work site level generated as a Normal variable with mean 0 and variance such that ICC on the latent response scale was equal to 0.1.

In each scenario, the five methods are compared by calculating the point estimates and confidence intervals for the two estimands described earlier—the coefficient for interaction between intervention and employee type (the HTE) and the ATE. As we mentioned earlier, we assume the absence of informative cluster size and a correct specification of the GEE marginal mean model for data analysis; therefore, the ATE and the interaction effect can be mapped to the corresponding regression coefficients and interpreted without ambiguity. For the MAR scenarios, each imputation method specified an imputation model that included the outcome, the two baseline covariates (dichotomized as in the missing data generation), the treatment, and all possible two-way interactions between them. Models with higher-order interaction terms did not always converge due to the limited sample size and nontrivial number of parameters, and were thus excluded. The missing data simulation procedure and corresponding estimation under each method was repeated for a comparison across 500 replications.

Key results are presented in Figure 5, which shows the average point estimate and average 95% confidence (upper and lower) limits across the 500 replications for each method. For these results, 13 iterations were removed in Scenario 2 for MMI due to the non-convergence of the imputation model; these were resolved for B-MMI by using alternative initial values informed by the SI/MI imputation model. For the ATE estimand (top row of the figure), all methods seem to perform similarly but with slight variation in average confidence interval width. For the HTE estimand (bottom row of the figure), CCA yielded similar results to each of the imputation methods under all scenarios, and produced narrower confidence intervals on average than MI, MMI, or B-MMI. This is in slight contrast to the results from the simulation studies, and may be due to a lack of explanatory power of the assumed imputation model with the available auxiliary data. Alternatively, this phenomenon can arise as a result of increased uncertainty in the imputation procedure. That is, whereas CCA uses less data, the MI methods can introduce nontrivial uncertainty around the imputed values and, therefore, increase the width of the confidence interval. Finally, although SI had similar confidence limits to the complete data on average, SI was also the only method for which the averaged confidence intervals did not fully cover the range of the complete data confidence interval in Scenarios 2 and 3. MI, MMI, and B-MMI performed similarly, but had noticeably wider average confidence intervals than CCA or SI. Although there is no ground truth to compare to, this may indicate that the set of MI procedures are more faithfully accounting for the uncertainty due to missing effect modifier data.

Figure 5.

Results for the work, family, and health study data application. The top row shows the average point estimate and average confidence interval limits for each method across 500 replications of the simulation procedure for the ATE estimand. The bottom row shows the parallel results for the HTE estimand. HTE: heterogeneous treatment effect; ATE: average treatment effect.

Supplemental Tables 1 and 2 report the percentage of iterations where each methods’ confidence intervals were narrower than the corresponding complete data confidence interval, for the ATE and the HTE estimands, respectively. Notably, MI, MMI, and B-MMI almost never produced a narrower confidence interval compared to the complete data confidence interval, except for at most two iterations in a given scenario. In contrast, CCA yielded a narrower confidence interval compared to the complete data counterpart between 9.0% and 16.2% of the time, depending on the scenario. Furthermore, SI reported a narrower confidence interval than the complete data counterpart between 22.2% and 48.6% of iterations across all scenarios. This adds evidence to our Section 4 results that SI often yields overly narrow confidence intervals. Supplemental Tables 3 and 4 report a related metric, the percentage of iterations where each methods’ 95% confidence interval completely covered the complete data 95% confidence interval. A very low value of this metric could indicate that an approach is frequently excluding plausible parameter values. For the ATE estimand, MI, MMI, and B-MMI almost always covered the complete data interval, while CCA and SI tended to cover the complete data interval only around a third of the time. For the HTE estimand in Scenario 2, CCA, SI, MI, MMI, and B-MMI reported covering the complete data confidence interval 34.0%, 19.4%, 60.6%, 63.2%, and 69.4% of the time, respectively. Results for other scenarios were similar, and collectively indicate that SI likely did not appropriately account for uncertainty, while B-MMI appropriately accounts for more uncertainty on average than other approaches.

Finally, to further investigate the variability of the point estimates across replications, we provide box plots of the point estimates in Supplemental Figure 7. The complete data point estimate and 95% confidence interval are presented on the left of each panel for a reference. The top row displays results for the ATE estimand, where CCA exhibited much higher variability than the other approaches. SI exhibited slightly higher variability than the other imputation approaches. All methods’ median point estimates were close to the complete data estimate. The bottom row of the figure displays results for the HTE estimand, and suggests that no method had a median point estimate equal to the complete data point estimate in Scenario 2 or 3. While all methods tended to report point estimates within the complete data confidence interval, each had point estimates outside this range in either Scenario 2 or 3. As expected, SI has highly variable point estimates, with several point estimates outside of the complete data confidence interval. MMI reports a bias-variance tradeoff in Scenario 3, with its median point estimate being the furthest from the complete data point estimate, but also having less variable estimates overall.

We acknowledge that there is no known truth in this single data application, and the complete data results are themselves subject to variability. Thus, whether or not confidence intervals “match” those for the complete data should only be interpreted with caution. However, when considering all metrics together, this application provides strong evidence that CCA and SI are likely overly confident, and that MMI and B-MMI may be most faithful in capturing uncertainty. For the complete data, both the ATE and the HTE estimates were close to null with relatively wide confidence intervals, indicating no ATE and no effect heterogeneity. Although it is difficult to define the scale of meaningful treatment effects when the outcome is an average of Likert variables, the confidence intervals exclude treatment effects of an absolute value larger than 0.1, which would correspond to a very minor difference between groups.

6. Discussion

In this article, we compared several methods to address missing effect modifier data for assessing HTE (and for assessing ATE as a direct product of the analysis model) in CRTs using simulation studies and then used a completed CRT where we imposed selected missing data patterns to compare their performance using real data. The key findings of our work are as follows. First, MMI and B-MMI had the lowest bias and highest coverage across the settings we investigated, with B-MMI being the only method to achieve nominal coverage rates across several scenarios considered when the imputation models were approximately correct. However, when imputation models were strongly misspecified, imputation approaches performed poorly, and could even be worse than CCA in several scenarios. Second, using an imputation model with only main effects for the outcome, treatment, and covariates resulted in especially poor performance when targeting a non-null HTE estimand. This could have important implications in practice as imputation models with only main effects are the default specification in many software packages, such as the popular mice R package.⁵⁴ Third, including more interaction terms almost always resulted in better performance in the simulations, so in practice, it may be safer to “overspecify” imputation models when enough data is available to justify doing so. Future work should consider whether such overspecifications maintain good performance when the smallest necessary imputation model is more parsimonious. Finally, implementing the logistic mixed-effects model Gibbs Sampler for B-MMI is, to our knowledge, a new contribution; previous procedures in the multilevel data context have instead leveraged a probit link function.

The simulation study focused on data-generating mechanisms that induced correlations within clusters, as are commonly found in CRTs. Accounting for these correlations when performing imputation was critical, and MMI and B-MMI generally outperformed the other comparator methods throughout the simulation studies. In CRTs, much attention has been given to accounting for ICC of outcome variables, but similar consideration must also be given for the ICC of covariates, especially for the purpose of studying confirmatory HTE. The recommendation to account for correlations in effect modifiers in CRTs has been previously emphasized for designing CRTs,^4–9,55 and here we have reinforced that same recommendation when imputing missing effect modifier data in CRTs. This recommendation is more related to correct imputation model specification than to model compatibility. In this article, the terms “model misspecification” and “lack of model compatibility” were used somewhat interchangeably, but there are subtle differences between them. While specification of interaction terms in imputation models is important for compatibility, adjusting for correlations in a missing effect modifier may be primarily needed to reflect the extra variability in the effect modifier at the cluster level and thus to ensure valid uncertainty statements when analyzing the imputed complete data. While we primarily addressed missing effect modifier data at the individual level due to its strong relevance to mainstream practice in subgroup analysis of CRTs,¹¹ future work should also consider the impact of missing modifiers which are measured at the cluster level rather than the individual level.

Table 2.

Summary of findings from the simulation and data application comparisons and recommendations for practitioners.

∙

Model compatibility between the outcome and imputation models is key for unbiased estimation and inference.

∙

As assessing HTE often entails interaction terms in outcome models, corresponding interaction terms should be used in imputation models.

∙

Cautiously over-specifying imputation models may be a good strategy when enough data is available for the models to converge.

∙

As with missing outcome data, correlation within clusters should be accounted for when imputing missing modifier data.

∙

MMI and B-MMI are promising methods in this setting and recommended over CCA, SI, or MI, but may still have bias or undercoverage when using non-compatible models.

∙

B-MMI is recommended when targeting an HTE or interaction effect estimand.

∙

MMI and B-MMI performed similarly and are recommended when targeting an ATE estimand.

HTE: heterogeneous treatment effect; MSE: mean squared error; SI: single imputation; MI: multiple imputation; MMI: multilevel multiple imputation; B-MMI: Bayesian multilevel multiple imputation; CCA: complete-case analysis; ATE: average treatment effect.

A summary of findings and recommendations for practitioners reflecting our observations is provided in Table 2. As always, assessing the plausibility of MCAR and MAR assumptions should be the first step in addressing missingness. Then one should consider the assumptions of different methods and model specifications, noting that while each imputation method assumes MAR, they are distinguished by whether they account for partial or more complete uncertainty in the imputation process. These distinctions were informed by the simulations and real data analysis, where some methods were more robust to specifying imputation models which were incompatible with the substantive model. Overall, B-MMI has been shown to be a promising approach for handling missing effect modifier data in CRTs, and generally had the strongest performance among the methods that we considered. To facilitate implementation, all simulation and data application code are available at https://github.com/bblette1/CRT-miss-mod. For practitioners that do not perform, or prefer, analysis under the Bayesian paradigm, approximate (frequentist) MMI also exhibited acceptable performance for missing modifier data in CRTs in several scenarios. However, there may be bias or non-nominal coverage when using MMI or B-MMI in similar ways that we have in this article, for example, when the imputation models are not truly compatible with the outcome model. This was seen for the ATE estimand, where no method achieved nominal coverage or preserved the Type I error rate in Scenario 1 of the simulation study. Although using three-way interactions in the imputation models was often sufficient for good performance, there does not exist a joint model for which the implied conditional outcome model is a linear model with random intercept and the implied conditional modifier model is a logistic model with random intercept. This highlights the importance of and potential challenge in the imputation model specification in CRTs when the interest lies in assessing HTE (and ATE after effect modifier adjustment) with a single binary effect modifier. These recommendations will not necessarily be generalizable to other settings such as scenarios with continuous effect modifiers or binary outcomes, and future work in necessary to expand our studies to those settings.

There are several paths forward that have been illuminated by these comparison studies. First, “substantive-model compatible” approaches defined for individually randomized trials or observational studies^28,56–59 could be compared and may be valuable tools in the CRT setting. These often entail deriving the correct imputation model once an outcome model is specified, and (as the correct model will likely have a non-standard form) sampling from this derived model using MCMC. Joint modeling methods for individually randomized trials and observational studies^60–62 may be similarly useful. Some of these are likely adaptable to the CRT setting, while others may require extension as they are either not designed for a multilevel setting, do not have off-the-shelf software, and/or are dependent on specific outcome model forms. Second, the use of non-parametric imputation approaches that are flexible enough to be compatible with a large range of outcome models also merit deeper examination. One procedure that may be especially useful for this is imputation via BART.^63,31,64 Not only would such methodology be useful for handling missing modifiers in CRTs, but it is also under-explored in individually randomized trial and observational study settings where studying HTE introduces similar model compatibility issues. If such an approach can perform well without requiring the expertise to both derive an implied imputation model and code an appropriate sampling procedure, it would be practical for assessing HTE with missing modifier data. We plan to pursue this methodological extension in a separate report.

Our studies have several limitations. First, although we considered a wide range of data-generating mechanisms, there are inevitably many possible data structures that were not included in the simulation studies. For example, all data generating mechanisms generated modifier data independently of the baseline covariates. While this still implies a complex imputation model, as described in the Section 4, it leaves out plausible scenarios where auxiliary covariates have a direct impact on the modifier. Previous work indicates that such correlations may not greatly impact simulation results,¹⁷ but this still may be an interesting area for future exploration. Furthermore, we restricted our comparisons to scenarios with one individual-level binary effect modifier, which are common in practice when assessing subgroup-specific treatment effects or HTE.¹¹ Future work should consider method performance when modifiers are categorical or continuous variables, when effect modifiers are measured at the cluster level, or when there are multiple effect modifiers that are subject to missingness. In addition, for an individual-level effect modifier, varying the ICC of the effect modifier may be useful in assessing the operating characteristics of the MMI and B-MMI methods. To address data with multiple missing variables in CRTs, imputation via joint modeling and imputation via fully conditional specification should also be compared in future research; see Audigier et al.⁶⁵ for such a comparison in the observational study setting. In general, the biases found should only amplify in scenarios with more than one effect modifier, such that our results are still informative and reflective of the recommended practice and caveats in that setting. Second, only MCAR and MAR scenarios were considered across the simulations and data illustration. Imputation-based approaches may not perform as well on data where modifiers are MNAR, and sensitivity methods that account for the multilevel feature of the CRT data may be useful to address MNAR. Third, in the simulation study, standard robust standard errors were used for all GEE outcome models. However, previous research has shown that bias-corrected standard error estimates are often needed when there are fewer than 30 clusters, especially for estimating a cluster-level effect parameter with missing outcomes.^66,67 In our simulation results with missing effect modifier data, the standard robust standard error estimator seemed adequate for as few as 20 clusters, but using bias-corrected standard errors may improve the slight undercoverage for the ATE estimand in a few settings. Finally, while this article focused on imputation approaches, other methods for handling missing data in CRTs such as weighting-based methods²⁴ may be useful and require further developments to address missing effect modifier data in the CRT setting.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802241242323 - Supplemental material for Assessing treatment effect heterogeneity in the presence of missing effect modifier data in cluster-randomized trials

Supplemental material, sj-pdf-1-smm-10.1177_09622802241242323 for Assessing treatment effect heterogeneity in the presence of missing effect modifier data in cluster-randomized trials by Bryan S Blette, Scott D Halpern, Fan Li and Michael O Harhay in Statistical Methods in Medical Research

Footnotes

Acknowledgements

We would like to thank the investigators and participants of the Work, Family, and Health Study. The data from the Work, Family, and Health Study is publicly available at .

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research in this article was partially supported by the Patient-Centered Outcomes Research Institute^® (PCORI^® Awards ME-2020C1-19220 to Michael O Harhay and ME-2020C3-21072 to Fan Li). Michael O Harhay and Fan Li are also funded by the United States National Institutes of Health (NIH), National Heart, Lung, and Blood Institute (grant number R01-HL168202). All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the NIH or PCORI^® or its Board of Governors or Methodology Committee.

ORCID iDs

Bryan S Blette

Fan Li

Supplemental material

Supplemental material for this article is available online.

References

Murray

. Design and analysis of group-randomized trials. 29. New York, NY: Oxford University Press, 1998.

Turner

Gallis

et al. Review of recent methodological developments in group-randomized trials: part 1—design. Am J Public Health 2017; 107: 907–915.

Turner

Prague

Gallis

et al. Review of recent methodological developments in group-randomized trials: part 2—analysis. Am J Public Health 2017; 107: 1078–1086.

Yang

Starks

et al. Sample size requirements for detecting treatment effect heterogeneity in cluster randomized trials. Stat Med 2020; 39: 4218–4237.

Tong

Esserman

. Accounting for unequal cluster sizes in designing cluster randomized trials to detect treatment effect heterogeneity. Stat Med 2022; 41: 1376–1396.

Chen

Tian

et al. Designing three-level cluster randomized trials to assess treatment effect heterogeneity. Biostatistics 2023; 24: 833–849.

Tong

Taljaard

. Sample size considerations for assessing treatment effect heterogeneity in randomized trials with heterogeneous intracluster correlations and variances. Stat Med 2023; 42: 3392–3412.

Ryan

Esserman

. Maximin optimal cluster randomized designs for assessing treatment effect heterogeneity. Stat Med 2023; 42: 3764–3785.

Maleyeff

Wang

Haneuse

et al. Sample size requirements for testing treatment effect heterogeneity in cluster randomized trials with binary outcomes. Stat Med 2023; 42: 1–30.

10.

Starks

Sanders

Coeytaux

et al. Assessing heterogeneity of treatment effect analyses in health-related cluster randomized trials: a systematic review. PLoS ONE 2019; 14: e0219894.

11.

Wang

Goldfeld

Taljaard

et al. Sample size requirements to test subgroup-specific treatment effects in cluster-randomized trials. Prev Sci 2023; 1–15.

12.

Abrevaya

Hsu

Lieli

. Estimating conditional average treatment effects. J Bus Econ Stat 2015; 33: 485–505.

13.

Semenova

Chernozhukov

. Debiased machine learning of conditional average treatment effects and other causal functions. Econom J 2021; 24: 264–289.

14.

Nicholls

Al-Jaishi

Niznick

et al. Health equity considerations in pragmatic trials in Alzheimer’s and dementia disease: results from a methodological review. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 2023; 15: e12392.

15.

Little

Rubin

. Statistical analysis with missing data. 793. Hoboken, NJ: John Wiley & Sons, 2019.

16.

Von Hippel

. How to impute interactions, squares, and other transformed variables. Sociol Methodol 2009; 39: 265–291.

17.

Seaman

Bartlett

White

. Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med Res Methodol 2012; 12: 1–13.

18.

Díaz-Ordaz

Kenward

Cohen

et al. Are missing data adequately handled in cluster randomised trials? A systematic review and guidelines. Clinical Trials 2014; 11: 590–600.

19.

Fiero

Huang

Oren

et al. Statistical analysis and handling of missing data in cluster randomized trials: a systematic review. Trials 2016; 17: 1–10.

20.

Hunsberger

Murray

Ed Davis

et al. Imputation strategies for missing data in a school-based multi-centre study: the pathways study. Stat Med 2001; 20: 305–316.

21.

Taljaard

Donner

Klar

. Imputation strategies for missing continuous outcomes in cluster randomized trials. Biometrical Journal 2008; 50: 329–345.

22.

Caille

Leyrat

Giraudeau

. A comparison of imputation strategies in cluster randomized trials with missing binary outcomes. Stat Methods Med Res 2016; 25: 2650–2669.

23.

Hossain

DiazOrdaz

Bartlett

. Missing binary outcomes under covariate-dependent missingness in cluster randomised trials. Stat Med 2017; 36: 3092–3109.

24.

Turner

Yao

et al. Properties and pitfalls of weighting as an alternative to multilevel multiple imputation in cluster randomized trials with missing binary outcomes under covariate-dependent missingness. Stat Methods Med Res 2020; 29: 1338–1353.

25.

Díaz-Ordaz

Kenward

Grieve

. Handling missing values in cost effectiveness analyses that use data from cluster randomized trials. J R Stat Soc Ser A (Stat Soc) 2014; 177: 457–474.

26.

Andridge

. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal 2011; 53: 57–74.

27.

Meng

. Multiple-imputation inferences with uncongenial sources of input. Stat Sci 1994; 9: 538–558.

28.

Bartlett

Morris

. Multiple imputation of covariates by substantive-model compatible fully conditional specification. Stata J 2015; 15: 437–456.

29.

Yang

Kim

. A semiparametric inference to regression analysis with missing covariates in survey data. Stat Sin 2017; 27: 261–285.

30.

Sang

Kim

Lee

. Semiparametric fractional imputation using Gaussian mixture models for handling multivariate missing data. J Am Stat Assoc 2022; 117: 654–663.

31.

Daniels

Winterstein

. Sequential BART for imputation of missing covariates. Biostatistics 2016; 17: 589–602.

32.

Ramosaj

Pauly

. Predicting missing values: a comparative study on non-parametric approaches for imputation. Comput Stat 2019; 34: 1741–1764.

33.

Erler

Rizopoulos

Rosmalen

, et al. Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach. Stat Med 2016; 35: 2955–2974.

34.

Liang

Zeger

. Longitudinal data analysis using generalized linear models. Biometrika 1986; 73: 13–22.

35.

Morris

White

Crowther

. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102.

36.

Preisser

Young

Zaccaro

et al. An integrated population-averaged approach to the design, analysis and sample size determination of cluster-unit trials. Stat Med 2003; 22: 1235–1254.

37.

Zeger

Liang

. Longitudinal data analysis for discrete and continuous outcomes. Biometrics 1986; 42: 121–130.

38.

Tong

. Sample size estimation for modified poisson analysis of cluster randomized trials with a binary outcome. Stat Methods Med Res 2021; 30: 1288–1305.

39.

Kahan

Copas

et al. Estimands in cluster-randomized trials: choosing analyses that answer the right question. Int J Epidemiol 2023; 52: 107–118.

40.

Tilling

Williamson

Spratt

et al. Appropriate inclusion of interactions was needed to avoid bias in multiple imputation. J Clin Epidemiol 2016; 80: 107–115.

41.

Schafer

. Analysis of incomplete multivariate data. New York, NY: CRC Press, 1997.

42.

Rubin

. Multiple imputation after 18+ years. J Am Stat Assoc 1996; 91: 473–489.

43.

Rubin

. Multiple imputation for nonresponse in surveys. 81. Hoboken, NJ: John Wiley & Sons, 2004.

44.

Barnard

Rubin

. Miscellanea. small-sample degrees of freedom with multiple imputation. Biometrika 1999; 86: 948–955.

45.

Raudenbush

. Statistical analysis and optimal design for cluster randomized trials. Psychol Methods 1997; 2: 173–185.

46.

Murray

. Multiple imputation: a review of practical and theoretical findings. Stat Sci 2018; 33: 142–159.

47.

Casella

George

. Explaining the Gibbs sampler. Am Stat 1992; 46: 167–174.

48.

Polson

Scott

Windle

. Bayesian inference for logistic models using Pólya–Gamma latent variables. J Am Stat Assoc 2013; 108: 1339–1349.

49.

Zhang

. Multiple imputation: theory and method. Int Stat Rev 2003; 71: 581–592.

50.

Eldridge

Ukoumunne

Carlin

. The intra-cluster correlation coefficient in cluster randomized trials: a review of definitions. Int Stat Rev 2009; 77: 378–394.

51.

Turner

Heagerty

et al. An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Stat Med 2017; 36: 3791–3806.

52.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021. https://www.R-project.org/.

53.

Work, Family and Health Network. Work, Family, and Health Study (WFHS). Inter-university Consortium for Political and Social Research, 2018. DOI:https://doi.org/10.3886/ICPSR36158.v2.

54.

Van Buuren

Groothuis-Oudshoorn

. mice: multivariate imputation by chained equations in R. J Stat Softw 2011; 45: 1–67.

55.

Tong

Harhay

et al. Accounting for expected attrition in the planning of cluster randomized trials for assessing treatment effect heterogeneity. BMC Med Res Methodol 2023; 23: 85.

56.

Goldstein

Carpenter

Browne

. Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. J R Stat Soc Ser A (Statistics in Society) 2014; 553–564.

57.

Kim

Sugar

Belin

. Evaluating model-based imputation methods for missing covariates in regression models with interactions. Stat Med 2015; 34: 1876–1888.

58.

Enders

Keller

. A model-based imputation procedure for multilevel regression models with random coefficients, interaction effects, and nonlinear terms. Psychol Methods 2020; 25: 88–112.

59.

Lüdtke

Robitzsch

West

. Regression models involving nonlinear effects with missing data: a sequential modeling approach using Bayesian estimation. Psychol Methods 2020; 25: 157.

60.

Zhang

Wang

. Moderation analysis with missing data in the predictors. Psychol Methods 2017; 22: 649–666.

61.

Kim

Belin

Sugar

. Multiple imputation with non-additively related variables: joint-modeling and approximations. Stat Methods Med Res 2018; 27: 1683–1694.

62.

Erler

Rizopoulos

Jaddoe

et al. Bayesian imputation of time-varying covariates in linear mixed models. Stat Methods Med Res 2019; 28: 555–568.

63.

Chipman

George

McCulloch

. BART: Bayesian additive regression trees. Ann Appl Stat 2010; 4: 266–298.

64.

Chen

Harhay

Tong

et al. A Bayesian machine learning approach for estimating heterogeneous survivor causal effects: applications to a critical care trial. Ann Appl Stat 2024; 18: 350–374.

65.

Audigier

White

Jolani

et al. Multiple imputation for multilevel data with continuous and binary variables. Stat Sci 2018; 33: 160–183.

66.

Kauermann

Carroll

. A note on the efficiency of sandwich covariance matrix estimation. J Am Stat Assoc 2001; 96: 1387–1396.

67.

Mancl

DeRouen

. A covariance estimator for GEE with improved small-sample properties. Biometrics 2001; 57: 126–134.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.71 MB