Abstract
Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the KIR loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor KIR genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.
Keywords
Introduction
A common problem in the analysis of statistical models is the presence of missing data. This can be entirely missing values, that is, not available (
From a statistical point of view, missingness can be classified into the missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) taxonomy, 2 which make different assumptions on the conditional distribution of the incomplete data, given observed data. MCAR assumes an independent distribution of missing values, making a complete case analysis valid, albeit power is usually reduced.2,3 The MAR assumption states that the probability of missingness only depends on other variables in the dataset. MAR underpins most approaches for handling missingness and under MAR complete case analysis can also lead to valid conclusions, although other analysis approaches are more efficient in general, because of the increased sample size. 4 The discarding of incomplete cases can make the complete case analysis infeasible, especially for high-dimensional data where missingness is observed in most, if not all, participants. MNAR includes the case when unobserved variables are required for conditional independence and is difficult to deal with in practice.
For likelihood based models, two categories of approaches are often used, both assuming MAR: multiple imputation (MI)2,5,6 and expectation-maximization (EM) algorithms. 7 Analysis using MI is a two-step procedure. In the first step, missing values are imputed by data drawn from an appropriate distribution iteratively, producing several datasets. In the second step, these data are analyzed “as usual” and results are pooled.2,5,8 The goal of MI is to derive valid pooled coefficients and standard errors in the second step. 2 In contrast, the goal of the EM algorithm is to compute the maximum-likelihood estimate for each variable of the dataset. This is done by imputing expectations of missing values using working parameters assumed to approximate true parameters (the E-step). The MI and EM algorithm are similar in the sense that the same conditional distribution given observed data is used to draw from (MI) and to calculate expected values of (EM). The resulting likelihood looks like a complete data likelihood and can be optimized as usual (the M-step). 9 Resulting parameter estimates are used in an iteration as new working parameters.
Using MI, it is necessary to include the outcome variable of the regression model, that is the target of the analysis, in the procedure to avoid bias.3,10 This follows from the idea that if an association exists between predictors and outcome, their correlation is informative on missing values and if no such association exists, no harm is done. It has therefore been proposed to explicitly include the regression model into the MI procedure. 5 The substantive model compatible fully conditional specification (SMC-FCS) procedure draws imputations by incorporating the regression model of interest into the imputation process, representing the research question. This model, also called the substantive model, models the expected relationship between predictors and outcome. The SMC-FCS model is a principled approach that leads to unbiased parameter estimates under non-linear link-functions while methods that ignore the substantive model can lead to biased estimates. 5 This holds under correct specification of the substantive model. We here define a substantive model to be designed to answer the underlying research question and to be used after the imputation procedure has been completed. Such a substantive model might be difficult to define before imputation, because of issues like model selection and regularization. We distinguish here between the substantive model to be used after imputation, and the outcome model to be used during imputation, which is a regression model with the same outcome as the substantive model, but using fewer biological or statistical assumptions or chosen by statistical convenience. When emphasizing this distinction we refer to the latter as working outcome model. Considering the fact that it is common practice to include more variables into an MI procedure than are used in a substantive model in order to improve imputation results, 10 a mismatch between the substantive and working outcome models occurs regularly in practice. As depending on the research question a substantive model might be hard to define, discrepancies between substantive and outcome models are important to investigate. We do this by mimicking several practical approaches covering the high-dimensional setting.
To this end, we develop an EM algorithm including an outcome model for a specific genetic situation which should, under correct specification, allow for unbiased estimation of the substantive model. To understand characteristics of the procedures we investigate two properties: first, the distribution of the partly missing data in terms of the overall distribution conditioned on all observed covariates and the distribution per individual, as proper imputation of values is the first step in obtaining optimal regression coefficients. Second, the bias and efficiency of the regression coefficients of a substantive model is of interest.
We have chosen the problem of diplotype (pair of haplotypes) reconstruction for ambiguous genotype data 11 and discuss broad applicability of our results below. Previous work suggests that results can be improved when outcome data is taken into account beyond genetic data alone. 11 However, this comes with several challenges. First, as discussed above, it is not straightforward to specify a substantive model. In general, most genetic variants are considered to have a small effect on any outcome12,13 implying that the correct choice of predictor variables is unknown a-priori. Second, potential outcome models are usually high-dimensional as haplotype (tuples of alleles) based models are of interest, eventually leading to fully saturated regression models. Third, we want to allow an analyst to specify a very rich outcome model, possibly including additional covariates that help to impute but are not of direct relevance for the substantive model. After the imputation procedure, the analyst can then fit and test several substantive models. This implies that a regression model used after running the EM algorithm might differ from the outcome model during the algorithm. Although this approach slightly deviates from the idea that the imputation model should be compatible with the substantive model, it leads to a much less cumbersome and slow workflow compared to re-running every new substantive model with the EM algorithm. This also reflects common practice, when no substantive model can be specified a-priori.
Our work is motivated by studies of the killer-cell immunoglobulin-like receptor (KIR) gene region, which is one of the most complex genetic regions observed in humans exhibiting high allelic polymorphism, copy-number-variation (CNV), and allele homology across genes, thereby making it difficult to measure and analyze. We illustrate our methods on a dataset describing the impact of KIR genes on outcomes after allogeneic stem cell transplantation of patients with secondary acute myeloid leukemia (sAML) or myelodysplastic syndromes (MDS).11,14 One main source of missingness in this data stems from the measurement process and leads to ambiguity of genotype calls, that is, a measurement is, in general, compatible with a larger set of diplotypes. 15 Due to the high amount of missing data in this application, it is paramount to retain as much information as possible, suggesting to add an outcome model to the EM algorithm, to improve both estimation of haplotype frequencies (HTFs) and of regression coefficients that capture the impact of KIR genes on outcomes after transplantation.
KIR genes are under investigation because they orchestrate natural-killer (NK)-cell functioning, which is assumed to play a role in the graft-versus-leukemia (GvL) reaction after allogeneic hematopoietic stem cell transplantation (alloHCT). Certain KIR variants have previously been found to be associated with reduced disease relapse and improved patients’ overall survival, although their role is still under scrutiny.14,16–18 To fit in the methodological framework of the current article focusing on parametric regression models, we present a much simplified version of the analyses shown elsewhere by modeling patient death until 6 months after alloHCT as a binary outcome in our data application.
In a more general setting, we consider high-dimensional data for which a substantial model is either unknown or difficult to fit. While we consider a specific example concerning genetics, we believe that our findings are valid for similar data. Dimensionality of the data considered in this study is comparable to many other research questions (omics data such as gene expression, DNA variation, and blood metabolites). In our case, we can handle missing data using a likelihood approach, which might not be possible in other situations, making our results optimistic. The reader more interested in general conclusions can therefore focus on simulation results that give an impression on problems arising from similar amount of missingness in other datasets.
The article is structured as follows. In Section 2, we develop an EM algorithms for diplotype reconstruction taking into account an outcome model, and describe methodology used in the simulations and data analysis. Section 3 describes an extensive simulation study. In Section 4, we present results from the simulation study and a real data analysis and we close with a discussion in Section 5.
Methods
Data and genetic likelihood
Genetic data considered here is composed of observations of DNA sequences at fixed positions (loci), called alleles. Tuples along a single chromosome of alleles form haplotypes. Pairs thereof, inherited from both parents, are called diplotypes. If pairs of alleles are observed per locus, that is, it is unknown on which chromosomes alleles lie, observations are called (unphased) genotypes. The missing data problem considered here is that allelic state is observed incompletely and that phasing information is missing in general. Additionally, several loci might be indistinguishable by the measuring process, resulting in more than two observed alleles per locus (copy-number variation).
19
We first consider genetic information only. Following notation from van der Burg et al.,
11
for
We now assume that besides a multi-locus genotype also an outcome is observed, depending on the diplotype and potentially influenced by additional covariates. We assume that the outcome is modeled by a regression model, the density of which is given by
Each EM algorithm starts with the full data likelihood, which has to iterate/integrate over all possibilities for missing values. This likelihood can then be used to condition missing values on observed values and compute the expectation under given parameter values. 7 In the context of genetics, mostly phase-ambiguity (i.e. genotypes are observed, but diplotypes are of interest) has been considered.20,21 Our study is based on an extended model for HTF estimation 11 that can cope with genetically diverse gene regions as well as with genotyping ambiguities. First, both ambiguous genotype calls and CNVs are allowed. Second, the algorithm controls complexity by an iterative reconstruction and collapsing of rare haplotypes. Third, the stability of HTF estimates is guaranteed by a profiling step.
Lacking an analytical solution for the maximum likelihood estimator (MLE), the log-likelihood is optimized using an EM algorithm which iterates between taking expectations (E-step), and updating parameters
E-step
Denote with
The M-step is estimated as the population frequency, calculated via
Reconstruction algorithm
To reconstruct actual data, the EM algorithm has to be broken down into smaller steps to make it computationally feasible (Figure 1). The algorithm adds loci iteratively, each time adding a new locus to a previous reconstruction. To obtain good starting values, the algorithm starts with the profile EM algorithm not including outcome information (step 1; van der Burg et al.
11
) results of which are used for the first E-step (matrix
Note that in the E-step for HTFs (step 7), frequencies of collapsed haplotypes from the previous step are fixed to their frequency from the previous iteration, resulting in a profiling for this parameter. This profiling was introduced earlier and is due to the fact that collapsing does not retain full information about missingness and can, therefore, lead to bias (see van der Burg et al.
11
for details).To avoid numerical instability, the terms
Outcome design matrix
Different types of design matrices are used depending on their role in the procedure: data simulation, the different iterations of the EM algorithm and a substantive model for data analysis, all potentially being different. All matrices are derived by the following steps, starting with the matrix of known (simulation) or expected (EM algorithm or substantive model) diplotype counts
Working outcome model
Since in a high-dimensional situation it is generally difficult to specify a “true” regression model additional modeling steps are required. Three strategies for constructing the design matrix for the outcome model within the EM algorithm (step 4) were considered:
CNV alleles
CNV observations are sets of alleles that can be observed for a gene on a single haplotype, molecularly corresponding to several gene copies. In the simulation design matrix, the set of alleles is treated as a separate allele observable for this gene (locus). Because of their low frequency, effect sizes are difficult to estimate and we decompose CNV alleles into their individual alleles, when constructing design matrices for model fitting. CNVs contribute to each allele or haplotype in the allele set as a fractional count (additive effect), which is determined as one over the number of combinations that can be formed by the decomposition.
Substantive models
A design matrix is constructed based on the expected counts as described above (Section 2.4.1). Two types of substantive models are considered: (1) only use allelic predictors, (2) use all haplotypes with an expected frequency
Simulation study
Our simulation study follows the aims, data-generating mechanisms, estimands, methods, and performance measures (ADEMP) structure discussed by Morris et al. 25
Aim
The aim of the simulation study is to compare the performance of the profile EM algorithm based on genotypes only with the profile EM algorithm with outcome. As outcome models, the different modeling strategies mentioned above (Section 2.4.2) are compared.
Data-generation mechanism
Haplotype simulation
Diplotypes consisting of three loci were simulated. We used
Parameters of the simulation studies.
Parameters of the simulation studies.
Note: The two sets of AFs (set 1/set 2) for the simulation of the individual diplotypes. AFs are based on the actual KIR dataset. Within brackets, the starting points of the two sets of regression coefficients (set 1/set 2) for the simulation of the outcomes. These effect sizes are divided or multiplied by 5, thus each set leads to three sub-scenarios. Haplotypes that in the scenarios with haplotype effects have a double effect (once for individual alleles and once for the haplotype) are A–F and C–H (also indicated by superscripts
AFs: allele frequencies; KIR: killer-cell immunoglobulin-like receptor; LD: linkage disequilibrium.
For diplotype simulation, HTFs are determined first. To this end, loci are added iteratively, where in each iteration, LD between new alleles and existing haplotypes defines the new joint, multinomial distribution (Table 1). From this distribution,
To determine the effect of ambiguities on estimating allele and haplotype effects in regression models, continuous and binary outcomes were simulated based on diplotypes. The design matrix
To limit the number of simulation scenarios, a subset was chosen from combinations of (1) two sets of HTFs (non-uniform (default) and approximately uniform), (2) two sets of regression coefficients (small CNV allele effects (default) and strong CNV allele effects), (3) two sets of (non-allele) haplotype regression coefficients (none (default) and haplotype effects present), and (4) two types of outcomes (continuous (default) and binary), leading to a partial factorial design of eight scenarios (Supplemental Appendix C and Table S1). Criteria to eliminate scenarios were to avoid models with likely unstable outcome models (e.g. small effect sizes and low corresponding HTFs). Each of these scenarios was combined with LD (
Proof-of-concept simulations
To establish validity of the outcome profile EM algorithm, another two simulation scenarios with few alleles and strong effect sizes were chosen, without (scenario A) and with (scenario B) haplotype effects (Table 2). Simulations for this scenario follow the same steps as described above.
Parameters of the proof-of-concept simulation.
Parameters of the proof-of-concept simulation.
Note: The set of AFs for the simulation of the individual diplotypes. AFs are based on the actual KIR dataset. Within brackets, the starting points of the effects size set for the simulation of the individuals outcome. These effect sizes are divided or multiplied by 3, thus each set leads to three sub-scenarios. Haplotypes that in the scenarios with haplotype effects have a double effect (once for individual alleles and once for the haplotype) are A–E and F–H (also indicated by superscripts
AFs: allele frequencies; KIR: killer-cell immunoglobulin-like receptor; LD: linkage disequilibrium.
Estimands are the individual and population HTFs. The population HTFs are the
Additionally, we evaluate the accuracy of substantive model regression coefficients.
Methods to evaluate
Each replication is analyzed with the profile EM algorithm with and without included outcome. The threshold for collapsing rare haplotypes was set to
Performance measures
For each scenario, 100 independent replications ( Profile EM-algorithm with outcome for a set of loci. HTFs are updated with the profile EM-algorithm with outcome for the haplotypes (loci in reconstruction) by iterating between a weighted haplotype only profile EMalgorithm (EMprof; van der Burg et al. (2023)) and updating regression coefficients. Maximized parameters are returned as

where
To quantify individual HTF accuracy, we use the HTR measure:
Proof-of-concept
We start with a simple proof-of-concept simulation study, including few allele effects. Here, only the sub-scenario of scenario A (without additional haplotype effects) containing high effect sizes will be discussed. We only show results for the six HTs with the highest HTFs. Further explanation and results are given in Supplemental Appendices D and E.
The simulation results are displayed in a nested-loop plot of Figure 2. In this plot, results of all sub-scenarios are plotted in vertical bins stacked next to each other, where the set of lines below the graph indicates the parameters used for each sub-scenario. For scenario A, there are nine datasets with different combinations of LD and ambiguities, and each dataset has been analysed with four EM algorithms that differ in the used outcome model: no-outcome model, allelic model, forward-backward model, and penalized model. The HTFs estimated with each model are compared, both on an individual level with the HTR measure (bold red horizontal lines), and on the population level with population HTFs (bold black horizontal lines).

HTF estimation, scenario A. Note: Nested-loop plot of RMSEs for HTFs and HTR measure for all sub-scenarios of the proof-of-concept study without haplotype effects. Each column of points separated by vertical grid lines depicts a different set of parameters, where the lines at the bottom of the graph define these parameter sets. Each sub-scenario is analyzed by four outcome models: no-outcome (lowest line; noY), allelic model (one step; allele), forward-backward model (second step; FB), and penalized regression model (upper line; pen). The sub-scenarios differ in their added LD and ambiguities, where a step in the line corresponds to a higher value (lowest line is 0.0; highest line is 0.8). Further explanation of the nested-loop plot is given by Rücker and Schwarzer.
26
The haplotypes plotted are a combination of gene 1, gene 2, and gene 3, with the genes in that order separated by a “
As observed by the large differences between the red lines corresponding to the HTR measure, HTFs estimated with the allelic outcome model strongly outperform the no-outcome model in sub-scenarios with medium and/or high ambiguity and LD (Figure 2). The forward-backward selection model performs slightly worse than the allelic model, but still better than the no-outcome model, while the penalized model is slightly worse than the no-outcome model. Both the forward-backward and penalized models take possible haplotype effects into account, which were absent in the simulations. This is also reflected in KLD ratios (comparing the KLD of outcome models with the KLD of the no-outcome model), which for most of these sub-scenarios are above one (Figure 2). However, the fact that the black lines corresponding to the population HTFs are comparable across most sub-scenarios shows that the beneficial effect of the allelic and forward-backward model in the aforementioned sub-scenarios is small. This is especially true for sub-scenarios with low ambiguities, which is further confirmed by KLD ratios being far below 1.
Two substantive models were evaluated: one with only allelic predictors and another with all candidate list haplotypes. RMSEs for the regression coefficients from both substantive models follow a similar pattern as observed above: high effect sizes allow for more reliable estimation with the allelic and forward-backward selection model (Supplemental Figure S2(A) and (B)). For other sub-scenarios, the four methods perform comparably.
Haplotype frequencies
For the default scenario (scenario 001) with allelic effects only and a continuous outcome, all methods perform well for low ambiguity with the HTR measure (Figure 3). In most other sub-scenarios, the no-outcome model outperforms the three outcome models. Exceptions are the allelic and forward-backward models for medium LD, high ambiguities, and the highest effects. For high LD, and the same ambiguities and effects all methods performed similar again. Population HTFs are consistently well estimated by the no-outcome model and almost always outperform the three outcome models, with lower RMSE and KLD ratios substantially lower than 1 (Figure 3). With specific sub-scenarios, the outcome models can perform adequately, but the no-outcome model is never substantially worse.

HTF estimation, scenario 001. Note: Nested-loop plot of RMSEs for HTFs and HTR measure for all sub-scenarios of default scenario 001, containing HTF and effect size set 1. Explanation about the nested-loop plots is given in the legend of Figure 1. The horizontal bold black lines represent the RMSE of the popHTFs (with values on the left y-axis), while the bold red lines represent the HTR measure of the individual HTFs (with values on the right y-axis). The table below the graph shows mean RMSE values of these popHTFs and the HTR measures, as well as the KLD ratios between the no-outcome model and the three outcome models. All values in the table are the range (min–max) of values for the three sub-scenarios above. A ratio
Overall, estimates obtained by the penalized model perform relatively worst, reflecting the bias-variance trade-off of this model. The allelic and forward-backward model perform best under several sub-scenarios, however, the no-outcome model performs better in other sub-scenarios and is most consistent across all sub-scenarios.
The HTF estimates of the scenario with haplotype effects (scenario 002) shows similar results, except for a drop of the performance of the allelic model (Supplemental Figure S4(A)). In all other cases, the no-outcome performs either better or comparable to the outcome models.
For the substantive models with only allelic predictors, the default scenario with high effect sizes shows superior performance of the forward-backward model, and especially of the allelic models, compared to the no-outcome model in terms of RMSEs of regression coefficients (Figure 4(a)). Including all candidate list predictors in the substantive model, the allelic outcome model still performs well, while the forward-backward model shows inconsistent behavior (Figure 4(b)). For both substantive models, the penalized model performs well for a low amount of ambiguities but starts to deteriorate for more ambiguities. In the scenario, where also haplotype effects are simulated, the beneficial effect of the allelic outcome model is attenuated (Supplemental Figure S6(A) and (B)). The allelic outcome model now only outperforms the no-outcome model with candidate list predictors in the substantive model, while for the other sub-scenarios it always performs worse. The RMSE values for the substantive model regression coefficients for the three outcome models in the other scenarios are equal to, or worse than the no-outcome model (Supplemental Figure S6(C) and (D)). Especially with high effect sizes, RMSEs can be very large for the outcome models (Supplemental Figure S6(E) to (H)). More results of the partially factorial design simulation are given in SupplementalAppendix D.

Substantive model regression coefficients, scenario 001. Note: Nested-loop plots of RMSE values for estimated regression coefficients for all sub-scenarios of default scenario 001, containing HTF and effect size set 1. Two substantive models are run with different sets of predictors: (A) only alleles as predictor and (B) all candidate list predictors as predictor. Explanation about the nested-loop plots is given in the legend of Figure 1. Here, the colored dots are either alleles (single letters) or haplotypes (combination of alleles separated by a “
Set-up
The KIR data contains genotype data of seven genes (KIR2DL3, KIR2DL1, KIR2DS1, KIR2DL2, KIR2DS4, KIR3DL1, and KIR3DS1), which is a subset of the known KIR genes. The current analysis is based on the subset of 1030 donor-patients pairs registered with EBMT for whom survival outcome data were available and who had no “POS” denotation for any of the seven genes.14,27 For the data analysis, different grouping threshold values were used (
In each analysis, the no-outcome as well as the three outcome models (alleles only, forward-backward, and penalized) were included. Survival outcome was converted into a binary outcome by computing the survival indicator at 6 months, patients censored before 6 months were considered to be alive at 6 months. In total, 229 (22.2%) patients had experienced the event. For both, the inside EM algorithm regression models as for the substantive models a logistic outcome model was used.
Regression coefficients obtained via the outcome and substantive model were used to compare outcome models with the no-outcome model. Additionally, HTFs were compared to a superset of the data (containing an additional 2517 genotype donors for which no patient outcome is available) 11 and to estimates obtained from another study. 15
Results
All genes were considered as starting genes in the reconstruction. KLDs comparing results of the no-outcome EM algorithm with the three outcome EM algorithm strategies are shown in Figure 5 and are used to evaluate the collapsing threshold. The KLD measures do not show monotonic behavior with decreasing threshold values which would allow to choose a threshold below which KLD values would not change substantially. Instead, we use a collapsing threshold of

HTF dissimilarity. Note: The dissimilarity between the HTFs of the no-outcome EM algorithm and the HTFs estimated with the three outcome models for different threshold values (
Table 3 shows the estimated HTFs for the different methods with this threshold over the seven starting genes. HTFs estimated by the four models are very similar. Only the SDs (calculated over the different starting genes) in the allelic model are substantially larger compared to the other methods. Interestingly, despite the almost identical effect size estimates in the regression models, HTFs of allelic and forward-backward models are not more similar to each other as compared to other models. Compared with the superset and the external study, the four methods tested here result in comparable HTFs estimates. These differences indicate that method heterogeneity is small compared to sample fluctuation. Overall no substantial inconsistencies were observed. This is also the case for the information
Haplotype frequency comparison.
Note: Average HTFs over the seven starting genes in the different EM algorithm (between brackets the standard deviations over the seven starting genes). The 20 haplotypes with the highest HTFs in the baseline EM algorithm without outcome are shown and ordered numerically. Additionally, the HTFs observed in a study, here used as a rough comparison
15
and the superset. Analyses were run with
The two substantive models run after the EM algorithm do not support a beneficial effect of the incorporation of the outcome in haplotype reconstruction for this dataset either, probably because of over-parameterization of the models (Supplemental Appendix E). The substantive models with candidate list haplotypes as predictors are estimated as intercept-only models for both the no-outcome and the outcome methods, whereas for all methods the substantive model with allelic predictors results in unrealistically large effect sizes.
In this article, we have addressed the problem of missing data analysis in the context of genetic data. While we have implemented a specific model, our findings have implications beyond the genetic situation, which is discussed after our specific findings.
Here, we extended an EM algorithm 11 for haplotype reconstruction that can deal with a very general class of missing data, including ambiguities of measurements. Our extension allows to incorporate an outcome model in the imputation process, which is deemed necessary for unbiased estimates and which could potentially improve efficiency of regression models. 3 In the genetic situation, it is generally difficult to specify a “true” regression model, also called the substantive model, which can be used during reconstruction and later for the final analysis. We have therefore extended the profile EM algorithm by allowing to choose an outcome model used during reconstruction independently from a substantive model for the final analysis. In genetics, marginal analyses are often of interest, even when higher-dimensional effects can be suspected.28,29 This suggests that for high(er)-dimensional data, two outcome models might be required: one for imputation and one for estimation of effects.
In the simulations, we found that there can be a huge benefit of incorporating the outcome in the EM algorithm. Both the allelic and forward-backward regression models outperform the no-outcome EM algorithm model with respect to the HTR measure and RMSEs of regression coefficients for multiple sub-scenarios. This pertains to the proof-of-concept simulations with high effects sizes and a few other simulation scenarios. On the other hand, the outcome-based EM algorithm can perform poorly for these measures in most other scenarios (Figure 3). Our results show that, at a minimum, regression models have to be correctly specified as the EM algorithm with outcome also performed poorly in the proof-of-concept simulations under misspecification. Another aspect is model size. Saturated, nearly saturated or penalized models performed poorly in general. We used penalized regression to offer a generic solution: the outcome model always contains all possible haplotype indicators and the user does not have to choose a model explicitly. However, the inherently more sparse allelic and forward-backward selection models strongly outperformed penalized regression under correct specification. We used elastic net regression (
A possible strategy is to run both the no-outcome and EM algorithm with outcome and compare estimates, and then accept the outcome-based estimates only if they do not diverge too strongly from no-outcome findings. While it was possible to make such a strategy based on the Euclidean distances work under specific scenarios, we were not able to define a meaningful acceptance cut-off in general because it depends highly on parameter settings (data not shown). We therefore do not see this as a promising strategy.
In summary, a benefit of an EM algorithm with outcome could only be demonstrated in simulations where missingness was modeled based on the real KIR dataset, but effect sizes were chosen higher than can be expected for real data. Especially in the genetic situation, effect sizes are deemed to be small, in general. 12 In all other cases, the no-outcome EM algorithm performed comparable or better than the outcome model. Additionally, the regression model has to be correctly specified, when in practice, the genetic pathway is usually unknown.
In our data analysis, the beneficial impact of the outcome-models was limited. HTFs do not differ substantially and the information
We believe that our findings allow for some general conclusions about handling missing data in similar situations. While we have not implemented a MI algorithm, results from our EM algorithm can be used to create multiple imputed dataset. Each EM algorithm iteration involves computing the diplotype-distribution of missing values for each individual from which data can be drawn (Figure 1). We see the use of an EM algorithm versus a MI strategy as two different ways to estimate the same substantive model and, therefore, believe that our findings apply to a wide range of models, including MI algorithms, for handling missing data under the MAR assumption. Multiple imputation including the outcome variable in the imputation data has been shown to increase efficiency when implemented by regressions, 30 predictive mean matching, 10 or substantive model based imputations. 5 In these applications, however, the substantive model is known and it is possible to obtain reliable parameter estimates, that is, the substantive model can be well fitted. Our results show that these conditions are crucial for successful data imputation. Whenever model uncertainty becomes large either due to high dimensionality or poor conditioning of design matrices, the inclusion of a (working) outcome model is likely to be detrimental. To benefit from a substantive model when data are imputed it is therefore critical to be able to substantiate the choice of outcome model and to assess it’s model fit.
Conclusions
Regression model-based imputation seems beneficial in many cases, at least when the model is correctly specified, driven by true associations, and the model is well-conditioned (i.e. the design-matrix of a regression model is well-conditioned). However, misspecification can lead to substantial bias and decrease in efficiency. Outcome-based models are therefore not automatically “better” than no-outcome models. Especially in situations where it is challenging to correctly specify a substantive model—for example, due to high-dimensional predictors or interactions—, incorporating the outcome in the imputation procedure could do more harm than good. Therefore, outcome and no-outcome models have always to be performed together to check plausibility. In the high-dimensional setting, sparse models are required, but difficult to construct, especially if the process to be modeled is not well understood. Benefits of outcome-based imputation in such settings are doubtful.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802241304112 - Supplemental material for High-dimensional, outcome-dependent missing data problems: Models for the human
loci
Supplemental material, sj-pdf-1-smm-10.1177_09622802241304112 for High-dimensional, outcome-dependent missing data problems: Models for the human
Footnotes
Acknowledgements
Data availability
For this study, two datasets are of importance:
The real KIR dataset consists of the genotyping data of 17 KIR genes for 4,077 individuals. The following KIR genes are included in the data: KIR2DL1, KIR2DL2, KIR2DL3, KIR2DL4, KIR2DL5, KIR2DP1, KIR2DS1, KIR2DS2, KIR2DS3, KIR2DS4, KIR2DS4N, KIR2DS5, KIR3DL1, KIR3DL2, KIR3DL3, KIR3DP1, and KIR3DS1. The dataset is managed by the Collaborative Biobank (CoBi), where any bona fide scientist can apply to get access to its datasets. More information about the access can be found under access policy at https://www.cobi-biobank.com/scientists/welcome/. All genotyped individuals above have supplied stem cells for allogeneic hematopoietic stem cell transplantation. For 1208 patients, outcome data were available with the EBMT. Researchers can apply to get access to EBMT data, more information can be found on https://www.ebmt.org/registry/ebmt-data-sharing.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
