Should a Normal Imputation Model be Modified to Impute Skewed Variables?

Abstract

Researchers often impute continuous variables under an assumption of normality–yet many incomplete variables are skewed. We find that imputing skewed continuous variables under a normal model can lead to bias. The bias is usually mild for popular estimands such as means, standard deviations, and linear regression coefficients, but the bias can be severe for more shape-dependent estimands such as percentiles or the coefficient of skewness. We test several methods for adapting a normal imputation model to accommodate skewness, including methods that transform, truncate, or censor (round) normally imputed values as well as methods that impute values from a quadratic or truncated regression. None of these modifications reliably reduces the biases of the normal model, and some modifications can make the biases much worse. We conclude that, if one has to impute a skewed variable under a normal model, it is usually safest to do so without modifications–unless you are more interested in estimating percentiles and shape than in estimating means, variances, and regressions. In the conclusion, we briefly discuss promising developments in the area of continuous imputation models that do not assume normality.

Keywords

missing data missing values incomplete data regression transformation normalization multiple imputation imputation

Get full access to this article

View all access options for this article.

References

Allison

Paul D.

2000. “Multiple Imputation for Missing Data: A Cautionary Tale.” Sociological Methods & Research 28:301–09.

Allison

Paul D.

2002. Missing Data. Thousand Oaks, CA: Sage.

Allison

Paul D.

2005. “Imputation of Categorical Variables With PROC MI.” Prsented at the SAS Users Group International, 30th Meeting (SUGI 30), Philadelphia , PA.

Anderson

T. W.

1957. “Maximum Likelihood Estimates for a Multivariate Normal Distribution When Some Observations are Missing.” Journal of the American Statistical Association 52:200–03.

Andridge

Rebecca R.

Little

Roderick J. A.

. 2010. “A Review of Hot Deck Imputation for Survey Non-response.” International Statistical Review 78:40–64.

Bernaards

Coen A.

Belin

Thomas R.

Schafer

Joseph L.

. 2007. “Robustness of a Multivariate Normal Approximation for Imputation of Incomplete Binary Data.” Statistics in Medicine 26:1368–82.

Bondarenko

Irina

Raghunathan

Trivellore E.

. 2007. “Multiple Imputations Using Sequential Semi and Nonparametric Regressions.” Proceedings of the Survey Research Methods Section. Alexandria, VA: American Statistical Association.

Box

G. E. P.

Cox

D. R.

. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society Series B 26:211–52.

Centers for Disease Control and Prevention. 2011. “A SAS Program for the CDC Growth Charts.” Retrieved July 13, 2011 (http://www.cdc.gov/nccdphp/dnpao/growthcharts/resources/sas.htm).

10.

Demirtas

Hakan

. 2010. “An Application of Multiple Imputation under the Two Generalized Parametric Families.” Journal of Data Science 8:443–55.

11.

Demirtas

Hakan

Freels

Sally A.

Yucel

Recai M.

. 2008. “Plausibility of Multivariate Normality Assumption When Multiply Imputing Non-Gaussian Continuous Outcomes: A simulation Assessment.” Journal of Statistical Computation & Simulation 78:69–84.

12.

Demirtas

Hakan

Hedeker

Donald

. 2008. “Imputing Continuous Data Under Some Non-Gaussian Distributions.” Statistica Neerlandica 62:193–205.

13.

Goldberger

Arthur S.

1981. “Linear Regression after Selection.” Journal of Econometrics 15:357–66.

14.

Greene

William H.

1999. Econometric Analysis. 4th ed. New York: Prentice Hall.

15.

Halmos

Paul R.

1946. “The Theory of Unbiased Estimation.” The Annals of Mathematical Statistics 17:34–43.

16.

Hawkins

Douglas M.

Wixley

R. A. J.

. 1986. “A Note on the Transformation of Chi-Squared Variables to Normality.” American Statistician 40:296–98.

17.

Yulei

Raghunathan

Trivellore E

. 2006. “Tukey’s gh Distribution for Multiple Imputation.” The American Statistician 60:251–56.

18.

Yulei

Raghunathan

Trivellore E.

. 2009. “On the Performance of Sequential Regression Multiple Imputation Methods with Non Normal Error Distributions.” Communications in Statistics - Simulation and Computation 38:856.

19.

Yulei

Raghunathan

Trivellore E.

. 2012. “Multiple Imputation Using Multivariate Gh Transformations.” Journal of Applied Statistics 39:2177–98.

20.

Heitjan

Daniel F.

Basu

Srabashi

. 1996. “Distinguishing ‘Missing at Random’ and ‘Missing Completely at Random’.” The American Statistician 50:207–13.

21.

Hoerl

Arthur E.

Kennard

Robert W.

. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12:55–67.

22.

Horton

Nicholas J.

Lipsitz

Stuart R.

Parzen

Michael

. 2003. “A Potential for Bias When Rounding in Multiple Imputation.” The American Statistician 57:229–32.

23.

Johnson

Norman L.

Kotz

Samuel

Balakrishnan

. 1994. Continuous Univariate Distributions, Vol. 1. 2nd ed. New York: Wiley-Interscience.

24.

Jöreskog

Karl G.

. 2002. “Censored Variables and Censored Regression.” Retrieved November 10, 2011 (http://www.ssicentral.com/lisrel/techdocs/censor.pdf).

25.

Kenward

Michael G.

Carpenter

James

. 2007. “Multiple Imputation: Current Perspectives.” Statistical Methods in Medical Research 16:199–218.

26.

Kim

Jae Kwang

. 2004. “Finite Sample Properties of Multiple Imputation Estimators.” The Annals of Statistics 32:766–83.

27.

Kunovich

Paxton

Pamela

. 2005. “Pathways to Power: The Role of Political Parties in Women’s National Political Representation.” American Journal of Sociology 111:505–52.

28.

Little

Roderick J. A.

1992. “Regression With Missing X’s: A Review.” Journal of the American Statistical Association 87:1227–37.

29.

Little

Roderick J. A.

Rubin

Donald B.

. 1989. “The Analysis of Social Science Data with Missing Values.” Sociological Methods & Research 18:292–326.

30.

Muniz

Gisela

Golam Kibria

B. M.

. 2009. “On Some Ridge Regression Estimators: An Empirical Comparisons.” Communications in Statistics - Simulation and Computation 38:621–30.

31.

Raghunathan

Trivellore E.

Lepkowski

James M.

Hoewyk

John Van

Solenberger

Peter W.

. 2001. “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models.” Survey Methodology 27:85–95.

32.

Raghunathan

Trivellore E.

Solenberger

Peter W.

Hoewyk

John Van

. 2002. IVEware: Imputation and Variance Estimation Software. Ann Arbor, MI.

33.

Robert

Christian P.

1995. “Simulation of Truncated Normal Variables.” Statistics and Computing 5:121–25.

34.

Rubin

Donald B.

1976. “Inference and Missing Data.” Biometrika 63:581–92.

35.

Rubin

Donald B.

1987. Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley.

36.

Rubin

Donald B.

Schenker

Nathaniel

. 1986. “Multiple Imputation for Interval Estimation From Simple Random Samples With Ignorable Nonresponse.” Journal of the American Statistical Association 81:366–74.

37.

SAS Institute. 2001. “The MI Procedure.” SAS/STAT Software: Changes and Enhancements, Release 8.2. Cary, NC: SAS Institute. Retrieved May 23, 2011 (http://support.sas.com/rnd/app/da/new/802ce/stat/chap9/index.htm).

38.

Schafer

Joseph L.

Graham

John W.

. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7:147–77.

39.

Schafer

Joseph L.

Olsen

Maren K.

. 1998. “Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective.” Multivariate Behavioral Research 33:545.

40.

Stata Corp. 2011. “Multiple Imputation for Missing Data.” Retrieved July 2, 2011 (http://www.stata.com/stata11/mi.html).

41.

von

Hippel

Paul

. 2007. “Regression With Missing Ys: An Improved Strategy For Analyzing Multiply Imputed Data.” Sociological Methodology 37:83–117.

42.

von

Hippel

Paul

. 2009. “How To Impute Interactions, Squares, and Other Transformed Variables.” Sociological Methodology 39:265–91.

43.

von

Hippel

Paul

. in press. “The Bias and Efficiency of Incomplete-Data Estimators in Small Univariate Normal Samples.” Sociological Methods & Research.

44.

von

Hippel

Paul

Nahhas

Ramzi W.

Czerwinski

Stefan A.

. 2012. “Percentiles for Change in Body Mass Index (BMI) from Age 3½ to 18 Years.” Unpublished manuscript.

45.

Wooldridge

Jeffrey M.

2001. Econometric Analysis of Cross Section and Panel Data. 1st ed. Cambridge, MA: MIT Press.

46.

Yuan

K.–H.

Wallentin

Bentler

P. M.

(in press). ML versus MI for missing data with violation of distribution conditions. Sociological Methods & Research.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.09 MB