Multiple Imputation for Combined-survey Estimation With Incomplete Regressors in One but Not Both Surveys

Abstract

Within-survey multiple imputation (MI) methods are adapted to pooled-survey regression estimation where one survey has more regressors, but typically fewer observations, than the other. This adaptation is achieved through (1) larger numbers of imputations to compensate for the higher fraction of missing values, (2) model-fit statistics to check the assumption that the two surveys sample from a common universe, and (3) specifying the analysis model completely from variables present in the survey with the larger set of regressors, thereby excluding variables never jointly observed. In contrast to the typical within-survey MI context, cross-survey missingness is monotonic and easily satisfies the missing at random assumption needed for unbiased MI. Large efficiency gains and substantial reduction in omitted variable bias are demonstrated in an application to sociodemographic differences in the risk of child obesity estimated from two nationally representative cohort surveys.

Keywords

combining data multiple imputation model fit statistics panel surveys child obesity

Get full access to this article

View all access options for this article.

References

Norton

. 2003. “Interaction Terms in Logit and Probit Models.” Economics Letters 80:123–29.

Allison

P. D

. 2002. Missing Data. Newbury Park, CA: Sage.

Allison

P. D

. 2005. “Imputation of Categorical Variables with PROC MI.” Paper 113-30, SUGI 30 Focus Session. Retrieved August 29, 2013 (http://www2.sas.com/proceedings/sugi30/113-30.pdf)

Anderson

P. M.

Butcher

. 2006. “Reading, Writing and Refreshments: Are School Finances Contributing to Children’s Obesity?” Journal of Human Resources 41:467–94.

Anderson

S. E.

Whitaker

R. C.

. 2009. “Prevalence of Obesity among U.S. Preschool Children in Different Racial and Ethnic Groups.” Archives of Pediatric and Adolescent Medicine 163:344–48.

Assuncao

R. M.

Schmertmann

C. P.

Potter

J. E.

Cavenaghi

S. M.

. 2005. “Empirical Bayes Estimation of Demographic Schedules for Small Areas.” Demography 42:537–58.

Burnham

K. P.

Anderson

D. R.

. 2002. Model Selection and Multimodel Inference: A Practical Information-theoretic Approach. New York: Springer.

Classen

Hokayem

. 2005. “Childhood Influences on Youth Obesity.” Economics and Human Biology 3:165–87.

D’Orazio

M. B.

Di Zio

Scanu

. 2006. “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints.” Journal of Official Statistics 22:137–57.

10.

Downey

D. B.

von Hippel

P. T.

Broh

B. A.

. 2004. “Are Schools the Great Equalizer? Cognitive Inequality during the Summer Months and the School Year.” American Sociological Review 69:613–35.

11.

Fidler

Thomason

Cumming

Finch

Leeman

. 2004. “Editors can Lead Researchers to Confidence Intervals, but They Can’t Make Them Think: Statistical Reform Lessons from Medicine.” Psychological Science 15:119–26.

12.

Fitzgerald

Gottschalk

Moffitt

. 1998. “An Analysis of Sample Attrition in the Michigan Panel Study of Income Dynamics.” Journal of Human Resources 33:251–99.

13.

Freedman

D. S.

Khan

L. K.

Serdula

M. K.

Ogden

C. L.

Dietz

W. H.

. 2006. “Racial and Ethnic Differences in Secular Trends for Childhood BMI, Weight, and Height.” Obesity 14:301–8.

14.

Freedman

Wolf

D. A.

. 1995. “A Case Study on the Use of Multiple Imputation.” Demography 32:459–70.

15.

Gelman

King

Liu

. 1998a. “Not Asked and Not Answered: Multiple Imputation for Multiple Surveys.” Journal of the American Statistical Association 94:846–57.

16.

Gelman

King

Liu

. 1998b. “Rejoinder.” Journal of the American Statistical Association 94:869–74.

17.

Goldscheider

Clair

P. St.

Hodges

. 1999. “Changes in Returning Home in the United States, 1925-1985.” Social Forces 78:695–720.

18.

Handcock

M. S.

Huovilainen

S. M.

Rendall

M. S.

. 2000. “Combining Registration-system and Survey Data to Estimate Birth Probabilities.” Demography 37:187–92.

19.

Handcock

M. S.

Rendall

M. S.

Cheadle

J. E.

. 2005. “Improved Regression Estimation of a Multivariate Relationship with Population Data on the Bivariate Relationship.” Sociological Methodology 35:291–334.

20.

Hellerstein

Imbens

G. W.

. 1999. “Imposing Moment Restrictions from Auxiliary Data by Weighting.” Review of Economics and Statistics 81:1–14.

21.

Imbens

G. W.

Lancaster

. 1994. “Combining Micro and Macro Data in Microeconometric Models.” Review of Economic Studies 61:655–80.

22.

Johnson

D. R.

Young

. 2011. “Toward Best Practices in Analyzing Datasets with Missing Data: Comparisons and Recommendations.” Journal of Marriage and Family 73:926–45.

23.

Judkins

D. R

. 1998. “Not Asked and Not Answered: Multiple Imputation for Multiple Surveys: Comment.” Journal of the American Statistical Association 94:861–64.

24.

Kimbro

R. T.

Brooks-Gunn

McLanahan

. 2007. “Racial and Ethnic Differentials in Overweight and Obesity among 3-year-old Children.” American Journal of Public Health 97:298–305.

25.

Kuczmarski

R. J.

Ogden

C. L.

Guo

S. S.

Grummer-Strawn

L. M.

Flegal

K. M.

Mei

Wei

Curtin

L. R.

Roche

A. F.

Johnson

C. L.

. 2002. “2000 CDC Growth Charts for the United States: Methods and Development.” Vital Health Statistics 11:1–190.

26.

Lancaster

2004. An Introduction to Modern Bayesian Econometrics. Malden, MA: Blackwell.

27.

Lee

K. J.

Carlin

J. B.

. 2010. “Multiple Imputation for Missing Data: Fully Conditional Specification versus Multivariate Normal Imputation.” American Journal of Epidemiology 171:624–32.

28.

Little

R. J. A

. 1992. “Regression with Missing X’s: A Review.” Journal of the American Statistical Association 87:1227–37.

29.

Little

R. J. A.

Rubin

D. B.

. 1989. “The Analysis of Social Science Data with Missing Values.” Sociological Methods and Research 18:292–326.

30.

Little

R. J. A.

Rubin

D. B.

. 2002. Statistical Analysis with Missing Data 2nd ed. Hoboken, NJ: John Wiley.

31.

Martin

M. A

. 2008. “The Intergenerational Correlation in Weight: How Genetic Resemblance Reveals the Social Role of Families.” American Journal of Sociology 114:S67–105.

32.

McCloskey

D. N

. 1985. “The Loss Function has been Mislaid: The Rhetoric of Significance Tests.” American Economic Association Papers and Proceedings 75:201–5.

33.

McLaren

2007. “Socioeconomic Status and Obesity.” Epidemiological Reviews 29:29–48.

34.

Meng

X. L

. 1994. “Multiple-imputation Inferences with Uncongenial Sources of Input.” Statistical Science 9:538–73.

35.

Miech

R. A.

Kumanyika

S. K.

Stettler

Link

B. G.

Phelan

J. C.

Chang

V. W.

. 2006. “Trends in the Association of Poverty with Overweight among U.S. Adolescents, 1971-2004.” Journal of the American Medical Association 295:2385–93.

36.

Mollborn

Morningstar

. 2009. “Investigating the Relationship between Teenage Childbearing and Psychological Distress Using Longitudinal Evidence.” Journal of Health and Social Behavior 50:310–26.

37.

Moriarity

Scheuren

. 2003. “A Note on Rubin’s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation.” Journal of Business and Economic Statistics 21:65–73.

38.

National Center for Health Statistics. n.d. National Health and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. Retrieved August 29, 2013 (http://www.cdc.gov/nchs/nhanes.htm)

39.

Ogden

C. L.

Carroll

M. D.

Curtin

L. R.

Lamb

M. M.

Flegal

K. M.

. 2010. “Prevalence of High Body Mass Index in U.S. Children and Adolescents, 2007-2008.” Journal of the American Medical Association 303:242–49.

40.

Pfefferman

Sverchkov

. 2007. “Small-area Estimation Under Informative Probability Sampling of Areas and Within the Selected Areas.” Journal of the American Statistical Association 102:1427–39.

41.

Raghunathan

T. E

. 2004. “What Do We Do with Missing Data? Some Options for Analysis of Incomplete Data.” Annual Review of Public Health 25:99–117.

42.

Raghunathan

T. E.

Grizzle

J. E.

. 1995. “A Split Questionnaire Survey Design.” Journal of the American Statistical Association 94:896–908.

43.

Raghunathan

T. E.

Lepkowski

J. M.

van Hoewyk

Solenberger

. 2001. “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models.” Survey Methodology 27:85–95.

44.

Raghunathan

T. E.

Solenberger

van Hoewyk

. 2000. IVEware: Imputation and Variance Estimation Software. Retrieved August 29, 2013 http://www.isr.umich.edu/src/smp/ive/

45.

Rao

S. R.

Granbard

B. I.

Schmid

C. H.

Morton

S. C.

Louis

T. A.

Zaslavsky

A. M.

Finkelstein

D. M.

. 2008. “Meta-analysis of Survey Data: Application to Health Services Research.” Health Services Outcomes Research Methods 8:98–114.

46.

Rässler

. 2002. Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. New York: Springer Verlag.

47.

Reiter

J. P.

Raghunathan

T. E.

Kinney

S. K.

. 2006. “The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data.” Survey Methodology 32:143–49.

48.

Rendall

M. S.

Admiraal

DeRose

DiGiulio

Handcock

M. S.

Racioppi

. 2008. “Population Constraints on Pooled Surveys in Demographic Hazard Modeling.” Statistical Methods and Applications 17:519–39.

49.

Rendall

M. S.

Ghosh-Dastidar

Weden

M. M.

Nazarov

. 2011. “Multiple Imputation For Combined-Survey Estimation With Incomplete Regressors in One but Not Both Surveys.” Maryland Population Research Center Working Paper PWP-MPRC-2011-001. Retrieved August 29, 2013. (http://papers.ccpr.ucla.edu/papers/PWP-MPRC-2011-001/PWP-MPRC-2011-001.pdf).

50.

Rendall

M. S.

Handcock

M. S.

Jonsson

S. H.

. 2009. “Bayesian Estimation of Hispanic Fertility Hazards from Survey and Population Data.” Demography 46:65–84.

51.

Rendall

M. S.

Weden

M. M.

Favreault

M. M.

Waldron

. 2011. “The Protective Effect of Marriage for Survival: A Review and Update.” Demography 48:481–506.

52.

Ridder

Moffitt

R. A.

. 2007. “The Econometrics of Data Combination.” Pp.5469-547 in Handbook of Econometrics, edited by Heckman

J. J.

Leamer

E. E.

, Vol.6b. Amersterdam, The Netherlands: North Holland.

53.

Roberts

Binder

. 2009. “Analyses Based on Combining Similar Information from Multiple Surveys.” Proceedings of the Joint Statistical Meetings Section on Survey Methods Research 2138–47.

54.

Rodgers

W. L.

. 1984. “An Evaluation of Statistical Matching.” Journal of Business & Economic Statistics 2:91–102.

55.

Rubin

D. B

. 1986. “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations.” Journal of Business and Economic Statistics 2 1:65–73.

56.

Rubin

D. B

. 1987. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley.

57.

Salsberry

P. J.

Reagan

P. B.

. 2005. “Dynamics of Early Childhood Overweight.” Pediatrics 116:1329–38.

58.

Sassler

McNally

. 2003. “Cohabiting Couples’ Economic Circumstances and Union Transitions: A Re-examination Using Multiple Imputation Techniques.” Social Science Research 32:553–78.

59.

Schafer

J. L

. 1997. Analysis of Incomplete Multivariate Data. Boca Raton, FL: Chapman and Hall.

60.

Schafer

J. L.

Graham

J. W.

. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7:147–77.

61.

Schenker

Raghunathan

T. E.

Bondarenko

. 2010. “Improving on Analysis of Self-reported Data in a Large-scale Health Survey by Using Information from an Examination-based Survey.” Statistics in Medicine 29:553–45.

62.

Singh

A. C.

Mantel

H. J.

Kinack

M. D.

Rowe

. 1993. “Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption.” Survey Methodology 19:59–79.

63.

Singh

G. K.

Siahpush

Kogan

M. D.

. 2010. “Rising Social Inequalities in U.S. Childhood Obesity, 2003-2007.” Annals of Epidemiology 20:40–52.

64.

Snow

Derecho

Wheeless

Lennon

Rosen

Rogers

Kinsey

Morgan

Einaudi

. 2009. “Early Childhood Longitudinal Study, Birth Cohort (ECLS-B), Kindergarten 2006 and 2007 Data File User’s Manual. (NCES 2010-010).” Washington, DC: National Center for Education Statistics.

65.

Taylor

K. W.

Frideres

. 1972. “Issues and Controversies: Substantive and Statistical Significance.” American Sociological Review 37:464–72.

66.

Tighe

Livert

Barnett

Saxe

. 2010. “Cross-survey Analysis to Estimate Low-incidence Religious Groups.” Sociological Methods and Research 39:56–82.

67.

U.S. Department of Education. 2009a. Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) Kindergarten through Eighth Grade Full Sample Public-use Data and Documentation (DVD). Washington, DC: National Center for Education Statistics.

68.

U.S. Department of Education. 2009b. Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) Base Year Public-use Data Files and Electronic Codebook. Washington, DC: National Center for Education Statistics.

69.

U.S. Department of Labor. 2012. Consumer Price Index. Washington, DC: Bureau of Labor Statistics. Retrieved May 14, 2010 (ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt).

70.

von Hippel

P. T

. 2007. “Regression with Missing Y’s: An Improved Strategy for Analyzing Multiply Imputed Data.” Sociological Methodology 37:83–117.

71.

Weakliem

W. L

. 2004. “Introduction to Special Issue on Model Selection.” Sociological Methods and Research 33:167–87.

72.

Weden

M. M.

Brownell

Rendall

M. S.

. 2012. “Prenatal, Perinatal, Early-life, and Sociodemographic Factors Underlying Racial Differences in the Likelihood of High Body Mass Index in Early Childhood.” American Journal of Public Health 102:2057–67.

73.

Western

1998. “Causal Heterogeneity in Comparative Research: A Bayesian Hierarchical Model.” American Journal of Political Science 42:1233–59.

74.

White

I. R.

Carlin

J. B.

. 2010. “Bias and Efficiency of Multiple Imputation Compared with Complete-case Analysis for Missing Covariate Values.” Statistics in Medicine 29:2920–31.