Sage Journals: Discover world-class research

Abstract

Testing whether items fit the assumptions of an item response theory model is an important step in evaluating a test. In the literature, numerous item fit statistics exist, many of which show severe limitations. The current study investigates the root mean squared deviation (RMSD) item fit statistic, which is used for evaluating item fit in various large-scale assessment studies. The three research questions of this study are (1) whether the empirical RMSD is an unbiased estimator of the population RMSD; (2) if this is not the case, whether this bias can be corrected; and (3) whether the test statistic provides an adequate significance test to detect misfitting items. Using simulation studies, it was found that the empirical RMSD is not an unbiased estimator of the population RMSD, and nonparametric bootstrapping falls short of entirely eliminating this bias. Using parametric bootstrapping, however, the RMSD can be used as a test statistic that outperforms the other approaches—infit and outfit, S − X ²—with respect to both Type I error rate and power. The empirical application showed that parametric bootstrapping of the RMSD results in rather conservative item fit decisions, which suggests more lenient cut-off criteria.

Keywords

item fit item response theory educational measurement bootstrap

Get full access to this article

View all access options for this article.

References

Adams

R. J.

Wilson

Wang

(1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23. doi:10.1177/0146621697211001

American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Ames

A. J.

Penfield

R. D.

(2015). An NCME instructional module on item-fit statistics for item response theory models. Educational Measurement: Issues and Practice, 34, 39–48. doi:10.1111/emip.12067

Birnbaum

(1968). Some latent trait models. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 395–479). Reading, MA: Addison-Wesley.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. Retrieved from http://www.jstatsoft.org/v48/i06/

Chon

K. H.

Lee

Ansley

T. N.

(2013). An empirical investigation of methods for assessing item fit for mixed format tests. Applied Measurement in Education, 26, 1–15.

DeMars

C. E.

(2005). Type I error rates for PARSCALE’s fit index. Educational and Psychological Measurement, 65, 42–50. doi:10.1177/0013164404264849

Douglas

Cohen

(2001). Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement, 25, 234–243.

10.

Efron

(1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–16.

11.

Embretson

Reise

(2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.

12.

Glas

C. A. W.

Suárez Falcón

J. C.

(2003). A comparison of item-fit statistics for the three parameter logistic model. Applied Psychological Measurement, 27, 87–106.

13.

Habing

(2001). Nonparametric regression and the parametric bootstrap for local dependence assessment. Applied Psychological Measurement, 25, 221–233.

14.

Hambleton

R. K.

Han

(2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In Lenderking

W. R.

Revicki

(Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Washington, DC: Degnon Associates.

15.

Horowitz

J. L.

(2001). The bootstrap. In Heckman

J. J.

Leamer

(Eds.), Handbook of econometrics (pp. 3159–3228). Amsterdam, the Netherlands: Elsevier Science, B.V.

16.

Köhler

Hartig

(2017). Practical significance of item misfit in low-stakes educational assessment. Applied Psychological Measurement, 41, 388–400. doi:10.1177/0146621617692978

17.

Lee

Y.-S.

Wollack

Douglas

(2009). On the use of nonparametric ICC estimation techniques for checking parametric model fit. Educational and Psychological Measurement, 69, 181–197.

18.

Liang

Wells

C. S.

Hambleton

R. K.

(2014). An assessment of the nonparametric approach for evaluating the fit of item response models. Journal of Educational Measurement, 51, 1–17.

19.

Maydeu-Olivares

(2017). Assessing the size of model misfit in structural equation models. Psychometrika, 82, 533–558. doi:10.1007/s11336-016-9552-7

20.

Maydeu-Olivares

Joe

(2005). Limited- and full-information estimation and goodness-of-fit testing in 2 ⁿ contingency tables: A united framework. Journal of the American Statistical Association, 100, 1009–1020. doi:10.1198/016214504000002069

21.

McKinley

R. L.

Mills

C. N.

(1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49–57.

22.

Organization for Economic Cooperation and Development. (2012). PISA 2009 technical report. Paris, France: Author. doi:10.1787/9789264167872-en

23.

Organization for Economic Cooperation and Development. (2015). PISA 2015 field trial analysis report: Outcomes of the cognitive assessment. Meeting of the Technical Advisory Group Paris (TAG(1506)1 Field Trial Cognitive Outcomes). Paris, France: Author.

24.

Organization for Economic Cooperation and Development. (2016). PISA 2015 results (volume I): Excellence and equity in education. Paris, France: Author. doi:10.1787/9789264266490-en

25.

Organization for Economic Cooperation and Development. (2017). PISA 2015 technical report. Paris, France: Author.

26.

Orlando

Thissen

(2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64.

27.

Orlando

Thissen

(2003). Further investigation of the performance of S-X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289–298.

28.

Pohl

Carstensen

C. H.

(2012). NEPS technical report—Scaling the data of the competence tests (NEPS Working Paper No. 14). Bamberg, Germany: Otto-Friedrich- Universität, Nationales Bildungspanel.

29.

Rasch

(1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Nielsen & Lydiche. (Expanded edition, 1980)

30.

Raykov

(2005). Bias-corrected estimation of noncentrality parameters of covariance structure models. Structural Equation Modeling, 12, 120–129.

31.

R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0. Retrieved from http://www.R-project.org/

32.

Reiser

(2008). Goodness-of-fit testing using components based on marginal frequencies of multinomial data. British Journal of Mathematical and Statistical Psychology, 61, 331–360. doi:10.1348/000711007X204215

33.

Robitzsch

Kiefer

(2017). TAM: Test analysis modules [R package Version 2.4-9]. Retrieved from http://cran.r-project.org/web/packages/TAM/index

34.

Rossi

Wang

Ramsay

J. O.

(2002). Nonparametric item response function estimates with the EM algorithm. Journal of Educational and Behavioral Statistics, 27, 291–317.

35.

Sinharay

Haberman

S. J.

(2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and Practice, 33, 23–35. doi:10.1111/emip.12024

36.

Stone

C. A.

(2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. Journal of Educational Measurement, 37, 58–75.

37.

Stone

C. A.

Zhang

(2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 4, 331–352.

38.

J.-H.

Scheu

C.-F.

Wang

W.-C.

(2007). Computing confidence intervals of item fit statistics in the family of Rasch models using the bootstrap method. Journal of Applied Measurement, 8, 190–203.

39.

Sueiro

M. J.

Abad

F. J.

(2011). Assessing goodness of fit in item response theory with nonparametric models: A comparison of posterior probabilities and kernel-smoothing approaches. Educational and Psychological Measurement, 71, 834–848.

40.

Swaminathan

Hambleton

R. K.

Rogers

H. J.

(2007). Assessing the fit of item response theory models. In Rao

C. R.

Sinharay

(Eds.), Handbook of statistics: Psychometrics (Vol. 26, pp. 683–718). Amsterdam, the Netherlands: Elsevier.

41.

van Rijn

P. W.

Sinharay

Haberman

S. J.

Johnson

M. S.

(2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assessments in Education, 4, 10. doi:10.1186/s40536-016-0025-3

42.

von Davier

(2005). mdltm: Software for the general diagnostic model and for estimating mixtures of multidimensional discrete latent traits models [Computer software]. Princeton, NJ: Educational Testing Service.

43.

von Davier

Sinharay

Oranje

Beaton

(2006). Statistical procedures used in the National Assessment of Educational Progress (NAEP): Recent developments and future directions. In Rao

C. R.

Sinharay

(Eds.), Handbook of statistics (Vol. 26): Psychometrics (pp. 1039–1055). Amsterdam, the Netherlands: Elsevier.

44.

Wainer

Thissen

(1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12, 339–368.

45.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450. doi:10.1007/BF02294627

46.

Wells

C. S.

Bolt

D. M.

(2008). Investigation of a nonparametric procedure for assessing goodness-of-fit in item response theory. Applied Measurement in Education, 21, 22–40.

47.

White

(1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.

48.

Wright

B. D.

Masters

G. N.

(1982). Rating scale analysis. Rasch measurement. Chicago, IL: MESA Press.

49.

M. L.

(1997). The development and application of a fit test for use with marginal maximum likelihood estimation and generalized item response models (Unpublished Master’s dissertation). University of Melbourne, Australia.

50.

Adams

R. J.

(2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14, 339–355.

51.

M. L.

Adams

R. J.

Wilson

M. R.

Haldane

S. A.

(2007). ACER ConQuest Version 2.0. Generalised item response modeling software. Victoria, Australia: ACER Press.

52.

Yamamoto

Khorramdel

von Davier

(2016). Scaling PIAAC cognitive data. In Technical report of the survey of adult skills (PIAAC) (2nd edition, Chapter 17). Paris, France: OECD Publishing.

53.

Yen

W. M.

(1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB

A Bias-Corrected RMSD Item Fit Statistic: An Evaluation and Comparison to Alternatives

Abstract

Keywords

Get full access to this article

References

Supplementary Material