Sage Journals: Discover world-class research

Abstract

Methods to detect item response theory (IRT) item-level misfit are typically derived assuming fixed test forms. However, IRT is also employed with more complicated test designs, such as the balanced incomplete block (BIB) design used in large-scale educational assessments. This study investigates two modifications of Douglas and Cohen’s 2001 nonparametric method of assessing item misfit, based on A) using block total score and B) pooling booklet level scores for analyzing BIB data. Block-level scores showed extreme inflation of Type I error for short blocks containing 5 or 10 items. The pooled booklet method yielded Type I error rates close to nominal $α$ in most conditions and had power to detect misfitting items. The study also found that the Douglas and Cohen procedure is only slightly affected by the presence of other misfitting items in the block. The pooled booklet method is recommended for practical applications of Douglas and Cohen’s method with BIB data.

Keywords

IRT item fit nonparametric IRT IRT fit Douglas-Cohen procedure

Get full access to this article

View all access options for this article.

References

Bartholomew

D. J.

Leung

S. O.

(2002). A goodness of fit test for sparse 2^p contingency tables. The British Journal of Mathematical and Statistical Psychology, 55(Pt 1), 1–15. https://doi.org/10.1348/000711002159617.

Bishop

Y. M.

Feinberg

S. E.

Holland

P. W.

(1975). Discrete multivariate analysis: Theory and practice. MIT Press.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51. https://doi.org/10.1007/bf02291411.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/bf02293801

Bolt

D. M.

(2002). A monte carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113–141. https://doi.org/10.1207/s15324818ame1502_01

Cai

(2013). flexMIRT version 2: Flexible multilevel multidimensional item analysis and test scoring [computer program]. Vector Psychometric Group.

Cohen

(1973). Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educational and Psychological Measurement, 33(1), 107–112. https://doi.org/10.1177/001316447303300111

Darrell Bock

Lieberman

(1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35(2), 179–197. https://doi.org/10.1007/bf02291262

Donoghue

J. R.

Isham

S. P.

(1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22(1), 33–51. https://doi.org/10.1177/01466216980221002

10.

Douglas

(1997). Joint consistency of nonparametric item characteristic curve and ability estimation. Psychometrika, 62(1), 7–28. https://doi.org/10.1007/bf02294778

11.

Douglas

(2001). Asymptotic identifiability of non-parametric item response models. Psychometrika, 66(4), 531–540. https://doi.org/10.1007/bf02296194

12.

Douglas

Cohen

(2001). Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement, 25(3), 234–243. https://doi.org/10.1177/01466210122032046

13.

Glas

C. A. W

Falcón

J. C. S.

(2003). A comparison of item fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. https://doi.org/10.1177/0146621602250530

14.

Haberman

S. J.

Sinharay

Chon

K. H.

(2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440. https://doi.org/10.1007/s11336-012-9305-1

15.

Liang

Wells

C. S.

(2009). A model fit statistic for generalized partial credit model. Educational and Psychological Measurement, 69(6), 913–928. https://doi.org/10.1177/0013164409332222

16.

Liang

Wells

C. S.

Hambleton

R. K.

(2014). An assessment of the nonparametric approach for evaluating the fit of item response models. Journal of Educational Measurement, 51(1), 1–17. https://doi.org/10.1111/jedm.12031

17.

Little

R. J. A.

Rubin

D. B.

(2002). Statistical analysis with missing data (2nd ed). John Wiley & Sons.

18.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Laurence Erlbaum Associates.

19.

Lord

F. M.

Wingersky

M. S.

(1984). Comparison of IRT true-score and equipercentile observed-score equatings. Applied Psychological Measurement, 8(4), 453–461. https://doi.org/10.1177/014662168400800409.

20.

Matsumoto

Nishimura

(1998). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 8(1), 3–30. https://doi.org/10.1145/272991.272995

21.

Maydeu-Olivares

(2013). Goodness-of-fit assessment of item response theory models. Measurement: Interdisciplinary Research & Perspective, 11(3), 71–101. https://doi.org/10.1080/15366367.2013.831680

22.

Muraki

Bock

R. D.

(1997). PARSCALE: IRT item analysis and test scoring for rating- scale data [computer program]. Scientific Software International.

23.

Orlando

Thissen

(2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64. https://doi.org/10.1177/01466216000241003

24.

Press

W. H.

Teukolsky

S. A.

Vetterling

W. T.

Flannery

B. P.

(1992). Numerical recipes in C: The art of scientific computing (2nd ed). Cambridge.

25.

Ramsay

J. O.

(1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56(4), 611–630. https://doi.org/10.1007/bf02294494

26.

Ramsay

J. O

(2000). TESTGRAF: A program for the graphical analysis of multiple choice and questionnaire data [computer program]. https://www.psych.mcgill.ca/faculty/ramsay.html

27.

Reiser

(1996). Analysis of residuals for the multionmial item response model. Psychometrika, 61(3), 509–528. https://doi.org/10.1007/bf02294552

28.

Sgammato

Donoghue

J. R.

(2016). Evaluation of Item Response Theory (IRT) fit indices. Paper presented at the annual meeting of the National Council on Measurement in Education.

29.

Sgammato

Donoghue

J. R.

(2018). Evaluating a modified nonparametric procedure to assess parametric IRT model fit. Paper presented at the annual meeting of the National Council on Measurement in Education.

30.

Sinharay

(2008). A further look at the correlation between item parameters and item fit statistics. Journal of Educational Measurement, 45, 1–15. https://doi.org/10.1111/j.1745-3984.2007.00049.x

31.

Stone

C. A.

(2000). Monte Carlo based null distribution for an alternative goodness of fit test statistic in IRT models. Journal of Educational Measurement, 37(1), 58–75. https://doi.org/10.1111/j.1745-3984.2000.tb01076.x

32.

Stone

C. A.

(2003). Empirical power and Type I error rates for an IRT fit statistic that considers the precision of ability estimates. Educational and Psychological Measurement, 63(4), 566–583. https://doi.org/10.1177/0013164402251034

33.

Stone

C. A.

Hansen

M. A.

(2000). The effect of errors in estimating ability on goodness-of-fit tests for irt models. Educational and Psychological Measurement, 60(6), 974–991. https://doi.org/10.1177/00131640021970907

34.

Stone

C. A.

Mislevy

R. J.

Mazzeo

(1994). Classification error and goodness-of-fit in IRT models. Paper presented at the annual meeting of the American Educational Research Association.

35.

Stone

C. A.

Zhang

(2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352. https://doi.org/10.1111/j.1745-3984.2003.tb01150.x

36.

van Rijn

P.W.

Sinharay

Haberman

S.J.

Johnson

M. S.

(2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-scale Assessments in Education, 4(1), 10. https://doi.org/10.1186/s40536-016-0025-3

37.

Wells

C. S.

Bolt

D. M.

(2008). Investigation of a nonparametric procedure for assessing goodness-of-fit in item response theory. Applied Measurement in Education, 21(1), 22–40. https://doi.org/10.1080/08957340701796464

38.

Yen

(1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5(2), 245–262. https://doi.org/10.1177/014662168100500212

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.36 MB

Evaluating the Douglas-Cohen IRT Goodness of Fit Measure With BIB Sampling of Items

Abstract

Keywords

Get full access to this article

References

Supplementary Material