Assessing Item Fit Using Expected Score Curve Under Restricted Recalibration

Abstract

In item response theory applications, item fit analysis is often performed for precalibrated items using response data from subsequent test administrations. Because such practices lead to the involvement of sampling variability from two distinct samples that must be properly addressed for statistical inferences, conventional item fit analysis can be revisited and modified. This study extends the item fit analysis originally proposed by Haberman et al., which involves examining the discrepancy between the model-implied and empirical expected score curve. We analytically derive the standard errors that accurately account for the sampling variability from two samples within the framework of restricted recalibration. After derivation, we present the findings from a simulation study that evaluates the performance of our proposed method in terms of the empirical Type I error rate and power, for both dichotomous and polytomous items. An empirical example is also provided, in which we assess the item fit of pediatric short-form scale in the Patient-Reported Outcome Measurement Information System.

Keywords

item response theory item fit model misspecification expected score curve generalized residuals restricted recalibration

Get full access to this article

View all access options for this article.

References

Birnbaum

A. L.

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397–479). Addison-Wesley.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1), 29–51.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an em algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/bf02293801

Buchholz

Hartig

(2019). Comparing attitudes across groups: An IRT-based item-fit statistic for the analysis of measurement invariance. Applied Psychological Measurement, 43(3), 241–250. https://doi.org/10.1177/0146621617748323

Cai

Maydeu-Olivares

Coffman

D. L.

Thissen

(2006). Limited-information goodness-of-fit testing of item response theory models for sparse

2^{P}

tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173–194. https://doi.org/10.1348/000711005X66419

Cella

Riley

Stone

Rothrock

Reeve

Yount

Amtmann

Bode

Buysse

Choi

Cook

Devellis

DeWalt

Fries

J. F.

Gershon

Hahn

E. A.

Lai

J.-S.

Pilkonis

Revicki

, …PROMIS Cooperative Group. (2010). The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. Journal of Clinical Epidemiology, 63(11), 1179–1194. https://doi.org/10.1016/j.jclinepi.2010.04.011

Cella

Yount

Rothrock

Gershon

Cook

Reeve

Ader

Fries

Bruce

Rose

; On behalf of the PROMIS Cooperative Group (2007). The patient-reported outcomes measurement information system (PROMIS): Progress of an NIH roadmap cooperative group during its first two years. Medical Care, 45(5), S3–S11. https://doi.org/10.1097/01.mlr.0000258615.42478.55

Chalmers

R. P.

(2012). Mirt: A multidimensional item response theory package for the r environment. Journal of Statistical Software, 48, 1–29. https://doi.org/10.18637/jss.v048.i06

Chernoff

Lehmann

(1954). The use of maximum likelihood estimates in

χ^{2}

tests for goodness of fit. Annals of Mathematical Statistics, 25, 579–586. https://doi.org/10.1214/aoms/1177728726

10.

Embretson

S. E.

Reise

S. P.

(2013). Item response theory. Psychology Press.

11.

Gibbons

R. D.

Hedeker

D. R.

(1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436. https://doi.org/10.1007/bf02295430

12.

Glas

C. A.

(1998). Detection of differential item functioning using lagrange multiplier tests. Statistica Sinica, 98(8), 647–667.

13.

Glas

C. A.

Suárez-Falcón

J. C.

(2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. https://doi.org/10.1177/0146621602250530

14.

Gong

Samaniego

F. J.

(1981). Pseudo maximum likelihood estimation: Theory and applications. The Annals of Statistics, 861–869. https://doi.org/10.1214/aos/1176345526

15.

Haberman

S. J.

Sinharay

Chon

K. H.

(2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440. https://doi.org/10.1007/s11336-012-9305-1

16.

Han

Sinharay

Johnson

M. S.

Liu

(2023). The standardized S-χ² statistic for assessing item fit. Applied Psychological Measurement, 47(1), 3–18. https://doi.org/10.1177/01466216221108077

17.

Joo

S. H.

Khorramdel

Yamamoto

Shin

H. J.

Robin

(2021). Evaluating item fit statistic thresholds in PISA: Analysis of cross-country comparability of cognitive items. Educational Measurement: Issues and Practice, 40(2), 37–48.https://doi.org/10.1111/emip.12404

18.

Kang

Chen

T. T.

(2008). Performance of the generalized S-χ² item fit index for polytomous irt models. Journal of Educational Measurement, 45(4), 391–406. https://doi.org/10.1111/j.1745-3984.2008.00071.x

19.

Kang

Chen

T. T.

(2011). Performance of the generalized S-χ² item fit index for the graded response model. Asia Pacific Education Review, 12(1), 89–96. https://doi.org/10.1007/s12564-010-9082-4

20.

Kim

(2006). A comparative study of irt fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381. https://doi.org/10.1111/j.1745-3984.2006.00021.x

21.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer. https://doi.org/10.1007/978-1-4939-0317-7

22.

Kondratek

(2022). Item-fit statistic based on posterior probabilities of membership in ability groups. Applied Psychological Measurement, 46(6), 462–478. https://doi.org/10.1177/01466216221108061

23.

König

Khorramdel

Yamamoto

Frey

(2021). The benefits of fixed item parameter calibration for parameter accuracy in small sample situations in large-scale assessments. Educational Measurement: Issues and Practice, 40(1), 17–27. https://doi.org/10.1111/emip.12381

24.

Lai

J.-S.

Stucky

B. D.

Thissen

Varni

J. W.

DeWitt

E. M.

Irwin

D. E.

Yeatts

K. B.

DeWalt

D. A.

(2013). Development and psychometric properties of the PROMIS pediatric fatigue item banks. Quality of Life Research, 22(9), 2417–2427. https://doi.org/10.1007/s11136-013-0357-1

25.

Rupp

A. A.

(2011). Performance of the S-χ² statistic for full-information bifactor models. Educational and Psychological Measurement, 71(6), 986–1005. https://doi.org/10.1177/0013164410392031

26.

Liu

Thissen

(2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36(8), 670–688. https://doi.org/10.1177/0146621612458174

27.

Liu

Thissen

(2014). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology, 67(3), 496–513. https://doi.org/10.1111/bmsp.12030

28.

Liu

Yang

J. S.

Maydeu-Olivares

(2019). Restricted recalibration of item response theory models. Psychometrika, 84(2), 529–553. https://doi.org/10.1007/s11336-019-09667-4

29.

McDonald

R. P.

(1981). The dimensionality of tests and items. British Journal of Mathematical and Statistical Psychology, 34(1), 100–117. https://doi.org/10.1111/j.2044-8317.1981.tb00621.x

30.

McKinley

R. L.

Mills

C. N.

(1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9(1), 49–57. https://doi.org/10.1177/014662168500900105

31.

Monroe

(2019). Estimation of expected fisher information for irt models. Journal of Educational and Behavioral Statistics, 44(4), 431–447. https://doi.org/10.3102/1076998619838240

32.

Monroe

(2021). Testing latent variable distribution fit in IRT using posterior residuals. Journal of Educational and Behavioral Statistics, 46(3), 374–398. https://doi.org/10.3102/1076998620953764

33.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176.

34.

Orlando

Thissen

(2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64. https://doi.org/10.1177/01466216000241003

35.

Orlando

Thissen

(2003). Further investigation of the performance of S-χ²: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27(4), 289–298. https://doi.org/10.1177/0146621603027004004

36.

Parke

W. R.

(1986). Pseudo maximum likelihood estimation: The asymptotic distribution. The Annals of Statistics, 14(1), 355–357. https://doi.org/10.1214/aos/1176349862

37.

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

38.

Rose

Bjorner

J. B.

Gandek

Bruce

Fries

J. F.

Ware

J. E.

Jr. (2014). The promis physical function item bank was calibrated to a standardized metric and shown to improve measurement efficiency. Journal of Clinical Epidemiology, 67(5), 516–526. https://doi.org/10.1016/j.jclinepi.2013.10.024

39.

Rubin

D. B.

(1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151–1172. https://doi.org/10.1214/aos/1176346785

40.

Rupp

A. A.

Zumbo

B. D.

(2004). A note on how to quantify and report whether irt parameter invariance holds: When pearson correlations are not enough. Educational and Psychological Measurement, 64(4), 588–599. https://doi.org/10.1177/0013164403261051

41.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34, 1–97. https://doi.org/10.1007/bf03372160

42.

Shepard

Camilli

Williams

D. M.

(1984). Accounting for statistical artifacts in item bias research. Journal of Educational Statistics, 9(2), 93–128. https://doi.org/10.3102/10769986009002093

43.

Sinharay

(2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59(2), 429–449. https://doi.org/10.1348/000711005x66888

44.

Thissen

Liu

Magnus

Quinn

(2015). Extending the use of multidimensional IRT calibration as projection: Many-to-one linking and linear computation of projected scores. In van der Ark

L. A.

Bolt

D. M.

Wang

Douglas

J. A.

Chow

(Eds.), Quantitative psychology research: The 79th annual meeting of the psychometric society, Madison, Wisconsin, 2014 (pp. 1–16). Springer International Publishing.

45.

van Rijn

P. W.

Sinharay

Haberman

S. J.

Johnson

M. S.

(2016). Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-Scale Assessments in Education, 4(1), 1–23. https://doi.org/10.1186/s40536-016-0025-3

46.

von Davier

Bezirhan

(2023). A robust method for detecting item misfit in large-scale assessments. Educational and Psychological Measurement, 83(4), 740–765. https://doi.org/10.1177/0013164422110581

47.

von Davier

A. A.

(2007). A unified approach to irt scale linking and scale transformations. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 3(3), 115. https://doi.org/10.1027/1614-2241.3.3.115

48.

Wells

C. S.

Hambleton

R. K.

(2016). Model fit with residual analyses. In van der Linden

W. J.

(Ed.), Handbook of item response theory (pp. 395–413). CRC Press.

49.

White

(1982). Maximum likelihood estimation of misspecified models. Econometrica: Journal of the Econometric Society, 50, 1–25. https://doi.org/10.2307/1912526

50.

Yang

J. S.

Hansen

Cai

(2012). Characterizing sources of uncertainty in item response theory scale scores. Educational and Psychological Measurement, 72(2), 264–290. https://doi.org/10.1177/0013164411410056

51.

Yen

W. M.

(1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5(2), 245–262. https://doi.org/10.1177/014662168100500212

52.

Zhang

Stone

C. A.

(2008). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68(2), 181–196. https://doi.org/10.1177/0013164407301547

53.

Zhang

Wang

(2022). Modified item-fit indices for dichotomous irt models with missing data. Applied Psychological Measurement, 46(8), 705–719. https://doi.org/10.1177/01466216221125176

54.

Zhang

Wang

Tao

(2018). Assessing item-level fit for higher order item response theory models. Applied Psychological Measurement, 42(8), 644–659. https://doi.org/10.1177/0146621618762740

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.18 MB