Assessing Item-Level Fit for Higher Order Item Response Theory Models

Abstract

Testing item-level fit is important in scale development to guide item revision/deletion. Many item-level fit indices have been proposed in literature, yet none of them were directly applicable to an important family of models, namely, the higher order item response theory (HO-IRT) models. In this study, chi-square-based fit indices (i.e., Yen’s Q₁, McKinley and Mill’s G², Orlando and Thissen’s S-X², and S-G²) were extended to HO-IRT models. Their performances are evaluated via simulation studies in terms of false positive rates and correct detection rates. The manipulated factors include test structure (i.e., test length and number of dimensions), sample size, level of correlations among dimensions, and the proportion of misfitting items. For misfitting items, the sources of misfit, including the misfitting item response functions, and misspecifying factor structures were also manipulated. The results from simulation studies demonstrate that the S-G² is promising for higher order items.

Keywords

higher order IRT models item fit false positive rate correct detection rate

Get full access to this article

View all access options for this article.

References

Baker

F. B.

Kim

S.-H.

(2004). Item response theory: Parameter estimation techniques (2nd ed., Revised and expanded). New York, NY: Marcel Dekker.

Bartholomew

D. J.

Leung

S. O.

(2002). A goodness of fit test for sparse 2p contingency tables. British Journal of Mathematical and Statistical Psychology, 55, 1-15.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51.

Bock

R. D.

Haberman

S. J.

(2009, July). Confidence bands for examining goodness-of-fit of estimated item response functions. Paper presented at Annual Meeting of the Psychometric Society, Cambridge, UK.

Cai

(2015). Lord-Wingersky algorithm version 2.0 for hierarchical item factor models with applications in test scoring, scale alignment, and model fit testing. Psychometrika, 80, 535-559.

Cai

Maydeu-Olivares

Coffman

D. L.

Thissen

(2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2^P tables. British Journal of Mathematical and Statistical Psychology, 59, 173-194.

Chon

K. H.

Lee

W. C.

Dunbar

S. B.

(2010). A comparison of item fit statistics for mixed IRT models. Journal of Educational Measurement, 47, 318-338.

de la Torre

Hong

(2010). Parameter estimation with small sample size: A higher-order IRT model approach. Applied Psychological Measurement, 34, 267-285.

de la Torre

Song

(2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33, 620-639.

10.

Demars

C. E.

(2005). Type I error rates for Parscale’s fit index. Educational and Psychological Measurement, 65, 42-50.

11.

Douglas

Cohen

A. S.

(2001). Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement, 25, 234-243.

12.

Fienberg

S. E.

(1979). The use of chi-squared statistics for categorical data problems. Journal of the Royal Statistical Society: Series B (Statistical Methodological), 41, 54-64.

13.

Fox

J. P.

Glas

C. A.

(2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, 271-288.

14.

Glas

C. A. W.

Suárez-Falcón

J. C.

(2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27, 87-106.

15.

Haberman

S. J.

(2009). Use of generalized residuals to examine goodness of fit of item response models. ETS Research Report Series, 2009(1), 1-17.

16.

Haberman

S. J.

Sinharay

Chon

K. H.

(2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78, 417-440.

17.

Hambleton

R. K.

Swaminathan

Rogers

H. J.

(1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

18.

Huang

H. Y.

(2015). A multilevel higher order item response theory model for measuring latent growth in longitudinal data. Applied Psychological Measurement, 39, 362-372.

19.

Huang

H. Y.

(2017). Mixture IRT model with a higher-order structure for latent traits. Educational and Psychological Measurement, 77, 275-304.

20.

Huang

H. Y.

Chen

P. H.

Wang

W. C.

(2012). Computerized adaptive testing using a class of high-order item response theory models. Applied Psychological Measurement, 36, 689-706.

21.

Huang

H. Y.

Wang

W. C.

(2013). Higher order testlet response models for hierarchical latent traits and testlet-based items. Educational and Psychological Measurement, 73, 491-511.

22.

Huang

H. Y.

Wang

W. C.

Chen

P. H.

C. M.

(2013). Higher-order item response models for hierarchical latent traits. Applied Psychological Measurement, 37, 619-637.

23.

Huo

de la Torre

Mun

E. Y.

Kim

S. Y.

Ray

A. E.

Jiao

White

H. R.

(2015). A hierarchical multi-unidimensional IRT approach for analyzing sparse, multi-group data for integrative data analysis. Psychometrika, 80, 834-855.

24.

LaHuis

D. M.

Clark

O’Brien

(2011). An examination of item response theory item fit indices for the graded response model. Organizational Research Methods, 14, 10-23.

25.

Lee

(2014). Application of higher-order IRT models and hierarchical IRT models to computerized adaptive testing (Electronic Theses and Dissertations). University of California, Los Angeles.

26.

Rupp

A. A.

(2011). Performance of the S-X2 statistic for full-information bifactor models. Educational and Psychological Measurement, 71, 986-1005.

27.

Liang

Wells

C. S.

(2009). A model fit statistic for generalized partial credit model. Educational and Psychological Measurement, 69, 913-928.

28.

Liang

Wells

C. S.

(2015). A nonparametric approach for assessing goodness-of-fit of IRT models in a mixed format test. Applied Measurement in Education, 28, 115-129.

29.

Liang

Wells

C. S.

Hambleton

R. K.

(2014). An assessment of the nonparametric approach for evaluating the fit of item response models. Journal of Educational Measurement, 51, 1-17.

30.

Liu

Maydeu-Olivares

(2013). Local dependence diagnostics in IRT modeling of binary data. Educational and Psychological Measurement, 73, 254-274.

31.

Liu

Maydeu-Olivares

(2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49, 354-371.

32.

Lord

F. M.

Wingersky

M. S.

(1984). Comparison of IRT true-score and equipercentile observed-score “equatings.”Applied Psychological Measurement, 8, 453-461.

33.

Kang

Chen

T. T.

(2008). Performance of the generalized S-X² item fit index for polytomous IRT models. Journal of Educational Measurement, 45, 391-406.

34.

Maydeu-Olivares

Joe

(2005). Limited- and full-information estimation and goodness-of-fit testing in 2ⁿ contingency tables: A unified framework. Journal of the American Statistical Association, 100, 1009-1020.

35.

McKinley

R. L.

Mills

C. N.

(1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49-57.

36.

Micceri

(1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156-166.

37.

Muraki

Bock

R. D.

(1997). PARSCALE: IRT item analysis and test scoring for rating-scale data. [Computer software]. Chicago, IL: Scientific Software International.

38.

Orlando

Thissen

(2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64.

39.

Orlando

Thissen

(2003). Further investigation of the performance of S-X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289-298.

40.

Ranger

Kuhn

J. T.

(2012). Assessing fit of item response models using the information matrix test. Journal of Educational Measurement, 49, 247-268.

41.

Reckase

(2009). Multidimensional item response theory (Vol. 150). New York, NY: Springer.

42.

Reiser

(2008). Goodness-of-fit testing using components based on marginal frequencies of multinomial data. British Journal of Mathematical and Statistical Psychology, 61, 331-360.

43.

Rijmen

Jeon

von Davier

Rabe-Hesketh

(2014). A third-order item response theory model for modeling the effects of domains and subdomains in large-scale educational assessment surveys. Journal of Educational and Behavioral Statistics, 39, 235-256.

44.

Roberts

J. S.

(2008). Modified likelihood-based item fit statistics for the generalized graded unfolding model. Applied Psychological Measurement, 32, 407-423.

45.

Sinharay

(2005). Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42, 375-394.

46.

Sinharay

(2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59, 429-449.

47.

Stone

C. A.

(2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT models. Journal of Educational Measurement, 37, 158-175.

48.

Stone

C. A.

Zhang

(2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40, 331-352.

49.

Stroud

A. H.

(1974). Numerical quadrature and solution of ordinary differential equations. New York, NY: Springer.

50.

Suárez-Falcón

J. C.

Glas

C. A. W.

(2003). Evaluation of global testing procedures for item fit to the Rasch model. British Journal of Mathematical and Statistical Psychology, 56, 127-143.

51.

van der Linden

W. J

. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46, 247-272.

52.

Wang

(2014). Improving measurement precision of hierarchical latent traits using adaptive testing. Journal of Educational and Behavioral Statistics, 39, 452-477.

53.

Wang

Kohli

Henn

(2016). A second-order longitudinal model for binary outcomes: Item response theory versus structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23, 455-465.

54.

Wang

Shu

Shang

(2015). Assessing item-level fit for the DINA model. Applied Psychological Measurement, 39, 525-538.

55.

Wells

C. S.

Bolt

D. M.

(2008). Investigation of a nonparametric procedure for assessing goodness-of-fit in item response theory. Applied Measurement in Education, 21, 22-40.

56.

Yen

W. M.

(1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262.

57.

Zhang

Stone

C. A.

(2008). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68, 181-196.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB