Likelihood-Based Item-Fit Indices for Dichotomous Item Response Theory Models

Abstract

New goodness-of-fit indices are introduced for dichotomous item response theory (IRT) models. These indices are based on the likelihoods of number-correct scores derived from the IRT model, and they provide a direct comparison of the modeled and observed frequencies for correct and incorrect responses for each number-correct score. The behavior of Pearson’s X ² (S-X ²) and the likelihood ratio G ² (S-G ²) was assessed in a simulation study and compared with two fit indices similar to those currently in use (Q1-X ² and Q ₁-G ²). The simulations included three conditions in which the simulating and fitting models were identical and three conditions involving model misspecification. S-X ² performed well, with Type I error rates close to the expected .05 and .01 levels. Performance of this index improved with increased test length. S-G ² tended to reject the null hypothesis too often, as did Q ₁-X ² and Q ₁-G ². The power of S-X ² appeared to be similar for all test lengths, but varied depending on the type of model misspecification.

Get full access to this article

View all access options for this article.

References

Andersen, E. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, 123–140.

Ankenmann, R. (1994). Goodness of fit and ability estimation in the graded response model. Unpublished manuscript.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51.

Bock, R. D. , & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, 443–449.

Bock, R. D. , & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35, 179–197.

Chen, W. (1995). Estimation of item parameters for the three-parameter logistic model using the marginal likelihood of summed scores (Doctoral dissertation, University of North Carolina, 1995). Dissertation Abstracts International, 56/10-B, 5825.

Cochran, W. G. (1952). The chi-square test of goodness of fit. Annals of Mathematical Statistics, 23, 315–345.

Glas, C. A. W. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53, 525–546.

Hambleton, R. K. , & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff.

10.

Kingston, N. , & Dorans, N. (1985). The analysis of item-ability regressions: an exploratory IRT model fit tool. Applied Psychological Measurement, 9, 281–288.

11.

Larntz, K. (1978). Small-sample comparisons of exact levels for chi-squared goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253–263.

12.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale NJ: Erlbaum.

13.

Lord, F. M. , & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observedscore “equatings.”Applied Psychological Measurement, 8, 453–461.

14.

McKinley, R. , & Mills, C. (1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49–57.

15.

Mislevy, R. J. , & Bock, R. D. (1986). Bilog: Item analysis and test scoring with binary logistic models. Mooresville IN: Scientific Software.

16.

Orlando, M. (1997). Item fit in the context of item response theory. (Doctoral dissertation, University of North Carolina, 1997). Dissertation Abstracts International, 58/04-B, 2175.

17.

Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research). Expanded edition (1980), with foreword and afterword by B. D. Wright. Chicago: University of Chicago Press.

18.

Read, T. R. C. , & Cressie, N. A. C. (1988). Goodnessoffit statistics for discrete multivariate data. New York: Springer.

19.

Rogers, H. , & Hattie, J. (1987). A monte carlo in-vestigation of several person and item fit statistics for item response models. Applied Psychological Measurement, 11, 47–57.

20.

Roscoe, J. T. , & Byars, J. A. (1971). An investigation of the restraints with respect to sample size commonly imposed on the use of the chi-square statistic. Journal of the American Statistical Association, 66, 755–759.

21.

Rost, J. ,& von Davier, M. (1994). Aconditional item-fit index for Rasch models. Applied Psychological Measurement, 18, 171–182.

22.

Stroud, A. H. (1974). Numerical quadrature and solution of ordinary differential equations. New York: Springer.

23.

Thissen, D. (1991). MULTILOG user’s guide: Multiple categorical item analysis and test scoring using item response theory. Chicago: Scientific Software.

24.

Thissen, D. , Pommerich, M. , Billeaud, K. , & Williams, V. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39–49.

25.

Wainer, H. , & Mislevy, R. J. (1990). Item response theory, item calibration, and proficiency estimation. In H. Wainer , N. J. Dorans , R. Flaugher , B. F. Green , R. J. Mislevy , L. Steinberg , & D. Thissen , Computerized adaptive testing: A primer (pp. 65–101). Hillsdale NJ: Erlbaum.

26.

Wright, B. , & Mead, R. (1977). BICAL: Calibrating items and scales with the Rasch model (Research Memorandum No. 23). Chicago: University of Chicago, Department of Education, Statistical Laboratory.

27.

Wright, B. , & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 23–48.

28.

Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245–262.