Sage Journals: Discover world-class research

Abstract

Model-data-fit of item response theory (IRT) models is generally assessed by comparing observed performance by examinees on individual items with performance that is predicted under the chosen IRT model. However, use of traditional chi-square methods to evaluate goodness-of-fit of IRT models is not appropriate when the underlying trait/ability is estimated imprecisely (e.g., shorter assessments). This article describes a goodness-of-fit statistic that considers directly the uncertainty with which ability is estimated as well as a resampling-based hypothesis testing procedure. A simulation study was conducted to evaluate the empirical power and Type I error rates for the proposed procedure. Results of the study indicated that the procedure should be useful for evaluating goodness-of-fit of IRT models for most testing applications where uncertainty in ability estimation is an issue.

Get full access to this article

View all access options for this article.

References

Agresti, A. (1990). Categorical data analysis. New York: Wiley.

Ansley, T. N., & Bae, H. W. (1989, April). An empirical investigation of the nature of the distribution of an IRT goodness-of-fit statistic. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco.

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Hillsdale, NJ: Lawrence Erlbaum.

Fan, X. (1998). Item Response Theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58, 357-381.

Fleishman, A. I. (1978). A method for simulating nonnormal distributions. Psychometrika, 43, 521-532.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Boston: Kluwer-Nijhoff.

Lane, S. (1993). The conceptual framework for the development of a mathematics performance assessment. Educational Measurement: Issues and Practice, 12, 16-23.

Lewis, C. (2000). Expected response functions. In A. Boomsma, M. A. J. van Duijn, & T. A. B. Snijders (Eds.), Essays on item response theory (pp. 163-170). New York: Springer-Verlag.

10.

Little, R. J. A., & Rubin, D. B. (1983). On jointly estimating parameters and missing data by maximizing the complete-data likelihood. American Statistician, 37, 218-220.

11.

Little, R. J. A., & Rubin, D. B. (1987). Statistical analyses with missing data. New York: Wiley.

12.

Lunneborg, C. E. (2000). Data analysis by resampling: Concepts and applications. Pacific Grove, CA: Duxbury Press.

13.

McKinley, R. L., & Mills, C. N. (1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49-57.

14.

Mislevy, R. J., & Bock, R. D. (1990). BILOG: Item analysis and test scoring with binary logistic models [Computer program]. Chicago: Scientific Software.

15.

Muraki, E. (1997). A generalized partial credit model. In W. van der Linden& R. K. Hambleton (Eds.), Handbook of modern item response theory. New York: Springer-Verlag.

16.

Reise, S. P. (1990). A comparison of item- and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14, 127-137.

17.

Stone, C. A. (2000). A Monte Carlo based null distribution for a goodness of fit. Journal of Educational Measurement, 37, 58-76.

18.

Stone, C. A., Ankenmann, R. D., Lane, S., & Liu, M. (1993, April). Scaling QUASAR’s performance assessment. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA.

19.

Stone, C. A., & Hansen, M. A. (2000a). A computer program for assessing goodness-of-fit of item response theory models to NAEP data (PR/Award No. R902B70008). Final report for the Department of Education, NAEP Secondary Analysis Program.

20.

Stone, C. A., & Hansen, M. A. (2000b). The effect of errors in estimating ability on goodness-of-fit tests for IRT models. Educational and Psychological Measurement, 60, 974-991.

21.

Stone, C. A., Mislevy, R. J., & Mazzeo, J. (1994). Classification error and goodness-of-fit in IRT models. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans, LA.

22.

Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371-390.

23.

Thissen, D. (1991). MULTILOG: Multiple, categorical item analysis and test scoring using item response theory (Version 6.0). Mooresville, IN: Scientific Software.

24.

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262.

Empirical Power and Type I Error Rates for an IRT Fit Statistic that Considers the Precision of Ability Estimates

Abstract

Get full access to this article

References