Sage Journals: Discover world-class research

Abstract

Assessing goodness of fit of item response theory models typically involves evaluating differences between observed and expected score response distributions using a chi-square test statistic. When these methods are applied to assessments that are shorter in length, uncertainty with which ability is estimated greatly affects the approximation to the null chi-square distribution. Results from a Monte Carlo study indicated serious departures between null theoretical distributions and empirically derived sampling distributions for the chi-square statistic for tests with 8 and 16 constructed response items. This article also describes a fit statistic that attempts to account for the uncertainty in estimating ability and that could therefore be applied to testing situations in which ability is not precisely estimated. This method employs more information from the same distribution used to obtain Bayesian point estimates of ability and reflects probabilities that examinees have ability equal to a range of values rather than restricting expectations to single values.

Get full access to this article

View all access options for this article.

References

Agresti, A. (1990). Categorical data analysis. New York: Wiley.

Ansley, T. N. , & Bae, H. W. (1989, April). An empirical investigation of the nature of the distribution of an IRT goodness-of-fit statistic. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.

Bock, R. D. (1972).Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37,29-51.

Bock, R. D. , & Mislevy, R. J. (1982).Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6,431-444.

Chambers, J. M. , Cleveland, W. S. , Kleiner, B. , & Tukey, P. A. (1983). Graphical methods for data analysis. Belmont, CA: Wadsworth.

Fan, X. (1998).Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58,357-381.

Hambleton, R. K. , & Swaminathan, H. (1985). Item response theory. Boston, MA: Kluwer-Nijhoff.

Harwell, M. , Stone, C. A. , Hsu, T. C. , & Kirisci, L. (1996).Monte Carlo studies in item response theory. Applied Psychological Measurement, 20,101-126.

Henard, D. (in press). An introduction to item response theory (IRT). In L. Grimm & P. Yarnold (Eds.), Reading and understanding multivariate statistics (Vol. 2). Washington, DC: American Psychological Association.

10.

Lancaster, H. O. (1969). The chi-square distribution. New York: Wiley.

11.

Lane, S. (1993).The conceptual framework for the development of a mathematics performance assessment. Educational Measurement: Issues and Practice, 12,16-23.

12.

Lane, S. , Stone, C. A. , Ankenmann, R. D. , & Liu, M. (1995).Examination of the assumptions and properties of the graded item response theory model: An example using a mathematics performance assessment. Applied Measurement in Education, 8,313-340.

13.

Little, R.J.A. , & Rubin, D. B. (1983).On jointly estimating parameters and missing data by maximizing the complete-data likelihood. American Statistician, 37,218-220.

14.

Little, R.J.A. , & Rubin, D. B. (1987). Statistical analyses with missing data. New York: Wiley.

15.

Lunneborg, C. E. (2000). Data analysis by resampling: Concepts and applications. Pacific Grove, CA: Duxbury Press.

16.

Manly, B. F. (1991). Randomization and Monte Carlo methods in biology. New York: Chapman and Hall.

17.

McKinley, R. L. , & Mills, C. N. (1985).A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9,49-57.

18.

McKinley, R. L. , & Mills, C. N. (1989). Item response theory: Advances in achievement and attitude measurement. In B. Thompson (Ed.), Advances in social science methodology (Vol. 1, pp. 71-135). Greenwich, CT: JAI.

19.

Mislevy, R. J. , & Bock, R. D. (1990). BILOG: Item analysis and test scoring with binary logistic models [Computer program]. Chicago: Scientific Software, Inc.

20.

Noreen, E. W. (1989). Computer-intensive methods for testing hypotheses: An introduction. New York: John Wiley.

21.

Reise, S. P. (1990).A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14,127-137.

22.

Samejima, F. (1969).Estimation of latent ability groups using a response pattern of graded scores. Psychometrika Monograph Supplement, 4, Part 2, Whole #17.

23.

Stone, C. A. (2000).Monte Carlo based null distribution for an alternative goodness-of-fit test statistic for IRT models. Journal of Educational Measurement, 37,58-75.

24.

Stone, C. A. , Ankenmann, R. D. , Lane, S. , & Liu, M. (1993, April). Scaling QUASAR’s performance assessment. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA.

25.

Stone, C. A. , Mislevy, R. J. , & Mazzeo, J. (1994, April). Classification error and goodness-of-fit in IRT models. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.

26.

Thissen, D. (1991). MULTILOG: Multiple, categorical item analysis and test scoring using item response theory (Version 6.0) [Computer software]. Mooresville, IN: Scientific Software.

27.

Wright, B. D. , & Mead, R. J. (1977). BICAL: Calibrating items and scales with the Rasch model (Research memorandum No. 23). Chicago: University of Chicago, Department of Education, Statistical Laboratory.

28.

Yen, W. M. (1981).Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5,245-262.

The Effect of Errors in Estimating Ability on Goodness-of-Fit Tests for Irt Models

Abstract

Get full access to this article

References