The Long-Term Sustainability of Different Item Response Theory Scaling Methods

Abstract

This article investigates the accuracy of examinee classification into performance categories and the estimation of the theta parameter for several item response theory (IRT) scaling techniques when applied to six administrations of a test. Previous research has investigated only two administrations; however, many testing programs equate tests across multiple administrations. As such, this article seeks to examine the long-term sustainability of IRT scaling methods. Three different types of shifts in the ability distribution were examined: no change, a mean shift, and a change in skewness. Haebara, Stocking and Lord, mean—sigma, mean—mean, and fixed common item parameter (FCIP) scaling were compared relative to bias, root mean square error, and classification of examinees into performance categories. Results indicate that FCIP may be the most suitable for complex changes in examinee performance, whereas the methods performed quite similarly for simple changes.

Keywords

item response theory scaling classification fixed common item parameter Stocking and Lord

Get full access to this article

View all access options for this article.

References

Baldwin, S. , Nering, M. , & Baldwin, P. ( 2007, April). A comparison of IRT equating methods on recovering parameters and capturing growth in mixed-format tests. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.

Bock, R.D. , Muraki, E. , & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275-285.

Braun, H.I. , & Holland, P.W. ( 1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9-49). New York, NY: Academic Press.

DeMars, C.E. ( 2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17, 263-300.

Fleishman, A.I. ( 1978). A method for simulating non-normal distributions. Psychometrica, 43, 521-531.

Giordano, C. , Subhiyah, R. , & Hess, B. ( 2005, April). An analysis of item exposure and item parameter drift on a take-home recertification exam. Paper presented at the annual meeting of the American Educational Research Association , Montreal, Quebec, Canada.

Haebara, T. ( 1981). Least squares method for equating logistic ability scales: A general approach and evaluation (Iowa Testing Programs Occasional Papers, No. 30). University of Iowa, Iowa City.

Hanson, B.A. , & Beguin, A.A. ( 2002). Obtaining a common scale for item response theory item parameters using Separate versus Concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24.

Isham, S.P. , & Donoghue, J.R. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement , 22, 33-51.

10.

Kamata, A. , & Tate, R. ( 2005). The performance of a method for the long-term equating of mixed-format assessment. Journal of Educational Measurement , 42, 193-213.

11.

Keller, L.A. , Skorupski, W.P. , Swaminathan, H. , & Jodoin, M.G. ( 2004, April). An evaluation of item response theory equating procedures for capturing changes in examinees distributions with mixed-format tests. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

12.

Keller, L.A. , Wells, C.S. , & Keller, R.R. ( 2010). The effect of removing anchor items that exhibit differential item functioning on the scaling and classification of examinees ( Center for Educational Assessment Rep. No. CEA-745). University of Massachusetts, Amherst.

13.

Kendall, M. , & Stuart, A. ( 1977). The advanced theory of statistics (Vol. 1, 4th ed.). New York, NY: Macmillan.

14.

Kim, S. ( 2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43, 355-381.

15.

Kim, S.-H. , & Cohen, A.S. ( 1995). A minimum chi square method for equating tests under the graded response model. Applied Psychological Measurement , 19, 167-176.

16.

Kim, S.-H. , & Cohen, A.S. ( 1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22, 131-143.

17.

Kim, S.-H. , & Kolen, M.J. ( 2004). STUIRT: Scale transformation under unidimensional item response theory models (version 1.0) [Computer program]. University of Iowa, Iowa Testing Programs, Iowa City.

18.

Li, Y.H. , Lissitz, R.W. , & Yang, Y.N. ( 1999, April). Estimating IRT equating coefficients for tests with polytomously and dichotomously scored items. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada.

19.

Meade, A.W. , Lautenschlager, G.J. , & Hecht, J.E. ( 2005). Establishing measurement equivalence and invariance in longitudinal data with item response theory. International Journal of Testing, 5, 279-300.

20.

Muraki, E. , & Bock, R.D. ( 2003). PARSCALE 4 for Windows: IRT based test scoring and item analysis for graded items and rating scales [Computer software]. Lincolnwood, IL: Scientific Software International .

21.

Paek, I. , & Young, M.J. ( 2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution. Applied Measurement in Education, 18, 199-215.

22.

Reise, S.P. , & Yu, J. ( 1990). Parameter recovery in the graded response model using MULTI-LOG. Journal of Educational Measurement, 27, 133-144.

23.

Skorupski, W.P. , Jodoin, M.G. , Keller, L.A. , & Swaminathan, H. (2003, April). An evaluation of equating procedures for capturing growth. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

24.

Stahl, J.A. , Bergstrom, B.A. , & Shneyderman, O. (2002, April). Impact of item drift on test-taker measurement. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA.

25.

Stocking, M.L. , & Lord, F.M. ( 1982). Developing a common metric in item response theory ( Educational Testing Company Research Rep. No. ETS-RR-82-25-ONR). Princeton, NJ: Educational Testing Company.

26.

Sykes, R. , & Ito, K. ( 1993, April). Item parameter drift in IRT-based licensure examinations . Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta, GA.

27.

van der Linden, W.J. ( 2006). Equating error in observed-score equating. Applied Psychological Measurement, 30, 355-378.

28.

Wells, C.S. , Subkoviak, M.J. , & Serlin, R.C. ( 2002). The effect of item parameter drift on examinee ability estimates. Applied Psychological Measurement, 26, 77-87.

29.

Witt, E.A. , Stahl, J.A. , Bergstrom, B.A. , & Muckle, T. ( 2003, April). Impact of item drift with non-normal distributions . Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL .