Sage Journals: Discover world-class research

Abstract

Large-scale testing programs involving classification decisions typically have multiple forms available and conduct equating to ensure cut-score comparability across forms. A test developer might be interested in the extent to which an examinee who happens to take a particular form would have a consistent classification decision if he or she had taken an equated alternate form. In this article, classification consistency indices directly applicable to equating contexts are introduced, and procedures for estimating these indices are presented under three equating designs: the single-group design, the random-groups design, and the common-item nonequivalent-groups design. Two families of psychometric models (item response theory models and beta-binomial models) are introduced, focusing on the procedures for estimating conditional score distributions and ability distributions. Two empirical analyses are provided to illustrate the use of the methodology under the common-item nonequivalent-groups design and the random-groups design, using item response theory models and beta-binomial models.

Keywords

Index terms: classification consistency percentage agreement kappa equating cut scores licensure and certification

Get full access to this article

View all access options for this article.

References

Bock, R.D. , & Zimowski, M.F. (1997). Multiple group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 433-448). New York: Springer.

Braun, H.I. , & Holland, P.W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P.W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9-49). New York: Academic .

Brennan, R.L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38, 295-317.

Brennan, R.L. (2004). Manual for BB-CLASS: A computer program that uses the beta-binomial model for classification consistency and accuracy (CASMA Research Report No. 9). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa .

Carver, R.P. (1970). Special problems in measuring change with psychometric devices. In Evaluative research: Strategies and methods (pp. 48-66). Washington, DC: American Institutes for Research.

Cohen, J. (1960). A coefficient of agreement for nominal scales . Educational and Psychological Measurement, 20, 37-46.

Hanson, B.A. (1991). Method of moments estimates for the four-parameter beta compound binomial model and the calculation of classification consistency indexes (ACT Research Report 91-5). Iowa City, IA: ACT.

Hanson, B.A. , & Brennan, R.L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27, 345-359.

Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 253-264.

10.

Huynh, H. (1990). Error rates in competency testing when test retaking is permitted. Journal of Educational Measurement, 15, 39-52.

11.

Kolen, M.J. , & Brennan, R.L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York : Springer.

12.

Kuder, G.F. , & Richardson, M.W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151-160.

13.

Lee, W.C. , Brennan, R.L. , & Kolen, M.J. (2000). Estimators of conditional scale-score standard errors of measurement: A simulation study. Journal of Educational Measurement, 37, 1-20.

14.

Lee, W.C. , Hanson, B.A. , & Brennan, R.L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement , 26, 412-432.

15.

Lord, F.M. (1965). A strong true score theory, with applications . Psychometrika, 30, 239-270.

16.

Lord, F.M. , & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA : Addison-Wesley.

17.

Lord, F.M. , & Wingersky, M.S. (1984). Comparison of IRT true-score and equipercentile observed score ``equatings.'' Applied Psychological Measurement, 8, 453-461.

18.

Swaminathan, H. , Hambleton, R.K. , & Algina, J. (1974). A reliability of criterion-referenced tests: A decision-theoretic formulation. Journal of Educational Measurement , 11, 263-267.

19.

Wang, T. , Kolen, M.J. , & Harris, D.J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141-162.

20.

Wingersky, M.S. , & Lord, F.M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement , 8, 347-364.

21.

Zimowski, M.F. , Muraki, E. , Mislevy, R.J. , & Bock, R.D. (1996). BILOG-MG: Multi-group IRT analysis and test maintenance for binary items. Chicago: Scientific Software International.

A Method for Estimating Classification Consistency Indices for Two Equated Forms

Abstract

Keywords

Get full access to this article

References