Conversion of Proportion-Correct Standard-Setting Judgments to Cutoff Scores on the Item Response Theory θ Scale

Abstract

This study compares the efficacy of different strategies for translating item-level, proportion-correct standard-setting judgments into a y-metric test cutoff score for use with item response theory (IRT) scoring, using Monte Carlo methods. Simulated Angoff-type ratings, consisting of 1,000 independent 75 Item × 13 Rater matrices, were generated at five points along the y continuum, at three levels of rater fit to the item characteristics curves, yielding 14,625,000 ratings as the basis of the analyses. These simulated proportion-correct ratings were converted to the IRT y scale using test-level and item-level methods explicated by Kane (1987). Kane's optimally weighted, item-level conversion method initially produced anomalous results; however, it was discovered that imposing a restriction on the weights avoided these anomalies and rendered the optimally weighted method the most statistically efficient. Six areas for future research are outlined for advancing the integration of these classical standard-setting ratings into IRT methodology.

Keywords

standard-setting Angoff method item response theory cutoff scores model fit Monte Carlo simulation

Get full access to this article

View all access options for this article.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing . Washington, DC: American Psychological Association.

Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 99-109). Washington, DC: American Council on Education.

Beretvas, S.N. (2004). Comparison of bookmark difficulty locations under different item response models. Applied Psychological Measurement , 28, 25-47.

Berk, R.A. (1986). A consumer's guide to setting performance standards on criterion referenced tests. Review of Educational Research , 56, 137-172.

Biddle, R.E. (1993). How to set cutoff scores for knowledge tests used in promotion, training, certification, and licensing. Public Personnel Management, 22, 63-79.

Cizek, G.J. (1993). Reconsidering standards and criteria. Journal of Educational Measurement, 30, 93-106.

Cizek, G.J. (2001). Conjectures on the rise and fall of standard setting: An introduction to context and practice. In G. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 79-120). Mahwah, NJ: Lawrence Erlbaum.

Fitzpatrick, A.R. (1989). Social influences in standard setting: The effects of social interaction on group judgments. Review of Educational Research, 59, 315-328.

Hambleton, R.K. , Swaminathan, H. , & Rogers, H.J. (1991). Fundamentals of item response theory: Vol. 2. Newbury Park, CA: Sage.

10.

Hurtz, G.M. , & Auerbach, M.A. (2003). A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educational and Psychological Measurement, 63, 584-601.

11.

Hurtz, G.M. , & Hertz, N. (1999). How many raters should be used for establishing cutoff scores with the Angoff method? A generalizability theory study. Educational and Psychological Measurement, 59, 885-897.

12.

Jaeger, R.M. (1991). Selection of judges for standard-setting. Educational Measurement: Issues and Practice, 10(2), 3-6, 10, 14.

13.

Johnson, N.L. , & Kotz, S. (1970). Continuous uni-variate distributions: Vol. 2. Boston: Houghton Mifflin.

14.

Kane, M.T. (1987). On the use of IRT models with judgmental standard-setting procedures. Journal of Educational Measurement, 24, 333-345.

15.

Lewis, D.M. , Mitzel, H.C. , & Green, D.R. (1996, June). Standard-setting: A bookmark approach. Symposium presented at the Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ.

16.

Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

17.

Mooney, C.Z. (1997). Monte Carlo simulation. (Sage University Paper Series on Quantitative Applications in the Social Sciences, Series No. 07-116) . Thousand Oaks, CA: Sage.

18.

Mooney, C.Z. , & Duval, R.D. (1993). Bootstrapping: A nonparametric approach to statistical inference. (Sage University Paper Series on Quantitative Applications in the Social Sciences, Series No. 07-095). Newbury Park, CA: Sage.

19.

NIST/SEMATECH e-Handbook of Statistical Methods. (n.d.). Retrieved July 30, 2004, from http://www.itl.nist.gov/div898/handbook/eda/section3/eda366h.htm

20.

Reckase, M.D. (2001). Innovative methods for helping standard-setting participants to perform their task: The role of feedback regarding consistency, accuracy, and impact. In G. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 159-174). Mahwah, NJ: Lawrence Erlbaum.

21.

Sireci, S.G. , & Clauser, B.E. (2001). Practical issues in setting standards on computerized adaptive tests. In G. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 355-370). Mahwah, NJ: Lawrence Erlbaum.

22.

Wilcox, R.R. (1997). Introduction to robust estimation and hypothesis testing. San Diego, CA: Academic Press.

23.

van der Linden, W.J. (1982). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard-setting. Journal of Educational Measurement, 19, 295-308.