Sage Journals: Discover world-class research

Abstract

This article introduces a new test-centered standard-setting method as well as a procedure to detect intrajudge inconsistency of the method. The standard-setting method that is based on interdependent evaluations of alternative responses has judges closely evaluate the process that examinees use to solve multiple-choice items. The new method is analyzed against existing methods, particularly the Nedelsky and Angoff methods. Empirical results from three different experiments confirm the hypothesis that standards set by the new method are higher than those of the Nedelsky but lower than those of the Angoff method. The procedure for detecting intrajudge inconsistency is based on residual diagnosis of the judgments, which makes it possible to identify the sources of inconsistencies in the items, response alternatives, and/or judges. An empirical application of the procedure in an experiment with the new standard-setting method suggests that the method is internally consistent and has also revealed an interesting difference between residuals for the correct and incorrect alternatives.

Keywords

standard setting Angoff method Nedelsky method intrajudge inconsistency judgmental item analysis multiple-choice test polytomous response models

Get full access to this article

View all access options for this article.

References

Baron, J. B. , Rindone, D. A. , & Prowda, P. (1981, April). Will the “real” proficiency standard please stand up? Paper presented at the annual meeting of the New England Educational Research Organization, Lenox, MA.

Behuniak, P., Jr. , Archambault, F. X. , & Gable, R. K. (1982). Angoff and Nedelsky standard setting procedures: Implications for the validity of proficiency test score interpretation. Educational and Psychological Measurement, 42, 247-255.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley.

Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 33-49). New York: Springer.

Brennan, R. L. , & Lockwood, R. E. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4, 219-240.

Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied Measurement in Education, 12, 151-165.

Chang, L. , Dziuban, C. D. , Hynes, M. C. , & Olson, A. H. (1996). Does a standard reflect minimal competency of examinees or judge competency? Applied Measurement in Education, 9, 151-160.

Cross, L. H. , Impara, J. C. , Frary, R. B. , & Jaeger, R. M. (1984). A comparison of three methods for establishing minimum standards on the national teacher examinations. Journal of Educational Measurement, 21, 113-129.

Glass, G. V. (1978). Standards and criteria. Journal of Educational Measurement, 15, 237-261.

10.

Halpin, G. , Sigmon, G. , & Halpin, G. (1983). Minimum competency standards set by three divergent groups of raters using three judgmental procedures: Implications for validity. Educational and Psychological Measurement, 43, 185-197.

11.

Hambleton, R. K. (1978). On the use of cut-off scores with criterion-referenced tests in instructional settings. Journal of Educational Measurement, 15, 227-290.

12.

Harasym, P. H. (1981). A comparison of the Nedelsky and modified Angoff standard setting procedure on evaluation outcome. Educational and Psychological Measurement, 41, 725-735.

13.

Jaeger, R. M. (1989). Certification of student competence. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 485-514). New York: Macmillan.

14.

Kane, M. (1998). Choosing between examinee-centered and test-centered standard-setting methods. Educational Assessment, 5, 129-145.

15.

Kassirer, J. P. , & Kopelman, R. I. (1989). Cognitive errors in diagnosis: Instantiation, classification, and consequence. American Journal of Medicine, 86, 433-441.

16.

Lord, F. M. , & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

17.

Maguire, T. , Skakun, E. , & Harley, C. (1992). Setting standards for multiple-choice items in clinical reasoning. Evaluation and the Health Professions, 15, 434-452.

18.

Mokken, R. L. (1997). Nonparametric models for dichotomous responses. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 351-367). New York: Springer.

19.

Molenaar, I. W. , & Sijtsma, K. (2000). MSP 5 for Windows: A program for Mokken scale analysis for polytomous items. Groningen, the Netherlands: iecProGAMMA.

20.

Paiva, R. E. A. , & Vu, N. V. (1979, April). Standards for acceptable level of performance in an objectives-based medical curriculum: A case study. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.

21.

Poggio, J. P. , Glasnapp, D. R. , & Eros, D. S. (1981, April). An empirical investigation of the Angoff, Ebel and Nedelsky standard setting methods. Paper presented at the annual meeting of the American Educational Research Association, Los Angeles.

22.

Popham, W. J. (1978). As always, provocative. Journal of Educational Measurement, 15, 297-300.

23.

Ramsden, P. , Whelan, G. , & Cooper, D. (1989). Some phenomena of medical students’ diagnostic problem solving. Medical Education, 23, 108-117.

24.

Rock, D. A. , Davis, E. L. , & Werts, C. (1980, June). An empirical comparison of judgmental approaches to standard setting procedures. Research report of the Educational Testing Service, Princeton, NJ.

25.

Shepard, L. A. (1995). Implications for standard setting of the National Academy of Education Evaluation of the National Assessment of Educational Progress Achievement Levels. Washington, DC: Proceedings of the Joint Conference on Standard Setting for Large Scale Assessments, National Assessment Governing Board, National Center for Educational Statistics.

26.

Smith, R. L. , & Smith, J. K. (1988). Differential use of item information by judges using Angoff and Nedelsky procedures. Journal of Educational Measurement, 25, 259-274.

27.

Thissen, D. (1991). Multilog user’s guide. Chicago: Scientific Software.

28.

Thissen, D. , & Steinberg, L. (1984). A response model for multiple-choice items. Psychomerika, 49, 501-519.

29.

Thissen, D. , & Steinberg, L. (1997). A response model for multiple-choice items. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 51-65). New York: Springer.

30.

van der Linden, W. J. (1982). A latent trait method for determining intrajudge inconsistency in the Angoff and Nedelsky techniques of standard setting. Journal of Educational Measurement, 19, 295-308.

Setting Standards and Detecting Intrajudge Inconsistency Using Interdependent Evaluation of Response Alternatives

Abstract

Keywords

Get full access to this article

References