Interjudge Reliability and Decision Reproducibility

Abstract

The purpose of this article is to discuss the importance of decision reproducibility for performance assessments. When decisions from two judges about a student's performance using comparable tasks correlate, decisions have been considered reproducible. However, when judges differ in expectations and tasks differ in difficulty, decisions may not be independent of the particular judges or tasks encountered unless appropriate adjustments for the observable differences are made. In this study, data were analyzed with the Facets model and provided evidence that judges grade differently, whether or not the scores given correlate well. This outcome suggests that adjustments for differences among judge severities should be made before student measures are estimated to produce reproducible decisions for certification, achievement, or promotion.

Get full access to this article

View all access options for this article.

References

Allal, L. (1988). Generalizability theory. In J. P. Keeves (Ed.), Educational research, methodology and measurement: An international handbook (pp. 272-276). New York: Pergamon.

Engelhard, G. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191.

Fagot, R. F. (1991). Reliability of grades for multiple judges: Intraclass correlation and metric scales. Applied Psychological Measurement, 15, 1-11.

Koretz, D. (1992). New report on Vermont portfolio project documents challenges. National Council on Measurement in Education Quarterly Newsletter, 1(4), 1-2.

Linacre, J. M. (1988). FACETS, a computer program for analysis of examinations with multiple facets. Chicago: MESA Press.

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA Press.

Lunz, M. E. , & Stahl, J. A. (1990). A comparison of intra- and interudge decision consistency using analytic and holistic scoring criteria. Journal of Allied Health, 19, 173-179.

Lunz, M. E. , Wright, B. D. , & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345.

Rasch, G. (1980). Probabilistic models for some intelligence and achievement tests (rev. ed.). Chicago: University of Chicago Press. (Original work published 1960)

10.

Raymond, M. , Webb, L. , & Houston, W. (1991). Correcting performance grade errors in oral examinations. Evaluation and the Health Professions, 14, 100-122.

11.

Stanley, J. (1971). Reliability. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 356-442). Washington, DC: American Council on Education.

12.

Swanson, D. B. (1990). Issues in assessment of practical skills in medicine. Professions Education Research Quarterly, 12, 3-6.