A Method for Measuring Interrater Agreement on Checklists

Abstract

A method for measuring interrater agreement on checklists is presented. This technique does not assign individual scores to raters, but computes a single agreement score from the concordance of their check mark configurations. An overall coefficient of agreement, called phi, is derived. The agreement coefficient that is expected by chance and the statistical significance of phi are determined by statistical simulation. Despite the dichotomous nature of the checklist agreement (raters either agree or disagree on items), we show that the binomial distribution does not provide a means for testing the statistical significance of phi. A medical education study is used to illustrate the phi methodology.

Get full access to this article

View all access options for this article.

References

Aman, M. G. , Singh, N. N. , & Turbott, S. H. (1987). Reliability of the aberrant behavior checklist and the effect of variations in instructions. American Journal of Mental Deficiency, 92, 237-240.

Andrew, B. J. (1977). The use of behavioral checklists to assess physical examination skills. Journal of Medical Education, 52, 589-591.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Demb, H. B. , Brier, N. , Huron, R. , & Tomor, E. (1994). The adolescent behavior checklist: Normative data and sensitivity and specificity of a screening tool for diagnosable psychiatric disorders in adolescents with mental retardation and other developmental disabilities. Research in Developmental Disabilities, 15, 151-165.

Dunn, W. (1990). Establishing inter-rater reliability on a criterion-referenced developmental checklist. The Occupational Therapy Journal of Research, 10, 377-380.

Kalet, A. , Earp, J. A. , & Kowlowitz, V. (1992). How well do faculty evaluate the interviewing skills of medical students? Journal of General Internal Medicine, 7, 499-505.

MacRae, H. M. , Vu, N. V. , Graham, B. , Word-Sims, M. , Colliver, J. A. , & Robbs, R. S. (1995). Comparing checklists and databases with physicians’ ratings as measures of students’ history and physical-examination skills. Academic Medicine, 70, 313-317.

Peterson, I. (1991, July 27). Pick a sample. Science News, pp. 56-58.

Simon, J. L. (1994). What some puzzling problems teach about the theory of simulation and the use of resampling. The American Statistician, 48, 290-293.

10.

Simon, J. L. , & Bruce, P. (1991). Resampling: A tool for everyday statistical work. Chance, 4(1), 22-32.

11.

Spiker, D. , Kraemer, H. C. , Constantine, N. A. , & Bryant, D. (1992). Reliability and validity of behavior problem checklists as measures of stable traits in low birth weight, premature preschoolers. Child Development, 63, 1481-1496.

12.

Spitznagel, E. L. , & Helzer, J. E. (1985). A proposed solution to the base rate problem in the kappa statistic. Archives of General Psychiatry, 42, 725-728.