A Generalization of Cohen's Kappa Agreement Measure to Interval Measurement and Multiple Raters

Abstract

Cohen's kappa statistic is frequently used to measure agreement between two observers employing categorical polytomies. In this paper, Cohen's statistic is shown to be inherently multivariate in nature; it is expanded to analyze ordinal and interval data; and it is extended to more than two observers. A nonasymptotic test of significance is provided for the generalized statistic.

Get full access to this article

View all access options for this article.

References

Armitage, P. , Blendis, L. M. , and Smyllie, H. C. (1966). The measurement of observer disagreement in the recording of signs. Journal of the Royal Statistical Society, Series A, 129, 98-109.

Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19, 3-11.

Bartko, J. J. (1976). On various intraclass correlation reliability coefficients. Psychological Bulletin, 83, 762-765.

Bartko, J. J. and Carpenter, W. T. (1976). On the methods and theory of reliability. The Journal of Nervous and Mental Disease, 163, 307-317.

Brennan, R. L. and Prediger, D. L. (1981). Coefficient kappa: some uses, misuses, and alternatives. Educational And Psychological Measurement, 41, 687-699.

Cicchetti, D. V. , Showalter, D. , and Tyrer, P. J. (1985). The effect of number of rating scale categories on levels of interrater reliability: a Monte Carlo investigation. Applied Psychological Measurement, 9, 31-36.

Cochran, W. G. (1950). Comparison of percentages in matched samples. Biometrika, 37, 256-266.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational And Psychological Measurement, 20, 37-46.

Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322-328.

10.

Conger, A. J. (1985). Kappa reliabilities for continuous behaviors and events. Educational And Psychological Measurement, 45, 861-868.

11.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.

12.

Fleiss, J. L. and Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational And Psychological Measurement, 33, 613-619.

13.

Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675-701.

14.

Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross-classification. Journal of the American Statistical Association, 49, 732-764.

15.

Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84, 289-297.

16.

Hubert, L. and Golledge, R. G. (1983). Rater agreement for complex assessments. British Journal of Mathematical and Statistical Psychology, 36, 207-216.

17.

lachan, R. (1984). Measures of agreement for incompletely ranked data. Educational And Psychological Measurement, 44, 823-830.

18.

Kendall, M. G. (1948). Rank correlation methods. London: Griffin.

19.

Krippendorff, K. (1970a). Bivariate agreement coefficients for reliability of data. In E. G. Borgatta (Ed.), Sociological Methodology (pp. 139-150). San Francisco: Jossey-Bass.

20.

Krippendorff, K. (1970b). Estimating the reliability, systematic error and random error of interval data. Educational And Psychological Measurement, 30, 61-70.

21.

Landis, J. R. and Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363-374.

22.

Light, R. J. (1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76, 365-377.

23.

McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153-157.

24.

Mielke, P. W. (1984). Meteorological applications of permutation techniques based on distance functions. In P. R. Krishnaiah and P. K. Sen (Eds.), Handbook of statistics, volume 4 (pp. 813-830). Amsterdam: North-Holland.

25.

Mielke, P. W. (1986). Non-metric statistical analyses: some metric alternatives. Journal of Statistical Planning and Inference, 13, 377-387.

26.

Mielke, P. W. (1987). Ll, L2 and L,, regression models: is there a difference? Journal of Statistical Planning and Inference, 16, 430.

27.

Mielke, P. W. , Berry, K. J. , and Brier, G. W. (1981). Application of multiresponse permutation procedures for examining seasonal changes in monthly mean sea-level pressure patterns. Monthly Weather Review, 109, 120-126.

28.

Mielke, P. W. and Iyer, H. K. (1982). Permutation techniques for analyzing multiresponse data from randomized block experiments. Communications in Statistics: Theory and Methods, 11, 1427-1437.

29.

Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101.

30.

Spearman, C. (1906). `Footrule' for measuring correlation. British Journal of Psychology, 2, 89-108.

31.

Williams, G. W. (1976). Comparing the joint agreement of several raters with another rater. Biometrics, 32, 619-627.