Cohen's kappa statistic is frequently used to measure agreement between two observers employing categorical polytomies. In this paper, Cohen's statistic is shown to be inherently multivariate in nature; it is expanded to analyze ordinal and interval data; and it is extended to more than two observers. A nonasymptotic test of significance is provided for the generalized statistic.
Get full access to this article
View all access options for this article.
References
1.
Armitage, P. , Blendis, L. M., and Smyllie, H. C. (1966). The measurement of observer disagreement in the recording of signs. Journal of the Royal Statistical Society, Series A, 129, 98-109.
2.
Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological Reports, 19, 3-11.
3.
Bartko, J. J. (1976). On various intraclass correlation reliability coefficients. Psychological Bulletin, 83, 762-765.
4.
Bartko, J. J. and Carpenter, W. T. (1976). On the methods and theory of reliability. The Journal of Nervous and Mental Disease, 163, 307-317.
5.
Brennan, R. L. and Prediger, D. L. (1981). Coefficient kappa: some uses, misuses, and alternatives. Educational And Psychological Measurement, 41, 687-699.
6.
Cicchetti, D. V. , Showalter, D., and Tyrer, P. J. (1985). The effect of number of rating scale categories on levels of interrater reliability: a Monte Carlo investigation. Applied Psychological Measurement, 9, 31-36.
7.
Cochran, W. G. (1950). Comparison of percentages in matched samples. Biometrika, 37, 256-266.
8.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational And Psychological Measurement, 20, 37-46.
9.
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322-328.
10.
Conger, A. J. (1985). Kappa reliabilities for continuous behaviors and events. Educational And Psychological Measurement, 45, 861-868.
11.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382.
12.
Fleiss, J. L. and Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational And Psychological Measurement, 33, 613-619.
13.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675-701.
14.
Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross-classification. Journal of the American Statistical Association, 49, 732-764.
15.
Hubert, L. (1977). Kappa revisited. Psychological Bulletin, 84, 289-297.
16.
Hubert, L. and Golledge, R. G. (1983). Rater agreement for complex assessments. British Journal of Mathematical and Statistical Psychology, 36, 207-216.
17.
lachan, R. (1984). Measures of agreement for incompletely ranked data. Educational And Psychological Measurement, 44, 823-830.
18.
Kendall, M. G. (1948). Rank correlation methods. London: Griffin.
19.
Krippendorff, K. (1970a). Bivariate agreement coefficients for reliability of data. In E. G. Borgatta (Ed.), Sociological Methodology (pp. 139-150). San Francisco: Jossey-Bass.
20.
Krippendorff, K. (1970b). Estimating the reliability, systematic error and random error of interval data. Educational And Psychological Measurement, 30, 61-70.
21.
Landis, J. R. and Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363-374.
22.
Light, R. J. (1971). Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76, 365-377.
23.
McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153-157.
24.
Mielke, P. W. (1984). Meteorological applications of permutation techniques based on distance functions. In P. R. Krishnaiah and P. K. Sen (Eds.), Handbook of statistics, volume 4 (pp. 813-830). Amsterdam: North-Holland.
25.
Mielke, P. W. (1986). Non-metric statistical analyses: some metric alternatives. Journal of Statistical Planning and Inference, 13, 377-387.
26.
Mielke, P. W. (1987). Ll, L2 and L,, regression models: is there a difference?Journal of Statistical Planning and Inference, 16, 430.
27.
Mielke, P. W. , Berry, K. J., and Brier, G. W. (1981). Application of multiresponse permutation procedures for examining seasonal changes in monthly mean sea-level pressure patterns. Monthly Weather Review, 109, 120-126.
28.
Mielke, P. W. and Iyer, H. K. (1982). Permutation techniques for analyzing multiresponse data from randomized block experiments. Communications in Statistics: Theory and Methods, 11, 1427-1437.
29.
Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72-101.
30.
Spearman, C. (1906). `Footrule' for measuring correlation. British Journal of Psychology, 2, 89-108.
31.
Williams, G. W. (1976). Comparing the joint agreement of several raters with another rater. Biometrics, 32, 619-627.