A new procedure to compute weighted kappa with multiple raters is described. A resampling procedure to compute approximate probability values for weighted kappa with multiple raters is presented. Applications of weighted kappa are illustrated with an example analysis of classifications by three independent raters.
Get full access to this article
View all access options for this article.
References
1.
AgrestiA.GhoshA. (1995) Raking kappa: describing potential impact of marginal distributions of measures of agreement. Biometrical Journal, 37, 811–820.
2.
AndrésA. M.MarzoP. F. (2004) Delta: a new measure of agreement between two raters. British Journal of Mathematical and Statistical Psychology, 57, 1–19.
3.
AndrésA. M.MarzoP. E. (2005) Chance-corrected measures of reliability and validity in KxK tables. Statistical Methods in Medical Research, 14, 473–492.
4.
BanerjeeM.CapozzoliM.McSweeneyL.SinhaD. (1999) Beyond kappa: a review of interrater agreement measures. The Canadian Journal of Statistics, 27, 3–23.
5.
BennettE. M.AlpertR.GoldsteinA. C. (1954) Communications through limited response questioning. Public Opinion Quarterly, 18, 303–308.
6.
BerryK. J.JohnstonJ. E.MielkeP. W. (2005) Exact and resampling probability values for weighted kappa. Psychological Reports, 96, 243–252.
7.
BerryK. J.MielkeP. W. (1988) A generalization of Cohen's kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921–933.
8.
BrennanR. L.PredigerD. (1981) Coefficient kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699.
9.
CareyG.GottesmanI. I. (1978) Reliability and validity in binary ratings: areas of common misunderstanding in diagnosis and symptom ratings. Archives of General Psychiatry, 35, 1454–1459.
10.
CicchettiD. V. (1981) Testing the normal approximation and minimal sample size requirements of weighted kappa when the number of categories is large. Applied Psychological Measurement, 5, 101–104.
11.
CicchettiD. V.FleissJ. L. (1977) Comparison of the null distribution of weighted kappa and the C ordinal statistic. Applied Psychological Measurement, 1, 195–201.
12.
CohenJ. (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
13.
CohenJ. (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
14.
CongerA. J. (1980) Integration and generalization of kappa for multiple raters. Psychological Bulletin, 88, 322–328.
15.
De MastJ. (2007) Agreement and kappa-type indices. Ehe American Statistician, 61, 148–153.
16.
EpsteinD. M.DalinkaM. K.KaplanF. S.AronchickJ. M.MarinelliD. L.KundelH. L. (1986) Observer variation in the detection of osteopenia. Skeletal Radiology, 15, 347–349.
17.
FleissJ. L. (1971) Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.
18.
FleissJ. L.CohenJ.EverittB. S. (1969) Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323–327.
19.
GrahamP.JacksonR. (1993) The analysis of ordinal agreement data: beyond weighted kappa. Journal of Clinical Epidemiology, 46, 1055–1062.
20.
HermanP. G.KhanA.KallmanC. E.RojasK. A.CarmodyD. P.BodenheimerM. M. (1990) Limited correlation of left ventricular end-diastolic pressure with radiographic assessment of pulmonary hemodynamics. Radiology, 174, 721–724.
21.
HolleyW.GuilfordJ. P. (1964) A note on the G-index of agreement. Educational and Psychological Measurement, 24, 749–753.
22.
HsuL. M.FieldR. (2003) Interrater agreement measures: comments on kappan, Cohen's kappa, Scott's Π, and Aickin's α. Understanding Statistics, 2, 205–219.
HutchinsonT. P. (1993) Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. Research in Nursing & Health, 16, 313–315.
25.
JansonS.VegeliusJ. (1979) On generalizations of the C index and the phi coefficient to nominal scales. Multivariate Behavioral Research, 14, 255–269.
26.
JohnstonJ. E.BerryK. J.MielkeP. W. (2007) Permutation tests: precision in estimating probability values. Perceptual and Motor Skills, 105, 915–920.
27.
KerM. (1991) Issues in the use of kappa. Investigative Radiology, 26, 78–83.
28.
KramerM. S.FeinsteinA. R. (1981) Clinical biostatistics: LIV. The biostatistics of concordance. Clinical Pharmacology & Therapeutics, 29, 111–123.
29.
KundelH. L.PolanskyM. (2003) Measurement of observer agreement. Radiology, 228, 303–308.
30.
LightR. J. (1971) Measures of response agreement for qualitative data: some generalizations and alternatives. Psychological Bulletin, 76, 365–377.
31.
LudbrookJ. (2002) Statistical techniques for comparing measurers and methods of measurement: a critical review. Clinical and Experimental Pharmacology and Physiology, 29, 527–536.
32.
MaclureM.WillettW. C. (1987) Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126, 161–169.
33.
MaxwellA. E. (1979) Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651–655.
34.
MeyerG. J. (1997) Assessing reliability: critical corrections for a critical examination of the Rorschach comprehensive system. Psychological Assessment, 9, 480–489.
35.
MielkeP. W.BerryK. J. (1988) Cumulant methods for analyzing independence of r-way contingency tables and goodness-of-fit frequency data. Biometrika, 75, 790–793.
36.
MielkeP. W.BerryK. J.JohnstonJ. E. (2007a) Resampling programs for multiway contingency tables with fixed marginal frequency totals. Psychological Reports, 101, 18–24.
37.
MielkeP. W.BerryK. J.JohnstonJ. E. (2007b) The exact variance of weighted kappa with multiple raters. Psychological Reports, 101, 655–660.
38.
NelsonJ. C.PepeM. S. (2000) Statistical description of interrater variability in ordinal ratings. Statistical Methods in Medical Research, 9, 475–496.
39.
SchoutenH. J. A. (1980) Measuring pairwise agreement among many observers. Biometrical Journal, 22, 497–504.
40.
SchoutenH. J. A. (1982a) Measuring pairwise agreement among many observers: II. Some improvements and additions. Biometrical Journal, 24, 431–435.
41.
SchoutenH. J. A. (1982b) Measuring pairwise interobserver agreement when all subjects are judged by the same observers. Statistica Neerlandica, 36, 45–61.
42.
SchusterC. (2004) A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.
43.
SchusterC.SmithD. A. (2005) Dispersion-weighted kappa: an integrative framework for metric and nominal scale agreement coefficients. Psychometrika, 70, 135–146.
44.
SeigelD. G.PodgorM. J.RemaleyN. A. (1992) Acceptable values of kappa for comparison of two groups. American Journal of Epidemiology, 135, 571–578.
45.
SheskinD. J. (2004) Handbook of parametric and nonparametric statistical procedures. (3rd ed.) Boca Raton, FL: Chapman & Hall/CRC.
46.
ShroutP. E.SpitzerR. L.FleissJ. L. (1987) Quantification of agreement in psychiatric diagnosis revisited. Archives of General Psychiatry, 44, 172–177.
47.
SoekenK. L.PrescottP. A. (1986) Issues in the use of kappa to estimate reliability. Medical Care, 24, 733–741.
48.
SpitznagelE. L.HelzerJ. E. (1985) A proposed solution to the base rate problem in the kappa statistic. Archives of General Psychiatry, 42, 725–728.
49.
TaplinS. H.RutterC. M.ElmoreJ. G.SegerD.WhiteD.BrennerR. J. (2000) Accuracy of screening mammography using single versus independent double interpretation. American Journal of Roentgenology, 174, 1257–1262.
50.
ThompsonW. D.WalterS. D. (1988) A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology, 41, 949–958.
51.
ZwickR. (1988) Another look at interrater agreement. Psychological Bulletin, 103, 374–378.