Sage Journals: Discover world-class research

Abstract

Fleiss’s Kappa is an extension of Cohen’s Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen’s Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss’s Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.

Keywords

interrater agreement Fleiss’s Kappa multiple raters prevalence-agreement effect coefficient Lambda

Get full access to this article

View all access options for this article.

References

Almehrizi

R. S.

(2025). Coefficient of agreement between two raters corrected for category prevalence: Alternative to kappa. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000732

Altman

D. G.

(1991). Practical statistics for medical research. Chapman & Hall/CRC.

Brennan

R. L.

Prediger

D. J.

(1981). Coefficient Kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699.

Byrt

Bishop

Carlin

J. B.

(1993). Bias, prevalence and Kappa. Journal of Clinical Epidemiology, 46(5), 423–429.

Cohen

(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/1.1177/001316446002000104

Cousineau

Laurencelle

(2017). An unbiased estimate of global interrater agreement. Educational and Psychological Measurement, 77, 721–742.

Dawson

(2017). Assessment rubrics: Towards clearer and more replicable design, research and practice. Assessment & Evaluation in Higher Education, 42(3), 347–336. https://doi.org/1.1080/02602938.2015.1111294

Delgado

Tibau

(2019). Why Cohen’s Kappa should be avoided as performance measure in classification. PLOS ONE, 14(9), Article e0222916. https://doi.org/1.1371/journal.pone.0222916

Feinstein

A. R.

Cicchetti

D. V.

(1990a). High agreement but low Kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. https://doi.org/1.1016/0895-4356(90)90158-L

10.

Feinstein

A. R.

Cicchetti

D. V.

(1990b). High agreement but low Kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558.

11.

Fleiss

J. L.

(1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378. https://doi.org/1.1037/h0031619

12.

Fleiss

J. L.

Cohen

Everitt

B. S.

(1969). Large sample standard errors of Kappa and weighted Kappa. Psychological Bulletin, 72(5), 323–327. https://doi.org/1.1037/h0028106

13.

Fleiss

J. L.

Nee

J. C.

Landis

J. R.

(1979). Large sample variance of Kappa in the case of different sets of raters. Psychological Bulletin, 86(5), 974–977. https://doi.org/1.1037/0033-2909.86.5.974

14.

Gwet

(2001). Statistical methods for inter-rater reliability measurement: Advance analytics. LLC.

15.

Kraemer

H. C.

Periyakoil

V. S.

Noda

(2002). Kappa coefficients in medical research. Statistics in Medicine, 21(14), 2109–2129. https://doi.org/1.1002/sim.1180

16.

Krippendorff

(2004). Measuring the reliability of qualitative text analysis data. Quality & Quantity, 38(6), 787–780.

17.

Landis

Koch

(1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174.

18.

Mandrekar

(2011). Measures of interrater agreement. Journal of Thoracic Oncology, 6(10), 6–7. https://doi.org/1.1097/JTO.0b013e318200f983

19.

McHugh

(2012). Interrater reliability: The Kappa Statistic. Biochemia Medica, 22(3), 276–282.

20.

Moons

Vandervieren

(2022). Handwritten math exams with multiple assessors: Researching the added value of semi-automated assessment with atomic feedback. Twelfth Congress of the European Society for Research in Mathematics Education (CERME12), TWG21(14). https://hal.science/hal-03753446.

21.

Muthén

(1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115–132.

22.

Radhakrishna

R. C.

(1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1), 50–57. https://doi.org/10.1017/S0305004100023987

23.

Sandifer

M. G.

Fleiss

J. L.

Green

L. M.

(1968). Sample selection by diagnosis in clinical drug evaluations. Psychopharmacologia, 13, 118–128. https://doi.org/10.1007/BF00404809

24.

Vach

(2004). The dependence of Cohen’s Kappa on the prevalence does not matter. Journal of Clinical Epidemiology, 58, 655–661. https://doi.org/1.1016/j.jclinepi.2004.02.021

25.

Warrens

M. J.

(2010). A formal proof of a paradox associated with Cohen’s Kappa. Journal of Classification, 27(3), 322–332. https://doi.org/1.1007/s00357-010-9060-x

26.

Zec

Soriani

Comoretto

Baldi

(2017). High agreement and high prevalence: The paradox of Cohen’s Kappa. The Open Nursing Journal, 11, 211–218. https://doi.org/1.2174/1874434601711010211

Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence

Abstract

Keywords

Get full access to this article

References