Sage Journals: Discover world-class research

Abstract

Inter-rater reliability is commonly assessed using chance-corrected agreement coefficients such as Cohen’s κ, which summarize concordance among categorical judgments without modeling the inferential processes that generate them. As a result, κ is sensitive to prevalence imbalance, task difficulty, and heterogeneity in decision criteria and is often misinterpreted as a proxy for diagnostic accuracy or rater competence. This paper reframes inter-rater reliability within a signal detection–theoretic (SDT) framework in which categorical judgments arise from comparisons between latent continuous evidence and rater-specific decision thresholds. Within this generative model, κ can be interpreted as a bounded transformation of discrete strategic variance (i.e., the observable consequence of dispersion in latent decision criteria) rather than as a direct measure of epistemic alignment. To make this structure explicit, we introduce the Strategic Convergence Index (SCI), a normalized functional summarizing convergence in rater decision thresholds under an SDT generative process. SCI is not proposed as a standalone agreement coefficient but as a model-implied quantity whose interpretation depends on explicit assumptions about evidence distributions and decision rules. Monte Carlo simulations show that κ varies systematically with prevalence and perceptual discriminability even when decision-policy alignment is held constant, whereas SCI selectively tracks epistemic alignment and remains invariant to these factors. Supplementary model–based analyses further illustrate that SCI can be recovered as a stable system-level property even under latent-truth uncertainty, whereas individual thresholds may be weakly identified. Together, these results clarify the epistemic meaning of κ and motivate a decomposition of inter-rater reliability into outcome-level agreement and process-level alignment. By linking classical agreement statistics to an explicit generative model of judgment, the Strategic Convergence framework advances reliability assessment from description toward explanation.

Keywords

inter-rater reliability signal detection theory Cohen’s Kappa psychometrics simulation study

Get full access to this article

View all access options for this article.

References

Anders

Batchelder

W. H.

(2015). Cultural Consensus Theory for the ordinal data case. Psychometrika, 80(1), 151–181. https://doi.org/10.1007/s11336-013-9382-9

Böckenholt

(2012). Modeling multiple response processes in judgment and choice. Psychological Methods, 17(4), 665–678. https://doi.org/10.1037/a0028111

Byrt

Bishop

Carlin

J. B.

(1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429. https://doi.org/10.1016/0895-4356(93)90018-v

Cicchetti

D. V.

(1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284–290. https://doi.org/10.1037/1040-3590.6.4.284

Cohen

(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46. https://doi.org/10.1177/001316446002000104

Dawid

A. P.

Skene

A. M.

(1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 20–28. https://doi.org/10.2307/2346806

DeCarlo

L. T.

(1998). Signal detection theory and generalized linear models. Psychological Methods, 3(2), 186–205. https://doi.org/10.1037/1082-989X.3.2.186

DeCarlo

L. T.

(2021). A signal detection model for multiple-choice exams. Applied Psychological Measurement, 45(6), 423–440. https://doi.org/10.1177/01466216211014599

Feinstein

A. R.

Cicchetti

D. V.

(1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. https://doi.org/10.1016/0895-4356(90)90158-l

10.

Fleiss

J. L.

(1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619

11.

Fruhwirth-Schnatter

Celeux

Robert

C. P.

(Eds.). (2019). Handbook of mixture analysis (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429055911

12.

Gisev

Bell

J. S.

Chen

T. F.

(2013). Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9(3), 330–338. https://doi.org/10.1016/j.sapharm.2012.04.004

13.

Green

D. M.

Swets

J. A.

(1966). Signal detection theory and psychophysics. Wiley.

14.

Gwet

K. L.

(2002). Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Statistical Methods for Inter-Rater Reliability Assessment, 1, 1–6.

15.

Gwet

K. L.

(2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.

16.

Krippendorff

(2013). Content analysis: An introduction to its methodology (3rd ed.). Sage.

17.

Lages

M. A.

(2024). A hierarchical signal detection model with unequal variance for binary responses. Psychonomic Bulletin & Review, 31, 2534–2557. https://doi.org/10.3758/s13423-024-02504-5

18.

Linacre

J. M.

(1988). FACETS: A computer program for many-facet Rasch measurement. Mesa Press.

19.

Macmillan

N. A.

Creelman

C. D.

(2005). Detection theory: A user’s guide (2nd ed.). Lawrence Erlbaum.

20.

McHugh

M. L.

(2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282.

21.

McLachlan

Peel

(2000). Wiley Series in Probability and Statistics: Finite mixture models. Wiley. https://doi.org/10.1002/0471721182

22.

Molenaar

Uluman

Tavşancıl

De Boeck

(2021). The hierarchical rater thresholds model for multiple raters and multiple items. Open Education Studies, 3(1), 33–48. https://doi.org/10.1515/edu-2020-0105

23.

Romney

A. K.

Weller

Batchelder

W. H.

(1986). Culture as consensus: A theory of culture and informant accuracy. American Anthropologist, 88, 313–338. https://doi.org/10.1525/aa.1986.88.2.02a00020

24.

Stark

Goecke

Jaggy

A. K.

Krummenauer

Kuntze

Golle

Nagengast

(2025). Assessing decision thresholds in primary school students using signal detection theory: Validating an adapted version of the beads task. Journal of Experimental Child Psychology, 260, 106346. https://doi.org/10.1016/j.jecp.2025.106346

25.

Vanacore

Pellegrino

M. S.

(2022). Robustness of κ-type coefficients for clinical agreement. Statistics in Medicine, 41(11), 1986–2004. https://doi.org/10.1002/sim.9341

26.

Warrens

M. J.

(2015). Five ways to look at Cohen’s Kappa. Journal of Psychology & Psychotherapy, 5, 1000197.

27.

Wickens

T. D.

(2002). Elementary signal detection theory. Oxford University Press.

28.

Wright

B. D.

Linacre

J. M.

(1989). Observations are always ordinal; measurements, however, must be interval. Archives of Physical Medicine and Rehabilitation, 70(12), 857–860.

29.

Zec

Soriani

Comoretto

Baldi

(2017). High agreement and high prevalence: The paradox of Cohen’s Kappa. The Open Nursing Journal, 11, 211–218. https://doi.org/10.2174/1874434601711010211

30.

Zhang

Yuan

S. S.

Eagel

B. A.

Lin

L. A.

Wang

W. W. B.

(2020). Bayesian hierarchical model for safety signal detection in multiple clinical trials. Contemporary Clinical Trials, 99, 106183. https://doi.org/10.1016/j.cct.2020.106183

From Agreement to Epistemic Alignment: A Signal Detection–Theoretic Model of Inter-Rater Reliability

Abstract

Keywords

Get full access to this article

References