Sequentially Determined Measures of Interobserver Agreement (Kappa) in Clinical Trials May Vary Independent of Changes in Observer Performance

Abstract

Background:

Cohen's kappa is a statistic that estimates interobserver agreement. It was originally introduced to help develop diagnostic tests. Interpretative readings of 2 observers, for example, of a mammogram or other imaging, were compared at a single point in time. It is known that kappa depends on the prevalence of disease and that, therefore, kappas across different settings are hard to compare.

Methods:

Using simulation, we examine an analogous situation, not previously described, that occurs in clinical trials where sequential measurements are obtained to evaluate disease progression or clinical improvement over time.

Results:

We show that weighted kappa, used for multilevel outcomes, changes during the trial even if we keep the performance of the observer constant.

Conclusions:

Kappa and closely related measures can therefore only be used with great difficulty, if at all, in quality assurance in clinical trials.

Keywords

clinical trials interobserver agreement Cohen kappa repeated measures biased estimator simulation central reading

Get full access to this article

View all access options for this article.

References

Ceriani

Barrington

Biggi

, et al. Training improves the interobserver agreement of the expert positron emission tomography review panel in primary mediastinal B-cell lymphoma: interim analysis in the ongoing International Extranodal Lymphoma Study Group-37 study. Hematol Oncol. 2017;35(4):548–553.

Gottlieb

Travis

Feagan

Hussain

Sandborn

Rutgeerts

. Central reading of endoscopy endpoints in inflammatory bowel disease trials. Inflam Bowel Dis. 2015;21(10):2475–2482.

D’Haens

Sandborn

Feagan

, et al. A review of activity indices and efficacy end points for clinical trials of medical therapy in adults with ulcerative colitis. Gastroenterology. 2007;132(2):763–786.

Banerjee

Capozzoli

McSweeney

Sinha

. Beyond kappa: a review of interrater agreement measures. Can J Stat. 1999;27(1):3–23.

Landis

Koch

. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159.

Smith

Hwang

Murr

Lavigne

Koreck

. Interrater reliability of endoscopic parameters following sinus surgery. Laryngoscope. 2012;122(1):230–236.

Feagan

Rutgeerts

Sands

, et al. Vedolizumab as induction and maintenance therapy for ulcerative colitis. N Engl J Med. 2013;369(8):699–710.

Byrt

Bishop

Carlin

. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423–429.

Brenner

Kliebsch

. Dependence of weighted kappa coefficients on the number of categories. Epidemiology. 1996;7(2):199–202.

10.

Fleiss

Cohen

. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas. 1973;33(3):613–619.

11.

Rae

. The equivalence of multiple rater kappa statistics and intraclass correlation coefficients. Educ Psychol Meas. 1988;48(2):367–374.

12.

Gottlieb

Hussain

. Voting for Image Scoring and Assessment (VISA)—theory and application of a 2 + 1 reader algorithm to improve accuracy of imaging endpoints in clinical trials. BMC Med Imaging. 2015;15(1):6.

13.

Reinisch

Mishkin

, et al. P132 Analysis of various central endoscopy reading methodologies in the BERGAMOT exploratory induction cohort evaluating etrolizumab in Crohn’s Disease. J Crohns Colitis. 2018;12(suppl 1):S161–S161.