Sage Journals: Discover world-class research

Abstract

When evaluating a new diagnostic test against a less than perfect “gold standard,” the kappa coefficient of agreement κ is often inappropriately used as a measure of “diagnostic accuracy,” which frequently leads to paradoxical findings. In this paper, κ is expressed as a function of disease prevalence and diagnostic accuracy (subject to Youden's index > 0), whereby necessary and sufficient conditions, given the accuracy rates, are derived to aid in locating the maximizer of κ. Paradoxical behavior of κ can thus be detected in the light of diagnostic accuracy. Attempts are made to clarify the subtle difference between “diagnostic accuracy” and “diagnostic reliability.” The implication of this difference is then assessed from a regulatory perspective. In order to extend the idea of κ beyond its originally intended use, the maximum likelihood method, coupled with the Expectation-Maximization algorithm, is proposed as a remedial option, not for measuring diagnostic agreement or reliability but, rather, for evaluating diagnostic accuracy. Some illustrative examples adapted from published data are provided.

Keywords

Diagnosis Kappa Agreement Prevalence Accuracy

Get full access to this article

View all access options for this article.

References

Cohen

A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):27–46.

Grove

Andreasen

McDonald-Scott

Keller

Shapiro

Reliability studies of psychiatric diagnosis: theory and practice. Arch Gen Psychiatry. 1981;38(April):408–413.

Spitznagel

Heizer

A proposed solution to the base rate problem in the kappa statistic. Arch Gen Psychiatry. 1985;42(July):725–728.

Rice

McDonald-Scott

Coryell

Grove

Keller

Altis

The stability of diagnosis with an application to bipolar II disorder. Psychiatry Res. 1986;19(July):285–296.

Tudor

Finlay

Taub

An assessment of inter-observer agreement and accuracy when reporting plain radiographs. Clin Radiol. 1997;52:235–238.

Buck

Gart

Comparison of a screening test and a reference test in epidemiologic studies. Am J Epidemiol. 1966;83(3):586–592.

Kraemer

Ramifications of a population model for κ as a coefficient of reliability. Psychometrika. 1979;44(4):461–472.

Faraone

Tsuang

Measuring diagnostic accuracy in the absence of a “Gold Standard.”

Am J Psychiatry. 1994;151(5):650–657.

Youden

Index for rating diagnostic tests. Cancer. 1950;January:32–35.

10.

Swets

Pickett

Evaluation of Diagnostic Systems. New York: Academic Press; 1982.

11.

Gart

Buck

Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. Am J Epidemiol. 1966;83(3):593–602.

12.

Lipman

Astles

Quantifying the bias associated with use of discrepant analysis. Clin Chemistry. 1998;44(1): 108–115.

13.

Galen

Gambino

Beyond Normality: The Predictive Value and Efficacy of Medical Diagnoses. New York: John Wiley & Sons; 1975.

14.

Dempster

Laird

Rubin

Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc. B 1977;39:1–38.

15.

Weng

Evaluation of a new diagnostic test against a reference test less than perfect in accuracy. Comm Stat-Simula. 1996;25(2):533–555.

16.

Stokes

Davis

Koch

Categorical Data Analysis Using SAS System. Cary, NC: SAS Institute Inc.; 1995.

17.

Shrout

Spitzer

Fleiss

Quantification of agreement in psychiatric diagnosis revisited. Arch Gen Psychiatry. 1987;44:172–177.

18.

Diamond

Forrester

Analysis of probability as an aid in the clinical diagnosis of coronary disease. New Engl J Med. 1979;300:1350–1358.

Evaluation of Diagnostic Tests: Measuring Degree of Agreement and Beyond *

Abstract

Keywords

Get full access to this article

References