Assessing Rater Performance without a "Gold Standard" Using Consensus Theory

Abstract

This study illustrates the use of consensus theory to assess the diagnostic perform ances of raters and to estimate case diagnoses in the absence of a criterion or "gold" standard. A description is provided of how consensus theory "pools" information pro vided by raters, estimating rater competencies and differentially weighting their re sponses. Although the model assumes that raters respond without bias (i.e., sensitivity = specificity), a Monte Carlo simulation with 1,200 data sets shows that model esti mates appear to be robust even with bias. The model is illustrated on a set of elbow radiographs, and consensus-model estimates are compared with those obtained from follow-up data. Results indicate that with high rater competencies, the model retrieves accurate estimates of competency and case diagnoses even when raters' responses are biased. Key words: clinical competence; interobserver variation; diagnostic evalu ation ; models—mathematical; consensus theory. (Med Decis Making 1997;17:71- 79)

Get full access to this article

View all access options for this article.

References

Weinstein MC , Fineberg HV Clinical Decision Analysis. Philadelphia, PA: W. B. Saunders, 1980.

Smith R. Where is the wisdom? BMJ. 1991;303:798-9.

Hillman BJ , Swensson RG , Hessel SJ , Gerson DE , Herman PG Improving diagnostic accuracy: a comparison of interactive and Delphi consultations . Invest Radiol. 1977;12:112-5.

Milholland AV , Wheeler SG , Hejeck JJ Medical assessment by a Delphi group opinion technic. N Engl J Med. 1973;288: 1272-5.

Hessel SJ , Herman PG , Swensson RG Improving performance by multiple interpretations of chest radiographs. effectiveness and cost. Radiology. 1978;127:589-94.

Kraemer HC How many raters? Toward the most reliable diagnostic consensus. Stat Med 1992;11:317-31.

Metz CE , Shen J. Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis. Med Decis Making 1992,12:60-75.

Spearman C. Correlation calculated from faulty data. Br J Psychol. 1910;3:271-295.

Brown W. Some experimental results in the correlation of mental abilities. Br J Psychol. 1910;3:296-322.

10.

Weller SC , Romney AK Systematic Data Collection. Qualitative Research Series N. 10, Newbury Park, CA: Sage, 1988.

11.

Weller SC Shared knowledge, intercultural variation, and knowledge aggregation . Am Behav Sci. 1987;31:178-93.

12.

Romney AK , Weller SC , Batchelder WH Culture as consensus : a theory of culture and informant accuracy . Am Anthropol. 1986:88.313—38.

13.

Batchelder WH , Romney AK The statistical analysis of a general Condorcet model for dichotomous choice situations. In: Grofman G , Owen G (eds). Information Pooling and Group Decision Making. Greenwich, Connecticut: JAI Press, 1986: 103-12.

14.

Batchelder WH , Romney AK Test theory without an answer key . Psychometrika. 1988;53:71-92.

15.

Batchelder WH , Romney AK New results in test theory without an answer key. In. Roskam EE (ed). Mathematical Psychology in Progress. Berlin, Germany: Springer-Verlag , 1989.

16.

Romney AK , Batchelder WH , Weller SC Recent applications of cultural consensus theory. Am Behav Sci. 1987;31:163-77.

17.

Goodman LA Exploratory latent structure analysis using both identifiable and nonidentifiable models. Biometrika. 1974;61:215-31.

18.

Young MA Evaluating diagnostic criteria: a latent class paradigm. J Psychiatr Res. 1982;17:285-96.

19.

McCutcheon AL Latent Class Analysis. Sage Quantitative Applications in the Social Sciences, vol 64. Beverly Hills, CA: Sage, 1987.

20.

Walter SD , Irwig LM Estimation of test error rates; disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol. 1988;41.923-7.

21.

Uebersax JS , Grove WM Latent class analysis of diagnostic agreement . Stat Med. 1990,9:559-72.

22.

Agresti A. Modeling patterns of agreement and disagreement. Stat Meth Med Res. 1992;1:201-18.

23.

Norusis MJ SPSS/PC+. SPSS/PC+ Statistics 4.0, 1990.

24.

Dixon WJ , et al. BMDP Statistical Software. Los Angeles, CA : University of California Press, 1990.

25.

Borgatti SP Anthropac, Version 3.02. Columbia, SC: University of South Carolina, 1990.

26.

Chacon D. , Kissoon N. , Brown T. , Galpin R. Use of comparison radiographs in the diagnosis of traumatic injuries of the elbow. Ann Emer Med 1992;21.895-9.

27.

Kissoon N. , Galpin R. , Gayle M. , Chacon D. , Brown T. Evaluation of the role of comparison radiographs in the diagnosis of traumatic elbow injuries. J Pediatr Orthop. 1995;15:449-53.

28.

Valenstein PN Evaluating diagnostic tests with imperfect standards. Am J Clin Pathol. 1990;93:252-8.

29.

Staquet M. , Rozeneweig M. , Lee YJ , Muggia FM Methodology for the assessment of new dichotomous tests. J Chronic Dis. 1981;34:599-610.

30.

Wolfram S. Mathematica—a system for doing mathematics by computer. 2nd ed. Redwood City, CA: Addison-Wesley, 1991.