Permutation procedures to compute exact and resampling probability values for weighted kappa are described. Comparisons with asymptotic probability values demonstrate that exact permutation procedures are advantageous for sparse data sets, whereas resampling permutation procedures are appropriate for both sparse and nonsparse data sets.
Get full access to this article
View all access options for this article.
References
1.
AgrestiA. (2002) Categorical data analysis. (2nd ed.) New York: Wiley.
2.
BakemanR.RobinsonB. F.QueraV. (1996) Testing sequential association: Estimating exact p values using sampled permutations. Psychological Methods, 1, 4–15.
3.
BanerjeeM.CapozzoliM.McSweeneyL.SinhaD. (1999) Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, 27, 3–23.
4.
CicchettiD. V. (1981) Testing the normal approximation and minimal sample size requirements of weighted kappa when the number of categories is large. Applied Psychological Measurement, 5, 101–104.
5.
CicchettiD. V.AllisonT. (1971) A new procedure for assessing reliability of scoring EEG sleep recordings. The American Journal of EEG Technology, 11, 101–109.
6.
CicchettiD. V.FleissJ. L. (1977) Comparison of the null distribution of weighted kappa and the C ordinal statistic. Applied Psychological Measurement, 1, 195–201.
7.
CohenJ. (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
8.
CohenJ. (1968) Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.
9.
EdgingtonE. S. (1995) Randomization tests. (3rd ed.) New York: Marcel Dekker.
10.
EverittB. S. (1968) Moments of the statistics kappa and weighted kappa. British Journal of Mathematical and Statistical Psychology, 21, 97–103.
11.
FleissJ. L.CicchettiD. V. (1978) Inference about weighted kappa in the non-null case. Applied Psychological Measurement, 2, 113–117.
12.
FleissJ. L.CohenJ. (1973) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613–619.
13.
FleissJ. L.CohenJ.EverittB. S. (1969) Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323–327.
14.
FleissJ. L.LevinB.PaikM. C. (2003) Statistical methods for rates and proportions. (5th ed.) Hoboken, NJ: Wiley.
15.
FleissJ. L.NeeJ. C. M.LandisJ. R. (1979) Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin, 86, 974–977.
16.
GabrielK. R.HallW. J. (1983) Rerandomization inference on regression and shift effects: Computationally feasible methods. Journal of the American Statistical Association, 78, 827–836.
17.
GabrielK. R.HsuC-F. (1983) Evaluation of the power of rerandomization tests, with application to weather modification experiments. Journal of the American Statistical Association, 78, 766–775.
18.
GailM.MantelN. (1977) Counting the number of r × c contingency tables with fixed margins. Journal of the American Statistical Association, 72, 859–862.
19.
GibbonsJ. D.PrattJ. W. (1975) P-values: Interpretation and methodology. The American Statistician, 29, 20–25.
20.
GoodP. (2000) Permutation tests: A practical guide to resampling methods for testing hypotheses. (2nd ed.) New York: Springer-Verlag.
21.
GoodP. (2001) Resampling methods: A practical guide to data analysis. Boston, MA: Birkhäuser.
22.
HolmesC. B. (1979) Sample size in psychological research. Perceptual and Motor Skills, 49, 283–288.
23.
HolmesC. B. (1990) The honest truth about lying with statistics. Springfield, IL: Thomas.
24.
HornS. D. (1977) Goodness-of-fit tests for discrete data: A review and an application to a health impairment scale. Biometrics, 33, 237–247.
25.
HubertL. J. (1978) A general formula for the variance of Cohen's weighted kappa. Psychological Bulletin, 85, 183–184.
26.
HubertL. J. (1987) Assignment methods in combinatorial data analysis. New York: Marcel Dekker.
27.
KingmanA. (2002) Beyond weighted kappa when evaluating examiner agreement for ordinal responses. Journal of Dental Research, 81, A219.
28.
KraemerH. C.PeriyakoilV. S.NodaA. (2002) Kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129.
29.
KramerM. S.FeinsteinA. R. (1981) Clinical biostatistics: LIV. The biostatistics of concordance. Clinical Pharmacology and Therapeutics, 29, 111–123.
30.
KundelH. L.PolanskyM. (2003) Measurement of observer agreement. Radiology, 228, 303–308.
31.
LandisJ. R.KochG. G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33, 671–679.
32.
LudbrookJ. (2002) Statistical techniques for comparing measures and methods of measurement: A critical review. Clinical and Experimental Pharmacology and Physiology, 29, 527–536.
33.
LunneborgC. E. (2000) Data analysis by resampling: Concepts and applications. Pacific Grove, CA: Duxbury.
34.
MaclureM.WillettW. C. (1987) Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126, 161–169.
35.
ManlyB. F. J. (1997) Randomization, bootstrap and Monte Carlo methods in biology. (2nd ed.) London: Chapman & Hall.
36.
MielkeP. W.Jr.BerryK. J. (1988) Cumulant methods for analyzing independence of r-way contingency tables and goodness-of-fit data. Biometrika, 75, 790–793.
37.
MielkeP. W.Jr.BerryK. J. (2001) Permutation methods: A distance function approach. New York: Springer-Verlag.
38.
PatefieldW. M. (1981) Algorithm AS 159: An efficient method of generating random R × C tables with given row and column totals. Applied Statistics, 30, 91–97.
39.
PerkinsS. M.BeckerM. P. (2002) Assessing rater agreement using marginal association models. Statistics in Medicine, 21, 1743–1760.
40.
PesarinF. (2001) Multivariate permutation tests. New York: Wiley.
41.
RadlowR.AlfE. F.Jr. (1975) An alternative multinomial assessment of the accuracy of the χ2 test of goodness-of-fit. Journal of the American Statistical Association, 70, 811–813.
42.
ReadT. R. C.CressieN. A. C. (1988) Goodness-of-fit statistics for discrete multivariate data. New York: Springer-Verlag.
43.
SchusterC. (2004) A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.
44.
SmithP. L.JohnsonL. R.PriegnitzD. L.BoeB. A.MielkeP. W.Jr. (1997) An exploratory analysis of crop hail insurance data for evidence of cloud seeding effects in North Dakota. Journal of Applied Meteorology, 36, 463–473.
45.
SpitzerR. L.CohenJ.FleissJ. L.EndicottJ. (1967) Quantification of agreement in psychiatric diagnosis. Archives of General Psychiatry, 17, 83–87.