Abstract
Interrater reliability (IRR) assesses the stability of a coding protocol over time and across coders. For practical reasons, it is often difficult to assess IRR for an entire dataset, so researchers sometimes calculate the IRR for a subset of the total data sample. The purpose of this study is to investigate the accuracy of such subset IRRs. Using bootstrapping, we determined the effects of sample size (10%, 25%, & 40% of the total dataset) and IRR measure type (percent agreement, Krippendorff’s alpha, & the G Index) on the bias and percent error of subset IRRs. Results support the use of calculating IRR from subsets of the total data sample, though we discuss how the accuracy of subset IRR values may depend on aspects of the dataset such as total sample size and coding methodology.
Get full access to this article
View all access options for this article.
