The accuracy of interrater reliability estimates found using a subset of the total data sample: A bootstrap analysis

Abstract

Interrater reliability (IRR) assesses the stability of a coding protocol over time and across coders. For practical reasons, it is often difficult to assess IRR for an entire dataset, so researchers sometimes calculate the IRR for a subset of the total data sample. The purpose of this study is to investigate the accuracy of such subset IRRs. Using bootstrapping, we determined the effects of sample size (10%, 25%, & 40% of the total dataset) and IRR measure type (percent agreement, Krippendorff’s alpha, & the G Index) on the bias and percent error of subset IRRs. Results support the use of calculating IRR from subsets of the total data sample, though we discuss how the accuracy of subset IRR values may depend on aspects of the dataset such as total sample size and coding methodology.

Get full access to this article

View all access options for this article.

References

Belur

Tompson

Thornton

Simon

(2018). Interrater reliability in systematic review methodology: Exploring variation in coder decision-making. Sociological Methods & Research, 1-29. doi: https://doi.org/10.1177/0049124118799372

Borah

Brown

A. W.

Caspers

P. L.

Kaiser

K. A.

(2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open, 7, e012545.

Campbell

J. L.

Quincy

Osserman

Pedersen

O. K.

(2013). Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement. Sociological Methods & Research, 42, 294-320.

*Cha

J. S.

Monfared

Stefanidis

Nussbaum

M. A.

(2019). Supporting surgical teams: Identifying needs and barriers for exoskeleton implementation in the operating room. Human Factors, 0018720819879271.

Cicchetti

D. V.

Feinstein

A. R.

(1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551-558.

Cohen

(1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46.

Downs

J. S.

Holbrook

Cranor

L. F.

(2007). Behavioral response to phishing risk. Proceedings of the Anti-phishing Working Groups 2nd Annual eCrime Researchers Summit, pp. 37-44.

*Drews

F. A.

Markewitz

B. A.

Stoddard

G. J.

Samore

M. H.

(2019). Interruptions and delivery of care in the intensive care unit. Human Factors, 61, 564-576.

*Falkland

E. C.

Wiggins

M. W.

Westbrook

J. I.

(2019). Cue utilization differentiates performance in the management of interruptions. Human Factors, 0018720819855281.

10.

Feng

G. C.

(2013). Factors affecting intercoder reliability: A Monte Carlo experiment. Quality & Quantity, 47, 2959-2982.

11.

Feng

G. C.

(2014). Intercoder reliability indices: Disuse, misuse, and abuse. Quality & Quantity, 48, 1803-1815.

12.

Ferreira

Coventry

Lenzini

(2015). Principles of persuasion in social engineering and their use in phishing. International Conference on Human Aspects of Information Security, Privacy, and Trust, 36-47.

13.

*Garosi

Kalantari

Zanjirani Farahani

Zuaktafi

Hosseinzadeh Roknabadi

Bakhshi

(2019). Concerns about verbal communication in the operating room: A field study. Human Factors, 0018720819858274.

14.

Garrison

D. R.

Cleveland-Innes

Koole

Kappelman

(2006). Revisiting methodological issues in transcript analysis: Negotiated coding and reliability. The Internet and Higher Education, 9, 1-8.

15.

*Gaspar

Carney

(2019). The effect of partial automation on driver attention: A naturalistic driving study. Human Factors, 61, 1261-1276.

16.

Gwet

K. L.

(2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61, 29-48.

17.

Haddaway

N. R.

Westgate

M. J.

(2019). Predicting the time needed for environmental systematic reviews and systematic maps. Conservation Biology, 33, 434-443.

18.

Hallgren

K. A.

(2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8, 23-34.

19.

Holley

J. W.

Guilford

J. P.

(1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749-753.

20.

*Hughes

A. M.

Hancock

G. M.

Marlow

S. L.

Stowers

Salas

(2019). Cardiac measures of cognitive workload: A meta-analysis. Human Factors, 61, 393-414.

21.

Hruschka

D. J.

Schwartz

St. John

D. C.

Picone-Decaro

Jenkins

R. A.

Carey

J. W.

(2004). Reliability in coding open-ended data: Lessons learned from HIV behavioral research. Field Methods, 16, 307-331

22.

Jeni

L. A.

Cohn

J. F.

De La Torre

(2013). Facing imbalanced data–recommendations for the use of performance metrics. 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 245-251.

23.

*Kazi

Khaleghzadegan

Dinh

J. V.

Shelhamer

M. J.

Sapirstein

Goeddel

L. A.

Chime

N. O.

Salas

Rosen

M. A.

(2019). Team physiological dynamics: A critical review. Human Factors, 0018720819874160.

24.

*Kircher

Kujala

Ahlström

(2019). On the difference between necessary and unnecessary glances away from the forward roadway: An occlusion study on the motorway. Human Factors, 0018720819866946.

25.

Krippendorff

(2004). Reliability. Content Analysis: An Introduction to Its Methodology (pp. 211-256). Thousand Oaks, CA: Sage.

26.

Krippendorff

(2011). Agreement and information in the reliability of coding. Communication Methods and Measures, 5, 93-112.

27.

Lombard

Snyder-Duch

Bracken

C. C.

(2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28, 587-604.

28.

*Lu

Megahed

F. M.

Cavuoto

L. A.

(2019). Interventions to mitigate fatigue induced by physical work: A systematic review of research quality and levels of evidence for intervention efficacy. Human Factors, 0018720819876141.

29.

MacPhail

Khoza

Abler

Ranganathan

(2016). Process guidelines for establishing intercoder reliability in qualitative studies. Qualitative Research, 16, 198-212.

30.

*Martin

McLeod

Périard

Rattray

Keegan

Pyne

D. B.

(2019). The impact of environmental stress on cognitive performance: A systematic review. Human Factors, 61, 1205-1246.

31.

*Motamedi

Wang

Zhang

Chan

C. Y.

(2019). Acceptance of full driving automation: Personally owned and shared-use concepts. Human Factors, 0018720819870658.

32.

O’Connor

Joffe

(2020). Intercoder reliability in qualitative research: Debates and practical guidelines. International Journal of Qualitative Methods, 19, 1-13.

33.

Paletz

S. B.

Bearman

Orasanu

Holbrook

(2009). Socializing the human factors analysis and classification system: Incorporating social psychological phenomena into a human factors error classification system. Human Factors, 51, 435-445.

34.

Radwin

R. G.

Lee

J. D.

Akkas

(2017). Driver movement patterns indicate distraction and engagement. Human Factors, 59, 844-860.

35.

*Roche

Somieski

Brandenburg

(2019). Behavioral changes to repeated takeovers in highly automated driving: Effects of the takeover-request design and the nondriving-related task modality. Human Factors, 61, 839-849.

36.

*Sanders

Kaplan

Koch

Schwartz

Hancock

P. A.

(2019). The relationship between trust and use choice in human-robot interaction. Human Factors, 61, 614-626.

37.

Sheng

Holbrook

Kumaraguru

Cranor

L. F.

Downs

(2010). Who falls for phish? A demographic analysis of phishing susceptibility and effectiveness of interventions. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 373-382.

38.

*Suh

Ferris

T. K.

(2019). On-road evaluation of in-vehicle interface characteristics and their effects on performance of visual detection on the road and manual entry. Human Factors, 61, 105-118.

39.

*Swann

Popovic

Blackler

Thompson

(2019). Airport security screener problem-solving knowledge and implications. Human Factors, 0018720819874169.

40.

Syed

Nelson

S. C.

(2015). Guidelines for establishing reliability when coding narrative data. Emerging Adulthood, 3, 375-387.

41.

Thompson

W. D.

Walter

S. D.

(1988). A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology, 41, 949-958.

42.

*Tremoulet

P. D.

Seacrist

Ward McIntosh

Loeb

DiPietro

Tushak

(2019). Transporting children in autonomous vehicles: An exploratory study. Human Factors, 0018720819853993.

43.

Lorber

M. F.

(2014). Interrater agreement statistics with skewed data: Evaluation of alternatives to Cohen’s kappa. Journal of Consulting and Clinical Psychology, 82, 1219-1227.

44.

Chang

C. C.

Faber

G. S.

Kingma

Dennerlein

J. T.

(2012). Estimating 3-D L5/S1 moments during manual lifting using a video coding system: Validity and interrater reliability. Human Factors, 54, 1053-1065.

45.

Zapf

Castell

Morawietz

Karch

(2016). Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BCM Medical Research Methodology, 16, 93.