Sage Journals: Discover world-class research

Abstract

All research, e.g., qualitative or quantitative, is concerned with the extent to which analyses can adequately describe the phenomena it seeks to describe. In qualitative research, we use internal validity checks like intercoder agreement to measure the extent to which independent researchers observe the same phenomena in data. Researchers report indices of agreement to serve as evidence of consistency and dependability of interpretations, and we do so to make claims about the trustworthiness of our research accounts. However, few studies report the methods of how multiple analysts developed alignment in their interpretation of data, a process that undergirds accounts of consistency, dependability, and trustworthiness. In this article, we review the issues and options around achieving intercoder agreement. Drawing on our experience from a longitudinal, team-based research project that required rapid cycles of qualitative data analysis, we reflect on the challenges we had achieving high intercoder agreement (which refer to as the perils). It was through these challenges that we developed a method that helps to foster shared ways of seeing data, and thus alignment in our interpretations of phenomena in data. In this article, we present this method as a tool for dyadic and team-based qualitative data analysis to facilitate reliable and consistent high-inference interpretations of data with multiple analysts.

Keywords

qualitative coding qualitative data analysis intercoder reliability intercoder agreement trustworthiness team-based research

Introduction

All research, e.g., qualitative or quantitative, postmodern or pragmatic, is concerned with the extent to which analyses can adequately describe the phenomena it seeks to describe. Internal validity checks in qualitative research concern the extent to which independent researchers observe the same constructs and phenomena in data (LeCompte & Goetz, 1982). As a scholarly community, we report these internal checks when we disseminate our research, and in these reports, these metrics serve doubly as evidence of the trustworthiness of our interpretations, i.e., metrics of credibility, reliability, dependability, transferability, and confirmability of interpretations and findings (Guba, 1981).

As a solitary analyst, practices such as “showing the workings” of analyses serve the goal of making interpretations transparent in written reports of analyses, where the reader serves as a corroborator of these interpretations (e.g., Mehan, 1979; Mishler, 1990). In research with multiple analysts, measures of intercoder agreement serve as evidence of the corroboration of interpretations between analysts, and therefore evidence of trustworthiness. Developing shared ways of seeing data in research teams is central to claims made about trustworthiness. However, much of the scholarship that reports intercoder agreement often glosses over how multiple analysts developed alignment in their interpretations of data.

Qualitative data is rich and complex. Theoretical and conceptual tools employed to analyze qualitative data are comparably complex, e.g., levels of analysis, high levels of inference, interactions, and latent features of data like power, historicity, culture, and norms. In addition, analysts bring to any analytic task their subjectivities, knowledges, cultures, histories, experiences, and expertise. Further, team research, as an intersubjective process, can introduce dynamics of power and authority that can shape a collective analytic process (Naganathan et al., 2022). An additional complexity to the analysis of qualitative data are the various purposes and uses of analyses, such as for theory building, or research to inform policy and practice. These different intentions of research may require different attunements in analysis in order to be responsive to various stakeholders (beyond the academy) such as research participants, communities, and policy makers. Thus, for a team of researchers to achieve shared ways of seeing data that lend themselves to high levels of intercoder reliability, in light of the aforementioned complexities–data, theory, analyst positionalities and uses of analyses– is not a trivial task. Yet, most qualitative research reports seem to mask this complexity altogether.

In this article, we present a method for achieving intercoder agreement that is explicit about the process of how researchers can align their ways of seeing in order to interpret qualitative data, reliably and consistently between coders. The method we present emerged from a large-scale, longitudinal, team-based project in public secondary schools where we have a steady stream of science classroom discourse data that we code and then share with teachers to reflect on their facilitation of student learning. Our team of researchers is diverse in terms of expertise, context experience, cultural and linguistic backgrounds – all potential challenges for achieving high intercoder agreement. When disagreements between coders became impassable, we sought to excavate the disagreements (aided by copious meeting notes), in order to understand precisely what contributed to coding disagreements and how we managed to resolve them. It was through this process that we developed a method that enabled the team to achieve greater agreement without compromising depth of inference (utilizing high inference coding categories to classify rich qualitative data). We share insights of the process of achieving intercoder agreement and consider their implications for team-based research. We then consider the state of the field around techniques for achieving agreement, what these approaches afford for reliability, and what tradeoffs they introduce into the research process. We conclude with a novel method for achieving intercoder agreement in team-based research.

Approaches to Achieving Intercoder Agreement

Over the decades, there has been much debate about whether there is a need for conventions like intercoder agreement given what the use of these conventions imply paradigmatically about the nature of qualitative inquiry (e.g., MacPhail et al., 2016; O’Connor & Joffe, 2020). We, the authors of this article, locate ourselves within an equally epistemic and pragmatic tension of qualitative research that necessitates in equal measure, complex high inference coding, use of multiple coders, and reliable, consistent and dependable interpretations of data. It is an epistemic tension insofar as we give primacy to the rich insights into data that qualitative methods uniquely supply. It is a pragmatic tension in terms of the scale, the need for speed, multiple coders, a research design in which our analyses are shared and verified with research participants. In addition, our analyses would be used to train machine learning models to code automatically. We argue that there is a need for methodological precision on core aspects of the research process that are the basis of trustworthiness – that is, how do we achieve shared ways of seeing data between multiple coders.

Intercoder agreement concerns the consistency of interpretation of data. It is a tool that is used to establish the dependability of interpretations, and thus stand as an indicator of their reliability (Guba, 1981). Much like the process of triangulation seeks to corroborate interpretations through multiple sources of convergent data, intercoder agreement seeks a similar type of convergence vis-a-vis multiple researchers. Often, agreement is arrived at through converging interpretations of the same phenomena in data, i.e., consensus. Using a coding scheme, agreement would mean that researchers agree that X data is an example of X phenomena, and Y data is an example of Y phenomena in a representative sample of data from a dataset.

Researchers check intercoder agreement in several ways. Some researchers use a calculation of the percentage of alignment in categorizing a sample of data by multiple researchers. However, this approach has lost favor as it is statistically flawed with some observations of alignment being simply chance agreements. Thus, statistical approaches were developed and appropriated to account for chance agreements in the calculation of intercoder agreement, e.g., Cohen’s and Fleiss’s kappa (Banerjee et al., 1999). Qualitative researchers now commonly report these statistics as evidence of trustworthiness, often with very little additional information about the complex process of arriving at a coefficient that is accepted as ‘very good agreement’. In addition, this practice has become so normative that there is very little negotiation in research reports of the standards of what constitutes evidence of good agreement, despite these metrics being used with a wide variety of data, for a wide variety of research purposes and uses. It is worth considering how conventional thresholds for what is “very good agreement” are universally applied, and whether they ought to be since dependability is not only in the service of the scientific community, but also those who make use of evidence produced by the scientific community.¹

There are few reports of research that include details of the process of multiple analysts achieving alignment in interpretations of qualitative data. Of strategies that are reported (e.g., simplification, discussion), they introduce some limitations that can impact the credibility, dependability and transferability of interpretations and findings, i.e., have consequences for the robustness of claims of trustworthiness.

Simplification

One research strategy that is used to achieve high intercoder agreement is simplification. Developing and employing a low inference coding scheme is a strategy that requires researchers to make the lowest inferential step, thus, increasing the likelihood that researchers are in alignment with their inferences. For example, in the context of classroom discussion transcripts, distinguishing between a question or an answer would constitute a low inference coding decision for several reasons. First, a transcript might contain markers such as punctuation, which would help to identify questions by the use of a question mark. Second, an analyst’s knowledge of language and conversation may assist in the identification of answers by their collocation with questions. Third, the culture of classroom interaction between teacher and students is often in the form of questions/answers binaries (Mehan, 1979), thus it might be a low inference decision for an analyst to decide between two possibilities.

Simplification increases the likelihood that independent researchers are aligned in their interpretations because it reduces the level of inference needed to classify data. However, simplification can often be so coarse that it has the potential to be less meaningful descriptively. Take, for example, Mehan’s breakthrough ethnomethodological study of classroom language. In 1979 he identified a pattern of question asking and answering that was unique to classroom life. Four decades later, researchers have sought to describe classroom exchanges with more detail, capturing a wide variety of features such as the depth of knowledge that is required to answer the question (Webb, 2009), and socio-cognitive implicature of teachers’ questions (Clarke et al., 2015), and what instigates students' responses to teachers’ questions (Clarke et al., 2016). Thus, if one were to simply classify classroom data in terms of questions and answers, what would be considered coarse categories of discursive interaction in light of decades of research in this area, an advantage would be that it is easier to achieve high intercoder agreement, but the disadvantage would be that it runs the risk of being less meaningfully descriptively of phenomena. In terms of credibility, simplification raises questions about the extent to which the account is adequately descriptive of theory with data, context and meanings. In terms of transferability, simplification also raises questions of whether there is enough information to determine if the findings would fit (and therefore, describe) another setting (Guba, 1981; Miles & Huberman, 1994).

Discussion

Another research strategy to achieve intercoder agreement reported in qualitative research is discussion. Discussion is used in the coding process to either obtain consensus of interpretations of codes, or probe coding disagreements (e.g., Cascio et al., 2019; Zade et al., 2018). Consensus building is often a favorable end in the coding process as it would support indices of intercoder agreement. Some approaches use discussion to collaboratively code and develop codebooks inductively, then apply the codebook independently with novel data and shared understanding of the coding scheme (Cascio et al., 2019; Naganathan et al., 2022). However as a process, it is often a footnote in reports of coding with many reports simply acknowledging that agreement was sought, measured, and in incidents of disagreement, discussion helped achieve consensus. However, disagreements can be systematic and can indicate a need for the coding frame to be refined in order to resolve ambiguity, or that there are multiple reasonable interpretations of the same data (MacPhail et al., 2016; Zade et al., 2018).

While discussion can be productive to resolve disagreement, sometimes reports of the like can obscure dynamics that shape consensus building which ultimately threaten the legitimacy of evidence of trustworthiness. For example, qualitative researchers have long used member checks as a tool to verify coding with populations involved in research as participants. However, power differentials that can be shaped by gender, race, class, institutional affiliation, and position, can shape the extent to which individuals have the agency to disagree, thus raising questions about the authenticity of “consensus” that undergirds accounts of trustworthiness. Some new practices seek to use discussion to collaboratively code with members of the research team and participants as a means to actively disrupt the traditionally hierarchical research relationships (Naganathan et al., 2022).

The Perils and Possibilities for Achieving Intercoder Agreement in Team Based Qualitative Research

Prior work has highlighted the paucity of guidelines for how a team of researchers can work to align ways of seeing and interpreting data, which are the foundations of what gets calculated and reported as intercoder agreement. In this section, we introduce a method for developing shared ways of seeing phenomena in data as a means for achieving intercoder agreement. We explore the methodological challenges and opportunities that emerged through the process of carrying out coding on a longitudinal project, ClassInSight. We refer to the challenges as the perils of team-based coding, but the methodological innovations we developed in light of those challenges we consider as the possibilities for depth and reliability in team-based qualitative data analysis.

ClassInSight Project

In our work on educational improvement, we seek to produce insight into educational processes that can help teachers to reflect and scaffold how they can transform education practices to be more just and equitable. The ClassInSight project is a 5-year research-practice partnership with two public school districts in Southern California that serve a diverse (ethnic, racial and linguistic) population of students. The goal of the project is to support teachers’ reflection on equitable pedagogy in secondary science, and scaffold instructional decision making to enact equitable pedagogical practices that have the potential to promote robust student learning. We do this by audio-recording science lessons and transforming those lessons into transcripts, then coding the transcripts and transforming the coding into meaningful analytics (visualizations) for teacher reflection on their practice (e.g., Figure 1).

Figure 1.

ClassInSight web app dashboard.

There are a few constraints that are meaningful to the logistics of our coding process. First, we want to provide teachers with timely analytics about their classroom observations as close to the observation event as possible so that the analytics are meaningful - this is a pace constraint. Second, we want to provide teachers with meaningful analytics about their facilitation of science lessons, so that we can scaffold their reflection in ways that align with the current state of the knowledge on the science of learning, specifically, how classroom discussion supports student sense-making of science and learning of scientific concepts as a consequence (e.g., Chen et al., 2020; Clarke et al., 2015) - this is a depth of analysis and conceptual constraint. Third, our coding was simultaneously being used by computational linguists to train machine learning algorithms in the hopes that we could automate analytics over the life of the project – this is a use constraint. For us, a core challenge of our work has always been how to achieve the depth of analysis that is adequately descriptive of the interactions we observe, and meaningful enough for our participants to make sense of descriptions in light of their memory of teaching these lessons. In light of these goals, we have developed a fast paced, data analysis infrastructure, i.e., big Qual (Brower et al., 2019), to support data analysis of a large and steady stream of classroom observation data so that teachers can reflect on theory-driven categorizations of classroom teaching.

Perils: Roadblocks in Achieving Intercoder Agreement in Team-based Analysis

The specific analytic task for us (the research team) was to decompose the transcript into codes that captured the dialogic function of the utterances in the transcript (Sushil et al., 2022). For example, if an utterance was an invitation for a student to elaborate on an idea, then it would be coded as (I). The first iteration of the codebook started in Fall 2019. We initially used coarse categories in terms of the interaction between students and teachers around the subject matter, because our coding was being used for developing automated coding using machine learning and natural language processes. However, challenges with developing accurate automated models and a need to prioritize our commitments to the school districts and teachers over and above scientific advancements in machine learning, meant that we needed to proceed with human coders only in the interim. In addition, because we shifted to human coders, we could prioritize the depth of analysis that (human) qualitative analysts bring to rich and complex discourse data that would be far more difficult to accurately achieve with machine learning. Thus we iterated on our codebook with richer high inference categories of classroom dialogue two more times, drawing on and adapting existing coding schemes of teacher-led classroom dialogue (T-SEDA Collective, 2022; Vrikki et al., 2019; Wells & Arauz, 2006).

The coding team was co-led by two experts in classroom dialogue, discourse analysis, particularly Conversation Analysis (Hutchby & Wooffitt, 2008). The remainder of the coding team included two doctoral students in their second and third years respectively who were former teachers, and later, two additional postdoctoral fellows, one a former teacher with prior expertise in discourse analysis of classroom data and the other with expertise in science and qualitative data analysis more generally. The coding team was diverse in terms of experience with discourse analysis, as well as linguistic and ethnic diversity. Using the codebook, the team engaged in discussion of the codes, and some side by side coding of a set of transcripts as training (n = 3; each lesson transcript averages 500 codable turns).

When it was felt that team members were aligned in their understanding of the codebook, the code definitions, and how they present in the data we were gathering, we would calculate the intercoder agreement. To calculate intercoder agreement, we would have each coder code a set of classroom transcripts and calculate a Cohen’s Kappa against a primary coder, thus we were just seeking that each new coder agreed with the expert coder.² As the team grew, over the years of the project, we repeated this process. However, much to our dismay, we simply could not achieve good interrater agreement across all categories we were coding for, with codes that presented less frequently in our dataset adding additional challenges to agreement. We repeated the steps of collaborative coding in team meetings. We revisited the codebook to reevaluate the coding categories. We added more exemplars of codes to the codebook. We conducted analyses of disagreements to identify if disagreements were systematic. Once there was some confidence in coders’ alignment, we went ahead and re-coded a training set of transcripts. However, we still could not obtain reasonable intercoder agreement. Along the journey, we engaged in a range of troubleshooting efforts to push through and get good agreement. Eventually, coders finally achieved a Cohen’s Kappa of .70 or above between expert and novice coders, which we considered very good agreement.

In order to determine why it was so difficult to achieve good intercoder agreement with adaptations of coding schemes that were rigorously developed and widely used (T-SEDA Collective, 2022; Wells & Arauz, 2006), and understand how we were able to eventually obtain very good agreement, we conducted an autoethnography of our coding journey, aided by coding meeting notes that were in effect thick descriptions of our coding meeting discussions (Sushil et al., 2022). Focusing on 2 years of coding challenges, we examined detailed meeting notes recorded during our IRR-oriented meetings between Fall 2019 and Spring 2021. Two researchers conducted open coding of the meeting notes and identified preliminary codes related to excerpts that would provide insights into our disagreements, and what decisions were made to get through the impasse. These codes were discussed and validated by the entire team of researchers that worked on the analyses. A codebook was developed based on the agreements reached. “Decision Point” or “Corrective Measures/Strategies” are examples of codes that were included in the codebook which sought to characterize critical moments, roadblocks, and decisions in our process. The same two researchers then coded the remaining 106 pages of meeting notes using a qualitative data analysis software. A subsequent cross-code analysis of the excerpts resulted in the themes that helped to identify the relationship between coding impasses and then solution decisions and ultimately intercoder agreement. It is from this process that we developed a team coding strategy, which we outline in the next section.

We found that most points of impasse were solvable once we identified why there was disagreement. For example, through our discussions we realized that structure of the transcript and differences in experience coding discussions, as well as differences in experience with coding using categories in our codebook were contributing to disagreement. Some coders were interpreting a conversational turn with more context from the surrounding turns than others. We also learned that some coders were drawing on more context about the school, teacher and lesson because they may have been in the classroom collecting the data. It was from these discussions that we recognized a need for greater common ground through shared experiences that help to share the mental models we individually had of data and the codebook. Moreover, we recognized that what was surfaced in these discussions were a resource that, once shared, helped to broker shared ways of seeing the phenomena in data. Therefore we utilized these explications to refine our processes and codebook to better design strategies for attuning our mental models of data and phenomena. In this sense, mining disagreements can be leveraged as a resource for achieving better agreement, rather than a hindrance that needs overcoming expeditiously. In the next section, we introduce a process for achieving intercoder agreement through developing shared heuristics and reasoning during the coding process. Heuristics refers to the interpretive strategies that coders employ to interpret data and identify which code should be applied to a data segment.

Possibilities: Achieving Intercoder Agreement through the Development of Shared Heuristics

Preparation for Coding

The first stage in the process (see Figure 2) is to develop the purpose of coding, as it may impact the process of coding. For example, in the case of the ClassInSight project, our primary purpose was to provide teachers with a visual description of the classroom discussion in terms of how teachers support student learning of science through dialogue. Central to this purpose was to create a representation of what is (i.e., what happened in the classroom discussion), in order to scaffold what could be (i.e., supporting teachers to reflect on the nature of the discussion in terms of student learning and begin to envision how they might refine discussion facilitation moves to support robust student learning). A secondary purpose of our coding was driven by our research goals, which was to repurpose the coding to examine teacher’s facilitation of discussion and over time, how their facilitation may change as a product of their reflection and tinkering. With respect to our primary purpose, this meant that the coding schema had to be a prima facie representation of a lesson, as the coding was of lessons each teacher had taught in the recent past, and the categories would have to be meaningful to them in terms of their sense-making about teaching and learning science.

Figure 2.

Procedure for developing shared heuristics in team based coding.

Second, the coding scheme should be either developed or informed/adapted from published schemas, and it should clearly align research questions, i.e., with the purpose of the coding. In addition, it is at this stage that decisions such as how interpretive the codes (high or low inference) should be, and how the data will be segmented for coding. For example, coding may rely on high inferences about latent characteristics of the data that require theoretical expertise to identify (O’Connor & Joffe, 2020). One such example that comes from our own work on ClassInSight is that we recognized that expert vs. novice coders were chunking the text differently, which was a heuristic developed with expertise that enables a conversation analyst to identify a code with respect to its adjacency with another code. This led us to reformat the transcript to help scaffold novice coders in attending to these contingencies. A classroom transcript has multiple levels of meaningful “chunks” that serve as levels of analysis, e.g., lesson, episode, sequence, exchange, conversational turn (Wells & Arauz, 2006).

Third, logistical decisions should be made about the coding team and approaches to coding, such as whole team collaborative coding (Cascio et al., 2019), or using a lead coder(s) with which new coders have to align with. In all cases, these decisions are consequential to the expectations of voice (Naganathan et al., 2022), and the overall procedure for achieving agreement. In addition, it should be decided what proportion of the data will be used as the training dataset. In the ClassInSight project, we had an estimation of how many lessons we needed to observe per teacher, per year, thus that helped to decide on the size of the coding team that was necessary in order to meet our scale and speed requirements. In addition, because of the size of our data set, we determined that 5 lesson transcripts would constitute our training set of transcripts, and we sought diversity of teaching styles in the training set in order for coders to have experience with multiple ways in which the codes might present in the data. In addition, because the project was a 5-year longitudinal project, logistical decisions had to be made to account for changeover in staff throughout the project, as students may graduate and postdoctoral scholars may move on to other appointments. Thus it was determined that a primary coder would code the training set, and as new coders were onboarded, they would seek alignment with that standard.

Coding Training: Developing Shared Heuristics

The next phase of the process is coder training (Figure 2). The key discovery from our autoethnography on the coding process (Sushil et al., 2022), was that when coders shared the heuristics used to code, we were able to (a) identify the heuristics employed independently by a coder that contributed to disagreements, and (b) surface heuristics employed to code that were productive, in terms of helping coders develop a shared way of seeing the data (e.g., identify phenomena and apply the relevant code). Again, heuristics refers to the interpretive strategies that coders employ to interpret data and identify which code should be applied to a data segment.

We would like to make a key distinction between sharing heuristics and providing exemplars of coding, albeit they may serve the same goal. When exemplars are used in a codebook, they serve to provide a canonical example of phenomena in data, and their corresponding label. Exemplars can be used to help familiarize a coder with a code, helping to make a coding category less abstract. However, what exemplars do not do is explicate the kind of decision-making and reasoning that an expert coder is engaged in when trying to determine if data fits a category of phenomena in the coding scheme. Table 1 provides an example of a code, a data illustration and heuristics to identify examples of the code in data.

Table 1.

Exemplar of a Code and Heuristics Used to Identify Code in Data

Code name	Definition	Example	Heuristics
R Makes reasoning explicit	This category encompasses various forms of argumentation (argument or counterargument), as well as explanations of the process of arriving at a solution	T [Sheila, lesson on where mass comes from]: … who can tell me why we need to include the roots S: Oh, because... because the water... You need water for the trees, so it goes into the roots of the tree. [student explains their understanding how a tree absorbs water] T: So why would the stream become slower? S: Cause, dams and roots…. I mean cause willows in the stream. [student explicates thinking that the fallen tree creates a dam] [if speaker restates explanation from text that has the lexical features of reasoning but does not share their personal sensemaking then this is NOT R and is other student talk]	For student codes, only apply an R when there is a clear explication of thinking (i.e., why I think that, the mechanism behind an idea vs just stating an idea in response to a teacher’s “I” question). Student reasoning needs to include support (evidence). This decision was made with connections to NGSS - thinking about meaningful connections to teachers working to support student explanation + evidence

Note. Heuristics in italics.

It is at this stage that we recommend documenting these heuristics and the associated reasoning around the use of them in the codebook or in the training data itself. In our work on the ClassInSight project, we captured heuristics as annotations within a transcript next to the utterances of text and the code it was assigned. In this way these transcripts served as a “worked example” of a coded transcript, a contextualized example of codes that analysts could return to when coding independently. This is particularly important given the contextual nature of discourse analysis, whereby surrounding text is essential for determining what a code is (e.g., adjacency, exchanges and sequences are levels of text that coders can use to infer meaning of a conversational turn). We also captured if/then reasoning around the heuristics in the codebook, and continued to update the codebook as additional reasoning emerged from collaborative coding sessions. Recording our reasoning was an essential step in our process, as it was only through arbitrating perspectives and disagreements that we were able to surface heuristics and reasoning that is tacit for expert coders, and novel for novices. In this sense, the codebook became a living document of our collaborative sensemaking around our data and the codebook. It may be possible at the start of a project to enumerate heuristics and reasoning around coding, however we posit that it may not be possible to predict all the perspectives, experiences and history that your coding team brings to the coding process that shape how they code. Thus, side by side coding for a period of time, helps to surface these often implicit ways of engaging with data that analysts bring to the coding process which can contribute to divergent interpretations. In addition, side by side coding can be an epistemic resource whereby novice coders have space to share their fresh perspectives, and have those perspectives negotiated and taken up by expert coders, as novices make sense of data and phenomena unencumbered prior experience.

Over time, the process of sharing and documenting heuristics helps the coding team not only align in their interpretations of data, but the underlying reasoning that supports those interpretations and the heuristics (i.e., interpretative strategies) we use to identify coding categories in data. This phase of the process ends when the lead coder(s) feel confident that coders have a shared way of seeing the data. We recommend that this process is iterative, so that over time (especially in the case of longitudinal coding projects) potential coding drift is minimized, i.e., after long periods of independent coding, analysts can drift away from the norms established during training.

Intercoder Reliability Check

The next phase of the process (Figure 2) is to check intercoder agreement. At this stage, coders code a portion of the data to be used to measure intercoder agreement. This should be a different set of transcripts from the training set (as the training set serves as a worked example of the team’s coding frame). Prior to measuring intercoder agreement, a threshold should be determined for what will count for the coding project as an acceptable level of agreement. We argue that qualitative researchers should be expansive in their definitions of very good agreement, as the uses of data, and the consequences of the uses of data may mean that the thresholds that are generally accepted by the research community, may not be universally appropriate.

Measure intercoder agreement with a preferred measure (e.g., Cohen’s Kappa). If the coders meet the threshold level of agreement that was predetermined, then the data to code can be distributed to coders and independent coding can commence. If, however, values do not meet thresholds, then have a third (neutral) party conduct an analysis of disagreements. This individual helps to identify patterned disagreements. Next, the coding team can return to the training set of data, and have a targeted discussion of coding for which there was patterned disagreement. Again, surfaced heuristics and reasoning should be recorded. Then the coders recode the data that will be checked for intercoder agreement. Once the threshold is met, independent coding can commence.³ Again, we recommend periodic collaborative coding to minimize the potential for coding drift.

Reliable Onboarding of New Coders

The final phase concerns sustainability (Figure 2). This phase will be most relevant to longitudinal and team based projects that have a similar workflow as the ClassInSight project. When onboarding new coders, all of the prior steps are used to develop shared heuristics. When checking intercoder agreement, agreement should be checked between new coders and the lead coder.

Discussion and Considerations

Achieving intercoder agreement relies on multiple coders developing shared ways of seeing phenomena in data. Developments in philosophy of qualitative methods have lent themselves to widespread use of conventions for qualitative reporting such as using forms of evidence of trustworthiness of interpretations, e.g., measure of intercoder agreement (Guba, 1981; Mishler, 1990). However, in terms of research practice, how researchers develop shared ways of seeing phenomena in complex data is under-reported and often mysterious, albeit the foundation of claims of trustworthiness of interpretations and potential evidence for transferability, confirmability, and dependability. Thus, there is a need to reconceptualize precisely what is meant by trustworthiness in qualitative research and the extent to which our methods of achieving agreement serve as evidentiary arguments for trustworthiness claims.

Developing shared heuristics for coding is one approach for dyadic or team-based research that help analysts achieve shared ways of seeing phenomena in data, as well as serve to provide a more robust foundation for outcome measures of intercoder agreement. We offer a blueprint for how research teams can develop these practices for projects with multiple layers of complexity, from the nature of the data, the theoretical tools employed, diversity of the analysts, and the uses of analysis.

The advantages of our approach are multifold. First, shared heuristics for coding attend not just to the assignment of a category to data, but also to reasoning that underlies categorization. Thus, it is a way in which to attune multiple coders to phenomena and their presentation in data, and attune sense-making as a scaffold for dependable and consistent identification of phenomena in data. Second, the process we outline includes a rich audit trail as a resource for novice analysts learning, but also as a calibration tool for expert analysts, especially in the context of a longitudinal project. This level of detail can also be useful for open science, supporting other research teams to learn from and utilize the tools developed on a project. Finally, and most importantly, it will support the achievement of very good intercoder agreement, without the need to oversimplify the depth of inference.

The biggest limitation of developing shared heuristics is the time investment that may be necessary for the process. While it may not seem worthwhile to use this approach on a project that has a fairly small dataset, we would argue that discussion of heuristics is valuable as it allows one to be explicit about identifying evidence of phenomena in data. Moreover, we would argue that a dataset of any size would benefit from this preparatory analytical calibration, as it would help to equip a team to process a dataset faster and with greater dependability than otherwise. The initial time investment may also be a consideration when a coding team is composed of novice analysts. However, sharing of heuristics through this process can help to socialize ways of making sense of data, helping novices quickly grow in their knowledge, skill and expertise. In this sense, we would argue that the advantages of this approach far outweigh potential limitations.

Footnotes

Acknowledgments

The authors would like to thank undergraduate research assistants Vivian Leung, Barbara Lee and Yerin Go for their support with data processing for which this research project relied upon, and upon which made the developments reported in this article possible. In addition, we would like to thank our school district partners, as well as university partners at Carnegie Mellon University and Pennsylvania State University.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This was supported by the James S. McDonnell Foundation Teachers as Learners Program.

ORCID iD

Sherice N. Clarke

Notes

References

Banerjee

Capozzoli

McSweeney

Sinha

(1999). Beyond kappa: A review of interrater agreement measures. Canadian Journal of Statistics, 27(1), 3–23. https://doi.org/10.2307/3315487

Brower

R. L.

Jones

T. B.

Osborne-Lampkin

Park-Gaghan

T. J.

(2019). Big qual: Defining and debating qualitative inquiry for large data sets. International Journal of Qualitative Methods, 18(■■■), 160940691988069. https://doi.org/10.1177/1609406919880692

Cascio

M. A.

Lee

Vaudrin

Freedman

D. A.

(2019). A team-based approach to open coding: Considerations for creating intercoder consensus. Field Methods, 31(2), 116–130. https://doi.org/10.1177/1525822X19838237

Chen

Chan

C. K. K.

Chan

K. K. H.

Clarke

S. N.

Resnick

L. B.

(2020). Efficacy of video-based teacher professional development for increasing classroom discourse and student learning. Journal of the Learning Sciences, 29(4–5), 642–680. https://doi.org/10.1080/10508406.2020.1783269

Clarke

S. N.

Howley

Resnick

Penstein Rosé

(2016). Student agency to participate in dialogic science discussions. Learning. Culture and Social Interaction, 10, 27–39. https://doi.org/10.1016/j.lcsi.2016.01.002

Clarke

S. N.

Resnick

L. B.

Rosé

C. P.

(2015). Dialogic instruction: A New Frontier. In Corno

Andeman

(Eds.), Handbook of educational psychology (3rd ed., pp. 378–392). Routledge.

Guba

E. G.

(1981). Criteria for assessing the trustworthiness of naturalistic inquiries. ECTJ, 29(2), 75–91. https://doi.org/10.1007/BF02766777

Hutchby

Wooffitt

(2008). Conversation analysis. Polity.

LeCompte

M. D.

Goetz

J. P.

(1982). Problems of reliability and validity in ethnographic research. Review of Educational Research, 52(1), 31–60. https://doi.org/10.3102/00346543052001031

10.

MacPhail

Khoza

Abler

Ranganathan

(2016). Process guidelines for establishing intercoder reliability in qualitative studies. Qualitative Research, 16(2), 198–212. https://doi.org/10.1177/1468794115577012

11.

Mehan

(1979). Learning lessons: Social organization in the classroom. In Learning lessons. Harvard University Press. https://doi.org/10.4159/harvard.9780674420106

12.

Mishler

(1990). Validation in inquiry-guided research: The role of exemplars in narrative studies. Harvard Educational Review, 60(4), 415–443. https://doi.org/10.17763/haer.60.4.n4405243p6635752

13.

Miles

M. B.

Huberman

A. M.

(1994). Qualitative data analysis: An expanded sourcebook (2nd ed.). Sage.

14.

Naganathan

Srikanthan

Balachandran

Gladdy

Shanmuganathan

(2022). Collaborative zoom coding—a novel approach to qualitative analysis. International Journal of Qualitative Methods, 21, 160940692210758. https://doi.org/10.1177/16094069221075862

15.

O’Connor

Joffe

(2020). Intercoder reliability in qualitative research: Debates and practical guidelines. International Journal of Qualitative Methods, 19, 160940691989922. https://doi.org/10.1177/1609406919899220

16.

Sushil

Dennis

Clarke

S. N.

Gates

Tripathi

Gomoll

Leung

Lee

U.-S. A.

Lee

(2022). Decomposing practice: Developing reliable analyses of complex classroom discussion. In Proceedings of the 16th international conference of the learning sciences - ICLS 2022, 3 (pp. 1629–1632).

17.

T-SEDA Collective . (2022). Toolkit for systematic educational dialogue analysis (T-SEDA): A resource for inquiry into practice. University of Cambridge. http://bit.ly/T-SEDA

18.

Vrikki

Kershner

Calcagni

Hennessy

Lee

Hernández

Estrada

Ahmed

(2019). The teacher scheme for educational dialogue analysis (T-SEDA): Developing a research-based observation tool for supporting teacher inquiry into pupils’ participation in classroom dialogue. International Journal of Research and Method in Education, 42(2), 185–203. https://doi.org/10.1080/1743727X.2018.1467890

19.

Webb

N. M.

(2009). The teacher’s role in promoting collaborative dialogue in the classroom. British Journal of Educational Psychology, 79(Pt 1), 1–28. https://doi.org/10.1348/000709908X380772

20.

Wells

Arauz

R. M.

(2006). Dialogue in the classroom. Journal of the Learning Sciences, 15(3), 379–428. https://doi.org/10.1207/s15327809jls1503_3

21.

Zade

Drouhard

Chinh

Gan

Aragon

(2018). Conceptualizing disagreement in qualitative coding. In Proceedings of the 2018 CHI conference on human factors in computing systems (pp. 1–11). https://doi.org/10.1145/3173574.3173733

Developing Shared Ways of Seeing Data: The Perils and Possibilities of Achieving Intercoder Agreement

Abstract

Keywords

Introduction

Approaches to Achieving Intercoder Agreement

Simplification

Discussion

The Perils and Possibilities for Achieving Intercoder Agreement in Team Based Qualitative Research

ClassInSight Project

Perils: Roadblocks in Achieving Intercoder Agreement in Team-based Analysis

Possibilities: Achieving Intercoder Agreement through the Development of Shared Heuristics

Preparation for Coding

Coding Training: Developing Shared Heuristics

Intercoder Reliability Check

Reliable Onboarding of New Coders

Discussion and Considerations

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iD

Notes

References