Abstract
Purpose
This study examines AI–human alignment in thematic analysis from a multi-level perspective, asking whether agreement at the level of individual codes is necessary for convergence in higher-order themes. While prior research often evaluates overall agreement, this study distinguishes between code-level variability and theme-level stability to provide a more nuanced assessment of AI-assisted qualitative analysis.
Methods
Using qualitative interview data from a doctoral study on e-commerce, the original human thematic analysis (2012, NVivo) was compared with four independent AI-assisted analyses conducted in 2026 using Claude Code. Two prompting strategies were tested: general prompts and structured multi-phase prompts. Alignment was assessed at both code and theme levels using the F1 Score, while inter-session consistency was evaluated through bidirectional mapping across session pairs.
Findings
At the code level, AI outputs showed considerable variability, with F1 Scores ranging from 55.9% to 87.6% and clear differences between prompting approaches (structured: 83.1% average vs. general: 59.2%). In contrast, theme-level alignment remained consistently high across all sessions (90.9%–100%), with strong inter-session consistency (average 86.1%). These findings indicate that although AI-generated codes may differ across sessions and from human coding, the resulting thematic structures converge reliably.
Originality
This study introduces a multi-level alignment framework and provides one of the first empirical evaluations of Claude Code as an agentic AI tool for thematic analysis. The 14-year gap between the original human analysis and AI reanalysis offers a distinctive test of AI engagement with historical qualitative data. The study identifies a pattern of hierarchical convergence, where code-level divergence coexists with theme-level stability.
Implications
The findings suggest that strict code-level agreement may not be necessary for reliable thematic conclusions. AI-assisted analysis can support theme development when structured prompting and human oversight are maintained, offering methodological guidance for integrating AI into qualitative research while preserving analytical rigor.
Keywords
1. Introduction
Artificial intelligence (AI) is increasingly used to support qualitative data analysis, yet a key question remains unresolved: must AI reproduce human coding decisions to generate meaningful thematic insights? Many existing evaluations report AI-human agreement at a single level of analysis, without systematically distinguishing between code-level and theme-level alignment (Bennis & Mouwafaq, 2025; Hila & Hauser, 2025; Jain et al., 2025). This paper challenges that assumption. Drawing on a multi-level alignment framework, it examines whether AI-human agreement operates differently at the code level and the theme level; and what this distinction means for how AI-assisted qualitative analysis should be evaluated, interpreted, and integrated into research practice. The significance of this question depends on which tradition of thematic analysis is being considered.
Thematic analysis has been fundamental to social science inquiry for decades. Braun and Clarke (2006) provided an influential framework for identifying, analyzing, and reporting patterns within qualitative data. In subsequent work, they distinguished thematic analysis as a family of methods encompassing coding reliability approaches that prioritize structured coding and accuracy, codebook approaches such as template and framework analysis, and reflexive thematic analysis that centers researcher subjectivity and interpretive engagement (Braun & Clarke, 2021). This diversity within thematic analysis has implications for how AI-assisted approaches are evaluated, as the quality criteria appropriate for each tradition differ. Understanding this landscape is essential for situating the growing body of empirical evidence on AI-assisted analysis.
Large Language Models (LLMs) with advanced natural language understanding capabilities have opened new possibilities for qualitative research. These AI systems can process large amounts of text, identify patterns, and categorize information in ways that may complement human analytical capabilities. Recent empirical studies have demonstrated approximately 80% agreement between LLM and human thematic analysis (Castellanos et al., 2025). Evaluators preferred LLM-generated codes 61% of the time for their analytical utility (Schroeder et al., 2025). Despite these encouraging findings, fundamental questions remain about the reliability and practical applicability of AI-assisted analysis, and critically, whether existing evaluations capture the right dimensions of agreement. These questions become particularly salient when examining AI tools that move beyond conversational interfaces into agentic, file-system-integrated environments.
This paper explores an area that has received little attention in the emerging literature on AI-assisted qualitative research. Despite the growing landscape of AI-integrated qualitative tools, I could not identify any published academic study examining Claude Code, Anthropic’s command-line agentic implementation of the Claude LLM, for qualitative research purposes. Unlike AI features embedded within CAQDAS platforms or web-based LLM interfaces, Claude Code functions as a standalone agentic environment that enables the AI to independently navigate project directories, process entire transcript corpora, and produce formatted analytical artifacts. At the time this study was conducted (January 2026), no peer-reviewed research had examined this class of agentic AI tool for qualitative thematic analysis.
1.1. Research Problem and Significance
Despite the growing number of AI-assisted analytical tools, fundamental questions remain about their alignment and reliability for qualitative research. Can AI systems achieve meaningful alignment with human analytical judgments? Under what conditions is such alignment maximized? Are AI-generated findings consistent enough across independent runs to support research conclusions? These questions go beyond technical concerns. They touch on the foundations of qualitative inquiry: the nature of interpretation, the role of human judgment, and the criteria for evaluating analytical quality (Braun & Clarke, 2021; Chatzichristos, 2025).
Critics of AI-assisted qualitative analysis have raised legitimate concerns. Qualitative interpretation, they argue, is a fundamentally human activity that draws upon contextual knowledge, cultural understanding, and embodied experience that AI systems cannot possess (Braun & Clarke, 2021; Sakaguchi et al., 2025). The meaning-making process central to qualitative research requires subjective engagement with data that may not be replicable through algorithmic processing (Jowsey, Braun, et al., 2025). In addition, the black box nature of AI systems raises questions about the transparency and auditability of AI-generated analytical outputs (Resnik & Hosseini, 2025).
Proponents counter that AI systems need not replicate human cognition to produce analytical outputs that align with human analysis. If AI-generated themes show concordance with those produced through human analysis, and if this alignment is consistent across multiple runs, then AI-assisted approaches may complement traditional methods (Castellanos et al., 2025; Montes et al., 2025). The empirical question of alignment should guide assessments of AI utility for qualitative research, rather than philosophical debates about machine understanding.
The qualitative research community lacks systematic empirical evidence on the alignment between AI-assisted and human thematic analysis. A systematic mapping study of LLM applications in qualitative research found that 75% of relevant publications appeared in 2024 alone, indicating that rigorous validation studies comparing AI outputs to established human analyses remain scarce (Barros et al., 2025). Without such evidence, researchers cannot make informed decisions about when and how AI-assisted approaches should be integrated into qualitative research practice. Some studies have formally rejected GenAI use for reflexive qualitative approaches (Jowsey, Braun, et al., 2025), though this position has been challenged by researchers who advocate for critical, researcher-led engagement with AI tools (De Paoli, 2026; Friese et al., 2026). This ongoing debate highlights the need for rigorous empirical studies that examine both alignment (concordance with human analysis) and reliability (consistency across sessions).
The present study addresses this need through an experimental design that offers certain advantages. By applying AI analysis to interview data that was analyzed 14 years ago using established methods, we can assess AI performance against an established reference point. The original analysis was subsequently validated through quantitative survey research with 153 retailers, lending additional credibility to its use as a comparator, though it remains one analyst’s interpretation rather than an objective standard. In addition, by conducting four independent AI analysis sessions, we can evaluate the consistency and reliability of AI-assisted analysis, a dimension that has received limited attention in existing literature.
1.2. Research Context
This study addresses the alignment question through analysis of qualitative data from my doctoral research project on the diffusion of online retailing adoption in Saudi Arabia. The original study, conducted in 2012, used semi-structured interviews with retailers and industry experts to explore the factors influencing e-commerce adoption in the Saudi context. I conducted thematic analysis of the interview transcripts using NVivo software, which yielded 51 codes organized under three meta-categories: Consumer Related Factors, Environment Related Factors, and Organization Related Factors.
This dataset is suitable for AI-human alignment assessment for several reasons. The analysis was conducted over a decade ago, and the AI sessions were run without providing the original coding or thematic framework. However, since the original PhD thesis is publicly available, the possibility that the AI model encountered elements of this work during training cannot be fully ruled out. While this does not eliminate concerns about indirect exposure, the procedural separation between the AI sessions and the original analysis, combined with the 14-year temporal gap, provides a reasonable degree of analytical independence. The thematic framework is grounded in established theoretical models: Diffusion of Innovation (DOI) theory, the Technology-Organization-Environment (TOE) framework, and the Stages of Growth for E-commerce (SOG-e) model. These models provide clear conceptual anchors against which AI performance can be evaluated. The dataset also represents a complete analytical arc from raw interview data through codes to themes, enabling assessment at multiple levels of analytical detail.
The 14-year gap between the original human analysis (2012) and the AI reanalysis (2026) introduces both challenges and opportunities. This gap may introduce confounds related to evolving interpretive frameworks and AI training data. At the same time, it provides an opportunity to assess the temporal stability of qualitative findings and the capacity of AI systems to engage with historical qualitative data.
1.3. Research Objectives
The objectives of this methodological experiment are: 1. Conduct an AI-assisted thematic analysis of qualitative interview data using Claude Code without providing prior human analysis results during the analytical sessions 2. Conduct multiple independent AI analysis sessions using different prompting strategies (general vs. structured) to test replication and assess reliability 3. Compare the themes, codes, and factors identified by the AI sessions with those from the original human NVivo analysis 4. Assess inter-session consistency between the four AI analyses 5. Document the methodological process and Claude Code capabilities to enable replication
1.4. Contribution to Knowledge
This study contributes to the knowledge on AI-assisted qualitative research in three areas: methodological, practical, and conceptual. These categories are used here as an organizing device rather than strict boundaries; the contributions are interconnected by nature, and insights in one area often carry weight in the others.
Methodological and empirical contribution - The multi-level alignment framework developed in this study provides a systematic, replicable methodology for evaluating AI-human concordance in thematic analysis. The F1 Score serves as a balanced metric that accounts for both precision (relevance of AI codes) and recall (coverage of human concepts), offering a more nuanced assessment than simple agreement percentages. By applying this framework across four independent AI sessions and comparing results against established human analysis, the study provides quantitative benchmarks that can inform future research design and tool evaluation.
Practical contribution - The comparative evaluation of prompting approaches provides actionable guidance for researchers. Understanding which prompting strategies maximize alignment can help researchers implement AI-assisted analysis more effectively. The documentation of Claude Code’s agentic workflow also supports replication by other researchers.
Conceptual contribution - The study identifies an empirical pattern of hierarchical convergence, where theme-level alignment consistently exceeds code-level alignment across all sessions and comparison types. This observation has implications for how AI-assisted findings should be interpreted and reported, suggesting that thematic conclusions may be more reliable than code-level details. While this pattern requires further investigation across different datasets and domains, it offers an initial conceptual lens for understanding AI performance in qualitative analysis.
2. Literature Review
2.1. The Emergence of AI in Qualitative Research
The integration of AI into qualitative research is among the notable methodological developments of the past decade. Computer-Assisted Qualitative Data Analysis Software (CAQDAS) such as NVivo, ATLAS.ti, and MAXQDA have long supported researchers with data organization, coding management, and retrieval functions. However, these tools maintained the researcher’s exclusive interpretive authority (Mehta et al., 2025). Generative AI (GenAI) changes this by enabling active participation in the analytical process itself: suggesting codes, identifying themes, and generating interpretations that previously required human cognitive engagement.
This transformation has accelerated rapidly. A systematic mapping study by Barros et al. (2025) examining LLM applications in qualitative research found that 75% of relevant publications appeared in 2024 alone. This indicates the growing interest in this methodological development. The field spans diverse domains, including healthcare (Castellanos et al., 2025; Sakaguchi et al., 2025), education (Tai et al., 2024), policy research (Liu & Sun, 2025), and software engineering (Montes et al., 2025).
Several technological advances make this research timely. Modern LLMs demonstrate sophisticated language understanding, context retention across lengthy documents, and the ability to follow complex analytical instructions. Anthropic’s analysis of 308,210 real-world Claude conversations revealed that the model mirrors users’ values in 28.2% of interactions while pushing back on approximately 3% of requests deemed inappropriate (Huang et al., 2025). This suggests a capacity for nuanced engagement that may be relevant to qualitative research requiring interpretive sensitivity.
2.2. Empirical Evidence for LLM-Human Agreement
The central question for AI-assisted qualitative research is whether LLMs can produce analytical outputs that align with human expert judgment. The empirical evidence on this question has been encouraging, though not without important caveats. Castellanos et al. (2025) examined thematic summarization in healthcare qualitative data and found 80% agreement between LLM and human interpretation. LLMs agreed with humans on theme alignment and convergence for two-thirds of the analyzed topics. Wachinger et al. (2025) found that ChatGPT’s results were “particularly convincing for the identification of descriptive themes.”
Montes et al. (2025) found that evaluators preferred LLM-generated codes 61% of the time over human codes, finding them more analytically useful for answering research questions. This finding, that in many cases AI-generated codes may actually exceed human codes in utility, challenges assumptions that human analysis is superior in all dimensions. However, these advantages come with important caveats, particularly regarding the distinction between descriptive and interpretive coding.
The distinction between descriptive and interpretive coding is important. Multiple studies find that LLMs excel at identifying surface-level, explicit themes but struggle with latent, implied meanings that require cultural or contextual understanding. Castellanos et al. (2025) found that LLMs were less successful at identifying subtle, interpretive themes compared to concrete, descriptive ones, and may miss themes requiring deep contextual or domain knowledge. Wachinger et al. (2025) similarly reported that LLM-generated results were particularly convincing for descriptive themes. Montes et al. (2025) noted that LLMs missed latent interpretations and produced themes with unclear boundaries. Research in Japanese clinical contexts further highlighted cultural interpretation as a specific challenge (Sakaguchi et al., 2025). These findings suggest that LLMs may be most useful for preliminary or descriptive analysis, with human researchers retaining authority over interpretive depth.
2.3. The Reliability Challenge
While alignment with human analysis has received considerable attention, the reliability dimension, consistency across independent applications, remains understudied. This gap is problematic because reliability is foundational to any analytical method’s scientific credibility. The present study’s inter-session replication design addresses this issue, but the existing literature provides limited benchmarks.
Jain et al. (2025) developed a framework for assessing reliability through dual metrics combining Cohen’s Kappa and semantic similarity. They tested Claude 3.5 Sonnet, Gemini 2.5 Pro, and GPT-4o through six independent runs per model on interview transcripts. Gemini achieved the highest reliability (κ > 0.90), followed by GPT-4o (κ > 0.850) and Claude (κ > 0.84). These differences suggest that model selection affects consistency.
Borchers et al. (2025) examined whether multi-agent consensus approaches could improve coding accuracy. Testing six LLMs ranging from 3 to 32 billion parameters, they found that consensus-making only improves accuracy under specific conditions: low temperature settings, a single LLM, and a single code type. Multi-agent approaches showed minimal accuracy gains overall, suggesting that simpler approaches may be preferable for achieving reliable results.
The inter-rater reliability metrics from existing studies provide benchmarks for contextualizing our findings. Research indicates that LLMs achieve substantial agreement (κ = 0.76-0.78) for deductive coding with pre-established codebooks, but only moderate agreement (κ = 0.54-0.57) for inductive coding tasks (Hila & Hauser, 2025; Zhang et al., 2025). This difference in performance on deductive versus inductive tasks has implications for methodological design: providing clear frameworks and examples improves LLM reliability.
2.4. Model Comparison and Selection
The fast-changing field of LLMs raises practical questions about which model researchers should select for qualitative analysis. Comparative studies have begun evaluating different models across dimensions relevant to research applications. Bennis and Mouwafaq (2025) conducted a comprehensive multi-model comparison, testing nine GenAI models on thematic analysis of Cutaneous Leishmaniasis qualitative data from 448 participant responses. The models tested included Claude 3.5 Sonnet, Llama 3.1 405B, NotebookLM, Gemini 1.5 Advanced Ultra, ChatGPT o1-Pro, and DeepSeekV3. Advanced models achieved high congruence with reference standards, with some achieving perfect concordance (Jaccard index = 1.00).
Mavrych et al. (2025) compared Claude, ChatGPT, Copilot, and Gemini against medical students on neuroscience questions. Claude achieved 83% accuracy (highest), followed by GPT-4 at 81.7%, Copilot at 59.5%, GPT-3.5 at 58.3%, and Gemini at 53.6%. Wójcik et al. (2025) compared these models on medical examinations in English and Polish, finding that Claude achieved the highest accuracy for most question groups in both languages.
However, accuracy alone does not determine suitability for qualitative research. On the Vectara Hallucination Leaderboard (last updated February 5, 2026), the lowest hallucination rate is antgroup/finix_s1_32b (1.8%). Among the models listed, google/gemini-2.5-flash-lite reports 3.3%, while openai/gpt-4.1-2025-04-14 reports 5.6%. Several Anthropic models are higher, including anthropic/claude-sonnet-4-20250514 (10.3%) and anthropic/claude-opus-4-5-20251101 (10.9%) (Vectara, 2026). For qualitative coding that requires factual accuracy, researchers may need to balance capability against the risk of hallucinations.
The present study selected Claude Code, powered by the Opus 4.5 model, for several reasons informed by the considerations above. Claude models have demonstrated strong performance in accuracy benchmarks across multiple domains (Mavrych et al., 2025; Wójcik et al., 2025). For analyzing lengthy interview transcripts, Claude Code’s context window handling offers a practical advantage: unlike models that require document chunking, Claude Code can process complete interview sets in a single session, maintaining a coherent understanding across the full corpus. More importantly, Claude Code’s agentic command-line architecture offers capabilities that differ from both CAQDAS-integrated AI features and web-based LLM interfaces, as discussed in detail in Section 2.8. This combination of model performance, context handling, and tool architecture motivated the selection for this study.
2.5. Methodological Frameworks for AI-Assisted Analysis
Researchers have begun developing frameworks for integrating LLMs into qualitative workflows while preserving methodological integrity. These frameworks address three interconnected challenges: how to structure the AI-assisted analytical process, how to design effective prompts, and how to report AI-integrated research transparently.
On the analytical process, Nguyen-Trung (2025) introduced GAITA (Guided AI Thematic Analysis), adapting Template Analysis to position researchers as reflexive instruments while guiding GPT-4 through four stages: data familiarization, preliminary coding, template formation, and theme development. Similarly, Naeem et al. (2025) provide practical guidance for integrating generative AI across Braun and Clarke’s six phases of thematic analysis, from data familiarization and initial code generation to theme development, review, definition, and reporting. Both frameworks share a common principle: human researchers retain interpretive authority while AI assists with structured analytical tasks. Their guidance includes prompt strategies, examples of AI-assisted coding workflows, and recommendations for maintaining researcher reflexivity and methodological transparency.
On prompt design, the ACTOR framework (Nguyen-Trung, 2025) provides a structured approach: Assign role and context, Clarify task and format, Tailor with examples, Outline constraints, and Refine iteratively. This framework addresses prompt sensitivity, one of the major sources of variability in AI-assisted analysis identified in the literature.
On reporting standards, COREQ+LLM is being developed as an extension of the Consolidated Criteria for Reporting Qualitative Research for LLM-integrated studies, aiming to ensure methodological rigor, transparency, and interpretability when AI tools are used (Fehring et al., 2025).
The present study draws on insights from these emerging frameworks. The structured multi-phase prompting approach used in Sessions 3 and 4 reflects the phased analytical design advocated by GAITA and Naeem et al., while the prompt structure incorporates elements consistent with the ACTOR framework, assigning the AI a qualitative analyst role, clarifying the coding task, and outlining methodological constraints. The documentation and transparency practices adopted in this study also align with the reporting principles underlying COREQ+LLM. By comparing general and structured prompting approaches, this study provides empirical evidence on the effectiveness of methodologically grounded prompt design, complementing the largely prescriptive guidance offered by existing frameworks.
2.6. The Methodological Controversy
The integration of AI into qualitative research has generated significant scholarly debate. In an open letter published in Qualitative Inquiry, Jowsey, Braun, et al. (2025) gathered 419 experienced qualitative researchers from 32 countries, including Virginia Braun and Victoria Clarke (the originators of reflexive thematic analysis), to formally reject GenAI use for reflexive qualitative approaches. Their argument centers on methodological incompatibility: reflexive thematic analysis requires a subjective, positioned, and reflexive researcher, which AI cannot provide. They also reject AI use on grounds of social and environmental justice. When AI intervenes in the analytical process, reflexivity risks being displaced onto model verification rather than self-interrogation of the researcher’s own analytic lens (Montes et al., 2025).
This position has prompted substantive counter-arguments, also published in Qualitative Inquiry. Friese et al. (2026) contended that rejecting GenAI in its entirety risks closing off methodological evolution and isolating qualitative research from broader epistemic developments. They argued that GenAI, when used under close researcher leadership and control, can serve as a legitimate analytical support within reflexive qualitative inquiry, the critical factor being whether interpretive authority remains with the human researcher, not whether AI is involved in the process. De Paoli (2026) argued that the categorical rejection rests on philosophical assumptions that risk becoming dogma, and that prohibiting GenAI on metaphysical grounds negatively impacts debate and innovation in qualitative analysis. He noted that GenAI does not replace interpretation but serves as a thinking companion that helps researchers ask better questions of their data. Greenhalgh (2026) similarly called for moving beyond binary framings of adoption versus refusal, arguing that such polarization leaves little space for principled disagreement or methodological experimentation. She proposed refocusing the debate on epistemic authority, distinguishing AI-led analysis from human-led practices that incorporate AI as one of several analytical resources.
A common thread across these responses is the distinction between AI as replacement and AI as complement. The critique from Jowsey et al. applies most directly to approaches where AI displaces rather than supplements human analysis. It should also be noted that their critique is directed specifically at reflexive thematic analysis, one approach within the broader family of thematic analysis methods (Braun & Clarke, 2021). Studies operating within coding reliability or codebook traditions, where consistency and replicability are valued quality indicators, engage with AI tools under different methodological considerations. When AI is positioned within a researcher-led workflow, where the human retains authority over interpretation and meaning-making, the methodological concerns, while still relevant, operate differently. This distinction aligns with the hybrid model emerging from the broader literature, where human insight and reflexivity guide and critically evaluate computational analysis, rather than delegating interpretive authority entirely to machines (Chatzichristos, 2025).
The present study is designed with this distinction in mind. The research design (AlGhamdi, 2014, AI Sessions 1–4) treats AI as an independent analytical perspective to be compared against human analysis, not a substitute for it. The original 2012 analysis was conducted by a human researcher using structured, inductive content analysis, an approach consistent with the coding reliability tradition rather than reflexive thematic analysis. The AI sessions serve as points of methodological triangulation, evaluated using metrics of alignment and consistency that are appropriate to this tradition. This positioning is consistent with the call from Friese et al. (2026) for critical engagement with AI tools under researcher oversight, while acknowledging the legitimate concerns raised by Jowsey, Braun, et al. (2025) about the boundaries of AI involvement in interpretive work.
2.7. Ethical Considerations
Beyond methodological debates, AI-assisted qualitative research raises ethical concerns that warrant careful attention. Bias and discrimination are key issues, as AI systems can reproduce and amplify biases inherent in training data, potentially supporting analyses that are discriminatory or harmful (Resnik & Hosseini, 2025). Biases related to race, ethnicity, gender, sexuality, age, nationality, and socioeconomic status embedded in AI systems could perpetuate existing disparities if uncritically propagated into research findings. In qualitative analysis specifically, such biases may shape which themes are foregrounded and which are overlooked, with implications for marginalized voices and underrepresented perspectives (Mehta et al., 2025). Researchers must therefore critically evaluate AI-generated outputs for systematic patterns of omission or emphasis rather than accepting them at face value.
Transparency and reproducibility present additional challenges. The proprietary nature of commercial AI systems makes interpretation of AI reasoning difficult, and version changes in AI models may produce different results over time (Resnik & Hosseini, 2025). These concerns have prompted the development of reporting frameworks such as COREQ+LLM, which aims to ensure methodological rigor, transparency, and interpretability when AI tools are used in qualitative research (Fehring et al., 2025). The need for transparency is underscored by empirical evidence of significant quality differences across AI tools. Jowsey, Stapleton, et al. (2025) examined Microsoft Copilot for thematic analysis and found concerning results: Copilot outputs included 58% fabricated quotes compared to 79% accuracy for human researchers, and none of the AI outputs provided participant spread by theme. Based on these findings, the researchers could not recommend Copilot for thematic analysis. This highlights that not all AI tools are equally suitable, and that researchers must evaluate specific tools rather than assuming uniform AI capability.
Data privacy concerns are important when sensitive qualitative data is sent to cloud-based AI services. Participant confidentiality may be compromised, data may be used for model training without consent, and sensitive information may be retained on external servers (Resnik & Hosseini, 2025). Samuel and Wassenaar (2025) highlighted that uploading qualitative data to cloud-based AI services raises specific informed consent challenges, as participants may not fully understand or be able to assess the risks involved. These risks are particularly acute in qualitative research, where interview data often contain rich personal narratives that are difficult to fully de-identify (Montes et al., 2025). While Claude Code operates as a local command-line tool with direct file system access, the data is transmitted to Anthropic’s servers for processing via the API. These considerations informed my decision to work with previously published, de-identified data from a completed PhD thesis (AlGhamdi, 2014), where participant anonymization had already been carried out. Researchers working with sensitive or unpublished qualitative data should carefully review the data handling and retention policies of any AI service before use.
2.8. Claude Code in Academic Research
The landscape of AI-assisted qualitative analysis tools has expanded considerably. Established CAQDAS platforms have introduced AI-powered features: MAXQDA AI Assist offers AI-supported coding suggestions and summaries within the MAXQDA environment, ATLAS.ti AI integrates LLM capabilities for code generation and thematic grouping, and newer platforms such as QInsights provide purpose-built AI-driven qualitative analysis workflows. General-purpose LLMs accessed through web interfaces (e.g., Claude web-based, ChatGPT, Gemini) have also been employed for thematic analysis tasks, as documented in the studies reviewed in Sections 2.2–2.5.
Claude Code occupies a distinct position within this landscape. Rather than embedding AI assistance within an existing CAQDAS tool or relying on a browser-based conversational interface, Claude Code is an agentic command-line tool that brings a frontier LLM directly into the researcher’s local computing environment (Anthropic, 2024). This architectural distinction carries methodological implications. CAQDAS AI features operate within the constraints of their host platforms, typically assisting with discrete tasks such as suggesting codes for individual segments. Web-based LLM interfaces require manual data upload, are bounded by session and context window limitations, and depend on conversational prompting for each analytical step. Claude Code, by contrast, operates autonomously across the full project directory structure, processes multiple files in sequence without manual intervention, maintains session context throughout extended analytical workflows, and generates structured output artifacts directly to the local file system. In practice, this means that a researcher can direct Claude Code to analyze an entire corpus of interview transcripts through a single methodological instruction, with the tool independently reading, coding, and synthesizing across all documents.
Figure 1 summarizes these technical characteristics: • Direct file system access - The system can process entire directories of transcripts or documents without manual upload procedures. • Multi-file synthesis - Claude Code can analyze and cross-reference multiple documents within a single analytical workflow. • Session persistence - Context is maintained throughout an active session, supporting iterative refinement of coding schemes. • Structured output generation - The tool can generate formatted artifacts (e.g., Markdown files, HTML reports, tables, figures), facilitating documentation and audit trails. • Integrated workflow automation - Researchers can execute multi-step analytical procedures within a single command-line environment. Technical capabilities of Claude code for qualitative thematic analysis

Despite these distinct capabilities, I found no published peer-reviewed studies that investigate Claude Code as a qualitative thematic analysis tool. This study was conducted in January 2026, a period during which the academic literature on Claude focused primarily on model performance through web-based or API-mediated interactions rather than on the methodological implications of agentic command-line implementations. The present study addresses this gap by providing an empirical evaluation of Claude Code’s analytical capabilities, inter-session consistency, and alignment with established human analysis.
Current documentation of Claude Code remains largely technical and product-oriented, consisting primarily of official documentation and practitioner accounts rather than peer-reviewed methodological evaluations. By treating the tool’s agentic architecture as a methodological variable rather than a neutral interface, this study contributes empirical evidence on how tool design may shape AI-assisted qualitative analysis.
3. Methodology
3.1. Experimental Design
This study used a comparative design examining five independent analyses of the same qualitative dataset: (1) the original human analysis conducted in 2012 using NVivo 8, (2) AI-assisted analysis Session 1 using Claude Code with general prompts, (3) AI-assisted analysis Session 2 using Claude Code with general prompts, (4) AI-assisted analysis Session 3 using Claude Code with structured multi-phase prompts, and (5) AI-assisted analysis Session 4 using Claude Code with structured multi-phase prompts. This five-way comparison enables assessment of alignment (concordance with human analysis) and reliability (consistency across AI sessions), and the impact of prompting strategy on analytical outcomes (see Figure 2). Research experimental design -five-way comparison framework
The analytical approach adopted in this study aligns with the coding reliability tradition within thematic analysis. Braun and Clarke (2021) distinguished thematic analysis as a family of methods encompassing coding reliability approaches that prioritize structured coding and accuracy, codebook approaches such as template and framework analysis, and reflexive thematic analysis that centers researcher subjectivity and interpretive engagement. The original 2012 human analysis followed an inductive content analysis methodology using NVivo, employing structured coding procedures and frequency-based categorization across multiple coding phases. The present study evaluates AI-assisted analysis using metrics of accuracy, alignment, and consistency, quality criteria appropriate to coding reliability approaches, where replicability and agreement are valued indicators of analytical quality. This positioning is distinct from reflexive thematic analysis, where such reliability measures would be considered inappropriate (Braun & Clarke, 2021). The present study does not claim that AI can perform reflexive thematic analysis; rather, it examines whether AI can produce thematic outputs that align with human analysis within a framework where structured coding, consistency, and replicability are the relevant evaluative criteria.
Epistemologically, this study adopts a post-positivist orientation, treating the qualitative data as containing identifiable patterns that can be systematically coded and compared across analysts, whether human or AI. This orientation is consistent with the coding reliability tradition, where analytical quality is assessed through metrics of agreement and consistency rather than through the reflexive engagement of the researcher with the data.
3.2. The Original Dataset
The data comprises 16 semi-structured interview transcripts collected as part of my PhD study examining e-commerce adoption by retail businesses in Saudi Arabia (AlGhamdi, 2014). I conducted interviews with retail managers representing diverse sectors, including electronics, fashion, furniture, and general retail. The interviews explored barriers to e-commerce adoption, current business strategies, perceptions of consumer readiness, and views on the Saudi e-commerce ecosystem.
The original analysis was conducted using NVivo 8, following inductive content analysis methodology. The researcher progressed through multiple coding phases: initial open coding generated 82 codes, which were refined through focused coding to 55 codes, and ultimately consolidated into 22 factors organized across three categories (Consumer-related, Environment-related, and Organization-related). This analysis was later validated through a quantitative survey of 153 retailers, which confirmed the relative importance of the identified factors.
3.3. Analytical Protocol and Process
3.3.1. Prompting Strategies
Two prompting approaches were employed to examine the impact of prompt structure on analytical outcomes: general prompting (2 sessions) and structured multi-phase prompting (2 sessions). Each of the four sessions was conducted in a completely separate Claude Code instance with no memory or context from previous sessions.
3.3.1.1. General Prompting Approach (Sessions 1-2)
Sessions 1 and 2 used a general prompting approach. I directed Claude Code to a folder containing research materials organized under three sub-folders: Objectives & Methodology, Transcripts, and Human Analysis (initially hidden). In a single prompt, I instructed Claude Code to: • Read the methodology documentation to understand the research context, objectives, and interview protocol. • Adopt the same methodological process used in the original study. • Perform thematic analysis on all 16 interview transcripts in the transcripts folder. • Output results to a designated AI analysis folder.
This approach provided minimal prescriptive guidance, allowing Claude Code to autonomously determine its analytical workflow based on the available content (see Figure 3). Approach of the general prompt for Claude code sessions 1 and 2
3.3.1.2. Structured Multi-phase Prompting Approach (Sessions 3-4)
Sessions 3 and 4 employed a structured prompting approach consisting of three sequential phases, each with detailed instructions. 1. This phase starts with an initial data familiarization prompt. Claude Code was instructed to read the methodology documentation to understand the research context, objectives, and interview protocol in the original study. 2. The next phase is the open coding prompt. Claude Code was instructed to read all 16 interview transcripts located in the “Transcripts” folder. It was directed to do line-by-line coding with specific instructions. The instructions are structured to identify meaningful units, assign descriptive labels, track frequency, and generate an initial codebook with definitions and exemplar quotes. 3. The third phase is the theme development prompt. Claude Code was asked to group related codes into potential themes. It was asked to review themes against the coded extracts, define and name each theme, and develop a thematic map showing relationships with definitions and exemplar quotes.
Then, I directed Claude Code to output structured HTML reports for the codes and themes. These reports can be accessed through the data repository link provided in the data availability statement. Appendix A presents the structured prompts used, and Figure 4 illustrates the steps followed. Approach of the structured multi-phase prompt for Claude code sessions 3 and 4
3.3.2. Quote Verification Protocol
Given concerns in the literature about AI-generated fabricated quotes (Jowsey, Stapleton, et al., 2025), I implemented a systematic verification protocol. From each AI session’s output, I randomly selected 20% of cited quotes (exemplar quotes provided as evidence for codes) for verification. I traced each selected quote to the original interview transcript to confirm: (1) the quote exists verbatim or with minor transcription variations, (2) the quote is attributed to the correct participant, and (3) the contextual meaning aligns with how the quote was used.
Verification Results Across the Four Sessions Revealed High Quote Accuracy
Figure 5 illustrates the quote verification process used in this study. The figure illustrates a sample of how the generated quotes from the Claude Code codebook (upper panel) were traced to their source locations in the original interview transcripts (lower panel). In the example shown, the code “ENV-IGNORANCE-FEAR”—defined as “lack of knowledge breeding fear and reluctance; ignorance about e-commerce leading to avoidance”—includes three example quotes attributed to specific participants. Each quote was verified by locating the corresponding passage in the original transcript, as indicated by the connecting arrows and highlighted text. This verification process confirmed that Claude Code accurately extracted and attributed participant statements. Quote verification process - tracing AI-generated citations to original interview transcripts
3.4. Alignment Measurement Methods
To evaluate the alignment between Claude Code’s analysis and human analysis, I developed a multi-level assessment framework. This framework examines alignment at two levels: code-level and theme-level. Code-level alignment compares the specific codes identified by each approach. Theme-level alignment compares the higher-order thematic structures that emerged from each analysis. I also assessed inter-session consistency at both levels to evaluate the reproducibility of Claude Code analysis across multiple independent runs. I adopted the F1 Score as the primary alignment metric throughout the analysis. F1 balances precision and recall, providing a single interpretable measure that accounts for both the relevance and completeness of the generated codes and themes.
It is important to note that the human analysis serves as a reference point for comparison rather than an objective gold standard. As a single analyst’s interpretation, it represents one valid reading of the data. However, the original analysis was subsequently validated through quantitative survey research with 153 retailers (AlGhamdi, 2014), which confirmed the relative importance of the identified factors, lending additional credibility to its use as a comparator.
Throughout this study, the term alignment refers to the degree of agreement between AI-generated and human-generated analytical outputs. Alignment with a single human analysis demonstrates concordance but does not by itself establish interpretive validity, which would require evidence of analytical quality beyond agreement with one reference point. The term reliability refers to consistency across independent AI sessions. These operational definitions should not be conflated with broader notions of interpretive validity in qualitative research.
3.4.1. Code-Level Alignment
Code-level alignment was assessed by mapping each Claude Code-identified code to the corresponding human-identified codes based on semantic equivalence. The human analysis produced 51 codes organized under three meta-categories: Consumer Related Factors, Environment Related Factors, and Organization Related Factors. Each of the four AI sessions was independently compared against this reference set of human codes.
To ensure systematic and replicable classification, the following explicit decision rules were applied: (A) Full Match (Score: 1) applied when: • The AI code and human code describe the same underlying concept • The conceptual scope is equivalent (neither broader nor narrower) • A domain expert would consider them interchangeable labels for the same phenomenon
Example: AI code “CREDIT_CARD_RELUCTANCE” ↔ Human code “Consumers’ reluctance to use credit cards” — Both capture identical consumer payment behavior concerns. (B) Partial Match (Score: 0.5) applied when any of the following conditions exist: • The AI code combines two or more human codes into a single construct • The AI code represents a subset of a broader human code • There is substantial but not complete conceptual overlap • The AI code uses a different framing that shifts emphasis while retaining core meaning
Example 1: AI code “DIGITAL_INFRASTRUCTURE_GAPS” ↔ Human codes “Lack of electronic payment systems” + “Logistics and delivery challenges” — The AI code combines two distinct human codes.
Example 2: AI code “TRUST_DEFICIT” ↔ Human code “Lack of consumer trust” — Substantial overlap, but the AI code is slightly broader (includes institutional trust). (C) Extended/AI-Unique (Score: 0) applied when: • The AI code represents a concept not present in the human codebook • No human code captures >25% of the AI code’s conceptual content • The concept may be valid, but was not identified in the original analysis
Example: AI code “FUTURE_MARKET_OPTIMISM” — No corresponding human code addresses forward-looking market projections.
It should be noted that the classification of matches as full, partial, or extended relied on the researcher’s judgment. While explicit decision rules were applied to ensure systematic and replicable classification, and the complete mapping tables with justifications are provided in Appendix B, the process inherently involves interpretive decisions, particularly in distinguishing between full and partial matches. A different analyst might draw these boundaries differently, which could affect the resulting F1 Scores. To mitigate this, borderline cases were resolved conservatively: where the match between an AI code and a human code was ambiguous, a partial match (0.5) was assigned rather than a full match (1.0). This conservative approach may underestimate alignment in some cases, but it reduces the risk of inflating scores through generous classification.
The F1 Score was calculated using the following formulas:
This approach ensures that the alignment metric penalizes both over-generation of codes (low precision) and under-coverage of human concepts (low recall). This provides a balanced assessment of AI performance. Precision measures how relevant or accurate the AI-generated codes are, whereas recall measures how completely the AI captured the human-identified concepts.
It should be acknowledged that the F1 Score, as applied here, captures structural similarity between AI and human analytical outputs, the extent to which the same concepts were identified and categorized. It does not assess interpretive depth, analytical nuance, or the quality of meaning-making that underlies the coding process. Two analysts may assign the same code label while differing in the richness of their interpretive engagement with the data. This metric is appropriate within the coding reliability framework adopted by this study, but it should not be taken as a comprehensive measure of analytical quality in the broader qualitative sense.
3.4.2. Theme-Level Alignment
Theme-level alignment was assessed by mapping each Claude Code theme to the three human meta-categories: consumer, environment, and organization related factors. A theme was classified as a “Full Match” (scored as 1) when it clearly aligned with a single human meta-category. For example, an AI theme labeled “Trust Deficit” was considered fully aligned with “Consumer Related Factors” as trust is fundamentally a consumer-oriented construct in the context of e-commerce adoption. A “Partial Match” (scored as 0.5) was assigned when a theme spanned multiple human categories or represented a cross-cutting concept. Themes such as “Education as the Catalyst” that addressed both consumer knowledge and organizational training needs received partial match scores. Themes representing entirely new conceptual dimensions not present in the human analysis were classified as “Extended” themes, scoring 0 (see Appendix C).
The same F1 Score formula was applied at the theme level, with precision calculated as the proportion of AI themes that mapped to human categories, and recall calculated as the proportion of human categories covered by Claude Code themes.
3.4.3. Inter-Session Consistency
Inter-session consistency was assessed to evaluate the reproducibility and reliability of Claude Code analysis across independent runs. For each pair of Claude Code sessions (six pairs total: S1-S2, S1-S3, S1-S4, S2-S3, S2-S4, S3-S4), bidirectional mapping was performed at both code and theme levels.
For code-level inter-session analysis, codes from Session A were mapped to semantically equivalent codes in Session B (Forward alignment). Codes from Session B were mapped to Session A (Backward alignment). The inter-session F1 Score was then calculated as the harmonic mean of these bidirectional alignments:
This approach produces a symmetric measure where F1(A,B) = F1(B,A). This ensures that the consistency measure is not biased by which session is used as the reference. The same methodology was applied at the theme level. Themes are mapped between session pairs to assess structural consistency in the higher-order thematic frameworks.
Sessions were also grouped by prompting approach, general vs structured prompts. It helps to examine whether the prompting methodology influenced both alignment with human analysis and inter-session consistency.
Figure 6 summarizes the alignment measurement framework. Matching criteria (a) assign scores of 1.0, 0.5, or 0.0 based on conceptual overlap between Claude Code and human codes. The F1 Score (b) balances precision and recall to assess alignment. Assessment occurs at both code and theme levels (c), with inter-session consistency (d) evaluated through bidirectional mapping across six session pairs. The complete flow (e) distinguishes validity (human-Claude Code alignment) from reliability (inter-session consistency). Alignment measurement framework. (A) Matching criteria for code-level comparison; (B) F1 score calculation; (C) Multi-level assessment at code and theme levels; (D) Bidirectional inter-session consistency; (E) Complete assessment flow
3.5. Technical Specifications
3.5.1. Claude Code Technical Specifications
Claude Code vs. Web Interface Comparison
3.5.2. Session Technical Specifications
Technical Specifications of all Five Analyses
4. Results
4.1. Code-level Alignment Results
4.1.1. Human-AI Code Alignment
Code-Level Human-AI Alignment Metrics
Sessions using structured multi-phase prompts achieved significantly higher alignment with human analysis compared to sessions using general prompts. Session 4 achieved the highest F1 Score of 87.6%, with a precision of 79.8% and a recall of 97.1%. This suggests that it captured nearly all human-identified concepts while maintaining acceptable precision. Session 3 achieved an F1 Score of 78.5% with the most balanced performance between precision (81.9%) and recall (75.5%). In contrast, Sessions 1 and 2 achieved F1 Scores of 62.4% and 55.9%, respectively. The average F1 Score for general prompts is 59.2%, being 23.9 points lower than structured prompts at 83.1%.
Analysis of the precision-recall trade-off revealed distinct patterns between the prompting approaches. General prompt sessions showed high precision (averaging 87.5%) but low recall (averaging 44.7%). This suggests that while the codes they generated were highly relevant to the human analysis, they missed more than half of the human-identified concepts. These sessions produced fewer codes (26 codes each) but with higher accuracy. Conversely, structured prompt sessions achieved more comprehensive coverage, with Session 3 producing 47 codes and Session 4 producing 62 unique codes (after removing duplicates). Session 4’s exceptional recall of 97.1%. This shows that the structured multi-phase prompting approach enabled the AI to capture nearly the complete conceptual landscape identified through traditional human analysis.
The number of full matches versus partial matches also differed by prompting approach. Session 4 achieved 37 full matches and 25 partial matches, while Session 1 achieved 22 full matches and only 4 partial matches. This indicates that structured prompts do more than just find more concepts. They also capture subtle relationships between these concepts.
Because of this, they partially overlap with multiple human codes.
4.1.2. Code-Level Inter-Session Consistency
Code-Level Inter-Session Consistency (F1 Scores)
Sessions using the same prompting approach exhibited higher internal consistency. The two general prompt sessions (S1 & S2) achieved an F1 Score of 79.8%, while the two structured prompt sessions (S3 & S4) achieved 72.5%. The slightly lower consistency between structured sessions can be attributed to the larger number of codes generated (47 and 62 codes, respectively), which creates more opportunity for variation while still capturing the same core concepts.
Cross-approach comparisons yielded lower F1 Scores, averaging 58.3% across the four general-to-structured session pairs (ranging from 52.2% for S2-S4 to 64.2% for S1-S3). This pattern reflects the fundamental difference in code granularity between the approaches. Structured prompts generate approximately twice as many codes as general prompts.
The directional analysis revealed asymmetric patterns. When mapping general prompt codes to structured prompt sessions, alignment was high at 87.3% average. This suggests that structured sessions captured nearly all concepts identified by general sessions. However, the reverse mapping showed lower alignment at 44.5% average. It proposes that structured prompts identified many additional codes do not present in general prompt sessions. This asymmetry supports the finding that structured prompts produce more comprehensive analyses while maintaining coverage of the core concepts identified through simpler prompting approaches.
4.2. Theme-Level Alignment Results
4.2.1. Human-AI Theme Alignment
Theme-Level Human-AI Alignment Metrics
Session 4 achieved complete theme-level alignment with an F1 Score of 100%, as all six of its themes mapped directly to the human categories with no cross-cutting or extended themes. Sessions 1 and 2 both achieved F1 Scores of 92.3%, with a precision of 85.7% (6 out of 7 themes mapping to human categories) and a recall of 100%. Session 1 included one extended theme (“E-commerce Opportunities & Future Outlook”) that represented a forward-looking perspective not explicitly captured in the human framework. Session 3 achieved an F1 Score of 90.9%, with two themes (“Education as the Catalyst” and “The Mutual Waiting Game”) classified as partial matches due to their cross-cutting nature spanning multiple human categories.
The average theme-level F1 Score for general prompts was 92.3%, compared to 95.5% for structured prompts. While both approaches achieved high theme-level alignment, structured prompts demonstrated slightly better performance, especially with Session 4’s perfect alignment.
Code-level vs Theme-Level Alignment Comparison
These themes are consistent with traditional human analysis.
The AI sessions produced 6-7 themes compared to the human analysis’s 3 meta-categories, representing a finer level of granularity. This extra level of detail was helpful.
It allowed for more precise and nuanced categorization. At the same time, it still aligned clearly with the broader human framework. For example, the human “Consumer Related Factors” category was captured across multiple AI themes, including trust-related themes, cultural/shopping behavior themes, and knowledge/education themes.
4.2.2. Theme-Level Inter-Session Consistency
Theme-Level Inter-session Consistency (F1 Scores)
The two structured prompt sessions (S3 and S4) demonstrated the highest consistency with an F1 Score of 91.7%, suggesting near-perfect agreement in their thematic structures. Both sessions identified themes related to trust, shopping culture, infrastructure and ecosystem, organizational factors, and education. There were only small differences in how these concepts were defined and labeled.
General prompt sessions (S1 and S2) achieved 85.0% consistency, also indicating strong agreement despite differences in theme labeling. For example, Session 1’s “Trust & Security” theme corresponded directly to Session 2’s “Trust Deficit and Risk Perception” theme.
Cross-approach comparisons performed much better at the theme level than at the code level. The average F1 Score was 85.0% at the theme level, compared to 58.3% at the code level, an improvement of 26.7 points. This suggests that different prompting approaches reach similar high-level themes, even if they produce different detailed codes. The pair S2–S4 showed especially high consistency across approaches, with a score of 88.6%. This indicates that the thematic frameworks identified by general and structured prompts are largely compatible.
The overall average theme-level inter-session consistency was 86.1%, representing a 21.9 point improvement over code-level consistency at 64.2%. This finding has important methodological implications. While AI-assisted analysis may show variation in specific code identification across runs, the higher-order thematic structures remain stable and consistent. This suggests that theme-level findings may be more reliable for drawing research conclusions.
4.3. Summary of Alignment Findings
Figure 7 below illustrates inter-session consistency heatmaps showing F1 Scores between all Claude Code session pairs at the code level (left) and theme level (right). Darker green indicates higher consistency. Inter-session consistency heatmaps: Code-level and theme-level F1 scores
These findings suggest several key patterns. Structured prompts perform much better than general prompts at the code level. They have a 23.9 point higher F1 Score (83.1% vs. 59.2%). At the theme level, the difference is much smaller. The advantage drops to 3.2 percentage points (95.5% vs. 92.3%). This is because both approaches successfully capture the main high-level themes. Additionally, theme-level alignment is consistently higher than code-level alignment for both human-AI comparison and inter-session consistency, with average improvements of 22.8% and 21.9%, respectively. The analysis also shows that AI-assisted thematic analysis suggests strong reproducibility, especially at the theme level, where average inter-session consistency exceeds 86%. This indicates that while specific code identification may vary across runs, the overarching thematic conclusions remain stable.
5. Discussion
5.1. Interpretation of Key Findings
This study evaluated the alignment between Claude Code thematic analysis and human analysis across multiple dimensions. The findings suggest several patterns that contribute to our understanding of how LLMs, especially agentic ones, can be used for qualitative data analysis.
5.1.1. Observed Differences Between Prompting Approaches
A key observation from this exploratory study is the difference in performance between prompting approaches. Sessions using structured multi-phase prompts achieved an average F1 Score of 83.1% at the code level, compared to 59.2% for general prompts. While this difference is substantial, the small sample size (n=2 per condition) prevents formal statistical inference, and the observed pattern requires replication before causal conclusions can be drawn. With this caveat, the pattern suggests that the quality of Claude Code’s qualitative analysis may depend on the methodological rigor of the prompting strategy.
General prompts, while achieving high precision (87.5%), suffered from low recall (44.7%), indicating that the AI identified relevant codes but missed more than half of the concepts captured in human analysis. The pattern indicates that without explicit methodological guidance, LLMs tend toward conservative interpretation, identifying only the most salient themes while overlooking subtler or more nuanced concepts. By contrast, structured prompts achieved a more balanced precision-recall trade-off (80.9% precision, 86.3% recall), enabling comprehensive coverage of the conceptual landscape without substantial loss in relevance.
This finding aligns with emerging best practices in prompt engineering for qualitative research tasks (Tai et al., 2024), which emphasize the importance of explicit methodological scaffolding. The structured approach employed in Sessions 3 and 4 incorporated distinct phases for open coding, axial coding, and selective coding, mirroring Corbin and Strauss, 1990 grounded theory methodology. The 23.9 percentage-point improvement achieved through structured prompting is consistent with broader findings in the literature. Nguyen-Trung (2025) emphasized the importance of explicit methodological scaffolding through the GAITA and ACTOR frameworks, and Naeem et al. (2025) provided similar evidence that phased prompting aligned with Braun and Clarke’s six stages improves AI analytical quality. The present findings provide quantitative support for these largely prescriptive recommendations. The code-level F1 Score of 83.1% achieved by structured prompts also compares favorably with the approximately 80% agreement reported by Castellanos et al. (2025) in healthcare qualitative data and the high congruence reported by Bennis and Mouwafaq (2025) across multiple generative models.
5.1.2. Theme-Level Stability
A key finding is the hierarchical pattern of alignment, where theme-level agreement (93.9% average F1) exceeded code-level agreement (71.1% average F1) across all sessions. This difference suggests that Claude Code analysis exhibits what might be termed hierarchical convergence. While variation exists in granular code identification, the higher-order thematic structures suggest stability and consistency with human analysis.
This pattern has methodological implications. In qualitative research, the primary analytical value typically resides at the thematic level, where patterns are synthesized into meaningful interpretive frameworks. The finding that all four Claude Code sessions achieved 100% recall at the theme level suggests that Claude Code analysis captures the essential conceptual structure of qualitative data, even when lower-level coding exhibits variation.
The hierarchical convergence phenomenon can be understood in terms of abstraction tolerance. At the code level, minor differences in terminology, scope, or conceptual boundaries can produce apparent disagreement even when the underlying concepts are similar. At the theme level, these variations are absorbed into broader categorical structures where semantic equivalence is more readily achieved. This suggests that theme-level findings from Claude Code analysis may be more stable and consistent than code-level findings. This pattern is consistent with observations elsewhere in the literature, though it has not previously been quantified in these terms. Castellanos et al. (2025) found that LLMs agreed with humans on theme alignment for two-thirds of analyzed topics, while noting greater divergence at the level of specific coding. Wachinger et al. (2025) similarly reported that LLM outputs were most convincing at the descriptive theme level. The present study extends these observations by providing a direct quantitative comparison: an average improvement of 22.8 percentage points from code-level to theme-level F1 Scores across all sessions. These findings indicate that hierarchical convergence may be a general property of AI-assisted thematic analysis rather than an artefact of a particular dataset or tool.
5.1.3. Inter-Session Consistency and Reproducibility
The inter-session consistency analysis revealed that AI-assisted thematic analysis shows strong reproducibility, especially at the theme level (86.1% average F1). Sessions using the same prompting approach showed higher internal consistency (same-type average: 88.4% for themes, 76.2% for codes) compared to cross-approach comparisons (cross-type average: 85.0% for themes, 58.3% for codes).
These findings address a common concern regarding the reliability of AI-assisted analysis: the potential for stochastic variation to produce inconsistent results across runs. While the code-level inter-session consistency (64.2%) indicates meaningful variation in specific code identification, the theme-level consistency (86.1%) suggests that the analytical conclusions remain stable. This pattern suggests that researchers employing Claude Code thematic analysis can have confidence in their thematic findings, though they should exercise appropriate caution when reporting specific codes as definitive.
The asymmetric directional patterns observed in inter-session analysis, where general prompt codes mapped well to structured sessions (87.3%) but not vice versa (44.5%), highlight the relationship between prompting complexity and analytical depth. Structured prompts appear to generate a superset of concepts that encompasses those identified through simpler approaches while introducing additional nuance and granularity. The inter-session theme-level consistency of 86.1% can be contextualized against existing reliability benchmarks. Jain et al. (2025) reported Cohen’s Kappa values exceeding 0.84 for Claude across six independent runs, while Hila and Hauser (2025) found substantial agreement (κ = 0.76–0.78) for deductive coding tasks. The present study’s findings are broadly consistent with these benchmarks, while adding a dimension not captured by Kappa alone: the distinction between code-level and theme-level consistency. The finding that same-approach sessions achieved higher consistency (76.2% for codes, 88.4% for themes) than cross-approach sessions (58.3% for codes, 85.0% for themes) also aligns with the sensitivity to prompting observed by Borchers et al. (2025), who found that coding accuracy was influenced by specific methodological conditions rather than being uniform across configurations.
5.2. Bridging a 14-Year Methodological Gap
A unique aspect of this study is the temporal gap between the original human analysis (conducted using NVivo in 2012) and the AI-assisted reanalysis (conducted using Claude Code in 2026). This 14-year interval raises questions about temporal validity and the stability of qualitative findings across analytical eras.
The high alignment achieved, especially Session 4’s 87.6% F1 Score at the code level and 100% at the theme level, suggests that the conceptual structure of the data remains interpretable across this temporal gap. This finding has two implications. First, it demonstrates the temporal stability of the original human analysis, as its conceptual framework could be reproduced over a decade later. Second, it supports the AI’s capacity to engage with historical qualitative data, suggesting potential applications for the re-analysis of archival qualitative datasets.
However, the temporal gap also introduces interpretive complexity. The AI’s extended themes, concepts not present in the original human analysis, may reflect either (a) aspects of the data that were underweighted in the original analysis, or (b) interpretive frameworks that have emerged in the intervening years that shape how the AI reads the data. For example, Session 1’s extended theme “E-commerce Opportunities & Future Outlook” may reflect contemporary discourse around digital transformation that was less prominent in 2012. This highlights the need for careful consideration of how temporal context shapes AI-assisted interpretation. The capacity of AI to engage with historical qualitative data has received little attention in the existing literature. Most validation studies compare AI and human analyses conducted within the same time period (e.g., Castellanos et al., 2025; Montes et al., 2025; Wachinger et al., 2025). The present study’s 14-year gap between original and AI analysis provides a distinctive test. The high alignment achieved suggests that the conceptual structure embedded in qualitative data can be recovered across significant temporal distances, though researchers should remain attentive to the possibility that AI interpretive frameworks reflect contemporary rather than historical analytical sensibilities.
5.3. Addressing Methodological Critiques
The scholarly debate over GenAI in qualitative research (discussed in Section 2.6) raises important considerations for interpreting the findings of this study. Jowsey, Braun, et al. (2025) argued that reflexive thematic analysis requires a subjective, positioned, reflexive researcher and that AI involvement is, therefore, methodologically incongruent. Friese et al. (2026), De Paoli (2026), and Greenhalgh (2026) countered that the critical question is not whether AI is involved, but whether interpretive authority remains with the human researcher. As noted in Section 3.1, the present study operates within the coding reliability tradition of thematic analysis, where consistency, alignment, and replicability are appropriate quality criteria, a distinct tradition from the reflexive thematic analysis that is the focus of Jowsey et al.'s critique.
The findings of this study offer empirical evidence that is relevant to the broader debate, nonetheless. The importance of structured prompting, which improved code-level alignment by 23.9 percentage points, demonstrates that AI does not autonomously produce quality analysis. The quality of the output directly reflected the methodological scaffolding provided by the researcher. This supports the position advanced by Friese et al. (2026) that AI can function as a legitimate analytical support when used under close researcher direction, rather than as an autonomous interpreter. The AI sessions in this study did not engage in reflexive practice; they operated as methodologically-directed instruments responding to human-designed analytical protocols.
Furthermore, the study was not designed as an AI-led reflexive thematic analysis. The original 2012 analysis was conducted by a human researcher using structured, inductive content analysis with NVivo. The 2026 AI analyses serve as independent analytical perspectives compared to human analysis. This triangulation design allows identification of where AI and human analyses converge (suggesting stable, perspective-independent patterns) and where they diverge (suggesting areas requiring deeper interpretive engagement). Rather than displacing human analytical authority, this approach complements it by providing an external reference point.
The application context also plays a role. The present study examined business interview data concerning e-commerce adoption, an applied organizational context that differs from domains involving high levels of cultural, emotional, or political sensitivity. Prior research has identified limitations in AI performance in culturally nuanced contexts (Sakaguchi et al., 2025). Accordingly, the findings should be interpreted as context-specific rather than universally generalizable. While the results support the viability of AI-assisted analysis within certain applied research domains, they do not imply equivalence or adequacy in all qualitative contexts, particularly those requiring deep cultural or interpretive sensitivity.
5.4. Claude Code as a Research Tool
This study provides one of the first academic evaluations of Claude Code for qualitative research, addressing the gap identified in Section 2.8. The agentic capabilities described therein proved valuable in practice. Direct file system access enabled seamless processing of all 16 transcripts without manual uploads, while autonomous multi-file processing allowed analysis of the complete corpus in single sessions. Session persistence maintained analytical context throughout extended workflows, and structured output generation produced formatted HTML reports directly to the local file system.
These capabilities distinguish Claude Code from web-based interfaces that require copy-paste workflows and from API-based approaches that require custom development. For researchers without programming expertise who want to incorporate AI assistance, Claude Code offers an accessible entry point. The approximately 30-minute processing time per session contrasts with the weeks required for traditional analysis. However, this efficiency comparison must be contextualized by the fact that human analysis involves reflexive engagement that AI cannot replicate.
5.5. Implications
The implications of this study span methodological, practical, and conceptual dimensions. While they are presented below under these three headings for clarity, the categories are intertwined in practice: a methodological choice often carries practical consequences, and a practical pattern can prompt conceptual reflection. The headings are intended to aid navigation rather than to draw firm lines between related insights.
5.5.1. Methodological Implications
The multi-level alignment framework and the F1 Score metric developed in this study offer a replicable approach for evaluating AI-human concordance in qualitative research. Existing studies have relied on simple agreement percentages or Cohen’s Kappa (Hila & Hauser, 2025; Jain et al., 2025), which do not distinguish between precision and recall. The F1 Score captures both dimensions, enabling researchers to identify whether an AI tool is generating relevant codes (precision) or capturing the full range of human-identified concepts (recall). This distinction proved critical in the present study: general prompts achieved high precision (87.5%) but low recall (44.7%), a pattern that would be obscured by a single agreement metric.
The inter-session replication design also provides a template for assessing AI reliability. By conducting multiple independent sessions with varying prompting strategies, researchers can evaluate both alignment with human analysis and reliability within a single study. This dual assessment addresses a gap identified in the literature, where alignment and reliability have typically been examined in isolation (Jain et al., 2025).
5.5.2. Practical Implications
The findings suggest several models for integrating AI-assisted analysis into qualitative research workflows.
First, AI analysis can serve as a first-pass analytical tool, generating initial codes and themes that human researchers subsequently refine, validate, and interpret. This model leverages AI efficiency while preserving human interpretive authority. The high alignment achieved in this study suggests that AI-generated frameworks can provide reliable scaffolding for subsequent human analysis.
Second, AI analysis can be conducted in parallel with human analysis, and the results compared to identify areas of convergence and divergence. Areas of agreement may be treated with higher confidence, while divergent findings prompt deeper examination. This study’s alignment metrics could serve as benchmarks for such comparative evaluation.
Third, AI analysis can be used to preliminarily explore large datasets, identifying candidate themes and patterns that human researchers subsequently investigate through traditional methods. The AI’s capacity to process large volumes of text efficiently makes it suitable for large-scale qualitative data applications.
Across all models, the finding that structured prompting outperformed general prompting by 23.9 percentage points at the code level provides clear practical guidance: researchers should invest in methodological scaffolding when designing prompts, mirroring established analytical phases rather than relying on open-ended instructions.
5.5.3. Conceptual Implications
The hierarchical convergence pattern described in Section 5.1.2 has implications for how AI-assisted findings should be reported. Thematic conclusions from AI analysis may be sufficiently stable for informing research findings, while code-level details should be treated as indicative rather than definitive. However, the present study examined a single dataset in an applied business context, and the generalizability of this pattern remains an open question. Future research should examine whether hierarchical convergence holds in domains requiring deeper cultural or interpretive sensitivity.
The importance of structured prompting also raises a broader point about the nature of AI-assisted qualitative analysis. The AI functions as a methodologically-directed instrument rather than an autonomous interpreter; the quality of its output reflects the quality of the methodological scaffolding provided by human researchers. This suggests a collaborative model where human researchers retain interpretive authority while delegating certain analytical tasks to AI systems, reconceptualizing the researcher’s role toward methodological design, prompt engineering, and critical evaluation of AI-generated outputs.
5.6. Limitations
This study should be interpreted in light of several limitations that define the scope of its contributions.
First, the human analysis used as a single researcher conducted the reference point. Although the original analysis followed a structured coding process and was later supported by quantitative validation with a larger sample, it remains one valid interpretation rather than an objective standard. As such, the alignment metrics reported in this study reflect concordance with this specific analytical perspective rather than definitive correctness. Future research could strengthen this design by incorporating multiple human coders and reporting inter-rater reliability, enabling comparison between AI-human and human-human agreement.
Second, the F1 Score captures structural similarity between AI and human outputs but does not assess interpretive depth, reflexivity, or the richness of meaning-making that characterize other qualitative traditions, particularly reflexive thematic analysis. The findings should therefore be interpreted within the coding reliability framework adopted by this study and not generalized into all forms of qualitative inquiry.
Third, the study is based on a single dataset drawn from interviews on e-commerce adoption in Saudi Arabia. While this dataset provides a complete analytical arc from raw data to validated thematic structure, the findings may not generalize to other domains, types of qualitative data, or research contexts. In particular, datasets requiring deeper latent interpretation, culturally embedded meaning, or highly specialized domain knowledge may yield different patterns of alignment. Replication across diverse datasets is necessary to assess the robustness of the observed multi-level alignment pattern.
Fourth, the study design included only two sessions per prompting condition, which limits statistical inference. While the observed 23.9 percentage-point difference between approaches is larger than within-approach variation, the findings should be considered exploratory evidence rather than statistically confirmed effects. Expanding the number of sessions in future research would enable a more rigorous comparison of prompting strategies.
Fifth, two tool-specific limitations should be acknowledged. The possibility of indirect exposure of the dataset to the AI model during training cannot be entirely excluded. The original PhD thesis is publicly available, and while no prior coding or analytical results were provided during the AI sessions, the potential for partial memorization remains a theoretical concern. However, the observed variability across AI sessions suggests that outputs were not deterministic reproductions of a fixed source. Additionally, the study focuses on a single AI system (Claude Code, Opus 4.5) and a specific agentic workflow. Given the rapid evolution of LLMs and differences in architecture and training data, the results may not generalize to other models or tools. Future research using novel datasets and comparative multi-system designs would address both concerns.
Despite these limitations, the study provides a controlled and transparent experimental framework that isolates key variables, prompting strategy, analytical level, and inter-session consistency, offering a foundation for cumulative research on AI-assisted qualitative analysis.
6. Conclusion
This study shows that alignment between AI-assisted and human thematic analysis is inherently multi-level. While agreement at the level of individual codes is variable, higher-order thematic structures consistently converge. This finding challenges the assumption that close correspondence in coding is necessary for meaningful qualitative outcomes and instead highlights a hierarchical pattern of convergence, where variability at the micro-level does not prevent stability at the macro-level.
The results show that, under appropriate conditions, AI-assisted analysis can achieve strong alignment with human interpretation. The best-performing session reached 87.6% F1 at the code level and 100% at the theme level, indicating that AI systems are capable of both detailed pattern recognition and coherent conceptual synthesis. However, this capability is not inherent to the technology alone. The observed 23.9 percentage-point improvement in code-level alignment under structured prompting underscores the importance of methodological scaffolding. AI systems do not independently reproduce rigorous analytical processes; they require explicit guidance shaped by qualitative research principles.
The consistency of theme-level findings across independent sessions (86.1%) provides evidence that AI-assisted analysis can produce reproducible higher-order interpretations, even when lower-level coding varies. This has important implications for practice. It suggests that AI-generated themes may be sufficiently stable to inform research conclusions, while code-level outputs should be interpreted as provisional and subject to researcher validation.
These findings should be interpreted within the boundaries of this study, including the use of a single dataset, a coding reliability analytical framework, and a specific AI system (Claude Code with the Opus 4.5 model). Within these constraints, the results provide evidence that AI-assisted thematic analysis can function as a methodologically sound complement to human analysis (Figure 8). Summary of key findings and practical guidance for AI-assisted thematic analysis
As qualitative research increasingly engages with large and complex datasets, AI-assisted approaches offer a practical pathway for extending analytical capacity. Their value lies not in replacing human interpretation, but in supporting it. When guided by structured prompting, critical oversight, and clear methodological framing, AI systems can contribute meaningfully to qualitative inquiry. Rather than redefining qualitative analysis, they reshape how it is operationalized, positioning the researcher as both analyst and methodological designer of AI-assisted processes.
Supplemental Material
Supplemental material - From Code Variability to Theme Convergence: AI–Human Alignment in Thematic Analysis With Claude Code
Supplemental material for From Code Variability to Theme Convergence: AI–Human Alignment in Thematic Analysis With Claude Code by Rayed AlGhamdi in International Journal of Qualitative Methods
Footnotes
Acknowledgments
The authors, therefore, acknowledge with thanks DSR for technical and financial support.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by Deanship of Scientific Research (DSR)at King Abdulaziz University, Jeddah, Saudi Arabia, under grant no. (IPP:572-611-2025).
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The qualitative dataset analyzed in this study originates from a previously published doctoral thesis (AlGhamdi, 2014). To support transparency and reproducibility, all materials used in the Claude Code reanalysis are openly available via the Open Science Framework at
. The repository is organized into three folders: (1) Claude Code Analysis, containing the complete outputs from all sessions, including codebooks, thematic reports, and prompting scripts; (2) Human Analysis, containing the original PhD thesis along with the research problem, questions, and methodology extracted from it, the corresponding NVivo coding files, and two Excel sheets providing side-by-side comparisons of codes and themes across the human analysis and all four Claude Code sessions; and (3) Transcripts, containing the anonymized interview transcripts used in both the original and AI-assisted analyses. All data are fully anonymized and contain no personally identifiable information.
Use of AI and Data Handling Statement
AI Tools Used - This study employed Claude Code (Anthropic’s command-line interface, powered by Claude Opus 4.5, model ID: claude-opus-4-5-20251101) as the primary analytical tool for AI-assisted thematic analysis. Additionally, the same model was used to support language refinement and improve the clarity, flow, and structure of the manuscript during the writing phase.
Data Processing and Privacy - Claude Code operates as a local command-line tool with direct file system access, processing data through Anthropic’s API. The interview transcripts used in this study were from a previously published PhD thesis (AlGhamdi, 2014) and contained no personally identifiable information, as all participant data had been anonymized in the original study. Researchers using Claude Code with sensitive or unpublished data should review Anthropic’s data retention policies and consider local processing options where available.
Reproducibility Considerations - AI model outputs may vary across different sessions and model versions. The specific model version (claude-opus-4-5-20251101) is documented to enable future comparison studies. Researchers attempting replication should note that subsequent model updates may produce different results, and exact prompts should be used to maximize comparability.
The author retains full responsibility for the accuracy, originality, and integrity of the work.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
