From Code Variability to Theme Convergence: AI–Human Alignment in Thematic Analysis With Claude Code

Abstract

Purpose

This study examines AI–human alignment in thematic analysis from a multi-level perspective, asking whether agreement at the level of individual codes is necessary for convergence in higher-order themes. While prior research often evaluates overall agreement, this study distinguishes between code-level variability and theme-level stability to provide a more nuanced assessment of AI-assisted qualitative analysis.

Methods

Using qualitative interview data from a doctoral study on e-commerce, the original human thematic analysis (2012, NVivo) was compared with four independent AI-assisted analyses conducted in 2026 using Claude Code. Two prompting strategies were tested: general prompts and structured multi-phase prompts. Alignment was assessed at both code and theme levels using the F1 Score, while inter-session consistency was evaluated through bidirectional mapping across session pairs.

Findings

At the code level, AI outputs showed considerable variability, with F1 Scores ranging from 55.9% to 87.6% and clear differences between prompting approaches (structured: 83.1% average vs. general: 59.2%). In contrast, theme-level alignment remained consistently high across all sessions (90.9%–100%), with strong inter-session consistency (average 86.1%). These findings indicate that although AI-generated codes may differ across sessions and from human coding, the resulting thematic structures converge reliably.

Originality

This study introduces a multi-level alignment framework and provides one of the first empirical evaluations of Claude Code as an agentic AI tool for thematic analysis. The 14-year gap between the original human analysis and AI reanalysis offers a distinctive test of AI engagement with historical qualitative data. The study identifies a pattern of hierarchical convergence, where code-level divergence coexists with theme-level stability.

Implications

The findings suggest that strict code-level agreement may not be necessary for reliable thematic conclusions. AI-assisted analysis can support theme development when structured prompting and human oversight are maintained, offering methodological guidance for integrating AI into qualitative research while preserving analytical rigor.

Graphical Abstract

Keywords

AI-assisted qualitative analysis thematic analysis LLMs prompt engineering qualitative research methods NVivo claude code

1. Introduction

Artificial intelligence (AI) is increasingly used to support qualitative data analysis, yet a key question remains unresolved: must AI reproduce human coding decisions to generate meaningful thematic insights? Many existing evaluations report AI-human agreement at a single level of analysis, without systematically distinguishing between code-level and theme-level alignment (Bennis & Mouwafaq, 2025; Hila & Hauser, 2025; Jain et al., 2025). This paper challenges that assumption. Drawing on a multi-level alignment framework, it examines whether AI-human agreement operates differently at the code level and the theme level; and what this distinction means for how AI-assisted qualitative analysis should be evaluated, interpreted, and integrated into research practice. The significance of this question depends on which tradition of thematic analysis is being considered.

Thematic analysis has been fundamental to social science inquiry for decades. Braun and Clarke (2006) provided an influential framework for identifying, analyzing, and reporting patterns within qualitative data. In subsequent work, they distinguished thematic analysis as a family of methods encompassing coding reliability approaches that prioritize structured coding and accuracy, codebook approaches such as template and framework analysis, and reflexive thematic analysis that centers researcher subjectivity and interpretive engagement (Braun & Clarke, 2021). This diversity within thematic analysis has implications for how AI-assisted approaches are evaluated, as the quality criteria appropriate for each tradition differ. Understanding this landscape is essential for situating the growing body of empirical evidence on AI-assisted analysis.

Large Language Models (LLMs) with advanced natural language understanding capabilities have opened new possibilities for qualitative research. These AI systems can process large amounts of text, identify patterns, and categorize information in ways that may complement human analytical capabilities. Recent empirical studies have demonstrated approximately 80% agreement between LLM and human thematic analysis (Castellanos et al., 2025). Evaluators preferred LLM-generated codes 61% of the time for their analytical utility (Schroeder et al., 2025). Despite these encouraging findings, fundamental questions remain about the reliability and practical applicability of AI-assisted analysis, and critically, whether existing evaluations capture the right dimensions of agreement. These questions become particularly salient when examining AI tools that move beyond conversational interfaces into agentic, file-system-integrated environments.

This paper explores an area that has received little attention in the emerging literature on AI-assisted qualitative research. Despite the growing landscape of AI-integrated qualitative tools, I could not identify any published academic study examining Claude Code, Anthropic’s command-line agentic implementation of the Claude LLM, for qualitative research purposes. Unlike AI features embedded within CAQDAS platforms or web-based LLM interfaces, Claude Code functions as a standalone agentic environment that enables the AI to independently navigate project directories, process entire transcript corpora, and produce formatted analytical artifacts. At the time this study was conducted (January 2026), no peer-reviewed research had examined this class of agentic AI tool for qualitative thematic analysis.

1.1. Research Problem and Significance

Despite the growing number of AI-assisted analytical tools, fundamental questions remain about their alignment and reliability for qualitative research. Can AI systems achieve meaningful alignment with human analytical judgments? Under what conditions is such alignment maximized? Are AI-generated findings consistent enough across independent runs to support research conclusions? These questions go beyond technical concerns. They touch on the foundations of qualitative inquiry: the nature of interpretation, the role of human judgment, and the criteria for evaluating analytical quality (Braun & Clarke, 2021; Chatzichristos, 2025).

Critics of AI-assisted qualitative analysis have raised legitimate concerns. Qualitative interpretation, they argue, is a fundamentally human activity that draws upon contextual knowledge, cultural understanding, and embodied experience that AI systems cannot possess (Braun & Clarke, 2021; Sakaguchi et al., 2025). The meaning-making process central to qualitative research requires subjective engagement with data that may not be replicable through algorithmic processing (Jowsey, Braun, et al., 2025). In addition, the black box nature of AI systems raises questions about the transparency and auditability of AI-generated analytical outputs (Resnik & Hosseini, 2025).

Proponents counter that AI systems need not replicate human cognition to produce analytical outputs that align with human analysis. If AI-generated themes show concordance with those produced through human analysis, and if this alignment is consistent across multiple runs, then AI-assisted approaches may complement traditional methods (Castellanos et al., 2025; Montes et al., 2025). The empirical question of alignment should guide assessments of AI utility for qualitative research, rather than philosophical debates about machine understanding.

The qualitative research community lacks systematic empirical evidence on the alignment between AI-assisted and human thematic analysis. A systematic mapping study of LLM applications in qualitative research found that 75% of relevant publications appeared in 2024 alone, indicating that rigorous validation studies comparing AI outputs to established human analyses remain scarce (Barros et al., 2025). Without such evidence, researchers cannot make informed decisions about when and how AI-assisted approaches should be integrated into qualitative research practice. Some studies have formally rejected GenAI use for reflexive qualitative approaches (Jowsey, Braun, et al., 2025), though this position has been challenged by researchers who advocate for critical, researcher-led engagement with AI tools (De Paoli, 2026; Friese et al., 2026). This ongoing debate highlights the need for rigorous empirical studies that examine both alignment (concordance with human analysis) and reliability (consistency across sessions).

The present study addresses this need through an experimental design that offers certain advantages. By applying AI analysis to interview data that was analyzed 14 years ago using established methods, we can assess AI performance against an established reference point. The original analysis was subsequently validated through quantitative survey research with 153 retailers, lending additional credibility to its use as a comparator, though it remains one analyst’s interpretation rather than an objective standard. In addition, by conducting four independent AI analysis sessions, we can evaluate the consistency and reliability of AI-assisted analysis, a dimension that has received limited attention in existing literature.

1.2. Research Context

This study addresses the alignment question through analysis of qualitative data from my doctoral research project on the diffusion of online retailing adoption in Saudi Arabia. The original study, conducted in 2012, used semi-structured interviews with retailers and industry experts to explore the factors influencing e-commerce adoption in the Saudi context. I conducted thematic analysis of the interview transcripts using NVivo software, which yielded 51 codes organized under three meta-categories: Consumer Related Factors, Environment Related Factors, and Organization Related Factors.

This dataset is suitable for AI-human alignment assessment for several reasons. The analysis was conducted over a decade ago, and the AI sessions were run without providing the original coding or thematic framework. However, since the original PhD thesis is publicly available, the possibility that the AI model encountered elements of this work during training cannot be fully ruled out. While this does not eliminate concerns about indirect exposure, the procedural separation between the AI sessions and the original analysis, combined with the 14-year temporal gap, provides a reasonable degree of analytical independence. The thematic framework is grounded in established theoretical models: Diffusion of Innovation (DOI) theory, the Technology-Organization-Environment (TOE) framework, and the Stages of Growth for E-commerce (SOG-e) model. These models provide clear conceptual anchors against which AI performance can be evaluated. The dataset also represents a complete analytical arc from raw interview data through codes to themes, enabling assessment at multiple levels of analytical detail.

The 14-year gap between the original human analysis (2012) and the AI reanalysis (2026) introduces both challenges and opportunities. This gap may introduce confounds related to evolving interpretive frameworks and AI training data. At the same time, it provides an opportunity to assess the temporal stability of qualitative findings and the capacity of AI systems to engage with historical qualitative data.

1.3. Research Objectives

The objectives of this methodological experiment are:

1. Conduct an AI-assisted thematic analysis of qualitative interview data using Claude Code without providing prior human analysis results during the analytical sessions

2. Conduct multiple independent AI analysis sessions using different prompting strategies (general vs. structured) to test replication and assess reliability

3. Compare the themes, codes, and factors identified by the AI sessions with those from the original human NVivo analysis

4. Assess inter-session consistency between the four AI analyses

5. Document the methodological process and Claude Code capabilities to enable replication

1.4. Contribution to Knowledge

This study contributes to the knowledge on AI-assisted qualitative research in three areas: methodological, practical, and conceptual. These categories are used here as an organizing device rather than strict boundaries; the contributions are interconnected by nature, and insights in one area often carry weight in the others.

Methodological and empirical contribution - The multi-level alignment framework developed in this study provides a systematic, replicable methodology for evaluating AI-human concordance in thematic analysis. The F1 Score serves as a balanced metric that accounts for both precision (relevance of AI codes) and recall (coverage of human concepts), offering a more nuanced assessment than simple agreement percentages. By applying this framework across four independent AI sessions and comparing results against established human analysis, the study provides quantitative benchmarks that can inform future research design and tool evaluation.

Practical contribution - The comparative evaluation of prompting approaches provides actionable guidance for researchers. Understanding which prompting strategies maximize alignment can help researchers implement AI-assisted analysis more effectively. The documentation of Claude Code’s agentic workflow also supports replication by other researchers.

Conceptual contribution - The study identifies an empirical pattern of hierarchical convergence, where theme-level alignment consistently exceeds code-level alignment across all sessions and comparison types. This observation has implications for how AI-assisted findings should be interpreted and reported, suggesting that thematic conclusions may be more reliable than code-level details. While this pattern requires further investigation across different datasets and domains, it offers an initial conceptual lens for understanding AI performance in qualitative analysis.

2. Literature Review

2.1. The Emergence of AI in Qualitative Research

The integration of AI into qualitative research is among the notable methodological developments of the past decade. Computer-Assisted Qualitative Data Analysis Software (CAQDAS) such as NVivo, ATLAS.ti, and MAXQDA have long supported researchers with data organization, coding management, and retrieval functions. However, these tools maintained the researcher’s exclusive interpretive authority (Mehta et al., 2025). Generative AI (GenAI) changes this by enabling active participation in the analytical process itself: suggesting codes, identifying themes, and generating interpretations that previously required human cognitive engagement.

This transformation has accelerated rapidly. A systematic mapping study by Barros et al. (2025) examining LLM applications in qualitative research found that 75% of relevant publications appeared in 2024 alone. This indicates the growing interest in this methodological development. The field spans diverse domains, including healthcare (Castellanos et al., 2025; Sakaguchi et al., 2025), education (Tai et al., 2024), policy research (Liu & Sun, 2025), and software engineering (Montes et al., 2025).

Several technological advances make this research timely. Modern LLMs demonstrate sophisticated language understanding, context retention across lengthy documents, and the ability to follow complex analytical instructions. Anthropic’s analysis of 308,210 real-world Claude conversations revealed that the model mirrors users’ values in 28.2% of interactions while pushing back on approximately 3% of requests deemed inappropriate (Huang et al., 2025). This suggests a capacity for nuanced engagement that may be relevant to qualitative research requiring interpretive sensitivity.

2.2. Empirical Evidence for LLM-Human Agreement

The central question for AI-assisted qualitative research is whether LLMs can produce analytical outputs that align with human expert judgment. The empirical evidence on this question has been encouraging, though not without important caveats. Castellanos et al. (2025) examined thematic summarization in healthcare qualitative data and found 80% agreement between LLM and human interpretation. LLMs agreed with humans on theme alignment and convergence for two-thirds of the analyzed topics. Wachinger et al. (2025) found that ChatGPT’s results were “particularly convincing for the identification of descriptive themes.”

Montes et al. (2025) found that evaluators preferred LLM-generated codes 61% of the time over human codes, finding them more analytically useful for answering research questions. This finding, that in many cases AI-generated codes may actually exceed human codes in utility, challenges assumptions that human analysis is superior in all dimensions. However, these advantages come with important caveats, particularly regarding the distinction between descriptive and interpretive coding.

The distinction between descriptive and interpretive coding is important. Multiple studies find that LLMs excel at identifying surface-level, explicit themes but struggle with latent, implied meanings that require cultural or contextual understanding. Castellanos et al. (2025) found that LLMs were less successful at identifying subtle, interpretive themes compared to concrete, descriptive ones, and may miss themes requiring deep contextual or domain knowledge. Wachinger et al. (2025) similarly reported that LLM-generated results were particularly convincing for descriptive themes. Montes et al. (2025) noted that LLMs missed latent interpretations and produced themes with unclear boundaries. Research in Japanese clinical contexts further highlighted cultural interpretation as a specific challenge (Sakaguchi et al., 2025). These findings suggest that LLMs may be most useful for preliminary or descriptive analysis, with human researchers retaining authority over interpretive depth.

2.3. The Reliability Challenge

While alignment with human analysis has received considerable attention, the reliability dimension, consistency across independent applications, remains understudied. This gap is problematic because reliability is foundational to any analytical method’s scientific credibility. The present study’s inter-session replication design addresses this issue, but the existing literature provides limited benchmarks.

Jain et al. (2025) developed a framework for assessing reliability through dual metrics combining Cohen’s Kappa and semantic similarity. They tested Claude 3.5 Sonnet, Gemini 2.5 Pro, and GPT-4o through six independent runs per model on interview transcripts. Gemini achieved the highest reliability (κ > 0.90), followed by GPT-4o (κ > 0.850) and Claude (κ > 0.84). These differences suggest that model selection affects consistency.

Borchers et al. (2025) examined whether multi-agent consensus approaches could improve coding accuracy. Testing six LLMs ranging from 3 to 32 billion parameters, they found that consensus-making only improves accuracy under specific conditions: low temperature settings, a single LLM, and a single code type. Multi-agent approaches showed minimal accuracy gains overall, suggesting that simpler approaches may be preferable for achieving reliable results.

The inter-rater reliability metrics from existing studies provide benchmarks for contextualizing our findings. Research indicates that LLMs achieve substantial agreement (κ = 0.76-0.78) for deductive coding with pre-established codebooks, but only moderate agreement (κ = 0.54-0.57) for inductive coding tasks (Hila & Hauser, 2025; Zhang et al., 2025). This difference in performance on deductive versus inductive tasks has implications for methodological design: providing clear frameworks and examples improves LLM reliability.

2.4. Model Comparison and Selection

The fast-changing field of LLMs raises practical questions about which model researchers should select for qualitative analysis. Comparative studies have begun evaluating different models across dimensions relevant to research applications. Bennis and Mouwafaq (2025) conducted a comprehensive multi-model comparison, testing nine GenAI models on thematic analysis of Cutaneous Leishmaniasis qualitative data from 448 participant responses. The models tested included Claude 3.5 Sonnet, Llama 3.1 405B, NotebookLM, Gemini 1.5 Advanced Ultra, ChatGPT o1-Pro, and DeepSeekV3. Advanced models achieved high congruence with reference standards, with some achieving perfect concordance (Jaccard index = 1.00).

Mavrych et al. (2025) compared Claude, ChatGPT, Copilot, and Gemini against medical students on neuroscience questions. Claude achieved 83% accuracy (highest), followed by GPT-4 at 81.7%, Copilot at 59.5%, GPT-3.5 at 58.3%, and Gemini at 53.6%. Wójcik et al. (2025) compared these models on medical examinations in English and Polish, finding that Claude achieved the highest accuracy for most question groups in both languages.

However, accuracy alone does not determine suitability for qualitative research. On the Vectara Hallucination Leaderboard (last updated February 5, 2026), the lowest hallucination rate is antgroup/finix_s1_32b (1.8%). Among the models listed, google/gemini-2.5-flash-lite reports 3.3%, while openai/gpt-4.1-2025-04-14 reports 5.6%. Several Anthropic models are higher, including anthropic/claude-sonnet-4-20250514 (10.3%) and anthropic/claude-opus-4-5-20251101 (10.9%) (Vectara, 2026). For qualitative coding that requires factual accuracy, researchers may need to balance capability against the risk of hallucinations.

The present study selected Claude Code, powered by the Opus 4.5 model, for several reasons informed by the considerations above. Claude models have demonstrated strong performance in accuracy benchmarks across multiple domains (Mavrych et al., 2025; Wójcik et al., 2025). For analyzing lengthy interview transcripts, Claude Code’s context window handling offers a practical advantage: unlike models that require document chunking, Claude Code can process complete interview sets in a single session, maintaining a coherent understanding across the full corpus. More importantly, Claude Code’s agentic command-line architecture offers capabilities that differ from both CAQDAS-integrated AI features and web-based LLM interfaces, as discussed in detail in Section 2.8. This combination of model performance, context handling, and tool architecture motivated the selection for this study.

2.5. Methodological Frameworks for AI-Assisted Analysis

Researchers have begun developing frameworks for integrating LLMs into qualitative workflows while preserving methodological integrity. These frameworks address three interconnected challenges: how to structure the AI-assisted analytical process, how to design effective prompts, and how to report AI-integrated research transparently.

On the analytical process, Nguyen-Trung (2025) introduced GAITA (Guided AI Thematic Analysis), adapting Template Analysis to position researchers as reflexive instruments while guiding GPT-4 through four stages: data familiarization, preliminary coding, template formation, and theme development. Similarly, Naeem et al. (2025) provide practical guidance for integrating generative AI across Braun and Clarke’s six phases of thematic analysis, from data familiarization and initial code generation to theme development, review, definition, and reporting. Both frameworks share a common principle: human researchers retain interpretive authority while AI assists with structured analytical tasks. Their guidance includes prompt strategies, examples of AI-assisted coding workflows, and recommendations for maintaining researcher reflexivity and methodological transparency.

On prompt design, the ACTOR framework (Nguyen-Trung, 2025) provides a structured approach: Assign role and context, Clarify task and format, Tailor with examples, Outline constraints, and Refine iteratively. This framework addresses prompt sensitivity, one of the major sources of variability in AI-assisted analysis identified in the literature.

On reporting standards, COREQ+LLM is being developed as an extension of the Consolidated Criteria for Reporting Qualitative Research for LLM-integrated studies, aiming to ensure methodological rigor, transparency, and interpretability when AI tools are used (Fehring et al., 2025).

The present study draws on insights from these emerging frameworks. The structured multi-phase prompting approach used in Sessions 3 and 4 reflects the phased analytical design advocated by GAITA and Naeem et al., while the prompt structure incorporates elements consistent with the ACTOR framework, assigning the AI a qualitative analyst role, clarifying the coding task, and outlining methodological constraints. The documentation and transparency practices adopted in this study also align with the reporting principles underlying COREQ+LLM. By comparing general and structured prompting approaches, this study provides empirical evidence on the effectiveness of methodologically grounded prompt design, complementing the largely prescriptive guidance offered by existing frameworks.

2.6. The Methodological Controversy

The integration of AI into qualitative research has generated significant scholarly debate. In an open letter published in Qualitative Inquiry, Jowsey, Braun, et al. (2025) gathered 419 experienced qualitative researchers from 32 countries, including Virginia Braun and Victoria Clarke (the originators of reflexive thematic analysis), to formally reject GenAI use for reflexive qualitative approaches. Their argument centers on methodological incompatibility: reflexive thematic analysis requires a subjective, positioned, and reflexive researcher, which AI cannot provide. They also reject AI use on grounds of social and environmental justice. When AI intervenes in the analytical process, reflexivity risks being displaced onto model verification rather than self-interrogation of the researcher’s own analytic lens (Montes et al., 2025).

This position has prompted substantive counter-arguments, also published in Qualitative Inquiry. Friese et al. (2026) contended that rejecting GenAI in its entirety risks closing off methodological evolution and isolating qualitative research from broader epistemic developments. They argued that GenAI, when used under close researcher leadership and control, can serve as a legitimate analytical support within reflexive qualitative inquiry, the critical factor being whether interpretive authority remains with the human researcher, not whether AI is involved in the process. De Paoli (2026) argued that the categorical rejection rests on philosophical assumptions that risk becoming dogma, and that prohibiting GenAI on metaphysical grounds negatively impacts debate and innovation in qualitative analysis. He noted that GenAI does not replace interpretation but serves as a thinking companion that helps researchers ask better questions of their data. Greenhalgh (2026) similarly called for moving beyond binary framings of adoption versus refusal, arguing that such polarization leaves little space for principled disagreement or methodological experimentation. She proposed refocusing the debate on epistemic authority, distinguishing AI-led analysis from human-led practices that incorporate AI as one of several analytical resources.

A common thread across these responses is the distinction between AI as replacement and AI as complement. The critique from Jowsey et al. applies most directly to approaches where AI displaces rather than supplements human analysis. It should also be noted that their critique is directed specifically at reflexive thematic analysis, one approach within the broader family of thematic analysis methods (Braun & Clarke, 2021). Studies operating within coding reliability or codebook traditions, where consistency and replicability are valued quality indicators, engage with AI tools under different methodological considerations. When AI is positioned within a researcher-led workflow, where the human retains authority over interpretation and meaning-making, the methodological concerns, while still relevant, operate differently. This distinction aligns with the hybrid model emerging from the broader literature, where human insight and reflexivity guide and critically evaluate computational analysis, rather than delegating interpretive authority entirely to machines (Chatzichristos, 2025).

The present study is designed with this distinction in mind. The research design (AlGhamdi, 2014, AI Sessions 1–4) treats AI as an independent analytical perspective to be compared against human analysis, not a substitute for it. The original 2012 analysis was conducted by a human researcher using structured, inductive content analysis, an approach consistent with the coding reliability tradition rather than reflexive thematic analysis. The AI sessions serve as points of methodological triangulation, evaluated using metrics of alignment and consistency that are appropriate to this tradition. This positioning is consistent with the call from Friese et al. (2026) for critical engagement with AI tools under researcher oversight, while acknowledging the legitimate concerns raised by Jowsey, Braun, et al. (2025) about the boundaries of AI involvement in interpretive work.

2.7. Ethical Considerations

Beyond methodological debates, AI-assisted qualitative research raises ethical concerns that warrant careful attention. Bias and discrimination are key issues, as AI systems can reproduce and amplify biases inherent in training data, potentially supporting analyses that are discriminatory or harmful (Resnik & Hosseini, 2025). Biases related to race, ethnicity, gender, sexuality, age, nationality, and socioeconomic status embedded in AI systems could perpetuate existing disparities if uncritically propagated into research findings. In qualitative analysis specifically, such biases may shape which themes are foregrounded and which are overlooked, with implications for marginalized voices and underrepresented perspectives (Mehta et al., 2025). Researchers must therefore critically evaluate AI-generated outputs for systematic patterns of omission or emphasis rather than accepting them at face value.

Transparency and reproducibility present additional challenges. The proprietary nature of commercial AI systems makes interpretation of AI reasoning difficult, and version changes in AI models may produce different results over time (Resnik & Hosseini, 2025). These concerns have prompted the development of reporting frameworks such as COREQ+LLM, which aims to ensure methodological rigor, transparency, and interpretability when AI tools are used in qualitative research (Fehring et al., 2025). The need for transparency is underscored by empirical evidence of significant quality differences across AI tools. Jowsey, Stapleton, et al. (2025) examined Microsoft Copilot for thematic analysis and found concerning results: Copilot outputs included 58% fabricated quotes compared to 79% accuracy for human researchers, and none of the AI outputs provided participant spread by theme. Based on these findings, the researchers could not recommend Copilot for thematic analysis. This highlights that not all AI tools are equally suitable, and that researchers must evaluate specific tools rather than assuming uniform AI capability.

Data privacy concerns are important when sensitive qualitative data is sent to cloud-based AI services. Participant confidentiality may be compromised, data may be used for model training without consent, and sensitive information may be retained on external servers (Resnik & Hosseini, 2025). Samuel and Wassenaar (2025) highlighted that uploading qualitative data to cloud-based AI services raises specific informed consent challenges, as participants may not fully understand or be able to assess the risks involved. These risks are particularly acute in qualitative research, where interview data often contain rich personal narratives that are difficult to fully de-identify (Montes et al., 2025). While Claude Code operates as a local command-line tool with direct file system access, the data is transmitted to Anthropic’s servers for processing via the API. These considerations informed my decision to work with previously published, de-identified data from a completed PhD thesis (AlGhamdi, 2014), where participant anonymization had already been carried out. Researchers working with sensitive or unpublished qualitative data should carefully review the data handling and retention policies of any AI service before use.

2.8. Claude Code in Academic Research

The landscape of AI-assisted qualitative analysis tools has expanded considerably. Established CAQDAS platforms have introduced AI-powered features: MAXQDA AI Assist offers AI-supported coding suggestions and summaries within the MAXQDA environment, ATLAS.ti AI integrates LLM capabilities for code generation and thematic grouping, and newer platforms such as QInsights provide purpose-built AI-driven qualitative analysis workflows. General-purpose LLMs accessed through web interfaces (e.g., Claude web-based, ChatGPT, Gemini) have also been employed for thematic analysis tasks, as documented in the studies reviewed in Sections 2.2–2.5.

Claude Code occupies a distinct position within this landscape. Rather than embedding AI assistance within an existing CAQDAS tool or relying on a browser-based conversational interface, Claude Code is an agentic command-line tool that brings a frontier LLM directly into the researcher’s local computing environment (Anthropic, 2024). This architectural distinction carries methodological implications. CAQDAS AI features operate within the constraints of their host platforms, typically assisting with discrete tasks such as suggesting codes for individual segments. Web-based LLM interfaces require manual data upload, are bounded by session and context window limitations, and depend on conversational prompting for each analytical step. Claude Code, by contrast, operates autonomously across the full project directory structure, processes multiple files in sequence without manual intervention, maintains session context throughout extended analytical workflows, and generates structured output artifacts directly to the local file system. In practice, this means that a researcher can direct Claude Code to analyze an entire corpus of interview transcripts through a single methodological instruction, with the tool independently reading, coding, and synthesizing across all documents.

Figure 1 summarizes these technical characteristics:

• Direct file system access - The system can process entire directories of transcripts or documents without manual upload procedures.

• Multi-file synthesis - Claude Code can analyze and cross-reference multiple documents within a single analytical workflow.

• Session persistence - Context is maintained throughout an active session, supporting iterative refinement of coding schemes.

• Structured output generation - The tool can generate formatted artifacts (e.g., Markdown files, HTML reports, tables, figures), facilitating documentation and audit trails.

• Integrated workflow automation - Researchers can execute multi-step analytical procedures within a single command-line environment.

Figure 1.

Technical capabilities of Claude code for qualitative thematic analysis

Despite these distinct capabilities, I found no published peer-reviewed studies that investigate Claude Code as a qualitative thematic analysis tool. This study was conducted in January 2026, a period during which the academic literature on Claude focused primarily on model performance through web-based or API-mediated interactions rather than on the methodological implications of agentic command-line implementations. The present study addresses this gap by providing an empirical evaluation of Claude Code’s analytical capabilities, inter-session consistency, and alignment with established human analysis.

Current documentation of Claude Code remains largely technical and product-oriented, consisting primarily of official documentation and practitioner accounts rather than peer-reviewed methodological evaluations. By treating the tool’s agentic architecture as a methodological variable rather than a neutral interface, this study contributes empirical evidence on how tool design may shape AI-assisted qualitative analysis.

3. Methodology

3.1. Experimental Design

This study used a comparative design examining five independent analyses of the same qualitative dataset: (1) the original human analysis conducted in 2012 using NVivo 8, (2) AI-assisted analysis Session 1 using Claude Code with general prompts, (3) AI-assisted analysis Session 2 using Claude Code with general prompts, (4) AI-assisted analysis Session 3 using Claude Code with structured multi-phase prompts, and (5) AI-assisted analysis Session 4 using Claude Code with structured multi-phase prompts. This five-way comparison enables assessment of alignment (concordance with human analysis) and reliability (consistency across AI sessions), and the impact of prompting strategy on analytical outcomes (see Figure 2).

Figure 2.

Research experimental design -five-way comparison framework

The analytical approach adopted in this study aligns with the coding reliability tradition within thematic analysis. Braun and Clarke (2021) distinguished thematic analysis as a family of methods encompassing coding reliability approaches that prioritize structured coding and accuracy, codebook approaches such as template and framework analysis, and reflexive thematic analysis that centers researcher subjectivity and interpretive engagement. The original 2012 human analysis followed an inductive content analysis methodology using NVivo, employing structured coding procedures and frequency-based categorization across multiple coding phases. The present study evaluates AI-assisted analysis using metrics of accuracy, alignment, and consistency, quality criteria appropriate to coding reliability approaches, where replicability and agreement are valued indicators of analytical quality. This positioning is distinct from reflexive thematic analysis, where such reliability measures would be considered inappropriate (Braun & Clarke, 2021). The present study does not claim that AI can perform reflexive thematic analysis; rather, it examines whether AI can produce thematic outputs that align with human analysis within a framework where structured coding, consistency, and replicability are the relevant evaluative criteria.

Epistemologically, this study adopts a post-positivist orientation, treating the qualitative data as containing identifiable patterns that can be systematically coded and compared across analysts, whether human or AI. This orientation is consistent with the coding reliability tradition, where analytical quality is assessed through metrics of agreement and consistency rather than through the reflexive engagement of the researcher with the data.

3.2. The Original Dataset

The data comprises 16 semi-structured interview transcripts collected as part of my PhD study examining e-commerce adoption by retail businesses in Saudi Arabia (AlGhamdi, 2014). I conducted interviews with retail managers representing diverse sectors, including electronics, fashion, furniture, and general retail. The interviews explored barriers to e-commerce adoption, current business strategies, perceptions of consumer readiness, and views on the Saudi e-commerce ecosystem.

The original analysis was conducted using NVivo 8, following inductive content analysis methodology. The researcher progressed through multiple coding phases: initial open coding generated 82 codes, which were refined through focused coding to 55 codes, and ultimately consolidated into 22 factors organized across three categories (Consumer-related, Environment-related, and Organization-related). This analysis was later validated through a quantitative survey of 153 retailers, which confirmed the relative importance of the identified factors.

3.3. Analytical Protocol and Process

3.3.1. Prompting Strategies

Two prompting approaches were employed to examine the impact of prompt structure on analytical outcomes: general prompting (2 sessions) and structured multi-phase prompting (2 sessions). Each of the four sessions was conducted in a completely separate Claude Code instance with no memory or context from previous sessions.

3.3.1.1. General Prompting Approach (Sessions 1-2)

Sessions 1 and 2 used a general prompting approach. I directed Claude Code to a folder containing research materials organized under three sub-folders: Objectives & Methodology, Transcripts, and Human Analysis (initially hidden). In a single prompt, I instructed Claude Code to:

• Read the methodology documentation to understand the research context, objectives, and interview protocol.

• Adopt the same methodological process used in the original study.

• Perform thematic analysis on all 16 interview transcripts in the transcripts folder.

• Output results to a designated AI analysis folder.

This approach provided minimal prescriptive guidance, allowing Claude Code to autonomously determine its analytical workflow based on the available content (see Figure 3).

Figure 3.

Approach of the general prompt for Claude code sessions 1 and 2

3.3.1.2. Structured Multi-phase Prompting Approach (Sessions 3-4)

Sessions 3 and 4 employed a structured prompting approach consisting of three sequential phases, each with detailed instructions.

1. This phase starts with an initial data familiarization prompt. Claude Code was instructed to read the methodology documentation to understand the research context, objectives, and interview protocol in the original study.

2. The next phase is the open coding prompt. Claude Code was instructed to read all 16 interview transcripts located in the “Transcripts” folder. It was directed to do line-by-line coding with specific instructions. The instructions are structured to identify meaningful units, assign descriptive labels, track frequency, and generate an initial codebook with definitions and exemplar quotes.

3. The third phase is the theme development prompt. Claude Code was asked to group related codes into potential themes. It was asked to review themes against the coded extracts, define and name each theme, and develop a thematic map showing relationships with definitions and exemplar quotes.

Then, I directed Claude Code to output structured HTML reports for the codes and themes. These reports can be accessed through the data repository link provided in the data availability statement. Appendix A presents the structured prompts used, and Figure 4 illustrates the steps followed.

Figure 4.

Approach of the structured multi-phase prompt for Claude code sessions 3 and 4

3.3.2. Quote Verification Protocol

Given concerns in the literature about AI-generated fabricated quotes (Jowsey, Stapleton, et al., 2025), I implemented a systematic verification protocol. From each AI session’s output, I randomly selected 20% of cited quotes (exemplar quotes provided as evidence for codes) for verification. I traced each selected quote to the original interview transcript to confirm: (1) the quote exists verbatim or with minor transcription variations, (2) the quote is attributed to the correct participant, and (3) the contextual meaning aligns with how the quote was used.

“Minor variations” included small differences in punctuation, word order, or grammar/speaking corrections that did not affect meaning. No fabrication was found, see Table 1 below. Overall, the verification 100% accuracy rate contrasts with the 42% accuracy (58% fabrication) reported for Microsoft Copilot (Jowsey, Stapleton, et al., 2025), suggesting that Claude Code may offer higher fidelity in quote extraction.

Table 1.

Verification Results Across the Four Sessions Revealed High Quote Accuracy

Session	Quotes sampled	Verified accurate	Minor variations	Accuracy rate
Session 1	12	11	1	100%
Session 2	12	10	2	100%
Session 3	18	17	1	100%
Session 4	22	20	2	100%
Total	64	58	6	100%

Figure 5 illustrates the quote verification process used in this study. The figure illustrates a sample of how the generated quotes from the Claude Code codebook (upper panel) were traced to their source locations in the original interview transcripts (lower panel). In the example shown, the code “ENV-IGNORANCE-FEAR”—defined as “lack of knowledge breeding fear and reluctance; ignorance about e-commerce leading to avoidance”—includes three example quotes attributed to specific participants. Each quote was verified by locating the corresponding passage in the original transcript, as indicated by the connecting arrows and highlighted text. This verification process confirmed that Claude Code accurately extracted and attributed participant statements.

Figure 5.

Quote verification process - tracing AI-generated citations to original interview transcripts

3.4. Alignment Measurement Methods

To evaluate the alignment between Claude Code’s analysis and human analysis, I developed a multi-level assessment framework. This framework examines alignment at two levels: code-level and theme-level. Code-level alignment compares the specific codes identified by each approach. Theme-level alignment compares the higher-order thematic structures that emerged from each analysis. I also assessed inter-session consistency at both levels to evaluate the reproducibility of Claude Code analysis across multiple independent runs. I adopted the F1 Score as the primary alignment metric throughout the analysis. F1 balances precision and recall, providing a single interpretable measure that accounts for both the relevance and completeness of the generated codes and themes.

It is important to note that the human analysis serves as a reference point for comparison rather than an objective gold standard. As a single analyst’s interpretation, it represents one valid reading of the data. However, the original analysis was subsequently validated through quantitative survey research with 153 retailers (AlGhamdi, 2014), which confirmed the relative importance of the identified factors, lending additional credibility to its use as a comparator.

Throughout this study, the term alignment refers to the degree of agreement between AI-generated and human-generated analytical outputs. Alignment with a single human analysis demonstrates concordance but does not by itself establish interpretive validity, which would require evidence of analytical quality beyond agreement with one reference point. The term reliability refers to consistency across independent AI sessions. These operational definitions should not be conflated with broader notions of interpretive validity in qualitative research.

3.4.1. Code-Level Alignment

Code-level alignment was assessed by mapping each Claude Code-identified code to the corresponding human-identified codes based on semantic equivalence. The human analysis produced 51 codes organized under three meta-categories: Consumer Related Factors, Environment Related Factors, and Organization Related Factors. Each of the four AI sessions was independently compared against this reference set of human codes.

To ensure systematic and replicable classification, the following explicit decision rules were applied:

(A) Full Match (Score: 1) applied when:

• The AI code and human code describe the same underlying concept

• The conceptual scope is equivalent (neither broader nor narrower)

• A domain expert would consider them interchangeable labels for the same phenomenon

Example: AI code “CREDIT_CARD_RELUCTANCE” ↔ Human code “Consumers’ reluctance to use credit cards” — Both capture identical consumer payment behavior concerns.

(B) Partial Match (Score: 0.5) applied when any of the following conditions exist:

• The AI code combines two or more human codes into a single construct

• The AI code represents a subset of a broader human code

• There is substantial but not complete conceptual overlap

• The AI code uses a different framing that shifts emphasis while retaining core meaning

Example 1: AI code “DIGITAL_INFRASTRUCTURE_GAPS” ↔ Human codes “Lack of electronic payment systems” + “Logistics and delivery challenges” — The AI code combines two distinct human codes.

Example 2: AI code “TRUST_DEFICIT” ↔ Human code “Lack of consumer trust” — Substantial overlap, but the AI code is slightly broader (includes institutional trust).

• The AI code represents a concept not present in the human codebook

• No human code captures >25% of the AI code’s conceptual content

• The concept may be valid, but was not identified in the original analysis

Example: AI code “FUTURE_MARKET_OPTIMISM” — No corresponding human code addresses forward-looking market projections.

It should be noted that the classification of matches as full, partial, or extended relied on the researcher’s judgment. While explicit decision rules were applied to ensure systematic and replicable classification, and the complete mapping tables with justifications are provided in Appendix B, the process inherently involves interpretive decisions, particularly in distinguishing between full and partial matches. A different analyst might draw these boundaries differently, which could affect the resulting F1 Scores. To mitigate this, borderline cases were resolved conservatively: where the match between an AI code and a human code was ambiguous, a partial match (0.5) was assigned rather than a full match (1.0). This conservative approach may underestimate alignment in some cases, but it reduces the risk of inflating scores through generous classification.

The F1 Score was calculated using the following formulas:

F_{1} = \frac{2 \times P \times R}{P + R}

(1)

where

P = \frac{M_{s}}{N_{A I}} and R = \frac{M_{s}}{N_{H}}

(2)

and

M_{s} = (F_{m} \times 1.0) + (P_{m} \times 0.5)

(3)

Where

P

= Precision,

R

= Recall,

M_{s}

= Matched Score,

N_{A I}

= Total AI codes,

N_{H}

= Total Human codes,

F_{m}

= Full matches, and

P_{m}

= Partial matches.

This approach ensures that the alignment metric penalizes both over-generation of codes (low precision) and under-coverage of human concepts (low recall). This provides a balanced assessment of AI performance. Precision measures how relevant or accurate the AI-generated codes are, whereas recall measures how completely the AI captured the human-identified concepts.

It should be acknowledged that the F1 Score, as applied here, captures structural similarity between AI and human analytical outputs, the extent to which the same concepts were identified and categorized. It does not assess interpretive depth, analytical nuance, or the quality of meaning-making that underlies the coding process. Two analysts may assign the same code label while differing in the richness of their interpretive engagement with the data. This metric is appropriate within the coding reliability framework adopted by this study, but it should not be taken as a comprehensive measure of analytical quality in the broader qualitative sense.

3.4.2. Theme-Level Alignment

Theme-level alignment was assessed by mapping each Claude Code theme to the three human meta-categories: consumer, environment, and organization related factors. A theme was classified as a “Full Match” (scored as 1) when it clearly aligned with a single human meta-category. For example, an AI theme labeled “Trust Deficit” was considered fully aligned with “Consumer Related Factors” as trust is fundamentally a consumer-oriented construct in the context of e-commerce adoption. A “Partial Match” (scored as 0.5) was assigned when a theme spanned multiple human categories or represented a cross-cutting concept. Themes such as “Education as the Catalyst” that addressed both consumer knowledge and organizational training needs received partial match scores. Themes representing entirely new conceptual dimensions not present in the human analysis were classified as “Extended” themes, scoring 0 (see Appendix C).

The same F1 Score formula was applied at the theme level, with precision calculated as the proportion of AI themes that mapped to human categories, and recall calculated as the proportion of human categories covered by Claude Code themes.

3.4.3. Inter-Session Consistency

Inter-session consistency was assessed to evaluate the reproducibility and reliability of Claude Code analysis across independent runs. For each pair of Claude Code sessions (six pairs total: S1-S2, S1-S3, S1-S4, S2-S3, S2-S4, S3-S4), bidirectional mapping was performed at both code and theme levels.

For code-level inter-session analysis, codes from Session A were mapped to semantically equivalent codes in Session B (Forward alignment). Codes from Session B were mapped to Session A (Backward alignment). The inter-session F1 Score was then calculated as the harmonic mean of these bidirectional alignments:

F_{1}^{i n t e r} (A, B) = \frac{2 \times F_{w} \times B_{w}}{F_{w} + B_{w}}

(4)

Where

F_{w}

= Forward alignment (proportion of Session A codes found in Session B) and

B_{w}

= Backward alignment (proportion of Session B codes found in Session A). This produces a symmetric measure where

F_{1} (A, B) = F_{1} (B, A)

This approach produces a symmetric measure where F1(A,B) = F1(B,A). This ensures that the consistency measure is not biased by which session is used as the reference. The same methodology was applied at the theme level. Themes are mapped between session pairs to assess structural consistency in the higher-order thematic frameworks.

Sessions were also grouped by prompting approach, general vs structured prompts. It helps to examine whether the prompting methodology influenced both alignment with human analysis and inter-session consistency.

Figure 6 summarizes the alignment measurement framework. Matching criteria (a) assign scores of 1.0, 0.5, or 0.0 based on conceptual overlap between Claude Code and human codes. The F1 Score (b) balances precision and recall to assess alignment. Assessment occurs at both code and theme levels (c), with inter-session consistency (d) evaluated through bidirectional mapping across six session pairs. The complete flow (e) distinguishes validity (human-Claude Code alignment) from reliability (inter-session consistency).

Figure 6.

Alignment measurement framework. (A) Matching criteria for code-level comparison; (B) F1 score calculation; (C) Multi-level assessment at code and theme levels; (D) Bidirectional inter-session consistency; (E) Complete assessment flow

3.5. Technical Specifications

3.5.1. Claude Code Technical Specifications

Claude Code is Anthropic’s official command-line interface tool that brings Claude’s AI capabilities directly into the researcher’s local computing environment (Anthropic, 2024). Unlike web-based interfaces, Claude Code operates as an agentic AI assistant with the following capabilities presented in Table 2.

Table 2.

Claude Code vs. Web Interface Comparison

Capability	Web interface (claude.ai)	Claude code (CLI)
File System Access	Limited - manual uploads only	Full - direct read/write to local files
Output Generation	Manual copy/paste required	Direct file creation (HTML, CSV, SVG, etc.)
Multi-File Operations	Sequential uploads only	Parallel processing of multiple files
Session Persistence	Conversation history only	Full persistence with memory system
Large Document Handling	Subject to upload limits	Systematic splitting and processing
Command Execution	None	Full Bash access for system operations

3.5.2. Session Technical Specifications

Table 3 presents the technical specifications for all five analyses.

Table 3.

Technical Specifications of all Five Analyses

Specification	Human	AI session 1	AI session 2	AI session 3	AI session 4
Analysis Tool	NVivo 8	Claude Code	Claude Code	Claude Code	Claude Code
Model ID	N/A	claude-opus-4-5-20251101
Platform	Windows	Linux (WSL) 4.4.0-19041
Analysis Date	2012	Feb 12, 2026	Feb 13, 2026	Feb 14, 2026	Feb 15, 2026
Prompting Approach	N/A	General		Structured
Duration	Several weeks	∼30 min
Prior Exposure to findings	N/A	Not provided during sessions (procedurally blind)

4. Results

4.1. Code-level Alignment Results

4.1.1. Human-AI Code Alignment

The code-level alignment analysis revealed substantial differences in performance between the two prompting approaches. Table 4 presents detailed metrics for each session’s alignment with the 51 human-identified codes.

Table 4.

Code-Level Human-AI Alignment Metrics

Session	Prompt type	AI codes	Matched	Precision	Recall	F1 score
Session 1	General	26	24.0	92.3%	47.1%	62.4%
Session 2	General	26	21.5	82.7%	42.2%	55.9%
Session 3	Structured	47	38.5	81.9%	75.5%	78.5%
Session 4	Structured	62	49.5	79.8%	97.1%	87.6%
General Prompts (Avg)		26	22.8	87.5%	44.7%	59.2%
Structured Prompts (Avg)		54.5	44.0	80.9%	86.3%	83.1%

Sessions using structured multi-phase prompts achieved significantly higher alignment with human analysis compared to sessions using general prompts. Session 4 achieved the highest F1 Score of 87.6%, with a precision of 79.8% and a recall of 97.1%. This suggests that it captured nearly all human-identified concepts while maintaining acceptable precision. Session 3 achieved an F1 Score of 78.5% with the most balanced performance between precision (81.9%) and recall (75.5%). In contrast, Sessions 1 and 2 achieved F1 Scores of 62.4% and 55.9%, respectively. The average F1 Score for general prompts is 59.2%, being 23.9 points lower than structured prompts at 83.1%.

Analysis of the precision-recall trade-off revealed distinct patterns between the prompting approaches. General prompt sessions showed high precision (averaging 87.5%) but low recall (averaging 44.7%). This suggests that while the codes they generated were highly relevant to the human analysis, they missed more than half of the human-identified concepts. These sessions produced fewer codes (26 codes each) but with higher accuracy. Conversely, structured prompt sessions achieved more comprehensive coverage, with Session 3 producing 47 codes and Session 4 producing 62 unique codes (after removing duplicates). Session 4’s exceptional recall of 97.1%. This shows that the structured multi-phase prompting approach enabled the AI to capture nearly the complete conceptual landscape identified through traditional human analysis.

The number of full matches versus partial matches also differed by prompting approach. Session 4 achieved 37 full matches and 25 partial matches, while Session 1 achieved 22 full matches and only 4 partial matches. This indicates that structured prompts do more than just find more concepts. They also capture subtle relationships between these concepts.

Because of this, they partially overlap with multiple human codes.

4.1.2. Code-Level Inter-Session Consistency

Inter-session consistency at the code level varied considerably depending on the comparison type. Table 5 presents the F1 Scores for all session pairs.

Table 5.

Code-Level Inter-Session Consistency (F1 Scores)

Pair	Prompt Types	Forward	Backward	F1 Score
S1 ↔ S2	General ↔ General	78.8%	80.8%	79.8%
S1 ↔ S3	General ↔ Structured	86.5%	51.1%	64.2%
S1 ↔ S4	General ↔ Structured	92.3%	41.1%	56.8%
S2 ↔ S3	General ↔ Structured	82.7%	46.8%	59.8%
S2 ↔ S4	General ↔ Structured	88.5%	37.1%	52.2%
S3 ↔ S4	Structured ↔ Structured	87.2%	62.1%	72.5%
Same-Type Average				76.2%
Cross-Type Average				58.3%
Overall Average				64.2%

Sessions using the same prompting approach exhibited higher internal consistency. The two general prompt sessions (S1 & S2) achieved an F1 Score of 79.8%, while the two structured prompt sessions (S3 & S4) achieved 72.5%. The slightly lower consistency between structured sessions can be attributed to the larger number of codes generated (47 and 62 codes, respectively), which creates more opportunity for variation while still capturing the same core concepts.

Cross-approach comparisons yielded lower F1 Scores, averaging 58.3% across the four general-to-structured session pairs (ranging from 52.2% for S2-S4 to 64.2% for S1-S3). This pattern reflects the fundamental difference in code granularity between the approaches. Structured prompts generate approximately twice as many codes as general prompts.

The directional analysis revealed asymmetric patterns. When mapping general prompt codes to structured prompt sessions, alignment was high at 87.3% average. This suggests that structured sessions captured nearly all concepts identified by general sessions. However, the reverse mapping showed lower alignment at 44.5% average. It proposes that structured prompts identified many additional codes do not present in general prompt sessions. This asymmetry supports the finding that structured prompts produce more comprehensive analyses while maintaining coverage of the core concepts identified through simpler prompting approaches.

4.2. Theme-Level Alignment Results

4.2.1. Human-AI Theme Alignment

Theme-level alignment analysis demonstrated higher agreement between AI and human analysis compared to code-level alignment. All four Claude Code sessions successfully identified themes corresponding to all three human meta-categories (Consumer, Environment, Organization), achieving 100% recall across all sessions. Table 6 below presents the theme-level alignment metrics.

Table 6.

Theme-Level Human-AI Alignment Metrics

Session	Prompt type	AI themes	Full match	Partial	Precision	Recall	F1 score
Session 1	General	7	6	0	85.7%	100%	92.3%
Session 2	General	7	5	2	85.7%	100%	92.3%
Session 3	Structured	6	4	2	83.3%	100%	90.9%
Session 4	Structured	6	6	0	100%	100%	100%
General Prompts (Avg)		7	5.5	1	85.7%	100%	92.3%
Structured Prompts (Avg)		6	5	1	91.7%	100%	95.5%

Session 4 achieved complete theme-level alignment with an F1 Score of 100%, as all six of its themes mapped directly to the human categories with no cross-cutting or extended themes. Sessions 1 and 2 both achieved F1 Scores of 92.3%, with a precision of 85.7% (6 out of 7 themes mapping to human categories) and a recall of 100%. Session 1 included one extended theme (“E-commerce Opportunities & Future Outlook”) that represented a forward-looking perspective not explicitly captured in the human framework. Session 3 achieved an F1 Score of 90.9%, with two themes (“Education as the Catalyst” and “The Mutual Waiting Game”) classified as partial matches due to their cross-cutting nature spanning multiple human categories.

The average theme-level F1 Score for general prompts was 92.3%, compared to 95.5% for structured prompts. While both approaches achieved high theme-level alignment, structured prompts demonstrated slightly better performance, especially with Session 4’s perfect alignment.

A comparison of code-level and theme-level alignment revealed that theme-level alignment was consistently higher across all sessions, as shown in Table 7. The average improvement from code-level to theme-level F1 Score was 22.8 percentage points, ranging from +12.4% for Sessions 3 and 4 to +36.4% for Session 2. Based on this pattern, AI sessions may identify different specific codes. However, they still reach similar overall themes.

Table 7.

Code-level vs Theme-Level Alignment Comparison

Session	Code-level F1	Theme-level F1	Difference
Session 1	62.4%	92.3%	+29.9%
Session 2	55.9%	92.3%	+36.4%
Session 3	78.5%	90.9%	+12.4%
Session 4	87.6%	100%	+12.4%
Average	71.1%	93.9%	+22.8%

These themes are consistent with traditional human analysis.

The AI sessions produced 6-7 themes compared to the human analysis’s 3 meta-categories, representing a finer level of granularity. This extra level of detail was helpful.

It allowed for more precise and nuanced categorization. At the same time, it still aligned clearly with the broader human framework. For example, the human “Consumer Related Factors” category was captured across multiple AI themes, including trust-related themes, cultural/shopping behavior themes, and knowledge/education themes.

4.2.2. Theme-Level Inter-Session Consistency

Inter-session consistency at the theme level was higher than at the code level, with all session pairs achieving F1 Scores above 83%. Table 8 presents the theme-level inter-session alignment results.

Table 8.

Theme-Level Inter-session Consistency (F1 Scores)

Pair	Prompt types	Forward	Backward	F1 score
S1 ↔ S2	General ↔ General	78.6%	92.9%	85.0%
S1 ↔ S3	General ↔ Structured	71.4%	100%	83.3%
S1 ↔ S4	General ↔ Structured	71.4%	100%	83.3%
S2 ↔ S3	General ↔ Structured	78.6%	91.7%	84.6%
S2 ↔ S4	General ↔ Structured	85.7%	91.7%	88.6%
S3 ↔ S4	Structured ↔ Structured	91.7%	91.7%	91.7%
Same-Type Average				88.4%
Cross-Type Average				85.0%
Overall Average				86.1%

The two structured prompt sessions (S3 and S4) demonstrated the highest consistency with an F1 Score of 91.7%, suggesting near-perfect agreement in their thematic structures. Both sessions identified themes related to trust, shopping culture, infrastructure and ecosystem, organizational factors, and education. There were only small differences in how these concepts were defined and labeled.

General prompt sessions (S1 and S2) achieved 85.0% consistency, also indicating strong agreement despite differences in theme labeling. For example, Session 1’s “Trust & Security” theme corresponded directly to Session 2’s “Trust Deficit and Risk Perception” theme.

Cross-approach comparisons performed much better at the theme level than at the code level. The average F1 Score was 85.0% at the theme level, compared to 58.3% at the code level, an improvement of 26.7 points. This suggests that different prompting approaches reach similar high-level themes, even if they produce different detailed codes. The pair S2–S4 showed especially high consistency across approaches, with a score of 88.6%. This indicates that the thematic frameworks identified by general and structured prompts are largely compatible.

The overall average theme-level inter-session consistency was 86.1%, representing a 21.9 point improvement over code-level consistency at 64.2%. This finding has important methodological implications. While AI-assisted analysis may show variation in specific code identification across runs, the higher-order thematic structures remain stable and consistent. This suggests that theme-level findings may be more reliable for drawing research conclusions.

4.3. Summary of Alignment Findings

Figure 7 below illustrates inter-session consistency heatmaps showing F1 Scores between all Claude Code session pairs at the code level (left) and theme level (right). Darker green indicates higher consistency.

Figure 7.

Inter-session consistency heatmaps: Code-level and theme-level F1 scores

These findings suggest several key patterns. Structured prompts perform much better than general prompts at the code level. They have a 23.9 point higher F1 Score (83.1% vs. 59.2%). At the theme level, the difference is much smaller. The advantage drops to 3.2 percentage points (95.5% vs. 92.3%). This is because both approaches successfully capture the main high-level themes. Additionally, theme-level alignment is consistently higher than code-level alignment for both human-AI comparison and inter-session consistency, with average improvements of 22.8% and 21.9%, respectively. The analysis also shows that AI-assisted thematic analysis suggests strong reproducibility, especially at the theme level, where average inter-session consistency exceeds 86%. This indicates that while specific code identification may vary across runs, the overarching thematic conclusions remain stable.

5. Discussion

5.1. Interpretation of Key Findings

This study evaluated the alignment between Claude Code thematic analysis and human analysis across multiple dimensions. The findings suggest several patterns that contribute to our understanding of how LLMs, especially agentic ones, can be used for qualitative data analysis.

5.1.1. Observed Differences Between Prompting Approaches

A key observation from this exploratory study is the difference in performance between prompting approaches. Sessions using structured multi-phase prompts achieved an average F1 Score of 83.1% at the code level, compared to 59.2% for general prompts. While this difference is substantial, the small sample size (n=2 per condition) prevents formal statistical inference, and the observed pattern requires replication before causal conclusions can be drawn. With this caveat, the pattern suggests that the quality of Claude Code’s qualitative analysis may depend on the methodological rigor of the prompting strategy.

General prompts, while achieving high precision (87.5%), suffered from low recall (44.7%), indicating that the AI identified relevant codes but missed more than half of the concepts captured in human analysis. The pattern indicates that without explicit methodological guidance, LLMs tend toward conservative interpretation, identifying only the most salient themes while overlooking subtler or more nuanced concepts. By contrast, structured prompts achieved a more balanced precision-recall trade-off (80.9% precision, 86.3% recall), enabling comprehensive coverage of the conceptual landscape without substantial loss in relevance.

This finding aligns with emerging best practices in prompt engineering for qualitative research tasks (Tai et al., 2024), which emphasize the importance of explicit methodological scaffolding. The structured approach employed in Sessions 3 and 4 incorporated distinct phases for open coding, axial coding, and selective coding, mirroring Corbin and Strauss, 1990 grounded theory methodology. The 23.9 percentage-point improvement achieved through structured prompting is consistent with broader findings in the literature. Nguyen-Trung (2025) emphasized the importance of explicit methodological scaffolding through the GAITA and ACTOR frameworks, and Naeem et al. (2025) provided similar evidence that phased prompting aligned with Braun and Clarke’s six stages improves AI analytical quality. The present findings provide quantitative support for these largely prescriptive recommendations. The code-level F1 Score of 83.1% achieved by structured prompts also compares favorably with the approximately 80% agreement reported by Castellanos et al. (2025) in healthcare qualitative data and the high congruence reported by Bennis and Mouwafaq (2025) across multiple generative models.

5.1.2. Theme-Level Stability

A key finding is the hierarchical pattern of alignment, where theme-level agreement (93.9% average F1) exceeded code-level agreement (71.1% average F1) across all sessions. This difference suggests that Claude Code analysis exhibits what might be termed hierarchical convergence. While variation exists in granular code identification, the higher-order thematic structures suggest stability and consistency with human analysis.

This pattern has methodological implications. In qualitative research, the primary analytical value typically resides at the thematic level, where patterns are synthesized into meaningful interpretive frameworks. The finding that all four Claude Code sessions achieved 100% recall at the theme level suggests that Claude Code analysis captures the essential conceptual structure of qualitative data, even when lower-level coding exhibits variation.

The hierarchical convergence phenomenon can be understood in terms of abstraction tolerance. At the code level, minor differences in terminology, scope, or conceptual boundaries can produce apparent disagreement even when the underlying concepts are similar. At the theme level, these variations are absorbed into broader categorical structures where semantic equivalence is more readily achieved. This suggests that theme-level findings from Claude Code analysis may be more stable and consistent than code-level findings. This pattern is consistent with observations elsewhere in the literature, though it has not previously been quantified in these terms. Castellanos et al. (2025) found that LLMs agreed with humans on theme alignment for two-thirds of analyzed topics, while noting greater divergence at the level of specific coding. Wachinger et al. (2025) similarly reported that LLM outputs were most convincing at the descriptive theme level. The present study extends these observations by providing a direct quantitative comparison: an average improvement of 22.8 percentage points from code-level to theme-level F1 Scores across all sessions. These findings indicate that hierarchical convergence may be a general property of AI-assisted thematic analysis rather than an artefact of a particular dataset or tool.

5.1.3. Inter-Session Consistency and Reproducibility

The inter-session consistency analysis revealed that AI-assisted thematic analysis shows strong reproducibility, especially at the theme level (86.1% average F1). Sessions using the same prompting approach showed higher internal consistency (same-type average: 88.4% for themes, 76.2% for codes) compared to cross-approach comparisons (cross-type average: 85.0% for themes, 58.3% for codes).

These findings address a common concern regarding the reliability of AI-assisted analysis: the potential for stochastic variation to produce inconsistent results across runs. While the code-level inter-session consistency (64.2%) indicates meaningful variation in specific code identification, the theme-level consistency (86.1%) suggests that the analytical conclusions remain stable. This pattern suggests that researchers employing Claude Code thematic analysis can have confidence in their thematic findings, though they should exercise appropriate caution when reporting specific codes as definitive.

The asymmetric directional patterns observed in inter-session analysis, where general prompt codes mapped well to structured sessions (87.3%) but not vice versa (44.5%), highlight the relationship between prompting complexity and analytical depth. Structured prompts appear to generate a superset of concepts that encompasses those identified through simpler approaches while introducing additional nuance and granularity. The inter-session theme-level consistency of 86.1% can be contextualized against existing reliability benchmarks. Jain et al. (2025) reported Cohen’s Kappa values exceeding 0.84 for Claude across six independent runs, while Hila and Hauser (2025) found substantial agreement (κ = 0.76–0.78) for deductive coding tasks. The present study’s findings are broadly consistent with these benchmarks, while adding a dimension not captured by Kappa alone: the distinction between code-level and theme-level consistency. The finding that same-approach sessions achieved higher consistency (76.2% for codes, 88.4% for themes) than cross-approach sessions (58.3% for codes, 85.0% for themes) also aligns with the sensitivity to prompting observed by Borchers et al. (2025), who found that coding accuracy was influenced by specific methodological conditions rather than being uniform across configurations.

5.2. Bridging a 14-Year Methodological Gap

A unique aspect of this study is the temporal gap between the original human analysis (conducted using NVivo in 2012) and the AI-assisted reanalysis (conducted using Claude Code in 2026). This 14-year interval raises questions about temporal validity and the stability of qualitative findings across analytical eras.

The high alignment achieved, especially Session 4’s 87.6% F1 Score at the code level and 100% at the theme level, suggests that the conceptual structure of the data remains interpretable across this temporal gap. This finding has two implications. First, it demonstrates the temporal stability of the original human analysis, as its conceptual framework could be reproduced over a decade later. Second, it supports the AI’s capacity to engage with historical qualitative data, suggesting potential applications for the re-analysis of archival qualitative datasets.

However, the temporal gap also introduces interpretive complexity. The AI’s extended themes, concepts not present in the original human analysis, may reflect either (a) aspects of the data that were underweighted in the original analysis, or (b) interpretive frameworks that have emerged in the intervening years that shape how the AI reads the data. For example, Session 1’s extended theme “E-commerce Opportunities & Future Outlook” may reflect contemporary discourse around digital transformation that was less prominent in 2012. This highlights the need for careful consideration of how temporal context shapes AI-assisted interpretation. The capacity of AI to engage with historical qualitative data has received little attention in the existing literature. Most validation studies compare AI and human analyses conducted within the same time period (e.g., Castellanos et al., 2025; Montes et al., 2025; Wachinger et al., 2025). The present study’s 14-year gap between original and AI analysis provides a distinctive test. The high alignment achieved suggests that the conceptual structure embedded in qualitative data can be recovered across significant temporal distances, though researchers should remain attentive to the possibility that AI interpretive frameworks reflect contemporary rather than historical analytical sensibilities.

5.3. Addressing Methodological Critiques

The scholarly debate over GenAI in qualitative research (discussed in Section 2.6) raises important considerations for interpreting the findings of this study. Jowsey, Braun, et al. (2025) argued that reflexive thematic analysis requires a subjective, positioned, reflexive researcher and that AI involvement is, therefore, methodologically incongruent. Friese et al. (2026), De Paoli (2026), and Greenhalgh (2026) countered that the critical question is not whether AI is involved, but whether interpretive authority remains with the human researcher. As noted in Section 3.1, the present study operates within the coding reliability tradition of thematic analysis, where consistency, alignment, and replicability are appropriate quality criteria, a distinct tradition from the reflexive thematic analysis that is the focus of Jowsey et al.'s critique.

The findings of this study offer empirical evidence that is relevant to the broader debate, nonetheless. The importance of structured prompting, which improved code-level alignment by 23.9 percentage points, demonstrates that AI does not autonomously produce quality analysis. The quality of the output directly reflected the methodological scaffolding provided by the researcher. This supports the position advanced by Friese et al. (2026) that AI can function as a legitimate analytical support when used under close researcher direction, rather than as an autonomous interpreter. The AI sessions in this study did not engage in reflexive practice; they operated as methodologically-directed instruments responding to human-designed analytical protocols.

Furthermore, the study was not designed as an AI-led reflexive thematic analysis. The original 2012 analysis was conducted by a human researcher using structured, inductive content analysis with NVivo. The 2026 AI analyses serve as independent analytical perspectives compared to human analysis. This triangulation design allows identification of where AI and human analyses converge (suggesting stable, perspective-independent patterns) and where they diverge (suggesting areas requiring deeper interpretive engagement). Rather than displacing human analytical authority, this approach complements it by providing an external reference point.

The application context also plays a role. The present study examined business interview data concerning e-commerce adoption, an applied organizational context that differs from domains involving high levels of cultural, emotional, or political sensitivity. Prior research has identified limitations in AI performance in culturally nuanced contexts (Sakaguchi et al., 2025). Accordingly, the findings should be interpreted as context-specific rather than universally generalizable. While the results support the viability of AI-assisted analysis within certain applied research domains, they do not imply equivalence or adequacy in all qualitative contexts, particularly those requiring deep cultural or interpretive sensitivity.

5.4. Claude Code as a Research Tool

This study provides one of the first academic evaluations of Claude Code for qualitative research, addressing the gap identified in Section 2.8. The agentic capabilities described therein proved valuable in practice. Direct file system access enabled seamless processing of all 16 transcripts without manual uploads, while autonomous multi-file processing allowed analysis of the complete corpus in single sessions. Session persistence maintained analytical context throughout extended workflows, and structured output generation produced formatted HTML reports directly to the local file system.

These capabilities distinguish Claude Code from web-based interfaces that require copy-paste workflows and from API-based approaches that require custom development. For researchers without programming expertise who want to incorporate AI assistance, Claude Code offers an accessible entry point. The approximately 30-minute processing time per session contrasts with the weeks required for traditional analysis. However, this efficiency comparison must be contextualized by the fact that human analysis involves reflexive engagement that AI cannot replicate.

5.5. Implications

The implications of this study span methodological, practical, and conceptual dimensions. While they are presented below under these three headings for clarity, the categories are intertwined in practice: a methodological choice often carries practical consequences, and a practical pattern can prompt conceptual reflection. The headings are intended to aid navigation rather than to draw firm lines between related insights.

5.5.1. Methodological Implications

The multi-level alignment framework and the F1 Score metric developed in this study offer a replicable approach for evaluating AI-human concordance in qualitative research. Existing studies have relied on simple agreement percentages or Cohen’s Kappa (Hila & Hauser, 2025; Jain et al., 2025), which do not distinguish between precision and recall. The F1 Score captures both dimensions, enabling researchers to identify whether an AI tool is generating relevant codes (precision) or capturing the full range of human-identified concepts (recall). This distinction proved critical in the present study: general prompts achieved high precision (87.5%) but low recall (44.7%), a pattern that would be obscured by a single agreement metric.

The inter-session replication design also provides a template for assessing AI reliability. By conducting multiple independent sessions with varying prompting strategies, researchers can evaluate both alignment with human analysis and reliability within a single study. This dual assessment addresses a gap identified in the literature, where alignment and reliability have typically been examined in isolation (Jain et al., 2025).

5.5.2. Practical Implications

The findings suggest several models for integrating AI-assisted analysis into qualitative research workflows.

First, AI analysis can serve as a first-pass analytical tool, generating initial codes and themes that human researchers subsequently refine, validate, and interpret. This model leverages AI efficiency while preserving human interpretive authority. The high alignment achieved in this study suggests that AI-generated frameworks can provide reliable scaffolding for subsequent human analysis.

Second, AI analysis can be conducted in parallel with human analysis, and the results compared to identify areas of convergence and divergence. Areas of agreement may be treated with higher confidence, while divergent findings prompt deeper examination. This study’s alignment metrics could serve as benchmarks for such comparative evaluation.

Third, AI analysis can be used to preliminarily explore large datasets, identifying candidate themes and patterns that human researchers subsequently investigate through traditional methods. The AI’s capacity to process large volumes of text efficiently makes it suitable for large-scale qualitative data applications.

Across all models, the finding that structured prompting outperformed general prompting by 23.9 percentage points at the code level provides clear practical guidance: researchers should invest in methodological scaffolding when designing prompts, mirroring established analytical phases rather than relying on open-ended instructions.

5.5.3. Conceptual Implications

The hierarchical convergence pattern described in Section 5.1.2 has implications for how AI-assisted findings should be reported. Thematic conclusions from AI analysis may be sufficiently stable for informing research findings, while code-level details should be treated as indicative rather than definitive. However, the present study examined a single dataset in an applied business context, and the generalizability of this pattern remains an open question. Future research should examine whether hierarchical convergence holds in domains requiring deeper cultural or interpretive sensitivity.

The importance of structured prompting also raises a broader point about the nature of AI-assisted qualitative analysis. The AI functions as a methodologically-directed instrument rather than an autonomous interpreter; the quality of its output reflects the quality of the methodological scaffolding provided by human researchers. This suggests a collaborative model where human researchers retain interpretive authority while delegating certain analytical tasks to AI systems, reconceptualizing the researcher’s role toward methodological design, prompt engineering, and critical evaluation of AI-generated outputs.

5.6. Limitations

This study should be interpreted in light of several limitations that define the scope of its contributions.

First, the human analysis used as a single researcher conducted the reference point. Although the original analysis followed a structured coding process and was later supported by quantitative validation with a larger sample, it remains one valid interpretation rather than an objective standard. As such, the alignment metrics reported in this study reflect concordance with this specific analytical perspective rather than definitive correctness. Future research could strengthen this design by incorporating multiple human coders and reporting inter-rater reliability, enabling comparison between AI-human and human-human agreement.

Second, the F1 Score captures structural similarity between AI and human outputs but does not assess interpretive depth, reflexivity, or the richness of meaning-making that characterize other qualitative traditions, particularly reflexive thematic analysis. The findings should therefore be interpreted within the coding reliability framework adopted by this study and not generalized into all forms of qualitative inquiry.

Third, the study is based on a single dataset drawn from interviews on e-commerce adoption in Saudi Arabia. While this dataset provides a complete analytical arc from raw data to validated thematic structure, the findings may not generalize to other domains, types of qualitative data, or research contexts. In particular, datasets requiring deeper latent interpretation, culturally embedded meaning, or highly specialized domain knowledge may yield different patterns of alignment. Replication across diverse datasets is necessary to assess the robustness of the observed multi-level alignment pattern.

Fourth, the study design included only two sessions per prompting condition, which limits statistical inference. While the observed 23.9 percentage-point difference between approaches is larger than within-approach variation, the findings should be considered exploratory evidence rather than statistically confirmed effects. Expanding the number of sessions in future research would enable a more rigorous comparison of prompting strategies.

Fifth, two tool-specific limitations should be acknowledged. The possibility of indirect exposure of the dataset to the AI model during training cannot be entirely excluded. The original PhD thesis is publicly available, and while no prior coding or analytical results were provided during the AI sessions, the potential for partial memorization remains a theoretical concern. However, the observed variability across AI sessions suggests that outputs were not deterministic reproductions of a fixed source. Additionally, the study focuses on a single AI system (Claude Code, Opus 4.5) and a specific agentic workflow. Given the rapid evolution of LLMs and differences in architecture and training data, the results may not generalize to other models or tools. Future research using novel datasets and comparative multi-system designs would address both concerns.

Despite these limitations, the study provides a controlled and transparent experimental framework that isolates key variables, prompting strategy, analytical level, and inter-session consistency, offering a foundation for cumulative research on AI-assisted qualitative analysis.

6. Conclusion

This study shows that alignment between AI-assisted and human thematic analysis is inherently multi-level. While agreement at the level of individual codes is variable, higher-order thematic structures consistently converge. This finding challenges the assumption that close correspondence in coding is necessary for meaningful qualitative outcomes and instead highlights a hierarchical pattern of convergence, where variability at the micro-level does not prevent stability at the macro-level.

The results show that, under appropriate conditions, AI-assisted analysis can achieve strong alignment with human interpretation. The best-performing session reached 87.6% F1 at the code level and 100% at the theme level, indicating that AI systems are capable of both detailed pattern recognition and coherent conceptual synthesis. However, this capability is not inherent to the technology alone. The observed 23.9 percentage-point improvement in code-level alignment under structured prompting underscores the importance of methodological scaffolding. AI systems do not independently reproduce rigorous analytical processes; they require explicit guidance shaped by qualitative research principles.

The consistency of theme-level findings across independent sessions (86.1%) provides evidence that AI-assisted analysis can produce reproducible higher-order interpretations, even when lower-level coding varies. This has important implications for practice. It suggests that AI-generated themes may be sufficiently stable to inform research conclusions, while code-level outputs should be interpreted as provisional and subject to researcher validation.

These findings should be interpreted within the boundaries of this study, including the use of a single dataset, a coding reliability analytical framework, and a specific AI system (Claude Code with the Opus 4.5 model). Within these constraints, the results provide evidence that AI-assisted thematic analysis can function as a methodologically sound complement to human analysis (Figure 8).

Figure 8.

Summary of key findings and practical guidance for AI-assisted thematic analysis

As qualitative research increasingly engages with large and complex datasets, AI-assisted approaches offer a practical pathway for extending analytical capacity. Their value lies not in replacing human interpretation, but in supporting it. When guided by structured prompting, critical oversight, and clear methodological framing, AI systems can contribute meaningfully to qualitative inquiry. Rather than redefining qualitative analysis, they reshape how it is operationalized, positioning the researcher as both analyst and methodological designer of AI-assisted processes.

Supplemental Material

Supplemental material - From Code Variability to Theme Convergence: AI–Human Alignment in Thematic Analysis With Claude Code

Supplemental material for From Code Variability to Theme Convergence: AI–Human Alignment in Thematic Analysis With Claude Code by Rayed AlGhamdi in International Journal of Qualitative Methods

Footnotes

Acknowledgments

The authors, therefore, acknowledge with thanks DSR for technical and financial support.

ORCID iD

Rayed AlGhamdi

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was funded by Deanship of Scientific Research (DSR)at King Abdulaziz University, Jeddah, Saudi Arabia, under grant no. (IPP:572-611-2025).

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The qualitative dataset analyzed in this study originates from a previously published doctoral thesis (AlGhamdi, 2014). To support transparency and reproducibility, all materials used in the Claude Code reanalysis are openly available via the Open Science Framework at . The repository is organized into three folders: (1) Claude Code Analysis, containing the complete outputs from all sessions, including codebooks, thematic reports, and prompting scripts; (2) Human Analysis, containing the original PhD thesis along with the research problem, questions, and methodology extracted from it, the corresponding NVivo coding files, and two Excel sheets providing side-by-side comparisons of codes and themes across the human analysis and all four Claude Code sessions; and (3) Transcripts, containing the anonymized interview transcripts used in both the original and AI-assisted analyses. All data are fully anonymized and contain no personally identifiable information.

Use of AI and Data Handling Statement

AI Tools Used - This study employed Claude Code (Anthropic’s command-line interface, powered by Claude Opus 4.5, model ID: claude-opus-4-5-20251101) as the primary analytical tool for AI-assisted thematic analysis. Additionally, the same model was used to support language refinement and improve the clarity, flow, and structure of the manuscript during the writing phase.

Data Processing and Privacy - Claude Code operates as a local command-line tool with direct file system access, processing data through Anthropic’s API. The interview transcripts used in this study were from a previously published PhD thesis (AlGhamdi, 2014) and contained no personally identifiable information, as all participant data had been anonymized in the original study. Researchers using Claude Code with sensitive or unpublished data should review Anthropic’s data retention policies and consider local processing options where available.

Reproducibility Considerations - AI model outputs may vary across different sessions and model versions. The specific model version (claude-opus-4-5-20251101) is documented to enable future comparison studies. Researchers attempting replication should note that subsequent model updates may produce different results, and exact prompts should be used to maximize comparability.

The author retains full responsibility for the accuracy, originality, and integrity of the work.

Supplemental Material

Supplemental material for this article is available online.

References

AlGhamdi

(2014). Diffusion of the adoption of online retailing in Saudi Arabia. Doctoral dissertation, Griffith University. arXiv preprint arXiv:1406.1469. https://doi.org/10.48550/arXiv.1406.1469

Anthropic . (2024). Claude Code documentation. Retrieved February 22, 2026, from. https://docs.anthropic.com

Barros

C. F.

Azevedo

B. B.

Neto

V. V. G.

Kassab

Kalinowski

Do Nascimento

H. A. D.

Bandeira

M. C.

(2025). Large language model for qualitative research: A systematic mapping study. In 2025 IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE) (pp. 48–55). IEEE. https://doi.org/10.1109/WSESE66602.2025.00015

Bennis

Mouwafaq

(2025). Advancing AI-driven thematic analysis in qualitative research: a comparative study of nine generative models on Cutaneous Leishmaniasis data. BMC Medical Informatics and Decision Making, 25(1), 124. https://doi.org/10.1186/s12911-025-02961-5

Borchers

Shahrokhian

Balzan

Tajik

Sankaranarayanan

Simon

(2025). Temperature and persona shape llm agent consensus with minimal accuracy gains in qualitative coding. arXiv preprint arXiv:2507.11198, 1–35. https://doi.org/10.48550/arXiv.2507.11198

Braun

Clarke

(2006). Using thematic analysis in psychology. Qualitative research in psychology, 3(2), 77–101. https://doi.org/10.1191/1478088706qp063oa

Braun

Clarke

(2021). One size fits all? What counts as quality practice in (reflexive) thematic analysis? Qualitative research in psychology, 18(3), 328–352. https://doi.org/10.1080/14780887.2020.1769238

Castellanos

Jiang

Gomes

Vander Meer

Castillo

(2025). Large Language Models for Thematic Summarization in Qualitative Health Care Research: Comparative Analysis of Model and Human Performance. JMIR AI, 4(7), e64447. https://doi.org/10.2196/64447

Chatzichristos

(2025). Qualitative Research in the Era of AI: A Return to Positivism or a New Paradigm? International Journal of Qualitative Methods, 24, 16094069251337583. https://doi.org/10.1177/16094069251337583

10.

Corbin

Juliet M.

Strauss

Anselm

. (1990). Grounded theory research: Procedures, canons, and evaluative criteria. Qualitative Sociology, 13(1), 3–21. http://link.springer.com/10.1007/BF00988593

11.

De Paoli

(2026). Why We Should Reject to Reject the Use of Generative Artificial Intelligence in Qualitative Analysis: A Response to Jowsey, Braun, Clarke, Lupton, and Fine (2025). Qualitative Inquiry, 10778004261425137. https://doi.org/10.1177/10778004261425137

12.

Fehring

Frings

Rust

Kempny

Thürmann

P. A.

Meister

(2025). Extension of the Consolidated Criteria for Reporting Qualitative Research Guideline to Large Language Models (COREQ+ LLM): Protocol for a Multiphase Study. JMIR Research Protocols, 14(1), e78682. https://doi.org/10.2196/78682

13.

Friese

Nguyen-Trung

Powell

Morgan

D. L.

(2026). Beyond Binary Positions: Making Space for Critical and Reflexive GenAI Integration in Qualitative Research. Qualitative Inquiry, 10778004261429393. https://doi.org/10.1177/10778004261429393

14.

Greenhalgh

(2026). Reflexive Qualitative Research and Generative AI: A Call to go Beyond the Binary. Qualitative Inquiry, 10778004261429383. https://doi.org/10.1177/10778004261429383

15.

Hila

Hauser

(2025). Assessing the Reliability of Large Language Models for Deductive Qualitative Coding: A Comparative Intervention Study with ChatGPT. Proceedings of the Association for Information Science and Technology, 62(1), 275–285. https://doi.org/10.1002/pra2.1255

16.

Huang

Durmus

Handa

McCain

Tamkin

Stern

Hong

Ganguli

(2025). Values in the Wild: Discovering and Mapping Values in Real-World Language Model Interactions. In Second Conference on Language Modeling. COLM. 102–115.

17.

Jain

Adeyinka

Roseman

Allsop

(2025). Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation. arXiv preprint arXiv:2512.20352, 1–11. https://doi.org/10.48550/arXiv.2512.20352

18.

Jowsey

Braun

Clarke

Lupton

Fine

(2025). We reject the use of generative artificial intelligence for reflexive qualitative research. Qualitative Inquiry, 10778004251401851. https://doi.org/10.1177/10778004251401851

19.

Jowsey

Stapleton

Campbell

Davidson

McGillivray

Maugeri

Keogh

(2025). Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research. PLoS One, 20(9), e0330217. https://doi.org/10.1371/journal.pone.0337734

20.

Liu

Sun

(2025). From voices to validity: Leveraging large language models (llms) for textual analysis of policy stakeholder interviews. AERA Open, 11, 23328584251374595. https://doi.org/10.1177/23328584251374595

21.

Mavrych

Yaqinuddin

Bolgova

(2025). Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience. Advances in Physiology Education, 49(2), 430–437. https://doi.org/10.1152/advan.00093.2024

22.

Mehta

S. D.

Paul

Awiti

Young

Zulaika

Otieno

F. O.

Phillips-Howard

P. A.

Mason

Bhaumik

(2025). Evaluation of large language models within GenAI in qualitative research. Scientific Reports, 15(1), 34993. https://doi.org/10.1038/s41598-025-18969-w

23.

Montes

C. M.

Feldt

Martos

C. M.

Ouhbi

Premanandan

Graziotin

(2025). Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research. arXiv preprint arXiv:2510.18456, 1–17. https://doi.org/10.48550/arXiv.2510.18456

24.

Naeem

Smith

Thomas

(2025). Thematic Analysis and Artificial Intelligence: A Step-by-Step Process for Using ChatGPT in Thematic Analysis. International Journal of Qualitative Methods, 24, 16094069251333886. https://doi.org/10.1177/16094069251333886

25.

Nguyen-Trung

(2025). ChatGPT in thematic analysis: Can AI become a research assistant in qualitative research? Quality & Quantity, 59(6), 4945–4978. https://doi.org/10.1007/s11135-025-02165-z

26.

Resnik

D. B.

Hosseini

(2025). The ethics of using artificial intelligence in scientific research: new guidance needed for a new tool. AI and Ethics, 5(2), 1499–1521. https://doi.org/10.1007/s43681-024-00493-8

27.

Sakaguchi

Sakama

Watari

(2025). Evaluating ChatGPT in Qualitative Thematic Analysis With Human Researchers in the Japanese Clinical Context and Its Cultural Interpretation Challenges: Comparative Qualitative Study. Journal of Medical Internet Research, 27, e71521. https://doi.org/10.2196/71521

28.

Samuel

Wassenaar

(2025). Joint editorial: Informed consent and AI transcription of qualitative data. Journal of Empirical Research on Human Research Ethics, 20(1-2), 3–5. https://doi.org/10.1177/15562646241296712

29.

Schroeder

Aubin Le Quéré

Randazzo

Mimno

Schoenebeck

(2025). Large Language Models in Qualitative Research: Uses, Tensions, and Intentions. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (pp. 1–17), Yokohama, Japan, 26 April 2025. https://doi.org/10.1145/3706598.3713120

30.

Tai

R. H.

Bentley

L. R.

Xia

Sitt

J. M.

Fankhauser

S. C.

Chicas-Mosier

A. M.

Monteith

B. G.

(2024). An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods, 23, 16094069241231168. https://doi.org/10.1177/16094069241231168

31.

Vectara . (2026). hallucination-leaderboard [Source code]. GitHub. Retrieved February 14, 2026, from. https://github.com/vectara/hallucination-leaderboard

32.

Wachinger

Bärnighausen

Schäfer

L. N.

Scott

McMahon

S. A.

(2025). Prompts, pearls, imperfections: Comparing ChatGPT and a human researcher in qualitative data analysis. Qualitative Health Research, 35(9), 951–966. https://doi.org/10.1177/10497323241244669

33.

Wójcik

Adamiak

Czerepak

Tokarczuk

Szalewski

(2025). Comparing the performance of ChatGPT, Gemini, and Claude in English and Polish on medical examinations. Scientific Reports, 15, 33083. https://doi.org/10.1038/s41598-025-17030-0

34.

Zhang

Xie

Rubino

Graver

Cai

Kim

Carroll

J. M.

(2025). Exploring inductive and deductive qualitative coding with AI: investigating inter-rater reliability between large language model and human coders. AHFE Open Access, 195, 142–152. https://doi.org/10.54941/ahfe1006232

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.52 MB