Sage Journals: Discover world-class research

Abstract

In this article, we examine the application of qualitative methods for exploring and capturing the emergent behaviors and characteristics of AI systems. In doing so, we formulate key facets of the ‘interviewing AI’ framework: (1) exploratory familiarization to develop an initial understanding of the AI system's functionalities and responses, (2) systematic investigation through structured probing to elicit behaviors such as hallucinations and manifestations of reasoning, using different prompting approaches, and (3) two complementary approaches - temporal and comparative analyses of AI behavior, examining changes over time or comparing multiple systems at a single point in time. We further discuss (4) potential qualitative analysis methods such as critical discourse analysis or content analysis adapted to theorize and interpret AI behaviors, and (5) triangulation, which integrates qualitative insights from interviewing AI with other methods such as user and expert studies, public interaction records analysis, and quantitative analysis to form a multidimensional and comprehensive understanding of AI systems. Finally, we address (6) ethical considerations by emphasizing transparency, reflexivity, and responsible interpretation of findings to ensure rigorous and contextualized research practices.

Keywords

Interviewing AI qualitative research machine behavior systematic probing prompting autoethnography

Introduction

The increasing adoption of generative AI systems, such as large language models (LLMs), along with their unpredictable performances, has raised key challenges about ways to capture and study their behaviors and characteristics. A nascent but growing field of research, often referred to as “machine behavior,” has begun to study AI systems not merely as engineering artifacts but as agentic systems with distinct behavioral patterns that unfold within use contexts beyond training phases (Rahwan et al., 2019). While avoiding anthropomorphic assumptions, this perspective advocates for recognizing the emergent nature of AI behaviors, which are not entirely predictable and comprehensible even to their developers (Tsvetkova et al., 2024). Such unpredictability often results in surprising capabilities that surface in practice, sometimes leading to harm or what Marres et al. (2024) call “AI frictions.” These frictions challenge the idea of AI as a coherent, seamless ‘thing’ and underscore the messy and unpredictable reality of its deployment in the real world (Kaun and Männiste, 2025).

AI behaviors are increasingly emergent in use as they arise from the interaction of the model's architecture, training data, and real-world environments rather than being merely explicitly programmed (Jones et al., 2025). Hence, emergence in this context highlights the dynamic nature of AI systems as they engage with complex inputs and newly experienced environments. These systems can develop abilities or shortcomings not necessarily present in smaller models or during training phases (Yampolskiy, 2025).

As AI systems become more integrated into diverse real-world applications, from workplaces to medical domains (Margaryan, 2023), the range of surprising and unforeseen behaviors is widening, manifesting what Horton (2023) refers to as the “performativity” problem, shaped by interactions within specific human environments (Hansen, 2021). The dynamic and opaque nature of generative AI thus calls for novel research methods to capture the full range of its capabilities, behaviors, and limitations (Magee et al., 2023). More than ever, questions remain: How do we reliably capture and articulate the “voice” of AI systems when dealing with entities that increasingly act as communicative participants rather than merely mediating tools or engineering artifacts (Guzman and Lewis, 2020; Hepp et al., 2023)? And what are the implications for research methods when the subject of study is no longer just human? This challenge becomes even more significant as we confront the reality that AI systems are not static tools and evolve based on user interactions, displaying human-like behaviors that may resemble autonomy (Bareis, 2024; Peter et al., 2025; Yampolskiy, 2025).

Why qualitative research?

In response to these challenges, we argue for applying qualitative research to study the emergent characteristics and behaviors of non-deterministic AI systems (systems that can generate different outputs for similar inputs due to inherent randomness or stochastic processes) (Magee et al., 2023; Tsvetkova et al., 2024). While quantitative and computational methods provide valuable insights into the performance of AI systems, qualitative approaches can provide unique affordances in explaining the contextual and nuanced nature of these behaviors that partly stem from situated interactions with the users (Mlynář et al., 2025).

What sets a method like interviewing AI apart is its ability to capture phenomena's depth, meaning, and complexity in the context of their occurrence (Seaver, 2017). In contrast to quantitative approaches, which focus on measurable outcomes, qualitative methods provide a more interpretive and explorative view of the complexity of AI behaviors in complex contexts (Ho et al., 2024). Qualitative methods reveal patterns missed by aggregate and reductionist metrics, especially regarding affordances, breakdowns, and sociopolitical entanglements of AI applications as contested domains imbued with power dynamics (Gourlet et al., 2024; Suchman, 2023). Approaches such as critical discourse analysis are especially helpful in uncovering the normative underpinnings and discursive constructions in AI outputs (including hegemonic ideologies and implicit biases embedded in training data or reflected in user interactions). Qualitative research therefore helps unearth and critically examine tensions that emerge during use across different contexts. This stands in contrast to dominant portrayals of AI systems as universally applicable, user-friendly, efficient, and versatile agents (Luchs et al., 2023).

Research establishes that generative AI systems’ interactive nature is central to their emergent behaviors, as these systems evolve based on user inputs that reflect unique interpretations and intentions (Hancock et al., 2020). Qualitative research is particularly well-positioned to capture this dynamism (what Rahwan et al. (2019) call hybrid ‘human-machine behavior’), as it can aptly document how AI systems shift and adapt in real-world contexts of use. In relation to this interactivity, which is central to emergent behaviors (Hansen, 2021), qualitative research can direct attention to human judgment and subjective assessment to explore and evaluate AI behaviors holistically beyond technical benchmarks or narrow task-centered performance metrics (e.g., contextual relevance of AI-generated outputs) (Glazko et al., 2023; Shah, 2023).

What interviewing AI entails

In this article, we formulate the ‘interviewing AI’ as a qualitative framework for studying AI. Much like classic qualitative research, the goal here is to explore, document, and interpret AI behavior as it develops through real-world interactions. Particularly, interviewing AI is valuable in answering why and how questions (Creswell, 2009); for example, exploring why AI systems have arrived at certain conclusions based on the review of the analysis of false-positive and/or false-negative results (Oami et al., 2024). As such, we argue that by employing this qualitative framework, researchers can complement computational methods to capture the depth, context, and complexity of AI systems.

The rest of the article unpacks what we mean by interviewing AI (see Table 1 for an outline of the key elements of the framework). We begin with experimentation to understand AI behavior firsthand and gain an initial perspective, then probing to uncover deeper and more systematic insight into AI behavior. Next, we introduce Temporal Interaction Analysis (TIA) and Comparative Synchronic Analysis (CSA) for documenting behavior over time or across systems. Finally, we highlight qualitative analysis techniques and triangulation to capture AI's emergent, adaptive nature.

Table 1.

Key elements of the interviewing AI framework.

Framework element	Purpose	Key approaches
Exploratory familiarization	Gain initial understanding and intuition about the AI system through experimentation	Experimentation, observing behavior to typical and complex questions, identifying quirks
Systematic probing	Systematically document invisible behaviors, reveal boundaries through breakdown analysis, and unpack AI reasoning and internal logic	Systematic prompt variation Role-playing, boundary testing, Incremental prompting, chain-of-thought, and least-to-most prompting Counterfactual prompting
Temporal and comparative analyses of AI behavior	Track and analyze how AI behaviors change over time or compare multiple AI systems’ responses to the same prompts at a single point in time	Temporal Interaction Analysis (TIA): Longitudinal prompt testing, autoethnography, repeated interactions Comparative Synchronic Analysis (CSA): Cross-system prompt comparison, performance benchmarking
Qualitative analysis	Interpret and theorize emergent AI behavior patterns and meanings	Interpretive analysis (identifying thematic patterns) Critical analysis (revealing ideological and social biases) Content analysis (Quantifying or explaining behavioral patterns) Treating human-AI exchanges as unified analytical units
Triangulation	Ensure robust understanding by combining multiple data sources and methods	Integrating multi-source public records of human-AI interactions Juxtaposing system behavior with human perspectives Triangulation with quantitative analysis
Ethical considerations	Ensure transparency, reflexivity, and avoid over-attribution in interpreting AI behaviors	Responsibility in interpreting emergent AI behavior Transparency in reporting and theorizing AI behavior Reflexive analysis of researchers’ influence

Exploratory familiarization with AI

Before diving into the more structured data collection and analysis processes, researchers need to gain some initial understanding of the interactive milieu and the specific AI system they are studying. This familiarization phase involves first-hand experimentation with the AI system in question to gain an initial intuition about the system's operations, behaviors, inferences, potential limitations, blind spots, and, more broadly, structural societal concerns (Marres et al., 2024). This process can be seen as a form of “human-in-the-loop” interaction (Shah, 2023) as researchers’ own engagement generates in-vivo understanding, key to documenting the behaviors that arise only when AI is deployed in real-world environments (de Seta et al., 2024).

During experimentation, researchers can engage in a variety of tasks to more effectively develop an understanding of AI behavior and capabilities, what Guzman and Lewis (2020) might refer to as “the functional dimensions” of the system. These include observing how the system reacts to typical or varied user prompts and evaluating its performance by contrasting simple and complex questions to assess the adaptability and reliability of responses. Experimentation is therefore the first step in what Margaryan (2023) calls ‘intelligent interrogation,’ the practice of formulating practical questions for AI. Researchers can also familiarize themselves with behavioral patterns, speculating when, why, and how specific outputs may emerge (specifically how the AI may handle ambiguity and contradictions). They can also focus on topics or types of questions in which the AI excels or struggles, while also noting quirks or idiosyncrasies in its behavior or reasoning.

This early engagement helps establish the behaviors and performances researchers can expect to explore more deeply. The experiential approach helps researchers gain a “grip” on AI's abilities, issues, and concrete operations (Gourlet et al., 2024), preparing them for the next phase, where the goal shifts from observation to systematically eliciting evidence and building theories about the system's capabilities. For example, Barambones et al. (2024) performed a preliminary study to understand system behavior and optimize persona training. By testing various prompt formats and model settings, they refined prompt design, enhanced the realism of simulated interviews and informed the main study's setup.

It is important to note that the experimentation is a situated encounter with the system and its affordances, so is inexorably shaped by the researchers’ prior experiences, expectations, and sociomaterial context (Suchman and Thimm, 2024). As such, the observations and intuitions gained are co-constructed through this situatedness, and should be treated as partial and contingent, requiring a reflexive consideration on how researchers’ positionality, such as disciplinary background, familiarity with AI, expectations and biases shapes both the prompts they use and the interpretations they make for the next stages of investigation.

Systematic investigation through structured probing

The next step is structured probing, which focuses on eliciting deeper, structured insights into the system's internal logic and emergent behaviors. Probing, a foundational ethnographic technique for eliciting deeper meanings (Patton, 2014), has recently been adapted to study generative models (de Seta, 2024). Probing is designed to generate a theoretically oriented, analytical viewpoint that enables researchers to systematically analyze and classify patterns, behaviors, and underlying reasoning processes within the AI system and connect with extant theoretical knowledge. In contrast to the exploratory and open-ended nature of the familiarization step, probing is more purposive and targeted and involves systematically crafting and posing structured prompts to reveal, record, and analyze detailed, complex, or unexpected behaviors from the AI (see for example, Krapp et al., 2024).

Probing can serve several crucial goals in interviewing AI, and these goals can be achieved through multiple approaches (see Table 2 for an outline).

Table 2.

Various goals achieved through different probing approaches.

Probing goals	Description	Examples of probing approaches
Documenting invisible behaviors	Focuses on capturing patterns of subtle or hidden behaviors of AI	Systematic prompt variation: Uncovering subtle inconsistencies or patterns in responses. Role-playing scenarios: Focusing on biases or idiosyncratic behaviors in adaptive roles.
Revealing boundaries through breakdown analysis	Focuses on identifying the limits of AI capabilities by intentionally introducing complex, ambiguous, or nonsensical tasks (uncovering breakdown points in logic or functionality).	Boundary testing: Stress-testing AI by presenting challenging tasks Counterfactual probing: presenting hypothetical or edge-case scenarios to test AI consistency and limitations.
Unpacking AI reasoning and internal logic	Focuses on the internal logic and decision-making processes of AI by exploring how they process inputs and arrive at outputs.	Chain-of-thought probing: Encouraging AI to explain its reasoning, providing insight into its logical processes. Incremental probing: Iteratively refining prompts based on previous outputs to examine how the AI adjusts its behavior and reasoning. Least-to-most probing: Gradually increasing guidance to examine the AI's ability to reason and adapt, revealing its internal problem-solving logic.

Documenting invisible behaviors

Probing facilitates the evaluation of how the AI system behaves based on diverse inputs, including varied, ambiguous, or even contradictory prompts (Abdullahi et al., 2024). This process identifies recurring patterns, themes, and responses to varied interaction strategies (Scholl et al., 2024) and uncovers behaviors that might not necessarily be evident during casual interactions (Magee et al., 2023). For example, probing can capture patterns of emergent behaviors such as hallucinations (responses that are plausible-sounding but factually incorrect) or creative outputs entailing novel combinations of ideas. This exploration is key to theorizing the system's capabilities and the nature of its emergent behaviors. For example, Triem and Ding (2024) systematically examined instances where LLMs changed their opinions to specifically gain an understanding of the structure of their reasoning.

Revealing boundaries through breakdown analysis

Probing can trigger specific behaviors in the AI system by using complex prompts (Schmidt et al., 2024). This approach pushes the system to its limits, revealing, for example, how it manages challenging or edge-case scenarios. Such targeted probing helps identify instances in which the system exhibits unexpected performance or fails to maintain logical consistency. For example, researchers directly “interviewed LLMs,” asking them to identify common failure modes, such as those observed in the 11–20 Money Request Game, in which LLMs struggled to provide accurate explanations for their choices and often exhibited reasoning that deviated significantly from human participants (Gao et al., 2024). Capturing and analyzing these instances of breakdown or what Waller et al. (2024) dubbed “questionable occasions” helps show specific limits to the systems’ capabilities, such as reasoning or intelligence (e.g., Lorè and Heydari, 2024). A key example of breakdown in human-AI interactions is moments of miscommunication where systems exhibit unexpected communicative performance or failures to maintain logical consistency (Mlynář et al., 2025).

Unpacking AI reasoning and internal logic

Probing can produce rich theoretical insights that embody the AI's underlying decision-making processes and reasoning (Magee et al., 2023). By closely analyzing the responses generated through probing, researchers can gather evidence about how the AI processes inputs, how it formulates its responses, and which factors may more significantly influence its outputs (Henrickson and Meroño-Peñuela, 2025). This method is beneficial for opening the black box of AI systems, where the internal mechanisms are often opaque and not fully understood. Probing approaches in this context can focus on the explanations provided by the system itself to understand its reasoning process and identify potential areas of bias or errors in response to varied stimuli (e.g., identifying instances where the AI may misunderstand the meaning of specific relations by analyzing the explanations provided by the system (Cohn, 2023)). As such, the primary focus of the analysis is not solely on whether the system completes tasks correctly or incorrectly but rather on how it arrives at its conclusions (Abdullahi et al., 2024). For instance, researchers qualitatively analyzed the reasoning provided by an AI system to interpret the logic behind its decisions in an effort to understand how it links its expressed values to its actions (Leng and Yuan, 2023) or to determine if it was consistent with common misconceptions in educational contexts (Kieser et al., 2023).

Approaches in probing

Researchers can employ several approaches during the probing process to achieve the above goals. Different approaches to probing AI systems yield different types of insights. It is important to note that these approaches can be effectively combined rather than being treated as mutually exclusive strategies.

Systematic prompt variation

One effective probing approach involves creating a framework of prompts that vary systematically in complexity, structure, or style. Such prompt structures can be derived from existing deductive theoretical models. This approach helps identify how the AI responds to different types of inputs and reveals potential patterns or inconsistencies in its behavior. For example, Sänger et al. (2024) assessed AI's capabilities in interpreting and manipulating scientific workflow descriptions by using a variety of prompts designed for distinct research objectives (e.g., comprehending/explaining workflows, modifying/adapting workflows, and extending workflows). Consequently, the prompts used were varied to simulate the interaction between a user and the AI system for different tasks.

Role-playing scenarios

By engaging the AI in role-playing scenarios, researchers can also observe how the system adapts its behavior across various contexts. This means asking the system to take on different personas or roles (e.g., a teacher, a friendly companion, an autistic person, a creative writer, a 10th-grade student in school, or a debate opponent) (De Freitas et al., 2024; Kieser et al., 2023; Krapp et al., 2024; Park et al., 2025). Role-playing scenarios enable unique insights into the AI's ability to shift perspectives and employ different modes of interaction, directing attention to both the flexibility and potential limitations of its conversational and thinking abilities. For example, through a “Tell Me Your Story” approach, Munn and Henrickson (2024) prompted an AI system as a social agent to disclose information about its design, training data, and embedded values. This approach also helps researchers disclose implicit biases and inclinations that might not be easily revealed through direct questioning (e.g., Buyl et al., 2024), shedding light on otherwise obscured aspects of the AI's internal logic and the constraints imposed by design guardrails. As an example, Magee et al. (2023) drew on a role-playing approach by assigning distinct personas (“Zhang” as helpful, “Ali” as truthful, and “Maria” as harmless) to explore how the system adapted to different normative roles. These personas helped the researchers examine how the AI adjusted its response based on the assigned identity, potentially surfacing embedded values.

Boundary testing

Deliberately pushing the AI to its limits through boundary testing helps reveal the limits of its capabilities and allows researchers to observe how the system behaves when faced with challenging or incomprehensible tasks (Zhan et al., 2023). This can involve presenting the AI with complex, ambiguous, or nonsensical questions to determine how it responds. For example, Stroebl et al. (2024) were able to elicit some of AI systems’ boundaries in producing reliably correct code by focusing on generating false positives. Boundary testing is beneficial for identifying where the system's reasoning breaks down (what we called breakdown analysis earlier) or where it may produce unexpected outputs, such as hallucinations or overly simplistic responses to complex problems.

Chain-of-thought probing

Another probing approach involves asking the AI to explain how it arrived at a particular answer, specifically contributing to what we earlier referred to as “unpacking AI reasoning.” This approach, known as chain-of-thought prompting, pushes the AI system to articulate the steps it took to reach a conclusion, which provides a window into how the AI structures its reasoning processes (Doshi et al., 2024) and therefore facilitates opening its black box (Micus et al., 2024). Chain-of-thought prompting may result in breaking down complex tasks into smaller steps and generating more relevant outputs (Spurlock et al., 2024). Qualitative analysis of AI outputs generated by this type of probing can be especially helpful in offering insight into the AI's internal logic and whether and how it is inconsistent or deficient across different contexts (Mirzadeh et al., 2024). For example, Li et al. (2023) applied a method akin to chain-of-thought prompting to track AI agents’ reasoning as they updated belief states in collaborative tasks, demonstrating how the agents processed and adapted to new information.

Incremental probing

This involves iteratively refining and improving the responses generated by the system (Schwenke et al., 2023), often based on external guidance provided by researchers (Liu et al., 2023). The process begins with researchers providing an initial prompt to guide the AI's response. After evaluating the generated output for quality and relevance, researchers modify the prompt to provide more specific instructions, clarify expectations, or address any shortcomings in the response (Shah, 2023). This cycle of evaluation and refinement continues until the system arrives at a satisfactory response, which allows for a deeper exploration of the system's behaviors and logic (Lingard et al., 2023). As an example, Otmar et al. (2025) used an iterative and incremental approach to prompting, progressively refining instructions, and engaging in feedback loops with an AI system (e.g., suggesting alternative titles, identifying areas for pacing adjustments) to improve the system's output for editing tasks. This iterative process helps researchers explore how the AI adapts its behavior (or struggles to adapt) based on previous interactions, highlighting both its capacity for and limitations in incremental behavior improvement and understanding of inputs.

Least-to-most prompting

Least-to-most prompting is another viable approach for observing emergent behavior, where minimal guidance is initially provided to the AI, and additional prompts are introduced only when necessary (Zhou et al., 2022). This method allows researchers to see the ability of the system to reason independently and reveals its ability to infer, adapt, or ask for clarification when minimal guidance is provided. As the level of prompting increases, patterns in the AI's logic, such as where it fails or succeeds in processing new sets of information, become apparent. Least-to-most prompting is valuable for exploring the limits of AI's capacity to act and reason beyond its training, as well as its ability for independent problem-solving (Vu et al., 2024).

Counterfactual prompting

This probing approach presents AI systems with scenarios that alter specific facts or conditions while maintaining other contextual elements. By doing so, researchers are able to evaluate how the system behaves in understanding causal relationships and maintaining logical consistency across hypothetical changes (Meinke et al., 2024). For example, researchers might ask the system to reason about ‘what if gravity worked in reverse’ or ‘what if humans had three arms,’ allowing observation of how the AI applies its knowledge to impossible scenarios. Such demonstrations are particularly valuable for assessing the system's ability to maintain logical consistency when reasoning beyond its training data, evaluating how it extrapolates from known principles to novel situations, and finally determining its ability to acknowledge or address contradictions (Ma et al., 2023). As such, this method reveals both the flexibility and limitations of AI reasoning when confronted with scenarios that cannot be solved through pattern matching alone

Temporal and comparative analyses of AI behavior

In the study of emergent behaviors in generative AI systems, researchers can adopt two complementary approaches: Temporal Interaction Analysis (TIA) and Comparative Synchronic Analysis (CSA). Interviewing AI through these two offers distinct yet reinforcing approaches to understanding how AI systems exhibit behaviors (see Table 3 for comparison). TIA focuses on tracking behavior changes over time or iterations, while CSA offers a comparative snapshot of the behavior of multiple systems at the same time. These two methods respectively reflect the diachronic and synchronic analysis in the technology studies, discussed by Steve Barley (1990). By employing both methods, researchers can achieve a comprehensive view of AI behavior, addressing both diachronic (over-time) and synchronic (fixed-time) dimensions of AI.

Table 3.

Comparison between temporal interaction analysis (TIA) and comparative synchronic analysis (CSA).

	Temporal Interaction Analysis (TIA)	Comparative Synchronic Analysis (CSA)
Focus	Dynamic changes in AI behavior over time	Cross-system comparison of multiple AI systems in a defined, short-term period
Key purpose	To track how AI behavior evolves, adapts, or shifts	To compare emergent behaviors across different AI systems
Data collection approach	Multiple interaction sessions over time	Single-session comparisons across multiple systems
Type of analysis	Diachronic (focuses on changes over time)	Synchronic (focuses on comparing system behaviors at a single time point)
Strengths	Captures evolving patterns of behaviors	Allows immediate cross-system comparisons
Limitations	Requires longitudinal commitment, which is difficult if the system updates frequently	Does not capture changes over time, often limited to a single time-point perspective

Temporal interaction analysis (TIA)

Temporal Interaction Analysis (TIA) is intended to track and analyze the emergent behaviors of AI systems over time. It is particularly effective in observing how behaviors stagnate, evolve, or completely transform through repeated interactions. AI behaviors are dynamic and may constantly change based on the contexts of interactions, frequency of interactions, or user prompts, making a case for longitudinal analyses to analyze both shifts and consistencies in these behaviors (Guzman and Lewis, 2020; Mlynář et al., 2025). In addition, models can evolve post-training through fine-tuning, new data integration, or guardrails (e.g., user feedback or safety improvements), or continual learning methods, shifting performance post-deployment (Burkhardt and Rieder, 2024). These changes call for temporal approaches that collect system outputs over iterations and over time.

A TIA approach helps monitor whether the system's responses become more refined, accurate, or consistent over time or over multiple iterations (e.g., Krapp et al., 2024). Furthermore, by revisiting similar prompts, researchers can observe how interaction patterns influence the AI's emergent behavior and whether learning or stabilization occurs. For example, in their study comparing AI's performance with a human researcher, Wachinger et al. (2025) used an iterative method, prompting the system repeatedly to examine its evolving responses. They noted both shifts and stabilization in the AI's ability to identify descriptive themes, but also inconsistencies in generating insightful codes and performing deeper interpretive analysis. As another example, Janse van Rensburg (2024) used similar prompts across three different time frames to assess the consistency of AI responses and observe potential variations over time. One of the key goals of the study, in line with the ideals of a TIA approach, was to explicitly determine if the system consistently produced responses in which the same critical thinking skills and dispositions could be identified and, consequently, to establish if its modeling capacity changed over time.

The TIA approach is not solely focused on changes in system behavior over time; it can also be used to gain a more comprehensive and immersive understanding of system behavior and performance. For instance, researchers performed autoethnographies over several months to gain a firsthand account of how generative AI systems could address accessibility needs (Glazko et al., 2023), the potential and challenges of using AI systems in academic writing (Schwenke et al., 2023), and the complex quasi-social relationships formed with the AI system (Krapp et al., 2024). In these examples, the TIA approach provided the researchers with deep access to individual experience and insights into the systems’ behaviors and performance, including both potentials and limitations, which emerge through sustained, real-world interactions over an extended period of time.

Comparative synchronic analysis (CSA)

Comparative Synchronic Analysis (CSA) offers a snapshot comparison of multiple AI systems (e.g., Abdullahi et al., 2024; Alfirević et al., 2024) using the same set of prompts or tasks and collecting data during a defined short-term period. CSA is particularly useful when researchers are looking to evaluate how different systems behave relative to identical tasks or prompts under similar conditions (Buyl et al., 2024). For instance, Collier et al. (2024) employed a systematic CSA-based evaluation of multiple LLMs to assess their performance in similar tasks of product risk assessment at a single point in time. The researchers used the identical prompts across different LLMs for each product category, and their assessment involved recording which failure modes, injuries, and risk mitigation tactics were highlighted by each model.

CSA is particularly valuable in studies where researchers focus on comparing the performance and behavior of different AI systems against similar benchmarks, without requiring extended longitudinal engagement. For example, Akyon et al. (2024) critically assessed the comprehension capabilities of six different LLMs at a single point in time by focusing on a subset of medical research papers and comparing their responses against a benchmark established by medical experts.

Exploring and interpreting patterns in AI behavior

In studying emergent behaviors in AI systems, much like in traditional qualitative research, data collection and analysis can occur simultaneously (Merriam and Tisdell, 2015). This iterative approach allows researchers to adjust their probing techniques and explore new themes as new insights emerge during their interactions with AI systems (e.g., Alfirević et al., 2024). This also means that researchers continually refine their probing approach in response to how the AI behaves, adjusting based on their ongoing analysis of its outputs.

By treating AI-generated responses as reflections of emergent behavior, researchers can apply qualitative methods traditionally used to study human behavior. These approaches allow for the identification and theorization of behavioral patterns, such as emerging affordances and limitations in AI abilities relative to specific tasks (e.g., Chavan et al., 2024). For example, Krakowski (2025), by using varied prompts and responses of generative AI systems, discussed how these systems may (mis)perform arithmetic tasks and speculated about potential variation in performance. In the following section, we outline other examples of approaches, adapted from human-centered behavioral studies, that can effectively be applied to the study of AI.

Identifying patterns through thematic analysis

At its core, thematic analysis serves as a useful foundation for spotting and interpreting recurring themes and patterns in AI-generated outputs. Recent research has shed light on aspects of AI behavior–from how these systems maintain consistent personas and tone to the way they develop preferences over time (Krapp et al., 2024). A thematic coding approach helps researchers systematically map out various types of AI errors, whether they represent logical challenges, struggles with edge cases, or repeated inaccuracies (Stroebl et al., 2024). For example, Lewis and Mitchell (2024) manually examined a sample of incorrect responses from humans and GPT models to categorize the types of errors made to understand the underlying reasoning processes. Thematic analysis can also help generate more systematic insights into AI capabilities, particularly in areas like critical thinking and reasoning within AI-generated content (Janse van Rensburg, 2024).

Thematic analysis can extend beyond coding by penetrating deeper layers of meaning in AI behavior, enabling researchers to interpret the social, cultural, and situational factors shaping the behavior of AI systems. Consider how researchers explore the way AI-generated text mirrors implicit assumptions or biases from its training data (e.g., Munn and Henrickson, 2024; Park et al., 2025). Taking this further, hermeneutic approaches offer an even more contextually aware analysis by focusing on the meaning and value ingrained in AI responses. By carefully examining word choice, structure, and context, researchers can better articulate how AI systems create meaning through user interactions. For example, focusing on features like the strategic use of politeness, conversational repairs, and the dynamic deployment of repetitions, Jones et al. (2025) examined the coproduction of meaning and manifestation of cultural conventions in conversation between a human and an AI chatbot. This approach acknowledges and accommodates the interactive, interpretive nature of AI behavior, and inquires into how these systems operate in specific interactive situations, manifesting the values, logic patterns, or even representations of certain cultural values embedded in their outputs (Henrickson and Meroño-Peñuela, 2025).

Uncovering ideological and social biases through critical analysis

Approaches such as critical discourse analysis (CDA) have emerged as powerful strategies for examining how AI behavior reflects broader societal biases, ideologies, and power dynamics. CDA is a qualitative, interpretive method that focuses on language as a social practice and interrogates how discourse structures enact, legitimize, or obscure power relations (Wodak and Meyer, 2015). Through these approaches, researchers can focus on linguistic patterns that might, in turn, reveal the worldview of AI creators or the hegemonic discourses embedded in the training data (Buyl et al., 2024) or illustrate how these systems may amplify existing social injustices (Kay et al., 2024), ultimately shedding light on specific worldviews and biases from their training data or creators (Gourlet et al., 2024). A fruitful analytical approach for exploring the systems’ ideological foundations in this context revolves around tracing how training data may specifically drive AI responses (Lee et al., 2025).

A compelling example of CDA-based analysis comes from researchers who examined AI-generated recommendation letters, discovering differences in how frequently certain nouns and adjectives appeared for male versus female candidates. This study underscored persistent gender stereotypes in descriptions of ability, leadership qualities, community involvement, and personal life (Wan et al., 2023). As another example, by focusing on a random subset of the generated responses, Park et al. (2025) directly engaged with the specific ways through which an LLM associated autism with common stereotypes, such as difficulties in social skills (“socially awkward”), sensory sensitivities requiring management, and the idea of autistic people being “unique” often framed in terms of skills beneficial to others. Such a critical qualitative analysis enabled researchers to reveal an implicit “bias paradox” in the system performance: While perpetuating negative stereotypes and a deficit-oriented perspective, the system also frequently used explicitly inclusive language and expressed a desire to be inclusive or connect with autistic people.

Quantifying or explaining behavioral patterns through content analysis

Content analysis gives researchers a systematic way to track and measure specific linguistic features in AI-generated text, helping identify and quantify behavioral patterns. Spurlock et al. (2024) put this approach to work in studying how AI systems represent and compare items (i.e., movies) and ascertain how varying levels of details in prompts may affect recommendation quality. Similarly, De Freitas et al. (2024) quantified loneliness-related conversations with AI, demonstrating how content analysis can uncover patterns of engagement in AI-human interactions. In another example, researchers tracked the frequency of terms like “civilian” or “terrorist” to explore the ways through which language choice influenced the portrayal of violence and airstrikes across different linguistic contexts (Kazenwadel and Steinert, 2023).

Treating human-AI exchanges as unified analytical units

When applying all these types of qualitative analysis noted above, it stands to reason to treat the prompts used and AI responses as a single unit of analysis, as humans and AI co-produce meaning in their unique context of interactions (Jones et al., 2025). This unit of analysis aligns with the principle of examining the sequential organization of ‘talk-in-interaction’, in this case, the exchange between a human and an AI system (Mlynář et al., 2025). Both elements work together to reveal the system's thinking and behaviors as part of the same analytical framework (Henrickson and Meroño-Peñuela, 2025).

Triangulating methods for a holistic understanding of AI behavior

Relying on a single research method to study AI behavior offers a limited view of how these systems operate and unfold in practice. So, interviewing AI should not necessarily be thought of as a standalone research method. Triangulation has been recognized as the process of using multiple research methods or sources to study the same phenomenon (Korstjens and Moser, 2018). This practice can contribute to research validity by enabling researchers to see different dimensions of AI behavior from complementary angles. As such, interviewing AI can be greatly complemented by integrating it with other methods, such as quantitative approaches, technical analysis, and user studies (Ma et al., 2024; Xu et al., 2024).

Integrating multi-source public interaction records

The first-hand qualitative data discussed above can be complemented with an analysis of publicly available datasets documenting thousands of users’ interactions with AI systems. For instance, the Dev-GPT dataset captures AI use in collaborative coding via GitHub (Chavan et al., 2024), while Cleverbot logs provide examples of informal, spontaneous dialogue with AI chatbots (De Freitas et al., 2024). More critically, large-scale datasets like WildChat (Zhao et al., 2024), LMSYS-Chat-1 M (Zheng et al., 2023), and public conversation records (e.g., Cheng et al., 2025) offer valuable insight into diverse user interactions across different social contexts. Integrating such datasets enables researchers to triangulate findings and assess patterns and consistency in emergent AI behavior.

Juxtaposing system behavior with human perspectives

Another useful way to triangulate qualitative research on AI is to juxtapose system behavior with the assessment of human actors or experts, particularly when human-AI interaction is a focus of the study (Arawjo et al., 2024; McDuff et al., 2023). These assessments provide a baseline or human lens through which AI behavior can be examined and contextualized (e.g., Bijker et al., 2024; Sänger et al., 2024). For example, Hou et al. (2024) compared an AI chatbot's relationship advice with Reddit users’ rankings of similar solutions and investigated how closely the AI's suggestions aligned with the collective judgment of these users. The combination of system behavior and lived experience helps build a more holistic understanding of AI, integrative of both technical performance and human interpretation.

Such triangulation highlights the system's capacities and limitations not only from the AI's internal logic but also from the external perspectives of people who encounter these systems in real-world applications (e.g., Otis et al., 2023). This kind of qualitative research uncovers how AI's “functioning in the wild” may weigh in on the perceived usefulness and credibility of AI in exploratory contexts.

Triangulation with quantitative analysis

Interviewing AI can also be triangulated with quantitative studies that provide a more structured or measurable view of AI systems (e.g., Scholl et al., 2024). For example, researchers conducted a mixed-methods study by combining quantitative metrics (response accuracy) and in-depth qualitative thematic analysis to evaluate the AI's performance in providing contextually relevant support within patient-centered dementia care scenarios (Li et al., 2024). As another example, Lorè and Heydari (2024) relied on statistical analysis on data from game theory simulations to identify patterns in LLM strategic decision-making, such as cooperation rates. They then complemented this with an exploratory, qualitative analysis, which focused on the explicitly prompted reasoning provided by the systems for selected illustrative examples. This combined approach, analyzing both the quantitative outcomes and the qualitative insights from the systems’ stated motivations, allowed the researchers to arrive at a more systematic picture of how different LLMs behave strategically across varying scenarios.

In short, through triangulation with other methods, interviewing AI becomes an integral part of the broader research methodological toolkit and sources. Together, these varied research approaches ensure that the multidimensional nature of AI systems’ behavior and performance (e.g., how they process information, respond to prompts, and interact with humans) can be fully captured, explained, contextualized, and theorized.

Ethical considerations in qualitative methods for studying AI systems

In this section, we explore key considerations in applying qualitative methods to the study of AI that can enhance the ethical validity of research.

Responsibility in interpreting emergent AI behavior

While interpretation is a cornerstone of qualitative research, when applied to AI systems, it may result in unique ethical concerns. In this context, researchers must interpret and theorize behaviors from non-human agents that essentially lack key human characteristics such as intent or self-awareness. Therefore, there is always a risk of over-attributing human-like traits to these systems (DeVrio et al., 2025; Peter et al., 2025).

To avoid such issues, ethical qualitative research should contextualize AI behavior within the design constraints of these systems. While AI may exhibit human-like patterns, they are fundamentally different in nature. Over-anthropomorphizing AI (e.g., assigning exclusively human faculties to AI) can lead to misleading theories about AI capabilities and limitations (Bory et al., 2025).

Moreover, in applying interpretive approaches, researchers need to remain mindful and critical of AI systems as non-neutral actors that inevitably bring forward specific worldviews and associated biases (Guzman and Lewis, 2020; Jones et al., 2025). As demonstrated in numerous studies of AI systems and their biased responses or exclusionary language (e.g., Kay et al., 2024; Wan et al., 2023), the researcher's role in responsibly revealing and theorizing these inherent characteristics of AI systems becomes even more pivotal.

Transparency in reporting and theorizing AI behavior

In the context of interviewing AI, a responsible and transparent approach entails clear documentation of how probing was conducted and how emergent behaviors were identified and analyzed (de Seta, 2024; Shah, 2023). This methodological transparency allows for accountability (see, for example, Waller et al., 2024) while also ensuring that other researchers can critically engage with research outcomes.

Specifically, audit trails have proven to be effective safeguards that can enhance the validity and transparency of qualitative research by keeping records of the research trajectory (Korstjens and Moser, 2018). Detailed journaling of interactions with AI can create comprehensive documentation that includes annotated screenshots and notes. This documentation can capture rich, detailed accounts of researchers’ individual experiences with AI systems, enable critical examination of those experiences, help researchers process uncomfortable emotions, and increase research transparency, especially when conducted as a collaborative effort (Desai et al., 2023).

Further, when interpreting findings about the AI systems, researchers should clearly articulate the limits of their methods and avoid making broad generalizations that extend beyond the scope of their study (Gillespie, 2024; Mlynář et al., 2025). For example, artificially constructed scenarios discussed earlier in probing approaches can reveal important dimensions of AI systems’ behavior in practice, but they also run the risk of presenting an incomplete or skewed picture of the system's capabilities or drawbacks. As another example, when applying a CSA approach, researchers should be transparent about the specific conditions under which multiple systems are evaluated and how those conditions might have influenced the results within a specific timeframe, as AI systems may dynamically change over time and consequently exhibit dissimilar behaviors at different points in time.

Reflexive analysis of researchers’ influence

Recognizing the researcher's role in the human-in-the-loop qualitative studies of AI also requires reflexivity regarding their influence on ‘interviewing AI.’ Qualitative research is not necessarily about objective data collection but rather an inherently co-creative, subjective process. In the context of interviewing-AI, researchers do not interface with fully autonomous actors or agents who have independent intent or sentience (Bory et al., 2025; Mlynář et al., 2025); rather, as Suchman and Thimm (2024) argue, these systems, even when behaving unpredictably, do not act outside of their relations with humans (notably, the researcher in this case). Reflexive analysis, therefore, should directly focus on the relational aspects of AI (Guzman and Lewis, 2020) or what Esposito (2022) calls “artificial communication” between humans and machines, emphasizing how researchers view the role and nature of both AI and themselves in light of their interactions with these systems. While emphasizing our earlier point about human-AI exchanges as unified analytical units, we note that this dynamic implicates researchers themselves in the evaluation of AI behaviors (Arora et al., 2024) as their decisions, such as probing strategies, actively influence the behavior of the systems (Henrickson and Meroño-Peñuela, 2025; Krapp et al., 2024).

In classic qualitative research, the researcher's role in shaping interactions and interpretation needs to be properly acknowledged and scrutinized (Berger, 2015). In the context of studying AI, it is equally important to recognize how researchers’ research design can steer the AI toward specific types of responses (Krakowski, 2025). Much like ethnographers account for their impact on fieldwork, AI researchers must recognize how their situations, personal contexts, and perceptions, as well as their unique interaction styles, shape the observations they make through questions and prompts (Desai and Twidale, 2023; Glazko et al., 2023). As noted, these systems, as conversational reflections of humans, could project or even reinforce the biases of the researcher and society at large (Kaplan, 20242024).

This active participation of researchers further calls for reflexive practices, such as critically examining assumptions, documenting biases, and continuously evaluating the coding and analysis processes, to help researchers understand and declare their positionality, influence, and mitigate potential bias-related distortions in research outcomes (Janse van Rensburg, 2024).

Conclusion

Exploring and studying AI systems through qualitative methods offers a critical lens to uncover and capture the emergent behaviors, capabilities, and limitations of these systems in vivo. By “interviewing” AI through structured prompts and analytical frameworks, qualitative researchers can theorize about the interplay between system design, user interactions, and broader sociotechnical contexts of use. The approaches discussed in this article complement quantitative and technical methods and provide key insights into the adaptive, interactive, and, more importantly, interpretive and critical dimensions of AI systems and their actual behavior in practice.

In this article, we highlighted methodological tools that can be adopted for studying AI. These methods exhibit strong disciplinary overlaps with interpretive sociology, ethnography, critical studies of algorithms, human-machine communication, and human-computer interaction traditions. Together, they highlight a growing methodological convergence aimed at making sense of machine behavior through human-centered inquiry.

The interviewing AI framework, as such, provides an empirical approach to generate a pragmatic understanding of AI systems, their actual behaviors and implications, which are already deeply influencing various social domains, in contrast to the overhyped ‘imaginaries of AI’ as capable of emulating human intelligence (i.e., general artificial intelligence) (Bory et al., 2025). Transcending the prevalent framing of AI as an ‘uncontroversial thing’ as discussed by Suchman (2023), our framework seeks to contribute to a critical understanding of AI as taking shape in practice. It does so by capturing the situated behaviors of these systems that emerge at the intersection of various research probes, use contingencies, and systems’ underlying processes and embedded logic that reflect an entanglement in actor networks from programmers to gatekeepers (e.g., training data regimes that may engender specific forms of bias) (Bareis, 2024).

Interviewing AI is uniquely suited to capture the depth and contextuality of AI behavior through direct interactions and analyses of outputs, contributing to the conceptualization of ‘AI in situated action’ (Gourlet et al., 2024; Mlynář et al., 2025; Monteiro et al., 2024). Understanding AI actions in real-world applications helps trace some of the fundamental roots of ‘AI frictions’ (Kaun and Männiste, 2025; Marres et al., 2024) by revealing the underlying technical mechanisms contributing to societal problems and controversies around AI.

While AI systems like LLMs continue to evolve rapidly, shifting toward embedding capabilities such as reasoning, autonomous actions, and multi-agent coordination, the interviewing AI framework remains a durable method for capturing emergent behaviors irrespective of specific system architectures. Its emphasis on interpretive interaction and attention to system inputs and outputs ensures some levels of adaptability and relevance across generations of AI models. As AI systems progressively partake in complex social tasks, interviewing AI offers a scalable approach to exploring the sociotechnical dynamics of AI in practice. In this context, integrating interaction data signals future methodological opportunities.

Limitations and future research

The interviewing AI framework serves as a normative and integrative guide, synthesizing diverse methodological traditions and emerging qualitative practices for studying AI systems. The framework is not intended as a step-by-step procedural recipe, nor is the outcome of a single empirical study. Rather, it offers a conceptual map that researchers of AI can adapt and appropriate based on their own research contexts, research problems, and disciplinary orientations. While no single case study to date has employed all framework elements in a unified investigation, we hope its modular and flexible elements allow for tailored applications suited to a variety of empirical demands. In this way, the framework follows what Kaplan (1964) refers to as a “reconstructed logic” as a normative idealization of scientific inquiry that, while not describing actual research practice in totality, can nonetheless guide future methodological development.

Future research could benefit from unified or multi-phase case studies that progressively integrate multiple framework elements, consequently evaluating their utility across diverse research domains, interactional contexts, and research goals. Finally in this article, we specifically focus on the study of LLM-based chatbots, with a recognition that AI systems take various forms (Jarrahi and Glaser, 2025), and the qualitative methods broached in the article may not necessarily be generalizable to other AI systems or the term ‘intelligent machine’, particularly those that do not interface with humans in similar ways (e.g., through prompting-based interactions).

Footnotes

ORCID iD

Mohammad Hossein Jarrahi

Funding

/The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Frances Carroll McColl Term Professorship at University of North Carolina at Chapel Hill.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Abdullahi

Singh

Eickhoff

(2024) Learning to make rare and Complex diagnoses with generative AI assistance: Qualitative study of popular large language models. JMIR Medical Education 10(1): e51391. mededu.jmir.org: e51391 .

Akyon

Camyar

, et al. (2024) Evaluating the capabilities of generative AI tools in understanding medical papers: Qualitative study. JMIR Medical Informatics 12(1). JMIR Publications Inc.: e59258.

Alfirević

Rendulić

Fošner

, et al. (2024) Educational roles and scenarios for large language models: An ethnographic research study of artificial intelligence. Informatics (MDPI) 11(4). MDPI AG: 78.

Arawjo

Swoopes

Vaithilingam

, et al. (2024) Chainforge: A visual toolkit for prompt engineering and LLM hypothesis testing. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 11 May 2024, pp. 1–18. ACM.

Arora

Chakraborty

Nishimura

(2024) AI–human hybrids for marketing research: Leveraging large language models (LLMs) as collaborators. Journal of Marketing 89(2): 43–70. 10.1177/00222429241276529.

Barambones

Moral

de Antonio

, et al. (2024) ChatGPT for learning HCI techniques: A case study on interviews for personas. IEEE Transactions On Learning Technologies 17: 1486–1501. Institute of Electrical and Electronics Engineers (IEEE): 1486–1501.

Bareis

(2024) The trustification of AI. Disclosing the bridging pillars that tie trust and AI together. Big data & society 11(2). https://doi.org/10.1177/20539517241249430.

Barley

(1990) Images of imaging: Notes on doing longitudinal field work. Organization Science 1(3). Institute for Operations Research and the Management Sciences (INFORMS): 220–247.

Berger

(2015) Now I see it, now I don’t: Researcher’s position and reflexivity in qualitative research. Qualitative Research: QR 15(2). SAGE Publications: 219–234.

10.

Bijker

Merkouris

Dowling

, et al. (2024) ChatGPT for automated qualitative research: Content analysis. Journal Of Medical Internet Research 26(1). JMIR Publications Inc.: e59050.

11.

Bory

Natale

Katzenbach

(2025) Strong and weak AI narratives: An analytical framework. AI & Society 40(4). Springer Science and Business Media LLC: 2107–2117.

12.

Burkhardt

Rieder

(2024) Foundation models are platform models: Prompting and the political economy of AI. Big Data & Society 11(2). https://doi.org/10.1177/20539517241247839.

13.

Buyl

Rogiers

Noels

, et al. (2024) Large language models reflect the ideology of their creators. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2410.18417.

14.

Chavan

Hinge

Deo

, et al. (2024) Analyzing developer-ChatGPT conversations for software refactoring: An exploratory study. In: Proceedings of the 21st International Conference on Mining Software Repositories, New York, NY, USA, 15 April 2024, pp. 207–211. ACM.

15.

Cheng

Ghate

Hua

, et al. (2025) REALM: A dataset of real-world LLM use cases. arXiv [cs.HC]. Available at: http://arxiv.org/abs/2503.18792.

16.

Cohn

(2023) An evaluation of ChatGPT-4’s qualitative Spatial Reasoning capabilities in RCC-8. arXiv [cs.AI]. Available at: http://arxiv.org/abs/2309.15577.

17.

Collier

Gruss

Abrahams

(2024) How good are large language models at product risk assessment?Risk analysis 45(4): 766–789. https://doi.org/10.1111/risa.14351.

18.

Creswell

(2009) Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. London, UK: Sage Publications.

19.

De Freitas

Uğuralp

, et al. (2024) AI companions reduce loneliness. papers.ssrn.com. Epub ahead of print 2024.

20.

Desai

Sharma

Saha

(2023) Using ChatGPT in HCI research—A trioethnography. In: Proceedings of the 5th International Conference on Conversational User Interfaces, New York, NY, USA, 19 July 2023, pp. 1–6. ACM.

21.

Desai

Twidale

(2023) Using playful metaphors to conceptualize practical use of ChatGPT: An autoethnography. Proceedings of the Association for Information Science and Technology 60(1). Wiley: 565–569.

22.

de Seta

(2024) Synthetic probes: A qualitative experiment in latent space exploration. Sociologica 18(2): 9–23.

23.

de Seta

Pohjonen

Knuutila

(2024) Synthetic ethnography: Field devices for the qualitative study of generative models. Big Data & Society: 1–15.

24.

DeVrio

Cheng

Egede

, et al. (2025) A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies. In: CHI ‘25: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 13 February 2025. ACM.

25.

Doshi

Bell

Mirzayev

, et al. (2024) Generative artificial intelligence and evaluating strategic decisions. Strategic Management Journal 46(3): 583–610.

26.

Esposito

(2022) Artificial Communication: How Algorithms Produce Social Intelligence. Cambridge, MA: MIT Press.

27.

Gao

Lee

Burtch

, et al. (2024) Take caution in using LLMs as human surrogates: Scylla ex machina. arXiv [econ.GN]. Available at: http://arxiv.org/abs/2410.19599.

28.

Gillespie

(2024) Generative AI and the politics of visibility. Big Data & Society 11(2). https://doi.org/10.1177/20539517241252131.

29.

Glazko

Yamagami

Desai

, et al. (2023) An autoethnographic case study of generative artificial intelligence’s utility for accessibility. In: The 25th International ACM SIGACCESS Conference on Computers and Accessibility, New York, NY, USA, 22 October 2023, pp. 1–8. ACM.

30.

Gourlet

Ricci

Crépel

(2024) Reclaiming artificial intelligence accounts: A plea for a participatory turn in artificial intelligence inquiries. Big Data & Society 11(2). https://doi.org/10.1177/20539517241248093.

31.

Guzman

Lewis

(2020) Artificial intelligence and communication: A human–machine communication research agenda. New Media & Society 22(1). SAGE Publications: 70–86.

32.

Hancock

Naaman

Levy

(2020) AI-Mediated Communication: Definition, research agenda, and ethical considerations. Journal of Computer-Mediated Communication: JCMC 25(1). Oxford University Press (OUP): 89–100.

33.

Hansen

(2021) Model talk: Calculative cultures in quantitative finance. Science, Technology & Human Values 46(3). SAGE Publications: 600–627.

34.

Henrickson

Meroño-Peñuela

(2025) Prompting meaning: a hermeneutic approach to optimising prompt engineering with ChatGPT. AI & Society 40(2): 903–918.

35.

Hepp

Loosen

Dreyer

, et al. (2023) ChatGPT, LaMDA, and the hype around communicative AI: The automation of communication as a field of research in media and communication studies. Human-Machine Communication 6(1). Communication and Social Robotics Labs: 41–63.

36.

Tian

Ayers

, et al. (2024) Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: A narrative review. BMC Medical Informatics and Decision Making 24. pmc.ncbi.nlm.nih.gov: 357.

37.

Horton

(2023) Large language models as simulated economic agents: What can we learn from homo silicus? Working Paper 31122 http://www.nber.org/papers/w31122. Epub ahead of print 2023.

38.

Hou

Leach

Huang

(2024) ChatGPT giving relationship advice – how reliable is it? In Proceedings of the International AAAI Conference on Web and Social Media 18. Association for the Advancement of Artificial Intelligence (AAAI): pp.610–623.

39.

Janse van Rensburg

(2024) Artificial human thinking: ChatGPT’s capacity to be a model for critical thinking when prompted with problem-based writing activities. Discover Education 3(1): 42.

40.

Jarrahi

Glaser

(2025) Not all AI systems are created equal: A typology of AI systems’ performance and affordance. In Proceedings of the 58th Annual Hawaii International Conference on Systems Sciences, Kona, Hawaii. Epub ahead of print 2025.

41.

Jones

Satran

Satyanarayan

(2025) Toward cultural interpretability: A linguistic anthropological framework for describing and evaluating large language models. Big Data & Society 12(1). https://doi.org/10.1177/20539517241303118.

42.

Kaplan

(1964) The Conduct of Inquiry: Methodology for Behavioral Science. San Francisco, CA: Chandler Publishing.

43.

Kaplan

(2024) LLMs as conversational agents in the interview society. In: EASST-4S 2024. Amsterdam, 2024.

44.

Kaun

Männiste

(2025) Public sector chatbots: AI frictions and data infrastructures at the interface of the digital welfare state. New Media & Society 27(4). SAGE Publications: 1962–1985.

45.

Kay

Kasirzadeh

Mohamed

(2024) Epistemic injustice in generative AI. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society 7: 684–697.

46.

Kazenwadel

Steinert

(2023) How user language affects conflict fatality estimates in ChatGPT. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2308.00072.

47.

Kieser

Wulff

Kuhn

Küchemann

(2023) Educational data augmentation in physics education research using ChatGPT. Physical Review Physics Education Research 19(2): 020150.

48.

Korstjens

Moser

(2018) Series: Practical guidance to qualitative research. Part 4: Trustworthiness and publishing. The European Journal of General Practice 24(1): 120–124.

49.

Krakowski

(2025) Human-AI agency in the age of generative AI. Information And Organization 35(1). Elsevier BV: 100560.

50.

Krapp

Neuhaus

Hassenzahl

, et al. (2024) In a quasi-social relationship with ChatGPT. An autoethnography on engaging with prompt-engineered LLM personas. In: Nordic Conference on Human-Computer Interaction, New York, NY, USA, 13 October 2024, pp. 1–16. ACM.

51.

Lee

Hajisharif

Johnson

(2025) The ontological politics of synthetic data: Normalities, outliers, and intersectional hallucinations. Big Data & Society 12(2). https://doi.org/10.1177/20539517251318289.

52.

Leng

Yuan

(2023) Do LLM Agents Exhibit Social Behavior? arXiv [cs.AI]. Available at: http://arxiv.org/abs/2312.15198.

53.

Lewis

Mitchell

(2024) Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models. arXiv [cs.AI]. Available at: http://arxiv.org/abs/2402.08955.

54.

Chong

Stepputtis

, et al. (2023) Theory of Mind for multi-agent collaboration via Large Language Models. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2310.10701.

55.

Xie

Hilsabeck

, et al. (2024) Effects of different prompts on the quality of GPT-4 responses to dementia care questions. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2404.08674.

56.

Lingard

Chandritilake

de Heer

, et al. (2023) Will ChatGPT’s free language editing service level the playing field in science communication?: Insights from a collaborative project with non-native English scholars. Perspectives on Medical Education 12(1): 565–574.

57.

Liu

Bao

Zhang

, et al. (2023) Improving ChatGPT Prompt for Code Generation. arXiv [cs.SE]. Available at: http://arxiv.org/abs/2305.08360.

58.

Lorè

Heydari

(2024) Strategic behavior of large language models and the role of game structure versus contextual framing. Scientific Reports 14(1). Springer Science and Business Media LLC: 18490.

59.

Luchs

Apprich

Broersma

(2023) Learning machine learning: On the political economy of big tech’s online AI courses. Big Data & Society 10(1). https://doi.org/10.1177/20539517231153806.

60.

Mishra

Beirami

, et al. (2023) Let’s do a Thought experiment: Using counterfactuals to improve moral reasoning. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2306.14308.

61.

Mei

Bruderlein

, et al. (2024) ‘ChatGPT, don’t tell me what to do’: Designing AI for context analysis in humanitarian frontline negotiations. arXiv [cs.HC]. Available at: http://arxiv.org/abs/2410.09139.

62.

Magee

Arora

Munn

(2023) Structured like a language model: Analysing AI as an automated subject. Big Data & Society 10(2). https://doi.org/10.1177/20539517231210273.

63.

Margaryan

(2023) Artificial intelligence and skills in the workplace: An integrative research agenda. Big Data & Society 10(2). https://doi.org/10.1177/20539517231206804.

64.

Marres

Castelle

Gobbo

Poletti

Tripp

(2024) AI as super-controversy: Eliciting AI and society controversies with an extended expert community in the UK. Big Data & Society 11(2). https://doi.org/10.1177/20539517241255103.

65.

McDuff

Schaekermann

, et al. (2023) Towards accurate differential diagnosis with Large Language Models. arXiv [cs.CY]. Available at: http://arxiv.org/abs/2312.00164.

66.

Meinke

Schoen

Scheurer

, et al. (2024) Frontier Models are Capable of In-context Scheming. arXiv [cs.AI]. Available at: http://arxiv.org/abs/2412.04984.

67.

Merriam

Tisdell

(2015) Qualitative Research: A Guide to Design and Implementation. 4th ed. London, England: Jossey-Bass.

68.

Micus

Dekova

Böttcher

Krcmar

(2024) Chat your data: Prompt engineering for standardized GenAI results. AMCIS 2024 Proceedings. No 7.

69.

Mirzadeh

Alizadeh

Shahrokhi

, et al. (2024) GSM-symbolic: Understanding the limitations of mathematical reasoning in Large Language Models. arXiv [cs.LG]. Available at: http://arxiv.org/abs/2410.05229.

70.

Mlynář

de Rijk

Liesenfeld

, et al. (2025) AI in situated action: a scoping review of ethnomethodological and conversation analytic studies. AI & society (40): 1497–1527.

71.

Monteiro

Nicolini

Erickson

, et al. (2024) Beyond the Buzz: Scholarly Approaches to the Study of Work. Journal of Management Inquiry 34(1): 19–40.

72.

Munn

Henrickson

(2024) Tell me a story: A framework for critically investigating AI language models. Learning, media and Technology. Informa UK Limited : 1–17.

73.

Oami

Okada

Nakada

(2024) Performance of a large language model in screening citations. JAMA Network Open 7(7). jamanetwork.com: e2420496–e2420496.

74.

Otis

Clarke

Rembrand

(2023) The Uneven Impact of Generative AI on Entrepreneurial Performance,” Working Paper no. 24-042. Cambridge, MA: Harvard Business School.

75.

Otmar

Michael

Mullins

, et al. (2025) Ethics and the use of generative AI in professional editing. AI Ethics 5: 1719–1731. https://doi.org/10.1007/s43681-024-00521-7.

76.

Park

Min

Beltran

, et al. (2025) ‘As an autistic person myself:' the bias paradox around autism in LLMs. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 26 April 2025, pp. 1–17. ACM.

77.

Patton

(2014) Qualitative Research & Evaluation Methods: Integrating Theory and Practice. Thousand Oaks, CA: Sage Publications.

78.

Peter

Riemer

West

(2025) The benefits and dangers of anthropomorphic conversational agents. Proceedings of the National Academy of Sciences of the United States of America 122(22): e2415898122.

79.

Rahwan

Cebrian

Obradovich

, et al. (2019) Machine behaviour. Nature 568(7753): 477–486.

80.

Sänger

De Mecquenem

Lewińska

, et al. (2024) A qualitative assessment of using ChatGPT as large language model for scientific workflow development. GigaScience 13: giae030.

81.

Schmidt

Elagroudy

Draxler

, et al. (2024) Simulating the human in HCD with ChatGPT: Redesigning interaction design with AI. Interactions 31(1), Association for Computing Machinery (ACM): 24–31.

82.

Scholl

Schiffner

Kiesler

(2024) Analyzing chat protocols of novice programmers solving introductory programming tasks with ChatGPT. arXiv [cs.AI]. Available at: http://arxiv.org/abs/2405.19132.

83.

Schwenke

Söbke

Kraft

(2023) Potentials and challenges of chatbot-supported thesis writing: An autoethnography. Trends in Higher Education 2(4). MDPI AG: 611–635.

84.

Seaver

(2017) Algorithms as culture: Some tactics for the ethnography of algorithmic systems. Big Data & Society 4(2). SAGE Publications Ltd: 2053951717738104.

85.

Shah

(2023) From prompt engineering to prompt science with human in the loop. arXiv [cs.HC]. Available at: http://arxiv.org/abs/2401.04122.

86.

Spurlock

Acun

Saka

, et al. (2024) ChatGPT for conversational recommendation: Refining recommendations by reprompting with feedback. arXiv [cs.IR]. Available at: http://arxiv.org/abs/2401.03605.

87.

Stroebl

Kapoor

Narayanan

(2024) Inference scaling flaws: The limits of llm resampling with imperfect verifiers. arXiv [cs.LG]. Available at: http://arxiv.org/abs/2411.17501 (accessed 27 November 2024).

88.

Suchman

(2023) The uncontroversial ‘thingness’ of AI. Big Data & Society 10(2). https://doi.org/10.1177/20539517231206794.

89.

Suchman

Thimm

(2024) there is no such thing as a machine that acts outside of relations with humans. Human-Machine Communication 9: 25–35.

90.

Triem

Ding

(2024) ‘Tipping the balance’: Human intervention in large language model multi-agent debate. Proceedings of the Association for Information Science and Technology 61(1). Wiley: 361–373.

91.

Tsvetkova

Yasseri

Pescetelli

, et al. (2024) A new sociology of humans and machines. Nature Human Behaviour 8(10). Nature Publishing Group: 1864–1876.

92.

Wang

Chen

, et al. (2024) GPTVoicetasker: Advancing multi-step mobile task efficiency through dynamic interface exploration and learning. In: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, New York, NY, USA, 13 October 2024, pp. 1–17. ACM.

93.

Wachinger

Bärnighausen

Schäfer

, et al. (2025) Prompts, pearls, imperfections: Comparing ChatGPT and a human researcher in qualitative data analysis. Qualitative Health Research 35(9): 951–966.

94.

Waller

Moats

Cox

, et al. (2024) Questionable devices: Applying a large language model to deliberate carbon removal. Environmental Science & Policy 162(103940). Elsevier BV: 103940.

95.

Wan

Sun

, et al. (2023) ‘Kelly is a Warm Person, Joseph is a Role Model’: Gender Biases in LLM-Generated Reference Letters. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2310.09219.

96.

Wodak

Meyer

(2015) Methods of Critical Discourse Studies. London, England: Sage Publications.

97.

Yin

, et al. (2024) Jamplate: Exploring LLM-enhanced templates for idea reflection. In: Proceedings of the 29th International Conference on Intelligent User Interfaces, New York, NY, USA, 18 March 2024. ACM. Available at: https://dl.acm.org/doi/abs/10.1145/3640543.3645196.

98.

Yampolskiy

(2025) On monitorability of AI. AI Ethics 5: 689–707.

99.

Zhan

Sarkadi

(2023) Deceptive AI ecosystems: The case of ChatGPT. In: Proceedings of the 5th International Conference on Conversational User Interfaces, New York, NY, USA, 19 July 2023, pp. 1–6. ACM.

100.

Zhao

Ren

Hessel

, et al. (2024) WildChat: 1 M ChatGPT Interaction Logs in the Wild. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2405.01470.

101.

Zheng

Chiang

W-L

Sheng

, et al. (2023) LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. arXiv [cs.CL]. Available at: http://arxiv.org/abs/2309.11998.

102.

Zhou

Schärli

Hou

, et al. (2022) Least-to-most prompting enables complex reasoning in large language models. arXiv [cs.AI]. Available at: http://arxiv.org/abs/2205.10625.

Interviewing AI: Using qualitative methods to explore and capture machines’ characteristics and behaviors

Abstract

Keywords

Introduction

Why qualitative research?

What interviewing AI entails

Exploratory familiarization with AI

Systematic investigation through structured probing

Documenting invisible behaviors

Revealing boundaries through breakdown analysis

Unpacking AI reasoning and internal logic

Approaches in probing

Systematic prompt variation

Role-playing scenarios

Boundary testing

Chain-of-thought probing

Incremental probing

Least-to-most prompting

Counterfactual prompting

Temporal and comparative analyses of AI behavior

Temporal interaction analysis (TIA)

Comparative synchronic analysis (CSA)

Exploring and interpreting patterns in AI behavior

Identifying patterns through thematic analysis

Uncovering ideological and social biases through critical analysis

Quantifying or explaining behavioral patterns through content analysis

Treating human-AI exchanges as unified analytical units

Triangulating methods for a holistic understanding of AI behavior

Integrating multi-source public interaction records

Juxtaposing system behavior with human perspectives

Triangulation with quantitative analysis

Ethical considerations in qualitative methods for studying AI systems

Responsibility in interpreting emergent AI behavior

Transparency in reporting and theorizing AI behavior

Reflexive analysis of researchers’ influence

Conclusion

Limitations and future research

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References