Abstract
In this article, we examine the application of qualitative methods for exploring and capturing the emergent behaviors and characteristics of AI systems. In doing so, we formulate key facets of the ‘interviewing AI’ framework: (1) exploratory familiarization to develop an initial understanding of the AI system's functionalities and responses, (2) systematic investigation through structured probing to elicit behaviors such as hallucinations and manifestations of reasoning, using different prompting approaches, and (3) two complementary approaches - temporal and comparative analyses of AI behavior, examining changes over time or comparing multiple systems at a single point in time. We further discuss (4) potential qualitative analysis methods such as critical discourse analysis or content analysis adapted to theorize and interpret AI behaviors, and (5) triangulation, which integrates qualitative insights from interviewing AI with other methods such as user and expert studies, public interaction records analysis, and quantitative analysis to form a multidimensional and comprehensive understanding of AI systems. Finally, we address (6) ethical considerations by emphasizing transparency, reflexivity, and responsible interpretation of findings to ensure rigorous and contextualized research practices.
Keywords
Introduction
The increasing adoption of generative AI systems, such as large language models (LLMs), along with their unpredictable performances, has raised key challenges about ways to capture and study their behaviors and characteristics. A nascent but growing field of research, often referred to as “machine behavior,” has begun to study AI systems not merely as engineering artifacts but as agentic systems with distinct behavioral patterns that unfold within use contexts beyond training phases (Rahwan et al., 2019). While avoiding anthropomorphic assumptions, this perspective advocates for recognizing the emergent nature of AI behaviors, which are not entirely predictable and comprehensible even to their developers (Tsvetkova et al., 2024). Such unpredictability often results in surprising capabilities that surface in practice, sometimes leading to harm or what Marres et al. (2024) call “AI frictions.” These frictions challenge the idea of AI as a coherent, seamless ‘thing’ and underscore the messy and unpredictable reality of its deployment in the real world (Kaun and Männiste, 2025).
AI behaviors are increasingly emergent in use as they arise from the interaction of the model's architecture, training data, and real-world environments rather than being merely explicitly programmed (Jones et al., 2025). Hence, emergence in this context highlights the dynamic nature of AI systems as they engage with complex inputs and newly experienced environments. These systems can develop abilities or shortcomings not necessarily present in smaller models or during training phases (Yampolskiy, 2025).
As AI systems become more integrated into diverse real-world applications, from workplaces to medical domains (Margaryan, 2023), the range of surprising and unforeseen behaviors is widening, manifesting what Horton (2023) refers to as the “performativity” problem, shaped by interactions within specific human environments (Hansen, 2021). The dynamic and opaque nature of generative AI thus calls for novel research methods to capture the full range of its capabilities, behaviors, and limitations (Magee et al., 2023). More than ever, questions remain: How do we reliably capture and articulate the “voice” of AI systems when dealing with entities that increasingly act as communicative participants rather than merely mediating tools or engineering artifacts (Guzman and Lewis, 2020; Hepp et al., 2023)? And what are the implications for research methods when the subject of study is no longer just human? This challenge becomes even more significant as we confront the reality that AI systems are not static tools and evolve based on user interactions, displaying human-like behaviors that may resemble autonomy (Bareis, 2024; Peter et al., 2025; Yampolskiy, 2025).
Why qualitative research?
In response to these challenges, we argue for applying qualitative research to study the emergent characteristics and behaviors of non-deterministic AI systems (systems that can generate different outputs for similar inputs due to inherent randomness or stochastic processes) (Magee et al., 2023; Tsvetkova et al., 2024). While quantitative and computational methods provide valuable insights into the performance of AI systems, qualitative approaches can provide unique affordances in explaining the contextual and nuanced nature of these behaviors that partly stem from situated interactions with the users (Mlynář et al., 2025).
What sets a method like interviewing AI apart is its ability to capture phenomena's depth, meaning, and complexity in the context of their occurrence (Seaver, 2017). In contrast to quantitative approaches, which focus on measurable outcomes, qualitative methods provide a more interpretive and explorative view of the complexity of AI behaviors in complex contexts (Ho et al., 2024). Qualitative methods reveal patterns missed by aggregate and reductionist metrics, especially regarding affordances, breakdowns, and sociopolitical entanglements of AI applications as contested domains imbued with power dynamics (Gourlet et al., 2024; Suchman, 2023). Approaches such as critical discourse analysis are especially helpful in uncovering the normative underpinnings and discursive constructions in AI outputs (including hegemonic ideologies and implicit biases embedded in training data or reflected in user interactions). Qualitative research therefore helps unearth and critically examine tensions that emerge during use across different contexts. This stands in contrast to dominant portrayals of AI systems as universally applicable, user-friendly, efficient, and versatile agents (Luchs et al., 2023).
Research establishes that generative AI systems’ interactive nature is central to their emergent behaviors, as these systems evolve based on user inputs that reflect unique interpretations and intentions (Hancock et al., 2020). Qualitative research is particularly well-positioned to capture this dynamism (what Rahwan et al. (2019) call hybrid ‘human-machine behavior’), as it can aptly document how AI systems shift and adapt in real-world contexts of use. In relation to this interactivity, which is central to emergent behaviors (Hansen, 2021), qualitative research can direct attention to human judgment and subjective assessment to explore and evaluate AI behaviors holistically beyond technical benchmarks or narrow task-centered performance metrics (e.g., contextual relevance of AI-generated outputs) (Glazko et al., 2023; Shah, 2023).
What interviewing AI entails
In this article, we formulate the ‘interviewing AI’ as a qualitative framework for studying AI. Much like classic qualitative research, the goal here is to explore, document, and interpret AI behavior as it develops through real-world interactions. Particularly, interviewing AI is valuable in answering why and how questions (Creswell, 2009); for example, exploring why AI systems have arrived at certain conclusions based on the review of the analysis of false-positive and/or false-negative results (Oami et al., 2024). As such, we argue that by employing this qualitative framework, researchers can complement computational methods to capture the depth, context, and complexity of AI systems.
The rest of the article unpacks what we mean by interviewing AI (see Table 1 for an outline of the key elements of the framework). We begin with experimentation to understand AI behavior firsthand and gain an initial perspective, then probing to uncover deeper and more systematic insight into AI behavior. Next, we introduce Temporal Interaction Analysis (TIA) and Comparative Synchronic Analysis (CSA) for documenting behavior over time or across systems. Finally, we highlight qualitative analysis techniques and triangulation to capture AI's emergent, adaptive nature.
Key elements of the interviewing AI framework.
Exploratory familiarization with AI
Before diving into the more structured data collection and analysis processes, researchers need to gain some initial understanding of the interactive milieu and the specific AI system they are studying. This familiarization phase involves first-hand experimentation with the AI system in question to gain an initial intuition about the system's operations, behaviors, inferences, potential limitations, blind spots, and, more broadly, structural societal concerns (Marres et al., 2024). This process can be seen as a form of “human-in-the-loop” interaction (Shah, 2023) as researchers’ own engagement generates in-vivo understanding, key to documenting the behaviors that arise only when AI is deployed in real-world environments (de Seta et al., 2024).
During experimentation, researchers can engage in a variety of tasks to more effectively develop an understanding of AI behavior and capabilities, what Guzman and Lewis (2020) might refer to as “the functional dimensions” of the system. These include observing how the system reacts to typical or varied user prompts and evaluating its performance by contrasting simple and complex questions to assess the adaptability and reliability of responses. Experimentation is therefore the first step in what Margaryan (2023) calls ‘intelligent interrogation,’ the practice of formulating practical questions for AI. Researchers can also familiarize themselves with behavioral patterns, speculating when, why, and how specific outputs may emerge (specifically how the AI may handle ambiguity and contradictions). They can also focus on topics or types of questions in which the AI excels or struggles, while also noting quirks or idiosyncrasies in its behavior or reasoning.
This early engagement helps establish the behaviors and performances researchers can expect to explore more deeply. The experiential approach helps researchers gain a “grip” on AI's abilities, issues, and concrete operations (Gourlet et al., 2024), preparing them for the next phase, where the goal shifts from observation to systematically eliciting evidence and building theories about the system's capabilities. For example, Barambones et al. (2024) performed a preliminary study to understand system behavior and optimize persona training. By testing various prompt formats and model settings, they refined prompt design, enhanced the realism of simulated interviews and informed the main study's setup.
It is important to note that the experimentation is a situated encounter with the system and its affordances, so is inexorably shaped by the researchers’ prior experiences, expectations, and sociomaterial context (Suchman and Thimm, 2024). As such, the observations and intuitions gained are co-constructed through this situatedness, and should be treated as partial and contingent, requiring a reflexive consideration on how researchers’ positionality, such as disciplinary background, familiarity with AI, expectations and biases shapes both the prompts they use and the interpretations they make for the next stages of investigation.
Systematic investigation through structured probing
The next step is structured probing, which focuses on eliciting deeper, structured insights into the system's internal logic and emergent behaviors. Probing, a foundational ethnographic technique for eliciting deeper meanings (Patton, 2014), has recently been adapted to study generative models (de Seta, 2024). Probing is designed to generate a theoretically oriented, analytical viewpoint that enables researchers to systematically analyze and classify patterns, behaviors, and underlying reasoning processes within the AI system and connect with extant theoretical knowledge. In contrast to the exploratory and open-ended nature of the familiarization step, probing is more purposive and targeted and involves systematically crafting and posing structured prompts to reveal, record, and analyze detailed, complex, or unexpected behaviors from the AI (see for example, Krapp et al., 2024).
Probing can serve several crucial goals in interviewing AI, and these goals can be achieved through multiple approaches (see Table 2 for an outline).
Various goals achieved through different probing approaches.
Documenting invisible behaviors
Probing facilitates the evaluation of how the AI system behaves based on diverse inputs, including varied, ambiguous, or even contradictory prompts (Abdullahi et al., 2024). This process identifies recurring patterns, themes, and responses to varied interaction strategies (Scholl et al., 2024) and uncovers behaviors that might not necessarily be evident during casual interactions (Magee et al., 2023). For example, probing can capture patterns of emergent behaviors such as hallucinations (responses that are plausible-sounding but factually incorrect) or creative outputs entailing novel combinations of ideas. This exploration is key to theorizing the system's capabilities and the nature of its emergent behaviors. For example, Triem and Ding (2024) systematically examined instances where LLMs changed their opinions to specifically gain an understanding of the structure of their reasoning.
Revealing boundaries through breakdown analysis
Probing can trigger specific behaviors in the AI system by using complex prompts (Schmidt et al., 2024). This approach pushes the system to its limits, revealing, for example, how it manages challenging or edge-case scenarios. Such targeted probing helps identify instances in which the system exhibits unexpected performance or fails to maintain logical consistency. For example, researchers directly “interviewed LLMs,” asking them to identify common failure modes, such as those observed in the 11–20 Money Request Game, in which LLMs struggled to provide accurate explanations for their choices and often exhibited reasoning that deviated significantly from human participants (Gao et al., 2024). Capturing and analyzing these instances of breakdown or what Waller et al. (2024) dubbed “questionable occasions” helps show specific limits to the systems’ capabilities, such as reasoning or intelligence (e.g., Lorè and Heydari, 2024). A key example of breakdown in human-AI interactions is moments of miscommunication where systems exhibit unexpected communicative performance or failures to maintain logical consistency (Mlynář et al., 2025).
Unpacking AI reasoning and internal logic
Probing can produce rich theoretical insights that embody the AI's underlying decision-making processes and reasoning (Magee et al., 2023). By closely analyzing the responses generated through probing, researchers can gather evidence about how the AI processes inputs, how it formulates its responses, and which factors may more significantly influence its outputs (Henrickson and Meroño-Peñuela, 2025). This method is beneficial for opening the black box of AI systems, where the internal mechanisms are often opaque and not fully understood. Probing approaches in this context can focus on the explanations provided by the system itself to understand its reasoning process and identify potential areas of bias or errors in response to varied stimuli (e.g., identifying instances where the AI may misunderstand the meaning of specific relations by analyzing the explanations provided by the system (Cohn, 2023)). As such, the primary focus of the analysis is not solely on whether the system completes tasks correctly or incorrectly but rather on how it arrives at its conclusions (Abdullahi et al., 2024). For instance, researchers qualitatively analyzed the reasoning provided by an AI system to interpret the logic behind its decisions in an effort to understand how it links its expressed values to its actions (Leng and Yuan, 2023) or to determine if it was consistent with common misconceptions in educational contexts (Kieser et al., 2023).
Approaches in probing
Researchers can employ several approaches during the probing process to achieve the above goals. Different approaches to probing AI systems yield different types of insights. It is important to note that these approaches can be effectively combined rather than being treated as mutually exclusive strategies.
Systematic prompt variation
One effective probing approach involves creating a framework of prompts that vary systematically in complexity, structure, or style. Such prompt structures can be derived from existing deductive theoretical models. This approach helps identify how the AI responds to different types of inputs and reveals potential patterns or inconsistencies in its behavior. For example, Sänger et al. (2024) assessed AI's capabilities in interpreting and manipulating scientific workflow descriptions by using a variety of prompts designed for distinct research objectives (e.g., comprehending/explaining workflows, modifying/adapting workflows, and extending workflows). Consequently, the prompts used were varied to simulate the interaction between a user and the AI system for different tasks.
Role-playing scenarios
By engaging the AI in role-playing scenarios, researchers can also observe how the system adapts its behavior across various contexts. This means asking the system to take on different personas or roles (e.g., a teacher, a friendly companion, an autistic person, a creative writer, a 10th-grade student in school, or a debate opponent) (De Freitas et al., 2024; Kieser et al., 2023; Krapp et al., 2024; Park et al., 2025). Role-playing scenarios enable unique insights into the AI's ability to shift perspectives and employ different modes of interaction, directing attention to both the flexibility and potential limitations of its conversational and thinking abilities. For example, through a “Tell Me Your Story” approach, Munn and Henrickson (2024) prompted an AI system as a social agent to disclose information about its design, training data, and embedded values. This approach also helps researchers disclose implicit biases and inclinations that might not be easily revealed through direct questioning (e.g., Buyl et al., 2024), shedding light on otherwise obscured aspects of the AI's internal logic and the constraints imposed by design guardrails. As an example, Magee et al. (2023) drew on a role-playing approach by assigning distinct personas (“Zhang” as helpful, “Ali” as truthful, and “Maria” as harmless) to explore how the system adapted to different normative roles. These personas helped the researchers examine how the AI adjusted its response based on the assigned identity, potentially surfacing embedded values.
Boundary testing
Deliberately pushing the AI to its limits through boundary testing helps reveal the limits of its capabilities and allows researchers to observe how the system behaves when faced with challenging or incomprehensible tasks (Zhan et al., 2023). This can involve presenting the AI with complex, ambiguous, or nonsensical questions to determine how it responds. For example, Stroebl et al. (2024) were able to elicit some of AI systems’ boundaries in producing reliably correct code by focusing on generating false positives. Boundary testing is beneficial for identifying where the system's reasoning breaks down (what we called breakdown analysis earlier) or where it may produce unexpected outputs, such as hallucinations or overly simplistic responses to complex problems.
Chain-of-thought probing
Another probing approach involves asking the AI to explain how it arrived at a particular answer, specifically contributing to what we earlier referred to as “unpacking AI reasoning.” This approach, known as chain-of-thought prompting, pushes the AI system to articulate the steps it took to reach a conclusion, which provides a window into how the AI structures its reasoning processes (Doshi et al., 2024) and therefore facilitates opening its black box (Micus et al., 2024). Chain-of-thought prompting may result in breaking down complex tasks into smaller steps and generating more relevant outputs (Spurlock et al., 2024). Qualitative analysis of AI outputs generated by this type of probing can be especially helpful in offering insight into the AI's internal logic and whether and how it is inconsistent or deficient across different contexts (Mirzadeh et al., 2024). For example, Li et al. (2023) applied a method akin to chain-of-thought prompting to track AI agents’ reasoning as they updated belief states in collaborative tasks, demonstrating how the agents processed and adapted to new information.
Incremental probing
This involves iteratively refining and improving the responses generated by the system (Schwenke et al., 2023), often based on external guidance provided by researchers (Liu et al., 2023). The process begins with researchers providing an initial prompt to guide the AI's response. After evaluating the generated output for quality and relevance, researchers modify the prompt to provide more specific instructions, clarify expectations, or address any shortcomings in the response (Shah, 2023). This cycle of evaluation and refinement continues until the system arrives at a satisfactory response, which allows for a deeper exploration of the system's behaviors and logic (Lingard et al., 2023). As an example, Otmar et al. (2025) used an iterative and incremental approach to prompting, progressively refining instructions, and engaging in feedback loops with an AI system (e.g., suggesting alternative titles, identifying areas for pacing adjustments) to improve the system's output for editing tasks. This iterative process helps researchers explore how the AI adapts its behavior (or struggles to adapt) based on previous interactions, highlighting both its capacity for and limitations in incremental behavior improvement and understanding of inputs.
Least-to-most prompting
Least-to-most prompting is another viable approach for observing emergent behavior, where minimal guidance is initially provided to the AI, and additional prompts are introduced only when necessary (Zhou et al., 2022). This method allows researchers to see the ability of the system to reason independently and reveals its ability to infer, adapt, or ask for clarification when minimal guidance is provided. As the level of prompting increases, patterns in the AI's logic, such as where it fails or succeeds in processing new sets of information, become apparent. Least-to-most prompting is valuable for exploring the limits of AI's capacity to act and reason beyond its training, as well as its ability for independent problem-solving (Vu et al., 2024).
Counterfactual prompting
This probing approach presents AI systems with scenarios that alter specific facts or conditions while maintaining other contextual elements. By doing so, researchers are able to evaluate how the system behaves in understanding causal relationships and maintaining logical consistency across hypothetical changes (Meinke et al., 2024). For example, researchers might ask the system to reason about ‘what if gravity worked in reverse’ or ‘what if humans had three arms,’ allowing observation of how the AI applies its knowledge to impossible scenarios. Such demonstrations are particularly valuable for assessing the system's ability to maintain logical consistency when reasoning beyond its training data, evaluating how it extrapolates from known principles to novel situations, and finally determining its ability to acknowledge or address contradictions (Ma et al., 2023). As such, this method reveals both the flexibility and limitations of AI reasoning when confronted with scenarios that cannot be solved through pattern matching alone
Temporal and comparative analyses of AI behavior
In the study of emergent behaviors in generative AI systems, researchers can adopt two complementary approaches: Temporal Interaction Analysis (TIA) and Comparative Synchronic Analysis (CSA). Interviewing AI through these two offers distinct yet reinforcing approaches to understanding how AI systems exhibit behaviors (see Table 3 for comparison). TIA focuses on tracking behavior changes over time or iterations, while CSA offers a comparative snapshot of the behavior of multiple systems at the same time. These two methods respectively reflect the diachronic and synchronic analysis in the technology studies, discussed by Steve Barley (1990). By employing both methods, researchers can achieve a comprehensive view of AI behavior, addressing both diachronic (over-time) and synchronic (fixed-time) dimensions of AI.
Comparison between temporal interaction analysis (TIA) and comparative synchronic analysis (CSA).
Temporal interaction analysis (TIA)
Temporal Interaction Analysis (TIA) is intended to track and analyze the emergent behaviors of AI systems over time. It is particularly effective in observing how behaviors stagnate, evolve, or completely transform through repeated interactions. AI behaviors are dynamic and may constantly change based on the contexts of interactions, frequency of interactions, or user prompts, making a case for longitudinal analyses to analyze both shifts and consistencies in these behaviors (Guzman and Lewis, 2020; Mlynář et al., 2025). In addition, models can evolve post-training through fine-tuning, new data integration, or guardrails (e.g., user feedback or safety improvements), or continual learning methods, shifting performance post-deployment (Burkhardt and Rieder, 2024). These changes call for temporal approaches that collect system outputs over iterations and over time.
A TIA approach helps monitor whether the system's responses become more refined, accurate, or consistent over time or over multiple iterations (e.g., Krapp et al., 2024). Furthermore, by revisiting similar prompts, researchers can observe how interaction patterns influence the AI's emergent behavior and whether learning or stabilization occurs. For example, in their study comparing AI's performance with a human researcher, Wachinger et al. (2025) used an iterative method, prompting the system repeatedly to examine its evolving responses. They noted both shifts and stabilization in the AI's ability to identify descriptive themes, but also inconsistencies in generating insightful codes and performing deeper interpretive analysis. As another example, Janse van Rensburg (2024) used similar prompts across three different time frames to assess the consistency of AI responses and observe potential variations over time. One of the key goals of the study, in line with the ideals of a TIA approach, was to explicitly determine if the system consistently produced responses in which the same critical thinking skills and dispositions could be identified and, consequently, to establish if its modeling capacity changed over time.
The TIA approach is not solely focused on changes in system behavior over time; it can also be used to gain a more comprehensive and immersive understanding of system behavior and performance. For instance, researchers performed autoethnographies over several months to gain a firsthand account of how generative AI systems could address accessibility needs (Glazko et al., 2023), the potential and challenges of using AI systems in academic writing (Schwenke et al., 2023), and the complex quasi-social relationships formed with the AI system (Krapp et al., 2024). In these examples, the TIA approach provided the researchers with deep access to individual experience and insights into the systems’ behaviors and performance, including both potentials and limitations, which emerge through sustained, real-world interactions over an extended period of time.
Comparative synchronic analysis (CSA)
Comparative Synchronic Analysis (CSA) offers a snapshot comparison of multiple AI systems (e.g., Abdullahi et al., 2024; Alfirević et al., 2024) using the same set of prompts or tasks and collecting data during a defined short-term period. CSA is particularly useful when researchers are looking to evaluate how different systems behave relative to identical tasks or prompts under similar conditions (Buyl et al., 2024). For instance, Collier et al. (2024) employed a systematic CSA-based evaluation of multiple LLMs to assess their performance in similar tasks of product risk assessment at a single point in time. The researchers used the identical prompts across different LLMs for each product category, and their assessment involved recording which failure modes, injuries, and risk mitigation tactics were highlighted by each model.
CSA is particularly valuable in studies where researchers focus on comparing the performance and behavior of different AI systems against similar benchmarks, without requiring extended longitudinal engagement. For example, Akyon et al. (2024) critically assessed the comprehension capabilities of six different LLMs at a single point in time by focusing on a subset of medical research papers and comparing their responses against a benchmark established by medical experts.
Exploring and interpreting patterns in AI behavior
In studying emergent behaviors in AI systems, much like in traditional qualitative research, data collection and analysis can occur simultaneously (Merriam and Tisdell, 2015). This iterative approach allows researchers to adjust their probing techniques and explore new themes as new insights emerge during their interactions with AI systems (e.g., Alfirević et al., 2024). This also means that researchers continually refine their probing approach in response to how the AI behaves, adjusting based on their ongoing analysis of its outputs.
By treating AI-generated responses as reflections of emergent behavior, researchers can apply qualitative methods traditionally used to study human behavior. These approaches allow for the identification and theorization of behavioral patterns, such as emerging affordances and limitations in AI abilities relative to specific tasks (e.g., Chavan et al., 2024). For example, Krakowski (2025), by using varied prompts and responses of generative AI systems, discussed how these systems may (mis)perform arithmetic tasks and speculated about potential variation in performance. In the following section, we outline other examples of approaches, adapted from human-centered behavioral studies, that can effectively be applied to the study of AI.
Identifying patterns through thematic analysis
At its core, thematic analysis serves as a useful foundation for spotting and interpreting recurring themes and patterns in AI-generated outputs. Recent research has shed light on aspects of AI behavior–from how these systems maintain consistent personas and tone to the way they develop preferences over time (Krapp et al., 2024). A thematic coding approach helps researchers systematically map out various types of AI errors, whether they represent logical challenges, struggles with edge cases, or repeated inaccuracies (Stroebl et al., 2024). For example, Lewis and Mitchell (2024) manually examined a sample of incorrect responses from humans and GPT models to categorize the types of errors made to understand the underlying reasoning processes. Thematic analysis can also help generate more systematic insights into AI capabilities, particularly in areas like critical thinking and reasoning within AI-generated content (Janse van Rensburg, 2024).
Thematic analysis can extend beyond coding by penetrating deeper layers of meaning in AI behavior, enabling researchers to interpret the social, cultural, and situational factors shaping the behavior of AI systems. Consider how researchers explore the way AI-generated text mirrors implicit assumptions or biases from its training data (e.g., Munn and Henrickson, 2024; Park et al., 2025). Taking this further, hermeneutic approaches offer an even more contextually aware analysis by focusing on the meaning and value ingrained in AI responses. By carefully examining word choice, structure, and context, researchers can better articulate how AI systems create meaning through user interactions. For example, focusing on features like the strategic use of politeness, conversational repairs, and the dynamic deployment of repetitions, Jones et al. (2025) examined the coproduction of meaning and manifestation of cultural conventions in conversation between a human and an AI chatbot. This approach acknowledges and accommodates the interactive, interpretive nature of AI behavior, and inquires into how these systems operate in specific interactive situations, manifesting the values, logic patterns, or even representations of certain cultural values embedded in their outputs (Henrickson and Meroño-Peñuela, 2025).
Uncovering ideological and social biases through critical analysis
Approaches such as critical discourse analysis (CDA) have emerged as powerful strategies for examining how AI behavior reflects broader societal biases, ideologies, and power dynamics. CDA is a qualitative, interpretive method that focuses on language as a social practice and interrogates how discourse structures enact, legitimize, or obscure power relations (Wodak and Meyer, 2015). Through these approaches, researchers can focus on linguistic patterns that might, in turn, reveal the worldview of AI creators or the hegemonic discourses embedded in the training data (Buyl et al., 2024) or illustrate how these systems may amplify existing social injustices (Kay et al., 2024), ultimately shedding light on specific worldviews and biases from their training data or creators (Gourlet et al., 2024). A fruitful analytical approach for exploring the systems’ ideological foundations in this context revolves around tracing how training data may specifically drive AI responses (Lee et al., 2025).
A compelling example of CDA-based analysis comes from researchers who examined AI-generated recommendation letters, discovering differences in how frequently certain nouns and adjectives appeared for male versus female candidates. This study underscored persistent gender stereotypes in descriptions of ability, leadership qualities, community involvement, and personal life (Wan et al., 2023). As another example, by focusing on a random subset of the generated responses, Park et al. (2025) directly engaged with the specific ways through which an LLM associated autism with common stereotypes, such as difficulties in social skills (“socially awkward”), sensory sensitivities requiring management, and the idea of autistic people being “unique” often framed in terms of skills beneficial to others. Such a critical qualitative analysis enabled researchers to reveal an implicit “bias paradox” in the system performance: While perpetuating negative stereotypes and a deficit-oriented perspective, the system also frequently used explicitly inclusive language and expressed a desire to be inclusive or connect with autistic people.
Quantifying or explaining behavioral patterns through content analysis
Content analysis gives researchers a systematic way to track and measure specific linguistic features in AI-generated text, helping identify and quantify behavioral patterns. Spurlock et al. (2024) put this approach to work in studying how AI systems represent and compare items (i.e., movies) and ascertain how varying levels of details in prompts may affect recommendation quality. Similarly, De Freitas et al. (2024) quantified loneliness-related conversations with AI, demonstrating how content analysis can uncover patterns of engagement in AI-human interactions. In another example, researchers tracked the frequency of terms like “civilian” or “terrorist” to explore the ways through which language choice influenced the portrayal of violence and airstrikes across different linguistic contexts (Kazenwadel and Steinert, 2023).
Treating human-AI exchanges as unified analytical units
When applying all these types of qualitative analysis noted above, it stands to reason to treat the prompts used and AI responses as a single unit of analysis, as humans and AI co-produce meaning in their unique context of interactions (Jones et al., 2025). This unit of analysis aligns with the principle of examining the sequential organization of ‘talk-in-interaction’, in this case, the exchange between a human and an AI system (Mlynář et al., 2025). Both elements work together to reveal the system's thinking and behaviors as part of the same analytical framework (Henrickson and Meroño-Peñuela, 2025).
Triangulating methods for a holistic understanding of AI behavior
Relying on a single research method to study AI behavior offers a limited view of how these systems operate and unfold in practice. So, interviewing AI should not necessarily be thought of as a standalone research method. Triangulation has been recognized as the process of using multiple research methods or sources to study the same phenomenon (Korstjens and Moser, 2018). This practice can contribute to research validity by enabling researchers to see different dimensions of AI behavior from complementary angles. As such, interviewing AI can be greatly complemented by integrating it with other methods, such as quantitative approaches, technical analysis, and user studies (Ma et al., 2024; Xu et al., 2024).
Integrating multi-source public interaction records
The first-hand qualitative data discussed above can be complemented with an analysis of publicly available datasets documenting thousands of users’ interactions with AI systems. For instance, the Dev-GPT dataset captures AI use in collaborative coding via GitHub (Chavan et al., 2024), while Cleverbot logs provide examples of informal, spontaneous dialogue with AI chatbots (De Freitas et al., 2024). More critically, large-scale datasets like WildChat (Zhao et al., 2024), LMSYS-Chat-1 M (Zheng et al., 2023), and public conversation records (e.g., Cheng et al., 2025) offer valuable insight into diverse user interactions across different social contexts. Integrating such datasets enables researchers to triangulate findings and assess patterns and consistency in emergent AI behavior.
Juxtaposing system behavior with human perspectives
Another useful way to triangulate qualitative research on AI is to juxtapose system behavior with the assessment of human actors or experts, particularly when human-AI interaction is a focus of the study (Arawjo et al., 2024; McDuff et al., 2023). These assessments provide a baseline or human lens through which AI behavior can be examined and contextualized (e.g., Bijker et al., 2024; Sänger et al., 2024). For example, Hou et al. (2024) compared an AI chatbot's relationship advice with Reddit users’ rankings of similar solutions and investigated how closely the AI's suggestions aligned with the collective judgment of these users. The combination of system behavior and lived experience helps build a more holistic understanding of AI, integrative of both technical performance and human interpretation.
Such triangulation highlights the system's capacities and limitations not only from the AI's internal logic but also from the external perspectives of people who encounter these systems in real-world applications (e.g., Otis et al., 2023). This kind of qualitative research uncovers how AI's “functioning in the wild” may weigh in on the perceived usefulness and credibility of AI in exploratory contexts.
Triangulation with quantitative analysis
Interviewing AI can also be triangulated with quantitative studies that provide a more structured or measurable view of AI systems (e.g., Scholl et al., 2024). For example, researchers conducted a mixed-methods study by combining quantitative metrics (response accuracy) and in-depth qualitative thematic analysis to evaluate the AI's performance in providing contextually relevant support within patient-centered dementia care scenarios (Li et al., 2024). As another example, Lorè and Heydari (2024) relied on statistical analysis on data from game theory simulations to identify patterns in LLM strategic decision-making, such as cooperation rates. They then complemented this with an exploratory, qualitative analysis, which focused on the explicitly prompted reasoning provided by the systems for selected illustrative examples. This combined approach, analyzing both the quantitative outcomes and the qualitative insights from the systems’ stated motivations, allowed the researchers to arrive at a more systematic picture of how different LLMs behave strategically across varying scenarios.
In short, through triangulation with other methods, interviewing AI becomes an integral part of the broader research methodological toolkit and sources. Together, these varied research approaches ensure that the multidimensional nature of AI systems’ behavior and performance (e.g., how they process information, respond to prompts, and interact with humans) can be fully captured, explained, contextualized, and theorized.
Ethical considerations in qualitative methods for studying AI systems
In this section, we explore key considerations in applying qualitative methods to the study of AI that can enhance the ethical validity of research.
Responsibility in interpreting emergent AI behavior
While interpretation is a cornerstone of qualitative research, when applied to AI systems, it may result in unique ethical concerns. In this context, researchers must interpret and theorize behaviors from non-human agents that essentially lack key human characteristics such as intent or self-awareness. Therefore, there is always a risk of over-attributing human-like traits to these systems (DeVrio et al., 2025; Peter et al., 2025).
To avoid such issues, ethical qualitative research should contextualize AI behavior within the design constraints of these systems. While AI may exhibit human-like patterns, they are fundamentally different in nature. Over-anthropomorphizing AI (e.g., assigning exclusively human faculties to AI) can lead to misleading theories about AI capabilities and limitations (Bory et al., 2025).
Moreover, in applying interpretive approaches, researchers need to remain mindful and critical of AI systems as non-neutral actors that inevitably bring forward specific worldviews and associated biases (Guzman and Lewis, 2020; Jones et al., 2025). As demonstrated in numerous studies of AI systems and their biased responses or exclusionary language (e.g., Kay et al., 2024; Wan et al., 2023), the researcher's role in responsibly revealing and theorizing these inherent characteristics of AI systems becomes even more pivotal.
Transparency in reporting and theorizing AI behavior
In the context of interviewing AI, a responsible and transparent approach entails clear documentation of how probing was conducted and how emergent behaviors were identified and analyzed (de Seta, 2024; Shah, 2023). This methodological transparency allows for accountability (see, for example, Waller et al., 2024) while also ensuring that other researchers can critically engage with research outcomes.
Specifically, audit trails have proven to be effective safeguards that can enhance the validity and transparency of qualitative research by keeping records of the research trajectory (Korstjens and Moser, 2018). Detailed journaling of interactions with AI can create comprehensive documentation that includes annotated screenshots and notes. This documentation can capture rich, detailed accounts of researchers’ individual experiences with AI systems, enable critical examination of those experiences, help researchers process uncomfortable emotions, and increase research transparency, especially when conducted as a collaborative effort (Desai et al., 2023).
Further, when interpreting findings about the AI systems, researchers should clearly articulate the limits of their methods and avoid making broad generalizations that extend beyond the scope of their study (Gillespie, 2024; Mlynář et al., 2025). For example, artificially constructed scenarios discussed earlier in probing approaches can reveal important dimensions of AI systems’ behavior in practice, but they also run the risk of presenting an incomplete or skewed picture of the system's capabilities or drawbacks. As another example, when applying a CSA approach, researchers should be transparent about the specific conditions under which multiple systems are evaluated and how those conditions might have influenced the results within a specific timeframe, as AI systems may dynamically change over time and consequently exhibit dissimilar behaviors at different points in time.
Reflexive analysis of researchers’ influence
Recognizing the researcher's role in the human-in-the-loop qualitative studies of AI also requires reflexivity regarding their influence on ‘interviewing AI.’ Qualitative research is not necessarily about objective data collection but rather an inherently co-creative, subjective process. In the context of interviewing-AI, researchers do not interface with fully autonomous actors or agents who have independent intent or sentience (Bory et al., 2025; Mlynář et al., 2025); rather, as Suchman and Thimm (2024) argue, these systems, even when behaving unpredictably, do not act outside of their relations with humans (notably, the researcher in this case). Reflexive analysis, therefore, should directly focus on the relational aspects of AI (Guzman and Lewis, 2020) or what Esposito (2022) calls “artificial communication” between humans and machines, emphasizing how researchers view the role and nature of both AI and themselves in light of their interactions with these systems. While emphasizing our earlier point about human-AI exchanges as unified analytical units, we note that this dynamic implicates researchers themselves in the evaluation of AI behaviors (Arora et al., 2024) as their decisions, such as probing strategies, actively influence the behavior of the systems (Henrickson and Meroño-Peñuela, 2025; Krapp et al., 2024).
In classic qualitative research, the researcher's role in shaping interactions and interpretation needs to be properly acknowledged and scrutinized (Berger, 2015). In the context of studying AI, it is equally important to recognize how researchers’ research design can steer the AI toward specific types of responses (Krakowski, 2025). Much like ethnographers account for their impact on fieldwork, AI researchers must recognize how their situations, personal contexts, and perceptions, as well as their unique interaction styles, shape the observations they make through questions and prompts (Desai and Twidale, 2023; Glazko et al., 2023). As noted, these systems, as conversational reflections of humans, could project or even reinforce the biases of the researcher and society at large (Kaplan, 20242024).
This active participation of researchers further calls for reflexive practices, such as critically examining assumptions, documenting biases, and continuously evaluating the coding and analysis processes, to help researchers understand and declare their positionality, influence, and mitigate potential bias-related distortions in research outcomes (Janse van Rensburg, 2024).
Conclusion
Exploring and studying AI systems through qualitative methods offers a critical lens to uncover and capture the emergent behaviors, capabilities, and limitations of these systems in vivo. By “interviewing” AI through structured prompts and analytical frameworks, qualitative researchers can theorize about the interplay between system design, user interactions, and broader sociotechnical contexts of use. The approaches discussed in this article complement quantitative and technical methods and provide key insights into the adaptive, interactive, and, more importantly, interpretive and critical dimensions of AI systems and their actual behavior in practice.
In this article, we highlighted methodological tools that can be adopted for studying AI. These methods exhibit strong disciplinary overlaps with interpretive sociology, ethnography, critical studies of algorithms, human-machine communication, and human-computer interaction traditions. Together, they highlight a growing methodological convergence aimed at making sense of machine behavior through human-centered inquiry.
The interviewing AI framework, as such, provides an empirical approach to generate a pragmatic understanding of AI systems, their actual behaviors and implications, which are already deeply influencing various social domains, in contrast to the overhyped ‘imaginaries of AI’ as capable of emulating human intelligence (i.e., general artificial intelligence) (Bory et al., 2025). Transcending the prevalent framing of AI as an ‘uncontroversial thing’ as discussed by Suchman (2023), our framework seeks to contribute to a critical understanding of AI as taking shape in practice. It does so by capturing the situated behaviors of these systems that emerge at the intersection of various research probes, use contingencies, and systems’ underlying processes and embedded logic that reflect an entanglement in actor networks from programmers to gatekeepers (e.g., training data regimes that may engender specific forms of bias) (Bareis, 2024).
Interviewing AI is uniquely suited to capture the depth and contextuality of AI behavior through direct interactions and analyses of outputs, contributing to the conceptualization of ‘AI in situated action’ (Gourlet et al., 2024; Mlynář et al., 2025; Monteiro et al., 2024). Understanding AI actions in real-world applications helps trace some of the fundamental roots of ‘AI frictions’ (Kaun and Männiste, 2025; Marres et al., 2024) by revealing the underlying technical mechanisms contributing to societal problems and controversies around AI.
While AI systems like LLMs continue to evolve rapidly, shifting toward embedding capabilities such as reasoning, autonomous actions, and multi-agent coordination, the interviewing AI framework remains a durable method for capturing emergent behaviors irrespective of specific system architectures. Its emphasis on interpretive interaction and attention to system inputs and outputs ensures some levels of adaptability and relevance across generations of AI models. As AI systems progressively partake in complex social tasks, interviewing AI offers a scalable approach to exploring the sociotechnical dynamics of AI in practice. In this context, integrating interaction data signals future methodological opportunities.
Limitations and future research
The interviewing AI framework serves as a normative and integrative guide, synthesizing diverse methodological traditions and emerging qualitative practices for studying AI systems. The framework is not intended as a step-by-step procedural recipe, nor is the outcome of a single empirical study. Rather, it offers a conceptual map that researchers of AI can adapt and appropriate based on their own research contexts, research problems, and disciplinary orientations. While no single case study to date has employed all framework elements in a unified investigation, we hope its modular and flexible elements allow for tailored applications suited to a variety of empirical demands. In this way, the framework follows what Kaplan (1964) refers to as a “reconstructed logic” as a normative idealization of scientific inquiry that, while not describing actual research practice in totality, can nonetheless guide future methodological development.
Future research could benefit from unified or multi-phase case studies that progressively integrate multiple framework elements, consequently evaluating their utility across diverse research domains, interactional contexts, and research goals. Finally in this article, we specifically focus on the study of LLM-based chatbots, with a recognition that AI systems take various forms (Jarrahi and Glaser, 2025), and the qualitative methods broached in the article may not necessarily be generalizable to other AI systems or the term ‘intelligent machine’, particularly those that do not interface with humans in similar ways (e.g., through prompting-based interactions).
Footnotes
Funding
/The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Frances Carroll McColl Term Professorship at University of North Carolina at Chapel Hill.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
