Abstract
Can artificial-intelligence (AI) systems, such as large language models (LLMs), replace human participants in behavioral and psychological research? Here, I critically evaluate the replacement perspective and identify six interpretive fallacies that undermine its validity. These fallacies are (a) equating token prediction with human intelligence, (b) treating LLMs as the average human, (c) interpreting alignment as explanation, (d) anthropomorphizing AI systems, (e) essentializing identities, and (f) substituting model data for human evidence. Each fallacy represents a potential misunderstanding about what LLMs are and what they can tell researchers about human cognition. In the analysis, I distinguish levels of similarity between LLMs and humans, particularly functional equivalence (outputs) versus mechanistic equivalence (processes), while highlighting both technical limitations (addressable through engineering) and conceptual limitations (arising from fundamental differences between statistical and biological intelligence). For each fallacy, specific safeguards are provided to guide responsible research practices. Ultimately, the analysis supports conceptualizing LLMs as pragmatic simulation tools—useful for role-play, rapid hypothesis testing, and computational modeling (provided their outputs are validated against human data)—rather than as replacements for human participants. This framework enables researchers to leverage language models productively while respecting the fundamental differences between machine intelligence and human thought.
Keywords
Collecting data from human participants is resource-intensive and time-consuming. The rise of large language models (LLMs) capable of generating human-like text has sparked substantial interest among psychologists and social scientists (Abdurahman et al., 2025; Demszky et al., 2023; Lin, 2023). In tasks ranging from perception and cognition to language and moral reasoning, LLMs sometimes produce responses that mirror those of average human participants (Aher et al., 2023; Binz & Schulz, 2023b; Dasgupta et al., 2022; Dillion et al., 2023; J. Hu et al., 2024; Marjieh et al., 2024). As these models are increasingly incorporated into experimental designs and research methodologies, promising instant and tireless responses, a question arises: Could they ever stand in for humans in behavioral and psychological research?
Across various fields, proposals have emerged suggesting that LLMs could, to varying degrees, “substitute human participants” in empirical research as “silicon samples” (Sarstedt et al., 2024), “supplant human participants for data collection” in social science as “simulated participants” (Grossmann et al., 2023), serve as “a single participant” and “a proxy for human participants in a certain set of circumstances” in psychological science (Dillion et al., 2023), and “replace humans” in human-centered design to provide “simulated user responses” (Schmidt et al., 2024). Echoing this perspective, commercial entities now peddle synthetic, human-like artificial-intelligence (AI) subjects for user and market research—advertising user research “without the users.”
Although replacement of human participants is not yet mainstream practice, these emerging viewpoints reflect a broader narrative in AI development—one that envisions systems that could “outperform humans at most economically valuable work” (OpenAI, 2024) or “on a vast array of tasks” (Bengio, 2024). Although AI researchers overwhelmingly treat LLMs as task-automation tools and agents, the discourse around such powerful AI capabilities creates a context in which proposals to use LLMs as research participants naturally emerge. In light of these broader technological ambitions and the specific research proposals they inspire, it is both timely and necessary to critically evaluate the replacement perspective and its methodological implications, particularly to clarify the role of LLMs in behavioral- and social-science research more broadly, whether or not the replacement proposal gains traction.
To this end, in this article, I examine the fundamental assumptions that underlie proposals to use LLMs as human replacements. By clarifying what LLMs actually are—computational tools that simulate linguistic behavior through statistical pattern matching and reinforcement learning (see Box 1)—one can better understand their appropriate role in psychology and behavioral sciences.
Language Models and the Language-User Illusion
Six key fallacies are identified, each arising from the misinterpretation of LLMs as human replacements: the token-prediction-as-human-intelligence fallacy (conflating statistical text prediction with genuine human intelligence), the average-human fallacy (assuming model outputs represent typical human responses), the alignment-as-explanation fallacy (interpreting similarity between model and human outputs as evidence of shared cognitive mechanisms), the anthropomorphism fallacy (attributing human-like mental states to AI), the identity-essentialization fallacy (treating social categories as fixed, homogeneous traits rather than fluid, contextual identities), and the substitution fallacy (presuming LLM-generated data can directly replace human evidence without validation).
These fallacies undermine research validity—the degree to which researchers’ evidence and reasoning support the conclusions they draw (Cook & Campbell, 1979; Cronbach & Meehl, 1955; Kerschbaumer et al., 2025; Stanley & Campbell, 1963). Validity concerns in LLM research stem from both methodological choices (how researchers prompt and test models) and theoretical interpretations (what researchers claim model outputs represent). Each fallacy undermines validity in distinct ways: Some compromise the ability to make causal inferences, others limit generalizability across populations, and still others reflect misalignment between what researchers intend to measure and what they actually capture. The methodological and theoretical safeguards provided for each fallacy address these threats by strengthening both research design and interpretive frameworks.
Taken together, these analyses support the function of LLMs as neural language simulators. Rather than stand-ins for people, they reveal latent psychological and cognitive dynamics embedded in the text—dynamics rooted in the communication of human thoughts, attitudes, and behaviors. This approach offers both conceptual clarity and practical guidance for leveraging LLMs appropriately in behavioral and cognitive research.
Interpretive Fallacies in LLM-Based Research
To understand why LLMs function as simulators rather than replacements, one must first examine the conceptual errors that underlie the replacement perspective. Six fallacies emerge in discussions of LLMs as human substitutes, each revealing a different way researchers might misinterpret what LLMs are and what their outputs signify (Table 1).
Six Fallacies in Conceptualizing LLMs as Human Replacements
Note: LLM = large language model; WEIRD = Western, educated, industrialized, rich, and democratic; RLHF = reinforcement learning from human feedback.
Token-prediction-as-human-intelligence fallacy
LLMs accomplish tasks once reserved for humans through a layered process: They are initially pretrained to predict sequential tokens and then posttrained via fine-tuning and reinforcement learning; finally, their reasoning behavior is shaped by guidance from well-crafted prompts and inference-time techniques (Box 1). This process equips LLMs with intricate linguistic knowledge (e.g., syntactic rules, semantic relations)—often termed “formal” linguistic competence (Mahowald et al., 2024). Crucially, these models do more than memorize patterns; they demonstrate the ability to navigate complex, context-dependent tasks in both language processing and generation (Mahowald et al., 2024; Millière & Buckner, 2024a). Through learning and contextual adaptation, LLMs can infer underlying task structures and generate appropriate responses across varied domains—a form of instrumental knowledge that supports problem-solving in diverse settings (Yildirim & Paul, 2024).
Yet at their core, LLMs have no minds (Searle, 1980) but are autoregressive statistical models that manipulate language—a task different from human cognition—producing a kind of ungrounded intelligence. They manipulate language by leveraging statistical patterns learned during training; although these patterns encode aspects of derivative meaning present in text-corpus relationships (e.g., “Paris”–“France,” “fire”–“hot”), the models themselves are yet to access this meaning through direct, grounded experience (e.g., visual qualia, proprioception). This mirrors Church encoding in lambda calculus, in which data and operations are expressed purely through abstract functions, defined by how they transform inputs to outputs rather than by inherent, grounded meaning (Church, 1936). This is in contrast to human intelligence, which integrates general and specialized capabilities through evolutionary adaptations and developmental learning (Box 2)—and is “grounded in one’s embodied physical and emotional experiences” and “deeply reliant on one’s social and cultural environments” (Mitchell, 2024).
Biological Versus Machine Optimization
A fallacy emerges when one mistakes the capacity of LLMs to predict language patterns and solve cognitive problems for genuine human-like understanding or intelligence. To unpack this fallacy, it is helpful to distinguish between two levels of understanding: functional understanding (reliably producing appropriate outputs, as when a dog successfully catches a ball) and reflective understanding (consciously grasping underlying principles, as when humans can articulate why the ball follows a particular trajectory). Although LLMs may demonstrate considerable functional understanding by generating contextually appropriate responses, they currently lack the reflective understanding that emerges from embodied experience and consciousness. Consequently, mistaking their output for genuine cognition would misrepresent what is being measured; the model’s linguistic performance, based on statistical token prediction, risks being conflated with human cognitive processes that integrate both functional and reflective understanding. The “intelligence” of LLMs is thus operationally different from the human cognition it might appear to simulate.
This fallacy manifests in two major ways. First, although LLMs process vast amounts of text reflecting real-world information and human experience, their tokens operate without direct, firsthand sensorimotor contact with the world. Consequently, they lack embodied referents and lived experience derived from such interaction (Leivada et al., 2023). This critical absence of grounding is demonstrated by how text-based LLMs, although capable of representing nonsensorimotor features of human concepts (e.g., emotional valence), systematically fail to capture sensorimotor features, particularly those related to motor actions. Indeed, the alignment between LLM and human conceptual representations diminishes markedly from nonsensorimotor to sensorimotor domains (Xu et al., 2025).
Second, this absence of first-person, sensorimotor grounding limits the models’ connection to the physical and social realities they aim to simulate. This hinders the development of fully functional linguistic competence—the use of language to achieve goals in the world (Mahowald et al., 2024). It also restricts their capacity to acquire the kind of worldly knowledge needed for robust world models—internal representations that are structure-preserving and behaviorally efficacious in real-world interactions (Webb et al., 2023; Yildirim & Paul, 2024). Such models, including cognitive maps, body schemas, and spatial schemas, support embodied reasoning and action, which remain beyond the scope of current models (Wicke & Wachowiak, 2024).
A critical consequence of this fallacy is that when researchers attribute changes in model outputs to experimental manipulations without recognizing the underlying statistical nature, they inevitably risk misinterpretation. Consider a simple prompt: “Mao Zedong was . . . .” Unlike interacting with another mind, when people engage with a chatbot, they are not seeking its opinion—despite the compelling illusion thereof—but rather making a computational request: Given the statistical distributions in the language model, what sequence is most likely to follow these words? Models trained for neutrality will likely provide correspondingly factual responses. Indeed, the character of such responses is shaped by the model’s training, geopolitical alignment, and prompt language. For example, DeepSeek-R1—pretrained mostly on English and Chinese text and subsequently aligned with government regulatory directives (Box 1)—showed substantially higher proportions of Chinese-state propaganda and anti-U.S. bias compared with ChatGPT o3-mini-high, which was pretrained primarily on English text. This bias was most pronounced with Simplified Chinese queries, diminished with Traditional Chinese inputs, and was nearly absent when queried in English (Huang et al., 2025).
Fundamentally then, the model, unlike a human, has no communicative intent; no opinion of, attitude toward, or belief about Mao; and no intrinsic capacity to tell the truth—it just models a distribution of token sequences based on the training texts (Shanahan, 2024). Although they manifest proficient language use that goes beyond simply retrieving prerecorded text strings, they are not true language users in the philosophical sense. They do not possess intrinsic meaning, communicative intentions, or other internal states essential to human language users (Block, 1981).
Indeed, model outputs can be highly sensitive to seemingly trivial prompt variations—variations that human language users would tolerate (Ivanova, 2025). Small changes in prompt language, wording, order, or context can produce drastically different outputs not because of the manipulated variable of interest but because of the model’s inherent sensitivity patterns—a phenomenon that contrasts with how humans process language.
In addition, findings from LLM studies cannot be reliably generalized to human cognition. The gap between token-prediction mechanisms and human cognitive processes fundamentally limits what one can infer about real human populations from model behavior, even when performance patterns may appear superficially similar. As philosopher John Searle (1980) illustrated in his “Chinese Room” thought experiment, a system might process symbols according to rules without understanding their meaning—just as a person following instructions to manipulate Chinese characters could produce appropriate responses without comprehending Chinese. When models generate coherent language without the intentionality, consciousness, and direct grounding in real-world experience that characterize human intelligence (Searle, 1980), they are capturing linguistic form (the observable structure of language) rather than true meaning (the communicative intent behind language; Bender & Koller, 2020). Therefore, to confuse cognitive algorithms for cognition—or models of the mind for the mind itself—represents a category mistake both ontologically (misidentifying their nature) and epistemologically (misunderstanding their knowledge).
No doubt LLMs (and AI in general) will continue to advance. But, as detailed in Box 3, even these improvements face fundamental conceptual limitations regarding embodied experience and grounding. Knowledge, as constructivists such as Jean Piaget have argued, is built from sensory experiences and perceptions, which are then layered with symbols and categories over time. Without such experiences, LLMs are like an artificial version of the proverbial Mary, who studies everything about color in a black-and-white room her entire life (Jackson, 1982). One can argue that when Mary leaves her room and sees color for the first time, she learns something new—what it is like to see something pink (Jackson, 1986). If so, a great deal more is at stake for language models, which experience neither color nor anything. Indeed, unlike Mary, who can use her other experiences as a scaffold for understanding color—much like Helen Keller using associations from senses such as touch and smell to construct a color scheme despite being blind and deaf (“Pink makes me think of a baby’s cheek, or a gentle southern breeze” [Keller, 1929])—LLMs have none of this kind of sensory scaffolding.
Technical Versus Conceptual Limitations in Model Simulation of Human Behavior and Cognition
Essential aspects of human cognition—emotions, intuition, and other subjective experiences—so far remain incomputable and therefore cannot be fully replicated by computational algorithms. Even if everything about human cognition were theoretically computable, actually capturing the full depth and complexity of human thought processes would likely exceed computational feasibility. The interplay of perception, memory, emotions, and social contexts creates patterns so intricate that complete replication would require exponentially increasing computational resources. In other words, creating AI systems that genuinely replicate human cognition—systems behaving like humans under all circumstances—may be fundamentally intractable from a computational perspective (van Rooij et al., 2024). AI systems—constrained by their current algorithms, data, and embedded assumptions—face inherent challenges in replicating human cognition. This limitation parallels Gödel’s famous discovery in mathematics: Any formal system sophisticated enough for basic arithmetic will contain true statements that cannot be proven within that system (Fokas, 2023). Likewise, AI systems face intrinsic constraints in capturing all aspects of human thought within their computational frameworks.
To mitigate these issues, consider the following.
Model selection and settings
Different model architectures may capture different aspects of linguistic behavior (Dettki et al., 2025). Choose models based on their specific capabilities for the research question rather than assuming larger or more advanced models automatically approximate human cognition better (Zan et al., 2025; Zhou et al., 2024). Compare base models with fine-tuned versions to understand how optimization objectives affect outputs relative to human responses (Binz & Schulz, 2023b; Yax et al., 2024). Evaluate the impact of model parameters, such as temperature (C. Li & Qi, 2025).
Prompt design
Test model sensitivity to prompt variations to distinguish between robust patterns and artifacts of specific phrasings (Brucks & Toubia, 2025). This helps identify when models are exhibiting systematic “reasoning” versus merely responding to surface-level cues in ways humans would not (Ivanova, 2025).
Interpretations and applications
Validate LLM outputs against human data, particularly when studying cognitive processes, to establish appropriate boundaries for generalizing from LLM experiments to human psychology. Make explicit that LLM performance, however impressive, stems from statistical prediction rather than human-like understanding (Shiffrin & Mitchell, 2023).
Ethics
Acknowledge model limitations in research reporting, particularly the inherent distinction between token prediction and human cognition, to prevent misinterpretation and reduce the risk of anthropomorphizing LLMs in scientific literature (Ibrahim & Cheng, 2025).
The average-human fallacy
Mischaracterizing the nature of intelligence in LLMs may not matter as much if they behave or perform like the average human, enabling them to functionally replace human participants. But this assumption commits the average-human fallacy, which misrepresents what LLM outputs signify.
Consider the engineering purpose and approach of LLMs. They are explicitly developed to outperform humans, as measured by a broad range of benchmarks and standardized tests, an engineering feat bolstered by continually improved design and algorithms as well as access to more data and compute while free from biological limitations (see also Box 2). In actual tests, for example, GPT-4 outperforms humans in detecting and interpreting irony, recognizing indirect requests or hints in conversation (Strachan et al., 2024), and performing analogical reasoning tasks (Webb et al., 2023) and probabilistic reasoning tasks, such as the Linda/Bill problems and the bat-and-ball problem (Yax et al., 2024), but underperforms humans in tasks such as faux-pas tests (Strachan et al., 2024). This contrasts with the assumption of the replacement view that LLM responses mirror average human judgments from the training data (Dillion et al., 2023) or the majority’s mainstream opinions (Qu et al., 2024).
The generalizability challenge is multifaceted. First, LLMs are not representative samples of any defined human population, let alone an “average” human. Their training data (mostly online text) exhibit systematic biases—predominantly Western, educated, industrialized, rich, and democratic (WEIRD) populations, especially individuals who are hegemonic, young, and publicly expressive (Crockett & Messeri, 2023; Santurkar et al., 2023; Tao et al., 2024). This biased training creates a complex landscape for psychological simulation: Although LLMs might adequately capture behaviors that are largely universal across human cultures (certain basic cognitive or emotional phenomena, perhaps), they face limitations when simulating those known to vary across cultures and groups—from number representations to personality traits and moral reasoning. Further complicating this picture is the incomplete understanding of which psychological phenomena are universal versus culturally variable. This uncertainty means researchers must exercise particular caution when using WEIRD-biased LLMs—or any LLM for that matter—to simulate potentially culture-dependent phenomena.
In tandem with spatial biases, the temporal distribution of the data is more concentrated in recent history, with the model’s understanding of the past filtered through the lens of contemporary languages and norms (Ziems et al., 2024). This presentist and recency bias risks temporal flattening, wherein historical and contemporary voices are homogenized, obscuring the richness of historical diversity in human thought. This is problematic for probing thoughts and behaviors that have evolved over extended periods—such as the way people think about concepts such as gender, race, and class (Kozlowski & Evans, 2024). The models may lack the contextual richness and temporal granularity needed to understand how these concepts were discussed in different periods, potentially reinforcing contemporary biases when trying to understand people from earlier time periods.
Beyond training-data biases, the engineering goals of LLMs further compromise their representational accuracy. The goal of developing advanced LLMs—to provide accurate, helpful answers to users—contrasts with the study of human cognition, which is replete with inaccuracies, biases, shortcuts, and idiosyncrasies. For instance, in language comprehension, people often settle for a partial and sometimes inaccurate understanding that is nevertheless sufficient for the task at hand—“good-enough representations” (Ferreira et al., 2002). In decision-making, when emotions are induced in bargaining games and repeated-cooperation games, GPT-4 tends to maintain consistent, rational decision-making, in contrast to humans (Mozikov et al., 2024).
Because of model limitations, opacity, and the common use of reinforcement learning from human feedback (RLHF), such models cannot be assumed to represent the “average” of their training data. Responses may reflect hallucination, the influence of RLHF, or other bias-reduction efforts; they could also simply regurgitate specific instances, examples, or strategies from their training data, particularly when the query is well represented in the data (Aher et al., 2023; Binz & Schulz, 2023b; Shiffrin & Mitchell, 2023). For example, when LLM responses are altered through RLHF, they can deviate from the original training data—and may even exacerbate their misalignments with nondominant views (Santurkar et al., 2023). In this process, the communicative intent of developers (and data labelers) shapes model outputs to prioritize goals such as accuracy and helpfulness rather than reflecting “raw” human-like responses from the training texts.
This raises concerns about whether the outputs genuinely represent human responses. If the psychological construct being measured is meant to be “average human response,” the LLM fails to operationalize this construct because of the representational biases described above. The models often become “too neutral, detached, and nonjudgmental,” lacking selfhood and initiative (Ye et al., 2024) and exhibiting homogeneous personality profiles—high in agreeableness and low in neuroticism (C. Li & Qi, 2025; Pellert et al., 2024). Although this fine-tuning process can reduce certain biases (T. Hu et al., 2025), enhance accuracy, and make the model more pleasant to interact with, it also weakens its ability to reflect the actual attitudes and thoughts present in human texts (Harding et al., 2024) and to produce diverse responses (Murthy et al., 2024). Such alterations may also introduce new preferences or biases from the feedback, including sycophancy (a tendency to generate outputs that are excessively user-pleasing or conform to perceived desirable responses), making it challenging to rely on RLHF-tuned LLMs as accurate indicators of human thought and judgment (Park et al., 2024). In survey responses, for example, LLMs do not share human-response biases, and there are more pronounced discrepancies in RLHF-tuned models (Tjuatja et al., 2024). Likewise, within the GPT-3 family, fine-tuned models exhibited a higher propensity for conjunction fallacy and intuitive reasoning relative to base models (Yax et al., 2024).
The practical implications of these misalignments are evident in actual response patterns. LLM responses have been found to mischaracterize marginalized groups, as evidenced by out-group imitation rather than in-group description (A. Wang et al., 2025), and misrepresent sampled groups, as demonstrated by an upward bias in mean ratings of the Big Five personality traits (Niszczota et al., 2025) and in other surveys and tests (Hagendorff et al., 2023; Sarstedt et al., 2024; Tjuatja et al., 2024). In addition to these shifts in average responses, LLMs also fail to capture the nuances and heterogeneity of human responses, producing flattened, oversimplified portrayals of various groups (A. Wang et al., 2025).
Although diversifying the training corpus to include more languages and cultural contexts helps to broaden representations, achieving global representations is ultimately a long-term challenge (Lin & Li, 2023). In the foreseeable future, the quantity and quality of available non-English training texts will remain impoverished compared with English as the lingua franca.
To mitigate the average-human fallacy, key considerations include explicitly defining represented populations, using specialized models, testing prompts and models for consistency and bias, validating responses against human data, and transparently reporting representational limitations.
Model selection and customization
Be explicit about which populations are represented in the training data and how this affects generalizability claims. Consider using specialized models fine-tuned on representative human data and, importantly, verify performance with actual samples from those populations (Gao et al., 2024; Suh et al., 2025).
Prompt design
Test models across multiple prompting styles to assess the consistency of simulated responses and determine sensitivity to superficial variations (C. Li & Qi, 2025; Momennejad et al., 2023; Tao et al., 2024). Compare outputs from different LLM versions and architectures to identify systematic biases in representation.
Interpretations and applications
Validate LLM responses against human samples, particularly for claims about specific populations or demographic groups, and report cases in which LLM responses diverge from human patterns. Avoid generalizing beyond domains in which empirical validation has been established (Abdurahman et al., 2025).
Ethics
Document potential biases and acknowledge limitations in representativeness when reporting results from LLM simulations (Abdurahman et al., 2025).
Alignment-as-explanation fallacy
High output alignment between LLM responses and human data has been suggested as evidence that LLMs can sometimes replace human participants (Dillion et al., 2023). Although such alignment suggests that LLMs might capture something mechanistic about human behavior, it is fallacious to assume mechanistic equivalence—that the model explains human cognition (Guest & Martin, 2023) or can replace human participants for theoretical understanding (see Box 4). Treating such alignment as sufficient for mechanistic equivalence constitutes the alignment-as-explanation fallacy. This fallacy arises when output similarity is conflated with representational and processing similarity, that is, when similar outputs are taken as evidence that LLMs engage the same psychological constructs or cognitive mechanisms as humans, overlooking the differences between the statistical pattern matching of current LLMs and the embodied, contextual cognitive processes of humans.
Functional Versus Mechanistic Equivalence
Indeed, a fundamental problem of applying human-centric tests and concepts to LLMs—such as theory of mind or emotional understanding—is the assumption that they engage with information in ways similarly to humans, presupposing background psychological mechanisms that may be absent or irrelevant for LLMs (Box 1). These anthropomorphic assumptions undermine the construct validity (i.e., the degree to which a test measures the specific psychological construct it purports to measure) of psychological tests in LLMs (Millière & Buckner, 2024b). For example, response confidence in LLMs, as measured by token probability, may differ from human self-reports. Likewise, emotional intelligence—the ability to perceive, understand, manage, and use emotions in oneself and others—encompasses self-awareness, empathy, emotional regulation, and social skills, all rooted in subjective experiences that are absent in LLMs. Model “understanding” of emotions is purely expressive and syntactic, driven by learned associations between words and concepts rather than experiential, visceral insight or an internalized understanding of mental states. So when LLMs perform like humans on tests of emotional understanding—in either overall score or response pattern (X. Wang et al., 2023)—this alignment is at the surface level rather than reflecting genuine equivalence.
Apparent alignments between LLM and human responses may also be artifacts of specific task formulations rather than evidence of robust, human-like reasoning capabilities. One clue for a lack of true competence is a dissociation between accuracy and the model’s reasoning (i.e., its explanations of its responses; Leivada et al., 2023). Another critical test is prompt sensitivity, examining how performance is affected by superficial alterations to the prompts—changes to which humans typically show little to no sensitivity or respond to differently. Such prompt sensitivity has been documented in LLMs in survey responses—RLHF-tuned models can be highly sensitive to changes such as typos (Tjuatja et al., 2024)—and in tasks such as reasoning (Binz & Schulz, 2023b; Dasgupta et al., 2022; Yax et al., 2024), decision-making (Binz & Schulz, 2023a; Suri et al., 2024), theory of mind (Strachan et al., 2024), moral judgments (Oh & Demberg, 2025), and more (Kamoi et al., 2024; McCoy et al., 2024).
For example, LLMs but not humans improved their performance in reasoning tasks when the instruction included the phrase “let’s think step by step” (Yax et al., 2024), a form of chain-of-thought prompting. Conversely, although LLMs performed at ceiling in a false-belief test similarly to human participants, they struggled when small changes were made to the formulation of false-belief scenarios—suggesting syntactic pattern processing rather than robust reasoning (Strachan et al., 2024). These sensitivities to minor variations highlight how apparent alignment can mask differences in underlying processes, creating a threat to internal validity—the ability to draw firm conclusions about causal relationships—when the observed performance is misattributed to the intended manipulation rather than to superficial linguistic patterns.
But even when performance reflects underlying abilities, it is not clear whether models employ mechanisms similar to those of humans. Given that different systems can achieve the same outcome through different mechanisms—known as multiple realizability (Bowers et al., 2023; Guest & Martin, 2023), as in telling time with digital versus mechanical clocks—it is unwarranted to assume mechanistic or even functional equivalence (Box 4). Indeed, it is notoriously difficult to understand exactly what LLMs have learned. Beyond inherent differences between machine and biological intelligence in their architecture and algorithms, LLMs are trained on data sets much larger than what human learners experience. Differences in mechanisms may manifest as distinct response characteristics, including (a) context sensitivity, such as prompt sensitivity or performance variations across different vignettes (Yax et al., 2024); (b) response patterns, such as variability in open-ended responses (Y. Li et al., 2024), item-by-item performance variability (X. Wang et al., 2023), and correlation of accuracy with confidence (Yax et al., 2024); and (c) error types and consistency, such as errors arising from cognitive demands versus those from item wording or familiarity (Yax et al., 2024).
This raises a fundamental question: When LLMs perform at human levels in psychological tasks, what does it tell researchers about the capacity of these models? Performance can reflect true competence (some underlying abilities) or something more superficial, such as pattern memorization (Gao et al., 2024) or other surface-level cues or pure chance. Conversely, underperformance can reflect something other than incompetence, such as processing limitations or ineffective prompting (Firestone, 2020). Thus, to establish true competence, model outputs should be sensitive to changes in the task-relevant inputs but insensitive to irrelevant changes (Harding & Sharadin, 2024).
To mitigate the alignment-as-explanation fallacy, safeguards include testing alignment robustness against perturbations, examining reasoning processes, using causal interventions, performing cross-domain validation, and explicitly acknowledging the limits of inferring shared mechanisms.
Model testing and validation
Test whether the alignment between LLM and human responses is robust to perturbations in task structure, prompt wording, and contextual variations that should not affect performance (Oh & Demberg, 2025). If the performance varies significantly with superficial changes, this suggests a lack of true construct equivalence (Firestone, 2020).
Process tracing
Use methods such as chain-of-thought prompting to examine the reasoning paths that LLMs use to arrive at answers, comparing these with human-reasoning protocols (Bao et al., 2024). Significant differences in reasoning processes, even when outputs align, would suggest different underlying mechanisms.
Causal interventions
Implement interventions that test specific hypotheses about psychological mechanisms. For example, if humans and LLMs appear to use similar heuristics, test whether manipulations designed to affect those heuristics produce comparable effects (GX-Chen et al., 2025). Examine the causal effects of training data on model output through fine-tuning (T. Hu et al., 2025).
Cross-domain validation
Examine whether alignment in one domain generalizes to related domains that should engage similar cognitive processes. Limited transfer may suggest different underlying mechanisms despite surface alignment (Zan et al., 2025). The same applies to validation across different languages of the prompt, such as English and Chinese (Jin et al., 2024).
Explicit limitation acknowledgment
When reporting alignments between LLM and human responses, acknowledge the limitations of inferring shared mechanisms, noting their distinct architecture and learning history (Abdurahman et al., 2025).
Anthropomorphism fallacy
Beyond misalignments due to algorithms, purposes, and implementations, another issue with the replacement view is that it leads to anthropomorphism of LLMs (Crockett & Messeri, 2023; Shiffrin & Mitchell, 2023). Their fluent, human-seeming responses can trigger an enhanced ELIZA effect: the tendency to attribute human-like understanding to AI systems, creating a compelling language-user illusion (Box 1). Indeed, when researchers refer to “the minds of language models” or “the machine minds of LLMs” (Dillion et al., 2023), such terminology—whether used metaphorically or literally—can inadvertently encourage teleological bias, attributing purposes and goals to LLMs.
The anthropomorphism fallacy misinterprets LLM responses, treating statistical artifacts as expressions of an inner mental life. When researchers ascribe human-like mental states—such as beliefs, intentions, or consciousness—to LLMs based on their linguistic output, they risk invalidating measures of psychological constructs. Indeed, many such constructs, including attitudes and emotions, presuppose a mental architecture that current LLMs lack.
In many situations, using anthropomorphic language in conversation is natural—even useful. However, with AI such as LLMs—which currently lack documented markers of consciousness yet produce remarkably coherent conversations (Shardlow & Przybyla, 2024)—this semblance seduces users to interpret model behavior through the lens of folk psychology, attributing “beliefs” or “consciousness” to these systems (Colombatto & Fleming, 2024). Indeed, the chatbot service provider Character AI invites users to meet AIs that “feel alive.”
Although the philosophical debate about machine consciousness remains open (Butlin et al., 2023), ascribing emotions or intentions to current LLMs (saying they “believe” or “think”) risks creating impressions that outpace their demonstrated capabilities. As of mid-2025, no publicly available LLM exhibits clear markers of phenomenal awareness or intentional agency. Such attribution gaps can lead researchers and the public to either overestimate or underestimate these systems’ capabilities (Crockett & Messeri, 2023; Shanahan, 2024), potentially shaping AI policy in ways disconnected from technological reality (Lin, 2025a). Perhaps most concerning, this conflation may dilute the understanding of distinctly human qualities—feelings, thoughts, and virtues—thereby diminishing their meaning (Vallor, 2024).
The fallacy also leads researchers to draw invalid inferences from LLM outputs. When interpreting model responses as expressing human-like mental states rather than statistical probabilities, they may fail to account for the statistical artifacts inherent in token-prediction systems. Likewise, findings may be inappropriately generalized from anthropomorphized LLMs to human populations without recognizing the essential difference between statistical text generation and human psychological processes.
The replacement perspective thus risks anthropomorphizing algorithms and mischaracterizing their nature—a conceptual error that invites misunderstandings and misinterpretations. For example, one such misunderstanding is that “any given LLM can act as only a single participant” (Dillion et al., 2023). Yet unlike humans, who are influenced by a unique combination of personal experiences, emotions, and cognitive biases, LLMs are not limited to a single perspective but generate responses based on their vast, diverse data set. This means that depending on the prompt and context, the same LLM can produce a range of patchwork responses, each reflecting different viewpoints or types of reasoning (Santurkar et al., 2023). This variability is not indicative of a singular, consistent “mind” but rather of a multifaceted tool capable of simulating diverse perspectives. LLMs can role-play various characters or personas (Shanahan et al., 2023): a teenager, a senior citizen, a subject-matter expert, or a layperson. This chameleon-like ability highlights LLMs as tools for linguistic simulation, not as human participants.
Mitigating the anthropomorphism fallacy requires using precise language, maintaining conceptual clarity about LLMs’ lack of mental states, documenting technical settings, applying simulation-based interpretive frameworks, and implementing specific researcher training.
Language and framing
Use precise, nonanthropomorphic terminology when describing LLM outputs. Instead of saying an LLM “believes” or “feels,” opt for terms such as “produces,” “generates,” or “outputs” to accurately reflect their statistical nature (Ibrahim & Cheng, 2025; Shanahan, 2024).
Conceptual clarity
Explicitly acknowledge in research designs and reports that current LLMs do not possess mental states or consciousness (Shardlow & Przybyla, 2024). Define psychological constructs carefully, noting when they inherently depend on mental states that LLMs lack.
Documentation practices
Document the specific LLM, version, prompt design, and parameter settings used to generate responses (Lin, 2025a), emphasizing the technical rather than psychological aspects of the process.
Interpretive frameworks
Develop and apply interpretive frameworks that treat LLM outputs as simulations rather than expressions of beliefs or attitudes (Ibrahim & Cheng, 2025). This includes distinguishing between “simulated beliefs” and actual beliefs when reporting results.
Education and training
Provide training to research teams on the mechanisms of LLMs and the risks of anthropomorphic interpretations (Lin, 2025a). Foster a research culture that maintains conceptual precision when discussing AI capabilities.
Identity-essentialization fallacy
Under the replacement perspective, prompting often invokes identity labeling, such as instructing the LLM to act as or adopt the identity of “White man,” “Black woman,” “Chinese,” or “American”—as if such labels describe innate, static, homogeneous social groups, each entailing a specific set of behaviors (Chuang et al., 2024; A. Wang et al., 2025). This approach constitutes an identity-essentialization fallacy that caricatures how identity operates in human populations.
When simple demographic labels are used to prompt LLMs, there is often an implicit assumption that these can generate responses representative of real human populations. Such prompting approaches treat social categories as static and homogeneous, ignoring the vast diversity within any demographic group. When LLMs generate responses based on these simplified identity prompts, they may produce stereotyped or inaccurate outputs that fail to capture the nuanced realities of actual human populations (C. Li & Qi, 2025), limiting what one can learn about real-world contexts and populations (Lahoti et al., 2023; M. H. Lee et al., 2024). Indeed, demographic prompting can even reduce alignment with human judgments (Sun et al., 2025).
In colloquial exchanges, essentialist language about social categories—from “artists are eccentric” to “women are nurturing”—is convenient and also meaningful. But in empirical research, identity essentialization masks the fluidity and diversity inherent within any demographic, overlooking individual nuances and intersectionality while reinforcing stereotypes and biases prevalent within society, thus overestimating group differences (Namboodiripad et al., 2023; Prentice & Miller, 2006).
Identity is not a static, unitary construct that can be captured by a single demographic label—it is fluid, contextual, and intersectional. When researchers use essentialist prompting techniques, they misrepresent the psychological construct of identity itself, reducing rich, complex human experiences to one-dimensional categories. This reductive operationalization fails to capture how various aspects of identity interact, how identity salience shifts across contexts, and how individuals negotiate multiple, sometimes contradictory, identity facets.
This is not to deny the importance of identities or to advocate for identity-blindness. As pervasive societal structures that shape people’s thoughts, attitudes, and behaviors, social categories, such as race, gender, and class, are deeply embedded in people’s experiences—and often an ingrained part of their identity. But rather than reducing individuals to essentialist categories, a more appropriate approach is to consider how various identities—demographic, professional, or situational—interact by role-playing various personas through contextualized prompting. This involves crafting character profiles that encompass a broader array of characteristics—from contextual descriptions (“I am a young tech worker living in the United States”) to broader social categories (e.g., based on political leaning or personality type)—allowing for more nuanced explorations of perspectives and experiences.
For example, instead of prompting an LLM to act as a “Black woman,” which may reinforce stereotypes or oversimplify identity (Sun et al., 2025), one might construct a more holistic persona by adding a specific context, such as “a young entrepreneur from Atlanta who is passionate about sustainable fashion and community development,” or by incorporating intersectional identities, such as “a young Black female tech worker navigating the challenges of a male-dominated field.” These contextual descriptions incorporate identity but frame it within specific experiences, values, and contexts. Indeed, contextualized prompting has been shown to evoke distinct, diverse (A. Wang et al., 2025), and more aligned responses from LLMs (Bui et al., 2025).
Identity is multifaceted and context-dependent, with varying salience for different individuals. Simulating human participants therefore risks misrepresenting the salience of various aspects of identity—reflecting the prompter’s perspective or presumptions about which aspects of identity are important rather than capturing the intersectional reality experienced by the simulated persona. It is therefore crucial, whether using contextualized prompting or not, to examine potential biases and limitations in the prompt. Sidestepping genuine engagement with marginalized communities further risks artificial inclusion (Agnew et al., 2024).
To mitigate the identity-essentialization fallacy when simulating diverse perspectives, safeguards include using contextual prompting, incorporating intersectional approaches, performing diversity validation, ensuring transparency about limitations, and adopting collaborative methods.
Contextual prompting
Instead of using simple demographic labels, develop richer, context-specific prompts that incorporate multiple aspects of identity (including intersectionality), specific experiences, and environmental factors (Bui et al., 2025).
Validation
Compare LLM-generated responses across multiple prompting strategies and validate against actual human responses from the target population to identify when simulations misrepresent or stereotype particular groups (Sun et al., 2025).
Transparency in limitations
Explicitly acknowledge the limitations of identity simulation in research reports, including the risk of reinforcing stereotypes or oversimplifying complex identities. Document the specific prompting approaches used and their potential biases (Sun et al., 2025).
Collaborative approach
When studying specific cultural or identity groups, involve members of those groups in designing prompts, validating outputs, and interpreting results (Zhao et al., 2024).
Substitution fallacy
LLMs can mimic certain aspects of human behavior and cognition, but using them as primary tools to directly reveal the human mind reflects a substitution fallacy.
A core issue arises from the temporal limitations of LLM training data. Because LLMs are trained on historical data sets with a specific cutoff date, they represent a snapshot of human knowledge, attitudes, and behaviors at that moment. Updating through retraining is infrequent and resource-intensive. This static nature restricts their capacity to capture ongoing societal changes, new social phenomena, or evolving attitudes and behaviors, such as rapidly changing views on technologies or social movements (Zhu et al., 2025). Without real-time adaptability, previous alignments do not guarantee current applicability. Furthermore, a model’s advertised knowledge cutoff often differs from its actual, or effective, knowledge cutoff. The functional knowledge of LLMs frequently corresponds to older text versions that predate the stated cutoff. This discrepancy stems from widespread temporal misalignments within large pretraining corpora—for instance, older documents lingering in recent web crawls—and from the incomplete removal of outdated or duplicated content during data processing (Cheng et al., 2024).
This fallacy persists even if one disregards challenges related to the static and historically bound nature of LLM training data—or issues of grounding, embodiment, and subjective experience. As the average-human fallacy illustrates, responses from LLMs cannot be assumed a priori to represent average responses of the targeted human group. Even when LLMs and humans show alignment, this correlation should not be confused with equivalence in cognitive processes or mechanisms (the alignment-as-explanation fallacy; Box 4). This leads to an epistemic dilemma: Generalizing findings from LLMs to humans requires corroboration with actual human data, undermining the basic premise of the substitution proposition.
In addition, substitution risks creating closed-loop information systems. When models trained on historical data are used as primary tools for generating new data, they perpetuate a self-referential loop that creates a distorted view of the present by amplifying the past (including its biases, errors, and oversights) rather than reflecting current human thought or behavior. This can lead to misleading inferences reflecting model artifacts rather than genuine psychological phenomena and behavioral patterns.
Even with up-to-date training data, excluding human participants leaves LLMs simulating humans in ways detached from rich, evolving realities. Such detachment can entrench outdated knowledge, weaken the diversity vital for human progress, and create epistemic echo chambers. Thus, LLMs should serve as supplementary rather than primary tools for understanding the human mind.
To mitigate the substitution fallacy, safeguards include using sequential validation with human participants, benchmarking against time-sensitive data, integrating mixed methods, ensuring temporal transparency, and implementing closed-loop detection techniques.
Sequential validation
Implement a sequential research design in which LLM explorations are followed by validation with human participants. Use LLMs for hypothesis generation or initial exploration but validate key findings with relevant human data (Gui & Toubia, 2023).
Benchmarking against time-sensitive data
Regularly benchmark LLM responses against recent human data to assess temporal drift in model outputs compared with current human attitudes and behaviors. This helps establish the temporal boundaries of generalizability for LLM-based findings (Cheng et al., 2024).
Temporal transparency
Explicitly document training data cutoff dates and potential temporal limitations in research reports, particularly in rapidly evolving domains (Cheng et al., 2024).
Concluding Remarks
Recent advances in human-level AI are renewing the classic debate on the role of computing artifacts in understanding the human mind and brain (Simon, 1983). In this article, I critically assessed the emerging proposition of substituting human participants with LLMs in behavioral and social sciences. By exposing six fallacies inherent in this replacement perspective, I underscore that despite their human-like language-production capabilities, current LLMs do not—and as presently conceived, cannot—substitute for human thought. Unlike the statistical text prediction that drives current LLMs, human intelligence emerges from embodied interaction with the world—grounded in sensory experiences, enriched by multimodal integration, and shaped by subjective consciousness. The predominantly linguistic nature of LLMs further constrains their ability to capture the breadth of human experience, including nonverbal cues, implicit attitudes, and real-world behaviors.
By identifying challenges to research validity and providing practical guidelines, the analysis supports the simulation perspective: LLMs serve as tools for simulating roles and modeling cognitive processes, complementing but not replacing humans. As outlined in Box 4, this perspective helps investigators distinguish between research contexts in which output-level simulation suffices (pragmatic applications such as rapid prototyping) and those requiring deeper mechanistic evidence (theoretical claims about cognitive processes). In practice, researchers should leverage LLMs primarily for hypothesis generation, theory development, and rapid prototyping—then validate with human participants. This sequential approach capitalizes on model strengths (comprehensive knowledge, efficient simulation) while acknowledging their limitations (lack of grounding, representational biases). Implementing the controls and considerations outlined for each fallacy can improve research quality and interpretability.
As emphasized in Box 3, understanding model limitations requires distinguishing between technical and conceptual constraints. Although technical limitations may be addressed through engineering advances, conceptual limitations represent fundamental challenges to using LLMs as psychological models. As these technologies evolve, the field must continuously reevaluate their capabilities and limitations, develop appropriate benchmarks, and establish guidelines for responsible integration. This perspective invites researchers to reconsider the role of AI in behavioral and cognitive science—as a mirror through which they can better understand the similarities and differences between human intelligence and machine intelligence. The limitations of apparently human-like models in replicating human thought may bring a deeper appreciation of the complexity and wonder of the mind.
Footnotes
Acknowledgements
I thank Gati Aher, Michael Bernstein, Danica Dillion, Nancy Fulda, Nicholas Laskowski, Paweł Niszczota, Philipp Schoenegger, Lindia Tjuatja, Lukasz Walasek, and David Wingate for comments on early drafts.
Transparency
Action Editor: Kongmeng Liew
Editor: David A. Sbarra
Author Contributions
