Sage Journals: Discover world-class research

Abstract

Can artificial-intelligence (AI) systems, such as large language models (LLMs), replace human participants in behavioral and psychological research? Here, I critically evaluate the replacement perspective and identify six interpretive fallacies that undermine its validity. These fallacies are (a) equating token prediction with human intelligence, (b) treating LLMs as the average human, (c) interpreting alignment as explanation, (d) anthropomorphizing AI systems, (e) essentializing identities, and (f) substituting model data for human evidence. Each fallacy represents a potential misunderstanding about what LLMs are and what they can tell researchers about human cognition. In the analysis, I distinguish levels of similarity between LLMs and humans, particularly functional equivalence (outputs) versus mechanistic equivalence (processes), while highlighting both technical limitations (addressable through engineering) and conceptual limitations (arising from fundamental differences between statistical and biological intelligence). For each fallacy, specific safeguards are provided to guide responsible research practices. Ultimately, the analysis supports conceptualizing LLMs as pragmatic simulation tools—useful for role-play, rapid hypothesis testing, and computational modeling (provided their outputs are validated against human data)—rather than as replacements for human participants. This framework enables researchers to leverage language models productively while respecting the fundamental differences between machine intelligence and human thought.

Keywords

generative AI (GenAI)large language models (LLMs)AI participants (AI subjects)silicon sampling simulation or modeling research validity

Collecting data from human participants is resource-intensive and time-consuming. The rise of large language models (LLMs) capable of generating human-like text has sparked substantial interest among psychologists and social scientists (Abdurahman et al., 2025; Demszky et al., 2023; Lin, 2023). In tasks ranging from perception and cognition to language and moral reasoning, LLMs sometimes produce responses that mirror those of average human participants (Aher et al., 2023; Binz & Schulz, 2023b; Dasgupta et al., 2022; Dillion et al., 2023; J. Hu et al., 2024; Marjieh et al., 2024). As these models are increasingly incorporated into experimental designs and research methodologies, promising instant and tireless responses, a question arises: Could they ever stand in for humans in behavioral and psychological research?

Across various fields, proposals have emerged suggesting that LLMs could, to varying degrees, “substitute human participants” in empirical research as “silicon samples” (Sarstedt et al., 2024), “supplant human participants for data collection” in social science as “simulated participants” (Grossmann et al., 2023), serve as “a single participant” and “a proxy for human participants in a certain set of circumstances” in psychological science (Dillion et al., 2023), and “replace humans” in human-centered design to provide “simulated user responses” (Schmidt et al., 2024). Echoing this perspective, commercial entities now peddle synthetic, human-like artificial-intelligence (AI) subjects for user and market research—advertising user research “without the users.”

Although replacement of human participants is not yet mainstream practice, these emerging viewpoints reflect a broader narrative in AI development—one that envisions systems that could “outperform humans at most economically valuable work” (OpenAI, 2024) or “on a vast array of tasks” (Bengio, 2024). Although AI researchers overwhelmingly treat LLMs as task-automation tools and agents, the discourse around such powerful AI capabilities creates a context in which proposals to use LLMs as research participants naturally emerge. In light of these broader technological ambitions and the specific research proposals they inspire, it is both timely and necessary to critically evaluate the replacement perspective and its methodological implications, particularly to clarify the role of LLMs in behavioral- and social-science research more broadly, whether or not the replacement proposal gains traction.

To this end, in this article, I examine the fundamental assumptions that underlie proposals to use LLMs as human replacements. By clarifying what LLMs actually are—computational tools that simulate linguistic behavior through statistical pattern matching and reinforcement learning (see Box 1)—one can better understand their appropriate role in psychology and behavioral sciences.

Box 1.

Language Models and the Language-User Illusion

Modern artificial-intelligence chatbots, such as ChatGPT, are built on large language models (LLMs). Unlike traditional rule-based systems, LLMs process text by converting it into tokens (units of text) and mapping these to numerical vectors in a high-dimensional space (embedding). This mathematical representation enables models to capture semantic relationships between concepts—such as “king” relating to “queen” similarly to how “man” relates to “woman.” The breakthrough transformer architecture (Vaswani et al., 2017) revolutionized these models through attention mechanisms that allow words to gather contextual information from surrounding text. This helps LLMs resolve ambiguities in language and handle complex linguistic tasks that once seemed to require human intelligence.
Base models (also known as “foundation” or “pretrained” models) are typically trained through autoregression—predicting each subsequent token based on preceding ones—enabling them to compress vast amounts of patterns and textual knowledge into their network weights. However, consumer-facing instruct models, such as ChatGPT, undergo further posttraining. This often involves supervised learning (fine-tuning on specific instruction–response examples) and reinforcement learning (RL; learning from feedback signals, e.g., rewards or penalties, to improve outputs). One prominent RL technique is RL from human feedback, which aligns model outputs with human preferences. Beyond alignment, RL methods are increasingly used to enhance specific cognitive-like capabilities, including reasoning and tool use (e.g., interacting with web browsers or code interpreters), often through trial-and-error processes in which the models learn to generate more effective or logically sound sequences of thought or action. Beyond model weights, the behavior of instruct models also depends heavily on system prompts (extensive human-authored instructions that guide model behavior), user prompts (Lin, 2024), and inference-time techniques (e.g., methodologies such as retrieval-augmented generation).
This sophisticated combination contributes to the illusion that the model possesses human-like understanding (Mitchell & Krakauer, 2023). It allows LLMs to model language in ways that go beyond traditional distributional semantics (Lake & Murphy, 2023). They capture nuanced, contextualized meanings and structures in language (Mahowald et al., 2024; Manning et al., 2020; Millière, 2024), including visual knowledge embedded in text (Jones et al., 2022; Lewis et al., 2019) and abstract concepts, such as space and time (Gurnee & Tegmark, 2023). When LLMs produce responses that feel empathetic or insightful, users often project human-like qualities onto them—what can be termed the “language-user illusion.” The fluent language generated by LLMs creates a compelling illusion of interacting with another mind, but this overlooks their fundamentally different nature from human language users. This distinction forms the critical foundation for the fallacies examined throughout this article.

Six key fallacies are identified, each arising from the misinterpretation of LLMs as human replacements: the token-prediction-as-human-intelligence fallacy (conflating statistical text prediction with genuine human intelligence), the average-human fallacy (assuming model outputs represent typical human responses), the alignment-as-explanation fallacy (interpreting similarity between model and human outputs as evidence of shared cognitive mechanisms), the anthropomorphism fallacy (attributing human-like mental states to AI), the identity-essentialization fallacy (treating social categories as fixed, homogeneous traits rather than fluid, contextual identities), and the substitution fallacy (presuming LLM-generated data can directly replace human evidence without validation).

These fallacies undermine research validity—the degree to which researchers’ evidence and reasoning support the conclusions they draw (Cook & Campbell, 1979; Cronbach & Meehl, 1955; Kerschbaumer et al., 2025; Stanley & Campbell, 1963). Validity concerns in LLM research stem from both methodological choices (how researchers prompt and test models) and theoretical interpretations (what researchers claim model outputs represent). Each fallacy undermines validity in distinct ways: Some compromise the ability to make causal inferences, others limit generalizability across populations, and still others reflect misalignment between what researchers intend to measure and what they actually capture. The methodological and theoretical safeguards provided for each fallacy address these threats by strengthening both research design and interpretive frameworks.

Taken together, these analyses support the function of LLMs as neural language simulators. Rather than stand-ins for people, they reveal latent psychological and cognitive dynamics embedded in the text—dynamics rooted in the communication of human thoughts, attitudes, and behaviors. This approach offers both conceptual clarity and practical guidance for leveraging LLMs appropriately in behavioral and cognitive research.

Interpretive Fallacies in LLM-Based Research

To understand why LLMs function as simulators rather than replacements, one must first examine the conceptual errors that underlie the replacement perspective. Six fallacies emerge in discussions of LLMs as human substitutes, each revealing a different way researchers might misinterpret what LLMs are and what their outputs signify (Table 1).

Table 1.

Six Fallacies in Conceptualizing LLMs as Human Replacements

Fallacy	The mistake	The reality	Research implications
Token prediction as intelligence	Mistaking statistical pattern matching for human understanding	LLMs create a “language-user illusion” through statistical processing, lacking embodied experience, consciousness, or grounded understanding.	Cannot assume cognitive equivalence or generalize from LLM performance to human cognition
The average human	Assuming LLMs represent typical or average human responses	WEIRD-biased training data, RLHF optimization for helpfulness (not human-likeness), and temporal snapshots create unrepresentative outputs.	LLM responses ≠ human responses; always validate with actual population data
Alignment as explanation	Inferring shared cognitive mechanisms from similar outputs	Multiple realizability means that similar outputs do not indicate similar mechanisms.	Test robustness to prompt variations; examine reasoning processes, not just outputs
Anthropomorphism	Attributing beliefs, feelings, or intentions to statistical text generators	LLM outputs are statistical probabilities, not expressions of mental states.	Cannot directly measure psychological constructs that require mental states; interpret outputs as statistical simulations
Identity essentialization	Treating demographic labels as fixed, homogeneous categories	Identity is fluid, contextual, and intersectional; demographic labels reduce complex human experiences to stereotyped, homogeneous categories.	Use contextual, intersectional prompting; validate outputs against target populations; avoid artificial inclusion
Substitution	Replacing human participants with LLM outputs in research	LLMs are static snapshots of historical data that cannot capture evolving human behavior; using them as primary evidence creates self-referential loops; findings still require human validation.	Use LLMs for hypothesis generation and rapid prototyping; validate findings with human participants

Note: LLM = large language model; WEIRD = Western, educated, industrialized, rich, and democratic; RLHF = reinforcement learning from human feedback.

Token-prediction-as-human-intelligence fallacy

LLMs accomplish tasks once reserved for humans through a layered process: They are initially pretrained to predict sequential tokens and then posttrained via fine-tuning and reinforcement learning; finally, their reasoning behavior is shaped by guidance from well-crafted prompts and inference-time techniques (Box 1). This process equips LLMs with intricate linguistic knowledge (e.g., syntactic rules, semantic relations)—often termed “formal” linguistic competence (Mahowald et al., 2024). Crucially, these models do more than memorize patterns; they demonstrate the ability to navigate complex, context-dependent tasks in both language processing and generation (Mahowald et al., 2024; Millière & Buckner, 2024a). Through learning and contextual adaptation, LLMs can infer underlying task structures and generate appropriate responses across varied domains—a form of instrumental knowledge that supports problem-solving in diverse settings (Yildirim & Paul, 2024).

Yet at their core, LLMs have no minds (Searle, 1980) but are autoregressive statistical models that manipulate language—a task different from human cognition—producing a kind of ungrounded intelligence. They manipulate language by leveraging statistical patterns learned during training; although these patterns encode aspects of derivative meaning present in text-corpus relationships (e.g., “Paris”–“France,” “fire”–“hot”), the models themselves are yet to access this meaning through direct, grounded experience (e.g., visual qualia, proprioception). This mirrors Church encoding in lambda calculus, in which data and operations are expressed purely through abstract functions, defined by how they transform inputs to outputs rather than by inherent, grounded meaning (Church, 1936). This is in contrast to human intelligence, which integrates general and specialized capabilities through evolutionary adaptations and developmental learning (Box 2)—and is “grounded in one’s embodied physical and emotional experiences” and “deeply reliant on one’s social and cultural environments” (Mitchell, 2024).

Box 2.

Biological Versus Machine Optimization

Natural selection introduces a different optimization mechanism from machine training, which directly affects how one should interpret intelligence in both systems. Biological pressures for survival and reproduction embed implicit assumptions and constraints within neural systems—termed “ecological inductive biases” (Richards et al., 2019). These biases enable humans to learn efficiently from physical and social interactions, manage limited cognitive resources by prioritizing selective inputs and tasks, and adapt to uncertainty using shortcuts and flexibility. Such evolved constraints stand in contrast to the engineered biases of language models, which arise from their objective function (e.g., next-token prediction, reward maximization), design (e.g., embeddings, transformer architecture), and training-data composition. Other computational models similarly have their distinctive inductive biases; for example, convolutional neural networks learn patterns by focusing on local spatial relationships because their convolutional filters operate over small image regions.
A crucial distinction is that biological evolution carries historical baggage that constrains optimization in ways that shape cognitive capabilities. When environments suddenly change, biological systems cannot rebuild from scratch but must adapt using existing neural machinery, often creating suboptimal yet functional solutions. This evolutionary path dependency is further constrained by the DNA bottleneck, limiting intergenerational information transfer: All adaptive changes must be encoded in the genome, which can store only a fraction of an organism’s lifetime learning. In contrast, artificial-intelligence systems face no such constraints. They can be redesigned from the ground up with each iteration, and their “knowledge” can be transferred directly via perfect copying, distillation into smaller models, or selective merging of capabilities—all without the genetic limitations that restrict biological information transfer.
These disparities between language models and humans in their optimization pressures manifest in both internal architectures and external behaviors (Doerig et al., 2023). Key architectural differences include the following:
• Computational mechanisms: Models rely on digital computation and continuous activation functions, contrasting with the analog computation and all-or-none neuronal firing characteristic of the human brain.
• Information flow: Models predominantly use feed-forward processing, whereas the human brain exhibits extensive feedback modulation across interconnected regions.
• Learning algorithms: Models depend on backpropagation—an algorithm that adjusts network weights based on output errors—for learning, which is biologically implausible because of its reliance on explicit error signals, symmetric weight updates during forward and backward passes, global error propagation throughout the entire network, and discrete, stepwise updates after processing batches of data.
These architectural differences produce observable behavioral distinctions between large language models (LLMs) and human cognition:
• Data efficiency: LLMs exhibit low data efficiency, as evidenced by sublinear scaling laws; although increasing data and model size leads to improved performance, achieving incremental improvements requires proportionally more resources.
• Energy consumption: LLMs display low energy efficiency during training, requiring megawatts of power (though inference time is more efficient: a single ChatGPT query as of mid-2025 uses about 0.34 watt-hours, equivalent to 61.2 s of energy consumed by the human brain, which operates at about 20 watts).
• Memory stability: Models show susceptibility to catastrophic forgetting in which learning new tasks causes the model to forget previously learned tasks.
• Learning approach: Current LLMs lack active exploration and continual learning; they do not retain prior interactions or instructions across sessions, unlike humans, who learn cumulatively over time.
• Input robustness: Models demonstrate fragility to minor input perturbations and noise, exemplified by the sensitivity of LLMs to prompt variations, that is, brittleness.
These differences in optimization mechanisms, architectural constraints, and resulting behaviors illustrate why token prediction, however sophisticated, produces a qualitatively different form of intelligence from human cognition, and thus, LLMs cannot simply replace human participants in psychological research.

A fallacy emerges when one mistakes the capacity of LLMs to predict language patterns and solve cognitive problems for genuine human-like understanding or intelligence. To unpack this fallacy, it is helpful to distinguish between two levels of understanding: functional understanding (reliably producing appropriate outputs, as when a dog successfully catches a ball) and reflective understanding (consciously grasping underlying principles, as when humans can articulate why the ball follows a particular trajectory). Although LLMs may demonstrate considerable functional understanding by generating contextually appropriate responses, they currently lack the reflective understanding that emerges from embodied experience and consciousness. Consequently, mistaking their output for genuine cognition would misrepresent what is being measured; the model’s linguistic performance, based on statistical token prediction, risks being conflated with human cognitive processes that integrate both functional and reflective understanding. The “intelligence” of LLMs is thus operationally different from the human cognition it might appear to simulate.

This fallacy manifests in two major ways. First, although LLMs process vast amounts of text reflecting real-world information and human experience, their tokens operate without direct, firsthand sensorimotor contact with the world. Consequently, they lack embodied referents and lived experience derived from such interaction (Leivada et al., 2023). This critical absence of grounding is demonstrated by how text-based LLMs, although capable of representing nonsensorimotor features of human concepts (e.g., emotional valence), systematically fail to capture sensorimotor features, particularly those related to motor actions. Indeed, the alignment between LLM and human conceptual representations diminishes markedly from nonsensorimotor to sensorimotor domains (Xu et al., 2025).

Second, this absence of first-person, sensorimotor grounding limits the models’ connection to the physical and social realities they aim to simulate. This hinders the development of fully functional linguistic competence—the use of language to achieve goals in the world (Mahowald et al., 2024). It also restricts their capacity to acquire the kind of worldly knowledge needed for robust world models—internal representations that are structure-preserving and behaviorally efficacious in real-world interactions (Webb et al., 2023; Yildirim & Paul, 2024). Such models, including cognitive maps, body schemas, and spatial schemas, support embodied reasoning and action, which remain beyond the scope of current models (Wicke & Wachowiak, 2024).

A critical consequence of this fallacy is that when researchers attribute changes in model outputs to experimental manipulations without recognizing the underlying statistical nature, they inevitably risk misinterpretation. Consider a simple prompt: “Mao Zedong was . . . .” Unlike interacting with another mind, when people engage with a chatbot, they are not seeking its opinion—despite the compelling illusion thereof—but rather making a computational request: Given the statistical distributions in the language model, what sequence is most likely to follow these words? Models trained for neutrality will likely provide correspondingly factual responses. Indeed, the character of such responses is shaped by the model’s training, geopolitical alignment, and prompt language. For example, DeepSeek-R1—pretrained mostly on English and Chinese text and subsequently aligned with government regulatory directives (Box 1)—showed substantially higher proportions of Chinese-state propaganda and anti-U.S. bias compared with ChatGPT o3-mini-high, which was pretrained primarily on English text. This bias was most pronounced with Simplified Chinese queries, diminished with Traditional Chinese inputs, and was nearly absent when queried in English (Huang et al., 2025).

Fundamentally then, the model, unlike a human, has no communicative intent; no opinion of, attitude toward, or belief about Mao; and no intrinsic capacity to tell the truth—it just models a distribution of token sequences based on the training texts (Shanahan, 2024). Although they manifest proficient language use that goes beyond simply retrieving prerecorded text strings, they are not true language users in the philosophical sense. They do not possess intrinsic meaning, communicative intentions, or other internal states essential to human language users (Block, 1981).

Indeed, model outputs can be highly sensitive to seemingly trivial prompt variations—variations that human language users would tolerate (Ivanova, 2025). Small changes in prompt language, wording, order, or context can produce drastically different outputs not because of the manipulated variable of interest but because of the model’s inherent sensitivity patterns—a phenomenon that contrasts with how humans process language.

In addition, findings from LLM studies cannot be reliably generalized to human cognition. The gap between token-prediction mechanisms and human cognitive processes fundamentally limits what one can infer about real human populations from model behavior, even when performance patterns may appear superficially similar. As philosopher John Searle (1980) illustrated in his “Chinese Room” thought experiment, a system might process symbols according to rules without understanding their meaning—just as a person following instructions to manipulate Chinese characters could produce appropriate responses without comprehending Chinese. When models generate coherent language without the intentionality, consciousness, and direct grounding in real-world experience that characterize human intelligence (Searle, 1980), they are capturing linguistic form (the observable structure of language) rather than true meaning (the communicative intent behind language; Bender & Koller, 2020). Therefore, to confuse cognitive algorithms for cognition—or models of the mind for the mind itself—represents a category mistake both ontologically (misidentifying their nature) and epistemologically (misunderstanding their knowledge).

No doubt LLMs (and AI in general) will continue to advance. But, as detailed in Box 3, even these improvements face fundamental conceptual limitations regarding embodied experience and grounding. Knowledge, as constructivists such as Jean Piaget have argued, is built from sensory experiences and perceptions, which are then layered with symbols and categories over time. Without such experiences, LLMs are like an artificial version of the proverbial Mary, who studies everything about color in a black-and-white room her entire life (Jackson, 1982). One can argue that when Mary leaves her room and sees color for the first time, she learns something new—what it is like to see something pink (Jackson, 1986). If so, a great deal more is at stake for language models, which experience neither color nor anything. Indeed, unlike Mary, who can use her other experiences as a scaffold for understanding color—much like Helen Keller using associations from senses such as touch and smell to construct a color scheme despite being blind and deaf (“Pink makes me think of a baby’s cheek, or a gentle southern breeze” [Keller, 1929])—LLMs have none of this kind of sensory scaffolding.

Box 3.

Technical Versus Conceptual Limitations in Model Simulation of Human Behavior and Cognition

When evaluating language models as simulation tools for psychological science, distinguishing between technical and conceptual limitations clarifies both current research constraints and future prospects.
Technical limitations arise from current implementation practices and can be mitigated through methodological refinements. These include (a) training data biases, predominantly WEIRD (Western, educated, industrialized, rich, and democratic)-skewed text corpora with specific temporal cutoffs; (b) algorithmic artifacts, such as those introduced by reinforcement learning (RL) from human feedback that systematically alters response patterns; (c) architectural constraints, including limited context windows and memory capabilities; and (d) prompt sensitivity, in which minor wording variations produce substantially different outputs. These limitations affect simulation fidelity and require various methodological controls but are potentially addressable through engineering advancements within the existing paradigm.
Progress in addressing technical limitations is already underway. Multimodal large language models (LLMs) now process information across text, images, audio, and video. Reasoning-focused models (sometimes referred to as “large reasoning models”) integrate RL with chains of thought, explicitly narrating intermediate reasoning steps. This scaffolding promotes deeper analysis and improves performance on tasks requiring structured, multistep inference across domains, such as science, math, and code. As models learn from increasingly diverse digital media, they develop more sophisticated representations of the physical world, potentially capturing dynamics beyond human sensory reach (e.g., at micro and macro scales) and improving aspects of physical common sense derived from text alone.
Conceptual limitations, by contrast, reflect intrinsic discrepancies between model design and human cognition that define the boundaries of the current paradigm. These include (a) grounding: LLMs lack embodied experience that situates cognition in physical and social contexts, operating solely through abstract symbol manipulation; (b) agency: LLMs have no intrinsic goals, desires, or motivations driving behavior, functioning instead as sophisticated prediction systems; and (c) subjectivity: LLMs lack conscious experience that introduces qualitative dimensions to human cognition, including emotional valence and phenomenal awareness. These limitations constrain the construct validity of LLMs as psychological models in ways not readily addressed through incremental engineering.
Even rich observational learning from video may not equate to the embodied knowledge humans acquire through direct, multisensory interaction. Without lived experience, models might still struggle to fully replicate crucial aspects of human intelligence, such as deeply grounded physical intuition, nuanced social common sense, and certain forms of mathematical reasoning (S. Lee et al., 2024; McCoy et al., 2024; Mirzadeh et al., 2024), even when they mimic or surpass human performance in other ways. For example, larger models improve in certain domains (e.g., some challenging pattern-recognition tasks) but still fail at seemingly simple tasks that humans would expect them to handle easily (Zhou et al., 2024). The depth of understanding gained without interactive, multisensory embodiment may differ from human cognition (Jones & Bergen, 2024), resembling, to some extent, the difference between inferring object properties from descriptions or observations versus learning through direct manipulation and sensory feedback, in which crucial experiential nuances can be missed (cf. Kim et al., 2019 regarding visual properties and congenital blindness).
Paradoxically, solutions to these conceptual limitations may render future artificial-intelligence systems less analogous to human cognition even while they gain superhuman abilities. For instance, addressing grounding limitations through embodied robotics with RL would introduce fundamentally different learning processes from current text-based training and human development, such as active, continuous learning from direct experience via grounded rewards from the world using nonhuman tools rather than relying on human input. Likewise, systems designed to overcome reasoning limitations might develop internal representations and goals with no human analogue. For example, as relentless problem-solving agents, these systems may exhibit a distribution of competence that differs from those of average humans.
As the field progresses beyond current paradigms, the relationship between machine and human cognition will continue to evolve, requiring ongoing reassessment of the role of LLMs in psychological science.

Essential aspects of human cognition—emotions, intuition, and other subjective experiences—so far remain incomputable and therefore cannot be fully replicated by computational algorithms. Even if everything about human cognition were theoretically computable, actually capturing the full depth and complexity of human thought processes would likely exceed computational feasibility. The interplay of perception, memory, emotions, and social contexts creates patterns so intricate that complete replication would require exponentially increasing computational resources. In other words, creating AI systems that genuinely replicate human cognition—systems behaving like humans under all circumstances—may be fundamentally intractable from a computational perspective (van Rooij et al., 2024). AI systems—constrained by their current algorithms, data, and embedded assumptions—face inherent challenges in replicating human cognition. This limitation parallels Gödel’s famous discovery in mathematics: Any formal system sophisticated enough for basic arithmetic will contain true statements that cannot be proven within that system (Fokas, 2023). Likewise, AI systems face intrinsic constraints in capturing all aspects of human thought within their computational frameworks.

To mitigate these issues, consider the following.

Model selection and settings

Different model architectures may capture different aspects of linguistic behavior (Dettki et al., 2025). Choose models based on their specific capabilities for the research question rather than assuming larger or more advanced models automatically approximate human cognition better (Zan et al., 2025; Zhou et al., 2024). Compare base models with fine-tuned versions to understand how optimization objectives affect outputs relative to human responses (Binz & Schulz, 2023b; Yax et al., 2024). Evaluate the impact of model parameters, such as temperature (C. Li & Qi, 2025).

Prompt design

Test model sensitivity to prompt variations to distinguish between robust patterns and artifacts of specific phrasings (Brucks & Toubia, 2025). This helps identify when models are exhibiting systematic “reasoning” versus merely responding to surface-level cues in ways humans would not (Ivanova, 2025).

Interpretations and applications

Validate LLM outputs against human data, particularly when studying cognitive processes, to establish appropriate boundaries for generalizing from LLM experiments to human psychology. Make explicit that LLM performance, however impressive, stems from statistical prediction rather than human-like understanding (Shiffrin & Mitchell, 2023).

Ethics

Acknowledge model limitations in research reporting, particularly the inherent distinction between token prediction and human cognition, to prevent misinterpretation and reduce the risk of anthropomorphizing LLMs in scientific literature (Ibrahim & Cheng, 2025).

The average-human fallacy

Mischaracterizing the nature of intelligence in LLMs may not matter as much if they behave or perform like the average human, enabling them to functionally replace human participants. But this assumption commits the average-human fallacy, which misrepresents what LLM outputs signify.

Consider the engineering purpose and approach of LLMs. They are explicitly developed to outperform humans, as measured by a broad range of benchmarks and standardized tests, an engineering feat bolstered by continually improved design and algorithms as well as access to more data and compute while free from biological limitations (see also Box 2). In actual tests, for example, GPT-4 outperforms humans in detecting and interpreting irony, recognizing indirect requests or hints in conversation (Strachan et al., 2024), and performing analogical reasoning tasks (Webb et al., 2023) and probabilistic reasoning tasks, such as the Linda/Bill problems and the bat-and-ball problem (Yax et al., 2024), but underperforms humans in tasks such as faux-pas tests (Strachan et al., 2024). This contrasts with the assumption of the replacement view that LLM responses mirror average human judgments from the training data (Dillion et al., 2023) or the majority’s mainstream opinions (Qu et al., 2024).

The generalizability challenge is multifaceted. First, LLMs are not representative samples of any defined human population, let alone an “average” human. Their training data (mostly online text) exhibit systematic biases—predominantly Western, educated, industrialized, rich, and democratic (WEIRD) populations, especially individuals who are hegemonic, young, and publicly expressive (Crockett & Messeri, 2023; Santurkar et al., 2023; Tao et al., 2024). This biased training creates a complex landscape for psychological simulation: Although LLMs might adequately capture behaviors that are largely universal across human cultures (certain basic cognitive or emotional phenomena, perhaps), they face limitations when simulating those known to vary across cultures and groups—from number representations to personality traits and moral reasoning. Further complicating this picture is the incomplete understanding of which psychological phenomena are universal versus culturally variable. This uncertainty means researchers must exercise particular caution when using WEIRD-biased LLMs—or any LLM for that matter—to simulate potentially culture-dependent phenomena.

In tandem with spatial biases, the temporal distribution of the data is more concentrated in recent history, with the model’s understanding of the past filtered through the lens of contemporary languages and norms (Ziems et al., 2024). This presentist and recency bias risks temporal flattening, wherein historical and contemporary voices are homogenized, obscuring the richness of historical diversity in human thought. This is problematic for probing thoughts and behaviors that have evolved over extended periods—such as the way people think about concepts such as gender, race, and class (Kozlowski & Evans, 2024). The models may lack the contextual richness and temporal granularity needed to understand how these concepts were discussed in different periods, potentially reinforcing contemporary biases when trying to understand people from earlier time periods.

Beyond training-data biases, the engineering goals of LLMs further compromise their representational accuracy. The goal of developing advanced LLMs—to provide accurate, helpful answers to users—contrasts with the study of human cognition, which is replete with inaccuracies, biases, shortcuts, and idiosyncrasies. For instance, in language comprehension, people often settle for a partial and sometimes inaccurate understanding that is nevertheless sufficient for the task at hand—“good-enough representations” (Ferreira et al., 2002). In decision-making, when emotions are induced in bargaining games and repeated-cooperation games, GPT-4 tends to maintain consistent, rational decision-making, in contrast to humans (Mozikov et al., 2024).

Because of model limitations, opacity, and the common use of reinforcement learning from human feedback (RLHF), such models cannot be assumed to represent the “average” of their training data. Responses may reflect hallucination, the influence of RLHF, or other bias-reduction efforts; they could also simply regurgitate specific instances, examples, or strategies from their training data, particularly when the query is well represented in the data (Aher et al., 2023; Binz & Schulz, 2023b; Shiffrin & Mitchell, 2023). For example, when LLM responses are altered through RLHF, they can deviate from the original training data—and may even exacerbate their misalignments with nondominant views (Santurkar et al., 2023). In this process, the communicative intent of developers (and data labelers) shapes model outputs to prioritize goals such as accuracy and helpfulness rather than reflecting “raw” human-like responses from the training texts.

This raises concerns about whether the outputs genuinely represent human responses. If the psychological construct being measured is meant to be “average human response,” the LLM fails to operationalize this construct because of the representational biases described above. The models often become “too neutral, detached, and nonjudgmental,” lacking selfhood and initiative (Ye et al., 2024) and exhibiting homogeneous personality profiles—high in agreeableness and low in neuroticism (C. Li & Qi, 2025; Pellert et al., 2024). Although this fine-tuning process can reduce certain biases (T. Hu et al., 2025), enhance accuracy, and make the model more pleasant to interact with, it also weakens its ability to reflect the actual attitudes and thoughts present in human texts (Harding et al., 2024) and to produce diverse responses (Murthy et al., 2024). Such alterations may also introduce new preferences or biases from the feedback, including sycophancy (a tendency to generate outputs that are excessively user-pleasing or conform to perceived desirable responses), making it challenging to rely on RLHF-tuned LLMs as accurate indicators of human thought and judgment (Park et al., 2024). In survey responses, for example, LLMs do not share human-response biases, and there are more pronounced discrepancies in RLHF-tuned models (Tjuatja et al., 2024). Likewise, within the GPT-3 family, fine-tuned models exhibited a higher propensity for conjunction fallacy and intuitive reasoning relative to base models (Yax et al., 2024).

The practical implications of these misalignments are evident in actual response patterns. LLM responses have been found to mischaracterize marginalized groups, as evidenced by out-group imitation rather than in-group description (A. Wang et al., 2025), and misrepresent sampled groups, as demonstrated by an upward bias in mean ratings of the Big Five personality traits (Niszczota et al., 2025) and in other surveys and tests (Hagendorff et al., 2023; Sarstedt et al., 2024; Tjuatja et al., 2024). In addition to these shifts in average responses, LLMs also fail to capture the nuances and heterogeneity of human responses, producing flattened, oversimplified portrayals of various groups (A. Wang et al., 2025).

Although diversifying the training corpus to include more languages and cultural contexts helps to broaden representations, achieving global representations is ultimately a long-term challenge (Lin & Li, 2023). In the foreseeable future, the quantity and quality of available non-English training texts will remain impoverished compared with English as the lingua franca.

To mitigate the average-human fallacy, key considerations include explicitly defining represented populations, using specialized models, testing prompts and models for consistency and bias, validating responses against human data, and transparently reporting representational limitations.

Model selection and customization

Be explicit about which populations are represented in the training data and how this affects generalizability claims. Consider using specialized models fine-tuned on representative human data and, importantly, verify performance with actual samples from those populations (Gao et al., 2024; Suh et al., 2025).

Prompt design

Test models across multiple prompting styles to assess the consistency of simulated responses and determine sensitivity to superficial variations (C. Li & Qi, 2025; Momennejad et al., 2023; Tao et al., 2024). Compare outputs from different LLM versions and architectures to identify systematic biases in representation.

Interpretations and applications

Validate LLM responses against human samples, particularly for claims about specific populations or demographic groups, and report cases in which LLM responses diverge from human patterns. Avoid generalizing beyond domains in which empirical validation has been established (Abdurahman et al., 2025).

Ethics

Document potential biases and acknowledge limitations in representativeness when reporting results from LLM simulations (Abdurahman et al., 2025).

Alignment-as-explanation fallacy

High output alignment between LLM responses and human data has been suggested as evidence that LLMs can sometimes replace human participants (Dillion et al., 2023). Although such alignment suggests that LLMs might capture something mechanistic about human behavior, it is fallacious to assume mechanistic equivalence—that the model explains human cognition (Guest & Martin, 2023) or can replace human participants for theoretical understanding (see Box 4). Treating such alignment as sufficient for mechanistic equivalence constitutes the alignment-as-explanation fallacy. This fallacy arises when output similarity is conflated with representational and processing similarity, that is, when similar outputs are taken as evidence that LLMs engage the same psychological constructs or cognitive mechanisms as humans, overlooking the differences between the statistical pattern matching of current LLMs and the embodied, contextual cognitive processes of humans.

Box 4.

Functional Versus Mechanistic Equivalence

Functional equivalence occurs when two systems produce the same outputs given the same inputs under specified conditions—such as how both a digital watch and a sundial tell accurate time on a clear day despite different mechanisms. Mechanistic equivalence, in contrast, requires shared internal causal processes—addressing how results are produced, not merely that they are produced. Because distinct mechanisms can yield identical functions (multiple realizability), functional alignment alone cannot establish mechanistic alignment.
In psychological research, a large language model (LLM) that correctly answers false-belief questions demonstrates functional alignment with humans, but evidence from prompt-sensitivity studies reveals mechanistic divergence. Even “perfect” behavioral correlation therefore licenses, at most, a claim of functional parity. Demonstrating mechanistic parity would demand convergent evidence from (a) input perturbation tests—performance should degrade in the same pattern as humans when task-relevant cues are modified; (b) internal-state homology—representational similarity analyses should reveal shared information coding; and (c) causal intervention—disabling a model component should mirror the effect of corresponding cognitive impairments in humans (Firestone, 2020; Guest & Martin, 2023).
Differentiating functional and mechanistic equivalence has implications for three levels of LLM-human similarity:
• Level 1: pragmatic simulation. If the research goal is purely functional—for example, pretesting vignettes, forecasting average responses, or relationship counseling—then output equivalence between LLMs and humans may suffice regardless of underlying mechanisms. Here, the simulation perspective is interested primarily in what is produced, not how it is produced. For example, if an LLM reliably predicts which survey items show ceiling effects, its utility as a rapid-prototyping tool remains valid even if its internal processes differ from human cognition.
• Level 2: explanatory inference. If the research goal is theoretical—for example, testing models of cognition, emotion, or moral reasoning—then superficial behavioral alignment becomes insufficient. Internal and construct validity demand evidence that similar underlying mechanisms are at play. An LLM might generate human-like responses to reasoning tasks without engaging in human-like reasoning—just as a calculator produces accurate math answers through processes entirely unlike human calculation.
• Level 3: phenomenological attribution. Even if perfect functional and mechanistic overlap were demonstrated, this would not automatically confer phenomenal states—subjective experiences such as visceral, emotional empathy. Claims that LLMs “have beliefs,” “feel empathy,” or “experience understanding” involve a third level of equivalence that requires additional, presently unknown properties (e.g., global workspace dynamics, biological embodiment, or other factors that might give rise to consciousness). Consequently, attributions of subjective experience to LLMs remain largely speculative until such criteria are articulated and empirically evaluated.
Thus, LLMs can serve as valuable functional simulators for many research purposes without needing to process and experience the world as humans do. However, researchers must be transparent about which level of equivalence they are claiming and provide appropriate evidence for that level. The danger lies not in using LLMs as Level 1 simulators but in making unwarranted leaps to Level 2 theoretical inferences or Level 3 phenomenological attributions—shortcuts that threaten validity.

Indeed, a fundamental problem of applying human-centric tests and concepts to LLMs—such as theory of mind or emotional understanding—is the assumption that they engage with information in ways similarly to humans, presupposing background psychological mechanisms that may be absent or irrelevant for LLMs (Box 1). These anthropomorphic assumptions undermine the construct validity (i.e., the degree to which a test measures the specific psychological construct it purports to measure) of psychological tests in LLMs (Millière & Buckner, 2024b). For example, response confidence in LLMs, as measured by token probability, may differ from human self-reports. Likewise, emotional intelligence—the ability to perceive, understand, manage, and use emotions in oneself and others—encompasses self-awareness, empathy, emotional regulation, and social skills, all rooted in subjective experiences that are absent in LLMs. Model “understanding” of emotions is purely expressive and syntactic, driven by learned associations between words and concepts rather than experiential, visceral insight or an internalized understanding of mental states. So when LLMs perform like humans on tests of emotional understanding—in either overall score or response pattern (X. Wang et al., 2023)—this alignment is at the surface level rather than reflecting genuine equivalence.

Apparent alignments between LLM and human responses may also be artifacts of specific task formulations rather than evidence of robust, human-like reasoning capabilities. One clue for a lack of true competence is a dissociation between accuracy and the model’s reasoning (i.e., its explanations of its responses; Leivada et al., 2023). Another critical test is prompt sensitivity, examining how performance is affected by superficial alterations to the prompts—changes to which humans typically show little to no sensitivity or respond to differently. Such prompt sensitivity has been documented in LLMs in survey responses—RLHF-tuned models can be highly sensitive to changes such as typos (Tjuatja et al., 2024)—and in tasks such as reasoning (Binz & Schulz, 2023b; Dasgupta et al., 2022; Yax et al., 2024), decision-making (Binz & Schulz, 2023a; Suri et al., 2024), theory of mind (Strachan et al., 2024), moral judgments (Oh & Demberg, 2025), and more (Kamoi et al., 2024; McCoy et al., 2024).

For example, LLMs but not humans improved their performance in reasoning tasks when the instruction included the phrase “let’s think step by step” (Yax et al., 2024), a form of chain-of-thought prompting. Conversely, although LLMs performed at ceiling in a false-belief test similarly to human participants, they struggled when small changes were made to the formulation of false-belief scenarios—suggesting syntactic pattern processing rather than robust reasoning (Strachan et al., 2024). These sensitivities to minor variations highlight how apparent alignment can mask differences in underlying processes, creating a threat to internal validity—the ability to draw firm conclusions about causal relationships—when the observed performance is misattributed to the intended manipulation rather than to superficial linguistic patterns.

But even when performance reflects underlying abilities, it is not clear whether models employ mechanisms similar to those of humans. Given that different systems can achieve the same outcome through different mechanisms—known as multiple realizability (Bowers et al., 2023; Guest & Martin, 2023), as in telling time with digital versus mechanical clocks—it is unwarranted to assume mechanistic or even functional equivalence (Box 4). Indeed, it is notoriously difficult to understand exactly what LLMs have learned. Beyond inherent differences between machine and biological intelligence in their architecture and algorithms, LLMs are trained on data sets much larger than what human learners experience. Differences in mechanisms may manifest as distinct response characteristics, including (a) context sensitivity, such as prompt sensitivity or performance variations across different vignettes (Yax et al., 2024); (b) response patterns, such as variability in open-ended responses (Y. Li et al., 2024), item-by-item performance variability (X. Wang et al., 2023), and correlation of accuracy with confidence (Yax et al., 2024); and (c) error types and consistency, such as errors arising from cognitive demands versus those from item wording or familiarity (Yax et al., 2024).

This raises a fundamental question: When LLMs perform at human levels in psychological tasks, what does it tell researchers about the capacity of these models? Performance can reflect true competence (some underlying abilities) or something more superficial, such as pattern memorization (Gao et al., 2024) or other surface-level cues or pure chance. Conversely, underperformance can reflect something other than incompetence, such as processing limitations or ineffective prompting (Firestone, 2020). Thus, to establish true competence, model outputs should be sensitive to changes in the task-relevant inputs but insensitive to irrelevant changes (Harding & Sharadin, 2024).

To mitigate the alignment-as-explanation fallacy, safeguards include testing alignment robustness against perturbations, examining reasoning processes, using causal interventions, performing cross-domain validation, and explicitly acknowledging the limits of inferring shared mechanisms.

Model testing and validation

Test whether the alignment between LLM and human responses is robust to perturbations in task structure, prompt wording, and contextual variations that should not affect performance (Oh & Demberg, 2025). If the performance varies significantly with superficial changes, this suggests a lack of true construct equivalence (Firestone, 2020).

Process tracing

Use methods such as chain-of-thought prompting to examine the reasoning paths that LLMs use to arrive at answers, comparing these with human-reasoning protocols (Bao et al., 2024). Significant differences in reasoning processes, even when outputs align, would suggest different underlying mechanisms.

Causal interventions

Implement interventions that test specific hypotheses about psychological mechanisms. For example, if humans and LLMs appear to use similar heuristics, test whether manipulations designed to affect those heuristics produce comparable effects (GX-Chen et al., 2025). Examine the causal effects of training data on model output through fine-tuning (T. Hu et al., 2025).

Cross-domain validation

Examine whether alignment in one domain generalizes to related domains that should engage similar cognitive processes. Limited transfer may suggest different underlying mechanisms despite surface alignment (Zan et al., 2025). The same applies to validation across different languages of the prompt, such as English and Chinese (Jin et al., 2024).

Explicit limitation acknowledgment

When reporting alignments between LLM and human responses, acknowledge the limitations of inferring shared mechanisms, noting their distinct architecture and learning history (Abdurahman et al., 2025).

Anthropomorphism fallacy

Beyond misalignments due to algorithms, purposes, and implementations, another issue with the replacement view is that it leads to anthropomorphism of LLMs (Crockett & Messeri, 2023; Shiffrin & Mitchell, 2023). Their fluent, human-seeming responses can trigger an enhanced ELIZA effect: the tendency to attribute human-like understanding to AI systems, creating a compelling language-user illusion (Box 1). Indeed, when researchers refer to “the minds of language models” or “the machine minds of LLMs” (Dillion et al., 2023), such terminology—whether used metaphorically or literally—can inadvertently encourage teleological bias, attributing purposes and goals to LLMs.

The anthropomorphism fallacy misinterprets LLM responses, treating statistical artifacts as expressions of an inner mental life. When researchers ascribe human-like mental states—such as beliefs, intentions, or consciousness—to LLMs based on their linguistic output, they risk invalidating measures of psychological constructs. Indeed, many such constructs, including attitudes and emotions, presuppose a mental architecture that current LLMs lack.

In many situations, using anthropomorphic language in conversation is natural—even useful. However, with AI such as LLMs—which currently lack documented markers of consciousness yet produce remarkably coherent conversations (Shardlow & Przybyla, 2024)—this semblance seduces users to interpret model behavior through the lens of folk psychology, attributing “beliefs” or “consciousness” to these systems (Colombatto & Fleming, 2024). Indeed, the chatbot service provider Character AI invites users to meet AIs that “feel alive.”

Although the philosophical debate about machine consciousness remains open (Butlin et al., 2023), ascribing emotions or intentions to current LLMs (saying they “believe” or “think”) risks creating impressions that outpace their demonstrated capabilities. As of mid-2025, no publicly available LLM exhibits clear markers of phenomenal awareness or intentional agency. Such attribution gaps can lead researchers and the public to either overestimate or underestimate these systems’ capabilities (Crockett & Messeri, 2023; Shanahan, 2024), potentially shaping AI policy in ways disconnected from technological reality (Lin, 2025a). Perhaps most concerning, this conflation may dilute the understanding of distinctly human qualities—feelings, thoughts, and virtues—thereby diminishing their meaning (Vallor, 2024).

The fallacy also leads researchers to draw invalid inferences from LLM outputs. When interpreting model responses as expressing human-like mental states rather than statistical probabilities, they may fail to account for the statistical artifacts inherent in token-prediction systems. Likewise, findings may be inappropriately generalized from anthropomorphized LLMs to human populations without recognizing the essential difference between statistical text generation and human psychological processes.

The replacement perspective thus risks anthropomorphizing algorithms and mischaracterizing their nature—a conceptual error that invites misunderstandings and misinterpretations. For example, one such misunderstanding is that “any given LLM can act as only a single participant” (Dillion et al., 2023). Yet unlike humans, who are influenced by a unique combination of personal experiences, emotions, and cognitive biases, LLMs are not limited to a single perspective but generate responses based on their vast, diverse data set. This means that depending on the prompt and context, the same LLM can produce a range of patchwork responses, each reflecting different viewpoints or types of reasoning (Santurkar et al., 2023). This variability is not indicative of a singular, consistent “mind” but rather of a multifaceted tool capable of simulating diverse perspectives. LLMs can role-play various characters or personas (Shanahan et al., 2023): a teenager, a senior citizen, a subject-matter expert, or a layperson. This chameleon-like ability highlights LLMs as tools for linguistic simulation, not as human participants.

Mitigating the anthropomorphism fallacy requires using precise language, maintaining conceptual clarity about LLMs’ lack of mental states, documenting technical settings, applying simulation-based interpretive frameworks, and implementing specific researcher training.

Language and framing

Use precise, nonanthropomorphic terminology when describing LLM outputs. Instead of saying an LLM “believes” or “feels,” opt for terms such as “produces,” “generates,” or “outputs” to accurately reflect their statistical nature (Ibrahim & Cheng, 2025; Shanahan, 2024).

Conceptual clarity

Explicitly acknowledge in research designs and reports that current LLMs do not possess mental states or consciousness (Shardlow & Przybyla, 2024). Define psychological constructs carefully, noting when they inherently depend on mental states that LLMs lack.

Documentation practices

Document the specific LLM, version, prompt design, and parameter settings used to generate responses (Lin, 2025a), emphasizing the technical rather than psychological aspects of the process.

Interpretive frameworks

Develop and apply interpretive frameworks that treat LLM outputs as simulations rather than expressions of beliefs or attitudes (Ibrahim & Cheng, 2025). This includes distinguishing between “simulated beliefs” and actual beliefs when reporting results.

Education and training

Provide training to research teams on the mechanisms of LLMs and the risks of anthropomorphic interpretations (Lin, 2025a). Foster a research culture that maintains conceptual precision when discussing AI capabilities.

Identity-essentialization fallacy

Under the replacement perspective, prompting often invokes identity labeling, such as instructing the LLM to act as or adopt the identity of “White man,” “Black woman,” “Chinese,” or “American”—as if such labels describe innate, static, homogeneous social groups, each entailing a specific set of behaviors (Chuang et al., 2024; A. Wang et al., 2025). This approach constitutes an identity-essentialization fallacy that caricatures how identity operates in human populations.

When simple demographic labels are used to prompt LLMs, there is often an implicit assumption that these can generate responses representative of real human populations. Such prompting approaches treat social categories as static and homogeneous, ignoring the vast diversity within any demographic group. When LLMs generate responses based on these simplified identity prompts, they may produce stereotyped or inaccurate outputs that fail to capture the nuanced realities of actual human populations (C. Li & Qi, 2025), limiting what one can learn about real-world contexts and populations (Lahoti et al., 2023; M. H. Lee et al., 2024). Indeed, demographic prompting can even reduce alignment with human judgments (Sun et al., 2025).

In colloquial exchanges, essentialist language about social categories—from “artists are eccentric” to “women are nurturing”—is convenient and also meaningful. But in empirical research, identity essentialization masks the fluidity and diversity inherent within any demographic, overlooking individual nuances and intersectionality while reinforcing stereotypes and biases prevalent within society, thus overestimating group differences (Namboodiripad et al., 2023; Prentice & Miller, 2006).

Identity is not a static, unitary construct that can be captured by a single demographic label—it is fluid, contextual, and intersectional. When researchers use essentialist prompting techniques, they misrepresent the psychological construct of identity itself, reducing rich, complex human experiences to one-dimensional categories. This reductive operationalization fails to capture how various aspects of identity interact, how identity salience shifts across contexts, and how individuals negotiate multiple, sometimes contradictory, identity facets.

This is not to deny the importance of identities or to advocate for identity-blindness. As pervasive societal structures that shape people’s thoughts, attitudes, and behaviors, social categories, such as race, gender, and class, are deeply embedded in people’s experiences—and often an ingrained part of their identity. But rather than reducing individuals to essentialist categories, a more appropriate approach is to consider how various identities—demographic, professional, or situational—interact by role-playing various personas through contextualized prompting. This involves crafting character profiles that encompass a broader array of characteristics—from contextual descriptions (“I am a young tech worker living in the United States”) to broader social categories (e.g., based on political leaning or personality type)—allowing for more nuanced explorations of perspectives and experiences.

For example, instead of prompting an LLM to act as a “Black woman,” which may reinforce stereotypes or oversimplify identity (Sun et al., 2025), one might construct a more holistic persona by adding a specific context, such as “a young entrepreneur from Atlanta who is passionate about sustainable fashion and community development,” or by incorporating intersectional identities, such as “a young Black female tech worker navigating the challenges of a male-dominated field.” These contextual descriptions incorporate identity but frame it within specific experiences, values, and contexts. Indeed, contextualized prompting has been shown to evoke distinct, diverse (A. Wang et al., 2025), and more aligned responses from LLMs (Bui et al., 2025).

Identity is multifaceted and context-dependent, with varying salience for different individuals. Simulating human participants therefore risks misrepresenting the salience of various aspects of identity—reflecting the prompter’s perspective or presumptions about which aspects of identity are important rather than capturing the intersectional reality experienced by the simulated persona. It is therefore crucial, whether using contextualized prompting or not, to examine potential biases and limitations in the prompt. Sidestepping genuine engagement with marginalized communities further risks artificial inclusion (Agnew et al., 2024).

To mitigate the identity-essentialization fallacy when simulating diverse perspectives, safeguards include using contextual prompting, incorporating intersectional approaches, performing diversity validation, ensuring transparency about limitations, and adopting collaborative methods.

Contextual prompting

Instead of using simple demographic labels, develop richer, context-specific prompts that incorporate multiple aspects of identity (including intersectionality), specific experiences, and environmental factors (Bui et al., 2025).

Validation

Compare LLM-generated responses across multiple prompting strategies and validate against actual human responses from the target population to identify when simulations misrepresent or stereotype particular groups (Sun et al., 2025).

Transparency in limitations

Explicitly acknowledge the limitations of identity simulation in research reports, including the risk of reinforcing stereotypes or oversimplifying complex identities. Document the specific prompting approaches used and their potential biases (Sun et al., 2025).

Collaborative approach

When studying specific cultural or identity groups, involve members of those groups in designing prompts, validating outputs, and interpreting results (Zhao et al., 2024).

Substitution fallacy

LLMs can mimic certain aspects of human behavior and cognition, but using them as primary tools to directly reveal the human mind reflects a substitution fallacy.

A core issue arises from the temporal limitations of LLM training data. Because LLMs are trained on historical data sets with a specific cutoff date, they represent a snapshot of human knowledge, attitudes, and behaviors at that moment. Updating through retraining is infrequent and resource-intensive. This static nature restricts their capacity to capture ongoing societal changes, new social phenomena, or evolving attitudes and behaviors, such as rapidly changing views on technologies or social movements (Zhu et al., 2025). Without real-time adaptability, previous alignments do not guarantee current applicability. Furthermore, a model’s advertised knowledge cutoff often differs from its actual, or effective, knowledge cutoff. The functional knowledge of LLMs frequently corresponds to older text versions that predate the stated cutoff. This discrepancy stems from widespread temporal misalignments within large pretraining corpora—for instance, older documents lingering in recent web crawls—and from the incomplete removal of outdated or duplicated content during data processing (Cheng et al., 2024).

This fallacy persists even if one disregards challenges related to the static and historically bound nature of LLM training data—or issues of grounding, embodiment, and subjective experience. As the average-human fallacy illustrates, responses from LLMs cannot be assumed a priori to represent average responses of the targeted human group. Even when LLMs and humans show alignment, this correlation should not be confused with equivalence in cognitive processes or mechanisms (the alignment-as-explanation fallacy; Box 4). This leads to an epistemic dilemma: Generalizing findings from LLMs to humans requires corroboration with actual human data, undermining the basic premise of the substitution proposition.

In addition, substitution risks creating closed-loop information systems. When models trained on historical data are used as primary tools for generating new data, they perpetuate a self-referential loop that creates a distorted view of the present by amplifying the past (including its biases, errors, and oversights) rather than reflecting current human thought or behavior. This can lead to misleading inferences reflecting model artifacts rather than genuine psychological phenomena and behavioral patterns.

Even with up-to-date training data, excluding human participants leaves LLMs simulating humans in ways detached from rich, evolving realities. Such detachment can entrench outdated knowledge, weaken the diversity vital for human progress, and create epistemic echo chambers. Thus, LLMs should serve as supplementary rather than primary tools for understanding the human mind.

To mitigate the substitution fallacy, safeguards include using sequential validation with human participants, benchmarking against time-sensitive data, integrating mixed methods, ensuring temporal transparency, and implementing closed-loop detection techniques.

Sequential validation

Implement a sequential research design in which LLM explorations are followed by validation with human participants. Use LLMs for hypothesis generation or initial exploration but validate key findings with relevant human data (Gui & Toubia, 2023).

Benchmarking against time-sensitive data

Regularly benchmark LLM responses against recent human data to assess temporal drift in model outputs compared with current human attitudes and behaviors. This helps establish the temporal boundaries of generalizability for LLM-based findings (Cheng et al., 2024).

Temporal transparency

Explicitly document training data cutoff dates and potential temporal limitations in research reports, particularly in rapidly evolving domains (Cheng et al., 2024).

Concluding Remarks

Recent advances in human-level AI are renewing the classic debate on the role of computing artifacts in understanding the human mind and brain (Simon, 1983). In this article, I critically assessed the emerging proposition of substituting human participants with LLMs in behavioral and social sciences. By exposing six fallacies inherent in this replacement perspective, I underscore that despite their human-like language-production capabilities, current LLMs do not—and as presently conceived, cannot—substitute for human thought. Unlike the statistical text prediction that drives current LLMs, human intelligence emerges from embodied interaction with the world—grounded in sensory experiences, enriched by multimodal integration, and shaped by subjective consciousness. The predominantly linguistic nature of LLMs further constrains their ability to capture the breadth of human experience, including nonverbal cues, implicit attitudes, and real-world behaviors.

By identifying challenges to research validity and providing practical guidelines, the analysis supports the simulation perspective: LLMs serve as tools for simulating roles and modeling cognitive processes, complementing but not replacing humans. As outlined in Box 4, this perspective helps investigators distinguish between research contexts in which output-level simulation suffices (pragmatic applications such as rapid prototyping) and those requiring deeper mechanistic evidence (theoretical claims about cognitive processes). In practice, researchers should leverage LLMs primarily for hypothesis generation, theory development, and rapid prototyping—then validate with human participants. This sequential approach capitalizes on model strengths (comprehensive knowledge, efficient simulation) while acknowledging their limitations (lack of grounding, representational biases). Implementing the controls and considerations outlined for each fallacy can improve research quality and interpretability.

As emphasized in Box 3, understanding model limitations requires distinguishing between technical and conceptual constraints. Although technical limitations may be addressed through engineering advances, conceptual limitations represent fundamental challenges to using LLMs as psychological models. As these technologies evolve, the field must continuously reevaluate their capabilities and limitations, develop appropriate benchmarks, and establish guidelines for responsible integration. This perspective invites researchers to reconsider the role of AI in behavioral and cognitive science—as a mirror through which they can better understand the similarities and differences between human intelligence and machine intelligence. The limitations of apparently human-like models in replicating human thought may bring a deeper appreciation of the complexity and wonder of the mind.

Footnotes

Acknowledgements

I thank Gati Aher, Michael Bernstein, Danica Dillion, Nancy Fulda, Nicholas Laskowski, Paweł Niszczota, Philipp Schoenegger, Lindia Tjuatja, Lukasz Walasek, and David Wingate for comments on early drafts.

Transparency

Action Editor: Kongmeng Liew

Editor: David A. Sbarra

Author Contributions

Zhicheng Lin: Conceptualization; Writing – original draft; Writing – review & editing.

ORCID iD

Zhicheng Lin

References

Abdurahman

Salkhordeh Ziabari

Moore

A. K.

Bartels

D. M.

Dehghani

(2025). A primer for evaluating large language models in social-science research. Advances in Methods and Practices in Psychological Science, 8(2). https://doi.org/10.1177/25152459251325174

Agnew

Bergman

A. S.

Chien

Díaz

El-Sayed

Pittman

Mohamed

McKee

K. R.

(2024). The illusion of artificial inclusion. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (pp. 1–12). Association for Computing Machinery. https://doi.org/10.1145/3613904.3642703

Aher

G. V.

Arriaga

R. I.

Kalai

A. T.

(2023). Using large language models to simulate multiple humans and replicate human subject studies. Proceedings of Machine Learning Researcher, 202, 337–371. https://proceedings.mlr.press/v202/aher23a.html

Bao

Zhang

Wang

Yang

Zhang

(2024). How likely do LLMs with CoT mimic human reasoning? arXiv. https://doi.org/10.48550/arXiv.2402.16048

Bender

E. M.

Koller

(2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.463

Bengio

(2024). FAQ on catastrophic AI risks. https://yoshuabengio.org/2023/06/24/faq-on-catastrophic-ai-risks/

Binz

Schulz

(2023a). Turning large language models into cognitive models. arXiv. https://doi.org/10.48550/arXiv.2306.03917

Binz

Schulz

(2023b). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences of the United States of America, 120(6), Article e2218523120. https://doi.org/10.1073/pnas.2218523120

Block

(1981). Psychologism and behaviorism. Philosophical Review, 90(1), 5–43. https://doi.org/10.2307/2184371

10.

Bowers

J. S.

Malhotra

Dujmovic

Llera Montero

Tsvetkov

Biscione

Puebla

Adolfi

Hummel

J. E.

Heaton

R. F.

Evans

B. D.

Mitchell

Blything

(2023). Deep problems with neural network models of human vision. Behavioral and Brain Sciences, 46, Article e385. https://doi.org/10.1017/S0140525X22002813

11.

Brucks

Toubia

(2025). Prompt architecture induces methodological artifacts in large language models. PLOS ONE, 20(4), Article e0319159. https://doi.org/10.1371/journal.pone.0319159

12.

Bui

Nguyen

H. T.

Kumar

Theodore

Qiu

Nguyen

V. A.

Ying

(2025). Mixture-of-personas language models for population simulation. arXiv. https://doi.org/10.48550/arXiv.2504.05019

13.

Butlin

Long

Elmoznino

Bengio

Birch

Constant

Deane

Fleming

S. M.

Frith

Kanai

Klein

Lindsay

Michel

Mudrik

Peters

M. A. K.

Schwitzgebel

Simon

(2023). Consciousness in artificial intelligence: Insights from the science of consciousness. arXiv. https://doi.org/10.48550/arXiv.2308.08708

14.

Cheng

Marone

Weller

Lawrie

Khashabi

Van Durme

(2024). Dated data: Tracing knowledge cutoffs in large language models. arXiv. https://doi.org/10.48550/arXiv.2403.12958

15.

Chuang

Y.-S.

Studdiford

Nirunwiroj

Goyal

Frigo

V. V.

Yang

Shah

Rogers

T. T.

(2024). Beyond demographics: Aligning role-playing LLM-based agents using human belief networks. arXiv. https://doi.org/10.48550/arXiv.2406.17232

16.

Church

(1936). An unsolvable problem of elementary number theory. American Journal of Mathematics, 58(2), 345–363. https://doi.org/10.2307/2371045

17.

Colombatto

Fleming

S. M.

(2024). Folk psychological attributions of consciousness to large language models. Neuroscience of Consciousness, 2024(1), Article niae013. https://doi.org/10.1093/nc/niae013

18.

Cook

T. D.

Campbell

D. T.

(1979). Quasi-experimentation: Design & analysis issues for field settings. Houghton Mifflin.

19.

Crockett

Messeri

(2023). Should large language models replace human participants? PsyArXiv. https://doi.org/10.31234/osf.io/4zdx9

20.

Cronbach

L. J.

Meehl

P. E.

(1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/h0040957

21.

Dasgupta

Lampinen

A. K.

Chan

S. C.

Sheahan

H. R.

Creswell

Kumaran

McClelland

J. L.

Hill

(2022). Language models show human-like content effects on reasoning tasks. arXiv. https://doi.org/10.48550/arXiv.2207.07051

22.

Demszky

Yang

Yeager

D. S.

Bryan

C. J.

Clapper

Chandhok

Eichstaedt

J. C.

Hecht

Jamieson

Johnson

Jones

Krettek-Cobb

Lai

JonesMitchell

Ong

D. C.

Dweck

C. S.

Gross

J. J.

Pennebaker

J. W.

(2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. https://doi.org/10.1038/s44159-023-00241-5

23.

Dettki

H. M.

Lake

B. M.

C. M.

Rehder

(2025). Do large language models reason causally like us? Even better? arXiv. https://doi.org/10.48550/arXiv.2502.10215

24.

Dillion

Tandon

Gray

(2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008

25.

Doerig

Sommers

R. P.

Seeliger

Richards

Ismael

Lindsay

G. W.

Kording

K. P.

Konkle

van Gerven

M. A. J.

Kriegeskorte

Kietzmann

T. C.

(2023). The neuroconnectionist research programme. Nature Reviews Neuroscience, 24(7), 431–450. https://doi.org/10.1038/s41583-023-00705-w

26.

Ferreira

Bailey

K. G. D.

Ferraro

(2002). Good-enough representations in language comprehension. Current Directions in Psychological Science, 11(1), 11–15. https://doi.org/10.1111/1467-8721.00158

27.

Firestone

(2020). Performance vs. competence in human–machine comparisons. Proceedings of the National Academy of Sciences of the United States of America, 117(43), 26562–26571. https://doi.org/10.1073/pnas.1905334117

28.

Fokas

A. S.

(2023). Can artificial intelligence reach human thought? PNAS Nexus, 2(12), Article pgad409. https://doi.org/10.1093/pnasnexus/pgad409

29.

Gao

Lee

Burtch

Fazelpour

(2024). Take caution in using LLMs as human surrogates: Scylla ex machina. arXiv. https://doi.org/10.48550/arXiv.2410.19599

30.

Grossmann

Feinberg

Parker

D. C.

Christakis

N. A.

Tetlock

P. E.

Cunningham

W. A.

(2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778

31.

Guest

Martin

A. E.

(2023). On logical inference over brains, behaviour, and artificial neural networks. Computational Brain & Behavior, 6(2), 213–227. https://doi.org/10.1007/s42113-022-00166-x

32.

Gui

Toubia

(2023). The challenge of using LLMs to simulate human behavior: A causal inference perspective. arXiv. https://doi.org/10.48550/arXiv.2312.15524

33.

Gurnee

Tegmark

(2023). Language models represent space and time. arXiv. https://doi.org/10.48550/arXiv.2310.02207

34.

GX-Chen

Lin

Samiei

Precup

Richards

B. A.

Fergus

Marino

(2025). Language agents mirror human causal reasoning biases: How can we help them think like scientists? arXiv. https://doi.org/10.48550/arXiv.2505.09614

35.

Hagendorff

Fabi

Kosinski

(2023). Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nature Computational Science, 3(10), 833–838. https://doi.org/10.1038/s43588-023-00527-x

36.

Harding

D’Alessandro

Laskowski

N. G.

Long

(2024). AI language models cannot replace human research participants. AI & Society, 39, 2603–2605. https://doi.org/10.1007/s00146-023-01725-x

37.

Harding

Sharadin

(2024). What is it for a machine learning model to have a capability? The British Journal for the Philosophy of Science. Advance online publication. https://doi.org/10.1086/732153

38.

Mahowald

Lupyan

Ivanova

Levy

(2024). Language models align with human judgments on key grammatical constructions. Proceedings of the National Academy of Sciences of the United States of America, 121(36), Article e2400917121. https://doi.org/10.1073/pnas.2400917121

39.

Kyrychenko

Rathje

Collier

van der Linden

Roozenbeek

(2025). Generative language models exhibit social identity biases. Nature Computational Science, 5(1), 65–75. https://doi.org/10.1038/s43588-024-00741-1

40.

Huang

P.-H.

Lin

Imbot

(2025). Analysis of LLM bias (Chinese propaganda and anti-US sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high. arXiv. https://doi.org/10.48550/arXiv.2506.01814

41.

Ibrahim

Cheng

(2025). Thinking beyond the anthropomorphic paradigm benefits LLM research. arXiv. https://doi.org/10.48550/arXiv.2502.09192

42.

Ivanova

A. A.

(2025). How to evaluate the cognitive abilities of LLMs. Nature Human Behaviour, 9(2), 230–233. https://doi.org/10.1038/s41562-024-02096-z

43.

Jackson

(1982). Epiphenomenal qualia. The Philosophical Quarterly, 32(127), 127–136. https://doi.org/10.2307/2960077

44.

Jackson

(1986). What Mary didn’t know. The Journal of Philosophy, 83(5), 291–295. https://doi.org/10.2307/2026143

45.

Jin

Kleiman-Weiner

Piatti

Levine

Liu

Gonzalez

Ortu

Strausz

Sachan

Mihalcea

Choi

Mihalcea

(2024). Language model alignment in multilingual trolley problems. arXiv. https://doi.org/10.48550/arXiv.2407.02273

46.

Jones

C. R.

Bergen

(2024). Does word knowledge account for the effect of world knowledge on pronoun interpretation? Language and Cognition, 16(4), 1182–1213. https://doi.org/10.1017/langcog.2024.2

47.

Jones

C. R.

Chang

T. A.

Coulson

Michaelov

J. A.

Trott

Bergen

(2022). Distributional semantics still can’t account for affordances. Proceedings of the Annual Meeting of the Cognitive Science Society, 44. https://escholarship.org/uc/item/44z7r3j3

48.

Kamoi

Das

S. S. S.

Lou

Ahn

J. J.

Zhao

Zhang

R. H.

Vummanthala

S. R.

Dave

Qin

Cohan

Yin

Zhang

(2024). Evaluating LLMs at detecting errors in LLM responses. arXiv. https://doi.org/10.48550/arXiv.2404.03602

49.

Keller

(1929). I Am Blind–Yet I See; I Am Deaf–Yet I Hear. The American Magazine. https://www.afb.org/HelenKellerArchive?a=d&d=A-HK02-B225-F03-005.1.1

50.

Kerschbaumer

Voracek

Aczél

Anderson

S. F.

Booth

B. M.

Buchanan

E. M.

Carlsson

Heck

D. W.

Hiekkaranta

A. P.

Hoekstra

Karch

J. D.

Lafit

McGorray

E. L.

Moreau

Papadatou-Pastou

Paterson

Perera

R. A.

Schad

D. J.

Sewell

D. K.

. . . Tran

U. S.

(2025). VALID: A checklist-based approach for improving validity in psychological research. Advances in Methods and Practices in Psychological Science, 8(1). https://doi.org/10.1177/25152459241306432

51.

Kim

J. S.

Elli

G. V.

Bedny

(2019). Knowledge of animal appearance among sighted and blind adults. Proceedings of the National Academy of Sciences of the United States of America, 116(23), 11213–11222. https://doi.org/10.1073/pnas.1900952116

52.

Kozlowski

A. C.

Evans

J. A.

(2024). Simulating subjects: The promise and peril of AI stand-ins for social agents and interactions. SocArXiv. https://doi.org/10.31235/osf.io/vp3j2

53.

Lahoti

Blumm

Kotikalapudi

Potluri

Tan

Srinivasan

Packer

Beirami

Beutel

Chen

(2023). Improving diversity of demographic representation in large language models via collective-critiques and self-voting. arXiv. https://doi.org/10.48550/arXiv.2310.16523

54.

Lake

B. M.

Murphy

G. L.

(2023). Word meaning in minds and machines. Psychological Review, 130(2), 401–431. https://doi.org/10.1037/rev0000297

55.

Lee

M. H.

Montgomery

J. M.

Lai

C. K.

(2024). Large language models portray socially subordinate groups as more homogeneous, consistent with a bias observed in humans. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1321–1340). Association for Computing Machinery. https://doi.org/10.1145/3630106.3658975

56.

Lee

Lai

Jia

Ryan

Cao

Kara

Boote

Shi

Yang

Rehg

J. M.

(2024). Towards social AI: A survey on understanding social interactions. arXiv. https://doi.org/10.48550/arXiv.2409.15316

57.

Leivada

Marcus

Günther

Murphy

(2023). A sentence is worth a thousand pictures: Can large language models understand hum4n l4ngu4ge and the w0rld behind w0rds? arXiv. https://doi.org/10.48550/arXiv.2308.00109

58.

Lewis

Zettersten

Lupyan

(2019). Distributional semantics as a source of visual knowledge. Proceedings of the National Academy of Sciences of the United States of America, 116(39), 19237–19238. https://doi.org/10.1073/pnas.1910148116

59.

(2025). Toward accurate psychological simulations: Investigating LLMs’ responses to personality and cultural variables. Computers in Human Behavior, 170, Article 108687. https://doi.org/10.1016/j.chb.2025.108687

60.

Huang

Wang

Zhang

Zou

Sun

(2024). Quantifying AI psychology: A psychometrics benchmark for large language models. arXiv. https://doi.org/10.48550/arXiv.2406.17675

61.

Lin

(2023). Why and how to embrace AI such as ChatGPT in your academic life. Royal Society Open Science, 10, Article 230658. https://doi.org/10.1098/rsos.230658

62.

Lin

(2024). How to write effective prompts for large language models. Nature Human Behaviour, 8(4), 611–615. https://doi.org/10.1038/s41562-024-01847-2

63.

Lin

(2025a). Beyond principlism: Practical strategies for ethical AI use in research practices. AI and Ethics, 5, 2719–2731. https://doi.org/10.1007/s43681-024-00585-5

64.

Lin

(2025b). Techniques for supercharging academic writing with generative AI. Nature Biomedical Engineering, 9(4), 426–431. https://doi.org/10.1038/s41551-024-01185-8

65.

Lin

(2023). Global diversity of authors, editors, and journal ownership across subdisciplines of psychology: Current state and policy implications. Perspectives on Psychological Science, 18(2), 358–377. https://doi.org/10.1177/17456916221091831

66.

Mahowald

Ivanova

A. A.

Blank

I. A.

Kanwisher

Tenenbaum

J. B.

Fedorenko

(2024). Dissociating language and thought in large language models. Trends in Cognitive Sciences, 28(6), 517–540. https://doi.org/10.1016/j.tics.2024.01.011

67.

Manning

C. D.

Clark

Hewitt

Khandelwal

Levy

(2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences of the United States of America, 117(48), 30046–30054. https://doi.org/10.1073/pnas.1907367117

68.

Marjieh

Sucholutsky

van Rijn

Jacoby

Griffiths

T. L.

(2024). Large language models predict human sensory judgments across six modalities. Scientific Reports, 14(1), Article 21445. https://doi.org/10.1038/s41598-024-72071-1

69.

McCoy

R. T.

Yao

Friedman

Hardy

M. D.

Griffiths

T. L.

(2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences of the United States of America, 121(41), Article e2322420121. https://doi.org/10.1073/pnas.2322420121

70.

Millière

(2024). Language models as models of language. arXiv. https://doi.org/10.48550/arXiv.2408.07144

71.

Millière

Buckner

(2024a). A philosophical introduction to language models – Part I: Continuity with classic debates. arXiv. https://doi.org/10.48550/arXiv.2401.03910

72.

Millière

Buckner

(2024b). A philosophical introduction to language models – Part II: The way forward. arXiv. https://doi.org/10.48550/arXiv.2405.03207

73.

Mirzadeh

Alizadeh

Shahrokhi

Tuzel

Bengio

Farajtabar

(2024). GSM-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv. https://doi.org/10.48550/arXiv.2410.05229

74.

Mitchell

(2024). Debates on the nature of artificial general intelligence. Science, 383(6689), Article eado7069. https://doi.org/10.1126/science.ado7069

75.

Mitchell

Krakauer

D. C.

(2023). The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences of the United States of America, 120(13), Article e2215907120. https://doi.org/10.1073/pnas.2215907120

76.

Momennejad

Hasanbeig

Vieira Frujeri

Sharma

Jojic

Palangi

Ness

Larson

(2023). Evaluating cognitive maps and planning in large language models with CogEval. Advances in Neural Information Processing Systems, 36, 69736–69751.

77.

Mozikov

Severin

Bodishtianu

Glushanina

Baklashkin

Savchenko

A. V.

Makarov

(2024). The good, the bad, and the Hulk-like GPT: Analyzing emotional decisions of large language models in cooperation and bargaining games. arXiv. https://doi.org/10.48550/arXiv.2406.03299

78.

Murthy

S. K.

Ullman

(2024). One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity. arXiv. https://doi.org/10.48550/arXiv.2411.04427

79.

Namboodiripad

Kutlu

Babel

Baese-Berk

Block

Carlson

M. T.

Cheng

Combiths

Hayes-Harb

Frederiksen

A. T.

Kendro

Koulidobrova

E. ‘H.,’

Lin

C.-J. C.

Lin

Luque

McGowan

K. B.

Muegge

Tripp

Wright

K. E.

(2023). Essentialist characterizations of language are an obstacle to accuracy, progress, and justice in science. PsyArXiv. https://doi.org/10.31234/osf.io/jn3ct

80.

Niszczota

Janczak

Misiak

(2025). Large language models can replicate cross-cultural differences in personality. Journal of Research in Personality, 115, Article 104584. https://doi.org/10.1016/j.jrp.2025.104584

81.

Demberg

(2025). Robustness of large language models in moral judgements. Royal Society Open Science, 12(4), Article 241229. https://doi.org/10.1098/rsos.241229

82.

OpenAI. (2024). OpenAI charter. https://openai.com/charter/

83.

Park

P. S.

Schoenegger

Zhu

(2024). Diminished diversity-of-thought in a standard large language model. Behavior Research Methods, 56, 5754–5770. https://doi.org/10.3758/s13428-023-02307-x

84.

Pellert

Lechner

C. M.

Wagner

Rammstedt

Strohmaier

(2024). AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19(5), 808–826. https://doi.org/10.1177/17456916231214460

85.

Prentice

D. A.

Miller

D. T.

(2006). Essentializing differences between women and men. Psychological Science, 17(2), 129–135. https://doi.org/10.1111/j.1467-9280.2006.01675.x

86.

Che

Wei

Zhang

Ouyang

Bian

Liu

(2024). Promoting interactions between cognitive science and large language models. The Innovation, 5(2), Article 100579. https://doi.org/10.1016/j.xinn.2024.100579

87.

Richards

B. A.

Lillicrap

T. P.

Beaudoin

Bengio

Bogacz

Christensen

Clopath

Ponte Costa

de Berker

Ganguli

Gillon

C. J.

Hafner

Kepecs

Kriegeskorte

Latham

Lindsay

G. W.

Miller

K. D.

Naud

Pack

C. C.

. . . Kording

K. P.

(2019). A deep learning framework for neuroscience. Nature Neuroscience, 22(11), 1761–1770. https://doi.org/10.1038/s41593-019-0520-2

88.

Santurkar

Durmus

Ladhak

Lee

Liang

Hashimoto

(2023). Whose opinions do language models reflect? Proceedings of Machine Learning Research, 202, 29971–30004. https://proceedings.mlr.press/v202/santurkar23a.html

89.

Sarstedt

Adler

S. J.

Rau

Schmitt

(2024). Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines. Psychology & Marketing, 41, 1254–1270. https://doi.org/10.1002/mar.21982

90.

Schmidt

Elagroudy

Draxler

Kreuter

Welsch

(2024). Simulating the human in HCD with ChatGPT: Redesigning interaction design with AI. Interactions, 31(1), 24–31. https://doi.org/10.1145/3637436

91.

Searle

J. R.

(1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–424. https://doi.org/10.1017/S0140525X00005756

92.

Shanahan

(2024). Talking about large language models. Communications of the ACM, 67(2), 68–79. https://doi.org/10.1145/3624724

93.

Shanahan

McDonell

Reynolds

(2023). Role play with large language models. Nature, 623(7987), 493–498. https://doi.org/10.1038/s41586-023-06647-8

94.

Shardlow

Przybyla

(2024). Deanthropomorphising NLP: Can a language model be conscious? PLOS ONE, 19(12), Article e0307521. https://doi.org/10.1371/journal.pone.0307521

95.

Shiffrin

Mitchell

(2023). Probing the psychology of AI models. Proceedings of the National Academy of Sciences of the United States of America, 120(10), Article e2300963120. https://doi.org/10.1073/pnas.2300963120

96.

Simon

H. A.

(1983). Why should machines learn? In Michalski

R. S.

Carbonell

J. G.

Mitchell

T. M.

(Eds.), Machine learning (pp. 25–37). Morgan Kaufmann. https://doi.org/10.1016/B978-0-08-051054-5.50006-6

97.

Stanley

J. C.

Campbell

D. T.

(1963). Experimental and quasi-experimental designs for research. Rand McNally.

98.

Strachan

J. W. A.

Albergo

Borghini

Pansardi

Scaliti

Gupta

Saxena

Rufo

Panzeri

Manzi

Graziano

M. S. A.

Becchio

(2024). Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7), 1285–1295. https://doi.org/10.1038/s41562-024-01882-z

99.

Suh

Jahanparast

Moon

Kang

Chang

(2025). Language model fine-tuning on scaled survey data for predicting distributions of public opinions. arXiv. https://doi.org/10.48550/arXiv.2502.16761

100.

Sun

Pei

Choi

Jurgens

(2025). Sociodemographic prompting is not yet an effective approach for simulating subjective judgments with LLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 2: Short Papers, pp. 845–854). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.naacl-short.71

101.

Suri

Slater

L. R.

Ziaee

Nguyen

(2024). Do large language models show decision heuristics similar to humans? A case study using GPT-3.5. Journal of Experimental Psychology: General, 153(4), 1066–1075. https://doi.org/10.1037/xge0001547

102.

Tao

Viberg

Baker

R. S.

Kizilcec

R. F.

(2024). Cultural bias and cultural alignment of large language models. PNAS Nexus, 3(9), Article pgae346. https://doi.org/10.1093/pnasnexus/pgae346

103.

Tjuatja

Chen

Talwalkwar

Neubig

(2024). Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics, 12, 1011–1026. https://doi.org/10.1162/tacl_a_00685

104.

Vallor

(2024). The AI mirror: How to reclaim our humanity in an age of machine thinking. Oxford University Press.

105.

van Rooij

Guest

Adolfi

F. G.

de Haan

Kolokolova

Rich

. (2024). Reclaiming AI as a theoretical tool for cognitive science. Computational Brain & Behavior, 7, 616–636. https://doi.org/10.1007/s42113-024-00217-5

106.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000–6010). Association for Computing Machinery. https://doi.org/10.5555/3295222.3295349

107.

Wang

Morgenstern

Dickerson

J. P.

(2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411. https://doi.org/10.1038/s42256-025-00986-z

108.

Wang

Yin

Liu

(2023). Emotional intelligence of large language models. Journal of Pacific Rim Psychology, 17, Article 18344909231213958. https://doi.org/10.1177/18344909231213958

109.

Webb

Holyoak

K. J.

(2023). Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9), 1526–1541. https://doi.org/10.1038/s41562-023-01659-w

110.

Wicke

Wachowiak

(2024, August). Exploring spatial schema intuitions in large language and vision models. In Ku

L.-W.

Martins

Srikumar

(Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 6102–6117). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.365

111.

Peng

Nastase

S. A.

Chodorow

(2025). Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature Human Behaviour. Advance online publication. https://doi.org/10.1038/s41562-025-02203-8

112.

Yax

Anllo

Palminteri

(2024). Studying and improving reasoning in humans and machines. Communications Psychology, 2, Article 51. https://doi.org/10.1038/s44271-024-00091-8

113.

Moore

Novick

Zhang

A. X.

(2024). Language models as critical thinking tools: A case study of philosophers. arXiv. https://doi.org/10.48550/arXiv.2404.04516

114.

Yildirim

Paul

L. A.

(2024). From task structures to world models: What do LLMs know? Trends in Cognitive Sciences, 28(5), 404–415. https://doi.org/10.1016/j.tics.2024.02.008

115.

Zan

Zhang

Liu

Cheng

(2025). Can the capability of large language models be described by human ability? A meta study. arXiv. https://doi.org/10.48550/arXiv.2504.12332

116.

Zhao

Sriwarnasinghe

S. M.

Tang

Wang

Morikawa

(2024). Collaborative participatory research with LLM agents in South Asia: An empirically-grounded methodological initiative and agenda from field evidence in Sri Lanka. arXiv. https://doi.org/10.48550/arXiv.2411.08294

117.

Zhou

Schellaert

Martinez-Plumed

Moros-Daval

Ferri

Hernandez-Orallo

(2024). Larger and more instructable language models become less reliable. Nature, 634, 61–68. https://doi.org/10.1038/s41586-024-07930-y

118.

Zhu

Chen

Gao

Zhang

Tiwari

Wang

(2025). Is your LLM outdated? A deep look at temporal generalization. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1: Long Papers, pp. 7433–7457). Association for Computational Linguistics. https://aclanthology.org/2025.naacl-long.381.pdf

119.

Ziems

Held

Shaikh

Chen

Zhang

Yang

(2024). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. https://doi.org/10.1162/coli_a_00502

Six Fallacies in Substituting Large Language Models for Human Participants

Abstract

Keywords

Interpretive Fallacies in LLM-Based Research

Token-prediction-as-human-intelligence fallacy

Model selection and settings

Prompt design

Interpretations and applications

Ethics

The average-human fallacy

Model selection and customization

Prompt design

Interpretations and applications

Ethics

Alignment-as-explanation fallacy

Model testing and validation

Process tracing

Causal interventions

Cross-domain validation

Explicit limitation acknowledgment

Anthropomorphism fallacy

Language and framing

Conceptual clarity

Documentation practices

Interpretive frameworks

Education and training

Identity-essentialization fallacy

Contextual prompting

Validation

Transparency in limitations

Collaborative approach

Substitution fallacy

Sequential validation

Benchmarking against time-sensitive data

Temporal transparency

Concluding Remarks

Footnotes

Acknowledgements

Transparency

ORCID iD

References