Sage Journals: Discover world-class research

Abstract

Large language models (LLMs) offer emerging opportunities for psychological and behavioral research, but methodological guidance is lacking. In this article, I develop a framework for using LLMs as psychological simulators across two primary applications: simulating roles and personas to explore diverse contexts, and serving as computational models to investigate cognitive processes. For simulation, the framework includes (a) an implementation-confound checklist distinguishing essential from context-dependent methodological checks, (b) methods for developing psychologically grounded personas that move beyond demographic categories, and (c) a three-tier validation framework (direct, indirect, and generative) tailored to data availability. A diagnostic decision framework guides researchers through establishing performance validity, identifying implementation artifacts, and interpreting LLM-human discrepancies. For cognitive modeling, I synthesize (a) emerging approaches for probing internal representations, (b) methodological advances in causal interventions, and (c) strategies for relating model behavior to human cognition. The framework addresses overarching challenges, including prompt sensitivity, temporal limitations from training-data cutoffs, and ethical considerations that extend beyond traditional human-subjects review. Open-weight models are the default for reproducibility. Together, this framework integrates emerging empirical evidence about LLM performance—including systematic biases, cultural limitations, and prompt brittleness—to help researchers wrangle these challenges and leverage the unique capabilities of LLMs in psychological research.

Keywords

large language models (LLMs)simulation computational psychology cognitive modeling (cognitive modelling)

Large language models (LLMs) have rapidly emerged as versatile tools in psychological research. But beyond their utility for writing assistance (Lin, 2025c), programming (Guo, 2023), and text analysis (Feuerriegel et al., 2025), how can LLMs contribute to understanding psychological phenomena and behavior, and how might they do so? Although these systems offer unprecedented opportunities for psychological research (Demszky et al., 2023; Ke et al., 2025; Sartori & Orrù, 2023), their rapid adoption has outpaced methodological development, creating risks of invalid inferences and irreproducible findings. The field lacks both conceptual clarity about their distinct applications and methodological guidance for their implementation. To address these gaps, I frame LLMs as psychological simulators, providing a methodological guide for their two primary applications: simulating human roles and personas, and serving as models of cognitive processes.

The use of computational systems to simulate human behavior and cognition has deep roots in both psychology and artificial intelligence (AI). Computational modeling of human thought traces back to mid-20th-century efforts, such as Newell et al.’s (1958) general problem solver—one of the first implementations of the information-processing paradigm—and to early agent-based frameworks, such as Schelling’s (1971) segregation models. The cognitive-modeling tradition evolved from symbol-manipulation architectures through parallel-distributed-processing models (Rumelhart et al., 1986) to today’s large-scale deep networks. LLMs both continue this trajectory—operationalizing psychological constructs in code and data—and depart from it. Earlier neural networks faced criticism for opacity, but LLMs introduce qualitatively different interpretability challenges: Their vast parameter counts, training on uncontrolled internet corpora, and capabilities that emerge unpredictably with scale collectively create a kind of opacity that is fundamentally distinct from the handcrafted, theoretically motivated architectures of earlier cognitive models.

These novel challenges have prompted theoretical examination of fundamental questions: Can LLMs replace human participants (Lin, 2025b)? What are the implications of integrating AI with psychological science (van Rooij & Guest, 2025)? How do their limitations constrain the understanding of cognition (Cuskley et al., 2024; Shah & Varma, 2025), and does their use threaten or enhance the generalizability of psychological science (Crockett & Messeri, 2025; Lin, in press)? Yet even as researchers examine these fundamental questions, empirical applications proliferate rapidly—for example, using LLMs to simulate cross-cultural personality differences (Niszczota et al., 2025), probe theory of mind capabilities (Strachan et al., 2024), generate psycholinguistic norms (Trott, 2024a), and forecast human behavior (Schoenegger et al., 2024). Commercial services now offer AI participants for market research, and academic proposals suggest LLMs could substitute human participants (Grossmann et al., 2023; Sarstedt et al., 2024).

This disconnect between theoretical caution and empirical enthusiasm risks producing invalid inferences and irreproducible findings. Without rigorous methodological standards, researchers may mistake statistical artifacts for genuine psychological phenomena—a validity crisis driven by neglecting psychometric and causal-inference principles (Lin, 2025a). This has led to warnings against “GPTology”—the uncritical application of LLMs that overlooks the complexities of human psychology and risks producing low-quality research (Abdurahman et al., 2024). Building on emerging work that has begun to establish best practices (Abdurahman et al., 2025; Hussain et al., 2024; Lin, 2025d; Lu et al., 2024), in this article, I aim to provide systematic guidance to bridge theoretical potentials and applications.

The methodological gap reflects a fundamental shift in simulation architecture. Unlike traditional agent-based models (Bonabeau, 2002; Fagiolo et al., 2007) or cognitive models (Laird et al., 1987; Ritter et al., 2019) that embody explicit behavioral rules and theoretical commitments, LLMs learn behavioral patterns implicitly from vast corpora. They produce remarkably human-like outputs through mechanisms that—unlike their handcrafted predecessors—remain partially opaque (e.g., Lin, 2023). This shift from theory-driven to data-driven simulation demands new validation strategies, ethical frameworks, and interpretive approaches (Argyle et al., 2025).

Recent empirical work has begun mapping both the promise and perils of LLM-based simulation. On one hand, models can capture certain aspects of human psychology remarkably well—from replicating cultural differences in personality traits (Niszczota et al., 2025) and human-like error patterns in cognitive tasks (Sartori & Orrù, 2023) to predicting sensory judgments across multiple modalities (Marjieh et al., 2024). On the other hand, they exhibit systematic biases and oversensitivity to prompt variations, reflecting fundamental differences from human cognition that researchers must carefully navigate (Binz & Schulz, 2023b; Tjuatja et al., 2024). Models often fail to capture the diversity found in real human responses (P. S. Park et al., 2024; A. Wang et al., 2025), showing more extreme, less nuanced preference distributions in moral domains compared with human participants (Zaim bin Ahmad & Takemoto, 2025).

Below, I provide practical guidelines organized around two primary research applications. First, I examine how LLMs can simulate roles and personas to explore diverse perspectives and behaviors—extending the agent-based-modeling tradition with systems that generate linguistically rich, contextually sensitive responses. Second, I synthesize methodological approaches for using LLMs as cognitive models—building on the neural-network tradition to probe how these systems process information and whether their mechanisms illuminate human cognition. For each application, concrete methodological recommendations are provided, grounded in emerging empirical evidence.

This framework acknowledges the temporal, cultural, and representational constraints inherent in current LLMs (Ziems et al., 2024). They are trained on historical data with specific cutoff dates, predominantly reflect WEIRD (Western, educated, industrialized, rich, democratic) perspectives. Many also undergo posttraining modifications that further alter their psychological profiles, particularly around socially sensitive topics such as race and gender (Cui et al., 2025). These limitations do not negate their research value but rather define the contexts within which they can be productively employed. By making these constraints explicit and providing strategies to work within them, the framework enables researchers to harness LLM capabilities while avoiding common pitfalls.

The article proceeds as follows. I first examine role and persona simulation, providing guidelines for prompt design, response validation, and appropriate use cases—from simulating rare populations to prototyping survey instruments. I then analyze cognitive-modeling applications, reviewing methodological approaches for probing internal representations, synthesizing advances in causal interventions, and examining strategies for relating findings to human cognition. Finally, I address ethical considerations that extend beyond traditional human-subjects protections. Throughout this framework, I emphasize that LLM-based methods should supplement rather than substitute for traditional approaches, offering unique advantages while requiring careful validation against human data. A glossary of specialized AI terms follows this introduction (see Box 1).

Box 1.

Glossary

• Activation patching (causal tracing): a technique for identifying which network components carry specific information. Researchers insert activation patterns from one context into another during processing; if the model’s output shifts to match the inserted context, this indicates those components are causally responsible for that capability.
• Attention mechanisms: components in transformer architectures (which underpin most modern LLMs; see definition below) that determine which parts of the input are most relevant for processing each word. When generating a response, attention mechanisms allow the model to focus on specific earlier words rather than treating all context equally, enabling it to handle long-range dependencies (e.g., resolving pronouns to distant antecedents).
• Backpropagation: the primary algorithm used to train most neural networks, including LLMs. It works by calculating the error in the model’s output and propagating this error backward through the network to adjust the model’s weights and improve performance.
• Base model: an LLM that has undergone initial pretraining on a large data set but has not been extensively fine-tuned with methods such as RLHF (see definition below) to follow instructions or align with specific human preferences. Comparing base models with tuned models can reveal biases.
• Causal intervention: experimental techniques that involve actively manipulating parts of an LLM (e.g., specific neurons, layers, or activation patterns) to observe the resulting changes in its behavior. This helps establish causal links between model components and their functions.
• Closed-source model: an LLM whose internal workings, training data, and code are proprietary and not publicly disclosed. This can limit transparency and independent verification of research findings.
• Embeddings (text/image–text): mathematical representations that convert words, sentences, or images into lists of numbers in which semantically similar items have similar number patterns. This allows models to measure and compute relationships between concepts (e.g., “king” – “man” + “woman” ≈ “queen”). Vision-language models (VLMs; see definition below) create joint embeddings that align images with text descriptions.
• Few-shot learning: a prompting technique in which the LLM is provided with a small number of examples (shots) of the desired task or output format within the prompt itself. This enables the model to understand and perform the task without updating the model’s parameters.
• Fine-tuning: the process of further training a pretrained LLM on a smaller, more specific data set. This adapts the model to particular tasks, domains, or desired response styles, such as aligning with human psychological patterns.
• Generative agents: LLM-based simulated entities that maintain memory of past interactions, reflect on experiences to form higher-level beliefs, and plan future actions. Unlike simple chatbots that respond to single prompts, generative agents persist across time and can autonomously pursue goals in simulated environments (e.g., organizing social events, forming relationships).
• Instruct/RLHF-tuned model: an LLM that has been specifically fine-tuned, often using RLHF, to better follow instructions, engage in dialogue, or exhibit desired behaviors. These models may differ significantly from base models in their responses and potential biases.
• Internal probing: a method for investigating what information an LLM encodes at different processing stages. Researchers train small diagnostic models (probes) to predict specific properties (e.g., grammatical structure, sentiment) from the model’s internal activation patterns. High probe accuracy indicates the model represents that information, revealing what knowledge the model has extracted from its training.
• Large language models (LLMs): artificial-intelligence systems trained on vast amounts of text data to understand, generate, and manipulate human language. They form the core subject of this article.
• Open-weight model: an LLM whose trained parameters (weights) are publicly available for download, enabling researchers to run, inspect, and modify the model locally. Sometimes called “open-source models,” although unlike traditional open-source software, open-weight models do not guarantee access to training data, complete source code, or permissive licensing—some restrict commercial use or modifications.
• Prompt: the input, such as a question, instruction, or contextual information, provided to an LLM to elicit a specific response or behavior. Effective prompt design is crucial for obtaining meaningful results.
• Reinforcement learning from human feedback (RLHF): a technique used to align LLMs with human preferences. It involves training a reward model based on human ratings of the LLM’s outputs and then using reinforcement learning to optimize the LLM to generate responses that humans would rate highly. RLHF can introduce specific biases.
• Temperature: a parameter in LLMs that controls the randomness of the output. Higher temperature values lead to more diverse and creative responses, and lower values produce more focused and deterministic outputs.
• Training-data cutoff: the specific date that marks the end of the data included in an LLM’s training set. The model has no knowledge of events, information, or cultural shifts that occurred after this date, which is a critical temporal limitation.
• Transformer: a neural-network architecture central to most modern LLMs, that uses self-attention mechanisms to process sequential data, allowing it to capture complex relationships and dependencies within text.
• Vision-language models (VLMs): artificial-intelligence models capable of processing and integrating information from both visual (e.g., images) and textual modalities. They can perform tasks such as image captioning or answering questions about images.

Using Language Models to Simulate Roles and Personas

Given their extensive training data, LLMs show particular promise in their capacity to adopt different personas and simulate diverse perspectives. To effectively leverage this capability, it is essential to consider both theoretical foundations and practical strategies.

Conceptual foundations and model capabilities

Understanding LLMs as language simulators that can role-play various personas requires first recognizing what these systems can and cannot do (Shanahan et al., 2023). Although LLMs process text without genuine cognition or consciousness, the text they process embodies rich psychological and social information accumulated from vast training corpora. This characteristic uniquely positions them as tools for exploring how language both encodes and expresses psychological phenomena and behavior.

Responding to persona-based prompts, LLMs draw on statistical patterns linking linguistic expressions to social roles and psychological states—reflecting human dynamics embedded in language. Because language is the primary medium for expressing beliefs and intentions, LLMs can fluidly shift perspectives to generate diverse viewpoints often inaccessible through traditional recruitment (Tseng et al., 2024). This capacity arguably surpasses human role-playing—constrained by limited perspective-taking and idiosyncratic biases (Pronin et al., 2001)—by leveraging aggregate statistical patterns elusive to individuals.

Recent empirical work has begun to establish the conditions under which LLM simulations can meaningfully capture human psychological patterns. Niszczota et al. (2025) provided an instructive demonstration, using GPT-3.5 and GPT-4 to simulate cross-cultural personality differences. Their experiment compared Big Five personality traits between simulated U.S. and South Korean personas, prompting models to “play the role of an adult from [the United States/South Korea].” Although GPT-3.5 failed to produce meaningful cultural patterns, GPT-4 successfully replicated established cross-cultural differences in personality traits. In other words, different model versions can produce dramatically different results, necessitating careful model selection and comparison (see Table 1, model provenance and model architecture).

Table 1.

Two-Tier Implementation Checklist

Tier	Domain	Guideline	Rationale
Essential checks	Model provenance	Select and document model type: Use open-weight models as the default for reproducibility; if using closed-source models, provide justification, replicate on an open model if feasible, and acknowledge limitations to replication.	Open-weight models offer version stability and full access to parameters, facilitating replication; closed-source models may be silently updated by the provider, introducing replication risks and limiting inspection.
	Sampling strategy	Use appropriate sampling and distinguish stochasticity from variation: When drawing N samples from a single LLM, generate N independent responses (e.g., via N separate API calls) rather than asking for N responses within a single call or chat, and treat these as N draws from one model rather than literal substitutes for N human participants.	Repeated sampling from a single prompt captures within-models stochasticity, not between-persons variance. Simulating a diverse human sample requires conditioning independent generations on distinct personas or psychological profiles to approximate individual differences.
	Parameter selection	Select and justify key generation parameters: Report temperatures (top-p, etc.) and justify their appropriateness (e.g., low temperature for consistency, higher temperature for variability).	A low temperature forces the model toward its single most probable response, appropriate for assessing the most likely outcome; a higher temperature is essential for simulating response diversity by sampling more broadly from the model’s learned distribution.
	Prompt and task design	Ensure task equivalence: The prompt given to the LLM must be as similar as possible to the instructions given to human participants.	Discrepancies may arise not from cognitive differences but from the LLM and humans performing functionally different tasks.
Context-dependent checks	Persona granularity	Vary persona detail: If claiming a finding relates to specific demographic or psychological traits, test prompts with varying levels of persona detail.	This directly tests whether the observed effect is driven by the intended persona characteristics or is an artifact of a simplistic prompt.
	Prompt sensitivity	Conduct a systematic prompt-sensitivity check: Ensure the finding is robust to semantically equivalent prompt variations (convergent validity) and appropriately sensitive to variations that alter the underlying psychological construct (discriminant validity).	This check assesses the construct validity of the simulation: If an effect is brittle to superficial rewording, it is likely a linguistic artifact; conversely, if the model behavior is insensitive to changes that should matter psychologically (e.g., altering the stakes in a moral dilemma), it is not meaningfully engaging with the construct.
	Model architecture	Compare across model families: If claiming a phenomenon is a general property of LLMs, replicate the key finding across different architectures.	Necessary for claims about the general properties of LLMs rather than an implementation-specific architecture of a single model.

Note: LLM = large language model; API = application programming interface.

Similarly encouraging results have emerged from studies examining the ability of LLMs to generate psycholinguistic norms. For instance, Trott (2024a) demonstrated that GPT-4 effectively captures human judgments of psycholinguistic properties—including word concreteness, semantic similarity, sensorimotor associations, and iconicity—with correlations matching or even surpassing average interannotator agreement. Moreover, substituting LLM-generated norms for small human samples in regression analyses preserves the direction and magnitude of effects, highlighting the utility of LLMs in approximating the “wisdom of small crowds” in psycholinguistic research (Trott, 2024b). LLMs thus offer rapid, cost-effective methods for generating initial approximations of psychological phenomena, particularly when these phenomena have strong linguistic components.

Conditioning models on specific psychological data sets can further improve alignment with human responses. Chuang et al. (2024), for example, integrated empirically derived human-belief networks—estimated via factor analysis on a 64-item controversial-beliefs survey—into LLM agent construction. By seeding role-playing agents with a single belief on a representative topic alongside demographic information and applying both in-context learning and supervised fine-tuning, they achieved substantially better alignment with human opinions on related test topics than when using demographics alone. Likewise, Moon et al. (2024) used detailed “backstories” rather than surface demographics to improve matching to human-response distributions by up to 18% and consistency metrics by 27% across three nationally representative surveys. Nuanced persona development can yield more reliable and representative simulations (see Table 1, persona granularity).

Fine-tuning on representative corpora can also improve validity. Chen et al. (2024) developed contextualized construct representation, converting psychological questionnaires into classical Chinese and fine-tuning models on historical texts. Their approach outperformed both generic models and simple prompting on culture-specific constructs, such as collectivism and traditionalism. Thus, domain expertise combined with technical customization can enhance validity for specific research contexts.

However, LLMs occupy a peculiar temporal state: They possess vast historical knowledge yet remain frozen at their training cutoff. They cannot reflect the postcutoff events or cultural shifts that shape human psychology and behavior (Kozlowski & Evans, 2024). A model trained before a major social movement cannot capture the transformations it engendered; ongoing societal learning remains beyond its horizon.

This temporal gap is further compounded by biases in the training data. Online text tends to overrepresent more recent periods; perspectives from the immediate past may crowd out those from more distant historical eras. Such skew can distort any attempt to study psychological change over time because the model’s understanding of the past is filtered through what was digitized and included in its corpus—introducing selection biases that must be explicitly acknowledged, and when possible, corrected (Ziems et al., 2024).

Methodological framework for implementation

To operationalize these capabilities into rigorous research practices, Table 1 presents a two-tier framework. Essential checks apply to most LLM-simulation research. Context-dependent checks become necessary when making specific theoretical claims—for instance, about demographic effects or general LLM properties.

The importance of these guidelines becomes apparent when examining how methodological choices shape research outcomes. That GPT-4 replicated cultural patterns when GPT-3.5 could not (Niszczota et al., 2025), for example, illustrates why the check for model architecture (see Table 1) is necessary when making claims about general LLM properties: What holds for one model version may not generalize to others.

The choice between base and instruction-tuned models also matters. Reinforcement learning from human feedback (RLHF) introduces systematic biases: RLHF-tuned models skew toward liberal, higher-income, educated perspectives relative to their base counterparts (Santurkar et al., 2023; Tao et al., 2024). Moreover, although instruction tuning improves simulation of topics with strong consensus, it degrades performance on pluralistic topics with diverse opinions (Hu et al., 2025). When studying implicit biases or authentic response patterns, researchers should therefore compare base and fine-tuned versions (Gao et al., 2025).

In general, open-weight models should serve as the default: Unlike closed-source alternatives, they provide version stability, parameter access, and transparency—necessary for replication. Such stability proves particularly critical for complex, longitudinal simulations (J. S. Park et al., 2023) in which silent proprietary updates could invalidate findings across multistage research programs. Access to internal workings (token probabilities, attention patterns, model parameters) further enables mechanistic analysis that closed systems cannot support. When researchers must employ closed-source models—for instance, to establish whether findings generalize across the current spectrum of systems or to access state-of-the-art capabilities unavailable in open alternatives—they should explicitly justify this choice, attempt replication with open-weight models when feasible, and clearly acknowledge the resulting constraints on verification and reproducibility.

Prompt-sensitivity analysis addresses empirical findings of extreme sensitivity to wording variations. Tjuatja et al. (2024) found that RLHF-tuned models showed high sensitivity to seemingly trivial changes, such as typos in survey questions—variations that human respondents would typically ignore. This brittleness necessitates systematic testing of prompt variations to distinguish robust psychological patterns from artifacts of specific phrasings. Effective prompts balance specificity with ecological validity, providing enough context to elicit coherent responses while avoiding overly constraining scenarios that might limit generalizability (Lin, 2024a).

Applications in psychology and behavior research

Building on these methodological foundations, I now examine concrete applications in which LLM-based role simulation demonstrates particular promise. These use cases span multiple domains of psychological inquiry, progressing from simple substitution to complex multiagent systems.

Studying inaccessible populations

Perhaps most compelling is the ability to study populations that remain inaccessible through traditional recruitment methods. Executives, political leaders, historical figures, and members of isolated communities have long presented challenges for psychological research. LLMs trained on relevant textual data can simulate responses from these populations, enabling exploratory studies that would otherwise remain impossible. Chen et al. (2024), for example, used LLMs trained on historical texts to investigate psychological patterns in past populations while validating findings against historical records and acknowledging the interpretive constraints imposed by their methods (see also Varnum et al., 2024). For historical populations, temporal distance is inherent to the inquiry; for contemporary inaccessible populations—executives, political leaders, isolated communities—such displacement becomes a validity threat that requires contemporaneous validation.

Addressing ethical constraints

Ethical and practical constraints often limit the ability to study extreme situations or sensitive topics with human participants. Here, LLM simulation offers unique alternatives. R. Wang et al. (2024), for example, used LLMs to simulate patients via diverse cognitive models informed by cognitive-behavioral therapy (CBT), creating interactive training scenarios for mental-health trainees that would be difficult to construct with real patients. Their PATIENT-Ψ system generates case formulations grounded in CBT principles, allowing trainees to practice cognitive-model formulation and therapeutic interviewing in a risk-free environment. The system’s effectiveness was validated through measures of trainee skill acquisition and confidence and expert evaluations.

Rapid prototyping and cross-cultural research

The rapid-prototyping capabilities of LLMs prove particularly valuable in survey and experimental design. At minimal cost and time, researchers can iterate on instrument wording, flag ambiguous items, and identify questions that might produce floor or ceiling effects. Although this simulation does not replace the need for a human pilot study—an anomalous result could reflect a flawed item but also a simulation artifact—its purpose is to refine the instrument for a more targeted and efficient human pilot. An unexpected LLM response prompts the researcher to critically examine the item, making the validation process more deliberate. This application proves especially useful in cross-cultural research, in which subtle linguistic or conceptual differences can invalidate measures across populations. By simulating responses from different cultural contexts using appropriately varied prompts, researchers can identify problematic items before beginning expensive international data-collection efforts (Sarstedt et al., 2024; Tao et al., 2024).

Complex social systems

Complex social phenomena involving multiple actors and emergent dynamics represent another frontier for LLM-powered simulations. J. S. Park et al. (2023) introduced generative agents—LLM-driven entities that store natural-language “memories,” synthesize them into higher-level reflections, and plan actions in an open-world sandbox. When prompted to “throw a Valentine’s Day party,” 25 agents autonomously spread invitations, forged relationships, coordinated the event, and diffused information across their social network. A systematic ablation of the memory, reflection, and planning modules underscored each component’s necessity for producing believable individual and collective behaviors. Horton (2023) extended this approach to economics, using LLMs to simulate labor-market dynamics and test how minimum-wage policies affect realized wages and labor substitution. LLMs can thus serve as laboratories for studying complex social systems that would be difficult to manipulate experimentally with human participants.

A three-tier validation framework

These diverse applications require rigorous validation, which must adapt to human-data availability. Figure 1 presents a decision framework distinguishing three scenarios—direct, indirect, and generative validation—each requiring different methodological approaches.

Fig. 1.

A decision framework for validation and interpretation in LLM-based psychological simulation. The framework proceeds through four stages. Stage 1 establishes performance validity through direct validation when comparable human benchmarks exist, indirect validation when only partial data are available (e.g., testing constituent processes or convergent measures), or generative validation when human data cannot be ethically or practically obtained (evaluating theoretical coherence, emergent properties, and expert assessment of plausibility). Stage 2 identifies implementation confounds that may artificially inflate or deflate LLM-human differences, including sampling strategy, prompt design, and parameter selection. Stage 3 guides interpretation of observed discrepancies by considering how training-data artifacts (temporal displacement, representational biases) and fundamental architectural differences (lack of embodiment, statistical vs. experiential learning) may differentially shape or limit conclusions. Stage 4 emphasizes transparent reporting that matches empirical claims to validation strength and documents all implementation choices. LLM = large language model.

Direct validation (when comparable human benchmarks exist)

Much of current research falls into this category, in which established human data sets permit direct comparison. Validation involves comparing LLM responses against established human findings. Methodological equivalence is critical—identical or comparable tasks and instructions ensure that observed differences reflect genuine psychological divergences rather than task artifacts (essential checks in Table 1).

Response-distribution analysis provides another crucial validation tool. Human populations exhibit natural variability, and credible simulations must capture not just average tendencies but also distributional properties. Mei et al. (2024), for example, compared GPT-3.5-Turbo and GPT-4 with tens of thousands of human participants across six canonical economic games, examining not only mean choices but also full response distributions, dynamic consistency (e.g., tit-for-tat behavior in a repeated prisoner’s dilemma), and sensitivity to framing and context. Although GPT-4 often falls within the human-response range and passes a behavioral Turing test in several games, it diverges notably in the prisoner’s dilemma and as the investor in the trust game.

More concerning, P. S. Park et al. (2024) found that GPT-3.5 (text-davinci-003) produced near-zero variation in six of 14 study replications (Many Labs 2)—a “correct answer” effect in which responses homogenize into a single modal answer rather than reflecting human-like diversity. Such granular distributional analysis—evaluating how and why LLMs differ, not just whether they match human means—establishes clear boundaries for meaningful approximation of human decision-making.

Validation extends beyond comparing final outcomes to examining how models arrive at those outcomes. Process-level validation assesses whether models follow psychologically plausible strategies rather than producing correct end states through artificial means. J. S. Park et al. (2023) exemplified this approach through systematic ablation studies: Removing memory, reflection, or planning modules from generative agents degraded performance in psychologically meaningful ways. Without memory, agents failed to maintain consistent relationships across interactions; without reflection, they could not synthesize experiences into higher-level generalizations; and without planning, they showed reduced believability and multistep coherence. These degradations mirror how human capabilities depend on intact cognitive systems, validating process-level psychological realism rather than mere endpoint fitting.

Sequential decision-making tasks offer particularly diagnostic evidence. Do models exhibit systematic search strategies, backtracking after dead ends, or refinement through iteration—or do they arrive at solutions through direct retrieval bypassing intermediate steps? Response-latency patterns, when reflecting actual computational demands, provide another signal: Humans show longer response times for difficult problems or conflicting information, and thus uniformly quick responding reveals fundamentally different processing. Strategy consistency across related tasks offers additional evidence: Simulations showing correlated patterns across judgment tasks—especially when common thinking dispositions are involved (Toplak et al., 2011)—suggest captured cognitive approaches rather than task-specific pattern matching.

Indirect validation (when only partial human data are available)

Many research questions lack perfectly matched human-comparison data but can draw on related empirical findings. Indirect validation proceeds through three complementary strategies: validating constituent processes, seeking convergent evidence across related measures, and stress-testing simulation limits.

Constituent process validation

Individual simulation components often permit validation against established findings from simpler paradigms. A complex multiagent simulation of organizational change might validate individual agent decision-making against choice architecture research, social-influence mechanisms against persuasion studies, and memory-based reasoning against recall findings. This piecewise approach builds confidence by confirming that components operate as psychological theory predicts.

R. Wang et al. (2024) employed this strategy in developing PATIENT-Ψ to simulate CBT patients for training mental-health professionals. Without public data sets of realistic CBT cognitive models, they validated constituent processes—how agents represent beliefs and emotional responses—against CBT principles and expert clinical judgments and then evaluated the system’s fidelity and training usefulness with experts and trainees. This component-level validation supports psychologically plausible mechanisms even when the full interactive system resists direct benchmark validation.

Convergent evidence across partial anchors

When multiple partial empirical anchors exist, convergence strengthens confidence. Researchers validating simulations of political-attitude formation might triangulate across cross-sectional surveys, longitudinal studies, and experimental findings on social influence. Divergence patterns prove equally informative: If a simulation matches cross-sectional distributions but fails to capture longitudinal change trajectories, this reveals specific limitations—the model may capture stable individual differences but miss dynamic attitude-formation processes. Such targeted divergences guide refinement more effectively than wholesale success or failure.

Stress-testing simulation limits

Systematic parameter manipulation provides diagnostic evidence of whether performance degrades in psychologically plausible ways. Does response quality decline gradually with task complexity, showing signatures of limited cognitive resources, or does the model maintain perfect performance until suddenly producing incoherent output? When encountering ambiguity, do models show increased variability and heuristic reliance, or do they confidently generate responses regardless of domain—revealing a lack of metacognitive awareness (Bowers et al., 2025)?

This discriminant-validity approach defines conditions under which simulation findings warrant psychological interpretation. Transparent reporting of limits—what the simulation can and cannot do, how it fails, cases in which inferences remain valid—is as essential as documenting successes.

Generative validation (when human data cannot be obtained)

Complex agentic simulations often resist direct validation. Consider simulating discrimination emergence across a multiagent social network: No existing data set provides equivalent temporal resolution, behavioral granularity, and experimental control, and creating one might require ethically problematic manipulations. Yet such simulations offer precisely the value justifying their development—exploring processes elusive to direct empirical investigation. Validation proceeds through three complementary strategies: assessing theoretical coherence, examining emergent phenomena, and soliciting expert evaluation.

Theoretical coherence

Does agent behavior align with established psychological mechanisms? A discrimination simulation should exhibit patterns consistent with social-identity theory, contact hypothesis, or stereotype-formation research—reduced bias following positive intergroup contact, increased bias under intergroup competition, or stereotype persistence despite disconfirming evidence. Theoretical incoherence (e.g., bias emerging without perceived threat, competition, or in-group identification) signals simulation failure rather than novel discovery. The simulation’s value lies in integrating known mechanisms to explore their interaction over time, not in producing theoretically arbitrary outcomes.

Emergent phenomena

Do expected higher-order patterns arise from lower-level interactions without explicit programming? In-group favoritism should emerge from individual biases and selective interaction, attitude polarization should emerge from homophily and confirmation bias, and social-norm formation should emerge from observation and conformity pressures. J. S. Park et al. (2023) demonstrated this: 25 agents autonomously spread party invitations, forged relationships, and coordinated events through emergent social dynamics rather than scripted behaviors. The presence of such emergent properties provides indirect validation that the simulation captures relevant psychological dynamics.

Expert evaluation

Domain specialists can assess qualitative plausibility—whether agent behaviors, interaction patterns, and developmental trajectories match theoretical expectations and ethnographic knowledge. Experts can identify failures of verisimilitude: discrimination emerging instantaneously, social networks forming without homophily constraints, or attitude change occurring without exposure to alternative views.

Positive and negative controls

Across all three validation scenarios, researchers should establish validation boundaries through systematic control conditions. Positive controls are conditions in which an effect is expected. In this context, they comprise tasks in which LLMs should perform competently given their training, such as making grammatical acceptability judgments or demonstrating basic textual reasoning. Success on these tasks confirms that the experimental setup (prompts, parameters) is sound and can detect a known capability. Failure, conversely, suggests an implementation confound—such as a poorly specified prompt or an inappropriate sampling strategy—rather than a fundamental model limitation.

Conversely, negative controls are conditions in which no effect is expected. For LLM simulations, these are tasks that tap into capabilities models architecturally lack, such as making proprioceptive judgments or describing phenomenal experiences. An LLM should fail these controls. Unexpected success signals that the task was ill-posed—solvable through text-based reasoning alone—or that the model has learned to mimic competence through superficial linguistic patterns without capturing the underlying mechanism (Liu & Ding, 2025).

Systematic application of control conditions calibrates validation expectations (Fig. 1), preventing both overinterpretation (mistaking superficial alignment for deep equivalence) and underinterpretation (mistaking implementation failures for fundamental limitations).

Interpreting LLM-human discrepancies

When LLM-human differences emerge, a diagnostic framework is needed to distinguish implementation confounds (Fig. 1, Stage 2) from training-data artifacts and architectural differences (Fig. 1, Stage 3).

Implementation confounds

Methodological choices can artificially inflate or deflate observed similarities. Sampling strategy shapes variance structure. Generating 1,000 responses from a single LLM captures within-models stochasticity, not the between-persons variance of human populations. Consequently, even well-calibrated simulations exhibit compressed variance unless researchers explicitly model individual differences through diverse psychological profiles (Chuang et al., 2024) or full agentic frameworks (J. S. Park et al., 2023).

Prompt choices create additional confounds. Simple demographic prompts (“You are a 35-year-old woman from Chicago”) may elicit stereotypical responses not because models cannot capture human psychology but because prompts fail to specify the psychological richness guiding actual human responses (Chuang et al., 2024; Moon et al., 2024). Elicitation format also matters: Forcing constrained numerical outputs (Likert ratings, numerical scales) may compress variance and regress to the mean, whereas eliciting free-text responses and projecting them onto rating scales via semantic similarity may recover more faithful distributions (Maier et al., 2025). This reflects a general principle: Probe LLMs through their linguistic capabilities rather than imposing response formats designed for human participants.

Temperature settings and other parameters similarly constrain interpretations: Low temperature measures the model’s most probable response, and higher temperature explores its learned distribution. Comparing low-temperature LLM outputs with human data conflates the modal response with distributional coverage.

Training-data artifacts

Temporal displacement represents a critical consideration. Models trained on historical data cannot reflect recent societal changes, emerging social movements, or evolving attitudes. For stable psychological phenomena (basic cognitive processes, fundamental emotional responses), temporal displacement may be negligible; for rapidly changing domains (social attitudes, technology adoption, political opinions), this gap fundamentally limits validity. Researchers should document model-training dates, implement contemporaneous validation for temporally sensitive domains, and consider temporal displacement when interpreting unexpected results. Representational biases in the training corpus (WEIRD overrepresentation, demographic skews) may also account for systematic deviations.

Architectural differences

Some discrepancies stem from fundamental constraints: lack of embodiment and sensorimotor grounding, and the gap between statistical pattern learning and experiential knowledge acquisition (Pezzulo et al., 2024). These limitations become particularly salient when phenomena require physical sensation, emotional arousal, or lived experience.

Conversely, the absence of LLM-human differences requires scrutiny. Superficial alignment—matching human means but not distributions or succeeding on simple tasks but failing on diagnostic variants—can mask deeper divergences, such as in response variance (P. S. Park et al., 2024). Robust validation demands convergent evidence: alignment across multiple metrics (means, variances, sequential dependencies), sensitivity to theoretically relevant manipulations, and stability across implementation and temporal contexts.

This diagnostic stance treats LLM simulations as scientific instruments requiring calibration. Implementation choices are parameters to be systematically varied, with result patterns revealing which phenomena are robust properties of learned representations versus artifacts of probing methods. Transparent reporting (Fig. 1, Stage 4) demands documenting implementation choices and confound checks while matching empirical claims to validation tier achieved.

Boundaries and limitations

Despite these advances, LLM role simulations face inherent limitations. The absence of embodied experience creates challenges for phenomena tied to physical sensation, emotional arousal, or lived experience. Although humans describe pain, emotion, and trauma through language—making these experiences partially accessible to text-based modeling—current LLMs lack the sensorimotor grounding that shapes how humans represent and reason about such experiences. More critically, validation becomes increasingly difficult as phenomena shift from those with clear behavioral or linguistic signatures (beliefs, stated preferences) to those rooted in subjective experience (qualia, proprioception, affective intensity). This limitation reflects not an absolute boundary but a gradient of tractability, with experientially grounded phenomena presenting steeper validation challenges.

The statistical nature of LLM responses enables capturing population-level patterns but may miss outliers, unique individual perspectives, or responses that deviate from typical training-data patterns (P. S. Park et al., 2024; A. Wang et al., 2025; Zaim bin Ahmad & Takemoto, 2025). LLMs excel at simulating modal responses but may fail to capture the full range of human psychological diversity—particularly problematic when studying individual differences, personality extremes, or rare psychological phenomena.

Cultural and demographic biases in training data create additional constraints. Overrepresentation of WEIRD populations and publicly expressive individuals means that simulations of marginalized groups or non-Western cultures require particular caution (Tao et al., 2024). For example, ChatGPT achieves higher simulation accuracy for male, White, older, highly educated, and upper-class personas and underperforms for others (Qu & Wang, 2024); it consistently portrays racial minorities as more homogeneous than White Americans (Lee et al., 2024); and it pervasively represents women as younger than men—a statistical bias that can diverge from workforce demographics (Guilbeault et al., 2025). These biases can amplify through statistical optimization (Z. Wang et al., 2024), compressing diverse experiences into narrow characterizations. Furthermore, LLMs show bias against null findings and inflate effect sizes: When replicating studies finding no significant effect, LLMs produced significant results in the vast majority of cases; for known effects, LLMs consistently generated larger effects than human participants (Cui et al., 2025).

These limitations underscore treating LLM simulations as hypothesis-generating tools requiring validation with human samples. Successful role simulation requires moving beyond simple demographic prompting to psychologically grounded approaches that acknowledge both model capabilities and fundamental constraints. Whether studying rare populations, prototyping interventions, or exploring complex social dynamics, researchers should maintain clear boundaries between simulation and substitution—using LLMs as tools for discovery rather than endpoints for inference.

Using Language Models to Model Cognitive Processes

Beyond simulating human roles and personas, LLMs offer a second avenue for psychological research: investigating cognitive mechanisms through their internal workings. This approach shifts the question from whether LLMs produce human-like outputs to what their computational processes reveal about cognition itself.

Theoretical foundations for cognitive modeling

Computational models for understanding cognition trace back to early AI and cognitive architectures (McGrath et al., 2024; Simon, 1983; van Rooij et al., 2024). LLMs are particularly compelling as cognitive models because they acquire complex linguistic knowledge through learning processes that, although distinct from human development, produce representations that often aligns with human cognitive structures.

First, these models develop internal representations corresponding to meaningful linguistic and conceptual categories without explicit programming. Different layers capture different levels of linguistic abstraction, from surface-level syntactic features in early layers to complex semantic relationships in later ones (Manning et al., 2020; Tenney et al., 2019). This hierarchical organization mirrors theories of language processing, suggesting computational principles transcending specific implementations (Liu et al., 2025). Fine-tuning a language model on human behavioral data both improves choice prediction and aligns internal representations with human neural activity without exposure to neural data during training (Binz et al., 2025).

Second, LLMs exhibit emergent behaviors arising from simple learning rules interacting with complex data (Wei et al., 2022)—paralleling how human cognitive abilities develop from basic neural mechanisms interacting with rich environmental input. Although specific learning algorithms differ (backpropagation in LLMs vs. the constellation of mechanisms supporting human learning), both systems demonstrate how sophisticated capabilities can emerge from simple foundations when exposed to structured information (Binz & Schulz, 2023a; Frank, 2023; Shah & Varma, 2025).

Third, alignment between model representations and neural activity follows a scaling law: As models grow larger, their internal representations—specifically, attention patterns—become better predictors of human brain activity and eye movements during naturalistic reading (Gao et al., 2025). Critically, this improvement stems from scale itself rather than instruction tuning: Base models of increasing size (7 billion to 65 billion parameters) show monotonically improving alignment with human functional-MRI data and regressive saccade patterns, whereas instruction-tuned variants of identical size show no advantage. The effect generalizes across languages (English and Chinese) and modalities (reading and listening), suggesting that scaling produces representations that are computationally more analogous to those underlying human language processing.

Fourth, the success of LLMs in capturing human-like performance suggests that they may have discovered computational solutions to problems that biological systems also face (Buckner, 2023). These convergent solutions manifest across multiple domains: unsupervised or self-supervised learning during pretraining (Manning et al., 2020), in-context learning, domain-general computations, and human-level performance on challenging tasks. When LLMs learn to track long-distance dependencies or resolve ambiguous pronouns, they develop mechanisms for maintaining and manipulating information over time—challenges that human cognitive systems also confront (Ambridge & Blything, 2024; Blank, 2023; Millière, 2024). Studying how LLMs solve these problems provides insights into computational requirements of cognition and potential mechanisms for meeting them (Lindsay, 2024).

Importantly, the investigation of LLMs as cognitive models depends on the transparency afforded by open-weight models (Zhang et al., 2022), allowing researchers to directly probe, intervene, observe, and measure model behaviors (Frank, 2023; McGrath et al., 2024). Measures fall into two categories—correlational and causal, detailed below—enabling researchers to understand how neural networks process information.

Correlational approaches to probing model cognition

Correlational methods comprise two major classes: internal probing and output analysis.

Internal probing

Probing involves training auxiliary classifiers on model internal states to predict specific properties of interest (Belinkov, 2022). For example, a classifier trained on activation patterns from a particular layer can determine whether that layer encodes syntactic information (e.g., part-of-speech tags) or semantic information (e.g., animacy). Probe performance reveals what information is represented at different processing stages, mapping information flow through the network (Manning et al., 2020; Tenney et al., 2019).

However, probing has important limitations. High probe accuracy does not necessarily mean the model uses that information functionally—the probe might detect incidental correlations rather than causally relevant features. Conversely, low probe accuracy does not prove information absence; it might simply be encoded in a format the probe cannot detect. These limitations have motivated more sophisticated approaches that combine probing with causal intervention (discussed next).

Behavioral analysis

Behavioral analysis focuses on systematic patterns in model outputs rather than internal representations. By carefully designing stimulus sets that isolate specific cognitive phenomena, researchers can test whether models exhibit human-like processing signatures: semantic priming (Jumelet et al., 2024), garden-path effects (Amouyal et al., 2025), and structural biases in ambiguity resolution. Liu and Ding (2025) developed a one-shot word-deletion task in which the deletion rule was deliberately ambiguous; both humans and LLMs consistently inferred rules based on syntactic structure, preferring to delete whole constituents rather than arbitrary word strings—revealing not just passive structural representation but also its active deployment in resolving uncertainty.

Behavioral analysis reveals functional similarities and differences between human and model cognition. When models show human-like patterns, it suggests they have discovered similar computational solutions despite different implementations. When they diverge, this highlights either model limitations or interesting differences in how artificial and biological systems process information. This comparative approach proves particularly valuable for understanding syntactic processing, semantic comprehension, and pragmatic inference.

Causal intervention and mechanistic understanding

Whereas probing and behavioral analyses reveal correlational patterns, establishing mechanistic understanding requires causal-intervention techniques. Inspired by lesion studies in neuroscience, these methods enable researchers to directly link model components to behaviors through targeted manipulations—selectively modifying or disabling model parts to observe resulting changes. Unlike biological systems, LLMs permit precise, reversible interventions impossible or unethical in human studies.

Model editing

Meng et al. (2022) investigated factual knowledge storage in GPT-like LLMs. Performing causal-mediation interventions on hidden-state activations across model components, they identified middle-layer feed-forward modules as the critical locus of factual associations. They then demonstrated that individual facts—such as updating “The Space Needle is in Seattle” to “The Space Needle is in Paris”—can be reliably edited by applying targeted rank-one updates to corresponding feed-forward weights (rank-one model editing). This mechanistic insight reveals how models store factual information and illuminates the computational structure underlying knowledge representation.

Activation patching

Building on causal-intervention techniques, activation patching (causal tracing) offers a more precise tool for understanding information flow through networks (Heimersheim & Nanda, 2024). Researchers can take activation patterns from one context—say, when the model correctly answers “The capital of France is Paris”—and surgically insert them into another context in which the model processes a different question. By systematically swapping these activation patterns at different locations in the network, researchers can trace exactly which pathways carry specific information.

This technique has yielded insights into how models learn from examples. When given a prompt such as “Cat→Gato, Dog→Perro, House→?,” models infer that they should translate to Spanish and respond “Casa.” Activation patching revealed that specific components—called “induction heads”—detect these pattern mappings and copy relevant behavior, implementing a learned “find-and-apply-pattern” operation (Olsson et al., 2022). This shows how models adapt their behavior based on context (in-context learning), a capability that emerges from training despite never being explicitly programmed.

These causal techniques offer unique advantages for cognitive modeling. Precise reversible interventions enable strong causal inferences about the relationship between representations and behaviors. Combined with behavioral analyses showing human-like performance patterns, these techniques provide converging evidence for shared computational principles.

Learning dynamics and developmental analogies

One of the most intriguing applications of LLMs as cognitive models involves studying how cognitive abilities emerge through learning. By analyzing models at different training stages or comparing models trained on different data, researchers can investigate how exposure to linguistic information shapes cognitive capabilities. This developmental perspective offers insights into the relationship between experience and cognitive structure.

Inspired by comparative psychology, the controlled-rearing approach trains models on carefully constructed data sets to test specific hypotheses about learning. Just as manipulating newborn chicks’ visual experiences (e.g., slow or fast object motion) reveals core learning algorithms supporting object perception, manipulating inputs in language models can test which specific input types are necessary for learning.

Misra and Mahowald (2024) showcased this approach in their study of syntactic generalization. They trained transformer language models on systematically manipulated corpora: a default corpus, one with all AANN (Article + Adjective + Numeral + Noun; e.g., “a beautiful five days”) sentences removed, and others in which AANNs were replaced by perturbed variants (ANAN, NAAN). Models trained without any AANN examples but still exposed to related constructions (e.g., “a few days”) generalized to novel AANN instances at well above chance levels, and those trained on corrupted variants did not. This finding demonstrates that language models can learn rare grammatical phenomena by bootstrapping from more frequent, related structures, providing computational support for theories of language acquisition.

This approach extends naturally to cross-linguistic and cross-cultural investigations. By training models on corpora from different languages or cultural contexts, researchers can explore how linguistic and cultural environment shapes cognitive representations. Multilingual models show that internal representations systematically vary by language pair—transfer is strongest between typologically similar languages—and that models develop language-specific and shared circuits for syntax and semantics (Muller et al., 2021; Pires et al., 2019).

The temporal dynamics of learning in LLMs also provide insights into cognitive development. Early in training, models rely on simple, surface-level heuristics (n-gram-like predictions) before gradually forming deeper, hierarchical representations—phases resembling developmental trajectories in children (e.g., progression from lexical to syntactic competence; Choshen et al., 2022; Evanson et al., 2023). Although LLM training unfolds on vastly different timescales and with different mechanisms, these parallels suggest general principles about how complex cognitive abilities emerge from simpler foundations through interaction with structured input.

Multimodal extensions and embodied cognition

Although language models have proven valuable for understanding linguistic phenomena, their utility extends beyond language processing. When combined with vision systems—as in vision-language models—they serve as models for visual perception, memory, and other cognitive processes, complementing traditional artificial neural networks in modeling the mind and brain (Kanwisher et al., 2023; Wicke & Wachowiak, 2024). This expansion addresses a fundamental question in cognitive science: How do cognitive systems integrate information across multiple modalities?

Multimodal models processing both language and vision provide new opportunities to explore this integration. Models such as CLIP (contrastive language-image pretraining) and its successors learn to align representations across modalities by training on image-caption pairs, creating unified embeddings that capture both visual and linguistic information.

The cognitive relevance of these multimodal representations becomes apparent when examining their predictive power. A. Y. Wang et al. (2023) found that CLIP’s joint vision-language representations explained up to 79% of variance in high-level visual cortex, outperforming vision-only models (ImageNet-trained ResNet50) and text-only models (BERT)—especially in regions linked to scene and human-object interactions (parahippocampal place area, extrastriate body area, temporoparietal junction). Likewise, Shoham et al. (2024) showed that CLIP embeddings predict human similarity judgments in pairwise rating tasks—when participants rated the visual similarity of familiar faces and objects presented as images (perception) or reconstructed from names (recall)—significantly better than purely visual (VGG-16) or purely semantic (SGPT) models.

These findings suggest that joint training on images and natural language produces embeddings that better approximate how biological systems integrate multimodal information, highlighting how computational models with natural language supervision can reveal principles of cognitive organization difficult to uncover with traditional approaches.

Limitations and interpretive challenges

Fundamental differences between artificial and biological systems create both opportunities and constraints for cognitive modeling. Understanding these limitations is crucial for drawing appropriate inferences from model studies (Cuskley et al., 2024; Lin, 2025b; Shah & Varma, 2025; van Rooij & Guest, 2025).

Scale disparity

LLMs are exposed to far more linguistic data than any human encounters, potentially discovering statistical patterns that play no role in human cognition. This raises questions about whether model mechanisms reflect human-like solutions or alternative strategies enabled by massive data exposure. Researchers must carefully consider whether observed mechanisms could plausibly operate given human-scale learning constraints.

Architectural differences and circular inference

The core challenge stems from inductive biases: Architectural priors; training objectives (e.g., next-token prediction), and statistical regularities in data can jointly produce behaviors that mimic cognitive phenomena without sharing underlying mechanisms. This creates a risk of circular inference in which a model appears to validate a theory simply because its design reflects similar assumptions—producing theory-consistent behaviors through mechanisms that differ from those proposed for human cognition. Researchers can mitigate this risk by testing whether effects replicate across different model families or using mechanistic interventions to distinguish genuine cognitive alignment from learned statistical shortcuts. However, such strategies are often constrained by the current landscape, in which many state-of-the-art models share similar architectures (e.g., transformers), objectives, and training regimes, limiting the extent to which they can adjudicate between competing cognitive theories. Yet this limitation frames a potential advantage: By manipulating these architectures, researchers can dissociate implementation from function, isolating the necessary principles of intelligence.

Lack of grounding and embodiment

This represents perhaps the most fundamental limitation for cognitive modeling. Human cognition develops through interaction with the physical and social world, shaping representations that purely linguistic exposure cannot replicate. This limitation particularly affects spatial reasoning, social cognition, and affective processing, in which embodied experience plays crucial roles. Even multimodal models, such as CLIP—lacking sensorimotor interaction, temporal continuity, and physical grounding—cannot capture cognitive phenomena rooted in bodily experience (e.g., real-world spatial reasoning, tool use, affective dynamics), suggesting a major frontier for future work in grounded, multimodal cognitive modeling.

Future directions and methodological recommendations

The next phase of LLM-based cognitive modeling requires greater transparency and deeper integration. Open-weight architectures and conceptual frameworks that connect internal representations (activations or attention patterns) to cognitive constructs will be essential. These tools can catalyze collaboration with cognitive neuroscience: Model activations can generate testable hypotheses about neural computation, and empirical data can guide model refinement.

Systematic comparisons across architectures, training regimes, and model scales will help disentangle general computational principles from implementation-specific artifacts. This comparative approach echoes strategies in cognitive neuroscience, in which convergent evidence across methods strengthens theoretical claims.

Hybrid modeling approaches hold great promise. By combining LLMs with reinforcement-learning agents, embodied simulations, or complementary cognitive frameworks, researchers can better capture cognitive processes rooted in perception, action, and decision-making—domains in which purely linguistic models fall short. This could involve pairing large models with deliberately minimalist frameworks designed for mechanistic transparency. For example, “tiny” recurrent neural networks can be trained on behavioral data to discover underlying cognitive algorithms, which are then rendered interpretable through dynamical-systems analysis (Ji-An et al., 2025).

Methodologically, researchers should clearly articulate theoretical commitments, specifying which cognitive phenomena they aim to model and why LLMs are suitable tools. A multimethod strategy—probing analyses, behavioral-style assays, and causal interventions—can build convergent validity, and transparent reporting of scope and limitations can mitigate overinterpretation.

Ultimately, the field will progress not by asking whether LLMs are “like” human minds but by treating them as experimental systems: manipulable platforms for uncovering the algorithmic principles underlying intelligent behavior. In this role, LLMs can complement traditional approaches and illuminate new dimensions of mind and cognition.

Ethical Considerations Beyond Traditional Institutional Review Boards

The use of LLMs in psychological research raises ethical questions that extend beyond traditional human-subjects protections. Although LLMs are not sentient beings requiring protection from harm, their use in simulating human responses creates novel ethical challenges.

The representation problem

LLMs are trained on vast corpora scraped from the internet, transforming individual expressions into statistical patterns for research use—a purpose far removed from their original context (Longpre et al., 2024). This raises questions about representation and consent that traditional institutional-review-board (IRB) frameworks cannot address.

When one prompts an LLM to simulate responses from specific populations, one implicitly claims the model represents that group. Yet training data inevitably contain biases: Marginalized communities may be underrepresented, their perspectives filtered through others’ descriptions rather than direct expression. This creates risks of epistemic injustice whereby already marginalized voices are further silenced through computational mediation (A. Wang et al., 2025).

The challenge is compounded by bias amplification. LLMs not only reflect training-data biases but can also systematically amplify them through statistical optimization (Z. Wang et al., 2024). As noted earlier, models often achieve higher simulation accuracy for dominant demographic groups while compressing diverse human experiences into narrow, homogeneous characterizations (Lee et al., 2024; Qu & Wang, 2024).

Simulation also risks indirect harm by misrepresenting vulnerable groups, such as children, individuals with mental-health conditions, and trauma survivors (Y. Wang et al., 2024). Inaccurate portrayals might perpetuate stigma, and simulating trauma responses without survivor input can trivialize their experiences. Even careful simulations may feel appropriative to communities who have long struggled for direct representation.

Practical ethical guidelines

Given these challenges, researchers using LLMs for psychological simulation should adopt specific ethical practices.

Transparency requirements

Document key aspects of LLM use: model versions, prompts, parameters, and validation procedures (Lin, 2024b). Clearly indicate when findings derive from simulations and acknowledge limitations prominently—especially for closed-source models (Hussain et al., 2024). Monitor model updates because they can silently alter behaviors, potentially invalidating previous results.

Representation auditing

Before simulating any population, critically examine whether the model can credibly represent that group. Actively identify and mitigate biases through testing for stereotypical responses, examining response diversity, and validating against contemporary community data. For marginalized or vulnerable populations, collaborate with community members to design prompts, interpret outputs, and determine appropriate-use boundaries.

Appropriate-use boundaries

Clearly establish when LLM simulation is appropriate. Simulation can be justified for initial exploration or hypothesis generation when participant recruitment is challenging, but subsequent validation with actual community members is essential. Efficiency gained through simulation must never justify bypassing communities whose experiences researchers seek to understand.

Community engagement

Involve community members in research design and validation, ensuring their perspectives shape methodological choices and help identify potential harms.

Institutional responsibilities

Traditional IRB frameworks focus on protecting individual human subjects from direct harm. LLM research raises additional ethical considerations—concerning collective representation, potential indirect harms through misrepresentation, and implications of substituting computational models for human voices—that extend beyond standard IRB purview but require institutional attention.

Rather than expanding IRB scope, institutions should foster ethical research through complementary mechanisms: developing methodological best practices for LLM simulation research, facilitating access to open-weight models and computational resources to enhance transparency and reproducibility, and establishing peer-review standards that evaluate appropriate use of simulation versus human participation. When LLM research does involve human subjects—whether through validation studies or when training data contain identifiable information—traditional IRB review applies as usual, focusing on informed consent, privacy protection, and minimizing foreseeable harm to participants.

Concluding Remarks

As new tools for psychological science, LLMs are complex systems demanding new methodological vigilance. They are best understood as among the most complex, manipulable, and potentially insightful scientific instruments yet devised for exploring human thought and behavior. In this article, I have provided guidelines grounded in two distinct applications. Using LLMs for persona simulation requires researchers to grapple with temporal displacement and ethical representation, moving past surface-level prompting to psychologically grounded validation. Treating LLMs as cognitive models compels a shift from observing behavioral mimicry to probing mechanistic understanding through causal interventions that test the architecture of learned abilities.

The path forward requires a dual commitment. Psychologists must develop fluency with the technical particulars of model architecture, training, and validation, and the AI community must foster deeper engagement with the complexities of human cognition and the nuances of empirical research. Open-weight models and transparent reporting are essential for this cross-disciplinary work. The unique strengths of LLMs in scalability, experimental control, and counterfactual reasoning promise to accelerate discovery. But the rigor of the science researchers build with them will depend on the skill and care with which they learn to use them.

Footnotes

Acknowledgements

I thank Gati Aher, Michael Bernstein, Danica Dillion, Nancy Fulda, Nicholas Laskowski, Paweł Niszczota, Philipp Schoenegger, Lindia Tjuatja, Lukasz Walasek, and David Wingate for comments on early drafts. I used Claude Sonnet 4.5 and Gemini 2.5 Pro for proofreading the article, following the prompts described in Lin (2025c).

Transparency

Action Editor: David A. Sbarra

Editor: David A. Sbarra

Author Contributions

Zhicheng Lin: Conceptualization, Writing – original draft, Writing – review & editing

ORCID iD

Zhicheng Lin

References

Abdurahman

Atari

Karimi-Malekabadi

Xue

M. J.

Trager

Park

P. S.

Golazizian

Omrani

Dehghani

(2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7), Article pgae245. https://doi.org/10.1093/pnasnexus/pgae245

Abdurahman

Salkhordeh Ziabari

Moore

A. K.

Bartels

D. M.

Dehghani

(2025). A primer for evaluating large language models in social-science research. Advances in Methods and Practices in Psychological Science, 8(2). https://doi.org/10.1177/25152459251325174

Ambridge

Blything

(2024). Large language models are better than theoretical linguists at theoretical linguistics. Theoretical Linguistics, 50(1–2), 33–48. https://doi.org/10.1515/tl-2024-2002

Amouyal

S. J.

Meltzer-Asscher

Berant

(2025). When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models. In Che

Nabende

Shutova

Pilehvar

M. T.

(Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8235–8253). Association for Computational Linguistics.

Argyle

L. P.

Busby

E. C.

Gubler

J. R.

Hepner

Lyman

Wingate

(2025). Arti-‘fickle’ intelligence: Using LLMs as a tool for inference in the political and social sciences. Nature Computational Science, 5(9), 737–744. https://doi.org/10.1038/s43588-025-00843-4

Belinkov

(2022). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1), 207–219. https://doi.org/10.1162/coli_a_00422

Binz

Akata

Bethge

Brandle

Callaway

Coda-Forno

Dayan

Demircan

Eckstein

M. K.

Éltető

Griffiths

T. L.

Haridi

Jagadish

A. K.

Ji-An

Kipnis

Kumar

Ludwig

Mathony

Mattar

. . . Schulz

(2025). A foundation model to predict and capture human cognition. Nature, 644(8078), 1002–1009. https://doi.org/10.1038/s41586-025-09215-4

Binz

Schulz

(2023a). Turning large language models into cognitive models. ArXiv. https://doi.org/10.48550/arXiv.2306.03917

Binz

Schulz

(2023b). Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences of the United States of America, 120(6), Article e2218523120. https://doi.org/10.1073/pnas.2218523120

10.

Blank

I. A.

(2023). What are large language models supposed to model? Trends in Cognitive Sciences, 27(11), 987–989. https://doi.org/10.1016/j.tics.2023.08.006

11.

Bonabeau

(2002). Agent-based modeling: Methods and techniques for simulating human systems. Proceedings of the National Academy of Sciences, 99(Suppl. 3), 7280–7287. https://doi.org/10.1073/pnas.082080899

12.

Bowers

Puebla

Thorat

Tsetsos

Ludwig

(2025). Centaur: A model without a theory. PsyArXiv. https://doi.org/10.31234/osf.io/v9w37_v2

13.

Buckner

(2023). Black boxes or unflattering mirrors? Comparative bias in the science of machine behaviour. The British Journal for the Philosophy of Science, 74(3), 681–712. https://doi.org/10.1086/714960

14.

Chen

Atari

(2024). Surveying the dead minds: Historical-psychological text analysis with contextualized construct representation (CCR) for classical Chinese. In Al-Onaizan

Bansal

Chen

Y.-N.

(Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 2597–2615). Association for Computational Linguistics.

15.

Choshen

Hacohen

Weinshall

Abend

(2022). The grammar-learning trajectories of neural language models . In Muresan

Nakov

Villavicencio

(Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8281–8297). Association for Computational Linguistics.

16.

Chuang

Y.-S.

Nirunwiroj

Studdiford

Goyal

Frigo

V. V.

Yang

Shah

D. V.

Rogers

T. T.

(2024). Beyond demographics: Aligning role-playing LLM-based agents using human belief networks. In Al-Onaizan

Bansal

Chen

Y.-N.

(Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 14010–14026). Association for Computational Linguistics.

17.

Crockett

M. J.

Messeri

(2025). AI surrogates and illusions of generalizability in cognitive science. Trends in Cognitive Sciences. Advance online publication. https://doi.org/10.1016/j.tics.2025.09.012

18.

Cui

Zhou

(2025). A large-scale replication of scenario-based experiments in psychology and management using large language models. Nature Computational Science, 5(8), 627–634. https://doi.org/10.1038/s43588-025-00840-7

19.

Cuskley

Woods

Flaherty

(2024). The limitations of large language models for understanding human language and cognition. Open Mind, 8, 1058–1083. https://doi.org/10.1162/opmi_a_00160

20.

Demszky

Yang

Yeager

D. S.

Bryan

C. J.

Clapper

Chandhok

Eichstaedt

J. C.

Hecht

Jamieson

Johnson

Jones

Krettek-Cobb

Lai

JonesMitchell

Ong

D. C.

Dweck

C. S.

Gross

J. J.

Pennebaker

J. W.

(2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. https://doi.org/10.1038/s44159-023-00241-5

21.

Evanson

Lakretz

King

J. R.

(2023). Language acquisition: Do children and language models follow similar learning stages? In Rogers

Boyd-Graber

Okazaki

(Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 12205–12218). Association for Computational Linguistics.

22.

Fagiolo

Moneta

Windrum

(2007). A critical guide to empirical validation of agent-based models in economics: Methodologies, procedures, and open problems. Computational Economics, 30(3), 195–226. https://doi.org/10.1007/s10614-007-9104-4

23.

Feuerriegel

Maarouf

Bär

Geissler

Schweisthal

Pröllochs

Robertson

C. E.

Rathje

Hartmann

Mohammad

S. M.

Netzer

Siegel

A. A.

Plank

Van Bavel

J. J.

(2025). Using natural language processing to analyse text data in behavioural science. Nature Reviews Psychology, 4(2), 96–111. https://doi.org/10.1038/s44159-024-00392-z

24.

Frank

M. C.

(2023). Large language models as models of human cognition. PsyArXiv. https://doi.org/10.31234/osf.io/wxt69

25.

Gao

Chen

Huang

(2025). Increasing alignment of large language models with language processing in the human brain. Nature Computational Science, 5, 1080–1090. https://doi.org/10.1038/s43588-025-00863-0

26.

Grossmann

Feinberg

Parker

D. C.

Christakis

N. A.

Tetlock

P. E.

Cunningham

W. A.

(2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778

27.

Guilbeault

Delecourt

Desikan

B. S

. (2025). Age and gender distortion in online media and large language models. Nature, 645, 1129–1137. https://doi.org/10.1038/s41586-025-09581-z

28.

Guo

P. J.

(2023). Six opportunities for scientists and engineers to learn programming using AI tools such as ChatGPT. Computing in Science & Engineering, 25(3), 73–78. https://doi.org/10.1109/MCSE.2023.3308476

29.

Heimersheim

Nanda

(2024). How to use and interpret activation patching. arXiv. https://doi.org/10.48550/arXiv.2404.15255

30.

Horton

J. J.

(2023). Large language models as simulated economic agents: What can we learn from homo silicus? (Working Paper No. 31122). National Bureau of Economic Research. https://doi.org/10.3386/w31122

31.

Baumann

Lupo

Collier

Hovy

Röttger

(2025). SimBench: Benchmarking the ability of large language models to simulate human behaviors. ACL 2025 Student Research Workshop. https://doi.org/10.48550/arXiv.2510.17516

32.

Hussain

Binz

Mata

Wulff

D. U.

(2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods, 56(8), 8214–8237. https://doi.org/10.3758/s13428-024-02455-8

33.

Ji-An

Benna

M. K.

Mattar

M. G.

(2025). Discovering cognitive strategies with tiny recurrent neural networks. Nature, 644, 993–1001. https://doi.org/10.1038/s41586-025-09142-4

34.

Jumelet

Zuidema

Sinclair

(2024). Do language models exhibit human-like structural priming effects? In Ku

L.-W.

Martins

Srikumar

(Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 14727–14742). Association for Computational Linguistics.

35.

Kanwisher

Khosla

Dobs

(2023). Using artificial neural networks to ask ‘why’ questions of minds and brains. Trends in Neurosciences, 46(3), 240–254. https://doi.org/10.1016/j.tins.2022.12.008

36.

Tong

Cheng

Peng

(2025). Exploring the frontiers of LLMs in psychological applications: A comprehensive review. Artificial Intelligence Review, 58, Article 305. https://doi.org/10.1007/s10462-025-11297-5

37.

Kozlowski

A. C.

Evans

J. A.

(2024). Simulating subjects: The promise and peril of AI stand-ins for social agents and interactions. SocArXiv. https://doi.org/10.31235/osf.io/vp3j2

38.

Laird

J. E.

Newell

Rosenbloom

P. S.

(1987). SOAR: An architecture for general intelligence. Artificial Intelligence, 33(1), 1–64. https://doi.org/10.1016/0004-3702(87)90050-6

39.

Lee

M. H.

Montgomery

J. M.

Lai

C. K.

(2024). Large language models portray socially subordinate groups as more homogeneous, consistent with a bias observed in humans. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 1321–1340). Association for Computing Machinery.

40.

Lin

(2023). Why and how to embrace AI such as ChatGPT in your academic life. Royal Society Open Science, 10, Article 230658. https://doi.org/10.1098/rsos.230658

41.

Lin

(2024a). How to write effective prompts for large language models. Nature Human Behaviour, 8(4), 611–615. https://doi.org/10.1038/s41562-024-01847-2

42.

Lin

(2024b). Towards an AI policy framework in scholarly publishing. Trends in Cognitive Sciences, 28(2), 85–88. https://doi.org/10.1016/j.tics.2023.12.002

43.

Lin

(2025a). From prompts to constructs: A dual-validity framework for large language model research in psychology. arXiv. https://doi.org/10.48550/arXiv.2506.16697

44.

Lin

(2025b). Six fallacies in substituting large language models for human participants. Advances in Methods and Practices in Psychological Science, 8(3). https://doi.org/10.1177/25152459251357566

45.

Lin

(2025c). Techniques for supercharging academic writing with generative AI. Nature Biomedical Engineering, 9, 426–431. https://doi.org/10.1038/s41551-024-01185-8

46.

Lin

(2025d). A validity-guided workflow for robust large language model research in psychology. PsyArXiv. https://doi.org/10.31234/osf.io/xw98v_v1

47.

Lin

(2026). How AI tools can enhance generalizability. Trends in Cognitive Sciences. https://doi.org/10.1016/j.tics.2025.11.010

48.

Lindsay

G. W.

(2024). Grounding neuroscience in behavioral changes using artificial neural networks. Current Opinion in Neurobiology, 84, Article 102816. https://doi.org/10.1016/j.conb.2023.102816

49.

Liu

Ding

(2025). Can Centaur truly simulate human cognition? The fundamental limitation of instruction understanding. PsyArXiv. https://doi.org/10.31234/osf.io/zfhv9_v1

50.

Liu

Xiang

Ding

(2025). Active use of latent tree-structured sentence representation in humans and large language models. Nature Human Behaviour. Advance online publication. https://doi.org/10.1038/s41562-025-02297-0

51.

Longpre

Mahari

Obeng-Marnu

Brannon

South

Gero

Pentland

Kabbara

(2024). Data authenticity, consent, & provenance for AI are all broken: What will it take to fix them? In Proceedings of the 41st International Conference on Machine Learning (pp. 32711–32725). Association for Computing Machinery.

52.

Aleta

Shi

Moreno

(2024). LLMs and generative agent-based models for complex systems research. Physics of Life Reviews, 51, 283–293. https://doi.org/10.1016/j.plrev.2024.10.013

53.

Maier

B. F.

Aslak

Fiaschi

Rismal

Fletcher

Luhmann

C. C.

Dow

Pappas

Wiecki

T. V.

(2025). LLMs reproduce human purchase intent via semantic similarity elicitation of Likert ratings. arXiv. https://doi.org/10.48550/arXiv.2510.08338

54.

Manning

C. D.

Clark

Hewitt

Khandelwal

Levy

(2020). Emergent linguistic structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences of the United States of America, 117(48), 30046–30054. https://doi.org/10.1073/pnas.1907367117

55.

Marjieh

Sucholutsky

van Rijn

Jacoby

Griffiths

T. L.

(2024). Large language models predict human sensory judgments across six modalities. Scientific Reports, 14(1), Article 21445. https://doi.org/10.1038/s41598-024-72071-1

56.

McGrath

S. W.

Russin

Pavlick

Feiman

(2024). How can deep neural networks inform theory in psychological science? Current Directions in Psychological Science, 33(5), 325–333. https://doi.org/10.1177/09637214241268098

57.

Mei

Xie

Yuan

Jackson

M. O.

(2024). A Turing test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences of the United States of America, 121(9), Article e2313925121. https://doi.org/10.1073/pnas.2313925121

58.

Meng

Bau

Andonian

Belinkov

(2022). Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35, 17359–17372.

59.

Millière

(2024). Language models as models of language. arXiv. https://doi.org/10.48550/arXiv.2408.07144

60.

Misra

Mahowald

(2024). Language models learn rare phenomena from less rare phenomena: The case of the missing AANNs. In Al-Onaizan

Bansal

Chen

Y.-N.

(Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 913–929). Association for Computational Linguistics.

61.

Moon

Abdulhai

Kang

Suh

Soedarmadji

Behar

E. K.

Chan

(2024). Virtual personas for language models via an anthology of backstories. In Al-Onaizan

Bansal

Chen

Y.-N.

(Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 19864–19897). Association for Computational Linguistics.

62.

Muller

Elazar

Sagot

Seddah

(2021). First align, then predict: understanding the cross-lingual ability of multilingual BERT. In Merlo

Tiedemann

Tsarfaty

(Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 2214–2231). Association for Computational Linguistics.

63.

Newell

Shaw

J. C.

Simon

H. A.

(1958). Elements of a theory of human problem solving. Psychological Review, 65(3), 151–166. https://doi.org/10.1037/h0048495

64.

Niszczota

Janczak

Misiak

(2025). Large language models can replicate cross-cultural differences in personality. Journal of Research in Personality, 115, Article 104584. https://doi.org/10.1016/j.jrp.2025.104584

65.

Olsson

Elhage

Nanda

Joseph

DasSarma

Henighan

Mann

Askell

Bai

Chen

Conerly

Drain

Ganguli

Hatfield-Dodds

Hernandez

Johnson

Jones

Kernion

Lovitt

. . . Olah

(2022). In-context learning and induction heads. arXiv. https://doi.org/10.48550/arXiv.2209.11895

66.

Park

J. S.

O’Brien

Cai

C. J.

Morris

M. R.

Liang

Bernstein

M. S.

(2023). Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (pp. 1–22). Association for Computing Machinery. https://doi.org/10.1145/3586183.3606763

67.

Park

P. S.

Schoenegger

Zhu

(2024). Diminished diversity-of-thought in a standard large language model. Behavior Research Methods, 56, 5754–5770. https://doi.org/10.3758/s13428-023-02307-x

68.

Pezzulo

Parr

Cisek

Clark

Friston

(2024). Generating meaning: Active inference and the scope and limits of passive AI. Trends in Cognitive Sciences, 28(2), 97–112. https://doi.org/10.1016/j.tics.2023.10.002

69.

Pires

Schlinger

Garrette

(2019). How multilingual is multilingual BERT? In Korhonen

Traum

Màrquez

(Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4996–5001). Association for Computational Linguistics.

70.

Pronin

Kruger

Savitsky

Ross

(2001). You don’t know me, but I know you: The illusion of asymmetric insight. Journal of Personality and Social Psychology, 81(4), 639–656. https://doi.org/10.1037/0022-3514.81.4.639

71.

Wang

(2024). Performance and biases of large language models in public opinion simulation. Humanities and Social Sciences Communications, 11(1), Article 1095. https://doi.org/10.1057/s41599-024-03609-x

72.

Ritter

F. E.

Tehranchi

Oury

J. D.

(2019). ACT-R: A cognitive architecture for modeling cognition. WIREs Cognitive Science, 10(3), Article e1488. https://doi.org/10.1002/wcs.1488

73.

Rumelhart

D. E.

McClelland

J. L.

Group

P. R.

(1986). Parallel distributed processing, Volume 1: Explorations in the microstructure of cognition: Foundations. The MIT Press. https://doi.org/10.7551/mitpress/5236.001.0001

74.

Santurkar

Durmus

Ladhak

Lee

Liang

Hashimoto

(2023). Whose opinions do language models reflect? Proceedings of Machine Learning Research, 202. https://proceedings.mlr.press/v202/santurkar23a.html

75.

Sarstedt

Adler

S. J.

Rau

Schmitt

(2024). Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines. Psychology & Marketing, 41, 1254–1270. https://doi.org/10.1002/mar.21982

76.

Sartori

Orrù

(2023). Language models and psychological sciences. Frontiers in Psychology, 14, Article 1279317. https://doi.org/10.3389/fpsyg.2023.1279317

77.

Schelling

T. C.

(1971). Dynamic models of segregation. The Journal of Mathematical Sociology, 1(2), 143–186. https://doi.org/10.1080/0022250X.1971.9989794

78.

Schoenegger

Tuminauskaite

Park

P. S.

Bastos

R. V. S.

Tetlock

P. E.

(2024). Wisdom of the silicon crowd: LLM ensemble prediction capabilities rival human crowd accuracy. Science Advances, 10(45), Article eadp1528. https://doi.org/10.1126/sciadv.adp1528

79.

Shah

R. S.

Varma

(2025). The potential—and the pitfalls—of using pre-trained language models as cognitive science theories. arXiv. https://doi.org/10.48550/arXiv.2501.12651

80.

Shanahan

McDonell

Reynolds

(2023). Role play with large language models. Nature, 623(7987), 493–498. https://doi.org/10.1038/s41586-023-06647-8

81.

Shoham

Grosbard

I. D.

Patashnik

Cohen-Or

Yovel

(2024). Using deep neural networks to disentangle visual and semantic information in human perception and memory. Nature Human Behaviour, 8(4), 702–717. https://doi.org/10.1038/s41562-024-01816-9

82.

Simon

H. A.

(1983). Why should machines learn? In Michalski

R. S.

Carbonell

J. G.

Mitchell

T. M.

(Eds.), Machine learning (pp. 25–37). Morgan Kaufmann. https://doi.org/10.1016/B978-0-08-051054-5.50006-6

83.

Strachan

J. W. A.

Albergo

Borghini

Pansardi

Scaliti

Gupta

Saxena

Rufo

Panzeri

Manzi

Graziano

M. S. A.

Becchio

(2024). Testing theory of mind in large language models and humans. Nature Human Behaviour, 8(7), 1285–1295. https://doi.org/10.1038/s41562-024-01882-z

84.

Tao

Viberg

Baker

R. S.

Kizilcec

R. F.

(2024). Cultural bias and cultural alignment of large language models. PNAS Nexus, 3(9), Article pgae346. https://doi.org/10.1093/pnasnexus/pgae346

85.

Tenney

Das

Pavlick

(2019). BERT rediscovers the classical NLP pipeline. In Korhonen

Traum

Màrquez

(Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4593–4601). Association for Computational Linguistics.

86.

Tjuatja

Chen

Talwalkwar

Neubig

(2024). Do LLMs exhibit human-like response biases? A case study in survey design. Transactions of the Association for Computational Linguistics, 12, 1011–1026. https://doi.org/10.1162/tacl_a_00685

87.

Toplak

M. E.

West

R. F.

Stanovich

K. E.

(2011). The Cognitive Reflection Test as a predictor of performance on heuristics-and-biases tasks. Memory & Cognition, 39(7), 1275–1289. https://doi.org/10.3758/s13421-011-0104-1

88.

Trott

(2024a). Can large language models help augment English psycholinguistic datasets? Behavior Research Methods, 56(6), 6082–6100. https://doi.org/10.3758/s13428-024-02337-z

89.

Trott

(2024b). Large language models and the wisdom of small crowds. Open Mind, 8, 723–738. https://doi.org/10.1162/opmi_a_00144

90.

Tseng

Y.-M.

Huang

Y.-C.

Hsiao

T.-Y.

Chen

W.-L.

Huang

C.-W.

Meng

Chen

Y.-N.

(2024). Two tales of persona in LLMs: A survey of role-playing and personalization. In Al-Onaizan

Bansal

Chen

Y.-N.

(Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 16612–16631). Association for Computational Linguistics.

91.

van Rooij

Guest

. (2025). Combining psychology with artificial intelligence: What could possibly go wrong? PsyArXiv. https://doi.org/10.31234/osf.io/aue4m_v1

92.

van Rooij

Guest

Adolfi

F. G.

de Haan

Kolokolova

Rich

. (2024). Reclaiming AI as a theoretical tool for cognitive science. Computational Brain & Behavior, 7, 616–636. https://doi.org/10.1007/s42113-024-00217-5

93.

Varnum

M. E. W.

Baumard

Atari

Gray

(2024). Large language models based on historical text could offer informative tools for behavioral science. Proceedings of the National Academy of Sciences of the United States of America, 121(42), Article e2407639121. https://doi.org/10.1073/pnas.2407639121

94.

Wang

Morgenstern

Dickerson

J. P.

(2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 7(3), 400–411. https://doi.org/10.1038/s42256-025-00986-z

95.

Wang

A. Y.

Kay

Naselaris

Tarr

M. J.

Wehbe

(2023). Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nature Machine Intelligence, 5(12), 1415–1426. https://doi.org/10.1038/s42256-023-00753-y

96.

Wang

Milani

Chiu

J. C.

Zhi

Eack

S. M.

Labrum

Murphy

S. M.

Jones

Hardy

K. V.

Shen

Fang

Chen

(2024). PATIENT-ψ: Using large language models to simulate patients for training mental health professionals. In Al-Onaizan

Bansal

Chen

Y.-N.

(Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 12772–12797). Association for Computational Linguistics.

97.

Wang

Zhao

Keller

S. A.

De Hond

van Buchem

M. M.

Pillai

Hernandez-Boussard

(2024). Unveiling and mitigating bias in mental health analysis with large language models. arXiv. https://doi.org/10.48550/arXiv.2406.12033

98.

Wang

Zhang

Jain

Guan

Koshiyama

(2024). Bias amplification: Language models as increasingly biased media. arXiv. https://doi.org/10.48550/arXiv.2410.15234

99.

Wei

Tay

Bommasani

Raffel

Zoph

Borgeaud

Yogatama

Bosma

Zhou

Metzler

Chi

E. H.

Hashimoto

Vinyals

Liang

Dean

Fedus

(2022). Emergent abilities of large language models. arXiv. https://doi.org/10.48550/arXiv.2206.07682

100.

Wicke

Wachowiak

(2024). Exploring spatial schema intuitions in large language and vision models. In Ku

L.-W.

Martins

Srikumar

, Findings of the Association for Computational Linguistics: ACL 2024 (pp. 6102–6117). Association for Computational Linguistics.

101.

Zaim bin Ahmad

M. S.

Takemoto

. (2025). Large-scale moral machine experiment on large language models. PLOS ONE, 20(5), Article e0322776. https://doi.org/10.1371/journal.pone.0322776

102.

Zhang

Roller

Goyal

Artetxe

Chen

Dewan

Diab

Lin

X. V.

Mihaylov

Ott

Shleifer

Shuster

Simig

Koura

P. S.

Sridhar

Wang

Zettlemoyer

(2022). OPT: Open pre-trained transformer language models. arXiv. https://doi.org/10.48550/arXiv.2205.01068

103.

Ziems

Held

Shaikh

Chen

Zhang

Yang

(2024). Can large language models transform computational social science? Computational Linguistics, 50(1), 237–291. https://doi.org/10.1162/coli_a_00502

Large Language Models as Psychological Simulators: A Methodological Guide

Abstract

Keywords

Using Language Models to Simulate Roles and Personas

Conceptual foundations and model capabilities

Methodological framework for implementation

Applications in psychology and behavior research

Studying inaccessible populations

Addressing ethical constraints

Rapid prototyping and cross-cultural research

Complex social systems

A three-tier validation framework

Direct validation (when comparable human benchmarks exist)

Indirect validation (when only partial human data are available)

Constituent process validation

Convergent evidence across partial anchors

Stress-testing simulation limits

Generative validation (when human data cannot be obtained)

Theoretical coherence

Emergent phenomena

Expert evaluation

Positive and negative controls

Interpreting LLM-human discrepancies

Implementation confounds

Training-data artifacts

Architectural differences

Boundaries and limitations

Using Language Models to Model Cognitive Processes

Theoretical foundations for cognitive modeling

Correlational approaches to probing model cognition

Internal probing

Behavioral analysis

Causal intervention and mechanistic understanding

Model editing

Activation patching

Learning dynamics and developmental analogies

Multimodal extensions and embodied cognition

Limitations and interpretive challenges

Scale disparity

Architectural differences and circular inference

Lack of grounding and embodiment

Future directions and methodological recommendations

Ethical Considerations Beyond Traditional Institutional Review Boards

The representation problem

Practical ethical guidelines

Transparency requirements

Representation auditing

Appropriate-use boundaries

Community engagement

Institutional responsibilities

Concluding Remarks

Footnotes

Acknowledgements

Transparency

ORCID iD

References