Abstract
Each incident of failure of a novel drug to translate from preclinical experiments into the clinic increases the pressure to justify the use of animals in these experiments and for the interpretation of findings to be valid and convincing. This is especially the case with research of psychedelics, because they have already been tested extensively in humans. One reason for failure to translate is that preclinical research findings and conclusions are rarely, if ever, confirmed independently. Another is an incautious interpretation of the validity of the animal models that are used, in respect of their relevance to the human disorder of interest and its treatment. This article discusses both these points in the context of preclinical investigations of psychedelics as fast-acting antidepressants, but the generic points are relevant to other human disorders as well.
Keywords
Introduction
After many years in the doldrums, there has been a resurgence of interest in the hunt for new classes of antidepressants. Alongside ketamine, a major focus for preclinical studies of psychedelics has been their potential action as fast-acting antidepressants. Although the efficacy of existing compounds is already being tested in humans, preclinical experiments are still needed to investigate their mechanism(s) of action and the search for new congeners with promising prospects for clinical development (reviewed by Liebnau et al., 2025).
This is happening at a time when the justification for preclinical research that uses animals is being challenged more than ever before. The ethical burden of preclinical experiments is exacerbated by the poor reproducibility of many research findings (de Oliveira Andrade, 2025) and their high rate of failure to translate into the human clinic (Yamaguchi et al., 2021). There are many reasons why a compound might not translate successfully into humans, including the emergence of unwanted side effects. However, if a compound was predicted to be efficacious on the basis of evidence from preclinical models, but turned out not to have any beneficial effects, then all the animals have been wasted. In some cases, that might be because the preclinical evidence has been misconstrued, rather than arising from a fundamental flaw in the animal model per se (see Stanford, 2025). Nevertheless, such failures have inflamed criticism of, and loss of confidence in, the reliability and relevance of findings from animal studies in the research of human disorders and their treatment. For all those reasons, it is essential that preclinical procedures are not only reproducible but also fit for purpose.
So far, most efforts to improve reproducibility have placed a strong emphasis on randomization and blinding as essential for reducing the risk of bias. Preregistration of research plans is also encouraged to help ensure compliance with the intended (and approved) research plan. However, successful translation depends not only on appropriate experimental design but also on the validity of the experimental procedures; both these elements must be satisfied if the findings are to be unambiguous and interpreted correctly.
This article first outlines steps being taken to improve the design and reporting of preclinical investigations. However, initiatives that are intended to improve reproducibility offer no benefit at all if the procedures are not carried out properly, or if they lack the validity needed to fulfil the experimental objectives (Pratt et al., 2022). That issue is discussed next. Despite focusing on behavioural procedures that are widely used in the research of potential psychedelic antidepressants, many of the points are generic and apply to other fields in the behavioural neurosciences.
Improving reproducibility and validity of research findings: General points
For preclinical findings to be translatable, there needs to be assurance that the investigation was planned properly and that the design of individual experiments is appropriate. To that end, several strategies and online tools have been developed to help improve the reproducibility of preclinical research (Table 1).
The pathway to improving reproducibility of research findings.
As part of a step-wise approach to the design of a preclinical investigation, the first consideration is the PREPARE guidelines (Planning Research and Experimental Procedures on Animals: Recommendations for Excellence; Smith et al., 2018), which offer a broad checklist of factors that need to be addressed before embarking on any experiment that uses animals. These include the following: proper formulation of the study; effective communication between researchers and the animal facility; and quality control of reagents and the animals’ health status.
An important component of that process will be deciding what experimental procedures are to be used and how the information gathered will be appropriate for meeting the objectives of the study. Those decisions require a dispassionate appraisal of the validity of the procedures/models, bearing in mind that each of them could be valid for one purpose (e.g. as a predictive screen), but not another (e.g. as a model of a complex, multifactorial disorder in humans; see below). The Improving Translational Relevance in Preclinical Psychopharmacology (iTRIPP) guidelines offer advice on those points when designing experiments to study a ‘model’ of a psychiatric disorder and/or its treatment (Bailey et al., 2023).
The next step is to plan the entire investigation: that is, the series of experiments that aim to reach a safe conclusion (Table 1). The prospective development of this overall plan is important because that approach will reduce the risk of false positives and enable the most efficient use of resources, including animals (Bate et al., 2025). The overall plan should start with pilot studies, to help decide which experimental factors merit inclusion in subsequent steps of the investigation and also to identify the fixed levels of other factors (e.g. timeline or range of drug doses), to maximize the window of opportunity or optimize the process. This should be followed by exploratory hypothesis-generating experiments and, finally, the hypothesis-confirming experiment. Whereas that sequence is typically standard practice in industrial settings, blue-skies research tends to evolve more gradually. However, it is important to note that an investigation that approaches the conclusion by devising one experiment at a time, adding new factors as the investigation progresses, risks wasting resources and will take much longer. It can also fail to identify important interactions between the experimental factors.
Notably, many preclinical exploratory findings are reported as conclusive before they have been confirmed independently. This is a premature and risky strategy; it should be made clear that any such findings are merely exploratory and tentative (Bate et al. 2025; Table 1). Conclusive findings require a test of an unequivocal prediction for the effect of a given experimental intervention (the hypothesis): for example, that a novel psychedelic will produce the response of interest, of meaningful magnitude, over a predicted range of doses. This precaution reduces the risk of being misled by false-positive and/or false-negative findings.
In addition to prompting the hypothesis, the results from exploratory experiments are useful because they also provide information on the likely magnitude of the response to the experimental intervention and the variability of the measure of interest. Both are needed for power analysis, which is used to estimate the appropriate sample size in the hypothesis-confirming experiment. That estimation helps to ensure that the sample sizes are neither too small (i.e. the study is underpowered and lacks adequate sensitivity) nor too large (i.e. the study is overpowered and risks detecting changes that are too small to be of any importance).
Only after carrying out a properly powered hypothesis-confirming experiment can it be inferred that the conclusion is likely to be correct and not a false positive (internal validity), but even that precaution does not guarantee reproducibility. Ideally, a further step would be to confirm that the same findings emerge when the experiment is carried out in a different laboratory setting and, even better, when testing the hypothesis in a different strain/species of animal (external validity). Unfortunately, the motivation to carry out those important checks is diminished by the need for confidentiality, which is driven by the competitive research environment of both the commercial and academic sectors. Nevertheless, findings that are confirmed as reproducible in that final category of tests are less likely to be false positives and more likely to offer promising translational potential.
Having planned the overall investigation, the next step is to design the individual experiments (Table 1). At this point, it is important to recognize that the design of each experiment depends on its role within the investigation: that is, whether it is an exploratory pilot study, an exploratory hypothesis-generating experiment or a hypothesis-confirming experiment (Bate et al., 2025). To help with that process, the National Centre for the 3Rs, UK, has developed the Experimental Design Assistant (EDA; https://nc3rs.org.uk/our-portfolio/experimental-design-assistant-eda). This enables researchers to build a schematic flowchart of a proposed experiment and to populate the plan with information about the experimental factors and important aspects of the procedure (e.g. species, strategies for blinding and randomization and the configuration of the samples within the experiment; Percie du Sert et al., 2017). The plan includes the intended strategy for statistical analysis (Bate et al., 2017). The EDA will then critique the overall design and analysis and will either confirm that they are appropriate and compatible, or flag points that raise concerns or that need more information; these flags are linked to online pages of advice on how to reconcile such issue(s). Finally, the EDA can produce a text report, which lists all the details that have been specified in the design. Importantly, this report makes it easy to check whether there are any issues that are important for promoting reproducibility, but have not been considered in the design.
The final step, after carrying out the experiment, is to report the experimental procedure and the findings to a standard that enables the experiment to be repeated independently (Table 1). Shortcomings in this reporting process have been identified as a major factor that contributes to poor reproducibility (Kilkenny et al., 2009). The ARRIVE Guidelines were developed to address that problem but despite most leading journals formally endorsing those guidelines, there is evidently scope for improving the compliance of the articles they publish (Hair et al., 2019; Lilley et al., 2020). As a further remedy, the ARRIVE 2.0 Guidelines were devised. This version differs from the original by categorizing factors that should be included in original research reports as either ‘Essential’ or ‘Recommended’ (Percie du Sert et al., 2020), but it is too early to assess whether that adjustment has improved matters. In the meantime, a preprint of the RIVER Guidelines has been released (RIVER Working Group, 2023), which will offer equivalent advice on the reporting of experiments carried out in vitro.
To further consolidate the reporting process, the Transparency and Reproducibility Committee of the International Union of Basic and Clinical Pharmacology (IUPHAR) is developing guidance on Clarity, Evaluation, Assessment, Rigour (CLEAR). This guidance emphasizes the need for data transparency together with its experimental context, which needs a full description of how the experimental design was implemented. This information is essential because it will help to identify sources of bias and variability that could affect the interpretation of the findings.
Despite all these initiatives, even reproducible findings will be merely expensive distractions if it turns out that the experimental model lacked the required validity. However, the criteria for validity depend on the objectives of the study (Almanasreh et al., 2019; Campbell and Fiske, 1959; Slack and Drougalis, 2001; Willner, 1986):
Construct validity: The neurobiology and pharmacology of the model are consistent with our understanding of the disorder and its treatment, which need continual revalidation as new evidence emerges. This can be further assessed in respect of: Content validity: The extent to which the animal model incorporates all (or only some) features of the human disorder, or its treatment. Concurrent validity: The extent to which the response to a novel experimental challenge replicates that expressed in an established (validated) model (the ‘gold standard’). Divergent validity: The extent to which it is possible to distinguish between the construct validity of an established model and that of a novel challenge with a different (or overlapping) construct. Convergent validity: The extent to which the effect(s) of a novel experimental challenge matches/correlates with those from a different model that assesses the same response (construct).
Predictive validity: The response to a given category of experimental challenge(s) in animals is consistent and can be used to predict the response in humans. This is the essence of preclinical drug screening.
Face validity: The observed response to the experimental challenge in animals resembles the effects in humans. This category of validity is vulnerable to misleading anthropomorphism.
Internal validity: The response to an experimental challenge is reproducible when the experiment is repeated under identical circumstances.
External validity: The response is reproducible when the experiment is carried out in different circumstances (a different laboratory or strain of animal, for instance). This category is essential for translational validity.
Translational validity: The response to the experimental challenge in animals is borne out in humans.
In short, for preclinical research to offer any scientific benefits and to have a realistic possibility of translating into humans, the research findings must be valid for the purposes of the study objectives, as well as reproducible. The following sections address that former point, focusing on key aspects of preclinical research of psychedelics.
The hallucinogenic profile of psychedelics
There are two main classes of psychedelics: phenethylamines (e.g. mescaline) and indolamines. The latter comprises two subgroups: tryptamines (e.g. dimethyltryptamine and psilocybin) and ergolines (also known as lysergamides, e.g. LSD). Conclusions from early studies of the pharmacology of psychedelics were confounded by the lack of any knowledge of the multiple 5-HT receptor subtypes that have now been identified. As a consequence, LSD was described as a ‘5-HTD’ receptor antagonist (Burris and Sanders-Bush, 1992; Gaddum, 1957), whereas Aghajanian believed LSD activated inhibitory 5-HT autoreceptors in the midbrain (Aghajanian et al., 1968). There is now plenty of evidence that all these compounds are either full agonists or partial agonists of 5-HT2a receptors and that this action is responsible for their hallucinogenic effect.
Until recently, there was little human data to underpin this view (for obvious reasons), apart from studies such as Sadzot et al. (1989) who reported that the threshold for doses of psychedelics that induce hallucinations correlated with their affinity for human (and rat) 5-HT2a receptors. However, many later studies have reported that the psychotomimetic effects of psychedelics are blocked by the 5-HT2a receptor antagonist, ketanserin (e.g. Kraehenmann et al., 2017; Vollenweider et al., 1998; see Jalal, 2018). A recent imaging study, using positron emission tomography, has further confirmed that both 5-HT2a occupancy by psilocin and its plasma concentration correlate with the subjective (mystical) experience induced by its prodrug parent, psilocybin (Madsen et al., 2019; Stenbæk et al., 2021).
However, several 5-HT2a receptor agonists are widely purported to lack hallucinogenic actions. Amongst these are 2-bromo-lysergic acid diethylamide (2-bromo-LSD; Cerletti and Rothlin, 1955; Ginzel and Meyer-Gross, 1956; Lewis et al., 2023), lisuride and Ariadne (Cunningham et al., 2023). The piperazine, quipazine, although from a different chemical class, is also often cited as another example. But, as discussed below, these assessments are highly questionable (see also Fiorella et al., 1995).
The prevailing view that 2-bromo-LSD does not induce hallucinations is open to question, or qualification. For instance, at doses used to test for relief of cluster headache, most subjects felt ‘slightly tipsy’ (see: Karst et al., 2010) and another study reported ‘a delirious reaction similar in almost all respects to that of LSD’ (Richards et al., 1958). Given that this compound is a partial (and biased) agonist at 5-HT2a receptors, but a full agonist, antagonist, or even inverse agonist at many other G protein-coupled receptors (Lewis et al. 2023), it has not been possible to identify which action(s) could contribute to such a response.
It is also hard to understand why lisuride, which is a congener of LSD and a 5-HT2a receptor agonist, is widely regarded as non-hallucinogenic. In a clinical trial, again to treat headache, ‘visual hallucinations’ were specified as a side-effect that warranted cessation of the study (Somerville and Herrmann, 1978). Hallucinations are also listed as a potential side-effect in data sheets for formulations of this compound, which is used to treat Parkinson’s disease and (off-label) migraine.
The evidence that quipazine does not induce hallucinations is similarly tenuous and seems to rely on a small-scale human study that focused on its effects on hormone secretion (Parati et al., 1980). Peripheral side-effects are listed, and they all point to nausea and gastrointestinal problems, which are likely to be attributed to quipazine’s action as a 5-HT3 receptor agonist. This action will constrain the dose range that can be tested in humans, which could account for the dearth of reports of any hallucinogenic action. This limitation would not be evident in rodent studies, given that neither rats nor mice express the emetic reflex (Horn et al., 2013). As a consequence, compounds that are predicted to be efficacious, on the basis of preclinical evidence, might not translate into humans due to limitations imposed by the maximum tolerated dose. Nevertheless, as summarized by de la Fuente Revenga et al. (2021), the human data for the lack of hallucinogenic effects of quipazine are ‘both scant and fragmentary, when not contradictory’.
Of particular relevance to this article is the evidence that the behavioural effects of these compounds in animal studies (specifically, the head-twitch response (HTR; see below)) resemble those of psychedelics that are acknowledged to be hallucinogenic. Yet, because they are widely regarded as non-hallucinogenic (and, in the case of quipazine, non-psychedelic), this action has led to them being branded as ‘false positives’ (e.g. White et al., 1981). This could be unfortunate because that judgement, if incorrect, unjustifiably undermines confidence in the validity of preclinical investigations of the effects of psychedelics on rodent behaviour (see also: Kehler and Lindskov, 2025).
Another complication is the evidence that 5-HT2a agonism is necessary, but not sufficient, for the hallucinogenic actions of psychedelics. A full discussion of that topic is beyond the scope of this article but, in brief, evidence points to activation of 5-HT1a receptors as an important contribution to the hallucinogenic effect of tryptamines (Pokorny et al., 2016), while activation of 5-HT2c receptors contributes to the actions of phenethylamines (Custodio et al., 2023; Fantegrossi et al., 2008) and ergolines (Fiorella et al., 1995), including lisuride. In addition, each psychedelic compound has its profile of binding to the full range of receptors for 5-HT and other neurotransmitters, each with different receptor reserves, which will further distinguish them from one another (Holze et al., 2024; McKenna et al., 1990; Millan et al., 2002). The recruitment of several different second messenger cascades that follow activation of (G-protein coupled) 5-HT2a receptors, together with biased agonism, and their heterodimerization with receptors for other neurotransmitters, will further refine and define the response to individual psychedelics.
Little is known about the functional consequences of all these variables, but emerging evidence suggests that they have important implications for the psychotropic effects of these compounds and their therapeutic potential. This is particularly the case in respect of compounds with preferential binding to 5-HT2a receptors, such as LSD and psilocybin, which are regarded as ‘typical psychedelics’ (Kamal et al., 2023). By contrast, others, such as 5-methoxy-N,N-dimethyltrypamine, which has higher affinity for 5-HT1a than 5-HT2a receptors, together with ketamine (a NMDA-receptor antagonist) and methylenedioxymethamphetamine (a mixed-action, serotonin-releasing agent) are described as ‘atypical psychedelics’ (‘entheogens’). These disparate pharmacological profiles and topographical distributions of their receptor targets (Delli Pizzi et al., 2023) are thought to explain differences in their psychotropic effects, especially core features of the hallucinations, and their potential as fast-acting antidepressants (Bosch et al., 2022; Dourron et al., 2023).
All these differences, together with their respective pharmacokinetic profiles, could affect findings from preclinical behavioural studies in crucial ways. This variability is reminiscent of the detailed interrogations of the actions of sedative-hypnotics in studies of stimulus discrimination by Griffiths (e.g. Ator and Griffiths, 1986, 1989), who discovered that benzodiazepines and other compounds that bind to their allosteric-binding site on the GABAA-receptor do not all generalize to the same interoceptive cue. That topic has been discussed in detail elsewhere (Heal et al., 2025).
Against that background, the following sections discuss preclinical investigations of the behavioural responses to psychedelics, as 5-HT2a receptor agonists with antidepressant potential.
The validity of predictive screens
Head-twitches, hallucinations and 5-HT2a receptors
Psychedelics and other hallucinogens induce a behavioural syndrome in rodents, which includes head-twitches and wet-dog shakes (e.g. Titeler et al., 1988, but see Silva and Calil, 1975). There is long-standing evidence that head-twitches in mice (Goodwin and Green, 1985) and rats (Schreiber et al., 1995) are mediated by activation of 5-HT2a receptors. Moreover, the expression of head-twitches is prevented by the 5-HT2a receptor antagonist, ketanserin (Darmani et al., 1990) and is not seen in 5-HT2a receptor gene knockout mice (htr2−/−; González-Maeso et al., 2007). Interestingly, a recent study reported that head-twitches induced by psilocybin are also not evident in genetically altered mice lacking the 5-HT transporter (Gattuso et al., 2025).
It is not known whether either activation of 5-HT2a receptors or a hallucinogenic (mystical) experience, or both, is essential for an antidepressant response to psychedelics (but see: Yaden and Griffiths, 2020). Nevertheless, many preclinical studies of putative psychedelic antidepressants have scored their induction of head-twitches in rodents, as an index of 5-HT2a receptor activation and, more speculatively, as a predictive screen for a hallucinogenic response (Halberstadt et al., 2020). Assuming the latter assumption to be correct, that still raises questions about whether particular features of the response to the test compound, including hallucinations (the discriminative stimulus), have any bearing on the HTR as a predictor of an antidepressant response (see above and Gouzoulis-Mayfrank et al., 2005).
The action of the LSD analogue, lisuride, has been somewhat problematic in this context because this 5-HT2a agonist does not cause head-twitches in mice, even at high doses. Given the evidence that lisuride does cause hallucinations (see above – but contrary to many descriptions of its actions), this finding should be regarded as a false-negative response were it not for its efficacy in the least shrew (Cryptotis parva; see: Darmani et al., 1994; Table 2). This disparity highlights the importance of species differences in this response.
Typical and atypical behavioural and pharmacological profiles of psychedelics and related compounds.
The ergoline, ergometrine, is another apparent anomaly. There are reports that this compound has psychotropic (‘entheogenic’/‘psychotomimetic’) effects that are equivalent to a low dose of LSD (Bigwood et al., 1979; Selva et al., 1989), but no frank hallucinations have been reported for this 5-HT2a partial agonist. This may be because the unpleasant somatic effects of this compound are dose-limiting (Ott and Neely, 1980). Yet, it does induce head-twitches in mice, albeit at a concentration that is 5- to 200-fold higher than is needed for psilocybin, dimethyltryptamine or LSD (Balsara et al., 1986; Corne and Pickering, 1967). Similarly, there are no reports of hallucinations for the 5-HT2a agonist, quipazine (but see above), and yet this piperazine produces a HTR (de la Fuente Revenga et al., 2021). As a consequence, both ergometrine and quipazine are widely regarded as producing false-positive responses in the head-twitch test.
If hallucinations are essential for the efficacy of psychedelics as antidepressants, then purported false-negative (lisuride) and false-positive (quipazine and ergometrine) findings in the HTR test would undermine its use as a preclinical predictive screen for (hallucinogenic) psychedelic antidepressants. Clearly, it is important to resolve all these points because using animals to score head-twitches cannot be justified if either hallucinations are not necessary and/or the test is unreliable (Figure 1).

A flowchart to indicate the criteria that justify the use of the head-twitch response (HTR) to predict the therapeutic effects of putative fast-acting antidepressants and their mechanism of action.
Notwithstanding those uncertainties, another (alternative) reason for scoring head-twitches assumes that 5-HT2a receptor agonism, alone, predicts a fast-acting antidepressant response. In that context, the validity of this test rests merely on the expression of head-twitches as an objective criterion. But, if assessment of 5-HT2a receptor activation is the intended objective, then there is no need to regard either ergometrine or quipazine as false positives. In fact, their provocation of head-twitches consolidates the validity of this behavioural assay as a predictive screen for 5-HT2a agonists, albeit with possible differences in species’ sensitivity.
That possibility, along with the belief that some tryptamines bind to 5-HT2a receptors but do not induce hallucinations (see above), has kindled the development of congeners as putative antidepressants that will lack an unwanted hallucinogenic side-effect (reviewed by Duan et al., 2024). However, if that is the case, it is not necessary to use animals at all because 5-HT2a receptor activation can be measured in vitro. That approach has the added advantage of enabling investigations of caveats such as biased agonism and coupling to different second messengers (Figure 1).
Yet, another scenario is that activation of 5-HT2a receptors is essential, but not adequate, for an antidepressant response. In fact, the need for co-activation of 5-HT2a and other neurotransmitter receptor(s), which would need to be identified, is the only situation for which the use of animals to score head-twitches can be fully justified (Figure 1).
Another factor to bear in mind, as pointed out by Halberstadt and Geyer (2013), is that what is scored as a ‘head-twitch’ can show poor inter-observer consistency (Silva and Calil, 1975) and is vulnerable to subjective bias, which confounds comparisons across different studies. That problem can be avoided by ensuring that head-witches are scored only by proficiently trained experimenters, after confirmation of their intra- and inter-scorer consistency (>95% inter-rater reliability: Canal and Morgan, 2012; Garcia et al., 2007). This precaution is essential because the full-blown response to psychedelics in rodents is not confined to head-twitches, but includes reciprocal forepaw treading, flat body posture, head-weaving, hind limb abduction and Straub tail. The ‘serotonin syndrome’ in rodents is believed to be analogous to ‘serotonin toxicity’ in humans, which is a life-threatening delirium caused by excessive serotonergic transmission (Haberzettl et al., 2013; Stanford et al., 2010).
This might be an important, but neglected, confounder because reciprocal forepaw treading and flat body posture, at least, have been attributed to activation of 5-HT1a receptors in rats (Arvidsson et al., 1981; Smith and Peroutka, 1986) and mice (Yamada et al., 1988). Moreover, expression of both head-twitches and other aspects of the serotonin syndrome is evidently influenced by interactions with other serotonin receptors and other neurotransmitter systems (Goodwin et al., 1987; Heal et al., 1986). This is likely to be particularly important for compounds, such as lisuride, which bind to many different neurotransmitter receptors (Millan et al., 2002) and are purported to have anomalous profiles in this test.
Related to this broader profile is the evidence that expression of head-twitches after treatment with psychedelics has a bell-shaped dose–response profile. This could be explained by high drug doses activating stereotypic behaviours, associated with the full-blown serotonin syndrome, that prevent or mask expression of head-twitches (Corne and Pickering, 1967). Clearly, the choice of drug dose (and possibly the route of administration (Yamada et al., 1988)) needs to be considered carefully when assessing the effects of novel psychedelics in this test, not least because these variables, and their behavioural consequences, could contribute to the inconsistent reports of the effects of different compounds. In fact, this hormetic confounder could explain why lisuride does not provoke head-twitches in mice (but see above) and has been proposed by Glatfelter et al. (2024) to account for the apparent lack of hallucinogenic effects of this compound.
In summary, it is not at all certain that hallucinations per se are a necessary component of an antidepressant response to psychedelics, either as a marker for a therapeutic dose and/or by inducing a change of psychological status, which drives benefits that outlast the acute effects of the drug. It is also not certain whether activation of 5-HT2a receptors, alone, is sufficient to either induce hallucinations or to relieve depression. Whatever the case, the validity of the HTR as a predictive screen for hallucinations and/or antidepressants rests on psychedelics producing a consistent change in animals’ behaviour; that validity is challenged by claims that this test is vulnerable to false negatives and false positives.
There is clearly a need to reconcile and validate all these different objectives and confounders. That process is important for ethical and scientific reasons, especially in respect of the need to confirm that in vitro (non-animal) alternatives cannot provide the required information.
The Forced Swim Test: Validity depends on the experimental objective
The Forced Swim Test (FST) has been used for over 50 years on the basis that the immobility of rodents, which develops after a short period of inescapable confinement in a cylinder of water (usually 6 min maximum), is an expression of depression, or ‘depression-like’ behaviour. A change of mindset on this point followed the long-overdue acknowledgement that depression in humans is typically a chronic relapsing disorder, whereas the immobility in the FST is state-dependent and dissipates when the animal is removed from the water. Also, the immobility is diminished by subchronic administration of an established antidepressant shortly before the test (c 24 h), but a therapeutic response to established (monoamine targeting) antidepressants in humans needs several weeks of treatment, at least. Interestingly, the latter limitation does not apply to psychedelics, which are being investigated as fast-acting antidepressants. Nevertheless, the use of the FST to study depression (or depression-like behaviour) is now widely deprecated and no longer permitted in the UK and some other jurisdictions, on the grounds of its lack of validity as a model of depression.
The only procedure to produce a change in animals’ phenotype that arguably produces many key features of depression is olfactory bulbectomy. Interestingly, the HTR to the psychedelic, 2,5-dimethoxy-4-iodoamphetamine (DOI), is particularly sensitive in olfactory bulbectomized animals and is abolished by an antagonist of either 5-HT2a or 5-HT2c receptors or chronic administration of the selective serotonin reuptake inhibitor (SSRI), fluvoxamine (Oba et al., 2013). Repeated LSD treatment also relieves the deficit in active avoidance in these animals, as did chronic administration of the established antidepressant, imipramine (Buchborn et al., 2014). There is clearly scope for more research on psychedelics in this model of depression.
The misunderstanding that led to the FST being used to study depression derives from publications that first reported the procedure. The authors commented that the immobility appears to ‘reflect a state of despair in the rat’ (Porsolt et al., 1978) and that ‘having learned that escape was impossible and their having given up hope. Immobility was therefore given the name “behavioral despair”’ (Porsolt et al., 2001). The immobility was arbitrarily given the name ‘behavioural despair’ to distinguish it from ‘learned helplessness’, which was used to describe the deficit in active avoidance that develops when animals experience a series of inescapable, uncontrollable foot-shocks.
Learned helplessness was abandoned as a model of depression more than 40 years ago (Maier, 1984) but, because ‘despair’ and ‘hopelessness’ are prominent features of depression, the assumption that the forced swimming induces depression in rodents still prevails despite the lack of any scientific justification. The only evidence for that interpretation resorts to ‘face validity’, which is vulnerable to anthropomorphic mistakes. The immobility that characterizes both behavioural despair and learned helplessness is now thought to reflect a coping strategy/stress resilience (Maier and Watkins, 2010; Molendijk and de Kloet, 2015; see: Stanford, 2020).
By contrast, all established antidepressants produce a positive response in the FST, and so it is still used as a predictive screen for new candidate treatments. The procedure has been refined since it was first developed, mainly because it was thought that SSRIs did not reduce immobility. That setback was resolved by scoring several components of animals’ behaviour (swimming and climbing), as well as overt immobility (Cryan et al., 2005).
A common criticism of the use of the FST as a screen for antidepressants is that the reduction in immobility (or increase in swimming/climbing) does not emulate any aspect of antidepression in humans. Even though an increase in motor behaviour (or motor motivation) could be beneficial in depression, this criticism is a red herring. The key point is that a predictive screen does not need the behavioural response of the animal to emulate any aspect of the treatment in humans: the only requirement is for all drugs of a given therapeutic class to induce the same (any) change in animals’ behaviour. So far, the FST is an effective screen for all antidepressants that are licensed for that indication, including SSRIs, and others that have been developed since then. Its predictive validity has been endorsed by a recent systematic review, which concluded that this test is ‘necessary and evidence based’ (Brandwein et al., 2023).
However, most studies using the FST have tested long-established antidepressants that augment monoamine transmission. The predictive validity of the FST for putative fast-acting antidepressants, which fall into a different pharmacological and therapeutic category, is less certain. Strong support for that possibility has emerged from the flurry of studies of the effects of Esketamine (the S-enantiomer of ketamine) in the FST, which was prompted by its recent licensed approval as a fast-acting antidepressant.
So far, nearly all studies have found that S-ketamine reduces immobility of both male and female rats (e.g. Arjmand et al., 2023; Koncz et al., 2023; Pereira et al., 2019) and mice, even when administered as a nebulized formulation (Brandão et al., 2023). Many more positive findings have emerged for tests of the racemic mixture of ketamine in rats (e.g. Clark et al., 2024) and mice (e.g. Bulthuis et al., 2024). There is also evidence that its efficacy in this test is long-lasting (Viana et al., 2020), as is the therapeutic response to S-ketamine. One exception is where rats were repeatedly treated with a high dose of ketamine before carrying out the FST in a way that did not conform to the usual protocol (Zhou et al., 2025). Another is where even the active controls, imipramine and fluoxetine, were ineffective in mice (Medeiros et al. 2025).
Whether or not the predictive validity of the FST extends to psychedelics has yet to be confirmed, and that will be an important challenge. So far, comparatively few studies have looked at the effects of psychedelics in this test, but the majority have reported a reduction in the immobility of rats. They do seem to be less effective in mice, but that limitation might be resolved by refining the dose schedule or other test parameters (see: Brandwein et al., 2023; Table 3).
Examples of findings for the effects of psychedelics in the Forced Swim Test (FST).
FSL: flinders sensitive line; NC: no change; WH: Wistar Han; WKY: Wistar Kyoto.
Dose (mg/kg) unless otherwise stated.
Of particular note was the finding that psilocybin did not affect the immobility of the Flinders Sensitive Line strain of rats, which is used in preclinical research of depression (Jefsen et al., 2019). However, this strain has a low density of 5-HT2a receptor mRNA expression in key limbic areas (Osterlund et al., 1999), and so their apparently anomalous response to psilocybin in the FST might actually serve to endorse the reliability of the FST instead.
The focus on 5-HT2a agonism as a key component of an antidepressant response to psychedelics prompts the question of whether their effect on immobility in the FST can be attributed to activation of these receptors? Not many studies have addressed that question, but the reduction in immobility following treatment with psilocyn or DOI (Takaba et al., 2024) or 5-MeO-DMT (Cameron et al., 2023) was prevented by the 5-HT2a antagonist, ketanserin. Functional ablation of the 5-HT2a receptor gene does not affect baseline immobility in this test (Jaggar et al., 2017), and there is inconsistent evidence for its effect on the immobility response to psychedelics. Whereas the response to DOI and lisuride was abolished, the response to psilocybin remained intact (Sekssaoui et al., 2024). There is also evidence that 5-HT1a receptor co-activation makes an important contribution to a reduction in the immobility caused by psychedelics (Głuch-Lutwin et al., 2023), which again suggests that activation of 5-HT2a receptors might be necessary, but not sufficient, for a psychedelic response.
The wide range of experimental parameters that have been incorporated into different studies of the actions of psychedelics in the FST (particularly drug doses and treatment schedules) could explain why no consistent response profile has emerged so far. As a consequence, we can neither be confident about nor rule out the use of the FST as a reliable predictive screen for psychedelic antidepressants, particularly in mice. Given that there are no alternative procedures in prospect, with predictive validity that might match the track record of the FST, there is a pressing need to resolve this question.
The Open Field Test
The Open Field Test (OFT) is a deceptively straightforward, high-throughput procedure that apparently involves merely placing animals in a novel arena and scoring their movements for a few minutes.
When first developed, the Open Field was a large circular arena (at least 1 m in diameter for testing rats), which enabled low-level and even illumination throughout. The measure of primary interest was the rats’ defaecation rate when placed in the apparatus. Although rats with a comparatively high defaecation rate tended to spend more time in the periphery of the arena (‘thigmotaxis’), it was their defaecation rate that was thought to be an indication of their ‘emotional reactivity’. That proposal prompted a program of selective inbreeding of rats (high vs low rate of defaecation in the Open Field) to understand the neurobiological basis of ‘emotionality’ (see: Broadhurst, 1975). The culmination of that effort was the development of the Maudsley Reactive (MR) and Maudsley NonReactive (MNR) strains of rats. However, defaecation, as the primary measure of interest, attracted some criticism (Cunha and Masur, 1978) and was later eclipsed by assessment of animals’ preference for, versus avoidance of, the central zone of the arena. That shift was based on the unconfirmed assumption that animals that spent more time in the central zone were less ‘anxious’ than those that preferred the periphery.
The proposal that emotionality differed in MR and MNR strains of rat was always controversial (Broadhurst, 1976; Commissaris et al., 1986). The possibility that emotionality in the Open Field, and other preclinical tests, is analogous to anxiety in humans was even more uncertain (see: Lister, 1990). Such scepticism was endorsed by studies comparing the effects of anxiolytic drugs on several different aspects of the behaviour of MR and MNR rats, including the Open Field, which failed to validate the proposal (e.g. Rowan and Flaherty, 1991). An objective assessment of the effect of antianxiety drugs on animals’ behaviour in the OFT was that they reduce ambulation (Crawley, 1985), but do not reduce thigmotaxis. In short, it is hard to understand why animals’ centre-field behaviour in this test is still being used as an index of anxiety.
The OFT is also used to evaluate animals’ locomotor activity. This is particularly the case when investigating the actions of putative antidepressants. The intention in this context is to gather evidence that a positive response to a novel antidepressant in tests such as the FST predicts a change in mood/motivation in humans, rather than causing a non-specific increase in motor activity. Yet, there are two reasons why fulfilling that objective is probably not necessary: (1) as explained above, an increase in motor activity could help treat psychomotor retardation in depression and (2) a predictive screen for antidepressants does not need to measure any aspect of mood.
Whatever the case, interpreting results from the OFT needs to take into account the animals’ response to the stress of exposure to a novel arena. This could either increase their movement (rapid exploration to achieve escape) or reduce it (freezing, in extreme). Because animals’ locomotor activity in the OFT is affected by their stress resilience, which is controversially interpreted as ‘anxiety’, these are not independent variables and so to evaluate one, but not the other, will be misleading. This will be a particular problem when using centre-field activity as an index of the animal’s stress response (‘anxiety’). To take account of this interaction, the score for movement within, or directed towards, the central zone for each animal needs to be expressed as a proportion of its total activity in the arena (Salmon and Stanford, 1989). Yet, when using the OFT to assess animals’ ‘anxiety’, most studies merely report the % time animals spend in the central zone; that adjustment does not resolve the issue.
Other potential confounds are that, despite Broadhurst’s detailed attention to the dimensions and illumination of the apparatus, contemporary Open Fields are usually square and made of Perspex or similar material. Those changes in the construction of the apparatus not only introduce corners, which will be a preferred region of the arena, but also make it difficult to prevent shadows and reflective glare. All those technical features, which will affect animals’ behaviour in the OFT (Stanford, 2007; Voikar and Stanford, 2023; Walsh and Cummings, 1976), could be particularly important for psychedelics in view of evidence that LSD affects animals’ sensitivity to light (Cunha and Masur, 1978).
Another complication is that locomotor activity in the OFT will depend on what else the animals are doing. If the experimental intervention affects grooming or rearing, for instance, then ambulation will be affected indirectly because they cannot carry out those behaviours and move around the arena at the same time. This is potentially particularly important for studies of psychedelics, which induce head-twitches, wet-dog shakes and other motor stereotypies that might disrupt ambulation long after their overt expression, or subliminally with low doses of psychedelics (Bysiek et al., 2025). It follows that any inference about the effect of psychedelics on animals’ locomotor activity in the OFT needs to be considered in the context of a full behavioural profile and depends on confirmation that any change in locomotor activity is not explained by their effects on other aspects of motor behaviour (e.g. Herpfer et al., 2005).
Although there have been many studies of the effects of psychedelics on locomotor activity and other exploratory/vegetative behaviours (Geyer et al., 1986; Halberstadt and Geyer, 2018; Halberstadt et al., 2019), they have all been carried out in what is essentially a novel environment, as in the OFT, or other form of activity meter, and do not consider interactions between different aspects of the animals’ behaviour. This is important, especially in light of a comparison of animals’ behaviour in a novel environment and their home cage, which revealed that the effects of test compounds (including hallucinogens) on locomotor activity differ markedly in a novel test arena and the home cage (Robinson and Riedel, 2014).
As a consequence, any or all of the factors above could help explain why the effects of psychedelics on these behaviours, especially their locomotor activity, are not yet consistent enough to support any conclusions about their effects on either motor activity or emotionality (Table 4). Gaining a better understanding of the effects of psychedelics on spontaneous behaviours, without the complication of the stress of exposure to a novel environment, needs a clear assessment of the animals’ baseline activity throughout the circadian cycle while in their home cage (e.g. Porter et al., 2015) and a comparative assessment of the effects of test compounds on their behaviours.
Examples of the effects of psychedelics on defaecation and locomotor activity in the Open Field Test.
Dose (mg/kg) unless otherwise stated.
The sucrose preference test (‘anhedonia’)
The reduction in rodents’ sucrose preference after experiencing a regime of chronic, unpredictable mild stressors (CUMS) is interpreted as analogous to anhedonia in humans and so serves as an animal model of this aspect of depression (but see Stanford, 2020). The validation of this model rests mainly on evidence that established antidepressants prevent this stress-induced reduction in motivation to seek reward (reviewed by Willner, 2017).
It is too early to tell whether this antidepressant action extends to psychedelics because only a limited range of compounds has been tested so far. The majority of studies have used either LSD or psilocybin in mice and a dose of 1 mg/kg, regardless of the test compound. An increase in sucrose preference is a common, but not invariable, finding (Table 5).
Examples of the effects of psychedelics on animals’ preference for sucrose in the sucrose preference test.
DOI: 2,5-dimethoxy-4-iodoamphetamine: NC: no change.
When considering the validity of this test to study depression, or as a predictive screen for putative antidepressants, it should be borne in mind that, although anhedonia is a feature of major depressive disorder in humans, it is also a symptom of autism spectrum disorder, substance use disorder and schizophrenia. Its status as a symptom of schizophrenia is especially relevant given that early preclinical research of the neurobiology of that disorder used psychedelics as the experimental challenge, to produce a rodent model of schizophrenia on the basis that hallucinations are common to both.
In the context of research of depression and its treatment, it should also be noted that procedures for imposing CUMS that are used currently often differ markedly from the protocol developed by Willner (1986). The original procedure used the rodent equivalent of a series of daily ‘hassles’ (mild psychological stressors). However, many CUMS studies now incorporate a series of physiological stressors, such as 24 h food and/or water deprivation, swimming in ice-cold water and prolonged heat stress (45°C), each of which is moderate or severe, particularly when their cumulative severity is taken into account. The justification for such CUMS procedures is not assured, especially when they are compared with the types of non-noxious (psychological) stressors that can trigger, or exacerbate, depression in humans. It follows that the validity of their use, either to study the neurobiology of depression or as predictive screens for putative antidepressants, is highly questionable, both scientifically and ethically.
Finally, the wide range of stressors that have been used to induce anhedonia could well contribute to the disparate effects of psychedelics in these studies (Table 5). As with the FST, when used as a predictive screen, it may well be that the predictive validity of the sucrose preference test would improve if there was better understanding of: the environmental factors that contribute to variability in the response to the drug; the effective dose regimen; and compliance with the need to apply chronic mild stress, both in terms of its duration and intensity.
Final comments
With the possible exception of the HTR, no clear pattern has emerged to confirm the effects of psychedelics on rodent behaviour. Doubtless, this is partly because the pharmacokinetics and pharmacodynamics of all these compounds differ substantially in ways that will affect both the magnitude and the time course of the behavioural response. To allay the burgeoning scepticism about the merits of preclinical experiments, it is essential to pin down and understand the key variables. The success of those investigations will depend on the appropriate design of the entire investigation, as well as individual experiments, and confirmation that the findings are reproducible.
However, more caution should be applied to the anthropomorphic interpretation of individual behaviours (face validity), which is controversial at best and could even be totally spurious. Some outcome measures are clearly more objective and do have obvious analogous relevance to human behaviour. These include, for instance, measures of drug self-administration (to assess the risk of misuse of a new drug), monitoring sleep/eating architecture in studies of insomnia and eating disorders, or assessment of cognitive performance in operant training tasks. By contrast, assessment of animals’ subjective state (mood) is particularly challenging, especially when drawing inferences that are based merely on changes in their motor activity or gustatory preference.
The justification for using animals to develop new psychedelic compounds and to study their underlying biological mechanisms is predicated on the need for the findings to be reliable predictors of therapeutic efficacy (and harms). At the moment, every failure to translate into humans is cited as evidence that animal experiments are irrelevant and misleading. Rebutting this criticism requires the experimental procedures to be valid and the inferences to be cautious and realistic. Only then can we be confident that animal models will make an unassailable contribution to the research of novel psychedelics, as fast-acting antidepressants.
Footnotes
Acknowledgements
I wish to thank the referees for their helpful comments on the first version of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The author received no financial support for the research, authorship and/or publication of this article.
