Abstract
Given the well-known problems of replicability, how is it that researchers at respected institutions continue to publish and publicize studies that are fatally flawed in the sense of not providing evidence to support their strong claims? We argue that two general problems are (a) difficulties of analyzing data with multilevel structure and (b) misinterpretation of the literature. We demonstrate with the example of a recently published claim that altering patients’ subjective perception of time can have a notable effect on physical healing. We discuss ways of avoiding or at least reducing such problems, including comparing final results with simpler analyses, moving away from shot-in-the-dark phenomenological studies, and more carefully examining previous published claims. Making incorrect choices in multilevel modeling is just one way that things can go wrong, but this example also provides a window into more general problems with complicated designs, cutting-edge statistical methods, and the connections between substantive theory, experimental design, data collection, and replication.
A dozen years ago, Bem (2011) published an article claiming to find extrasensory perception, and this kicked off awareness of a replication crisis in psychology. The experiments in question indeed failed to replicate (Ritchie et al., 2012), but the more general issue remained that this unreplicable and scientifically implausible result had appeared to be supported by rigorous experimentation and analysis (Carey, 2011).
A few years later, the episode was summarized as follows in the news media: Even with all that extra care, Bem would not have dared to send in such a controversial finding had he not been able to replicate the results in his lab, and replicate them again, and then replicate them five more times. His finished paper lists nine separate ministudies of [extrasensory perception]. Eight of those returned the same effect. . . . But for most observers, at least the mainstream ones, the paper posed a very difficult dilemma. It was both methodologically sound and logically insane. (Engber, 2017)
In fact, Bem’s (2011) article contains zero actual replications. What it has could be called “conceptual replications,” open-ended studies that could be freely interpreted as successes through the “garden of forking paths” of data-dependent choices of data coding and analysis (Gelman & Loken, 2014). And the article is not “methodologically sound.” Its conclusions are based on p values, which are statements regarding what the data summaries would look like had the data come out differently, but that article offers no evidence that had the data come out differently, the analyses would have been the same. Indeed, the nine studies in that article feature all sorts of different data analyses.
What is stunning in retrospect is how (a) at the time, the Bem (2011) article looked like standard practice, maybe nothing special but nothing horrible either; but (b) in retrospect, its problems are obvious and just jump out once you know what to look for. It is like one of those color-vision tests the eye doctor gives, where when you wear the 3-D glasses, the images just leap off the page. We are reminded of the notorious photographic images of fairies from the early 20th century that fooled Arthur Conan Doyle and others but to modern eyes are obvious fakes (Smith, 1997).
The problems of Bem (2011) are now clear, but publication and promotion of unreplicable research remains a problem despite the progress made in the science-reform movement during the past decade.
In the present article, we explore some general issues by examining in detail a recent psychology article and investigating problems that might not be apparent in a casual reading but nonetheless lead to unreplicability (our data and code are available at https://osf.io/rh6g5/).
Two Factors Leading to Unreplicable Research
We consider two factors that lead to overconfidence in empirical claims from noisy data.
First, psychology experiments often include multilevel structure, for example, from repeated measurements, manipulations applied at the group level, or different raters. A certain amount of complexity in experimental data is typically unavoidable in psychology given that modern research often focuses on interactions: The hypothesis of interest is how a manipulation affects a change rather than an absolute level, how effects differ among groups, or how the effect of one variable depends on the level of another. In addition, high variation between people makes it advisable to perform within-person comparisons as possible for reasons of substantive theory and statistical efficiency. But analysis of multilevel data is difficult: It is easy to get apparently strong statistical results from correlated errors, it is not always clear how to perform simple sanity checks of complex analyses, and multilevel modeling introduces its own challenges.
The second problem is misreading of the empirical literature. Results from any single experiment will be open to multiple interpretations because no intervention occurs in a vacuum. New empirical findings are understood in the context of previous work on the topic. It is well known, although perhaps not so well understood, that published results tend to be overly optimistic about effect sizes as a result of low power and selection on statistical significance (Ioannidis, 2008). Beyond this, there are qualitative challenges in interpretation of the literature, with a mismatch between claims being made and the evidence that was used in support of those claims.
Although these problems have been discussed in general terms, they can really be understood only in the context of particular examples: Each statistical analysis presents its own challenges, and each literature review has its own concerns. As a result, we fear that these quantitative and qualitative problems of interpretation of evidence are insufficiently conveyed in the statistics and methods literatures.
In this article, we focus on the data analysis and cited articles of a single work in psychology, a recently published article reporting that altering patients’ subjective perception of time can have a notable effect on physical healing. Beyond revealing specific problems with that work, our investigation demonstrates the effort that can be required to track down what went wrong in a published article. Our purpose here is not specifically to critique that one article; indeed, we could have chosen many others that exhibit very similar problems. This particular article was chosen because it was by researchers from a respected institution, published in a journal that is generally considered legitimate, and demonstrates the following three key factors:
The published article reports large estimated effects from a small study with no clear theoretical justification, the sort of finding that is characteristic of the wave of unreplicable results in psychology, as discussed, for example, by Bishop (2020a).
On the other hand, the results appear at first glance to be unambiguously statistically significant and based on a solid experimental design.
In addition, the published article refers to a substantial literature reporting similar findings, thus potentially reducing the concern about the lack of clear mechanism of action of the treatment.
All three of these attributes are relevant. Without the first attribute—implausibly large effects—this would just be unexceptional science. Without the second attribute—apparent statistical significance—the results could be dismissed as an artifact of noisy data. And without the third attribute—connection to an existing literature—the results would not lead to any clear scientific interpretation. In the present article, we investigate these factors in the context of a single published research article and explain why its apparent statistical significance is a mirage and why its cited literature does not say what is claimed.
Much has been written on the replication crisis in psychology, including methodological studies, recommendations for changes in research and publication practices, organized replication studies, and surveys of the literature. Here, we present a detailed examination of a particular claim, a case study that we see as complementary to the broader takes on replication concerns. We have many times seen illusory statistical significance and inaccurate literature reviews, and it typically takes a bit of digging to track down these problems in each case. By carefully going through these steps in the context of a high-profile example, one can get a sense of how a result can misleadingly appear to be well founded both empirically and theoretically.
A Questionable Finding in Psychology
A recent article (Aungle & Langer, 2023) reported an experiment that “tested whether cupping marks produced by identical cupping treatments healed faster or slower as a function of perceived time.” Cupping “involves creating a localized suction on the skin . . . [leading to] bruising” (Aungle & Langer, 2023). Each of 33 participants was given this treatment three times; in each case, the participant was given a 28-min-long task, and in each instance, a photograph of the skin was taken before and after the 28-min interval. The three instances differed in that the experimenters manipulated “perceived time” of the recovery interval, telling the participant it was actually 14 min, 28 min, or 56 min. The following results were reported: Healing in the 14-min condition had a mean rating of 6.17 (SD = 2.59, 32 Subjects, 800 ratings); healing in the 28-min condition had a mean rating of 6.43 (SD = 2.54, 33 Subjects, 825 ratings); and healing in the 56-min condition had a mean rating of 7.30 (SD = 2.25, 32 Subjects, 800 ratings). (Aungle & Langer, 2023)
Healing was measured by 25 external raters who were given the before and after photographs and asked to rate on the following scale: “0.0 = not at all healed, 5.0 = somewhat healed, 10.0 = completely healed” (Aungle & Langer, 2023). For each of three comparisons (56-min vs. 28-min conditions, 56-min vs. 14-min conditions, 28-min vs. 14-min conditions), the t score (estimate divided by standard error) was calculated: The resulting values were reported as 7.2, 10.7, and 2.5, respectively (Aungle & Langer, 2023).
Based on our experiences with small-sample studies, these results did not seem plausible. An indirect intervention given to 32 or 33 people yielding a t statistic of 7.2? A t statistic as high as 7 in such a setting would typically occur only for a manipulation check, not for the main finding of a study of a speculative effect; it is a red flag leading us immediately to question the data analysis.
We were also concerned about the larger claims made in the article’s summary: “We show that the effect of time on physical healing is significantly influenced by the psychological experience of time. . . . Our results demonstrate that the effect of time on physical healing is inseparable from the psychological experience of time” (Aungle & Langer, 2023). After a careful look at the statistical analysis, we find that the data do not provide strong evidence of the claimed effect, and even if the statistical results in the article had been correct, this would not demonstrate the inseparability claimed in that statement.
We report our efforts to understand where the claimed results came from, followed by our reanalysis of the data and our consequent understanding of the study and the literature to which it refers. In doing this, we came across a challenge in using multilevel modeling to account for experimental design. We hope this work is useful for future researchers who are analyzing multilevel data structures and for people who are trying to interpret the existing literature.
Reanalysis and Reassessment
Even when the substantive theory underlying a flawed research project is speculative or implausible, it can be helpful to reanalyze the data to better understand how the results of a noisy experiment could have been arranged in a way that convinced authors and reviewers alike that they were seeing strong evidence. In the case of Aungle and Langer (2023), challenges arose when analyzing a multilevel data structure.
Published analysis: multilevel model with varying intercepts
The data and code for the healing experiment are linked from the journal article’s webpage, so we could check the authors’ analyses and conduct our own as well (Aungle & Langer, 2023). It also makes sense to check things with a simpler model (which we report on in the Simple Paired-Comparisons Analysis section) by collapsing the data and comparing averages. We start by examining the analysis that appeared in the published article.
One recommended practice when analyzing data with clustering is to fit a multilevel model (Snijders & Bosker, 1999). In this case, the data are clustered by participant (coded as “Subject” in the data set) and rater (coded as “ResponseId”). For their Table 1, Aungle and Langer (2023) chose the best fit among several models. Here, we show the simplest, as fit in R:
In this model, the 14-min condition is the baseline, and the estimates of
Before going on, we interpret the error terms in the above fitted model. Based on the fitted model, the measurements vary with a standard deviations of 1.07 across participants and 1.22 across raters, and the unexplained or residual error has standard error 1.87, all relative to the 10-point scale of measurement. This all makes sense: Some participants’ bruises will look more serious than others’, different raters use different subjective scales, and this will vary across measurements.
Multilevel model with varying intercepts and slopes
When estimating a treatment effect under a cluster design, it is not enough to fit a multilevel model with varying intercepts. The slopes—that is, the treatment effects—must also be allowed to vary, in accordance with the general principle of the design and analysis of experiments that the error term for any comparison should be at the level of analysis of the comparison; see for example, Cochran and Cox (1957) and Barr et al. (2013). In the terminology of the analysis of variance, the treatment is applied between groups (subjects and raters), and so the estimated effect must be compared with a between-groups variance. This can be done by slightly extending the fitted multilevel model to allow the treatment effects to vary by subjects and raters:
The estimated average treatment effects are similar to before, but the standard errors are much bigger. The fitted model estimates a high variation of effects across participants: The effect of 28 min (compared with the baseline of 14 min) is estimated to have a standard deviation of 1.99, and the effect of 56 min is estimated to have a standard deviation of 2.03. Those standard deviations are much higher than the estimated mean effects of 0.25 and 1.09, respectively; thus, according to the fitted model, the estimated effects are not consistent in their sign or their magnitude. In this case, the most important step was to allow treatment effects to vary by subjects. The variation of effects by raters is small, but it did not hurt to include that in the model. It would also be possible for the variation and the mean to vary by group, and various alterations of the model will slightly change the estimated effects and uncertainties. Our point here is not to offer a definitive analysis of these data but rather to understand how the published results had been so inappropriately strong.
An additional concern arises from the high estimated correlations of the varying slopes; indeed, the estimated correlation matrix for the varying intercepts and slopes for raters is not positive definite, which results in a warning message when the model is fit in R. Given the small number of groups, this sort of degeneracy in the maximum likelihood estimate of the covariance is not unexpected; see Chung et al. (2014). We checked our result by running a fully Bayesian analysis, which accounts for uncertainty in the estimation of these variance components. In this case, the result was essentially the same, and so we stick with the analysis shown above. Including other predictors into the model also left the estimates and standard errors of the average treatment effects essentially unchanged.
To summarize, the fitted model shows evidence for an average effect of the 56-min condition compared with the 14-min condition but not of the 28-min condition compared with the 14-min condition. Both effects are estimated to vary by a large amount across participants, implying that both the sign and the magnitude of the effects are highly variable. There is essentially zero variation in treatment effects across raters, which makes sense given that the raters are reacting to the images given to them and are not otherwise affected by the treatments.
Referring to the estimated intraclass correlation (ICC) in their fitted model, Aungle and Langer (2023) wrote, “A lower ICC value suggests that there was less variability in healing outcomes between subjects, indicating that the condition effect was relatively consistent across subjects.” This claim is in error: The model that they fit assumes constant treatment effects and thus offers no information at all regarding consistency of effects across subjects.
Simple paired-comparisons analysis
One frustrating aspect of this problem is that the statistical issue is clear—researchers want to obtain estimates and uncertainties of treatment effects in the presence of clustering—but textbook recommendations for analysis can be hard to follow. The multilevel analysis performed by Aungle and Langer (2023) looks superficially reasonable but is missing the all-important variation in treatment effects. We solved this problem by including varying slopes, but this leaves a lingering suspicion that some additional analysis step might still be missing.
One way to get a handle on the problem is to perform a simpler analysis. To start with, we consider each of the comparisons (56 min vs. 28 min, 56 min vs. 14 min, and 28 min vs. 14 min) as its own problem, thus avoiding the difficulties arising from analyzing an experiment with three treatment levels. Next, we simplify further by working with the mean of the 25 measurements for each person and each condition. This leaves us with a simple matched-pair design for each of the three comparisons, which we can then estimate in the usual way by computing the difference in outcome between the two conditions for each person and then summarizing by the mean of these differences and their standard deviation divided by
This simple analysis is not intended to be an alternative to the multilevel model; it is just a way to compare it with something more easily understandable. In this case, we see no clear alternative to fitting a multilevel model with varying intercepts and slopes or using a procedure such as clustered standard errors, which would have its own theoretical and practical complications (Abadie et al., 2023).
Reevaluating the published claims
How does one think about the claims of Aungle and Langer (2023) now that its t scores have been downgraded from 10.7 to 3.0 (for the 56 min vs. 14 min comparison) and from 2.5 to 0.7 (for the 28 min vs. 14 min comparison)? As noted above, in addition to these possible average effects, any effect of the manipulation on healing is estimated to be highly variable, sometimes positive and sometimes negative.
We are skeptical that this study reveals anything about the effect of perceived time on physical healing for four reasons.
First, the statistically significant result that appeared is one of many comparisons that could have been made. Data were also gathered on participants’ anxiety, stress, depression, mindfulness, mood, and personality traits, implying many possible analyses that could have been performed. In the absence of preregistration, there is just no way of knowing what might have been done had the data turned out differently, and the result is that the appearance of a comparison that is 3 SE away from zero does not necessarily represent strong evidence of an effect (Simmons et al., 2011). To the extent that perceived time could affect healing, it would be easy to come up with hypotheses why such effects would occur only for patients with high or low levels of anxiety or stress or different levels of mindfulness, under only some conditions of mood, or for only some sorts of personality profiles because any of these could be related to “specific networks of expectations, physiological responses, and beliefs associated with participants’ concepts of time,” which is one of several speculative explanations offered by Aungle and Langer (2023) for their findings.
Second, the large estimated variation in effect size across people (Aungle & Langer, 2023) implies that any estimated average effect will be highly contingent on who happens to be in the study, and there is no reason to believe that the particular 33 people in the experiment are representative of any larger population of interest. This issue would always arise when attempting to use data from a lab experiment to generalize to the outside world, but it would have been less of a concern if the effect estimate had really had been 10 SE away from zero because that would imply a consistency in the effect that would make generalization easier to swallow.
A third reason for skepticism is that any effect would be expected to vary not just across people but also across situations, leading to the same concern about interactions and stability. Aungle and Langer (2023) referred to “healing,” but the experiment did not involve the participants experiencing any sickness or injury beyond very mild bruising. Indeed, the technique of cupping is often promoted by proponents of alternative medical treatments as having healing properties of its own; on this account, what the patients experienced could have been explained as an interaction between psychological factors and the purported mechanism of action of cupping itself (which, to the best of our knowledge, remains unidentified; Singh & Ernst, 2008, p. 307).
A fourth concern is with the mechanism of action because it is not clear how the perception of time would affect changes in the skin in this setting. Aungle and Langer (2023) referred to “mind–body unity” and “the importance of psychological factors in all aspects of health and wellbeing [sic],” and we would not want to rule out the possibility of such an effect, but no mechanisms are examined in this study, so the result seems at best speculative, even taking the data summaries at face value. During the half hour of the experimental conditions, the participants were performing various activities on the computer that could affect blood flow, and these activities were different in each condition (watching videos under one condition, playing Tetris in another, playing a different video game in the third). In addition, there is no mention that the experimenter, who took the photos, was blind to the condition. There were many things going on in the experiment, and it seems to us to be a strong claim to attribute the observed differences to “perceived time” rather than to any of the other factors that were varying across the three conditions or even to just the general ability of researchers conducting uncontrolled studies to find patterns from noise.
In raising these concerns, we are not saying that the substantive conclusions of the study are necessarily wrong, just that there are many alternative explanations for the results that we find just as scientifically plausible as the published claim that “the effect of time on physical healing is significantly influenced by the psychological experience of time” (Aungle & Langer, 2023).
Problems With Citation of the Previous Literature
Unreplicable claims based on weak theory can gain apparent support by connections to related published work. Three problems can arise.
First, the connections between the cited literature and the new study can be tenuous, and this can particularly be an issue when the underlying theory is vague. Ideas such as embodied cognition, evolutionary psychology, nudging, mindfulness, or mind–body unity are general enough to encompass a wide range of potential phenomena, to the extent that there is almost no limit to the past studies that could be thought to have some possible relevance to any new experiment.
Second, informal literature reviews are subject to selection bias. An article promoting a controversial idea can easily cite studies claiming to have found evidence for related ideas while avoiding citations of failed replications or articles suggesting alternative theories. This can even be a problem with systematic meta-analyses if the entire subfield being meta-analyzed is full of studies with uncontrolled researcher degrees of freedom (Gelman, 2022).
Third, the interpretation of individual studies being cited can be seriously flawed. This is a problem of citing past literature as support for a general claim without looking at exactly what was done in the cited research and without following up on that work. Here, we discuss three different examples of this sort of misinterpretation of the literature cited in Aungle and Langer (2023).
Doctors’ assurance and allergic reactions
Aungle and Langer (2023) provided context for their study by citing related findings on “surprising mind-body unity phenomena.” One of these came from Leibowitz et al. (2018), which reported that patients who received a histamine skin prick reported less itchiness if they were assured by their health-care provider that “from this point forward your allergic reaction will start to diminish, and your rash and irritation will go away” (p. 2051). However, the statistically significant (
Leibowitz et al. (2018) concluded: We suspect the present study is a conservative test of this effect since participants were healthy volunteers whose allergic reactions were unlikely to be highly stressful or concerning, and allergic reactions were expected to decline over time even without intervention. (p. 2052)
This is highly speculative, and indeed, it would be easy to make a case for the exact opposite. If one wants to make this claim—or its opposite—it would make sense to go and test some more difficult patients.
If a one-sentence reassurance could reliably reduce short-term pain, this could have immediate implications in health-care practice, and so it would seem advisable for someone who believes in this result to conduct a careful replication study. A Google Scholar search conducted 5 years after the study’s publication showed 28 citations, none of which replicated the original experiment. The closest empirical study we found in these references was Leibowitz et al. (2019), which was also cited by Aungle and Langer (2023) as one of the “surprising mind-body unity phenomena” and studied “various mechanisms of open-label placebo treatments: a supportive patient-provider relationship, a medical ritual, positive expectations, and a rationale about the power of placebos” (Leibowitz et al., 2019, p. 613), looking at outcomes on allergic responses but not on itchiness. That study reported “no main effects of condition on allergic responses” but statistical significance on some particular interactions. Again, given the many possible interactions that could be studied, this does not represent strong evidence for the replicability of whatever happened to show up in this particular sample.
“Countless studies, many of which stand up to replication and rigorous scrutiny”
Aungle and Langer (2023) wrote, Previous research has found the skin is quite responsive to expectations. For example, patients who received physician assurances after skin pricks healed significantly faster, and the suggestion that one had touched poison ivy resulted in stronger symptoms than actually touching poison ivy.
The first of these claims refers to the aforementioned Leibowitz et al. (2018) study; the second points to Ikemi and Nakagawa (1962), which included a qualitative study of 13 boys who were exposed on one arm to the poisonous leaves of the lacquer or wax tree (not actually poison ivy) and on the other arm to inert leaves but were told that the exposures were the reverse. Under various conditions, skin reactions occurred on the arm where the boys were told the poisonous leaves had been applied rather than at the actual location of the contact. In a later review article, W. A. Brown (2015) described that study as “as baffling today as it was when it first appeared” and continued, This contact dermatitis study has not been replicated so it’s hard to know just how solid its remarkable findings may be. Nevertheless this study does not stand alone. . . . Countless studies, many of which stand up to replication and rigorous scrutiny, show that the power of expectation is as dramatic and perplexing as it was in the poison leaf study. (p. 19)
The phrase, “countless studies, many of which stand up to replication and rigorous scrutiny” (W. A. Brown, 2015), is interesting in that it asks the reader to consider as evidence some large number of studies (the difference between “countless” and “many”) that do not stand up to replication and rigorous scrutiny. In this case, Aungle and Langer (2023) cited small uncontrolled studies that have not been replicated, which suggests to us that the many studies that stand up to replication and rigorous scrutiny are either hard to find or else not directly relevant to the claims being made in their article.
Exercise beliefs and biometric outcomes
Aungle and Langer (2023) also reported the following claim, which would be stunning if it held up under replication: If a person who does not exercise weighed themselves, checked their blood pressure, took careful body measurements, wrote everything down, maintained their same diet and level of physical activity, and then repeated the same measures a month later, few would expect exercise-like improvements. But in a study involving hotel housekeepers, that is effectively what the researchers found.
After a careful study of the reference (Crum & Langer, 2007), we are again skeptical. The treatment in this experiment was to inform the hotel housekeepers (in this study, 84 women working at seven hotels) “that the work they do (cleaning hotel rooms) is good exercise and satisfies the Surgeon General’s recommendations for an active lifestyle” (p. 165)
We have two reasons to doubt the above-quoted summary.
First, the reported changes seem implausibly large for population effects, with the women receiving the brief intervention seeing an average drop of two pounds of weight, half a percentage point of body fat, and five or 10 points of blood pressure a month later compared with the women in the control group. We would be inclined to attribute such large apparent effects to chance variation in the data. However, for many of the outcomes being studied, these differences are 2 or 3 SE from zero, which would seem to be unlikely to occur by chance alone.
Some of the apparent strength of the statistical patterns arose from clustering in the design that was not accounted for in the analysis, similar to the problem with the multilevel analysis discussed above. In the Crum and Langer (2007) article, the intervention was applied at the hotel level, with workers at four hotels receiving the treatment and at the other three receiving the control, but the published analysis does not appear to have accounted for this clustering. A more appropriate analysis would use a multilevel model with intercept and treatment effect varying by group, as described above, but with groups being hotel rather than participant in this case. This correction for clustering reduces the t statistics for the changes in biometric outcomes, but some of them still remain above the conventional level of statistical significance.
There is also a concern about possibility of systematic error in body measurements if the experimenter was not blinded to the treatment. In addition, there were many missing observations, and some people in the data set had body mass index data that were not consistent with their recorded heights and weights. Flexibility in data coding, measurement, and analysis could suffice to explain the observed patterns in the data.
Stepping back, it is a stretch to expect that a presentation on how work is good exercise would result in major changes, especially for a group of people who have “maintained their same diet and level of physical activity” (Crum & Langer, 2007). We would expect that a one-shot study of 84 people would be too noisy to discover any plausible effect after 4 weeks—a period that is not only short but also arbitrary in the absence of any theoretical account of how the intervention might work without inducing change in diet and exercise. Even if such an effect exists, we would not expect it to work or to go in the same direction for everyone, and any average effect should be small. As explained by Button et al. (2013), when a noisy experiment is performed to study a small effect, any statistically significant result will tend to greatly overestimate the true underlying effect; this is called the “winner’s curse,” “statistical significance filter,” or “type M” (magnitude) error (Gelman & Carlin, 2014).
Our second concern with the above quote is the claim that the women in the study “maintained their same diet and level of physical activity.” Crum and Langer (2007) indeed stated that “actual behavior did not change” but did not report any direct measures of diet and physical activity at either the start or end of the study, just information from a retrospective questionnaire. It is problematic to take survey responses as measures of actual behavior, especially in the context of a study of an intervention specifically designed to alter perceptions of exercise. The interpretation given by Aungle and Langer (2023) requires that the words given to the participants at the beginning of the study could affect measures that are the result of physiological processes, such as weight, body-fat percentage, and blood pressure, without having any effect on survey responses on exercise and diet.
Beyond this, the data in Crum and Langer (2007) actually did show a large increase in perceived amount of exercise (the average going from 3.8 to 5.7 on a scale from 0 to 10), so if the survey responses are to be believed, this directly contradicts the claim that the participants “maintained their same diet and level of physical activity.” They also reported that “there were no significant changes in subjects’ substance abuse and diet.” However, with such a small sample, a lack of statistical significance does not imply that real changes were zero or even that they were small. All of this is in addition to the differences between actual diet and retrospective self-reports.
In short, to the extent that the intervention caused the physiological changes measured in Crum and Langer (2007), the data are consistent with the reasonable hypothesis that these were associated with behavioral changes, and so we do not think it makes sense for Aungle and Langer (2023) to cite this study as evidence for the claim that “the benefits of exercise” do not “require the act of exercising, or at the very least an increase in physical activity or change in diet.” To make such a conclusion requires first, the statistical error of treating a nonstatistically significant result as zero (“no change in workload, exercise habits, overall physical activity, or diet”) and second, discounting the actual survey responses in which participants reported exercising more in that study.
Discussion
A naive reader of many discussions of the replication crisis in science might gain the impression that all would be well if scientists were merely to follow open-science protocols and avoid certain questionable research practices. The problems go deeper, however: difficulties of statistical analysis of real data and misinterpretations of the scientific literature, two issues that arise more generally in unreplicable subfields of research. We hope that our careful exploration of these issues in a particular example gives insight into a problem that goes far beyond the literature in mind–body unity.
In addition, to the extent that there is general interest in the claim that physical healing can be affected by manipulating the psychological experience of time, or that a one-sentence reassurance could reliably reduce short-term pain, or that it is possible to gain the benefits of exercise while maintaining the same diet and level of physical activity—and given publication, citation, and media attention to such claims, they do seem to be of general interest—it should also be of interest to learn that the evidence for such claims is not nearly as strong as has been presented in the literature. And it is worth understanding the specifics of how an article published by researchers at a respected university could go so wrong.
Challenges of Accounting for Complex Designs
As illustrated in the Reanalysis and Reassesment section, data analysis in the real world can be difficult. Aungle and Langer (2023) followed general recommendations to use multilevel modeling when analyzing clustered data, but even so, they got tripped up and did not realize they needed to allow the treatment effects, not just the intercept, to vary by participants. And it seems that none of the reviewers of the article caught this error either. Indeed, we noticed it only because we were tipped off by the unrealistically high t scores and then were able to download and interpret the authors’ code.
The most natural advice at this point would be to say that when estimating a causal effect (or more generally, a regression coefficient) from data with a multilevel structure, it is necessary to allow both the intercepts and effects to vary along each grouping structure. In practice, however, this is a challenge—first, because this particular issue is not always covered in textbooks; second, because including additional variance components to a model can make it less stable to fit, especially when the number of measurements and people in the study is small; and third, because modeling choices will still arise, for example, which other predictors to interact with the treatment indicator and how to code a treatment with multiple levels.
Similar problems arise with the general recommendation in observational studies to adjust for all relevant pretreatment variables. The number of possible adjustment variables can be large, in which case, researchers will need to choose what predictors to include and how to parameterize them, and it can be necessary to go beyond simple least squares adjustment.
A different direction would be to obtain standard errors by bootstrapping, with the resampling respecting the design of the study. Such an approach can work well but again, will require care in getting it to work in problems with nonnested structures, such as in the Aungle and Langer (2023) study. Our message here is not that complex designs cannot be analyzed or should not be conducted—indeed, one of us is on record as recommending within-person comparisons in psychology experiments (Gelman, 2018). Rather, this is just a reminder that statistical analysis of such data is far from routine, even for randomized experiments.
Applied researchers remain in the awkward position of needing to use statistical methods without clear guidance and whose results can be difficult to understand. Ultimately, this reflects a fundamental issue that in all but the simplest designs, uncertainties in estimation depend on aspects of treatment-effect variation that themselves must be estimated from the data. This represents a challenge not just for practitioners but also for software developers and authors of textbooks and expository articles: It is impossible to give advice that is general, easy to follow and understand, and correct. As the saying goes, choose at most two.
Misinterpretation of the literature and relevance to the replication crisis
In the article under discussion, Aungle and Langer (2023) followed the pattern of much of the unreplicable research we have seen in psychology: a study of a highly speculative claim with results whose apparent statistical significance fades upon careful analysis. The problems were not apparent to casual reviewers, and the article was published in a legitimate journal and received some uncritical publicity (Carroll, 2024; DeSmith, 2024; Langer & Aungle, 2024; Levitt, 2024; Peterson, 2023; Plain, 2024; Rand, 2023). Beyond the specific problems with the statistical analysis, the authors featured a claim (“the effect of time on physical healing is inseparable from the psychological experience of time”) that was not directly addressed by the experiment and thus could not have been supported even if one were to take the reported quantitative summaries at face value. Other work in this literature has been similarly criticized on both methodological and theoretical grounds, and these criticisms are not new (Coyne, 2014; Liberman, 2009).
One recurring feature in the replication crisis is a style of writing and presentation that can give a misleading appearance of coherence to a diverse literature. Conceptual replications are valuable, but they represent yet another source of researcher degrees of freedom (Simmons et al., 2011): If a conceptual replication goes in the direction that is consistent with the story that researchers want to tell, they can label it as a replication and use it as evidence in favor of their theory; if it yields a null result or goes in the opposite direction, they can emphasize the differences between the different studies. This relates to the points made by Bishop (2020b) regarding researchers’ cognitive processes when designing experiments and interpreting their results.
Another issue we have seen before is reliance on earlier studies with flawed design and analysis. This arose in a recent meta-analysis of nudge interventions that was based on a large number of published articles that were subject to selection bias in what summaries they included and several articles that had been retracted or discredited because of potential fraud. Even without the inclusion of the fraudulent research, we judged that the potential for selection bias and effect-size variation in the mass of studies in the meta-analysis made its conclusions close to worthless (Szászi et al., 2022).
Aungle and Langer (2023) had similar problems, uncritically citing work that, when studied carefully, did not offer strong evidence in favor of their claims. One of these cited articles, Crum and Langer (2007), supported its argument for the plausibility of large effects by pointing to various studies of the placebo effect, including a newspaper report (Blakeslee, 1998) that referred to the unreplicated study of Ikemi and Nakagawa (1962) discussed above. This is a sort of game of telephone in which various problematic or unreplicated studies get referenced in a way that can make them appear to represent a consistent literature or a web of evidence, and it arises from a disconnect between scientific procedures and scientific theories (Devezer et al., 2021). Loose theory plus loose criteria for evidence combine to allow a literature to be built on sand, an issue discussed by Oberauer and Lewandowsky (2019).
Recommendations
In this article, we have considered several challenges:
At the technical level, there is not always a clear pathway for researchers to analyze data from complex designs to obtain efficient inferences while avoiding overconfidence. This is an issue that will not be going away given the increasing interest in varying treatment effects, within-person studies, and the need for more elaborate analyses to generalize from observational or experimental data to larger populations under more realistic scenarios.
The evidence provided in any particular study can depend strongly on the average size and variation of the underlying effect. Claims—explicit or implicit—about effect sizes are commonly based on a literature that is full of estimates that are biased wildly upward.
Readers of published articles often need to resort to a sort of forensics, especially with studies that are not preregistered, in which there can be many researcher degrees of freedom in data exclusion, coding, and analysis.
All these problems arise even when authors are acting in good faith. It is important to be able to criticize published research without impugning the integrity of the researchers; conversely, researchers should not fool themselves into thinking that they cannot publish work with serious and avoidable errors just because they are morally upright.
All these issues arose with the study we have considered here on physical healing as a function of perceived time (Aungle & Langer, 2023). The underlying claim—that manipulating a clock to alter patients’ subjective recovery time can affect actual physical recovery—does not fit in well with the standard paradigm of medicine, and Aungle and Langer (2023) did not offer any mechanistic theory of action. As a result, they are under some burden to argue for the plausibility of this entire line of research, which unfortunately is itself based on studies with similar methodological flaws (small samples, measurements too noisy to study effects that are realistically small and highly variable, misapplied statistical methods, and workflows with enough researcher degrees of freedom to make it possible to find apparent statistical significance even in the absence of any underlying consistent effects). Finding these errors took effort on our part, although this was facilitated by the openly posted data of Aungle and Langer and the openness of Crum and Langer to share the data from their 2007 article.
Beyond our immediate recommendation not to trust these particular published claims of mind–body effects on healing, weight loss, and blood pressure, we can offer some general advice.
In the short term, interpret studies in light of realistic possible effect sizes and accept that many experiments are just too small and noisy to provide useful scientific information: Even when some statistically significant comparisons can be found, they can be explained by chance variation. In the analysis stage, be aware of the challenges of design and adjustment in the presence of multilevel structure. When it is time to summarize and write up the study, present all comparisons of interest, ideally in graphical form, rather than focusing only on the largest or those that reach some statistical-significance threshold. When an interesting result arises, nail down the finding by designing and carrying out an exact replication. Contrary to all your expectations, the replication might fail; indeed, that is the reason for performing the replication in the first place (Nosek et al., 2012).
In the medium term, design new studies on the basis of plausible hypothetical models and then preregister before collecting data. Preregistration is a floor, not a ceiling: Use it to specify initial analyses with the understanding that you can and should go further when you see unexpected patterns in the data. The design and preregistration stage is a good time to think hard about effect sizes and their variation and to understand an experimental design using simulated data (Gelman, 2024). If a small study appears to reveal a potentially interesting result, the next scientific step is to probe it carefully with future experiments, not to treat it as an established fact in the later literature.
Recognizing the importance of preregistration, journals, including AMPPS, are steadily increasing their support for it. The strongest current form of preregistration, Registered Reports, requires authors’ prespecified statistical-analysis plans to be reviewed before data collection starts, which brings external, impartial eyes to the process. Registered Reports show considerable promise in reducing publication bias (Scheel et al., 2021).
In the longer term, we hope that default analyses and workflows will keep up with advances in data collection and modeling and that the presence of stronger studies in the literature, along with formal replication studies, will allow researchers to avoid being trapped in a loop of pseudoreplication. When it comes to the study of mind-body interaction, we recommend moving away from shot-in-the-dark phenomenological studies—black-box experiments designed to demonstrate that intervention
The above recommendations should seem reasonable, but none of them are easy even if—especially if—researchers are working from a position of honesty and transparency, which we believe characterizes ourselves and also the authors of the articles under discussion. Analyzing multilevel, panel, time-series, spatial, and network data is hard. There are statistical and computational literatures on all these topics, but said literature often gives conflicting recommendations. One of us has written a book on multilevel modeling and still remains confused about general recommendations for the inclusion of interactions when analyzing data with nonnested multilevel structure. So it is not like we can even just recommend practitioners to solve their analysis problems by asking a nearby statistician for advice. Hypothesizing effect sizes for simulated-data experimentation is another difficult task, requiring the hard work of making strong assumptions and committing to them, at least for the moment. We argue that this effort would be well spent, but it adds to the cognitive cost of conducting a study. It is a lot more work to design, hypothesize, preregister, and conduct a study than to just gather some data and run with it. Reading the literature with a critical eye can also be hard work—as we demonstrated to ourselves in the preparation of the present article—which often can seem wasted if the ultimate conclusion is not to trust that literature. It is far easier to just take titles and abstracts at face value and perhaps to pipe previously published results into a meta-analysis.
In short, we are recommending a steady dose of blood, toil, tears, and sweat, with the argument being that this is the only way to make progress when working in a field in which effects are small and highly variable.
Statistical and conceptual problems go together
We have focused our inquiry on the Aungle and Langer (2023) article, which, despite the evident care that went into it, has many problems that we have often seen elsewhere in the human sciences: weak theory, noisy data, a data structure necessitating a complicated statistical analysis that was done wrong, uncontrolled researcher degrees of freedom, lack of preregistration or replication, and an uncritical reliance on a literature that also has all these problems.
Any one or two of these problems would raise a concern, but we argue that it is no coincidence that they all have happened together in one article, and as we noted earlier, this was by no means the only example we could have chosen to illustrate these issues. Weak theory often goes with noisy data: It is hard to know how to collect relevant data to test a theory that is not well specified. Such studies often have a scattershot flavor, with many different predictors and outcomes being measured in the hope that something will come up, thus yielding difficult data structures requiring complicated analyses with many researcher degrees of freedom. When underlying effects are small and highly variable, direct replications are often unsuccessful, leading to literatures that are full of unreplicated studies that continue to get cited without qualification. This seems to be a particular problem with claims about the potentially beneficial effects of emotional states on physical-health outcomes; indeed, one of us found enough material for an entire PhD dissertation on this topic (N. J. L. Brown, 2019).
Finally, all of this occurs in the context of what we believe is a sincere and highly motivated research program. The work being done in this literature can feel like science: a continual refinement of hypotheses in light of data, theory, and previous knowledge. It is through a combination of statistics (recognizing the biases and uncertainty in estimates in the context of variation and selection effects) and reality checks (including direct replications) that we have learned that this work, which looks and feels so much like science, can be missing some crucial components. This is why we believe there is general value in the effort taken in the present article to look carefully at the details of what went wrong in this one study and in the literature on which it is based.
Footnotes
Acknowledgements
Correction (March 2025):
Article updated to correct the order of the manuscript’s received and revision accepted dates.
Transparency
Action Editor: Pamela Davis-Kean
Editor: David A. Sbarra
Author Contributions
