Abstract
Psychologists are often interested in the effect of an internal state, such as ego depletion, that cannot be directly assigned in an experiment. Instead, they assign participants to a manipulation intended to produce this state and use manipulation checks to assess the manipulation’s effectiveness. In this article, I discuss statistical analyses for experiments in which researchers are primarily interested in the average treatment effect (ATE) of the target internal state rather than that of the manipulation. Often, researchers estimate the association of the manipulation itself with the dependent variable, but this intention-to-treat (ITT) estimator is typically biased for the ATE of the target state, and the bias could be either toward the null (conservative) or away from the null. I discuss the fairly stringent assumptions under which this estimator is conservative. Given this, I argue against the status-quo practice of interpreting the ITT estimate as the effect of the target state without any explicit discussion of whether these assumptions hold. Under a somewhat weaker version of the same assumptions, one can alternatively use instrumental-variables (IVs) analysis to directly estimate the effect of the target state. IVs analysis complements ITT analysis by directly addressing the central question of interest. As a running example, I consider a multisite replication study on the ego-depletion effect, in which the manipulation’s partial effectiveness led to criticism and several reanalyses that arrived at varying conclusions. I use IVs analysis to directly account for the manipulation’s partial effectiveness; this corroborated the replication authors’ reported null results.
In experimental psychology and related disciplines, researchers are often interested in the effects of a certain internal state, such as ego depletion or social stress, that cannot be directly assigned in an experiment. In such cases, researchers often assign participants to a manipulation that is intended to produce the target state and use manipulation-check items to assess whether the manipulation did so effectively (Aronson et al., 1990; Ejelöv & Luke, 2020). That is, manipulation checks are often used when the intervention of interest is not the randomized manipulation itself but, rather, the target state that the manipulation check measures. For example, in a multisite replication study on ego-depletion effect, participants were randomly assigned to perform a putatively effortful and fatiguing task or to perform a similar but noneffortful control task (Hagger et al., 2016). Here, scientific interest centered less on the effects of the effortful task per se than on the effect of effort and fatigue. Indeed, experiments on ego depletion have used a diverse range of manipulations, including working on unsolvable puzzles, concealing one’s emotions while watching a movie, and avoiding using stereotypes when describing people (Baumeister et al., 1998; Gailliot et al., 2007). As a manipulation check, Hagger et al. (2016) asked participants to self-report their effort, fatigue, perceived difficulty, and frustration. The replications were criticized partly because the manipulation was only partially effective based on these measures (Baumeister & Vohs, 2016).
Although manipulation checks are widely used, there has been little guidance on appropriate statistical analyses for studies using these measures (Hauser et al., 2018). In this article, I provide such guidance, informed by classic and recent results in causal inference. I first discuss several widespread existing statistical methods, such as estimating the association of the manipulation with the dependent variable without including the manipulation check in the analysis model. In the language of clinical trials, this is an “intention-to-treat” (ITT) analysis. Intuitively, it might seem that this analysis would always provide a conservative estimate (i.e., in the correct direction but biased toward the null) of the effect of the target state itself on the dependent variable because the estimate is “diluted” by the manipulation’s limited effectiveness. This presumed conservatism is perhaps why this approach is in widespread use. In fact, as I discuss, the ITT estimator is not necessarily conservative: It can be biased away from rather than toward the null, and it can even be in the wrong direction (Hernán & Hernández-Díaz, 2012). This means that the status-quo practice of interpreting the ITT estimate as the effect of the target state itself is not automatically justified even if the estimate is interpreted as conservative. I discuss the fairly stringent assumptions under which the ITT estimator is guaranteed to be conservative. These assumptions are violated, for example, for several sites in the ego-depletion replications, so the ITT estimate cannot necessarily be interpreted as the effect of ego depletion itself or even as a conservative estimate thereof.
In fact, under a weaker version of the same assumptions, it is possible to obtain a consistent, rather than conservative, estimator for the effect of the target state itself (a consistent estimator is one whose estimates converge to the true parameter—in this case, the effect of the target state—as the sample size increases). I discuss how to do so using standard instrumental-variables (IVs) analysis (Angrist et al., 1996). Conducting such an analysis along with the standard ITT analysis helps more directly address a central question of interest when using manipulation checks. As I discuss, estimating the effect of the target state requires fairly strong assumptions regardless of whether one uses IVs analysis or the conservative interpretation of ITT analysis. In particular, both methods require assuming “excludability,” meaning that the manipulation affects the dependent variable via only the target state and not via any other mechanisms (Angrist et al., 1996). Others have critiqued the status-quo practice of interpreting the ITT estimate in terms of the effect of the target state (Eronen, 2020; Gruijters, 2022; Spirtes & Scheines, 2004); I offer further methodological reasons to abandon this practice unless excludability and the other required assumptions have been carefully justified.
Running Example: Ego Depletion
As a running example, I consider Hagger et al.’s (2016) multisite replication study on ego depletion, in which the ITT estimate was close to the null (standardized mean difference [SMD] = 0.04; 95% confidence interval [CI] = [–0.07, 0.15]). I show that accounting for the manipulation’s partial effectiveness yields point estimates that remained very close to the null, corroborating the replication team’s findings and contributing new evidence to a contentious debate (Baumeister & Vohs, 2016; Dang, 2016; Drummond & Philipp, 2017; Hagger & Chatzisarantis, 2016). The experimental conditions in the replications were (a) a control task, which involved viewing a series of words and rapidly pressing a button if the word contained the letter “e,” or (b) an effortful task, which was identical to the control task except that participants were to avoid pressing the button if the “e” was next to or one letter away from a vowel (Hagger et al., 2016). This additional stipulation requires response inhibition and so is thought to make the task more ego-depleting. The primary dependent variable was reaction-time variability (RTV) on a multisource interference task, which is conceptually similar to the Stroop task and requires response inhibition. Higher RTV indicates worse performance, so the ego-depletion hypothesis predicts that experiencing more ego depletion would increase RTV. The four manipulation check items were 7-point scales assessing self-reported effort, fatigue, perceived difficulty, and frustration.
Setting and Notation
I focus on experiments in which the manipulation is randomly assigned but the target state is not. Let X denote the experimental condition; I assume this is a binary variable such that
Any discussion of appropriate statistical methods must begin with clearly defining the estimand of interest—that is, the unknown quantity one is trying to estimate. One estimand of natural interest is the average treatment effect (ATE) of R, which can be formalized using the counterfactual framework of causal inference.
1
In this framework, a participant’s potential outcome is the value that a given variable, usually Y, would take for that individual if the individual were to receive a particular intervention (Hernán & Robins, 2020). For example, if R is binary, then the potential outcome
The notation
for any baseline level c. I now consider three existing analytic approaches for experiments that use manipulation checks and consider whether these approaches can validly estimate
Summary of Statistical Analyses, Questions Addressed, and Assumptions
Note: DV = dependent variable.
The listed assumptions are not exhaustive. For example, all methods assume that the manipulation is randomized and make other standard causal-inference assumptions (e.g., that participants do not affect each other; Hernán & Robins, 2020).
The other assumptions are given in Mathur and Shpitser’s (2024) Proposition 1.
Existing Approaches
ITT analysis
Because the target state R is not randomly assigned, the association of R itself with may Y not provide a valid estimate for
This estimate is typically obtained by taking the mean difference between experimental conditions. Of course, because X is randomized, this estimate is statistically consistent (i.e., valid) for the ATE of the manipulation itself. However,
As noted in the introduction, intuition might incorrectly suggest that the ITT estimator would always be conservative (i.e., biased toward the null) for
Assumption 1 (no backfiring): The mean difference of X on R is positive.
Assumption 2 (excludability): The manipulation X affects the dependent variable Y only via the target state, R.
Assumption 3 (no simultaneous moderation): Any unmeasured moderators of the effect of target state R on Y are uncorrelated with any unmeasured moderators of the effect of manipulation X on R.
Assumption 4 (binary target state): The target state R is analyzed as a binary variable.
Assumption 1 states that the mean difference of X on R is positive, that is,
If the manipulation affects R (regardless of the strength or direction of effect) and satisfies Assumption 2, then it is called an “instrument” for the target state R. The following conservatism results follow from classic and recent results regarding IVs, which are used in causal inference to estimate the effects of variables that cannot themselves be randomly assigned (Angrist et al., 1996; Hartwig et al., 2023; Hernán & Robins, 2006). 3
Assumption 2 states that the manipulation affects the dependent variable only via the target state; this assumption would be violated if the manipulation had any effects on the dependent variable that were not entirely mediated by the target state. This assumption may be plausible a priori if the manipulation is compared with a closely matched control condition such that the only plausible difference between the conditions is whether they promote being in the target state. In Hagger et al.’s (2016) replications, the control task and ego-depleting task were identical except that the control task did not require effortful response inhibition. In addition, Assumption 2 can to some extent be tested empirically. For example, to try to rule out the possibility that the manipulation affects the dependent variable via mechanisms not involving the target state, one could measure variables representing these other plausible mechanisms (Ejelöv & Luke, 2020). If the manipulation is not empirically associated with any of these other mechanisms, this result could increase one’s confidence in Assumption 2 although would not conclusively prove that the assumption holds.
Assumption 3 refers to moderation of the R–Y and X–R relationships (Hartwig et al., 2023). 4 In the ego-depletion example, if unmeasured variable, such as trait neuroticism, moderated both (a) the effect of experiencing ego depletion on RTV and (b) the effect of the manipulation on experiencing ego depletion, then Assumption 3 would be violated. However, if either form of moderation were implausible for all unmeasured moderators, then Assumption 3 would hold. 5 The version of this assumption stated here applies if R is binary. If R is continuous, then the effect of R on Y must also be additive linear (Hartwig et al., 2023). This is similar to the standard assumption used in common regression models and means, for example, that for a given individual, the effect of a 1-unit increase in ego depletion on the dependent variable is the same for any level of ego depletion.
Under Assumptions 1 through 4, the ITT estimator will be conservative; more formally, it will be consistent for a parameter that is no greater than the true
Mediation analysis
A second existing approach involves treating the target state, as assessed by the manipulation check, as a mediator and estimating the indirect effect of the manipulation that occurs through the target state (Hauser et al., 2018). For example, Drummond and Philipp (2017) reanalyzed Hagger et al.’s (2016) replication data and estimated small indirect effects through each manipulation-check item. Although mediation analysis can be useful and informative in many contexts, such analyses face two limitations in the context of manipulation checks. First, as I have suggested, experimenters using manipulation checks are often (although not always) ultimately interested in the effect of the target state itself. In contrast, the indirect effect represents the effect of the manipulation that operates via the manipulation’s effect on the target state. Although this estimand may be of interest for other reasons, it cannot be interpreted as the effect of the target state and so does not help address a question of central interest for experimenters who use manipulation checks (Hauser et al., 2018).
As a second limitation, even if the indirect effect is indeed of interest, mediation analysis requires strong assumptions to validly estimate causal effects (Pearl, 2009; Rohrer et al., 2022; VanderWeele, 2015). One such assumption is that there must be no confounding of the relationship between the mediator (here, the target state) and the dependent variable, conditional on any covariates that have been adjusted in analysis. Because the target state itself is not randomized, this assumption will often be violated unless the mediation analysis controls for all variables that affect both the target state and the dependent variable (Hauser et al., 2018; Pearl, 2009; VanderWeele, 2015). As noted above, in the ego-depletion example, it seems likely that certain participant characteristics could affect participants’ susceptibility to ego depletion and their RTV (Maples-Keller et al., 2016; Salmon et al., 2014). Because mediation analyses involving manipulation checks rarely adjust for such common causes (e.g., Drummond & Philipp, 2017), their estimates may not yield valid estimates of the indirect effect.
Moderator analysis
A third existing approach involves treating the manipulation check as a moderator rather than a mediator. For example, Dang (2016) reanalyzed the ego-depletion replications by examining the association of effort with RTV, stratified by experimental condition. It is not clear why of the four manipulation-check items, Dang considered only effort rather than fatigue, the problematic item. Dang found that participants who reported greater effort during the depletion task had higher RTV (i.e., worse performance on the final task) and concluded that the “ineffectiveness of Hagger’s replication may result from ineffectiveness of their manipulation.” One possible reason for treating the target state as a moderator would seem to be to estimate how much more effective the manipulation was among participants who were in the target state. 6 Perhaps counterintuitively, if the manipulation affects the target state at all, moderation analysis does not validly estimate this quantity (Mathur & Shpitser, 2024). 7 This problem is essentially a form of posttreatment bias, that is, bias because of conditioning on a variable affected by the treatment of interest (Montgomery et al., 2018). For moderation analysis to estimate the difference in effectiveness between participants who were in the target state versus those who were not, other assumptions are also required (Mathur & Shpitser, 2024). I do not discuss these other assumptions because the assumption that the manipulation does not affect the target state is already violated by design.
Alternative Approach: IVs Analysis
A consistent, rather than conservative, estimator of
Assumption 1′ (relevance): The manipulation X affects the target state R.
Assumption 2 (excludability): The manipulation X affects the dependent variable Y only via the target state, R.
Assumption 3 (no simultaneous moderation): Any unmeasured moderators of the effect of target state R on Y are uncorrelated with any unmeasured moderators of the effect of manipulation X on R.
Assumption
Heuristically, if X is only partially effective, then Assumptions 1′ through 3 guarantee that
Near violations of Assumption
Several approaches exist to assess how results might change if Assumptions
Reanalysis of the Ego-Depletion Effect
Using ITT analysis, Hagger et al. (2016) estimated that the ego-depletion effect was close to the null (SMD = 0.04; 95% CI = [–0.07, 0.15]). The manipulation had strong effects on three of the four manipulation-check items (effort, difficulty, and frustration); SMDs ranged from 0.82 to 1.91. However, the manipulation had only weak effects on fatigue (SMD = 0.09; 95% CI = [–0.03, 0.20]). The lead author of the original study on ego depletion critiqued the replications on several grounds, one of which was that “the manipulation failed to create ego depletion” based on the fatigue measure (Baumeister et al., 1998).
To investigate how the replication findings might change when accounting for the manipulation’s partial effectiveness, I reanalyzed the data using IVs analysis. I first consider the plausibility of Assumptions 1′ through 3. First, Assumption 1′ (relevance) appears to hold because the manipulation did somewhat increase ego depletion, if only partially. Assumption 2 (excludability) seems fairly plausible on theoretical grounds because the manipulation involved stylized laboratory tasks that were identical except that the control task did not require effortful response inhibition (i.e., suppressing a response to press a button when a word contained the letter “e” but the “e” was near a vowel). This is a strength of the replications compared with some previous studies on ego depletion in which the manipulations were often considerably less specific (e.g., forcing oneself to eat radishes instead of freshly baked cookies) and hence more likely to violate the excludability assumption (Baumeister et al., 1998; Lurquin & Miyake, 2017). On the other hand, there appears to be little empirical basis for ruling out possible undesired effects of the letter-crossing manipulation on nuisance psychological states. Such effects could result in excludability violations (but regarding possible effects on negative affect, see Hagger et al., 2010). This paucity of literature on manipulation specificity contrasts strikingly with the better developed literature on the effectiveness and mechanisms of these tasks at inducing ego depletion (Arber et al., 2017; Baumeister & Vohs, 2016; Singh & Göritz, 2019). As noted previously, Assumption 3 (no simultaneous moderation) could potentially be violated if an unmeasured variable, such as trait neuroticism, moderated both (a) the effect of experiencing ego depletion on the dependent variable (RTV) and (b) the effect of the manipulation on experiencing ego depletion. I leave further substantive consideration of this possibility to the ego-depletion research community.
In my reanalysis, I treated the four manipulation-check items (effort, difficulty, frustration, and fatigue) as a single composite measure of ego depletion by aggregating Hagger et al.’s (2016) publicly available estimates of the manipulation’s effects on each item, accounting for correlation between the items. I combined these with Hagger et al.’s ITT estimates to obtain IV estimates for each replication site (Fig. 1). I aggregated the IV estimates across sites using random-effects meta-analysis fit with restricted maximum likelihood estimation and Knapp-Hartung standard errors (Knapp & Hartung, 2003; Sidik & Jonkman, 2002). I estimated that on average across sites, the effect of the manipulation on composite ego depletion was SMD = 0.79 (95% CI = [0.71, 0.86]). Within sites, the ITT estimates ranged from −0.51 to 0.50, and the IV estimates ranged from −0.76 to 0.55. 10 In 20 of 23 sites, the ITT estimate was closer to null than the IV estimate or was equal to the IV estimate. In my aggregated IV analysis, I estimated that the effect of ego depletion itself (rather than the effect of the manipulation) was SMD = 0.06 (95% CI = [–0.09, 020]). Like Hagger and Chatzisarantis’s (2016) ITT estimate, this IV estimate is very close to the null with a CI that excludes medium or large effect sizes.

Forest plot of intention-to-treat (ITT) estimates and instrumental variable (IV) estimates within each replication site and pooled across sites with 95% confidence intervals. Interval limits that extend past the plotted range are truncated. IV estimates are based on the composite ego-depletion measure.
In a critical commentary on the replications, Baumeister and Vohs (2016) argued that fatigue is the most important of the manipulation-check items. I conducted a second analysis that was maximally favorable to this viewpoint in which I treated fatigue as the only manipulation-check item. This analysis would be justified in the extreme case that the only effects of the manipulation were via fatigue and not via effort, perceived difficulty, frustration, or any other pathways. The resulting IV estimate was again very similar to the ITT estimate (SMD = 0.05; 95% CI = [–0.25, 0.38]). 11 Although this estimate had a wide CI, it again corroborates Hagger et al.’s (2016) original findings under assumptions that favor the opposite conclusion. I reiterate that my IV reanalysis is subject to assumptions that might be violated and that merit further empirical evaluation and substantive consideration; nevertheless, these assumptions are less stringent than those required for the status-quo interpretation of the ITT estimate.
Discussion
In this article, I considered analytic approaches for experiments that involve manipulation checks, specifically, when the effect of primary interest is that of the target internal state that the manipulation is intended to produce. In this context, the widespread ITT analysis is not necessarily conservative for the ATE of the target state on the dependent variable. I discussed assumptions under which the ITT estimator is indeed conservative for this ATE. I suggested IVs analysis as an alternative approach that is well established in the causal-inference literature. The assumptions under which the IVs analysis estimates the ATE of the target state are slightly less stringent than those under which the ITT analysis is conservative.
The IVs estimate directly addresses a central question of interest in experiments involving manipulation checks, as in the ego-depletion literature. As another example, in a high-profile meta-analysis of experiments designed to assess the effects of mood on eating behavior, experimental manipulations of mood were highly diverse, including watching emotional video clips, recalling emotional experiences, receiving social feedback, and giving a public presentation (Cardi et al., 2015). Although many of the studies included manipulation checks, the meta-analysts (Cardi et al., 2015) extracted only ITT estimates. Despite the mood manipulations’ inconsistent effectiveness, the meta-analysts (Cardi et al., 2015) concluded that “eating behavior is influenced by emotional state” with no discussion of the assumptions required to thus interpret ITT estimates in terms of the target state. This is a case in which the causal effect of interest clearly concerns the target state, not the various manipulations, but the statistical analysis and discussion of assumptions were not well aligned with this objective.
On the other hand, there are cases in which the ITT estimate will be of equal or greater scientific interest than the IV estimate. Sometimes, researchers are additionally—or exclusively—interested in the effects of the manipulation itself on the dependent variable. This could be the case if the manipulation is a candidate real-world policy or intervention. For example, in a series of randomized experiments, Fernbach et al. (2013) found that participants who were made to explain the details of various political policies subsequently moderated their stances on the policies compared with participants who merely had to explain their own stances. The hypothesized mechanism was that having to explain policies reveals to participants their weak understanding of the policies, leading them to moderate their views. Fernbach et al. expressed interest in the manipulation itself as a potential intervention to counteract attitude polarization, and the ITT estimate directly assessed this possibility. If the authors were additionally interested in effect of participants’ perceived understanding of the policies, the IV estimate would more directly address this alternative question. Therefore, our recommendation is not that the IV estimate should always be used instead of the ITT estimate. Rather, researchers using manipulation checks should explicitly define and preregister the causal effect(s) of interest and choose analysis methods accordingly.
Among other assumptions, IVs analysis requires the strong assumption of excludability (i.e., that the manipulation affects the dependent variable only via the target state). By comparison, if the ITT estimate is interpreted as the causal effect of the manipulation itself, this is justified simply because the manipulation was randomly assigned. My position is not that the assumptions for IV are typically well justified in psychology experiments: In fact, I concur with others’ concerns that manipulations in psychology may often be “fat-handed,” meaning that they manipulate numerous psychological states other than the target state (Eronen, 2020; Gruijters, 2022; Spirtes & Scheines, 2004). Excludability will usually be violated if these other, nuisance effects of the manipulation also affect the dependent variable. The key point is that if a researcher is interested in the effect of the target state rather than the manipulation—as is usually the case in experimental psychology—then the excludability assumption is the price paid, and its plausibility should be scrutinized in substantive context.
Critically, as I have discussed, the excludability assumption is not unique to IVs analysis. If researchers use the ITT estimate but interprets it as the effect of the target state, then they are still implicitly assuming excludability. In fact, this seems to be the status-quo practice, and it requires assumptions that are strictly more stringent than those required for IVs analysis. Thus, the status-quo practice of interpreting the ITT estimate as the effect of the target state without any explicit discussion of whether excludability holds should be abandoned. Instead, researchers should either (a) describe why, on theoretical or empirical grounds, their manipulation warrants making the excludability assumption and thus be licensed in interpreting either the IV estimate or (under additional assumptions) the conservative ITT estimate in terms of the effect of the target manipulation or (b) decide that their manipulation may not warrant making the excludability assumption and interpret that ITT estimate as only the effect of the manipulation itself, not the effect of the target state.
When interest does center on the target state, and hence one must establish that excludabilty is plausible, I concur with others’ recommendations to conduct separate, careful validation studies of the manipulation (Gruijters, 2022). These studies should assess the manipulation’s effects on not only the target state but also other nuisance psychological states that could potentially result in excludability violations (Gruijters, 2022). I have also suggested applying sensitivity analyses and bounding methods to assess robustness to violations of this and other assumptions.
Ultimately, I do not advocate for an uncritical switch from ITT analysis to IVs analysis. Rather, I hope this article encourages psychology researchers to carefully define their estimands of interest, articulate and justify the relevant assumptions, and choose an appropriate analysis method accordingly.
Footnotes
Acknowledgements
Transparency
Action Editor: Rogier A. Kievit
Author Contributions
