Abstract
Common visual heuristics used to interpret marginal effects plots are susceptible to Type-1 error. This susceptibility varies as a function of (a) sample size, (b) stochastic error in the true data generating process, and (c) the relative size of the main effects of the causal variable versus the moderator. I discuss simple alternatives to these standard visual heuristics that may improve inference and do not depend on regression parameters.
Introduction
The interpretation of interaction terms in political science is a topic of wide interest (Brambor et al., 2006; Braumoeller, 2004; Berry et al., 2012; Esarey and Sumner, forthcoming; Hainmueller et al., 2017; Kam and Franzese, 2007). An influential article by Brambor et al. (2006), in particular, has transformed how political scientists study and interpret interactive hypotheses. 1 In addition to reminding researchers that they must include both constitutive terms and interaction terms if they wish to test interactive hypotheses, the authors write that “The analyst cannot … infer whether X has a meaningful conditional effect on Y from the magnitude and significance of the coefficient on the interaction term either… It means that one cannot determine whether a model should include an interaction term simply by looking at the significance of the coefficient on the interaction term” (p. 74). The authors propose instead to use “marginal effects plots” to calculate the estimated marginal effect of the variable of interest across substantively meaningful values of the moderating variable.
Marginal effects plots have since become ubiquitous in political science. Despite their ubiquity, there is little analysis of their performance as a tool for identifying interactive effects. Two recent papers have begun to look more closely at the marginal effects plot. Hainmueller et al. (2017) show that marginal effects plots (and indeed, any hypothesis test relying on an interaction term) rely on the assumption that the effect of the causal variable of interest is linear and constant across the values of the moderating variable. By contrast, Esarey and Sumner (forthcoming) argue that marginal effects plots usually have inappropriate coverage because of the problem of multiple comparisons. My contribution in this manuscript is to draw attention to the visual heuristics that researchers implicitly use when they interpret marginal effects plots.
In this article I demonstrate that applied researchers have drawn incorrect conclusions from Brambor et al. (2006). The appropriate test for the presence of linear interaction effects is given by the significance of the coefficient on the interaction term. Commonly used visual heuristics, which I identify below, will often fail compared to this test. Marginal effects plots have other uses, but they should not be used to test for the presence of linear interaction effects.
Focusing on Type-1 error, or the problem of false positives, I ask the following question: How frequently will visual inspection of a marginal effects plot suggest that interaction effects exist when the true data generating process is not interactive? To investigate, I generate simulated data for a binary treatment variable

The marginal effect of D on Y.
Visual inspection of Figure 1 suggests that the effect of D on Y is positive at high values of
That conclusion is incorrect. The coefficient on the interaction term,
Learning from marginal effects plots
Marginal effects plots contain two pieces of information. The first is the slope of the “marginal effect line,” which is determined by the coefficient
When there is no interactive effect, the true value of
Figure 2 illustrates how such a marginal effects plot ought to look when there is no interaction between

The marginal effect of D on Y.
Because the effect of
When visual results are not so clean, researchers commonly follow one of two visual heuristics. The first, which I term the “crosses zero” heuristic, looks to see whether or not the confidence intervals capture zero for some portion of the range of
One other piece of information that may test whether or not an interaction effect exists is the coefficient
We are left with what seems to be an impasse. The coefficient on the interaction term is not a meaningful test of the marginal effect
The coefficient of
The confidence intervals around the point estimates across values of
Differences in
How frequently do researchers employ the crosses zero heuristic? I consulted each of the articles replicated by Hainmueller et al. (2017) and checked for evidence that authors explicitly based their inferences on a marginal effects plot rather than on the statistical significance of the interaction term. The authors argue that this sample represents “high profile” articles that likely took “special care to employ and interpret these models correctly.” By my count, 7 out of 22—or nearly one out of every three articles—fulfill this criterion. 4 In the majority of the remaining 15 cases, the coefficient on the interaction term was itself significant, obviating the need to choose one or the other. For the same reasons that Hainmueller et al. (2017) argue that their replications represent a lower bound on the true rate of problematic multiplicative interaction terms, my count may also represent a lower bound on how often visual heuristics are used to identify interaction effects.
Simulations
The preceding discussion explains why it is not correct to compare marginal effects to make inferences about the presence of interactive effects. To illustrate the dangers of doing so, I use simulations. Based on the data generating process outlined in the introduction, I created 1000 simulated datasets and created “virtual” marginal effects plots for each. I then implemented five tests: three based on the heuristics outlined above, one based on the coefficients from the binning estimator, and one based on the coefficient on
Crosses zero heuristic If the estimated marginal effect of
Crosses zero heuristic (bins) This heuristic applies the same logic of the crosses zero heuristic to a plot derived from the binning estimator. If the confidence interval of the low (high) tercile captures zero, and the confidence interval of the high (low) tercile does not capture zero, and the point estimate for each tercile falls in the order (Low, Middle) > High or (High, Middle) > Low, then I conclude that the binning estimator plot is consistent with the presence of an interactive effect.
Compare extremes heuristic If the maximum value of the lower confidence interval is greater than the minimum value of the upper confidence interval, then I conclude that that marginal effects plot is consistent with the presence of an interactive effect. Recognizing the critiques that exist of this heuristic, note here that I study only cases where the confidence band extends to the observed maximum and minimum of a normally distributed moderator whose values are independent of the causal variable.
Differences between bins If the two-sided p-value for a test of the equality of the first and third bins is less than .05, I conclude that the binning estimates are consistent with the presence of an interactive effect.
Coefficient and p-value If the p-value associated with
I then repeat this process hundreds of times, varying four parameters: the sample size
First, I fix
The small sample performance of the cross-zero heuristic in Figure 3 is noteworthy because it runs counter to common expectations that small samples lead to conservative tests that are more likely to fail to reject the null when an alternative hypothesis is true. In identifying interaction effects, the crosses zero heuristic is anticonservative in small samples.

Type-1 error rates for four different heuristics.
In the Appendix I vary other features of the simulations. Specifically, I vary the unexplained variance in the model (
Discussion
The crosses zero heuristic is overconfident when interpreted to be a test of the hypothesis that the effect of
To explore this, I adjust the data generating process to

A linear interaction effect.
I then test the performance of each of the five heuristics. To “stack the deck” in favor of the crosses zero heuristic, I only require that the confidence interval includes zero at
Figure 5 shows that with small sample sizes, all five heuristics are likely to fail to reject the null hypothesis that there is no interaction effect when one does exist. As sample size increases, all five heuristics improve, but the crosses zero heuristic based on the marginal effects plot improves the fastest. The crosses zero heuristic applied to the binning estimator is acceptable but too conservative, even with large samples. The formal test of the differences between bins only approaches the performance of the other four heuristics when the sample size is large. These results suggest that marginal effects plots are better suited than coefficients and p-values for identifying interaction effects, but only when we know that these effects exist and the sample size is relatively small. Similar conclusions may be drawn from simulations that increase the ratio of stochastic variance to systematic variance.

Type-2 error rates for four different heuristics.
Finally, I consider a case where the effects of
where
where
Here D has no effect on Y when

A nonlinear interaction effect.
Not surprisingly, the kernel estimator captures the nonlinear effect of

Detecting nonlinear interactions for two different heuristics.
In these simulations, the crosses zero heuristic nearly always fails to identify the correct nonlinear effect of
Recommendations
This article has shown that visual heuristics used to interpret marginal effects plots can lead to misleading substantive conclusions. When there is no interaction between
Brambor et al.’s (2006) most important contribution—amplified by Braumoeller (2004) and Kam and Franzese (2007) in ways that have fundamentally changed research practice—is to shift researchers away from simple inspection of coefficients and standard errors when examining substantive interaction effects. However, Brambor et al.’s (2006) argument that “one cannot determine whether a model should include an interaction term simply by looking at the significance of the coefficient on the interaction term” is incorrect if interpreted to mean that the coefficient on the interaction term does not test whether the effect of
This discussion suggests some simple guidelines for applied researchers. Assuming linear interaction effects, a conservative strategy that minimizes Type-1 error and which does not depend on sample size, stochastic error, or the relative size of the causal effect of interest would be to only use coefficients and p-values to test for the presence of interaction effects. Although marginal effects plots do calculate the correct marginal effects and their confidence intervals, they do not test for the presence or absence of an interactive effect. Marginal effects plots should be used, then, only to calculate substantive quantities of interest. They are also useful, in combination with histograms of the distribution of the moderating variable, to explore the sensitivity of interaction models to the range of the moderating variable.
Another strategy to improve standard visual heuristics is to add a second reference line that corresponds to the marginal effect of

The marginal effect of D on Y.
This additional dotted line focuses the eye not only on whether the confidence band includes zero, but also on whether the entire confidence band spans a common value. The figure on the left plots the same model as in Figure 1, and clearly reveals no interaction effect when
I provide open source software in R to create figures similar to Figure 8 in the R package interplot.medline, which is based on the interplot package in R by Solt and Hu (2016). 8 This simple addition to the standard marginal effects plot should discourage researchers from inferring that interaction effects exist when they do not.
Footnotes
Acknowledgements
Thanks to Bryce Corrigan, Justin Esarey, Jens Hainmueller, and anonymous referees for useful comments on previous drafts. I am responsible for all errors.
Correction (June 2025):
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplementary materials
The supplementary files are available at http://journals.sagepub.com/doi/suppl/10.1177/2053168018756668. The replication files can be found at ![]()
Notes
Carnegie Corporation of New York Grant
This publication was made possible (in part) by a grant from Carnegie Corporation of New York. The statements made and views expressed are solely the responsibility of the author.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
