Sage Journals: Discover world-class research

Abstract

In this article, we describe the use of ivmediate, a new command to estimate causal mediation effects in instrumental-variables settings using the framework developed by Dippel et al. (2020, unpublished manuscript). ivmediate allows estimation of a treatment effect and the share of this effect that can be attributed to a mediator variable. While both treatment and mediator can be potentially endogenous, a single instrument suffices to identify both the causal treatment and the mediation effects.

Keywords

st0611 ivmediate causal mediation analysis treatment effects instrumental variables

1 Introduction

There are many settings where a researcher would like to understand the mechanism that underlies an estimated effect of a treatment T on an outcome Y . For example, Becker and Woessmann (2009) are interested in the Weber hypothesis that religion, specifically Protestantism, affects economic growth. Because Protestantism promoted reading of the Bible,¹ they establish that an underlying mechanism M of the effect of religion on economic growth works through human capital accumulation, especially literacy. Given that the prevalence of religion across regions is likely not random, they introduce an instrumental variable (IV) and show that Protestantism caused higher literacy rates and thus economic growth. They derive plausible bounds for the range of a mediation effect but lack a formal framework to causally estimate the indirect effect of religion on economic growth that works through literacy.

Such an exercise of unpacking mechanisms is called mediation analysis, where a treatment T and one of its outcomes M, that is, the mediator, jointly cause a final outcome of interest Y . Mediation analysis has long been used in settings where T can be assumed to be randomly assigned. However, when T is systematically nonrandom and therefore needs to be instrumented by a variable Z,² there has been a lack of frameworks for undertaking mediation analysis in such IV settings without having separate instruments for both T and M.³ The command ivmediate fills this gap and provides a new regression command that allows researchers to use a single IV to estimate the causal effect of the intermediate variable on a final outcome using the estimator developed by Dippel et al. (2020). This complements existing ways to estimate causal mediation effects that assume randomness in the assignment of treatment T (Imai, Keele, and Tingley 2010) or require separate instruments for T and M (for example, Frölich and Huber [2017]; Jun et al. [2016]).

Table 1 illustrates the identification challenge described above. As a starting point, we show the standard IV estimations of the causal effect of T on M (model I) and the causal effect of T on Y (model II). In model I, T is considered endogenous (that is, $ε_{T} ∐ ε_{M}$ ) and we introduce for the endogenous treatment T an IV Z, which is both uncorrelated with the omitted variables (Z ∐ ε_T , ε_M ) and a reasonably strong predictor of T. Model II fits the TE of T on Y using the same IV approach: $ε_{T} ∐ η_{Y}$ , but Z is exogenous (that is, $\begin{array}{l} Z ∐ ε_{T}, η_{Y} \end{array}$ ). Table 1 is reprinted from Dippel et al. (2020).

Table 1.

The identification problem of mediation analysis with IV

NOTES: (a) Model I is the standard IV model, which enables the identification of the causal effect of T on M. Model II is the standard IV model that enables the identification of the causal effects of T on Y . Model III is the IV Mediation Model with an instrumental variable Z. (b) Panel A gives the graphical representation of the models. Panel B presents the non-parametric structural equations of each model. Conditioning variables are suppressed for sake of notational simplicity. We use ∐ to denote statistical independence.

To identify what fraction of the TE is explained by the indirect effect, we have to perform a mediation analysis that decomposes the TE of T on Y into 1) the mediated “indirect” effect of T on Y that operates through M and 2) the residual “direct” effect that does not work through M. Model III of table 1 shows the main identification challenge in combining the two IV models into a general mediation model. Equations M = f_M (T, ε_M ) and Y = f_Y (T, M, ε_Y ) imply that T causes Y indirectly through M as well as directly, which is graphically represented by the arrow directly linking T to Y . In a regression of Y on both T and M, there are two potentially endogenous regressors (that is, $ε_{T} ∐ ε_{Y}$ and $ε_{M} ∐ ε_{Y}$ ), but there is only one instrument Z to address this endogeneity.

To overcome the underidentification problem, we do not assume away endogeneity in any of the key relationships in model III ( $ε_{T} ∐ ε_{M}$ , $ε_{M} ∐ ε_{Y}$ , and $ε_{T} ∐ ε_{Y}$ are all maintained), yet we do not need additional instruments. Instead, the omitted variable concerns themselves can suggest a natural solution. This is the case when T is endogenous in a regression of M on T because of confounders that jointly affect M and T and when T is endogenous in a regression of Y on T because of the same confounders that affect Y primarily through M.

Dippel et al. (2020) show that this assumption alone is sufficient to unpack the causal channels in model III, therefore allowing us to identify the extent to which T causes Y through M. Under linearity, the resulting identification framework is straightforward to estimate using three separate two-stage least squares (2SLS) regressions; these estimate i) the effect of T on M, ii) the effect of T on Y , and iii) the effect of M on Y conditional on T .

In the following section, we will briefly explain the underlying econometric theory before we explain the estimation procedure in section 3. There, we also provide further guidance on the interpretation of results and issues regarding weak identification that are typical concerns for applied researchers. Section 4 describes the syntax and options of ivmediate. Section 5 provides a brief simulation exercise in section 5.1 to show not only how ivmediate estimates the correct TE of a treatment but also how these can be decomposed into direct and indirect effects. We then apply the command to a real-life example using the data and empirical setting of Becker and Woessmann (2009) in section 5.2 to estimate how Protestantism affects local economic performance in Prussian counties in 1877 and how much of this effect is causally mediated by literacy.

2 Causal mediation analysis in IV models

Under linearity and with an instrument Z, the causal relations in model III in table 1 can be written as

Z = ε_{Z}

T = β_{T}^{Z} \times Z + ε_{T}

M = β_{M}^{T} \times T + ε_{M}

Y = β_{Y}^{T} \times T + β_{Y}^{M} \times M + ε_{Y}

Equations (1) –(4) can be compactly expressed as X = Ψ × X + ε in (5):

\underset{x}{\underset{︸}{[\begin{matrix} Z \\ T \\ M \\ Y \end{matrix}]}} = \underset{ψ}{\underset{︸}{[\begin{matrix} 0 & 0 & 0 & 0 \\ β_{T}^{Z} & 0 & 0 & 0 \\ 0 & β_{M}^{T} & 0 & 0 \\ 0 & β_{Y}^{T} & β_{Y}^{M} & 0 \end{matrix}]}} \times \underset{x}{\underset{︸}{[\begin{matrix} Z \\ T \\ M \\ Y \end{matrix}]}} + \underset{ε}{\underset{︸}{[\begin{matrix} ε_{Z} \\ ε_{T} \\ ε_{M} \\ ε_{Y} \end{matrix}]}}

Equation (6) presents the covariance matrix Σ_X of observed variables X:

\sum_{x} \equiv Var (\begin{matrix} Z \\ T \\ M \\ Y \end{matrix}) = [\begin{matrix} σ_{Z Z} & σ_{Z T} & σ_{Z M} & σ_{Z Y} \\ \cdot & σ_{T T} & σ_{T M} & σ_{T Y} \\ \cdot & \cdot & σ_{M M} & σ_{M Y} \\ \cdot & \cdot & \cdot & σ_{Y Y} \end{matrix}]

Let Σ _ε denote the covariance matrix of unobserved error terms ε. Because Z is an IV, it implies that ε_Z is statistically independent of ε_T , ε_M , and ε_Y . Thus, Σ _ε is given by

\sum_{\in} \equiv Var (\begin{matrix} \in_{Z} \\ \in_{T} \\ \in_{M} \\ \in_{Y} \end{matrix}) = [\begin{matrix} σ_{\in Z}^{2} & 0 & 0 & 0 \\ \cdot & σ_{\in T}^{2} & ρ_{T M} σ_{\in T} σ_{\in M} & ρ_{T Y} σ_{\in T} σ_{\in Y} \\ \cdot & \cdot & σ_{\in M}^{2} & ρ_{M Y} σ_{\in M} σ_{\in Y} \\ \cdot & \cdot & \cdot & σ_{\in Y}^{2} \end{matrix}]

The identifying assumption in Dippel et al. (2020) is that T is endogenous in a regression of Y on T , but endogeneity cannot arise from confounders that jointly influence T and Y , only from confounders that jointly affect T and M (for example, Protestantism and literacy in Becker and Woessmann [2009]). The framework also allows for confounders that jointly influence M and Y (for example, literacy and economic growth in Becker and Woessmann [2009]). Formally, the identifying assumption is ρ_TY = 0 in Σ _ε , while allowing ρ_TM ≠ 0 and ρ_MY ≠ 0.

In section 5.1, we describe how to generate a simulated dataset with these dependent relations.

3 Estimation

3.1 Estimation procedure

The estimation equations to identify all linear coefficients are associated with well-known econometric estimators as follows (control variables are suppressed for notational simplicity and without loss of generality):

1. Parameter $β_{M}^{T}$ is identified by standard 2SLS estimation, described by the following two-equation system:

First stage : T = β_{T}^{Z} \times Z + ε_{T}

Second stage : M = β_{M}^{T} \times \hat{T} + ε_{M}

where $\hat{T}$ stands for the estimated values of T in the first stage.

2. Dippel et al. (2020) show that the identifying assumption ρ_TY = 0 yields a new exclusion restriction, which allows for the use of Z as an instrument for M when conditioned on T (but not unconditionally). This implies that $β_{Y}^{M}$ and $β_{Y}^{T}$ are the expected values of the estimators of a 2SLS regression where T plays the role of a conditioning variable, Z is the instrument, M is the endogenous variable, and Y is the dependent variable. Namely, $β_{Y}^{M}$ and $β_{Y}^{T}$ can be fit by the following 2SLS model:

First stage : M = γ_{M}^{Z} \times Z + γ_{M}^{T} \times T + ε_{T}

Second stage : Y = β_{Y}^{M} \times \hat{M} + β_{Y}^{T} \times T + ε_{Y}

where $\hat{M}$ are the estimated values of M in the first stage.

The estimation procedure associated with (7) and (8) is the standard IV approach. By contrast, the estimation procedure associated with (9) and (10) is novel and a property of the framework laid out in Dippel et al. (2020).

There are two first stages here in (7) and (9) for which ivmediate provides tests for weak identification by reporting the corresponding F statistics on the excluded instrument. If robust or cluster–robust standard errors are requested, the regression output displays the F statistic by Kleibergen and Paap (2006). To implement estimation of their corrected F statistic, we rely on the ranktest command by Kleibergen and Schaffer (2007).

In section 5.1, we compare the unbiased estimates resulting from (7)–(10) with the associated OLS estimates.

3.2 Interpretation

There is another explicit link between (7)–(10) and the direct estimation of the TE in model II of table 1. Model II is obtained from model III by substituting (8) into (10):

\begin{matrix} Y = β_{Y}^{M} \times (β_{M}^{T} \times T + ε_{M}) + β_{Y}^{T} \times T + ε_{Y} \\ = \underset{TE}{\underset{︸}{(β_{Y}^{M} \times β_{M}^{T} + β_{Y}^{T})}} \times T + \underset{η_{Y}}{\underset{︸}{β_{Y}^{M} ε_{M} + ε_{Y}}} \equiv g_{Y} (T, η_{Y}) \end{matrix}

Equation (11) shows that the direct estimate of TE produced by model II is algebraically identical to the product of estimates $β_{Y}^{T} + β_{M}^{T} \times β_{Y}^{M}$ produced by model III [that is, (7)–(10)].⁴ This algebraic equivalence holds for a scalar instrument Z, but may not hold with a vector of instruments Z′. The ivmediate command, therefore, is limited to the use of a single scalar instrument.⁵

It is also worth noting that, in the mediation framework, either $β_{Y}^{T}$ or $β_{M}^{T} \times β_{Y}^{M}$ (but not both) can have opposite signs. For example, there is nothing logically inconsistent about having a positive TE that is composed of a (larger) positive indirect effect that is partly offset by a negative direct effect, or vice versa. In such a case, a statement like “the indirect effect explains more than 100 percent of the total effect” is not incorrect, but it does require careful explanation to avoid confusion.

3.3 Weak identification with two first-stage regressions

Applied researchers are now well aware of the bias introduced by weak identification in an IV setting (Bound, Jaeger, and Baker 1995). A rule of thumb is that an F test of the excluded instrument(s) in the first stage should yield an F statistic of 10 or more (Stock and Yogo 2005). How does this apply to the IV mediation setting with two first stages? Currently, there is no theory to guide applied researchers. Instead, we apply the code from section 5.1 to simulate the behavior of the estimator under different instrument strengths in the treatment and the mediator first stages. This is done by varying the amount of noise in ε_T and ε_Y .

Figure 1 plots the coefficient values of the total, direct, and indirect effects over different values of the first-stage F statistic. The left panel manipulates the strength of the instrument in the treatment first stage, and the right panel manipulates that in the mediation first stage. The instrument is only ever weak in one of the two first stages but not in both at the same time. Samples were simulated according to (1)–(4) with 1,000 observations for each value of the error variance. The values increase from 1 to 15 in increments of 0.5. In the example, the true values of the direct and indirect effects are both 1, summing up to a true TE of 2. The ivmediate simulations were then run 100 times for each error variance value.

Figure 1.

Coefficient values under differing IV strengths in either first stage. NOTE: The left panel simulated data for different values of Var(ε_T ), and the right simulated data for different values of Var(ε_Y ) ranging from 1 to 15. The value of Var(ε_T ) increases in steps of 0.5, and at each step, 100 random samples were drawn according to (1)–(4) with 1,000 observations. Both panels show binned scatter plots of coefficient values of the total, direct, and indirect effects over different values of the corresponding first-stage F statistics where the strength of the instrument was manipulated. The true TE is 2, and the true direct and indirect effects are equal to 1.

The left panel shows that, as the treatment first-stage F statistic approaches the rule of thumb value of 10, all effects begin to center on their true values. This is also the case for the right panel, however, here the direct effect takes longer to center on its true value. It only begins to center on the true value from a mediation first-stage F statistic of 30. A conservative approach would therefore require a stronger instrument in the mediator first stage to accurately identify all three effects. If interest only lies on the indirect effect, the commonly used approximation rule for a reasonably strong instrument seems applicable.

4 The ivmediate command

4.1 Syntax

ivmediate depvar [indepvars] [if] [in] [, mediator( varname ) treatment( varname ) instrument( varname ) absorb( varname ) full vce( vcetype ) level( # ) ]

4.2 Description

ivmediate implements the causal mediation analysis framework for IV models introduced by Dippel et al. (2020). The command allows the estimation of the causal treatment and mediation effects for potentially endogenous treatment and mediator variables without the need for an additional instrument for the mediator. A single IV suffices to identify both effects.

4.3 Options

mediator( varname ) includes a single mediator variable. mediator() is required.

treatment( varname ) includes a single treatment variable. treatment() is required.

instrument( varname ) includes a single IV. instrument() is required.

absorb( varname ) allows the absorption of one fixed effect. For details, see [R] areg.

full displays intermediate results together with the main results. Specifying this option will display three intermediate output tables:

the IV regression of Y on T (instrumented with Z)

the IV regression of M on T (instrumented with Z), for which the first-stage F statistic is reported as first stage one in the main table

the IV regression of Y on M (instrumented with Z) and controlling for T, for which the first-stage F statistic is reported as first stage two in the main table

The TE is the coefficient on T in the first table; the direct effect is the coefficient on T in the third table; the indirect effect is the product of the coefficient on T in the second table and the coefficient on M in the third. The mediation effect as percentage of the TE is therefore the indirect effect divided by the TE times 100.

vce( vcetype ) may be robust to estimate Eicker/Huber/White standard errors or may be cluster clustervar to estimate cluster–robust standard errors. The default is vce(unadjusted) standard errors.

level( # ) specifies the confidence level, as a percentage, for confidence intervals. Integers between 10 and 99 inclusive are allowed. The default is level(95) or as set by set level; see [U] 20.8 Specifying the width of confidence intervals.

4.4 Stored results

ivmediate stores the following in e():

Empirical example

5.1 Simulation exercise

A simulated dataset with the assumed dependent relations can be straightforwardly generated in the following way:

Separately generate error terms ε_T and ε_Y that are normally distributed with mean 0 and variance 1, N(0, 1). These are statistically independent, that is, ε_T ∐ ε_Y .

Let error term ε_M be defined as $ε_{M} = \sqrt{ω} \times ε_{T} + \sqrt[]{(1 - ω)} \times ε_{Y}$ for any ω ∊ [0, 1].⁶

The correlation between ε_M and ε_T is given by $ρ_{T M} = \sqrt{ω}$ . Thereby, $ε_{M} ∐ ε_{T}$ . By symmetry, we also have that $ρ_{M Y} = \sqrt{(1 - ω)}$ and $ε_{M} ∐ ε_{Y}$ . Having drawn ε_T and ε_Y independently implies that the correlation between ε_T and ε_Y is ρ_TY = 0. However, conditioning on ε_M = e induces a linear relation between ε_T and ε_Y , namely, $ε_{T} = e / \sqrt{ω} - \sqrt{(1 - ω) / ω} \times ε_{Y}$ . Thus, the correlation between ε_T and ε_Y conditioned on ε_M is ρ_TY _| εM = −1 and, thereby, $ε_{T} ∐ ε_{Y} | ε_{M}$ . A high ω implies a high ρ_TM . By contrast, a low ω implies a high ρ_MY .

It is instructive to investigate the bias generated by a misspecified model where T and M are assumed to be exogenous, that is, the mutual independence of ε_T , ε_M , and ε_Y is wrongly assumed. Let the data be generated by (1)–(4) and the model coefficients be normalized to equal 1, that is, $β_{T}^{Z} = β_{M}^{T} = β_{Y}^{T} = β_{Y}^{M} = 1$ . The true parameters $β_{M}^{T}, β_{Y}^{T}$ , and $β_{Y}^{M}$ are identified through (7)–(10). If the error terms ε_T , ε_Y , and ε_M were wrongly assumed to be statistically independent, then parameters $β_{M}^{T}, β_{Y}^{T}$ , and $β_{Y}^{M}$ could be estimated by OLS through the following equations:

\begin{array}{l} OLS : β_{M}^{T} = \frac{σ_{T M}}{σ_{T Y}} \\ OLS : β_{Y}^{T} = \frac{σ_{M M} σ_{T Y} - σ_{T M} σ_{M Y}}{σ_{M M} σ_{T T} - σ_{T M}^{2}} \\ OLS : β_{Y}^{M} = \frac{- σ_{T M} σ_{T Y} - σ_{T T} σ_{M Y}}{σ_{M M} σ_{T T} - σ_{T M}^{2}} \end{array}

While the true parameters are set to be 1, the OLS estimators may range from 0 to 2 depending on the error correlations. Because a high ω implies pronounced bias in the relation between T and M (a high ρ_TM ), the OLS estimate of $β_{M}^{T}$ diverges from the true value 1 as ω increases. By contrast, the OLS estimates of $β_{Y}^{T}$ and $β_{Y}^{M}$ converge to the true value 1.

Given the model parameters, the TE of $β_{Y}^{M} \times β_{M}^{T} + β_{Y}^{T} = 1 \times 1 + 1 = 2$ is not recovered by simple OLS. In fact, not even the 95% confidence interval would include the true TE. On the other hand, 2SLS did recover the TE, but it could not disentangle the direct effect of the treatment (net of the mediator) from the indirect effect of the mediating variable. The simulation shows how ivmediate can both recover the true TE and decompose it into the direct and indirect effects as described in the theoretical section.

5.2 Applied example using the Becker and Woessmann (2009) data

The example below uses data from Becker and Woessmann (2009), who estimate the effect of Protestantism on economic prosperity in Prussian counties. To obtain exogenous variation in the share of Protestants in these counties, they used the fact that Protestantism spread concentrically around Wittenberg, the city where Martin Luther taught and preached. Following their example, we use distance to Wittenberg (kmwitt) as an instrument for the share of Protestants (f_prot) with the outcome being the per capita income tax (inctax) in 1877 as a measure for economic performance. The mediator we consider is the share of literate population (f_rw).

According to Becker and Woessmann (2009), Protestantism promoted reading of the Bible, which led to human capital accumulation and therefore promoted economic development. They are interested in estimating

Y = α Prot + χ Lit + X^{'} γ + ε

though they note that the “problem with such a model is that not only Protestantism but also literacy may be endogenous in this setting” (p. 570). Because they have no additional instrument for literacy, they use different types of bounding exercises using estimates from previous literature on the returns to education (see section VI.C in the original study). Using ivmediate, we can go further and directly estimate the mediation effect of literacy that goes through Protestantism with only one instrument.

As in the original study, we condition on further covariates in the estimation of (12), which are the share of Jewish population, female population, individuals aged below 10, the share of population of Prussian origin, average household size, population size of the county, the percentage population growth between 1867 and 1871, and the share of the population with missing information on literacy.⁷

The TE estimates that every 1 percentage point increase in the share of Protestants increases per capita income tax revenues by 0.83 Marks. Under the typical IV assumptions, this effect is causal. The direct effect estimates that only 0.08 Marks of this increase are because of Protestantism itself and it is not statistically significant. However, the indirect effect estimates that 0.75 Marks of this increase are caused by literacy as a mediating factor. This implies that literacy explains 90% of the TE of Protestantism on economic outcomes. This is in line with the findings by Becker and Woessmann (2009), who conclude that “Protestants’ higher literacy can account for roughly the whole gap in economic outcomes between the two denominations [Catholics and Protestants]” (p. 576).

Supplemental Material

Supplemental Material, st0611 - Causal mediation analysis in instrumental-variables regressions

Supplemental Material, st0611 for Causal mediation analysis in instrumental-variables regressions by Christian Dippel, Andreas Ferrara and Stephan Heblich in The Stata Journal

Footnotes

6 Programs and supplemental materials

To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type

Notes

References

Baron

R. M.

Kenny

D. A.

1986. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51: 1173–1182. https://doi.org/10.1037/0022-3514.51.6.1173.

Becker

S. O.

Woessmann

2009. Was Weber wrong? A human capital theory of Protestant economic history. Quarterly Journal of Economics 124: 531–596. https://doi.org/10.1162/qjec.2009.124.2.531.

Bound

Jaeger

D. A.

Baker

R. M.

1995. Problems with instrumental variables estimation when the correlation between the instruments and the endogeneous explanatory variable is weak. Journal of the American Statistical Association 90: 443–450. https://doi.org/10.2307/2291055.

Dippel

Gold

Heblich

Pinto

2020. Mediation analysis in IV settings with a single instrument. Unpublished manuscript.

Frölich

Huber

2017. Direct and indirect treatment effects—causal chains and mediation analysis with instrumental variables. Journal of the Royal Statistical Society, Series B 79: 1645–1666. https://doi.org/10.1111/rssb.12232.

Imai

Keele

Tingley

2010. A general approach to causal mediation analysis. Psychological Methods 15: 309–334. https://doi.org/10.1037/a0020761.

Jun

S. J.

Pinkse

Yildiz

2016. Multiple discrete endogenous variables in weakly-separable triangular models. Econometrics 4: 7. https://doi.org/10.3390/econometrics4010007.

Kleibergen

Paap

2006. Generalized reduced rank tests using the singular value decomposition. Journal of Econometrics 133: 97–126. https://doi.org/10.1016/j.jeconom.2005.02.011.

Kleibergen

Schaffer

M. E.

2007. ranktest: Stata module to test the rank of a matrix using the Kleibergen–Paap rk statistic. Statistical Software Components S456865, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s456865.html.

10.

MacKinnon

D. P.

2008. Introduction to Statistical Mediation Analysis. New York: Lawrence Erlbaum.

11.

Stock

J. H.

Yogo

2005. Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, ed. Andrews

D. W. K.

Stock

J. H.

, 80–108. New York: Cambridge University Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.06 MB

0.00 MB