Average treatment effect estimates robust to the “limited overlap” problem: robustate

Abstract

We introduce a new command, robustate, that executes the inverseprobability weighting estimation and inference for the average treatment effect with robustness against limited overlap (that is, weak satisfaction of the common support condition). This command produces estimates, standard errors, p-values, and confidence intervals for the average treatment effect. The utility of the command is demonstrated with both simulated and real data of right heart catheterization. These illustrations show that the proposed estimator implemented by the robustate command indeed exhibits more robustness against limited overlap than the traditional inverse-probability weighting estimator. The main method of the command is proposed in Sasaki and Ura (2022, Econometric Theory 38: 66–112).

Keywords

st0674 robustate average treatment effect bias correction common support inverse-probability weighting limited overlap robustness trimming

1 Introduction

The average treatment effect (ate) measures the average E(y ₁ − y ₀) of the difference between the potential outcome under treatment (y ₁) and the potential outcome under no treatment (y ₀). The difference of the potential outcomes (y ₁ − y ₀) is not directly observed in data, because we observe either y ₁ or y ₀, but not both, for each individual. Unobserved counterfactual potential outcomes can be inferred from similar observations in terms of covariates or propensity scores. However, such an inference is feasible only when there are similar observations for each observation in a dataset. If the support of the covariates for the treatment group does not overlap with the support of the covariates for the control group, then similar observations in the counterfactual group may not exist for every observation. Thus, the existing methods to estimate the ate, such as the inverse propensity-score weighting estimator implemented by the teffects ipw command and the matching-type estimators implemented by the teffects psmatch and teffects nnmatch commands, require the common support condition, also known as the strong overlap condition.

Several datasets are known to violate this strong overlap condition and are said to have limited overlaps. Limited overlaps can also be characterized by many observations with propensity-score values close to 0 or 1; for instance, an observation with the propensity-score value of 0.001 is likely untreated with a chance of 99.9%, and you would be unlikely to find a similar observation in the treatment group. For inverse-probability weighting (ipw) estimation, implemented by the teffects ipw command, propensityscore values close to 0 or 1 imply infinitesimal denominators and thus outliers. In turn, outliers lead to huge values of standard errors. A common practical solution to this limited overlap problem is to trim observations whose estimated propensity-score values are close to 0 or 1, but this ad hoc procedure generally biases the estimates of the ate. In Sasaki and Ura (2022), we propose a debiased trimmed ipw estimation procedure, along with a valid standard error of the procedure that accounts for the bias correction. In this article, we introduce the robustate command for the debiased trimmed ipw estimation of the ate, based on our proposed procedure.

We first review the method in sections 2 and 3. The command is introduced in section 4, followed by simulation and real-data analyses in sections 5 and 6, respectively.

2 Review of the method

This section reviews the method of debiased trimmed estimation and inference according to Sasaki and Ura (2022). Specifics and additional practical details implemented by the robustate command will be presented in section 3.

ipw estimands take the form of

θ_{0} = E {\frac{g_{B} (w, γ_{0})}{g_{A} (w, γ_{0})}}

where {g_A (w, γ ₀), g_B (w, γ ₀)} is a function of observed random vector w and an unknown finite-dimensional parameter vector γ ₀ ∊ Γ.¹ The problem of limited overlap is characterized by close-to-0 denominators g_A (w, γ ₀), which cause large variances in estimating the moment of ratios (1).

A common solution is to trim observations with small denominators g_A (w, γ ₀). We write an estimator with smooth trimming as

\tilde{θ} (h_{n}) = E_{n} [\frac{g_{B} (w, \hat{γ})}{g_{A} (w, \hat{γ})} \times S {\frac{g_{A} (w, \hat{γ})}{h_{n}}}]

where $\hat{γ}$ is an estimator of γ ₀, h_n is a tuning parameter of trimming, and S is a smoothed indicator function.² One idea is to softly trim those observations with small denominator values g_A (w, $\hat{γ}$ ), where h_n defines “small”. The population counterpart of (2) can be written as

θ (h_{n}) = E [\frac{g_{B} (w, γ_{0})}{g_{A} (w, γ_{0})} \times S {\frac{g_{A} (w, γ_{0})}{h_{n}}}]

Estimand (3) with the trimming is different from the original estimand (1), and hence the trimming clearly induces a bias in estimating θ ₀ in general.

Theorem 3 of Sasaki and Ura (2022) motivates that the bias from the trimming can be estimated by

\hat{λ} (h_{n}) = \sum_{κ = 1}^{k - 1} \frac{E_{n} (g_{A} {(w, \hat{γ})}^{κ - 1} [S {\frac{g_{A} (w, \hat{γ})}{h_{n}}} - 1])}{κ!} \times {\hat{m}}^{(κ)} (0; \hat{γ})

where

{\hat{m}}^{(κ)} (0; γ) = p_{K}^{(κ)} (0)^{'} E_{n} {[p_{K} {g_{A} (w, \hat{γ})} p_{K} {g_{A} (w, \hat{γ})}^{'}]}^{- 1} E_{n} [p_{K} {g_{A} (w, \hat{γ})} g_{B} (w, \hat{γ})]

for each κ ∊ {1,…, k − 1} and p _K denotes an orthonormal basis of degree K.³ While this formula for ${\hat{m}}^{(κ)} (0; γ)$ may appear complicated, it in fact represents a prediction through the regression of the numerator g_B (w, $\hat{γ}$ ) on polynomials p _K{g_A (w, $\hat{γ}$ )} of the denominator g_A (w, $\hat{γ}$ ). With the bias estimator (4) subtracted from the biased trimmed estimator (2), we obtain the bias-corrected trimmed estimator

\hat{θ} (h_{n}) = E_{n} [\frac{g_{B} (w, \hat{γ})}{g_{A} (w, \hat{γ})} \times S {\frac{g_{A} (w, \hat{γ})}{h_{n}}}] - \sum_{κ = 1}^{k - 1} \frac{E_{n} (g_{A} {(w, \hat{γ})}^{κ - 1} [S {\frac{g_{A} (w, \hat{γ})}{h_{n}}} - 1])}{κ!} \times {\hat{m}}^{(κ)} (0; \hat{γ})

After obtaining the debiased trimmed estimator, one must still estimate the variance of this estimator, accounting for the bias estimation as well as an estimation of γ ₀. In Sasaki and Ura (2022), we show in our Lemma 10 that the influence function $\hat{θ} (h_{n})$ can be written as

\begin{array}{l} z (h_{n}) = ω_{1, n} (w, γ_{0}) + E {{\frac{\partial}{\partial γ} ω_{1, n} (w, γ) |}_{γ = γ_{0}}}^{'} \times ϕ \\ + \sum_{κ = 1}^{k - 1} \frac{ω_{2, κ, n} (w, γ_{0}) \times m^{(κ)} (0) + E {ω_{2, κ, n} (w, γ_{0})} \times ψ_{κ}}{κ!} \\ + \sum_{κ = 1}^{k - 1} \frac{m^{(κ)} (0) \times E {{\frac{\partial}{\partial γ^{'}} ω_{2, κ, n} (w, γ) |}_{γ = γ_{0}}} + E {ω_{2, κ, n} (w, γ_{0})} \times {\frac{\partial}{\partial γ^{'}} m^{(κ)} (0; γ) |}_{γ = γ_{0}}}{κ!} \times ϕ \end{array}

where ψ_κ and φ denote the influence functions of ${\hat{m}}^{(κ)} (0; γ)$ and $\tilde{γ}$ , respectively,⁴

\begin{array}{l} ω_{1, n} (w, γ) = g_{B} (w, γ) \frac{S {\frac{g_{A} (w, γ)}{h_{n}}}}{g_{A} (w, γ)} \\ ω_{2, κ, n} (w, γ) = g_{A} {(w, γ)}^{k - 1} \times [S {\frac{g_{A} (w, γ)}{h_{n}}} - 1] \end{array}

In theorem 4 in Sasaki and Ura (2022), we show the asymptotic normality

\frac{\hat{θ} (h_{n}) - θ_{0}}{\sqrt{Var {z (h_{n})} / n}} \overset{d}{\to} N (0, 1)

under suitable assumptions.

The above theory motivates $\hat{θ} (h_{n})$ as a debiased trimmed estimator of θ ₀ and

\sqrt{\hat{Var} {z (h_{n})} / n}

as its asymptotically valid standard error, accounting for the bias correction and a preliminary estimation of γ ₀. The robustate command uses these expressions to produce its output, which consists of estimates, standard errors, p-values, and confidence intervals for the ate.

3 Specifics and additional details about the method

While section 2 presents a review of the general method, this section focuses on the ate and presents specifics and additional details about the method implemented by the robustate command.

3.1 ATE

Suppose that a researcher observes a random sample of w = (y, d, x ^′ ) ^′ , where y denotes an observed outcome, d denotes the binary indicator of an observed treatment, and x denotes the vector of observed controls. Suppose that p(x, γ ₀) models the conditional probability P (d = 1| x) of receiving the treatment given the observed characteristics x, also known as the propensity score.⁵ With these notations, the ipw estimator of the ate can be written by (1), where {g_A (w, γ ₀), g_B (w, γ ₀)} is defined by

\begin{array}{l} g_{A} (w, γ_{0}) = d + {P (x, γ_{0}) - 1} \times (2 d - 1) \\ g_{B} (w, γ_{0}) = y \times (2 d - 1) \end{array}

There are alternative ways to define {g_A (w, γ ₀), g_B (w, γ ₀)} for the ate. We use this particular definition of {g_A (w, γ ₀), g_B (w, γ ₀)} to conveniently trim observations with propensity scores close to both 0 and 1 by (2) [note that g_A (w, γ ₀) = 1 − P (x, γ ₀) if d = 0 and g_A (w, γ ₀) = P (x, γ ₀) if d = 1].

3.2 Propensity-score estimation

Suppose that the propensity-score function is modeled by the generalized linear model p(x, γ ₀) = Λ(x ^′ γ ₀) for some link function Λ. The robustate command contains two options of Λ. One option is the standard logistic cumulative distribution function (cdf),

Λ (z) = 1 / {1 + exp (- z)}

in which case p(x, γ ₀) = Λ(x ^′ γ ₀) represents the logit binary treatment choice model. The other option is the standard normal cdf,

Λ (z) = Φ (z)

in which case p(x, γ ₀) = Λ(x ^′ γ ₀) represents the probit binary treatment choice model.

The parameter vector γ ₀ of the generalized linear model is estimated by the maximum-likelihood estimation procedure. Concretely, the estimator $\tilde{γ}$ of γ ₀ is defined by

\tilde{γ} = \underset{γ \in Γ}{arg max} Q_{n} (γ)

where the criterion function Q_n (γ) takes the form of

Q_{n} (γ) = E_{n} [d log {Λ (x^{'} γ)} + (1 - d) log {1 - Λ (x^{'} γ)}]

3.3 Influence function of preliminary estimation

The asymptotic variance of the debiased trimmed estimator (2) needs to account for the preliminary estimation of γ ₀. As such, the influence function representation (5) of $\hat{θ} (h_{n})$ involves the influence function φ of the preliminary estimator $\tilde{γ}$ . In light of the concrete estimator $\tilde{γ}$ defined in (6), the influence function φ of the preliminary estimator $\tilde{γ}$ can be estimated by

\hat{ϕ} = - {{d^{2} Q_{n} (\tilde{γ})}^{- 1} \frac{d}{d γ} [d \log {Λ (x^{'} γ)} + (1 - d) \log {1 - Λ (x^{'} γ)}] |}_{γ = \tilde{γ}}

In the special case of the logit link function, it can be more explicitly written with the analytic expression

\hat{ϕ} = E_{n} {x Λ (x^{'} \tilde{γ}) x^{'}}^{- 1} x {d - Λ (x^{'} \tilde{γ})}

The robustate command uses this explicit formula for the case of the logit propensityscore estimation, while it uses numerical derivatives to (7) for the case of the probit propensity-score estimation.

3.4 Smoothed indicator function

The smoothed indicator function S can be defined arbitrarily as long as it satisfies S(0) = 0, S(u) = 1 for u ≥ 1, and some regularity conditions. In the implementation by the robustate command, the specific form of S defined by

S (u) = {\begin{array}{l} 0 & if u < 0 \\ 6 u^{5} - 15 u^{4} + 10 u^{3} & if 0 \leq u \leq 1 \\ 1 & if 1 < u \end{array}

is used. With this definition of S, the smooth trimmed estimator (2) uses the full unit weight for those observations with a large denominator $[g_{A} (w, \hat{γ}) \geq h_{n}]$ and a smaller subunit weight for those observations with a small denominator $[g_{A} (w, \hat{γ}) < h_{n}]$ . The relatively high degrees (five degrees) of polynomial are needed to satisfy the regularity conditions in theorem 5 of Sasaki and Ura (2022).

3.5 Shifted orthonormal Legendre polynomial basis

The bias correction in (4) is based on a sieve nonparametric estimation with an orthonormal basis p _K (a) of degree K. The robustate command uses the shifted orthonormal Legendre polynomial basis, which is given by

p_{K} (a) = {\begin{matrix} 1 \\ \sqrt{3} (2 a - 1) \\ \sqrt{5} (6 a^{2} - 6 a + 1) \\ \sqrt{7} (20 a^{3} - 30 a^{2} + 12 a - 1) \\ \sqrt{9} (70 a^{4} - 140 a^{3} + 90 a^{2} - 20 a + 1) \\ \sqrt{11} (252 a^{5} - 630 a^{4} + 560 a^{3} - 210 a^{2} + 30 a - 1) \\ ⋮ \end{matrix}}

The minimum value of K that is allowed in theory is 4. The default value of K in the robustate command is set to 4 (see section 2.2).

4 The robustate command

4.1 Syntax

The syntax of the robustate command is as follows:

robustate outcome treatment controls [if] [in] [, probit h( real ) k( real )]

outcome stands for the outcome variable y, treatment stands for the binary indicator of a treatment d, and controls include observed controls x. Exactly one outcome variable, exactly one treatment variable, and at least one controls variable must be included to run the command.

4.2 Options

probit specifies to estimate the propensity score with probit estimation. The default is to use logit estimation.

h( real ) sets the trimming threshold. The default is h(0.1). real must be a real number in (0, 1). Larger values induce larger biases of the naïve estimator.

k( real ) sets the sieve dimension for bias correction. The default is k(4). real must be an integer no smaller than 4.

4.3 Stored results

robustate stores the following in e():

5 Simulation studies

In this section, we use Monte Carlo simulations to evaluate the small-sample performance of the method implemented by the robustate command. We use the generating design outlined below.

First, generate the five-dimensional vector x = (x ₁ ,…, x ₅) ^′ of controls according to the t distribution of df degrees of freedom:

x_{1}, \dots, x_{5} \overset{i.i.d.}{~} t (df)

In turn, the propensity score p is produced by

p = Λ (x^{'} γ_{0})

where Λ is the logistic cdf and γ ₀ = (1.0, 0.8, 0.6, 0.4, 0.2) ^′ . Note that a smaller value of df induces a heavier-tailed distribution of x ₁ ,…, x ₅, which in turn causes the propensity score p to be closer to 0 and 1. In other words, the smaller df is, the more intensive limited overlaps become.

Given the treatment choice d and the propensity score p, the outcome variable is finally generated by

\begin{array}{l} y = (1 - d) \times y_{0} + d \times y_{1} \\ y_{0} = v_{0} \\ y_{1} = v_{1} + {1 + 0.5 \times Φ^{- 1} (p)} \end{array}

where $v_{0}, v_{1} \overset{i.i.d.}{~} N (0, 1)$ and Φ ⁻ ¹ denotes the quantile function of the standard normal distribution. Recall that y ₀ and y ₁ denote the potential outcomes under treatment and no treatment, respectively, which are not usually observed by a researcher. Note that the inclusion of p in the equation of the potential outcome y ₁ plays the role of causing an endogeneity.

We vary the sample size n ∊ {200, 400} and the degrees-of-freedom parameter df ε {3, 5, 7, 9} across sets of simulations. Each set of simulations is based on 500 Monte Carlo iterations. We summarize the simulation results in the table below. For each of the naïve ipw estimators and our proposed robust ipw estimator, the table reports the root mean squared error (rmse), the 90% coverage frequency, and the 95% coverage frequency.

Naïve ipw					Robust ipw
n	df	rmse	90%	95%	rmse	90%	95%
200	9	0.220	0.922	0.964	0.179	0.890	0.944
400	9	0.148	0.916	0.962	0.126	0.912	0.958
200	7	0.217	0.920	0.962	0.180	0.896	0.958
400	7	0.156	0.906	0.966	0.129	0.906	0.956
200	5	0.244	0.912	0.968	0.187	0.900	0.948
400	5	0.224	0.892	0.954	0.126	0.910	0.954
200	3	0.831	0.890	0.948	0.194	0.884	0.934
400	3	0.346	0.896	0.954	0.134	0.884	0.944

Observe that the coverage frequencies are fairly close to the respective nominal probabilities for both estimators. On the other hand, the rmse is strikingly different between the naïve and robust estimators. Specifically, under each sample size n and each degree of freedom df, the rmse of the naïve estimator is much larger than that of the robust estimator. Furthermore, this difference tends to become bigger as the df of the t distribution decreases. Recall that the problem of limited overlap is more intensive as df gets smaller. Hence, these results show that the naïve estimator is vulnerable to limited overlap, while our robust estimator is literally robust against it.

6 Illustration of the command

We now illustrate the robustate command with an analysis of the treatment effect of catheterization on 30-day survival using a subsample of real data. A small-sample dataset is available in the robustate package, and it can be loaded by typing the following command:

use catheterization_small

The command below executes the estimation and inference for the ate of catheterization on 30-day survival.

Here outcome is the name of the outcome variable, treat is the name of the binary indicator of a treatment, and all the rest are the names of control variables.

This command produces the results displayed below. The first row of this output shows an estimate based on the naïve inverse propensity-score weighting estimation method, which is not robust against limited overlap. The second row, on the other hand, shows an estimate based on our proposed debiased trimmed estimation method, which is robust against limited overlap.

Notice that the robust estimator yields a much narrower 95% confidence interval than the naïve estimator. These results are consistent with the simulation results presented in section 5. It is also worth noting that the robust method indicates a statistically significant ate, whereas the naïve method does not. The counterintuitive conclusion that the right heart catheterization has negative effects on 30-day survivals is consistent with the previous finding.

7 Conclusion

In this article, we introduced the robustate command, which executes an ipw estimation and inference for the ate with robustness against limited overlap based on the method of Sasaki and Ura (2022). We illustrated the command with simulated data and demonstrated that the proposed method achieves smaller rmses than the traditional estimator, especially as the intensity of the limited overlap becomes more severe. We also illustrated the command with a small real dataset of right heart catheterization and showed that the proposed method delivers a narrower confidence interval than the traditional estimator.

While a brief review of the main method of the debiased trimmed ipw estimation is presented in section 2 and additional practical details of the method implemented by the robustate command are described in section 3, we refer interested readers to Sasaki and Ura (2022) for further details about the supporting econometric theory.

Finally, we close this article by discussing how other approaches might treat the limited overlap problem. A common approach is to trim observations with estimated propensity scores close to 0 or 1 (where popular trimming thresholds are 0.1 and 0.9) and to run the standard ipw estimator without bias correction as implemented by the teffects ipw command. This approach biases the estimate with respect to the ate in general but instead obtains the conditional ate among the subpopulation of the untrimmed region (for example, [0.1, 0.9]) of the propensity score (see Crump et al. [2009]). Furthermore, a naïve application of the existing commands, like teffects ipw, would not obtain a correct standard error even for such a conditional ate, because of the data-driven trimming. Matching-type estimators, as implemented by the teffects psmatch command and the teffects nnmatch command, would generally bias the estimate with respect to the ate because matching observations outside of the common support entails matching dissimilar observations. We are not aware of a bias-correction method for matching-type estimators.

Supplemental Material

Supplemental Material, sj-zip-1-stj-10.1177_1536867X221106402 - Average treatment effect estimates robust to the “limited overlap” problem: robustate

Supplemental Material, sj-zip-1-stj-10.1177_1536867X221106402 for Average treatment effect estimates robust to the “limited overlap” problem: robustate by Yuya Sasaki and Takuya Ura in The Stata Journal

Footnotes

8 Programs and supplemental materials

To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type

robustate can also be downloaded from the Statistical Software Components Archive by typing

ssc install robustate

Notes

References

Crump

R. K.

Hotz

V. J.

Imbens

G. W.

Mitnik

O. A.

2009. Dealing with limited overlap in estimation of average treatment effects. Biometrika 96: 187–199. https://doi.org/10.1093/biomet/asn055.

Sasaki

Ura

2022. Estimation and inference for moments of ratios with robustness against large trimming bias. Econometric Theory 38: 66–112. https://doi.org/10.1017/S0266466621000025.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.08 MB

0.00 MB