1 Introduction
Linear regression models containing endogenous regressors are generally identified using outside information such as exogenous external instruments or by parametric distribution assumptions. Some articles obtain identification without external instruments by exploiting heteroskedasticity, including Rigobon (2003), Klein and Vella (2010), Lewbel (1997, 2018), and Prono (2014). In particular, Lewbel (2012) shows how one can use heteroskedasticity to construct instruments when no external instruments are available. Other articles that obtain identification using constructed instruments include Lewbel (1997) and Erickson and Whited (2002). See Lewbel (Forthcoming) for a general discussion of identification methods like these.
In this article, we provide advice and instructions for researchers who wish to apply the Lewbel (2012) estimator. That article includes estimators for fully simultaneous systems, semiparametric systems, and bounds for when key identifying assumptions are violated. However, most empirical applications use the estimator for a singleequation linear regression model with a single endogenous regressor, which is the focus here. Baum and Schaffer (2012) implemented this linear single-equation estimator as the command ivreg2h, which is available from the Statistical Software Components (SSC) archive.
Note that it is almost always preferable to use any available external instruments rather than constructed instruments like those of Lewbel (2012) because of the greater difficulty of confirming that constructed instruments are valid. However, constructed instruments can be useful if no external instruments are available and for testing validity of external instruments.
2 The model and estimator
Assume a sample of observations of endogenous variables Y
1 and Y
2 and a vector of exogenous covariates X. We wish to estimate γ and the vector β in the model
where the errors ε1
and ε
2 may be correlated.
Standard instrumental-variables estimation depends on having an element of X that appears in the Y
2 equation but not in the Y
1 equation and uses that excluded regressor as an instrument for Y
2. The problem considered here is that perhaps no element of X is excluded from the Y
1 equation, or equivalently, we are not sure that any element of β is zero. Lewbel (2012) provides identification and a corresponding simple linear two-stage least-squares estimator for β and γ in this case, where no element of X can be used as an excluded instrument for Y
2. The method consists of constructing valid instruments for Y
2 by exploiting information contained in heteroskedasticity of ε
2.
We begin with some standard regression model assumptions. First, β and γ are assumed to be fixed constants (in particular, this means that if Y
2 is a treatment measure, then treatment effects are assumed to be homogeneous). Second, we have the standard exogenous X assumptions that
, and
is nonsingular. Then, the key additional assumptions required for applying the Lewbel (2012) estimator are that
, where either Z = X or Z is a subset of the elements of X.
The Lewbel (2012) estimator can be summarized as the following two steps:
Estimate
by an ordinary least-squares linear regression of Y
2 on X, and obtain estimated residuals
.
Let Z be some or all of the elements of X (not including the constant term). Estimate β and γ by an ordinary linear two-stage least-squares regression of Y
1 on X and Y
2, using X and
as instruments, where
is the sample mean of Z.
This estimator is implemented in the command ivreg2h by Baum and Schaffer (2012). Note that applying the estimator requires choosing which elements of X will compose the vector Z used to construct instruments. The default assumption in ivreg2h is that Z includes all the elements of X except for the constant term. However, one might also choose to let Z be only some of the elements of X if doing so helps to satisfy the assumptions required for the estimator as discussed in the next section.
3 Advice on applying the estimator
The main question to be answered by applied researchers who wish to use this estimator is whether the key assumptions, that
, are likely to hold. Below, we discuss conditions that are sufficient to make these key assumptions hold. The virtue of these sufficient conditions (given as assumptions A1, A2, and A3 below) is that each can be motivated by economic theory, empirically tested with data, or both. The key assumptions can hold without satisfying assumptions A1, A2, and A3. However, if you can provide evidence (theory and tests as we describe below) for why these sufficient conditions should hold in your application, then the estimator is more likely to be appropriate for you to use.
Assumption A1
The errors ε1 and ε2 have the factor structure
where c is a constant and U, V1, and V2 are unobserved error terms that are mutually independent conditional on Z.
The interpretation of assumption A1 is that Y
2 is endogenous because it contains an error component U that appears in the errors of both equations. This assumption is not directly testable and so should be justified by an appeal to either economic (structural) or econometric (statistical) theory. To illustrate, here we provide examples of how assumption A1 could be justified in many contexts.
Suppose Y
2 is endogenous because it is mismeasured. Then V
1 is the true outcome model error, and U is the measurement error. Classical measurement error in linear regression models satisfies assumption A1.
Suppose Y
1 is an individual’s wage and Y
2 is the individual’s education level. Here U could be unobserved ability, which affects both one’s educational attainment, Y
2, and one’s wage, Y
1. Then V
1 represents all the unobservables that affect wages but not education, while V
2 represents all the unobservables that affect education but not wages.
Suppose Y
1 is a firm’s value-added output per unit of capital and Y
2 is the firm’s labor per unit of capital. Here U could be unobserved entrepreneurship, which affects both productivity and the chosen level of inputs. Then V
1 represents all the unobservables that affect productivity but not inputs, and vice versa for V
2.
The point here, as illustrated by these examples, is that the endogeneity of Y
2 takes the form of there being some underlying, unobserved factor U that affects both Y
1 and Y
2.
Assumption A2 U2 is not correlated with Z.
Assumption A2 says that U is homoskedastic. The Y
1 equation is a structural model, so if we can argue that it is correctly specified without important omitted variables, then it is common to assume remaining errors are completely idiosyncratic. This may be a difficult assumption to justify in theory, but it is partly testable. In particular, we may apply a Pagan and Hall (1983) test to the Y
1 equation.
The idea behind the Pagan–Hall test is that if any of the exogenous variables can predict the squared residuals, then the errors are conditionally heteroskedastic. The more common Breusch–Pagan and White tests for heteroskedasticity (Breusch and Pagan 1979; White 1980) are inappropriate here because, as Pagan and Hall (1983) point out, those tests are valid only if heteroskedasticity is present in the equation being tested and nowhere else in the system (that is, the other structural equations in the system corresponding to the endogenous regressors must be homoskedastic, even though they are not being explicitly estimated). In contrast, under the null of conditional homoskedasticity in the two-stage least-squares regression, the Pagan–Hall statistic is distributed as χ
p
2
, irrespective of the presence of heteroskedasticity elsewhere in the system.
The ivhettest command (Schaffer 2002), available from the SSC archive, is invoked by
ivhettest [varlist ] , [options ]
where the optional varlist specifies the exogenous variables to be used to model the squared errors. The tradeoff in the choice of variables to be used is that a smaller set of variables will conserve degrees of freedom at the cost of being unable to detect heteroskedasticity in certain directions. See, for example, Baum, Schaffer, and Stillman (2003), section 3, for more details.
For testing assumption A2, the correct set of variables to include in the test is the levels of the instruments Z (excluding the constant). This is available in ivhettest by specifying the ivlev option and is the default. We do not need to test if other variables (like squares or cross products of Z) are correlated with U
2
, because those other forms of heteroskedasticity would not violate assumption A2.
A limitation of this test is that it tests homoskedasticity of ε
1, so if we reject homoskedasticity, we cannot know whether the rejection is due to violating assumption A2 or due to harmless heteroskedasticity of V
1. In short, failing to reject homoskedasticity of ε
1 provides evidence supporting assumption A2, but rejecting homoskedasticity of ε
1 does not mean that assumption A2 is necessarily violated.
Note that assumption A2 does not require that U
2
be fully homoskedastic, only that it not be correlated with Z. As discussed at the end of the previous section, to satisfy assumption A2 (and A3 below), one might be selective about which elements of X to include in Z.
Assumption A3
is correlated with Z.
This assumption is needed to ensure that the constructed instrument ends up correlated with Y
2. If the previous assumptions hold, then this assumption is equivalent to heteroskedasticity of V
2 relative to Z. This assumption is easy to justify because the Y
2 equation need not be a structural equation. The Y
2 equation is like the first stage of two-stage least squares; it can be defined as just a linear projection of Y
2 on exogenous covariates. Moreover, this assumption can be tested by applying a Breusch and Pagan (1979) test to the Y
2 equation.
1
Unlike the test of assumption A2 for the Y
1 equation, to satisfy assumption A3, we want to reject homoskedasticity.
2
Note that the above assumptions are not necessary for validity of the estimator. For example, it is possible that the factor model of assumption A1 does not hold, but the estimator is still consistent (see Lewbel [2018] for an example). However, we can have more confidence that the estimator is consistent in a given application if we can argue that the logic of assumption A1 holds and if we pass the tests in assumptions A2 and A3.
Additional tests lending even more support for the estimator are possible when Z has more than one element. In that case, the model is overidentified, and one can then apply standard overidentification tests such as the Hansen (1982) and Sargan (1958) J-test. However, note that this tests only a necessary condition for validity of the method, which is that all instruments yield the same coefficient estimates. It is possible, for example, that one fails to reject overidentification tests not because the assumptions hold but because the constructed instruments happen to all yield the same incorrect coefficient estimates. Still, failing to reject overidentification tests provides additional evidence in support of the model and estimator.
To summarize the results of this section, we note that one way to use this estimator convincingly is to do the following:
Use economic theory and data to justify linearity of the model Y
1 = X
′
β+Y
2
γ+ε
1 and the assumption that X is exogenous.
Use economic theory and data to justify the factor structure of the errors given by assumption A1.
Choose a set of covariates Z (either all the elements of X except the constant or some subset of those elements) to use for constructing the instruments
. For the chosen Z, apply theory and the above described tests to justify the remaining identifying assumptions.
4 Implementing the estimator and tests
Using the Lewbel (2012) method, we construct instruments as simple functions of the model’s data. This approach may be a) applied when no ordinary (external) instruments are available or b) used along with external instruments to improve the efficiency of the instrumental-variables estimator. Constructed and external instruments can also be used to obtain overidentification, thereby allowing application of Sargan–Hansen tests (of the orthogonality conditions or overidentifying restrictions), which would not be possible in the case of exact identification by external instruments. This then allows one to simultaneously test validity of both the external instruments and the constructed instruments.
The implementation of the estimator in ivreg2h is based on the earlier xtivreg2(Schaffer 2005) and ivreg2(Baum, Schaffer, and Stillman 2003, 2007) commands. Essentially, ivreg2h generates the heteroskedasticity-based constructed instruments and then implements instrumental-variables estimation like these earlier commands. In addition to pure cross-section or time-series data, ivreg2h can also be applied to panel data using the within transformation of a fixed-effects model; see the fe option described below. Because ivreg2h is a variant of ivreg2, essentially all the features and options of that command are available in ivreg2h. For that reason, you can consult help ivreg2 for full details of the available options.
The robust and gmm2s options should generally be used, invoking the instrumentalvariable generalized method of moments estimator. This will compute the Hansen J statistic as a test of overidentifying restrictions. The default Sargan test assumes normality of the errors. See Baum, Schaffer, and Stillman (2003, 2007) for further details. Note that the gmm2s option supersedes the gmm option described in the earlier article.
The ivreg2h command provides four more options: gen, gen(string [ , replace ]), fe, and z(). If the gen option is given, the generated (constructed) instruments are stored, with names built from the original variable names suffixed with g. If you want greater control over the naming of the generated instruments, use the gen(string ,[ replace ]) option. The string argument allows the specification of a stub, or prefix, for the generated variable names, which will also be suffixed with g. You can remove earlier instruments with those same names with the replace suboption. If the data have been declared as a panel, you can use the fe option to specify that a fixed-effects model should be fit, as in xtivreg2. The z() option can be used to specify that only some of the included exogenous variables should be used to generate instruments, as suggested above.
The ivreg2h command can be invoked to fit either a) a model that would be identi- fied even without the constructed instruments or b) a model that, without constructed instruments, would fail the order condition for identification by either having no excluded instruments or having fewer excluded instruments than needed for traditional identification.
In case a, where an adequate number of external instruments are augmented by the generated constructed instruments, ivreg2h provides three sets of estimates: the traditional instrumental-variable estimates, the estimates using only the generated instruments, and the estimates using both generated and excluded instruments. In this case, ivreg2h automatically produces a Hayashi C test of the excluded instruments’ validity, equivalent to that provided by the orthog() option in ivreg2, see Baum, Schaffer, and Stillman (2003, 18–19). The results of the third estimation (the one including both generated and excluded instruments) are stored in the ereturn list. All three sets of estimates are stored, named StdIV, GenInst, and GenExtInst, respectively.
In case b, where the equation would be underidentified without constructed instruments, either one or two sets of estimates will be produced and displayed. If there are no excluded instruments, only the estimates using the generated instruments are displayed. If there are excluded instruments but too few to produce identification by the order condition, the estimates using only generated instruments and those produced by both generated and excluded instruments will be displayed. Unlike ivreg2 or ivregress, ivreg2h allows the syntax ivreg2hdepvar exogvar (endogvar = [varlist_iv ]) [if] [ in ] [,options ] because after augmentation with the generated regressors, the order condition for identification will be satisfied. The resulting estimates are stored in the ereturn list and as a set of estimates named GenInst and, optionally, GenExtInst.
The Pagan and Hall (1983) tests referenced above are available from the ivreg2 package of Baum, Schaffer, and Stillman (2003) using the ivhettest command. The default test does not assume normality of the errors.
4.1 Stored results
In the estimates table output, the displayed results j, jdf, and jp refer to the Hansen J statistic, its degrees of freedom, and its p-value. If independent and identically distributed errors are assumed and a Sargan test is displayed in the standard output, the Sargan statistic, its degrees of freedom, and p-value are displayed in j, jdf, and jpval because the Hansen and Sargan statistics coincide in that case. The results of the most recent estimation are stored in the ereturn list.
5 Examples of usage
In this example from Lewbel (2012), centering of regressors is used only to match the published results.
The following is an example with panel data and heteroskedastic and autocorrelated standard errors:
webuse grunfeld, clear
ivreg2h invest L(1/2).kstock (mvalue=), fe
ivreg2h invest L(1/2).kstock (mvalue=L(1/4).mvalue), fe robust bw(2)
6 Additional comments
Here we provide answers to additional questions that have been asked about the estimator.
1. Can validity of the estimator be tested?
Partially. The tests discussed in the previous sections are examples.
2. What if Y
1 or Y
2 is discrete?
The estimator may still be valid in this case. Lewbel (2018) gives one set of conditions that suffice for validity of the estimator. However, the factor structure given by assumption A1 will generally not hold if Y
1 or Y
2 is discrete, so it is much harder to justify application of the estimator. One might still apply the tests discussed in the previous sections to provide some evidence to rationalize the estimator in this case.
3. What does it mean if coefficient estimates are close to those from ordinary least squares?
In any application of instrumental-variables estimators, coefficient estimates can be close to ordinary least squares either by chance or if the instruments are highly correlated with the endogenous regressors. The same is true of constructed instruments.
4. Can the estimator be used with more than one endogenous regressor?
Conditions for validity of the estimator have been proven for one endogenous regressor. The estimator may be valid with multiple endogenous regressors, but the exact conditions required for validity in that case have not been shown.
5. Can I use functions of the constructed instruments as additional instruments?
No. The ε
1 errors are uncorrelated with the constructed instruments but may not be conditionally mean zero conditioning on the instruments. This means that unless you make additional strong assumptions, you cannot, for example, use squares of the constructed instruments or interactions of the constructed instruments with exogenous regressors as additional instruments.
6. Can I use the constructed instruments to estimate local average treatment effects?
No, except under very strong conditions. The method does not construct instruments designed to satisfy the assumptions for local average treatment-effects estimation. It constructs instruments in the traditional structural model sense, where linear model coefficients are fixed constants. This means that if the endogenous regressor is a measure of treatment, then the constructed instrument is valid for estimating a treatment effect only if the treatment effect is homogeneous, that is, the same for everyone in the population.
7. What if I have additional instruments?
This is the best-case scenario because those external instruments can be used along with the constructed instruments in the second step of the estimator (as discussed earlier). In particular, one of the best uses of the constructed instruments is to provide overidentifying information for model tests and robustness checks. For example, one could apply the overidentification tests discussed in the previous sections to estimates based on both constructed and external instruments. If validity is rejected, then either the model is misspecified or at least one of these instruments is invalid. If validity is not rejected, it is still possible that the model is wrong or the instruments are invalid, but one would at least have increased confidence that both the external and constructed instruments are valid. More informally, one might simply compare the estimated coefficients based on constructed instruments with those based on external instruments.
3
If they are numerically similar, that increases confidence in the robustness of the model because the two estimators based on very different identifying assumptions are yielding similar results. More generally, identification based on constructed instruments is preferably not used in isolation but rather is ideally used in conjunction with other means of obtaining identification, both as a way to check robustness of results to alternative identifying assumptions and to increase estimation efficiency.
7 Conclusions
In the few years since the heteroskedasticity-based estimator was proposed, it has been cited more than 500 times according to Google Scholar. But as with any identification method that is based largely on structure and functional form, one must be cautious about interpreting the results. This article should help ensure that the estimator is applied appropriately.