lassopack: Model selection and prediction with regularized regression in Stata

Abstract

In this article, we introduce lassopack, a suite of programs for regularized regression in Stata. lassopack implements lasso, square-root lasso, elastic net, ridge regression, adaptive lasso, and postestimation ordinary least squares. The methods are suitable for the high-dimensional setting, where the number of predictors p may be large and possibly greater than the number of observations, n. We offer three approaches for selecting the penalization (“tuning”) parameters: information criteria (implemented in lasso2), K-fold cross-validation and h-step-ahead rolling cross-validation for cross-section, panel, and time-series data (cvlasso), and theory-driven (“rigorous” or plugin) penalization for the lasso and square-root lasso for cross-section and panel data (rlasso). We discuss the theoretical framework and practical considerations for each approach. We also present Monte Carlo results to compare the performances of the penalization approaches.

Keywords

st0594 lasso2 cvlasso rlasso cvlassologit lassologit rlassologit lasso2 postestimation lassologit postestimation rlasso postestimation lasso elastic net square-root lasso cross-validation

1 Introduction

Machine learning is attracting increasing attention across many scientific disciplines. Recent surveys explore how machine-learning methods can be used in economics and applied econometrics (Varian 2014; Mullainathan and Spiess 2017; Athey 2019; Kleinberg et al. 2018; Athey and Imbens 2019). At the time of writing, Stata offers only a limited set of machine-learning tools. lassopack is an attempt to extend this set by providing easy-to-use and flexible methods for regularized regression in Stata.¹

While regularized linear regression is only one of many methods in the toolbox of machine learning, it has some properties that make it attractive for empirical research. To begin with, it is a straightforward extension of linear regression. Like ordinary least squares (OLS), regularized linear regression minimizes the sum of squared deviations between observed and model-predicted values but imposes a regularization penalty aimed at limiting model complexity. The most popular regularized regression method is the lasso—which this package is named after—introduced by Frank and Friedman (1993) and Tibshirani (1996), which penalizes the absolute size of coefficient estimates.

The primary purpose of regularized regression, as with supervised machine-learning methods more generally, is prediction. Regularized regression typically does not produce estimates that can be interpreted as causal, and statistical inference on these coefficients is complicated.² While regularized regression may select the true model as the sample size increases, this is generally the case only under strong assumptions. However, regularized regression can aid causal inference without relying on the strong assumptions required for perfect model selection. The post-double-selection methodology of Belloni, Chernozhukov, and Hansen (2014) and the postregularization approach of Chernozhukov, Hansen, and Spindler (2015) can be used to select appropriate control variables from a large set of putative confounding factors and thereby improve robustness of estimation of the parameters of interest. Likewise, the first stage of two-step least squares is a prediction problem, and lasso or ridge can be applied to obtain optimal instruments (Belloni et al. 2012; Carrasco 2012; Hansen and Kozbur 2014). We implemented these methods in our sister package pdslasso (Ahrens, Hansen, and Schaffer 2018), which builds on the algorithms developed in lassopack.

The strength of regularized regression as a prediction technique stems from the bias-variance tradeoff. The prediction error can be decomposed into the unknown error variance reflecting the overall noise level (which is irreducible), the squared estimation bias, and the variance of the predictor. The variance of the estimated predictor increases with the model complexity, whereas the bias tends to decrease with model complexity. By reducing model complexity and inducing a shrinkage bias, regularized regression methods tend to outperform OLS in terms of out-of-sample prediction performance. In doing so, regularized regression addresses the common problem of overfitting: high in-sample fit (high R ²) but poor prediction performance on unseen data.

Another advantage is that the regularization methods of lassopack—with the exception of ridge regression—set some coefficients to exactly zero, thereby excluding predictors from the model. In this way, regularized regression can serve as a modelselection technique. Model selection is challenging, especially when there are many putative predictors. Iterative testing procedures, such as the general-to-specific approach, typically induce pretesting biases, and hypothesis tests often lead to many false positives. At the same time, high-dimensional problems where the number of predictors is large relative to the sample size are a common phenomenon, especially when the true model is treated as unknown. Regularized regression allows for identification even when the number of predictors exceeds the sample size under the sparsity assumption, which requires that the true model include only a few predictors or, more generally, that only a few predictors are necessary to approximate the true model sufficiently well.

Regularized regression methods rely on tuning parameters that control the degree and type of penalization. lassopack offers three approaches to select these tuning parameters. The classical approach is to select tuning parameters using cross-validation (CV) to optimize out-of-sample prediction performance. CV methods are universally applicable and generally perform well for prediction tasks but are computationally expensive. The second approach relies on information criteria, such as the Akaike information criterion (AIC) (Zou, Hastie, and Tibshirani 2007; Zhang, Li, and Tsai 2010). Information criteria are easy to calculate and have attractive theoretical properties but are less robust to violations of the independence and homoskedasticity assumptions (Arlot and Celisse 2010). The third approach is rigorous penalization for the lasso and squareroot lasso. The tuning parameters are chosen to theoretically provide good prediction performance and control overfitting in the presence of heteroskedastic, non-Gaussian, and cluster-dependent errors (Belloni et al. 2012; Belloni, Chernozhukov, and Wang 2014; Belloni et al. 2016). The rigorous approach places a high priority on controlling overfitting, thus often producing parsimonious models. This strong focus on containing overfitting is practically and theoretically beneficial for selecting control variables or instruments in a structural model but also implies that the approach may be outperformed by CV techniques for pure prediction tasks. Which approach is most appropriate depends on the type of data at hand and the purpose of the analysis. To guide applied researchers, we discuss the theoretical foundation of all three approaches and present Monte Carlo results that assess their relative performance.

This article proceeds as follows. In section 2, we present the estimation methods implemented in lassopack. In sections 3–5, we discuss the aforementioned approaches for selecting the tuning parameters: information criteria in section 3, CV in section 4, and rigorous penalization in section 5. In section 6, we present the three commands, which correspond to the three penalization approaches. In section 7, we demonstrate the commands. In section 8, we present Monte Carlo results. Further technical notes are in section 9.

Notation. We briefly clarify the notation used in this article. Suppose a is a vector of dimension m with typical element a_j for j = 1,…, m. The ℓ ₁ norm is defined as ${‖ a ‖}_{1} = \sum_{j = 1}^{m} | a_{j} |$ , and the ℓ ₂ norm is ${‖ a ‖}_{2} = \sqrt{\sum_{j = 1}^{m} {| a_{j} |}^{2}}$ . The “ℓ ₀ norm” of a is denoted by || a ||₀ and is equal to the number of nonzero elements in a . $1 {\cdot}$ denotes the indicator function. We use the notation b ∨ c to denote the maximum value of b and c, that is, max(b, c).

2 Regularized regression

Here we introduce the regularized regression methods implemented in lassopack. We consider the high-dimensional linear model

y_{i} = x_{i}^{'} β + ε_{i}, i = 1, \dots, n

where the number of predictors, p, may be large and even exceed the sample size, n. The regularization methods introduced in this section can accommodate large-p models under the assumption of sparsity: out of the p predictors, only a subset of s ≪ n is included in the true model, where s is the sparsity index

s : = \sum_{j - 1}^{p} 1 {β_{j} \neq 0} = {‖ β ‖}_{0}

We refer to this assumption as exact sparsity. It is more restrictive than required, but we use it here for illustration. We will later relax the assumption to allow for nonzero but “small” β_j coefficients. We also define the active set Ω = {j ∊ {1,…, p} : β_j ≠ 0}, which is the set of nonzero coefficients. p, s, Ω, and β may generally depend on n, but we suppress the n subscript for notational convenience.

We adopt the following convention throughout the article: unless otherwise noted, all variables have been mean centered such that $\sum_{i} y_{i} = 0$ and $\sum_{i} x_{i j} = 0$ , and all variables are measured in their natural units; that is, they have not been prestandardized to have unit variance. By assuming the data have already been mean centered, we simplify the notation and exposition. Leaving the data in natural units, however, allows us to discuss standardization in the context of penalization.

Penalized regression methods rely on tuning parameters that control the degree and type of penalization. The estimation methods implemented in lassopack, which we will introduce in the following subsection, use two tuning parameters; λ controls the general degree of penalization. Setting λ to zero corresponds to no penalization and is equivalent to OLS, while a sufficiently large value of λ yields an empty model, where all coefficients are zero. The second tuning parameter, α, determines the relative contribution of ℓ ₁ (lasso-type) versus ℓ ₂ (ridge-type) penalization. We will discuss the merits and properties of ℓ ₁ and ℓ ₂ penalization in the following subsection. In section 2.2, we introduce the three approaches offered by lassopack for selecting λ and α.

2.1 The estimators

Lasso

The lasso takes a special position because it provides the basis for the rigorous penalization approach (see section 5) and has inspired other methods such as elastic net and square-root lasso, which we introduce later in this section. The lasso minimizes the mean squared error subject to a penalty on the absolute size of coefficient estimates:

{\hat{β}}_{lasso} (λ) = arg min \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} + \frac{λ}{n} \sum_{j = 1}^{p} ψ_{j} | β_{j} |

The tuning parameter λ controls the overall penalty level, and ψ_j are predictor-specific penalty loadings.

Tibshirani (1996) motivates the lasso with two major advantages over OLS. First, because of the nature of the ℓ ₁ penalty, the lasso sets some of the coefficient estimates exactly to zero and, in doing so, removes some predictors from the model. Thus, the lasso serves as a model-selection technique and facilitates model interpretation. Second, lasso can outperform least squares in prediction accuracy because of the bias-variance tradeoff.

The lasso coefficient path, which constitutes the trajectory of coefficient estimates as a function of λ, is piecewise linear with changes in slope where variables enter or leave the active set. The change points are referred to as knots.

Unlike OLS, the lasso is not invariant to linear transformations, which is why scaling matters. If the predictors are not of equal variance, the most common approach is to prestandardize the data such that $1 / n \sum_{i} x_{i j}^{2} = 1$ and set ψj = 1 for j = 1,…, p. Alternatively, we can set the penalty loadings to ${\hat{ψ}}_{j} = {(1 / n \sum_{i} x_{i j}^{2})}^{- 1 / 2}$ . The two methods yield identical results.

Ridge regression

Ridge regression (Tikhonov 1963; Hoerl and Kennard 1970) replaces the ℓ ₁ penalty of the lasso with an ℓ ₂ penalty, thus minimizing

\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} + \frac{λ}{n} \sum_{j = 1}^{p} ψ_{j}^{2} β_{j}^{2}

The interpretation and choice of the penalty loadings ψ_j is the same as above. As with the lasso, we need to account for uneven variance either through preestimation standardization or by appropriately choosing the penalty loadings ψ_j .

In contrast with estimators relying on ℓ ₁ penalization, the ridge does not perform variable selection. At the same time, it also does not rely on the assumption of sparsity. This makes the ridge attractive in the presence of dense signals, that is, when the assumption of sparsity does not seem plausible. Dense high-dimensional problems are more challenging than sparse problems; for example, Dicker (2016) shows that, if p/n → ∞, it is not possible to outperform a trivial estimator that includes only the constant. If p, n → ∞ jointly, but p/n converges to a finite constant, the ridge has desirable properties in dense models and tends to perform better than sparsity-based methods (Hsu, Kakade, and Zhang 2014; Dicker 2016; Dobriban and Wager 2018).

Ridge regression is closely linked to principal component regression. Both methods are popular in the context of multicollinearity because of their low variance relative to OLS. Principal component regression applies OLS to a subset of components derived from principal component analysis, thereby discarding a specified number of components with low variance. The rationale for removing low-variance components is that the predictive power of each component tends to increase with the variance. The ridge can be interpreted as projecting the response against principal components while imposing a higher penalty on components exhibiting low variance. Hence, the ridge follows a similar principle, but rather than discarding low-variance components, it applies a more severe shrinkage (Hastie, Tibshirani, and Friedman 2009).

Comparing lasso and ridge regression provides further insights into the nature of ℓ ₁ and ℓ ₂ penalization, so it is helpful to write lasso and ridge in constrained form as

\begin{array}{l} {\hat{β}}_{lasso} = arg min \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} subject to \sum_{j = 1}^{p} ψ_{j} | β_{j} | \leq τ \\ {\hat{β}}_{ridge} = arg min \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} subject to \sum_{j = 1}^{p} ψ_{j}^{2} β_{j}^{2} \leq τ \end{array}

and to examine the shapes of the constraint sets. The above optimization problems use the tuning parameter τ instead of λ. Note that there exists a data-dependent relationship between λ and τ.

Figure 1 illustrates the geometry underpinning lasso and ridge regression for the case of p = 2 and ψ ₁ = ψ ₂ = 1 (that is, unity penalty loadings). The gray elliptical lines represent residual sums of squares contours, and the black lines indicate the lasso and ridge constraints. The lasso constraint set, given by |β ₁| + |β ₂| ≤ τ, is diamond shaped with vertices along the axes, from which it immediately follows that the lasso solution may set coefficients exactly to 0. In contrast, the ridge constraint set, $β_{1}^{2} + β_{2}^{2} \leq τ$ , is circular and will thus (effectively) never produce a solution with any coefficient set to 0. Finally, ${\hat{β}}_{0}$ in the figure denotes the solution without penalization, which corresponds to OLS. The lasso solution at the corner of the diamond implies that, in this example, one of the coefficients is set to zero, whereas ridge and OLS produce nonzero estimates for both coefficients.

Figure 1.

Behavior of ℓ ₁ and ℓ ₂ penalties in comparison. The gray lines represent residual sum of squares contour lines, and the black lines represent the lasso and ridge constraints, respectively. ${\hat{β}}_{OLS}$ denotes the OLS estimates. ${\hat{β}}_{lasso}$ and ${\hat{β}}_{ridge}$ are the lasso and ridge estimates. The illustration is based on Tibshirani (1996, fig. 2).

While there exists no closed-form solution for the lasso, the ridge solution can be expressed as

{\hat{β}}_{ridge} = {(X^{'} X + λ Ψ^{'} Ψ)}^{- 1} X^{'} y

Here X is the n × p matrix of predictors with typical element x_ij , y is the response vector, and Ψ = diag(ψ ₁ ,…, ψ_p ) is the diagonal matrix of penalty loadings. The ridge regularizes the regressor matrix by adding positive constants to the diagonal of X ^′ X . The ridge solution is thus generally well defined as long as all the ψ_j and λ are sufficiently large, even if X ^′ X is rank deficient.

Elastic net

The elastic net of Zou and Hastie (2005) combines some of the strengths of lasso and ridge regression. It applies a mix of ℓ ₁ (lasso-type) and ℓ ₂ (ridge-type) penalization:

{\hat{β}}_{elastic} = \arg \min \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} + \frac{λ}{n} {α \sum_{j = 1}^{p} ψ_{j} | β_{j} | + (1 - α) \sum_{j = 1}^{p} ψ_{j}^{2} β_{j}^{2}}

The additional parameter α determines the relative contribution of ℓ ₁ versus ℓ ₂ penalization. In the presence of groups of correlated regressors, the lasso typically selects only one variable from each group, whereas the ridge tends to produce similar coefficient estimates for groups of correlated variables. However, the ridge does not yield sparse solutions, impeding model interpretation. The elastic net can produce sparse solutions for some α that are greater than zero and retains or drops correlated variables jointly.

Adaptive lasso

The irrepresentable condition is shown to be sufficient and (almost) necessary for the lasso to be model-selection consistent (Zhao and Yu 2006; Meinshausen and Bühlmann 2006). However, the irrepresentable condition imposes strict constraints on the degree of correlation between predictors in the true model and predictors outside the model. Motivated by this nontrivial condition for the lasso to be variable-selection consistent, Zou (2006) proposed the adaptive lasso. The adaptive lasso uses the penalty loadings of $ψ_{j} = 1/ {| {\hat{β}}_{0, j} |}^{θ}$ , where ${\hat{β}}_{0, j}$ is an initial estimator. The adaptive lasso is variable-selection consistent for a fixed p under weaker assumptions than the standard lasso. If p < n, OLS can be used as the initial estimator. Huang, Ma, and Zhang (2008) prove variable-selection consistency for a large p and suggest using univariate OLS if p > n. The idea of adaptive penalty loadings can also be applied to elastic net and ridge regression (Zou and Zhang 2009).

Square-root lasso

The square-root lasso,

{\hat{β}}_{\sqrt{lasso}} = \arg \min \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2}} + \frac{λ}{n} \sum_{j = 1}^{p} ψ_{j} | β_{j} |

is a modification of the lasso that minimizes the root mean squared error (rmse) while also imposing an ℓ ₁ penalty. The main advantage of the square-root lasso over the standard lasso becomes apparent if theoretically grounded, data-driven penalization is used. Specifically, the score vector, and thus the optimal penalty level, is independent of the unknown error variance under homoskedasticity as shown by Belloni, Chernozhukov, and Wang (2011), resulting in a simpler procedure for choosing λ (see section 5).³

Postestimation OLS

Penalized regression methods induce an attenuation bias that can be alleviated by postestimation OLS, which applies OLS to the variables selected by the first-stage variable-selection method, that is,

{\hat{β}}_{post} = \arg \min \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} subject to β_{j} = 0 if {\tilde{β}}_{j} = 0

where ${\tilde{β}}_{j}$ is a sparse first-step estimator such as the lasso. Thus, postestimation OLS treats the first-step estimator as a genuine model-selection technique. For lasso with theory-driven regularization, Belloni and Chernozhukov (2013) have shown that postestimation OLS, also referred to as postlasso, achieves the same convergence rates as the lasso and can outperform the lasso in situations where consistent model selection is feasible (see also Belloni et al. [2012]). Similar results hold for the square-root lasso (Belloni, Chernozhukov, and Wang 2011, 2014).

2.2 Choice of the tuning parameters

Because coefficient estimates and the set of selected variables depend on λ and α, a central question is how to choose these tuning parameters. Which method is most appropriate depends on the objectives and setting, particularly the aim of the analysis (prediction or model identification), computational constraints, and if and how the independent and identically distributed (i.i.d.) assumption is violated. lassopack offers three approaches for selecting the penalty level of λ and α:

1. Information criteria: The value of λ can be selected using information criteria. lasso2 implements model selection using four information criteria. See section 3.

2. CV: The aim of CV is to optimize the out-of-sample prediction performance. CV is implemented in cvlasso, which allows for CV across both λ and the elastic-net parameter α. See section 4.

3. Theory driven (“rigorous”): Theoretically justified and feasible penalty levels and loadings are available for the lasso and square-root lasso via rlasso. The penalization is chosen to dominate the noise of the data-generating process (represented by the score vector), which allows derivation of theoretical results with regard to consistent prediction and parameter estimation. See section 5.

3 Tuning parameter selection using information criteria

Information criteria are closely related to regularization methods. The classical aic (Akaike 1974) is defined as −2 × log likelihood + 2p. Thus, the AIC can be interpreted as penalized likelihood, which imposes a penalty on the number of predictors included in the model. However, this form of penalization, referred to as the ℓ ₀ penalty, has an important practical disadvantage. To find the model with the lowest AIC, we need to estimate all different model specifications. In practice, it is often not feasible to consider the full model space. For example, with only 20 predictors, there are more than 1,000,000 different models.

The advantage of regularized regression is that it provides a data-driven method for reducing model selection to a one-dimensional problem (or two-dimensional problem in the case of the elastic net) where we need to select λ (and α). Theoretical properties of information criteria are well understood, and they are easy to compute once coefficient estimates are obtained. Thus, it seems natural to use the strengths of information criteria as model-selection procedures to select the penalization level.

Information criteria can be categorized based on two central properties: loss efficiency and model-selection consistency. A model-selection procedure is loss efficient if it yields the smallest averaged squared error attainable by all candidate models. Model-selection consistency requires that the true model be selected with probability approaching 1 as n → ∞. Accordingly, which information criteria is appropriate in a given setting also depends on whether the aim of analysis is prediction or identification of the true model.

We first consider the most popular information criteria, AIC and Bayesian information criterion (Schwarz 1978, BIC):

\begin{array}{l} AIC (λ, α) = n log {{\hat{σ}}^{2} (λ, α)} + 2 df (λ, α), \\ BIC (λ, α) = n log {{\hat{σ}}^{2} (λ, α)} + df (λ, α) log (n) \end{array}

${\hat{σ}}^{2} (λ, α) = n^{- 1} \sum_{i = 1}^{n} {\hat{ε}}_{i}^{2}$ and ${\hat{ε}}_{i}$ are the residuals. df(λ, α) is the effective degrees of freedom, which is a measure of model complexity. In the linear regression model, the degrees of freedom is simply the number of regressors. Zou, Hastie, and Tibshirani (2007) show that the number of coefficients estimated to be nonzero, sb, is an unbiased and consistent estimate of df(λ) for the lasso (α = 1). Tibshirani and Taylor (2012) confirm that the result also holds if p > n. More generally, the degrees of freedom of the elastic net can be calculated as the trace of the projection matrix; that is,

\hat{df} (λ, α) = tr [X_{\hat{Ω}} {{X^{'}}_{\hat{Ω}} X_{\hat{Ω}} + λ (1 - α) Ψ}^{- 1} {X^{'}}_{\hat{Ω}}]

where $X_{\hat{Ω}}$ is the $n \times \hat{s}$ matrix of selected regressors. The unbiased estimator of the degrees of freedom provides a justification for using the classical AIC and BIC to select tuning parameters (Zou, Hastie, and Tibshirani 2007).

The BIC is known to be model-selection consistent if the true model is among the candidate models, whereas AIC is inconsistent. Clearly, the assumption that the true model is among the candidates is strong; even the existence of the “true model” may be problematic, so loss efficiency may become a desirable second best. The AIC is loss efficient, in contrast with the BIC. Yang (2005) shows that the differences between AIC-type information criteria and BIC-type information criteria are fundamental: a consistent model-selection method such as the BIC cannot be loss efficient, and vice versa. Zhang, Li, and Tsai (2010) confirm this relation in the context of penalized regression.

Both AIC and BIC are not suitable in the large-p–small-n context, where they tend to select too many variables (see Monte Carlo simulations in section 8). It is well known that the AIC is biased in small samples, which motivated the bias-corrected AIC (Sugiura 1978; Hurvich and Tsai 1989),

{AIC}_{c} (λ, α) = n log {{\hat{σ}}^{2} (λ, α)} + 2 df (λ, α) \frac{n}{n - df (λ, α)}

The bias can be severe if df is large relative to n, and thus the AIC _c should be favored when n is small or with high-dimensional data.

The BIC relies on the assumption that each model has the same prior probability. This assumption seems reasonable when the researcher has no prior knowledge, yet it contradicts the principle of parsimony and becomes problematic if p is large. To see why, consider the case where p = 1000 (following Chen and Chen [2008]): There are 1,000 models for which one parameter is nonzero (s = 1), while there are 1000 × 999/2 models for which s = 2. Thus, the prior probability of s = 2 is greater than the prior probability of s = 1 by a factor of 999/2. More generally, because the prior probability that s = j is greater than the probability that s = j − 1 (up to the point where j = p/2), the BIC is likely to overselect variables. To address this shortcoming, Chen and Chen (2008) introduce the extended BIC (EBIC), defined as

{EBIC}_{ξ} (λ, α) = n log {{\hat{σ}}^{2} (λ, α)} + df (λ, α) log (n) + 2 ξ df (λ, α) log (p)

which imposes an additional penalty on the size of the model. The prior distribution is chosen such that the probability of a model with dimension j is inversely proportional to the total number of models for which s = j. The additional parameter, ξ ∊ [0, 1], controls the size of the additional penalty.⁴ Chen and Chen (2008) show in simulation studies that the EBIC _ξ outperforms the traditional BIC, which exhibits a higher false-discovery rate when p is large relative to n.

4 Tuning parameter selection using CV

The aim of CV is to directly assess the performance of a model on unseen data. To this end, the data are repeatedly divided into a training and a validation dataset. The models are fit to the training data, and the validation data are used to assess the predictive performance. In the context of regularized regression, CV can be used to select the tuning parameters that yield the best performance, for example, the best out-of-sample mean squared prediction error. Many methods for CV are available. For an extensive review, we recommend Arlot and Celisse (2010). The most popular method is K-fold CV, which we introduce in section 4.1. In section 4.2, we discuss methods for CV in the time-series setting.

4.1 K-fold CV

For K-fold CV, proposed by Geisser (1975), the data are split into K groups of approximately equal size that are referred to as folds. Let K _k denote the set of observations in the kth fold, and let n_k be the size of data partition k for k = 1,…, K. In the kth step, the kth fold is treated as the validation dataset, and the remaining K − 1 folds constitute the training dataset. The model is fit to the training data for a given value of λ and α. The resulting estimate, which is based on all the data except the observations in fold k, is ${\overset{⌢}{β}}_{k} (λ, α)$ . The procedure is repeated for each fold, as illustrated in figure 2, so that every data point is used for validation once. The mean squared prediction error (MSPE) for each fold is computed as

{MSPE}_{k} (λ, α) = \frac{1}{n_{k}} {\sum_{i \in K_{k}} {y_{i} - x_{i}^{'} {\hat{β}}_{k} (λ, α)}}^{2}

Figure 2.

Data partition for five-fold CV. Each row corresponds to one step, and each column corresponds to one data partition (“fold”). In the first step, fold 1 constitutes the validation data, and folds 2–5 are the training data.

The K-fold CV estimate of the MSPE, which serves as a measure of prediction performance, is

{\hat{L}}^{CV} (λ, α) = \frac{1}{K} \sum_{k = 1}^{K} {MSPE}_{k} (λ, α)

This suggests selecting λ and α as the values that minimize ${\hat{L}}^{CV} (λ, α)$ . An alternative common rule is to use the largest value of λ that is within one standard deviation of the minimum, which leads to a more parsimonious model.

CV can be computationally expensive. It is necessary to compute ${\hat{L}}^{CV}$ for each value of λ on a grid if α is fixed (for example, when using the lasso) or, in the case of the elastic net, for each combination of values of λ and α on a two-dimensional grid. In addition, the model must be fit K times at each grid point such that the computational cost is approximately proportional to K.⁵

Standardization adds another layer of computational cost to K-fold CV. An important principle in CV is that the training dataset should not contain information from the validation dataset. This mimics the real-world situation where out-of-sample predictions are made despite the true response being unknown. The principle applies not only to individual observations but also to data transformations such as mean centering and standardization. Specifically, data transformations applied to the training data should not use information from the validation data or full dataset. Mean centering and standardization using sample means and sample standard deviations for the full sample would violate this principle. Instead, when in each step the model is fit to the training data for a given λ and α, the training dataset must be recentered and restandardized, or if standardization is built into the penalty loadings, the ${\hat{ψ}}_{j}$ must be recalculated based on the training dataset.

The choice of K is a practical problem, but it also has theoretical implications. The variance of ${\hat{L}}^{CV}$ decreases with K and is minimal (for linear regression) if K = n, which is referred to as LOO CV. Similarly, the bias decreases with the size of the training dataset. Given computational constraints, K values between 5 and 10 are often recommended, arguing that the performance of CV rarely improves for K values greater than 10 (Hastie, Tibshirani, and Friedman 2009; Arlot and Celisse 2010).

If the aim of the researcher’s analysis is model identification rather than prediction, the theory requires training data to be “small” and the evaluation sample to be close to n (Shao 1993, 1997). The reason is that more data are required to evaluate which model is the “correct” one rather than to decrease bias and variance. This is referred to as the CV paradox (Yang 2006). However, because K-fold CV sets the size of the training sample to approximately n/K, K-fold CV is necessarily ill suited for selecting the true model.

Regarding the K-fold cross-validated lasso, Chetverikov, Liao, and Chernozhukov (2019) demonstrate that the estimation error converges at a near-optimal rate to zero. For example, if the errors are Gaussian, then the convergence rate is slower by a “small logarithmic factor” of $\sqrt{log (n p)}$ relative to the theory-driven approach of regularization (to be discussed in section 5). The authors also confirm through simulations and deriving sparsity bounds that choosing the tuning parameter by CV tends to yield more predictors than are in the true model and than when theory-driven regularization is used.

4.2 CV with time-series data

Serially dependent data violate the principle that training and validation data are independent. That said, standard K-fold CV may still be appropriate in certain circumstances. Bergmeir, Hyndman, and Koo (2018) show that K-fold CV remains valid in the pure autoregressive model if one is willing to assume that the errors are uncorrelated. A useful implication is that K-fold CV can be used on autoregressive models that include a sufficient number of lags and are not otherwise badly misspecified, because such models have uncorrelated errors.

Rolling h-step-ahead CV is an intuitively appealing approach that directly incorporates the ordered nature of time-series data (Hyndman and Athanasopoulos 2012).⁶ The procedure builds on repeated h-step-ahead forecasts. The procedure is implemented in lassopack and illustrated in figures 3–4.

Figure 3.

Rolling h-step-ahead CV with expanding training window. T and V denote that the observation is included in the training and validation samples, respectively. A dot (·) indicates that an observation is excluded from both training and validation data.

Figure 4.

Rolling h-step-ahead CV with fixed training window

Figure 3(a) corresponds to the default case of one-step-ahead CV. T denotes the observation included in the training sample, and V refers to the validation sample. In the first step, observations 1–3 constitute the training dataset, and observation 4 is the validation point, whereas the remaining observations are unused as indicated by a dot (·). Figure 3(b) illustrates the case of two-step-ahead CV. In both cases, the training window expands incrementally, whereas figure 4 displays rolling CV with a fixed estimation window.

4.3 Comparison with information criteria

Because information-based approaches and CV share the aim of model selection, one might expect that the two methods share some theoretical properties. Indeed, AIC and LOO CV are asymptotically equivalent, as shown by Stone (1977) for fixed p. Because information criteria require the model to be fit only once, they are computationally much more attractive, which might suggest that information criteria are superior in practice. However, some advantages of CV are its flexibility and that it adapts better to situations where the assumptions underlying information criteria, for example, homoskedasticity, are not satisfied (Andrews 1991; Arlot and Celisse 2010). If the aim of the analysis is identifying the true model, BIC and EBIC provide a better choice than K-fold CV, because there are strong but well-understood conditions under which BIC and EBIC are model-selection consistent.

5 Rigorous penalization

This section introduces the “rigorous” or plugin approach to penalization. Following Chernozhukov, Hansen, and Spindler (2016), we use the term “rigorous” to emphasize that the framework is grounded in theory in that it provides penalization parameters that guarantee optimal rate of convergence for prediction and parameter estimation. The approach also yields models whose size is of the same order as the true model. Rigorous penalization is of special interest because it provides the basis for methods to facilitate causal inference in the presence of many instruments, many control variables, or both; these methods are the instrumental-variables lasso (Belloni et al. 2012), the post-double-selection estimator (Belloni, Chernozhukov, and Hansen 2014), and the postregularization estimator (Chernozhukov, Hansen, and Spindler 2015), all of which are implemented in our sister package pdslasso (Ahrens, Hansen, and Schaffer 2018).

We discuss the conditions required to derive theoretical results for the lasso in section 5.1. Sections 5.2–5.5 present feasible algorithms for rigorous penalization choices for the lasso and square-root lasso under i.i.d., heteroskedastic, and cluster-dependent errors. Section 5.6 presents a related test for joint significance testing.

5.1 Theory of the lasso

There are three main conditions required to guarantee that the lasso is consistent in terms of prediction and parameter estimation.⁷ The first condition relates to sparsity. Sparsity is an attractive assumption in settings where we have a large set of potentially relevant regressors or consider various model specifications but assume that only one true model exists that includes few regressors. We have introduced exact sparsity in section 2, but the assumption is stronger than needed. For example, some true coefficients may be nonzero but small in absolute size, in which case it might be preferable to omit them, so we use a weaker assumption:

Approximate sparsity. Belloni et al. (2012) consider the approximate sparse model

y_{i} = f (w_{i}) + ε_{i} = x_{i}^{'} β_{0} + r_{i} + ε_{i}

The elementary regressors w _i are linked to the dependent variable through the unknown and possibly nonlinear function f(·). The aim of lasso (and square-root lasso) estimation is to approximate f( w _i ) using the target parameter vector β ₀ and the transformations x _i := P ( w _i ), where P (·) denotes a dictionary of transformations. The vector x _i may be large relative to the sample size either because w _i itself is high dimensional and x _i := w _i or because many transformations such as dummies, polynomials, or interactions a re considered to approximate f( w _i ).

The assumption of approximate sparsity requires the existence of a target vector β ₀, which ensures that f( w _i ) can be approximated sufficiently well while using only few nonzero coefficients. Specifically, the target vector β ₀ and the sparsity index s are assumed to meet the condition

{‖ β_{0} ‖}_{0} : = s ≪ n with \frac{s^{2} \log^{2} (p \lor n)}{n} \to 0

and the resulting approximation error, $r_{i} = f (w_{i}) - x_{i}^{'} β_{0}$ , is bounded such that

\sqrt{\frac{1}{n} \sum_{i = 1}^{n} r_{i}^{2}} \leq C \sqrt{\frac{s}{n}}

where C is a positive constant. To emphasize the distinction between approximate and exact sparsity, consider the special case where f( w _i ) is linear with $f (w_{i}) - x_{i}^{'} β^{⋆}$ but where the true parameter vector β ^⋆ violates exact sparsity so that ||β ^⋆||₀ > n. If β ^⋆ has many elements that are negligible in absolute size, we might still be able to approximate β ^⋆ using the sparse target vector β ₀ as long as $r_{i} = x_{i}^{'} (β^{⋆} - β_{0})$ is sufficiently small as specified in (1).

Restricted sparse eigenvalue condition. The second condition relates to the Gram matrix, n⁻ ¹ X ^′ X . In the high-dimensional setting where p is larger than n, the Gram matrix is necessarily rank deficient, and the minimum (unrestricted) eigenvalue is zero, that is,

\min_{δ \neq 0} {\frac{{‖ X δ ‖}_{2}}{\sqrt{n} ‖ δ ‖}}_{2} = 0

Thus, to accommodate a large p, the full-rank condition of OLS needs to be replaced by a weaker condition. In the context of sparse models, we instead assume that “small” submatrices of size m, where m is slightly larger than s, are well behaved. This is in fact the restricted sparse eigenvalue condition (RSEC) of Belloni et al. (2012). The RSEC formally states that the minimum sparse eigenvalues

ϕ_{\min} (m) = \min_{1 \leq {‖ δ ‖}_{0} \leq m} \frac{δ^{'} X^{'} X δ}{{‖ δ ‖}_{2}^{2}} and ϕ_{\max} (m) = \max_{1 \leq {‖ δ ‖}_{0} \leq m} \frac{δ^{'} X^{'} X δ}{{‖ δ ‖}_{2}^{2}}

are bounded away from zero and from above. The requirement φ _min(m) > 0 implies that all small submatrices of size m have to be positive definite.⁸

Regularization event. The third central condition concerns the choice of the penalty level λ and the predictor-specific penalty loadings ψ_j . The idea is to select the penalty parameters to control the random part of the problem in that

\frac{λ}{n} \geq c \max_{1 \leq j \leq p} | ψ_{j}^{- 1} S_{j} | where S_{j} = \frac{2}{n} \sum_{i = 1}^{n} x_{i j} ε_{i}

with high probability. Here c > 1 is a constant slack parameter, and S_j is the jth element of the score vector $S = \nabla \hat{Q} (β)$ , which is the gradient of $\hat{Q}$ at the true value β . The score vector summarizes the noise associated with the estimation problem.

Denote by $Λ = n \max_{j} | ψ_{j}^{- 1} S_{j} |$ the maximal element of the score vector scaled by n and ψ_j , and denote by q _Λ(·) the quantile function for Λ.⁹ In the rigorous lasso, we choose the penalty parameters λ and ψ_j and confidence level γ so that

λ \geq c q_{Λ} (1 - γ)

A simple example illustrates the intuition behind this approach. Consider the case where the true model has β_j = 0 for j = 1,…, p; that is, none of the regressors appears in the true model. It can be shown that for the lasso to select no variables, the penalty parameters λ and ψ_j need to satisfy $λ \geq 2 \max_{j} | \sum_{i} ψ_{j}^{- 1} x_{i j} y_{i} |$ .¹⁰ This is because none of the regressors appears in the true model, y_i = ε_i . We can therefore rewrite the requirement for the lasso to correctly identify the model without regressors as $λ \geq 2 \max_{j} | \sum_{i} ψ_{j}^{- 1} x_{i j} ε_{i} |$ , which is the regularization event in (2). We want this regularization event to occur with a high probability of at least (1 − γ). If we choose values for λ and ψ_j such that λ ≥ q _Λ(1−γ), then by the definition of a quantile function, we will choose the correct model (no regressors) with a probability of at least (1 − γ). This is simply the rule in (3).

The chief practical problem in using the rigorous lasso is that the quantile function q _Λ(·) is unknown. There are two approaches to addressing this problem proposed in the literature, both of which are implemented in rlasso. The rlasso default is the “asymptotic” or X-independent approach: theoretically grounded and feasible penalty level and loadings are used that guarantee that (2) holds asymptotically, because n → ∞ and γ → 0. The X-independent penalty-level choice can be interpreted as an asymptotic upper bound on the quantile function q _Λ(·). In the “exact” or X-dependent approach, the quantile function q _Λ(·) is directly estimated by simulating the distribution of q _Λ(1− γ| X ), the (1−γ)-quantile of Λ conditional on the observed regressors X . We first focus on the X-independent approach and introduce the X-dependent approach in section 5.5.

5.2 Rigorous lasso

Belloni et al. (2012) show—using moderate deviation theory of self-normalized sums from Jing, Shao, and Wang (2003)—that the regularization event in (2) holds asymptotically; that is,

P (\max_{1 \leq j \leq p} c | S_{j} | \leq \frac{λ ψ_{j}}{n}) \to 1 as n \to \infty, γ \to 0

if the penalty levels and loadings are set to

\begin{array}{l} homosdasticity: λ = 2 c σ \sqrt{n} Φ^{- 1} {1 - γ / (2 p)} & ψ_{j} = \sqrt{\frac{1}{n} \sum_{i} x_{i j}^{2}} \\ heteroskedasticity: λ = 2 c \sqrt{n} Φ^{- 1} {1 - γ / (2 p)} & ψ_{j} = \sqrt{\frac{1}{n} \sum_{i} x_{i j}^{2} ε_{i}^{2}} \end{array} (4)

under homoskedasticity and heteroskedasticity, respectively. c is the slack parameter from above, and the significance level γ is required to converge toward 0. rlasso uses c = 1.1 and γ = 0.1/ log(n) as defaults.¹¹ ^, ¹²

Homoskedasticity. We first focus on the case of homoskedasticity. In the rigorouslasso approach, we standardize the score. But because $E (x_{i j}^{2} ε_{i}^{2}) = σ^{2} E (x_{i j}^{2})$ under homoskedasticity, we can separate the problem into two parts: the regressor-specific penalty loadings $ψ_{j} = \sqrt{(1 / n) \sum_{i} x_{i j}^{2}}$ standardize the regressors, and σ moves into the overall penalty level. In the special case where the regressors have already been standardized such that $(1 / n) \sum_{i} x_{i j}^{2} = 1$ , the penalty loadings are ψ_j = 1. Hence, the purpose of the regressor-specific penalty loadings in the case of homoskedasticity is to accommodate regressors with unequal variance.

The only unobserved term is σ, which appears in the optimal penalty level λ. To estimate σ, we can use some initial set of residuals ${\hat{ε}}_{0, i}$ and calculate the initial estimate as ${\hat{σ}}_{0} = \sqrt{(1 / n) \sum_{i} {\hat{ε}}_{0, i}^{2}}$ . A possible choice for the initial residuals is ${\hat{ε}}_{0, i} = y_{i}$ as in Belloni et al. (2012) and Belloni, Chernozhukov, and Hansen (2014). rlasso uses the OLS residuals ${\hat{ε}}_{0, i} = y_{i} - x_{i} {[D]}^{'} {\hat{β}}_{OLS}$ , where D is the set of five regressors exhibiting the highest absolute correlation with y_i .¹³ The procedure is summarized in algorithm A:

□ Algorithm A: Estimation of penalty level under homoskedasticity

1. Set k = 0, and define the maximum number of iterations, K. Regress y_i against the subset of d predictors exhibiting the highest correlation coefficient with y_i , and compute the initial residuals as ${\hat{ε}}_{0, i} = {\hat{ε}}_{k, i} = y_{i} - x_{i} {[D]}^{'} {\hat{β}}_{OLS}$ . Calculate the homoskedastic penalty loadings in (4).

2. If k ≤ K, compute the homoskedastic penalty level in (4) by replacing σ with ${\hat{σ}}_{k} = \sqrt{(1 / n) \sum_{i} {\hat{ε}}_{k, i}^{2}}$ , and obtain the rigorous lasso or postlasso estimator ${\hat{β}}_{k}$ . Update the residuals ${\hat{ε}}_{k + 1, i} = y_{i} - x_{i}^{'} {\hat{β}}_{k}$ . Set k ← k + 1.

3. Repeat step 2 until k > K or until convergence by updating the penalty level.

□

The rlasso default is to perform one further iteration after the initial estimate (that is, K = 1), which in our experience provides good performance. Both lasso and postlasso can be used to update the residuals. rlasso uses postlasso to update the residuals.¹⁴

Heteroskedasticity. The X-independent choice for the overall penalty level under heteroskedasticity is $λ = 2 c \sqrt{n} Φ^{- 1} {1 - γ / (2 p)}$ . The only difference from the homoskedastic case is the absence of σ. The variance of ε is now captured via the penalty loadings, which are set to $ψ_{j} = \sqrt{1 / n \sum_{i} x_{i j}^{2} ε_{i}^{2}}$ . Hence, the penalty loadings here serve two purposes: accommodating both heteroskedasticity and regressors with uneven variance. To help with intuition, we consider the case where the predictors are already standardized. It is easy to see that, if the errors are homoskedastic with σ = 1, the penalty loadings are (asymptotically) just ψ_j = 1. If the data are heteroskedastic, however, the standardized penalty loading will not be 1. In most practical settings, the usual pattern will be that $\hat{ψ_{j}} > 1$ for some j. Intuitively, heteroskedasticity typically increases the likelihood that the term $\max_{j} | \sum_{i} x_{i j} ε_{i} |$ takes on extreme values, thus requiring a higher degree of penalization through the penalty loadings.¹⁵

The disturbances ε_i are unobserved, so we obtain an initial set of penalty loadings ${\hat{ψ}}_{j}$ from an initial set of residuals ${\hat{ε}}_{0, i}$ , similarly to the i.i.d. case above. We summarize the algorithm for estimating the penalty level and loadings as follows:

□ Algorithm B: Estimation of penalty loadings under heteroskedasticity

2. If k ≤ K, compute the heteroskedastic penalty loadings using the formula given in (4) by replacing ε_i with ${\hat{ε}}_{k, i}$ , and obtain the rigorous lasso or postlasso estimator ${\hat{β}}_{k}$ . Update the residuals ${\hat{ε}}_{k + 1, i} = y_{i} - x_{i}^{'} {\hat{β}}_{k}$ . Set k ← k + 1.

3. Repeat step 2 until k > K or until convergence by updating the penalty loadings.

□

Theoretical property. Under the assumptions of sparse eigenvalue condition and approximate sparse model, and if penalty level λ and the penalty loadings are estimated by algorithm A or B, the lasso and postlasso obey¹⁶

\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{'} \hat{β} - x_{i}^{'} β)}^{2}} = O (\sqrt{\frac{s \log (p \lor n)}{n}})

\begin{array}{l} {‖ \hat{β} - β ‖}_{1} & = & O (\sqrt{\frac{s^{2} \log (p \lor n)}{n}}) \\ {‖ \hat{β} ‖}_{0} & = & O (s) \end{array}

The first relation in (5) provides an asymptotic bound for the prediction error, and the second relation in (6) bounds the bias in estimating the target parameter β . Belloni et al. (2012) refer to the above convergence rates as near-oracle rates. If the identity of the s variables in the model were known, the prediction error would converge at the oracle rate $\sqrt{s / n}$ . Thus, the logarithmic term log(p ∨ n) can be interpreted as the cost of not knowing the true model. The final relation is a sparsity bound that shows that the size of the fit model does not diverge relative to the true model.

For comparison, the rates of the rigorous lasso are faster than the convergence rates of the K-fold cross-validated lasso derived in Chetverikov, Liao, and Chernozhukov (2019). Furthermore, the sparsity bound in Chetverikov, Liao, and Chernozhukov (2019) does not rule out situations where $| | \hat{β} | |_{0} - s \to \infty$ , implying that the model fit using CV may be much larger than the true model because the penalty level chosen by K-fold CV cannot be guaranteed to satisfy the regularization event in (2). Indeed, Chetverikov, Liao, and Chernozhukov (2019) confirm in simulations that the cross-validated penalty is often smaller than (2) requires.

5.3 Rigorous square-root lasso

The theory of the square-root lasso is similar to the theory of the lasso (Belloni, Chernozhukov, and Wang 2011; 2014). The jth element of the score vector is now defined as

S_{j} = \frac{\frac{1}{n} \sum_{i = 1}^{n} x_{i j} ε_{i}}{{(\frac{1}{n} \sum_{i = 1}^{n} ε_{i}^{2})}^{1 / 2}}

To see why the square-root lasso is of special interest, we define the standardized errors ν_i as ν_i = ε_i/σ. The jth element of the score vector becomes

S_{j} = \frac{\frac{1}{n} \sum_{i = 1}^{n} x_{i j} σ v_{i}}{{(\frac{1}{n} \sum_{i = 1}^{n} σ^{2} v_{i}^{2})}^{1 / 2}} = \frac{\frac{1}{n} \sum_{i = 1}^{n} x_{i j} v_{i}}{{(\frac{1}{n} \sum_{i = 1}^{n} v_{i}^{2})}^{1 / 2}}

and is thus independent of σ. For the same reason, the optimal penalty level for the square-root lasso in the i.i.d. case,

λ = c \sqrt{n} Φ^{- 1} {1 - γ / (2 p)}

is independent of the noise level σ.

Homoskedasticity. The ideal penalty loadings under homoskedasticity for the squareroot lasso are given by formula (iv) in table 1, which provides an overview of penaltyloading choices. The ideal penalty parameters are independent of the unobserved error, which is an appealing theoretical property and implies a practical advantage. Because both λ and ψ_j can be calculated from the data, the rigorous square-root lasso is a onestep estimator under homoskedasticity. Belloni, Chernozhukov, and Wang (2011) show that the square-root lasso performs similarly to the lasso with infeasible ideal penalty loadings.

Table 1.

Ideal penalty loadings for the lasso and square-root lasso under homoskedasticity, heteroskedasticity, and cluster dependence

	lasso	square-root lasso
homoskedasticity	(i) $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} x_{i j}^{2}}$	(iv) $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} x_{i j}^{2}}$
heteroskedasticity	(ii) $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} x_{i j}^{2} ε_{i}^{2}}$	(v) $\sqrt{\frac{1}{n} \sum_{i = 1}^{n} x_{i j}^{2}} \lor \sqrt{\frac{\sum_{i = 1}^{n} x_{i j}^{2} ε_{i}^{2}}{\sum_{i = 1}^{n} ε_{i}^{2}}}$
cluster dependence	(iii) $\sqrt{\frac{1}{n T} \sum_{i = 1}^{n} u_{i j}^{2}}$	(vi) $\sqrt{\frac{1}{n T} \sum_{i = 1}^{n} \sum_{t = 1}^{T} x_{i j t}^{2}} \lor \sqrt{\frac{\sum_{i = 1}^{n} u_{i j}^{2}}{\sum_{i = 1}^{n} \sum_{t = 1}^{T} ε_{i t}^{2}}}$

note: Formulas (iii) and (vi) use the notation $u_{i j} = \sum_{t} x_{i t j} ε_{i t}$ .

Heteroskedasticity. In the case of heteroskedasticity, the optimal square-root lasso penalty level remains (7), but the penalty loadings, given by formula (v) in table 1, depend on the unobserved error and need to be estimated. Note that the updated penalty loadings using the residuals ${\hat{ε}}_{i}$ use thresholding: the penalty loadings are enforced to be greater than or equal to the loadings in the homoskedastic case. The rlasso default algorithm used to obtain the penalty loadings in the heteroskedastic case is analogous to algorithm B.¹⁷ While the ideal penalty loadings are not independent of the error term if the errors are heteroskedastic, the square-root lasso may still have an advantage over the lasso because the ideal penalty loadings are pivotal with respect to the error term up to scale, as pointed out above.

5.4 Rigorous penalization with panel data

Belloni et al. (2016) extend the rigorous framework to the case of clustered data, which accommodates a limited form of dependence—within-group correlation—as well as heteroskedasticity. They prove consistency of the rigorous lasso using this approach in the large-n–fixed-T and large-n–large-T settings. The authors present the approach in the context of a fixed-effects panel-data model, $y_{i t} = {x^{'}}_{i t} β + µ_{i} + ε_{i t}$ , and apply the rigorous lasso after the within transformation to remove the fixed effects µ_i . The approach extends to any clustered-type setting and to balanced and unbalanced panels. For convenience, we ignore the fixed effects and write the model as a balanced panel:

y_{i t} = {x^{'}}_{i t} β + ε_{i t} i = 1, \dots, n, t = 1, \dots, T

The intuition behind the Belloni et al. (2016) approach is similar to that behind the clustered standard errors reported by various Stata estimation commands: observations within clusters are aggregated to create “superobservations” that are assumed independent, and these superobservations are treated similarly to cross-sectional observations in the nonclustered case. Specifically, define for the ith cluster and jth regressor the superobservation $u_{i j} : = \sum_{t} x_{i j t} ε_{i t}$ . Then the penalty loadings for the clustered case are

ψ_{j} = \sqrt{\frac{1}{n T} \sum_{i = 1}^{n} u_{i j}^{2}}

which resembles the heteroskedastic case. The rlasso default for the overall penalty level is the same as in the heteroskedastic case, $λ = 2 c \sqrt{n} Φ^{- 1} {1 - γ / (2 p)}$ , except that the default value for γ is 0.1/ log(n); that is, we use the number of clusters n rather than the number of observations nT. lassopack also implements the rigorous square-root lasso for panel data, which uses the overall penalty in (7) and the penalty loadings in formula (vi) in table 1.

5.5 X-dependent lambda

The X-dependent penalty is an alternative, sharper choice for the overall penalty level. Recall that the asymptotic, X-independent choice in (4) can be interpreted as an asymptotic upper bound on the quantile function of Λ, which is the scaled maximum value of the score vector. Instead of using the asymptotic choice, we can estimate by simulation the distribution of Λ conditional on the observed X and use this simulated distribution to obtain the quantile q _Λ(1 − γ| X ).

In the case of estimation by the lasso under homoskedasticity, we simulate the distribution of Λ using the statistic W, defined as

W = 2 \hat{σ} \max_{1 \leq j \leq p} | \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} g_{i} |

where ψ_j is the penalty loading for the homoskedastic case, $\hat{σ}$ is an estimate of the error variance using some previously obtained residuals, and g_i is an i.i.d. standard normal variate drawn independently of the data. The X-dependent penalty choice is sharper and adapts to correlation in the regressor matrix (Belloni and Chernozhukov 2011).

Under heteroskedasticity, the lasso X-dependent penalty is obtained by a multiplier bootstrap procedure. In this case, the simulated statistic W is defined as in formula (ii) in table 2. The cluster–robust X-dependent penalty is again obtained analogously to the heteroskedastic case by defining superobservations, and the statistic W is defined as in formula (iii) in table 2. Note that the standard normal variate g_i varies across but not within clusters.

Table 2.

Definition of W statistic for the simulation of the distribution of Λ for the lasso and square-root lasso under homoskedasticity, heteroskedasticity, and cluster dependence

	lasso	square-root lasso
homoskedasticity	(i) $2 \hat{σ} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} g_{i} \|$	(iv) $\frac{1}{{\hat{σ}}_{g}} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} g_{i} \|$
heteroskedasticity	(ii) $2 \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} {\hat{ε}}_{i} g_{i} \|$	(v) $\frac{1}{{\hat{σ}}_{g}} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} \hat{ε} g_{i} \|$
cluster dependence	(iii) $2 \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} {\hat{u}}_{i j} g_{i} \|$	(vi) $\frac{1}{{\hat{σ}}_{g}} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} {\hat{u}}_{i j} g_{i} \|$

note: g_i is an i.i.d. standard normal variate drawn independently of the data; ${\hat{σ}}_{g} = \frac{1}{n} \sum_{i} g_{i}^{2}$ . Formulas (iii) and (vi) use the notation ${\hat{u}}_{i j} = \sum_{t} x_{i t j} {\hat{ε}}_{i t}$ .

The X-dependent penalties for the square-root lasso are similarly obtained from quantiles of the simulated distribution of the square-root lasso Λ and are given by formulas (iv), (v), and (vi) in table 2 for the homoskedastic, heteroskedastic, and clustered cases, respectively.¹⁸

5.6 Significance testing with the rigorous lasso

Inference using the lasso, especially in the high-dimensional setting, is a challenging and ongoing research area (see footnote 2). A test that has been developed and is available in rlasso corresponds to the test for joint significance of regressors using F or χ ² tests that is common in regression analysis. Specifically, Belloni et al. (2012) suggest using the Chernozhukov, Chetverikov, and Kato (2013, app. M) sup-score test to test for the joint significance of the regressors; that is,

H_{0} : β_{1} = \dots = β_{p} = 0

As in the preceding sections, regressors are assumed to be mean centered and in their original units.

If the null hypothesis is correct and the rest of the model is well specified, including the assumption that the regressors are orthogonal to the disturbance ε_i , then y_i = ε_i and hence E( x _iε_i ) = E( x _iy_i ) = 0. The sup-score statistic is

S_{S S} = \sqrt{n} \max_{1 \leq j \leq p} | \frac{1}{n} \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} y_{i} |

where $ψ_{j} = \sqrt{(1 / n) \sum_{i = 1}^{n} {(x_{i j} y_{i})}^{2}}$ . Intuitively, the ψ_j in (8) plays the same role as the penalty loadings do in rigorous-lasso estimation.

The p-value for the sup-score test is obtained by a multiplier bootstrap procedure simulating the distribution of S_SS by the statistic W, defined as

W = \sqrt{n} \max_{1 \leq j \leq p} | \frac{1}{n} \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} y_{i} g_{i} |

where g_i is an i.i.d. standard normal variate drawn indep endently of the data and ψ_j is defined as in (8).

The procedure above is valid under both homoskedasticity and heteroskedasticity but requires independence. A cluster–robust version of the sup-score test that accommodates within-group dependence is

S_{S S} = \sqrt{n T} \max_{1 \leq j \leq p} | \frac{1}{n T} \sum_{i = 1}^{n} ψ_{j}^{- 1} u_{i j} |

where $u_{i j} : = \sum_{t = 1}^{T} x_{i j t} y_{i t}$ and $ψ_{j} = (1 / n T) \sum_{i = 1}^{n} u_{i j}^{2}$ .

The p-value for the cluster version of S_SS comes from simulating

W = \sqrt{n T} \max_{1 \leq j \leq p} | \frac{1}{n T} \sum_{i = 1}^{n} ψ_{j}^{- 1} u_{i j} g_{i} |

where again g_i is an i.i.d. standard normal variate drawn independently of the data. Note again that g_i is invariant within clusters.

rlasso also reports conservative critical values for S_SS using an asymptotically justified upper bound: CV = cΦ ⁻ ¹{1 − γ_SS/(2p)}. The default value of the slack parameter is c = 1.1, and the default test level is γ_SS = 0.05. These parameters can be varied using the c() and ssgamma() options, respectively. The simulation procedure to obtain p-values using the simulated statistic W can be computationally intensive, so users can request reporting only of computationally inexpensive conservative critical values by setting the number of simulated values to zero with the ssnumsim( int ) option.

6 The commands

The package lassopack consists of three commands: lasso2, cvlasso, and rlasso. Each command offers an alternative method for selecting the tuning parameters λ and α. We discuss each command in turn and present its syntax. We focus on the main syntax and the most important options. Some technical options are omitted for the sake of brevity. For a full description of syntax and options, see the help files.

6.1 lasso2: Base command and information criteria

The primary purpose of lasso2 is to obtain the solution of (adaptive) lasso, elastic net, and square-root lasso for a single value of λ or a list of penalty levels, that is, for λ ₁ ,…, λ_r,…, λ_R , where R is the number of penalty levels. The basic syntax of lasso2 is as follows:

lasso2 depvar indepvars [if] [ in ] [ , options ]

We omit the full list of options for the sake of brevity. Selected options are discussed below; see the help file for a full list of options.

Estimation methods

alpha( real ) controls the elastic-net parameter, α, which controls the degree of ℓ ₁-norm (lasso-type) to ℓ ₂-norm (ridge-type) penalization. alpha(1) corresponds to the lasso (the default estimator), and alpha(0) corresponds to ridge regression. The real value must lie in the interval [0, 1].

sqrt specifies the square-root lasso estimator. Because the square-root lasso does not use any form of ℓ ₂ penalization, the option is incompatible with alpha( real ).

adaptive specifies the adaptive lasso estimator. The penalty loading for predictor j is set to ${| {\hat{β}}_{j, 0} |}^{- θ}$ , where ${\hat{β}}_{j, 0}$ is the OLS estimate or univariate OLS estimate if p > n. θ is the adaptive exponent and can be controlled using the adatheta( real ) option.

adaloadings( matrix ) is a matrix of alternative initial estimates, ${\hat{β}}_{j, 0}$ , used for calculating adaptive loadings. For example, this could be the vector e(b) from an initial lasso2 estimation. The absolute value of ${\hat{β}}_{j, 0}$ is raised to the power −θ (note the minus).

adatheta( real ) is the exponent for calculating adaptive penalty loadings. The default is adatheta(1).

ols specifies that postestimation OLS estimates are displayed and returned in e(betas) or e(b).

The options alpha( real ), sqrt, adaptive, and ols can be used to select elastic net, square-root lasso, adaptive lasso, and postestimation OLS, respectively. The default estimator of lasso2 is the lasso, which corresponds to alpha(1). The special case of alpha(0) yields ridge regression.

Options relating to lambda

lambda( numlist ) controls the penalty level or levels, λ_r , used for estimation. numlist is a scalar value or list of values in descending order. Each λ_r must be greater than zero. By default, the default list is used, which is defined using Mata syntax by lambda(exp(rangen(log(lmax),log(lminratio*lmax),lcount))), where lcount, lminratio, and lmax are defined below; exp() is the exponential function; log() is the natural logarithm; and rangen( a , b , n ) creates a column vector going from a to b in n-1 steps (see the mf_range help file). Thus, the default list ranges from lmax to lminratio*lmax, and lcount is the number of values. The distance between each λ_r in the sequence is the same on the logarithmic scale.

lcount( integer ) is the number of penalty values, R, for which the solution is obtained. The default is lcount(100).

lminratio( real ) is the ratio of the minimum penalty level, λ_R , to the maximum penalty level, λ ₁. real must be between 0 and 1. The default is lminratio(0.001).

lmax( real ) is the maximum-value penalty level, λ ₁. By default, λ ₁ is chosen as the smallest penalty level for which the model is empty. If the regressors are mean centered and standardized, then λ ₁ is defined as $\max_{j} 2 / (n α) \sum_{i = 1}^{n} | x_{i j} y_{i} |$ for the elastic net and $\max_{j} 1 / (n α) \sum_{i = 1}^{n} | x_{i j} y_{i} |$ for the square-root lasso (see Friedman, Hastie, and Tibshirani [2010, sec. 2.5]).

The behavior of lasso2 depends on whether the numlist in lambda( numlist ) is a scalar or of length greater than 1. If numlist is a list of more than one value, the solution is a matrix of coefficient estimates that are stored in e(betas). Each row in e(betas) corresponds to a distinct value of λ_r , and each column corresponds to one of the predictors in indepvars. The “path” of coefficient estimates over λ_r can be plotted using plotpath( method ), where method controls whether the coefficient estimates are plotted against lambda (lambda), the natural logarithm of lambda (lnlambda), or the ℓ ₁ norm (norm). If the numlist in lambda( numlist ) is a scalar value, the solution is a vector of coefficient estimates that is stored in e(b). The default behavior of lasso2 is to use a list of 100 values.

Information criteria

lic( string ) specifies that, after the first lasso2 estimation using a list of penalty levels, the model that corresponds to the minimum information criterion will be fit and displayed. aic, bic, aicc, and ebic (the default) are allowed. However, the results are not stored in e().

postresults is used in combination with lic( string ). postresults stores estimation results of the model selected by information criterion in e().¹⁹

ic( string ) controls which information criterion is shown in the output of lasso2 when lambda() is a list. aic, bic, aicc, and ebic (the default) are allowed.

ebicgamma( real ) controls the ξ parameter of the EBIC. ξ needs to lie in the [0, 1] interval. ξ = 0 is equivalent to the BIC. The default choice is ξ = 1 − log(n)/{2 log(p)}.

In addition to obtaining the coefficient path, lasso2 calculates four information criteria (AIC, AIC _c , BIC, and EBIC). These information criteria can be used for model selection by choosing the value of λ_r that yields the lowest value for one of the four information criteria. The ic( string ) option controls which information criterion is shown in the output, where string can be replaced with aic, aicc, bic, or ebic. lic( string ) displays the estimation results corresponding to the model selected by an information criterion. Note that lic( string ) will not store the results of the estimation. This has the advantage that the user can compare the results for different information criteria without the need to refit the full model. To save the estimation results, specify the postresults option in combination with lic( string ).

Penalty loadings and standardization

notpen( varlist ) sets penalty loadings to zero for predictors in varlist. Unpenalized predictors are always included in the model.

partial( varlist ) specifies that variables in varlist are partialed out prior to estimation.

ploadings( matrix ) is a row vector of penalty loadings and overrides the default standardization loadings. The size of the vector should equal the number of predictors (excluding partialed-out variables and excluding the constant).

unitloadings specifies that penalty loadings be set to a vector of ones; this option overrides the default standardization loadings.

prestd specifies that the dependent variable and predictors are standardized prior to estimation rather than standardized “on the fly” using penalty loadings. See section 9.2 for more details. By default, the coefficient estimates are unstandardized (that is, returned in original units).

stdcoef returns coefficients in standard deviation units, that is, not unstandardized. This option is supported only with the prestd option.

Fixed effects and intercept

fe specifies that the within transformation is applied prior to estimation. The option requires the data in memory to be xtset.

noconstant suppresses the constant from estimation. The default behavior is to partial the constant out (that is, to center the regressors).

Plotting

plotpath( string ) plots the coefficients path as a function of the ℓ ₁ norm (norm), lambda (lambda), or the log of lambda (lnlambda).

plotvar( varlist ) specifies the list of variables to be included in the plot.

plotopt( string ) specifies additional plotting options passed to line. For example, to turn the legend off, use plotopt(legend(off)).

plotlabel displays the variable labels in the plot.

Replay syntax

The replay syntax of lasso2 allows for plotting and changing display options without the need to rerun the full model. It can also be used to fit the model using the value of λ selected by an information criterion. The syntax is given by

lasso2 [ , plotpath( string ) plotvar( varlist ) plotopt( string ) plotlabel postresults lic( method ) ic( method ) ]

Prediction syntax

predict [ type ] newvar [ if ] [ in ] [ , xb residuals ols lambda( real ) lid( int ) approx noisily postresults ]

xb computes predicted values (the default).

residuals computes residuals.

ols specifies that postestimation OLS will be used for prediction.

If the previous lasso2 estimation uses more than one penalty level (that is, R > 1), the following options are applicable:

lambda( real ) specifies the lambda value used for prediction.

lid( int ) specifies the index of the lambda value used for prediction.

approx specifies that linear approximation is used instead of reestimation. This is faster but only exact if the coefficient path is piecewise linear.

noisily shows estimation output if reestimation is required.

postresults stores estimation results in e() if reestimation is used.

6.2 CV with cvlasso

cvlasso implements K-fold and h-step-ahead rolling CV. The syntax of cvlasso is

cvlasso depvar indepvars [ if ] [ in ] [, options ]

We discuss only selected options; see the help file for a full list of options.

Options for K-fold CV

alpha( numlist ) is a scalar elastic-net parameter or an ascending list of elastic-net parameters. Note that the alpha() option of cvlasso accepts a numlist, while lasso2 accepts only a scalar. If numlist is a list longer than 1, cvlasso cross-validates over λ_r with r = 1,…, R and α_m with m = 1,…, M.

nfolds( int ) is the number of folds used for K-fold CV. The default is nfolds(10), or K = 10.

foldvar( varname ) can be used to specify what fold (data partition) each observation lies in. varname is an integer variable with values ranging from 1 to K. By default, the fold variable is randomly generated such that each fold is of approximately equal size.

savefoldvar( varname ) saves the fold variable in varname.

seed( real ) sets the seed for the generation of a random fold variable.

Options for h-step-ahead rolling CV

rolling uses rolling h-step-ahead CV. The option requires the data to be tsset or xtset.

h( int ) changes the forecasting horizon. The default is h(1).

origin( int ) controls the number of observations in the first training dataset.

fixedwindow ensures that the size of the training dataset is constant.

Options for selection of lambda

lopt specifies that, after CV, lasso2 fits the model with the value of λ_r that minimizes the mean squared prediction error. That is, the model is fit with $λ = {\hat{λ}}_{lopt}$ .

lse specifies that, after CV, lasso2 fits the model with the largest λ_r that is within one standard deviation from $λ = {\hat{λ}}_{lopt}$ . That is, the model is fit with $λ = {\hat{λ}}_{lse}$ .

postresults stores the lasso2 estimation results in e() (to be used in combination with lse or lopt).

Plotting

plotcv plots the estimated mean squared prediction error as a function of ln(lambda).

plotopt( string ) additional plotting options passed to line. For example, to turn the legend off, use plotopt(legend(off)).

Storing estimation results for each fold

saveest( string ) saves lasso2 results from each step of the CV in string 1,…, stringK, where K is the number of folds. Intermediate results can be restored with estimates restore.

Replay syntax

The replay syntax of cvlasso is given by

cvlasso [ , lopt lse postresults plotcv( method ) plotopt( string ) ]

Like lasso2, cvlasso also provides a replay syntax that helps to avoid time-consuming reestimations. The replay syntax of cvlasso can be used for plotting and to fit the model corresponding to ${\hat{λ}}_{lopt}$ or ${\hat{λ}}_{lse}$ .

Predict syntax

predict [ type ] newvar [ if ] [ in ] [ , xb residuals lopt lse noisily ]

6.3 rlasso: Rigorous penalization

rlasso implements theory-driven penalization for lasso and square-root lasso. It allows for heteroskedastic, cluster-dependent, and non-Gaussian errors. Unlike lasso2 and cvlasso, rlasso estimates the penalty level λ using iterative algorithms. The syntax of rlasso is given by

rlasso depvar indepvars [ if ] [ in ] [ weight ] [ , options ]

aweights and pweights are allowed; see [U] 11.1.6 weight. pweights is equivalent to aweights + robust.

As above, we discuss only selected options; see the help file for a full list of options.

pnotpen( varlist ) specifies that variables in varlist are not penalized.²⁰

robust specifies that the penalty loadings account for heteroskedasticity.

cluster( var ) specifies that the penalty loadings account for clustering on variable var.

center centers moments in heteroskedastic and cluster–robust loadings.²¹

lassopsi specifies to use lasso or square-root lasso residuals to obtain penalty loadings. The default is postestimation OLS.²²

corrnumber( int ) specifies the number of high-correlation regressors used to obtain initial residuals. The default is corrnumber(5), and corrnumber(0) implies that depvar is used in place of residuals.

prestd standardizes data prior to estimation. The default is standardization during estimation via penalty loadings.

Options relating to lambda

xdependent specifies that the X-dependent penalty level is used; see section 5.5.

numsim( int ) is the number of simulations used for the X-dependent case. The default is numsim(5000).

lalternative specifies the alternative, less sharp penalty level, which is defined as $2 c \sqrt{2 n \log (2 p / γ)}$ (for the square-root lasso, 2c is replaced with c). See footnote 12.

gamma( real ) is the γ in the rigorous penalty level [default γ = 1/log(n); the cluster-lasso default is γ = 1/log(n _clust)]. See (4).

c( real ) is the c in the rigorous penalty level. The default is c(1.1). See (4).

Sup-score test

supscore reports the sup-score test of statistical significance.

testonly reports only the sup-score test without lasso estimation.

ssgamma( real ) is the test level for the conservative critical value for the sup-score test. The default is ssgamma(0.05), that is, a 5% significance level.

ssnumsim( int ) controls the number of simulations for the sup-score test-multiplier bootstrap. The default is ssnumsim(500), while 0 implies no simulation.

Predict syntax

predict [ type ] newvar [ if ] [ in ] [ , xb residuals lasso ols ]

xb generates fitted values (default).

residuals generates residuals.

lasso uses lasso coefficients for prediction. The default is to use estimates posted in the e(b) matrix.

ols uses OLS coefficients based on lasso-selected variables for prediction. The default is to use estimates posted in the e(b) matrix.

7 Demonstrations

In this section, we demonstrate the use of lasso2, cvlasso, and rlasso using one cross-section example (in section 7.1) and one time-series example (in section 7.2).

7.1 Cross-section

For demonstration purposes, we consider the Boston Housing Dataset available on the UCI Machine Learning Repository.²³ The dataset includes 506 observations and 14 variables.²⁴ The purpose of the analysis is to predict house prices using a set of census-level characteristics.

Estimation with lasso2

After importing the data and changing the variable order (to simplify the syntax), we use the lasso estimator by regressing the outcome variable (medv) against 11 continuous predictors (crim-lstat), the binary Charles River dummy variable chas, and the highway accessibility index rad, which we treat as categorical.

The lasso2 output above shows the following columns:

Knot is the knot index. Knots are points at which predictors enter or leave the model. The default output shows one line per knot. If the long option is specified, one row per λ_r value is shown.

ID shows the λ_r index, that is, r. By default, lasso2 uses a descending sequence of 100 penalty levels.

s is the number of predictors in the model.

L1-Norm shows the ℓ ₁ norm of coefficient estimates.

The sixth column (here labeled EBIC) shows one out of four information criteria. The ic() option controls which information criterion is displayed, where string can be replaced with aic, aicc, bic, and ebic (the default).

R-sq shows the R ² value.

The final column shows which predictors are entered or removed from the model at each knot. The order in which predictors are entered into the model can be interpreted as an indication of the relative predictive power of each predictor.

Because lambda( real ) is not specified, lasso2 obtains the coefficient path for a default list of λ_r values. The largest penalty level is 6858.99, in which case the model does include only the constant.

We use the ibn. prefix for rad instead of the i. prefix commonly used in regression analysis. The i. prefix makes Stata omit the (automatically selected) base category from the estimation to avoid perfect collinearity (see help fvvarlist). In contrast with OLS, the lasso does not require the full-rank condition, allowing us to include one dummy for every category without removing the base category dummy. In this way, we let the lasso choose the base category automatically as the categories omitted from the selected model.

Figure 5 shows the coefficient path of the lasso for selected variables as a function of ln(λ). The graph illustrates how the lasso shrinks coefficients toward zero as the penalty level increases (from left to right). We can also infer the order in which predictors leave the model. The figure was created using the following command:

Figure 5.

The coefficient path of the lasso for selected variables as a function of ln(λ)

The plotpath() option passes settings to Stata’s line command and is used here to switch the legend off. Instead of a standard legend, we specify plotlabel to show variable names on the left-hand side of the graph.

The commands of lassopack support Stata’s factor-variable notation. This is especially useful when the relationship between the outcome variable and predictors is suspected to be nonlinear. For demonstration purposes, we add squared terms and interaction effects. We also allow the effect of lstat to depend on whether the binary Charles River dummy variable (chas) takes the values 1 or 0.

lasso2 supports model selection using information criteria. To this end, we use the replay syntax with the lic() option, which avoids fitting the full model again. The lic() option can also be specified in the first lasso2 call. In the following example, the replay syntax works similarly to a postestimation command:

Two columns are shown in the output: one for the lasso estimator and one for postestimation OLS, which applies OLS to the model selected by the lasso.

K-fold CV with cvlasso

Next, we consider K-fold CV:

The cvlasso output displays four columns: the index of λ_r (that is, r), the value of λ_r , the estimated mean squared prediction error, and the standard deviation of the MSPE. The output indicates the value of λ_r that corresponds to the lowest MSPE with an asterisk (*). We refer to this value as ${\hat{λ}}_{lopt}$ . In addition, the symbol ⁁ marks the largest value of λ that is within one standard error of ${\hat{λ}}_{lopt}$ , which we denote as ${\hat{λ}}_{lse}$ .

The mean squared prediction is shown in figure 6, which was created using the plotcv option. The graph shows the mean squared prediction error estimated by CV along with ± one standard error. The continuous and dashed vertical lines correspond to ${\hat{λ}}_{lopt}$ and ${\hat{λ}}_{lse}$ , respectively.

Figure 6.

The MSPE estimated by CV along with ± one standard error. The continuous and dashed vertical lines correspond to lopt and lse, respectively.

To fit the model corresponding to either ${\hat{λ}}_{lopt}$ or ${\hat{λ}}_{lse}$ , we use the lopt or lse option, respectively. Similarly to the lic() option of lasso2, lopt and lse can be specified either in the first cvlasso call or after estimation using the replay syntax as in this example:

Rigorous penalization with rlasso

Lastly, we consider rlasso, which runs an iterative algorithm to estimate the penalty level and loadings. In contrast with lasso2 and cvlasso, it reports the selected model directly.

The supscore option prompts the sup-score test of joint significance. The p-value is obtained through multiplier bootstrap. The test statistic of 16.15 can also be compared with the asymptotic 5% critical value (here 3.75).

7.2 Time-series data

A standard problem in time-series econometrics is to select an appropriate lag length. In this subsection, we show how lassopack can be used for this purpose. We consider Stata’s built-in dataset lutkepohl2.dta, which includes quarterly (log-differenced) consumption (dln_consump), investment (dln_inv), and income (dln_inc) series for West Germany from quarter 1 1960–quarter 4 1982. We demonstrate both lag selection via information criteria and by h-step-ahead rolling CV. We do not consider the rigorous penalization approach of rlasso because of the assumption of independence, which seems too restrictive in the time-series context.

Information criteria

After importing the data, we run the most general model with up to 12 lags of the variables dln_consump, dln_inv, and dln_inc using lasso2 with the lic(aicc) option.

The output consists of two parts. The second part of the output is prompted because lic(aicc) is specified. lic(aicc) asks lasso2 to fit the model selected by AIC _c , which in this case corresponds to λ ₁₁ = 0.207.

h-step-ahead rolling CV

In the next step, we consider h-step-ahead rolling CV.

The output indicates how the dataset is partitioned into training data and the validation point. For example, the shorthand 13-80 (81) in the output above indicates that observations 13 to 80 constitute the training dataset in the first step of CV, and observation 81 is the validation point. The options fixedwindow, h(), and origin() can be used to control the partitioning of data into training and validation data. h() sets the parameter h. For example, h(2) prompts 2-step-ahead forecasts (the default is h(1)). fixedwindow ensures that the training dataset is the same size in each step. If origin(50) is specified, the first training partition includes observations 13 to 50, as shown in the next example:

The option plotcv creates the graph of the estimated mean squared prediction in figure 7. To fit the model corresponding to ${\hat{λ}}_{lse}$ , we can—as in the previous examples—use the replay syntax:

Figure 7.

CV plot. The graph uses Stata’s built-in dataset lutkepohl2.dta and 1-stepahead rolling CV with origin(50).

Take care when setting the parameters of h-step-ahead rolling CV. The default settings have no particular econometric justification.

8 Monte Carlo simulation

We have introduced three alternative approaches for setting the penalization parameters in sections 3–5. In this section, we present the results of Monte Carlo simulations, which assess the performance of these approaches in terms of in-sample fit, out-of-sample prediction, model selection, and sparsity. To this end, we generate artificial data using the process

y_{i} = 1 + \sum_{j = 1}^{p} β_{j} x_{i j} + ε_{i}, ε_{i} \sim N (0, σ^{2}), i = 1, \dots, 2 n

with n = 200. We report results for p = 100 and for the high-dimensional setting with p = 220. The predictors x_ij are drawn from a multivariate normal distribution with corr(x_ij, x_ir ) = 0.9^|j−r|. We vary the noise level σ between 0.5 and 5; specifically, we consider σ = {0.5, 1, 2, 3, 5}. We define the parameters as $β_{j} = 1 {j \leq s}$ for j = 1,…, p with s = 20, implying exact sparsity. This simple design allows us to gain insights into the model-selection performance in terms of false-positive frequency (the number of variables falsely included) and false-negative frequency (the number of variables falsely omitted) when relevant and irrelevant regressors are correlated. All simulations use at least 1,000 iterations. We report additional Monte Carlo results in appendix A.1, where we use a design in which coefficients alternate in sign.

Because the aim is to assess in-sample and out-of-sample performance, we generate 2n observations and use the data i = 1,…, n as the estimation sample and the observations i = n + 1,…, 2n for assessing out-of-sample prediction performance. This allows us to calculate the rmse and root mean squared prediction error (RMSPE) as

rmse = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i, n})}^{2}} and RMSPE = \sqrt{\frac{1}{n} \sum_{i = n + 1}^{2 n} {(y_{i} - {\hat{y}}_{i, n})}^{2}}

where ${\hat{y}}_{i, n}$ are the predictions from fitting the model to the first n observations.

Figures 8 and 9 illustrate results for the following estimation methods implemented in lassopack: the lasso with λ selected by AIC, BIC, EBIC _ξ , and AIC _c (as implemented in lasso2); the rigorous lasso and the rigorous square-root lasso (implemented in rlasso) using the X-independent penalty choice; and the five-fold cross-validated lasso using the penalty level that minimizes the estimated MSPE (implemented in cvlasso). In addition, we report postestimation OLS results (denoted by dots).

Figure 8.

Monte Carlo simulation for an exactly sparse parameter vector with p = 100 and n = 200

Figure 9.

Monte Carlo simulation for an exactly sparse parameter vector with p = 220 and n = 200

For comparison, we also show results of stepwise regression (for p = 100 only) and the oracle estimator. Stepwise regression starts from the full model and iteratively removes regressors if the p-value is above a predefined threshold (10% in our case). Stepwise regression is known to suffer from overfitting and pretesting bias. However, it still serves as a relevant reference point because of its connection with ad hoc model selection using hypothesis testing and the general-to-specific approach. The oracle estimator is OLS applied to the predictors included in the true model. Naturally, the oracle estimator is expected to show the best performance but is not feasible in practice because the true model is not known. The results of figures 8 and 9 are also reported in tabular form in tables 5 and 6 in appendix A.2 along with additional information, including results for the rigorous lasso and rigorous square-root lasso with an X-dependent penalty choice.

We first summarize the main results in figure 8 for the case where p = 100. AIC and stepwise regression exhibit the worst selection performance, with around 18–20 falsely included predictors on average. While AIC and stepwise regression achieve the lowest rmse (best in-sample fit), the out-of-sample prediction performance is among the worst—a typical symptom of overfitting. Note that in-sample fit of AIC and stepwise even exceed the fit of the oracle estimator. The corrected AIC improves upon the standard AIC in terms of both false-positive rate and prediction performance.

Compared with AIC _c , the BIC-type information criteria show similar out-of-sample prediction and better selection performance. While the EBIC performs only marginally better than BIC in terms of false positives, we expect the relative performance of BIC and EBIC to shift in favor of EBIC as p increases relative to n. Five-fold CV with the lasso behaves very similarly to the AIC _c across all measures. The rigorous lasso, rigorous square-root lasso, and EBIC exhibit overall the lowest false-positive rates, whereas rigorous methods yield slightly higher rmse and RMSPE than information criteria and CV-based methods.

Postestimation OLS generally exhibits better in-sample fit, indicating that the shrinkage bias from penalization can be alleviated by applying OLS to the selected model. In terms of out-of-sample prediction, postestimation OLS performs worse than first-step lasso when variable-selection performance is poor (particularly for the AIC) but leads to performance improvements for rigorous methods.

We also present simulation results for the high-dimensional setting in figure 9. Specifically, we consider p = 220 instead of p = 100, while keeping the estimation sample size constant at n = 200. With between 164 and 195 included predictors on average, the AIC suffers from overfitting. The RMSPE of the AIC exceeds the rmse by a factor of 5 or more. In comparison, aic _c and five-fold cv perform better as model selectors, with a false-positive frequency between six and eight predictors.

Despite many predictors to choose from, EBIC and rigorous methods perform generally well in recovering the true structure. The false-positive frequency is below 1 across all noise levels, and the false-negative rate is zero if σ is 1 or smaller. While the BIC performs similarly to the EBIC for σ = 0.5 and σ = 1, its performance resembles the poor performance of AIC for larger noise levels. The Monte Carlo results in table 6 in appendix A.2 highlight that EBIC and rigorous methods are well suited for the high-dimensional setting where p > n, while AIC and BIC are not appropriate.

The computational costs of each method are reported in table 3. rlasso with X-independent penalty is the fastest method considered. The runtimes of lasso and square-root lasso with p = 100 are 0.1s and 0.4s, respectively. The computational cost increased only slightly if p is increased to p = 220. rlasso with an X-dependent penalty simulates the distribution of the maximum value of the score vector. This process increases the computational cost of the rigorous lasso to 5.9 seconds for p = 100 (12.7 seconds for p = 220). With an average runtime of 3.1 seconds, lasso2 is slightly faster than rlasso with X-dependent penalty if p = 100 but slower in the high-dimensional setup. Unsurprisingly, K-fold CV is the slowest method because it requires the model to be fit K times for a range of tuning parameters.

Table 3.

Runtime with p = 100 and p = 220

Method	Call	Seconds p = 100	p = 220
Rigorous lasso	rlasso y x	0.09	0.24
with X-dependent penalty	rlasso y x, xdep	5.92	12.73
Rigorous square-root lasso	rlasso y x, sqrt	0.39	0.74
with X-dependent penalty	rlasso y x, sqrt xdep	3.34	7.03
CV	cvlasso y x, nfolds(5) lopt	23.50	293.93
Information criteria	lasso2 y x	3.06	44.06
Stepwise regression	stepwise, pr(.1): reg y x	4.65	–

note: PC specification: Intel Core i5-6500 with 16 GB RAM, Windows 7.

9 Technical notes

9.1 Pathwise coordinate-descent algorithms

lassopack implements the elastic net and square-root lasso using coordinate-descent algorithms. The algorithm—then referred to as “shooting”—was first proposed by Fu (1998) for the lasso and by Van der Kooij (2007) for the elastic net. Belloni, Chernozhukov, and Wang (2011, 2014) use the coordinate descent for the square-root lasso and have kindly provided MATLAB code.

Coordinate-descent algorithms repeatedly cycle over predictors j = 1,…, p and update single-coefficient estimates until convergence. Suppose the predictors are centered and standardized to have unit variance, and the penalty loadings are ψ_j = 1 for all j. In that case, the update for coefficient j is obtained using univariate regression of the current partial residuals (that is, excluding the contribution of predictor j) against predictor j. More precisely, the update for the elastic net is calculated as

{\tilde{β}}_{j} \leftarrow \frac{S {\sum_{i = 1}^{n} x_{i j} (y_{i} - {\tilde{y}}_{i}^{(j)}), λ α}}{1 + λ (1 - α)}

where ${\tilde{β}}_{j}$ denotes the current coefficient estimate and ${\tilde{y}}_{i}^{(j)} = \sum_{l \neq j} x_{i l} {\tilde{β}}_{l}$ is the predicted value without the contribution of predictor j. Thus, because the predictors are standardized, $\sum_{i} x_{i j} (y_{i} - {\tilde{y}}_{i}^{(j)})$ is the OLS estimate of regressing predictor j against the partial residual $(y_{i} - {\tilde{y}}_{i}^{(j)})$ . The function S(a, b), referred to as the soft-thresholding operator, is

S (a, b) = {\begin{array}{l} a - b & if a > 0 and b < | a | \\ a + b & if a < 0 and b < | a | \\ 0 & if b > | a | \end{array}

and it sets some of the coefficients equal to zero. The coordinate-descent algorithm is spelled out for the square-root lasso in Belloni, Chernozhukov, and Wang (2014, supplementary material).²⁵

The algorithm requires an initial beta estimate for which the ridge estimate is used. If the coefficient path is obtained for a list of λ values, lasso2 starts from the largest λ value and uses previous estimates as initial values (“warm starts”). See Friedman et al. (2007) and Friedman, Hastie, and Tibshirani (2010) and references therein for further information.

9.2 Standardization

Because penalized regression methods are not invariant to scale, it is common practice to standardize the regressors x_ij such that $\sum_{i} x_{i j}^{2} = 1$ before computing the estimation results and then to unstandardize the coefficients after estimation. We refer to this approach as preestimation standardization. An alternative is to standardize on the fly by adapting the penalty loadings. The results are equivalent in theory. In the case of the lasso, setting $ψ_{j} = {(\sum_{i} x_{i j}^{2})}^{1 / 2}$ yields the same results as dividing the data by $\sum_{i} x_{i j}^{2}$ before estimation. Standardization on the fly is the default in lassopack because it tends to be faster. Preestimation standardization can be used with the prestd option, which can lead to improved numerical precision or more stable results in the case of difficult problems; the cost is (a typically small) computation time required to standardize the data. The unitloadings option can be used if the researcher does not want to standardize data. If the preestimation-standardization and standardizationon-the-fly results differ, the user can compare the values of the penalized minimized objective function saved in e(pmse) (the penalized mean squared error for the elastic net) or e(prmse) (the penalized rmse for the square-root lasso).

9.3 Zero penalization and partialing out

In many applications, theory suggests that specific predictors affect the outcome variable. Hence, it might be desirable to always include these predictors in the model to improve finite-sample performance. Typical examples are the intercept, a time trend, or any other predictor that the researcher has prior knowledge of. lassopack offers two approaches for such situations:

Zero penalization: The notpen() option of lasso2 and cvlasso allows one to set the penalty for specific predictors to 0; that is, ψ_ℓ = 0 for some ℓ ∊ {1,…, p}. Those variables are not subject to penalization and will always be included in the model. rlasso supports zero penalization through the pnotpen() option, which accommodates zero penalization in the rigorous-lasso penalty loadings; see below.

• Partialing out: We can also apply the penalized regression method to the data after the effect of certain regressors has been partialed out. Partialing out is supported by lasso2, cvlasso, and rlasso using the partial() option. The penalized regression does not yield estimates of the partialed-out coefficients directly. Instead, lassopack recovers the partialed-out coefficients by postestimation OLS.

The two methods—zero penalization and partialing out—are numerically equivalent. Formally, suppose we do not want to subject predictors ℓ with $\bar{p} > l \geq p$ to penalization. The zero-penalization and partialed-out lasso estimates are defined, respectively, as

\begin{array}{l} \hat{β} (λ) & = \arg \min \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \sum_{j = 1}^{\bar{p}} x_{i j} β_{j} - \sum_{l = \bar{p} + 1}^{p} x_{i l} β_{l})}^{2} + \frac{λ}{n} \sum_{j = 1}^{\bar{p}} ψ_{i j} | β_{j} | \\ and \hat{β} (λ) & = \arg \min \frac{1}{n} \sum_{i = 1}^{n} {({\tilde{y}}_{i} - \sum_{j = 1}^{\bar{p}} {\tilde{x}}_{i j} β_{j})}^{2} + \frac{λ}{n} \sum_{j = 1}^{\bar{p}} ψ_{j} | β_{j} | \end{array}

where ${\tilde{y}}_{i} = y_{i} - \sum_{l = \bar{p} + 1}^{p} x_{i l} {\hat{δ}}_{y, l}$ and ${\tilde{x}}_{i j} = x_{i j} - \sum_{l = \bar{p} + 1}^{p} x_{i l} {\hat{δ}}_{j, l}$ are the residuals of regressing y and the penalized regressors against the set of unpenalized regressors. The equivalence states that ${\hat{β}}_{j} = {\tilde{β}}_{j}$ for all $j = 1, …, \bar{p}$ . The result is spelled out in Yamada (2017) for the lasso and ridge but holds for the elastic net more generally.

The partial() or notpen() option can be used for variables that should not be penalized by the lasso. The options are equivalent in theory (see above), but their numerical results can differ in practice because of the different calculation methods used. Partialing out variables can lead to improved numerical precision or more stable results in the case of difficult problems compared with zero penalization but may be slower in terms of computation time.

The estimation of penalty loadings in the rigorous lasso introduces an additional complication that necessitates the rlasso-specific option pnotpen(). The theory for the rlasso penalty loadings is based on the penalized regressors after partialing out the unpenalized variables. The pnotpen() option guarantees that the penalty loadings for the penalized regressors are the same as if the unpenalized regressors had instead first been partialed out.

The fe (fixed-effects) option is equivalent to (but computationally faster and more accurate than) specifying unpenalized panel-specific dummies. The fixed-effects (within) transformation also removes the constant and the fixed effects. The panel variable used by the fe option is the panel variable set by xtset. If installed, the within transformation uses the fast ftools package by Correia (2016).

The prestd, notpen(), and pnotpen() options can be used as simple checks for numerical stability by comparing results that should be equivalent in theory. The values of the penalized minimized objective function saved in e(pmse) for the elastic net and e(prmse) for the square-root lasso may also be used for comparison.

9.4 Treatment of the constant

By default, the constant (if present) is not penalized; this is equivalent to mean centering prior to estimation. The partial() option also partials out the constant (if present). To partial out the constant only, we can specify partial(_cons). Both partial() and fe mean-center the data; the noconstant option is redundant in this case and may not be specified with these options. If the noconstant option is specified, an intercept is not included in the model, but the estimated penalty loadings are still estimated using mean-centered regressors (see the center option).

11 Programs and supplemental materials

Supplemental Material, st0594 - lassopack: Model selection and prediction with regularized regression in Stata

Supplemental Material, st0594 for lassopack: Model selection and prediction with regularized regression in Stata by Achim Ahrens, Christian B. Hansen and Mark E. Schaffer in The Stata Journal

Footnotes

10 Acknowledgments

We thank Alexandre Belloni, who provided MATLAB code for the square-root lasso, and Sergio Correia for supporting us with the use of ftools. We also thank Christopher F. Baum, Jan Ditzen, David M. Drukker, Martin Spindler, participants of the 2018 London Stata Conference, and participants of the 2018 Swiss Stata Users Group meeting for many helpful comments and suggestions. We have received excellent comments from an anonymous referee that have improved the article. We are also grateful to the many users that have provided feedback. All remaining errors are our own.

11 Programs and supplemental materials

To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type

To get the latest stable version of lassopack from our website, check the installation instructions at .

Notes

A Appendix

References

Ahrens

Hansen

C. B.

Schaffer

M. E.

2018. pdslasso: Stata module for post-selection and post-regularization OLS or IV estimation and inference. Statistical Software Components S458459, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458459.html.

Akaike

1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716–723. https://doi.org/10.1109/TAC.1974.1100705.

Andrews

D. W. K.

1991. Asymptotic optimality of generalized CL, cross-validation, and generalized cross-validation in regression with heteroskedastic errors. Journal of Econometrics 47: 359–377. https://doi.org/10.1016/0304-4076(91)90107-O.

Arlot

Celisse

2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4: 40–79. https://doi.org/10.1214/09-SS054.

Athey

2019. The impact of machine learning on economics. In The Economics of Artificial Intelligence: An Agenda, ed. Agrawal

Gans

Goldfarb

, 507–547. Cambridge, MA: National Bureau of Economic Research.

Athey

Imbens

2019. Machine learning methods economists should know about. ArXiv Working Paper No. arXiv:1903.10075. https://arxiv.org/abs/1903.10075.

Belloni

Chen

Chernozhukov

Hansen

2012. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80: 2369–2429. https://doi.org/10.3982/ECTA9626.

Belloni

Chernozhukov

2011. High dimensional sparse econometric models: An introduction. In Inverse Problems and High-Dimensional Estimation, ed. Alquier

Gautier

Stoltz

, chap. 3. Vol. 203 of Lecture Notes in Statistics . Berlin: Springer.

Belloni

Chernozhukov

2013. Least squares after model selection in high-dimensional sparse models. Bernoulli 19: 521–547. https://doi.org/10.3150/11-BEJ410.

10.

Belloni

Chernozhukov

Hansen

2014. Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies 81: 608–650. https://doi.org/10.1093/restud/rdt044.

11.

Belloni

Chernozhukov

Hansen

Kozbur

2016. Inference in highdimensional panel models with an application to gun control. Journal of Business & Economic Statistics 34: 590–605. https://doi.org/10.1080/07350015.2015.1102733.

12.

Belloni

Chernozhukov

Wang

2011. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98: 791–806. https://doi.org/10.1093/biomet/asr043.

13.

Belloni

Chernozhukov

Wang

2014. Pivotal estimation via square-root Lasso in nonparametric regression. Annals of Statistics 42: 757–788. https://doi.org/10.1214/14-AOS1204.

14.

Bergmeir

Hyndman

R. J.

Koo

2018. A note on the validity of crossvalidation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis 1 20: 70–83. https://doi.org/10.1016/j.csda.2017.11.003.

15.

Bickel

P. J.

Ritov

Tsybakov

A. B.

2009. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics 37: 1705–1732. https://doi.org/10.1214/08-AOS620.

16.

Bühlmann

2013. Statistical significance in high-dimensional linear models. Bernoulli 19: 1212–1242. https://doi.org/10.3150/12-BEJSP11.

17.

Bühlmann

Van de Geer

2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Berlin: Springer.

18.

Burman

Chow

Nolan

1994. A cross-validatory method for dependent data. Biometrika 81: 351–358. https://doi.org/10.2307/2336965.

19.

Carrasco

2012. A regularization approach to the many instruments problem. Journal of Econometrics 170: 383–398. https://doi.org/10.1016/j.jeconom.2012.05.012.

20.

Chen

2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95: 759–771. https://doi.org/10.1093/biomet/asn034.

21.

Chernozhukov

Chetverikov

Kato

2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Annals of Statistics 41: 2786–2819. https://doi.org/10.1214/13-AOS1161.

22.

Chernozhukov

Hansen

Spindler

2015. Post-selection and postregularization inference in linear models with many controls and instruments. American Economic Review 105: 486–490. https://doi.org/10.1257/aer.p20151022.

23.

Chernozhukov

Hansen

Spindler

2016. High-dimensional metrics in R. ArXiv Working Paper No. arXiv:1603.01700. https://arxiv.org/abs/1603.01700.

24.

Chetverikov

Liao

Chernozhukov

2019. On cross-validated Lasso. ArXiv Working Paper No. arXiv:1605.02214. http://arxiv.org/abs/1605.02214.

25.

Correia

2016. ftools: Stata module to provide alternatives to common Stata commands optimized for large datasets. Statistical Software Components S458213, Department of Economics, Boston College. https://ideas.repec.org/c/boc/bocode/s458213.html.

26.

Dicker

L. H.

2016. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli 22: 1–37. https://doi.org/10.3150/14-BEJ609.

27.

Dobriban

Wager

2018. High-dimensional asymptotics of prediction: Ridge regression and classification. Annals of Statistics 4 6: 247–279. https://doi.org/10.1214/17-AOS1549.

28.

Frank

I. E.

Friedman

J. H.

1993. A statistical view of some chemometrics regression tools. Technometrics 3 5: 109–135. https://doi.org/10.2307/1269656.

29.

Friedman

Hastie

H¨ofling

Tibshirani

2007. Pathwise coordinate optimization. Annals of Applied Statistics 1: 302–332. https://doi.org/10.1214/07-AOAS131.

30.

Friedman

J. H.

Hastie

Tibshirani

2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1): 1–22. https://doi.org/10.18637/jss.v033.i01.

31.

W. J.

1998. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics 7: 397–416. https://doi.org/10.1080/10618600.1998.10474784.

32.

Geisser

1975. The predictive sample reuse method with applications. Journal of the American Statistical Association 70: 320–328. https://doi.org/10.2307/2285815.

33.

Hansen

Kozbur

2014. Instrumental variables estimation with many weak instruments using regularized JIVE. Journal of Econometrics 182: 290–308. https://doi.org/10.1016/j.jeconom.2014.04.022.

34.

Hastie

Tibshirani

Friedman

2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

35.

Hastie

Tibshirani

Wainwright

2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton, FL: CRC Press.

36.

Hoerl

A. E.

Kennard

R. W.

1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12: 55–67. https://doi.org/10.1080/00401706.1970.10488634.

37.

Hsu

Kakade

S. M.

Zhang

2014. Random design analysis of ridge regression. Foundations of Computational Mathematics 14: 569–600. https://doi.org/10.1007/s10208-014-9192-1.

38.

Huang

Zhang

C.-H.

2008. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica 18: 1603–1618.

39.

Hurvich

C. M.

Tsai

C.-L.

1989. Regression and time series model selection in small samples. Biometrika 76: 297–307. https://doi.org/10.1093/biomet/76.2.297.

40.

Hyndman

R. J.

Athanasopoulos

2012. Forecasting: Principles and Practice. OTexts. https://www.otexts.org/fpp.

41.

Jing

B.-Y.

Shao

Q.-M.

Wang

2003. Self-normalized Cramér-type large deviations for independent random variables. Annals of Probability 31: 2167–2215. https://doi.org/10.1214/aop/1024404289.

42.

Kleinberg

Lakkaraju

Leskovec

Ludwig

Mullainathan

2018. Human decisions and machine predictions. Quarterly Journal of Economics 133: 237–293. https://doi.org/10.1093/qje/qjx032.

43.

Lockhart

Taylor

Tibshirani

R. J.

Tibshirani

2014. A significance test for the lasso. Annals of Statistics 42: 413–468. https://doi.org/10.1214/13-AOS1175.

44.

Meinshausen

Bühlmann

2006. High-dimensional graphs and variable selection with the lasso. Annals of Statistics 34: 1436–1462. https://doi.org/10.1214/009053606000000281.

45.

Meinshausen

Meier

Bühlmann

2009. p-values for high-dimensional regression. Journal of the American Statistical Association 104: 1671–1681. https://doi.org/10.1198/jasa.2009.tm08647.

46.

Mullainathan

Spiess

2017. Machine learning: An applied econometric approach. Journal of Economic Perspectives 31: 87–106. https://doi.org/10.1257/jep.31.2.87.

47.

Raninen

Ollila

2017. Scaled and square-root elastic net. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4336–4340. New Orleans, LA: IEEE.

48.

Schwarz

1978. Estimating the dimension of a model. Annals of Statistics 6: 461–464. https://doi.org/10.1214/aos/1176344136.

49.

Shao

1993. Linear model selection by cross-validation. Journal of the American Statistical Association 88: 486–494. https://doi.org/10.2307/2290328.

50.

Shao

1997. An asymptotic theory for linear model selection. Statistica Sinica 7: 221–242.

51.

Stone

1977. An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society, Series B 39: 44–47. https://doi.org/10.1111/j.2517-6161.1977.tb01603.x.

52.

Sugiura

1978. Further analysts [sic] of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics—Theory and Methods 7: 13–26. https://doi.org/10.1080/03610927808827599.

53.

Tibshirani

1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58: 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.

54.

Tibshirani

R. J.

Taylor

2012. Degrees of freedom in lasso problems. Annals of Statistics 40: 1198–1232. https://doi.org/10.1214/12-AOS1003.

55.

Tikhonov

A. N.

1963. On the solution of ill-posed problems and the method of regularization. Doklady Akademii Nauk 151: 501–504.

56.

Van der Kooij

2007. Prediction accuracy and stability of regression with optimal scaling transformations. PhD thesis, Department of Data Theory, University of Leiden.

57.

Varian

H. R.

2014. Big data: New tricks for econometrics. Journal of Economic Perspectives 28: 3–27. https://doi.org/10.1257/jep.28.2.3.

58.

Wasserman

Roeder

2009. High-dimensional variable selection. Annals of Statistics 37: 2178–2201. https://doi.org/10.1214/08-AOS646.

59.

Weilenmann

Seidl

Schulz

2017. The socio-economic determinants of urban sprawl between 1980 and 2010 in Switzerland. Landscape and Urban Planning 157: 468–482. https://doi.org/10.1016/j.landurbplan.2016.08.002.

60.

Yamada

2017. The Frisch–Waugh–Lovell theorem for the lasso and the ridge regression. Communications in Statistics—Theory and Methods 46: 10897–10902. https://doi.org/10.1080/03610926.2016.1252403.

61.

Yang

2005. Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation. Biometrika 92: 937–950. https://doi.org/10.1093/biomet/92.4.937.

62.

Yang

2006. Comparing learning methods for classification. Statistica Sinica 16: 635–657.

63.

Zhang

Tsai

C.-L.

2010. Regularization parameter selections via generalized information criterion. Journal of the American Statistical Association 105: 312–323. https://doi.org/10.1198/jasa.2009.tm08013.

64.

Zhao

2006. On model selection consistency of lasso. Journal of Machine Learning Research 7: 2541–2563.

65.

Zou

2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101: 1418–1429. https://doi.org/10.1198/016214506000000735.

66.

Zou

Hastie

2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67: 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x.

67.

Zou

Hastie

Tibshirani

2007. On the “degrees of freedom” of the lasso. Annals of Statistics 3 5: 2173–2192. https://doi.org/10.1214/009053607000000127.

68.

Zou

Zhang

H. H.

2009. On the adaptive elastic-net with a diverging number of parameters. Annals of Statistics 37: 1733–1751. https://doi.org/10.1214/08-AOS625.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

2.17 MB

0.00 MB

	lasso	square-root lasso
homoskedasticity	(i) $2 \hat{σ} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} g_{i} \|$	(iv) $\frac{1}{{\hat{σ}}_{g}} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} g_{i} \|$
heteroskedasticity	(ii) $2 \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} {\hat{ε}}_{i} g_{i} \|$	(v) $\frac{1}{{\hat{σ}}_{g}} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} x_{i j} \hat{ε} g_{i} \|$
cluster dependence	(iii) $2 \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} {\hat{u}}_{i j} g_{i} \|$	(vi) $\frac{1}{{\hat{σ}}_{g}} \max_{1 \leq j \leq p} \| \sum_{i = 1}^{n} ψ_{j}^{- 1} {\hat{u}}_{i j} g_{i} \|$