Sage Journals: Discover world-class research

Abstract

We address the challenge of correlated predictors in high-dimensional generalized linear model (GLMs), where regression coefficients range from sparse to dense, by proposing a data-driven random projection (RP) method. This is particularly relevant for applications where the number of predictors is (much) larger than the number of observations and the underlying structure—whether sparse or dense—is unknown. We achieve this by using ridge-type estimates for variable screening and RP to incorporate information about the response–predictor relationship when performing dimensionality reduction. We demonstrate that a ridge estimator with a small penalty is effective for RP and screening, but the penalty value must be carefully selected. Unlike in linear regression, where penalties approaching zero work well, this approach leads to overfitting in non-Gaussian families. Instead, we recommend a data-driven method for penalty selection. In a simulation study, this data-driven RP improved prediction performance over conventional RPs, even surpassing benchmarks like elastic net. Furthermore, an ensemble of multiple such RPs combined with probabilistic variable screening delivered the best aggregated results in prediction and variable ranking across varying sparsity levels in our simulation study at a rather low computational cost. Final, three applications with count and binary responses demonstrate the method’s advantages in interpretability and prediction accuracy.

Keywords

Generalized linear models high-dimensional data predictive modeling random projection (RP)screening

1 Introduction

High-dimensional data in a regression context, where the number of variables exceeds the number of observations (i.e., p > n or even $p ≫ n$ ), has become increasingly common across various applications, posing substantial computational and statistical challenges, particularly when dealing with discrete responses. In such cases, predictors are often correlated and the sparsity of the true model is uncertain. Moreover, interpretability is increasingly becoming a model requirement in a variety of fields. This calls for computationally efficient approaches that enable both accurate predictions and interpretable relationships between the predictors and the response.

The generalized linear model (GLM) extends the linear model to both continuous and discrete responses while maintaining interpretability. In high-dimensional settings, GLMs are typically regularized (e.g., Tibshirani, 1996; Fan and Li, 2001; Zou and Hastie, 2005). Alternatively or complementarily, the dimensionality of the feature space can be reduced to a moderate size while learning and inference is performed in this reduced predictor space. One fast way to achieve this is variable screening, that is, selecting a subset of the predictors based on their utility. Methods for screening often rely on parametric (such as the maximum likelihood estimates of univariate GLMs, Fan et al., 2009; Fan and Song, 2010) or nonparametric (e.g., Fan et al., 2011; Mai and Zou, 2013, 2015; Ke, 2023) measures but typically ignore predictor correlations. Fan et al. (2009) suggest an iterative procedure to address this issue, while Wang and Leng (2016) propose screening variables in linear regression using the high-dimensional ordinary least squares projection (HOLP), a ridge-type estimator with a closed form solution when the penalty converges to zero. However, screening approaches based on ridge-type estimators are still rare in the context of GLMs.

An approach similar in scope to variable screening is random projection (RP), which reduces the dimensionality of the feature space by linearly projecting the features onto a lower-dimensional space, rather than employing a reduced set of the original features. Conventional RPs contain i.i.d. entries from a suitable distribution and are oblivious to the data to be used in the regression. Such projections have been used in a classification setting, for example, by Cannings and Samworth (2017), while Guhaniyogi and Dunson (2015); Mukhopadhyay and Dunson (2020) focus on linear regression. On the other hand, Ryder et al. (2019) propose a data-informed RP using an asymmetric transformation of the predictor matrix without using information of the response. Parzer et al. (2025) propose a data-driven projection for linear regression which incorporates information from the estimated HOLP coefficients, that is, about both the predictors and the response.

In this article, we leverage the computational advantages of variable screening and RP and introduce a data-driven RP method for GLMs that accounts for the relationship between predictors and the response, while addressing the potentially complex correlation structure among the predictors. The manuscript extends Parzer et al. (2025) to the GLM family of models, by proposing a ridge-type estimator which can be integrated into a sparse RP matrix and can also be employed for screening the variables prior to projection, as it performs well in preserving the true relationship between predictors and the response.

A key aspect of the proposed ridge-type estimator is the selection of the penalty term. We extend the HOLP estimator to GLMs with canonical link, deriving a closed-form solution and explore if it retains the same benefits for RP and screening in GLMs as in linear regression. We find that for non-Gaussian families, a ridge estimator with zero penalty can overfit, making penalty selection a non-trivial task. While small penalty values reduce bias, a data-driven approach to choosing the penalty works best. Specifically, we propose selecting the smallest penalty value for which the deviance ratio in the fit stays under a certain threshold (e.g., 0.8 for non-Gaussian families and 0.999 for Gaussian). Simulations show that using these ridge estimates in the sparse RP matrix outperform conventional RP techniques.

Given the randomness in the RP matrix, variability can be reduced by building ensembles of multiple RPs. For example, Guhaniyogi and Dunson (2015) propose repeatedly sampling RPs of different sizes and estimating an ensemble of linear regressions on the reduced predictors, while Cannings and Samworth (2017) generate various RPs for each classifier in an ensemble and pick the best one based on an appropriate loss function. Additionally, ensembles of multiple RPs and variable screening steps have also been proposed that achieve better predictive performance in linear regression (Mukhopadhyay and Dunson, 2020; Parzer et al., 2025).

In a similar fashion, we extend the variable screening and RP procedure by building an ensemble of GLMs and averaging them to form the final model, adapting the algorithm in Parzer et al. (2025) to GLMs. An extensive simulation study reveals that this ensemble improves predictive performance, ranks predictors effectively and is computationally efficient, particularly with increasing predictor dimensions. It consistently performs well against state-of-the-art approaches such as penalized regression, random forests and support vector machines (SVMs) across a range of sparsity settings and yields the best overall performance when aggregated across all scenarios. This broad applicability makes the method versatile for high-dimensional regression with correlated predictors, especially when the sparsity of the underlying data-generating process is unknown and capable of computationally handling datasets with small n and a (very) large number of predictors.

The integration of the GLM framework with probabilistic screening improves model interpretability, as the coefficients in the marginal models can be extracted and their reliability and relevance can be assessed. The GLM framework also offers modeling flexibility, facilitating seamless comparison across different family–link combinations.

The article is organized as follows: Section 2 presents the GLM model, data-informed RP, variable screening and the ensemble algorithm. Section 3 provides simulations comparing the method with state-of-the-art approaches. Applications are shown in Section 4 and Section 5 concludes.

2 Method

This section begins by presenting the GLM model class, followed by an introduction to the dimension reduction tools RP and variable screening. Then, we propose a novel coefficient estimator useful for both these concepts and state how this estimator can be used to extend the algorithm in Parzer et al. (2025) to GLMs. Throughout the section, we use the notation $[n] = {1, \dots, n}$ for any $n \in ℕ$ .

2.1 Generalized linear models

We assume to observe high-dimensional data ${\{(x_{i}, y_{i})\}}_{i = 1}^{n}, x_{i} \in ℝ^{p}, y_{i} \in ℝ$ with $p ≫ n$ from a GLM with the responses having conditional densities from a (reproductive) exponential dispersion family of the form

f (y_{i} | θ_{i}, ϕ) = \exp \{\frac{y_{i} θ_{i} - b (θ_{i})}{a (ϕ)} + c (y_{i}, ϕ)\},

(2.1)

where θ_i is the natural parameter, a(.) > 0 and c(.) are specific real-valued functions determining different families, ϕ is a dispersion parameter and b(.) is the log-partition function normalizing the density to integrate to one. If ϕ is known, we obtain densities in the natural exponential family for our responses. It can be shown that b(.) is twice differentiable and convex with $E [y_{i} | θ_{i}, ϕ] = b^{'} (θ_{i})$ and $Var (y_{i} | θ_{i}, ϕ) = a (ϕ) b^{″} (θ_{i}) > 0$ , if the responses have (positive) second moments (see e.g. McCullagh and Nelder, 1989, Section 2.2.2).

The responses are related to the p-dimensional predictors through the conditional mean, that is, the conditional mean of y_i given x _i depends on a linear combination of the predictors through a (invertible) link function g(.)

g (E [y_{i} | x_{i}]) = β_{0} + x_{i}^{'} β = : η_{i},

(2.2)

where $β_{0} \in ℝ$ is the intercept and $β \in ℝ^{p}$ is a vector of regression coefficients. Equations (2.1) and (2.2) give the functional relation $θ_{i} = θ_{i} (β_{0}, β, x_{i}) = {(b^{'})}^{- 1} (g^{- 1} (η_{i}))$ between θ_i and η_i. For each family, $g : = {(b^{'})}^{- 1}$ is the canonical link function, such that θ_i = η_i. The full log-likelihood of the regression parameter β given the data ${\{(x_{i}, y_{i})\}}_{i = 1}^{n}$ is

l (β_{0}, β) = \sum_{i = 1}^{n} \frac{y_{i} θ_{i} (β_{0}, β, x_{i}) - b (θ_{i} (β_{0}, β, x_{i}))}{a (ϕ)} + c (y_{i}, ϕ),

but for maximization with respect to β, it suffices to use

\tilde{l} (β_{0}, β) = \sum_{i = 1}^{n} y_{i} θ_{i} (β_{0}, β, x_{i}) - b (θ_{i} (β_{0}, β, x_{i})),

(2.3)

and treat ϕ as a nuisance parameter. In our general high-dimensional setting p > n, the predictor matrix $X \in ℝ^{n \times p}$ with rows x _i can be assumed to have full rank( X ) = n, so there is typically a whole (affine) subspace of βs yielding the same η_is and we can not hope to find a unique maximize $\hat{β}$ . In order to reduce the dimension of the problem, we resort to two techniques, namely, RP and variable screening.

2.2 Random projection and variable screening

RP can be used as a dimension-reduction tool for high-dimensional regression by creating a random matrix $Φ \in ℝ^{m \times p}$ with $m ≪ p$ and using the reduced predictors $z_{i} = Φ x_{i} \in ℝ^{m}$ in a regression model. When using this method for GLMs, we would like the predictors to still have most of the predictive power and that the true regression coefficients β are close to the row span of Φ, such that they can be approximately recovered by the reduced predictors. For this purpose, we propose to employ the following RP.

Definition. 2.1. Let $h : [p] \to [m]$ be a random map such that for each $j \in [p] : h (j) = h_{j} \overset{i i d}{\sim} U n i f ([m])$ . The CW RP matrix $Φ^{m \times p}$ of dimension m × p and rank m is defined such that each column j ∈ [p] is. randomly mapped to one of the m output dimensions and scaled by a nonzero value d_j, that is, for each column j ∈ [p] the entry $Φ_{h_{j}, j} = d_{j} \in ℝ ∖ {0}$ and all other entries in column j are zero.

When using random sign diagonal elements $d_{j} \sim U n i f ({- 1, 1})$ independent of h, we obtain a sparse embedding matrix $Φ \in ℝ^{m \times p}, m ≪ p$ from Clarkson and Woodruff (2013) and we will call such a matrix a SparseCW matrix. Aside from being sparse and computationally efficient, this RP also exhibits the property $d = {(d_{1}, \dots, d_{p})}^{'} \in s p a n (Φ^{'})$ . Thus, by choosing $d_{j} \propto β_{j}, j = 1, \dots, p$ instead of random sign diagonal elements, we can reach our goal of combining the variables to reduced predictors with strong predictive power which are able to recover the true regression coefficient. In Theorem 1 of Parzer et al. (2025), it is shown that this approach significantly reduces the expected squared error of future predictions in the linear regression setting. Below, we will propose a new estimator for a general family to use as diagonal elements.

Variable screening aims at selecting a small subset of variables based on some marginal utility measure and using the ones with the highest utility for further analysis. This procedure can complement the RP approach and further reduce the dimensionality of the problem by first screening for the important variables and then performing the RP. A seminal contribution in variable screening is the sure independence screening (SIS) of Fan and Lv (2008), who proposed to use the absolute marginal correlation of the predictors to the response in linear regression, which was later extended to GLMs in Fan et al. (2009); Fan and Song (2010) by employing the maximum likelihood coefficient estimates of univariate GLMs instead of the correlation coefficient. However, in the presence of predictor correlation, screening based on a conditional utility measure (i.e., conditional on all the other variables in the model) is to be preferred to the unconditional one. To tackle this issue, Fan and Lv (2008) propose iterative SIS, which involves iteratively applying SIS and penalized regression to select a small set of variables, computing the residuals of a model fitted with the selected variables and using these residuals as a response variable to continue finding relevant variables. This procedure was later extended to more general model classes in Fan et al. (2009).

Similar to the linear regression case, the absolute value of the true coefficients in a GLM (assuming all predictors have the same scale) can be employed as a measure of variable importance. Therefore, another approach is to find a screening coefficient that is capable of detecting the correct order of magnitudes of the regression coefficients (but not necessarily their signs). We note that an estimator that performs well for the purpose of RP (as discussed above) would also be a good candidate as a screening coefficient. In the next section, we propose such an estimator for GLMs.

2.3 Proposed estimator for random projection and screening

In general, a ridge-type estimator

{\hat{β}}_{λ} = a r g m i n_{β \in ℝ^{p}} \min_{β_{0} \in ℝ} \{- \tilde{l} (β_{0}, β) + \frac{λ}{2} \sum_{j = 1}^{p} β_{j}^{2}\}, λ > 0

(2.4)

with the scaled log-likelihood $\tilde{l} (.)$ from (2.3) promises to be a sensible choice, both for screening and for inclusion in the CW RP, because it considers all variables in the model and is non-sparse. For linear regression, the HOLP estimator (Wang and Leng, 2016) provides a screening utility measure considering all variables’ effects simultaneously and has strong theoretical properties (see Section 2.1 in Parzer et al. 2025 for a discussion). It is explicitly given by

{\hat{β}}_{HOLP} = X^{'} {(X X^{'})}^{- 1} y = \lim_{λ \to 0} ({argmin}_{β \in ℝ^{p}} \{\sum_{i = 1}^{n} {(y_{i} - x_{i}^{'} β)}^{2} + \frac{λ}{2} \sum_{j = 1}^{p} β_{j}^{2}\}),

(2.5)

where $X \in ℝ^{n \times p}$ is the predictor matrix and $y \in ℝ^{n}$ is the response vector. Here, a model without an intercept β₀ is assumed, which can be justified by centering X and y . Parzer et al. (2025) used this HOLP coefficient (2.5) as the diagonal elements in a CW RP.

Motivated by (2.5) and Kobak et al. (2020), who showed that the optimal ridge-penalty in linear regression can be negative due to implicit regularization from high-dimensional predictors, we first investigate whether $\lim_{λ \to 0} {\hat{β}}_{λ}$ is also an appropriate choice in GLMs. For logistic regression (binomial family with logit link), it is known that the estimator (2.4), scaled by its norm, converges to a hard-margin SVM coefficient for $λ \to 0$ (Theorem 3 in Rosset et al. (2004), see Section 3.6.1 in Hastie et al. (2015) for a discussion). More generally, the following theorem shows an explicit form of $\lim_{λ \to 0} {\hat{β}}_{λ}$ for families with canonical link function such that $g (y_{i}) \in ℝ$ for all i ∈ [n].

Theorem 2.2. For a family with canonical link function satisfying $g (y_{i}) \in ℝ$ for all $i \in [n]$ , for a full rank predictor matrix rank $(X X^{'}) = n$ and in a model without an intercept β ₀ , we obtain

\lim_{λ \to 0} {\hat{β}}_{λ} = X^{'} {(X X^{'})}^{- 1} g (y),

(2.6)

for ${\hat{β}}_{λ}$ from (2.4).

The proof can be found in the Supplementary Materials A. The intercept-free assumption can be avoided by appropriate centering of X and g( y ). For practical usage, this exact limit has a few drawbacks. For example, when centering by the exact sample mean, X would not have full rank n anymore. A workaround could be employing a generalized inverse or using different location estimators (such as the median or trimmed mean) for the purpose of centering the variables. In the cases where $g (y_{i}) \notin ℝ$ , one could approximate g(y_i) by using a continuity correction, but there is no guarantee that this approximation works well. Also, Theorem only covers the canonical link for each family.

As an alternative that can cover all cases, we propose to approximate $\lim_{λ \to 0} {\hat{β}}_{λ}$ by a ridge estimator with a small fixed $λ_{min} > 0$ and to use ${\hat{β}}_{λ_{min}}$ from (2.4) for variable screening and as the diagonal elements in the data-informed RP.

Furthermore, in order to understand whether more penalization (i.e., through higher values of λ) is needed for other families as compared to the Gaussian case, we investigate alternative strategies for choosing the penalty value. In a simulation example in Section 3.3, we investigate the choice of this λ_min for different families. It shows that for non-Gaussian families, it is beneficial to use a higher λ_min to avoid a saturated fit. We also show that the resulting estimator allows good recovery of sign and magnitude of the true non-zero coefficients, while also performing well in terms of prediction when employed in the RP matrix.

2.4 Sparse projected averaged regression (SPAR) algorithm for generalized linear models (GLMs)

Employing a single data-driven RP with the proposed estimator and then estimating a GLM on the reduced predictors can lead to high variability due to randomness. We address this by adapting the Sparse Projected Averaged Regression (SPAR) algorithm from Parzer et al. (2025) to GLMs, which builds an ensemble of GLMs in the following way: (a) randomly sampling predictors for inclusion in the RP based on the proposed screening coefficient, (b) projecting the sampled variables to a randomly chosen lower dimension using the proposed RP, (c) estimating penalized GLMs with the reduced predictors and (d) averaging them to form the final model. The adapted algorithm is given below, where * indicates changes compared to the linear regression formulation:

1.* Choose family with corresponding log-likelihood ℓ(.) and link and standardize covariate inputs X : n × p.

2.* Calculate $\hat{α} = {\hat{β}}_{λ_{min}} = {argmin}_{β \in ℝ^{p}} \min_{β_{0} \in ℝ} \{- \tilde{l} (β_{0}, β) + \frac{λ_{min}}{2} \sum_{j = 1}^{p} β_{j}^{2}\}$ , see below for choice of $λ_{\min} > 0$ .

3. For k = 1, …, M:

3.1. Draw 2n predictors with probabilities $p_{j} \propto |{\hat{α}}_{j}|$ without replacement sequentially yielding screening index set $I_{k} = \{j_{1}^{k}, \dots, j_{2 n}^{k}\} \subset [p];$ if $p < 2 n$ , set $I_{k} = [p]$ ;

3.2. Project remaining variables to dimension m_k randomly drawn with $\log (p) \leq m_{k} \leq n / 2$ using $Φ_{k} : m_{k} \times 2 n$ from Definition with diagonal elements $d_{i} = {\hat{α}}_{j_{k}^{k}}$ to obtain reduced predictors $Z_{k} = X_{\cdot I_{k}} Φ_{k}^{'} \in ℝ^{n \times m_{k}}$ ;

3.3.* Fit a GLM of y against Z _k (with small L₂-penalty) to obtain estimated coefficients $γ^{k} \in ℝ^{m_{k}}$ and ${\hat{β}}^{k}$ , where ${\hat{β}}_{I_{k}}^{k} = Φ_{k}^{'} γ^{k}$ and ${\hat{β}}_{{\bar{I}}_{k}}^{k} = 0$ .

4. For a given threshold ν ≥ 0, set all entries ${\hat{β}}_{j}^{k}$ with $|{\hat{β}}_{j}^{k}| \leq v$ to 0 for all j, k.

5. Combine via simple average on link-level $\hat{β} = \sum_{k = 1}^{M} {\hat{β}}^{k} / M$ or on response level $\hat{y} = \sum_{k = 1}^{M} {\hat{y}}^{k} / M$ .

6. Optionally, select M and ν via cross-validation by evaluating a two-dimensional grid of values. For each training fold, repeat Steps 2–6 using fixed index sets I _k and projections Φ _k ) and assess the performance on the test fold by model deviance (Dev). Then choose the optimal pair (M_best, ν_best) = argminM,νDev(M, ν).

7. Output the estimated coefficients and predictions for the chosen M and ν.

Our goal is to select the smallest penalty that maintains good predictive performance in the context of screening and RP for GLMs. In our experiments, the estimator in Equation (2.6) (or its approximation) does not consistently achieve the best predictive performance across all family–link function combinations. The strategy for selecting the penalty parameter λ_min is detailed in Section 3.3. Specifically, we recommend choosing the smallest λ for which the deviance ratio remains below a predefined threshold. Using a threshold on the deviance ratio has the advantage of being invariant to the data scale and link function.

In Step 3.2 the dimension m_k is random for the purpose of introducing more variability in the ensemble and reducing the reliance on a fixed (possibly arbitrarily chosen) goal dimension, in line with previous literature (e.g., Mukhopadhyay and Dunson, 2020).

In fitting the marginal models in Step 3.3, we also obtain intercept estimates, which can also be averaged and translated back to the original predictor scale to give an overall estimate ${\hat{β}}_{0}$ of the intercept β₀. The marginal models should be easy to estimate to ensure the computational efficiency of the algorithm. Given that the dimension of the projected regressors is at most n/2, we suggest using a ridge regression with a small penalty in order to increase the stability of the results (given that some applications might have a low number of observations) and to avoid issues related to, for example separation for binomial models, while keeping the bias low. In particular, we use the penalty value λ_min = λ_max/100 where λ_max is the smallest value that shrinks all coefficients to zero, given by $λ_{max} = \frac{1}{n α} \max_{j} | 〈z_{j}, y〉 | {(g^{- 1})}^{'} (g (\bar{y})) \frac{1}{{{Var}_{y} (μ)|}_{μ = \bar{y}}}$ with z_j denoting predictors standardized by the square root of their variance (using n in the denominator) and Var _y (μ) the variance function of the GLM family evaluated at the mean response. For ridge regression α = 0 and λ_max = ∞ but for computational purposes α = 0.001 is used. However, in our experiments, the choice of the penalty in the marginal models was not as critical and did not influence performance much.

This algorithm allows for several variations: (a) Different measures can be used in the cross-validation, such as mean squared error instead of the family-dependent deviance. (b) If more sparsity in the coefficients is desired, (M, ν) can be chosen by the one-standard-error rule, which yields the sparsest $\hat{β}$ within one standard error of the score of the best parameters. We note that the cross-validation on the grid of number of models is not computationally intensive as we only need to train the maximum number of models considered in the grid on each training fold and evaluate performance by aggregating the first M₁, then the first M₂ and so on. (c) While averaging in Step 5 can also be done on response-level, averaging at the linear predictor-level is more interpretable, as a single final coefficient estimate $\hat{β}$ , as well as a distribution of each ${\hat{β}}_{j}$ over the marginal models, can be reported. However, we note that in certain data settings, the type of averaging can impact prediction measures in the case of non-linear link functions.

3. Simulation study

In a first simulation study, we compare how well the estimator (2.4) recovers the true active β across different penalty choices and evaluate the predictive performance of the data-driven RP with these estimators. We then compare SPAR algorithm’s predictive performance and variable ranking ability against various benchmarks in a comprehensive simulation study.

3.1 Setup

We generate data from Equation (2.1) for five family–link combinations with n = 200 and use additionally n_test = 1 000 observations as a test sample. The p-dimensional predictors are simulated as $x_{i} \sim N_{p} (0, Σ)$ , where we investigate different structures for Σ: Identity, compound with $Σ_{i j} = 0.5$ if $i \neq j$ and 1 otherwise, autocorrelated (ar1) with $Σ_{i j} = {0.9}^{| i - j |}$ and a block structure with blocks of size 100: Half compound, last one independent and the rest autocorrelated (using the above parameters).

We consider p = 500, 2 000, 10 000 and three sparsity settings for β: sparse (a = [2 log(p)]), medium (a = [2 log(p) + n/2]) and dense (a = p/4), where a is the number of non-zero entries in β. These entries are independently set as ${(- 1)}^{u} (4 \log (n) / \sqrt{n} + | z |)$ at uniformly random positions, where $u \sim Bernoulli (0.4)$ and $z \sim N (0, 1)$ , as in Fan and Lv (2008) to obtain values with random sign and magnitude that are bounded away from zero to distinguish them from the non-active variables. Final, for each family-link we rescale β to control the signal strength by $β^{'} Σ β = c$ , where c = 100, 1 000, 0.25, 10, 0.125 for binomial–logit, binomial–complementary log–log (cloglog), Poisson–log, Gaussian–identity and Gaussian–log, respectively. The intercept β₀ is set such that $\sum_{i = 1}^{n} E [y_{i} | x_{i}] / n = 0.5, 0.7, 10, 1, 10$ for the respective family–links combinations. These values are chosen to ensure that the problem is not too easy to solve, but also that most methods can explain some of the deviance.

3.2 Measures

Prediction performance is assessed on the independent test samples $\{(x_{n + i}, y_{n + i}) : i \in [n_{test}]\}$ by several measures: Mean squared prediction error (MSPE) and its relative version

MSPE (\hat{y}) = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {(y_{n + i} - {\hat{y}}_{n + i})}^{2}, rMSPE (\hat{y}) = n_{test} MSPE (\hat{y}) / \sum_{i = 1}^{n_{test}} {(y_{n + i} - \bar{y})}^{2}

where $\bar{y}$ is the mean of the responses in the training sample (for the binomial family, MSPE corresponds to a scaled Brier score). We also compute mean squared link estimation error (MSLE) and its relative version to assess the linear predictor accuracy

MSLE (\hat{β}) = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {(η_{i} - {\hat{η}}_{i})}^{2}, rMSLE (\hat{β}) = n_{test} \frac{MSLE (\hat{β})}{\sum_{i = 1}^{n_{test}} {(η_{i} - g (\bar{y}))}^{2}},

where $η_{i} = β_{0} + x_{n + i}^{'} β$ and ${\hat{η}}_{i} = {\hat{β}}_{0} + x_{n + i}^{'} \hat{β}$ . Both these relative measures give an interpretable performance measure of an estimator in mean squared error to the estimate of a naive intercept-only model, which can be shown to yield ${\hat{y}}_{0} = \bar{y}, {\hat{β}}_{0} = g (\bar{y})$ . This performance measure has already been used for high-dimensional regression by Silin and Fan (2022) and Parzer et al. (2025). For the binomial family, we also consider the area under the receiver operating characteristic (ROC) curve (AUC) of the predicted probabilities ${\hat{p}}_{n + i} = 1 / (1 + \exp (- {\hat{η}}_{i})) t$ to the binary responses.

Furthermore, we evaluate variable ranking performance using the partial AUC (pAUC) where the true binary labels indicate whether a variable is truly active and the absolute estimated coefficients serve as ranking scores. To ensure fair comparison between sparse methods and dense methods, we limit the number of false positive to n/2 (see Wang et al., 2019), implicitly limiting the false positive rate to n/(2(p − a)) with p − a truly inactive variables. To obtain an interpretable measure in the interval (0, 1), we divide the resulting pAUC by n/(2(p − a)) so that a perfect method would obtain a pAUC of 1.

3.3 Simulations for screening and random projection

3.3.1 Recovery of true active β

We investigate whether the ridge estimator in (2.4) is appropriate for the purpose of screening and data-informed RP in the sense that it is able to recover the true active coefficients well by considering the correlation to the true active β.

In the following, we present results for the scenario n = 200, p = 2 000 and Σ block-diagonal. We focus on the block-diagonal structure as it should most closely resemble covariance patterns observed in real data. We also investigated different sparsity settings and different signal levels but the results remain rather stable, so we only discuss here results for medium sparsity and high signal strength from Section 3.1: The investigated low signal strength setting used $β^{'} Σ β = c / 2$ where c is given in Section 3.1. We consider:

The screening coefficient in Fan and Song (2010), where each β_j is estimated by the slope of a marginal GLM (marGLM) with an intercept and only one predictor $\{x_{i j}, i \in [n]\}$ as a benchmark.

Ridge estimator from (2.4) with λ chosen by cross-validation based on the deviance criterion (L2_cv)

Ridge estimator from (2.4) with λ converging to zero (L2_limit0). In the binomial–logit case, we use the exact limit for $λ \to 0$ , which is a hard-margin linear SVM coefficient (estimated with R package e1071 with cost set to 10¹⁰; Meyer et al. 2023). For the other families we use Equation (2.6). For the Gaussian family with log link and for the Poisson family with some y_i = 0, Theorem 2.2 does not hold and we only use Equation (2.6) as an approximation to L2_limit0, where we impute any zeros present in the response variable by a small positive value.

Ridge estimator with a data-driven approach to choosing λ as the smallest value for which the fraction of (null) deviance explained does not exceed 60% (L2_dev06), 80% (L2_dev08), 95% (L2_dev095) and 99.9% (L2_dev0999). Fixed deviance thresholds are used instead of fixed λ values, since the latter varies strongly across families. Table 2 in the Supplementary Materials shows average λ resulting from the cross-validation and the deviance cut-offs; higher deviance cut-offs lead to smaller λ, with large variation in the actual values across families.

Figure 1 shows the distribution of the correlation coefficient of the true active coefficients to different screening coefficients over 100 replications. We observe that L2_limit0 did not deliver the best results for all investigated family–links (for conciseness, we omit the results for binomial–cloglog as they are similar to binomial–logit). L2_cv also underperformed and the marGLM coefficients were the least effective at recovering the true coefficients. For binomial, differences among the ridge estimators were minor, but for Poisson, the performance declined as the deviance ratio threshold increases or λ becomes too small. While the cross-validated penalty seems to be too large for the purpose of screening, using a 0.8–0.95 deviance ratio for non-Gaussian families and 0.999 for Gaussian responses yields good results. The corresponding estimates are therefore well-suited as diagonal elements approximately proportional to the true β in our data-informed RP from Definition 2.1. In this comparison, we also computed the pAUC of these coefficients and the ratio of true active variables within the highest 3a absolute estimated values, but omitted these results as they are similar to those based on correlation.

Figure 1

Correlation of true active coefficients to different screening estimators for 100 replications (n = 200, p = 2 000, medium sparsity and block-diagonal Σ).

3.3.2 Data-informed random projection

Next, in the same setting as above, we investigate the predictive performance of a model where the predictors are first projected onto a m = n/4 = 50 lower-dimensional space using the proposed data-informed RP with different screening coefficients, namely L2_cv, L2_limit0, L2_dev06, L2_dev08, L2_dev095 and L2_dev0999 from the previous section. Additionally, we also show the oracle performance of our proposal using the true β in the RP (True_Beta). We also consider models where conventional RPs (i.e., Gaussian with iid standard normal entries and SparseCW, the matrix from Definition with random sign diagonal elements) are used. Furthermore, we estimate the models with adaptive LASSO (AdLASSO; Zou, 2006) and elastic net (α = 1/2) with penalty chosen by cross-validation based on deviance (ElNet) as performance benchmarks.

In Figure 2, we present the rMSLE and the prediction error (1 AUC for binomial, rMSPE for Gaussian and Poisson) for n_test = 1 000 over 100 repetitions and see that the proposed data-informed projection with ridge estimates in the diagonal generally increases the performance with respect to both metrics over the conventional RPs (Gaussian and SparseCW), reaching a lower link estimation error and prediction error. The differences between the ridge estimators are less obvious, but generally, we see that the performance of the estimators in terms of screening also translates to the prediction power, namely that L2_dev0999 or L2_limit0 deliver the best results for the Gaussian family with both identity and log link, while L2_dev08 and L2_dev095 achieve the best prediction performance for the other families. Furthermore, for this example, it can be seen that this adaptation of the diagonal of the projection matrix suffices to reach a better performance than high-dimensional regression benchmarks, adaptive LASSO and elastic net, but there is a noticeable gap to the oracle performance.

We note that we assessed the sensitivity of the results in Figure 2 regarding the choice of the goal dimension m = n/4 but observed that results are stable for values of m between log(p) and n/2 (see also Figure 9 in the Supplementary Materials).

Figure 2

Link estimation error (MSLE) and prediction error (1 − AUC for binomial, MSPE otherwise) for 100 replications (n = 200, p = 2 000, medium sparsity, block Σ). All projected methods use a single projection (without screening or ensembling).

3.4 Benchmark simulations for SPAR

We consider n = 200, p = 2 000 in sparse, medium and dense settings for the five family–link combinations and a block structure for Σ. We report here the results for the canonical links. In the Supplementary Materials, we report additional results, which include different covariance structures for p = 2 000 predictors for the medium setting, as well as different values of p for the medium setting with block covariance.

We compare the following methods. As GLM-based methods, we use LASSO, elastic net (ElNet; α = 1/2), ridge, adaptive LASSO ((AdLASSO; using package glment by Tay et al. 2023) and SIS (Fan and Song, 2010). The penalty in the GLM-based methods is chosen by cross-validation using deviance and by employing the one-standard-error rule. Then, as general regression benchmarks, we use random forest (RF implemented R-package randomForest with mtry parameter tuned by cross-validation; Liaw and Wiener 2002) and SVMs (R package e1071 with cost and kernel—linear or radial–tuned by CV; Meyer et al. 2023). For these two methods, we do not report results for link estimation, since they do not (necessarily) estimate a linear predictor. For variable ranking, we use the reported importance measure (i.e., [weighted] mean of the individual trees’ decrease in Gini-index or MSE produced by each variable) for RF and the inner product of the estimated coefficients and the support vectors for SVM.

Final, as a set of methods using RPs, we use an ensemble of 50 models with the conventional RP from Definition with random sign entries without any screening and random goal dimensions as in Step 3.2 of the SPAR algorithm in Section 2.4 (RP_CW_Ensemble), Targeted Random Projections (TARP), which is an adaptation of Mukhopadhyay and Dunson (2020) to GLMs where we perform screening based on marginal GLM coefficients and use the conventional RP of Achlioptas (2003) with ψ = 1/6 for an ensemble of M = 20 models, as well as our proposed SPAR algorithm from Section 2.4 in two configurations: (a) Without cross-validation, using a fixed number of M = 20 models and selecting the thresholding parameter ν from a predefined grid of 20 values (starting at 0 and followed by quantiles of the absolute values of all non-zero standardized coefficients, computed at evenly spaced probability levels from 1/(20 1) to 1) based on the best model deviance on the training set; and (b) with cross-validation over a grid of 20 ν values and a grid M = 10, 20, 30, 50 models. For both SPAR methods, the screening coefficient is the L2_dev08 for binomial and Poisson family and L2_dev0999 for Gaussian family. Furthermore, link-level averaging is employed in the ensembles, as it has the advantage of preserving the interpretability of the coefficients.

3.4.1 Predictive performance

Figure 3 shows prediction and link estimation performance for 100 replications of n = 200, p = 2 000 in the block covariance setting. LASSO and SIS are excluded for brevity, as they were consistently outperformed by AdLASSO and elastic net, respectively. For the binomial family, we consider 1 AUC as the prediction error, for all other families the rMSPE. AUC is a measure of discrimination power rather than calibration (i.e., how well the probabilities are estimated). For measuring calibration, the rMSPE (scaled Brier score) can be employed. While the choice of the prediction measure is highly dependent on the application context, we note that the Brier score will be more sensitive than the AUC to the type of averaging in the ensemble methods (link versus response-level averaging). Note that, even if the link estimation is excellent, especially probabilities in the tails of the distribution can be over/underestimated by averaging on the link level, given the non-linear link function. We therefore consider AUC as the prediction measure in the subsequent analysis but provide results with rMSPE and AUC for the binomial family in different settings, as well as results for SPAR using response-level averaging, in Figure 11 in the Supplementary Materials.

Figure 3

Comparison of prediction performance (1 − AUC for binomial family, rMSPE for all other families) and link estimation (rMSLE, not reported for SVM and RF) over 100 repetitions (n = 200, p = 2 000, medium sparsity and block Σ).

Generally, SPAR and SPAR-CV were among the best-performing methods in all settings, except the sparse ones, where it was outperformed by AdLASSO and elastic net in terms of prediction and link estimation. In particular, SPAR seems to inherit strong prediction performance in dense cases from the L2-estimator in the screening (also used in the RP), while yielding stronger predictions compared to dense methods in medium and sparse settings. Especially, the performance in link estimation for the logistic regression is remarkable.

Figure 4

Comparison of variable ranking (pAUC rescaled to [0, 1] for better comparison) over 100 repetitions for n = 200, p = 2 000, medium sparsity and block Σ.

Figure 10 in the Supplementary Materials provides further results for prediction error for the five family–link combinations across the different covariance settings in the medium setting and p = 2 000 as well as for increasing p in the block covariance and medium sparsity. Results for AR(1) and block covariance were similar in both prediction performance and method ranking. All methods performed best under compound symmetry and worst under independence, likely reflecting the information content in the covariance. SPAR and SPAR CV are (among) the best methods for all covariance settings, except for the Poisson family in the compound symmetry and independent covariance settings. Compared to the block or ar1 setting, SVM performed worse under compound symmetry, except for the Gaussian identity case, while RF ranked higher for compound symmetry and independent settings, particularly for Gaussian and Poisson families. Performance declined with increasing dimensionality, with adaptive LASSO and elastic net most affected.

3.4.2 Variable ranking performance

Figure 4 shows that SPAR and SPAR-CV both perform well in terms of variable ranking as measured by pAUC, where only SVM yielded a similar performance across all settings. For the ensemble methods, we here compute the pAUC of the final averaged $\hat{β}$ . As expected, AdLASSO and ElNet perform best in the sparse settings.

Figure 5

Mean ranks with 99% confidence intervals for prediction error (1 − AUC for binomial, rMSPE otherwise), link estimation (rMSLE), and variable ranking (pAUC) across all settings and nrep = 100. Methods not significantly worse than the best are shown in black.

3.4.3 Average performance over all scenarios

To evaluate overall performance, we ranked methods from best (1) to worst (11) across all settings and replications, including those in the Supplementary Materials. Figure 5 shows the average ranks (with 99% confidence intervals) across all investigated settings for each of prediction (1 AUC for binomial, rMSPE for all other families), link estimation (rMSLE) and variable ranking (pAUC). Aside from SVM in variable ranking, we find that SPAR-CV and SPAR achieved the best ranks on average over all simulation scenarios for all three measures. Using the Friedman and the post-hoc Nemenyi tests for multiple comparisons (Hollander et al., 2013), we can also report that SPAR and SPAR-CV were significantly better than all other methods for prediction and link estimation and, together with SVM, they were also significantly better for variable ranking. Even if SPAR and SPAR-CV are not best in every scenario, they perform well across the board, making them especially suitable when the degree of sparsity is unknown.

3.4.4 Computing time

Figure 6 shows computing time for three increasing values of p for the binomial family for the medium sparsity setting with block covariance. Most methods, except RF, SVM and SIS, inherit computational efficiency from glmnet, which uses a fast C++ implementation for canonical links. SPAR is the second fastest for larger p (after SIS) for the non-canonical family–link, with its time mostly spent fitting the M = 20 marginal models and thus hardly affected by p. We note that the ensemble of SparseCW random projections (RP_CW_Ensemble) is slower than SPAR due to projecting all p variables, resulting in larger projection matrices. TARP is slower than SPAR, likely due to using less sparse projections. SPAR-CV, while slower than most methods for small p, scales efficiently with increasing p. The computing time for other family–link combinations follows similar patterns based on whether the link is canonical, so those plots are omitted for brevity.

Figure 6

Comparison of average computing time for increasing p, n = 200, medium sparsity and block Σ, for the binomial family.

Table 1

Mean prediction metrics (with standard deviations) over 100 three-to-one train/test splits. The top three methods per dataset and metric are bolded. Note that these standard deviations do not represent formal variability estimates (e.g., confidence intervals) due to the inherent dependence across splits.

Method	FTIR spectra		Darwin				DLBCL
Method	rMSPE		AUC		rMSPE		AUC		rMSPE
Ridge	0.059	(0.045)	0.915	(0.036)	0.542	(0.066)	0.995	(0.013)	0.282	(0.099)
LASSO	0.079	(0.062)	0.914	(0.037)	0.509	(0.095)	0.963	(0.053)	0.456	(0.204)
AdLASSO	0.588	(0.578)	0.782	(0.063)	0.805	(0.07)	0.875	(0.087)	0.676	(0.229)
ElNet	0.073	(0.049)	0.925	(0.031)	0.479	(0.082)	0.985	(0.035)	0.282	(0.149)
SIS	0.424	(0.514)	0.898	(0.045)	0.600	(0.066)	0.902	(0.089)	0.678	(0.209)
SVM	0.855	(0.09)	0.954	(0.031)	0.355	(0.115)	0.992	(0.022)	0.165	(0.119)
RF	0.126	(0.087)	0.957	(0.025)	0.444	(0.06)	0.960	(0.041)	0.544	(0.102)
RP_CW_Ensemble	0.066	(0.05)	0.927	(0.032)	0.450	(0.094)	0.988	(0.026)	0.403	(0.128)
TARP	0.067	(0.067)	0.927	(0.034)	0.447	(0.103)	0.983	(0.031)	0.359	(0.119)
SPAR	0.036	(0.041)	0.935	(0.033)	0.439	(0.128)	0.994	(0.016)	0.179	(0.125)
SPAR-CV	0.033	(0.034)	0.935	(0.032)	0.437	(0.128)	0.994	(0.014)	0.183	(0.124)

4 Data applications

We demonstrate the proposed method on one high-dimensional dataset with a count response and two with binary responses. Results are detailed in the following sections and summarized in Table 1, which reports average prediction metrics over 100 random train/test splits (3:1 ratio). Note that while the stated standard deviations can serve as indicators for performance variation, they do not represent formal variability estimates (e.g., confidence intervals) due to the inherent dependence across splits. SPAR-CV uses deviance for cross-validation, while other methods are tuned as described in Section 3.4.

Figure 7

Difference FTIR spectrum for one oil sample in the tribology dataset (top) and coefficients of p = 1 814 wavenumbers estimated by SPAR-CV in each of M = 50 marginal models (bottom). Marked intervals represent non-informative variables.

4.1 Fourier transform infrared (FTIR) spectra

Fourier transform infrared (FTIR) spectroscopy is used in tribology to analyze changes in oil samples during use. The dataset introduced in Pfeiffer et al. (2022) comprises n = 34 automotive engine oil samples that were artificially degraded under controlled laboratory conditions. Two types of artificial alteration were applied—by heating the oil and exposing it to dried air—to mimic real-world oil degradation. For each sample, the predictor variables consist of difference FTIR spectra, calculated as the absorbance difference between fresh and degraded oil, measured at p = 1814 distinct wavenumbers. The target variable is the alteration duration in hours (as integers), which quantifies how long each sample was subjected to the degradation process.

Table 1 shows the relative MSPE for various methods. SPAR-CV was estimated using Poisson, quasipoisson and Gaussian families with log links. We only present results for the Gaussian model in Table 1, as it had the best predictive performance (results for the other link functions are given in Table 3 in the Supplementary Materials). Moreover, SPAR-CV performed best overall, followed by ridge regression.

Figure 7 (top panel) displays the difference spectrum for one sample, highlighting intervals with high or total absorption. These regions, typically non-informative due to hydrocarbon properties, are often pre-processed or discarded from analysis. The bottom panel shows the standardized coefficients estimated by SPAR-CV across M = 50 marginal models (y-axis) for all variables (x-axis), where the coefficients for each variable are sorted by their absolute values and displayed vertically as a colour gradient. The distribution of coefficients across the marginal models indicates which wavenumbers correlate positively or negatively with longer alteration durations. Even without pre-processing, non-informative variables rarely appear in the models and have low coefficients when they do, demonstrating the reliability of the SPAR method.

4.2 Darwin Alzheimer dataset

The dataset, introduced in Cilia et al. (2022), contains a binary response for Alzheimer’s disease (AD) together with p = 450 extracted variables from 25 handwriting tests (18 features per task) for 89 AD patients and 85 healthy people (n = 174) and can be downloaded from the UC Irvine Machine Learning Repository. The dataset has been reported to contain outliers. We first screened for multivariate outliers and imputed them using the detect deviating cells algorithm of Rousseeuw and Bossche (2018) implemented in R package cellWise (Raymaekers and Rousseeuw, 2023).

Table 1 presents the area under the ROC curve (AUC) and the relative MSPE (rMSPE; scaled Brier score) as prediction metrics for the binary classification task. For all GLM-based methods (i.e., except SVMs and RFs), results are shown for the binomial family with logit link function. We also investigated the use of the complementary log–log (cloglog) link for these methods, but the observed performance is generally slightly worse and only Ridge and RP_CW_Ensemble achieved marginally better results with this link. Table 4 in the Supplementary Materials shows the results for both considered links.

SPAR and SPAR-CV performed similarly and were outperformed in AUC by both SVM and RF and in rMSPE by SVM alone. This suggests SPAR is a viable option for modelling this dataset, while offering low computational cost. SPAR-CV was, however, slower than RF. Table 6 in the Supplementary Materials shows the average computing time of all methods for all three data applications.

Figure 8 displays the estimated standardized coefficients for the p = 450 variables grouped by feature, across all marginal models. Feature blocks generally show either a positive or negative impact on the probability of AD across all 25 tasks (see online version for colour figures). For example, the probability of AD increases with the total time spent on a task (total_time). The number of pendowns (num_of_pendown, the number of times the pen hits the paper) is positively associated with the AD likelihood for the first few tasks, but negatively correlated for the remaining tasks, with a strong negative association observed for the task ’Copy the fields of a postal order’, which requires many pendowns when copying the individual fields. This may suggest that a lower number of pendowns in this task indicates failure to complete the task and thus a higher likelihood of AD.

4.3 Diffuse Large B-Cell Lymphoma (DLBCL)

This microarray dataset has been introduced in Shipp et al. (2002) and is available at the OpenML platform. It contains expression data for p = 5 469 genes of n = 77 patients diagnosed with two different types of lymphomas: Diffuse Large B-Cell Lymphoma (DLBCL) (58 cases) and follicular lymphoma (FL, 19 cases), making this a rather imbalanced dataset.

Again, Table 1 presents the AUC and rMSPE (scaled Brier score) as prediction metrics and for all methods based on GLMs, results are shown for the binomial family with logit. Due to the imbalance of the response, we also used the cloglog link function for these methods again, but only Ridge, LASSO and RP_CW_Ensemble achieved slightly better results with this link. Table 5 in the Supplementary Materials shows the results for both considered links.

Figure 8

Estimated coefficients for the p = 450 variables in the Darwin dataset, for each of the M = 50 marginal models in the SPAR-CV algorithm.

On this very high-dimensional dataset, SPAR and SPAR-CV perform very well, being slightly outperformed only by SVM in terms of rMSPE and by ridge in terms of AUC.

5 Conclusion

In this article, we propose a novel data-driven random projection method to be employed in high-dimensional GLMs, which efficiently reduces the dimensionality of the problem while preserving essential information between the response and the (possibly correlated) predictors. We achieve this by using ridge-type estimates of the regression coefficients to construct the random projection matrix, which should approximately recover the true regression coefficients. These coefficients can also be employed for variable screening, which can be used before random projection to further reduce the dimensionality of the problem.

A critical aspect of the proposed method is the selection of the penalty term for the ridge-type estimator. The penalty should generally be small to avoid over-regularization. However, determining the optimal size of the penalty has proven to be a non-trivial task. For linear regression, a ridge estimator with penalty converging to zero has shown good properties (Wang and Leng, 2016; Parzer et al., 2025). In this article, we derive the analytical formula for such an estimator in GLMs with canonical links and find that this estimator leads to lower predictive performance for non-Gaussian families, likely due to overfitting. More generally, there is no one-size-fits-all penalty value for all families. Instead, we advocate for a data-driven approach by decreasing the penalty value as long as the improvement in a goodness-of-fit criterion (e.g., deviance) exceeds a certain threshold (e.g., 0.8 for non-Gaussian families and 0.999 for Gaussian).

Through extensive simulations, we show that integrating multiple probabilistic variable screening and projection steps into an ensemble of medium-sized GLMs can improve prediction accuracy and variable ranking, without too much computational cost. To implement this method, we adapt the SPAR algorithm from Parzer et al. (2025), ensuring that it is tailored to the specific requirements of high-dimensional GLMs and the method achieved good overall performance regarding ranks aggregated over all investigated scenarios, making it a valid choice when the true degree of sparsity is not known in practice. We note that the figures showing prediction and variable ranking metrics complement the aggregated ranks. While the metrics reflect performance across all replications and are influenced by data randomness, the ranks highlight consistent performance differences per replication, even if subtle. At the cost of higher computation time, which still scales well with p, the method can, to some degree, benefit from cross-validation, most notably in terms of ranking the variables based on their relevance. Even with the extensive simulation study trying to cover the most common settings, it is not certain whether the obtained results will always be generalizable to any other data-generating process encountered in practice. Final, the proposed method achieved a strong prediction performance while retaining interpretability on three real datasets, supporting the simulation results.

A potential extension includes adapting the method to multivariate GLMs (e.g., multinomial) and multivariate responses (e.g., multivariate linear regression). A key extension in this direction would be designing a data-driven random projection that can preserve the multivariate structure in the data while also being straightforward and fast to compute. Additionally, ways of incorporating non-linearities in the random projection could be explored.

Footnotes

Acknowledgement

The authors acknowledge TU Wien Bibliothek for financial support through its Open Access Funding Programme.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: Roman Parzer and Laura Vana-Gür acknowledge funding from the Austrian Science Fund (FWF) for the project ‘High-dimensional statistical learning: New methods to advance economic and sustainability policies’ (ZK 35).

Supplementary materials

References

Achlioptas

(2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences , 66, 671–87. doi: 10.1016/s0022-0000(03)00025-4. Special Issue on PODS 2001.

Cannings

and Samworth

(2017) Random projection ensemble classification. Journal of the Royal Statistical Society B , 79, 959–1035. doi: 10.1111/rssb.12228.

Cilia

N D

, De Gregorio

, De Stefano

, Fontanella

, Marcelli

and Parziale

(2022) Diagnosing Alzheimer’s disease from on-line handwriting: A novel dataset and performance benchmarking. Engineering Applications of Artificial Intelligence , 111, 104822. doi: 10.1016/j.engappai.2022.104822.

Clarkson

and Woodruff

(2013) Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing , pages 81–90. doi: 10.1145/2488608.2488620.

Fan

and Li

(2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association , 96, 1348–60. doi: 10.1198/016214501753382273.

Fan

and Lv

(2008) Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society B , 70, 849–911. doi: 10.1111/j.1467-9868.2008.00674.x.

Fan

and Song

(2010) Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics , 38, 3567–3604. doi: 10.1214/10-aos798.

Fan

, Samworth

and Wu

(2009) Ultrahigh dimensional feature selection: Beyond the linear model. The Journal of Machine Learning Research , 10, 2013–38. doi: 10.32614/cran.package.sis.

Fan

, Feng

and Song

(2011) Nonparametric independence screening in sparse ultrahigh-dimensional additive models. Journal of the American Statistical Association , 106, 544–57. doi: 10.1198/jasa.2011.tm09779.

10.

Guhaniyogi

and Dunson

(2015) Bayesian compressed regression. Journal of the American Statistical Association , 110, 1500–14. doi: 10.1080/01621459.2014.969425.

11.

Hastie

, Tibshirani

and Wainwright

(2015) Statistical Learning with Sparsity: The Lasso and Generalizations . Boca Raton, FL: Chapman and Hall/CRC. ISBN 1498712169. doi: 10.1201/b18401.

12.

Hollander

, Wolfe

and Chicken

(2013) Nonparametric Statistical Methods . Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley and Sons. Available at: https://books.google.at/books?id=Y5s3AgAAQBAJ

13.

(2023) Sufficient variable screening with high-dimensional controls. Electronic Journal of Statistics , 17, 2139–79. doi: 10.1214/23-ejs2150.

14.

Kobak

, Lomond

and Sanchez

(2020) The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. Journal of Machine Learning Research , 21, 1–16. doi: 10.5555/3455716.3455885.

15.

Liaw

and Wiener

(2002) Classification and regression by randomForest. R News , 2, 18–22. doi: 10.32614/cran.package.randomforest.

16.

Mai

and Zou

(2013) The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika , 100, 229–34. doi: 10.1093/biomet/ass062.

17.

Mai

and Zou

(2015) The fused Kolmogorov filter: A nonparametric model-free screening method. The Annals of Statistics , 43, 1471–97. doi: 10.1214/14-aos1303.

18.

McCullagh

and Nelder

(1989) Generalized Linear Models , 2nd Edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability. London: Taylor and Francis. Available at: https://books.google.at/books?id=h9kFH2_FfBkC

19.

Meyer

, Dimitriadou

, Hornik

, Weingessel

and Leisch

(2023) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7–13.

20.

Mukhopadhyay

and Dunson

(2020) Targeted Random Projection for Prediction From High-Dimensional Features. Journal of the American Statistical Association , 115, 1998–2010. doi: 10.1080/01621459.2019.1677240.

21.

Parzer

, Filzmoser

and Vana-Gür

(2025) Sparse data-driven random projection in regression for high-dimensional data. Journal of Data Science, Statistics, and Visualisation , 5. doi: 10.52933/jdssv.v5i5.138. Available at: https://jdssv.org/index.php/jdssv/article/view/138

22.

Pfeiffer

, Ronai

, Vorlaufer

, Dör

and Filzmoser

(2022) Weighted lasso variable selection for the analysis of FTIR spectra applied to the prediction of engine oil degradation. Chemometrics and Intelligent Laboratory Systems , 228, 104617. doi: 10.1016/j.chemolab.2022.104617.

23.

Raymaekers

and Rousseeuw

(2023) cellWise: Analyzing Data with Cellwise Outliers . R package version 2.5.3.

24.

Rosset

, Zhu

and Hastie

(2004) Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research , 5, 941–73. doi: 10.1145/1143844.1143939.

25.

Rousseeuw

and Bossche

WVD

(2018) Detecting deviating data cells. Technometrics , 60, 135–45. doi: 10.1080/00401706.2017.1340909.

26.

Karnin

and Liberty

(2019) Asymmetric random projections. arXiv , abs/1906.09489. doi: 10.48550/arxiv.1906.09489.

27.

Shipp

, Ross

, Tamayo

, Weng

, Aguiar

RCT

, Gaasenbeek

, Angelo

, Reich

, Pinkus

, Ray

, Koval

, Last

, Norton

, Lister

, Mesirov

, Neuberg

, Lander

, Aster

and Golub

(2002) Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine , 8, 68–74. doi: 10.1038/nm0102-68.

28.

Silin

and Fan

(2022) Canonical thresholding for nonsparse high-dimensional linear regression. The Annals of Statistics , 50, 460–86. doi: 10.1214/21-aos2116.

29.

Tay

, Narasimhan

and Hastie

(2023) Elastic net regularization paths for all generalized linear models. Journal of Statistical Software , 106, 1–31. doi: 10.18637/jss.v106.i01.

30.

Tibshirani

(1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) , 58, 267–88. doi: 10.1111/j.2517-6161.1996.tb02080.x.

31.

Vana-Gür

, Parzer

and Filzmoser

(2025). spareg: Sparse Projected Averaged Regression in R. R package version 1.0.0. Available at: https://CRAN.R-project.org/package=spareg

32.

Wang

, Mukherjee

, Richardson

and Hill

(2019) High-dimensional regression in practice: An empirical study of finite-sample prediction, variable selection and ranking. Statistics and Computing , 30, 697–719. doi: 10.1007/s11222-019-09914-9.

33.

Wang

and Leng

(2016) High-dimensional ordinary least-squares projection for screening variables. Journal of the Royal Statistical Society B, 78, 589–611. doi: 10.1111/rssb.12127.

34.

Zou

(2006) The adaptive lasso and its oracle properties. Journal of the American Statistical Association , 101, 1418–29. doi: 10.1198/016214506000000735.

35.

Zou

and Hastie

(2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B , 67, 301–20. doi: 10.1111/j.1467-9868.2005.00503.x.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

21.92 MB

0.00 MB

Data-driven random projection and screening for high-dimensional generalized linear models

Abstract

Keywords

1 Introduction

2 Method

2.1 Generalized linear models

2.3 Proposed estimator for random projection and screening

3. Simulation study

3.1 Setup

3.2 Measures

3.3 Simulations for screening and random projection

3.3.1 Recovery of true active β

Figure 1

Correlation of true active coefficients to different screening estimators for 100 replications (n = 200, p = 2 000, medium sparsity and block-diagonal Σ).

Figure 2

Link estimation error (MSLE) and prediction error (1 − AUC for binomial, MSPE otherwise) for 100 replications (n = 200, p = 2 000, medium sparsity, block Σ). All projected methods use a single projection (without screening or ensembling).

3.4.1 Predictive performance

Figure 3

Comparison of prediction performance (1 − AUC for binomial family, rMSPE for all other families) and link estimation (rMSLE, not reported for SVM and RF) over 100 repetitions (n = 200, p = 2 000, medium sparsity and block Σ).

Comparison of variable ranking (pAUC rescaled to [0, 1] for better comparison) over 100 repetitions for n = 200, p = 2 000, medium sparsity and block Σ.

Figure 5

Mean ranks with 99% confidence intervals for prediction error (1 − AUC for binomial, rMSPE otherwise), link estimation (rMSLE), and variable ranking (pAUC) across all settings and nrep = 100. Methods not significantly worse than the best are shown in black.

3.4.4 Computing time

Figure 6

Comparison of average computing time for increasing p, n = 200, medium sparsity and block Σ, for the binomial family.

Figure 7

Difference FTIR spectrum for one oil sample in the tribology dataset (top) and coefficients of p = 1 814 wavenumbers estimated by SPAR-CV in each of M = 50 marginal models (bottom). Marked intervals represent non-informative variables.

4.2 Darwin Alzheimer dataset

4.3 Diffuse Large B-Cell Lymphoma (DLBCL)

Figure 8

Estimated coefficients for the p = 450 variables in the Darwin dataset, for each of the M = 50 marginal models in the SPAR-CV algorithm.

Footnotes

Acknowledgement

Declaration of Conflicting Interests

Funding

Supplementary materials

References

Supplementary Material