Abstract
Simultaneously performing variable selection and inference in high-dimensional regression models is an open challenge in statistics and machine learning. The increasing availability of vast amounts of variables requires the adoption of specific statistical procedures to accurately select the most important predictors in a high-dimensional space, while controlling the false discovery rate (FDR) associated with the variable selection procedure. In this paper, we propose the joint adoption of the Mirror Statistic approach to FDR control, coupled with outcome randomisation to maximise the statistical power of the variable selection procedure, measured through the true positive rate. Through extensive simulations, we show how our proposed strategy allows us to combine the benefits of the two techniques. The Mirror Statistic is a flexible method to control FDR, which only requires mild model assumptions, but requires two sets of independent regression coefficient estimates, usually obtained after splitting the original dataset. Outcome randomisation is an alternative to data splitting that allows to generate two independent outcomes, which can then be used to estimate the coefficients that go into the construction of the Mirror Statistic. The combination of these two approaches provides increased testing power in a number of scenarios, such as highly correlated covariates and high percentages of active variables. Moreover, it is scalable to very high-dimensional problems, since the algorithm has a low memory footprint and only requires a single run on the full dataset, as opposed to iterative alternatives such as multiple data splitting.
Keywords
Introduction
Advances in data collection capabilities have allowed researchers to get access to thousands of features on multiple subjects in a relatively short time. Examples are next-generation sequencing technologies that allow fast DNA and RNA sequencing and nuclear magnetic resonance spectroscopy used to identify metabolites. 1 But also wearable personal devices that can measure several anthropometric variables and clinical outcomes of interest, such as blood pressure and blood glucose levels, continuously over time, thus producing a vast collection of measurements. 2 Moreover, it is now common to integrate these multiple input sources into a single study, and repeating these measurements over multiple time points, in a longitudinal fashion, in order to capture potentially relevant trends in the outcomes of interest, given some specific interventions or treatments.
The combination of all these factors generates an incredibly vast and complex set of features that need to be jointly analysed.3,4 In many such studies, feature selection becomes an important task, a typical example being biomarker discovery. This can be a daunting task, for at least two reasons: (a) the available sample size, for example, the number of patients recruited in a cohort study, is very often much lower than the number of available features (high-dimensional problem). Therefore, standard statistical or machine learning models struggle to extract the underlying fundamental associations, due to model limitations. (b) Even when a model can deal with a sample size lower than the number of features, the high probability of selecting false positive effects, poses a serious threat to the validity of the analysis.
We address the first problem of variable selection by using all available features during the training process and building a statistical model that can automatically select the most relevant variables. This allows us to retain the full interpretability of the effects of the covariates on the outcome of interest, in contrast with dimensionality reduction techniques such as principal component analysis.
5
In high-dimensional regression (or classification) problems, popular choices are the least absolute shrinkage and selection operator (LASSO), ElasticNet, least-angle regression (LARS), and smoothly clipped absolute deviation (SCAD), which provide efficient algorithms that scale well to a large number of features.6–9 Alternatives also exist in the Bayesian framework, where variable selection is performed through an appropriate choice of the prior distributions.
10
Popular choices are the two-group Spike and Slab prior distribution, which explicitly provide posterior probabilities of inclusion for each variable; and the one-group prior, such as the Laplace prior distribution, which shrinks coefficients toward zero, acting as a regularisation in a similar fashion to the
In addition to selecting only a subset of variables, in many applications of real data, it is also essential to be able to make proper inferences on the selected subset of features. This is essential to have a reliable estimate of the regression coefficients confidence intervals and to be able to control some form of error rate, such as the false discovery rate (FDR). 11 However, simply proceeding with inference, after a data-dependent variable selection step, does not allow to perform valid post-selection inference, as explained in Berk et al. 12 This is because data-driven variable selection procedures, such as the ones mentioned above, generate a model that is not deterministic and the straightforward application of classic approaches to inference, such as ordinary least squares (OLS), do not account for this additional randomness. Therefore, the estimated coefficients and the corresponding confidence intervals will be biased, leading to a potential increase in erroneous classifications.
A number of methods exist to control error rates in multiple testing. Meinshausen and Buhlmann 13 propose stability selection, as a method to control the number of false discoveries in a LASSO regression setting. Lee et al. 14 take a different approach, providing an analytical solution for the linear regression problem when using a LASSO penalty, formally accounting for the variable selection process at the inferential step, thus obtaining correct confidence intervals for the selected coefficients. Rügamer et al. 15 provide a partial extension to the case of additive and linear mixed models, using Monte Carlo approximations, however, the proposed solution is computationally intensive. Overall, this formal treatment is limited to simple cases, such as linear regression, and extensions are difficult.
Barber and Candes,
16
in their pioneering work, propose the knockoff filters procedure as a direct way to control the FDR. The knockoff method works by augmenting the space of covariates, adding a perturbed version of each feature to the design matrix and performing variable selection on the new feature space. The authors provide an upper-bound estimate of the FDR and a new knockoff test statistic through which it is possible to control the FDR at any specific level. This approach, however, has some limitations, that is, the knowledge of the joint distribution of the covariates is required to construct the knockoff filters and the procedure is limited to the case
Despite the limitations of the knockoff, the idea of feature perturbation has been successfully adopted in other works as a way to control FDR. Xing et al. 18 develop the Gaussian Mirrors procedure which allows to control FDR without requiring any distributional assumption on the covariates. However, this approach is inefficient because it only evaluates one variable at a time. Using a similar approach, Dai et al. 19 develop the Mirror Statistic, by substituting the feature perturbation step with a two-step procedure based on data splitting (DS). The first step is variable selection on a subset of the data and the second step is statistical inference on a second subset of data, independent of the first. Since the two sets are independent by construction, the conditional distribution of the inference set given the output of the selection step is the same as the unconditional one, so classical procedures can be used to provide valid inference for the selected parameters. 20 One drawback of this approach is the loss of power caused by the reduced sample size available for the inference step (and similarly for the variable selection step). To mitigate this problem the authors propose a variation called multiple DS (MDS), more akin to stability selection, showing increased power in simulations. 19 The authors show that this approach outperforms the knockoff filter in many situations. Nevertheless, MDS is computationally much more costly than DS, since the same procedure has to be repeated multiple times (at least 50 according to the authors).
In this paper, we propose to use the Mirror Statistic, but, instead of creating two independent sub-samples using DS, we borrow the idea of outcome randomisation from Rasines and Young, 20 where the authors propose a simple mechanism to create two independent sets of data by adding some random noise to the outcome, splitting more efficiently the original information available into two independent new pseudo-outcomes. The result is an increase in statistical power and a more computationally efficient algorithm. We provide a performance comparison via numerical experiments, replicating the results of Dai et al. 19 and extending the simulation scenarios to more challenging settings.
Throughout the article, we use the following notation: RandMS to indicate our proposed model with outcome Randomisation plus Mirror Statistic, DS for single Data Splitting with Mirror Statistic, and MDS for Multiple Data Splitting with Mirror Statistic.
The remainder of the article is organised as follows. In Section 2, we review the methodology underlying FDR control via the Mirror Statistic and DS. We then introduce Randomisation and provide the algorithm for our proposed strategy. In Section 3, we show in detail the results of our simulations and the computational performance of the algorithm. In Section 4, we apply the proposed method to the selection of genes in a high-dimensional real-world study. Finally, in Section 5, we summarise our contributions, limitations, and potential extensions of the method.
Methods
FDR control
The FDR has been introduced as a less conservative approach to false positive error control compared to the family-wise error rate, which can be too restrictive when testing a large number of hypotheseis.
11
FDR is defined as the expectation of the false discovery proportion (FDP):
Benjamini and Hochberg
11
provide a correction method for the
For completeness, we also specify the formula for the true positive rate (TPR), defined as
The practice of splitting a given dataset into multiple smaller independent slices is common and the standard in most machine learning applications,
22
where, generally, a training, validation and testing set is generated from the original sample. While in machine learning, splitting is done to validate the prediction accuracy of a model, in classic statistical inference the same idea can be used to create valid inferential procedures. Dai et al.
19
use this strategy in order to control FDR via the test statistic Mirror Statistic, defined for variable
Under the assumption that, for a feature
As we can see from the definition, to use the Mirror Statistic estimator we need two independent sets of observations. Although DS is universally valid and straightforward to use, it comes at the cost of a much-reduced sample size, which can have detrimental effects on the power of the statistical test and on the stability of the variable selection. To counteract this downside, Dai et al. 19 propose MDS as a way to increase the power of the test statistic. MDS amounts to repeating multiple times the whole procedure of variable selection with simple DS and then aggregating the results. In simulations, MDS seems to provide higher power; however, this improvement comes at the cost of a much higher computational burden and an additional uncertainty due to the choice of the aggregation strategy.
This is where randomisation offers an alternative way of distributing the available sample information and helps avoid the randomness of the simple DS, creating two pseudo-independent sets, by perturbing the outcome with some random noise
Here we consider the case where the outcome of interest has a normal distribution and whose mean is a function of some features
Given
Then, building
Randomisation can be interpreted as averaging information over all possible data splits of the same size. Rasines and Young
20
show that for a normally distributed outcome
The complete inferential procedure is detailed in the following algorithm:
If the residual variance
This is the strategy that we adopt in the numerical experiments in Section 3, as well as in the real-world data application in Section 4, where in both cases we use
We further explore the implications of an incorrectly estimated variance
The theory underlying the development of the Mirror Statistic is based on the independence of the two sets of coefficient estimates and the following symmetry and weak dependence assumptions:
19
Symmetry: For each feature Weak dependence: The mirror statistics
A requirement for the symmetry condition is that all variables with a non-null coefficient are selected in the variable selection step. This is generally referred to as the sure screening property. In our simulations and real-world application, we use the combination of LASSO and OLS, respectively, to perform variable selection and inference, in linear models. For the LASSO, the sure screening property depends on the Signal strength condition, defined as:
Simulations
To prove the effectiveness of our proposed approach we perform several simulations in multiple scenarios, comparing our method against DS and MDS.
We first repeat the simulations done in Dai et al., 19 to check whether we can achieve a similar performance to what was reported in the original paper. We then perform additional simulations under different scenarios, with the purpose of finding the limit of the approach, that is, when the performance deteriorates too much, and to check in which situations our method can perform better than the alternatives.
The common strategy for all simulations is to perform variable selection with LASSO and inference with a standard linear model. For all simulations, we set
The metrics that we track are the FDR and the TPR or power. An ideal model selection will have a TPR close to
Replication of the simulations from the original paper with DS and MDS
The data for this scenario is generated using the following parameters:
sample size, number of covariates, number of non-zero coefficients, covariates correlation, regression coefficients signal strength, error variance, randomisation variance factor, predictor in equation (4) set to be
Throughout the paper, the correlation coefficient
The covariates are generated as independent random draws from a multivariate normal distribution,
The regression coefficients are randomly generated from a normal distribution with mean zero and standard deviation
Replicating the strategy used by Dai et al.,
19
we first compare the performance of the three algorithms fixing the signal strength to
Fifty independent replications are run for each combination. Here we show the boxplot of the results.
In Figures 1 and 2, top plots, we can observe some common patterns:

Paper replication study –

Paper replication study – signal strength
In the bottom plots, we show the TPR estimates:
We now concentrate on testing our method on new scenarios that were not covered in the original paper. We explore the performance in contexts with a higher correlation, near ill-conditioned covariance matrices, higher proportions of non-zero regression coefficients, non-block diagonal covariance matrices and regression coefficients sampled from a fixed known pool of values.
from fixed pool
The first additional scenario that we test is the situation where the regression coefficients
Selecting the coefficients from a known set of values allows for better control and understanding of the variable selection capabilities of the algorithms, since the random draws from a normal distribution will naturally be concentrated around the zero mean, making it more difficult to really understand which coefficients can actually be considered non-zero. This set of regression coefficients is used in all of the following simulations.
Keeping all the other simulation settings as before, we run the same simulations. From Figure 3 (top), we see that the median values and variability of FDR are very similar to the ones obtained before, validating the ability of the Mirror Statistic to control the FDR. On the other hand, the results for the TPR in Figure 3 (bottom) suggest that variable selection with fixed coefficients is somewhat easier, as expected, which is reflected in higher median TPR values, concentrated near 1.

All three methods have a comparable performance, both in terms of FDR control and TPR, with the only exception of MDS, which has, on average, a lower TPR than DS and RandMS. This behaviour could be explained by the fact that MDS has a very conservative control over FDR, thus also ending up with a lower TPR.
A natural question arises regarding whether DS, MDS and RandMS, can cope with a higher percentage of active variables, potentially highly correlated. To this end, we increase the complexity of the simulated data by increasing the percentage of active variables and allowing very high degrees of correlations across the covariates.
We start by increasing the percentage of active variables to

Active coefficients
By increasing the percentage of active variables to

Active coefficients
In Figure 5 (bottom), we see that RandMS outperform the competitors in terms of TPR, in particular, for correlations up to, and including, 0.5. On the other hand, DS is totally unable to retain enough power.
Finally, setting the percentage of active variables to 30%, the difference is even more striking, again in favour of RandMS. Figure 6 (top) and (bottom) shows the results for FDR and TPR, respectively.

Active coefficients 30%. Top: False discovery rate. Bottom: True positive rate.
The next simulation is performed with covariates generated from a normal distribution whose inverse covariance matrix is near ill-conditioned, meaning that the lowest eigenvalues of
The covariance matrix

Ill-conditioned covariance – active coefficients 10%. Top: False discovery rate. Bottom: True positive rate.

Ill-conditioned covariance – active coefficients 30%. Top: False discovery rate. Bottom: True positive rate.
In terms of TPR, the advantage of RandMS is more clear. Already with a percentage of active coefficients of 10%, Figure 7 (bottom), RandMS does a better job than DS and MDS, and, increasing that percentage to 30%, Figure 8 (bottom), RandMS totally outperforms the competitors.
Here we test the performance of RandMS with a higher number of covariates,
In Figure 9, we report the performance for

Block covariance matrix –
We run a benchmark simulation of the computational requirements for the proposed Randomisation plus Mirror Statistic versus MDS (run with 50 splits). The benchmark, as well as all other analysis, has been done in Julia,
23
version
In Figure 10, we show the average time (in seconds) and the memory requirements (in gigabytes) to run the algorithms on a linear regression with an increasing number of variables

Computational benchmark of randomisation with mirror statistic versus multiple data splitting. Left: CPU time in seconds. Right: Memory requirement in Gigabytes.
Elevated serum triglyceride (TG) levels in the blood are strongly associated with an increased risk of cardiovascular diseases (CVDs). Serum TG levels can be reduced by a healthy diet, but there are also large inter-individual variations in fasting TG levels. An improved understanding of this variation would be beneficial for CVD prevention. In this example, we are interested in relating fasting TG levels to gene expression through a linear regression model. We have data from the screening visit of a randomised controlled dietary intervention trial, presented in detail in Ulven et al.
24
In this trial, gene expression was measured in peripheral blood mononuclear cells (PBMC). These are immune system cells and because they are circulating cells, they are exposed to nutrients, metabolites and peripheral tissues and may therefore reflect whole-body health. We include all individuals from whom we have both PBMC gene expressions and fasting TG levels, in total 251 individuals. The outcome is the log measurement of blood TGs, while we use measurements from 13,967 genes expression as covariates (
The log-transformed TG outcome is well approximated by a normal distribution, while the gene expression data have been already preprocessed and are also well approximated by a normal distribution.
We proceed to analyse the data with our proposed method, that is, outcome Randomisation plus Mirror Statistic, in order to identify which genes could potentially contribute to the differences in TG levels. We use LASSO for variable selection on the randomised outcome
If we look only at the variable selection part of Algorithm 1 (i.e. LASSO), the number of selected genes is 30, while only two of those (plus the intercept) have been selected using the full RandMS algorithm. Both our selected genes can be linked to atherosclerosis, and thus, seem biologically plausible. MYLIP (
Discussion
We propose the adoption of outcome randomisation instead of DS, in combination with the Mirror Statistic, in order to effectively control the FDR in high-dimensional linear regressions. Intuitively, randomisation acts as information averaging and helps avoid the pitfalls of DS. When combined with the Mirror Statistic, it allows to correctly control the FDR at the target level, while providing higher power and a more computationally efficient algorithm.
Our extensive simulations show superior performance compared to DS strategies, in various scenarios of increasing complexity. Even in very high-dimensional cases, we can retain the good scalability of the proposed method.
Finally, we perform a real data analysis, where the outcome of interest is blood TG levels, and the covariates are gene expression data. The dimension of the covariates space, compared to the sample size, makes this problem a perfect example of high-dimensional linear regression. We use our method to perform variable selection and inference, with a target FDR of 10%. We are able to identify two genes, potentially responsible for the variation of TG levels.
This extension is currently limited to normally distributed outcomes, where randomisation takes a closed-form analytical solution and the symmetry requirement of the regression coefficients to apply the Mirror Statistic is satisfied. It is therefore possible to use the method for the inference on linear regression models and Gaussian graphical models, for example. It would be interesting to explore possible extensions to outcomes following arbitrary distributions and high-dimensional mixed models. Leiner et al. 25 provide an extension of randomisation to distributions belonging to the exponential family, relying on the concept of conjugate distributions from Bayesian statistics. Their result could potentially be used, for example, for a high-dimensional logistic regression model. Dai et al. 26 extend the use of the Mirror Statistic to high-dimensional logistic regression, however, their method relies on the computationally expensive procedure of de-biasing the LASSO estimate, which is needed to satisfy the symmetry requirement to use the Mirror Statistic. Future efforts to combine these two extensions could be of practical interest. Another area of further improvement could be the adoption of different variable selection techniques in Algorithm 1. For example, substituting LASSO with a model that can better select highly correlated covariates, for example, ElasticNet. 7
Footnotes
Acknowledgments
The authors wish to thank Professor Stine Ulven, Department of Nutrition, University of Oslo, for providing the TG dataset.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Code availability
We explore the implications of an incorrectly estimated variance Average false discovery rate (FDR), power and proportion of selected variables against different values of
When the variance is overestimated, the algorithm becomes more conservative and the FDR is controlled. The number of variables selected stays roughly constant for the selected range of values.
In order to better understand the impact of violating the assumption of symmetry, we run a controlled simulation where we monitor the individual steps of the randomisation plus Mirror Statistic algorithm. We generate data from a linear regression model with Proportion of true non-null variables selected by the LASSO ( Distribution of the number of mirror statistic coefficients (associated with the true null-coefficients) lying above and below the optimal threshold calculated for each repeated sampling.
