Sage Journals: Discover world-class research

Abstract

Bayesian structured additive quantile regression is an established tool for regressing outcomes with unknown distributions on a set of explanatory variables and/or when interest lies with effects on the more extreme values of the outcome. Even though variable selection for quantile regression exists, its scope is limited. We propose the use of the Normal Beta Prime Spike and Slab (NBPSS) prior in Bayesian quantile regression to aid the researcher in not only variable but also effect selection. We compare the Bayesian NBPSS approach to statistical boosting for quantile regression, a current standard in automated variable selection in quantile regression, in a simulation study with varying degrees of model complexity and illustrate both methods on an example of childhood malnutrition in Nigeria. The NBPSS prior shows good performance in variable and effect selection as well as prediction compared to boosting and can thus be recommended as an additional tool for quantile regression model building.

Keywords

Bayesian statistics effect selection NBPSS quantile regression variable selection

1 Introduction

In standard regression methods, it is usually the location parameter—very often the mean—of an outcome of interest conditioned on a set of explanatory covariates that is in focus of the estimation procedure. Of course, if the outcome is not characterized sufficiently well by the global conditional location parameter, there exist plenty of methods to also explain other global parameters of the outcome by covariates such as the scale, for example, variance, or shape, for example, skewness, parameters by means of distributional regression a.k.a. general additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos, 2005).

While this works perfectly well for outcomes following a known distribution, the task becomes challenging when the distributional form of the outcome violates model assumptions or when it is obvious that there is no global effect to be expected for a specific parameter, but locally there clearly is, for example, for specific quantiles of the outcome, and the task is to quantify that, too. Another drawback of distributional regression is the non-intuitive interpretation of the resulting coefficients, which are—with the exception of the identity link—subject to a distribution-specific link function. In these cases, it helps to consider model types that look beyond the estimation of specific distributional parameters (Kneib, 2013; Kneib, Silbersdorff, and Säfken, 2021). When the form of the outcome is impossible to describe by common modelling distributions, the focus is on quantifying the effects on outliers of the outcome and/ or straightforward communication of the results is necessary, then a suitable regression type is quantile regression first introduced by Koenker and Bassett Jr (1978).

Quantile regression works without parametric distributional assumptions on the outcome and yields potentially non-linear effects for a specific quantile of the outcome conditional on a set of covariates by weighting the outcome accordingly and employing linear programming (Koenker, Ng, and Portnoy, 1994). More flexible models are possible when using other estimation approaches such as Bayesian structured additive quantile regression (Waldmann, Kneib, Yue, Lang, and Flexeder, 2013) or statistical boosting for additive quantile regression (Fenske, Kneib, and Hothorn, 2011). Another challenge for correct modelling is the selection of informative variables. Many different approaches exist for mean regression and have been adapted to suit quantile regression. The LASSO features prominently as a methodological vehicle (Tibshirani, 1996) in variable selection for quantile regression: Koenker et al. (1994), Koenker (2005) and Belloni and Chernozhukov (2011) propose penalized splines in particular L1-penalized splines. Both Wu and Liu (2009) and Lv, Zhang, Zhao, and Liu (2015) use adaptive LASSO; the former in combination with a smoothly clipped absolute deviation (SCAD) function. Zhao and Lian (2016) used group SCAD instead for variable selection, while Ying Sun and Fuentes (2016) applied fused adaptive LASSO in spatial-temporal quantile regression. Sherwood and Maidman (2022) discuss variable selection via SCAD and LASSO in high-dimensional data.

Other methods for variable selection in quantile regression use, for example, a combination of the L0- and L1-norm (Dai, 2023), employ copulas (Fu et al., 2023) or use an information criterion (Eun Ryung Lee and Park, 2014). And Jiang, Bondell, and Wang (2014) proposed a fused penalty to enable variable selection with simultaneous estimation of quantiles and Bar, Booth, and Wells (2023) applied expectation maximization (EM) with weighted least squares to achieve variable selection.

While the previous publications used linear programming in one form or another in their algorithms, there exist also machine learning approaches to variable selection in quantile regression. Meinshausen (2006) established quantile regression forests, Fenske et al. (2011) expanded additive quantile regression to gradient boosting and both He, Qin, Wang, Wang, and Wang (2019) and Liu, Chen, Liu, Qin, and Fars (2023) suggested LASSO-penalized quantile regression neural networks.

In Bayesian quantile regression, the Bayesian equivalent of LASSO (Park and Casella, 2008) is quite popularly employed for variable selection (Waldmann et al., 2013; Alhamzawi, 2015; Shiyi Tu and Sun, 2017; Benoit and Van den Poel, 2017). Other possibilities include indirect variable selection by using the SavageDickey density ratio on the Bayes factors of proposal models (Oh, Choi, and Park, 2016), the proposal of a prior for model selection (Alhamzawi and Yu, 2013) and the usage of spike-and-slab priors in the form of stochastic search variable selection (SSVS) on the model coefficients (Alhamzawi and Yu, 2012; Chen, Dunson, Reed, and Yu, 2013) or on indicator variables (Kedia, Kundu, and Das, 2023).

SSVS spike-and-slab priors for variable selection were first discussed by Mitchell and Beauchamp (1988) and George and McCulloch (1993). They apply a spike-and-slab prior directly on the regression coefficients. When modelling a non-linear effect the respective function can be represented as a basis expansion of the variable of interest with a set of corresponding coefficients and in order to decide on the inclusion of this function in the model, the entire set of coefficients needs to be considered. Ishwaran and Rao (2005) enabled this by applying a spike-and-slab prior to the variance of the coefficients instead of the scalar coefficients themselves. Many variations of this approach exist—see, for example, Panagiotelis and Smith (2008), Zhu, Vannucci, and Cox (2010) or O’Hara and Sillanpää (2009) for a review—but deviations from Gaussian models and/ or variations in the priors to fit the required model suffer from poor mixing (Zhu et al., 2010; Scheipl, Fahrmeir, and Kneib, 2012). In order to improve the mixing properties, Scheipl et al. (2012) proposed the peNMIG spike-and-slab prior for function selection in structured additive regression based on parameter expansion into an importance parameter and standardized coefficients (Gelman, van Dyk, Huang, and Boscardin, 2008). Similar, yet different, is the approach by Klein, Carlan, Kneib, Lang, and Wagner (2021), who proposed the Normal Beta Prime Spike and Slab (NBPSS) prior. In contrast to the peNMIG prior, which relies on the mixed-model-decomposition of effects and a bimodal prior for the standardized coefficients, it uses sparse design matrices resulting in faster computation times and enables samples from the full posterior instead of ones biased towards one of the modes. The NBPSS prior further enables discriminating between a linear and a non-linear component of an effect. While also Guo, Jaeger, Rahman, Long, and Yi (2022) address the problem of selecting the functional form of an effect, they only select non-linear effects once their linear components have been selected first. Thus, their approach might miss non-monotonic functions of variables, whereas function selection with the NBPSS prior does not.

Given the favourable properties of the NBPSS prior we are interested in its extension to a quantile regression context and its variable selection performance therein. In order to do this the methodological background of both Bayesian quantile regression and the NBPSS prior are explained in Section 2. We then apply the NBPSS prior in a simulation study to a set of different models varying in complexity and compare the results to other variable selection mechanisms for quantile regression in Section 3. Section 4 illustrates the application of both approaches to an example of childhood malnutrition in Nigeria and Section 5 summarizes our findings and gives an outlook for possible future developments.

2 Methods

Traditionally regression models focus on modelling the conditional mean of the outcome given the regressors. The resulting model for the mean might, however, not apply to the more extreme observations. A solution to that is quantile regression, which regresses a specific conditional quantile of the outcome instead of the conditional mean.

2.1 Bayesian quantile regression

In quantile regression, the model specification takes the form

\begin{matrix} y_{τ} = η_{τ} + ε_{τ}, F_{ε_{τ}} (0) = τ \end{matrix},

(2.1)

where $τ$ is a distinct quantile of outcome $y$ and $η_{τ} = X β_{τ}$ a set of linear predictors of $y$ with covariate design matrix $X$ and the linear, quantile specific effects $β_{τ}$ . Further, the error term $ε_{τ}$ is defined such that its cumulative distribution function $F_{ε_{t}} (x)$ takes on $τ$ at $x = 0$ . The linear, quantile specific effects $β_{τ}$ can then be estimated by minimizing the asymmetrically weighted absolute deviations (AWAD) of the error

\begin{matrix} \underset{β_{τ}}{arg min} \sum_{i = 1}^{n} ρ_{τ} (y_{i}, η_{i τ}) |y_{i} - η_{i τ}| \end{matrix}

(2.2)

with

ρ_{τ} (y_{i}, η_{i τ}) = \{\begin{array}{l} τ & if y_{i} - η_{i τ} \geq 0 \\ 1 - τ & if y_{i} - η_{i τ} < 0, \end{array}

and n denoting the number of observations.

The traditional method of solving (2.2) is by linear programming (Koenker and Bassett Jr, 1978). In order to estimate quantile regression under a Bayesian framework, that is, perform inference of the posterior distribution of the parameters of interest $β_{τ}$ , we need a suitable distribution to represent the outcome variable y as well as prior distributions for all parameters entering the model representing our prior knowledge. A distribution for y that allows us to reflect the asymmetric weighting implied by (2.2) is the asymmetric Laplace distribution (ALD), which for subject i is given by

p (y_{i} ∣ η_{i τ}, σ^{2}) = \frac{τ (1 - τ)}{σ^{2}} exp \{- ρ_{τ} (y_{i}, η_{i τ}) \frac{|y_{i} - η_{i τ}|}{σ^{2}}\}

with

ρ (y_{i}, η_{i τ}) = \{\begin{array}{l} τ & if y_{i} - η_{i τ} \geq 0 \\ 1 - τ & if y_{i} - η_{i τ} < 0. \end{array}

Even though the ALD is an appropriate distribution to represent y in a Bayesian quantile regression setting, it contains an absolute value with undesirable differentiability properties. Fortunately, it can be expressed as a scale mixture of normal distributions with different variants proposed by Kozumi and Kobayashi (2011) and Yue and Rue (2011). For us, the latter is of interest, which uses the formulation

\begin{matrix} y_{i} ∣ ω_{i}, η_{i τ}, τ, σ^{2} \sim N (η_{i τ} + ζ ω_{i}, σ^{2} φ ω_{i}) \end{matrix}

(2.3)

with

ω_{i} \sim Exp (\frac{1}{σ^{2}}), ζ = \frac{1 - 2 τ}{τ (1 - τ)}, φ = \frac{2}{τ (1 - τ)} .

2.2 Bayesian structured additive quantile regression

While (2.1) illustrates quantile regression with linear effects, the predictor $η_{τ}$ can be expanded to include non-linear, random and spatial effects. Koenker et al. (1994) considered smoothing splines for non-linear quantile regression, Yue and Rue (2011) and Waldmann et al. (2013) used a Bayesian approach for (non-)linear, possibly spatial (structured) additive quantile regression and Fenske et al. (2011) boosted structured additive quantile regression. From this, the predictor can be expressed in its most generic form as

\begin{matrix} η_{τ} = X β_{τ} + \sum_{j = 1}^{J} f_{τ, j} (z_{j}) + f_{τ, g e o} (s) + U b_{τ}, \end{matrix}

(2.4)

with

linear effects $β_{τ}$ on their corresponding design matrix X,

non-linear effects given by functions $f_{τ, j} (z_{j})$ for covariates $z_{j} \forall j \in \{1, \dots, J\}$ ,

iii

spatial relation $f_{τ, g e o} (x_{geo})$ on a spatial covariate $x_{geo}$ and

random effects $b_{τ}$ on their corresponding block-diagonal design matrix U.

Rewriting equation (2.4) in matrix notation yields

\begin{matrix} η_{τ} = Z_{1} γ_{τ, 1} + \dots + Z_{p} γ_{τ, p}, \end{matrix}

(2.5)

where $Z_{j}$ is an effect specific design matrix and $γ_{τ, j}$ a vector of quantile specific corresponding effect coefficients for $j \in \{1, \dots, p\}$ variables entering the model.

For the effects stated above, $Z_{j}$ takes the following forms: For (i) $Z_{j} = X$ and for (ii) $f_{τ, j} (z_{j}) =$ $\sum_{k = 1}^{K} γ_{τ, j k} B_{k} (z_{j}) = Z_{j} γ_{τ, j}$ , where $B_{k} (z_{j})$ is the B-spline basis extension of $z_{j}$ . For (iii) the spatial effect $f_{τ, g e o} (x_{geo})$ of spatial covariate $x_{geo}$ with entries of S unique regions can be represented by an $n \times S$ incidence matrix $Z_{j}$ with entries 1 if individual $i \forall i \in \{1, \dots, n\}$ was observed in region s $\forall s \in \{1, \dots, S\}$ and 0 otherwise. Lastly, for the random effects (iv) $U b_{τ} = Z_{j} γ_{τ, j}$ , where $Z_{j} = U =$ blockdiag $(u_{1}, \dots, u_{n})$ , that is, a block-diagonal matrix with $u_{i}$ being a vector of observations (or $1 s$ for random intercepts) for individual $i \forall i \in \{1, \dots, n\}$ .

For Bayesian estimation of the model as given in (2.3) with the predictor as given in (2.5) we require priors for $ω_{i}, \frac{1}{σ^{2}}$ and $γ_{τ, j}$ . The weights $ω_{i}$ are exponentially distributed with intensity $\frac{1}{σ^{2}}$ , which is why we also specify a prior for $\frac{1}{σ^{2}}$ . They will be drawn as part of the MCMC algorithm in the hierarchical prior structure

ω_{i} ∣ σ^{2} \sim Exp (\frac{1}{σ^{2}}) and

\frac{1}{σ^{2}} \sim Ga (a_{y}, b_{y}),

where $a_{y}$ and $b_{y}$ are hyperparameters chosen such as to reflect our prior knowledge.

The prior for the coefficients $γ_{τ, j}$ is of generic form and proportional to a normal distribution with zero mean, prior variance $σ_{γ_{τ, j}}^{2}$ and precision matrix $K_{j}$ :

\begin{matrix} p (γ_{τ, j} ∣ σ_{γ_{τ, j}}^{2}) \propto exp \{- \frac{1}{2 σ_{γ_{τ, j}}^{2}} γ_{τ, j}^{'} K_{j} γ_{τ, j}\} I_{[A j} γ_{τ, j} = 0] . \end{matrix}

(2.6)

$A_{j}$ denotes a constraint matrix, by which the indicator function $I_{[A, V_{τ, j} = 0]}$ enforces constraints on the prior. Usually for non-linear and spatial effects the penalty matrix $K_{j}$ is rank deficient with the result that prior (2.6) would be partially improper. Here $I_{[A_{j} y_{t, j} = 0]}$ ensures identifiability and propriety of the prior. This generic prior is very flexible as depending on the form of the precision matrix $K_{j}$ it can be applied to map (non-)linear (fixed) and random effects as well as spatial effects.

For linear (i) and random effects (iv) $K_{j} = I_{j}$ and for non-linear effects (ii) $K_{j} = D^{'} D$ with D being a matrix of first or second order differences. For spatial effects (iii) $K_{j}$ is an adjacency matrix with entries of -1 if locations s and r are neighbours $(s \sim r)$ , the number of neighbours $|n (s)|$ if $s = r$ and 0 otherwise.

Since we are not just interested in inference on the posterior distributions of the effects, but also in automatic function selection and effect decomposition as outlined in the introduction, the priors for the quantile specific effects $γ_{τ}$ also need to allow for variable or effect selection respectively.

2.3 Effect selection via the Normal Beta Prime Spike and Slab (NBPSS) prior

Inclusion of the correct effects in a regression model is paramount. Otherwise the estimates will be biased and the prediction performance will suffer. To achieve this in a Bayesian setting mixture priors with spike and slab components of the form

p (γ ∣ δ) = p_{slab} (γ_{δ}) \prod_{j : δ_{j} = 0} p_{spike} (γ_{j})

(2.7)

are applied with hierarchically specified inclusion probability

p (δ = 1 ∣ ω) = ω, ω \sim B e t a (a_{ω}, b_{ω}),

where $γ$ is the vector of regression coefficients, $δ$ a vector of indicators for inclusion in the slab component, $γ_{δ}$ the vector of regression coefficients attributed to the slab component by indicator $δ = 1$ and $a_{ω}$ and $b_{ω}$ are hyperparameters of the Beta prior for the inclusion probability.

While originally spike and slab priors applied a mixture of normal distributions as prior for the coefficients (SSVS, George and McCulloch, 1993), contemporary spike and slab priors use a Normal Mixture of Inverse Gamma (NMIG) distributions on the variance of the coefficients (Ishwaran and Rao, 2005; Konrath, Kneib, and Fahrmeir, 2008). If effect selection is to include non-linear effects as well, approaches rely on re-parametrization of the effects and apply the prior to a specific parameter of the resulting term (importance parameter) as in parameter-expanded NMIG (peNMIG, Scheipl et al., 2012) and NBPSS (Klein et al., 2021).

This re-parametrization of an effect for the NBPSS prior is such that in the quantile regression setting

\begin{matrix} Z_{j} γ_{τ, j} = ϖ_{τ, j} Z_{j} {\tilde{γ}}_{τ, j} \end{matrix}

(2.8)

holds. The scalar $ϖ_{τ, j}$ is an importance parameter, via which effect selection happens, and ${\tilde{γ}}_{τ, j}$ are (standardized) coefficients.

Consider again the generic and thus flexible prior given in (2.6). With the re-parametrization in (2.8) this prior changes to

\begin{matrix} p ({\tilde{γ}}_{τ, j}) \propto e x p \{- \frac{1}{2} {\tilde{γ}}_{τ, j}^{'} K_{j} {\tilde{γ}}_{τ, j}\} I_{[A_{j} {\tilde{γ}}_{τ, j} = 0]} \end{matrix} .

(2.9)

In order to ensure identifiability and propriety of the prior Klein et al. (2021) choose the constraint $A_{j}$ such that it is equal to the representation of the basis of the null space of $K_{j}$ given by $(ker (K_{j}))$ , that is,

A_{j} = span (ker (K_{j})) .

This specific type of constraint matrix further enables the decomposition of a non-linear effect $Z_{j} {\tilde{γ}}_{τ, j}$ into an unpenalized linear effect component $Z_{j} {\tilde{γ}}_{unpen, τ, j}$ and a penalized non-linear effect component $Z_{j} {\tilde{γ}}_{p e n, τ, j}$ ,

Z_{j} {\tilde{γ}}_{τ, j} = Z_{j} {\tilde{γ}}_{unpen, τ, j} + Z_{j} {\tilde{γ}}_{p e n, τ, j} .

This allows to decide whether to include a covariate in the model at all, whether its effect is purely linear or if an additional non-linear effect component needs to be considered as well.

The actual variable or effect selection then happens by imposing a spike and slab prior on the squared importance parameter $ϖ_{τ, j}^{2}$ . This spike and slab prior is gamma distributed conditional on an inclusion indicator $δ_{τ, j}$ with inclusion probability $π_{τ, j}$ and scale parameter $ψ_{τ, j}^{2}$ . Since the importance parameter $ϖ_{τ, j}^{2}$ is specific to the quaile $τ$ , all of the parameters that $ϖ_{τ, j}^{2}$ depends on are also specific to $τ$ . The prior structure is hierarchical and follows the setup

ϖ_{τ, j}^{2} ∣ δ_{τ, j}, ψ_{τ, j}^{2} \sim Ga (\frac{1}{2}, \frac{1}{2 r_{j} (δ_{τ, j}) ψ_{τ, j}^{2}}),

δ_{τ, j} ∣ π_{τ, j} \sim Bern (π_{τ, j}),

ψ_{τ, j}^{2} \sim InvGa (a_{j}, b_{j})

π_{τ, j} \sim Beta (a_{0, j}, b_{0, j})

r_{j} (δ_{τ, j}) = \{\begin{matrix} r_{j} > 0 small \\ 1 \end{matrix} \begin{matrix} δ_{τ, j} = 0 \\ 1 δ_{τ, j} = 1. \end{matrix}

For indicator $δ_{τ, j} = 1$ the prior takes on the slab shape $Ga (\frac{1}{2}, \frac{1}{2 ψ_{t, j}^{2}})$ and the effect is included in the model. For $δ_{τ, j} = 0$ the prior takes on the spike form $Ga (\frac{1}{2}, \frac{1}{2 r_{t, j} ψ_{τ, j}^{2}})$ with its density concentrated around zero thus excluding the effect from the model.

When applying the NBPSS prior the choice of the hyperparameters $a_{j}, b_{j}, a_{0, j}, b_{0, j}$ and $r_{j}$ is critical. While Klein et al. (2021) suggest default values $a_{j} = 5$ and $a_{0, j} = b_{0, j} = 1$ , they recommend for $b_{j}$ and $r_{j}$ to be elicited via rules they have laid out in their paper. This prior elicitation is based around the marginal distribution of the standardized supremum of the function evaluations of one specific variable and requires the parameters $α$ (denoting the $α$ -quantile of this distribution) and c (denoting a threshold value a given supremum must not surpass) and is implemented in the $R$ package sdPrior (Klein, 2018). For applications, they recommend to use $α = c = 0.1$ .

The graphical display of the employed prior hierarchy is in the supplementary material (Section 1).

2.4 Other methods for variable and/or effect selection in quantile regression

In order to put the focus of this article—effect selection in Bayesian quantile regression via the NBPSS prior—in context, we will address other methods for variable selection in quantile regression.

Frequentist statistics. Under a frequentist framework quantile smoothing splines (Koenker et al., 1994) and non-convex (group) penalties (Sherwood and Maidman, 2022; Zhao and Lian, 2016) are commonly used for variable selection in quantile regression. Quantile smoothing splines use a total variation penalty with the ensuing minimization problem being solved via linear programming. The method is implemented in the R package quantreg (Koenker, 2023). Non-convex (group) penalties (in the case of non-linear effects) hold under milder conditions than the total variation penalty and typical penalties are the (grouped) smoothly clipped absolute deviation (gSCAD) penalty and the minimax concave penalty (MCP). Linear programming is not possible anymore with these types of penalties and possible algorithms to yield estimates are either the quantile interative coordinate descent (QICD) algorithm (implemented in the R package rqPen with a choice of gSCAD or MCP penalty, Sherwood, Maidman, and Li, 2023) or the majorize-minimize (MM) algorithm (Zhao and Lian, 2016).

Bayesian statistics. Wu and Liu (2009), Benoit, Van den Poel, et al. (2017) and Waldmann et al. (2013) used Bayesian (adaptive) LASSO for variable selection in quantile regression. The Bayesian LASSO (Park and Casella, 2008) uses a Laplace-prior, which is conjugate to the representation of quantile regression as a scale mixture of normal densities. Therefore, full conditional distributions exist and draws from the posterior distributions can be realized via a Gibbs sampler. The Bayesian computation platform BayesX supports the Bayesian LASSO as a prior option as well as structured additive quantile regression (Belitz, Brezger, Kneib, and Lang, 2015) and the R package bayesQR supports Bayesian adaptive LASSO in quantile regression for linear predictors (Benoit et al., 2017). Also other spike-and-slab priors—other than the NBPSS prior-have been shown to produce good results (Alhamzawi and Yu, 2012; Chen et al., 2013; Kedia et al., 2023) and in theory also spike-andslab LASSO should be possible to be used for variable selection due to the normal representation of quantile regression (Ročková and George, 2018; Guo et al., 2022). Both variants need programming by the user.

Statistical machine learning. In statistical machine learning it is important to distinguish between methods that result in a statistically interpretable model and those which do not (black box learners). A representative of the former is constituted by gradient boosting for quantile regression (Fenske et al., 2011). Gradient boosting derives a model by minimizing a loss function defining the regression type. It iteratively regresses each variable on the negative gradient, that is, the first derivative, of this loss function and thus subsequently builds a model by only allowing a specific proportion of the effect of the very variable to enter the final model, which minimizes the gradient in each iteration. The latter enables automatic variable selection. The loss function for quantile regression is identical to the AWAD criterion (2.2). Gradient boosting for structured additive quantile regression is available within the R package mboost (Hothorn, Buehlmann, Kneib, Schmid, and Hofner, 2022). Examples of black box learners are quantile regression forests implemented in the R package quantregForest (Meinshausen, 2006, 2017) and LASSO-quantile regression neural networks (He et al., 2019; Liu et al., 2023).

Apart from the NBPSS prior, of the presented alternatives for variable selection in quantile regression only the Bayesian LASSO as used by Waldmann et al. (2013) and gradient boosting for quantile regression by Fenske et al. (2011) are to the best of our knowledge able to estimate structured additive quantile regression models with non-linear and in particular spatial effects.

3 Simulations

3.1 Setup

When it comes to effect selection in quantile regression two scenarios appear relevant: (I) A particular variable either has an effect across all quantiles (informative) or not (non-informative) and (II) a particular variable has an effect on a specific quantile (quantile specific informative), but not on others (quantile specific non-informative). Ideally an effect selection mechanism would be able to discriminate both scenarios and in order to examine whether the NBPSS prior is such a mechanism we will conduct a two-fold simulation study. Part I considers informative vs. non-informative variables and Part II in addition features quantile specific informative vs. non-informative variables. For comparability reasons, we will keep Part I as close as possible to the simulation study by Klein et al. (2021).

Thus, the performance of the NBPSS prior is assessed via its overall accuracy in effect selection as well as its resulting prediction performance and benchmarked against established variable selection mechanisms for quantile regression, that is, Bayesian LASSO as implemented in BayesX and gradient boosting as implemented in the R package mboost. Overall accuracy is measured as correct (non-)selection of variables, which for the NBPSS approach is defined as a posterior inclusion probability of at least (or less than) 50% for a specific effect, for the Bayesian LASSO it is constituted as coefficients shrunken to zero and in the boosting context is determined by whether an effect is selected at least once (or not at all). While prediction performance is usually measured as log-scores in a Bayesian context, this will not allow us to compare the Bayesian results to gradient boosting. Therefore, the prediction performance will be measured as the average quantile loss. The quantile loss is given by the evaluations of the AWAD criterion from (2.2). For better comparability the quantile loss will be centred around the quantile specific median of the results for NBPSS elicited for the combination of α = c = 0.1 (median-(0.1,0.1) centred quantile loss). This combination is the default Klein et al. (2021) recommend for applications (compare Section 2.3). Due to the weighting of data as part of the quantile regression estimation mechanism (refer to Section 2.1) and the thus incurred asymmetry in availability of observations used for estimating we would expect a degree of lower performance towards the outer quantiles in both dimensions, accuracy and prediction.

Part I: (Non-)informative variables in homo- and heteroscedastic models

For Part I, we replicate several parts of the simulation study conducted by Klein et al. (2021). These include homoscedastic Gaussian and heteroscedastic Gaussian location-scale models. All covariates will be standardized. All in all 16 independent variables x will be simulated according to a uniform distribution U−2,2 and as effects four different functions are specified:

\begin{matrix} f_{1} (x) = x, f_{2} (x) = x + \frac{{(2 x - 2)}^{2}}{5.5}, f_{3} (x) = - x + π sin π x and \\ f_{4} (x) = 0.5 x + 15 ϕ (2 (x - 0.2)) - ϕ (x + 0.4) \end{matrix}

The predictor of the mean $η_{μ}$ includes twelve and the predictor of the variance $η_{σ}$ four of these variables and take the forms

\begin{matrix} η_{μ} = f_{1} (x_{1}) + f_{2} (x_{2}) + f_{3} (x_{3}) + f_{4} (x_{4}) + \\ 1.5 (f_{1} (x_{5}) + f_{2} (x_{6}) + f_{3} (x_{7}) + f_{4} (x_{8})) + \\ 2 (f_{1} (x_{9}) + f_{2} (x_{10}) + f_{3} (x_{11}) + f_{4} (x_{12})) and \\ η_{σ} = f_{1} (x_{1}) + f_{2} (x_{2}) + f_{3} (x_{3}) + f_{4} (x_{4}) \end{matrix}

Moreover, the generated data will feature a spatial variable alternating between an informative and a non-informative effect, which will be added to both predictors in the informative cases. The Gaussian models are then simulated from

\begin{matrix} y = η_{μ} + ε, ε \sim N (0, \frac{Var (η_{μ})}{20} 1) \end{matrix}

(3.1)

and the Gaussian location-scale models from

\begin{matrix} y = η_{μ} + {(exp \{η_{σ}\})}^{0.5} \cdot ε, ε \sim N (0, 1) \end{matrix} .

(3.2)

Thus the Gaussian models from (3.1) and the Gaussian location-scale models from (3.2) are identical except for their variance.

To account for estimation complexity due to sample size the Gaussian models are tested on n = 200 and n = 1000 observations and the Gaussian location-scale models on n = 1000 and n = 200 observations. Predictions are performed on n_predict = 5000 new observations. Each combination of the above is repeated R = 100 times. Further, to examine sensitivity as to the choice of hyperparameters prior elicitation with the help of the R package sdPrior is carried out with possible combinations of a = 0.05,0.1,0.2 and c = 0.1,0.2.

Part II: Quantile specific (non-)informative variables in heteroscedastic models

Part II of the simulation study will focus on a spatial Gaussian location-scale model with n = 1000 observations and two variables only informative for specific quantiles. The data generation mechanism is the same as for Part I, as is the prediction logic. The predictors take the forms

\begin{matrix} η_{μ} = z_{0.75} (1 + x_{1}) + ({(2 x_{2})}^{2} + 1) + z_{0.1} (1 + x_{3}) + \\ {(2 x_{4})}^{2} + (x_{5} - sin 2 x_{5}) + (- x_{6}) and \\ η_{σ} = (1 + x_{1}) + 1.5 (x_{2}^{2} + 0.5) + (1 + x_{3}) \end{matrix}

where $z_{τ}$ is the $τ$ -quantile of the standard normal distribution. Data is then simulated from

\begin{matrix} y = η_{μ} + η_{σ} \cdot ε, ε \sim N (0, 1) . \end{matrix}

(3.3)

While the model simulated in (3.3) is also a Gaussian location-scale model, it differs from (3.2) in the function that links the predictor $η_{σ}$ to the variance. The choice of an identity link function for (3.3) is motivated by the intention to simulate a true non-zero effect on a specific quantile. Details on the rationale behind this, can be found in the supplementary material (Section 2).

For estimation we consider a set of seven quantiles $τ \in \{0.05, 0.10, 0.25, 0.5, 0.75, 0.90, 0.95\}$ and we use the current version of BayesX (Belitz, Brezger, Kneib, Lang, and Umlauf, 2022) as well as mboost (version 2.9-7, Hothorn et al., 2022) in R (R Core Team, 2022).

3.2 Results

The results presented here only include the methods 'NBPSS' and 'gradient boosting', since the Bayesian LASSO for geoadditive quantile regression despite it being mentioned in (Waldmann et al., 2013) is not available in BayesX.

Part I: (Non-)informative variables in homo- and heteroscedastic models

The accuracy of effect selection for the homoscedastic models of Part I is in general very high for the NBPSS prior (Figure 1). This improves further with larger sample sizes. While in the non-spatial case effect selection is more precise towards the inner quantiles, this effect vanishes in the spatial case. This is due to the spatial effect being wrongly selected in the outer quantiles in the non-spatial case. This to a certain degree is a result of decreased data density caused by the weighting step of the scale mixture representation of the observations during estimation as pointed out before. The results also indicate a sensibility of the NBPSS prior towards the chosen hyperparameters suggesting that different combinations of α and c (see Section 2.3) lead to more conservative or liberal selection of variables. Notably a larger probability α influences the selection of non-informative effects and a larger threshold c influences the deselection of informative variables. Detailed figures on the inclusion probability can be found in the supplementary material (Section 4). On a global level, however, this only influences the overall accuracy when α is chosen relatively small and c larger in comparison (or vice versa).

Figure 1

Notes: The left panel displays the results of the small sample size models $(n = 200)$ , and the right panel larger sample size models ( $n = 1000$ ). The upper row depicts models without spatial effect, the lower one with spatial effect. Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior as well as boosting.

The boosting results show lower selection accuracy in all homoscedastic cases. The reason lies with the high selection frequency of the non-informative variables, incorrect selection of non-linear components of strictly linear effects as well as the frequent selection of a spatial effect in non-spatial cases. For further information, the selection frequencies are depicted in more detail in the supplementary material (Section 4.4).

In terms of estimation accuracy, the Bayesian and the boosting approach yield similar values of mean-squared error (MSE) and bias, the 95%-coverage values for the Bayesian results are higher for the inner quantiles (Figure 2). It became relevant throughout the analysis to also consider those parameters in the evaluation of the results. But since they were not planned to be included, theyapart from the depiction of the coverage—can be found in the supplementary material Section 5.

Figure 2

Notes: The left panel displays the results of the small sample size models ( $n = 200$ ), and the right panel larger sample size models ( $n = 1000$ ). The upper row depicts models without spatial effect, the lower one with spatial effect. Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior. Boosting does not provide measures of estimation insecurity (variances) and thus no coverage can be stated.

When it comes to prediction performance, the median-(0.1,0.1) centred quantile loss is in general less for the NBPSS prior than for boosting, yet the discrepancy shrinks with increasing sample size (compare Figure 3). The boosting performance can be traced back to the incorrect effect selection. In addition, the NBPSS prior shows more variability towards the outer quantiles, while the behaviour is inverted for boosting, which has higher variability towards the inner quantiles. For the Bayesian approach, the higher variability can be traced back to the more pronounced weighting of the data in the outer quantiles making estimation a little less stable and eventually being reflected in the prediction. In contrast to the selection accuracy, the choice of hyperparameters does hardly impact the quality of the prediction.

Figure 3

Notes: The left panel displays the results of the small sample size models n = 200, and the right panel larger sample size models (n = 1000). The upper row depicts models without spatial effect, the lower one with spatial effect. Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior as well as boosting.

For the heteroscedastic models a very similar picture is evident in terms of selection accuracy. The NBPSS prior yields high selection accuracy with slight drops towards the outer quantiles in the non-spatial case (see Figure 4) due to incorrect selection of the spatial effect as well as the artificial imbalance induced by weighting the data. What also becomes evident is that for the heteroscedastic effects of variables x₁ to x₄ the selection accuracy is quantile specific, which seems incorrect at first, but is indeed the desired behaviour. For certain quantiles, the effects of those variables are flat and rather close to zero (illustration in the supplementary material Section 3) and in consequence, effect selection is lower. This is the case for quantiles $τ \in \{0.05; 0.1\}$ for $f (x_{1}), τ \in \{0.05, 0.1, 0.25\}$ for $f (x_{2}), τ = 0.25$ for $f (x_{3})$ and $τ = 0.5$ for $f (x_{4})$ . Depending on the hyperparameter choice, this may even show on a global level, for example, combination $(α = 0.05, c = 0.2)$ . The selection accuracy for the boosting results is comparable to the one for the homoscedastic models for the same reasons and now also shows in the estimation accuracy parameters MSE, bias and coverage (details provided in the supplementary material Section 5 and (Figure 5). Yet the phenomenon with quantile specific effects close to zero is also evident for the boosting results. Despite being less pronounced as with the NBPSS prior the linear component of the effect of $x_{2}$ was selected less frequently in the lower quantiles by boosting.

Figure 4

Notes: The left panel displays the results of the smaller sample size models ( $n = 1000$ ), and the right panel larger sample size models ( $n = 2000$ ). The upper and lower rows show models without and with spatial effects respectively. Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior as well as boosting for comparison.

Despite the additional complexity in the simulated data, the prediction performance in the heteroscedastic models is improved due to larger samples, that is, the median-(0.1, 0.1) centred quantile loss is smaller than in the homoscedastic cases (see Figure 6). The median-(0.1,0.1) centred quantile loss generated by the boosting results is again higher than in the NBPSS prior results, but similar to the homoscedastic case the difference between the two methods decreases with increasing sample size. The increased sample size also seems to have a homogenizing effect on the variability in results across the quantiles: For the NBPSS the outer quantiles are still more volatile than the inner ones and vice versa for boosting, but this balances out with increasing sample size.

Part II: Quantile specific (non-)informative variables in heteroscedastic models

In Part II of the simulation study featuring quantile specific non-informative variables the overall accuracy is again relatively high as can be seen from Figure 7, but coverage appears low (Figure 8). The accuracy results are expected to drop towards the outer quantiles, since the effect of x₁ was non-informative on the 25%-quantile and the effect of x₃ on the 90%-quantile. As a consequence, the effects of the neighbouring quantiles are very small and the NBPSS prior shows expected selection and deselection for the relevant quantiles. In contrast, the boosting approach overall shows a weaker selection accuracy and the driving reasons behind this compared to the Bayesian outcomes is high selection of non-informative variables and effects. Illustrations of the quantile-specific non-informative effects, inclusion probabilities and selection frequencies are provided in the supplementary material Figures 3, 12 and 15.

Compared to Part I the prediction quality for the boosting results is closer to the Bayesian results in Part II, as can be seen in Figure 9. For boosting the median-(0.1,0.1) centred quantile loss is less for the more extreme quantiles than for the NBPSS results, and vice versa for the inner quantiles. The prediction variability is again higher in the outer quantiles for the Bayesian results and higher in the inner quantiles for the boosting results. The former can as before in Part I be traced back to the weighting of the observations as part of the scale mixture representation of quantile regression. For the NBPSS prior this behaviour is independent of the hyperparameters.

Figure 5

Notes: The left panel displays the results of the smaller sample size models (n = 1000), and the right panel larger sample size models (n = 2000). The upper row depicts models without spatial effect, the lower one with spatial effect. Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior. Boosting does not provide measures of estimation insecurity (variances) and thus no coverage can be stated.

Compared to the previous publication on variable selection with the NBPSS prior in distributional regression (Klein et al., 2021), the results for quantile regression seem to fall behind expectations especially when measured by ‘selection accuracy’. But then the intention of quantile regression is to model an outcome locally and a local effect might differ substantially from a global effect (as yielded by distributional regression). As the heteroscedastic cases of Part I and the specifically designed Part II of the simulation study have shown, a particular effect might be more or less smooth or (close to) zero altogether for a specific quantile.

This leads to the question: When can a predictor on a specific quantile be considered correctly specified? On the one hand, it is desirable that all components are selected despite being very small, which is especially important in the setting of a simulation study with the aim of measuring the quality of the effect selection properties. At the same time, it is to be questioned whether other variable selection mechanisms might achieve such a task (potentially also in other regression settings) and whether selection based on the posterior inclusion probability alone is sufficiently precise in these cases. The latter can be complemented by taking also the effect size and its deviation from the true effect into consideration as we did throughout the course of this simulation study with the common metrics MSE, bias and coverage (provided in the supplementary material). On the other hand is the argument of parsimony of the final model for a specific quantile, against which it might be argued that the simpler model, for example, a very small effect being left out or a slightly non-linear effect being modelled as linear only, is sufficient to describe the conditional quantile of the outcome. Following this thought the posterior inclusion probability would be well suited to determine the predictor for a specific quantile and can be recommended in application settings.

Figure 6

Notes: The left panel displays the results of the smaller sample size models (n = 1000), and the right panel larger sample size models (n = 2000). The upper and lower row depicts models with non-informative and informative spatial effects respectively. Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior as well as boosting.

4 Childhood malnutrition in Nigeria

4.1 Identifying predictors of chronic malnutrition

In low- and middle-income countries malnutrition remains a problem, especially so when affecting children. The Demographic and Health Surveys (DHS) Program (www.measuredhs.com) has been collecting reliable data through nationally representative surveys on indicators of fertility, family planning, maternal and child health, gender, HIV/AIDS, malaria, and nutrition since 1984 on a vast variety of countries.

Figure 7

Notes: Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior as well as boosting for comparison.

Figure 8

Notes: Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior. Boosting does not provide measures of estimation insecurity (variances) and thus no coverage can be stated.

Figure 9

Notes: Results are given for all quantiles and combinations $(α, c)$ used for prior elicitation for the NBPSS prior as well as boosting.

In general, it makes sense to distinguish between acute malnutrition and chronic malnutrition. A common measure for the latter is stunting, that is, insufficient height for a child's age. It is defined as a Z-score standardizing the actual anthropometric measurement by a reference population and its calculation follows the WHO Recommendations for data collection, analysis and reporting on anthropometric indicators in children under 5 years old.

Now quantile regression especially of the lower quantiles is particularly well suited for analysing the topic of malnutrition since it allows to focus on the more extreme observations that represent varying stages of malnutrition better than is possible when regressing the mean. In addition, the broad variety of topics covered in the DHS Program allows us to explore the question of which indicators predict more severe cases of childhood malnutrition measured as the 5% and 10%-quantile of stunting.

In terms of country, our analysis will focus on Nigeria, which still exhibits distinct sociodemographic inter-regional disparities. For comparability reasons with an earlier analysis on this topic by Klein et al. (2021), we will focus on data from 2013 and the same candidate variables for selection as given in Table 1. After cleansing (removing outliers and inconsistent values), 8 005 completely observed cases remain.

Table 1

Candidate variables in the Nigeria childhood malnutrition dataset.

Variable	Description	Mean (Std Dev.)	Min - Max
Continuous/discrete
stunting	Degree of stunting	-92.85(217.7)	-600.00-597.00
cage	child’s age in months	11.49 (9.08)	0.00-59.00
mbmi	mother’s bmi	22.55(3.82)	12.10-39.91
mage	mother’s age in years	28 (6.85)	12.00-50.00
edupartner	education of mother’s partner in years	6.26(5.83)	0.00-20.00
Binary			Frequency in %
csex	child’s gender	female =1/ male =0	50.63/49.37
ctwin	child is a twin	single birth =0/ twin =1	98.46/1.54
cbirthorder1	child is 1st child	no =0/ yes =1	83.02/16.98
cbirthorder2	child is 2nd child	no =0/ yes =1	83.26/16.74
cbirthorder3	child is 3rd child	no =0/ yes =1	84.9/15.1
cbirthorder4	child is 4 th child	no =0/ yes =1	86.88/13.12
cbirthorder5	child is 5th child	no =0/ yes =1	61.94/38.06
munemployed	mother’s employment status	employed =0/ unemployed =1	68.82/31.18
mresidence	residential area	rural =0/ urban =	69.21/30.79
electricity	household has electricity	no =0/ yes =1	55.85/44.15 -2-15
radio	household has a radio	no =0/ yes =1	33.39/66.61
television	household has a television	no =0/ yes =1	60.26/39.74
refrigerator	household has a refrigerator	no =0/ yes =1	85.88/14.12
bicycle	household has a bicycle	no =0/ yes =1	78.1/21.9
motorcycle	household has a motorcycle	no =0/ yes =1	58.24/41.76
car	household has a car	no =0/ yes =1	91.89/8.11

The analysis will focus on the 5%- and 10% -quantile to capture the more severe cases of malnutrition and the median to allow for comparison with mean regression as performed in Klein et al. (2021). We specify the continuous variables cage (child's age in months), mbmi (mother's BMI), mage (mother's age in years) and edupartner (education of mother's partner in years) as smooth effects decomposed into their linear and their non-linear component. These variables are centred before the analysis to increase the stability of the algorithm. The remaining binary variables enter as exclusively linear effects. As a benchmark for variable selection in quantile regression we will compare NBPSS results to results from gradient boosting for quantile regression (package mboost, version 2.9-7, Hothorn et al. (2022). The best number of iterations for boosting were identified via in-built 25 -fold bootstrap cross validation.

4.2 Results

Figure 10 shows the exclusively non-linear part of the effects of the quantitative variables on stunting for the different quantiles under analysis together with the boosted results and Figure 11 depicts the posterior means and 95%-credible intervals of the linear effects of the quantitative and binary variables together with their boosted counterparts. A depiction of the re-composed smooth effects is provided in the supplementary material (Section 6).

Figure 10

Notes: Solid lines represent results from Bayesian effect selection, dashed lines from boosting. Grey-coloured lines indicate effect was not selected by the respective algorithm. Boosting by design does not provide confidence intervals.

Figure 11

Notes: Blue points with error bars represent the results from Bayesian effect selection and black points represent the results from boosting. Boosting will underestimate any effect for improved generalizability of the estimated model and by design not provide confidence intervals.

Inspecting the smooth effects it becomes clear that the results of both methods are very similar. The NBPSS prior selected most of the smooth effects across the quantiles with the exception of the mother's BMI for the 10%-quantile and the median and the education in years of the mother's partner for the 5% - and 10%-quantile. Gradient boosting performed a similar selection, but the mother's BMI for the median, the mother's age not for the 10%-quantile and the partner's education not at all.

Analogously to the non-linear part, the linear effects yielded by both methods are very similar in magnitude. The only difference between both methods seems to lie in the selection of variables: Gradient boosting tends to select less linear effects than the NBPSS prior at the 5% - and 10% quantile and more at the median (compare Table 2).

Table 2

Selection of variables by the NBPSS prior (including the selection frequency in brackets) and gradient boosting. Boosting seems to be more conservative in variable selection.

	tau = 0.05		tau = 0.1		tau = 0.5
	NBPSS	Boosting	NBPSS	Boosting	NBPSS	Boosting
lin(bicycle)	1 (0.634)	0	0(0.449)	1	0(0.317)	0
lin(cage)	1(1.000)	1	1 (0.999)	1	1(1.000)	1
lin⁡(car)	1(0.895)	1	1(0.586)	0	0(0.402)	0
lin(cbirthorder1)	1(0.697)	0	1(0.694)	0	0(0.428)	1
lin(cbirthorder2)	1(0.990)	1	1(0.972)	1	1(0.858)	1
lin(cbirthorder3)	1(0.808)	0	1(0.532)	1	0(0.380)	1
lin(cbirthorder4)	1 (0.781)	1	1 (0.796)	1	0(0.404)	1
lin(cbirthorder5)	1(0.744)	0	1(0.541)	0	1(0.624)	1
lin⁡(csex)	1(1.000)	1	1(1.000)	1	1(1.000)	1
lin(ctwin)	1(1.000)	1	1(1.000)	1	1(0.920)	1
lin(edupartner)	1(0.997)	1	1 (0.969)	1	1(0.999)	1
lin(electricity)	1(0.589)	0	0(0.490)	0	0(0.427)	1
lin(mage)	1 (0.749)	1	1(0.923)	1	1(0.939)	1
lin(mbmi)	1 (0.779)	0	1 (0.537)	0	1(0.634)	1
lin(motorcycle)	1 (0.517)	0	1(0.536)	1	0(0.327)	1
lin(mresidence)	1(0.756)	0	1(0.501)	0	0(0.387)	1
lin(munemployed)	1(0.706)	1	1(0.562)	1	0(0.337)	0
lin(radio)	1(0.563)	1	1 (0.601)	0	0(0.317)	0
lin(refrigerator)	1(0.998)	1	1(0.651)	1	0(0.392)	1
lin(television)	1(1.000)	1	1(1.000)	1	1(0.950)	1
sm(cage)	1(1.000)	1	1(1.000)	1	1(1.000)	1
sm(edupartner)	0(0.463)	0	0(0.500)	0	1(1.000)	1
sm(mage)	1 (0.999)	1	1(0.820)	0	1(0.820)	1
sm(mbmi)	1(0.795)	0	0(0.394)	0	0(0.408)	0
mrf(region)	1(1.000)	1	1(1.000)	1	1(1.000)	1

Both algorithms also select the spatial effect for each quantile. Combined with the intercept they show a similar effect with a north-south-disparity (compare Figure 12).

Figure 12

Notes: The upper row belongs to results from Bayesian effect selection, the lower row depicts gradient boosting results. Both methods selected a spatial effect for each quantile. Note the different value range for the effect per quantile.

Socio-demographic factors such as the child being a girl, the mother's age or the partner's education have a protective effect on the level of stunting of a child, while the child's age or unemployment of the mother constitute risk factors. Years of partner's education and the mother's age show higher protection in the median, but the child's gender is more protective for the lower quantiles. Interestingly the child's age as a risk factor is more distinct in the median than in the lower quantiles. Moreover, the spatial effect shows a pronounced north-south-disparity with the southern, more urban regions acting as protection against chronic malnutrition. In addition, the living situation of the child plays a role: A refrigerator and television in the household were protective indicators of children's nutritional status especially in the lower quantiles, while other forms of media or electrification didn't play a role. In particular, the latter could be highly correlated to the region the child lives in with the availability of both being higher in the urbanized areas in the south and is thus covered by the spatial effect. Further, the order of birth appears to impact the level of stunting: Being the first or in particular the second child is a protective factor, the risk of stunting rises, however, with increasing order of birth. It is worth noting that this effect is more pronounced for the lower quantiles than it is for the median. Mobility factors don't seem to have a distinct impact on the severity of stunting regardless of the quantile. These outcomes are comparable to the findings in Klein et al. (2021).

To make the findings of the data example reproducible, the code used for the analysis can be found on github on ‘/arappl/Nigeria’.

5 Conclusion

The aim of this study was to examine the NBPSS prior as an effect selection mechanism in Bayesian structured additive quantile regression and compare the outcome to Bayesian LASSO and gradient boosting for quantile regression. This was performed through a simulation study and illustrated by an example of childhood malnutrition in Nigeria. Results for Bayesian LASSO are not included, since this method was not available within the stated platform.

When it comes to variable selection in quantile regression, two scenarios are relevant: A variable might not be informative on the entire model or it might not be informative on a particular quantile. Our two-part simulation study accounted for both. The first part was kept close to the original publication of the NBPSS prior to capture various model complexities and it showed that the NBPSS prior outperformed boosting when it came to variable selection. It already became apparent that it is sensitive towards variables having no or a very small effect on specific quantiles, which is desirable behaviour. This was then confirmed in the second part of the simulation study, which in addition also covered variables non-informative on specific quantiles. The NBPSS prior proved its ability to discriminate between the two and yield accurate estimation results. Finally the findings were applied to data on childhood malnutrition in Nigeria and the results of both variable selection mechanisms were encouragingly close and in line with previous research on childhood malnutrition.

What was interesting is that a prediction optimized method such as gradient boosting was outperformed by a Bayesian variable selection mechanism. This said it needs to be noted that gradient boosting for quantile regression was used here in its default settings in order to use an established baseline for comparison and in what for boosting constitutes a rather low dimensional setting. Given that boosting is known to perform stronger in considerably higher dimensional data situations and given the variety of available stopping criteria other than the employed cross-validation the overall performance of boosting in the present case could surely be optimized. An approach to this can be found in Strömer, Staerk, Klein, Weinhold, Titze, and Mayr (2022).

Despite the performance of the NBPSS prior the weights drawn in the MCMC algorithm for Bayesian quantile regression in order to achieve the scale mixture representation is computationally costly and particularly in the outer quantiles several tries may be needed to achieve convergence. This behaviour is amplified by adding a variable selection mechanism such as the NBPSS prior. Furthermore, the selection of variables is sensitive towards the hyperparameters chosen for the NBPSS prior and very small or zero effects are not entirely deselected, but rather shrunken. While this is still acceptable from a statistical perspective, applied researchers might wish for a clear guideline on how to deal with such cases.

This could be achieved by a sensitivity analysis on an informed choice of hyperparameters and handling of very small effect estimations and might be an aspect of future research. Especially with applied researchers in mind a user-friendly solution similar to the R-package makemyprior, which works in conjunction with R-INLA, might be useful (Hem, Fuglstad, and Riebler, 2022; Martins, Simpson, Lindgren, and Rue, 2013). In addition, it might be explored whether the NBPSS prior is also suitable for other complex model types such as joint models. Due to their model complexity standard variable selection mechanism are not available for joint models, but since this model class keeps on receiving more and more interest also from applied researchers, automated variable selection might help increase its popularity further.

Supplementary material

Supplementary materials for this article are available online.

Supplemental Material for Bayesian effect selection in structured additive quantile regression by Anja Rappl Manuel Carlan Thomas Kneib Sebastiaan Klokman and Elisabeth Bergherr, in Statistical Modelling

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

Thomas Kneib gratefully acknowledges funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), grant KN 922/9-1. Elisabeth Bergherr gratefully acknowledges funding from the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), grant BE 7939/2-2.

References

Alhamzawi

(2015) Model selection in quantile regression models. Journal of Applied Statistics , 42(2), 445–458. doi: 10.1080/02664763.2014.959905

Alhamzawi

and Yu

(2012) Variable selection in quantile regression via Gibbs sampling. Journal of Applied Statistics , 39(4), 799–813. doi: 10.1080/02664763.2011.620082

Alhamzawi

and Yu

(2013) Conjugate priors and variable selection for Bayesian quantile regression. Computational Statistics & Data Analysis , 64, 209–219. doi: 10.1016/j.csda.2012.01.014

Bar

, Booth

and Wells

(2023) Mixed effect modelling and variable selection for quantile regression. Statistical Modelling , 23(1), 5380. doi: 10.1177/1471082X211033490

Belitz

, Brezger

, Kneib

and Lang

(2015) Bayesx-software for Bayesian inference in structured additive regression models. Version 3.0.2 . URL http://www.bayesx.org

Belitz

, Brezger

, Kneib

, Lang

and Umlauf

(2022) BayesX: Software for Bayesian Inference in Structured Additive Regression Models. Version 3.0 . http://www.unigoettingen.de/de/bayesx/550513.html

Belloni

and Chernozhukov

(2011) L1penalized quantile regression in highdimensional sparse models. The Annals of Statistics , 39(1), 82–130.

Benoit

and Van den Poel

(2017) bayesQR: A Bayesian approach to quantile regression. Journal of Statistical Software , 76(7), 1–32. doi: 10.18637/jss.v076.i07

Benoit

and Van den Poel

(2017) bayesQR: A Bayesian approach to quantile regression. Journal of Statistical Software , 76(7), 1–32.

10.

Chen

, Dunson

, Reed

and Yu

(2013) Bayesian variable selection in quantile regression. Statistics and Its Interface , 6(2), 261274.

11.

Dai

(2023) Variable selection in convex quantile regression: L1-norm or 10 -norm regularization? European Journal of Operational Research , 305(1), 338–355. doi: 10.1016/j.ejor.2022.05.041

12.

Eun Ryung Lee

and Park

(2014) Model selection via Bayesian information criterion for quantile regression models. Journal of the American Statistical Association , 109(505), 216–229. doi: 10.1080/01621459.2013.836975

13.

Fenske

, Kneib

and Hothorn

(2011) Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. Journal of the American Statistical Association , 106(494), 494–510. doi: 10.1198/jasa.2011.ap09272

14.

, Fu

and Song

(2023) Robust and smoothing variable selection for quantile regression models with longitudinal data. Journal of Statistical Computation and Simulation , 93(15), 2600–2624. doi: 10.1080/00949655.2023.2201007

15.

Gelman

, van Dyk

, Huang

and Boscardin

(2008) Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics , 17(1), 95–122. doi: 10.1198/106186008X287337

16.

George

and McCulloch

(1993) Variable selection via Gibbs sampling. Journal of the American Statistical Association , 88(423), 881–889.

17.

Guo

, Jaeger

, Rahman

AKMF

, Long

and Yi

(2022) Spike-and-slab least absolute shrinkage and selection operator generalized additive models and scalable algorithms for high-dimensional data analysis. Statistics in Medicine , 41(20), 3899–3914. doi: 10.1002/sim.9483

18.

, Qin

, Wang

and Wang

(2019) Electricity consumption probability density forecasting method based on lasso-quantile regression neural network. Applied Energy , 233-234, 565–575. doi: 10.1016/j.apenergy.2018.10.061

19.

Hem

, G-A

Fuglstad

and Riebler

(2022) makemyprior: Intuitive Construction of Joint Priors for Variance Parameters. R package version 1.1.0 . URL https://CRAN.Rproject.org/package=makemyprior

20.

Hothorn

, Buehlmann

, Kneib

, Schmid

and Hofner

(2022) mboost: Modelbased Boosting. R package version 2.97 . URL https://CRAN.R-project.org/package=mboost

21.

Ishwaran

and Rao

(2005) Spike and slab variable selection: Frequentist and Bayesian strategies. Annals of Statistics , 33(2), 730773. doi: 10.1214/009053604000001147

22.

Jiang

, Bondell

and Wang

(2014) Interquantile shrinkage and variable selection in quantile regression. Computational Statistics & Data Analysis , 69, 208–219. doi: 10.1016/j.csda.2013.08.006

23.

Kedia

, Kundu

and Das

(2023) A Bayesian variable selection approach to longitudinal quantile regression. Statistical Methods & Applications , 32(1), 149–168.

24.

Klein

(2018) sdPrior: Scale-dependent Hyperpriors in Structured Additive Distributional Regression. R package version 1.0-0. URL https://CRAN.R-project.org/package=sdPrior

25.

Klein

, Carlan

, Kneib

, Lang

and Wagner

(2021) Bayesian Effect Selection in Structured Additive Distributional Regression Models. Bayesian Analysis , 16(2), 545573. doi: 10.1214/20-BA1214

26.

Kneib

(2013) Beyond mean regression. Statistical Modelling , 13(4), 275–303. doi: 10.1177/1471082X13494159

27.

Kneib

, Silbersdorff

and Säfken

(2021) Rage against the mean: a review of distributional regression approaches. Econometrics and Statistics . doi: 10.1016/j.ecosta.2021.07.006

28.

Koenker

(2005) Quantile Regression. Econometric Society Monographs . Cambridge University Press. doi: 10.1017/CBO9780511754098

29.

Koenker

(2023) quantreg: Quantile Regression. R package version 5.95 . URL https://CRAN.R-project.org/package=quantreg

30.

Koenker

and Bassett

Jr (1978) Regression quantiles. Econometrica: Journal of the Econometric Society , 46(1), 33–50.

31.

Koenker

, Ng

and Portnoy

(1994) Quantile smoothing splines. Biometrika , 81(4), 673680.

32.

Konrath

, Kneib

and Fahrmeir

(2008) Bayesian regularisation in structured additive regression models for survival data. URL http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-5732-1

33.

Kozumi

and Kobayashi

(2011) Gibbs sampling methods for Bayesian quantile regression. Journal of Statistical Computation and Simulation , 81(11), 1565–1578. doi: 10.1080/00949655.2010.496117.

34.

Liu

, Chen

, Liu

, Qin

and Fars

(2023) A novel electricity load forecasting based on probabilistic least absolute shrinkage and selection operator-quantile regression neural network. International Journal of Hydrogen Energy , 48(88), 34486–34500. doi: 10.1016/j.ijhydene.2023.04.091

35.

, Zhang

, Zhao

and Liu

(2015) Quantile regression and variable selection of partial linear single-index model. Annals of the Institute of Statistical Mathematics , 67(2), 375409.

36.

Martins

, Simpson

, Lindgren

and Rue

(2013) Bayesian computing with INLA: New features. Computational Statistics & Data Analysis , 67, 68–83. doi: 10.1016/j.csda.2013.04.014

37.

Meinshausen

(2006) Quantile regression forests. Journal of Machine Learning Research , 7(35), 983–999. URL http://jmlr.org/papers/v7/meinshausen06a.html

38.

Meinshausen

(2017) quantregForest: Quantile Regression Forests. R package version 1. 3–7. URL https://CRAN.Rproject.org/package=quantregForest

39.

Mitchell

and Beauchamp

(1988) Bayesian variable selection in linear regression. Journal of the American Statistical Association , 83(404), 1023–1032.

40.

OHara

and Sillanpää

(2009) A review of Bayesian variable selection methods: What, how and which. Bayesian Analysis , 4(1), 85117. doi: 10.1214/09-BA403

41.

M-S

, Choi

and Park

(2016) Bayesian variable selection in quantile regression using the Savage-Dickey density ratio. Journal of the Korean Statistical Society , 45(3), 466–476. doi: 10.1016/j.jkss.2016.01.006

42.

Panagiotelis

and Smith

(2008) Bayesian identification, selection and estimation of semiparametric functions in highdimensional additive models. Journal of Econometrics , 143(2), 291–316. doi: 10.1016/j.jeconom.2007.10.003

43.

Park

and Casella

(2008) The Bayesian lasso. Journal of the American Statistical Association , 103(482), 681–686.

44.

Core Team

(2022) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing , Vienna, Austria. URL https://www.R-project.org/

45.

Rigby

and Stasinopoulos

(2005) Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 54(3), 507–554. doi: 10.1111/j.14679876.2005.00510.x

46.

Ročková

and George

E I

(2018) The spikeand-slab lasso. Journal of the American Statistical Association , 113(521), 431–444. doi: 10.1080/01621459.2016.1260469

47.

Scheipl

, Fahrmeir

and Kneib

(2012) Spike-and-slab priors for function selection in structured additive regression models. Journal of the American Statistical Association , 107(500), 1518–1532. doi: 10.1080/01621459.2012.737742

48.

Sherwood

and Maidman

(2022) Additive nonlinear quantile regression in ultrahigh dimension. Journal of Machine Learning Research , 23(63), 1–47.

49.

Sherwood

, Maidman

and Li

(2023) rqPen: Penalized Quantile Regression. R package version 3.2.1. URL http://jmlr.org/papers/v23/19-697.html

50.

Shiyi Tu

and Sun

(2017) Bayesian variable selection and estimation in maximum entropy quantile regression. Journal of Applied Statistics , 44(2), 253–269. doi: 10.1080/02664763.2016.1168369

51.

Strömer

, Staerk

, Klein

, Weinhold

, Titze

and Mayr

(2022) Deselection of baselearners for statistical boosting-with an application to distributional regression. Sta- tistical Methods in Medical Research , 31(2), 207–224. doi: 10.1177/09622802211051088

52.

Tibshirani

(1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) , 58(1), 267–288.

53.

Waldmann

, Kneib

, Yue

, Lang

and Flexeder

(2013) Bayesian semi-parametric additive quantile regression. Statistical Modelling , 13(3), 223–252.

54.

and Liu

(2009) Variable selection in quantile regression. Statistica Sinica , 19(2), 801817.

55.

Ying Sun

HJW

and Fuentes

(2016) Fused adaptive lasso for spatial and temporal quantile function estimation. Technometrics , 58(1), 127–137. doi: 10.1080/00401706.2015.1017115

56.

Yue

and Rue

(2011) Bayesian inference for additive mixed quantile regression models. Computational Statistics & Data Analysis , 55(1), 84–96.

57.

Zhao

and Lian

(2016) Variable selection in additive quantile regression using nonconcave penalty. Statistics , 50(6), 1276–1289. doi: 10.1080/02331888.2016.1221954

58.

Zhu

, Vannucci

and Cox

(2010) A Bayesian hierarchical model for classification with selection of functional predictors. Biometrics , 66(2), 463–473. doi: 10.1111/j.15410420.2009.01283.x

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

7.69 MB

0.00 MB