Sage Journals: Discover world-class research

Abstract

Ordinal data are quite common in applied statistics. Although some model selection and regularization techniques for categorical predictors and ordinal response models have been developed over the past few years, less work has been done concerning ordinal-on-ordinal regression. Motivated by a consumer test and a survey on the willingness to pay for luxury food products consisting of Likert-type items, we propose a strategy for smoothing and selecting ordinally scaled predictors in the cumulative logit model. First, the group lasso is modified by the use of difference penalties on neighbouring dummy coefficients, thus taking into account the predictors’ ordinal structure. Second, a fused lasso-type penalty is presented for the fusion of predictor categories and factor selection. The performance of both approaches is evaluated in simulation studies and on real-world data.

Keywords

Cumulative logit group lasso item-on-items regression Likert-scale proportional odds model (POM)sensometrics

1 Introduction

Ordered categorical variables arise commonly in regression modelling. Still, proper treatment of the underlying scale level and appropriate statistical modelling is an issue of discussion. In many applications, particularly, the nature of ordinal predictors such as rating scales is neglected and a classical linear model is fit instead, thus assuming a metric scale level (Tutz and Gertheiss, 2014). On the other hand, statistics textbooks often transform ordinal covariates into dummy variables, thus using the variables’ nominal scale level only. Ignoring that categories are ordered implies, however, that we do not make full use of all the information contained in the data. Sometimes, when using dummy-coded ordinal predictors, monotonicity in the corresponding regression parameters is assumed to exploit the categories’ ordering (compare, for example, Young et al., 1976; Linting et al., 2007; Rufibach, 2010), which is also known under the name ‘isotonic regression’ (Barlow et al., 1972), but the assumption of monotonicity is not always plausible (Linting et al., 2007). If the dependent variable is ordinal, ordered response models as treated in most advanced textbooks on statistical modelling (e.g., Fahrmeir et al., 2013) can be used. These models are typically recommended in the statistics literature but far less used in applications; compare, for example Christensen and Brockhoff (2013). Liddell and Kruschke (2018), for instance, demonstrated that applying metric models to ordinal data can lead to significant errors, such as Type I and Type II errors and argued for the use of ordered-probit models to ensure more accurate and reliable analyses. Although modelling of ordinal response variables has been well investigated, ordinal-on-ordinal regression is hardly mentioned. A comprehensive overview of statistical models for ordinal cross-sectional data is provided by Bartolucci et al. (2015), who offer a unified framework for analyzing such data using R and Stata. To investigate the relationship between two ordinal variables, basically, any statistical tool designed for analyzing the relationship between two categorical variables arranged in a contingency table could be used. Those tools include so-called association models. For instance, the row-column (RC) model (Goodman, 1979) is a type of association model that allows us to study the interaction between the row and column categories. Such models are particularly useful in understanding how different categories of one variable are associated with the categories of another variable. However, when the number of covariates and/or the number of levels per variable becomes large, RC models are no longer practically manageable. In this work, we hence focus on ordinal regression models to analyze the effects of a set of ordinal explanatory variables on an ordinal response variable. To give an example of such a setting, let us consider a survey on luxury food (Hartmann et al., 2016, 2017), which will be the primary case study here. This survey collected data on various items measured on a five-point Likert-type scale. The primary goal of the original study was to segment German consumers based on their perceived dimensions of luxury food and the shift of consumer consumption motives toward indulgence, quality and sustainability. In what follows, we will focus on a regression problem where the response variable is the willingness to pay for luxury food products, which is also evaluated on an ordinal five-point scale. The set of predictors considered contains 43 ordinally scaled statements (items) on eating habits, shopping places and habits, the importance of price, eating style and the subjective definition of luxury food. In high-dimensional settings such as this, however, with ordinal data both on the left and right-hand side of the regression, the full regression model, such as a cumulative logit model, is often hard to fit by common maximum likelihood. Penalty-based regularization has been shown to be a viable means to handle such situations and various regularization techniques have been proposed for regression models with categorical variables over the last two decades. Meier et al. (2008), for instance, used the group lasso penalty (Yuan and Lin, 2006) for factor selection in logistic regression. Tutz and Gertheiss (2014, 2016) proposed methods for the selection of ordinally scaled independent variables in the classical linear, the logistic and the log-linear (Poisson) models. Regularized ordinal regression has been considered by Archer and Williams (2012) and Wurm et al. (2021) using a lasso-type (Tibshirani, 1996) and elastic net (Zou and Hastie, 2005) penalty, respectively.

The limitation of the penalty approaches above, however, is that they assume either the response or the predictors to be of metric scale or binary. In this article, we propose a penalty-based strategy for model selection with ordinally scaled covariates and smoothing across covariate levels in ordinal regression, specifically, the cumulative logit model. For doing so, the original group lasso is modified by the use of a difference penalty on neighbouring dummy coefficients, thus taking into account their ordinal structure. As an alternative, the use of a fused lasso-type penalty (Tibshirani et al., 2005) will be discussed.

The remainder of the article is organized as follows. Section 2 introduces two motivating data examples on food products and the corresponding modelling framework. In Section 3, we will describe in detail the two penalty methods for ordinal predictors mentioned above. Some illustrative simulation studies in Section 4 examine the quality of the proposed methodology. In Section 5, the proposed methods are applied to data from a consumer test and the study on luxury food expenditure. In Section 6, we conclude with a summary and discussion. All computations were done using the statistical programme R (R Core Team, 2021). The proposed methods are implemented in the open-source R add-on package ordPens (Gertheiss and Hoshiyar, 2023) available on CRAN and Github (https://github.com/ahoshiyar/ordPens).

2 Data examples and modelling framework

2.1 Two case studies

2.1.1 Perception and acceptance of boar-tainted meat

For illustrative purposes, we first consider a consumer test about the perception and acceptance of boiled sausages from strongly boar-tainted meat, conducted by the Department of Animal Sciences, University of Göttingen (Meier-Dinkel et al., 2016). Due to the European call for alternatives to surgical castration of male piglets, as this practice was proven to be cruel (European Commission, 2010, 2019), there is rising interest in the production and processing of boar meat. However, there is a disadvantageous risk that off-flavors occur, the so-called boar taint, which is to be dealt with. An effort is being put into improving the processing of tainted raw material in a sustainable way with the primary aim of making it ‘fit for human consumption’. Specifically, one challenge lies in masking the undesirable odor—and the associated flavor—by combining spices, herbs and aromas and developing meat products that are palatable and accepted by consumers. The study at hand examined six sausages of different types (raw smoked-fermented or boiled emulsion-type sausages) with varying proportions of tainted raw material (50%, 100%, plus a control product without tainted material). A total of 120 participants were recruited to consume and evaluate the products with regard to different features; see Meier-Dinkel et al. (2016) for details on the recruited consumer sample. For illustrational purposes, we focus on boiled sausages here with a proportion of 50% tainted boar meat (product six). In order to process highly tainted meat sustainably, we aim to detect the relevant factors for the consumers’ overall liking. The six ordinal predictors considered ‘expected liking’, ‘appearance’, ‘odour’, ‘flavour’, ‘texture’, ‘aftertaste’ and the ordinal response ‘overall liking’ are measured on a nine-point scale (1 = ‘dislike extremely’ to 9 = ‘like extremely’). The data is summarised through barplots for selected variables in Figure 1 (top). Using a regression model, the effect of the ordinal covariates on the ordinal factor overall liking is to be investigated.

2.1.2 Consumers’ willingness to pay for luxury food

As a second and main case study, we consider the before-mentioned survey concerning, among other things, the consumers’ willingness to pay for ‘luxury food’ (Hartmann et al., 2016, 2017). The data was collected by a commercial panel provider and is representative of the German population with respect to age, gender and education (see Hartmann et al., 2016, 2017, for details). The part of the dataset we examine consists of 821 observations of 44 Likert-type items on personal luxury food definitions, eating and shopping habits, diet styles and food price sensitivities. One such item is, for example, ‘I associate a high price in food with particularly good quality’, with coding scheme −2 = ‘strongly disagree’, …, +2 = ‘fully agree’. The coding scheme of the remaining items is similar and given in Table C3 in the online supplements. Our response of interest is whether participants would be willing to pay a higher price for a food product that they associate with luxury, measured again on the ordinal −2 to +2 scale. The subset of the data that is analyzed here has been made publicly available at https://zenodo.org/record/8383248 (Hoshiyar et al., 2023). A different subset of the data has previously been studied by Tutz and Gertheiss (2016), but their focus was on aspects regarding more general behaviour when buying and consuming food products. Also, their response of interest was different, namely the (approximate) household’s weekly expenditure on food, assumed to be measured on a metric scale.

Figure 1

Illustration of the sensory data. Top: frequency distribution for the covariates with levels 1 = ‘dislike extremely’ to 9 = ‘like extremely’, segmented by frequencies of response y (‘overall liking’), with colors corresponding to response categories 1 = ‘dislike extremely’ to 9 = ‘like extremely’ (see top left). Bottom: polr estimates of the regression coefficients as functions of class labels for the sensory data from Section 2.1.1. The estimated thresholds are: −34.22, −1.20, 4.84, 7.32, 8.86, 11.29, 15.75 and 21.98.

2.2 Ordinal-on-ordinal regression

The cumulative model (McCullagh, 1980) is probably the most popular regression model for ordinal response variables; compare, for example, Agresti (2010, 2013), Fahrmeir and Tutz (2001), Fahrmeir et al. (2013), Tutz (2022). It can be motivated such that the observable response variable y ∈ {1, …, c} is a categorized version of a latent continuous variable u. The link between the response variable y_i (i.e., y for subject i = 1, …, n) and the corresponding latent variable u_i is then defined by the threshold mechanism $y_{i} = r \Leftrightarrow θ_{r - 1} < u_{i} \leq θ_{r}, r = 1, \dots, c$ , where $- \infty = θ_{0} < θ_{1} < \dots < θ_{c} = \infty$ are the ordered thresholds. Given p numeric covariates, the latent variable is typically modelled as $u_{i} = x_{i}^{⊤} β + ϵ_{i}$ , where $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{⊤}$ denotes the covariate vector for subject i, β = (β₁, …, β_p)^⊤ is the parameter vector and ϵ_i is an error variable with distribution function F. From these assumptions, it follows that $P (y_{i} \leq r) = P (u_{i} \leq θ_{r}) = F (θ_{r} - x_{i}^{⊤} β) = F (η_{i r})$ , with $η_{i r} = θ_{r} - x_{i}^{⊤} β$ being the linear predictor. Common choices for F(·) are the logistic and the standard normal cumulative distribution function, resulting in the cumulative logit and the cumulative probit model, respectively. The most widely used cumulative model is the cumulative logit model (cf. Tutz, 2022), which we will concentrate on here. The cumulative logit model is also known as the proportional odds model (POM) since the ratio of the cumulative odds for two populations is postulated to be the same across all (response) categories.

With ordinal covariates, without loss of generality, each $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{⊤}$ contains integers, with entry x_ij indicating the level of the jth variable that is observed for the ith subject. So let the jth variable have (potential) values 1, …, k_j and define dummy variables such that z_ijl = 1 if x_ij = l, l = 1, …, k_j and z_ijl = 0 otherwise. The model to be fitted then has the linear predictor

η_{i r} = θ_{r} - \sum_{j = 1}^{p} \sum_{l = 1}^{k_{j}} z_{i j l} β_{j l} = θ_{r} - \sum_{j = 1}^{p} z_{i j}^{⊤} β_{j},

(2.1)

where vector $z_{i j} = {(z_{i j 1}, \dots, z_{i j k_{j}})}^{⊤}$ collects the dummy variables belonging to the jth predictor variable and $β_{j} = {(β_{j 1}, \dots, β_{j k_{j}})}^{⊤}$ contains the corresponding regression coefficients. Estimation of unknown coefficients and thresholds θ_r relies on well-known maximum likelihood principles. Further details can be found in the online appendix (Section A) and in McCullagh (1980). For reasons of identifiability, however, some restrictions are needed. For instance, one can impose these restrictions by choosing a so-called reference category for each covariate and setting the corresponding regression coefficient to zero. Alternatively, one may set $\sum_{l} β_{j l} = 0 \forall j$ which is also known as ‘effect coding’ and will be used throughout this article. With respect to parameter estimates, both restrictions are equivalent. For illustration, Figure 1 (bottom) gives the estimated dummy coefficients in the POM when fit to the sensory data from Section 2.1.1 using the MASS::polr function in R (Venables and Ripley, 2002). In the first attempt to estimate the model, however, the algorithm did not converge. The function failed to find suitable starting values. This is a known problem with the function polr since it estimates a (binary) logit model to initiate and the function terminates when the GLM does not converge. Convergence problems occur when the fitted probabilities are extremely close to zero or one, as also addressed by Venables and Ripley (2002, p. 197–198). However, fitting was possible when inserting some group lasso penalized estimates as starting values (compare Section 3). It is seen that polr estimates are very wiggly and thus hard to interpret. Moreover, some effects appear to be extremely large (note that the effects refer to the logit scale here), causing some fitted response probabilities to be numerically zero or one. The reason for that becomes clear when looking at the data in Figure 1. Typically, there is only a small number of samples found in the lowest x-category (‘one’), with most (or even all) observations falling in the same and most extreme y-category ‘one’, whereas y-level ‘one’ is hardly found for x-levels greater than 1. As a consequence, fitting by pure maximum likelihood leads to (seemingly) extreme effects. At this point, it should be noted that polr is not the only routine available to fit cumulative logit/POM. For instance, the R package ordinal (Christensen, 2023) could be used as well. However, this package also uses common maximum likelihood and results are virtually identical to those obtained with polr. The main advantage of the ordinal package over polr is that it also allows to include random effects, which is often needed in sensometrics but not the focus of this article.

3 Regularization and model selection

3.1 Groupwise selection and smoothing

With categorical predictors included in the model, resulting in a large (potentially enormous) number of parameters to be estimated, typically, two challenges arise. The first step is to decide on the variables to include in the regression model. Second, we have to decide which categories within one categorical predictor should be distinguished. Furthermore, if maximizing the likelihood using (2.1), only the nominal scale level of the predictor variables is used. As already discussed by Tutz and Gertheiss (2014, 2016) for other types of regression models (namely, models with numerical, one-dimensional response), it can be beneficial to use special penalties to incorporate the predictors’ ordinal scale level. In particular, penalizing differences between coefficients that belong to adjacent categories in combination with a group lasso as done by Gertheiss et al. (2011) seems promising. The rationale behind this approach is as follows. If the levels of a categorical predictor are ordered, it may be assumed that the (expected) response does not change drastically but rather smoothly between neighbouring predictor categories. For simulation studies showing that this method is highly competitive in the classical linear model both in terms of variable selection and estimation accuracy, see Tutz and Gertheiss (2014) and Gertheiss and Tutz (2023). To implement the approach in the cumulative logit model, we will extend the logistic group lasso estimator for binary responses (Meier et al., 2008) to the class of cumulative logit models. This type of estimator is given by the minimizer of the function

l_{λ} (θ, β) = - \frac{1}{n} l (θ, β) + λ \sum_{j = 1}^{p} J_{j} (β_{j}),

(3.1)

where l(θ, β) is the log-likelihood of the cumulative logistic distributionin (2.2); vectors θ and β collect all thresholds θ₁, …, θ_c₋₁ and subvectors β1, …, βp, respectively. In order to take into account the ordinal structure of the predictors, we modify the usual L₂-norm of the group lasso by the first-order difference penalty functions

J_{j} (β_{j}) = \sqrt{\sum_{l = 2}^{k_{j}} d f_{j} {(β_{j l} - β_{j, l - 1})}^{2}} = \sqrt{d f_{j}} D_{1, j} β_{j}_{2},

(3.2)

with k_j being the number of levels of the jth covariate and $d f_{j} = k_{j} - 1. ‖ \cdot ‖_{2}$ denotes the L₂-norm and D _1, _j is the matrix that produces differences of first order for the jth predictor and is given in Appendix A. This first-order penalty both enforces the selection of the whole group of parameters that belong to the same categorical predictor and simultaneously smoothes over the ordered categories.

To find the final solution, we can utilize the block coordinate gradient descent method of Tseng and Yun (2009), as described in Meier et al. (2008). Namely, it combines a quadratic approximation of the log-likelihood with an additional line search. Based on the coordinate descent algorithm, the R package grplasso (Meier, 2020) provides (standard) group lasso estimation for the Poisson, the logistic and the classical linear models. For penalized ordinal-on-ordinal regression in terms of a cumulative logit model with ordinal predictors and corresponding ordinal smoothing penalty (3.2) some modifications have to be made. Using a second-order Taylor series expansion at ${\hat{γ}}^{[t]} = {({\hat{θ}}^{[t] ⊤}, {\hat{β}}^{[t] ⊤})}^{⊤}$ , that is, at the current estimate in iteration t of the algorithm and replacing the Hessian of l(·) by some suitable block diagonal matrix $H^{[t]} = d i a g (H_{0}^{[t]}, H_{1}^{[t]}, \dots, H_{p}^{[t]})$ we specify

M_{λ}^{[t]} (d) = - \{l ({\hat{γ}}^{[t]}) + d^{⊤} \nabla l ({\hat{γ}}^{[t]}) + \frac{1}{2} d^{⊤} H^{[t]} d\} + λ \sum_{j = 1}^{p} \sqrt{d f_{j}} D_{1, j} {\hat{β}}_{j}^{[t]} + d_{j}_{2} \approx l_{λ} ({\hat{γ}}^{[t]} + d),

(3.3)

with $d = {(d_{0}^{⊤}, d_{1}^{⊤}, \dots, d_{p}^{⊤})}^{⊤} \in ℝ^{c - 1 + \sum_{j} (k_{j} - 1)}$ , where the index 0 denotes the elements belonging to θ₁, …, θ_c₋₁ and D _1, _j from above (also see Appendix A.3). Minimization of $M_{λ}^{[t]} (d)$ is carried out with respect to the jth parameter group, j = 0, 1, …, p, as follows: We use the (k_j × k_j) submatrix $H_{j}^{[t]} = h_{j}^{[t]} I_{k_{j}}$ for some scalar $h_{j}^{[t]} \in ℝ$ , where a possible choice is $h_{j}^{[t]} = \min \{\max \{d i a g (\nabla^{2} l {({\hat{γ}}^{[t]})}_{j j}), 10^{- 6}\}, 10^{8}\}$ and $I_{k_{j}}$ is the identity matrix of corresponding dimensions. If

\nabla l {({\hat{γ}}^{[t]})}_{j} - h_{j}^{[t]} {\hat{β}}_{j}^{[t]}_{2} \leq λ \sqrt{d f_{j}},

(3.4)

the minimizer of $M_{λ}^{[t]} (d)$ in (3.3) with respect to group j is $d_{j}^{[t]} = - {\hat{β}}_{j}^{[t]}$ . Otherwise $d_{j}^{[t]} = - \frac{1}{h_{j}^{[t]}} \{\nabla l {({\hat{γ}}^{[t]})}_{j} - λ \sqrt{d f_{j}} \frac{\nabla l {({\hat{γ}}^{[t]})}_{j} - h_{j}^{[t]} {\hat{β}}_{j}^{[t]}}{\nabla l {({\hat{γ}}^{[t]})}_{j} - h_{j}^{[t]} {\hat{β}}_{j}^{[t]}_{2}}\}$ . If $d^{[t]} \neq 0$ , an (inexact) line search using the Armijo rule must be performed. Let α^[ ^t ^] be the largest value among the grid ${\{α_{o} δ^{l}\}}_{l \geq 0}$ such that $l_{λ} ({\hat{γ}}^{[t]} + α^{[t]} d^{[t]}) - l_{λ} ({\hat{γ}}^{[t]}) \leq α^{[t]} σ Δ^{[t]}$ , where $0 < σ < 1, α_{0} > 0$ and ∆^[ ^t ^] is the improvement in l_λ(·) when using a linear approximation for the objective function $Δ^{[t]} = - d^{[t] ⊤} \nabla l ({\hat{γ}}^{[t]}) + λ \sqrt{d f_{j}} ‖ D_{1, j} {\hat{β}}_{j}^{[t]} + d_{j}^{[t]}_{2} - λ \sqrt{d f_{j}} D_{1, j} {\hat{β}}_{j}^{[t]} ‖_{2}$ . Standard choices are α = 1, δ = 0.5 and σ = 0.1. Final, we specify ${\hat{β}}^{[t + 1]} = {\hat{β}}^{[t]} + α^{[t]} d^{[t]}$ . The described procedure, namely, looping through the block coordinates j = 0, 1, …, p: (a) update $H_{j}^{[t]} \leftarrow h_{j}^{[t]} I_{k_{j}}, (b) d^{[t]} \leftarrow \arg \min \{M_{λ}^{[t]} (d)\}, (c) if d^{[t]} \neq 0 : α^{[t]} \leftarrow$ line search, $β^{[t + 1]} \leftarrow β^{[t]} + α^{[t]} d^{[t]}$ , is repeated until some convergence criterion is met. Note that the check for differentiability in condition (3.4) is not necessary for the intercept parameters (because those parameters are not penalized), that is, j = 0 and the solution can be directly computed as $d_{0}^{⊤} = - \frac{1}{h_{0}^{[t]}} \nabla l {({\hat{γ}}^{[t]})}_{0}$ .

The strength of penalization is controlled by the parameter λ. With λ = 0, unpenalized maximum likelihood estimates for categorical variables as described in Section 2.2 are obtained. If λ → ∞ one obtains the extreme case that coefficients that belong to the same predictor are estimated to be equal, which means that the corresponding predictor has no effect on the response. In fact, with any of the restrictions (reference category set to zero or effect coding), all parameters except for thresholds are estimated to be zero if λ → ∞. Further, note that penalized estimates as proposed here are invariant against the concrete choice of the restrictions because the values of both the likelihood and penalty at (3.2) do not depend on the restriction chosen. Note that we do not penalize thresholds θ_r.

3.2 Ordinal-on-ordinal fusion

If a variable is selected by the smoothed group lasso (3.2), all coefficients belonging to that selected covariate are estimated as differing (at least slightly). However, it might be useful to fuse certain categories. Fusion of dummy coefficients of ordinal predictors has already been discussed by Tutz and Gertheiss (2014, 2016) for regression models with numerical, one-dimensional response; namely, the classical linear model and a generalized linear model with binary or count data (Poisson) response. Having data such as those described in Section 2.1 in mind, we extend this previous work to the multivariate case of multi-categorical, ordinal response models; specifically, the framework of the cumulative logit model. Fusion is done by a fused lasso-type penalty (Tibshirani et al., 2005) using the L₁-norm on adjacent categories. That is, adjacent categories may end up with exactly the same coefficient values and one obtains clusters of categories. The idea is to maximize l_λ(θ, β) from (3.1) above with

J_{j} (β_{j}) = \sum_{l = 2}^{k_{j}} |β_{j l} - β_{j, l - 1}| .

(3.5)

Using the L₁-norm for neighbouring categories encourages that adjacent parameters are set exactly equal rather than just being close (as done by penalty (3.2)). Penalty (3.5) thus has the effect that neighbouring categories may be fused (namely, if they have the same β values). The fused lasso also enables the selection of predictors: A predictor is excluded if all its categories are combined into one cluster. As already pointed out in Section 3.1 above, any of the restrictions considered here then means that all parameters belonging to the respective predictor are set to zero. As before, penalty (3.5) is invariant against the concrete choice of restriction. After reparametrization, our fused lasso variant can be fit using existing software for the lasso. Technical details can be found in Appendix A.3.

An alternative to the one-step, penalized likelihood methods presented in this article would be to use regularization for model selection only (i.e., identification of the relevant covariates and relevant differences between predictor levels). Afterward, the final model would be re-fit on the reduced and collapsed data through common, unpenalized maximum likelihood. Such an approach was, for example, used in Gertheiss and Tutz (2010) and is also known under the name ‘post-lasso’ (Belloni et al., 2012; Belloni and Chernozhukov, 2013).

3.3 Tuning parameter, model and feature selection

3.3.1 Cross validation

In the definitions and algorithms above, the penalty parameter λ was fixed at some specific value. However, the choice of an optimal value for λ should be made using the data at hand. A common strategy in penalized regression is K-fold cross-validation as described in many textbooks, for example, Hastie et al. (2009). As a measure of performance, we calculate the Brier score (Brier, 1950) $BS = \sum_{i} \sum_{r} {(v_{i r} - π_{i r})}^{2}$ with the c-dimensional vector $v_{i} = {(v_{i 1}, \dots, v_{i c})}^{⊤}$ of indicator variables, where v_ir = 1 if y_i = r and zero otherwise. Now, over a fine grid G of sensible values λ ∈ G, the optimal λ can be determined by minimizing the cross-validated Brier score. An alternative measure of performance is the so-called ranked probability score (see, for example, Epstein, 1969). On the data considered in this article, however, the results concerning the chosen λ’s were effectively the same.

3.3.2 Stability selection

The previous section described a potential process for selecting the regularization parameter, with the selection of this parameter also implying the particular set of features being selected. Particularly for high-dimensional data, cross-validation techniques can be quite challenging as cross-validated choices may include too many variables (Leng et al., 2006; Meinshausen and Bühlmann, 2006). If we are mainly interested in feature selection, we can alternatively apply stability selection as suggested by Meinshausen and Bühlmann (2010), a promising subsampling strategy combined with high-dimensional variable selection. In general, instead of selecting/fitting one model, the data are perturbed or subsampled many times and we choose those variables that occur in a large fraction of runs. Specifically, for the jth variable, the estimated probability ${\hat{π}}_{j}^{λ}$ of being in the stable selection set corresponds to the frequency of being chosen over all subsamples for a specific λ. In other words, we keep variables with a high selection probability ${\hat{π}}_{j}^{λ} \geq π_{t h r}$ and neglect those with low selection probability. The cutoff value can therefore be seen as a tuning parameter and a typical choice is $π_{t h r} \in (0.6, 0.9)$ (Meinshausen and Bühlmann, 2010). Motivated by the concept of regularisation paths, we draw the stability path: The probability for each variable to be selected when randomly subsampling from the data as a function of λ. Meinshausen and Bühlmann (2010) emphasize that choosing the proper regularization parameter is much less critical for the stability path, which increases the chance of selecting truly relevant variables.

4 Numerical experiments

4.1 Study design

Before applying the presented methodology to the real-world data in Section 5, we carry out a simulation study to investigate the properties of the proposed ordinal-on-ordinal selection approach using penalties (3.2) and (3.5). We generate p = 50 ordinally scaled covariates X₁, …, X₅₀, including 38 noise variables as follows: X₁, …, X₄ have non-monotone effects, X₅, …, X₈ are monotone but non-linear and the effect of X₉, …, X₁₂ is linear across categories. The remaining 38 predictors are irrelevant, that is, with zero effects. Each predictor is assumed to have the same number of levels, either five or nine, depending on the concrete setting considered. Factor levels are then randomly drawn from {1, …, 5} or {1, …, 9}, respectively. The (true) covariate effects are shown in Figure 2. Using those predictors, we construct the ordinal response (with five levels) through the cumulative logit model with linear predictor (2.1) and thresholds θ = (5.5, 6.5, 7.5, 8.5)^⊤ and consider three different sample sizes n = 200, 500 and 1 000.

Figure 2

True effects of influential predictors in the simulation study. (a) five levels without fused effects; (b) nine levels without fused effects; (c) five levels with fused effects; (d) nine levels with fused effects.

We assign ranks to the predictor items according to the order of non-zero ordinal group lasso coefficients in the coefficient path and call this ordinal rank selection (ORS). Similarly, we proceed with the non-zero ordinal fused lasso coefficients in the corresponding coefficient path and call this ordinal rank fusion (ORF). We compare these two approaches to two competitors: (a) the standard POM fit by polr and combined with AIC-based forward stepwise selection where the ordinal covariate is treated as nominal, that is, it is dummy-coded without any penalization and (b) treating the predictor as numeric and employing ordinalNet with a standard lasso penalty for selection. Note that the usual ordinalNet function offers no (nominal/ordinal) group lasso option. Thus, the selection of ordinal responses is only accessible if we ignore the categorical nature of the covariates. The entire data generation and estimation/selection process was repeated 100 times. The results are discussed below.

4.2 Results

4.2.1 Variable selection

To evaluate the methods’ performance in terms of variable selection, we construct the Receiver Operating Characteristic (ROC) (Fawcett, 2006) by varying selection thresholds λ (for ORS, ORF and ordinalNet) and calculate the area under this curve (AUC) in each iteration. Briefly, we evaluate the estimated models as a binary classification task, where covariates are predicted to be present or absent. For polr, the order of variables entering the (forward) stepwise selected model is used for ranking. Variables that were not selected at all are randomly coerced to the back. We only discuss the results of scenario (a) from Figure 2 in detail here, where true effects contain unfused effects of five levels. Due to a lack of space, we refer to the online supplementary material for the other settings’ results. Figures 3 and 4 show the performance in terms of the ROC as obtained with varying λ values, for n = 200 and n = 500, respectively. Additionally, the mean and median AUC for each method are given. The results for the remaining scenarios can be found in the online supplementary material (Figures C1–C12). Even with the relatively large sample size of 500, polr failed in 37 of the datasets (zero cases with n = 1 000, but all cases with n = 200). This result is in line with the issues that we observed with the standard POM for the boar-taint data in Section 2.2. The mean and median AUC over the 63 successful runs with n = 500 were 0.908 and 0.919, respectively. By contrast, the ordinal penalties work very well and both account for the ordinal-on-ordinal nature. In general, the choice between penalties (3.2) and (3.5) depends on the concrete application and personal preferences. While both enforce the selection of predictors, ORS yields smooth effects of predictors, whereas ORF yields parameter estimates that tend to be flat over some neighbouring categories, which represents a specific form of smoothness, too. The classical ordinalNet approach, however, is inferior for model selection since it offers no group lasso option for categorical or ordinal predictors. With n = 1 000 (not shown here), all methods—except for classical lasso—work (almost) equally well in scenarios (a) and (c). In scenarios (b) and (d), that is, with predictors having nine levels each, polr performs even worse than Figure 4, as estimation now fails in all cases with n = 500. Even with n = 1 000, polr is inferior to penalized selection and fusion in the nine predictor level setting (compare the online appendix for details, Figures C4–6 and C10–12).

Figure 3

ROC curves when using ORS, polr, ORF or ordinalNet for finding the relevant variables in terms of a binary ‘classification’ task with n = 200 according to simulation setting (a) from Figure 2 (five levels without fused effects).

Figure 4

ROC curves when using ORS, polr, ORF or ordinalNet for finding the relevant variables in terms of a binary ‘classification’ task with n = 500 according to simulation setting (a) from Figure 2 (five levels without fused effects).

Figure 5

ROC curves when using polr or ordinalFusion concerning the identification of relevant differences (i.e., fusion) of dummy coefficients for n = 1 000. Left: five levels with fused effects; right: nine levels with fused effects; that is, setting (c) and (d) from Figure 2, respectively.

To gain more insights into the different behaviours of the competing methods, we tested whether differences in AUC values are statistically significant. Table C2 in the online appendix presents the p-values of the corresponding (paired) t-tests, Table C3 shows fractions of p-values ≤ .05 of the DeLong test (DeLong et al., 1988). It is seen that the performance of the proposed penalties ORS and ORF does not significantly differ in most cases. Graphical comparisons of AUC values confirm these findings (Figures C13 and C14 in the online appendix). It is a positive result that the ROC and AUC of the considered fusion and smoothing/selection penalty are very similar, indicating that the fusion penalty performs well for variable selection purposes, too.

4.2.2 Level fusion

The results discussed so far concern the identification of relevant predictors. In scenarios (c) and (d), however, we are also interested in comparing the performance regarding the identification of relevant differences (i.e., fusion between adjacent dummy coefficients). For this purpose, we consider estimates using penalty (3.5). Figure 5 shows the performance in terms of ROC as obtained with varying λ and n = 1 000. In addition, we run stepAIC(polr) as a competing model, but on the re-coded design matrix (Walter et al., 1987), thus selecting (presumably relevant) differences between neighbouring categories. The run time of polr increases dramatically as contrasts of factor levels now enter and leave the model in each step separately. Figure C15 in the supplementary material illustrates the false positive/negative rates (FPR/FNR). With five underlying levels (scenario (c)) and n = 500, stepAIC(polr) only converged in 23 cases and, not surprisingly, in zero cases for n = 200. This is why we only present results for the largest sample size n = 1 000 in Figure 5. With nine underlying levels (scenario (d)), estimation with polr failed in all cases with n = 200 and n = 500. The poorer performance compared to variable selection (see also Figures C9 and C12, online appendix) is due to the fact that with selection, it is sufficient for a correct result (in terms of a ‘true positive’) if at least one contrast was selected. For identification of the relevant differences, however, all contrasts of neighbouring categories are checked individually. The finding that AUC-values in Figure 5 tend to be lower in the nine-levels case (right) than in the five-levels case (left) can at least partly be explained by the fact that the former model has much more parameters and that contrasts that were not selected at all are coerced to the back in a random order before drawing or calculating the ROC/AUC. However, the fusion penalty still works better than polr with stepwise selection.

5 Application to case studies

5.1 Perception and acceptance of boar-tainted meat

To illustrate the application of penalized ordinal-on-ordinal regression in practice, we first consider the boar-taint example from Section 2.1.1. The study aimed to identify important factors for consumer acceptance of boar-tainted meat. Determining the most important predictors of overall liking would be of great interest for the development of palatable products. Figure 6 (top) shows the resulting parameter estimates for the selected covariates and different values of tuning parameter λ when applying the proposed groupwise selection and smoothing method. Results of the remaining variables can be found in Appendix D. It is seen that for smaller λ (light gray), the estimates are more wiggly (and more extreme) and become more and more smoothed out/shrunken as λ increases. It is also seen that, at some point, the entire group of dummy coefficients belonging to one variable is set to zero, which means that the corresponding variable is excluded from the model. Figure 6 (bottom) shows the fitted coefficients when the fusion penalty (3.5) is employed. It is seen that some of the categories are fused, indicating that, at least for some variables, fewer rating categories would be sufficient to model the effect on overall liking. In general, the overall liking increases with the liking level of the covariates, with flavour apparently having the largest effect, followed by appearance and texture. On the other hand, the expected liking, the odour and the aftertaste seem to be much less relevant. The obvious consequence of these findings is that, to make meat from entire male pigs more attractive to consumers, the producers (of processed meat) should try to make their products look and, most importantly, taste better, for example, using spices masking the boar taint (Mörlein et al., 2019).

The predictive performance according to five-fold cross-validation on the boar-taint data for both the smooth group and fused lasso as a function of λ is given in Figure 7 (left panels). On the training data, this function is monotonically increasing in λ, as with smaller λ, more emphasis is put on the data. On the validation data, however, we see that the unpenalized POM (see λ → 0, i.e., log₁₀(λ) → −∞) is worse than smoothed selection or fusion. Cross-validation results thus indicate that predictive performance can be enhanced using the suggested penalized method(s). Performance improves on the validation data up to a certain λ value and deteriorates from there. Note that the shift in the curves for the training data is not to be understood in the y-direction but in the x-direction. This is because the penalties are not directly comparable with each other. The optimal smoothing parameter, as determined on the validation set(s), is indicated by the dotted lines in Figure 7. Based on those results, we can use the optimal λ (where the Brier score on the validation data reaches its minimum) to fit the final model. The resulting estimates are marked in blue in Figure 6. The stability paths for the selection penalty (3.2) and the fusion penalty (3.5), indicating the order of relevance of the predictors according to stability selection are found in the supplementary material (Figure C27).

Figure 6

Estimation results for the sensory data from Section 2.1.1. Top: Cumulative group lasso estimates of the regression coefficients as functions of class labels (λ ∈ {10, 5, 1, 0.5, 0.25}/n). Light grey indicates smaller λ. Blue lines correspond to cross-validated/optimal λ = 3.5/n. Bottom: Cumulative fused lasso estimates of the regression coefficients as functions of class labels (λ ∈ {10, 5, 1, 0.5, 0.25}/n). Light grey indicates smaller λ. Blue lines correspond to cross-validated/optimal λ = 4/n.

5.2 Consumers’ willingness to pay for luxury food

We use penalties (3.2) and (3.5) when modelling the luxury food data as described in the Introduction and Section 2.1.2. The descriptive statistics on the data analyzed can be found in the online supplementary material, covering, among other things, a description of the items along with observed frequencies (Table C4 and Figure C16) and the correlation plot (Figure C22). The code used for the presented case study is given in the vignette of our R package ordPens (https://html-preview.github.io/?url=https://github.com/ahoshiyar/ordPens/blob/master/inst/ordinal-on-ordinal.html). For more background information, we refer to Hartmann et al. (2016) who performed a partial least squares structural equation analysis (treating, however, the data as numeric). Using an initial principal component analysis (also treating the data as numeric), the authors revealed seven relevant dimensions concerning consumers’ attitudes toward luxury food: Hedonism, quality, sustainability, materialism, usability, uniqueness and price. In our analysis, by contrast, the research question is more specific in that we try to identify factors that determine the consumers’ willingness to pay for food products that they associate with luxury.

Figure 7

Brier score (averaged over all folds) with ordinal smooth selection penalty (black dashed) and with ordinal fusion penalty (blue solid) for the sensory data from Section 2.1.1 (left panels) and for the luxury food data from Section 2.1.2 (right panels).

5.2.1 Modelling results

Figure 8 illustrates the resulting parameter estimates for selected covariates and different values of tuning parameter λ when applying the ordinal group lasso. Figure 9 shows the corresponding results for the ordinal fused lasso. Similarly to the boar data example, the estimates are more wiggly for smaller λ (light grey) but become smoother and shrunken as λ increases. Especially, the willingness to pay tends to increase among people who (a) eat out frequently in expensive restaurants, (c) associate a high price with good quality and (g) associate luxury with high quality. Our findings thus contribute to the literature on food and luxury by providing additional information about the kind of consumer who may be willing to pay a higher price for ‘luxury food’. For some items, for example, ‘I like to cook myself’ and ‘Luxury food can impress my guests,’ however, we observe rather flat effects that quickly go to zero as soon as λ increases (compare the most right panel in Figure 9), indicating that those items and, hence, corresponding attitudes, provide only limited information regarding the respondent’s willingness to pay for luxury food. As a consequence, for example, a producer of luxury food, it might not be worth targeting consumers (e.g., through a TV ad where a party host impresses the guests by serving luxury food) who primarily want to ‘show off’ (compare Figure 9, bottom right). Instead of that, the food producer should rather emphasize the food’s high quality (compare Figure 9, third panel). From a more general perspective, Wiedmann et al. (2009) and Shirai (2015) already discussed the perceived association of luxury, quality and price. Alfnes and Sharma (2010) and Parsa et al. (2017) specifically considered customers in restaurants, highlighted price as a perceived positive indicator for the quality of food and investigated ways to increase the willingness to pay, respectively. An overview of the estimated effects of all items considered in our study is given in the supplementary material (Figures C17–C21). Since the effects of penalties (3.2) and (3.5) are quite similar, only the results of the fused lasso are presented there.

Figure 8

Cumulative group lasso estimates of the regression coefficients as functions of class labels (λ ∈ 14.5, 10, 5, 1, 0.25/n) for the luxury food data from Section 2.1.2. Light grey indicates smaller λ. Blue lines correspond to cross-validated/optimal λ = 14.5/n.

The (fivefold) cross-validated predictive performance in terms of the Brier Score as a function of λ is given in Figure 7 (right panels). If using the fused lasso and choosing the amount of penalty through (fivefold) cross-validation, we obtain an optimal λ around 18.5/n (marked by the dotted blue line in Figure 7). The corresponding estimates of the covariates’ effects are highlighted in blue in Figure 9. For the variable (g) ‘Luxury food has a good quality,’ for example, it is seen that the upper three categories are fused. If using the ordinal group lasso instead, the optimal (cross-validated) λ is around 14.5/n (marked by the dotted black line in Figure 7), with the resulting estimates in Figure 8 indicated by the blue lines. If choosing λ = 14.5/n to obtain a final model, 17 variables are removed and 26 variables are selected by the ordinal group lasso (with some showing very small effects, though). The most influential factors are those already shown in Figure 8: (a) Expensive restaurants, (c) associating a high price with good quality and (g) associating luxury with high quality. With the fused lasso, however, we might be more interested in detecting significant differences. To understand the fitted model’s complexity, we refer to the degrees of freedom (df). For the fused lasso, an approximation is given by the number of non-zero parameters, that is, non-zero differences of adjacent (dummy) coefficients (cf. Tibshirani et al., 2005). If using λ = 18.5/n for the ordinal fused lasso, we detect 51 relevant differences, which corresponds to about one-third compared to a total of 43 × 4 = 172 parameters in the full model.

Figure 9

Cumulative fused lasso estimates of the regression coefficients as functions of class labels (λ ∈ 18.5, 10, 5, 1, 0.25/n) for the luxury food data from Section 2.1.2. Light grey indicates smaller λ. Blue lines correspond to cross-validated/optimal λ = 18.5/n.

5.2.2 Computational complexity and performance evaluation

With the fusion penalty, in total 3.9 seconds elapsed when estimating the final model (i.e., using the optimal amount of penalization according to Figure 7) on a MacBook with a 2.3 GHz processor and 16 GB RAM. With the smoothing/selection penalty, 5.7 seconds elapsed. It should be noted, though, that the computation for a grid of λ values does not increase the computational time by a factor of the length of the grid compared to a single λ. This is because our algorithm sorts the provided λ in decreasing order and utilizes the results from the previous λ value for a ‘warm start’. This approach significantly reduces the computational burden when working with multiple λ values. For instance, running our ordinal group lasso algorithm (on the complete dataset) on a grid of five different λ values (as done to create Figure 8) took approximately 14.9 seconds.

Figure 10

Prediction accuracy (lower values are better) in terms of the test set Brier (left) and ranked probability (right) score for 50 random splits of the luxury food data from Section 2.1.2.

In order to evaluate the prediction accuracy of the models obtained using the proposed penalties (3.2) and (3.5), including ‘post-(fused) lasso’, we performed repeated random splitting of the luxury data into training and test sets. That means all regression parameters are estimated on the training data (including the choice of the tuning parameter) and then used to predict the test data. As the test set size, we choose 100 and the procedure is independently repeated 50 times. As before in the simulations studies, we compare our ordinal group and fused lasso with the ordinalNet and a common POM/AIC-based stepwise selection. Resulting violin plots are shown in Figure 10. Performance is measured in terms of the test set Brier Score (left) and ranked probability score (right). In the simulation studies in Section 4.2.1, we saw that the common POM with factor variables performed well in large samples and was superior to ordinalNet. In the simulation studies, however, we had a controlled setting with partially non-linear and even non-monotonic effects. Results on the luxury dataset are somewhat different. Although the dataset has a moderate sample size, the unpenalized POM seems to attribute too much importance to random noise in the data and, therefore, tends to overfit. The considered penalty methods, on the other hand, perform equally well, indicating that the covariates’ effects are approximated reasonably well by linear functions of the integer-valued predictor levels (as the latter is done by ordinalNet). The post-lasso does not provide any (further) benefit on this dataset.

6 Summary and discussion

In the present work, we proposed two ways of regularization in ordinal-on-ordinal regression. By use of tailored penalty terms, these methods effectively extract information from ordinally scaled predictors and account for ordinal response variables. In particular, an L₂-difference penalty of first order for the selection of relevant factors and an L₁ variant for fusion of categories were presented. These approaches are more flexible than the methods existing so far for ordinal responses in combination with ordinal predictors and allow for smoothing of covariate effects. The proposed methodology is implemented in R and publicly available on CRAN through add-on package ordPens.

To motivate and illustrate the proposed methods and to show their potential benefits, we presented two real-world examples of item-on-items regression. First, we considered a consumer test on the acceptance and perception of boar-tainted meat. The corresponding dataset consisted of seven Likert-type items with nine levels each. In summary, our presented results suggest that penalized regression works well in the cumulative logit model with ordinally scaled predictors and ordinal group or fused lasso penalty. We combined the penalty approach with stability selection to gain additional information about a variable’s probability to be included in the model, given a certain amount of penalization. Namely, if trying to predict the product’s overall liking score, the stability path revealed that the probability to be selected (within resampling) is highest for the flavour, the texture and the appearance of the meat product.

For comparison, we considered the standard POM implemented in the polr() R-function, which simply treats categorical/ordinal covariates as dummy-coded factors and offers no option for smoothing. We found that the use of polr can cause problems as (a) the employed fitting routine often fails (even for moderately dimensioned setups) and (b) often results in wiggly/implausible estimates of the (categorical) covariates’ effects. Our numerical experiments distinguished between settings involving five or nine predictor levels. In summary, polr only worked well when the sample size was large enough and (often) failed otherwise. Ordinal selection and ordinal fusion worked equally well and outperformed polr in most scenarios. Overall, performance deteriorated with an increasing number of category levels, that is, increasing model complexity. In addition, we compared our smoothing/selection and fusion penalties to the classical lasso for the cumulative logit model as implemented in ordinalNet(). This method offers no group lasso option for the selection of categorical predictors. In our numerical experiments, it was thus not particularly surprising that in terms of AUC values, standard ordinalNet with linear effects performed worst and therefore seems somewhat unsuited for ordinal-on-ordinal regression if at least some of the effects may be non-linear across factor levels or even non-monotone.

Our main interest was in a high-dimensional survey on luxury food expenditure, again consisting of Likert-type items. On the one hand, it can be assumed that in a dataset of this size, not all of the (43) items will be relevant for explaining the response, making variable selection a primary concern. On the other hand, we would like to have estimates that are rather smooth and interpretable while taking the covariates’ ordinal scale level into account. In summary, penalized (cumulative logit) regression appeared to be a sensible approach in studies of this type. For some items, we could observe that most categories were fused. For some (presumably) less relevant items, it could be seen that they reached effects equal to zero quickly as the amount of penalization increased. To obtain a final model, an appropriate penalty parameter was chosen using cross-validation. With the group lasso, eventually, about 40% of the covariates were removed from the model. With the fused lasso, we detected about one-third of the parameters to be relevant (in the sense of non-zero differences between adjacent categories).

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: A. Hoshiyar’s and J. Gertheiss’ research was supported by Deutsche Forschungsgemeinschaft (DFG) under Grant GE2353/2-1.

Supplementary materials

References

Agresti

(2010) Analysis of Ordinal Categorical Data . 2nd ed. New York: Wiley.

Agresti

(2013) Categorical Data Analysis . 3rd ed. New York: Wiley.

Alfnes

and Sharma

(2010) Locally produced food in restaurants: Are the customers willing to pay a premium and why? International Journal of Revenue Management , 4, 238–58.

Archer

and Williams

(2012) L1 penalized continuation ratio models for ordinal response prediction using high-dimensional datasets. Statistics in Medicine , 31, 1464–74.

Barlow

, Bartholomew

, Bremner

and Brunk

(1972) Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression . 1st ed. New York: Wiley.

Bartolucci

, Bacci

and Gnaldi

(2015) Statistical analysis of questionnaires: A unified approach based on R and Stata . Boca Raton, FL: CRC Press.

Belloni

and Chernozhukov

(2013) Least squares after model selection in high-dimensional sparse models. Bernoulli , 19, 521–47.

Belloni

, Chen

, Chernozhukov

and Hansen

(2012) Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica , 80, 2369–2429.

Brier

(1950) Verification of forecasts expressed in terms of probability. Monthly Weather Review , 78, 1–3.

10.

Christensen

and Brockhoff

(2013) Analysis of sensory ratings data with cumulative link models. Journal de la Société Française de Statistique , 154, 58–79.

11.

Christensen

RHB

(2023) ordinal—Regression Models for Ordinal Data. R package version 2023.12–4. Available from: https://CRAN.R-project.org/package=ordinal

12.

DeLong

, DeLong

and Clarke-Pearson

(1988) Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics , 44, 837–45.

13.

Epstein

(1969) A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology , 8, 985–87.

14.

Commission

European

(2010) European Declaration on Alternatives to Surgical Castration of Pigs . Available from: https://food.ec.europa.eu/system/files/2016-10/aw_prac_farm_pigs_cast-alt_declaration_en.pdf. Accessed 2 June 2023.

15.

Commission

European

(2019) Establishing Best Practices on the Production, the Processing and the Marketing of Meat from Uncastrated Pigs or Pigs Vaccinated Against Boar Taint (Immunocastrated)–Final Report . Available from: https://food.ec.europa.eu/system/files/2019-03/aw_prac_farm_pigs_cast-alt_establishing-best-practices.pdf. Accessed 2 June 2023.

16.

Fahrmeir

, Kneib

, Lang

and Marx

(2013) Regression: Models, Methods and Applications. 1st edition . New York: Springer.

17.

Fahrmeir

and Tutz

(2001) Multivariate Statistical Modelling Based on Generalized Linear Models. 2nd edition . New York: Springer.

18.

Fawcett

(2006) An introduction to ROC analysis. Pattern Recognition Letters , 27, 861–74.

19.

Gertheiss

and Tutz

(2010) Sparse modeling of categorial explanatory variables. The Annals of Applied Statistics , 4, 2150–80.

20.

Gertheiss

, Hogger

, Oberhauser

and Tutz

(2011) Selection of ordinally scaled independent variables with applications to international classification of functioning core sets. Journal of the Royal Statistical Society: Series C , 60, 377–95.

21.

Gertheiss

and Hoshiyar

(2023) ordPens: Selection, Fusion, Smoothing and Principal Components Analysis for Ordinal Variables . R package version 1.1.0. Available from: https://CRAN.R-project.org/package=ordPens.

22.

Gertheiss

and Tutz

(2023) Regularization and Predictor Selection for Ordinal and Categorical Data, 1st edition . pages 199–232. Cham: Springer.

23.

Goodman

(1979) Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association , 74, 537–52.

24.

Hartmann

, Nitzko

and Spiller

(2016) The significance of definitional dimensions of luxury food. British Food Journal , 118, 1976–98.

25.

Hartmann

, Nitzko

and Spiller

(2017) Segmentation of German consumers based on perceived dimensions of luxury food. Journal of Food Products Marketing , 23, 733–68.

26.

Hastie

, Tibshirani

, Friedman

and Friedman

(2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition . New York: Springer.

27.

Hoshiyar

, Gertheiss

, Spiller

and Gertheiss

(2023) Spending on Luxury Food . Zenodo. Available from: https://zenodo.org/record/8383248.

28.

Leng

, Lin

and Wahba

(2006) A note on the lasso and related procedures in model selection. Statistica Sinica , 16, 1273–84.

29.

Liddell

and Kruschke

(2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology , 79, 328–48.

30.

Linting

, Meulman

, van der Kooij

and Groenen

PJF

(2007) Nonlinear principal components analysis: Introduction and application. Psychological Methods , 12, 336–58.

31.

McCullagh

(1980) Regression models for ordinal data. Journal of the Royal Statistical Society: Series B , 42, 109–27.

32.

Meier

(2020) grplasso: Fitting User-Specified Models with Group Lasso Penalty . R package version 0.4-7. Available from: https://cran.r-project.org/web/packages/grplasso/index.html

33.

Meier

, Van De Geer

and Bühlmann

(2008) The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B , 70, 53–71.

34.

Meier-Dinkel

, Gertheiss

, Schnäckel

and Mörlein

(2016) Consumers’ perception and acceptance of boiled and fermented sausages from strongly boar-tainted meat. Meat Science , 118, 34–42.

35.

Meinshausen

and Bühlmann

(2006) High-dimensional graphs and variable selection with the lasso. The Annals of Statistics , 34, 1436–62.

36.

Meinshausen

and Bühlmann

(2010) Stability selection. Journal of the Royal Statistical Society: Series B , 72, 417–73.

37.

Mörlein

, Meier-Dinkel

, Gertheiss

, Schnäckel

and Mörlein

(2019) Sustainable use of tainted boar meat: Blending is a strategy for processed products. Meat Science , 152, 65–72.

38.

Parsa

, Dutta

and Njite

(2017) Consumer behaviour in restaurants: Assessing the importance of restaurant attributes in consumer patronage and willingness to pay. In: Hospitality Marketing and Consumer Behavior , pp. 211–39. Waretown: Apple Academic Press.

39.

R Core Team (2021). R: A Language and Environment for Statistical Computing . Vienna: R Foundation for Statistical Computing. Available from: https://www.r-project.org/

40.

Rufibach

(2010) An active set algorithm to estimate parameters in generalized linear models with ordered predictors. Computational Statistics and Data Analysis , 54, 1442–56.

41.

Shirai

(2015) Impact of “high quality, low price” appeal on consumer evaluations. Journal of Promotion Management , 21, 776–97.

42.

Tibshirani

(1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B , 58, 267–88.

43.

Tibshirani

, Saunders

, Rosset

, Zhu

and Knight

(2005) Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B , 67, 91–108.

44.

Tseng

and Yun

(2009) A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming , 117, 387–423.

45.

Tutz

(2022) Ordinal regression: A review and a taxonomy of models. Wiley Interdisciplinary Reviews: Computational Statistics , 14, e1545.

46.

Tutz

and Gertheiss

(2014) Rating scales as predictors–the old question of scale level and some answers. Psychometrika , 79, 357–76.

47.

Tutz

and Gertheiss

(2016) Regularized regression for categorical data. Statistical Modelling , 16, 161–200.

48.

Venables

and Ripley

(2002) Modern Applied Statistics with S. 4th edition . New York: Springer.

49.

Walter

, Feinstein

and Wells

(1987) Coding ordinal independent variables in multiple regression analyses. American Journal of Epidemiology , 125, 319–23.

50.

Wiedmann

, Hennigs

and Siebels

(2009) Value-based segmentation of luxury consumption behavior. Psychology and Marketing , 26, 625–51.

51.

Wurm

, Rathouz

and Hanlon

(2021) Regularized ordinal regression and the ordinalNet R package. Journal of Statistical Software , 99, 1–42.

52.

Young

, De Leeuw

and Takane

(1976) Regression with qualitative and quantitative variables: An alternating least squares method with optimal scaling features. Psychometrika , 41, 505–29.

53.

Yuan

and Lin

(2006) Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B , 68, 49–67.

54.

Zou

and Hastie

(2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B , 67, 301–320.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.95 MB

0.00 MB

Regularization and model selection for ordinal-on-ordinal regression with applications to food products’ testing and survey data

Abstract

Keywords

1 Introduction

2 Data examples and modelling framework

2.1 Two case studies

2.1.1 Perception and acceptance of boar-tainted meat

2.1.2 Consumers’ willingness to pay for luxury food

Figure 1

3.1 Groupwise selection and smoothing

3.3.1 Cross validation

3.3.2 Stability selection

4 Numerical experiments

4.1 Study design

Figure 2

True effects of influential predictors in the simulation study. (a) five levels without fused effects; (b) nine levels without fused effects; (c) five levels with fused effects; (d) nine levels with fused effects.

4.2.1 Variable selection

Figure 3

ROC curves when using ORS, polr, ORF or ordinalNet for finding the relevant variables in terms of a binary ‘classification’ task with n = 200 according to simulation setting (a) from Figure 2 (five levels without fused effects).

ROC curves when using ORS, polr, ORF or ordinalNet for finding the relevant variables in terms of a binary ‘classification’ task with n = 500 according to simulation setting (a) from Figure 2 (five levels without fused effects).

ROC curves when using polr or ordinalFusion concerning the identification of relevant differences (i.e., fusion) of dummy coefficients for n = 1 000. Left: five levels with fused effects; right: nine levels with fused effects; that is, setting (c) and (d) from Figure 2, respectively.

5 Application to case studies

5.1 Perception and acceptance of boar-tainted meat

Figure 6

Figure 7

Brier score (averaged over all folds) with ordinal smooth selection penalty (black dashed) and with ordinal fusion penalty (blue solid) for the sensory data from Section 2.1.1 (left panels) and for the luxury food data from Section 2.1.2 (right panels).

Figure 8

Cumulative group lasso estimates of the regression coefficients as functions of class labels (λ ∈ 14.5, 10, 5, 1, 0.25/n) for the luxury food data from Section 2.1.2. Light grey indicates smaller λ. Blue lines correspond to cross-validated/optimal λ = 14.5/n.

Cumulative fused lasso estimates of the regression coefficients as functions of class labels (λ ∈ 18.5, 10, 5, 1, 0.25/n) for the luxury food data from Section 2.1.2. Light grey indicates smaller λ. Blue lines correspond to cross-validated/optimal λ = 18.5/n.

Figure 10

Prediction accuracy (lower values are better) in terms of the test set Brier (left) and ranked probability (right) score for 50 random splits of the luxury food data from Section 2.1.2.

Footnotes

Declaration of Conflicting Interests

Funding

Supplementary materials

References

Supplementary Material