Sage Journals: Discover world-class research

Abstract

We propose a novel Bayesian model framework for discrete ordinal and count data based on conditional transformations of the responses. The conditional transformation function is estimated from the data in conjunction with an a priori chosen reference distribution. For count responses, the resulting transformation model is novel in the sense that it is a Bayesian fully parametric yet distribution-free approach that can additionally account for excess zeros with additive transformation function specifications. For ordinal categoric responses, our cumulative link transformation model allows the inclusion of linear and non-linear covariate effects that can additionally be made category-specific, resulting in (non-)proportional odds or hazards models and more, depending on the choice of the reference distribution. Inference is conducted by a generic modular Markov chain Monte Carlo algorithm where multivariate Gaussian priors enforce specific properties such as smoothness on the functional effects. To illustrate the versatility of Bayesian discrete conditional transformation models, applications to counts of patent citations in the presence of excess zeros and on treating forest health categories in a discrete partial proportional odds model are presented.

Keywords

discrete responses Bayesian transformation models penalised splines overdispersion zeroinflation partial proportional odds

1 Introduction

Discrete data commonly occur in almost every scientific area. In this article, we focus on the two relevant cases of count data and ordinal data as special instances of discrete response structures. Before the advent of generalized linear models (GLM, Nelder and Wedderburn, 1972), the peculiarities of count data were either ignored or treated simply by log transformations (Sokal and Rohlf, 1981). Then, the standard modeling approach for count data Y ∈ {0, 1, 2,…} became Poisson regression, $Y ∣ x ~ Po (λ_{x})$ . Since the Poisson distribution often turned out to be too simplistic for many applications, more advanced regression models were introduced as described, for example, by Cameron and Trivedi (1998), Winkelmann (2008) and Hilbe (2011) for negative binomial regression, $Y ∣ x ~ NB (λ_{x}, ν)$ accounting for potential overdispersion. Generalized additive models (GAM, Hastie and Tibshirani, 1990) unify these model types into one framework and drop the linearity assumption for the regression predictor. They require a fixed response distribution that belongs to the exponential family.

Similar to counts, ordered categorical data Y ∈ {1,…, c + 1} occur in a manifold of scientific disciplines such as medicine or the social sciences. A researcher in medicine, for example, may want to distinguish between different kinds of infection grades, while an ecologist could be interested in measuring forest health in terms of defoliation categories. Exploiting the natural ordering in these kinds of data is firmly established in the statistical community by cumulative link models as shown in McCullagh (1980). Prominent versions are the discrete proportional odds model and the discrete proportional hazards model (Tutz, 2011). In its simplest form, the cumulative link model is given by $π_{r} = P (Y = r) = F (γ_{r} - x^{T} β) - F (γ_{r - 1} - x^{T} β), r = 1, \dots, c + 1$ with some pre-specified cumulative distribution function F or equivalently $P (Y \leq r) = F (γ_{r} - x^{T} β) = π_{1} + π_{2} + \dots + π_{r}$ , where $\sum_{r = 1}^{c + 1} π_{r} = 1$ is required and the ordering $- \infty \equiv γ_{0} < \dots, < γ_{c + 1} \equiv \infty$ needs to be obliged. It is possible to include category-specific regression effects $x^{T} β_{r}$ , resulting in a (linear) non-proportional odds (Peterson and Harrell, 1990) or the non-proportional hazards model, depending on the choice of F.

The dissemination of Markov chain Monte Carlo (MCMC) simulation techniques led to the development of Bayesian analogues for established models in the form of Bayesian GLMs (Dey et al., 2000) with many extensions, for example, by Friihwirth-Schnatter and Wagner (2006), Fruhwirth-Schnatter et al. (2009), Rodrigues (2003) and the Bayesian GAM (Brezger and Lang, 2006). Ghosh et al. (2006) describe a Bayesian treatment of zero-inflated regression models, and Klein et al. (2015a) introduce zero-inflated and overdispersed count data to the framework of Bayesian structured additive distributional regression (Klein et al., 2015b). In a non-transformation environment, Lavine and Mockus (1995) and Dunson (2005), among others, apply a (strictly) isotonic regression function for count responses on the basis of a Dirichlet process mixture prior.

To bridge the gap between discrete ordinal and count regression models, we consider count data as ordinal categorical data with a very high number of intercept thresholds that, however, are not estimated but rather are fixed by design at all non-negative integers. Methodologically, both approaches are unified by the idea of a direct parametrization of the transformation function. Similar to Siegfried and Hothorn (2020), we treat the smooth parametrization of the thresholds as the defining element of the count transformation approach used in this article. While overdispersion is absorbed by the smooth transformation of the counts, we supplement the model with a second component that explicitly accounts for eventual zero inflation. For a discussion of the connection between (binary) regression and transformation models, see Doksum and Gasko (1990).

To summarize, in this article, we aim to do the following:

Propose a Bayesian approach for count transformation models based on flexible transformation functions that are inferred from the data, which-in its simplest form with linear covariate shift effects-results in a distribution-free yet interpretable model framework for count data that automatically accounts for over- and underdispersion in the response distribution,

Account for excess zeros in two-component mixtures models,

Propose a Bayesian approach for cumulative link transformation models with Bayesian proportional odds and proportional hazards models as special cases,

Allow for the inclusion of category-specific effects, resulting in non-proportional transformation model types,

Combine both model types into the class of Bayesian discrete conditional transformation models (BDCTM) and establish it as an extension of Bayesian conditional transformation models (BCTM) for continuous responses,

Supplement all models with non-linear, possibly high-dimensional covariate effects and interactions, and

Illustrate BDCTM's capability in the presence of count and categoric data in two applications.

The rest of this article is structured as follows: Section 2 introduces the model class we refer to as BDCTM with a preliminary discussion of its building blocks. Section 3 contains a description of posterior estimation. A simulation study evaluating BDCTM's performance in a count data setting is presented in Section 4. Section 5 features an application on patent citation counts and an application on forest health categories. We conclude in Section 6.

2 Bayesian discrete conditional transformation models

In what follows, we introduce BDCTM as a model class that represents a novel approach to the direct estimation of the conditional distribution function $F_{Y ∣ X = x} (y ∣ x)$ based on an independent sample of discrete responses Y₁,…, Y_n conditional on covariates x ₁,…, x _n . We broadly distinguish between cases of count data and ordered categorical data with a finite sample space, which have to be addressed by different assumptions on the sampling distribution and different basis functions.

Let y be an observation of a count or ordered categorical response variable Y and let x ^T = (x₁,…, x_q) be a vector of observed explanatory variables. Moreover, let F_Z be the cumulative distribution function ofan a priori chosen reference distribution, linking a discrete and monotonically increasing transformation function h(y| x ) to the conditional distribution function $F_{Y ∣ X = x} (y ∣ x)$ via the connection

F_{Y ∣ X = x} (y ∣ x) = P (Y \leq y ∣ x) = F_{Z} (h (y ∣ x)) .

(2.1)

The responses are transformed towards the reference distribution conditionally on x by means of the transformation function h(y| x ). Through allowing different complexities of the transformation function h(y| x ), BDCTM is able to resemble and expand on established models for count and ordinal data without requiring a fixed response distribution. The encompassing goal of all models described in this article is to obtain an estimate of the distribution function $F_{Y ∣ X = x}$ by means of estimating h(y| x ). In contrast to Bayesian CTMs for continuous responses, the transformation function will no longer be bijective since a continuous reference distribution is linked to the CDF ofa discrete response variable.

We proceed with discussing each of the components ofa BDCTM in more detail. Section 2.1 introduces the basic structure assumed for the transformation functions. Sections 2.2 and 2.3 present model variants for count data and ordinal responses, respectively, while Section 2.4 discusses a generic basis function representation for the transformation functions. Section 2.5 introduces the corresponding prior assumptions, Section 2.6 discusses partial contributions to the transformation function, and Section 2.7 contemplates on the relevance of the choice of the reference distribution.

2.1 Transformation functions

Similar to Hothorn et al. (2014), we assume an additive decomposition on the scale of the transformation function into J partial transformation functions

h (y ∣ x) = \sum_{j = 1}^{J} h_{j} (y ∣ x),

(2.2)

where h_j(y|x) are response-covariate interactions that are monotone only in direction of y. We denote partial transformation functions that depend only on the covariates simply by h( x ). A simple transformation model, for example, is obtained by setting $h_{1} (y ∣ x) = h_{Y} (y)$ and $h_{2} (y ∣ x) = h (x)$ . We explicitly allow the inclusion of linear and non-linear covariate effects, that is,

h (x) = z^{T} β + f_{1} (v) + \dots + f_{L} (v),

(2.3)

where in $x = {(z^{T}, v^{T})}^{T}, z$ contains all covariates associated with linear effects and v contains covariates with assumed non-linear effects f₁,…,f_L.

2.2 Count transformation models

We distinguish between two related model types for count data: simple shift count transformation models that are able to deal with overdispersion and two-component mixture transformation models that can additionally deal with excess zeros.

Mean-shift count transformation models: Regular count transformation models are defined by shifts of the non-linear baseline transformation function h_Y:

F_{Y ∣ X = x} (y ∣ x) = F_{Z} (h_{Y} (⌊y⌋) - h (x))

(2.4)

where $⌊y⌋$ denotes the floor function returning the greatest integer less than or equal to y. Since all moments besides the conditional mean (which is shifted by h( x )) are captured solely by $h_{Y} (⌊y⌋)$ , independently of the covariates, the resulting model is not affected by over- or underdispersion. Model (2.4) is similar to a regular linear transformation model, but the application of the floor function leads to jumps at the respective integers, such that the transformation function h_Y(y) is only evaluated at the distinctive response values $y \in {0, 1, 2, \dots}$ and, as a consequence, the overall transformation is no longer invertible. The likelihood-based version of this model type restricted to linear covariate shifts was discussed in detail in Siegfried and Hothorn (2020).

Two-component mixture count transformation models: Besides over- and underdispersion, count data often come with an excess number of zeros, which needs to be accomodated in the model. One possibility is to add a second component to the linear transformation function that captures zeros (Hothorn et al., 2018). A transformation function in that vein can be depicted as:

F_{Y ∣ X = x} (y ∣ x) = F_{Z} (h_{Y} (⌊y⌋) - h (x) + 1 (y = 0) (β_{0} - h_{0} (\tilde{x}))),

(2.5)

where $h_{0} (\tilde{x})$ and h( x ) can consist of different linear and non-linear effects of different sets of covariates.-This two component mixture transformation model resembles a hurdle model with hurdle at zero, where the probability of an excess zero is perceived as the mean-shifted deviation from a regular count transformation model at y = 0:

P (Y = 0 ∣ X = x) = F_{Z} (h_{Y} (0) - h (x) + (β_{0} - h_{0} (\tilde{x}))) .

(2.6)

The process generating non-zeros in this case is not explicitly truncated but stems from a transformation function that excludes the zeros.

All count transformation functions of this type have in common that they act on the floor function $⌊y⌋$ , resulting in step functions in direction of y and thus the desired discrete distribution functions. Comparing this to the ordinal response models discussed in the next section, count data transformation models can also be considered as introducing a latent, continuous scale, implicitly determined by the transformation function, with a large number of pre-specified thresholds corresponding to the non-negative integers.

2.3 Cumulative link transformation models

For ordered categorical data, we distinguish between cumulative models with and without categoryspecific shifts. From a transformation perspective, the latter are modeled in terms of response- covariate interactions that can be linear or non-linear in direction of the respective covariate.

Proportional models: The simplest cumulative transformation model is:

F_{Y ∣ X = x} (y_{r} ∣ x) = F_{Z} (h_{Y} (y_{r}) - h (x)),

(2.7)

where the term h( x ), which is independent of the category r, constitutes the log-odds ratio to h(0) or the log-hazard ratio in model types (2.4) and (2.7), depending on the choice of reference distribution.

Non-proportional models: Models of type (2.7) can be generalized by a category-specific shift resulting in the following model:

F_{Y ∣ X = x} (y_{r} ∣ x) = F_{Z} (h_{Y} (y_{r}) + h_{r} (x)),

(2.8)

where h_r ( x ) induces the category-specific shifts, resulting in linear or non-linear non-proportional odds or hazards models depending on F_Z and on whether h_r ( x ) consists of linear or non-linear effects. Partial proportional models as shown in the application in Section 5.2 consist of a mixture of proportional and non-proportional effects. The reparameterization illustrated in the following section guarantees that the implied probabilities P(Y = r) = F_Z(γ_r - h_r( x )) - F_Z(Y_{r - 1} - h_{r - 1}( x )) are always positive.

2.4 A generic joint basis

We assume that each of the J partial transformation functions can be approximated by a linear combination of basis functions c _j such that

h_{j} (y ∣ x) = c_{j} {(y, x)}^{T} γ_{j},

where γ_j is a vector of basis coefficients. Based on the additivity assumption in (2.2), the complete conditional transformation function can be denoted as

h (y ∣ x) = c {(y, x)}^{T} γ

(2.9)

with joint basis

c (y, x) = {(c_{1} {(y, x)}^{T}, \dots, c_{J} {(y, x)}^{T})}^{T}

and Y contains all partial basis coefficient vectors,

γ = {(γ_{1}^{T}, \dots, γ_{J}^{T})}^{T} .

(2.10)

This allows us to write all discrete conditional transformation models treated in this article in the general form:

F_{Y ∣ X = x} (y) = F_{Z} (c {(y, x)}^{T} γ) .

(2.11)

We call models of type (2.11) Bayesian discrete transformation models (BDCTM). They can be conceived as extensions of the versatile model class of BCTM for continuous responses that were introduced by Carlan et al. (2020), taking the additional challenges arising from discrete responses into account. In this tradition, a BDCTM is fully specified by a reference distribution F_Z, the joint basis c (y, x ) and a vector of basis coefficients γ together with suitable priors, which are introduced in the next section. The rest of this section discusses the generic basis that is used by the BDCTM in greater detail.

Let a _j denote a basis transformation of y with dimension D₁, collecting evaluated basis functions $B_{j 1 d_{1}} (y), d_{1} = 1, \dots, D_{1}$ , and let b _j denote a basis transformation of x with dimension D₂ collecting evaluated basis functions $B_{i 2 d_{2}} (x), d_{2} = 1, \dots, D_{2}$ . The resulting effects are approximated by the following linear combinations:

h_{j} (y) = \sum_{d_{1} = 1}^{D_{1}} γ_{j 1 d_{1}} B_{j 1 d_{1}} (y) = a {(y)}^{T} γ_{j 1}, h_{j} (x) = \sum_{d_{2} = 1}^{D_{2}} γ_{j 2 d_{2}} B_{j 2 d_{2}} (x) = b_{j} {(x)}^{T} γ_{j 2},

where $γ_{j 1} = {(γ_{j 11}, \dots, γ_{j 1 D_{1}})}^{T}$ and $γ_{j 2} = {(γ_{j 21}, \dots, γ_{j 2 D_{2}})}^{T}$ are partially reparameterized versions of the vectors of corresponding basis coefficients β _j ₁ and β _j ₂. The conditional transformation approach commonly involves response-covariate interactions (e.g., model types (2.6) and (2.8)), which is why we parametrize each partial transformation function generically as

\begin{matrix} h_{j} (y ∣ x) = c_{j} {(y, x)}^{T} γ_{j} = {(a_{j} {(y)}^{T} \otimes b_{j} {(x)}^{T})}^{T} γ_{j} \\ = \sum_{d_{1} = 1}^{D_{1}} \sum_{d_{2} = 1}^{D_{2}} γ_{j, d_{1} d_{2}} B_{d_{1}} (y) B_{d_{2}} (x), \end{matrix}

(2.12)

where the Kronecker product forms parametric interactions between the evaluated basis functions, and γ _j is a basis vector of dimension D = D₁D₂. A collection of special cases can be found in Section 2.6.

We require all transformation functions to be strictly monotonically increasing solely in the direction of y but not in direction of the explanatory variables such that $F_{Y ∣ X = x} (y_{j} ∣ x) < F_{Y ∣ X = x} (y_{j + 1} ∣ x)$ for all y_j < y_j₊₁. This property needs to be accomodated in the basis. For this, we adopt the approach of Pya and Wood (2015) for monotonically increasing smooth functions. The vector γ _j is reparameterized as $γ_{j} = Σ_{j} {\tilde{β}}_{j}$ , where $Σ_{j} = Σ_{D_{1}} \otimes I_{D_{2}}$ and $Σ_{D_{1}}$ is given by the lower triangular matrix of size D₁ such that $Σ_{D_{1}, k l} = 0$ if k < l and $Σ_{D_{1}, k l} = 1$ if k > l. The vector ${\tilde{β}}_{j}$ of dimension D = D₁ D₂ contains a mixture of unexponentiated and exponentiated β-coefficients given by

{\tilde{β}}_{j} = {(β_{j, 11}, \dots, β_{j, 1 D_{2}}, \exp (β_{j, 21}), \dots \exp (β_{j, 2 D_{2}}), \dots, \exp (β_{j, D_{1} D_{2}}))}^{T} .

(2.13)

and $I_{D_{2}}$ is an identity matrix of size D₂. An unconditional transformation function h_Y(y) is obtained by setting D₂ = 1 and a function of type h ( x ) is obtained by setting D₁ = 1.

The vector of basis coefficients for the whole conditional transformation function h(y| x ) is given by $γ = Σ \tilde{β}$ , where $\tilde{β} = {({\tilde{β}}_{1}^{T}, \dots, {\tilde{β}}_{J}^{T})}^{T}$ is based on $β = {(β_{1}^{T}, \dots, β_{J}^{T})}^{T}$ . Matrix Σ is block diagonal with Σ _j as diagonal elements.

Of course, other basis specification could be employed to set up BDCTMs, as long as monotonicity along y is ensured. For example, the increasing splines considered in continuous ordinal regression (Manuguerra and Heller, 2010) would be a potential alternative. We rely on Bayesian P- splines and their tensor product interactions since these have been extensively studied in Bayesian structured additive regression and enable efficient and stable computations.

2.5 Priors

We adopt the principle of Bayesian P-splines (Lang and Brezger, 2004) and assume partially improper multivariate Gaussian priors for the unconstrained vectors β_j1 and β_j2 (the reparameterized vectors γ_j1 and γ_j2 are based on) such that

\begin{array}{l} p (β_{j 1} ∣ τ_{j 1}^{2}) \propto {(\frac{1}{τ_{j 1}^{2}})}^{\frac{rk (K_{j 1})}{2 τ_{j 1}^{2}}} \exp (- \frac{1}{2 τ_{j 1}^{2}} β_{j 1}^{T} K_{j 1} β_{j 1}), \\ p (β_{j 2} ∣ τ_{j 2}^{2}) \propto {(\frac{1}{τ_{j 2}^{2}})}^{\frac{rk (K_{j 2})}{2 τ_{j 2}^{2}}} \exp (- \frac{1}{2 τ_{j 2}^{2}} β_{j 2}^{T} K_{j 2} β_{j 2}), \end{array}

(2.14)

where $τ_{j 1}^{2}$ and $τ_{j 2}^{2}$ are marginal smoothing variances, rk(·) is the rank of a matrix, and K _j1 and K _j2 are potentially rank deficient prior precision matrices. The generic formulation of the precision matrix associated with γ_j is given by

K_{j} = \frac{1}{τ_{j 1}^{2}} (K_{j 1} \otimes I_{D_{2}}) + \frac{1}{τ_{j 2}^{2}} (I_{D_{1}} \otimes K_{j 2}),

where precision matrices K _j1 and K _j2 control the penalty in the direction ofy and x respectively. For unconditional transformation functions or pure covariate functions, K _j1 and K _j2 are respectively set to 0 such that only the prior precision of the corresponding effect is used. Specific choices are discussed in the next section. The model precision matrix K is given as the block diagonal matrix with matrices K _j as diagonal elements.

The smoothing variances $τ_{j 1}^{2}$ and $τ_{j 2}^{2}$ are associated with inverse gamma priors, $τ_{j 1}^{2} ~ IG (a_{j 1}, b_{j 1})$ and $τ_{j 2}^{2} ~ IG (a_{j 2}, b_{j 2})$ . All model parameters are collected in $ϑ = {(β_{1}, \dots, β_{J}, τ_{11}^{2}, τ_{12}^{2} \dots, τ_{J 1}^{2}, τ_{J 2}^{2})}^{T}$ with joint prior $p (ϑ)$

2.6 Partial transformations

We start this section by introducing the two types of basis functions we use in a , depending on whether Y is a count variable or discrete ordinal followed by a brief discussion of choices for b together with suitable precision matrices.

Smooth basis for count transformations: In case of a count response $Y \in {0, \dots}$ , a consists of B-spline basis functions $B_{d_{1}}$ i.e. $a_{j} (y) = {(B_{1} (y), \dots, B_{D_{1}} (y))}^{T}$ . It may be useful to parametrize the transformation function on the log-scale, i.e $a_{j} (\log (y))$ or $a_{j} (\log (y + 1))$ , where especially the latter can be beneficial numerically if there are many small and some large counts. Smooth monotonic effects ofa count transformation subject to the reparameterization in (2.13) are supplemented with a penalty matrix $K_{j 1} = D_{1}^{T} D_{1}$ based on a (D₁ – 2) × D₁ partial first-difference matrix D ₁ that is zero except that $D_{i, i + 1} = - D_{i, i + 2} = 1$ for i = 1,…,D₁ - 2 to achieve shrinkage towards a straight line (Pya and Wood, 2015).

Discrete basis for ordinal categorical data For ordered categorical responses, $Y \in {1, \dots, c + 1}$ we assign one parameter to each category except for the reference category c + 1 (Hothorn et al., 2018). As a basis, we use the unit vector e _c of length c, i.e. a _j (y_r) = e _c (r), where

Y = r \Leftrightarrow e_{c} (r) = {(0, \dots, 1, \dots, 0)}^{T}, r = 1, \dots, c .

(2.15)

The corresponding precision matrix is K _j1 = 0.

Bases for covariates effects For covariate effects we allow linear bases b j ( z ) = (z₁,…,z_p) ^T together with precision matrix K j₂ = 0 and B-spline bases for non-linear effects $b_{j} (v) = {(B_{1} (v), \dots, B_{D_{2}} (v))}^{T}$ with a second order random-walk precision matrix K _j2. All bases involving B-spline basis functions can be centered around zero for identification purposes.

Transformation random effects h_j (x) = β_g are based on the grouping indicator g ∈ {1,…, G}. The corresponding G-dimensional basis vector b j(g) has entry one if x belongs to group g and zero otherwise. We set K _j = I _G for i.i.d. random effects. Regular non-monotonic tensor splines as used in the forest health application in Section 5.2 can be retrieved by using the specification in (2.12) and setting γ_j = β_j.

2.7 Reference distribution

In the context of discrete conditional transformation models, the reference distribution function F_Z plays the role of the inverse link function controlling the interpretational scale of the impact of the explanatory variables. While it can be chosen arbitrarily in theory, we concentrate on distributions with log-concave densities for F_Z to guarantee uniqueness of the maximum likelihood estimate, which usually will also imply unimodality of the posterior. Furthermore, it is advised to consider characteristics such as right-skewness or the support of the count data distribution in the selection process. Prominent choices for F_Z are

F_SL(z) = (1 + exp(-z))⁻¹, that is, the standard logistic distribution,

$Φ (z)$ , that is, the standard normal distribution, and

F_MEV(z) = 1 - exp(- exp(z)), that is, the minimum extreme value distribution

This results in logit, probit or cloglog interpretations of the covariate effects. Setting F_Z = F_SL, for example, results in the discrete proportional odds model and F_Z = F_MEV results in the proportional hazards model, with h( x ) becoming the log-odds ratio or the log-hazard ratio to h(0), respectively (Hothorn et al., 2018).

To reflect specific properties of the data-generating process, other link functions that have been considered in the context of GLM, such as skew-logistic or t-distributed link functions to reflect strong asymmetry or heavy tails, may be considered. However, given the flexibility of the transformation function, we do not expect large gains from such specifications since both asymmetry and tail behaviour should be taken up by the transformation function, leaving only a small potential for improving the fit via the link function. We therefore suggest to stick to the defaults and to select the reference distribution according to preferences on model interpretation.

2.8 Transformation probability mass functions

In this section, we introduce the transformation probability mass functions (PMFs) resulting from the different sampling assumptions that come with count and ordinal categoric data as well as the resulting transformation likelihoods. To emphasize that γ are partially non-linear reparameterizations of β , we write γ ( β ). Following Hothorn et al. (2018), the log-transformation PMF of a conditionally independent (count) response Y with unbounded support Y ∈ {0, 1, …,} is given by

\log (f_{Z} (y ∣ β)) = \{\begin{array}{l} \log [F_{Z} (c {(y_{k}, x)}^{T} γ (β))] & k = 1 \\ \log [F_{Z} (c {(y_{k}, x)}^{T} γ (β)) - F_{Z} (c {(y_{k - 1}, x)}^{T} γ (β)] & k > 1. \end{array}

(2.16)

In case of an ordinal categorical response with bounded support Y ∈ {y₁,…, y_c₊₁}, the corresponding conditional distribution function needs to take the additional constraint for the reference category c + 1, $c + 1, P (Y \leq y_{c + 1} ∣ X = x) = F_{Z} (h (y_{c + 1} ∣ X = x)) = 1$ into account. The transformation PMF is then given by

f_{Z} (y ∣ β) = \{\begin{array}{l} [F_{Z} (c {(y_{k}, x)}^{T} γ (β)))] & k = 1 \\ [F_{Z} (c {(y_{k}, x)}^{T} γ (β)) - F_{Z} (c {(y_{k - 1}, x)}^{T} γ (β))] & k = 2, \dots, c \\ [1 - F_{Z} (c {(y_{c}, x)}^{T} γ (β))] & k = c + 1. \end{array}

(2.17)

With the convention $F_{Z} (h (y_{0})) = F_{Z} (h (- \infty)) = 0$ and $F_{Z} (h (y_{c + 1})) = F_{Z} (h (\infty)) = 1$ , the conditional PMF simplifies to

f_{Z} (y_{k} ∣ β) = F_{Z} (c {(y_{k}, x)}^{T} γ (β) - F_{Z} (c {(y_{k - 1}, x)}^{T} γ (β))

(2.18)

encompassing count and ordered categoric models in a unified framework (Hothorn et al., 2018). Based on (2.18), the transformation log-likelihood for independent observations (y_i, x _i ), i = 1,…,n is given by

l (β) = \sum_{i = 1}^{n} \log (F_{Z} (c {(y_{i}, x_{i})}^{T} γ (β)) - F_{Z} (c {(y_{i} - 1, x_{i})}^{T} γ (β))) .

The likelihood is chosen according to the discrete response structure only, while the transformation function determines whether excess zeros are accounted for or if the category-specific effects are included, for example. With all building blocks in mind, a BDCTM can be fully specified by the set $\{ϑ ∣ F_{Z}, c, π_{ϑ} (\cdot)\}$ of unknown model parameters $ϑ$ , given a choice for the basis c , the reference distribution F_Z and the joint prior $π_{ϑ}$ (Carlan et al., 2020).

3 Posterior inference

For Bayesian inference, we rely on MCMC simulation techniques. We sketch the most relevant parts of the algorithm in this section.

Update of the basis coefficients: The log-full conditional of the basis coefficients (up to an additive constant) is given by

\log (p (β ∣ \cdot)) \propto l (β) - \frac{1}{2} β^{T} K β,

where the second term arises from the multivariate Gaussian prior. The gradient of the unnormalized log-posterior is needed for inference and is given by

s (β) = \sum_{i = 1}^{n} \frac{f_{Z} (c {(y_{i}, x_{i})}^{T} Σ \tilde{β}) c {(y_{i}, x_{i})}^{T} Σ C - f_{Z} (c {(y_{i} - 1, x_{i})}^{T} Σ \tilde{β}) c {(y_{i} - 1, x_{i})}^{T} Σ C}{F_{Z} (c {(y_{i}, x_{i})}^{T} Σ \tilde{β}) - F_{Z} (c {(y_{i} - 1, x_{i})}^{T} Σ \tilde{β})} - K β,

where C is a diagonal matrix of size D with entries

C_{d d} = \{\begin{array}{l} 1 & if {\tilde{β}}_{d} = β_{d} \\ \exp (β_{d}), & otherwise. \end{array}

Strong dependencies among the variables (which are partly due to the monotonicity restriction) complicate the sampling from the posterior distribution. This is further impeded by the mixed linear- non-linear dependence of the transformation function on β and $\tilde{β}$ , respectively. Therefore, we use the No-u-turn sampler (NUTS, Hoffman and Gelman, 2014) with dual averaging (Nesterov, 2009) for efficient exploration of the target distribution. The adaptive and dynamic nature of NUTS enables a streamlined estimation process that abolishes the need for costly preliminary tuning runs (needed for setting the number of leapfrog steps and the step size parameter) at the expense of some computation time per iteration. In the following, we distinguish between the burn-in period, which determines the number of samples that gets thrown out at the beginning of a Markov chain, and the warm-up period, which controls the length of the adaptive phase of the algorithm. Due to the high dependence between parameter blocks, all basis coefficients are updated in one step, followed by successive updates of the smoothing variances.

Update of the smoothing variances: In the univariate case, updating the smoothing variance is straightforward by using the full-conditional:

τ_{j}^{2} ∣ \cdot ~ IG (a_{j} + \frac{rk (K_{j})}{2}, b_{j} + \frac{1}{2} β_{j}^{T} K_{j} β_{j}),

where K _j is specified as shown in Section 2.6. However, in case of tensor splines based on a multivariate Gaussian prior with precision matrix,

\frac{1}{τ_{j 1}^{2}} (K_{j 1} \otimes I_{D_{2}}) + \frac{1}{τ_{j 2}^{2}} (I_{D_{1}} \otimes K_{j 2}),

(3.1)

we need to consider the generalized determinant of (3.1) when updating the smoothing variances. This aggravates sampling, which is why we introduce an anisotropy parameter $ω_{j} \in (0, 1)$ , resulting in an alternative representation of the precision given by

\frac{1}{τ_{j}^{2}} K_{j} = \frac{1}{τ_{j}^{2}} [ω_{j} (K_{j 1} \otimes I_{D_{2}}) + (1 - ω_{j}) (I_{D_{1}} \otimes K_{j 2})],

where ω_j controls how much prior information is assigned to each of the two covariates of the tensor spline. For the BDCTM, we consider a discrete prior for ω_j, which allows to pre-compute a finite set of generalized determinants that can be used within the MCMC simulations (see Kneib et al., 2019) for a detailed explanation of this approach).

In the following, the hyperparameters of the inverse gamma prior are set to a_j₁ = a_j₂ = 1, b_j₂ = b_j₂ = 0.001, resulting in good and stable performance in all investigated cases.

Numerical stability: Klein et al. (2015a) observed numerical problems if zero-inflation was wrongfully assumed when in fact, for example, a simple Poisson model was due. One reason is that the estimated predictor for the probability of an extra zero tends towards minus infinity in log-space. This is usually not an issue in models of type (2.5) as the coefficients that are related to the zero component are not exp-transformed. In cumulative models with category-specific effects, however, flat sections can lead to divergent transitions in which case weakly identified coefficients have to be dropped from the model (Pya and Wood, 2015). This issue can be remedied by adding $ϵ = 10 e^{- 6}$ to the diagonal of the precision matrix in this case. Moreover, the target acceptance rate can be increased to up to .99 to keep transitions in check.

Software: All computations were carried out in R version 4.1.0 (R Core Team, 2020). To improve computation time, likelihoods and score functions were implemented via the package Rcpp (Eddelbuettel et al., 2011). The mass matrix adaption scheme was adopted from adnuts (Monnahan and Kristensen, 2018).

4 Simulation study

In this section, we present a simulation experiment that highlights the possible advantages of the count transformation approach in general and that compares our Bayesian approach with the likelihood-based linear count transformation model by Siegfried and Hothorn (2020).

Count transformation models can mimic most well-known models for count data. Therefore, a meaningful simulation study in this setting needs to consider the sensitivity of the flexible transformation function with respect to the true data-generating process. In other words, it needs to investigate to what extent the flexible transformation function is able to accommodate eventual overdispersion and other characteristics of possibly complex data generating processes.

Simulation design: We use a similar simulation design to Siegfried and Hothorn (2020) with the following properties:

One covariate is generated via $z ~ U [0, 1]$ .

Conditional on z, we consider five different count data generating processes (DGPs)

–
Poisson with mean and variance $E (Y ∣ z) = V (Y ∣ z) = \exp (1.2 + 0.8 z)$ ,
–
Negative Binomial with $E (Y ∣ z) = \exp (1.2 + 0.8 z)$ and variance $V (Y ∣ z) = E (Y ∣ z) + E {(Y ∣ z)}^{2} / 3$ , and
–
Three different count data-generating processes according to $F_{Z} (a_{(8)} {(\log (y + 1))}^{T} γ - z β)$ , β = 0.8 with the reference functions $F_{Z} = F_{SL} (l o g i t), F_{Z} = Φ (p r o b i t), F_{Z} = F_{MEV} (c l o g l o g)$ .

Each dataset was estimated by their corresponding true (oracle) models, that is, a Poisson GLM (mp), a negative binomial GLM (mnb), BDCTMs (bmlo denotes the logistic model, bmpr denotes the probit model and bmcll denotes the cloglog model) and a frequentist count transformation model (Siegfried and Hothorn, 2020) implemented in the R-package cotram (Siegfried and Hothorn, 2021), where mlo stands for the logistic model, mpr stands for the probit model and mcll for the cloglog model. Each model type was estimated for each DGP, resulting in 5 × 8 = 40 models in total.

Training and validation sample sizes are set to 250 and 750, respectively.

The simulation experiment was repeated in 100 replications with a total iteration number of 2,000 and a burn-in and warm-up phase of length 1,000, such that 1,000 iterations are being used for computing the estimates.

Each model fit is quantified by means of the centered out-of-sample log-likelihood resulting from the difference between the out-of-sample log-likelihoods of the models and the out-of-sample loglikelihoods of the true data-generating processes evaluated on a hold-out sample, taking a predictive perspective that implicitly controls for differenes in complexity between the models. The results presented in Figure 1 confirm most of the findings of Siegfried and Hothorn (2020) regarding the merits of the count transformation approach.
Figure 1
Comparison of count data-generating processes on basis of centered out-of-sample log-likelihoods obtained from the respective model. Larger out-of-sample log-likelihoods indicate a better performance of the corresponding model.

Based on these results, we can make the following statements:
The Poisson model, being the most rigid model, shows the worst performance with respect to the out-of-sample log-likelihood, if misspecified.

As expected, the negative binomial model performs well for the Poisson and the overdispersed case, but shows inferior performance in the remaining scenarios.

The fit of both the BDCTM and the cotram model is robust for all considered DGPs, effectively redeeming the promise of providing a flexible model framework for count data that is applicable in many situations.

The BDCTM seems to perform better than cotram in the more complicated scenarios and worse especially in settings where a simple Poisson model would be due; this may be less surprising considering BDCTM's spline-based nature in comparison to cotram's use of Bernstein polynomials.

The simulation study confirms the robustness of the BDCTM in the presence of different data- generating processes. Its fit is satisfactory in all investigated cases and highly competitive in the more complicated scenarios. While the Poisson distribution only works well in simple scenarios, the negative binomial distribution also works quite well for most scenarios (except the Poisson case). Still, BDCTMs outperform negative binomial regression uniformly over all but the Poisson and the negative binomial scenario.
5 Applications

We illustrate possible applications of the BDCTM in this section. For better readability, we add the number of basis functions to the basis, for a (q). Code required for reproducing the following applications is openly accessible.^*

5.1 Patent citations with excess zeros

Similar to an author of a scientific publication, an inventor who applies for a patent has to cite all existing patents their work is based on. We analyze the citation number (ncit : y) of patents granted by the European Patent Office (EPO). The considered dataset includes five dummys and three continuous variables. The available continuous covariates are the grant year (year), the number of the designated states (ncountry) and the number of patent claims (nclaims). (For a full description of the explanatory variables in the data set of n = 4,805 observations, see Jerak and Wagner 2006). A high rate of zeros (≈ 46%) and a big spread ncit ∈ {0,…, 40} hint on the presence of zero-inflation and overdispersion. A rigorous investigation of this presumption has to consider whether this is holds conditional on the covariates. We let the sampler run for 2,000 iterations with a burn-in and warm-up phase of length 1,000 such that 1,000 iterations are obtained for inference.

We start our investigation with the simple linear transformation model (BDCTM_lin):

F_{SL} (a_{8} {(\log (⌊y + 1⌋))}^{T} γ - z^{T} β),

(5.1)

where the linear predictor z _T β contains all available covariates. As a first in-sample assessment of the practical capabilities of our transformation approach, we want to inspect to what extend the observed frequencies ${obs}_{r} = \sum_{i = 1}^{n} 1 (y_{i} = r)$ in the data set match the expected frequencies $\exp_{r} = \sum_{i = 1}^{n} (r; {\hat{γ}}_{i})$ derived from the model. Figure 2 displays the rootograms as introduced by Kleiber and Zeileis (2016) obtained from the model in Equation (5.1), from a Poisson and from a negative binomial GLM with all covariates included in the predictors. Rootograms make use of a horizontal reference line (at zero) to highlight the discrepancies between observed and expected frequencies. The Poisson model clearly underfits the zeros and exhibits an undulating pattern, overpredicting counts between 1 and 4, and underpredicting the rest, which is a sign of substantial overdispersion. The flexible transformation function of BDCTM is able to emulate the overdispersion-robust negative binomial model, which is reflected in the bars being closely aligned with the x-axis.

Figure 2

Patent citations. Rootograms of the linear Poisson, the linear negative binomial and the simple linear BDCTM model.

In summary, this first visual inspection of the goodness-of-fit confirms that BDCTM is able to ameliorate the impact of overdispersion on the model fit.

We also want to pursue the assumption of excess zeros. For this, we consider a two-component model (BDCTM_hurdle-lin) in the vein of (2.5) with $h (x) = h_{0} (x) = z^{T} β$ :

F_{SL} (a_{8} {(\log (⌊y + 1⌋))}^{T} γ - z^{T} β + 1 (y = 0) (β_{0} - z^{T} β)),

where, again, z contains all explanatory variables in the data set. As GLM analogues, we consider the zero-inflated versions of the Poisson and of the negative binomial models. Previous analyses of the data set revealed that assuming non-linear relationships for the continuous covariates can improve the estimation results (Klein et al., 2015a). This does not automatically hold for BDCTM, because the explanatory variables impact the response on a different scale (the scale of the transformation). Therefore, we estimated models of type (2.4) and (2.5), while replacing the covariate functions with additive functions of type (2.3), that is, $h (x) = h_{0} (x) = z^{T} β + f (n c o u n t r y) + f (y e a r) + f (n c l a i m s)$ , where z now only contains the discrete covariables. In what follows, we refer to these partially non-linear models as BDCTM_nl and BDCTM_hurdle-nl, respectively. Figure 3 shows the estimated non-linear effects of ncountry, year and nclaims on the log-odds ratio from model BDCTM_nl.

Figure 3

Patent citations. Posterior mean estimates of the effects of nclaims, ncountry and year on the log-odds ratio, together with 95% credible intervals. Remaining covariates are held constant at their mean or are set to zero in case of dummy variables. Estimates belong to BDCTM_nl.

In the next step, we compared all models in terms of randomized quantile residuals as proposed by Rigby et al. (2008). For every observation y_i, we computed residuals ${\hat{r}}_{i} = Φ^{- 1} (u_{i})$ where $Φ^{- 1}$ is the quantile function of the standard normal distribution and u_i is randomly drawn from $U (F (y_{i} - 1) ∣ \hat{γ}), F (y_{i} ∣ \hat{γ}))$ with plugged in estimates $\hat{γ} \cdot F (\cdot ∣ \hat{γ})$ is the estimated conditional distribution function. Residuals obtained from the true model follow a standard normal distribution, which is why deviations can be checked by quantile-quantile plots. Figure 4 shows the Q-Q plots of the considered models. Again, the Poisson model reveals a lack of fit represented by the strong deviations from the normal line, which also holds true for its zero-inflated counterpart to a somewhat lesser extend. The negative binomial models provide a considerably better fit but seem to be surpassed by the BDCTMs, which indicate the best aptitude for infering the distribution of patent citations while at the same time providing a flexible ‘sans souci’ approach, abolishing the need to search for the ‘right’ count distribution in general.

Figure 4

Patent citations. Comparison of quantile residuals obtained by BDCTM models with and without addititional zero component with various generalized linear and zero-inflated models.

For a more rigorous assessment of the out-of-sample performance, we conclude our analysis with an evaluation based on proper scoring rules. Originally proposed by Gneiting and Raftery (2007), they serve as summary measures for the predictive power of a model. Based on data y₁, …, y_R in a validation sample and estimated probabilities ${\hat{p}}_{r} = ({\hat{p}}_{r 0}, {\hat{p}}_{r 1}, \dots)$ obtained from the predictive distribution ${\hat{p}}_{r k} = f (y_{r} = k ∣ \hat{γ})$ , scores are computed by taking the sum of the individual score contribution $S = \sum_{r = 1}^{R} S ({\hat{p}}_{r}, y_{r})$ . We consider the three most prominent scores:

Brier score: $S ({\hat{p}}_{r}, y_{r}) = - \sum_{k} {(1 (y_{r} = k) - {\hat{p}}_{r k})}^{2}$ ,

Logarithmic score: $S ({\hat{p}}_{r}, y_{r}) = \log ({\hat{p}}_{r y_{r}})$ (out-of-sample likelihood), and

Spherical score: $S ({\hat{p}}_{r}, y_{r}) = \frac{{\hat{p}}_{r y_{r}}}{\sqrt{\sum_{k} {\hat{p}}_{r k}^{2}}}$ .

The probabilistic forecasts collected in ${\hat{p}}_{r}$ for the responses y_r are assessed by 10-fold crossvalidation. Table 1 shows the score sums obtained from the four BDCTM models introduced in this section, together with the Watanabe Information Criterion for Bayesian models (WAIC, Watanabe (2010)). The cotram model is specified equivalently to BDCTM_lin, which is why their similar performance in terms of quadratic and spherical score is not surprising. Note that the logarithmic score considers only one probability of the predictive distribution and is therefore vulnerable to outliers and extreme observations, which could explain the better performance of BDCTM_lin in that regard. Both, considering excess zeros and non-linear effects, come with improved predictive power, culminating in the BDCTM_hurdle-nl 's dominating performance across all measures besides the WAIC where the zero component did not lead to improvements. The scores could be further improved by a model selection procedure as shown in Klein et al. (2015a).

Table 1.

Patent citations. Score sums of all models obtained via 10-fold cross-validation. Calculation of the WAICs on basis of the whole data set. Best results are depicted in bold font

Model	Logarithmic	Quadratic	Spherical	WAIC
BDCTMl_in	-8119.67	-3444.53	2530.84	6257.85
BDCTM_hurdle-lin	-8091.47	-3438.87	2534.39	6224.634
BDCTM_nl	-8110.94	-3440.52	2533.98	6040.573
BDCTM_hurdle-nl	-8044.69	-3427.77	2543.44	6184.174
cotram	-8174.92	-3443.07	2531.23	-

5.2 A partial proportional odds model for forest health assessment

This short analysis involving non-linear category-specific effects is based on data from the forest of Rothenbuch (Spessart) over the years 1982-2004. Every year, the health status is evaluated and categorized by the response variable defol measuring defoliation grades. Since data is sparse in some of the original nine categories (0%, 12.5%, …, 100%), we aggregated them into the three defoliation grades: 1 = no (0%), 2 = weak (12.5% - 37.5%) and 3 = severe (≥ 50%). Among others, the dataset comes with the covariates canopy (canopy density in percentage), x, y (x- and y-coordinates of location) and id (tree location identification number.). (Check Fahrmeir et al. (2013) for a full description of the dataset). The goal of this analysis is to determine the effect of the covariates on the degree of defoliation. Since the forest data is notorious for confounding and high autocorrelation, we let the sampler run for 10,000 iterations with a burn-in and warm-up phase of length 1000.

For this, we set up the partial proportional odds model

\begin{matrix} F_{Y ∣ X = x} (y_{r}) = F_{SL} (e {(d e f o l)}^{T} γ_{1} + {(e {(d e f o l)}^{T} \otimes b_{(10)} {(c a n o p y)}^{T})}^{T} γ_{2} \\ - b {(i d)}^{T} β_{3} \\ - {(b_{(10)} {(x)}^{T} \otimes b_{(10)} {(y)}^{T})}^{T} β_{4}), \end{matrix}

where we assume non-linear category-specific shifts of canopy, a transformation random effect for the tree location groups and a spatial non-linear effect on the basis of a tensor spline for the coordinates x and y. Figure 5 shows the estimated non-linear category-specific effect for canopy. The section for 0 ≤ canopy ≤ 25 displays almost parallel curves, which then vary more and more individually until they even cross. The variance of the estimated random effect for id is 2.42, and the standard deviation is 1.55. Figure 6 shows the estimated random intercepts. In a preliminary run, we observed the same problems with confounding in location-specific effects as Fahrmeir et al. (2013), which could be improved to some extend by adding the spatial effect. It is displayed in Figure 7.

Figure 5

Forest health: estimated non-linear category-specific effect of canopy, “no defoliation” in red, “severe defoliation” in blue, together with 95%-credible intervals.

Figure 6

Forest health: median-sorted estimated random intercepts for tree location groups.

Figure 7

Forest health: estimated two-dimensional spatial effect with triangles indicating observed tree locations based on 2nd order penalties.

6 Discussion

With the BDCTM, we present a novel Bayesian model framework for discrete data that combines cumulative link models with models for count data through directly modeling the conditional distribution function. Approaching these discrete data structures from the transformation perspective allows us to unify models that are usually treated seperately under the same umbrella. The BDCTM is flexible in the sense that it permits the user to control interpretability by means of choosing a reference distribution in conjunction with an additive transformation function. Estimating the conditional distribution function directly makes deriving distributional aspects such as the conditional quantiles straightforward by numerical inversion of F_Z(h(y| x ) (Siegfried and Hothorn, 2020). Furthermore, our Bayesian inferential procedure lets us obtain credible intervals and other quantities of interest without having to rely on large sample approximations. All high-dimensional effects are joined with suitable prior specifications, resulting in smooth effects across the board.

We demonstrate BDCTM's ability to handle under- or overdispersion in an adaptive fashion without restrictive distributional assumptions in Sections 4 and 5. A short investigation of a nonlinear non-proportional odds model highlights the versatility of our approach. In a model selection context, the unifying scope of the transformation function turns out to be a valuable simplification because there is just one “predictor” that has to be constructed. Though not shown in this article, it is possible to establish a relationship between overdispersion and the covariate effects by including full non-linear interactions between the count response and the respective explanatory variable. Constructing the conditional transformation function can be difficult as informed decisions about which effects to include and to interact with the response are required. Therefore, it would be desirable to develop an effect selection strategy via spike and slab priors in the spirit of Klein et al. (2021) for the BDCTM that could effectively tell the user what kind of effect is impacting the regular count process, the zero component or overdispersion.

As demonstrated in Section 5.2, our cumulative link transformation approach can be supplemented with category-specific linear or non-linear effects by modeling them as response-covariate interactions. This way, popular models such as (non-)proportional odds or hazards models can be retrieved simply by specifying the reference distribution. Both the count and the ordinal model could be supplemented with a more flexible link function as proposed by Aranda-Ordaz (1983), that is,

F (h) = 1 - {(λ \exp (h) + 1)}^{- λ^{- 1}},

which depends on an auxiliary parameter $λ \in] 0, \infty [$ , mitigating between the log-log link for λ → 0 and the logistic link when λ → 1. Horowitz (2001) avoided specifying the link function entirely. A Bayesian version would entail prior distributions on the space of nonparametric continuous reference distribution.

To conclude, we believe that in this article, the BDCTM is established as a flexible, modular modeling framework in the world of discrete data that is competitive in many modern scenarios.

Supplementary material

Supplementary material is available online.

Supplemental Material for Bayesian discrete conditional transformation models by Manuel Carlan, Thomas Kneib, in Statistical Modelling

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

Thomas Kneib received financial support from the DFG within the research project KN 922/9-1. The work of Manuel Carlan was supported by DFG via the research training group 1644.

Note

References

Aranda-Ordaz

(1983) An extension of the proportional-hazards model for grouped data. Biometrics , 39, 109–17.

Brezger

and Lang

(2006) Generalized structured additive regression based on bayesian p-splines. Computational Statistics & Data Analysis , 50, 967–91.

Cameron

and Trivedi

(1998) Regression Analysis of Count Data . London: Cambridge University Press.

Carlan

, Kneib

and Klein

(2020) Bayesian Conditional Transformation Models . arXiv eprints, page arXiv:2012.11016.

Dey

, Ghosh

and Mallick

(2000) Generalized Linear Models: A Bayesian Perspective . Boca Raton: CRC Press.

Doksum

and Gasko

(1990) On a correspondence between models in binary regression analysis and in survival analysis. International Statistical Review , 58, 243–52.

Dunson

(2005) Bayesian semiparametric isotonic regression for count data. Journal of the American Statistical Association , 100, 618–27.

Eddelbuettel

, Francois

, Allaire

, Ushey

, Kou

, Russel

, Chambers

and Bates

(2011) Rcpp: Seamless r and c++ integration. Journal of Statistical Software , 40, 1–18.

Fahrmeir

, Kneib

, Lang

and Marx

(2013) Regression: Models, Methods and Applications . New York: Springer.

10.

Fruhwirth-Schnatter

and Wagner

(2006) Auxiliary mixture sampling for parameter-driven models of time series of counts with applications to state space modelling. Biometrika , 93, 827–41.

11.

Fruhwirth-Schnatter

, Fruhwirth

, Held

and Rue

(2009) Improved auxiliary mixture sampling for hierarchical models of nongaussian data. Statistics and Computing , 19, 479–92.

12.

Ghosh

, Mukhopadhyay

and Lu

J-CJ

(2006) Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference , 136, 1360–75.

13.

Gneiting

and Raftery

(2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association , 102, 359–78.

14.

Hastie

and Tibshirani

(1990) Generalized Additive Models , volume 43. Boca Raton: CRC press.

15.

Hilbe

(2011) Negative Binomial Regression . London: Cambridge University Press.

16.

Hoffman

and Gelman

(2014) The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research , 15, 1593–1623.

17.

Horowitz

(2001) Nonparametric estimation of a generalized additive model with an unknown link function. Econometrica , 69, 499–513.

18.

Hothorn

, Kneib

and Buhlmann

(2014) Conditional transformation models. Journal of the Royal Statistical Society: Series B , 76, 3–27.

19.

Hothorn

, Most

and Buhlmann

(2018) Most likely transformations. Scandinavian Journal of Statistics , 45, 110–34.

20.

Jerak

and Wagner

(2006) Modeling probabilities of patent oppositions in a bayesian semiparametric regression framework. Empirical Economics , 31, 513–33.

21.

Kleiber

and Zeileis

(2016) Visualizing count data regressions using rootograms. The American Statistician , 70, 296–303.

22.

Klein

, Kneib

and Lang

(2015a) Bayesian generalized additive models for location, scale, and shape for zero-inflated and overdispersed count data. Journal of the American Statistical Association , 110, 405–19.

23.

Klein

, Kneib

, Lang

, & Sohn

(2015b) Bayesian structured additive distributional regression with an application to regional income inequality in germany. The Annals of Applied Statistics , 9, 1024–52.

24.

Klein

, Carlan

, Kneib

, Lang

and Wagner

(2021) Bayesian effect selection in structured additive distributional regression models. Bayesian Analysis , 16, 545–73.

25.

Kneib

, Klein

, Lang

, and Umlauf

(2019) Modular regression-a lego system for building structured additive distributional regression models with tensor product interactions. Test , 28, 1–39.

26.

Lang

and Brezger

(2004) Bayesian P-splines. Journal of Computational and Graphical Statistics , 13, 183–212.

27.

Lavine

and Mockus

(1995) A nonparametric bayes method for isotonic regression. Journal of Statistical Planning and Inference , 46, 235–48.

28.

Manuguerra

and Heller

(2010) Ordinal regression models for continuous scales. The International Journal of Biostatistics , 6.

29.

McCullagh

(1980) Regression models for ordinal data. Journal of the Royal Statistical Society: Series B (Methodological) , 42, 109–27.

30.

Monnahan

and Kristensen

(2018) No-U- turn Sampling for fast Bayesian inference in ADMB and TMB: Introducing the adnuts and tmbstan R packages. PLoS ONE , 13, e0197954.

31.

Nelder

and Wedderburn

(1972) Generalized linear models. Journal of the Royal Statistical Society: Series A , 135, 370–84.

32.

Nesterov

(2009) Primal-dual subgradient methods for convex problems. Mathematical Programming , 120, 221–59.

33.

Peterson

and Harrell

(1990) Partial proportional odds models for ordinal response variables. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 39, 205–17.

34.

Pya

and Wood

(2015) Shape constrained additive models. Statistics and Computing , 25, 543–59.

35.

R Core Team (2020). R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing. URL https://www.R-project.org/.

36.

Rigby

, Stasinopoulos

and Akantziliotou

(2008) Instructions on how to use the gamlss package in r. Computational Statistics and Data Analysis , 2, 194–95.

37.

Rodrigues

(2003) Bayesian analysis of zero- inflated distributions. Communications in Statistics-Theory and Methods , 32, 281–89.

38.

Siegfried

and Hothorn

(2020) Count transformation models. Methods in Ecology and Evolution , 11, 818–27.

39.

Siegfried

and Hothorn

(2021) Count Transformation Models: The cotram Package . URL https://CRAN.R-project.org/package=cotram. R package version 0.2.1.

40.

Sokal

and Rohlf

(1981) Biometry: The Principles and Practice of Statistics in Biological Research . San Francisco: W.H. Freeman Ltd.

41.

Tutz

(2011) Regression for Categorical Data , volume 34. London: Cambridge University Press.

42.

Watanabe

(2010) Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research , 11, 3571–94.

43.

Winkelmann

(2008) Econometric Analysis of Count Data . New York: Springer.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.48 MB

0.27 MB

0.75 MB

0.10 MB

0.01 MB

0.02 MB

Bayesian discrete conditional transformation models

Abstract

Keywords

1 Introduction

2 Bayesian discrete conditional transformation models

2.8 Transformation probability mass functions

Comparison of count data-generating processes on basis of centered out-of-sample log-likelihoods obtained from the respective model. Larger out-of-sample log-likelihoods indicate a better performance of the corresponding model.

5.1 Patent citations with excess zeros

Patent citations. Rootograms of the linear Poisson, the linear negative binomial and the simple linear BDCTM model.

Patent citations. Posterior mean estimates of the effects of nclaims, ncountry and year on the log-odds ratio, together with 95% credible intervals. Remaining covariates are held constant at their mean or are set to zero in case of dummy variables. Estimates belong to BDCTMnl.

Patent citations. Comparison of quantile residuals obtained by BDCTM models with and without addititional zero component with various generalized linear and zero-inflated models.

Patent citations. Score sums of all models obtained via 10-fold cross-validation. Calculation of the WAICs on basis of the whole data set. Best results are depicted in bold font

Figure 5

Forest health: estimated non-linear category-specific effect of canopy, “no defoliation” in red, “severe defoliation” in blue, together with 95%-credible intervals.

Forest health: median-sorted estimated random intercepts for tree location groups.

Forest health: estimated two-dimensional spatial effect with triangles indicating observed tree locations based on 2nd order penalties.

Supplementary material

Supplementary material is available online.

Footnotes

Declaration of conflicting interests

Funding

Note

References

Supplementary Material

Patent citations. Posterior mean estimates of the effects of nclaims, ncountry and year on the log-odds ratio, together with 95% credible intervals. Remaining covariates are held constant at their mean or are set to zero in case of dummy variables. Estimates belong to BDCTM_nl.