Variance-factorized loglinear models for overdispersed and underdispersed counts

Abstract

A technique is proposed for a more structured approach to modelling over- and under-dispersed counts: variance factorized loglinear (VFL) models are a parameterization of an (enlarged) distribution whereby the overdispersed variance is expressed as a product of the nominal variance and a remainder term involving a highly interpretable loglinear effect with a potentially different set of covariates. This facilitates mean–variance analysis by allowing the variance to be modelled more separately from the mean, especially for generalized linear models (GLMs). Consequently, VFL parameterized models offer substantial advantages over existing methods for investigating overdispersion, such as a unifying framework and greater interpretability. Examples given here of enlarged distributions that supply the equidispersion as a special case include the extended beta-binomial (with a newly proposed ‘clog’ link and special offset) and negative binomial (NB) and generalized Poisson (GP–1 and GP–2 variants). VFL parameterized models are constructed as a vector generalized linear model (VGLM) with a judicious combination of constraint matrices, offsets and link functions. Underdispersed data can be handled by expanding the response by a multiplier and offset adjustment. Useful for disentangling the mean–variance relationship in GLMs, VFL parameterized models are implemented within the very broad framework of the vector generalized additive model (VGAM) R package available from comprehensive r archive network (CRAN). The technique is generally applicable to mean-parameterized distributions.

Keywords

Constraint matrices extended beta-binomial distribution generalized Poisson distribution mean-parameterized distributions mean–variance analysis negative binomial regression offsets parameter link function vector generalized linear model

1 Introduction

Overdispersed counts commonly encountered in applied statistics are analyzed by a variety of techniques ranging from rigourous to ad hoc. For this, previous work such as Nelder and Lee (1991), Hinde and Démetrio (1998) and Carroll (2003) have sought and emphasized structure and the purpose of this article is to propose a technique for constructing or defining/identifying a subclass of models that aids such an analysis in a more systematic manner. In particular, the aim here is to enable and/or disentangle mean–variance modelling so that each part sheds separate insight into the trend and variability. We call this mean–variance analysis (MVA) where the purpose is to separately focus on the mean and variance of an (enlarged) model having two parameters. Although the binomial and Poisson cases are the main examples, the concepts carry over well beyond the generalized linear model (GLM; Nelder and Wedderburn, 1972) and to underdispersed data too (Section 4). Two main motivations for variance factorized loglinear (VFL) parameterized models are given in Section 2. The methological contribution of this article is a proposal of a technique for parameterizing a distribution so that MVA is facilitated. Although applied in this article to four distributions, the technique is general and can be applied beyond the examples here.

Write the data as ( x _i, y_i) for i = 1, …, n independently. Dropping the i subscript often for simplicity, $x = {(x_{[1]}^{T}, x_{[2]}^{T})}^{T}$ is a d-vector with kth element x_k, where x _[1] and x _[2] may be identical, overlapping or completely different in practice. In (1.2) below, the regression coefficients of x will be β _j . We define a VFL GLM as satisfying

Var (Y; x) = V_{e} (μ; x_{[1]}) R (μ; x_{[2]}),

(1.1)

where V_e(μ; x _[1]) is the nominal variance corresponding to equidispersion and R(μ; x _[2]) is a highly interpretable remainder involving a loglinear term. Thus R(μ; x _[2]) measures the amount of overdispersion, given that μ( x _[1]) is correctly specified. The central equation of this article, (1.1) offers the advantages described in Section 2.

1.1 VGLMs

We now summarize the class of vector generalized linear models (VGLMs; Yee, 2015) which conveniently fit VFL parameterized models because they have the necessary framework and infrastructure. The framework is very large so that it easily accommodates all the variants of Table 1. The log-likelihood to be maximized is $l = \sum_{i = 1}^{n} w_{i}^{*} l_{i}$ where the prior weights $w_{i}^{*}$ are positive, known and prespecified. VGLMs model multiple parameters (not a single μ) by multiple linear predictors η_j. For M parameters θ_j, VGLMs specify the jth linear predictor as

g_{j} (θ_{j}) = η_{j} = ω_{j} + β_{j}^{T} x = ω_{j} + \sum_{k = 1}^{d} β_{(j) k} x_{k},

(1.2)

Table 1

VFL parameterized models for the Poisson and binomial base. Note: ω₂ = log(N/(N − 1)) for the extended beta-binomial distribution (EBBD). The bottom half is for underdispersion with mY as the response and m = 2, 3, … is a multiplier. Here, $x = {(x_{[1]}^{T}, x_{[2]}^{T})}^{T}$ .and ‘GT’ stands for ‘generally truncated’.

$Var (Y\| x_{[1]}, x_{[2]})$	(2-parameter) Enlarged model
$μ (x_{[1]}) \cdot [1 + e^{β_{[2]}^{T} x_{[2]}}]$	NB–H–VFL
$μ (x_{[1]}) \cdot \exp (e^{β_{[2]}^{T} x_{[2]}})$	GP–1–VFL
$μ (x_{[1]}) \cdot {[1 + e^{β_{[2]}^{T} x_{[2]}}]}^{2}$	GP–2–VFL
$μ (x_{[1]}) (1 - μ (x_{[1]})) \cdot [1 - e^{- \{ω_{2} + β_{[2]}^{T} x_{[2]}\}}]$	EBBD-clog-VFL for proportion Y
$μ (x_{[1]}) \cdot \frac{1}{m} [1 + e^{β_{[2]}^{T} x_{[2]}}]$	GT–NB–H–VFL with ω₁ = log m = ω₂
$μ (x_{[1]}) \cdot \frac{1}{m} \exp (e^{β_{[2]}^{T} x_{[2]}})$	GT–GP–1–VFL with ω₁ = −log m and ω₂ = 0
$μ (x_{[1]}) \cdot \frac{1}{m} {[1 + e^{β_{[2]}^{T} x_{[2]}}]}^{2}$	GT–GP–2–VFL with ω₁ = log m = −ω₂

for j = 1, …, M and some suitable parameter link function g_j because g_j operates on a parameter rather than to a mean only. The β₍ _j ₎ _k in (1.2) reflect the two-dimensional structure of the regression coefficients: Across the jth linear predictor and across the kth covariate, which is reflected by B ^T in (1.3) below. Here, x₁ = 1 denotes the optional intercept and the ω_j are offsets which are placed into ω (= (ω₁, …, ω_M) ^T ). For example, the generalized Poisson variant called the GP-1 (Yang et al., 2009) as a VGLM has θ = (μ, φ) ^T with η₁ = log μ and η₂ = log log φ by default since dispersion index φ = 1 corresponds to equidispersion and φ ≥ 1 generally. Since M = 2 > 1 in this article, relationships between the β₍ _j ₎ _k are permitted by the linear constraints

\begin{matrix} η (x_{i}) = (\begin{matrix} η_{1} (x_{i}) \\ ⋮ \\ η_{M} (x_{i}) \end{matrix}) = ω + \sum_{k = 1}^{d} β_{(k)} x_{i k} \\ = ω + \sum_{k = 1}^{d} H_{k} β_{(k)}^{*} x_{i k} = ω + B^{T} x_{i}, \end{matrix}

(1.3)

for known constraint matrices H _k of full column-rank (i.e., rank R_k = ncol[Hk]) and

β_{(k)}^{*} = {(β_{(1) k}^{*}, β_{(2) k}^{*}, \dots)}^{T}

(1.4)

is a possibly reduced set of regression coefficients to be estimated. The β ₍ _k ₎ and $β_{(k)}^{*}$ correspond to the kth covariate x_k whereas β _j correspond to η_j. Trivial constraints are denoted by H _k = I _M and other common examples include parallelism (H _k = 1 _M ), exchangeability and intercept-only parameters $η_{j} = β_{(j) 1}^{*}$ , depending on the model. VFL parameterized models are the result of a judicious combination of constraint matrices, offsets and links. In the software implementation, the H _k are constructed internally for ease of use upon specification by the user.

1.2 Some previous work and approaches

From the vast number of methods proposed for unequidispersed count regression it is beneficial comparing several to highlight issues and differences with this work. In general, a MVA requires a mean-parameterization and the variance is usually the other parameter or is somewhat related to (1.1).

Fisher scoring used for VGLM estimation essentially only requires the expected information. Work such as Cifuentes-Amado and Cepeda-Cuervo (2024) is partly based on this algorithm, however it is far more complex, Bayesian and no software is available. Somewhat similar is the compound Poisson-normal model of Hinde (1982) estimated by iteratively reweighted least squares, numerical integration and the expectation–maximization (EM) algorithm. However, while a generalized linear interactive modelling (GLIM) macro is given, there is no R implementation and the author states its results are expected to be very similar to negative binomial (NB) regression.

One approach is based on mixtures such as the Poisson–Tweedie (e.g., Petterle et al., 2019; Abid et al., 2021) which have a variance such as μ + ϕμ^p and μ + μ² + ϕμ^p. These have several disadvantages such as the dispersion ϕ and power p parameters can only be modelled as intercept-only. In terms of interpretation, neither of these models is a suitable vehicle for MVA. Furthermore, neither of these models is as simple as (3.1), (3.2) or (3.3) below.

In recent years Conway–Maxwell–Poisson regression has received considerable attention (e.g., Shmueli et al., 2005; Sellers, 2023), however there are numerical challenges such as computing its normalizing constant and expected information. Unfortunately the mean-parameterized variant (Huang, 2017) also suffers from these drawbacks. Another distribution whose normalizing constant is difficult to compute is del Castillo and Pérez-Casany (2005) even though the model may be reparameterized by the mean and variance ‘under certain assumptions’. Yet another approach is to use quasi-likelihood rather than maximum likelihood estimation, (e.g., Engel and Brake, 1993, for proportions), however such methods do not allow the dispersion parameter ϕ to incorporate covariates.

To summarize, the above work and most others tend to be piecemeal in nature and if software has been written, specialized algorithms are required and it is limited to that specific model only. Also, often these works do not promote MVA because they lack the interpretability needed (e.g., (1.1)) and any overdispersion parameter cannot realistically incorporate covariates, as well has being computationally complex.

2 Motivating considerations

There are many instances where understanding the structure of variability is just as central as understanding the mean structure (Carroll, 2003). Under such conditions, if the variance structure is treated as a nuisance instead of being a central part of the modelling effort, it will lead to inefficient estimation of means and to misleading conclusions. One such instance where the mean–variance relationship has a central property is the case of multivariate abundances (Warton and Hui, 2017), where potentially serious artifacts can be introduced to analyses if this is not accounted for.

Under this setting, VFL parameterized models are spurred by two main motivations. Before describing these, it is emphasized that the technique is amenable to any mean-parameterized distribution and not only for counts. The methodology seeks to untwine the mean and the variance using two simple features available in the VGLM infrastructure (constraint matrices and offsets).

The first motivation is that (1.1) has several compelling advantages. Foremost, it allows variation in the response to be studied more separately from the signal. For example, in quantitative finance, the volatility may be of equal interest as the trend so that the ability to unravel the signal–noise relationship is crucial. Electrocardiography is another example where variability may be more important than the trend. Since V_e is a simple function of the mean for GLMs, another advantage is that the VFL approach teases out the mean–variance relationship so that interpretation is greatly facilitated. A different set of covariates may be used to model the mean and remainder terms, hence overdispersion can be modelled separately and conditional on μ. This follows the rationale of quasi-likelihood modelling, for example, having

Var (Y) = ϕ μ,

(2.1)

where a moment estimator of the dispersion parameter, $\hat{ϕ}$ , ‘follows’ after the maximum likelihood estimator of the mean, $\hat{μ}$ , to adjust for the extra-variation. Outside the exponential family, (1.1) holds partially by replacing μ by a general set of parameters θ. Additional advantages of VFL parameterized models include the following:

They are modular and belong to a large framework. Because they arise from a choice of H_k, g_j, ω and mean-parameterized enlarged distribution, they offer much flexibility so there may be more than one suitable model. For example, the GP-1-VFL and GP-2-VFL parameterizations below are new. Nowadays almost all practitioners are familiar with GLMs, hence would find the VGLM framework a natural extension and VGLMs as supercharged GLMs. VGLMs are far more general and are applied to more than 100 popular regression models in vector generalized additive model (VGAM), such as for categorical responses, capture-recapture experiments and extremes (Yee, 2015). The insistence of having a large framework is vindicated, for example, by Pujol-Rigol et al. (2025) concluding that VGAM is the most versatile of 48 R packages for categorical regression.

They are very interpretable because loglinear models exhibit multiplicative effects. In contrast, under- and overdispersion is less interpretable for Conway–Maxwell–Poisson regression (for example) even when mean-parameterized because the variance is a complicated function of the parameters.

In addition to VGLMs, the framework of Yee (2015) provides other useful classes such as the VGAM of Section 5.3 for additive modelling, to provide a data-driven approach by smoothing.

The second motivation is computational. For ordinary two-parameter models the constraint matrices H _k in (1.3) contain two columns. When the desire is to model the first two moments separately, it is difficult to delete one or both columns of H _k in its R implementation. Instead, a term is easily dropped or added from an S language formula, using functions such as drop1() and terms() (Chambers and Hastie, 1991). Additional computational advantages include:

The framework means that each VFL parameterized model can be implemented in R by a single ‘family’ function; for example, negbinomial() in VGAM fits over half-a-dozen NB variants, including VFL parameterized models. In contrast, other packages such as gamlss Stasinopoulos et al., 2025) require a separate function for the Evans (1953) NB parameterization. A recent review of NB regression in the context of ecology is Stoklosa et al. (2022).

Ease of implementation: It is almost trivial constructing the appropriate H _k within each VGAM family function.

Fisher scoring is well-established and mainstream (Osborne, 1992). Convergence is rapid and standard errors are automatically produced from scoring.

It is noted that for MVA a direct mean–variance parameterization is often problematic, for instance, estimation for the NB distribution parameterized by its mean and variance, NB(μ, σ²), having probability mass function

\Pr (Y = y) = (\begin{matrix} y + μ^{2} / (σ^{2} - μ) - 1 \\ y \end{matrix}) {(1 - \frac{μ}{σ^{2}})}^{y} {(\frac{μ}{σ^{2}})}^{μ^{2} / (σ^{2} - μ)}, 0 < μ < σ^{2} < \infty,

involves a difficult constrained optimization because η₁ < η₂ when the log link is applied to both parameters. VFL parameterized models are a pragmatic compromise.

3 Examples of VFL parameterized models

We look separately at overdispersion relative to the Poisson and binomial distributions.

3.1 Poisson and the NB, GP-1 and GP-2

Consider Y ∼ NB(μ, κ) as the ‘enlarged’ distribution with η₁ = log μ and η₂ = log κ as μ and κ are positive. Then based on Var(Y) = μ(1 + μ/κ), a VFL parameterized model is obtained by choosing H _k = (1, 1) ^T for x _[1] and H _k = (0, −1) ^T for x _[2] since μ and κ appear as a ratio. Then $η_{1} = β_{[1]}^{T} x_{[1]}$ and $η_{2} = β_{[1]}^{T} x_{[1]} - β_{[2]}^{T} x_{[2]}$ so that

Var (Y | x) = μ (x_{[1]}) \cdot [1 + \frac{e^{β_{[1]}^{T} x_{[1]}}}{e^{β_{[1]}^{T} x_{[1]}} e^{- β_{[2]}^{T} x_{[2]}}}] = Var (Y^{*} | x_{[1]}) \cdot [1 + e^{β_{[2]}^{T} x_{[2]}}]

(3.1)

with the Poisson limit μ/κ → 0⁺ defining Y^* ∼ Pois(μ). Thus it is possible to distance the mean from the variance with respect to x _[1] and x _[2], which is a very favourable property because it enables a joint variable-by-variable analysis of the mean and variance. It is noted that if variable x_k ∈ x _[2] only and its (real) regression coefficient β_k is positive then increasing x_k is associated with increasing overdispersion. Incidentally, (3.1) coincides with a NB parameterization described by Evans (1953) and is abbreviated by ‘NB–H–VFL’ in Table 1.

Practically, one procedure is to make a copy of all the covariates used for μ and fitting a model with x _[1] and x _[2]. This may be followed by backward elimination (or stepwise regression more generally) for variable selection. Of interest will be the set of covariates for μ compared to those for the loglinear term in R(μ). This strategy is used in all the Section 5 examples.

We now apply same argument to two variants of the generalized Poisson distribution.

GP-1. By default, η₁ = log μ and η₂ = log log φ because 1 ≤ φ and Var(Y) = μφ. By choosing H _k ( x _[1]) = (1, 0) ^T and H _k ( x _[2]) = (0, 1) ^T to partition x , then

Var (Y | x_{[1]}, x_{[2]}) = μ (x_{[1]}) \cdot \exp (e^{β_{[2]}^{T} x_{[2]}})

(3.2)

which is very interpretable.

GP-2. By default, η₁ = log μ and η₂ = log α because Var(Y) = μ(1 + αμ)² and α > 0. By choosing H _k = (1, −1) ^T for x _[1] and H _k = (0, 1) ^T for x _[2] to sever the product αμ then

Var (Y | x_{[1]}, x_{[2]}) = μ (x_{[1]}) \cdot {[1 + e^{β_{[2]}^{T} x_{[2]}}]}^{2},

(3.3)

so that it is similar to the NB–H–VFL (3.1) but the standard deviation is modelled instead of the variance.

Collectively, (3.1), (3.2) and (3.3) enable overdispersion for counts to be modelled more separately from the mean so that one can perform variance modelling, given μ. Table 1 and Appendix B are a summary. All three equations have compelling advantages over the quasi-Poisson variance (2.1): They allow ϕ to be effectively modelled with x _[2] and confidence intervals and standard errors for ϕ (i.e., R(μ)) can be computed. In contrast, (2.1) lacks a likelihood so that none of these are available.

3.2 Binomial and the BB and EBB

Compared to the classical beta-binomial, the EBBD (Prentice, 1986) has the advantage of accommodating a small amount of underdispersion. This is very useful when units within a cluster are independent because of sampling variation. The EBBD for a proportion y has probability mass function

\Pr (Y = y; N) = (\begin{matrix} N \\ N y \end{matrix}) \frac{\prod_{i = 0}^{N y - 1} (μ + γ i) \prod_{i = 0}^{N (1 - y) - 1} (1 - μ + γ i)}{\prod_{i = 0}^{N - 1} (1 + γ i)}, y = 0, 1 / N, \dots, 1,

(3.4)

where N > 1, μ = E(y) ∈ (0, 1), γ = ρ/(1 − ρ) and $\prod_{i = 0}^{- 1}$ terms are ignored. The intraclass correlation ρ satisfies

\max \{\frac{- μ}{N - μ - 1}, \frac{- (1 - μ)}{N - (1 - μ) - 1}\} \leq ρ \leq 1,

(3.5)

and Var(NY) = Nμ(1 − μ)[1 + (N − 1)ρ]. The classical beta-binomial replaces the lower bound of (3.5) by 0, therefore the EBB test for independence does not involve hypothesis testing at the parameter space boundary so that Tarone’s Z statistic (Tarone, 1979) is unnecessary. This is the second favourable property described very shortly. It is now shown that

η (x) = (\begin{matrix} η_{1} (x) \\ η_{2} (x) \end{matrix}) = (logit μ ∖ ∖ - \log (1 - ρ))

(3.6)

is a VFL parameterized model with two favourable properties. Note that the link proposed here for η₂ is an incomplete complementary log–log link function, which is named the complementary log (‘clog’) link instead of the usual ‘cloglog’.

Theorem 1 The extended beta-binomial distribution (3.4) with g₂(ρ) = − log(1 − ρ) has

Var (Y | x_{[1]}, x_{[2]}) = μ (x_{[1]}) (1 - μ (x_{[1]})) \cdot [1 - e^{- \{ω_{2} + β_{[2]}^{T} x_{[2]}\}}]

(3.7)

where ω₂ = log(N/(N − 1)). □

The first significant result is that $R (μ) = 1 - e^{- \{ω_{2} + β_{[2]}^{T} x_{[2]}\}}$ is very interpretable and measures the effect of x _[2] on overdispersion relative to the variance of a Bernoulli random variable. The adjustment for N is absorbed into the offset, therefore the extra-variability can be studied seamlessly across clusters of varying sizes. Note that the offset ω₂ → 0⁺ quite quickly as N → ∞; for example, ω₂ ≤ 0.05 when N ≥ 20.

The second favourable property of (3.6) is that if η₂ is intercept-only then testing H₀ : ρ = 0 is equivalent to testing H₀ : β₍₂₎₁ = 0, for example, using the ordinary Wald statistic produced by summary(). This is because cloglink(0) is 0 and this null value 0 is in the interior of the parameter space (3.5) and not on its boundary. Some simulations in the supplementary material confirm that the Wald and likelihood ratio test (LRT) p values are very similar, as expected.

4 Underdispersed counts

Most count distributions suitable for use as the enlarged distribution cannot handle underdispersion; for example, NB, GP–1, GP–2 for the Poisson.A natural question is: How can VFL parameterized models handle underdispersed responses? A practical solution is offered in this section. Because the left-hand side of (1.1) is of the enlarged distribution, the loglinear term R(μ; x _[2]) ≥ 1, however since underdispersion corresponds to R(μ; x _[2]) < 1, Y may be expanded upon multiplication by some integer m > 1 and then the R(μ; x _[2]) ≥ 1 property of (1.1) is retained. Consequently (1.1) still holds but to an adjusted response. The finer details are as follows.

We want

\frac{Var (Y^{*})}{E (Y^{*})} = R (θ_{1}, θ_{2}) \geq 1

(4.1)

for some Y-transformed Y^* not exhibiting underdispersion. A suitable transformation is Y^* = mY for some integer multiplier m > 1. Using the Poisson as the ‘baseline’ count distribution then (4.1) becomes m Var(Y)/μ = R(μ, θ₂) ≥ 1. Applying this to the NB distribution for example, one obtains the NB–H–VFL model having

Var (Y | x_{[1]}, x_{[2]}) \approx \frac{μ (x_{[1]})}{m} [1 + e^{β_{[2]}^{T} x_{[2]}}] .

(4.2)

For estimation it is recommended that the smallest m achieving unequivocal overdispersion be used; for example, we want a finite $\hat{κ}$ to avoid the parameter space boundary. Then the only change to (3.1) is to set ω_j = log m as an offset for η₁ and η₂.

Strictly, a more accurate approximation is mY ∼ GT–NB (Yee and Ma, 2024) with truncation set

T = {\min (\hat{m} Y), \dots, \max (\hat{m} Y)} ∖ {\hat{m} \min (Y), \dots, \hat{m} \max (Y)},

(4.3)

where ‘GT’ is an abbreviation for ‘generally truncated’. The simulation study below shows that without general truncation, the estimates are not consistent. Although the coefficients for μ( x _[1]) tend to behave well without general truncation, those for the index parameter definitely require it for consistency. The model is called GT–NB-H-VFL in Table 1. The GP-1 and GP-2 cases follow similarly and the bottom half of the table summarizes all three.

It is noted that an alternative to (4.2) is to omit the offset for η₂ so that

Var (Y | x_{[1]}, x_{[2]}) \approx e^{β_{[1]}^{T} x_{[1]}} [\frac{1}{m} + e^{β_{[2]}^{T} x_{[2]}}] .

(4.4)

While this special case is just as interpretable, we believe the bottom half of Table 1 is preferable because they share a common format. It is also noted that relatively few distributions are capable of modelling underdispersion (e.g., Sellers and Morris, 2017) and therefore can theoretically serve as the base distribution for the generally truncated expansion (GTE) method behind (4.3).

A final note here is that recently Puig et al. (2024) described mechanisms leading to under- and over-dispersion and commented that while both are common, underdispersion is less frequent. Possibly one might consider the multiplier m as a way to compensate for binomial thinning which is one mechanism for underdispersion. For the mean-parameterized distributions considered in this article it is informative knowing the conditions characterizing all two-parameter count distributions that are partially or fully closed under addition so that the maximum likelihood estimator of the population mean is the sample mean; for more details see Puig (2003) and Puig and Valero (2006).

5 Examples

In the following, we chose x _[2] = x _[1] by making a copy of each covariate and adding the suffix ‘.cp’. Thus x _[ _j _] is primarily used to model each η_j ‘separately’. The notation ${\hat{β}}_{(j) k}^{*}$ from (1.4) is used repeatedly to identify specific regression coefficients and the supplementary material includes many details not given here.

5.1 Length of hospital stay in cardiovascular patients

We consider 3 589 cardiovascular patients’ Y = length of stay (LOS; days) in hospitals in Arizona, USA, around 1991. Being integer-valued, it is reasonable treating the response as having a count distribution. The data, available as azpro in COUNT (Hilbe, 2016), has the four covariates age75 (1 if age > 75, else 0), sex (M = 1, 0 = F), admission type (admit; 1 = urgent/emergency, 0 = elective), and surgery procedure (1 = CABG, 0 = PTCA). The LOS variance-to-mean ratio (VMR) is 5.4.

An ordinary NB–H regression would have the form

\begin{array}{l} \log μ = η_{1} = \sum_{k = 1}^{5} β_{(1) k} x_{k}, \\ \log κ = η_{2} = \sum_{k = 1}^{5} β_{(2) k} x_{k} . \end{array}

Table 2

NB–VFL regression fitted to Arizona cardiovascular patients length of hospital stay: Final model. The suffix ‘.cp’ means a copy; for example, x1.cp = 1.

	Estimate	Std. Error	z value	Pr(> \|z\|)
(Intercept)	1.438	0.024	61.093	0.000
procedure	0.984	0.019	52.664	0.000
sex	−0.126	0.019	−6.647	0.000
age75	0.110	0.019	5.719	0.000
admit	0.338	0.018	18.610	0.000
procedure.cp	0.293	0.078	3.742	0.000
sex.cp	−0.225	0.077	−2.925	0.003
admit.cp	0.500	0.070	7.120	0.000

We fitted a VFL NB–H followed by backward elimination, which dropped x1.cp and age75.cp. All the remaining regression coefficients are highly statistically significant (Table 2). We draw the following conclusions (keeping all other variables fixed for their interpretation).

In terms of the mean LOS, percutaneous transluminal coronary angioplasty (PTCA) have a shorter duration compared to coronary artery bypass graft (CABG), males have shorter stays, emergencies results in longer stays than elective treatment and increasing age is associated with longer stays. For example, the model predicts those aged over 75 have a mean LOS of e^0.11046 − 1 ≈ 11.7% higher than those younger, cf. 12.0% empirically.

The procedure and procedure.cp estimates are both positive, so the model implies CABG patients will have a higher LOS mean and variance compared to PTCA patients. This is verified in the data. Likewise, since the estimates for sex and sex.cp are both negative, males should have less LOS mean and variance compared to females. This also verified in the data.

Because the intercept is omitted from ${\hat{η}}_{2}$ by step4(), the VMR is particularly simple: The subgroup corresponding to x _[2] = 0 ^T (females undergoing elective PTCA surgery) is expected to have an estimated VMR of 1 + e⁰ = 2. In fact it is 1.97.

On the whole, a higher mean is associated with a higher variance for each variable, which has clinical implications.

5.2 Sleep duration in a NZ cohort

Sleep duration data was obtained from a large New Zealand cross-sectional study called xs.nz in VGAMdata. An approximate random sample of the country’s working population in the mid-1990s (MacMahon et al., 1995), the integer-valued response was recorded from the question: ’How many hours do you usually sleep each night’? After removing the missing values and outliers (3.6%) there were n = 10 147 individuals; specifically, y ∈ {3, …, 12} and we regressed against x₂ = age, sex, ethnicity (European, Maori, Polynesian, other) and x₅ = marital status (single, married, divorced, widowed). The VMR of 1.28/7.30 ≈ 0.175 indicates strong underdispersion and the multiplier m = 7 was found suitable for obtaining overdispersion relative to the Poisson (VMR ≈ 1.23).

Initially we fitted the NB–H–VFL (4.2) and GT–NB–H–VFL (4.3) with x _[1] = x _[2], whereupon choosing the latter because simulations show the former may have inconsistent estimates (Section 5.4). Stepwise regression then removed sex from x _[2]. The resulting model (Table 3) shows several features, such as the following (keeping all other variables fixed at their values):

An increasing age is associated with a decreasing mean sleep duration and increasing overdispersion. This feature holds because if x_k appears in both η₁ and η₂ then a sufficient condition for an increase in overdispersion due to incrementing ${\hat{β}}_{(1) k}^{*} < 0$ and ${\hat{β}}_{(2) k}^{*} > 0$ , since μ( x _[1]) decreases and R(μ( x _[2])) increases.

Two more examples of the previous point are Maori and Polynesians compared to Europeans. The effect is most strongly associated with Polynesians. There are several possible explanations why this might be, such as the group being more heterogeneous in life style and work life compared to the other ethnicities (Tautolo et al., 2020; Fangupo et al., 2022).

Divorcees and widows appear to experience lower mean sleep duration relative to singles, because the estimates −0.038 and −0.037 are negative and highly significant. There is weak evidence (p ≈ 0.045) of overdispersion among divorcees.

The model suggests a very small mean difference between males and females, which are confirmed by predicted and empirical results showing that males sleep about 100(1 − e⁻^0.022)/7 ≈ 0.3% to 0.6% less on average compared to females.

Table 3

GT–NB–VFL regression fitted to underdispersed New Zealand health study sleep duration: Final model. The suffix ‘.cp’ means a copy; for example, x1.cp = 1.

	Estimate	Std. Error	z value	Pr(> \|z\|)
(Intercept)	2.104	0.006	355.111	0.000
age	−0.002	0.000	−17.664	0.000
sexM	−0.022	0.003	−6.381	0.000
ethnicityMaori	−0.032	0.005	−5.958	0.000
ethnicityPolynesian	−0.036	0.008	−4.368	0.000
ethnicityOther	−0.012	0.011	−1.122	0.262
maritalmarried	0.004	0.005	0.819	0.413
maritaldivorced	−0.038	0.007	−5.162	0.000
maritalwidowed	−0.037	0.010	−3.572	0.000
x1.cp	−4.893	0.431	−11.350	0.000
age.cp	0.059	0.005	11.728	0.000
ethnicity.cpMaori	1.551	0.191	8.139	0.000
ethnicity.cpPolynesian	2.013	0.179	11.213	0.000
ethnicity.cpOther	−0.970	2.363	−0.410	0.681
marital.cpmarried	−0.109	0.255	−0.426	0.670
marital.cpdivorced	0.585	0.292	2.003	0.045
marital.cpwidowed	0.132	0.284	0.463	0.643

Marginal statistics in the supplementary material confirm these results, as well as additional GP–1–VFL and GP–2–VFL analyses showing similar results to the NB–H–VFL, which is of no surprise due to the close similarity of the GP and NB distributions (Joe and Zhu, 2005).

5.3 Atomic bomb chromosomally aberrant cells

We fit an EBBD to the number of chromosome aberrations in survivors across two atomic bombs. The number of analyzed cells was N_i = 100 for each individual and the covariates were x₂ = dosage¹ ^/ ³ (rads) and x₃ = site (640 survivors in Hiroshima as the baseline category and 399 in Nagasaki). The data Atomic in FlexReg (Ascari et al., 2023) used follows closely with Otake and Prentice (1984).

Table 4

Extended beta-binomial regression for chromosomal aberrations due to radiation: Final model. This VFL parameterized model has variance (3.7). In the notation of (1.4), the regression coefficients are ${\hat{β}}_{(1) 1^{'}}^{} {\hat{β}}_{(1) 2^{'}}^{} {\hat{β}}_{(1) 3^{'}}^{} {\hat{β}}_{(1) 4^{'}}^{} {\hat{β}}_{(2) 1^{'}}^{} {\hat{β}}_{(2) 2'}^{}$ respectively. The suffix ‘.cp’ means a copy; for example x1.cp = 1.

	Estimate	Std. Error	z value	Pr(> \|z\|)
(Intercept)	−4.898	0.079	−61.976	0.000
txdose	0.371	0.014	26.974	0.000
bombN	0.013	0.129	0.104	0.917
txdose:bombN	−0.148	0.024	−6.121	0.000
x1.cp	−0.005	0.001	−3.547	0.000
txdose.cp	0.005	0.001	9.499	0.000

Fitting a VGAM, (Yee and Wild, 1996), it was found that x₂ = dose¹ ^/ ³ gave a large improvement in linearity to both η_j. Denoting the bomb site by the indicator variable x₃ with Hiroshima as the baseline, a joint model (3.6) was fitted having

\begin{array}{l} η_{1} = β_{(1) 1}^{*} + β_{(1) 2}^{*} x_{2} + β_{(1) 3}^{*} x_{3} + β_{(1) 4}^{*} x_{2} x_{3}, \\ η_{2} = ω_{2} + β_{(2) 1}^{*} + β_{(2) 2}^{*} x_{2} + β_{(2) 3}^{*} x_{3} + β_{(2) 4}^{*} x_{2} x_{3} \end{array}

with ω₂ = log(100/99). Upon convergence, backward elimination removed the estimates of $β_{(2) 3}^{*}$ and $β_{(2) 4}^{*}$ . Table 4 summarizes the final fit. The bombN term is retained by the principle of marginality. It is seen that the dosage has a different but increasing effect on the aberration probability between cities because 0.371 and 0.371 − 0.148 are positive, however the dosage effect on ρ is the same in both cities.

At a zero radiation dose in Hiroshima, $\hat{ρ} = {clog}^{- 1} (ω_{2} + {\hat{β}}_{(2) 1}^{*}) \approx 0.0053$ . Although an approximate 95% confidence interval does not cover 0, its closeness to 0 suggests near independence between chromosomes at very low dosages, which makes biological sense. In contrast, the tenfold ρ estimates in the dataset ranging from 0.0053 to 0.053 suggests that the correlation becomes nonnegligible at high radiation levels—this also makes biological sense. Final as ${\hat{β}}_{(2) 2}^{*} > 0$ is highly significant, this implies ρ and therefore the variance of Y, is an increasing function of dosage (but does not appear to differ between cities).

5.4 Simulation study

The supplementary material reports on a simulation study conducted to compare the performance of the NB–H–VFL and GT–NB–H–VFL parameterized models relative to the quasi-Poisson for over- and under-dispersed data. The details are sketched here and results summarized. The overdispersed NB–H case had log μ = η₁ = 1 + 2x₂,

η_{2} = - 1 + x_{3}, Var (Y | x) = μ (x_{[1]}) [1 + e^{η_{2}}],

(5.1)

for n = 100, 500, 1 000 and x₂ and x₃ ∼ U(0, 1) independently. Six covariates were entered and backward elimination based on akaike information criterion was used for variable selection. The correct model was found with proportions 0.42, 0.66, 0.69 for these n values, respectively, showing that large sample may be needed for recovering the true model. The root mean square error showed a strongly decreasing sequence in n for μ and a more slowly decreasing sequence for κ. Wald confidence intervals were shown to work well in terms of their 95% coverage rates. The quasi-Poisson model was confirmed to have moment estimator ϕ consistent with the overall mean

E_{x_{3}} Var (Y_{1} | x) = μ (x_{[1]}) [1 + \int_{0}^{1} e^{η_{2}} d x_{3}] = (2 - e^{- 1}) μ (x_{[1]}) .

(5.2)

Some conclusions are: (a) The NB–H–VFL model performs estimation well as a function of n although a larger n is required for precise estimation of the coefficients in R(μ; x _[2]); (b) For variable selection, a reasonably large n is needed for reliability; (c) Unlike the quasi-Poisson model which is restricted to (5.2), VFL models can estimate covariate effects in R(μ; x _[2]), where Var(Y; x ) = V_e(μ; x _[1]) R(μ; x _[2]).

The underdispersed case fitted data generated from GT–NB–VFL(mμ₁, mκ₁) with multiplier m = 4, η^* = −1.5 + x₃ and

\log (m μ_{2}) = η_{1} = ω_{1} + 1 + 2 x_{2}, Var (m Y_{2} | x) = m μ_{2} (x_{[1]}) [1 + e^{η_{2}^{*}}],

with ω₁ = ω₂ = log m. For this, the NB–H–VFL produced biased estimates for $η_{2}^{*}$ due to ignoring values in the truncation set whereas the GT–NB–H–VFL model worked well as expected. However, the regression coefficients in $η_{2}^{*}$ may be quite variable for small n. The regression coefficients for a naive quasi-Poisson model fitted to Y₂ are inconsistent.

6 Discussion

Understanding the structure of variation in counts continues to be an important problem (Carroll, 2003). VFL parameterized models address this need by providing additional insights into the variance structure conditional on the mean. The quest for methods to model the mean and variance separately has led to proposals such as some of those mentioned in Section 1.2, quasi-likelihood (Wedderburn, 1974), joint GLMs (Nelder and Lee, 1991), extended Poisson process models (EPPMs; e.g., Faddy, 1997; Faddy and Smith, 2008, 2011; Smith and Faddy, 2016, 2019) and double exponential families (Efron, 1986). All these are not without their shortcomings; for example, the latter may use a normalizing constant that can only be approximated and as mentioned in Section 3.1, quasi-likelihood cannot handle dispersion parameters with covariates, nor are ordinary standard errors available for confidence intervals without some adjustment. Likewise, EPPMs have complicated mean and variance expressions and lack the simplicity of (1.1)—covariate dependence in the variance is also complex.

For practitioners primarily interested in the mean structure, VFL parameterized models still offer benefit because how well the dispersion parameter is modelled affects the quality of the mean estimate. Taking the NB–H fitted by VGAM as a specific example, there is the benefit of being able to model κ more easily and ‘correctly’ even though μ may only be of interest. If x_k affects the mean and not κ then it is cumbersome deleting $β_{(2) k}^{*}$ , however when NB–H–VFL parameterized, backward elimination via step4() easily removes that coefficient. VFL parameterized models are implemented in latest version of VGAM (Yee, 2025) on CRAN.

Footnotes

Appendix

Acknowledgements

The author is grateful to the referees and editors whose comments greatly improved earlier versions of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

Supplementary material

References

Abid

, Kokonendji

and Masmoudi

(2021) On Poisson–exponential–Tweedie models for ultra-overdispersed count data. AStA Advances in Statistical Analysis , 105, 1–23.

Ascari

, Brisco

AMD

, Migliorati

and Ongaro

(2023) FlexReg: Regression Models for Bounded Continuous and Discrete Responses . R package version 1.3.0. Available at: URL https://CRAN.R-project.org/package=FlexReg

Carroll

(2003) Variances are not always nuisance parameters. Biometrics , 59, 211–20.

Chambers

and Hastie

(eds) (1991) Statistical Models in S . Pacific Grove, CA: Wadsworth/Brooks Cole.

Cifuentes-Amado

and Cepeda-Cuervo

(2024) Overdispersed nonlinear regression models. Austrian Journal of Statistics , 53, 20–38.

del Castillo

and Pérez-Casany

(2005) Overdispersed and underdispersed Poisson generalizations. Journal of Statistical Planning and Inference , 134, 486–500.

Efron

(1986) Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association , 81, 709–21.

Engel

and Brake

(1993) Analysis of embryonic development with a model for under- or overdispersion relative to binomial variation. Biometrics , 49, 269–79.

Evans

(1953) Experimental evidence concerning contagious distributions in ecology. Biometrika , 40, 186–211.

10.

Faddy

(1997) Extended Poisson process modelling and analysis of count data. Biometrical Journal , 39, 431–40.

11.

Faddy

and Smith

(2008) Extended Poisson process modelling of dilution series data. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 57, 461–71.

12.

Faddy

and Smith

(2011) Analysis of count data with covariate dependence in both mean and variance. Journal of Applied Statistics , 38, 2683–94.

13.

Fangupo

, Lucas

, Taylor

, Camp

and Richards

(2022) Sleep and parenting in ethnically diverse Pacific families in southern New Zealand: A qualitative exploration. Sleep Health , 8, 89–95.

14.

Hilbe

(2016) COUNT: Functions, Data and Code for Count Data . R package version 1.3.4. Available at: URL https://CRAN.R-project.org/package=COUNT. R package version 1.3.4.

15.

Hinde

(1982) Compound Poisson regression models. In Gilchrest

(ed.) GLIM 82: Proceedings of the International Conference on Generalised Linear Models . Pages 109–121. New York, USA: Springer.

16.

Hinde

and Demetrio

CGB

(1998) Overdispersion: Models and estimation. Computational Statistics and Data Analysis , 27, 151–70.

17.

Huang

(2017) Mean-parametrized Conway– Maxwell–Poisson regression models for dispersed counts. Statistical Modelling , 17, 359–80.

18.

Joe

and Zhu

(2005) Generalized Poisson distribution: The property of mixture of Poisson and comparison with negative binomial distribution. Biometrical Journal , 47, 219–29.

19.

MacMahon

, Norton

, Jackson

, Mackie

, Cheng

, Vander Hoorn

, Milne

and McCulloch

(1995) Fletcher Challenge-University of Auckland Heart and Health Study: Design and baseline findings. New Zealand Medical Journal , 108, 499–502.

20.

Nelder

and Lee

(1991) Generalized linear models for the analysis of Taguchi-type experiments. Applied Stochastic Models and Data Analysis , 7, 107–20.

21.

Nelder

and Wedderburn

RWM

(1972) Generalized linear models. Journal of the Royal Statistical Society: Series A (Statistics in Society) , 135, 370–84.

22.

Osborne

(1992) Fisher’s method of scoring. International Statistical Review , 60, 99–117.

23.

Otake

and Prentice

(1984) The analysis of chromosomally aberrant cells based on beta-binomial distribution. Radiation Research , 98, 456–70.

24.

Petterle

, Bonat

, Kokonendji

, Seganfredo

, Moraes

and da Silva

(2019) Double Poisson–Tweedie regression models. International Journal of Biostatistics , 15, 1–15.

25.

Prentice

(1986) Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association , 81, 321–27.

26.

Puig

(2003) Characterizing additively closed discrete models by a property of their maximum likelihood estimators, with an application to generalized Hermite distributions. Journal of the American Statistical Association , 98, 687–92.

27.

Puig

and Valero

(2006) Count data distributions: Some characterizations with applications. Journal of the American Statistical Association , 101, 332–40.

28.

Puig

, Valero

and Fernández-Fontelo

(2024) Some mechanisms leading to underdispersion: Old and new proposals. Scandinavian Journal of Statistics , 51, 245–67.

29.

Pujol-Rigol

, Fernández

and Casals

(2025) A systematic review and comparative study of R packages for ordinal response regression models. WIREs Computational Statistics , 17, e70025.

30.

Sellers

(2023) The Conway–Maxwell–Poisson Distribution . Cambridge, UK: Cambridge University Press.

31.

Sellers

and Morris

(2017) Underdispersion models: Models that are “under the radar”. Communications in Statistics–Theory and Methods , 46, 12075–86.

32.

Shmueli

, Minka

, Kadane

, Borle

and Boatwright

(2005) A useful distribution for fitting discrete data: Revival of the Conway–Maxwell–Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 54, 127–42.

33.

Smith

and Faddy

(2016) Mean and variance modeling of under- and overdispersed count data. Journal of Statistical Software , 69, 1–23. Available at: URL https://www.jstatsoft.org/v069/i06

34.

Smith

and Faddy

(2019) Mean and variance modeling of under-dispersed and over-dispersed grouped binary data. Journal of Statistical Software , 90, 1–20. Available at: URL https://www.jstatsoft.org/v090/i08

35.

Stasinopoulos

, Rigby

, Voudouris

, Akantziliotou

, Enea

, Kiose

and Zeileis

(2025) gamlss: Generalized Additive Models for Location Scale and Shape . R package version 5.5-0. Available at: URL https://CRAN.R-project.org/package=gamlss

36.

Stoklosa

, Blakey

and Hui

FKC

(2022) An overview of modern applications of negative binomial modelling in ecology and biodiversity. Diversity , 14, 1–24.

37.

Tarone

(1979) Testing the goodness of fit of the binomial distribution. Biometrika , 66, 585–90.

38.

Tautolo

, Faletau

, Iusitini

and Paterson

(2020) Exploring success amongst Pacific families in New Zealand: Findings from the Pacific Islands families study . Auckland, New Zealand: Resource Books Ltd.

39.

Warton

and Hui

FKC

(2017) The central role of mean-variance relationships in the analysis of multivariate abundance data: A response to Roberts (2017). Methods in Ecology and Evolution , 8, 1408–14.

40.

Wedderburn

RWM

(1974) Quasi-likelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika , 61, 439–47.

41.

Yang

, Hardin

and Addy

(2009) Testing overdispersion in the zero-inflated Poisson model. Journal of Statistical Planning and Inference , 139, 3340–53.

42.

Yee

(2015) Vector Generalized Linear and Additive Models: With an Implementation in R . New York: Springer.

43.

Yee

(2025) VGAM: Vector Generalized Linear and Additive Models . R package version 1.1-14. Available at: URL https://CRAN.R-project.org/package=VGAM

44.

Yee

and Gray

(2025) VGAMdata: Data Supporting the ‘VGAM’ Package . R package version 1.1-13. Available at: URL https://CRAN.R-project.org/package=VGAMdata

45.

Yee

and Ma

(2024) Generally altered, inflated, truncated and deflated regression. Statistical Science , 39, 568–88.

46.

Yee

and Wild

(1996) Vector generalized additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 58, 481–93.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.41 MB