Sage Journals: Discover world-class research

Abstract

In this article, we describe a new command, bamm, that implements a Bayesian method for addressing misclassification in multinomial data; see Swartz et al. (2004, Canadian Journal of Statistics 32: 285–302). We also describe a postestimation command, bammdx, that was developed to provide additional estimation diagnostics. We describe the method and the new commands and then present results from both a simulation study demonstrating bamm’s performance under a known misclassification data-generating process and an empirical example from alcohol epidemiology modeling.

Keywords

st0742 bamm bammdx Bayesian multinomial misclassification alcohol liver cirrhosis

1 Introduction

In this article, we present a new command for estimating categorical proportions in the presence of misclassification error. The motivating case for the development and implementation of this command comes from substance use epidemiology. In this field of research, quantitative goals include measuring alcohol relative risks (RRs) and attributable fractions for specific diseases (for example, Rehm and Shield [2013]; Kehoe et al. [2012]; Meier et al. [2013]; Nelson et al. [2013]) and deriving public health guidelines for consumption (Kalinowski and Humphreys 2016). Furthermore, multinomial data and the potential for misclassification bias is commonplace in many disciplines, including medicine (for example, Chen et al. [2019]), education (for example, Goldstein, Browne, and Charlton [2018]), and marketing (for example, Gmehling and La Mura [2016]). The command we describe below implements the Bayesian model proposed by Swartz et al. (2004), which explicitly accounts for the possibility of misclassification error in a multinomial data setting. Ignoring the possibility of misclassification can lead to biased and inefficient estimates of proportions (Goldstein and Wolf 1977; Katz and McSweeney 1979; Schwartz 1985).

Bross (1954) laid out the framework for working with misclassification in the binary case. If validation data are available, both the misclassification rate and the binary proportion can be estimated (Tenenbein 1970). Without validation data, these parameters may be bounded or estimated from a fully Bayesian model (for example, Gaba and Winkler [1992]; Joseph, Gyorkos, and Coupal [1995]; Bollinger and van Hasselt [2017]). However, when data have more than two categories, misclassification is more challenging with complex and potentially asymmetric errors across categories, and a much larger parameter space needed to support underlying distributions. This latter consideration is a major reason why Bayesian approaches to the problem have been popular (see Pérez et al. [2007] for a review). One such approach is from Swartz et al. (2004), who focused on Bayesian identifiability with multinomial data. Swartz et al. (2004) extended work on the binary case approach using mixtures of Dirichlet distributions (Evans et al. 1996) and took advantage of increasing computational capability to apply Markov chains to estimate high-dimensional posterior distributions.

As noted above, our interest in the statistical problem of misclassification in multinomial data is associated with modeling alcohol exposure distributions. In this context, researchers often rely on a nationally representative self-reported survey data source. These data sources typically collect quantities (for example, number of drinks per drinking day) and frequencies (for example, number of drinking days) of alcohol consumption during some reference period (for example, past 30 days). Using these data, we can derive the average amount of alcohol consumed daily either in terms of grams per day or in terms of standard drinking units (that is, 1 standard drinking unit = 10 grams per day). Because many national drinking guidelines are framed in terms of the number of standard drinking units, it is natural to classify individuals according to a prespecified set of drinking unit ranges. For example, in the empirical applications below, we will categorize survey respondents as: i) nondrinkers; ii) persons who consume up to one drink per day; iii) persons who consume one to two drinks per day; iv) persons who consume two to three drinks per day; v) persons who consume three to four drinks per day; vi) persons who consume four to five drinks per day; and vii) persons who consume more than five drinks per day.

A common concern, however, is that there is underlying error in the amount of alcohol consumption reported by survey respondents and that this causes some individuals to be misclassified (Kilian et al. 2020). While triangulation methods have been proposed for addressing this concern (Rehm et al. 2010a; Parish et al. 2017), these methods have some important limitations. A typical triangulation approach uses data from a selfreported survey to measure the average amount of alcohol consumed and compares this with a less error-prone measure of average alcohol consumption, such as from national records of alcohol sales. The ratio of the total volume of alcohol sold nationally per capita divided by the average amount of alcohol reported in a representative survey provides a measurement of how much people are underreporting on average. For example, if this ratio is approximately two, then on average survey respondents are only reporting about half of what they actually drink. This ratio can then be used as a constant multiplier to upshift the distribution of alcohol consumption as measured by the survey (that is, double all respondents alcohol consumption). A key limitation associated with this approach is that it assumes that survey respondents underreport by a constant factor, which may be untrue. For example, it may be the case that more frequent drinkers underreport less than those drinkers who drink infrequently. Moreover, this approach implicitly assumes that there is no misclassification among nondrinkers. In the context of measuring alcohol-attributable fractions (AAFs), these limitations may be particularly important because prevalence estimates are used to weight the risk distribution and risks can vary substantially across the alcohol consumption distribution.

Swartz et al. (2004) provide a Bayesian misclassification method that can be used as an alternative method for adjusting the alcohol exposure distribution. As described below, the model is identified on the basis of constraints and by leveraging Bayesian priors. The Bayesian priors can be used to embed alternative assumptions about misclassification across the alcohol consumption distribution, and this provides greater flexibility over how the distribution is adjusted. Thus, this approach can be used to overcome the limitations of typical triangulation approaches.

2 The Swartz et al. (2004) model

2.1 The model

Let x_i denote a multinomial variable that takes the values 1,…, m for the ith observation. The goal is to measure the probability that x_i = j for j = 1,…, m. In the absence of misclassification, this would be a straightforward task (for example, the prevalence within each of the m categories would provide an unbiased estimator). However, in the presence of misclassification, x_i is a fallible measure, and the proportion of the sample observed within each of the m categories may underor overstate the true prevalence. To develop a model that can correct for misclassification, Swartz et al. (2004) introduce a latent variable, denoted by t_i , that indicates the correct classification for each observation and a set of misclassification probabilities, denoted by π_kj ≡ Pr(x_i = j | t_i = k). Letting ρ_j ≡ Pr(t_i = j), the likelihood function is

L = \prod_{j = 1}^{m} \Pr (x_{i} = j)^{n_{j}} = \prod_{j = 1}^{m} {(\sum_{k = 1}^{m} p_{k} π_{k j})}^{n_{j}}

where n_j denotes the number of observations such that x_i = j. To complete the description of the model, the priors for ρ ≡ (ρ ₁ ,…, ρ_m ) and π _j ≡ (π_j ₁ ,…, π_jm ) for each j = 1,…, m are

ρ \sim Dirichlet (α_{1}, . . ., α_{m})

and

π_{j} \sim Dirichlet (β_{j}_{1}, . . ., β_{j m}) for j = 1, . . ., m

The Dirichlet distribution is a multivariate extension of the Beta distribution and is a commonly used prior distribution in many Bayesian applications. In bamm—the command described in this article—the hyperparameters α ≡ (α ₁ ,…, α_m ) and β _j ≡ (β_j ₁ ,…, β_jm ) for j = 1,…, m are prespecified by the user and can be adjusted from a default of 1 as an option to the command, though they must be strictly positive as a condition of the Dirichlet distribution.

Swartz et al. (2004) provide a rigorous discussion of the concept of identification within the context of Bayesian modeling. In this discussion, they distinguish between two types of nonidentifiability. The first type corresponds with a more classical definition of parameter identification and can be resolved through the effective use of informative priors in the Bayesian specification of a prior and likelihood function. The second type is referred to as permutation-type nonidentifiability, and this type of nonidentifiability cannot be resolved through the use of priors. Mathematically, this type of nonidentifiability arises because any permutation of parameters that swaps the position of, for example, ρ ₁ and ρ ₂ while simultaneously swapping π ₁ and π ₂ leads to an equivalent value for the $\sum_{k = 1}^{m} ρ_{k} π_{k j}$ for all j = 1,…, m and thus leads to an equivalent likelihood. While this type of nonidentifiability cannot be resolved through the use of priors, it can be overcome by imposing constraints on the misclassification probabilities, π _j . Multiple constraints that overcome permutation-type nonidentifiability are presented in Swartz et al. (2004), and these constraints range in terms of relative restrictiveness. The most restrictive of these constraints is

π_{j}_{1} < \cdot \cdot \cdot < π_{j, j} {_{-}}_{1} < π_{j j} > π_{j, j}_{+ 1} > \cdot \cdot \cdot > π_{j m} \forall j

To see how this constraint resolves the issue of identification, note that the likelihood function can be written in the presence of any constraint such that it returns zero for any parameter values that do not meet the constraint. For example, we could rewrite the likelihood as

L = \prod_{j = 1}^{m} {(\sum_{k = 1}^{m} p_{k} π_{k j})}^{n_{j}} \times I (π_{j}_{1} < \cdot \cdot \cdot < π_{j, j -}_{1} < π_{j j} > π_{j, j + 1} > \cdot \cdot \cdot > π_{j m} \forall j)

If we now swap the positions of ρ ₁ and ρ ₂ while simultaneously swapping π ₁ and π ₂, then we will end up with two distinct likelihood values. This is because with the original positions, the likelihood function returns 0 unless all elements of the constraint are satisfied, including that at least the following two inequalities hold: π ₂₁ < π ₂₂ and π ₁₁ > π ₁₂. However, after swapping the positions of ρ ₁ and ρ ₂ while simultaneously swapping the positions of π ₁ and π ₂, we must rewrite the constraint, and the likelihood function returns 0 unless all the elements of the new constraints are satisfied, including that at least the following two inequalities hold: π ₁₁ < π ₁₂ and π ₂₁ > π ₂₂, which are diametrically opposed to the constraints as specified under the original positions. Thus, the two likelihood functions return distinct values, and the permutation-type identification issue has been resolved.

The intuition here is that constraining the misclassification parameters solves the identifiability issue by essentially eliminating from the parameter space arbitrary permutations. Additionally, specifying constraints on the misclassification parameters is also somewhat intuitive. In words, the first constraint considered in Swartz et al. (2004) says that it is most likely that the observed classification is correct and that it becomes sequentially less likely that the correct classification differs from the observed classification as classifications more distant from the observed data are considered. For example, if a person is actually in category 1, the probability of observing them in category 4 is less likely than observing them in category 3, which in turn is less likely than observing them in category 2. This type of constraint is likely reasonable in a variety of analysis contexts where the data are ordinal. The remaining constraints suggested by Swartz et al. (2004) are similar in spirit and are detailed below as part of our discussion of the command options.

While the number of categories is theoretically unbounded, ensuring these types of constraints on the parameter space are met becomes increasingly complex as larger numbers of categories are allowed. We have found that the model performs reasonably well with as many as eight categories. More may be possible, but convergence will become slow and impractical at higher numbers of categories.

Last, note that, while constraints resolve the permutation-type nonidentifiability, priors are still needed to ensure identification for this model. In fact, in our experience, stronger priors lead to significant improvements in the identification of the parameters of interest. Examples below illustrate this point.

2.2 Details of the Gibbs sampling algorithm

Swartz et al. (2004) outline a Gibbs sampling algorithm for simulating from the posterior distribution for ρ . We present the algorithm below, with a minor modification to the sampling approach for the misclassification probabilities.

Step one

Sample from the discrete probability distribution for the latent unobserved variable, t_i , conditional on the observed variable, x_i , the corrected classification probabilities, ρ, and the misclassification probabilities, Π ≡ ( π ₁; … ; π _m ). The conditional probability mass function is

\Pr (t_{i} = j | x_{i}, ρ, Π) = \frac{ρ_{j} π_{j}, x_{i}}{\sum_{l = 1}^{m} ρ_{l} π_{l, x_{i}}}

We sample from this discrete probability distribution in the usual way. Specifically, we first generate a pseudo–random uniform variate, u, and then for each j = 1,…, m, we check whether F (j − 1) ≤ u ≤ F (j), where F (a) ≡ Pr(t_i ≤ a | x_i, ρ , Π) and F(0) ≡ 0. If this condition is met, then we set t_i = j. Otherwise, we check whether the condition is met for j + 1 and continue checking till the condition is met.

Step two

Sample from a Dirichlet distribution for the true but unobserved classification probabilities, ρ , conditional on t_i ,

(ρ | t_{i}) ~Dirichlet (α_{1} + \sum_{i = 1}^{n} I (t_{i} = 1), \dots, α_{m} + \sum_{i = 1}^{n} I (t_{i} = m))

where I(c) denotes an indicator function that returns 1 if condition c is true and returns 0 otherwise. We follow a standard algorithm for generating pseudo–random Dirichlet variates, which starts by sampling from a Gamma distribution with shape set to $a_{j} + \sum_{i = 1}^{n} (t_{i} = j)$ and scale set to 1 for each of the m probability parameters. Then, after sampling m Gamma variates, denoted by y_j , we set

ρ_{j} = \frac{y_{j}}{\sum_{k = 1}^{m} y_{k}}, for j = 1, \dots, m

Step three

Sample from a truncated-Dirichlet distribution for the misclassification probabilities, π_j , conditional on t_i and x_i ,

(π_{j} | t_{i}, x_{i}) \sim truncated-Dirichlet (γ_{j}_{1}, . . ., γ_{j m}), for j = 1, . . ., m

where

γ_{j k} = β_{j k} + \sum_{i = 1}^{n} I (t_{i} = j) I (x_{i} = k)

The truncation points for this distribution are how the constraints are captured in the Gibbs algorithm. In general, regardless of the specific constraint chosen, we can always capture the constraints by specifying a lower and upper bound, q_jk ₁ and q_jk ₂, for each of the misclassification probabilities such that 0 ≤ q_jk ₁ ≤ π_jk ≤ q_jk ₂ ≤ 1.

As noted in Swartz et al. (2004), we can implement a rejection sampling scheme to sample from the truncated Dirichlet, which involves sampling from an untruncated Dirichlet and then checking whether the constraint is met. If not, then we continue sampling until the constraint is met. We use a more efficient sampling method instead, which is described in Damien and Walker (2001). This method essentially embeds a Gibbs algorithm within the larger Gibbs sampling scheme and is similar to an approach described in Swartz et al. (2004). In practice, however, we have found the Damien and Walker (2001) method to be more numerically stable.

This approach starts by sampling a pseudo–random uniform variate, u ₀, between 0 and $π_{j m}^{c_{j m} - 1}$ . If c_jm > 1, then we use cumulative distribution function inversion to generate π_jk for each k = 1,…, m − 1 from the distribution,

(π_{j k} | u_{0}, π_{j, - k}) \propto π_{j k}^{c j k - 1} I (q_{j k 1} \leq π_{j k} \leq \min (1 - u_{0}^{1 / (c_{j m} - 1)} - \sum_{l \neq k} π_{j l}, q_{j k 2}))

where π _j, ₋ _k denotes the set of misclassification probabilities associated with x_i = j after excluding the kth probability. If c_jm = 1, we do not need to introduce the u ₀ variable, and the target distribution simplifies considerably. Last, if c_jm < 1, then we use cumulative distribution function inversion to generate π_jk for k = 1,…, m − 1 from the distribution:

(π_{j k} | u_{0}, π_{j, - k}) \propto π_{j k}^{c j k - 1} I (\max (1 - u_{0}^{1 / (c_{j m} - 1)} - \sum_{l \neq k} π_{j l}, q_{j k 1}) \leq π_{j k} \leq q_{j k 2})

2.3 Model convergence and other calculations

bamm calculates and reports a summary of the $\hat{R}$ statistic presented in Gelman and Rubin (1992). This convergence check requires a set of independent Markov chains initialized from different starting values, and bamm uses the user-specified priors (or the default priors if unspecified) to independently sample starting values for each of the requested number of chains. The Gibbs algorithm detailed above samples the corrected classification probabilities, ρ, and the misclassification probabilities, Π. bamm calculates an $\hat{R}$ for each of the ρ ₁ ,…, ρ_k parameters as well as for each of the π ₁ ,…, π_k parameters. The $\hat{R}$ statistic compares the within chain variance with a global variance estimator that mixes the within chain and between chain variances. Thus, the statistic captures whether the chains mix well: If these two variances are similar, then the $\hat{R}$ statistic will be close to one; whereas if these two variances are dissimilar, then the $\hat{R}$ statistic can be much larger than one. As a convergence check, it is standard practice to require that all parameters have an $\hat{R}$ < 1.1. To quickly check whether this is the case, the model summary reports a maximum $\hat{R}$ across all the parameters. The $\hat{R}$ calculations follow the approach used in Stan—a general purpose Bayesian analysis software. Specifically, we first split all the Markov chains in half and calculate the within chain, between chain, and global variances for each half chain as usual. As noted in the Stan documentation pages, this provides a more robust check for convergence (Stan Development Team 2022).

The other major statistic that bamm provides is the effective sample size and its corollary efficiency. Markov chain Monte Carlo (MCMC) methods produce simulated samples for the posterior that are autocorrelated. This means that, while the user may request 10,000 post-burn-in points from the posterior distribution, the correlation present in the posterior sample may indicate a smaller effective sample size. The bamm estimator is known to produce posterior samples with high levels of autocorrelation (Swartz et al. 2004). The effective sample size is also used to calculate the Monte Carlo standard errors, which provides a measure of the precision of the posterior means. As with the $\hat{R}$ statistic, we split all the Markov chains in half to calculate the effective sample size for each parameter.

Even when $\hat{R}$ < 1.1 for all sampled parameters, it is good practice to review some additional diagnostic information to ensure that convergence has been achieved. This includes visually inspecting trace plots (plots showing chain-specific sampled parameter values at each iteration of the Gibbs algorithm); chain-specific histograms or kernel density plots; and autocorrelation plots over iterations of the Gibbs algorithm. Additionally, if $\hat{R}$ ≥ 1.1 for at least one parameter, this will be indicated by the maximum $\hat{R}$ reported in the bamm output. However, in this event it can be useful to diagnose the issue further by inspecting the $\hat{R}$ statistic for each parameter individually to determine if it is a global issue or concentrated on one or two parameters. To support this additional investigation, we developed a suite of postestimation commands, which are all wrapped up under a unified command called bammdx. Depending on the subcommand chosen, bammdx will produce various diagnostic plots or provide a table of $\hat{R}$ statistics for each model parameter.

3 The bamm command

3.1 Syntax

The syntax of bamm was designed to be similar to the syntax for other Bayesian commands available in Stata; see [BAYES] bayes:

bamm varname [if] [in] [, cnumber( #) prior_p( name ) prior_pi( name )

nchains( # ) mcmcsize( # ) burnin( # ) thinning( # ) rseed( # ) clevel( # )

saving( filename [, replace])]

3.2 Required arguments

varname is the multinomial variable to be modeled. It must be integer valued and the minimum should be 1.

3.3 Options

cnumber( # ) specifies the constraint to be imposed on the misclassification probabilities. The default is cnumber(1). Only integers between 1 and 4 are allowed. The specific constraints available are specified below:

cnumber(1):

π_{j}_{1} < \cdot \cdot \cdot < π_{j, j -}_{1} < π_{j j} > π_{j, j}_{+ 1} > \cdot \cdot \cdot > π_{j k}

cnumber(2):

π_{j i} < π_{j j} \forall i \neq j

cnumber(3):

π_{i j} + π_{j i} < π_{i i} + π_{j j} \forall i < j

cnumber(4):

π_{j i} < π_{i i} \forall j \neq i

prior_p( name ) specifies the hyperparameters for the Dirichlet prior for the corrected classification probabilities, ρ . name should be a Stata row matrix with m columns corresponding with each of the m categories in varname. The default is a 1 × m row matrix with all elements set to 1.0.

prior_pi( name ) specifies the hyperparameters for the Dirichlet priors for the misclassification probabilities, Π. name should be a Stata matrix that is m × m. The default is an m × m matrix whose elements are all set to 1.0.

nchains( # ) specifies the number of Markov chains to simulate. nchains(2) is the default. The $\hat{R}$ statistic requires simulating from at least 2 chains. bamm uses a C++ plugin library. Sampling from each chain is carried out independently and in parallel using the std::thread libraries available in C++. Thus, simulating more chains with a fixed mcmcsize() has the effect of improving the precision of the parameter estimates as well as allowing for the calculation of the estimation diagnostics with minimal computational cost.

mcmcsize( # ) specifies the MCMC sample size. The default is mcmcsize(10000). The algorithm will produce mcmcsize() samples for each chain specified by nchains().

burnin( # ) specifies the number of iterations for the burn-in period. The default is burnin(2500). Only post-burn-in samples from the posterior are saved by bamm. Although the default values for mcmcsize() and burnin() are typical starting points, the high level of autocorrelation in this model may warrant larger MCMC and burnin sample sizes. Checking for convergence can inform whether a larger MCMC or burn-in sample size is necessary.

thinning( # ) specifies whether to thin the posterior samples. thinning(1) is the default, which has the effect of saving every observation from the posterior sample. Deviations from the default will save only every kth observation from the MCMC sample. If k > 1 is specified, then bamm will increase the number of MCMC samples produced such that after dropping every kth observation, the remaining sample size is equivalent to what was specified in mcmcsize() (or the default if unspecified). Thinning the posterior samples can reduce the amount of autocorrelation present in the posterior sample. However, a longer post-burn-in chain or more chains may produce the same effective sample size. Thus, this option can be viewed as a data storage convenience.

rseed( # ) specifies a seed value, which is passed to the bamm plugin. The plugin code uses this seed value for the first chain and offsets the seed value by the length of each chain, including the burn-in and post-burn-in periods, for each subsequent chain. Specifying a seed value is recommended because it ensures the analysis is reproducible. However, by default bamm uses rseed(0), ensuring reproducibility under the default setting for bamm.

clevel( # ) specifies the credible interval level. The default is clevel(95).

saving( filename [, replace ]) saves the MCMC sample in filename .dta. The dataset will include a variable for each of the model parameters, with variable names set to eq1_p1,…, eq1_p k, as well as the chain (_chain), the iteration number (_index), the log likelihood value at each iteration (_loglikelihood), and the log posterior value at each iteration (_logposterior). The replace option will allow Stata to overwrite filename .dta if it exists. The saving() option is required for bammdx to be used after bamm.

3.4 Stored results

bamm stores the following in e():

3.5 Syntax for bammdx

The syntax of bammdx is also similar to other Stata commands and was designed to be similar to the syntax for the Bayesian postestimation commands bayesgraph and bayesstats grubin; see [BAYES] bayes. bammdx also encompasses multiple subcommands: bammdx plot, bammdx trace, bammdx ac, bammdx histogram, bammdx kdensity, and bammdx table. The syntax for each of these subcommands are as follows:

bammdx {plot | trace | | histogram | kdensity} {param_spec | _all} [if] [in] [,

using( filename ) sleep( # ) wait close saving( filename[, replace])]

bammdx table [, using( filename )]

Calling bammdx after bamm requires that the user specify the saving() option for bamm. bammdx can also be called if the posterior samples are the active dataset in Stata, or bammdx can be called with the using() option specified providing the name of a Stata dataset that has the posterior samples stored.

3.6 Required arguments for bammdx

plot, trace, ac, histogram, kdensity, or table must be specified.

param_spec | _all must be specified after plot, trace, ac, histogram, or kdensity. This argument either provides the name of a model parameter (for example, eq1_p1) or, if _all is specified, produces plots for all model parameters.

3.7 Options for bammdx

using( filename ) specifies the name of a dataset that stores the simulation results from a previous call of bamm. If the previous command executed is bamm with the saving() option invoked, then it is not necessary to specify the name of the dataset that stores the simulation results. However, if another Stata command is executed in between calls of bamm and bammdx, this option is required. Alternatively, if the simulation results are already in memory, this option is not required.

sleep( # ) specifies that when multiple graphs are produced to wait # milliseconds between producing each graph.

wait specifies that when multiple graphs are produced to wait until the more condition is cleared between producing each graph.

close specifies that when multiple graphs are produced to close graphs before producing the next graph.

saving( filename [, replace)] saves the graphs. When multiple graphs are requested, each graph uses filename as a stem, and each graph is saved as filename1, filename2, etc. The replace option specifies to overwrite the file if it exists. This option is essentially calling graph export, so the user can specify any of the supported file extensions, and they will save the graph as the corresponding file types (for example, .eps, .png).

4 Examples

4.1 Simulation study

To illustrate the use of bamm, we conducted a small simulation study similar to the study presented in Swartz et al. (2004). We simulated data under assumed values for the correct classification probabilities, ρ , and assumed values for the misclassification probabilities, Π. To help other users in replicating this analysis, we provide a data file that contains a simulated multinomial variable, with known classification and misclassification probabilities. Below, we read this data file into memory and use tabulate to view the distribution of the misclassified multinomial variable. Notes were attached to the data file to provide a reference for the data-generating process used to simulate these data. These notes show that

ρ = \begin{matrix} [0.20 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16] \end{matrix}

and

\prod = [\begin{matrix} 0.50 & 0.20 & 0.15 & 0.08 & 0.05 & 0.02 \\ 0.15 & 0.50 & 0.15 & 0.10 & 0.08 & 0.02 \\ 0.10 & 0.15 & 0.50 & 0.15 & 0.08 & 0.02 \\ 0.02 & 0.08 & 0.15 & 0.50 & 0.15 & 0.10 \\ 0.02 & 0.08 & 0.10 & 0.15 & 0.50 & 0.15 \\ 0.02 & 0.05 & 0.08 & 0.15 & 0.20 & 0.50 \end{matrix}]

Tabulating this misclassified variate, we see that 15% of the sample is observed in the first category, between 17% and 19% of the sample are observed in the second through fifth categories, and 13% of the sample is observed in the sixth category. This shows us that the data are misclassified most strongly in the “tails” of the distribution or in the lowest and largest categories.

A naïve approach for estimating ρ , which ignores misclassification altogether, would be to estimate a multinomial logistic regression and use Stata’s margins command to obtain the predicted probabilities. As can be seen below, this leads to inaccurate estimates of ρ . Moreover, these estimates are identical to the observed distribution shown in the tabulation above, and none of the confidence intervals cover the true values of ρ .

Although the model can be identified on the basis of the constraints alone, flat priors are not recommended (Swartz et al. 2004). To illustrate why, we show what happens when all the hyperparameters in the prior distribution are set equivalently to 1, which is the default. Using the second constraint, and sampling with four chains, we see that the model does slightly better than the naïve estimator, coming closer to the true parameter value for ρ ₁ than the naïve estimator but performing much worse for ρ ₆ than the naïve estimator. More importantly, while the posterior means are still inaccurate, the credible intervals are wider, representing greater uncertainty over the parameter values and do cover the true values of ρ . The maximum Gelman–Rubin statistic reported shows that the model converged with all parameters having an $\hat{R}$ < 1.1. However, in practice a maximum $\hat{R}$ = 1.0865 would likely be considered too close to 1.1 by many peer reviewers.

Below, we see that increasing the number of Monte Carlo repetitions leads to a smaller maximum $\hat{R}$ . We continue to use flat priors but specify burnin(10000) and mcmcsize(40000). Estimates of ρ are similar, but we can now make a stronger argument for convergence as evidenced by a maximum $\hat{R}$ = 1.0312.

The last example shows what happens when we use informative priors for both the classification and misclassification probabilities. Specifically, we use a highly concentrated prior for the classification probabilities, which embeds an assumption that it is relatively more likely to be in the first classification than any other and that it is equally likely to be in the remaining classifications. Given our knowledge of the data-generating process, this is clearly a good assumption. We can see that the model now very accurately estimates ρ . Additionally, credible intervals are more narrow, while still covering the true values of ρ .

4.2 Empirical example

To provide an empirical example using bamm, we used a random sample from the 2019 Behavioral Risk Factor Surveillance System (BRFSS) on individuals aged 18 to 64 years to model the prevalence of drinking alcohol in the following categories: i) nondrinkers, ii) >0–1 drink/day, iii) >1–2 drinks/day, iv) >2–3 drinks/day, v) >3–4 drinks/day, vi) >4–5 drinks/day, and vii) > 5 drinks/day. These categories were derived from the quantity-frequency questions included in the BRFSS questionnaire.

We accounted for binge drinking as described in Stahre et al. (2006). Straightforward quantity-frequency calculations multiply typical number of drinking days (F ) by the average number of drinks per day (Q). This may underestimate average daily consumption if respondents ever or frequently have days with higher than the average number of drinks per day reported. Because most surveillance surveys also provide data on the number of binge drinking days and average quantities consumed on binge drinking days, Stahre et al. (2006) recommend augmenting the basic quantity-frequency calculation as follows: define BQ as the binge quantity, BF as the binge frequency, and AF as the adjusted frequency of drinking (that is, AF = F − BF). Then the total drinks consumed can be calculated as Q × AF + BF × BQ.

As described in the introduction, triangulation methods require data on per capita sales (or other similar sources). We used alcohol sales data from the National Institute of Alcohol and Alcoholism (NIAAA) (Slater and Alpert 2021). These data show there were 2.38 gallons of ethanol sold in 2019 across all beverage types per person aged 14 years or older. We converted per capita gallons of ethanol sold to average grams of per capita ethanol per day. The multiplier used in the triangulation approach presented in this study was then derived as 75% of the average grams of ethanol sold per day in 2019 divided by the average grams of ethanol consumed per day in the BRFSS sample used. Adjusting to 75% of the average amount of alcohol sold provides a more conservative adjustment and is consistent with what other epidemiological studies have done (see for example, Esser et al. [2022]). While the per capita sales data do not account for nondrinkers, the denominator for the multiplier (that is, average daily alcohol consumption from the BRFSS) includes both drinkers and nondrinkers, with nondrinkers assigned zero grams of ethanol consumed per day. The BRFSS sample includes only individuals aged 18 to 64, and the per capita sales data represents sales per person aged 14 years or older. While the per capita sales data do provide separate estimates by beverage type, we did not consider different effects by beverage type for this analysis.

After deriving the triangulation multiplier, we multiplied the derived alcohol consumption variable by the multiplier. Other approaches (for example, Rehm et al. [2010a]; Parish et al. [2017]) adjust the parameters of estimated distributions (for example, the parameters of an estimated gamma distribution) rather than directly adjusting the data points themselves. We chose to directly adjust the data in this way for illustration purposes, because this allows us to provide the adjusted data to Stata users who can then apply our code examples and more easily replicate the examples in this article.

Using prevalence estimates (unadjusted, triangulation adjusted, and adjusted with bamm) along with RR data, we calculated the AAF of liver cirrhosis mortality using the following formula,

{AFF}_{i} = \frac{P_{i} (R R_{i} - 1)}{1 + \sum_{j} P_{j} (R R_{j} - 1)}

where P_i denotes the prevalence of drinking in the ith drinking category and RR _i denotes the risk of liver cirrhosis mortality associated with drinking in category i. The formula used in this article decomposes the attributable fractions from each of the drinking categories as compared with no drinking. The total AAF can be calculated as the sum of the categorical-specific attributable fractions. Data on the RR of liver cirrhosis mortality were obtained from Rehm et al. (2010b).

Below, we use bamm to estimate the prevalence of drinking among female respondents in each of 7 drinking categories. The priors are specified as follows:

ρ \sim Dirichlet (800, 800, 300, 250, 220, 200, 180)

and

\begin{matrix} π_{1} \sim Dirichlet (150, 100, 100, 100, 100, 100, 100) \\ π_{2} \sim Dirichlet (100, 150, 100, 100, 100, 100, 100) \\ π_{3} \sim Dirichlet (100, 100, 150, 100, 100, 100, 100) \\ π_{4} \sim Dirichlet (100, 100, 100, 150, 100, 100, 100) \\ π_{5} \sim Dirichlet (100, 100, 100, 100, 150, 100, 100) \\ π_{6} \sim Dirichlet (100, 100, 100, 100, 100, 150, 100) \\ π_{7} \sim Dirichlet (100, 100, 100, 100, 100, 100, 150) \end{matrix}

The prior for ρ reflects a prior belief that there is more underreporting among persons who are actually in higher drinking categories relative to those who are actually in lower drinking categories. The priors on π_j for j = 1,…, 7 were chosen to be consistent with the second constraint, which we have used below.

As can be seen in the output above, all parameters have $\hat{R}$ < 1.1, with a maximum $\hat{R} = 1.0018$ . While this provides a good argument for convergence, it is useful to also look at other diagnostic information, such as trace plots, autocorrelation plots, and others. To illustrate how to do this with bammdx, we produced all diagnostic plots for the first parameter, ρ ₁. Figure 1 shows that there is no discernible walk across iterations of the Gibbs algorithm, that autocorrelation dies out after 10 or so lags, and that the chain-specific distributions mix well.

Figure 1.

Diagnostic plots for ρ ₁ after initial run of bamm

Table 1 summarizes the results. Triangulation-based estimates do not adjust the prevalence of nondrinkers for either female or male respondents. This is because triangulation upshifts consumption by a constant factor, and zero multiplied by any number remains zero. Based on the priors chosen in this analysis, bamm adjusts the upper end of the female distribution more severely than triangulation. Under triangulation, the prevalence of drinking > 5 drinks per day among female respondents shifts from 0.6% to 1.8%, whereas under bamm the prevalence shifts from 0.6% to 2.4%. In contrast, bamm adjusts the upper end of the male distribution less severely than triangulation. Accordingly, among females bamm produces larger estimates of the AAF than triangulation (65.0% versus 59.5%), while among males bamm produces smaller estimates of the AAF than triangulation (46.4% versus 57.2%). This highlights the importance of having a methodology that can more flexibly adjust across strata of the alcohol consumption distribution. Both approaches show that unadjusted results are likely underestimating the AAF of liver cirrhosis mortality cases.

Table 1.

Estimates of the AAF of liver cirrhosis mortality cases

Sex/Category	Unadjusted			Triangulation		bamm
Sex/Category	RR	Prob.	AAF	Prob.	AAF	Prob.	AAF
Female
Nondrinker	Ref.	49.0%	Ref.	49.0%	Ref.	44.8%	Ref.
>0–1 drink/day	1.9	43.6%	20.5%	36.9%	13.4%	40.2%	12.7%
>1–2 drinks/day	5.6	4.7%	11.2%	7.1%	13.1%	4.4%	7.1%
>2–3 drinks/day	7.7	1.2%	4.1%	3.0%	8.0%	3.1%	7.3%
>3–4 drinks/day	10.1	0.7%	3.3%	1.7%	6.1%	2.7%	8.5%
>4–5 drinks/day	14.7	0.3%	1.9%	0.6%	3.4%	2.4%	11.4%
>5 drinks/day	22.7	0.6%	6.6%	1.8%	15.4%	2.4%	18.1%
Total			47.7%		59.5%		65.0%
Male
Nondrinker	Ref.	40.7%	Ref.	40.7%	Ref.	39.9%	Ref.
>0–1 drink/day	1.0	40.9%	0.02%	31.8%	0.01%	40.1%	0.02%
>1–2 drinks/day	1.6	9.1%	3.3%	9.9%	2.6%	6.1%	2.0%
>2–3 drinks/day	2.8	3.2%	3.5%	5.2%	4.0%	4.2%	4.0%
>3–4 drinks/day	5.6	2.1%	5.7%	3.6%	7.1%	3.5%	8.7%
>4–5 drinks/day	7.0	1.0%	3.6%	1.7%	4.4%	3.0%	9.7%
>5 drinks/day	14.0	3.1%	23.9%	7.0%	39.1%	3.1%	21.9%
Total			39.9%		57.2%		46.4%

SOURCE: Relative risk estimates are from Rehm et al. (2010b). All other estimates are from authors’ analysis of 2019 BRFSS data.

NOTE: Unadjusted estimates used a multinomial logistic regression to estimate predicted probabilities for each of the drinking categories. Triangulation estimates used the ratio of per capita alcohol sold divided by average alcohol consumption from the survey to upshift the alcohol distribution and then applied a multinomial logistic regression to the upshifted data. Adjusted estimates used bamm with cnumber(2). The same priors were applied for both women and men. Priors for the corrected classification probabilities were Dirichlet(800, 800, 300, 250, 220, 200, 180). Priors for the misclassification probabilities were Dirichlet{I(7) ∗ 50 + J(7, 7, 100)}.

5 Conclusion

bamm is a new command that implements the multinomial misclassification method proposed by Swartz et al. (2004). This approach is most appropriate for applications where a natural ordering across categories can be applied. This is because the permutationtype identification issues discussed here and in Swartz et al. (2004) are resolved via imposing logical constraints that rule out arbitrary permutations of parameters. Thus, with ordinal data a stronger argument can be made in favor of the types of constraints considered for this model. The simulation study highlights the role that informative priors also play in ensuring identification. Specifically, the simulation study shows that flat priors result in estimates that remain biased but where wide credible intervals do provide coverage for the known parameters from the data-generating process. Informative priors were easy to choose in the context of the simulation study because they could be chosen to reflect the known parameters from the data-generating process. Under these informative priors, the model eliminated practically all bias in the estimates and resulted in relatively narrow credible intervals.

This interplay between priors, credible interval width, and bias highlights the importance of the task of choosing and setting priors in empirical examples. Unfortunately, in most empirical applications, it is not likely that one can choose priors based on precise knowledge of the data-generating process, as is the case with the simulation study presented in this article. However, it may be possible to leverage external information to derive informative priors. For example, if a gold standard study exists demonstrating the degree to which a survey instrument misclassifies respondents, then results of this gold standard study could be used to derive informative priors.

The empirical study presented in this article provides some insight into the ways in which bamm provides advantages for alcohol epidemiology modeling. Specifically, we showed ways in which the model could more flexibly adjust the alcohol consumption distribution as compared with typical triangulation approaches, which rely on a multiplier derived from per capita sales estimates to shift the distribution by a constant proportion. We argue that this inflexibility can lead to biases in the AAF, and the empirical study shows that results differ substantively between triangulation and bamm.

While we have focused heavily on the use of bamm for modeling alcohol consumption, this approach could be applied in many other applications. The only requirements are on the nature of the data (that is, this could not be applied directly with continuous data) because the approach requires a multinomial variate. As noted above, ordinal data are also most appropriate. It could also be used as an alternative to frequentist approaches to solve misclassification (for example, Oparina and Srisuma [2022]). In Operina and Srisuma (2022) and other similar approaches, instrumental variables or other specific benchmarks are required, and this is not always feasible for every study. The Bayesian approach in this article allows users to “model their way out” of misclassification biases. While the results are likely sensitive to choice of prior, if priors are chosen carefully and presented with transparency, results will provide useful insights that can produce policy-relevant information.

7 Programs and supplemental material

Supplemental Material, sj-zip-1-stj-10.1177_1536867X241233671 - A Bayesian method for addressing multinomial misclassification with applications for alcohol epidemiological modeling

Supplemental Material, sj-zip-1-stj-10.1177_1536867X241233671 for A Bayesian method for addressing multinomial misclassification with applications for alcohol epidemiological modeling by William J. Parish, Arnie Aldridge and Martijn van Hasselt in The Stata Journal

Footnotes

6 Acknowledgment

This work was supported by a research grant from the National Institute of Alcohol and Alcoholism (NIAAA: R01AA027796-01). The views expressed in this article are the authors’ own and do not necessarily represent the official views of the NIAAA.

7 Programs and supplemental material

To install the software files as they existed at the time of publication of this article, type

References

Bollinger

C. R.

van Hasselt

. 2017. A Bayesian analysis of binary misclassification. Economics Letters 156: 68–73. https://doi.org/10.1016/j.econlet.2017.04.011.

Bross

1954. Misclassification in 2 x 2 tables. Biometrics 10: 478–486. https://doi.org/10.2307/3001619.

Chen

Wang

Chubak

Hubbard

R. A.

. 2019. Inflation of type I error rates due to differential misclassificaiton in EHR-derived outcomes: Empirical illustration using breast cancer recurrence. Pharmacoepidemiology and Drug Safety 28: 264–268. https://doi.org/10.1002/pds.4680.

Damien

Walker

S. G.

. 2001. Sampling truncated normal, beta, and gamma densities. Journal of Computational and Graphical Statistics 10: 206–215. https://doi.org/10.1198/10618600152627906.

Esser

M. B.

Sherk

Subbaraman

M. S.

Martinez

Karriker-Jaffe

K. J.

Sacks

J. J.

Naimi

T. S.

. 2022. Improving estimates of alcohol-attributable deaths in the United States: Impact of adjusting for the underreporting of alcohol consumption. Journal of Studies on Alcohol and Drugs 83: 134–144. https://doi.org/10.15288/jsad.2022.83.134.

Evans

Guttman

Haitovsky

Swartz

. 1996. Bayesian analysis of binary data subject to misclassification. In Bayesian Analysis in Statistics and Econometrics, ed. Berry

D. A.

Chaloner

K. M.

Geweke

J. K.

, 67–77. New York: Wiley.

Gaba

Winkler

R. L.

. 1992. Implications of error in survey data: A Bayesian model. Management Science 38: 913–925. https://doi.org/10.1287/mnsc.38.7.913.

Gelman

Rubin

D. B.

. 1992. Inference from iterative simulation using multiple sequences. Statistical Science 7: 457–472. https://doi.org/10.1214/ss/1177011136.

Gmehling

La Mura

. 2016. A Bayesian inference model for the credit rating scale. Journal of Risk Finance 17: 390–404. https://doi.org/10.1108/JRF-04-2016-0055.

10.

Goldstein

Browne

W. J.

Charlton

. 2018. A Bayesian model for measurement and misclassification errors alongside missing data, with an application to higher education participation in Australia. Journal of Applied Statistics 45: 918–931. https://doi.org/10.1080/02664763.2017.1322558.

11.

Goldstein

Wolf

. 1977. On the problem of bias in mutinomial misclassification. Biometrics 33: 325–331. https://doi.org/10.2307/2529782.

12.

Joseph

Gyorkos

T. W.

Coupal

. 1995. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. American Journal of Epidemiology 141: 263–272. https://doi.org/10.1093/oxfordjournals.aje.a117428.

13.

Kalinowski

Humphreys

. 2016. Governmental standard drink definitions and low-risk alcohol consumption guidelines in 37 countries. Addiction 111: 1293–1298. https://doi.org/10.1111/add.13341.

14.

Katz

B. M.

McSweeney

. 1979. Misclassification errors and categorical data analysis. Journal of Experimental Education 47: 331–338. https://doi.org/10.1080/00220973.1979.11011702.

15.

Kehoe

Gmel

Shield

K. D.

Gmel

Rehm

. 2012. Determining the best population-level alcohol consumption model and its impact on estimates of alcoholattributable harms. Population Health Metrics 10(6). https://doi.org/10.1186/1478-7954-10-6.

16.

Kilian

Manthey

Probst

Brunborg

G. S.

Bye

E. K.

Ekholm

Kraus

Moskalewicz

Sieroslawski

Rehm

. 2020. Why is per capita consumption underestimated in alcohol surveys? Results from 39 surveys in 23 European countries. Alcohol and Alcoholism 55: 554–563. https://doi.org/10.1093/alcalc/agaa048.

17.

Meier

P. S.

Meng

Holmes

Baumberg

Purshouse

Hill-McManus

Brennan

. 2013. Adjusting for unrecorded consumption in survey and per capita sales data: Quantification of impact on genderand age-specific alcohol-attributable fractions for oral and pharyngeal cancers in Great Britain. Alcohol and Alcoholism 48: 241–249. https://doi.org/10.1093/alcalc/agt001.

18.

Nelson

D. E.

Jarman

D. W.

Rehm

Greenfield

T. K.

Rey

Kerr

W. C.

Miller

Shield

K. D.

Naimi

T. S.

. 2013. Alcohol-attributable cancer deaths and years of potential life lost in the United States. American Journal of Public Health 103: 641–648. https://doi.org/10.2105/AJPH.2012.301199.

19.

Operina

Srisuma

. 2022. Analyzing subjective well-being data with misclassification. Journal of Business and Economic Statistics 40: 730–743. https://doi.org/10.1080/07350015.2020.1865169.

20.

Parish

W. J.

Aldridge

Allaire

Ekwueme

D. U.

Poehler

Guy

G. P.

Thomas

C. C.

Trogdon

J. G.

. 2017. A new methodological approach to adjust alcohol exposure distributions to improve the estimation of alcohol-attributable fractions. Addiction 112: 2053–2063. https://doi.org/10.1111/add.13880.

21.

Pérez

C. J.

Girón

F. J.

Martín

Ruiz

Rojano

. 2007. Misclassified multinomial data: A Bayesian approach. Revista de la Real Academia de Ciencias Exactas, Fisicas y Naturales, A ser., 101: 71–80.

22.

Rehm

Kehoe

Gmel

Stinson

Grant

Gmel

. 2010a. Statistical modeling of volume of alcohol exposure for epidemiological studies of population health: The U.S. example. Population Health Metrics 8: 3. https://doi.org/10.1186/1478-7954-8-3.

23.

Rehm

Shield

K. D.

. 2013. Alcohol and mortality: Global alcohol-attributable deaths from cancer, liver cirrhosis, and injury in 2010. Alcohol Research: Current Reviews 35: 174–183.

24.

Rehm

Taylor

Mohapatra

Irving

Baliunas

Patra

Roerecke

. 2010b. Alcohol as a risk factor for liver cirrhosis: A systematic review and metaanalysis. Drug and Alcohol Review 29: 437–445. https://doi.org/10.1111/j.1465-3362.2009.00153.x.

25.

Schwartz

J. E.

1985. The neglected problem of measurement error in categorical data. Sociological Methods and Research 13: 435–466. https://doi.org/10.1177/0049124185013004001.

26.

Slater

M. E.

Alpert

H. R.

. 2021. Apparent per capita alcohol consumption: National, state, and regional trends, 1977–2019. Surveillance Report Report no. 117, U.S. Department of Health and Human Services, Public Health Service, National Institutes of Health. https://www.niaaa.nih.gov/sites/default/files/SR-117-Per-Capita-Consumption.pdf.

27.

Stahre

Naimi

Brewer

Holt

. 2006. Measuring average alcohol consumption: The impact of including binge drinks in quantity-frequency calculations. Addiction 101: 1711–1718. https://doi.org/10.1111/j.1360-0443.2006.01615.x.

28.

Stan Development Team. 2022. Stan Modeling Language: User’s Guide and Reference Manual, Version 2.33. https://mc-stan.org/docs/stan-users-guide/index.html.

29.

Swartz

Haitovsky

Vexler

Yang

. 2004. Bayesian identifiability and misclassification in multinomial data. Canadian Journal of Statistics 32: 285–302. https://doi.org/10.2307/3315930.

30.

Tenenbein

1970. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association 65: 1350–1361. https://doi.org/10.2307/2284301.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB