Sage Journals: Discover world-class research

Abstract

In small area estimation, it is sometimes necessary to use model-based methods to produce estimates in areas with little or no data. In official statistics, we often require that aggregates of small area estimates agree with national estimates for internal consistency purposes. Enforcing this agreement is referred to as benchmarking, and while methods currently exist to perform benchmarking, few are ideal for applications with non-normal outcomes and benchmarks with uncertainty. Fully Bayesian benchmarking is a theoretically appealing approach insofar as we can obtain posterior distributions conditional on a benchmarking constraint. However, existing implementations may be computationally prohibitive. In this paper, we critically review benchmarking methods in the context of small area estimation in low- and middle-income countries with binary outcomes and uncertain benchmarks, and propose a novel approach in which posterior samples of small area characteristics from an unbenchmarked model can be combined with a rejection sampler or Metropolis-Hastings algorithm to produce benchmarked posterior distributions in a computationally efficient way. To illustrate the flexibility and efficiency of our approach, we provide comparisons to an existing benchmarking approach in a simulation, and applications to HIV prevalence and under-5 mortality estimation. Code implementing our methodology is available in the R package stbench.

Keywords

small area estimation area-level model unit-level model HIV prevalence under-5 mortality

1. Introduction

In a public health context, small area estimates, where small areas are defined as domains with little to no data, are often produced by official statistics agencies for the purpose of informing targeted public health interventions. In low-and middle-income countries (LMICs), the most reliable source of public health information is household surveys such as Demographic and Health Surveys (DHS) and Multiple Indicator Cluster Surveys (MICS), and as such, direct (weighted) survey estimates are considered the gold standard when they are precise enough to be practically useful. In small area estimation, where we have by definition little to no data in certain domains, it may not be possible to obtain direct, survey estimates with reasonable precision. In these settings, model-based methods are often used when estimates at small levels of aggregation are required (Rao and Molina 2015). Model-based methods that incorporate spatial random effects, in the absence of reliable covariate information, allow for estimates in all small areas to be produced with little to no data, introducing some bias into the estimates in exchange for tighter interval estimates (Knorr-Held 2000; Riebler et al. 2016; Wakefield et al. 2020). Area-level models, such as the popular Fay-Herriot model (Fay and Herriot 1979), consider data at the level of the small area, whereas unit-level models consider data at the level of the individual or cluster (Battese et al. 1988). When area-level sample sizes are particularly small, unit-level models may be more desirable. For a review of both area-level and unit-level models, see Chapters 4 to 7 of Rao and Molina (2015).

It is often required that small area estimates agree with estimates at a higher level of aggregation. For example, subnational estimates may be required to aggregate to a national estimate. These higher level estimates are referred to as benchmarks, and are frequently considered more reliable than small area estimates since more data are available to inform them, or they are direct (weighted) estimates and therefore less dependent on model assumptions. Note that benchmarking as we consider it in this paper is distinct from calibration weighting (Särndal et al. 2003; Si and Zhou 2020). The latter involves incorporating known, population-level covariate information into a model to adjust for potential bias in a subnational model, while the former involves incorporating known, population-level outcome information into a model to enforce internal consistency in an official statistics setting.

Benchmarks may be from the same data source that was used to produce small area estimates or from an outside source, referred to as internal benchmarking and external benchmarking, respectively (Bell et al. 2013). In an internal benchmarking setting, the use of the same data twice—once to calculate model-based small area estimates and also to define benchmarks at a higher level of spatial aggregation—may lead to an understating of statistical uncertainty unless additional care is taken. The method of You et al. (2002) can be used to quantify the posterior mean squared error (MSE) of an internally benchmarked predictor in a Bayesian context. We also direct the reader to Pfeffermann and Tiller (2006) for an approach to assessing the uncertainty of internally benchmarked predictors. We note that the particular applications we consider in this paper are in an external benchmarking setting, and that this setting is relatively common in global health applications (Eaton et al. 2021; Osgood-Zimmerman et al. 2018) as well as more recent applications in agriculture (Chen et al. 2022). Internal benchmarking settings and their limitations are outside the scope of this paper.

Many existing benchmarking approaches treat benchmarking as a constraint problem, where estimates at smaller areas are constrained to agree with estimates at a higher level of aggregation. Approaches vary by the ways in which constraints have been incorporated into a modeling framework, and as such, the interpretation of resulting benchmarked estimates varies by approach. There are many ways to categorize benchmarking approaches, but one we may consider is a one-step versus two-step procedure. In a two-step procedure, estimates and uncertainty are first obtained from a model that is agnostic to the benchmarking constraint and are then adjusted to satisfy the benchmarking constraint. Datta et al. (2011), Ghosh and Steorts (2013), Steorts et al. (2020), Wang et al. (2008), Patra and Dunson (2018), Patra (2019), Ghosh et al. (2015), and Williams and Berg (2013) and Berg and Fuller (2018) all consider a two-step procedure where benchmarked estimates are calculated by minimizing posterior expected loss (for a given loss function) subject to a benchmarking constraint. Unbenchmarked estimates are first obtained, and then projected into a constrained space.

Other two-step procedures include difference benchmarking and ratio benchmarking (henceforth referred to as raking), described in Erciulescu et al. (2019). Again, a model that is agnostic to the benchmarking constraint is first fit, and then estimates are adjusted by a constant so that the benchmarking constraint is satisfied (Erciulescu et al. 2018, 2019, 2020). You et al. (2002) consider a hierarchical Bayesian model for unbenchmarked estimates, but obtain benchmarked estimates using the raking method, and quantify uncertainty via posterior MSE as opposed to estimating full posterior distributions for benchmarked estimates.

Other benchmarking approaches incorporate the benchmarking constraint into the data likelihood—referred to as an augmented model in Bell et al. (2013), Berg and Fuller (2018), and Stefan and Hidiroglou (2021)—and thus produce automatically benchmarked estimates in a single modeling step. You and Rao (2002) also follow this approach, and refer to this as a “self-benchmarking” property. Others (Erciulescu et al. 2019; Janicki and Vesper 2017; Nandram et al. 2019; Nandram and Sayit 2011; Zhang and Bryant 2020) propose benchmarking approaches that produce full, benchmarked posterior distributions for small area estimates, making uncertainty quantification straightforward in an external benchmarking setting. Following Zhang and Bryant (2020), we refer to such approaches as fully Bayesian benchmarking approaches. Note that we consider this type of approach to be distinct from fitting a Bayesian model and benchmarking only point estimates. Nandram and Sayit (2011) develop a fully Bayesian benchmarking approach specifically for betabinomial models. Janicki and Vesper (2017) perform fully Bayesian benchmarking by minimizing the KL divergence between a benchmarked and unbenchmarked posterior using moment constraints. Nandram et al. (2019) perform fully Bayesian benchmarking using a transformation of the benchmarking constraint that corresponds to “deleting” a single small area, but note that benchmarked estimates vary based on which area is deleted. Erciulescu et al. (2019) consider four different benchmarking approaches, one of which is the Bayesian method described in Nandram and Sayit (2011), and three of which involve fitting a hierarchical Bayesian model and benchmarking point estimates using ratio or difference benchmarking. Zhang and Bryant (2020) develop a fully Bayesian benchmarking approach that produces full posterior distributions conditional on either a soft or hard benchmarking constraint.

Benchmarking approaches have also been developed in a time series context, with certain similarities to small area estimation in requiring finer estimates in time to agree with aggregate estimates across multiple years (Dagum and Cholette 2006). Different benchmarking approaches are more or less appealing than others—depending on context—with regards to obtaining measures of uncertainty, computational tractability, and the way in which the benchmarking constraint is enforced.

Benchmarking methods may also differ depending on whether benchmarking must be exact or inexact. In exact benchmarking, as the name suggests, the benchmarking constraint must hold exactly, whereas in inexact benchmarking the constraint must hold within some margin of error. The latter can be viewed as a soft constraint as opposed to a hard constraint. Exact benchmarking may be appropriate if the benchmarks are unbiased and have little to no uncertainty, which can occur when they come from a census (Trabelsi and Hillmer 1990). Inexact, external benchmarking may be appropriate if national estimates come from a different survey and contain appreciable sampling error, as discussed in Hillmer and Trabelsi (1987), or if national estimates are model-based with appreciable uncertainty, as in the settings we consider in our application.

In this article, we present two novel implementations of the fully Bayesian benchmarking approach described in Zhang and Bryant (2020) that are more flexible and computationally tractable in many settings. Our approaches combine an unbenchmarked model with either a rejection sampler or Metropolis-Hastings algorithm to produce fully Bayesian benchmarked posteriors, which we describe in Subsection 3.4. We compare our method to that described in Zhang and Bryant (2020) in the setting of modeling HIV prevalence in South Africa, as well as modeling under-5 mortality rates (U5MR) in Namibia. These applications were chosen to demonstrate the flexibility of the proposed fully Bayesian approaches in estimating outcomes that lie between 0 and 1 with unique benchmarking constraints, the efficiency of the method in cases where the benchmarks are very consistent or inconsistent with the small area estimates, and the flexibility of the approaches to handle both area-level and unit-level models. To emphasize the novelty of our approaches and computational advantages of the proposed method over the Markov chain Monte Carlo (MCMC) samplers used in Zhang and Bryant (2020) in terms of flexibility, we use integrated nested Laplace approximation (INLA) and Template Model Builder (TMB) as alternative ways to conduct Bayesian inference using Laplace approximations, that are fast and do not require users to code model-specific fitting routines (Kristensen et al. 2016; Rue et al. 2009). We additionally compare run times for our proposed approaches to that of Zhang and Bryant (2020) in a simulation, and show that our proposed approaches provide not only increased flexibility in terms of modeling for Bayesian inference, but computational speed gains as well. All code for fitting the models described in this paper is available via the R package stbench, found at https://github.com/taylorokonek/stbench.

2. Small Area Models in Low- and Middle-Income Countries

The estimation of U5MR in LMICs at a subnational level motivates our desire for benchmarking. The UN Inter-agency Group for Child Mortality Estimation (IGME) produces annual, national level estimates of U5MR for all countries using a Bayesian B-spline bias-reduction (B3) method (Alkema and New 2014). Various data are used to produce B3 estimates, including vital registration, census, and household surveys, and many of these sources cannot be used for producing subnational estimates because geographic information is lacking, or the data type is not amenable to incorporate into a small area model. Subnational estimates of U5MR are of interest in addition to national estimates, in accordance with the Sustainable Development Goals (https://sustainabledevelopment.un.org/post2015/transformingourworld).

Model-based small area estimation has a long history in LMICs. While small area estimates that incorporate the survey design directly are preferred, they are often impractical at a small area level, with either too little precision to be practically useful or no data available in some small areas (Lehtonen and Veijanen 2009; Wakefield et al. 2020). This lack of data primarily comes from a disconnect between the administrative level at which these surveys are designed to produce reliable estimates (often administrative level 1, or state) and the administrative level at which public health interventions are made (often administrative level 2, or county). Model-based methods with spatial smoothing terms allow us to obtain precise estimates in all small areas, as required for public health estimates (Datta 2009; Wakefield et al. 2020).

Due to these smoothing terms and the inability to use data sources without geographic information to produce small area estimates, benchmarking is required to align subnational estimates with national estimates from the B3 model for internal consistency within the production of UN IGME estimates. Subnational estimates are currently produced for a handful of countries using a Beta-binomial model described in Wu et al. (2021), but benchmarking approaches in this context have not yet been rigorously explored. As national estimates are model-based (with uncertainty) in this context, we aim to use a benchmarking approach that incorporates national level uncertainty (i.e., is inexact).

Many public health outcomes, including HIV prevalence and U5MR that we consider for our applications, are estimated from Bernoulli or binomial counts. As such, binomial models are a typical choice in an LMIC setting (Eaton et al. 2021; Wu et al. 2021). We aim to use a benchmarking approach that is well-suited to binomial models with estimates that are proportions, and hence lie between 0 and 1. Just as the unbenchmarked estimates will lie in this range, the benchmarked estimates from our chosen benchmarking approach should as well.

Additionally, the approach we use should be fast and computationally flexible. These properties are particularly important in an LMIC setting, as practitioners and national statistics offices in LMICs may not have the same computational resources as those in high-income countries.

While our motivation for benchmarking primarily comes from subnational estimation of U5MR, the application to HIV prevalence is similarly important in a LMIC context, in that small area estimates of HIV prevalence are produced in an official statistics setting by UNAIDS and require benchmarking for internal consistency (Eaton et al. 2021). Similarly to U5MR, subnational estimates of HIV prevalence inform public health interventions and allow countries to monitor their progress toward the Sustainable Development Goals. The theoretical properties, computational speed, and flexibility of our proposed approaches are relevant to HIV prevalence estimation in LMICs, as (similarly to U5MR) national estimates are produced with uncertainty, and benchmarked estimates should lie between 0 and 1.

2.1. Small Area Models

Below we describe the unbenchmarked models that we consider for the HIV application, but they are generally appropriate for a binary outcome. The unbenchmarked models for the U5MR application are described in Subsection 2.2 of the Supplemental data. If there are sufficient data in each target area, then weighted (direct) estimates are reliable. Such estimates include the Horvitz-Thompson (HT) (Horvitz and Thompson 1952) and Hájek estimators (Hájek 1971). In situations where the direct estimates can be calculated but have unacceptably high design variance, one may smooth using an area-level model, such as the model of Fay and Herriot (1979). When the data are sparser still, area-level direct estimates may be unusable. With very little data, the variance estimate can be statistically unreliable or may not be calculable at all. In these cases, the raw data (counts, in the binary context) are modeled (in our HIV and U5MR examples, at the cluster level) to give a unit-level model.

2.1.1. Area-Level Model: Spatial Fay-Herriot

Let the HT estimator for area i be denoted ${\hat{θ}}_{i}^{H T}$ with design-based variance ${\hat{V}}_{i}^{*, H T}$ . Under the Fay-Herriot model, a working likelihood may be based on the distribution,

y_{1 i}^{*} = logit ({\hat{θ}}_{i}^{H T}) \sim N (η_{i}, {\hat{V}}_{i}^{H T})

(1)

where ${\hat{V}}_{i}^{H T} = V_{i}^{*, H T} / {({\hat{θ}}_{i}^{2, H T} (1 - {\hat{θ}}_{i}^{2, H T}))}^{2}$ is obtained via the delta method. Treating $y_{1 i}^{*}$ and ${\hat{V}}_{i}^{H T}$ as fixed, as in the Fay-Herriot model, we then consider the area-level model with the top level given in equation (1), and mean model

η_{i} = β_{0} + b_{i}

where $β_{0}$ is an intercept term, and $b_{i}$ follows a BYM2 spatial model (Riebler et al. 2016), denoted $b_{i} \sim BYM2 (τ_{b}, ϕ)$ , where $τ_{b}$ is the total precision and ϕ is the proportion of the variance that is spatial. The BYM2 model is a reparameterization of the BYM spatial random effect developed by Besag et al. (1991), which consists of an unstructured (iid) random effect and a structured, ICAR spatial random effect (Besag 1974). A sum-to-zero constraint is placed on the structured component of the BYM2 random effect to ensure identifiability (Besag et al. 1991). The above model is often used in the estimation of U5MR in LMICs as it directly accounts for survey design and incorporates spatial smoothing random effects (Li et al. 2019; Wakefield et al. 2020), and is readily applicable to applications of HIV prevalence as well (Wakefield et al. 2020).

For priors, we set $β_{0} \sim N ({0, 0.001}^{- 1})$ , $ϕ \sim Beta (0.5, 0.5)$ . We use penalized complexity (PC) priors for the precision $τ_{b}$ with values $U = 1$ and $α = 0.01$ , corresponding to a prior belief that the probability $σ_{b} = 1 / \sqrt{τ_{b}}$ is greater than 1 is $1 %$ (Simpson et al. 2017). Note that we chose less-informative hyperpriors for some of our model parameters. In practice, priors for hyperparameters should be carefully chosen for the given application.

2.1.2. Unit-Level Model: Binomial

We may assume binomial observations of the outcome for each cluster c within area i. Let $y_{1 i [c]}$ be the number of cases observed in $n_{i [c]}$ total people in cluster c within area i. We consider the unit-level model

\begin{array}{l} y_{1 i [c]} | n_{i [c]}, θ_{i [c]} \sim Binomial (n_{i [c]}, θ_{i [c]}), \\ η_{i [c]} = logit (θ_{i [c]}) = β_{0} + b_{i} + e_{i [c]}, \end{array}

where $θ_{i [c]}$ is prevalence in cluster c of area i, $β_{0}$ is an intercept term, $b_{i}$ again follows a BYM2 model (Riebler et al. 2016), denoted $b_{i} \sim BYM2 (τ_{b}, ϕ)$ , and $e_{i} [c] \overset{i i d}{\sim} N (0, σ_{e}^{2})$ is an iid cluster-level random effect. Constraints and priors are the same as for the area-level model, with the addition of a PC prior on $σ_{e}$ with values $U = 1$ and $α = 0.01$ .

Area-level predictions ${\hat{θ}}_{i}$ are obtained by marginalizing over the cluster random effect to get

{\hat{θ}}_{i} = expit (\frac{β_{0} + b_{i}}{\sqrt{1 + h^{2} σ_{e}^{2}}}),

where $h = \frac{16 \sqrt{3}}{15}$ . The cluster-level random effect is excluded from predictions, and a correction is done using h and $σ_{e}^{2}$ , as described in Dong and Wakefield (2021) and detailed in Subsection 9.13.1 and Exercise 9.1 of Wakefield (2013). The correction accounts for within-cluster variation that induces cluster-level overdispersion. If we instead believed the cluster-level random effect reflected true between-cluster differences, we could obtain predictions such as those obtained in Dong and Wakefield (2021).

3. Methods

Let $y_{2}$ be a national level estimate of the outcome of interest, possibly with uncertainty given by a confidence interval or standard error, $\hat{θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{n})}^{Τ}$ be small area estimates for areas $i = 1, \dots, n$ , and $w_{i}$ be weights that do not depend on $θ$ , standardized so that $Σ_{i = 1}^{n} w_{i} = 1$ . Often, these weights are set to $w_{i} = N_{i} / Σ_{j = 1}^{n} N_{j}$ , where $N_{j}$ is the population size for the small area j. Note that the population sizes used should correspond to the population under study. For example, when estimating HIV prevalence using survey data, $N_{j}$ should be population counts for individuals within the age range in the sampling frame. For DHS surveys, this consists of individuals aged 15 to 49. To our knowledge, incorporating uncertainty about the population counts into benchmarking approaches has not yet been considered in the benchmarking literature.

Throughout the paper, we consider benchmarking constraints of the most basic form, $Σ_{i = 1}^{n} w_{i} {\hat{θ}}_{i} = y_{2}$ . While more complex benchmarking constraints—or even multiple benchmarking constraints as noted in Zhang and Bryant (2020)—may be useful in certain settings, this simple equality constraint is reasonable for the HIV prevalence and U5MR applications we consider, and is the constraint currently used in practice by both UNAIDS and the UN IGME (Eaton et al. 2021; Wu et al. 2021). For examples of benchmarking applications with inequality constraints, see Steorts et al. (2020) or Nandram et al. (2022).

In the following subsections we describe a subset of existing approaches to benchmarking and propose two novel approaches to fully Bayesian benchmarking. The methods we describe in detail were chosen either because they are commonly used benchmarking methods in an official statistics setting, or because they are particularly relevant to our motivating application. One method is not inherently preferable, though we argue that some are more appropriate than others in the context of estimating outcomes between 0 and 1 in LMICs. The only existing fully Bayesian approach we describe is that of Zhang and Bryant (2020), as our proposed approach builds directly from their methodology.

3.1. Benchmarked Bayes Estimate Approach

The first approach we describe was developed in a Bayesian, decision theoretic framework (Datta et al. 2011; Steorts et al. 2020). This involves minimizing expected posterior MSE loss subject to the benchmarking constraint, which results in a projection of the unbenchmarked estimates into a benchmarked (constrained) space. Since its development, others have extended this decision theoretic approach with different loss functions (Berg and Fuller 2018; Ghosh et al. 2015; Williams and Berg 2013). Methods have also recently been developed to obtain benchmarked uncertainty for these estimates under this decision theoretic framework (Patra 2019; Patra and Dunson 2018). As it involves minimizing expected posterior loss subject to a benchmarking constraint we call this approach the Benchmarked Bayes Estimate approach. Though we will argue that this approach is not appropriate for our motivating application, we detail it here as it is theoretically appealing, fast, and commonly used in many benchmarking applications.

For n small areas, let $\hat{θ} = {({\hat{θ}}_{1}, \dots, {\hat{θ}}_{n})}^{Τ}$ be the direct estimators of the small area means $θ = {(θ_{1}, \dots, θ_{n})}^{Τ}$ . We are interested in computing the benchmarked Bayes estimator ${\hat{θ}}^{B M} = {({\hat{θ}}_{1}^{B M}, \dots, {\hat{θ}}_{n}^{B M})}^{Τ}$ of $θ$ such that the constraint $Σ_{i = 1}^{n} w_{i} {\hat{θ}}_{i}^{B M} = y_{2}$ is satisfied.

As Datta et al. (2011) are interested in an estimate of the benchmarked posterior mean, they consider minimizing the posterior expectation of the weighted squared error loss $Σ_{i = 1}^{n} ϕ_{i} E [(θ_{i} - e_{i})^{2} | y_{1}]$ under the constraint ${\bar{e}}_{w} : = Σ_{i = 1}^{n} w_{i} e_{i} = y_{2}$ , where $ϕ_{i}$ are weights not necessarily equal to $w_{i}$ , and $y_{1}$ is a vector of area-level observations. They note that the weights $ϕ_{i}$ could be different for different policy makers, and a simple default is setting $ϕ_{i} = 1$ for all $i = 1, \dots, n$ . The resulting benchmarked Bayes estimate (solution to the constrained minimization of the posterior expectation of the weighted squared error loss) is

{\hat{θ}}^{B M} = {\hat{θ}}^{B} + s^{- 1} (y_{2} - {\bar{\hat{θ}}}_{w}^{B}) r,

(2)

where ${\hat{θ}}^{B} = {({\hat{θ}}_{1}^{B}, \dots, {\hat{θ}}_{n}^{B})}^{Τ}$ is a vector of unbenchmarked posterior means $E [θ_{i} | y_{1}]$ under a given prior, ${\bar{\hat{θ}}}_{w}^{B} = Σ_{i = 1}^{n} w_{i} {\hat{θ}}_{i}^{B}$ , $r = {(r_{1}, \dots, r_{n})}^{Τ}$ , $r_{i} = w_{i} / ϕ_{i}$ , and $s = Σ_{i = 1}^{n} w_{i}^{2} / ϕ_{i}$ . Note that if $ϕ_{i} = 1$ for all $i = 1, \dots, n$ , the benchmarked Bayes estimate becomes ${\hat{θ}}^{B M} = {\hat{θ}}^{B} + {(w^{Τ} w)}^{- 1} (y_{2} - {\bar{\hat{θ}}}_{w}^{B}) w$ . Thus from Equation (2) we can see that the benchmarked Bayes estimate is a function of the unbenchmarked Bayes estimate under MSE loss and user-specified weights. This is computationally appealing, as obtaining benchmarked estimates with this method is done via a quick post-processing step.

To obtain uncertainty around the benchmarked estimates, posterior samples can be projected into the space defined by the benchmarking constraint (Patra 2019; Patra and Dunson 2018). Geometrically, we can interpret this benchmarked Bayes estimate as the point estimate within the space defined by the benchmarking constraint that is as close to the unbenchmarked Bayes estimate as possible, where closeness is measured in terms of expected weighted squared error (Steorts et al. 2020).

The benchmarked estimate from Equation (2) is an exactly benchmarked estimate, that is, the benchmarking constraint holds exactly. Alternatively, in inexact benchmarking, the benchmarking constraint need not hold exactly. If inexact benchmarking is desired in the benchmarked Bayes estimate framework, a penalty term $λ > 0$ (pre-specified by the user) is introduced, and a slightly different loss function is considered,

L (θ, e) = λ {(y_{2} - {\bar{e}}_{w})}^{2} + \sum_{i = 1}^{n} ϕ_{i} {(θ_{i} - e_{i})}^{2} .

The Bayes estimate associated with this loss is

{\hat{θ}}_{λ}^{B} = {\hat{θ}}^{B} + {(s + λ^{- 1})}^{- 1} (y_{2} - {\bar{\hat{θ}}}_{w}^{B}) r,

where we note that as $λ \to \infty$ , ${\hat{θ}}_{λ}^{B}$ approaches the exactly benchmarked Bayes estimate in Equation (2). Considering a loss function with the addition of a penalty term for the benchmarking constraint allows the user to incorporate a predetermined level of agreement between the benchmarks and aggregated, unbenchmarked model estimates that must be met, whether that be exact benchmarking (corresponding to $λ \to \infty$ ) or inexact, where the resulting estimate ${\hat{θ}}_{λ}^{B}$ is a compromise between the exactly benchmarked and unbenchmarked Bayes estimate.

In the context of U5MR and HIV prevalence, the outcome of interest lies between 0 and 1, and may be close to 0. Any estimate that falls outside of $[0, 1]$ would be invalid. In applications where the outcome of interest lies in a restricted space, the benchmarked Bayes estimate approach described above can return invalid benchmarked estimates. In particular, note that the benchmarked Bayes estimates may possibly lie below zero when ${\bar{\hat{θ}}}_{w}^{B} > y_{2}$ . Ghosh et al. (2015) note this issue and consider a variant of the Kullback-Leibler loss function rather than weighted MSE loss, which addresses the issue of estimates falling below 0 when small area estimates should instead be positive. Williams and Berg (2013) and Berg and Fuller (2018) also note this issue, and the latter propose using a specific form for the weights $ϕ_{i}$ in equation (1) to deal with unbenchmarked estimates that are close to the boundary. Their approach does not guarantee, however, that benchmarked estimates will lie within the required restricted space. While they note that in many situations benchmarked estimates that lie outside the restricted space will be rare—and in such cases their approach may be sufficient—when estimating rare disease prevalence or mortality this boundary issue is a concern.

The benchmarked Bayes estimate approach is fast and theoretically justified in settings where estimates fall on the real line, are positive as in Ghosh et al. (2015), or lie well within the boundary of $[0, 1]$ . However, with the loss functions considered thus far in the literature, the approach will often fall short for targets that are on a restricted range.

3.2. Raking Approach

The second benchmarking approach we consider is simple and popular: raking, also referred to as the ratio-adjustment method (Datta et al. 2011; Ghosh et al. 2015; Zhang and Bryant 2020). A version of the raking approach is used by the Institute for Health Metrics and Evaluation (IHME) in a variety of applications (e.g., Osgood-Zimmerman et al. (2018) and Local Burden of Disease HIV Collaborators (2021)), in the Naomi model for estimating HIV prevalence and incidence (Eaton et al. 2021), and in the subnational U5MR estimates currently produced by UN IGME (Wu et al. 2021). This approach is commonly applied post hoc. The key feature of the raking approach is a ratio comparing an unbenchmarked national estimate to the national level benchmark

R = \frac{{\hat{θ}}^{N}}{y_{2}},

(3)

where ${\hat{θ}}^{N}$ denotes some unbenchmarked national level estimate, and $y_{2}$ again denotes the national level benchmark. The unbenchmarked national estimate could be obtained via the weighted sum of unbenchmarked small area estimates as ${\hat{θ}}^{N} = Σ_{i = 1}^{n} w_{i} {\hat{θ}}_{i}^{M}$ , where ${\hat{θ}}_{i}^{M}$ denote the posterior estimates (means or medians) in each area from our unbenchmarked model. Note that the raking approach is considered internal or external benchmarking based on whether $y_{2}$ is calculated from the same data as ${\hat{θ}}^{N}$ .

In a post hoc raking approach and with a sampling-based method, the posterior draws of $θ_{i}$ from an unbenchmarked model are multiplied by $1 / R$ so that the constraint $Σ_{i = 1}^{n} w_{i} {\hat{θ}}_{i}^{M} = y_{2}$ holds. Of note, this benchmarking approach will treat the unbenchmarked estimates in every area in the same fashion, regardless of the uncertainty in the unbenchmarked estimates. As such, the ranking of regions based on posterior medians/means will be preserved between unbenchmarked and benchmarked estimates. This behavior follows because of the ad hoc nature of the raking adjustment, as noted by Datta et al. (2011). It may be preferable to instead treat unbenchmarked estimates with more uncertainty differently than those with less.

The raking approach to benchmarking can also be approximated via the inclusion of a log offset term for the ratio comparing an unbenchmarked national estimate to the national level benchmark in logistic models when the outcome of interest is rare, as in under-5 mortality estimation. In the supplement of Wakefield et al. (2019) they show that, for rare outcomes, including a log offset for R in a logistic regression model corresponds approximately to the same multiplicative bias adjustment that would be done in the post hoc raking approach. This approach is currently used in the UN IGME’s subnational U5MR estimates. We note that the form of raking involving the inclusion of a log offset for R does in fact produce fully Bayesian estimates, in the sense that a full posterior distribution for the benchmarked estimates is produced. However, the approach differs from the fully Bayesian approach described in Section 3.3 in that a likelihood is not specified for the benchmarks themselves.

3.3. Fully Bayesian Benchmarking Approach

Consider an area-level, Bayesian hierarchical model for small area estimation (Zhang and Bryant 2020). For n small areas, let the area-level parameters we wish to estimate be denoted by $θ = {(θ_{1}, \dots, θ_{n})}^{Τ}$ , with a hierarchical structure specified through a model $π (θ | ϕ)$ with a vector of hyperparameters ϕ . The area-level observations are denoted by $y_{1} = {(y_{11}, \dots, y_{1 n})}^{Τ}$ . For example, the values $y_{1 i}$ may be binomial counts of HIV status, with the parameters $θ_{i}$ corresponding to HIV prevalence in area i. We can write the posterior distribution as

π (θ, ϕ | y_{1}) \propto π (y_{1} | θ, ϕ) π (θ | ϕ) π (ϕ) .

Though Zhang and Bryant (2020) consider more complex forms of benchmarking constraints, for simplicity we again consider the constraint $Σ_{i = 1}^{n} w_{i} θ_{i} = y_{2}$ , where $w_{i}$ and $y_{2}$ are defined previously. For simplicity, we will refer to $y_{2}$ as benchmarks and $Σ_{i = 1}^{n} w_{i} θ_{i} = y_{2}$ as a benchmarking constraint in an inexact benchmarking setting as well as an exact setting, despite the fact that in an inexact setting there are no hard constraints. The method we describe is generally applicable to a variety of benchmarking constraints.

To incorporate the benchmarking constraint into their hierarchical model, Zhang and Bryant (2020) define an additional likelihood term for the benchmarks, $π (y_{2} | θ)$ . This results in the benchmarked posterior distribution

π (θ, ϕ | y_{1}, y_{2}) \propto π (y_{1} | θ, ϕ) π (y_{2} | θ) π (θ | ϕ) π (ϕ),

(4)

with the assumptions that $y_{1}$ and $y_{2}$ are conditionally independent given θ , and that $π (y_{2} | θ)$ does not depend on hyperparameters ϕ . Notably, this first assumption does not hold in internal benchmarking settings. The likelihood term for the benchmarks, $π (y_{2} | θ)$ , pulls the likelihood for the area-level observations, $π (y_{1} | θ, ϕ)$ , toward the benchmarks. Uncertainty quantification is thus straightforward, as we obtain an entire benchmarked posterior distribution as opposed to simply a benchmarked point estimate.

Under exact benchmarking, we set $π (y_{2} | θ) = I [Σ_{i = 1}^{n} w_{i} θ_{i} = y_{2}]$ . Under inexact benchmarking, $π (y_{2} | θ)$ is a non-degenerate distribution specified by the user. One can incorporate a discrepancy parameter λ into this distribution to allow for varying levels of agreement between the benchmarks and the aggregate parameters if desired. For example, in the application we consider setting $π (y_{2} | θ)$ equal to a normal distribution with mean $Σ_{i = 1}^{n} w_{i} θ_{i}$ and variance equal to the variance of the national benchmark. This would be equivalent to setting $1 / λ$ equal to the variance of the national estimate in the Bayes estimate approach described in Subsection 3.1.

A benefit of this benchmarking approach is that it allows for nonlinear benchmarking constraints. This is particularly relevant for logistic models, where the benchmarking constraint may take the form $Σ_{i = 1}^{n} w_{i} θ_{i} = Σ_{i = 1}^{n} w_{i} expit (η_{i})$ for a mean model ηi. Unlike the estimates from a benchmarked Bayes estimate approach, the fully Bayesian benchmarking approach ensures that benchmarked estimates remain within the boundaries of the parameter space.

The fully Bayesian benchmarking approach can be used with both area-level and unit-level models. Though Zhang and Bryant (2020) describe an extension of their fully Bayesian approach to unit-level models, an implementation of this does not currently exist. Zhang and Bryant (2020) provide code in their R package demest for Poisson, binomial, multinomial, and normal models, with optional point mass, Poisson, binomial, and normal distributions for the likelihood for the benchmark, found at https://github.com/StatisticsNZ/demest. They provide an outline for a MCMC scheme for a general model. However, implementation of this MCMC scheme will be model-specific, which can limit the uptake of the method. Statistical programs such as INLA and TMB provide alternative ways to conduct Bayesian inference which have great computational advantages over MCMC samplers. These computational advantages are described in detail for INLA in Rue et al. (2009) and for TMB in Kristensen et al. (2016), and involve the use of Laplace approximations to obtain posterior distributions at a fraction of the time it would take using MCMC algorithms. The methods are particularly suited to space and space-time modeling with Markov random field models—situations in which MCMC can be especially computationally demanding because of the dependence in the posterior (Margossian et al. 2020).

The benchmarking approach we propose in the following section is a more general implementation of the inexact, fully Bayesian approach described by Zhang and Bryant (2020), and allows us to obtain fully Bayesian benchmarked estimates from any method that produces area-level samples from an unbenchmarked model. The approach we propose is readily applicable to both area- and unit-level models, so long as area-level samples can be produced, and can be used in conjunction with statistical programs such as INLA and TMB.

3.4. Proposed Approach

We propose an external benchmarking approach that combines an unbenchmarked model with either a rejection sampler or Metropolis-Hastings algorithm to produce fully Benchmarked posterior distributions conditional on a benchmarking constraint under inexact benchmarking. The key to our proposed approach is that from equation (4) we can write

\begin{matrix} π (θ, ϕ | y_{1}, y_{2}) \propto π (y_{1} | θ, ϕ) π (y_{2} | θ) π (θ | ϕ) π (ϕ), \\ \propto π (θ, ϕ | y_{1}) π (y_{2} | θ) . \end{matrix}

Intuitively, we can think of $π (y_{2} | θ)$ as a likelihood for the benchmarks and $π (θ, ϕ | y_{1})$ as a “prior” that corresponds to the posterior based on area-level observations $y_{1}$ . For concreteness we consider a normal distribution for $π (y_{2} | θ)$ , that is,

y_{2} | θ \sim N (\sum_{i = 1}^{n} w_{i} θ_{i}, σ_{y_{2}}^{2}),

where $w_{i}$ are population weights that sum to one across all regions, and $σ_{y_{2}}^{2}$ is treated as the known, national level variance for $y_{2}$ . One could consider other distributions for the benchmark likelihood, however we focus on the normal case in the following derivations.

Importantly, $π (θ, ϕ | y_{1})$ is the posterior distribution from an unbenchmarked model, that is, a model that is agnostic to the benchmarking constraint. In practice, this is the small area, subnational model fit without consideration of the national benchmarks. The most appropriate unbenchmarked subnational model will depend on the context of the statistical problem, though we note that if unbenchmarked subnational estimates are particularly far from reliable national benchmarks, this may indicate that the unbenchmarked model is inappropriate for the data. The approach we describe below will produce samples from the benchmarked posterior conditional on the benchmarking constraint for a given unbenchmarked model.

We note that though we target the same constrained posterior distribution as Zhang and Bryant (2020), our approach is distinct in the implementation in that we do not obtain samples from this posterior distribution in a single step, such as with an MCMC algorithm as implemented in Zhang and Bryant (2020). Rather, our approach allows for first obtaining samples from an unbenchmarked model (not necessarily using MCMC, hence greater flexibility and potential speed gains) and benchmarking in a second step involving either a rejection sampler or a Metropolis-Hastings algorithm.

3.4.1. Rejection Sampler

In a rejection sampling framework, we can obtain samples from $π (θ, ϕ | y_{1}, y_{2})$ by filtering samples from $π (θ, ϕ | y_{1})$ through the information provided by $π (y_{2} | θ)$ . This filtering is done via the following rejection sampler (Smith and Gelfand 1992):

Generate $U \sim Uniform (0, 1)$ and $(θ, ϕ) \sim π (θ, ϕ | y_{1})$ independently.

Accept $(θ, ϕ)$ if

U < \frac{π (y_{2} | θ)}{\sup_{θ} π (y_{2} | θ)} = \exp (- \frac{1}{2 σ_{y_{2}}^{2}} {(\sum_{i = 1}^{m} w_{i} θ_{i} - y_{2})}^{2}),

Otherwise, return to Step 1.

This rejection sampling approach targets the same constrained posterior distribution as the fully Bayesian benchmarking approach of Zhang and Bryant (2020) but in a computationally straightforward manner. As the rejection sampling approach only requires posterior draws from an unbenchmarked posterior distribution, this allows practitioners to use a wider array of computation tools to conduct fully Bayesian benchmarking, even when benchmarking constraints are nonlinear. A relevant example of such a computational tool is INLA, which does not allow for non-linear predictors, and therefore INLA cannot directly incorporate the likelihood for a nonlinear benchmark into the model (Rue et al. 2009). Crucially, unbenchmarked posterior samples can be generated from an INLA analysis.

3.4.2. Metropolis-Hastings

The rejection sampling approach to fully Bayesian benchmarking may be inefficient in cases where benchmarks are far from the population aggregated estimates from an unbenchmarked model. One potential way to address this inefficiency is to instead use an independence Metropolis-Hastings (MH) algorithm (Tierney 1994). Intuitively, we can fit an adjusted unbenchmarked distribution that has been shifted toward the national benchmarks and correct for this shift in the acceptance rate. This serves to increase the proportion of accepted samples relative to using the unbenchmarked distribution as a proposal distribution by moving the proposal distribution closer to where the constrained posterior should be.

Here we present one example of an adjusted unbenchmarked distribution that can be used as a proposal distribution, and show how the shift can be corrected for in the acceptance rate. Note that we assume in this example that the unadjusted, unbenchmarked model we would want to fit has a fixed intercept β with a flat prior, $π (β) \propto 1$ . This assumption is needed for terms to cancel in the acceptance rate, and is not an unreasonable assumption given that the models fit in our motivating application typically satisfy this assumption (Wu et al. 2021). We note that this assumption can be relaxed if other proposal distributions are considered, which may improve the acceptance rate, but that the acceptance probability may not cancel as conveniently in these cases.

Suppose that area-level parameters are linked to a mean model $g (θ) = β + η$ via a link function g, where β is a fixed intercept term with $π (β) \propto 1$ , and η denotes the summation of any remaining fixed or random effect terms in the unbenchmarked model (e.g., spatio-temporal smoothing terms, covariates, etc.). Let $π^{+} (β) \sim N (g (y_{2}), σ_{+}^{2})$ denote an alternative prior for the intercept, centered at $g (y_{2})$ with fixed variance $σ_{+}^{2}$ . Then we have

\begin{array}{l} π (η, β, ϕ | y_{1}, y_{2}) \propto π (y_{1} | η, β, ϕ) π (y_{2} | η, β) π (η | ϕ) π (ϕ) π (β), \\ = π (y_{1} | η, β, ϕ) π (y_{2} | η, β) π (η | ϕ) π (ϕ) π (β) (\frac{π^{+} (β)}{π^{+} (β)}), \\ \propto π^{+} (η, β, ϕ | y_{1}) (\frac{π (y_{2} | η, β)}{π^{+} (β)}), \end{array}

where $π^{+} (η, β, ϕ | y_{1})$ is an adjusted unbenchmarked distribution with $π^{+} (β)$ specified as the prior for the intercept.

In this framework, we can obtain samples from the same benchmarked posterior distribution $π (η, β, ϕ | y_{1}, y_{2})$ using independent samples from the adjusted unbenchmarked posterior distribution $π^{+} (η, β, ϕ | y_{1})$ . The algorithm is executed as follows:

Initalize $(β^{0}, η^{0})$ .

Sample $(β^{'}, η) \sim π^{+} (η, β, ϕ f | y_{1})$ .

Compute the acceptance probability

\begin{matrix} A = \min (1, \frac{π (η^{'}, β^{'}, ϕ^{'} | y_{1}, y_{2}) π^{+} (η^{0}, β^{0}, ϕ^{0} | y_{1})}{π (η^{0}, β^{0}, ϕ^{0} | y_{1}, y_{2}) π^{+} (η^{'}, β^{'}, ϕ^{'} | y_{1})}) \\ = \min (1, \frac{π (y_{2} | η^{'}, β^{'}) π^{+} (β^{0})}{π (y_{2} | η^{0}, β^{0}) π^{+} (β^{'})}) \end{matrix}

With probability A, accept the proposed value $(β^{'}, η^{'})$ and set $(β^{0}, η^{0}) = (β^{'}, η^{'})$ . Otherwise, set $(β^{0}, η^{0}) = (β^{0}, η^{0})$ .

Just as with the rejection sampling approach, the MH approach targets the same constrained posterior distribution as the fully Bayesian benchmarking approach of Zhang and Bryant (2020) but in a computationally convenient manner.

3.4.3. Convergence Diagnostics

While the MH algorithm may have a higher acceptance rate than the rejection sampling approach (and therefore greater computational speed), additional care must be taken to ensure that the algorithm has mixed and converged properly. Though not a convergence diagnostic, one basic check is to aim for an acceptance rate near 23.4%, as suggested in Gelman et al. (1997) as the optimal acceptance rate under normal posteriors for random walk Metropolis algorithms. It may be desirable to vary the prior $π^{+} (β)$ in the MH algorithm to obtain an acceptance rate close to this optimal value.

A common convergence diagnostic for MCMC sampling is the potential scale reduction factor $\hat{R}$ , introduced by Gelman and Rubin (1992). Intuitively, $\hat{R}$ (or split- $\hat{R}$ as it is extended in Gelman et al. (2013)) compares the variance of each individual Markov chain to the variance of all of the chains combined. If the sampler has mixed appropriately, $\hat{R}$ should be close to 1. More recently, new convergence diagnostics have been developed for MCMC methods that go beyond $\hat{R}$ , as there are certain situations where $\hat{R}$ can fail to correctly diagnose poor mixing. In brief, Vehtari et al. (2021) introduce rank-normalized split- $\hat{R}$ and bulk effective sample size (bulk-ESS) as measures that have good asymptotic efficiency through avoiding a normality assumption in non-rank-normalized measures. For more detail see Vehtari et al. (2021) for an overview of existing convergence diagnostics and details on those they propose.

Following the guidelines in Vehtari et al. (2021),we recommend running at least four chains if using the MH approach to benchmarking that we propose. Aiming for a rank-normalized split- $\hat{R}$ less than 1.01 and a bulk-ESS that exceeds four hundred (with four, 1000-iteration chains after warmup) provide default guidelines.

3.4.4. Limitations

Despite the potential improvements in speed from the MH approach over the rejection sampler approach, both methods may still be inefficient in cases where benchmarks are far from the population aggregated estimates from an unbenchmarked model. It may be necessary in such cases to use an approach such as Zhang and Bryant (2020)’s in order to produce benchmarked estimates. However, we caution that if the proportion of accepted samples is very small this may be indicative of inconsistencies between the two data sources, and model elaboration may be required in this case.

Additionally, the rejection sampling and MH approaches cannot benchmark estimates with zero variance, as the likelihood $π (y_{2} | θ)$ would be a point mass. We note this benchmarking scenario could be approximated by setting $σ_{y_{2}}^{2}$ to be very small, but the rejection sampling approach would likely be very inefficient due to the likelihood for the benchmark being very concentrated. The scenario where benchmarks have positive uncertainty is the most practically relevant in our context, as situations where the national benchmarks do not have some degree of uncertainty in an LMIC context are rare. For fully Bayesian benchmarking to benchmarks with zero variance, we refer the reader to the MCMC methods described in Zhang and Bryant (2020).

A final consideration with the MH approach is that using this algorithm with the suggested proposal distribution will not allow for comparison between the resulting benchmarked estimates and the unbenchmarked estimates from the unadjusted, unbenchmarked model, without separately fitting an unadjusted, unbenchmarked model. In some official statistics settings it may be desirable or even required to compare benchmarked estimates to unbenchmarked estimates, and if the time it takes to separately fit an unadjusted, unbenchmarked model is time-consuming, the MH approach may be impractical.

4. Simulation

To demonstrate the improved computational speed of our approach compared to Zhang and Bryant (2020), we compare run-times for our approach used with INLA to that of Zhang and Bryant (2020) implemented in Stan; a probabilistic programming language that uses a variant of Hamiltonian Monte Carlo to do full Bayesian inference (Carpenter et al. 2017). We chose not to compare the computational speed of our approach to that of Datta et al. (2011) or the raking approach, as both of those benchmarking approaches target different benchmarked estimates than our proposed method. Note, however, that both the raking and benchmarked Bayes estimate approach will be the fastest approach to benchmarking in general, as they involve a very quick step to adjusting unbenchmarked draws/estimates, and do not rely on acceptance rates (as our methods do to achieve a reasonable number of effective samples) or MCMC methods.

We simulate unit-level (cluster) binomial observations, using the nine provinces (small areas) of South Africa as our spatial structure. For each simulation setting, binomial probabilities were given by $p_{i} = 0.28, 0.29, \dots, 0.35, 0.36$ , with 100 binomial trials in each cluster. Equal probability weights were given to each province. We varied: the number of samples taken in each area, $m = {5, 10, 100, 1000}$ , the national level benchmark, $y_{2} = {0.29, 0.3}$ , and the variance of the national level benchmark, $σ_{y_{2}}^{2} = {0.01, 0.0001}$ . For each simulation setting, we generated ten unique datasets using the given parameters.

The unbenchmarked model we fit to the generated data is the unit-level model used in our application to South Africa (and described in Section 2.1.2), where we have binomial observations for clusters c within area i. Let $y_{1 i [c]}$ be the number of cases observed in $n_{i [c]}$ total observations in cluster c within area i. We consider the unbenchmarked, unit-level model

\begin{array}{l} y_{1 i [c]} | n_{i [c]}, θ_{i [c]} \sim Binomial (n_{i [c]}, θ_{i [c]}), \\ η_{i [c]} = logit (θ_{i [c]}) = β_{0} + b_{i} + e_{i [c]} \end{array}

where $θ_{i [c]}$ is case prevalence in area i and cluster c, $β_{0}$ is an intercept term, $b_{i}$ follows a BYM2 model (Riebler et al. 2016), denoted $b_{i} \sim BYM2 (τ_{b}, ϕ)$ , and $e_{i} [c] \overset{i i d}{\sim} N (0, σ_{e}^{2})$ is an independent and identically distributed (iid) cluster-level random effect. Note that although the model we fit was not used to generate the data, as we are interested only in comparing run times from each method, and ensuring that the benchmarked distributions are the same from each method, this does not affect the validity of our simulation results. More details on hyperprior specification and simulating the data can be found in Section 3 of the Supplemental data.

The modified unbenchmarked model, used for the MH algorithm is the same as above with the exception of the prior for the intercept being $π^{+} (β_{0}) \overset{d}{=} N (logit (y_{2}), \sqrt{0.1})$ , where $y_{2}$ is the national level benchmark. The benchmarked model we fit using the Zhang and Bryant (2020) approach includes the additional likelihood

y_{2} | θ, σ_{y_{2}}^{2} \sim N (\sum_{i = 1}^{n} w_{i} θ_{i}, σ_{y_{2}}^{2}),

and the proposed rejection sampler and MH algorithm approaches are carried out as described in Sections 3.4.1 and 3.4.2. Code for reproducing the simulation can be found at https://github.com/taylorokonek/benchmarking-paper-sim.

To fairly compare run-times, we compare the time it takes to:

Fit the fully Benchmarked model per Zhang and Bryant (2020) in Stan, and obtain a bulk-ESS of 1,000.

Fit the unbenchmarked model in INLA, draw posterior samples, and obtain 1,000 accepted samples using the rejection sampler.

Fit the modified unbenchmarked model in INLA, draw posterior samples, and obtain a bulk-ESS of 1,000 using the MH algorithm.

For both Stan and the MH algorithm, we run four chains with 1,000 burn-in samples each, and the appropriate number of samples after to obtain the desired bulk-ESS.

In Figure 1 we compare methods under the setting ${y_{2} = 0.29, σ_{y}^{2} = 0.01}$ , and in Figure 2 we compare methods under the setting ${y_{2} = 0.29, σ_{y}^{2} = 0.0001}$ . As expected, our proposed approaches outperform that of Zhang and Bryant (2020) in most simulation settings, as noted in Figures 1 and 2. In particular, the amount of time needed to obtain 1,000 samples from the benchmarked posterior distribution is much lower for both of our proposed approaches than that of Zhang and Bryant (2020) when the number of samples in each area is large. When the national variance is smaller, as in Figure 2, the Zhang and Bryant (2020) approach tends to outperform both the rejection sampler and MH algorithm at low sample sizes (5 and 10 per area), but not at larger sample sizes. Finally, note that, in settings with smaller national variance, the Metropolis-Hastings algorithm outperforms the rejection sampler approach in terms of computational speed, while the reverse is true when national variance is large. Under the setting ${y_{2} = 0.29, σ_{y}^{2} = 0.01, m = 5}$ , our proposed methods are three to five times faster than that of Zhang and Bryant (2020) on average, and under the setting ${y_{2} = 0.29, σ_{y}^{2} = 0.01, m = 1000}$ , our proposed methods are seventy-five to one hundred times faster than that of Zhang and Bryant (2020) on average. The speed of the MH algorithm could potentially be optimized by modifying prior for the intercept, $π^{+} (β_{0})$ , and we suggest that, when possible, multiple priors should be tested if the acceptance rate in the MH algorithm is lower than desired. The included figures and above summary are representative of the simulation results, and additional results from the simulation can be found in Section 3 of the Supplemental data.

Figure 1.

Setting: $y_{2} = 0.29$ , $σ_{y_{2}}^{2} = 0.01$ . Comparative, total run-time (in seconds) needed to obtain 1,000 samples from the rejection sampling (RS) approach, or 1,000 bulk-ESS from the Metropolis-Hastings (MH) approach or the approach of Zhang and Bryant (2020) (Stan). Each boxplot contains ten observations, with data generated under the given simulation setting for ten different seeds.

Figure 2.

Setting: $y_{2} = 0.29$ , $σ_{y_{2}}^{2} = 0.0001$ . Comparative, total run-time (in seconds) needed to obtain 1,000 samples from the rejection sampling (RS) approach, or 1,000 bulk-ESS from the Metropolis-Hastings (MH) approach or the approach of Zhang and Bryant (2020) (Stan). Each boxplot contains ten observations, with data generated under the given simulation setting for ten different seeds.

5. Application

Below we describe the data, models, and software used in the HIV prevalence application. The application to U5MR is in Section 2 of the Supplemental data.

We fit both area-level and unit-level models to demonstrate the flexibility of our proposed method over existing computational software. As noted in Section 3.3, there is currently no available implementation of the fully Bayesian benchmarking approach described in Zhang and Bryant (2020) to unit-level models. As unit-level models are commonly used in LMICs when estimating public health outcomes in an official statistics setting (Wakefield et al. 2020; Wu et al. 2021), it is important for practitioners to have a method available to conduct fully Bayesian benchmarking with unit-level models without needing to write their own MCMC algorithm. We demonstrate that our proposed method is readily applicable to both area- and unit-level models.

For our unbenchmarked area-level model, we consider the spatial Fay-Herriot model defined in Mercer et al. (2015). The model is detailed in Section 2.1.1. The Fay-Herriot model is one of the most well-known small area models, that directly incorporates aspects of the survey design using a transformed survey-weighted estimate and an iid random effect to increase precision of the resulting estimates (Fay and Herriot 1979). The spatial extension incorporates both an iid random effect and an additional structured spatial random effect to further capture spatial dependency in the observations (Mercer et al. 2015).

For our unbenchmarked unit-level model, we consider a hierarchical Bayesian model with a spatial random effect term for the application to HIV prevalence, and the model currently used to produce subnational estimates of U5MR for the UN IGME (Wu et al. 2021). The model is detailed in Subsection 2.1.2. Of note, these unit-level models do not directly account for the survey design like the area-level model. It has been suggested that, when possible, the inclusion of covariates related to the survey design could be included in such a model-based framework to account for the survey design (Wakefield 2013; Wu et al. 2021). One example of this can be found in Paige et al. (2022), though as noted in Wakefield et al. (2020), in many surveys conducted in LMICs the covariates corresponding to the survey design may be unavailable.

5.1. Data

Spatial boundary files for South Africa are obtained from GADM, the Database of Global Administrative Areas (Database of Global Administrative Areas [GADM] 2019).

We estimate HIV prevalence in South Africa from the 2016 South Africa DHS survey. The survey followed a multi-stage, stratified design, and was designed to provide estimates at the Administrative 1 (admin1) level, which consists of nine provinces, which we considered to be our small areas for this application. Subareas within each of the nine provinces were assigned urban/farm/traditional area status, and therefore resulted in twenty-six strata, as the Western Cape province does not have a traditional area geotype. The sampling frame was established from the 2011 census, and 750 enumeration areas (primary sampling units, PSUs) were selected across strata. The second stage of sampling sampled dwelling units, or households, from the enumeration areas, and every individual within the household (if available) was included in the survey. Only men and women aged fifteen to forty-nine were included in the HIV dataset. Households within a given enumeration area are given a single geographic location, and we denote these as clusters from here on. GPS coordinates are displaced by up to 2 km for urban clusters and 5 km for rural clusters, but are never displaced outside of their area of stratification.

Of note, all nine small areas (provinces) for this application contained at least one binomial observation. The number of PSUs in each small area ranged from 56 to 88. The binomial counts within each primary sampling unit ranged from 0 to 11, with totals in each primary sampling unit ranging from 1 to 30. The distribution of observed binomial proportions in each PSU can be seen in Figure 3.

Figure 3.

Observed binomial proportions in each PSU.

We obtain a national level estimate for 2016 in South Africa from the national Thembisa model, an HIV epidemic projection model that produces the official estimates published by UNAIDS for South Africa (Johnson, May, et al. 2017; Mahy et al. 2019; Stover et al. 2019). The national level estimate of HIV prevalence for 2016 in South Africa is 17.1%, with a 95% confidence interval of (15.6%, 18.3%). For the benchmark likelihood for the fully Bayesian benchmarking approaches we need a standard error for this benchmarked estimate, which we take to be (18.3−17.1) / 1.96 = 0.61 based on the assumption that the national level estimate is asymptotically normally distributed. The Thembisa model incorporates data from five Human Sciences Research Council (HSRC) surveys conducted from 2002 to 2017, the 2016 DHS survey, Antenatal Sentinel HIV and Syphilis Surveys, and antiretroviral therapy coverage data. Although the DHS survey is included in the model for the benchmarks, we treat this as an external benchmarking scenario as multiple other data sources are included in the national model as well.

The population count data we use for our HIV application comes from the provincial Thembisa model (Johnson, Dorrington, and Moolla 2017). We use this specific population data as they are currently used in UNAIDS models of subnational HIV prevalence in South Africa (Eaton et al. 2021). In Figure 4, we plot the spatial distribution of individuals aged fifteen to forty-nine (the population in our HIV example) in 2016 in South Africa; that is, out of all people aged fifteen to forty-nine in South Africa, the proportion of that sub-population who live in each region is displayed. The population proportions are also present in Table 1, along with the number of observations in each province and their direct, survey-weighted estimates.

Figure 4.

Proportion of the fifteen to forty-nine population living in each province. These proportions are the weights used to aggregate province level estimates to the national level. Provinces are labeled in alphabetical order: (1) Eastern Cape; (2) Free State; (3) Gauteng; (4) KwaZulu-Natal; (5) Limpopo; (6) Mpumulanga; (7) NorthWest; (8) Northern Cape; (9) Western Cape.

Table 1.

Proportion of the fifteen to forty-nine population living in each province, number of observations in each province, direct survey estimates, and their standard errors provided in parentheses. Provinces are arranged in ascending order of direct estimates.

Province	Population proportion	Sample size	Direct estimate (%)
Limpopo	0.09	815	9.8 (1.19)
Northern Cape	0.02	453	11.4 (2.25)
Western Cape	0.12	397	12.3 (2.96)
Eastern Cape	0.10	1,056	16.4 (1.40)
Gauteng	0.28	528	19.6 (2.07)
North West	0.07	917	20.6 (1.52)
Free State	0.05	849	21.3 (1.76)
Mpumalanga	0.08	893	23.9 (1.84)
KwaZulu-Natal	0.19	1,004	25.9 (2.08)

5.2. Benchmarked Models

We compute benchmarked estimates using the fully Bayesian approach of Zhang and Bryant (2020), the proposed approaches, and the raking approach. This allows us to show: (1) the benchmarked estimates from our proposed approaches are identical to those from the Zhang and Bryant (2020) method, and (2) a comparison of the benchmarked estimates from the proposed approach to what is currently used in practice (raking) by the UN when estimating both HIV Prevalence and U5MR subnationally (Eaton et al. 2021; Wu et al. 2021).We do not compute benchmarked estimates using the benchmarked Bayes estimate approach, as we have noted in Section 3.1 that it is inappropriate for these applications. We obtain credible intervals for our benchmarked and unbenchmarked estimates using draws from the posterior distributions.

5.2.1. Approaches

5.2.1.1. Raking

We obtain posterior medians ${\hat{θ}}_{i}^{M}$ from an unbenchmarked model for each area i, and compute R from equation (3), using weights $w_{i}$ and our benchmark $y_{2}$ , setting ${\hat{θ}}^{N} : = Σ_{i = 1}^{n} w_{i} {\hat{θ}}_{i}^{M}$ . We then adjust unbenchmarked posterior draws ${\hat{θ}}_{i}^{(k)}$ , $k = 1, \dots, K$ , to obtain benchmarked posterior draws ${\hat{θ}}_{i}^{R (k)} = {\hat{θ}}_{i}^{(k)} / R$ .

5.2.1.2. Fully Bayesian

Using weights $w_{i}$ and our benchmark $y_{2}$ with associated variance $σ_{y_{2}}^{2}$ , we fit the unbenchmarked model with the additional likelihood (Zhang and Bryant 2020)

y_{2} | θ \sim Normal (\sum_{i = 1}^{m} w_{i} θ_{i}, σ_{y_{2}}^{2}),

(5)

with $y_{2} = 17.1 %$ and $σ_{y_{2}}^{2} = 0.37$ , and denote benchmarked posterior draws by ${\hat{θ}}_{i}^{F B 1 (k)}$ .

5.2.1.3. Fully Bayesian: Rejection sampler

We obtain $k = 1, \dots, K$ posterior draws ${\hat{θ}}_{i}^{(k)}$ for each area i from the unbenchmarked unit-level or area-level model, and apply the algorithm described in Section 3.4.1 using the benchmark likelihood in Equation (5) to obtain benchmarked posterior draws ${\hat{θ}}_{i}^{F B 2 (k)}$ .

5.2.1.4. Fully Bayesian: Metropolis-Hastings algorithm

We obtain $k = 1, \dots, K$ posterior draws ${\hat{θ}}_{i}^{(k)}$ for each area i from the adjusted unbenchmarked unit-level or area-level model, and apply the algorithm described in Section 3.4.2 using the benchmark likelihood in equation (5) to obtain benchmarked posterior draws ${\hat{θ}}_{i}^{F B 3 (k)}$ . The adjusted unbenchmarked model uses the prior $π^{+} (β_{0}) \sim Normal (logit (y_{2} {), 0.001}^{- 1}) .$

5.3. Model Validation

As we view benchmarking as a necessary adjustment needed to an existing unbenchmarked model in an official statistics setting, we choose to validate the unbenchmarked models we fit prior to benchmarking. Note that if benchmarking is instead viewed as a way to reduce potential bias in subnational estimates, model validation may need to be done to the resulting benchmarked models as opposed to the unbenchmarked models.

Wakefield et al. (2020), among others (Osgood-Zimmerman et al. 2018; Wu et al. 2021), suggest that one way to approach model validation in small area models in a survey setting is to check whether the direct estimates in each area lie within an uncertainty interval surrounding the posterior means of the predictive distribution in that area. In a survey setting in LMICs, the direct estimates are considered the gold standard; hence, model-based estimates should not stray too far from the direct estimates. This also gives one an idea of average coverage across areas: 80% of the direct estimates (left out data) should lie within 80% credible intervals based on the predictive distribution.

We take this approach to model validation for both our unbenchmarked area-level and unit-level models. Details and figures containing the results of model validation for the HIV application can be found in Subsection 1.4 of the Supplemental data, and Subsection 2.6 of the Supplemental data for the U5MR application.

5.4. Computation

As the novelty of our approach is computational, we chose the statistical programs used in the application to demonstrate the flexibility of our approach compared to the fully Bayesian approach of Zhang and Bryant (2020).

For the approach described in Zhang and Bryant (2020), we implement models in Template Model Builder (TMB) (Kristensen et al. 2016). TMB is a flexible modeling tool that takes advantage of Laplace approximations for computational efficiency, and allows users to specify a wide variety of models in C++. For a review of TMB, see Section 4 of Osgood-Zimmerman and Wakefield (2021). Importantly, TMB allows users to specify nonlinear predictors, unlike INLA, which is needed for implementing the fully Bayesian benchmarking approach of Zhang and Bryant (2020) in settings with binary outcomes. Of note, TMB still requires model-specific implementations, and therefore does not have as nice of an interface as other programs (such as INLA, which we will describe below).

To show that the results from our proposed method are identical to those from the Zhang and Bryant (2020) method, we also fit the unbenchmarked HIV models in our application using TMB. This ensures that the posterior distributions are directly comparable. Note that our proposed method could have been implemented using INLA to fit the unbenchmarked model as well, but as INLA uses a different Laplace approximation than TMB, the results would not have been exactly, directly comparable.

For our U5MR application, we fit all unbenchmarked models in INLA to show that our proposed method is compatible with the software currently used to produce official estimates of U5MR for the UN IGME. INLA is an appealing, alternative Laplace approximation program for obtaining posterior distributions for latent Gaussian models (Rue et al. 2009). Unlike TMB, INLA does not allow users to specify nonlinear predictors. Since our proposed approach does not require the use of such nonlinear predictors for the benchmarking constraint by taking advantage of a two-step procedure, we implement the models for our U5MR application using INLA to demonstrate the benefit of the increased flexibility of our approach to accommodate fast Laplace approximation techniques such as INLA.

5.5. Results

In Figure 5 and Table 2, we display national level estimates from Thembisa, the unbenchmarked, and benchmarked results from the unit-level model. Results from the area-level model are very similar to those from the unit-level model, and can be found in Table 2 as well, with more detailed results in Subsection 1.2 of the Supplemental data and direct comparisons between the unit-level and area-level models found in Subsection 1.3 of the Supplemental data. We see that the fully Bayesian benchmarking approaches produce essentially identical results, and that they are a compromise between the likelihood given by the national level benchmark and the unbenchmarked posterior estimate. The raking approach enforces exact benchmarking, which is evidenced by the overlap between the density of the Thembisa national estimate and that of the benchmarked method’s national aggregated estimate. The fully Bayesian rejection sampler accepts 7.6% of unbenchmarked samples using the unit-level model, and 8.6% of unbenchmarked samples using the area-level model. The fully Bayesian MH approach accepts 18.8% of unbenchmarked samples using the unit-level model, and 18.9% of unbenchmarked samples using the area-level model.

Figure 5.

Aggregated national level HIV prevalence estimates from Thembisa, unbenchmarked, and benchmarked results from the unit-level model. The fully Bayesian model refers to the approach described in Zhang and Bryant (2020), and the fully Bayesian rejection sampler and fully Bayesian MH refer to the proposed approaches. All densities are based on five thousand samples.

Table 2.

Aggregated national level hiv prevalence estimates from Thembisa, unbenchmarked, and benchmarked results from the unit-level and area-level models. 95% Credible intervals are given next to posterior medians.

Model	Unit-level: Median (%)	Area-level: Median (%)
Thembisa	17.1 (15.6, 18.3)	17.1 (15.6, 18.3)
Unbenched	19.1 (17.6, 20.6)	19.0 (17.5, 20.7)
FB	18.0 (17.1, 19.0)	17.9 (17.0, 18.8)
FB: Rejection sampler	17.9 (17.0, 18.9)	17.9 (17.0, 18.8)
FB: MH	17.9 (17.0, 18.8)	17.9 (16.9, 18.8)
Raking	17.1 (15.8, 18.5)	17.1 (15.8, 18.7)

Figure 6 and Table 3 display the subnational breakdown of benchmarked and unbenchmarked estimates from the unit-level model. Of note, the benchmarked posterior medians and credible intervals for the fully Bayesian approaches look roughly identical (as they should), and the posterior distributions for each province from the raking approach are all slightly lower than those from the fully Bayesian approaches. This difference is due to the raking approach enforcing the benchmarking constraint exactly, thus pulling the unbenchmarked estimates down farther than the fully Bayesian approaches, as the national benchmark is lower than the unbenchmarked national estimate. Though more difficult to see, the fully Bayesian approach treats regions with more uncertainty differently than those with less uncertainty. In particular, note that in the Eastern Cape province the fully Bayesian benchmarked posterior medians are not as different from the unbenchmarked median as in the Gauteng province, where uncertainty is much higher in the unbenchmarked estimate. This is opposed to the raking approach, which shifts all estimates by the same multiplicative constant, regardless of uncertainty. Table 4 displays the subnational breakdown of benchmarked and unbenchmarked estimates from the area-level model, with additional results in Subsection 1.2 of the Supplemental data.

Figure 6.

Comparison of HIV Prevalence estimates from benchmarked and unbenchmarked unit-level models at a subnational level. Error bars correspond to 95% credible intervals, and provinces are arranged from left to right in order of unbenchmarked median.

Table 3.

Admin1 HIV prevalence estimates from unbenchmarked and benchmarked unit-level models. Point estimates provided are medians, with 95% credible intervals. Provinces are arranged in the order of lowest unbenchmarked median to highest.

Province	Unbenched (%)	FB (%)	FB: Rejection sampler (%)	FB: MH (%)	Raking (%)
Limpopo	11.2 (9.0, 13.9)	11.0 (8.8, 13.6)	10.8 (8.8, 13.4)	10.8 (8.7, 13.4)	10.1 (8.1, 12.5)
Western Cape	11.5 (8.6, 15.3)	10.9 (8.2, 14.4)	10.7 (8.0, 14.0)	10.7 (8.2, 14.0)	10.4 (7.7, 13.8)
Northern Cape	12.0 (9.2, 15.4)	12.0 (9.2, 15.6)	11.8 (9.0, 15.3)	11.7 (8.9, 15.4)	10.8 (8.3, 13.8)
Eastern Cape	17.5 (15.6, 19.4)	16.9 (15.3, 18.7)	16.8 (15.1, 18.5)	16.7 (15.1, 18.5)	15.7 (14.1, 17.4)
Gauteng	18.6 (15.3, 22.5)	16.7 (14.1, 19.6)	16.7 (14.1, 19.5)	16.7 (14.1, 19.4)	16.7 (13.7, 20.2)
Free State	21.0 (17.9, 24.2)	20.6 (17.7, 24.1)	20.5 (17.5, 23.7)	20.5 (17.7, 23.7)	18.8 (16.1, 21.8)
North West	21.1 (18.0, 24.5)	20.7 (17.8, 24.0)	20.5 (17.6, 23.8)	20.5 (17.6, 23.7)	18.9 (16.2, 22.0)
Mpumalanga	23.4 (20.2, 27.0)	22.9 (19.8, 26.3)	22.8 (19.7, 26.2)	22.7 (19.8, 26.1)	21.0 (18.2, 24.3)
KwaZulu-Natal	26.4 (23.1, 30.0)	25.1 (22.1, 28.3)	25.1 (22.2, 28.2)	25.1 (22.2, 28.1)	23.7 (20.7, 26.9)

Table 4.

Admin1 HIV prevalence estimates from unbenchmarked and benchmarked area-level models. Point estimates provided are medians, with 95% credible intervals. Provinces are arranged in the order of lowest unbenchmarked median to highest.

Province	Unbenched (%)	FB (%)	FB: Rejection sampler (%)	FB: MH (%)	Raking (%)
Limpopo	10.4 (8.3, 13.1)	10.1 (8.0, 12.8)	10.2 (8.0, 12.8)	10.1 (8.0, 12.8)	9.4 (7.5, 11.8)
Northern Cape	12.5 (8.9, 17.4)	12.2 (8.6, 17.0)	12.3 (8.7, 17.1)	12.4 (8.8, 17.3)	11.3 (8.0, 15.7)
Western Cape	12.6 (8.2, 18.8)	11.1 (7.5, 16.4)	11.0 (7.4, 16.2)	11.1 (7.5, 16.0)	11.3 (7.4, 17.0)
Eastern Cape	16.8 (14.6, 19.4)	16.3 (14.3, 18.5)	16.3 (14.1, 18.7)	16.2 (14.2, 18.6)	15.1 (13.1, 17.4)
Gauteng	19.3 (15.9, 23.6)	17.5 (14.7, 20.6)	17.4 (14.7, 20.5)	17.3 (14.5, 20.3)	17.4 (14.3, 21.2)
North West	20.2 (17.5, 23.3)	19.9 (17.2, 22.8)	19.9 (17.2, 22.9)	20.0 (17.3, 22.9)	18.2 (15.7, 21.0)
Free State	21.0 (17.9, 24.4)	20.6 (17.7, 24.1)	20.7 (17.6, 24.0)	20.7 (17.6, 24.0)	18.9 (16.1, 21.9)
Mpumalanga	23.4 (20.2, 27)	22.9 (19.7, 26.4)	22.9 (19.6, 26.4)	22.8 (19.6, 26.3)	21.1 (18.2, 24.3)
KwaZulu-Natal	25.5 (21.7, 29.4)	24.0 (20.7, 27.6)	24.0 (20.7, 27.4)	24.0 (20.6, 27.5)	22.9 (19.5, 26.5)

The model validation results for the HIV application for both unit- and area-level models suggest that both models are reasonably well-suited to the data. We compare posterior medians of the predictive distribution in each area (having left that area out of model fitting) to the direct estimate in each area, respectively. The 80% credible intervals capture $7 / 9$ and $8 / 9$ direct estimates in the unit- and area-level models, respectively, and the 50% credible intervals capture $3 / 9$ and $4 / 9$ in the unit- and area-level models, respectively. This suggests that the area-level model may be slightly more suited to the data, which is unsurprising given that the area-level model accounts for the survey design directly.

Additional plots and tables, including those for the area-level model, can be found in Section 1 of the Supplemental data.

5.6. U5MR

The results for the U5MR application are in Section 2 of the Supplemental data.

6. Discussion

In this paper we have summarized existing benchmarking approaches and their benefits and drawbacks with regard to data with rare binary outcomes in LMICs. We consider benchmarking methods that make use of a benchmarking constraint via one-step or two-step approaches, and pay particular attention to the resulting interpretation of the benchmarked estimates, the ease of uncertainty quantification, the acknowledgment of uncertainty in national estimates, the use of non-linear constraints, and computational tractability. We believe that the proposed rejection sampling and MH approaches to fully Bayesian benchmarking provide a desirable balance all of these concerns, and provide alternative computational approaches to the one-step, fully Bayesian benchmarking method developed by Zhang and Bryant (2020), which in many cases may be inflexible with regards to modeling choice and computational tractability, while targeting the same benchmarked, posterior distribution.

We show via an application of various benchmarking approaches to an HIV prevalence example that the proposed two-step approaches to fully Bayesian benchmarking produce the same benchmarked estimates as the one-step Zhang and Bryant (2020) approach, and that the resulting estimates are a compromise between the national level estimate and unbenchmarked estimates. The rejection sampling and MH approaches allow us to take advantage of potentially faster computational programs than traditional MCMC samplers, such as INLA or TMB, as evidenced by our application to U5MR. Inference for spatial models (continuous models especially) using MCMC methods is computationally challenging, and our approach allows users to conduct fully Bayesian benchmarking while relying on Laplace approximation methods for inference that are more suited for such models (Osgood-Zimmerman and Wakefield 2021). Additionally, the rejection sampling and MH approaches are easily applied to both unit-level and area-level models, as evidenced by both our HIV and U5MR applications. Though not explored in this paper, additional algorithms could be considered to similarly obtain benchmarked posterior distributions in a two-step fashion, such as the sampling-importance resampling approach (see Subsection 10.4 of Gelman et al. (2013)).

Though the purpose of this paper was not to provide an in-depth review of small area models in LMICs, one reviewer noted that our unit-level models do not directly account for the survey design. This point should not go unnoticed, and as with any application involving survey data, careful consideration should be given the survey design even if a model-based approach is required for adequate precision.

There are several limitations with the proposed approaches to fully Bayesian benchmarking. The first and potentially most pressing is the inability to conduct exact benchmarking, which may be required in some settings and is especially relevant if national benchmarks come from a census, though censuses have uncertainty in practice. Scenarios where true exact benchmarking is required are rare in an LMIC context. While we note that exact benchmarking can be approximated via the proposed approaches, it will likely be computationally inefficient. The one-step approach to fully Bayesian benchmarking may be a more useful approach if exact benchmarking is required, or a posterior projection approach using a loss function that respects the bounds on the estimates that are being benchmarked.

Additionally, our method does not account for uncertainty in the population count data. The population data used in our applications did not have reported uncertainty. If uncertainty for population count data were available, this would ideally be incorporated into the likelihood for the benchmarking constraint. Worldpop has recently started quantifying uncertainty in population data for select countries in their bottom-up Bayesian models, which may be of interest to include in future applications (Leasure et al. 2020).

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the National Institute Of Allergy And Infectious Diseases of the National Institutes of Health under Award Number R37AI029168. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Supplemental Material

Supplemental material for this article is available online.

Received: March 21, 2022

Accepted: December 7, 2023

References

Alkema

New

J. R.

2014. “Global Estimation of Child Mortality Using a Bayesian B-Spline Bias-Reduction Model.” The Annals of Applied Statistics 8 (4): 2122–49. DOI: https://doi.org/10.1214/14-AOAS768.

Battese

G. E.

Harter

R. M.

Fuller

W. A.

1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36. DOI: https://doi.org/10.2307/2288915.

Bell

W. R.

Datta

G. S.

Ghosh

2013. “Benchmarking Small Area Estimators.” Biometrika 100 (1): 189–202. DOI: https://doi.org/10.1093/biomet/ass063.

Berg

Fuller

W. A.

2018. “Benchmarked Small Area Prediction.” Canadian Journal of Statistics 46 (3): 482–500. DOI: https://doi.org/10.1002/cjs.11461.

Besag

1974. “Spatial Interaction and the Statistical Analysis of Lattice Systems.” Journal of the Royal Statistical Society: Series B (Methodological) 36 (2): 192–225. DOI: https://doi.org/10.1111/j.2517-6161.1974.tb00999.x.

Besag

York

Mollié

1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43 (1): 1–20. DOI: https://doi.org/10.1007/BF00116466.

Carpenter

Gelman

Hoffman

M. D.

Lee

Goodrich

Betancourt

Brubaker

Guo

Riddell

2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76: 1. DOI: https://doi.org/10.18637/jss.v076.i01.

Chen

Nandram

Cruze

N. B.

2022. “Hierarchical Bayesian Model with Inequality Constraints for US County Estimates.” Journal of Official Statistics 38 (3): 709–32. DOI: https://doi.org/10.2478/JOS-2022-0032.

Dagum

E. B.

Cholette

P. A.

2006. Benchmarking, Temporal Distribution, and Reconciliation Methods for Time Series. New York, NY: Springer. DOI: https://doi.org/10.1007/0-387-35439-5.

10.

Database of Global Administrative Areas (GADM). 2019. “Global Administrative Areas [Shapefiles].” https://www.gadm.org.

11.

Datta

Ghosh

Steorts

Maples

2011. “Bayesian Benchmarking with Applications to Small Area Estimation.” Test 20 (3): 574–88. DOI: https://doi.org/10.1007/s11749-010-0218-y.

12.

Datta

G. S.

2009. “Model-Based Approach to Small Area Estimation.” In Rao

C. R.

(Ed.), Handbook of Statistics, Vol. 29, 251–88. Amsterdam: Elsevier. DOI: https://doi.org/10.1016/S0169-7161(09)00232-6.

13.

Dong

T. Q.

Wakefield

2021. “Modeling and Presentation of Vaccination Coverage Estimates Using Data from Household Surveys.” Vaccine 39 (18): 2584–94. DOI: https://doi.org/10.1016/j.vaccine.2021.03.007.

14.

Eaton

J. W.

Dwyer-Lindgren

Gutreuter

O’Driscoll

Stevens

Bajaj

Ashton

, et al. 2021. “Naomi: A New Modelling Tool for Estimating HIV Epidemic Indicators at the District Level in Sub-Saharan Africa.” Journal of the International AIDS Society 24: e25788. DOI: https://doi.org/10.1002/jia2.25788.

15.

Erciulescu

A. L.

Cruze

N. B.

Nandram

2018. “Benchmarking a Triplet of Official Estimates.” Environmental and Ecological Statistics 25 (4): 523–47. DOI: https://doi.org/10.1007/s10651-018-0416-4.

16.

Erciulescu

A. L.

Cruze

N. B.

Nandram

2019. “Model-Based County Level Crop Estimates Incorporating Auxiliary Sources of Information.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 182 (1): 283–303. DOI: https://doi.org/10.1111/rssa.12390.

17.

Erciulescu

A. L.

Cruze

N. B.

Nandram

2020. “Statistical Challenges in Combining Survey and Auxiliary Data to Produce Official Statistics.” Journal of Official Statistics 36 (1): 63–88. DOI: https://doi.org/10.2478/jos-2020-0004.

18.

Fay

R. E.

Herriot

R. A.

1979. “Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data.” Journal of the American Statistical Association 74 (366a): 269–77. DOI: https://doi.org/10.2307/2286322.

19.

Gelman

Carlin

J. B.

Stern

H. S.

Dunson

D. B.

Vehtari

Rubin

D. B.

2013. Bayesian Data Analysis. New York, NY: Chapman and Hall/CRC. DOI: https://doi.org/10.1201/b16018.

20.

Gelman

Gilks

W. R.

Roberts

G. O.

1997. “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms.” The Annals of Applied Probability 7 (1): 110–20. DOI: https://doi.org/10.1214/aoap/1034625254.

21.

Gelman

Rubin

D. B.

1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7: 457–72. DOI: https://doi.org/10.1214/ss/1177011136.

22.

Ghosh

Kubokawa

Kawakubo

2015. “Benchmarked Empirical Bayes Methods in Multiplicative Area-Level Models with Risk Evaluation.” Biometrika 102 (3): 647–59. DOI: https://doi.org/10.1093/BIOMET/ASV010.

23.

Ghosh

Steorts

R. C.

2013. “Two-Stage Benchmarking as Applied to Small Area Estimation.” Test 22 (4): 670–87. DOI: https://doi.org/10.1007/s11749-013-0338-2.

24.

Hájek

1971. “Discussion of ‘An Essay on the Logical Foundations of Survey Sampling, Part I’, by D. Basu.” Foundations of Statistical Inference, 326. DOI: https://doi.org/10.1007/978-1-4419-5825-924.

25.

Hillmer

S. C.

Trabelsi

1987. “Benchmarking of Economic Time Series.” Journal of the American Statistical Association 82 (400): 1064–71. DOI: https://doi.org/10.2307/2289382.

26.

Horvitz

D. G.

Thompson

D. J.

1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85. DOI: https://doi.org/10.2307/2280784.

27.

Janicki

Vesper

2017. “Benchmarking Techniques for Reconciling Bayesian Small Area Models at Distinct Geographic Levels.” Statistical Methods and Applications 26 (4): 557–81. DOI: https://doi.org/10.1007/s10260-017-0379-x.

28.

Johnson

L. F.

Dorrington

R. E.

Moolla

2017. “Progress Towards the 2020 Targets for HIV Diagnosis and Antiretroviral Treatment in South Africa.” Southern African Journal of HIV Medicine 18 (1): 1–8. DOI: https://doi.org/10.4102/sajhivmed.v18i1.694.

29.

Johnson

L. F.

May

M. T.

Dorrington

R. E.

Cornell

Boulle

Egger

Davies

M.-A.

2017. “Estimating the Impact of Antiretroviral Treatment on Adult Mortality Trends in South Africa: A Mathematical Modelling Study.” PLoS Medicine 14 (12): e1002468. DOI: https://doi.org/10.1371/journal.pmed.1002468.

30.

Knorr-Held

. 2000. “Bayesian Modelling of Inseparable Space-Time Variation in Disease Risk.” Statistics in Medicine 19 (17–18): 2555–67. DOI: https://doi.org/10.1002/1097-0258(20000915/30)19:17/18<2555::AID-SIM587>3.0.CO;2-\%23.

31.

Kristensen

Nielsen

Berg

C. W.

Skaug

H. J.

Bell

2016. “TMB: Automatic Differentiation and Laplace approximation.” Journal of Statistical Software 70 (5): 1–21. DOI: https://doi.org/10.18637/jss.v070.i05.

32.

Leasure

D. R.

Jochem

W. C.

Weber

E. M.

Seaman

Tatem

A. J.

2020. “National Population Mapping from Sparse Survey Data: A Hierarchical Bayesian Modeling Framework to Account for Uncertainty.” Proceedings of the National Academy of Sciences of the United States of America 117 (39): 24173–9. DOI: https://doi.org/10.1073/pnas.1913050117.

33.

Lehtonen

Veijanen

2009. “Design-Based Methods of Estimation for Domains and Small Areas.” In Rao

C. R.

(Ed.), Handbook of Statistics, Vol. 29, 219–49. Amsterdam: Elsevier. DOI: https://doi.org/10.1016/S0169-7161(09)00231-4.

34.

Hsiao

Godwin

Martin

B. D.

Wakefield

Clark.

S. J.

2019. “Changes in the Spatial Distribution of the Under-Five Mortality Rate: Small-Area Analysis of 122 DHS Surveys in 262 Subregions of 35 Countries in Africa.” PLoS One 14 (1): e0210645. DOI: https://doi.org/10.1371/journal.pone.0210645.

35.

Local Burden of Disease HIV Collaborators. 2021. “Mapping Subnational HIV Mortality in Six Latin American Countries with Incomplete Vital Registration Systems.” BMC Medicine 19: 1–25. DOI: https://doi.org/10.1186/s12916-020-01876-4.

36.

Mahy

Marsh

Sabin

Wanyeki

Daher

Ghys

P. D.

2019. “HIV Estimates Through 2018: Data for Decision-Making.” AIDS (London, England) 33 (Suppl 3): S203. DOI: https://doi.org/10.1097/QAD.0000000000002321.

37.

Margossian

Vehtari

Simpson

Agrawal

2020. “Hamiltonian Monte Carlo Using an Adjoint-Differentiated Laplace Approximation: Bayesian Inference for Latent Gaussian Models and Beyond.” Advances in Neural Information Processing Systems 33: 9086–97. https://proceedings.neurips.cc/paper/2020/file/673de96b04fa3adcae1aacda704217ef-Paper.pdf (accessed January 2022).

38.

Mercer

L. D.

Wakefield

Pantazis

Lutambi

A. M.

Masanja

Clark

2015. “Space-Time Smoothing of Complex Survey Data: Small Area Estimation for Child Mortality.” The Annals of Applied Statistics 9 (4): 1889. DOI: https://doi.org/10.1214/15-AOAS872.

39.

Nandram, B., A. L. Erciulescu, N. B. Cruze, and L. Chen , et al. 2022. “Bayesian Small Area Models under Inequality Constraints with Benchmarking and Double Shrinkage.” Technical Report, United States Department of Agriculture, National Agricultural Statistics Service. https://www.nass.usda.gov/Education_and_Outreach/Reports,_Presentations_and_Conferences/reports/ResearchReport_constraintmodel.pdf (accessed January 2023).

40.

Nandram

Erciulescu

A. L.

Cruze

N. B.

2019. “Bayesian Benchmarking of the Fay-Herriot Model Using Random Deletion.” Survey Methodology 45 (2): 365–91. https://www150.statcan.gc.ca/n1/pub/12-001-x/2019002/article/00004-eng.htm (accessed March 2022).

41.

Nandram

Sayit

2011. “A Bayesian Analysis of Small Area Probabilities Under a Constraint.” Survey Methodology 37 (2): 137–52. https://www150.statcan.gc.ca/n1/pub/12-001-x/2011002/article/11603-eng.pdf (accessed March 2022).

42.

Osgood-Zimmerman

Millear

A. I.

Stubbs

R. W.

Shields

Pickering

B. V.

Earl

Graetz

, et al. 2018. “Mapping Child Growth Failure in Africa Between 2000 and 2015.” Nature 555 (7694): 41–7. DOI: https://doi.org/10.1038/nature25760.

43.

Osgood-Zimmerman

Wakefield

2021. “A Statistical Introduction to Template Model Builder: A Flexible Tool for Spatial Modeling.” arXiv:2103.09929v1. DOI: https://doi.org/10.48550/arXiv.2103.09929.

44.

Paige

Fuglstad

G.-A.

Riebler

Wakefield

2022. “Design-and Model-Based Approaches to Small-Area Estimation in a Low-and Middle-Income Country Context: Comparisons and Recommendations.” Journal of Survey Statistics and Methodology 10 (1): 50–80. DOI: https://doi.org/10.1093/jssam/smaa011.

45.

Patra

2019. “Constrained Bayesian Inference Through Posterior Projection with Applications.” PhD dissertation, Duke University. https://hdl.handle.net/10161/19864 (accessed March 2022).

46.

Patra

Dunson

D. B.

2018. “Constrained Bayesian Inference Through Posterior Projections.” arXiv:1812.05741v3. DOI: https://doi.org/10.48550/arXiv.1812.05741.

47.

Pfeffermann

Tiller

2006. “Small-Area Estimation with State–Space Models Subject to Benchmark Constraints.” Journal of the American Statistical Association 101 (476): 1387–97. DOI: https://doi.org/10.1198/016214506000000591.

48.

Rao

J. N.

Molina

2015. Small Area Estimation. Hoboken, NJ: John Wiley & Sons. DOI: https://doi.org/10.1002/9781118735855.

49.

Riebler

Sørbye

S. H.

Simpson

Rue

2016. “An Intuitive Bayesian Spatial Model for Disease Mapping That Accounts for Scaling.” Statistical Methods in Medical Research 25 (4): 1145–65. DOI: https://doi.org/10.1177/0962280216660421.

50.

Rue

Martino

Chopin

2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92. DOI: https://doi.org/10.1111/j.1467-9868.2008.00700.x.

51.

Särndal

C.-E.

Swensson

Wretman

2003. Model Assisted Survey Sampling. New York, NY: Springer Science & Business Media. DOI: https://doi.org/10.1007/978-1-4612-4378-6.

52.

Zhou

2020. “Bayes-Raking: Bayesian Finite Population Inference with Known Margins.” Journal of Survey Statistics and Methodology 9 (4): 833–55. DOI: https://doi.org/10.1093/jssam/smaa008.

53.

Simpson

Rue

Riebler

Martins

T. G.

Sørbye

S. H.

, 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28. DOI: https://doi.org/10.1214/16-STS576.

54.

Smith

A. F.

Gelfand

A. E.

1992. “Bayesian Statistics Without Tears: A Sampling–Resampling Perspective.” The American Statistician 46 (2): 84–8. DOI: https://doi.org/10.2307/2684170.

55.

Stefan

Hidiroglou

M. A.

2021. “Small Area Benchmarked Estimation Under the Basic Unit Level Model When the Sampling Rates Are Non-Negligible.” Survey Methodology 47 (1): 123–50. https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00007-eng.htm (accessed March 2022).

56.

Steorts

R. C.

Schmid

Tzavidis

2020. “Smoothing and Benchmarking for Small Area Estimation.” International Statistical Review 88: 580–98. DOI: https://doi.org/10.1111/insr.12373.

57.

Stover

Glaubius

Mofenson

Dugdale

C. M.

Davies

M.-A.

Patten

Yiannoutsos

2019. “Updates to the Spectrum/AIM Model for Estimating Key HIV Indicators at National and Subnational Levels.” AIDS (London, England) 33: S227. DOI: https://doi.org/10.1097/QAD.0000000000002357.

58.

Tierney

1994. “Markov Chains for Exploring Posterior Distributions.” The Annals of Statistics 22: 1701–28. DOI: https://doi.org/10.1214/aos/1176325750.

59.

Trabelsi

Hillmer

S. C.

1990. “Bench-Marking Time Series with Reliable Bench-Marks.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 39 (3): 367–79. DOI: https://doi.org/10.2307/2347386.

60.

Vehtari

Gelman

Simpson

Carpenter

Bürkner

P.-C.

2021. “Rank-Normalization, Folding, and Localization: An Improved R for Assessing Convergence of MCMC (with Discussion).” Bayesian Analysis 16 (2): 667–718. DOI: https://doi.org/10.1214/20-BA1221.

61.

Wakefield

2013. Bayesian and Frequentist Regression Methods. Vol. 23. New York, NY: Springer. DOI: https://doi.org/10.1007/978-1-4419-0925-1.

62.

Wakefield

Fuglstad

G.-A.

Riebler

Godwin

Wilson

Clark

S. J.

2019. “Estimating Under-Five Mortality in Space and Time in a Developing World Context.” Statistical Methods in Medical Research 28 (9): 2614–34. DOI: https://doi.org/10.1177/0962280218767988.

63.

Wakefield

Okonek

Pedersen

2020. “Small Area Estimation for Disease Prevalence Mapping.” International Statistical Review 88 (2): 398–418. DOI: https://doi.org/10.1111/insr.12400.

64.

Wang

Fuller

W. A.

2008. “Small Area Estimation Under a Restriction.” Survey Methodology 34 (1): 29. https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2008001/article/10619-eng.pdf?st=TuZsIBto (accessed January 2023).

65.

Williams

Berg

2013. “Incorporating User Input into Optimal Constraining Procedures for Survey Estimates.” Journal of Official Statistics 29 (3): 375. DOI: https://doi.org/10.2478/jos-2013-0032.

66.

Z. R.

Mayala

B. K.

Wang

Gao

Paige

Fuglstad

G.-A.

, et al. 2021. “Spatial Modeling for Subnational Administrative Level 2 Small-Area Estimation.” DHS Spatial Analysis Reports 21. https://dhsprogram.com/publications/publication-SAR21-Spatial-Analysis-Reports.cfm (accessed March 2022).

67.

You

Rao

2002. “A Pseudo-Empirical Best Linear Unbiased Prediction Approach to Small Area Estimation Using Survey Weights.” Canadian Journal of Statistics 30 (3): 431–9. DOI: https://doi.org/10.2307/3316146.

68.

You

Rao

Dick

2002. “Benchmarking Hierarchical Bayes Small Area Estimators with Application in Census Undercoverage Estimation.” In Proceedings of the Survey Methods Section, 86–90. https://ssc.ca/sites/default/files/survey/documents/SSC2002_Y_You.pdf (accessed January 2023).

69.

Zhang

J. L.

Bryant

2020. “Fully Bayesian Benchmarking of Small Area Estimation Models.” Journal of Official Statistics 36 (1): 197–223. DOI: https://doi.org/10.2478/jos-2020-0010.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

32.80 MB

A Computationally Efficient Approach to Fully Bayesian Benchmarking

Abstract

Keywords

1. Introduction

2. Small Area Models in Low- and Middle-Income Countries

2.1. Small Area Models

2.1.1. Area-Level Model: Spatial Fay-Herriot

2.1.2. Unit-Level Model: Binomial

3. Methods

3.1. Benchmarked Bayes Estimate Approach

3.2. Raking Approach

3.3. Fully Bayesian Benchmarking Approach

3.4. Proposed Approach

3.4.1. Rejection Sampler

3.4.2. Metropolis-Hastings

3.4.3. Convergence Diagnostics

3.4.4. Limitations

4. Simulation

5. Application

5.1. Data

5.2. Benchmarked Models

5.2.1. Approaches

5.2.1.1. Raking

5.2.1.2. Fully Bayesian

5.2.1.3. Fully Bayesian: Rejection sampler

5.2.1.4. Fully Bayesian: Metropolis-Hastings algorithm

5.3. Model Validation

5.4. Computation

5.5. Results

5.6. U5MR

6. Discussion

Footnotes

Funding

Supplemental Material

References

Supplementary Material