Sage Journals: Discover world-class research

Abstract

When analyzing spatially referenced event data, the criteria for declaring rates as “reliable” is still a matter of dispute. What these varying criteria have in common, however, is that they are rarely satisfied for crude estimates in small area analysis settings, prompting the use of spatial models to improve reliability. While reasonable, recent work has quantified the extent to which popular models from the spatial statistics literature can overwhelm the information contained in the data, leading to oversmoothing. Here, we begin by providing a definition for a “reliable” estimate for event rates that can be used for crude and model-based estimates and allows for discrete and continuous statements of reliability. We then construct a spatial Bayesian framework that allows users to infuse prior information into their models to improve reliability while also guarding against oversmoothing. We apply our approach to county-level birth data from Pennsylvania, highlighting the effect of oversmoothing in spatial models and how our approach can allow users to better focus their attention to areas where sufficient data exists to drive inferential decisions. We then conclude with a brief discussion of how this definition of reliability can be used in the design of small area studies.

Keywords

Bayesian inference informative priors preterm birth small area analyses spatial statistics

1. Introduction

When producing estimates of the incidence of adverse health events (e.g., heart disease related deaths) or the prevalence of a risk factor (e.g., obesity) in small areas, there is a lack of consensus on what constitutes a “reliable” estimate. For instance, the United States Cancer Statistics (USCS) Working Group (2022) recommends declaring an estimate reliable if it is based on sixteen or more cases, while groups such as the New York Department of Health (1999) require twenty or more cases. As an illustration of the complexity of this issue, the Rhode Island Department of Health (2016) and the Utah Department of Health (2009) each provide detailed flowcharts describing when to report an estimate, when to report with a warning about reliability, and when to suppress estimates or aggregate data in their published reports. More recently, the National Center for Health Statistics (NCHS; Parker et al. 2017) produced a report detailing their new standards for reporting proportions, which consist of a mix of guidance based on the sample size (greater than thirty) and the width of the estimate’s confidence interval. Many of these approaches are based on the coefficient of variation (CV; also referred to as the relative standard error)—that is, the standard error of the rate estimate divided by the estimate itself—though the rationale underlying these rules is still quite vague. An exception to this is the criteria of the USCS, which requires that the ratio of an estimate to the width of its 95% confidence interval (assuming normality) be greater than 1 or, equivalently, that $CV < 1 / 4$ . Regardless of the criteria used, however, it is often the case that data from small areas will fail to satisfy the necessary criteria, especially when the data are stratified by demographic factors such as race/ethnicity, sex, and age.

One approach to relieve concerns about reporting estimates with high uncertainty is to aggregate the data across neighboring spatial regions and/or adjacent time periods. For instance, the County Health Rankings and Roadmaps program (CHR&R; Remington et al. 2015) aggregates data over periods of up to seven years in order to provide reliable estimates for as many counties as possible when constructing the measures used to determine their rankings. While aggregation should boost case counts—and thus lead to more reliable estimates—it may also preclude inference on fine-level geographic disparities and/or temporal changes. Perhaps worse, spatially aggregating neighboring regions with vastly different underlying rates can produce estimates that misrepresent each of the individual regions (e.g., Bradley et al. 2017; Spielman and Folch 2015). These concerns serve as a motivation for this work.

When analyzing spatially referenced data, an attractive alternative to data aggregation is to report model-based estimates generated in a Bayesian framework. For instance, methods such as the conditional autoregressive (CAR) model of Besag et al. (1991) and its multivariate extensions (e.g., Gelfand and Vounatsou 2003; Quick et al. 2018) have been used to produce over forty years of county-level estimates of heart disease mortality by age, race, and sex (Vaughan et al. 2019), county-level estimates of suicide rates from 2005 to 2015 (Khana et al. 2018), and census tract-level estimates of obesity rates (Quick et al. 2020). Here, the benefit of using Bayesian methods is two-fold: not only do we obtain more precise (and thus more reliable) estimates via the infusion of prior information, but we may also obtain more accurate estimates by virtue of Bayesian shrinkage (Stein 1956). The key drawback of this, however, is that tasks such as quantifying and specifying the amount of information contributed by our models (relative to the data) is not always straightforward (e.g., Morita et al. 2008). For instance, recent work by Quick et al. (2021) and Song et al. (2024) illustrated that the informativeness of CAR model framework of Besag et al. (1991)—when left to its own devices—could far exceed the contribution of the data in most regions, leading to oversmoothing and potentially untrustworthy inference. While the authors provided guidance with regard to how to restrict the model’s informativeness, here we will provide a rationale for determining an appropriate cap on how much information the model should be allowed to contribute.

With this background in mind, the objectives of this paper are two-fold. First and foremost, we provide a definition for a “reliable” estimate of an incidence or prevalence rate that is based on statistical point and interval estimates. Not only does this definition accommodate crude and model-based estimates, but it also allows for discrete and continuous statements of reliability—for example, distinctions of reliable versus unreliable and the ability to indicate when one estimate is more reliable than another. Secondly, by anchoring this definition in a Bayesian framework, we allow users to infuse prior information (e.g., spatial structure) into their models to improve the reliability of their estimates. In doing so, however, we must also be cognizant of the influence these models may have on our estimates. As such, the foundation for the proposed work is the need to “embrace uncertainty”—that is, rather than designing models to ensure that all estimates are reliable, we will provide guidance for restricting the informativeness of the Besag et al. (1991) CAR model with an eye toward requiring a sufficient number of cases be observed for estimates to be deemed “reliable.”

2. Methods

2.1. Definition of Reliability

We begin by assuming $y_{i} \sim Bin (n_{i}, π_{i})$ , where $y_{i}$ denotes the number of cases in region $i$ out of a population (or in the case of survey data, a sample) of size $n_{i}$ and where pi denotes the underlying event rate, for $i = 1, \dots, I$ . For the purpose of establishing our definition for a reliable event rate, we then assume $π_{i} \sim Beta (a_{i}, b_{i})$ , which yields a posterior of the form

π_{i} | y_{i} \sim Beta (y_{i} + a_{i}, n_{i} - y_{i} + b_{i}),

(1)

and yields interpretations of $a_{i}$ and $b_{i}$ as the prior number of cases and the prior number of non-cases, respectively, out of a total prior population of size $a_{i} + b_{i}$ . In this framework, the USCS’s criteria for a reliable rate would require

CV (π_{i} | y_{i}) = \frac{\sqrt{V [π_{i} | y_{i}]}}{E [π_{i} | y_{i}]} = \sqrt{\frac{n_{i} - y_{i} + b_{i}}{(y_{i} + a_{i}) (n_{i} + a_{i} + b_{i} + 1)}} < 1 / 4 .

(2)

To identify conditions for $y_{i}$ in which the posterior in (1) would yield a reliable estimate, we first note that the requirement in equation (2) can be reexpressed as a requirement on the posterior number of cases: if $E [π_{i} | y_{i}] \leq 0.5$ ,

(y_{i} + a_{i}) > 16 (1 - E [π_{i} | y_{i}]),

(3)

when $n_{i} + a_{i} + b_{i} \geq 16$ . A convenient feature of the requirement in (2) is that if we were to approximate the 95% credible interval (CI) for pi as $E [π_{i} | y_{i}] \pm 2 \sqrt{V [π_{i} | y_{i}]}$ , then satisfying (3) would also result in $E [π_{i} | y_{i}]$ being greater than the width of its 95% CI. Finally, while the relationships in equations (1)–(3) are based on binomially distributed data, similar results hold when the data are modeled as being Poisson distributed, as is customary when analyzing birth and death rates (Brillinger 1986) and other rare event data. Analogous derivations under the Poisson model specification are provided in Supplemental Appendix A, the requirements under which are consistent with those for binomial data when p is small.

We thus generalize the USCS reliability criteria as a function of quantiles of the posterior distribution as follows:

Definition 1. Estimates of the rate parameter pi obtained from the posterior distribution $p (π | y)$ are reliable at the $1 - α$ level if the posterior medians of p and its opposite, $1 - π$ , are each larger than the width of their respective $(1 - α) \times 100 %$ equal-tailed credible intervals.

We also define the relative precision of an estimate at the $1 - α$ level to be the ratio of its posterior median over the width of its $(1 - α) \times 100 %$ credible interval, where a relative precision greater than 1 corresponds to an estimate that is reliable at the $1 - α$ level. Finally, note that Definition 1 is designed to ensure that p and its opposite, $1 - π$ , receive the same reliability distinction—for example, the estimate for the prevalence of people who do smoke has the same level of reliability as the estimate for the prevalence of people who do not smoke.

2.2. Impact of Informative Priors

Using this definition of reliability, the requirement from (3), and a relatively noninformative prior for pi —that is, $a_{i} = 1 / 2$ and $b_{i} = a_{i} (1 - π_{i 0}) / π_{i 0}$ such that $π_{i 0} = a_{i} / (a_{i} + b_{i})$ —we would need to observe sixteen cases to obtain a reliable estimate at the .95 level when pi = 0.01, twelve cases when pi = 0.20, and nine cases when pi = 0.40, as shown in Figure 1a. Similarly, if we were to relax the desired level of reliability and assume pi = 0.01, approximately sixteen events would be required to deem a rate reliable at the 0.95 level, eleven events at the 0.90 level, and seven events at the 0.80 level, as displayed in Figure 1b. From this point forward, references to reliability and the relative precision are at the 0.95 level unless otherwise stated.

Figure 1.

Comparison of the relative precision as a function of the number of events. Panel (a) displays the relative precision at the 0.95 level for various underlying event rates, and Panel (b) displays the relative precision for various levels of reliability for an event rate of pi = 0.01. (a)Relative Precision × Event Rates and (b) Relative Precision × Levels of Reliability.

While the results above are convenient, the true benefit of this definition of reliability is revealed when we consider informative prior specifications for pi. For instance, suppose we are analyzing a dataset comprised of the number of infants born preterm (i.e., before thirty-seven weeks of pregnancy) out of the total number of births in one or more small areas and that prior information indicates that 10% of infants are born preterm. Based on equation (3), our posterior number of preterm births, $y_{i} + a_{i}$ , must exceed $16 \times (1 - 0.10) = 14.4$ for the estimates from these areas to be deemed “reliable.” Thus, rather than explicitly requiring $y_{i}$ to exceed a given threshold for an estimate to be deemed reliable, we can instead view such requirements in terms of a maximum on the amount of information that can be contributed by the prior. For instance, if $a_{i} < 16 \times (1 - 0.10) / 2 = 7.2$ , then the data must contribute more information to our estimate than the prior—that is, $y_{i} > a_{i}$ —in order for an estimate to be deemed reliable.

That said, explicitly incorporating prior information as described above is not a common practice. Furthermore, in the context of disease mapping, it is common to consider a model specification in which

logit (π_{i}) | β, z, σ^{2} \sim Norm (x_{i}^{T} β + z_{i}, σ^{2}),

(4)

where $x_{i}$ denotes a $p$ -vector of region-specific covariates with a corresponding vector of regression coefficients, β, and $z = {(z_{1}, \dots, z_{I})}^{T}$ is a vector of spatial random effects based on the CAR model framework of Besag et al. (1991), denoted by $z \sim CAR (τ^{2})$ . While measuring the informativeness of the model specification in equation (4) is nontrivial, recent work by Song et al. (2024) established a relationship between the posterior in equation (1) and the posterior resulting from equation (4). This was achieved by first equating the mean and the variance of a beta distribution and a logitnormal using the delta method, and then extending that relationship to the conditional distribution of pi given the remaining pj, j ∙ i, under (4) after integrating the spatial random effects, $z$ , out of the model, which yielded an approximation of the form:

{\hat{a}}_{i} = \frac{1 + \exp (x_{i}^{T} β)}{σ^{2} + (σ^{2} + τ^{2}) / m_{i}} - \frac{\exp (x_{i}^{T} β)}{1 + \exp (x_{i}^{T} β)},

(5)

where $m_{i}$ denotes the number of regions that neighbor region $i$ . Because each region can have its own covariate vector, $x_{i}$ , and its own number of neighbors, $m_{i}$ , Song et al. (2024) recommended defining the model’s baseline level of informativeness as

{\hat{a}}_{0} = \frac{1 + \exp (x_{0}^{T} β)}{σ^{2} + (σ^{2} + τ^{2}) / m_{0}} - \frac{\exp (x_{0}^{T} β)}{1 + \exp (x_{0}^{T} β)},

(6)

wherein $x_{i}$ and $m_{i}$ from equation (5) are replaced with $x_{0}$ and $m_{0}$ , respectively, where $x_{0} = {(x_{01}, \dots, x_{0 p})}^{T}$ represents the vector of the average covariate values—that is, $x_{0 j} = Σ_{i} x_{i j} / I$ for $j = 1, \dots, p$ —and $m_{0}$ represents a baseline number of neighbors with $m_{0} = 3$ suggested as a rule-of-thumb to allow for comparisons between datasets. Following a similar process, Quick et al. (2021) estimated the informativeness of the analogous Poisson model specification as

{\hat{a}}_{0} = \frac{1}{\exp [σ^{2} + (σ^{2} + τ^{2}) / m_{0}] - 1} .

(7)

Finally, it should be emphasized that in the beta-binomial framework in equation (1), it was implicitly assumed that the hyperparameters $a_{i}$ and $b_{i}$ were fixed and known for all $i = 1, \dots, I$ . In contrast, the various hyperparameters encountered in the CAR model framework shown in equation (4)—namely, β, $σ^{2}$ , and $τ^{2}$ —are all commonly assigned non- or weakly informative hyperpriors. Moreover, while the variance parameters $σ^{2}$ and $τ^{2}$ are largely responsible for controlling the estimated model informativeness parameters presented in equations (5)–(7), β is primarily responsible for controlling the overall average rate of the pi parameters. As such, it is important to keep in mind that if there is not enough information in the data to precisely estimate β (e.g., if $Σ_{i} y_{i}$ is small), the rate estimates will be less precise—and thus less reliable—than ${\hat{a}}_{0}$ would suggest. Similarly, because ${\hat{a}}_{0}$ is calculated from the hyperparameters, using a point-estimate such as the posterior mean or median for ${\hat{a}}_{0}$ will provide guidance into the model’s informativeness, but it may not perfectly align with the relative precision and the level of reliability of the rate estimates. While the use of informative priors—particularly on $σ^{2}$ and $τ^{2}$ —could provide some stability and help avoid oversmoothing, specifying said priors is often not intuitive, thus we recommend relying on standard non- or weakly informative priors (e.g., those considered by Bernardinelli et al. (1995)) and controlling the informativeness of the model under the restriction that ${\hat{a}}_{0} < A$ for some $A > 0$ . A brief description of how the model in equation (4) can be fit under such a restriction and sample MultiBUGS (Goudie et al. 2020) code are provided in Supplemental Appendix B.

3. Case Study: Rates of Preterm Birth in PA Counties

To illustrate how this approach can be used in practice, we conduct a case study pertaining to preterm birth in Pennsylvania counties from 2010 to 2019. The data are stratified by both year and race/ethnicity of the mother (white, black, Hispanic, and Asian) and were obtained from the Pennsylvania Department of Health’s (2020) web-based Enterprise Data Dissemination Informatics Exchange (EDDIE) system. In the EDDIE system, race/ethnicity is coded as white, black, Hispanic, or Asian with the caveat that these groups may not be mutually exclusive (e.g., mothers coded as “white” may be Hispanic or non-Hispanic white). For the purposes of our analysis, however, we will analyze each of these combinations of racial/ethnic groups and time-periods as separate, distinct datasets; doing so will allow us to make comparisons between datasets that are both similar (i.e., comparisons across time within each racial/ethnic group) and quite different (i.e., comparisons between racial/ethnic groups with different geographic distributions and/or underlying event rates) to assess the degree to which our results are anomalous or in fact phenomena that can commonly occur in analyses of real data.

A summary of the data is provided in Table 1. Here, we see evidence of several important features. First and foremost, we see evidence of Pennsylvania’s racial demographics, where not only were approximately 70% of the state’s births to white mothers, but also that racial/ethnic minority mothers—particularly black and Asian mothers—are more geographically concentrated in the state’s urban centers, Philadelphia and Pittsburgh. As a result, data from racial/ethnic minority mothers are sparse, with more than 30% of Pennsylvania’s counties experiencing zero preterm births from black, Asian, and Hispanic mothers and two-thirds of counties experiencing fewer than ten. That being said, Table 1 also displays evidence of high racial/ethnic disparities in the incidence of preterm birth, with black mothers in Pennsylvania experiencing rates nearly 50% higher than their non-black counterparts. Thus, demographic challenges notwithstanding, exploring county-level trends in preterm birth by race is of epidemiologic interest, thus motivating the use spatial models to produce more stable estimates than the data alone can provide.

Table 1.

Summary of the Pennsylvania Preterm Birth Data, Presented as Averages Over the Ten-Year Period, 2010 to 2019.

	White	Black	Asian	Hispanic
Average births per year	97,934	19,848	5,970	14,809
Average preterm births per year	8,550	2,618	470	1,474
Average preterm birth rate (%)	8.7	13.2	7.9	10.0
Percent of counties with <10 preterm births (%)	4.5	67.2	83.7	70.4
Percent of counties with zero preterm births (%)	0.3	38.4	50.3	31.3
Ratio of urban vs. rural birth totals	0.44	3.27	1.95	0.64

Ratio of urban vs. rural births calculated by declaring counties with population densities greater than 1,000 people per square mile as “urban”, which corresponds to the metropolitan areas of Philadelphia (Philadelphia County and the neighboring Bucks, Delaware, and Montgomery Counties) and Pittsburgh (Allegheny County).

With this background in mind, we let $y_{i r t}$ and $n_{i r t}$ denote the number of preterm births and total number of births, respectively, to mothers of race $r$ in county $i$ in year $t$ . We then assume $y_{i r t} | π_{i r t} \sim Bin (n_{i r t}, π_{i r t})$ and implement two model specifications analogous to (4) to model pirt, stratified by each combination of race and year. First, we will let

\begin{matrix} p (π_{\cdot r t}, β_{0; r t}, z_{\cdot r t}, σ_{r t}^{2}, τ_{r t}^{2} | y_{\cdot r t}) \propto \prod_{i = 1}^{I} [\begin{array}{l} Bin (y_{i r t} | n_{i r t}, π_{i r t}) \\ \times Norm (logit π_{i r t} | β_{0; r t} + z_{i r t}, σ_{r t}^{2}) \end{array}] \times \\ \times CAR (z_{\cdot r t} | τ_{r t}^{2}) \times IG (σ_{r t}^{2} | 1, 1 / 100) \times IG (τ_{r t}^{2} | 1, 1 / 7), \end{matrix}

(8)

denote a standard CAR modeling approach, where the priors above for $σ_{r t}^{2}$ and $τ_{r t}^{2}$ are consistent with previous studies (e.g., Bernardinelli et al. 1995; Waller et al. 1997) and the prior for $β_{0; r t}$ (i.e., a flat, improper prior) can be expressed as $p (β_{0; r t}) \propto 1$ . As the approach in (8) may have a tendency to produce overly informative models, we also consider a model specification in which we restrict ${\hat{a}}_{0; r s} < 5$ ; as discussed in Section 2.2, this restriction should be expected to result in a requirement that $y_{i r t} \geq 10$ for an estimate of pirt to be deemed reliable at the 0.95 level. To achieve this, we follow the approach of Song et al. (2024) and let

\begin{matrix} p (π_{\cdot r t}, β_{0; r t}, z_{\cdot r t}, σ_{r t}^{2}, τ_{r t}^{2} | y_{\cdot r t}) \propto \prod_{i = 1}^{I} [\begin{array}{l} Bin (y_{i r t} | n_{i r t}, π_{i r t}) \\ \times Norm (logit π_{i r t} | β_{0; r t} + z_{i r t}, σ_{r t}^{2}) \end{array}] \times \\ \times CAR (z_{\cdot r t} | τ_{r t}^{2}) \times IG (σ_{r t}^{2} | 1, 1 / 100) \times IG (τ_{r t}^{2} | 1, 1 / 7), I {{\hat{a}}_{0; r t} < 5}, \end{matrix}

(9)

denote our restricted CAR modeling approach. Both the standard and restricted models were fit in R (R Core Team 2020) using Markov chain Monte Carlo (MCMC) algorithms run for a total of one hundred thousand iterations, with separate runs for each combination of race/ethnicity and year. For the sake of uniformity and to ease the computational burden of the post hoc analyses, the first fifty thousand iterations of each chain were discarded as burn-in and the last fifty thousand iterations were thinned by a factor of 10, resulting in five thousand iterations’ worth of samples for each combination of race/ethnicity and year for both the standard and restricted models.

Before we discuss the rate estimates themselves (and their degree of reliability), we consider the estimates for the informativeness of the standard CAR models (based on ${\hat{a}}_{0; r t}$ ) shown in Figure 2. Here, we see that the standard CAR model framework of Besag et al. (1991) is contributing the equivalent of thirteen to forty-three preterm births per county per year for each of the racial minority groups, values that are far greater than the observed number of preterm births in most of those counties (as previously noted in Table 1). While this is not quite the case for white mothers—where the model appears to be contributing fewer preterm births per county than the data—the model is still contributing in excess of forty preterm births per county, thus estimates for all groups considered may be susceptible to oversmoothing.

Figure 2.

Comparison of the estimated model informativeness parameters, ${\hat{a}}_{0; r t}$ , from the Pennsylvania preterm birth analysis; median county-level counts provided for reference.

To illustrate the potential impacts of oversmoothing, we consider the estimates for white mothers in McKean County in 2010 and the estimates for black mothers in Adams County in 2019 shown in Figure 3. In Figure 3a, we see that while the restricted CAR model yields an estimate for white mothers in McKean County that is fairly consistent with the observed data (17.2%)—reflecting the relatively large number of preterm births observed in the data (74)—the heightened informativeness of the standard CAR model has pulled the estimate away from its observed rate toward the rate in the neighboring counties (11.4%). As we illustrate at greater length in Supplemental Appendix C (and specifically in Figure C.1), such extreme oversmoothing could inhibit our ability to detect outlying regions and thus could stymie state and local health departments’ intervention efforts. Meanwhile, Figure 3b illustrates a second drawback of the standard CAR model’s tendency to oversmooth estimates: an (unwarranted) increase in precision. Specifically, Adams County is a predominantly white, rural county in which only two preterm births were observed for black mothers (out of nine total births) in 2019. While the observed rate for this county is consistent with both its neighbors and the overall state-level average for black mothers, the standard CAR model produces an estimate whose relative precision is twice that of the estimate produced by the restricted CAR model and consistent with that of a county with over thirty preterm births.

Figure 3.

Comparison of the posterior distributions of rate parameters, pirt, under the standard and restricted CAR models for white and black mothers in selected counties: (a) White Mothers; McKean County and (b) Black Mothers; Adams County.

Having discussed the effect oversmoothing can have on an individual county’s estimates, we now shift our attention to the primary focuses of this paper: maps of estimates and reliability. To do so, we consider the maps shown in Figure 4 and the relative precision plots shown in Figure 5. We begin by comparing the estimates for white mothers in 2019 based on the standard CAR model in Figure 4a to those based on the restricted CAR model in Figure 4b. While the overall geographic patterns are similar, the estimates from the restricted model have more extreme values, a manifestation of the phenomenon observed in Figure 3a. In addition, estimates for several counties are deemed unreliable in the restricted model; in contrast, all of the estimates produced by the standard CAR model are deemed reliable. As shown in the relative precision plot in Figure 5a, the unreliable estimates under the restricted model (i.e., those with a relative precision less than 1) all correspond to instances where $y_{i r t} < 10$ , demonstrating that our ${\hat{a}}_{0; r t} < 5$ restriction had the desired effect.

Figure 4.

Comparison of the preterm birth rates for white and Asian mothers in 2019 from the standard and restricted CAR models: (a) White Mothers: 2019 (Standard), (b) White Mothers: 2019 (Restricted), (c) Asian Mothers: 2019 (Standard), and (d) Asian Mothers: 2019 (Restricted).

Figure 5.

Comparison of the relative precision of the race/ethnicity-specific estimates from 2019 under the standard and restricted CAR models: (a) White Mothers: 2019, (b) Black Mothers: 2019, (c) Asian Mothers: 2019, and (d) Hispanic Mothers: 2019.

While the results for white mothers in Figures 4a and 4b provided a subtle illustration of the difference between the standard and restricted CAR models, the results for Asian mothers in Figures 4c and 4d illustrate a much more stark difference. Specifically, unlike Pennsylvania’s white population, racial minorities in Pennsylvania are much more geographically concentrated in the state’s major metropolitan areas. As such, the heightened informativeness of the standard CAR model not only yields reliable estimates in all but two counties, but it also produces a very spatially smooth map as very few counties experienced enough preterm births to overrule the model’s spatial structure. In contrast, the restricted CAR model again results in a de facto requirement that $y_{i r t} \geq 10$ to obtain a reliable estimate (as shown in Figure 5c), producing unreliable estimates in fifty-six of the state’s sixty-seven counties. Similar results are observed for black and Hispanic mothers and for the entire ten-year study period, as shown in Figures C.1 and C.2 of the Supplemental Appendix.

Finally, the discussion up until this point has treated “reliability” as a binary “reliable versus unreliable” property of an estimate, but a key feature of our definition of a reliable estimate is that we can determine an estimate’s level of reliability by identifying the value of $α \in (0, 1)$ such that its posterior median will be greater than the width of its $(1 - α) \times 100 %$ equal-tailed credible interval. For instance, Figure 6 displays maps of the level of reliability of the estimates for Asian mothers in 2019 from the standard and restricted CAR models. As first observed in Figure 4c, Figure 6a shows that all but two counties achieved a level of reliability greater than the 0.95 level under the standard CAR model. In contrast, Figure 6b highlights the degree of geographic clustering of Pennsylvania’s Asian population by virtue of the disparate levels of reliability in the estimates. Reliability maps for other racial/ethnic groups and other years are shown in Figure C.3 of the Supplemental Appendix. There, not only do we observe the stark differences in the level of reliability produced by the standard and restricted CAR models, but we also observe differences in the year-to-year variability. In particular, while the restricted model produces reliability levels that are consistent year-to-year for each racial/ethnic group—reflecting a similar degree of consistency in the temporal behavior of the underlying data—the level of reliability of the estimates produced by the standard CAR model exhibits irregular behavior. This manifests in requirements of the data to achieve reliability that differ year-to-year, as can be observed in Figure C.1 of the Supplemental Appendix.

Figure 6.

Comparison of the level of reliability of the estimates of the preterm birth rate for Asian mothers in 2019 under the standard and restricted CAR models: (a) Asian Mothers: 2019 (Standard) and (b) Asian Mothers: 2019 (Restricted).

4. Discussion

This paper was motivated by the lack of consensus in the statistical and epidemiologic literature regarding the requirements for an estimate of an event rate to be deemed “reliable.” The proposed definition in Section 2 accommodates both crude and model-based estimates, as well as discrete (reliable vs. unreliable, based on some predetermined level) and continuous statements of reliability. Equally important, our definition of reliability can be directly related to the posterior number of events—that is, the sum of the observed number of events and the prior number of events—thereby allowing users to restrict the informativeness of their model specification such that a minimum number of events must be observed in order to expect to obtain a reliable estimate. Moreover, while properties of our definition of reliability are most clearly conveyed via the conjugate beta-binomial and Poisson-gamma modeling frameworks, the approximations proposed in Quick et al. (2021) and Song et al. (2024) allow us to extend these properties to the Besag et al. (1991)-framework commonly used in the disease mapping literature where the question of reliability often arises. Finally, while the notion of restricting the informativeness of the Besag et al. (1991)-framework was first proposed by Quick et al. (2021) and Song et al. (2024), the reliability criteria used here provides a rationale for imposing restrictions on the model that was lacking in their work. Future work aims to extend these restrictions to other, more recently developed approaches for disease mapping (e.g., Datta et al. 2019; Leroux et al. 2000), methods for multivariate spatial and spatiotemporal settings (e.g., Gelfand and Vounatsou 2003; Quick et al. 2017)—particularly for the purpose of estimating age-standardized rates from age-stratified event data—and other challenges encountered in disease mapping that may complicate determinations of reliable vs. unreliable estimates (e.g., zero-inflated data (Agarwal et al. 2002), left-censored data (Quick 2019), and undercounting (Schmertmann and Gonzaga 2018)). In the interim, however, we recommend researchers assess the relative precision of their estimates to gain insight into the model’s informativeness prior to declaring estimates as reliable or unreliable.

A principal underpinning of this work is that declaring an estimate as “reliable” conveys an element of trustworthiness, not simply that an estimate is precise. As such, we believe the motivation of using model-based estimates should not be to produce “reliable” estimates but rather to improve the estimates’ precision in a deliberate and measured manner. Thus, the objective of this paper is to describe how researchers can exercise restraint and use criteria for reliability to inform the design of their statistical models to obtain estimates that have a desired level of precision. In addition, a key feature of the reliability definition provided here is that the continuous quantification of reliability provides researchers some flexibility with regard to presenting their results. For instance, Figure C.4 of the Supplemental Appendix illustrates how dynamic and/or interactive data visualization tools can be used to display estimates with varying levels of reliability as an alternative to producing maps with few “reliable” estimates or to artificially inflating the precision of estimates (e.g., by using an overly informative model) for the purpose of having more estimates being deemed “reliable.”

Finally, we recognize that there are situations where only the most highly populated regions will experience enough events in a standard amount of time (e.g., one year) to be deemed reliable, with or without the contribution of prior information, particularly when producing estimates for and making inference on rare outcomes. In these situations, rather than increasing the model’s informativeness, the requirements for reliability can instead be used as a basis for combining data across multiple time periods. For instance, based on the expected preterm birth rates for each of the race/ethnicities considered in Section 3, combining four years of data with a model specification that contributed ${\hat{a}}_{0} ~ ~ 5$ prior events would be sufficient to produce reliable estimates for one-third of Pennsylvania’s counties for Asian mothers and for half of Pennsylvania’s counties for both black and Hispanic mothers, improvements of more than ten counties for each group compared to using a single year of data. Alternatively, agencies may consider other predetermined aggregations (e.g., aggregating from counties to congressional districts in the United States) in an effort to obtain larger event counts. As with the notion of using a relaxed level of reliability, however, agencies would need to weigh the benefits of improving reliability via aggregation versus producing estimates that may have reduced utility (e.g., the effect of aggregation on policymaking).

Supplemental Material

sj-zip-1-jof-10.1177_0282423X241244917 – Supplemental material for Reliable Event Rates for Disease Mapping

Supplemental material, sj-zip-1-jof-10.1177_0282423X241244917 for Reliable Event Rates for Disease Mapping by Harrison Quick and Guangzi Song in Journal of Official Statistics

Footnotes

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Support for this work came from the County Health Rankings & Roadmaps program and the National Heart, Lung, And Blood Institute of the National Institutes of Health, United States under Award Number R01HL158802. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Supplemental Material

Supplemental material for this article is available online.

Received: February 2023

Accepted: December 2023

References

Agarwal

D. K.

Gelfand

A. E.

Citron-Pousty

2002. “Zero-Inflated Models with Application to Spatial Count Data.” Environmental and Ecological Statistics 9: 341–55. DOI: https://doi.org/10.1023/A:1020910605990.

Bernardinelli

Clayton

Montomoli

1995. “Bayesian Estimates of Disease Maps: How Important Are Priors?” Statistics in Medicine 14: 2411–31. DOI: https://doi.org/10.1002/sim.4780142111.

Besag

York

Mollié

1991. “Bayesian Image Restoration, with Two Applications in Spatial Statistics.” Annals of the Institute of Statistical Mathematics 43: 1–59. DOI: https://doi.org/10.1007/BF00116466.

Bradley

J. R.

Wikle

C. K.

Holan

S. H.

2017. “Regionalization of Multiscale Spatial Processes Using a Criterion for Spatial Aggregation Error.” Journal of the Royal Statistical Society Series B: Statistical Methodology 79: 815–32. DOI: https://doi.org/10.1111/RSSB.12179.

Brillinger

D. R.

1986. “The Natural Variability of Vital Rates and Associated Statistics.” Biometrics 42: 693–734. DOI: https://doi.org/10.2307/2530689.

Datta

Banerjee

Hodges

J. S.

Gao

2019. “Spatial Disease Mapping Using Directed Acyclic Graph Auto-Regressive (DAGAR) Models.” Bayesian Analysis 14: 1221–44. DOI: https://doi.org/10.1214/19-BA1177.

Gelfand

A. E.

Vounatsou

2003. “Proper Multivariate Conditional Autoregressive Models for Spatial Data Analysis.” Biostatistics 4: 11–25. DOI: https://doi.org/10.1093/BIOSTATISTICS/4.1.11.

Goudie

R. J. B.

Turner

R. M.

De Angelis

Thomas

2020. “MultiBUGS: A Parallel Implementation of the BUGS Modeling Framework for Faster Bayesian Inference.” Journal of Statistical Software 95: 7. DOI: https://doi.org/10.18637/JSS.V095.I07.

Khana

Rossen

L. M.

Hedegaard

Warner

2018. “A Bayesian Spatial and Temporal Modeling Approach to Mapping Geographic Variation in Mortality Rates for Subnational Areas with R-INLA.” Journal of Data Science 16: 147–82. DOI: https://doi.org/10.6339/JDS.20180116(1).0009.

10.

Leroux

B. G.

Lei

Breslow

2000. “Estimation of Disease Rates in Small Areas: A New Mixed Model for Spatial Dependence.” In Statistical Models in Epidemiology, the Environment, and Clinical Trials, edited by Halloran

M. E.

Berry

, 179–91. New York, NY: Springer New York. DOI: https://doi.org/10.1007/978-1-4612-1284-34.

11.

Morita

Thall

P. F.

Müller

2008. “Determining the Effective Sample Size of a Parametric Prior.” Biometrics 64: 594–602. DOI: https://doi.org/10.1111/J.1541-0420.2007.00888.X.

12.

New York Department of Health. 1999. “Rates Based on Small Numbers — Statistics Teaching Tools.” Available at: https://www.health.ny.gov/diseases/chronic/ratesmall.htm (accessed December 10, 2023).

13.

Parker

Talih

Malec

, and the Data Suppression Workgroup. 2017. “National Center for Health Statistics Data Presentation Standards for Proportions.” National Center for Health Statistics, Vital and Health Statistics, 2.

14.

Pennsylvania Department of Health. 2020. “EDDIE.” Available at: https://www.health.pa.gov/topics/HealthStatistics/EDDIE/Pages/EDDIE.aspx# (accessed December 10, 2023).

15.

Quick

2019. “Estimating County-Level Mortality Rates Using Highly Censored Data from CDC WONDER.” Preventing Chronic Disease 16: E76. DOI: https://doi.org/10.5888/pcd16.180441.

16.

Quick

Song

Tabb

2021. “Evaluating the Informativeness of the BesagYork-Mollié CAR Model.” Spatial and Spatio-temporal Epidemiology 37: 100420. DOI: https://doi.org/10.1016/j.sste.2021.100420.

17.

Quick

Terloyeva

Moore

Diez Roux

A. V.

2020. “Trends in Tract-Level Prevalence of Obesity in Philadelphia by Race/Ethnicity, Space, and Time.” Epidemiology 31: 15–21. DOI: https://doi.org/10.1097/EDE.0000000000001118.

18.

Quick

Waller

L. A.

Casper

2017. “Multivariate Spatiotemporal Modeling of Age-Specific Stroke Mortality.” Annals of Applied Statistics 11: 2170–82. DOI: https://doi.org/10.1214/17AOAS1068.

19.

Quick

Waller

L. A.

Casper

2018. “A Multivariate Space-Time Model for Analysing County-Level Heart Disease Death Rates by Race and Sex.” Journal of the Royal Statistical Society Series C: Applied Statistics 67: 291–304. DOI: https://doi.org/10.1111/rssc.12215.

20.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

21.

Remington

P. L.

Catlin

B. B.

Gennuso

K. P.

2015. “The County Health Rankings: Rationale and Methods.” Population Health Metrics 13: 1–12. DOI: https://doi.org/10.1186/s12963-015-0044-2.

22.

Rhode Island Department of Health. 2016. “Rhode Island Small Numbers Reporting Policy.” Available at: http://www.health.ri.gov/publications/policies/SmallNumbersReporting.pdf (accessed December 10, 2023).

23.

Schmertmann

C. P.

Gonzaga

M. R.

2018. “Bayesian Estimation of Age-Specific Mortality and Life Expectancy for Small Areas with Defective Vital Records.” Demography 55: 1363–88. DOI: https://doi.org/10.1007/S13524-018-0695-2.

24.

Song

Tabb

L.P.

Quick

(2024). “Restricted spatial models for the analysis of geographic and racial disparities in the incidence of low birthweight in Pennsylvania.” Spatial Spatio-temporal Epidemiology, 49: 100649. DOI: https://doi.org/10.1016/j.sste.2024.100649

25.

Spielman

S. E.

Folch

D. C.

2015. “Reducing Uncertainty in the American Community Survey Through Data-Driven Regionalization.” PLoS One 10 (2): 1–21. DOI: https://doi.org/10.1371/journal.pone.0115626.

26.

Stein

1956. “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution.” In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 197–206. Berkeley, CA: University of California Press. DOI: https://doi.org/10.1525/9780520313880-018.

27.

United States Cancer Statistics Working Group. 2022. “United States Cancer Statistics Data Visualizations Tool Technical Notes.” U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, and National Cancer Institute. Available at: https://www.cdc.gov/cancer/uscs/pdf/uscs-data-visualizations-tool-technical-notes-508.pdf (accessed December 10, 2023).

28.

Utah Department of Health, Data Suppression Decision Rules Work Group. 2009. “Report of Guidelines for Data Result Suppression.” Available at: https://ibis.health.utah.gov/ibisph-view/pdf/resource/DataSuppression.pdf (accessed December 10, 2023).

29.

Vaughan

A. S.

Schieb

Kramer

M. R.

Quick

Taylor

Casper

2019. “Changing Rate Orders of Race-Gender Heart Disease Death Rates: An Exploration of County-Level Race-Gender Disparities.” SSM - Population Health 7: 100334. DOI: https://doi.org/10.1016/j.ssmph.2018.100334.

30.

Waller

L. A.

Carlin

B. P.

Xia

Gelfand

A. E.

1997. “Hierarchical Spatio-Temporal Mapping of Disease Rates.” Journal of the American Statistical Association 92: 607–17. DOI: https://doi.org/10.1080/01621459.1997.10474012.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

20.81 MB