Discussion

Abstract

In this excellent nontechnical overview paper, Dr Newhouse clearly outlines potential utility of geospatial data in small area estimation. In this discussion, I offer some general observations pertaining to the modelling and estimation methodologies relevant to small area estimation. Given the central theme of Dr Newhouse’s paper revolving around classical linear mixed model prediction, my discussion is deliberately concentrated on this methodological framework, thereby ensuring alignment with the focal subject matter of the aforementioned paper.

Let me first say that it is imperative to underscore the indispensability of both sound modelling principles and robust estimation methodologies in the pursuit of generating dependable small area estimates. This assertion warrants elucidation, and to this end, I invoke a prominent area-level model, elucidated by Dr Newhouse as well, known as the Fay-Herriot model.^[1] This model, a special case of the linear mixed model, is delineated by the following formulation:

y_{i} = θ_{i} + e_{i} = x_{i}^{'} β + v_{i} + e_{i}, i = 1, \dots, m, i = 1, \dots, m,

where m denotes the number of small areas to be combined; $θ_{i}$ represents the true mean for area i; $y_{i}$ is the direct survey-weighted estimate of $θ_{i}; x_{i}$ is a $p \times 1$ vector of known auxiliary variables; $v_{i}$ is area specific random effect that captures the leftover variations not captured by the $x_{i}$ ; the random effects $\{v_{i}, i = 1, \dots, m\}$ and the errors $\{e_{i}, i = 1, \dots, m\}$ are independent with $v_{i} \sim N (0, A)$ , and $e_{i} \sim N (0, ψ_{i})$ . The regression coefficients $β$ and the variance component A are generally unknown and need to be estimated. The sampling variances $ψ_{i}$ are assumed to be known. In practice, methodologies such as the generalized variance function are deployed to derive smooth estimates of $ψ_{i}$ ; see, for example, Hawala and Lahiri.^[2]

To facilitate clarity in exposition, let us assume that the model parameters $β$ and $A$ are known. In this scenario, one may predict $θ_{i}$ by a synthetic predictor $x_{i}^{'} β$ , which is an unbiased predictor of $θ_{i}$ under the assumed area level model. However, a notable drawback of such a synthetic predictor is its complete reliance on the validity of the assumed model, potentially resulting in significant discrepancies from $y_{i}$ even in areas with substantial sample sizes. In essence, the synthetic predictor lacks the property of design consistency. Alternatively, we can employ the best predictor (BP) of $θ_{i}$ , which is the conditional mean of $θ_{i}$ given $y = {(y_{1}, \dots, y_{m})}^{'}$ under the assumed model. The BP is given by ${\hat{θ}}_{i} = (1 - B) y_{i} + B_{i} x_{i}^{'} β$ , where $B_{i} = ψ_{i} / (A + ψ_{i})$ . The $BP$ is a weighted average of $y_{i}$ and the synthetic predictor $x_{i}^{'} β$ giving more weight to $y_{i}$ for areas with small sampling variance $ψ_{i}$ . The BP, unlike the synthetic predictor, is design consistent under mild regularity conditions. Evidently, if the model fails to include the areaspecific random effect $v_{i}$ , the BP coincides with the synthetic predictor. Consequently, the incorporation of the random effects $v_{i}$ assumes pivotal significance in ensuring the design consistency of the BP. While the synthetic predictor of $θ_{i}$ may not typically exhibit the design consistency property, its utility can remain adequate provided it relies on well-selected auxiliary variables.

Now, let us consider the more realistic scenario wherein $β$ and A are unknown. In such instances, one may resort to employing the Empirical Best Linear Unbiased Predictor (EBLUP), which is derived by substituting estimates of $β$ and A into the BP. A conventional approach entails utilizing the residual maximum likelihood (REML) method for estimation. However, it is worth noting that the REML estimate of A may occasionally yield a value of zero, thereby giving rise to the so-called overshrinkage problem, wherein the EBLUP reduces to the synthetic predictor $x_{i}^{'} \hat{β}$ , with $\hat{β}$ representing the standard maximum likelihood estimator of $β$ . The mean squared prediction error (MSPE) of the BP of $θ_{i}$ is given by $g_{1 i} = (1 - B_{i}) ψ_{i}$ . A straightforward plugged-in estimation of the MSPE of EBLUP, wherein the variance component A is replaced by its REML estimate, would yield a zero estimate in instances where the REML estimate of A equals zero. This outcome is evidently misleading. To address this issue, researchers have explored various sophisticated methodologies, including the parametric bootstrap methods referenced by Dr Newhouse, as a means to circumvent the aforementioned challenge; for further elucidation, refer to Jiang and Lahiri.^[3] Even with such advanced methods, MSPE estimates are likely to be negligible for large m, owing to the leading term in the MSPE approximation, that is, $g_{1 i} (A)$ , being estimated at zero. To correct for this problem associated with estimation of A, adjusted maximum likelihood methodologies have been proposed in the literature; see, for example, Li and Lahiri,^[4] Hirose and Lahiri,^[5] among others. Such methodologies yield strictly positive and consistent estimates of A, thereby avoiding the overshrinkage problem in EBLUP and furnishing reasonable MSPE estimates.

An insightful extension of the area-level model is the twofold subarea model, initially proposed by Fuller and Goyenche^[6] and subsequently refined by Torabi and Rao.^[7] This model framework has found application in various domains, including the Small Area Income and Poverty Estimates (SAIPE) program administered by the US Census Bureau. In this context, the model is typically employed to model direct estimates for US counties (subareas), which are hierarchically nested within states (areas).

Dr Newhouse aptly highlighted the diverse array of subarea-level geospatial data that may be available for analysis. It is pertinent to note that the inclusion of such additional subarea auxiliary variables in subarea models holds promise, as these variables have the potential to capture within-area variations, thereby potentially enhancing the precision of estimation for the respective areas. Moreover, it is worth emphasizing that models of this nature can prove instrumental in generating subarea estimates, as exemplified by the work of Fuller and Goyenche.^[6]

I concur with the notion that geospatial data at the subarea level holds immense potential for augmenting area-level estimates, particularly if the quality of such data remains consistent across different areas. This premise underscores the importance of ensuring uniformity in the quality and reliability of subarea-level geospatial data across the geographical landscape, thereby maximizing the utility of such data in improving the accuracy and precision of area-level estimates.

Several pivotal considerations must be addressed before the implementation of aggregate statistics modelling, whether at the area level or subarea level. First, the accurate estimation of sampling variances $ψ_{i}$ is paramount. Bell^[8] extensively deliberated on the repercussions of inaccurately estimating $ψ_{i}$ on small area estimation endeavours. Typically, a smoothing technique such as the generalized variance function method is employed to address this concern. Second, it is imperative to acknowledge that area-level models inherently lack the capacity to accommodate additional variability stemming from the estimation of sampling variances. Third, it is common to model transformed direct estimates, potentially leading to bias in the final estimates upon conversion to the original scale. However, diligent implementation of winsorization and benchmarking strategies holds promise for mitigating such biases effectively.

On a positive note, given that survey-weighted direct estimates serve as the modelling basis, the modelling process is comparatively simpler than the corresponding unit-level model. Moreover, EBLUP based on the area-level model generally exhibits the desirable property of design consistency. It is worth highlighting that area-level models have garnered successful application in official statistics endeavours, such as the US Census Bureau’s SAIPE program (see, for instance, Bell et al.^[9]), as well as in Chilean poverty mapping initiatives (refer to Casas-Cordero et al.^[10]). These instances underscore the efficacy and utility of area-level modelling approaches in real-world statistical applications.

The unit-level model, as delineated by Dr Newhouse, holds significant promise for enhancing small area estimation endeavours, particularly when robust unit-level auxiliary variables are available, contingent upon the validity of the model. However, it is pertinent to acknowledge that the modelling task is inherently more complex compared to its area-level counterpart. Building a reasonable working unit-level model necessitates careful consideration of various sample design features, nonresponse mechanisms, and other relevant factors.

As highlighted by Dr Newhouse, the integration of geospatial data at the primary sampling unit (PSU) or subarea level holds potential for mitigating selection bias in unit-level modelling efforts. Recent simulation studies conducted by Chen et al.^[11] shed light on the efficacy of the Observed Best Predictor (OBP), as initially proposed by Jiang et al.,^[12] which capitalizes on area-level auxiliary variables. These studies indicate that the OBP, exclusively utilizing area-level auxiliary variables, outperforms both the OBP and EBLUP employing solely unit-level auxiliary variables in scenarios of model misspecification, potentially stemming from the utilization of inadequate unit-level auxiliary variables.

It would indeed be intriguing to explore and compare the performance of EBLUP utilizing subarealevel auxiliary variables, such as those derived from geospatial data, with EBLUP employing unit-level covariates, particularly in situations where the unit-level auxiliary information is limited or weak. For instance, when unit-level auxiliary variables are sourced from outdated census data, the comparative efficacy of these two approaches warrants careful examination. Such comparative analyses could offer valuable insights into the relative merits and limitations of leveraging different types of auxiliary variables in unit-level small area estimation modelling.

Dr Newhouse delved into a crucial aspect of estimation for unsampled small areas, an issue of paramount importance in statistical practice. A commonly employed strategy for confronting this challenge is the synthetic method, a technique that involves fitting a model using data from sampled areas and subsequently using area-specific auxiliary variables from unsampled areas into the fitted model to facilitate prediction in the unsampled regions. However, in instances where the area-specific auxiliary variables are weak, spatial models leveraging information from neighbouring sampled areas can offer a viable alternative, as elucidated by Vogt et al.^[13]

Dr Newhouse also touched upon an alternative approach proposed by Pfeffermann and Sverchkov,^[14] which warrants further exploration across diverse applications. This informative sampling approach presents promising prospects for enhancing the estimation process, particularly in scenarios where traditional methods may fall short. Recently, Das and Lahiri^[15] introduced a methodology aimed at predicting means of unsampled small areas by leveraging a synthetic assumption regarding the similarity of cell means across areas. These cells are created using relevant auxiliary variables such as relevant demographic variables. Their proposed augmented hierarchical model incorporates available auxiliary variables at both unit and area levels, as well as survey weights as an additional auxiliary variable. This inclusion of survey weights is anticipated to mitigate the extent of informativeness in sampling and nonignorability in the nonresponse mechanism. While their methodology was implemented using Bayesian techniques, classical implementations are feasible and merit exploration. To mitigate selection bias, the concept of employing sampling weights in different modelling contexts was previously contemplated by Rubin^[16] and Varret et al.^[17]

It would be intriguing to investigate the efficacy of methods mentioned in the preceding paragraph for estimation in unsampled areas, particularly in conjunction with subarea geospatial data. The integration of these innovative methodologies with geospatial approaches holds promise for advancing the precision and reliability of small area estimation techniques, thereby facilitating more robust statistical inference in practice. Further research in this direction could yield valuable insights into the optimal strategies for addressing the challenges inherent in estimating unsampled small areas.

Indeed, the complexities inherent in pooling a large number of areas in small area estimation necessitate careful consideration of various exchangeability assumptions on the model parameters. Researchers have explored diverse methodologies to address such challenges, including random coefficient models (e.g., Hobza and Morales ^[18]) and random sampling variances models (e.g., Arora et al. ^[19]). A recent contribution by Lahiri and Salvati ^[20] introduces a flexible nested error regression model featuring highdimensional area-specific regression coefficients and sampling variances. Their approach incorporates a data-driven parameter estimation method aimed at enhancing the predictive power of the resulting Empirical Best Predictor (EBP) for small areas. The method’s flexibility in modelling enables the amalgamation of a large number of small areas, facilitating the utilization of simple parametric bootstrap methods for estimating various measures of uncertainty. While promising, further research is warranted to fully comprehend the theoretical properties of this methodology.

A noted statistician George Box says, ‘Statisticians, like artists, have the bad habit of falling in love with their models.’ While small area researchers may indeed develop an affinity for their models, it is crucial to bear in mind George Box’s another famous quote: ‘Essentially, all models are wrong, but some are useful.’ As data assume diverse forms and qualities, and new technologies continue to emerge, the small area estimation landscape remains dynamic and exciting. Maintaining an open mind in navigating different data situations and methodologies is paramount for progress in the field.

I congratulate Dr. Newhouse for crafting an outstanding article that contributes significantly to the small area research community. Such contributions serve to advance our collective understanding and propel the field forward in exciting new directions.

References

Fay

III and Herriot

RA.

Estimates of income for small places: an application of James-Stein procedures to census data. J Am Stat Assoc 1979; 74: 269–277.

Hawala

and Lahiri

Variance Modeling for Domains. Stat Appl 2018; 16(1): 399–409.

Jiang

and Lahiri

Mixed model prediction and small area estimation. Test 2006; 15: 1–96.

and Lahiri

Adjusted maximum method for solving small area estimation problems. J Multivariate Anal 2010; 101(4): 882–892. https://doi.org/10.1016/j.jmva.2009.10.009 (accessed May 17, 2024).

Hirose

and Lahiri

Estimating variance of random effects to solve multiple problems simultaneously. Ann Stat 2018; 46: 1721–1741. https://doi.org/10.1214/17-AOS1600 (accessed May 17, 2024).

WÁ

Fuller

and Goyeneche

JJ.

Estimation of the state variance component [unpublished report]. 1988.

Torabi

and Rao

JNK.

On small area estimation under a sub-area model. J Multivariate Anal 2014; 127: 36–55.

Bell

WR.

Examining sensitivity of small area inferences to uncertainty about sampling error variances. Proc Am Stat Assoc 2008; 327: 334 (‘Survey Research Methods’ section).

Bell

, Basel

and Maples

JJ.

An overview of the U.S. Census Bureau’s Small Area Income and Poverty Estimates Programme. In: Pratesi

, editor. Analysis of poverty data by small area estimation . West Sussex, UK: Wiley; 2016. pp. 349–378.

10.

Casas-Cordero

, Encina

and Lahiri

Poverty mapping for the Chilean Comunas. In: Analysis of Poverty Data by Small Area Estimation , ed. Pratesi

Monica

. Wiley Series in Survey Methodology , 2016, pp. 379–403.

11.

Chen

, Lahiri

and Salvati

Effects of model misspecification on small area estimators . 2024. https://doi.org/10.48550/arXiv.2403.11276 (accessed May 17, 2024).

12.

Jiang

, Nguyen

and Rao

JS.

(2011), Best predictive small area estimation. J Am Stat Assoc 2011; 106: 732–745.

13.

Vogt

, Lahiri

and Munnich

Spatial prediction in small area estimation. Statistics in Transition New Series 2023; 24(3): 77–94. https://doi.org/10.59170/stattrans-2023-037 (accessed May 17, 2024).

14.

Pfeffermann

and Sverchkov

Small-area estimation under informative probability sampling of areas and within the selected areas. J Am Stat Assoc 2007; 102: 1427–1439.

15.

Das

and Lahiri

Hierarchical Bayes estimation of small area proportions using statistical linkage of disparate data sources. 2022. https://doi.org/10.48550/arXiv.2210.04980 (accessed May 17, 2024).

16.

Rubin

DB.

An evaluation of model-dependent and probability-sampling inferences in sample surveys: Comment. J Am Stat Assoc 1983; 78(384): 803–805.

17.

Verret

, Rao

and Hidiroglou

MA.

Model-based small area estimation under informative sampling. Surv Methodol 2015; 41(2): 333–348.

18.

Hobza

and Morales

Small area estimation under random regression coefficient models. J Stat Comput Simul 2013; 83(11): 2160–2177.

19.

Arora

, Lahiri

and Mukherjee

Empirical Bayes estimation of finite population means from complex surveys. J Am Stat Assoc 1997; 92: 1555–1562.

20.

Lahiri

and Salvati

A nested error regression model with high-dimensional parameter for small area estimation. J R Stat Soc: Stat Methodol 2023; 85: 212–239. https://doi.org/10.1093/jrsssb/qkac010 (accessed May 17, 2024).