Abstract

In this excellent nontechnical overview paper, Dr Newhouse clearly outlines potential utility of geospatial data in small area estimation. In this discussion, I offer some general observations pertaining to the modelling and estimation methodologies relevant to small area estimation. Given the central theme of Dr Newhouse’s paper revolving around classical linear mixed model prediction, my discussion is deliberately concentrated on this methodological framework, thereby ensuring alignment with the focal subject matter of the aforementioned paper.
Let me first say that it is imperative to underscore the indispensability of both sound modelling principles and robust estimation methodologies in the pursuit of generating dependable small area estimates. This assertion warrants elucidation, and to this end, I invoke a prominent area-level model, elucidated by Dr Newhouse as well, known as the Fay-Herriot model.[1] This model, a special case of the linear mixed model, is delineated by the following formulation:
where m denotes the number of small areas to be combined;
To facilitate clarity in exposition, let us assume that the model parameters
Now, let us consider the more realistic scenario wherein
An insightful extension of the area-level model is the twofold subarea model, initially proposed by Fuller and Goyenche[6] and subsequently refined by Torabi and Rao.[7] This model framework has found application in various domains, including the Small Area Income and Poverty Estimates (SAIPE) program administered by the US Census Bureau. In this context, the model is typically employed to model direct estimates for US counties (subareas), which are hierarchically nested within states (areas).
Dr Newhouse aptly highlighted the diverse array of subarea-level geospatial data that may be available for analysis. It is pertinent to note that the inclusion of such additional subarea auxiliary variables in subarea models holds promise, as these variables have the potential to capture within-area variations, thereby potentially enhancing the precision of estimation for the respective areas. Moreover, it is worth emphasizing that models of this nature can prove instrumental in generating subarea estimates, as exemplified by the work of Fuller and Goyenche.[6]
I concur with the notion that geospatial data at the subarea level holds immense potential for augmenting area-level estimates, particularly if the quality of such data remains consistent across different areas. This premise underscores the importance of ensuring uniformity in the quality and reliability of subarea-level geospatial data across the geographical landscape, thereby maximizing the utility of such data in improving the accuracy and precision of area-level estimates.
Several pivotal considerations must be addressed before the implementation of aggregate statistics modelling, whether at the area level or subarea level. First, the accurate estimation of sampling variances
On a positive note, given that survey-weighted direct estimates serve as the modelling basis, the modelling process is comparatively simpler than the corresponding unit-level model. Moreover, EBLUP based on the area-level model generally exhibits the desirable property of design consistency. It is worth highlighting that area-level models have garnered successful application in official statistics endeavours, such as the US Census Bureau’s SAIPE program (see, for instance, Bell et al.[9]), as well as in Chilean poverty mapping initiatives (refer to Casas-Cordero et al.[10]). These instances underscore the efficacy and utility of area-level modelling approaches in real-world statistical applications.
The unit-level model, as delineated by Dr Newhouse, holds significant promise for enhancing small area estimation endeavours, particularly when robust unit-level auxiliary variables are available, contingent upon the validity of the model. However, it is pertinent to acknowledge that the modelling task is inherently more complex compared to its area-level counterpart. Building a reasonable working unit-level model necessitates careful consideration of various sample design features, nonresponse mechanisms, and other relevant factors.
As highlighted by Dr Newhouse, the integration of geospatial data at the primary sampling unit (PSU) or subarea level holds potential for mitigating selection bias in unit-level modelling efforts. Recent simulation studies conducted by Chen et al.[11] shed light on the efficacy of the Observed Best Predictor (OBP), as initially proposed by Jiang et al.,[12] which capitalizes on area-level auxiliary variables. These studies indicate that the OBP, exclusively utilizing area-level auxiliary variables, outperforms both the OBP and EBLUP employing solely unit-level auxiliary variables in scenarios of model misspecification, potentially stemming from the utilization of inadequate unit-level auxiliary variables.
It would indeed be intriguing to explore and compare the performance of EBLUP utilizing subarealevel auxiliary variables, such as those derived from geospatial data, with EBLUP employing unit-level covariates, particularly in situations where the unit-level auxiliary information is limited or weak. For instance, when unit-level auxiliary variables are sourced from outdated census data, the comparative efficacy of these two approaches warrants careful examination. Such comparative analyses could offer valuable insights into the relative merits and limitations of leveraging different types of auxiliary variables in unit-level small area estimation modelling.
Dr Newhouse delved into a crucial aspect of estimation for unsampled small areas, an issue of paramount importance in statistical practice. A commonly employed strategy for confronting this challenge is the synthetic method, a technique that involves fitting a model using data from sampled areas and subsequently using area-specific auxiliary variables from unsampled areas into the fitted model to facilitate prediction in the unsampled regions. However, in instances where the area-specific auxiliary variables are weak, spatial models leveraging information from neighbouring sampled areas can offer a viable alternative, as elucidated by Vogt et al.[13]
Dr Newhouse also touched upon an alternative approach proposed by Pfeffermann and Sverchkov,[14] which warrants further exploration across diverse applications. This informative sampling approach presents promising prospects for enhancing the estimation process, particularly in scenarios where traditional methods may fall short. Recently, Das and Lahiri[15] introduced a methodology aimed at predicting means of unsampled small areas by leveraging a synthetic assumption regarding the similarity of cell means across areas. These cells are created using relevant auxiliary variables such as relevant demographic variables. Their proposed augmented hierarchical model incorporates available auxiliary variables at both unit and area levels, as well as survey weights as an additional auxiliary variable. This inclusion of survey weights is anticipated to mitigate the extent of informativeness in sampling and nonignorability in the nonresponse mechanism. While their methodology was implemented using Bayesian techniques, classical implementations are feasible and merit exploration. To mitigate selection bias, the concept of employing sampling weights in different modelling contexts was previously contemplated by Rubin[16] and Varret et al.[17]
It would be intriguing to investigate the efficacy of methods mentioned in the preceding paragraph for estimation in unsampled areas, particularly in conjunction with subarea geospatial data. The integration of these innovative methodologies with geospatial approaches holds promise for advancing the precision and reliability of small area estimation techniques, thereby facilitating more robust statistical inference in practice. Further research in this direction could yield valuable insights into the optimal strategies for addressing the challenges inherent in estimating unsampled small areas.
Indeed, the complexities inherent in pooling a large number of areas in small area estimation necessitate careful consideration of various exchangeability assumptions on the model parameters. Researchers have explored diverse methodologies to address such challenges, including random coefficient models (e.g., Hobza and Morales [18]) and random sampling variances models (e.g., Arora et al. [19]). A recent contribution by Lahiri and Salvati [20] introduces a flexible nested error regression model featuring highdimensional area-specific regression coefficients and sampling variances. Their approach incorporates a data-driven parameter estimation method aimed at enhancing the predictive power of the resulting Empirical Best Predictor (EBP) for small areas. The method’s flexibility in modelling enables the amalgamation of a large number of small areas, facilitating the utilization of simple parametric bootstrap methods for estimating various measures of uncertainty. While promising, further research is warranted to fully comprehend the theoretical properties of this methodology.
A noted statistician George Box says, ‘Statisticians, like artists, have the bad habit of falling in love with their models.’ While small area researchers may indeed develop an affinity for their models, it is crucial to bear in mind George Box’s another famous quote: ‘Essentially, all models are wrong, but some are useful.’ As data assume diverse forms and qualities, and new technologies continue to emerge, the small area estimation landscape remains dynamic and exciting. Maintaining an open mind in navigating different data situations and methodologies is paramount for progress in the field.
I congratulate Dr. Newhouse for crafting an outstanding article that contributes significantly to the small area research community. Such contributions serve to advance our collective understanding and propel the field forward in exciting new directions.
