Sage Journals: Discover world-class research

Abstract

Keywords

balanced sampling consistency asymptotic normality mixed-mode surveys data integration forest inventories

1. Sampling

National statistical offices (NSOs) conduct surveys to fulfill several key objectives for the public good, including social and demographic insights, labor market analysis, census and population studies, environmental and agricultural statistics, business and industry analysis, among many others. Probabilistic surveys constitute the base for such estimations. The NSOs maintain sampling frames, inside which samples of statistical units may be selected with controlled probabilities and eventually surveyed, allowing the production of objective and design-unbiased estimates.

In his seminal paper, Neyman (1934) introduced stratified random sampling, and laid the foundations for design-based inference. Since that paper, a very large number of sampling algorithms have been proposed, a recent inventory of which can be found in Tillé (2006). Among these algorithms, the cube method (Deville and Tillé 2004) is certainly the finest technical innovation of the last twenty-five years. It enables the selection of (approximately) balanced samples, that is, those that guarantee an exact estimate of totals for auxiliary variables known for all units inside the sampling frame. With balanced sampling, the variance is only related to the regression residual of the variable of interest on the balancing variables, which can lead to very significant reductions in variance.

Balancing is generally deteriorated by non-response, which makes the technique more interesting for a first stage of sampling where this problem is generally absent. The method is commonly used by INSEE (the French NSO) for the population census (Durr and Dumais 2002), and for the selection of the master sample for household surveys (Costa et al. 2018). On the other hand, balanced sampling seems to be little used by other institutes, apart from a few experiments (Biggeri and Falorsi 2006; Gismondi 2007; Jocelyn 2018). In a global context where it is becoming more difficult to maintain the same sample sizes as in past surveys, I am convinced that balanced sampling is an essential tool for a first degree of sampling, in order to reduce survey costs while maintaining the quality of statistical estimates (Chipperfield 2009).

2. Statistical Properties of Design and Estimators

Deville and Tillé (2004) have proposed variance approximations for balanced sampling, under the assumption of large entropy. However, the theoretical properties of the algorithm have yet to be fully explored. Chauvet (2014) has provided a proof of the mean-square convergence of the Horvitz-Thompson (HT) estimator for a martingale sampling algorithm, which applies in particular to the Cube method. The asymptotic normality of the Horvitz-Thompson estimator has not been demonstrated, apart from the special case of the pivotal method (Chauvet and Le Gleut 2021).

More generally, the convergence of the HT estimator and its asymptotic normality are important properties for a design, ensuring that confidence intervals based on normality can be used. It is often relatively straightforward to show mean-square convergence of the HT estimator, even if there are sampling designs for which this property does not hold (e.g., Chauvet 2022). Asymptotic normality is trickier to establish, and has been demonstrated on a case-by-case basis for certain unequal probability sampling methods (see e.g., Chauvet and Le Gleut 2021, for a review). It has also been demonstrated for multi-stage sampling designs, which are more widely used in practice (Chauvet and Vallée 2020; Krewski and Rao 1981; Ohlsson 1989), and for two-phase sampling designs (Chen and Rao 2007). Whether from a practical or theoretical point of view, it is important to continue extending this type of result to cover more of the designs used in survey practice.

While household and social surveys are based on finite population sampling, environmental surveys like National forest inventories (NFIs) are based on continuous population sampling. It is common practice in NFIs to randomly select a sample of points in a continuum, and then to define fixed-shape supports (e.g., plots or polygons) from these points to perform the field survey (e.g., Vidal et al. 2016). Although the sampling design may be formalized in several manners (e.g., Eriksson 1995), the infinite population approach (e.g., Stevens and Urquhart 2000) is arguably the simplest device for inference. It consists in transforming a variable of interest defined on the population of trees into a local synthetic variable defined on any point of the territory, with the same (integral) total. Inference may be performed directly from the sampled population, which is straightforward by using the theory of continuous Horvitz-Thompson (HT) estimation (Cordy 1993) both in terms of point estimation and variance estimation (Chauvet et al. 2023).

For this type of design, it is also important to study the limiting properties of the HT-estimator. Such properties have not been much considered in the literature, with the exception of Barabesi and Franceschi (2011) and Barabesi et al. (2012) who prove the consistency of the HT-estimator and derive a centrallimit theorem under one per stratum sampling (a.k.a. tesselation stratified sampling, or systematic unaligned sampling). However, an Hölder condition on the local variable is needed, which does not generally hold for the variables measured at the tree level in forest inventories. Similar results under weaker assumptions and more general NFI designs are needed.

3. Mixed-Mode Surveys

Survey protocols involving several collection modes (online, telephone, or face-to-face) are becoming increasingly common among NSOs, and INSEE in particular. This trend has even increased since 2020, with the start of the covid-19 epidemic. The use of a mixed-mode survey improves survey coverage and encourages participation (Schouten et al. 2021). However, mixed-mode surveys are primarily used to reduce survey costs. With this in mind, the most recommended type of protocol is sequential: the least expensive mode is offered first (often the Internet), then non-respondents are asked to respond using an alternative mode (telephone, then possibly face-to-face).

Mixed-mode surveys do, however, pose significant methodological challenges. Estimates may be subject to measurement bias if, all other things being equal, two individuals responding by two different modes exhibit different behavior on average for the variable of interest. Estimates can also be tainted by nonignorable response bias, where response behavior remains related to the variables of interest, even after controlling for covariates. Numerous studies have found measurement bias and/or selection bias (see e.g., Olson et al. 2021). However, there is less work on estimation methods for weighting multimode surveys to take these biases into account, see Buelens and Van den Brakel (2011), Buelens and van den Brakel (2015), Brick et al. (2022), and Yu et al. (2024). Further work is needed to define a comprehensive framework for handling selection and measurement errors, and to propose weighting methods suitable for different mixed-mode protocols.

4. Data Integration for Forest Inventories

Data integration involves combining several data sources to improve the quality of statistical estimators. It can take many forms: for example, integrating data from several probability surveys, combining probability and non-probability surveys, or using auxiliary data from other sources; see for example Kim (2022).

NFIs traditionally produce average estimates over time periods of five or ten years. The sampling intensity is sufficient for national or even regional estimates, but is generally insufficient for estimation at a lower scale. Improving statistical estimates from NFIs requires the use of auxiliary data, well correlated with the field attributes. Canopy height derived from LiDAR (Light Detection And Ranging) and digital photogrammetry are the most effective, but are still not widely available over large areas. NASA’s GEDI (Global Ecosystem Dynamics Investigation) mission has acquired LiDAR data over most of the world’s forests, with a high spatial density.

Theoretical estimators have been proposed to produce estimates based on GEDI data, at the scale of 1 km² meshes (Qi et al. 2019; Saarela et al. 2018). However, for forest management purposes, smaller meshes would be required. One difficulty in using high-resolution (and free) GEDI data is that its spatial distribution is neither controlled nor optimized for forestry use, with a highly unbalanced spatial coverage. However, the potential of these data is enormous, especially for estimating wood volume or biomass (Patterson et al. 2019).

Forests are fragile ecosystems, which have to cope with dramatic changes in environmental conditions (longer growing seasons, rising average temperatures, changing in rainfall patterns), accompanied by the appearance of new pathogens. Recent decades have seen a marked increase in the frequency and intensity of disturbances impacting forest condition. Forecasts point to an increase in the intensity and frequency of climatic disturbances (Patacca et al. 2023), which will have repercussions on the capacity of forests to provide expected ecosystem services. Forest management is faced with this prospect and must equip itself with the means to respond, by being able to produce estimates at fine geographical levels and on short time scales. The project CONIFER was recently supported by the French institute of Mathematics for Planet Earth (iMPT) to contribute to these important issues.

Footnotes

Acknowledgements

I would like to sincerely thank the editors of the Journal of Official Statistics for giving me the opportunity to share my views on research needs and methodological developments in the field of surveys.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project CONIFER was supported by the French institute of Mathematics for Planet Earth (iMPT).

ORCID iD

Guillaume Chauvet

Received: January 28, 2025

Accepted: February 7, 2025

References

Barabesi

Franceschi

2011. “Sampling Properties of Spatial Total Estimators Under Tessellation Stratified Designs.”Environmetrics 22 (3): 271–8. DOI: https://doi.org/10.1002/env.1046.

Barabesi

Franceschi

Marcheselli

2012. “Properties of Designbased Estimation Under Stratified Spatial Sampling with Application to Canopy Coverage Estimation.”The Annals of Applied Statistics 6 (1): 210–28. DOI: https://doi.org/10.1002/env.1046.

Biggeri

Falorsi

P. D.

2006. “A Probability Sample Strategy for Improving the Quality of the Consumer Price Index Survey Using the Information of Business Registers.”Proceedings of the Conference of European Statisticians Group of Experts on Consumer Price Indices, Geneva, Switzerland, May. https://api.semanticscholar.org/CorpusID:219632372.

Brick

J. M.

Kennedy

Cervantes-Flores

Mercer

A. W.

2022. “An Adaptive Mode Adjustment for Multimode Household Surveys.”Journal of Survey Statistics and Methodology 10 (4): 1024–47. DOI: https://doi.org/10.1093/jssam/smab034.

Buelens

Van den Brakel

J. A.

2011. “Inference in Surveys with Sequential Mixed-Mode Data Collection.” Technical Report, Statistics Netherlands.

Buelens

van den Brakel

J. A.

2015. “Measurement Error Calibration in Mixed-Mode Sample Surveys.”Sociological Methods & Research 44 (3): 391–426. DOI: https://doi.org/10.1177/0049124114532444.

Chauvet

2014. “A Note on the Consistency of the Narain-Horvitz-Thompson Estimator.” Technical Report, arXiv. DOI: https://doi.org/10.48550/arXiv.1412.2887.

Chauvet

2022. “A Cautionary Note on the Hanurav–Vijayan Sampling Algorithm.”Journal of Survey Statistics and Methodology 10 (5): 1276–91. DOI: https://doi.org/10.1093/jssam/smac011.

Chauvet

Bouriaud

Brion

2023. “An Extension of the Weight Share Method When Using a Continuous Sampling Frame.”Survey Methodology 49 (1): 139–62. http://www.statcan.gc.ca/pub/12-001-x/2023001/article/00011-eng.htm.

10.

Chauvet

Le Gleut

2021. “Inference Under Pivotal Sampling: Properties, Variance Estimation, and Application to Tesselation for Spatial Sampling.”Scandinavian Journal of Statistics 48 (1): 108–31. DOI: https://doi.org/10.1111/sjos.12441.

11.

Chauvet

Vallée

A.-A.

2020. “Inference for Two-Stage Sampling Designs.”Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3): 797–815. DOI: https://doi.org/10.1111/rssb.12368.

12.

Chen

Rao

J. N. K.

2007. “Asymptotic Normality Under Two-Phase Sampling Designs.”Statistica Sinica 17 (3): 1047–64.

13.

Chipperfield

2009. “An Evaluation of Cube Sampling for ABS Household Surveys.” Technical Report, Australian Bureau of Statistics. https://www.abs.gov.au/ausstats/abs@.nsf/mf/1352.0.55.087.

14.

Cordy

C. B.

1993. “An Extension of the Horvitz-Thompson Theorem to Point Sampling from a Continuous Universe.”Statistics & Probability Letters 18 (5): 353–62. DOI: https://doi.org/10.1016/0167-7152(93)90028-H.

15.

Costa

Guillo

Paliod

, et al. 2018. “Le tirage coordonné du nouvel échantillon-maître NAUTILE avec l’échantillon de l’enquête emploi en continu.”Journées de méthodologie statistique, Insee, Paris, France, June.

16.

Deville

J.-C.

Tillé

2004. “Efficient Balanced Sampling: The Cube Method.”Biometrika 91 (4): 893–912. DOI: https://doi.org/10.1093/biomet/91.4.893.

17.

Durr

J.-M.

Dumais

2002. “Redesign of the French Census of Population.”Survey Methodology 28 (1): 43–50. DOI: https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X20020016414.

18.

Eriksson

1995. “Design-Based Approaches to Horizontal-Point-Sampling.”Forest Science 41 (4): 890–907. DOI: https://doi.org/10.1093/forestscience/41.4.890.

19.

Gismondi

2007. “Quick Estimation of Tourist Nights Spent in Italy.”Statistical Methods and Applications 16 (1): 141–68. DOI: https://doi.org/10.1007/s10260-006-0035-3.

20.

Jocelyn

2018. “Sampling and Estimation Strategies for the Canadian Unincorporated Business Population.”Joint Statistical Meeting of the American Statistical Association, Seattle, WA, USA, July.

21.

Kim

J. K.

2022. “A Gentle Introduction to Data Integration in Survey Sampling.”The Survey Statistician 85: 19–29.

22.

Krewski

Rao

J. N. K.

1981. “Inference from Stratified Samples: Properties of the Linearization, Jackknife and Balanced Repeated Replication Methods.”The Annals of Statistics 9 (5): 1010–9. DOI: https://doi.org/10.1214/aos/1176345580.

23.

Neyman

1934. “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.”Journal of the Royal Statistical Society 97 (4): 558–606. DOI: https://doi.org/10.1111/j.2397-2335.1934.tb04184.x.

24.

Ohlsson

1989. “Asymptotic Normality for Two-Stage Sampling from a Finite Population.”Probability Theory and Related Fields 81 (3): 341–52. DOI: https://doi.org/10.1007/BF00340058.

25.

Olson

Smyth

J. D.

Horwitz

, et al. 2021. “Transitions from Telephone Surveys to Selfadministered and Mixed-Mode Surveys: AAPOR Task Force Report.”Journal of Survey Statistics and Methodology 9 (3): 381–411. DOI: https://doi.org/10.1093/jssam/smz062.

26.

Patacca

Lindner

Lucas-Borja

M. E.

, et al. 2023. “Significant Increase in Natural Disturbance Impacts on European Forests Since 1950.”Global Change Biology 29 (5): 1359–76. DOI: https://doi.org/10.1111/gcb.16531.

27.

Patterson

P. L.

Healey

S. P.

Ståhl

, et al. 2019. “Statistical Properties of Hybrid Estimators Proposed for GEDI—NASA’s Global Ecosystem Dynamics Investigation.”Environmental Research Letters 14 (6): 065007. DOI: https://doi.org/10.1088/1748-9326/ab18df.

28.

Saarela

Armston

Ståhl

Dubayah

2019. “Forest Biomass Estimation Over Three Distinct Forest Types Using TanDEM-X InSAR Data and Simulated GEDI Lidar Data.”Remote Sensing of Environment 232: 111283. DOI: https://doi.org/10.1016/j.rse.2019.111283.

29.

Saarela

Holm

Healey

S. P.

, et al. 2018. “Generalized Hierarchical Model-Based Estimation for Aboveground Biomass Assessment Using GEDI and Landsat Data.”Remote Sensing 10 (11): 1832. DOI: https://doi.org/10.3390/rs10111832.

30.

Schouten

van den Brakel

Buelens

Giesen

Luiten

Meertens

2021. Mixed-Mode Official Surveys: Design and Analysis. Boca Raton, FL: Chapman and Hall/CRC. DOI: https://doi.org/10.1201/9780429461156.

31.

Stevens

J. R.

Urquhart

N. S.

2000. “Response Designs and Support Regions in Sampling Continuous Domains.”Environmetrics 11 (1): 13–41. DOI: https://doi.org/10.1002/(SICI)1099-095X(200001/02)11:1〈13::AID-ENV379〉3.0.CO;2-8.

32.

Tillé

2006. Sampling Algorithms. New York: Springer. DOI: https://doi.org/10.1007/0-387-34240-0.

33.

Vidal

Alberdi

Hernández

Redmond

J. J.

2016. National Forest Inventories. Cham: Springer. DOI: https://doi.org/10.1007/978-3-319-44015-6.

34.

Elliott

M. R.

Raghunathan

T. E.

2024. “Three Approaches to Improve Inferences Based on Survey Data Collected with Mixed-Mode Designs.”Journal of Survey Statistics and Methodology 12 (3): 814–39. DOI: https://doi.org/10.1093/jssam/smae012.

Future Challenges in Sampling and Estimation

Abstract

Keywords

1. Sampling

2. Statistical Properties of Design and Estimators

3. Mixed-Mode Surveys

4. Data Integration for Forest Inventories

Footnotes

Acknowledgements

Funding

ORCID iD

References