Abstract

Keywords
1. Sampling
National statistical offices (NSOs) conduct surveys to fulfill several key objectives for the public good, including social and demographic insights, labor market analysis, census and population studies, environmental and agricultural statistics, business and industry analysis, among many others. Probabilistic surveys constitute the base for such estimations. The NSOs maintain sampling frames, inside which samples of statistical units may be selected with controlled probabilities and eventually surveyed, allowing the production of objective and design-unbiased estimates.
In his seminal paper, Neyman (1934) introduced stratified random sampling, and laid the foundations for design-based inference. Since that paper, a very large number of sampling algorithms have been proposed, a recent inventory of which can be found in Tillé (2006). Among these algorithms, the cube method (Deville and Tillé 2004) is certainly the finest technical innovation of the last twenty-five years. It enables the selection of (approximately) balanced samples, that is, those that guarantee an exact estimate of totals for auxiliary variables known for all units inside the sampling frame. With balanced sampling, the variance is only related to the regression residual of the variable of interest on the balancing variables, which can lead to very significant reductions in variance.
Balancing is generally deteriorated by non-response, which makes the technique more interesting for a first stage of sampling where this problem is generally absent. The method is commonly used by INSEE (the French NSO) for the population census (Durr and Dumais 2002), and for the selection of the master sample for household surveys (Costa et al. 2018). On the other hand, balanced sampling seems to be little used by other institutes, apart from a few experiments (Biggeri and Falorsi 2006; Gismondi 2007; Jocelyn 2018). In a global context where it is becoming more difficult to maintain the same sample sizes as in past surveys, I am convinced that balanced sampling is an essential tool for a first degree of sampling, in order to reduce survey costs while maintaining the quality of statistical estimates (Chipperfield 2009).
2. Statistical Properties of Design and Estimators
Deville and Tillé (2004) have proposed variance approximations for balanced sampling, under the assumption of large entropy. However, the theoretical properties of the algorithm have yet to be fully explored. Chauvet (2014) has provided a proof of the mean-square convergence of the Horvitz-Thompson (HT) estimator for a martingale sampling algorithm, which applies in particular to the Cube method. The asymptotic normality of the Horvitz-Thompson estimator has not been demonstrated, apart from the special case of the pivotal method (Chauvet and Le Gleut 2021).
More generally, the convergence of the HT estimator and its asymptotic normality are important properties for a design, ensuring that confidence intervals based on normality can be used. It is often relatively straightforward to show mean-square convergence of the HT estimator, even if there are sampling designs for which this property does not hold (e.g., Chauvet 2022). Asymptotic normality is trickier to establish, and has been demonstrated on a case-by-case basis for certain unequal probability sampling methods (see e.g., Chauvet and Le Gleut 2021, for a review). It has also been demonstrated for multi-stage sampling designs, which are more widely used in practice (Chauvet and Vallée 2020; Krewski and Rao 1981; Ohlsson 1989), and for two-phase sampling designs (Chen and Rao 2007). Whether from a practical or theoretical point of view, it is important to continue extending this type of result to cover more of the designs used in survey practice.
While household and social surveys are based on finite population sampling, environmental surveys like National forest inventories (NFIs) are based on continuous population sampling. It is common practice in NFIs to randomly select a sample of points in a continuum, and then to define fixed-shape supports (e.g., plots or polygons) from these points to perform the field survey (e.g., Vidal et al. 2016). Although the sampling design may be formalized in several manners (e.g., Eriksson 1995), the infinite population approach (e.g., Stevens and Urquhart 2000) is arguably the simplest device for inference. It consists in transforming a variable of interest defined on the population of trees into a local synthetic variable defined on any point of the territory, with the same (integral) total. Inference may be performed directly from the sampled population, which is straightforward by using the theory of continuous Horvitz-Thompson (HT) estimation (Cordy 1993) both in terms of point estimation and variance estimation (Chauvet et al. 2023).
For this type of design, it is also important to study the limiting properties of the HT-estimator. Such properties have not been much considered in the literature, with the exception of Barabesi and Franceschi (2011) and Barabesi et al. (2012) who prove the consistency of the HT-estimator and derive a centrallimit theorem under one per stratum sampling (a.k.a. tesselation stratified sampling, or systematic unaligned sampling). However, an Hölder condition on the local variable is needed, which does not generally hold for the variables measured at the tree level in forest inventories. Similar results under weaker assumptions and more general NFI designs are needed.
3. Mixed-Mode Surveys
Survey protocols involving several collection modes (online, telephone, or face-to-face) are becoming increasingly common among NSOs, and INSEE in particular. This trend has even increased since 2020, with the start of the covid-19 epidemic. The use of a mixed-mode survey improves survey coverage and encourages participation (Schouten et al. 2021). However, mixed-mode surveys are primarily used to reduce survey costs. With this in mind, the most recommended type of protocol is sequential: the least expensive mode is offered first (often the Internet), then non-respondents are asked to respond using an alternative mode (telephone, then possibly face-to-face).
Mixed-mode surveys do, however, pose significant methodological challenges. Estimates may be subject to measurement bias if, all other things being equal, two individuals responding by two different modes exhibit different behavior on average for the variable of interest. Estimates can also be tainted by nonignorable response bias, where response behavior remains related to the variables of interest, even after controlling for covariates. Numerous studies have found measurement bias and/or selection bias (see e.g., Olson et al. 2021). However, there is less work on estimation methods for weighting multimode surveys to take these biases into account, see Buelens and Van den Brakel (2011), Buelens and van den Brakel (2015), Brick et al. (2022), and Yu et al. (2024). Further work is needed to define a comprehensive framework for handling selection and measurement errors, and to propose weighting methods suitable for different mixed-mode protocols.
4. Data Integration for Forest Inventories
Data integration involves combining several data sources to improve the quality of statistical estimators. It can take many forms: for example, integrating data from several probability surveys, combining probability and non-probability surveys, or using auxiliary data from other sources; see for example Kim (2022).
NFIs traditionally produce average estimates over time periods of five or ten years. The sampling intensity is sufficient for national or even regional estimates, but is generally insufficient for estimation at a lower scale. Improving statistical estimates from NFIs requires the use of auxiliary data, well correlated with the field attributes. Canopy height derived from LiDAR (Light Detection And Ranging) and digital photogrammetry are the most effective, but are still not widely available over large areas. NASA’s GEDI (Global Ecosystem Dynamics Investigation) mission has acquired LiDAR data over most of the world’s forests, with a high spatial density.
Theoretical estimators have been proposed to produce estimates based on GEDI data, at the scale of 1 km2 meshes (Qi et al. 2019; Saarela et al. 2018). However, for forest management purposes, smaller meshes would be required. One difficulty in using high-resolution (and free) GEDI data is that its spatial distribution is neither controlled nor optimized for forestry use, with a highly unbalanced spatial coverage. However, the potential of these data is enormous, especially for estimating wood volume or biomass (Patterson et al. 2019).
Forests are fragile ecosystems, which have to cope with dramatic changes in environmental conditions (longer growing seasons, rising average temperatures, changing in rainfall patterns), accompanied by the appearance of new pathogens. Recent decades have seen a marked increase in the frequency and intensity of disturbances impacting forest condition. Forecasts point to an increase in the intensity and frequency of climatic disturbances (Patacca et al. 2023), which will have repercussions on the capacity of forests to provide expected ecosystem services. Forest management is faced with this prospect and must equip itself with the means to respond, by being able to produce estimates at fine geographical levels and on short time scales. The project CONIFER was recently supported by the French institute of Mathematics for Planet Earth (iMPT) to contribute to these important issues.
Footnotes
Acknowledgements
I would like to sincerely thank the editors of the Journal of Official Statistics for giving me the opportunity to share my views on research needs and methodological developments in the field of surveys.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project CONIFER was supported by the French institute of Mathematics for Planet Earth (iMPT).
Received: January 28, 2025
Accepted: February 7, 2025
