A data dependent procedure for determining thresholds for minimum inclusion probabilities in an unequal probability sample

Abstract

Unequal probability sampling designs are often used when unit size is predictive of unit outcome. Such sampling schemes increase the precision of the estimated totals relative to other fixed size unbiased sampling strategies by assigning inclusion probabilities proportional to unit size. Consequently, the largest units on the sampling frame have inclusion probabilities approaching one, while the smallest have inclusion probabilities approaching zero, with varying sampling weights. At the design stage, it may be reasonable to assume that these smallest units will have small values for outcome variables. However, any violation of this assumption can have detrimental consequences on the precision of the sample-based estimates. In practice, reducing the influence of such units is often accomplished via weight trimming (setting a threshold for maximum sampling weight after sample selection). With multipurpose surveys, it is difficult to determine optimal thresholds appropriate for all variables. This paper introduces the Exchangeable Unit Inclusion Probability Average algorithm, a data-dependent procedure that uses clustering to determine stratum specific thresholds for minimum inclusion probabilities, assigned before sample selection. This approach yields unbiased samples and can reduce sampling variance. We present the empirical application of this procedure to the U.S. Census Bureau’s Annual Integrated Economic Survey.

Keywords

Winsorization k-means clustering multipurpose survey exchangeability unequal probability sample

Get full access to this article

View all access options for this article.

References

Cochran

. Sampling Techniques. New York: Wiley, 1977.

Sigman

Monsour

. Selecting samples from list frames of businesses. chapter 8. Business Survey Methods, John Wiley & Sons, Ltd., 1995. pp.131–152.

Berger

Tillé

. Chapter 2 - sampling with unequal probabilities. In: Rao CR (ed.) Handbook of statistics, Handbook of statistics, Vol. 29, 2009, pp.39–54. Elsevier.

Lohr

. Sampling: Design and Analysis. Boston: Brooks and Cole, 2010.

Smith

. Sampling and estimation for business surveys. In: Designing and conducting business surveys, chapter 5, 2013, pp.165–218. John Wiley & Sons, Ltd.

Lee

. Outliers in business surveys, chapter 26, 1995, pp.503–526. John Wiley & Sons, Ltd. ISBN 9781118150504.

Cruze

Erciulescu

Nandram

, et al. Producing official county-level agricultural estimates in the United States: Needs and challenges. Stat Sci 2019; 34: 301–316.

Kish

. Weighting for unequal

p_{i}

. J Off Stat 1992; 8: 183–200.

Kott

. A plan for coordinating surveys at the National Agricultural Statistics Service. In: Proceedings of the Statistics Canada symposium, 2003. https://www150.statcan.gc.ca/n1/en/pub/11-522-x/2003001/session3/7601-eng.pdf?st=j8owf6do.

10.

Potter

. A study of procedures to identify and trim extreme sampling weights. In: Proceedings of the survey research methods section of the American Statistical Association, 1990, pp.225–230.

11.

Kalton

Cervantes

. Weighting methods. J Off Stat 2003; 19: 81–97.

12.

Elliott

. Model averaging methods for weight trimming. J Off Stat 2008; 24: 517–540.

13.

Kröger

Särndal

Teikari

. Poisson mixture sampling: A familiy of designs for coordinated selection using permanent random numbers. Surv Methodol 1999; 25: 3–11.

14.

Zong

Zhu

Zou

. Improved Horvitz-Thompson estimator in survey sampling. Surv Methodol 2019; 45: 165–184.

15.

Kokic

Bell

. Optimal winsorizing cutoffs for a stratified finite population estimator. J Off Stat 1994; 10: 419–435.

16.

Clark

Kokic

Smith

. A comparison of two robust estimation methods for business surveys. Int Stat Rev 2017; 85: 270–289.

17.

Mulry

Oliver

Kaputa

, et al. A cautionary note on Clark winsorization. Surv Methodol 2016; 42: 297–305.

18.

Beaumont

Alavi

. Robust generalized regression estimation. Surv Methodol 2004; 30: 195–208.

19.

Beaumont

Rivest

. In: Rao CR (ed.) Handbook of statistics, Vol. 29, 2009, pp.247–279. Elsevier.

20.

Martinoz

Haziza

Beaumont

. A method of determining the winsorization threshold, with an application to domain estimation. Surv Methodol 2015; 41: 57–77.

21.

Landsman

Graubard

. Estimation of domain means from business surveys in the presence of stratum jumpers and nonresponse. J Off Stat 2021; 37: 1059–1078.

22.

Kott

Creel

. Designing a probability sample to produce a large number of key estimates. In: Proceedings of Federal Committee on Statistical Methodology research and policy conference, 2021.

23.

Cheng

Bailey

Slud

, et al. Sampling and estimation for multipurpose surveys. In: Proceedings of Federal Committee on Statistical Methodology research and policy conference. College Park, MD, 2021.

24.

Chromy

. Sequential sample selection methods. In: Proceedings of the survey research methods section of the American Statistical Association. 1979, pp.401–406.

25.

Jarque

. A solution to the problem of optimum stratification in multivariate sampling. J R Stat Soc Ser C (Appl Stat) 1981; 30: 163–169.

26.

Julien

Maranda

. Sample design of the 1988 national farm survey. Surv Methodol 1990; 16: 117–129.

27.

Dalenius

JLH

. Minimum variance stratification. J Am Stat Assoc 1959; 54: 88–101.

28.

SAS/STAT 13.2 User’s Guide: The FASTCLUS procedure. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_fastclus_gettingstarted.htm (accessed 24 March 2022).