Sage Journals: Discover world-class research

Abstract

Chambers and Dunstan proposed a model-based predictor of the population distribution function that makes use of auxiliary population information under a general sampling design. Subsequently, Rao, Kovar, and Mantel proposed design-based ratio and difference predictors of the population distribution function that also use this auxiliary information. Both predictors (CD and RKM) assume a single level model for the target population. In this article we develop predictors of the finite population distribution function for a population that follows a multilevel model. These new predictors use the same smearing approach underpinning the CD predictor. We compare our new predictors with the CD and RKM predictors via design-based simulation, and show that they perform better than these single level predictors when there is significant intra-cluster correlation. The performances of these new two level predictors are also examined via an empirical study based on data from a large-scale UK business survey aimed at estimating the distribution of hourly pay rates.

AMS Subject Classification: Primary 62G30, Secondary 62G32

Keywords

Distribution function multilevel model stratification smearing approach business survey

1. Background

Chambers and Dunstan^[4] show that traditional survey methods used to estimate the extreme finite population quantiles of a response variable $Y$ can be problematic. They instead propose that such estimates be obtained by inverting a model-based predictor of the finite population distribution function of $y$ . This predictor, which assumes that the target population follows a single level model with heteroskedastic errors, is hereafter referred to as the CD predictor. It is based on combining the smearing approach of Duan^[6] with a model for the finite population distribution function of $Y$ , and is model-consistent for a finite population distribution function of this variable. These authors also show that the proposed model-based predictor provides significant gains over competing design-based estimators when there is auxiliary population information that is linearly related to the target variable of interest. Following on from this, Rao et al.^[9] define design-based predictors of the finite population distribution function (hereafter referred to as RKM predictors) based on ratio and difference approaches under a general sampling design and show that the RKM predictor is preferable to the CD predictor under model misspecification. However, both the CD and RKM predictors assume an uncorrelated data structure, in the sense that they do not take into account the presence of intra-cluster correlation in the population. As a consequence it is expected that they will be inefficient when there is intra-cluster correlation. Chambers and Clark^[3] argue that without the simultaneous use of auxiliary information, there is little to be gained from just assuming a two level structure (more specifically a clustered structure) when predicting the value of a finite population distribution function. We therefore propose an alternative framework for predicting the finite population distribution function in the presence of clustering based on assuming that the population of interest is a realization of a super-population that can be modelled via a two level linear regression structure, and show that applying the Chambers and Dunstan^[4] approach in this situation leads to more efficient and robust predictors of the finite population distribution function of $Y$ .

We note that our framework also allows us to calculate small-area estimates for quantities that can be defined in terms of functionals of alternative smearing-type estimators of the small-area finite population distribution function, for example using the outlier-resistant finite population distribution function prediction approach developed in Tzavidis et al.^[10]

In what follows we use simulation and an empirical study to compare the performance of our proposal with various potential predictors, including CD and RKM, under the two level super-population model.

Let $U$ be a finite population of size $N$ consisting of $D$ domains each with $N_{i}$ sub-population units and let $y_{i j} (i = 1, \dots, D; j = 1, \dots, N_{i}$ ) denote the value of the response variable $Y$ for $j^{t h}$ unit of $i^{t h}$ domain. The finite population distribution function of $Y$ is then defined as

F_{N} (t) = N^{- 1} \sum_{i = 1}^{D} \sum_{j \in U_{i}} Δ (t - y_{i j}),

where $Δ (e)$ is a step function such that $Δ (e) = 1$ if $e \geq 0$ otherwise $Δ (e) = 0$ . Let $X$ denotes a set of $p + 1$ auxiliary variables (including the intercept) related to $Y$ where the corresponding realizations of $x$ are known for all units of the population. Furthermore, assume that the response variable $Y$ follows a two level super-population model

\begin{array}{l} y_{i j} = x_{i j}^{T} β + u_{i} + ϵ_{i j} \\ u_{i} \sim N (0, σ_{u}^{2}), ϵ_{i j} \sim N (0, σ_{ϵ}^{2} υ (x_{i j})) \\ i = 1,2, \dots, D; j = 1,2, \dots, N_{i} \end{array}

(1.1)

where $β$ is a vector of $(p + 1)$ unknown regression parameters, $σ_{u}^{2}$ and $σ_{ϵ}^{2}$ are unknown domain-specific and unit-specific random error variances respectively, typically referred to as variance components, and $υ (x_{i j})$ is a known and strictly positive function of $x_{i j}$ . Note that, $υ (x_{i j}) = 1$ if the unit-specific random error is homoskedastic.

2. Predicting

F_{N} (t)

Under a Two Level Model

2.1 Model-based Approach

The distribution function $F_{N} (t)$ can be decomposed into sampled ( $s$ ) and non-sampled ( $r$ ) parts as

\begin{array}{l} F_{N} (t) = N^{- 1} [\sum_{i = 1}^{D} \sum_{j \in s_{i}} Δ (t - y_{i j}) + \sum_{l = 1}^{D} \sum_{k \in r_{l}} Δ (t - y_{l k})], \end{array}

(2.1)

where $s_{i}$ and $r_{i}$ indicate the set of sampled and non-sampled units of domain $i$ , respectively. If we assume that the variance components $σ_{u}^{2}$ and $σ_{ϵ}^{2}$ are known, a predictor of $F_{N} (t)$ can be defined following the smearing approach described in Chambers and Dunstan^[4], from now on referred to as a global smearing approach. This leads to the GSA predictor

F_{N}^{*} (t) = N^{- 1} [n {\hat{F}}_{s} (t) + (N - n) {\hat{F}}_{r}^{G} (t, \hat{β}, \tilde{u})],

where

{\hat{F}}_{s} (t) = n^{- 1} \sum_{i = 1}^{D} \sum_{j \in s_{i}} Δ (t - y_{i j}) and

{\hat{F}}_{r}^{G} (t, \hat{β}, \tilde{u}) = (N - n)^{- 1} \sum_{l = 1}^{D} \sum_{k \in r_{l}} n^{- 1} \sum_{i = 1}^{D} \sum_{j \in s_{i}} Δ (\frac{t - x_{l k}^{T} \hat{β} - {\tilde{u}}_{l}}{υ (x_{l k})} - \frac{y_{i j} - x_{i j}^{T} \hat{β} - {\tilde{u}}_{l}}{υ (x_{i j})}),

(2.2)

where $\hat{β}$ is the vector of the estimates of the regression coefficients and $\tilde{u} = {{\tilde{u}}_{1},.. \dots, {\tilde{u}}_{D}}$ is the best linear unbiased predictor (BLUP) of $u = {u_{1},.. \dots, u_{D}}$ . An alternative to global smearing is local smearing. Here domain-specific residuals are used to define ${\hat{F}}_{r}^{G} (t, \hat{β}, \tilde{u})$ . We refer to this as local smearing below, and it leads to the LSA predictor

{\hat{F}}_{r}^{L} (t, \hat{β}, \tilde{u}) = (N - n)^{- 1} \sum_{l = 1}^{D} \sum_{k \in r_{l}} n_{l}^{- 1} \sum_{j \in s_{l}} Δ (\frac{t - x_{l k}^{T} \hat{β} - {\tilde{u}}_{l}}{υ (x_{l k})} - \frac{y_{l j} - x_{l j}^{T} \hat{β} - {\tilde{u}}_{l}}{υ (x_{l j})}).

(2.3)

One needs to know the variance components $σ_{u}^{2}$ and $σ_{ϵ}^{2}$ before one can calculate the BLUP $\tilde{u}$ used in (2.2) and (2.3). In practice these parameters will not be known. However, they can be estimated from the sample data using maximum likelihood (ML) or restricted maximum likelihood (REML), and the resulting estimates then plugged into (2.2) and (2.3), leading to the empirical versions of the GSA and LSA predictors respectively:

{\hat{F}}_{r}^{G} (t, \hat{β}, \hat{u}) = {(N - n)}^{- 1} \sum_{l = 1}^{D} \sum_{k \in r_{l}} n^{- 1} \sum_{i = 1}^{D} \sum_{j \in s_{i}} Δ (\frac{t - x_{l k}^{T} \hat{β} - {\hat{u}}_{l}}{v (x_{l k})} - \frac{y_{i j} - x_{i j}^{T} \hat{β} - {\hat{u}}_{i}}{v (x_{i j})}),

(2.4)

and

{\hat{F}}_{r}^{L} (t, \hat{β}, \hat{u}) = (N - n)^{- 1} \sum_{l = 1}^{D} \sum_{k \in r_{l}} n_{l}^{- 1} \sum_{j \in s_{l}} Δ (\frac{t - x_{l k}^{T} \hat{β} - {\hat{u}}_{l}}{v (x_{l k})} - \frac{y_{l j} - x_{l j}^{T} \hat{β} - {\hat{u}}_{l}}{v (x_{l j})}),

(2.5)

where $\hat{u}$ is empirical version of $\tilde{u}$ , typically referred to as the empirical BLUP, or EBLUP, of $u$ . Note that under homoskedasticity, ${\hat{u}}_{l}$ can be written as ${\hat{u}}_{l} = \frac{{\hat{σ}}_{u}^{2}}{{\hat{σ}}_{u}^{2} + {\hat{σ}}_{ϵ}^{2} / n_{l}} ({\bar{y}}_{l} - {\bar{u}}_{l}^{T} \hat{β})$ where ${\hat{σ}}_{u}^{2}$ and ${\hat{σ}}_{ϵ}^{2}$ are REML estimators of $σ_{u}^{2}$ and $σ_{ϵ}^{2}$ respectively. The predictors (5) and (6) are referred to from now on as empirical GSA (EGSA) and empirical LSA (ELSA) predictors.

The predictors $F_{N}^{*} (t)$ in (2.2) and (2.3) are defined by conditioning on the vector $u$ of domain-specific random errors. This allows us to treat the conditional residuals $y_{i j} - x_{i j}^{T} β - u_{i}$ as independent and identically distributed. An unconditional version of these predictors can also be developed following the specification of the usual CD predictor. This is based on the estimates of the regression coefficient $β$ and the marginal unit level error variances of the unit level errors $e_{i j} = u_{i} + ϵ_{i j}$ that are obtained after fitting a two level regression model to the sample data. In this case the global smearing version of the predictor ${\hat{F}}_{r}^{G} (.)$ can be defined as follows:

{\hat{F}}_{r}^{*} (t, \hat{β}) = (N - n)^{- 1} \sum_{l = 1}^{D} \sum_{k \in r_{l}} n^{- 1} \sum_{i = 1}^{D} \sum_{j \in s_{i}} Δ (\frac{t - x_{l k}^{T} \hat{β}}{υ^{*} (x_{l k})} - \frac{y_{i j} - x_{i j}^{T} \hat{β}}{υ^{*} (x_{i j})}),

(2.6)

where $υ^{*} (x_{i j}) = \sqrt{(σ_{u}^{2} + σ_{ϵ}^{2} υ (x_{i j})) / σ_{u}^{2}}$ . Similarly, the local smearing version of this predictor is then:

{\hat{F}}_{r}^{L} (t, {\hat{β}}^{*}) = (N - n)^{- 1} \sum_{l = 1}^{D} \sum_{k \in r_{l}} n_{i}^{- 1} \sum_{j \in s_{i}} Δ (\frac{t - x_{l k}^{T} {\hat{β}}^{*}}{υ^{*} (x_{l k})} - \frac{y_{i j} - x_{i j}^{T} {\hat{β}}^{*}}{υ^{*} (x_{i j})}),

(2.7)

Clearly empirical versions of these unconditional CD-type predictors are easily written down once we have estimates of the variance components $σ_{u}^{2}$ and $σ_{ϵ}^{2}$ . We denote these empirical versions by MGSA (2.6) and MLSA (2.7) below.

2.2 Monte Carlo Approximation to Smearing

Predictors like (2.4) and (2.5) can be computationally extensive for realistic large-scale application like poverty mapping because all the sample residuals are used in the smearing method. In order to speed up calculation in this situation, Marchetti et al.^[8] propose an alternative approach to implementing smearing based on Monte Carlo (MC) simulation. Following this approach, a Monte-Carlo approximation to the value of the non-sample part of $F_{N}^{*} (t)$ in the EGSA predictor is

{\hat{F}}_{r}^{*} (t, \hat{β}, \hat{u}) = (N - n)^{- 1} B^{- 1} \sum_{b = 1}^{B} \sum_{l = 1}^{D} \sum_{k \in r_{l}} Δ (\frac{t - x_{l k}^{T} \hat{β} - {\hat{u}}_{l}}{υ (x_{l k})} - {\hat{γ}}_{l k}^{(b)}),

(2.8)

where the ${\hat{γ}}_{l k}^{(b)}$ are random draws with replacement from the sample residuals ${\hat{γ}}_{i j} = υ {(x_{i j})}^{- 1} (y_{i j} - x_{i j}^{T} \hat{β} - {\hat{u}}_{i})$ , $i = 1, \dots, D$ and $j = 1, \dots, s_{i}$ . Similarly, the local smearing predictor ELSA can be approximated by drawing the ${\hat{γ}}_{l k}^{(b)}$ from the corresponding domain-specific sample residuals.

2.3 Mean Squared Error Estimation via Bootstrap

We propose two different bootstrap methods for estimating the mean squared error (MSE) of the EGSA/ELSA predictors. They are based on the non-parametric bootstrap developed by Marchetti et al.^[8] and the two level block-bootstrap procedure developed by Chambers and Chandra.^[1] We refer to the non-parametric bootstrap procedure of Marchetti et al.^[8] as MTP, while the block bootstrap procedure of Chambers and Chandra^[1] is referred to as CC. The steps of the MTP bootstrap procedure are as follows:

Step 1 Generate bootstrap population values using the estimated errors as $y_{i j}^{*} = x_{i j}^{T} \hat{β} + {\hat{u}}_{i}^{*} + {\hat{e}}_{i j}^{*}$ where ${\hat{u}}_{i}^{*}$ and ${\hat{e}}_{i j}^{*}$ are randomly sampled with replacement from the corresponding rescaled vectors of estimated random errors ${\hat{u}}_{i}$ and ${\hat{e}}_{i j}$ , $i = 1, \dots, D$ and $j = 1, \dots, s_{i}$ , and then calculate the bootstrap domain-specific target parameters $F_{N}^{M T P} (t)$ from these bootstrap population values.

Step 2 Extract a sample $s^{*}$ of size $n$ from the bootstrap population using the same sample design as that used to obtain the original sample.

Step 3 Calculate the bootstrap values of the EGSA/ELSA predictors ${\hat{F}}_{N}^{* M T P} (t)$ based on the bootstrap sample data.

Step 4 Repeat steps $1 - 3$ $B$ times. In the $b^{t h}$ bootstrap replication, let $F_{N}^{M T P (b)} (t)$ be the quantity of interest and let $F_{N}^{* M T P (b)} (t)$ be its corresponding predicted value.

Step 5 The MTP bootstrap estimator of the MSE of $F_{N} (t)$ is then

m s e^{M T P} ({\hat{F}}_{N}^{* M T P} (t)) = B^{- 1} \sum_{b = 1}^{B} {(F_{N}^{* M T P (b)} (t) - F_{N}^{M T P (b)} (t))}^{2} .

The CC-based MSE estimation method follows the same steps except for generation of the bootstrap population. Under this approach, the bootstrap population is generated as follows:

Step 1 Calculate sample residuals as $r_{i j} = y_{i j} - x_{i j}^{t} \hat{β} - {\hat{u}}_{i}$ , $i = 1 \dots D$ , $j = 1 \dots n_{i}$ and then calculate estimated level two residuals as group averages ${\bar{r}}_{i} = \sum_{j = 1}^{n_{i}} r_{i j}$ of the $r_{i j}$ for the $D$ groups and estimated level one residuals as $r_{i j}^{1} = r_{i j} - {\bar{r}}_{i}$ .

Step 2 Obtain ${\hat{u}}_{i}^{*}$ by sampling independently with replacement from the ${\bar{r}}_{i}$ , $i = 1 \dots D$ .

Step 3 Obtain ${\hat{e}}_{i j}$ for all the individuals belong to the $i^{t h}$ group by sampling independently with replacement from the set of $r_{i j}^{1}$ s defined by the $i^{t h}$ group.

Step 4 Generate the bootstrap population as $y_{i j}^{*} = x_{i j}^{t} \hat{β} + {\hat{u}}_{i}^{*} + {\hat{e}}_{i j}^{*}$ .

The CC bootstrap procedure then replicates steps 2–5 of the MTP procedure, leading to the CC bootstrap estimator of the MSE of $F_{N} (t)$ :

m s e^{C C} ({\hat{F}}_{N}^{* C C} (t)) = B^{- 1} \sum_{b = 1}^{B} {(F_{N}^{* C C (b)} (t) - F_{N}^{C C (b)} (t))}^{2} .

3. Numerical Evaluations

This section uses design-based simulation to illustrate the performances of the EGSA and ELSA smearing predictors of the finite population distribution function (2.1). These predictors are compared with their marginal versions MGSA (2.6) and MLSA (2.7) as well as with the single level predictors proposed in Chambers and Dunstan^[4] and Rao et al.,^[9] and with the unweighted (Dir) and the sample weighted (WDir) direct estimators given in Chambers and Clark^[3]. In the two level CD-type estimators EGSA and ELSA ${\hat{u}}_{i}$ is assumed to be an asymptotically consistent predictor of the group-specific random effect. However, as is well known, these predictors are subject to shrinkage. As an alternative we therefore also consider implementations of EGSA and ELSA based on unshrunken versions of the predictors of the group-specific random effects Chambers et al.^[2] These are referred to as UGSA and ULSA respectively below. The aim here is to reduce bias due to the shrinkage effect in EGSA and ELSA Chambers et al.^[2]

The simulation is based on a population of 338 sugar cane farms corresponding to a sample of these farms obtained in a 1982 survey of the Queensland sugar cane industry, for detail see in Chambers and Dunstan^[4]. The unplanned domains in this case are the four cane-growing regions of Queensland, with three response variables used for the simulation: (a) total cane harvested; (b) gross value of cane; (c) total farm expenditure. The auxiliary variable $x$ is the measure of size, the assigned area for cane planting. It is assumed that $υ (x_{i j})$ in (1.1) is $x_{i j}^{1 / 2}$ . We tested the null hypothesis of zero between region variation by computing the conditional-AIC (cAIC) value proposed in Vaida and Blanchard^[11] and compared this to the AIC value for a linear regression model without random effects. The cAIC and AIC values for the three response variables are shown in Table 1. The cAIC value for the linear mixed model is always smaller than the AIC value for the linear regression model. This indicates that the linear mixed model represents a better fit than the linear regression model that does not include random effects. A linear regression model that included region as a fixed effect was also fitted and the traditional CD estimator based on this model was then used to predict the finite population distribution functions of the three response variables. These are denoted by CD1 below. We expect CD1 to perform better than the CD predictor that does not account for between region variability. The AIC values for the regression models with region as a fixed effect are close to the cAIC values for the corresponding random effects fits.

Table 1.

cAIC and AIC Values Computed for the three Response Variables Available in the Queensland Sugar Cane Industry Survey Data.

Response	cAIC	AIC
total cane harvested	5606.0	5681.0
gross value of cane	7766.5	7895.3
total farm expenditure	7561.6	7612.0

For each response variable, $S = 500$ samples of size $n = 30$ were drawn from the population under three different sampling designs: simple random sampling, stratified random sampling with $2$ strata and proportional allocation, and stratified random sampling with $2$ strata and optimal Neyman allocation based on $x$ following the approach described in Chambers and Dunstan^[4]. The stratum boundaries were the same for both methods of stratified sampling and was chosen to make the total measure of size ( $x$ ) for each stratum as nearly equal as possible. The target parameter was the value of $F_{N} (t)$ for $t$ equal to the $10$ , $25$ , $50$ , $75$ and $90$ percentiles of the finite population distribution of each of the three study variables. The performances of the various predictors in the simulation study were evaluated by computing their Mean Integrated Bias (MIB), Mean Integrated Absolute Error (MIAE) and Mean Integrated Mean Squared Error (MIMSE). The first two of these indicators measure bias related performance while the third one measures variability related performance. In order to assess the performance of the MTP and CC estimators of MSE, the simulation procedure was repeated using the same $S = 500$ samples, each with $B = 500$ bootstraps. The performances of the MSE estimators were also assessed using the MIB, MIAE and MIMSE indicators. Formally the MIB, MIAE and MIMSE indicators measure performance by averaging over samples and target percentiles as follows:

M I B = T^{- 1} \sum_{t = 1}^{T} S^{- 1} \sum_{s = 1}^{S} ({\hat{F}}_{N}^{s} (t) - F_{N}^{s} (t))

M I A E = T^{- 1} \sum_{t = 1}^{T} S^{- 1} \sum_{s = 1}^{S} a b s ({\hat{F}}_{N}^{s} (t) - F_{N}^{s} (t))

M I M S E = T^{- 1} \sum_{t = 1}^{T} S^{- 1} \sum_{s = 1}^{S} {({\hat{F}}_{N}^{s} (t) - F_{N}^{s} (t))}^{2}

where $t$ denotes the $10$ , $25$ , $50$ , $75$ and $90$ percentile values, with $T = 5$ .

Table 2.

Simulation Values of MIB, MIAE and MIMSE for the Direct (Dir) and Sample Weighted Direct (WDir) Estimators, the Single Level Model-based Predictors CD and CD1, the Single Level Model-assisted Estimator RKM, and the Two Level Model-based Predictors MGSA & MLSA (Marginal CD Using Shrunken Random Effects), UGSA & ULSA (Conditional CD Using Unshrunken Random Effects) and EGSA & ELSA (Conditional CD Using Shrunken Random Effects) Under (i) Simple Random Sampling, (ii) Stratified Random Sampling with $2$ Strata and Proportional Allocation, and (iii) Stratified Random Sampling with $2$ Strata and Optimal Allocation. The Best Performances for Each Method of Sampling, Variable of Interest, and Performance Measure are Highlighted in Bold.

(i) Simple random sampling
	Dir	WDir	CD	CD1	RKM	MGSA	MLSA	UGSA	ULSA	EGSA	ELSA
total cane harvested
$M I B \times 10^{4}$	$-$ 28.00	$-$ 28.00	50.19	$-$ 16.22	12.66	27.51	$-$ 6.95	20.86	18.43	28.82	21.52
$M I A E \times 10^{4}$	559.60	559.60	289.97	236.67	381.01	335.76	300.11	276.11	271.72	254.96	261.18
$M I M S E \times 10^{4}$	52.05	52.05	13.28	9.48	23.88	17.35	14.50	12.58	12.21	10.64	11.34
gross value of cane
$M I B \times 10^{4}$	$-$ 32.93	$-$ 32.93	46.46	$-$ 45.21	5.69	20.94	$-$ 9.27	20.65	9.22	13.15	15.17
$M I A E \times 10^{4}$	544.27	544.27	261.95	247.17	389.45	322.08	268.98	252.66	241.77	224.34	231.45
$M I M S E \times 10^{4}$	50.19	50.19	11.26	9.59	24.60	16.27	11.75	10.72	9.77	8.33	8.99
total farm expenditure
$M I B \times 10^{4}$	$-$ 43.73	$-$ 43.73	$-$ 38.22	$-$ 62.59	$-$ 6.28	$-$ 63.24	$-$ 79.63	$-$ 48.85	$-$ 53.72	$-$ 44.88	$-$ 48.02
$M I A E \times 10^{4}$	560.00	560.00	284.98	287.75	386.48	368.04	361.68	315.94	308.02	279.63	295.12
$M I M S E \times 10^{4}$	51.99	51.99	12.78	13.52	24.12	20.73	20.34	15.99	15.18	12.47	13.95
(ii) Stratified random sampling with proportional allocation
total cane harvested
$M I B \times 10^{4}$	$-$ 30.13	$-$ 11.96	40.27	$-$ 29.62	$-$ 12.08	15.99	$-$ 21.57	6.78	3.40	14.12	6.68
$M I A E \times 10^{4}$	457.73	458.84	285.29	235.17	377.65	327.68	292.05	266.03	263.77	248.41	252.23
$M I M S E \times 10^{4}$	34.28	34.27	12.70	9.30	22.83	16.27	13.70	11.63	11.42	10.21	10.54
gross value of cane
$M I B \times 10^{4}$	$-$ 24.27	$-$ 6.04	30.52	$-$ 59.52	$-$ 5.14	1.73	$-$ 24.09	5.86	$-$ 4.50	$-$ 3.39	0.57
$M I A E \times 10^{4}$	461.07	462.62	257.58	244.09	391.05	314.93	266.42	247.90	238.22	220.82	227.85
$M I M S E \times 10^{4}$	34.89	34.91	10.74	9.32	24.91	15.35	11.28	10.08	9.41	8.06	8.68
total farm expenditure
$M I B \times 10^{4}$	$-$ 19.87	$-$ 1.99	$-$ 29.39	$-$ 60.46	$-$ 3.87	$-$ 54.25	$-$ 78.93	$-$ 49.93	$-$ 53.02	$-$ 41.82	$-$ 48.47
$M I A E \times 10^{4}$	458.13	459.94	280.18	282.25	377.92	359.73	348.26	304.95	298.75	275.70	286.04
$M I M S E \times 10^{4}$	35.57	35.64	12.28	12.97	23.27	19.57	18.92	14.94	14.30	12.02	13.12
(iii) Stratified random sampling with optimal allocation
total cane harvested
$M I B \times 10^{4}$	$-$ 1193.47	$-$ 1.26	99.73	$-$ 78.39	$-$ 14.37	56.02	$-$ 19.15	23.28	17.03	42.85	21.92
$M I A E \times 10^{4}$	1230.00	537.73	309.41	280.78	430.25	388.34	353.99	315.90	307.80	277.12	286.59
$M I M S E \times 10^{4}$	210.09	51.07	15.13	14.29	32.41	22.92	20.47	17.29	16.56	13.23	14.55
gross value of cane
$M I B \times 10^{4}$	$-$ 1186.93	3.04	77.18	$-$ 176.60	$-$ 10.30	40.85	$-$ 15.39	25.23	12.52	17.56	19.59
$M I A E \times 10^{4}$	1221.07	544.98	276.77	302.82	450.38	384.67	331.69	307.04	288.35	254.11	269.95
$M I M S E \times 10^{4}$	207.21	51.72	12.28	15.70	35.45	22.77	18.06	16.42	14.57	11.11	12.92
total farm expenditure
$M I B \times 10^{4}$	$-$ 1161.20	12.50	2.86	$-$ 40.78	$-$ 3.86	$-$ 24.51	$-$ 67.10	$-$ 23.76	$-$ 28.26	$-$ 10.81	$-$ 19.30
$M I A E \times 10^{4}$	1191.33	537.92	311.82	337.78	427.97	455.84	445.56	374.73	358.39	307.88	332.83
$M I M S E \times 10^{4}$	195.49	50.35	15.25	18.92	31.32	30.78	30.37	22.77	20.98	15.46	18.18

3.1 Simulation Results

The simulation performances of the different distribution function predictors for the response variables total cane harvested, gross value of cane and total farm expenditure are set out in Table 2 for simple random sampling and stratified sampling with proportional allocation and optimal allocation respectively. These results show a mixed picture for the MIB measure, with design-based methods (WDir and RKM) performing best in 5 scenarios and model-based methods (MLSA, ULSA, ELSA and CD) performing best in the remaining 4 scenarios. However, this changes when we consider performances with respect to the MIAE and MIMSE measures. Here it is clear that the EGSA predictor performs best overall. In particular, it is outright best in 6 scenarios, second best after CD1 in two scenarios and equal best with CD in the remaining scenario. Although this seems a surprising result at first, given that CD1 is based on fixed region effects while EGSA is based on random region effects, it can be explained by noting that CD1 assumes homoskedasticity in model errors while EGSA allows for heteroskedasticity in level one errors. The importance of allowing for heteroskedasticity in model specification when predicting the value of a finite population distribution function also explains why CD and CD1 perform similarly, even though the the latter allows for average regional differences. It also emphasizes the fact that although the cAIC values generated by a two level model with heteroskedasticity and AIC values generated by a homoskedastic single level model with region as factor seem close (see Table 1), the CD1 predictor based on the latter model can perform worse than conditionally specified predictors like EGSA and ELSA based on the former model that also allow for heteroskedasticity.

The simulation performances of the MTP and CC based MSE estimators of the EGSA and ELSA predictors are set out in Table 3. Interestingly, for both simple random sampling and for stratified sampling with proportional allocation, the variations in performance shown here depend on the variable of interest, rather than the performance measure used, or the predictor whose MSE is being estimated. In particular, for the variables total cane harvested and total farm expenditure the MTP method outperforms the CC method for both EGSA and ELSA, while the reverse holds for gross value of cane, where the CC method outperforms the MTP method for EGSA and ELSA. For stratified random sampling with optimal allocation, there does not appear to be any particular trend in these results. The MTP method works well for EGSA with total farm expenditure, while the CC works well with ELSA for gross value of cane. There appears to be little to choose between the two methods of MSE estimation for the variable total cane harvested.

Table 3.

Simulation Values of MIB, MIAE and MIMSE for the MTP and CC Based Estimators of the Mean Squared Error (MSE) of the Global (EGSA.MTP & EGSA.CC) and Local (ELSA.MTP & ELSA.CC) Smearing Based Two Level CD Predictors Under (i) Simple Random Sampling, (ii) Stratified Random Sampling with $2$ Strata and Proportional Allocation, and (iii) Stratified Random Sampling with $2$ Strata and Optimal Allocation. The Best Performing Predictor/MSE Estimator Combination is Highlighted in Bold.

(i) Simple random sampling
	EGSA.MTP	ELSA.MTP	EGSA.CC	ELSA.CC
total cane harvested
$M I B \times 10^{4}$	$-$ 5.87	$-$ 11.00	$-$ 18.61	$-$ 28.98
$M I A E \times 10^{4}$	21.62	35.52	25.02	36.74
$M I M S E \times 10^{4}$	0.07	0.13	0.11	0.19
gross value of cane
$M I B \times 10^{4}$	26.10	16.00	11.38	$-$ 1.41
$M I A E \times 10^{4}$	47.39	38.26	44.58	28.82
$M I M S E \times 10^{4}$	0.31	0.18	0.23	0.13
total farm expenditure
$M I B \times 10^{4}$	$-$ 18.82	$-$ 26.91	$-$ 23.13	$-$ 41.58
$M I A E \times 10^{4}$	58.58	65.35	67.09	74.88
$M I M S E \times 10^{4}$	0.45	0.57	0.55	0.68
(ii) Stratified sampling with proportional allocation
total cane harvested
$M I B \times 10^{4}$	$-$ 18.54	$-$ 18.76	$-$ 29.22	$-$ 34.62
$M I A E \times 10^{4}$	21.40	34.31	29.22	35.16
$M I M S E \times 10^{4}$	0.09	0.14	0.15	0.21
gross value of cane
$M I B \times 10^{4}$	16.91	13.63	4.21	$-$ 1.75
$M I A E \times 10^{4}$	45.23	37.39	42.81	30.65
$M I M S E \times 10^{4}$	0.27	0.18	0.21	0.13
total farm expenditure
$M I B \times 10^{4}$	$-$ 20.06	$-$ 30.51	$-$ 23.92	$-$ 45.17
$M I A E \times 10^{4}$	57.25	67.42	65.36	78.07
$M I M S E \times 10^{4}$	0.42	0.59	0.51	0.72
(iii) Stratified random sampling with optimal allocation
total cane harvested
$M I B \times 10^{4}$	4.52	$-$ 3.06	$-$ 7.57	$-$ 20.27
$M I A E \times 10^{4}$	34.37	38.37	30.79	32.42
$M I M S E \times 10^{4}$	0.16	0.18	0.16	0.19
gross value of cane
$M I B \times 10^{4}$	16.91	13.63	13.92	$-$ 6.47
$M I A E \times 10^{4}$	45.23	37.39	40.22	37.00
$M I M S E \times 10^{4}$	0.27	0.18	0.20	0.19
total farm expenditure
$M I B \times 10^{4}$	$-$ 3.79	$-$ 32.14	$-$ 4.43	$-$ 43.32
$M I A E \times 10^{4}$	80.15	86.08	85.46	93.39
$M I M S E \times 10^{4}$	1.06	1.35	1.09	1.37

4. An Empirical Study: UK Business Survey Data

In this section we use data from the 2002 UK New Earnings Survey (NES2002) in an empirical study of the distribution of hourly pay rates (in pence units). Among $N = 142,999$ employees surveyed in NES2002, $n = 71,481$ were able to provide exact hourly pay rates ( $Y$ ) while the remaining $71,518$ surveyed employees could not provide their exact hourly pay rate. However, an implicit hourly pay rate ( $X_{1}$ ) can be calculated for all surveyed employees based on their total wage and total working hours. Here we assume that the $n = 71,481$ employees for whom we have both an exact hourly rate and implicit hourly rate constitute our sample, while the remaining $N - n = 71,518$ employees in the NES2002 sample constitute the non-sampled part of our target population of $N = 142,999$ employees. Our domains of interest for this study are the 76 occupation groups covered by NES2002, with employee implicit hourly pay rate ( $X_{1}$ ), sex ( $X_{2}$ ), and age-group ( $X_{3}$ ) used as explanatory variables in fitting a regression model to $Y$ . Preliminary analysis shows that the quantiles of $X_{1}$ are higher in the non-sampled employees ( $X_{1 r}$ ) compared to the sampled employees ( $X_{1 s}$ ) and all employees ( $X_{1 U}$ ). Since $Y$ and $X_{1}$ are highly correlated, the sample quantiles of $Y$ are close to those of $X_{1 s}$ as can be seen in Table 4. In order to calculate sampling weights, the sample was split into 220 strata based on the deciles of $X_{1}$ , $X_{2}$ , and $X_{3}$ (11 age-groups). These sampling weights were then used to estimate the pay rate distribution function using the weighted direct (WDir) estimator following Chambers and Clark^[3]. Finally, we see from Figure 1 that the stratum-specific and occupation-specific means of $Y$ and $X_{1}$ (denoted by $\bar{X}$ ) vary significantly in the survey data.

Linear models, with and without domain random effects, were fitted to $Y$ using $X_{1}$ , $X_{2}$ , $X_{3}$ , and the interactions between $X_{2}$ and $X_{3}$ as fixed effects. The intra-cluster (occupation group is the cluster here) correlation was calculated to be $24 %$ . Assuming homoskedastic level one and level two errors, the CD, RKM, EGSA, and ELSA predictors were used to obtain the estimated proportion of employees with hourly pay rates below $t = 500,600,700,800,900$ and 1,000 pence. The MTP based MSE estimation method was used to estimate the MSEs of these estimated proportions, including those based on the CD and RKM approaches. In these cases, the bootstrap procedure was implemented by generating a bootstrap population based on marginal residuals, as $y_{i j}^{*} = x_{i j}^{t} \hat{β} + {\hat{e}}_{i j}^{*}$ where ${\hat{e}}_{i j}^{*}$ was randomly selected with replacement from the sample residuals ${\hat{e}}_{i j} = y_{i j} - {\hat{y}}_{i j}$ .

Table 5 shows that the estimates obtained using the unweighted (Dir) and weighted (WDir) direct estimators are higher than those obtained using the other methods. This was expected given that the quantiles of $X_{1 s}$ are lower than those of $X_{1 r}$ (see Table 4). All the other estimators appropriately adjust for this issue, with CD making the smallest adjustments. The RKM, EGSA and ELSA predictors lead to very similar estimates in all cases. As far as MSE estimation is concerned, the CD and RKM predictors show very poor performance while the EGSA and ELSA predictors lead to significantly lower MSE estimates (see Table 5). It is of some concern that the MSE estimates for CD and RKM were considerably worse than those for the direct estimators.

Table 4.

Distributions of Employee Hourly Pay Rate $(Y)$ and Implicit Hourly Pay Rate $(X_{1})$ by Sample $(s)$ , Non-sample $(r)$ and Population $(U)$ for NES2002 Data.

Quantile
Vector	90%	75%	50%	25%	10%
$Y_{s}$	1081	800	581	480	422
$X_{1 s}$	1187	860	622	491	420
$X_{1 r}$	1614	1290	935	691	541
$X_{1 U}$	1469	1097	764	553	450

Figure 1.

Stratum and Occupation (group) Specific Mean Hourly Pay Rate $(\bar{Y})$ and Implicit Hourly Pay Rate $(\bar{X})$ for NES2002 Data.

Table 5.

Estimated values for the Distribution Function of Hourly Wage Rates (in Pence) Along with the Estimated Mean Squared Errors (MSE, $\times 10^{4}$ ) Using Direct (Dir), Weighted Direct (WDir), CD, RKM, Mixed Model CD with Global (EGSA) and Local (ELSA) Smearing Approaches Applied to NES2002 Data.

t	Dir	WDir	CD	RKM	EGSA	ELSA
Estimated distribution function value
500	0.30	0.33	0.25	0.24	0.20	0.20
600	0.52	0.52	0.43	0.40	0.36	0.36
700	0.67	0.67	0.58	0.52	0.48	0.49
800	0.75	0.75	0.68	0.60	0.59	0.59
900	0.82	0.82	0.76	0.68	0.67	0.67
1000	0.87	0.87	0.83	0.75	0.74	0.74
Estimated MSE ( $\times 10^{4}$ )
500	1.72	1.91	56.14	27.96	0.75	0.48
600	1.87	1.87	87.18	30.22	0.46	0.53
700	1.76	1.76	100.03	28.91	0.71	0.56
800	1.62	1.62	99.15	28.15	0.76	0.62
900	1.45	1.45	92.15	31.67	0.89	0.80
1000	1.27	1.27	82.06	35.40	0.95	0.91

5. Concluding Remarks

In this article we extend the Chambers and Dunstan^[4] approach to prediction of a finite population distribution function given two level clustered population data. Our simulation results and our empirical study provide evidence that two level smearing-based extensions of this approach (EGSA and ELSA) perform well when the population data exhibit significant intra-cluster correlation. We also develop two bootstrap methods for estimating the MSE of the predictors that we propose. These clearly show that ignoring intra-cluster correlation (a feature of single level predictors like CD and RKM) can lead to serious error when predicting the value of the population distribution function. The conditional two level smearing predictors EGSA and ELSA that we develop outperform their unconditional alternatives MGSA and MLSA as well as conditional versions based on unshrunken random effects (UGSA and ULSA). Some important advantages of the two level prediction methods that we propose in this article are that they can be easily extended to prediction of small area distribution functions and associated functionals (e.g., small area poverty indicators) and can also be easily extended to outlier robust inference following the approach described in Welsh and Ronchetti.^[12] In further research, we aim to develop analytical MSE estimators of the the EGSA and ELSA predictors in order to further evaluate the performances of the bootstrap-based MSE estimators introduced in this article. We also note that non-parametric versions of the EGSA and ELSA predictors can be developed following Chambers et al.,^[5] and also Kuk and Welsh.^[7]

Footnotes

Acknowledgement

We would like to acknowledge the contribution of the National Institute for Applied Statistics Research Australia, University of Wollongong, in providing access to its high performance computing facility. The work of Salvati was supported by the program Progetto di Ricerca di Ateneo: From survey-based to register-based statistics: a paradigm shift using latent variable models (grant PRA2018-9).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Sumonkanti Das

References

Chambers

and Chandra

A random effect block bootstrap for clustered data. J Comput Graph Stat 2013 22: 452–470.

Chambers

, Chandra

and Tzavidis

On bias-robust mean squared error estimation for pseudo-linear small area estimators. Surv Meth 2011 37: 153–170.

Chambers

and Clark

An introduction to model-based survey sampling with applications . Oxford University Press 2012.

Chambers

and Dunstan

Estimating distribution functions from survey data. Biometrika 1986 73: 597–604.

Chambers

, Dorfman

and Wehrly

TE.

Bias robust estimation in finite populations using nonparametric calibration. J Amer Stat Assoc 1993 88: 268–277.

Duan

Smearing estimate: A nonparametric retransformation method. J Amer Stat Assoc 1983 78: 605–610.

Kuk

and Welsh

Robust estimation for finite populations based on a working model. J Royal Stat Soc, Series B 2001 63: 277–292.

Marchetti

, Tzavidis

and Pratesi

Non-parametric bootstrap mean squared error estimation for m-quantile estimators of small area averages, quantiles and poverty indicators. Comput Stat Data Anal 2012 56: 2889–2902.

Rao

, Kovar

and Mantel

On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 1990 77: 365–375.

10.

Tzavidis

, Marchetti

and Chambers

Robust prediction of small area means and distributions. Aust New Zeal J Statist 2010 52: 167–186.

11.

Vaida

and Blanchard

Conditional Akaike information for mixed-effects models. Biometrika 2005 92: 351–370.

12.

Welsh

and Ronchetti

Bias-calibrated estimation from sample surveys containing outliers. J Royal Stat Soc, Series B 1998 60: 413–428.

Predicting the Finite Population Distribution Function under a Multilevel Model

Abstract

Keywords

1. Background

2.1 Model-based Approach

3. Numerical Evaluations

Table 1.

cAIC and AIC Values Computed for the three Response Variables Available in the Queensland Sugar Cane Industry Survey Data.

Table 3.

Table 4.

Distributions of Employee Hourly Pay Rate ( Y ) and Implicit Hourly Pay Rate ( X 1 ) by Sample ( s ) , Non-sample ( r ) and Population ( U ) for NES2002 Data.

Stratum and Occupation (group) Specific Mean Hourly Pay Rate ( Y ¯ ) and Implicit Hourly Pay Rate ( X ¯ ) for NES2002 Data.

Estimated values for the Distribution Function of Hourly Wage Rates (in Pence) Along with the Estimated Mean Squared Errors (MSE, × 10 4 ) Using Direct (Dir), Weighted Direct (WDir), CD, RKM, Mixed Model CD with Global (EGSA) and Local (ELSA) Smearing Approaches Applied to NES2002 Data.

Footnotes

Acknowledgement

Declaration of Conflicting Interests

Funding

ORCID iD

References

Distributions of Employee Hourly Pay Rate $(Y)$ and Implicit Hourly Pay Rate $(X_{1})$ by Sample $(s)$ , Non-sample $(r)$ and Population $(U)$ for NES2002 Data.

Stratum and Occupation (group) Specific Mean Hourly Pay Rate $(\bar{Y})$ and Implicit Hourly Pay Rate $(\bar{X})$ for NES2002 Data.

Estimated values for the Distribution Function of Hourly Wage Rates (in Pence) Along with the Estimated Mean Squared Errors (MSE, $\times 10^{4}$ ) Using Direct (Dir), Weighted Direct (WDir), CD, RKM, Mixed Model CD with Global (EGSA) and Local (ELSA) Smearing Approaches Applied to NES2002 Data.