Sage Journals: Discover world-class research

Abstract

The statistical production process of many National Institutes is increasingly exploiting the integration of administrative data and sample surveys. Administrative data are generally affected by errors; among others, under and over-coverage may introduce bias in the statistics produced. In this paper, we propose a method to make inference on the population sizes at different aggregation levels by leveraging administrative data in the presence of coverage errors. We introduce a Bayesian statistical model for integrating nonprobability (register-based) and probability samples. The use of a Bayesian model allows a natural quantification of the uncertainty of estimates through the posterior distribution of the unknown target parameters. Although the framework we discuss is quite general, we will mainly refer to the setting of the Italian Permanent Census. We first assess the model performance using simulated data closely related to the real one; then, we provide results for the real data.

Keywords

administrative data data integration Permanent census population size estimation probability samples

1. Introduction

The use of non-survey data, especially administrative data, is becoming crucial in the statistical production system of National Statistical Institutes (NSIs). The reasons for this widespread use are discussed in Citro (2014): a data production context that leverages multiple sources is the best way to meet the user’s needs and to avoid high rates of non-response in questionnaires. A relevant framework concerning the integration of administrative and survey data is that of building statistical registers. Statistical registers allow NSIs to compute different summaries of the target population, including its size. Computing summary statistics directly from registers is an essential task of the NSIs today; the availability of register-based statistics allows the NSIs to have a sort of backbone for some core variables. Under the principles of standardization and coherence, such sources may be used in the production of all the related statistics of the NSIs. This is the case of North European countries that have produced purely register-based statistics since the 1990s (Statistics Denmark 1995; Statistics Finland 2004; Wallgren and Wallgren 2007).

Today, this framework is being adopted in many other countries (Istat 2016), and the register is widely considered a formal and reliable statistical product, where every estimate should be accompanied by its degree of uncertainty. This is a crucial point to consider; in the past, registers were assumed to be affected by negligible errors, while today, they are filtered and processed using old and new statistical methods to account for errors.

Errors may occur in different ways, as discussed by Biemer and Lyberg (2003) and Groves et al. (2011) for sample surveys and by Zhang (2012) for register-based statistics and data integration of nonprobability samples—such as administrative data. For instance, non-probability samples obtained from administrative data sources are typically affected by coverage errors. On the one hand, statistical units belonging to the target population may not be included in the sample (under-coverage); on the other hand, some units that do not belong to the target population may erroneously be included in the sample (over-coverage). For a discussion and illustration of these issues, see Wolter (1986) and Zhang (2015).

In this paper, we propose a Bayesian statistical model to make inference on the population size at different aggregation levels by leveraging administrative data and survey data in the presence of coverage errors. This model naturally accounts for different sources of error and quantifies the uncertainty of the estimates. The flexibility of the Bayesian approach has been proven crucial in managing complex models in similar contexts (see, e.g., the recent Elleouet et al. 2022). Through the simulations and analysis based on the 2018 “Italian Permanent Census” data, we show that a Bayesian approach can manage the different sources of error and adequately summarize the uncertainty through a suitable synthesis provided by the posterior distribution.

The paper is structured as follows: Section 2 describes the general statistical model. Section 3 details the motivating example, namely the 2018 Permanent Census. We empirically assess the model’s performance using a simulated dataset closely reflecting the Permanent Census data in Section 4. In Section 5, the model is applied to real data. The conclusions follow.

2. Model Setting

2.1. Notation

Let $U$ be a finite population of unknown size $N$ . We denote with $A$ a non-probability sample of $U$ , that is, any administrative data source; we assume $A$ to suffer from under- and over-coverage. Let $S^{u}$ and $S^{o}$ be two samples designed to assess the entity of under-coverage and over-coverage errors in $A$ , respectively. In particular, let $S^{u}$ be a simple random sample of the target population $U$ , and let $S^{o}$ be a simple random sample of the administrative source $A$ . We aim to estimate the target population size, $N$ , by using $A$ and leveraging $S^{o}$ and $S^{u}$ to correct the coverage errors. Figure 1 schematically shows the overlap of the sources. In the Figure, $S^{o}$ and $S^{u}$ do not overlap. We make this mild assumption throughout the paper.

Figure 1.

Representation of the data sources and the target population. The height of the rectangles represents the number of units included in the set; the width represents the amount of information (covariates), which is assumed to be common to all sources. $U$ is the target population, delimited by the thick solid line. $A$ , delimited by the dashed line, is the administrative source. $A$ under-covers $U$ , and it also includes out-of-target units (over-coverage). $S^{o}$ (gray rectangle, diagonal lines) is a sample designed to assess the entity of the over-coverage error in $A$ , and it is typically contained in $A$ . $S^{u}$ (gray rectangle, net) is a sample designed to assess the under-coverage error in $A$ . It is assumed to include units that do and do not belong to $A$ .

Let $N^{A}$ be the number of individuals listed in $A$ . We denote with $N^{\bar{o}}$ the unknown number of units not over-covered by $A$ , namely the number of units listed in $A$ that belong to the target population; $| U \cap A | = N^{\bar{o}}$ . We also denote with $N^{u}$ the unknown number of population units missed by $A$ ; $| U ∖ A | = N^{u}$ . Our goal is to estimate the error-free population size $N = N^{\bar{o}} + N^{u}$ ; to do so, we leverage the two probability samples $S^{o}, S^{u}$ .

Let $n^{\bar{o} + o}$ be the size of $S^{o}$ (the height of the gray rectangle with diagonal stripes in Figure 1); by design, $n^{\bar{o} + o} \leq N^{A}$ . Among the $n^{\bar{o} + o}$ units, we observe $n^{\bar{o}}$ not over-covered by $A$ , namely the in target units or units that belong to $U$ (see again Figure 1); $n^{o} = n^{\bar{o} + o} - n^{\bar{o}}$ is the observed number of over-covered units.

Similarly, we denote with $n^{\bar{u} + u} = n^{\bar{u}} + n^{u}$ the number of individuals covered by $S^{u}$ (the height of the rectangle with the gray net in Figure 1); $n^{\bar{u}}$ are those also covered by $A$ , whereas $n^{u}$ is the observed number of those missed by $A$ . Note that the number of units that are not-over covered by $A$ and the number of units that are not-under-covered by $A$ are equal in $A$ ; in other words, it is the number of in-target units listed in $A$ . Thus, $n^{\bar{u}} \leq N^{\bar{o}}$ .

We leverage the information contained in $S^{o}$ , $S^{u}$ to make inference on the probability of being under and over-covered by $A$ . Then, we will be able to predict $N^{\bar{o}}$ and $N^{u}$ .

One can be interested in estimating the in-target population at a specific domain level (e.g., municipality, gender, age class). Hence, we add the subscript $d$ , $d = 1, \dots, D$ , to denote the count at a specific domain level, for example, $N_{d}^{\bar{o}}$ , $N_{d}^{u}$ . We denote with bold letters the vectors of counts, for example, $N^{A} = (N_{1}^{A}, \dots, N_{D}^{A})$ . Table 1 helps visualize the observed data structure.

Table 1.

Example of Observed Data Structure.

$d$	$N_{d}^{A}$	$n_{d}^{\bar{o} + o}$		$n_{d}^{\bar{u} + u}$
		$n_{d}^{\bar{o}}$	$n_{d}^{o}$	$n_{d}^{\bar{u}}$	$n_{d}^{u}$
$1$	$N_{1}^{A}$	$n_{1}^{\bar{o}}$	$n_{1}^{o}$	$n_{1}^{\bar{u}}$	$n_{1}^{u}$
$2$	$N_{2}^{A}$	$n_{2}^{\bar{o}}$	$n_{2}^{o}$	−	−
$3$	$N_{3}^{A}$	−	−	$n_{3}^{\bar{u}}$	$n_{3}^{u}$
⋮	⋮	⋮	⋮	⋮	⋮
$D$	$N_{D}^{A}$	−	−	−	−

Note. The Domain $d = 1$ is Covered by Both $S^{o}$ and $S^{u}$ ; Thus, for $d = 1$ We Observe (i) The Number of Units Listed in $A$ , $N_{1}^{A}$ , (ii) The Total Number of Units Listed in $S^{o}$ , $n_{1}^{\bar{o} + o} \leq N_{1}^{A}$ , and Its Components, Namely the Number of Units Belonging to the Target Population Correctly Listed in $A$ , $n_{1}^{\bar{o}}$ , and the Number of Units That Do Not Belong to the Target Population Erroneously Listed in $A$ , $n_{1}^{o}$ , and (iii) The Total Number of Units Listed in $S^{u}$ , $n_{1}^{\bar{u} + u}$ , and Its Components, Namely the Number of Population Units Listed in $A$ , $n_{1}^{\bar{u}}$ , and the Number of Units Missed by $A$ , $n_{1}^{u}$ . The Domain $d = 2$ is Covered Only by $S^{o}$ , Whereas the Domain $d = 3$ is Covered Only by $S^{u}$ . None of the Surveys Cover the Domain $d = D$ .

2.2. Bayesian Model for Coverage Errors’ Correction

To predict the population counts at the specified domain level, one needs to learn about the probabilities of under- and over-coverage. In a Bayesian framework, it amounts to say that one must compute the joint posterior distribution of such probabilities, which are considered random. To this aim, we first explicitly state the likelihood function arising from our model and then discuss suitable prior distributions for the unknown parameters of the model.

This section describes a hierarchical Bayesian model for combining information from a non-probabilistic sample (from an administrative source) and the two random samples. In the first stage, we introduce a statistical model for describing the data-generation mechanism and deduce the resulting likelihood function; then, we model the probabilities of under- and over-coverage and specify the prior setting.

We assume that, independently for each domain $d = 1, \dots, D$ , the numbers of not over-covered units, $N_{d}^{\bar{o}}$ , can be considered as the number of successes that occur in $N_{d}^{A}$ independent Bernoulli trials:

N_{d}^{\bar{o}} ∣ ψ_{d} ~ Bin (N_{d}^{A}, 1 - ψ_{d}), d = 1, \dots, D,

(1)

where $ψ_{d}$ is the probability of failure, that is, the probability of over-coverage, for domain $d$ ; more explicitly,

P (N_{d}^{\bar{o}} = q; ψ_{d}) = (\begin{matrix} N_{d}^{A} \\ q \end{matrix}) {(1 - ψ_{d})}^{q} ψ_{d}^{N_{d}^{A} - q} d = 1, \dots, D; q = 1, \dots, N_{d}^{A} .

(2)

Regarding the under-covered counts, we still assume independence among domains. However, assuming a Binomial model is not suitable in this case since we do not have a well-defined number of Bernoulli trials. Yet, one can interpret the number of units missed by $A$ as the number of failures that occur before $N_{d}^{\bar{o}}$ successes are observed. Let us denote with $φ_{d}$ the probability of failure, namely the probability of under-coverage, for domain $d$ ; thus,

N_{d}^{u} ∣ N_{d}^{\bar{o}}, φ_{d} ~ NegBin (N_{d}^{\bar{o}}, 1 - φ_{d}) d = 1, \dots, D,

(3)

That is,

P (N_{d}^{u} = r; N_{d}^{\bar{o}}, φ_{d}) = (\begin{matrix} N_{d}^{\bar{o}} + r - 1 \\ r \end{matrix}) φ_{d}^{r} {(1 - φ_{d})}^{N_{d}^{\bar{o}}} d = 1, \dots, D; r = 0, 1, 2, \dots

(4)

We assume simple random sampling for both $S^{o}$ and $S^{u}$ , and independence across domains; under such assumptions, one can combine the modeling assumptions in Equations (3) and (1) and write the likelihood function of the parameters’ vector $(φ = (φ_{1}, \dots, φ_{D}), ψ = (ψ_{1}, \dots, ψ_{D}))$ as

L (φ, ψ; N^{A}, s^{o}, s^{u}) = L (φ, ψ; s^{o}, s^{u}) \propto Π_{d = 1}^{D} NegBin (n_{d}^{u}; n_{d}^{\bar{u}}, 1 - φ_{d}) Bin (n_{d}^{\bar{o}}; n_{d}^{\bar{o} + o}, 1 - ψ_{d}) .

(5)

It is important to stress that the hypothesis of simple random sampling for both $S^{o}$ and $S^{u}$ allows us to assume that the population model can be reproduced in the observed sample. However, it might happen that, in similar situations, one must consider more complex sample designs; in those cases, calibrating likelihood or Bayesian procedures is generally problematic, and some approximations are often unavoidable. Things are even more complicated in the presence of non-ignorable non-response patterns, which have to be considered in the sampling model; see, for example, Wang et al. (2018) for a discussion of these issues.

In the second stage, we model the probabilities of over and under-coverage as

logit (ψ_{d}) = X_{d} β + b_{d}, d = 1, \dots, D,

(6)

and

logit (φ_{d}) = X_{d} δ + c_{d}, d = 1, \dots, D,

(7)

where $X$ is the $D \times (1 + K^{'})$ design matrix, with $K^{'}$ being the number of covariates included in the models, and $b_{d}$ and $c_{d}$ are random intercepts;

b_{d} ~ N (0, s_{b}^{2}), c_{d} ~ N (0, s_{c}^{2}),

(8)

independently for each $d$ , with $s_{b}$ and $s_{c}$ unknown.

The probabilities of under and over-coverage, $φ$ and $ψ$ , are deterministic transformation of $β, b$ and $δ, c$ , respectively; thus, the unknown parameters of interest are $β, δ, b, c, s_{b}^{2}, s_{c}^{2}$ . Since the distributions for $b$ and $c$ are given by the second-level model specification, we only need to specify the joint prior distribution for the remaining unknowns. Hence, let the parameter vector of our model be

θ = (β, δ, s_{b}^{2}, s_{c}^{2}) .

We assume the elements of $θ$ to be a priori independent, and we set

\begin{matrix} β, δ ~^{iid} N_{1 + K^{'}} (0, ω^{2} I), \\ s_{b}, s_{c} ~^{iid} Unif (0, u) \end{matrix}

(9)

with $ω^{2}$ and with $u$ known. The values of ω and u can be calibrated by considering that the range of significant values for the coefficients of a logistic regression rarely goes outside the interval (-5,5). However, it is a good practice to test the sensitivity of the results to different prior specifications.

The posterior distribution, which is proportional to the product of the likelihood and the parameters’ joint prior, will be

π (θ ∣ s^{o}, s^{u}) = π (β, s_{b}^{2}, δ, s_{c}^{2} ∣ s^{o}, s^{u}) \propto L (β, s_{b}^{2}, δ, s_{c}^{2}; s^{o}, s^{u}) π (β) π (s_{b}^{2}) π (δ) π (s_{c}^{2}) .

(10)

Our goal is the predictive distribution

P (N_{d} ∣ N_{d}^{A}, s^{o}, s^{u}) = P (N_{d}^{\bar{o}}, N_{d}^{u} ∣ N_{d}^{A}, s^{o}, s^{u}) = = \int_{Θ} NegBin (N_{d}^{u}; N_{d}^{\bar{o}}, 1 - φ_{d}) Bin (N_{d}^{\bar{o}}; N_{d}^{A}, 1 - ψ_{d}) π (θ ∣ N_{d}^{A}, s^{o}, s^{u}) d θ .

(11)

We obtain a posterior sample via Hamiltonian Monte Carlo algorithm implementing the model in Stan (Stan Development Team 2023).

3. 2018 Italian Permanent Census Data

3.1. The Motivating Example

In 2018, Italy shifted from a survey-based paradigm to a register-based one, establishing the Italian Permanent Population Census (Falorsi 2017). Population counts were produced using an integrated administrative source, the Base Register of Individuals (BRI), corrected for over- and under-coverage by two sample surveys. Righi et al. (2021) proposed, similarly to Pfeffermann et al. (2015), to estimate population counts by correcting the BRI counts with a function of the probabilities of under- and over-coverage estimated on two distinct sample surveys designed to account for coverage errors. The two samples consisted of (i) a list survey, a probabilistic sample of the BRI, and (ii) an area survey, a probabilistic sample of the target population. The list survey was designed to estimate the over-coverage, whereas the area survey was designed to assess the under-coverage. Both surveys used a two-stage sampling design, with the Italian municipalities as the primary sampling units. The secondary sampling units differed in the two samples: the households for the list sample and either addresses or enumeration areas for the area sample. A strategy proposed by Mancini and Toti (2014) consisted of applying a GLMM approach to obtain probabilities of over and under-coverage at an individual level, conditional on individual characteristics such as gender, age, and citizenship. The major limitation of that approach was that the analytical form of the correction term, based on the plugged-in estimated weights, made it very complex to obtain an explicit expression for the accuracy evaluation of the count estimates. The analysis of the Italian municipalities proceeded separately for each Italian Region (NUTS2) and differently for municipalities with a population above or below 18,000 individuals. The aim was to estimate the population counts for each municipality at the profile level, with profiles determined by the combinations of some individual characteristics: individuals’ sex, age class, and citizenship.

Here, we consider the 2018 Permanent Census our motivating example; however, in this work, we make a simplifying assumption and consider the list and area surveys as simple random samples; see Subsection 2.2. We further discuss our simplifying assumption in the Conclusions.

3.2. An Application of the Theoretical Model

We exploit the idea of the Permanent Census to empirically assess the performance of the model introduced in Section 2. The notation introduced in Subsection 2.1 is suitable to the aim; however, instead of considering a generic domain $d$ , we need to consider cells defined by the couple ${h, q}$ , with $h$ , $h = 1, \dots, H$ , representing the municipality of residency, and $q$ , $q = 1, \dots, Q$ , denoting the individuals’ profile, determined by the combinations of some individual characteristics.

As covariates in the logistic regressions modeling the over- and under-coverage probabilities (Equations (6) and (7)), we use the whole set of variables that define the profiles $q$ , namely sex, age class, and citizenship, with an additional variable representing to which area, either rural or urban, the municipality belongs to. Hence, $X = (X^{sex}, X^{age}, X^{cit}, X^{area})$ where

$X_{h, q}^{sex}$ , $h = 1, \dots, H, q = 1, \dots, Q$ , takes value 1 if the individuals composing the count are females and 0 otherwise;

$X_{h, q}^{age}$ , $h = 1, \dots, H, q = 1, \dots, Q$ , is a five-level variable indicating the individuals’ age class (0–5; 6–19; 20–39; 40–64; $\geq$ 65); for this variable, we create four dummies, the first class being the reference class;

$X_{h, q}^{cit}$ , $h = 1, \dots, H, q = 1, \dots, Q$ , takes value 1 if the individuals have not the Italian citizenship and 0 otherwise;

$X_{h, q}^{area}$ , $h = 1, \dots, H, q = 1, \dots, Q$ , takes value 1 if the municipality $h$ is located in an urban area and 0 if it is in a rural area.

The random effects vary according to groups defined by the interactions between the area (as defined above) and the province (NUTS3) to which the municipality belongs. We estimate separate models for the twenty Italian Regions: Abruzzo (ABR), Basilicata (BAS), Calabria (CAL), Campania (CAM), Emilia-Romagna (EMI), Friuli-Venezia Giulia (FRI), Lazio (LAZ), Liguria (LIG), Lombardy (LOM), Marche (MAR), Molise (MOL), Piedmont (PIE), Puglia (PUG), Sardinia (SAR), Sicily (SIC), Trentino-Alto Adige (TRE), Tuscany (TOS), Umbria (UMB), Aosta Valley (VAL), Veneto (VEN).

In the next section, we assess the model’s performance regarding coverage, bias, and variability using simulated data that closely mimic the 2018 Permanent Census. In Section 5, we analyze accurate 2018 Permanent Census data.

4. Simulation Study

To empirically test the model’s performance, we simulate a complete synthetic dataset on which it is possible to distinguish all (otherwise unknown) components of the population: the units correctly listed in $A$ , those over-covered, and those missed by $A$ . We restrict our attention to municipalities with a population below 18,000.

In the next subsection, we describe the data simulation; results are shown in Subsection 4.2.

4.1. Data Simulation

First, we generate a synthetic data set composed of three components: (1) units that belong to the target population, $N_{h, q}^{\bar{o}}, h = 1, \dots, H, q = 1, \dots, Q$ , and are included in the non-probability sample obtained from the administrative source, (2) units that do not belong to the target population but are included in the non-probability sample (over-coverage component), $N_{h, q}^{o}, h = 1, \dots, H, q = 1, \dots, Q$ , and (3) units that are in the population but are not included in the non-probability sample (under-coverage components), $N_{h, q}^{u}, h = 1, \dots, H, q = 1, \dots, Q$ . Then, we draw the under and over-coverage samples from this synthetic data set.

The parameters used to generate this synthetic data set are based on the 2018 Census data. The details of the simulation follow.

1. Generation of a data set

We start with a data set composed of real counts of the Base Register of the Individuals involved in the 2018 Italian Permanent Census, ${N_{h, q}^{A}}$ , $h = 1, \dots, H, q = 1, \dots, Q$ . Then, for each municipality $h$ and profile $q$ :

1.1 We randomly generate the count of subjects that are correctly enumerated in $A$ , $N_{h, q}^{\bar{o}}$ , drawing from a binomial distribution with probability parameter equal to the relative frequency $f_{h, q}^{\bar{o}}$ of not over-covered subjects in $S^{o}$ :

N_{h, q}^{\bar{o}} ~ Bin (N_{h, q}^{A}, 1 - ψ_{h, q} \equiv f_{h, q}^{\bar{o}}), h = 1, \dots, H, q = 1, \dots, Q .

(12)

1.2 We randomly generate the count of under-covered subjects, $N_{h, q}^{u}$ , drawing from a negative binomial distribution with the number of successes equal to the number of not under-covered units, and using the relative frequency $f_{h, q}^{\bar{u}}$ of not under-covered subjects in $S^{u}$ as probability of success:

N_{h, q}^{u} ~ NegBin (N_{h, q}^{\bar{o}}, 1 - φ_{h, q} \equiv f_{h, q}^{\bar{u}}), h = 1, \dots, H, q = 1, \dots, Q .

(13)

The values $f^{\bar{o}}, f^{\bar{u}}$ are set on the observed relative frequencies of the not over- and not under-covered units in the surveys conducted for the 2018 Italian Permanent Census; yet, we will estimate $ψ$ and $φ$ assuming two logistic models, as detailed in Subsection 2.2. We denote with $N_{h, q}^{*}$ the (synthetic) target population size for municipality $h$ and profile $q$ , that is, $N_{h, q}^{*} = N_{h, q}^{\bar{o}} + N_{h, q}^{u}$ .

2. Drawing of the two samples for coverage correction

For each municipality, $h, h = 1, \dots, H$ , two simple random samples without replacement representing under and over-coverage samples are drawn from the synthetic data set previously generated. The samples are drawn only considering municipalities used in the actual 2018 sample survey census wave.

4.2. Analysis of the Synthetic Data and Results

The simulation of a synthetic data set, as detailed in the previous subsection, allows us to evaluate the performance of the proposed model knowing the (simulated) target population size, $N_{h, q}^{*}, h = 1, \dots, H, q = 1, \dots, Q$ .

Before presenting the simulations’ results, let us naively compare the target population size to that estimated using the administrative source. Let us set $N_{h}^{A} = \sum_{q} N_{h, q}^{A}$ , and, similarly, $N_{h}^{*} = \sum_{q} N_{h, q}^{*}$ . We define the following measure of relative error of the nonprobability sample derived by the administrative source at the municipal level:

{RE}_{h}^{A} = \frac{∣ N_{h}^{A} - N_{h}^{*} ∣}{N_{h}^{A}} .

(14)

Figure 2 shows distribution of the ${RE}_{h}^{A}$ by Italian Regions (NUTS2). In each region, the relative error for most of the municipalities is below 5%. The region of Marche (MAR) has an outliers’ behavior markedly different from the others, with some municipalities over 10% and reaching a maximum of about 30%.

Figure 2.

Distribution of the municipal ${RE}_{h}^{A}$ (see Equation (14)), by region.

Now, we implement the model described in Section 2 and detailed in Subsection 3.2 with the aim of obtaining population size estimates closer to the target population size than the administrative counts. A Bayesian approach allows us to estimate the posterior distributions of population sizes at the profile level; consequently, any aggregation at the municipal or provincial level is easy to obtain. For instance, Figure 3 shows an example of a single municipality’s posterior distribution of counts. The solid line represents the (simulated) target population size $N_{h}^{*}$ ; the highest posterior density interval at 95% level (HPD95%, the gray area) covers $N_{h}^{*}$ . The administrative count for that municipality (dashed line) is far from $N_{h}^{*}$ . In the following subsections, we investigate the coverage with respect to the simulated target population value, the relative bias of our estimates, and their variability.

Figure 3.

Example of a posterior distribution of counts for a single municipality $h$ . The solid line represents the simulated population value $N_{h}^{*}$ , whereas the dashed line represents the administrative count $N_{h}^{A}$ , clearly affected by a high over-coverage error. The gray shade represents the highest posterior density interval at 95% credibility level (HPD95%).

For this simulation study, we have specified the hyperparameters of our model as follows: $ω^{2} = 25$ , and $s_{b}, s_{d}$ uniformly distributed in $(0, 10)$ . The results are robust to different prior specifications. Even for the largest regions, less than five minutes are required for the posterior estimation and prediction.

4.2.1. Coverage

The coverage of the proposed method is explored in Figure 4, which is a map showing the coverage by province. Figure 4a shows the coverage at the profile level for each province,

\frac{\sum_{h \in pv} 1_{{N_{h, q}^{*} \in HPD 95 % (N_{h, q})}}}{\sum_{h} 1_{{h \in pv}}}, pv = 1, \dots, 107,

(15)

whereas Figure 4b shows the coverage at the municipal level for each province $pv$ , that is,

\frac{\sum_{h \in pv} 1_{{N_{h}^{*} \in HPD 95 % (N_{h})}}}{\sum_{h} 1_{{h \in pv}}}, pv = 1, \dots, 107 .

(16)

The profile level coverage (Figure 4a) is extensively high, that is, around 95%. There are exceptions. For instance, coverage is below 75% for the provinces of Trieste (TRE), Bologna (EMI), Latina (LAZ), Ragusa, and Siracusa (SIC). Yet, looking at the municipal level coverage (Figure 4b), we can observe a critical improvement. The coverage is still under 75% for the province of Latina (LAZ) only.

Figure 4.

Coverage at the profile and municipal level, HPD95%. Provincial average. Simulated data. (a) Profile level and (b) municipal level.

4.2.2. Relative Bias

Beyond coverage, we may also look at a relative measure of the bias of our estimates, evaluated as:

{RE}_{h} = \frac{∣ Med (N_{h}) - N_{h}^{*} ∣}{Med (N_{h})}, h = 1, \dots, H,

(17)

where Med $(N_{h})$ is the posterior median of $N_{h}$ . Although we do not observe much difference between the posterior median and the posterior mean, we opt for the former for robustness reasons. The RE is analogous to the measure in Equation (14); here, we measure the relative distance of the simulated target population count from the posterior median. To better appreciate the relative gain we obtain via correction in terms of relative error, that is, $({RE}_{h}^{A} - {RE}_{h}) / {RE}_{h}^{A}$ ; see Figure 5. The first quartile of the distribution of the relative error is above zero for eighteen regions out of twenty. Marche (MAR) and Valle D’Aosta (VAL) make exceptions; for these regions, the average gain is close to zero, that is, there is no major improvement when correcting for coverage errors.

Figure 5.

Distribution of the relative gains $({RE}_{h}^{A} - {RE}_{h}) / {RE}_{h}^{A}$ , by region (see Equation (17) for the definition of RE). Positive values indicate a gain in terms of relative error, namely ${RE}_{h} < {RE}_{h}^{A}$ .

4.2.3. Variability

We provide a measure of the variability of our estimates by computing a robust version of the coefficient of variation, that is, the ratio between the interquartile range and the median at the municipal level, that is,

RC V_{h} = \frac{{IQR}_{h}}{Med (N_{h})}, h = 1, \dots, H .

(18)

Results are shown in Figure 6. All values are below 8.4%; more than 90% of the municipalities have a $RCV$ lower than 2%.

Figure 6.

Distribution of $RC V_{h}$ (see Equation (18)) by region.

5. The Bayesian Model Applied to the 2018 Permanent Census Data

In the previous Section, we assessed our model performance using simulated data. In this Section, we analyze the real 2018 Permanent Census data. We implement the model described in Section 2 and detailed in Subsection 3.2 for every Italian region, using the 2018 Permanent Census data. Figure 7 shows the posterior mean of the probability of over (Figure 7a) and under-coverage (Figure 7b) at profile level averaged at a province level. Overall, over-coverage is the most relevant error source that the Base Register of Individuals (the administrative source) might be affected by; the province average ranges between 3% (Cuneo, PIE) and 15% (Siracusa, SIC), whereas the average under-coverage probability is always below 3.9% (Crotone, CAL).

Figure 7.

Over- and under-coverage probabilities at profile level, province averages. (a) Over-coverage and (b) under-coverage. 2018 Population census data.

The Appendix provides tables reporting the HPD95% of the $β$ and $δ$ coefficients obtained for each region. Being a female or a non-Italian citizen is associated with a higher probability of being over-covered. For some regions, namely Calabria, Friuli-Venezia Giulia, Lombardia, Puglia, and Valle D’Aosta, being a foreign citizen is also associated with a higher under-coverage probability. The probability of being over-covered increases in the age classes of the most mobile (working) population, namely the third and the fourth ones, whereas there is a negative association with belonging to the last one, corresponding to the over-65 population.

Figure 8 shows the relative frequency of the inclusion of the count registered in the Base Register of the Individuals (BRI) in the HPD95% interval at the profile and municipal level. The relative frequency is quite high at the profile level, especially in Northern Italy, some provinces of the center and the South, and Sardegna (SAR). When we aggregate the results to a municipal level, the BRI count falls short of the more shrunk HPD95% interval. This may also be due to an error’s accumulation. At the municipality level, the error due to the model prediction is much lower than the coverage errors; at the profile level, the error in the models and the coverage errors have the same intensity and do not discriminate against each other. Figure 9 shows the same relative frequency of inclusion of the BRI count in the HPD95% interval, yet obtained using the simulated data.

Figure 8.

Relative frequency of the inclusion of the BRI estimate by the HPD95% at the profile and municipal level. Provincial average. 2018 Population census data. (a) Profile level and (b) municipal level.

Figure 9.

Relative frequency of the inclusion of the BRI estimate by the HPD95% at the profile and municipal level. Provincial average. Simulated data. (a) Profile level and (b) municipal level.

Finally, let us define the relative distance of the posterior median to the BRI estimate at the municipal level as

Δ_{h} (A, Med) = \frac{Med (N_{h}) - N_{h}^{A}}{N_{h}^{A}}, h = 1, \dots, H .

(19)

Figure 10 compares the regional distribution of $Δ_{h} (A, Med)$ obtained using 2018 Population census data with that obtained using simulated data. The distributions are almost identical. All curves lie mainly on the negative values, exhibiting the weight of the over-coverage error in the administrative source. This strong similarity suggests that the evidence reported in the simulated data experiment can also be considered important for the application of the 2018 Population census data studied in this section.

Figure 10.

Relative distance between the municipal level posterior median, Med $(N_{h})$ and the BRI estimate $N_{h}^{A}$ . Real (black dashed curves) versus simulated (gray solid curves) data.

6. Conclusions

This paper proposes a Bayesian approach to estimating count data by integrating a nonprobability sample from an administrative source, which is susceptible to coverage errors, and two sample surveys. The proposed model provides count estimates at a detailed level, and it makes the uncertainty estimation straightforward. The application of this approach in a simulated experiment and to the 2018 Italian census data demonstrates its effectiveness in improving the quality of estimates by reducing bias in the administrative data source counts. Furthermore, it proves its feasibility in complex scenarios such as the Population Census.

However, it is important to note that the i.i.d. assumption used in the model approximation may not fully capture the random structure of the data due to survey design features such as stratification, weighting, and clustering. To address this limitation, future research should incorporate information on the sampling design into the modeling process to mitigate the effects of model misspecification. The inclusion of variables determining the sampling design as covariates and the use of hierarchical Bayesian models with random effects for clusters can enhance the accuracy of the estimates. Little (2006, 2022) offer valuable suggestions within a Bayesian framework to tackle this challenging problem. Di Zio et al. (2024) provides insights and proposals for considering sampling design features within the context of the application studied in this paper.

Finally, it is worth noting that errors other than under- and over-coverage ones, such as misclassification of the covariates defining the profiles, can also impact these data. Although misclassification is not accounted for in this study, it is essential to develop a general theoretical framework that simultaneously addresses misclassification and coverage errors. In the specific case of the Italian application, the influence of misclassification is deemed negligible for the variables used as covariates in the model. However, for a comprehensive approach, future work will aim to incorporate misclassification and coverage errors simultaneously, illustrating the flexibility of the model described in Section 2.

Footnotes

Appendix A

Acknowledgements

We thank the anonymous referees for the suggestions that greatly improved the quality of this paper.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Brunero Liseo’s research has been funded by Sapienza University of Rome, grant no. RM122181612D9F93.

ORCID iD

Veronica Ballerini

Received: July 2023

Accepted: December 2024

References

Biemer

P. P.

Lyberg

L. E.

2003. Introduction to Survey Quality. New York: John Wiley & Sons. DOI: https://doi.org/10.1002/0471458740.

Citro

C. F.

2014. “From Multiple Modes for Surveys to Multiple Data Sources for Estimates.” Survey Methodology 40 (2): 137–62.

Di Zio

Liseo

Ranalli

M. G.

2024. “Bayesian Ideas in Survey Sampling: The Legacy of Basu.” Sankhya A 86: 71–94. DOI: https://doi.org/10.1007/s13171-023-00327-5.

Elleouet

J. S.

Graham

Kondratev

Morgan

A. K.

Green

R. M.

2022. “Small Domain Estimation of Census Coverage–A Case Study in Bayesian Analysis of Complex Survey Data.” Journal of Official Statistics 38 (3): 767–92. DOI: https://doi.org/10.2478/jos-2022-0034.

Falorsi

2017. “Census and Social Surveys Integrated System.”United Nations Economic Commission for Europe–UNECE, Conference of European Statisticians-CES, Group of Experts on Population and Housing Censuses, 19th Meeting, Geneva, Switzerland, October 4–6. https://scholar.google.com/scholar_url?url=https://www.istat.it/it/files/2018/11/FalorsiS_original-paper.pdf&hl=it&sa=T&oi=gsb-ggp&ct=res&cd=0&d=10243942950096081740&ei=yc5SZ7adGNXEy9YPxYiS6QU&scisig=AFWwaeYQWaLZpi5o4BF4L4Hco9iE (accessed December 6, 2024).

Groves

R. M.

Fowler

F. J.

Jr. Couper

M. P.

Lepkowski

J. M.

Singer

Tourangeau

2011. Survey Methodology. New York: John Wiley & Sons.

Istat. 2016. “Istat’s Modernisation Programme.” Report. https://www.istat.it/wp-content/uploads/2011/04/IstatsModernistionProgramme_EN.pdf (accessed December 6, 2024).

Little

R. J.

2006. “Calibrated Bayes: A Bayes-Frequentist Roadmap.” The American Statistician 60 (3): 213–23. DOI: https://doi.org/10.1198/000313006X117837.

Little

R. J.

2022. “Bayes, Buttressed by Design-Based Ideas, is the Best Overarching Paradigm for Sample Survey Inference.” Survey Methodology 48: 257–81.

10.

Mancini

Toti

2014. “Dalla popolazione residente a quella abitualmente dimorante: modelli di previsione a confronto sui dati del censimento 2011.” Technical Report, Istat Working Papers N. 8/2014.

11.

Pfeffermann

Eltinge

J. L.

Brown

L. D.

2015. “Methodological Issues and Challenges in the Production of Official Statistics: 24th Annual Morris Hansen Lecture.” Journal of Survey Statistics and Methodology 3 (4): 425–83. DOI: https://doi.org/10.1093/jssam/smv035.

12.

Righi

Falorsi

P. D.

Daddi

Fiorello

Massoli

Terribili

M. D.

2021. “Optimal Sampling for the Population Coverage Survey of the New Italian Register Based Census.” Journal of Official Statistics 37 (3): 655–71. DOI: https://doi.org/10.2478/jos-2021-0029.

13.

Stan Development Team. 2023. “RStan: The R Interface to Stan.” R Package Version 2.21.8.

14.

Statistics Denmark. 1995. Statistics on Persons in Denmark – A Register-Based Statistical System. Eurostat.

15.

Statistics Finland. 2004. Use of Registers and Administrative Data Sources for Statistical Purposes – Best Practices in Statistics Finland, Volume Handbook 45. Statistics Finland.

16.

Wallgren

2007. Register-Based Statistics: Administrative Data for Statistical Purposes. Vol. 553. John Wiley & Sons.

17.

Wang

Kim

J. K.

Yang

2018. “Approximate Bayesian Inference Under Informative Sampling.” Biometrika 105 (1): 91–102. DOI: https://doi.org/10.1093/biomet/asx073.

18.

Wolter

K. M.

1986. “Some Coverage Error Models for Census Data.” Journal of the American Statistical Association 81 (394): 337–46. DOI: https://doi.org/10.1080/01621459.1986.10478277.

19.

Zhang

L.-C

. 2012. “Topics of Statistical Theory for Register-Based Statistics and Data Integration.” Statistica Neerlandica 66 (1): 41–63. DOI: https://doi.org/10.1111/j.1467-9574.2011.00508.x.

20.