Abstract
Estimating the risk of re-identification probabilistically is well-developed for the case of a random representative sample drawn from the general population, such as large-scale government surveys conducted regularly at National Statistical Institutes. Recent work extended this procedure to assess the risk of re-identification in non-probability subpopulation registers such as a cancer register. In this paper, we extend this work further to the case of samples drawn from registers or more generally to non-probability samples, such as those used in opt-in panels at survey organizations. The assumption is that membership to the subpopulation register is not known and the sampling mechanism is also unknown. We show how to assess the risk of re-identification for these types of non-probability samples using a probability-based reference sample to infer population parameters under the probabilistic modeling framework. We demonstrate with a simulation study and a real application on the 2021 Survey of Doctoral Recipients drawn from a subpopulation register of all PhD recipients from an accredited US institution.
1. Introduction
Much of the literature on quantifiying the risk of re-identification for random sample microdata have used probabilistic modeling to estimate population uniqueness from among sample uniques on a set of indirect quasi-identifying variables that are typically categorical, such as: age, sex, place of residence, marital status, and occupation. See for example, early papers Bethlehem et al. (1990), Benedetti et al. (1998), Fienberg and Makov (1998), Skinner and Holmes (1998), and Skinner and Shlomo (2008). In this setting, the authors focus on quantifying the risk of re-identification for a random probability-based sample drawn from the general population, for example large-scale government surveys such as the Labour Force Survey. The disclosure risk scenario is that an intruder wishes to link an individual in the released sample microdata to the general population through a vector
As mentioned, the disclosure risk assessment for survey microdata is based on quantifying the risk of re-identification and not on attribute disclosures. This is because there are hundreds of attributes from collected target variables in the survey microdata, and rather than trying to provide a probability of predicting each attribute, the risk of re-identification provides a single interpretable measure of disclosure risk for all attributes. We assume here that identity disclosure is a pre-requisite for individual attribute disclosures and thus, protecting the survey dataset against identity disclosure will avoid attribute disclosures (Willenborg and De Waal 2001).
In any sample (both probability and non-probability surveys), an intruder knows that it is possible that an individual in the sample microdata can be linked to an individual in the population. In this scenario, direct identifying variables, such as name, address, or identification numbers, are removed. Nevertheless, the risk of re-identification can arise when there are small counts on a set of cross-classified indirect quasi-identifying variables. These indirect quasi-identifying variables can be used to identify an individual and further confidential information may be learned from the survey target variables.
Under a probabilistic modeling approach, disclosure risk is assessed on the contingency table of sample counts spanned by the indirect quasi-identifying variables. The assumption is that the sample microdata contain responding individuals from a survey and the population counts are unknown. The risk of re-identification is therefore a function of both the population and the sample and measured in terms of population uniqueness, for example, the probability that a sample unique in a cell of the contingency table is a population unique. Besides measuring the disclosure risk for the sample microdata, the sample is also used as a data source for estimating population parameters and the disclosure risk measures under the probabilistic modeling framework.
Shlomo and Skinner (2022) consider a new setting where there may be microdata available from a register that represents a particular subpopulation in the general population. The register may be publicly available but the membership of the subpopulation may not be known. The subpopulation is not representative of the general population as is the case for a random probability-based sample in the previous setting. A register is a type of administrative dataset, and is considered non-probability data when it is used for statistical purposes. This is because its selection is not based on random chance rather on some specific inclusion criteria, leading to potential biases when attempting to generalize to the general population. Unlike probability data, which requires a defined sampling frame to ensure each unit has a known non-zero chance of selection, a register contains individuals who have been identified and included for non-random reasons (Golini and Righi 2024).
Similar to the original approach, Shlomo and Skinner (2022) also assume the same disclosure risk scenario in that an intruder aims to link an invidiual in the subpopulation register to an individual in the population of which the subpopulation is a subset and the population counts are unknown. They also assume that there are categorical indirect quasi-identifying variables
In order to allow inference about the general population uniqueness in the subpopulation, Shlomo and Skinner (2022) assume that there also exists survey microdata from a random sample to be used as a general probability-based reference sample. Thus, whereas in the original work, particularly in Elamir and Skinner (2006) and Skinner and Shlomo (2008), the sample microdata file served two purposes, one as the file about which disclosure risk is a concern and one for inference about population uniqueness, Shlomo and Skinner (2022) assume that we need to resort to using separate files for these two purposes.
In this paper, we demonstrate how the developed theory may be extended to assess disclosure risk in non-probability samples drawn from a subpopulation register or more generally to non-probability samples, such as those obtained from opt-in internet panel members within survey agencies where inclusion probabilities are unknown.
Section 2 reviews the original theory of assessing the risk of reidentification in probability-based random survey microdata and is followed by the case where we assess the risk of re-identification for a subpopulation register using a probability-based reference sample in Section 3. We extend the theory in Subsection 3.1 to include samples drawn from a register or more generally to non-probability samples where inclusion probabilities and the sample mechanism are not known. We demonstrate the theory in a simulation study in Section 4. Section 5 presents an application to assess the disclosure risk in the public-use microdata of the Survey of Doctoral Recipients (SDR) 2021 (National Center for Science and Engineering Statistics (NCSES) 2023). The application includes assessing the disclosure risk of re-identification according to the original method where the statistical agency would like to know the number of subpopulation uniques on the set of sample uniques according to the set of indirect quasi-identifying variables, assuming that the intruder would have access to the register. Moreover, we also show the more realistic intruder scenario where we assess population uniqueness with respect to the general population on the set of sample uniques. For this approach, we need a probability-based reference sample drawn from the general population to be used to correct for the selection bias in the non-probabilty sample. For the application, we use the American Community Survey (ACS) 2021 (available here: https://www.census.gov/programs-surveys/acs/microdata/access.html) as the probability-based reference sample. We conclude in Section 6 with a discussion.
2. Overview of Assessing Disclosure Risk for Probability-Based Sample Microdata
Individual per-record risk measures in the form of a probability of re-identification are estimated using a probabilistic model. These per-record risk measures are then aggregated to obtain global risk measures for the entire file. We denote F
k
the population size in cell k of a table spanned by indirect quasi-identifying variables having K cells, f
k
the sample size in cell k,
Skinner and Holmes (1998), Elamir and Skinner (2006), and Skinner and Shlomo (2008) propose using a Poisson Distribution as the probabilistic modeling framework and a log-linear model to estimate population parameters to calculate τ1. In this model, they assume that
The sample frequencies f
k
are independent Poisson distributed with a mean of
The fitted values are then calculated by:
Skinner and Shlomo (2008) developed goodness of fit criteria, termed the B-statistics, that can be used to assess the model fit of the log-linear model assuming the Normal distribution critical value of 1.96. The B-statistics represent the deviation from the assumptions of the first and second moments of the Poisson distribution, and when the value is small, we can expect a good fit under the log-linear modeling. They found that for the type of large-scale probability-based surveys conducted within National Statistical Institutes, such as the Labour Force Survey, an all 2-way interactions log-linear model generally provides accurate results with respect to estimated disclosure risk measures, as this model strikes a balance between structural and random zeroes in the contingency table spanned by the indirect quasi-identifying variables.
Skinner and Shlomo (2008) also consider the case of a complex sample design where differential weights are assigned to each unit in the random probability-based sample. The λ k can be estimated consistently using a pseudo-maximum likelihood estimation where the estimating equation is modified to:
and
3. Assessing Disclosure Risk for a Non-Probability Subpopulation Register
The theory for measuring the risk of re-identification for a non-probability subpopulation register was developed in Shlomo and Skinner (2022). Let U and U1 denote the population and the subpopulation, respectively, with
Denote the subpopulation count in cell k as
To estimate disclosure risk measures
and:
3.1. Extending the Theory to Non-Probability Samples
From here, we wish to extend the theory from quantifying the disclosure risk of a non-probability subpopulation register to the case where we have a sample s1 drawn from the non-probability subpopulation register U1, or more generally a non-probabillity sample where membership to a sample frame and a sample mechanism is unknown. We do not observe the entire subpopulation, rather we observe a non-probability sample s1 drawn from some hypothetical subpopulation
and
with the estimation of parameter ϕ
k
shown in Equations (3) and (4) to account for the non-probability sample counts
We expect smaller disclosure risk measures for the non-probability sample s1 in Equations (5) and (6) compared to the case of estimating the disclosure risk measures for the subpopulation U1 as shown in Shlomo and Skinner (2022) in Equations (3) and (4) due to the extra variation and uncertainty arising from some (unknown) sampling procedure.
As in the original setting described in Section 2, we assume we have a random sample s drawn from a finite population U in which the values of
In a theoretical sense, for a sample drawn from the non-probability subpopulation register, we can assume that:
If the F
k
and
In the first step, we use the full microdata of the non-probability sample and estimate the propensity score:
We set out the proposed estimation procedure in detail as follows:
Step 1: Estimate for each individual in the non-probability sample the propensity score
Step 2: Since the propensity score
Step 3: Using the population estimates
Step 4: Defining
4. Simulation Study
We use the simulation set-up from Shlomo and Skinner (2022) based on the UK Census 2001 synthetic data having the following variables (between brackets are the number of categories): Geography (6), Age group (14), Sex (2), Marital Status (6), Ethnicity (16), Economic Activity (10), and Ill Health (2) with N = 1,003,401. The subpopulation register contains individuals having ill health where N1 = 179,699. We produce a multiway contingency table of size K = 161,280 cells defined by all variables except Ill Health. Shlomo and Skinner (2022) compared the distributions in the population and the subpopulation data for quasi-identifying variables Age Group, Sex and Economic Activity (see Table 1) and showed that the subpopulation mainly contained the elderly population.
Average of 200 Iterations (Simulation Standard Errors in Parenthesis).
The simulation steps were as follows:
Step 1: Draw 200 random samples without replacement from the population using Bernoulli sampling where π= 1/20 and resulting in a sample size of n = 50,171 on average. Draw 200 random samples without replacement from the sub-population using Bernoulli sampling where π1= 1/5 and resulting in a sample size of n1 = 35,940 on average.
Step 2: Use the samples as the reference sample and nonprobability sample, respectively, to estimate
Now we follow Steps 1 to Step 4 in Subsection 3.1 to calculate the risk measures
Table 1 shows the results of the estimation of the disclosure risk measures
From Table 1 we see a slight bias in the estimates of the disclosure risk measures based on the selected log-linear model with a combination of 2-way and 3-way interactions. The measures are slightly overestimated and therefore are more conservative disclosure risk measures. As mentioned, we used a combination of 2-way and 3-way interactions because of the large sample sizes drawn compared to their populations: 1/20 for the probability sample and 1/5 for the non-probability sample from the subpopulation register. A more in-depth analysis of goodness-of-fit might have improved the fit of the models. For typical and more realistic sample fractions, we generally use the all 2-way interaction model as will be shown in the application in Section 5.
Furthermore, from the simulation set-up in Shlomo and Skinner (2022) we obtained that the true (average) value of population uniques and subpopulation U1 uniques according to Equation (3) was 2,721 and the true (average) value of population uniques, subpopulation U1 uniques, and random sample s uniques was 55.5. Given we are now assessing the disclosure risk in the non-probability sample s1 drawn from some hypothetical subpopulation U1, we obtain a true (average) value of 543.9 for the number of population uniques and non-probability sample s1 uniques, and the true (average) value of 26.7 for the number of population uniques, non-probability sample s1 uniques, and random sample s uniques. As expected, we obtain smaller values of disclosure risk measures when assessing the disclosure risk on a non-probability sample from a subpopulation versus the disclosure risk measures on the whole subpopulation.
5. Risk of Re-Identification in the Survey for Doctoral Recipients 2021
The dataset of the Survey for Doctoral Recipients (SDR) 2021 (National Center for Science and Engineering Statistics (NCSES) 2023) included 80,295 individuals in a sample drawn from a subpopulation register with the estimated size of approximately 1,186,000 individuals between the ages 25 to 75 having received a PhD at an accredited university in the United States (US) in the area of science, engineering, and health. As mentioned in Section 1, a subpopulation register is considered a non-probability data source because its selection is not based on random chance rather on the administration processes of a data collection authority. Hence, any sample from a register is also a non-probablity sample and we assume that the sampling mechanism to a general population is unknown.
The sample for SDR is drawn from a subpopulation register referring to all persons awarded a PhD from an accredited US institution in a given year, called the Survey of Earned Doctorates (SED). The SED is an annual census conducted since 1957 and is sponsored by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF) and by three other federal agencies: the National Institutes of Health, Department of Education, and National Endowment for the Humanities.
We now turn to the disclosure risk assessment of SDR 2021. We start with the original framework described in Section 2 where we assume that an intruder would try and match unique cells derived from a set of indirect quasi-identifying variables in SDR 2021 to PhD recipients in the subpopulation register of the SED and therefore we calculate the probability that a sample unique is unique in the subpopulation register τ1 based on the estimate
In this original framework, we use the pseudo-maximum likelihood log-linear model to estimate the population parameters λ
k
, k = 1, …K, where the marginals for the log-linear model are obtained by summing the weighted sample counts
We now turn to the new approach described in Subsection 3.1. Here, we assess the general population uniqueness for the SDR 2021 according to the general population of PhD recipients (and not the subpopulation register). This would be a more realistic disclosure risk assessment with respect to the statistical agency as the subpopulation of labeled PhD recipients in the SED would not be available to potential intruders. To assess the disclosure risk in the general population, we use a subset of the American Community Survey (ACS) 2021 as the probability-based reference sample. The subset included only those respondents declaring that they had a PhD and in the appropriate ages between 25 and 75. This led to a sample size of 35,755 individuals and the estimated population size 3,288,181.00 after summing the weights in the ACS 2021. The inconsistencies in the population sizes between SDR 2021 and ACS 2021 are due to the non-probability nature of SDR 2021 and in particular, the ACS includes respondents declaring a PhD degree in any subject and from any university, including universities abroad. With these inconsistencies, we expect that estimating population uniqueness to the general population of PhD degree recipients will have a lower risk of re-identification in SDR 2021 compared to estimating the subpopulation uniqueness based on the SED in the Original Approach.
We follow the procedure as outlined in Subsection 3.1 and denote this as the “New Approach” in Table 3. To estimate the propensity score
Variables and Their Distributions Used to Estimate the Propensity Score in SDR 2021 According to ACS 2021.
From Table 2, we can see large differences between the probability-based reference sample ACS 2021 in columns 2 and 3, and the original unadjusted SDR 2021 in columns 4 and 5, for example, a larger percentage of males, employed, non-citizens, and Asians in SDR compared to ACS. The last 2 columns in Table 2 show the adjusted inverse propensity weighted (IPW) SDR 2021 distributions and it can be seen that these weighted SDR 2021 distribution are more similar to the ACS 2021 distributions.
We now turn to estimating the risk of re-identification for SDR 2021 according to population uniqueness in the general population of PhD recipients following steps 2 through 4 in Subsection 3.1. We use the same indirect quasi-identifying variables as shown for the Original Approach framework: NSDRMENTOD (Field of Study—26 categories); GENDER (2 categories); AGEGRP (10 categories); RACETHMP (Race\Ethnicity—5 categories); BTHUS (Born in the US—2 categories); CTZN_DRF (Type of Citizenship—5 categories). However, because these variables are not defined in the probability-based reference sample ACS 2021, this means that we do not have the sample uniques f
k
= 1 on these variables and are not able to calculate
Table 3 contains the two disclosure risk measures. Under the Original Approach, the estimate for the number of SDR 2021 unique cells that are also unique on the set of indirect quasi-identifying variables in the subpopulation register of doctoral recipients (SED) is approximately 251, although the B-statistic is 5.75 and above the expected level of 1.96. This informs of potential over-estimation. However, SDR 2021 contains only a subset of PhD recipients and is a non-probability dataset with distributions that differ from the general population of PhD recipients. We can use the weights in SDR 2021 to infer on population uniqueness to the SED, but these weights are very skewed ranging from 1 to 100 with a median of 10. We see that this non-probability sample of SDR 2021 is difficult to model according to the original log-linear modeling setting in Section 2, and we obtain unreliable goodness-of-fit (B-statistic) criteria as the dataset clearly deviates from a random probability-based sample of which the theory was developed. Moreover, from the perspective of the statistical agency, it is unrealistic to assume that an intruder would have access to the labeled subpopulation register of the SED to attempt to make a re-identification.
Final Risk Measures Under Original and New Approach.
Under the New Approach, the statistical agency releasing SDR 2021 data is interested in identifying population uniqueness according to the general population of PhD recipients. We therefore rely on a probability-based reference sample to estimate the risk of re-identification. The result for the New Approach in Table 3 is approximately 59 sample uniques that are general population uniques, with a B-statistic (under the pseudo-maximum likelihood approach) of 1.53. In this approach, the B-statistic is more reliable and the log-linear model using the inverse propensity weighted SDR 2021 is more stable.
6. Discussion
The original approach of assessing the risk of re-identification in surveys from non-probability registers or non-probability samples assumes that membership to a nonrepresentative subpopulation is known in order to make a re-identification from the survey microdata. This is currently the case for SDR 2021 where weighted margins from the survey are used to fit log-linear models to estimate population parameters for disclosure risk measures despite the problems of trying to fit models with data that have selection biases and skewed distributions, such as recipients of a PhD in science, engineering, and health. The original approach of measuring disclosure risk assumes the scenario that an intruder would have access to the labeled subpopulation register from which the SDR 2021 is drawn. Whether this is a viable disclosure risk scenario depends on the statistical agency perspective and the data environment on whether the information on the particular set of PhD recipients is accessible public knowledge.
If, however, the statistical agency wants to assess the risk of re-identification in SDR 2021 against the general population of PhD recipients, we can turn to using a probability-based reference sample to obtain propensity scores for the SDR 2021 where the inverse propensity scores provide weights that are a representation of this general population. As can be seen in the application, given the discrepancies in distributions between SDR 2021 and ACS 2021 shown in Table 2, the risk of re-identification under the new approach, where the ACS 2021 is used as the probability-based reference sample, is much reduced by almost 75%—from 251 estimated register-based population uniques that are sample uniques, to 59 estimated general population uniques that are sample uniques.
This application shows different estimates of the risk of re-identification under different disclosure risk scenarios. It is the first attempt to apply new theory shown in Section 3 and to assess the viability of this approach where non-probability subpopulations and non-probability samples are not representative to the general population for assessing disclosure risk.
