Sage Journals: Discover world-class research

Abstract

National Statistical Offices (NSOs) collect extensive data on the international activities of firms within their borders. However, they typically lack information about the foreign partners with whom these firms trade. Linking import data from one NSO to corresponding export data from a partner NSO could significantly enhance statistics on firms’ international operations. While technically feasible, such linkage is legally constrained by strict privacy laws. Private set intersection (PSI) protocols may help address privacy concerns but require unique identifiers to avoid linkage errors. To overcome this limitation, we propose a PSI protocol with three innovations. First, we estimate the rates of linkage error by modeling the number of links from a given record. Second, we adjust an estimated population mean according to the estimated linkage accuracy. Lastly, our adjustment explicitly accounts for this accuracy without assuming a particular relationship among the target variables.

Keywords

international trade privacy preserving techniques private set intersection

1. Introduction

National Statistical Offices (NSOs) collect a wide range of detailed information about the international activities of the firms located within their national borders. From their customs agencies, they typically receive information on which products are traded with which countries, by which firms, and at what time. NSOs can then link this information to business registries to get a sense of the difference in trading activity by firms of different characteristics such as age or size. So, while NSOs have an extensive picture of the trade activities of the firms active within their borders, they do not know anything about the partner firm involved in each transaction. That information is only available to the NSO of the partner country. Linking both data sets on a transaction level could vastly improve our understanding of firms’ international trade and investment activities.

To illustrate this point, we focus on the issue of firms using preferential trade agreements when they trade internationally. While such agreements can significantly lower the trading costs by reducing import tariffs, preference utilization rates, that is, the proportion of eligible trade that actually uses the trade agreement, typically stagnates at around 60% to 70% (e.g., Nilsson 2022). Better understanding the obstacles that firms face in using trade agreements can significantly increase their utility. And while NSOs have information that can provide an understanding of the barriers that the importer faces, it has no information about the exporter. This is a significant shortcoming, as it’s the exporter who is responsible to provide the relevant documentation that can allow the importer to apply for preferential access.

Due to the lack of a unique firm identifier, linking these data sets would require the exchange of firm names. Privacy regulations complicate this process, especially between EU and non-EU NSOs. In this situation, private set intersection can offer a solution as it can link the data sets and compute the desired summary statistics without any exchange of microdata in the clear.

Indeed, private set intersection techniques have already demonstrated their usefulness in many applications in the public and private sectors, whether to track the spread of COVID 19 (Andreea 2021; United Nations 2023, 9), enable the sharing of administrative data across different government organizations (Straus 2021), or enable users to use mobile messaging applications without disclosing all their contacts to the service provider (Andreea 2021). Meanwhile, NSOs are also experimenting with these techniques to gain access to more data sources (Bruno et al. 2018; Dugdale et al. 2022).

Private set intersection techniques are input privacy techniques because they aim to allow “two or more parties to submit data into a calculation without the other respective parties seeing data in clear” (United Nations 2023, 20). Quite a few methods have been described to implement a private set intersection, which are reviewed by Andreea (2021). In general, these methods assume the presence of a unique identifier and make no provision for linkage errors, that is, false negatives (not linking records from the same unit) and false positives (linking records from different units), where the linkage is based on quasi-identifiers, that is, nonunique variables that may be susceptible to recording errors, such as names. Two such solutions are the protocol from De Cristofaro and Tsudik (2010, Section 5.6) and its more recent extension by Bruno et al. (2018).

To deal with the linkage errors, the authors have enhanced this latter protocol with the following three innovations. First, the rates of false negatives and false positives are estimated by modeling the number of links from a given record (Dasylva and Goussanou 2020). Second, an estimated population mean (hereafter also called mean for brevity) is adjusted according to the estimated linkage accuracy. Finally, the applied adjustment explicitly accounts for the false negatives and false positives, without assuming a particular relationship (such as a generalized linear model) among the target variables. Thus, this procedure represents a natural extension of the approach described by Judson et al. (2013) where the linked records are weighted to represent all the records under the assumption that there are no false positives. To the authors’ best knowledge, the resulting protocol is the first private set intersection protocol that addresses the issue of linkage error. It may serve to estimate a mean over a finite population of export transactions from a first country into a second country, where the import and export data sets are perfect censuses of this population, and each data set includes the linkage variables as well as one or many private variables unknown to the other party. This paper is an extension of work done by the authors as a part of the United Nations Economic Commission for Europe Input Privacy Preservation (UNECE IPP) project (UNECE 2023a).

The rest of our paper is structured as follows. The next section describes the use case. It is followed by Section 3, which provides the notation, and Section 4 that gives some background on private set intersection including the protocol by Bruno et al. (2018). Section 5 covers the statistical methodology to deal with the linkage errors, while Section 6 presents the modified protocol. In Section 7, the performance of the proposed statistical methodology is examined through simulations. Finally, Section 8 discusses the limitations and potential next steps, while Section 9 concludes the paper.

2. International Trade Use Case

As stated in the introduction, our goal is to link import data from Statistics Canada to export data from Statistics Netherlands, which would allow for a better understanding of the obstacles faced by small Dutch exporters in using the Comprehensive Economic Trade Agreement (CETA); a free trade agreement between Canada and the European Union which entered into force provisionally in 2017. While both agencies record exporter names, product codes, dates, and transaction values, only Canada records the tariff regime, and only Statistics Netherlands stores exporter details like firm size. Common variables (e.g., names, product codes, dates, and values) can serve as linkage variables, while private variables, such as Canada’s tariff data or the Netherlands’ exporter size, must remain confidential. This situation is represented in Table 1.

Table 1.

Variables Included in the Different Data Sets.

Variable	Export microdata (Statistics Netherlands)	Import microdata (Statistics Canada)	Private information
Exporter name	✓	✓	×
Exporter size	✓	×	✓
Product code	✓	✓	×
Transaction date	✓	✓	×
Transaction value	✓	✓	×
Tariff type	×	✓	✓

Note. “Private” applies to information which is only known or permitted to be known by one of the two parties and never shared openly.

Statistics Netherlands could assess the use of the preferential tariff by exporter characteristics based on the related proportion of transactions with this tariff, if it were authorized to link the two data sets in the clear. Instead, the two data sets must be linked in a manner that is privacy preserving and accounts for the potential linkage errors, since there is no unique identifier.

In practice, the import microdata and the export microdata are expected to be near censuses of the target population, which comprises Dutch export transactions into Canada, over a given reference period. For simplicity, the two data sets are assumed to be perfect censuses, and the values of the common variables are not considered confidential, with respect to the two data holding parties. However, this information is considered confidential with respect to any other third party. In general, each data set may comprise many private variables. Then the challenge is that of estimating a population mean that involves at least one private variable from each data set, while minimizing the information that is disclosed about the private variables for any specific transaction.

3. Notation and Assumptions

In what follows, two records are called matched if they refer to the same transaction in accordance with Fellegi and Sunter (1969), Newcombe (1988), and Herzog et al. (2007). Note that this use of the word “matched” departs from that in Christen (2012), where it means that two records are declared (based on the available information) to be from the same unit (a transaction here) regardless of whether this is true.

This work considers a target population of N mutually independent transactions indexed over ${1, \dots, N}$ . In this population, transaction i is characterized by some quasi-identifiers (e.g., the exporter name, the product code), some covariates $x_{i} = {[x_{i 1} \dots x_{ip}]}^{⊤}$ that are only known by the exporter (e.g., the exporter size), and a response y_i that is only known by the importer (e.g., the tariff type), where x _i and y_i are loosely called target variables. For convenience, define the matrix of covariates $X = {[x_{1}^{⊤} \dots x_{N}^{⊤}]}^{⊤}$ and the vector of responses $y = {[y_{1} \dots y_{N}]}^{⊤}$ . The goal is to estimate a population mean of the form $\bar{t} = N^{- 1} \sum_{i = 1}^{N} t (x_{i}, y_{i})$ , for some known function $t (., .)$ , without assuming a particular parametric or semi-parametric model for the response. The problem is cast in the finite population inference paradigm, where the inferences are conditional on the realized values of the covariates and responses in the finite population, that is, while holding $X$ and $y$ fixed. Then, the variability is entirely ascribed to the quasi-identifiers and the linkage process that is driven by them (Chambers et al. 2009). From this inference perspective, the linkage process plays the role of the sampling process in traditional sample surveys.

Each transaction is recorded separately in the import and export tables, with possible typos on the quasi-identifiers. However, the recorded covariates and response are assumed to be error-free. Since there is no unique identifier, the records are labeled independently in the two tables, such that their records correspond through an unknown permutation. Without losing any generality, this permutation is assumed to be uniformly distributed and independent of the attributes (quasi-identifiers or responses) of all the transactions. Also, no generality is lost by assuming that record i corresponds to transaction i in each table. This is equivalent to conditioning on the event that the permutation matrix is the identity. It is further assumed that the tuples $(x_{1}, y_{1}, v_{1}, v_{1}')$ , …, $(x_{N}, y_{N}, v_{N}, v_{N}')$ are mutually independent. In other words, the tuples comprising the target variables and recorded quasi-identifiers are independent across the transactions. For $i, j = 1, \dots, N$ , denote by $(i, j)$ the pair that comprises record i from the export table and record j from the import table, and denote by l_ij the corresponding linkage decision (an indicator set to 1 when there is a link). In what follows, it is assumed that this decision only involves v_i and $v_{j}'$ , and no other record (as in the private set intersection protocol) unless mentioned otherwise.

The population mean may be estimated by the naïve estimator that is based on the links, that is, $\hat{\bar{t}} = N^{- 1} \sum_{i = 1}^{N} \sum_{j = 1}^{N} l_{ij} t (x_{i}, y_{j})$ . When the linkage key is based on a unique identifier, this naïve estimator is equal to the actual population mean. Otherwise, it may be biased due to the occurrence of linkage errors, which include the false negatives (FN) and false positives (FP). In relation to these concepts, define a true positive (TP) as a matched pair that is linked, and a true negative (TN) as an unmatched pair that is not linked. The four types of record pairs are usually represented in a 2 × 2 table called confusion matrix, as shown in Table 2. With a slight abuse of the notation, let TP, TN, FP, and FN also denote the number of true positives, true negatives, false positives, and false negatives, respectively. Then $TP = \sum_{i = 1}^{N} l_{ii}$ , $TN = \sum_{i = 1}^{N} \sum_{j = 1 : j \neq i}^{N} (1 - l_{ij})$ , $FP = \sum_{i = 1}^{N} \sum_{j = 1 : j \neq i}^{N} l_{ij}$ , and $FN = N - TP$ .

Table 2.

Confusion Matrix.

	Linked	Not linked
Matched	TP	FN
Unmatched	FP	TN

There are many measures of the linkage accuracy, which include the recall, the precision, and the false positive rate (FPR). The recall is the proportion of matched pairs that are linked (i.e., $TP / (TP + FN) = TP / N$ ). The precision is the proportion of linked pairs that are matched (i.e., $TP / (TP + FP)$ ). As for the false positive rate, it is the proportion of unmatched pairs that are linked (i.e., $FP / (TN + FP)$ ). The precision is thus a function of the recall and false positive rate. In what follows, the false positive rate is considered very small if it is smaller than $0.01 / (N - 1)$ , that is, less than 0.01 false positives per export record on average.

Finally, the standard assumption is made that the linkage errors are noninformative, that is, the linkage decisions (i.e., ${[l_{ij}]}_{1 \leq i, j \leq N}$ ) are independent of the responses $y$ given the covariates $X$ . This last assumption and the mutual independence of the tuples $(x_{1}, y_{1}, v_{1}, v_{1}')$ , …, $(x_{N}, y_{N}, v_{N}, v_{N}')$ imply that

\begin{array}{l} E [l_{i i} | X, y] = E [l_{i i} | x_{i}], \\ E [l_{i j} | X, y] = E [l_{i j} | x_{i}, x_{j}], i \neq j . \end{array}

4. A Private Set Intersection Protocol

A private set intersection is an example of secure multi-Party computation (United Nations 2023, 28), where two parties determine the units represented in both their data sets without revealing any information about the other units. Additionally, some statistics (e.g., a total or a mean) may be computed over the intersection.

Bruno et al. (2018) describe such a protocol, which comprises a setup phase, a loading phase, and a query phase in this order. In the setup phase, the two parties determine the intersection of their data sets by exchanging their lists of unique identifiers that are encoded through hashing and encryption (see Appendix A for details). In the loading phase, each party encrypts the records (including the unique identifier and target variables) from the intersection with a symmetric encryption key, and it uploads them to a neutral third party called linker, who ignores the symmetric key. Also, the two data sets are linked by the third party with the encoded unique identifier. Finally, in the query phase, one of the two data-holding parties may send a request for a total over the intersection to the linker, who responds with the requested total.

The protocol is secure if the parties are non-colluding and honest but curious, that is, no two parties collaborate to defeat the protocol, and each party scrupulously follows the protocol but learns anything it can about the other parties. However, it must be adapted to deal with the linkage errors, which may occur when there is no unique identifier, such as when linking international trade data. While many other private set intersection protocols may be considered, such as the one by Pinkas et al. (2019), none of them address linkage errors. The protocol by Bruno et al. (2018) is chosen because it is easier to adapt to handle this issue than the other options.

5. Statistical Methodology

In the protocol by Bruno et al. (2018), two records are linked if they perfectly agree on a linkage key that is ideally based on a unique identifier. However, such an identifier may be unavailable, for example, when linking international trade data from different countries. Instead, the linkage key may be a concatenation of many quasi-identifiers, such as the exporter name, product code, and transaction date, at the risk of introducing linkage errors and potential bias. In this section, the authors propose to estimate the linkage accuracy with a model, and to adjust the estimated mean accordingly with a generalization of the method described by Judson et al. (2013). While the original method only adjusts for the false negatives, the proposed extension also accounts for the false positives. It requires the reliable estimation of the linkage accuracy, which is discussed next.

5.1. Estimating the Linkage Accuracy

Estimating the recall and false positive rate is what is required for adjusting the statistics derived from the linked data. In practice, these measures are typically estimated with clerical-reviews on a probability sample of record pairs, if the records are linked in the clear (Dasylva et al. 2016). These reviews consist in the visual inspection of the sampled pairs to determine if they are matched. Unfortunately, such visual inspections are usually impossible in a privacy-preserving setting. Instead, one may consider one of the many statistical models that are reviewed by Dasylva and Goussanou (2024). These solutions include models of the number of links from a record (Blakely and Salmond 2002; Dasylva and Goussanou 2020), which are preferred to other options because they have modest computation requirements, implicitly account for all the interactions among the linkage variables and are not limited to the probabilistic method of record linkage. Thus, they are easier to use in the current setup than the other options, provided that the decision to link two records involves no other record, for example, if the linkage is based on the perfect agreement of a linkage key. In this class of models, the model by Dasylva and Goussanou (2020) has the additional advantage of accounting for the records heterogeneity. It operates by modeling the number of links from a given record with a finite mixture, where each component is the convolution of a Bernoulli distribution with an independent Poisson distribution. In this model, a component represents the latent class of a record. To be specific, denote by n_i the number of links from record i, in the export table. Then the model is

\begin{matrix} n_{i} ~ \sum_{g = 1}^{G} α_{g} Bernoulli (p_{g}) * Poisson (λ_{g}) \end{matrix},

(1)

where * denotes the convolution operator, G is the number of latent classes, $α_{g}$ is the probability of class g, and p_g and $λ_{g}$ are the expected number of true positives and false positives per record from the class, respectively. The model parameters are related to the accuracy measures through the weighted sum of the p_g’s and $λ_{g}$ ’s, that is, $\bar{p} = \sum_{g = 1}^{G} α_{g} p_{g}$ and $\bar{λ} = \sum_{g = 1}^{G} α_{g} λ_{g}$ . Indeed, the recall, precision, and false positive rate are estimated consistently by $\bar{p}$ , $\bar{p} / (\bar{p} + \bar{λ})$ , and $\bar{λ} / (N - 1)$ , respectively, provided that the expected number of false positives per record is bounded above by a constant independent of N, and other regularity conditions hold (Dasylva and Goussanou 2022), which are detailed in Appendix B. The same conditions ensure that the maximum likelihood estimator is consistent, such that

\begin{matrix} \frac{TP}{TP + FN} & \to & \bar{p}, \\ \frac{TP}{TP + FP} & \to & \frac{\bar{p}}{\bar{p} + \bar{λ}}, \\ \frac{(N - 1) FP}{TN + FP} & \to & \bar{λ}, \end{matrix}

(2)

as $N \to + \infty$ , where the → symbol indicates a convergence in probability. Equation (2) implies that the linkage accuracy may be estimated consistently without clerical reviews. The model parameters may be estimated by maximizing the likelihood of the n_i’s numerically, where the number of classes G is chosen by minimizing Akaike’s information criterion (Akaike 1974; Dasylva and Goussanou 2022), which is denoted by $AIC (G)$ . This criterion is based on the difference between the number of model parameters (denoted by $k (G)$ ) and the maximum log-likelihood (denoted by $\hat{ℓ} (G)$ ) when there are G classes, that is, $AIC (G) = 2 (k (G) - \hat{ℓ} (G))$ . For the mixture model based on Equation (1) with no restrictions, $k (G) = 3 G - 1$ . With the constraint $p_{1} = \dots = p_{G}$ , we instead have $k (G) = 2 G$ . To prevent overfitting, G is set to the value minimizing $AIC (G)$ , thus penalizing models that have more parameters.

The model may be fitted to $n_{1}, \dots, n_{m}$ , where m is a nondecreasing function of N, which goes to infinity as N goes to infinity. Obviously, using a larger m leads to a smaller variance for the maximum likelihood estimator, which suggests the choice m = N. However, estimating the variance of the resulting estimator may be challenging when m is too large because the n_i’s are correlated. In Dasylva and Goussanou (2024), it is shown that $n_{1}, \dots, n_{m}$ are approximately independent when $m = o (\sqrt{N})$ , such as the optimal choice $m = O (N^{2 / 5})$ that is suggested in the same paper. To estimate the variance, it is beneficial to make this latter choice, even if the resulting estimator has a larger variance than when m=N. Even then, the resulting variance may be smaller than that obtained with clerical reviews.

The same methodology applies when hashing the records into Bloom filters, which are long strings of zeros and ones (Schnell 2016). Then, two records are linked according to the number of bit positions that are set to one in the corresponding filters. Bloom filters have been used to privately link health records (Schnell 2016). The proposed model may also serve to evaluate the linkage accuracy when it varies across known strata (e.g., based on the covariates x _i), or when enforcing the constraint that each export record is linked to at most one import record, for example, to reduce the false positives or for operational reasons. Indeed, the accuracy within a stratum may be estimated by fitting the mixture model therein, that is, by maximizing the log-likelihood of the n_i’s that are from the stratum records. Furthermore, the same model provides a basis for deriving the linkage accuracy when deleting all the links where an export record has many links, resulting in at most one link per record on the export side. In this case, Equation (2) is replaced by the following equation (Dasylva and Goussanou 2022, 17), since the decision to link two records now involves other records.

\begin{matrix} \frac{TP}{TP + FN} & \to & \sum_{g = 1}^{G} α_{g} p_{g} e^{- λ_{g}}, \\ \frac{TP}{TP + FP} & \to & \frac{\sum_{g = 1}^{G} α_{g} p_{g} e^{- λ_{g}}}{\sum_{g = 1}^{G} α_{g} e^{- λ_{g}} (p_{g} + (1 - p_{g}) λ_{g})}, \\ \frac{(N - 1) FP}{TN + FP} & \to & \sum_{g = 1}^{G} α_{g} (1 - p_{g}) λ_{g} e^{- λ_{g}}, \end{matrix}

(3)

where the → symbol indicates a convergence in probability. It is important to note that the parameters are based on the n_i’s before deleting some links. Since $p_{g} e^{- λ_{g}} \leq p_{g}$ and $(1 - p_{g}) λ_{g} e^{- λ_{g}} \leq λ_{g}$ for each g, we have a decrease in the recall and false positive rate. The following proposition states that the precision may also increase.

Proposition 1: Suppose that all the p_g’s are equal or $p_{g} > 1 / 2$ for each g. Then

\frac{\sum_{g = 1}^{G} α_{g} p_{g} e^{- λ_{g}}}{\sum_{g = 1}^{G} α_{g} e^{- λ_{g}} (p_{g} + (1 - p_{g}) λ_{g})} \leq \frac{\bar{p}}{\bar{p} + \bar{λ}} .

The proof is given in Appendix B.

The estimated linkage accuracy provides a basis for deciding whether to ignore the errors or to adjust for them. In general, the first option may be chosen if the estimated accuracy is sufficiently high, that is, the recall and precision are both sufficiently high, according to the population mean. The second option is adjusting the naïve estimator. When the function $t (., .)$ is an indicator (i.e., $t (x_{i}, y_{j})$ is equal to 0 or 1), the decision to ignore the linkage errors or adjust for them may be based on the absolute value of the relative error between the naïve estimator and the population mean $\bar{t}$ (which is a proportion in this case), that is, $| E [\hat{\bar{t}} | X, y] - \bar{t} | / \bar{t}$ . Indeed

\begin{matrix} \frac{| E [\hat{\bar{t}} | X, y] - \bar{t} |}{\bar{t}} & \leq \frac{1}{\bar{t} N} [\sum_{i = 1}^{N} (1 - E [l_{ii} | X, y]) t (x_{i}, y_{i}) + \sum_{i = 1}^{N} \sum_{j \neq i} E [l_{ij} | X, y] t (x_{i}, y_{j})] \\ \leq \frac{1}{\bar{t} N} [\sum_{i = 1}^{N} (1 - E [l_{ii} | X, y]) + \sum_{i = 1}^{N} \sum_{j \neq i} E [l_{ij} | X, y]] \\ = \frac{1}{\bar{t} N} [N (1 - π) + N (N - 1) q] \\ = \frac{(1 - \bar{p}) + \bar{λ}}{\bar{t}} . \end{matrix}

Since $\bar{t}$ is unknown, the decision to ignore the linkage errors may be based on the ratio $(1 - \bar{p} + \bar{λ}) / \hat{\bar{t}}$ . For example, the errors may be ignored if $(1 - \bar{p} + \bar{λ}) / \hat{\bar{t}}$ is smaller than some positive threshold ε. Otherwise, the naïve estimator may be corrected as discussed in the next section.

5.2. Accounting for the Linkage Errors

Thanks to the pioneering work of Neter et al. (1965), it has been long known that ignoring linkage errors can seriously bias the population estimates that are derived from linked data. Since then, many methods have been proposed to account for these errors, of which most fit within the super-population inference paradigm, where the inferences are not conditional on the realized values of the response in the finite population, and the focus is on a particular parametric or semi-parametric model of the response. See Han (2018) for a review of these solutions.

Weighting the linked records: Judson et al. (2013), and Christidis et al. (2018) describe an alternative approach that is common in health and social studies. It is also better suited to the finite population inference paradigm and does not assume a particular response model, in contrast to the situation under the super-population paradigm. This approach consists of linking the records with a high precision, and in reweighting the linked records to account for the false negatives, under the assumption that there are no false positives. Then, the linkage process is essentially equivalent to a sampling process (the sample comprising the true positives), so that one may draw from classical sampling theory to derive consistent estimators and study their properties, for example, the Horwitz-Thompson and Hajek estimators. Indeed, denote by $π_{i}$ the probability of a true positive for transaction i, that is, $π_{i} = E [l_{ii} | X, y] = E [l_{ii} | x_{i}]$ . Then the Horwitz-Thompson estimator of the population mean is

{\hat{\bar{t}}}_{H T} = \frac{1}{N} \sum_{i = 1}^{N} \frac{l_{i i}}{π_{i}} t (x_{i}, y_{i}),

(4)

while Hajek’s estimator is

{\hat{\bar{t}}}_{H} = \frac{\sum_{i = 1}^{N} π_{i}^{- 1} l_{i i} t (x_{i}, y_{i})}{\sum_{i = 1}^{N} π_{i}^{- 1} l_{i i}} .

(5)

It is also possible to use a calibrated estimator of the form

{\hat{\bar{t}}}_{C} = \frac{\sum_{i = 1}^{N} w_{i} l_{i i} t (x_{i}, y_{i})}{\sum_{i = 1}^{N} w_{i} l_{i i}},

(6)

where the weight w_i is obtained by calibrating the sampling weight $1 / π_{i}$ to known population totals based on auxiliary variables such as age-sex groups in social and health studies. The variance of the Horwitz-Thompson is easy to derive when the decision to link two records involves no other record and the recall is known. In this case, the linked transactions correspond to a Poisson sample, or a Bernoulli sample if all the $π_{i}$ ’s are the same. So that

v a r ({\hat{\bar{t}}}_{H T} | X, y) = \frac{1}{N^{2}} \sum_{i = 1}^{N} (\frac{1}{π_{i}} - 1) t {(x_{i}, y_{i})}^{2},

which is estimated without bias by

\hat{v a r} ({\hat{\bar{t}}}_{H T} | X, y) = \frac{1}{N^{2}} \sum_{i = 1}^{N} \frac{l_{i i}}{π_{i}} (\frac{1}{π_{i}} - 1) t {(x_{i}, y_{i})}^{2} .

The variance of the Hajek and calibrated estimators have no closed-form expressions, but they may be estimated by linearization or resampling. The same is true for the Horwitz-Thompson, where the recall is estimated.

New point estimators: A potential limitation of the above methodology is the assumption that the false positives are negligible. The authors have addressed this limitation with two distinct linkages, where the first linkage may be stricter than the second one. For example, the first linkage key may concatenate the exporter name, product code, transaction date, and transaction value, while the second linkage may only concatenate the exporter name, product code, and transaction date. For the pair $(i, j)$ , denote the linkage decisions associated with the first and second linkages by $l_{ij}^{(1)}$ and $l_{ij}^{(2)}$ , and the corresponding true positive probabilities by $π_{i}^{(1)} = E [l_{ii}^{(1)} | x_{i}]$ and $π_{i}^{(2)} = E [l_{ii}^{(2)} | x_{i}]$ . Also, define the functions

\begin{matrix} q^{(1)} (x, x') & = & E [l_{ij}^{(1)} | x_{i} = x, x_{j} = x'] & for i \neq j, \\ q^{(2)} (x, x') & = & E [l_{ij}^{(2)} | x_{i} = x, x_{j} = x'] & for i \neq j, \end{matrix}

which give the false positive probabilities for an unmatched pair given x _i and x _j, and observe that the right-hand sides do not depend on i or j. For notational convenience, let $q_{ij}^{(1)} = q^{(1)} (x_{i}, x_{j})$ , $q_{ij}^{(2)} = q^{(2)} (x_{i}, x_{j})$ .

Proposition 2: Suppose that $π_{i}^{(1)} - (q_{ij}^{(1)} / q_{ij}^{(2)}) π_{i}^{(2)} \neq 0$ for each i and j. Then, an unbiased estimator of the population mean $\bar{t}$ is

{\hat{\bar{t}}}_{A}^{(1)} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{l_{i j}^{(1)} - (q_{i j}^{(1)} / q_{i j}^{(2)}) l_{i j}^{(2)}}{π_{i}^{(1)} - (q_{i j}^{(1)} / q_{i j}^{(2)}) π_{i}^{(2)}} t (x_{i}, y_{j}) .

(7)

The proof is given in Appendix C.

The estimator ${\hat{\bar{t}}}_{A}^{(1)}$ corresponds to a two-step adjustment procedure, where the first step combines the two linkage decisions linearly to remove the bias due to the false positives. In the second step, this linear combination is weighted to account for the false negatives. The estimator coincides with the Horwitz-Thompson estimator ${\hat{\bar{t}}}_{HT}$ if $q_{ij}^{(1)}$ is null (i.e., no false positive for the first linkage) and $q_{ij}^{(2)}$ is positive for each i and j. Besides, if the first linkage is stricter than the second one, we have $l_{ij}^{(1)} l_{ij}^{(2)} = l_{ij}^{(1)}$ and $q_{ij}^{(1)} / q_{ij}^{(2)} \leq 1$ .

In what follows, it is assumed that the false positive probabilities are such that $q_{ij}^{(1)} = q^{(1)}$ and $q_{ij}^{(2)} = q^{(2)}$ , where $q^{(1)}$ and $q^{(2)}$ do not depend on i or j. Also, the linkage parameters are estimated by ${\hat{π}}_{i}^{(1)}$ , ${\hat{π}}_{i}^{(2)}$ , ${\hat{q}}^{(1)}$ , and ${\hat{q}}^{(2)}$ . In this case, we have the following estimator.

{\hat{\bar{t}}}_{A}^{(2)} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{N} \frac{l_{i j}^{(1)} - ({\hat{q}}^{(1)} / {\hat{q}}^{(2)}) l_{i j}^{(2)}}{{\hat{π}}_{i}^{(1)} - ({\hat{q}}^{(1)} / {\hat{q}}^{(2)}) {\hat{π}}_{i}^{(2)}} t (x_{i}, y_{j}) .

(8)

Another simplification occurs when the probabilities $π_{i}^{(1)}$ and $π_{i}^{(2)}$ are uniform, that is, $π_{i}^{(1)} = π^{(1)}$ and $π_{i}^{(2)} = π^{(2)}$ , with the corresponding estimators denoted by ${\hat{π}}^{(1)}$ and ${\hat{π}}^{(2)}$ . Then,

{\hat{\bar{t}}}_{A}^{(3)} = \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{N} l_{i j}^{(1)} t (x_{i}, y_{j}) - ({\hat{q}}^{(1)} / {\hat{q}}^{(2)}) \sum_{i = 1}^{N} \sum_{j = 1}^{N} l_{i j}^{(2)} t (x_{i}, y_{j})}{N ({\hat{π}}^{(1)} - ({\hat{q}}^{(1)} / {\hat{q}}^{(2)}) {\hat{π}}^{(2)})} .

(9)

Further simplifications occur, if the function $t (., .)$ is of the form $t (x_{i}, y_{j}) = t_{x} (x_{i}) t_{y} (y_{j})$ for known functions $t_{x} (.)$ and $t_{y} (.)$ . For example, $t_{x} (.)$ may indicate whether the exporter is large, while $t_{y} (.)$ indicates whether the tariff is preferential, as in the use case. This particular form of $t (., .)$ greatly facilitates the derivation of the variance that is discussed later. It is also possible to consistently estimate the population mean with a single linkage instead of two by setting $l_{ij}^{(2)} = 1$ , that is, linking all the pairs in the Cartesian product, without actually implementing the second linkage. This latter point is crucial. This leads to the following estimator.

{\hat{\bar{t}}}_{A}^{(4)} = \frac{\sum_{i = 1}^{N} \sum_{j = 1}^{N} l_{i j}^{(1)} t (x_{i}, y_{j}) - {\hat{q}}^{(1)} (\sum_{j = 1}^{N} t_{x} (x_{i})) (\sum_{j = 1}^{N} t_{y} (y_{j}))}{N ({\hat{π}}^{(1)} - {\hat{q}}^{(1)})} .

(10)

Variance estimation: The authors have developed a bootstrap procedure to estimate the variance of ${\hat{\bar{t}}}_{A}^{(4)}$ (from Equation (10)). This procedure is described in Appendix C, and it exploits the product form of the function $t (., .)$ . In future work, this solution should be extended to the other proposed estimators.

6. Modified Protocol

To handle the linkage errors, the authors have modified the protocol by Bruno et al. (2018) under the assumption that $E [l_{ii}^{(1)} | X, y] = π^{(1)}$ , $E [l_{ij}^{(1)} | X, y] = q^{(1)}$ , and $E [l_{ij}^{(2)} | X, y] = q^{(2)}$ , for distinct i and j. While the modified protocol does not include the variance estimation, this feature can be added without difficulty. In details, the modified protocol incorporates the following changes.

Setup: In this step, there are three changes. First, the setup now comprises four rounds for two distinct linkage keys (including two rounds per key), instead of two rounds for a single key previously. Second, each round incorporates the computation of the n_i’s and the subsequent estimation of the linkage accuracy. For a given client record (i.e., a record from the data-holding party that is the client in the round), n_i is set to the number of server records (i.e., records from the data-holding party that is the server in the round) with the same value of the encoded linkage key. Then, this information is used to estimate the linkage accuracy according to Equations (1) and (2). Third, the intersection of the two data sets is based on the two linkage keys instead of a single key previously. To be specific a client record i is placed in the intersection if n_i is positive for at least one linkage key. When the second linkage key is laxer than the first one (i.e., perfect agreement on the first key implies perfect agreement on the second key), this means that the record is placed in the intersection if n_i is positive for the second linkage key. Figure A2 shows the modified setup phase in Appendix A.

Loading phase: In this step, the change is that the two encoded linkage keys are included in the intersection data set, instead of a single encoded key based previously. The remaining details are as before, that is, the intersection data set also includes the target variables, it is encrypted with a symmetric key, and it is sent to the linker by each data-holding party.

Query phase: In this step there are two changes. The first change is implemented at the linker, who responds to a count request (e.g., the number of transactions for a given exporter type and tariff type) from a data-holding party, by computing two totals, including one total for each linkage key. These totals are $\sum_{i} \sum_{j} l_{ij}^{(1)} t (x_{i}, y_{j})$ and $\sum_{i} \sum_{j} l_{ij}^{(2)} t (x_{i}, y_{j})$ , which appear on the right-hand side of Equation (9). The second change is implemented by the receiving data-holding party, who estimates the requested count based on Equation (9), the received totals, as well as the estimated recall and false positive rate from the setup phase.

The above protocol has been implemented in Python, in the special case where the second key is error-free (i.e., $E [l_{ii}^{(2)}] = 1$ ) and laxer than the first linkage key. This implementation is available on the GitHub repository of UNECE Input Privacy Preserving project (UNECE 2023b).

7. Simulations

Simulations are performed to evaluate the error estimation and error adjustment procedures, when the goal is to estimate the proportion of transactions with a preferential tariff, that is, the unweighted Preferential Utilization Rate, for small and large Dutch exporters. They are based on mock transactions and are implemented in R without running the implemented private set intersection protocol.

7.1. Simulation Setup

The simulations comprise the following two steps.

First step: A finite population of transactions is generated, including the product code, date, value, and tariff preference for each transaction. This population comprises two strata of equal sizes for small and large Dutch exporters. The product code is selected with replacement from a list of ten six-digit codes in keeping with the standard format of these codes (World Customs Organization 2022). The date is sampled with replacement from the days in the year 2021. The tariff preference follows a Bernoulli distribution with probability .25 or .40 according to whether the exporter is small or large, respectively. Finally, the value is selected by drawing a number uniformly between 0 and 1,000,000.

Second step: It comprises 100 repetitions, where the product code, date, value, and tariff preference are held fixed for each transaction. In a repetition, each transaction is assigned an exporter name that is sampled with replacement from 947 publicly listed firms (Securities and Exchange Commission). Next, the import and export data sets are created by recording the attributes of each transaction, that is, the product code, date, value, and tariff preference, in addition to the exporter name. The latter is recorded without errors on the import side set but possibly with typos on the export side. Finally, the data sets are linked using the available variables, and the proportion of transactions with a preferential tariff is estimated for each exporter size. There are six scenarios that are described in Table 3. In scenarios 1 and 2, there are 2,000 transactions. In scenarios 3 to 6, there are instead 10,000 transactions, which results in a significant increase in the number of false positives compared to scenarios 1 and 2, where there are almost none. In scenarios 5 and 6, the variance is estimated with the bootstrap procedure described in Appendix C.

Table 3.

Simulation Scenarios.

Parameter	Scenario
	1	2	3	4	5	6
False negatives	None	Some	None	Some	Some	Some
False positives	Almost none	Almost none	Some	Some	Some	Some
Typo probability of exporter name in export data set	0.0	0.2 for small exporters, and 0.1 for large ones	0.0	0.2 for small exporters, and 0.1 for large ones	0.2 for all	0.2 for all
Num. transactions	2,000	2,000	10,000	10,000	10,000	10,000
1st linkage key	Exporter name, product code, month, and value rounded to nearest million	Exporter name, product code, month, and value rounded to nearest million	Exporter name, product code and month	Exporter name, product code and month	Exporter name, product code and month	Exporter name, product code and month
2nd linkage key	None	None	Exporter name, product code, quarter (e.g., January to March)	Exporter name, product code, quarter	Exporter name, product code, quarter	Exporter name, product code, quarter
Estimation of the linkage accuracy	Fit model within each stratum.	Fit model within each stratum.	Fit model within each stratum.	Fit model within each stratum.	Fit model using all n_i’s across the strata.	Fit model using 100 n_i’s drawn across the strata.

7.2. Results

The results are shown in Tables 4 to 9, where RRMSE means relative root mean squared error, and FPR means False Positive Rate. Tables 4 and 5 show the achieved and estimated linkage accuracy, where the recall and precision are always estimated with a small relative bias and RRMSE. The FPR is estimated with a moderate relative bias, which does not exceed 10% in absolute value. However, the variance may be quite large as evidenced by the RRMSE where the precision is very high or few n_i’s are used to estimate the linkage accuracy, as in scenarios 1, 2, and 6. Tables 6 and 7 show the performance of the estimated proportion based on the proposed adjustments. In scenario 1, where there are no false negatives and the precision is very high, the linkage errors can be ignored. In scenario 2, where there are false negatives and almost no false positives, only adjusting for the false negatives is sufficient and preferred to applying a full adjustment, because of the large variance of the estimated FPR. In scenario 3, where there are no false negatives but nonnegligible false positives, it is beneficial and sufficient to only adjust for the false positives. In scenario 4, where the false negatives and false positives are nonnegligible, a full adjustment is required and beneficial. This is also true in scenarios 5 and 6, where the adjustment based on Equation (10) is more effective than that based on Equation (8), when it comes to the RRMSE. Tables 8 and 9 show the bootstrap variance in scenarios 5 and 6, where the relative bias may be large if using all the n_i’s to estimate the linkage accuracy as in scenario 5. However, in scenario 6 (using only 100 n_i’s instead of all of them), the bootstrap variance has a small bias. All these results demonstrate that the proposed methodology performs as expected.

Table 4.

Achieved Linkage Accuracy.

Scenario	Linkage	Exporter size	Recall	Precision	FPR
1	1	All	1.0	0.998	7.950e-07
2	1	Small	0.801	0.998	7.950e-07
		Large	0.901	0.998	7.550e-07
		All	0.851	0.998	7.750e-07
3	1	All	1.0	0.919	8.850e-06
	2	All	1.0	0.976	2.400e-06
4	1	Small	0.801	0.914	7.510e-06
		Large	0.900	0.923	7.490e-06
		All	0.850	0.919	7.500e-06
	2	Small	0.800	0.781	2.250e-05
		Large	0.900	0.800	2.250e-05
		All	0.850	0.791	2.250e-05
5	1	All	0.800	0.919	7.010e-06
	2	All	0.800	0.792	2.110e-05
6	1	All	0.799	0.919	7.040e-06
	2	All	0.799	0.791	2.120e-05

Table 5.

Estimated Linkage Accuracy.

Scenario	Linkage key	Exporter size	Recall		Precision		FPR
			Bias (%)	RRMSE (%)	Bias (%)	RRMSE (%)	Bias (%)	RRMSE (%)
1	1	All	—	—	0.000*	0.120	−0.005	75.515
2	1	Small	0.010	1.526	0.015	0.185	−7.392	93.013
		Large	0.024	1.122	−0.010	0.186	5.784	111.619
		All	0.067	0.933	0.001	0.160	−0.384	88.153
3	1	All	—	—	0.002	0.405	−0.020	4.978
	2	All	—	—	−0.000*	0.213	0.010	9.076
4	1	Small	0.054	0.781	0.054	0.589	−0.578	6.786
		Large	0.020	0.436	0.018	0.544	−0.215	7.038
		All	0.033	0.425	0.034	0.470	−0.386	5.782
	2	Small	0.080	0.884	0.079	0.975	−0.280	4.233
		Large	0.036	0.488	0.036	0.924	−0.143	4.532
		All	0.051	0.506	0.052	0.805	−0.197	3.741
5	1	All	0.079	0.737	0.0398	0.626	−0.405	7.602
	2	All	0.095	0.780	0.073	1.039	−0.236	4.764
6	1	All	0.094	5.323	0.217	3.246	−1.802	38.835
	2	All	–0.317	5.078	0.063	5.483	0.361	24.448

The * means that the value is less than 0.001.

Table 6.

Estimated Proportion with the Preferential Tariff in Scenarios 1 to 4.

Scenario	Exporter size	Ignore the linkage errors		Adjust for the false negatives Equation (4)		Adjust for all linkage errors Equation (8)
		Bias (%)	RRMSE (%)	Bias (%)	RRMSE (%)	Bias (%)	RRMSE (%)
1	Small	0.248	0.433	—	—	—	—
	Large	0.132	0.248	—	—	—	—
	All	0.174	0.258	—	—	—	—
2	Small	−19.713	19.863	0.179	2.710	—	—
	Large	−9.748	9.852	0.113	1.369	—	—
	All	−13.274	13.340	0.137	1.385	—	—
3	Small	11.657	11.704	—	—	−0.126	0.672
	Large	7.139	7.171	—	—	0.057	0.464
	All	8.858	8.880	—	—	−0.012	0.362
4	Small	−9.981	10.091	12.380	12.514	0.024	1.862
	Large	−3.880	3.983	6.808	6.868	0.008	0.937
	All	−6.201	6.256	8.928	8.975	0.014	0.845

Table 7.

Estimated Proportion with the Preferential Tariff in Scenarios 5 and 6.

Scenario	Exporter size	Ignore the linkage errors		Adjust for all linkage errors Equation (8)		Adjust for all linkage errors Equation (10)
		Bias (%)	RRMSE (%)	Bias (%)	RRMSE (%)	Bias (%)	RRMSE (%)
5	Small	−10.856	10.975	−0.080	2.138	−0.102	1.777
	Large	−14.254	14.303	0.022	1.455	0.077	1.348
	All	−12.962	13.002	−0.017	1.206	0.008	1.053
6	Small	−10.877	11.005	−2.333	17.979	0.276	6.887
	Large	−14.228	14.269	−0.771	12.839	0.487	5.963
	All	−12.953	12.992	−1.365	14.551	0.407	6.177

Table 8.

Bias (%) of the Bootstrap Variance for the Estimated Linkage Accuracy in Scenarios 5 and 6.

Scenario	Exporter size	Recall	Precision	FPR
5	All	2.782	−38.261	−38.619
6	All	−3.507	12.971	9.182

Table 9.

Bias (%) of the Bootstrap Variance for Estimated Proportion with a Preferential Tariff in Scenarios 5 and 6.

Scenario	Exporter size	Proportion with preferential tariff
5	Small	−32.499
	Large	−38.549
6	Small	−0.654
	Large	−1.454

8. Limitations and Next Steps

The proposed solution performs well in simulations, but it has many limitations. For example, it does not apply when the data sets are samples, in which case the estimated population mean may be biased. A second issue is the lack of support for continuous target variables. A third limitation is the absence of mechanisms to audit the queries from the data-holding parties or to control the statistical disclosure. One way to address these limitations is to perturb the linked data according to differential privacy techniques, and to implement the resulting methodology within a trusted execution environment as suggested by Zhang and Haraldsen (2022). A trusted execution environment is a technology that endeavors to “mimic the behavior of a trusted third party by attesting the functionality performed by hardware or by a cloud provider” (United Nations 2023, 20). It allows more flexible computations than cryptographic solutions, which means that the target variables may be continuous, and the linkage strategy may be more flexible and based on available packages in Python or R. It is also easy to accommodate situations where the membership in the intersection is sensitive, and to produce some synthetic data from the linked data using state-of-the-art solutions for this purpose (UNECE 2023c). For applications, the most immediate way to exploit the described solution is to link the records with a high precision and weight the linked records under the assumption that there are no false positives. Indeed, this leads to correct statistical inferences according to classical sampling theory.

There are also many open problems, which represent so many opportunities for advancing the state of the art. One of them is developing statistically sound resampling techniques for linked data. Indeed, progress in this area would enable correct statistical inferences, when relaxing the assumption that there are no false positives. A second research avenue concerns the development of the linkage strategy when using a trusted execution environment. Indeed, when linking with quasi-identifiers, the final linkage strategy is usually an iterative process that involves some trial and error, for example, to choose the record comparison functions and other critical parameters such as the weight threshold in the probabilistic method of record linkage (Fellegi and Sunter 1969). In the clear, each iteration typically involves many human interventions to evaluate the linkage accuracy and tune the linkage parameters. With a trusted execution environment, these interventions may be based on the reported linkage accuracy. However, the release of this information is likely to have an impact on the privacy budget, which must be studied, especially if the membership in the intersection is sensitive information. Another way to address this problem is to minimize and possibly eliminate human interventions by fully automating the design of the linkage strategy with machine learning techniques, within the trusted execution environment. This requires the development of machine learning solutions that can automatically choose from a large catalog of record comparison options without any training data. Additionally, this must be done while retaining the ability to evaluate the impact of each choice on the linkage accuracy. Dasylva and Chen (2022) describe an adaptation of recursive partitioning that is a step in this direction.

9. Conclusion

Linking international trade microdata from different countries could lead to a step change in providing better statistics and research into international dependencies. In this regard, private set intersection techniques could be a useful tool as they allow linkage while safeguarding privacy in accordance with the law. However, they must be adapted to link with quasi-identifiers and handle linkage errors. To do so, the authors have modified the protocol by Bruno et al. (2018) to estimate the linkage accuracy and adjust an estimated population mean according to the linkage error. This procedure allows us to calculate a novel statistic, in this case the preference utilization rate of Dutch exporters categorized by their size, which would provide further insights into the (lack of) use of preferential benefits by different types of firms. Through better statistics, governments can better understand the obstacles faced by firms, when trying to use preferential trade agreements, since current utilization rates typically fall between 60% and 70%. In turn, the resulting knowledge can guide policies aimed at further lowering firms’ trading costs. Furthermore, the low utilization of preferential tariffs clearly demonstrates that international trade statistics must keep up with globalization, which increases interdependence among countries.

Footnotes

Appendix A

Appendix B

Appendix C

Acknowledgements

This study was carried out in the context of the United Nations Economic Commission for Europe Input Privacy Preservation (UNECE IPP) project.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Disclaimer

The views expressed represent the authors’ opinions and not those of their respective statistical organizations.

ORCID iD

Abel Dasylva

Received: July 31, 2023

Accepted: March 5, 2025

References

Akaike

1974. “A New Look at the Statistical Model Identification.” IEEE Transactions on Automatic Control 19: 716–23. DOI: https://doi.org/10.1109/TAC.1974.1100705.

Andreea

2021. “Private Set Intersection: Past, Present and Future.” Proceedings of the 18th International Conference on Security and Cryptography (SECRYPT 2021), Online, July 6–8. https://www.scitepress.org/Papers/2021/105258/105258.pdf (accessed February, 2025).

Blakely

Salmond

2002. “Probabilistic Record Linkage and a Method to Calculate the Positive Predicted Value.” International Journal of Epidemiology 31: 1246–52. DOI: https://doi.org/10.1093/ije/31.6.1246.

Bruno

Nicoletti

Scannapieco

Zardetto

2018. “Privacy Preserving Set Intersection.” Proceedings of the Ninth Irving Fisher Conference: Bank of International Settlements, Basel, August 30–31. https://www.Bis.org/ifc/publ/ifcb49_33.pdf (accessed February, 2025).

Chambers

Chipperfield

Davis

Kovacevic

2009. “Inference Based on Estimating Equations with Probability-Linked Data.” Working Paper Series, Centre for Statistical and Survey Methodology.

Christen

2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Springer.

Christidis

Labrecque-Synnott

Pinault

Saidi

Tjepkema

2018. “The 1996 CanCHEC: Canadian Census Health and Environment Cohort Profile.” Analytical Studies: Methods and References, no. 013. Statistics Canada Catalogue no. 11-633-X. Statistics Canada. https://www150.statcan.gc.ca/n1/pub/11-633-x/11-633-x2018013-eng.htm (accessed February, 2025).

Dasylva

Abeysundera

Akpoué

Haddou

Saidi

2016. “Measuring the Quality of a Probabilistic Linkage Through Clerical Reviews.” Proceedings of the International Methodology Symposium: Statistics Canada, Gatineau, March 22–24. https://www.statcan.gc.ca/en/conferences/symposium2016/program/14743-eng.pdf (accessed February, 2025).

Dasylva

Chen

2022. “Probabilistic Record Linkage Through Recursive Partitioning Without Training Data.” Presentation at the Monthly Meeting of the ONS-UNECE Machine Learning Group, April. https://statswiki.unece.org/spaces/MLP/pages/338329602/Machine+Learning+Group+2022 (accessed February 2025).

10.

Dasylva

Goussanou

2020. “Estimating Linkage Errors Under Regularity Conditions.” Proceedings of the Section on Survey Research Methods, Online, August 2–6. American Statistical Association. http://www.asasrms.org/Proceedings/y2020/files/1505346.pdf (accessed February, 2025).

11.

Dasylva

Goussanou

2022. “On the Consistent Estimation of Linkage Errors Without Training Data.” Japanese Journal of Statistics and Data Science 5: 181–216. DOI: https://doi.org/10.1007/s42081-022-00153-3.

12.

Dasylva

Goussanou

2024. “Making Statistical Inferences About Linkage Errors.” Japanese Journal of Statistics and Data Science 7: 17–56. DOI: https://doi.org/10.1007/s42081-023-00228-9.

13.

De Cristofaro

Tsudik

2010. “Practical Private Set Intersection Protocols with Linear Bandwidth and Computational Complexity.” Proceedings of Financial Cryptography and Data Security, Tenerife, January 25–28. https://eprint.iacr.org/2009/491.pdf (accessed February, 2025).

14.

Dugdale

Molladavoudi

Santos

Templeton

2022. “Privacy Enhancing Technologies at Statistics Canada.” Proceedings of the Survey Methods Section, Online, May 30–June 3. Statistical Society of Canada. https://ssc.ca/sites/default/files/imce/dugdale_molladavoudi_ssc2022.pdf (accessed February, 2025).

15.

Fellegi

Sunter

1969. “A Theory of Record Linkage.” Journal of the American Statistical Association 64: 1183–210. DOI: https://doi.org/10.1080/01621459.1969.10501049.

16.

Han

2018. “Statistical Inference Using Data from Multiple Files Combined Through Record Linkage.” Doctoral dissertation, University of Maryland.

17.

Herzog

Scheuren

Winkler

2007. Data Quality and Record Linkage Techniques. New York: Springer.

18.

Judson

D. H.

Parker

Larsen

M. D.

2013. Adjusting Sample Weights for Linkage-Eligibility Using SUDAAN. National Center for Health Statistics. https://www.cdc.gov/nchs/data/datalinkage/adjusting_sample_weights_for_linkage_eligibility_using_sudaan.pdf (accessed February, 2025).

19.

Neter

Maynes

Ramanathan

1965. “The Effect of Mismatching on the Measurement of Response Error.” Journal of the American Statistical Association 60: 1005–27. DOI: https://doi.org/10.1080/01621459.1965.10480846.

20.

Newcombe

1988. Handbook of Record Linkage. Oxford: Oxford University Press.

21.

Nilsson

2022. “Time to Preference: Early Preference Uptake Under the EU-Canada Comprehensive Economic and Trade Agreement and the EU-Korea Free Trade Agreement.” Journal of Economic Integration 37 (4): 589–648. DOI: https://doi.org/10.11130/jei.2022.37.4.589.

22.

Pinkas

Schneider

Tkachenko

Yanai

2019. “Efficient Circuit-Based PSI with Linear Communication.” Proceedings of EUROCRYPT, Darmstadt, May 19–23. https://doi.org/10.1007/978-3-030-17659-4_5 (accessed February, 2025).

23.

Schnell

2016. “Privacy-Preserving Record Linkage.” In Methodological Developments in Data Linkage, edited by Harron

Goldstein

Dibben

Hoboken, NJ: Wiley.

24.

Securities and Exchange Commission. “List of Companies (Corrected).” https://www.sec.gov/files/rules/other/4-460list.htm (accessed February 7, 2002).

25.

Straus

2021. “A Federal Government Privacy-Preserving Technology Demonstration.” https://mccourt.georgetown.edu/news/a-federal-government-privacy-preserving-technology-demonstration/ (accessed February, 2025).

26.

UNECE. 2023a. “UNECE Project on Input Privacy Preservation: Final Report.” United Nations Economic Commission for Europe. https://zenodo.org/records/10400296 (accessed February, 2025).

27.

UNECE. 2023b. “Input Privacy Preserving Project.” GitHub Repository. https://github.com/UNECE/Input-Privacy-Preserving-Project/tree/main/Code/PSI_WITH_TYPOS https://github.com/UNECE/Input-Privacy-Preserving-Project (accessed February, 2025).

28.

UNECE. 2023c. Synthetic Data for Official Statistics: A Stater Guide. United Nations Economic Commission for Europe. https://unece.org/sites/default/files/2022-11/ECECESSTAT20226.pdf (accessed February, 2025).

29.

United Nations. 2023. “United Nations Guide on Privacy-Enhancing Technologies for Official Statistics.” United Nations Committee of Experts on Big Data and Data Science for Official Statistics. https://unstats.un.org/bigdata/task-teams/privacy/guide/ (accessed February, 2025).

30.

World Customs Organization. 2022. “Harmonized System Nomenclature 2022 Edition.” https://www.wcoomd.org/en/topics/nomenclature/instrument-and-tools/hs-nomenclature-2022-edition/hs-nomenclature-2022-edition.aspx (accessed February, 2025).

31.

Zhang

L.-C.

Haraldsen

2022. “Secure Big Data Collection and Processing: Framework, Means and Opportunities.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 185: 1541–59. DOI: https://doi.org/10.1111/rssa.12836.

Linking Trade Data from Different National Statistical Offices Through a Private Set Intersection

Abstract

Keywords

1. Introduction

2. International Trade Use Case

3. Notation and Assumptions

4. A Private Set Intersection Protocol

5. Statistical Methodology

5.1. Estimating the Linkage Accuracy

5.2. Accounting for the Linkage Errors

6. Modified Protocol

7. Simulations

7.1. Simulation Setup

7.2. Results

8. Limitations and Next Steps

9. Conclusion

Footnotes

Appendix A

Appendix B

Appendix C

Acknowledgements

Funding

Disclaimer

ORCID iD

References