Sage Journals: Discover world-class research

Abstract

This paper explores the integration of confidential microdata between two National Statistics Agencies (NSOs) where a unique identifier is missing. Using mock data on international trade between these two countries, we use a cloud based secure enclave to integrate both NSOs’ trade data at the transaction level. Particular attention is given to ensuring input and output privacy, while maintaining the flexibility to produce both summary statistics on the linked data as well as econometric analyses that can lead to novel insights. As such, this study provides valuable insights into the potential of privacy enhancing techniques, in particular secure enclaves and differential privacy techniques, to improve data privacy while preserving the utility of the statistical results.

Keywords

private data linkage secure enclave privacy enhancing technologies econometric analysis confidential microdata

1 Introduction

The digitization of data collection has significantly increased the amount of data that National Statistics Offices (NSOs) have at their disposal. A recent trend is the combination of various data sources, which further allows NSOs to monitor the various social and economic activities within its national borders. However, the combination of data sources can also pose challenges, particularly from a privacy point of view. Privacy Enhancing Techniques (henceforth PETs) hold the potential to alleviate these concerns. In this paper, we further explore the use of PETs to combine confidential data between two distinct NSOs.

The use case that we will consider involves the combination of international trade data between the Netherlands and Canada. Currently, when both countries trade with each other, each NSO holds more detailed information for imported than for exported transactions. For example, while Statistics Netherlands would know whether a Dutch importer made use of preferential tariffs under the Comprehensive and Economic Trade Agreement (CETA) between the EU and Canada, only Statistics Canada would know whether a Dutch exporter did so. However, Statistics Canada does not know anything about the Dutch exporter, other than its name. Linking this data at the transaction level would lead to much richer statistics and research on the use of international trade agreements. For example, it would allow to calculate Preference Utilization Rates (or PUR: the percentage of eligible trade that makes use of a trade agreement) by firm categories. Since PURs typically stagnate at around 60–70 percent, a better understanding of the obstacles to further use could lead to increased gains of international trade agreements. However, linking such transactions means NSOs need to share privacy sensitive information on (the activities of) individual firms, which could violate various laws and regulations such as Canada Statistics Act,¹ Statistics Netherlands Act² and the General Data Protection Regulation³ in the European Union.

A potential solution is to link the data sets with a Private Set Intersection (PSI) protocol, i.e. a cryptographic protocol allowing two parties to compare their data sets and identify common elements, without revealing any information about the other elements. CBS, Statistics Canada, and Istat explored this approach for linking international trade data without unique identifiers as part of a UNECE project.⁴ Their work led to a new private set intersection protocol that estimates linkage errors and a population mean, while removing any potential bias caused by the errors. In a recent study,⁵ a bootstrap variance estimator is developed for the estimated mean. While the UNECE project illustrated the potential of PETs in practical applications, it also highlighted some limitations, such as the limited flexibility regarding the linkage or the supported statistical analysis, as well as the lack of statistical disclosure control.

This work focuses on overcoming these shortcomings by utilizing a cloud-based secure enclave, i.e. an isolated area of a virtual server, where sensitive data and operations are protected from unauthorized access and tampering (i.e. compromising the data or computations) by the enclave owner or any other party. Within this enclave, the records are linked with great flexibility, and the resulting links serve to compute summary statistics (e.g. on the use of preferential tariffs by Dutch exporters) and fit various statistical models (e.g. a linear or logit model). Then, the corresponding outputs are safely exported outside the enclave after they are randomly perturbed according to differential privacy methods. Furthermore, more utility is extracted from the linked data by generating some synthetic data, which is safely exported outside the enclave with all the target variables. By synthetic data, we mean a data set of fictitious transactions that are distributed like the actual transactions, according to the established links between the export and import data sets.

The paper is structured as follows. In the next section, we first explain the use case in greater detail. After that, we discuss previous and related work in Section 3. Section 4 describes the various methodological steps involved while Section 5 presents the evaluation including the tests in the enclave and simulations. Section 6 gives the conclusions and next steps.

2 Use case

The overarching goal of this project is to explore the use of PET to link highly detailed – and therefore typically privacy-sensitive – data from two NSOs. The case in point that we consider in this paper concerns the use of preferential trade agreements when firms trade internationally. More specifically, we assume a situation where a firm from the Netherlands exports a product to Canada which is eligible for preferential (i.e. lower) import tariffs under CETA. Only the Canadian customs authority will record (and thus Statistics Canada) whether this transaction enters Canada under preferential terms or not. On the other hand, only Statistics Netherlands holds detailed information on the Dutch exporter, e.g. whether it is a large or small firm. Both agencies could therefore benefit from linking their information at the transaction level such that both agencies could eventually gain an insight into preference utilization by firm categories.

In what follows, we will construct a mock dataset that tries to mimic the real-life data that would ultimately be used. This mock dataset will consist of several common variables, i.e. variables that both Statistics Netherlands and Statistics Canada have in their respective dataset. This includes the name of the Dutch exporter, the code of the exported product (according to the 6-digit harmonized system⁶), as well as the date and value of the transaction. In addition, Statistics Netherlands uniquely observes some exporter characteristic for which we will assume the exporter size. Likewise Statistics Canada uniquely observes the importer size and whether this transaction entered Canada under preferential terms or not (Table 1).

Table 1.
Overview of the availability of different variables in each dataset.

Export micro-data Import micro-data

Available Private

Variable (Yes/No) (Yes/No) Available Private

Exporter name Y N Y N

Exporter size Y Y N N

Importer size N N Y Y

Product code Y N Y N

Transaction date Y N Y N

Transaction value Y N Y N

Tariff type N N Y Y

	Export micro-data	Import micro-data
Exporter name	Y	N	Y	N
Exporter size	Y	Y	N	N
Importer size	N	N	Y	Y
Product code	Y	N	Y	N
Transaction date	Y	N	Y	N
Transaction value	Y	N	Y	N
Tariff type	N	N	Y	Y

Note: “Private” applies to information which is only known or permitted to be known by one of the two parties and never shared openly.

In what follows, we aim to calculate descriptive statistics, perform econometric analysis and generate some synthetic data from the linked data, while ensuring that the statistical outputs are safe and keeping the inputs private. These two requirements are also called output privacy and input privacy, respectively. Output privacy corresponds to statistical disclosure control. For the use case, it means that the attributes of any given transaction cannot be inferred from the statistical outputs. In this work, output privacy is achieved through differential privacy techniques. According to the UN PET Guide⁷ (p. 20), “Input privacy endeavors to allow two or more parties to submit data into a calculation without the other respective parties seeing data in clear.”. It implies that the input data remains protected when it is collected and processed. It may be based on encryption or other measures that are implemented in software or hardware. In the case of encryption, the input data is processed while it is encrypted, i.e. without prior decryption.

3 Related works

This work derives from a larger collaborative effort involving multiple National Statistical Organizations (NSOs). It builds upon previous projects that have laid the groundwork for the current study. These projects have explored various techniques such as private set intersection and federated learning for secure data analysis and integration, focusing on the challenges and solutions related to privacy preservation in official statistics. In addition to the foundational projects, this work also leverages extensive technological and methodological experience gained from other domains, including advancements in Secure Multi-Party Computation (SMPC) and the implementation of secure enclaves.

The main project from which this work draws inspiration is the UNECE Project on Input Privacy Preservation (IPP),⁴ launched in January 2021, with the aim to explore statistical use cases that require input-side protection, assess the applicability of different privacy-preserving techniques and foster a collaborative community among statistical organizations and external partners, including academia and the private sector.

The UNECE IPP project demonstrated the feasibility of private set intersection for international trade while highlighting the need for computational efficiency and accurate linkage techniques. In that project, a protocol was developed by modifying the solution described by Bruno et al.,⁸ to deal with the linkage errors when the records are linked without a unique identifier. With the new protocol, a population mean is estimated while removing any potential bias that is due to the linkage errors. In recent work, a variance estimator is proposed for the estimated mean.⁵ Like the solution by Bruno et al.,⁸ the protocol involves the two data-holding parties as well as a trusted third party called “linker”. The protocol is secure when the parties are non-colluding (i.e. not secretly cooperating or conspiring together) and honest but curious, where the latter qualifier means that the parties follow the protocol correctly but try to learn additional information from the data they observe. While this solution enables the parties to estimate a mean without sharing their clear data, it has many limitations, including no support for approximate comparisons of the linkage variables, no support for continuous target variables, a limited range for the statistical analysis, which may be performed, and no control of the statistical disclosure risk. Also, the need to have a trusted third party is restrictive.

Other solutions have been developed, which address some of these limitations. For example, Pinkas et al.⁹ describe a private set intersection approach which has the advantage of not requiring a trusted third party. It is based on oblivious transfer, whereby a receiver obtains a single piece of information (to which it is entitled) from a sender that has many such pieces of information, without revealing the obtained information to the sender. This solution was used to privately link student financial aid data in the United States.¹⁰ Zanussi, Dugdale and Santos¹¹ describe a similar solution. While these solutions dispense with a trusted party, they require a unique identifier and assume that there are no linkage errors.

Indeed, linkage errors are a particular concern when linking with quasi-identifiers, where a quasi-identifier is a nonunique variable that is possibly recorded with typos (e.g. the transaction date, the transaction value). To describe these errors, a record pair is called matched if its records are from the same unit. A linkage error is a false negative (i.e. not linking records from a matched pair) or a false positive (linking records from unmatched pairs). The linkage errors are usually measured by the recall and the precision, where the recall is the proportion of matched pairs that are linked, and the precision is the proportion of linked pairs that are matched. Performing approximate comparisons is an effective way to limit these errors when linking with quasi-identifiers. In a privacy-preserving context, the question is how to perform such comparisons on quasi-identifiers when they are encrypted or hashed. Encryption and hashing both aim to map a first message (called plaint text) into a second message (called cipher text) that is unreadable. While encryption is reversible, i.e. the plain text may be recovered from the plain text with a decryption key, this is not the case with hashing. One privacy-preserving linking method is to use Bloom filters that are based on hashing the quasi-identifiers.¹² While these methods can improve the linkage efficiency, they must be part of a bigger solution, which also protects the statistical summaries that are derived from the linked data. The latter is also true for all the above-described solutions.

An attractive alternative is a cloud-based secure enclave. As it has been mentioned before, it is essentially an isolated area of a virtual server, which protects sensitive data and operations from unauthorized access and tampering by the owner of the server or any other party. Enclaves provide a secure execution environment that ensures isolation, integrity, and authenticity. For our use case, this enables privacy-preserving computations with both formal security and privacy guarantees. In practice, enclaves achieve their guarantees through virtualized environments that are isolated from the host system, typically with no external network access and restricted communication through tightly controlled local channels. They also provide a cryptographic proof that their code and configuration have not been tampered with. While this solution requires trust in the cloud provider, in theory, it provides the greatest flexibility for implementing the linkage, statistical analysis and disclosure control measures, since the clear data may be processed with standard statistical packages within the enclave. This trust is necessary because the provider controls the underlying hardware and infrastructure. Nevertheless, it remains a practical and effective solution that is widely used.

4 Methodology

Within a secure enclave, the import and export data sets can be linked privately to fit various regression models and generate public use microdata through data synthesis, while controlling for the linkage errors, as shown in Figure 1. By design, this process preserves the privacy of the inputs, but it can also ensure that all the outputs are safe in the differential privacy (DP) framework.¹³ While fitting regression models and generating synthetic data from the linked data are major improvements over the UNECE IPP project,⁴ producing differentially private outputs is equally significant and requires a careful consideration of the threat model and the linkage impact on the privacy loss. Here, the term threat model loosely refers to the unit of information that is to be protected from statistical disclosure, as well as scenarios for such disclosure, which assume the curator model in this work, i.e. all the clear data is sent to a trusted third party, who performs the processing and applies the necessary disclosure control measures. Regarding the protected unit of information, a first choice is protecting each transaction, while a second choice is protecting all the transactions from a given exporter or importer. This is an important distinction when considering a regression model with fixed exporter effects. For the problem in hand, the linkage also affects the privacy loss in ways that go beyond some previous discussions of the interplay between record linkage and differential privacy. Indeed, this linkage involves data sets held by different statistical organizations, where each organization is not entitled to the private information owned by the other. However, an organization might leverage its data set to infer the private attribute, which its peer holds regarding a particular transaction, e.g. guessing the tariff type based on the variables in the export data set including the firm size. This means that each organization is a potential adversary for its peer, when it comes to protecting the private attributes.

Figure 1.

Production process.

In what follows, our goal is to produce various statistical outputs about a finite population of N international trade transactions, while respecting the above-described confidentiality constrains and leveraging the capabilities of the secure enclave. The outputs are to be based on an import data set and an export data set, which are assumed to be complete and duplicate-free censuses for simplicity. In the following paragraphs, a clear data production process is described, which operates within the boundaries of a statistical organization where there is unfettered access to all the data. Then, an overview of the DP framework is given along with a discussion of different threat models for the use case. Finally, the proposed solution is outlined based on a differentially private version of the clear data production process that runs within the secure enclave.

4.1 The clear data production process

For ease of presentation, it is convenient to first consider the situation where an analyst at Statistics Netherlands has access to all the clear data.

4.1.1 Analysis

First consider a standard generalized linear model of the form $g (E [y_{i}]) = x_{i}^{⊤} β$ , based on the link function $g (.)$ , the response $y_{i}$ of transaction i (e.g. whether the tariff is preferential), the vector $x_{i}$ of covariates and the regression parameters $β$ , which comprise the intercept and coefficients. Two common choices of link function are the identity link $g (μ) = μ$ and the logit link $g (μ) = \frac{e^{μ}}{(1 + e^{μ})}$ , for a linear model and a logit model, respectively, where $μ$ is a dummy variable. With the linear model, the parameters may be estimated by the ordinary least squares estimator $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} Y$ , where $Y = [y_{1} \dots y_{N}]^{⊤}$ and $X = [x_{1} \dots x_{N}]^{⊤}$ . A special case is the linear probability model (LPM) where the response is binary (i.e. either 0 or 1) as in the use case. Using a logit model is an alternative to the LPM, but the latter is easier to apply and interpret, especially when a researcher wants to use so called fixed effects to control for unobserved and time invariant differences between firms for example.

Indeed, econometricians often control for unobserved differences between observations that may lead to biased estimates by including fixed effects. In our case, there may be differences between exporters that we are unaware of, e.g. in terms of experience. To the extent that these differences are time invariant, we can control for them by using exporter fixed effects (i.e. time-invariant differences between exporters). This can be done effectively by de-meaning the data, i.e. by subtracting the average exporter response (i.e. the average response over the transactions from the same exporter) from each transaction response and subtracting the average exporter covariates from the covariates of each transaction. This only works for linear regressions such as ordinary least squares, giving it a distinct advantage over the logit model where de-meaning would not work. However, it may yield an estimate of the probability $E [y_{i}]$ that is less than 0 or greater than 1. For a logit model, the regression coefficients may be estimated by the numerical solution of the normal equation $X^{⊤} (Y - \hat{μ}) = 0$ , where $\hat{μ} = [{\hat{μ}}_{1} \dots {\hat{μ}}_{N}]^{⊤}$ and ${\hat{μ}}_{i} = e^{x_{i}^{⊤} \hat{β}} / (1 + e^{x_{i}^{⊤} \hat{β}})$ .

The above estimators apply where the linkage is perfect, i.e. two records are linked if and only if they pertain to the same transaction. Such record pairs are called matched. However, linkage errors tend to occur when linking with quasi-identifiers, including false negatives and false positives, where a false negative is failing to link a matched pair, and a false positive is linking a pair that is unmatched. There are essentially two ways to mitigate the impact of these errors on an analysis including linking the records with a sufficiently high precision to suppress the false positives at the expense of more false negatives,¹⁴ or adjusting the analysis for any potential bias,^15,16 which is not discussed further. With the approach by Judson et al.¹⁴ the linked records may be reweighted to represent the unlinked records. However, this last step is not needed if the false negatives occur at random.

4.1.2 Linking the records

The records may be linked with the probabilistic method,¹⁷ which is a good choice when the linkage variables have typos. With this method, a pair may be linked only if it meets certain coarse criteria that are called blocking criteria. For such a pair that is called potential, a probabilistic linkage weight is assigned according to the observed agreements between the pair records, such that its weight increases with the similarity of the records. In the fully automated version of the method, a potential pair is automatically linked without any manual intervention if its weight exceeds a certain threshold, which is a function of the target linkage accuracy. In general, the pair weight is assigned by fitting a statistical model of the agreements in a potential pair with a numerical procedure, where the agreements are coded in a categorical vector of the form $γ = (γ_{1}, \dots, γ_{K})$ , with $γ_{k}$ indicating whether there is agreement on variable k. In the simplest case, $γ_{k}$ is dichotomous, e.g. $γ_{k} = 0, 1$ . A common statistical model for $γ$ is a log-linear mixture where the probability of observing $γ$ is of the form $p (γ) = π m (γ) + (1 - π) u (γ)$ , with $π$ denoting the proportion of matched pairs (also called mixing proportion), and $m (γ)$ and $u (γ)$ denoting the conditional probabilities of observing $γ$ given that the pair is matched and unmatched. Under the conditional independence assumption, each of these latter conditional probabilities is the product of the marginal conditional probabilities of the $γ_{k}$ 's. While this assumption may be restrictive, it leads to sensible linkage decisions in practice.¹⁷ Following the estimation of the model parameters, a pair weight is computed as $w (γ) = \log (m (γ) / u (γ))$ (or some scaled version of this variable), and it is optimal to link the pairs that are above a certain weight threshold, if one is minimizing the false positives subject a target recall.¹⁷ In Annex 1, it is shown that this strategy is also optimal when maximizing the recall subject to a minimum precision. However, it requires a reliable estimator of the linkage accuracy.

4.1.3 Evaluating the linkage accuracy

While the accuracy may be evaluated according to the above-described log-linear mixture, the resulting estimates may be biased due to violations of the restrictive condition independence assumption. To avoid this problem, it is proposed to model the number of links from a record as suggested in a previous paper.¹⁸ According to this model, the number of links $n_{i}$ from export record i is distributed according to the following mixture.

n_{i} \sim \sum_{g = 1}^{G} α_{g} B e r n o u l l i (p_{g}) * P o i s s o n (λ_{g}),

where

*

denotes the convolution operator, G is the number of latent record classes,

α_{g}

is the probability of class g, and

p_{g}

and

λ_{g}

are the expected numbers of true positives and false positives per record from the class, respectively. With

\bar{p} = \sum_{g = 1}^{G} α_{g} p_{g}

and

\bar{λ} = \sum_{g = 1}^{G} α_{g} λ_{g}

, the recall and precision are given by

\bar{p}

and

\bar{p} / (\bar{p} + \bar{λ})

, respectively. Alternatively, the precision may be estimated by

(# e x p o r t r e c o r d s) \hat{\bar{p}} / (# l i n k s)

, where

\hat{\bar{p}}

is the maximum likelihood estimator of

\bar{p}

. This alternative precision estimator will be handy later. The model parameters may be estimated by the numerical maximization of the log-likelihood of the

n_{i}

's. The number of classes G may be selected in a data-driven manner through the minimization of Akaike information criterion.¹⁹ Yet, in practice, a single class often suffices where the precision is at 0.95 or above. The resulting estimators are known to be consistent under regularity conditions, if linking two duplicate-free censuses such that the decision to link two records is independent of other records.

Implementing the clear data production process is straightforward with the available record linkage and statistical packages. However, there is a need to control the statistical disclosure.

4.2 Specifying the desired level of confidentiality

The protection from statistical disclosure may be provided for individual transactions or individual firms (exporters or importers). In general, the desired protection level may be concisely expressed in the language of differential privacy, a robust mathematical definition of privacy that protects against a wide range of attacks, including differencing and record linkage.¹³ However, the specific constraints of the use case must be considered.

Differential privacy is a property of a statistical procedure or its output, which is hereafter called measurement. Basically, a measurement is differentially private if its distribution is essentially invariant under the modification or addition of a single record in the input dataset. The formal definition of differential privacy rests on the concept of adjacent datasets, which depends on the context. In that regard, there are two common definitions of adjacency, which both assume that each dataset comprises records from distinct individuals. According to the first definition, two datasets are called adjacent if they have the same number of records and differ in exactly one record, i.e. all the record values are identical across the two datasets except for one record. According to the second definition, two datasets are adjacent if they differ in exactly one record, and one dataset has one more record than the other. A general definition of adjacency, which includes both cases, is to call two datasets adjacent if they differ by one record (see,¹³ Definition 2.3, p.17). To define exact differential privacy, consider two adjacent datasets (according to one of the two definitions given above), a measurement as well as a subset of possible measurement values, and let p and $p^{'}$ denote the probabilities of observing the event that the measurement takes its value from the selected subset, for the first and second datasets, respectively. Then, the measurement is said to be $ε$ differentially private if $p \leq e^{ε} p^{'}$ , for any choice of the adjacent datasets and subset of measurement values.¹³ The exact differential privacy is further qualified as bounded if the two datasets have the same number of records.²⁰ Otherwise, it is called unbounded. There is also a relaxed definition of differential privacy called approximate differential privacy, which is instead characterized by two positive parameters $ε$ and $δ$ . According to this other definition, a measurement is called $(ε, δ)$ differentially private if $p \leq e^{ε} p^{'} + δ$ , for any choice of the adjacent datasets and subset of measurement values. Essentially, $(ε, δ)$ differential privacy means $ε$ differential privacy except for a subset of measurement values, the probability of which is bounded above by $δ$ . Like exact differential privacy, approximate differential privacy may be qualified as bounded and unbounded according to the chosen adjacency definition. In all the above definitions of differential privacy, the parameter $ε$ is called privacy loss parameter. Values between 0 and 5 are believed to provide a strong protection, but larger values have also been used in practice.²¹ A differentially private version of a statistic may be computed by perturbing the statistic, the input on which it is based (see Proposition 2.1 in¹³) or the processing, through the addition of noise (see Theorem 3.6 and Appendix A in¹³). When directly perturbing the statistic or the related input, the amount of noise is dictated by the global sensitivity of the perturbation target, which is the biggest possible change when a single record is modified or added.

In the use case, the computed statistics are based on the linkage of two datasets that are held by different statistical organizations. Then, the actual privacy loss may be greater than expected if the linked data is the input of a differentially private procedure, which expects the different links to represent distinct transactions. A simple fix is to set an upper limit on the number of links per record²² and to account for this limit when setting the privacy loss parameter. Also, this parameter must account for the maximum number of transactions per exporter, where the goal is to protect individual exporters. This is detailed in the next section.

4.4 Building a differentially private process

A differentially private production process may be based on procedures, which expect that the input records represent distinct transactions. However, this is typically not true with the linked pairs as input due to the presence of false positives. While these errors may be rare at a high precision, accounting for them is crucial from a privacy standpoint, and it may be done as follows. Suppose that we wish to reuse an $ε$ differentially private procedure, where a record has up to d links (a constraint enforceable by deleting some of the potential pairs), and an exporter has up to T transactions.

With respect to privacy, there are two kinds of adversaries or attackers, where the first kind is a participating statistical organization, who tries to guess some of the private attributes that are held by its peer. This is the case if Statistics Canada is trying to infer the exporter size of a transaction, or Statistics Netherlands is trying to infer the tariff type of a transaction. This threat must be mitigated even if the produced statistical outputs are not released outside the two statistical organizations. The second kind of adversaries is someone outside the two organizations, with no access to the import and export datasets. The goal of such an adversary is to correctly guess any attribute (among those present in either dataset) of a specific transaction. This distinction is a key feature of the use case, which impacts the definition of the privacy loss.

When the attacker is one of the two statistical organizations, the differential privacy is based on the worst case where a single record is modified in the dataset held by its peer, while all the remaining records are fixed and provided to the attacker. For example, if Statistics Canada were the attacker, this would mean that all the export records are known expect the target of the inference. Of course, Statistics Canada would also benefit from knowing all the import records including the one matched with the target export record. Since the target record generates up to d linked pairs, the measurement based on the $ε$ differentially private procedure (with the linked pairs as input) is $d ε$ differentially private with respect to individual transactions. Thus, it is $T d ε$ differentially private with respect to individual exporters (as defined in the previous section) based on the group privacy theorem (see,¹³ Theorem 2.2, p. 20). When the original procedure is instead $(ε, δ)$ differentially private, the resulting measurement is $(d ε, (\sum_{k = 0}^{d - 1} e^{k ε}) δ)$ differentially private with respect to individual transactions, and $(T d ε, (\sum_{k = 0}^{T d - 1} e^{k ε}) δ)$ differentially private with respect to individual exporters. Here, by differential privacy, we mean bounded differential privacy because the number of transactions is known, and each dataset is a census. The above discussion applies whether the target record is an export or import record.

When the attacker is outside the two statistical organizations, the privacy loss must also account for the non private attributes as defined in Section 2. These attributes include all the linkage variables, such as the exporter name, product code, etc. In this case, the worst-case means that the attacker knows all the records except those associated with the target transaction. This transaction is associated with one import record and one export record (recall that each dataset is a census), where each record has up to d links. Therefore, a transaction may generate up to $2 d$ linked pairs in the worst case. In this case, the measurement based on the $ε$ differentially private procedure (with the linked pairs as input) is $2 d ε$ differentially private with respect to individual transactions, and $2 T d ε$ differentially private with respect to individual exporters. If the original procedure is $(ε, δ)$ differentially private, the resulting measurement is $(2 d ε, (\sum_{k = 0}^{2 d - 1} e^{k ε}) δ)$ differentially private with respect to individual transactions, and $(2 T d ε, (\sum_{k = 0}^{2 T d - 1} e^{k ε}) δ)$ differentially private with respect to individual exporters.

In what follows, we assume that the attacker is one of the two organizations. When the attacker is outside the two organizations, the same solutions apply after doubling the privacy budget as suggested by the above discussion. We next review the differential private procedures, which are reused, and further detail on how they are adapted.

4.4.1 Analysis

Many methods have been described for performing a differentially private regression including the perturbation of an objective function,^23,24 which has been implemented in the enclave. Chaudhuri et al.²³ focus on high-dimensional problems, with an objective function that incorporates a penalty for regularization (i.e. penalizing the model complexity) and a perturbation of the form $u^{⊤} β$ , where $β$ is the vector of parameters. This solution is used for logistic regression in IBM diffprivlib package (https://diffprivlib.readthedocs.io/en/latest/), which is supported by the enclave. Zhang et al.²⁴ describe a different method for problems where the model parameters $β$ are estimated by minimizing an objective function of the form $\sum_{m = 1}^{M} (\sum_{i = 1}^{n} c_{m} (x_{i}, y_{i})) φ_{m} (β)$ with $(x_{1}, y_{1})$ ,…, $(x_{n}, y_{n})$ denoting the data points, $c_{m} (,)$ being a function that does not involve the parameters, and $φ_{m} (.)$ being a function that is polynomial or well approximated by a polynomial. A differentially private estimator is obtained by perturbing the total $\sum_{i = 1}^{n} c_{m} (x_{i}, y_{i})$ with the Laplace mechanism (i.e., the addition of noise following the Laplace distribution) for each m. This method is related to the perturbation of sufficient statistics, where some statistics are called sufficient if the distribution of the data no longer depends on the parameters of the underlying statistical model after conditioning on these statistics. The main drawback is that the added noise grows with the number of covariates (e.g. quadratically for a linear regression), which is inconvenient when the covariates are high-dimensional. This approach is used for linear regression in IBM diffprivlib package.

Within the enclave, differentially private linear and logistic regressions are based on this package.

4.4.2 Data synthesis

Some synthetic data may be generated from the linked data to enable some exploratory analysis and motivate a deeper analysis. It may be generated in a differentially private manner by the Multiplicative Weights Exponential Mechanism (MWEM),²⁵ where the synthetic dataset is generated iteratively by minimizing the difference between the actual data and the synthetic data for a set of linear queries. Initially, the synthetic data is drawn uniformly from the universe of possible records and each synthetic record is assigned the same weight. Then, in each iteration, a query is selected and the difference between the actual data and the synthetic data is evaluated in a differentially private manner. When the answer is larger with the actual data, the weight is increased where a synthetic record makes a positive contribution to the difference, and it is decreased where the contribution is negative. When the answer is smaller with the actual data, the weight is decreased where the contribution of a synthetic record is positive, and it is increased where the contribution is negative. The resulting synthetic data is differentially private.

Within the enclave, the MWEM data synthesizer is supported by a local library of the SmartNoise package (https://docs.smartnoise.org/synth/index.html).

4.4.3 Linkage and error estimation

The linkage may be a source of disclosure if the probabilistic weights and estimated linkage accuracy are communicated to a participating statistical organization, e.g. to let the organization choose the probabilistic weight threshold. To limit this disclosure, the linkage parameters and estimated accuracy may be computed with differentially private procedures.

For the probabilistic linkage weights, this procedure may be based on the post-processing property as follows. First, generate a differentially private histogram of the pair distribution according to the vector of comparison outcomes. Next, use the histogram as input to the numerical optimization of the log-linear mixture likelihood. The sensitivity of the actual histogram is $2 d$ (see Lemma 2 in the Annex 1) with respect to individual transactions, which implies a sensitivity of $2 T d$ with respect to individual exporters, since each exporter has up to T transactions. This information lets us select the correct amount of noise when generating the differentially private histogram. In practice, the differentially private histogram is exported outside the enclave, where the probabilistic weights are estimated through nonlinear optimization. Then, the resulting estimates are imported into the enclave where they are applied to the potential pairs. To work within the enclave constraints, the histogram is generated as follows. First, assign a working exponential weight of $2^{k - 1}$ to $γ_{k}$ for $k = 1, \dots, K$ . Then, for each $t = 0, \dots, 2^{K} - 1$ , the enclave reports the number of potential pairs having a weight at least equal to t plus an adequate amount of noise from the Laplace distribution (¹³ p. 33) or double-sided geometric distribution,²⁶ to obtain an $ε$ differentially private count. From these noisy counts, the differentially private histogram is easily derived. This means that the overall procedure is $2^{K} ε$ differentially private. By exploiting the correlations among the cell counts, it is possible to construct a differentially private that requires much less noise for the same privacy loss (¹³ p. 33). However, doing so is not currently possible within the enclave due to the constraints therein.

The linkage accuracy may be estimated in a differentially private manner by adding an adequate amount of noise to the $n_{i}$ frequencies and fitting the previously described error model to the perturbed frequencies. However, this solution is currently prohibited by the enclave constraints.

To overcome this hurdle, it is possible to model the total number of links for disjoint groups of export records, i.e. to model $n_{(H - 1) t + 1} + \dots + n_{H t}$ for groups of size H, where $t = 1, \dots, ⌊ N / H ⌋$ . In this case, the group size is chosen such that $H = o (\sqrt{N})$ , to ensure that the $n_{i}$ 's are approximately independent within each group,²⁷ each $n_{i}$ following the previously described mixture model. Then, the number of links of the different groups (hereafter called group $n_{i}$ 's) may be obtained by a differentially private histogram with $⌊ N / H ⌋$ bins, where bin t includes the export records with a record id (a meaningless number) between $(H - 1) t + 1$ and $H t$ . As before, the amount of noise is dictated by sensitivity of the original histogram, which is bounded by $2 d$ for individual transactions (see Lemma 3 in Annex), and by $2 T d$ for individual exporters. Grouping the export records requires the following variation of the error model. As before, suppose that there are G latent record classes. Then, $n_{(H - 1) t + 1} + \dots + n_{H t}$ depends on the latent number $a_{g}$ of export records from class g in group t, where $(a_{1}, \dots, a_{G})$ has a multinomial distribution based on H trials and probabilities $(α_{1}, \dots, α_{G})$ , and the conditional distribution of $n_{(H - 1) t + 1} + \dots + n_{H t}$ given $(a_{1}, \dots, a_{G})$ is of the form $f_{1}^{(* a_{1})} * \dots * f_{G}^{(* a_{G})}$ , where $*$ denotes the convolution operator, $f_{g} = B e r n o u l l i (p_{g}) * P o i s s o n (λ_{g})$ and $f_{g}^{(* k)}$ is the $k$ -fold convolution of the distribution $f_{g}$ with itself. With a single record class, we have $n_{(H - 1) t + 1} + \dots + n_{H t} \sim B i n o m i a l (H, p) * P o i s s o n (H λ)$ . As before, the recall and precision are given by $\bar{p}$ and $\bar{p} / (\bar{p} + \bar{λ})$ , respectively, where $\bar{p} = \sum_{g = 1}^{G} α_{g} p_{g}$ and $\bar{λ} = \sum_{g = 1}^{G} α_{g} λ_{g}$ . In practice, we access a noisy version of the group $n_{i}$ 's, which are exported outside the enclave. With the added noise, the group $n_{i}$ 's are of the form $n_{(H - 1) t + 1} + \dots + n_{H t} + e_{t}$ , where $e_{t}$ is the added noise, which follows the double-sided geometric distribution²⁶ with parameter $e x p (- ε / 2 d)$ (when protecting each transaction), where $ε$ is the privacy loss for estimating the precision and recall. The group $n_{i}$ 's are clipped at 0, i.e. we ultimately observe $max (0, n_{(H - 1) t + 1} + \dots + n_{H t} + e_{t})$ for $t = 1, \dots, ⌊ N / H ⌋$ . Denoting the noise distribution by $D o u b l e G e o m (e^{- ε / 2 d})$ , we have

\begin{aligned} n_{(H - 1) t + 1} + \dots + n_{H t} + e_{t} \sim B i n o m i a l (H, p) * P o i s s o n (H λ) \\ * D o u b l e G e o m (e^{- ε / 2 d}), \end{aligned}

when there is a single record class. As before, the model parameters (i.e. p and

λ

) may be estimated by maximizing the log-likelihood of the perturbed group

n_{i}

's. The recall and precision may be estimated by

\hat{p}

and

\hat{p} / (\hat{p} + \hat{λ})

, respectively, where

\hat{p}

and

\hat{λ}

are the maximum likelihood estimators. However, the estimated precision has a large positive bias when the actual precision is high, because

\hat{λ}

severely underestimates

λ

. Interestingly, this problem does not occur when directly accessing the group

n_{i}

's without any added noise. To mitigate this bias, the precision is instead estimated by

(# e x p o r t r e c o r d s) \hat{\bar{p}} / (D P # l i n k s)

, where the denominator is a differentially private measurement of the total number of links, which is based on the sensitivity of

2 d

(with respect to individual transactions), which is established in Lemma 4 in the Annex 1. To remain within the allocated privacy budget of

ε

, one may use a fraction of the budget (e.g.

(3 / 4) ε

) to obtain

\hat{p}

and

\hat{λ}

, and the remaining fraction (e.g.

(1 / 4) ε

) to measure the total number of links.

5 Evaluation

The methodology is evaluated with mock data and tests on the enclave. The code for these tests is publicly available on Git-Hub.²⁸

5.1 Generating the mock data

The mock data is based on a population of 100,000 fictitious transactions that are recorded with typos but without duplication in an export dataset and an import dataset. All the transactions are assumed to occur in the year 2021, with the exporter name, exporter size, importer size, product code, day, month, transaction value (in thousands of $), and tariff preference. These variables are distributed as follows. The exporter name is sampled with replacement from 947 firm names listed by the United States Securities and Exchange Commission (SEC). Thus, each exporter has about 100 transactions on average. The exporter size is set to ‘small’ or ‘large’ with equal probability. Likewise, the importer size is set to ‘small’ or ‘large’ with equal probability. The product code is sampled with replacement from a list of 200 six-digit product codes. The day is sampled with replacement from the set {1, …,30}, while the month is sampled with replacement from the set {1, …,12}. The transaction value is drawn uniformly between 2 and 1000. Finally, the tariff preference is set to “Preferential” with a probability that is given by a linear or logistic model. Otherwise, it is set to the regular tariff; the so called Most-Favoured Nation (MFN) tariff. The probability of a preferential tariff is a function of whether the exporter is large ( $x_{1}$ ), whether the importer is large ( $x_{2}$ ) and the logarithm in base 10 of the transaction value ( $x_{3}$ ). With the linear probability model, we have $E [y | x_{1}, x_{2}, x_{3}] = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3}$ , where $β_{0} = 0.2$ and $β_{1} = β_{2} = β_{3} = 0.1$ . With the logistic model, we have $l o g i t (E [y | x_{1}, x_{2}, x_{3}]) = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3}$ , where $β_{0} = - 2$ and $β_{1} = β_{2} = β_{3} = 0.5$ . The export and import datasets are created by recording each transaction with typos in separate datasets. The export dataset includes the exporter name, exporter size, product code, day, month, and transaction value, while the import dataset includes the exporter name, importer size, product code, day, month, transaction value and tariff preference. For simplicity, the exporter size, importer size, and tariff preference are recorded without typos. For the other variables, the typos are produced as follows. The exporter name is recorded without any typo with probability 0.9, else it is modified by replacing all occurrences of a given letter by a different randomly chosen letter, where the replaced letter is chosen randomly among those, which appear in the exporter name. The product code is modified in a similar manner, i.e. with probability 0.9, it is recorded without typos. Otherwise, it is modified by replacing all occurrences of a given digit by a different randomly chosen digit, where the replaced digit is chosen randomly among those that appear in the product code. The transaction value is recorded without typo on the exporter side. However, on the importer side, the transaction value is scaled by $e^{u}$ where u is drawn from the normal distribution with mean 0.05 and standard deviation 0.01. The day is recorded without any typo with probability 0.9, else it is decreased or increased by one with equal probability. The recorded day is set to 1 if the result is less than 1, and it is set to 30 if the result is greater than 30. The month is modified in a similar manner, i.e. it is recorded without any typo with probability 0.9, else it is decreased or increased by one with equal probability. Besides, the recorded month is set to 1 if the result is less than 1, and it is set to 12 if the result is greater than 12. The resulting datasets are utilized for testing the differentially private production process in the enclave.

5.2 Testing the enclave

Within the enclave, the implemented production process comprises a probabilistic linkage, the model-based estimation of the linkage errors, a regression and the generation of synthetic data. At the different stages, the allocated privacy budget is determined by prior simulations outside the enclave based on the expected utility of the different estimates. The tests consist in executing the entire production process ten times (each time representing an iteration), each time with a new mock dataset that is generated as described before. For each iteration, the linkage, regression and data synthesis performance are measured. The following paragraphs provide more details, where $ε$ is a privacy budget, which may vary according to the stage of the production process.

The linkage is probabilistic and implemented with a local version of the Record Linkage Toolkit (RLTK) using the variables exporter name, product code, transaction value, day and month. For better results, the exporter name is preprocessed including capitalization, removal of spaces and stop words such as “&”, “inc.”, “inc”, “corporation”, “corp.” and “corp”. While this preprocessing may occur within the secure enclave, it is more convenient to do so outside before uploading the datasets to the enclave. Within the enclave, blocking criteria are applied based on the following conditions.

Having the same exporter name, product code, transaction value, day and month

Having the same product code, transaction value, day and month

Having the same exporter name, day and month

Having the same exporter name, product code, transaction value and month

Having the same exporter name, product code, transaction value and day

The number of potential pairs per record is limited to $d = 2$ . For each potential pair, the linkage variables are compared as follows. For the exporter name and product code, an agreement is having a Jaro-Winkler similarity equal to or greater than 0.85. For the transaction value, an agreement is having a relative difference equal to or less than 0.1, where the denominator is the recorded transaction value on the exporter side. For the day and the month, an agreement is having an absolute difference (in number of days or months) equal to or less than 1. For each potential pair, these comparisons produce a vector of dichotomous outcomes, which serves to estimate the linkage weights, under the assumption of conditional independence. The probabilistic weights are estimated by maximizing the log-linear mixture likelihood numerically outside the enclave, based on the perturbed frequencies of the comparison vector. To extract, the frequencies from the enclave, working agreement weights 16, 8, 4, 2 and 1 are used for the exporter name, product code, transaction value, day and month respectively, while all working disagreement weights are set to 0. Then, the total weight of each potential pair is computed that is constrained to be between 15 and 31 by the blocking criteria. For $t = 15, \dots, 31$ , the number of pairs with weight equal to or greater than t is obtained, and some noise is added to the result to obtain a perturbed count $f_{t}$ , which is $ε$ differentially private, based on the sensitivity given in Lemma 4 of Annex 1. The added noise is according to the geometric mechanism²⁶ with parameter $e x p (- ε / 2 d)$ . By parallel composition, $(f_{15}, \dots, f_{31})$ is $17 ε$ differentially private. The computation of this histogram can take up to 15 min, including the communication time with the enclave. Then, the number of pairs with weight t is set to $max (0, f_{t + 1} - f_{t})$ for each t between 15 and 30, and it is set to $f_{t}$ if $t = 31$ . By the post-processing property (Dwork and Roth, 2014, Proposition 2.1), the resulting histogram is $17 ε$ differentially private. As mentioned before, the actual probabilistic weights are estimated by maximizing the log-linear mixture likelihood based on the perturbed frequencies. To obtain reasonable weights, this maximization is done under the following constraints. Let $m_{1}$ , $m_{2}$ , $m_{3}$ , $m_{4}$ and $m_{5}$ denote the conditional probabilities of agreement given that a pair is matched for the exporter name, product code, transaction value, day and month, respectively, and let $u_{1}$ , $u_{2}$ , $u_{3}$ , $u_{4}$ and $u_{5}$ denote the corresponding probabilities of agreement given that a pair is unmatched. The first set of constraints is $m_{t} > 2 u_{t}$ , for $t = 1, \dots, 5$ , which means that the probability of an agreement for a matched pair is at least twice as large as the corresponding probability for an unmatched pair for each linkage variable. This also implies that the agreement weight $\log (m_{t} / u_{t})$ is bounded away from 0 for each linkage variable. The second set of constraints is $m_{1} > m_{2}$ and $u_{1} < u_{2}$ , giving more importance to agreement on exporter name than agreement on product code. The third set of constraints is $m_{3} > m_{2}$ and $u_{3} < u_{2}$ , giving greater importance to agreement on transaction value relative to agreement on product code. The fourth set of constraints is $m_{2} > m_{4}$ and $u_{2} < u_{4}$ , giving greater importance to agreement on product code relative to agreement on day. Finally, the fifth set of constraints is $m_{2} > m_{5}$ and $u_{2} < u_{5}$ , giving a greater importance to agreement on product code relative to agreement on month.

The estimated probabilistic weights are applied to the potential pairs within the enclave and the weight threshold is selected with the help of the error model to have a precision that is at least equal to 0.99. The actual procedure is iterative and visits the possible pair weights, in decreasing order, until the estimated precision falls below the target. This procedure is slightly simpler than that described in Lemma 1 in Annex 1, where some pairs are linked with a positive probability that is less than 1.0. For the estimation of the linkage accuracy, the records are partitioned into disjoint groups of 100 export records, and the group $n_{i}$ 's are perturbed according to the geometric mechanism²⁶ with parameter $e x p (- ε / 2 d)$ , based on Lemma 3 in Annex 1. Consequently, each measurement of the linkage accuracy is $ε$ differentially private, and the overall privacy loss is proportional to the number of iterations of the threshold selection procedure. For example, the total loss is $3 ε$ if the threshold is set at the second highest possible pair weight. Using the selected threshold, the linkage decisions are made within the enclave.

Finally, the linked data serves to fit the relevant regression model (linear or logistic) and generate some synthetic data. In these steps, the privacy parameter is set to $ε / d$ to ensure that the procedure is $ε$ differentially private despite the linkage. The regression is performed with diffprivlib version 0.6.4 within the enclave, in the case of a linear regression. However, this version of diffprivlib cannot be used for logistic regressions because it gives nonsensical results. Instead, version 0.6.6 must be used that is not currently installed in the enclave. To mimic this feature, logistic regressions are implemented outside the enclave based on diffprivlib version 0.6.6 and links, which are produced in the clear. As mentioned before, the data synthesis is implemented with the MWEM synthesizer of the SmartNoise package. It concerns the exporter size, importer size, tariff preference, transaction value and logarithm in base 10 of the transaction value, with dummy coding for the categorical variables. Specifically, the exporter size is coded to 1 if the size is large, and the same applies for the importer size. As for the tariff preference, it is set to 1 if the tariff is preferential. To assess the utility of the synthetic data for data exploration, the appropriate regression model is fitted to the synthetic data with the package statsmodel, and the p-value of each coefficient is checked to see if it is significant at the 0.1 level. Ideally, each coefficient should be significant.

The production process involves some sequential composition and some parallel composition. Some parallel composition occurs when some steps are concurrent and have the same inputs, e.g. the analysis and data synthesis that are both based on the linked pairs. There is also some sequential composition because the weight estimation procedure precedes the threshold selection procedure, and the estimated weights are utilized by the latter procedure. Besides, the outputs from both procedures serve to create the linked pairs, which are inputs for the analysis and data synthesis as mentioned above.

5.3 Allocating the privacy budget

In terms of privacy budget, we aim for an overall budget that does not exceed 5.0 for the entire production process, while having a minimum utility for the point estimates and p-values that are based on the synthetic data. For a point estimate, the utility is measured by the root mean square relative error (i.e. the square root of the mean of the square relative error between the point estimate and a reference value), to mirror the use of the coefficient of variation in assessing the reliability of published estimates at Statistics Canada.²⁹ Indeed, the reliability is classified as acceptable, marginal or unacceptable according to whether the coefficient of variation is less than or equal to 16.5%, greater that 16.5% and less than or equal to 33.3%, and greater than 33.3%. In what follows, these different levels are called gold, silver and bronze, respectively, and they are based on the root mean square relative error (RRMSE). For the estimated precision and recall, the reference values are the actual precision and recall based on the ground truth. For a regression coefficient, the reference value is the chosen value when generating the mock data. Our goal is producing estimates that are at least at the gold or silver reliability level, for the estimated linkage accuracy and regression coefficients (i.e. $β_{1}$ , $β_{2}$ and $β_{3}$ ). The reliability is defined differently for the p-values based on the synthetic data. In this case, it is based on the number of times that a p-value is below the significance level of 0.1, for a variable that is known to be significant, including the exporter size, the importer size and the log of the transaction value. For simplicity, we use the following simple criterion that is called gold reliability. The null hypothesis (that a variable is not significant) is rejected at least 9 times out of 10 for a significant variable. In practice, it is difficult to choose the privacy loss parameter to achieve a given utility without accessing the data or knowing the specific analysis that is to be performed. However, here we fortunately have both and can obtain an answer by simulating and mimicking each stage of the production process outside the enclave ten times, for different values of the ratio of the privacy budget to the sensitivity. This ratio is hereafter denoted by $ε / Δ$ , and it is selected in the set $2^{- 0}, 2^{- 1}, \dots, 2^{- 5}$ , where $ε$ is the privacy loss parameter and $Δ$ is the sensitivity; a function of the maximum number of links per record ( $d$ ). When protecting each exporter, this sensitivity is also a function of the maximum number of transactions per exporter ( $T$ ). The simulations aim to identify the minimum value of the ratio $ε / Δ$ to achieve a reliability at the gold or silver level.

For the error estimation step, the mock data is generated as described before, and the datasets are linked based on having the same exporter name, product code, transaction value, day and month. This has the advantage of simplicity, while resulting in a high precision like the probabilistic linkage implemented in the enclave. In the simulations, the precision is very close to 1.0 (between 0.999 and 1.0) while the recall is around 0.4. The linkage accuracy is estimated by perturbing the group $n_{i}$ 's. For the data analysis and data synthesis steps, the mock data is generated as before and the two datasets are linked perfectly, i.e. two records are linked if and only if they relate to the same transaction.

Tables 2 to 4 give the minimum $ε / Δ$ ratio according to the reliability in the different stages. In Table 2, this information is provided separately for the precision and recall. Both measures are estimated at the gold level with $ε / Δ$ as low as 0.25, and at the silver level with $ε / Δ$ as low as 0.125.

Table 2.
Minimum $ε / Δ$ ratio according to the reliability in the error estimation step.

Precision Recall

Gold Silver Gold Silver

0.25 0.125 0.25 0.125

Precision	Recall
0.25	0.125	0.25	0.125

NA: Not available.

Table 3.

Minimum $ε / Δ$ ratio according to the reliability in the analysis step.

	$β_{1}$		$β_{2}$		$β_{3}$
Model	Gold	Silver	Gold	Silver	Gold	Silver
Linear	0.125	NA	0.25	0.0625	0.5	0.25
Logit	0.25	0.125	0.25	NA	1.0	0.5

NA: Not available.

Table 4.

Minimum $ε / Δ$ ratio for a gold reliability in the data synthesis step.

Model	$β_{1}$	$β_{2}$	$β_{3}$
Linear	0.03125	0.03125	0.03125
Logit	0.03125	0.03125	0.03125

Table 3 gives the minimum $ε / Δ$ ratio according to the reliability, data model (linear or logit) and regression coefficient. With the linear model, the estimation of $β_{1}$ , $β_{2}$ and $β_{3}$ at the gold level is observed with $ε / Δ$ as low as 0.125, 0.25 and 0.5, respectively. With the same model, $β_{2}$ and $β_{3}$ are estimated at the silver level with $ε / Δ$ as low as 0.0625 and 0.25, respectively, while estimation of $β_{1}$ at this level is not observed. In the case of the logit model, $β_{1}$ , $β_{2}$ and $β_{3}$ are respectively estimated at the gold level with $ε / Δ$ as low as 0.25, 0.25 and 1.0, while $β_{1}$ and $β_{3}$ are estimated at the silver level with $ε / Δ$ as low as 0.125 and 0.5, respectively. Overall, the required privacy budget tends to be larger under the logit model than the linear model, for a given reliability level.

Finally, Table 3 shows that it is possible to test the significance of each covariate at the gold level when $ε / Δ = 0.03125$ (the smallest selected value of $ε / Δ$ ), for the linear and logit models, based on the synthetic data.

Based on Tables 2 to 4, we can derive the minimum privacy loss to ensure that all the selected outputs have the desired reliability in a stage, such as the gold level, or the gold or silver level. For example, in the regression stage, having all outputs at the gold or silver level means that each regression coefficient is estimated with an RMSRE, which does not exceed 33.3%. Of course, the derived privacy budget depends on the sensitivity, which is itself of function of d (the maximum number of links per record) and T (the maximum number of transactions per exporter); the latter if protecting each exporter. We can also do this exercise in the other direction as shown in Table 5, where we derive the largest possible d (if protecting each transaction) and T (if protecting each exporter) to ensure that all selected stage outputs are at the gold level, or at the gold or silver level, for a fixed maximum privacy budget. The parameters d and T are each constrained to be at least equal to 2, otherwise the configuration is deemed infeasible. For example, if the maximum d is 2 and the maximum T is less than 2, then we can choose $d = 2$ and protect each transaction at the selected reliability with the given privacy budget. However, we cannot protect each exporter. In detail, the maximum value of d is computed as follows. First, compute the maximum sensitivity by dividing the maximum privacy budget by the minimum $ε / Δ$ ratio according to the selected reliability. Next derive the maximum value of d according to how the sensitivity $Δ$ depends on d. For example, consider the error estimation stage, a maximum privacy budget of $ε = 1.0$ and a reliability at the gold or silver level for all the selected outputs, i.e. the precision and recall. According to Table 2, all the outputs are at the gold or silver level with $ε / Δ$ as low as 0.125. Therefore, we must have $Δ \leq 1.0 / 0.125 = 8$ . Since the sensitivity is $Δ = 2 d$ (see Lemma 3 in Annex 1), we have $d \leq 4$ . When protecting each exporter, the sensitivity is $Δ = 2 d T$ . Since $d \geq 2$ , we have $T \leq 2$ .

Table 5.

Maximum d and maximum T according to the privacy budget $ε$ and reliability for each stage.

			Gold		Gold and silver
Max. $ε$	Stage	Data model	Max. $d$	Max. $T$	Max. $d$	Max. $T$
0.5	Error estimation		<2	<2	2	<2
	Analysis	Linear	<2	<2	2	<2
		Logit	<2	<2	<2	<2
	Data synthesis	Linear	16	8	16	8
		Logit	16	8	16	8
1.0	Error estimation		2	<2	4	2
	Analysis	Linear	2	<2	4	2
		Logit	<2	<2	2	<2
	Data synthesis	Linear	32	16	32	16
		Logit	32	16	32	16
2.0	Error estimation		4	2	8	4
	Analysis	Linear	4	2	8	4
		Logit	2	<2	4	2
	Data synthesis	Linear	64	32	64	32
		Logit	64	32	64	32

The parameters d and T are mostly constrained by the analysis stage, because the same parameter values must be used across all the stages. According to Table 5, this means that d cannot exceed 8, and T cannot exceed 4, if the privacy budget per stage does not exceed 2.0, and the outputs are to be at the gold or silver level for the analysis stage. The small value of T illustrates the difficulty of providing group privacy for all the transactions associated with each exporter. This may be facilitated by increasing the number of distinct exporters. Table 5 provides a basis for allocating the privacy budget for the test scenarios that are described in the next section.

5.4 Scenarios

Table 6 describes the two test scenarios, where the protection is on individual transactions and the test is repeated ten times in each scenario. A budget of 0.5 is allocated for the weight estimation procedure, including a budget of $0.5 / 17$ to obtain the perturbed frequency $f_{t}$ for $t = 15, \dots, 31$ as described in Section 5.2. This budget is used to export the frequency from the enclave. The resulting weights lead to sensible linkage decisions even if this fact is not expressed in terms of reliability. In the first scenario, $d = 2$ , the mock data is generated according to a linear model, and the linkage errors are estimated by perturbing the group $n_{i}$ 's. The total privacy budget is of the form $2 + τ / 2$ , where $τ$ is the number of attempts of the threshold selection procedure. The second scenario differs from the first scenario by generating the data according to a logit model and allocating a privacy budget of 2.0 for the analysis stage. In this case, the total privacy budget is of the form $3 + τ / 2$ . The tests results are presented in the next section.

Table 6.
Test scenarios.

Scenario Model $d$ Step Budget ( $ε$ ) Expected reliability

1 Linear 2 Weight estimation 0.5 NA

Error estimation by perturbing the group $n_{i}$ 's 0.5 Gold or silver

Analysis 1.0 Gold or silver

Data synthesis 0.5 Gold

2 Logit 2 Weight estimation 0.5 NA

Error estimation by perturbing the group $n_{i}$ 's 0.5 Gold or silver

Analysis 2.0 Gold or silver

Data synthesis 0.5 Gold

Scenario	Model	$d$	Step	Budget ( $ε$ )	Expected reliability
1	Linear	2	Weight estimation	0.5	NA
			Error estimation by perturbing the group $n_{i}$ 's	0.5	Gold or silver
			Analysis	1.0	Gold or silver
			Data synthesis	0.5	Gold
2	Logit	2	Weight estimation	0.5	NA
			Error estimation by perturbing the group $n_{i}$ 's	0.5	Gold or silver
			Analysis	2.0	Gold or silver
			Data synthesis	0.5	Gold

NA: Not available.

5.5 Results

The summary statistics for the actual and estimated linkage accuracy are shown in Tables 7 and 8, respectively. In Table 7, the mean precision is above 0.99 as expected, while the mean recall is at or above 0.795. For the estimated linkage accuracy, the absolute relative bias is below 1% while the RMSRE does not exceed 5%. Thus, the precision and recall are both estimated at the gold level.

Table 7.
Actual linkage accuracy.

Variance

Scenario Measure Mean ( $\times 10^{- 4}$ ) Min. Max.

1 Precision 0.998 0.066 0.994 1.000

Recall 0.795 2.398 0.784 0.819

2 precision 0.990 3.215 0.956 1.000

Recall 0.798 2.799 0.784 0.820

			Variance
1	Precision	0.998	0.066	0.994	1.000
	Recall	0.795	2.398	0.784	0.819
2	precision	0.990	3.215	0.956	1.000
	Recall	0.798	2.799	0.784	0.820

Table 8.

Estimated linkage accuracy.

Scenario	Measure	Mean	Variance ( $\times 10^{- 4}$ )	Min.	Max.	Relative bias (%)	RMSRE (%)
1	Precision	0.989	0.676	0.973	1.000	−0.934	1.321
	Recall	0.788	4.657	0.763	0.821	−0.856	1.336
2	Precision	0.982	4.296	0.940	1.000	−0.727	3.315
	Recall	0.793	17.701	0.737	0.856	−0.713	3.328

The performance of the estimated regression parameters is shown in Table 9. The absolute relative bias does not exceed 10%, while the RMSRE does not exceed 33.3% for each coefficient, i.e. a reliability at the gold or silver level.

Table 9.

Estimated regression parameters.

Scenario	Parameter	Mean	Variance ( $\times 10^{- 4}$ )	Min.	Max.	Relative bias (%)	RMSRE (%)
1.0	$α$	0.207	15.682	0.143	0.279	3.280	19.068
	$β_{1}$	0.102	1.193	0.086	0.120	2.339	10.622
	$β_{2}$	0.101	1.751	0.073	0.123	1.078	12.599
	$β_{3}$	0.097	2.221	0.076	0.122	−3.302	14.519
2.0	$α$	−1.862	747.827	−2.179	−1.301	−6.897	14.691
	$β_{1}$	0.487	13.934	0.406	0.533	−2.693	7.577
	$β_{2}$	0.480	21.925	0.397	0.556	−3.950	9.723
	$β_{3}$	0.453	90.206	0.254	0.567	−9.354	20.304

Based on the synthetic data, each regression coefficient is significant at the 0.1 level, in each scenario and iteration. Thus, the tests perform at the gold level.

The total privacy budget is shown in Table 10, It never exceeds 5.5, which is quite acceptable. Overall, the results demonstrate that it is possible to privately link the datasets while controlling the linkage accuracy and generating many useful outputs from the linked data, with a reasonable privacy budget.

Table 10.

Total privacy budget.

Scenario	Mean	Min.	Max.
1	2.9	2.5	4
2	4.2	3.5	5.5

6 Conclusions and next steps

Secure enclaves have a major role to play in the private linkage of datasets across statistical organizations, regardless of the competing solutions that are based on a public key infrastructure, oblivious transfer, garbled circuits or fully homomorphy encryption. Indeed, they currently offer the greatest flexibility by far for implementing a linkage, statistical analysis or disclosure control measures, since the clear data may be processed with standard statistical packages within an enclave, at least in theory.

This study has clearly demonstrated all those benefits by implementing a differentially private production process within a cloud-based enclave for the econometric analysis of international trade micro-data, including a sophisticated probabilistic linkage with approximate comparisons to deal with typos. Additionally, the estimation of the linkage accuracy is based on the datasets that are being linked instead of relying on a prior linkage of “representative” datasets, which may be hard to find. Furthermore, the linked data not only serves to perform a differentially private linear or logistic regression, but it also serves to generate some differentially private synthetic data for data exploration outside the enclave, a very useful feature that may precede and guide the statistical analysis in practice. Overall, the production process is both input and output preserving, where the latter property is based on the application of differential privacy techniques. In tests on the enclave, the production process yields many reliable statistical outputs within a total privacy budget, which does not exceed 5.5. This process communicates with the outside world to estimate the probabilistic weights, estimate the linkage accuracy and select the weight threshold.

While these results are encouraging, some improvements are needed in many areas that may be the focus of future work, such as the computation of the probabilistic weights. Indeed, this operation may be facilitated by relaxing some of the enclave constraints, including allowing the computation of aggregate counts for observations that are cross-classified by many categorical variables, as well as enabling a much greater subset of the procedures or function calls, which are provided by the installed packages (e.g. the Record Linkage Toolkit) instead of a handful currently. Also, providing a general-purpose nonlinear optimization routine within the enclave may help implement many steps without having to communicate with the external world, which is a source of privacy loss and latency. This is true for the estimation of the linkage accuracy and the selection of the weight threshold. Another challenge is the large amount of noise required to protect each exporter, which currently limits the number of transactions per exporter. Beyond the enclave, the methodology may be improved to address some of its current limitations, regarding the lack of support for regressions with fixed effects, and the need to provide variance and confidence intervals when performing a regression. Indeed, fixed effects are an important feature of panel data, which are usually dealt with through de-meaning, in the case of a linear regression. While this de-meaning is easily done on each dataset by the corresponding organization at the source, the solution is not as simple as using the resulting linked data as input to one of the previously described procedures for a differentially private linear regression. Indeed, the actual privacy loss may exceed the expected loss, since the latter is based on the wrong assumption that each record represents the information of a single transaction in the export and import datasets. Instead, the underlying methodologies of these procedures must be examined thoroughly to update the anticipated privacy loss according to the sensitivity of the de-meaned data. Besides, the methodology must be developed further to report variances and confidence intervals, which is an active area in the research on differentially privacy. Lastly, extensions are required to deal with the situation where the datasets are not censuses. Despite these shortcomings, secure enclaves represent a viable solution.

By fostering collaboration and innovation, this work paves the way for a broader adoption of secure data integration techniques in official statistics, in compliance with international privacy regulations. It also provided the opportunity for learning many valuable lessons regarding the need to manage expectations about the reuse of existing statistical packages within an enclave, the importance of viewing differential privacy as a property of the entire production process instead of focusing on the outputs, and the usefulness of the differential privacy framework for discussing privacy risks. While, an enclave may conveniently support an existing package, it may be necessary to disable some of the package features to guarantee that all the outputs are differentially private. Also, the entire production process must be designed according to differential privacy principles, from the ground up. Finally, differential privacy concepts and definitions have helped articulate the intricate privacy risks of the use case, where each transaction has a record in each dataset and the two datasets are held by different organizations. Based on this experience, the authors are convinced that the framework of differential privacy can be the basis for the much-needed lingua franca about privacy.³⁰ Therefore, education about this subject is a priority for national statistical organizations.

Supplemental Material

sj-docx-1-sji-10.1177_18747655251355704 - Supplemental material for Private linkage of international trade microdata in a cloud-based secure enclave

Supplemental material, sj-docx-1-sji-10.1177_18747655251355704 for Private linkage of international trade microdata in a cloud-based secure enclave by A Dasylva, B Santos, L Franssen, M De Cubellis, F De Fausti, A Pappagallo, N Berrios and J Fitzsimons in Statistical Journal of the IAOS

Footnotes

Author note

The views expressed herein are those of the authors and do not necessarily reflect the views of the respective organizations.

Acknowledgements

The authors would like to thank the Office of National Statistics and all members of the UN PET Lab for their support and insights.

ORCID iDs

B Santos

L Franssen

M De Cubellis

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Canada Statistics Act. Revised Statutes of Canada, 1985, c. S-19, https://laws-lois.justice.gc.ca/eng/acts/s-19/fulltext.html.

Netherlands Statistics Netherlands Act. Effective from March 2, 2022, enacted November 20, 2003, https://www.cbs.nl/-/media/_pdf/2017/28/statistics-netherlands-act-2022.pdf.

European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal L119, May 4, 2016, pp. 1–88. https://gdpr-info.eu/ .

UNECE. UNECE project on input privacy preservation: final report. United Nations Economic Commission for Europe, 2023.

Dasylva

De Cubellis

De Fausti

, et al. Linking trade data from different National Statistical Offices through a private set intersection. J Off Stat 2025; 41: 569–597.

World Customs Organization. Harmonized system nomenclature 2022 edition. Brussels: WCO; [cited 2025 Feb], https://www.wcoomd.org/en/topics/nomenclature/instrument-and-tools/hs-nomenclature-2022-edition/hs-nomenclature-2022-edition.aspx (2022).

United Nations. United Nations guide on privacy-enhancing technologies for official statistics. New York: United Nations Committee of Experts on Big Data and Data Science for Official Statistics; [cited 2025 Feb], https://unstats.un.org/bigdata/task-teams/privacy/guide/ (2023).

Bruno

Nicoletti

Scannapieco

, et al. Privacy preserving set intersection. In: Proceedings of the ninth Irving Fisher conference, 2018. Bank of International Settlements. https://www.bis.org/ifc/publ/ifcb49_33.pdf .

Pinkas

Schneider

Tkachenko

, et al. Efficient circuit-based PSI with linear communication. In: Proceedings of EUROCRYPT, 2019, pp.122–153.

10.

Straus

. A federal government privacy-preserving technology demonstration. [Cited 2024 Dec 15], https://mccourt.georgetown.edu/news/a-federal-government-privacy-preserving-technology-demonstration/ (2021 June 29).

11.

Zanussi

Dugdale

Santos

. Practical privacy-aware data linkage and statistical aggregation based on privacy enhancing techniques. In: Presentation at Eurostat conference on new techniques and technologies for statistics, 2023. Statistics Canada.

12.

Schnell

. Privacy-preserving record linkage. In: Harron

Goldstein

Dibben

(eds) Methodological developments in data linkage. Chichester: Wiley, 2016, pp.201–225.

13.

Dwork

Roth

. The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 2014; 9: 211–407.

14.

Judson

Parker

Larsen

. Adjusting sample weights for linkage-eligibility using SUDAAN. National Center for Health Statistics, https://www.cdc.gov/nchs/data/datalinkage/adjusting_sample_weights_for_linkage_eligibility_using_sudaan.pdf (2013).

15.

Lahiri

Larsen

. Regression analysis with linked data. J Am Stat Assoc 2005; 100: 222–230.

16.

Chambers

Chipperfield

Davis

, et al. Inference based on estimating equations with probability-linked data. In: Centre for statistical and survey methodology, working paper series, 2009. Wollongong: University of Wollongong.

17.

Fellegi

Sunter

. A theory of record linkage. J Am Stat Assoc 1969; 64: 1183–1210.

18.

Dasylva

Goussanou

. On the consistent estimation of linkage errors without training data. Jpn J Stat Data Sci 2022; 5: 181–216.

19.

Akaike

. A new look at the statistical model identification. IEEE Trans Autom Control 1974; 19: 716–723.

20.

Kifer

Machanavajjhala

. No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, 2011, pp.193–204: Association for Computing Machinery.

21.

Near

Darais

. Differential privacy: future work and open challenges. Differential Privacy: Future Work & Open Challenges | NIST. 2022.

22.

. Differential privacy for complex data: answering queries across multiple data tables, https://www.nist.gov/blogs/cybersecurity-insights/differential-privacy-complex-data-answering-queries-across-multiple.

23.

Chaudhuri

Monteleoni

Sarwate

. Differentially private empirical risk minimization. J Mach Learn Res 2011; 12: 1069–1109.

24.

Zhang

Xiao

, et al. Functional mechanism: regression analysis under differential privacy. Proc VLDB Endow 2012; 5: 1364–1375.

25.

Hardt

Ligett

McSherry

. A simple and practical algorithm for differentially private data release. In: Bartlett PL, Pereira FCN, Burges CJC, Bottou L, Weinberger KQ (eds). Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012, pp.2348–2356.

26.

Ghosh

Roughgarden

Sundararajan

. Universally utility-maximizing privacy mechanisms. In: Proceedings of the forty-first annual ACM symposium on theory of computing (STOC ‘09), 2009, pp.351–360. New York: ACM..

27.

Dasylva

Goussanou

. Making statistical inferences about linkage errors. J Jpn Stat Data Sci 2024; 7: 17–56.

28.

URL: Git-Hub repository for the tests code: ObliviousAI/private-linkage.

29.

Statistics Canada. Table 5.1: quality level guidelines. Ottawa: Statistics Canada; [cited 2025 May 23], https://www150.statcan.gc.ca/n1/pub/13f0026m/2007001/table/tab5p1-eng.htm (2009).

30.

Kean

Jansen

Hsiao

, et al. A global expert panel weigh on privacy enhancing technologies (PETs) and risk assessment, mixed method Delphi study on the role of privacy enhancing technologies (PETs) for global data sharing ecosystems. Stat J Int Assoc Off Stat, Forthcoming.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.09 MB

	Export micro-data		Import micro-data
	Available	Private
Variable	(Yes/No)	(Yes/No)	Available	Private
Exporter name	Y	N	Y	N
Exporter size	Y	Y	N	N
Importer size	N	N	Y	Y
Product code	Y	N	Y	N
Transaction date	Y	N	Y	N
Transaction value	Y	N	Y	N
Tariff type	N	N	Y	Y

Private linkage of international trade microdata in a cloud-based secure enclave

Abstract

Keywords

1 Introduction

2 Use case

4 Methodology

4.1.1 Analysis

4.1.2 Linking the records

4.1.3 Evaluating the linkage accuracy

4.2 Specifying the desired level of confidentiality

4.4 Building a differentially private process

4.4.1 Analysis

4.4.2 Data synthesis

4.4.3 Linkage and error estimation

5 Evaluation

5.1 Generating the mock data

5.2 Testing the enclave

5.3 Allocating the privacy budget

Table 2. Minimum ε / Δ ratio according to the reliability in the error estimation step. Precision Recall Gold Silver Gold Silver 0.25 0.125 0.25 0.125

Table 7. Actual linkage accuracy. Variance Scenario Measure Mean ( × 10 − 4 ) Min. Max. 1 Precision 0.998 0.066 0.994 1.000 Recall 0.795 2.398 0.784 0.819 2 precision 0.990 3.215 0.956 1.000 Recall 0.798 2.799 0.784 0.820

Supplemental Material

sj-docx-1-sji-10.1177_18747655251355704 - Supplemental material for Private linkage of international trade microdata in a cloud-based secure enclave

Footnotes

Author note

Acknowledgements

ORCID iDs

Funding

Declaration of conflicting interests

Supplemental material

References

Supplementary Material

Table 2.
Minimum $ε / Δ$ ratio according to the reliability in the error estimation step.

Precision Recall

Gold Silver Gold Silver

0.25 0.125 0.25 0.125

Table 7.
Actual linkage accuracy.

Variance

Scenario Measure Mean ( $\times 10^{- 4}$ ) Min. Max.

1 Precision 0.998 0.066 0.994 1.000

Recall 0.795 2.398 0.784 0.819

2 precision 0.990 3.215 0.956 1.000

Recall 0.798 2.799 0.784 0.820