Sage Journals: Discover world-class research

Abstract

In this note, we present a plausible structural mechanism by which over-parameterized deep learning models trained on real data may produce pseudo-synthetic data that constitute merely a diﬀerent representation (or re-encoding) of the training data. We conjecture that, in principle, similar mechanisms may be learned by large-scale AI models even if they are not intentionally designed to do so. From there, we derive some cautionary warnings for potential adopters of pseudo-synthetic data generation tools based on deep learning. We claim that the burden of proof that no data re-encoding mechanism is at play in AI-based generation models rests with their proponents.

Keywords

synthetic data deep learning data protection statistical disclosure control

1. Context and Motivations

The outstanding success of Artiﬁcial Intelligence (AI) based on deep learning in numerous business ﬁelds is changing the perception by the society of the value as well as of the risks associated with personal data. Like oil releases energy but also pollution when fueling combustion engines, data creates new value but also new risks when fueling modern AI engines. The availability of more capable and powerful AI technology increases the appetite for granular data, hence the pressure on those who hold the data to make them available to those that own the AI engines. But sharing and releasing granular data referred to natural persons, that is, personal data, comes with privacy risks and must comply with data protection legislation. The combined eﬀect of increasing appetite for granular data by the business and increasing attention to privacy by the general public drives the quest for solutions that, like pollution-free combustion, promise to enable risk-free sharing of granular data with the same general utility as the source data. One approach actively investigated by diﬀerent research communities, and already oﬀered commercially by several start-up companies, relies on the generation of so-called “synthetic data” by means of large deep learning models trained on real data.

Generally speaking, the term “synthetic data” is used to refer to collections of data records created artiﬁcially that should ideally resemble the original data collectively but not individually. In other words, synthetic data should (i) yield the same population-level features of real data, for example, same statistics and joint distributions (population-level similarity) while at the same time (ii) not representing or allowing to derive information about real-world data units (unit-level dissimilarity). The utility of the synthetic data depends on which population-level features (statistics) are retained from the original data: if the analysis task is known in advance, one may engineer the synthesis process to retain exactly the features and statistics of interest—but in this case one may wonder with Domingo-Ferrer et al. (2025) why not just publishing the relevant statistics instead of a synthetic data set. When the analysis task is not known in advance, as is the case with public dissemination, the synthetic data generation process would aim at retaining as much as possible of the population-level features without revealing unit-level information. Underlying this approach is the assumption that population-level features are generally separable from, and not disclosive of, unit-level information—an assumption whose validity ultimately depends on the exact deﬁnition of population-level features.

The notion that synthetic data may serve as a means for data dissemination is not new to the oﬃcial statistics community: over at least four decades researchers and practitioners in the Statistical Disclosure Control (SDC) ﬁeld have developed a variety of synthetic data generation (or synthesis) methods with diﬀerent trade-oﬀs between data utility and disclosure risk, see for example, Calvino (2017), Hundepool et al. (2012), Drechsler and Haensch (2024), Domingo-Ferrer et al. (2024), and references therein. These traditional methods consist of sequences of operations designed step by step by human experts based on an explicit understanding of what population-level features to retain and what unit-level information to remove. Some of these approaches make use of “classical” Machine Learning (ML) tools, for example, for ML-based imputation (see Hundepool et al. 2012) but that does not change the essentially human-designed nature of the synthesis mechanism.

Compared to such earlier proposals, the advent of deep learning does not represent an incremental evolution but rather a leap into a new paradigm. Central to our discussion is the distinction between under-parameterized and over-parameterized ML models as separate regimes with diﬀerent properties, as pointed out for example, by Belkin et al. (2019), Theodoridis (2020), and Rocks and Mehta (2022). Classical ML models are typically under-parameterized: they do not have suﬃcient capacity (i.e., not enough model parameters) to ﬁt precisely all the training data points. Actually, ML models in the under-parameterized regime are designed purposely to avoid ﬁtting the data too closely, since in this case the prediction performances (generalization power) degrade—a condition called overﬁtting. Diﬀerently from classical ML, modern deep learning networks are often over-parameterized: they do have suﬃcient capacity to memorize all the training data points and they typically tend to do so, since in the over-parameterized regime achieving good prediction performances on test data is not in contradiction with ﬁtting precisely the training data.

The recent advent of deep learning has impacted research on synthetic data in multiple ways. First, it has extended the appeal of “synthetic data” to applications beyond the ﬁeld of oﬃcial statistics dissemination, that is the elective domain of the SDC research community, toward virtually the whole spectrum of data-intensive business sectors, from health to ﬁnance, from manufacturing to retail marketing. Second, it has attracted to the problem researchers from other scientiﬁc ﬁelds, with diﬀerent backgrounds and mindsets from SDC experts. Third, it has fueled the expectation among potential adopters that synthetic data generation based on large-scale deep learning models is intrinsically superior to traditional schemes based on human design and classic small-scale ML tools.

The SDC community is very well aware that when synthetic data are derived from real data there is an unavoidable trade-oﬀ between privacy risk and utility of the resulting data set, and that these two aspects must be balanced against each other. They would therefore audit carefully every step of the data synthesis method and strive to assess, or at least make an educated judgment about the residual level of disclosure risk. They would never consider such assessment to be unnecessary on the basis of the resulting data being labeled as “synthetic.” Instead, the general hype around AI seems to be creating a misguided perception across various business sectors that AI-based synthetic data do not move along the same risk-versus-utility trade-oﬀ frontier as the traditional approaches, but rather leap over it and magically resolve the conflict between utility and privacy risk altogether, mystically delivering almost full utility at essentially zero privacy risk. This is, at least, how the narrative goes in certain commercial blogs and business articles, and occasionally also in some research articles, for example, Ammara et al. (2024).

Previous research papers have started to expose the fundamental fallacy of this view, both formally and empirically, see in particular Stadler and Troncoso (2022), Stadler et al. (2022, 2024), and references therein, while others have highlighted the opacity of such approach in the sense of not permitting a clear assessment of the actual level of risk and utility, see for example, Jordon et al. (2022). In line with such previous work, we provide here an additional contribution from a diﬀerent angle, reinforcing previous eﬀorts to demystify and dispute the claim that synthetic data based on deep learning are intrinsically always risk-free. With this work we contribute to raise awareness among potential adopters and caution statistical oﬃces about the non-zero risks of synthetic data generation based on deep learning. Performing a careful risk assessment is absolutely necessary also with over-parameterized models but much more challenging than for traditional methods, if at all possible, given the lack of interpretability.

1.1. Organization of the Paper

One major source of confusion is the overload of the term synthetic data to refer to a range of fundamentally diﬀerent paradigms. Therefore, we start by proposing in Section 2 a taxonomy and a diﬀerentiated terminology to distinguish the diﬀerent synthetic data paradigms. We propose to adopt the term “pseudo-synthetic data” to refer to data produced by over-parameterized deep learning models. In Section 3 we introduce the notion of “privacy-deceptive coding” to refer to synthesis mechanisms that conceal rather than remove personal information, and present a simple example based on polynomial interpolation. In Section 4 we show how the pseudo-synthetic data produced in this way are exposed to reveal unit-level information about the source data through membership inference and attribute discovery attacks, that we interpret as partial decoding of the source data. In Section 5 we elaborate on the possibility and plausibility that large-scale over-parameterized models may in principle end up learning some data transformation mechanisms that is akin to a privacy-deceptive coding scheme. From there, we claim in Section 6 that pseudo-synthetic data based on deep learning from personal data cannot and should not be cleared as anonymous or anonymized data unless a formal proof is given that the model has not learned any such structural mechanism, the burden of the proof resting with their proponents. Lacking such a proof, pseudo-synthetic data should be considered as potentially embedding personal data, and therefore fall entirely within the scope of data protection legislation, that is, GDPR in the European Union. That implies that sharing pseudo-synthetic data, like sharing of pseudonymized or encrypted personal data, should be subject to preliminary assessment of risk and lawfulness conditions. Finally, in Section 7 we conclude and identify directions for further work.

2. Data Generation Paradigms: Synthetic, Semi-Synthetic, and Pseudo-Synthetic

In several scientiﬁc and technology ﬁelds it is customary to generate synthetic data serving the purpose of testing the performance of some system, real or simulated, under conﬁgurable conditions. In the most genuinely synthetic scenario, depicted in Figure 1a, both the logic $g ()$ and the parameters $z$ of the data generator (synthesizer) are fully speciﬁed by a human expert with direct knowledge of the real-world phenomenon of interest. The number of parameters is typically low—rarely more than a handful—as they represent well understood macroscopic (collective) properties of the system. The generator logic $g ()$ typically embeds some form of random process, and in several engineering disciplines it is used as a tool to perform Monte Carlo simulations. This scenario does not involve any real-world data.

Figure 1.

Taxonomy of synthetic data generation paradigms: (a) purely artiﬁcial data, (b) synthetic data based on low-dimensional data ﬁtting, (c) semi-synthetic data based on human-designed logic and (d) pseudo-synthetic data based on high-dimensional model training.

In a slightly more sophisticated scenario, depicted in Figure 1b, a set of real data $x$ are collected from the real-world phenomenon of interest, and the few parameters of the synthetic data generator $z$ are ﬁt to the real data $x$ through some low-dimensional ﬁtting or regression function $r$ , that is, $x \to^{r} z$ .

We claim that the dimensionality of the parameter space $z$ plays a critical role in determining the amount of individual information that may or may not flow from the real data $x$ to the new data set $\tilde{x}$ . Let $| z |$ denote the number of parameters in the generator function and $[x]$ the intrinsic dimensionality of the input data. We distinguish the two following regimes:

Low-dimensional regime where $| z | << [x]$ corresponds to human-designed schemes that may possibly adopt some “classical” under-parameterized ML model.

High-dimensional regime, for which $| z | > [x]$ , based on over-parameterized ML models and deep learning networks.

Such distinction follows recent developments in deep learning theory (see e.g., Belkin et al. 2018, 2019; Rocks and Mehta 2022; Theodoridis 2020) showing that the under-parameterized and over-parameterized regimes display diﬀerent characteristics and structural mechanisms, and therefore should be treated separately.

Figure 1b refers to the low-dimensional case where the data generator $g ()$ and the ﬁtting function $r ()$ are parsimonious, that is, involve a relatively small number of parameters that are intended to capture some macroscopic quantities, and interpretable insofar the target quantities have known meaning: they typically encode descriptive summaries, counts, moments of distributions, hyper-parameters of analytic families, correlation coeﬃcients and in general projections on low-dimensional subspaces. In this scenario, parsimony and interpretability allow practitioners to make an educated judgment about the privacy risk, that is, to assess at least qualitatively, if not quantitatively, whether and how much unit-level information about $x$ is conveyed into the new data $\tilde{x}$ , and based on such understanding identify measures to reduce the risk down to an acceptable level. In other words, the dimensionality bottleneck $| z |$ facilitates the task of limiting or blocking the conduction of personal information from $x$ to $\tilde{x}$ through $z$ .

For the sake of completeness we mention that, in addition to the risk of conducing personal information from $x$ , that is the focus of the present contribution, one may need to consider also the separate risk that some genuinely artiﬁcial data points match incidentally real-world data points: this case is susceptible of producing at least a reputational risk for the data holder, as discussed for example, in INEXDA Working Group (2024, Section 3.3.3.1), and may have also legal implications, see for example, the discussion in D’Acquisto (2024), but is left out of the scope of our reflection.

A third possible scenario is depicted in Figure 1c, where the data $\tilde{x}$ are derived directly from $x$ without the mediation of globally descriptive macroscopic parameters. In this case the derivation function $h ()$ is designed to act selectively at the microscopic level on individual data elements through swapping, replacement, randomization, generalization, and any combination thereof (for a recent example see Domingo-Ferrer et al. 2024). Since the resulting data set $\tilde{x}$ contains a mixture of real and artiﬁcial data elements, we propose to denote it by the term “semi-synthetic” to distinguish it from the previous scenario.

In the scenario depicted in Figure 1c the derivation function $h ()$ is designed by a human and therefore is intrinsically parsimonious and interpretable, hence auditable. That means, it is possible for human experts to assess or at least make an educated judgment about the risk of backwards inference, and keep it under control at the cost of some loss of ﬁdelity, hence reduction of utility of the new data for some analysis tasks. The approaches proposed by the SDC community in the last decade (see Calvino 2017; Domingo-Ferrer et al. 2024; Drechsler and Haensch 2024; Hundepool et al. 2012 and references therein) fall in the scenario of either Figure 1b or 1c.

More recently, following the outstanding success of deep learning in other application ﬁelds, a new computationally-intensive high-dimensional paradigm has emerged alongside the traditional design-based low-dimensional one. In this new paradigm, graphically sketched in Figure 1d, the generation of $\tilde{x}$ is based on a large-scale over-parameterized AI model trained on the real data.

The new paradigm of Figure 1d may be seen as evolved from the low-dimensional data ﬁtting approach of Figure 1b, where (i) the designed-by-human functions $r ()$ and $g ()$ have been subsumed into the learned-by-machine model along with the intermediate parameter set $z$ and (ii) the dimensionality of the latter has grown dramatically. In the transition both parsimony and interpretability of the transformation process are lost, preventing the possibility for human experts to judge whether and how much individual unit-level information is conveyed from $x$ into $\tilde{x}$ .

Recalling the distinction between the low-dimensional and high-dimensional regimes, we may consider the case of Figure 1b as the projection of data $x$ into a much smaller subspace of dimensionality $| z | << [x]$ (compression), while the case of Figure 1d may possibly represent merely a transformation of the data into the same space, or even a super-space of larger dimensionality $| z | > [x]$ (expansion). In the case of Figure 1d the new data may in principle represent a lossless re-encoding of the original data points (not necessarily one to one). Without the ability to interpret in detail the transformation mechanism, it is not possible to completely exclude that the new data are just a diﬀerent re-encoding of the original data, since there is no information bottleneck along the chain.

In the light of the above, we argue that using the term synthetic to qualify $\tilde{x}$ in the high-dimensional non-interpretable paradigm is misleading. In fact, this term implicitly excludes the possibility that a direct relationship is in place between the source data points and the new ones that could allow an attacker to infer back unit-level about the source records, beyond population-level patterns. As the presence of such relationship cannot be excluded a priori, we argue that the data produced (or re-produced) in this way should be better qualiﬁed as pseudo-synthetic instead of synthetic to signify that the new data set maintains some relationship with the original data and therefore cannot be considered completely artiﬁcial and risk-free.

We do not intend to postulate that pseudo-synthetic data generated by over-parameterized models always and necessarily embeds unit-level information from the training data: we rather dispute that they cannot do so. In other words, we claim that pseudo-synthetic data may bear some non-zero privacy risk, and therefore should be subject to a careful risk assessment that, however, cannot be reduced to merely checking that the constellation of new data points is dissimilar from the original one. In fact, as we show in the following, it is entirely possible to construct a new set of pseudo-synthetic data that appear completely diﬀerent from the original data points from which they are derived but still contain the whole unit-level information thereof.

3. Privacy-Deceptive Coding

Borrowing terminology from Information Theory, and particularly from Coding Theory, we use the term coding to refer to the way information (or data) is represented. We shall use the term “encoder” to refer to any system that, taking as input the source data set $x$ , produces as output a new data set $\tilde{x}$ that embeds unit-level information about the source data while meeting certain requirements. In the case of interest here, the requirement is that the encoded data appear at face value suﬃciently dissimilar from the source data. Conversely, we use the term decoder to refer to a system that recovers unit-level source information from the encoded data. Note that our use of these terms should not be confused with the meaning that they have in ML, and speciﬁcally in the context of variational autoencoder (VAE) architectures, where they are used to refer to speciﬁc layers of the network: if a VAE network is used to produce pseudo-synthetic data, the whole VAE network, comprehensive of its inner encoder, decoder, and hidden layers, would represent the encoder system in our setting.

The problem of exﬁltrating personal information may be regarded as a particular type of channel coding problem where the source message $x$ is to be transferred across a channel that allows only transmission of data complying with a given dissimilarity test $P (x, \tilde{x})$ . Therefore, the source message $x$ must be encoded in the new form $\tilde{x}$ to pass through the channel. At the receiving end of the encoded data is the potential attacker. The attacker may already have some partial prior information about the source data and is interested to recover the missing unit-level information, for example, the unknown attribute of a target record known to be present in the source data set. The coding scheme will be considered partially reversible if it allows the potential attackers to recover the missing source information with a high probability of success, that is, with low error rate. A coding scheme that allows even an attacker with zero prior information about the source data to recover the whole source data will be qualiﬁed as fully reversible.

3.1. Deﬁnitions

Let us consider a generic process (procedure, algorithm) taking as input a set of records $x$ and producing as output a new set of records $\tilde{x}$ deﬁned in the same domain. Denote by $L$ the inner working logic of the process, that is, $x \to^{L} \tilde{x}$ . Let $P (x, \tilde{x})$ denote an arbitrary measure of dissimilarity between the two data sets $x$ and $\tilde{x}$ intended to test the privacy of the new data. For example, $P (x, \tilde{x})$ may be deﬁned in terms of the number of common records appearing in both data sets, or the minimum distance between the data points of the two sets by some deﬁnition of distance, or the minimum number of records that fall within a certain distance from the original data points, and so on.

With reference to the system $x \to^{L} \tilde{x}$ we ask the following question:

In order to certify that the new data set $\tilde{x}$ represents anonymous data, is it possible to rely exclusively on a dissimilarity test based on $P (x, \tilde{x})$ , disregarding knowledge about the process $L$ ?

In other words, we are asking whether it is possible to reduce anonymity to a matter of dissimilarity between the input and the output of the anonymization process. In agreement with Jordon et al. (2022) we argue that the answer to the above question should be negative, and that detailed knowledge of the process $L$ is absolutely necessary to assess whether or not personal information is conveyed into the new data set. In fact, we conjecture that any privacy test that reduces to a dissimilarity test can be fooled: for any arbitrary dissimilarity criterion $P (x, \tilde{x})$ it is always possible to design a deceptive process $L_{P}$ that transforms the source data $x$ into a diﬀerent form $\tilde{x}$ that (i) passes the dissimilarity test $P (x, \tilde{x})$ and (ii) permits to recover individual unit-level information about the source data. In other words, the deceptive process $x \to^{L_{P}} \tilde{x}$ would eﬀectively conceal rather than remove personal information into the new data. We introduce the term privacy-deceptive coding to refer to a process of such kind. To illustrate, we present below a simple example of privacy-deceptive coding based on elementary notions of functional interpolation.

3.2. An Example Based on Functional Interpolation

Let us consider a tabular data set $x$ consisting of an arbitrary number of records (rows) deﬁned over $k + 1$ variables (columns). Each record consists of $k$ independent variables, called “predictor vector” hereafter, plus one variable representing the value of a sensitive attribute. For the sake of simplicity, we assume the predictor variables are continuous, that is, the predictor vector lies in $R^{k}$ , and the sensitive attribute is discrete over an alphabet of $m$ values. We also assume no duplicate records in the original data set. These assumptions simplify the illustration of the basic scheme but are not limiting: it is straightforward to devise variants and generalizations of the basic scheme that work also when such assumptions are waived.

Let us divide the records into $m$ groups by the value of their sensitive attribute. We now focus on one arbitrary group of records, with the understanding that the operations presented hereafter will be applied to each of the $m$ groups independently. Let $p$ the number of data points in the group at hand. In other words, we are considering a set of $p$ points in a $k$ -dimensional space $R^{k}$ sharing the same value of the sensitive attribute, and we will represent them by a $p \times k$ matrix denoted by $x_{p}$ . We will show how such $p$ real points may be mapped to a new set $x_{\tilde{p}}$ of $\tilde{p} \geq p$ pseudo-synthetic points in the same $k$ -dimensional space $R^{k}$ and in a way that is conducive of personal information, formally:

x_{p} \overset{def}{=} [\begin{matrix} x_{1, 1} & x_{1, 2} & \dots & x_{1, k} \\ x_{2, 1} & x_{2, 2} & \dots & x_{2, k} \\ \dots & \dots & \dots & \dots \\ x_{p, 1} & x_{p, 2} & \dots & x_{p, k} \end{matrix}] \to {\tilde{x}}_{\tilde{p}} \overset{def}{=} [\begin{matrix} x_{1, 1} & x_{1, 2} & \dots & x_{1, k} \\ x_{2, 1} & x_{2, 2} & \dots & x_{2, k} \\ \dots & \dots & \dots & \dots \\ x_{\tilde{p}, 1} & x_{\tilde{p}, 2} & \dots & x_{\tilde{p}, k} \end{matrix}] .

(1)

The basic idea, illustrated in Figure 2, is to interpolate the group of $p$ source data points in the $k$ -dimensional space with a line having exactly $kp$ degrees of freedom, and then select new points laying on the same line. If the number of new points is at least $p$ , then they unambiguously identify the interpolating line. As discussed below, this information allows a potential attacker to perform membership inference and attribute discovery attacks with high probability of success.

Figure 2.

Example of interpolating polynomial curve in three-dimensional space ( $k = 3$ ). The new data points (magenta) are all well distanced from original points (blue), nonetheless the reconstructed curve coincides exactly with the curve interpolating the original points. This enables membership inference and attribute discovery attacks.

A parametric curve in the $k$ -dimensional space with exactly $kp$ degrees of freedom may be constructed in diﬀerent ways. One possibility is to consider a $k$ -tuple of member functions from an arbitrary family of parametric functions with $p$ degrees of freedom deﬁned over the auxiliary variable $t$ , formally $F^{(p)} : t \in R \to x \in R$ . One simple choice for the parametric function family $F^{(p)}$ are the polynomials of degree $p - 1$ . For each generic dimension $i = 1, \dots, k$ we shall indicate by $f_{i} (t) \in F^{(p)}$ the member function of the chosen family that interpolates the coordinates of the $p$ points along the $i$ th variable, formally:

f_{i} (t) \overset{def}{=} a_{0, i} + a_{1, i} \cdot t + a_{2, i} \cdot t^{2} + \dots + a_{p - 1, 1} \cdot t^{p - 1}

(2)

wherein the $p$ coeﬃcients $a_{0, i}, a_{1, i}, \dots a_{p - 1, i}$ are determined by imposing that the function traverses the points $(t_{1}, x_{1, i}), (t_{2}, x_{2, i}), \dots, (t_{p}, x_{p, i})$ for a predeﬁned set of points $t^{(p)} \overset{def}{=} {[t_{1}, t_{2}, \dots, t_{p}]}^{T}$ along the auxiliary dimension $t$ . One practical choice for $t^{(p)}$ is to pick one of the $k$ dimensions to serve as reference, denote it by $i^{*}$ , and then let the ordered set of coordinate values along the reference dimension to serve as auxiliary points, that is,

t^{(p)} = {[x_{1, i^{*}}, x_{1, i^{*}}, \dots, x_{p, i^{*}}]}^{T} .

(3)

In other words, we are picking the reference dimension $i^{*}$ to serve eﬀectively as the auxiliary variable. Interpolation along the data points leads to the following system of equations:

{\begin{matrix} f_{i} (x_{1, i^{*}}) = x_{1, i} \\ f_{i} (x_{2, i^{*}}) = x_{2, i} \\ \dots \\ f_{i} (x_{p, i^{*}}) = x_{p, i} \end{matrix} for i = 1, \dots, k

(4)

from where the $p$ coeﬃcients of the function $f_{i} (t)$ are determined. Obviously, the member function in the $k$ -tuple that is associated to the reference variable reduces to the identity function, that is, $f_{i^{*}} (t) = t$ .

By stacking the interpolating functions for all variables $i = 1, 2, \dots, k$ we obtain an ordered $k$ -tuple of polynomials that will be denoted by $f (t) \overset{def}{=} [f_{1} (t), f_{2} (t), \dots, f_{k} (t)]$ . By construction $f (t) : R \to R^{k}$ is a vector-valued function of the single variable $t \in R$ , with exactly $kp$ degrees of freedom, crossing all $p$ points in the predictor space $R^{k}$ .

We remark that polynomials of degree $p - 1$ are just one of the possible options for the choice of the parametric function family $F^{(p)}$ . Another possible choice would be the family of polynomials of degree $p + 1$ that are constrained to cross two ﬁxed points of known coordinates, for example, placed at the inﬁmum and supremum of the coordinate range. In principle, any arbitrary parametric function could be used to achieve the same goal.

In general, given the parametric family $F^{(p)}$ of choice, the interpolation approach presented above ensures that any set of $p$ points $x_{p}$ identiﬁes one and only one $k$ -valued interpolating curve $f (t)$ , formally:

x_{p} | F^{(p)} \to f (t) .

(5)

From the interpolating curve, the new points are generated by selecting a new set of $\tilde{p} \geq p$ points laying on the same curve. To do so, it is suﬃcient to select arbitrarily a new set of generating points along the auxiliary variable ${\tilde{t}}^{(\tilde{p})} \overset{def}{=} [{\tilde{t}}_{1}, {\tilde{t}}_{2}, \dots, {\tilde{t}}_{\tilde{p}}]$ and then evaluate $f (t)$ at these points, formally:

f (t) | {\tilde{t}}^{(\tilde{p})} \to {\tilde{x}}_{\tilde{p}} .

(6)

In other words, the generating points ${\tilde{t}}^{(\tilde{p})}$ determine the new data set ${\tilde{x}}_{\tilde{p}}$ through the expansion $f (t) : R \to R^{k}$ .

Since the function interpolating the reference variable $i^{*}$ is reduced to the identity function, that is, $f_{i^{*}} (t) = t$ , the generating points ${\tilde{t}}^{(\tilde{p})}$ will be directly visible in the new data set ${\tilde{x}}_{\tilde{p}}$ as the coordinates of the new points along the reference variable.

The new data set ${\tilde{x}}_{\tilde{p}}$ constructed in this way allows a potential attacker with knowledge of the parametric function of choice to identify unambiguously the $k$ -valued interpolating curve $f (t)$ regardless of the criterion that was adopted to pick the new generating points along the auxiliary variable ${\tilde{t}}^{(\tilde{p})}$ as long as the number of new points equals or exceeds the number of source data, that is, $\tilde{p} \geq p$ . In other words, even if the generating points ${\tilde{t}}^{(\tilde{p})}$ are selected randomly, the new data set $x_{\tilde{p}}$ allows the attacker to identify the interpolating curve, formally:

{\tilde{x}}_{\tilde{p}} \to f (t) .

(7)

As the identiﬁed curve $f (t)$ interpolates also the original points $x_{p}$ , its knowledge is conducive of unit-level information in the sense of enabling membership inference and attribute discovery attack to succeed with high probability, as explained later in Section 4.

In selecting randomly the new points along the curve, the generator may be conﬁgured to censor those candidate new points that lie too close to the original points, by whatever distance measure of choice, and to replace the censored points with other randomly selected candidates. In this way, the level of dissimilarity between the new and original data sets can be increased without impairing the possibility for an attacker to recover the interpolating curve.

A graphic example in 3-dimensional space ( $k = 3$ ) is shown in Figure 2: it can be seen that the new data points are all diﬀerent and well distanced from the original data points, but nonetheless they allow to identify without error the interpolating curve.

3.3. Variants

The basic interpolation scheme presented above may be varied in diﬀerent ways to build privacy-deceptive coding schemes with additional desirable properties (desirable for the deceiver). In the approach presented above, the original records were divided by the value of their attribute variables into $m$ groups based on the value of the secret attribute of interest. Then interpolation and generation of new points is applied independently to each group. For each group, the resulting set of pseudo-synthetic data is assigned the same attribute value as the original data points.

Another variant of privacy-deceptive encoding goes without grouping and therefore does away with high-degree polynomials. In this variant, each single data point in the original dataset is interpolated separately from the others. Assuming the predictor variables are bounded, and their inﬁmum and supremum values are known (e.g., from metadata) we set two virtual “anchor points” $u =^{def} [u_{1}, u_{2}, \dots, u_{k}]$ and $v =^{def} [v_{1}, v_{2}, \dots, v_{k}]$ at the opposite corners of the $k$ -dimensional bounded domain, that is, the coordinates of $u$ and $v$ are set respectively to the inﬁmum and supremum values of each variable. For each single data point $x$ , denote by $f_{x} (t)$ the $k$ -tuple of quadratic polynomials, that is, polynomials of degree 2, interpolating the triplet $[u, x, v]$ . Note that the parametric function $f_{x} (t)$ has exactly $k$ degrees of freedom, as each quadratic polynomial constrained to cross 2 anchor points has only one single degree of freedom. As done before, once that the coeﬃcients of the interpolating curve $f_{x} (t)$ have been determined, the generator selects randomly one or more new data points along such curve and assigns them the same attribute value of the original record $x$ . As in the previous variant, from the new data point the attacker can readily recover the interpolating curve.

Taking inspiration from the basic mechanisms presented above, one could conceive dozens of similar schemes and variants thereof, more or less sophisticated. In all cases, the encoding process goes by the following steps:

Identify ordered groups of records in the original data set sharing the same attribute value. As particular cases, we have considered groups consisting of all records sharing the same attribute value in the ﬁrst variant, and groups consisting of single records (singletons) in the second variant;

For each group of $p$ points, identify a parametric family of $k$ -valued curves with exactly $kp$ degrees of freedom.

For each group, ﬁnd the unique function $f$ interpolating the $p$ points in the $k$ -dimensional space.

For each group, sample the interpolating curve $f$ , that is, pick randomly $\tilde{p} \geq p$ new points along the curve, and assign them the same attribute value of the original records.

The criterion by which the new points are selected along the interpolating curve (or equivalently by which the curve is sampled) may be further sophisticated in order to improve the deceptive power of the overall scheme by increasing the dissimilarity between the new and original data. As anticipated above, the generator may apply censorship, that is, discard and resample the points that incidentally fall too close to the original points, by whatever measure of distance in the $k$ -dimensional space. In this way, the resulting pseudo-synthetic data set may easily pass any test $P (x, \tilde{x})$ that is based exclusively on dissimilarity measures, including for instance the one based on Distance to Closest Record considered by Yao et al. (2025).

As explained above, the possibility for the attacker to recover back the interpolating curve does not depend on the criterion adopted to sample it as long as the number of sampled points equals or exceed the number of original points. This condition is often met in practical applications where pseudo-synthetic data are seen as a tool to “expand” real data sets of limited size (see e.g., Sivakumar et al. 2023).

Once that the interpolating curves are recovered for all the groups in the data set, the attacker can readily infer membership and discover attribute values, as explained later in Section 4. In other words, as long as the new data are sampled from the interpolating curve, censorship or any other stratagem adopted in the data generation phase to ensure target level of data dissimilarity is inconsequential for prospective attackers and does not reduce the actual level of unit-level information carried into (and recoverable from) the pseudo-synthetic data.

For the sake of completeness we now elaborate on how the basic scheme presented above may be further sophisticated to go beyond enabling membership inference and attribute discovery attacks, and allow potential attackers to reconstruct the whole data set, that is, enable full database reconstruction. To achieve this malicious goal it suﬃces to replace random sampling with deterministic selection in the ﬁnal generation stage, that is, to generate the new points in the auxiliary variable $t$ based on a deterministic reversible function $s$ of the original points, formally $s : t^{(p)} \to {\tilde{t}}^{(\tilde{p})}$ . If information about the reverse function $s^{- 1}$ is leaked, discovered, or guessed by potential attackers, then they can recover not only the interpolating curve where both new and original points lie, but also the precise position of the original points along the curve, eﬀectively reconstructing the complete data set.

4. Recovering Source Information from Pseudo-Synthetic Data

At the receiving end of the new data set $\tilde{x}$ the attacker may attempt to recover information about the original data, that is, to “decode” back the missing information about the source data.

4.1. Privacy Attacks as Decoding

Before proceeding further, we present how the three main types of privacy attacks may be conducted by a potential attacker when the pseudo-synthetic data were encoded with the interpolation mechanism presented in Section 3.

4.1.1. Membership Inference

Given a predictor vector, that is, a test point in the $k$ -dimensional space, one may infer probabilistically whether or not it corresponds to an element that was present in the source data set by simply testing whether it falls on any of the interpolating lines for the various groups. The attack yields zero rate of false negatives, since by construction all source data points lie on some interpolating line, but non-zero rate of false positives, since not every point lying on the interpolating lines was actually present in the original data set. However, the false positive rate vanishes exponentially as the number of variables $k$ increases, due to increasing sparsity of a one-dimensional curve in a $k$ -dimensional space.

4.1.2. Attribute Discovery

Given a predictor vector that is known by the attacker to be present in the source data set, that is, a test point in the $k$ -dimensional space, the value of the sensitive attribute can be discovered by just checking which one of the interpolating curves traverses the test point. Recall that each interpolating function is associated to the attribute value of the original data point or group thereof. Each interpolating curve represents a line in the $k$ -dimensional space. There is a tiny possibility that two or more lines from diﬀerent groups will intersect, and an even smaller probability that their intersection point will fall exactly on the test point. If the two interpolating lines are associated with diﬀerent attribute values, that would represent a case of ambiguity in the discovery of the attribute, but this would occur with vanishing small probability as the number of variables $k$ increases. Therefore the attack may be expected to yield very high rate of success.

4.1.3. Full Database Reconstruction

The two attacks above may be considered as partial recovery of information about the original data from the new data: for a known value of the predictor vector, the attacker discovers the sensitive attribute value and/or infers membership in the source data set. We have already seen that with some additional sophistication of the encoding process, that is, replacing the random selection with a deterministic selection of the new points along the interpolating curves at the time of creating the pseudo-synthetic data, it is possible in principle to enable the complete recovery of the original group of data points, that is, a full database reconstruction attack.

Having clariﬁed these three attacks, and recalling the generative scheme presented in the previous section, we distinguish two general scenarios:

Partial decoding. For each group of data, the new points are generated randomly along the interpolating curve for that group. From the new data $\tilde{x}$ , an attacker with knowledge of the function family $F$ recovers the interpolating curve $f$ for each group. As discussed above, this allows the attacker to succeed in membership inference and attribute discovery attacks, that is, to discover the missing information (membership or attribute) for a known test point (predictor vector).

Full decoding. The new points are generated along the interpolating curve based on a deterministic reversible mapping $s : t \to \tilde{t}$ . From the new data $\tilde{x}$ , an attacker with knowledge of both $F$ and $s$ (or more precisely its reverse $s^{- 1}$ ) is able to recover also the precise position of the original points along each interpolating curve, in addition to the interpolating curve itself, eﬀectively reconstructing the whole source data set. We refer to this scenario as full decoding.

Full database reconstruction (full decoding) is unlikely to occur in practical settings, unless the pseudo-synthetic data generation is designed purposely to be deceptive, that is, the attacker has control over the data encoding process and uses the pseudo-synthetic data to exﬁltrate covertly the original data. We have chosen to present this scenario for the sake of completeness and as a warning against possible scams, but it is not our main focus here. Instead, we argue that the partial decoding scenario may plausibly occur even in benign practical settings, where the attacker is only at the receiving end of the pseudo-synthetic data and has no influence over the encoding process.

4.2. Decoding with and without Auxiliary Information

In presenting the privacy-deceptive coding scheme in the previous section we have assumed that the attacker has some auxiliary knowledge about the encoding process, that in the case of attribute discovery and membership inference attacks (partial decoding) reduces to knowledge of the function family $F$ used in the generation stage. There are two important considerations to make about this point.

First, if the pseudo-synthetic data set is disseminated publicly, then the protection of personal information in the source data rests on the secrecy of such auxiliary information. The risk assessment therefore must take into account how well the auxiliary information can be protected, that is, how diﬃcult is for potential attackers to retrieve, discover, or simply guess it. Additional caution should be paid to commercial deployments where the same company oﬀers paired tools for generation (encoding) and analysis (decoding) of pseudo-synthetic data that are developed jointly, as the auxiliary information may be inadvertently passed from the generation to the analysis tool even in absence of malicious intent, due to coupled development. Seen in these terms, privacy attacks against pseudo-synthetic data are analogous to cryptanalysis of ciphertext encrypted with an undisclosed algorithm—the historic failure of security by obscurity should serve as a warning here.

Second, while knowledge of $F$ allows the attacker to recover exactly the interpolating function and use it to conduct attacks with very high success rate, that is, zero or vanishing low rates of false positives and false negatives, without such knowledge the attacker may still attempt to recover an approximation $\hat{f} \approx f$ of the interpolating function based solely on the pseudo-synthetic data. Under certain conditions, $\hat{f}$ would suﬃce to achieve dangerous (for the potential victims) levels of correct membership inference and attribute discovery. The risk is higher with increasing number of generated pseudo-synthetic points. To illustrate, Figure 3 shows a toy example with $p = 8$ data points mapped to $\tilde{p} = 16$ pseudo-synthetic data points. By construction, both kinds of points lie on the same interpolating curve $f$ . By connecting each pseudo-synthetic data point to its closest neighbor through a straight line we obtain the piece-wise ﬁrst-order approximation $\hat{f}$ . The higher the number of pseudo-synthetic points, the better the approximation. For a test data point, the attacker would infer membership and attribute value based on the relative distances of the test point from the piece-wise linear approximating curve associated with the diﬀerent groups. Note that testing the distance to the curve connecting the data points is diﬀerent than testing the distance to the data points themselves.

Figure 3.

Example of approximate recovery of the interpolation curve. The piece-wise approximation (continuous) is obtained by connecting each pseudo-synthetic point to its closest neighbor through a straight line. In the example it follows closely the smooth interpolating curve (dashed).

The error rates in membership inference and attribute discovery attacks resulting from the approximated curve $\hat{f}$ will be higher than what the attacker may obtain with the exact interpolating function $f$ , but depending on the data set and attack scenario they may be suﬃciently low to qualify the attack as successful.

5. Learning Privacy-Deceptive Coding Schemes

In the previous section we have hinted at some examples of human-designed privacy-deceptive coding schemes based on a simple and general structural mechanism, namely functional interpolation. We have shown how an attacker with access to the pseudo-synthetic data set built in this way may infer unit-level information about the source data. In this section we elaborate on the possibility for a large-scale AI model to learn some kind of privacy-deceptive coding and (partial) decoding mechanisms that are functionally similar to those that we have presented insofar.

5.1. Interpretation: Interpolating Curve as Latent Space

It is useful at this stage to establish a bridge between the interpolation examples from Section 3 and the terminology in use in AI/ML. The $k$ -valued interpolating curve $f \in F : t \in R \to x \in R^{k}$ in the single auxiliary variable $t$ may be interpreted as one particular “latent subspace” (among the many possible ones) embedding the group of training points through a non-linear mapping. The act of interpolating a parametric function may be seen as a special case of non-linear dimensionality reduction, or low-dimensional manifold learning, forced to a particular structure of the mapping $t \in R \to x \in R^{k}$ by the choice of $F$ . In other words, the notions of latent subspace and low-dimensional manifold that are commonly referred in AI literature may be considered as generalizations of the interpolating function that is central to the privacy-deceptive coding scheme presented in Section 3. In force of this conceptual link, and having in mind the partial decoding mechanisms illustrated earlier in Section 4, it is clear that the attacker’s ability to infer membership and discover attribute values with high success rate rests on the following pair of conditions:

The subspace learned from the source data in the generation stage (encoding) can be re-learned, at least approximately, from the new data.

The size of the subspace is much smaller than the whole domain space.

These conditions, referred to an arbitrary subspace learned by the AI model, generalize the conditions that enable membership inference and attribute discovery attacks to be conducted based on the knowledge of the interpolating line that, as said above, may be seen as a special case of subspace.

The plausibility of the ﬁrst condition rests on the fact that both the new and the original data points lie by construction in the same latent subspace or low-dimensional manifold. The second condition is based on the consideration that the decoding success rate, that is, the probability of inferring correctly membership or attribute value for a test data point relative to a random guess, is directly connected to the size of the subspace embedding the data points relative to the total volume of the $k$ -dimensional bounded domain.

5.2. Learning Intentionally Versus Learning Unintentionally

In Figure 4 we show two possible high-level schema of AI settings. We assume the AI models are over-parameterized but do not make any assumption on their speciﬁc architecture. A recent survey by Lu et al. (2024) shows that Variational Auto-Encoders (VAE) and Generative Adversarial Networks (GAN) are among the most popular architectures in this ﬁeld, but our discussion at this early stage is abstract and addresses any over-parameterized network independently from their architecture.

Figure 4.

Examples of architectures for data generation aimed at maximizing utility while passing the privacy test. The training signals are derived from the scores in the privacy and utility tests. The utility test for the honest setting may be considered as a relaxation of the utility test in the dishonest setting: (a) dishonest setting designed to maximize utility for the attacker and (b) honest setting designed to maximize utility for the analyst.

Figure 4a shows a dishonest system, designed intentionally to learn some privacy-deceptive coding mechanism. Here the encoder and decoder modules refer to distinct but coupled networks that are trained jointly on the same signals, and may possibly share some layers. Their joint goal is to produce a reversible encoding mechanism that produces encoded pseudo-synthetic data that pass the dissimilarity test $P (x, \tilde{x})$ while retaining the full unit-level information from the source data in a way that allows full reconstruction (full decoding). The signals that drive the learning process, that is, the loss function, is based on the scores of the two test modules shown in the ﬁgure: the ﬁrst module tests for dissimilarity, while the second module tests for utility for the attacker, that is, compares the reconstructed data to the original data at the unit level.

It is clear that only a dishonest actor would be motivated to implement deliberately the architecture of Figure 4a. The only conceivable practical use of such system, beyond legitimate experimental research, would be to setup a personal data exﬁltration scam. The encoder module trained in this way would be presented as a genuine synthetic data generator to the data holder. If the only condition to certify the non-personal nature of the data relied on the dissimilarity test $P (x, \tilde{x})$ , based exclusively on the appearance of the data $\tilde{x}$ and oblivious to the mechanism by which they were produced, then sharing the pseudo-synthetic data $\tilde{x}$ would be accepted and the personal data exﬁltration scam would succeed.

We now move from the dishonest to the honest scenario, where the analyst does not pursue explicitly the reconstruction of the original data from the pseudo-synthetic data, but rather aims to preserve their utility for a rack of legitimate analysis tasks. Toward this aim, the honest analyst may adopt the architecture of Figure 4b. In the new scheme, the utility of the pseudo-synthetic data is expressed in terms of their ability to drive a large-scale AI model trained on such data to deliver correct inference results, ideally the very same results that would be obtained by training the model directly on the real data. This implies that the pseudo-synthetic data retains all the population-level properties from the original data that are relevant across all tested analysis task.

The utility goal expressed in this way, from the perspective of the honest analyst, may be considered as a relaxation of the utility goal for the dishonest analyst in Figure 4a, that is, data reconstruction. In fact, the ability to retain the whole unit-level information from the data, or equivalently learning the data, is suﬃcient to retain any conceivable population-level pattern that may be learned from the data.

In principle, we cannot completely exclude a priori that the honest architecture ends-up learning into the encoder module some kind of data mapping mechanism similar to the privacy-deceptive coding scheme learned by the dishonest architecture. The risk increases with the number of legitimate tasks considered in the honest setting, as each additional task implies that more information from the source data is retained into the new data set, increasing the probability that the generator end up retaining all information (in some hidden form due to forced compliance with the privacy test). If so happens, an attacker that has access to the pseudo-synthetic data generated with the honest architecture may therefore succeed to elicit personal information about the source data.

5.3. Relation to Recent Experimental Work

We do not intend to postulate here that learning such a privacy-deceptive scheme is necessarily the only possible outcome of honest learning. However, we conjecture that this is a possible and plausible outcome that cannot be excluded a priori. Some recent experimental work by diﬀerent authors provide empirical support for this claim and serve as early warnings. In an experimental setting similar to the honest analyst scenario described above, Slokom et al. (2022) report that “The ML model trained on the synthetic data […] was found to leak in the same way or slightly less than the original classiﬁer.”Annamalai et al. (2024)ﬁnd that the value of a sensitive attribute associated with a real record in the original data set can be inferred back from a set of synthetic data that do not appear to contain that speciﬁc record (but obviously “encode” such information in some way). More recently, Yao et al. (2025)ﬁnd that membership inference attacks against synthetic data are successful even if they appear markedly dissimilar from the original data, and conclude that the distance to closest record and other analogous measures of dissimilarity are “uninformative of actual membership inference risk.”

While these empirical studies do not provide a conclusive proof, they represent early warnings that the possibility of pseudo-synthetic data representing the outcome of some kind of privacy-deceptive coding scheme, implicitly learned by the AI model, cannot be dismissed. Our work is complementary to these parallel experimental work in that the privacy-deceptive coding scheme presented above provide hints about the kind of underlying structural mechanism that may possibly explain those empirical ﬁndings.

5.4. A Note on Diﬀerential Privacy

In some sense, previous proposals of synthetic data generation that combine deep learning with Diﬀerential Privacy (DP), for example, McKay Bowen and Liu (2020), represent an implicit admission that deep learning without DP cannot be assumed to produce risk-free data. However, DP has its own limitations, and the formal guarantees that hold on paper under strict conditions are often lost in practical system implementations and application scenarios that do not fully meet those conditions, see for example, the discussion in Domingo-Ferrer et al. (2025), Seeman and Susser (2024), and Stadler et al. (2022). Even when the DP methodology, parameters, and source code are made public, practical DP deployments are often eﬀectively opaque in the sense that assessing (empirically) the actual level of risk and utility remains an extremely hard task. Similarly, large-scale deep learning models are opaque in the sense that interpreting what they have actually learned from the training data, and transferred into the new generated data, remains an extremely hard task even when their architecture, source code, and weights are made public. Overlaying one opaque approach on top of the other should be seen as a way to increase, not reduce the opacity of the overall system. Rather than heuristically piling up one hyped but problematic approach over the other, we believe it would be epistemically cleaner to ﬁrst investigate the structural mechanisms that make synthetic data based on deep learning problematic, and then leverage such knowledge to identify targeted countermeasures. Our work moves in this direction, therefore mixed approaches that combine deep learning with DP are left out of the scope of the present contribution.

6. Discussion

6.1. Practical Implications of Considering Pseudo-Synthetic Data as Non-Personal Data

To illustrate concretely the potential danger of considering pseudo-synthetic data as non-personal data let us consider a scenario where the source data set $x$ consists of patient records from a large hospital. The database will include the usual demographic variables that are typically recorded at triage (sex, birth date, birthplace, place of residence, etc.) and these constitute the predictor variables as per the terminology used in Section 3. Notably, the same demographic variables are typically provided by the data subjects when stipulating a contract with, say, a commercial or ﬁnancial service provider, for example, to apply for a mortgage or insurance contract. For each patient, the hospital data set includes the additional “attribute” variable representing the diagnosis received by the doctors. Even if no direct identiﬁers like name or tax code is included in the training data set, the string of demographic variables represent indirect identiﬁers and therefore constitute personal data in the sense of GDPR.

We consider the task of training a diagnostic AI model $M$ on the data set $x$ with the declared aim of discovering demographic patterns associated with higher probability of contracting a serious disease. Toward this aim, we consider the two workflows depicted in Figure 5. In the ﬁrst scenario of Figure 5a the model $M$ is trained directly on the real data $x$ . In the second scenario of Figure 5b, a generator model $T$ trained on the real data produces a set of pseudo-synthetic data $\tilde{x}$ , and the diagnostic model is then trained on the new data. We shall indicate by $M_{x}$ and $M_{\tilde{x}}$ the trained diagnostic models trained respectively on $x$ and $\tilde{x}$ , and by $T_{x}$ the trained generator model. We consider both models $M$ and $T$ to be over-parameterized, that is, the number of model parameters (weights) is larger, and possibly much larger, than the intrinsic size of the training data.

Figure 5.

Direct versus indirect model training: (a) direct model training on real data and (b) indirect model training via synthetic data.

Nominally, all such AI models are primarily aimed at learning from the training data some population-level properties linking the predictor variables to the attribute value (e.g., that living in a certain area correlates positively with lung cancer after a certain age). As far as classic under-parameterized ML models are concerned, this is all what they may learn, because such models do not have the capacity to memorize all the data. With over-parameterized models and deep learning the story is diﬀerent: along the process of learning from the training data the sought-after population-level properties, they may end up the training data, that is, individual data points from the training set (e.g., that one particular person of a speciﬁc sex, born on some speciﬁc date and resident in one speciﬁc place, was diagnosed with lung cancer). We note also that the notions of population-level and unit-level information are not always easily separable: for instance, learning some highly detailed collective characteristics about the data points, like for instance that all the points sharing the same attribute value lie on the same interpolating line $f$ (as per the example described in Section 3) reveals unit-level information like membership and attribute, as discussed in earlier Section 4.1. In other words, large-scale models have the capacity to do both at the same time: learn population-level patterns, if anything is there to be learned, but also memorize unit-level information. For this reason, we agree with Veale et al. (2018) that models trained on personal data should be considered by default as personal data—a view that is also in line with a recent opinion by the European Data Protection Board (2024).

The structural mechanism presented in Section 3 leads to encode the original data $x$ ﬁrst into the model $T_{x}$ and from there into the pseudo-synthetic data set $\tilde{x}$ , enabling the downstream model $M_{\tilde{x}}$ to re-learn it (fully or partially). If the resulting pseudo-synthetic data were labeled as anonymous data, then in principle they would fall out of the scope data protection legislation and may be legally passed to commercial entities, or even released publicly. Say that a bank or an insurance company have gained access to the pseudo-synthetic data set $\tilde{x}$ . If any mechanism similar to the one presented in Section 3 is at play, based on the standard demographic variables provided for contractual purposes this company may test the demographic vectors of their customers and learn whether they visited the hospital, or even discover their diagnosis. The privacy risk for the patients is obvious.

6.2. Analogy with Data Encryption

If the large-scale generative process was to perform some kind of privacy-deceptive coding, the data generator process may be considered similar to an encryption system, for which the decryption key must be kept separate from the encrypted data (or cipher-text) $\tilde{x}$ . By this analogy, membership inference and attribute discovery attacks may be seen as analogous to cipher attacks with partially known plain text, the known part corresponding to the predictor variable for the test data point and the unknown part to the membership bit or attribute value. In force of such analogy, the data controller should (i) assess the strength of the encryption mechanisms and of its implementation, also in scenarios where part of the plain text (source data) is known to potential attackers, and (ii) make sure that none of the receiving entities will ever be able to obtain, infer, or guess the auxiliary information that enables partial or full recovery of the source data. This in turn implies an eﬀort to analyze and understand in depth the generation process in order to identify what auxiliary information would be needed and where it lies, whether and how a potential attacker could obtain it, exactly or approximately, and ultimately how to protect it. Such assessment must be conducted on a case-by-case basis and should consider the role of diﬀerent actors (e.g., the case where both models $T$ and $M$ are provided commercially by the same company may be considered more critical in terms of potential leakage due to coupled development). These requirements create additional work and responsibilities for the data holders, and reduce the appeal of pseudo-synthetic data as a quick win solution for real applications.

6.3. Considerations on Legal Implications

We conjecture that any privacy test that reduces to a dissimilarity test and is oblivious to the detailed description of the anonymization process, that is, only considers the apparent dissimilarity at face value of the output data $\tilde{x}$ vis-à-vis the input data $x$ , is exposed to be fooled by a properly designed (by human) or learned (by machine) privacy-deceptive coding scheme. That means that any robust anonymization criteria cannot avoid to take into account a detailed and explainable description of the “generation” process $x \to^{L} \tilde{x}$ . This implies the ability to interpret, explain, and audit the trained model for what it has actually learned, beyond the intentions or expectations of its proponents. Therefore, research on pseudo-synthetic data crosses research on AI explainability.

In our opinion the burden of proof should always be with the proponents of such methods to demonstrate that their model has genuinely removed, and not merely encoded personal information in way that is just not immediately apparent, for example, through some kind of implicitly learned privacy-deceptive coding scheme. In this sense, the lack of interpretability poses a serious problem.

Data Protection Authorities may have to deﬁne sound anonymization criteria speciﬁc for AI models and the pseudo-synthetic data derived thereby. In the light of the simple yet general privacy-deceptive coding mechanism presented above, and in line with the ﬁndings of recent experimental work (see e.g., Annamalai et al. 2024; Yao et al. 2025) such criteria should not be limited to assess the dissimilarity between the transformed and the original data, but should rather consider the detailed characteristics of the pseudo-synthetic data generation mechanism. Such hypothetical future guidelines may possibly distinguish between under-parameterized and over-parameterized models, which in turn implies the ability to quantify the intrinsic size of the data in terms of their amount of information, net of all redundancy. To the best of our knowledge, this aspect remains an open research question (we can only conjecture that the intrinsic size of a data set is linked to the notion of Kolmogorov complexity).

6.4. Directions for Interdisciplinary Research

Taking inspiration from Information Theory, and particularly from Channel Coding, we may interpret the chain $T - \tilde{x} - M$ in Figure 5b as the series of encoder, channel coded message, and decoder. From this perspective, injecting noise (randomization) into the process may not necessarily represent a conclusive solution. In principle, the generator (encoder) may introduce redundancy to improve robustness of the decoding processes to perturbation and noise across the channel.

In the same way as research on attack models helps privacy defenders, we believe that devising new privacy-deceptive coding schemes would be instructive and valuable for privacy research in that it would help to identify more robust countermeasures and risk assessment criteria. Moreover, it may also drive privacy researchers to pull concepts and tools from well-established disciplines with a solid theoretical basis, such as Information Theory, Channel Coding theory, and Cryptography, that is, disciplines dealing with diﬀerent forms of data representations and their properties.

7. Concluding Remarks

There is already wide empirical evidence (see e.g., the survey by Zhou et al. (2024) and the online repository Zhou (2025)) that deep learning models, like a sponge immersed in a liquid, tend to absorb personal information from the training data on which they are trained, and that such information can be then squeezed out from the trained model by means of so-called model inversion attacks when not regurgitated spontaneously. Similarly, we expect in the near future a proliferation of papers on synthetic data inversion attacks, that is, empirical studies showing that also the pseudo-synthetic data generated by deep learning models may carry unit-level information from the original training data, and that such information can be inferred back via membership inference and attribute discovery attacks. In this perspective, recent works like Slokom et al. (2022), Annamalai et al. (2024), and Yao et al. (2025) appear as the ﬁrst pioneering examples of a line of research that is set to grow.

In parallel to demonstrate empirically the feasibility of privacy attacks against pseudo-synthetic data, it is important to unveil the general structural mechanisms underlying the phenomenon, that is, to explain how and why personal information may flow through the model into the new generated data. Such understanding is necessary for devising robust and theoretically sound mitigation measures. In this work we have attempted to take a step in this direction by formulating an initial hypothesis, inspired by the concept of parametric interpolation as a mechanism for data encoding and non-linear dimensionality reduction. Together with other prominent work in the field, like for example Stadler and Troncoso (2022), Stadler et al. (2022), and Jordon et al. (2022), we hope that our work will contribute to raise awareness among legal experts, data protection oﬃcers, and potential adopters, including professional statisticians and managers of statistical oﬃces, that pseudo-synthetic data based on deep learning are not a magic risk-free solution but rather an approach for which risks are yet to be fully understood.

Footnotes

Acknowledgements

I am sincerely grateful to the anonymous reviewers and to the editor for providing an extensive number of accurate and constructive comments and corrections on initial versions of this work.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Disclaimer

The view expressed in this paper are those of the author and do not necessarily reflect the opinion of the European Commission.

ORCID iD

Fabio Ricciato

Received: April 30, 2024

Accepted: November 11, 2025

References

Ammara

Ding

Tutschku

2024. “Synthetic Data Generation in Cybersecurity: A Comparative Analysis.” arXiv preprint. https://arxiv.org/abs/2410.16326.

Annamalai

Gadotti

Rocher

2024. “A Linear Reconstruction Approach for Attribute Inference Attacks Against Synthetic Data.” Proceedings of the 33rd USENIX Security Symposium, Philadelphia, PA, USA, August 14–16. Extended version available as arXiv preprint. https://arxiv.org/pdf/2301.10053.

Belkin

Hsuc

Mandal

2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias–Variance Trade-Oﬀ.” Proceedings of the National Academy of Sciences of the United States of America 116 (32): 15849–54. DOI: https://doi.org/10.1073/pnas.1903070116.

Belkin

Mandal

2018. “To Understand Deep Learning We Need to Understand Kernel Learning.” Proceedings of the 35th International Conference on Machine Learning (PMLR 80). https://proceedings.mlr.press/v80/belkin18a.html.

Calvino

2017. “A Simple Method for Limiting Disclosure in Continuous Microdata Based on Principal Component Analysis.” Journal of Oﬃcial Statistics 33 (1): 15–41. DOI: https://doi.org/10.1515/JOS-2017-0002.

D’Acquisto

2024, May. “Dati sintetici: cosa sono, le applicazioni e i rischi da gestire.” https://www.agendadigitale.eu/sicurezza/privacy/dati-sintetici-cosa-sono-le-applicazioni-e-i-rischi-da-gestire (in Italian, accessed October 16, 2025).

Domingo-Ferrer

Muralidhar

Martínez

2024. “Synthetic Data Generation via the Permutation Paradigm with Optional k-Anonymity.” IEEE Transactions on Dependable and Secure Computing 22 (3): 3155–65. https://ieeexplore.ieee.org/document/10820070.

Domingo-Ferrer

Sánchez

Muralidhar

2025. “Statistical Disclosure Control: Moving Forward.” Journal of Oﬃcial Statistics 41 (3): 820–6. DOI: https://doi.org/10.1177/0282423X241312023.

Drechsler

Haensch

A.-C.

2024. “30 Years of Synthetic Data.” Statistical Science 39 (2): 221–42. DOI: https://doi.org/10.1214/24-STS927.

10.

European Data Protection Board. 2024. “Opinion 28/2024 on Certain Data Protection Aspects Related to the Processing of Personal Data in the Context of AI Models.” https://www.edpb.europa.eu/system/files/2024-12/edpb_opinion_202428_ai-models_en.pdf (accessed May 30, 2025).

11.

Hundepool

Domingo-Ferrer

Franconi

, et al. 2012. Statistical Disclosure Control. John Wiley & Sons.

12.

INEXDA Working Group. 2024. “Main Outcomes of the INEXDA Working Group on Statistical Disclosure Control (SDC).” https://www.inexda.org/wp-content/uploads/2024/06/Final_Report_INEXDA_WG_SDC_20240422.pdf (accessed October 16, 2025).

13.

Jordon

Szpruch

Houssiau

, et al. 2022. “Synthetic Data - What, Why and How?” arXiv preprint. https://arxiv.org/abs/2205.03257.

14.

Shen

Wang

van Rechem

Wei

2024, January. “Machine Learning for Synthetic Data Generation: A Review.” arXiv preprint. https://arxiv.org/html/2302.04062v6.

15.

McKay Bowen

Liu

2020. “Comparative Study of Diﬀerentially Private Data Synthesis Methods.” Statistical Science 35 (2): 280–307. DOI: https://doi.org/10.1214/19-STS742.

16.

Rocks

Mehta

2022. “Memorizing Without Overﬁtting: Bias, Variance, and Interpolation in Overparameterized Models.” Physical Review Research 4 (1): 013201. DOI: https://doi.org/10.1073/pnas.19030701.

17.

Seeman

Susser

2024. “Between Privacy and Utility: On Diﬀerential Privacy in Theory and Practice.” ACM Journal on Responsible Computing 1 (1): 1–18. DOI: https://doi.org/10.1145/3626494.

18.

Sivakumar

Ramamurthy

Radhakrishnan

Won

2023, November. “GenerativeMTD: A Deep Synthetic Data Generation Framework for Small Datasets.” Knowledge-Based Systems 280: 110956. DOI: https://doi.org/10.1016/j.knosys.2023.110956.

19.

Slokom

de Wolf

Larson

2022. “When Machine Learning Models Leak: An Exploration of Synthetic Training Data.” Proceedings of the International Conference on Privacy in Statistical Databases (PSD), Paris, France, September 21–23. https://doi.org/10.48550/arXiv.2310.08775.

20.

Stadler

Kulynych

Gastpar

Papernot

Troncoso

2024. “The Fundamental Limits of Least-Privilege Learning.” arXiv preprint. DOI: https://doi.org/10.48550/arXiv.2402.12235.

21.

Stadler

Oprisanu

Troncoso

2022. “Synthetic Data–Anonymisation Groundhog Day.” Proceedings of the 31st USENIX Security Symposium, Boston, MA, USA, August 10–12. https://www.usenix.org/system/files/sec22summer_stadler.pdf (accessed October 16, 2025).

22.

Stadler

Troncoso

2022. “Why the Search for a Privacy-Preserving Data Sharing Mechanism is Failing.” Nature Computational Science 2: 208–10. DOI: https://doi.org/10.1038/s43588-022-00236-x.

23.

Theodoridis

2020. Machine Learning: A Bayesian and Optimization Perspective. 2nd ed. Elsevier.

24.

Veale

Binns

Edwards

2018. “Algorithms That Remember: Model Inversion Attacks and Data Protection Law.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376: 20180083. DOI: http://doi.org/10.1098/rsta.2018.0083.

25.

Yao

Krčo

Ganev

de Montjoye

2025. “The DCR Delusion: Measuring the Privacy Risk of Synthetic Data.” arXiv preprint. https://arxiv.org/abs/2505.01524.

26.

Zhou

2025. “Awesome Model Inversion Attack.” https://github.com/AndrewZhou924/Awesome-model-inversion-attack (accessed May 30, 2025).

27.

Zhou

Zhu

, et al. 2024. “Model Inversion Attacks: A Survey of Approaches and Countermeasures.” arXiv preprint. DOI: https://doi.org/10.48550/arXiv.2411.10023.

Privacy-Enhancing or Privacy-Elusion Technology? A Critical View of (Pseudo)Synthetic Data Based on Deep Learning

Abstract

Keywords

1. Context and Motivations

1.1. Organization of the Paper

2. Data Generation Paradigms: Synthetic, Semi-Synthetic, and Pseudo-Synthetic

3. Privacy-Deceptive Coding

3.1. Deﬁnitions

3.2. An Example Based on Functional Interpolation

3.3. Variants

4. Recovering Source Information from Pseudo-Synthetic Data

4.1. Privacy Attacks as Decoding

4.1.1. Membership Inference

4.1.2. Attribute Discovery

4.1.3. Full Database Reconstruction

4.2. Decoding with and without Auxiliary Information

5. Learning Privacy-Deceptive Coding Schemes

5.1. Interpretation: Interpolating Curve as Latent Space

5.2. Learning Intentionally Versus Learning Unintentionally

5.3. Relation to Recent Experimental Work

5.4. A Note on Diﬀerential Privacy

6. Discussion

6.1. Practical Implications of Considering Pseudo-Synthetic Data as Non-Personal Data

6.2. Analogy with Data Encryption

6.3. Considerations on Legal Implications

6.4. Directions for Interdisciplinary Research

7. Concluding Remarks

Footnotes

Acknowledgements

Funding

Disclaimer

ORCID iD

References