Sage Journals: Discover world-class research

Abstract

Objectives

Sharing medical data is hampered by technical, regulatory, and privacy challenges, including compliance with the Health Insurance Portability and Accountability Act of 1996. However, existing data anonymization methods are error-prone or vulnerable to re-identification, and synthetic data generation approaches are limited. This study introduces SYNNER, a novel synthetic data generation framework that overcomes existing limitations, preserving data utility while ensuring privacy.

Methods

We employ knowledge graph embeddings to encode data into a k-dimensional space, capturing complex relationships. For each entity, its nearest neighbors are identified, and their characteristics are used to generate a synthetic version that maintains statistical consistency. We evaluated SYNNER on seven publicly available datasets, measuring the preservation of original data signals and comparing macro-F1 scores across prediction tasks. A novel evaluation protocol for differential privacy was also introduced, simulating an adversarial attack to infer missing values.

Results

The evaluation shows that SYNNER maintains an average of 83.2% of the signals from the original datasets. In predictive tasks, models trained on SYNNER-generated data achieved a proportional average macro-F1 score of 74.4%, comparable to those trained on the original data. The proposed evaluation protocol for differential privacy assesses whether synthetic datasets meet expected privacy standards and highlights potential risks of individual data point reconstruction.

Conclusion

SYNNER provides a scalable and effective solution for generating synthetic data that maintains statistical fidelity. It overcomes the limitations of existing methods, providing a privacy-preserving solution for synthetic data generation and advancing research in sensitive domains such as healthcare.

Keywords

Artificial intelligence machine learning health informatics synthetic data generation privacy-preserving data sharing differential privacy evaluation embedding-based sampling knowledge graph embeddings

Introduction

In recent years, there has been a push for individualized treatment and preventive medicine.¹ These personalized approaches enable clinicians to identify disease risks and treatments for each patient, but require access to comprehensive datasets, since each patient has unique clinical circumstances.² However, precision medicine research has been slowed by the limited access to high-quality data, as hospitals and medical device companies often face significant barriers to making their data publicly available. The main difficulties reported in 2021, for example, include lack of technical or personnel capacity, complex interface issues, differences in vocabulary standards, and difficulty extracting relevant information from electronic health records (EHRs) to report.³

In addition, publishing personally identifiable information without consent can have serious consequences. This refers to any data that could identify a specific individual, as enforced by The Health Insurance Portability and Accountability Act (HIPAA) of 1996.⁴ In just ten years (2012–2022) the Office of Civil Rights has settled over $100 million HIPAA violations.⁵ To comply with HIPAA, data can be manually anonymized or using professional-level products designed to automatically anonymize data, such as ARX,⁶ CloverDX,⁷ or Amnesia.⁸ However, manual anonymization is time-consuming and prone to human error, while automated methods are vulnerable to attacks.^9,10 Moreover, these methods often fail to accommodate the diversity of data formats, the variability in labeling, and terminology across medical practices.

In response to these challenges, synthetic data generation techniques have been proposed,^11–18 which aim to mimic the trends in the original data using publicly available data and expert knowledge. Synthea¹⁵ is one of the rule-based generators that create synthetic EHR for patients. Although such datasets can be both realistic and representative, their creation often entails significant overhead, and expert knowledge may be either difficult to integrate or subject to bias.^19,20 Furthermore, expert knowledge is not easily transferable between domains;²¹ for instance, knowledge used to generate synthetic healthcare data may not apply to the financial domain, and vice versa. Alternative approaches have emerged that do not rely on expert input; instead, they learn patterns directly from existing data. Conditional Probability Generators (CPGs) and Generative Adversarial Networks (GANs) have been widely used in the field of synthetic data, being applied not only for image generation,²² but also to generate tabular data.

A key challenge in creating synthetic data is balancing utility and privacy. In methods that do not rely on expert knowledge (such as CPGs and GANs), there is a risk of reproducing exact records from the source data. To comply with HIPAA, Differential Privacy (DP)²³ must be satisfied. A differentiable private dataset limits the influence of any single individual’s data on the analysis output, preventing data reconstruction or re-identification, and ensuring strong privacy guarantees. However, outliers pose an additional challenge, as their inclusion or removal can disproportionately affect the learned distributions,²⁴ which requires specialized strategies to mitigate this effect.

In this study, we use a Knowledge Graph Embeddings (KGEs)²⁵ to enhance synthetic data generation and produce privacy-differentiable data. Our proposed method, SYNNER, uses KGE to map entities into a $k$ -dimensional space, capturing complex relationships between them through unsupervised learning. For each entity in the embedding space, we identify its closest neighbors, supposedly sharing similar characteristics, which we use to generate its synthetic version. This is done by randomly selecting each feature value from a proximity-weighted distribution derived from the corresponding feature values of the nearest neighbors. This approach is particularly effective in scenarios where datasets cannot be modeled solely with expert knowledge.

We applied SYNNER to seven publicly available datasets, demonstrating its ability to replicate feature distributions and relationships while balancing data utility (i.e., realism) and privacy. The generated data closely mirrors the original in predictive tasks without duplicating real records. SYNNER also addresses common challenges in traditional synthetic data generation approaches (such as domain generalization and high duplication rates) by using an embedding representation that captures, on average, over 83% of the signals from the original data, preserving key relationships between features. The resulting synthetic data can be used not only to substitute training sets in the intended prediction tasks effectively, but also to achieve an average macro F1 score ratio of 74.4% compared to the resulting F1 score when using the original data as a training set.

Finally, to evaluate DP, we propose a label-prediction task in a simulated adversarial setting. The setup assumes that an attacker has access to the original data, the embeddings, the synthetic dataset, the embedding model, and attempts to infer a missing feature value. Through predictive experiments comparing the performance of integrating synthetic or embedded data with the original dataset, we show that while embeddings alone do not substantially improve accuracy, adding synthetic data can enhance performance on certain features, potentially revealing sensitive patterns. These findings underscore the need for mitigation strategies, such as calibrated noise, to balance data utility and privacy.

Literature review

In pursuit of generating realistic synthetic data, researchers have explored various approaches, including rule-based and machine learning methods. Rule-based approaches rely heavily on expert knowledge, using statistical modeling and schema-driven techniques to capture population characteristics and simulate real-world data. While these approaches can produce highly realistic synthetic datasets, they tend to lack generalizability across domains and often risk duplicating records from the original data. On the other hand, machine learning approaches offer more flexibility by learning patterns directly from data. GANs²⁶ methods, for example, leverage adversarial training to learn data distributions and ensure privacy preservation. However, these approaches face challenges such as potential leakage of sensitive information, limited generalization in low-sample or imbalanced scenarios, and difficulty in modeling structured or multi-relational data.

In this section, we review related work on rule-based and machine learning approaches to synthetic data generation. We also introduce an alternative direction based on KGEs, which serves as the foundation for SYNNER. KGE techniques represent entities and their relationships in a continuous vector space, capturing complex structural patterns through unsupervised learning.

Rule based approaches

Rule-based approaches rely on domain expertise and predefined statistical rules to generate synthetic data, rather than learning directly from source datasets. These methods typically incorporate publicly available statistics, ontologies, or schema constraints to simulate realistic data that reflect expected distributions and logical relationships within a given domain.

Publicly Available Data Approach to Realistic Synthetic EHR (PADARSER),¹⁶ for example, uses publicly available health statistics and care flows to generate realistic EHR. Care flows are generated from clinical practice guidelines, without using real patient data. Although the use of expert knowledge is helpful in creating realistic data, it is not transferable to other domains.

Graph Differential Dependencies (GDDx)¹⁷ is a schema-driven knowledge graph generator. The generation process relies on expert-defined schemas that impose constraints on the relationships between entities. While this ensures structural coherence, it requires significant domain knowledge, which limits the generalizability of the approach across different contexts.

Synthpop¹¹ uses conditional distributions to generate synthetic data. It models each column of the source data as a conditional distribution, where the value of a given attribute is sampled based on previously generated attributes in the same row. This creates a more realistic version of each entry in the resulting synthetic data. However, synthetic data generators based on conditional probabilities usually face a vanishing distribution problem when there are too many features and not enough data to support those features. As each feature is conditionally sampled, the subset of eligible entities becomes increasingly narrow. In cases where no entities exist with a given combination of previously selected features, the generation process may fail due to an empty distribution. This issue can lead to the vanishing gradient problem,²⁷ where generators that use conditional distributions can leak individuals from the source data that match less frequent combinations of feature values. In neural networks, the vanishing gradient problem occurs when the values used to update the internal parameters become extremely small as they are passed backward through many layers. This makes it hard for the earlier layers to adjust their weights, which in turn slows down or even blocks the training process.²⁸ In embedding processes, the vanishing gradient problem progressively reduces the embedding vector magnitudes during training, which can happen when similarity measures such as cosine similarity fall into saturation regions, or when positive and negative pairs produce nearly identical distances. As a result, the embeddings lose their ability to distinguish between inputs.²⁹ In both scenarios, the vanishing gradient problem can be understood as a form of forgetting.

Machine learning approaches

Unlike rule-based methods, machine learning approaches generate synthetic data by learning directly from source data, rather than relying solely on expert knowledge. These approaches can learn the underlying patterns and distributions in the data, enabling the generation of high-fidelity synthetic data.

GANs²⁶ have been widely used to generate discrete and continuous tabular data. One example is medGAN,¹² an autoencoder combined with a GAN that aims to learn the distributions of multi-label discrete features in a medical context. The autoencoder helps the GAN learn the distribution of the variables by putting the data into a latent or encoded space. The synthetic encoded data are then decoded back into discrete variables. However, GANs may leak sensitive information, particularly when trained on small datasets or when feature distributions are highly imbalanced.¹⁴ At the same time, GANs can undergo mode collapse,³⁰ which results in the GAN producing the same entity over and over instead of a representative variety of entities. When experiencing mode collapse, the generator produces only a narrow subset of possible outputs rather than reflecting the full diversity of the data. This means that, although the generated samples may look realistic, they fail to represent all variations present in the original dataset. Instead of capturing all possible modes of the data (e.g., different classes, variations, or styles), the generator keeps producing very similar samples, preventing the model from generating a full range of realistic data.³¹

DPGAN¹⁴ is a differentially private GAN that improves on medGAN. DPGAN uses the Wasserstein distance,³² as opposed to Jensen-Shannon divergence or Kullback–Leibler divergence³³ to approximate the difference between the probability distributions of the real data and the generated data. DPGAN achieves high-quality data point generation while also ensuring the protection of individuals. This is done by adding noise to the Wasserstein gradient during training. Although DPGAN improves on the mode collapse and gradient vanishing problems, the method cannot produce multi-relational data.

Conditional Tabular GAN (CTGAN¹³) is a generative model specifically designed to address the unique challenges of tabular data synthesis. Traditional tabular data often include a combination of continuous and discrete variables, making modeling difficult, especially when continuous variables have multiple modes or when discrete variables are imbalanced. Unlike standard deep learning or statistical methods, which often struggle with this complexity, CTGAN introduces a conditional generator that effectively models the joint distribution of tabular data. This approach enables it to generate realistic synthetic rows that reflect the structure and variability of the original dataset.

TabularARGN¹⁸ is an auto-regressive neural network architecture designed specifically to generate high-quality, privacy-preserving synthetic tabular data. TabularARGN addresses the unique challenges of real-world datasets, including mixed data types, missing values, and variable-length sequences, by modeling all possible conditional probabilities between features. This design enables advanced features such as conditional generation, missing value imputation, and fairness-aware synthesis, while maintaining DP safeguards.

Knowledge graph embeddings

To overcome the limitations of rule-based and machine learning approaches (e.g., reliance on domain-specific knowledge, lack of generalizability, and risk of privacy leakage), KGEs offer a promising alternative. KGE methods represent entities and their relationships in a continuous vector space, effectively capturing complex, multi-relational structures without requiring explicit rules or full generative modeling.

Traditionally, embeddings have been used to visualize clusters of high-dimensional data in low-dimensional space, providing a way to naturally group similar entities without explicitly sorting them together. Moreover, they are more effective forms of representing knowledge to encode known information that allows inference and reasoning,^34–36 making them particularly suitable for synthetic data generation scenarios where preserving underlying data semantics and generalization across domains are essential.

Static knowledge graphs (knowledge graphs that encode facts that are assumed to be universally or permanently true) are not suitable for embedding time-series or time-linked events, where many facts are only valid within a specific time frame. TKG embedding models aim to capture dynamic patterns and evolving relationships over time. While earlier models relied on random sampling to generate negative training examples, more recent work has shown that generating plausible but false negative samples using adversarial techniques improves the quality of the learned embeddings.²⁵ This results in more semantically rich and temporally-aware representations—an important feature for modeling longitudinal data such as patient histories.

The training architecture proposed in Tissot and Pedebos²⁵ improves on previous work by choosing more likely false facts to use as training data for the GAN instead of uniform random sampling. This creates a more semantically rich TKG embedding. After training the 5 different TKG models, the methods that were trained using the proposed adversarial model yielded slightly better results.

Despite the growing interest in KGEs for tasks such as clustering, reasoning, and representation learning, their application in the context of synthetic data generation remains relatively underexplored. Although existing studies demonstrate their effectiveness in capturing structured patterns and performing semantic reasoning, their potential to produce synthetic data has not been fully realized, particularly in sensitive domains like healthcare. Our work aims to bridge this gap by leveraging KGE techniques as the foundation for SYNNER, a synthetic data generator designed to preserve utility while enhancing privacy through embedding-based sampling.

Methods

In this section, we outline SYNNER, our proposed method for generating synthetic data that closely emulates the characteristics and statistical properties of diverse datasets. Our approach handles both categorical and continuous attributes, generating realistic individual entities and ensuring an appropriate level of privacy. SYNNER implements a synthetic data generation pipeline that is structured into three main phases, as shown in Figure 1:

Embeddings generation and evaluation (A): In this initial phase, we generate embedding representations for training and testing sets of the original datasets, without including their respective labels. The embeddings are evaluated through a prediction task on each dataset’s target label, aimed to assess their performance in comparison to the same task performed using the original data, and hence their effectiveness in capturing relevant data patterns for synthetic data generation.

Synthetic data generation (B): New embeddings are generated, now including data labels. Using the training set embeddings, we generate synthetic data using a statistical sampling method with the nearest neighbors within a maximum distance radius. Synthetic entities are generated by taking each target instance from the embedding space, and using its $k$ nearest neighbors to generate its synthetic version, aiming to preserve the structural and relational properties of the data without replicating the exact same data points.

Differential privacy evaluation (C): Finally, we evaluate the DP of each synthetic dataset generated to ensure that individual data cannot be traced or reconstructed. We consider a scenario in which one has access to all the data sources, including the synthetic data, the embedding representations, and the embedding model to embed new instances, and the original data, except for a single data point in one of the features, which is the target information intended to be reconstructed. We use classification tasks to evaluate the model’s ability to infer the missing value using different combinations of the three data sources: The original dataset (C.1), the original dataset + embeddings (C.2), and the original dataset + the synthetic dataset (C.3), followed by another model using all three combined data sources. This evaluation ensures whether generated synthetic data adheres to privacy standards without compromising the confidentiality of the original information. If the probability of reconstructing a missing value is too high, the corresponding feature in the synthetic data should be reviewed and potentially modified, for instance, by adding random noise.

Figure 1.

Overview of the SYNNER framework, structured into three phases: (A) Embeddings generation and evaluation, (B) synthetic data generation using embedding-based sampling, and (C) differential privacy evaluation through simulated adversarial inference tasks. The pipeline highlights how embeddings capture feature relationships, synthetic data preserves structure without duplicating records, and privacy is empirically assessed under attack scenarios.

To address and evaluate diverse challenges in our proposed synthetic data generation process, we used the benchmark data described below. They were selected to represent different types of data and challenges in the generation of synthetic data and were preprocessed to obtain training and test splits when not provided (https://github.com/hextrato/SYNNER).

Mushroom Classification ³⁷: The Mushroom dataset is entirely categorical, with features such as cap-shape, odor, gill-color, and habitat. The target label indicates whether a mushroom is poisonous or edible. This dataset is known as an easy-to-resolve dataset from a machine learning prediction perspective, and we use it as a baseline to validate most of the proposed steps, e.g. synthetic data is also expected to easily resolve the same task, with minimum loss.

Avila Bible ³⁸: The Avila Bible dataset contains 10 continuous features extracted from 800 images of the “Avila Bible”—an XII century giant Latin copy of the Bible. The original task consisted of associating each pattern with one of the 12 different copyists. Avila Bible was specifically selected to address continuous distribution sampling during the synthetic data generation process, as there are no categorical features in this dataset, except for the target label.

Polish Bankruptcy ³⁹: The Polish Companies Bankruptcy dataset is about the prediction of bankruptcy in the period 2000–2012, while still operating from 2007 to 2013. It contains 64 continuous features named from X1 to X64, representing financial metrics, such as operating profit rate (X6), total asset growth rate (X29), and accounts receivable turnover (X46). The original task was to predict whether the company went bankrupt or not (1 or 0).

Fetal Health ⁴⁰: This dataset is composed of 20 continuous features related to fetal health, such as baseline_value, accelerations, and fetal_movement. The target label is fetal_health, which classifies fetal health into three categories (1, 2, 3). The challenge involves accurately classifying fetal health based on these time series physiological measurements.

Infant Health (kaggle/datasets/bhavkaur/synthetic-infant-health-dataset/) : This dataset was synthetically generated to mimic real infant data, including several categorical features, such as ChestXray, BodyO $_{2}$ , BodyCO $_{2} 4$ , Birth_Diseases and Age. The target label is “Sick” and can assume values “yes” or “no”.

Stroke Prediction (kaggle/datasets/fedesoriano/stroke-prediction-dataset) : This dataset contains three continuous features (age, avg_glucose_level, and bmi) along with several categorical features such as gender, hypertension, and smoking_status. The binary target label stroke determines whether a patient has had a stroke.

Genetic Disorder (kaggle/datasets/mukund23/predict-the-genetic-disorder) : This dataset contains a mix of continuous features such as Patient Age, Blood cell count (mcL), Mother’s age, and categorical features like Genes in mother’s side and Inherited from father. The dataset contains two target labels: Genetic Disorder (with classes like “mitochondrial” and “single-gene inheritance” diseases) and Disorder Subclass (including subclasses such as “Tay-Sachs” and “Alzheimer’s”).

Additional details about each dataset are presented in Table 1, including the number of instances in each training and test set, the number of categorical and continuous features, and the distribution of the target label values.

Table 1.

Characteristics of benchmark datasets used in this study, including number of training and test instances, number of continuous and categorical features, and class label distribution.

Dataset	# of instances		# of features		Class distribution (training only)
	Training	Test	Continuous	Categorical	Label	# of instances
Mushroom classification	6543	1873	–	22	Edible	3481
					Poisonous	3062
Avila Bible	10,430	10,437	10	–	A	4,286
					B	5
					C	103
					D	352
					E	1095
					F	1961
					G	446
					H	519
					I	831
					W	44
					X	522
					Y	266
Polish bankruptcy	4728	1182	64	–	0	4405
					1	323
Fetal health	1712	414	20	1	1 (normal)	1323
					2 (suspect)	241
					3 (pathological)	148
Infant health	11,946	3054	–	20	No	9,121
					Yes	2825
Stroke prediction	4093	1017	3	7	0 (no stroke)	3899
					1 (stroke)	194
Genetic disorder	15,900	4.467	6	36	Mitochondrial	8158
					Multifactorial	1664
					Inheritance	6078

These datasets represent a range of clinical and non-clinical domains with diverse data types and prediction tasks.

Embeddings generation

To address the dependency between features in the source data, we use embeddings to organize entities into a multidimensional space where similar instances are naturally grouped together. We use KRAL (https://github.com/hextrato/kral), an unsupervised learning approach based on KGEs,³⁵ to generate embedding spaces that capture feature similarities and patterns present in the original datasets (see Figure 1(a)). KRAL has been chosen for its ability to handle not only categorical, but also continuous features when embedding a knowledge graph. While embeddings for categorical features are learned using translational projections of entities and relations, continuous features are embedded along a random choice of a dimension $d$ in the $k$ -dimensional space. For example, in a $k$ -dimensional space with $k = 128$ , the range of values corresponding to a continuous feature “heart rate” is randomly chosen to fit in dimension 25, while another continuous feature “continuous feature” might be randomly set to fit in dimension 74. Specifically, each dataset is partitioned into training and testing sets as shown in Table 1. The KGE algorithm is then applied directly to the original data, learning the relationships between the nodes (entities) and edges (features and their corresponding values) and encoding them as vectors in a $k$ -dimensional space. This phase (Figure 1(A.1)) is performed without the target label, since our objective is to evaluate to what extent the embeddings are capable of effectively capturing signals from categorical and continuous features in each dataset, by reproducing the original prediction task now using embeddings solely as input.

After generating the embeddings based on the training set, we applied the learned transformations to generate embeddings for the test set. This approach ensures that the test embeddings are derived solely from the relationships captured in the training phase, without introducing additional information from the test set itself. In the next step (Figure 1(A.2)), the embeddings are evaluated in the same label prediction task proposed for each original dataset, so that we can assess their ability to capture meaningful patterns and relevant structures. In general, this step enables us to quantify the amount of meaningful signal that can be captured from each dataset, and to estimate how effectively the synthetic data could represent the underlying structure of the original data.

In $k$ -dimensional spaces, the choice of $k$ directly affects the ability of the embeddings to capture the complexity and nuances of the original data. Although lower dimensionality improves computational efficiency, it may fail to preserve critical data patterns during embedding generation.⁴¹ To help determine optimal dimensionality, the KGE algorithm was tested in a preliminary evaluation using a rule-based synthetic data generator⁴² configured to project entities into multiple $k$ -dimensional spaces in separate experimental runs, allowing us to systematically assess how dimensionality and noise influence the quality of synthetic data (further details are presented in Section “Results”). Finally, we decided to use $k = 32$ for all datasets. Although all datasets are very similar in terms of size and shape, we observed that when the same $k$ used during the embedding process is able to capture signals in the form of embedding vectors, it is correlated with noise and the complexity of relationships within the data. The results are presented in Table 2 and discussed in Section “Embeddings Evaluation.”

Table 2.

Evaluation of embeddings as input features for prediction tasks across all benchmark datasets.

Dataset	Macro F1 score		Relative performance
	Original data	Embeddings	(F1 $_{e m b e d d i n g s}$ / F1 $_{o r i g i n a l}$ )
Fetal health	0.9159	0.8059	0.8799
Infant health	0.3721	0.3823	1.0274
Stroke prediction	0.2514	0.2360	0.9387
Genetic disorder
Class	0.5279	0.4383	0.8303
Subclass	0.3077	0.1681	0.5463
Avila Bible	0.9932	0.9040	0.9102
Polish bankruptcy	0.7250	0.4153	0.5728
Mushroom classification	1.0000	0.9977	0.9977

Macro F1 scores are compared between models trained on original data and embedding representations. The relative F1 performance quantifies the signal preservation achieved by embeddings.

Synthetic data generation: Embedding space sampling

Following an initial validation of the effectiveness of the embeddings in a predictive task, we re-run the embedding process using the same KGE algorithm, this time incorporating class labels (see Figure 1B). Integrating label information into the embedding process enables embeddings to capture label-dependent patterns and relationships. As a result, entities that exhibit similar features or associations with specific labels are embedded in close proximity within the latent space.

Sampling strategy

A commonly used strategy to generate synthetic data is random sampling.⁴³ Although this approach allows the creation of an unlimited number of synthetic records, it often fails to preserve essential data structures, resulting in unrealistic or unrepresentative samples. To enhance the fidelity of the synthetic data with respect to the original distribution, we adopt a density-based sampling strategy.⁴⁴ In this method, synthetic instances are generated within clusters formed by the $K$ nearest neighbors (KNN algorithm),⁴⁵ subject to a maximum radius constraint ( $R_{m a x}$ ). This constraint limits sampling from densely populated regions, thereby ensuring that synthetic samples remain close to the true data manifold.

The density analysis step utilizes the resulting embeddings from the source dataset to determine the appropriate parameters for the synthetic data generation process, including the minimum and maximum number of neighbors and the maximum radius. Let $X = {x_{i}}_{i = 1}^{N}$ denote the set of samples in $R^{d}$ , and let $d (\cdot, \cdot)$ be the Euclidean distance for a pair of embedding vectors $d (x_{1}, x_{2})$ . For each instance $x_{i}$ in our source dataset, we calculate the number of neighbors $x_{n}$ at a distance within a given maximum radius $r_{m}$ , $d (x_{i}, x_{n}) <= r_{m}$ , where $r_{m} \in [r_{m i n}, r_{m a x}]$ , reporting the minimum, maximum, and average numbers of neighbors in the source dataset in each $r_{m}$ , the percentage of instances $x_{i}$ that are not able to achieve a certain minimum number of neighbors within each radius $r_{m}$ , and the percentage of instances $x_{i}$ that have more than a maximum number of neighbors within each radius $r_{m}$ . This strategy enables us to visualize and establish the different values of the required parameters necessary for the synthetic data generation process. For example, if there is a minimum number of neighbors $N_{m i n} >= 32$ , and it is acceptable that 10% of the source dataset is classified as outlier (no minimum number of neighbors, ignored during that process), what should be the minimum value for the $R_{m a x}$ parameter to avoid an excessive number of outliers?

Because each feature value is sampled independently from the distributions of its nearest neighbors, some higher-order dependencies may not be fully preserved. This includes temporal dynamics and strong inter-feature correlations, which are particularly relevant in clinical datasets. We revisit this limitation in Section 5 and suggest potential extensions to address it.

Maximum cluster radius

The selection of a suitable maximum radius ( $R_{m a x}$ ) is essential to balance utility, data density, and privacy. A smaller radius promotes the generation of synthetic samples that closely resemble real observations, thereby enhancing the representativeness and local fidelity of the data. However, this increased similarity can raise privacy risks, particularly in sensitive domains such as healthcare, where the potential for re-identification must be carefully managed. In contrast, a larger radius reduces the likelihood of generating samples that are too similar to real instances. However, this comes at the cost of diluting local structural information, which can result in overly generalized synthetic data that fail to capture fine-grained patterns present in the original distribution.

Avoiding replicating outliers

To mitigate the risk of replicating outliers, we impose a minimum neighbor count constraint ( $N_{m i n}$ ) within the defined radius $R_{m a x}$ . For example, if only a single neighbor is identified within $R_{m a x}$ , that neighbor would otherwise be replicated directly as the synthetic counterpart of the target instance. In addition, to avoid such scenarios, SYNNER relies exclusively on the local neighborhood density, explicitly excluding the target instance from the sampling process. This design ensures that the characteristics of the target instance are not used in generating its synthetic version. By imposing a minimum number of neighbors and excluding the target itself, SYNNER minimizes the risk of data duplication, except in degenerate cases where all neighbors are identical or very similar to each other.

Cluster quality

To refine the quality of each cluster, we introduce a maximum neighbor constraint ( $N_{m a x}$ ), which limits the number of neighbors considered when finding the nearest neighbors. This constraint helps reduce the influence of densely populated regions, thereby avoiding synthetic samples that disproportionately reflect global distributional characteristics at the expense of preserving localized structure. Building on the above considerations and to balance privacy with data utility, we examine how radius, outlier frequency, and local data density interact to select appropriate values for $R_{m a x}$ , $N_{m i n}$ , and $N_{m a x}$ . We perform a density analysis in the embedding space to analyze how different radius values influence the percentage of outliers, high-density regions, and the average number of neighbors in the embedding space.

An illustrative example based on the Fetal Health dataset is shown in Figure 2. For a given pair of neighbor constraints $[N_{m i n}, N_{m a x}]$ , we examine how variations in $R_{m a x}$ affect outlier identification and cluster density. In the presented example, where $[N_{m i n}, N_{m a x}] = [16, 32]$ , if an outlier tolerance of 10% is acceptable, that is, the synthetic dataset will contain approximately 10% fewer instances than the original data, the minimum required value for $R_{m a x}$ is approximately 0.18. At this threshold, around 80% of the resulting clusters will be formed with at least $N_{m a x} = 32$ neighbors. The remaining 20% of clusters comprise between 16 and 31 neighbors, which still satisfy the constraint $N_{m i n}$ while supporting the generation of synthetic instances that respect local density without overfitting to densely populated regions. Finally, the same analysis provides the average number of neighbors found. When $R_{m a x} = 0.18$ , clusters corresponding to each target instance will have approximately 56 neighbors ( $10^{1.75}$ ). This configuration maintains a meaningful neighborhood size for synthetic sampling while limiting the risk of privacy leakage. Note that these parameters are dataset-specific and need to be tuned to balance the number of outliers against excessive generalization in the generated synthetic dataset.

Figure 2.

Density analysis for the Fetal Health dataset in the embedding space. The number of neighbors within varying radius thresholds is plotted (log scale), showing the trade-off between outlier frequency (¡16 neighbors) and local density (>32 neighbors). Results guide the selection of the maximum cluster radius ( $R_{m a x}$ ) and neighbor constraints.

Data utility

Finally, we evaluate how each resulting synthetic dataset performs as a training set in the original label prediction task. We analyze the resulting F1 score from the test set, comparing two models: (a) One trained with the original training data and (b) another trained with synthetic data generated from the original training set. There are no specific expectations regarding the F1 score resulting from synthetic data, but we use these results to assess the representativeness of synthetic data with respect to the original data. These results are presented in Table 3 and discussed in Section “Synthetic data evaluation”.

Table 3.

Comparison of predictive performance between models trained on original training data versus synthetic data generated by SYNNER.

Dataset	Macro F1 score		Radius	Relative Performance
	Original	Synthetic	$R_{m a x}$	(F1 $_{s y n t h e t i c}$ / F1 $_{o r i g i n a l}$ )
Fetal health	0.9159	0.7454	0.170	0.8139
Infant health	0.3721	0.3670	0.050	0.9863
Stroke prediction	0.2514	0.1702	0.155	0.6769
Genetic disorder
Class	0.5279	0.4317	0.125	0.8178
Subclass	0.3077	0.1707	0.125	0.5548
Avila Bible	0.9932	0.6046	0.135	0.6088
Polish bankruptcy	0.7250	0.3618	0.065	0.4990
Mushroom classification	1.0000	0.9919	0.030	0.9919

Macro F1 scores, maximum radius used for sampling, and relative F1 performance are reported. Results highlight dataset-specific trade-offs in representativeness and generalization.

Differential privacy

Now that we have a synthetic version of the original training set, to ensure its practical applicability, it is essential to evaluate how well the synthetic dataset balances data utility and privacy protection. One concern when generating synthetic data is the risk of reconstructing sensitive or confidential information. When synthetic samples are too closely aligned with real data points, especially in high-dimensional representations, there is a risk that models trained on the combined datasets may overfit or expose patterns that could compromise individual privacy. In other words, the integration of real and synthetic data, particularly those generated from embedding spaces, could inadvertently improve a model’s predictive performance to the extent that it enables the reconstruction of sensitive or missing information from the original data.

To assess the trade-off between data utility and privacy, we propose evaluating how well synthetic datasets perform in prediction tasks when combined with the original dataset and/or their corresponding embeddings (see Figure 1(C)). We run prediction tasks on each feature assuming the scenario in which an attacker gains access to all versions of the same dataset: Synthetic data (Syn), the embedding representations along with the embedding model (Emb), and original data (DS), except for one missing feature that could represent, for example, a critical health indicator of a patient. In this setting, the attacker: (C1) Using only DS, would rely on the available patient records to train a model and infer the missing value (for example, a health indicator); (C2) when adding Emb, privacy risks could increase if encoded relationships preserve too much information; similarly, (C3) the addition of Syn could potentially improve model quality, leading to the exposure of sensitive patient data.

Taking into account this adversarial scenario, we quantified the attempt to reconstruct the missing value through a comparative evaluation in four experimental settings: (1) Inference using only the original dataset DS; (2) inference using the DS+Emb; (3) inference using DS+Syn; and (4) inference using DS+Emb+Syn. In each setting, we evaluate the predictive performance of each feature, ensuring DP so that Emb and Syn cannot be used to reveal sensitive data or reproduce information beyond the original dataset. Therefore, if a feature in the synthetic data increases the likelihood of inferring its real counterpart, it should be reconsidered to maintain an appropriate balance between utility and privacy, potentially by modifying it through techniques such as noise injection. These results are discussed in Section “Differential privacy evaluation”.

Results

In this section, we present the results of our work in producing synthetic data for different datasets. We first performed a preliminary evaluation that shows how noise and the choice of $k$ can affect embedding performance when embedding knowledge into $k$ -dimensional spaces. These results help to support further tuning decisions regarding synthetic data generation. Then, we present our findings for each phase of the proposed SYNNER pipeline in Figure 1: (A) An evaluation to demonstrate how much signal the resulting embedding representation can capture from different datasets; (B) an evaluation to show how the resulting synthetic datasets perform in the same original prediction tasks; finally, (C) the result of our proposed evaluation protocol regarding DP.

As proposed in Section “Methods”, we use KGEs to enhance synthetic data generation and ensure DP in generated data. According to Dwork and Roth,²³ adding calibrated noise can protect sensitive information while preserving the statistical properties of the original dataset. Whether noise is an intrinsic feature of the data or added in a controlled way, it is essential to demonstrate how it affects the embedding process’s ability to capture signals from the data.

To this end, we used FormulAI.⁴² This rule-based dataset generation framework systematically adds a specified level of noise to a dataset, to evaluate the robustness of the embedding representations at different levels of noise. We used six variants, adding noise from 0% to 50%, to test their effect on the embedding process. Then, using the XGBoost model,⁴⁶ we compare macro F1 scores across the original data and their corresponding embeddings.

This evaluation was intentionally kept simple, consisting of only two categorical and two continuous features, along with a multilabel target with six distinct values. Embeddings were generated using a $k$ -dimensional space, with $k = 8$ . As shown in Table 4, embeddings tend to be more adversely affected by noise compared to the original data when used for prediction tasks (see column ‘Ratio’). However, because embeddings also encode noise-related signals from data, noise can serve as a valuable mechanism for modifying synthetic data to address issues related to DP, as described in Section “Differential privacy evaluation”.

Table 4.

Preliminary evaluation of the impact of increasing noise levels (0%–50%) on embedding representations.

Noise level	Macro F1 score		Relative performance
	Dataset	Embeddings	(F1 $_{e m b e d d i n g s}$ / F1 $_{o r i g i n a l}$ )
No noise	1.0000	0.9919	0.992
10%	0.9523	0.9458	0.993
20%	0.9559	0.9408	0.984
30%	0.9702	0.9279	0.956
40%	0.9592	0.8979	0.936
50%	0.9401	0.8392	0.893

Macro F1 scores are reported for the original data and embeddings, with relative F1 performance defined as the ratio of the embedding-based score to the original data score. Results highlight how embeddings encode noise and how this affects predictive performance.

We also used the FormulAI dataset to assess how different $k$ -dimensional spaces (where $k \in {8, 16, 24, 32}$ ) would affect the embedding ability to capture data signals. As discussed in Section “Embeddings generation”, it is expected that increasing the dimensionality will generally lead to better embedding representations, which would reflect in better F1 scores in prediction tasks using the resulting embeddings.

Table 5 shows that the higher the dimensional space, the better the embedding process can accommodate the different entities that represent each instance of the original data. In other words, the higher the $k$ , the more representative the embeddings will become. Interestingly, in some cases, embeddings in higher-dimensional spaces can outperform the original data in prediction tasks, achieving slightly higher F1 scores. This suggests that the embeddings may capture underlying feature correlations more effectively than the raw input.

Table 5.

Comparison of embedding performance across different embedding space dimensionalities ( $k \in 8, 16, 24, 32)$ .

Macro F1 score			Relative performance
Dataset	$k$	Embeddings	(F1 $_{e m b e d d i n g s}$ / F1 $_{o r i g i n a l}$ )
0.9401	08	0.8392	0.893
	16	0.9256	0.985
	24	0.9396	0.999
	32	0.9584	1.019

Macro F1 scores for original data and embeddings are reported, along with the relative F1 performance. Results show how higher dimensionality generally improves embedding representativeness.

The choice of $k$ definitely affects the embedding performance and can be used to enhance or restrict the amount of signals captured during this process. In our next steps, given the sizes and shapes of the evaluation datasets, we adopted $k = 32$ in all subsequent experiments. Using a fixed value for $k$ allowed us to observe how different data shapes would be affected in the same $k$ -dimensional space, and the corresponding impact during the synthetic and privacy evaluation.

Embeddings evaluation

As described in the first step of the SYNNER pipeline (Figure 1(A)), we generate embedding representations for the training and test partitions of the original datasets, without including the target labels, as they could influence the relationships learned. Once the embeddings were generated, we evaluated their utility as synthetic data by using XGBoost to predict the original labels. The results are shown in Table 2.

An interesting exception occurs in the Infant Health dataset, where embeddings slightly outperform the original data (Macro F1 $=$ 0.3823 vs. 0.3721). We interpret this as an effect of the embedding transformation, which acts as a denoising and regularization mechanism. By projecting categorical feature-value pairs into a continuous vector space, embeddings smooth local irregularities and emphasize co-occurrence patterns, thereby reducing noise and sparsity. In this dataset, which contains many categorical attributes with uneven distributions, embeddings appear to better capture correlations that are harder to exploit directly from raw features, resulting in a modest performance gain.

When comparing results to the same task performed on the original datasets, we observed that embeddings typically achieve lower performance, as some signal is lost during the embedding process. In each dataset, the loss ratio in a prediction task relative to the original data varies with data complexity and the difficulty of predicting the target label. For example, in the Mushroom dataset, widely considered straightforward and easy to resolve, embeddings maintain a high predictive performance. In contrast, in the subclass prediction task in the Genetic Disorder dataset, embeddings achieve only 50% of the same performance as the original data.

Furthermore, in the Infant Health dataset, we observed that the F1 score using embeddings improved by nearly 3% compared to the original dataset. Although designed to reduce dimensionality, the embedding transformation can also enhance representation by capturing correlations between similar feature-value pairs, as becomes more evident in inherently difficult-to-learn prediction tasks.

Overall, this assessment confirms that embeddings can preserve data patterns. Most importantly, since embeddings typically underperform compared to the original data, they may provide a suitable level of privacy for subsequent synthetic data generation steps.

Synthetic data evaluation

In this section, we evaluate the effectiveness of our synthetic data generation framework by analyzing several key factors, including comparisons between density-based and random sampling approaches and the influence of radius and neighbor constraints in balancing trade-offs between privacy, utility, and generalizability. We also examine how these parameters affect class-level prediction performance and discuss the implications of dataset-specific parameter tuning. The results are reported in Table 3 and Figures 3 and 4.

Figure 3.

Impact of varying the maximum radius ( $R_{m a x}$ ) on predictive performance (Macro F1 score) in the Fetal Health dataset. Increasing $R_{m a x}$ expands clusters (from an average of 65 to 762 neighbors) but reduces F1 performance, showing that larger clusters dilute local structure and reduce the fidelity of synthetic data. Smaller radii preserve more of the original predictive signal, while larger radii reduce local fidelity but may provide stronger privacy by preventing overly close synthetic samples.

Figure 4.

Effect of varying minimum and maximum neighbor constraints ( $N_{m i n}$ and $N_{m a x}$ ) on Fetal Health predictive performance. As neighbor ranges increase ([4,8] $\to$ [256,512]), synthetic data becomes more generalized, and Macro F1 scores decline. Results highlight the importance of neighborhood constraints for preserving minority structures and preventing excessive generalization.

As discussed previously, unlike the embedding evaluation in the previous section, we used embeddings learned from the full original datasets, including target labels, to improve the semantic reliability of synthetic data generated from latent representations. The resulting synthetic records retain a substantial portion of the discriminative structure of the original data when we incorporate class labels into the KGE process. For example, as shown in Table 3, in the Fetal Health dataset, although the F1 score decreases from 0.9159 (original) to 0.7454 (synthetic), the retention ratio of 0.8139 reflects how strongly signals can be preserved during the embedding and synthetic data generation processes. Similarly, in the Mushroom dataset, the synthetic-based model achieves a macro F1 score of 0.9919, indicating minimal information loss, as expected for an easy-to-resolve task. However, in tasks with more granular class structures, such as when predicting the subclass label in the Genetic Disorder dataset, the drop in performance is more noticeable (ratio $=$ 0.5548), highlighting the challenge of maintaining class separability in complex label spaces, which is consistent with when evaluating the embedding representation for the same dataset.

From a clinical perspective, these observed performance gaps carry essential implications. For example, in the Fetal Health dataset, a 10%–15% reduction in the Macro F1 score does not imply that models trained solely on synthetic data can be safely applied in patient care. Even modest declines in predictive accuracy could translate into an increase in Type II errors (false negatives)—instances in which a pathological or high-risk case is incorrectly classified as normal. Instead, these results demonstrate that synthetic datasets preserve sufficient predictive signal to support model screening, prototyping, and hypothesis generation. In practice, synthetic data can help identify promising approaches while maintaining privacy. Still, final models must be retrained and independently validated on the original clinical data to ensure safety and reliability.

Density-based vs. random sampling

As mentioned in Section “Synthetic data generation: Embedding space sampling”, we adopted a density-based sampling strategy within the embedding space, which uses the K-nearest neighbors within a maximum radius constraint ( $R_{m a x}$ ) to ensure that synthetic records are generated in locally consistent regions. As the number of neighbors increases, the resulting synthetic instances become more generalist, potentially reducing their representativeness of local structures and decreasing the synthetic model performance. However, because of our weighted-density strategy in which closer neighbors have greater influence, this degradation is not linear. For example, in Table 5, datasets with tighter radius constraints, such as Mushroom ( $R = 0.030$ ) achieve F1 ratio of 0.9919, while broader radii, such as in Stroke Prediction ( $R = 0.155$ ), correlate with lower ratios (0.6769), reflecting a loss of local fidelity.

Balancing utility and generalization via max radius

Selecting an appropriate radius parameter ( $R_{m a x}$ ) is critical for balancing data utility and generalization. In Figure 3 we show how the F1 ratio decreases when we create different synthetic versions of the Fetal Health dataset with increasing variations of $R_{m a x}$ , which allows more neighbors to be used when generating each synthetic instance. Generating synthetic data with increasing values of $R_{max}$ allows for more dense clusters and leads to a consistent decline in macro F1 scores when synthetic data are used as a training set, indicating that larger max radii dilute the structure of local clusters, reducing the discriminative quality of synthetic data. In our experiment $N_{m i n} = 16$ and $N_{m a x}$ varies from 32 to 128 as $R_{m a x}$ increases from 0.170 to 0.230. The corresponding average number of neighbors increases from 65 to 762, as shown in the density analysis (Figure 2). The initial setting $R_{m a x} = 0.170$ corresponds to the configuration reported in Table 3. These results suggest that the radius parameter can be adjusted to mitigate excessive generalization risk while maintaining a sufficient predictive signal, demonstrating that tuning $R_{m a x}$ might be efficient in datasets with well-structured feature-label relationships. However, the choice of $R_{m a x}$ must also take into account the characteristics of each data set.

Minimum and maximum neighbor constraints to prevent outlier duplication and preserve local structure

As described in Section “Synthetic data generation: Embedding space sampling”, we apply a minimum neighbor constraint ( $N_{m i n}$ ) to prevent overfitting and enhance diversity by ensuring that synthetic samples are generated from sufficiently representative clusters. Dense and coherent neighborhoods support robust local sampling. In contrast, in sparser or noisier regions in the embedding space, the neighborhood threshold is more difficult to satisfy consistently. Although this may hinder stable cluster formation, the constraint $N_{m i n}$ remains critical to prevent outlier replication and promote privacy compliance in synthetic data.

In addition to $N_{m i n}$ , we also apply a maximum neighbor constraint ( $N_{m a x}$ ) to limit the influence of dense clusters during sampling, thus promoting diversity and preserving minority structures in synthetic data. By restricting $N_{m a x}$ , we maintain higher utility, i.e., synthetic data that remains representative of the original dataset. The effectiveness of this strategy is supported by generally consistent performance trends, as shown in Figure 4.

Using the Fetal Health dataset as an example, we observe that relaxing $N_{m a x}$ leads to more generalist synthetic data, resulting in reduced utility (i.e., lower predictive performance in the benchmark task), but it is still unevenly correlated with the increasing number of neighbors added in each iteration. Each time we double the number of $N_{m i n}$ and $N_{m a x}$ , the F1 score drops 5% on average in both label-specific and overall macro performance, given by the F1 scores in the prediction evaluation task.

Similarly to the effect of increasing $R_{m a x}$ , increasing the number of neighbors negatively affects performance when synthetic data are used as a training set. In this experiment, $R_{m a x}$ is fixed at 0.25, where the overall average number of nearest neighbors per instance is 1,131. We varied $[N_{m i n}, N_{m a x}]$ from $[4, 8]$ to $[256, 512]$ , and observed a drop in the macro F1 score from 0.7670 to 0.5497 (a reduction of nearly 30%). In addition, in more complex or heterogeneous datasets, we found that overly restrictive $N_{m a x}$ values can exclude critical but infrequent neighborhood patterns necessary for accurate predictions. A deeper investigation into this trade-off is left for future work.

Evaluation using prediction performance

When analyzing the macro F1 scores shown in Table 3, we observed that although models trained with synthetic data are expected to underperform relative to those trained with real data, our results show that SYNNER can generate synthetic datasets that can approximate the performance of their original counterparts in specific scenarios. For example, in the Infant Health dataset, the synthetic trained model achieves an F1 ratio of 0.9863, in a task that can be considered difficult to resolve ( $F 1 = 0.3721$ when using original data and $F 1 = 0.3670$ when using synthetic data as a training set), suggesting that our proposed method helps reinforce consistent local patterns and reduce noise. Such outcomes underscore the practical value of the SYNNER pipeline, particularly in contexts where privacy constraints limit access to real data.

However, the limitations of this method become apparent in datasets with weak or sparse local structures. In the Avila Bible and the Polish Bankruptcy datasets, for example, the F1 ratio decreased to 0.6088 and 0.4990, respectively. Insufficient density in the embedding space, likely due to the lack of categorical features, may have hindered the formation of meaningful synthetic clusters, underscoring the importance of local density in the design of sampling strategies for high-fidelity synthetic data generation. These performance drops may in part reflect the limitation of independent feature sampling, which can weaken the preservation of fine-grained correlations or subgroup-specific structures. This effect is especially noticeable in complex or heterogeneous datasets, where capturing relationships beyond local neighbor distributions is essential.

Finally, neighborhood constraints may affect the predictive performance of synthetic data unevenly across different label types within the same task. This effect is evident in the Genetic Disorder dataset, where $N_{m i n}$ and $N_{m a x}$ helped preserve consistent patterns when predicting the broader class label, but led to a noticeable performance drop when applied to the more granular subclass label. This trade-off underscores the challenge of balancing accuracy, diversity, and privacy, particularly when the complexity of the labels varies.

The evaluation results presented thus far have focused on the utility and representativeness of the synthetic data. However, in privacy-sensitive domains such as healthcare, it is equally important to assess how well synthetic data preserves privacy. It mitigates the risk of exposing identifiable or sensitive information. To this end, we conduct a DP Evaluation in the following subsection, examining how well the synthetic data resists inference attacks under adversarial conditions.

Differential privacy evaluation

As mentioned in Section “Differential privacy”, we conducted a DP Evaluation to assess the balance between utility and privacy on the generated synthetic data. This task is performed through a label prediction task, considering a critical scenario in which an attacker gains access to the original and synthetic data, embeddings, and the embedding model, then tries to predict a missing feature value, such as some sensitive information about a patient.

As illustrated in Figure 1(C), three experimental settings were considered, using only the original dataset (DS), DS + Embeddings (Emb), and DS + the synthetic dataset generated (Syn). A fourth scenario is added to the experiment (the attacker owns DS + Emb + Syn), representing the maximum data exposure situation. For each feature in each resulting synthetic dataset, we evaluate its predictive potential. For continuous features, we use the Mean Reciprocal Rank (MRR) metric⁴⁷ to assess how closely the predictions match the actual values. This is done after discretizing each feature’s range into bins based on the standard deviation observed in the training set. For categorical features, we compute the F1 score, with a focus on predicting less frequent class values. To avoid trivial classification tasks dominated by the majority class, we select target values that occur in approximately 5%–10% of the dataset whenever possible. Figure 5 presents our evaluation results for all benchmark datasets using the XGBoost algorithm as a predictor.

Figure 5.

Differential privacy evaluation across benchmark datasets. Predictive performance (F1 for categorical features, MRR for continuous) is shown under four adversarial conditions: Using only the original dataset (DS), DS+Synthetic (DS+Syn), DS+Embeddings (DS+Emb), and DS+Syn+Emb. While embeddings alone rarely improve inference, the inclusion of synthetic data substantially improves prediction for some features (e.g., Infant Health: CXR, XRR, LF), underscoring both utility gains and privacy risks. (a) Genetic Disorder features—RR: Respiratory Rate(breaths/min); HR: Heart Rate(rates/min); FU: Follow-up; BA: Birth asphyxia; ABD: Autopsy shows birth defect; FA: Folic acid details; SMI: H/O serious maternal illness; RE: H/O radiation exposure(x-ray); SA: H/O substance abuse; IVF: Assisted conception IVF/ART; HA: History of anomalies in previous offspring; PA: No.of previous abortion; BD: Birth defects; BTR: Blood test result; PA: Patient Age; BCC: Blood cell count(mCL); WBCC: White Blood cell count. (b) Infant Health features—BA: Birth Asphyxia; HD: Hyp Distrib; HO2: Hypoxia O2; CO2; CXR: Chest X-Ray; GR: Grunting; LVHR: LVH Report; LBO2: Lower Body O2; RUQO2: RUQ O2; CO2R: CO2 Report; XRR: X-Ray Report; Dis: Disease; GR: Grunting Report; Age; LVH; DF: Duct Flow; CM: Cardiac Mixing; LP: Lung Parenchyma; LF: Lung Flow. (c) Fetal Health features—HT: Histogram Tendency; HNZ: Histogram number of zeroes; HNP: Histogram number of peaks; BV: Baseline Value; ASTV: Abnormal Short Term Variability. (d) Stroke features—SS: Smoking Status; WT: Work Type; HD: Heart Disease; HT: Hypertension; AGE: Age; AGL: Average Glucose Level; BMI. (e) Avila features—F1; F2; F3; F4; F5; F6; F7; F8; F9; F0. and (f) Polish Bankruptcy features—X1 to X64.

These datasets were chosen for their diverse feature sets and distinctive performance trends. We observed that combining the original dataset with embeddings alone does not significantly improve the F1 and MRR scores for any of the evaluated features. In contrast, incorporating synthetic data leads to notable performance gains across several features (e.g., CXR, XRR, and LF) in the Infant Health dataset (see Figure 5(b)). Although this underscores the potential of synthetic data to enhance model effectiveness, it also raises concerns about inadvertently exposing sensitive patterns. If performance improvements exceed acceptable privacy thresholds, mitigation strategies such as adding calibrated noise during generation (or post hoc to the synthetic data) can help reduce re-identification or leakage risks. Alternatively, increasing the $N_{m i n}$ constraint can further safeguard against exposing sensitive outliers in sparsely populated regions of the embedding space.

When performance improvements suggest potential privacy leakage, it is essential to implement safeguards that limit the risk of revealing sensitive information without compromising the usefulness of the resulting synthetic data. To that end, we evaluated two complementary strategies: (a) Increasing the minimum number of neighbors ( $N_{m i n}$ ) considered during synthetic data generation, which reduces the risk of modeling outliers in sparse regions and makes each synthetic instance to be created based on the statistical distribution of a larger local cluster; and (b) injecting controlled noise into the resulting synthetic dataset, which helps obscure sensitive feature correlations while maintaining high-quality data generation.

Figure 4 shows how increasing the number of minimum and maximum neighbors used to generate synthetic data can negatively affect the performance of the resulting synthetic dataset. This evaluation compares synthetic versions generated using $[N_{m i n}, N_{m a x}]$ ranges from $[4, 8]$ to $[256, 512]$ . As the number of neighbors increases, synthetic data generation prioritizes broader generalization over local specificity. Consequently, we observe a steady decline in both individual label prediction performance and the overall macro F1 score in the original evaluation task when using the resulting synthetic data as the training set. Increasingly diluted clusters cause this performance degradation as more neighbors are aggregated, consistently affecting all features in the dataset, potentially helping reduce the risk of data leakage or sensitive data reconstruction.

We also evaluated the effect of introducing calibrated noise directly into the resulting synthetic dataset, aiming to obscure fine-grained patterns that could otherwise enable re-identification or reveal rare feature combinations. We experimented with two noise injection strategies: (a) Adding noise only to specific target features considered more privacy sensitive or highly predictive, and (b) applying noise uniformly across all features in the dataset. In Figure 6, we contrast these two strategies with the idea of increasing $[N_{m i n}, N_{m a x}]$ described above, aiming to address privacy concerns related to the three features in the Infant Health dataset. We observed that increasing the number of nearest neighbors has less impact on performance compared to adding controlled noise to the synthetic data, since SYNNER assigns higher weights to neighbors closer to the center of each cluster (i.e., each synthetic target instance), regardless of the total number of neighbors considered (Figure 6(A)). Furthermore, we found that adding noise only to features identified as sensitive to reconstruction does not necessarily reduce privacy risks by the same proportion, mainly when those features can be inferred from combinations of other features (Figure 6(B)). Only after adding 100% noise, which fully randomized the target features, the performance of the synthetic model dropped to the same level observed in the original data. This suggests that applying controlled noise across all features may be a more effective strategy to mitigate potential data leakage (Figure 6(C)).

Figure 6.

Comparison of three strategies to mitigate privacy leakage in the Infant Health dataset: (a) Increasing nearest neighbor constraints, (b) adding noise to specific sensitive features (CXR = Chest X-Ray; XRR = X-Ray Report, LF = Lung Flow), and (c) adding noise across all features. Results show that global noise injection is most effective at reducing predictive performance for sensitive features, suggesting it as a stronger safeguard than targeted feature noise or neighborhood expansion alone.

Finally, we analyzed duplicate records using two benchmark datasets: (a) Mushroom (in Table 6), which consists entirely of categorical features, and (b) Avila Bible (in Table 7), which contains continuous features (except for the target label), all normalized to similar scales. We compared the characteristics of the original datasets with their corresponding synthetic versions generated by four tools, including SYNNER.

Table 6.

Analysis of duplicate and unique instances in the mushroom dataset when synthetic data are generated by different tools (synthpop, ARGN, Conditional tabular generative adversarial network (CTGAN), SYNNER).

Synthetic	% Unique	Duplicate ratio	F1 score
Approach	Instances	(regarding original)	(original task)
Original	94.74%	–	1.0000
Synthpop	40.70%	7.05%	0.9988
ARGN	48.78%	52.51%	0.9994
CTGAN	83.39%	12.00%	0.9351
SYNNER	48.69%	37.28%	0.9919

Percentages of unique instances, duplicate ratios relative to the original dataset, and F1 scores for the original classification task are reported. Results show that while SYNNER generates fewer unique records than CTGAN, it achieves higher predictive accuracy, placing it in an intermediate position that balances utility and duplication risk.

Table 7.

Analysis of duplicate and unique instances in the avila Bible dataset under different synthetic generation tools.

Synthetic	% Unique	Duplicate ratio	Macro F1 score	Avg Min Diff
approach	instances	(regarding original)	(original task)	(L2 norm)
Original	99.94%	–	0.9932	0.93589
Synthpop	98.37%	1.43%	0.9294	0.51521
ARGN	100.00%	0.01%	0.6256	0.71424
Conditional tabular generative adversarial network (CTGAN)	100.00%	0.00%	0.2276	1.01658
SYNNER	100.00%	0.00%	0.6046	0.82237

Continuous variables were rounded to one decimal place when computing duplicates. Metrics reported include percentage of unique instances, duplicate ratios relative to the original dataset, macro F1 scores, and average minimum pairwise Euclidean distance (“Avg Min Diff”) as a measure of similarity between records. Results indicate that SYNNER avoids generating exact duplicates, while achieving a similarity profile (“Avg Min Diff”) closer to the original dataset than other methods, balancing fidelity with privacy preservation.

We considered two instances in the Mushroom dataset to be duplicates if they shared identical values across all features. In the original training set, nearly 95% of the instances are unique. As expected, synthetic data generation tools may reproduce this tendency to duplicate records. Synthpop, ARG, and SYNNER generated synthetic datasets in which only about 50% of the instances were unique, yet they still maintained high predictive accuracy in the test set. Synthpop was also effective in avoiding duplication of instances from the original data ( 7%). CTGAN, on the other hand, produced far fewer internal duplicate records and just 12% duplicates of the original data, but showed a notable drop in performance, with nearly a 7% reduction in the F1 score, suggesting that the resulting synthetic data loses utility as much as it becomes more randomly generated. Although SYNNER stays in an intermediate position when balancing internal versus original duplicates, the resulting synthetic data still performs accurately in the prediction task.

In the Avila Bible dataset, we considered two instances as duplicates if they shared identical values after rounding each continuous feature to one decimal place. In the original training set, there are only a few duplicates (actually very similar instances), and only Synthpop produced internal and original duplicates at a ration of less than 2% of the resulting synthetic dataset, leading to a higher prediction performance, but also raising concerns about privacy and data leakage. For each instance, we also computed its closest pair, defined as the minimum Euclidean distance (L2 norm) over all feature values, reported in Table 7 as ”Avg Min Diff”. The less the difference, the more similar the instances are within the resulting dataset. SYNNER approximates the best when instances differ from each other in the original data.

Finally, we computed the number of unique instances and the duplicate ratio in each of the evaluated datasets (original and synthetic versions produced by our approach), as shown in Table 8. We highlight the large number of duplicates in the original Stroke Prediction dataset when considering categorical features only, with less than 3% of its instances being unique, a behavior captured by SYNNER, which generated only 3.46% of unique synthetic combinations of categorical values, also reflected in the high duplication ratio compared to the original data (only $\sim$ 3% instances in the synthetic version are distinct from those in the original data).

Table 8.

Comparison of unique instance percentages and duplicate ratios between original datasets and SYNNER-generated synthetic versions.

Dataset	% Unique instances (original)		% Unique instances (SYNNER)		Duplicate ratio (regarding original)
	Categorical (only)	With continuous	Categorical (only)	With continuous	Categorical (only)	With continuous
Fetal health	98.19%	98.77%	100.00%	100.00%	0.00%	0.00%
Infant health	89.15%	–	85.64%	–	11.51%	–
Stroke prediction	2.98%	99.76% $^{(a)}$	3.46%	91.96% $^{(b)}$	96.97%	0.20% $^{(c)}$
Genetic disorder $^{(d)}$	100.00%	100.00%	100.00%	100.00%	0.00%	0.00%
Avila Bible $^{(e)}$	−	99.94%	–	100.00%	–	0.00%
Polish bankruptcy $^{(f)}$	−	99.94%	−	100.00%	−	0.00%
Mushroom classification	94.74%	–	48.69%	–	37.28%	–

Results are reported separately for categorical-only features and continuous-inclusive features (rounded as indicated). Findings illustrate dataset-specific duplication behaviors in both original and synthetic data. While some duplicates appear in SYNNER-generated data, these are not intentional replications of original records; rather, they arise from the probabilistic sampling process, where feature values are drawn independently from the distributions of nearest neighbors, which reflects the balance between preserving local feature distributions and minimizing overfitting to individual records.

$^{(a, b, c)}$ continuous variables rounded as int.

$^{(d, e, f)}$ all continuous variables rounded with 1 decimal place.

Our privacy analysis evaluates empirical resilience against adversarial privacy attacks. While these simulations demonstrate robustness, they do not constitute a formal DP guarantee. Thus, we report empirical reductions in privacy risk but do not claim provable DP compliance. We acknowledge that highly sensitive applications, such as those involving the infant health dataset, require stronger safeguards. Several mitigation strategies can reduce leakage risks if our method is deployed in practice. First, formal DP mechanisms can be integrated during training by injecting calibrated noise into gradients or outputs,^48,49 thereby providing provable guarantees that individual records cannot be inferred. Second, data minimization and aggregation strategies should be applied, where rarely occurring or uniquely identifying variables are generalized or grouped to reduce the risk of re-identification. Third, generating synthetic data under DP constraints can serve as a privacy-preserving alternative, particularly in cases where rare events, such as neonatal conditions, could otherwise be memorized by the model. Furthermore, because membership inference attacks⁵⁰ remain a particularly significant threat in health data scenarios, where an adversary attempts to determine whether the record of a specific individual was part of the training set, our framework could be extended with adversarial training or defenses based on DP to mitigate this risk further. Finally, robust access control and audit mechanisms should be implemented to monitor the use of trained models with limited exposure of potentially sensitive outputs.

Real-world applicability

A central motivation for SYNNER is its potential use in clinical settings, where hospitals often need predictive models but face critical constraints on sharing sensitive patient data. In such scenarios, external companies or research groups may offer candidate solutions, yet they cannot be granted access to the original clinical data at the outset. SYNNER provides a practical pathway to overcome this barrier by enabling hospitals to generate synthetic datasets that preserve a substantial fraction of the original predictive signal (approximately 83.2% in our experiments).

Consider a real-world example: A hospital needs to develop a predictive model but cannot grant immediate access to its clinical data. In this scenario, SYNNER enables the hospital to generate synthetic datasets that capture a significant portion of the predictive signal while maintaining robust privacy safeguards. These synthetic datasets can be shared with external research groups or companies for prototyping models without direct access to the original clinical data.

Internally, the hospital can then evaluate candidate models against held-out real test data, thus quantifying the performance gap between models trained on synthetic data and those trained on the original data. This screening step substantially reduces the number of external agents that ultimately require access to sensitive clinical data, ensuring that only the most robust approaches move forward and thereby minimizing exposure risks.

However, it is essential to note that synthetic data alone is not intended for the final deployment of clinical models. Their role is to support pre-evaluation and model screening, allowing institutions to identify promising approaches while preserving patient privacy. The representativeness of synthetic datasets, even when relatively high, does not guarantee preservation of all feature dependencies or subgroup-specific relationships. Therefore, once candidate models have been identified, they must undergo independent validation and retraining on the original, unseen clinical data as an indispensable step to confirm reliability and safety before deployment. In this sense, the contribution of SYNNER is to enable privacy-preserving prototyping and collaboration, not to replace validated clinical models.

SYNNER’s built-in DP evaluation further strengthens this workflow by quantifying privacy risks and supporting mitigation strategies such as tighter neighborhood constraints or calibrated noise injection. These safeguards should be viewed as mechanisms to measure and manage potential biases in model performance and re-identification risks, rather than as a means to eliminate them.

Discussion

In this study, we present SYNNER, our proposal for generating synthetic data. SYNNER implements a pipeline structured in three phases: (a) Embedding generation, (b) synthetic data generation, and (c) DP evaluation. In the first phase, we generate embeddings from the original dataset using feature similarities and patterns within the data. In the second phase, we generated synthetic entities by randomly sampling feature values from their k-nearest neighbors in the embedding space, preserving the original relational properties while avoiding instance replication. In the last phase, we assess the DP of the generated synthetic datasets to ensure that sensitive data cannot be reconstructed.

Our framework included an empirical density analysis to establish neighborhood constraints to reduce the influence of outliers. With the analysis, we found that the effectiveness of SYNNER’s embedding-based sampling is highly sensitive to the maximum radius ( $R_{m a x}$ ), minimum neighbors ( $N_{m i n}$ ) and maximum neighbors ( $N_{m a x}$ ) parameters. Our findings show that a uniform parameter configuration across datasets is suboptimal, as evidenced by the wide range of F1 score ratios observed. Mushroom and Infant Health exhibited strong performance with synthetic data, suggesting that their intrinsic data geometry aligned well with SYNNER’s configuration. Conversely, datasets such as Genetic Disorder, Avila Bible, and Polish Bankruptcy yielded lower F1 ratios, indicating class imbalance or sparsity that may require further tuning.

Across seven publicly available datasets, SYNNER preserved, on average, 83.2% of the original predictive signal, achieving 74.4% classification performance on synthetic data compared to models trained on original data. Our DP evaluation protocol further confirms that SYNNER-generated datasets meet stringent privacy standards, ensuring individual data protection. For cases where privacy risk remains, mitigation strategies such as tighter sampling constraints or calibrated noise can be applied during pre- or post-processing. This evaluation stage serves as a comprehensive diagnostic to assess the effectiveness of our design choices, including embedding strategies and sampling parameters. By identifying where synthetic data fall short in preserving predictive signal and privacy, researchers are better positioned to refine and tune their generative parameters.

Overall, our research contributes to ongoing efforts to generate synthetic data, providing insights into techniques for producing more realistic synthetic data. Most importantly, by addressing the limitations of existing approaches, SYNNER offers a scalable, privacy-preserving solution for synthetic data generation, paving the way for responsible research in sensitive areas such as healthcare.

Although our method has shown promising results in generating synthetic data, it has limitations, including the need to sample feature values for each synthetic instance independently. Specifically, each feature is sampled from a weighted distribution computed over the nearest neighbors in the embedding space. As a result, specific relational characteristics, such as temporal dependencies or feature correlations tied to demographic groups (e.g., sex-specific traits), may only be preserved by chance. This limitation becomes more pronounced in scenarios where data relationships evolve over time or exhibit structural dependencies. Future work should explore approaches that preserve these dependencies, for example, through sequence-aware models to retain temporal dynamics in longitudinal data, or conditional generative methods (CGMs) to enforce demographic and clinical subgroup consistency. These methods have been used in the medical/biomedical domain as effective means to capture evolving patterns and generate patient-specific synthetic data while maintaining subgroup consistency and realistic joint distributions.^51,52

Because SYNNER is designed to capture feature distributions of continuous and categorical tabular data, it is less suited for scenarios involving dynamic entity relationships or longitudinal structures. Extending the framework with temporal modeling or subgroup-aware sampling would address this limitation and broaden its applicability to complex healthcare datasets. In this direction, we also plan to create synthetic versions of the publicly available Medical Information Mart for Intensive Care datasets (MIMIC-III and MIMIC-IV).

As noted in Sections “Synthetic data generation: Embedding space sampling” and “Synthetic data evaluation”, an important limitation of SYNNER is that feature values are sampled independently from local neighbor distributions in the embedding space. While this strategy helps avoid direct record duplication, it may fail to preserve higher-order dependencies, such as temporal dynamics and strong inter-feature correlations, which are relevant to clinical data. As a result, specific relational characteristics, such as disease progression patterns or demographic subgroup associations, may be preserved only by chance. Future work should address this limitation by exploring sequence-aware models that retain temporal information or CGMs that explicitly enforce subgroup consistency and joint feature dependencies.

Finally, our investigation of instance duplication offers valuable insights for both utility and privacy assessment in synthetic data. Although we were able to identify cases in which full-instance duplicates occur in both the original and synthetic data based on categorical features (as observed in the Mushroom dataset), comparing continuous features is more subjective. It is influenced by factors such as bin size or similarity thresholds. Addressing this complexity is essential for more precise evaluations of fidelity and privacy. Future research could build on our work by developing adaptive similarity measures or incorporating domain-specific knowledge to assess similarity in continuous data. These directions not only enhance the robustness of synthetic data evaluation but also reinforce the practical utility of SYNNER for generating synthetic data in complex real-world scenarios.

Conclusions

In this study, we present SYNNER, our proposal for generating synthetic data, designed to capture feature distributions of continuous and categorical tabular data. It is less suited for scenarios involving dynamic entity relationships or longitudinal structures. SYNNER implements a pipeline structured in three phases: (a) Embedding generation, (b) synthetic data generation, and (c) DP evaluation. In the first phase, we generate embeddings from the original dataset using feature similarities and patterns within the data. In the second phase, we generated synthetic entities by randomly sampling feature values from their $k$ -nearest neighbors in the embedding space, preserving the original relational properties while avoiding instance replication. In the last phase, we assess the DP of the generated synthetic datasets to ensure that sensitive data cannot be reconstructed. The way we envisage extending SYNNER is two-fold: (a) Extending this framework with temporal modeling or subgroup-aware sampling would address the limitations of capturing temporal dependencies and feature correlations tied to demographic groups, avoiding their preservation only by chance; and (b) we also plan to create a synthetic version for publicly available datasets—MIMIC-III and MIMIC-IV to validate the ability of using synthetic generated data from SYNNER in more realistic scenarios.

Footnotes

Acknowledgment

The authors would like to acknowledge the support of INSAFEDARE Project (Grant agreement ID: 101095661), which made this research possible.

ORCID iD

Hegler Tissot

Contributorship

Hegler Tissot: Designing and implementing core features, running final experiments, writing, and reviewing the manuscript.

Justin Moore: Implementing code, running validations experiments, writing.

Eric Benton: Implementing and running code to test related work.

Sarah Alshahrani: Implementing and running code to test related work.

Maria Helena Franciscatto: Designing and testing core features, performing results analysis, writing, and reviewing the manuscript.

Marcos Didonet Del Fabro: Designing and testing core features, performing results analysis, writing, and reviewing the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work has been partially funded by the INSAFEDARE Project (Grant agreement ID: 101095661).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

CEA LIST, France.

References

Collins

Varmus

. A new initiative on precision medicine. New Engl J Med 2015; 372: 793–795.

Liu

Luo

Jiang

, et al. Difficulties and challenges in the development of precision medicine. Clin Genet 2019; 95: 569–574.

Office of the National Coordinator for Health Information Technology. Challenges in public health reporting experienced by non-federal acute care hospitals. https://www.healthit.gov/data/data-briefs/challenges-public-health-reporting-experienced-non-federal-acute-care-hospitals (2021, accessed: 27 August 2024).

US Department of Health & Human Services. Health information privacy: HIPAA Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/index.html, Content last reviewed September 27. (2024, accessed: 12 January 2022).

Journal

. HIPAA Violation Cases. https://www.hipaajournal.com/hipaa-violation-cases/.

ARX. Anonymization tool. https://arx.deidentifier.org/anonymization-tool/#a25.

CloverDX. Data anonymization. https://www.cloverdx.com/data-anonymization.

Amnesia. High accuracy data anonymization. https://amnesia.openaire.eu/.

Unnikrishnan

Naini

. De-anonymizing private data by matching statistics. In: 2013 51st annual allerton conference on communication, control, and computing (Allerton), 2013, pp.1616–1623. DOI: 10.1109/Allerton.2013.6736722.

10.

Mittal

Beyah

. Graph data anonymization, de-anonymization attacks, and de-anonymizability quantification: A survey. IEEE Commun Surveys Tutorials 2017; 19: 1305–1326.

11.

Nowok

Raab

Dibben

. SynthPop: Bespoke creation of synthetic data in R. J Stat Softw 2016; 74. DOI: 10.18637/jss.v074.i11

12.

Choi

Biswal

Malin

, et al. Generating multi-label discrete patient records using generative adversarial networks. In: Doshi-Velez

Fackler

Kale

et al. (eds.) Proceedings of the 2nd machine learning for healthcare conference, Proceedings of Machine Learning Research, Vol. 68, pp.286–305. PMLR.

13.

Skoularidou

Cuesta-Infante

, et al. Modeling tabular data using conditional GAN. Red Hook, NY, USA: Curran Associates Inc., 2019.

14.

Xie

Lin

Wang

, et al. Differentially private generative adversarial network, 2018. DOI: 10.48550/ARXIV.1802.06739.

15.

Syntheticmass. https://synthea.mitre.org/about.

16.

Dube

Gallagher

. Approach and method for generating realistic synthetic electronic healthcare records for secondary use. In: Gibbons J and MacCaull W (eds.) Foundations of Health Information Engineering and Systems, pp.69–86. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-53956-5.

17.

Feng

Mayer

, et al. A schema-driven synthetic knowledge graph generation approach with extended graph differential dependencies (gddxs). IEEE Access 2021; 9: 5609–5639.

18.

Tiwald

Krchova

Sidorenko

, et al. TabularARGN: A flexible and efficient auto-regressive framework for generating high-fidelity synthetic data, 2025. DOI: 10.48550/ARXIV.2501.12012.

19.

Barse

Kvarnström

Jonsson

. Synthesizing test data for fraud detection systems. In: Proceedings of the 19th annual computer security applications conference. ACSAC ’03, p.384. USA: IEEE Computer Society. ISBN 0769520413.

20.

Abadi

Chu

Goodfellow

, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. CCS ’16, 2016, p.308-318. New York, NY, USA: Association for Computing Machinery. ISBN 9781450341394. DOI: 10.1145/2976749.2978318.

21.

Shadbolt

Smart

. Knowledge elicitation: Methods, tools and techniques. In: Wilson JR and Sharples S (eds.) Evaluation of Human Work, 2015. pp.163–200. CRC Press.

22.

Huang

Jafari

. Enhanced balancing gan: minority-class image generation. Neural Comput Appl 2023; 35: 5145–5154.

23.

Dwork

Roth

. The algorithmic foundations of differential privacy, 2013.

24.

Luo

Chen

, et al. Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Appl Intell 2016; 45: 1179–1191.

25.

Tissot

Pedebos

. Improving risk assessment of miscarriage during pregnancy with knowledge graph embeddings. J Health Informat Res 2021; 5: 359–381.

26.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial nets. In: Proceedings of the 27th international conference on neural information processing systems - Volume 2. NIPS’14, p.2672–2680. Cambridge, MA, USA: MIT Press.

27.

Ribeiro

Tiels

Aguirre

, et al. Beyond exploding and vanishing gradients: analysing rnn training using attractors and smoothness. In: Chiappa S and Calandra R (eds.) Proceedings of the twenty third international conference on artificial intelligence and statistics, Proceedings of machine learning research, Vol. 108, pp.2370–2380. PMLR.

28.

Hochreiter

. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncert Fuzz Knowled-Based Syst 1998; 06: 107–116.

29.

. Aoe: Angle-optimized embeddings for semantic textual similarity. In: Proceedings of the 62nd Annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, p.1825–1839. DOI: 10.18653/v1/2024.acl-long.101.

30.

Durall

Chatzimichailidis

Labus

, et al. Combating mode collapse in gan training: An empirical analysis using hessian eigenvalues, 2020. DOI: 10.48550/ARXIV.2012.09673.

31.

Creswell

White

Dumoulin

, et al. Generative adversarial networks: An overview. IEEE Signal Process Mag 2018; 35: 53–65.

32.

Gulrajani

Ahmed

Arjovsky

, et al. Improved training of wasserstein GANs. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17, p.5769–5779. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510860964.

33.

Englesson

Azizpour

. Generalized jensen-shannon divergence loss for learning with noisy labels, 2021. DOI: 10.48550/ARXIV.2105.04522.

34.

Bordes

Usunier

Garcia-Duran

, et al. Translating embeddings for modeling multi-relational data. In: Burges

CJC

Bottou

Welling

et al. (eds.) Advances in neural information processing systems 26, 2013, pp.2787–2795. Curran Associates, Inc.

35.

Tissot

. HEXTRATO: Using Ontology-based Constraints to Improve Accuracy on Learning Domain-specific Entity and Relationship Embedding Representation for Knowledge Resolution. In: Proceedings of the 10th international joint conference on knowledge discovery, knowledge engineering and knowledge management, IC3K 2018, Volume 1: KDIR, Seville, Spain, September 18-20, 2018. pp.70–79.

36.

Tissot

Pedebos

. Improving risk assessment of miscarriage during pregnancy with knowledge graph embeddings. J Health Informat Res 2021; 5: 359–381.

37.

Mushroom. UCI Machine Learning Repository, 1981. DOI: 10.24432/C5959T.

38.

Stefano

Fontanella

Maniaci

, et al. Avila. UCI Machine Learning Repository, 2018. DOI: 10.24432/C5K02X.

39.

Tomczak

. Polish Companies Bankruptcy. UCI Machine Learning Repository, 2016. DOI: 10.24432/C5F600.

40.

Campos

Bernardes

. Cardiotocography. UCI Machine Learning Repository, 2000. DOI: 10.24432/C51S4N.

41.

Nickel

Kiela

. Poincaré embeddings for learning hierarchical representations. Adv Neural Inf Process Syst 2017; 3: 30.

42.

Tissot

. FormulAI: Designing rule-based datasets for interpretable and challenging machine learning tasks. Artif Intell Appl 2024; 3: 72–82.

43.

Martino

Luengo

Míguez

. Independent Random Sampling Methods. Basel, Switzerland: Springer International Publishing, 2018.

44.

Wang

Hamilton

. Dbrs: A density-based spatial clustering method with random sampling. In: Whang

Jeon

Shim

et al. (eds.) Advances in knowledge discovery and data mining, pp.563–575. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-540-36175-6.

45.

Cunningham

Delany

. k-nearest neighbour classifiers - a tutorial. ACM Comput Surv 2021; 54: 1–25.

46.

Chen

Guestrin

. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. pp.785–794.

47.

Craswell

. Mean Reciprocal Rank. New York, NY: Springer US, 2009.

48.

Abadi

Chu

Goodfellow

, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. CCS’16, pp.308–318, ACM. DOI: 10.1145/2976749.2978318.

49.

Mironov

. Rényi differential privacy. In: 2017 IEEE 30th Computer security foundations symposium (CSF), pp.263–275. IEEE. DOI: 10.1109/csf.2017.11.

50.

Shokri

Stronati

Song

, et al. Membership inference attacks against machine learning models. In: 2017 IEEE symposium on security and privacy (SP), pp.3–18. IEEE. DOI: 10.1109/sp.2017.41.

51.

Ngu

AHH

Metsis

. TTS-CGAN: A transformer time-series conditional GAN for biosignal data augmentation. CoRR 2022; abs/2206.13676. DOI: 10.48550/ARXIV.2206.13676.

52.

Liu

Altman

. Conditional generative models for synthetic tabular data: Applications for precision medicine and diverse representations. Ann Rev Biomed Data Sci 2025; 8: 21–49.

SYNNER synthetic data generator framework

Abstract

Objectives

Methods

Results

Conclusion

Keywords

Introduction

Literature review

Rule based approaches

Machine learning approaches

Knowledge graph embeddings

Methods

Embeddings generation

Synthetic data generation: Embedding space sampling

Sampling strategy

Maximum cluster radius

Avoiding replicating outliers

Cluster quality

Data utility

Differential privacy

Results

Embeddings evaluation

Synthetic data evaluation

Density-based vs. random sampling

Balancing utility and generalization via max radius

Minimum and maximum neighbor constraints to prevent outlier duplication and preserve local structure

Evaluation using prediction performance

Differential privacy evaluation

Real-world applicability

Discussion

Conclusions

Footnotes

Acknowledgment

ORCID iD

Contributorship

Funding

Declaration of conflicting interests

Guarantor

References