Sage Journals: Discover world-class research

Abstract

Objective

To systematically map the research landscape of synthetic data in healthcare between 2000 and 2024, revealing prevalent topics and tracking their evolution over time and across geographic locations.

Methods

We applied structural topic modeling (STM) to map this landscape, identifying prevalent topics and their evolution over time and geography. PubMed articles from 2000 to 2024 with “synthetic data,” “artificial data,” or “simulated data” in the title/abstract were analyzed. Texts were preprocessed (lowercasing, stopword removal, stemming), and STM was run with year and continent as covariates. The optimal number of topics (K = 10) was selected based on held-out likelihood and interpretability. Topic trends and correlations were analyzed using stacked area charts and network analysis.

Results

Among 7533 articles, a 20-fold growth in publications was observed. North America (48.1%) and Europe (31.8%) dominated early research, while Asia's share rose from 4.7% to 24.1%. Topics grouped into four themes: Biomedical Imaging & Signal Processing (21.1%), Synthetic Data Applications (20.7%), Computational & Statistical Methods (34.3%), and Genomics & Molecular Biology (23.9%). Initially prominent topics such as “Bayesian Modeling” (23.1%–10.8%) and “Statistical Bias & Missing Data” (21.9%–7.1%) declined, while “Synthetic Data Generation” (2.7%–23.0%), and “Disease Modeling and Public Health” (3.5%–14.3%) grew significantly.

Conclusion

Synthetic data research in healthcare is expanding, with shifting regional contributions and evolving topic focus. Realizing its potential requires cross-disciplinary collaboration, bias mitigation, and equitable partnerships.

Keywords

Healthcare machine learning network visualization structural topic modeling synthetic data

Introduction

The healthcare sector is increasingly leveraging data-driven approaches to improve patient outcomes, optimize operational efficiency, and advance medical research. However, acquiring comprehensive healthcare data is often constrained by high costs associated with advanced data acquisition techniques, privacy regulations, and ethical considerations regarding patient data collection and sharing.^1,2 These factors collectively impede the utilization of comprehensive data for advancing patient care and medical research. To address these challenges, synthetic data offer a promising solution.

Synthetic data refer to artificially generated datasets that mimic the statistical properties of real-world data without exposing sensitive information.³ It mitigates data scarcity and privacy concerns, enabling a broader scope of research and experimentation without patient data exposure.⁴ Moreover, it facilitates the creation of diverse and representative datasets, improves AI model generalizability, and mitigates bias arising from skewed or underrepresented populations.⁵ Synthetic data also facilitate innovation while safeguarding patient confidentiality, making it a crucial tool for overcoming data access barriers.

While a number of reviews have examined synthetic data in healthcare,^3,5–11 they predominantly focus on data generation methods (Table 1). These existing reviews of synthetic data in healthcare, while valuable, often rely on manual theme identification, introducing potential subjectivity and bias. Structural topic modeling (STM) is an advanced probabilistic unsupervised machine learning topic modeling technique designed to uncover latent themes in textual corpora by incorporating document-level metadata.¹² Compared to traditional systematic reviews or bibliometric analyses, STM provides a scalable, data-driven way to identify themes, quantify their prevalence, and examine how they evolve over time.^13,14 In contrast to Latent Dirichlet Allocation (LDA) and related topic models, STM extends the analysis by explicitly modeling covariates, thereby enabling richer insights into research trends and contextual factors. While STM mitigates certain forms of subjective bias, its outputs remain sensitive to preprocessing decisions, parameter choices, and corpus selection, and thus the method cannot be regarded as entirely unbiased. Taking this into account, STM complements rather than replace existing approaches by offering an additional lens to explore thematic structures, and evolution of research priorities in the literature. This is important in informing strategic decision-making by policymakers, funding agencies, and researchers.^15,16

Table 1.

Summary of review studies on synthetic data in healthcare.

Authors	Year	Period	Topic	Database	Number of articles	Methods
Gonzales et al.³	2023	-	Synthetic data in health care	PubMed, Scopus, and Google Scholar	72	A narrative review focused on thematic analysis of use cases
Pezoulas et al.⁶	2024	2015–2024	Synthetic data generation methods	PubMed and Scopus	83	Systematic review
Liu et al.⁷	2025	2019–2023	Deep learning approaches for synthetic data generation	Scopus, Web of Science, PubMed, and IEEE	21	Systematic review
Rujas et al.⁸	2024	2014–2024	Synthetic data generation methods: Domains, motivations, and future applications	PubMed, Scopus, and Web of Science	42	Scoping review
Goyal et al.⁹	2024	2007–2024	Synthetic Data Generation Techniques Using Generative AI	MDPI, IEEE Xplore, Science Direct, Research Gate, NeurIPS, and Arxiv	77	Systematic review
Smolyak et al.¹⁰	2024	-	Large language models and synthetic health data: progress and prospects	-	-	Synthesis of systematic scoping reviews
Kaabachi et al.¹¹	2025	2018–2024	Privacy and utility metrics in medical synthetic data	PubMed and Embase	24	Scoping review
Murtaza et al.⁵	2023	-	Synthetic data generation: State of the art in health care domain	IEE, PubMed, Springer and ACM	70	Narrative review

Despite its growing adoption, the thematic landscape of synthetic data research in healthcare remains underexplored. Here, we apply STM to map the research landscape of synthetic data in healthcare, revealing prevalent topics and tracking their evolution over time and across geographic locations. Specifically, this study aims to address this gap by answering two research questions: (i) What are the dominant topics in synthetic data research? (ii) What are the patterns in the thematic and geographical evolution of synthetic data research over time?

Materials and methods

Study design

This study is a systematic bibliometric review of globally published literature on synthetic data in healthcare, covering the period 1 January 2000 to 31 December 2024.

Data collection

We systematically retrieved relevant publications from PubMed using a search strategy that targeted articles containing terms related to synthetic data in the title or abstract: (Synthetic data"[Title/Abstract] OR “Artificial data"[Title/Abstract] OR “Simulated data"[Title/Abstract] OR “generative data"[Title/Abstract] OR “synthetic patient"[Title/Abstract] OR “synthetic health"[Title/Abstract] OR “in-silico data"[Title/Abstract] OR “simulation data"[Title/Abstract] OR “data simulation"[Title/Abstract]”). The batch_pubmed_download function was used to systematically retrieve articles in XML format while adhering to PubMed API rate limits.¹⁷ A semiautomated screening process was implemented using a rule-based algorithm to filter potentially relevant articles. Specifically, the algorithm filtered articles by detecting the co-occurrence of terms related to synthetic or simulated data and also referenced healthcare or biomedical contexts, including clinical, epidemiological, or genomic applications. Following the automated screening, a random sample of 5% of abstracts from the final set of 7533 identified articles was manually reviewed to verify the accuracy of the rule-based approach, confirming that the majority (99.7%) aligned with the intended scope. We included English articles published between 2000 and 2024 with an abstract.

Data processing

Each downloaded XML file was assigned a unique timestamp-based filename to prevent duplication. The downloaded XML files were processed to extract key bibliographic information, including title, abstract, publication year, and first author affiliations. Metadata variables were extracted, including publication year and country. Countries were categorized into continents (Africa, Asia, Europe, North America, South America, and Oceania). To ensure data integrity, articles were assessed for missing values in key fields, duplicate titles and extracted fields were cleaned to remove HTML/XML artifacts.

Text preprocessing

Titles and abstracts from the dataset were combined to form a text corpus. The corpus was preprocessed using the tm package in R. This process involved conversion of text to lowercase, removal of punctuation, numbers and English stopwords and stripping of extra whitespace.¹⁸ Thereafter, we stemmed all words using the Porter stemming algorithm.¹⁹ A Document-Term Matrix was then created, transformed into a matrix format, and transposed to generate a term matrix.

Topic modeling

Structural topic modeling is an advanced probabilistic unsupervised machine learning topic modeling technique designed to uncover latent themes within a collection of textual documents by incorporating document-level metadata.^12–14 Preprocessed text data were formatted for STM analysis. Documents were structured into a list format compatible with the stm package, and the vocabulary was extracted from the term matrix.

To determine the optimal number of topics, a search was performed using searchK(), which evaluates model fit using held-out likelihood.¹⁸ A range of topic numbers (K = 5, 10, 15, 20, 25, 30) was tested, incorporating year and continent as prevalence covariates. The optimal number of topics (K) was selected by identifying the “elbow point” in the likelihood plot, balancing semantic coherence and exclusivity, and evaluating the interpretability of the resulting topics²⁰ (Figure S1). A final STM model was fitted with K = 10 topics, using year and continent as prevalence covariates. The Expectation–Maximization algorithm was run for 150 iterations, initializing with a LDA approach.²¹

The top terms associated with each topic were extracted using the labelTopics() function. Based on interpretability, topics were assigned descriptive labels reflecting their thematic content. Furthermore, to enhance the interpretability of the topics generated by the STM model, word clouds were created for each topic, highlighting the most frequently associated words.²¹ The cloud() function from the stm package was used, with word sizes scaled to reflect their relative importance within each topic. To enhance interpretability, topics were then grouped into broader thematic categories according to their primary orientation, distinguishing between domain-specific applications, translational and clinical research contexts, methodological innovations, and biological sciences.

To assess the temporal trends in STM research, we computed the annual count of articles for each continent and standardized the data to ensure a complete set of year-continent combinations. Missing values were replaced with zero to maintain data consistency. We then calculated the annual proportion of articles per continent to facilitate a comparative analysis. We employed 100% stacked area chart to illustrate the temporal distribution of STM research by continent. To examine how research topics evolved over time, we linked topic prevalence data with publication years. We generated 100% stacked area chart to depict changes in topic proportions across years.

We assessed relationships between topics using correlation-based network analysis.^18,21 A topic correlation matrix was derived from the STM model, with edges retained for correlations above a predefined threshold (0.1). An adjacency matrix was created to construct a graph-based network, which was visualized using the igraph and ggraph packages. To improve readability, we scaled node sizes based on topic prevalence, colored nodes according to their thematic classification, and weighted edges based on correlation strength. Additionally, a force-directed layout was employed to enhance network interpretability, and labels were added to identify key topics.

Descriptive analysis using percentages and topic modeling were performed using R version 4.4.1 (R Foundation for Statistical Computing, Vienna, Austria).

Ethical approval

Ethical approval was not required for this study as it did not involve human or animal participants.

Results

During the review period, the search criterion yielded 20,011 articles. Of these, 25 articles without abstracts were excluded. An additional 12,437 articles were excluded following rule-based screening resulting in 7533 articles retained for further analysis (Figure 1). The number of publications analyzed exhibited a fluctuating upward trajectory, increasing approximately 20-fold from 43 articles in 2000 to 863 articles in 2024 (Figure 2).

Figure 1.

Flow diagram of included studies for synthetic data in healthcare, 2000–2024.

Figure 2.

Number of articles related to synthetic data in healthcare between 2000 and 2024.

Geographical patterns

Overall, North America (3625 [48.1%]) and Europe (2396 [31.8%]) were the primary contributors to the research, followed by Asia (1293 [17.2%]). In contrast, Oceania (112 [1.5%]), South America (74 [1.0%]), and Africa (33 [0.4%]) had the lowest contributions. Although North America contributed the most, its share declined from 58.1% in 2000 to 40.8% in 2024. Europe experienced a marginal drop, decreasing from 34.9% to 32.0% over the same period. Conversely, Asia experienced a steady rise, increasing from 4.7% in 2000 to 24.1% in 2024. Oceania and South America demonstrated modest contributions with minor fluctuations between years, ranging from 0.0% to 2.8% and 0.0% to 1.9%, respectively, between 2000 and 2024. Africa's contribution remained low and was minimal before 2018 but has shown steady growth since, peaking in 2019 (∼1.3% of global output) (Figure 3). At the country level, the top five contributors to the research were the United States of America (3388 [45.0%]), China (649 [8.6%]), the United Kingdom (562 [7.5%]), Germany (442 [5.9%]), and France (285 [3.8%]).

Figure 3.

Annual proportion of synthetic data research in healthcare by continent (2000–2024).

Topic identification and prevalence

Table 2 presents the identified topics along with key terms for each metric, as well as the thematic areas encompassing the ten topics. Figure 4 presents word clouds visualizing the most frequent words for each topic, ranked by the FREX criterion. Word size corresponds to their relative importance within each topic. Among these, Topic 2: Bayesian Modeling & Inference (15.1%), Topic 6: Statistical Bias & Missing Data (13.0%), Topic 8: Gene Expression & Transcriptomics (12.0%), Topic 9: Genetic Association Studies & Genomics (11.9%), and Topic 10: Medical Image Reconstruction (11.4%) emerged as the most prominent topics in synthetic data research in healthcare. In 2000, the most important topics were Topic 6: Statistical Bias & Missing Data (21.9%), Topic 2: Bayesian Modeling & Inference (17.9%), and Topic 7: Neuroimaging & Signal Processing (14.4%). By 2024, the focus shifted to Topic 4: Synthetic Data Generation (23.0%), Topic 3: Disease Modeling and Public Health (14.34%), and Topic 2: Bayesian Modeling & Inference (10.8%), reflecting an evolving research landscape (Figure 5).

Figure 4.

Word cloud figure for the ten topics in synthetic data research in healthcare, 2000–2024.

Figure 5.

Temporal patterns in synthetic data research in healthcare topics, 2000–2024.

Table 2.

Themes, topic labels, and key terms in synthetic data research for healthcare.

Theme	Topic	Keywords per metric
Biomedical Imaging & Signal Processing	Topic 7: Neuroimaging & Signal Processing	Highest Prob: signal, use, data, function, brain, method, dynam
		FREX: wavelet, atrial, epilepsi, oscil, ica, aif, dcemri
		Lift: costfunct, keti, eyeblink, galiana, rey, otoacoust, sfoae
		Score: eeg, brain, fmri, dynam, signal, aif, wavelet
	Topic 10: Medical Image Reconstruction	Highest Prob: imag, use, method, data, reconstruct, simul, algorithm
		FREX: pet, mmlmath, deform, ultrasound, cbct, slice, spect
		Lift: beamform, axial, xmlnsmmlhttpwwwworgmathmathml, deform, jacobian, oneto, bspm
		Score: imag, phantom, reconstruct, motion, pet, deform, mri
Synthetic Data Applications in Biomedical Research	Topic 1: Pharmacokinetics & Drug Modeling	Highest Prob: simul, dose, use, model, concentr, data, studi
		FREX: pharmacokinet, dock, irradi, ligand, pbpk, charg, neutron
		Lift: octam, mgkg, doxorubicin, clomipramin, srtm, flush, crystallin
		Score: dose, pharmacokinet, concentr, water, bind, beam, pbpk
	Topic 3: Disease Modeling & Public Health	Highest Prob: data, health, use, research, simul, develop, infect
		FREX: infect, epidem, covid, outbreak, infecti, pandem, influenza
		Lift: memphi, tennesse, citywid, mpcss, lectur, mph, academia
		Score: health, infect, epidem, outbreak, covid, privaci, healthcar
	Topic 5: Clinical trials & Risk Assessment	Highest Prob: patient, treatment, use, clinic, risk, outcom, trial
		FREX: mortal, hazard, item, pregnanc, nhane, patientreport, tdm
		Lift: timerel, timesincetermin, gsd, silesia, voivodship, pregabalin, qtc
		Score: trial, risk, surviv, outcom, hazard, mortal, exposur
Computational & Statistical Methods	Topic 2: Bayesian Modeling & Inference	Highest Prob: model, data, use, approach, method, estim, propos
		FREX: bayesian, mcmc, markov, mixtur, illustr, latent, uncertainti
		Lift: greybox, ivgtt, sdes, gee, decisiontheoret, extendeda, extendedb
		Score: model, bayesian, estim, infer, paramet, causal, regress
	Topic 4: Synthetic Data generation & Machine Learning	Highest Prob: data, synthet, learn, generat, network, model, train
		FREX: adversari, gan, fault, autoencod, svm, pretrain, cnns
		Lift: oneclass, fecg, sobi, neurologist, mpiann, subtask, upperlimb
		Score: train, learn, imag, network, neural, deep, synthet
	Topic 6: Statistical Bias & Missing Data	Highest Prob: estim, data, method, use, simul, studi, test
		FREX: misclassif, rehydr, imput, bias, true, miss, confound
		Lift: productmo, inestim, etoposid, elisa, lambda, lcd, choleski
		Score: bias, estim, imput, rehydr, confound, error, miss
Genomics and Molecular Biology	Topic 8: Gene Expression & Transcriptomics	Highest Prob: gene, data, express, method, sequenc, network, use
		FREX: express, microarray, transcript, singlecel, phylogenet, rnaseq, transcriptom
		Lift: splice, microbiom, multiom, betweenclust, gep, rand, wellisol
		Score: gene, express, genom, microarray, singlecel, rnaseq, transcript
	Topic 9: Genetic Association Studies & Genomics	Highest Prob: associ, genet, data, method, diseas, use, studi
		FREX: trait, snps, genotyp, linkag, haplotyp, snp, gwas
		Lift: evolutionarili, diplotyp, diplotypebas, genotyp, lmax, loglmaxlmax, mthfr
		Score: genet, snps, trait, haplotyp, gwas, genotyp, gene

Thematic groups and temporal dynamics

The identified topics were grouped into four thematic areas:

Theme 1: Biomedical imaging and signal processing

This theme encompasses two topics: Topic 7: Neuroimaging & Signal Processing, and Topic 10: Medical Image Reconstruction (Table 2), collectively representing 21.1% of all evaluated publications. Topic 7 accounted for 9.7%, while Topic 10 contributed 11.4%. Over time, Topic 7 declined from 14.4% (2000) to 6.5% (2024), whereas Topic 10 showed fluctuating interest dipping to a low of 6.6% in 2005 and peaking at 15.0% in 2006 (Figure 5). While Topic 7 has been grouped under Biomedical Imaging due to its application area, this topic also shares methodological commonality with the Computational & Statistical Methods theme.

Theme 2: Synthetic data applications in biomedical research

Comprising Topic 1 (Pharmacokinetics & Drug Modeling), Topic 3 (Disease Modeling & Public Health), and Topic 5 (Clinical Trials and Risk Assessment) (Table 2), the Synthetic Data Applications in Biomedical Research theme (Table 2) represented 20.7% of the total publications. Topics 1, 3, and 5 individually contributed 7.8%, 6.3%, and 6.6%, respectively. These topics had differing temporal trajectories: Topic 1 had a marginal decline from 11.0% in 2000 to 7.8% in 2024, while Topic 3 saw growth from 1.7% to 14.3% during the same time frame. However, Topic 5 fluctuated between 3.0% (2001) and 9.0% (2024) (Figure 5).

Theme 3: Computational and statistical methods

Comprising Topics 2 (Bayesian Modeling & Inference), 4 (Synthetic Data Generation), and 6 (Statistical Bias & Missing Data) (Table 2), the Computational & Statistical Methods theme accounted for a significant 34.3% of the research output. With individual contributions of 15.1%, 6.2%, and 13.0%, respectively, these topics exhibited contrasting temporal patterns. While Topic 2 declined from 23.1% to 10.8%, and Topic 6 from 21.9% to 7.1%, Topic 4 had an upward trend, rising from 2.7% in 2000 to become the leading topic in 2024, accounting for 23.0% of research topics (Figure 5).

Theme 4: Genomics and molecular biology

Comprising two topics, 8 (Gene Expression & Transcriptomics) and 9 (Genetic Association Studies), the Genomics and Molecular Biology theme (Table 2) accounted for a 23.9% of the analyzed research. The temporal trends within this theme were varied. Topic 8 was lowest at 2.6% in 2000, rose to a peak of 18.2% in 2013, and then gradually declined to 8.6% by 2024. In contrast, Topic 9 showed an early dominant focus, peaking at 36.5% in 2001, but has since declined sharply to just 3.1% in 2024 (Figure 5).

Topic co-occurrence and correlations

The interconnections between different research topics in synthetic data research in healthcare are shown in the topic network diagram (Figure 6). The network is anchored by a strong Computational & Statistical Methods core, dominated by the close integration of “Bayesian Modeling” (Topic 2) and “Synthetic Data Generation” (Topic 4). This analytical foundation directly supports “Pharmacokinetics & Drug Modeling” (Topic 1), a central application hub in the quantitative domain. In parallel, “Medical Image Reconstruction” (Topic 10) functions as a key translational hub, linking “Neuroimaging” (Topic 7) with both clinical practice and molecular biology, particularly through its strong ties to “Genetic Association Studies & Genomics” (Topic 9). These hubs converge on “Clinical Trials & Risk Assessment” (Topic 5), which serves as the central clinical anchor connecting imaging, genomics (“Gene Expression & Transcriptomics,” Topic 8), and “Disease Modeling & Public Health” (Topic 3). More peripheral but still important are topics such as Topic 8, Topic 3, and Topic 6 (“Statistical Bias & Missing Data”), which reinforce applied and methodological depth. Overall, the landscape is defined by highly central translational hubs (Topics 10, 1, and 5) built upon a robust computational–statistical foundation (Topics 2 and 4), enabling integration across clinical, genomic, and methodological domains.

Figure 6.

Topic network in synthetic data research in healthcare, 2000–2024.

Discussion

We mapped 25 years of synthetic data research in healthcare using a STM approach. The research output grew twentyfold, from 43 articles in 2000 to 863 in 2024, reflecting a rising interest in synthetic data applications in healthcare. Furthermore, we observed substantial changes in the geographic distribution of research activity, despite the global North-South disparity, and a dynamic evolution of key research topics. Specifically, initially prominent topics including “Bayesian Modeling,” “Neuroimaging,” and “Statistical Bias & Missing Data” gradually decreased as the research focus shifted to “Synthetic Data Generation” and “Disease Modeling and Public Health” in 2024 with “Bayesian Modeling” still ranking third despite its reduction.

Geographical landscape

Geographically, North America and Europe were initially the dominant contributors, but Asia's contribution significantly increased over time. Overall, Oceania, South America, and Africa contributed minimally (<2% each). The observed geographic patterns reflect both regional dominance and disparity in synthetic data research in healthcare. North America and Europe's historical leadership stems from established institutions, robust funding, strong industry-academia collaborations, and mature regulatory frameworks. Conversely, Asia's rising contribution, particularly from China driven largely by state-led investment in research (2.4% of GDP in 2022) prioritizing AI and healthcare, signals a shifting research landscape.²² The persistent global North-South disparity^23–25 is exacerbated by limited funding, infrastructure deficits, brain drain, and weak regulatory frameworks in Africa, South America, and Southeast Asia.

Thematic landscape and temporal dynamics

The thematic contributions and temporal dynamics in synthetic data research reflect evolving technological priorities, healthcare demands, and methodological advancements. Biomedical Imaging & Signal Processing theme representing approximately a fifth of the research output, once dominant, has seen gradual declines in subtopics like “Neuroimaging.” This trend may stem from the maturation of imaging technologies, where foundational innovations in computed tomography, magnetic resonance imaging, and signal processing have already been integrated into clinical workflows.^26,27 This may reduce the urgency for novel breakthroughs. Additionally, the rise of computationally intensive methods in other areas, such as synthetic data, may have diverted the focus from these areas.

The Computational & Statistical Methods theme, the most dominant, representing slightly more than a third of the research output, illustrates a pivotal shift. The surge in “Synthetic Data Generation” overshadowed the declines in “Bayesian Modeling & Inference” and “Statistical Bias & Missing Data.” This shows the growing demand for scalable, AI-driven solutions in research and healthcare. This is especially true in contexts where synthetic data addresses challenges such as data privacy, scarcity of labeled datasets, and the need to train robust machine learning models.⁴ Bayesian methods, while robust, may have waned due to their computational complexity and the preference for faster, data-driven approaches in large-scale applications.²⁸ Moreover, the declining prominence of “Statistical Bias & Missing Data” does not reflect reduced importance, but rather a shift from methodological innovation to maturity and integration into standard research practice.^29,30

The Synthetic Data Applications in Biomedical Research theme representing one-fifth of the research output highlights contrasting trajectories. “Disease Modeling & Public Health” had an upward trend reflecting its growing centrality in global research and the need for predictive modeling to inform policy and outbreak management.³¹ This growth has been driven by the increasing burden of infectious and chronic diseases, heightened urgency during major outbreaks such as SARS, H1N1, Ebola, and COVID-19. The parallel growth of computational power and large-scale health data has also enabled more sophisticated models for accurate forecasting, intervention planning, and informing public health policy. Conversely, “Clinical Trials & Statistical Inference” declined, possibly due to increasing regulatory scrutiny and the difficulty of translating synthetic data into accepted clinical endpoints. In parallel, the decline of Pharmacokinetics & Drug Modeling likely reflects its transition into a mature and highly standardized discipline within drug development. As pharmacokinetic methods and software became routine, the need for extensive methodological publication decreased, while research focus broadened toward biologics, complex therapies, and precision medicine, often emphasizing delivery systems or integrative approaches.³² At the same time, the rapid rise of computational fields such as Bayesian modeling, synthetic data generation, and machine learning has absorbed much of the methodological innovation once attributed to the topic, redistributing publications under these broader computational themes. Genomics & Molecular Biology, representing slightly more than one-fifth of the research output, reveals subfield variability. Synthetic genomic data address data scarcity and privacy concerns by enabling researchers to share datasets mimicking real sequences without exposing individual identities. This facilitates studies of rare variants and diseases, modeling evolutionary dynamics, and validating algorithms.^33–36 It supports hypothesis testing via evolutionary simulations, allows for controlled benchmarking and bias mitigation, and facilitates large-scale population genomics and high-risk climate adaptation modeling. Furthermore, synthetic data democratize genomic research by providing usable datasets for resource-limited settings and addressing ethical concerns in indigenous communities, promoting equitable access and collaboration in the field. While synthetic genomic data are increasingly utilized, concerns exist regarding its fidelity and ability to capture complex genomic phenomena such as epistasis and epigenetic regulation.^37,38 Additionally, establishing statistical validation standards is crucial to ensure the trustworthiness of synthetic genomic datasets. Temporally, “Gene Expression & Transcriptomics” demonstrated growth possibly fueled by reducing sequencing costs and the rise of precision medicine. On the contrary, “Genetic Association Studies” declined, potentially due to saturation in identifying common genetic variants and shifting focus to functional genomics.³⁹

Topic co-occurrence and correlations

The network structure highlights the growing centrality of computational and statistical methods, showing how advances in Bayesian modeling and synthetic data generation now form the analytical backbone of biomedical research. This methodological foundation is not only driving innovation but also enabling the shift toward multimodal data integration, particularly imaging and genomics, into more holistic models of disease and treatment. The translational role of medical imaging and genomics, converging on clinical trials and risk assessment, emphasizes the importance of aligning methodological sophistication with clinical relevance to directly inform patient care and policy. At the same time, peripheral but strategically important areas such as bias correction and handling of missing data remain critical for ensuring validity and reproducibility in an era of increasingly complex datasets. The observed structure reflects a field in transition, where traditional application areas like pharmacokinetics are becoming embedded within broader computational frameworks, signaling the need for domain expertise to evolve in tandem with advanced data science skills. Collectively, these dynamics point to the importance of cross-disciplinary training and collaboration as essential enablers of the next generation of biomedical research.

Policy implications

Our findings carry several practical and policy implications for fostering innovation, advancing equity, and addressing global health challenges. At the global level, closing the persistent North–South disparity requires targeted investments in infrastructure, training, and local talent retention in underrepresented regions. This should go hand-in-hand with equitable international partnerships that support technology transfer and collaborative research in these settings. Additionally, policymakers in emerging regions may draw lessons from Asia's rapid rise, particularly China's strategic state-led investment in AI and healthcare, as a model for catalyzing local research growth. Strategic global investment is also essential, with resources directed toward emerging areas such as AI-driven data generation, disease modeling, and genomics. In parallel, there is need to ensure that more established or hard-to-regulate domains, such as imaging and clinical trials, are not neglected.

At the research and regulatory level, fostering interdisciplinarity is key. Institutions and funders should encourage cross-cutting research collaborations that connect computational methods with applied fields such as neuroimaging and disease modeling, as well as foundational sciences such as genomics. To support this, policymakers should work on establishing agile regulatory frameworks that uphold fidelity, privacy, and generalizability. This is particularly important for sensitive applications in clinical trials and genomic data to build public trust while enabling responsible innovation. Ethical governance will be equally crucial to ensure equitable access, prevent misuse, and mitigate risks of bias in synthetic data.

At the sectoral level, several themes demand attention. The rise of AI-driven synthetic data generation highlights the need for scalable, privacy-preserving solutions to meet the demands of modern machine learning. Growth in disease modeling highlights synthetic data's potential for pandemic preparedness and public health policy. Similarly, advances in synthetic genomics offer opportunities to democratize research, enable studies of rare variants, and drive precision medicine, provided standards for fidelity and validation are rigorously enforced.

Research gaps

Despite the diversity of themes identified in this analysis of synthetic healthcare data, key gaps remain. These thematic gaps are underpinned by common, cross-disciplinary challenges: technical limitations, ethical considerations, and regulatory uncertainty.

Theme 1: Biomedical imaging and signal processing

Despite advances in neuroimaging and medical image reconstruction, important research gaps remain. In neuroimaging, there is limited progress in developing synthetic data approaches that accurately capture the complexity of brain structure, function, and connectivity, especially for multimodal imaging and rare neurological conditions.⁴⁰ Similarly, in medical image reconstruction, while deep learning has enhanced image quality and reduced noise, challenges persist in generalizability across scanners, populations, and acquisition protocols.⁴¹ Furthermore, research is needed to develop standardized digital phantoms and validation frameworks that can be universally adopted to benchmark generative models across different imaging modalities, moving beyond single-institution validation.^42,43 Limitations in molecular modeling, high computational costs, and the lack of standardized imaging benchmarks hinder broader adoption.

Theme 2: Synthetic data applications in biomedical research

Synthetic electronic health records (EHRs), clinical trial simulation, and disease modeling, remain underdeveloped, largely due to fragmented data standards, regulatory barriers, and data scarcity.⁴⁴ This is particularly pronounced in applications for health economics and outcomes research and for modeling care delivery including areas like telemedicine. Personalized medicine beyond genomics is particularly underrepresented, suggesting a need for specialized tools and cross-institutional data-sharing frameworks.

Theme 3: Computational and statistical methods

Despite rapid methodological advances, key methodological gaps persist. Bayesian approaches in healthcare remain underutilized for handling uncertainty in multimodal and longitudinal clinical data, particularly when data are sparse or incomplete. Similarly, while synthetic data methods are advancing, they often overlook statistical biases and missingness mechanisms inherent in real-world health datasets, limiting their reliability for downstream inference.^45,46 A critical gap lies in developing integrated frameworks that combine Bayesian modeling, principled missing data strategies, and fairness-aware synthetic data generation to ensure both validity and equity in clinical research and policy applications. Additionally, methods for generating complex, multimodal data are still nascent and require further development to address the complexity of clinical data.⁴⁷

Theme 4: Genomics and molecular biology

The primary gap here is the narrow focus on genomic sequence data itself. There is need for expansion of the field to address the challenge of generating synthetic data for integrated multiomics and for spatial transcriptomics.⁴⁸ Furthermore, there is a significant opportunity to extend synthetic data applications beyond basic research toward personalized medicine and drug discovery. These areas face hurdles due to the complexity of molecular modeling and a lack of suitable synthetic data for preclinical and clinical stages.

Limitations

This study's methodology is subject to some limitations. The search strategy, restricted to specific keywords in titles/abstracts and reliance on PubMed, introduces potential selection bias by excluding relevant literature using alternative terminology or indexed in other databases. Although, the semiautomated screening approach used enhanced efficiency and consistency, we cannot rule out that some studies only tangentially related to synthetic data may have been included due to its reliance on predefined keyword rules. This could potentially introduce minor noise and slightly reduce topic precision. Geographic classification based solely on first-author affiliation may misrepresent multinational collaborations and regional contributions. While language bias was minimized via English titles/abstracts, exclusion of non-English publications is still possible. Future analyses should mitigate these issues through expanded search terms and data sources, expert validation, and refined geographic classification methods to enhance robustness and representativeness.

Conclusion

Through STM analysis of 25 years of synthetic data research in healthcare, this study reveals significant growth and evolution of topics. Research output has increased nearly twentyfold, accompanied by a shift in geographic contributions, with Asia's presence growing alongside the historical dominance of North America and Europe, although a persistent Global North-South disparity remains. Thematic foci have evolved, with increased emphasis on “Synthetic Data Generation” and “Disease Modeling and Public Health.” Methodological interconnections are evident. Despite these advances, key gaps exist in areas such as drug discovery, synthetic EHRs, personalized medicine beyond genomics, telemedicine, mental health applications, health economics, and ethical AI. These highlight the need for cross-disciplinary collaborations, bias mitigation strategies, and equitable partnerships to fully realize the potential of synthetic data in healthcare.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251404530 - Supplemental material for A quarter-century of synthetic data in healthcare: Unveiling trends with structural topic modeling

Supplemental material, sj-docx-1-dhj-10.1177_20552076251404530 for A quarter-century of synthetic data in healthcare: Unveiling trends with structural topic modeling by Billy Ogwel, Vincent H Mzazi, Alex O Awuor, Gabriel Otieno, Sidney Ogolla, Bryan O Nyawanda and Richard Omore in DIGITAL HEALTH

Footnotes

ORCID iD

Billy Ogwel

Contributorship

BO conceived the study, BO, VM, BON, AOA, GO, and RO contributed to study design and implementation. BO and BON analyzed and interpreted the data. BO drafted the manuscript and all authors critically reviewed the manuscript for intellectual content and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and materials

The authors retrieved publicly available metadata from PubMed using a structured search for “synthetic data,” “artificial data,” and “simulated data” in titles and abstracts. The search strategy is detailed in “Methods” section. The programming code for R is available on GitHub:

Disclosure

The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Kenya Medical Research Institute or partnering institutions.

Use of artificial intelligence (AI) tools

The authors would like to acknowledge the use of AI technology (Gemini and Deepseek) for grammar checking and proofreading of this manuscript.

Supplemental material

Supplemental material for this article is available online.

References

Institute of Medicine (US) Roundtable on Value & Science-Driven Health Care. Healthcare data as a public good: Privacy and security. In: Clinical Data as the Basic Staple of Health Learning: Creating and Protecting a Public Good: Workshop Summary [Internet]. US: National Academies Press, 2010. 5 [cited 2025 Feb 23]. https://www.ncbi.nlm.nih.gov/books/NBK54293/

Cong

. Exploring barriers and ethical challenges to medical data sharing: perspectives from Chinese researchers. BMC Med Ethics 2024; 25: 132.

Gonzales

Guruswamy

Smith

. Synthetic data in health care: a narrative review. PLOS Digit Health 2023; 2: e0000082.

Giuffrè

Shung

. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit Med 2023; 6: 1–8.

Murtaza

Ahmed

Khan

, et al. Synthetic data generation: state of the art in health care domain. Comput Sci Rev 2023; 48: 100546.

Pezoulas

Zaridis

Mylona

, et al. Synthetic data generation methods in healthcare: a review on open-source tools and methods. Comput Struct Biotechnol J 2024; 23: 2892–2910.

Liu

Acharya

Tan

. Preserving privacy in healthcare: a systematic review of deep learning approaches for synthetic data generation. Comput Methods Programs Biomed 2025; 260: 108571.

Rujas

Herranz

Fico

, et al. Synthetic data generation in healthcare: a scoping review of reviews on domains, motivations, and future applications. Int J Med Inform 2025; 195: 105763.

Goyal

Mahmoud

. A systematic review of synthetic data generation techniques using generative AI. Electronics (Basel) 2024; 13: 3509.

10.

Smolyak

Bjarnadóttir

Crowley

, et al. Large language models and synthetic health data: progress and prospects. JAMIA Open 2024; 7: ooae114.

11.

Kaabachi

Despraz

Meurers

, et al. A scoping review of privacy and utility metrics in medical synthetic data. npj Digit Med 2025; 8: 1–9.

12.

Wang

Zhang

Zhai

. Structural topic model for latent topical structure analysis. 2011 June.

13.

Hankar

Kasri

Beni-Hssane

. A comprehensive overview of topic modeling: techniques, applications and challenges. Neurocomputing 2025; 628: 129638.

14.

Egger

. Topic modelling. In: Egger

(ed) Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications [Internet]. Cham: Springer International Publishing, 2022 [cited 2025 Feb 24], pp.375–403. 10.1007/978-3-030-88389-8_18

15.

Park

Wang

, et al. Application of structural topic modeling in a literature review of air transport. J Air Transp Manag 2025; 122: 102708.

16.

Kherwa

Bansal

. Topic modeling: a comprehensive review. EAI Endorsed Trans Scalable Inf Syst 2019; 7: 159623.

17.

Fantini

. Retrieving and processing PubMed records using easyPubMed [Internet]. 2019 [cited 2025 Feb 24], https://cran.r-project.org/web/packages/easyPubMed/vignettes/getting_started_with_easyPubMed.html

18.

Bauer

. 11.1 Lab: Structural topic model | computational social science [Internet]. 2022 [cited 2025 Feb 24], https://bookdown.org/paul/computational_social_science/lab-structural-topic-model.html#data-pre-processing

19.

Willett

. The Porter stemming algorithm: then and now. Program Electron Libr Inf Syst 2006; 40: 219–223.

20.

Weston

Shryock

Light

, et al. Selecting the number and labels of topics in topic modeling: a tutorial. Adv Methods Pract Psychol Sci 2023; 6: 25152459231160105.

21.

Roberts

Stewart

Tingley

. stm: an R package for structural topic models. J Stat Soft [Internet] 2019 [cited 2025 Feb 24]; 91: 1–40. http://www.jstatsoft.org/v91/i02/

22.

Zhou

Dahal

. Has R&D contributed to productivity growth in China? The role of basic, applied and experimental R&D. China Econ Rev 2024; 88: 102281.

23.

Bain

Adeagbo

Avoka

, et al. Identifying the conundrums of “global health” in the Global North and Global South: a case for Sub-Saharan Africa. Front Public Health [Internet] 2024 [cited 2025 Mar 1]; 12: 1168505. [cited 2025 Mar 1].12.DOI: 1168505. https://www.frontiersin.org/journals/public-health/articles/10.3389/fpubh.2024.1168505/full

24.

Reidpath

Allotey

. The problem of ‘trickle-down science’ from the Global North to the Global South. BMJ Glob Health [Internet] 2019 [cited 2025 Mar 1]; 4. https://gh.bmj.com/content/4/4/e001719

25.

Abouzeid

Muthanna

Nuwayhid

, et al.

Barriers to sustainable health research leadership in the Global South: time for a Grand Bargain on localization of research leadership?

Health Res Policy Syst 2022; 20: 136.

26.

Pinto-Coelho

. How artificial intelligence is shaping medical imaging technology: a survey of innovations and applications. Bioengineering (Basel) 2023; 10: 1435.

27.

Rong

Liu

. Advances in medical imaging techniques. BMC Methods 2024; 1: 10.

28.

Mejia-Aguilar

Gutiérrez Prada

Portilla

, et al. State-of-the-art review advantages and limitations of Bayesian approaches to decision-making in construction management: a critical review (1988–2023). ASCE-ASME J Risk Uncertainty Eng Syst Part A Civil Eng 2024; 10: 03124005.

29.

Mesquita

Perfeito

Paolotti

, et al. Epidemiological methods in transition: minimizing biases in classical and digital approaches. PLOS Digit Health 2025; 4: e0000670.

30.

Pedersen

Mikkelsen

Cronin-Fenton

, et al. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol 2017; 9: 157–166.

31.

Martin-Moreno

Alegre-Martinez

Martin-Gorgojo

, et al. Predictive models for forecasting public health scenarios: practical experiences applied during the first wave of the COVID-19 pandemic. Int J Environ Res Public Health 2022; 19: 5546.

32.

Rowland

Lesko

Rostami-Hodjegan

. Physiologically based pharmacokinetics is impacting drug development and regulatory decision making. CPT Pharmacometr Syst Pharmacol 2015; 4: 313–315.

33.

Bonomi

Huang

Ohno-Machado

. Privacy challenges and research opportunities for genomic data sharing. Nat Genet 2020; 52: 646–654.

34.

Oprisanu

Ganev

De Cristofaro

. On utility and privacy in synthetic genomic data. In Proceedings 2022 Network and Distributed System Security Symposium [Internet], San Diego, CA, USA: Internet Society, 2022 [cited 2025 Mar 1], pp.1–17. https://www.ndss-symposium.org/wp-content/uploads/2022-92-paper.pdf

35.

Zhou

Shu

. Synthetic genomics in crop breeding: evidence, opportunities and challenges. Crop Design 2025; 4: 100090.

36.

Alharbi

Rashid

. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16: 26.

37.

Wharrie

Yang

Raj

, et al. HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. Bioinformatics 2023; 39: btad535.

38.

Jacobs

D’Amico

Benvenuti

, et al. Opportunities and challenges of synthetic data generation in oncology. JCO Clin Cancer Inform 2023; 7: e2300045.

39.

George

SHL

Medina-Rivera

Idaghdour

, et al. Increasing diversity of functional genetics studies to advance biological discovery and human health. Am J Hum Genet 2023; 110: 1996–2002.

40.

Peters

Vansina

de Weerd

, et al. Multimodal imaging of brain plasticity. In: Barbey

(ed.) The Oxford Handbook of Cognitive Enhancement and Brain Plasticity [Internet]. Great Clarendon Street, Oxford, OX2 6DP, UK.: Oxford University Press, 2024 [cited 2025 Oct 3] 10.1093/oxfordhb/9780197677131.013.20.

41.

Tran

Zeevi

Payabvash

. Strategies to improve the robustness and generalizability of deep learning segmentation and classification in neuroimaging. Biomed Inform 2025; 5: 20.

42.

Wegner

Schmiech

Sobirey

, et al. Requirement analysis in medical phantom development: a survey tool approach with an illustrative example of a multimodal deformable pelvic phantom. Front Phys [Internet] 2024 [cited 2025 Sept 2]; 12: 1416601. https://www.frontiersin.org/journals/physics/articles/10.3389/fphy.2024.1416601/full.

43.

Kinahan

Chenevert

Malyarenko

. Standards, phantoms, and site qualification. 2021 September 30 [cited 2025 Sept 2], https://pubs.aip.org/books/monograph/76/chapter/20671241/Standards-Phantoms-and-Site-Qualification

44.

Velocia

Kumar

. Synthetic data in Ai healthcare research: A review of use cases, benefits, and risks. In 2025.

45.

Babu

KAR

Mulay

Prabhu

, et al. Position paper: Building trust in synthetic data for clinical AI [Internet]. arXiv; 2025 [cited 2025 Sept 2], http://arxiv.org/abs/2502.02076

46.

Lautrup

Hyrup

Zimek

, et al. Syntheval: a framework for detailed utility and privacy evaluation of tabular synthetic data. Data Min Knowl Disc 2025; 39: 6.

47.

Schouten

Nicoletti

Dille

, et al. Navigating the landscape of multimodal AI in medicine: a scoping review on technical challenges and clinical applications. Med Image Anal 2025; 105: 103621.

48.

Selvarajoo

Maurer-Stroh

. Towards multi-omics synthetic data integration. Brief Bioinform 2024; 25: bbae213.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.11 MB