Sage Journals: Discover world-class research

Abstract

Synthetic data – algorithmically generated data – has been considered a novel solution to the data scarcity issue, and a ‘technical fix’ able to fill the gap in areas where real data is sensitive or biased. Different narratives about the nature of synthetic data as either mirroring or replacing real data, alongside diverse evaluation metrics for measuring the fidelity and utility of such data, have proliferated across the machine learning fairness community, in public policy research, privacy and data protection studies, and critical data scholarship. Yet, there is still no consensus on what constitutes ‘high-quality’ synthetic data. Against this background, I demonstrate how the concept of synthetic data introduces an analogical perspective on data. This perspective is relational and regulative, extending the discussion on data quality to encompass questions of data justice and responsible innovation. It invites critical reflections on the purpose and trade-offs involved in synthetic data generation and use, the social practices and power dynamics that underpin and configure it, and how its direction can be shaped in response to changing real-world circumstances and emerging human values. Building on this analysis, I argue that the generation and use of meaningful synthetic data require promoting responsibility in complex AI and data innovation ecosystems, and facilitating forms of algorithmic reparation and responsiveness.

Keywords

Synthetic data responsible artificial intelligence algorithmic fairness data justice algorithmic reparation meaningful human control

Introduction

The past years have witnessed a ‘data-centric’ shift in the field of artificial intelligence (AI), which emphasises the systematic engineering of data to develop effective AI (Zha et al., 2023). One crucial reason behind this shift lies in the demand for an increasing amount of data at an unprecedented rate, since this volume of data is a key factor in training and advancing AI models at scale, across multiple tasks and various domains. A New York Times investigation has recently revealed that Big Tech companies like OpenAI, Meta, and Google are using the data faster than it is being produced to feed their large language models (LLMs), and in their desperate hunt for data, these companies have ignored corporate policies and navigated a legal grey area, gathering copyrighted data from across the internet without permission (Metz et al., 2024). As more AI companies need to find new data, since without it the learning capability of their AI products stagnates, the question of where they will find new data will become fundamental. If the current trends continue, researchers predict that we will run out of high-quality data to fuel AI models before 2026 (and of low-quality data by 2030–2050) (Villalobos et al., 2022).

But this lack of data is not a new problem. Various terms such as ‘data scarcity’ (Alzubaidi et al., 2023), ‘data shortage’ (Xu, 2022), and ‘data problem’ (Nikolenko, 2021) had been given of the same underlying issue: data is difficult, time-consuming, and costly to label and obtain, and the access to data is not possible sometimes (e.g. sufficient data simply do not exist, for instance, in regions of conflict), or it is made difficult or undesirable for other legal and socio-ethical reasons (e.g. privacy concerns, legal compliances, and data that exhibits biased distributions).

Against this background, we are witnessing the rise of synthetic data ‒ data that is algorithmically generated, and serves as a novel solution to this data crisis. Synthetic data has already entered different fields of application, and companies like Synthesis AI and Datagen are using synthetic data to improve computer vision, facial recognition, monitoring, and threat detection in different scenarios (e.g. public space, home, and cars) (Renieris, 2023).¹ In 2022, the MIT Technology Review listed synthetic data among the ‘Top 10 Breakthrough Technologies’ of the year, since it can be used to fill the gap in areas where real data is scarce, sensitive, or biased (Heaven, 2021). Synthetic data as a new source of data could make the stock of data virtually infinite (Villalobos et al., 2022), and according to providers, could offer simple and effective ways for creating meaningful copies of sensitive and valuable data assets.²

However, the introduction of synthetic data is not exempt from relevant challenges. As LLMs are trained more and more on synthetic data, one phenomenon that may occur is model collapse: quality and diversity of generative models decrease over generations, and this may amplify existing biases in society (Alemohammad et al., 2023). How can we ensure that synthetic data is ‘high-quality’ data to train AI models? To date, there is no consensus on how to define the ‘quality’ of synthetic data, notwithstanding the growing interest in the topic from different fields, for example, the machine learning fairness community, public policy research, privacy and data protection studies, and critical data scholarship.

In this article, my aim is to address this question and the gap in scholarship. While existing critical scholarship on synthetic data has already started to investigate the societal implications that underpin and configure synthetic data, I advance this scholarship by highlighting how synthetic data introduces an analogical perspective on data, which extends discussion on the quality of synthetic data to encompass questions of data justice and responsible innovation. These are important perspectives because they allow us to understand and evaluate synthetic data not in a vacuum, but necessarily embedded in broader ecosystems that are confronted with complex value judgment activities and different power dynamics and socio-technical problems.

The article is organised as follows. In the ‘Synthetic data and real data. On analogies’ section, I introduce and examine the relational and regulative perspective of analogy when defining and evaluating synthetic data, showing its benefits over the different narratives around synthetic data as mirroring or replacing real data that have proliferated in recent years. Rather than treating synthetic data as a mere ‘technical-fix’ that can replace and work as real data, the adoption of this analogical perspective has the merit of prompting critical reflections about the quality, purpose, and trade-offs of synthetic data generation and use. In ‘The quality in the synthetic. On metrics’ section, I analyse the question of data quality and evaluation metrics and illustrate a case study of building a semi-synthetic dataset to test for biases introduced by AI systems in recruitment. I demonstrate that synthetic data generation is a data practice that requires context-sensitive evaluations and justifications, and synthesisers are called upon to anticipate, gain knowledge, and critically reflect on the societal implications of models and datasets. Building on this more nuanced analysis, in the ‘Meaningful synthetic data. On responsibility in data ecosystems’ section, I argue that the generation and use of meaningful synthetic data requires promoting responsibility in complex AI and data innovation ecosystems, and facilitating forms of algorithmic reparation and responsiveness.

Synthetic data and real data. On analogies

Despite the nascent interest in synthetic data, to date, there is no clear consensus on how to define it, and different attempts to capture its conceptual meaning have been put forward, varying across contexts and affecting the transparency and reproducibility in research involving synthetic data generation (Giuffrè and Shung, 2023). But generally speaking, what is synthetic data? Synthetic data is data that has been generated by a model and is designed to reproduce some structural or statistical properties and distributions of real data (Jordon et al., 2022). Even if there is no univocal definition, what is particularly interesting in the conceptions proposed across different fields is the recurring focus on a mirroring criterion: data is synthetic when it is ‘mirroring properties of an original dataset’, when it could serve as a ‘proxy’ for real data (James et al., 2021). Along those lines, synthetic data has been defined as ‘an artificial alternative to real-world data, mimicking and replicating real datasets’ (UK Statistics Authority, 2022), as ‘almost-but-not-quite replica data’ or ‘fake’ data (van Bekkum and Zuiderveen Borgesius, 2023: 12), as ‘a stand-in for real world data’ (Renieris, 2023: 84), whose aim is to ‘mimic real-world data, to look like it, to stand in for it, and to be used as though it were real training data for machine learning algorithms’ (Jacobsen, 2023: 2).

The mirroring criterion encompasses one main direction in synthetic data generation, that is, the emulation of certain key variables in real datasets (Baumann et al., 2023). A few years ago, MIT's data scientists launched a project called Synthetic Data Vault (SDV), which provided an open-source system that automatically generated synthetic data for different publicly available datasets. They affirmed that the synthesised data had ‘successfully replaced’ the original data, since no difference was found in the work produced using real data compared to data generated by the system (Patki et al., 2016). But is synthetic data a mere replica of real data?

Synthetic data aims to mimic the properties of underlying ‘real’ or ‘genuine’ data (Johansson et al., 2023: 8). Fidelity is indeed one of the evaluation metrics used to assess the quality of synthetic data, and it refers to how closely synthetic data statistically matches the real data (Jordon et al., 2022). However, emulation does not mean mere replication. If we look closely at the different proposed definitions based on the mirroring criterion, we can notice how all those attempts have in common a reference to analogical reasoning. Analogies establish a correspondence: synthetic data can work as if they were real data. However, analogies neither provide a definition nor an accurate description of objects. In the history of philosophy, scholars have often drawn a distinction between analogy of attribution and analogy of proportion. The analogy of attribution involves a similarity-based comparison among homogeneous objects sharing a property, leading to a reductio ad unum. The analogy of proportion, by contrast, deals with heterogeneous objects and establishes a similarity between the ‘relations’ or ‘proportions’ between objects.³ These two types of analogy have distinct epistemological implications (Callanan, 2008).

Unlike attribution, which ascribes properties from one object to another, the proportional perspective identifies only the relation between them (Reichl, 2023). The analogy of proportion offers a more refined epistemic tool for understanding synthetic data. Applied in this context, it means we do not assert an equation between homogeneous objects, as in the analogy of attribution: Synthetic Data = Real Data. Rather, it serves as a tool for developing knowledge about how the distributions in the synthetic and real datasets relate to one another, according to the analogy of proportion: Synthetic Distributions: Synthetic Dataset = Real Distributions: Real Dataset.

The benefits of adopting this relational perspective of analogy when defining and evaluating synthetic data are several. First, it challenges the assumption that synthetic and real datasets can or should be directly compared as homogeneous. Synthetic datasets would need to share many distributions between data points with the real datasets. However, there are some distributions that would not affect the synthetic data generation, and that would not be shared between the two datasets. The match between synthetic datasets and original datasets can be difficult for privacy concerns, but also undesirable, such as in the case of biased distributions. A full statistical similarity, that is, matching the distributions of the synthetic and real datasets, does not necessarily correspond to improved performance, as it depends on the context (Jordon et al., 2022). For example, privacy can trade off with fidelity, as an excessive similarity between the synthetic dataset and the original one poses information disclosure risks that could lead to re-identification, and this creates legal uncertainty (e.g. regarding GDPR compliance) (Beduschi, 2024). Framing synthetic data as a mere ‘technical-fix’ or substitute for real data ignores the ethical and contextual dimensions involved in data generation (Helm et al., 2024; Jacobsen, 2023).

More fundamentally, the relational perspective of analogy plays a distinct theoretical and epistemic role, framing synthetic data generation as a regulative practice – one that enables inquiry and the development of new knowledge (van den Berg, 2018). Rather than offering certainty or asserting similarity between objects, this perspective guides examination and discovery in our experience, even when the characteristics of one object remain unknown (Callanan, 2008). As such, it functions as an epistemic tool that bridges the gap between the known and the unknown, by focusing on relations and not isolated objects and properties (Burles, 2023).

According to fidelity metrics used to evaluate synthetic data generation, the aim is to approximate the distribution used to generate synthetic data as close as possible to the (unknown) real data distribution (Jordon et al., 2022: 23). This implies that the quality of synthetic data is based upon a relational and contextually-embedded perspective: it is not measured in isolation and decontextualised, but always in relation to a particular use and purpose of real data. Synthetic data takes value from precisely the same space of the real data, but not in a descriptive and homogenous perspective that tends to duplicate properties of real distributions in an accurate representation, but rather in a regulative perspective that looks at how the relation between distributions in synthetic datasets and distributions in real datasets should be constructed, how data should be arranged to approximate unknown distributions.

As argued by Jordon et al. (2022: 15), ‘for synthetic data to be meaningful, it must be similar to and different from the original data in some sense’ [emphasis added]. Against this background arises the question of how to generate unknown meaningful (not-scarce, not-sensitive, and not-biased) distributions. What is suggested is that there is a need for critical reflections about the aims and trade-offs of synthetic data generation and use, and this is a question of ethics and has fundamental moral repercussions (Jordon et al., 2022: 30). Synthetic datasets aim to overcome the limitations of real datasets, yet, to be valuable, they should be applied to real-world scenarios. The dissimilarity/similarity criterion has received many echoes in the machine learning literature on the topic, and scholars have insisted that the aim of synthetic generation is to ‘maintain a particular ratio of synthetic-to-real data’, and identify a ‘critical threshold’ of this ratio or balance (Alemohammad et al., 2023: 14–15), by navigating the relevant trade-offs that can arise between different properties of synthetic datasets.

The excessive focus on a mirroring criterion obscures the fact that there is another important direction in synthetic data studies, that is, the generation of different testing scenarios for evaluating phenomena (Baumann et al., 2023). As recently recognised by critical data scholars, synthetic data introduces a speculative and counterfactual dimension in the definition of data, which opens up possibilities for addressing improbable and unknown cases or edge cases (Jacobsen, 2023, 2024; Offenhuber, 2024). Synthetic data can be a viable solution to explore phenomena ahead of accessing real data or not covered by available data, and to create unseen examples, by producing a higher variance and a wider range of data of what is typically seen in the state-of-the-art (Johansson et al., 2023). For example, synthetic datasets that augmented the data volume in CT scan image datasets helped the development of more accurate detection models during the COVID-19 pandemic, when the availability of disease data was still limited (Das et al., 2021).

The adoption of the relational theory of analogy has the merit of broadening considerations around synthetic data as a mere ‘technical-fix’ to encompass reflections about the quality of this data, and the complex systems of science and innovation that are called to anticipate, gain knowledge, and respond to possible consequences of synthetic data generation and use. Synthetic data generation is not only a data practice that aims to develop knowledge, but also a data practice that invites questions of responsible innovation. Responsible innovation is an approach to science and innovation that aims to realise the alignment of research and innovation activities with beneficial societal goals and needs (Stilgoe et al., 2013). Synthetic data can enable system prototyping, helping practitioners, data scientists, engineers, but also policy-makers to understand the nature of phenomena under analysis, the characteristics and general patterns of real datasets as a whole, and the context in which they are used (Johansson et al., 2023: 10). Yet, those involved must remain aware of the innovation pathways they are shaping and the societal challenges synthetic data may pose. Responsible innovation studies highlight the importance of considering the tensions, governance practices, and accountability models that surround emerging technologies (Stilgoe et al., 2013). Framing synthetic data through the relational analogy perspective allows for a more nuanced analysis of the innovation ecosystems in which data are embedded. Crucially, it calls for scrutiny of the direction of data generation – its purposes, trade-offs, and how it can be shaped in response to changing circumstances and emerging human values.⁴

The quality in the synthetic. On metrics

‘There is no AI without data’ (Gröger, 2021). Data quality is back into the spotlight in the context of building data ecosystems that cope with emerging data challenges posed by AI, and this shift in research focus from a model-centric to a ‘data-centric’ approach has led to the introduction of standards and quality frameworks, and strategies on how to anchor dataset quality in the context of modelling, evaluation, and use (Mohammed et al., 2025). The question of data quality is thus not unique to synthetic data but applies to machine learning data and datasets more broadly. But despite being a widely used term in machine learning studies, defining exactly what is meant by dataset quality can be a surprisingly difficult task since ‘quality’, and other constructs like ‘fairness’, are essentially contested constructs, that is, have multiple, sometimes conflicting, context-dependent theoretical understandings that make them inherently hard to measure and operationalise (Jacobs and Wallach, 2021). In particular, understanding and measuring synthetic data quality is non-trivial, as the quality targets of synthetic datasets can be vague, and current metrics are often insufficient, encompass different aspects of synthetic data, and do not allow for granular evaluation and for navigating trade-offs between competing metrics, for example, privacy versus fidelity or fairness versus utility (De Wilde et al., 2024; van Breugel and van der Schaar, 2023).⁵

Metrics have proliferated, but a common trend has emerged in machine learning studies that defines the quality of synthetic data generation based on two main categories: the already cited fidelity, that is, how well it reproduces and preserves key distributions of real data (Jordon et al., 2022); and utility, that is, how to best use it in real life scenarios for a given task (Houssiau et al., 2022).⁶ A basis for evaluating synthetic data generation from both fidelity and utility perspectives is centred on building mathematical or computational guarantees of these categories (Marshall et al., 2023). Machine learning scholars employ different mechanisms for the calculation of fidelity and utility measures (e.g. propensity score and classification accuracy) to investigate how to best use generated synthetic data in real-life scenarios (Dankar and Ibrahim, 2021). The general aim is to provide frameworks that can ‘quantify’ information and properties to preserve in synthetic data (Houssiau et al., 2022).

For example, in the case of fidelity, this metric responds to a logic of preservation, that is, identifying distributions that might be preserved from an original dataset. Synthesisers (the individuals making the synthetic data) for the preservation of the desired analytical value have a spectrum that ranges from pre-identified statistics to a maximum of relationships between variables, and these kinds of variables might have a different nature: numerical, categorical, socio-demographics, and so on (UN, 2022). The value of synthetic data is dependent on how complex the system is and how sophisticated the data needs to be, and specific analysis is required on the part of synthesisers to determine which method to use, identify the type of synthetic data that is required, and within what context they will be used (UN, 2022). The generation consists of two steps: modelling the distribution of variables, and replacing original values contained in a dataset with generated values from the model (Sallier, 2020). In this process, not every distribution contained in the original dataset is preserved.

But how should one select the order in which variables are synthesised? There is no known standard procedure for selecting the order of the variables, and subject matter expertise may be important for informing such choices (UN, 2022). For example, experts proposed to place education as a variable before income in the modelling for statistical information, since ‘it is generally accepted that an individual's education level influences their income’ (Sallier, 2020: 1063, italics from authors). However, this gives rise to many socio-ethical challenges. The variables to be preserved by models in synthetic data are difficult to predict and highly data- and context-dependent. Moreover, using ‘ideal’ distribution may be inappropriate as there is no neutrality in determining the ‘best’ mixture or order of variables for a task (Wyllie et al., 2024).

Synthesisers can use sampling, that is, vary the sample size of subgroups in datasets, or re-weighting, that is, assign weight to distributions in order to balance data (Barbierato et al., 2022), but these data practices should always be evaluated vis-à-vis potential cases of over-fitting, that is, introducing errors by fitting the models too tightly to the available data, or overgeneralisation, that is, introducing loss of details and information (Offenhuber, 2024). In certain cases, some distributions in data may reflect a skew towards a specific demographic, and in those cases it is important to assess if this skew implies a gap that might require an increment in the representation within the data or, alternatively, should be considered as an element to be preserved in the distribution of the synthetic dataset (Johansson et al., 2023: 20). Data is synthesised, but still corresponds to real data about individuals and groups, and it is still built upon it. To generate synthetic people, companies like Datagen first scan actual humans, and decide how to encode this information, that is, how many individuals to scan in each age bracket and ethnicity (Renieris, 2023: 87–88). But what is an acceptable amount of group skew (Friedler et al., 2021)? Responding to those questions requires adding explanatory and evaluative dimensions to our decisions when generating synthetic datasets. Synthetic data generation actively intervenes and reconfigures a data distribution for a dataset, as it shapes what is meant and what is expected by that same dataset (Jacobsen, 2024 calls it the logic of ‘synthetic supplement’). It has become evident that considering synthetic datasets and metrics without context may be extremely problematic.

Building a semi-synthetic dataset

Let us consider this case study in the domain of recruitment. AI tools used to extract information from curriculum vitae (CV) and rank applicants against job descriptions can increase recruitment efficiency for recruiters and professionals, but studies have shown how these AI tools can also exacerbate discriminatory risks due to candidates’ age, gender, or national origin (Fabris et al., 2025). A case in point is Amazon's AI system for screening job applicants, which was trained on biased historical training data that led to a preference for male job applicants, reflecting the male dominance in the company and the tech industry (Dastin, 2018).

Among the approaches to measure, mitigate, and explain bias in those AI tools, some are proposing to build a semi-synthetic dataset based on real-CVs collected through a data donation campaign, which can be used to test for biases introduced by automated CV ranking systems, evaluating and comparing ranking algorithms using fairness metrics before system deployment.⁷ Synthetic data is often used in combination with real data, and AI models can be trained on such hybrid datasets or on partially semi-synthetic datasets, that is, datasets that replace sensitive variables with synthetic values (Nikolenko, 2021). Synthetic data can be derived from real-world observations and, notwithstanding the different methodologies, all the current approaches exist on a continuum between real and synthetic (Offenhuber, 2024).

Building a semi-synthetic dataset of CVs requires addressing complex challenges, like understanding how to properly represent and realistically mimic the characteristics and different attributes of real collected documents, while achieving diversity in data and introducing as much variability as possible in the synthetic CVs generation, and, finally, maintaining consistency while putting into place privacy guarantees for the data subjects involved (Saldivar et al., 2024; Saldivar et al., 2025). But even when sensitive data like race and gender are not used by the model predictors, in synthetic data generation, it is still fundamental to consider the effect that those sensitive data might have on other unprotected attributes, for example, race affecting educational opportunities, resulting in disparately qualified people for the same job application (Jordon et al., 2022: 28).

To address this challenge, a solution can be employing additional qualitative approaches to evaluate and study the context within which a synthetic generation process is to be applied (de Wilde et al., 2024). In this specific case, a qualitative analysis explored how gender, ethnicity, and other sensitive data subtly manifest in the real donated CVs, and identified potential proxies of discrimination using techniques from Social Sciences and Humanities (SSH) to describe biases that can influence both AI and manual hiring decisions (Bathia et al., 2024). The study highlighted the need of developing cross-disciplinary research collaborations to address the context-sensitive nature and societal relevance of data when generating synthetic datasets that can help practitioners to reduce gender and intersectional discrimination (Bathia et al., 2024).

Complex social characteristics and phenomena, like ‘skills’ and ‘hireability’, are difficult to define, let alone measure. Many of the harms discussed in the literature on fairness in computational systems are a results of mismatches between these theoretical constructs – like hireability, risks of recidivism, and socio-economic status – and their possible operationalisation, which requires making and testing explicit assumptions about these constructs and might appeal to traditions from SSH like political science, psychology, and education (Jacobs and Wallach, 2021). Any dataset is an imperfect proxy of what should be measured. In the machine learning fairness literature, scholars have introduced the notion of two spaces: ‘observed space’, that is, what the datasets end up measuring, and ‘construct space’, that is, what the datasets aim to measure (Friedler et al., 2021). But between those spaces, there is often a mismatch, as prediction targets are often based on spatially and culturally differentiated practices that may be racially biased (Fabris et al., 2022).

Individuals belonging to different groups may possess different realised skills and talents, and these can be accurately measured by CVs, resulting in no mismatch between observed and construct space. In that case, the construct space ‘measuring the fit of each candidate for the job posting based on relevant skills (unobservable characteristics)’ has as its counterpart an observed space ‘containing measurable properties that serve as proxies for and aim to quantify those skills, for example, having a certain degree, having a certain amount of work experience in the field’. However, this dynamic between construct and observed space might obscure the presence of potential historical and representational biases, which, for example, might have induced groups with the same potential to a different realisation of skills, and this realised difference might have led to some form of discrimination (Baumann et al., 2023). This is the reason why scholars interested in a more ethical approach to fairness metrics have introduced the notion of a ‘potential space’, that is, a space in which to consider individuals and group potential and biases that can arise from social inequalities, related to the quality of education, life experiences, gender stereotypes, and many others (Hertweck et al., 2021).

For building a ‘high-quality’ synthetic dataset, it is fundamental to have a context and domain-specific understanding of the challenges raised by this ‘potential space’ and of the expected distributions in data we aim for in the process of generation over time. Instead of focusing on computational guarantees and technical implementation alone, synthesisers should provide moral justification for the methods adopted to track relevant information and account for the needs of diverse stakeholders involved in and impacted by the use of synthetic data (Capasso et al., 2024). One criticism that has emerged against evaluation metrics like fidelity that are based on computational guarantees is that they obscure the fact that metrics are not neutral descriptions, but are inherently performative, that is, actively shape the world and create expectations around the value of models and datasets (Ravn, 2024). The discussion on synthetic data quality raises broader questions of responsibility within data ecosystems, revealing that its generation is not merely a technical data practice but a normative one. It entails value-laden decisions about what to include or exclude, and how to represent and interpret complex social phenomena.

Meaningful synthetic data. On responsibility in data ecosystems

A few years ago, a reporter for MIT Technology Review found that Lensa AI, a mobile app with generative features trained on stable diffusion, a text-to-image open-source generation model that is trained on synthetic images online, amplified and exacerbated existing bias, generating content that was increasingly sexualised and racialised (Heikkilä, 2022). In these online environments, there can be cases of ‘synthetic data spills’, that is, polluted data that represent dominant representations of reality and that can serve as ground truth for future generations of models and amplify existing social biases (Wyllie et al., 2024). As models are trained more and more on synthetic data, one phenomenon that may occur is model collapse: a process of degradation of model quality (Alemohammad et al., 2023). In the process of collapse, models ‘forget’ the underlying distributions of data, and the quality and diversity of generative models decrease over generations, leading to a synthetic distribution with little resemblance to real data and to a ‘self-consuming’ loop (Shumailov et al., 2023). Downstream models can become more and more distant from the reference distribution found in the original dataset, but it must be noted that, at the same time, it is difficult to discern generated-texts or images from the ones produced by humans, and often there is no indication that a piece of data has been synthesised, making it indistinguishable from data of human provenance (Johansson et al., 2023: 17). As the adoption of AI models being trained on synthesised data continues to grow rapidly due to data scarcity and privacy regulation, these situations will only accelerate.

The vast majority of works focusing on the responsible generation and use of synthetic data have highlighted the importance of ‘high-quality provenance information’, which means adopting an archival perspective on synthetic data curation, in which synthetic data can undergo a process of ‘watermarking’, that is, of identification and traceability, in order to distinguish it from real data (Calcraft et al., 2021; De Wilde et al., 2024; Wyllie et al., 2024). But are these measures sufficient to implement responsibility in the context of synthetic data generation and use? The issue of responsibility with emergent technologies like AI models has sparked much controversy over the last few years, since those models may give rise to what philosophers call ‘responsibility gaps’: it seems like somebody (e.g. individuals, organisations, and governments) should be held responsible for the outcomes of systems, but it is not clear who can be singled out along the chain of actions (Matthias, 2004). Some suggestions about how responsibility gaps can be filled in the philosophy of technology literature are about indirect forms of control, called ‘meaningful human control’. The theory of meaningful human control (MHC) promotes a ‘trace and track’ theory: a ‘tracing’ condition, according to which systems should be designed in such a way to always trace back the outcome of their operations to at least one human along the chain of design and operation; and a ‘tracking’ condition, according to which systems should be able to respond to the relevant moral reasons of the humans designing and deploying the system and the relevant facts in the environment in which the system operates (Santoni de Sio and Mecacci, 2021; Santoni de sio and van den Hoven, 2018).

The logic of traceability in terms of archivist data curation (Jo and Gebru, 2020) responds to the need of flagging and disclosing the origin of synthetic data, and it provides a means to recognise synthetic data as such, evaluate its provenance along the complex chain of production, and ensuring that there are methods in place for taking accountability for synthetic data spills (Wyllie et al., 2024). However, a focus on traceability alone is not sufficient in sustaining the responsible use and generation of synthetic data. First, because traceability in this way seems to be regarded as an inherently technical question, that can be solved through technical safeguards that allow to safely store data and trace its provenance, like the proposal for ‘watermark signals’ identifying the data source in the data itself (Calcraft et al., 2021: 24) or ‘generator cards’ that transparently states what information was (and was not) used to generate the data (Houssiau et al., 2022). But, as many have noted, these traceability techniques can be removed inadvertently or by malicious actors from a dataset (Calcraft et al., 2021), and, more importantly, these techniques focus on technical considerations of data accuracy and reliability, but do not go beyond them.

Indeed, the second reason why traceability alone is not sufficient is that it seems to leave out considerations on the complex social practices that underpin and configure the generation and use of data and models. Meaningful forms of data curation should be able to trace how labels and distribution of data can change over time and over cultures, that is, they should pay attention to sociocultural inclusivity in their processes (Jo and Gebru, 2020). Indeed, beyond methods for technical traceability, there is another fundamental social task that can sustain responsibility: distributed AI power. Distributed AI power is a concept mobilised by proponents of algorithmic reparation that argues for co-creation between developers and community stakeholders, and is premised on undoing existing power asymmetries, and disproportionate risks and dynamics that can inform the training of data (Davis et al., 2021).⁸ To put this point in terms of the theory of MHC, the tracking: that is, the consideration of relevant human values and relevant facts in the environment in which models and datasets operate.

In particular, proponents of algorithmic reparation argue for a shift from a fairness perspective in Machine Learning studies ‒ centred on equal distributions of resources and benefits to social groups ‒ to a reparative justice perspective, which can use models to provide redress for past harms to people with marginalised intersectional identities (Davis et al., 2021). This shift can also be relevant to the discourse on the responsible generation and use of synthetic data and needs further analysis in this context. Different actors are involved in synthetic data ecosystems, with their different experiences and power in managing, collecting, arranging, sharing, and auditing data. These actors have different power or degrees of control over synthetic data generation and access over non-synthetic and fresh data (Shumailov et al., 2023). Therefore, to make sure that the process of data curation is sustained over a period of time, there should be the need to preserve the access to fresh data, share information about data provenance, but, more fundamentally, ensure that the data ecosystem, as a particular political and economic system that advances a normative vision of how social issues should be understood and resolved, facilitate forms of data justice and democratic data governance (Dencik and Sanchez-Monedero, 2022).

Reparation

Recently, machine learning scholars discussed the possibility of introducing the algorithmic reparation perspective into synthetic data generation as a way to promote social equity and justice (Wyllie et al., 2024). For example, models can be used for positive and intentional interventions in their data ecosystems, creating induced distribution shifts that use progressive intersectional categorical sampling, for example, using sensitive data like race and gender, and making the synthetic training representative of intersectional identities (Wyllie et al., 2024). Following a reparation perspective, practitioners do not aim to mitigate bias or invisibilise sensitive data like gender, but to leverage on them to benefit marginalised communities, in consideration of the not-ideal and real-world scenarios in which they operate, where inequalities and discrimination are systemic and entangled (Davis et al., 2021).

The adoption of an algorithmic reparation approach in the context of synthetic data can serve to implement strategies for distributed AI power. Reparation measures in data practices can indeed contribute to raise awareness and understanding of the equity and fairness harms that may arise, and provide a venue for facilitating forms of data justice that go beyond the logic of technical traceability in the broader data setting. But if, on the one hand, the inclusion of reparation into synthetic data generation studies can be useful as a critical framework for addressing implicit (and unjust) socio-cultural dynamics and accounting for intersectional identities, on the other hand, it still focuses on mathematical and technical solutions that need to be optimised. As such, it fails to explicitly address the question of the ‘agency’ over models (Wyllie et al., 2024: 14–15), which rests on different actors with different powers, and how these can influence model changes and shifts to the data ecosystem over time.

The EU General Data Protection Regulation (GDPR) prohibits the use of special categories of data (e.g. information revealing race or ethnic origin, etc.) (Article 9(1), GDPR).⁹ However, the recent final draft of the AI Act provides exceptions to the GDPR that allow the use of this data for bias detection if this usage is subjected to appropriate safeguards, and, specifically, synthetic data or anonymised data are regarded as appropriate safeguards that enables bias detection without the use of sensitive data (Article 10(5)a, AI Act).¹⁰

Yet, the definition of appropriate safeguards remain unclear, and the AI Act neither gives a concrete indication on who decides what the appropriate safeguards are (e.g. providers and controllers, and organisations), nor elucidates the risks associated with the collection of sensitive data (van Bekkum and Zuiderveen Borgesius, 2023: 12). Moreover, the adoption of safeguards does not remove the need to use sensitive data: sensitive data from original datasets are essential for the development and validation of bias detection models, even when using synthetic data, since this data must still be collected to create a synthetic dataset and, moreover, a controlled access to this data must be assured throughout the downstream tasks of models and over generations, to avoid degradations and self-consuming loops of models. In this scenario, one solution to enable bias detection is assigning the collection, store, and discrimination analysis of sensitive data to trusted third parties, that is, neutral parties that hold sensitive data, and run bias analyses on their premises (Berendsen and Beauxis-Aussalet, 2024).

However, it is unclear who can be a trusted third party, for example, government, governmental organisations like the national Statistics Bureaus of member states that already collect demographics data at a large scale, consumer rights groups, civil society research groups, consultancy and accountancy firms, and many others (Veale and Binns, 2017). But each of those entities has different levels of technical expertise, requirements of transparency and trustworthiness, or has less auditing competence and lower involvement of marginalised groups in its activities. For example, depending on the context, it may be appropriate to involve trade unions in cases where models and data are deployed in human resources decisions, or NGOs might be more suited for cases of historical biases with models, since they are perceived as more trustworthy by marginalised communities (Veale and Binns, 2017).

The adoption of algorithmic reparation measures into the machine learning community, instead of focusing on technical implementations alone, should adopt a more critical approach for realising distributed AI power. Beyond providing data ecosystems maintenance, reparation as a critical approach indeed has the fundamental task of enacting co-creation data practices that are contextually and institutionally grounded, addressing different power dynamics and socio-technical problems. Moreover, implementing algorithmic reparation measures has another important limitation. With a focus on past harms, these measures neglect the analogical perspective introduced by synthetic data, which is regulative and relational, and opens up the possibility of generating (future) different scenarios for evaluating phenomena. Synthetic data and models need to constantly adapt to new scenarios and changes in data distribution, and to new reconfigurations of what is (considered) a ‘fair’ distribution. This is because synthetic data takes value from precisely the same space of the real data, and if there is a shift in the distribution of real data, then synthetic data may no longer be fair (Jordon et al., 2022: 29).

Consider a case of LLMs trained on the current format of online text, which can comprise both human-produced texts and synthetic texts. As already noted, in these online environments, there can be cases of ‘synthetic data spills’. Studies have shown that LLMs can present inherent limitations like misrepresentations and group flattening, since in their generated-responses can fail to recognise emergent within-group heterogeneity, for example, LLMs missing that not all non-binary people use they/them pronouns (Wang et al., 2024). These limitations are likely to persist if there is no critical attention on how to capture the emergent nuances, socially accepted norms, and complexity present in real-world scenarios, and provide empirically better techniques that can integrate them in models’ vast training data.

Concluding discussion. On responsibility

Building a ‘high-quality’ synthetic dataset requires not only technical accuracy but also an awareness of how synthetic data both shapes and is shaped by broader data ecosystems. Within these ecosystems, diverse actors must navigate and make explicit the often competing values that inform their choices, while engaging in contextually and institutionally grounded co-creation practices that reflect and sustain AI distributed power. Yet, creating ‘meaningful’ synthetic datasets goes further: it entails fostering responsible AI and data innovation ecosystems – systems that prioritise responsiveness to ethical and social concerns, enable anticipation of potential consequences, promote critical reflection, and integrate dynamic notions of justice and governance into their structure (Stahl, 2022, 2023).

The adoption of this ecosystem metaphor has the advantage of providing a strong conceptual basis for an improved understanding of the social reality surrounding synthetic data, and can draw from relevant discourses in responsible innovation studies that have already developed conceptual and empirical tools for understanding and shaping ecosystems (Stahl, 2023). For example, in responsible innovation studies, responding to new situations and uncertainty is a key aspect. Beyond a principle of inclusion, which is associated to more efficient mechanisms that integrate different views and perspectives, responsible innovation studies argue for a principle of responsiveness: that is, technological development should be responsive to values and needs of society, and adjust and change the direction of innovation based on new reconfigurations of those latter as they emerge along the way (Stilgoe et al., 2013). The ‘trace and track’ theory in MHC also outlines a more comprehensive approach to address responsibility, based on the idea that responsibility fundamentally requires responsiveness to the various human actors involved in innovation ecosystems and their context-relative capacities to remain (publicly) accountable to different subjects and fora (Santoni de Sio, 2024; Santoni de Sio and Mecacci, 2021).

Responsibility has not only a negative backward-looking nature concerning blame or redress for something that has happened, but also a positive forward-looking nature that consists in promoting and achieving socially shared values (Nyholm, 2023; Santoni de Sio and Mecacci, 2021).¹¹ In the same vein, reparation does not stand only for accountability and redress for past injustices, but implies a constructive worldmaking project on present and future justice. Adopting an ecosystem perspective on responsibility and reparation might be a way to address the ‘dilemma of societal alignment’ or ‘value alignment’, that is, shape science, technology, and innovation to ensure that their development processes are responsive and aligned with the values and needs of different publics (Ribeiro et al., 2018). In this article, I have demonstrated that the same concept of synthetic data introduces an analogical perspective on data, meaning it allows us to frame data practices as regulative and relational practices that aims to develop knowledge, and through which questions of responsible innovation can be asked to make them more responsive to societal challenges. Meaningful synthetic data is not data in a vacuum, but data that is embedded in broader responsible innovation ecosystems and is responsive to relevant human values across various domains and diverse and real-world scenarios, and to new changes in real data distributions as these emerge along the way.

To allow a richer understanding and practical implementation of real-life responsible AI and data innovation ecosystems, a venue would consist, for example, not only in the creation of technical methods to demand redress and compute damages for injustices in the past, according to a backward-looking view, but, more fundamentally, in the adoption of methods for community-based perspectives, like participatory design and participatory action research (Costanza-Chock, 2020; Santoni de Sio, 2024). These perspectives account for heterogeneous communities involving unequal power relationships and multiple, sometimes conflicting interests, and can instigate structural changes aimed at preventing societal harms and discrimination and actively promoting socially shared values and more equitable forms of justice (Zhang, 2023). This can be a starting point for considering diverse positions in a pluralist society within the synthetic data ecosystem, which does not correct or neutralise multiple perspectives within a logic of technical optimisation, but instead considers synthetic data generation as a collective and open-ended data practice.

Footnotes

Acknowledgements

Not applicable.

ORCID iD

Marianna Capasso

Ethical approval

Not applicable.

Informed consent

Not applicable.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FINDHR (Fairness and Intersectional Non-Discrimination in Human Recommendation) project that received funding from the European Union's Horizon Europe research and innovation program under grant agreement No 101070212. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analysed during the present study.

Notes

References

Alemohammad

Casco-Rodriguez

Luzi

, et al. (2023) Self-consuming generative models go MAD.

Alzubaidi

Bai

Al-Sabaawi

, et al. (2023) A survey on deep learning tools dealing with data scarcity: Definitions, challenges, solutions, tips, and applications. Journal of Big Data 10: 46.

Barbierato

Vedova

MLD

Tessera

, et al. (2022) A methodology for controlling bias and fairness in synthetic data generation. Applied Sciences 12(9): 4619.

Bathia

Capasso

Arora

, et al. (2024). Proxy discrimination risks in hiring: a qualitative analysis of a set of real CVs. Available at SSRN: https://ssrn.com/abstract=5048771

Bauer

Trapp

Stenger

, et al. (2024) Comprehensive exploration of synthetic data generation: A survey.

Baumann

Castelnovo

Crupi

, et al. (2023) Bias on Demand: A Modelling Framework That Generates Synthetic Data With Bias. In: 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ‘23, New York, NY, USA: Association for Computing Machinery.

Beduschi

(2024) Synthetic data protection: Towards a paradigm change in data regulation? Big Data & Society 11(1). https://doi.org/10.1177/20539517241231277.

Berendsen

Beauxis-Aussalet

(2024) Fairness vs privacy. Sensitive data is needed for bias detection. Available at https://ucds.cs.vu.nl/fairness-versus-privacy-sensitive-data-is-needed-for-bias-detection/ (Accessed 5 May 2024).

Boudewijn , et al. (2023) Privacy measurement in tabular synthetic data: State of the art and future research directions.

10.

Burles

(2023) Kant’s domestic analogy: International and global order. European Journal of International Relations 29(2): 501–522.

11.

Calcraft

Thomas

Maglicic

, et al. (2021) Accelerating public policy research with synthetic data. The Behavioural Insights Team: 1–42. 14 December. https://www.adruk.org/fileadmin/uploads/adruk/Documents/Accelerating_public_policy_research_with_synthetic_data_December_2021.pdf.

12.

Callanan

(2008) Kant on analogy. British Journal for the History of Philosophy 16(4): 747–772.

13.

Capasso

(2023) Responsible social robotics and the dilemma of control. International Journal of Social Robotics 15: 1981–1991.

14.

Capasso

Arora

Sharma

, et al. (2024) On the right to work in the age of artificial intelligence: Ethical safeguards in algorithmic human resource management. Business and Human Rights Journal 9(3): 346–360.

15.

Costanza-Chock

(2020) Community-Led Practices to Build the Worlds We Need. Cambridge, MA: The MIT Press.

16.

Dankar

Ibrahim

(2021) Fake it till you make it: Guidelines for effective synthetic data generation. Applied Sciences 11(5): 2158.

17.

Das

Tran

Singh

, et al. (2021) Conditional synthetic data generation for robust machine learning applications with limited pandemic data. In: AAAI Conference on Artificial Intelligence, pp.11792–11800: AAAI Press.

18.

Dastin

(2018) Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women. San Francisco, CA: Reuters. Insight - Amazon scraps secret AI recruiting tool that showed bias against women | Reuters.

19.

Davis

Williams

Yang

(2021) Algorithmic reparation. Big Data & Society 8(2). https://doi.org/10.1177/20539517211044808.

20.

Dencik

Sanchez-Monedero

(2022) Data justice. Internet Policy Review 11(1). https://doi.org/10.14763/2022.1.1615.

21.

de Wilde

Arora

Buarque

, et al. (2024) Recommendations on the Use of Synthetic Data to Train AI Models. Tokyo: United Nations University.

22.

Fabris

Baranowska

Dennis

, et al. (2025) Fairness and bias in algorithmic hiring: A multidisciplinary survey. ACM Transactions on Intelligent Systems and Technology 16(1): 1–54.

23.

Fabris

Messina

Silvello

, et al. (2022) Algorithmic fairness datasets: The story so far. Data Mining and Knowledge Discovery 36: 2074–2152.

24.

Friedler

Scheidegger

Venkatasubramanian

(2021) The (im)possibility of fairness: Different value systems require different mechanisms for fair decision making. Communications of the ACM 64(4): 136–143.

25.

Giuffrè

Shung

(2023) Harnessing the power of synthetic data in healthcare: Innovation, application, and privacy. npj Digital Medicine 6: 86.

26.

Gröger

(2021) There is no AI without data. Communication of the ACM 64(1): 98–108.

27.

Heaven

(2021) Synthetic data for AI. Mit Technology Review. https://www.technologyreview.com/2022/02/23/1044965/ai-synthetic-data-2/ (Accessed 5 May 2024).

28.

Heikkilä

(2022) The viral AI avatar app Lensa undressed me – without my consent. MIT Technology Review. https://www.technologyreview.com/2022/12/12/1064751/the-viral-ai-avatar-app-lensa-undressed-me-without-my-consent/. (Accessed 5 May 2024).

29.

Helm

Lipp

Pujadas

(2024) Generating reality and silencing debate: Synthetic data as discursive device. Big Data & Society 11(2). https://doi.org/10.1177/20539517241249447.

30.

Hertweck

Heitz

Loi

(2021) On the moral justification of statistical parity. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FaccT ‘21), pp.747–757. New York: Association for Computing Machinery.

31.

Houssiau

Cohen

Szpruch

, et al. (2022) A framework for auditable synthetic data generation.

32.

Jacobs

Wallach

(2021) Measurement and fairness. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FaccT ‘21), pp.375–385. New York: Association for Computing Machinery.

33.

Jacobsen

(2023) Machine learning and the politics of synthetic data. Big Data & Society 10(1): 1–12.

34.

Jacobsen

(2024) The logic of the synthetic supplement in algorithmic societies. Theory, Culture and Society 41(4): 41–56 https://doi.org/10.1177/02632764231225768.

35.

James

Harbron

Branson

, et al. (2021) Synthetic data use: exploring use cases to optimise data utility. Discover Artificial Intelligence 1(15). https://doi.org/10.1007/s44163-021-00016-y.

36.

Gebru

(2020) Lessons from archives: Strategies for collecting sociocultural data in machine learning. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ‘20, pp.306–316. New York: Association for Computing Machinery.

37.

Johansson

Bright

Krishna

, et al. (2023) Exploring responsible applications of synthetic data to advance online safety research and development. The Alan Turing Institute: 1–29. https://arxiv.org/pdf/2402.04910 .

38.

Jordon

Szpruch

Houssiau

, et al. (2022) Synthetic data—What, why and how?

39.

Marshall

Markham

Avramovic

, et al. and FCA Official . (2023) Research paper: Exploring synthetic data validation – privacy, utility and fidelity. https://cy.ico.org.uk/media/for-organisations/documents/4025484/sythetic-data-roundtable-202306.pdf

40.

Matthias

(2004) The responsibility gap: Ascribing responsibility for the actions of learning automata. Ethics and Information Technology 6(3): 175–183.

41.

Metz

, et al. (2024) How tech giants cut corners to harvest data for AI. The New York Times. https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html. (Accessed 5 May 2024).

42.

Mohammed

, et al. (2025) The effects of data quality on machine learning performance on tabular data. Information System 132: 102549.

43.

Nikolenko

(2021) Synthetic Data for Deep Learning. Cham: Springer.

44.

Nyholm

(2023) Responsibility gaps, value alignment, and meaningful human control over artificial intelligence. In: Placani

Broadhead

(eds) Risk and Responsibility in Context, New York: Routledge, pp 191–213.

45.

Offenhuber

(2024) Shapes and frictions of synthetic data. Big Data & Society 11(2). https://doi.org/10.1177/20539517241249390.

46.

Patki

Wedge

Veeramachaneni

(2016) The synthetic data vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp.399–410. Montreal, QC, Canada.

47.

Ravn

(2024) The overlooked politics of synthetic data performance metrics. Internet Policy Review. https://policyreview.info/articles/news/politics-of-synthetic-data-performance metrics/1761

48.

Reichl

(2023) Kant’s response to hume on natural theology: Dogmatic anthropomorphism, analogical inference, and symbolic representation. Journal of the History of Philosophy 61(1): 77–101.

49.

Renieris

(2023) Beyond Data: Reclaiming human rights at the dawn of metaverse. Cambridge, MA: The MIT Press.

50.

Ribeiro

Bengtsson

Benneworth

, et al. (2018) Introducing the dilemma of societal alignment for inclusive and responsible research and innovation. Journal of Responsible Innovation 5(3): 316–331.

51.

Saldivar

Fabris

Castillo

(2024). Towards a synthetic dataset for anti-discrimination algorithmic hiring. In: AIMMES 2024 Workshop on AI Bias: Measurements, Mitigation, Explanation Strategies, Amsterdam, NL, March 20.

52.

Saldivar

Gatzioura

Castillo

(2025) Synthetic CVs to Build and Test Fairness-Aware Hiring Tools. arXiv:2508.21179.

53.

Sallier

(2020) Toward more user-centric data access solutions: Producing synthetic data of high analytical value by data synthesis. Statistical Journal of the IAOS 36(4): 1059–1066.

54.

Santoni de Sio

(2024) Human Freedom in the Age of AI. New York: Routledge – Taylor & Francis Group.

55.

Santoni de Sio

Mecacci

(2021) Four responsibility gaps with artificial intelligence: Why they matter and how to address them. Philosophy and Technology 34(4): 1057–1084.

56.

Santoni De Sio

van den Hoven

(2018) Meaningful human control over autonomous systems: A philosophical account. Frontiers In Robotics and AI 5: 15.

57.

Shumailov

Shumaylov

Zhao

, et al. (2023) The curse of recursion: Training on generated data makes models forget.

58.

Stahl

(2022) Responsible innovation ecosystems: Ethical implications of the application of the ecosystem concept to artificial intelligence. International Journal of Information Management 62(C): 102441.

59.

Stahl

(2023) Embedding responsibility in intelligent systems: From AI ethics to responsible AI ecosystems. Scientific Reports 13: 7586.

60.

Stilgoe

Owen

Macnaghten

(2013) Developing a framework for responsible innovation. Research Policy 42(9): 1568–1580.

61.

UK Statistics Authority . (2022). Ethical considerations relating to the creation and use of synthetic data. https://uksa.statisticsauthority.gov.uk/publication/ethical-considerations-relating-to-the-creation-and-use-of-synthetic-data/pages/2/ (Accessed 5 May 2024)

62.

United Nations (UN) Economic Commission for Europe (2022) Synthetic Data for Official Statistics. A Starter Guide. Geneva, Switzerland: United Nations. https://unece.org/sites/default/files/2022-11/ECECESSTAT20226.pdf .

63.

van Bekkum

Zuiderveen Borgesius

(2023) Using sensitive data to prevent discrimination by artificial intelligence: Does the GDPR need a new exception? Computer Law & Security Review 48: 105770.

64.

van Breugel

van der Schaar

(2023) Beyond privacy: Navigating the opportunities and challenges of synthetic data. arXiv preprint arXiv:2304.03722: 1–20.

65.

van den Berg

(2018) Kant and the scope of analogy in the life sciences. Studies in History and Philosophy of Science 71: 67–76.

66.

van de Poel

(2021) Design for value change. Ethics and Information Technology 23: 27–33.

67.

Veale

Binns

(2017) Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society 4(2). doi:10.1177/2053951717743530.

68.

Villalobos , et al. (2022) Will we run out of data? An analysis of the limits of scaling datasets in machine learning. https://arxiv.org/abs/2211.043254 (accessed 5 May 2024).

69.

Wang

Morgenstern

Dickerson

(2024) Large language models cannot replace human participants because they cannot portray identity groups. ArXiv abs/2402.01908: 1–36.

70.

Wyllie

Shumailov

Papernot

(2024) Fairness feedback loops: Training on synthetic data amplifies bias. arXiv:2403.07857: 1–46.

71.

(2022). We could run out of data to train AI language program. MIT Technology Review. https://www.technologyreview.com/2022/11/24/1063684/we-could-run-out-of-data-to-train-ai-language-programs/ (accessed 5 May 2024).

72.

Zha

et al. (2023) Data-centric artificial intelligence: A survey. https://arxiv.org/abs/2303.10158

73.

Zhang

(2023) Redress and worldmaking: Differing approaches to algorithmic reparations for housing justice. Big Data & Society 10(2).

Synthetic data as meaningful data. On Responsibility in data ecosystems

Abstract

Keywords

Introduction

Synthetic data and real data. On analogies

The quality in the synthetic. On metrics

Building a semi-synthetic dataset

Meaningful synthetic data. On responsibility in data ecosystems

Reparation

Concluding discussion. On responsibility

Footnotes

Acknowledgements

ORCID iD

Ethical approval

Informed consent

Funding

Declaration of conflicting interests

Data availability statement

Notes

References