Abstract
Synthetic data – algorithmically generated data – has been considered a novel solution to the data scarcity issue, and a ‘technical fix’ able to fill the gap in areas where real data is sensitive or biased. Different narratives about the nature of synthetic data as either mirroring or replacing real data, alongside diverse evaluation metrics for measuring the fidelity and utility of such data, have proliferated across the machine learning fairness community, in public policy research, privacy and data protection studies, and critical data scholarship. Yet, there is still no consensus on what constitutes ‘high-quality’ synthetic data. Against this background, I demonstrate how the concept of synthetic data introduces an analogical perspective on data. This perspective is relational and regulative, extending the discussion on data quality to encompass questions of data justice and responsible innovation. It invites critical reflections on the purpose and trade-offs involved in synthetic data generation and use, the social practices and power dynamics that underpin and configure it, and how its direction can be shaped in response to changing real-world circumstances and emerging human values. Building on this analysis, I argue that the generation and use of
Keywords
Introduction
The past years have witnessed a ‘data-centric’ shift in the field of artificial intelligence (AI), which emphasises the systematic engineering of data to develop effective AI (Zha et al., 2023). One crucial reason behind this shift lies in the demand for an increasing amount of data at an unprecedented rate, since this volume of data is a key factor in training and advancing AI models at scale, across multiple tasks and various domains. A
But this lack of data is not a new problem. Various terms such as ‘data scarcity’ (Alzubaidi et al., 2023), ‘data shortage’ (Xu, 2022), and ‘data problem’ (Nikolenko, 2021) had been given of the same underlying issue: data is difficult, time-consuming, and costly to label and obtain, and the access to data is not possible sometimes (e.g. sufficient data simply do not exist, for instance, in regions of conflict), or it is made difficult or undesirable for other legal and socio-ethical reasons (e.g. privacy concerns, legal compliances, and data that exhibits biased distributions).
Against this background, we are witnessing the rise of synthetic data ‒ data that is algorithmically generated, and serves as a novel solution to this data crisis. Synthetic data has already entered different fields of application, and companies like
However, the introduction of synthetic data is not exempt from relevant challenges. As LLMs are trained more and more on synthetic data, one phenomenon that may occur is model collapse: quality and diversity of generative models decrease over generations, and this may amplify existing biases in society (Alemohammad et al., 2023). How can we ensure that synthetic data is ‘high-quality’ data to train AI models? To date, there is no consensus on how to define the ‘quality’ of synthetic data, notwithstanding the growing interest in the topic from different fields, for example, the machine learning fairness community, public policy research, privacy and data protection studies, and critical data scholarship.
In this article, my aim is to address this question and the gap in scholarship. While existing critical scholarship on synthetic data has already started to investigate the societal implications that underpin and configure synthetic data, I advance this scholarship by highlighting how synthetic data introduces an analogical perspective on data, which extends discussion on the quality of synthetic data to encompass questions of data justice and responsible innovation. These are important perspectives because they allow us to understand and evaluate synthetic data not in a
The article is organised as follows. In the ‘Synthetic data and real data. On analogies’ section, I introduce and examine the relational and regulative perspective of analogy when defining and evaluating synthetic data, showing its benefits over the different narratives around synthetic data as mirroring or replacing real data that have proliferated in recent years. Rather than treating synthetic data as a mere ‘technical-fix’ that can replace and work as real data, the adoption of this analogical perspective has the merit of prompting critical reflections about the quality, purpose, and trade-offs of synthetic data generation and use. In ‘The quality in the synthetic. On metrics’ section, I analyse the question of data quality and evaluation metrics and illustrate a case study of building a semi-synthetic dataset to test for biases introduced by AI systems in recruitment. I demonstrate that synthetic data generation is a data practice that requires context-sensitive evaluations and justifications, and synthesisers are called upon to anticipate, gain knowledge, and critically reflect on the societal implications of models and datasets. Building on this more nuanced analysis, in the ‘Meaningful synthetic data. On responsibility in data ecosystems’ section, I argue that the generation and use of
Synthetic data and real data. On analogies
Despite the nascent interest in synthetic data, to date, there is no clear consensus on how to define it, and different attempts to capture its conceptual meaning have been put forward, varying across contexts and affecting the transparency and reproducibility in research involving synthetic data generation (Giuffrè and Shung, 2023). But generally speaking, what is synthetic data? Synthetic data is data that has been generated by a model and is designed to reproduce some structural or statistical properties and distributions of real data (Jordon et al., 2022). Even if there is no univocal definition, what is particularly interesting in the conceptions proposed across different fields is the recurring focus on a mirroring criterion: data is synthetic when it is ‘mirroring properties of an original dataset’, when it could serve as a ‘proxy’ for real data (James et al., 2021). Along those lines, synthetic data has been defined as ‘an artificial alternative to real-world data, mimicking and replicating real datasets’ (UK Statistics Authority, 2022), as ‘almost-but-not-quite replica data’ or ‘fake’ data (van Bekkum and Zuiderveen Borgesius, 2023: 12), as ‘a stand-in for real world data’ (Renieris, 2023: 84), whose aim is to ‘mimic real-world data, to look like it, to stand in for it, and to be used
The mirroring criterion encompasses one main direction in synthetic data generation, that is, the
Synthetic data aims to
Unlike attribution, which ascribes properties from one object to another, the proportional perspective identifies only the relation between them (Reichl, 2023). The analogy of proportion offers a more refined epistemic tool for understanding synthetic data. Applied in this context, it means we do not assert an equation between homogeneous objects, as in the analogy of attribution: Synthetic Data = Real Data. Rather, it serves as a tool for developing knowledge about how the distributions in the synthetic and real datasets relate to one another, according to the analogy of proportion: Synthetic Distributions: Synthetic Dataset = Real Distributions: Real Dataset.
The benefits of adopting this relational perspective of analogy when defining and evaluating synthetic data are several. First, it challenges the assumption that synthetic and real datasets can or should be directly compared as homogeneous. Synthetic datasets would need to share many distributions between data points with the real datasets. However, there are some distributions that would not affect the synthetic data generation, and that would not be shared between the two datasets. The match between synthetic datasets and original datasets can be difficult for privacy concerns, but also undesirable, such as in the case of biased distributions. A full statistical similarity, that is, matching the distributions of the synthetic and real datasets, does not necessarily correspond to improved performance, as it depends on the context (Jordon et al., 2022). For example, privacy can trade off with fidelity, as an excessive similarity between the synthetic dataset and the original one poses information disclosure risks that could lead to re-identification, and this creates legal uncertainty (e.g. regarding GDPR compliance) (Beduschi, 2024). Framing synthetic data as a mere ‘technical-fix’ or substitute for real data ignores the ethical and contextual dimensions involved in data generation (Helm et al., 2024; Jacobsen, 2023).
More fundamentally, the relational perspective of analogy plays a distinct theoretical and epistemic role, framing synthetic data generation as a regulative practice – one that enables inquiry and the development of new knowledge (van den Berg, 2018). Rather than offering certainty or asserting similarity between objects, this perspective guides examination and discovery in our experience, even when the characteristics of one object remain unknown (Callanan, 2008). As such, it functions as an epistemic tool that bridges the gap between the known and the unknown, by focusing on relations and not isolated objects and properties (Burles, 2023).
According to fidelity metrics used to evaluate synthetic data generation, the aim is to approximate the distribution used to generate synthetic data as close as possible to the (unknown) real data distribution (Jordon et al., 2022: 23). This implies that the quality of synthetic data is based upon a relational and contextually-embedded perspective: it is not measured in isolation and decontextualised, but always in relation to a particular use and purpose of real data. Synthetic data takes value from precisely the same space of the real data, but not in a descriptive and homogenous perspective that tends to duplicate properties of real distributions in an accurate representation, but rather in a
As argued by Jordon et al. (2022: 15), ‘for synthetic data to be
The excessive focus on a mirroring criterion obscures the fact that there is another important direction in synthetic data studies, that is, the
The adoption of the relational theory of analogy has the merit of broadening considerations around synthetic data as a mere ‘technical-fix’ to encompass reflections about the quality of this data, and the complex systems of science and innovation that are called to anticipate, gain knowledge, and respond to possible consequences of synthetic data generation and use. Synthetic data generation is not only a data practice that aims to develop knowledge, but also a data practice that invites questions of responsible innovation. Responsible innovation is an approach to science and innovation that aims to realise the alignment of research and innovation activities with beneficial societal goals and needs (Stilgoe et al., 2013). Synthetic data can enable system prototyping, helping practitioners, data scientists, engineers, but also policy-makers to understand the nature of phenomena under analysis, the characteristics and general patterns of real datasets as a whole, and the context in which they are used (Johansson et al., 2023: 10). Yet, those involved must remain aware of the innovation pathways they are shaping and the societal challenges synthetic data may pose. Responsible innovation studies highlight the importance of considering the tensions, governance practices, and accountability models that surround emerging technologies (Stilgoe et al., 2013). Framing synthetic data through the relational analogy perspective allows for a more nuanced analysis of the innovation ecosystems in which data are embedded. Crucially, it calls for scrutiny of the direction of data generation – its purposes, trade-offs, and how it can be shaped in response to changing circumstances and emerging human values. 4
The quality in the synthetic. On metrics
‘There is no AI without data’ (Gröger, 2021). Data quality is back into the spotlight in the context of building data ecosystems that cope with emerging data challenges posed by AI, and this shift in research focus from a model-centric to a ‘data-centric’ approach has led to the introduction of standards and quality frameworks, and strategies on how to anchor dataset quality in the context of modelling, evaluation, and use (Mohammed et al., 2025). The question of data quality is thus not unique to synthetic data but applies to machine learning data and datasets more broadly. But despite being a widely used term in machine learning studies, defining exactly what is meant by dataset quality can be a surprisingly difficult task since ‘quality’, and other constructs like ‘fairness’, are essentially contested constructs, that is, have multiple, sometimes conflicting, context-dependent theoretical understandings that make them inherently hard to measure and operationalise (Jacobs and Wallach, 2021). In particular, understanding and measuring synthetic data quality is non-trivial, as the quality targets of synthetic datasets can be vague, and current metrics are often insufficient, encompass different aspects of synthetic data, and do not allow for granular evaluation and for navigating trade-offs between competing metrics, for example, privacy versus fidelity or fairness versus utility (De Wilde et al., 2024; van Breugel and van der Schaar, 2023). 5
Metrics have proliferated, but a common trend has emerged in machine learning studies that defines the quality of synthetic data generation based on two main categories: the already cited fidelity, that is, how well it reproduces and preserves key distributions of real data (Jordon et al., 2022); and utility, that is, how to best use it in real life scenarios for a given task (Houssiau et al., 2022). 6 A basis for evaluating synthetic data generation from both fidelity and utility perspectives is centred on building mathematical or computational guarantees of these categories (Marshall et al., 2023). Machine learning scholars employ different mechanisms for the calculation of fidelity and utility measures (e.g. propensity score and classification accuracy) to investigate how to best use generated synthetic data in real-life scenarios (Dankar and Ibrahim, 2021). The general aim is to provide frameworks that can ‘quantify’ information and properties to preserve in synthetic data (Houssiau et al., 2022).
For example, in the case of fidelity, this metric responds to a logic of preservation, that is, identifying distributions that might be preserved from an original dataset. Synthesisers (the individuals making the synthetic data) for the preservation of the desired analytical value have a spectrum that ranges from pre-identified statistics to a maximum of relationships between variables, and these kinds of variables might have a different nature: numerical, categorical, socio-demographics, and so on (UN, 2022). The value of synthetic data is dependent on how complex the system is and how sophisticated the data needs to be, and specific analysis is required on the part of synthesisers to determine which method to use, identify the type of synthetic data that is required, and within what context they will be used (UN, 2022). The generation consists of two steps: modelling the distribution of variables, and replacing original values contained in a dataset with generated values from the model (Sallier, 2020). In this process, not every distribution contained in the original dataset is preserved.
But how should one select the order in which variables are synthesised? There is no known standard procedure for selecting the order of the variables, and subject matter expertise may be important for informing such choices (UN, 2022). For example, experts proposed to place education as a variable before income in the modelling for statistical information, since ‘it is
Synthesisers can use sampling, that is, vary the sample size of subgroups in datasets, or re-weighting, that is, assign weight to distributions in order to balance data (Barbierato et al., 2022), but these data practices should always be evaluated vis-à-vis potential cases of over-fitting, that is, introducing errors by fitting the models too tightly to the available data, or overgeneralisation, that is, introducing loss of details and information (Offenhuber, 2024). In certain cases, some distributions in data may reflect a skew towards a specific demographic, and in those cases it is important to assess if this skew implies a gap that might require an increment in the representation within the data or, alternatively, should be considered as an element to be preserved in the distribution of the synthetic dataset (Johansson et al., 2023: 20). Data is synthesised, but still corresponds to real data about individuals and groups, and it is still built upon it. To generate synthetic people, companies like
Building a semi-synthetic dataset
Let us consider this case study in the domain of recruitment. AI tools used to extract information from curriculum vitae (CV) and rank applicants against job descriptions can increase recruitment efficiency for recruiters and professionals, but studies have shown how these AI tools can also exacerbate discriminatory risks due to candidates’ age, gender, or national origin (Fabris et al., 2025). A case in point is Amazon's AI system for screening job applicants, which was trained on biased historical training data that led to a preference for male job applicants, reflecting the male dominance in the company and the tech industry (Dastin, 2018).
Among the approaches to measure, mitigate, and explain bias in those AI tools, some are proposing to build a semi-synthetic dataset based on real-CVs collected through a data donation campaign, which can be used to test for biases introduced by automated CV ranking systems, evaluating and comparing ranking algorithms using fairness metrics before system deployment. 7 Synthetic data is often used in combination with real data, and AI models can be trained on such hybrid datasets or on partially semi-synthetic datasets, that is, datasets that replace sensitive variables with synthetic values (Nikolenko, 2021). Synthetic data can be derived from real-world observations and, notwithstanding the different methodologies, all the current approaches exist on a continuum between real and synthetic (Offenhuber, 2024).
Building a semi-synthetic dataset of CVs requires addressing complex challenges, like understanding how to properly represent and realistically mimic the characteristics and different attributes of real collected documents, while achieving diversity in data and introducing as much variability as possible in the synthetic CVs generation, and, finally, maintaining consistency while putting into place privacy guarantees for the data subjects involved (Saldivar et al., 2024; Saldivar et al., 2025). But even when sensitive data like race and gender are not used by the model predictors, in synthetic data generation, it is still fundamental to consider the effect that those sensitive data might have on other unprotected attributes, for example, race affecting educational opportunities, resulting in disparately qualified people for the same job application (Jordon et al., 2022: 28).
To address this challenge, a solution can be employing additional qualitative approaches to evaluate and study the context within which a synthetic generation process is to be applied (de Wilde et al., 2024). In this specific case, a qualitative analysis explored how gender, ethnicity, and other sensitive data subtly manifest in the real donated CVs, and identified potential proxies of discrimination using techniques from Social Sciences and Humanities (SSH) to describe biases that can influence both AI and manual hiring decisions (Bathia et al., 2024). The study highlighted the need of developing cross-disciplinary research collaborations to address the context-sensitive nature and societal relevance of data when generating synthetic datasets that can help practitioners to reduce gender and intersectional discrimination (Bathia et al., 2024).
Complex social characteristics and phenomena, like ‘skills’ and ‘hireability’, are difficult to define, let alone measure. Many of the harms discussed in the literature on fairness in computational systems are a results of
Individuals belonging to different groups may possess different realised skills and talents, and these can be accurately measured by CVs, resulting in no mismatch between observed and construct space. In that case, the construct space ‘measuring the fit of each candidate for the job posting based on relevant skills (unobservable characteristics)’ has as its counterpart an observed space ‘containing measurable properties that serve as proxies for and aim to quantify those skills, for example, having a certain degree, having a certain amount of work experience in the field’. However, this dynamic between construct and observed space might obscure the presence of potential historical and representational biases, which, for example, might have induced groups with the same potential to a different realisation of skills, and this realised difference might have led to some form of discrimination (Baumann et al., 2023). This is the reason why scholars interested in a more ethical approach to fairness metrics have introduced the notion of a ‘potential space’, that is, a space in which to consider individuals and group potential and biases that can arise from social inequalities, related to the quality of education, life experiences, gender stereotypes, and many others (Hertweck et al., 2021).
For building a ‘high-quality’ synthetic dataset, it is fundamental to have a context and domain-specific understanding of the challenges raised by this ‘potential space’ and of the expected distributions in data we aim for in the process of generation over time. Instead of focusing on computational guarantees and technical implementation alone, synthesisers should provide moral justification for the methods adopted to track relevant information and account for the needs of diverse stakeholders involved in and impacted by the use of synthetic data (Capasso et al., 2024). One criticism that has emerged against evaluation metrics like fidelity that are based on computational guarantees is that they obscure the fact that metrics are not neutral descriptions, but are inherently performative, that is, actively shape the world and create expectations around the value of models and datasets (Ravn, 2024). The discussion on synthetic data quality raises broader questions of responsibility within data ecosystems, revealing that its generation is not merely a technical data practice but a normative one. It entails value-laden decisions about what to include or exclude, and how to represent and interpret complex social phenomena.
Meaningful synthetic data. On responsibility in data ecosystems
A few years ago, a reporter for MIT Technology Review found that Lensa AI, a mobile app with generative features trained on stable diffusion, a text-to-image open-source generation model that is trained on synthetic images online, amplified and exacerbated existing bias, generating content that was increasingly sexualised and racialised (Heikkilä, 2022). In these online environments, there can be cases of ‘synthetic data spills’, that is, polluted data that represent dominant representations of reality and that can serve as ground truth for future generations of models and amplify existing social biases (Wyllie et al., 2024). As models are trained more and more on synthetic data, one phenomenon that may occur is model collapse: a process of degradation of model quality (Alemohammad et al., 2023). In the process of collapse, models ‘forget’ the underlying distributions of data, and the quality and diversity of generative models decrease over generations, leading to a synthetic distribution with little resemblance to real data and to a ‘self-consuming’ loop (Shumailov et al., 2023). Downstream models can become more and more distant from the reference distribution found in the original dataset, but it must be noted that, at the same time, it is difficult to discern generated-texts or images from the ones produced by humans, and often there is no indication that a piece of data has been synthesised, making it indistinguishable from data of human provenance (Johansson et al., 2023: 17). As the adoption of AI models being trained on synthesised data continues to grow rapidly due to data scarcity and privacy regulation, these situations will only accelerate.
The vast majority of works focusing on the responsible generation and use of synthetic data have highlighted the importance of ‘high-quality provenance information’, which means adopting an archival perspective on synthetic data curation, in which synthetic data can undergo a process of ‘watermarking’, that is, of identification and traceability, in order to distinguish it from real data (Calcraft et al., 2021; De Wilde et al., 2024; Wyllie et al., 2024). But are these measures sufficient to implement responsibility in the context of synthetic data generation and use? The issue of responsibility with emergent technologies like AI models has sparked much controversy over the last few years, since those models may give rise to what philosophers call ‘responsibility gaps’: it seems like somebody (e.g. individuals, organisations, and governments) should be held responsible for the outcomes of systems, but it is not clear who can be singled out along the chain of actions (Matthias, 2004). Some suggestions about how responsibility gaps can be filled in the philosophy of technology literature are about indirect forms of control, called ‘meaningful human control’. The theory of meaningful human control (MHC) promotes a ‘trace and track’ theory: a ‘tracing’ condition, according to which systems should be designed in such a way to always trace back the outcome of their operations to at least one human along the chain of design and operation; and a ‘tracking’ condition, according to which systems should be able to respond to the relevant moral reasons of the humans designing and deploying the system and the relevant facts in the environment in which the system operates (Santoni de Sio and Mecacci, 2021; Santoni de sio and van den Hoven, 2018).
The logic of traceability in terms of archivist data curation (Jo and Gebru, 2020) responds to the need of flagging and disclosing the origin of synthetic data, and it provides a means to recognise synthetic data as such, evaluate its provenance along the complex chain of production, and ensuring that there are methods in place for taking accountability for synthetic data spills (Wyllie et al., 2024). However, a focus on traceability alone is not sufficient in sustaining the responsible use and generation of synthetic data. First, because traceability in this way seems to be regarded as an inherently technical question, that can be solved through technical safeguards that allow to safely store data and trace its provenance, like the proposal for ‘watermark signals’ identifying the data source in the data itself (Calcraft et al., 2021: 24) or ‘generator cards’ that transparently states what information was (and was not) used to generate the data (Houssiau et al., 2022). But, as many have noted, these traceability techniques can be removed inadvertently or by malicious actors from a dataset (Calcraft et al., 2021), and, more importantly, these techniques focus on technical considerations of data accuracy and reliability, but do not go beyond them.
Indeed, the second reason why traceability alone is not sufficient is that it seems to leave out considerations on the complex social practices that underpin and configure the generation and use of data and models. Meaningful forms of data curation should be able to trace how labels and distribution of data can change over time and over cultures, that is, they should pay attention to sociocultural inclusivity in their processes (Jo and Gebru, 2020). Indeed, beyond methods for technical traceability, there is another fundamental social task that can sustain responsibility: distributed AI power. Distributed AI power is a concept mobilised by proponents of algorithmic reparation that argues for co-creation between developers and community stakeholders, and is premised on undoing existing power asymmetries, and disproportionate risks and dynamics that can inform the training of data (Davis et al., 2021).
8
To put this point in terms of the theory of MHC, the
In particular, proponents of algorithmic reparation argue for a shift from a fairness perspective in Machine Learning studies ‒ centred on equal distributions of resources and benefits to social groups ‒ to a reparative justice perspective, which can use models to provide redress for past harms to people with marginalised intersectional identities (Davis et al., 2021). This shift can also be relevant to the discourse on the responsible generation and use of synthetic data and needs further analysis in this context. Different actors are involved in synthetic data ecosystems, with their different experiences and power in managing, collecting, arranging, sharing, and auditing data. These actors have different power or degrees of control over synthetic data generation and access over non-synthetic and fresh data (Shumailov et al., 2023). Therefore, to make sure that the process of data curation is sustained over a period of time, there should be the need to preserve the access to fresh data, share information about data provenance, but, more fundamentally, ensure that the data ecosystem, as a particular political and economic system that advances a normative vision of how social issues should be understood and resolved, facilitate forms of data justice and democratic data governance (Dencik and Sanchez-Monedero, 2022).
Reparation
Recently, machine learning scholars discussed the possibility of introducing the algorithmic reparation perspective into synthetic data generation as a way to promote social equity and justice (Wyllie et al., 2024). For example, models can be used for positive and intentional interventions in their data ecosystems, creating induced distribution shifts that use progressive intersectional categorical sampling, for example, using sensitive data like race and gender, and making the synthetic training representative of intersectional identities (Wyllie et al., 2024). Following a reparation perspective, practitioners do not aim to mitigate bias or invisibilise sensitive data like gender, but to leverage on them to benefit marginalised communities, in consideration of the not-ideal and real-world scenarios in which they operate, where inequalities and discrimination are systemic and entangled (Davis et al., 2021).
The adoption of an algorithmic reparation approach in the context of synthetic data can serve to implement strategies for distributed AI power. Reparation measures in data practices can indeed contribute to raise awareness and understanding of the equity and fairness harms that may arise, and provide a venue for facilitating forms of data justice that go beyond the logic of technical traceability in the broader data setting. But if, on the one hand, the inclusion of reparation into synthetic data generation studies can be useful as a critical framework for addressing implicit (and unjust) socio-cultural dynamics and accounting for intersectional identities, on the other hand, it still focuses on mathematical and technical solutions that need to be optimised. As such, it fails to explicitly address the question of the ‘agency’ over models (Wyllie et al., 2024: 14–15), which rests on different actors with different powers, and how these can influence model changes and shifts to the data ecosystem over time.
The EU General Data Protection Regulation (GDPR) prohibits the use of special categories of data (e.g. information revealing race or ethnic origin, etc.) (Article 9(1), GDPR). 9 However, the recent final draft of the AI Act provides exceptions to the GDPR that allow the use of this data for bias detection if this usage is subjected to appropriate safeguards, and, specifically, synthetic data or anonymised data are regarded as appropriate safeguards that enables bias detection without the use of sensitive data (Article 10(5)a, AI Act). 10
Yet, the definition of appropriate safeguards remain unclear, and the AI Act neither gives a concrete indication on who decides what the appropriate safeguards are (e.g. providers and controllers, and organisations), nor elucidates the risks associated with the collection of sensitive data (van Bekkum and Zuiderveen Borgesius, 2023: 12). Moreover, the adoption of safeguards does not remove the need to use sensitive data: sensitive data from original datasets are essential for the development and validation of bias detection models, even when using synthetic data, since this data must still be collected to create a synthetic dataset and, moreover, a controlled access to this data must be assured throughout the downstream tasks of models and over generations, to avoid degradations and self-consuming loops of models. In this scenario, one solution to enable bias detection is assigning the collection, store, and discrimination analysis of sensitive data to trusted third parties, that is, neutral parties that hold sensitive data, and run bias analyses on their premises (Berendsen and Beauxis-Aussalet, 2024).
However, it is unclear who can be a trusted third party, for example, government, governmental organisations like the national Statistics Bureaus of member states that already collect demographics data at a large scale, consumer rights groups, civil society research groups, consultancy and accountancy firms, and many others (Veale and Binns, 2017). But each of those entities has different levels of technical expertise, requirements of transparency and trustworthiness, or has less auditing competence and lower involvement of marginalised groups in its activities. For example, depending on the context, it may be appropriate to involve trade unions in cases where models and data are deployed in human resources decisions, or NGOs might be more suited for cases of historical biases with models, since they are perceived as more trustworthy by marginalised communities (Veale and Binns, 2017).
The adoption of algorithmic reparation measures into the machine learning community, instead of focusing on technical implementations alone, should adopt a more critical approach for realising distributed AI power. Beyond providing data ecosystems maintenance, reparation as a critical approach indeed has the fundamental task of enacting co-creation data practices that are contextually and institutionally grounded, addressing different power dynamics and socio-technical problems. Moreover, implementing algorithmic reparation measures has another important limitation. With a focus on past harms, these measures neglect the analogical perspective introduced by synthetic data, which is regulative and relational, and opens up the possibility of generating (future) different scenarios for evaluating phenomena. Synthetic data and models need to constantly adapt to new scenarios and changes in data distribution, and to new reconfigurations of what is (considered) a ‘fair’ distribution. This is because synthetic data takes value from precisely the same space of the real data, and if there is a shift in the distribution of real data, then synthetic data may no longer be fair (Jordon et al., 2022: 29).
Consider a case of LLMs trained on the current format of online text, which can comprise both human-produced texts and synthetic texts. As already noted, in these online environments, there can be cases of ‘synthetic data spills’. Studies have shown that LLMs can present inherent limitations like misrepresentations and group flattening, since in their generated-responses can fail to recognise emergent within-group heterogeneity, for example, LLMs missing that not all non-binary people use they/them pronouns (Wang et al., 2024). These limitations are likely to persist if there is no critical attention on how to capture the emergent nuances, socially accepted norms, and complexity present in real-world scenarios, and provide empirically better techniques that can integrate them in models’ vast training data.
Concluding discussion. On responsibility
Building a ‘high-quality’ synthetic dataset requires not only technical accuracy but also an awareness of how synthetic data both shapes and is shaped by broader data ecosystems. Within these ecosystems, diverse actors must navigate and make explicit the often competing values that inform their choices, while engaging in contextually and institutionally grounded co-creation practices that reflect and sustain AI distributed power. Yet, creating ‘meaningful’ synthetic datasets goes further: it entails fostering responsible AI and data innovation ecosystems – systems that prioritise responsiveness to ethical and social concerns, enable anticipation of potential consequences, promote critical reflection, and integrate dynamic notions of justice and governance into their structure (Stahl, 2022, 2023).
The adoption of this ecosystem metaphor has the advantage of providing a strong conceptual basis for an improved understanding of the social reality surrounding synthetic data, and can draw from relevant discourses in responsible innovation studies that have already developed conceptual and empirical tools for understanding and shaping ecosystems (Stahl, 2023). For example, in responsible innovation studies, responding to new situations and uncertainty is a key aspect. Beyond a principle of inclusion, which is associated to more efficient mechanisms that integrate different views and perspectives, responsible innovation studies argue for a principle of
Responsibility has not only a negative backward-looking nature concerning blame or redress for something that has happened, but also a positive forward-looking nature that consists in promoting and achieving socially shared values (Nyholm, 2023; Santoni de Sio and Mecacci, 2021).
11
In the same vein, reparation does not stand only for accountability and redress for past injustices, but implies a constructive worldmaking project on present and future justice. Adopting an ecosystem perspective on responsibility and reparation might be a way to address the ‘dilemma of societal alignment’ or ‘value alignment’, that is, shape science, technology, and innovation to ensure that their development processes are responsive and aligned with the values and needs of different publics (Ribeiro et al., 2018). In this article, I have demonstrated that the same concept of synthetic data introduces an analogical perspective on data, meaning it allows us to frame data practices as regulative and relational practices that aims to develop knowledge, and through which questions of responsible innovation can be asked to make them more responsive to societal challenges. Meaningful synthetic data is not data in a
To allow a richer understanding and practical implementation of real-life responsible AI and data innovation ecosystems, a venue would consist, for example, not only in the creation of technical methods to demand redress and compute damages for injustices in the past, according to a backward-looking view, but, more fundamentally, in the adoption of methods for community-based perspectives, like participatory design and participatory action research (Costanza-Chock, 2020; Santoni de Sio, 2024). These perspectives account for heterogeneous communities involving unequal power relationships and multiple, sometimes conflicting interests, and can instigate structural changes aimed at preventing societal harms and discrimination and actively promoting socially shared values and more equitable forms of justice (Zhang, 2023). This can be a starting point for considering diverse positions in a pluralist society within the synthetic data ecosystem, which does not correct or neutralise multiple perspectives within a logic of technical optimisation, but instead considers synthetic data generation as a collective and open-ended data practice.
Footnotes
Acknowledgements
Not applicable.
Ethical approval
Not applicable.
Informed consent
Not applicable.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FINDHR (Fairness and Intersectional Non-Discrimination in Human Recommendation) project that received funding from the European Union's Horizon Europe research and innovation program under grant agreement No 101070212. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Data sharing not applicable to this article as no datasets were generated or analysed during the present study.
