Abstract
What happens when there is not enough data to train machine learning algorithms? In recent years, so-called ‘synthetic data’ have been increasingly used to add to or supplement the training regimes of various machine learning algorithms. Seeking to read the notion of supplementarity differently through an engagement with the work of Jacques Derrida, I propose that the nascent emergence of synthetic data embodies what I call
Introduction
In June 2022, I interviewed two data scientists from a US-based tech company about their use of so-called ‘synthetic data’. When asked what happens when there is not enough data to train their machine learning algorithms, one of them responded:
What if I had a small population, right? So that might be a population that’s not represented well enough to train a model. Well, if I make more of that population with synthetic data – so augment the real data, right, it’s not replaced the real data but augmented it – can I build more effective models, more accurate models?
1
It was clear from the subsequent discussion that the answer to this question was rhetorical. He argued that synthetic data, produced by algorithmic models trained on real data, had the power to augment machine learning algorithms and their training datasets, making them more accurate, their populations more representative. What was put forward was a particular notion of
What does it mean to algorithmically enlarge, add on, or ‘correct’ a population? In the examples just mentioned, the specific methodologies are different, yet the assumption remains the same: one can variously add to the algorithm and its training dataset, and in so doing address and resolve the fundamental ethico-politics of machine learning. The supplement figures here as a necessary means of addressing the gaps, lacks, and imbalances of algorithms. The emergence of synthetic data takes this logic of supplementarity even further. It is claimed that if additional data cannot be extracted from society then they can be algorithmically generated, adding a further dimension to the social power of algorithms (Beer, 2017; Bucher, 2018). There is therefore a seductive and promissory element to what I call
In the above examples, to supplement implies to enhance, to add, to optimise. But this crucially rests on a narrow, specific reading of what it means to supplement algorithmic systems. This paper seeks to read the supplement differently, drawing on the work of Jacques Derrida. By drawing attention to both its fundamental necessity
Before moving on, it is worth defining what is meant by synthetic data. They are, broadly speaking, data points that have been generated by deep generative algorithms such as GANs, VAEs, large language models (LLMs), or diffusion models to be used as training data for other algorithms (Courville et al., 2016; Nikolenko, 2019, 2021). Many computer engineers even deploy a combination of methods and architectures to produce synthetic data, approaches ranging from GANs, Transformers, and 3D modelling pipelines to more statistical methods such as Bayesian network algorithms. However, synthetic training data are generally generated by machine learning models that extract features, attributes, and correlations from the real training dataset on which they have been trained. Synthetic data are often used to either reproduce the salient attributes of a given data example (e.g. synthetic facial recognition images that capture the texture and contours of a real face) or to approximate the overall statistical distribution of some real-world dataset (e.g. a synthetic insurance dataset that mimics the patterns of real persons without referring to any real persons) (Nikolenko, 2021). Although relying on the extraction of and learning from real training data, synthetic data are never simply the pure reproduction or ‘copy’ of the real. The aim is to produce data points that are as proximate as possible to the training data without being an identical mapping of them. It is not always clear, however, if this has been successfully achieved in practice (see Carlini et al., 2023). As such, synthetic data inhabit a relationally fraught nexus between the frameworks of extraction described by Zuboff (2019) and Crawford (2021), as well as the generative logic of deep learning algorithms that operates through ‘the inductive retrieval and recombination of infinite data volumes’ (Parisi, 2019: 4). While there is nascent critical scholarship on synthetic data (Jacobsen, 2023; Steinhoff, 2022), there remains a need to explore the various ways in which they reconfigure the conditions of possibility for machine learning algorithms.
The paper is divided into four sections. I begin by discussing the notion of supplementarity, drawing on Jacques Derrida’s work. The second section foregrounds the fundamental condition of possibility for the logic of the synthetic supplement: the perceived ‘lack’ within the machine learning in our algorithmic societies. The latter half of the paper examines how the synthetic supplement actually operates. Here, I claim that two crucial dimensions of the logic of the synthetic supplement are that of
On the Supplement
In order to fully explain why this double movement matters, we need to read the notion of supplementarity differently – it is never simply the addition that resolves. The term has a long lineage in philosophy, most notably in the work of Jacques Derrida. The concept runs through Derrida’s whole corpus.
2
The most detailed treatment of the term, however, can be found in his 1967 book
In terms of its necessity, it is important to note how the supplement is intimately connected with Derrida’s wider vision of deconstruction (see for instance Culler, 1983; Critchley, 1992). Central to deconstruction is a claim about a fundamental, asymmetrical tension in Western philosophy between speech and writing, presence and absence. In these dichotomies, there is a metaphysical privileging of sameness, presence, and speech over difference, absence, and writing. Derrida states in
Yet, Derrida foregrounds a fundamental problem with this mode of thinking. As he puts it, ‘In the spoken address, presence is at once promised and refused’ (Derrida, 1976: 141). Speech, rather than being the embodiment of full (metaphysical) presence, contains within itself its own limits. Through a reading of Rousseau, Derrida outlines how writing is repeatedly seen as destructive, fallible, and deceptive. Yet, it also figures as a redemptive force insofar as it brings back what was lost in speech. Writing therefore emerges as a ‘necessary supplement’ to speech because ‘when Nature, as self-proximity, comes to be forbidden or interrupted, when speech fails to protect presence, writing becomes necessary. It must
This links to the second core aspect of the term: its two-pronged definition. Derrida suggests that the concept of the supplement in French has two principal meanings, neither of which can be understood independently of each other. Firstly, he states that ‘the supplement adds itself, it is a surplus, a plenitude enriching another plenitude, the
Ideas of lack and necessity bring us to the final element fundamental to Derrida’s notion of the supplement, namely its inherent impossibility. ‘Nature is somewhere incomplete’, he writes in his 1972 text
The ‘Lack’ of Machine Learning
How exactly, then, does Derrida’s conceptualisation of the supplement help us make sense of the ethico-politics of machine learning? Derrida’s term helps foreground an idea of
In these narratives, there is a persistent foregrounding of the perceived lack of machine learning. Often understood in terms of training data, one computer engineer explains that ‘in general, many problems of modern AI come down to insufficient data: either the available datasets are too small or, also very often, even while capturing unlabeled data is relatively easy the costs of manual labeling are prohibitively high’ (Nikolenko, 2019: 3). Similarly, others have argued that there is a pervasive ‘data scarcity’ problem in machine learning, ‘as in many fields a sufficient amount of data is not available to train the model’ (Bansal et al., 2022: 6). In other cases, however, there is a more specific reference to a lack of racial representation in data (see Buolamwini and Gebru, 2019; Crawford and Paglen, 2019). It is against this notion of lack broadly understood that synthetic data have emerged as a desirable and even necessary solution. This is especially the case in domains or ‘data-limited regimes’ (Hoffmann et al., 2019) where an ethical question in the use of sensitive personal data is common (Chen et al., 2021). Echoing Derrida, the notion of lack in machine learning thus foregrounds the necessity of synthetic data as supplement, narrowly conceptualised as a mechanism of addition, augmentation, and correction. However, as the recent case of Meta and their large language translation model has shown, lack is by no means restricted to data-limited regimes (Fan, 2020). The lack of machine learning is pervasive.
So, what is at stake here? Synthetic data are increasingly used as a response to the challenges of lack in machine learning. They embody an emergent logic of supplementarity, which promises to address and resolve the ethico-political issues of algorithms. This logic derives its sense of necessity and immediacy from that which is always necessarily lacking in algorithmic models and their training datasets: a population that is too small or skewed, a particular skin type unaccounted for, the harm resulting from biased outputs. Indeed, Derrida (2007: 42) wrote that the fundamental promise of the supplement was that ‘it adds on, and thus inaugurates, it is an addition that serves to complete a whole, to fill in where there is a gap and thus to carry out a program’. Similarly, the program of the synthetic supplement attains a seductive and promissory veneer through claims that ‘synthetic data is an important approach to
As a result, there are ethico-political issues at stake in this logic of the synthetic supplement. In the sections that follow, I critically examine two crucial dimensions of this logic: the notions of
Imbalance: ‘Boost Those Underrepresented Groups and Balance Out the Real Data’
A crucial aspect of the logic of the synthetic supplement is a promise of completion. But the notion of completion here differs from something like Big Data’s ‘push towards total knowing’ (Steyerl and Crawford, 2017) or ‘the utopia of the infinite inventory’ (Ewald, 2020: 81). Instead, the synthetic supplement promises to address issues of
Yet, this logic always falls short of such promises precisely because it is never simply an addition. Instead, it is an active intervention into the data distribution of an algorithm, transforming what can be seen and acted upon. As one data scientist and epidemiologist working with medical synthetic patient data explains:
Sometimes politically there are some things which are important to us, for example, ensuring that certain box subgroups of the population are not underrepresented in the data. So, we expanded our synthetic data generation methods, and we first devised a methodology to be able to know what outcome the algorithm was meant to predict. Based on that outcome and based on what the key variable of interest was, let’s say it was ethnicity, we could work out automatically if any of the ethnic subgroups were underrepresented in that ground truth data, without making any assumptions. So, it would automatically detect [. . .] and it would say we feel you do not have enough cases in Chinese ethnicity in that group so you need to boost these. [. . .] We would then always sample from the real data, Chinese cases, and we would then do the whole synthetic data generation process so that the synthetic data would boost those underrepresented groups and balance out the real data.
3
Echoing a form of supplementarity as ‘boosting’ as well as emphasising its impact on model performance, a similar point was raised in another interview: ‘you can definitely also create more examples which helps downstream models trained on synthetic data to just have better performance because we see more examples to have a more balanced dataset to be trained on’. 4 Yet, there are two points of ethico-political tension here: the synthetic supplement participates in the reduction of algorithmic harms and biases to what Amaro (2022: 46, 48) calls a ‘problematic of representation’, where ‘the desire for representation’ is simultaneously seen as the solution but also inevitably leading to the institutionalisation of certain representations that are ‘devoid of the dynamisms of Black life’. Moreover, this promise to resolve political imbalances is problematic, because it obfuscates how machine learning can reconfigure the very notions of race and ethnicity. As Phan and Wark (2021) have argued, new modes of racialisation and racism emerge through the hidden layers of neural networks. ‘In the absence of explicit racial categories’, they state, ‘computational systems are still able to racialize us’ (p. 3), inferring categories of races from correlations and attributes in the data. While Derrida argued for an openness towards the other, the way algorithms learn to infer categories such as race from features and attributes forecloses the otherness of the other (Amoore, 2020; Critchley, 1992). As such, the logic of the synthetic supplement may figure as a necessary means of intervening in the imbalances in algorithmic systems, but machine learning algorithms also produce new modes of subjectification and racialisation.
This normative project even goes beyond the specific generation and use of synthetic data. There is an increasingly widespread dream of
Yet, the synthetic supplement falls short of itself precisely because it constitutes an active intervention into the data distribution of an algorithm. It never ‘just’ adds. It transforms what can be seen and acted upon by the algorithm. This means that there is a fundamental limit to the dream of perfect balance. As Derrida (1976: 145) put it, the supplement ‘adds only to replace. It intervenes or insinuates itself.’ Similarly, the synthetic supplement never completes nor renders whole. Instead, it reconfigures the very thing it promises to simply augment. This is well illustrated in the work of artist and researcher Anna Ridler (2020). For her 2018 project
Ridler’s GAN illustrates the limits of the logic of the synthetic supplement: although it promises balance, it falls short of itself. The algorithm always embodies this double movement. It generates not only something that is imbalanced but also fundamentally intervenes into what is meant by a population of tulips, here reconfigured as predominantly red. It follows that while generative algorithms such as GANs, VAEs, or diffusion models are highly constrained and circumscribed by their training datasets, they are not fully determined by them. They generate more than just what was input. They problematise what de Man (1971) has called ‘the conformity to origin’. That is, they do not merely reproduce or conform to data but instead generate something new, something
Absence: ‘Understand the Gaps and Generate Them’
Another key dimension of the logic of the synthetic supplement is that of absence. While imbalance refers to disproportional representations of certain classes or examples in data distributions, the synthetic supplement also promises to intervene in spaces where certain classes or examples are wholly
The synthetic supplement is seen by many computer engineers as a promising solution to the problem of gaps and absences in machine learning. As stated by the co-founder and Chief Technology Officer of one synthetic data company:
If you know what you are missing in your data, it does guide the machine to generate it. Then, of course, when you automate this the next step would probably be to automatically understand the gaps and generate them: this could be an anomaly or missing part, the bias in your data.
5
Here, the gap or the absence in the data figures as ‘the guide’, as that which directs the generative process. Crucially, the gap is also seen as a way to resolve some of the biases of machine learning. A similar point is raised by Rev Lebaredian, Vice President of Simulation Technology at NVIDIA:
With synthetic data it’s easier for us to create diversity of data. If I’m generating images of humans and I have a synthetic data generator, that allows me to change the configurations of people’s faces, their skin tone, eye colour, hairstyle, and all of those things. (Lebaredian quoted in Strickland, 2022)
Addressing the issue of algorithmic bias and the role of synthetic data, Lebaredian claims that ‘we can construct ideal worlds with the diversity that we want, and our AIs can be better for it’ (Lebaredian quoted in Strickland, 2022). Here, the world according to the algorithm is understood predominantly in terms of lack, insufficiency, a series of gaps in representations of identity. Machine learning algorithms are seen in terms of partiality, not only in relation to the accounts they regularly give of themselves (Amoore, 2020), but also in terms of their training regimes and the extent to which they are able to account for different aspects in the world. According to the logic of the synthetic supplement, the world appears as lacking, as biased, as underrepresented and yet nonetheless resolvable by the generation and incorporation of synthetic data. In short, the synthetic supplement promises to resolve the ethico-politics of machine learning through the filling of gaps in the data distribution.
The logic of the synthetic supplement delineates the problem space of algorithms in terms of gaps and absences. It promises that these
In contrast to its promises, the logic of the synthetic supplement is also that which always falls short of itself. As Derrida argued, if something needs supplementing then it is because it is always already incomplete and insufficient. Hinting at this impossibility of the synthetic supplement, one data scientist and medical researcher explained in an interview:
GANs seem to work really well with images, I mean phenomenally well, but what they’re doing is, you know, basically supplementing their datasets. . . . What I think we need to be careful of with some of these generative models is assuming we can fill gaps in data that are unfillable. We can’t create knowledge from nothing if there’s no underlying signal there in the first place, and so we need to be slightly wary of synthetic data in that respect.
6
Although medical training data for machine learning is notoriously noisy and sparse, the data scientist still expressed an unease about the use of generative models in generating additional data. The unease derived from the question of whether the supplement ‘simply’ adds to or if it does something more fundamental to the data distribution. Does it create something from nothing? Emphasising that there are gaps in data that remain fundamentally unfillable, he stated in the interview: ‘you’re not going to magically create data that’s got signals that were never there before’. 7 While this point resonates with the larger argument of this paper, the promise to fill a gap embodied by the logic of the synthetic supplement actually stands in a more fundamental tension with how machine learning algorithms generally operate. As Parisi (2019: 23) has argued, the power of machine learning derives from their abductive capacities to ‘learn from incomplete information’ and thus be ‘able to classify new cases that may otherwise remain incomplete or not fully specified’. Or as Amoore (2011: 28) puts it, algorithms infer from ‘across the gaps between data’ in order to ‘project onto an array of uncertain futures’. This means that the attempt to account for the absent fails to take into account how gaps are actually generative mechanisms for algorithms. They learn from gaps and absences in data to generate new rules and hypotheses about future possibilities.
The generativity of gaps echoes Derrida’s conceptualisation of the supplement: it not only signals a need and lack but also a plenitude and a space of intervention. It always already inhabits a space of semantic tension: it simultaneously refers to ideas of addition and completion as well as replacement and transformation. Similarly, the synthetic supplement promises to simply fill an absence, a gap in the data distribution, but there is no ‘simple’ addition. It necessarily falls short of these promises, because it is always an active reconfiguration of that data distribution, bringing into being what did not exist beforehand: a small population augmented, a ‘balanced’ dataset of tulips made red, additional skin types accounted for. That which is supplemented is made different. The attempt to algorithmically fill a gap generates other gaps, all of which are generative for machine learning algorithms. The synthetic supplement is incapable of resolving the fundamental incompleteness and tensions of algorithms precisely because it generates new gaps and frictions, thus keeping the space of the ethico-politics of machine learning perpetually open.
Conclusion: The Impossibility of Ethics
In this paper, I have interrogated the logic of the synthetic supplement, how it operates, and why it matters for the ethico-politics of algorithmic societies. Drawing on Derrida’s work on supplementarity, I have argued that there is a double movement at the heart of the logic of the synthetic supplement. On the one hand, it promises to resolve the tensions inherent in algorithms. This could be to balance out a data distribution, filling in the gaps, adding more varied skin types to a dataset, or it could be to enlarge and augment a small or skewed population. As such, the logic of the synthetic supplement frames the emergence of synthetic data as a promissory way to resolve entrenched issues in algorithmic systems. However, this logic always falls short of itself, revealing an inherent tension. It is incapable of resolving these issues precisely because it never simply adds to a data distribution. It is always an active reconfiguration of that which it claims to add to, bringing into being new possibilities of imbalance and absence. It is simultaneously an addition and intervention. As such, there is a need for a renewed attentiveness to the ways in which machine learning models engender new conceptions and parameters of race and ethnicity (Amaro, 2022; Phan and Wark, 2021).
What is at stake with the logic of the synthetic supplement more broadly? Is it relevant only in terms of the emergence of generative algorithmic models and synthetic data? The logic of the synthetic supplement assumes a particular idea of ethics in the context of algorithms. Therefore, the logic is suspect in a way that goes beyond its interventions into the imbalances and absences of machine learning: it reinforces a much broader conceptualisation of ethics as simply an ‘add on’ to algorithms. It appears to promise the resolution of politics in algorithmic societies. Any ethical issue – race, gender, bias, representation, harm – can be solved merely by adding to algorithms and their datasets. In this logic, ethics become a supplement easily added. Addition becomes a means of correction and completion. This is not a new phenomenon, however. Reardon (2011) writes how, during the genomics research of the 1990s, committees such as the Human Genome Diversity Project were ‘casting ethics as distinct from science – as something that does not inhere in science but instead needs to be done along with it’ (p. 219). From this, a conceptual scheme emerged that ‘posited ethics as something that could be added onto science – and not something that was unavoidably implicitly in it’ (p. 231).
The issue outlined by Reardon resonates with how the logic of the synthetic supplement operates. For when ethics is conceived in supplementary terms, understood narrowly as an add-on, it creates the conditions of possibility for a conceptualisation of ethics as independent from the making and training of algorithms. This logic ‘casts’ the ethics of algorithms as resolvable through the simple addition of more data, features, attributes, or categories. The result is a dream of complete, balanced, and ethical algorithms. One generates to supplement to resolve. However, this is a narrow reading of what is meant by supplementarity. The logic of the synthetic supplement obfuscates how ethics is always already enmeshed with machine learning. It conceals different forms of violence as well as asymmetrical power relations (Bellanova et al., 2021). Of course, the field of AI ethics deals with inherently complex issues, ones which rarely have straightforward solutions. But it has also been critiqued for being ‘toothless’ (Green, 2021), constituting ‘a lack of friction between ethical principles and existing business principles’ (Munn, 2023: 4). Moreover, Amoore (2022: 21) has argued that ‘the notion that machine learning algorithms could be subject to good governance via regulation, or “AI ethics”, appeals to a different epistemic order than that which is itself generated by deep learning algorithms’, an order ‘generative of new norms and thresholds of what “good”, “normal”, and “stable” orders look like in the world’ (p. 21). As many have already argued, ethics cannot be made reducible solely to paradigms of inclusion and fairness (e.g. Hoffmann, 2019). It must start from the place of the fundamental ‘partiality’ of all machine learning algorithms (Amoore, 2019, 2020), as well as the ‘irremediable incompleteness’ (Suchman cited in Bellanova et al., 2021) of all training datasets.
While frameworks of AI ethics and governance are necessary, there is also a need to emphasise what Derrida (2007) has called ‘the impossibility of ethics’ in the context of machine learning. ‘I would say that deconstruction loses nothing from admitting that it is impossible’, and he adds that ‘for a deconstructive operation, possibility is rather the danger, the danger of becoming an available set of rule-governed procedures, methods, accessible approaches’. ‘The interest of deconstruction’, he continues, ‘is a certain experience of the impossible: that is, as I shall insist in my conclusion, of the other’ (p. 15). On the one hand, this means that ‘ethics, and justice, can find no privileged ground for their articulation’ (Keenan, 1990: 1681). That is, the synthetic supplement cannot be seen as a privileged ground for how to articulate machine learning orders, for there is none. On the other hand, it means conceiving of ethics as ‘a certain experience of the impossible: that is [. . .] of the other – the experience of the other as the invention of the impossible’ (Derrida, 2007: 15). As opposed to the promises embodied by the logic of the synthetic supplement, the ethico-politics of machine learning is a never-ending, ever-shifting, impossible encounter with and responsibility for the other. The gap of the other in machine learning, the gap that is the space of the other, can never be completely filled or supplemented. It always remains open.
Footnotes
Acknowledgements
This paper has benefitted enormously from many fruitful discussions with Louise Amoore, Ludovico Rella, and Alexander Campolo on topics such as machine learning, synthetic data, and ethics. My thanks also go to the four anonymous reviewers as well as TCS board members for their careful engagement and comments.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the European Research Council (ERC) under the Horizon 2020 Framework Programme (Advanced Investigator Grant ERC-2019-AdG-883107-ALGOSOC).
