The Logic of the Synthetic Supplement in Algorithmic Societies

Abstract

What happens when there is not enough data to train machine learning algorithms? In recent years, so-called ‘synthetic data’ have been increasingly used to add to or supplement the training regimes of various machine learning algorithms. Seeking to read the notion of supplementarity differently through an engagement with the work of Jacques Derrida, I propose that the nascent emergence of synthetic data embodies what I call the logic of the synthetic supplement in algorithmic societies. I argue, on the one hand, that the synthetic supplement promises and claims to resolve the ethico-political tensions, frictions, and intractabilities of machine learning. On the other hand, it always falls short of these promises because it necessarily intervenes in that which it claims to merely augment. Ultimately, this means that the gaps and frictions of machine learning cannot be completely filled, supplemented, or resolved.

Keywords

Derrida ethics machine learning race synthetic data

Introduction

In June 2022, I interviewed two data scientists from a US-based tech company about their use of so-called ‘synthetic data’. When asked what happens when there is not enough data to train their machine learning algorithms, one of them responded:

What if I had a small population, right? So that might be a population that’s not represented well enough to train a model. Well, if I make more of that population with synthetic data – so augment the real data, right, it’s not replaced the real data but augmented it – can I build more effective models, more accurate models?¹

It was clear from the subsequent discussion that the answer to this question was rhetorical. He argued that synthetic data, produced by algorithmic models trained on real data, had the power to augment machine learning algorithms and their training datasets, making them more accurate, their populations more representative. What was put forward was a particular notion of supplementarity in relation to machine learning: a logic of enhancing, adding to, and augmenting data and algorithms. This is not an isolated example. In many critical responses to the harms of machine learning, a similar understanding of supplementarity is at play. In their ground-breaking 2018 paper titled ‘Gender Shades’, computer scientists Joy Buolamwini and Timnit Gebru point out the racial disparities and discriminatory practices of facial recognition systems resulting from the underrepresentation of black faces in training data. Their response mobilises a particular conception of the supplement as a tool of racial mitigation: ‘our work advances gender classification benchmarking by introducing a new face dataset composed of 1270 unique individuals that is more phenotypically balanced on the basis of skin type than existing benchmarks’ (Buolamwini and Gebru, 2018: 2). Echoing the data scientists generating synthetic data, the response here to the ethico-political issues of algorithms and their training data is to replace these with ‘balanced’ distributions of skin types. In short, the solution to the risks and harms of AI is to supplement the algorithm, to make it more representative and inclusive, to add diversity.

What does it mean to algorithmically enlarge, add on, or ‘correct’ a population? In the examples just mentioned, the specific methodologies are different, yet the assumption remains the same: one can variously add to the algorithm and its training dataset, and in so doing address and resolve the fundamental ethico-politics of machine learning. The supplement figures here as a necessary means of addressing the gaps, lacks, and imbalances of algorithms. The emergence of synthetic data takes this logic of supplementarity even further. It is claimed that if additional data cannot be extracted from society then they can be algorithmically generated, adding a further dimension to the social power of algorithms (Beer, 2017; Bucher, 2018). There is therefore a seductive and promissory element to what I call the logic of the synthetic supplement: any small or skewed population, any data distribution, can be augmented through the generation and incorporation of synthetic data. Any algorithm can be made more accurate, representative, and less harmful. As such, the logic of the synthetic supplement embodies a ‘normative project’, that is, ‘a positive technique of intervention and transformation’ (Foucault, 2003: 50). A promise of inclusivity and representativeness.

In the above examples, to supplement implies to enhance, to add, to optimise. But this crucially rests on a narrow, specific reading of what it means to supplement algorithmic systems. This paper seeks to read the supplement differently, drawing on the work of Jacques Derrida. By drawing attention to both its fundamental necessity and impossibility, I argue that the logic of the synthetic supplement emerges as a promise to resolve the ethico-political tensions of machine learning. Yet, I also argue there is an inherent and unresolvable tension within the logic. It therefore always inhabits a double movement: it promises to resolve and yet it always already falls short of itself. This double movement foregrounds both the politics of claims to resolve and that efforts to supplement algorithmic systems are never finished nor completely closed. Put differently, the promises of the synthetic supplement occupy an uneasy space between the ‘nearly-here’ and the always ‘out of reach’ (Beer, 2023: 3–4). As such, I contend that understanding the logic of the synthetic supplement is increasingly crucial for examining the ethico-politics of contemporary ‘algorithmic thought’ (Beer, 2023; Fazi, 2021), as well as our ‘algorithmic lives’ more broadly (Amoore and Piotukh, 2015; Cheney-Lippold, 2011).

Before moving on, it is worth defining what is meant by synthetic data. They are, broadly speaking, data points that have been generated by deep generative algorithms such as GANs, VAEs, large language models (LLMs), or diffusion models to be used as training data for other algorithms (Courville et al., 2016; Nikolenko, 2019, 2021). Many computer engineers even deploy a combination of methods and architectures to produce synthetic data, approaches ranging from GANs, Transformers, and 3D modelling pipelines to more statistical methods such as Bayesian network algorithms. However, synthetic training data are generally generated by machine learning models that extract features, attributes, and correlations from the real training dataset on which they have been trained. Synthetic data are often used to either reproduce the salient attributes of a given data example (e.g. synthetic facial recognition images that capture the texture and contours of a real face) or to approximate the overall statistical distribution of some real-world dataset (e.g. a synthetic insurance dataset that mimics the patterns of real persons without referring to any real persons) (Nikolenko, 2021). Although relying on the extraction of and learning from real training data, synthetic data are never simply the pure reproduction or ‘copy’ of the real. The aim is to produce data points that are as proximate as possible to the training data without being an identical mapping of them. It is not always clear, however, if this has been successfully achieved in practice (see Carlini et al., 2023). As such, synthetic data inhabit a relationally fraught nexus between the frameworks of extraction described by Zuboff (2019) and Crawford (2021), as well as the generative logic of deep learning algorithms that operates through ‘the inductive retrieval and recombination of infinite data volumes’ (Parisi, 2019: 4). While there is nascent critical scholarship on synthetic data (Jacobsen, 2023; Steinhoff, 2022), there remains a need to explore the various ways in which they reconfigure the conditions of possibility for machine learning algorithms.

The paper is divided into four sections. I begin by discussing the notion of supplementarity, drawing on Jacques Derrida’s work. The second section foregrounds the fundamental condition of possibility for the logic of the synthetic supplement: the perceived ‘lack’ within the machine learning in our algorithmic societies. The latter half of the paper examines how the synthetic supplement actually operates. Here, I claim that two crucial dimensions of the logic of the synthetic supplement are that of imbalance and absence. Both sections also exemplify the double movement of the synthetic supplement: the promise to resolve and the falling short. Imbalance and absence signal the necessity of the synthetic supplement as well as its impossibility. Through an analysis of these critical dimensions, I argue that while the logic of the synthetic supplement is fuelled by promises to resolve the ethico-political tensions and intractabilities of machine learning, it always falls short of its promises. As these examples will show, it never completes nor fully closes.

On the Supplement

In order to fully explain why this double movement matters, we need to read the notion of supplementarity differently – it is never simply the addition that resolves. The term has a long lineage in philosophy, most notably in the work of Jacques Derrida. The concept runs through Derrida’s whole corpus.² The most detailed treatment of the term, however, can be found in his 1967 book Of Grammatology (Derrida, 1976). Drawing on this text as well as his wider work, I will foreground three main aspects of Derrida’s treatment of the supplement: its necessity; its two-pronged definition; and its impossibility. These aspects will be important for the analyses in the latter half of the paper, as they help reveal what is at stake in thinking machine learning through the logic of the synthetic supplement.

In terms of its necessity, it is important to note how the supplement is intimately connected with Derrida’s wider vision of deconstruction (see for instance Culler, 1983; Critchley, 1992). Central to deconstruction is a claim about a fundamental, asymmetrical tension in Western philosophy between speech and writing, presence and absence. In these dichotomies, there is a metaphysical privileging of sameness, presence, and speech over difference, absence, and writing. Derrida states in Of Grammatology (Derrida, 1976: 8) that if one looks at the dichotomy between, say, writing and speech in many philosophical texts, writing is persistently confined to ‘a secondary and instrumental function’ or an ‘interpreter of an originary speech itself shielded from interpretation’. As such, speech has long been seen as both historically and metaphysically originary to writing, which is reduced to a secondary function. The role of writing is ‘simply’ to mediate speech which, in turn, remains unchanged from the mediation.

Yet, Derrida foregrounds a fundamental problem with this mode of thinking. As he puts it, ‘In the spoken address, presence is at once promised and refused’ (Derrida, 1976: 141). Speech, rather than being the embodiment of full (metaphysical) presence, contains within itself its own limits. Through a reading of Rousseau, Derrida outlines how writing is repeatedly seen as destructive, fallible, and deceptive. Yet, it also figures as a redemptive force insofar as it brings back what was lost in speech. Writing therefore emerges as a ‘necessary supplement’ to speech because ‘when Nature, as self-proximity, comes to be forbidden or interrupted, when speech fails to protect presence, writing becomes necessary. It must be added to the word urgently’ (p. 144). In other words, the supplement is necessary. Language is always already in need of supplementing. It is never complete.

This links to the second core aspect of the term: its two-pronged definition. Derrida suggests that the concept of the supplement in French has two principal meanings, neither of which can be understood independently of each other. Firstly, he states that ‘the supplement adds itself, it is a surplus, a plenitude enriching another plenitude, the fullest measure of presence. It cumulates and accumulates presence’ (Derrida, 1976: 144–5). Yet, Derrida also writes that the supplement ‘adds only to replace. It intervenes or insinuates itself in-the-place-of; if it fills, it is as if one fills a void’ (p. 145). Thus, the supplement embodies an ambiguous semantic space: it simultaneously signifies plenitude, completeness, and self-sufficiency as well as replacement, compensation, and intervention. This definitional tension is always present. ‘The supplement, which seems to be added as a plenitude to a plenitude, is equally that which compensates for a lack’ (Derrida, 1978: 212). The supplement is both that which fills and actively intervenes. It supplies something that is missing, filling a void, but it also supplies something additional, something different. That is, it transforms that which is being added to. As Derrida emphatically states in Limited Inc, ‘As though an addition were ever simple! [. . .] As though an addition or repetition did not alter!’ (Derrida, 1988: 103). In short, the supplement both adds and alters. It never simply augments or optimises.

Ideas of lack and necessity bring us to the final element fundamental to Derrida’s notion of the supplement, namely its inherent impossibility. ‘Nature is somewhere incomplete’, he writes in his 1972 text Dissemination, and ‘it lacks something needed for it to be what it is, that it has to be supplemented’ (Derrida, 2016: 40). Elsewhere he states that ‘if speech must be “added” to the thought identity of the object, it is because the “presence” of sense and speech had already from the start fallen short of itself’ (Derrida, 1973: 87). The supplement is not only necessary but is also necessarily lacking. Language has always already fallen short of itself from the start. This is precisely why Derrida states in a conversation with Maurizio Ferraris that ‘this logic of supplementarity is a logic of incompleteness’ (Derrida and Ferraris, 2001: 29). It is that which is never finished nor final. As such, the supplement comprises three principal dynamics in Derrida’s work: it is necessary; it is simultaneously that which adds and that which intervenes; and lastly, it is impossible, always incomplete. These three dynamics will help to clarify how the logic of the synthetic supplement operates and how it inhabits a double movement between promising to resolve the tensions of AI and yet falling short of itself.

The ‘Lack’ of Machine Learning

How exactly, then, does Derrida’s conceptualisation of the supplement help us make sense of the ethico-politics of machine learning? Derrida’s term helps foreground an idea of lack in machine learning: a focus on gaps, voids, absences, imbalances, and scarcity. It disrupts dominant ideas associated with Big Data and the rise of deep learning (Bengio et al., 2021; boyd and Crawford, 2012; Kitchin, 2014). Crucially, it suggests how recurrent narratives about the scale and volume of data, as well as the radical availability of large training datasets, occlude their own inherent frictions and limitations. But Derrida’s notion also showcases how narratives of lack have simultaneously given rise to what Berlant (2006: 20) calls ‘a cluster of promises’, that which ‘we want someone or something to make to us and make possible for us’. In contrast to recent claims – like ‘it is no longer enough to say that data is big [. . .] data is now in a state of surplus’ (Halpern et al., 2022: 197) – it is crucial that we also attend to a different set of narratives that have emerged and come to matter.

In these narratives, there is a persistent foregrounding of the perceived lack of machine learning. Often understood in terms of training data, one computer engineer explains that ‘in general, many problems of modern AI come down to insufficient data: either the available datasets are too small or, also very often, even while capturing unlabeled data is relatively easy the costs of manual labeling are prohibitively high’ (Nikolenko, 2019: 3). Similarly, others have argued that there is a pervasive ‘data scarcity’ problem in machine learning, ‘as in many fields a sufficient amount of data is not available to train the model’ (Bansal et al., 2022: 6). In other cases, however, there is a more specific reference to a lack of racial representation in data (see Buolamwini and Gebru, 2019; Crawford and Paglen, 2019). It is against this notion of lack broadly understood that synthetic data have emerged as a desirable and even necessary solution. This is especially the case in domains or ‘data-limited regimes’ (Hoffmann et al., 2019) where an ethical question in the use of sensitive personal data is common (Chen et al., 2021). Echoing Derrida, the notion of lack in machine learning thus foregrounds the necessity of synthetic data as supplement, narrowly conceptualised as a mechanism of addition, augmentation, and correction. However, as the recent case of Meta and their large language translation model has shown, lack is by no means restricted to data-limited regimes (Fan, 2020). The lack of machine learning is pervasive.

So, what is at stake here? Synthetic data are increasingly used as a response to the challenges of lack in machine learning. They embody an emergent logic of supplementarity, which promises to address and resolve the ethico-political issues of algorithms. This logic derives its sense of necessity and immediacy from that which is always necessarily lacking in algorithmic models and their training datasets: a population that is too small or skewed, a particular skin type unaccounted for, the harm resulting from biased outputs. Indeed, Derrida (2007: 42) wrote that the fundamental promise of the supplement was that ‘it adds on, and thus inaugurates, it is an addition that serves to complete a whole, to fill in where there is a gap and thus to carry out a program’. Similarly, the program of the synthetic supplement attains a seductive and promissory veneer through claims that ‘synthetic data is an important approach to solving the data problem by either producing artificial data from scratch or using advanced data manipulation techniques to produce novel and diverse training examples’ (Nikolenko, 2019: 3). There is then a promise to intervene in the gaps, lacks, and bottlenecks of algorithms. The following sections will show how the logic of the synthetic supplement concerns itself with generating and filling in strategic gaps, supplying certain missing elements of the data distribution.

As a result, there are ethico-political issues at stake in this logic of the synthetic supplement. In the sections that follow, I critically examine two crucial dimensions of this logic: the notions of imbalance and absence. These are two critical elements of the synthetic supplement for they clearly highlight what the logic promises and how it intervenes. In short, what and how it makes supplemental data matter in algorithmic societies (Barad, 2007). However, it must be added that both dimensions embody the uneasy double movement outlined earlier: the claim to resolve and yet the necessary falling short of such claims. First, we turn to the matter of imbalance.

Imbalance: ‘Boost Those Underrepresented Groups and Balance Out the Real Data’

A crucial aspect of the logic of the synthetic supplement is a promise of completion. But the notion of completion here differs from something like Big Data’s ‘push towards total knowing’ (Steyerl and Crawford, 2017) or ‘the utopia of the infinite inventory’ (Ewald, 2020: 81). Instead, the synthetic supplement promises to address issues of imbalance in the data distributions of algorithmic systems. In the computer science literature this is sometimes called ‘the class imbalance problem’ or the ‘long-tail problem’ (Hoffmann et al., 2019; Hooker, 2021). ‘Most real-world data naturally have a skewed distribution’, machine learning researcher Hooker (2021: 2) writes, ‘with a small number of well-represented features and a “long-tail” of features that are relatively underrepresented.’ On one level, this is a computational problem. It is a question of reducing error rates, optimising the weights and parameters of the model, making it more accurate and robust to future unseen data examples. However, the matter of imbalance is also a normative project, for it has a performative impact on the work algorithms (should) do in the world: ‘The skew in feature frequency leads to disparate error rates on the underrepresented attribute. This prompts fairness concerns when the underrepresented attribute is a protected attribute’ (Hooker, 2021: 2). As such, the logic of the synthetic supplement constitutes a promise to address computational and ethico-political imbalances in algorithmic systems: underrepresented people groups, skin types, gender proportions, age distributions, classes, and so on.

Yet, this logic always falls short of such promises precisely because it is never simply an addition. Instead, it is an active intervention into the data distribution of an algorithm, transforming what can be seen and acted upon. As one data scientist and epidemiologist working with medical synthetic patient data explains:

Sometimes politically there are some things which are important to us, for example, ensuring that certain box subgroups of the population are not underrepresented in the data. So, we expanded our synthetic data generation methods, and we first devised a methodology to be able to know what outcome the algorithm was meant to predict. Based on that outcome and based on what the key variable of interest was, let’s say it was ethnicity, we could work out automatically if any of the ethnic subgroups were underrepresented in that ground truth data, without making any assumptions. So, it would automatically detect [. . .] and it would say we feel you do not have enough cases in Chinese ethnicity in that group so you need to boost these. [. . .] We would then always sample from the real data, Chinese cases, and we would then do the whole synthetic data generation process so that the synthetic data would boost those underrepresented groups and balance out the real data.³

Echoing a form of supplementarity as ‘boosting’ as well as emphasising its impact on model performance, a similar point was raised in another interview: ‘you can definitely also create more examples which helps downstream models trained on synthetic data to just have better performance because we see more examples to have a more balanced dataset to be trained on’.⁴ Yet, there are two points of ethico-political tension here: the synthetic supplement participates in the reduction of algorithmic harms and biases to what Amaro (2022: 46, 48) calls a ‘problematic of representation’, where ‘the desire for representation’ is simultaneously seen as the solution but also inevitably leading to the institutionalisation of certain representations that are ‘devoid of the dynamisms of Black life’. Moreover, this promise to resolve political imbalances is problematic, because it obfuscates how machine learning can reconfigure the very notions of race and ethnicity. As Phan and Wark (2021) have argued, new modes of racialisation and racism emerge through the hidden layers of neural networks. ‘In the absence of explicit racial categories’, they state, ‘computational systems are still able to racialize us’ (p. 3), inferring categories of races from correlations and attributes in the data. While Derrida argued for an openness towards the other, the way algorithms learn to infer categories such as race from features and attributes forecloses the otherness of the other (Amoore, 2020; Critchley, 1992). As such, the logic of the synthetic supplement may figure as a necessary means of intervening in the imbalances in algorithmic systems, but machine learning algorithms also produce new modes of subjectification and racialisation.

This normative project even goes beyond the specific generation and use of synthetic data. There is an increasingly widespread dream of the perfectly balanced training dataset in the AI community, where all machine learning algorithms are imagined as balanceable. In 2019, IBM released the ‘Diversity in Faces’ dataset to ‘advance the study of fairness in facial recognition systems’ (Smith, 2019). A direct response to the critical interventions proposed by scholars such as Gebru and Buolamwini, the IBM dataset aimed to provide ‘a more balanced distribution and broader coverage of facial images compared to previous datasets’ (Smith, 2019). What kind of population is being imagined here? A racially diverse yet balanced population not found in the ‘real world’. This illustrates the promissory and normative dimension of the logic of the synthetic supplement: the promise to resolve the tensions of machine learning and society more broadly through the balancing of data distributions. To optimise the algorithm’s performance and make it fairer, a persistent slippage between the computational and ethico-political. In short, a crucial aspect of the logic is how it embodies a promise to eradicate issues of imbalance in our algorithmic societies.

Yet, the synthetic supplement falls short of itself precisely because it constitutes an active intervention into the data distribution of an algorithm. It never ‘just’ adds. It transforms what can be seen and acted upon by the algorithm. This means that there is a fundamental limit to the dream of perfect balance. As Derrida (1976: 145) put it, the supplement ‘adds only to replace. It intervenes or insinuates itself.’ Similarly, the synthetic supplement never completes nor renders whole. Instead, it reconfigures the very thing it promises to simply augment. This is well illustrated in the work of artist and researcher Anna Ridler (2020). For her 2018 project Myriads (Tulips), Ridler explains how she produced her own labelled training dataset, consisting of thousands of photographic images of tulips taken in the Netherlands, in order to train a generative adversarial network (GAN). Ridler continues that she wanted to explore the possibilities of creating a training dataset devoid of the biases and imbalances commonly found in benchmark datasets such as ImageNet (Crawford and Paglen, 2019; Denton et al., 2021). This attempt at balance was operationalised through an even proportion of differently coloured tulips: 20% red, 20% white, and so on. Yet, surprisingly, the final outputs generated by the GAN were 80% synthetic images of red tulips. Beyond their evident bias towards hypersaturated colours such as red, Ridler was unable to fully explain why her GAN behaved this way. ‘Even when you try and be careful and considered’, Ridler (2020) concluded, ‘there are still things beyond your control that you can’t work with [. . .] It’s impossible for me to predict what will come out of my model. I can guess, but I can never know.’

Ridler’s GAN illustrates the limits of the logic of the synthetic supplement: although it promises balance, it falls short of itself. The algorithm always embodies this double movement. It generates not only something that is imbalanced but also fundamentally intervenes into what is meant by a population of tulips, here reconfigured as predominantly red. It follows that while generative algorithms such as GANs, VAEs, or diffusion models are highly constrained and circumscribed by their training datasets, they are not fully determined by them. They generate more than just what was input. They problematise what de Man (1971) has called ‘the conformity to origin’. That is, they do not merely reproduce or conform to data but instead generate something new, something in excess of their training data and the biases they may contain. In short, there is a ‘computational production of new probabilities’ (Parisi, 2013: x) and new possibilities for imbalance. This means that even if algorithms are supplemented by synthetic data, this alone will not fulfil the dream of balance. No matter what synthetic data is added to the training regime of an algorithm, the model will always generate outputs in excess of this mode of addition. As such, the logic of the synthetic supplement embodies both a promise to resolve through balance and a falling short of that promise.

Absence: ‘Understand the Gaps and Generate Them’

Another key dimension of the logic of the synthetic supplement is that of absence. While imbalance refers to disproportional representations of certain classes or examples in data distributions, the synthetic supplement also promises to intervene in spaces where certain classes or examples are wholly absent from the distribution. As stated by one machine learning researcher, the power of synthetic data is their capacity to ‘fill in the holes’ in data distributions, meaning that generative models can be used to produce ‘synthetic data for situations that are lacking in the original dataset’ (Nikolenko, 2019: 64). Again, drawing on Derrida, the logic of the synthetic supplement never simply adds to a data distribution, filling in the holes and absences. It is always an active intervention into and a reconfiguration of that data distribution. It brings into being what did not exist beforehand. This means that while the synthetic supplement promises to simply add to algorithms and their training data, it gives rise to a wholly different set of tensions and questions. In this way, Derrida’s supplement becomes a way to foreground and make sense of the inherent lack of AI and machine learning as well as the impossibility of resolving this lack. The idea of absence, therefore, foregrounds the double movement of the logic of the synthetic supplement: its promises to resolve as well as its inherent tensions and incapacity to do so.

The synthetic supplement is seen by many computer engineers as a promising solution to the problem of gaps and absences in machine learning. As stated by the co-founder and Chief Technology Officer of one synthetic data company:

If you know what you are missing in your data, it does guide the machine to generate it. Then, of course, when you automate this the next step would probably be to automatically understand the gaps and generate them: this could be an anomaly or missing part, the bias in your data.⁵

Here, the gap or the absence in the data figures as ‘the guide’, as that which directs the generative process. Crucially, the gap is also seen as a way to resolve some of the biases of machine learning. A similar point is raised by Rev Lebaredian, Vice President of Simulation Technology at NVIDIA:

With synthetic data it’s easier for us to create diversity of data. If I’m generating images of humans and I have a synthetic data generator, that allows me to change the configurations of people’s faces, their skin tone, eye colour, hairstyle, and all of those things. (Lebaredian quoted in Strickland, 2022)

Addressing the issue of algorithmic bias and the role of synthetic data, Lebaredian claims that ‘we can construct ideal worlds with the diversity that we want, and our AIs can be better for it’ (Lebaredian quoted in Strickland, 2022). Here, the world according to the algorithm is understood predominantly in terms of lack, insufficiency, a series of gaps in representations of identity. Machine learning algorithms are seen in terms of partiality, not only in relation to the accounts they regularly give of themselves (Amoore, 2020), but also in terms of their training regimes and the extent to which they are able to account for different aspects in the world. According to the logic of the synthetic supplement, the world appears as lacking, as biased, as underrepresented and yet nonetheless resolvable by the generation and incorporation of synthetic data. In short, the synthetic supplement promises to resolve the ethico-politics of machine learning through the filling of gaps in the data distribution.

The logic of the synthetic supplement delineates the problem space of algorithms in terms of gaps and absences. It promises that these can and should be generated. Issues concerning diversity, bias, and representation can be supplemented and, as a result, resolved. Yet, one consequence is that the synthetic supplement is used to perpetuate vectors of power and control in society (Deleuze, 1992), or what Cheney-Lippold (2011) has called the ‘soft biopower’ of algorithms. For instance, GANs are being used to generate synthetic face images to augment the training datasets of biometric and facial recognition algorithms (Marriott, 2020). It is claimed that deep generative models not only have the ability to generate new ‘synthetic identities’ but also enable future facial recognition systems to bypass constraints usually associated with the extraction of biometric identities (such as privacy concerns). The synthetic supplement is therefore more than just a promise; it embodies nascent regimes that actively seek to reconfigure the conditions of possibility of machine learning.

In contrast to its promises, the logic of the synthetic supplement is also that which always falls short of itself. As Derrida argued, if something needs supplementing then it is because it is always already incomplete and insufficient. Hinting at this impossibility of the synthetic supplement, one data scientist and medical researcher explained in an interview:

GANs seem to work really well with images, I mean phenomenally well, but what they’re doing is, you know, basically supplementing their datasets. . . . What I think we need to be careful of with some of these generative models is assuming we can fill gaps in data that are unfillable. We can’t create knowledge from nothing if there’s no underlying signal there in the first place, and so we need to be slightly wary of synthetic data in that respect.⁶

Although medical training data for machine learning is notoriously noisy and sparse, the data scientist still expressed an unease about the use of generative models in generating additional data. The unease derived from the question of whether the supplement ‘simply’ adds to or if it does something more fundamental to the data distribution. Does it create something from nothing? Emphasising that there are gaps in data that remain fundamentally unfillable, he stated in the interview: ‘you’re not going to magically create data that’s got signals that were never there before’.⁷ While this point resonates with the larger argument of this paper, the promise to fill a gap embodied by the logic of the synthetic supplement actually stands in a more fundamental tension with how machine learning algorithms generally operate. As Parisi (2019: 23) has argued, the power of machine learning derives from their abductive capacities to ‘learn from incomplete information’ and thus be ‘able to classify new cases that may otherwise remain incomplete or not fully specified’. Or as Amoore (2011: 28) puts it, algorithms infer from ‘across the gaps between data’ in order to ‘project onto an array of uncertain futures’. This means that the attempt to account for the absent fails to take into account how gaps are actually generative mechanisms for algorithms. They learn from gaps and absences in data to generate new rules and hypotheses about future possibilities.

The generativity of gaps echoes Derrida’s conceptualisation of the supplement: it not only signals a need and lack but also a plenitude and a space of intervention. It always already inhabits a space of semantic tension: it simultaneously refers to ideas of addition and completion as well as replacement and transformation. Similarly, the synthetic supplement promises to simply fill an absence, a gap in the data distribution, but there is no ‘simple’ addition. It necessarily falls short of these promises, because it is always an active reconfiguration of that data distribution, bringing into being what did not exist beforehand: a small population augmented, a ‘balanced’ dataset of tulips made red, additional skin types accounted for. That which is supplemented is made different. The attempt to algorithmically fill a gap generates other gaps, all of which are generative for machine learning algorithms. The synthetic supplement is incapable of resolving the fundamental incompleteness and tensions of algorithms precisely because it generates new gaps and frictions, thus keeping the space of the ethico-politics of machine learning perpetually open.

Conclusion: The Impossibility of Ethics

In this paper, I have interrogated the logic of the synthetic supplement, how it operates, and why it matters for the ethico-politics of algorithmic societies. Drawing on Derrida’s work on supplementarity, I have argued that there is a double movement at the heart of the logic of the synthetic supplement. On the one hand, it promises to resolve the tensions inherent in algorithms. This could be to balance out a data distribution, filling in the gaps, adding more varied skin types to a dataset, or it could be to enlarge and augment a small or skewed population. As such, the logic of the synthetic supplement frames the emergence of synthetic data as a promissory way to resolve entrenched issues in algorithmic systems. However, this logic always falls short of itself, revealing an inherent tension. It is incapable of resolving these issues precisely because it never simply adds to a data distribution. It is always an active reconfiguration of that which it claims to add to, bringing into being new possibilities of imbalance and absence. It is simultaneously an addition and intervention. As such, there is a need for a renewed attentiveness to the ways in which machine learning models engender new conceptions and parameters of race and ethnicity (Amaro, 2022; Phan and Wark, 2021).

What is at stake with the logic of the synthetic supplement more broadly? Is it relevant only in terms of the emergence of generative algorithmic models and synthetic data? The logic of the synthetic supplement assumes a particular idea of ethics in the context of algorithms. Therefore, the logic is suspect in a way that goes beyond its interventions into the imbalances and absences of machine learning: it reinforces a much broader conceptualisation of ethics as simply an ‘add on’ to algorithms. It appears to promise the resolution of politics in algorithmic societies. Any ethical issue – race, gender, bias, representation, harm – can be solved merely by adding to algorithms and their datasets. In this logic, ethics become a supplement easily added. Addition becomes a means of correction and completion. This is not a new phenomenon, however. Reardon (2011) writes how, during the genomics research of the 1990s, committees such as the Human Genome Diversity Project were ‘casting ethics as distinct from science – as something that does not inhere in science but instead needs to be done along with it’ (p. 219). From this, a conceptual scheme emerged that ‘posited ethics as something that could be added onto science – and not something that was unavoidably implicitly in it’ (p. 231).

The issue outlined by Reardon resonates with how the logic of the synthetic supplement operates. For when ethics is conceived in supplementary terms, understood narrowly as an add-on, it creates the conditions of possibility for a conceptualisation of ethics as independent from the making and training of algorithms. This logic ‘casts’ the ethics of algorithms as resolvable through the simple addition of more data, features, attributes, or categories. The result is a dream of complete, balanced, and ethical algorithms. One generates to supplement to resolve. However, this is a narrow reading of what is meant by supplementarity. The logic of the synthetic supplement obfuscates how ethics is always already enmeshed with machine learning. It conceals different forms of violence as well as asymmetrical power relations (Bellanova et al., 2021). Of course, the field of AI ethics deals with inherently complex issues, ones which rarely have straightforward solutions. But it has also been critiqued for being ‘toothless’ (Green, 2021), constituting ‘a lack of friction between ethical principles and existing business principles’ (Munn, 2023: 4). Moreover, Amoore (2022: 21) has argued that ‘the notion that machine learning algorithms could be subject to good governance via regulation, or “AI ethics”, appeals to a different epistemic order than that which is itself generated by deep learning algorithms’, an order ‘generative of new norms and thresholds of what “good”, “normal”, and “stable” orders look like in the world’ (p. 21). As many have already argued, ethics cannot be made reducible solely to paradigms of inclusion and fairness (e.g. Hoffmann, 2019). It must start from the place of the fundamental ‘partiality’ of all machine learning algorithms (Amoore, 2019, 2020), as well as the ‘irremediable incompleteness’ (Suchman cited in Bellanova et al., 2021) of all training datasets.

While frameworks of AI ethics and governance are necessary, there is also a need to emphasise what Derrida (2007) has called ‘the impossibility of ethics’ in the context of machine learning. ‘I would say that deconstruction loses nothing from admitting that it is impossible’, and he adds that ‘for a deconstructive operation, possibility is rather the danger, the danger of becoming an available set of rule-governed procedures, methods, accessible approaches’. ‘The interest of deconstruction’, he continues, ‘is a certain experience of the impossible: that is, as I shall insist in my conclusion, of the other’ (p. 15). On the one hand, this means that ‘ethics, and justice, can find no privileged ground for their articulation’ (Keenan, 1990: 1681). That is, the synthetic supplement cannot be seen as a privileged ground for how to articulate machine learning orders, for there is none. On the other hand, it means conceiving of ethics as ‘a certain experience of the impossible: that is [. . .] of the other – the experience of the other as the invention of the impossible’ (Derrida, 2007: 15). As opposed to the promises embodied by the logic of the synthetic supplement, the ethico-politics of machine learning is a never-ending, ever-shifting, impossible encounter with and responsibility for the other. The gap of the other in machine learning, the gap that is the space of the other, can never be completely filled or supplemented. It always remains open.

Footnotes

Acknowledgements

This paper has benefitted enormously from many fruitful discussions with Louise Amoore, Ludovico Rella, and Alexander Campolo on topics such as machine learning, synthetic data, and ethics. My thanks also go to the four anonymous reviewers as well as TCS board members for their careful engagement and comments.

TCS Online Forum:

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the European Research Council (ERC) under the Horizon 2020 Framework Programme (Advanced Investigator Grant ERC-2019-AdG-883107-ALGOSOC).

ORCID iD

Benjamin N. Jacobsen

Notes

Benjamin N. Jacobsen is a Lecturer in Sociology at the University of York as well as a Visiting Fellow on Professor Louise Amoore’s ‘Algorithmic Societies’ project at Durham University. His current research explores the ethico-political implications of generative modelling and synthetic data on society and culture.

References

Amaro

Ramon

(2022) The Black Technical Object: On Machine Learning and the Aspiration of Black Being. London: Sternberg Press.

Amoore

Louise

(2011) Data derivatives: On the emergence of a security risk calculus for our times. Theory, Culture & Society 28(6): 24–43.

Amoore

Louise

(2019) Doubt and the algorithm: On the partial accounts of machine learning. Theory, Culture & Society 36(6): 147–169.

Amoore

Louise

(2020) Cloud Ethics: Algorithms and the Attributes of Ourselves and Others. Durham: Duke University Press.

Amoore

Louise

(2022) Machine learning political orders. Review of International Studies 49(1): 20–36.

Amoore

Louise

Piotukh

Vohla

(2015) Algorithmic Life: Calculative Devices in the Age of Big Data. New York, NY: Routledge.

Bansal

Aayushi

Sharma

Rewa

Kathuria

Mamta

(2022) A systematic review on the data scarcity problem in deep learning: Solution and applications. ACM Computing Surveys 54(10): 1–29.

Barad

Karen

(2007) Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning. Durham: Duke University Press.

Beer

David

(2017) The social power of algorithms. Information, Communication & Society 20(1): 1–13.

10.

Beer

David

(2023) The Tensions of Algorithmic Thinking: Automation, Intelligence and the Politics of Knowing. Bristol: Bristol University Press.

11.

Bellanova

Rocco

Irion

Kristina

Lindskov Jacobsen, Katja , et al (2021) Toward a critique of algorithmic violence. International Political Sociology 15: 121–150.

12.

Bengio

Yoshua

LeCun

Yann

Hinton

Geoffrey

(2021) Deep learning for AI: Turing lecture. Communications of the ACM 64(7): 58–65.

13.

Berlant

Lauren

(2006) Cruel optimism. A Journal of Feminist Cultural Studies 17(5): 20–36.

14.

boyd

danah

Crawford

Kate

(2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15(5): 662–679.

15.

Bucher

Taina

(2018) If . . . Then: Algorithmic Power and Politics. Oxford: Oxford University Press.

16.

Buolamwini

Joy

Gebru

Timnit

(2019) Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research 81(1): 1–15.

17.

Carlini

Nicholas

Hayes

Jamie

Nasr

Milad

, et al (2023) Extracting training data from diffusion models. Preprint. arXiv:2301.13188.

18.

Chen

Richard J.

Lu Ming

Chen Tiffany

, et al (2021) Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 5: 493–497.

19.

Cheney-Lippold

John

(2011) A new algorithmic identity: Soft biopolitics and the modulation of control. Theory, Culture & Society 28(6): 164–181.

20.

Courville

Aaron

Goodfellow

Ian

Bengio

Yoshua

(2016) Deep Learning. Cambridge, MA and London: MIT Press. https://www.deeplearningbook.org/.

21.

Crawford

Kate

(2021) Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. New Haven: Yale University Press.

22.

Crawford

Kate

Paglen

Trevor

(2019) Excavating AI: The politics of images in machine learning training sets. Available at: https://excavating.ai/ (accessed 2 January 2024).

23.

Critchley

Simon

(1992) The Ethics of Deconstruction: Derrida and Levinas. Oxford: Blackwell.

24.

Culler

Jonathan

(1983) On Deconstruction: Theory and Criticism After Structuralism. Ithaca, NY: Cornell University Press.

25.

de Man

Paul

(1971) Blindness and Insight: Essays in the Rhetoric of Contemporary Criticism. Minneapolis, MN: University of Minnesota Press.

26.

Deleuze

Gilles

(1992) Postscript on the societies of control. October 59: 3–7.

27.

Denton

Emily

Hanna

Alex

Amironesei

Razvan

, et al (2021) On the genealogy of machine learning datasets: A critical history of ImageNet. Big Data & Society 8(2). https://doi.org/10.1177/20539517211035955

28.

Derrida

Jacques

(1973) Speech and Phenomena, and Other Essays on Husserl’s Theory of Signs, trans. Allison

David B.

Evanston, IL: Northwestern University Press.

29.

Derrida

Jacques

(1976) Of Grammatology, trans. Spivak

Gayatri Chakravorty

. Baltimore, MD: Johns Hopkins University Press.

30.

Derrida

Jacques

(1978) Writing and Difference, trans. Bass

Alan

. Chicago: University of Chicago Press.

31.

Derrida

Jacques

(1988) Limited Inc, ed. Gerald

Graff

. Evanston, IL: Northwestern University Press.

32.

Derrida

Jacques

(2007) Psyche: Invention of the other. In: Kamuf

Peggy

Rottenberg

Elizabeth

(eds) Psyche: Invention of the Other, Volume 1. Stanford, CA: Stanford University Press.

33.

Derrida

Jacques

(2016) Dissemination, trans. Barbara

Johnson

. London: Bloomsbury Academic.

34.

Derrida

Jacques

Ferraris

Maurizio

(2001) A Taste for the Secret, eds Giacomo

Donis

Webb

David

. Cambridge: Polity.

35.

Domingos

Pedro

(2012) A few useful things to know about machine learning. Communications of the ACM 55(10): 78–87.

36.

Ewald

François

(2020) The Birth of Solidarity: The History of the French Welfare State, trans. Johnson

Timothy Scott

, ed. Cooper

Melinda

. Durham, NC: Duke University Press.

37.

Fan

Angela

(2020) Introducing the first AI model that translates 100 languages without relying on English. Facebook Newsroom. Available at: https://about.fb.com/news/2020/10/first-multilingual-machine-translation-model/ (accessed 2 January 2024).

38.

Fazi

M. Beatrize

(2021) Introduction: Algorithmic thought. Theory, Culture & Society 38(7–8): 5–11.

39.

Foucault

Michel

(2003) Abnormal: Lectures at the College de France 1974–1975, trans. Graham

Burchell

. London: Verso.

40.

Green

Ben

(2021) The contestation of tech ethics: A sociotechnical approach to technology ethics in practice. Journal of Social Computing 2(3): 209–225.

41.

Halpern

Orit

Jagoda

Patrick

Kirkwood

Jeffrey West

, et al (2022) Surplus data: An introduction. Critical Inquiry 48(2): 197–210.

42.

Hoffmann

Anna Lauren

(2019) Where fairness fails: Data, algorithms, and the limits of antidiscrimination discourse. Information, Communication & Society 22(7): 900–915.

43.

Hoffmann

Jordan

Bar-Sinai

Yohai

Lee

Lisa M.

, et al (2019) Machine learning in a data-limited regime: Augmenting experiments with synthetic data uncovers order in crumpled sheets. Science Advances 5: 1–8.

44.

Hooker

Sara

(2021) Moving beyond ‘algorithmic bias is a data problem’. Patterns 2: 1–4.

45.

Jacobsen

Benjamin N.

(2023) Machine learning and the politics of synthetic data. Big Data & Society 10(1): 1–12.

46.

Keenan

Thomas

(1990) Deconstruction and the impossibility of justice. Cardozo Law Review 11(5–6): 1675–1686.

47.

Kitchin

Rob

(2014) Big data, new epistemologies and paradigm shifts. Big Data & Society 1(1): 1–12.

48.

Marriott

Richard T.

(2020) Data-augmentation with synthetic identities for robust facial recognition. PhD dissertation, Ecole Centrale Lyon and Universite de Lyon.

49.

Munn

Luke

(2023) The uselessness of AI ethics. AI and Ethics 3: 869–877.

50.

Nikolenko

Sergey I.

(2019) Synthetic data for deep learning. Preprint. arXiv:1909.11512.

51.

Nikolenko

Sergey I.

(2021) Synthetic Data for Deep Learning. Cham, Switzerland: Springer.

52.

Parisi

Luciana

(2013) Contagious Architecture: Computation, Aesthetics, and Space. Cambridge, MA: MIT Press.

53.

Parisi

Luciana

(2019) Critical computation: Digital automata and general artificial thinking. Theory, Culture & Society 36(2): 89–121.

54.

Phan

Thao

Wark

Scott

(2021) Racial formations as data formations. Big Data & Society 8(2): 1–5.

55.

Reardon

Jenny

(2011) Human population genomics and the dilemma of difference. In: Jasanoff

Sheila

(ed.) Reframing Rights: Bioconstitutionalism in the Genetic Age. Cambridge, MA: MIT Press, pp. 217–238.

56.

Ridler

Anna

(2020) The abstraction of nature. A presentation given on 19 February 2020, at the Akisoma Project Space in Ljublijana, Slovenia. Available at: https://vimeo.com/396388790 (accessed 2 January 2024).

57.

Smith

John R.

(2019) IBM research releases ‘diversity in faces’ dataset to advance study of fairness in facial recognition systems. IBM Research. Available at: https://phys.org/news/2019-01-ibm-diversity-dataset-advance-fairness.html (accessed 2 January 2024).

58.

Steinhoff

James

(2022) Toward a political economy of synthetic data: A data-intensive capitalism that is not a surveillance capitalism? New Media & Society. https://doi.org/10.1177/14614448221099217

59.

Steyerl

Hito

Crawford

Kate

(2017) Data streams. The New Inquiry. Available at: https://thenewinquiry.com/data-streams/ (accessed 2 January 2024).

60.

Strickland

Eliza

(2022) Are you still using real data to train your AI? IEEE Spectrum. Available at: https://spectrum.ieee.org/synthetic-data-ai#toggle-gdpr (accessed 2 January 2024).

61.

Zuboff

Shosanna

(2019) The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. London: Profile Books.