Abstract
What is the relationship between ideas of sameness and difference for machine learning and AI? Algorithms are often understood to participate in the continual displacement of the different and heterogeneous in society in favour of sameness, of that which is socio-politically similar and proximate. In contrast to this prevalent emphasis on sameness, however, this paper argues that there is a nascent
Introduction
In a 2018 interview, CEO of tech company Affectiva and a pioneer of so-called ‘Emotional AI’, Rana el Kaliouby, was asked about the relationship between bias and AI. She responded that ‘It’s the data. It’s how we’re applying this data’, which means that ‘we need to make sure that the training data is representative of all the different ethnic groups, and that it has gender balance and age balance’ (Ford, 2018: 222). Unsurprisingly, issues of bias and representativeness in machine learning and its training data have only become more pressing for computer scientists and researchers since Kaliouby’s interview in 2018 (e.g. Barocas et al., 2023; Wang et al., 2022). Yet, a recurrent response to such issues has been to build and deploy models that are ‘blind to gender, ethnicity and age’, that circumvent the issue of representativeness altogether (Drage and Mackereth, 2022: 8). In such cases, Drage and Mackereth (2022) argue, there is a political claim that by not accounting for protected characteristics such as race or gender in input data, machine learning algorithms can be made more representative, removing the grounds for human discrimination and bias. 1
In contrast to such claims, this paper argues that there is another nascent logic in machine learning –
This paper argues that the emergent intersection of machine learning and synthetic data signals not so much ‘the expulsion of the other’ (Han, 2018) or the erasure of all visible markers of difference. Instead, computer scientists and machine learning engineers increasingly seek to procure training datasets that are ‘representative of all the different ethnic groups’ and have ‘gender balance and age balance’ (Kaliouby cited in Ford, 2018: 222) through the generation of synthetic data attributes on race, gender, and other protected characteristics (Bender et al., 2021; Buolamwini and Gebru, 2019; Crawford and Paglen, 2019). By actively generating and augmenting the different, the diverse, and the underrepresented in algorithms, synthetic data embody a promise to eradicate risks of bias, imbalance, and homogenization in machine learning. But this is a highly problematic promise. As Amoore (2020: 7) argues, the question of ethics does not exist outside of the algorithm precisely because it is ‘always already an ethico-political entity by virtue of being immanently formed through the relational attributes of selves and others’. The question this raises, therefore, is not what is an ‘ethical use’ of algorithms – or how can the algorithm be made ‘fairer’ – but rather ‘how are algorithmic arrangements generating ideas of goodness, transgression, and what society ought to be’ (p. 7). Similarly, the logic of heterophily underpinning the intersection of synthetic data and machine learning generates a particular politics of difference, that is, a specific idea of data ethics and representativeness in AI. Yet, it is a claim to representativeness where the subject and the other are nonetheless emptied of their intractable substance and inherent difficulties. The result is models that may be perceived as ‘more ethical’ or ‘more representative’ but still do not capture the nuances and variabilities of lived experience.
The paper draws on findings and examples from the emergent field of synthetic data and generative modelling (see also Jacobsen, 2023, 2024; Steinhoff, 2024), exploring how synthetic data attributes are algorithmically generated in order to intervene into the ways in which algorithms generate new parameters of sameness and differences in society. Synthetic data are used as training data for algorithms, and are most often generated using deep generative models such as large language models, diffusion models, computer graphics pipelines, statistical methods, or a combination of all different approaches (Goodfellow et al., 2016; Nikolenko, 2021; Veselovsky et al., 2023). Synthetic data have become increasingly significant in a number of sectors, including healthcare and government, because they promise to be able to account for data points that are either absent from or underrepresented in the training data, thus making algorithms more representative and placing their outputs beyond the realm of risk (Jacobsen, 2023). While these models rely on the extraction of real-world data to make possible the generation of synthetic data points that approximate the real data distribution (Crawford, 2021; Zuboff, 2019), they are alluring precisely because synthetic data embody a claim to provide a space where more diverse or missing attributes can be generated and incorporated into the training of machine learning algorithms, often as a way of fine-tuning or optimization (Ferrari and McKelvey, 2022).
As such, the logic of heterophily promises to reconfigure this attributive capacity of algorithms. In the computer science literature, attributes are often understood as a quantity that can be further described with a value, where ‘the combination of an attribute and a value is a feature’ (Abney, 2008: 15; see also Alpaydin, 2010). The attribute is central to how algorithms learn to make differences in society. ‘In the ontology of algorithms’, Amoore (2020: 169) writes, ‘the mechanism that connects people, entities, and events is the attribute.’ This means that ‘the machine learning algorithm iteratively moves back and forth between the ground truth attributes of a known population’ and ‘the unknown feature vector that has not yet been encountered’, and as such learns to recognize and infer in the world (p. 169). In clustering, for instance, the model learns to find groupings of input data and distributes objects and individuals, whose attributes are more or less similar, into clusters. Machine learning algorithms therefore have an attributive capacity to anticipate and render objects and people knowable in the future based on how they learned from past groupings of data attributes. 2 One crucial question that emerges from this is: what does the attribute make actionable?
The intersection of machine learning and synthetic data constitutes a drive towards generating a wide range of heterogeneous attributes and incorporating these into the training regimes of algorithms to improve generalization. The reason being that adding more data attributes – whether synthetic or real – to the distribution of an algorithm is
In what follows, I foreground three crucial dimensions of the heterophilic logic of machine learning and synthetic data. In the first section – ‘Disentanglement’ – I outline how data attributes are imagined and generated as radically separable, malleable, and controllable. The second section – ‘Compositionality’ – discusses what is made possible through the generation of disentangled synthetic data, namely attribute combinations that can be arranged and composed in a plethora of ways so as to produce in algorithms a hypersensitivity to ever more fine-grained differences in the world in order to make algorithms more representative and less biased. The final section – ‘Normativity’ – examines the normative claim embedded in this emergent conceptualization of difference. It gravitates towards what can be understood as ‘the uniformly biased’, where the biased, unjust, and imbalanced data distributions deriving from real-world domains can be circumvented or resolved. Lastly, I argue what is at stake in thinking the power and politics of algorithms and synthetic data through the logic of heterophily, and that there is a continuous need to widen the field of critique of machine learning.
Disentanglement
The use of synthetic data for machine learning has become increasingly popular in recent years (Nikolenko, 2021). Algorithmic models are often trained on data distributions that contain an unequal frequency of attributes, such as ‘white male’ compared to ‘black female’. This has implications for ethics because ‘notions of fairness often coincide with how underrepresented sensitive attributes are treated by the model’ (Hooker, 2021: 3). As a result, a model trained on a large volume of data representing, say, human faces may still fail to recognize faces with rare characteristics (Jacobsen, 2023). It is for this reason synthetic data constitute such a promising development for machine learning: researchers have shown how, in multiple domains, an algorithmic model trained on real data but finetuned on synthetic data that represents rare or sensitive attributes produces a higher accuracy score than a model trained solely on real data. The incorporation of diverse synthetic data is therefore seen as beneficial, constituting a reduction in a model’s error rates. 3 Synthetic data constitute an attempt to augment the probability for a model to perceive and recognize a broader range of things more accurately in the world. They embody a promise that different kinds of data (e.g. rare and sensitive attributes) can be generated and incorporated into the training regimes of machine learning algorithms, increasing their capacity to recognize and act in the world.
Yet, this dream to generate a wide variety of data attributes has also transformed the An image is composed of the interaction between one or more light sources, the object shapes and the material properties of the various surfaces present in the image. It is important to distinguish between the related but distinct goals of learning invariant features and learning to disentangle explanatory factors. The central difference is the preservation of information. (p. 19)
The aim, in other words, is to build models that learn
The logic of heterophily underpinning the intersection of synthetic data and machine learning resonates with this notion of disentanglement. It signals a move towards generating data attributes that are radically separable, isolatable, addable, malleable, and controllable. It embodies a vision of the social world as an array of attributes and features that are simultaneously separable, independent of each other, and easily malleable and controllable. Take the example of synthetic images generated for the training of facial recognition systems. Computer scientists have stated that ‘disentangled face generation has become popular, which can provide the precise control of targeted face properties such as identity, pose, expression, and illumination’, which in turn has made possible the systematic exploration of the impacts of facial properties on face recognition (Qiu et al., 2022: 10881). What is being discussed here is the development of a ‘controllable face synthesis model’ that can be used to generate synthetic faces rarely found in the typical training sets of facial recognition systems. With ‘precise control of targeted face properties’, machine learning engineers and data scientists are now able to collect ‘large-scale face images of non-existing identities without the risk of privacy issues’ as well as ‘analyzing the influences of different facial attributes (e.g., expression, pose, and illumination)’ for the algorithmic model (p. 10881).
Synthetic data are therefore transforming the attributive capacity of machine learning algorithms, making it possible to generate increasingly disentangled and granular differences. This impacts both what the algorithm is able to infer in the future and what social relations are engendered in the present. The emergence of synthetic data and machine learning therefore valorises a particular notion of difference as something that is separable, controllable, and malleable in order to make the algorithmic more representative and less biased. The aim is not only to generate facial training data at scale, but to generate ‘different facial attributes’, tweaking the expression, pose, or illumination of facial images to see how they variously impact the model. As such, the logic of heterophily opens up unto a future where training datasets are not simply well-curated and relatively fixed entities. Instead, they are increasingly heterogeneous and variable, always in a state of becoming; and the unit of malleability shifts from facial images to more fine-grained attributes and features within such images.
This capacity to generate disentangled data attributes to make algorithms more representative, however, reproduces a particular politics of race. The once popular Israeli-based synthetic data company Datagen, which closed down in 2024, provided customers with access to the company’s self-service platform in order to build and customize synthetic humans as training data for their computer vision algorithms. These synthetic humans figure as radically isolatable, malleable, and tweakable parameters. According to their ‘API Catalog of Attributes’ (Datagen, 2023), human data attributes range from the seemingly mundane – such as eyebrows, hair, glasses, and beards – to protected characteristics such as age, gender, and ethnicity. As an example, the list of ethnicity attributes one could choose from is: ‘african, east_african, hispanic, mediterranean, north_european, southeast_asian, and south_asian’. Such attribute data are not only radically isolatable and controllable but also divorced from any biological and embodied determinations. On the one hand, these race specifications reflect deeply racialized and stereotypical representations, showcasing how attempts at disentanglement actually produce new entanglements with real-world stereotypes and entrenched representations. Data attributes that were supposed to be divorced from biology become reattached to very particular and narrow understandings of race.
On the other hand, however, it also imagines race as fundamentally separable from wider societal, structural issues of power. The logic of heterophily constitutes, as Amaro (2019: 84) put it, ‘a call to make black technical objects compatible to machine learning artificial intelligence algorithms’. Yet, this ground for racialized compatibility nonetheless ‘risks the further reduction of the lived potentiality of black life’. Whilst it resonates with historical ideas of race and control (Gilroy, 2000; Mbembe, 2019),
4
the emergence of the malleable and racialized synthetic attribute results in new racial formations that do not map unto any real bodies but only to a controllable feature within the algorithmic model (Phan and Wark, 2021). As such, the emergence of synthetic data embodies a nascent mode of racialized control that has the potential to further evade the already insufficient safeguards of protected characteristics. It is therefore not a question of
Compositionality
It is well known that one of the fundamental features of the learning process of neural network algorithms is that they can exploit what has been called the ‘compositional hierarchies’ in natural signals – that is, ‘where higher-level features are obtained by composing lower-level ones’ (LeCun et al., 2015: 439). For images, for instance, ‘local combinations of edges form motifs, motifs assemble into parts, and parts form objects’ (p. 439). Through combining features in one layer and creating more abstract features in the next layer, neural networks are able to learn many complex non-linear tasks and functions from input data, such as detecting objects and people within the relationships of image pixels. The 2018 Turing Award recipients – Yann LeCun, Yoshua Bengio, and Geoffrey Hinton – state that the benefit of this mode of learning is that ‘with the composition of enough such transformations, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations’ (LeCun et al., 2015: 436). The power of deep learning, in their view, derives from its compositional capacity: to endlessly combine and re-combine data attributes and features into increasingly complex representations.
Crucially, this idea of compositionality feeds into the politics of difference of machine learning and synthetic data. Here it refers to the ways in which synthetic data attributes are not only malleable and controllable but can also be composed and recomposed in a myriad of ways to create something that is ultimately seen as beneficial to the algorithm. Moreover, the notion of compositionality signals a certain relationship between algorithmic recognition and generated combinations of the different and heterogeneous. At the 2021 NVIDIA GTC Conference, Gal Chechik, Director of AI Research at NVIDIA, presented on the challenges of developing machine learning algorithms that fuse perception with reasoning and decision-making. Speaking on the particular challenge of ‘compositional recognition’, Chechik (2021) stated that ‘what people can do is they can understand new combinations of familiar components, but this is really hard for our current deep [learning] methods’. As he put it, a computer vision algorithm may be able to recognize goats and trees in an image, but ‘may fail to recognize goats in a tree’. Similarly, it may recognize red tomatoes but ‘will struggle to recognize black tomatoes’, given it has not learnt a strong correlation between objects such as tomatoes and attributes such as black. In other words, algorithms find it difficult to recognize unfamiliar combinations of familiar objects or attribute-object pairs that they have not been previously exposed to. The aim, Chechik suggests, is to expose algorithms to unseen and increasingly diverse and heterogeneous combinations of objects and attributes in order to make them more robust to different unseen instances in the world.
What are the implications of this exposure to manufactured heterogeneity for how algorithms generate parameters of sameness and difference in society? The logic of heterophily underpinning synthetic data and machine learning valorises a notion of difference that is disentangled and malleable as well as infinitely recombinable. The fact that models are used to generate disentangled and malleable synthetic attributes makes possible the combination of data attributes into increasingly complex representations – and these representations do not even have to exist in the real world. Crucially, it makes possible a ConfigNet learns a factorized latent space, where each part corresponds to a different facial attribute. The first column shows images produced by ConfigNet for certain points in the latent space. The remaining columns show changes to various parts of the latent space vectors, where we can generate attribute combinations outside the distribution of the training set, like children or women with facial hair. (p. 300)
Developing models that learn a factorized latent space representation of the training data is also crucial because ‘even when conditional models are trained with detailed labels, they struggle to generalize to out-of-distribution combinations of control parameters such as children with extensive facial hair or young people with grey hair’ (Kowalski et al., 2020: 299).
Microsoft Cambridge’s ConfigNet model is used to generate different attribute combinations for machine learning models that fall outside of the data distribution. The aim is, again, not to eradicate differences from the model, to make the algorithm ‘colour blind’, so to speak. Rather, it is to amplify differences for the algorithm through the generation of heterogeneous attribute combinations in order to augment the algorithm’s capacity to recognize and infer in the world – even combinations that, at first, appear highly anomalous or perhaps monstrous (such as children with facial hair). For Foucault (2003: 56), the figure of the monster, central to his notion of the abnormal, constitutes a limit point: ‘The monster is the limit, both the point at which law is overturned and the exception that is found only in extreme cases.’ Foucault argues that the monster is both that which transgresses and that which reinforces the societal boundaries that are transgressed. Similarly, the logic of heterophily promises the endless capacity to generate that which falls outside of the distribution of the training data. But such attribute combinations constitute a form of othering that generates rare and heterogeneous faces and, in turn, reinforces what falls
The question remains: why generate the different, heterogeneous, and monstrous? Microsoft Cambridge’s ConfigNet model also helps to unpack this question. As they note in their research paper, the aim of the model is to learn ‘a factorized latent space, where each part corresponds to a different facial attribute’ (Kowalski et al., 2020: 300). In other words, the model learns a compressed, low-dimensional representation of its training data and, as a result, learns to foreground the salient features and attributes in the data whilst discarding what it considers irrelevant. In short, the latent space indicates what a model has learned from data (Amoore et al., 2024). The latent space also provides the ground upon which different and fine-grained attribute combinations can be generated. If the model has been trained on a dataset of face images, then by moving between specific points in latent space machine learning engineers are able to change the output of the model, from a generated image of a woman to a man or from a man to a child, from a child with glasses to one without glasses, and so on (Sher, 2021).
The emergence of synthetic data embodies a drive towards the generation of different and heterogeneous attribute combinations for machine learning algorithms. And by incorporating these synthetic attribute combinations into the training regime of the model, it becomes increasingly hypersensitive to differences in new input data. But these synthetic differences are not generated in order to immunise algorithmic models against the real, ‘to immunise the actual against the virtual, the probable against the excess of the possible’ (Rouvroy, 2018: 100). The logic of heterophily underpinning synthetic data and machine learning is not one of immunisation. As the Microsoft Cambridge researchers have claimed elsewhere, ‘training on data with darker skin types leads to a more robust model, perhaps because the task is harder – forcing the model to learn better representations or a more robust attention mechanism’ (McDuff et al., 2021: 3746). This is symptomatic of an attitude that algorithms actively benefit from the different, the difficult, the excess of the possible, and the monstrous. The dark skin, for instance, becomes a way to make the task harder, to make the model more robust, and to make the algorithm better at recognizing and inferring different shades of dark skin in the real world.
This means that rather than the erasure of the different and diverse from algorithmic models, the intersection of synthetic data and machine learning is fuelling the generation of increasingly diverse and heterogeneous attribute combinations, some of which may approximate existing social identities and protected characteristics and some of which may not (such as children with facial hair). The aim in either case is to produce algorithms that are increasingly sensitive to fine-grained differences in new data and in the world. Yet, this othering, this mode of generating the different and monstrous, withdraws the ethical obligation to respond to the other. That is because the other is displaced, transformed into a computational problem, a figure of the ‘out of distribution’. As such, by making endlessly combinable and recombinable – so there are no fixed reference points, no fixed human bodies – the logic of heterophily opens up a space for those building, tweaking, and deploying algorithmic systems to claim that there is no discrimination, racism, or ageism in their models.
Normativity
The emergence of synthetic data for machine learning constitutes a drive towards the generation of disentangled and malleable attributes, which can be endlessly composed into attribute combinations. This promises to make algorithms hypersensitive to differences in the world. All differences can be rendered controllable and malleable to the algorithm, decoupled from potential association with specific sorts of bodies or phenotypes. It follows that synthetic data also embody This is a bit counterintuitive because even though the world may be biased, your models shouldn’t be. There are biases that can be caused by the gathering methods, things that are naturally less frequent or are harder to gather, appear less in the data. There are biases that can be caused during the annotation process, so things that are harder to annotate have more annotation mistakes. One of the most widely discussed biases, which poses a serious problem for real-world applications, are biases in demographics that are widely dependent on the geography of the gathering process. This is counterintuitive because you don’t want to represent the distributions of the real-world. You want to reflect high-level biases of the domain uniformly, in our training data. Ethnicities, ages, genders, lighting scenarios and smartphone camera type are a few examples. (Datagen, 2022)
Central to Datagen’s claim is a desire to
On one level, this normative claim to the uniformly biased is unsettling traditional statistical approaches, with their normal distributions, probability estimations, and the identification of regularities in seemingly stochastic processes (Amoore, 2013). On another level, and more worryingly, this claim to difference has the potential to undercut a politics of intervention that seeks to foreground the systemic unfairness and violence of machine learning models. The synthetic data points, attribute combinations, and subjects that are being algorithmically generated are promising to resolve the ethico-politics of algorithms by going beyond ‘the distributions of the real-world’ (Datagen, 2022). Here, ethical concepts such as fairness and representation are ‘narrowed and instrumentalized’, made measurable and easily implementable (Hong, 2022: 936). The fundamental promise is to be able to generate whatever racial or gendered attributes are needed for the training and fine-tuning of a machine learning model. Attributes such as race and gender never need to be insufficient, imbalanced, or wholly missing from algorithmic models.
This normative claim is underpinned by a very specific conceptualization of difference (of race, gender, diversity, and representation): it frames synthetic training data as fundamentally disentangled, malleable, controllable, and radically composable. All generated differences are made amenable and compatible with algorithmic models. And as the notion of the ‘uniformly biased’ suggests, these synthetic differences may be incorporated into the training of an algorithm, may exist in the model’s latent space, but their space for ethico-politics is
Conclusion
This paper has critically examined the intersection of synthetic data and machine learning through the conceptual lens of heterophily. This intersection is characterized by an increasing drive towards generating differences – various and diverse attributes and features – as additional training inputs for algorithms. I also foregrounded three core dimensions of the heterophilic logic of synthetic data and machine learning. Firstly, disentanglement expresses how attributes have become radically separable, malleable, and controllable. Secondly, this results in the endless play of compositionality by which heterogeneous attribute combinations are generated in order to create ‘more difficult’ and productive training datasets. Lastly, the drive towards the disentangled, controllable, and radically compositional also embodies a normative vision of the world. In other words, the emergence of synthetic data embodies a promise where controllable and endlessly modifiable and recombinable data attributes can be generated and incorporated into the training of algorithms as a way to bypass their ethico-political limitations and constraints whilst making them (seem) more representative. Synthetic data also constitute a claim to a particular notion of difference where the subject and the other are emptied of their intractable substance.
Whilst these synthetic attributes and their underlying conceptualization of difference are evidently reductive – they do not capture the complex and nuanced variability of the social world – there is still a danger that social science critiques that evoke reduction rely too firmly on what Ramon Amaro (2022: 46) has called ‘the problematic of representation’. That is, using the example of Joy Buolamwini’s Aspire Mirror project, 5 Amaro observes how issues of bias, risk, harm, and violence in machine learning are too often reduced to a question of racial and gendered representation: a lack of diversity in the training data as well as a lack of diversity of those that build the models. Such critiques are still valuable and necessary – especially given that issues of representation remain entangled with different forms of participatory injustice (see Noble, 2017). Yet, they are by themselves insufficient. ‘Coders like Buolamwini’, Amaro writes, ‘speak directly to the problem of erasure, more specifically the erasure of being, yet the act folds seamlessly into a desire for representation’ which is ‘devoid of the dynamisms of Black life’ (Amaro, 2022: 48). Amaro’s work foregrounds the limitations of operating solely within a framework of representation, because it does not fundamentally challenge its underlying sociocultural and structural conditions. Nor does it take into account how machine learning algorithms, in engaging with the world, necessarily transform and generate it. The danger of the representational framework is that it opens up a problem space where all possible critical responses and interventions inevitably gravitate towards the solution vector of either more representation or better representation – neither of which challenges the fundamental politics and power of algorithms. Indeed, the heterophilic logic of machine learning and synthetic data relies on and fuels this problematic of representation. Here, all differences – whether that of race, age, gender, lighting conditions, or smartphone types – can be generated and incorporated into the training or fine-tuning of a model with the aim of making it more representative.
There is also a danger that the intersection of synthetic data and machine learning participates in the steady erosion of what Louise Amoore (2020) has called ‘the unattributable’. For Amoore, the unattributable is ‘a potentiality that cannot be attributed to a unitary subject’ as well as ‘a refusal to be governed by the attribute’ (p. 171). Yet, she asks towards the end of her book
There is therefore a need to problematize this new politics of difference generated by synthetic data and machine learning. There is also a need to rethink the notion of difference in relation to contemporary algorithms. To redraw the contours of a social critique that does not rely solely on representation. As Rosi Braidotti (1991: 177) once asked, ‘Can we formulate otherness, difference without devaluing it? Can we think of the other not as an other-than, but as a positively other entity?’ Part of the answer may be opening up for a more agonistic reading of algorithmic culture (Crawford, 2016: 87), where ‘algorithmic decision making is always a contest’, where differences are never settled or without friction. Another part of the answer may lie in, as Edward Said (1985: 43) put it, not thinking what difference or representations can do
Footnotes
Acknowledgements
I want to thank Louise Amoore, Ludovico Rella, and Alexander Campolo for many productive discussions on topics such as machine learning, synthetic data, and ethics, which ultimately produced the ideas for this paper. My thanks also go to the anonymous peer reviewers as well as
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research has received funding from the European Research Council (ERC) under Horizon 2020, Advanced Investigator Grant ERC-2019-ADG-883107-ALGOSOC.
