Abstract
Surveillance of human subjects is how data-intensive companies obtain much of their data, yet surveillance increasingly meets with social and regulatory resistance. Data-intensive companies are thus seeking other ways to meet their data needs. This article explores one of these: the creation of synthetic data, or data produced artificially as an alternative to real-world data. I show that capital is already heavily invested in synthetic data. I argue that its appeal goes beyond circumventing surveillance to accord with a structural tendency within capitalism toward the autonomization of the circuit of capital. By severing data from human subjectivity, synthetic data contributes to the automation of the production of automation technologies like machine learning. A shift from surveillance to synthesis, I argue, has epistemological, ontological, and political economic consequences for a society increasingly structured around data-intensive capital.
Keywords
Surveillance is the unspoken byword of contemporary capitalism. New business models driven by data-intensive technologies like machine learning rely on the collection of large quantities of data, much of which is obtained by surveilling users of social media, mobile devices, and applications. The capitalist drive to accumulation is increasingly funneled through this data-intensive infrastructure, making surveillance a prerequisite for profit-generation. The search for new markets takes the form of continual incursions of surveillance into new spheres of life, driving visions of a future characterized by increasingly fine-grained and omnipresent monitoring. A social backlash, fueled in part by such visions, has contributed to the establishment of regulations on the collection and usage of data, and thus on surveillance, such as the European Union’s General Data Protection Regulation, implemented in 2018. Against such developments, data-intensive firms are striving to maintain the ability to collect data through surveillance via techniques ranging from record-setting lobbying spending (Chung, 2021) to encouraging a digital resignation to surveillance in users (Draper and Turow, 2019). This article examines another, technical, way in which data-intensive capital is responding to the curtailment of surveillance: synthetic data.
Synthetic data is data which is not collected via surveillance; rather it is “produced artificially” (Nikolenko, 2021: v). It is data “that computer simulations or algorithms generate as an alternative to real-world data” (Andrews, 2021). This article introduces and critically assesses this technology from a political economy perspective, drawing on analysis of documents from machine learning, data science and computer science research, and the artificial intelligence (AI) industry. It is easy to imagine why data-intensive capital might find synthetic data appealing at a moment of heightened scrutiny over its surveillance practices. However, I argue that the appeal of synthetic data for data-intensive capital goes beyond providing an alternative to surveillance. It also provides a novel technical means for continuing a historical tendency within capitalism toward the autonomization of the circuit of capital. The autonomization of capital is most often discussed in terms of the automation of production and circulation processes, as well as the speculative movements of value in the finance sector. But by shifting the production of data from a process contingent on the surveillance of human subjects to one of computational generation, synthetic data contributes to the automation of the production of the conditions for production and circulation in data-intensive capitalism. This, I argue, presents ontological, epistemological, and political implications for data-intensive technologies, data-intensive capital, and struggles against it.
Context: data-intensive capital and surveillance
Capitalism is the mode of production founded on the process of transforming a quantity of value into more value via the production and exchange of commodities. Karl Marx schematized this process as M-C-M’ (Marx, 1990: 251). More value must be generated than was invested; capital must be “valorized” (Marx, 1990: 252). Capital is this process endlessly repeated; it is an “end in itself” (Marx, 1990: 253). Hence the restlessness of capitalism. Valorization can never be finally completed. New markets must always be opened and new commodities produced. But wherein does value arise? According to Marx, and contrary to contemporary mainstream economics, not from the sale of commodities for more than their production costs. Rather, the value of capital arises from exploitation, which in Marx’s technical sense refers to the difference between the wage paid to a worker and the greater value that worker’s capacity to labor can produce (Marx, 1990: 418–421). Capitalists aim to maximize this difference, which Marx calls “surplus-value” (Marx, 1990: 293). The unequal exchange of capacity to labor for a wage defines the immanent antagonism of capitalism, between labor and capital.
Yet value is an abstract quantity. It can only exist if there are commodities which may be valued. Thus, the process of valorization necessitates a labor process, in which the concrete actions of humans produce goods or services. In the labor process, the labor/capital antagonism finds its primal expression. Since capital as such consists of numerous individual capitals engaged in competition, those individual capitals are driven to optimize their labor processes so as to increase the productivity of labor and the amount of surplus-value captured relative to their peers (Marx, 1990: 433). There are many ways to achieve increased productivity of labor going back to the division of labor. But Marx singled out the introduction of machinery as a qualitative break, because machines can be continually revolutionized, presenting ever new possibilities for increased productivity versus competitors (Marx, 1990: 1035). Capital’s imperative to valorization, driven by competition, thus compels the transformation of labor-intensive production processes into “process[es] of the technological application of scientific knowledge” (Marx, 1990: 775). In contemporary terms: processes of automation.
As Ramtin (1991) argues, once it is possible, automation becomes an “objective necessity imposed by the very functioning of the capitalist mode of production itself, in accordance with and as a result of the law of value” (p. 101). This does not mean that automation proceeds in a linear fashion, progressively eliminating humans across the economy. Rather, the automation of one task may necessitate new types of labor in adjacent tasks (Delfanti and Frey, 2021; Gray and Suri, 2019: 206). But wherever such new pockets of labor appear within data-intensive capital, there also appear efforts to replace them with machines, as the contemporary AI industry demonstrates (Steinhoff, 2021: 193–200). Over time, the constant component of capital (including machines) increases while its human component, labor, or variable capital, decreases. Marx (1990) referred to this as the increasing “organic composition” of capital (p. 762). This concept plays a central role in Marx’s schema, explaining several of capital’s distinctive macro-level tendencies (Ross, 2015). However, it also manifests on the smaller scale of the labor process.
In the labor process, machines are used not only to increase the speed and volume of production, but also to wrest control over the labor process from labor and transfer it to capital. Braverman (1998) describes this as the “separation of conception from execution” (p. 79). Its larval form is the division of labor, which develops into the “progressive elimination of the control functions of the worker . . . and their transfer to a device which is controlled . . . by management from outside the direct process” (Braverman, 1998: 146). Case studies of this separation abound in industries from machine tools (Noble, 1986) to education software (Mirrlees and Alvi, 2019). Machines allow capital to administer control over labor at an infrastructural level, fixing parameters of possible action and precluding alternative courses. The final stage of control is wholesale automation, or the “dissociation of living labour . . . from the production process” (Ramtin, 1991: 58). While such a maximally organic composition of capital is rarely obtained for most labor processes, it remains a motivating principle by structural necessity: the “development of the means of labour into machinery is not an accidental moment of capital, but is rather the historical reshaping of the traditional, inherited means of labour into a form
The year 2007 was a significant year in terms of the organic composition of capital. That year, Google acquired the targeted-advertising company DoubleClick, “ushering in a massive agglomeration of data between Google’s search information and DoubleClick’s user-tracked marketing data,” forming an “Internet-wide surveillance network” and setting off a chain of similar data-intensive acquisitions (Cheney-Lippold, 2017: 20). Since then, many sectors of capital have incorporated data-intensive technologies into their operations. Indeed, contemporary capitalism is often defined by its relation to data. It may be in terms of a “platform capitalism” centered on “extracting and using a particular kind of raw material: data” (Srnicek, 2017). It might be in terms of a “data capitalism” premised on “the sale of individual behavioral profiles tied to user data” (West, 2019: 23). Or it may be in terms of a “surveillance capitalism” since data collection depends on the “unilateral surveillance . . . of human behavior” (Zuboff, 2016: 1). The notion is one of a new mode of capitalist production in which digital data, harvested via surveillance, is of central importance to valorization. 1 Data-intensive technologies, it would seem, are quite adequate to capital.
While few employ Marxist terminology, critical scholars of data and AI have tracked the increasingly data-intensive composition of capital. Such work has shown how data-intensive technologies are anything but neutral, immaterial, and inhuman. Rather, such scholars have argued that data-intensive technologies are implicated deeply in existing social relations, depend for their existence on a rich array of material factors, and cannot be considered in abstraction from the humans who design, deploy, and use them. More specifically, critical scholars have demonstrated how data-intensive technologies reinforce existing forms of discrimination (Katz, 2020; Noble, 2018) while undermining and complicating democratic processes (Sudmann, 2019), reconfiguring epistemologies (Kitchin, 2014; Lepage-Richer, 2021), and bolstering the pathologies of the capitalist mode of production, from ecological devastation to colonial predation of labor and resources (Chun, 2018; Crawford, 2021; Dyer-Witheford et al., 2019; Halpern, 2021; Verdegem, 2021). Others have pointed out how a self-reinforcing dynamic has developed between data and capital.
Once data-intensive technologies are built-in to business models, data becomes subject to the same endless self-justification as valorization. Data-intensive capital must “constantly collect and circulate data by producing commodities that create more data and building infrastructure to manage data. The stream of data must keep flowing and growing” (Sadowski, 2019: 4). Andrejevic (2020) notes a “cascading logic of automation . . . [in which] automated data collection leads to automated data processing, which, in turn, leads to automated response” (p. 9). While social media presents a wealth of data, capital cannot stop there, so surveillance extends “deeper into areas of human life that were once off-limits or too expensive to reach” (Crawford, 2021: 119). The notion is that, barring resistance, omnipresent, or as Andrejevic (2020) puts it, “frameless,” surveillance awaits. Capital’s data hunger will require an increasingly rigorous tracking of the human subjects from which data is generated. Thus, while data-intensive capital may achieve new levels of organic composition, a human element remains at its core.
Early critical research on data pointed out that data does not simply exist fully formed in the wild; it must be created. For Manovich (2001), data “does not just exist—it has to be generated. Data creators have to collect data and organize it, or create it from scratch” (p. 224). For Bowker (2005), “[r]aw data is both an oxymoron and a bad idea” (p. 184). Rather, data must be “cooked” or prepared. Subsequent work would also reject the notion of raw data (Gitelman, 2013), but would drop Manovich’s suggestive reference to creating data from scratch. Data is now recognized as the product of sensors collecting the traces of human actions; a copy of the world, not something that can be created from scratch. For Sadowski (2019), data is a “recorded abstraction of the world” (p. 2) much of which is “about people” (p. 6). For, Gregory (2014), data “is made of people” or more precisely, the “very rhythms, circulations, palpitations, and mutations of our bodies.” For Lemov (2016), big data “is people” because “data is not only generated about individuals but also
In sum, critical views on data insist that data’s provenance is human subjectivity. 2 Surveillance is thus not an accidental component of data-intensive capital, but its condition of possibility. While I do not dispute this logic, my goal is to explore an alternative means by which data-intensive capital might satisfy its hunger for data. What if capital could obtain data from a means other than the surveillance of people? This is the possibility suggested by synthetic data.
Synthetic data
Synthesis describes an act of combination; bringing together elements to generate a new whole; a “process that, by human agency, emulates certain properties of a naturally occurring material” (Raghunathan, 2021: 131). While natural rubber, made from the latex of the
Before delving into the technical details, it is worth considering why there is interest in making even more data, living as we are amid an unprecedented abundance or surfeit of data (Andrejevic, 2013; Mosco, 2015). In short, from the perspective of machine learning and computer science research, there is no surfeit, but rather a dire lack of data. Computer scientists complain of a “data shortage” (Yang, 2019) and “data scarcity” (Bansal et al., 2022) and assert that “many problems of modern AI come down to insufficient data” (Nikolenko, 2021: 12). Unsurprisingly, since AI is now integrated into many business processes, industry commentators share this perspective. A writer for
There are three frequently cited causes of the data shortage in both the technical and business literature. The first derives from the fact that data for supervised learning, the most popular approach to machine learning, must be hand-labeled (Nikolenko, 2021: 12). This is labor-intensive and makes creating labeled datasets difficult and buying them costly. One industry blogger puts the cost of a quality dataset between $10,500 and $85,000 USD (Incze, 2019). Furthermore, hand-labeling is unreliable. Recent research has shown that several datasets widely used as benchmarks contain substantial amounts of errors (Wiggers, 2021a). The second cited cause of the data shortage is that for some applications, sufficient data simply do not exist. One may think of data in relation to digital platforms where millions of interactions can be captured each day, but for many potential applications of machine learning (farming is often mentioned), there is very little data (DeepLearning.AI, 2021; Yang, 2019). A related concern here is bias. While bias in data is a deep problem with diverse social factors (Ntoutsi et al., 2020), from a technical perspective, it refers to a dataset inadequately representing a particular feature. Synthetic data presents the possibility of creating new data to fix datasets biased in this sense (Watson, 2020). The third cited cause of the data shortage pertains to data accessibility. This refers to data which exists, but is rendered inaccessible by social, rather than technical, factors including privacy concerns, proprietary ownership, and regulation (El Emam, 2021). Medical applications are frequently mentioned in this regard. To summarize, from the perspectives of computer science and capital, there is no data surfeit, but rather a data shortage. Data is rare, expensive, and time-consuming to label, and access to the data that exists is often difficult, impossible, or ethically unsound. This is the context in which synthetic data has been commercialized.
While synthetic data is far from superseding conventional data and far from eliminating completely the human component from the production of data, it is also far from merely theoretical. At the time of writing in early 2022, there are over 15 startup companies with synthetic data commodities on the market, including Synthesis AI, DataGen, Anyverse, Trūata, and Mostly.AI. Large tech companies are also involved. Microsoft (n.d.) plans to open-source its Synthetic Data Generator, and in 2021, Facebook spent an undisclosed sum to acquire the startup AI.Reverie, which creates simulated environments (detailed below) (Wiggers, 2021b). 3 Even large non-AI companies like J.P. Morgan, John Deere, and American Express are producing and using synthetic data internally.
Still, one might reasonably ask: does synthetic data actually work? Experimental results suggest that it does. Using a generative model approach (detailed below), researchers created synthetic datasets from five publicly available datasets and challenged data scientists to develop predictive models from them. They compared the results of the data scientists working with the synthetic data to that of data scientists working with the original data and found “no significant difference” suggesting that “synthetic data can successfully replace original data for data science” (Patki et al., 2016: 399). More recent studies are also supportive. Rankin et al. (2020) hold that most (92%) models trained on synthetic data have only slightly lower accuracy than those trained on real data while Foraker et al. (2020) find that analyses on both synthetic and real data produced sufficiently statistically similar conclusions.
With context now established, it is time to turn to synthetic data itself. I will discuss three approaches to producing it: data augmentation, generative models, and simulated environments. I find it useful to imagine these on a continuum, in the same order, going from least to most synthetic, defined heuristically as how removed the data they produce are from the surveillance of human subjects. These approaches are often combined. For instance, you might generate synthetic data by capturing images of three-dimensional (3D) faces in a simulated environment. You might produce these faces, in turn, with a generative model, and you might expand your set of captured images via data augmentation.
Data augmentation
Data augmentation is the automated application of minor modifications to a dataset in order to enlarge or diversify it. This is most readily explicated in terms of image data. Here, augmentation can be as simple as rotating, cropping, or blurring images. Even if you merely rotate each existing photo once, you can double the size of your dataset. The potential gains are, however, much larger. Krizhevsky et al. (2012) show that 2048 new images can be generated from a single input image via data augmentation (p. 5). Furthermore, since labels are preserved throughout augmentation, a labeled dataset can be expanded for no additional labeling cost. The simplicity and effectiveness of data augmentation have made it a widespread practice. Today, most state-of-the-art image recognition models are trained with augmented data.
Data augmentation may not sound very synthetic, as it merely extrapolates from existing data. However, it presents an example of how data can be procured, if not from scratch, from the manipulation of existing data. And when machine learning is applied to data augmentation, it further blurs the line between data augmentation and generation (Nikolenko, 2021: 90). So-called “smart” augmentation involves having a network which does not simply apply preset augmentations but learns an optimal way to augment data in the course of training in accord with reducing another network’s error (Lemley et al., 2017).
Generative models
Generative models generate a novel output, rather than classify a given input as this or that category (i.e. object recognition). There are many types of generative models, but the kind most used for synthetic data generation is generative adversarial networks (Goodfellow et al., 2014). The technical details are beyond the ambitions of this article, but the basic idea is to pit two networks against one another, with one trying to get the other to “misclassify data . . . that are only marginally different from those that they adequately classify” (Lepage-Richer, 2021: 215). Over the course of such adversarial exchanges a model can be trained that can extract and reproduce the statistical properties (and labels) of interest of a real dataset in a new synthetic dataset. Sensitive aspects may be eliminated, such that the new dataset will be anonymized. The generative model approach is thus pitched as mitigating privacy concerns and is promoted for applications in fields like medicine (Goncalves et al., 2020).
The Austrian company Mostly.AI, which uses generative models to produce synthetic data for banking, insurance, and telecommunications companies, describes synthetic data as data which is “nearly indistinguishable” from the original data it was generated from but “bears no direct relationship” to it (Pasieka, 2020). They even proclaim that “synthetic datasets contain all the value of the data without any privacy risk. A bit like a rich cake without calories” (Mostly.AI, n.d.). While such claims for anonymization are disputed (Stadler et al., 2021), they have convinced the United States of America’s medical administration. In 2020, the National Institutes of Health entered into a partnership with Israeli company MDClone to produce an anonymized synthetic clinical dataset for COVID-19 research (MDClone, n.d.).
Simulated environments
At the furthest end of the synthetic data continuum, we have the use of simulated environments. In this approach, data is synthetic not because it is extrapolated from existing conventional data, but because it is collected within a synthetic world. Simulated environments can be as simple as an empty virtual space in which objects can be captured in images. The dataset ShapeNet, which features around 51,300 categorized and labeled 3D models, ranging from airplanes to wine bottles, can be used in such an environment. More complex simulations range from detailed indoor scenes, like Facebook’s Habitat, to full cityscapes, complete with moving vehicles, pedestrians, and dynamic weather, such as VIVID (Virtual Environment for Visual Deep Learning) (Lai, n.d.). Environments may be passive, in that one merely gathers image or video data from within them. They may also be interactive, in which some sort of agent or user can interact with them to generate data, such as Waymo’s Simulation City, which generates driving data for autonomous vehicles (Hawkins, 2021). The increasingly sophisticated graphics of video games have also made their simulated environments, including that of Grand Theft Auto V, attractive venues for synthetic data generation (Richter et al., 2016). Video game engines are also being used to build bespoke simulated environments. The developers of the Unity engine sell a simulated environment for autonomous vehicle training which they claim is used by “80% of the world’s leading automotive manufacturers” (Unity Technologies, n.d.).
Simulated environments are the holy grail of synthetic data. Once constructed, they present the possibility of data which is not extrapolated from an existing dataset. This is desirable not only because it nullifies privacy concerns, but also the labor of data collection and labeling: if you have a three-dimensional virtual scene complete with objects of interest, and images for the dataset are produced by rendering, it means that you automatically know which object every pixel belongs to, what are the 3D relations between them, and so on, and so forth. Producing new data becomes very cheap, and labeling becomes free. (Nikolenko, 2021: v)
Yet, even as they present such advantages, simulations remain quite labor-intensive since they must be built in the first place. Even the adaptation of existing video game environments generally requires extensive labeling. Thus, the data shortage reappears within the virtual realm. Some synthetic data practitioners already speak of a lack of “asset diversity” or paucity of virtual things (AI.Reverie, 2021). And as the ambitions of synthetic data producers grow, so will the size and complexity of simulated environments. Thus, research on the automated production of whole environments is underway using procedural generation (Qi et al., 2018) and generative adversarial models (Santana and Hotz, 2016). Microsoft researchers have recently shown that procedural generation, in combination with a large library of hand-crafted 3D components, can be used to generate synthetic faces which can then be used to train facial recognition algorithms without any real-world data as input. The researchers conclude that with “a sufficiently good synthetic framework, it is possible to create training data that can be used to solve real world problems in the wild, without using any real data at all” (Wood et al., 2021: 3682). This experiment demonstrates starkly the novel situation posed by synthetic data. If conventional data can be dispensed with, then so can its invasive collection. Additional layers of technological mediation and new forms of labor can replace surveillance.
Data adequate to capital
The approaches discussed above face a shared technical problem of whether models trained on synthetic data will work when applied to real-world data. This is called the problem of synthetic-to-real “domain transfer” (Nikolenko, 2021: vi) or of bridging the “reality gap” (Tremblay et al., 2018). All machine learning projects are concerned with “generalization ability” or how well a model functions when applied to data it was not trained on (Alpaydin, 2016: 40). With synthetic data this concern is amplified because it is even less certain whether training data adequately emulates the real world, especially in the case of simulated environments in which there are “limitations on rendering quality, including unrealistic texture, appearance, illumination and scene layout, etc.” (Chen et al., 2020: 1). Synthetic data thus comes with a new technical problem of adequately representing the real world it presents an alternative to. But beyond the technical realm, what might the broader implications of a shift from surveillance to synthesis be for data-intensive capital and societies structured around it?
We have seen the technical and business motivations for using synthetic data, summed up as the data shortage. But in addition to ameliorating this particular technical issue deriving from the nature of machine learning, I argue that synthetic data also satisfies a tendency immanent to capital, toward rendering the valorization process autonomous from human subjectivity. Recall that Marx (1993) describes an immanent tendency of capital to replace humans with machines or to reshape “the traditional, inherited means of labour into a form adequate to capital” (p. 694). In so doing, capital increases its machinic component or organic composition. I suggest that synthetic data be understood as a manifestation of rising organic composition because it involves increasing the machine element and minimizing the human element in the production of data. The data-intensive shift in capital is an ongoing episode in the rising organic composition of capital in which machine learning models are deployed to automate all kinds of analytic, predictive, and managerial functions in the spheres of conception, production, and circulation. But machine learning relies crucially on data collected via the surveillance of human subjects. While the subjects of surveillance are not laborers, they are human, and potentially fractious. By severing, or at least attenuating, the connection between data and surveillance, synthetic data contributes to the automatization of the automation technology machine learning. It signals a qualitative shift in how data is obtained, beyond merely automating an existing process. It reconfigures the conditions of possibility for data-intensive capital, and constitutes a data more adequate to capital.
Implications
The data shortage refers to contemporary technical and social relations of data production based on surveillance, which are no longer adequate to data-driven capital. As popular discontent with surveillance mounts and new regulations following the EU’s General Data Protection Regulation, such as Canada’s (currently tabled) federal Bill C-11 Digital Charter Implementation Act, erect further barriers to conventional data collection, data-intensive capital will find these relations less and less tolerable. For that reason, it is worth considering the potential implications of synthetic data now. It seems reasonable to assume that synthetic data generation will occur alongside data collection via surveillance, rather than supplant it. For this reason, future research may find it fruitful to consider William Bogard’s work on the interplay between surveillance and simulation in digital media. Bogard (1996) suggests that simulation technologies “are forms of hypersurveillant control, where the prefix ‘hyper’ implies not simply an intensification of surveillance, but the effort to push surveillance technologies to their absolute limit” by prefiguring events, rather than merely recording them (p. 4). Yet, he notes that “surveillance and simulation, although really distinct assemblages, always mutually implicate and complicate each other’s operations and development” (Bogard, 2006: 75). How exactly the simulated environments approach to synthetic data might figure into such a configuration remains to be explored. But for the sake of vividity, let us imagine a situation in which synthetic data fully replaces conventional data. Perhaps, surveillance for data collection is rendered impossible through sophisticated personal encryption systems, such as the gevulot technology in Hannu Rajaniemi’s (2010) novel
A first implication concerns the ontological status of data. The senior vice president of AI and machine learning at simulated environment producer Unity derides real data as “really just a snapshot of the situation” whereas synthetic data presents the possibility of “augment[ing] that real world with special use cases, special situations, special events” (VB Staff, 2021). While conventional data is, as critical data scholars argue, a trace or recording, synthetic data is pitched as “represent[ing] the real world in a very even way, better than the real world does” (VB Staff, 2021). It can include things that rarely or never happen. While machine learning is at base a mere “learning from examples” (Offert, 2021: 16), the simulated environments approach means that examples can be constructed as needed. On this account, synthetic data improves upon conventional data in that it has a broader scope. To borrow from Andrejevic (2020), it is “frameless” since it could potentially include anything, at least anything that can be simulated (p. 106). Synthetic data is also, to borrow from Baudrillard (1994), “hyperreal” since it quite literally involves the “generation by models of a real without origin or reality” (p. 1).
A second implication concerns the ontological status of embodiment. Phenomenological critiques of AI since Dreyfus (1972) have argued that intelligence does not consist of the abstract manipulation of formal symbols, but that it is rather the product of a body and perceptual apparatus, which integrate diverse motor and sensory data. Brooks (1991) and others have pursued a project of situated/embodied AI which focuses on robotic agents developing intelligence by interacting with real environments. Some critical scholars who have seen in such efforts a welcome focus on the concrete, and a turn away from abstraction, have insisted on the socially emancipatory import of the link between intelligence and embodiment (Suchman, 2007; Pettersen, 2019). However, the sophistication of simulated environments throws the abstract/embodied dichotomy into question because it allows virtual embodied agents to interact with virtual environments. Agents consisting wholly of code can have a virtual body and perceptual apparatus, subject to simulated laws of physics, and can interact with complex environments as well as other agents, emulating sociality. Indeed, situated/embodied AI research now often proceeds in simulated environments (Heess et al., 2017). Embodiment is no longer irreducible to the virtual. It is rather a question of bridging the reality gap to translate virtual embodiment into the real world.
A third implication is epistemological, concerning machine learning models trained on synthetic data. A frequently made critique of machine learning is that it cannot produce or grasp novelty. A model can only recognize patterns that exist in its training data, which can only encompass things that have already happened. Speaking of generative models in particular, Offert (2021) notes that “what there is to know is what is already known” (p. 11). In other words, machine learning works via statistical induction and cannot make the leap of Peircean abduction, which invents the new (Pasquinelli, 2017). Thus, machine learning models break down when exposed to anomalous events or merely unfamiliar data. Synthetic data presents the option of including in training data events that have never occurred in the real world, opening up the possibility of models which are not limited to reiterating the past. If you are collecting data on driving for an autonomous vehicle system, you may find it difficult to recruit real drivers willing to experience a five-car pileup. In a simulated environment, however, there is no barrier to running such a scenario thousands of times.
A fourth implication also concerns the epistemology of simulated environments. In our scenario, contemporary questions of how a given machine learning model works and what its training data were would remain. But an additional question could also arise: what was the simulated environment in which this data was generated like? Machine learning models consist of patterns extracted from training data. However, as Cramer (2018) rightly argues, this is far from an objective form of analysis, since hermeneutics and bias are implicated “at the point where data is captured, since almost any type of data acquisition requires subjective decision making” (p. 35). This will not change in a simulated environment and may in fact be compounded as it cannot be assumed that all virtual worlds will adequately emulate relevant real-world patterns. How well do simulated pedestrians emulate actual pedestrians? How is one to tell? What unexpected or unintended patterns might a model pick up from a simulated environment, particularly in the case of a complex environment with emergent properties generated by interacting agents? Nikolenko (2021) notes that “thankfully” it is not a common endeavor to try to transfer AI agents trained in video games such as StarCraft to actual combat situations (p. 215). But less extreme examples still present cause for concern.
A fifth implication is political. Marx (1990) saw the rising organic composition of capital in steam-powered machinery which he described as a “mechanical monster whose body fills whole factories” (p. 503). Today, the mechanical monster’s sprawl is perhaps most evident in the proliferation of huge data centers. But it also takes the form of deepening layers of software: networks, operating systems, applications, and so on. Since capital became computational, its rising organic composition has involved a process of “market-driven immanentization” (Land, 2012: 339–340) in which virtual markets progressively exclude “transcendent elements” such as human subjectivity and replace them with “economically programmed circuits” (Land, 2012: 341). Synthetic data aims to replace the transcendent human subject as the origin of data. Yet, the human subject and its surveillance have been a focal point for resistance to data-intensive capital. If the production of data withdraws from the real world and occurs deep in digital immanence, where can resistance to data-intensive capital, which was previously directed at the point of surveillance, occur?
Conclusion
Despite the heading of this section, this is a article without many conclusions. Synthetic data is in a very early stage of commercialization and to try to flesh out its political economy in detail would be premature. I have, for the most part, posed rather than answered questions. I have aimed only to present a level-headed assessment of synthetic data, some ways in which it is produced and provide some provocative musings on its potential implications within data-intensive capitalism. While much more remains to be investigated, it is safe to conclude one thing. In response to the assumption that the source of data is necessarily human, we can offer a qualified negative. It is a
Footnotes
Acknowledgements
The author thanks Elena Papagiannaki, Vince Manzerolle, and Alessandro Delfanti for helpful comments on drafts of this paper.
Author’s Note
All authors have agreed to the submission of this article and it is not currently being considered for publication by any other print or electronic journal.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Social Sciences and Humanities Research Council of Canada Grant 756-2021-0520.
