Sage Journals: Discover world-class research

Abstract

Foundation models are a new frontier of value creation in the digital platform economy. These technologies rely on the production and consumption of massive datasets that are monetised through consumer facing artificial intelligence (AI) products. However, the unlicensed use of these materials by the AI industry has provoked a legal and conceptual dispute. Media industries claim that the material in datasets is ‘content’ and subject to copyright law. AI industries, alternatively, are working to strategically reframe those materials as ‘data’, which is governed through regimes more congenial to the industry's business models, such as loosely enforced data protection and technologically secured trade secrets. This essay shows how AI copyright litigation, and the central question of data versus content, is mediating between different claims to the right to generate and justify value from datasets, as well as participating in the broader reformation of AI dataset markets.

Keywords

Foundation models platform economy monetisation copyright law dataset

Introduction

Foundation models are a new frontier of value creation in the digital platform economy. The market uptake of artificial intelligence (AI) systems built from foundation models – very large machine learning models trained on massive amounts of data – including generative AI systems, has been meteoric. But this latest generation of technology products and business strategies has become embroiled in the decades-long tussle between content (née media) industries and AI (née data processing) industries over the capacity to generate value from information as it flows between producers, online platforms, and consumers. Intellectual property scholars have excitedly dissected these developments, wrestling with the sticky doctrinal questions of whether and how this diverse range of technological systems infringe existing law. Rather than drill into the doctrinal questions, however, this essay highlights the need to frame these disputes within the broader legal dynamics of platformisation.

The tense legal relations between data processing and media industries have profoundly shaped the direction of platformisation, and the viability and profitability of different business models. Access and control over flows of content, as well as data about users and their interactions with content, are critical to monetisation strategies in the digital economy (Rasmussen, 2024), and have long been the subject of complex legal manoeuvres. For decades, both media industries and data industries have mobilised arguments about copyright, monopolies, and ‘commons’ in efforts to extract either content or data from the opposing sector's control. In the 1990s and 2000s, supported by influential academics like Lawrence Lessig (1999) and Yochai Benkler (2003), emerging data processing business (such as Google) were arguing for a reconfiguration of intellectual property and the establishment of a digital commons to counter the centralised control over cultural materials held by copyright industries. This eventually enabled those data processing entities (i.e. digital platforms) to profit from new flows of content and copyrighted material across spaces they controlled. As digital platforms eventually leveraged that politics of ‘openness’ into massive and centralised data monetisation architectures, the language of commons was deployed in reverse. This time, a data commons became necessary to counter platform monopolies (Shkabatur, 2019; Mills, 2020; Verdegem, 2024), to rebalance revenues between platforms and content producers, and to expose the opaque and commercially protected relations between content, data and value.

This ongoing dispute has reached a new crescendo, however, with the AI industry's unauthorised use of massive troves of online material for training foundation models, provoking a slew of both litigation and policy work on how to understand the relationship between AI development, dataset circulations and copyright – or, more broadly, the question of how to manage the AI industry's new techniques for appropriating and valorising data. This iteration of the dispute, however, involves a novel conceptual twist. Part of data processing/AI firms’ strategy for extracting content from the control of content industries is to argue that such material is not content at all. Specifically, this essay outlines how arguments rendered about copyright doctrine are proxies for broader questions of whether the material that constitutes AI pre-training datasets consists of either ‘content’, managed and monetised through intellectual property regimes like copyright, or ‘data’, managed and monetised through loosely enforced data protection laws and technologically secured trade secrets.

The goal of this essay is to situate those legal arguments within a broader sphere of conceptual competition over how to define the AI dataset as a legal and economic object, and to show how AI training datasets are evolving as a new class of digital value – a commodity, subject to new data governance settlements, built off the back of strategic conceptual and narrative battles between different sectors of the digital economy.

Data or datasets?

Digital platforms and data processing businesses profit tremendously by controlling personal data as it flows through platforms and ad tech ecosystems. But foundation models and AI have introduced new data valorisation pathways and new narratives around what counts as data at all. In the (now familiar) online advertising business model, personal data is used to create profiles and behavioural metrics that format users as machine-readable bundles of probabilistic profit opportunities (Goldenfein & McGuigan, 2023). Those profiles and metrics enable advertisers to influence online experience in ways that increase the likelihood of preferred outcomes. Through these mechanisms, data is valorised less as a stable, tradeable commodity, and more as a constant flow between individuals, ad ecosystems and information environments (Benthall & Goldenfein, 2021), typically operating as a tool for steering consumer behaviour (Pistor, 2020). As personal data's centrality to the digital economy has stabilised, international statistical agencies and national accounting working groups have focused on how to define and standardise data as a financial and economic asset. Those agencies defined data as ‘information that is produced by accessing and observing phenomena: and recording, organizing and storing information elements from these phenomena in a digital format, which can be accessed electronically for reference or processing’ (Mitchell, 2021; Mitchell et al., 2021). The OECD (2008) gives a shorter but similar definition in its Glossary of Statistical Terms, in which it defines data as ‘Characteristics or information, usually numerical, that are collected through observation’.

The new legal and policy disputes erupting around AI training data butt awkwardly against those definitions. The materials that constitute the pre-training and fine-tuning datasets used to train and benchmark AI are not the same as the flows of personal data that power online advertising businesses. The latter may be information collected through observing phenomena – specifically user's interactions with information services and data tracking infrastructures – but to emulate human expression, generative AI systems are trained on human expression. AI datasets, especially massive pre-training datasets, are made of text, images, programming code and video. They are made of what, in other contexts, would be called online ‘content’. And this material is typically generated and managed by amateur and professional content industries like software, film and television, and news publishing businesses, as well as various online platforms that cultivate rich forms of expressive human interaction.

The different character of data used to train AI compared to the data powering online advertising is central to the legal instability of datasets. Media industries that control the ‘content’ that comprises datasets exploit their repositories in different ways, but they typically do so in close association with legal regimes that create and define assignable rights in creative expression. For example, commercial content industries rely on copyright laws to create rights that are monetised through licensing arrangements or assignments. Less professionalised content creators may not directly monetise content, but they nonetheless engage with copyright and licensing to generate other benefits and forms of social capital while still controlling access to, and permissible uses of, different materials. Open-source communities, for example, trade in productive ethos and reputational indexes that undergird large, decentralised projects like Wikipedia, GitHub and Stack Exchange, while still interacting with copyright, though in a more nuanced way, through creative commons or GNU public licenses which typically allow end users to freely run, study, and modify software. The latter permit and prohibit works to be used in different ways but nonetheless depend on complex webs of legal arrangements that involve copyright creation, assignment and licensing (Choksi & Geodicke, 2023).

Using that material to train AI models would, intuitively, require licensing the right to do so from its creator, owner or assignee. But having to license all the materials in pre-training datasets would make building AI systems unfathomably complex and expensive for AI companies. Further, AI companies generate value from this material not by using its expressive content to aggregate human attention and sell ads or subscriptions, but by computationally processing and capturing what the material reveals about linguistical and grammatical rules, relations between dialogue and topics, image and text, the quirks of human language, and communicative styles. AI industries therefore want access to repositories of human expression, but they want to operationalise it in ways that leverage the latent statistical relations within that expression, that is, as data. And in doing so, they want to evade the legal regimes that govern the financial exploitation of that expression, that is, as content.

Content or data?

The effort to avoid engaging with existing economic and legal systems that govern the monetisation of expressive content by calling it ‘data’ is not simply strategic AI industry rhetoric. It reflects the actual material processes animating AI model training; and it elucidates a legitimate duality in datasets that is both deeply confounding for law and deeply consequential for data markets. Should the law approach datasets as repositories of cultural expression or as repositories of statistical relations extracted through technological processes? Does what we understand to constitute ‘data’ include not only what can be observed about a phenomenon, but also the material subject to observation? How should legal doctrine deal with this material being initially created in one economic context with its own pathways and practices for exploiting it as a form of expression, but then, through computation, it being transitioned into another economic context, controlled by platforms and AI firms, and subject to radically different value propositions and monetisation strategies?

These questions make the simple application of copyright doctrine to AI training datasets deeply fraught. Judges presiding over copyright litigation in the United States have, so far, struggled to reconcile the encounter between online content and AI as a computational form. The inability of the judge in Tremblay v Open AI to conceptualise how an AI model might violate copyright without reproducing a work as an output is a good example (Tremblay v. OpenAI, Inc, 2024). There, Justice Martínez-Olguín understood unlawful appropriation in copyright law to require the production of a work with substantial similarity to the copyrighted original. Applying that basic principle mitigates against any finding of unlawful appropriation if a model merely extracts statistical information without reproducing the expression. But the ‘replication’ problem is just one facet of this new doctrinal trouble, and many other cases are proceeding on the basis that model training may constitute infringement even without model outputs replicating works in the training data. These cases raise a host of further issues associated with, for instance, the scope of ‘fair use’, and perhaps more relevant to the discussion here, the fact versus expression dichotomy, through which the contest over legitimacy for data value creation pathways is most apparent.

The Udio dispute is a good illustration. That case involves a conglomerate of major recorded music businesses, that together control copyright in the majority of commercially valuable sound recordings in the world, bringing an action against a team of former Google Deep Mind researchers that created AI music generating software. In documents outlining the parties’ preliminary arguments, complainants allege that Udio engaged in unauthorised copying of copyright protected sound recordings when training its model. Defendants, alternatively, describe that claim as an effort to own ‘style’ – that is, a claim to own the characteristic sounds of a genre which is not copyrightable. They argue that the materials operationalised through model training are not the songs owned by the claimants, but only statistical information about style and genre – that is, what types of sound tend to appear in what kinds of music. They assert that their product is not a reproduction of copyrighted content, but rather a statistical model of how music works: ‘To be clear, the model underpinning Udio's service is not a library of pre-existing content, outputting a collage of “samples” stitched together from existing recordings. The model does not store copies of any sound recordings. Instead, it is a vast store of information about what various musical styles consist of, used to generate altogether new auditory renditions of creations in those styles’ (UMG Recordings, Inc. v. Uncharted Labs, 2024a, 2024b).

Zhang v Google, involving artists whose works were captured in the LAION-400B dataset used to train the Google Imagen model, is another example. In preliminary documents, the complainants allege: ‘the training images in the dataset are directly copied in full and then completely ingested by the model, meaning that protected expression from every training image enters the model’ ( Zhang v Google , n.d.). The defendants, alternatively, characterise the training process as using images in order to recognise ‘the quintessential characteristics and dimensions of the people, places, and things depicted in them’, rather than copying.

Looking through the narrow lens of copyright doctrine, one might interpret these arguments as merely prodding the boundaries of copyright's internal logic. But whether or not the stuff of datasets is deemed fact or expression fails to capture the stakes of the dispute, as well as the reality that both are clearly true. As copyright scholar Ben Sobel notes, the already unstable distinction between fact and expression has become an impractical and unworkable doctrinal anchor for AI copyright markets (Sobel, 2024). Determinations on copyright doctrine are certainly part of how law is mediating between different claims over the right to generate and justify value from datasets. But the doctrinal contouring of legitimate value claims is only part of a broader reformation of AI dataset markets, which is already well underway.

Media industries or data intermediaries?

The copyright questions, if ever determined coherently, might place datasets either inside or outside of the bounds of copyright law. In other words, datasets might be deemed to be constituted by data that can be appropriated from a public domain or, in contrast, they might be deemed content that has to be licensed to be legitimately processed. But this dichotomy neglects the reality that datasets are already being managed through novel and evolving governance regimes that trade precisely on their hybrid character. Datasets are both, simultaneously, content and data; and relying on legal doctrine to configure datasets as one or the other is blind to how computation has already reshaped the ontology of media and how it circulates in the digital economy (Çalişkan and Callon, 2009).

Our current technological milieu obsolesces the fact versus expression dichotomy for datasets. The material in AI datasets has both a ‘representational’ character, wherein meaning and value are derived from what the material expresses to human audiences, as well as an ‘operational’ character, wherein meaning and value are derived from its being operationalised in data processing systems that reveal statistical relations legible to machines (Goldenfein, 2024). In the generative AI context, those statistical relations about culture are then repackaged and sold as AI services capable of re-organising and recompiling the individual conceptual units of culture as ‘styles’ (Reimer and Peter, 2024). This duality is now characteristic of all digital media (Andrejevic and Zala, 2021; Farocki, 2004; Paglen, 2014; Uliasz, 2021), and emerging data markets already operate on this basis.

In 2023 and 2024 Microsoft, OpenAI, Perplexity, Google, Apple, Adobe and Nvidia – despite being engaged in ongoing copyright litigation – made dozens of confirmed deals with news media and other publishing companies, enabling content to be used for model training, with a reported spend of over $350,000,000 USD. Hundreds more unconfirmed arrangements of unknowable financial amounts have also been reported (Brown, n.d.; CB Insights, 2024; Schomer, 2024), with the marketplace value estimated at around $3billion USD and growing (Paul and Tong, 2024). There are also a range of new industry organisations and standardisation efforts working to stabilise how datasets take shape as tradeable commodities (Rosenblatt, 2024). This includes, for instance, intermediaries like the Dataset Providers Alliance which represents sellers of music, image and video for use in AI training (Paul, 2024), and advocates for rules and standards to help valorise datasets in legally legitimate ways, typically through dataset transparency and provenance rules that make data transactions more efficient. For these licensing deals, very little hinges on whether the object of exchange is content, data, or any other ‘product’. These pragmatic arrangements afford access to materials for the specific purpose of model training; they articulate the economic logic of content into the economic logic of data, and through that transition hybridise datasets as economic objects.

The financial value of these exchanges further lubricates the slippage between the representational-expressive and operational-statistical conceptual character of data. The nearly defunct media repository, Photobucket, for instance, claims to have been negotiating prices of $0.05 to $1.00 per image and more than $1.00 per video for AI training (Paul and Tong, 2024; Notopoulos, 2024). Text price is closer to $0.001 per word, but other companies claim to charge higher rates. Platforms like Reddit have made deals that price access to datasets of thematically labelled comments at around $60,000,000 per year (Tong et al., 2024). Most academic publishers have also made data access arrangements (Informa PLC, 2024). On these measures, for some actors at least, media has a higher financial value as data than as content, suggesting that AI is reconfiguring media businesses into data businesses.

Content businesses such as publishers, art repositories, music labels and other media firms that control catalogues of content are therefore becoming a new type of data intermediary, selling access to their catalogues to AI companies as a form of data. Book-writing academics have experienced this first-hand, as academic presses now seek retroactive addenda to publishing contracts granting them rights to license access to book content for AI model training. While this presently constitutes only a fraction of academic publisher revenue (Informa PLC, 2024), these shifts demonstrate how the intellectual (i.e. expressive) value of academic work is being complemented, or potentially eclipsed, by what that work indicates statistically about the inherent relations of concepts and language. If this trend continues, the primary value of media may become its operationalisation for the creation of data, through which its more valuable second-order statistical encodings about culture can be realised.

Conclusion

Legal regimes play a critical role in shaping the conditions under which data becomes a value form, including the commercial conditions of data access, commodification and monetisation. Informational commodities rely on regimes like intellectual property or data protection to define their boundaries and conditions of circulation. Which regimes govern the form and movement of informational commodities at which times, however, are determined by legal settlements that institute and organise markets and market actors. Media industries and data processing industries have tussled over the legal configurations defining the movement of data and content in the platform context for decades. We are now in the midst of a battle between data processing industries and media industries as to which regimes will govern the digital artefact that is the dataset. At the heart of this battle is a conceptual question: ‘What is data?’ The inchoate identity of datasets as legal and economic objects situates them at the frontier of two competing economic logics, championed by two sectors of the digital platform economy, in pursuance of a decades-old dispute over who is able to profit from the movement of information online. The resolutions and settlements that emerge from these disputes will set the terms of how we understand and govern the markets for AI inputs for some time.

Footnotes

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author received funding support through the Australian Research Council Centre of Excellence for Automated Decision-Making and Society. Grant number CE200100005.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iD

Jake Goldenfein

References

Andrejevic

Zala

(2021) Seeing like a border: Biometrics and the operational image. Digital Culture & Society 7(2): 139–158.

Benkler

(2003) Freedom in the commons: Towards a political economy of information. Duke Law Journal 52(6): 1245–1276.

Benthall

Goldenfein

(2021) Artificial intelligence and the purpose of social systems. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society: 3–12. https://doi.org/10.1145/3461702.3462526 .

Brown

(n.d.) Platforms and Publishers: AI Partnership Tracker. https://petebrown.quarto.pub/pnp-ai-partnerships/ .

Çalışkan

Callon

(2009) Economization, part 1: Shifting attention from the economy towards processes of economization. Economy and Society 38(3): 369–398.

CB Insights (2024, July 19) AI content licensing deals. https://www.cbinsights.com/research/ai-content-licensing-deals/ .

Choksi

Goedicke

(2023) Whose text is it anyway? Exploring BigCode, intellectual property, and ethics (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2304.02839.

Farocki

(2004) Phantom images. Public 29: 12–22.

Goldenfein (2024) Privacy’s loose grip on facial recognition: Law and the operational image. In: Matulionyte

Zalnieriute

(eds) The Cambridge Handbook of Facial Recognition in the Modern State (1st ed.). Cambridge University Press. https://doi.org/10.1017/9781009321211 .

10.

Goldenfein

McGuigan

(2023) Managed sovereigns: How inconsistent accounts of the human rationalize platform advertising. Journal of Law and Political Economy 3(3): 425–449.

11.

Informa PLC (2024, May 8) Market update [Press release]. https://www.informa.com/globalassets/documents/investor-relations/2024/informa-plc—market-update.pdf .

12.

Lessig

(1999, February 9) Code and the commons. Keynote given at a conference on ‘Media Convergence’, Fordham Law School, New York.

13.

Mills

(2020) Who owns the future? Data trusts, data commons, and the future of data ownership. Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3437936.

14.

Mitchell

(2021, April 6) An Update on Recording and Measuring Data. OECD Advisory Expert Group. https://unstats.un.org/unsd/nationalaccount/aeg/2021/M15_7_4_Recording_Data_Pres.pdf .

15.

Mitchell

Ker

Lesher

(2021) Measuring the Economic Value of Data. OECD Publishing.

16.

Notopoulos

(2024, October 16) Your old images stored on Photobucket could soon be used to train AI. Business Insider.

17.

OECD. (2008) Data. In OECD Glossary of Statistical Terms. OECD. https://doi.org/10.1787/9789264055087-en.

18.

Paglen

(2014) Operational images. E-Flux 59.

19.

Paul

(2024, June 27) AI dataset licensing companies form trade group. Reuters. https://www.reuters.com/technology/artificial-intelligence/ai-dataset-licensing-companies-form-trade-group-2024-06-26/.

20.

Paul

Tong

(2024, April 8) Inside Big Tech’s race to buy AI training data. IT News.

21.

Pistor

(2020) Rule by data: The end of markets? Law and Contemporary Problems 83: 101–124.

22.

Rasmussen

(2024) Friction in the Netflix Machine: How Screen Workers Interact with Streaming Data. New Media and Society. DOI: 10.1177/14614448241250029.

23.

Reimer

Peter

(2024) Conceptualizing generative AI as style engines: Application archetypes and implications. International Journal of Information Management 79: 102824–102839.

24.

Rosenblatt

(2024, July 18) The media industry’s race to license content for AI. Forbes. https://www.forbes.com/sites/billrosenblatt/2024/07/18/the-media-industrys-race-to-license-content-for-ai/.

25.

Schomer

(2024, August 5) AI content licensing deals with publishers: Complete updated index. Variety VIP+. https://variety.com/vip/breaking-down-ai-content-licensing-all-the-publisher-deals-training-ai-models-1236093395/ .

26.

Shkabatur

(2019) The global commons of data. Stanf Technol Law Rev 22: 354–411.

27.

Sobel

BLW

(2024, August 2) On Copyright, “Facts,” & Generative AI. DLI Critical Reflections. https://www.dli.tech.cornell.edu/post/on-copyright-facts-generative-ai .

28.

Tong

Wang

Coulter

(2024, February 22) Exclusive: Reddit in AI content licensing deal with Google. Reuters.

29.

Tremblay v. OpenAI, Inc., 3:23-cv-03223-AMO (N.D. Cal. February 16, 2024).

30.

Uliasz

(2021) Seeing like an algorithm: Operative images and emergent subjects. AI & Society 36(4): 1233–1241.

31.

UMG Recordings, Inc. v. Uncharted Labs, Inc., 1:24-cv-04777 (1 August 2024b) Response to Complaint.

32.

UMG Recordings, Inc. v. Uncharted Labs, Inc., 1:24-cv-04777 (24 June 2024a) Complaint.

33.

Verdegem

(2024) Dismantling AI capitalism: The commons as an alternative to the power concentration of Big Tech. AI & Society 39: 727–737.

34.

Zhang v. Google LLC, 5:24-cv-02531 (N.D. Cal).