Abstract
Foundation models are a new frontier of value creation in the digital platform economy. These technologies rely on the production and consumption of massive datasets that are monetised through consumer facing artificial intelligence (AI) products. However, the unlicensed use of these materials by the AI industry has provoked a legal and conceptual dispute. Media industries claim that the material in datasets is ‘content’ and subject to copyright law. AI industries, alternatively, are working to strategically reframe those materials as ‘data’, which is governed through regimes more congenial to the industry's business models, such as loosely enforced data protection and technologically secured trade secrets. This essay shows how AI copyright litigation, and the central question of data versus content, is mediating between different claims to the right to generate and justify value from datasets, as well as participating in the broader reformation of AI dataset markets.
Introduction
Foundation models are a new frontier of value creation in the digital platform economy. The market uptake of artificial intelligence (AI) systems built from foundation models – very large machine learning models trained on massive amounts of data – including generative AI systems, has been meteoric. But this latest generation of technology products and business strategies has become embroiled in the decades-long tussle between content (née media) industries and AI (née data processing) industries over the capacity to generate value from information as it flows between producers, online platforms, and consumers. Intellectual property scholars have excitedly dissected these developments, wrestling with the sticky doctrinal questions of whether and how this diverse range of technological systems infringe existing law. Rather than drill into the doctrinal questions, however, this essay highlights the need to frame these disputes within the broader legal dynamics of platformisation.
The tense legal relations between data processing and media industries have profoundly shaped the direction of platformisation, and the viability and profitability of different business models. Access and control over flows of
This ongoing dispute has reached a new crescendo, however, with the AI industry's unauthorised use of massive troves of online material for training foundation models, provoking a slew of both litigation and policy work on how to understand the relationship between AI development, dataset circulations and copyright – or, more broadly, the question of how to manage the AI industry's new techniques for appropriating and valorising data. This iteration of the dispute, however, involves a novel conceptual twist. Part of data processing/AI firms’ strategy for extracting content from the control of content industries is to argue that such material is not content at all. Specifically, this essay outlines how arguments rendered about copyright doctrine are proxies for broader questions of whether the material that constitutes AI pre-training datasets consists of either ‘content’, managed and monetised through intellectual property regimes like copyright, or ‘data’, managed and monetised through loosely enforced data protection laws and technologically secured trade secrets.
The goal of this essay is to situate those legal arguments within a broader sphere of conceptual competition over how to define the AI dataset as a legal and economic object, and to show how AI training
Data or datasets?
Digital platforms and data processing businesses profit tremendously by controlling personal data as it flows through platforms and ad tech ecosystems. But foundation models and AI have introduced new data valorisation pathways and new narratives around what counts as data at all. In the (now familiar) online advertising business model, personal data is used to create profiles and behavioural metrics that format users as machine-readable bundles of probabilistic profit opportunities (Goldenfein & McGuigan, 2023). Those profiles and metrics enable advertisers to influence online experience in ways that increase the likelihood of preferred outcomes. Through these mechanisms, data is valorised less as a stable, tradeable commodity, and more as a constant flow between individuals, ad ecosystems and information environments (Benthall & Goldenfein, 2021), typically operating as a tool for steering consumer behaviour (Pistor, 2020). As personal data's centrality to the digital economy has stabilised, international statistical agencies and national accounting working groups have focused on how to define and standardise data as a financial and economic asset. Those agencies defined data as ‘information that is produced by accessing and observing phenomena: and recording, organizing and storing information elements from these phenomena in a digital format, which can be accessed electronically for reference or processing’ (Mitchell, 2021; Mitchell et al., 2021). The OECD (2008) gives a shorter but similar definition in its Glossary of Statistical Terms, in which it defines data as ‘Characteristics or information, usually numerical, that are collected through observation’.
The new legal and policy disputes erupting around AI training data butt awkwardly against those definitions. The materials that constitute the pre-training and fine-tuning datasets used to train and benchmark AI are not the same as the flows of personal data that power online advertising businesses. The latter may be information collected through observing phenomena – specifically user's interactions with information services and data tracking infrastructures – but to emulate human expression, generative AI systems are trained on human expression. AI datasets, especially massive pre-training datasets, are made of text, images, programming code and video. They are made of what, in other contexts, would be called online ‘content’. And this material is typically generated and managed by amateur and professional content industries like software, film and television, and news publishing businesses, as well as various online platforms that cultivate rich forms of expressive human interaction.
The different character of data used to train AI compared to the data powering online advertising is central to the legal instability of datasets. Media industries that control the ‘content’ that comprises datasets exploit their repositories in different ways, but they typically do so in close association with legal regimes that create and define assignable rights in creative expression. For example, commercial content industries rely on copyright laws to create rights that are monetised through licensing arrangements or assignments. Less professionalised content creators may not directly monetise content, but they nonetheless engage with copyright and licensing to generate other benefits and forms of social capital while still controlling access to, and permissible uses of, different materials. Open-source communities, for example, trade in productive ethos and reputational indexes that undergird large, decentralised projects like Wikipedia, GitHub and Stack Exchange, while still interacting with copyright, though in a more nuanced way, through creative commons or GNU public licenses which typically allow end users to freely run, study, and modify software. The latter permit and prohibit works to be used in different ways but nonetheless depend on complex webs of legal arrangements that involve copyright creation, assignment and licensing (Choksi & Geodicke, 2023).
Using that material to train AI models would, intuitively, require licensing the right to do so from its creator, owner or assignee. But having to license all the materials in pre-training datasets would make building AI systems unfathomably complex and expensive for AI companies. Further, AI companies generate value from this material not by using its expressive content to aggregate human attention and sell ads or subscriptions, but by computationally processing and capturing what the material reveals about linguistical and grammatical rules, relations between dialogue and topics, image and text, the quirks of human language, and communicative styles. AI industries therefore want access to repositories of human expression, but they want to operationalise it in ways that leverage the latent statistical relations within that expression, that is, as data. And in doing so, they want to evade the legal regimes that govern the financial exploitation of that expression, that is, as content.
Content or data?
The effort to avoid engaging with existing economic and legal systems that govern the monetisation of expressive content by calling it ‘data’ is not simply strategic AI industry rhetoric. It reflects the actual material processes animating AI model training; and it elucidates a legitimate duality in datasets that is both deeply confounding for law and deeply consequential for data markets. Should the law approach datasets as repositories of cultural expression or as repositories of statistical relations extracted through technological processes? Does what we understand to constitute ‘data’ include not only what can be observed about a phenomenon, but also the material subject to observation? How should legal doctrine deal with this material being initially created in one economic context with its own pathways and practices for exploiting it as a form of expression, but then, through computation, it being transitioned into another economic context, controlled by platforms and AI firms, and subject to radically different value propositions and monetisation strategies?
These questions make the simple application of copyright doctrine to AI training datasets deeply fraught. Judges presiding over copyright litigation in the United States have, so far, struggled to reconcile the encounter between online content and AI as a computational form. The inability of the judge in
The
Looking through the narrow lens of copyright doctrine, one might interpret these arguments as merely prodding the boundaries of copyright's internal logic. But whether or not the stuff of datasets is deemed fact or expression fails to capture the stakes of the dispute, as well as the reality that both are clearly true. As copyright scholar Ben Sobel notes, the already unstable distinction between fact and expression has become an impractical and unworkable doctrinal anchor for AI copyright markets (Sobel, 2024). Determinations on copyright doctrine are certainly part of how law is mediating between different claims over the right to generate and justify value from datasets. But the doctrinal contouring of legitimate value claims is only part of a broader reformation of AI dataset markets, which is already well underway.
Media industries or data intermediaries?
The copyright questions, if ever determined coherently, might place datasets either inside or outside of the bounds of copyright law. In other words, datasets might be deemed to be constituted by data that can be appropriated from a public domain or, in contrast, they might be deemed content that has to be licensed to be legitimately processed. But this dichotomy neglects the reality that datasets are already being managed through novel and evolving governance regimes that trade precisely on their hybrid character. Datasets are both, simultaneously, content and data; and relying on legal doctrine to configure datasets as one or the other is blind to how computation has already reshaped the ontology of media and how it circulates in the digital economy (Çalişkan and Callon, 2009).
Our current technological milieu obsolesces the fact versus expression dichotomy for datasets. The material in AI datasets has both a ‘representational’ character, wherein meaning and value are derived from what the material expresses to human audiences, as well as an ‘operational’ character, wherein meaning and value are derived from its being operationalised in data processing systems that reveal statistical relations legible to machines (Goldenfein, 2024). In the generative AI context, those statistical relations about culture are then repackaged and sold as AI services capable of re-organising and recompiling the individual conceptual units of culture as ‘styles’ (Reimer and Peter, 2024). This duality is now characteristic of all digital media (Andrejevic and Zala, 2021; Farocki, 2004; Paglen, 2014; Uliasz, 2021), and emerging data markets already operate on this basis.
In 2023 and 2024 Microsoft, OpenAI, Perplexity, Google, Apple, Adobe and Nvidia – despite being engaged in ongoing copyright litigation – made dozens of confirmed deals with news media and other publishing companies, enabling content to be used for model training, with a reported spend of over $350,000,000 USD. Hundreds more unconfirmed arrangements of unknowable financial amounts have also been reported (Brown, n.d.; CB Insights, 2024; Schomer, 2024), with the marketplace value estimated at around $3billion USD and growing (Paul and Tong, 2024). There are also a range of new industry organisations and standardisation efforts working to stabilise how datasets take shape as tradeable commodities (Rosenblatt, 2024). This includes, for instance, intermediaries like the Dataset Providers Alliance which represents sellers of music, image and video for use in AI training (Paul, 2024), and advocates for rules and standards to help valorise datasets in legally legitimate ways, typically through dataset transparency and provenance rules that make data transactions more efficient. For these licensing deals, very little hinges on whether the object of exchange is content, data, or any other ‘product’. These pragmatic arrangements afford access to materials for the specific purpose of model training; they articulate the economic logic of content into the economic logic of data, and through that transition hybridise datasets as economic objects.
The financial value of these exchanges further lubricates the slippage between the representational-expressive and operational-statistical conceptual character of data. The nearly defunct media repository, Photobucket, for instance, claims to have been negotiating prices of $0.05 to $1.00 per image and more than $1.00 per video for AI training (Paul and Tong, 2024; Notopoulos, 2024). Text price is closer to $0.001 per word, but other companies claim to charge higher rates. Platforms like Reddit have made deals that price access to datasets of thematically labelled comments at around $60,000,000 per year (Tong et al., 2024). Most academic publishers have also made data access arrangements (Informa PLC, 2024). On these measures, for some actors at least, media has a higher financial value as data than as content, suggesting that AI is reconfiguring media businesses into data businesses.
Content businesses such as publishers, art repositories, music labels and other media firms that control catalogues of content are therefore becoming a new type of data intermediary, selling access to their catalogues to AI companies as a form of data. Book-writing academics have experienced this first-hand, as academic presses now seek retroactive addenda to publishing contracts granting them rights to license access to book content for AI model training. While this presently constitutes only a fraction of academic publisher revenue (Informa PLC, 2024), these shifts demonstrate how the intellectual (i.e. expressive) value of academic work is being complemented, or potentially eclipsed, by what that work indicates statistically about the inherent relations of concepts and language. If this trend continues, the primary value of media may become its operationalisation for the creation of data, through which its more valuable second-order statistical encodings
Conclusion
Legal regimes play a critical role in shaping the conditions under which data becomes a value form, including the commercial conditions of data access, commodification and monetisation. Informational commodities rely on regimes like intellectual property or data protection to define their boundaries and conditions of circulation. Which regimes govern the form and movement of informational commodities at which times, however, are determined by legal settlements that institute and organise markets and market actors. Media industries and data processing industries have tussled over the legal configurations defining the movement of data and content in the platform context for decades. We are now in the midst of a battle between data processing industries and media industries as to which regimes will govern the digital artefact that is the dataset. At the heart of this battle is a conceptual question: ‘What is data?’ The inchoate identity of datasets as legal and economic objects situates them at the frontier of two competing economic logics, championed by two sectors of the digital platform economy, in pursuance of a decades-old dispute over who is able to profit from the movement of information online. The resolutions and settlements that emerge from these disputes will set the terms of how we understand and govern the markets for AI inputs for some time.
Footnotes
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author received funding support through the Australian Research Council Centre of Excellence for Automated Decision-Making and Society. Grant number CE200100005.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
