Abstract
The Enron Corpus is a canonical training dataset representing one of the first scale jumps in the size of natural language data for machine learning (ML) research. That corpus was built from 500,000 internal Enron emails released by the Federal Energy Regulatory Commission in the wake of the Enron prosecution. This article traces the historical and genealogical link between Enron and contemporary foundation models. Foundation model training sets are currently so large they include almost all available data contained on the Internet, and researchers anticipates that future models will incorporate even more tokens. That might suggest feeding AI-generated data back into the models to train future generations, but this poses an existential problem known as model collapse. This essay investigates how and why the corporate collapse documented by one of the earliest “massive” ML corpora resonates with the model collapse syndrome that threatens the integrity of LLMs and multimodal generative AI today. Model collapse can be understood as the phenomenon where a model becomes poisoned by its own projection of reality; this also describes the conditions that led to the collapse of Enron Corp. The Enron Corpus thus poses a more fundamental question: Why is it that theft, scandal, and fraud lie at the heart of so many of the most prominent training sets? Is this Enron story just an outlier in ML history, or is there a genealogical trace of collapse?
Keywords
Introduction
In a 2021 research paper published in Big Data & Society, Emily Denton, Alex Hanna and their co-authors proposed a method for developing what they call a critical history and genealogy of machine learning datasets (Denton et al., 2021). They focused on ImageNet, one of the foundational datasets and benchmarks in image recognition and computer vision. The competition built around the dataset, the ImageNet Large Scale Visual Recognition Challenge, led to a proof-of-concept breakthrough in 2012 when AlexNet, a convoluted neural network trained on GPUs, reported a significant jump in accurately identifying images using machine learning. By 2017, the entire competition was suspended after the bulk of competing teams scored significantly better on the benchmark than the humans who had created the original dataset.
At that point, ImageNet, a victim of its own success, passed into obsolescence as an active dataset. From the perspective of frontier work on foundation models in multi-modal generative AI, ImageNet is a historical footnote, with little relevance to cutting-edge algorithms and their chatbot interfaces. And yet there is universal acknowledgement in the field that ImageNet represented a breakthrough in the use of convoluted neural networks in the training of AI models. How much influence do historical datasets, once retired, continue to exert on the fields of machine learning and generative artificial intelligence? In framing the question this way, I am expanding the scope to include what in the humanities is called historiography: the history of how we write the history of a given field (White, 1990). When we think of ancestors and cultural tradition, we understand that human culture moves across generations. But what happens when a data set like ImageNet, which was the foundation for an earlier iteration of computer vision, drops out of use? How much residual effect do these early foundational model training sets have on AI more broadly? Do datasets have ancestral links that span across generations?
There have been several recent attempts to tell the critical history of specific training sets, focusing on the massive, publicly available data sets like the Common Crawl used to build many of the contemporary large language models (LLMs); the aforementioned ImageNet, historically the gold standard benchmark for image recognition and generation; and LAION-5B, a massive dataset assembled purely through machine learning processes. 1 These training datasets are huge—Common Crawl indexes over 2 billion webpages, and ImageNet contains 14 million labeled images—and yet they are proving incapable of satisfying the generative AI industry's voracious demand for new data to train their next-generation algorithms, which include hundreds of billions of parameters (Metz et al., 2024).
In order to answer the question about the genealogical and conceptual legacy of historical data sets, and building from the important work on ImageNet also done by creative and arts-based researchers like Trevor Paglen, Kate Crawford, Everest Pipkin, and Adam Harvey, I turn to another machine learning corpus of historical import in the development of natural language processing (NLP): the Enron Archive. I will first introduce the Enron Corpus, and rise to the challenge of the Enron Corporation's advertising copy provocation to “ask why.” Then I’ll turn to a series of creative researchers who engage the Enron Corpus as a data body in the aesthetic realm; all of these artists share an interest in bending the algorithms that arise out of the Enron Archive away from the homogenization of fraud, and back towards the emergence of some kind of beauty or truth. One truth that emerges from Enron is that its fraudulent business model ultimately collapsed because it had to account for the speculative assets it was feeding back into the financial markets. This kind of collapse connects to contemporary worries about context collapse in algorithmic models. Both cases arise from analogous fears of poisoning a system with the forward projection of its own corrupt representation of reality (Shumailov et al., 2024).
The Enron Corpus
It might seem quaint to remind ourselves that early in the twenty-first century, a data set made up of a mere half million email messages could represent a paradigm-shifting development in the world of NLP. The Enron Corpus is just one such dataset: a collection of some 500,000 internal Enron emails released to the public by the Federal Energy Regulatory Commission in the wake of the prosecution of Enron's fraudulent business practices. The Enron Corpus quickly became the standard archive for many computer science machine learning courses, and the basis of all kinds of experimental algorithmic processes in the first decade of the twenty-first century. This is how a cluster of emails documenting an empirical corporate collapse became the Enron Corpus, a canonical training dataset representing one of the first scale jumps in the size of natural language data for machine learning research.
Today, the Enron Corpus, also known as the Enron Archive, is a statistically insignificant portion of the multi-billion token datasets used to pre-train foundation models. In a landmark white paper, Stanford University's Human-Centered Artificial Intelligence group (HAI) explained that these “foundation models” are so massive that their training cannot plausibly be supervised in the machine learning process. This means that the models are no longer trained on datasets that have been cleaned, labeled, and verified; instead, a “collect-all” mentality has been built into the systems themselves. In 2023, Fei-Fei Li called ImageNet “the largest hand-curated data set in AI's history” (Li, 2023: 177), and it will likely maintain that distinction in perpetuity. Today's brand-name deep learning algorithms are being fed daily dumps of massive scrapes of the entire Internet at a scale too large for even a crowdsourced global underclass to hand-label each token. In other words, today's foundation models “self-supervise” their own training.
The Enron Archive's historical significance, and the Enron Corporation's cultural meaning as the paragon of turn-of-the-century technocapitalist fraud, lend the Corpus additional weight in any narrative we might assemble about training data. There is a particular resonance between this question of self-supervision, on the one hand, and contemporary accountability for the data theft at the core of all first-generation transformer models like Google's BERT 2 . Even if the Enron emails are today an insignificant or even negligible part of these billion parameter models, there is a historical and genealogical link between the Enron Corpus and contemporary foundation models that passes through all machine learning training corpora.
How much does that history matter? In order to answer that question, we’ll need to dive into the documents and the data, but we’ll also have to use our creative imagination to conjure up a few suspicious and deliberate absences. Because the history behind the Enron Corpus is the story of the so-called Smartest Guys in the Room, one of the most spectacular corporate frauds in the history of technocapitalism. And part of that fraud was document deletion. In fact, one of the most important pieces of documentary evidence in the story of the Enron Corporation's collapse is an email that's not included within the Enron Archive itself, the infamous Nancy Temple message. This email proved pivotal in the prosecution of Enron's accounting firm Arthur Andersen, as it hinted at the widespread destruction of evidence across both firms. I’ll describe the specifics of that case below, but even this brief sketch suggests that the questions of justice and liability in the case of an epochal fraud might reside in the status of a single email. And that email stands out not only because it caused the deletion of truckloads of evidence, but also because it was not sent to an
At this point, those questions are impossible to answer. But we can enrich our understanding of the case with a deep citational practice, which I offer to illuminate the story of the Enron Corpus. Through the critical history of this particular training dataset, I also propose the method of deep citation as an embodied human practice in the service of telling the story of training data. In the vocabulary of machine learning (ML), the method could be called “non-transformer embedding.” In this narrative embedding, the depth of citational reference is constrained by the following question: If we know data was deleted from a corpus that becomes a foundational data set, how deeply must we probe to be able to imagine what was not included in the matrix?
As Donna J. Haraway insists, building from Marilyn Strathern, it matters which ideas we use to think other ideas with (Haraway, 2016: 12). This particular deep citational practice will blend personal anecdote, archival research, artistic performance, literary storytelling, and the management of the many outlier data points that, from a literary perspective, we might call tangential story elements. I justify this methodology because I know that the generative AI systems that I’ll be discussing in this article are abandoning precisely this type of deep citational practice, and perhaps offering to replace it with a petrochemical fueled simulation run on microchips embossed with rare Earth minerals. When we consider the Enron Archive, it helps bring into relief some unexpected connections between those algorithmic models that form the foundation of the contemporary technosystems we call “AI” and the greatest corporate collapse in the history of the United States.
Some of my readers might immediately complain: but you are just assembling a sequence of outlier examples, anecdotes all the way down! And I agree, at least in part: the question of the outlier itself is key to any understanding of training datasets, especially when confronting the problem of a societal context collapse. Deep citational practice, however, is the opposite of a narrative string of next-token generated inferred anecdotes. Instead of filling in the gaps through generative pretrained transformer architectures, the method embeds a particular dataset within the omissions, offenses and ambiguities of its own narrative. A critical history of the Enron Archive will allow us to follow the digital trace from a Corporation that poisoned itself with its own fraudulent projection of reality, through the phenomenon known as model collapse, to connect with our current anxiety that future generative models will become poisoned by AI slop (Koebler, 2024). The generational presence of the Enron Corpus can help explain why collapse—whether statistical, financial, or epistemological—seems to be the inevitable horizon of generative artificial intelligence.
Ask why: Enron's messy materiality
For many readers, the mere mention of the name Enron will conjure up feelings of rage, disgust, and shame. I know this all too well: in presenting the research that led to this article at San Francisco's Commonwealth Club, my name-check of the “Crooked E” elicited hisses from the audience. What I didn’t remember at the time was that the Commonwealth Club was the very venue where, on June 22, 2001, Enron CEO Jeffery Skilling confronted an angry crowd of Californians who had lived through a year of rolling blackouts and exorbitant energy prices. As the Houston Chronicle reported, “A Houston energy executive got more than boos and jeers before starting a speech on California's energy crisis. He got a pie in the face.” (News Services, 2001; McLean and Elkind, 2013: 335–6.).
Enron's corporate motto, heralded across print advertisements, televisions spots and shareholder meetings, was “Ask why.” The original intent, one supposes, was to highlight Enron's innovative and unorthodox strategy of speculative neoliberal economics. 3 In the wake of the company's collapse, the motto now reads as an implicit confession, or a cursed caveat emptor. The material history of the Enron Corpus offers us an opportunity for hindsight and foresight to converge into another “why?” question: Why is it that theft, scandal, and fraud lie at the heart of so many of the most prominent AI training datasets?
It is indeed disconcerting to learn that the several hundred thousand emails exchanged between Enron executives and their contacts during the unfolding of a generationally catastrophic corporate collapse might be, in terms of historical datasets, the last universal common ancestor of all neural network-based generative AI algorithms. That tidbit of trivia—the Enron emails were released to the public in the interest of transparency and justice, and that decision generated one of the largest freely accessible bodies of natural, authentic human email communication on the early Internet—will be approached through some artistic and aesthetic engagements with the Enron Archive in the second half of this essay. But the first question I want to pose is: How much do we need to know about that dataset to say that we “understand” the Enron Archive?
My pie story is nowhere near a definitive account of the Enron Corpus, and many would say that it's a silly outlier to the data story about the Enron Archive. But I think it's very useful to remind ourselves how San Francisco, or at least the Commonwealth Club, felt about Enron in the early 2000s. In that, it's a key part of the story of the Enron Corpus. Perhaps not coincidentally, when a friend's son got hold of a draft of this essay and asked ChatGPT to summarize it because it was too long for him to read, the generated output ignored the story about the Enron CEO getting a pie to the face, and also deleted this sentence from the conclusion, “…which cooked its books with bullshit that everyone pretended to have read.” In the rest of this essay, I’m going to invoke documents, performances, histories, and code, in the interest of providing precisely the kind of deep citational practice that generative AI systems are either incapable or unwilling to deploy. This is the question I take seriously: What do we need in order to feel the presence of the Enron Archive in our datafied world? I ask the question with some urgency because I am worried that the corruption of Enron is pulsing through our terminals, which makes it all the more pressing that we develop a critical history of the Enron Corpus.
Deregulate, defraud ++ data
It is a safe bet to say that Bethany McLean and Peter Elkind would endorse *Poisoned with its own projection of reality* as a description of Enron, as documented in The Smartest Guys in the Room, their definitive history of the company's collapse. The documentary based on the book, also called The Smartest Guys in the Room, concisely introduces Enron as the “dark shadow of the American Dream” (Gibney, 2005). The Houston-based company began its life in the regional natural gas pipeline business, but grew into a poster child for the 1990s neoliberal economic boom times as an innovative energy trading company devoted to creating new markets in newly deregulated sectors. The Enron business strategy was publicly presented as a three-step process. First, the company would anticipate the deregulation of a sector of the global economy. This began with natural gas, expanded into energy generation and transmission more broadly, and at its apogee reached towards ever-more speculative gambles, like deregulated municipal water systems and nascent broadband internet connections. Then the company would push to privatize those sectors; this was an especially important component of Enron's international strategy, which led to purchases of previously state-owned utilities across the developing world. Finally, Enron would commoditize the delivery of the service—be it natural gas, electricity, water, or internet bandwidth—in order to create a secondary market to trade future contracts, derivatives, and other exotic financial instruments that only exist on paper. These, then, are the components of Enron's business that the financial press celebrated as pure innovation: deregulate, privatize, and commoditize.
But as McLean, Elkind, and a few other intrepid business journalists and short-sellers exposed, this strategy's main profit center resided in the fourth, undisclosed step: defraud. Through accounting tricks, shell corporations, and invented metrics like “total contract value”, the Enron executives fraudulently booked those speculative future profits as realized and executed deals with multi-mullion dollar valuations, which drove up Enron's stock price to exorbitant heights (McLean and Elkind, 2013: 285). All the while, the executives were corruptly siphoning off cash from the balance sheets into their own funds and accounts. Since the first three steps in the sequence—deregulate, privatize, and commoditize—are closely related and tied to broader macroeconomic shifts in late-twentieth century neoliberalism, we can condense them all into one overarching strategy: deregulation. In that way, we can reduce Enron's entire business model into an alliterative two-step strategy: deregulate and defraud. Enron's inflated stock price was premised on investors buying the fraudulent future it was selling, which meant that once the company had expanded into the 6th largest company on Fortune's Global 500 list, it had no other option than to keep feeding its fraudulent future contracts back into the economy it was gaming, poisoning the market with its corrupt projection of reality.
Another journalist, Mimi Swartz, worked closely with a whistleblower to tell the Enron story from the inside. Her insider knowledge allowed her to generate a pitch-perfect version of the story as the exploits of a group of postmodern Texan wildcatters. In her article “We’re all living in the world Enron created,” she imagines the story as a Netflix series, but ultimately determines that none of the characters are likable enough to warrant multiple hours of streaming content (Swartz, 2023). I agree with Swartz that the world does not need further dramatization of the Enron story, especially when Alex Gibney's 2005 documentary based on The Smartest Guys in the Room already captures the lessons from the fiasco in such stark terms. But that does not mean that Enron should be far from the front of our minds in these heady times…
What the collapse did gift us, beyond the public spectacle of Jeffery Skilling taking a pie in the face in downtown San Francisco, was a collection of hundreds of thousands of emails included as evidence in the federal investigation into the energy trading company's fraudulent business model (Pender, 2002). After the trials—which resulted in multiple guilty pleas and convictions—the emails were released to the public domain by the Federal Energy Regulatory Commission (FERC). In hindsight, the decision seems excessively punitive, especially since the half million or so messages go far beyond the few communications that formed the basis of the prosecution, and furthermore contained material that should have been redacted. In its totality, the FERC data dump consisted of most of the emails exchanged on the Enron servers between the corporation's executives in the three year run-up to the company's 2002 collapse. As Sam Lavigne elaborates in his presentation during the roundtable “The Making of Natural Language: An Evening with the Enron Email Archive,” FERC originally released 1.6 million emails, including messages with personal information that were later deemed irrelevant to Enron's criminal enterprise; FERC instituted an opt-out window for Enron employees, and re-released a trimmed-down version of the communications of the 158 top corporate executives at Enron, totaling around 500,000 messages.
William Cohen, a computer science professor at Carnegie Mellon, converted the public record dump into a clean and organized data set, and maintained it at his website until it was archived and listed in the Library of Congress and the Internet Archive (Cohen, 2003). In 2004, Bryan Klimt and Yiming Yang published an introduction to the Enron Corpus in the proceedings of the 15th European Conference on Machine Learning, and demonstrated the exciting possibilities for all kinds of analysis. According to Klimt and Yang, the Enron Corpus is so valuable precisely because it's a “large, complex and messy data set” representing real-world email exchanges on a variety of issues (Klimt and Yang, 2004: 180). The Enron Corpus had become the Utah Teapot of training sets: it is the standard natural language corpus for many computer science machine learning courses, and is the basis for myriad machine learning problem sets, from social network analysis to topic modeling and beyond to all kinds of experimental algorithmic processes. 4
Nathan Heller's 2017 New Yorker article “What the Enron Emails Say About Us” describes how the afterlife of this public record of the email messages of 158 high-ranking employees became, within the world of computer and data science, the Enron Corpus, a toy model data set for NLP. In some NLP circles, working with that dataset became a rite of passage, and Heller memorably describes the archive itself as a hot-dog bun beset by seagulls, “pulled apart and pecked up” (Heller, 2017). This has led the dataset to occupy a strange position in the history of language modeling: “a canonic research text that no one has actually read” (Heller, 2017).
Thousands of different research projects have spun out from the Enron Corpus, from compliance bots, to studies of tonal formality in corporate emails, even including an analysis of “ball metaphors” in corporate language. Jessica Leber elaborates: “Much of today's software for fraud detection, counterterrorism operations, and mining workplace behavioral patterns over e-mail has been somehow touched by the data set” (Leber, 2013). This suggests that the Enron Corpus should indeed be a prime candidate to be deemed the Last Universal Common Ancestor (LUCA) of training sets, if an evolutionary history of Artificial Intelligence exists 5 . Indeed, all spam filters owe a genealogical debt to the archive, as Finn Brunton documents in his book Spam: A Shadow History of the Internet. Brunton describes the corpus as a remarkable object, data “frozen in place like the ruins of Pompeii for future researchers” (Brunton, 2013: 130). Brunton continues, “As a human document, it has the skeleton of a great, if pathetic, novel: a saga of nepotism, venality, arrogant posturing, office politics, stock deals, wedding contractors, and Texas strip clubs, played out over hundreds of thousands of messages.” (Brunton, 2013: 130).
Brunton's challenge to reimagine the Enron Corpus as a novel, reminiscent of Swartz's imagined Netflix series, is helpful in reminding us that beyond its status as a historically important corpus for early twentieth-first century breakthroughs in NLP, the Enron Corpus is also a public record of the entire bureaucratic functioning of the email system of one of the most emblematic corporations of postmodern capitalism. And it is only available because it was used as evidence in the criminal prosecution of what was then the largest corporate bankruptcy in the history of the United States. In this, it is something of an outlier: the public record of corporate fraud that was actually punished! As Mimi Ọnụọha has explained, this is the rarest kind of dataset: one that reflects the inner workings of powerful people, full of information that those powerful people vigorously wanted to keep secret (Brain et al., 2017). And yet against all odds, the Enron Corpus exists, and betrays the wrongdoing of executives who are skilled in specifically covering up this kind of behavior.
Feeling the Enron Corpus
But is it really fair to call the Enron Corpus an outlier? Can something be an outlier if it is also the proper name of a foundation model training set? Outliers are the very things that are stripped away when data is compressed, and yet here we have the Enron Corpus as a potential foundation model training corpus LUCA. It is, at once, exemplary and unique, a representative sample that is also an outlier. Although in its heyday, it was once the largest thing of its kind, it is now a mere toy lost within Internet-scale foundation models. How much is the Enron Corpus still with us today? Is it possible to trace the Enron Corpus's sociotechnical fingerprint?
I adopt and adapt Susan Harding's phrase “the social fingerprints of hegemony” from The Science Question in Feminism 6 . Given that the Enron case involves an exemplary early twenty-first century case of forensic accounting, the metaphor suggests the need for digital forensics to be a component of any critical history of data. We already have abundant documentation of these fingerprints at the level of representation, predicted by Safiya Noble in Algorithms of Oppression and confirmed by Bouloamwini & Gebru's article “Gender shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” to name just two examples of the abundant research in the field. Furthermore, as these scholars have argued, the critique of representation within these technosystems is necessary but not sufficient to address the harms they actually enact in our shared world. To return again to Harding, generative AI models will “always bear the social fingerprints of the dominant groups” of whatever culture it is modeling; and this will especially be the case in models that don’t transparently situate their knowledges. 7 Every technology stack has layers, and there are further infrastructures layered beneath the already egregious examples of harm enacted and perpetuated by these technosystems at the level of representation.
Without abandoning that already complex two-fold infrastructure of theft and exploitation, I do want to refocus on the specifics of the Enron Corpus's sociotechnical fingerprint, which also has theft, fraud and scandal deeply embedded within its own data body. In an interactive article published in April 2023, reporters at the Washington Post analyzed Google's Colossal Clean Crawled Corpus (C4) dataset, an example of the kind of massive dataset that has eclipsed the Enron Corpus to become one of the primary training sets for contemporary foundation models (Schaul et al., 2023). The story confirms the pattern of indiscriminately including copyrighted content in its crawl, as C4 contains 200 million examples of the copyright symbol ©, which—as the authors dryly remark—unambiguously denotes a work registered as intellectual property.
8
It also confirms that the Enron emails are part of C4, as tokens from
In other words, this reporting confirms that the Enron Corpus is represented in some if not all foundation model training sets. Yet that fact cannot capture or convey what these datasets feel like, what it truly means to be an intelligence trained on the Enron archive. Beyond simple recognition of its presence, is it possible to perceive the social fingerprint of the corpus? Can we find a way to *feel* what we now know is there?
In order to investigate further, we need to shift to the aesthetic realm. There we can find artists who engage the problem of training data with creativity, and approach these technosystems through an experimental practice. I am not suggesting we consult folks “making AI art”, a practice which has become the near-exclusive province of middle-brow prosumers and celebrity self-branders. Instead I look to artists who push the limits of AI technosystems in an effort to bend the algorithms towards beauty or truth.
It was Everest Pipkin's Lacework (2020) that helped me understand what it would mean for a human being to actually experience a training corpus as a machine might. Their work allowed me to appreciate a cluster of other artists who attempt, in some form, to “actually read” a canonical corpus, including Adam Harvey and Jules LaPlace's exposing.ai project, James Bridle's performances, Kate Crawford and Trevor Paglen's excavating.ai and Mimi Ọnụọha's The Library of Missing Data Sets.
It just so happens that many of those artists gathered in New York City in 2017 to celebrate the launch of Sam Lavigne and Tega Brain's The Good Life (Enron Simulator). 10 The Good Life is a masterpiece of email art, but more importantly for the current focus, it provides a (barely) human readable experience of the entire archive. In short, The Good Life is an inbox reenactment of the Enron Corpus (Lavigne and Brain, 2017). The artists further cleaned the Enron Corpus down to c. 225,000 emails, and at https://enron.email/ you can sign up to receive all of those messages in chronological order. As the artists have said, it is an experiential opportunity to relive the Enron collapse blow-by-blow.
Their original idea was to speed-run the Enron Corpus over 24 h, beginning immediately whenever someone entered their email into the website, but this proved impossible, as the very spam filters whose technical genealogy relied on the Enron Corpus categorically shut down the project before a single email was delivered (Brain et al., 2017). The volume of messages also proved to be unworkable in the “real” time scale of 3 years, so the artists settled on several options ranging from 7 years (double the 3.5 years included in the archive) to 28 years (the longest simulation, which averages out to about 49 emails a day).
If you engage The Good Life (Enron Simulator) on any timeline, it is guaranteed that the Corpus will not FEEL like an outlier. And the simulator has the added benefit of breaking your digital persona by flooding your data stream with not only spam, but with the original spam filter training dataset! This is a point made by Sam Lavigne in the 2017 panel discussion “The Making of Natural Language: An Evening with the Enron Archive,” and I would advise readers to watch that roundtable presentation instead of signing up for The Good Life (Enron Simulator) itself. In the presentation, Finn Brunton situates Enron's sociotechnical fingerprint within a particular culture: “Not just Texan but Houstonian, and not just corporate, but a very specific kind of Houstonian, fratty, evangelical, financial arbitrage corporate culture” (Brain et al., 2017). This is the turn-of-the-century Houston that Mimi Swartz described as “a postmodern wildcatter's dream”, and in Houston today, Enron is still remembered as a criminal enterprise that betrayed a generation of workers while robbing them of their savings.
What The Good Life (Enron Simulator) helped me understand is the strange and convoluted way in which the epochal fraud perpetuated by Enron has become hardcoded into the AI mind. It does not answer the question about whether or not Enron is an outlier, but rather it reframes the question entirely: what is the value of an outlier if the outlier itself is a poisoned projection of reality? Only a handful of the original 1.6 million emails in the Enron archive were used at trial, in what the defendants surely argued were non-representative outliers. To return to Ọnụọha's point: the true outlier is that Enron's executives were convicted, and that Arthur Andersen, the accounting-consulting firm that enabled the fraud, got the corporate death penalty. In the age of too-big-to-fail and too-big-to-jail, the closure and accountability in the Enron case seemed like something special. Almost too good to be true…
Nancy temple and the opposite meaning
If I have waited this long to mention Arthur Andersen, it is only because the emails that led to that disgraced accounting firm's collapse are not represented in the Enron Corpus. Arthur Andersen existed in what Cory Doctorow calls the “consulting/accounting continuum” (Doctorow, 2024). It was a storied accounting firm with almost a century of industry experience, but its Houston office had become deeply implicated in Enron's fraudulent accounting practices, with in-house consultants advising Enron how to persuade the accounting side to sign off on Enron's speculative balance sheets. Even though it was an email sent to David Duncan, the lead Arthur Andersen partner on the Enron account, that led to the firm's conviction of obstructing justice, that was an internal email to Arthur Andersen's in-house counsel, so it never touched the
This brings up a final and most inconvenient aspect of the Enron Corpus: even the original 1.6 million message archive was incomplete. This is not some metaphysical point about the inherent incompleteness of any archive, or a scholastic argument about the paradox of measurement. The Enron Corpus is incomplete because Enron employees, along with their accountant counterparts at Arthur Andersen, deliberately destroyed evidence of wrongdoing, both in physical and digital form. In the mid-October 2001 weeks leading to the inevitable subpoena of Enron's records, the document shredders at Enron's corporate headquarters ran non-stop. Enron's subpoena arrived on October 25, 2001, but the company continued to shredding documents, sometimes at the rate of 7000 pounds an hour, until late January 2002 (ABC News, 2002, “Enron Destroyed Documents by the Truckload”).
Across the way, Arthur Andersen employees were deleting their own emails and shredding their own documents. The firm deleted 30,000 computer files and shredded an extra 2000 pounds of documents in the two weeks between when Enron publicly acknowledged it was the target of a federal investigation and when Arthur Andersen received their own subpoena for records on November 8, 2001 (Kinsler, 2008: 103–4).
We know this in part because of an email, written by one of Arthur Andersen's in-house attorneys called Nancy Temple. Temple's colleagues would later testify under oath that the email was a coded instruction to destroy the evidence that everyone knew would be subpoenaed in the immediate future (Schwartz and Eichenwald, 2002). The email's coded nature presents a particularly interesting example of the role of human interpretation in written communication. In its entirety, the October 12, 2001 message reads: ‘'Mike – It might be useful to consider reminding the engagement team of our documentation and retention policy. It will be helpful to make sure that we have complied with the policy. Let me know if you have any questions.'’
Former employees testified that they interpreted this “reminder” of the firm's document retention policy to mean “Save what is important, destroy everything else.” During Arthur Andersen's trial for obstruction of justice, the October 12 email was considered a smoking gun, documentary proof of Andersen's cover-up of the criminal fraud they enabled Enron to pursue. This was further emphasized by another witness's testimony that depicted David Duncan, the lead partner on Andersen's Enron account, physically destroying a document labeled “smoking gun” as he proclaimed “we don’t need this.” (US v Andersen, 2005: 4–5).
And yet, for reasons that continue to confound everyone involved, the members of the jury that voted to convict Arthur Andersen of obstruction of justice later reported to the press that it was a different email that provided the crucial evidence to meet the instructions that the judge in the case had given the jury. Instead of focusing on the document destruction that arose from the Oct 12 message, the jury focused on an October 16, 2001 email where Temple explicitly told a co-worker to alter the language in a draft statement. The Supreme Court would later overturn the conviction, arguing that the judge's instructions to the jury were flawed, and that the Oct 16 email that jurors later publicly cited as the single most important piece of evidence did not remotely express the “corrupt intent” demanded by the statute. 11
In other words, the Supreme Court overturned Arthur Andersen's corporate death penalty on a technicality (the judges’ faulty jury instructions), although this reprise, granted in 2005, was not enough to resurrect the disgraced firm. The ruling sidestepped the entire question of document destruction itself, which ignored the obvious story told unambiguously by the scant evidence. Temple's reminder to “comply with the document policy” was clearly interpreted as an instruction to “destroy the evidence.” We know this because employees at both Enron and Arthur Andersen spent the fall of 2001 frantically doing just that. And due to their success, some of the very evidence that would have helped prove the fraud case against the companies disappeared into a void. The Supreme Court took advantage of that void to collapse the conviction against Arthur Andersen into an unaccountable aporia. 12 All this, and yet somehow the two emails at the heart of this entire saga are not included in the Enron Corpus!
If there are any true outliers in the case of the Enron corpus, they are these two Arthur Andersen emails. I would go so far as to propose the Oct 12 and Oct 16 messages as canonical complex outliers, absent in the Enron archive yet permanently inscribed in the history of the United States Supreme Court. One email conveyed the exact and unstated opposite of what it “said”, while the other was used to justify a “plausible deniability” that only existed because of a documented example of the obstruction of justice. It might not even be sufficient to call the Oct 12 email an outlier; it is the coded inverse of an outlier. Burn after reading.
Model collapse
In HAI's foundation model whitepaper, the authors worry early on that “the defects of the foundation model are inherited by all the adapted models downstream” (Bommasani et al., 2022: 1). The risk, then, is that the worlds enacted downstream from these foundation models will perpetuate and even exaggerate the inherited defects of discrimination, theft, and exploitation. And those risks are not exhaustive of the variety of defects—computational, representational, sociological, ecological or infrastructural—that will be coded into future generations of these technosystems. That said, content theft is a strong contemporary example of this tendency, and recent reporting has provided incontrovertible evidence of internet-scale content larceny (Cole, 2024).
We know that foundation models are trained on nearly all publicly scrapable digital content, regardless of copyright status. We know that this scraping originally occurred with no plans for establishing consent, control, or compensation for the authors and creators of the data. 13 And we know that academic research and other data explicitly marked under Creative Commons-style copyleft licenses have been used commercially, in direct violation of the licensing terms. Even still, the demand for new training data does not subside!
At the same time, the widespread adoption of generative AI systems into the basic architecture of the Internet means that online content is becoming ever-more populated with AI-generated material. Mark Hansen (2015) has named this condition feed forward, a kind of cybernetic loop where predictive algorithms steer human behavior towards the very goals the algorithm predicts humans should want. What was originally conceived of as a feedback loop, when put in the service of generating new content, creates an algorithmic machine that feeds new tokens forward. This remixed statistical average is called “inference”, which is the broad name for the artificial content generated by these technosystems. Regardless of whether it is so-called high-quality synthetic inference or piles of AI slop, Artificial Intelligence content is now part of our collective media experience. 14 This poses a conundrum. As is fitting of a recursive conundrum, it has been named several times over in the wake of the launch of ChatGPT (Wong, 2023). Kate Crawford has called this the statistical ouroboros (Crawford, 2021: 131); a group of researchers from Rice University proposed “Model Autophagy Disorder (MAD)” (Alemohammad et al., 2023); and perhaps most memorably Jathan Sadowski named the phenomenon Habsburg AI (Baraniuk et al., 2023; Sadowski, 2025). The consensus technical name seems to be “model collapse”, as ratified by the July 2024 publication in Nature of an article starkly subtitled “AI models collapse when trained on recursively generated data.”
The Nature paper's thesis is elegant and terrifying: “Model collapse refers to a degenerative learning process in which models start forgetting improbable events over time, as the model becomes poisoned with its own projection of reality.” One way to think about the “poison” at the root of model collapse is as the result of generation loss, the phenomenon observed across all media where quality degrades as it is transcoded. The classic example is the degradation of an image that is printed and photocopied, with a new photocopy made from the print, in a sequence that will eventually overwhelm the original image with noise, artefacts, and errors. Media artists have always been fascinated by generation loss, and composers like Alvin Lucier (I am sitting in a room, 1969) and William Basinski (The Disintegration Loops, 2002–2003) have explored the aesthetic possibilities of that distortion (Zimmer, 2018). Whether in analog or digital settings (where generation loss is the side effect of a compression algorithm), the primary cause of generation loss is the deletion of far outlying data, what can be considered the “long tail” of a data distribution. Under normal circumstances, this outlying data would have an insignificant effect on the quality of the reproduced file, and in lossy data compression it is chopped off in an effort to reduce the file size. Yet under certain conditions, when future “downstream” copies are continually made from subsequent generations, the cumulative effect of chopping off those minuscule tails manifests into undesirable yet unavoidable errors.
But in the case of foundation model collapse, this “poison” is not introduced because of intentional lossy compression or unintentional transcoding errors. No, the poison of model collapse is of the generative variety, and thus has a particularly recursive character, that of an algorithm feeding its own “inference” forward to train future models. The “Model Collapse” paper presents an extreme hypothetical of this situation, where future foundation models are trained indiscriminately on the “inference” generated by the previous generation (Figure 1). Their conclusions are stark: “We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.” In short, the infamous “long tail” of data distribution shrinks over time, which causes improbable events to be further underestimated. In the language of LLMs: the generation of unlikely tokens becomes even more improbable in subsequent generations, while at the same time more probable tokens become ever-more represented. In the language of poetry: those unlikely events that give experience its texture are stripped away as distractions and inefficiencies to the centralized logic of the homogenizing data stream (Ruby, 2024).
In this very specific way, probable events “poison” reality and lead the model to converge across generations. To return to the exaggerated example from the “Model Collapse” paper: as generative AI algorithms feed AI inference into the ocean of data used to train future generations of models, those future models, progressively stripped of more and more outliers, will increasingly predict and generate their own computational “hallucinations”, although I am ambivalent about “hallucination” as a term to describe what are errors. Instead, I prefer to generalize the problem: model collapse is also a general description of how power expresses itself algorithmically, consolidating itself by stripping outliers away from its datafied representations of the world.
At the extreme, the entire model will collapse into a single mode. It sounds complicated, but the phenomenon can be intuitively grasped by reading the example published in Nature, where a model was given an existing text input and asked to finish the paragraph (see Figure 1).
According to the authors, these snapshot examples across nine generations make visible two processes—tail shrinkage and convergence—that are inevitable outcomes for any generative system trained on data produced by the previous generation. Some AI boosters have criticized the Model Collapse thesis, confidently stating the difference between the paper's hypothetical “indiscriminate” use of multi-generational AI data, and Silicon Valley's careful producing of “inference-generated high quality training data.” 15 And yet the paper outlines with logical precision and statistical rigor that model collapse is here to stay and will substantially change the ecosystem of digital content.
In short: “Model Collapse” will be a feature of any second-order AI technosystem, with “second-order” simply being an AI technosystem trained in part by undistinguished outputs of previous generative models. As there is currently no universally-implemented digital watermark on AI-generated content, and given the flood of undifferentiated AI slop that now populates much of the Internet, it is a virtual guarantee that future training data will be populated with some amount of “inference” generated by previous models (Buschek & Thorp, 2023; Koebler, 2024). In this sense, Model Collapse is a certainty in an analogous way to the certainty that any secret backdoor built into a cryptographic system will inevitably render the entire system insecure. In cryptography, this is known as the “Keys under Doormats” fallacy; it is important to note that Ross Anderson, one of the co-authors of “Model Collapse”, published the definitive proof of the “Keys under Doormats” vulnerability (Abelson et al., 2015). And in the same way that “there is no such thing as a secure backdoor to a cryptosystems” is axiomatic in the field of cryptography whether politicians and technologists like it or not, an emerging axiom of generative AI technosystms is that Habsburg AIs will experience irreparable model collapse as the amount of AI-generated content incorporated into future foundational models increases.
And this is why the Enron collapse resonates so deeply with the contemporary field of AI: the threat of model collapse is the threat of poisoning a system with its own projection of reality. 16
Conclusion: The antidote to botshit
Although it is an important point that Nancy Temple's emails are not part of the Enron Corpus, my overall argument has not focused on the missing pieces in foundation model training sets (a theme that has been explored by Mimi Ọnụọha and the other artists cited above). It has proposed a distinct yet related question: what is the value of an outlier in a training set? What if that particular outlier is the emblem of a collapse which seems to periodically cycle through our technosocial financial system? Bit players in the Enron story, like Lehman Brothers, would take their own turn in 2008 as chief villain in the next system collapse, and the same patterns ripple through other proper names for abstract and ultimately fraudulent concepts: Juicero, Theranos, FTX. Perhaps the investigations in the wake of those collapses will provide yet another archive of internal documents released through litigation, following Enron's inglorious footsteps.
How distributed are the effects of a digital space poisoned with its own projection of reality? Can we now somehow feel the corruption of Enron pulsing through the technosystem? Sam Lavigne and Tega Brain created a modality to experience the Corpus, and The Good Life (Enron Simulator) provides a direct perceptual experience of the ghost of Enron spread across our shared social machine. The work's true revelation, which I hope I have also approached in this essay, is the depth to which Enron has penetrated into the ontological structures of the AI mind.
That is what I think was behind the visceral reaction of my audience at the Commonwealth Club, the disgust they expressed when I mentioned the “Crooked E.” Nobody wants the ingredients of those first, early, foundational algorithmic sausages to have been the toxic slop of turn-of-the-century neoliberal globalization. And we don’t want this because we know, viscerally, that ethics manifest themselves in the building of these systems. There is no additive “ethics” poured into these systems as a lubricant after they've been manufactured. The ethics will always be the reflection of how these systems were built (Stark, 2023). GIGO, as the coders say: garbage in, garbage out.
I have been exploring the complex resonance of Enron, not only with the GIGO ethos of technocapitalism, but also with the urgent emergence of model collapse (Zimmer, 2024). Enron's story is that of a fraudulent business model based on the speculative trading of extractive rents, which cooked its books with bullshit that everyone pretended to have read. The most charitable interpretation of these fraudsters is that they were visionaries ahead of their time. But they were not content to invest in their own capacity to realize their vision; rather, they sold speculative shares in the promise that they would realize their vision in the future; they then securitized those promises, and shed them off of their balance sheets systematically and clandestinely until they could do it no more.
And today, in a kind of atonal harmony with the toxic assets of Enron's poisoned accounting practices, so-called “cloud credits” zip around imagined balance sheets to produce context-collapsed inferences like, “In addition to being home to some of the world's largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-” (Figure 1). When I think of the computational genealogy linking that sentence to another one written by a postmodern Texan wildcatter using his work email, for instance, to manage an outing to some downtown Houston strip clubs, I begin to understand the full implications of the possibility of model collapse. The epistemic violence caused by AI slop, which poisons trust in a shared media space of factual reality, shares a familiar resemblance with the speculative financial shell game played by Enron and their Arthur Andersen co-conspirators, feeding forward a fraudulent hallucination. This resemblance is only further reinforced by contemporary discussion of a speculative asset bubble, and comparisons between Enron's “innovative” accounting and the financial deals that structure the Silicon Valley AI economy (Riley, 2025).

Example of model collapse from Shumailov et al. 2024. According to the authors, when model-generated content is indiscriminately introduced into future generations of LLMs, there will be irreversible defects in the resulting models; these defects become further exaggerated across multiple generations.
By gathering these anecdotes, artworks, and outlier messages, I have tried to offer a rich citational practice in this essay as an antidote to AI slop, building from feminist science studies and a commitment to always situate knowledge. Donna Haraway turned Marilyn Strathern's idea into a slogan: it matters which thoughts one uses to think thought. In other words: thought-as-such is always thought from somewhere. My hope is that by confronting this truth, we can become a bit more sanguine about the generative AI technosystems self-replicating all around us. As the models collapse and converge onto a sloppy point of botshit singularity, at least the models will then be speaking from somewhere. It is my hope, however, that we can find a better ground on which to generate our future intelligences than from the smoldering crater of Enron's context collapse.
Footnotes
Acknowledgements
Thanks to UCSC Humanities Division's EXPLORE program, which supported three cohorts of undergraduate researchers to help with this project: Jennacess Carreon and Marissa Omaque; Chloe Allen and Caroline Wilson; Sophia Coffing, Shalini Satish and Pilar Zapien. I am grateful to all my colleagues who gave feedback on various sections of this article: the National Humanities Center's Responsible AI cohort, my THINK and Humanizing Technology collaborators at UCSC, and many others too numerous to name. A special thanks to the audiences who listened to early versions of the argument, especially that night at the Commonwealth Club in San Francisco, where I was reminded that Enron got a pie in the face.
Funding
This article developed out of research and teaching in the National Humanities Center's Responsible AI project, Summer 2022.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
